Shakespeare vs Wu-Tang: Big Data and the Dangers of Overinterpretation

Let’s say you have an important decision to make. Your teenage daughter will be taking a standardized test in a few years, and the score she achieves in the vocabulary section will be a big factor in whether she gets into the college of her (or your) choice. Your first thought is that you should encourage her to read lots of literary classics, but then you happen across an article a few of your old college friends have shared on Facebook that shows definitively—like in actual numbers—that the members of the Wu-Tang Clan sport vocabularies that dwarf even Shakespeare’s. So do you tell your daughter to forgo the book group and push her to listen to more rap instead (assuming a parent’s encouragement wouldn’t ruin it)?

This is an admittedly somewhat absurd example, but it does highlight some the dangers of misreading or overinterpreting statistics. No one graph or metric ever tells the whole story. But with all the increasingly powerful tools for sifting through Big Data coming on the market many business leaders are being tempted to make decisions based on a few numbers. To understand what those numbers mean, you need to ask two other questions. And once you have a grasp of the what the numbers mean you still have remember what your goals are in relation to those numbers.

Does GZA really have a bigger vocabulary than Shakespeare did?

The two questions you have to ask:

1. How were these numbers generated?

The article that compares rappers’ vocabularies to Shakespeare’s is called “The Largest Vocabulary in Hip Hop,” and it was written by guy named Matthew Daniels. If you go straight to the graph, you can see that GZA uses 6,426 unique words while Shakespeare only uses 5,170. Doesn’t this mean GZA wins?

Let’s look at the methods and sources. For the rappers, Daniels had his program count the first 35,000 words of all the albums they released and eliminate all the repeats. For the bard, he set the same program on the first 5,000 words of seven plays to get the same word count. This strategy is actually really simple, and Daniels himself notes that it runs into some problems of interpretation:

“I used a research methodology called token analysis to determine each artist’s vocabulary. Each word is counted once, so pimps, pimp, pimping, and pimpin are four unique words. To avoid issues with apostrophes (e.g., pimpin’ vs. pimpin), they’re removed from the dataset. It still isn’t perfect. Hip hop is full of slang that is hard to transcribe (e.g., shorty vs. shawty), compound words (e.g., king sh**), featured vocalists, and repetitive choruses.”

In other words, a rapper whose spelling is inconsistent will have artificially high numbers of unique words. Daniels insists that though the graph is hard to interpret, “It’s still directionally interesting,” suggesting he thinks the relationships it reveals are meaningful. But does that apply to rap lyrics and Elizabethan plays equally? That brings us to the next important question.

2. How does the context of the behaviors behind the numbers affect their meaning?

Daniels refers to Shakespeare’s vocabulary in the opening of his article mostly for rhetorical effect. So it’s a little disturbing how many people latched on to the Shakespeare number in reporting on the findings. After the first two sentences, the only mention of the bard comes in the line, “As a benchmark, I included data points for Shakespeare and Herman Melville, using the same approach (35,000 words across several plays for Shakespeare, first 35,000 of Moby Dick).”

But how meaningful is that benchmark? This question boils down to whether writing rap lyrics is the same thing as writing plays or a novel. Or are we comparing apples to oranges? Daniels points out at the end of his article that it’s difficult to say what the different numbers mean even when we’re just comparing one rapper to another. Jay-Z, for instance, has rapped, “I dumbed down for my audience to double my dollars/ They criticized me for it, yet they all yell ‘holla’.”

Part of the appeal of rap comes from the fact that its messages alienate and enrage certain demographics, so it’s an arena that’s likely far more friendly to the practice of making up words. Shakespeare, on the other hand, was trying to appeal to multiple audiences, on multiple levels. Token analysis is completely blind to puns and double entendres, which Shakespeare uses quite a bit. Shouldn’t these be counted twice? Plus, rap songs are amenable to single-word sentences, or chains of words without articles or syntax. It’s harder to get away with this when you’re trying to create characters and tell stories without the benefit of sick beats to keep your audience engaged. On the other hand, you don’t have as many repeating choruses in plays.

Let’s get back to your goals.

So would your daughter be better off listening to rap than reading Shakespeare? Well, no, not unless the standardized test she’s preparing for has a section of antonyms for words like shizzle. Of course, the test probably won’t have a lot of doths or thous either. So maybe have her read some more modern fiction—or lots of blogs.

The key point here is that when you have an easily comprehensible graph it’s hard not to place too much weight on it. Who has the time to dig into messy and complicated methods sections? Who has the time to listen to a whole story? So when a number comes along that seems to tell a whole story in just a few digits we just can’t help ourselves. But data is just one small—though important—part of understanding. And Big Data should only be one part of a much more comprehensive Business Intelligence strategy.

So are there any decisions we might make based on Daniels’s graph? Well, even if it doesn’t show that rappers and hip hop artists are as brilliant as Shakespeare, it probably does show that they’re at least pretty impressive. You may even decide to pay them a little more respect—or even to try listening to their music.

