This week, I had the great privilege of attending the second Computation and Visualization Consortium, led by Danny Kaplan, Randy Pruim, and Nick Horton. I met so many fascinating people, and learned so much in five days, it’s almost as if I am a totally different person. For example, this time last week I had no idea how to use R to scrape Beyonce lyrics from the internet, or how to get data from Wikipedia and IMDB on several different TV shows to make a new dataset. But what can I say? I have been transformed in a mere 5 days.
TV Shows Data!
I used data freely available from Wikipedia and IMDB.com to make a little dataset of 400 episodes of television. This dataset includes information on every episode (as of this blog posting) of Bob’s Burgers, Seinfeld, Community, The Americans, and Breaking Bad. If you would like to see my RStudio markdown file for scraping the data and doing a few visualizations click here.
The code in this file worked today, but since Wikipedia and IMDB change all the time, there is no guarantee that the code will work if you try to run it.
Here’s a little taste of what I found. I’m a pretty big Community fan. Any of my students can attest to that as I am constantly using scenarios from Community in class and on tests. Sadly, fewer and fewer people seem to be watching Community. It isn’t much of a surprise that NBC dropped it:
Breaking Bad, on the other hand, had a much different pattern, starting off slow, and then really ramping up in viewership:
As we can see from the plot above, the season finale of Breaking Bad got huge numbers. I haven’t watched Breaking Bad (it is on my to-do list), but from what I understand there was a bit of a cliffhanger for the finale, and there was some fear of spoilers, which explains why people wanted to watch the episode at it’s first showing.
So far I have just done a couple of univariable visualizations. We can also try to explore some relationships here. I plotted Millions of Viewers against average rating on IMDB for each episode for all five TV shows:
One thing really sticks out here. There is one episode had about 70 million viewers, far more than any other episode in the dataset. It was the season finale of Seinfeld, which if my memory of 1998 serves me correctly, was a huge cultural moment. Memories…
There are so many different relationships we could look at here. Are Community episodes that are written by Dan Harmon more highly rated on IMDB? Are episodes that are directed by women watched by a higher number of viewers than episodes directed by men? Is there a trend in the ratings for Bob’s Burgers over time? But don’t just take my word for it. Take a look at the code, and run it yourself!
I also took a brief dip into a more complicated data scraping venture with a lot of help from the very patient and knowledgable Danny Kaplan. I was interested in learning about Beyonce’s lyrics, so I scraped some of them from www.azlyrics.com.
I don’t have a lot (or really any) experience with analysis of texts, but it can be fun to try new things. For starters, here are the Top 10 most frequent words used in the Dangerously In Love album:
With Danny’s Help I also found that the most common three word phrase was “I love you” clocking in with 25 times. Heart warming. Other popular phrases included “be with you” (20 times), and “lookin’ so crazy” (18 times).
Less common phrases included “your honesty integrity” (once), and “a green beret” (also once). Ten points goes to anybody who can identify in which songs those lonely phrases appeared.
Lastly, because I love networks, I made a word network. Here the nodes represent words and an arrow would go from the node representing “I” to the node representing “you” if Beyonce sang “I you” in this album at least once. Here is what I get:
Do you want to analyze this massive jumble? Let me know! It could be quite fun!
If you want to know how I got this big mess of platinum selling words, please take a look at this R Markdown file.
I’m so excited that I can scrape the internet with R now. I will never hand scrape anything again! Hurrah!