The students in my Intro Statistics classes are diligently working on their Statistics Projects. Every time I meet with a project group, I get excited and inspired by their ideas. So what to do with all that inspiration? Why not do a statistical analysis of Humans of New York; one of the most popular photo blogs of all time?
This is all to say that I am excited to analyse some Humans of New York Data! The first thing I did was figure out how many blog posts are in Humans of New York (HONY). As of this morning, there were a little over 4,000. Next, using a random number generator, I chose a random sample of 50 blog posts. The URLs for the random blog posts are included below. I decided to keep track of a few variables: how many notes each post has (these can be either likes, or re-shares), age of post in days, how many words were in the caption, how many humans were the focus of each photo, whether or not there were any animals, and whether or not there were any children in the photo. I collected the data on the morning of May 5, 2014, and the information may have changed since then.
Exploring the sample with univariable statistics and visualizations:
First off, I wanted to learn about the number of notes that each photo gets. In my sample of 50 photos, there was quite a wide range. This photo from August 22, 2010 only had 10 notes, where as this photo from May 22, 2013 had a whopping 51,766 notes! So as you may have guessed the number of notes variable is quite right-skewed:
The mean number of notes in this sample was 3,086, with a 95% CI of (955, 5218). So I am 95% confident that the mean number of notes for all HONY posts is between 955 and 5218. There are two reasons why my confidence interval is so wide: first of all my sample photos have quite a large range, and secondly I only have a sample of 50 photos. Trust me I’m typing “only 50” right now, but when I put this dataset together, 50 felt HUGE.
I also found that in this sample the number of words in the photo caption is also right skewed, with a sample mean of 24 words and a 95% CI that went from 16 words to 32 words. The number of days the photo has been on HONY is pretty uniform (barring one outlier). As for the number of humans in each photo, a few photos had no humans, while others had 3, but the majority of the photos had one or two people:
Lastly, about a third of the photos featured children (95% CI =20.67%, 47.33%) while only 8% featured animals (95% CI =0.55%, 15.45%).
Looking at relationships in the data:
I wanted to see how the number of notes for each photo is related to both the number of words in the caption, as well as how long the photo has been up on the website. Here is what I found:
This one really popular blog post, really makes it hard to tell what is going on here! Just by eyeballing these scatter plots, I would say there seems to be a negative relationship between the number of notes, and the number of days on the website. This makes sense to me since the blog has had increasing popularity over time. It also looks like there is a positive relationship between the number of words in a photo caption and the number of notes the photo receives.
I’m also interested in seeing whether photos with children tend to get more notes than photos without children. Let’s take a look at this side-by-side boxplot (leaving out our most popular post) comparing the number of notes for posts with children vs. without children:
My intuition was that the average number of notes for photos with children would be greater than the average number of notes for photos without children. These data do not seem to be agreeing with my intuition: the average number of notes for a photo without a child is about 3,749, as opposed to an average of 1,540 notes when the photo depicts a child. This difference is not statistically significant (p-value =0.20), so I don’t have conclusive enough evidence to say that the average number of notes differs by whether or not children are in the picture.
Trying to Predict the (Log of the) Number of Notes
I next wanted to see if I could use a few easy to collect variables that could explain some of the variability in the number of notes for each photo. In order to achieve this, I fit a linear regression model with notes as the dependent variable, and the number of days the photo has been up, the number of words in the caption, whether or not children were in the photo, and the number of people in the photo as independent variables.
Since it isn’t always appropriate to fit a linear regression model (like in the case when you have a non-linear relationship, or non-constant variance) it is really important to look at model diagnostics. Here is my fitted values vs. residual plot for my first model:
I must say that this isn’t looking too hot. Ideally my residuals would be spread out evenly above and below the zero line, and there would be no clear pattern. Sadly that is not the case here. When your residuals look like this, that means it is time to head back to the drawing board. Fortunately, we still have some options here.
When I log transform the number of notes, and also add a quadratic effect of the number of days the photo has been online, I end up with a much better residual plot:
This still isn’t the residual vs. fitted values plot of my dreams, but I am able to stomach it. Soooo let’s move forward and see what this regression tells me. Here is the equation for my model:
predicted log(notes)= β0 + β1(words) +β2(days) +β3(days^2)+β4(people) +β5(children)
After adjusting for other variables in the model, I found that the predicted log number of notes increases by .016 for each additional word in the photo caption (p-value =0.008). In other words, everything else being equal, the longer the caption, the more notes a photo is predicted to get. That isn’t to say that the best way to increase the number of notes is just to write longer captions… we don’t have proof of a causal association here. In fact it could be that there is something inherently interesting that inspires both longer captions as well as more notes. We just can’t tell! Now if we could do an experiment where we randomly chose captions of different lengths then we could try to make some causal claims. As of right now, nobody is giving me that power so we will have to put that idea on the back-burner.
With just these few variables, my model is accounting for 69% of the variation in the number of notes! That was higher than I would have hoped for. When I think of the HONY blog, it really seems to me that the popularity of the image is highly related to the story that image is telling. To think that we could explain 69% of the variability in the number of notes with just these basic ways of measuring each post is pretty astounding!
Wrapping it up
One of the (many) great things about statistics is that I didn’t have to look at every single blog post to be able to make some reasonable claims about what is happening on the Humans of New York blog. Due to the fact that I took a random sample, I could generalize from the 50 posts in my sample to the 4,000+ posts that make up the Humans of New York blog.
This was just one analysis of one website on the internet. There are many many more things we could analyze. The sky is the limit!
Statistics: so amazing!
If you want to do your own analysis with these data, check out the random sample that I used in the links below. And if you end up doing your own analysis, do let me know! I would love to hear about your findings!