I feel like I am on a roll here. Last month, my Bayesian paper (joint work with Krista Gile, Joseph Hogan, Nancy Barnett, and Crystal Linkletter) was published in Statistics in Medicine, and just today my RDS paper (which I worked on with Dr. Krista Gile) was published in the Electronic Journal of Statistics!
I am really interested in RDS, and here’s why: it is really important to understand the health behaviors and needs of people who are at higher risk for certain health issues, like HIV. But it turns out that the people who tend to be at higher risks for HIV are really hard to get a nice random sample of.
For instance, it is hard to get a random sample of injection drug users, or a random sample of sex workers. Since many statistical methods assume that you are using a random sample, those statistical methods can go awry if you get your sample in some other way… like a convenience sample.
So what are we supposed to do? How are we supposed to find out if HIV prevention measures are actually working in these higher risk groups if we can’t get a nice random sample? Should we use a convenience sample even though we know it will cause our results to be biased? Should we give up?
Well, we should never give up. This is an important problem, and clearly we need to work to find a reasonable solution. Fortunately there are several techniques (or statistical methods) to deal with this very situation. One of these techniques is called Respondent-Driven Sampling, or RDS, which was first proposed by Heckathorn, and in subsequent years has been improved upon by many researchers in sociology, and statistics.
RDS makes use of the fact that often people that we want to include in our sample know other people who we also want to include in our sample, in other words, our target populations tend to be networked. So RDS takes advantage of these networks, and people in our samples help recruit other people, and the researches keep track of who recruits who. In this way the researches get to see a small part of the social network of their target population. That’s why it is called Respondent-Driven Sampling: the respondents drive the recruitment of other respondents.
Now you may be thinking to yourself, “Well, don’t you still have biases in this RDS business?” And if you were thinking that you would be absolutely correct. For one thing, it turns out that people tend to be friends very similar to themselves, so this is by no means a simple-random-sample. However, one cool thing about RDS is that since we partially observe this network, we can take the network structure into account when we are trying to make our estimates from the RDS data we have gathered.
That isn’t to say that RDS fixes everything and is perfect. In fact, RDS assumes that lots of things are true that we either can’t verify, or are highly unlikely. It is these assumptions that have both the greatest weaknesses for RDS, but also point the way to the greatest opportunities to improve RDS. It is actually really important to try to improve RDS since it is being used in studies all around the world to assess HIV prevalence. If RDS isn’t working well, then we may be over or underestimating HIV prevalence and how effective different health interventions are.
There are some assumptions in RDS that have previously been considered to be fairly reasonable, like the assumption that each relationship (edge) in the network is equally likely to be sampled. Spolier Alert, in my new paper, we look into the assumption of equal edge inclusion probabilities, and we find that actually it is not such a reasonable assumption. If you want to learn more, head on over to EJS where the paper is available for free!