Much of my research is on Social Networks, which are representations of how people, groups of people, or sometimes chimpanzees interact or are otherwise connected with each other. When analyzing social network data, as when analyzing any data, it is always important to consider how the data were collected. For social networks we have to not only consider missing data on each individual in the network, but also the possibility that information on the relationships between the individuals could also be missing.
My newest paper (co-authored with Matthew Harrison, Nancy Barnett, Krista Gile, and Joseph Hogan) in Biostatistics looks at a particular kind of missingness in network data: where each individual is asked to name up to a certain number of other people with whom they share the relationship of interest, which is called fixed-choice data. With fixed-choice data, we are sure that we have complete data for each individual who named fewer than the maximum number of relationships, but for each individual who did name the maximum, we don’t know if they would have named other individuals.
Why would a fixed-choice design pose a problem for analysis? Well, if no one reached the maximum number of relationships, then it wouldn’t be a problem at all, you would get to assume that you have complete information on the relationships between individuals in your sample. Let’s imagine that I want to use network data to see if people who share a particular trait are more or less likely to be friends. If I have data from a fixed-choice design, and there are indeed individuals in the sample who named the maximum number, then I suddenly don’t know if the people who that individual did not name in the survey are actually people that they are connected to. So it turns out that trying to use fixed choice design data to test if sharing a particular trait is associated with sharing a relationship can lead you to biased results if you ignore the fact that some of your relationship data is missing.
In this paper we developed a new method for accounting for the missingness in fixed-choice design network data, and suggest a new design that asks individuals who have reported the maximum number of relationships to report how many more people they would have named had there been no maximum. It turns out that this extra bit of information can really help with our estimation.
Next up, I want to develop an R package that will use this method that I will upload to CRAN. I think that sounds like a fun project for the summer.