Tackling the US obesity epidemic using network analysis

[This post is based on a paper I published, along with my colleagues from Northwestern University,  back in 2015]

This was a very interesting project for me, because we used computational tools and also teamed up with medical doctors to approach a relevant health issue in the US: obesity. As everybody knows, obesity is a health problem of almost epidemic proportions in the US  (over 60% of adults are either obese or overweight as of 2014). While exercise and diet are necessary tools, they are not always enough. Doctors sometimes recommend support groups, such as Weight Watchers. The specific research question we address here was: is it possible to obtain similar benefits as those of health support groups, just from on online platforms?

We obtained our data by partnering up with an online weight management platform. The data includes detail, self-reported information about the weight evolution of the users over time. This platform includes many resources, such as forums to discuss specific issues, as well as a complete database of different foods and calories. More interestingly for us, this platform includes a friendship tool, that allows users to connect with each other, and hence we have records of friendships too.

First, we explore the data set, and present some basic statistics:

From this initial exploration, we learn that the entire set of users that at least have entered 2 weigh-in in the system (Group 1) is around 22K, over 80% female with an initial Body Mass Index of 31, and 43 years old. We can also get the average values for these and other variables for subsets of users in the platform, for example: those with at least two weigh-in and at least an engagement time of 6 months, or Group 2, (we define engagement time as the total time between a user's sign-up day, and her or his last day of recorded activity on the platform), or those that also have at least one friend (Group 4). We observe how as we move from the general Group 1 to more engaged subsets of users (Grops 2, 3 or 4), the amount of total weight loss is higher, and the amount of online communications with other users is also higher. This is an interesting clue that we'll use in our model for weight loss later.

Next, by defining a link as an accepted friendship request between two users of the platform, we can build a network of our population:

Network representation of the friendships among users in the system under study. We represent women by a circle and men by a triangle, and we color coded the total weight loss of the users over their total stay in the system.

We observe several things from the figure above: the network has a giant component (GC) that includes about 75% of the users, while the rest of the population forms small groups or couples. We also notice that the individuals in the center of the network tend to have a higher success at losing weight than those in the periphery. This is another interesting clue!

The ultimate goal of this research project is going to be to build a weight loss model, including all relevant variables available to us. We have age and gender for the users, as well as number of friends, number of online communications with other users in the system, and so on.

Before we move any further, there is an important detail to consider in this and other observational cases where there is no control over the amount of time that the participants remain in the system: self-selection and potential bias. It is reasonable to think that some users will drop out pretty early on, while other will stick to the program and carry on. In fact, if we plot the attrition rates as a function of time, we observe how the population drops out as time goes by:

Engagement time for the entire population (red line), as well as for people with at least one friend in the platform (green line). 

We observe how, for the entire population, there is 40% of people that never return to the system after signing-up, while 20% of them remind active after 6 months (180 days). More interestingly, among people that make at least one friend, we observe that 50% of them remain engaged after 6 months. This is yet another clue!

Next, we explore the correlation between a user's integration (or 'embeddedness') in the social network, and her or his weight loss performance. We start by simply selecting different subgroups of users with increasing embeddedness in the social network, and plot their total weight loss at the 6 month mark:

In the left panel we observe an increase in average weight loss for users increasingly more embedded in the network:  members belonging to the network's giant component lose around 6.8% of their body weight in six months (a value above the clinically significant threshold, 5%), and also higher than the value for the general population, 4.5%, (or the non-networked members, 4.1%, not shown). All the differences in weight loss for all pairs of sets shown are statistically significant (except for networked users vs users in the GC).

A more specific measure for network embeddedness is given by the k-shell index. Without going into a lot of details, let's say that this index is calculated by in iterative pruning algorithm, and  it measures how deeply connected to other members of the network a user is, and how deeply connected those are, in term (if you picture the network as the layers of an onion, then nodes with k = 1 or 2 are in the periphery, while those with k = 8 or 12 are nodes deeply embedded in the core of the onion-network).

This way, if we plot the average weight loss of users in each k-shell embeddedness level (note that people without any friends would be in the k=0 layer).

Because users stay in the system for different amounts of total time (as we have just seen), and to make a fair assessment and comparison of everyone's progress, we have to measure weight loss after a fixed amount of time.

But even further, it is possible that people don't just drop out randomly, but there are certain characteristics that make some users drop out more likely than others. For example, it is reasonable to think that users that do not manage to lose any weight after a while, do give up, while more successful ones stay in the system. If we do not correct for this self-selection bias, we could over-estimate the effect of the online platform on weight loss outcomes.

If you are interested, you can read the details in our paper, but for now, let's just say that we apply what is called Heckman correction,  which is a two-step process, to account for possible selection bias and censoring, and to correct its effects. The first step is a logistic regression to estimate the likelihood of users dropping out of the system before we can measure their weight change (ideally, you use different independent variables for this part than for the second part). The second part is the actual weight change linear regression model. Fortunately for us, the R package SampleSelection includes this model, and we only have to give it the variables we want to use for estimating the likelihood of remaining in the platform, as well as those variables for estimating weigh change.

Next, I present the multiple models tested, including different number of independent variables, and after applying the Heckman correction:

Let's break down what we found. First of all, every model includes the initial BMI of the users (since it would make sense that, the heavier a user is initially, the more weight she may loose eventually), as well as the number of weigh-ins (which is considered a measure of engagement or commitment with the program), as well as the total amount of time in the system. These three variables are significantly, negatively correlated with weight change (or in other words, positively correlated with weight loss). We also observe how the number of online communications per week with other users is also significant, along with the user's embeddedness measure by their k-shell index (as we already knew). We also find that a user's network betweenness is not significant, and more interestingly, the user's number of friends, as well as the average weight change of a user's circle of friends, is ONLY significant if considered in isolation (without any other network-related variables). But both stop being significant once we include back in the user's k-shell embeddedness.

This is was an interesting finding for me: sometimes, variables appear significantly correlated with the outcome, but only until you include some other variable. 

After exploring the contribution of each variable separately and also combining them, the final model with picked for weight loss is the one listed last, which accounts for 27% of the variability in our data (not bad, given all the missing information about our subjects, such as diet, habits, genetics etc!)

So, to summarize, in this project I started exploring the data by grouping it into different subsets, and I obtained different useful clues about what could be going on in the system. I also utilize network science to uncover user's attributes that would otherwise be invisible or inaccessible, such as network embeddedness. After I gathered a set of 'interesting variables' that could potentially correlate with weight loss, I build my final model, including a statistical correction against potential selection bias.  The results from my model indicate that, even after controlling for initial weight and engagement,  social network embeddedness has a positive correlation with weight loss for the individuals in my population.

Here you can read my original research paper.