Analyzing Food Insecurity in the United States

Authors: Christine Shen, Kara Chu, Akshat Srivastav, Anika Chakrabarti

13 min readDec 23, 2021

(Source: https://helpingpeople.org/what-food-insecurity-means-for-a-childs-development/)

Introduction

According to Hunger and Health, an estimated 1 in 8 Americans were food insecure in the year 2020¹. The United States, although an economic superpower, is faced with an important domestic challenge of food insecurity. Before we dive into using all sorts of models in helping us study food insecurity, it is essential that we define food insecurity. According to the U.S. Department of Agriculture, food insecurity could be defined as the lack of consistent access to enough food for an active healthy lifestyle1. Defining a concept of food deserts might also be important as they might be central in solving our issue of food insecurity. A food desert, according to Medical News Today, is a region whose inhabitants have limited access to healthy and affordable food options². Let’s see how this ties in with food insecurity!

To explore this problem of food security, our team scoured the internet to find a dataset from the U.S. Department of Agriculture. The dataset, named Food Environment Atlas Data documents important variables such as the percentage of children, seniors, and adults with low access to food and other economic factors such as poverty rates for many States and Counties across the US. Our team’s objective is to build an ML model using such variables to predict whether or not a region within the U.S. is a food desert.

Supplemental Nutrition Programs

Previously known as the Food Stamp Program, the Supplemental Nutrition Assistance Program (SNAP) is a federal program that provides food benefits to low-income families and individuals³. But is this program effectively reaching those with low food accessibility? This is a question we studied by visualizing the variation in SNAP benefits and food access by county.

The map above displays the association between accessibility to supermarkets and the use of government food assistance, based on USDA data from 2015. Each dot represents a county, and the size of the dots corresponds to supermarket accessibility. If a dot appears larger, that county has a larger percentage of people living at least 1 mile from a market in urban areas or more than 10 miles in rural areas. The color of the dots relates to the share of people who use SNAP benefits provided by the government. If a dot is redder, that means a higher percentage of people in that county receive SNAP benefits. This data is particularly interesting when looking at the large, yellow-colored dots as seen more heavily down the center of the country. These counties simultaneously have low accessibility to supermarkets as well as a low percentage of their population who use government assistance. To help alleviate these issues, the US government may want to adjust its food assistance program implementation to target these counties where it is most needed.

Another government assistance program that we decided to take a closer look at is the Special Supplemental Nutrition Program for Women, Infants, and Children — otherwise known as WIC⁴. This is an assistance program specifically targeted towards pregnant, breastfeeding, or postpartum women, infants, and children under five years old. Some of the services WIC provides are supplementary nutritious foods, nutrition education, and health service referrals. It is important to note that participants must qualify for the program by meeting income guidelines and being at “nutritional risk,” and it is possible that not everyone who qualifies for the program will receive benefits. According to the US Department of Agriculture, “Congress does not set aside funds to allow every eligible individual to participate in the program,” since the program only operates on federal grants⁵.

Program participants receive vouchers to purchase healthy foods at specific WIC-authorized stores. The number of WIC-authorized stores, as well as the available WIC resources, clinics, and health centers that provide WIC services, varies by county, as the program is run at a local level. For this reason, we decided to look into the range of WIC services by county.

In the above map, we examine the relationship between WIC-authorized stores and populations with low access to food stores. Each dot on the map represents a county. Similar to the first map, the size of each data point corresponds to the percentage of the population with low access to stores, and the color corresponds to the number of WIC-authorized stores in that particular county. For example, the large red dot in Southern California is Los Angeles County, which shows that Los Angeles has a large number of WIC-authorized stores when compared to the rest of the country. By the size of this data point, we can also see that a large percentage of Los Angeles County is living with low accessibility to food stores. In counties with low accessibility and low numbers of WIC-authorized stores (i.e. light-colored, large dots), it is likely that there are more people who qualify for WIC than those who receive WIC benefits. This indicates a clear disparity in food security across the country.

Another aspect of the dataset that we wanted to examine was the variety of different food resources in the US, and how access to these resources vary by county. For the purposes of simplicity, the above bar graph only visualizes counties in California. The different colors represent the number of grocery stores, fast food restaurants, SNAP-authorized stores, and convenience stores in each county as a ratio of the number of stores per 1000 population. Similar to WIC, SNAP provides services for individuals to receive supplementary nutritious foods, and these foods can be purchased at specific SNAP-authorized stores.

As the graph shows, there is a wide range of accessibility just within the state of California. The variable that appears to have the largest notable range is the number of SNAP-authorized stores. For example, compare the difference between Trinity county and Marin county. While this chart is only looking at one state, it would be interesting to see if the range of accessibility increases when comparing counties in different states.

So far, we have found that the levels of government assistance and accessibility to food resources is nowhere near consistent across the country, and there are many different geographic, political, and demographic factors at play.

Child Poverty

According to the US Department of Agriculture, “Children have always been the largest category of WIC participants.”⁶ For this reason, we wanted to take a closer look at how children are affected by low accessibility to health resources. The following two maps show the child poverty rate by county and the number of WIC redemptions per capita by county, respectively. The goal of these visualizations was to identify any relationship between areas with high child poverty and the number of WIC redemptions per person.

Something that stands out from the first map is simply the prevalence of child poverty across the country, as a large number of counties appear to fall within a 30–40 percent range or above. When comparing the two maps, there seems to be a slight correlation between the child poverty rate in some areas and the number of WIC redemptions per person. However, there was a lot of missing data for the number of WIC redemptions. While the missing data makes it difficult for us to come to a conclusion about the relationship between these two maps, it is clear that child poverty is an issue that many counties face.

Food Insecurity and its Relationship with Diabetes

Diabetes and Food Insecurity Scatterplot

Another item of interest is how food insecurity contributes to health problems. Diabetes is one of the most prevalent diseases in the United States, and it is heavily linked to people’s diets. In a study done by UCLA/Health Affairs⁷, it has been shown that amputation rates for diabetes cases are higher in certain regions of Los Angeles, reflecting health disparities. Thus, we created a scatter plot to examine the average housing food insecurity percentages from 2012–2017 in relation to the average diabetes percentages of the different counties in the country. Although there is no clear pattern in the scatter plot, we can still continue to examine the effects of food insecurity, the availability of healthy foods, and how they both relate to the development of metabolic diseases.

Classifying Food Deserts

After uncovering interesting insights about our dataset, it seems like it’s time to dive into some ML! Before we begin talking about all the complex and not-so-complex models, let’s spell out our problem more numerically. We want to predict whether or not a certain region in the U.S. is a food desert. Our team decided to use the variable PCT_LACCESS_POP to determine whether a region was a food desert. The variable name, which might appear as intimidating jargon, essentially denotes the percentage of the population that has low access to food. To numerically quantify a food desert we look at the median of this variable: regions that fall below the median are food deserts while regions that surpass the median are not. It’s important to note that this metric is quite subjective but is a rough idea of how we might go about classifying food deserts.

What might also seem apparent now is that we are staring at a classification problem, i.e. we don’t want to predict a continuous quantity like the number of people with low access to food. Rather we want to classify our region into one of two categories: food desert or NOT a food desert.

Let’s now have a look at some of the models we have used to help us tackle our problem. The primary models that we used to tackle this problem are:

Random Forests
Logistic Regression
Gaussian Naive Bayes

Random Forests

A Random Forest is essentially a collection of Decision Trees. A Decision Tree, in layman’s terms, is kind of like a flowchart that explores all of the decision alternatives with regards to the feature space. In the Decision Tree, we start with a question and make a decision for a certain feature. At each decision point, we have a node that splits into other nodes. At the end of the tree, we reach a prediction, which in our case would classify a region as a food desert or not.

However, a general issue with using decision trees is that they tend to overfit the data, which means that they cannot generalize well on data they haven’t seen before. Overfitting as such is a typical case of high variance and to deal with this we use Random Forest which is essentially a collection of many decision trees. In the end, our output is essentially an average of the outputs of all other decision trees. In the case of classification, our Random Forest output would represent what the majority of our decision trees have to say. If the majority of the trees in the Random Forest classify our region as a food desert, the Random Forest will classify it as a decision tree and vice versa.

Before fitting any ML model, we split our data into training and testing data. Once we fit our model on the training data, we check for the model’s accuracy on the training set and compare it to the accuracy on the test set. It appears that our Random Forest model gives us a 100% accuracy on our training data but gives us only a 97% accuracy on the test data. This definitely is a red flag as it suggests that our model is overfitting. Let’s have a look at the confusion matrix for the Random Forest to get an idea of its misclassifications on the test set. Here, our confusion matrix highlights the rather obvious, 49.86% + 47.77% = 97.63% of our classifications are True Negatives (TN) or True Positives (TP). Not surprising because our accuracy is 97.63%.

It also seems like our Random Forest classification is very precise and sensitive. This means that our Random Forest is reliable and is not prone to false negatives or false positives.

From our Random Forest, we can also see what features are most important in the graph below.

From this graph, it appears that the most important features start with PCT_LACCESS. All of these features tell us the percentage of people from specific demographic groups that have low access to food. Since these features are highly correlated to each other, that’s probably why we might be overfitting. In our other methods, let’s explore ways to reduce overfitting.

Logistic Regression

Our logistic regression model follows the equation below. For simplicity, assume that we only have one feature X and its associated coefficient b and a. This is kind of like a linear equation with a is the intercept and b is the slope. The output of a logistic regression model is the probability that our data point belongs to class 1 or class 0⁸.

The graph for this logistic regression model is given below.

However, to take care of the overfitting problem, we decided to apply PCA to our dataset. PCA, which stands for Principal Component Analysis, tries to reduce the number of features needed in our model. We currently have 57 features, which makes our model fairly complex. The idea behind PCA is that somehow out of 57 features we want to extract fewer features or Principal Components that are a compact representation of all the original features while retaining all the relevant features. We want to find a certain number of components that capture 95% of the information as our features.

To decide the number of components to reduce our data to, we plot a graph to see how the percentage of variance captured varies with the number of components.

It seems like the number of components that pass this 95% threshold is 41. This means we need 41 components to capture 95% of the information in the dataset. Notice that a model with only 41 components/features is much less complex than a model with 57 features. After engineering our features through PCA, it appears that logistic regression gives us an accuracy of 90.59% on the test set. It gives us a very similar accuracy on the training set which means that our model has done a good job at generalizing but it suffers from slight bias, i.e. the accuracy is slightly lower. Below is the confusion matrix (for the test set) for reference. Perhaps let’s try a model that can give us better accuracy.

Gaussian Naive Bayes

The Gaussian Naive Bayes model essentially relied on the conditional probability of the form P(A|B), i.e. the probability of event A occurring given B has occurred. For the sake of simplicity, suppose we only have one feature x and are trying to predict the class/category y. Our Naive Bayes model essentially tried to predict P(y | x) and will assign our data point to a category if this probability is greater than 0.5. This threshold can be played around with.

This model rests on the assumption that our data x comes from the Gaussian or ‘bell-shaped’ normal distribution. However, this assumption may not always hold and is slightly naive. Hence the ‘naive’ in Gaussian Naive Bayes. To compute P(y | x) we rely on Bayes’ Theorem which is given below.

Here, the model finds P(x) using the assumption that it comes from a Gaussian distribution. P(y) and P(x|y) are estimated from the dataset. Our model then computes P(y | x) to then classify our region into a food desert or not.

Instead of running this model on the regular dataset, I decided to run it on the principal components we obtained after conducting PCA. After engineering our features in such a way, it turns out that this model gives us a training accuracy of 92.58% and a testing accuracy of 92.34%. Seems like our Gaussian Naive Bayes model is not overfitting and gives us the best accuracy!

Below is the confusion matrix for reference.

Conclusion

So what have we learned about food insecurity in the US? One major theme that stood out in our results was the vast range of inaccessibility to healthy food options. By examining factors such as proximity to stores, food assistance program benefits, and poverty rates by county, we can clearly see that these variables which contribute to a region’s overall accessibility are not consistent across the board. Many regions struggle to provide sufficient food benefits, and there are many different demographic and political factors contributing to a county’s accessibility. To further study these different factors at play, we employed various classification machine learning models to predict whether regions can be classified as food deserts. Our best performing model was the Gaussian Naive Bayes, which showed a 92.34% accuracy. Overall, we found that food accessibility is a serious issue in this country, and there are many different factors affecting a region’s access to affordable and healthy foods.