Analyzing Longevity Across the World

DataRes at UCLA
12 min readDec 21, 2023

Aryan Sunkersett (Project Lead), Cynthia Du, Alexis Adzich, Alvin Huang, Li Mingyang, Anish Deshpande, Daniel Wang, Steven Liu

Introduction

People want to live longer because, well, it’s in our DNA. Survival instincts, fear of the unknown after death, and the desire to accomplish personal goals all play a part. Plus, who wouldn’t want more time for adventures, relationships, and just indulging in the pleasures of life? It’s like yearning for a sequel to a heart-pounding movie that ended on a cliffhanger.

In this blog post, we will investigate the leading factors that contribute to variation in life expectancy among countries by breaking down the data from “Global Country Information Dataset 2023”. Check out what we found!

Increase in Life Expectancy Across Decades

Before we dive into the different factors that affect life expectancy, we will first explore the trends of life expectancy across decades. Over the course of multiple decades, numerous inventions and influential factors have significantly impacted the quality of life. To get a good idea of how this has affected life expectancy, we plotted some countries over the decades. While many nations exhibit a consistent upward trend in life expectancy over the decades, a notable anomaly is observed in South Africa. Contrary to the overall trend, South Africa experienced a pronounced decrease from a life expectancy of 63 years in 1990 to 58 years in the 2000s and 2010s.

The primary cause for this decline can be attributed to South Africa grappling with one of the most severe HIV/AIDS epidemics globally during the 2000s. The exceptionally high prevalence of HIV resulted in a substantial number of AIDS-related deaths, significantly impacting the life expectancy, particularly among the adult population in their economically productive years. Despite efforts to address the HIV/AIDS epidemic in 2010, the South African healthcare system encountered challenges related to infrastructure, resource allocation, and workforce capacity.

As of 2020, there is a noteworthy improvement, with life expectancy rebounding to 65, surpassing the recorded life expectancy of 1990. This positive shift underscores the efficacy of interventions, improvements in healthcare infrastructure, and the resilience of the population. Nevertheless, ongoing vigilance and sustained efforts are imperative to ensure continued progress in enhancing life expectancy in the face of evolving health challenges.

How do the variables in our dataset affect life expectancy?

The graphic above gives us a cursory glance at how the variables in our dataset correlate with one another. However, we are specifically concerned with the impact that other variables have on life expectancy. The column highlighted in red shows how life expectancy correlates with each of the other variables of interest. While these are likely not the only variables that affect life expectancy, it gives us good insight on which factors to focus on.

Immediately, we can see that physicians per thousand and minimum wage have a strong positive correlation with life expectancy, while infant mortality and birth rate have a strong negative correlation with life expectancy. CO₂ emissions and Urban Population have slight positive correlations. Other than these obvious observations, it is hard to tell how well the other variables are correlated. For this, we would need to do a deeper analysis. It may also be worth noting how other variables relate to one another (for example, physicians per thousand and birth rate are strongly negatively correlated, while urban population and CO₂ emissions go hand in hand).

Diving Deeper into the Impact of Physicians per Thousand Population on Life Expectancy

Let’s explore exactly how physicians per thousand correlates with life expectancy. Intuitively, this makes sense, as people in countries with more physicians have better access to healthcare. Indeed in the graph above, we see that the relationship between life expectancy and physicians per thousand is actually logarithmic. As the number of physicians increases, the life expectancy tends to increase. However, the graph also displays that at about 0–1 physicians per thousand, the life expectancy varies greatly, ranging from about 52 to 78 years. This displays that the number of physicians in a country doesn’t accurately predict life expectancy on its own. In addition, as the number of physicians per thousand increases, the life expectancy levels off and doesn’t really go above 85; this makes sense as life expectancy has a limit and does not continually increase.

In general, as the physicians per thousand increase, the life expectancy increases as well, with the most improvement coming from around the 2 physicians per thousand mark and only marginal benefits thereafter. However,there are data points scattered that do not follow this trend (ie the country with over 8 physicians per thousand having a life expectancy below 80 versus the country with just over 2 physicians per thousand and having a life expectancy of just below 85). Therefore, we can see that there is a relationship between life expectancy and the number of physicians per thousand population but there are other factors that influence the life expectancy of a nation as well.

How do CO2 Emissions Affect Life Expectancy?

The two scatterplots above examine the relationship between CO₂ emissions, GDP, and life expectancy by country. The first plot includes all countries, while the second one removes the five countries that emit over 1,000,000 tons of CO₂ (China, USA, India, Russia, and Japan) and re-scales the plot. These five countries emit so much more CO₂ compared to the rest of the world that the plot becomes difficult to interpret and the large majority of countries get squished to the left of the graph. When these five countries are removed from the picture, the graph becomes much more balanced. Perhaps these graphs display the imbalance in CO₂ pollution more than life expectancy.

Looking at the plots, it seems that GDP generally increases as CO₂ emissions increase. As for life expectancy, we have a square-rectangle relationship with CO₂ emissions. That is, not all high life expectancy countries have high CO₂ emissions, but all high CO₂ emission countries have high life expectancy. In practice, this makes sense, countries that emit more CO₂ tend to have more economic activity and are thus more industrialized and generally wealthier, which contributes to having better standards of living and longer life expectancy. For example, China emits 9,893,038 tons of CO₂ per year, which is almost double the next highest country, the United States, at 5,006,302 tons. The United States is then more than double the next highest country, which is India at 2,407,672 tons. China and the United States clearly dominate the rest of the world in CO₂ emissions as well as GDP, which makes sense because their economies are two of the strongest in the world and rapidly growing, requiring a lot of carbon-intensive activities. However, because of their wealthy economies and higher standards of living due to industrialization, they also have high average life expectancies despite high CO₂ emissions.

Looking at the different continents, Asia seems to contribute the most to CO₂ emissions as they have the most points towards the right of the graph. Asia’s employment and production rely heavily on carbon-intensive activities like manufacturing, transportation, and energy, explaining why these countries also have a high GDP. Europe also has a lot of countries with high life expectancy, but many of them do not have a high GDP and have low CO₂ emissions. Looking at a country like Monaco, the most densely populated country in the world, they have an extremely high life expectancy, but are on the low end of CO₂ emissions. This is because Monaco is a very small country, and almost 70% of residents are immigrants, many very wealthy. On the other hand, Africa has many countries that have very low CO₂ emissions, GDP, and life expectancy.

How Does Fertility Rate Affect Life Expectancy?

As evidenced by the original correlation matrix, a more direct way to visualize life expectancy is through birth rate, and we will use fertility rate to proxy that. This graphic shows the joint kernel density estimate between life expectancy and fertility rate for different countries. The distributions on the top end and right end of the graph reveal the exact variations within each variable. As you can see, different continents have noticeably distinct distributions of fertility rate and life expectancy.

Notably, Europe has consistently the lowest fertility rate but maintains a high life expectancy, even on the lower ends. This is likely due to a combination of factors, including high levels of economic development, access to healthcare, and social safety nets. Conversely, Africa has the lowest relative life expectancy compared to other countries, while generally maintaining a higher fertility rate across the board. This may be due to a number of factors, including high poverty rates, limited access to healthcare, and in certain cases, poor access to education for women.

It is important to remember that this graphic does not necessarily imply that a higher fertility rate reduces life expectancy. However, it does suggest that there is a correlation between the two factors. It is likely that this correlation is influenced by other global factors such as wealth inequality and access to education.The other continents seem to generally overlap, suggesting that the relationship between fertility rate and life expectancy is more complex than a simple linear correlation. It is likely that there are a variety of factors that contribute to the different distributions seen in the graphic, and more research is needed to understand these factors fully.

Does Sleep Duration Affect Life Expectancy?

We also analyzed the effect of sleep duration on life expectancy across countries. We merged a dataset that encompassed sleep duration data of 50 countries with our main dataset, albeit excluding 5 countries that were not part of our main dataset. The 45 countries were predominantly from Europe and Asia and boasted relatively developed economies.

For ease of analysis, we classified these countries based on a binary factor — whether their average sleep duration is over or under the expert-recommended threshold of 7 hours. Notably, it is evident from our boxplot that the median life expectancy of countries with over the recommended sleep duration is 5 years longer than the median of those with insufficient sleep. Moreover, among the countries with over-recommended sleep durations, 89% boasted an average life expectancy of at least 81 years. Conversely, this figure plummeted to a mere 25% for countries with under-recommended sleep durations, highlighting a striking difference between the two groups. Given that the bulk of sleep duration data were collected on relatively developed countries, other factors, such as GDP and physicians per capita, differ minimally across these countries. Therefore, we can conclude with significant confidence that among relatively developed countries, acquiring over 7 hours of sleep is a key contributing factor to higher life expectancy.

Impact of Retirement Age on Life Expectancy

Another investigation we made into data not originally in the global countries information dataset was retirement age. Our initial hypothesis was that a higher retirement age would lead to a lower life expectancy. Since a higher retirement age is indicative of a more stressful work culture, it will presumably take a toll on the health of the working population. Yet, the results of our analysis, as depicted in the hexbin plot, reveal a different story.

Surprisingly, countries with higher retirement ages had longer life expectancies. Upon deeper analysis, we realized that countries with higher retirement ages also tend to have superior healthcare services that contribute to enhanced disease prevention and treatment. On the contrary, countries with lower retirement ages face challenges in providing adequate healthcare, which elevates the risk of mortality in the event of disease outbreaks. Hence, the perceived stress associated with prolonged working years has been mitigated by the availability of high-quality healthcare services. Furthermore, retiring early could lead to inadequate savings. Even with accessible and high-quality healthcare services, early retirees might find themselves financially strained when confronted with exorbitant medical bills for severe illnesses.

KNN and Random Forest Regression Model

Building off graphical visualizations and feature analysis, a regression machine learning model provides additional insight into making longevity predictions and influential features for life span. The two regression models evaluated were a KNN model, due to its strength in determining accuracy for fairly regular and invariable data, and a Random Forest Regression model, which utilizes multiple decision trees to make accurate predictions. In preparation for fitting the models, data preprocessing was conducted by cleaning the data and transforming any string objects, such as numbers with dollar signs or commas, into numerical data. Untrainable columns, such as “Country Name” or “Capital City,” were excluded. Finally, an important step for regression models, especially a KNN model, is to normalize the numerical data, since the model is very sensitive to large distances.

To implement the models, the data was split into train and test sets (30% reserved in the test set). This division enabled fitting the model on the training data, followed by evaluating its performance on unseen test data to prevent overfitting. During the training process, optimal model parameters need to be selected.

The KNN model requires a k-value parameter, which dictates how many neighboring data points that the model utilizes to make predictions. After evaluating k-values of 1–20, the model made the most accurate predictions at k = 12 on the test dataset with an RMSE value of 3.7 years, where the root mean squared error is a way of evaluating the distance between our predicted value and the actual longevity.

Some of the most important Random Forest Regression model parameters are n_estimators, which is the number of decision trees the model utilizes, max_depth, the maximum depth of each decision tree, and then max_features, which finds the best number of features when making a split. A grid search method was implemented to find the optimal choices for parameters, which gave us values of max_depth = 5, max_features = “sqrt”, and n_estimators = 200. With these training parameters, our model predicted longevity on the test dataset with an RMSE value of 2.76 years.

While it is not a perfect comparison, the RMSE is the standard deviation of the residuals and both models returned final RMSE values of less than half of the standard deviation of life expectancy. Evaluating each model, the Random Forest Regression model performed better than the KNN model with a lower RMSE, indicating that its predictions were closer to the true actual values.

Using the random forest regressor, we also displayed a simple feature importance chart to see which features mattered the most when predicting the life expectancy. Funnily enough the model strayed away from using things like GDP and Armed Forces Size as a proxy to find more developed countries. Instead, it leans towards occam’s razor, with the three most important features directly related to either birth or death, making for very simple and direct life expectancy predictors.

Conclusion

In this exploration of global longevity factors, we found that healthcare accessibility, economic conditions, fertility rate, and even sleep patterns play profound roles. Notable trends include the logarithmic correlation between physicians per thousand and general increase in life expectancy over time. Some surprises were how strong the impact of a good night’s sleep is and the lack of dropoff in life expectancy from high retirement age countries. Machine learning models took a different approach in predicting life expectancy, with the Random Forest Regression model emphasizing the significance of maternal mortality, infant mortality, and birth rate. Overall, our findings highlight the complex web of factors influencing life expectancy across diverse nations. Short of using simple proxies for how developed a country is; or just indirectly measuring life expectancy through something like mortality, it is hard to establish causal connection between any one factor. Except for getting over 7 hours of sleep. We are pretty sure about that one. So if you are reading this, and it’s past your bedtime, go to sleep. It might just save you some time.

--

--