Being Woke About Strokes!

DataRes at UCLA
7 min readJun 13, 2021

By Anish Dulla (Project Lead), Akshat Srivastav, Eleanor Pae, Ojas Bardiya, Zoeb Jamal

Introduction

As the fifth highest cause of death in the United States, dealing with strokes is one of the biggest healthcare concerns in America. While strokes are a combination of various diet and lifestyle choices building up for decades, steps can be taken to reduce the severity and frequency of strokes before they occur. To do so, patients need to know if they are at risk of stroke in order for them to make proper changes to ensure their wellbeing. This is why early detection of strokes is so important in dealing with this crucial issue. While doctors would traditionally diagnose patients for risk of stroke themselves after looking at patient data, the introduction of machine learning in healthcare has provided new accuracy and efficiency in diagnosing patients for being at risk of strokes. We aim to build and perfect these ML models.

To take a closer look at stroke prediction, team Woke About Stokes at UCLA DataRes investigated patient data from Kaggle. This dataset contains 5110 records of patients, including eleven clinical features for predicting stroke events. For each patient, the dataset includes fields such as gender, age, if the patient had hypertension, if the patient had heart disease, if the patient was ever married, the patient’s work type, the patient’s residence type, average glucose level, bmi, smoking status, and a target stroke variable.

Exploratory Data Analysis

Before we began building our classification machine learning models, we decided to explore our dataset to discover any relationships between variables and trends within the data.

Relationship Between Glucose Level, Work Type, and Smoking

This heat map describes the average glucose level by work type and smoking status. This is important because average glucose level is directly related to hypertension which causes strokes and it’s important to understand which lifestyle choices cause this based on profession and smoking. People who smoke and have children are at the highest risk of hypertension.

Demographic Information and Strokes

Demographic information can be key to predicting the likelihood of suffering a stroke. Some key demographic factors to consider are gender, marital status, and age. Let’s dive in!

First we consider gender and marital status. From the figures in the following pages it appears 56.63% of stroke patients were female. To those planning for marriage, think twice! Our data shows that 88.35% of patients who suffered a stroke have been married at least once.

Next, we explore the BMI of individuals who suffered a stroke. There does not appear to be a significant difference between the BMI distribution for both groups. The median BMI for those who suffered a stroke is slightly higher at about 29.7 but is characterized by smaller spread as compared to those who did not suffer a stroke. If we categorize the BMI into different classes such as Underweight, Normal, Overweight and Obese, it comes as no surpsie that 46.89% of those who suffered a stroke are obese while 35.89% are overweight. The composition of overweight and obese individuals is larger for those who suffered a stroke.

Other risk factors we looked at included smoking status and the overlap between a stroke event and heart disease. We anticipated that heavy smokers or past smokers could indicate a larger risk for stroke. Looking at our data, we noticed that there was a large percentage of smokers or former smokers who suffered from a stroke event. Although there is not a large correlation, it is important to note that the number of smokers and former smokers who suffered from a stroke exceeds the number of nonsmokers who had a stroke.

Relationship Between Glucose Level, BMI, and Strokes

Next, we look at the incidence of stroke with respect to glucose Level and BMI. We visualize their combined effect on the risk of a patient getting a stroke using a 2-D scatterplot. Looking at the data, we see higher rates of stroke incidence as the Glucose Level increases and interestingly the correlation between BMI and stroke incidence (the proportion of strokes amongst the total number of observations) seems to decrease at excessive glucose levels — if the glucose level exceeds ~200 there isn’t a significant increase in stroke incidence at higher values of BMI.

Machine Learning Data Exploration

Before we dive into machine learning, we can also do a principal component analysis, which is a technique to reduce the dimensionality of our dataset, to confirm that certain combinations of traits make someone more likely to suffer from a stroke.

From the plot, we can see that there definitely is some clustering, which indicates that machine learning algorithms will be able to predict strokes with high accuracy.

Stroke Prediction With Machine Learning

With our exploration of the patient data and trends in stroke prediction, we began to build classification machine learning models to predict whether a patient would be at risk for stroke. To predict these classifications, we used KNN, Decision Tree, and SVM classification algorithms. We looked at gender, age, if the patient had hypertension, if the patient had heart disease, if the patient was ever married, the patient’s work type, the patient’s residence type, average glucose level, bmi, and smoking status to make these predictions. Our SVM algorithm was able to best predict critic scores at 95.76% accuracy. Let’s take a deeper look at these algorithms and why they predicted their respective accuracies.

The first model we used was a K-Nearest Neighbors (KNN) classifier. Roughly speaking, this approach determines what class a particular observation belongs to under the assumption that similar observations exist in close proximity. Ultimately, this model uses the idea that similar things are near each other. K refers to the number of nearest neighbors, and by changing the value of k, one can tune the KNN Classification model to produce the most accurate results. This algorithm correctly classified critic scores at 95.5% at K = 5, a very accurate model.

This visualization depicts how our KNN algorithm grouped together similar data points.

Another ML model used was the Decision Tree, which is good at predicting an outcome based on a combination of traits. At first, the decision tree was trained using an 80% train split, stratified to ensure that the ratio of stroke to non-stroke cases were present in the training and testing data. K-fold cross validation was used to determine the optimal model complexity.

Strangely, the model worked best when it was only using 1 column: age. Essentially, the model just checked if the age was over 67.5. If it was, it classified the patient for a stroke.

However, we knew this would not be a viable model. We decided to split the dataset by age and train two separate models. One dataset contained all the patients that were under 67.5 and the other contained patients 67.5 and up. K-fold cross validation determined that the model worked best with 1 to 3 branches. This model gave us an accuracy of 97.9% and used age, glucose level, and bmi to make its predictions.

The model for the “old” patients, on the other hand, had a lower accuracy but still had an optimal depth of 1 branch. Using the glucose level column, this model predicted stroke cases with an accuracy of 84%.

Lastly, we used was a Support Vector Machine (SVM) which separates the data into 2 separate classes — patients who suffered a stroke and patients who did not suffer a stroke — using an optimal hyperplane that categorizes the observations by maximizing the margin around the separating hyperplane.While the SVM is a difficult to interpret to model, since we are primarily concerned in identifying stroke in patients, it is useful in our case. We observed an accuracy of 95.76% using this model as shown in the confusion matrix below.

The confusion matrix demonstrates which the model was correct and when it misclassified.

Although our three machine learning algorithms worked in fundamentally different ways, they all were quite accurate in this classification task, with SVM being the most accurate. We hope this article was able to give you a taste of just how impactful machine learning can be in the healthcare industry. The possibilities are truly endless. To see our work, check out our Github.

--

--