What makes an Ultimate Fighter?
By Hana Lim, Zoeb Jamal, Dara Tan, Ben Brill, Kaushik Naresh, and Lia Bergman-Turnbull
MMA is one of the fastest growing sports in the world right now, and no company has capitalized more than the UFC. The number of events per year has grown steadily, as has its global presence. UFC athletes are also becoming more and more well known; stars like Conor McGregor and Ronda Rousey are household names now. In addition to this, gambling on fight outcomes has become a huge industry, with many people placing extremely high bets on the victory of their favorite fighters. As a group we decided to investigate the “Ultimate UFC Dataset” on Kaggle. This dataset comprises three CSV files, a master file that contains all the fight data, an upcoming events file, and a recent events file. Our initial aim was to see if we could uncover any interesting trends in the growth of the sport. After investigating a little more, we decided to also see if we could determine what characteristics make a good UFC fighter and if we could build machine learning models to predict the outcomes of upcoming fights.
How has UFC grown in recent years?
In order to answer the first of our many questions, we used heat maps, count plots, and an interactive choropleth map to observe the trends.
From the heat map, we can clearly see that the number of matches per year is gradually increasing, as the colors are getting lighter as we move through time. One interesting discrepancy is April 2020, which is blank due to the cancellation of events in light of COVID-19. Based on the heatmap, we can also see that there is no clear “season” for UFC events; there are high concentrations all around the year, from July to December. The highest number of fights were held in July 2016, during UFC 200 in the Las Vegas Strip.
From this countplot, we can see a similar trend — the number of games has generally followed a linear upward trend as the years have gone on. One interesting thing to consider is that the worldwide stay at home order has not seemed to affect the UFC as much as some would have expected. After a small hiccup in April of 2020, they managed to get back to almost normal numbers. For a sport that is synonymous with huge crowds and packed arenas, one would think the pandemic would have decimated the industry. However, the UFC, much like their athletes, will not go down without a fight. Many events were held in the “UFC fight island” — a bubble they had initially established in Abu Dhabi, UAE to work around travel restrictions preventing some international fighters from competing. In light of the pandemic, this became the perfect location to hold UFC events safely.
While the majority of UFC events have been and are still held in the United States, we can see from the choropleth map that the number of events in other countries has gradually started to increase. In fact, since April of 2021, events have been held exclusively on the UFC fight Island in Abu Dhabi, UAE.
Another question we wanted to investigate was the possible influence of the attendance on the intensity of the fight. Is it possible that the crowd has an impact on a fighter’s choices? To visualize this, we decided to use a Pirate plot.
In order to get the attendance data, we had to scrape data from a table on Wikipedia. We cleaned this to exclude cancelled events and in the end had data for about 551 events. In order to line this attendance data up with the fight data from kaggle, we decided that the following conditions must be met:
- Country name in both datasets are equal
- The (last) names of the fighters from the “event” column of the Wikipedia dataset must match the names of the fighters in the Kaggle dataset
- The date from the Wikipedia dataset must be ± 1 of the date in the Kaggle dataset (to account for time zone discrepancies)
After data preprocessing, we were able to match 368 fights. To determine the “intensity” of the fight, we calculated the sum of significant strikes landed, the sum of takedowns landed, and the sum of wins by KO/TKO.
To get the x-axis of the pirate plot, the extracted attendance statistic was binned by percentile (bin 1 = 0 to 10 percentile, etc). To get the y-axis, the calculated statistics were used collectively as a measure for “intensity.”
From the pirate plot we were able to determine that the fights in bin 1 (those with attendance in the lowest 10 percentile — where all the zero attendance fights during COVID would fall) actually had a much lower intensity than the fights in any other bin. This shows us that in empty arenas, the athletes spend more time calculating their next moves, rather than adopting a super aggressive style.
What makes a fighter more likely to win?
The first thing we investigated to see if physical attributes correlated to better performance was the effect of height on number of wins.
In order to prepare the data, we first calculated the signed differences (winner attribute less loser attribute) and standardized each variable across the entire dataset. For the physical attributes like height, reach, and weight, we binned by rounding the standardized value (x-axis variables are categorical).
In the plots, we saw that the physical attributes within bins -2 to 2 have larger symmetric “win difference” densities, with a “win difference” of -1 to 1 being the densest. Contrastingly, more extreme differences in physical attributes (bin less than -2 or greater than 2) tended to have more skewed “win difference” densities, as indicated by either the section above or below the bar and band being much thicker and the other side being much more tail-like.
This suggests that the possession of a particular physical attribute does not automatically give you an inherent advantage over another fighter in a match. It is more about having a difference in physical attributes relative to your opponent.
How do different weight classes fight?
Given the fact that we have seen how Physical attributes can affect wins, we also wanted to see if physical attributes could also affect the type of win. In a UFC fight, there are three ways a fight can end: knockout (loser is unconscious) or technical knockout (loser is unable to intelligently defend themself), submission (opponent taps out), or by score (that 3 judges decide). To visualize this, we looked at the percentage of bouts that were won by a certain win type for each weight class, by gender. The graph shown depicts this for the male weight classes, sorted from the lightest to heaviest, other than catchweight (isn’t really a weight class, just fighters that agree to compete at a certain weight). In the lighter weight classes, wins by decision are far more common. For example, in the flyweight class, the median of 50% of fights were won by unanimous decision. Contrastingly, in the heavier classes, knockouts are more likely. In the heavyweight division, a median of 60% of fights were won by KO or TKO. The fact that bigger fighters can put more force behind their strikes can also clearly be seen in the visualization — there are more points in the doctor stoppage category compared to lighter weight divisions.
What stats do top fighters share?
We also decided to look at some top fighter’s stats to see what techniques one should work on if they want to become a UFC titleholder. We decided to look at the competitors from UFC 200: Tate vs Nunes, which was the main reason for the yellow block on the heat map (The most frequent bouts held in July 2016) From this fight, UFC Women’s Bantamweight Championship bout between then champion Miesha Tate and top contender Amanda Nunes was revealed as the new main event.
We found that the top fighters tended to land more strikes and with far greater accuracy than the average values in the dataset. They are also able to bring their opponents to the ground far more often and again with much higher success rate.
Player Analysis of Miesha Tate and Amanda Nunes
- Winstreak: median
- Nunes has higher statistics than Tate in win streaks, strike accuracy, takedown accuracy, percent game wins, and total career wins except the average strike.
- Amanda Nunes took the women’s bantamweight championship from Miesha Tate
Titleholders and the effect on their wins
In addition to this, we also wanted to see if the number of title bouts a fighter had participated in could have any effect on their wins. To do so, we made a line plot with a 95% confidence interval between the number of wins and number of title bouts.
From this plot, we can see that the number of wins and number of title bouts have a positive relationship. Given the fact that in order to even get the opportunity to participate in a title bout, fighters must be experienced and have many wins already, this makes perfect sense. We also see the flip side of that relationship: fighters with less experience and less wins have fought in fewer title bouts.
How do people bet on UFC fighters?
Our dataset included a column for the American odds a fighter had in a fight. In this system, the odds for the favorite are indicated by a minus sign and indicate the amount of money you need to stake to win $100. Meanwhile, the odds for the underdog are indicated by a plus sign and represent the amount you will win for every $100 you stake. In both cases, you get back your initial wager as well as the money won. The difference between the odds for the favorite and underdog increases as the probability of the favorite winning increases.
From the joint plot, we can see that the betting odds for the red and blue fighter (this is how they are differentiated) are distributed quite symmetrically, given the fact that the color is set randomly.
From this histogram we can also see that the majority of odds do not have a large spread. While there are some lopsided matches according to the gamblers, the majority of the odds are rather evenly matched. In fact, we noticed that the odds tended to be a bad predictor of the actual winner of the fight.
After inspecting the dataset the first time and noticing the vast number of columns of fight data available, we were all quite excited about the possibility of creating machine learning models to predict the winner of a UFC fight based on the stats of the athletes competing. Each row of the dataset represents a fight. The names of the fighters are stored in the R_fighter or B_fighter column, and their statistics like current win streak, height, weight, etc are all stored in respective columns. In addition to the fighter’s data, we also had access to statistics from the fight, like the type of finish, duration, etc.
Unfortunately, the dataset required lots of modifying in order to make it more intuitive. In the original dataset, there were several columns completely irrelevant to the fight. For example, fighters in a Heavyweight bout would be given a NaN value in the “B_fighter Flyweight rank” category, since they would both be ranked Heavyweight fighters. To fix this issue, we pivoted the data to transform the 21 weight class rank columns to just 2 columns: weight class and rank. Once this was adjusted, we removed columns with zero variance and the highly correlated variables (correlation coefficient greater than 0.8) to avoid multicollinearity. We then split the data into a training and testing split with an 80:20 ratio and ran our selected machine learning models.
Surprisingly, all of our models achieved scores of greater than 70%, with some even getting close to 90%! Initially, we were shocked by this. We looked at the rate betting odds correctly predicted the winner and found that to be only around 62%. However, we then realized that gamblers normally investigate and make bets based on the individual competitors. Given the fact that no fighter has more than 30 official UFC bouts, the small sample size may explain the relatively low accuracy of their predictions. However, we are using data from over 2500 bouts after preprocessing, so we are able to make much more accurate general predictions based on fighter data and fight statistics.
Logistic Regression, K-Nearest Neighbors, simple Neural Network
Logistic Regression at first was difficult to work with, given the imbalance between Red or Blue winners. After stratifying the outcome variables, the algorithm tuned out an 84% accuracy, much to our surprise.
We used the K-Nearest Neighbors algorithm as our next shot due to its simplicity. KNN algorithms classify the data points based on distance functions and classification is performed by the majority vote to its neighbors.
Training deep learning neural networks is very challenging. It gets extensive as advanced mathematical concepts such as stochastic gradient descent come in play to perform neural networks. Instead, we tried using simple neural networks in python to see how it performs.
K-Nearest Neighbors and our simple Neural Network produced decent results of 70.5% and 72.5% without significant hyperparameter tuning.
After performing the initial random forest model, we extracted significant features using the `varImp` method. Variable importance feature refers to how much a model relies on the variable to make accurate predictions. So the higher the output values of the variables are, more important the variables are.
The top 20 most important features can be clearly seen in this bar chart. These features played significant roles in the random forest model so in the second run we used only these variables and improved our accuracy to ~87%.
Support Vector Machine (SVM)
This supervised learning machine learning method is better used for classification problems. How SVM works is we perform classification by finding the hyperplane (borderline) that segregates the two or more classes. To optimize the prediction accuracy, we used the SVM kernel that is used in non-linear separation problems. Turning hyperparameters of Gamma = cost = 1 was the best prediction accuracy. After scaling the data and tuning the hyperparameters, we got an accuracy of ~88% for the SVM classification model.
We used the same data as in the RandomForest model and used a LabelEncoder to transform some of the text based columns (like weight division, sex, etc) to numerical values. We then used K-fold cross validation to determine the optimal complexity, which ended up being 26.
After determining this, we fit the model with the actual training data and got an accuracy of 0.8969 when scored against the testing data.
Applying our ML models
Unfortunately the upcoming events CSV file in the dataset was not as developed as we had hoped so we would have to manually create DataFrames with the columns of interest to predict the outcome of upcoming UFC fights. However, we managed to do this for one extremely anticipated match for the Light Heavyweight belt between Israel Adesanya and Jan Błachowicz, on March 6th. We obtained values for columns like “R_winstreak”, “R_wins_by_KO/TKO” from the most recent bout they had participated in. We then manually created a DataFrame containing data for only their upcoming fight and used the DecisionTreeClassifier with 89.7% accuracy to predict the outcome. According to our model, Jan Błachowicz will win the fight.
On March 6th, Jan Błachowicz beat Israel Adesanya by unanimous decision and successfully defended the light heavyweight belt.
Overall, our analysis yielded some very interesting results. Our investigation into the growth of UFC revealed that due to the existence of Fight Island they were not that badly affected by the lockdown measures imposed due to COVID-19. We also illustrated how the UFC and MMA in general has become a much more global sport, with international events increasing in recent years. In addition to this, we discovered that attendance may actually have an impact on the intensity of a fight and that more than advantageous physical attributes, it is the discrepancies between opponents that really makes all the difference. We also made some rather accurate ML models that can be used to predict the outcome to UFC fights and even made an actual prediction of our own for a very highly anticipated title bout.
Here’s a link to our GitHub to check out more of our work.