Predicting The Madness of March

DataRes at UCLA
12 min readApr 12

--

Authors: Kevin Hamakawa (Project Lead), David Oplatka, Vatsal Jalan, Matthew Day, Brian Dinh

Introduction:

March Madness is the pinnacle event of college basketball. In less than four weeks, 67 total games are played and one champion is crowned. In addition to rooting for their favorite teams, millions of fans create brackets to try to predict the outcomes of every game in the tournament. The tournament is single elimination, so one bad game from a top-rated team and thousands of brackets are busted. Nobody has ever had a perfect bracket, but we wanted to see if we could find some trends that would give us a better idea of what might happen this March. Some factors we considered looking into were three-point shooting percentages, turnover rates, and seeding, in order to truly see what goes into a championship team.

Dataset:

Our main dataset was found at https://www.kaggle.com/datasets/andrewsundberg/college-basketball-dataset, which was a dataset containing T-Rank information from 2013–2021. The T-Rank is a college basketball ranking that contains many different stats about college basketball teams, including, but not limited to, 3 point percentage, turnover percentage, offensive/defensive efficiency, and power rating. While there were some limitations from the 2020 year due to COVID-19, this dataset contains over 20 variables about each team’s stats from the past decade that we used to explore March Madness.

Cursed by Seed?

Each year after the regular season is complete, teams anxiously wait to see what seed the NCAA DI men’s basketball committee will give them. Most teams want the highest possible seed but that might not give them the best chance to go far in the tournament or win a national championship. In general, having a higher seed gives a team a better chance of success in the tournament, but there are a few noteworthy exceptions to consider. As seen in the graph below, the average round made by teams has a strong negative correlation with the team’s seed.

The March Madness Bracket consists of four quadrants that seed teams 1 through 16 (Source: Jane Street Puzzles)

However, it is also worth noting the two surprisingly high marks at seeds №7 and 11. One factor to consider is the “side” of the bracket for which these seeds are placed. The seven and eleven seeds are on the bottom half of the bracket. If these two seeds win their first-round matchup, their next likely opponents are the №2 and 3 seeds, respectively. On the other hand, the winner of the game between the №8 and 9 seed is likely to have their next game against a №1 seed, who overwhelmingly have the greatest success, as seen by the above graph.

When it comes to winning championships, having a top seed matters. №1 seeds are the overwhelming favorites, making up 65% of all March Madness champions, and nearly 90% of national champions have been a №3 seed or higher. It is again worth noting some anomalies in the above graph. When we look just beyond the 3 seeds, the numbers drop off significantly. №4 and 5 seeds have combined for only six title appearances and only one national champion. One explanation for this is the same reason as described above with №8 and 9 seeds’ struggles: they are on the same side of the bracket as the №1 seed. In conclusion, teams should generally want the highest seed possible while also being on the opposite side of the bracket from the №1 seed.

If We Play the Right Way, Can We Win?

Outside of seeding, it is often difficult to predict which teams will have success in the tournament. One playing style that some people favor is called three-and-d. This is typically used to describe a player who is a lockdown defender and also has a high three-point shooting percentage, but we can also apply it to a team’s general playing style. In a video by ESPN giving advice to people about how to fill out their bracket, they said to consider lower seeds who “press on defense generating steals and launch a lot of threes” (ESPN, 2019) to upset a higher-seed opponent.

We decided to take a look at some of these stats and see how they impact teams’ success in the tournament. We created a standardized statistic that equally measures teams’ relative three-point shooting percentages and turnover differential (steal rate — turnover rate). We then plotted this statistic against teams’ win percentages and the round they reached in the tournament. As displayed in the first graph above, teams’ regular season win percentage had a moderately-strong, positive correlation to their relative success in net-turnovers and three-pointer shooting percentages in 2018. However, this did not translate well to success in the tournament, as seen in the second graph above. Although the teams that made it the furthest in the tournament did have high turnover differentials and three-point percentages, there were other teams with comparable stats that were eliminated in the first round of the tournament. Overall, the three-and-d stat does not seem like a causal factor for teams’ success in the tournament. Other factors that will be discussed later may have a greater impact on teams’ late tournament runs.

How Critical is 3-point Offense and Defense to a Team’s Winning Rate?

The evolution of basketball has called for a more modern approach to offensive and defensive strategies. A long emergent one is ‘small ball’ where coaches favor smaller players with higher scoring prowess over more physically present players. This concept has translated to the advent of ‘small ball’ teams who altogether aim to capitalize their scoring from the three point-line.

We wanted to explore this phenomenon further statistically and examine whether the 3-point percentage shot by a team was related to their winning rate. And to that end, we wanted to explore whether a team’s 3-point defense also made positive contributions to their winning rate.

We used scatterplots to visualize any potential relationship between efficient 3-point offense/defense and the winning percentage of teams.

We see that there, indeed, seems to be a positive correlation between the two variables, and to stress our hypothesis under a slightly different context, we chose to map the chances of beating an average Division 1 Team onto the scatterplot with blue points marking a low chance, orange marking a fair chance, and green marking a high chance. Indeed the plot supports our hypothesis in both a real sense when examining games won but also theoretically as the chance of beating an average division 1 team.

Looking at Defense, we see generally that the less 3 point shooting allowed the higher rate of winning and chance of beating an average division 1 team. It is important to acknowledge that the chance of beating an average division 1 team is composed of a variety of factors and what makes a division team ‘average’ is another discussion to be had.

March Madness & Money:

Aside from the analysis that can be performed on the extensive data collected around in-game stats, win records, and seeding, there’s one key variable off the court that colleges across the nation have quietly been optimizing in order to improve their chances at the prestigious bracket: money. Since 2003, the average funding for a NCAA DI men’s college basketball team has tripled, from about 100 thousand dollars per team in a given year to over 300 thousand dollars per team in 2019. For the teams that actually make it to March Madness, that change is even more dramatic; from about 200 thousand dollars per team in 2003 to over 1.2 million dollars in 2019. Colleges are spending more and more money on their basketball teams, and successful programs are increasing their funding at a faster rate. This naturally raises the following questions: Can you buy yourself into March Madness? Can you buy yourself all the way to the championship?

In 1994, Congress passed the Equity in Athletics Disclosure Act (EADA), requiring each college to make public their spending towards their men’s and women’s sports teams in an effort to promote transparency and equity about college athletics. As a result, there is a plethora of publicly available data showing exactly how much funding a given NCAA D1 Men’s basketball team receives in a given year, and, if I were to compare that data to, say, previous March Madness bracket results, there may be interesting results. So I did:

This bar graph takes a look at the most recent year we have both team funding data and March Madness results available, 2019. The x-axis represents the possible placements of a NCAA DI Men’s Basketball team in March Madness, while the y-axis represents the average team funding of the teams that received each placement (in millions of dollars).

Can you buy yourself into March Madness?

As you can see, there is certainly a large difference in funding between the teams who did not qualify for March Madness, and the teams who qualified directly and made it to at least the round of 64. That said, there is almost no difference in funding between teams who did not qualify, and teams who placed in the round of 68, so poorly funded teams seem to at least have an equal shot of making it in the bracket, even if their chance of advancing isn’t high. Additionally, the disparity between the teams who did not qualify and teams who did seems to get even larger in a linear fashion, up to the sweet sixteen (more on that later). While this is nice to look at, the statistic that really solidifies the answer to this question is this: The average funding for a team who does not make it into March Madness is around 300,000 dollars. The average funding for a team who does is over 4x higher, at a little above 1.2 million dollars! This strongly suggests that the answer to this question is yes; but one more question still remains…

Can you buy yourself all the way to the championship?

While there is a general correlation between team funding and getting further in the bracket, that correlation starts to fall apart once you reach the sweet 16. This tends to follow intuition; the variance in a single elimination bracket with high stakes and young players is incredibly high; this fact is what March Madness owes much of its namesake to. It is also interesting to note how significantly upsets represent themselves in the data set. Duke, which is one of the top 4 most well funded basketball programs in any given year over the past decade, was eliminated in the elite eight in 2019; hence the outlier in this data visualization. The fact that the correlation almost disappears above the sweet 16, in addition to the generally high variance in results between years despite the low variance in team funding between years suggests that the answer to this question is no.

What does this all mean?

If current trends continue, it looks like the funding gap between the most successful programs and everyone else will increase, and the results disparity is likely to increase as well. While a sad outcome for underdogs, this also means that March Madness will likely become easier to predict. Before you get your fantasy brackets out, let me be clear that predicting who will be in March Madness will become easier to predict; as far as who will win March Madness, post sweet 16 correlations between funding and placements fall flat. Fret not, statistical predictions haven’t ruined the volatility and excitement of the sweet 16; at least not yet… So in the meanwhile, grab some popcorn, donate some money to your alma mater, and hope they can clutch out a win.

Additional disclaimer: it is unclear whether better funding is causing teams to get into March Madness, or that teams who get into March Madness receive better funding as a result; although my suspicions are that at the very least, they both cause each other.

What is the Probability of an Upset?

With ongoing, outside discussions about the intricacies of seeding placements deployed by the March Madness Selection Committee, we have undoubtedly seen favoring matchups being played out in obscure fashion as well as large discrepancies in seeding placement and a team’s ranking. This recurring trend ultimately creates madness for die hard viewers trying to predict bracket outcomes, but also stumps participating teams when making their run towards the championship.

With this visualization we can easily notice different probability upsets across the tournament. Although one might think upset probability decreases as the tournament progresses, it is unusually the highest within the Elite 8. The Elite 8 percentage is 0.310344, while the second highest percentage seen is within the first round with a percentage of 0.258036. Having a very high upset probability within the Elite 8 is rightfully strange but unforgivingly correct. As lower level seeds face off against higher level seeds within the later parts of the tournament, an upset is bound to occur and over the years, we have seen 2.1 of the 4 Elite 8 games result in upsets. Within the 1st round, upsets also occur more sparingly compared to the Elite 8 round, as teams are occasionally inaccurately seeded within their respective region and thus, display fraudulency when going against higher numbered seeds. This trend is not as highlighted within the rest of the rounds and within the later parts of the tournament, as higher numbered seeds usually take over and consume the remaining tournament spots!

Can we predict March Madness?

Finally, we wanted to use this dataset to ultimately predict the outcome for this year’s March Madness. After trying out several different models, we ultimately decided on using a linear regression model to predict the “POSTSEASON” column of the original dataset, which was the column that stated the final round in which the team was eliminated during the bracket.

Rather than creating a classification model for the original postseason column (originally categorical data), we decided to convert these different categories (Champions, 2nd, F4, E8) into numerical data, allowing us to create a more distinct ranking for each team in the tournament. Furthermore, we decided to remove all of the teams that didn’t make the tournament from the dataset, as we found that setting their place to “69th” would skew a lot of the numbers towards a higher postseason result.

For feature selection, we used the leaps package in R, which chose adjusted offensive efficiency, adjusted defensive efficiency, power rating, defensive effective field goal %, defensive turnover %, defensive rebounding %, 3 point percentage, defensive 3 point percentage, adjusted tempo (amount of possessions), and wins above bubble as our predictor variables. Our model achieved an adjusted R-squared of 0.471 with an MAE of 8.133. Here is our model’s prediction of this year’s March Madness, in which we sorted by the lowest predicted POSTSEASON value.

  1. Purdue
  2. Texas A&M
  3. Alabama
  4. Saint Mary’s
  5. San Diego St.
  6. Gonzaga
  7. Arizona
  8. Connecticut
  9. UCLA
  10. Houston

Obviously, it is impossible to predict March Madness. With the brackets and seeding not being released yet, this isn’t necessarily a bracket prediction, but simply the model’s prediction on who has the highest chance of winning the tournament. Some interesting teams to note are definitely Texas A&M (currently ranked 24) being predicted to place 2nd, and Houston (currently ranked 1st) being predicted to place 10th.

Good luck with your brackets and make sure to reference this article when preparing for future tournaments!

Sources:

  1. https://www.youtube.com/watch?v=R9i1VNdrSMI&t=171s
  2. https://www.janestreet.com/puzzles/bracketology-101-index/
  3. https://www.kaggle.com/datasets/andrewsundberg/college-basketball-dataset?select=cbb.csv
  4. https://www.kaggle.com/datasets/woodygilbertson/ncaam-march-madness-scores-19852021
  5. https://ope.ed.gov/athletics/

--

--