After thinking about this off and on for probably more than 5 years, I finally made a machine learning application that can predict NCAA tournament basketball games in time for the 2025 March Madness tournament. Occassionally, through the years, before I had done any actual work on the model, I'd find myself fantasizing that my application would predict the perfect bracket. I'd imagine myself in a podcast interview explaining how I was as surpised as anyone at the results but that I was confident that I had not somehow created Skynet. In reviewing the aftermath of my bracket, I definitely did not need to have an answer prepared to explain how I knew I had not accidently created Skynet.
I entered brackets completed by my machine learning application in four places:
In the Kaggle March Machine Learning Mania 2025 competition, I came in 931 of 1,727 competitors (46th percentile).
When using my machine learning model against human picks in my friend, family, and coworker leagues, it finished between the 40th percentile and 75th percentile.
My model made 38 of 63 picks correctly (~60%).
The Kaggle competition I entered provided box scores for each game of the NCAA regular season. I wrote a script to calculate season average statistics for each team based on those box scores, as well as the true shooting percentage and a couple statistics I made up called "blowout wins" and "blowout losses." I used these calculated statistics for each team as factors against the results of each NCAA tournament games (winning versus losing, represented as 1 versus 0) since 2003 to train my model. The ultimate weights calculated and used by my model are shown below.
Rank | Factor | Weight | Weight (Absolute Value) |
---|---|---|---|
1 | Losing Team Offensive Rebounds Ave | -0.6079 | 0.6079 |
2 | Winning Team Offensive Rebounds Ave | 0.5802 | 0.5802 |
3 | Winning Team True Shooting Percentage | 0.4817 | 0.4817 |
4 | Winning Team Free Throws Attempted Ave | -0.4543 | 0.4543 |
5 | Losing Team Turnovers Ave | 0.4396 | 0.4396 |
6 | Winning Team Winning percentage | 0.3323 | 0.3323 |
7 | Losing Team True Shooting Percentage | -0.3006 | 0.3006 |
8 | Winning Team Free Throws Made Ave | 0.2903 | 0.2903 |
9 | Winning Team Turnovers Ave | -0.2874 | 0.2874 |
10 | Losing Team Free Throws Attempted Ave | 0.2409 | 0.2409 |
11 | Losing Team Effective Field Goal Percentage | -0.2229 | 0.2229 |
12 | Winning Team Blocks Ave | 0.2187 | 0.2187 |
13 | Losing Team Three Points Made Ave | 0.2082 | 0.2082 |
14 | Losing Team Blocks Ave | -0.1939 | 0.1939 |
15 | Losing Team Steals Ave | -0.1714 | 0.1714 |
16 | Winning Team Steals Ave | 0.1626 | 0.1626 |
17 | Losing Team Winning percentage | -0.1399 | 0.1399 |
18 | Winning Team Three Points Made Ave | -0.1369 | 0.1369 |
19 | Losing Team Free Throws Made Ave | -0.1317 | 0.1317 |
20 | Losing Team Score Ave | 0.1221 | 0.1221 |
21 | Winning Team Score Ave | -0.1107 | 0.1107 |
22 | Winning Team Personal Fouls Ave | -0.0991 | 0.0991 |
23 | Losing Team Blowout wins | 0.0886 | 0.0886 |
24 | Winning Team Three Points Attempted Ave | 0.0885 | 0.0885 |
25 | Losing Team Three Points Attempted Ave | -0.0872 | 0.0872 |
26 | Losing Team Personal Fouls Ave | 0.0836 | 0.0836 |
27 | Winning Team Blowout losses | 0.0765 | 0.0765 |
28 | Winning Team Assists Ave | -0.0617 | 0.0617 |
29 | Losing Team Blowout losses | 0.0495 | 0.0495 |
30 | Losing Team Defense Rebounds Ave | -0.0384 | 0.0384 |
31 | Winning Team Blowout wins | 0.0349 | 0.0349 |
32 | Winning Team Effective Field Goal Percentage | -0.0342 | 0.0342 |
33 | Losing Team Assists Ave | -0.0144 | 0.0144 |
34 | Winning Team Defense Rebounds Ave | -0.0073 | 0.0073 |
There were two factors that drove some deep, incorrect "Cinderella" predictions:
The initial version of the model was trained with the seed of each team in a given match-up, but the first bracket I/it/we completed seemed too boring. For the past couple of years, the word "parity" seemed to get thrown around a lot. It felt like the general consensus was that there was no longer an advantage to being a "blue blood" or a high seed or from a major conference.
So I took the seed data out, but I didn't have another metric to attempt to account for difficulty of schedule or opponent. The next iteration of my bracket had Alabama (2) getting upset by Robert Morris (15) in the first round. It also loved Troy (14), having them upset Kentucky (3) and Illinois (6). It seemed to love VCU (11), and a St. Mary's (7) team that my favorite NCAA commentator (Mark Titus) said didn't really pass the eye test.
I quickly looked up Robert Morris. They had a good record and statistics, but some bad losses on their record against "name brand" schools. To attempt to account for merely being the best team in a bad conference, I threw together calculations for "blowout wins" and "blowout losses." I defined a blowout loss as a game where a team lost by more than 10 points. My hope was that this would catch the kind of teams that are in weaker conferences and get invited to play schools in traditionally better conferences early in the season to pad their record ("cupcakes"). I defined a "blowout win" as a game where a team won by more than 10 points either as the away team or on a neutral court. I was hoping to filter teams that did not play well away from home, are talented enough to blow teams out on the road, and maybe just catch teams with "grit."
After including the blowout game counts, the model did specifically tip to have Alabama beat Robert Morris, but not much else changed. It still had Troy, VCU, and St. Mary's making runs in the East bracket. It still had McNeese St. (12) making a run and High Point (13) upsetting Purdue (4) in the Midwest region.
I ran a SQL query on my database, and realized that Troy, VCU, St. Mary's, and Robert Morris all had high season average offensive rebounds. High Point averaged more than 1 additional offensive rebound per game than Purdue.
I was surprised and confused by how much emphasis the model put on offensive rebounding. It seemed even stranger given that, of all statistics, it gives the least weight to defense rebounding. I don't have an explanation for why it loves offensive rebounding, and I'm not sure what to do about it going forward, if anything.
Though I was disappointed with the results, this was a fun project, and one that I intend to tweak and maintain for the rest of my life.
This initial model deployed a "kitchen sink" approach. I just kind of threw everything I could come up with into the training data. I was curious to see what it would identify as most important.
I'd like to pare down the amount of statistics that go into the model, or at least have an intent behind why each statistic was included. By next year, I'd like to come up with categories of statistics that reflect different styles of play or strengths, and just include one of each category.
I'm also interested in developing a way to factor in the coach of each team, as well as the "name on the jersey" of the school. I'd like to come up with a way to quantify and reflect what it means to "be Duke" or "a Kentucky" or be a nobody. Similarly, I'd love to come up with a way to reflect that a coach like Rick Pitino is suddenly behind St. John's. But I view these as long term goals, that will take years of iteration of my code, sources of data, and theories.