Over the last couple weeks, I know many of you have been DEPRIVED of content from your favorite blogging DJ (that's me). While my colleagues Ky & Jim have hopefully filled this void, DJ Dimes has officially re-surfaced to share the first chapter of his new novel, "Data Dishin".
During my time away, I chose to silo myself from the outside world and write ~ 1,500 lines of code in R, a language far more comfortable to me than the English one. To be more specific, I have been primarily focused on merging two behemoth data files with the intention of developing my own consolidated dataset of all D1 teams and all D1 games. Thanks to statistical hoop pioneers Ken Pomeroy and Ken Massey, aspiring data scientists like myself can get their hands on robust datasets FOR FREE, which provide granular information on team statistics and individual game results dating all the way back to 2002 (!!!). Despite getting lost countless times in a perpetually tangled web of rows and columns, I was finally able to construct my own "one-stop-shop" college basketball data repository. Its purpose will be to inform recurring analyses on trends and observations across the college hoop landscape, with comparisons to prior seasons and specific historical teams. These recurring analyses will serve as a separate, standalone chapters in the "Data Dishin" series.
With that said, the main plot of this series will center around the birth and rise of a dangerously addictive drug, which will come to be known in the streets as "Green Magic". Developed in the highly automated 3MW lab, "Green Magic" is built on a secret, proprietary recipe and perfected through many long iterations of cooking and testing, which I will outline below...
Description & Key Ingredients
Apologies in advance to any parents or DEA members that I have officially frightened with a slight misrepresentation of our product. "Green Magic" is a single game prediction model that attempts to project what will happen in a matchup today, based on what happened in similar, historical precedent matchups. In other words, for a game today featuring team A & team B, the model looks at the characteristics of both team A and team B (team "DNA"), as well as the situational game characteristics of the matchup itself (game "circumstances"). The model will then search for matchups in the past that featured teams with similar "DNAs", that played under similar "circumstances". Allow me to elaborate on what I mean by "DNA" and "circumstances":
- DNA: Team information or team statistics that reveal the "strengths" and "weaknesses" of each team, as well as describe their style of play. Examples include:
- 2-point FG %
- 3-point FG %
- Turnover Rate (Offensive & Defensive)
- Rebounding Rate (Offensive & Defensive)
- Free-Throw Rate (Offensive & Defensive)
- Effective Height (How tall/long is Team A?)
- Effective Experience (How "old" is Team A?)
- Points Distribution (What % of Team As points are 2s vs. 3s vs. FTs?)
- Adjusted Efficiencies (Offensive & Defensive)
- Tempo/Pace (How "fast" does Team A play?)
- Circumstances: Game information that specifies where & when the game was played, as it relates to both teams in the matchup. Examples include:
- Game Location (Was Team A at home or on road?)
- Game Date
- Days between Games (How many days off has Team A had?)
- Travel Distance (How far did Team A travel?)
For the most part, all of these data variables have been included in the early batches of Green Magic (version 1.1). Some of the game circumstance information are not yet incorporated, but will be prioritized in the subsequent model enhancements. The detailed team statistics (DNA) currently embedded in the model still provides plenty of data points to identify similar historical matchups, and assess what happened in those matchups. My current master datafile contains specific box score data on ~33,000 D1 games running back until 2008 (2002-2007 games require additional data manipulation). Green Magic Version 1.1 DOES NOT include neutral site games, which I also de-prioritized for the first stage of testing. Just for context, ~ 10% of the initial raw master file contained data for neutral games. In other words, ~90% of all regular season D1 games are played at a "true home" or "true road" locations.
While the "cook stage" will be an iterative and improved process overtime, I wanted to give a high-level overview of the key steps involved in Version 1.1:
- Understand Importance of Team DNA Variables to Game Outcomes:
This is by far the longest, but most critical step to the process. A big chunk of my time involved running many simple regressions of different team stats and evaluating how significant they were in predicting final game scores. The ultimate goal here is to determine how much each team factor should be weighted when identifying the most similar historical teams & matchups. For example, most of us who watch excessive amounts of hoop would agree that how well a team shoots from the floor (Effective FG%) is more critical to a game outcome than how tall they are (Effective Height). Quantifying this difference of importance, amongst many other variables was, and will continue to be, my primary focus.
I first began by performing a regression using the most critical team stats that determine final game scores: Adjusted Offensive Efficiency, Adjusted Defensive Efficiency & Adjusted Tempo/Pace. The bread-and-butter of most game projection models (including KenPom's FanMatch formula), involves estimating the points per possession each team will score, and how many possessions will occur in the game. I ran a regression predicting home team scores and a regression predicting away team scores, using just these three variables for each. Specifically, for predicting home team score, I incorporated the home team's Adjusted Offensive Efficiency, the away team's Adjusted Defensive Efficiency and both the home & away teams' Adjusted Tempo (combined below in to "pace"). The opposite factors were tested for predicting away team score:
From the output above, we can see that Adjusted Offensive Efficiency ("AdjOE") has a relatively larger impact (though not significant) on determining a game's final score than Adjusted Defensive Efficiency ("AdjDE"). This is of no surprise to myself, or others who have buried themselves in the weeds of basketball analytics. Generally, teams have more control of their offensive efficiency game-to-game than their defensive efficiency. Or to be more "statistically correct", there is more variance for individual teams in game-to-game performance on defense than on offense. This confirmed my hypothesis that I should put a higher weight on Adjusted Offensive Efficiency than Adjusted Defensive Efficiency when identifying the most similar historical teams and matchups.
An important thing to mention about the Adjusted Offensive Efficiency and/or Adjusted Defensive Efficiency team metrics is that many of the other team stats are in some way already embedded in the AdjOE and/or AdjDE numbers. For example, Effective FG%, Turnover Rate, Rebounding Rate & Free Throw Rate (better known as the "Four Factors" statistics) are all important team stats I wish to use in my similarity assessment, but are directly used in the AdjOE & AdjDE calculation.
Another example of this is in the team statistic called "Defensive Fingerprint" (courtesy of KenPom's team pages), which is a numerical value that corresponds to the style of defense a certain team plays. The smallest values (which actually go negative in his calculation) refer to dominant zone teams (i.e Syracuse), while higher values refer to teams that play man-to-man almost exclusively. This statistic is not based on any sort of qualitative assessment whatsoever, and relies on other team stats to derive what defense that team is playing. Specifically, 3-point attempt % (percentage of shots allowed that are 3s) and free-throw attempt % (how much a team puts its opponents at the line) are two key inputs to the Defensive Fingerprint statistic.
Therefore, the same issues arise when regressing with 3-point attempt and free-throw attempt % with Defensive Fingerprint, as well as each of the Four Factor stats with AdjOE/AdjDE. Both instances contain multicollinearity issues, since the initial statistics are embedded in the final calculated statistics. This was evident when assessing the Variable Inflation Factor (VIF) for each team statistic. Not to drown you in nerdiness, but both 3-point attempt % and free-throw attempt % both had a VIF of over 15 (anything over 10 is generally considered to be problematic and will hurt the regression output) when all variables were tested together in the same regression.
In order to reduce correlation, or dependence, amongst team variables, my solution was to go-forward with a step-by-step or "gated" approach to the distance calculation(s). I would first filter for the most similar historical matchups based strictly on Adjusted Offensive Efficiency and Adjusted Defensive Efficiency (distance calculation step 1 or "gate" 1). Then, I would perform three additional similarity or distances calculations based on the following team factors:
- Four Factors
- Style (Defensive Fingerprint & Tempo)
- Other (Effective Height & Effective Experience)
So to clarify, the initial distance calculation in step 1 would be performed on all 33K games in my dataset. I would then sort all 33K games based on that distance calc from smallest to largest distance (or most to least similar). The three subsequent distance calculations would be performed only on this subset of games.
Below is the regression output from all team factors tested in version 1.1, excluding Adjusted Offensive Efficiency & Adjusted Defensive Efficiency (I am showing only the home team score prediction output here). As I alluded to earlier, the purpose of these regressions is to determine how much each factor should be weighted in the distance calculation:
Every team factor shown above would be included in one of the three separate distance calculations, or simply removed, due to the multicollinearity issue aforementioned. The team factors that were inevitably included were assigned weights, based on how significant they proved to be in these score prediction regressions. The above example is only one of many regressions tested to arrive at the appropriate weightings for each of three distance calcs. However, examining all factors holistically allows us to get a good initial view at what is truly predictive for determining game scores. Given this example is predicting the home team score only, you will notice the offensive specific factors are showing up as more significant for the home team, while the defensive specific factors are more significant for the away team. This is relatively obvious, as we would expect the away team's defense to play a bigger role in determining the home team's score and the inverse would be true if I showed the regression results for the away team's score. Now lets get in to the specifics of the distance calculation mechanics for the "Overall", "Four Factors", "Style" & "Other" similarity assessments.
- Calculate Distance or "Similarity" of Historical Matchups to Current Matchup:
In version 1.1, the initial distance calculation using just Adjusted Offensive Efficiency and Adjusted Defensive Efficiency is used to arrive at a subset of the most similar historical matchups. In my final output, this distance calculation, and its corresponding final game projection, is called the "Overall" similarity calculation. Currently, I have set the threshold at 150 games for this "Overall" similarity assessment. This means once the distance calculation is run on all 33K games, I sort by the "nearest" matchups and pick out the top 150 closest historical games. Based on how close these matchups are with the current matchup I am attempting to project, each matchup is assigned a weight of its own, referring to how similar it is with the current matchup. So for the "Overall" assessment, each historical game is assigned a similarity score (distance calculation), which uses simply Adjusted Offensive Efficiency & Adjusted Defensive Efficiency as the two team factors. As the initial regressions proved, along with many prior research studies, offense has shown to exhibit much lower variance than defense in predicting a team's performance game-to-game. Therefore, in version 1.1 of the model, I assign a slightly bigger weight to Adjusted Offensive Efficiency than Adjusted Defensive Efficiency.
An extreme example of this may help clarify further. Say I have two teams today: Team A, with an AdjOE of 30 and a AdjDE of 90 ... and Team B, with with an AdjOE of 75 and a AdjDE of 40. Now say one of the historical matchups in my master datafile includes two team A comparables, call them Team X and Team Y, both of whom played an identical team B comparable, call this Team Z. Whichever matchup will get a higher similarity score, or lower distance calculation, depends on whether Team X or Team Y is most alike Team A. Lets say Team X has an Adj OE of 25 and an Adj DE of 90 (same as Team A). Now lets say Team Y has an AdjOE of 30 (same as Team A) and an AdjDE of 85. Both Team X and Team Y have an overall delta of 5 from Team A. However, because a higher weight has been assigned to offensive similarity (AdjOE), the model will produce a higher similarity score (or lower distance calculation) for Team Y.
My model will rank the top 150 similarity score matchups, and return them in a table of 150 rows, each with all of the box score statistics and team statistics for that particular matchup. The last column of this table will then show the similarity score, or distance calculation. Below is an example of Green Magic's output using the Iowa & Iowa State matchup at Iowa State earlier this year. The "h_" teams and variables are to be compared to Iowa State, and the "a_" teams and variables are to be compared to Iowa. Most columns have been hidden due to the space available:
Only 20 of the 150 rows returned are shown here above. In order to arrive at a projected score, simply multiply both the home team scores and away team scores by the distance weight adjustment (shown in the far right column). Thus the final projected scores for both the home team and away team are a weighted-average calculation of the 150 closest historical matchups. In this specific scenario, Green Magic projected a 79-71 Iowa State win, which equates to an 8 point spread. This initially felt high to me, until I saw the actual game line open at 6.5 and move to 7 before tip off.
To re-iterate, the above example is the "Overall" distance assessment, which is only 1 of 4 similarity scores calculated for each current matchup. A similar visual output and weighted-average distance score is calculated for the "Four Factors", "Style" & "Other" factors. Here are are the specific team statistics included in each:
- Four Factors (16 team statistics):
- Home & Away Team Offensive & Defensive Effective Field Goal %, Offensive & Defensive Turnover Rate, Offensive & Defensive Rebounding Rate, Offensive & Defensive Free-Throw Rate
- Style (4 team statistics):
- Home & Away Team Defensive Fingerprint & Adjusted Tempo (Pace)
- Other (4 team statistics):
- Home & Away Team Effective Height & Effective Experience
Once I arrive at the projected score using all four similarity assessments (Overall, Four Factors, Style & Other), I combine these in to a final score projection, which I call "aggregated projected score" in the final output. Essentially, the home team projected score and away team projected score for each of the four similarity measures are multiplied by .25 to arrive at this weighted-average final projection. It is this aggregated measure that I will focus on most for tracking against-the-spread and over/under performance of Green Magic.
Green Magic Outlook & Future Product Enhancements
Even in its most nascent form (version 1.1), Green Magic's potency is still very dangerous. Though the statistical techniques applied to Version 1.1 are by no means advanced (yet), the final output format allows for a great deal of qualitative assessment, which complements the pure quantitative distance calculations. Therefore, in the current state, I will be using Green Magic as another tool in my daily wagering analysis, and not as just a stand-alone system. However, I will be tracking the model accuracy as if were used as a stand-alone system, in addition to my actual wagers using my own intuition.
Beginning the week after Christmas, I will be tracking daily/weekly against-the-spread (and totals) records of the models' aggregate score projections. For those of you irrationally confident individuals who absolutely need this narcotic now, I will be tweeting and/or posting the top daily "leans", or games with the largest variance between actual spread/totals & Green Magic projected spread/totals. These specific games correspond to the highest confidence bets, according to the Green Magic projections. I would also like to get some preliminary feedback on whether the final output format is intuitive and easy-to-read, so it can be tweaked for a more user-friendly experience.
Stay tuned for more updates of Green Magic's day-to-day performance, as well as additional process improvements I plan to make over the coming weeks. Just as a teaser, Version 1.2 and beyond will incorporate some of the following enhancements:
- Adjustments to Team Factor Weightings
- Inclusion of Additional Game Circumstance Factors
- Adjustments for Current Season Rule Changes (I have already incorporated a tempo adjustment, but there are probably more factors that also need to be accounted for)
- Inclusion of Player-specific Statistics
- Additions to Green Magic Performance Tracking:
- Green Magic predictions compared to public consensus betting (Does Green Magic side with the public?)
- Green Magic predictions compared to KenPom FanMatch predictions
- Variance analysis between Vegas Spreads, KenPom FanMatch predictions & Green Magic predictions (There is a proper balance in how close Green Magic projections are to FanMatch. Too close means the model is providing minimal value, while too far apart means the model is probably inaccurate in the long-run)