This is a description of my College Football Algorithm used during the 2018 Season developed during the summer of 2018. I have made updates to the algorithm, which you can read about here.
What is the overall rationale for the algorithm?
Simply put, the algorithm seeks to compare how a team performs when compared to their opponent’s average opponent.
If Team A scores 45 on Team B when Team B normally gives up just 20/game, Team A will be awarded points. Similarly, if Team A holds Team B to just 7 points when Team B normally scores 28, Team A is also awarded points. This is dynamic throughout the season, and points won or missed out on early can be reversed as numbers become more accurate later. Initially, I intended for this to be the entire basis of a team’s algorithmic score, as if strength of schedule were already factored in. That is, if a team plays a 12 game regular season, conference championship, and two playoff games and outperforms opponent average offense and defensive numbers in all 15 games, they should have a high enough score to beat out the rest of the competition, and the fact that they were above average both offensively and defensively means that they should be the easy #1. Unfortunately, being better than average on both sides of the ball in every game is not an easy task, and even the best teams will lose out on points. Because of this, with the initial algorithm, Wisconsin topped the 2017 College Football world.
From there, I realized that an actual strength of schedule (SOS) calculation needed to be introduced. Strength of schedule is determined both by the “class” of the teams on a given schedule (whether they are in a Power 5 conference: SEC, Big XII, Pac 12, ACC, or Big 10; a Group of 5 conference: AAC, Mountain West, Sun Belt, MAC, and C-USA; or an FCS school), and opponent record with a team’s own record factored out. I also included the team’s season record in the final score calculation.
And that’s how the algorithm sat for a while. Just three components: record, raw score and strength of schedule. As the 2018 season progressed, I saw that things needed to change slightly. The two component system was working fine on a macro-scale, but it was ignoring a clearly important statistic: margin of victory (MOV). Alabama, though their resume was not great, was handling all of their opponents with ease. So much so that starting quarterback Tua Tagovailoa hardly saw snaps in the second half of football games. The closest game was not close by any means of the imagination. Sure, they were winning points for beating the averages, but they were far better than just averages: they were the best! I didn’t want margin of victory to be as heavy handed as the others. After all, a team that wins by 7 each week is still winning. Maybe they took their foot off the gas. Maybe the other team scored a touchdown to turn a 14 point lead into a 7 point lead with seconds left in the game. Maybe a team that wins most games close had one bad game where they lost by 24 to a good team. With all of these thoughts in mind, I added a MOV stat that that ranged from 0.8 to 1–enough to move a clearly better team up a few spots, but not enough to have it become the entire basis of the algorithm.
Where is this algorithm used?
During the 2018 offseason I developed this ranking system to be included in the reddit.com/r/cfb poll committee (here). My application was accepted and I have been tweaking the algorithm in small ways since then as I get more data to test its accuracy. Since then I have submitted the rankings to Ken Massey for the Massey composite rankings (found here) as well.
The rankings change so much from week to week. Why?
In a human ranking system, if you’re going to lose, it’s much better to do it early. That’s because poll inertia is important. A human makes a top 25 one week and then the next week looks at the results of the games played and uses their previous top 25 as the groundwork upon which they move teams. Did the number 1 team win? Cool. They stay at 1. Same with number 2? Wow, this is easy! Teams in the top spots only move up and down when other highly ranked teams lose or, on rare occasions, when teams dominate their competition enough to warrant a jump in the ranks. If you are the #1 team and you lose in week 3, you might fall to #10, but week by week, as you keep winning and teams above you lose as almost everyone does at some point, you creep up the ranks. The end result might be that you lost one game and wind up #2. If that same team with the same schedule, however, loses in the final week of the season, a human will punish them for the loss, even if they only drop a few spots to #4. So we are left with a case where a team plays the same schedule with the same record, but their final ranking depends only on when they lost. That’s partly what happened in 2008. Oklahoma lost first to Texas and had time to recover through the rest of the season. Texas lost with 3 games remaining and no real chance to prove themselves, and poor Tech lost on their last regular season game.
This is a catch 22 for a human. Continuing the example with the 2008 season in mind, if the regular season ends and you think Texas is the best team of the three because within the triangle, their win was on a neutral site and their loss was at night, on the road, during a literal last second play, after beating (at the time of the game) #1 Oklahoma, #11 Missouri, and #7 Oklahoma State in consecutive weeks, this might be good justification. But what everyone else sees is the choice to hop Texas, who just easily handled their rival at the bottom of the division 49-9, over Oklahoma, after beating a highly ranked Oklahoma State 61-41 themselves. It’s hard to justify when you view last week’s rankings as a foundation for the next week. A team loses and they have to fall with that mentality. A team wins but maybe every team they had beaten in the past all of a sudden looks worse so their resume looks less impressive. How can even a diligent human take note of every game? A computer has no issues with looking at an entire body of work each week. In some sense, a computer algorithm is an independent calculation each week. The #1 team last week might have done just fine this week, but there’s more at play than just that one game. It’s likely that they retain their spot if they scored well enough to get there in the first place, but maybe enough things change, maybe all previous teams they played lose, maybe they don’t win by enough, and they drop because the computer saw something we didn’t.
All this is to say that this algorithm has a different process when ranking teams than a human would. All games a team plays are in their little basket. Each week we get new information and we throw it in the basket and shake it up. Everything mixes and the timeline becomes irrelevant. The end result doesn’t care if you lost last week or 10 weeks ago, it looks at the whole body of work. For this reason my algorithm might appear to change a lot from week to week, especially in early weeks, but the end result is all mathematically based.
Is this a power ranking?
This algorithm is purely backward looking and does not place any emphasis on time or streaks. You could make a legitimate argument that a team that lost their first three games of the season and has won 8 since is in a better situation than a team that started with 8 wins and has lost their last three. I agree that there is some nuance to this. Accounting for time and streaks could also, in a subtle sense, account for injuries as well. Perhaps a team loses a starter and the whole operation falls apart. A legitimate claim could be made that the team that won those 8 games to start the season is not the same team that lost the most recent 3, and the team that lost the most recent 3 is the current team.
This is a legitimate concern, and many computer algorithms do have some form of recency bias. Many, also, do not. The reason I don’t is because I am more interested in seeing how a team’s season resume stacks up with another’s. Part of what I am trying to get away from is the inclination to punish a team heavily for a loss just for the sake of punishment. If the #3 team loses late in the season, we can’t forget their entire body of work just because the thing most recently on our mind is the loss.
A team that just beat a ranked opponent by a wide margin moved down. Why?
Late in the season, especially, there are a lot of moving parts. Every team a team has played continues to impact their rating, even weeks after their head-to-head game took place. This impact comes in several ways. First, the average offensive and defensive stats for each team changes as they play more games. A team might hold their opponent to 24 points and be rewarded for that performance when their opponent’s offense averages 28.8 points/game, but as the season goes on, if that opponent’s scoring offense dips below 24, those early season points go away. The second way that previous teams impact ratings are in the SOS calculation. Having a combined opponent record one week of 48-24 looks really good might really help a team’s rating, but this can change if their opponents go a combined 1-7 the following week.
All that is to say that if you look at week to week changes as an indication of a team’s performance in the previous week, you don’t see the full picture. Every win inherently helps a team’s score. It is impossible for a win to hurt more than a loss. It’s also important to remember that, again, the computer doesn’t forget, and makes every comparison it is designed to make based on all games.
Two teams ranked consecutively had a head-to-head and the team ranked lower won. Why don’t you flip them?
Okay, I know I just wrote that the computer doesn’t forget and makes every comparison it is designed to make based on all games. The quick answer here is that the computer doesn’t recognize head to heads as head to heads. It cares about the data (offense, defense, who won, team averages, records, etc.), but doesn’t care that it was Texas playing Oklahoma. The final rating for a team, especially late in the season, is the result of a complex web of data. The fact that two teams who previously played each other find each other as neighbors in the ranking is purely coincidence, and there is far more at play than just the one game they had against each other.
Are you happy with these results?
Every week, part of my rankings fail what many commonly call “the eye test.” That is, “does this look right?” Or at least, “does this, with my own bias of team histories and brands, look right?” You see, a team like Alabama, Oklahoma, Clemson, Texas, Michigan, etc will often receive the benefit of the doubt after losing because of their program’s success, either recently or historically. Whether people care to admit it or not, seeing a team perform well in the past will influence how they see that team in the present. Also, a team’s ranking at the time a game is played is going to influence how a human sees strength of schedule whether they like it or not. A win against Week 4’s #8 team looks good. People remember that. They might not remember to check that that #8 team in week 4 is now barely in the top 40 in week 12. Maybe they will in that one particular case. But maybe they overlook another. College football is a messy web of 130 (FBS) teams playing 12 games a piece. The end result is never obvious. The eye test carries with it all the biases of the brain to which it is attached.
Part of the reason I like computer rankings is that it shows you things you might not have otherwise seen. 2018, Week 4, Duke and Kentucky, at least when compared to opponent averages so far that season, outperformed everyone in almost every category through their 4 games. They have also had respectable schedules so far. Computers don’t care about your name, logo, or what you have done for 30 years. Through 4 games, it’s hard to argue Kentucky and Duke aren’t good and the only real way to say they are worse than teams like OU is to go to history and recruits. Would OU win a head to head against both of these teams? Probably more often than not, but we haven’t seen them perform at that level through 4 games, so historic program dominance shouldn’t implicitly affect the rankings, at least in my opinion.
Do you think averaging MOV works out to a valid measurement?
I honestly don’t like the pure average of MOV. Winning a lot by a small margin in low scoring games can all be undone by one fairly large loss to a highly ranked team. The reason I’m okay with it how it is, though, is because it doesn’t seem to have a major impact on a team’s score. That means you can get blown out by 70 in every game (making you the lowest team in the category), but your MOV score will only cut your raw score by a small factor in the end. When also keeping in mind that most other teams are having their own scores reduced (simply by nature of not being the very best), whatever impact MOV has on total score is not major. It’s mostly a finessing tool to move teams who have been consistently blowing out opponents to higher spots within their tiers.
Do you think you have a good balance between SOS and MOV?
Part of my answer above has to do with the balance. While MOV scales scores slightly, SOS is incorporated in two places. In one, it is a direct multiplier, and in the other it impacts how many points are even available to a team in the first place. This has a more direct impact on final score. SOS is much more important and I think it is weighted appropriately. SOS is also iteratively calculated until the results even themselves out.
Do you think your system and the way it handles MOV fairly addresses strong offensive teams versus strong defensive teams?
I did play around with this some. A 17-3 win is much different than a 56-42 win. A 70-13 win and a 42-7 win are similar as far as dominance goes but show different in raw MOV. I had in the past toyed with giving teams points based on how their scores compared within the scope of the single game in addition to the raw score given based on averages comparisons. I went about assigning points for a shutout and also winning by more than 10 (to avoid massive point paydays simply for shutouts, 3-0 isn’t a dominant win). I assigned points based on if you doubled your opponents score, tripled… etc, and set an ultimate cap on it (I think 4x or 5x). When I ran the algorithm, I hated the results. Teams I tried to “help” fell more actually. Teams I didn’t expect (and didn’t really pass the eye test for me) moved up. In the end, I scratched the whole system, because it felt too complex, and my simple metric was giving me a result that honestly seemed much better to me.
Your results seem to favor winning and not losing (so long as your opponents are at least half decent) a bit more than the aggregate results. Are you satisfied with this?
It actually doesn’t inherently punish losing. A team that outperforms opponent offensive and defensive averages but still loses can still earn almost as many raw game points as a team which beats the same opponent and does neither. A team can also earn more raw points losing to a P5 close than a team who beats a G5 unconvincingly. Overall, however, win percentage is still an important metric in my algorithm. Early in the season, 1 loss is a big hit on that value. It’s not so much the act of losing that makes you take the hit as much as it is the fact that with so few games, 1 loss is a much larger percentage of the overall record. A team being 2-1 right now is effectively the same as if they end 8-4, however many teams currently ranked highly at 2-1 I expect to finish much better than 8-4. 2018 A&M started 2-2, but almost every team would (losses to Clemson and Alabama). I don’t think they will finish 6-6. All that is to say that this will all get evened out as the season progresses. Later in the season (week 8/9/10/11… etc), going from 9-1 to 9-2 isn’t nearly as much of a slide. Assuming raw score is still fairly high, I don’t expect a team to fall out of the top 25 for losing 1 game but still having a good record. To sort of demonstrate it further: Auburn finished 11th in my ranking in 2017 with a record of 10-4. They played well through 14 games and their 4 losses were all to tough teams. Despite losing 4 times, they found themselves at #11 above many teams with 1, 2, and 3 losses.
Why did you choose the values that you chose?
The answer to this question is less beautiful. By toying with the algorithm and changing the numbers around, I settled on the values I am using today. As a metric for comparison, I used AP rankings and Massey’s composite rankings calculating the total difference of ranks within the top 25. Nothing can be perfect and the two even differ from each other. I find that often my ranking differs from the two by about as much as they differ from each other. Additionally, my goal is not to use a computer to recreate the AP poll or Massey’s rankings. I want to produce something that looks respectable but is also my own.
Why should you care?
I can’t sit here and defend this algorithm as if it is the best to ever exist. I can point to several others that overall do a much better job. In general, aggregate polls tend to pass eye tests far better than individual ballots because outliers tend to be averaged out. This, however, is just a single computer ranking created and run by an amateur. You might notice your team higher or lower than you might have expected. You might see that they are ranked behind a team they beat earlier in the season, or that a team with a worse record or with a “worse loss” is ahead of them. I hold no grudges. The computer has no bias. This is simply my best attempt at an accurate result for college football rankings. If you enjoy the sport, or the numbers even half as much as I do, then I hope you are interested in what I have to say and that my algorithm can shed some light on aspects you might not have previously seen with your team or others.