What is the overall goal of this ranking?
This ranking uses R to minimize the total “Ranking Violations.”
What is a Ranking Violation?
A ranking violation is defined as any time a team with a lower rank in a generated ranking beat a team with a higher rank. For example, in 2017, Seattle (Rank 20) beat Philadelphia (Rank 2). According to the on field-result, Seattle is apparently the better team (I’ll address the concerns and conclusions we can draw from the ranking later), however Philadelphia being ranked higher, despite this, is a ranking violation.
You claim to have minimized Ranking Violations, however your results still give just under 20% on the season. Why?
First, to be clear, I am not claiming a minimization of RV. The numbers I report are the lowest I have found online personally, and I can be reasonably confident that I have gotten close to the minimum given that I have tested more than 4 million ranking iterations of my best produced ranking. If a lower one exists, please share it with me.
Now onto the other part of the question: Why is the RV so high? Unfortunately, some ranking violations are unavoidable. If a team goes 1-1 with a divisional rival on the season (e.g. Dallas and Philadelphia), a Ranking Violation is inevitable, because no matter who is in front, you are guaranteed an instance in which a lower ranked team beat a higher ranked team. In 2017, there were 17 such instances.
These are not the only guaranteed instances, however. More common are cases in which you have loops of teams who all beat each other. Take for instance the (2017) loop where Kansas City > New England > Buffalo > Kansas City. There are tons of these loops, and there are loops within loops. Each loop guarantees a ranking violation, again, because somebody has to be on the top and somebody has to be on the bottom.
There’s also objectively bad ways to handle these loops from an RV perspective. Ordering it as KC > NE > Buffalo gives you 1 RV, but ordering it Buffalo > KC > NE gives you two (because the Patriots beat the Bills twice), or even worse, Buffalo > NE > KC gives three ranking violations.
As you can see, there’s a right and a wrong way to approach these loops. The code doesn’t really think about it like that, although it arrives at the same conclusion.
How does the code work?
I wrote the code in R and pulled the data from pro-football-reference.com. After importing the data, the code creates a dataframe of the ranks, team names, record, and maximum/minimum recorded ranks for the team at the current ranking violation. The first step in reducing the ranking violation is to reorder the teams by a weighted random number generation based on the team’s record. The logic behind this is that the ranking violation will be lower when a team with more wins and fewer losses is near the top. In this step, theoretically, every team has a shot to be any rank , but teams that are 14-2 are more likely to be higher ranked than teams that are 2-14. The other part of this step is that new rankings are generated and their ranking violations are calculated. If the ranking violation of the new ranking is less than that of the old one, the new replaces the old and the process continues.
After a certain point, continuing in this manner serves very little benefit. This is because there are 32! (2.63E35) different ranking combinations and it’s not useful to search for the tip of a bell curve by generating an entirely independent ranking each time when you’ve already optimized the ranking to some extent. That is why, for the next step, I decided to search for new rankings by just swapping two different teams at random. If the exchange negatively affects the RV (makes it higher), the generated ranking is forgotten and replaced by a new one. Otherwise, the old is replaced by the new.
Once there is reason to believe a minimum has been hit, the ranking is temporarily saved and the process is repeated to ensure the results are not stuck in a local minimum. If the result is the same RV or a higher one, it is forgotten.
The final step is to scan though the swapping method hundreds of thousands of more times, logging the highest and lowest rank a team falls in to get a strong sense of their “possible range” at that RV. From that, I can get a sense of how far in one direction or the other a team can be stretched in rank for personal intuition. After doing that, I run a line of code which swaps the rank of any two teams with different overall records such that the team with the better record is ranked higher than the team with the lower record, as long as RV is not affected.
Does your final ranking include an eye test, or is it purely algorithmically driven?
My goal by having the last step which switches teams based on record so long as RV isn’t affected is essentially an extension of the eye test. An 8-5 team ranked behind a 6-7 team that they never played will get swapped in front of that team because they have a better record and the move does not violate RV. This helps clean things up to they eye to some degree, but the bottom line is that the ranking isn’t really meant to pass the eye test. In some sense it’s almost the opposite. On NFL ranking message boards or facebook posts, you’ll often see fans crying out “Why are they ranked so highly? We beat them in week 7!” This all relies on the assumption that going just off who beat who is truly the best metric (I legitimately don’t know that it is). The ranking should demonstrate that sometimes it isn’t, but at the end of the day, it is the most representative of on field results.
What are the conclusions to draw from this?
A RV minimization ranking probably isn’t what you want to look to to make bets or predictions about games. It’s also going to be harder to convince yourself that it passes the eye test, because it’s more complicated than just taking the teams with the best records over teams with worse records. The game results for head to head match-ups are unforgiving. Beating an opponent by 50 and winning in overtime with a 60 yard field goal have the same effect on the ranking violation. As such, some of the nuance is removed. If team A beats team B, but that result is the outlier and B would beat A 90% of the time, it doesn’t matter. For this reason, algorithms that boast lower RV are often some of the most controversial when compared to other rankings. This can largely be seen in the Massey Composites for Football and Basketball (both of which I am a part: see College Basketball here and College Football here). Polls which largely ignore RV leave room to be advanced in other ways and to consider instances when game outcomes are not necessarily related to true team strength. In doing so, however, that extra piece of information is lost. The web of games is complicated, and while at first glance this ranking might look truly awful, it is the best representation of the on-field results from the perspective of head-to-head match-ups.