Indeed Wrestling: What I learned from Meltzer's 280+ WWF PPVs Ratings (Part Three)

We continue...

1/22/2014 Part One: We ask some Questions; We take the first at annual average ratings.
1/23/2014 Part Two: We test some options to little avail; We get distracted; We learn little.

Our goal is to establish a "fair" method of weighting individual matches so that can aggregate our information (in this case, WWF PPV Star Ratings from Dave Meltzer). To begin, we've been comparing annual PPV averages. We've looked at weighing all matches the same (a/b), weighing matches that occurred closer to the end of the card (i.e. the "main event") heavier (c/d), and weighing matches according to their length in minutes (e/f). In each case, we still ended up with same subset of years on top as "best years" (chronologically 2001, 2008, 2009, 2011, 2013) and "worst years" (chronologically 1988, 1989, 1990, 1992, 1999).

This isn't to say there's no variation, especially in the middle - for instance, 1986 could be as high as 10th place (option D - positive ratings only, card placement: 2.55 avg score) or as low as 27th place (option E - all ratings, match length weighted: 1.66 avg score). Yet overall, we're not getting drastically different results.

Should we continue? Yes. I am not satisfied with these previous methods. Every wrestling match is not identical. They're given different amounts of time to work with and there are constraints with the ability of the wrestlers. The place in the storyline arc these matches occupy will vary (i.e. beginning/middle/end of a feud). Both match length (in minutes) and match placement (on card) are artificial proxies for estimating how "important" a match is. Furthermore, there can be a separation between what matches "sold the PPV" and which matches are the most "important" -- for example how the title match can be overshadowed by the Royal Rumble. Simply weighting against the longest matches can be vexing -- is a 30 minute match really FIVE times "more important" than a 6 minute frenzy? (Perhaps adjusting for time logarithmically so that a the difference would in my example would be more like 211% instead of 500%).

On the subject of Card placement, let's look at some statistics:

As you can see, when PPVs became a monthly affair, the average PPV card has been hovering between 7 and 8 matches (often including the pre-show); the 1996-2013 average was 7.55. What's interesting is how steady that number has been for the last fifteen years.

Let's discuss an "average" card. Since 1996, the average hovers around 95 minutes, but with some significant variation. (The standard deviation is 5:24 which suggests with a normal distribution, we could be looking at +/- 11 minutes to the mean: annual average PPVs averages be between 84 minutes and 106 minutes, which they do.)

As you'll see, it's not quite a normal curve. But it's not terrible either.

However, it's terribly hard to simply break out the 95 minute card would split out among 8 matches:

Match #1: Match E: 11:53
Match #2: Match A: 4:20
Match #3: Match B: 6:11
Match #4: Match C: 7:48
Match #5: Match D: 9:47
Match #6: Match F: 14:17
Match #7: Match G: 17:54
Match #8: Match H: 22:51

This is average time per match (A is shortest; H is longest). In fact, if you just try to average the time per each match, you end up with lots of 9-12 minute undercard matches - but that doesn't match up to reality. (The standard deviation in my "hypothetical card" for match #1 through #7 was 1.31 minutes while the "real" standard deviation in terms of time was 5.61 minutes. That told me that I needed more "wild swings".) The final match (#8) was the longest in about half of the instances (and second longest in another quarter of the sample).

We can see how hypothetically card position (#1-#8) relates to match length (A-H), though it's hardily maps like a 1:1 function. What's far more compelling occurs when we look at average time versus average star rating:

This is really a surprising result. The line goes from 25% (quartile 1) to 75% (quartile 3) with the black box (bottom line is the median and the top line is the average). Essentially, positive star ratings appear to have a linear relationship with time -- on average, the longer the match, the average star rating is higher.

Let's flip the relationship - given a match length, look at the average star rating...

Importantly, the linear relationship between time and average star rating peaks around 21:30. (Chart was created by breaking matches into quarter minute segments and plotting average star rating of matches in that timeslice.)

Stars vs Time (average)

Keep in mind these are the average times. If you look at the earlier chart with the quartiles, you'll notice that a 10-minute match could easily land anywhere from * (one star) to **3/4 (2.75 stars); that's just handling the "average results" - covering the the 25th quartile to 75th quartile.

But, it provides us an interesting comparison point: every 2 minutes 45 seconds, the match moves up by about half a star until around the twenty minute mark. But after twenty minutes, other factors seem to come into play which drive the star rating variance; the linear relationship really crumbles.

We're seeing that card placement is related to match length, and match length related to star rating. We want to introduce a new variable. Let's call it "importance".

How can we qualify how "important" the matches are?

I decided we could start with a variation of applying the OCELOT (Overly Complicated ELO Theorem) wrestler algorithm:

Named for a physics professor, the Elo Rating System was created as a system for rating chess players. Each player is assigned a numerical ranking where higher rankings correspond to better performing players. When two people of unequal ability compete several times, the better performing player is "expected" to win a certain number of the battles. The expected outcome can be calculated as a formula based on the Elo rankings of each competitor. Whenever a competitor performs better than their expected result, their score increases while their competitor loses points.
While the expected outcome of each competition is calculated by the relative scores of each competitor, the magnitude of the points transfer is dictated by a concept known as the k-value. In Chess, the k-factor is typically a uniform number, with the possibility that players with limited history may have a higher k-factor in order to reach their "true" rating quicker. In my pro-wrestling model, the k-value was essentially the "importance" of each competition. Therefore, losing a world championship match was rated as far more important than simply losing the preliminary match at an untelevised event.

What I did was calculate the "average ELO" rating for each PPV match based on the simplified ELO system where Title Changes and TV tapings were assigned higher k-values. (I didn't include card placement because that variable is being evaluated separately.) People's scores go up and down depending on who they beat in matches. The idea was that the important matches would involve the active wrestlers with the highest average ELO scores. (Note on a limitation: this model did not add any kind of time lapsing function or alternative Federation scoring, so when Roddy Piper returned to WWE in the 2000s, he carried over his high ELO score from the 80s.)

Here's the highest rated ELO match for each per PPV

As you can see, the lack of time-aging for a wrestler like Hogan (or experienced hands like Undertaker or Bret Hart) are going to do quite well.

The good news is that we have a new approach we can use for weighting the events on a PPV. And we move ahead...

Indeed Wrestling

Saturday, January 25, 2014

What I learned from Meltzer's 280+ WWF PPVs Ratings (Part Three)

No comments: