Saturday, December 07, 2013

How do you tabulate all those stats?

Perhaps people are interested in my #wrestlenomics process.  Let's start with taking results and turning them into win-loss records by wrestler.

Here we do go.

Step One: Find a good datasource.

This can actually be the hardest step.  You want to find useful, concise, complete and accurate summary of wrestling results. Each of these adjectives is important:

* Useful - An alternative word might be "consistent". You want to split your data by key terms.  It's terribly helpful if the transcriber uses the same set of words to describe match results and if you only have one match per line.
* Concise - it's FAR easier to compile your starting point if an entire year's worth of results are available on a single webpage. However, it's far too common to locate results which have been split into separate webpages; you often spend as much time opening and copying individual pages as you'll spend on the actual analysis.
* Complete - For instance, if you're working on annual results, make sure your dataset includes both television tapings and house shows.
* Accurate - You want to trust the source of the data.

For this tutorial, I'll start with one of my favorite websites:

Step Two: Transfer Data into your Spreadsheet

Data Source: Let's work on WWF 1991 from the The History of WWE for our example.
Spreadsheet Software: I am quite partial to Microsoft Excel so I'll be using MS Excel 2010 in this example, but you can use earlier (or newer) versions of Excel or other free applications like Google Docs.

a. "Select All" on the webpage
b.  Paste Special (right-click) and choose as "Text" in your spreadsheet (I usually start in cell B2).

Getting rid of any unusual formatting (bold text, images, merged cells) is key. You really want to start with a clean dataset.

Step Three: Label your Columns and number your lines

It's important that you can preserve the initial state of your data so you can always get the information in the same order as you started.

I started with all my text in cell B2 leaving Column A completely blank.

I label my columns:
A1 = linenum (this will be a unique number used to identify the data)
B1 = linetxt (this is the text from the webpage)
C1 = type (i.e. is it a wrestling match or a date or random note?)

 I label my first line of text with a number (typically in the form of YYYY0001 so in this case the first line is 19910001).
Label cell A2 as a number "19910001"

For the rest of column A, I use an equation to auto-number each line: A3 = A2+1
I carry the formula down to the end of the dataset.
There's several ways you can do that.
Method1: What I usually do is copy cell A3 (highlight cell A3, hit control+C), go the bottom of the sheet (control+End), moving over to column A, highlighting back to cell A3 (control+shift+up) and pasting the formula (control+V).
Method2: Grab the corner of the cell (your cursor will change from a large "open" plus-sign to a tiny plus-sign) and "drag" the formula down until you reach the bottom of the data.

Since may reorder data later in the process, let's freeze the data.
Copy column A and paste special values on Column A.  (highlight column A, control+C, move to cell A1, right-click Paste Special->Values).

It's also good form to freeze the top row so you can see your labels across the spreadsheet.
Menu: View => Freeze Panes => Freeze Top Row

Step Four: Label some obvious Data

The goal at the end of this exercise is to have every row of data labelled in column C.

These are the codes that I use:
  • blank = no data on this line
  • event = location & date of Wrestling Show (possibly attendance too)
  • match = wrestling match
  • other = battle royal or other contest (Divas Dance Contest, for instance)
  • taping = indicating beginning of a television or PPV taping
  • included = information on other stars who appeared on the card without match details
  • notes = other information that is not a wrestling match (for instance, who "no showed" due to injury)
 Add Filters to the top row. (Data => Filter)

Let's label the blank rows first.  Click on large grey down arrow in cell B1 for the filter and choose "Text Filters => Equals".

We'll start with the blank lines.  Press OK.

I label the top cell "blank" and copy it to the bottom.
(Type "blank" into cell C2, control+C, control+shift+end, hold down shift and press left until you only have column C highlighted, control+V).

The quickest way to reset things is to click on the Filter icon twice (turning it off and back on again).

Next thing to label is some obvious "notes". Notes on The History of WWE begin with a dash (-).

Next thing to label is some of the events "event". Some events on The History of WWE contain a @.

Label all of these as "event".

Step Five: Search for the Match buzzwords

There's several ways that matches are described:
  • Jimmy Snuka defeated Black Bart
  • Saba Simba pinned Buddy Rose
  • Davey Boy Smith fought the Warlord to a double count-out
  • Kerry Von Erich vs. Ted Dibiase
The keywords are: "defeated", "pinned", "fought" and "vs."
We're going to search for those four words.

D1 = "min"
E1 = " defeated " (leading space)
F1 = " pinned " (leading space)
G1 = " fought " (leading space)
H1 = " vs. " (leading space)
D2 = "=min(E2:H2)"  (find the minimum number in cells E2 through cells H2)

Copy formula in cell D2 all the way to the bottom of the sheet.

Next let's add the formula looking for the keyword in the linetxt.

E2: =IFERROR(FIND(E$1,$B2),"")

Copy this to cell F2, G2 and H2.  Then copy cells E2 through H2 to the bottom row.

Next, let's assign "match" if minimum (column D) fall between 1-124. Filter on column D.
Label any filtered rows as column C as "match".

Filter on Column B for matches that begin with "Included" or "Also included".

 Label any filtered rows as column C as "included".

Filter on Column B for matches that end with a ":" and label them as "taping"

Filter on Column B for matches that contain a " won " and label them as "other"

Filter on Column C for lines that have not yet been labelled.
There are about 45 (out of 5131 records) left.  Manually assign the remaining labels.

Most remaining lines are labelled "notes" or "event".  There are occasional misspelled match terms or alternate terms used such as "WWF IC Champion Mr. Perfect battled Davey Boy Smith to a draw" or "Jesse Bolt defetaed the Brooklyn Brawler".  Usually, it's easiest to manually edit these lines (switching "battled" to "fought" and fixing the spelling for "defeated").

Step Six: Attach Event Labels to all Matches

We're going to label column I as "event".
I2: =IF(C2="blank","blank",IF(C2="event",B2,I1))

Copy the formula to the bottom of the data.

Step Seven: Isolate Matches

Filter on "type" (Column C) on "match".

Copy the match data.

Paste Values on the next Sheet.

Part I ended with us pasting values of the Matches on a separate sheet.

We'll pick up there.

Step Six: Isolate the Winners and Losers

First, let's freeze the top row.

View => Freeze Panes => Freeze Top Row

We can delete columns E/F/G/H (defeated/pinned/fought/vs.)

Let's label column F as "word".

Using our minimum word calculation in column D (min), let's find the word from the column B (llinetxt).

F2:  =MID(B2,D2,FIND(" ",B2,D2+1)-D2)

Copy the formula to the bottom of the data.

Label column G as "AA" and column H as "BB".

AA will represent the wrestlers on the first team (typically the winners) and BB will be for the other wrestlers.


Column G (AA) is looking at the wrestlers are the left side of the word (but to the right of a colon, for things like "Prime Time Wrestling - 11/4/91:").
Column H (BB) is looking at the wrestlers to the right side of the word.

 Step Seven: Trim down the "Losing" Team

 New Column Labels
Column I = "BB_Trim"
Column J = "min"
Column K = " to "
Column L = " at "
Column M = " after "
Column N = " follow"
Column O = " by "
Column P = " via "
Column Q = " when "
Column R = " with "
Column S = "; "
Column T = " in "

The goal is to reduce the length of the second team (which currently has a lot of extraneous data to the actual person/team, i.e. "Jake Roberts via count-out" to "Jake Roberts").

I2: =TRIM(IF(J2>0,LEFT(H2,J2-1),H2))
J2: =MIN(K2:T2)
K2: =IFERROR(FIND(K$1,$H2),"")
L2: =IFERROR(FIND(L$1,$H2),"")
M2: =IFERROR(FIND(M$1,$H2),"")
N2: =IFERROR(FIND(N$1,$H2),"")
O2: =IFERROR(FIND(O$1,$H2),"")
P2: =IFERROR(FIND(P$1,$H2),"")
Q2: =IFERROR(FIND(Q$1,$H2),"")
R2: =IFERROR(FIND(R$1,$H2),"")
S2: =IFERROR(FIND(S$1,$H2),"")
T2: =IFERROR(FIND(T$1,$H2),"")

Copy these formulas to the bottom of the data.

 Step Eight: Separate the Teams and the Valets

 Copy all of the match data from Sheet2 onto Sheet3 and special paste values.

Go ahead and trim the columns - we're going to keep:

Column Labels
Column A = "linenum"
Column B = "linetxt"
Column C = "type"
Column D = "min"
Column E = "event"
Column F = "word"
Column G = "AA"
Column H = BB_trim"

Freeze top row (View => Freeze Panes => Freeze Top Row)

Column I = "team"
Column J = "team_num"

First, we want to copy all of Column G to Column I.

I2: =G2
J2: ="AA"

Copy this to the bottom.

Next we want to copy all of the data from row 2 to the bottom of the data and append it to the bottom of the dataset.  Starting at the first row of this duplicated row (if you lose track, search for the linenum that is in cell A2, the second location is where your data begins repeating)..

I3800: =H3800
J3800: ="BB"

Copy this down to the bottom of your data.

 You should now have a column I which has the names of all of the teams involved in matches and column J which indicates whether they are part of team AA (usually the winners) or team BB (usually the losers).

Next we want to strip off the valets and manager information which is in the paratheses.

New Columns
Column K = "("
Column L = ")"
Column M = "()"
Column N = "team_trim"

K2: =IFERROR(FIND(K$1,$I2),"")
L2: =IFERROR(FIND(L$1,$I2),"")
M2: =IFERROR(MID(I2,K2,L2-K2+1),"")

I would sort my sheet on column M using the "Sort Z to A" button on Data menu.

The top examples will move to the top.

What we want to do is both preserve the ()_text as well as remove it from the "team" line.

Here's the process:
I add one more column O which is labelled "()_text"
I copy and paste values from column M into column O.
Then I copy and paste values from column N into column I.
Then it's another "Sort Z to A" on Column M.

If you're lucky, there's only a few lines of data that have more than one set of text in paratheses.
For us, it's only about nine lines.

Find the best way to append the useful data to column O.

Usually, I use a quick dummy column P (labelled temp).  
P2: =O2&" "&M2
Carry down to the bottom of the ()_text.
Copy the data from column O and paste values on top of column O.
Delete Column P.
Copy the data from column N and paste values to column I.

Remove columns K/L/M/N.

 Step Nine: Separate Teams into Wrestlers

If things have gone well, columns should be:
Column A: linenum
Column B: linetxt
Column C: type
Column D: min
Column E: event
Column F: word
Column G: AA
Column H: BB_trim
Column I: team
Column J: team_num
Column K: ()_text

All of these columns should be values only - no formulas.

The next goal is to separate these teams into individual wrestlers.

New Columns
Column L: person
Column M: " & "
Column N: person1
Column O: person2

L2: =SUBSTITUTE(SUBSTITUTE(TRIM(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(I2," and "," & "),","," & "),"  "," ")),".","")," & & "," & ")M2: =IFERROR(FIND($M$1,L2),"")N2: =IF(M2<>"",TRIM(LEFT(L2,M2-1)),"")
O2: =IF(M2<>"",TRIM(RIGHT(L2,LEN(L2)-M2)),"")

Column L is converting column I (team) in three ways: converting "and", "/" and "," to "&" and removing periods. Then we're splitting into two groups - a single person (column N) and the remaining people (column O).

Next, we're going to have an iterative process to finish deconstructing the teams.  Here are the steps involved:

1. Sort A-Z on column M ( & )
2. Turn on the Filter
3. Filter on Column M for empty columns (Text Filters "equals" blank)
4. Delete data from first blank cell in columns M/N/O to the bottom of data

5. Turn off the Filter
6. Copy columns O and P and Paste Values on Columns P/Q

7. Copy the non-empty Columns from P to column L
8. Append a copy of the non-empty rows of the data to the bottom of the dataset
9. Starting with the new data appended on the bottom, paste non-empty data from Column Q back on Column L
10. Delete Columns P and Columns Q

Your goal is to rinse, lather and repeat these ten steps until you no longer have any entries in column L with a "&" in them.

Once you're down with that, delete Columns M/N/O (which should be blank).

Step Ten: Cleaning up the Corners

At this point now there's two major steps left: (a) change names into final form, (b) assign wins/losses.

(a) Final Names

Column M: "final person"

This is a process.  First you want to remove the "Champion" titles.

M2: =TRIM(IFERROR(RIGHT(RIGHT(L2,LEN(L2)-FIND(" Champion",L2)),LEN(RIGHT(L2,LEN(L2)-FIND(" Champion",L2)))-FIND(" ",RIGHT(L2,LEN(L2)-FIND(" Champion",L2)))),L2))

Then I usually do a PivotTable to start analyzing how many misspelling or multiple identities or tag teams will we need to decipher.

(b) Wins & Losses

defeated win loss
pinned win loss
fought draw draw
vs. unknown unknown

Column N: "result"

N2: =IF(F2=" vs. ","unknown",IF(F2=" fought ","draw",IF(J2="AA","win","loss")))

At this point, you should have a moderately usable dataset.  You still need to remove some "false positives" (lines of data that aren't actually matches) or lines of data from events that weren't in your original federation (filter on column E event). 

No comments: