Friday, January 09, 2009

Simpson's Paradox



I know that I am a maths (Brit) dork, given a past posting about *emailing equations* as well as a fascination with running numbers for everything I do... I remember all race times, splits, positions, wattages, min/mile paces etc...

I am actually more than a math dork, I am a statistics dork! Statistics was my inital major in college even though I eventually graduated with a double major in languages and only a minor in math and statistics! I suppose I was in denial in college but I soon put that straight after college joining a *white-shoe*, New York investment bank trading fixed income... now I work for an investment management firms that prides itself on the number of geeky quant jocks it employs :)

I delight in taking a statistical approach to things and recently I came across a work problem which reminded me of the *Yule-Simpson effect*, a.k.a. Simpson's Paradox. What is it, you ask? It is a statistical paradox, wherein, individual results are reversed when the results are combined into a group... simple, d'oh! That's probably not too clear, so let me try an *athletic* example... One of my favorite books, Michael Lewis' Moneyball, demonstrated the richness of statistics available in the world of baseball. Let's review some batting averages:

1995 1996 Combined
Derek Jeter 12/48 .250 183/582 .314 195/630 .310
David Justice 104/411 .253 45/140 .321 149/551 .270

In the example above, Justice had the higher batting average in both the '95 and '96 seasons (.253 and .321 respectively), but when you combine the two seasons, Jeter comes out on top (.310)!!! (gratuitous picture of Derek Jeter below)


How the heck does Jeter have the higher 2 season average?

Looking more closely at the results, this phenomenon occurs when there are large differences in the number of at-bats between seasons. I like to think of it as a scale issue, it's not just how well you do it, it's how often you do it!

So why does this Paradox suddenly capture my attention? After I won my age-group at Ironman Arizona last November, I bumped into an acquaintance who congratulated me on winning my age-group at the race and getting a *Kona slot*. She (intentionally?) weakened her congratulatory feedback by adding that it looked to be an *easy age-group*. WTF? What does that mean? I took it to assume that my age-group wasn't the fastest. Indeed, if you take away the forced groupings, the results might be different. She was right... I wouldn't have won any of the younger age-groups... but hey, I'm not younger so I won by the rules of the race! My race might have been different had I reacted in a more competitive fashion and picked up the pace when I was passed by a 28 year old versus a 38 year old? But I was racing for an age group placing and Kona slot, so I was officially racing against the 38 year old and not the 28 year old...

USA Triathlon rankings involve the same paradox. I could finish ahead of another age-grouper every time we compete against one another, but if that athlete races more or enters *easier* races, she may well have a higher ranking overall at the end of the season.

So what? I think reflecting on the above has been a healthy reminder that numbers, be they rankings, wattages, bpm are just that, numbers. They need a context, a comparison and a reference point to make them come alive... and even then we need to proceed cautiously...

No comments: