The Summer of Jeff

The Wild Card Effect

Posted in tennis by Jeff on January 24, 2011

I’ve written before about the types of players awarded wild cards into professional men’s tennis tournaments.  While they can be categorized in different ways, there are two characteristics that are true of almost all wild cards:

  1. Without a wild card, they would not be able to play in the tournament.
  2. Tournament organizers see them as an asset to the event.

The first isn’t quite true; many wild cards would otherwise enter the qualifying draw, and some would reach the main draw that way.  We can still conclude that WCs are, at least according to ATP entry rankings, inferior to other players who appear in the main draw.  The only possible exceptions worth mentioning are qualifiers and other wild cards.

The second doesn’t necessarily tell us anything about the skill level of a player.  Simply having James Blake in the draw probably boosts tickets sales for any event in the U.S.  Other WCs are awarded to promote a tournament in other ways, perhaps by giving one WC to the winner of a junior event, or a special qualifying tournament for local amateurs.

While these cases are common enough, a major factor in the awarding of wild cards is the tournament organizer’s belief that a WC can compete.  So the WC goes to a player returning from injury, or a veteran coming back from retirement.  Or a junior who is rocketing up the rankings, or who has recently won a major collegiate event.

All this is to say, in the aggregate, players granted wild cards are usually better than their ranking says they are.

Thus, when we look at matches with one wild card and one non-wild card and apply my algorithm to predict the winner, we should anticipate that wild cards outperform expectations.

Empirical results

In fact, they do.  The effect is substantial, and it holds at multiple levels of competition.

In testing the hypothesis, I controlled for home court advantage, an important consideration that is easily conflated with the wild card effect.  After all, a large percentage of wild cards are granted to local players, so without careful analysis, it would not be clear how much of the advantage can be attributed to the wild card selection or the benefits of playing in one’s home country.

I ran the numbers with a dataset comprising all ATP main draw, ATP qualifying draw, and Challenger main draw matches from 2008 to 2010.  The results were fairly consistent from year to year.

At the ATP main draw level, the dataset yielded over 900 matches between a wild card and a non-wild card.  The wild card won the match about 15% more often than expected.  We can approximate this effect by multiplying the WC’s ranking points by 1.3.

The other two levels showed even larger effects over about 2600 relevant matches.  In ATP qualifying and Challenger main draw matches, wild cards won more than 25% more than expected.  We can approximate this effect by multiplying the WC’s ranking points by 1.55.

Commentary

The existence of a positive “wild card effect” is not a surprise, nor is the magnitude.  Essentially, when a player is awarded a wild card, we’re given more information about him than ranking points otherwise offer.

I suspect the difference in magnitude between the higher and lower levels is fairly straightforward, as well.  While some players receive ATP wild cards straight from the amateur ranks, as can be the case with collegiate champions, most ATP wild cards go to somewhat established players on the fringes of success.  These players are often inside the top 150, meaning that they’ve played a lot of professional tournaments, so while their ranking might undervalue them slightly, it is a fairly accurate gauge of their ability level.

By contrast, qualifying and challenger-level wild cards often go to less experienced players.  They may not be full-time professionals or they may spend most of their time playing collegiate or junior tournaments.  They usually have rankings, but the point totals may only be based on a handful of events.

Example from Australia

The most successful wild card in the Australian Open was Aussie youngster Bernard Tomic, who reached the third round, beating Jeremy Chardy and Feliciano Lopez before losing to Rafael Nadal.

As he was a local and a wild card, we now know to adjust his ranking points twice before estimating his likelihood of winning a match.  Instead of estimating his talent with his pre-tourney ranking point total of 239, we adjust upward to 435.  That still puts him as an underdog against Chardy’s 960 points, but it means we would have given him a 30% chance of winning instead of an 18% chance.

Of course, the 2011 Australian Open isn’t very instructive here, since six of the other wild cards lost their first matches, while the final WC, Benoit Paire, drew qualifier Flavio Cipolla in the the first round, and was a favorite.

Advertisements

Comments Off on The Wild Card Effect

Tennis Home Court – Research Notes

Posted in tennis by Jeff on January 23, 2011

I’ve built out my men’s tennis results database quite a bit in the last couple of months, so I thought I’d revisit my research into home court advantage.

To recall, I started with ATP main draw matches from 2009.  I focused on the subset of matches where the tournament was in the home country of one player, but not the other.  I excluded matches where either player was a wild card entry–that usually applies to the home player.  I did so because I think there is a separate “wild card” effect that reflects selection bias.  (Tourney organizers choose players who did not make the cut but whose chances, for whatever reason, are better than their ranking would suggest.)

As I reported in my initial research, using about 450 matches from the 2009 main draw dataset, the home player won 17% more matches than expected.  (“Expected” winnings are derived from my bare-bones algorithm to predict the winner of the match.)  Using ranking points, this is roughly equivalent to giving the home player credit for 50% more ranking points than he actually has.

For example, Lleyton Hewitt is currently ranked 54th, with 870 ranking points.  If we make this adjustment for the Australian Open, we’d say he’ll play at a level equal to someone with 1,305 ranking points, which would be 32nd in the world.  Instead of giving him a 36% chance of winning his first round match against David Nalbandian, the home-court-adjusted number would give him a 47% chance.  In this case the results might bear us out: The match went to 9-7 in the fifth set.

The surprise came when I expanded the dataset to include Challenger main draw matches and ATP-level qualifier matches.  In 2009 Challengers, home players only won 6% more often than expected–equivalent to a ranking points multiplier of 1.15.  In 2009 ATP qualies, the home court advantage was only 2%–a multiplier of about 1.05.  Whatever confers the home court advantage in ATP main draw matches may not apply at all levels.

I next looked at the same datasets for 2010.  Here are the home court advantages (and ranking points multipliers) observed last year:

  • ATP main draw: 12% (1.35)
  • Challenger main draw: 4% (1.1)
  • ATP qualifiers: 14% (1.3)

The first two numbers don’t differ much from the ’09 observations, but the qualifier numbers come out of nowhere.

Until I’m able to look at more matches from before 2009, I hesitate to draw any conclusions about the qualifiers.  That still leaves us with a fairly consistent gap between the home court advantage observed at the ATP main draw and Challenger main draw levels.

To the extent that crowd involvement plays a part, it seems reasonable to expect that players would get a bigger boost on a bigger stage.  Even on outer courts in the early rounds, fans tend to pull for the locals.  At challengers, the atmosphere is often more like a club tournament where the audience is next to nonexistent.

Another major possibility is that some combination of selection bias and the inadequacy of my prediction algorithm accounts for the lack of observed home court advantage in challengers.  Players have more choice of where to play at the lower levels, so they will tend to stay closer to home.  It may mean that, even exclusive of wild cards, the distribution of home-country players and non-home-country players is different; perhaps the bottom ends of challenger draws are disproportionately packed with home-country players.  This is something that I can investigate further.

UPDATE: Just ran the numbers for 2008.  The ATP main draw home court advantage remained consistent, at a 16% boost for the home player.  The ATP qualifier pool also showed the same home court advantage.  However, 2008 differed from later years in that in Challenger main draw matches, home players got an 11% boost, much bigger than in 2009 or 2010.

Marginal ATP Rankings

Posted in tennis by Jeff on January 22, 2011

ATP rankings are frustrating: They are a decent approximation for player skill, but there are so many obvious flaws.  Some of those flaws derive from the problem of needing one number–there’s no accounting for surface, for instance.

The one that frustrates me the most is how much luck is allowed to creep into a player’s ranking.  When a player is awarded points for his performance in a certain tournament, there is no consideration of the skill level of the players he defeated.  So two players who lose in the second round get the same number of points, even if one defeated a 16-year-old wild card in the first round and the other defeated Rafael Nadal in the first round.

There are plenty of arguments in favor of the present way of doing things.

  • First, there’s the circular problem of finding a starting point–if ranking points aren’t an adequate measure of skill, how do you give numerical credit based on the skill of opponents?
  • Second, players don’t display consistent levels of skill; if Milos Raonic is in the fourth round of the Australian Open, he is probably playing better than he was four months ago when he lost in the first round of the U.S. Open.  Perhaps the person who defeats him in Melbourne deserves more points than the guys who beat him in qualifiers and challengers last fall.  Players also display different levels of skill depending on surface; beating Juan Carlos Ferrero is more impressive on clay than on grass, and you’re more likely to do so in a later round on clay.
  • Third, you could say that it all comes out in the wash.  Pros play a lot of tournaments, and while you might only get 20 points for beating a top-10 player in the first round, you might get an additional 90 points for beating an unseeded player three rounds later.

We could settle for the status quo, or we could experiment with a different approach and test it.  Testing these things is an enormous task, so for today I’m just presenting the experiment itself.

Opponent-based point awards

I looked at all ATP-level main draw and qualifying draw matches, along with Challenger-level main draw matches.  I figured out the marginal points awarded to the winner of each match (e.g., by winning in the third round in the Aussie Open, you get 180 points instead of 90 points, for 90 marginal points) and the ranking points of the loser at the time of the match.

For instance, when Nadal beat Federer in the Madrid final, Nadal was awarded 400 marginal points, and Federer had 10,690 ranking points.  Add up those two types of points, and it turns out that the total marginal points awarded in these matches are approximately 4.5% of the ranking points of the losers.

Thus, if we use a simple linear model, instead of giving Nadal 400 marginal points for winning that match, we give him 4.5% of 10,690, or 463 points.  In this case, not a big difference.  But when top players are upset in early rounds, the adjustment is huge.

To take a very different example: In Miami last year, Olivier Rochus beat Novak Djokovic in the round of 64.  For advancing to the round of 32, Rochus earned 20 marginal points.  Djokovic’s ranking point total at that point was 8,220, so if we give Rochus 4.5% of that, he gets 365 points.  As we’ll see, that single adjustment rockets him up the rankings.

Pros and Cons

Compared to the present ATP ranking system, this approach gives more credit to the players who are capable of a top-10 performance, even if they play at that level very rarely.  As we’ll see, a single major upset can make a huge difference, so perhaps it too heavily weighs a single match.  If Rochus happened to play Djokovic on a day when Djokovic had the flu, does he really deserve 365 points?

Another potential problem is that this model doesn’t consider the level of the opponents that a player loses to.  Nikolay Davydenko is known for his ability to beat Federer or Nadal, but in consecutive weeks in October, he lost to Pablo Cuevas and Mischa Zverev.  Should we rank someone based on their ability to defeat “better” players, or their inability to defeat “lesser” players?  As always the standard ATP ranking system appears to be a decent compromise.

For my purposes, what matters is how well a ranking system predicts future results.  I hope that soon I’ll be able to report on how this one performs.

In the meantime, here are the 2010 year-end top 100, using the opponent-based model I’ve described.  I’ve also included each player’s actual 2010 year-end ranking and the difference between their placement in the two systems.

Rk   Player                   Pts  Actual  Diff  
1    Rafael Nadal            4562       1     0  
2    Roger Federer           4529       2     0  
3    Robin Soderling         3905       5     2  
4    David Ferrer            3450       7     3  
5    Andy Murray             3347       4    -1  
6    Tomas Berdych           2891       6     0  
7    Jurgen Melzer           2772      11     4  
8    Novak Djokovic          2730       3    -5  
9    Fernando Verdasco       2697       9     0  
10   Andy Roddick            2331       8    -2  
11   Gael Monfils            2266      12     1  
12   Nikolay Davydenko       2070      22    10  
13   Mikhail Youzhny         2059      10    -3  
14   Ivan Ljubicic           1952      17     3  
15   Guillermo Garcia-Lopez  1948      33    18  
16   Nicolas Almagro         1908      15    -1  
17   Marcos Baghdatis        1859      20     3  
18   Marin Cilic             1843      14    -4  
19   Albert Montanes         1832      25     6  
20   Michael Llodra          1822      23     3  
21   Ernests Gulbis          1695      24     3  
22   Viktor Troicki          1654      28     6  
23   Mardy Fish              1646      16    -7  
24   Jo-Wilfried Tsonga      1577      13   -11  
25   Stanislas Wawrinka      1573      21    -4  
26   Richard Gasquet         1570      30     4  
27   Florian Mayer           1497      37    10  
28   John Isner              1473      19    -9  
29   Philipp Kohlschreiber   1456      34     5  
30   Feliciano Lopez         1396      32     2  
31   David Nalbandian        1395      27    -4  
32   Juan Monaco             1394      26    -6  
33   Samuel Querrey          1353      18   -15  
34   Xavier Malisse          1342      60    26  
35   Jeremy Chardy           1272      45    10  
36   Andrei Goloubev         1236      36     0  
37   Juan Carlos Ferrero     1227      29    -8  
38   Jarkko Nieminen         1214      39     1  
39   Gilles Simon            1180      41     2  
40   Janko Tipsarevic        1178      49     9  
41   Benjamin Becker         1145      53    12  
42   Michael Berrer          1144      58    16  
43   Thomaz Bellucci         1125      31   -12  
44   Alexander Dolgopolov    1058      48     4  
45   Denis Istomin           1058      40    -5  
46   Andreas Seppi           1047      52     6  
47   Thiemo de Bakker        1033      43    -4  
48   Potito Starace          1031      47    -1  
49   Daniel Gimeno            983      56     7  
50   Olivier Rochus           971     113    63  
51   Lleyton Hewitt           944      54     3  
52   Julien Benneteau         941      44    -8  
53   Marcel Granollers        937      42   -11  
54   Juan Ignacio Chela       872      38   -16  
55   Pablo Cuevas             868      63     8  
56   Tommy Robredo            851      50    -6  
57   Philipp Petzschner       817      57     0  
58   Sergey Stakhovsky        813      46   -12  
59   Dudi Sela                808      75    16  
60   Santiago Giraldo         805      64     4  
61   Michael Zverev           797      82    21  
62   Radek Stepanek           789      62     0  
63   Fabio Fognini            784      55    -8  
64   Mikhail Kukushkin        781      59    -5  
65   Yen-Hsun Lu              745      35   -30  
66   Igor Andreev             722      79    13  
67   Carlos Berlocq           719      66    -1  
68   Ryan Sweeting            716     116    48  
69   Teimuraz Gabashvili      714      80    11  
70   Arnaud Clement           703      78     8  
71   Lukas Lacko              691      89    18  
72   Tobias Kamke             652      67    -5  
73   Pere Riba                646      72    -1  
74   Rainer Schuettler        642      84    10  
75   Robin Haase              626      65   -10  
76   Florent Serra            625      69    -7  
77   Leonardo Mayer           622      94    17  
78   Rui Machado              618      93    15  
79   Kevin Anderson           596      61   -18  
80   Albert Ramos             595     123    43  
81   Ivo Karlovic             567      73    -8  
82   Frederico Gil            554     101    19  
83   Daniel Brands            546     104    21  
84   Alejandro Falla          544     105    21  
85   Simon Greul              534     130    45  
86   Simone Bolelli           521     107    21  
87   Filippo Volandri         509      91     4  
88   Ilia Marchenko           488      81    -7  
89   Marco Chiudinelli        486     117    28  
90   Filip Krajinovic         483     214   124  
91   Victor Hanescu           481      51   -40  
92   Bjorn Phau               479     102    10  
93   Ivan Dodig               478      88    -5  
94   Kei Nishikori            477      98     4  
95   Evgueni Korolev          468     140    45  
96   James Blake              467     135    39  
97   Ruben Ramirez-Hidalgo    466      77   -20  
98   Ricardo Mello            461      76   -22  
99   Grigor Dimitrov          460     106     7  
100  Brian Dabul              458      85   -15

Comments Off on Marginal ATP Rankings

Predictiveness of ATP Rankings – Research Notes

Posted in tennis by Jeff on January 19, 2011

I’m working on some bigger projects right now that might take some time before they see the light. In the meantime, here are a couple of things I’ve discovered about ATP rankings and their use to predict the outcome of matches.

1. In my earlier research, I found that in the “buckets” of matches that the favorite is most likely to win, my algorithm is still reasonably accurate.  In other words, if the ranking points predict that Nadal, say, has a 98% chance of beating the 140th ranked player, his chances are in fact that high.  The algorithm was as accurate on the extreme high end as it was anywhere else on the spectrum.

However, I only included matches in my sample where both players were ranked inside the top 200.  I thought that was an innocuous enough cutoff, but I see now why it was misleading.  If we limit the the sample that way, the most extreme favorites will only be the very top players.  In fact, the only players who my algorithm gives a 95% chance of beating the 200th ranked player are the top 5.

When I expanded the sample to players ranked outside of the top 200, the high end broke down.  In other words, in the bucket of matches where the favorite had a 90% or better chance of winning, the favorite isn’t winning that often.

There are several possible explanations for this, none of which account for the entire effect, but many of which surely play a part:

  • I’m still only looking at ATP-level matches, and if a player outside of the top 200 is in an ATP main draw match, he was not exactly randomly selected.  He may be playing at “home” on a wild card, he may be hot after a solid week in qualifying, he is probably on his favorite surface, and his ranking may be misleading due to injury.
  • Outside of the top 5 or top 10, players are substantially less consistent.  It’s tough to imagine Robin Soderling losing to a qualifier right now, but easy to see, say, Fernando Verdasco or Ivan Ljubicic doing so.

More fundamentally, I suspect that the further down the rankings you go, the less the difference in points really mean.  Certainly there’s much more movement–once you get outside the top 50, one good showing can easily gain you 10, 20, or more spots.  That doesn’t mean that a player is suddenly more skilled, which is the way my algorithm has to treat him.

Controlling for surface, wild card status, and more will help reconcile some of these differences, but ultimately, matches between drastically mismatched (on paper) opponents may have to be treated differently than matches between more closely matched peers.

2. Eliminating some quirks of the ATP ranking system doesn’t break it, at least not for my purposes.  In the process of my current projects, I wanted to be able to more easily tweak the parameters of the ranking system, so I started by rebuilding the existing one.  But there are a lot of quirks:

  • The top 4 or 5 players get a lot of points from the Tour Championships.
  • Davis Cup players get points.
  • Rankings are limited to a player’s top 18 tournaments, but there are some limitations on what those tournaments must be, resulting in cases where player gets credited for a poor showing at a grand slam, but does not get credited for a better showing (worth more points) at a smaller tournament.

All of these quirks have their purposes, given the ATP’s priorities are built around keeping fans interested and ensuring that top players focus on the most important events.  But they are a pain in the butt to incorporate in an on-the-fly system, so I just ignored them.

And as it turns out, they are not affecting my results in any meaningful way.  I’ve re-run a couple of earlier projects with my “improvised” rankings, and nothing is changing by more than a percent or two.  Occasionally the effect is strong on a certain player (I think the improvised system bumps Juan Carlos Ferrero from #29 to inside the top #15 at 2010 year-end), but in the aggregate, it makes no difference.

Comments Off on Predictiveness of ATP Rankings – Research Notes

2011 Aussie Open Simulation Results

Posted in programming, tennis by Jeff on January 16, 2011

Using my simple ranking-points-based algorithm to determine the odds that each player wins a match, I ran simulations using the 2011 Australian Open draw.

As usual, the keyword is “simple,” and you can easily find all sorts of intuitive reasons to discount the results.  There’s no consideration for surface, so clay-court specialists are generally overrated.  Players returning from injury (Del Potro, especially, and Karlovic) have seen the hit in the rankings, and are thus underrated here, as well.

I’m also publishing the code that I use to generate these sims. It should work for any single-elimination tournament up to 128 competitors, and is easily expandable to handle larger brackets.  The function ‘calcWP’ is specific to my tennis algorithm, but you could swap in something like log5 very easily. I also included the .csv file I used for the draw, so you can see the format, or tinker with the parameters and come up with your own Aussie sim.

Your 2011 Australian Open…

Player               points    R64    R32    R16     QF     SF      F      W  
Nadal             1   12390  96.9%  92.7%  87.0%  78.1%  66.1%  49.6%  34.5%  
Daniel                  564   3.1%   1.4%   0.5%   0.1%   0.0%   0.0%   0.0%  
Sweeting          Q     486  35.3%   1.6%   0.5%   0.1%   0.0%   0.0%   0.0%  
Gimeno-Traver           844  64.7%   4.3%   1.9%   0.7%   0.2%   0.0%   0.0%  
Tomic             W     239  17.9%   3.1%   0.1%   0.0%   0.0%   0.0%   0.0%  
Chardy                  960  82.1%  39.6%   3.7%   1.4%   0.4%   0.1%   0.0%  
Falla                   540  27.3%  11.3%   0.7%   0.2%   0.0%   0.0%   0.0%  
Lopez F          31    1310  72.7%  46.0%   5.6%   2.6%   0.9%   0.2%   0.0%  
                                                                              
Isner            20    1850  74.0%  56.8%  31.7%   5.8%   2.5%   0.8%   0.2%  
Serra                   711  26.0%  14.0%   4.6%   0.5%   0.1%   0.0%   0.0%  
Stepanek                735  62.1%  20.4%   6.9%   0.6%   0.2%   0.0%   0.0%  
Gremelmayr        Q     469  37.9%   8.8%   2.2%   0.1%   0.0%   0.0%   0.0%  
Machado                 573  41.2%  10.2%   3.3%   0.2%   0.0%   0.0%   0.0%  
Giraldo                 785  58.8%  18.3%   7.2%   0.7%   0.2%   0.0%   0.0%  
Young D           Q     435  14.6%   5.4%   1.4%   0.1%   0.0%   0.0%   0.0%  
Cilic            15    2140  85.4%  66.1%  42.8%   8.7%   4.1%   1.4%   0.4%  
                                                                              
Youzhny          10    2920  85.6%  70.1%  51.9%  29.2%   8.1%   3.3%   1.1%  
Ilhan                   574  14.4%   6.2%   2.2%   0.5%   0.0%   0.0%   0.0%  
Kavcic            Q     552  38.0%   7.1%   2.4%   0.5%   0.0%   0.0%   0.0%  
Anderson K              868  62.0%  16.6%   7.4%   2.1%   0.2%   0.0%   0.0%  
Raonic            Q     351  36.4%   6.8%   1.0%   0.2%   0.0%   0.0%   0.0%  
Phau                    581  63.6%  18.0%   4.1%   0.9%   0.1%   0.0%   0.0%  
Chela                  1070  39.3%  27.8%   9.9%   3.2%   0.4%   0.1%   0.0%  
Llodra           22    1575  60.7%  47.4%  21.0%   8.8%   1.6%   0.4%   0.1%  
                                                                              
Nalbandian       27    1480  64.2%  49.1%  18.4%   8.2%   1.4%   0.4%   0.1%  
Hewitt                  870  35.8%  23.1%   6.1%   2.0%   0.2%   0.0%   0.0%  
Berankis                589  61.1%  19.1%   3.9%   1.0%   0.1%   0.0%   0.0%  
Matosevic         W     392  38.9%   8.8%   1.4%   0.2%   0.0%   0.0%   0.0%  
Russell                 547  67.0%  10.1%   3.6%   0.8%   0.1%   0.0%   0.0%  
Ebden             W     288  33.0%   2.7%   0.6%   0.1%   0.0%   0.0%   0.0%  
Nieminen               1062  20.2%  14.5%   7.7%   2.8%   0.4%   0.1%   0.0%  
Ferrer            7    3735  79.8%  72.7%  58.4%  39.4%  12.6%   5.8%   2.4%  
                                                                              
Soderling         4    5785  87.9%  83.6%  71.9%  58.3%  35.9%  15.6%   7.9%  
Starace                 945  12.1%   8.9%   4.2%   1.7%   0.3%   0.0%   0.0%  
Muller            Q     466  76.9%   6.9%   2.0%   0.5%   0.1%   0.0%   0.0%  
Stadler           Q     155  23.1%   0.7%   0.1%   0.0%   0.0%   0.0%   0.0%  
Istomin                1031  86.2%  41.8%   8.8%   3.6%   0.8%   0.1%   0.0%  
Hernych           Q     196  13.8%   1.9%   0.1%   0.0%   0.0%   0.0%   0.0%  
Mello                   627  30.0%  12.8%   1.9%   0.6%   0.1%   0.0%   0.0%  
Bellucci         30    1355  70.0%  43.5%  11.0%   5.3%   1.4%   0.2%   0.1%  
                                                                              
Gulbis           24    1505  64.3%  41.5%  20.7%   6.3%   1.9%   0.4%   0.1%  
Becker                  870  35.7%  17.9%   6.4%   1.3%   0.2%   0.0%   0.0%  
Dolgopolov              928  53.6%  22.8%   8.6%   1.8%   0.4%   0.0%   0.0%  
Kukushkin               815  46.4%  17.9%   6.3%   1.2%   0.2%   0.0%   0.0%  
Seppi                   900  59.6%  19.2%   8.7%   1.9%   0.4%   0.0%   0.0%  
Clement                 627  40.4%   9.9%   3.5%   0.6%   0.1%   0.0%   0.0%  
Petzschner              839  24.3%  12.6%   5.5%   1.1%   0.2%   0.0%   0.0%  
Tsonga           13    2345  75.7%  58.2%  40.4%  15.8%   6.3%   1.6%   0.5%  
                                                                              
Melzer           11    2785  91.2%  77.7%  54.3%  22.9%  10.4%   3.0%   1.0%  
Millot            Q     334   8.8%   3.3%   0.7%   0.1%   0.0%   0.0%   0.0%  
Ball              W     344  32.5%   4.1%   0.9%   0.1%   0.0%   0.0%   0.0%  
Riba                    672  67.5%  14.8%   5.5%   1.0%   0.2%   0.0%   0.0%  
Sela                    568  77.8%  21.8%   5.0%   0.7%   0.1%   0.0%   0.0%  
Del Potro               180  22.2%   2.4%   0.2%   0.0%   0.0%   0.0%   0.0%  
Zemlja            Q     376  15.1%   6.9%   1.2%   0.1%   0.0%   0.0%   0.0%  
Baghdatis        21    1785  84.9%  68.9%  32.2%  10.7%   3.8%   0.8%   0.2%  
                                                                              
Garcia-Lopez     32    1300  62.1%  44.0%  10.6%   4.2%   1.2%   0.2%   0.0%  
Berrer                  835  37.9%  22.8%   3.9%   1.1%   0.2%   0.0%   0.0%  
Schwank                 580  50.6%  16.9%   2.3%   0.5%   0.1%   0.0%   0.0%  
Mayer L                 572  49.4%  16.3%   2.1%   0.4%   0.1%   0.0%   0.0%  
Marchenko               624  49.3%   5.5%   2.3%   0.6%   0.1%   0.0%   0.0%  
Ramirez Hidalgo         638  50.7%   5.7%   2.4%   0.6%   0.1%   0.0%   0.0%  
Beck K                  543   7.0%   3.2%   1.2%   0.3%   0.0%   0.0%   0.0%  
Murray            5    5760  93.0%  85.5%  75.3%  56.7%  35.5%  15.6%   7.9%  
                                                                              
Berdych           6    3955  96.4%  78.5%  63.1%  42.3%  22.0%   9.6%   3.4%  
Crugnola          Q     194   3.6%   0.5%   0.1%   0.0%   0.0%   0.0%   0.0%  
Kohlschreiber          1215  63.8%  15.2%   8.3%   3.1%   0.9%   0.2%   0.0%  
Kamke                   724  36.2%   5.8%   2.4%   0.7%   0.1%   0.0%   0.0%  
Harrison          W     313  32.3%   6.7%   0.6%   0.1%   0.0%   0.0%   0.0%  
Mannarino               612  67.7%  22.8%   3.9%   0.9%   0.1%   0.0%   0.0%  
Dancevic          Q     172   9.0%   2.2%   0.1%   0.0%   0.0%   0.0%   0.0%  
Gasquet          28    1385  91.0%  68.3%  21.5%   8.8%   2.5%   0.6%   0.1%  
                                                                              
Davydenko        23    1555  60.0%  41.5%  17.1%   6.5%   2.0%   0.5%   0.1%  
Mayer F                1073  40.0%  23.9%   8.0%   2.3%   0.6%   0.1%   0.0%  
Fognini                 855  59.6%  22.7%   6.5%   1.7%   0.3%   0.0%   0.0%  
Nishikori               599  40.4%  12.0%   2.7%   0.5%   0.1%   0.0%   0.0%  
Zverev                  611  38.3%   7.2%   2.4%   0.5%   0.1%   0.0%   0.0%  
Tipsarevic              935  61.7%  16.0%   7.2%   2.0%   0.4%   0.1%   0.0%  
Schuettler              597  13.5%   5.8%   1.9%   0.4%   0.1%   0.0%   0.0%  
Verdasco          9    3240  86.5%  71.1%  54.1%  30.3%  14.2%   5.7%   1.8%  
                                                                              
Almagro          14    2160  84.5%  68.0%  41.9%  15.4%   6.8%   2.2%   0.5%  
Robert            Q     460  15.5%   6.6%   1.8%   0.2%   0.0%   0.0%   0.0%  
Andreev                 622  52.1%  13.7%   4.5%   0.8%   0.1%   0.0%   0.0%  
Volandri                574  47.9%  11.8%   3.6%   0.5%   0.1%   0.0%   0.0%  
Cipolla           Q     190  32.6%   3.4%   0.4%   0.0%   0.0%   0.0%   0.0%  
Paire             W     366  67.4%  12.5%   2.5%   0.3%   0.0%   0.0%   0.0%  
Luczak            W     400  14.7%   8.6%   1.9%   0.2%   0.0%   0.0%   0.0%  
Ljubicic         17    1965  85.3%  75.5%  43.4%  15.1%   6.2%   1.8%   0.4%  
                                                                              
Troicki          29    1385  86.2%  64.4%  16.2%   7.2%   2.4%   0.5%   0.1%  
Tursunov                263  13.8%   4.5%   0.3%   0.0%   0.0%   0.0%   0.0%  
Dabul                   584  58.6%  19.8%   2.7%   0.7%   0.1%   0.0%   0.0%  
Mahut             Q     424  41.4%  11.3%   1.1%   0.2%   0.0%   0.0%   0.0%  
Karlovic                670  52.8%   6.2%   2.5%   0.7%   0.1%   0.0%   0.0%  
Dodig                   606  47.2%   5.0%   2.0%   0.5%   0.1%   0.0%   0.0%  
Granollers              993  11.6%   7.2%   3.6%   1.4%   0.4%   0.1%   0.0%  
Djokovic          3    6240  88.4%  81.6%  71.5%  56.9%  40.2%  21.9%  10.2%  
                                                                              
Roddick           8    3565  88.5%  78.1%  61.4%  42.2%  16.8%   8.1%   2.7%  
Hajek                   560  11.5%   5.8%   2.0%   0.4%   0.0%   0.0%   0.0%  
Przysiezny              590  51.7%   8.5%   2.9%   0.7%   0.1%   0.0%   0.0%  
Kunitsyn                551  48.3%   7.6%   2.5%   0.7%   0.1%   0.0%   0.0%  
Berlocq                 725  47.1%  16.8%   4.0%   1.2%   0.2%   0.0%   0.0%  
Haase                   803  52.9%  20.0%   5.2%   1.7%   0.3%   0.0%   0.0%  
Benneteau               965  38.5%  21.8%   6.3%   2.3%   0.4%   0.1%   0.0%  
Monaco           26    1480  61.5%  41.5%  15.7%   7.2%   1.7%   0.5%   0.1%  
                                                                              
Wawrinka         19    1855  76.7%  52.3%  28.1%  12.8%   3.5%   1.2%   0.2%  
Gabashvili              626  23.3%   9.4%   2.6%   0.6%   0.1%   0.0%   0.0%  
Dimitrov          Q     518  29.5%   7.5%   1.8%   0.4%   0.0%   0.0%   0.0%  
Golubev                1135  70.5%  30.8%  12.7%   4.2%   0.8%   0.2%   0.0%  
Gil                     551  40.2%   8.3%   2.4%   0.5%   0.0%   0.0%   0.0%  
Cuevas                  790  59.8%  16.4%   6.1%   1.6%   0.2%   0.0%   0.0%  
De Bakker               950  25.2%  14.9%   6.2%   1.9%   0.3%   0.1%   0.0%  
Monfils          12    2560  74.8%  60.4%  40.2%  21.5%   7.2%   2.9%   0.7%  
                                                                              
Fish             16    1996  70.1%  52.0%  32.0%   8.2%   3.9%   1.3%   0.3%  
Hanescu                 915  29.9%  16.4%   6.8%   1.0%   0.3%   0.0%   0.0%  
Robredo                 915  65.2%  23.4%   9.9%   1.5%   0.4%   0.1%   0.0%  
Devvarman               514  34.8%   8.2%   2.4%   0.2%   0.0%   0.0%   0.0%  
Stakhovsky              925  64.4%  24.8%  10.2%   1.6%   0.4%   0.1%   0.0%  
Brands                  541  35.6%   9.3%   2.6%   0.3%   0.1%   0.0%   0.0%  
Kubot                   670  24.5%  11.4%   3.9%   0.5%   0.1%   0.0%   0.0%  
Querrey          18    1860  75.5%  54.5%  32.1%   7.8%   3.4%   1.1%   0.2%  
                                                                              
Montanes         25    1495  74.3%  48.4%   8.8%   4.5%   1.7%   0.5%   0.1%  
Brown                   573  25.7%  10.3%   0.9%   0.2%   0.0%   0.0%   0.0%  
Andujar                 683  40.9%  14.9%   1.4%   0.4%   0.1%   0.0%   0.0%  
Malisse                 956  59.1%  26.4%   3.4%   1.4%   0.4%   0.1%   0.0%  
Lu                     1141  53.7%   6.2%   3.2%   1.4%   0.5%   0.1%   0.0%  
Simon                  1005  46.3%   4.8%   2.3%   0.9%   0.3%   0.1%   0.0%  
Lacko                   553   4.3%   1.4%   0.4%   0.1%   0.0%   0.0%   0.0%  
Federer           2    9245  95.7%  87.6%  79.6%  70.0%  56.6%  40.3%  22.4% 

Python Code for Marcel Projections

Posted in baseball analysis, programming by Jeff on January 14, 2011

A while back, I posted retro-Marcel projections for over 100 seasons.  They were generated with some python code, and now you can play with it.

You’ll also need some Baseball-Databank files.  (Well, you don’t need them, but they will make the process much easier.)

The ‘import’ lines refer to a few utilities that I’ve written.  Those are also available on gitHub.  At some point, I’ll write up a summary of some of my Python utilities.  I’m sure that none of them are original (for instance, turning a 2-d matrix into a .csv, or vice versa), but I use them all the time, and they might come in handy for you, too.

Comments Off on Python Code for Marcel Projections

Python Code for Tennis Markov

Posted in programming, tennis by Jeff on January 13, 2011

I’ve published my code for the tennis markov project.  You can find it here:

  • Single game outcome. Takes the server’s probability of winning a single point and the current score, returns server’s chance of winning game.
  • Tiebreak outcome. Takes server’s probability of winning a single service point, prob of winning single return point, and current score, returns server’s chance of winning tiebreak.
  • Single set outcome. Takes server’s probability of winning a single service point, prob of winning single return point, and current game score, returns server’s chance of winning set. (Assumes standard tiebreak set.)
  • Match outcome. Takes server’s probability of winning a single service point, prob of winning single return point, current score in points, games, and sets, and number of sets, returns server’s chance of winning match.

The logic in the tiebreak problem is knotty, and the code reflects that; I’m sure there’s a better way of doing it, I just didn’t feel like working it out once I got to the answer.

In the other functions, the code is pretty clean, and I’ve commented it more than I otherwise would.  The math gets a little hairy, though.

Isner Loses on Points, Wins Match…Again

Posted in tennis by Jeff on January 12, 2011

Last month, I noticed that John Isner wins a whole lot of matches despite losing more than half of the total points.  Such matches are not terribly rare, but Isner is in a class by himself–it happened eight times last season alone.

Sure enough, he’s picking up right where he left off.  In his first match of the year, he beat Robin Haase 36 76(4) 75, despite losing 110 of 211 total points.

While Isner managed to break serve once (in his one attempt), he was amazingly inept against Haase’s serve, winning only 24 percent of points on return.  I haven’t made a thorough survey, but that’s one of the lowest return success rates I’ve ever seen.

Between this and the staggering numbers of tiebreaks played by the likes of Isner and Ivo Karlovic, it really seems like the tallest guys are playing a different sport.

Comments Off on Isner Loses on Points, Wins Match…Again

Minor League Splits Databases Available

Posted in baseball analysis by Jeff on January 11, 2011

It’s been a good run, but Minor League Splits will no longer exist in its original form.

I took down the site a few months ago, and have decided to make all of the underlying databases freely available.  This includes full play-by-play of all affiliated minor leagues in the U.S. from 2005 to 2010.

I haven’t decided whether I will update this with 2011 data at the end of next season.  At the very least, I won’t be doing any in-season updates.

At some point in the future, I may open source the code I use to collect and analyze the play-by-play.  That’s a ways off, though.  This was my first major programming project five years ago, so the original code isn’t very good, and as MLBAM has changed things, I’ve added one ugly hack on top of another.  As is, the code is probably not usable for anyone aside from me.

Click here to see what’s available.

Ivo Karlovic and the Inevitable Tiebreak

Posted in tennis by Jeff on January 10, 2011

Ivo Karlovic is back.  He missed most of last season due to a foot injury, but he’s healed, and playing just like he always has.  In Doha last week, he reached the quarterfinals, beating Philipp Kohlschreiber in the round of 16.

What should come as a surprise to no one is that, before reaching the quarters, he played a total of five sets, every one of which went to a tiebreak.  For most opponents, Karlovic is impossible to break, and since his game is so service-centered, he doesn’t break serve much himself.

That’s the anecdotal story, and it’s intuitively sound.  Does the data back it up?

To find out, I used a data set of all ATP-level matches from 2001 to 2010 and counted, among other things, how many sets ended in a tiebreak.

In that span, about 17 percent of sets ended in tiebreaks.  Indeed, Karlovic has played tiebreaks at a higher rate than anyone else.  And it isn’t even close.

After eliminating everyone who played fewer than 200 sets in the last decade, we’re left with 205 players.  Of those, 33 guys reached a tiebreak in at least 20 percent of sets.  Only 7 played tiebreaks in more than a quarter of sets.  Karlovic reached a tiebreak in 40 percent of sets, more than anyone else in this time period.

Rounding out the top of the list are Chris Guccione at 35% (in only 203 sets), John Isner and Wayne Arthurs at 33%, and Alexander Waske at 31%.

The highest possible range for top-10 level success seems to be about 24 percent.  That’s where Ivan Ljubicic has been over the last decade, while Pete Sampras (albeit in only 252 sets) and Andy Roddick are at 23 percent.  To find more elite-level players, we must go down to 21 percent, the level Marat Safin and Jo-Wilfried Tsonga have maintained.

But did they win?

Tennis fans often view success in tiebreaks as a proxy for clutch, and perhaps they are right to do so.  If two players reach a tiebreak, they are fairly evenly matched, and the tiebreak itself doesn’t necessarily give an edge to either player.

This may explain why few top players end up in a high percentage of tiebreaks.  Only a handful of players are able to sustain tiebreak winning percentages above 60 percent, but to have a very successful season, you need to win more than 60 percent of sets.

What came as a surprise to me is that the players who reach the most tiebreaks are not necessarily that successful in the tiebreaks.  In fact, there is no meaningful correlation between the two rates.

Karlovic, for instance, won only 49 percent of tiebreaks in the last decade.  Ljubicic only 52 percent, and Safin only 50 percent.  Yet there are plenty of standouts at the high end of the spectrum: Isner has won 63 percent of tiebreaks, while Roddick has won 64 percent and Tsonga has won 61 percent.

The same variety is on display among those players who contest tiebreaks at the lowest rates.  David Ferrer ends up in a tiebreak only 11 percent of the time, and wins only 47 percent of them, but Nicholas Kiefer played tiebreaks in 12 percent of his sets and won 58 percent.

Perhaps this is all reassuring.  I suspect I’m not the only tennis fan to be annoyed watching Karlovic or Isner cruise to a tiebreak with what appears to be a minimal effort.  At least, for many such players, winning the set is not such an automatic result.