The Summer of Jeff

Toward tennis analytics

Posted in tennis by Jeff on March 28, 2009

I spend a lot of my time crunching baseball stats, and I spend a lot of the rest of my time playing and watching tennis.  It stands to reason that eventually I’d try to mash the two together.  (Would Federer’s OBP be over .400?  I think so.)

Variants of sabermetrics have mostly spread to other team sports, such as football, basketball, and hockey.  This is no surprise, since a very important aspect of this type of analysis is teasing out and valuing the individual contributions that led to results at the team level.  In tennis, there’s no such thing.  If Murray beats Federer, we can attribute the full responsibility for the victory to Murray and the full responsibility for the loss to Federer.

So on a match-by-match basis, sabermetrics-style analysis doesn’t seem to have a lot to offer tennis.  Over the longer term, though, I think there’s a lot of ways we could gain knowledge by applying the same techniques to tennis.

The most obvious arena to me is that of rankings.  You don’t hear very many complaints about the structure of the tennis ranking system.  This is partly because it has been more or less the same forever, and probably in larger part because it only matters so much.  While serious flaws in the rankings would lead to a harder or easier draw for some players, the ultimate results would be the same.  The best players should win the most tournaments regardless of whether they are seeded 1st, 7th, or 27th.

The stakes aren’t as high, but I’d be surprised if tennis rankings are any better than, say, the NCAA football BCS rankings.  They are doubtless flawed for some different reasons (and I don’t know a lot about the details of the BCS rankings), but they are similar in that both could be improved by available mathematical techniques.

Here are three easily remedied problems with the existing tennis ranking system:

  1. Different tournaments count for more points than others.  This is more glaring in the ATP rankings with the identification of Masters-level tournaments.  I suppose a defense can be made that, at least for men, grand slams should count for more points (men play best-of-five sets instead of best-of-three), but apart from that, the different points totals seem to be little more than a bid for more public attention.
  2. Points are based on round and tournament, not on opponent.  If you win in the first round of the US Open, you get the same number of points whether you beat Roger Federer or Frank Danecevic.
  3. Rankings are based on a precise 12-month period.  While there is something of a tennis “season,” it doesn’t make much sense to give players credit for something happened 11 months ago, but not something that happened 13 months ago.  A weighted system (perhaps: multiply results from the last six months by 3, results from the previous six months by 2, and results from the previous 6 or 12 months by 1) would give  a more accurate picture of a player’s performance.

Fixing (2) would make it possible to turn (1) into a non-issue.  If a system took into account opponent quality, there would be no reason to discriminate between tournaments.  Even challenger-level tournaments could be treated the same way.

(A fourth possible problem is that match results are considered binary.  A loss in a third-set tiebreak is the same as a 6-0 6-0 result.  It’s not a foregone conclusion that this is a problem.  It may be that discriminating between lopsided and close wins has no predictive value on future results.  Intuitively, it seems wrong to penalize a player like David Nalbaldian for frequently playing five-set matches if he ultimately wins most of them.  This is certainly something worth testing, however.)

If one had a database of the last year or two’s worth of matches, it would be reasonably easy to create rankings similar strength-of-schedule-based rankings used in college sports (only better).  I would imagine that the top of those rankings would look very much like the top of the ATP rankings, but some surprises would crop up beyond the top 5-10.

Of course, since no one is paying tennis players based on perceived worth (as baseball teams generally pay players), better rankings wouldn’t have much of an effect on most players.  The only people I can imagine seeing much value in the project are people like me who geek out on this stuff, and those who bet on tennis.

The possibility of building a system to bet on tennis raises all sorts of additional variables.  While SOS-based rankings would probably do a better job of predicting match results than the present rankings, they would have obvious flaws.  I’ve said nothing so far about surface (the probability that Nadal beats Federer is very different on clay than on grass), and I haven’t included the possibility that some players match up better against others.

The surface issue seems fairly easy to fix, simply by weighting matches on that surface more heavily than matches on other surfaces.  Player matchups would be more difficult.  Perhaps some version of similarity scores could be developed, and then previous matches could be weighted based on similarity score.  For instance, if Djokovic is playing Isner, Djokovic’s previous matches against big servers would be weighted more heavily than other matches.

Because of the obvious value of this sort of thing to gamblers, I’d be shocked if it hadn’t been done before.  (Publicly, I’m not so sure.)  I don’t know how easy it would be to build and maintain a sufficient database of matches to rank players using a more sophisticated method.  I do know I’d much prefer rankings based on these improvements to what we currently have to work with.

Tagged with: