Python code for WPA stats
A long time ago I put together a python version of the win expectancy/volatility calculations contained in Studes’s WPA spreadsheet. Those were the days–if we wanted a post-game WPA graph, we had to do it ourselves
.
I’ve brushed off the cobwebs and published the code. Click here to see it.
All this does is calculate the win expectancy and volatility (~leverage) in any situation. It doesn’t calculate WPA on the play. Of course, if you’re running this on a play-by-play log, it’s trivial to compare the WX of one play and the next.
‘Volatility’ is the difference between the win expectancies that would result from a home run and from a strikeout. To normalize it so that the average volatility is 1.0, I have this code divide the result by 0.133. Depending on your dataset, that might not be quite right. There are more sophisticated ways to measure leverage, though this one is adequate for many purposes.
Thank you Studes, Tango, and others for publishing all that you have. As is so often the case, I’m just the code monkey.
2011 Aussie Open Simulation Results
Using my simple ranking-points-based algorithm to determine the odds that each player wins a match, I ran simulations using the 2011 Australian Open draw.
As usual, the keyword is “simple,” and you can easily find all sorts of intuitive reasons to discount the results. There’s no consideration for surface, so clay-court specialists are generally overrated. Players returning from injury (Del Potro, especially, and Karlovic) have seen the hit in the rankings, and are thus underrated here, as well.
I’m also publishing the code that I use to generate these sims. It should work for any single-elimination tournament up to 128 competitors, and is easily expandable to handle larger brackets. The function ‘calcWP’ is specific to my tennis algorithm, but you could swap in something like log5 very easily. I also included the .csv file I used for the draw, so you can see the format, or tinker with the parameters and come up with your own Aussie sim.
Your 2011 Australian Open…
Player points R64 R32 R16 QF SF F W
Nadal 1 12390 96.9% 92.7% 87.0% 78.1% 66.1% 49.6% 34.5%
Daniel 564 3.1% 1.4% 0.5% 0.1% 0.0% 0.0% 0.0%
Sweeting Q 486 35.3% 1.6% 0.5% 0.1% 0.0% 0.0% 0.0%
Gimeno-Traver 844 64.7% 4.3% 1.9% 0.7% 0.2% 0.0% 0.0%
Tomic W 239 17.9% 3.1% 0.1% 0.0% 0.0% 0.0% 0.0%
Chardy 960 82.1% 39.6% 3.7% 1.4% 0.4% 0.1% 0.0%
Falla 540 27.3% 11.3% 0.7% 0.2% 0.0% 0.0% 0.0%
Lopez F 31 1310 72.7% 46.0% 5.6% 2.6% 0.9% 0.2% 0.0%
Isner 20 1850 74.0% 56.8% 31.7% 5.8% 2.5% 0.8% 0.2%
Serra 711 26.0% 14.0% 4.6% 0.5% 0.1% 0.0% 0.0%
Stepanek 735 62.1% 20.4% 6.9% 0.6% 0.2% 0.0% 0.0%
Gremelmayr Q 469 37.9% 8.8% 2.2% 0.1% 0.0% 0.0% 0.0%
Machado 573 41.2% 10.2% 3.3% 0.2% 0.0% 0.0% 0.0%
Giraldo 785 58.8% 18.3% 7.2% 0.7% 0.2% 0.0% 0.0%
Young D Q 435 14.6% 5.4% 1.4% 0.1% 0.0% 0.0% 0.0%
Cilic 15 2140 85.4% 66.1% 42.8% 8.7% 4.1% 1.4% 0.4%
Youzhny 10 2920 85.6% 70.1% 51.9% 29.2% 8.1% 3.3% 1.1%
Ilhan 574 14.4% 6.2% 2.2% 0.5% 0.0% 0.0% 0.0%
Kavcic Q 552 38.0% 7.1% 2.4% 0.5% 0.0% 0.0% 0.0%
Anderson K 868 62.0% 16.6% 7.4% 2.1% 0.2% 0.0% 0.0%
Raonic Q 351 36.4% 6.8% 1.0% 0.2% 0.0% 0.0% 0.0%
Phau 581 63.6% 18.0% 4.1% 0.9% 0.1% 0.0% 0.0%
Chela 1070 39.3% 27.8% 9.9% 3.2% 0.4% 0.1% 0.0%
Llodra 22 1575 60.7% 47.4% 21.0% 8.8% 1.6% 0.4% 0.1%
Nalbandian 27 1480 64.2% 49.1% 18.4% 8.2% 1.4% 0.4% 0.1%
Hewitt 870 35.8% 23.1% 6.1% 2.0% 0.2% 0.0% 0.0%
Berankis 589 61.1% 19.1% 3.9% 1.0% 0.1% 0.0% 0.0%
Matosevic W 392 38.9% 8.8% 1.4% 0.2% 0.0% 0.0% 0.0%
Russell 547 67.0% 10.1% 3.6% 0.8% 0.1% 0.0% 0.0%
Ebden W 288 33.0% 2.7% 0.6% 0.1% 0.0% 0.0% 0.0%
Nieminen 1062 20.2% 14.5% 7.7% 2.8% 0.4% 0.1% 0.0%
Ferrer 7 3735 79.8% 72.7% 58.4% 39.4% 12.6% 5.8% 2.4%
Soderling 4 5785 87.9% 83.6% 71.9% 58.3% 35.9% 15.6% 7.9%
Starace 945 12.1% 8.9% 4.2% 1.7% 0.3% 0.0% 0.0%
Muller Q 466 76.9% 6.9% 2.0% 0.5% 0.1% 0.0% 0.0%
Stadler Q 155 23.1% 0.7% 0.1% 0.0% 0.0% 0.0% 0.0%
Istomin 1031 86.2% 41.8% 8.8% 3.6% 0.8% 0.1% 0.0%
Hernych Q 196 13.8% 1.9% 0.1% 0.0% 0.0% 0.0% 0.0%
Mello 627 30.0% 12.8% 1.9% 0.6% 0.1% 0.0% 0.0%
Bellucci 30 1355 70.0% 43.5% 11.0% 5.3% 1.4% 0.2% 0.1%
Gulbis 24 1505 64.3% 41.5% 20.7% 6.3% 1.9% 0.4% 0.1%
Becker 870 35.7% 17.9% 6.4% 1.3% 0.2% 0.0% 0.0%
Dolgopolov 928 53.6% 22.8% 8.6% 1.8% 0.4% 0.0% 0.0%
Kukushkin 815 46.4% 17.9% 6.3% 1.2% 0.2% 0.0% 0.0%
Seppi 900 59.6% 19.2% 8.7% 1.9% 0.4% 0.0% 0.0%
Clement 627 40.4% 9.9% 3.5% 0.6% 0.1% 0.0% 0.0%
Petzschner 839 24.3% 12.6% 5.5% 1.1% 0.2% 0.0% 0.0%
Tsonga 13 2345 75.7% 58.2% 40.4% 15.8% 6.3% 1.6% 0.5%
Melzer 11 2785 91.2% 77.7% 54.3% 22.9% 10.4% 3.0% 1.0%
Millot Q 334 8.8% 3.3% 0.7% 0.1% 0.0% 0.0% 0.0%
Ball W 344 32.5% 4.1% 0.9% 0.1% 0.0% 0.0% 0.0%
Riba 672 67.5% 14.8% 5.5% 1.0% 0.2% 0.0% 0.0%
Sela 568 77.8% 21.8% 5.0% 0.7% 0.1% 0.0% 0.0%
Del Potro 180 22.2% 2.4% 0.2% 0.0% 0.0% 0.0% 0.0%
Zemlja Q 376 15.1% 6.9% 1.2% 0.1% 0.0% 0.0% 0.0%
Baghdatis 21 1785 84.9% 68.9% 32.2% 10.7% 3.8% 0.8% 0.2%
Garcia-Lopez 32 1300 62.1% 44.0% 10.6% 4.2% 1.2% 0.2% 0.0%
Berrer 835 37.9% 22.8% 3.9% 1.1% 0.2% 0.0% 0.0%
Schwank 580 50.6% 16.9% 2.3% 0.5% 0.1% 0.0% 0.0%
Mayer L 572 49.4% 16.3% 2.1% 0.4% 0.1% 0.0% 0.0%
Marchenko 624 49.3% 5.5% 2.3% 0.6% 0.1% 0.0% 0.0%
Ramirez Hidalgo 638 50.7% 5.7% 2.4% 0.6% 0.1% 0.0% 0.0%
Beck K 543 7.0% 3.2% 1.2% 0.3% 0.0% 0.0% 0.0%
Murray 5 5760 93.0% 85.5% 75.3% 56.7% 35.5% 15.6% 7.9%
Berdych 6 3955 96.4% 78.5% 63.1% 42.3% 22.0% 9.6% 3.4%
Crugnola Q 194 3.6% 0.5% 0.1% 0.0% 0.0% 0.0% 0.0%
Kohlschreiber 1215 63.8% 15.2% 8.3% 3.1% 0.9% 0.2% 0.0%
Kamke 724 36.2% 5.8% 2.4% 0.7% 0.1% 0.0% 0.0%
Harrison W 313 32.3% 6.7% 0.6% 0.1% 0.0% 0.0% 0.0%
Mannarino 612 67.7% 22.8% 3.9% 0.9% 0.1% 0.0% 0.0%
Dancevic Q 172 9.0% 2.2% 0.1% 0.0% 0.0% 0.0% 0.0%
Gasquet 28 1385 91.0% 68.3% 21.5% 8.8% 2.5% 0.6% 0.1%
Davydenko 23 1555 60.0% 41.5% 17.1% 6.5% 2.0% 0.5% 0.1%
Mayer F 1073 40.0% 23.9% 8.0% 2.3% 0.6% 0.1% 0.0%
Fognini 855 59.6% 22.7% 6.5% 1.7% 0.3% 0.0% 0.0%
Nishikori 599 40.4% 12.0% 2.7% 0.5% 0.1% 0.0% 0.0%
Zverev 611 38.3% 7.2% 2.4% 0.5% 0.1% 0.0% 0.0%
Tipsarevic 935 61.7% 16.0% 7.2% 2.0% 0.4% 0.1% 0.0%
Schuettler 597 13.5% 5.8% 1.9% 0.4% 0.1% 0.0% 0.0%
Verdasco 9 3240 86.5% 71.1% 54.1% 30.3% 14.2% 5.7% 1.8%
Almagro 14 2160 84.5% 68.0% 41.9% 15.4% 6.8% 2.2% 0.5%
Robert Q 460 15.5% 6.6% 1.8% 0.2% 0.0% 0.0% 0.0%
Andreev 622 52.1% 13.7% 4.5% 0.8% 0.1% 0.0% 0.0%
Volandri 574 47.9% 11.8% 3.6% 0.5% 0.1% 0.0% 0.0%
Cipolla Q 190 32.6% 3.4% 0.4% 0.0% 0.0% 0.0% 0.0%
Paire W 366 67.4% 12.5% 2.5% 0.3% 0.0% 0.0% 0.0%
Luczak W 400 14.7% 8.6% 1.9% 0.2% 0.0% 0.0% 0.0%
Ljubicic 17 1965 85.3% 75.5% 43.4% 15.1% 6.2% 1.8% 0.4%
Troicki 29 1385 86.2% 64.4% 16.2% 7.2% 2.4% 0.5% 0.1%
Tursunov 263 13.8% 4.5% 0.3% 0.0% 0.0% 0.0% 0.0%
Dabul 584 58.6% 19.8% 2.7% 0.7% 0.1% 0.0% 0.0%
Mahut Q 424 41.4% 11.3% 1.1% 0.2% 0.0% 0.0% 0.0%
Karlovic 670 52.8% 6.2% 2.5% 0.7% 0.1% 0.0% 0.0%
Dodig 606 47.2% 5.0% 2.0% 0.5% 0.1% 0.0% 0.0%
Granollers 993 11.6% 7.2% 3.6% 1.4% 0.4% 0.1% 0.0%
Djokovic 3 6240 88.4% 81.6% 71.5% 56.9% 40.2% 21.9% 10.2%
Roddick 8 3565 88.5% 78.1% 61.4% 42.2% 16.8% 8.1% 2.7%
Hajek 560 11.5% 5.8% 2.0% 0.4% 0.0% 0.0% 0.0%
Przysiezny 590 51.7% 8.5% 2.9% 0.7% 0.1% 0.0% 0.0%
Kunitsyn 551 48.3% 7.6% 2.5% 0.7% 0.1% 0.0% 0.0%
Berlocq 725 47.1% 16.8% 4.0% 1.2% 0.2% 0.0% 0.0%
Haase 803 52.9% 20.0% 5.2% 1.7% 0.3% 0.0% 0.0%
Benneteau 965 38.5% 21.8% 6.3% 2.3% 0.4% 0.1% 0.0%
Monaco 26 1480 61.5% 41.5% 15.7% 7.2% 1.7% 0.5% 0.1%
Wawrinka 19 1855 76.7% 52.3% 28.1% 12.8% 3.5% 1.2% 0.2%
Gabashvili 626 23.3% 9.4% 2.6% 0.6% 0.1% 0.0% 0.0%
Dimitrov Q 518 29.5% 7.5% 1.8% 0.4% 0.0% 0.0% 0.0%
Golubev 1135 70.5% 30.8% 12.7% 4.2% 0.8% 0.2% 0.0%
Gil 551 40.2% 8.3% 2.4% 0.5% 0.0% 0.0% 0.0%
Cuevas 790 59.8% 16.4% 6.1% 1.6% 0.2% 0.0% 0.0%
De Bakker 950 25.2% 14.9% 6.2% 1.9% 0.3% 0.1% 0.0%
Monfils 12 2560 74.8% 60.4% 40.2% 21.5% 7.2% 2.9% 0.7%
Fish 16 1996 70.1% 52.0% 32.0% 8.2% 3.9% 1.3% 0.3%
Hanescu 915 29.9% 16.4% 6.8% 1.0% 0.3% 0.0% 0.0%
Robredo 915 65.2% 23.4% 9.9% 1.5% 0.4% 0.1% 0.0%
Devvarman 514 34.8% 8.2% 2.4% 0.2% 0.0% 0.0% 0.0%
Stakhovsky 925 64.4% 24.8% 10.2% 1.6% 0.4% 0.1% 0.0%
Brands 541 35.6% 9.3% 2.6% 0.3% 0.1% 0.0% 0.0%
Kubot 670 24.5% 11.4% 3.9% 0.5% 0.1% 0.0% 0.0%
Querrey 18 1860 75.5% 54.5% 32.1% 7.8% 3.4% 1.1% 0.2%
Montanes 25 1495 74.3% 48.4% 8.8% 4.5% 1.7% 0.5% 0.1%
Brown 573 25.7% 10.3% 0.9% 0.2% 0.0% 0.0% 0.0%
Andujar 683 40.9% 14.9% 1.4% 0.4% 0.1% 0.0% 0.0%
Malisse 956 59.1% 26.4% 3.4% 1.4% 0.4% 0.1% 0.0%
Lu 1141 53.7% 6.2% 3.2% 1.4% 0.5% 0.1% 0.0%
Simon 1005 46.3% 4.8% 2.3% 0.9% 0.3% 0.1% 0.0%
Lacko 553 4.3% 1.4% 0.4% 0.1% 0.0% 0.0% 0.0%
Federer 2 9245 95.7% 87.6% 79.6% 70.0% 56.6% 40.3% 22.4%
Python Code for Marcel Projections
A while back, I posted retro-Marcel projections for over 100 seasons. They were generated with some python code, and now you can play with it.
You’ll also need some Baseball-Databank files. (Well, you don’t need them, but they will make the process much easier.)
The ‘import’ lines refer to a few utilities that I’ve written. Those are also available on gitHub. At some point, I’ll write up a summary of some of my Python utilities. I’m sure that none of them are original (for instance, turning a 2-d matrix into a .csv, or vice versa), but I use them all the time, and they might come in handy for you, too.
Comments Off
Python Code for Tennis Markov
I’ve published my code for the tennis markov project. You can find it here:
- Single game outcome. Takes the server’s probability of winning a single point and the current score, returns server’s chance of winning game.
- Tiebreak outcome. Takes server’s probability of winning a single service point, prob of winning single return point, and current score, returns server’s chance of winning tiebreak.
- Single set outcome. Takes server’s probability of winning a single service point, prob of winning single return point, and current game score, returns server’s chance of winning set. (Assumes standard tiebreak set.)
- Match outcome. Takes server’s probability of winning a single service point, prob of winning single return point, current score in points, games, and sets, and number of sets, returns server’s chance of winning match.
The logic in the tiebreak problem is knotty, and the code reflects that; I’m sure there’s a better way of doing it, I just didn’t feel like working it out once I got to the answer.
In the other functions, the code is pretty clean, and I’ve commented it more than I otherwise would. The math gets a little hairy, though.
Roll-your-own blogging software
A few years ago, I moved my GMAT Hacks website off of WordPress. I wrote the code for a basic blogging platform using Python, and since then, I’ve built it out a little more. A content management system (CMS) does not have to be complicated. And as Blogger, WordPress and others have shown, the platform is generic; I’ve used almost exactly the same code to drive GMAT Hacks, GRE HQ, and the College Splits blog.
I’m not going to share any code, but I will walk through the process. It’s very intuitive in Python, and I’m sure it’s similarly straightforward in many other languages.
The various blogging platforms offer much of what you’ve ever need, and they are generally easy to use and modify. That’s why this very post is on a WordPress blog. But especially in the case of my GMAT site, I needed more flexibility to automatically update special types of pages and create customized sidebars and footers.
The basics
A do-it-yourself CMS can consist of as few as three files:
- A database of some sort that, for each post, stores title, body text, date, and other information, possibly including category, tag(s), and anything else you can dream up. I think this is simple enough not to require further explanation.
- A simple script to add items to the database and edit items already in the database.
- A script that uses the site template to generate pages for each post using the database.
Let’s look at the last two in a little more detail.
Add and edit items
This is also pretty simple. The one aspect worth mentioning is that it’s important to validate everything going in–if you’re ambitious, you may even try to validate the HTML in the posts themselves. I limit myself to checking that a new post’s category already exists and that the post’s date is valid. (On some of my sites, I use YYYYMMDDX as a post ID, where X is an index to differentiate multiple posts from the same day.)
Generate the site
Depending on how thorough you want to be, this script can get fairly complex. (Mine is currently a bit longer than 400 lines of code.)
At its most basic, it’s just a matter of creating a page for each post and uploading each one. Here are a few more things it can do:
- Uploading some pages to multiple locations. For example, you might want your most recent post to be the front page on your site. So the page “category/recent-post.html” might also be uploaded as “index.html.”
- Creating tables of contents. On my GMAT site, I have a chronological TOC, a site-wide TOC with posts sorted by category, and an individual TOC page for each category. I also have a “recent posts” page with a chronological list of the last 10 posts. The script creates each one every time I update the site.
- Creating an xml feed. You might include the last 5 or 10 posts, and you have the flexibility to include all, some, or none of the body of the post.
- Updating pages outside of the blog hierarchy. The first page of my GMAT site does not contain a blog post, but the script creates it, so that it always links to the most recent post.
- Varying sidebar and footer content. My footers are generally predictable–they link to the previous post, as well as a category-specific table of contents. But I also include an ad for one of my books. (For some posts, I randomly rotate the ads with each site update.) With full control over the script, I can put an ad for my math book on math-related pages and my verbal book on verbal-related pages. I also have a few different sidebars for different purposes. In a few cases, I even drop the footer content altogether.
Unlike the way, say, WordPress does things, every single page on all of my blogs is a flat html page. This ensures that the pages are very fast to load regardless of traffic level. It takes a little more time to generate and upload the site–for instance, my GMAT site now consists of over 300 pages, and most of them have a ‘recent posts’ box on the sidebar, so they must be updated each time I add a new post. But with a decent connection, that only takes a couple of minutes.
The way my script works, it sorts the database by date, then goes through the list twice. The first time, it creates the various TOCs, the XML feed, and the list of recent posts that I use in the sidebar. The second time, it creates the individual pages.
If you have questions about the process, feel free to post them in the comments.
Comments Off
Python for baseball
A fair number of people are curious enough about my baseball projects (Minor League Splits, College Splits) to inquire about the tools I use. Here’s the answer.
When I decided to take a crack at collecting minor league split statistics in 2006, I had no programming background at all. For reasons I don’t recall, I ended up learning Python. It has proven to be a very good choice–it was extremely easy to learn, and I could start building stuff almost immediately.
In fact, to this day, almost everything I do is written in Python. The one major exception is that the web interface for Minor League Splits is written in Javascript. (Click on “view source” on any MLS player page and you’ll see some ugly, ugly Javascript.) Instead of rewriting the site in JS, I probably should have taken the opportunity to learn a Python web framework like Django, but I didn’t, and I haven’t since.
Even though the software that runs College Splits manages some very large databases (by baseball standards, anyway), I don’t use any kind of database-specific language. I know many statisticians rely on MySQL; there’s a commonly-used API that allows Python scripts to work with MySQL databases. (There are also APIs for just about any other db format.) But I don’t use it. I’ve written a fair amount of Python code to simulate some of the power of SQL, but ultimately, my databases sit in CSV files.
Ultimately, there’s just not much that a baseball statistician needs a programming language to do. From my perspective, the most important tools are those that allow me to do text parsing, getting play-by-play logs from various formats into a standardized version that I use (very much like Retrosheet’s). Python’s built-in libraries make it very easy to do much of that.
XML parsing also arises quite a bit, especially if you’re grabbing data from MLB. There are plenty of Python libraries that do that. (I ended up writing my own.) Creating and uploading flat HTML files is also a breeze. For instance, I wrote my own blogging platform in a few hundred lines of code; that’s what runs the College Splits blog, as well as the GMAT Hacks and GRE HQ websites. (More on that another day.)
I can’t say very much about what makes Python better than other languages, because I don’t have enough experience with other languages to know that it is. For a beginner, I wouldn’t recommend anything else. You may discover reasons to end up in another language, but even Python’s detractors acknowledge that it’s about as easy as it gets.
If you do decide to teach yourself Python, I encourage you to start working on “real” projects as quickly as possible. Don’t bite off too much–you might just work on coding some common baseball stats (OBP, SLG, ERA, etc.), or when you’re ready for more, write a program to compare players given the parameters of a certain fantasy league. Having a relevant goal makes it a lot easier to stay motivated. If it hadn’t been for the motivation of Minor League Splits, I probably would never have become skillful enough to try any other major programming projects.
The one thing I wish someone had told me when I was a beginner is this: Anything you’re working on, someone else has probably done. I’m embarrassed to recall how many functions I wrote that were no more than clumsy replicas of built-in functions. Some of the tools I’ve written–to work with CSV and XML formats, for instance–I treated as exercises for myself, but there are several options out there. Even when you’re engrossed in your first project, take some time out to browse through Python’s documentation, or how-to book or two. These will remind you that the language does a lot more than you’re aware of, and keep you from spending time on work others have done.
Another word of advice I wish I’d had–don’t pay too much attention to the constant refrain to “comment your code.” If you get a developer job, comment your code. If you’re doing this stuff for fun, don’t worry too much about it. Instead, always think about making your code reusable. You don’t need to adopt all the trappings of object-oriented programming, but if you’re doing almost anything baseball related, realize you’ll need it again.
For instance, writing a function for something like OPS takes a couple of minutes no matter how good you are with the language…but once you’ve written it, if you keep it separate from other things (for instance, a function should calculate only OPS, not AVG, BABIP, and OPS), you can use it again and again. I was an awful programmer in 2006, but there is code I wrote in my first few months that I still find myself reusing.
Comments Off