The Summer of Jeff

Python for baseball

Posted in programming by Jeff on July 16, 2010

A fair number of people are curious enough about my baseball projects (Minor League Splits, College Splits) to inquire about the tools I use.  Here’s the answer.

When I decided to take a crack at collecting minor league split statistics in 2006, I had no programming background at all.  For reasons I don’t recall, I ended up learning Python.  It has proven to be a very good choice–it was extremely easy to learn, and I could start building stuff almost immediately.

In fact, to this day, almost everything I do is written in Python.  The one major exception is that the web interface for Minor League Splits is written in Javascript.  (Click on “view source” on any MLS player page and you’ll see some ugly, ugly Javascript.)  Instead of rewriting the site in JS, I probably should have taken the opportunity to learn a Python web framework like Django, but I didn’t, and I haven’t since.

Even though the software that runs College Splits manages some very large databases (by baseball standards, anyway), I don’t use any kind of database-specific language.  I know many statisticians rely on MySQL; there’s a commonly-used API that allows Python scripts to work with MySQL databases.  (There are also APIs for just about any other db format.) But I don’t use it.  I’ve written a fair amount of Python code to simulate some of the power of SQL, but ultimately, my databases sit in CSV files.

Ultimately, there’s just not much that a baseball statistician needs a programming language to do.  From my perspective, the most important tools are those that allow me to do text parsing, getting play-by-play logs from various formats into a standardized version that I use (very much like Retrosheet’s).  Python’s built-in libraries make it very easy to do much of that.

XML parsing also arises quite a bit, especially if you’re grabbing data from MLB.  There are plenty of Python libraries that do that.  (I ended up writing my own.)  Creating and uploading flat HTML files is also a breeze.  For instance, I wrote my own blogging platform in a few hundred lines of code; that’s what runs the College Splits blog, as well as the GMAT Hacks and GRE HQ websites.  (More on that another day.)

I can’t say very much about what makes Python better than other languages, because I don’t have enough experience with other languages to know that it is.  For a beginner, I wouldn’t recommend anything else.  You may discover reasons to end up in another language, but even Python’s detractors acknowledge that it’s about as easy as it gets.

If you do decide to teach yourself Python, I encourage you to start working on “real” projects as quickly as possible.  Don’t bite off too much–you might just work on coding some common baseball stats (OBP, SLG, ERA, etc.), or when you’re ready for more, write a program to compare players given the parameters of a certain fantasy league.  Having a relevant goal makes it a lot easier to stay motivated.  If it hadn’t been for the motivation of Minor League Splits, I probably would never have become skillful enough to try any other major programming projects.

The one thing I wish someone had told me when I was a beginner is this: Anything you’re working on, someone else has probably done.  I’m embarrassed to recall how many functions I wrote that were no more than clumsy replicas of built-in functions.  Some of the tools I’ve written–to work with CSV and XML formats, for instance–I treated as exercises for myself, but there are several options out there.   Even when you’re engrossed in your first project, take some time out to browse through Python’s documentation, or how-to book or two.  These will remind you that the language does a lot more than you’re aware of, and keep you from spending time on work others have done.

Another word of advice I wish I’d had–don’t pay too much attention to the constant refrain to “comment your code.”  If you get a developer job, comment your code.  If you’re doing this stuff for fun, don’t worry too much about it.  Instead, always think about making your code reusable.  You don’t need to adopt all the trappings of object-oriented programming, but if you’re doing almost anything baseball related, realize you’ll need it again.

For instance, writing a function for something like OPS takes a couple of minutes no matter how good you are with the language…but once you’ve written it, if you keep it separate from other things (for instance, a function should calculate only OPS, not AVG, BABIP, and OPS), you can use it again and again.  I was an awful programmer in 2006, but there is code I wrote in my first few months that I still find myself reusing.

6 Responses

Subscribe to comments with RSS.

  1. Olivier said, on July 16, 2010 at 7:18 pm

    What got me into python was that, for parsing the nhl play-by-play sheets, BeautifulSoup was waaaaaay simpler to use than anything else.

    I bet my code is uglier than yours too.

  2. john said, on July 16, 2010 at 7:48 pm

    Any good books on Python? What would be the best course of action if one was wanting to learn?

  3. Jeff said, on July 17, 2010 at 5:18 pm

    There are plenty of free tutorials/books online. Really, it doesn’t matter … just find a few, browse to see what seems most helpful to you, and dig in. You won’t end up working all the way through one book anyway. Just learn enough to start making stuff, start making stuff, see what else you want to know how to do, then go find out how to do that. And repeat.

  4. Chris said, on July 17, 2010 at 6:05 pm

    I started working with Perl when I first read Baseball Hacks but never was comfortable with it. I started learning Django (which runs on Python) and for whatever reason that stuck. I’ve used python for some college baseball projects and on census data and it’s worked very well.

    John, check out Dive into Python. It’s free and it’s helped me on a few questions.

  5. Sean Smith said, on July 18, 2010 at 9:02 pm

    I’ll have to give python a try sometime. What I use, almost exclusively, is vba for excel. It is more powerful than most people think. I can get gameday data into batting lines, splits, and stuff like that. Though I’m not at the level where I can recreate the Retrosheet files.

    I would recommend commenting code. To me it has a ton of value when I’m looking at something I did a year ago, and try to figure out what I was doing. If your memory is better than mine though, to each his own.

  6. Jeff said, on July 21, 2010 at 6:12 pm

    Sean –

    There’s certainly no harm in commenting code, and it can do a lot of good. My comment was more in response to the blanket recommendation you’ll get in any intro programming books — MUST COMMENT CODE!!! [paraphrased]

    Well-written python code can comment itself. A toy example:

    def calculateBattingAverage(h, ab):
    if ab == 0: return 0
    else: return h/ab

    (The last two lines should be indented; wordpress doesn’t want me to, apparently.)

    One goal is certainly to write code that will be understandable and usable to you (and maybe others) at some indefinite future time. I love it when that’s achievable without any formal comments.

Comments are closed.

%d bloggers like this: