Python for baseball
A fair number of people are curious enough about my baseball projects (Minor League Splits, College Splits) to inquire about the tools I use. Here’s the answer.
When I decided to take a crack at collecting minor league split statistics in 2006, I had no programming background at all. For reasons I don’t recall, I ended up learning Python. It has proven to be a very good choice–it was extremely easy to learn, and I could start building stuff almost immediately.
Even though the software that runs College Splits manages some very large databases (by baseball standards, anyway), I don’t use any kind of database-specific language. I know many statisticians rely on MySQL; there’s a commonly-used API that allows Python scripts to work with MySQL databases. (There are also APIs for just about any other db format.) But I don’t use it. I’ve written a fair amount of Python code to simulate some of the power of SQL, but ultimately, my databases sit in CSV files.
Ultimately, there’s just not much that a baseball statistician needs a programming language to do. From my perspective, the most important tools are those that allow me to do text parsing, getting play-by-play logs from various formats into a standardized version that I use (very much like Retrosheet’s). Python’s built-in libraries make it very easy to do much of that.
XML parsing also arises quite a bit, especially if you’re grabbing data from MLB. There are plenty of Python libraries that do that. (I ended up writing my own.) Creating and uploading flat HTML files is also a breeze. For instance, I wrote my own blogging platform in a few hundred lines of code; that’s what runs the College Splits blog, as well as the GMAT Hacks and GRE HQ websites. (More on that another day.)
I can’t say very much about what makes Python better than other languages, because I don’t have enough experience with other languages to know that it is. For a beginner, I wouldn’t recommend anything else. You may discover reasons to end up in another language, but even Python’s detractors acknowledge that it’s about as easy as it gets.
If you do decide to teach yourself Python, I encourage you to start working on “real” projects as quickly as possible. Don’t bite off too much–you might just work on coding some common baseball stats (OBP, SLG, ERA, etc.), or when you’re ready for more, write a program to compare players given the parameters of a certain fantasy league. Having a relevant goal makes it a lot easier to stay motivated. If it hadn’t been for the motivation of Minor League Splits, I probably would never have become skillful enough to try any other major programming projects.
The one thing I wish someone had told me when I was a beginner is this: Anything you’re working on, someone else has probably done. I’m embarrassed to recall how many functions I wrote that were no more than clumsy replicas of built-in functions. Some of the tools I’ve written–to work with CSV and XML formats, for instance–I treated as exercises for myself, but there are several options out there. Even when you’re engrossed in your first project, take some time out to browse through Python’s documentation, or how-to book or two. These will remind you that the language does a lot more than you’re aware of, and keep you from spending time on work others have done.
Another word of advice I wish I’d had–don’t pay too much attention to the constant refrain to “comment your code.” If you get a developer job, comment your code. If you’re doing this stuff for fun, don’t worry too much about it. Instead, always think about making your code reusable. You don’t need to adopt all the trappings of object-oriented programming, but if you’re doing almost anything baseball related, realize you’ll need it again.
For instance, writing a function for something like OPS takes a couple of minutes no matter how good you are with the language…but once you’ve written it, if you keep it separate from other things (for instance, a function should calculate only OPS, not AVG, BABIP, and OPS), you can use it again and again. I was an awful programmer in 2006, but there is code I wrote in my first few months that I still find myself reusing.