The Summer of Jeff

Roll-your-own blogging software

Posted in programming by Jeff on July 28, 2010

A few years ago, I moved my GMAT Hacks website off of WordPress.  I wrote the code for a basic blogging platform using Python, and since then, I’ve built it out a little more.  A content management system (CMS) does not have to be complicated.  And as Blogger, WordPress and others have shown, the platform is generic; I’ve used almost exactly the same code to drive GMAT Hacks, GRE HQ, and the College Splits blog.

I’m not going to share any code, but I will walk through the process.  It’s very intuitive in Python, and I’m sure it’s similarly straightforward in many other languages.

The various blogging platforms offer much of what you’ve ever need, and they are generally easy to use and modify.  That’s why this very post is on a WordPress blog.  But especially in the case of my GMAT site, I needed more flexibility to automatically update special types of pages and create customized sidebars and footers.

The basics

A do-it-yourself CMS can consist of as few as three files:

  1. A database of some sort that, for each post, stores title, body text, date, and other information, possibly including category, tag(s), and anything else you can dream up.  I think this is simple enough not to require further explanation.
  2. A simple script to add items to the database and edit items already in the database.
  3. A script that uses the site template to generate pages for each post using the database.

Let’s look at the last two in a little more detail.

Add and edit items

This is also pretty simple.  The one aspect worth mentioning is that it’s important to validate everything going in–if you’re ambitious, you may even try to validate the HTML in the posts themselves.  I limit myself to checking that a new post’s category already exists and that the post’s date is valid.  (On some of my sites, I use YYYYMMDDX as a post ID, where X is an index to differentiate multiple posts from the same day.)

Generate the site

Depending on how thorough you want to be, this script can get fairly complex.  (Mine is currently a bit longer than 400 lines of code.)

At its most basic, it’s just a matter of creating a page for each post and uploading each one.  Here are a few more things it can do:

  1. Uploading some pages to multiple locations.  For example, you might want your most recent post to be the front page on your site.  So the page “category/recent-post.html” might also be uploaded as “index.html.”
  2. Creating tables of contents.  On my GMAT site, I have a chronological TOC, a site-wide TOC with posts sorted by category, and an individual TOC page for each category.  I also have a “recent posts” page with a chronological list of the last 10 posts.  The script creates each one every time I update the site.
  3. Creating an xml feed.  You might include the last 5 or 10 posts, and you have the flexibility to include all, some, or none of the body of the post.
  4. Updating pages outside of the blog hierarchy.  The first page of my GMAT site does not contain a blog post, but the script creates it, so that it always links to the most recent post.
  5. Varying sidebar and footer content.  My footers are generally predictable–they link to the previous post, as well as a category-specific table of contents.  But I also include an ad for one of my books.  (For some posts, I randomly rotate the ads with each site update.)  With full control over the script, I can put an ad for my math book on math-related pages and my verbal book on verbal-related pages.  I also have a few different sidebars for different purposes.  In a few cases, I even drop the footer content altogether.

Unlike the way, say, WordPress does things, every single page on all of my blogs is a flat html page.  This ensures that the pages are very fast to load regardless of traffic level.  It takes a little more time to generate and upload the site–for instance, my GMAT site now consists of over 300 pages, and most of them have a ‘recent posts’ box on the sidebar, so they must be updated each time I add a new post.  But with a decent connection, that only takes a couple of minutes.

The way my script works, it sorts the database by date, then goes through the list twice.  The first time, it creates the various TOCs, the XML feed, and the list of recent posts that I use in the sidebar.  The second time, it creates the individual pages.

If you have questions about the process, feel free to post them in the comments.

Comments Off on Roll-your-own blogging software

Marcel spreadsheets for batters

Posted in baseball analysis by Jeff on July 23, 2010

Following up on my earlier post about historical Marcels, I’ve just posted full spreadsheets with Marcel forecasts for hitters going back to 1901.

For each year, I included a prediction for every non-pitcher who appeared in any of the previous three seasons, or would be a rookie that year.  (For the rookies, of course, the prediction is very close to league average.)  I didn’t distinguish at all between leagues, which I’m sure creates some wonkiness, especially around the years of the Federal League, the years of World War II, and the last few seasons, when the AL/NL difference became stark.

Click here for the directory with single-year spreadsheets available for download.  At some point in the near future I’ll add pitchers, and at some point in the less-near future I’ll make a more user-friendly interface so that you can view the stats directly on the web.

Marcel forecasts 1941

Posted in baseball analysis by Jeff on July 18, 2010

Here’s something I’ve been meaning to do for a long time.

Using 1938-40 stats and the Marcel forecasting algorithm, I generated batter projections for the 1941 season.  You can download the full spreadsheet (CSV) here.

My intention was to follow Tango’s algorithm exactly.  I didn’t do any park adjustments, nor did I consider any league differences.  I included a “reliability” column to indicate how much data each projection was based on.  Those numbers top out at about 0.88 for a player who played full seasons in 1938-40, and is 0 for a rookie with no MLB experience.  I ran projections for every non-pitcher who played in 1938, ’39, ’40, or ’41.

As you might imagine, Ted Williams and Joe DiMaggio come out near the top.  But neither would have been projected to have the best offensive season in 1941; that honor went to Hank Greenberg, who had to leave a 269/410/463 slash line when he was drafted in May.  Those three, along with Johnny Mize and Jimmy Foxx, would have been forecast as the best of the bunch; Marcel gives them all wOBAs between .431 and .441.  No one else is above .407.

Pitchers and other seasons to come soon.

Python for baseball

Posted in programming by Jeff on July 16, 2010

A fair number of people are curious enough about my baseball projects (Minor League Splits, College Splits) to inquire about the tools I use.  Here’s the answer.

When I decided to take a crack at collecting minor league split statistics in 2006, I had no programming background at all.  For reasons I don’t recall, I ended up learning Python.  It has proven to be a very good choice–it was extremely easy to learn, and I could start building stuff almost immediately.

In fact, to this day, almost everything I do is written in Python.  The one major exception is that the web interface for Minor League Splits is written in Javascript.  (Click on “view source” on any MLS player page and you’ll see some ugly, ugly Javascript.)  Instead of rewriting the site in JS, I probably should have taken the opportunity to learn a Python web framework like Django, but I didn’t, and I haven’t since.

Even though the software that runs College Splits manages some very large databases (by baseball standards, anyway), I don’t use any kind of database-specific language.  I know many statisticians rely on MySQL; there’s a commonly-used API that allows Python scripts to work with MySQL databases.  (There are also APIs for just about any other db format.) But I don’t use it.  I’ve written a fair amount of Python code to simulate some of the power of SQL, but ultimately, my databases sit in CSV files.

Ultimately, there’s just not much that a baseball statistician needs a programming language to do.  From my perspective, the most important tools are those that allow me to do text parsing, getting play-by-play logs from various formats into a standardized version that I use (very much like Retrosheet’s).  Python’s built-in libraries make it very easy to do much of that.

XML parsing also arises quite a bit, especially if you’re grabbing data from MLB.  There are plenty of Python libraries that do that.  (I ended up writing my own.)  Creating and uploading flat HTML files is also a breeze.  For instance, I wrote my own blogging platform in a few hundred lines of code; that’s what runs the College Splits blog, as well as the GMAT Hacks and GRE HQ websites.  (More on that another day.)

I can’t say very much about what makes Python better than other languages, because I don’t have enough experience with other languages to know that it is.  For a beginner, I wouldn’t recommend anything else.  You may discover reasons to end up in another language, but even Python’s detractors acknowledge that it’s about as easy as it gets.

If you do decide to teach yourself Python, I encourage you to start working on “real” projects as quickly as possible.  Don’t bite off too much–you might just work on coding some common baseball stats (OBP, SLG, ERA, etc.), or when you’re ready for more, write a program to compare players given the parameters of a certain fantasy league.  Having a relevant goal makes it a lot easier to stay motivated.  If it hadn’t been for the motivation of Minor League Splits, I probably would never have become skillful enough to try any other major programming projects.

The one thing I wish someone had told me when I was a beginner is this: Anything you’re working on, someone else has probably done.  I’m embarrassed to recall how many functions I wrote that were no more than clumsy replicas of built-in functions.  Some of the tools I’ve written–to work with CSV and XML formats, for instance–I treated as exercises for myself, but there are several options out there.   Even when you’re engrossed in your first project, take some time out to browse through Python’s documentation, or how-to book or two.  These will remind you that the language does a lot more than you’re aware of, and keep you from spending time on work others have done.

Another word of advice I wish I’d had–don’t pay too much attention to the constant refrain to “comment your code.”  If you get a developer job, comment your code.  If you’re doing this stuff for fun, don’t worry too much about it.  Instead, always think about making your code reusable.  You don’t need to adopt all the trappings of object-oriented programming, but if you’re doing almost anything baseball related, realize you’ll need it again.

For instance, writing a function for something like OPS takes a couple of minutes no matter how good you are with the language…but once you’ve written it, if you keep it separate from other things (for instance, a function should calculate only OPS, not AVG, BABIP, and OPS), you can use it again and again.  I was an awful programmer in 2006, but there is code I wrote in my first few months that I still find myself reusing.

The Qualifier Advantage

Posted in tennis by Jeff on July 13, 2010

By definition, qualifiers in a pro tennis tournament are among the lowest-ranked players in the draw. (Wild cards are sometimes lower seeded, and occasionally a player substantially improves his ranking before the cut-off date for tourney acceptance and the tourney itself. That’s why Thomaz Belluci was playing qualies at last year’s US Open despite a ranking well inside the top 100.)

So, it stands to reason that qualifiers would usually lose their first-round match; if they survive, their odds of getting further plummet with each passing round.

That doesn’t seem like what happens. Sure, a large number of qualifiers don’t make it past the first round, and sometimes the ones who make it past the first round did so because they drew another qualifier (or wild card) to open the tournament. But anecdotally, it seems that qualifiers do better than they’re “supposed to.”

That’s my first testable claim, one that I’m not going to test today:

Qualifiers perform better than their rankings suggest they would.

If that’s true, the more compelling question is, “Why?”  That’s tougher to test, but the reasons are worth some speculation.  Here are several possibilities:

  1. The hot hand.  In grand slams, qualifiers have to win three matches in a row.  In any tournament, they have to have won at least two.  Hell, in some USA Futures tourneys, qualifiers have to win four in a row.  Maybe the momentum is worth something.
  2. A short lay-off.  A player ranked in the middle of the top 100 is getting direct entry into most draws, but not winning very often.  He might go several weeks in a row in which he plays once, maybe twice per week.  So each week, he has a new match after a full week without any match play.  On the other hand, the qualifier has just played two or three matches.  The higher-ranked player has the advantage of being “fresh,” but how fresh is too fresh?  This factor would be difficult to separate from the hot hand, since a player who has recent match play is generally one who has been winning.
  3. A favored surface.  The main disadvantage of ATP rankings as a predictor of match outcomes is that they don’t distinguish between surfaces.  So if we use rankings, we make the false assumption that Andy Roddick is just as likely to beat Gael Monfils on clay as he is on grass.  So maybe the players who come through qualifying are ones who are best on the surface of that tournament.  Thus, their ranking doesn’t reflect their skill level on the surface, but their performance reveals it in qualifying.
  4. Rankings overstate the difference between players.  This is another avenue for research.  Clearly there’s less difference between #50 and #150 than, say, #1 and #10.  But how much less?  It’s amazing how much movement there is in the back half of the top 100–I would guess that over 150 players per year appear in the top 100.  So, if ranking points overstate the difference in talent level, it shouldn’t come as a surprise that a qualifier ranked #147 has a good shot at knocking off a player ranked #54.
  5. Unfamiliarity.  Maybe an up-and-coming player has an advantage against a more established opponent in their first meeting.  For instance, last week at Newport, qualifier Richard Bloomfield beat Christophe Rochus and Santiago Giraldo en route to the final.  Rochus and Giraldo aren’t exactly superstars, but Bloomfield would have a better opportunity to learn about their games than they would about his.  I doubt this theory is worth much.
  6. There’s nothing to lose.  Here’s a commentator favorite.  Bloomfield, once he got into the main draw, had “nothing to lose,” while all the pressure was more established players.  I’m not sure I buy that, and even if I do, I’m not sure it has much value here.  It would seem to apply to almost every match, so any calculation we make using player rankings would already have the “nothing to lose” factor built in.

Many of the same factors apply to wild cards, as well.  WCs would probably be tougher to test because so many of them have limited track records.  Some players get WCs for winning local junior tournaments, or upon returning for a long injury layoff.

In any event, there’s a lot to study here.