Monday, April 19, 2010

The Super-Secret Origin of IRP

So early this morning, I had an idea profound enough to wake me up and make me get on the computer. Considering that I had been dreaming about very pleasant things - I don't remember much, but waterskiing pirates were involved - this was not an insignificant occurrence. What moved me so? Why, baseball statistics, of course!

It occurred to me that the result of every plate appearance must have an intrinsic value, value of course meaning runs. That was the original thought rather than the conclusion, and I honestly don't know how I formulated it. I do have a justification, though, which I came up with afterwards. The result of a plate appearance affects how many runs a team can expect to score on average, but it's the circumstance that changes the value of the result (as in, a single is worth more in Case A than in Case B because the bases are loaded in Case A but empty in Case B; it's the situation that dictates the worth, rather than the hit itself). There's no reason to think the result itself will change unless the circumstances around it change. Before we get too heavy, though, here's an example that illustrates just what I'm getting at when I say "value" and gives me something to refer to in my explanation.

The Cardinals are batting. There's nobody out and Ryan Ludwick is on third base. Albert Pujols singles and Ludwick scores. We might therefore think that the single is worth a run - that's the rather noble idea behind Runs Batted In, probably the most maligned statistic in existence. It is true in a concrete way, but there are two big ideas missing: 1) the hit plated a run only because the sequence of events prior to Pujols' plate appearance made it possible - if Ludwick had made an out instead of getting on, no runs would have scored that play; and 2) the batter reaches base and therefore has a chance to score himself, beyond the other results of the play.

Pushing a little more, we can see that the single was actually worth more than a run. According to the run expectancy matrix supplied by Baseball Prospectus, a team with no outs and a runner on third base will score an average of 1.31 runs that inning. Pujols' hit scored Ludwick, and now Matt Holliday comes up with a runner on first base and no outs. In that situation (runner on first, no outs) the Cardinals can expect to score 0.88 more runs. Pujols' hit really did more than just score Ludwick. It also created an opportunity for further scoring.

(This is getting into an area of probability that might be a little confusing. If you're cool, feel free to go on to the next paragraph. If you're my mother, stick around. One would think that since the Cards already scored one run, they should only expect another 0.31. It's important to remember, though, that the situation resets itself every batter, much like a coin-flip. The figure 1.31 was the average of the results of when the batter created situations where more than 1.31 runs could score and when the batter created situations where less than 1.31 runs could score. Obviously, you can't actually score three-tenths of a run, so more or less than the average number will be scored: 0, 1, 2, and so on. Think of the decimal places as telling us which integer is more likely in the grand scheme of things.)

We have statistics in order to identify individual performance, but in a team sport, the natural urge is to figure out how those stats fit into a team's overall showing. In baseball, an offensive player's performance has to be judged by how many runs he's responsible for - after all, runs are the only positive offensive result, so there's nothing that a team should value in its batters besides run production. In our example, though, Ludwick's scoring from third shouldn't factor into assessing how much Pujols contributed to the team's effort, as he wasn't responsible for the situation that scored the run. What he's actually responsible for is the situation that appears for the guy on-deck. That's really the key point behind my dream-interrupting thought, so if you take nothing else away from this post, take that.

So, we've realized that the quality of a plate appearance should really be determined by the circumstance it creates. That passes the eye test: a batter reaches base because he did something right - got a hit, worked a walk - and that increases his team's chances of scoring. He should get credit for that. One thing you could do is just take the difference between the situation the batter faced and the situation that resulted, using the run expectancy matrix. But, in the example, you'd have to take into account that Ludwick scored. Otherwise you'd be docking Pujols about half-a-run despite the fact that he did his number one job - not make an out. If you just add the run scored to the total, you're again giving Pujols credit for the situation he didn't create. What we need to do is separate the valuation of the hit from the run driven in.

You could just give Pujols the 0.88 runs. After all, that specific situation resulted from his plate appearance alone. There's a couple of problems with that, though. First off, it would mean that he'd actually get more credit for making an out and stranding Ludwick at third (the one-out, runner on third situation yields 0.97 runs) than he would by singling. Another problem: even if Pujols had made an out, there would still have been a probability of scoring. Let's say he hit a sacrifice fly and Ludwick scored, bringing up Holliday with nobody on and one out. The situation Pujols created for Holliday would still yield an average of 0.28 runs. Until the third out is recorded, there's always a positive average run scored figure, so it wouldn't be right to give Pujols the full 0.88 runs. That would be crediting him for runs that would have, on average, scored without help from his plate appearance. The last problem, that I can think of right now at least, is that sometimes you'd be crediting the batter with the results of previous plate appearances. It didn't apply in this example, but if there had been a runner on first as well as third, the single would have presented Holliday with two base runners (unless one of them was Enos Slaughter, of course), only one of whom Pujols could properly claim responsibility for. In a real-world sense, yes, that's the situation Pujols left Holliday, but we're wanting to whittle down to the thoroughly independent result in order to find out what results well and truly belongs to the individual player.

(There's also a practical problem with trying to deal directly with the run expectancy matrix: time. It would be preposterously time-consuming to go through and record the situation after every single plate appearance for every single player in every single game. It'd be possible with computer programs - that's what Sean Forman does, after all - but I simply don't know enough to be able to manage that task in a timely enough fashion.)

So, what can free the individual's production from the influence of other players? Well, there are three situations that are absolutely free from the effects of earlier batters: nobody on, any number of outs. For instance, with nobody on and nobody out, a team will average 0.52 runs. If the batter singles, the run average goes up to 0.88. This means that the batter's single created an extra 0.36 runs. Look at it. The only play that contributed to that 0.36 increase was the single, for which the batter can claim full responsibility. No other batters and no other events were involved. We can say that, with zero outs, a single by itself is worth 0.36 runs and not worry the total is under external influence. You repeat the exercise with one and two outs and average the three figures to find out the overall value of the generic single. It's quite reasonable to assume that one-third of all plate appearances take place with zero out, one-third with one out, and one-third with two-out - there's simply no substantive reason for that not to be the case - so that's a very easy calculation to make.

A word about home runs: I decided to score a nobody on, nobody out home run as being worth 1 run rather than 1.52 (the run plus the ensuing situation; this was my original strategy). A batter cannot produce more than 1 run in a plate appearance without outside help - he'd need a runner on base in front to drive in or a batter behind him to convert, too. Therefore, we top out at 1 run produced per plate appearance.

I'm about to post a short list containing the runs per plate appearance result. You'll note that I have grouped walks and singles together as "One-Base"; in the vacuum of nobody on, where we can see the inherent value of the result itself and not the effect it has on the surroundings or the surroundings has on it, the one-base plays are well and truly identical. I've left out hit by pitch and reached on error because, honestly, those are largely unpredictable defensive mistakes that don't merit much in the way of credit. They're infrequent enough that ignoring them will have a very minor impact on the data in the eventual, and we have to leave something for version 2.0, don't we? So, here are the inherent run values for one-base plays, doubles, and triples (rounded to 4 decimal places):

One-Base - 0.2457 runs
Double - 0.4164 runs
Triple - 0.5825 runs

Before I get into the final stretch, I'd like to just provide some thoughts on these results, which I find interesting. The difference between a single and a double and a double and a triple is very, very similar, which is just plain cool. It's somewhat intuitive, but plenty of things you think are that simple just aren't, so that's nifty. What's even more interesting is that a one-base play is almost exactly a quarter of a home run - again, you might guess that it's a quarter, but the fact that it actually is a quarter is awesome - but a triple isn't even three-fifths of a home run. In fact, there's a bigger difference between a triple and a home run than there is between a single and a triple! This really emphasizes how important it is to just get on base. Just look at the 2009 Leaderboard. Walk leader Adrian Gonzalez (119) produced 29.2 runs by taking the free pass. Doubles leader Brian Roberts (56)? 23.3 by double. Triples leader Shane Victorino (13)? Just 7.6 by triple. Just reaching first base frequently produces a lot of runs, although nothing quite packs the wallop of the home run.

Anyway, I'm sure you can see how I'm choosing to apply these results. By taking the number of singles+walks, doubles, triples, and home runs a player collects in a season and multiplying them by the proper inherent value, you can see exactly how many runs a player produced independent of his surroundings. We can get a rate statistic by simply dividing this value by the number of plate appearances. This is where the small errors I mentioned before come in, but I really don't think they make enough of a difference to worry about. Even for Chase Utley, the three-time reigning hit by pitch king who actually does seem to have a talent for getting hit, would only lose 6 runs using this method. That's pretty small in the grand scheme of things, though I might just bite the bullet and include it anyway. It's starting to niggle at me, though I don't like the idea of messing with a rather pretty looking spreadsheet.

I'm in the process of crunching the data for every player, but here are a couple of notable ones from 2009. I tried to get two greats, two goods, two averages, and two bads, based on other statistics like OPS+. The totals are first, followed by per plate appearance.

MLB 600 PA Average: 65.04 / 0.108
Joe Mauer: 92.19 / 0.152
Albert Pujols: 117.44 / 0.168
Derek Jeter: 88.31 / 0.123
Troy Tulowitzki: 88.94 / 0.142
Curtis Granderson: 85.52 / 0.120
Kevin Kouzmanoff: 59.02 / 0.103
Yuniesky Betancourt: 43.38 / 0.085
Jason Kendall: 43.52 / 0.083

Interesting. Granderson shows the power of the big fly, as his season was, by all accounts, thoroughly average, but his thirty home runs help him sneak right up on Jeter.

Anyway, the only thing left to do is come up with a name. I like acronyms, so I'm trying to think of a good one. The main idea is about the independent value of each offensive play. Call it Independent Run Production. IRP. The rate value can be Independent Run Production Average, or IRPA. Not insane about it, but it'll do for now. Now I just need to crunch the data.

If you have a better name, or any other thoughts, please let me know!

1 comment:

  1. I have done some revisions ... I made a rather sizable logical mistake (I'm sure you can figure out where), but I think I corrected it.

    ReplyDelete