Batter's Box Interactive Magazine

(Bonus points to anyone who can name the 1980s Canadian band that debuted with that album title.)

There's a lot of sound and fury in sabrmetric circles today, and it's the doing of Bill James. He wrote an article for SABR's Baseball Research Journal titled "Underestimating the Fog," and he's gotten a lot of folks pretty riled up in the process. We don't have access to the essay itself, so here's a link to an article that discusses it.

Summarizing the summary of another person's work is a mug's game, but here I go anyway: James argues that sabrmetrics has fallen into a bad habit. Sabrmetrics has long relied on the principle of persistence to establish whether a given phenomenon is an actual skill or just a coincidence. For example, James says, players with better-than-average stolen base totals and batting average marks one year tend to also post better-than-average totals and marks in subsequent years. They do it in a consistently measurable pattern, so therefore it's a skill.

"Clutch hitting," on the other hand -- hitting for a higher average late in the game, in close games, with the game on the line, etc. -- does not persist. A player who hits .390 in "clutch situations" one year may hit .225 the next year and .281 the year after that. If this player really had a clutch-hitting skill -- for whatever reason, he becomes more focused and more productive in tight situations -- he should be able to do it every year (and frankly, he should be able to do it every at-bat). Because he doesn't, his performance in those situations is considered simply to be luck, coincidence, the alignment of the stars, or what have you. It is transient.

This approach has been used by numerous sabrmaticians, including James, to discount a number of hallowed "truths" held by traditional baseball folk (platoon differentials, catchers' ERA, etc.). In the best traditions of science, they insist of running the numbers and conducting the experiment to see if reality conforms with theory and legend. In many ways, it's the fundamental principle of sabrmetrics, and it has taught us scads of previously unrealized things about baseball.

Only now James isn't so sure. He recently revised his opinion of his own 1988 study that discounted platoon differentials. He thinks that there was a whole lot of noise and static in the numbers he looked at, and that that noise drowned out patterns that might be there. In the platoon differentials situation, he thinks that the randomness, luck and noise might be ten times as loud as whatever signal might be coming through, and that may be why we can't make out the signal.

James's essential point, I think, is summarized in this line: "We ran astray because we have been assuming that random data is proof of nothingness, when in reality, random data proves nothing." In other words, just because we can't see a pattern where we were expecting to find one doesn't mean the pattern isn't there. It may simply mean we're not looking hard enough, or not looking at it from the right angle, or that it simply isn't measurable by our current means. But the absence of measurable evidence is not, in itself, proof of anything. Failure to locate is not the same as non-existence.

James concludes his essay with a metaphor of a sentry standing guard on a foggy night:

...A sentry is looking through a fog, trying to see if there is an invading army out there, somewhere through the fog. He looks for a long time, and he can't see any invaders, so he goes and gets a really, really bright light to shine into the fog. Still doesn't see anything.

The sentry returns and reports that there is just no army out there -- but the problem is, he has underestimated the density of the fog. It seems, intuitively, that if you shine a bright enough light into the fog, if there was an army out there you'd have to be able to see it -- but in fact you can't. That's where we are: we're trying to see if there's an army out there, and we have confident reports that the coast is clear -- but we may have underestimated the density of the fog. The randomness of the data is the fog. What I am saying in this article is that the fog may be many times more dense than we have been allowing for.

Let's look again; let's give the fog a little more credit. Let's not be too sure that we haven't been missing something important.

The reaction to the James essay has not been difficult to make out -- the signals are loud and clear. The Primer thread that links to the summary article is quite a read, featuring full-scale rants, discourses on optimum bullpen use, and the always-amusing antagonism of Backlasher. More enlightening commentary can be found (not suprisingly) from TangoTiger, and I'm reliably informed that our very own Craig B is planning an article responding to the James piece in an upcoming edition of The Hardball Times.

James probably does overstate his case somewhat, and there's a key objection that both Tango and Craig have already raised: is there any practical difference between a signal that doesn't exist and a signal that is so small and quiet that it has no effect in practice on a baseball field? They suspect there is not, and I'm inclined to think they're right. But I think the James essay is nonetheless important, for two main reasons.

The first is that it shines a light on an increasingly unfortunate attribute in sabrmetric discussion. Many people who are attracted to the intellectual rigour and accomplishments of sabrmetrics find themselves turned off by what they perceive as an odd mix of condescension and zealotry on the part of some of its proponents (ironically enough, James himself is somewhat infamous for his personal quirks in this respect). Hang around Primer or other sabrmetric-friendly neighbourhoods long enough and you soon get a sense of people who believe that those who disagree with them are really too stupid and naive for words. Primer in particular has fallen into the bad habit of linking to articles from decrepit sportswriters and old-school managers and, essentially, laughing at them.

This stems in part, I think, from a degree of self-satisfaction in some sabrmetric quarters, a quasi-Enlightenment assumption that reason rules here in the light while superstition governs out there in the darkness. You could argue that some folks in the sabrmetric community have gotten a little carried away with themselves. The problem is that this sort of attitude is both infectious and lethal to honest inquiry, insulating the inquirer from the effect of differing viewpoints. An article like James's can serve as an antidote to the smugness that can come with insularity. The most prominent practitioners of this art will argue that they don't have to be pleasant or open in their approach to their work; they only have to be right, and they are. I respectfully disagree with that attitude as a sound pedagogical approach.

The second is that what James is doing here is turning sabrmetric principles on sabrmetrics itself. A prime danger faced by the sabrmetric community, which is in ascendance throughout baseball as never before, is that it will become as stubborn and set in its beliefs and methodologies as the people it used to attack. One of the chief attractions of sabrmetric thought is that it challenges conventional wisdom: "Is this really true? I know it's been accepted as true for decades, but do the facts back it up?" The first sabrmetric writer I read and enjoyed, Rob Neyer, applied that skeptical curiosity to numerous aspects of baseball and showed that many of them were not supported by the evidence. That's the lifeblood of the genre, and it must be protected.

It's important, it's critical, that sabrmetrics always be willing to re-examine its own conclusions, revisit its old attitudes. It's easy to shine the light on the fog, see no army and say "Good old light -- never let me down yet." The desire to always shine brighter and brighter lights on a subject is the engine of inquiry. James is, I think, telling sabrmaticians never to grow too fond of or too complacent about their lights.

Scientific inquiry doesn't have an axe to grind or a philosophy to expound; it's not defensive or angry. It simply wants to know if something is true or not. There is a strain of thought in sabrmetrics -- not a strong one, but not invisible either -- that seems to think baseball evolution ended when Oakland hired Billy Beane as General Manager. Thankfully, the bulk of sabrmetrics has lost none of its rigour and honesty, and is willing to always question the questioner as well as the subject. Essays like "Underestimating the fog" -- flawed as it might be -- will, I hope, ensure that that rigour continues.

Posted by Jordan on Wednesday, March 16 2005 @ 12:34 PM EST.

Living in a Fog | 16 comments | Create New Account

The following comments are owned by whomever posted them. This site is not responsible for what they say.

cbugden - Wednesday, March 16 2005 @ 02:28 PM EST (#106375) #

The group is the Wonderful Grand Band

Jordan - Wednesday, March 16 2005 @ 02:35 PM EST (#106379) #

And the Newfoundland contingent at Batter's Box rises to two!

Mike Green - Wednesday, March 16 2005 @ 02:53 PM EST (#106391) #

There are many ways of collecting evidence on specific issues- be it clutch hitting, catcher's ERA and so on. Year over year persistence is but one of the measuring sticks.

For instance, career clutch vs. non-clutch hitting is a useful marker. The sample sizes are larger, and not subject to the same kinds of issues as James describes. A similar approach to catcher's ERA can also be applied.

Incidentally, elsewhere in the SABR journal, there's a cool career SB/CS chart for most of the great modern catchers (and some of the lesser ones like Milt May) from Bench on down. Pudge has thrown out 48% of baserunners to lead over Bench by 4%; Piazza has thrown out 20% of baserunners. Over their careers, it looks like the difference in throwing ability between Pudge and Piazza has been worth on the order of 300 runs. Wow.

Chuck - Wednesday, March 16 2005 @ 03:03 PM EST (#106398) #

Don Malcolm has long been a grumpy old man questioning neo-sabermetrics, but his barking has been so loud and so extreme that any message of value he might have been delivering has been lost. His main whipping boys have been the fellows at BP, whom he has even gone so far as to suggest are MBA's gone wild, generating an industry simply to make money, even if they don't actually believe what they are saying.

He has lost all credibility and probably has no audience any more, which is too bad, because he often had interesting things to say. Were it not for his level of self-satisfaction (even worse than that of the sabermetric elite) and his singleminded agenda of attacking BP (and Rob Neyer), he could well be the voice of saber-dissidence that helps keep the saber-minded on the straight and narrow.

That voice hasn't existed, or at least not from anyone that anyone respected (sportswriters expressing an anti-sabermetric view are generally revealing a deeper-seeded anti-sabermetric agenda).

There was a thread recently about the tone of the new BP. I am in the process of reading it (cover to cover) and am bothered by the heightened level of arrogance that drips from its pages.

The sabermetric world needs someone with credibility to express dissenting opinion. I am glad Bill James is assuming that role, if only to spark debate.

Andrew K - Wednesday, March 16 2005 @ 04:20 PM EST (#106422) #

I'm happy to see some debate about this, but I'm afraid that it involves the very issue which most turns me off sabermetrics. More precisely, I should say it turns me off sabermetricians. My problem is that I am a statistician (of sorts).

A vanishingly small number - almost none - of those who call themselves sabermetricians have the statistical background sufficient to evaluate the accuracy of their results, whether those results be the measurement of some unknown factor or a prediction. (In this respect MGL and Tango seem to be the outstanding writers, although I would query their methodology in some of the detail.) Even more worryingly, very many of those who write articles and comments on message boards do not even seem to care about this gap in their knowledge. "Oh yes, there are sample size issues" we hear time after time. I always want to scream "well find out how much margin of error there is in your result then and come back when you've completed the study properly".

It's almost like some bizarre other-world logic, where people set such store by measuring everything under the sun, and try to use it to put value on players, but cannot or will not quantify the uncertainty in their measurement. I've no doubt that the really excellent sabermetricians do indeed do this, but the hoipolloi sure don't.

I have found myself so annoyed with the lack of statistical rigour, and the lack of interest shown by so many writers in trying to remedy the situation, that I have been dissuaded from persuing a proper study of my own. The audience probably wouldn't appreciate it. I find it extremely sad, but it is probably an inevitable result of the necessary stats being quite hard, at least compared with the Excel spreadsheet calculations of numbers themselves which people so blithely throw around.

Anyway, my word on the simple question in the summary - which seems to boil down to whether unmeasurable differences are genuinely unimportant or just unknowable - is that of course they are important, no matter how small. If we had a system for playing poker which was 1% better than anyone else's, we would take our 1% advantage and use it. Never mind that we won't play enough poker ever to tell whether our 1% theoretical advantage translated into serious money. We take every edge we can get.

Now the more serious question is to put a value on your edge. If player A costs £100k more than player B, and maybe possibly has some advantageous makeup but that makeup is too small to be sure about (even based on the whole career up until now) then we need to price our certainty that the advantage exists. I don't see why we can't do that, with a suitably Bayesian approach, although pricing anything in baseball seems to be fraught with (non-statistical) problems due to replacement level and roster limitations.

jeff_h - Wednesday, March 16 2005 @ 05:17 PM EST (#106425) #

Given the caveat that the difference between a de minimis and non-existent effect cannot, you know, be deemed of much greater importance than, you know, de minimis....

I think Bill James' current supporters might be grasping for straws in order to find cause to inveigh against their longerstanding complaint, i.e., that the sabermetric community is arrogant.

Reletaedly, the second issue raised is that "sabermetrics is the new old thing," i.e., the conventional wisdom that is too settled in its self-acceptance.

For the first complaint to be more than stylistic complaining, the second complaint would need to be valid. For the second complaint to be valid... the attack on "sabermetric orhtodoxy" would need to be more than trivially significant.

Since no one is necessarily persuaded by the "> than trivial test," I think this is more hot air than substance.

Re James -- literally my intellectual hero -- I think his strongly redeeming desire to be contrarian is leading him astray. The same "excessive veneration for contrarianism" has killed "neoliberalism" in American punditry; the sshort definition of that, btw, is The New Republic and the DLC (the Washington Monthly, which gave birth to it, has become supple enough to move away from reflexive contrarianism)

Mike Green - Wednesday, March 16 2005 @ 05:46 PM EST (#106426) #

The problem does not boil down to whether unmeasurable differences are genuinely unimportant or just unknowable.

It is a critique of the use of a particular measurement device (year over year persistence) to attempt to determine whether a particular phenomenon (clutch hitting, catcher's ERA). Regardless whether James is correct about the particular measurement device, it does not mean that other measurement devices are incapable of providing reliable data with respect to the phenomenon. James is not saying that "clutch hitting is unmeasurable, but that doesn't mean that it does not exist". What he is saying is: "we cannot infer from the absence of year-to-year persistence of clutch hitting that it does not exist".

One of the other phenomena James mentions is streak hitting. Here is a fascinating MGL study. Note especially MGL's comment #9. In MGL's study, hot hitters had an OPS of .901 in the week subsequent to their hot streak. As a baseline, MGL used their OPS in prior and subsequent years of .870. But, the same hot hitters had an OPS of .885 in the year of the streak, but not including the streak. The conclusions here are not obvious.

Andrew K - Wednesday, March 16 2005 @ 06:00 PM EST (#106429) #

Mike Green, it's difficult to reply to James' article based only on a summary of a summary.

If the question is about year-to-year persistence then we shouldn't forget that, generally, one can only sign a player for approximately a whole year. So effects which do not have predictive value year-to-year aren't much help in valuing players. Of course they might be very useful in informing lineup construction.

The hot hitting question is certainly tricky, although there are statistical techniques which ought to be applied. I remember when I first read that article by MGL, I wanted to get the data and do so "proper" analysis, based on the events themselves (hits, outs, walks, etc) as opposed to the rate stats involved. By "proper" I don't mean that MGL's is flawed, but that there are known statistical techniques to apply to this problem, much more sophisticated than looking at the mean performance after "hot" or "cold" periods, however categorised. This is one of the things I have been put off doing, because of the prevailing sabermetric indifference to statistical rigour.

jeff_h - Wednesday, March 16 2005 @ 06:11 PM EST (#106430) #

1. The "MGL" link does not work.

2. The issue is what to presume from:

(a) an extant hoary cliche,
(b) significant analytic inquiry, and
(c) no evidence of a phenomenon.

Supremely humble people might suggest that one infers nothing. However, every moment of the day we make decisions, such as crossing the street or entering into an elevator without having confirmed that it has been inspected recently, that are philosophically consistent... and suggest that absent real evidence to presuppose the existence of something (danger, clutch hitting, wmds in Iraq, an actually compassionate conservative), we conclude that the thing doesn't exist.

John Northey - Wednesday, March 16 2005 @ 07:05 PM EST (#106435) #

I remember from my years of stats classes how it was incredibly hard to prove something doesn't exist as, in order to do so, you had to check every possible way that it could exist and show it didn't.

Clutch hitting, for example, has yet to be shown but given the low number of AB's per year it is almost impossible to prove it one way or the other. For example, using a basic statistical technique you get, for 100 AB's, a confidence interval of +/-9.8%. This would mean that I could, statistically, say that a guy who hits for a 300 batting average over those 100 AB's could have a 'true' average somewhere between 202 to 398 with 95% confidence. To be 99% confident I'd give a range of 171 to 429. Not very useful.

Now, lets say he had 500 AB's in clutch situations. Now that 300 hitter has a 95% range of 256 to 344. Still way too big a range to be useful. OK, what about 1000 AB's? 269 to 331. 5000 AB's? 286 to 314. 10000? 290 to 310.

It takes a heck of a lot of AB's to prove what a player will hit beyond a statistical doubt. This is why baseball will always have a mix of scouts and stats. Now, how to go beyond this level? Perhaps checking results from the player's individual swings of the bat, where you get a lot more data and could check line drives, swings and misses, strike zone judgement, etc. if someone has a way to track the data in a reliable way.

Now, there are statistical methods that could reduce the margin of error and perhaps get us more useful results but the bottom line is baseball rarely produces enough of any situation for one player for us to judge if that player, via statistics, will produce at a 330 level or a 260 level. We can draw overall conclusions (shortstops as a group will never be as good hitters as first basemen for example) but for one player in one situation? I don't see it unless we find some new ways of measuring the activity beyond stats that are based on the results of each plate appearance (ie: Avg/OBP/Slg/OPS/TA/etc.)

John Northey - Wednesday, March 16 2005 @ 07:10 PM EST (#106436) #

It is true, we trust that the elevator will not fall down the shaft when we get in it. But guess what? Every so often it will go down and crash. We'll jay walk and get hit once in awhile. A conservative will be compassionate sometimes too, trust me.

The question becomes, how much evidence do you need to trust something does or does not exist. Crossing the road? I trust my eyes that no one is charging at me. I could be wrong, but figure the odds are in my favor. If you are a MLB team and you decide to blow an extra $2 million on a player just because you think he is 'clutch' though you will want something beyond 'I saw Jeter make this amazing play in the playoffs, and he seems to always do it' to justify it.

Andrew K - Thursday, March 17 2005 @ 04:41 AM EST (#106447) #

John, your post is correct (although the confidence intervals will be a smaller as we can be pretty sure that the player is not really a .500 hitter). And I quite agree that we are going to find it very hard to get evidence that a particular player has "the clutch ability".

But this is not a random drug trial. We don't need to know that a certain player is all but certain to be a clutch player. We just want to know that there's a good chance that he is, price our certainty and the gain we hope to make, and use it to inform lineup construction. This is by no means impossible.

Lineup construction is something which has bothered me for quite a while now, by the way. I can't believe that the old wisdom of leadoff hitter, number three with high average, number four cleanup hitter with power, etc etc, is necessarily the way to maximise run production. And it's the sort of question which ought to be easy to address, and quantify error for, with monte carlo methods. This also applies to sac bunt strategies. Does anyone know who has already done this?

fra paolo - Thursday, March 17 2005 @ 07:17 AM EST (#106450) #

Don Malcolm

People got so irritated by the 'noise' in Don Malcolm's commentaries that they overlooked his good points. One could easily have complained about the noise in the hope of encouraging him to stop (although that might just egged him on) while still respecting his baseball sense.

It's also fair to say that his case against BP and Neyer was increasingly misrepresented as he banged on as sour grapes by someone who had played an important role in keeping the 'abstract' idea going in 89-95, but whose approach proved less commercially attractive than BP's. His dispute with them had, as Chuck suggests above, a strong element of principled opposition. Although he did also trash the New Historical Abstract at one point. He was an equal opportunity blunderbuss.

To me it's a great shame the Big Bad Baseball web site has now vanished, even if he had stopped updating it.

Brande36 - Thursday, March 17 2005 @ 12:58 PM EST (#106474) #

Another complication is that increasing the sample size for a player introduces more variables, e.g., 10,000 AB’s would require over 20 years of data. Over that time, physical abilities peak and decline, while skills such as strike zone judgement may improve significantly. He may also become less anxious (or more) at being in a clutch situation. The 40-year old is not the same player as 20 years earlier.

robertdudek - Tuesday, March 22 2005 @ 12:24 AM EST (#107067) #

In baseball, when dealing with individual players, only large differences will be detectable by statistical analysis.

The fog that James talks about is not only small sample size, but it is also selction bias. Players who have a huge weakness against either lefthanded or righthanded pitchers rarely become major league regulars and so almost all regulars cluster around the "natural" platoon differential.

There is also the various conditions and myriad of factors influencing each and every baseball event. This constitutes the main density of the fog James discusses.

In baseball, we will never be able to make iron-clad statements, except when the dataset we are dealing with is truly massive. There is just no way sabrmetrics will ever approach the rigour of the physical sciences.

TangoTiger - Tuesday, March 22 2005 @ 11:47 AM EST (#107102) #

Andrew: If you are brave, read these: BO 1 BO 2 BO 3 BO 4 BO 5 It took me halfway through to use the differing PA. Otherwise, wait for our book next year. It'll be 20 pages in Word, and be (hopefully) tight and easy to understand. Same goes for all the other strategies you mentioned. We've done 7 chapters so far, out of 12 to 16. If all that is still not good enough: your #2 hitter should be one of your 2 or 3 best, if not the best, hitter on your team. We're also shopping around, and hopefully we'll get resolution on that soon enough.

Living in a Fog | 16 comments | Create New Account

The following comments are owned by whomever posted them. This site is not responsible for what they say.