Thanks to the miracle of technology and the relatively advanced state of sabermetrics, evaluating a pool of ballplayers statistically, even one as large and disparate as the NCAA, has become pretty easy if you want to pour in the skull sweat, and can find the data.
I set out to do both, and got huge slices of help from friends, and also from people I'd never met before. Before I get into the whys, hows, and whats of my research, I want to thank four terrific people for their help with this project, of which this article is but a tiny part.
The first is Boyd Nation, the most knowledgeable layman in America about college baseball. Boyd is a true gentleman, and has managed to put together 99% of the data on which this project is based ? and made it available for free to anyone who wants it. When I needed assistance, he was there within 24 hours to give me what I needed with kind wishes.
I would also like to thank my friend Vinay Kumar, who was so helpful in helping me when I ran into difficulties with methods. Also, Joe C. and Noffs, readers of Baseball Primer, who also offered assistance without any chance of reward. Thanks guys.
As I said above, I set out to evaluate the pool of NCAA Division I players. Division I is huge ? there are over 280 participating schools, and 8,000 players compete in any given year. What's worse, there is a wide variation in ballparks, much more so than in MLB, and an even wider variation in competition. Unlike MLB, where teams play relatively equal schedules, some NCAA teams will play 80% or more of their games against teams in the top 40% (or bottom 40%) of the talent pool.
Once I had found that data was available on every hitter and pitcher to play Division I baseball in 2002 and 2003, I knew that I would have to crunch the numbers and come up with a rating system. After all, what is a sabermetrician without a rating system? To begin with (for the first iteration of this project at least) I have chosen a very simple system that is used so well by Lee Sinins. This is the RCAA/RSAA evaluation system. It uses runs allowed by pitchers and runs created by batters, and compares players to their league averages, giving an run value above or below the average for the number of outs (or innings) a player uses or records.
RCAA/RSAA is a very useful method of analysis for college players in particular. Most of us, myself included, don't have any particular interest in college play per se, rather we are interested in analyzing the performance of the top players ? the actual prospects. Using a ?baseline? of an average college player is probably more in line with what we are interested in, rather than using ?replacement level? which, in addition to being relatively difficult to calculate (not least because talent at the Division I level is very unevenly distributed).
I tweaked the system slightly. Knowing that the college game is different from the pro game, I decided on a linear run estimation system rather than a non-linear one, and so chose (it being a simple system) Jim Furtado's xRuns. I found that for the entire pool of players, xR (as it is known) underestimated scoring by about 6%. This is presumably due to the higer run-scoring environment in the metal-bat game... the average team in Division I scores about 6.5 runs per game. This makes sense; in an environment where there are more hits and more men on base, each offensive event will have a greater impact than it otherwise would.
When I redo the study and update my spreadsheet, and for future articles, I am hoping to use a run estimation tool that better approximates runs scored. All suggestions are welcome. At any rate, I will be using xRAA instead of RCAA to present my data (I should note that all xR figures are compared to league average xR, not runs scored, so that the 6% discrepancy doesn't work its way into the ratings). The problem is the data available; the team totals are not ? unfortunately ? reliable (so NewRC is out as a method) and the available historical information is sketchy. But we'll figure something out.
Once I had xR for every player in the database, I recognized the need to make two adjustments. The first was a park adjustment. Park-adjusting is a well-understood phenomenon and I don't need to get into the theory of it here, but I should point out that college ballparks have a much wider distribution of park factors than pro parks, which is reflective not only of the larger number of teams, but also of the greater geographic diversity and diversity of facilities.
Thankfully, my efforts were considerably speeded by Boyd Nation, who has conventionally calculated four-year park factors available for 2002 and 2003 along with all the other data on his website. The 2003 factors have an additional benefit: park factors based on a weighted average of all the parks a team played in over the course of a year. This is very important information; teams in the Mountain West conference, for instance, play dozens of games a year at a high altitude (San Diego State's home park factor is 104, but the average of all parks they played in was 114!). The fact that four-year park factors are used reflects the shorter NCAA season, almost always less than 70 games and usually less than 60.
For 2002, the park factors are also four-year, but aren't weighted averages, just home factors. I am still tweaking the PFs for 2002, based on home/road+neutral games, but for now I use the PFs in an absolutely standard way.
Park-adjusted numbers, though, aren't the whole story, because of the need to adjust for various levels of competition. Division I ranges from Texas and Florida State, who would be competitive with teams from the low-level minors, to Alabama A&M and Western Illinois, who make up numbers. In particular, there are massive differences in scheduling. Any two given NCAA players not only won't play similar schedules, they likely won't have any common opponents. So Mitch Maier's performance at Toledo, impressive as it is, is not in fact better than Beau Herrod's at Alabama. It looks like a better performance, but that's only because it is compiled against inferior competition. Toledo's opponents had a .474 winning perrcentage; their opponents' opponents also had a .474 winning percentage. Alabama, meanwhile, scheduled opponents with a collective .582 winning percentage, and their opponents' opponents had a .546 winning percentage. (All opponents' winning percentages (OppWP) and opponents' opponents' winning percentages (OppOppWP) listed are weighted for the number of games played versus each team).
So what can we do with this? Ideally, it would be great to adjust for the quality of the pitching staff (offense for pitchers) that each batter faced. Unfortunately, my data is not detailed enough for this ? nor are my analytical abilities that advanced. But I settled on a solution that allows for an adjustment to level of competition, without making particular adjustments for offense or defense.
First, I used OppWP and OppOppWP to derive a ?true ability? winning percentage for each team's opponents. I used the ?log5? method to do this, but if you want to get a basic approximation, adding the OppWP and OppOppWP and subtracting .500 usually gets close enough. This estimates what record the team's opponents would have had against a baseline level of competition, in this case .500.
Once I had done this, I used the Pythagorean forumla. Or rather, the Reverse Pythagorean, which sounds like a most unmentionable perversion but is really quite useful in this context. What we need, is not an approximation of the quality of a team's opponents in terms of wins and losses. In order to adjust a measure of talent expressed in terms of runs, what we need is an approximation of the quality of a team's opponents measured in terms of runs.
Converting runs to wins, of course, is done by the Pythagorean forumla. Converting wins back to runs can be done the same way. First, we make the assumption that each team's won-lost record is equally due to offense and defense (or pitching+defense if you prefer). Yes, this is a pretty massive assumption, but it's necessary for the time being given the data we have. (Which is why I refer to this adjustment as a ?competition adjustment? - it only adjusts for the level of competition and not of the player's actual opposition). Then, we can plug the opponents' winning percentage back into the Pythagorean forumla and we'll derive a number of runs scored and allowed for that team. That, essentially, is the quality of the offense or defense that that team faced that year, all measured against a baseline of the Division I average (a .500 team).
If you're interested, Vinay Kumar contributed the calculation. It used WP/(1-WP) instead of straight winning percentage, and derives a multiplier which indicates how much of league average a team's offense/defense is. That multiplier, the Competition Adjustment, is the fourth root of WP/(1-WP) (i.e. the won-lost ratio).
(WP/(1-WP))^0.25
So for Arkansas State 2003, for instance, whose opponents' won/lost ratio (adjusted for opponents' opponents) was just about 1.33, the forumla yields a Competition Adjustment of 1.073. In other words, we assume that Arkansas State's opponents were 7.3% above average in both run scoring and run prevention.
This allows us to make a very simple adjutment to the ?Average? level for RSAA and xRAA, enabling us to move the ?average? baseline to whom the Arkansas State players are compared up 7.3% for pitchers, and down 7.3% for hitters.
Once the park adjustments and competition adjustments are applied to xRAA, we get park-and-competition-adjusted xRAA, or **xRAA ? in keeping with the convention of one asterisk for park-adjusted numbers, we'll use two for park-and-competition adjustment. This measures how many runs above average a player would have been in the same number of opportunities, against perfectly average opposition on a perfectly average park. It puts every hitter in the NCAA on the same footing. And we can compare them directly.
So I'll stop, and present a top 50 for 2003. Eventually, I will have numbers for every player in 2002 and 2003 available. The only comment I will make on this table is that the Blue Jays managed to select the fourth-best hitter in the NCAA in 2003 in the 18th round of the amateur draft. As in any analysis of less than 300 plate appearances (remember these are short seasons!) small sample size warnings apply.
I set out to do both, and got huge slices of help from friends, and also from people I'd never met before. Before I get into the whys, hows, and whats of my research, I want to thank four terrific people for their help with this project, of which this article is but a tiny part.
The first is Boyd Nation, the most knowledgeable layman in America about college baseball. Boyd is a true gentleman, and has managed to put together 99% of the data on which this project is based ? and made it available for free to anyone who wants it. When I needed assistance, he was there within 24 hours to give me what I needed with kind wishes.
I would also like to thank my friend Vinay Kumar, who was so helpful in helping me when I ran into difficulties with methods. Also, Joe C. and Noffs, readers of Baseball Primer, who also offered assistance without any chance of reward. Thanks guys.
As I said above, I set out to evaluate the pool of NCAA Division I players. Division I is huge ? there are over 280 participating schools, and 8,000 players compete in any given year. What's worse, there is a wide variation in ballparks, much more so than in MLB, and an even wider variation in competition. Unlike MLB, where teams play relatively equal schedules, some NCAA teams will play 80% or more of their games against teams in the top 40% (or bottom 40%) of the talent pool.
Once I had found that data was available on every hitter and pitcher to play Division I baseball in 2002 and 2003, I knew that I would have to crunch the numbers and come up with a rating system. After all, what is a sabermetrician without a rating system? To begin with (for the first iteration of this project at least) I have chosen a very simple system that is used so well by Lee Sinins. This is the RCAA/RSAA evaluation system. It uses runs allowed by pitchers and runs created by batters, and compares players to their league averages, giving an run value above or below the average for the number of outs (or innings) a player uses or records.
RCAA/RSAA is a very useful method of analysis for college players in particular. Most of us, myself included, don't have any particular interest in college play per se, rather we are interested in analyzing the performance of the top players ? the actual prospects. Using a ?baseline? of an average college player is probably more in line with what we are interested in, rather than using ?replacement level? which, in addition to being relatively difficult to calculate (not least because talent at the Division I level is very unevenly distributed).
I tweaked the system slightly. Knowing that the college game is different from the pro game, I decided on a linear run estimation system rather than a non-linear one, and so chose (it being a simple system) Jim Furtado's xRuns. I found that for the entire pool of players, xR (as it is known) underestimated scoring by about 6%. This is presumably due to the higer run-scoring environment in the metal-bat game... the average team in Division I scores about 6.5 runs per game. This makes sense; in an environment where there are more hits and more men on base, each offensive event will have a greater impact than it otherwise would.
When I redo the study and update my spreadsheet, and for future articles, I am hoping to use a run estimation tool that better approximates runs scored. All suggestions are welcome. At any rate, I will be using xRAA instead of RCAA to present my data (I should note that all xR figures are compared to league average xR, not runs scored, so that the 6% discrepancy doesn't work its way into the ratings). The problem is the data available; the team totals are not ? unfortunately ? reliable (so NewRC is out as a method) and the available historical information is sketchy. But we'll figure something out.
Once I had xR for every player in the database, I recognized the need to make two adjustments. The first was a park adjustment. Park-adjusting is a well-understood phenomenon and I don't need to get into the theory of it here, but I should point out that college ballparks have a much wider distribution of park factors than pro parks, which is reflective not only of the larger number of teams, but also of the greater geographic diversity and diversity of facilities.
Thankfully, my efforts were considerably speeded by Boyd Nation, who has conventionally calculated four-year park factors available for 2002 and 2003 along with all the other data on his website. The 2003 factors have an additional benefit: park factors based on a weighted average of all the parks a team played in over the course of a year. This is very important information; teams in the Mountain West conference, for instance, play dozens of games a year at a high altitude (San Diego State's home park factor is 104, but the average of all parks they played in was 114!). The fact that four-year park factors are used reflects the shorter NCAA season, almost always less than 70 games and usually less than 60.
For 2002, the park factors are also four-year, but aren't weighted averages, just home factors. I am still tweaking the PFs for 2002, based on home/road+neutral games, but for now I use the PFs in an absolutely standard way.
Park-adjusted numbers, though, aren't the whole story, because of the need to adjust for various levels of competition. Division I ranges from Texas and Florida State, who would be competitive with teams from the low-level minors, to Alabama A&M and Western Illinois, who make up numbers. In particular, there are massive differences in scheduling. Any two given NCAA players not only won't play similar schedules, they likely won't have any common opponents. So Mitch Maier's performance at Toledo, impressive as it is, is not in fact better than Beau Herrod's at Alabama. It looks like a better performance, but that's only because it is compiled against inferior competition. Toledo's opponents had a .474 winning perrcentage; their opponents' opponents also had a .474 winning percentage. Alabama, meanwhile, scheduled opponents with a collective .582 winning percentage, and their opponents' opponents had a .546 winning percentage. (All opponents' winning percentages (OppWP) and opponents' opponents' winning percentages (OppOppWP) listed are weighted for the number of games played versus each team).
So what can we do with this? Ideally, it would be great to adjust for the quality of the pitching staff (offense for pitchers) that each batter faced. Unfortunately, my data is not detailed enough for this ? nor are my analytical abilities that advanced. But I settled on a solution that allows for an adjustment to level of competition, without making particular adjustments for offense or defense.
First, I used OppWP and OppOppWP to derive a ?true ability? winning percentage for each team's opponents. I used the ?log5? method to do this, but if you want to get a basic approximation, adding the OppWP and OppOppWP and subtracting .500 usually gets close enough. This estimates what record the team's opponents would have had against a baseline level of competition, in this case .500.
Once I had done this, I used the Pythagorean forumla. Or rather, the Reverse Pythagorean, which sounds like a most unmentionable perversion but is really quite useful in this context. What we need, is not an approximation of the quality of a team's opponents in terms of wins and losses. In order to adjust a measure of talent expressed in terms of runs, what we need is an approximation of the quality of a team's opponents measured in terms of runs.
Converting runs to wins, of course, is done by the Pythagorean forumla. Converting wins back to runs can be done the same way. First, we make the assumption that each team's won-lost record is equally due to offense and defense (or pitching+defense if you prefer). Yes, this is a pretty massive assumption, but it's necessary for the time being given the data we have. (Which is why I refer to this adjustment as a ?competition adjustment? - it only adjusts for the level of competition and not of the player's actual opposition). Then, we can plug the opponents' winning percentage back into the Pythagorean forumla and we'll derive a number of runs scored and allowed for that team. That, essentially, is the quality of the offense or defense that that team faced that year, all measured against a baseline of the Division I average (a .500 team).
If you're interested, Vinay Kumar contributed the calculation. It used WP/(1-WP) instead of straight winning percentage, and derives a multiplier which indicates how much of league average a team's offense/defense is. That multiplier, the Competition Adjustment, is the fourth root of WP/(1-WP) (i.e. the won-lost ratio).
(WP/(1-WP))^0.25
So for Arkansas State 2003, for instance, whose opponents' won/lost ratio (adjusted for opponents' opponents) was just about 1.33, the forumla yields a Competition Adjustment of 1.073. In other words, we assume that Arkansas State's opponents were 7.3% above average in both run scoring and run prevention.
This allows us to make a very simple adjutment to the ?Average? level for RSAA and xRAA, enabling us to move the ?average? baseline to whom the Arkansas State players are compared up 7.3% for pitchers, and down 7.3% for hitters.
Once the park adjustments and competition adjustments are applied to xRAA, we get park-and-competition-adjusted xRAA, or **xRAA ? in keeping with the convention of one asterisk for park-adjusted numbers, we'll use two for park-and-competition adjustment. This measures how many runs above average a player would have been in the same number of opportunities, against perfectly average opposition on a perfectly average park. It puts every hitter in the NCAA on the same footing. And we can compare them directly.
So I'll stop, and present a top 50 for 2003. Eventually, I will have numbers for every player in 2002 and 2003 available. The only comment I will make on this table is that the Blue Jays managed to select the fourth-best hitter in the NCAA in 2003 in the 18th round of the amateur draft. As in any analysis of less than 300 plate appearances (remember these are short seasons!) small sample size warnings apply.
Top 50 hitters, NCAA Division I, 2003
Rk Name Team **OWP **xRAA
1 Jeremy Cleveland North Carolina .910 70.1
2 Michael Aubrey Tulane .907 66.8
3 Rickie Weeks Southern .930 64.2
4 Ryan Roberts Texas-Arlington .889 54.9
5 Brian Buscher South Carolina .849 50.7
6 Ricardo Nanita Florida International .879 47.9
7 Stephen Drew Florida State .841 47.2
8 Tony Richie Florida State .843 46.3
9 Tony McQuade Florida State .831 45.7
10 Jonny Kaplan Tulane .825 44.9
11 Josh Anderson Eastern Kentucky .847 44.8
12 Michael Johnson Clemson .870 44.2
13 David Murphy Baylor .837 43.8
14 Carlos Quentin Stanford .842 43.7
15 Matt Hopper Nebraska .851 42.8
16 Sean Farrell North Carolina .824 42.6
17 Chris Durbin Baylor .819 42.5
18 Brian Snyder Stetson .846 42.4
19 Lee Curtis College of Charleston .826 41.9
20 Dustin Majewski Texas .827 41.6
21 Beau Hearod Alabama .841 41.6
22 Jeff Larish Arizona State .831 41.3
23 Ryan Garko Stanford .833 41.1
24 John Gragg Bethune-Cookman .847 41.0
25 Chad Hauseman Jacksonville .867 39.9
26 Adam Boeve Northern Iowa .837 39.6
27 Jeff Fiorentino Florida Atlantic .830 39.3
28 Ryan Braun Miami, Florida .829 39.0
29 Michael Brown William and Mary .845 38.7
30 Aaron Hill Louisiana State .816 38.5
31 Clint King Southern Mississippi .803 36.6
32 Landon Powell South Carolina .804 36.5
33 Jeff Cook Southern Mississippi .797 36.4
34 Mitch Maier Toledo .842 36.0
35 Jamie Hemingway North Carolina-Wilmingto .803 35.7
36 Ryan McGraw Coastal Carolina .781 35.6
37 Neil Sellers Eastern Kentucky .799 35.5
38 Anthon Garibaldi Southeastern Louisiana .857 35.4
39 Jordan Foster Lamar .815 35.0
40 Brad Snyder Ball State .824 34.9
41 Conor Jackson California .864 34.8
42 Kevin Melillo South Carolina .799 34.8
43 Brian Hopkins Southeast Missouri State .821 34.6
44 Ryan Mulhern South Alabama .801 33.9
45 Keith Brachold Marist .788 32.8
46 Eddy Martinez Florida State .842 32.5
47 Ryan Gordon North Carolina-Greensbor .789 32.0
48 David Coffey Georgia .815 31.6
49 Christian Snavely Ohio State .794 31.5
50 David Castillo Oral Roberts .782 31.3