Classical Sabermetrics: What is the Lahman Database?
The beginning of baseball data analysis.
With the onset of different technologies that gather information in baseball, such as Hawkeye and StatCast, there ends up being a whopping seven terabytes of data for every baseball game! This revolution in data collection allows access to information that was previously approximated like ball tracking data in the form of ball spin rate, batted ball velocity, strike zone areas, and much more. This allows for a more accurate and rich analysis of baseball where we can analyze the effectiveness of every player in greater depth than how it was done decades ago. But how was analysis done before the robust technologies of Hawkeye and StatCast? Besides the box score, the Lahman database was the premier source of information for all things baseball.
It was created by Sean Lahman in a grueling effort to ensure that relevant baseball statistics were open source so the public could conduct their own research and analyses.
The Lahman database is a relational database that contains tables of hitting, pitching, and fielding metrics for every major league baseball player that has ever been in a game since 1871! This means that this giant set of data spans over all the main eras of major league baseball (not including the modern era): proto-baseball (pre 1900), dead-ball (1901 - 1919), live-ball (1920 - 1941), integration (1942 - 1960), expansion (1961 - 1976), free agency (1977 - 1993), and the power (1994 - 2005).
How do I Utilize the Lahman Database?
You can access the Lahman database on Sean Lahman’s website, https://www.seanlahman.com/baseball-archive/statistics/, where you can install the entire database locally and get whatever information you need using SQL editors or Microsoft Access if you’re not familiar with computer programming. Once you have the database loaded into whatever software you’re using, it’s time to partition/subset (query) the data you want for your analysis. Let’s say you want the number of strikeouts, wins, losses, and ERA of pitchers of the Boston Red Sox after 1955 (an important starting year for sabermetric analysis); you can do that with a simple query in SQL and Access. After you have the queried data, you can export it into data analysis software like R or Python to conduct your own analysis. I made a visualization that shows the flow of information from the database to data analysis.
This sounds like a long process, what other alternatives do I have to access baseball data?
The Lahman database is perfectly usable for data analysis today, because all the data is accurate, but what if you want a quicker way of accessing baseball data? If you happen to know Python and/or R, then there’s a quick answer to this.
The Pybaseball package for Python programmers allows access not only to the basic metrics as listed in the Lahman database, but also data from Baseball Savant, FanGraphs, and Baseball Reference. Pybaseball has a myriad of information at the fingertips of those who program in Python. Remember those technologies we mentioned earlier? Pybaseball even includes the data gathered from Hawkeye and StatCast (albeit not the uncompressed seven terabytes of information).
R has the ‘Lahman’ package that does exactly what the Lahman database does, minus having to query the data you want in SQL or Access, because the package imports all the dataframes (tables) within the Lahman database into R.
Besides from having the Lahman Database package in R, there’s also a pybaseball equivalent package called ‘baseballr’. Baseballr does the exact same thing pybaseball does (except in R), and adds functions for calculating metrics like OPS, wOBA, and FIP.
You may ask what’s the point of querying the original database when there’s more robust ways of accessing the data one would need. You might want to analyze the data in Excel, to which querying the database in Access would be ideal. You also might want to practice SQL to query the Lahman database, and with how giant the dataset is, it is a great database to learn and practice SQL on.
In terms of the diagram above, this eliminates steps 1 and 2, because you already have the data imported into the data analysis software!
With other more robust data sources like Baseball Savant, why is the Lahman database important?
Besides being a great introduction to sabermetrics for beginners to practice important and demanding data skills like SQL, and R; the creation of the Lahman database serves to be a historical landmark in sports analytics.
When Sean Lahman organized his database, he not only revolutionized casual baseball sabermetrics, he changed all of baseball analytics with his eloquent organization of data. The Lahman database sowed the seeds of what was to come after its creation in July 1995; the Red Sox’s victory in 2004 using ‘moneyball’ tactics that started to popularize the use of sabermetrics in a Major League setting, and the mainstreamization of data science being used in baseball with the introduction of StatCast in 2015.
It is through the Lahman database where beginner analysts can start to understand the importance of clean and easily accessible data, and be reminded of the data revolution it started for baseball.
Important Links:
https://bit.ly/3Op9HEH - The R ‘Lahman’ package (pdf of documentation link)
https://bit.ly/3QuYdkZ - The Python ‘pybaseball’ package (github link of documentation)
https://bit.ly/3NZX5UR - Website to learn MySQL (a commonly used version of SQL)