After starting my senior year of college, I’ve gotten involved with sports analytics organizations on my campus and just started a research project in baseball modeling, and has led me to ponder why exactly I am so particularly fixated on baseball and what precisely baseball research looks like. The latter is the primary subject of this quick article, but the former is a question of great importance, ‘why exactly am I so interested in baseball’, is something one’s ought to ask at some point, to which my answer is, I don’t know, I just think it’s fun. So what does research in baseball look like?

Research Question
Well it’s a lot like research in a ‘standard’ environment, where you start with a broad category you’re interested in. In physics it might be about quasi-stellar satellites; in baseball it might be going deeper into what it means to be an undervalued player. The only rule about choosing the category you pick for research is that it has to be something you’re genuinely passionate about, because if you have the passion, then generating inquiry comes naturally. For me, I really enjoy investigating pitching and the value of players (as you may have seen with my article on Corbin Burnes). Use the passion you have on a topic and generate a research question— a question that drives the investigation you want to spend time on (it doesn’t have to be original in the beginning, just developing an initial question is a great baseline for further refinement).
Ask questions such as: Will it challenge a common idea? Will it be particularly helpful to other researchers? Can this open more areas of research?
Literature Review
After you make your research question, the next step is to start the relevant literature review about current research being done in your specific subfield that you’re investigating. If you’re in school, go to your library database and start looking up research articles; if you’re independent, go to google scholar to start. For example, if I’m interested in baseball modeling I might look up, ‘Bayesian modeling in baseball’, into the relevant search engines, or ‘machine learning in baseball’, as a place to start looking for literature.
Use the literature you found to steer your investigation, see what practices are commonly used when writing a research paper in your field – detect practices that are used when writing a paper to ensure that when the time comes you can write your research paper without too many issues. For research projects that are heavily data based (a lot of research in today’s era), it is also good to know where existing literature received their data, so that you can potentially get data from that source as well (assuming it’s from a public domain).
Data Attributes and Manipulation
The most important aspect of most research projects in today’s era is probably the relevant data one can find when conducting research. This is because you can’t do any form of prediction, classification, or analysis without data. Data in baseball is luckily found in many public places like Baseball Savant, and FanGraphs primarily; there are other places, but these two sources can provide a lot to the amateur such as myself.
You often only want a subset of the data you might query from these websites, so you definitely want to be well situated with querying tools, whether that be the tools built into websites, SQL, pandas in Python, or dplyr in R.
Once you have the data manipulated (this is a time consuming step), then you start implementing whatever analysis or model you want to use to analyze your data. Examples of such analysis would be exploratory data analysis, or implementing a machine learning mode like k-means clustering, or linear discriminant analysis.
Final Thoughts
From my little experience in baseball research I learned the basic process of starting research, however, by no means am I an expert. I’m sure I have some kinks to work out before writing an extensive guide on how to do baseball research, but this is a good place to start! I will come back to this in the future with revisions on where and how to develop a research question more in depth, how to obtain data from baseball data sources, and how to determine what models or analysis to implement.
Various organizations specialize in baseball research, the most popular being SABR, the Society of American Baseball Research, where most of baseball’s famous sabermetricians like Sean Lahman, Bill James, Peter Palmer, and Dick Cramer have made groundbreaking work in sabermetrics. Another big research website that is more community based is FanGraphs, where they tend explain baseball metrics fabulously and experiment with new statistics in the work.
Notes
I’ve been busy at work this past month and have sadly had no such time to write my blog, but given the current environment of the game with Aaron Judge and Albert Pujols making history with their absolute power on the field, and the Cleveland Guardians and Seattle Mariners entering the postseason as two teams who were never thought to make their strides at the beginning of the season; I have plenty of content to write about.
I am in the groove of things as of now in this semester, so I am hoping to post more on here!
Like I mentioned in the beginning of the article, I am currently in the process of developing a baseball research project in classification domain of machine learning, more on that in a future article!