
Certificate of Completion
THIS ACKNOWLEDGES THAT
HAS COMPLETED THE FALL 2025 DATA SCIENCE BOOT CAMP
Owen Goff
Roman Holowinsky, PhD
NOVEMBER 13, 2025
DIRECTOR
DATE

TEAM
Sports Analytics
Owen Goff, Andrei Prokhorov, Garo Sarajian, Jonghyun Lee

In this project we analyze pitch data from the pybaseball GitHub repository. We looked at the year 2025 with 754044 pitches and 119 features. Our goal is to predict the decline of performance of the pitcher and suggest the substitution. We noticed that the data is unbalanced, since most of the pitches do not lead to substitution. The standard feature that can be used to predict the outcome is the pitch count. It behaves differently for the starter pitcher and relief pitcher. Logistic regression on the pitch count gives a result of F1 score 0.24. XGBoost trained on various features provides F1 score of 0.32. Additionally we implemented RE288 (run expectancy 288) to convey information to the manager. It tells the expected number of runs for a given pitcher given one of 288 possible combinations of base-out and plate count states the game is during the pitch. We additionally included Kaplan-Meier estimator from proportional hazards model and Recurrent Neural network as alternatives.
