Analyzing Pitching, Fielding and Hitting as it relates to Win Percentages¶
Modern life is filled with numbers. Analytics is becoming inceasingly important to almost every aspect of life, and sports is no different. Baseball has long been one of the most studied sports in terms of data analytics, because the rules of the game set it up for straightforward analysis. Every pitch is a single experiment for analysis.This paper will investigate how well some different statistical measures do to correlate to how well a team performs on the field. The correlation between some of the "basic" statistics will be the starting point, as well as some other "advanced" stats to see how well they embody whether teams are picking up wins on the field.
In [61]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import csv,random,time
#Load Data and calculate the required advanced statistics. Not this file lcoation is for working locally, and will need
#to be updated for other systems.
Teams_data = pd.read_csv('Project 1 Baseball\\baseballdatabank-2017.1\\core\\Teams.csv')
#Replace Blanks as NaN
Teams_data = Teams_data.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)
print(Teams_data.isnull().sum())
We see in the above summation that there is quite a bit of missing data from the set. This is a drawback to analyzing some of the data in this set, that many of the statistics were just not tracked early on in baseball history, or were not tracked due to various other outside impacts (The World War's for example). The two points of data we are missing first and foremost for our analysis are the HBP and SF (Hit by Pitch and Sacrifice Flies). These statistics were not tracked until the year 2000, so our analysis will folks on data after 1999. THis is well suited for this study though, since it presents a period of time that is all inside one era of baseball, and the games rules have been unified.
In [62]:
#Load Data and calculate the required advanced statistics. Not this file lcoation is for working locally, and will need
#to be updated for other systems.
Teams_data = pd.read_csv('Project 1 Baseball\\baseballdatabank-2017.1\\core\\Teams.csv')
#Estimate Required Missing Data from other data
sacflies_per_game = (Teams_data['SF'][Teams_data['yearID'] > 1999]/Teams_data['G'][Teams_data['yearID'] > 1999]).mean()
hbp_per_game = (Teams_data['HBP'][Teams_data['yearID'] > 1999]/Teams_data['G'][Teams_data['yearID'] > 1999]).mean()
Teams_data['SF'].fillna(Teams_data['G']*sacflies_per_game,inplace=True)
Teams_data['HBP'].fillna(Teams_data['G']*hbp_per_game,inplace=True)
Teams_data['WinPercent'] = Teams_data['W']/Teams_data['G']
Teams_data['Batting_Average'] = Teams_data['H']/Teams_data['AB']
Teams_data['Singles'] = Teams_data['H']-(Teams_data['2B']+Teams_data['3B']+Teams_data['HR'])
#Plate Apearances are different from at bats, since they include walks and sacrifice flies.
Teams_data['PA_equiv'] = Teams_data['AB']+Teams_data['HBP']+Teams_data['BB']+Teams_data['SF']
Teams_data['OBP'] = (Teams_data['H']+Teams_data['BB']+Teams_data['HBP'])/Teams_data['PA_equiv']
Teams_data['SLG'] = (Teams_data['Singles']+2*Teams_data['2B']+3*Teams_data['3B']+4*Teams_data['HR'])/Teams_data['AB']
Teams_data['OPS'] = Teams_data['OBP']+Teams_data['SLG']
#DICE is a fielding independent measure of pitching developed by Bill James
Teams_data['DICE'] = 3.0+((13*Teams_data['HRA']+3*Teams_data['BBA']-2*Teams_data['SOA'])/Teams_data['IPouts'])
#Select Modern Data and re-index
Teams_data_modern = Teams_data[Teams_data['yearID'] > 1999]
Teams_data_modern = Teams_data_modern.set_index('yearID')
#Group Data and calculate Statistics
grouped_teams = Teams_data_modern.groupby(Teams_data_modern.index)
league_BA = grouped_teams['Batting_Average'].mean()
league_era = grouped_teams['ERA'].mean()
league_E = grouped_teams['E'].mean()
league_ops = grouped_teams['OPS'].mean()
league_obp = grouped_teams['OBP'].mean()
league_slg = grouped_teams['SLG'].mean()
league_DICE = grouped_teams['DICE'].mean()
league_FP = grouped_teams['FP'].mean()
Teams_data_modern['OPSPlus'] = 100*(((Teams_data_modern['OBP']/league_obp)+(Teams_data_modern['SLG']/league_slg))-1)
Teams_data_modern['ERAPlus'] = 100*(2-Teams_data_modern['ERA']/(league_era*Teams_data_modern['PPF']/100))
In [48]:
#Produce Basic Box Plots for some averages
grouped_data = pd.concat([league_BA,league_ops,league_era,league_DICE,league_FP,league_E],axis =1)
grouped_data.plot(kind='line',subplots=True, layout=(3,2), sharex=False, sharey=False, figsize = (12,12),title='Yearly Averages For Various Statistics')
print("")
grouped_data.plot(kind='box', subplots=True, layout=(3,2), sharex=False, sharey=False, figsize = (12,12),title='Yearly Averages Distribution For Various Statistics')
Out[48]:
We can see some interesting trends. In any given year the hitting statistics and pitching statistics tend to follow a similar path. This trend really shouldnt be so surprising, since these two statistics are really noting the matchup between the pitcher and the batter. Likewise, it should not surprise us that the number of errors committed has decreased as the average fielding percentage has increased. Looking at the box plot comparisons of the traditional statistics versus the advanced ones, we see Batting Average has had a bit larger spread than OPS. Pitcher ERA has seen more difference near its upper quartile, whereas DICE sees it at the bottom quartile.
Now that we have seen some oeverall trends in our statistics, lets look at how it correlates to team wins. We will look at the data in two ways here. First will be an analysis of the total time period, giving the whole correlation value. Second, the data will be broken down for each year in the period so the variance of it can be explored a bit.
Now that we have seen some oeverall trends in our statistics, lets look at how it correlates to team wins. We will look at the data in two ways here. First will be an analysis of the total time period, giving the whole correlation value. Second, the data will be broken down for each year in the period so the variance of it can be explored a bit.
In [49]:
names = ['WinPercent','ERA','Batting_Average','E']
df_corr = Teams_data_modern[['WinPercent','ERA','Batting_Average','E']].corr()
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(df_corr, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,4,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.title('Correlation Matrix Heat Map, Regular Statistics')
plt.show()
print(df_corr['WinPercent'])
df_corr_groups = grouped_teams[['WinPercent','ERA','Batting_Average','E']].corr()
print(df_corr_groups['WinPercent'])
Not so surprising as the old adage goes, good pitching beats good hitting. Since 2000, teams with better ERAs have posted higher win percentages, with a Pearson Correficient of .624. This shows there is a decently strong correlation to pitching. Batting average though has not really seen such a strong corrlation, especially in the last few years. In 2014 the correlation was just .28, and it bottomed out in 2015 to just .08. It was still under its overall correlation score last year as well. The swing stat here is errors comitted. Some years the correlation is especially high, topping ERA for the highest score. Other years, like 2015 it scored just .08.
Do our advanced stats do a better job in correlating to wins?
Do our advanced stats do a better job in correlating to wins?
In [50]:
names = ['WinPercent','DICE','FP','OPS','ERAPlus','OPSPlus']
df_corr_advanced = Teams_data_modern[['WinPercent','DICE','OPS','FP','ERAPlus','OPSPlus']].corr()
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(df_corr_advanced, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,6,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.title('Correlation Matrix Heat Map, Advanced Statistics')
plt.show()
print(df_corr_advanced['WinPercent'])
df_corr_groups_adv = grouped_teams[['WinPercent','DICE','FP','OPS','ERAPlus','OPSPlus']].corr()
print(df_corr_groups_adv['WinPercent'])
In an interesting twist we see that DICE does not match up as well as ERA does. It still has what most statisticians would call a moderate correlation, but not as strong as regular ERA. OPS does do a much better job than just batting average, although it seems to have also seen its correlation dropquite a bit since the early 2000s to today where its seen much lower values. Fielding percentage is a little stronger than measuring errors, but its not really measurable. Fielding Percentage does seem to be a bit less swingy than analyzing errors though.
ERA+ and OPS+ are further improvements to the system, noting that both of them use the average value for the league and show how much better (or worse) a team is than that average.
The below analyis pulls the team with the best Winning Percentage in a given year, then finds the teams with the best OPS+ and ERA+ so we can see how often it happens the three of them line up.
ERA+ and OPS+ are further improvements to the system, noting that both of them use the average value for the league and show how much better (or worse) a team is than that average.
The below analyis pulls the team with the best Winning Percentage in a given year, then finds the teams with the best OPS+ and ERA+ so we can see how often it happens the three of them line up.
In [51]:
#Uncomment this line if this block has already been run to reset the index to yearID
#Teams_data_modern = Teams_data_modern.set_index('yearID')
grouped_teams = Teams_data_modern.groupby(Teams_data_modern.index)
idwp = grouped_teams['WinPercent'].transform(max) == Teams_data_modern['WinPercent']
best_wins = Teams_data_modern[idwp]
idop = grouped_teams['OPSPlus'].transform(max) == Teams_data_modern['OPSPlus']
best_opsp = Teams_data_modern[idop]
ider = grouped_teams['ERAPlus'].transform(max) == Teams_data_modern['ERAPlus']
best_erap = Teams_data_modern[ider]
print(best_wins[['name','WinPercent','OPSPlus','ERAPlus']])
print("")
print(best_opsp[['name','WinPercent','OPSPlus']])
print("")
print(best_erap[['name','WinPercent','ERAPlus']])
combined_best = best_wins[['name','WinPercent','OPSPlus','ERAPlus']].merge(best_opsp[['name','WinPercent','OPSPlus']],right_index=True, left_index=True)
combined_bests = combined_best.merge(best_erap[['name','WinPercent','ERAPlus']],right_index=True, left_index=True)
The team with the best winning percentage rarely lined up with the the best OPS+ list, but did do a bit better with best ERA+ list.
Overall this research has demonstrated that we can see that teams that win in baseball tend to also be the ones that pitch well and play enough defense to limit runs. It also is helpful to be able to hit well, but that only can be a bit less instructive. The advanced statistics do tend to produce a better correlation, but anything that does not specifically deal with run scoring or run allowing does not pair nearly as well as those that do. These correlations of course do not state that a team with a good ERA+ score will always have a bettwer win loss record, but rather that a team that has a good score is also fairly likely to have a good record.
A note on some of the limitations of this data set. Because the MLB only began tracking some statistics offically from the 2000 season onward, namely sacrifice flies and hit by pitch, it only makes sense to analyze from that point forward. Secondly, baseball is a game of eras. It does not make a lot sense to compare the game from over 100 years ago when players traded teams and leagues multiple times a season to the modern market where players are truely professional. These limitations do not limit the analysis laid out in this paper to this point though, but if someone wanted to measure how these different statistics macthed up throughout history, and try to draw some historical narrative they would be making any judgements based on estimations.
The final anaylisis will be to attempt to estimate the correlation for ERA+ and OPS+ for our historical data, despite the limitations stated above. We already dealt with the missing values by estimating them earlier in the set, so what's left to do is run our similar analyis with these estimated values.
Overall this research has demonstrated that we can see that teams that win in baseball tend to also be the ones that pitch well and play enough defense to limit runs. It also is helpful to be able to hit well, but that only can be a bit less instructive. The advanced statistics do tend to produce a better correlation, but anything that does not specifically deal with run scoring or run allowing does not pair nearly as well as those that do. These correlations of course do not state that a team with a good ERA+ score will always have a bettwer win loss record, but rather that a team that has a good score is also fairly likely to have a good record.
A note on some of the limitations of this data set. Because the MLB only began tracking some statistics offically from the 2000 season onward, namely sacrifice flies and hit by pitch, it only makes sense to analyze from that point forward. Secondly, baseball is a game of eras. It does not make a lot sense to compare the game from over 100 years ago when players traded teams and leagues multiple times a season to the modern market where players are truely professional. These limitations do not limit the analysis laid out in this paper to this point though, but if someone wanted to measure how these different statistics macthed up throughout history, and try to draw some historical narrative they would be making any judgements based on estimations.
The final anaylisis will be to attempt to estimate the correlation for ERA+ and OPS+ for our historical data, despite the limitations stated above. We already dealt with the missing values by estimating them earlier in the set, so what's left to do is run our similar analyis with these estimated values.
In [40]:
In [54]:
Teams_data_old = Teams_data[Teams_data['yearID'] < 2000]
Teams_data_old = Teams_data_old.set_index('yearID')
#Group Data and calculate Statistics
grouped_teams_old = Teams_data_old.groupby(Teams_data_old.index)
league_BA_old = grouped_teams_old['Batting_Average'].mean()
league_era_old = grouped_teams_old['ERA'].mean()
league_E_old = grouped_teams_old['E'].mean()
league_ops_old = grouped_teams_old['OPS'].mean()
league_obp_old = grouped_teams_old['OBP'].mean()
league_slg_old = grouped_teams_old['SLG'].mean()
league_DICE_old = grouped_teams_old['DICE'].mean()
league_FP_old = grouped_teams_old['FP'].mean()
Teams_data_old['OPSPlus'] = 100*(((Teams_data_old['OBP']/league_obp_old)+(Teams_data_old['SLG']/league_slg_old))-1)
Teams_data_old['ERAPlus'] = 100*(2-Teams_data_old['ERA']/(league_era_old*Teams_data_old['PPF']/100))
In [55]:
names = ['WinPercent','DICE','FP','OPS','ERAPlus','OPSPlus']
df_corr_advanced_old = Teams_data_old[['WinPercent','DICE','OPS','FP','ERAPlus','OPSPlus']].corr()
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(df_corr_advanced_old, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,6,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.title('Correlation Heat Map All Years, Advanced Statistics Estimate')
plt.show()
print(df_corr_advanced_old['WinPercent'])
df_corr_groups_adv_old = grouped_teams_old[['WinPercent','DICE','FP','OPS','ERAPlus','OPSPlus']].corr()
print(df_corr_groups_adv_old['WinPercent'])
One of the most interesting things we can fnd in the data here is that baseball in the early era was much more dependent on how well a team fielded the ball. This is at least in part due to the absence of gloves, and what is knwon as the dead ball. The modern game sees the ball being changed every few pitches. Early in the game the ball would last the whole game leading to softer hit balls.
We can also run this analysis for the entire data set, remembering it will be tough to draw any meaningful conclusions from our estimated data.
In [65]:
#Group Data and calculate Statistics
Teams_data_all = Teams_data.set_index('yearID')
grouped_teams_all = Teams_data_all.groupby(Teams_data_all.index)
league_BA_all = grouped_teams_all['Batting_Average'].mean()
league_era_all = grouped_teams_all['ERA'].mean()
league_E_all = grouped_teams_all['E'].mean()
league_ops_all = grouped_teams_all['OPS'].mean()
league_obp_all = grouped_teams_all['OBP'].mean()
league_slg_all= grouped_teams_all['SLG'].mean()
league_DICE_all = grouped_teams_all['DICE'].mean()
league_FP_all= grouped_teams_all['FP'].mean()
Teams_data_all['OPSPlus'] = 100*(((Teams_data_all['OBP']/league_obp_all)+(Teams_data_all['SLG']/league_slg_all))-1)
Teams_data_all['ERAPlus'] = 100*(2-Teams_data_all['ERA']/(league_era_all*Teams_data_all['PPF']/100))
In [64]:
names = ['WinPercent','DICE','FP','OPS','ERAPlus','OPSPlus']
df_corr_advanced_all = Teams_data_all[['WinPercent','DICE','OPS','FP','ERAPlus','OPSPlus']].corr()
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(df_corr_advanced_all, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,6,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.title('Correlation Heat Map All Years, Advanced Statistics Estimate')
plt.show()
print(df_corr_advanced_all['WinPercent'])
df_corr_groups_adv_all = grouped_teams_all[['WinPercent','DICE','FP','OPS','ERAPlus','OPSPlus']].corr()
print(df_corr_groups_adv_all['WinPercent'])
In [ ]:
No comments:
Post a Comment