Wednesday, August 2, 2017

Playing with Jupyter Notebooks

I've been taking a few classes through my day job to start doing more data analysis for the company.  The first big project was to due some basic work on some baseball stats.  Right up my alley.  I figured I'd share the notebook here, since Jupyter makes that nice and easy to do.  Its after the jump, since it lays out a little funny on the blog.







Baseball Analysis

Analyzing Pitching, Fielding and Hitting as it relates to Win Percentages

Modern life is filled with numbers. Analytics is becoming inceasingly important to almost every aspect of life, and sports is no different. Baseball has long been one of the most studied sports in terms of data analytics, because the rules of the game set it up for straightforward analysis. Every pitch is a single experiment for analysis.
This paper will investigate how well some different statistical measures do to correlate to how well a team performs on the field. The correlation between some of the "basic" statistics will be the starting point, as well as some other "advanced" stats to see how well they embody whether teams are picking up wins on the field.
In [61]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

import csv,random,time



#Load Data and calculate the required advanced statistics.  Not this file lcoation is for working locally, and will need
#to be updated for other systems.
Teams_data = pd.read_csv('Project 1 Baseball\\baseballdatabank-2017.1\\core\\Teams.csv')

#Replace Blanks as NaN
Teams_data = Teams_data.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

print(Teams_data.isnull().sum())
yearID               0
lgID                50
teamID               0
franchID             0
divID             1517
Rank                 0
G                    0
Ghome              399
W                    0
L                    0
DivWin            1545
WCWin             2181
LgWin               28
WSWin              357
R                    0
AB                   0
H                    0
2B                   0
3B                   0
HR                   0
BB                   0
SO                 120
SB                 144
CS                 859
HBP               2325
SF                2325
RA                   0
ER                   0
ERA                  0
CG                   0
SHO                  0
SV                   0
IPouts               0
HA                   0
HRA                  0
BBA                  0
SOA                  0
E                    0
DP                 317
FP                   0
name                 0
park                34
attendance         279
BPF                  0
PPF                  0
teamIDBR             0
teamIDlahman45       0
teamIDretro          0
dtype: int64
We see in the above summation that there is quite a bit of missing data from the set. This is a drawback to analyzing some of the data in this set, that many of the statistics were just not tracked early on in baseball history, or were not tracked due to various other outside impacts (The World War's for example). The two points of data we are missing first and foremost for our analysis are the HBP and SF (Hit by Pitch and Sacrifice Flies). These statistics were not tracked until the year 2000, so our analysis will folks on data after 1999. THis is well suited for this study though, since it presents a period of time that is all inside one era of baseball, and the games rules have been unified.
In [62]:
#Load Data and calculate the required advanced statistics.  Not this file lcoation is for working locally, and will need
#to be updated for other systems.
Teams_data = pd.read_csv('Project 1 Baseball\\baseballdatabank-2017.1\\core\\Teams.csv')


#Estimate Required Missing Data from other data
sacflies_per_game = (Teams_data['SF'][Teams_data['yearID'] > 1999]/Teams_data['G'][Teams_data['yearID'] > 1999]).mean()
hbp_per_game = (Teams_data['HBP'][Teams_data['yearID'] > 1999]/Teams_data['G'][Teams_data['yearID'] > 1999]).mean()

Teams_data['SF'].fillna(Teams_data['G']*sacflies_per_game,inplace=True)
Teams_data['HBP'].fillna(Teams_data['G']*hbp_per_game,inplace=True)


Teams_data['WinPercent'] = Teams_data['W']/Teams_data['G']
Teams_data['Batting_Average'] = Teams_data['H']/Teams_data['AB']
Teams_data['Singles'] = Teams_data['H']-(Teams_data['2B']+Teams_data['3B']+Teams_data['HR'])

#Plate Apearances are different from at bats, since they include walks and sacrifice flies.
Teams_data['PA_equiv'] = Teams_data['AB']+Teams_data['HBP']+Teams_data['BB']+Teams_data['SF']
Teams_data['OBP'] = (Teams_data['H']+Teams_data['BB']+Teams_data['HBP'])/Teams_data['PA_equiv']
Teams_data['SLG'] = (Teams_data['Singles']+2*Teams_data['2B']+3*Teams_data['3B']+4*Teams_data['HR'])/Teams_data['AB']
Teams_data['OPS'] = Teams_data['OBP']+Teams_data['SLG']

#DICE is a fielding independent measure of pitching developed by Bill James
Teams_data['DICE'] = 3.0+((13*Teams_data['HRA']+3*Teams_data['BBA']-2*Teams_data['SOA'])/Teams_data['IPouts'])


#Select Modern Data and re-index
Teams_data_modern = Teams_data[Teams_data['yearID'] > 1999]
Teams_data_modern = Teams_data_modern.set_index('yearID')


#Group Data and calculate Statistics
grouped_teams = Teams_data_modern.groupby(Teams_data_modern.index)

league_BA = grouped_teams['Batting_Average'].mean()
league_era = grouped_teams['ERA'].mean()
league_E = grouped_teams['E'].mean()

league_ops = grouped_teams['OPS'].mean()
league_obp = grouped_teams['OBP'].mean()
league_slg = grouped_teams['SLG'].mean()
league_DICE = grouped_teams['DICE'].mean()
league_FP = grouped_teams['FP'].mean()



Teams_data_modern['OPSPlus'] = 100*(((Teams_data_modern['OBP']/league_obp)+(Teams_data_modern['SLG']/league_slg))-1)
Teams_data_modern['ERAPlus'] = 100*(2-Teams_data_modern['ERA']/(league_era*Teams_data_modern['PPF']/100))
We will start by looking at just how several of our measured statistics have changed over the course of time in our analysis set. It should be noted we will not look at the average of OPS+ and ERA+, becaus they are standardized already to have an average of 100. A higher value than 100 is better.
In [48]:
#Produce Basic Box Plots for some averages
grouped_data = pd.concat([league_BA,league_ops,league_era,league_DICE,league_FP,league_E],axis =1)
grouped_data.plot(kind='line',subplots=True, layout=(3,2), sharex=False, sharey=False, figsize = (12,12),title='Yearly Averages For Various Statistics')
print("")
grouped_data.plot(kind='box', subplots=True, layout=(3,2), sharex=False, sharey=False, figsize = (12,12),title='Yearly Averages Distribution For Various Statistics')

Out[48]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000000EE839B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000009FCB278>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000000ED7F7F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000000F1D92E8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000000E9EE550>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000105DA1D0>]], dtype=object)
We can see some interesting trends. In any given year the hitting statistics and pitching statistics tend to follow a similar path. This trend really shouldnt be so surprising, since these two statistics are really noting the matchup between the pitcher and the batter. Likewise, it should not surprise us that the number of errors committed has decreased as the average fielding percentage has increased. Looking at the box plot comparisons of the traditional statistics versus the advanced ones, we see Batting Average has had a bit larger spread than OPS. Pitcher ERA has seen more difference near its upper quartile, whereas DICE sees it at the bottom quartile.
Now that we have seen some oeverall trends in our statistics, lets look at how it correlates to team wins. We will look at the data in two ways here. First will be an analysis of the total time period, giving the whole correlation value. Second, the data will be broken down for each year in the period so the variance of it can be explored a bit.
In [49]:
names  = ['WinPercent','ERA','Batting_Average','E']


df_corr = Teams_data_modern[['WinPercent','ERA','Batting_Average','E']].corr()
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(df_corr, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,4,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.title('Correlation Matrix Heat Map, Regular Statistics')
plt.show()

print(df_corr['WinPercent'])

df_corr_groups = grouped_teams[['WinPercent','ERA','Batting_Average','E']].corr()

print(df_corr_groups['WinPercent'])
WinPercent         1.000000
ERA               -0.623678
Batting_Average    0.368824
E                 -0.371032
Name: WinPercent, dtype: float64
yearID                 
2000    WinPercent         1.000000
        ERA               -0.657605
        Batting_Average    0.363401
        E                 -0.127516
2001    WinPercent         1.000000
        ERA               -0.765989
        Batting_Average    0.484234
        E                 -0.498863
2002    WinPercent         1.000000
        ERA               -0.823924
        Batting_Average    0.692064
        E                 -0.455654
2003    WinPercent         1.000000
        ERA               -0.673311
        Batting_Average    0.555701
        E                 -0.469042
2004    WinPercent         1.000000
        ERA               -0.636416
        Batting_Average    0.562622
        E                 -0.454024
2005    WinPercent         1.000000
        ERA               -0.717419
        Batting_Average    0.314030
        E                 -0.542535
2006    WinPercent         1.000000
        ERA               -0.751892
        Batting_Average    0.395753
        E                 -0.264370
2007    WinPercent         1.000000
        ERA               -0.651304
                             ...   
2009    Batting_Average    0.465591
        E                 -0.409088
2010    WinPercent         1.000000
        ERA               -0.681075
        Batting_Average    0.464246
        E                 -0.634078
2011    WinPercent         1.000000
        ERA               -0.616283
        Batting_Average    0.349537
        E                 -0.409663
2012    WinPercent         1.000000
        ERA               -0.812985
        Batting_Average    0.272529
        E                 -0.362973
2013    WinPercent         1.000000
        ERA               -0.747713
        Batting_Average    0.585259
        E                 -0.444087
2014    WinPercent         1.000000
        ERA               -0.728931
        Batting_Average    0.288410
        E                 -0.098307
2015    WinPercent         1.000000
        ERA               -0.819131
        Batting_Average    0.084381
        E                 -0.092299
2016    WinPercent         1.000000
        ERA               -0.795748
        Batting_Average    0.344146
        E                 -0.586490
Name: WinPercent, dtype: float64
Not so surprising as the old adage goes, good pitching beats good hitting. Since 2000, teams with better ERAs have posted higher win percentages, with a Pearson Correficient of .624. This shows there is a decently strong correlation to pitching. Batting average though has not really seen such a strong corrlation, especially in the last few years. In 2014 the correlation was just .28, and it bottomed out in 2015 to just .08. It was still under its overall correlation score last year as well. The swing stat here is errors comitted. Some years the correlation is especially high, topping ERA for the highest score. Other years, like 2015 it scored just .08.
Do our advanced stats do a better job in correlating to wins?
In [50]:
names  = ['WinPercent','DICE','FP','OPS','ERAPlus','OPSPlus']

df_corr_advanced = Teams_data_modern[['WinPercent','DICE','OPS','FP','ERAPlus','OPSPlus']].corr()
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(df_corr_advanced, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,6,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.title('Correlation Matrix Heat Map, Advanced Statistics')
plt.show()

print(df_corr_advanced['WinPercent'])

df_corr_groups_adv = grouped_teams[['WinPercent','DICE','FP','OPS','ERAPlus','OPSPlus']].corr()

print(df_corr_groups_adv['WinPercent'])
WinPercent    1.000000
DICE         -0.509477
OPS           0.490193
FP            0.380603
ERAPlus       0.715469
OPSPlus       0.591425
Name: WinPercent, dtype: float64
yearID            
2000    WinPercent    1.000000
        DICE         -0.490985
        FP            0.115591
        OPS           0.605958
        ERAPlus       0.588944
        OPSPlus       0.611980
2001    WinPercent    1.000000
        DICE         -0.702926
        FP            0.506159
        OPS           0.597049
        ERAPlus       0.689731
        OPSPlus       0.612912
2002    WinPercent    1.000000
        DICE         -0.752856
        FP            0.470150
        OPS           0.732204
        ERAPlus       0.844095
        OPSPlus       0.740936
2003    WinPercent    1.000000
        DICE         -0.748001
        FP            0.498525
        OPS           0.624097
        ERAPlus       0.747947
        OPSPlus       0.629883
2004    WinPercent    1.000000
        DICE         -0.649045
        FP            0.489051
        OPS           0.745977
        ERAPlus       0.535459
        OPSPlus       0.748465
                        ...   
2012    WinPercent    1.000000
        DICE         -0.681597
        FP            0.380137
        OPS           0.497550
        ERAPlus       0.803317
        OPSPlus       0.496829
2013    WinPercent    1.000000
        DICE         -0.703775
        FP            0.423762
        OPS           0.691613
        ERAPlus       0.592674
        OPSPlus       0.702402
2014    WinPercent    1.000000
        DICE         -0.579660
        FP            0.139803
        OPS           0.327054
        ERAPlus       0.724885
        OPSPlus       0.333830
2015    WinPercent    1.000000
        DICE         -0.687243
        FP            0.115086
        OPS           0.455378
        ERAPlus       0.826296
        OPSPlus       0.470414
2016    WinPercent    1.000000
        DICE         -0.651334
        FP            0.617207
        OPS           0.547176
        ERAPlus       0.824863
        OPSPlus       0.561053
Name: WinPercent, dtype: float64
In an interesting twist we see that DICE does not match up as well as ERA does. It still has what most statisticians would call a moderate correlation, but not as strong as regular ERA. OPS does do a much better job than just batting average, although it seems to have also seen its correlation dropquite a bit since the early 2000s to today where its seen much lower values. Fielding percentage is a little stronger than measuring errors, but its not really measurable. Fielding Percentage does seem to be a bit less swingy than analyzing errors though.
ERA+ and OPS+ are further improvements to the system, noting that both of them use the average value for the league and show how much better (or worse) a team is than that average.
The below analyis pulls the team with the best Winning Percentage in a given year, then finds the teams with the best OPS+ and ERA+ so we can see how often it happens the three of them line up.
In [51]:
#Uncomment this line if this block has already been run to reset the index to yearID
#Teams_data_modern = Teams_data_modern.set_index('yearID')

grouped_teams = Teams_data_modern.groupby(Teams_data_modern.index)
idwp = grouped_teams['WinPercent'].transform(max) == Teams_data_modern['WinPercent']
best_wins = Teams_data_modern[idwp]

idop = grouped_teams['OPSPlus'].transform(max) == Teams_data_modern['OPSPlus']
best_opsp = Teams_data_modern[idop]

ider = grouped_teams['ERAPlus'].transform(max) == Teams_data_modern['ERAPlus']
best_erap = Teams_data_modern[ider]

print(best_wins[['name','WinPercent','OPSPlus','ERAPlus']])
print("")
print(best_opsp[['name','WinPercent','OPSPlus']])
print("")
print(best_erap[['name','WinPercent','ERAPlus']])


combined_best = best_wins[['name','WinPercent','OPSPlus','ERAPlus']].merge(best_opsp[['name','WinPercent','OPSPlus']],right_index=True, left_index=True)
combined_bests = combined_best.merge(best_erap[['name','WinPercent','ERAPlus']],right_index=True, left_index=True)
                                 name  WinPercent     OPSPlus     ERAPlus
yearID                                                                   
2000             San Francisco Giants    0.598765  112.901601  103.910822
2001                 Seattle Mariners    0.716049  112.868920  113.744582
2002                 New York Yankees    0.639752  116.251425  108.495063
2003                   Atlanta Braves    0.623457  117.702263  106.726321
2004              St. Louis Cardinals    0.648148  110.283349  115.038233
2005              St. Louis Cardinals    0.617284  103.635432  118.546756
2006                 New York Yankees    0.598765  114.659084  101.484435
2006                    New York Mets    0.598765  102.490009  105.609114
2007                   Boston Red Sox    0.592593  112.866729  117.410047
2007                Cleveland Indians    0.592593  103.523171  109.247087
2008    Los Angeles Angels of Anaheim    0.617284   98.248047  109.379968
2009                 New York Yankees    0.635802  123.070411  104.127901
2010            Philadelphia Phillies    0.598765  104.738520  109.901800
2011            Philadelphia Phillies    0.629630   99.675796  125.519759
2012             Washington Nationals    0.604938  106.573670  117.766136
2013                   Boston Red Sox    0.598765  122.582958  103.821753
2013              St. Louis Cardinals    0.598765  105.816217  108.737536
2014    Los Angeles Angels of Anaheim    0.604938  107.885978   99.186168
2015              St. Louis Cardinals    0.617284   98.888143  126.424524
2016                     Chicago Cubs    0.635802  109.577388  119.052893

                     name  WinPercent     OPSPlus
yearID                                           
2000    Cleveland Indians    0.555556  114.025481
2001     Colorado Rockies    0.450617  119.906395
2002     New York Yankees    0.639752  116.251425
2003       Boston Red Sox    0.586420  124.626338
2004       Boston Red Sox    0.604938  117.955268
2005       Boston Red Sox    0.586420  116.564536
2006     New York Yankees    0.598765  114.659084
2007     New York Yankees    0.580247  118.592251
2008        Texas Rangers    0.487654  117.253472
2009     New York Yankees    0.635802  123.070411
2010       Boston Red Sox    0.549383  116.246321
2011       Boston Red Sox    0.555556  124.417190
2012     New York Yankees    0.586420  117.506162
2013       Boston Red Sox    0.598765  122.582958
2014     Colorado Rockies    0.407407  119.532341
2015    Toronto Blue Jays    0.574074  120.229587
2016       Boston Red Sox    0.574074  118.923208

                         name  WinPercent     ERAPlus
yearID                                               
2000           Atlanta Braves    0.586420  114.098640
2001           Atlanta Braves    0.543210  120.244554
2002           Atlanta Braves    0.627329  127.457633
2003      Los Angeles Dodgers    0.524691  123.522360
2004             Chicago Cubs    0.549383  116.217702
2005      St. Louis Cardinals    0.617284  118.546756
2006           Detroit Tigers    0.586420  115.075562
2007           Boston Red Sox    0.592593  117.410047
2008        Toronto Blue Jays    0.530864  116.650082
2009     San Francisco Giants    0.543210  119.323316
2010     San Francisco Giants    0.567901  118.328985
2011    Philadelphia Phillies    0.629630  125.519759
2012          Cincinnati Reds    0.598765  122.144280
2013           Atlanta Braves    0.592593  120.085109
2014     Washington Nationals    0.592593  120.530010
2015      St. Louis Cardinals    0.617284  126.424524
2016             Chicago Cubs    0.635802  119.052893
The team with the best winning percentage rarely lined up with the the best OPS+ list, but did do a bit better with best ERA+ list.
Overall this research has demonstrated that we can see that teams that win in baseball tend to also be the ones that pitch well and play enough defense to limit runs. It also is helpful to be able to hit well, but that only can be a bit less instructive. The advanced statistics do tend to produce a better correlation, but anything that does not specifically deal with run scoring or run allowing does not pair nearly as well as those that do. These correlations of course do not state that a team with a good ERA+ score will always have a bettwer win loss record, but rather that a team that has a good score is also fairly likely to have a good record.
A note on some of the limitations of this data set. Because the MLB only began tracking some statistics offically from the 2000 season onward, namely sacrifice flies and hit by pitch, it only makes sense to analyze from that point forward. Secondly, baseball is a game of eras. It does not make a lot sense to compare the game from over 100 years ago when players traded teams and leagues multiple times a season to the modern market where players are truely professional. These limitations do not limit the analysis laid out in this paper to this point though, but if someone wanted to measure how these different statistics macthed up throughout history, and try to draw some historical narrative they would be making any judgements based on estimations.
The final anaylisis will be to attempt to estimate the correlation for ERA+ and OPS+ for our historical data, despite the limitations stated above. We already dealt with the missing values by estimating them earlier in the set, so what's left to do is run our similar analyis with these estimated values.
In [40]:
 
In [54]:
Teams_data_old = Teams_data[Teams_data['yearID'] < 2000]
Teams_data_old = Teams_data_old.set_index('yearID')


#Group Data and calculate Statistics
grouped_teams_old = Teams_data_old.groupby(Teams_data_old.index)


league_BA_old = grouped_teams_old['Batting_Average'].mean()
league_era_old = grouped_teams_old['ERA'].mean()
league_E_old = grouped_teams_old['E'].mean()

league_ops_old = grouped_teams_old['OPS'].mean()
league_obp_old = grouped_teams_old['OBP'].mean()
league_slg_old = grouped_teams_old['SLG'].mean()
league_DICE_old = grouped_teams_old['DICE'].mean()
league_FP_old = grouped_teams_old['FP'].mean()


Teams_data_old['OPSPlus'] = 100*(((Teams_data_old['OBP']/league_obp_old)+(Teams_data_old['SLG']/league_slg_old))-1)
Teams_data_old['ERAPlus'] = 100*(2-Teams_data_old['ERA']/(league_era_old*Teams_data_old['PPF']/100))
In [55]:
names  = ['WinPercent','DICE','FP','OPS','ERAPlus','OPSPlus']

df_corr_advanced_old = Teams_data_old[['WinPercent','DICE','OPS','FP','ERAPlus','OPSPlus']].corr()
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(df_corr_advanced_old, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,6,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.title('Correlation Heat Map All Years, Advanced Statistics Estimate')
plt.show()

print(df_corr_advanced_old['WinPercent'])

df_corr_groups_adv_old = grouped_teams_old[['WinPercent','DICE','FP','OPS','ERAPlus','OPSPlus']].corr()

print(df_corr_groups_adv_old['WinPercent'])
WinPercent    1.000000
DICE         -0.218249
OPS           0.492186
FP            0.226242
ERAPlus       0.703391
OPSPlus       0.689971
Name: WinPercent, dtype: float64
yearID            
1871    WinPercent    1.000000
        DICE         -0.212684
        FP            0.529982
        OPS           0.584916
        ERAPlus       0.259016
        OPSPlus       0.594085
1872    WinPercent    1.000000
        DICE         -0.492138
        FP            0.937680
        OPS           0.853405
        ERAPlus       0.880152
        OPSPlus       0.851231
1873    WinPercent    1.000000
        DICE         -0.894436
        FP            0.809613
        OPS           0.876936
        ERAPlus       0.756135
        OPSPlus       0.877484
1874    WinPercent    1.000000
        DICE         -0.438821
        FP            0.785365
        OPS           0.696880
        ERAPlus       0.952800
        OPSPlus       0.695293
1875    WinPercent    1.000000
        DICE         -0.433901
        FP            0.741148
        OPS           0.911715
        ERAPlus       0.826455
        OPSPlus       0.911652
                        ...   
1995    WinPercent    1.000000
        DICE         -0.555895
        FP            0.169191
        OPS           0.621075
        ERAPlus       0.612356
        OPSPlus       0.618917
1996    WinPercent    1.000000
        DICE         -0.552270
        FP            0.282332
        OPS           0.435972
        ERAPlus       0.575983
        OPSPlus       0.442604
1997    WinPercent    1.000000
        DICE         -0.496431
        FP            0.088919
        OPS           0.438382
        ERAPlus       0.577075
        OPSPlus       0.448630
1998    WinPercent    1.000000
        DICE         -0.705523
        FP            0.364780
        OPS           0.678672
        ERAPlus       0.612147
        OPSPlus       0.684970
1999    WinPercent    1.000000
        DICE         -0.669435
        FP            0.418054
        OPS           0.676266
        ERAPlus       0.749316
        OPSPlus       0.686470
Name: WinPercent, dtype: float64
One of the most interesting things we can fnd in the data here is that baseball in the early era was much more dependent on how well a team fielded the ball. This is at least in part due to the absence of gloves, and what is knwon as the dead ball. The modern game sees the ball being changed every few pitches. Early in the game the ball would last the whole game leading to softer hit balls.
We can also run this analysis for the entire data set, remembering it will be tough to draw any meaningful conclusions from our estimated data.
In [65]:
#Group Data and calculate Statistics
Teams_data_all = Teams_data.set_index('yearID')
grouped_teams_all = Teams_data_all.groupby(Teams_data_all.index)

league_BA_all = grouped_teams_all['Batting_Average'].mean()
league_era_all = grouped_teams_all['ERA'].mean()
league_E_all = grouped_teams_all['E'].mean()

league_ops_all = grouped_teams_all['OPS'].mean()
league_obp_all = grouped_teams_all['OBP'].mean()
league_slg_all= grouped_teams_all['SLG'].mean()
league_DICE_all = grouped_teams_all['DICE'].mean()
league_FP_all= grouped_teams_all['FP'].mean()

Teams_data_all['OPSPlus'] = 100*(((Teams_data_all['OBP']/league_obp_all)+(Teams_data_all['SLG']/league_slg_all))-1)
Teams_data_all['ERAPlus'] = 100*(2-Teams_data_all['ERA']/(league_era_all*Teams_data_all['PPF']/100))
       lgID teamID franchID divID  Rank    G  Ghome   W   L DivWin    ...     \
yearID                                                                ...      
2016     NL    PHI      PHI     E     4  162   81.0  71  91      N    ...      
2016     NL    PIT      PIT     C     3  162   81.0  78  83      N    ...      
2016     NL    SDN      SDP     W     5  162   81.0  68  94      N    ...      
2016     AL    SEA      SEA     W     2  162   81.0  86  76      N    ...      
2016     NL    SFN      SFG     W     2  162   81.0  87  75      N    ...      
2016     NL    SLN      STL     C     2  162   81.0  86  76      N    ...      
2016     AL    TBA      TBD     E     5  162   81.0  68  94      N    ...      
2016     AL    TEX      TEX     W     1  162   81.0  95  67      Y    ...      
2016     AL    TOR      TOR     E     2  162   81.0  89  73      N    ...      
2016     NL    WAS      WSN     E     1  162   81.0  95  67      Y    ...      

       teamIDlahman45 teamIDretro WinPercent  Batting_Average  Singles  \
yearID                                                                   
2016              PHI         PHI   0.438272         0.240155      878   
2016              PIT         PIT   0.481481         0.257308      964   
2016              SDN         SDN   0.419753         0.235283      815   
2016              SEA         SEA   0.530864         0.259001      955   
2016              SFN         SFN   0.537037         0.258221      973   
2016              SLN         SLN   0.530864         0.255047      859   
2016              TBA         TBA   0.419753         0.243204      797   
2016              TEX         TEX   0.586420         0.261719      951   
2016              TOR         TOR   0.549383         0.247855      843   
2016              MON         WAS   0.586420         0.255556      903   

        PA_equiv       OBP       SLG       OPS      DICE  
yearID                                                    
2016      5946.0  0.300538  0.384431  0.684970  3.363953  
2016      6220.0  0.332476  0.401660  0.734136  3.338925  
2016      5962.0  0.298893  0.390293  0.689186  3.380093  
2016      6202.0  0.326346  0.429876  0.756223  3.346145  
2016      6225.0  0.329478  0.398023  0.727501  3.171879  
2016      6185.0  0.325141  0.442141  0.767283  3.209896  
2016      6027.0  0.307118  0.425652  0.732770  3.347978  
2016      6071.0  0.321529  0.433303  0.754832  3.440517  
2016      6206.0  0.329520  0.425808  0.755327  3.259022  
2016      6153.0  0.325532  0.425865  0.751397  3.106645  

[10 rows x 55 columns]
In [64]:
names  = ['WinPercent','DICE','FP','OPS','ERAPlus','OPSPlus']

df_corr_advanced_all = Teams_data_all[['WinPercent','DICE','OPS','FP','ERAPlus','OPSPlus']].corr()
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(df_corr_advanced_all, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,6,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.title('Correlation Heat Map All Years, Advanced Statistics Estimate')
plt.show()

print(df_corr_advanced_all['WinPercent'])

df_corr_groups_adv_all = grouped_teams_all[['WinPercent','DICE','FP','OPS','ERAPlus','OPSPlus']].corr()

print(df_corr_groups_adv_all['WinPercent'])
WinPercent    1.000000
DICE         -0.246254
OPS           0.479051
FP            0.216819
ERAPlus       0.704330
OPSPlus       0.679646
Name: WinPercent, dtype: float64
yearID            
1871    WinPercent    1.000000
        DICE         -0.212684
        FP            0.529982
        OPS           0.584916
        ERAPlus       0.259016
        OPSPlus       0.594085
1872    WinPercent    1.000000
        DICE         -0.492138
        FP            0.937680
        OPS           0.853405
        ERAPlus       0.880152
        OPSPlus       0.851231
1873    WinPercent    1.000000
        DICE         -0.894436
        FP            0.809613
        OPS           0.876936
        ERAPlus       0.756135
        OPSPlus       0.877484
1874    WinPercent    1.000000
        DICE         -0.438821
        FP            0.785365
        OPS           0.696880
        ERAPlus       0.952800
        OPSPlus       0.695293
1875    WinPercent    1.000000
        DICE         -0.433901
        FP            0.741148
        OPS           0.911715
        ERAPlus       0.826455
        OPSPlus       0.911652
                        ...   
2012    WinPercent    1.000000
        DICE         -0.681597
        FP            0.380137
        OPS           0.497550
        ERAPlus       0.803317
        OPSPlus       0.496829
2013    WinPercent    1.000000
        DICE         -0.703775
        FP            0.423762
        OPS           0.691613
        ERAPlus       0.592674
        OPSPlus       0.702402
2014    WinPercent    1.000000
        DICE         -0.579660
        FP            0.139803
        OPS           0.327054
        ERAPlus       0.724885
        OPSPlus       0.333830
2015    WinPercent    1.000000
        DICE         -0.687243
        FP            0.115086
        OPS           0.455378
        ERAPlus       0.826296
        OPSPlus       0.470414
2016    WinPercent    1.000000
        DICE         -0.651334
        FP            0.617207
        OPS           0.547176
        ERAPlus       0.824863
        OPSPlus       0.561053
Name: WinPercent, dtype: float64
In [ ]:
 

No comments:

Post a Comment