Chicago Cubs’ legend Ernie Banks passed away this week at age 83, leaving behind a legacy of 12 All-Star appearances, two NL MVP awards, and 512 home runs over a 19-year Hall of Fame career. He also earned 67.5 career rWAR, and is considered one of the greatest shortstops in history. However, the most memorable moment in his career is arguably a single quote, before an otherwise irrelevant game:
It’s a great day for a ballgame; let’s play two!
In the days since his passing, “Let’s play two!” has been the popular refrain for anyone referencing his career. Hearing it said so many times, I had an idea – how well did Banks perform when he DID play two?
Unfortunately, nothing in this world is original, and at least one blogger, William Juliano of the Yankees-centric The Captain’s Blog, got to it back in 2011. In an effort to find my own spin on this idea, it seemed simple enough that I could figure these numbers out on my own using MySQL and R, and the Retrosheet database, and reasonably explain how others can do the same.
I’m not an expert in the area, despite having some past database experience, and I consider this as much a learning experience for myself as it is for anyone reading, so anyone with comments or suggestions, please let me know.
Someone attempting these queries and calculations should have some prior experience with SQL and relational databases, and some understanding of basic statistics. An installation of MySQL and the R statistics package are used by me, and I find programs like Sequel Pro and RStudio are incredibly helpful interfaces for these languages.
As far as the data itself, we’ll be using Retrosheet’s Event Files. Ascii-text file versions of these event logs can be found on their website, for free, linked above. However, the easier (and faster) solution would be to install the SQL dump generously maintained by Baseball Heat Maps, an absolutely awesome site. I’ll be using the schema in their version, but at worst only the column names would change slightly from version to version.
Finding the Games
In Retrosheet, we want to find all games played by Ernie Banks that were part of a double header. Fortunately, unique Retrosheet Game IDs are formatted in such a way as to help with this.
CHN195405221 -> CHN 1954 05 22 1
The prefix is home team’s Retrosheet ID, in this case the Chicago Cubs. The next eight digits are the date of the game, in YYYY/MM/DD format. The final number identifies the number of the game played in that stadium, on that day. On days where only one game is played, this final digit is a 0. In the first game of a double header, this number is 1, and 2 for the second game. This query returns all Game IDs of the relevant games:
SELECT DISTINCT e.GAME_ID FROM EVENTS as e WHERE SUBSTRING(e.GAME_ID, 12, 1) != 0 AND e.BAT_ID = 'banke101';
Next, we’ll want to gather a list of all individual ‘events’ (roughly, plate appearances) that occurred while Banks was at the plate, in each of these games. Since I’ll be calculating some basic stats (Batting Average, On-Base Percentage, Slugging Percentage, and OPS), I want to check if each event falls under a certain category (for these purposes, AB, H, 2B, 3B, HR, BB, IBB, HBP, SO, SF, SH, and Reaching on Defensive Indifference).
Almost all of these events can be determined by the ‘EVENT_CD’ field. It’s an integer value that relates to the data-use table found here. By checking if the EVENT_CD = 3, for instance, one can see if the plate appearance ended in a strikeout. Other fields used here are the ‘AB_FL’ (was this an at-bat?), ‘SF_FL’ (sacrifice fly), and ‘SH_FL’ (sacrifice hit).
Removing the ‘DISTINCT’ keyword from the previous query, and adding the necessary IF statements, this query returns the results of every Ernie Banks plate appearance in a double header.
SELECT SUBSTRING(e.GAME_ID, 4, 4) as Year, SUBSTRING(e.GAME_ID, 12, 1) as Game, IF(e.AB_FL='T', 1, 0) as AB, IF(e.EVENT_CD>=20,IF(e.EVENT_CD<=23,1,0),0) as H, IF(e.EVENT_CD=21,1,0) as 2B, IF(e.EVENT_CD=22,1,0) as 3B, IF(e.EVENT_CD=23,1,0) as HR, IF(e.EVENT_CD=14,1,0) as BB, IF(e.EVENT_CD=15,1,0) as IBB, IF(e.EVENT_CD=16,1,0) as HBP, IF(e.EVENT_CD=3,1,0) as SO, IF(e.SF_FL='T',1,0) as SF, IF(e.SH_FL='T',1,0) as SH, IF(e.EVENT_CD=5,1,0) as DI FROM EVENTS as e WHERE SUBSTRING(e.GAME_ID, 12, 1) != 0 AND e.BAT_ID = 'banke101';
For ease of use, we want to group these counting stats by year, and by game in the double header (either 1 or 2). Add a COUNT(*) selection to count the number of games played under each year/game combination, a SUM() around each IF statement, and a GROUP BY at the end of the query.
SELECT SUBSTRING(e.GAME_ID, 4, 4) as Year, SUBSTRING(e.GAME_ID, 12, 1) as Game, COUNT(DISTINCT e.GAME_ID) as Games, SUM(IF(e.AB_FL='T', 1, 0)) as AB, SUM(IF(e.EVENT_CD>=20,IF(e.EVENT_CD<=23,1,0),0)) as H, SUM(IF(e.EVENT_CD=21,1,0)) as 2B, SUM(IF(e.EVENT_CD=22,1,0)) as 3B, SUM(IF(e.EVENT_CD=23,1,0)) as HR, SUM(IF(e.EVENT_CD=14,1,0)) as BB, SUM(IF(e.EVENT_CD=15,1,0)) as IBB, SUM(IF(e.EVENT_CD=16,1,0)) as HBP, SUM(IF(e.EVENT_CD=3,1,0)) as SO, SUM(IF(e.SF_FL='T',1,0)) as SF, SUM(IF(e.SH_FL='T',1,0)) as SH, SUM(IF(e.EVENT_CD=5,1,0)) as DI FROM EVENTS as e WHERE SUBSTRING(e.GAME_ID, 12, 1) != 0 AND e.BAT_ID = 'banke101' GROUP BY Year, Game;
Export these results to ‘banks.csv’, and open RStudio to manipulate the data.
Once in RStudio, the first thing to do is install and import the “plyr” library, which will be used split and aggregate the data, and then import the data itself from ‘Banks.csv’:
install.packages("plyr") library("plyr") data <- read.csv( file = "Desktop/banks.csv", header=TRUE, sep="," )
A new column calculating the number of plate appearances in each row should be calculated for convenience:
data$PA <- data$AB + data$BB + data$IBB + data$HBP + data$SH + data$SF + data$DI
New data frames are needed, as we want to look at Banks’ year-by-year performance, his career performance in either the first or second game of double header, and the 19-year total. The ddply() function will be used to split by year or game and summarize that data, and summarize() will be used for the single career total.
year_stats <- ddply(data, .(Year), summarize, Games = sum(Games), PA = sum(PA), AB = sum(AB), H = sum(H), X2B = sum(X2B), X3B = sum(X3B), HR = sum(HR), BB = sum(BB), IBB = sum(IBB), HBP = sum(HBP), SO = sum(SO), SF = sum(SF), SH = sum(SH), DI = sum(DI))
game_stats <- ddply(data, .(Game), summarize, Games = sum(Games), PA = sum(PA), AB = sum(AB), H = sum(H), X2B = sum(X2B), X3B = sum(X3B), HR = sum(HR), BB = sum(BB), IBB = sum(IBB), HBP = sum(HBP), SO = sum(SO), SF = sum(SF), SH = sum(SH), DI = sum(DI))
total_stats <- summarize(data, Games = sum(Games), PA = sum(PA), AB = sum(AB), H = sum(H), X2B = sum(X2B), X3B = sum(X3B), HR = sum(HR), BB = sum(BB), IBB = sum(IBB), HBP = sum(HBP), SO = sum(SO), SF = sum(SF), SH = sum(SH), DI = sum(DI))
View()-ing each of these data frames, you’ll notice the aggregate of each counting stat category.
The final steps are to append the necessary rate stat columns to each of these new data frames. For simplicity’s sake, we’ll stick with the basics: BA, OBP, SLG, and OPS. The “By Year” stats and the “Double Header Game” stats are calculated in a slightly different way than the total line, due to having more than one row as a result.
Instead of “summarizing” this data again (which creates additional data frames), the “transform” function should be used to append new columns to existing frames:
year_stats <- ddply(year_stats, .(Year), transform, BA = H/AB, OBP = (H+BB+IBB+HBP)/(AB+BB+IBB+HBP+SF), SLG = (H+X2B+(2*X3B)+(3*HR))/AB) year_stats$OPS <- year_stats$OBP + year_stats$SLG
game_stats <- ddply(game_stats, .(Game), transform, BA = H/AB, OBP = (H+BB+IBB+HBP)/(AB+BB+IBB+HBP+SF), SLG = (H+X2B+(2*X3B)+(3*HR))/AB) game_stats$OPS <- game_stats$OBP + game_stats$SLG
As the “OBP” and “SLG” columns don’t exist at the time of calling the ddply function, the “OPS” column has to be created in a separate step in the code above.
As for the total data frame, we won’t use a function to do the transform. For ease of writing, I attach() the “total_stats” data frame to cut down on typing, and detach it before the final line.
attach(total_stats) total_stats$BA <- H/AB total_stats$OBP <- (H+BB+IBB+HBP)/(AB+BB+IBB+HBP+SF) total_stats$SLG <- (H+X2B+(2*X3B)+(3*HR))/AB
detach(total_stats) total_stats$OPS <- total_stats$OBP + total_stats$SLG
At this point, stat lines have been calculated and are ready for viewing. Below are my results:
Double Header Game
It makes logical sense that a batter would be more fresh (AKA better) in the first game of a double header, as this data correlates. Ernie Banks was a career .274/.330/.500 hitter. In this surprisingly large sample (double headers were more common in the past), Banks was not too far off from his career line. However, he did in fact have a slightly higher OPS on days where he was able to “play two.”