Ernie Banks’ Performance In Double Headers (Using MySQL and R)

Chicago Cubs’ legend Ernie Banks passed away this week at age 83, leaving behind a legacy of 12 All-Star appearances, two NL MVP awards, and 512 home runs over a 19-year Hall of Fame career. He also earned 67.5 career rWAR, and is considered one of the greatest shortstops in history. However, the most memorable moment in his career is arguably a single quote, before an otherwise irrelevant game:

It’s a great day for a ballgame; let’s play two!

In the days since his passing, “Let’s play two!” has been the popular refrain for anyone referencing his career. Hearing it said so many times, I had an idea – how well did Banks perform when he DID play two?

Unfortunately, nothing in this world is original, and at least one blogger, William Juliano of the Yankees-centric The Captain’s Blog, got to it back in 2011. In an effort to find my own spin on this idea, it seemed simple enough that I could figure these numbers out on my own using MySQL and R, and the Retrosheet database, and reasonably explain how others can do the same.

I’m not an expert in the area, despite having some past database experience, and I consider this as much a learning experience for myself as it is for anyone reading, so anyone with comments or suggestions, please let me know.

Prerequisites

Someone attempting these queries and calculations should have some prior experience with SQL and relational databases, and some understanding of basic statistics. An installation of MySQL and the R statistics package are used by me, and I find programs like Sequel Pro and RStudio are incredibly helpful interfaces for these languages.

As far as the data itself, we’ll be using Retrosheet’s Event Files. Ascii-text file versions of these event logs can be found on their website, for free, linked above. However, the easier (and faster) solution would be to install the SQL dump generously maintained by Baseball Heat Maps, an absolutely awesome site. I’ll be using the schema in their version, but at worst only the column names would change slightly from version to version.

Finding the Games

In Retrosheet, we want to find all games played by Ernie Banks that were part of a double header. Fortunately, unique Retrosheet Game IDs are formatted in such a way as to help with this.

CHN195405221 -> CHN 1954 05 22 1

The prefix is home team’s Retrosheet ID, in this case the Chicago Cubs. The next eight digits are the date of the game, in YYYY/MM/DD format. The final number identifies the number of the game played in that stadium, on that day. On days where only one game is played, this final digit is a 0. In the first game of a double header, this number is 1, and 2 for the second game. This query returns all Game IDs of the relevant games:

SELECT DISTINCT e.GAME_ID
FROM EVENTS as e
WHERE SUBSTRING(e.GAME_ID, 12, 1) != 0
 AND e.BAT_ID = 'banke101';

Next, we’ll want to gather a list of all individual ‘events’ (roughly, plate appearances) that occurred while Banks was at the plate, in each of these games. Since I’ll be calculating some basic stats (Batting Average, On-Base Percentage, Slugging Percentage, and OPS), I want to check if each event falls under a certain category (for these purposes, AB, H, 2B, 3B, HR, BB, IBB, HBP, SO, SF, SH, and Reaching on Defensive Indifference).

Almost all of these events can be determined by the ‘EVENT_CD’ field. It’s an integer value that relates to the data-use table found here. By checking if the EVENT_CD = 3, for instance, one can see if the plate appearance ended in a strikeout. Other fields used here are the ‘AB_FL’ (was this an at-bat?),  ‘SF_FL’ (sacrifice fly), and ‘SH_FL’ (sacrifice hit).

Removing the ‘DISTINCT’ keyword from the previous query, and adding the necessary IF statements, this query returns the results of every Ernie Banks plate appearance in a double header.

SELECT SUBSTRING(e.GAME_ID, 4, 4) as Year, SUBSTRING(e.GAME_ID, 12, 1) as Game,
 IF(e.AB_FL='T', 1, 0) as AB,
 IF(e.EVENT_CD>=20,IF(e.EVENT_CD<=23,1,0),0) as H,
 IF(e.EVENT_CD=21,1,0) as 2B,
 IF(e.EVENT_CD=22,1,0) as 3B,
 IF(e.EVENT_CD=23,1,0) as HR,
 IF(e.EVENT_CD=14,1,0) as BB,
 IF(e.EVENT_CD=15,1,0) as IBB,
 IF(e.EVENT_CD=16,1,0) as HBP,
 IF(e.EVENT_CD=3,1,0) as SO,
 IF(e.SF_FL='T',1,0) as SF,
 IF(e.SH_FL='T',1,0) as SH,
 IF(e.EVENT_CD=5,1,0) as DI
FROM EVENTS as e
WHERE SUBSTRING(e.GAME_ID, 12, 1) != 0
 AND e.BAT_ID = 'banke101';

For ease of use, we want to group these counting stats by year, and by game in the double header (either 1 or 2). Add a COUNT(*) selection to count the number of games played under each year/game combination, a SUM() around each IF statement, and a GROUP BY at the end of the query.

SELECT SUBSTRING(e.GAME_ID, 4, 4) as Year, SUBSTRING(e.GAME_ID, 12, 1) as Game,
 COUNT(DISTINCT e.GAME_ID) as Games,
 SUM(IF(e.AB_FL='T', 1, 0)) as AB,
 SUM(IF(e.EVENT_CD>=20,IF(e.EVENT_CD<=23,1,0),0)) as H,
 SUM(IF(e.EVENT_CD=21,1,0)) as 2B,
 SUM(IF(e.EVENT_CD=22,1,0)) as 3B,
 SUM(IF(e.EVENT_CD=23,1,0)) as HR,
 SUM(IF(e.EVENT_CD=14,1,0)) as BB,
 SUM(IF(e.EVENT_CD=15,1,0)) as IBB,
 SUM(IF(e.EVENT_CD=16,1,0)) as HBP,
 SUM(IF(e.EVENT_CD=3,1,0)) as SO,
 SUM(IF(e.SF_FL='T',1,0)) as SF,
 SUM(IF(e.SH_FL='T',1,0)) as SH,
 SUM(IF(e.EVENT_CD=5,1,0)) as DI
FROM EVENTS as e
WHERE SUBSTRING(e.GAME_ID, 12, 1) != 0
 AND e.BAT_ID = 'banke101'
GROUP BY Year, Game;

Export these results to ‘banks.csv’, and open RStudio to manipulate the data.

Calculating Rates

Once in RStudio, the first thing to do is install and import the “plyr” library, which will be used split and aggregate the data, and then import the data itself from ‘Banks.csv’:

install.packages("plyr")
library("plyr")

data <- read.csv( file = "Desktop/banks.csv", header=TRUE, sep="," )

A new column calculating the number of plate appearances in each row should be calculated for convenience:

data$PA <- data$AB + data$BB + data$IBB + data$HBP + data$SH + data$SF + data$DI

New data frames are needed, as we want to look at Banks’ year-by-year performance, his career performance in either the first or second game of double header, and the 19-year total. The ddply() function will be used to split by year or game and summarize that data, and summarize() will be used for the single career total.

year_stats <- ddply(data, .(Year), summarize, Games = sum(Games), PA = sum(PA), AB = sum(AB), H = sum(H), X2B = sum(X2B), X3B = sum(X3B), HR = sum(HR), BB = sum(BB), IBB = sum(IBB), HBP = sum(HBP), SO = sum(SO), SF = sum(SF), SH = sum(SH), DI = sum(DI))
game_stats <- ddply(data, .(Game), summarize, Games = sum(Games), PA = sum(PA), AB = sum(AB), H = sum(H), X2B = sum(X2B), X3B = sum(X3B), HR = sum(HR), BB = sum(BB), IBB = sum(IBB), HBP = sum(HBP), SO = sum(SO), SF = sum(SF), SH = sum(SH), DI = sum(DI))
total_stats <- summarize(data, Games = sum(Games), PA = sum(PA), AB = sum(AB), H = sum(H), X2B = sum(X2B), X3B = sum(X3B), HR = sum(HR), BB = sum(BB), IBB = sum(IBB), HBP = sum(HBP), SO = sum(SO), SF = sum(SF), SH = sum(SH), DI = sum(DI))

View()-ing each of these data frames, you’ll notice the aggregate of each counting stat category.

The final steps are to append the necessary rate stat columns to each of these new data frames. For simplicity’s sake, we’ll stick with the basics: BA, OBP, SLG, and OPS. The “By Year” stats and the “Double Header Game” stats are calculated in a slightly different way than the total line, due to having more than one row as a result.

Instead of “summarizing” this data again (which creates additional data frames), the “transform” function should be used to append new columns to existing frames:

year_stats <- ddply(year_stats, .(Year), transform, BA = H/AB, OBP = (H+BB+IBB+HBP)/(AB+BB+IBB+HBP+SF), SLG = (H+X2B+(2*X3B)+(3*HR))/AB) 
year_stats$OPS <- year_stats$OBP + year_stats$SLG
game_stats <- ddply(game_stats, .(Game), transform, BA = H/AB, OBP = (H+BB+IBB+HBP)/(AB+BB+IBB+HBP+SF), SLG = (H+X2B+(2*X3B)+(3*HR))/AB) 
game_stats$OPS <- game_stats$OBP + game_stats$SLG

As the “OBP” and “SLG” columns don’t exist at the time of calling the ddply function, the “OPS” column has to be created in a separate step in the code above.

As for the total data frame, we won’t use a function to do the transform. For ease of writing, I attach() the “total_stats” data frame to cut down on typing, and detach it before the final line.

attach(total_stats)
total_stats$BA <- H/AB 
total_stats$OBP <- (H+BB+IBB+HBP)/(AB+BB+IBB+HBP+SF)
total_stats$SLG <- (H+X2B+(2*X3B)+(3*HR))/AB
detach(total_stats)
total_stats$OPS <- total_stats$OBP + total_stats$SLG

The Results

At this point, stat lines have been calculated and are ready for viewing. Below are my results:

year-by-year

Year Games PA AB H X2B X3B HR BB IBB HBP SO SF SH DI BA OBP SLG OPS
1 1953 2 8 7 1 0 0 0 1 0 0 1 0 0 0 0.143 0.250 0.143 0.393
2 1954 46 189 178 52 9 2 6 7 1 0 11 1 2 0 0.292 0.321 0.466 0.787
3 1955 37 153 142 49 8 2 14 8 1 0 16 2 0 0 0.345 0.379 0.725 1.104
4 1956 47 196 176 55 15 1 10 15 5 0 17 0 0 0 0.312 0.383 0.580 0.962
5 1957 56 232 204 54 7 0 17 18 5 3 30 2 0 0 0.265 0.345 0.549 0.894
6 1958 39 171 153 44 6 3 7 12 3 1 21 2 0 0 0.288 0.351 0.503 0.854
7 1959 28 123 111 21 3 0 5 6 5 1 14 0 0 0 0.189 0.268 0.351 0.620
8 1960 38 164 142 36 11 0 7 12 7 1 15 2 0 0 0.254 0.341 0.479 0.820
9 1961 36 153 136 39 6 1 6 7 7 1 25 0 0 2 0.287 0.358 0.478 0.836
10 1962 36 146 138 42 8 2 10 5 1 2 18 0 0 0 0.304 0.342 0.609 0.951
11 1963 30 110 98 23 4 0 4 6 2 0 17 3 1 0 0.235 0.284 0.398 0.682
12 1964 36 145 135 42 8 2 7 4 3 2 16 0 1 0 0.311 0.354 0.556 0.910
13 1965 41 166 143 36 5 2 9 15 4 1 12 3 0 0 0.252 0.337 0.503 0.841
14 1966 34 129 123 25 4 1 1 2 2 0 13 2 0 0 0.203 0.225 0.276 0.501
15 1967 40 158 146 42 8 0 10 6 3 0 25 1 2 0 0.288 0.327 0.548 0.875
16 1968 33 134 119 37 7 0 8 7 1 1 14 1 5 0 0.311 0.357 0.571 0.928
17 1969 29 115 101 31 2 1 3 10 1 1 18 1 1 0 0.307 0.377 0.436 0.813
18 1970 12 33 32 9 1 0 2 1 0 0 5 0 0 0 0.281 0.303 0.500 0.803
19 1971 6 15 14 3 0 0 1 0 0 0 3 0 1 0 0.214 0.214 0.429 0.643

Double Header Game

Game Games PA AB H X2B X3B HR BB IBB HBP SO SF SH DI BA OBP SLG OPS
1 1 319 1317 1188 337 63 7 63 76 29 7 149 11 6 0 0.284 0.343 0.508 0.850
2 2 307 1223 1110 304 49 10 64 66 22 7 142 9 7 2 0.274 0.329 0.509 0.838

Total

Games PA AB H X2B X3B HR BB IBB HBP SO SF SH DI BA OBP SLG OPS
1 626 2540 2298 641 112 17 127 142 51 14 291 20 13 2 0.279 0.336 0.508 0.844

It makes logical sense that a batter would be more fresh (AKA better) in the first game of a double header, as this data correlates. Ernie Banks was a career .274/.330/.500 hitter. In this surprisingly large sample (double headers were more common in the past), Banks was not too far off from his career line. However, he did in fact have a slightly higher OPS on days where he was able to “play two.”

Ernie Banks’ Performance In Double Headers (Using MySQL and R)

One thought on “Ernie Banks’ Performance In Double Headers (Using MySQL and R)

Leave a comment