Problem Set 4

MLB Batting Statistics

In this problem set, you will practice using the pipe %>% and grouped calculations with batting statistics taken from the Lahman dataset. We will compute the same statistics as we did in Problem Set 3 and also standardize these statistics within year, year and league, and also historical era.

To load the batting data into R, we can do the following

> library(tidyverse)
> library(Lahman)
> batting <- as_tibble(Batting)

Unfortunately, some statistics like hit-by-pitch (HBP) were not recorded in the earlier decades of baseball. In the tbl, these missing values are designated NA. To see some of these, we can run the following code:

> Batting %>% 
+   select(playerID, yearID, teamID, AB, BB, HBP, SH, SF, IBB, GIDP)
        playerID yearID teamID  AB  BB HBP SH SF IBB GIDP
1      abercda01   1871    TRO   4   0  NA NA NA  NA   NA
2       addybo01   1871    RC1 118   4  NA NA NA  NA   NA
3      allisar01   1871    CL1 137   2  NA NA NA  NA   NA
4      allisdo01   1871    WS3 133   0  NA NA NA  NA   NA
5      ansonca01   1871    RC1 120   2  NA NA NA  NA   NA
6      armstbo01   1871    FW1  49   0  NA NA NA  NA   NA
7      barkeal01   1871    RC1   4   1  NA NA NA  NA   NA
8      barnero01   1871    BS1 157  13  NA NA NA  NA   NA
9      barrebi01   1871    FW1   5   0  NA NA NA  NA   NA
10     barrofr01   1871    BS1  86   0  NA NA NA  NA   NA
11      bassjo01   1871    CL1  89   3  NA NA NA  NA   NA
12     battijo01   1871    CL1   3   1  NA NA NA  NA   NA
13     bealsto01   1871    WS3  36   2  NA NA NA  NA   NA
14     beaveed01   1871    TRO  15   0  NA NA NA  NA   NA
15     bechtge01   1871    PH1  94   2  NA NA NA  NA   NA
16     bellast01   1871    TRO 128   9  NA NA NA  NA   NA
17     berkena01   1871    PH1   4   0  NA NA NA  NA   NA
18     berryto01   1871    PH1   4   0  NA NA NA  NA   NA
19     berthha01   1871    WS3  73   4  NA NA NA  NA   NA
20     biermch01   1871    FW1   2   1  NA NA NA  NA   NA
21      birdge01   1871    RC1 106   3  NA NA NA  NA   NA
22     birdsda01   1871    BS1 152   4  NA NA NA  NA   NA
23     brainas01   1871    WS3 134   7  NA NA NA  NA   NA
24     brannmi01   1871    CH1  14   0  NA NA NA  NA   NA
25     burrohe01   1871    WS3  63   1  NA NA NA  NA   NA
26     careyto01   1871    FW1  87   2  NA NA NA  NA   NA
27     carleji01   1871    CL1 127   8  NA NA NA  NA   NA
28      conefr01   1871    BS1  77   8  NA NA NA  NA   NA
29     connone01   1871    TRO  33   0  NA NA NA  NA   NA
30     cravebi01   1871    TRO 118   3  NA NA NA  NA   NA
31     cuthbne01   1871    PH1 150  10  NA NA NA  NA   NA
32     deaneha01   1871    FW1  22   2  NA NA NA  NA   NA
33     donnepe01   1871    FW1  34   1  NA NA NA  NA   NA
34     duffyed01   1871    CH1 121   3  NA NA NA  NA   NA
35     eggleda01   1871    NY2 147   4  NA NA NA  NA   NA
36     ewellge01   1871    CL1   3   0  NA NA NA  NA   NA
37     fergubo01   1871    NY2 158   3  NA NA NA  NA   NA
38     fishech01   1871    RC1 123   3  NA NA NA  NA   NA
39     fislewe01   1871    PH1 147   3  NA NA NA  NA   NA
40     fleetfr01   1871    NY2   6   0  NA NA NA  NA   NA
41     flowedi01   1871    TRO 105   4  NA NA NA  NA   NA
42     flynncl01   1871    TRO 142   4  NA NA NA  NA   NA
43     foleyto01   1871    CH1  84   3  NA NA NA  NA   NA
44     foranji01   1871    FW1  89   2  NA NA NA  NA   NA
45     forceda01   1871    WS3 162   4  NA NA NA  NA   NA
46     fulmech01   1871    RC1  63   5  NA NA NA  NA   NA
47     glennjo01   1871    WS3 120   3  NA NA NA  NA   NA
48     goldswa01   1871    FW1  88   4  NA NA NA  NA   NA
49     gouldch01   1871    BS1 151   3  NA NA NA  NA   NA
50      hallge01   1871    WS3 136   8  NA NA NA  NA   NA
51     halliji01   1871    FW1  25   2  NA NA NA  NA   NA
52       hamra01   1871    RC1 113   1  NA NA NA  NA   NA
53     hastisc01   1871    RC1 118   2  NA NA NA  NA   NA
54     hatfijo01   1871    NY2 168   4  NA NA NA  NA   NA
55     heubege01   1871    PH1  75   2  NA NA NA  NA   NA
56     highadi01   1871    NY2  94   2  NA NA NA  NA   NA
57     hodesch01   1871    CH1 130   7  NA NA NA  NA   NA
58     jackssa01   1871    BS1  76   1  NA NA NA  NA   NA
59     johnsca01   1871    CL1  67   0  NA NA NA  NA   NA
60     kellybi01   1871    FW1  67   6  NA NA NA  NA   NA
61     kimbage01   1871    CL1 131   3  NA NA NA  NA   NA
62      kingma01   1871    CH1 101   8  NA NA NA  NA   NA
63      kingst01   1871    TRO 144   1  NA NA NA  NA   NA
64     kohlehe01   1871    FW1  12   0  NA NA NA  NA   NA
65     lennobi01   1871    FW1  48   1  NA NA NA  NA   NA
66     leonaan01   1871    WS3 148   3  NA NA NA  NA   NA
67      mackde01   1871    RC1 122   8  NA NA NA  NA   NA
68     malonfe01   1871    PH1 134   9  NA NA NA  NA   NA
69     mathebo01   1871    FW1  89   2  NA NA NA  NA   NA
70     mcatebu01   1871    CH1 135   5  NA NA NA  NA   NA
71     mcbridi01   1871    PH1 132   7  NA NA NA  NA   NA
72     mcderjo01   1871    FW1   8   1  NA NA NA  NA   NA
73     mcgeami01   1871    TRO 148   6  NA NA NA  NA   NA
74     mcmuljo01   1871    TRO 136   8  NA NA NA  NA   NA
75     mcveyca01   1871    BS1 153   1  NA NA NA  NA   NA
76     meyerle01   1871    PH1 130   2  NA NA NA  NA   NA
77     millsch01   1871    NY2 146   1  NA NA NA  NA   NA
78     millsev01   1871    WS3 157   3  NA NA NA  NA   NA
79     minched01   1871    FW1  36   0  NA NA NA  NA   NA
80     nortofr01   1871    WS3   1   0  NA NA NA  NA   NA
81     paborch01   1871    CL1 142   1  NA NA NA  NA   NA
82     patteda01   1871    NY2 151   1  NA NA NA  NA   NA
83     pearcdi01   1871    NY2 163   4  NA NA NA  NA   NA
84     phelpne01   1871    FW1   3   1  NA NA NA  NA   NA
85      pikeli01   1871    TRO 130   5  NA NA NA  NA   NA
86     pinkhed01   1871    CH1  95  18  NA NA NA  NA   NA
87     prattal01   1871    CL1 130   1  NA NA NA  NA   NA
88     prattto01   1871    PH1   6   0  NA NA NA  NA   NA
89     questjo01   1871    CL1  13   1  NA NA NA  NA   NA
90     quinnpa02   1871    FW1  17   4  NA NA NA  NA   NA
91     radcljo01   1871    PH1 145   6  NA NA NA  NA   NA
92     reachal01   1871    PH1 133   5  NA NA NA  NA   NA
93     sagerpo01   1871    RC1  39   2  NA NA NA  NA   NA
94     schafha01   1871    BS1 149   3  NA NA NA  NA   NA
95     selmafr01   1871    FW1  65   4  NA NA NA  NA   NA
96     senseco01   1871    PH1 127   0  NA NA NA  NA   NA
97     simmojo01   1871    CH1 129   1  NA NA NA  NA   NA
98     smithch01   1871    NY2  72   1  NA NA NA  NA   NA
99     spaldal01   1871    BS1 144   8  NA NA NA  NA   NA
100    startjo01   1871    NY2 161   3  NA NA NA  NA   NA
 [ reached getOption("max.print") -- omitted 102716 rows ]

A common convention for dealing with the missing values when computing \(\text{PA}\) is to replace the NA with a 0. To do this, we can use the function replace_na() within a pipe as follows.

> Batting <- 
+   batting %>% 
+   replace_na(list(IBB = 0, HBP = 0, SH = 0, SF = 0, GIDP = 0))
> Batting %>% select(playerID, yearID, teamID, AB, BB, HBP, SH, SF, IBB)
# A tibble: 102,816 x 9
   playerID  yearID teamID    AB    BB   HBP    SH    SF   IBB
   <chr>      <int> <fct>  <int> <int> <dbl> <dbl> <dbl> <dbl>
 1 abercda01   1871 TRO        4     0     0     0     0     0
 2 addybo01    1871 RC1      118     4     0     0     0     0
 3 allisar01   1871 CL1      137     2     0     0     0     0
 4 allisdo01   1871 WS3      133     0     0     0     0     0
 5 ansonca01   1871 RC1      120     2     0     0     0     0
 6 armstbo01   1871 FW1       49     0     0     0     0     0
 7 barkeal01   1871 RC1        4     1     0     0     0     0
 8 barnero01   1871 BS1      157    13     0     0     0     0
 9 barrebi01   1871 FW1        5     0     0     0     0     0
10 barrofr01   1871 BS1       86     0     0     0     0     0
# ... with 102,806 more rows

The syntax for replace_na() is a bit involved but the basic idea is you have to specify the value you want to replace each NA. When using replace_na() it is very important to remember to include the list(...) bit.

Load the Lahman data and run the above code to create the tbl Batting, which has replaced all of the NA’ in the columns IBB, HBP, SH, SF and GIDP`.
Using the pipe %>%, mutate(), filter(), and select(), create a tbl batting by:

Adding columns (with mutate()) for plate appearances (PA), unintentional walks (uBB), singles (X1B), batting average (BA), on-base percentage (OBP), on-base plus slugging (OPS), and weighted On-Base Average (wOBA). Note that the formula for plate appearances is \(\text{PA} = \text{AB} + \text{BB} + \text{HBP} + \text{SH} + \text{SF}.\) Formulae for the remaining statistics are given in Problem Set 3.
Pulling out only those rows of players with at least 502 plate appearances and who played in either the AL or NL with filter()
Select the following columns: playerID, yearID, lgID, teamID, PA, BA, OBP, OPS, wOBA.

Standardize each of BA, OBP, OPS, and wOBA using data from all of the years. Who were the best and worst batters according to these four metrics? Name the columns containing these new standardizedvalues zBA_all, zOPB_all, etc. Remember, you must re-define the standardize() function we wrote in Module 4

> standardize <- function(x){ (x - mean(x))/sd(x)}
> batting <-
+   batting %>%
+   mutate(zBA_all = standardize(BA),
+          zOBP_all = standardize(OBP),
+          zOPS_all = standardize(OPS),
+          zwOBA_all = standardize(wOBA))

Group batting by year and compute the standardized BA, OBP, OPS, and wOBA within each year. Now who are the best and worst batters according to the four measures? Name the columns containing these new standardized values zBA_year, zOBP_year, etc.

> batting <- 
+   batting %>%
+   group_by(yearID) %>%
+   mutate(zBA_year = standardize(BA),
+          zOBP_year = standardize(OBP),
+          zOPS_year = standardize(OPS),
+          zwOBA_year = standardize(wOBA))

Remove the grouping by year and instead group by year and league. Once again, standardize OBP, OPS, and wOBA within each league-year combination. Are the best and worst batters still the same? Name the columns containing these new standardized valuse zBA_year_lg, zOBP_year_lg, etc.

> batting <- 
+   batting %>%
+   ungroup() %>% 
+   group_by(yearID) %>%
+   mutate(zBA_year_lg = standardize(BA),
+          zOBP_year_lg = standardize(OBP),
+          zOPS_year_lg = standardize(OPS),
+          zwOBA_year_lg = standardize(wOBA))

Remove the grouping you created in Problem 4. Bill James divided baseball history into several eras as follows:

Pioneer Era: 1871 – 1892
Spitball Era: 1893 – 1919
Landis Era: 1920 – 1946
Baby Boomer Era: 1947 – 1968
Artifical Turf Era: 1969 – 1992
Camden Yards Era: 1993 – present

> batting <-
+   batting %>%
+   ungroup() %>%
+   mutate(HIST_ERA = case_when(
+     1871 <= yearID & yearID <= 1892 ~ "Pioneer",
+     1893 <= yearID & yearID <= 1919 ~ "Spitball",
+     1920 <= yearID & yearID <= 1946 ~ "Landis",
+     1947 <= yearID & yearID <= 1968 ~ "Baby Boomer",
+     1969 <= yearID & yearID <= 1992 ~ "Artifical Turf",
+     1993 <= yearID ~ "Camden Yards")) %>%
+   group_by(HIST_ERA) %>%
+   mutate(zBA_hist = standardize(BA),
+          zOBP_hist = standardize(OBP),
+          zOPS_hist = standardize(OPS),
+          zwOBA_hist = standardize(wOBA))

Use mutate() and case_when() (just like we did in Module 3) to add a column called Hist_era to batting that records the historical era.

Group batting by Hist_era and standardize BA, OBP, OPS, and wOBA within historical era. Who are the best and worst batters now? Name the columns containing these new stanardized values zBA_hist, zOBP_hist, etc.
Remove the grouping you added in Problem 6.

MLB Payroll and Winnings

Recall from Problem Set 2, we plotted the relative payroll of MLB teams against their winning percentage. In that problem set, we read in a file that had included the relative payroll for each team as a separate column. To get some additional practice with dplyr, we will read in a different dataset and re-compute these relative payrolls.

Remember the function we wrote in Module 4 to standardize various statistics? It is reproduced below

> standardize <- function(x){
+   mu <- mean(x, na.rm = TRUE)
+   sigma <- sd(x, na.rm = TRUE)
+   return( (x - mu)/sigma)
+ }

We need to write another function in order to compute “relative payroll”. This function will take in a vector x, compute its median, and then divides every element of x by the median.

> relative <- function(x){
+   med <- median(x, na.rm = TRUE)
+   return( x/med )  
+ }

Read in the MLB Payroll Data and load it into a tibble called mlb_payrolls.

Parsed with column specification:
cols(
  Team = col_character(),
  GM = col_character(),
  Team_Payroll = col_integer(),
  Winning_Percentage = col_double(),
  Year = col_integer()
)

Using the pipe %>%, group_by(), and mutate(), add a column to mlb_payrolls that contains the relative payroll for each team.
Make a scatterplot of winning percentage against relative payrolls. Comment on the relationship. Your scatterplot should be identical to one you made in Problem Set 2.
Using the summarize() function, compute the average team payroll and relative payroll for each year. Save these results in a new tbl called payroll_avg.
Make a scatterplot that shows how team payrolls have evolved over the year. Similar to what we did in Module 4, add a line to this scatterplot that shows the average team payroll. Do the same thing for relative payroll. What do you notice about the average team payroll and relative payroll?
As you will see in coming lectures, correlation is a measure of the strength of the linear relationship between two variables. The closer to +1 or -1 the correlation between two variables is, the more predictable they are of each other. We can compute it using the cor() function. Using summary() and cor(), compute the correlation between relative payroll and winning percentage within each year. What do you notice about how the relationship between winning percentage and relative payroll changes year to year?

# A tibble: 18 x 2
    Year   cor
   <int> <dbl>
 1  1998 0.764
 2  1999 0.699
 3  2000 0.327
 4  2001 0.338
 5  2002 0.443
 6  2003 0.415
 7  2004 0.515
 8  2005 0.497
 9  2006 0.538
10  2007 0.495
11  2008 0.322
12  2009 0.504
13  2010 0.347
14  2011 0.408
15  2012 0.195
16  2013 0.330
17  2014 0.297
18  2015 0.281

A Challenge Question

Without running the code, work with your teammates to see if you can figure out what the code below is doing.

> batting_2014_2015 <-
+   batting %>%
+   filter(yearID %in% c(2014, 2015)) %>%
+   group_by(playerID) %>%
+   filter(n() == 2) %>% 
+   select(playerID, yearID, BA) %>%
+   arrange(playerID)

Run the code above and save the tbl batting_2014_2015.RData to the file “data/batting_2014_2015.RData”. We will return to this dataset in Module 6.