In this problem set, you will practice using the pipe %>%
and grouped calculations with batting statistics taken from the Lahman dataset. We will compute the same statistics as we did in Problem Set 3 and also standardize these statistics within year, year and league, and also historical era.
To load the batting data into R, we can do the following
> library(tidyverse)
> library(Lahman)
> batting <- as_tibble(Batting)
Unfortunately, some statistics like hit-by-pitch (HBP) were not recorded in the earlier decades of baseball. In the tbl, these missing values are designated NA
. To see some of these, we can run the following code:
> Batting %>%
+ select(playerID, yearID, teamID, AB, BB, HBP, SH, SF, IBB, GIDP)
playerID yearID teamID AB BB HBP SH SF IBB GIDP
1 abercda01 1871 TRO 4 0 NA NA NA NA NA
2 addybo01 1871 RC1 118 4 NA NA NA NA NA
3 allisar01 1871 CL1 137 2 NA NA NA NA NA
4 allisdo01 1871 WS3 133 0 NA NA NA NA NA
5 ansonca01 1871 RC1 120 2 NA NA NA NA NA
6 armstbo01 1871 FW1 49 0 NA NA NA NA NA
7 barkeal01 1871 RC1 4 1 NA NA NA NA NA
8 barnero01 1871 BS1 157 13 NA NA NA NA NA
9 barrebi01 1871 FW1 5 0 NA NA NA NA NA
10 barrofr01 1871 BS1 86 0 NA NA NA NA NA
11 bassjo01 1871 CL1 89 3 NA NA NA NA NA
12 battijo01 1871 CL1 3 1 NA NA NA NA NA
13 bealsto01 1871 WS3 36 2 NA NA NA NA NA
14 beaveed01 1871 TRO 15 0 NA NA NA NA NA
15 bechtge01 1871 PH1 94 2 NA NA NA NA NA
16 bellast01 1871 TRO 128 9 NA NA NA NA NA
17 berkena01 1871 PH1 4 0 NA NA NA NA NA
18 berryto01 1871 PH1 4 0 NA NA NA NA NA
19 berthha01 1871 WS3 73 4 NA NA NA NA NA
20 biermch01 1871 FW1 2 1 NA NA NA NA NA
21 birdge01 1871 RC1 106 3 NA NA NA NA NA
22 birdsda01 1871 BS1 152 4 NA NA NA NA NA
23 brainas01 1871 WS3 134 7 NA NA NA NA NA
24 brannmi01 1871 CH1 14 0 NA NA NA NA NA
25 burrohe01 1871 WS3 63 1 NA NA NA NA NA
26 careyto01 1871 FW1 87 2 NA NA NA NA NA
27 carleji01 1871 CL1 127 8 NA NA NA NA NA
28 conefr01 1871 BS1 77 8 NA NA NA NA NA
29 connone01 1871 TRO 33 0 NA NA NA NA NA
30 cravebi01 1871 TRO 118 3 NA NA NA NA NA
31 cuthbne01 1871 PH1 150 10 NA NA NA NA NA
32 deaneha01 1871 FW1 22 2 NA NA NA NA NA
33 donnepe01 1871 FW1 34 1 NA NA NA NA NA
34 duffyed01 1871 CH1 121 3 NA NA NA NA NA
35 eggleda01 1871 NY2 147 4 NA NA NA NA NA
36 ewellge01 1871 CL1 3 0 NA NA NA NA NA
37 fergubo01 1871 NY2 158 3 NA NA NA NA NA
38 fishech01 1871 RC1 123 3 NA NA NA NA NA
39 fislewe01 1871 PH1 147 3 NA NA NA NA NA
40 fleetfr01 1871 NY2 6 0 NA NA NA NA NA
41 flowedi01 1871 TRO 105 4 NA NA NA NA NA
42 flynncl01 1871 TRO 142 4 NA NA NA NA NA
43 foleyto01 1871 CH1 84 3 NA NA NA NA NA
44 foranji01 1871 FW1 89 2 NA NA NA NA NA
45 forceda01 1871 WS3 162 4 NA NA NA NA NA
46 fulmech01 1871 RC1 63 5 NA NA NA NA NA
47 glennjo01 1871 WS3 120 3 NA NA NA NA NA
48 goldswa01 1871 FW1 88 4 NA NA NA NA NA
49 gouldch01 1871 BS1 151 3 NA NA NA NA NA
50 hallge01 1871 WS3 136 8 NA NA NA NA NA
51 halliji01 1871 FW1 25 2 NA NA NA NA NA
52 hamra01 1871 RC1 113 1 NA NA NA NA NA
53 hastisc01 1871 RC1 118 2 NA NA NA NA NA
54 hatfijo01 1871 NY2 168 4 NA NA NA NA NA
55 heubege01 1871 PH1 75 2 NA NA NA NA NA
56 highadi01 1871 NY2 94 2 NA NA NA NA NA
57 hodesch01 1871 CH1 130 7 NA NA NA NA NA
58 jackssa01 1871 BS1 76 1 NA NA NA NA NA
59 johnsca01 1871 CL1 67 0 NA NA NA NA NA
60 kellybi01 1871 FW1 67 6 NA NA NA NA NA
61 kimbage01 1871 CL1 131 3 NA NA NA NA NA
62 kingma01 1871 CH1 101 8 NA NA NA NA NA
63 kingst01 1871 TRO 144 1 NA NA NA NA NA
64 kohlehe01 1871 FW1 12 0 NA NA NA NA NA
65 lennobi01 1871 FW1 48 1 NA NA NA NA NA
66 leonaan01 1871 WS3 148 3 NA NA NA NA NA
67 mackde01 1871 RC1 122 8 NA NA NA NA NA
68 malonfe01 1871 PH1 134 9 NA NA NA NA NA
69 mathebo01 1871 FW1 89 2 NA NA NA NA NA
70 mcatebu01 1871 CH1 135 5 NA NA NA NA NA
71 mcbridi01 1871 PH1 132 7 NA NA NA NA NA
72 mcderjo01 1871 FW1 8 1 NA NA NA NA NA
73 mcgeami01 1871 TRO 148 6 NA NA NA NA NA
74 mcmuljo01 1871 TRO 136 8 NA NA NA NA NA
75 mcveyca01 1871 BS1 153 1 NA NA NA NA NA
76 meyerle01 1871 PH1 130 2 NA NA NA NA NA
77 millsch01 1871 NY2 146 1 NA NA NA NA NA
78 millsev01 1871 WS3 157 3 NA NA NA NA NA
79 minched01 1871 FW1 36 0 NA NA NA NA NA
80 nortofr01 1871 WS3 1 0 NA NA NA NA NA
81 paborch01 1871 CL1 142 1 NA NA NA NA NA
82 patteda01 1871 NY2 151 1 NA NA NA NA NA
83 pearcdi01 1871 NY2 163 4 NA NA NA NA NA
84 phelpne01 1871 FW1 3 1 NA NA NA NA NA
85 pikeli01 1871 TRO 130 5 NA NA NA NA NA
86 pinkhed01 1871 CH1 95 18 NA NA NA NA NA
87 prattal01 1871 CL1 130 1 NA NA NA NA NA
88 prattto01 1871 PH1 6 0 NA NA NA NA NA
89 questjo01 1871 CL1 13 1 NA NA NA NA NA
90 quinnpa02 1871 FW1 17 4 NA NA NA NA NA
91 radcljo01 1871 PH1 145 6 NA NA NA NA NA
92 reachal01 1871 PH1 133 5 NA NA NA NA NA
93 sagerpo01 1871 RC1 39 2 NA NA NA NA NA
94 schafha01 1871 BS1 149 3 NA NA NA NA NA
95 selmafr01 1871 FW1 65 4 NA NA NA NA NA
96 senseco01 1871 PH1 127 0 NA NA NA NA NA
97 simmojo01 1871 CH1 129 1 NA NA NA NA NA
98 smithch01 1871 NY2 72 1 NA NA NA NA NA
99 spaldal01 1871 BS1 144 8 NA NA NA NA NA
100 startjo01 1871 NY2 161 3 NA NA NA NA NA
[ reached getOption("max.print") -- omitted 102716 rows ]
A common convention for dealing with the missing values when computing \(\text{PA}\) is to replace the NA
with a 0
. To do this, we can use the function replace_na()
within a pipe as follows.
> Batting <-
+ batting %>%
+ replace_na(list(IBB = 0, HBP = 0, SH = 0, SF = 0, GIDP = 0))
> Batting %>% select(playerID, yearID, teamID, AB, BB, HBP, SH, SF, IBB)
# A tibble: 102,816 x 9
playerID yearID teamID AB BB HBP SH SF IBB
<chr> <int> <fct> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 abercda01 1871 TRO 4 0 0 0 0 0
2 addybo01 1871 RC1 118 4 0 0 0 0
3 allisar01 1871 CL1 137 2 0 0 0 0
4 allisdo01 1871 WS3 133 0 0 0 0 0
5 ansonca01 1871 RC1 120 2 0 0 0 0
6 armstbo01 1871 FW1 49 0 0 0 0 0
7 barkeal01 1871 RC1 4 1 0 0 0 0
8 barnero01 1871 BS1 157 13 0 0 0 0
9 barrebi01 1871 FW1 5 0 0 0 0 0
10 barrofr01 1871 BS1 86 0 0 0 0 0
# ... with 102,806 more rows
The syntax for replace_na()
is a bit involved but the basic idea is you have to specify the value you want to replace each NA
. When using replace_na()
it is very important to remember to include the list(...)
bit.
Load the Lahman data and run the above code to create the tbl Batting
, which has replaced all of the NA
’ in the columns IBB, HBP, SH, SF
and GIDP`.
Using the pipe %>%
, mutate()
, filter()
, and select()
, create a tbl batting
by:
mutate()
) for plate appearances (PA), unintentional walks (uBB), singles (X1B), batting average (BA), on-base percentage (OBP), on-base plus slugging (OPS), and weighted On-Base Average (wOBA). Note that the formula for plate appearances is \(\text{PA} = \text{AB} + \text{BB} + \text{HBP} + \text{SH} + \text{SF}.\) Formulae for the remaining statistics are given in Problem Set 3.filter()
zBA_all
, zOPB_all
, etc. Remember, you must re-define the standardize()
function we wrote in Module 4> standardize <- function(x){ (x - mean(x))/sd(x)}
> batting <-
+ batting %>%
+ mutate(zBA_all = standardize(BA),
+ zOBP_all = standardize(OBP),
+ zOPS_all = standardize(OPS),
+ zwOBA_all = standardize(wOBA))
batting
by year and compute the standardized BA, OBP, OPS, and wOBA within each year. Now who are the best and worst batters according to the four measures? Name the columns containing these new standardized values zBA_year
, zOBP_year
, etc.> batting <-
+ batting %>%
+ group_by(yearID) %>%
+ mutate(zBA_year = standardize(BA),
+ zOBP_year = standardize(OBP),
+ zOPS_year = standardize(OPS),
+ zwOBA_year = standardize(wOBA))
zBA_year_lg
, zOBP_year_lg
, etc.> batting <-
+ batting %>%
+ ungroup() %>%
+ group_by(yearID) %>%
+ mutate(zBA_year_lg = standardize(BA),
+ zOBP_year_lg = standardize(OBP),
+ zOPS_year_lg = standardize(OPS),
+ zwOBA_year_lg = standardize(wOBA))
> batting <-
+ batting %>%
+ ungroup() %>%
+ mutate(HIST_ERA = case_when(
+ 1871 <= yearID & yearID <= 1892 ~ "Pioneer",
+ 1893 <= yearID & yearID <= 1919 ~ "Spitball",
+ 1920 <= yearID & yearID <= 1946 ~ "Landis",
+ 1947 <= yearID & yearID <= 1968 ~ "Baby Boomer",
+ 1969 <= yearID & yearID <= 1992 ~ "Artifical Turf",
+ 1993 <= yearID ~ "Camden Yards")) %>%
+ group_by(HIST_ERA) %>%
+ mutate(zBA_hist = standardize(BA),
+ zOBP_hist = standardize(OBP),
+ zOPS_hist = standardize(OPS),
+ zwOBA_hist = standardize(wOBA))
Use mutate()
and case_when()
(just like we did in Module 3) to add a column called Hist_era to batting
that records the historical era.
Group batting
by Hist_era and standardize BA, OBP, OPS, and wOBA within historical era. Who are the best and worst batters now? Name the columns containing these new stanardized values zBA_hist
, zOBP_hist
, etc.
Remove the grouping you added in Problem 6.
Recall from Problem Set 2, we plotted the relative payroll of MLB teams against their winning percentage. In that problem set, we read in a file that had included the relative payroll for each team as a separate column. To get some additional practice with dplyr, we will read in a different dataset and re-compute these relative payrolls.
> standardize <- function(x){
+ mu <- mean(x, na.rm = TRUE)
+ sigma <- sd(x, na.rm = TRUE)
+ return( (x - mu)/sigma)
+ }
We need to write another function in order to compute “relative payroll”. This function will take in a vector x
, compute its median, and then divides every element of x
by the median.
> relative <- function(x){
+ med <- median(x, na.rm = TRUE)
+ return( x/med )
+ }
mlb_payrolls
.Parsed with column specification:
cols(
Team = col_character(),
GM = col_character(),
Team_Payroll = col_integer(),
Winning_Percentage = col_double(),
Year = col_integer()
)
Using the pipe %>%
, group_by()
, and mutate()
, add a column to mlb_payrolls
that contains the relative payroll for each team.
Make a scatterplot of winning percentage against relative payrolls. Comment on the relationship. Your scatterplot should be identical to one you made in Problem Set 2.
Using the summarize()
function, compute the average team payroll and relative payroll for each year. Save these results in a new tbl called payroll_avg
.
Make a scatterplot that shows how team payrolls have evolved over the year. Similar to what we did in Module 4, add a line to this scatterplot that shows the average team payroll. Do the same thing for relative payroll. What do you notice about the average team payroll and relative payroll?
As you will see in coming lectures, correlation is a measure of the strength of the linear relationship between two variables. The closer to +1 or -1 the correlation between two variables is, the more predictable they are of each other. We can compute it using the cor()
function. Using summary()
and cor()
, compute the correlation between relative payroll and winning percentage within each year. What do you notice about how the relationship between winning percentage and relative payroll changes year to year?
# A tibble: 18 x 2
Year cor
<int> <dbl>
1 1998 0.764
2 1999 0.699
3 2000 0.327
4 2001 0.338
5 2002 0.443
6 2003 0.415
7 2004 0.515
8 2005 0.497
9 2006 0.538
10 2007 0.495
11 2008 0.322
12 2009 0.504
13 2010 0.347
14 2011 0.408
15 2012 0.195
16 2013 0.330
17 2014 0.297
18 2015 0.281
Without running the code, work with your teammates to see if you can figure out what the code below is doing.
> batting_2014_2015 <-
+ batting %>%
+ filter(yearID %in% c(2014, 2015)) %>%
+ group_by(playerID) %>%
+ filter(n() == 2) %>%
+ select(playerID, yearID, BA) %>%
+ arrange(playerID)
batting_2014_2015.RData
to the file “data/batting_2014_2015.RData”. We will return to this dataset in Module 6.