MLB Batting Statistics

In this problem set, we will gain more experience using the dplyr verbs we learned in Module 3 to analyze batting statistics of MLB players with at least 502.2 plate appearances. All of the data is contained in the file “data/hitting_qualified.csv”.

  1. Load the data into a tibble called hitting_qualified using read_csv().
Parsed with column specification:
cols(
  .default = col_integer(),
  playerID = col_character(),
  teamID = col_character(),
  lgID = col_character(),
  CS = col_character(),
  IBB = col_character(),
  SF = col_character(),
  GIDP = col_character()
)
See spec(...) for full column specifications.

The columns of this dataset include

  1. Use arrange() to find out the first and last season for which we have data. Hint: you may need to use desc() as well.

  2. Use summarize() to find out the first and last season for which we have data. Hint, you only need one line of code to do this

  3. When you print out hitting_qualified you’ll notice that some columns were read in as characters and not integers or numerics. This can happen sometimes whenever the original csv file has missing values. In this case, the columns IBB, HBP, SH, SF, and GIDP were read in as characters. We want to convert these to integers. We can do this using mutate() and the function as.integer().

> hitting_qualified <- mutate(hitting_qualified, IBB = as.integer(IBB), HBP = as.integer(HBP), 
+     SH = as.integer(SH), SF = as.integer(SF), GIDP = as.integer(GIDP))
  1. Let’s take a look at some of the columns we just converted:
> select(hitting_qualified, playerID, yearID, AB, IBB, HBP, SH, SF, GIDP)
# A tibble: 12,043 x 8
   playerID  yearID    AB   IBB   HBP    SH    SF  GIDP
   <chr>      <int> <int> <int> <int> <int> <int> <int>
 1 ansonca01   1884   475    NA    NA    NA    NA    NA
 2 bradyst01   1884   485    NA     0    NA    NA    NA
 3 connoro01   1884   477    NA    NA    NA    NA    NA
 4 dalryab01   1884   521    NA    NA    NA    NA    NA
 5 farreja02   1884   469    NA    NA    NA    NA    NA
 6 gleasbi01   1884   472    NA    12    NA    NA    NA
 7 hinespa01   1884   490    NA    NA    NA    NA    NA
 8 hornujo01   1884   518    NA    NA    NA    NA    NA
 9 jonesch01   1884   472    NA    10    NA    NA    NA
10 nelsoca01   1884   432    NA     9    NA    NA    NA
# ... with 12,033 more rows

You’ll notice that a lot of these columns contain NA values, which indicates that some of these values are missing. This make sense, since a lot of these statistics were not recorded in the early years of baseball. A popular convention for dealing with these missing statistics is to impute the missing values with 0. That is, for instance, every place we see an NA we need to replace it with a 0. We can do that with mutate() and replace_na() function as follows.

> hitting_qualified <- replace_na(hitting_qualified, list(IBB = 0, HBP = 0, SH = 0, 
+     SF = 0, GIDP = 0))

We will discuss the syntax for replace_na() later in lecture.

  1. Use mutate() to add a column for the number of singles, which can be computed as \(\text{X1B} = \text{H} - \text{X2B} - \text{X3B} - \text{HR}\).
> hitting_qualified <- mutate(hitting_qualified, X1B = H - X2B - X3B - HR)
  1. The variable BB includes as a subset all intentional walks (IBB). Use mutate() to add a column to hitting_qualified that counts the number of un-intentional walks (uBB). Be sure to save the resulting tibble as hitting_qualified.

  2. Use mutate() to add columns for the following offensive statistics, whose formulae are given below. We have also included links to pages on Fangraphs that define and discuss each of these statistics.

> hitting_qualified <- mutate(hitting_qualified,
+                   BBP = BB/PA,
+                   KP = SO/PA,
+                   OBP = (H + BB + HBP)/(AB + BB + HBP + SF),
+                   SLG = (X1B + 2*X2B + 3*X3B + 4*HR)/AB,
+                   OPS = OBP + SLG,
+                   wOBA = (0.687 * uBB + 0.718 * HBP + 0.81 * X1B + 1.256 * X2B + 
+                             1.594 * X3B+ 2.065 * HR)/(AB + uBB + SF + HBP))
  1. For most of the statistics in the previous question, Fangraphs has defined rating scales (to see these ratings, click on the linked page for each statistic in Question 6 and scroll down to the “Context” section of the page). Use mutate() and case_when() to add the ratings for walk percentage (BBP), strike-out percentage (KP), on-base percentage (OBP), on-base plus slugging (OPS), and wOBA. Call the columns “BBP_rating”, “KP_rating”, “OBP_rating”, “OPS_rating”, and “wOBA_rating.”

  2. Use filter() to subset the players who played between 2000 and 2015. Call the new tibble tmp_batting.

  3. Use select() to create a tibble called batting_recent containing all players who played between 2000 and 2015 with the following columns: playerID, yearID, teamID, lgID, and all of the statistics and rankings created in Problems 6 and 7.

  4. Explore the distribution of some of the batting statistics introduced in problem 6 using the tbl batting_recent using histograms. Then explore the relationship between some of these statistics with scatterplots.