In Module 6, we started building a model for predicting a baseball player’s 2015 batting average using his 2014 batting average. We found that some models, even though they fit the data quite well, appeared to overfit and may not predict future observations well. We will switch gears a little bit and discuss how to diagnose overfit issues using data on field goals in the NFL.

Training and Testing Paradigm

Suppose we have fit a whole bunch of models to a given dataset. How should we choose which model is the best? One strategy might be to see how well the models predict the data we used to fit them. While this seems intuitive, we saw in Module 6 that it is possible to “over-learn” the patterns in this training data.

A common alternative is to instead split our original dataset into two parts: a training set and a testing set. We fit all of our models on the training set and then see how well they predict the values in the testing set.

Field Goal Success in the NFL

The file “nfl_fg_train.csv” contains a large dataset about field goals attempted in the NFL between 2005 and 2015. We will train several models of field goal success with this data and then evaluate their predictive performance using the data contained in “nfl_fg_test.csv”.

> fg_train <- read_csv("data/nfl_fg_train.csv")
Parsed with column specification:
cols(
  Team = col_character(),
  Year = col_integer(),
  GameMinute = col_integer(),
  Kicker = col_character(),
  Distance = col_integer(),
  ScoreDiff = col_integer(),
  Grass = col_logical(),
  Success = col_integer()
)
  1. The simplest forecast for field goal success probability is the overall average success rate. This forecast does not differentiate between players or attempt to adjust for distance or other game contexts. Add a new column to fg_train called “phat_all” which contains the overall average field goal suceess rate. This may be computed by taking the average of the column “Success”.

  2. Overall, kickers just over 83% of their attempts. However, we know that there are some truly elite kickers (e.g. the Cowboys’ Dan Bailey) who make well over 83% of their attempts. Instead of forecasting field goal success probabilites with the overall average, we could instead use compute each individual kicker’s conversion rate. Using group_by() and mutate(), add a column to fg_train called “phat_kicker” which contains each kicker’s field goal conversion rate. Be sure to ungroup() the tbl when you’re done!

> fg_train <- 
+   fg_train %>% 
+   group_by(Kicker) %>%
+   mutate(phat_kicker = mean(Success)) %>%
+   ungroup()
  1. Intuitive, distance is one of the main determinants of whether a kicker makes a field goal. We will use the cut() function to bin the data according to distance and then compute the conversion rate (averaged over all kickers) within each bin. In our dataset, the shortest field goal attempt was 18 yards and the longest was 76. We will begin by binning our data into 10 yard increments, 10 – 20, 20 – 30, …, 70 – 80. We then save the bin label in a column called “Dist_10” and the predictions in a column called “phat_dist_10”.
> fg_train <-
+   fg_train %>%
+   mutate(Dist_10 = cut(Distance, breaks = seq(from = 10, to = 80, by = 10))) %>%
+   group_by(Dist_10) %>%
+   mutate(phat_dist_10 = mean(Success)) %>%
+   ungroup()

Create a scatter plot of distance and phat_dist_10. Since there are many attempts from certain yardages, use alpha-blending (see Module 2 for a refresher on this!) to change the transparency of the points according to their frequency. Do these predictions make sense intuitively?

  1. Using very similar code as in Question 3, bin the Distance into 5-yard increments, saving the bin labels in a column called “Dist_5”. Add a column to fg_train() that compute the overall conversion rate within each of these 5-yard bins. Call this column “phat_dist_5”.

  2. Repeat Question 4 but using 2 yard increments. Save the bin labels in a column called “Dist_2” and the probability forecasts in a column called “phat_dist_2”. When plotting phat_dist_2 against distance, what do you notice?

> fg_train <-
+   fg_train %>%
+   mutate(Dist_2 = cut(Distance, breaks = seq(from = 10, to = 80, by = 2))) %>%
+   group_by(Dist_2) %>%
+   mutate(phat_dist_2 = mean(Success)) %>%
+   ungroup()
  1. We now have several predictive models of field goal success: phat_all, phat_kicker, phat_dist_10, phat_dist_5, and phat_dist_2. Recall from Module 6 that to assess how well we are predicting a continuous outcome, we could use the RMSE. When predicting a binary outcome, we have a few more options. To set the stage, let let \(y_{i}\) be the outcome of the \(i^{\text{th}}\) observation and let \(\hat{p}_{i}\) be the forecasted probability that \(y_{i} = 1.\)

The Brier Score is defined as \[ BS = \frac{1}{n}\sum_{i = 1}^{n}{(y_{i} - \hat{p}_{i})^{2}} \] Looking at the formula, we see that the Brier score is just the mean square error of our forecasts. Using code that is very nearly identical to that used to compute RSME in Module 6, compute the Brier score of each of our prediction models.

# A tibble: 1 x 5
  phat_all phat_kicker phat_dist_10 phat_dist_5 phat_dist_2
     <dbl>       <dbl>        <dbl>       <dbl>       <dbl>
1    0.140       0.138        0.125       0.123       0.123

Which model has the lowest Brier score? Do you think this model over-fits the data?

  1. Before assessing how well our models predict out-of-sample, it will be useful to create tbl’s that summarize each model’s forecasts. For instance, when we divided the data by kickers, we need not have multiple rows recording the same prediction for each kicker. Instead, we can create a tbl called phat_kicker which has one row per kicker and two columns, one for the kicker and one for the associated forecast:
> phat_kicker <- 
+   fg_train %>%
+   group_by(Kicker) %>%
+   summarize(phat_kicker = mean(Success))

Create similar tbls for phat_dist_10, phat_dist_5, and phat_dist_2.

  1. The file “nfl_fg_test.csv” contains additional data on more field goals kicked between 2005 and 2015. Since we have not used the data in this file to train our models, we can get a sense of the out-of-sample predictive performance of our models by looking at how well they predict these field goals. We first load the data into a tbl called fg_test.
> fg_test <- read_csv("data/nfl_fg_test.csv")
Parsed with column specification:
cols(
  Team = col_character(),
  Year = col_integer(),
  GameMinute = col_integer(),
  Kicker = col_character(),
  Distance = col_integer(),
  ScoreDiff = col_integer(),
  Grass = col_logical(),
  Success = col_integer()
)

Using mutate() and cut(), add columns “Dist_10”, “Dist_5”, and “Dist_2” to fg_test that bin the data into 10-yard, 5-yard, and 2-yard increments.

  1. We are now ready to add our model’s forecasts to fg_test. For instance, we can add the forecast from “phat_all”, which is just the overall conversion rate averaged over all kickers.
> fg_test <-
+   fg_test %>%
+   mutate(phat_all = mean(fg_train[["Success"]]))

Notice that instead of computing the mean of the column “Success” from fg_test, we are computing the mean from fg_train.

  1. Joining is a powerful technique to combine data from two different tables based on a key. The key is what enables us to match rows between the tables. For instance, to add the forecasts from “phat_kicker” to the tbl fg_test we can work row-by-row. First, for each field goal in fg_test, we identify the kicker who attempted the field goal. Then we can go over to the tbl phat_kicker and find the row corresponding to that kicker. We can then take that kicker’s forecast from phat_kicker and append it to the row in fg_test. What we have just described is what is known as an “inner join”. In this situation, the key was the Kicker. The code to carry out these operations is given below
> fg_test <-
+   fg_test %>%
+   inner_join(phat_kicker, by = "Kicker")

To verify that we have successfully performed this join, we can print out a few rows of fg_test:

> select(fg_test, Kicker, Success, phat_kicker)
# A tibble: 1,682 x 3
   Kicker  Success phat_kicker
   <chr>     <int>       <dbl>
 1 Akers         1       0.788
 2 Akers         1       0.788
 3 Bironas       1       0.850
 4 Bironas       1       0.850
 5 Bironas       1       0.850
 6 Brien         0       0.333
 7 Brown         0       0.829
 8 Brown         1       0.829
 9 Brown         1       0.829
10 Brown         1       0.829
# ... with 1,672 more rows

Let’s unpack the inner join code line-by-line: in the first two lines, we are telling R that we want to over-write fg_test. Next, we pipe fg_test to the function inner_join, which takes two more arguments. The last argument by = "Kicker" tells the function that the key we want to use is the kicker. The first argument phat_kicker tells the function where the additional data corresponding to each key value is.

  1. Mimic the code shown in Question 10 to add columns for fg_test for the remaining three predictive models: “phat_dist_10”, “phat_dist_5”, and “phat_dist_2”.

  2. Compute the out-of-sample Brier scores for each of our models. Which has the best out-of-sample performance?

# A tibble: 1 x 5
  phat_all phat_kicker phat_dist_10 phat_dist_5 phat_dist_2
     <dbl>       <dbl>        <dbl>       <dbl>       <dbl>
1    0.134       0.133        0.120       0.118       0.118