Module 3: Wrangling Data

We will be using tools containined in the dplyr() package, which is already loaded when we load the tidyverse. There are five main functions in dplyr corresponding to the most common things you’ll want to do with your data. We will learn each of these.

Reorder the rows with arrange()
Identify observations satisfying certain conditions with filter()
Creating new variables that are functions of existing variables with mutate()
Picking a subset of variables by names with select()
Generating simple summaries of the data with summarise()

Arranging Data

In Module 2, we looked at NBA shooting data over 20 seasons. When we visualized this data, we noticed that there were some players who took very few of a certain type of shot. In order to verify this, we could try sorting our tbl according to the number of field goals attemped. We’ll start by loading the NBA shooting data again into a tbl called raw_shooting (the reasons for this naming convention will be clearer soon).

> library(tidyverse)
> raw_shooting <- read_csv(file = "data/nba_shooting.csv")
Parsed with column specification:
cols(
  PLAYER = col_character(),
  SEASON = col_integer(),
  FGM = col_integer(),
  FGA = col_integer(),
  TPM = col_integer(),
  TPA = col_integer(),
  FTM = col_integer(),
  FTA = col_integer(),
  FGP = col_double(),
  TPP = col_double(),
  FTP = col_double()
)

The arrange() function works by taking a tbl and a set of column names and sorting the data according to the values in these columns.

> arrange(raw_shooting, FGA)
# A tibble: 7,447 x 11
   PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
   <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
 1 Dajuan Wa…   2007     1     1     1     1     1     2 1         1   0.5
 2 Tyson Whe…   1999     1     1     1     1     1     2 1         1   0.5
 3 Alvin Wil…   2007     0     2     0     1     2     4 0         0   0.5
 4 Donald Wh…   1998     1     2     0     1     0     2 0.5       0   0  
 5 Mustafa S…   2014     0     3     0     1     1     2 0         0   0.5
 6 John Luca…   2011     1     3     0     1     0     2 0.333     0   0  
 7 Roger Pow…   2007     0     3     0     1     2     2 0         0   1  
 8 Alvin Wil…   2006     0     3     0     2     1     2 0         0   0.5
 9 Rusty LaR…   2004     1     3     1     1     1     2 0.333     1   0.5
10 Dell Demps   1997     0     3     0     1     2     2 0         0   1  
# ... with 7,437 more rows

We see now that there were two players who attempted only one field goal. We could instead sort the data according to FGA but in descending order, using desc():

> arrange(raw_shooting, desc(FGA))
# A tibble: 7,447 x 11
   PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
   <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
 1 Kobe Brya…   2006   978  2173   180   518   696   819 0.450 0.347 0.850
 2 Allen Ive…   2003   804  1940    84   303   570   736 0.414 0.277 0.774
 3 Jerry Sta…   2001   774  1927   166   473   666   810 0.402 0.351 0.822
 4 Kobe Brya…   2003   868  1924   124   324   601   713 0.451 0.383 0.843
 5 Michael J…   1998   881  1893    30   126   565   721 0.465 0.238 0.784
 6 Michael J…   1997   920  1892   111   297   480   576 0.486 0.374 0.833
 7 LeBron Ja…   2006   875  1823   127   379   601   814 0.480 0.335 0.738
 8 Allen Ive…   2006   815  1822    72   223   675   829 0.447 0.323 0.814
 9 Allen Ive…   2005   771  1818   104   338   656   786 0.424 0.308 0.835
10 Tracy McG…   2003   829  1813   173   448   576   726 0.457 0.386 0.793
# ... with 7,437 more rows

When we specify more than one column, arrange() uses each additional column name to break ties in the values of preceding columns

> arrange(raw_shooting, FGA, TPA, FTA)
# A tibble: 7,447 x 11
   PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
   <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
 1 Dajuan Wa…   2007     1     1     1     1     1     2 1         1   0.5
 2 Tyson Whe…   1999     1     1     1     1     1     2 1         1   0.5
 3 Donald Wh…   1998     1     2     0     1     0     2 0.5       0   0  
 4 Alvin Wil…   2007     0     2     0     1     2     4 0         0   0.5
 5 Mustafa S…   2014     0     3     0     1     1     2 0         0   0.5
 6 John Luca…   2011     1     3     0     1     0     2 0.333     0   0  
 7 Roger Pow…   2007     0     3     0     1     2     2 0         0   1  
 8 Rusty LaR…   2004     1     3     1     1     1     2 0.333     1   0.5
 9 Dell Demps   1997     0     3     0     1     2     2 0         0   1  
10 Alvin Wil…   2006     0     3     0     2     1     2 0         0   0.5
# ... with 7,437 more rows

Filtering Data

When we start computing advanced statistics like effective field goal percentage and true shooting percentage, we probably don’t want to consider those players for whom we have very little data. For instance, we probably do not want to include the players who took a very limited number of shots in any one season in our analysis. The function filter() is used to pull out subsets of observations that satisfy some logical condition like “FGA > 100” or “FGA > 100 and FTA > 50”.

To make such comparisons in R, we have the following operators available at our disposal:

== for “equal to”
!= for “not equal to”
< and <= for “less than” and “less than or equal to”
> and >= for “greater than” and “greater than or equal to”
&, |, ! for “AND” and “OR” and “NOT” The code below filter out all of the players with at least 100 field goals in a single season

> filter(raw_shooting, FGA > 100)
# A tibble: 6,295 x 11
   PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
   <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
 1 Stephen C…   2016   805  1597   402   887   363   400 0.504 0.453 0.908
 2 James Har…   2016   710  1617   236   657   720   837 0.439 0.359 0.860
 3 Kevin Dur…   2016   698  1381   186   480   447   498 0.505 0.388 0.898
 4 DeMarcus …   2016   601  1332    70   210   476   663 0.451 0.333 0.718
 5 LeBron Ja…   2016   737  1416    87   282   359   491 0.520 0.309 0.731
 6 Damian Li…   2016   618  1474   229   610   414   464 0.419 0.375 0.892
 7 Anthony D…   2016   560  1137    35   108   326   430 0.493 0.324 0.758
 8 Russell W…   2016   656  1444   101   341   465   573 0.454 0.296 0.812
 9 DeMar DeR…   2016   614  1377    47   139   555   653 0.446 0.338 0.850
10 Paul Geor…   2016   605  1448   210   565   454   528 0.418 0.372 0.860
# ... with 6,285 more rows

When we run this code, you’ll notice that R prints out a tbl with 6,385 rows. However, it has not removed the players with fewer than 100 field goals from the original tbl raw_shooting. In fact, dplyr functions never modify their input but work by creating a copy and modifying that. So if we wanted to be able to use the tbl consisting of just those players with a least 100 field goal attempts, we will need to save this modified copy of raw_shooting as a new tbl. Arranging this new tbl verifies that all observations contained in it have at least 100 field goals attempts.

> new_data <- filter(raw_shooting, FGA >= 100)
> arrange(new_data, FGA)
# A tibble: 6,306 x 11
   PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
   <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
 1 Jordan Fa…   2016    42   100    16    45    10    10  0.42 0.356 1    
 2 Jerry Sta…   2012    37   100    13    38    21    23  0.37 0.342 0.913
 3 Brian Car…   2011    43   100    42    87    17    18  0.43 0.483 0.944
 4 Kyrylo Fe…   2011    44   100     0     1    18    46  0.44 0     0.391
 5 Jonathan …   2010    40   100    14    39    24    26  0.4  0.359 0.923
 6 Ryan Bowen   2008    49   100     0     1    16    29  0.49 0     0.552
 7 Richie Fr…   2006    39   100    23    70     7    10  0.39 0.329 0.7  
 8 Lindsey H…   2006    37   100    11    43     2     4  0.37 0.256 0.5  
 9 Oliver Mi…   2004    53   100     0     1    15    23  0.53 0     0.652
10 Mark Jack…   2004    34   100     7    41    28    39  0.34 0.171 0.718
# ... with 6,296 more rows

We can also filter on more complicated conditions constructed using the AND, OR, and NOT operators: &, |, and !. For instance, to filter observations with at least 100 field goal attempts OR 50 three point attempts, we would do

> filter(raw_shooting, FGA >= 100 | TPA >= 50)
# A tibble: 6,328 x 11
   PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
   <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
 1 Stephen C…   2016   805  1597   402   887   363   400 0.504 0.453 0.908
 2 James Har…   2016   710  1617   236   657   720   837 0.439 0.359 0.860
 3 Kevin Dur…   2016   698  1381   186   480   447   498 0.505 0.388 0.898
 4 DeMarcus …   2016   601  1332    70   210   476   663 0.451 0.333 0.718
 5 LeBron Ja…   2016   737  1416    87   282   359   491 0.520 0.309 0.731
 6 Damian Li…   2016   618  1474   229   610   414   464 0.419 0.375 0.892
 7 Anthony D…   2016   560  1137    35   108   326   430 0.493 0.324 0.758
 8 Russell W…   2016   656  1444   101   341   465   573 0.454 0.296 0.812
 9 DeMar DeR…   2016   614  1377    47   139   555   653 0.446 0.338 0.850
10 Paul Geor…   2016   605  1448   210   565   454   528 0.418 0.372 0.860
# ... with 6,318 more rows

We may combine these constraints by enclosing them in parantheses.

> filter(raw_shooting, (FGA >= 100 & TPA >= 50) | (FGP >= 0.45 & FGP <= 0.5))
# A tibble: 4,837 x 11
   PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
   <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
 1 Stephen C…   2016   805  1597   402   887   363   400 0.504 0.453 0.908
 2 James Har…   2016   710  1617   236   657   720   837 0.439 0.359 0.860
 3 Kevin Dur…   2016   698  1381   186   480   447   498 0.505 0.388 0.898
 4 DeMarcus …   2016   601  1332    70   210   476   663 0.451 0.333 0.718
 5 LeBron Ja…   2016   737  1416    87   282   359   491 0.520 0.309 0.731
 6 Damian Li…   2016   618  1474   229   610   414   464 0.419 0.375 0.892
 7 Anthony D…   2016   560  1137    35   108   326   430 0.493 0.324 0.758
 8 Russell W…   2016   656  1444   101   341   465   573 0.454 0.296 0.812
 9 DeMar DeR…   2016   614  1377    47   139   555   653 0.446 0.338 0.850
10 Paul Geor…   2016   605  1448   210   565   454   528 0.418 0.372 0.860
# ... with 4,827 more rows

What if we wanted to pull out the observations corresponding to the 2015-16 and 2014-15 season. We could do something like filter(raw_shooting, (SEASON == 2016) | (SEASON == 2015)), which would be perfectly fine. However, what if we wanted data from 1998-99, 2011-12, and 2015-16? Typing a lot of expressions like SEASON == ... would be rather tedious. The %in% operator lets us avoid this tedium:

> filter(raw_shooting, SEASON %in% c(1999, 2012, 2016))
# A tibble: 1,150 x 11
   PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
   <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
 1 Stephen C…   2016   805  1597   402   887   363   400 0.504 0.453 0.908
 2 James Har…   2016   710  1617   236   657   720   837 0.439 0.359 0.860
 3 Kevin Dur…   2016   698  1381   186   480   447   498 0.505 0.388 0.898
 4 DeMarcus …   2016   601  1332    70   210   476   663 0.451 0.333 0.718
 5 LeBron Ja…   2016   737  1416    87   282   359   491 0.520 0.309 0.731
 6 Damian Li…   2016   618  1474   229   610   414   464 0.419 0.375 0.892
 7 Anthony D…   2016   560  1137    35   108   326   430 0.493 0.324 0.758
 8 Russell W…   2016   656  1444   101   341   465   573 0.454 0.296 0.812
 9 DeMar DeR…   2016   614  1377    47   139   555   653 0.446 0.338 0.850
10 Paul Geor…   2016   605  1448   210   565   454   528 0.418 0.372 0.860
# ... with 1,140 more rows

We could also filter out data from the two lockout-shortened seasons, 1998-99 and 2011-12 using a combination of the NOT ! operator and %in%.

> filter(raw_shooting, !SEASON %in% c(1999, 2012))
# A tibble: 6,721 x 11
   PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
   <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
 1 Stephen C…   2016   805  1597   402   887   363   400 0.504 0.453 0.908
 2 James Har…   2016   710  1617   236   657   720   837 0.439 0.359 0.860
 3 Kevin Dur…   2016   698  1381   186   480   447   498 0.505 0.388 0.898
 4 DeMarcus …   2016   601  1332    70   210   476   663 0.451 0.333 0.718
 5 LeBron Ja…   2016   737  1416    87   282   359   491 0.520 0.309 0.731
 6 Damian Li…   2016   618  1474   229   610   414   464 0.419 0.375 0.892
 7 Anthony D…   2016   560  1137    35   108   326   430 0.493 0.324 0.758
 8 Russell W…   2016   656  1444   101   341   465   573 0.454 0.296 0.812
 9 DeMar DeR…   2016   614  1377    47   139   555   653 0.446 0.338 0.850
10 Paul Geor…   2016   605  1448   210   565   454   528 0.418 0.372 0.860
# ... with 6,711 more rows

For the remainder of this module, we will focus on the players who attempted at least 100 field goals, 100 free throws, 50 three pointers in the non-lockout seasons.

> nba_shooting_orig <- filter(raw_shooting, FGA >= 100 & FTA >= 100 & TPA >= 50 & 
+     !SEASON %in% c(1999, 2012))
> nba_shooting_orig
# A tibble: 2,254 x 11
   PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
   <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
 1 Stephen C…   2016   805  1597   402   887   363   400 0.504 0.453 0.908
 2 James Har…   2016   710  1617   236   657   720   837 0.439 0.359 0.860
 3 Kevin Dur…   2016   698  1381   186   480   447   498 0.505 0.388 0.898
 4 DeMarcus …   2016   601  1332    70   210   476   663 0.451 0.333 0.718
 5 LeBron Ja…   2016   737  1416    87   282   359   491 0.520 0.309 0.731
 6 Damian Li…   2016   618  1474   229   610   414   464 0.419 0.375 0.892
 7 Anthony D…   2016   560  1137    35   108   326   430 0.493 0.324 0.758
 8 Russell W…   2016   656  1444   101   341   465   573 0.454 0.296 0.812
 9 DeMar DeR…   2016   614  1377    47   139   555   653 0.446 0.338 0.850
10 Paul Geor…   2016   605  1448   210   565   454   528 0.418 0.372 0.860
# ... with 2,244 more rows

Creating New Variables from Old

In Module 1, we computed effective field goal percentage (eFGP), points scored (PTS), and true shooting percentage (TSP) from vectors containing the number of made and attempted field goals, three pointers, and free throws. Now that we have substantially more data stored in our tbl nba_shooting_orig, we would like to compute these statistics for all of the players and add new columns for them. We do this with mutate().

> mutate(nba_shooting_orig, eFGP = (FGM + 0.5 * TPM)/FGA)
# A tibble: 2,254 x 12
   PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
   <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
 1 Stephen C…   2016   805  1597   402   887   363   400 0.504 0.453 0.908
 2 James Har…   2016   710  1617   236   657   720   837 0.439 0.359 0.860
 3 Kevin Dur…   2016   698  1381   186   480   447   498 0.505 0.388 0.898
 4 DeMarcus …   2016   601  1332    70   210   476   663 0.451 0.333 0.718
 5 LeBron Ja…   2016   737  1416    87   282   359   491 0.520 0.309 0.731
 6 Damian Li…   2016   618  1474   229   610   414   464 0.419 0.375 0.892
 7 Anthony D…   2016   560  1137    35   108   326   430 0.493 0.324 0.758
 8 Russell W…   2016   656  1444   101   341   465   573 0.454 0.296 0.812
 9 DeMar DeR…   2016   614  1377    47   139   555   653 0.446 0.338 0.850
10 Paul Geor…   2016   605  1448   210   565   454   528 0.418 0.372 0.860
# ... with 2,244 more rows, and 1 more variable: eFGP <dbl>

When we run the code above, we find that R prints out a tbl whose very last column is eFGP. However, if we try to print out nba_shooting_orig we no longer see this column! This is because dplyr functions never modify their input but work by creating a copy and modifying that. So if we wanted a new tbl that contains eFGP, we need to save it directly:

> nba_shooting_2 <- mutate(nba_shooting_orig, eFGP = (FGM + 0.5 * TPM)/FGA)
> nba_shooting_2
# A tibble: 2,254 x 12
   PLAYER     SEASON   FGM   FGA   TPM   TPA   FTM   FTA   FGP   TPP   FTP
   <chr>       <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
 1 Stephen C…   2016   805  1597   402   887   363   400 0.504 0.453 0.908
 2 James Har…   2016   710  1617   236   657   720   837 0.439 0.359 0.860
 3 Kevin Dur…   2016   698  1381   186   480   447   498 0.505 0.388 0.898
 4 DeMarcus …   2016   601  1332    70   210   476   663 0.451 0.333 0.718
 5 LeBron Ja…   2016   737  1416    87   282   359   491 0.520 0.309 0.731
 6 Damian Li…   2016   618  1474   229   610   414   464 0.419 0.375 0.892
 7 Anthony D…   2016   560  1137    35   108   326   430 0.493 0.324 0.758
 8 Russell W…   2016   656  1444   101   341   465   573 0.454 0.296 0.812
 9 DeMar DeR…   2016   614  1377    47   139   555   653 0.446 0.338 0.850
10 Paul Geor…   2016   605  1448   210   565   454   528 0.418 0.372 0.860
# ... with 2,244 more rows, and 1 more variable: eFGP <dbl>

We now have a new tbl in our environment called nba_shooting_2 and this new tbl now has a column for eFGP. Recall that the formulas for points scored (PTS) and true shooting percentage (TSP): \[ \text{PTS} = \text{FTM} + 2\times \text{FGM} + \text{TPM} \] \[ \text{TSP} = \frac{\text{PTS}}{2\times(\text{FGA} + 0.44\times \text{FTA})} \] We can add both of them to our tbl using mutate():

> nba_shooting_3 <- mutate(nba_shooting_2, PTS = FTM + 2 * FGM + TPM)
> nba_shooting_4 <- mutate(nba_shooting_3, TSP = PTS/(2 * (FGA + 0.44 * FTA)))

Compared to nba_shooting_orig, the tbl nba_shooting_4 now has three additional columns for eFGP, PTS, and TPS. In order to create this tbl, we created two intermdiate tbls, nba_shooting_2 and nba_shooting_3. These are somewhat useless now, since any analyses we would want to do with them could be done using the richer dataset in nba_shooting_4.

To get rid of these objects, we can use the rm() function:

> rm(nba_shooting_2, nba_shooting_3)

rm() works by deleting the objects whose names are specified within the parantheses and separated by parantheses.

If you’re thinking that it was somewhat inefficient to create the two intermediate tbls nba_shooting_2 and nba_shooting_3 in order to arrive at nba_shooting_4, you’re correct. It turns out that we could have done it all in one shot, as follows:

> nba_shooting <- mutate(nba_shooting_orig, 
+                          eFGP = (FGM + 0.5*TPM)/FGA,
+                          PTS = FTM + 2*FGM + TPM,
+                          TSP = PTS/(2 * (FGA + 0.44 * FTA)))

You’ll notice in this code that we have separated each variable we’re creating onto its own line. This helps make the code readable.

When we print both nba_shooting_4 and nba_shooting, we see that the first ten rows are identical. To verify that the remaining 2,413 rows are identical, we can use R’s identical() function:

> identical(nba_shooting, nba_shooting_4)
[1] TRUE

Creating Categorical Variables

So far, we have used mutate() to compute numeric or continuous variables. Often in an analysis, however, we may want to bin these values into smaller buckets or categories. For instance, we may rather arbitrarily classify players based on their three-point shooting prowess as follows:

Hopeless: TPP < 20%
Below Average: 20% <= TPP < 30%
Average: 30% <= TPP < 35%
Above Average: 35% < TPP < 40%
Elite: TPP > 40%

In order to add a column to nba_shooting that includes these classifications, we can use the case_when() function

> nba_shooting <- mutate(nba_shooting,
+                        Classification = case_when(
+                          TPP < 0.2 ~  "Hopeless",
+                          0.2 <= TPP & TPP < 0.3 ~ "Below Average",
+                          0.3 <= TPP & TPP < 0.35 ~  "Average",
+                          0.35 <= TPP & TPP < 0.4 ~ "Above Average",
+                          0.4 <= TPP ~ "Elite"))

Let’s take a minute to unpack the code above. Within mutate(), we have started like we always did, with the name of the new variable on the left hand side of an equal sign. Then we called the case_when() function. Within this function, we have a new line for each of the values of the new variable Classification''. On each line we have an expression with a twiddle (`~`). On the left of the `~`, we have put a logical expression and on the right we have written the value ofClasification’’.

Summarizing Individual Columns

Among eligible players, what was the average field goal percentage in the 2015-16 season? To answer this, we can use filter() to create a new tbl containing the data only for this season. Then we can use the dplyr verb summarize() as follows:

> nba_shooting_2016 <- filter(nba_shooting, SEASON == 2016)
> summarize(nba_shooting_2016, FGP = mean(FGP))
# A tibble: 1 x 1
    FGP
  <dbl>
1 0.438
> summarize(nba_shooting_2016, FGP = mean(FGP), TPP = mean(TPP), FTP = mean(FTP))
# A tibble: 1 x 3
    FGP   TPP   FTP
  <dbl> <dbl> <dbl>
1 0.438 0.346 0.790

In the first example, we compute the average field goal percentage and in the second example, we compute the average field goal, three point, and free throw percentages. Of course, we are not limited to computing just the mean. The following functions are quite useful for summarizing several aspects of the distribution of the variables in our dataset:

Center: mean(), median()
Spread: sd(), IQR()
Range: min(), max()
Count: n(), n_distinct()

We will have much more to say about summarize() in Module 4 when we discuss grouped manipulations.

Selecting Columns

Oftentimes, the dataset you load into R contains many, many more columns than you need. We can use select() to pull out the columns we want to use in our subsequent analyses. For instance, we may want to only focus on the columns SEASON, FGP, TPP, FTP, eFGP, PTS, and TSP and ignore the rest of the columns.

> select(nba_shooting, PLAYER, SEASON, FGP, TPP, FTP, eFGP, PTS, TSP)
# A tibble: 2,254 x 8
   PLAYER            SEASON   FGP   TPP   FTP  eFGP   PTS   TSP
   <chr>              <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Stephen Curry       2016 0.504 0.453 0.908 0.630  2375 0.670
 2 James Harden        2016 0.439 0.359 0.860 0.512  2376 0.598
 3 Kevin Durant        2016 0.505 0.388 0.898 0.573  2029 0.634
 4 DeMarcus Cousins    2016 0.451 0.333 0.718 0.477  1748 0.538
 5 LeBron James        2016 0.520 0.309 0.731 0.551  1920 0.588
 6 Damian Lillard      2016 0.419 0.375 0.892 0.497  1879 0.560
 7 Anthony Davis       2016 0.493 0.324 0.758 0.508  1481 0.558
 8 Russell Westbrook   2016 0.454 0.296 0.812 0.489  1878 0.554
 9 DeMar DeRozan       2016 0.446 0.338 0.850 0.463  1830 0.550
10 Paul George         2016 0.418 0.372 0.860 0.490  1874 0.558
# ... with 2,244 more rows

Saving our work

By this point, the nba_shooting tbl has much more information in it than the original data file we read in. While we can always re-run the commands used to produce this tbl from our script, when data analyses become more complicated, it is helpful to save these objects. R has its own special file format for efficiently saving data on your computer.

We will use the save() command.

> save(nba_shooting, file = "data/nba_shooting.RData")

When we want to load the data back into R, we can use the load() function

> load("data/nba_shooting.RData")

Thinking Ahead for Tomorrow

Up to this point, we have only used the dplyr verbs mutate(), filter(), and arrange() one at a time. What if we wanted to do something a bit more complicated like:

Remove players from the lockout season who had fewer than 100 field goal attempts, fewer than 100 free throw attempts, or fewer than 50 three point attempts
Arrange them according to their three point percentage.
Add a classification of their three point shooting ability (as we did above with case_when())
Compute the mean field goal percentage of all of the players within each of these categories. That is, separately compute the mean field goal percentage of the “Elite” three point shooters, the “Above Average” three point shooters, etc.

Using what we have already learned, you could accomplish steps 1 – 3 by creating lots of temporary tbls. In Module 4, we will learn how to string together several dplyr verbs to perform the above tasks without having to create temporary tbls. We will also learn how to perform grouped calculations.