In Problem Set 0, we used R to save a single number and perform calculations.
While this functionality may be exciting, we’re ultimately going to be interested in analyzing large amounts of data. Consider the table below which lists different statistics for several basketball players from the 2015-16 NBA regular season. The statistics are:
PLAYER | FGM | FGA | TPM | TPA | FTM | FTA |
---|---|---|---|---|---|---|
Stephen Curry | 805 | 1597 | 402 | 887 | 363 | 400 |
John Wall | 572 | 1349 | 115 | 327 | 272 | 344 |
Jimmy Butler | 470 | 1034 | 64 | 206 | 395 | 475 |
James Harden | 710 | 1617 | 236 | 657 | 720 | 837 |
Kevin Durant | 698 | 1381 | 186 | 480 | 447 | 498 |
LeBron James | 737 | 1416 | 87 | 282 | 359 | 491 |
Kristaps Porzingis | 498 | 1112 | 126 | 342 | 250 | 280 |
Dirk Nowitzki | 373 | 886 | 81 | 243 | 201 | 240 |
Tim Duncan | 215 | 442 | 0 | 2 | 92 | 131 |
Andre Drummond | 552 | 1061 | 2 | 6 | 208 | 586 |
How are we going to load this data into R?
The fundamental unit of data storage in R is a vector, which is simply an ordered collection of data. We can input each column in the above table as a vector. Luckily, many of R’s functions act on these vectors, which allow us to do the same calculation on a lot of different values at once.
To create a vector containing the field goal makes (FGM), we use the combine function c()
and enter each number within the parantheses separated by commas. In the code example below, we create a vector called fgm
which contains the FGM statistics from each player in the table and then print it out.
> fgm <- c(805, 572, 470, 710, 698, 737, 498, 373, 215, 552)
> fgm
[1] 805 572 470 710 698 737 498 373 215 552
Exercise: Create vectors fga
, tpm
, tpa
, ftm
and fta
which contain the FGA, TMP, TPA, FTM, FTA statistics for each player respectively.
In R, the elements of a vector do not necessarily have to numbers. They can be character strings too. Below, we create a vector that contains the names of the players in the above table.
> players <- c("Stephen Curry", "John Wall", "Jimmy Butler", "James Harden", "Kevin Durant",
+ "LeBron James", "Kristaps Porzingis", "Dirk Nowitzki", "Tim Duncan", "Andre Drummond")
Vectors can also contain what are known as logicals, but we will rarely if ever encounter such vectors in this course. It is important to note, however, that every element of a vector must be of the same type. That is, you cannot have a vector whose first element is a number and whose second element is a character string. Let’s see what happens if we try:
> x <- c(23, "LeBron")
In this example, R did not return an error. However, if we print out x
, we see that it has coerced the number 23 into a character string ``23
’’.
Later in the course, we will sometimes need to create a vector of evenly spaced numbers. To do this, we use the seq()
command. If we need to create a vector of consecutive integers, we can use the :
syntax. Both are demonstrated below
> seq(from = 0, to = 1, by = 0.1)
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
> seq(from = 0, to = 1, length = 11)
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
We’re nearly done loading all of the data into the R. Now, we can start to summarize each column of the data and also start computing new statistics from the ones listed in the table above. Before proceeding, we have one more organizational task.
We know that the first element of the vector fgm
is Stephen Curry’s FGM while the ninth element of tpa
is the number of three point shots that Dirk Nowitzki attempted. When we print out fgm
or tpa
for instance, R does not tell us which values correspond to which players.
> fgm
[1] 805 572 470 710 698 737 498 373 215 552
> tpa
[1] 887 327 206 657 480 282 342 243 2 6
As we proceed with our analysis, it’ll be useful to have named the elements of the vector. We can do that with the names()
function.
> fgm
[1] 805 572 470 710 698 737 498 373 215 552
> names(fgm) <- players
> fgm
Stephen Curry John Wall Jimmy Butler
805 572 470
James Harden Kevin Durant LeBron James
710 698 737
Kristaps Porzingis Dirk Nowitzki Tim Duncan
498 373 215
Andre Drummond
552
When we use names()
, we put the vector whose elements we want to name in the parantheses and on the right-hand side of the <-
we put a character vector of the same length as the vector we want to name containing the element names. In the example above, we set the names of fgm
to be the elements of the vector players
.
Exercise Name the elements of fga
, tpm
, tpa
, ftm
and fta
.
Once we have created a vector, we can perform calculations on each element of the vector with very straightforward syntax:
> fgm + 5
Stephen Curry John Wall Jimmy Butler
810 577 475
James Harden Kevin Durant LeBron James
715 703 742
Kristaps Porzingis Dirk Nowitzki Tim Duncan
503 378 220
Andre Drummond
557
> fgm/5
Stephen Curry John Wall Jimmy Butler
161.0 114.4 94.0
James Harden Kevin Durant LeBron James
142.0 139.6 147.4
Kristaps Porzingis Dirk Nowitzki Tim Duncan
99.6 74.6 43.0
Andre Drummond
110.4
> fgm^5
Stephen Curry John Wall Jimmy Butler
3.380488e+14 6.123224e+13 2.293450e+13
James Harden Kevin Durant LeBron James
1.804229e+14 1.656827e+14 2.174390e+14
Kristaps Porzingis Dirk Nowitzki Tim Duncan
3.062998e+13 7.220116e+12 4.594014e+11
Andre Drummond
5.125018e+13
> sqrt(fgm)
Stephen Curry John Wall Jimmy Butler
28.37252 23.91652 21.67948
James Harden Kevin Durant LeBron James
26.64583 26.41969 27.14774
Kristaps Porzingis Dirk Nowitzki Tim Duncan
22.31591 19.31321 14.66288
Andre Drummond
23.49468
In each of these examples, we see that arithmetic proceeds elements-wise: fgm + 5
added 5 to every element of fgm
while sqrt(fgm)
took the square root of every element in fgm
.
In basketball, field goal percentage (FGP) is one of the simplest statistic we can use to summarize a player’s ability to shoot the ball well. FGP is defined as FGA/FGM. We can use the vectors fga
and fgm
to create a new vector fgp
that records each players’ field goal percentage.
> fgp <- fgm/fga
> fgp
Stephen Curry John Wall Jimmy Butler
0.5040701 0.4240178 0.4545455
James Harden Kevin Durant LeBron James
0.4390847 0.5054308 0.5204802
Kristaps Porzingis Dirk Nowitzki Tim Duncan
0.4478417 0.4209932 0.4864253
Andre Drummond
0.5202639
To evaluate fgm / fga
, R took the first element of fgm
and divided it by the other first element of fga
and so on. In order for this to work, fgm
and fga
had to be the same length. Technically, R is able to handle element-wise arithmetic when the vectors are not of the same length (this is known as recycling) but we will not get into it. For the purposes of this class, whenever you want to do element-wise arithmetic with two vectors, they need to be of the same length.
To determine the length of a vector we can use the function length()
as follows:
> length(fgm)
[1] 10
> length(fga)
[1] 10
> length(c(1, 2, 3))
[1] 3
Exercise Create vectors tpp
and ftp
which contain the three-point percentages (TPP) and free throw percentages (FTP) of the players in the table. If a player’s field goal percentage, three point percentage, and free throw percentage exeed 50%, 40%, and 90%, respectively, they are said to belong to the 50-40-90 club. Did any of the players in the table above qualify for the 50-40-90 club?
Up to this point, we have seen how to perform element-wise arithmetic with vectors. Now that we have created the vectors fgp
, ftp
, and tpp
we can start to study the distribution of these datasets. R has several functions that operate on entire vectors (i.e. single datasets) but not in an element-wise fashion. Below are a couple of examples. Exercise Try to describe what each of the functions below does
> sum(fgp)
[1] 4.723153
> mean(fgp)
[1] 0.4723153
> sd(fgp)
[1] 0.03931162
> min(fgp)
[1] 0.4209932
> max(fgp)
[1] 0.5204802
> median(fgp)
[1] 0.4704854
> range(fgp)
[1] 0.4209932 0.5204802
> length(fgp)
[1] 10
> sort(fgp)
Dirk Nowitzki John Wall James Harden
0.4209932 0.4240178 0.4390847
Kristaps Porzingis Jimmy Butler Tim Duncan
0.4478417 0.4545455 0.4864253
Stephen Curry Kevin Durant Andre Drummond
0.5040701 0.5054308 0.5202639
LeBron James
0.5204802
> quantile(fgp)
0% 25% 50% 75% 100%
0.4209932 0.4412740 0.4704854 0.5050907 0.5204802
> summary(fgp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.4210 0.4413 0.4705 0.4723 0.5051 0.5205
Up to this point, we have created, manipulated, and summarized vectors. To do this, we took advantage of R’s ability to operate on entire vectors at once. But what if we wanted to look at only a few elements of the vectors? For example, what if we wanted to compute the difference in Stephen Curry’s field goal percentage and Dirk Nowitzki’s field goal percentage? To do this, we need a way to extract specific elements of a vector. This is known as subsetting and is an EXTREMELY important aspect of R since it allows us to extract and modify specific elements of an object.
To subset vectors, we will use the single-bracket notation [ ]
. Between the brackets we will specify which elements we want to extract. So to extract the first element of fgm
we would execute fgm[1]
and to extract the third element we would execute fgm[3]
:
> fgm[1]
Stephen Curry
805
> fgm[3]
Jimmy Butler
470
If you look at the output, you’ll see that when prints the value of fgm[1]
or fgm[3]
, the value is preceded by a [1]
. This is R’s way of letting you know that it is printing out a vector. So what is really happening under the hood when we subset a vector is that R is creating another temporary vector containing only the elements we wanted to subset. However, if we look in our environment pane, we see that there is no additional vector, which means that R is not saving that temporary vector. We, of course, could save it using the assignment operator and modify the new vector as much as we want:
> x <- fgm[1]
> y <- x + 3
Above, the first line of code subsets the first element of fgm
, creating a temporary vector of length 1 containing the first element of fgm,
and saves the result as x
. In the next line, it adds three to this value and saves the result as y
. Of course, if all we wanted to do was create a vector of length one containing only the value of three plus the first element of fgm
, we could have done y <- fgm[1] + 3
.
Here we see for the first time an example of R’s order of operations: subsetting happens before operations like addition. We will have a lot more to say about the order of operations a bit later.
We can also use our [ ]
syntax to modify vectors. For instance, suppose we made a mistake inputing the first number of fgm
, while the rest are fine. We can change this using subsetting:
> fgm[1] <- 105
Exercise: Change the first element of fgm
back to 805.
We are not limited to subsetting one element at a time. In order to subset multiple elements, we put a vector between the brackets:
> fgm[c(1, 3)]
Stephen Curry Jimmy Butler
805 470
> fgm[1:2]
Stephen Curry John Wall
805 572
> players[c(1, 3)]
[1] "Stephen Curry" "Jimmy Butler"
We can also use names to subset vectors. Instead of putting a number (or vector of numbers) between the bracket, we now put a character string (or vector of character strings) specifying the names of vector elements we want to subset.
> fgp["Stephen Curry"]
Stephen Curry
0.5040701
> fgp[c("Stephen Curry", "Dirk Nowitzki")]
Stephen Curry Dirk Nowitzki
0.5040701 0.4209932
> fgp["Stephen Curry"] - fgp["Dirk Nowitzki"]
Stephen Curry
0.0830769
Of the ten players in the table above, we see that LeBron James had the highest field goal percentage. As it turns out, during the 2015–16 season, DeAndre Jordan led the league in field goal percentage, making just over 70% of his shots. Meanwhile, Steph Curry made just over 50% of his shots. Does this mean that DeAndre Jordan is the better shooter than Curry?
One criticism of FGP is that it treats 2-point shots the same as 3-point shots. As a result, the league leader in FGP is usually a center whose shots mostly come from near the rim. effective Field Goal Percentage is a statistic that adjusts FGP to account for the fact that a made 3-point shots is worth 50% more than a made 2-point shot. The formula for eFGP is \[ \text{eFGP} = \frac{\text{FGM} + 0.5 \times \text{TPM}}{\text{FGA}}.\] We can compute the effective field goal percentage of all ten players as follows
> efgp <- (fgm + 0.5 * tpm)/fga
> efgp
Stephen Curry John Wall Jimmy Butler
0.6299311 0.4666420 0.4854932
James Harden Kevin Durant LeBron James
0.5120594 0.5727734 0.5512006
Kristaps Porzingis Dirk Nowitzki Tim Duncan
0.5044964 0.4667043 0.4864253
Andre Drummond
0.5212064
We see now that Curry has the highest effective field goal percentage. Before declaring Curry was the best shooter out of these ten players, though, we might pause and ask about free throws. Both field goal percentage and effective field goal percentage totally ignore free throws. One metric that accounts for all field goals, three pointers, and free throws is true shooting percentage, whose formula is given by \[ \text{TPS} = \frac{\text{PTS}}{2\times(\text{FGA} + 0.44\times \text{FTA})}, \] where \(\text{PTS} = \text{FTM} + 2 \times \text{FGM} + \text{TPM}\) is the number of points scored.
Exercise Create vectors pts
and tsp
for points scored (PTS) and true shooting percentage (TSP). Which player had the highest true shooting percentage?
Up to this point, we have been working strictly within the R console, proceeding line-by-line. As our commands become more and more complex, you’ll find that using the console can get pretty cramped. And if you make a mistake in entering your code, you’ll get an error and have to start all over again. Plus, when we start a new session of R Studio, the console is cleared. How can we save the commands we typed into R? We do so using an R Script. An R Script is a file type which R recognizes as storing R commands and is saved as a .R file. R Scripts are useful as we can edit our code before sending it to be run in the console.
We can start a new R Script by clicking on the top left symbol in R Studio and selecting “R Script”.
The untitled R Script will then appear in the top left hand box of R Studio.
In the R Script, type the following:
2 * 3
x <- 4
sqrt(x)
Now our code is just sitting in the R Script. To run the code (that is, evalutate it in the console) we click the “Run” button in the top right of the script. This will run one line of code at a time - whichever line the cursor is on. Place your cursor on the first line and click “Run”. Observe how 2 * 3
now appears in the console, as well as the output 6
.
If we want to run multiple lines at once, we highlight them all and click “Run”.
Note in the above that we had to run x <- 4
before sqrt(x)
. We need to define our variables first and run this code in the console before performing calculations with those variables. The console can’t “see” the script unless you run the code in the script.
One very nice thing about RStudio’s script editor is that it will highlight syntax errors with a red squiggly line and a red cross in the sidebar. If you move your mouse over the line, a pop-up will appear that can help you diagnose the potential problem.
Another advantage of R scripts is thta you can add comments to your code, which are preceded by the pound sign / hash sign #. Comments are useful because they allow you to explain what your code is doing.
Exercise Create a new R script called “module1.R”" which contains all of the code used to load the data from the table above and computing all of the vectors of shooting statistics. Be sure to add some comments!