27 Time Series

Previous chapters cover all but a few of the functions that appear in this final chapter. The examples in this chapter will not use all of the provided data in the sample files. The provided data tables include lagged variables; the examples shown here avoid do not use those in order to show how to make lagged variables as needed. Something else comes up in this chapter: missing values. The presence of lagged variables as explanatory variables introduces missing values if the prior values are unknown. Missing values require the use of the na.action option in lm. The novel functions illustrated here include:

subset to use a portion of a data frame in a regression
durbinWatsonTest (from car) computes this common diagnostic statistic

27.1 Analytics in R: Predicting Sales of New Cars

These data report quarterly counts of domestic sales of cars and light trucks in the US (in thousands) since 1990. The file includes a numerical date for plots.

Sales <- read.csv("Data/27_4m_car_sales.csv")
dim(Sales)

## [1] 105   4

head(Sales)

##       Year Quarter Car_Sales Light_Truck_Sales
## 1 1990.167      Q1    2310.1            1151.3
## 2 1990.417      Q2    2531.8            1258.6
## 3 1990.667      Q3    2358.2            1156.5
## 4 1990.917      Q4    2100.1             993.1
## 5 1991.167      Q1    1906.4             909.4
## 6 1991.417      Q2    2231.9            1119.4

The first rule when dealing with time series is to inspect the sequence plot, or timeplot, of the data. In this example, there’s a strong seasonal pattern in both series. Interesting trends are present as well, such as the sharp rise in “truck” sales; the most popular vehicle in many years sold in the US is often the Ford F-150 truck. The truck category also includes many SUVs as well.

plot(Car_Sales ~ Year, data=Sales, type='l', col='blue', ylab="Sales in 1000s")
lines(Light_Truck_Sales ~ Year, data=Sales, col='red')
legend(2001, 1400, legend=c("Car", "Truck"), pch=19, col=c('blue', 'red'))

The analysis in the text extrapolates the pattern in car sales through the recession that began in 2008. The idea is to extrapolate the trend before the recession to estimate the loss in car sales that might be attributed to the economic downturn.

Multiple regression provides an initial model, one that needs some improvement. A linear trend with a quarterly seasonal factor fits well and captures much of the pattern in car sales prior to 2008. The regression is fit using the data prior to 2008. (Notice in the summary of the regression that R omits the dummy variable for the first quarter rather than the fourth quarter as in the textbook version of this example.)

regr <- lm(Car_Sales ~ Year + Quarter, data=Sales[Sales$Year<2008,])
summary(regr)

## 
## Call:
## lm(formula = Car_Sales ~ Year + Quarter, data = Sales[Sales$Year < 
##     2008, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -186.797  -66.222   -1.583   59.563  279.053 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37901.011   4791.555   7.910 3.50e-11 ***
## Year          -17.983      2.397  -7.501 1.91e-10 ***
## QuarterQ2     325.474     35.184   9.251 1.36e-13 ***
## QuarterQ3     169.231     35.200   4.808 8.99e-06 ***
## QuarterQ4     -54.863     35.225  -1.557    0.124    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 105.5 on 67 degrees of freedom
## Multiple R-squared:  0.7518, Adjusted R-squared:  0.737 
## F-statistic: 50.74 on 4 and 67 DF,  p-value: < 2.2e-16

The residuals show substantial autocorrelation, so the inferences in this summary are not reliable.

plot(residuals(regr) ~ Sales$Year[Sales$Year<2008], type='b') 
abline(h=0, col='gray')

We can test for the presence of autocorrelation using the Durbin-Watson test from the package car. This implementation uses a fast simulation (fast for this regression – it is slow for larger models) to compute a p-value. The test rejects \(H_0: \rho=0\) with a very small p-value near zero.

require(car)
durbinWatsonTest(regr)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.5365565     0.8735341       0
##  Alternative hypothesis: rho != 0

Although the model is inadequate for inference because of the presence of autocorrelation, we can nonetheless use it to extrapolate this simple trend through the recession. Since the subset will be needed for the plot that follows, it is saved explicitly. I added the predictions to the subset to keep the data together.

Subset <- Sales[2008 <= Sales$Year,]

Subset$preds <- predict(regr, newdata=Subset)
Subset$preds

##  [1] 1787.481 2108.458 1947.719 1719.131 1769.497 2090.475 1929.736
##  [8] 1701.147 1751.514 2072.492 1911.753 1683.164 1733.531 2054.508
## [15] 1893.769 1665.181 1715.547 2036.525 1875.786 1647.197 1697.564
## [22] 2018.542 1857.803 1629.214 1679.581 2000.558 1839.819 1611.231
## [29] 1661.597 1982.575 1821.836 1593.247 1643.614

This plot contrasts the predictions obtained from the subset of data (connected points) with the actual sales during the recession.

plot(Car_Sales ~ Year, data=Subset)
lines(preds ~ Year, col='blue', data=Subset)

With the predictions in the same data frame, it is easy to get a table that compares the pre-recession trend to the observed values.

Subset[,c("Year", "Car_Sales", "preds")]

##         Year Car_Sales    preds
## 73  2008.167    1730.7 1787.481
## 74  2008.417    2110.9 2108.458
## 75  2008.667    1724.7 1947.719
## 76  2008.917    1202.9 1719.131
## 77  2009.167    1093.3 1769.497
## 78  2009.417    1359.7 2090.475
## 79  2009.667    1655.1 1929.736
## 80  2009.917    1293.4 1701.147
## 81  2010.167    1282.8 1751.514
## 82  2010.417    1542.3 2072.492
## 83  2010.667    1469.0 1911.753
## 84  2010.917    1341.7 1683.164
## 85  2011.167    1505.6 1733.531
## 86  2011.417    1650.3 2054.508
## 87  2011.667    1469.2 1893.769
## 88  2011.917    1464.7 1665.181
## 89  2012.167    1805.0 1715.547
## 90  2012.417    1937.4 2036.525
## 91  2012.667    1788.7 1875.786
## 92  2012.917    1713.3 1647.197
## 93  2013.167    1863.5 1697.564
## 94  2013.417    2036.4 2018.542
## 95  2013.667    1926.1 1857.803
## 96  2013.917    1759.3 1629.214
## 97  2014.167    1781.6 1679.581
## 98  2014.417    2115.2 2000.558
## 99  2014.667    1980.6 1839.819
## 100 2014.917    1810.1 1611.231
## 101 2015.167    1786.5 1661.597
## 102 2015.417    2054.4 1982.575
## 103 2015.667    1922.9 1821.836
## 104 2015.917    1761.2 1593.247
## 105 2016.167    1694.3 1643.614

27.2 Analytics in R: Forecasting Unemployment

The data in this example record the monthly civilian unemployment rate in the US, dating back to January 1948. These are the unemployment rates that you might hear mentioned in the news. The data frame includes several lags of unemployment that we will not use in this example; rather, we will make them directly. The unemployment data are heavily rounded to a single decimal place.

Unemp <- read.csv("Data/27_4m_unemployment.csv")
dim(Unemp)

## [1] 819   4

head(Unemp)

##       Date Unemployment_Rate Lag_Unemployment_Rate Lag_6_Unemployment_Rate
## 1 1/1/1948               3.4                    NA                      NA
## 2 2/1/1948               3.8                   3.4                      NA
## 3 3/1/1948               4.0                   3.8                      NA
## 4 4/1/1948               3.9                   4.0                      NA
## 5 5/1/1948               3.5                   3.9                      NA
## 6 6/1/1948               3.6                   3.5                      NA

Next, remember the first rule of time series analysis: inspect the sequence plot. To do that, convert the date string in the first column into an R date using the function mdy from lubridate. That conversion provides a nice variable for the time axis in plots.

require(lubridate)
Unemp$Date <- mdy(Unemp$Date)

plot(Unemployment_Rate ~ Date, data=Unemp, pch=20)   # pch=20 picks a smaller filled point than pch=19

R includes the function lag for building lagged variables, but this function is made to work with a special type of time series data that is not compatible with regular R vectors. Instead, we will build the lagged variables by simply adding missing values to the start of a vector and trimming an equal number of values from the other end.

n <- nrow(Unemp)

Unemp$Lag_1_Un <- c(NA,        Unemp$Unemployment_Rate[-n])
Unemp$Lag_2_Un <- c(NA, NA,    Unemp$Unemployment_Rate[-c(n-1,n)])
Unemp$Lag_6_Un <- c(rep(NA,6), Unemp$Unemployment_Rate[-((n+1)-(1:6))])

Now we can fit the autoregression of the current unemployment on its lags using the same subset of data as in the text. Notice that this code uses a condition that concerns a date to pick out cases. You need to convert “constant” date strings (e.g.,“1/1/1980”) into date objects for that comparison to work.

Subset <- Unemp[ (mdy("1/1/1980")<=Unemp$Date) & (Unemp$Date < mdy("1/1/2016")), ]

regr <- lm(Unemployment_Rate ~ Lag_1_Un + Lag_6_Un, data=Subset)
summary(regr)

## 
## Call:
## lm(formula = Unemployment_Rate ~ Lag_1_Un + Lag_6_Un, data = Subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.61522 -0.10000 -0.00871  0.10009  0.56066 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.07709    0.03185   2.420   0.0159 *  
## Lag_1_Un     1.11761    0.01472  75.904   <2e-16 ***
## Lag_6_Un    -0.12978    0.01477  -8.785   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1595 on 429 degrees of freedom
## Multiple R-squared:  0.9904, Adjusted R-squared:  0.9904 
## F-statistic: 2.212e+04 on 2 and 429 DF,  p-value: < 2.2e-16

As when you first encounter a time series, always inspect the residuals from a time series regression for a time trend. In a pinch, if you omit the date, R plots a single vector in order. That’s handy if you’re just checking and want the sequence plot and don’t need the dates. There do seem to be some small “curves” of several adjacent points. These local patterns are artifacts of the rounded nature of the unemployment rate.

plot(residuals(regr) ~ Subset$Date, pch=20)
abline(h=0,col='gray')

The Durbin-Watson statistic (from car) is not valid because this regression uses lagged versions of the response as explanatory variables. You can still use the function which is handy for computing the residual autocorrelation, but it virtually always says everything is okay when applied in this context (that is, to residuals from a regression that uses the lag of the response as an explanatory variable). In this case, it finds some small bit of remaining autocorrelation. Again, because of the model, the test is not conclusive.

require(car)
durbinWatsonTest(regr)

##  lag Autocorrelation D-W Statistic p-value
##    1      -0.1126248      2.219147   0.024
##  Alternative hypothesis: rho != 0

The D-W statistic shows the autocorrelation, but its also good to plot the residuals on their lag. Just like we like to see a scatterplot of \(Y\) on \(X\) when relying on the correlation of two variables, this is the plot that goes with an autocorrelation. Note the use of missing values (NA) to shift the variables. (The diagonal lines are another artifact of the rounding mentioned previously.)

plot(c(NA, residuals(regr)), c(residuals(regr),NA), 
     xlab="Lagged Residual", ylab="Residual")

The plot of residuals on fitted values appears okay, as does the normal quantile plot (at least until you get far into the tails).

plot(residuals(regr) ~ fitted.values(regr))

require(car)
qqPlot(residuals(regr))

Again, the version of the normal QQ plot from car claims much tighter confidence bands than our reference software (JMP).

We can compare the one-step-ahead predictions and intervals to the actual data in 2016.

preds <- predict(regr, newdata=Unemp[mdy("1/1/2016") <= Unemp$Date,], 
                 interval='prediction')
preds

##          fit      lwr      upr
## 817 4.977295 4.663174 5.291416
## 818 4.891490 4.577373 5.205607
## 819 4.891490 4.577373 5.205607

The actual values are at the low side of these intervals, but well inside these prediction intervals.

Unemp[mdy("1/1/2016") <= Unemp$Date,"Unemployment_Rate"]

## [1] 4.9 4.9 5.0

27.3 Analytics in R: Forecasting Profits

The data for the final case in this chapter record quarterly financial sales for retailer Best Buy.

BB <- read.csv("Data/27_4m_bestbuy.csv")
dim(BB)

## [1] 85 13

head(BB)

##       Time     Date Cal_Year Quarter Revenue Cost_Goods_Sold Gross_Profit
## 1 1995.083 19950228     1995       1 1947.11         1679.99       267.12
## 2 1995.333 19950531     1995       2 1274.70         1079.65       195.05
## 3 1995.583 19950831     1995       3 1437.91         1228.08       209.83
## 4 1995.833 19951130     1995       4 1929.28         1672.05       257.23
## 5 1996.083 19960229     1996       1 2575.56         2246.24       329.32
## 6 1996.333 19960531     1996       2 1637.18         1387.49       249.69
##   Pct_Change Lag_Pct_Change Lag_2_Pct_Change Lag_3_Pct_Change
## 1         NA             NA               NA               NA
## 2         NA             NA               NA               NA
## 3         NA             NA               NA               NA
## 4         NA             NA               NA               NA
## 5     23.284             NA               NA               NA
## 6     28.014          23.28               NA               NA
##   Lag_4_Pct_Change Lag_5_Pct_Change
## 1               NA               NA
## 2               NA               NA
## 3               NA               NA
## 4               NA               NA
## 5               NA               NA
## 6               NA               NA

Sales at Best Buy grew rapidly prior to 2010, even through the recession. The following plot also shows increasing variation before leveling off and then declining in the more recent data (probably due to increased competition from on-line retailers like Amazon).

plot(Gross_Profit ~ Time, type='b', pch=20,  data=BB)

The exercise in this case is to predict gross profits of Best Buy in 2012 from the prior data using the year-over-year percentage changes. Percentage changes are often simpler to model than the trend itself.

Subset <- BB[BB$Time<2012,]

plot(Pct_Change ~ Time, type='b', data=Subset)

The plot of the percentage change on its lag shows substantial autocorrelation; the coefficient in the first order autoregression is about 0.80.

plot(Pct_Change ~ Lag_Pct_Change, data=Subset)

lm(Pct_Change ~ Lag_Pct_Change, data=Subset)

## 
## Call:
## lm(formula = Pct_Change ~ Lag_Pct_Change, data = Subset)
## 
## Coefficients:
##    (Intercept)  Lag_Pct_Change  
##         3.2492          0.8086

The eventual regression uses several lags and a time trend. It is important that the construction of this regression includes the na.action=na.exclude option. That’s because of missing values caused by taking lags of the variables. If you don’t use this option, you will have trouble aligning the residuals in later plots.

regr <- lm(Pct_Change ~ Lag_Pct_Change + Lag_4_Pct_Change + Lag_5_Pct_Change + Time,
           data=Subset, na.action=na.exclude)
summary(regr)

## 
## Call:
## lm(formula = Pct_Change ~ Lag_Pct_Change + Lag_4_Pct_Change + 
##     Lag_5_Pct_Change + Time, data = Subset, na.action = na.exclude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0311  -4.3954  -0.1358   4.8837  19.3950 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1899.41753  579.08521   3.280  0.00182 ** 
## Lag_Pct_Change      0.71145    0.09524   7.470  7.1e-10 ***
## Lag_4_Pct_Change   -0.30908    0.12214  -2.530  0.01434 *  
## Lag_5_Pct_Change    0.23314    0.11571   2.015  0.04890 *  
## Time               -0.94413    0.28813  -3.277  0.00184 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.995 on 54 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.7454, Adjusted R-squared:  0.7265 
## F-statistic: 39.52 on 4 and 54 DF,  p-value: 1.919e-15

Setting na.action=na.exclude causes R to put missing values in the fitted values and residuals. The residuals from the regression begin with 9 missing values because the model uses year-over-year changes (4 missing values plus 5 lags = 9 missing values).

residuals(regr)[1:10]

##         1         2         3         4         5         6         7 
##        NA        NA        NA        NA        NA        NA        NA 
##         8         9        10 
##        NA        NA -7.049886

These missing values imply that the series of residuals, for example, has the same length as other variables in the original data frame. (Otherwise, it would be much shorter and you could not directly use it in the following plots.)

The sequence plot of residuals appears random.

plot(residuals(regr) ~ Subset$Time)
abline(h=0, col='gray')

The plot of residuals on fitted values is also okay, albeit there’s a hint of a quadratic pattern.

plot(residuals(regr) ~ fitted.values(regr))

The normal quantile plot is fine.

require(car)
qqPlot(residuals(regr))