##### HOMEWORK 7, STAT 541, DUE: Mon, Nov 12, 2007, before class ##### Student Name: .... Instructions: You may change the extension of this file to '.txt' for editing. Edit this file by adding your solutions. E-mail to the usual class gmail address. (Leave all existing text in this file.) PROBLEM 1: Form a data frame 'dat' with 12 rows and 3 columns as follows: - 'x1': 1:12 - 'x2': factor(rep(letters[1:3],4)) - 'x3': factor(rep(LETTERS[1:4],3)) Assume these are three predictors in a linear regression. Using 'table()', check how many observations there are for each combination of levels of x2 and levels of x3. Does it surprise you that this is called a "balanced design"? SOLUTION: PROBLEM 2: Obtain the model matrix for the model formula '~x1+x2+x3' for the data frame 'dat' using 'model.matrix()'. Assign the model matrix to 'X' and show it here (without the attributes). Explain carefully the columns and the meanings of their coefficients. Use all precautions in your formulations. SOLUTION: PROBLEM 3: What are the degrees of freedom for the model given by '~x1+x2+x3' and the residuals? SOLUTION: PROBLEM 4: Same question as in Problem 2 for '~x1+x2*x3'. Use 'options(width=100)' to print the model matrix with one row per line. Explain only one column, 'x2b:x3C', for example, but do it two ways. Explain when these interaction columns are needed. SOLUTION: PROBLEM 5: a) Same question as in Problem 3 -- what are the degrees of freedom for the model given by '~x1+x2*x3' and the residuals? b) What can you say about the residual when fitting this model? What is the R2 value and the overall F-statistic? c) Is there a problem estimating the regression coefficients? d) Can it be solved by taking more than one observation for each combination of predictor values? (If one took two observations per combination, the model matrix would be 'rbind(dat,dat)'.) e) Would observing x1=1:24 instead of x1=c(1:12,1:12) help? form a data frame 'dat2' with x1=1:24 and rbind(dat[,-1],dat[,-1]). Obtain its model matrix ('X2') and check whether it is full rank. Determine model and residual dfs. SOLUTION: PROBLEM 6: In Problem 5 e), use the response y <- (1:24)+.1*(1:24)^2 and test whether there exist significant interactions between x2 and x3. Perform a SINGLE F-test for all interaction terms, i.e., test the null hypothesis that the coefficients of all columns 'x2..:x3..' are zero. Compute the F-statistic from first principles, building up your own orthonormal basis of predictor space in such a way that the null hypothesis can be tested. Use your function 'gs()' from the previous homework to find an appropriate orthonormal basis. Once you have the desired basis, project y, and compute/show the following quantities: dfint # degrees of freedom for interaction SSint # sum of squares for interaction dfres # residual degrees of freedom RSS # residual sum of squares F # the F-statistic for interaction p-val # the p-value associated with the F-statistic Finally, compare your values with those of the anova function: anova(lm(y ~ x1+x2*x3, data=dat2)) SOLUTION: PROBLEM 7: For the Boston housing data, compute and plot a 95% confidence band and a 95% prediction band for the response 'MEDV' as a function of the predictor variable 'LSTAT' for the following values of the other predictors: x <- c(CRIM=.25, ZN=0, INDUS=9.7, CHAS=0, NOX=.5380, RM=6, AGE=77.5, DIS=3.2, RAD=5, TAX=330, PTRATIO=19, B=390) For plotting the C- and P-bands, use the range LSTAT <- seq(0,100,by=1) as this variable is the percentage of 'lower status people' In other words, obtain the 95% CI and PI for xx=c(x,LSTAT[i]) for i in 1:101. Plot the upper and lower CI bounds as two connected curves in blue color, and the upper and lower PI bounds as two connected curves in red color, but first plot the boston[,'MEDV'] versus boston[,'LSTAT'] in gray. Attach the plot as a separate JPEG or PDF file to your solutions; the functions 'pdf(...)' or 'jpeg(...)' should be called before plotting and 'dev.off()' after plotting (to flush out buffers). Comment on the following: a) Which of the gray points is the prediction band supposed to catch? b) Does the width of the CIs and PIs as a function of LSTAT make sense? Why is the CI almost as wide as the PI on the right hand side? PS: You can use the following function to get s=RMSE from a model. RMSE <- function(model) sqrt( deviance(model) / model$df.resid ) PPS: You can also use 'solve()' to invert (X^T X). SOLUTION: