##### HOMEWORK 7, STAT 541, DUE: Monday, Dec 12, 2011, 5pm ##### Student Name: .... INSTRUCTIONS: Edit this txt file by inserting your solutions after each problem statement. Rename this file to hw07-Yourlastname-Yourfirstname.tst and email to the usual class gmail address stat541.at.wharton@gmail.com with "Homework 7, 2011" in the subject line. You can discuss the homework with each other in general terms, but not with previous years'' students of Stat 541. You must not consult solutions of similar homeworks of previous years. Finally, you must write your own solutions and not copy from anyone. Report here who you collaborated with and what sources you used: My collaborators: ... The complete list of my sources: ... ---------------------------------------------------------------- PROBLEM 1: Consider a dataframe containing three predictors for N=12 cases: dat <- data.frame(x1=1:12, x2=factor(rep(letters[1:3],4)), x3=factor(rep(LETTERS[1:4],3)) ) (a) What is the problem with fitting the following model ~x1+x2*x3? (b) Create the model matrix X and perform an eigendecomposition on X^T X. Explain how the eigendecomposition shows the problem. [Check 'help(model.matrix)' for a way to generate the model matrix from ~x1+x2*x3 and dat without using 'lm()'.] (c) Someone proposes to solve the problem by obtaining more data -- good idea, right? They go about it by observing each cell twice, so that the model matrix becomes X2 <- rbind(X,X) Does this solve the problem? If you don''t see the solution right away, see what happens to the eigendecomposition of X2^T X2 and go from there. You should be able to wring an insight from an argument in one of the first problems in Homework 2. ---------------- YOUR SOLUTION: ... (a) The problem is that "x2*x3" is already saturated, that is, the dfs for the predictors and the intercept is 12. With an additional predictor x1=1:12, the predictors become linearly dependent. (b) R code: X <- model.matrix(~x1+x2*x3, data=dat) eigen(crossprod(X))$values [1] 6.645005e+02 5.028641e+00 4.515400e+00 3.732051e+00 2.943258e+00 [6] 1.277772e+00 1.000000e+00 7.793209e-01 6.326880e-01 2.679492e-01 [11] 2.169448e-01 1.054389e-01 2.914216e-16 We see that the last eigenvalue is zero (to numerical precision), hence the columns are linearly dependent. (c) This does not solve the problem because X2^T X2 = 2 * X^T X which we saw to be singular. You can see this in R: X2 <- rbind(X,X) eigen(crossprod(X2))$values [1] 1.329001e+03 1.005728e+01 9.030801e+00 7.464102e+00 5.886515e+00 [6] 2.555544e+00 2.000000e+00 1.558642e+00 1.265376e+00 5.358984e-01 [11] 4.338896e-01 2.108778e-01 5.828431e-16 The eigenvalues are just twice those of X. ---------------------------------------------------------------- PROBLEM 2: R has a few functions that are popular as 'smoothers', that is, a a nonparametric curve fitting routines. These are Jerry Friedman''s 'supsmu()' and Bill Cleveland''s 'lowess()' and its newer sibling 'loess()'. We will use the older but faster 'lowess()'. Here is an example of their use: cl <- c(lowess="blue",loess="green",supsmu="red") N <- 30; x <- 1:N; y <- rnorm(N) + x^2/N^2*5 plot(x, y, pch=16) lines(lowess(x, y), lwd=2, col=cl["lowess"]) ord <- order(x); lines(x[ord], fitted(loess(y ~ x))[ord], lwd=2, col=cl["loess"]) lines(supsmu(x, y), lwd=2, col=cl["supsmu"]) legend(x="topleft", legend=names(cl), col=cl, lwd=2) Find out what the smoothers return. Use the above code as a template and augment the code of the function 'nonlinearity.plots()' shown below by adding the three smooths to each plot. Apply the augmented function to the Boston Housing data and attach a PDF of the result in your email as part of your solution (File > Save as > PDF...). Comment on which variables appear to have the strongest nonlinearities. ---------------- YOUR SOLUTION: nonlinearity.plots <- function(lm.model, extra=.1, cex=.5, cex.ident=.7, cex.axis=.8, cex.lab=.8, win.ncol=5, frame.sz=3.0, ...) { X <- model.matrix(lm.model)[,-1] # Assume an intercept r <- resid(lm.model) p <- ncol(X) N <- nrow(X) win.ncol <- min(p, win.ncol) win.nrow <- ceiling(p/win.ncol) windows(width=frame.sz*(win.ncol), height=frame.sz*(win.nrow)) par(mfrow=c(win.nrow,win.ncol), mgp=c(1.5,0.5,0), mar=c(2.5,2.5,.5,.5)) y.rg <- range(r); ylim <- y.rg + c(-1,1)*diff(y.rg)*extra for(j in 1:p) { xj <- X[,j] x.rg <- range(xj) xlim <- x.rg + c(-1,1)*diff(x.rg)*extra plot(x=xj, y=r, pch=16, xlim=xlim, ylim=ylim, xlab=colnames(X)[j], ylab="Res", cex=cex, ...) abline(h=0, col="gray") # Add here your code that draws the smooths in the above colors. # Add a legend only to the first leverage plot. # --------- cl <- c(lowess="blue",loess="green",supsmu="red") lines(lowess(xj, r), lwd=2, col=cl["lowess"]) ord <- order(xj); lines(xj[ord], fitted(loess(r ~ xj))[ord], lwd=2, col=cl["loess"]) lines(supsmu(x=xj, y=r), lwd=2, col=cl["supsmu"]) if(j==1) legend(x="topright", legend=names(cl), col=cl, lwd=2) # --------- } } # Save the following plot to a pdf file named 'boston-nonlinearities.pdf' # and submit with your solutions: nonlinearity.plots(lm(MEDV~., data=boston)) Your comment on nonlinearities: Apparently 'RM' and 'LSTAT' have the strongest nonlinearities. If you think other variables have nonlinearities, too, that''s ok. ---------------------------------------------------------------- PROBLEM 3: You have no idea how 'lowess()' is implemented, yet you should be able to implement an exact test to check whether the residuals contain a nonlinearity against any of the predictors, estimated by 'lowess()'. Write a function that implements such a test for a given model matrix X and response y, using as a test statistic the standard deviation of the fitted values returned by 'lowess()': sd(lowess(...)$y). [It would probably be better to use the RSS from the 'lowess()' fit as the test statistic, but this is a pain to compute in view of the structure returned by 'lowess()'.] The problem is simplified for you (and for the TA) with a template where you can fill in missing code. Replace all '__' with one-liners for allocating storage and for actual computations. Run the code on the Boston Housing data ---------------- YOUR SOLUTION: lowess.test <- function(lm.model, Nsim=199, ...) { # function to produce normalized residual vectors: nrm <- function(x) x/sqrt(sum(x^2)) # model matrix, but toss the intercept vector: X <- __ # silliness: the next line jitters X a little; # smoothers are not good at handling ties, so we use jitter to break ties randomly. X <- X + rnorm(length(X), s=rep(apply(X, 2, sd)/1000, N)) # from X: p <- __ N <- __ # actual residual vector from regressing y on X, normalized to norm=1: r.obs <- __ # allocate storage for p observed values of the test statistic: s.obs <- __ # calculate the observed values of the test statistic for all predictors: for(j in 1:p) s.obs[j] <- __ # allocate a p x Nsim (!) matrix of storage for null values of test statistics: s.null <- __ for(isim in 1:Nsim) { if(isim%%50==0) cat(isim," ") # null residual vector normalized to norm=1: r.null <- __ # calculate null values of test statistics for all predictors: for(j in 1:p) s.null[j,isim] <- __ }; cat("\n") # px1 vector of p-values from s.obs and s.null: # (a simple one-liner with vectorization, using that s.null is p x Nsim) pvals <- __ # column names of X (so you know which predictor goes with which p-value): names(pvals) <- __ pvals } options(width=120) # to get all pvalues on one line lowess.test(__) # show the results ---------------- When the instructor prepared this homework, he tried out all smoothers. Here are the solutions: lowess.test <- function(lm.model, Nsim=499, ...) { nrm <- function(x) x/sqrt(sum(x^2)) X <- model.matrix(lm.model)[,-1] X <- X + rnorm(length(X), s=rep(apply(X, 2, sd)/1000, N)) p <- ncol(X) N <- nrow(X) r.obs <- nrm(resid(lm.model)) s.obs <- rep(NA, p) for(j in 1:p) s.obs[j] <- sd(lowess(x=X[,j], y=r.obs, ...)$y) s.null <- matrix(NA, nrow=p, ncol=Nsim) for(isim in 1:Nsim) { if(isim%%50==0) cat(isim," ") r.null <- nrm(resid(lm(rnorm(N) ~ X))) for(j in 1:p) s.null[j,isim] <- sd(lowess(y=r.null, x=X[,j], ...)$y) }; cat("\n") pvals <- (1+apply(s.null > s.obs, 1, sum)) / (1+Nsim) names(pvals) <- colnames(X) pvals } boston.lowess.pvals <- lowess.test(boston.lm) boston.lowess.pvals loess.test <- function(lm.model, Nsim=199, ...) { nrm <- function(x) x/sqrt(sum(x^2)) X <- model.matrix(lm.model)[,-1] X <- X + rnorm(length(X), s=rep(apply(X, 2, sd)/1000, N)) p <- ncol(X) N <- nrow(X) r.obs <- nrm(resid(lm.model)) s.obs <- rep(NA, p) for(j in 1:p) s.obs[j] <- sd(fitted(loess(r.obs ~ X[,j], ...))) s.null <- matrix(NA, nrow=p, ncol=Nsim) for(isim in 1:Nsim) { if(isim%%50==0) cat(isim," ") r.null <- nrm(resid(lm(rnorm(N) ~ X))) for(j in 1:p) s.null[j,isim] <- sd(fitted(loess(r.null ~ X[,j], ...))) }; cat("\n") pvals <- (1+apply(s.null >= s.obs, 1, sum)) / (1+Nsim) names(pvals) <- colnames(X) pvals } boston.loess.pvals <- loess.test(boston.lm) boston.loess.pvals supsmu.test <- function(lm.model, Nsim=499, ...) { nrm <- function(x) x/sqrt(sum(x^2)) X <- model.matrix(lm.model)[,-1] p <- ncol(X) N <- nrow(X) X <- X + rnorm(length(X), s=rep(apply(X, 2, sd)/1000, N)) r.obs <- nrm(resid(lm.model)) s.obs <- rep(NA, p) for(j in 1:p) s.obs[j] <- sd(supsmu(x=X[,j], y=r.obs, ...)$y) s.null <- matrix(NA, nrow=p, ncol=Nsim) for(isim in 1:Nsim) { if(isim%%50==0) cat(isim," ") r.null <- nrm(resid(lm(rnorm(N) ~ X))) for(j in 1:p) s.null[j,isim] <- sd(supsmu(x=X[,j], y=r.null, ...)$y) }; cat("\n") pvals <- (1+apply(s.null > s.obs, 1, sum)) / (1+Nsim) names(pvals) <- colnames(X) pvals } boston.supsmu.pvals <- supsmu.test(boston.lm, Nsim=999) boston.supsmu.pvals cbind(lowess=boston.lowess.pvals, loess=boston.loess.pvals, supsmu=boston.supsmu.pvals) plot.plus(data.frame(lowess=boston.lowess.pvals, loess=boston.loess.pvals, loess=boston.supsmu.pvals), pch=16, xlim=c(0,1), ylim=c(0,1)) ] ---------------------------------------------------------------- PROBLEM 4: Write a function 'lm.boot()' that takes a linear model 'lm.model' as an argument and returns the matrix 'summary(lm.model)$coefficients$' augmented with two more columns called 'SE.boot' and 'SE.ratio'. These columns should contain bootstrap standard errors of the slopes and the ratios of bootstrap/traditional standard errors, respectively. ---------------- YOUR SOLUTION: lm.boot <- function(lm.model, Nboot=10000) { # the '$coefficients' matrix from the 'summary()' function: slope.mat <- __ # model matrix: X <- __ p <- __ N <- __ # 'lm.model' does not contain the response; # reconstruct it from 'fitted()' and 'resid()': y <- __ # allocate a matrix of size Nboot x p for bootstrap coefficients: slopes.bs <- matrix(NA, nrow=Nboot, ncol=p) for(iboot in 1:Nboot) { # selection vector for sampling with replacement: sel <- __ # for speed, use the triple-X formula to compute bootstrap coefficients: slopes.bs[iboot,] <- solve(crossprod(X[sel,])) %*% t(X[sel,]) %*% y[sel] if(iboot%%100==0) cat(iboot," ") }; cat("\n") # compute the bootstrap standard errors for the slopes: SE.boot <- __ # compute the ratio 'bootstrap/traditional' standard errors; # the latter are a column of 'slope.mat': SE.ratio <- __ # bootstrap t-ratio: t.boot <- slope.mat[,"Estimate"] / SE.boot # cbind 'SE.boot' and 'SE.ratio' and 't.ratio' to 'slope.mat': slope.mat <- __ # return value: slope.mat } boston.boot <- lm.boot(boston.lm) signif(boston.boot, 3) Your comparison of both types of SEs and t-ratios: ... ---------------- lm.boot <- function(lm.model, Nboot=10000) { slope.mat <- summary(lm.model)$coefficients X <- model.matrix(lm.model) y <- fitted(lm.model) + resid(lm.model) p <- ncol(X) N <- nrow(X) slopes.bs <- matrix(NA, nrow=Nboot, ncol=p) for(iboot in 1:Nboot) { sel <- sample(N, replace=T) ## slopes.bs[iboot,] <- coef(lm(y[sel] ~ X[sel,-1])) slopes.bs[iboot,] <- solve(crossprod(X[sel,])) %*% t(X[sel,]) %*% y[sel] if(iboot%%100==0) cat(iboot," ") }; cat("\n") SE.boot <- apply(slopes.bs, 2, sd) SE.ratio <- SE.boot / slope.mat[,"Std. Error"] t.boot <- slope.mat[,"Estimate"] / SE.boot slope.mat <- cbind(slope.mat, SE.boot=SE.boot, SE.ratio=SE.ratio, t.boot=t.boot) slope.mat } boston.boot <- lm.boot(boston.lm) signif(boston.boot, 3) # 'RM' and 'LSTAT' have bootstrap SEs twice the size of the # traditional SEs. Something going on here: These are also # the most nonlinear predictors. # The bootstrap t-ratios change the variable priorities: # Most significant is PTRATIO, followed by LSTAT, followed # by DIS, and then RAD, NOX, RM. # CHAS lost a lot and is barely significant. ---------------------------------------------------------------- PROBLEM 5: Style: The following text is Ex. 6.2.4 from the book. Revise it according to the principles of that chapter. Mucosal and vascular permeability altered by a toxin elaborated by the vibrio is a current hypothesis to explain this kind of severe condition. Changes in small capillaries located near the basal surface of the epithelial cells, and the appearance of numerous microvesicles in the cytoplasm of the mucosal cells are evidence in favor of this hypothesis. Hydrodynamic transport of fluid into the interstitial tissue and then through the mucosa into the lumen of the gut is believed to depend on altered capillary permeability. Your Solution: ... Revision 1: To explain this kind of severe condition, a current hypothesis holds that a toxin elaborated by the vibrio alters mucosal and vascular permeability. Evidence in favor of this hypothesis includes changes in small capillaries located near the basal surface of the epithelial cells, and the appearance of numerous microvesicles in the cytoplasm of the mucosal cells. It is believed that altered capillary permeability causes hydrodynamic transport of fluid into the interstitial tissue and then through the mucosa into the lumen of the gut. Revision 2: A current hypothesis is that a severe condition, altered mucosal an dvascular permeability, is caused by a toxin elaborated by the vibrio. Evidence in favor of this hypothesis includes changes in small capillaries located near the basal surface of the epithelial cells, and the appearance of numerous microvesicles in the cytoplasm of the mucosal cells. It is believed that altered capillary permeability causes hydrodynamic transport of fluid into the interstitial tissue and then through the mucosa into the lumen of the gut. Revision 3: ~ Emil Pitkin (2009) researched "vibrio" = bacterium of cholera Choleraic diarrhea is a severe condition. The current hypothesis on its origins contends that mucosal and vascular permeability are altered by a toxin produced by the vibrio bacterium. It is further hypothesized that the altered capillary permeability stimulates the hydrodynamic transport of fluid, first into the interstitial tissue, and the through the mucosa into the lumen of the gut. This hypothesis is supported by two observations. First, there are changes in small capillaries near the basal surface of the epithelial cells. Second, numerous microvesicles appear in the cytoplasm of the mucosal cells. Comment: - Read the instructions preceding the paragraph: The task is to put technical language at the end. - We have no idea whether the first sentence contains a description of the "severe condition" or not. Depending on this, the first or the second revision is appropriate. - Emil Pitkin in 2009 researched the "conditition". ----------------------------------------------------------------