##### HOMEWORK 7, STAT 541, DUE: Monday, Dec 12, 2011, 5pm ##### Student Name: .... INSTRUCTIONS: Edit this txt file by inserting your solutions after each problem statement. Rename this file to hw07-Yourlastname-Yourfirstname.tst and email to the usual class gmail address stat541.at.wharton@gmail.com with "Homework 7, 2011" in the subject line. You can discuss the homework with each other in general terms, but not with previous years'' students of Stat 541. You must not consult solutions of similar homeworks of previous years. Finally, you must write your own solutions and not copy from anyone. Report here who you collaborated with and what sources you used: My collaborators: ... The complete list of my sources: ... ---------------------------------------------------------------- PROBLEM 1: Consider a dataframe containing three predictors for N=12 cases: dat <- data.frame(x1=1:12, x2=factor(rep(letters[1:3],4)), x3=factor(rep(LETTERS[1:4],3)) ) (a) What is the problem with fitting the following model ~x1+x2*x3? (b) Create the model matrix X and perform an eigendecomposition on X^T X. Explain how the eigendecomposition shows the problem. [Check 'help(model.matrix)' for a way to generate the model matrix from ~x1+x2*x3 and dat without using 'lm()'.] (c) Someone proposes to solve the problem by obtaining more data -- good idea, right? They go about it by observing each cell twice, so that the model matrix becomes X2 <- rbind(X,X) Does this solve the problem? If you don''t see the solution right away, see what happens to the eigendecomposition of X2^T X2 and go from there. You should be able to wring an insight from an argument in one of the first problems in Homework 2. ---------------- YOUR SOLUTION: ... (a) ... (b) ... (c) ... ---------------------------------------------------------------- PROBLEM 2: R has a few functions that are popular as 'smoothers', that is, a a nonparametric curve fitting routines. These are Jerry Friedman''s 'supsmu()' and Bill Cleveland''s 'lowess()' and its newer sibling 'loess()'. We will use the older but faster 'lowess()'. Here is an example of their use: cl <- c(lowess="blue",loess="green",supsmu="red") N <- 30; x <- 1:N; y <- rnorm(N) + x^2/N^2*5 plot(x, y, pch=16) lines(lowess(x, y), lwd=2, col=cl["lowess"]) ord <- order(x); lines(x[ord], fitted(loess(y ~ x))[ord], lwd=2, col=cl["loess"]) lines(supsmu(x, y), lwd=2, col=cl["supsmu"]) legend(x="topleft", legend=names(cl), col=cl, lwd=2) Find out what the smoothers return. Use the above code as a template and augment the code of the function 'nonlinearity.plots()' shown below by adding the three smooths to each plot. Apply the augmented function to the Boston Housing data and attach a PDF of the result in your email as part of your solution (File > Save as > PDF...). Comment on which variables appear to have the strongest nonlinearities. ---------------- YOUR SOLUTION: nonlinearity.plots <- function(lm.model, extra=.1, cex=.5, cex.ident=.7, cex.axis=.8, cex.lab=.8, win.ncol=5, frame.sz=3.0, ...) { X <- model.matrix(lm.model)[,-1] # Assume an intercept r <- resid(lm.model) p <- ncol(X) N <- nrow(X) win.ncol <- min(p, win.ncol) win.nrow <- ceiling(p/win.ncol) windows(width=frame.sz*(win.ncol), height=frame.sz*(win.nrow)) par(mfrow=c(win.nrow,win.ncol), mgp=c(1.5,0.5,0), mar=c(2.5,2.5,.5,.5)) y.rg <- range(r); ylim <- y.rg + c(-1,1)*diff(y.rg)*extra for(j in 1:p) { xj <- X[,j] x.rg <- range(xj) xlim <- x.rg + c(-1,1)*diff(x.rg)*extra plot(x=xj, y=r, pch=16, xlim=xlim, ylim=ylim, xlab=colnames(X)[j], ylab="Res", cex=cex, ...) abline(h=0, col="gray") # Add here your code that draws the smooths in the above colors. # Add a legend only to the first leverage plot. # --------- ... # --------- } } # Save the following plot to a pdf file named 'boston-nonlinearities.pdf' # and submit with your solutions: boston.lm <- lm(MEDV~., data=boston) nonlinearity.plots(boston.lm) Your comment on nonlinearities: ... ---------------------------------------------------------------- PROBLEM 3: You have no idea how 'lowess()' is implemented, yet you should be able to implement an exact test to check whether the residuals contain a nonlinearity against any of the predictors, estimated by 'lowess()'. Write a function that implements such a test for a given model matrix X and response y, using as a test statistic the standard deviation of the fitted values returned by 'lowess()': sd(lowess(...)$y). [It would probably be better to use the RSS from the 'lowess()' fit as the test statistic, but this is a pain to compute in view of the structure returned by 'lowess()'.] The problem is simplified for you (and for the TA) with a template where you can fill in missing code. Replace all '__' with one-liners for allocating storage and for actual computations. Run the code on the Boston Housing data ---------------- YOUR SOLUTION: lowess.test <- function(lm.model, Nsim=199, ...) { # function to produce normalized residual vectors: nrm <- function(x) x/sqrt(sum(x^2)) # model matrix, but toss the intercept vector: X <- __ # silliness: the next line jitters X a little; # smoothers are not good at handling ties, so we use jitter to break ties randomly. X <- X + rnorm(length(X), s=rep(apply(X, 2, sd)/1000, N)) # from X: p <- __ N <- __ # actual residual vector from regressing y on X, normalized to norm=1: r.obs <- __ # allocate storage for p observed values of the test statistic: s.obs <- __ # calculate the observed values of the test statistic for all predictors: for(j in 1:p) s.obs[j] <- __ # allocate a p x Nsim (!) matrix of storage for null values of test statistics: s.null <- __ for(isim in 1:Nsim) { if(isim%%50==0) cat(isim," ") # null residual vector normalized to norm=1: r.null <- __ # calculate null values of test statistics for all predictors: for(j in 1:p) s.null[j,isim] <- __ }; cat("\n") # px1 vector of p-values from s.obs and s.null: # (a simple one-liner with vectorization, using that s.null is p x Nsim) pvals <- __ # column names of X (so you know which predictor goes with which p-value): names(pvals) <- __ pvals } options(width=120) # to get all pvalues on one line lowess.test(__) # show the results ---------------------------------------------------------------- PROBLEM 4: Write a function 'lm.boot()' that takes a linear model 'lm.model' as an argument and returns the matrix 'summary(lm.model)$coefficients$' augmented with two more columns called 'SE.boot' and 'SE.ratio'. These columns should contain bootstrap standard errors of the slopes and the ratios of bootstrap/traditional standard errors, respectively. ---------------- YOUR SOLUTION: lm.boot <- function(lm.model, Nboot=10000) { # the '$coefficients' matrix from the 'summary()' function: slope.mat <- __ # model matrix: X <- __ p <- __ N <- __ # 'lm.model' does not contain the response; # reconstruct it from 'fitted()' and 'resid()': y <- __ # allocate a matrix of size Nboot x p for bootstrap coefficients: slopes.bs <- matrix(NA, nrow=Nboot, ncol=p) for(iboot in 1:Nboot) { # selection vector for sampling with replacement: sel <- __ # for speed, use the triple-X formula to compute bootstrap coefficients: slopes.bs[iboot,] <- solve(crossprod(X[sel,])) %*% t(X[sel,]) %*% y[sel] if(iboot%%100==0) cat(iboot," ") }; cat("\n") # compute the bootstrap standard errors for the slopes: SE.boot <- __ # compute the ratio 'bootstrap/traditional' standard errors; # the latter are a column of 'slope.mat': SE.ratio <- __ # bootstrap t-ratio: t.boot <- slope.mat[,"Estimate"] / SE.boot # cbind 'SE.boot' and 'SE.ratio' and 't.ratio' to 'slope.mat': slope.mat <- __ # return value: slope.mat } boston.boot <- lm.boot(boston.lm) signif(boston.boot, 3) Your comparison of both types of SEs and t-ratios: ... ---------------------------------------------------------------- PROBLEM 5: Style: The following text is Ex. 6.2.4 from the book. Revise it according to the principles of that chapter. Mucosal and vascular permeability altered by a toxin elaborated by the vibrio is a current hypothesis to explain this kind of severe condition. Changes in small capillaries located near the basal surface of the epithelial cells, and the appearance of numerous microvesicles in the cytoplasm of the mucosal cells are evidence in favor of this hypothesis. Hydrodynamic transport of fluid into the interstitial tissue and then through the mucosa into the lumen of the gut is believed to depend on altered capillary permeability. YOUR SOLUTION: ... ----------------------------------------------------------------