##### HOMEWORK 8, STAT 541, DUE: Mon, Dec 3, 2007, before class ##### Student Name: .... Instructions: You may change the extension of this file to '.txt' for editing. Edit this file by adding your solutions. E-mail to the usual class gmail address. (Leave all existing text in this file.) In this homework we explore a few statistical concepts illustrated with the 'bodyfat' data: site <- "http://www-stat.wharton.upenn.edu/~buja/STAT-541/" bodyfat <- read.table(paste(site,"bodyfat.dat",sep="")) names(bodyfat) <- scan(paste(site,"bodyfat.col",sep=""), w="") Visit www-stat.wharton.upenn.edu/~buja/STAT-541/bodyfat.README to learn more about the background of the data. In what follows we ignore the variable 'Density' and focus on 'PercBF' (Percent Body Fat) as the response. We will attempt to find a formula that approximates 'PercBF' based on the other variables. The idea is that these variables are easy and cheap, whereas actual measurement of bodyfat is laborious and expensive. Before we do regression, though, we should do some serious data exploration. We start with two series of plots which you should fine tune to publication quality. Email the four plots as PDF files with your solutions (this filed edited) in separate attachments. PROBLEM 1: Explore the variables with a scatterplot matrix ('plot(dataframe)'). - Color three outliers in 'Ankle' and one in 'Weight', and also show them with slightly larger plot character size. - The color argument 'col=...' can be given vectors of type 'character' containing things such as 'red', 'gray30'. - Similarly, the argument 'cex=...' (character expansion) can be given vectors of type 'numeric' to control the plot character size. Choose these values so one can see individual cases, but minimize overplotting in the stamp-sized plot frames. This is quite a balancing act and requires some experimenting. - Fine tune the plot by setting the following arguments in 'plot()': pch=..., gap=..., mgp=..., cex.axis=..., tcl=... Aesthetic requirments: . Choose a fixed plot character to your liking. . Shrink the gap between plot frames to zero. . Shrink the distance between axis tick numbers and axis similar to the plots produced in class. (The parameter 'mgp' can be set in 'par()' or in 'plot()'.) . Shrink the axis tick numbers so they are not distracting yet readable. . Shrink the size of tick marks so they are visible but not dominating. SOLUTION: PROBLEM 2: Make a 5x3 array of plots and fill 14 of them with normal Q-Q plots of the 14 variables (leaving the last frame blank). - Fine tune plot character type and size, and show again the outliers in color and extra plot character size as in Problem 1. - Aesthetic requirements: . Make sure the individual plots are roughly square shaped by calling 'windows()' with suitable 'width' and 'height' arguments. . Label the plots with the variable names ('main=...') but remove the x- and y-axis labels ('xlab', 'ylab'). . As we do in class, leave minimal white space in the margins by fine tuning the parameters 'mgp' and 'mar' in the call to 'par(...)'. . Set and fine tune the arguments 'cex.main', 'cex.axis'. Minimize ink where possible yet maintain visibility. SOLUTION: PROBLEM 3: Comment on what you see. Would you mark more or fewer outliers? Are there more outliers in other variables? SOLUTION: PROBLEM 4: Recreate the scatterplot matrix of Problem 1 and the normal Q-Q plot of Problem 2 one more time without the 4 outliers. (No more coloring.) Comment on how normal the variable distributions look, both bivariately and univariately. SOLUTION: PROBLEM 5: For the next steps, standardize the predictor variables 'Weight',...,'Wrist' (ignoring 'Age') and collect the columns in a matrix 'X': X <- scale(bodyfat[,4:15]) # (returns a matrix, not a dataframe) Check whether there are likely outliers in each variable by looking for cases that are outside +-3, say, after standardization (i.e., using X). Write a one-liner using 'apply()' to compute the number of cases outside the interval [-3,+3] for each variable. If the variables were approximately normally distributed, how many observations would you expect to fall outside [-3,+3]? SOLUTION: PROBLEM 6: Next we generalize to a multivariate form of outlier detection, such that even 'diagonal' and not just 'marginal' outliers can be found. This requires a couple of linear algebra steps. The goal is to 'sphere' the predictors contained in X, that is, mapping them linearly such that their covariance matrix is the identity. Once the data are sphered, they should spread out the same in all directions, and we should be able to detect multivariate outliers by simply looking for cases that have a large distance from the origin. To achieve sphering, one constructs a square matrix of size ncol(X) which, when multiplied with X, turns X into something whose covariance matrix is the identity. One way to go about it is to find an inverse square root of the covariance matrix of X because if {A^2}^{-1} = cov(X), then cov(X %*% A) = I. Your task is to construct an inverse square root of cov(X) using the eigendecomposition of X. (There are other ways, but this way gives us more fluency with eigendecompositions.) When you have found A, construct the sphered predictor matrix 'Z' from 'X' and 'A' and convince yourself that its covariance matrix is indeed the indentity up to numerical inaccuracy. SOLUTION: PROBLEM 7: Because the sphered data 'Z' are centered and have the same spread (SD) in all directions, we can use the simple distance from the origin as a measure of outlyingness. [Terminology: The Euclidean distance in sphered data is called the 'Mahalanobis distance'.] Therefore, compute a vector 'md2' of length 'nrow(Z)' which for each case contains the squared distance from the origin. This can be done with a one-liner in R. SOLUTION: PROBLEM 8: Show mathematically that up to a constant factor the squared Mahalanobis distance from the mean is the same as the self-influence values of a linear regression on X without intercept. Determine the proportionality factor. Also, knowing the bounds for the self-influence values, derive bounds for the squared Mahalanobis distances. [Assume as usual that X is full rank. The math proof consists of a few lines of translations that link the projection onto the columns space of X with cov(X), and X with Z. Use the notation 'xi' for the i''th row of X written as a column, and similarly 'zi' with regard to Z. If you like to use nice latex formula notation, you submit the proof in a separate PDF file; otherwise it is ok to use notation such as X (X^T X)^{-1} X^T and the like.] SOLUTION: PROBLEM 9: In what follows we take the multivariate normal distribution as a gauge to judge outlyingness: If the values in the matrix Z were iid N(0,1), then the values in the vector 'md2' would be chi-squared with p=ncol(Z) degrees of freedom. Therefore, let us construct a chi-squared Q-Q plot for the values in 'md2'. There is no canned function to this end, hence we must construct one from first principles. Make a plain 'plot()' with the following: y: the sorted values in 'md2' x: the quantiles of the chi-squared distribution with 'p' degrees of freedom for the probabilities (1:n)/(n+1), where n=length(md2). [Fine tune the plot with pch, cex, xlab, ylab,...] Produce the plot twice: once with all 252 values, and once after removing the largest 13 values of 'md2'. Put the plots on one page with par(mfrow=c(1,2),...). SOLUTION: PROBLEM 10: Here is a deep question: Why should the previous plot be made by removing the largest 13 quantiles from (1:252)/253, and not by recomputing quantiles for (1:239)/240? The question becomes more obvious by pondering the case of removing, say, the larger half of the data and assuming the data are perfectly normal. [If you recomputed quantiles in i), go back and fix it.] SOLUTION: PROBLEM 11: Fit linear models to the data without the 13 high self-influence cases. The response is 'PercBF' and the predictors are 'Age',..., 'Wrist'. You can be as creative as you please, but apart from checking significances, look also at leverage plots to see whether your models carry terms that are dependent on just a few cases (a special danger when you play with interaction terms which tend to yank a few cases into high-leverage places). You may want to use R2.adjusted to judge the overall quality of models. Think about whether the signs of the slopes make biological sense (as far as we amateurs can judge). Report a large model and discuss its failing; then report a final and parsimonious model you believe in. SOLUTION: ----------------------------------------------------------------