================================================================ LECTURE 1, 2011/09/07: * Syllabus: http://www.stat.wharton.upenn.edu/~buja/STAT-541/ * Mechanics of these lectures: - no slides, no power point - instead: online editing of a text file in an Emacs text buffer realtime R computations in an Emacs R buffer * Homeworks 1 and 2: - HW 1: R intro and practice, posted - HW 2: Linear algebra, to be posted - Also: Be prepared to answer quiz questions about R by being called on in class on Mon, 2011/09/12. * FUNDAMENTALS OF STATISTICAL INFERENCE: - How do we arrive at stderrs, CIs, statistical tests in the simplest case? - Imagine observing N ojects (e.g., sample students and measure height) X1,...,XN i.i.d. with existing expectation mu and variance sigma^2 => E[Xi] = mu, V[Xi] = sigma^2, C[Xi,Xj] = 0 (i!=j) - Interpret height data as a realization of (X1,...,XN). - The point is: If you repeated the data collection, you would get different values. Think through: What is the meaning of E[X1]? ... - Summary statistic: mean Xbar = mean(X1,...,XN) For a given dataset, you obtain one mean value. Simulation of a 'dataset': N <- 300 X <- rnorm(N, m=69, s=5) mean(X) sd(X) - But the mean would have been different from another dataset, => The mean is a random variable, even though observed only once. The point is: We could IN PRINCIPLE repeat the data collection to obtain another value of the mean. Simulate this idea by repeatedly executing the above four lines of R code. - Question: How much does the mean vary from dataset to dataset? => variance expansion V(Xbar) = ( sum_i Var(Xi) + sum_i!=j Cov(Xi,Xj) )/N^2 = sigma^2 / N => sigma(Xbar) = sigma/sqrt(N) true standard error: dispersion of a statistic ACROSS DATASETS/DATA COLLECTIONS - Statistical question: Can we estimate sigma(Xbar)? YES, because . sigma(Xbar) = sigma/sqrt(N) and . we can estimate sigma with s: s = sqrt(sum_i (Xi - Xbar)^2 / N) s <- sd(X) # in R => An estimate of sigma(Xbar) is s/sqrt(N) - Statistical inference consists of approximate probability statements about estimates, where the probabilities concern RELATIVE FREQUENCIES ACROSS DATASETS. The two components that typically make such statements possible are . standard error estimates and . the central limit theorem. - Central limit theorem: (Xbar - mu)/sigma(Xbar) gets ever more N(0,1) distributed as N-->Inf => mu +- 2*sigma(Xbar) contains Xbar for ~95% of datasets Xbar +- 2*sigma(Xbar) contains mu for ~95% of datasets !!!!!! - Strangeness of the CLT: In most cases it speaks about "fraction of datasets/data collections" because we usually see only one mean. (Exception: quality control, where we may see a new Xbar every day) ================================================================ LECTURE 2, 2011/09/12: * Organizational matters: - Homework 1: Problem 2 - typo "2 appears twice, 3 appears three times,..." Problem 15: There is no closed formula for the probability; compute it in R when co=1 and co=0. - Homework 2 will be posted shortly, due Fri, Sept 23, 5pm - Problem: The class on Monday 2011/09/19 needs to be rescheduled. What would be a good time to make it up? Any time on Friday, 2011/09/23? (except 3-5pm) FRIDAY 1:30-3pm tentative makeup !!!!!!!!!!!!!!!!!!!!!! * Roadmap: - recap of fundamentals of statistical inference with quiz questions - wind down with fundamentals of statistical inference - quiz questions about R - beginnings of linear models * Recap: Fundamentals of statistical inference - What are typical assumptions about a simple data collection? ... - What is the meaning of mu=E[X37]? of sigma=SD[X37]? ... - In what sense is Xbar random? ... - How can we simulate the randomness of Xbar computationally? ... - What is a measure of dispersion of Xbar? What is it called? ... - How do we estimate the dispersion of Xbar? ... - Do we know the distribution of Xbar? ... - Do we know the distribution of Xbar at least approximately? ... - Think it through: How is it that we seem to be able to know something about the variability of Xbar across datasets -- BASED ON ONE DATASET? What underlies this 'knowledge'? ... - How can it fail? ... - When we write down the probability P[ (Xbar-mu)/(sigma/sqrt(N)) < x ], what relative frequency does it refer to? ... * FUNDAMENTALS OF STATISTICAL INFERENCE (continued): - The two principal modes of frequentist statistical inference: . hypothesis tests . confidence statements - Accordingly distinguish: . Confidence intervals: 95% CI for the true mu [Xbar - 2*s/sqrt(N), Xbar + 2*s/sqrt(N)] contains the true mu for ~95% of datasets. ==> Reject H0: mu=mu0 at the ~5% significance level if mu0 is not in the 95% CI. . Retention intervals (RI) for a null hypothesis H0: mu=mu0 [mu0 - 2*stderr, mu0 + 2*stderr] contains Xbar for ~95% of datasets. ==> Reject H0 at the ~5% significance level if Xbar is not in the 95% coverage retention interval. - Confidence and significance level: parameters that determines the stringency of inference . 95% = 1-alpha 'confidence' for the CI . 5% = alpha 'significance' or Type I error probability - Connection between CIs and RIs: The 1-alpha CI contains all parameters mu0 that cannot be rejected when testing H0: mu=mu0 mu0 is in CI <==> xbar is in RI - Q: If P[ mu in CI ] ~ 0.95, do we ever know that the true mu is in the CI for our dataset? ... - History: Fisher (SE), Neyman (CI), ... a huge 20th century achievement . Fisher got CIs wrong with his 'fiducial intervals'. . Bayesians today compete with their 'posterior credible intervals'. - Generalization to parameters other than means and mu: . Standard errors and their estimates can be obtained analytically or computationally for most estimates. . A CLT holds ALMOST always. ==> Approximate asymptotic statistical inference with CIs and tests becomes available: CI = [theta.est +- 2*stderr(theta.est)] contains true theta for ~95% of datasets. RI = [mu0 +- 2*stderr(theta.est)] contains theta.est for ~95% of datasets if theta is true. . This raises the question: What is 'theta' in general? Examples: theta.est = median(X1,...,XN) theta.est = upper quartile(X1,...,XN) theta.est = sd(X1,...,XN) theta.est = MAD(X1,...,XN) What is the true 'theta' in each case? General 'loose' answer: theta = lim theta(X1,...,XN) when N-->Inf of ==> theta becomes a function/functional of the true underlying distribution: c.d.f.=F(X) theta = theta(F): so-called 'statistical functional' Examples: Think of X as a random variable with c.d.f. F. theta(F) = median(F) theta(F) = upper quartile(F) theta(F) = SD(F) = sqrt(E[(X-E[X])^2]) theta(F) = MAD(F) = median(|X - median(X)|) . Why do we need theta(F)? Objection to theta(F): It is a figment of our imagination. It is never observed. Q: So why do we need theta(F)? A: We need a target for estimates theta(X1,...,XN). Without a target we do not know what it means to say: 'The precision of theta(X1,...,XN) increases as N increases.' 'Our estimate gets better as we collect more data.' . Can you tell in terms of CIs and RIs how we gain precision as the sample size N increases? ... - Strange Case I: TIME SERIES -- Where is the randomness? Who as ever seen replications of Lehman stock, US GDP,...? Nobody, yet we think of stock return series as random and model them with random variables, to the point where some models win Nobel prizes! (ARCH/GARCH models) . Example: download from the webpage and put in subfolder 'DATA' URL <- "http://stat.wharton.upenn.edu/~buja/STAT-541" ftse <- read.csv(paste(URL,"FTSE-financials.csv",sep="/")) str(ftse) y <- ftse[,'fin.BARCLAYS'] plot(y, pch=".") # stock prices, not returns: some kind of random walk . Financial returns tend to look more random: returns <- function(y) (y[-1]-y[-length(y)])/y[-length(y)] plot(returns(y), pch=".") plot(returns(y), type="l") . Can they be considered truly i.i.d. random? If so, they would look indistinguishable from permuted data: par(mfrow=c(2,1)) plot(returns(y), type="l", ylab="", xlab="") plot(sample(returns(y)), type="l", ylab="", xlab="") Even though this time series of returns is not i.i.d. random, it looks unpredictable with little structure other than varying volatility. ==> Stochastic modeling of financial time series... . Back to the deeper question: What is it we are doing when there is no way to sample another dataset? ==> We are imagining possible worlds with different outcomes for Barclays returns. "POSSIBLE-WORLDS SEMANTICS" - Strange Case II: REGRESSION -- X-Conditional randomness! . Strangeness of regression: * The predictor values are considered fixed, not random. Regression is "conditional on x". * Only the response is considered random. . Hence "varibility across datasets" means datasets that have the same x-values but different y-values. (In a simulation, keep x values fixed, simulate only y values.) . Yet there is a further argument that says conditional inference is valid unconditionally (across datasets with varying x). . Stats students: Look out for the notion of "ancillarity" in math stat. . Conditional simulation in regression given x: Model: y = 1+2*x+eps, eps~N(0,1) N <- 100 x <- rnorm(N) # Once only! Now fix it -- CONDITIONAL ON X y1 <- 1 + 2*x + rnorm(N) y2 <- 1 + 2*x + rnorm(N) y3 <- 1 + 2*x + rnorm(N) y4 <- 1 + 2*x + rnorm(N) par(mfrow=c(2,2)) plot(x, y1) plot(x, y2) plot(x, y3) plot(x, y4) - [Brief critique of the above plots: They are not acceptable in homeworks and publications. Why? (1) The figure consist mostly of non-plotting area. ==> R has wasteful plotting defaults. ==> You need to know . how to shrink the margins to enlarge the plotting areas, and . how to make the axis labelings less wasteful. (2) The four plots should have the same axes, whereas the above plots create random vertical axes. ==> You need to know how to force specific axes in the plots. (3) The default circle glyph for points is deprives the scatterplot of visual impact. ==> You need to know how to select other glyphs. Solution: par(mfrow=c(2,2), mar=c(3,3,1,1), mgp=c(1.8,.5,0)) ylim <- range(y1,y2,y3,y4) plot(x, y1, pch=16, ylim=ylim) plot(x, y2, pch=16, ylim=ylim) plot(x, y3, pch=16, ylim=ylim) plot(x, y4, pch=16, ylim=ylim) ==> Study the documentation of the par() and plot() functions to acquaint yourself with the possibilities of customizing plots! ---------------------------------- Conclusion: Practice | iterative refinement of plots. | ---------------------------------- ] * R Quiz Questions: - base data types, notion of coercion - composite data types, their purposes - vector/matrix/array indexing - looping: ... - conditionals: ... - functions: types of arguments - vectorization: ... - other ways of avoiding loops: ... apropos("apply") - text functions: ... paste(), grep(), sub(), gsub(), regexp() - finding your way around among R functions: ... apropos(...), help(...), ? ================================================================ LECTURE 3, 2011/09/14: * REMINDER OF PREREQUISITES: an understanding of - probability and random variables ~ Stat 430 - statistical modeling, estimation and testing ~ Stat 431,432 - linear algebra of basis/coordinate changes, orthogonal projections, symmetric matrices and their eigen decompositions - programming in a high-level language such as R, S, Matlab, Python DO NOT ... DO NOT ... DO NOT ... DO NOT ... DO NOT ... ATTEMPT TO TAKE THIS COURSE WITHOUT ALL OF THE ABOVE!!! * ORG: - No class: Monday, 2011/09/19, 3-4:30 Makeup: Friday, 2011/09/23, 1:30-3, room tba - Homework 1 due this Friday, 2011/09/16, 5pm Follow these instructions for submission: . Submit by email to stat541.at.wharton@gmail.com . Subject line: Homework 1, 2011 . Attachment containing your solutions: filename hw01-Yourlastname-Yourfirstname.txt Example: hw01-Buja-Andreas.txt . The solutions file for HW1 must have extension '.txt'; not '.doc', not '.docx', not '.pdf', not '.tex', ... - Homework 2 is posted -- topic: linear algebra . Side effect: learn LaTex typesetting easy entry: template provided . This is the first of two Lin.Alg. homeworks. . Submit by email to stat541.at.wharton@gmail.com . Subject line: Homework 2, 2011 . Attachment containing your solutions: filename hw02-Yourlastname-Yourfirstname.pdf Example: hw02-Buja-Andreas.pdf . The solutions file for HW2 must have extension '.pdf'; not '.doc', not '.docx', not '.txt', not '.tex', ... - Homework 3 will be about text manipulation in R. * ROADMAP: - Recap of stats inference, fundamentals - Short discussion of further R features - New: . Linear models: beginnings . Linear models prerequisites: Vector expectations and covariance matrices . Linear models theory: First and second order properties of LS estimators ---------------- * Recap of stats inference: - Basic insight: The data and the estimates obtained from them could have been ... (Tukey) - A mathy way to formalize the insight: 1) A dataset is a draw (X1,X2,...,XN) of N i.i.d. random variables/vectors 2) A statistic T(dataset) = T(X1,X2,...,XN) is also a random variable: it varies ... (how?) However, in any data analysis situation we are given just ... value of T(X1,...,XN). - Statistical inference consists of -------------------------- | probability guarantees | -------------------------- for estimates and test statistics. - What is a missing ingredient for statistical inference? An estimate "estimates" something. ==> We need targets of estimation: parameters T(F) T(F) = value of T(...) for an infinite amount of data = figment of the imagination, idealization, never observed Trick question: Is T(F) random? ... - First type of guarantees: Testing theory How close is an estimate supposed to fall to a ... ...? ==> RI around T(F) with P[T(dataset) in RI] = 1-alpha - Second type of guarantees: Confidence intervals How close is the true parameter supposed to be to the ... ...? ==> CI around T(dataset) with P[T(F) in CI] = 1-alpha - What is the fundamental trade-off in statistical inference? ------------------------------ | Precision versus certainty | ------------------------------ Greater precision means the CI or RI is ... Greater certainty means the CI or RI coverage is ... CI coverage prob = confidence RI coverage prob = 1-significance - How to you improve precision and uncertainty? ... ---------------- * DISCUSSION OF R FEATURES: - Recall basic types of 'data atoms' - Composite data types: vectors, matrices, arrays lists, data frames ==> Recall their purposes!!! - An incredibly powerful feature of R: ----------------------------------------- | Indexing of vectors, matrices, arrays | ----------------------------------------- Toy example: vec <- c(1:4,100,1000,-10000,pi) . Indexing with positive integers: vec[c(5,6,rep(c(1,7),8))] # Note: repetitions allows! . Indexing with negative integers: vec[-(5:8)] . Indexing with logicals: "DB queries" vec[rep(c(T,F),4)] vec[vec<10] vec[vec %in% 1:100] . Associative array indexing: "hash tables" names(vec) <- c("Mark","Nancy","Lynda","Abba","Melissa","Trev","Rob","Nermin") vec # vec now has named entries vec[c("Trev","Lynda","Trev","Lynda")] Such indexing exists also for matrices, arrays, and rows of data frames - Functions are 'first class citizens': They can be passed around like data. . For the lazy R user who hates to type: abbreviate len <- length len(1:1000) . Looping over a vector of functions and passing them to 'apply()': N <- 100 mat <- cbind(var1=rnorm(N), var2=rt(N,df=5), var3=rcauchy(N)) for(fun in c(mean,median,sd)) print(apply(mat, 2, fun)) ---------------- * LINEAR MODELS: THE LS PROBLEM - Let''s start with no probabilistic assumptions. We are given (x,y) data and we want to fit a line to predict y from x. - LS criterion: RSS . In R: a numeric toy example of LS N <- 5 X <- cbind(rep(1,N), 1:N) y <- 1 + 2*X[,2] + rnorm(N,s=.5) # Very little 'error' in this case plot(X[,2], y, pch=16) RSS <- function(b) sum((y-X%*%b)^2) # (Bad programming: global variables inside functions) Getting at the coefficient estimates the hard way: nlm(RSS, rep(0,2)) # nlm() is a minimizer function in R: help(nlm) nlm(RSS, rep(0,2))$estimate # just get the estimate nlm(RSS, c(-10,1000))$estimate # initialize real badly: i'cept=-10, slope=1000 (Of course we have analytical formulas for this.) . In terms of math/lin.alg.: RSS(b) = |y - X b|^2 . LS estimate: b = argmin RSS(b) = minimizer of RSS(b) (Justifications of LS? ...) . Normal equations as stationary equations of RSS: RSS(b) = b^T X^T X b - 2* b^T X^T y + y^T y gradient RSS(b) = 2*( X^T X b - X^T y ) = 0 ==> (X^T X) b = X^T y . Explicit LS solution: b = (X^T X)^{-1} X^T y [Does not work when ...] In R: b <- solve(t(X) %*% X) %*% t(X) %*% y [NOT the numeric method of choice; stay tuned for a HW.] abline(b) # Add the fitted line to the scatterplot. - Residuals and fits: . Notation: yhat = X b r = y - yhat RSS = |r|^2 . Normal equations can be rewritten this way: X^T(y - Xb) = 0 X^T r = 0 ==> Uniquely characterizes the LS solution. Geometric meaning: ... . In R: yhat <- X %*% b r <- y - yhat - LS and orthogonal projections: yhat = X b = ( X (X^T X)^{-1} X^T ) y = H y r = y - yhat = ( I - H ) y where H and I-H are orthogonal projections in n-space (HW2). . Definition of orth. proj.: ... (HW2) - What are the statistical properties of 'b', 'yhat' and 'r'? . In what sense are they 'random vectors'? plot(X[,2], y, pch=16, ylim=c(1,11)) for(i in 1:10) { y <- 1 + 2*X[,2] + rnorm(N,s=.5) # Generate new response vectors. clr <- sample(colors(),1) # Pick a random color. points(X[,2], y, pch=16, col=clr) # Add points to plot. abline(solve(t(X) %*% X) %*% t(X) %*% y, col=clr, lwd=2) # Add LS line to plot. } . What are their expected values? . What are their variances, and covariances? - 1st order properties: need expectation vectors of random vectors E[y] = mu = X beta # Meaning? What is 'beta' in the simulation? E[yhat] = ... # You can guess this if the line model is correct. E[r] = ... # " E[b] = ... # " - 2nd order properties: need covariance matrices of random vectors V[y] = sigma^2 * I # I = identity of size nxn # What is the meaning of this? V[yhat] = ... # No, you can't guess this. You need lin.alg. V[r] = ... # " V[b] = ... # " - Reminder: (1) What is the meaning of E[y]? (2) What is the weirdness about X? - Beginnings of linear model theory: . Ingredients: * Design/predictor matrix: X of size N x (p+1) * Response vector: y of size N x 1 * Squared norm/length: |y|^2 = y1^2 + y2^2 + ... + yN^2 . LS minimization: min_b | y - X b |^2 . LS estimates: b = argmin_b | y - X b |^2 = (X^T X)^{-1} X^T y . Orthogonal projections: needed to generate yhat and r yhat = X (X^T X)^{-1} X^T y (When does this have problems?) |________H_______| = 'hat matrix' (puts hat on y, JWT) r = y - yhat = (I-H) y . Express the RSS and RMSE in terms of the above: RSS = |r|^2 RMSE = |r|^2 / (N-p-1) ---------------- * VECTOR EXPECTATIONS AND VARIANCE/COVARIANCE MATRICES - In preparation for linear models analysis we define vector expectations and variance/covariances matrices. Confusing fact: . In multivariate analysis, we sample the random vector Y repeatedly: Y = (Height, Weight, Age, ...) of a random person, sample multiple persons . In regression analysis, we see only one draw of the random vector Y: Y = (Height of person 1, Height of person 2, ..., Height of person N) but we know more about person 1, person 2, ... person N in terms of predictors. - Assume: Y ~ Ix1 random vector Z ~ Jx1 random vector with joint distribution ("observed on the same objects") Ex.: Y = (Height, Weight, Age, ...) of a random person Z = (English Grade, Math Grade, SAT score, ...) of the same random person The Y- and Z-variables will generally be 'associated'. - EXPECTATION: . Definition: E[Y] = vector of E[Yi] (i=1...I) . Estimation: In Multivariate Analysis one assumes N i.i.d. draws yn from Y, where yn ~ Ix1 = n''th sample. Estimate of E[Y]: ybar = (sum_n yn)/N ~ Ix1 Unfortunately in regression we only see one draw y of Y Do you see what''s confusing? ... Why can we estimate E[Y] in regression, when we only see one draw? ... . Affine transformation: if A mxn, c mx1, both constant, then E[AY + c] = A E[Y] + c Proof: Write out the components of AY and apply E[]. Q: Why do we need this? A: Because we will analyze objects such as b = (X^T X)^{-1} X^T y yhat = H y . Algebra: Assuming Y and Z are both Ix1, then E[a*Y + b*Z] = a*E[Y] + b*E[Z] This property is called linearity of E[...]. - COVARIANCE: . Definition: V[Y,Z] = E[ (Y-E[Y])(Z-E[Z])^T ] V[Y] = V[Y,Y] . Components: V[Y,Z]_ij = ... V[Y]_ij = ... V[Y]_ii = ... . What are the ranks of (Y-E[Y])(Z-E[Z])^T and (Y-E[Y])(Y-E[Y])^T ? Hence a covariance matrix is an average of ... . Sizes: V[Y,Z] is IxJ, V[Y] is IxI . Generalizing univariate V[X] = E[X^2] - E[X]^2: V[Y,Z] = E[ Y Z^T ] - E[Y] E[Z]^T V[Y] = E[ Y Y^T ] - E[Y] E[Y]^T . Algebra: Assuming Y, Z are both Ix1 and U is Kx1: V[Y,Z] = V[Z,Y]^T V[a*Y + b*Z,U] = a*V[Y,U] + b*V[Z,U] V[a*Y + b*Z] = a^2 V[Y] + b^2 V[Z] + 2*a*b*V[Y,Z] These properties are called ... and ... and ... 'symmetry' 'distributive law' 'binomial expansion' . Affine transformation: if A mxn, c mx1, both constant, then ---------------------------- | V[AY,BZ] = A V[Y,Z] B^T | | | | V[AY] = A V[Y] A^T | | | | V[Y+c,Z+d] = V[Y,Z] | ---------------------------- (Exercise: Specialize the middle to a linear form, where A, B are vectors, in particular to A = t(c(1,1,...,1)).) . Multivariate Analysis: Estimation of V[Y] + Given N i.i.d. samples yn drawn from Y Vhat[Y] = sum_i ( (yn-ybar)(yn-ybar)^T )/(N-1) + Affine transformation: zn = A yn + c Vhat[z] = A Vhat[y] A^T (Specialize A to a linear form: A ~ mx1) ================================================================= LECTURE 4, 2011/09/21: * ORG: - Homework 1 is graded, will be returned shortly. - Homework 2 due this Friday, 5pm - Homework 3 will be posted shortly, due 10 days from posting. - Friday is end of Course Selection Period. Select sufficient credit so you can afford to drop this course. (Drop Deadline: Friday, Oct 14) - Did everybody get the Style book? If not, get it by Monday! - Apologies -- a change of notation We denoted in Lecture 3 and HW 2: P = X (X^T X)^{-1} X^T This makes sense because P is an orthogonal projection. However, this is unfortunate because we use P for probability also. ==> From now on we use H = X (X^T X)^{-1} X^T where 'H' derives from John W. Tukey''s term 'hat matrix'. Why? This is JWT humour: H puts the hat on y, yhat = H y. ---------------- * ROADMAP: - Recaps of . some important R features . LS estimation basics . The meanings of randomness in multivariate analysis versus linear regression . Formula heaven for + vector expectations + variance/covariance matrices - New: . Linear model assumptions, 1st and 2nd order, NO normality/Gaussianity yet . Linear models geometry . 1st and 2nd order properties of the LS estimate b . 1st and 2nd order properties of yhat and r ---------------- * RECAP AND ELABORATIONS: - R features: . Vector/matrix/array indexing: 4 types ... . Functions are ... . What is the purpose of data frames as opposed to matrices? ... . What is the purpose of lists as opposed to data frames, matrices? ... - LS: Can be introduced purely as an approximation method. E.g., fitting a line, y = b0+b1*x given data (xi,yi): LS estimates for the coefficients b0, b1 = (b0,b1)^T = b = argmin_(b0,b1) [ (y_1-(b0+b1*x_1))^2 + (y_2-(b0+b1*x_2))^2 + ... + (y_N-(b0+b1*x_N))^2 ] = argmin_b | y - (b0*1 + b1*x) |^2 where y=(y_1,...yN)^T, x=(x_1,...x_N)^T, 1=(1,...1)^T = argmin_b | y - X b |^2 where X = cbind(1,x) of size Nx2 = argmin_b | r |^2 where r = y - X b = argmin_b RSS(b) = (X^T X)^{-1} X^T y the solution found by solving the normal equations = solve(t(X) %*% X) %*% t(X) %*% y in R (This obviously generalizes to more than one predictor.) - Preamble for linear models analysis: E[] vectors and V[,] matrices . Y = Ix1 random vector = vector of random variables with a joint distribution Q1: What does "joint distribution" mean? A1: At least in principle there is meaning in P[Y in A] for A subset of R^I, Q2: What is the meaning of P[Y in A]? A2: Using the LLN, if Y_1,...,Y_N are iid draws of Y, then |{i: Y_i in A}|/N --> P[Y in A]. _ 'Conceptual Example' 1: Multivariate analysis Y = (height, weight, age, income,... )^T of subjects These measurements have obviously a joint distribution. Observe Y on many i.i.d. subjects => Y_1,...,Y_N LLN applies: As the size N of the multivariate dataset grows, the relative frequencies converge to the probabilities of the joint distribution. _ 'Conceptual Example 2': Regression Y = (height of subject 1, height of subject 2, ..., height of subject N)^T = a structured random vector of dimension Nx1 where all values have the same units For these quantities we assume a trivial joint distribution: independence hence we only need to deal with the 'marginal' (one-at-a-time) distributions. Even though the component random variables (the various heights) are drawn from independent subjects, they are not i.i.d. Q1: Why? A1: The subjects are INDEPENDENT, BUT NOT IDENTICALLY distributed. Q2: Why? A2: Because we know more about the subjects, namely, predictors such as x = (age of subject 1, age of subject 2, ..., age of subject N)^T Q3: What is the distribution of the heights? A3: We hypothesize that the height of subject 'i' is function of age on average: E[height_i] = f(age_i) = average of i''th response across datasets equivalently: height = f(age) + eps, where E[eps]=0 That is, subjects of age_i have on average height f(age_i). - "Formula Heaven" for E[] and V[]: Definitions: . E[Y] := vector (E[Y_1],..., E[Y_I])^T . V[Y,Z] := matrix ( Cov(Y_i,Z_j) ) = E[ (Y-E[Y])(Z-E[Z])^T ] (average of rank-one matrices) . V[Y] := V[Y,Y] Facts: Assuming compatible dimensions everywhere, we have... . E[AY+BZ+c] = A E[Y] + B E[Z] + c Extended linearity of E[] . V[AY+BZ+c,U] = A V[Y,U] + B V[Z,U] Left distributive + location invariance . V[U,AY+BZ+c] = V[U,Y] A^T + V[U,Z] B^T Right distributive + location invariance . V[Y,Z] = V[Z,Y]^T Generalized symmetry . V[AY+c] = A V[Y] A^T Quadratic: (a*y)^2=a^2*y^2 . V[aY+bZ] = a^2 V[Y] + 2ab V[Y,Z] + b^2 V[Z] Quadratic expansion: (ay+bz)^2=... . V[Y]^T = V[Y] Symmetry ---------------- * LINEAR MODEL ASSUMPTIONS: - Convention: y is Nx1 (not Ix1) Random vector with component random variables that are independent but NOT identically distributed - Model in vector/matrix language: y_i = beta0 + beta1*x_i1 + beta2*x_i2 + ... + betap*x_ip + eps_i y = beta0*1 + beta1*x1 + beta2*x2 + ... + betap*xp + eps = X beta + eps where beta is (p+1)x1 where y, xj, eps are Nx1 y = (y_1,...,y_N)^T is Nx1 xj = (x_1j,...,x_Nj)^T is Nx1 X = (1,x1,x2,...,xp) is Nx(p+1) eps = (eps_1,...eps_N)^T is Nx1 Note: + The response column y is a random vector -- inspite of lower-case notation! + Predictor columns xj are NOT random but fixed and known (They are set in designed experiments, or sampled in observational data but conditioned on). + The error column 'eps' is a random vector. . Stochastic assumptions: -------------- | E[eps] = 0 | (1) 1st order linear model assumption -------------- Note: E[eps] is a constant Nx1 vector. Meaning: Across worlds, universes, datasets,.. ... the "errors" average to zero, case by case. ... the model is "correct". ... the model is "unbiased". [Distinguish: "unbiased estimator" -- "unbiased model"] E[b]=beta E[y]=X beta ------------------------ | V[eps] = sigma^2 * I | (2) 2nd order linear model assumption ------------------------ Note: V[eps] is a constant NxN matrix. Meaning: Uncorrelated errors between cases and constant variance across cases ("homoscedastic errors") . Equivalent to E[eps]=0 and V[eps]=sigma^2*I are these: ---------------------- | E[y] = X beta | 1st order | V[y] = sigma^2 * I | 2nd order ---------------------- [Prove the equivalence to (1)&(2) using y = X beta + eps.] . More rambles on 'HOLDING X FIXED': + Interpretation: When X is random, 'fixing X' does NOT mean the assertion that X IS fixed! Rather, 'fixing X' refers to the technique of conditioning in the sense of probability theory. ==> The linear model assumptions are assumptions about the CONDITIONAL DISTRIBUTION OF Y GIVEN X. + If X is fixed, as in a designed experiment, then X IS truly fixed, no artifice there. + In simulations (see below) we only sample/simulate eps-vectors, leaving X fixed, varying only y. . THE MEANING OF THE LINEAR MODEL ASSUMPTIONS -- one last time Q: What variability do E[...] and V[...] refer to? A: Dataset-to-dataset variability, varying ... but holding ... fixed. Example: Holding the observed ages fixed, what is the average height? In a simulation we consider the same set of ages but the heights can vary. How can the heights vary? (1) 1st order assumption: E[y_i] = beta0 + beta1*Age_i The 'mean' (expectation) of case i across datasets is a linear function of the age of case i. (2) 2nd order assumption: (2a) V[y_i] = sigma^2 (2b) C[y_i,y_j] = 0 (i != j) (2a) Across cases, the heights vary around their expectations with the same standard deviation sigma. (2b) For any two cases, the dataset-to-dataset variation of heights around their respective expectations is uncorrelated. ---------------- * LINEAR MODELS GEOMETRY: - Ingredients: . y = ... . X = ... . b = ... . yhat = ... . r = ... . H = ... . I-H = ... - Vector space geometry: . The response vector y is in Reals^N, and so are yhat and r. . The columns of X = (1,x1,...,xp) span a subspace of Reals^N: ---------------------------------------------------------------------- | V = columns space of X | | = span(1,x1,...xp) | | = { beta0 + beta1*x1+...+betap*xp | beta0,...,betap are in Reals } | | = { X beta | beta is in Reals^(p+1) } | | = 'space of fits' | ---------------------------------------------------------------------- dim(V) = ... if the vectors 1,x1,...,xp are NOT collinear. . The 1st order model assumption says that E[y] is in ... . In other words: The 'average' of the y vector across datasets is confined to a (p+1)-dimensional subspace. . Fundamental question: Q: How is it possible to estimate E[y] if we have only one sample/draw of the vector y? A: By assuming that E[y] is confined to a much lower-dimensional subspace V !!! --------------------------------------------------------- | Below we will see that as N-->Inf but p stays fixed, | | the LS estimator b of beta becomes arbitrarily precise. | --------------------------------------------------------- ==> There is some mysterious power in confining E[y] to a small-dimensional subspace. [Compare with the multivariate analysis (MA) approach: In MA one has N i.i.d. draws of the vector y. ==> One can estimate E[y] by the sample vector mean of draws.] - Program: In what follows we will derive . E[b] (p+1)x1 . V[b] (p+1)x(p+1) . E[yhat] Nx1 . E[r] Nx1 . V[yhat] NxN . V[r] NxN . V[yhat,r] NxN ---------------- * LINEAR MODEL ANALYSIS, STEP 1: VARIABILITY OF b - Reminder: The LS Estimate is b = (X^T X)^{-1} X^T y - Type of variability in b: ... b = (p+1)x1 random vector varying from response vector y to response vector y... but holding X fixed - Making the linear model assumptions, it follows: E[b] = ... E[ (X^T X)^{-1} X^T y ] = E[ (X^T X)^{-1} X^T (Xbeta +eps) ] = E[ (X^T X)^{-1} X^T X beta + (X^T X)^{-1} X^T eps ] = (X^T X)^{-1} X^T X beta = beta Actually, which part of the assumptions was used? ... only 1st order assumption => Unbiased estimator if the 1st order model assumptions are true. - Making the linear model assumptions, it follows: V[b] = ... V[ (X^T X)^{-1} X^T (X beta +eps) ] = V[ (X^T X)^{-1} X^T X beta + (X^T X)^{-1} X^T eps) ] = V[ (X^T X)^{-1} X^T eps) ] = (X^T X)^{-1} X^T V[ eps ] ((X^T X)^{-1} X^T)^T = (X^T X)^{-1} X^T (sigma^2 I) X (X^T X)^{-1} = (X^T X)^{-1} X^T X (X^T X)^{-1} sigma^2 = (X^T X)^{-1} sigma^2 Actually, which part of the assumptions was used? ... only second order model assumptions! Reminder: V[b] is conditional on X. - What variability does V[b] describe? ... - What is the interpretation of the diagonal elements in V[b]? ... - Could one hope to estimate V[b]? ... - What would such estimates be good for? ... - Simulation example: Simple linear regression with intercept ==> X is a 1-predictor matrix, hence of size Nx2. ## R: N <- 100 # Picking a sample size X <- cbind(x0=rep(1,N), x1=runif(N)) # Making up a predictor matrix sigma <- 1 # Assuming a 'true' error variance beta <- c(b0=1, b1=2) # Assuming a 'true' coeff vector y <- rnorm(N, m=X%*%beta, s=sigma) # Simulating a response vector b <- solve(t(X)%*%X)%*%t(X)%*%y; b # Estimating and printing coefficients ## Simulate and animate parallel universes/possible worlds/...: for(i in 1:1000) { y <- rnorm(N, m=X%*%beta, s=sigma) # Simulating a response vector b <- solve(t(X)%*%X)%*%t(X)%*%y; b plot(x=X[,2], y=y, pch=16, col="gray20", ylim=c(-2,6)) abline(beta, col="gray70", lwd=3); abline(b, col="red", lwd=2) for(j in 1:1000000) {} # Crude way to slow down the animation } # (Use ESC or ctrl-C to stop, whichever works.) ## ==> Dataset-to-dataset (sampling) variability conditional on X. ## Simulate and store many response vectors (possible worlds, parallel universes,...): Nsim <- 10000 bs <- matrix(NA, nrow=Nsim, ncol=length(beta)) # Storage for coeff estimates colnames(bs) <- paste("b",0:(length(beta)-1),sep="") # Cosmetics for(isim in 1:Nsim) { y <- rnorm(N, m=X%*%beta, s=sigma) # Simulating a response vector bs[isim,] <- solve(t(X)%*%X)%*%t(X)%*%y # LS estimate, store in 'isim' row of bs. if(isim%%500==0) cat(isim,"\n") } ## [ Control output is a problem for RGui users: ## It shows only after the code has completed! Two remedies: ## - "By default, R for Windows prints no messages while a command is running. ## Right-click on the 'R Console' window and de-select Buffered output." ## - Or: while running the code in the RGui, hit any key to see the output. ## ] ## What is meaning of 'bs'? ## ==> Plot the 'sampling' or 'dataset-to-dataset' variability of b: plot(bs, pch=16, cex=.5, xlab="b0: LS intercept", ylab="b1: LS slope") ## ==> Negative correlation between b0 and b1 !!! ## WHAT IS THE MEANING OF EACH POINT IN THE SCATTERPLOT??? ## ... ## Compare (1): colMeans(bs) # Same as: apply(bs, 2, mean), but faster. beta ## ==> 'colMeans(bs)' is our simulation estimate of the true beta. ## ## --------------------------------------------------- ## | beta = lim colMeans(bs) as Nsim-->Inf (LLN) | ## --------------------------------------------------- ## Compare (2): cov(bs) # The matrix of multivariate variance/covariance estimates of 'bs' solve(t(X)%*%X)*sigma^2 # What's this again? ## ==> 'cov(bs)' is our simulation estimate of the true V[b]. ## From the above theory we know that V[b] = (X^T X)^{-1} sigma^2 ## ## ---------------------------------------------- ## | V[b] = lim cov(bs) as Nsim-->Inf (LLN) | ## ---------------------------------------------- ## Some intuitions: draw some lines according to 'bs' and compare to plot of 'bs' windows(height=5, width=10) par(mfrow=c(1,2), mgp=c(1.8,.5,0), mar=c(3,3,1,1)) plot(cbind(x=X[,2],y=y), pch=16, cex=1, col="gray", ylim=c(-2,6)) # last y for(i in 1:50) abline(bs[i,]) # lines from first 50 simulations plot(bs, pch=16, cex=.5) # second plot: shows negative correlation ## How does the negative correlation on the right ## relate to the lines on the left? ## ... SEE-SAW effect generates a negative association ## between b0 and b1 ---------------- * LINEAR MODEL ANALYSIS, STEP 2: VARIABILITY OF yhat AND r - Preparation: . The hat matrix: H = X (X^T X)^{-1} X^T . Recall: H is an orthogonal projection, meaning: HH = H idempotence H^T=H symmetry . The hat matrix is exclusively built from X, hence it is considered [fixed/random?]. . The matrix Q=I-H [is/is not] an orthogonal projection. . The matrix Q maps the response vector y to the ... vector. - There is variability in 'b' ... => There is variability in 'yhat' and 'r'. because yhat = ... X b and r = ... y - Xb - Strangeness about the randomness in yhat and r: Even though both are N-dimensional, . yhat can only range in ... V . r can only range in ... the orthogonal complement of V = residual space - First order properties of yhat and R: E[yhat] = ... E[ H y ] = H E[y] = H X beta = X beta due to H xj = xj E[r] = ... E[ y - H y ] = E[y] - E[ H y ] = X beta - X beta = 0 What assumptions are used? ... only 1st order assumption What assumptions are not used? ... 2nd order [Looking ahead: Can we detect in real data whether these properties are likely to be violated? If so, how? ... ] - Second order properties of yhat and R: V[yhat] = ... V[ H y ] = H V[y] H^T = H (sigma^2 I) H = H sigma^2 V[r] = ... (I - H) sigma^2 V[yhat,r] = ... V[ H y, (I-H) y ] = H V[y,y] (I-H) = H (I-H) sigma^2 = 0 What assumptions are used? ... 2nd order only What assumptions are not used? ... 1st order [Looking ahead, same question: Can we detect in real data whether these properties are likely to be violated? If so, how? ... ] ================================================================ LECTURE 5, 2011/09/23: * ORG: - Homework 1 is graded, will be returned shortly. - Homework 2 due today, 5pm -- questions? - Homework 3 will be posted shortly, due 10 days from posting. - Today 4:30pm is the end of Course Selection Period. Select sufficient credit so you can afford to drop this course!!! (Drop Deadline: Friday, Oct 14) - Get the Style book by Monday! ---------------- * ROADMAP: - Recaps of . Linear model assumptions: 1st & 2nd order . N-dim geometry of linear models . Formal properties of the LS estimator b: 1st & 2nd order . Formal properties of yhat and r: 1st & 2nd order - New: . Implications of 1st & 2nd order properties for b . Implications of 1st & 2nd order properties for yhat, r ---------------- * RECAP: - Linear models geometry: V = { X beta | beta is in Reals^(p+1) } = column space of X = space of linear combinations of columns of X =: space of fits - "High-dim" plot of LS geometry: source("http://stat.wharton.upenn.edu/~buja/STAT-541/LS-geometry.R") After the first download and saving, you can call the following function instead: LS.geom() - 1st-2nd order assumptions for a linear model: y = X beta + eps (Nx1) functional form of the model: meaning? ... E[eps] = ... (Nx1) 1st order assumption: meaning? ... V[eps] = ... (NxN) 2nd order assumption: meaning? ... Equivalent: E[y] = ... V[y] = ... - New notation we will use: mu := E[y] !!!!!!!!!! 1st order lin model assumption: --------------- | mu = X beta | --------------- [Strictly speaking: There exists beta such that...] Equivalent: --------------- | mu is in V | --------------- - Connections between mu, yhat, beta, b: b estimates ... beta yhat estimates ... mu - Properties of LS estimates, LS fits, and LS residuals derived from lin model assumptions: E[b] = ... E[yhat] = ... (Compare with E[y] = ...) E[r] = ... V[b] = ... (X^T X)^{-1} sigma^2 V[yhat] = ... H sigma^2 V[r] = ... (I-H) sigma^2 V[yhat,r] = ... 0 Q: . Which linear model assumptions are used for which equations? . Meanings of these equations? Surprises? ... . Compare E[yhat] with E[y]; why and when is yhat 'better'? . What will be the use of V[b]? - Recall the simulation: . What did we simulate 'Nsim' times? ... . What did we store from each simulation run, 'isim=1,...,Nsim'? ... . Explain the following code snippets: apply(bs, 2, mean) beta and: cov(bs) solve(t(X)%*%X)*sigma^2 ... . In both cases, what happens as 'Nsim' --> Inf? ... . What do these plots illustrate? windows(height=5, width=10); par(mfrow=c(1,2), mgp=c(1.8,.5,0), mar=c(3,3,1,1)) plot(cbind(x=X[,2],y=y), pch=16, cex=1, col="gray", ylim=c(-2,6)) for(i in 1:100) abline(bs[i,]) plot(bs, pch=16, cex=.5) ... . In practical data analysis, do we usually get to see what amounts to multiple 'simulations'? ... with the exception of ... ---------------- * SAMPLING PROPERTIES OF THE PREDICTOR MATRIX X: - Recall: V[b] = (X^T X)^{-1} sigma^2 - Goals: (1) Understand what happens to (X^T X) and (X^T X)^{-1} when N-->Inf (2) Find where 'root-N' is hiding in LS estimation. [Most statistical estimates gain precision at the speed root-N.] - Difficulty: . Analysis and inference of linear models so far is conditional on X. . To understand what happens when N-->Inf, we need to make X random. - A technical nuisance: intercept We will see later that LS slope estimates b1,...,bp [excluding the intercept estimate b0] can be obtained by . centering the predictor columns x1,...,xp and . centering the response vector y and . obtaining LS estimates WITHOUT intercept: X. = (x1., ..., xp.) Nxp with centered columns b = (b1,...,bp)^T = (X.^T X.)^{-1} X.^T y. [Recall: X=(x0=1,x1,...,xp) is of size Nx(p+1).] [Confuser: mean(y) is NOT E[y] !!! mean(y) = (y_1+...y_N)/N = the sample average = a random variable, size 1x1 E[y] = long-run cross-dataset average of y = Nx1 fixed (non-random) vector 'mu' ] - Even though in regression we condition on X., here we consider the sampling properties of X., assuming the rows of X are iid samples from a multivariate distributions P(dx1,dx2,...,dxp). - Q: What is the meaning of (X.^T X.) in V[b] = sigma^2 (X^T X)^{-1} ??? A: Consider instead (X^T X) / (N-1) = sum_{i=1..N} (x_i x_i^T) / (N-1) = sample mean of x x^T up to N-1 versus N where x_i is the i''th row of X written as a px1 column vector. (Apologies for the messiness of this notation. We are really switching from regression conventions to multivariate analysis conventions.) Interpretation: Using the assumption that the columns of X are centered, sum_i x_i /N = 0, we have cov estimate of predictor variables = Vhat(x) = sample average of x x^T up to N-1 versus N = sum_{i=1..N} x_i x_i^T / (N-1) = (X^T X)/(N-1) . How does the interpretation jibe with conditionality on X? In reality even predictors are random (at least in observational data), hence covariance between predictors makes sense. We write V[x] (pxp) for the population covariance matrix of the rows of X (written as random column vectors), and Vhat(x) = sum_{i=1..N} x_i x_i^T / (N-1) . As N grows (N -> Inf), what happens to (X^T X) ? ... [converges/diverges...] what happens to (X^T X)^{-1} ? ... [converges/diverges...] what happens to X^T X/(N-1) ? ... converges/diverges ... based on ... what happens to ( X^T X/(N-1) )^{-1} ? ... converges/diverges ... . Q: What does this formula say about the precision of estimates of regression coefficients in relation to the distribution of the predictor vectors? A: We gain precision of estimation with more data at the rate 1/(N-1) ~ 1/N for dataset-to-dataset variances. Proof: V[b] = sigma^2 (X^T X)^{-1} = sigma^2 (sum_{i=1..N} x_i^T x_i)^{-1} = sigma^2 (sum_{i=1..N} x_i^T x_i / (N-1))^{-1} / (N-1) --> sigma^2 V[predictors] / (N-1) by the LLN . Memorize: X^T X diverges/explodes (Why? ...) Vhat(x) = (X^T X)/(N-1) converges to V[x]. (Why? ...) ==> V[b] = sigma^2 Vhat(x)^{-1] /(N-1) shrinks at the rate ~1/N. PS: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Test your conceptual thinking ability and try to fully understand the weirdness of this formula!!! The formula is correct, but V[b] is both a TRUE UNKNOWN PARAMETER MATRIX AND based on an ESTIMATE OF A MULTIVARIATE COVARIANCE MATRIX. This is possible because V[b] is a regression quantity where we condition on X, but in multivariate analysis X is random and used to estimate V[x]...) ---------------- * MORE ON 2ND ORDER PROPERTIES OF yhat AND r: - Express V[yhat] = H sigma^2 V[r] = (I-H) sigma^2 V[yhat,r] = 0 in terms of components and interpret them: . Diagonal elements: Var[yhat_i] = ... H_ii sigma^2 Var[r_i] = ... (1-H_ii) sigma^2 . Off-diagonal elements: Cov[yhat_i,yhat_j] = ... H_ij sigma^2 Cov[r_i,r_j] = ... -H_ij sigma^2 . Cross elements: Cov[yhat_i,r_j] = ... 0 - What variability is being described in the above equations? ... - Implications: . Residuals are generally [un?]correlated even when the errors are assumed [un?]correlated. ... correlated!!! . Residuals are generally [homo/hetero?]scedastic when the errors are assumed [homo/hetero?]scedastic. ... heteroskedastic !!!!! . Pairs of distinct fits and corresponding pairs of residuals are generally correlated with the [same/opposite?] sign. ... . The variance of residuals is generally [> inference... . How do we see sqrt-N shrinking of V[b]? ... - V[] of yhat and r: . V[yhat] = ... H sigma^2 . V[r] = ... (I-H) sigma^2 . V[yhat,r] = ... 0 . Diagnostic tool: diagonal of H Fact: 0 <= H_ii <= 1 (see HW 3) . Are residuals homoskedastic? Why? ... No, because usually the H_ii are not constant . Standardized residuals: How are they possible? ... ri* = ri / sqrt(1-Hii) . Are residuals uncorrelated? Why? ... No, because the H_ij are not generally zero . How do V[r_i] and V[yhat_j] relate to each other? ... They are always uncorrelated. . What is desirable: small or large V[yhat_i]? ... Small variance because yhat_i estimates mu_i . What would you say if V[yhat_i] > 0.5*sigma^2 for all cases i? ... There is a technical term for this: ------------------ | ".... fitting" | OVER !!!! ------------------ . What does "self-influence" mean and how is it measured? ... yhat_i = sum_k=1...N H_ik y_k = H_ii y_i + sum_k(!=i) H_ik y_k - See HW3 for more intuitions regarding H_ij in the case of simple linear regression through the origin. ---------------- * OFF-DIAGONAL USES OF V[b]: - Motivation: . If x1 and x2 are doses of a drug applied 1hr and 2hrs, resp., before measurement of y (blood pressure, say), you might ask whether beta1=beta2, i.e, H0: beta1-beta2 = 0. ==> Can you devise a SE for b1-b2? . If x1=dummy for math tutoring, x2=dummy for English tutoring, x3=dummy for science tutoring, x4=dummy for history tutoring, and if y=overall SAT score, and you assume the effects of x1, x2, x3, x4 are additive, you might want to estimate beta1+beta2+beta3+beta4. ==> Can you devise a SE for b1+b2+b3+b4? - Mathematize the core of these problems: . b1-b2 = (1,-1,0,0,...) (b1,b2,...)^T . b1+b2+b3+b4 = (1,1,1,1,0,0,...) (b1,b2,b3,b4,...)^T . In general: lin(b) = c^T b where c=(c1,c2,...)^T = constant col vector . Task: Find SE^2(c^T b) - Solution: . SE^2(c^T b) = V[c^T b] . We know V[b], hence: V[c^T b] = ... c^T V[b] c (butterfly formula, special case) SE[c^T b] = ... sqrt(c^T V[b] c) . Q: Would you now know how to (1) test c^T beta = 0? (2) form a CI for c^T beta? ---------------- * DEGREES OF FREEDOM IN LINEAR MODELS: - Matrix traces: (see HW 3 for background) . Def.: a purely mathematical/lin. alg. matter tr(A) = sum A_ii (sum of diagonal elements for square M) . Homework 3: The trace is invariant under basis/coordinate changes. tr(T A T^{-1}) = tr(A) . Facts: Assuming X is full rank, i.e., rank(X) = p+1, we have tr(H) = rank(H) = dim(colspace(H)) = dim(colspace(X)) = rank(X) = p+1 => tr(H) = p+1 'Trick': tr(H) = tr(T H T^{-1}) = tr(H diagonalized) = p+1 (HW 3) - Trace formulas or 'total variance' formulas: Def.: "TOTAL VARIANCE" of a random vector Z TotVar[Z] := sum_i V[Z_i] = tr(V[Z]) TotVar[y] = tr(V[y]) = tr(I sigma^2) = sigma^2 N TotVar[yhat] = tr(V[yhat] = sigma^2 tr(H) = sigma^2 sum Hii = sigma^2 (p+1) TotVar[r] = tr(V[r]) = sigma^2 tr(I-H) = sigma^2 sum (1-Hii) = sigma^2 (N-(p+1)) - Interpretation of the trace formulas: The more predictors the more total variance in ... [yhat,r?] The more predictors the less total variance in ... [yhat,r?] NOTE: . This is independent of whether the model is unbiased or not! We only used second order assumptions. . This is independent of whether we use good or bad predictors. Whether a predictor is good or bad, its inclusion moves one case worth of variation (sigma^2) from r to yhat (and b). - 'Degrees of freedom': dimensions of the subspaces in which yhat and r can range dfs(fits) = dfs(yhat) = p+1 = tr(H) dfs(resid) = dfs(r) = N-(p+1) = tr(I-H) dfs(resp) = dfs(y) = N = tr(I) - [Why might we not be interested in the total variance of b?] ... ---------------- * ESTIMATING THE ERROR VARIANCE sigma^2: - Recall from ~ Stat 102: RMSE = sigma.hat = sqrt( RSS / (N-p-1) ) RMSE^2 is an unbiased estimate of sigma^2 as we will now show. - If the model is correct, then E[ r_i^2 ] = V[r_i] = ... (*) [Which of 1st and 2nd order model assumptions are used??? Check the first equality carefully! ... ] Sum up (*) over i=1,...,N and use what you know about H: E[RSS] = ... sum_i=1...N E[r_i^2] = sum_i=1...N V[r_i] = (N-(p+1)) sigma^2 Solve for sigma^2: sigma^2 = ... E[ RSS/ (N-(p+1)) ] Hence an unbiased estimate of sigma^2 is sigma.hat^2 = ... RSS / (N-(p+1)) also called the 'RMSE^2', a misnomer. - Note: We would really like E[RMSE] = sigma, but this is not true, and there is no simple identity to connect left and right. Hence one resorts to what is mathematically doable: ---------------- * "ADJUSTMENT" AND ITS ALGEBRA: - Intuitive Example: 'Explain' student''s school performance. . Factors of interest: Measures of school quality (Class size, training of teachers, teacher salaries, physical plant,...) ==> Collect these predictors in a matrix X1. . Factors NOT of interest in the study but cannot be ignored: Demographics (Household income, home zip code, education level of parents, ...) ==> Collect these predictors in a matrix X2. . Outcome variables: SAT scores, grades, graduation vs dropping out, ... . Idea: Need to ADJUST the variables of interest for demographics. - Partitioning linear regression: Partition the predictor matrix: X = (X1,X2) (Nxp1, Nxp2) X1 = predictors of primary interest X2 = non-ignorable predictors not of primary interest Partition the LS estimator: b^T = (b1^T,b2^T) (p+1 = p1+p2) y = X b + r = X1 b1 + X2 b2 + r Compare with naive approach: y = X1 b1* + r* - Q: How are b1 and b1* related to each other? A: In general we only know that b1 and b1* are not the same. Reason: The following problems produce different b1 coefficients: min_b1,b2 | y - (X1 b1 + X2 b2) |^2 min_b1* | y - (X1 b1*) |^2 ================================================================ LECTURE 7, 2011/09/28: * ORG: - HW 3 is due Tue, 2011/10/04, 5pm - Style: Read the first two lessons. Be prepared to discuss examples in class on Monday, 2011/10/03 - Q: Do we need to record any lectures due to religious holidays? We need just one request to make it happen. ---------------- * ROADMAP: - Recap: . Off-diagonal uses of V[b] . Self-influence . Heteroskedasticity and correlation among residuals . Degrees of freedom, traces, and budgets of total variance . Overfitting and the variance of fits . Adjustment (beginnings) - New: . Adjustment and its implications ---------------- * RECAP: - Give a practical example where you are interested in differences or sums of regression slopes. ... - How can you derive a SE for a difference or sum of slopes? ... - Explain the notion of "self-influence" in linear regression. ... PS: Stay tuned for a more general notion of self-influence. PPS: Self-influence is sometimes also called 'leverage'. Thus 'high self-influence' and 'high leverage' mean the same. But we will also learn about influence/leverage on quantities other than yhat_i, above all on bj. - Compare errors and residuals with regard to variance. ... - Which is larger, V[eps_i] or V[r_i]? ... - Compare errors and residuals with regard to correlation. ... - What are the budgets of total variance in y, yhat, r? ... - How does overfit express itself in V[yhat_i]? ... - Is it possible to overfit case i but not case k? ... - What is the motivation behind the idea of adjustment? ... - For bj, does it matter whether xk is included in the regression? ... ---------------- * ADJUSTMENT (continued): a VERY important topic! - Define orthogonal (LS) projections according to the partitions: H = projection onto column space of X H1 = projection onto column space of X1 H2 = projection onto column space of X2 (H1, H2 are NOT partitions of H; all are NxN) Question: When is H = H1 + H2 ? (see HW2, problem 7, for a simple case) - Definition: residual operations I-H2 = "adjustment operator" for X2 y.2 = (I-H2) y is "y adjusted for X2" Intuitively: Removing the variation that can be "explained" by X2 "is accounted for" Practical example from above: X1 = predictors describing school quality X2 = predictors describing demographics y = performance on SAT y.2 = performance on SAT adjusted for demographics Note: y.2 is a residual vector in a model with predictors X2. Upcoming surprise: We will need X1 adjusted for X2 !!!! X1.2 = (I-H2) X1 = Collection of predictors of interest (school quality) adjusted for demographics (X2). [Yes, the X1 predictors will be treated like responses!] - Some higher trivialities: Assuming y = X1 b1 + X2 b2 + r is the LS decomposition based on both X1 and X2, note the following: r and the columns of X are orthogonal to each other, hence r and the columns of X1, X2 are ... ==> r is adjusted for both X1 and X2. Now for the 'trivialities': H X = ... X H X1 = ... X1 H X2 = ... X2 H r = ... 0 (I-H) r = ... r (I-H) X = ... 0 H1 r = ... 0 H2 r = ... 0 (I-H1) r = ... r (I-H2) r = ... r (I-H1) X1 = ... 0 (I-H2) X2 = ... 0 - Adjust the LS decomposition y = X1 b1 + X2 b2 + r (*) for X2 by applying I-H2 to both sides: (I-H2) y = (I-H2) X1 b1 + r (*****) ^^^^^^^^ ^^^^^^^^^ ^ X2-adjusted X2-adjusted Remains fixed under response y predictors X1 X2-adjustment An attempt to put it in words: -------------------------------------------------------- | For a subset of 'relevant' predictors, | | the adjusted LS regression produces the same slopes | (*****) | as the combined multiple regression | (*) | on the relevant predictors AND the adjustors. | -------------------------------------------------------- - Example: "Adjustment" for the mean . The intercept vector is usually the most irrelevant 'predictor', hence let us adjust for it. . Let X2 = x0 = (1,1,...,1)^T be the intercept vector alone. X1 = (x1,...,xp) all other predictors. . What is the hat matrix for the sole predictor x0=e:=(1,...,1)^T? We will sensibly write H0 instead of H2. H0 = ... E/N where E = x0 x0^T = all 1''s, size NxN H0 y = ... ybar x0 (I-H0)y = ... centered y vector I-H0 = ... centering operation = adjustment operation for the intercept vector . Write down the mean-adjusted LS decomposition starting from y = b0*x0 + b1*x1 + ... + bp*xp + r ... = ... y.0 = b1*x1.0 + ... + bp*xp.0 + r . In words: Center y, x1, ..., xp and run LS w/o intercept. ==> LS slopes b1, ..., bp as if x0 had been included. ---------------- * SINGLE PREDICTOR ADJUSTMENT: - The most important case of adjustment: X1 = single column xj Nx1 X2 = all columns of X except column xj Nxp (X is Nx(p+1)) => The adjusted equation (***) amounts to a simple linear regression without intercept!!! This we can do explicitly: recall HW2, regressing y on x, no intercept b = .... Simplified notation: y.adj = (I-H2) y (y adjusted for all xk except xj) xj.adj = (I-H2) xj (xj adjusted for all xk except xj) bj = slope of simple lin. regr. of "y adjusted for all but xj" onto xj adjusted for all other predictors -------------------------- | < y.adj, xj.adj> | | bj = ---------------- | ("adjustment formula" for bj) | |xj.adj|^2 | -------------------------- Note: bj = j''th MULTIPLE regression coefficient!!! - Elementary consequence: Does it matter for the value of bj if we add or drop some of the other variables? ... - Some weirdness: Adjusting y is actually not necessary. Adjusting xj is. Numerator of adjustment formula = < y.adj, xj.adj> = <(I-H2)y, (I-H2)xj> = (Why? Why? Why?) = ------------------------- | < y, xj.adj> | | bj = -------------- | (the "weird adjustment formula" for bj) | |xj.adj|^2 | ------------------------- - Loose ends: . The LS estimate bj is a linear function of y: bj = What is the coefficient vector 'a' in view of the above? ... a = xj.adj / |xj.adj|^2 . We also have the triple-X formula for the b vector of LS estimates: b = (X^T X)^{-1} X^T y What is the connection to the weird adjustment formula for bj? ... xj.adj / |xj.adj|^2 = j''th row in the triple-X matrix In other words: X (X^T X)^{-1} = (x0.adj / |x0.adj|^2, x1.adj / |x1.adj|^2, ..., xp.adj / |xp.adj|^2) ---------------- * "ADJUSTED VARIABLES PLOTS" OR "LEVERAGE PLOTS": - A plot of y.adj vs. xj.adj shows what the LS criterion "sees" when determining bj. - Example: Fetch data from class website; after downloading to folder 'Data'... URL <- "http://stat.wharton.upenn.edu/~buja/STAT-541/" cars <- read.table(paste(URL,"CarModels2003-4.TXT",sep=""), header=T, sep=",", na=".") names(cars); dim(cars) rownames(cars) <- paste(1:nrow(cars),as.character(cars[,"Model"])) apply(cars, 2, function(x) sum(is.na(x))) # there is an NA in "MPG.Highway" sel <- !is.na(cars[,"MPG.Highway"]) # to select non-NAs # multiple regression of response on two predictors: cars.lm <- lm(MPG.Highway ~ Weight.lbs + Horsepower, data=cars[sel,]) # adjusted response and other predictor: lm() = lin. models fct, resid() obvious mpg.adj <- resid(lm(MPG.Highway ~ Horsepower, data=cars[sel,])) wgt.adj <- resid(lm(Weight.lbs ~ Horsepower, data=cars[sel,])) # compare coeff of Weight: same!! "proof" by example... lm(mpg.adj ~ wgt.adj); cars.lm plot(mpg.adj ~ wgt.adj, pch=16, cex=.5); abline(lm(mpg.adj ~ wgt.adj)) - "Leverage Plot" because it shows points that have high leverage for bj (every bj can have its own leverage points) - Canned function for leverage plots with optional point identification: source(paste(URL,"leverage-plots.R",sep="")) leverage.plots(model=cars.lm, ident=c(1,2), frame.sz=5) # stop with right click - Note to future TAs of Stats classes: The JMP software provides leverage plots with all regression outputs. ---------------- * Variance/STDERR OF b_j: - Note about terminology: We call 'standard error' or SE the true sdev of an estimate. We call 'standard error estimate' or SE.est the estimated sdev of an estimate. (Most people say 'standard error' and mean 'standard error estimate'.) - Goal: Obtain V[ b_j ] from the adjustment formula for b_j. Recall variance of a linear combination: V[ ] = V[a^T y] = ... Use a = xj.adj: V[ ] = sigma^2 |xj.adj|^2 To obtain V[bj], you need to divide this by |xj.adj|^4: ------------------------- | sigma^2 | | V[bj] = ------------ | | |xj.adj|^2 | ------------------------- ----------------------- | | | sigma | "Adjustment formula" | SE[bj] = ---------- | for standard errors | |xj.adj| | of regression coeffs | | ----------------------- - Loose ends: We also have the double-X formula for V[b], V[b] = (X^T X)^{-1} sigma^2 How does the above adjustment formula for V[bj] relate to the V[b] matrix? ... ================================================================ LECTURE 8, 2011/10/03: * ORG: - HW 3 is due tomorrow Tue, 2011/10/04, 5pm Last minute questions? PS: primes and double primes never refer to derivatives... - HW 4 to be posted shortly, check webpage An exercise in LS computations, implemented in R Q-R decomposition with modified Gram-Schmidt - Style: Read the Lessons 3 and 4 by 2011/10/12 Be prepared to discuss examples in class on Monday, 2011/10/12 ---------------- * ROADMAP: - Recap: . some random quizzing . ADJUSTMENT -- IT''S SO IMPORTANT!!!!! . adjusting in general . adjusting for the mean/intercept . adjusting for all else . no need to adjust y - New: . Winding down adjustment . ANOVA decompositions (Pythagorean, really) . Rsquare . Expectations of Sums of Squares (SS) ---------------- * RECAP/ELABORATIONS: - Residuals have same variance, Y/N? ... - Residuals are uncorrelated, Y/N? ... - What do the above mean? Var[r_i], Cov[r_i,r_k]? ... - Why are cases not interchangeable wrt to V[r]? ... - Assume the predictors are entirely useless. Is it then true that V[r_i] = sigma^2, C[r_i,r_k]=0 (i!= k)? ... - What does it mean to say "the predictors are entirely useless"? ... - Adjustment, general case: X1 (Nxp1) adjustees X2 (Nxp2) adjustors . What is the adjustment operator? ... I-H2 . Write down the general adjusted LS decomposition. ... (I-H2)y = (I-H2) X1 b1 + r y.2 = X1.2 b1 + r . New: Write down the triple-X formula for b1. ... b1 = (X1.2^T X1.2)^{-1} X1.2^T y.2 - Mean-adjustment: . What does it mean to 'adjust for the mean'? ... . Using the notation 'stuff.0', write down the mean-adjusted LS decomp. ... y.0 = b1*x1.0 + ... + bp*xp.0 + r . New: Write down the triple-X formula for (b1,...,bp)^T. ... . ATTENTION: you still have p+1 parameters/dfs/dimension it is only for LS estimation that we have fewer parameters - Adjustment for all else: . What is the j''th predictor adjusted for all else? ... xj.adj = (I-H\j) xj where H\j is from X with column j removed . Write down the all-else-adjusted LS decomposition. ... y.\j = bj*xj.adj + r . Write down the triple-X formula for bj. ... bj = (xj.adj^T xj.adj)^{-1} xj.adj^T y.\j = / |xj.adj|^2 - Use the all-else-adjusted formula to derive SE[bj]. ... SE[bj] = sigma / |xj.adj|^2 ---------------- * CONSEQUENCES OF ADJUSTMENT FOR STATISTICAL INFERENCE: - How does |xj.adj| compare to |xj|? ... |xj.adj| <= |xj| - When is |xj.adj| = |xj|? ... xj is orthogonal to all other predictors including x0 (i.e., xj is already centered) - Do we like small or large values of |xj.adj| ? ... large - Why? ... SE[bj] !!! - What does |xj.adj| mean statistically speaking? What is |xj.adj|/sqrt(N-1)? ... = SD[xj.adj] by the traditional SD formula - How do you "see" |xj.adj|/sqrt(N-1) in the leverage plot? ... SD[xj.adj] = |xj.adj|/sqrt(N-1) describes the spread of xj.adj on the horizontal axis of the leverage plot for xj - What happens to |xj.adj| if we add a predictor that is highly correlated with xj? ... the length 'collapses'/shortens/... - How would you measure how collinear xj is with all other predictors? ... 1 - |xj.adj|^2 / |xj.0|^2 (This is an Rsquare. Note xj.0, not xj!) - See Mosteller & Tukey (1977, p.326f) on the proxy problem Piling lots of well-intentioned but highly collinear predictors into a regression causes trouble if individual inference for these predictors is sought. ---------------- * STEPWISE ADJUSTMENT = GRAM-SCHMIDT ORTHOGONALIZATION - Stepwise adjustment: use dot-notation x0 (intercept vector) x1.0 = 'x1 demeaned' x2.01 = res from a regression of x2 onto x0, x1 x3.012 = res from a regression of x3 onto x0, x1, x2 x4.0123 = res from a regression of x4 onto x0, x1, x2, x3 ... xp.01... = res from a regression of x4 onto x0, x1, ..., x{p-1} . What is the geometric relation among them? ... They are pairwise orthogonal. . How can you obtain each from the precedings ones? ... Ex.: You can obtain x4.0123 by regressing out x0, x1.0, x2.01, x3.012 The advantage is that these vectors are orthogonal, hence we can regress them out of x4 one by one, without multiple regression of x4 onto x0,x1,x2,x3. . Why is this Gram-Schmidt orthogonalization? ... We successively orthogonalize x0,x2,...,xp to generate an orthogonal basis x0,x1.0,x2.01,...,xp.01...{p-1} of the column space of X = (x0,...,xp). - Modified Gram-Schmidt: . The above G-S procedure has numerical stability problems in cases of extreme collinearity. The following modification provides numerical stability that is competitive with the best LS algorithms. . For numerical stability, 'sweep' the active predictor from all remaining predictors instead of sweeping all previous predictors from the active predictor. ('sweeping' = 'orthogonalizing' = 'taking residuals') . Example: p=4 active: x0 sweep: x1.0, x2.0, x3.0, x4.0 active: x1, actually, use x1.0 sweep: x2.01, x3.01, x4.01 active: x2, actually, use x2.01 sweep: x3.012, x4.012 active: x3, actually, use x3.012 sweep: x4.0123 . This results in a 'Q-R decomposition' for numerically solving the LS problem; see HW 4. . Competitors to modified G-S: instead of orthogonalizations, use Householder reflections ('reflection' = flipping on a hyperplane) Givens/Jacobi rotations (in coordinate planes) - Meanings of ADJUSTMENT: intuitive, statistical, geometric . xj.k is xj adjusted for xk . xj.k is 'decorrelated' from xk (assumes both are centered) . xj.k is xj orthogonalized wrt xk [For the middle statement note that cor(xj,xk) = ------------- |xj.0| |xk.0| Therefore, cor(xj,xk)=0 iff xj.0 and xk.0 are orthogonal. ] ---------------- * ANOVA DECOMPOSITIONS AND R-SQUARE: - Q: How much "variation" does the model given by X "account for"? . Approach: Use sums of squares (SS) as measures of "variation" but first remove uninteresting variation: the mean. . Recall the "mean-adjusted" LS decomposition: (I-H0) y = (I-H0)x1*b1 + ... + (I-H0)xp*bp + r More convenient notation: y.0 = b1*x1.0 + ... + bp*xp.0 + r . Why the removal of the mean variation? Because unlike the real predictors the intercept is not "explanatory", Therefore, get rid of the intercept once for all. - Pythagorean decomposition of orthogonal projections: We now assume y and all xj are centered. H is the LS projection onto the CENTERED columns x1,...,xp !!!!!! y.0 = b1*x1.0 + ... + bp*xp.0 + r (The mean-adjusted LS decomp again) y.0 = yhat.0 + r Recall yhat.0 and r are orthogonal, hence: |y.0|^2 = |yhat.0|^2 + |r|^2 (Pythagorean theorem) ----------------------------------- | TSS = MSS + RSS | 'ANOVA decomp' (the simplest form) ----------------------------------- Total SS = Model SS + Residual SS - Definition of R Square ("R2"): MSS |yhat.0|^2 . R2 := ----- = -------- = "Fraction of variation accounted for by the model" TSS |y.0|^2 "--explained--" (Arrgh!) . Fact: ---------------------- | R2 = cor(y,yhat)^2 | ---------------------- + Note: 'cor()' implies mean-adjustment, that is, y.0 and yhat.0 are used. + Proof: cor(y,yhat)^2 = ^2 / = ^2 / (|y.0|^2 * |yhat.0|^2) = |yhat.0|^4 / (|y.0|^2 * |yhat.0|^2) = QED ================================================================ LECTURE 9, 2011/10/05: * ORG: - HW 4 is posted, due Fri, 2011/10/14, 5pm An exercise in LS computations, implemented in R Q-R decomposition with modified Gram-Schmidt - HW 5 may get posted sooner, though. - Style: Read Style Lessons 3 and 4 by Wed 2011/10/12 Be prepared to discuss examples in class. ---------------- * ROADMAP: - Recap: . Adjustment, for the last time . ANOVA/Pythagorean decompositions . Rsquare - New: . Expectations of Sums of Squares (SS) . Distribution theory for tests and CIs in linear models: + multivariate normal + t distributions for testing one slope (one linear fct of beta) + F distributions for testing multiple slope (multiple linear fctss of beta) (No optimality theory, though; see Math Stat) ---------------- * RECAP/ELABORATIONS: - Why do we like large |xj.adj|^2? ... makes SE[bj] small - What does large |xj.adj|^2 mean? Large compared to what? ... compare to |xj.0|^2 - How can we measure the degree of collinearity of xj with the rest of the predictors? (Twist to last time we asked the question: exclude x0; it''s uninteresting.) ... Rj.adj^2 = 1 - |xj.adj|^2/|xj.0|^2 - In a study of demographic factors for student''s school performance, experts would like to know whether it is 'household income' or 'father''s income' or 'household networth' that plays a greater role. What would you say if all three were put into the regression? ... - Derive the ANOVA decomposition for y.0 = yhat.0 + r ... - ANOVA is largely regression for designed experiments where X is 'designed', that is, chosen. Often the 'design' X is such that convenient orthogonalities arise (e.g.: balanced 2-way ANOVA). Therefore consider X=(X1,X2) such that the X1.0 columns are orthogonal to the X2.0 columns. Q1: What is a matrix statement for X1.0 _|_ X2.0 ? ... Q2: Derive the ANOVA decomposition for y.0 = X1.0 b1 + X2.0 b2 + r ... - Rsquare = cor(...,...)^2 ?? - New: Derive the exact relation between RMSE, Rsquare and SD(y) from the basic ANOVA decomposition TSS = MSS + RSS. ... - Solve this relation for Rsquare and suggest a 'better' Rsquare. It will turn out that this exists and is called 'adjusted Rsquare'. ... ---------------- * EXPECTED SUMS OF SQUARES: widely calculated in ANOVA - Can we calculate E[TSS], E[RSS] and E[MSS] theoretically? . We won''t have use for this in the future, but there are two reasons for doing this math exercise: + ANOVA (really the analysis of designed experiments with highly structured X) is permeated by calculations of expected values of sums of squares (E[SS]). + The calculations are insightful in their own right because they describe sources of variation between datasets. . Assume the model is 1st order correct and has uncorrelated homoscedastic errors: y = X beta + eps with E[eps]=0, V[eps] = sigma^2 I ==> E[y] = X beta =: mu . We need to allow the intercept term in the model, hence we write: TSS = |(I-H0)y|^2 = |y.0|^2 (total SS in the data other than the mean) MSS = |(H-H0)y|^2 = |yhat.0|^2 (SS captured by the model other than the intercept) RSS = |(I-H) y|^2 = |r|^2 (SS not captured by the model) where H is the projection onto the space spanned by x0,x1,...,xp. . Note: I-H0, H-H0, I-H are all orthogonal projections, with ranks N-1, p, N-p-1, respectively. Fact: H-H0 = H(I-H0) - Results: E[RSS] = sigma^2 (N-p-1) E[TSS] = | (I-H0) mu |^2 + sigma^2 (N-1) E[MSS] = | (H-H0) mu |^2 + sigma^2 p = | mu.0 |^2 + sigma^2 p # if we have 1st order correctness - Proofs: similar for all three (we did E[RSS] earlier) General situation: y = mu + eps (all in R^N) mu fixed, E[eps]=0 (N-vector), V[eps]=I_N sigma^2 Apply any fixed (non-random) orthogonal projection H. Problem: E[|H y|^2] = ??? General solution: E[|H y|^2] = E[|H mu + H eps|^2] = E[|H mu|^2 + |H eps|^2 + 2] = E[|H mu|^2 + |H eps|^2] # E[] = 0 = |H mu|^2 + E[|H eps|^2] E[|H eps|^2] = E[ eps^T H eps ] = E[ trace(eps^T H eps) ] = E[ trace(eps eps^T H) ] = trace(E[eps eps^T] H) = trace(V[eps] H) = trace(sigma^2 I H) = trace(H) sigma^2 ------------------------------------------------ | E[|H y|^2] = |H mu|^2 + trace(H) sigma^2 | ------------------------------------------------ ^^^^^^^^ ^^^^^^^^^^^^^^^^ signal noise (EE language) contribution contribution - Applications: (1) Replace H with I-H from the regression: I-H = residual projection E[RSS] = E[ |(I-H) y|^2 ] = |(I-H) mu|^2 + trace(I-H) sigma^2 = |(I-H) X beta|^2 + (N-p-1) sigma^2 # 1st term = 0 if mu = X beta = (N-p-1) sigma^2 # i.e., 1st order correctness (2) Replace H with the H-H0 from the regression: H-H0 = projection onto space of CENTERED fits E[MSS] = E[ |(H-H0) y|^2 ] = |(H-H0) mu|^2 + trace(H-H0) sigma^2 = |(H-H0) X beta|^2 + p sigma^2 # p, not p+1, due to centering = |(X beta).0|^2 + p sigma^2 # assuming 1st order correctness (3) Replace H with I-H0 (centering): E[TSS] = E[ |(I-H0) y|^2 ] = |(I-H0) mu|^2 + trace(I-H0) sigma^2 = |(X beta).0|^2 + (N-1) sigma^2 # assuming 1st order correctness - "Signal-to-Noise Ratio" or SN: Motivated by the decomposition of E[MSS]: How much 'signal' (model) and how much noise is in LS estimation? | (I-H0) mu |^2 / Ratio of model to noise variation SN = ------------------- = | in fitted values sigma^2 p \ excluding uninteresting mean variation Q: Do we know the SN in actual data analysis? ... - We calculated E[MSS] and E[TSS]. We defined R2 = MSS/TSS Q: Can we calculate E[R2] theoretically? ... ---------------- * THE MULTIVARIATE NORMAL DISTRIBUTIONS: Definitions and Facts - Def.: An N-dimensional 'standard normal random vector' is given by z = (z1,z2,...,zN)^T where zn are i.i.d. N(0,1) We write z ~ N(0,I) where 0 is in R^N and I is NxN. One draw from z, in R: z <- rnorm(N) - Fact: Rotation invariance of the multivariate standard normal distribution If z ~ N(0,I), then Qz ~ N(0,I) for any NxN orthogonal matrix Q. Proof: Use that the density of 'z' depends only on |z|^2, and det(R)=1. Let 'Event' be any subset of R^N. P[ Qz in Event ] = integral over Event of p(Qz) d(Qz) = integral over Event of p(z) d(z) det(Q) = integral over Event of p(z) d(z) = P[ z in Event ] - Def.: A general N-dimensional 'normal random vector' with mean vector 'mu' and covariance matrix 'C' is given by y = (y1,y2,...,yN)^T where ---------------------------------------------------- | | | y = A z + mu and C = A A^T and z ~ N(0,I) | | | ---------------------------------------------------- We write --------------- | y ~ N(mu,C) | where mu is in R^N and C is NxN (sym., >=0). --------------- IMPORTANT: The linear map A can be NON-FULL RANK!!!! We then say: The multivariate normal distribution is DEGENERATE. - Fact: The definition of N(mu,C) is independent of the representation C=AA^T. Proof idea: AA^T = BB^T <==> A = B Q for some orthogonal Q ('<==' is easy; '==>' less so) Then show that P[ Az in Event ] = P[ Bz in Event ] P[ Az in Event ] = P[ z in A^{-1} Event ] = P[ z in Q^{-1} B^{-1} Event ] = P[ Qz in B^{-1} Event ] = P[ z in B^{-1} Event ] = P[ Bz in Event ] - Fact: Multivariate normal distributions are uniquely characterized by their expectations and variance/covariance matrices (1st & 2nd moments). - Fact: Any linear (matrix) transformation of a multivariate normal distribution yields again a normal distribution: y ~ N(mu, C) ==> Gy+c ~ N(G mu + c, GCG^T) IMPORTANT: . G can be not only singular but even non-square with arbitrary target dimension. . This includes therefore linear forms with target dimension 1: y N-dimensional (mu in R^T, C is NxN) G = g^T 1xN (N-dim constant row vector) ==> Gy + c ~ N(g^T mu + c, g^T C g) is univariate normal - Fact: Two random vectors y1, y2 with a joint normal distribution are uncorrelated iff they are independent. V[y1,y2] = 0 <==> y1, y2 are independent Reason: y = c(y1,y2) (using R notation; math: y = (y1^T,y2^T)^T ) V[y] = block-diagonal with 1,2 block and 2,1 block zeroed out ==> y1 = A1 z1 + mu1, y2 = A2 z2 + mu2 where z=c(z1,z2) ~ N(0,I) Because z1 and z2 are independent, y1 and y2 are independent. - Fact: If y ~ N(mu, sigma^2 I), when are A^T y and B^T y uncorrelated (and hence ...)? V[A^T y, B^T y] = ... A^T V[y] B = (A^T B) sigma^2 ==> A^T y and B^T y are independent if the columns of A are orthogonal to the columns of B. (Column space of A is orthogonal to the column space of B.) - Special case: --------------------------------------------------------- | If: | | y ~ N(mu, sigma^2 I) Nx1 | | a, b Nx1, fixed | | Then: | | a^T y and b^T y are independent <==> a _|_ b | --------------------------------------------------------- Try to fully understand and memorize the above box! It is a recurring idea/method/proof technique. - Note: The definition of general normal distributions includes "degenerate" or "singular" cases where V[y] = C is not of full rank. - Note: Most of the above arguments work when the initial z is replaced with any circularly symmetric distribution in R^N. (An example is the circularly symmetric t-distribution.) That is, from a circularly symmetric distribution one can define an elliptic family of distributions. The only property that does not generalize is the equivalence of 'uncorrelated' and 'independent'; this property is special to the multivariate normal distribution, and in fact it characterizes it. - R: One draw from y is obtained by N <- 100 A <- t(cbind(rep(1,N), 1:N, (1:N)%%2)) mu <- 1:ncol(A) y <- A %*% rnorm(N) + mu - R: Given C, how do you find an A? C <- crossprod(matrix(rnorm(25),5,5)) # example of generating a pos def matrix A <- t(chol(C)) # Version 1: Cholesky decomp tcrossprod(A); C C.eig <- eigen(C) # Version 2: with eigen decomp A <- t(sqrt(C.eig$values) * t(C.eig$vectors)) tcrossprod(A); C ================================================================ LECTURE 10, 2011/10/12: * ORG: - HW 4 due this Fri, 2011/10/14, 5pm Any questions? - Style: Read Style, Lessons 5 and 6 by Wed 2011/10/19 Be prepared to discuss examples in class. ---------------- * ROADMAP: - Style exercises - Recap: . Expected values of SSs . Multivariate normal distributions - New: . t-distribution for testing one parameter at a time . F-distribution for testing multiple parameters at a tim ---------------- * Writing: Exercise x.x, p.xx -- revise the following sentences - The use of models in teaching prose style does not result in improvements of clarity and directness in student writing. - Precision in plotting the location of building foundations enhances the possibility of its accurate reconstruction. - Any departures by the members from established procedures may cause termination of membership by the Board. - A student''s lack of socialization into a field may lead to writing problems because of his insufficient understanding about arguments by professionals in that field. - The successful implementation of a new curriculum depends on the cooperation of faculty with students in setting achievable goals within a reasonable time. Your writing may not suffer from these defects, but when you see such prose in the future, you will know what is wrong about it and be able to parse and, if necessary, improve on it. ---------------- * RECAP: - Assume E[y]=mu, V[y]=I*sigma^2, and H is any orthogonal projection. . Draw schematic 3-D picture of the assumptions. . What is E[ |H y|^2 ] = ??? - Apply the preceding to E[TSS], E[MSS], E[RSS]. - Multivariate normal distributions: . How do you generate a sample from N(0_n, sigma^2 I_n)? ... . How do you generate a sample from N(mu_n, C)? ... - Comment on the connection between 'uncorrelated' and 'independent'. ... - If y ~ N(mu, sigma^2 I) (size nx1), when are and independent? ... - If y ~ N(mu, C) (size nx1), what are the variances and the covariance of and ? When are they independent? ... - If y ~ N(mu, C) (size nx1), what is the joint distribution of and <-a,y>? ... - New: If y ~ N(X beta, I*sigma^2), what distributions do yhat and r have? ... ---------------- * NEW: EXACT DISTRIBUTION THEORY FOR LINEAR MODELS - We will be examining the null distribution of t and F statistics. We proceed in small steps. - Q: If y ~ N(X beta, sigma^2 I), what is the distribution of 'b' and 'bj'? A: For the whole coef vector of estimates: b = (X^T X)^{-1} X^T y is a linear transformation of y E[b] = beta V[b] = sigma^2 (X^T X)^{-1} => b ~ N(beta, sigma^2 (X^T X)^{-1}) (multivariate normal) One coeff estimate at a time: bj = < y, xj.adj/|xj.adj|^2 > is a linear combination of y E[bj] = betaj (assuming 1st order correctness) V[bj] = sigma^2 / |xj.adj|^2 (assuming 2nd order correctness) => bj ~ N(betaj, sigma^2 / |xj.adj|^2) (univariate normal) => (bj - betaj) ---------------- ~ N(0, 1) (***) sigma / |xj.adj| - Q: Can we use (***) for testing H0: betaj=0, or for forming a CI for betaj? A: No, because we don''t know sigma. We estimate sigma^2 with s^2 = RSS/(n-p-1) = RMSE^2 SE.est[bj] = sigma.hat / |xj.ad|j = RMSE / |xj.adj| bj - betaj (bj - betaj) (bj - betaj)*|xj.adj| tj = -------------- = -------------- = --------------------- SE.est[bj] s/|xj.adj| s = t-statistic with the null distribution for H0: the true j''th slope is betaj (an assumed value) - Q: How is a t-distribution with k degrees of freedom defined? A: For any Z, Z1,...,Zk that are i.i.d. N(0,sigma^2) define Z # N(0,sigma^2=1) t := ------------------------ / Z1^2+...+Zk^2 \ sqrt(------------------) # independent estimate of sigma=1 \ k / Facts: . t has the same distribution if Z, Z1, ..., Zk ~ i.i.d. N(0,sigma^2). Why? ... . As k-->Inf we have t-->N(0,1). Why? ... . General principle used in the definition of the t-distribution: Start from a random variable/vector with a well-known distribution (here: (Z,Z1,...,Zk) ~ N(0, I) of size (k+1)x1), and define the new distribution in terms of a new random variable that is a function of the old random variable/vector (here: t = a function of (Z,Z1,...,Zk)). . This method of defining distributions allows us to simulate the new distribution if we know how to simulate the old distribution. This is in practice more useful than knowing densities and cdf''s. - Q: Why would 'tj' derived from 'bj' be distributed like a 't'-distribution? A: Some pretty linear algebra and geometry, and some smart basis/coordinate changes!!! Assume: y ~ N(X beta, I*sigma^2) Consequences: tj = (bj - betaj) * |xj.adj| / s Z := (bj - betaj)*|xj.adj| ~ N(0,sigma^2) Therefore: tj = Z / s Now, what would we like s^2 to look like? Do wishful thinking... (1) s^2 = (Z1^2 + Z2^2 + ... + Z{N-p-1}^2) / (N-p-1) for some Zj ~ iid N(0,sigma^2) (2) Z1, Z2, ..., Z{N-p-1} and Z should be independent. How do we go about it? Idea: r is independent of yhat = X b, hence of b !!! Let Q be any (!) Nx(N-p-1) matrix whose columns form an orthonormal basis of residual space: . Q^T X = 0 (The columns of Q are orthogonal to the columns of X.) . Q^T Q = I (The columns of Q are mutually orthogonal; I is (N-p-1)^2.) . Q Q^T = I-H (= the projection onto residual space; this I is N^2.) [If you had to, how would you compute a matrix Q in R? ...] Define the (N-p-1)-dimensional random vector z = (Z1,Z2,...,Z{N-p-1})^T = Q^T y Facts: . r = Q z (Proof: r = (I-H)y = Q Q^T y = Q z) (Interpretation: z contains the coordinates of r in the basis Q.) . z ~ N(0, sigma^2*I) where 'I' is (N-p-1)x(N-p-1) (Proof: z = Q^T y; y is normal, hence z is normal. E[z] = Q^T X beta = 0 size=(N-p-1)x1 V[z] = Q^T V[y] Q = sigma^2*I size=(N-p-1)^2 . |r|^2 = |z|^2 QED 1) above (Proof: |r|^2 = z^T Q^T Q z = |z|^2.) . z and bj are stochastically independent. QED 2) above (Proof: bj ~ xj.adj^T y, and xj.adj is orthogonal to the columns of Q.) ================================================================ LECTURE 11, 2011/10/17: * ORG: - HW 5 to be posted soon - Style: Read Style, Lessons 5 and 6 by Wed 2011/10/19 [wrong data given previously] Be prepared to discuss examples in class. ---------------- * ROADMAP: - Recap: . t-distribution: ideas - New: . Properties of t-distributions . F-distribution for testing multiple parameters at a tim . Confidence intervals for single parameters . Scheffe confidence ellipsoids for multiple parameters . Short intro to family-wise error control in multiple comparisons/simultaneous inference ---------------- * RECAP: - Define a t-distribution with k degrees of freedom ... - What is the t-statistic tj for bj? Start from the adjustment formula for bj and derive tj. ... - What are the three steps to show that tj is t-distributed? (1) ... (2) ... (3) ... - Argue the three steps without doing algebra. I.e., tell a mathematical/geometric story. ---------------- * PROPERTIES OF t-DISTRIBUTION with k degrees of freedom: - Bell-shaped: start with df=1, i.e., a Cauchy distribution x <- seq(-4,4,length=501) plot(x, dt(x,df=1), type="l", ylim=c(0,.45), col="blue", lwd=2, xlab="quantile", ylab="t-density") # dt(..,df=1) == dcauchy(..) abline(h=0) text(x=0, y=.25, lab="Cauchy", col="blue") text(x=0, y=.23, lab="df=1", col="blue") - As df increases, the t-distribution gets lighter in the tails: colrs <- gray.colors(100,0,.8) for(df in 2:100) lines(x, dt(x,df), col=colrs[df], lwd=.5) - As df -> Inf, the t-distribution approaches N(0,1) distribution: We proved this with the LLN applied to the denominator. We can also see it in the densities: lines(x, dnorm(x), col="red", lwd=2) text(x=0, y=.44, lab="Gauss", col="red") text(x=0, y=.42, lab=expression(paste("df=",infinity)), col="red") #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ math symbols # For math symbols in R plots see help(plotmath) demo(plotmath) ---------------- * NEW: F-TESTS for comparing two arbitrary nested models - Purpose: Testing groups of predictors jointly in particular: overall F-test - Def. of F-distribution: W1,...,Wq,Z1,...Zr i.i.d. N(0,sigma^2) ^^^^^ irrelevant but must be same (W1^2+...+Wq^2)/q F = ----------------- has a (central) F-distribution with q,r dfs (Z1^2+...+Zr^2)/r - Note: . Special case q=1: F = t^2 with r dfs . There is an element of convention in this definition. For example, one could have dropped the dfs and worked with scaled version of this F distribution. Also, one could have worked with a distribution gleaned from Rsquare instead; this would have resulted in a Beta distribution with equivalent inference: W1^2+...+Wq^2 ----------------------------- ~ Beta(...,...) Z1^2+...+Zr^2 + W1^2+...+Wq^2 . What happens to F as the denominator dfs grow large, r-->Inf? ... ==> If r is large what distribution does F have approximately? ... - General principle for Sums of Squares: . Given: y ~ N(0, sigma^2 I) (1) If H is any orthogonal rank-r projection: H = Q Q^T [Q=(q1,...qr), Q^T Q=I] Then: |H y|^2 = Z1^2+...+Zr^2 where Zj ~ iid N(0,sigma^2) (2) If H1 and H2 are projections onto orthogonal subspaces, H1 H2 = 0, then |H1 y|^2 and |H2 y|^2 are independent. - It is worthwhile to introduce chi^2 distributions: . If Zj ~ iid N(0,1), then Z1^2+...+Zr^2 is said to have a chi^2 distribution with r dfs: Z1^2+...+Zr^2 ~ chi^2(r) . If Zj ~ iid N(0,sigma^2), then Z1^2+...+Zr^2 is said to have a 'scaled' chi^2 distribution with r dfs: Z1^2+...+Zr^2 ~ chi^2(r)*sigma^2 . The above projection result says that if y ~ N(mu, sigma^2 I), then |H (y-mu)|^2 ~ chi^2(r)*sigma^2, r=rank(H) . F-variables are ratios of independent chi^2 variables with same sigma^2, scaled by the chi^2 dfs: sigma^2*chi^2(df1)/df1 F = ---------------------- where numerator and denominator are independent. sigma^2*chi^2(df2)/df2 . F-variables can be obtained from y ~ N(0,sigma^2 I) and two orthogonal projections onto orthogonal subspaces, H1 H2 = 0: |H1 y|^2 / rank(H1) F = ------------------- |H2 y|^2 / rank(H2) - F-ratios in linear models: . Purpose: Testing the null hypothesis that A COLLECTION OF SLOPES takes on particular values (usually zero: no effect). . Partition the linear model y = X beta + eps as when we introduced adjustment, but this time for the model, not the LS decomposition: y = X1 beta1 + X2 beta2 + eps (***) X = (X1,X2) X1 is Nxq, X2 is Nxr, q+r=p+1 beta1 is qx1, beta2 is rx1 . H0: We know beta1 E.g., beta1=(0,...,0)^T in R^q, but it works for any fixed assumed beta1. Is there evidence against H0, that is, the specific beta1? ==> We need a statistic and its null distribution. Method: Look for a SS that measures the variation accounted for by X1 but adjusted for X2, so we are excluding X2-variation. Manipulate (***): y - X1 beta1 = X2 beta2 + eps [We assume we know beta1 according to H0.] Adjust for X2: Apply I-H2 to both sides, where H2 = X2 (X2^T X2)^{-1} X2^T (I-H2)(y - X1 beta1) = (I-H2)eps Nice: We know that |(I-H2)eps|^2 ~ chi^2(N-r)*sigma^2 Not nice: We want something with q dfs, not N-r, because we are testing q parameters, not N-r. Solution: Use a more specific projection, namely: (H-H2) Properties of H-H2: + (H-H2) is also an orthogonal projection because the column space of X2 is contained in the column space of X=(X1,X2). This implies: H2 H = H H2 = H2 [Explain in English!] Which implies: (H-H2)(H-H2) = (H-H2) Symmetry is trivial anyway: (H-H2)^T = (H-H2) + (H-H2) peels off the variation in y accounted for by X1 but cannot be accounted for by X2. + Note that (H-H2) is NOT H1 = X1 (X1^T X1)^{-1} X1^T unless X1 and X2 span orthogonal subspaces. If this is the case, then X1 does not need to be adjusted for X2. We have, however, this messy formula: (H-H2) = X1.2 (X1.2^T X1.2)^{-1} X1.2^T Do you see it? [I have never proven it, but I know it must be true...] . Proposed F-ratio: | (H-H2)(y - X1 beta1) |^2 / q F = ------------------------------ RSS/(N-p-1) . We will show: Under H0, F has an F-distr with q,(N-p-1) dfs. . Program: (1) We know | (H-H2)(y - X1 beta1) |^2 ~ chi^2(q)*sigma^2 (2) We know RSS = | (I-H)y |^2 ~ chi^2(N-p-1)*sigma^2 [Explain to yourself that you don''t need to subtract X1 beta1 here.] (3) The two are independent because they stem from orthogonal projections (H-H2) and (I-H) that project onto orthogonal subspaces: (H-H2)(I-H) = 0 [Prove this using H H2 = H2 H = H2] Done. ---------------- * NEW: CONFIDENCE INTERVALS FROM t and CONFIDENCE ELLIPSOIDS FROM F - General principle for confidence intervals/sets: CI/S = interval/set of parameter values/vectors that are retained if tested Coverage Prob of CI/S = 1 - Prob of erroneous rejection - CI for Testing H0: betaj is the true j''th slope bj - betaj tj = ---------- where SE.est(bj) = s / |xj.adj| SE.est(bj) ME(bj,1-alpha) = margin of error for coverage 1-alpha = Type I error alpha = SE.est(bj)*t_(1-alpha/2,N-p-1) CI.j = { betaj : |bj-betaj| < ME(bj,1-alpha) } = { betaj : bj-ME(bj,1-alpha) < betaj < bj+ME(bj,1-alpha) } = [bj - ME(bj,1-alpha), bj + ME(bj,1-alpha)] - Testing H0: beta1 is the vector of q true slopes for q predictors according to the decomposition (***) above. [Apologies, our notations are tripping here: beta1 is a vector of true slopes, not the one slope of x1.] | (H-H2)(y - X1 beta1) |^2 / q | (H-H2)(y - X1 beta1) |^2 / q F = ------------------------------ = ------------------------------ RSS/(N-p-1) s^2 Invert to a confidence set, using an F-quantile F_(1-alpha,q,N-p-1). To this end you have to think of F as F=F(beta1): CS = { beta1 : F(beta1) < F(1-alpha,q,N-p-1) } = { beta1 : | (H-H2)(y - X1 beta1) |^2 < s^2*F_(1-alpha,q,N-p-1) } ==> Scheffe confidence ellipsoid consisting of beta1 vectors. Do you find this useful? ... What would you rather have as a confidence set for beta1? ... ---------------- * NEW: GLIMPSES OF SIMULTANEOUS INFERENCE - General problem: . Many parameters are being tested at the same time. . CIs are being sought for many parameters at the same time. - For each test we make sure: . Type I error prob = alpha P[ reject betaj | betaj is true ] = alpha . Coverage prob = 1-alpha P[ betaj in CIj | betaj is true ] = 1-alpha - But what about multiple tests and CIs? . Test for beta1,...,betap: P[ reject any betaj | all betaj are true ] = ??? . Coverage for beta1,...,betap simultaneously? P[ betaj in CIj for all j | all betaj are true ] = ??? - Lazy but powerful recipe for achieving simultaneous Type I error control and simultaneous coverage: Divide alpha by the number of parameters considered: ==> Bonferroni rule . Simultaneous Type I error prob = P[ union of rejections | all betaj are true ] <= sum_j P[ betaj rejected | all betaj are true ] <= sum_j P[ betaj rejected | betaj is true ] (xx) = q*(alpha/q) = alpha . Simultaneous CI coverage prob = P[ betaj in CIj for all j | all betaj are true ] = 1-P[ betaj rejected | all betaj are true ] >= 1-sum_j P[ betaj rejected | betaj is true ] = 1-q*(alpha/q) = 1-alpha - Curious detail in (xx) above: We switched the null hypothesis from multiple to single parameter. If we have equality in (xx), it is called "subset pivotality". In linear models we do have subset pivotality because the t-statistics are 'pivotal', that is, their distribution does not depend on the other parameters. ================================================================ LECTURE 12, 2011/10/19: * ORG: - HW 5 to be posted soon - Writing Homework 1: Exercises from chapter 4 of 'Style' The sentences are copied into a file for download from the class website. Send the edited 'txt' file with your solutions in an attachment to the usual HW email address. Due: this Monday, 2011/10/24, 3pm ---------------- * ROADMAP: - Writing summary of chapter 5. - Recap: . Sums of squares, the structure of their distributions . F distributions . Tests and CI/S from t- and F-tests . Multiple comparison/simultaneous inference - Next: . Loose end: prove that t^2=F . RIs, CIs and PIs for the response . Formula language in R . Interpretation of linear models . MBA statistics: logarithmically transformed x and y . Diagnostics for linear models . Inference for diagnostics ---------------- # Play Eric Schwartz' R code for Rosh Hashana a <- letters chi <- 18 reps <- 18*2 X <- matrix( runif(2*chi*reps), ncol=2) windows(); par(mfrow=c(1,1), mar=.1+c(1,1,1,1) ) for(i in 2:reps){ plot(X[((i-1)*chi):(i*chi), ], pch=11, cex=sample(1:5,size=chi,replace=T), col=4, ylab="",xlab="",xlim=c(0,1),ylim=c(0,1),axes=F, main=paste(a[19],a[8],a[1],a[14],a[1]," ",a[20],a[15],a[22],a[1],sep="") ) Sys.sleep(.5) } points(X, pch=11, cex=3, col=c(4,7) ) text(.5,.5,paste(a[19],a[8],a[1],a[14],a[1]," ",a[20],a[15],a[22],a[1],"!",sep=""),cex=3) text(.7,.01,"Eric Schwartz, 5772",cex=1.5) ---------------- * WRITING: some messages worth internalizing - Begin sentences with that which is familiar. I.e., a sentence should pick up on what was said before to create cohesion. - End sentences with information that could not be anticipated by the reader. - Use the active OR passive voice to position the familiar at the beginning of the sentence. - The principle of moving from the familiar to the new agrees with the principle of establishing characters in the subject. - Connectors to previous sentences: this, these, that, those, another, such, second, more - Cohesion vs coherence: . Cohesion: adjacent sentences fit together . Coherence: a larger unit (e.g., paragraph) establishes a larger picture or argument Coherence stems from consistency in the topic. - The subject usually contains the character, topic, the familiar, but exceptions exist; see p.73 for examples. - Avoid distractions at the beginning of the sentence. No 'throat clearing' noise; refer to the familiar briskly. - Do NOT follow the stylistic advice to vary the subject to avoid monotonicity. - Do not fake cohesion with 'thus', 'therefore', 'however', 'indeed',... ---------------- * RECAP: - Formulate a general principle for the distribution of sums of squares based on orthogonal projection. ... |H(y-mu)|^2 where y ~ N(mu, sigma^2 I) Distribution: sigma^2 * chi^2(rank(H)) - Formulate a general principle for the independence of two sums of squares based on orthogonal projections. ... |H1(y-mu)|^2/rank(H1) F = --------------------- where H1 _|_ H2 |H2(y-mu)|^2/rank(H2) i.e., H1 H2 = 0 - Partition the predictor matrix: X = (X1,X2), Create two orthogonal projections for testing a submodel H0: y = X2 beta2 + eps i.e.: beta1=0 against H1: y = X1 beta1 + X2 beta2 + eps ... Use Hy but adjusted for X2, that is: (I-H2)Hy = (H-H2)y SS = |(H-H2)y|^2 = |yhat - yhat2|^2 - Try to explain the meaning of the numerator SS. ... The SS needs to capture the variation in y that can be accounted for by the large model, i.e., Hy, but cannot be accounted by X2, i.e, (I-H2)Hy = (H-H2)y - How does inversion of tests work for the construction of confidence sets? Reconstruct for the F-test of several slopes. ... - Describe what is unsatisfactory about the Scheffe confidence ellipsoid? ... - What would be a more desirable shape of CS for parameters beta1, beta2, beta3? - What is its desired guarantee? ... ---------------- * LOOSE END: SHOW THAT t^2=F (it''s not trivial) - To show: tj^2(df(r)) = F(df1=1,df2=df(r)) - Critical geometric fact: Let H.no.j be the orthogonal projection for the model without xj ('no j'). Then: ------------------------------ | (H - H.no.j) y = bj*xj.adj | (*) ------------------------------ Draw a 2D picture to see (*). We know H-H.no.j is an orthogonal projection. What is its rank? ... What does it project onto? ... - Now prove tj^2 = F(df1=1,...): (bj - betaj)^2 |xj.adj|^2 tj^2 = ------------------------- s^2 |(H-H.no.j)(y - xj*betaj)|^2 F = ---------------------------- s^2 ---------------- * NEW: RIs, CIs AND PIs FOR THE RESPONSE - Inference for the response vs. prediction of the response: CIs vs PIs . Let xx be a ROW from the design matrix X WRITTEN AS A COLUMN, OR a new predictor vector of interest. Ex.: When modeling the car data with MPG = b0 + b1*WEIGHT + b2*HP we have xx=(1,WEIGHT,HP)^T We might want to know something about y=MpG at xx=(1, 3000lb, 200HP)^T . Point estimation: yhat(xx) = where b=(b0,b1,b2)^T. Ex: "It is thought in the automotive industry that cars with a weight of 3000lb and 200HP get ON AVERAGE 27 MGP on the highway." . Inference for the response is really inference about E[y(xx)] = + Testing H0: =27 RI: [ - halfwidth, + halfwidth] + Confidence interval: CI: [ - halfwidth, + halfwidth] Q: What is 'halfwidth'? How do we find it? . Prediction of the response itself: We look for an interval to capture future response values y(xx) at xx (e.g., for xx=(1,3000,200)^T). . Ex. of the use of prediction intervals: "It is thought in the automotive industry that MOST cars with a weight of 3000lb and 200HP get between 24 and 30 MGP on the highway." 'Most cars' might be rendered as '95% of car models'. . Form of the prediction interval PI: PI = [yhat(xx) - halfwidth, yhat(xx) + halfwidth] Its gambling guarantee should be to catch FUTURE response values y at xx, based on PAST training data with probability 95%. . In prediction problems, future AND past data are random, ==> The gambling guarantee has to account for randomness in both. . Reminder: to be covered by: = yhat(xx) est. ave of y at xx RI = E[y(xx)] true ave of y at xx CI y(xx) = E[y(xx)] + eps future y value at xx PI . Q: How do the widths of RIs, CIs, PIs compare with each other? ... . Q: How do we obtain 'halfwidths' of RIs, CIs, PIs? We calculate ... . Q: How do we obtain the 'gambling guarantees', that is, the coverage probabilities? P[ is in RI ] = 1-alpha P[ is in CI ] = 1-alpha P[ y(xx) is in PI ] = 1-alpha ... - Inference for E[y(xx)]=: RI(beta) and CI(b) . yhat(xx) = [LS estimate] . E[yhat(xx)] = = E[y(xx)] [if the model is 1st order correct] . Var[yhat(xx)] = Var[] = Var[xx^T (X^T X)^{-1} X^T y] [use what formula? ...] = sigma^2 (xx^T (X^T X)^{-1} xx) . Standard error estimate: stderr.est(yhat(xx)) = s * sqrt(xx^T (X^T X)^{-1} xx) Q: What is this for xx=xi, the observed i''th row? ... [Hint: H] . t-statistic: yhat(xx) - -------------------- is t-distributed with df=N-p-1 stderr.est(yhat(xx)) Why? ... . Can be used for RIs and inverted to CIs: RI = 95% interval to capture yhat(xx) if H0: E[y(xx)]= holds = [ +- t * s * sqrt(xx^T (X^T X)^{-1} xx)] CI = 95% random interval to capture the true = [ +- t * s * sqrt(xx^T (X^T X)^{-1} xx)] . Uses: + Testing H0: rarely done Ex.: H0: E[MPG(WEIGHT=3000,HP=200)] = mu0(xx) = 20 mpg => xx = (1,3000,200)^T t = (yhat(xx) - mu0(xx))/stderr.est(yhat(xx)) Reject at ~5% significance when |t|>2. + Confidence bands for regression lines: In simple linear regression with one predictor x, calculate stderr.est(xx) = stderr.est(x), plot the curves yhat(x)-t*stderr.est(x), yhat(x)+t*stderr.est(x) ==> confidence band for the line Issue: simultaneity -- coverage is only pointwise, not simultaneous... - Prediction intervals based on yhat(xx): . Let y(xx) be the random variable of FUTURE response values at xx whereas yhat(xx) is obtained from PAST data. . Assume the model is correct, wrt first and second moments and normality for past and future data. . Wanted: an interval around yhat(xx) that has a 95% chance of catching y(xx) . Approach: Find the distribution of y(xx)-yhat(xx). E[ y(xx) - yhat(xx) ] = 0 V[ y(xx) - yhat(xx) ] = V [ y(xx) ] + V[ yhat(xx) ] [Why???] = sigma^2 + stderr(yhat(xx))^2 = sigma^2 + sigma^2*(xx^T (X^T X)^{-1} xx) = sigma^2 * (1 + xx^T (X^T X)^{-1} xx) Assuming past and future response values are normally distributed: y(xx) - yhat(xx) t = -------------------------------- has a t-distribution with df=n-p-1 s*sqrt(1 + xx^T (X^T X)^{-1} xx) Reasons: The Gaussian assumption AND the numerator is independent of the denominator because - s is computed from the residuals of past data y - yhat(xx) is computed from the projection of y onto the X-space - y(xx) is future data at xx => s is independent of yhat(xx) and y(xx) (recall: V[r,yhat]=0 => under normality r and yhat are independent) . PI(xx) = [ yhat(xx) +- t*s*sqrt( 1 + xx^T (X^T X)^{-1} xx ) ] ================================================================ LECTURE 13, 2011/10/24: * ORG: - Writing Homework 1 was due today 3pm. - HW 5 to be posted soon (... huh!) ---------------- * ROADMAP: 'on the road to reality' - Recap of RIs, CIs, PIs for the response - R formula language - Interpretation of linear models: two things to remember forever... . The 'Estimated Average' clause!! . You NEVER CHANGE a predictor!! . The 'Ceteris Paribus' clause!! - MBA statistics: . log-models . Total Cost vs Average Cost models ---------------- * RECAP: - The loose end: t^2 = F Draw a 3D-picture of the crucial geometric fact. Use one dimension each for xj, all other xk''s, residuals. Explain to yourself: (H - H.no.j) y = bj*xj.adj - Response RIs, CIs, PIs: . H0: E[y(xx)]=mu0(xx) . Test stat: t = (yhat(xx) - mu0(xx))/SE(numerator) . Denominator of t-stat: V[yhat(xx)] ~ (X^T X)^{-1}-based distance^2 of xx from origin = ... xx^T (X^T X)^{-1} xx sigma^2 SE.est = ... sqrt( xx^T (X^T X)^{-1} xx) s . RI = [... - ..., ... + ...] [mu0 - t(alpha)*SE.est, mu0 + t(alpha)*SE.est] . CI = [... - ..., ... + ...] [yhat(xx) - t(alpha)*SE.est, yhat(xx) + t(alpha)*SE.est] . Predicting future y-values: yhat(xx) +- ... . Need a t-stat: t = (yfut(xx) - yhat(xx))/SD(numerator) . Denominator of t-stat: V[yfut(xx) - yhat(xx)] ~ 1 + (X^T X)^{-1}-based distance^2 of xx from origin = ... SD.est = ... . What critical insight makes this t-distributed? ... independence of future from past, and in the past: space of fits from resids . What model assumptions are needed? ... . PI = [... - ..., ... + ...] ---------------- * R: FORMULA LANGUAGE for modeling, with excursions - Create some toy 'data': N <- 1000 # (Programming: Do not hardcode constants such as this one!) x1 <- round(runif(N)*100) # Continuous Predictor x2 <- round(rnorm(N,m=71,s=3), dig=.5) # Continuous predictor x3 <- factor(sample(c("m","f","c"),size=N, # Character vector; not good for modeling replace=T, prob=c(.6,.3,.1))) # Categorical predictor eps <- rnorm(N,s=1) # 'Error' myd <- data.frame(x1=x1, x2=x2, x3=x3, y= round(10 + .1*x1 + 0*x2 + c("m"=0,"f"=-6,"c"=-10)[x3] + eps, dig=.01) ) - Excursion: Basic steps for getting acquainted with a new dataframe !!!!!!!!!!!!!!! # General overview: str(myd) # Equivalent of 'str()' in small steps: # Dataframe or matrix or what? class(myd) # How large is it? dim(myd) # What are the variable names? names(myd) # Look at the first 6 rows: head(myd) # Look at the last 6 rows: tail(myd) # Basic tabulation of all variables: source("http://stat.wharton.upenn.edu/~buja/STAT-541/tab-all.R") # Once tab.all(myd) # Tabulate all columns for a general overview # Generic scatterplot matrix: plot(myd, pch=16, cex=.8) # Works for 'factors()', not for 'is.character()' # Another scatterplot matrix function with nicer functionality: source("http://stat.wharton.upenn.edu/~buja/STAT-541/plot-plus.R") # Once source("Functions/plot-plus.R") # Once plot.plus(myd) plot.plus(myd, cex=.5, jitter=.3) # Like 'plot()' but (1) it jitters categorical variables, # (2) gives more margin space, and # (3) has nicer defaults for pch, cex. # To see the code and the arguments, say: plot.plus # Comment on plotting of categorical variables: # Categories are mapped to integers in lexicographical order of the labels. - Important note on character data versus factors: . Why factors and not just character data? A: (1) Not all levels might appear in the data (2) You might want to list the levels in a specific way. (3) plot() works for factors but not type character. . read.csv() turns character data by default into factors. If you want to suppress this feature, use the optional argument 'as.is=T'. . factor() sorts character levels alphabetically, which can be a pain: str(x3) but you can specify the levels and their order in 'factor()': x3 <- factor(sample(c("m","f","c"),size=N, # Character vector; not good for modeling replace=T, prob=c(.6,.3,.1)), # Categorical predictor levels=c("m","f","c","u") ) ==> In 'plot()' the axis values 1:4 correspond to c("m","f","c","u"). In regression, "m" will be the base group, and the effects will mean "f"-"m", "c"-"m", "u"-"m". - Formulas to express linear models: # Use 'lm()' to fit a linear model: lm(y ~ x1 + x2 + x3, data=myd) # What is this, though, !%@^&*%? # ^^^^^^^^^^^^^^^^ These variables refer to columns in 'myd'. summary(lm(y ~ x1 + x2 + x3, data=myd)) # Much better! mym <- lm(y ~ x1 + x2 + x3, data=myd) # IMPORTANT! ==> Fitted models are 'first class citizens' of the language. They "exist" because they are objects that can be assigned to variables. is.list(mym) # TRUE: Complex objects tend to be lists. summary(mym) # There is usually a 'method' for 'summary()' names(mym) # Print the 'class' of each component of the model: for(i in 1:length(mym)) print(c(names(mym)[i],class(mym[[i]]))) lapply(mym, class) # 'list apply' to loop over a list - 'Model formulas' are "first class citizens", too: myf <- y ~ x1 + x2 # y = beta0 + beta1*x1 + beta2*x2 + eps myf summary(lm(myf, data=myd)) - "Operator overloading" in 'model formulas': new meanings for +,-,1,*,/,^ to specify model predictors. '+' means 'include' '*' means 'include with interactions' ':' means 'include these specific interactions' '^' means 'include hierarchical ANOVA terms to power ...' '.' means 'include all predictors' (all variables other than the response) '-' means 'exclude following predictor' '1' means 'intercept' - Prevent overloading: I(...), e.g., I(x1*x2) - Linear models with interactions: # y = beta0 + beta1*x1 + beta2*x2 + beta12*x1*x2 + eps summary(lm(y ~ x1 * x2, data=myd)) summary(lm(y ~ x1 + x2 + x1:x2, data=myd)) # dito summary(lm(y ~ x1 + x2 + I(x1 * x2), data=myd)) # dito - Categorical variables also (e.g., x3 in myd) summary(lm(y ~ x3 -1, data=myd)) # Uses the alphabetically first category name as the reference group. # y = beta0 + beta.b*1[b] + beta.c*1[c] + eps : summary(lm(y ~ x1 + x3, data=myd)) # model?... - Interaction of continuous and categorical predictors: summary(lm(y ~ x1 * x3, data=myd)) # Sort the levels of x3 as desired: myd3 <- cbind(myd[,-3],x3=factor(myd[,3],levels=c("m","f","c","u"))) summary(lm(y ~ x1 * x3, data=myd3)) # (Note that zero-count levels are ignored.) Interpret: . Intercept: ... for males . Slope of x1: ... for males . Additive effects of x3: ... intercept corrections for group . Interactions of x1 and x3: ... slope corrections for group ---------------- * INTERPRETATION OF ESTIMATES AND TESTS: - Traps: . Causality: "When x1 is increased by one unit,..." . Observations versus averages: "... y increases by..." . Estimates vs parameters: averages ==> Observational data, where x1 cannot be manipulated, and where y is noisy ==> There are only averages, but even these averages are not "known", only estimated... ----------------------------------------------- - Rule 1: | You never 'change', 'increase' a predictor! | ----------------------------------------------- ----------------- Instead: You associate | 'DIFFERENCES' | in x and y. ----------------- - Rule 2: Intercepts, slopes and yhat-values do not refer to actual y-values. ------------------------ Instead: They refer to | 'Estimated Averages' | of y-values. ------------------------ - Rule 3: In multiple regression, each slope refers to differences in xj and y at fixed values of all other xk. (One tends to say 'holding all other predictors fixed', which sounds like experimental manipulation, but we only mean to mathematically fix them to compare yhat(x1,...,x{j-1},xj+1,x{j+1},...,xp) with yhat(x1,...,x{j-1},xj ,x{j+1},...,xp) ^^^^ --------------------------------------------- Best: Differences ... | 'at fixed values of all other predictors' | --------------------------------------------- Or get used to the latin clause common in econometrics: -------------------------- Differences ... | 'ceteris paribus' | -------------------------- | 'all else being equal' | -------------------------- - Summary proposals for formulating slopes: . "bj = est. ave. difference in y for a unit difference in xj when all other predictors are the constant." . "bj = est. ave. difference in y for two cases that ONLY differ by a unit in xj." . Wagging the finger one more time: "Unit change in xj" and "holding all other predictors fixed" are formulations often used but misleading because of their implication of ACTIVE EXPERIMENTATION. - Interpretation of other regression estimates: . Interpretation of b0? "est. ave. y at x=0" . Interpretation of stderr(b1)? !!!!!!!!!!!!!!!!!!!!!!!!!!! ("est. stdev of b1 varying from dataset to dataset") (caveat: conditioning on x) . Interpretation of tj? ("tj = bj/stderr(bj) = dist of bj from 0 in multiples of stderrs") . Interpretation of p-value(bj)? ("the probability (over datasets, under the assumption beta1=0) of observing a tj more extreme than the observed value") . Interpretation of s=RMSE? (R''s "Resid. stderr" is a bad term, too.) ("est. sdev of the errors") ("est. sdev of the noise") . Interpretation of R Square? ("variation of y accounted for by the model") ================================================================ LECTURE 14, 2011/10/26: * ORG: - HW 5 to be posted - HW 6: real estate project to follow right after ---------------- * ROADMAP: 'on the road to reality' - Recap: .R formula language, models . Interpretation of linear models: three things to remember forever... + The 'Estimated Average' clause!! + You NEVER CHANGE a predictor!! + The 'Ceteris Paribus' clause!! - New: MBA statistics . log-models . Total Cost vs Average Cost models . Diagnostics ---------------- * RECAP AND EXTENSIONS: - Two new 'first class citizens' in the R language: . ... formulas . ... models . These are in addition to dataframes. - What is 'operator overloading' in the formula language? Describe for each formula operation how it is 'overloaded'. . +: include predictors . *: include with all interactions . :: define a new predictor as an interaction of ...:... . -: exclude a certain predictor that would otherwise be included . ^: include all interactions to certain order, applied to (...+...+...) . .: used left of ~ and means to include all other variables as predictors - How do you avoid overloading? ... I() - What is the connection between 'data=myd' and the actual design matrix? Check: formula(mym) # forgot the formula you used in your model? ask it! mym$call # tells you also the modeling function and the dataframe mymm <- model.matrix(mym) # ask for the design matrix X actually used in lm! str(mymm) class(mymm) Interpret what you see in mymm. ... - Study the following help: help(lm) help(formula) help(model.matrix) Follow the 'See Also:' section !!!!!! You can learn a lot by following 'See Also:' in all help(). - Describe the meaning of the regression coeffs . in a model with two numeric predictors and interaction . in a 1-way ANOVA (i.e., 1 factor predictor) . in a model with 2 predictors: 1 factor, 1 numeric, and interaction - 3 Rules for interpreting regression coeffs carefully: . ... never 'change' a predictor (it''s 'differences', ...) . ... 'estimated average' in y . ... 'ceteris paribus' - An additional rule for slopes: . Slope = ... / unit difference in x . If the unit difference is not meaningful, scale to a different and more meaningful difference in x. Examples: + Diamonds: price as a fct of weight in carats, but carat range = [.2,.7] ==> A unit difference amounts to an extrapolation. Make it .1 carats: Slope = ... / 0.1 carats difference + Commercial real estate: rent/sqft ==> 1 sqft is not a realistic difference Make it 1000 or 10,000 sqft: Slope = .../1000 sqft ---------------- * NEW: MBA STATISTICS - Goal: Extending the reach of linear models to typical nonlinear situations found in business. - Variable scales: often dictate a natural transform Give examples for each of the following: . If a variable (x or y) + takes on only positive values, and + changes/differences are expressed in percentages/multiples as opposed to numeric amounts, ==> "RATIO SCALE" variable Ex.: ... income, net worth, bank balances, ... . If a variable + can take on positive and negative values and + changes/differences are expressed in numeric amounts, ==> "INTERVAL SCALE" Ex.: ... . If a variable + is confined to [0,1], ==> "PROBABILITY SCALE" Ex.: ... . If a variable + is confined to non-negative integers, ==> "COUNTING SCALE" Ex.: ... - Notes on ratio scales: . Changes/gains/losses/differences are EQUIVALENTLY expressed as + factor changes: 'doubling', 'tripling', 'halving',... "production is up by a factor 1.3" + percent changes: 'salary is up 5%', 'sales declined 20%', ... "production is up by 30%" . Undergrad trick questions: Translate back and forth between factor changes and percent changes + 'an increase of 100%' means what factor change? + 'the internet grew 140%' means what factor change? + 'tripling income' means what percent change? + 'stocks lost 3%' means what factor change? + 'losing a third' means what factor change? + 'losing a third' means what percentage change? - Logarithms: . Their principal role is to map RATIO SCALES to INTERVAL SCALES multiplicative/factor changes --> additive/amount changes . Linear Approximation of Logarithms: the "small change approximation" log(1+z) ~ z for small z (convention: small means |z|<.1) and NATURAL log !!! Consequence: log(y*(1+z)) = log(y) + log(1+z) ~ log(y) + z ------------------------------------------ | RULE: A difference in log(y) by .01 | | corresponds approximately to | | a difference in y by 1% | ------------------------------------------ - Scale this rule from -10% to +10% - This holds only for the NATURAL LOG!!!! . Caveats about logarithmic transforms of y: Means of logs do not correspond to means of unlogged: mean(log(y)) != log(mean(y)) [Is there an inequality here? ...] [This holds approximately, though, if ...] However: median(log(y)) == log(median(y)) ==> In general, if "ave. est." is interpreted as "median est.", one can arbitrarily transform with monotone functions. (For the normal distribution, mean=median, but not for skewed distributions.) #---------------- # # # - MODELS OF EXPONENTIAL GROWTH/DECAY/APPRECIATION/DEPRECIATION: # . Example from the crazy dotcom years: growth of number of web servers URL <- "http://www-stat.wharton.upenn.edu/~buja/STAT-541/web-servers.dat" webs <- read.table(URL, header=T) # space, not comma, delimited data webs <- read.table("data/web-servers.dat", header=T) # Place on website! str(webs) # 'data.frame': 100 obs. of 7 variables: # $ Web.Servers : int 1 10 50 130 204 228 623 2738 10022 23500 ... # $ Log.Web.Servers : num 0 2.3 3.91 4.87 5.32 ... # $ Yrs.since.Jan.97 : num -6 -5 -4 -3.5 -3.25 -3.17 -3 -2.5 -2 -1.5 ... # $ Time : num 1991 1992 1993 1994 1994 ... # $ Year : int 1990 1991 1992 1993 1993 1993 1993 1994 1994 1995 ... # $ Month : int 12 12 12 6 9 10 12 6 12 6 ... # $ Time.of.Exp.Growth: int 0 0 0 0 0 0 0 0 0 0 ... plot(webs[,c("Time","Web.Servers")], type="l", col="blue") points(webs[,c("Time","Web.Servers")],pch=16, cex=.5) # This is not all the way exponential, but it might be piecewise. # Extract the span in which exponential growth was supposed # to have happened, 1997 - end of 2000 webs.red <- webs[as.logical(webs[,"Time.of.Exp.Growth"]), c("Yrs.since.Jan.97","Web.Servers")] colnames(webs.red) <- c("year","nserv") plot(webs.red, type="b", pch=16, cex=.5) # Zero is Jan 1997. # Exponential growth calls for logging the response: plot(webs.red, type="b", log="y", pch=16, cex=.5) # Or: plot(webs.red[,1], log10(webs.red[,2]), ylab="Log10", type="b", xlab="yr", pch=16, cex=.5) # So we model log(y) ~ b0 + b1 * x + error abline(lm(log10(nserv) ~ year, data=webs.red), lwd=2) # # . Meaning: Multiplicative Model summary(lm(log(nserv) ~ year, data=webs.red)) ---- Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 13.408146 0.021395 626.7 <2e-16 *** year 0.900178 0.008834 101.9 <2e-16 *** ---- # Translation of an additive model on the log-scale # to a multiplicative model on the original scale: # log(nserv) ~ 13.4 + 0.9 * year + r # nserv ~ exp( 13.4 ) * exp( 0.9 * year ) * exp( r ) # ^^^^^^^^^^^ exp( 0.9 )^year * " # 660000 * 2.46^year * " plot(webs.red, pch=16, xlab="yr", ylab="nserv") lines(x=webs.red[,1], y=660000*2.46^webs.red[,1]) # Not perfect but adequate for now because we are more # interested in interpretation. # # . Interpretation: We start with 660,000 servers in Jan 1997. # From there on we have a yearly increase # by a factor of 2.46, or 146%. # # . Multiplicative errors: Make sense when error # is proportional to size of y(x) (or E[y(x)] = mu(x) really). # (The size of mice varies differently from the size of elephants.) # [Another way to accommodate size-dependent error would be # with direct modeling of the heteroscedasticity: # y = b0 + b1*x + eps, where eps ~ N(0,x^2*sigma^2) (e.g.) # But this does not limit y to positive values, # as do multiplicative errors.] # # . Using linear approximation for log(y)-models: # log(yhat) = b0 + b1*x # difference in x by dx # => difference in log(yhat) by b1*dx # => "difference" in yhat by b1*dx*100% if |b1*dx|<0.1 # # Web server data: # # b1 = .9 = log-increase in webservers per year # ==> dx = 1yr is much too large for linear approximation. # # Try dx = 1 month: b1*dx = b1/12 = 0.075 # ==> Small enough for linear approximation # b1/12*100 = 7.5% = (est. ave.) percentage increase per month # [Compounded this should amount to a 146% yearly increase, # but here we see the approximation error: # This monthly rate compounds to 1.075^12 [1] 2.381780 # instead of 2.46. The more exact rate would be 2.46^(1/12) [1] 1.077899 # or closer to a 7.8% increase per month # Still, linear approximation for log-differences |z|<.1 are commonly used. # # #---------------- # # # - MODELS OF LOGARITHMICALLY DIMINISHING RETURNS # # Example: Wine display space; the more space, the more sales, # but: going from 1 ft to 2 ft has probably a greater effect # than going from 9 ft to 10 ft URL <- "http://www-stat.wharton.upenn.edu/~buja/STAT-541/wine.dat" wine <- read.table(URL, header=T) str(wine) # 'data.frame': 47 obs. of 2 variables: # $ display.feet: int 1 6 7 4 6 7 5 2 5 7 ... # $ sales : num 29 358 382 280 400 ... # Parallel sort according to ft for plotting curves in order of predictor: wine <- wine[order(wine[,1]),] # Response: number of bottles sold summary(wine) # plot(wine[,1], wine[,2], pch=16, xlab="Display ft", ylab="Wine Sales") # # Idea: There may not be a constant increase in sales per unit increase # but per percentage increase in display space. # ==> take log(x). plot(log(wine[,1]), wine[,2], pch=16, xlab="log Display ft", ylab="Wine Sales") # Looks nearly ok. Ok enought to try a linear model: wine.lm <- lm(sales ~ log(display.feet), data=wine) abline(wine.lm) # # Meaning of the line: summary(wine.lm) # ---- # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 83.560 14.413 5.797 6.24e-07 *** # log(display.feet) 138.621 9.834 14.096 < 2e-16 *** # ---- # # Fitted Equation: (est. ave.) sales ~ 140 * log(display.feet) + 84 # plot(wine, pch=16, xlab="log Display ft", ylab="Wine Sales") lines(x=wine[,1], y=84+139*log(wine[,1])) # # Magic of logs again: # 10% difference in display.feet # => (est. ave.) difference in sales is about 140 * 0.1 = 14 bottles # (add a note about doubling or adding 50% ....) # # General: fit y ~ b0 + b1*log(x) # 10% difference in x => ~b1/10 difference in yhat. (est. ave.) # 1% difference in x => ~b1/100 difference in yhat. ( " ) # Outside the # 50% difference in x => ~b1*0.4 difference in yhat. ( " ) # # # #---------------- # # # 3) MODELS OF CONSTANT ELASTICITY: # # Example: Fedex has(had) a delivery service called "CourierPak". # At different prices, different numbers of CourierPak deliveries # were requested by customers. URL <- "http://www-stat.wharton.upenn.edu/~buja/STAT-541/Courier.dat" courier <- read.table(URL, header=T) str(courier) # $ Price : num 10.76 10.42 10.06 9.75 9.44 ... # $ Volume: int 1616 1321 1472 1538 1477 1460 1282 1249 1242 1245 ... plot(courier, pch=16) # One could fit a linear model for starters: courier.lm <- lm(Volume ~ Price, courier) summary(courier.lm) abline(courier.lm) # Looks quite good, but one doesn't like straight lines for a couple # of reasons: # 1) Lines cross into negatives, which must not happen (Vol>0). # The minimum volume is zero. # 2) More importantly, by economic theory it is not a unit increase # in Price that is associated with a constant decrease in Volume. # It is a PERCENTAGE increase in Price that causes a PERCENTAGE decrease # in volume. # courier.log.lm <- lm(log(Volume) ~ log(Price), courier) summary(courier.log.lm) # ---- # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.04930 0.20145 39.957 < 2e-16 *** # log(Price) -0.32732 0.07961 -4.112 0.000397 *** # ---- # The raw Price->Volume fit and the log(Price)->Log(Volume) perform # about the same, but the world prefers the latter. Interpretation: # # log(Volume) ~ 8 - 1/3*log(Price) + r # # . Constant elasticity as a power law: # Volume = exp(8 - 1/3*log(Price + r) # = exp(8) * Price^{-1/3} * exp(r) # = 2981 * Price^{-1/3} * exp(r) # In general: log(y) = b0 + b1*log(x) + r # y = exp(b0) * x^b1 * exp(r) # exp(b0) = est.median baseline at x=1 # # . Small change approximation: # Price difference of 1% ~ Volume difference of ... ? # Price difference of 1% # => log(Price) difference of about 0.01 # => log(Volume) difference of about (-1/3)*0.01 (|.|<0.1) # => Volume difference of about -1/3 % # # In general: 1% difference in x ~ difference in yhat of b1 % # This is called "constant price elasticity". # # (Assumes |b1|<0.1. If this is not the case, choose a smaller dx: # dx=dPrice=0.001 => dy=dVolume=b1*0.1% .) # # ---------------- * FIXED COSTS/VARIABLE COSTS: - Example: A manufacturer of 'blocks' gets orders of certain quantities and production costs. It is clear that total production cost is mostly driven by quantity. But the manufacturer would like to know what other factors make some orders more and others less expensive. A simple approach would be lm(Tot.Cost ~ Quantity + other factors) ==> Tot.Cost ~=~ b0 + b1*Quantity + ... + error Interpretation: - b0: FIXED COST for any order (usually setup cost) - b1: VARIABLE COST, or average production cost per unit - The error is constant for all sizes of orders. (How realistic is that?) One could argue, though, that whatever factors there are, they are more likely to act at the per-unit level: amount of material per block, difficulty of material, extra features ordered for the blocks, ... Their cost effects might multiply up with the number of units ordered. This suggests modeling the so-called "AVERAGE COST": ---------------------------------- | Ave.Cost = Tot.Cost / Quantity | ---------------------------------- as opposed to the Total Cost: lm(Ave.Cost ~ other factors) Total cost would then be obtained as Tot.Cost = Ave.Cost * Quantity But there is still a problem: The model might not contain all the factors that make up unit cost. For one thing, we should have a way to have fixed cost in the model as well. This seems like a drawback of choosing Average Cost as response. The problem can be fixed, though: introduce an additional predictor X = 1/Quantity. See what happens: lm(Ave.Cost ~ 1/Quantity + other factors) ==> Ave.Cost ~ b0 + b1/Quantity + ... + error A prediction equation for Tot.Cost is obtained by multiplying up with Quantity: ==> Tot.Cost ~ b0*Quantity + b1 + ...*Quantity + error*Quantity Interpretation: - b0: VARIABLE COST - b1: FIXED COST - The error increases with the size of an order. # Hence both models have interpretations of b0 and b1 as fixed and variable costs, in reverse order. But they differ in their error structure, and in how other factors affect the Total Cost of the order. # ================================================================ LECTURE 15, 2011/10/31: * ORG: - HW 5 is posted, due Monday, Nov 7, 2011, 3pm - HW 6: real estate project to be posted, due Monday Nov 16 ---------------- * ROADMAP: 'still on the road to reality' - Recap: MBA stats . measurement scales . log-models . Tot Cost vs Ave Cost models - New: . Conclusion of Tot Cost vs Ave Cost models . Protocol for stat consulting reports . Regression diagnostics ---------------- * RECAP and ELABORATIONS: - Measurement scales: list the four most prevalent ones and characterize them . ... interval scale . ... ratio scale . ... prob scale . ... counting scale . ... binary - Logarithms: . What is the point of logarithms? ... . What is special about ln()? ... . What is special about log10()? ... Ex.: What does a 6-figure salary mean in terms of log10? ... - Percentage changes <-> factor changes . Conversion: ... (make up examples and practice) . What is the beauty of factor changes? ... . FINANCIAL RETURNS -- a special type of %/factor changes: Given an investment with price series p(1), p(2),..., p(t), ... . 'Return' at time period i = % change from (i-1) to i = (price(t) - price(t-1))/(price(t-1) [* 100%] . 'Gross Return' = factor change = price(t)/price(t-1) . Gross Return from period 0 to I = p(I)/p(0) = product of Gross Returns from 1 to I = (p(1)/p(0))*(p(2)/p(1))*...*(p(I)/p(I-1)) - What are the three log-models and their business relevance? . ... . ... . ... - Cost concepts and models: . Explain and relate to each other + 'Fixed Cost': ... + 'Variable Cost': ... + 'Total Cost': ... + 'Average Cost': ... . What are the problems of Total Cost models? ... . What are the complications of Ave Cost models? ... . In the blocks example, discuss when a predictor is on a per-unit or on a per-order basis. Use examples. ... ---------------- * NEW: CONCLUDING TOTAL COST AND AVERAGE COST MODELS # - A data example: Cost is given as Ave.Cost. URL <- "http://www-stat.wharton.upenn.edu/~buja/STAT-541/blocks_red.dat" blocks <- read.table(URL, header=T) str(blocks) # 'data.frame': 200 obs. of 2 variables: # $ Ave.Cost: num 92.8 55.5 28.8 34.4 45.9 ... # $ Units : int 1100 150 100 110 140 230 600 180 230 140 ... plot(Ave.Cost ~ Units, data=blocks, pch=16) # Maybe there is a slightly negative slope: lower Ave.Cost # for more Quantity? Economies of scale? # As an aside, here is Tot.Cost, in $1000: plot(Ave.Cost*Units/1000 ~ Units, data=blocks, pch=16) # Certainly doesn't look like a constant error variance! # # There is a problem, though: an extremely small order # visible when plotting against 1/Quantity: plot(x=1/blocks[,"Units"], y=blocks[,"Ave.Cost"], pch=16) # Units = Quantity # Remove: sel <- 1/blocks[,"Units"] < .5 plot(x=1/blocks[sel,"Units"], y=blocks[sel,"Ave.Cost"], pch=16) # # Naive model for Tot.Cost=Ave.Cost*Units: summary(lm(Ave.Cost*Units ~ Units, blocks[sel,])) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4146.470 865.000 4.794 3.22e-06 *** ## Units 22.149 1.658 13.356 < 2e-16 *** # Model for Ave.Cost: summary(lm(Ave.Cost ~ I(1/Units), blocks[sel,])) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 31.082 2.273 13.673 < 2e-16 *** ## I(1/Units) 1502.211 416.821 3.604 0.000397 *** # Typically Tot.Cost models have much higher R^2 # than Ave.Cost models, but this is irrelevant: # The high R^2 stems from the trivial proportionality with Units, # which the Ave.Cost model factors out. # Often, Ave.Cost models are more plausible and # yield comparable predictions when multiplied up to Tot.Cost. # # Btw, the estimates of fixed and variable costs # differ wildly between the two approaches. # Which one would you believe? ---------------- * A PROTOCOL FOR REPORTING REGRESSION FINDINGS: - Goal: Provide a template for a consultant''s report - Target audience: non-statisticians (your manager, VP, CFO, CEO,...) . Principles: avoid technical terms, report relevant subject matter, keep stats and its qualifications at low volume, except when bearing bad news ... (which happens often in practice) - Example: URL <- "http://www-stat.wharton.upenn.edu/~buja/STAT-541/CarModels2003-4.TXT" cars <- read.table(URL, header=T, sep=",", na=".") my.cars <- cars[,c("Weight.lbs","Horsepower","MPG.Highway")] - Explain purpose and data: "The goal is to analyze gas mileage of 2003/2004 car models and describe to what extent it is associated with two major model characteristics: weight and horsepower." - Summarize the predictors and response: summary(my.cars) "Car models average about 3,500 lb, 200 HP, and 26 MPG on the highway. Some models are as light as 1,850 lb and as heavy as 6,400 lb. The engines range from 67 to 660 HP, and gas mileage from 12 to 68 MPG." [If it helps, use tabular presentation.] - Pick an imagined case with heavily rounded mean or median values as predictors. Example: apply(my.cars, 2, mean, na.rm=T) ==> "For reference, consider hypothetical car models roughly in the middle of the pack, with weight 3,500 lb and 200 HP." - Fit a model: my.cars.lm <- lm(MPG.Highway ~ Weight.lbs + Horsepower, data=my.cars) and give a reference range (PI) of the response values for the reference models: my.cars.refy <- predict(my.cars.lm, newdata=data.frame(Weight.lbs=3.5, Horsepower=200, MPG.Highway=NA)) my.cars.refy # Closer to 27 than 26 because our reference values are low-balling the means. # Standard Gaussian PI: my.cars.refy + c(-2,2) * summary(my.cars.lm)$sigma # Quantile-based non-parametric PI: my.cars.refy + quantile(resid(my.cars.lm), prob=c(.025,.975)) # (Better: use standardized residuals) ==> "Such reference car models are predicted to range in mileage from as low as 20 MPG to as high as 32 MPG. Thus there is considerable variation in highway mileage even among models with the same weight and same horsepower." - Continue with general comments on the quality of the data and results: summary(my.cars.lm) ==> "This is compatible with the fact that weight and horsepower account for about two thirds (64%) of the variation in MPG. (Weight and horsepower are highly statistically significant as predictors of MPG, but, as is often the case, this does not translate to an equal degree of practical significance.)" - Translate the slopes as meaningfully as possible: "If we compare models that differ by 500 lb in weight but have the same horsepower, the estimated average difference in mileage is -2 MPG, quantifying the obvious fact that heavier cars have worse mileage." "If, on the other hand, we compare models that differ by 10 HP but have the same weight, the estimated average difference in mileage is -0.3 MPG, again quantifying the obvious fact that stronger cars have worse mileage." - Mention parenthetically the uncertainty in the slopes: "(The 95% margin of error for these numbers can be described by a range -2+-.7 MPG for the 500 lb weight difference and -.3+-.06 MPG for the 10 HP difference.)" - Complications: + Predictors can be ratio scale, hence you may have used log(x), hence you need to consider a percent difference in x. + Any other kind of transformation also causes problems. E.g., physics says to model "I(1/MPG.Highway)", which would require additional non-trivial translations of statement about differences, possibly replacing means with medians, and translating endpoints of ranges. + Another level of complications arises from interactions and the need to express them in intuitive ways. See the real estate homework for examples with some guidance. ---------------- * DIAGNOSTICS FOR LACK OF FIT - Three types of diagnostics: --------------------- | . lack of fit | | . case influence | | . identifiability | --------------------- MEMORIZE THEM As stated, they are universal in statistical modeling!!! This section is about LACK OF FIT - What is 'identifiability' called in linear regression? ... collinearity !!!! - Potential lack-of-fit problems in linear models: (1) ... E[y] != X beta for any beta (2) ... V[y] != sigma^2 I_N for any sigma (3) ... y !~ N(...) => Need the following: . Detection: diagnostics => graphical methods, combined with inferential ideas . Fixes: better models (may not be possible) - Fixes: . If one can assume that all relevant predictors are at hand, and yet the model does not fit, try + TRANSFORMATIONS and + adding more FUNCTIONAL TERMS such as interactions. . TRANSFORMATION: Box-Cox family of power transformas when y>0 Box & Cox proposed a power family that contains the logarithm as a special case: box.cox <- function(y, pow=1) if(pow != 0) return( (y^pow - 1)/pow ) else return( log(y) ) Illustration: y <- seq(0,3,length=201) plot(x=range(y), y=c(-3,1.5), type="n", xlab="y", ylab="box.cox(y)") abline(h=0, col="gray90"); abline(v=1, col="gray90") for(pow in seq(-2,2,by=.2)) { lines(y, box.cox(y,pow), col="gray50") points(0, box.cox(0,pow), pch=16, cex=.3, col="gray50") } pow <--1; lines(y, box.cox(y,pow), col="green", lwd=2) pow <- 0; lines(y, box.cox(y,pow), col="red", lwd=2) pow <- 1; lines(y, box.cox(y,pow), col="blue", lwd=2) pow <- 2; lines(y, box.cox(y,pow), col="orange", lwd=2) legend("bottomright", paste("pow=",(-1):2,sep=""), col=c("green","red","blue","orange"), lwd=2, merge=T) Properties: + ascending for all powers, including negative powers!!! + convex for pow>1, concave for pow<1 + f(1)=0 for all powers, just as for the log Ex.: Pick the transformation of y that yields the highest R2 Data example: Boston housing data (stay tuned for background) site <- "http://stat.wharton.upenn.edu/~buja/STAT-541/" boston <- read.table(paste(site,"boston.dat",sep=""), header=T) rownames(boston) <- paste(scan(paste(site,"boston.geography",sep=""), w=""), 1:nrow(boston)) str(boston) ## 'data.frame': 506 obs. of 14 variables: ## $ CRIM : num 0.00632 0.02731 0.02729 0.03237 0.06905 ... ## $ ZN : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ... ## $ INDUS : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ... ## $ CHAS : int 0 0 0 0 0 0 0 0 0 0 ... ## $ NOX : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ... ## $ RM : num 6.58 6.42 7.18 7 7.15 ... ## $ AGE : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ... ## $ DIS : num 4.09 4.97 4.97 6.06 6.06 ... ## $ RAD : int 1 2 2 3 3 3 5 5 5 5 ... ## $ TAX : int 296 242 242 222 222 222 311 311 311 311 ... ## $ PTRATIO: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ... ## $ B : num 397 397 393 395 397 ... ## $ LSTAT : num 4.98 9.14 4.03 2.94 5.33 ... ## $ MEDV : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ... Try a collection of Box-Cox parameters for(pow in c(2.0,1.0, 0.5, 0.0, -0.5, -1.0, -2.0)) cat("power =",pow,":\n R^2 =", summary(lm(box.cox(MEDV, pow) ~ ., data=boston))["r.squared"][[1]],"\n") Winner: pow=0 (but by how much?) (Could be used for transformations of predictors also: Box-Tiao) . The ultimate hammer for transformations: the ACE algorithm (Breiman-Friedman 1985, JASA) Stay tuned! . Non-linear model terms: When it is likely that y = f(x1,x2,...) + eps with f(...) nonlinear, try: + interactions: y ~ x1 * x2 = y ~ x1 + x2 + x1:x2 => b1*x1 + b2*x2 + b12*x1*x2 = b1*x1 + (b2 + b12*x1)*x2 = b2*x2 + (b1 + b12*x2)*x1 the "effect" of x1 on y depends on the level of x2 + Polynomial terms: y ~ x + I(x^2) amounts to a quadratic transformation of the predictor but with free parameters: y = b0 + b1*x + b2*x^2 + r (b2 is the "curvature parameter" for the transformation of x.) ================================================================ LECTURE 16, 2011/11/02: * ORG: - HW 5 is posted, due Monday, Nov 7, 2011, 3pm - HW 6: real estate project is posted, due Monday, Nov 14, 3pm ---------------- * ROADMAP: Diagnostics - Recap: . Cost models . Protocol for an expert report using regression . Diagnostics - New: . Diagnostics, diagnostics, diagnostics... ---------------- * RECAP and ELABORATIONS: - Cost models: . List the four types of costs . How are they linked to each other? . Discuss two types of cost models . Discuss their respective flaws and oddities - Protocol for an expert report using regression: . List steps of the report . Give tips for intuively conveying the messages of the analysis - Diagnostics: . What are the three universal types of diagnostics for statistical models? + ... + ... + ... . What is the special purpose name of one of the diagnostics in linear regression? ... . Methods for improving fit: + transforming ... + ... + ... . Reconstruct the Box-Cox family of transformations and its properties ... ---------------- * R STANDARD DIAGNOSTICS: for lack of fit AND case influence - R diagnostics plots: help(plot.lm) boston.lm <- lm(MEDV ~ ., data=boston) par(mfrow=c(2,2)) plot(boston.lm) Refined: par(mfrow=c(2,2), mgp=c(1.8,.5,0), mar=c(3,3,2.5,2)) plot(boston.lm, pch=16, cex=.5, cex.axis=.8, cex.lab=.8) + Residuals vs fitted values: to see violations of ... + Normal q-q plot of stdized residuals: to see violations of ... + S-L plot of sqrt(|stdized residuals|): to see violations of ... + Stdized residuals vs leverage values P_{ii}: to see ... outlyingness in predictor space and case influence - Missing diagnostics plots in R: + Residuals versus order: to see "lurking" variables (such as: time, space,...) par(mfrow=c(1,1), mgp=c(1.8,.5,0), mar=c(3,3,2.5,2)) plot(sample(resid(boston.lm)), xlab="order", ylab="residuals", type="l", pch=16, cex=.5, cex.axis=.8, cex.lab=.8); abline(h=0, col="gray50") lines(resid(boston.lm), col="gray"); points(resid(boston.lm), cex=.5, pch=16) + How would you assess whether the plot is random or not? ... - Leverage plots: plot of residuals versus adjusted predictors to see the influence of cases on ... source(paste(site,"leverage-plots.R",sep="")) leverage.plots(boston.lm, cex=.5) (Unlike JMP, this routine does not add the means back into the adjusted variables; this is why these plots are zero-centered.) leverage.plots(boston.lm) leverage.plots <- function(model=NULL, X=NULL, y=NULL, extra=.1, cex=.5, cex.ident=.7, cex.axis=.8, cex.lab=.8, win.ncol=5, frame.sz=3.0, ident=F, label.sel=NULL,...) { y.lab <- "Y" if(!is.null(model)) { X <- model.matrix(model)[,-1] # assume intercept y <- model.frame(model)[,1] y.lab <- colnames(model.frame(model))[[1]] } else { if(is.null(X) | is.null(y)) stop("function 'leverage.plots': if no model is given, give both X and y") } row.lab <- rownames(X) npred <- ncol(X) win.ncol <- min(npred, win.ncol) win.nrow <- ceiling(npred/win.ncol) windows(width=frame.sz*(win.ncol), height=frame.sz*(win.nrow)) par(mfrow=c(win.nrow,win.ncol), mgp=c(1.5,0.5,0), mar=c(3,3,1,1)) for(i in 1:npred) { xi.adj <- resid(lm(X[,i] ~ X[,-i])) y.adj <- resid(lm(y ~ X[,-i])) x.rg <- range(xi.adj); xlim <- x.rg + c(-1,1)*diff(x.rg)*extra y.rg <- range(y.adj); ylim <- y.rg + c(-1,1)*diff(y.rg)*extra plot(x=xi.adj, y=y.adj, pch=16, xlim=xlim, ylim=ylim, xlab=paste(colnames(X)[i],"(adj)"), ylab=y.lab, cex=cex, ...) abline(lm(y.adj ~ xi.adj), col="green") lines(supsmu(y=y.adj, x=xi.adj), col="red") if(!is.null(label.sel)) text(x=xi.adj[label.sel], y=y.adj[label.sel], lab=row.lab[label.sel]) if(i %in% ident) { cat("Identify by clicking near points; stop with R-click.\n") identify(x=xi.adj, y=y.adj, lab=row.lab, cex=cex.ident) } } } ---------------- * ROADMAP FOR INFERENTIAL IDEAS FOR LACK-OF-FIT DIAGNOSTICS: - Perennial question: What would residual plots look like if the model were true? ==> We need inference for lack-of-fit diagnostics plots. - Idea 1 -- simple and non-technical: "Parametric bootstrap" or "plug-in" Not an exact theory but universally applicable. - Idea 2 -- beautiful but technical: Conditioning on a "minimal sufficient statistic" Exact theory but not universally applicable. - In both ideas we fake (simulate) datasets as they should look if the model were true. - Some conceptual strangeness: . In diagnostics for lack of fit, the lack of fit concerns the large model. ==> The large model is the null hypothesis H0 !!! . The alternative H1 is any model feature that falls outside the large model, such as an additional predictor, a non-linearity, non-normality,... - More strangeness: "Bootstrap" is usually used in its non-parametric form for CIs, not in the parametric form for diagnostics !!! Hence we encounter Bootstrap in a non-standard form and application. ---------------- * INFERENCE FOR DIAGNOSTICS PLOTS: - General idea: Q: How can we get an idea whether data that follow the fitted model look anything like the observed data? A: Create datasets from estimated model ------------------------ | "parametric bootstrap" | ------------------------ ----------- | "plug-in" | ----------- Caveat: We really want to diagnose 'errors', not residuals, and for the true model, not the estimated model. Plugin/parametric bootstrap, however, is the next best. - Parametric bootstrap for linear model: ----------------------- | y* ~ N(X b, s^2 I) | ----------------------- - Execution: boston.lm <- lm(MEDV ~ ., data=boston) r <- resid(boston.lm) # Actual residual vector RMSE <- function(model) summary(model)$sigma X <- model.matrix(boston.lm) # Fixed known numbers; same across simulations yhat <- fitted(boston.lm) # = X b = estimate of mu=X beta s <- RMSE(boston.lm) # = estimate of sigma N <- nrow(X) rlim <- c(-1,1)*max(abs(r)) # Fix residual range for plotting # Bootstrap simulation: y.boot <- rnorm(N, m=yhat, s=s) # A simulated response vector boston.boot.lm <- lm(y.boot ~ X) r.boot <- resid(boston.boot.lm) # A simulated residual vector yhat.boot <- y.boot - r.boot # For usual residual plot s.boot <- RMSE(boston.boot.lm) # To standardize the residuals # Prepare a pair of side-by-side plots windows(wid=10, hei=5) par(mfrow=c(1,2), mgp=c(1.8,.5,0), mar=c(3,3,2,1)) # Compare residual plots: plot(yhat, r, main="Actual Data", pch=16, cex=.5, ylim=rlim); abline(h=0) plot(yhat.boot, r.boot, main="Bootstrap Data", pch=16, cex=.5, ylim=rlim); abline(h=0) # Compare normal quantile plots of the residuals: qqnorm(r/s, main="Actual Data", pch=16, cex=.5, ylim=c(-3,3)); abline(a=0,b=1) qqnorm(r.boot/s.boot, main="Bootstrap Data", pch=16, cex=.5, ylim=c(-3,3)); abline(a=0,b=1) # Compare plot of residuals versus order: plot(r, main="Actual Data", pch=16, cex=.5, ylim=rlim); abline(h=0) plot(r.boot, main="Bootstrap Data", pch=16, cex=.5, ylim=rlim); abline(h=0) - Standard R diagnostics for bootstrapped data: # BS comparison of standard R diagnostics: windows(); par(mfrow=c(2,2)) plot(boston.lm, pch=16, cex=.6) # Actual windows(); par(mfrow=c(2,2)) y.boot <- rnorm(N, m=yhat, s=s) # A simulated response vector boston.boot.lm <- lm(y.boot ~ X) plot(boston.boot.lm, pch=16, cex=.6) # Bootstrapped/plugged-in # (Animate the above.) - But is this statistical inference yet? Not in the strict sense, but it can be turned into formal tests based on ANY statistic computed from residuals. Example: T(r) = max(r) ('worst' upper residual) T(r) = min(r) ('worst' lower residual) T(r) = 95% percentile of residuals quantile(r,p=.95) T(r) = mean(r^3)/mean(r^2)^(3/2) (~ skewness) T(r) = mean(r^4)/mean(r^2)^2 (~ kurtosis to measure tail weight) T(r) = mean(r*yhat^2) (parabolic curvature in r vs yhat) ... Bootstrap simulations provide an approximate null distribution for H0: The fitted model is true. for any of these test statistics T(r). - Generality of parametric bootstrap sampling: The idea can be used . for other types of lack-of-fit checks, and . for any type of 'generative model'. - Caveat: If the model overfits the data, there might still be model defects, but they might not be visible. Example: Fitting a high-degree polynomial to heteroscedastic data. ================================================================ LECTURE 17, 2011/11/07: * ORG: - HW 5 was due today 3pm - HW 6: real estate project due Monday, Nov 14, 3pm Reminder: This is an exercise in interpretation and communication, NOT model selection! Keep in mind: STYLE !!!!!! ---------------- * ROADMAP: - Recap: Diagnostics (contd.) - Two principles for formalizing lack-of-fit model diagnostics. - Collinearity analysis with Principal Component Analysis (not today) - Principal Component Analysis for Dimension Reduction (neither) ---------------- * RECAP AND ELABORATIONS: - Model diagnostics: 3 types . ... lack of fit . ... case influence . ... identifiability (collinearity) - Three types of lack of fit in linear models (elaboration): . ... 1st order assumption: E[y] = X beta (for some beta) . ... 2nd order assumption: V[y] = sigma^2 I . ... distributional assumption: y ~ Normal(...) - Diagnostics for the three types of lack of fit in linear models: . ... 1st order: residual plot -- r vs yhat (nonlinearity in y) . ... leverage plot for predictors one at a time (nonlin in xj) . ... 2nd order: same, but also |r| vs yhat, vs xj . ... distrib: normal Q-Q plot Issue: residuals -- might use standardized residuals r* = r/sqrt(1-H_ii) V[r*] same everywhere = sigma^2 - Fixes that sometimes solve some of these problems: . ... transformation of y (Box-Cox family) . ... same for xj''s (R: + I(xj^2)) . ... add interaction terms to the equation . ... (for later: stay tuned for the ACE algorithm!!!!!!) - Diagnostics for case influence in linear models (2): . influence on slopes: ... leverage plots . influence on own fitted value: ... hat values - Problem we solve next: ----------------------------------------------------- | When does a diagnostic plot indicate lack of fit? | ----------------------------------------------------- . What 'trick' did we use to answer this question? ... parametric bootstrap sample y vectors from the assumed model at beta=b, sigma=s ---------------- * STATISTICAL INFERENCE FOR LACK-OF-FIT DIAGNOSTICS: - Meta-principle for lack-of-fit diagnostics: --------------------------------------------------- | Consider the full model as the null hypothesis. | | Any model violation should be interpreted as | | rejection of this null hypothesis. | --------------------------------------------------- ----------------------------------- - Q: | What are the test statistics??? | ----------------------------------- ... (see below, e.g., max res, min res,...) ------------------------------------ - Q: | What is the null distribution??? | ------------------------------------ ... for the residuals: N(0, sigma^2 (I-H)), where we set sigma=s ------------------------------------------- - Q: | What is a null distribution good for??? | ------------------------------------------- ... plot the null draws of the residuals (e.g.) ------------------------------------------- - Q: | But the null hypothesis is 'composite'! | ------------------------------------------- I.e., for what choices of beta and sigma should we simulate null data? ... parametric bootstrap: ... stay tuned for another way!!!!!! - Principle (1) for generating SINGLE null distributions when the parameter of interested needs to be estimated and there are 'nuisance parameters' (i.e., not of interest)? -------------------------------------------------------- | PARAMETRIC BOOTSTRAP, also called the PLUG-IN METHOD | -------------------------------------------------------- . Principle: worked out for the linear model Model: N(X beta, sigma^2 I) Plug-in: N(X b, s^2 I) i.e., plug in b for beta, s for sigma. . Idea: If the model (=H0) were true, then the data would look very similar to data simulated from the plug-in distribution. . For linear models, this involves sampling y vectors from N(X b, s^2 I) . Q: Is the parametric bootstrap distribution an exact null distribution? ... ---------------- * SIMULATING AN APPROXIMATE NULL DISTRIBUTION - Two uses of null distributions: . GRAPHICAL USE -- plot artificial/simulated/synthetic datasets to compare with a plot of the actual/observed dataset --> 'lineup' method of graphical inference . NUMERICAL USE: Choose a test statistic, evaluate it on many artificial/simulated/synthetic datasets, and check where the observed value of the test statistic falls among the simulated values. - EXAMPLE OF GRAPHICAL USE: Plots are the substitutes for test statistics. Last lecture: Make comparison plots to learn what 'good data' would look like. boston.lm <- lm(MEDV ~ ., data=boston) r <- resid(boston.lm) # Actual residual vector RMSE <- function(model) summary(model)$sigma X <- model.matrix(boston.lm) # Fixed known numbers; same across simulations yhat <- fitted(boston.lm) # = X b = estimate of mu=X beta s <- RMSE(boston.lm) # = estimate of sigma N <- nrow(X) rlim <- c(-1,1)*max(abs(r)) # Fix residual range for plotting # Bootstrap simulation: y.boot <- rnorm(N, m=yhat, s=s) # A simulated response vector boston.boot.lm <- lm(y.boot ~ X) r.boot <- resid(boston.boot.lm) # A simulated residual vector yhat.boot <- y.boot - r.boot # For usual residual plot s.boot <- RMSE(boston.boot.lm) # To standardize the residuals # --------------------------------------------------------------- # Version 1: | Compare normal quantile plots of the residuals side by side | # --------------------------------------------------------------- windows(wid=10, hei=5); par(mfrow=c(1,2), mgp=c(1.8,.5,0), mar=c(3,3,2,1)) qqnorm(r/s, main="Actual Data", pch=16, cex=.5, ylim=c(-3,3)); abline(a=0,b=1) qqnorm(r.boot/s.boot, main="Bootstrap Data", pch=16, cex=.5, ylim=c(-3,3)); abline(a=0,b=1) # ---------------------------------------------------------------------------- # Version 2: | Spaghetti plot -- multiple bootstrap simulations to create a 'null band' | # ---------------------------------------------------------------------------- windows(wid=8, hei=8); par(mgp=c(1.8,.5,0), mar=c(3,3,2,1)) xy <- qqnorm(sort(r), type="n") for(iboot in 1:100) # Plot multiple 'good' qq-traces (Apologies -- efficient but unreadable code!) lines(qqnorm(sort(resid(lm(rnorm(N, m=yhat, s=s) ~ X))), plot.it=F), col="gray", lwd=2) points(xy, cex=.5, pch=16); lines(xy, lwd=2) # Need to plot the actual data last # Conceptual implications: # . How many different (!) statisical tests are being performed? # ... N=506 (each residual can cause rejection) # . What statistical significance can you infer from the exercise? # ... don't know... - EXAMPLE OF NUMERICAL USE: Statistical testing by simulating H0 . We start by making self-contained code: boston.lm <- lm(MEDV ~ ., data=boston) r <- resid(boston.lm) # Observed residual vector yhat <- fitted(boston.lm) # = X b = estimate of mu=X beta RMSE <- function(model) summary(model)$sigma s <- RMSE(boston.lm) # = estimate of sigma N <- nrow(X) r.boot <- resid(lm(rnorm(N,m=yhat,s=s) ~ ., data=boston[,-14])) . Examples of test statistics that could measure lack of fit in the full model: test.stat <- max # ('worst' upper residual) test.stat <- min # ('worst' lower residual) test.stat <- function(r) quantile(r, probs=.95) # 95% percentile of residuals test.stat <- function(r) mean(r^3)/mean(r^2)^(3/2) # skewness test.stat <- function(r) mean(r^4)/mean(r^2)^2 # kurtosis to measure tail weight xdiag <- resid(lm(yhat^2 ~ X)) # (Preparation for next line; not quite kosher...) test.stat <- function(r) sum(r*xdiag)/sum(xdiag^2) # parabolic curvature in r vs yhat - Bootstrap simulation to generate an approximate null distribution for any of the above test statistics: N.boot <- 999 test.stat.null <- rep(NA, N.boot) for(i.boot in 1:N.boot) { y.boot <- rnorm(N,m=yhat,s=s) m.boot <- lm(y.boot ~ ., data=boston[,-14]) r.boot <- resid(m.boot) test.stat.null[i.boot] <- test.stat(r.boot) if(i.boot%%100==0) cat(i.boot," ") } # Approximate p-value: test.stat.obs <- test.stat(r) (1+sum(test.stat.null>test.stat.obs))/(1+N.boot) # Graphical comparison: much more striking than the p-value... hist(test.stat.null, col="gray", breaks=N.boot/25, xlim=range(c(test.stat.null,test.stat.obs))) abline(v=test.stat.obs, col="red", lwd=2) # mark the observed value of test.stat() - Comment on p-value calculations from simulated null data: . Use an odd number of simulations such as 99, 199, 499, 999, 9999,... . Reason: FENCE POST ISSUE 99 fence posts make how many spacings? Include the outside 'spacings'. . If the actual data is a draw from the null distribution, what is the probability of 'test.stat.obs' into any of the N.boo+1 spacings? ... . What is the range of possible p-values for a given N.boot? ... ---------------- * CONDITIONING ON A MINIMAL SUFFICIENT STATISTIC ('CS-MiSS') - This is the second principle for generating single null distributions from composite null hypotheses. It is not as universal as the parametric bootstrap, but it amounts to an exact theory that underlies much of classical optimality theory for statistical tests. - Purpose: Generating a single null distribution for a composite H0 with nuisance parameters - Sufficiency Theory (see math stat for a full understanding) . Introductory quizz questions: + What is the difference between probability theory and statistics? . Intuition: If we observe N data values, but the model has only one or two parameters, do we really need to know all N data values? I.e., shouldn''t there exist a reduction to one or two summaries (functions of y=(y1,...,yN)^T) that contain all the information about the parameters? . Example 1: yi iid N(mu,1) p(y|mu) = exp(sum((yi-mu)^2)/2) / (2*pi)^(N/2) ~ exp(sum(yi^2)/2 - sum(yi)*mu + N*mu^2/2) S(y) = sum(y) = "1-dim sufficient statistics" . Example 2: yi iid N(mu,sig^2), y=(y1,...,yN)^T p(y|mu,sig) = exp(sum((y-mu)^2)/(2*sig^2)) / (2*pi*sig)^(N/2) # valid R = exp((sum(y^2) - 2*sum(y)*mu + N*mu^2)/(2*sig^2) / (sqrt(2*pi)*sig)^N S(y) = c(sum(y), sum(y^2)) = "2-dim sufficient statistic" . Example 3: y ~ N(X beta, sig^2 I), p(y|beta,sig) = exp(sum((y - X%*% beta)^2)/(2*sig^2)) / (sqrt(2*pi)*sig)^N = exp((sum(y^2) - 2*t(y)%*%X%*%beta + t(beta)%*%t(X)%*%X%*%beta)/ (sqrt(2*pi)*sig)^N S(y) = c(t(X)%*%y, sum(y^2)) = "(p+2)-dim sufficient statistic" . Principle: Write down the density and collect the functions of the data that drive the estimation of the parameters (with ML, e.g.). ================================================================ LECTURE 18, 2011/11/09: * ORG: - HW 6: real estate project due Monday, Nov 14, 3pm Reminder: This is an exercise in interpretation and communication, NOT model selection! Keep in mind: STYLE !!!!!! ---------------- * ROADMAP: - Recap: Inference for Diagnostics (contd.) The 1st principle for formalizing lack-of-fit model diagnostics: ---------------------------------------------- | Parametric Bootstrap or Plug-In Simulation | ---------------------------------------------- - New: The 2nd principle for formalizing lack-of-fit model diagnostics -------------------------------------------------- | Conditioning on a minimal sufficient statistic | -------------------------------------------------- - Collinearity analysis with Principal Component Analysis (maybe today) - Principal Component Analysis for Dimension Reduction (not today) ---------------- * RECAP AND ELABORATIONS: - Principle for formalizing lack-of-fit diagnostics as statistical tests: ... think of the FULL model as H0 - Contrast the usual testing situations (e.g., testing slopes betaj=0) with lack-of-fit diagnostic tests. ... H0 is a submodel of the full model - What problem does the parametric bootstrap solve in view of the model N(X beta, sigma^2 I)? ... it reduces the composite H0 to a single distribution - Describe uses of parametric bootstrap for graphical diagnostics. ... e.g., make residual plots of simulated data from the plug-in distr. - Describe the use of parametric bootstrap for numeric diagnostics. ... generate a null distribution for T=max(residuals), e.g. - When you simulate a null distribution of a test statistic, how do you use it to obtain an approximate p-value? ... ---------------- * 2ND PRINCIPLE FOR REDUCING COMPOSITE TO SINGLE NULL HYPOTHESES: --------------------------------------------------- | Conditioning on a minimal sufficient statistics | --------------------------------------------------- - Reminders and Elaborations: . How are probability theory and statistics different? + Probability theory: concerned with deductive reasoning about one complex distribution [E.g., infinitely many iid variables ==> one product distribution] + Statistics: concerned with inductive reasoning about a family of distributions [I.e.: Statistics teaches how to guess distributions based on data.] . In a multiparameter model (=parametrized family of distributions) tests are usually about certain parameters of interest. ==> All other parameters are called 'nuisance parameters'. . What are the nuisance parameters when you test betaj=0? ... . What are the nuisance parameters in diagnostic tests? ... ALL parameters!!! . To obtain a unique null distribution under a null hypothesis, one must somehow reduce the many distributions in the model under H0 to ONE SINGLE DISTRIBUTION. ==> The problem is: "Elimination of nuisance parameters" . How does parametric bootstrap reduce a composite multiparameter H0 to a single distribution that can be simulated? ... plug-in of estimates . The next approach (conditioning on a minimal sufficient statistic) eliminates nuisance parameters in a different and more technical manner. Recall: End of last class we motivated the notion of sufficiency. . General intuition about the notion of sufficiency: If we estimate 5 parameters, say, there should exist a reduction of the N-dim data to 5-dim, that is, to 5 summary statistics. . To figure this out, you need to look at the densities: In general, look for summary statistics implicit in the model densities that 'connect' with the parameters of the model. . Example 1: yi ~ N(mu,1) ==> p(y1,...,yN) ~ exp(((y1-mu)^2+...+(yN-mu)^2)/2) = exp(((sum yi^2) - 2mu(sum yi) + N*mu^2)/2) Only (sum yi) 'connects' with mu: ^^^^^^^^^^ . "Under the hood": the Neyman criterion p(y|theta) = f(y)*g(S(y)|theta)*h(theta) <==> S(y) is sufficient . One more ingredient: If there exists one suff statistic, many more exist. We need to choose the one with minimal dimension, that is, the one that reduces the data the most. Example 1: (sum(y[1:10]),sum(y[11:N])) is also sufficient but not minimal. sum(y) is minimal sufficient (which can be proven) . Some more intuitions: If you need to test theta1 vs theta0, you only need to know S(x). Reason: the Neyman-Pearson lemma p(y|theta1)/p(y|theta0) is the optimal test statistic, and it involves only S(y) because f(y) cancels out. - Observations: + Equivalence of sufficient statistics: If S(y) is sufficient, so is any G(S(y)) if G(s) is 1-1. Example 1: sum(y) or mean(y) Example 2: (sum(y), sum(y^2)) or (mean(y), sd(y)) Example 3: (t(X)%*%y, sum(y^2)) or (b, RMSE) or (yhat, RSS) + Maximal reduction by sufficiency: "minimal sufficient statistics" + Factorization lemma: p(y|theta) = f(S(y)|theta)*h(y)*c(theta) ==> S(y) is sufficient (also called the Neyman criterion) + Heuristic: To discriminate between theta1 and theta0, the N-P Lemma only requires p(y|theta1)/p(y|theta0). S(y) is sufficient if this ratio depends on the data only through S(y) - DEEP FACT: (actually, the definition of sufficiency) ------------------------------------------------------------------ | The conditional distribution of y given a sufficient statistic | | is independent of the parameters. | ----------------------------------------------------------------- . Intuition: Hold the sufficient statistic S(y) fixed. Then ask: + What datasets get mapped to this same value of S(y)? Equiv(y) := { y* | S(y*) = S(y) } + What is the conditional distribution on such a set under p(y|theta)? . Equiv(y): + Example 1: Equiv(y) = { y* | mean(y*)=mean(y) } = mean(y)*e + orthogonal complement of e=(1,...,1)^T + Example 2: Equiv(y) = { y* | mean(y*)=mean(y) & sd(y*)=sd(y) } = intersection of orth complement of 'e' and the sphere centered at mean(y)*e with radius sd(y)*sqrt(n-1) + Example 3: Equiv(y) = { y* | yhat*=yhat and RSS(y*)=RSS(y) } = { y* | y*=yhat+r* and |r*|^2 = |r|^2 } = yhat + { r* | r*_|_ X and |r*|^2 = |r|^2 } = yhat + sphere in residual space with radius=sqrt(RSS) . Conditional distributions: operational descriptions We are really only interested in understanding the conditional distribution given a MiSS to the point that we can simulate it. + Example 1: Let y be an observed vector of response values We need to condition on S(y)... mu <- 1 ysim <- rnorm(N, m=mu, s=1) ystar <- mean(y) + (ysim-mean(ysim)) It doesn''t matter what mu we picked; could use mu=0. ==> degenerate normal distribution in 'contrast space' + Example 2: mu <- 0 sigma <- 1 ysim <- rnorm(N, m=mu*rep(1,N), s=sigma) ystar <- mean(y) + (ysim-mean(ysim))/sd(ysim)*sd(y) It doesn''t matter what mu and sigma^2 we picked; could use mu=0 and sigma=1. ==> uniform distribution on sphere of radius |y-mean(y)| in contrast space + Example 3: y <- boston[,"MEDV"] beta <- rep(0,ncol(X)) sigma <- 1 ysim <- rnorm(N, m=X%*%beta, s=sigma) y.lm <- lm(y ~ X) ysim.lm <- lm(ysim~X) ystar = fitted(y.lm) + resid(ysim.lm)/RMSE(ysim.lm)*RMSE(y.lm) It doesn''t matter what beta and sigma we picked; could use beta=rep(0,ncol(X)) and sigma=1. - PRACTICAL USE: . Simulation of the CONDITIONAL distribution to generate response vectors y* that are AS LIKELY AS THE OBSERVED RESPONSE vector y IF the linear model is correct. . The datasets we sample from this conditional distribution will share exactly the value of the sufficient statistic with its observed values! Recall: We condition on the sufficient statistic, that is, we are holding it fixed. - Canned functions for comparison plots based on sufficiency theory: If the data followed the model, their residual plots would look like this... site <- "http://stat.wharton.upenn.edu/~buja/STAT-541/" source(paste(site,"residual-comparison.R",sep="")) res.vs.fits(boston.lm) # execute one statement at a time res.vs.qnorm(boston.lm) res.vs.order(boston.lm) - Why are the above ideas "inference"? They are not CIs or tests, or are they? ==> Turn the comparisons into 'retention bands' or 'null bands'. . Prepare: r <- resid(boston.lm) yhat <- fitted(boston.lm) X <- model.matrix(boston.lm) . Normal quantile band: hor <- qqnorm(r, pch=16, cex=.5)$x # set up the plot in proper dimensions for(i in 1:100) { r.null <- resid(lm(rnorm(N) ~ X)) r.null <- r.null / norm(r.null) * norm(r) points(x=sort(hor), y=sort(r.null), pch=16, cex=.3, col="gray") } points(x=sort(hor), y=sort(r), pch=16, cex=.5, col="red") => "NULL BAND" for quantile plot assuming the model is correct. . Standard residual plot: detect nonlinearities bw <- 1/3 plot(yhat, r, pch=16, cex=.5) # set up the plot in proper dimensions for(i in 1:100) { r.null <- resid(lm(rnorm(N) ~ X)) r.null <- r.null / norm(r.null) * norm(r) lines(lowess(r.null ~ yhat), col="gray", lwd=.5) } points(yhat, r, pch=16, cex=.5) lines(lowess(r ~ yhat), col="red", lwd=2) . Heteroscedasticity plot: abs(r) vs yhat bw <- 1/3 plot(yhat, abs(r), pch=16, cex=.5) # set up the plot in proper dimensions for(i in 1:100) { r.null <- resid(lm(rnorm(N) ~ X)) r.null <- r.null / norm(r.null) * norm(r) lines(lowess(abs(r.null) ~ yhat, f=bw), col="gray", lwd=.5) } points(yhat, abs(r), pch=16, cex=.5) lines(lowess(abs(r) ~ yhat, f=bw), col="red", lwd=2) . Lurking variables/auto-correlation plot: r vs order Not successful! However, a plot of the cdf of diff(resids) works: fun <- function(r) sort(diff(r)) z <- fun(r) plot(z, 1:length(z), pch=16, cex=.5, type="l") for(i in 1:100) { r.null <- resid(lm(rnorm(N) ~ X)) r.null <- r.null / norm(r.null) * norm(r) z.null <- fun(r.null) lines(z.null, 1:length(z), col="gray", pch=16, lwd=.5) } abline(v=0) lines(z, 1:length(z), col="red", lwd=2) ==> Adjacent residuals are generally too close (diffs cluster around zero). ==> Positive auto-correlation across order. Q: Where did we make use of the order of the data? ================================================================ * COLLINEARITY -- A SPECIAL CASE OF NON-IDENTIFIABILITY - Interpretation issues: . NOT a model problem, but causes interpretation issues . Commonality of collinearity and leverage/self-influence: + both are properties of the predictor distribution + neither is a model problem because neither involves y + both cause interpretation problems . Difference between collinearity and leverage/self-influence: + Leverage/self-influence has to do with "outliers" in (p+1)-dimensional predictor space. + Collinearity has to do with near-degeneracy of N-dimensional predictor vectors => The interpretation problems they cause are completely different. . Compare the following design matrices and comment on their properties and interpretation of coefficients: |1 0| |1 0| |1 1 0| |1 0| |1 0| |1 1 0| X = |0 1|, |1 1|, |1 0 1| |0 1| |1 1| |1 0 1| . In the third matrix, what is x0 adjusted for x1 and x2, or x1 adjusted for x0 and x2, or x2 adjusted for x0 and x1? ================================================================ LECTURE 19, 2011/11/14: * ORG: - HW 6: real estate project was due today, Monday, Nov 14, 3pm Comments? - General ed: Are you aware of 'aldaily.com'? Two very interesting articles: . Steve Jobs, the Tweaker (The New Yorker) . Michael Lewis on Danny Kahneman (Vanity Fair) ---------------- * ROADMAP: - Recap: Inference for Diagnostics (final) The two principles for formalizing lack-of-fit model diagnostics: -------------------------------------------------- | Parametric Bootstrap or Plug-In Simulation | |--------------------------------------------------| | Conditioning on a minimal sufficient statistic | -------------------------------------------------- - New: . Collinearity analysis with Principal Component Analysis . Principal Component Analysis for Dimension Reduction ---------------- * RECAP AND ELABORATIONS: some repeats of last recaps - Principle for formalizing lack-of-fit diagnostics as statistical tests: ... - Contrast the usual testing situations (e.g., testing slopes betaj=0) with lack-of-fit diagnostic tests. ... - What problem do the parametric bootstrap and conditioning on a MiSS solve? ... - When people talk about null distributions, they usually complete the expression to "null distribution of a test statistics". But this is not what our two principles produce. What do they really produce? ... - Compare the parametric bootstrap and conditioning on a MiSS. . ... . ... - Exercise: Design a test to check whether sigma increases with mu. Devise test statistics and null distributions. ... - Exercise: Apply the parametric bootstrap to testing beta1=0. .. - Exercise: Apply conditioning on a MiSS to testing beta1=0. ... ---------------- * COLLINEARITY: - What is the more general term that applies to all models? ... identifiability - Exact cases of collinearity do exist but are known prior to data analysis. Example: 5 level categorical predictor 5 dummies are collinear with intercept. ==> Solve the problem analytically - Collinearity is almost always approximate: alpha0*x0 + alpha1*x1 + ... + alphap*xp ~ 0 [Something is missing: if all alphaj~0, this is ~0 trivially. ...] - Predictor-specific collinearities: xj is collinear with the rest if xj ~ sum_{k!=j} alphak*xk Equivalent: xj.adj is small ~ 0 |xj.adj|^2 ~ 0 [Again, something is missing: If I switch the units of xj from inches to miles, I get very small numbers and hence |xj.adj|^2 ~ 0 ...] - Effect 1 of collinearity: variance (SE) inflation . VIF (variance inflaction factor): Quantification of loss of power in statistical inference . Recall: stderr.est(bj) = s / |xj.adj| . Collinearity of xj with other predictors causes |xj.adj| to be small hence stderr.est to be large . What would be a good definition of "variance inflation factor"? Compare |xj.adj| with ... VIFj = [ actual stderr.est(bj) / best possible stderr.est(bj) ]^2 = |xj.0|^2 / |xj.adj|^2 # Use xj.0, centered. = 1/(1-R2.j) # R2.j = R2 of regr of xj on all other predictors . Interpretation of the VIF in terms of CI-width? sqrt(VIFj) shows by how much the CI or RI width has been inflated by collinearity. - Effect 2 of collinearity: Simpson''s paradox . Practical example: "A marketing project identified a list of affluent customers for its new PDA. Should it focus on the younger or older members of this list?" "To answer this question, the marketing firm obtained a sample of 75 consumers and asked them to rate their 'likelihood of purchase' on a scale of 1 to 10." Data: pda <- read.csv(paste(site,"pda.dat",sep="")) pda <- read.csv("Data/pda.dat") # Alternate after separate download names(pda) dim(pda) plot(pda) summary(lm(Rating ~ Age, data=pda)) Observations: . For increasing Age we see on average increasing Rating which agrees with the simple regression of Rating on Age: summary(lm(Rating ~ Age, data=pda))$coef . However, adjusted for Income, we have a reversal of the slope for Age: summary(lm(Rating ~ Income + Age, data=pda))$coef . How is this possible? Collinearity of Age and Income: cor(pda[,c("Age","Income")]) This effect is possible but not necessary in the presence of collinearity. ---------------- * COLLINEARITY ANALYSIS WITH SMALLEST PRINCIPAL COMPONENTS: - Vectorized effects of collinearity on inference for the coeffs: . The (stderr-)variance matrix of b is V[b] = (X^T X)^{-1} sigma^2 . If the eigendecomposition of (X^T X) is: (X^T X) = lambda1 P1 + lambda2 P2 +... (p+1 terms, Pj = uj uj^T) then the eigendecomposition of (X^T X)^{-1} is: (X^T X)^{-1} = 1/lambda1 P1 + 1/lambda2 P2 +... [Side remark: if S is a symmetric matrix with eigendecomp S = sum lambdaj Pj and if f(lambda) is a numeric function, then f(S) = sum f(lambdaj) Pj ] Consequence: Linear combinations uj^T b = with small lambdaj are statistically ill-determined (dataset-to-dataset). V[uj^T b] = uj^T V[b] uj = 1/lambdaj = huge if lambdaj = small - Extreme case: perfect collinearity -- rank deficient design matrix X b = 0 <=> (X^T X) b = 0 <=> b^T (X^T X) b = 0 <=> |X b|^2 = 0 <=> b is eigenvector of (X^T X) with eigenvalue 0 determination of rank of X: eigen decomposition of X^T X rank deficiency: count the (near) zero eigenvalues each eigenvector for a zero eigenvalue is a 'b' with X b = 0 Example: boston.lm <- lm(MEDV ~ ., data=boston) X <- model.matrix(boston.lm) eigen(crossprod(X))$values # full rank XX <- cbind(X,x12=X[,1]+X[,2],apply(X,1,sum)) # Add a collinear column eigen(crossprod(XX))$values # rank deficient Look for eigenvalue several orders smaller than the one before (ex: .9, 8.4e-09) - Problem with the concept of collinearity: SCALING OF PREDICTORS XX <- cbind(X[,-13]*.000001,X[,13]*1000000) # example: rescale predictors eigen(crossprod(XX))$values # negative e'vals? => eigen() crashed!! Solution: Scale the predictors, use 'cor()' instead of 'crossprod()' but remove intercept vector, which is not of interest anyway (centering/adjusting-for-the-mean is implicit in correlations) eigen(cor(X[,-1]))$values XX <- cbind(X[,-1],x12=X[,2]+X[,3],apply(X,1,sum)) # Append collinear predictors eigen(cor(XX))$values - Standardizing the predictors: Often practiced in social sciences Slope of x = est. ave. difference in y for a 1 sdev difference in x 'holding all other predictors fixed' - Collinearity analysis with "smallest principal components": Use of smallest eigenvalues/vectors to detect near-collinearity PCA: Usually used for DIMENSION REDUCTION => largest eigenvalues/vectors but COLLINEARITY ANALYSIS requires: smallest eigenvalues/vectors Idea: Assume the columns of X are centered (intercept removed) LS: find b such that |y - X b|^2 = min PCA: find b such that var(X b) ~ |X b|^2 = max/min for collinearity analysis we are interested in var(X b) small But: PCA is ill-posed; b->Inf or b=0 solve the problem trivially. => PCA needs a constraint. Idea: Inspired by collinearity analysis, take the best possible case for comparison, which is: the predictors are pairwise ... IF this were the case: var(X b) = |x1|^2 b1^2 + ... + |xp|^2 bp^2 w.l.o.g.: |x1|^2 = ... = |xp|^2 = 1 var(X b) = b1^2 + ... + bp^2 = |b|^2 => var(X b) = max/min subject to |b|^2 = 1 => Eigenproblem: var(X b) = b^T cor(X) b = max/min subj. to |b|^2=1 Interpretation: Eigenvalues = variances (max/min/stationary) var(X b) ~=~ 0 => X b ~=~ 0 for small eigenvalues - Canned smallest eigendecomps. of the correlation matrix of X (w/o intercept) site <- "http://www-stat.wharton.upenn.edu/~buja/STAT-541/" source(paste(site, "collinearity-pca.R", sep="")) source("Functions/collinearity-pca.R") X.scaled <- scale(X[,-1]) # standardize the columns, remove intercept collinearity.pca(X.scaled, nlin=3) $variances PC13 PC12 PC11 0.064 0.169 0.186 $coefficients PC13 PC12 PC11 CRIM -0.046 0.087 -0.110 ZN 0.081 -0.071 0.263 * INDUS 0.251 -0.113 -0.303 CHAS -0.036 -0.004 0.014 NOX -0.044 0.804 0.111 RM -0.046 0.153 0.053 AGE 0.039 -0.212 -0.459 DIS 0.018 0.391 -0.696 * RAD 0.633 -0.107 0.037 * TAX -0.720 -0.215 -0.105 PTRATIO -0.023 0.210 0.175 B 0.004 0.042 0.019 LSTAT -0.024 0.055 0.271 Interpretation of the smallest eigenvalue: The smallest PC has 6.4% of the variance of a single variable (var(xj)=1). Its standard deviation is sqrt(0.064)=0.253, or about a quarter of each xj Interpretation of its eigenvector: Round the coeffs first. INDUS: +1/4 RAD: +2/3 TAX: -3/4 Defines a linear combination with small variance, hence examine: 1/4*INDUS + 2/3*RAD - 3/4*TAX ~=~ 0 Equivalently: 3/4*TAX ~=~ 1/4*INDUS + 2/3*RAD Equivalently: TAX ~=~ 1/3*INDUS + 8/9*RAD Check with a plot: plot(TAX ~ I(1/3*INDUS + 8/9*RAD), data=as.data.frame(X.scaled), pch=16, cex=1) (Recall the formula language can also be used for plotting.) Suspicious plot: These are not 506 points... windows(width=6, height=12) par(mfrow=c(2,1), mgp=c(1.8,.5,0), mar=c(3,3,1,1), cex.lab=.8, cex.axis=.8) plot(TAX ~ I(1/3*INDUS + 8/9*RAD), data=as.data.frame(X.scaled), pch=16, cex=.5) Jitter the same: n <- nrow(X.scaled) jit <- .1 jit.x <- runif(n,-jit,jit) jit.y <- runif(n,-jit,jit) plot(TAX + jit.y ~ I(1/3*INDUS + 8/9*RAD + jit.x), data=as.data.frame(X.scaled), pch=16, cex=.5) ================================================================ LECTURE 20, 2011/11/16: * ORG: - HW 5, 6: Solutions posted - Writing HW 2: Style, rewrite 5.2.2 (p.77, top) Due: Monday, Nov 21, 10pm Download txt file from website and edit your solution into the file. Send to usual HW gmail. ---------------- * ROADMAP: - Recap: . Collinearity in general . PCA for collinearity analysis - New: PCA for dimension reduction ---------------- * RECAP: - Collinearity: a property of ..., not of ... - Compare collinearity with leverage/case influence ... - Collinearity''s influence on inference: . What is it? ... it makes inference more difficult . In what formula can you see it? ... inflation of SE^2(bj) = sigma^2/|xj.adj|^2 . How can you measure it? ... VIF = 1/(1-R2.j) - Simpson''s paradox: . Assuming x1 and x2 are orthogonal, why can''t b1 switch its sign when you add x2 to the regression? . If b1 changes its sign when you add x2, what can you infer about x1 and x2 in relation to each other? . Try to 'visualize': # higher age -> fainter plot(pda, col=gray(rank(pda[,"Age"])/nrow(pda)), pch=16) Identify the relevant plots to see 'Simpson'! - Global analysis of the nature of the collinearity: . In terms of linear algebra, collinearity is an approximate ... linear dependence! X b = b1*x1+...+bp*xp ~ 0 . How can you measure an approximate ...? ... Var(X b) ~ 0 . How can you 'immunize' the measure against differences in measurement scale? ... change the units to SD for each predictor !! . How can you make this a mathematically well-defined measure that does not have a trivial 0 solution? ... impose a constraint, e.g., |b|^2 = 1 . There are many ways to impose a ..., but what is a principled way of finding one? ... consider a comparison of Var(Xb) with Var(Xb) when X is totally 'uncollinear', i.e., orthogonal predictors ==> Var(Xb) = |b|^2 . Instead of imposing a ..., you can normalize your measure. It becomes what is called a ... quotient. 'Rayleigh' . What are the stationary solutions of a ... quotient? ... eigenvectors of X^T X . For collinearity analysis, are we interested in small or large values of the ... quotient? ... small!! . What is the problem with a linear dependence for fitted equations? ... they make the fitted equation/prediction of y non-unique . Recall the discussion last time: Does degeneracy matter for prediction? ... in principle, inference is the only thing damaged by collinearity, but prediction might also want to know about collinearities because future predictor combinations might no longer follows the collinearity, causing invisible extrapolation ... ---------------- * PCA -- grand summary for both its uses: - Intuition: PCA seeks linear combinations Xb of the columns of X that have ... or ... variance. - Uses of extreme PC dimensions: . Large variance: dimension reduction (this lecture) . Small variance: implicit equations (see last lecture) Intuitions: Consider p=2, variables X1, X2 only, sd(Xj)=1, cor(X1,X2) large We can describe the situation in two ways: (1) The joint distribution of (X1,X2) fall near span(c(1,1)) ==> Dimension reduction (a linear space near the data, span(c(1,1))) (2) The joint distribution of (X1,X2) nearly satisfies X1-X2=0 ==> Fitting an implicit equation (the same space described by X1-X2=0) When p is large (13 for the Boston Housing predictors), there is usually a vast gap between the two approaches 1 implicit equation ==> subspace dim=12 (codim=1) 1 reduced dimension ==> subsapce dim=1 - Problem: Var(Xb)=max/min is not well-defined (why? ...) Note: Var(Xb) is the sample variance of the N-dim column vector Xb. In R this is computed literally as follows: var(X %*% b) The spread as measured by the sample standard deviation would be sd(X %*% b) which is the same as sqrt(var(X %*% b)). - Solution: Use a normalizing constraint. - Principle for deriving a normalizing constraint: . Wlog standardize all columns of X to zero mean and unit variance. ==> Change of units from inches, miles, light years, millimeters, Angstrom to multiples of the respective standard deviations. ==> (1) Obtain "compatible" orders of magnitudes across the columns of X. (2) Prevent numerical problems due to differing orders of magnitudes of the X columns. . After standardization, derive the constraint from the following principle: Assume that the columns of X are uncorrelated ("absence of linear association"). Under this assumption derive Var(Xb): Var(Xb|corrs=0) = Var(x1)*b1^2 + ... + Var(xp)*bp^2 = b1^2 + ... + bp^2 where the first line follows from assuming Cov(xj,xj)=0, and the second line follows from standardization, Var(xj)=1. - Final PCA problem: Compare Var(Xb|actual) with Var(Xb|corrs=0) and look for extreme coefficient vectors 'b'. Var(Xb|actual) b^T Cov(X) b b^T (X^T X)/(N-1) b Rayleigh(b) = ----------------- = -------------- = --------------------- Var(Xb|corrs=0) b^T b b^T b ==> max_b Rayleigh(b) = largest eigenvalue of Cov(X) min_b Rayleigh(b) = smallest eigenvalue of Cov(X) Stationary solutions of Rayleigh(b) = all eigenvalues of Cov(X) Note: Textbooks throw in the constraint |b|^2=1 without justification. We gave a justification and obtained an interpretation of PCA: It looks for 'interesting' directions 'b' by comparing the actual spread of 'X b' with that which it would be if the columns of X had no linear association. - Interpretation of eigenvalues: Rayleigh(b) = Var(Xb) if 'b' is a unit vector ==> Eigenvalues are maximal/minimal/stationary variances of lin.combs. of X, if the columns of X have been standardized. - EXAMPLE: the Boston Housing data, symmetric analysis including the response X.std <- scale(boston) # Standardize all variables to mean=0, sd=1 X.cov <- cov(X.std) # Same as cor(X), but we'll need X.std below. X.eig <- eigen(X.cov) # The heart of PCA # Examine the PCA guts: names(X.eig) X.eig$values # Can we reconstitute the covariance matrix? Indeed: max(abs(X.cov - X.eig$vec %*% diag(X.eig$val) %*% t(X.eig$vec))) # ------------ ANALYSIS OF THE EIGENVALUES ------------- # Total budget of variance: sum(X.eig$val) # Why must this be so? Think 'trace'! # "FRACTION OF VARIANCE ACCOUNTED FOR BY EACH PC": round(X.eig$val / sum(X.eig$val) *100,1) # "CUMULATIVE FRACTION OF VARIANCE ACCOUNTED FOR BY EACH PC": round(cumsum(X.eig$val) / sum(X.eig$val) *100,1) # It is standard to report both. # REPORT: # PC1 accounts for almost half of the variance, 47%. # PC2 accounts for 12% of the variance, followed by # PC3 with less than 10%. # Together PC1-3 account for 68% fo the variance. # It is also standard to plot the eigenvalue profile: plot(X.eig$val, type="b", pch=16, ylab="PCA Eigenvalues", xlab="Order") # What would the eigenvalues be if there were no association? cov(X)=I ==> all evals=1 # 1 is also one variable worth of variance (they are all standardized: sd=1). # This motivates using 1 as the baseline that separates small and large: abline(h=1, col="gray") # So should we accept three large dimensions as "real"? # ==> Kaiser's eigenvalue 1 rule: for dim reduction, accept PCs with eval>1. # Suspicion: Could the third PC have arisen by chance? # What would eigenvalue profiles look like if there were no association? # Here is a principle we will look into late: permutation null distributions for(i in 1:1000) lines(eigen(cov(apply(X.std,2,sample)))$values, col="gray") # The third eigenvalue still sticks out above the null eigenvalue profiles # The method of accepting PCs with eigenvalues above the random level # is due to Horn (1964) and called "parallel analysis". # In the social sciences it may be one of the best selection methods. # It answers this question: # Does a given PC have statistically significantly more variance # than one variable? # The answer depends a little on the order of the PC as the null profiles show. # ---------------- ANALYSIS OF EIGENVECTORS ---------------- # The columns of X.eig$vec contain the eigenvectors: options(width=96) # Want long lines for matrix outputs. round(X.eig$vectors, 2) # Ooops, we lost the variable labels; put them back on: rownames(X.eig$vectors) <- colnames(X.std) colnames(X.eig$vectors) <- paste("PC",1:ncol(X.eig$vectors),sep="") round(X.eig$vectors, 2) # Better. # The first PC is typically an overall weighted mean of the variables, # after inverting some of the variables so they all line up the same way: X.std.flipped <- t((sign(X.eig$vectors[,1]))*t(X.std)) colnames(X.std.flipped) <- paste(ifelse(sign(X.eig$vectors[,1])==1,"+","-"),rownames(X.eig$vectors),sep="") round(cov(X.std.flipped), 2) # ==> Except for CHAS (Charles river), all variables are positively associated. # ==> PC1 is a weighted average of the re-oriented variables. # # ---------------- INTERPRETATION OF THE EIGENVECTORS ---------------- # Look for large values and their signs. # First, note the indeterminacy in the signs of the eigenvectors: # You should feel free to change their signs for interpretability. # In our case, PC1 seems to go in the "wrong" direction: # high crime, high pollution, small houses, high age, high PT ratio, poverty... # By inverting this eigenvector high values will reflect 'expensive': X.eig$vectors[,1] <- -X.eig$vectors[,1] # Turns out PC2 is also better inverted: X.eig$vectors[,2] <- -X.eig$vectors[,2] round(X.eig$vectors, 2) # For interpretation of the eigenvectors, we need the meanings of the variables: # Here is a listing of the variable meanings: # Variable description # ----------------------------------------------------------------------------------- # CRIM crime rate # ZN proportion of residential land zoned for lots over 25,000 sq. ft # INDUS proportion of non-retail business acres # CHAS Charles River dummy variable (=1 if tract bounds river; 0 otherwise) # NOX nitric oxides concentration, pphm # RM average number of rooms per dwelling # AGE proportion of owner-occupied units built prior to 1940 # DIS weighted distances to five Boston employment centers # RAD index of accessibility to radial highways # TAX full-value property tax rate per $10,000 # Ptratio pupil teacher ratio # B 1000*(Bk-0.63)^2 where Bk is the proportion of blacks # LSTAT percent lower status population # MEDV median value of owner occupied homes in $1000's # ---------------------------------------------------------------------------------- # ==> PC1 ~ low crime, large lots, little industry, low air pollution, # large homes, young homes, low taxes, low PT ratio, # far from employment centers, far from highways, # low ratial mix (an inverse measure), low fraction of poor people # --> expensive suburban tracts # ==> PC2 ~ not low on crime, small lots, near Charles River, more air pollution, # large homes, old homes, near employment, neutral taxes, low PT ratio, # neutral to ratial mix, few poor people # --> expensive city tracts # ==> PC4 ~ Charles River # ==> PC7 ~ Crime # ==> PC13: We interpreted it in terms of an implicit equation # CONVENTION: Eigenvectors are usually rescaled to reflect the eigenvalues # They are then called 'loadings': X.loadings <- t(t(X.eig$vectors)*(sqrt(X.eig$values))) round(X.loadings,2) # In the social sciences loadings >=0.35 are often considered as 'large' # and interpretable. # I recommend plotting the loadings against each other: source(paste(site,"pairs-plus.R",sep="")) # See earlier lectures for 'site'. source("Functions/pairs-plus.R") # (Locally on the instructor's computer) # All PC loadings: pairs.plus(X.loadings, gap=0, cex=.5, xlim=1.1*range(X.loadings), ylim=1.1*range(X.loadings)) # Just 4 PCs pairs.plus(X.loadings[,1:4], gap=0, xlim=1.1*range(X.loadings), ylim=1.1*range(X.loadings)) # Just the first 2, labeled for interpretation: plot(X.loadings[,1:2], xlim=1.1*range(X.loadings), ylim=1.1*range(X.loadings), pch=16) text(X.loadings[,1:2], lab=rownames(X.loadings), adj=c(0,0)) abline(h=0, col="gray"); abline(v=0, col="gray") # # PC SCORES: Projections # Next we project the standardized data matrix onto the eigenvectors, or, # equivalently, we change the basis of standardized data space to the PC basis: X.proj <- X.std %*% X.eig$vec rownames(X.proj) <- rownames(X.std); colnames(X.proj) <- colnames(X.loadings) # And plot the 'rotated' standardized data: pairs.plus(X.proj, gap=0, cex=.2, xlim=1.1*range(X.proj), ylim=1.1*range(X.proj)) # IMPORTANT! To see the shrinking sizes of successive PC dimensions, # we need equal scales on all plots. # PC7 is created to accommodate a few stragglers. # What causes them? Check for large coeffs in PC7: round(X.eig$vec[,7],3) # CRIM! The high-crime census tracts are isolated on this dimension. # Next just 4 PCs: pairs.plus(X.proj[,1:4], gap=0, cex=.2, xlim=1.1*range(X.proj), ylim=1.1*range(X.proj)) # The cluster is probably due to Charles River: mark CHAS=1 with red pairs.plus(X.proj[,1:4], gap=0, cex=.8, xlim=1.1*range(X.proj), ylim=1.1*range(X.proj), col=c("black","red")[boston[,"CHAS"]+1]) # # Finally, remind yourself of the view of the raw variables: windows(); pairs.plus(boston[,-14], gap=0, cex=.2, extra=.2) ================================================================ LECTURE 21, 2011/11/21: * ORG: - Writing HW 2: Style, rewrite 5.2.2 (p.77, top) Due today: Monday, Nov 21, 10pm Download txt file from website and edit your solution into the file. Send to usual HW gmail. - There will be class on Wed. - Next homework coming... ---------------- * ROADMAP: - Recap: PCA for dimension reduction - New: . PCA -- conclusion . Model selection in regression . Permutation tests ---------------- * RECAP: - Two uses of PCA: . ... collinearity analysis through implicit equation estimation . ... dimension reduction - How do PCA and regression differ? ... PCA is 'unsupervised', i.e., no response variable singled out ... Regression is 'supervised', i.e., driven by a response - What is the optimization criterion in PCA? ... constrained variance (of linear combinations of variables) - How can you make the criterion independent of units? ... standardization of variables, i.e., use a data-driven unit GENERAL PRINCIPLE: For any statistical method, the CONCLUSIONS -- NOT the numeric values of the results -- must be invariant under a change of units. - What is a good principle to derive the Rayleigh quotient form of the PCA criterion? ... Null comparison: Var(X a) = .. if cols of X were orthogonal? Var(Xa) = |a|^2 (cols if X standardized) - Describe the stationary solutions of the PCA criterion. ... Eigenequations ==> e''vecs, e''vals - Basic property of PCA eigenvalues? ... sum = p - Describe a protocol for analyzing PCA eigenvalues. In particular, address the 'number of factors' problem. ... ---------------- * PCA for Dimension reduction - Interpretation of eigenvectors/loadings: Go back to last part of previous Lecture. - Concluding remarks and outlook: . We described the version of PCA used in most applications with a heterogeneous mix of variables. . In 'functional PCA' the rows are evaluations of functions at discrete locations (multiple time series, spatial functions,genes,...) and hence all variables have the same units (e.g., temperature). In this case variables are not usually standardized. ==> Nonstandardized PCA responds not only to associations between variables but also to differences in marginal variance. Example: Locations where the temperature variance is low will contribute little to the large PCs. . In functional PCA one tries to impose smoothness or sparseness on the domain (time, space, genes). ==> Penalties are added to variance function. . Multivariate analysis, and PCA in particular, is an active research area. Examples: Zongming''s thesis was on sparse PCA. Dan Yang''s is on sparse SVD. ---------------- * MODEL SELECTION -- CONCLUSION OF LINEAR MODELS - PREAMBLE: ADJUSTED R SQUARE . R2 is biased: E[R2] > 'true R2' (because of LS optimization) . Attempt at de-biasing: R2 = 1 - RSS/TSS = 1 - (RSS/(n-1))/(TSS/(n-1)) RSS/(n-1) is downward biased as an estimate of sigma^2 TSS/(n-1) is unbiased as an estimate of sigma.y^2 => Estimate true R2 by R2.adj = 1 - s^2/s.y^2 < R2 - MODEL SELECTION: Play with models using 'add1', 'drop1' functions. . Stepwise forward: ... summary(lm(MEDV ~ RM, data=boston)) summary(lm(MEDV ~ RM + LSTAT + NOX, data=boston)) a clumsy function to ``help'': add1(lm(MEDV ~ 1, data=boston), scope= ~ RM + LSTAT + NOX) . Stepwise backward: ... summary(lm(MEDV ~ ., data=boston)) summary(lm(MEDV ~ . - AGE, data=boston)) summary(lm(MEDV ~ . - AGE - INDUS, data=boston)) Another clumsy function to 'help': drop1(lm(MEDV ~ ., data=boston)) - STEPWISE REGRESSION: . Forward: Starting from null model (mean only), successively enter best variable as long as its t-stat IS signif. . Backward: Starting from full model, successively remove worst variable as long as its t-stat is NOT signif. . Alternating forward/backward... - ALL-SUBSETS REGRESSION: . Software: install the following package once: install.packages(pkgs="leaps", repos="http://cran.us.r-project.org/") library(leaps) # needs to be called in every R session . 'regsubsets()' in the package finds the best subset of predictors for every size. Optional arguments: nbest=2 will find the two best for each size nvmax=13 will find them for all sizes up to 13 force.in=c(6,13) will force variables 6 and 13 in the model (RM, LSTAT) Example: summary(regsubsets(log(MEDV) ~., data=boston)) summary(regsubsets(log(MEDV) ~., data=boston, nbest=3, nvmax=13, force.in=c(6,13))) A graphical way of looking at these models: windows(width=10, height=12) plot(regsubsets(MEDV ~., data=boston, nbest=2, nvmax=13), scale="adjr2") . Important observation in help(regsubset): 'Since this function returns separate best models of all sizes up to 'nvmax' and since different model selection criteria such as AIC, BIC, CIC, DIC, ... differ only in how models of different sizes are compared, the results do not depend on the choice of cost-complexity tradeoff.' - PROBLEM: How high should we go with picking models? . Many criteria for stopping the inclusion of predictors: Mallows'' Cp, AIC, BIC, RIC,... . A helpful universal trick/idea: ADD A FEW RANDOM NUMBER PREDICTORS TO THE MODEL !!!!!!! and check when the regressions get trapped in them: for(i in 1:1000) { nnoise <- 10 # the number of random predictors we add boston.ext.dat <- # the data frame of actual and random data data.frame(boston, as.data.frame(matrix(rnorm(506*nnoise), ncol=nnoise))) summary(lm(log(MEDV) ~ ., data=boston.ext.dat)) # model with 'junk' ## Fortunately most random predictors are recognized as worthless, ## but occasionally there is one that sticks out by chance. ## Now we do all-subset search and check graphically: boston.ext.sub <- regsubsets(nvmax=20, nbest=2, really.big=T, log(MEDV) ~., data=boston.ext.dat) plot(boston.ext.sub, scale="adjr2") } => Models of size up to 9 would be ok by this method. [Animate the above graph by uncommenting the for loop and braces!!!] Questions: . How does the method depend on the number of included junk predictors? . Is this a good method for interpretation or prediction or both or neither? . Theory??? . For some progress, see Shaun Lysen''s Ph.D. thesis, and references therein. - Final remarks on model selection: . Modern Techniques: Lasso, elastic net, Dantzig selector, FDR, ... . Open issues (1): Model selection when p >> N (genetics) . Open issues (2): Inference for coefficients after model selection !!!!!!!!!!!! END OF LINEAR MODELS !!!!!!!!!!!!!!!! ================================================================ LECTURE 22, 2011/11/28: * ORG: - One more homework: Practice covering several areas ---------------- * ROADMAP: - Recap: Model Selection - New: . Permutation tests . Nonparametric bootstrap . Smoothing ---------------- * RECAP AND ELABORATIONS: - Basic model selection strategies: . forward stepwise (greedy, not globally optimal) . backward stepwise (greedy, not globally optimal) . all-subsets search - Criteria-driven model search: . Numeric criteria such as Cp, AIC, BIC, RIC,... . They are all RSS + penalty for model size . The penalty is an increasing function of model size to encourage parsimony. . Important observation: + For any model size m, there is one best model (the one with min RSS). + Criteria-driven searches differ only in the model size they pick. (Cp, AIC pick larger models, BIC smaller models.) - Killer principle for model selection: . Control false selection rate. Meaning: The rate with which we select "null predictors". . Technology: ----------------------------------------- | Use random predictors whose selection | | is false by construction. | ----------------------------------------- . Details to think through: + How many random predictors? Call their number r. Xnull <- matrix(rnorm(r*nrow(X)),ncol=r) Xnew <- cbind(X,Xnull) The more predictors, the more likely one finds random inclusion. ==> Control simultaneous false selection from r null predictors. + Random predictors = normal random numbers or = random permutations of existing predictors? + How about adding a full set of the predictors, jointly randomly permuted? Xnull <- X[sample(1:nrow(X)),] Xnew <- cbind(X,Xnull) ==> It seems there should be some insight from this version, but not clear what. Note the permuted predictor set maintains the full internal collinearity structure; it only breaks the association with y. + There is still something messy with the concept of false selection rate, though not with the method: If X contains r0 null predictors already, then the true number of null predictors is r0+r. The issue is that the r0 'real' null predictors compete with the r 'fake' null predictors, so we underestimate the true false selection rate. ==> Maybe one should add a full set of p null predictors in order to have chance >0.5 to have the artificial null predictors selected over 'real' null predictors. ---------------- * NEW CHAPTER: PERMUTATION TESTS - Idea: If the null hypothesis is independence of two variables X and Y, break the association between X and Y with random permutations. - Data example: places rated data (1986), 329 metropolitan areas of the US places <- read.table(paste(site,"places.dat",sep=""), header=T) dim(places) names(places) rownames(places) pairs.plus(places, gap=0, cex=.3) - Visual permutation test to assess presence/absence of association between two variables: xlab <- "Climate.Terrain"; ylab <- "Housing" # xlab <- "Education"; ylab <- "Crime" x <- places[,xlab]; y <- places[,ylab] windows(6,7.5) par(mfrow=c(5,4), mar=rep(.5,4), xaxt="n", yaxt="n") i.pos <- sample(1:20,size=1) # random position of the actual for(i in 1:(i.pos-1)) plot(x=x, y=sample(y), pch=16, cex=.5) plot(x=x, y=y, , pch=16, cex=.5) for(i in (i.pos+1):20) plot(x=x, y=sample(y), pch=16, cex=.5) # Which plot shows the actual data? # 19 of the 20 plots show draws from a 'random association model'. - Numerical permutation test: p-value computation with simulation . Test statistic? anything; use the correlation, for example test.stat <- function(x,y) cor(x,y) test.stat <- function(x,y) { sum(x>median(x) & y>median(y)) } # Alternate T.actual <- test.stat(x,y) T.actual . Simulation: sampling random permutations; the more samples, the more precision nsim <- 9999 # fence post problem: 9999 make 10000 equally likely spaces T.null <- rep(NA,nsim) # storage of "null values" for(i in 1:nsim) { if(i%%1000==0) print(i) # print a message every 1000 simulations T.null[i] <- test.stat(x, sample(y)) } # Equivalent one-liner (here the use of global variables inside a fct is unavoidable): T.null <- sapply(1:nsim, function(i) { if(i%%1000==0) print(i); test.stat(x,sample(y)) }) . Statistical significance: How extreme is 'T.actual' among 'T.null'? Judge visually: windows() hist(T.null, breaks=40, col="gray", xlim=range(c(T.null,T.actual))) lines(rep(T.actual,2), c(0,1000), lwd=2, col="red") . Non-rejection/retention/'acceptance' region at 5% and at 1% significance levels: quantile(T.null, c(.025, .975)) # retention interval for alpha=5% 2.5% 97.5% -0.1093858 0.1091154 # note symmetry quantile(T.null, c(.005, .995)) # retention interval for alpha=1% 0.5% 99.5% -0.1446682 0.1411944 # note symmetry this far out in the tail T.actual # not in the retention region by a long shot . Approximate normality of the simulated null distribution: qqnorm(T.null, pch=".") T.null.mean <- mean(T.null) T.null.sd <- sd(T.null) (where do you expect 'T.null.mean' to fall?) abline(a=T.null.mean, b=T.null.sd, col="red") abline(v=0, col="gray") Could therefore use the fitted normal distribution for non-rejection regions: 2-sided 5% level: T.null.mean+c(-1,1)*qnorm(.975)*T.null.sd Same thing, using the normal quantile function: qnorm(c(.025,.975), m=T.null.mean, s=T.null.sd) 2-sided 1% level: qnorm(c(.005,.995), m=T.null.mean, s=T.null.sd) [There exists theory to derive an asymptotic normal approximation for the permutation distribution.] . p-value: Fence post consideration T.actual has equal probability 1/(nsim+1) to fall into any spacing of T.null (under what probability assumptions?) ==> The simulation approximation to the p-value is: (sum(abs(T.null) >= abs(T.actual)) + 1) / (nsim+1) * 100 (What are its extreme values? ... Why is the lowest possible value not 0?) - Theory behind permutation tests: . Teaser: (x1,y1),...,(xn,yn) i.i.d. => y1,...,yn i.i.d. => x1,...,xn i.i.d. => x2,...,xn are independent of y1, but x1 is not generally independent y1. . Version (1): (xi,yi) are i.i.d. pairs i.e., (xi,yi) is indep. from (xj,yj) and the pairs have the same joint distributions (i!=j). They are i.i.d. draws from (X,Y). H0: In each pair, xi and yi are independent. Minimal sufficient statistic: S = ( x(1)<=...<=x(n); y(1)<=...<=y(n) ) = order statistics of x''s and y''s ~ empirical distributions of x''s and y''s Conditional distribution of (x,y) given S(y): Uniform distribution on ... permutations of x and y values . Version (2): yi independent random samples, xi fixed known numbers (i.e., the regression assumption with xi predictors, yi responses) H0: yi are i.i.d. The part in 'i.i.d.' that matters here is 'identically distributed'. The alternative is that the distribution yi is a fct of xi: G(yi|xi) H0 ==> E[yi], V[yi] are not dependent on xi Minimal sufficient statistic: S = (y(1)<=...<=y(n)) = order statistics of y''s ~ empirical distribution y''s Conditional distribution of yi''s given S(y): Uniform distribution on ... - Non-parametric nature of permutation tests: No assumptions made about the marginal distributions of F(x) and G(y) in (1) or G(y) in (2). - Special cases: . x: predictor vector; y: quantitative => MULTIPLE REGRESSION a permutation test of H0: y1,...,yn i.i.d. could be used as a non-parametric substitute for the overall F-test Natural test statistic: overall-F or R2 or RSS (All are equivalent under the permutation null distribution, meaning that they result in rejection at level alpha at the same time.) . x: binary (categorical) variable; y: quantitative response => 2-SAMPLE TEST Natural test statistic: 2-sample t-stat, or difference of medians,... (you name it) . x: categorical variable; y: quantitative response => 1-WAY ANOVA Natural test statistic: overall-F or Kruskal-Wallis or any pair-difference . x: binary; y: binary => FISHER''S EXACT TEST (no need to simulate; exact null distr. known) Natural test statistic: anything (any cell count, they are all equiv.) . x: categorical; y: categorical => CHI-SQUARE TEST OF INDEPENDENCE Natural test statistics: chi^2, or any cell count !!! . x: predictor vector; y binary => LOGISTIC REGRESSION or binary classification (boosting, bagging, SVMs,...) Natural test statistics: negative log-likelihood, misclassification rate,... . x: predictor vector; y categorical => POLYTOMOUS REGRESSION or MULTI-CLASS CLASSIFICATION Natural test statistics: '' . x: predictor vector; y response vector => CANONICAL CORRELATION ANALYSIS (CCA) Natural test statistics: largest canonical correlation or sum of k largest squared cancorrs,... . x: time; y time series => GENERALIZED WHITE NOISE TEST Natural test statistics: any time series model statistic (AR, ARMA, ARCH,...) . x: time; y = diff(time series) => GENERALIZED RANDOM WALK TEST Natural test statistics: any time series model statistic (AR, ARMA, ARCH,...) ==> If the question is whether any association exists between x of any kind and y of any kind, we can use the permutation distribution as a conditional null distribution of the data and use it to generate a conditional null distribution for ANY TEST STATISTIC at all. ==> Caution: The test statistic should be a measure of association between x and y. If it is just a function of the marginal distributions of x an y, nonsense results. Example: xbar + ybar is NOT a measure of association. What do you get if you permute xi against yi? ... - Drawback of permutation tests: No conversion to CIs? Yes and now... See last year''s qualifying exam. . Recall duality between CIs with 95% coverage and tests at 5% significance CI = { parameter values that can not be rejected if used as H0 } . In some cases the conversion is easy. Example: 2-sample test for location can be converted H0: y11, ..., y1m, y21-mu, ..., y2n-mu i.i.d. F => Permutation test applied to y11,...,y1m,y21+mu,...,y2n+mu can be efficiently excuted because location test statistics satisfy T(y+mu) = T(y) + mu (shift-equivariance) . More generally: simple linear regression H0: yi - beta1*xi i.i.d. (where the marginal distr. F is arbitrary) (Reason why one can drop the intercept: ...) CI = { beta1 | H0: 'yi-beta1*xi i.i.d.' is not rejected } - Other advantages of permutation tests: . Intuitive nature . Can be defended even for non-sampling data (such as the Places Rated data) ==> Play a what-if game of 'possible worlds'. . Freedom of choice of test statistic: use any measure of dependence . Non-parametric nature: marginal distribution F is arbitrary - Limitation: Good only for H0 that express independence between xi and yi or in which xi are fixed and yi are i.i.d. - Literature: . E. S. Edgington (1980) "Randomization Tests" (Dekker, Statistics: Textbooks and monographs, vol. 31) . Ph. I. Good (1994) "Permutation Tests : A Practical Guide to Resampling Methods for Testing Hypotheses" (Springer Series in Statistics) . Lehman & Romano (1986 or later, good for theory: sec. 4.3, 5.10ff) "Testing Statistical Hypotheses" (Wiley) ================================================================ * BOOTSTRAP -- A METHOD FOR ESTIMATING SE/VARIANCE, BIAS, AND CIs - Preliminaries: statistical functionals . Consider i.i.d. samples y1,...,yn ~ F Form the empirical distribution of y1,...,yn: Fn = 1/n sum_i delta_yi . 'Statistical functionals': (1) T(y1,...,yn) = T(Fn) (2) T(Fn) --> T(F) as n --> Inf (3) (T(Fn)-T(F))*sqrt(n) --> N(0,AV(F)) in distrib./weakly as n --> Inf . Examples for property (1): + T(Fn)=mean, T(F)=expectation + T(Fn)=empirical 90% quantile; T(F)=distributional 90% quantile Counter-examples: + T(Fn) = SSn/(n-1) However, the biased estimate T(Fn)=SSn/n is a statistical functional. + T(Fn) = y{n/2} is not a statistical functional. However, the median is a statistical functional. + T(Fn) = max(y1,...,yn) . Property (2) above is a generalized 'LLN' and a particular 'continuity property' because Fn --> F 'weakly'. . Property (3) is a generalized CLT called 'asymptotic normality'. A(F) is called the asymptotic variance. . This framework gives justification to asymptotic CIs: ----------------------------------------------------------- | CI = [T(Fn)-1.96*sqrt(AV(F)/n), T(Fn)+1.96*sqrt(AV(F)/n)] | ----------------------------------------------------------- Asymptotic coverage guarantee: ------------------------------------------- | P[T(F) in CI] --> 0.95 as n --> Inf | ------------------------------------------- . Problems: + A(F) is not known; it needs to be estimated. + Even though asymptotic theory says the bias of T(Fn) disappears faster than the variance as n-->Inf, the bias can still be substantial for n of interest. ================================================================ LECTURE 22, 2011/11/30: * ORG: - One more homework: Practice covering several areas ---------------- * ROADMAP: - Recap: Permutation tests, Intro to Bootstrap - New: . Nonparametric bootstrap (continued) ---------------- * RECAP AND ELABORATIONS: - Permutation tests: . What kinds of null hypotheses can be tested with permutation tests? Write them mathematically, not informally. ... . What is the theoretical basis for the validity of permutation tests? ... . Is this basis valid for finite samples or asymptotically only? ... . This theory gives you a conditional null distribution for ... ... . Give MANY examples of where permutation tests apply! ... - Preliminaries to Bootstrap: . Explain the concept of a statistical functional ... . Give examples of statistical functionals. ... . Explain why the following are NOT statistical functionals: + T(y1,...,yn) = s^2 = sum (yi - ybar)^2 / (n-1) ... + T(y1,...,yn) = y1 ... . Correction to last time (mea culpa): max is really a statistical functional! T(y1,...,yn) = max(y1,...,yn) But it fails to satisfy one of the propert listed above. Which one? ... . What is the goal of bootstrap? ... . Does the bootstrap have finite sample or asymptotic justification? ... ---------------------------------------------------------------- * BOOTSTRAP -- THE RECIPES AND THE MANY USES - Efron''s 'TRIVIAL IDEA' (1979, >30 year anniversary) . It derives from a plain understanding of the sampling paradigm of frequentist statistics. . It is really a generalized 'plug-in' principle. . Mother nature presents us with data y1,...,yn i.i.d. F. . We do not know F, but we can estimate it. Call the estimate 'Fhat'. (Fhat will be discussed later.) This is the 'plug-in' part: Use Fhat as if it were F. . Act as if you were mother nature and Fhat were the true F. That is, sample fake data from Fhat, but do this repeatedly: (y1*1,...,yn*1) # Fake dataset 1 (y1*2,...,yn*2) # Fake dataset 2 (y1*3,...,yn*3) # Fake dataset 3 ... (Repeat, for example, Nboot=1000 times.) Compute statistics T() from y1,...,yn and y1*j,...,yn*j: Simulated bootstrap distribution of T = { T(y1*1,...,yn*1), T(y1*2,...,yn*2), ... } . Uses of these values: + Their SD is used as an estimate of the SE of T(). + Form bootstrap CIs: CI = [ T(Fn)-2*SE, T(Fn)+2*SE ] = '[ T(Fn) +- 2*SE ]' + Caveat: This is the crudest and most assumption-loaded form. - Two ways of estimating Fhat: . Nonparametric Bootstrap: Use Fhat = Fn, the empirical distribution Sample y1*,...,yn* i.i.d. from Fn, that is, sample from (y1,...,yn) WITH REPLACEMENT. . Parametric Bootstrap: Assume a parametric model for the data, such as yi ~ iid N(mu,sigma^2). Estimate the parameters, e.g., m <- mean(y); s <- sd(y) Sample y1*,...,yn* i.i.d. from N(m,s^2). - Comparison of the parametric and nonparametric bootstrap: . Parametric bootstrap may not give you the proper idea of dataset-to-dataset variability if the fitted model overfits. (Think about it!) . Nonparametric bootstrap is vastly more popular when it applies, but the datasets it creates have dups and do not use some cases: P[case 'i' not used in n draws] = (1-1/n)^n -> exp(-1) => Approx fraction of cases used is (1-exp(-1)) ~=~ 0.6321206 or just under 2/3. (After many bootstrap samples, one will have used all data, though.) - Example of non-parametric bootstrap standard errors: linear regression site <- "http://www-stat.wharton.upenn.edu/~buja/STAT-541/" bodyfat <- read.table(paste(site,"bodyfat.dat",sep="")) names(bodyfat) <- scan(paste(site,"bodyfat.col",sep=""), w="") colnames(bodyfat) plot.plus(bodyfat, gap=0, pch=".") # Note a collinearity, hence exclude the predictor 'Density' Regress 'PercBF' on all other predictors: bf.lm <- lm(PercBF ~ . - Density, data=bodyfat) bf.lm.smry <- summary(bf.lm) bf.lm.smry LS coefficients: bf.b <- bf.lm.smry$coeff[,"Estimate"] signif(bf.b,3) # Show the coeffs to 3 signif. digits Standard errors computed the usual way: bf.b.se <- bf.lm.smry$coeff[,"Std. Error"] signif(bf.b.se,3) Bootstrap simulation: ------------------------------------- N <- nrow(bodyfat) Nboot <- 100000 bf.b.bs <- matrix(NA, ncol=length(bf.b), nrow=Nboot) colnames(bf.b.bs) <- names(bf.b.se) y <- bodyfat[,"PercBF"] X <- model.matrix(bf.lm) for(iboot in 1:Nboot) { sel <- sample(N, replace=T) # check dups: table(table(sel)) # bf.b.bs[iboot,] <- coef(lm(PercBF ~ . - Density, data=bodyfat[sel,])) # so slow... bf.b.bs[iboot,] <- solve(crossprod(X[sel,])) %*% t(X[sel,]) %*% y[sel] # so fast... if(iboot%%500==0) print(iboot) } Bootstrap standard errors: -------------------------------------- bf.b.bs.se <- apply(bf.b.bs, 2, sd) Compare bootstrap with usual standard errors: signif( rbind(boot=bf.b.bs.se, classic=bf.b.se), 3) Ratio comparison: round( bf.b.bs.se / bf.b.se, 3) A non-event for many coeffs: Bootstrap SEs are very close to the usual SEs except for 'Height', 'Weight', 'Ankle', '(Intercept)': sort(round( bf.b.bs.se / bf.b.se, 3), decr=T) - Bootstrap CIs from bootstrap SEs: round( rbind( bf.b - 1.96*bf.b.bs.se, bf.b + 1.96*bf.b.bs.se), 3) - If we trust the Bootstrap CIs more than the traditional CIs, what would be their under/over-coverage? 1 - 2*pnorm(qnorm(.975)*bf.b.se, m=0, s=bf.b.bs.se, lower.tail=F) ^^^^^^^ ^^^^^^^^^^ SE est. 'true' SE PS: This comparison assumes that there is no significant bias in the estimates. If there is bias, then the intervals will be shifted against the truth, and 'm=0' would not be an appropriate assumption... - We can use the bootstrap distribution to diagnose how close our sampling distribution is to normality!!! --------------------------------------- | DIAGNOSTIC FOR ASYMPTOTIC NORMALITY | --------------------------------------- Now this is a curious concept, isn''t it? RECIPE: For each variable, make a normal quantile plot of BS estimates. Decoration: Draw the ideal line on which the quantiles for x=N(0,1) and y=N(b[j],SE^2(b[j])) would fall. par(mfrow=c(4,4), mgp=c(1.5,.5,0), mar=c(1.5,1.5,2,1)) for(j in 1:ncol(bf.b.bs)) { qqnorm(bf.b.bs[,j], pch=16, cex=.3, main=names(bf.b)[j], xlab="", ylab="", cex.main=1) co <- "gray"; lw=.5 abline(v=0, col=co, lwd=lw) # vertical: 0 to mark median abline(h=bf.b[j], col=co, lwd=lw) # horizontal: b[j] # show the normal distribution implied by classic theory: N(b[j], SE(b[j])^2) abline(a=bf.b[j], b=bf.b.se[j], col=co, lwd=lw) # slope: SE(b[j]) abline(h=0, col="red", lwd=lw) # horizontal: 0 to show significance } Note: . 'Height' has a problem with non-normality. . 'Ankle' has a problem with a non-normal lower tail. . 'Weight' has a minor problem in the upper tail. . 'Chest' has a minor problem in the lower tail. . Intercept looks nearly normal but wider than classical. - BIAS ESTIMATION with bootstrap: . Estimate bias = E[estimates|truth] - truth by bias.hat = E[boostrap estimates|estimate] - estimate ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E* = average of bootstrap estimates . Example of bootstrap bias estimates: for all regr coeffs bf.b.bs.bias <- apply(bf.b.bs, 2, mean) - bf.b signif( bf.b.bs.bias, 3) Compare bootstrap bias estimate and bootstrap SE: signif( bf.b.bs.bias / bf.b.bs.se, 3) Sizable biases: Weight (0.16), Chest (-0.19), Ankle (-0.15) Still, these biases are relatively small compared to the SE, which is natural because bias tends to disappear at the rate 1/n whereas the SE shrinks at the slower rate 1/sqrt(n). On the other hand, bias can be non-neglibible for small n. - What impact does bias have on mis-coverage of CIs? 1 - pnorm( qnorm(.975)*bf.b.se, m=-bf.b.bs.bias, s=bf.b.bs.se, lower.tail=T) - pnorm(-qnorm(.975)*bf.b.se, m=-bf.b.bs.bias, s=bf.b.bs.se, lower.tail=F) 1 - 2*pnorm(qnorm(.975)*bf.b.se, m=0, s=bf.b.bs.se, lower.tail=F) Apparently there is no big effect from bias on coverage. This is typical for 2-sided CIs. (Intuition: lose left, gain right) For 1-sided confidence half-axes, the story is different. ================================================================ LECTURE 23, 2011/12/05: * ORG: - Homework 7 is posted ---------------- * ROADMAP: - Recap: Bootstrap - New: . The ultimate bias-correcting, skewness-adapting quantile bootstrap . Smoothing/nonparametric curve estimation . Bias-Variance trade-off . Cross-validated bandwidth choice ---------------- * RECAP AND ELABORATIONS: - What is the main purpose of the bootstrap? ... SE estimation - What is the theoretical justification for the bootstrap? ... asy normality - Describe the nonparametric bootstrap procedure. ... create new datasets, can be applied to any statistics - It would appear that asymptotic normality is not one of those things that can be checked empirically. Yet there is something you can do to check approximate normality in a concrete dataset. Btw, this is approximate normality of what and in what sense? ... - How is bias defined? Bias of what? ... - Give a universal recipe for estimating bias. ... - How does bias affect CIs? ... coverage probs are ok to first order but substantive conclusions might suffer! - What do you have to say about using the non-parametric bootstrap for regression? ... paired bootstrap recreates the predictor variability as well counter to 'holding X fixed' - Practicalities of bootstrap for regression: Don''t use coef(lm(...)), use the triple-X formula instead. It''s much much faster! ---------------- * BOOTSTRAP - CONCLUSION - BOOTSTRAP BIAS CORRECTION/REDUCTION: Even though bias might not matter much for symmetric CIs, bias reduction is an important concept. . Bias-corrected/reduced estimate = estimate - bias.est = T(Fn) - BS.Bias.hat = T(Fn) - ( ave[T(Fn*)] - T(Fn) ) = 2*T(Fn) - ave[T(Fn*)] . This is the typical form of bias correction: ----------------------------------- | Bias-corrected/reduced estimate | | = 2*estimate - ave.est | ----------------------------------- . A related idea: Breiman''s 'bagging' Replace T(Fn) with ave[T(Fn*)] !!!! Why would you want to do this? ==> Makes more sense for tree-based methods. Still, an interesting concept, too. - Further potential problem: non-normality of T(Fn)-distribution Skewness, for example. Imagine T(Fn) is right skewed: T(F) ------|---------|-------------------------|--------- 2.5% 50% 97.5% |-----------------------------------| 95% range of T(Fn)-distribution Q: What interval should you put around T(Fn) to catch T(F)? A: An asymmetric interval that reaches farther left than right to compensate for the fact that T(Fn) tends to fall farther to the right than to the left. ------------------------- - Idea: | QUANTILE BOOTSTRAP CI | ------------------------- Construct CIs not symmetrically from T(Fn) +- 2*SE, but use the direct 2.5% and 97.5% BS quantiles to estimate lower and upper CI bounds. - Statement of the naive BS quantile (!) CI: round(cbind( CI.BS.lo= apply(bf.b.bs, 2, quantile, .025), # bootstrap quantile CI.BS.hi= apply(bf.b.bs, 2, quantile, .975), # " CI.lo = bf.b + qt(.025,df=238)*bf.b.se, # classical SE CI.hi = bf.b + qt(.975,df=238)*bf.b.se), 3) # " CI.BS.lo CI.BS.hi CI.lo CI.hi (Intercept) -57.462 25.190 -52.365 15.988 Age 0.003 0.124 -0.002 0.126 Weight -0.201 0.060 -0.194 0.017 Height -0.403 0.233 -0.259 0.120 Neck -0.928 0.007 -0.929 -0.013 Chest -0.262 0.144 -0.219 0.171 Abdomen 0.795 1.112 0.784 1.125 Hip -0.497 0.090 -0.495 0.080 Thigh -0.034 0.493 -0.048 0.520 Knee -0.486 0.450 -0.461 0.492 Ankle -0.414 0.578 -0.262 0.610 Biceps -0.129 0.515 -0.156 0.519 Forearm 0.063 0.910 0.060 0.844 Wrist -2.603 -0.562 -2.674 -0.567 The naive quantile bootstrap is applicable only if . bias is negligible . bootstrap distribution is approximately symmetric That is, it only solves a symmetric kurtosis problem: Heavier or lighter than normal tails. ==> If you face non-normality and/or bias, don''t use the naive quantile bootstrap. Instead, use the following correct version. --------------------------------------------------- - | A BIAS & SKEWNESS-CORRECTING QUANTILE BOOTSTRAP | --------------------------------------------------- . Approach: Glean from population calculations how to use bootstrap simulation to address both + bias of estimate and + non-normality of dataset-to-dataset ('sampling') distribution. . 'If we were nature, we would know the dataset-to-dataset distribution of b=T(Fn) and hence its upper/lower 2.5% percentiles (qb.lo, qb.hi):' (We use regression slopes b=T(Fn), beta=T(F) as our running example.) P[ qb.lo < b < qb.hi | beta ] = 0.95 ^^^^^ ^^^^^ 2.5% 97.5% quantiles of the dataset-to-dataset distribution of 'b' . Invert a coverage interval for 'b' to a CI for 'beta': P[ qb.lo - beta < b - beta < qb.hi - beta | beta ] = 0.95 Re-express in terms of lower and upper cut-offs c.lo and c.hi: P[ -c.lo < b - beta < c.hi | beta ] = 0.95 Invert the two inequalities: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! P[ b-c.hi < beta < b+c.lo | beta ] = 0.95 ==> Exact confidence interval: --------------------------------- | CI = [ b - c.hi, b + c.lo ] | --------------------------------- where c.hi = qb.hi - beta c.lo = beta - qb.lo ==> Note the inversion of UPPER<-->LOWER bounds between RIs and CIs!!! You don''t notice it when you assume symmetry of the RI. . Bootstrap plug-in estimation of c.hi and c.lo (replacing +-2*SE): beta <-- b qb.hi <-- qb*.hi (97.5% quantile of bootstrap values b*) qb.lo <-- qb*.lo ( 2.5% quantile of bootstrap values b*) c.hi <-- qb*.hi - b c.lo <-- b - qb*.lo ==> Bias-correcting and skewness-correcting BS quantile CI: ------------------------------------------------------------- | [ b - c.hi, b + c.lo ] = [ 2b - qb*.hi, 2b - qb*.lo ] | ------------------------------------------------------------- ^^^^^^^^^^^^^^^^^^^^^^^^^ Reminiscent of bias-corrected estimate RULE: Estimate upper and lower arms of the RI interval, then reverse the arms for the CI. . Grand comparison of three types of confidence bounds {upper, lower} x {BS.QUANTILE, BS.SE, CLASSICAL.SE} ndf <- summary(bf.lm)$df[2] # residual/numerator degrees of freedom options(width=120) # to print long lines round(cbind( q.biasc.l = 2*bf.b - apply(bf.b.bs, 2, quantile, .975), # lo, BS, bias/skew-corrected normal.l = bf.b + qt(.025,df=ndf)*bf.b.bs.se, # lo, BS, normal/t classic.l = bf.b + qt(.025,df=ndf)*bf.b.se, # lo, CLASSICAL, t q.biasc.h = 2*bf.b - apply(bf.b.bs, 2, quantile, .025), # hi, BS, bias/skew-corrected normal.h = bf.b + qt(.975,df=ndf)*bf.b.bs.se, # hi, BS, normal/t classic.h = bf.b + qt(.975,df=ndf)*bf.b.se), # hi, CLASSICAL, t 3) ^^^^^^^^^^^^^^^ ^^^^^ exact multipliers inversion!!! q.biasc.l normal.l classic.l q.biasc.h normal.h classic.h (Intercept) -62.741 -59.883 -52.365 22.457 23.506 15.988 Age 0.001 0.002 -0.002 0.121 0.122 0.126 Weight -0.235 -0.221 -0.194 0.032 0.044 0.017 Height -0.378 -0.346 -0.259 0.279 0.207 0.120 Neck -0.949 -0.930 -0.929 -0.033 -0.011 -0.013 Chest -0.211 -0.239 -0.219 0.216 0.191 0.171 Abdomen 0.793 0.785 0.784 1.132 1.125 1.125 Hip -0.500 -0.489 -0.495 0.062 0.074 0.080 Thigh -0.029 -0.031 -0.048 0.503 0.503 0.520 Knee -0.426 -0.451 -0.461 0.501 0.482 0.492 Ankle -0.259 -0.350 -0.262 0.784 0.698 0.610 Biceps -0.162 -0.149 -0.156 0.496 0.512 0.519 Forearm 0.016 0.030 0.060 0.870 0.874 0.844 Wrist -2.649 -2.653 -2.674 -0.593 -0.588 -0.567 We note: . 'Age' is insignificant w.r.t. the classical CI, but borderline significant (+) w.r.t. all BS CIs. . 'Height' requires a much wider CI under the corrected BS version. * Issues: - Peter Hall says to use the bootstrap distribution of pivotal quantities. - Efron says it''s not that simple: Much depends on whether T(F) is interval scale or ratio scale or anything else. ==> Invents intricate BS corrections. Hall''s use of pivotal quantities is fine for interval scale statistical functionals. * Lit. on bootstrap: - Efron, B. and and Tibshirani, R. (1993) An Introduction to the Bootstrap - Davison, A.C. and Hinkley, D.V. (1997) Bootstrap Methods and Their Application. - AS papers by Efron and Hall ---------------------------------------------------------------- * CURVE ESTIMATION / SMOOTHING / BANDWIDTH SELECTION / CROSS-VALIDATION - Idea: given x-y data (quantitative-quantitative) . Drop assumption of linearity: y = (a + bx) + eps . Assume smoothness only: y = f(x) + eps, f() smooth . Smooth => locally linear f(x+h) ~=~ f(x) + Df(x)*h + O(h^2) . If we wish to infer f() from data (xi,yi) (i=1,...,n), we have to infer an estimate f(x) from (xi,yi) with xi near x . First try: fhat(x) = ave(yi; |xi-x| fhat(x)=NA Smooth is really not smooth . Variations: running median and quantiles for 50%, 90% prediction bands! plot(xy, pch=16, cex=.5) lines(smooth1(xy, func=median), col="black", lwd=3) lines(smooth1(xy, func=function(y) quantile(y,.25)), col="gray50", lwd=3) lines(smooth1(xy, func=function(y) quantile(y,.75)), col="gray50", lwd=3) lines(smooth1(xy, func=function(y) quantile(y,.05)), col="gray80", lwd=3) lines(smooth1(xy, func=function(y) quantile(y,.95)), col="gray80", lwd=3) . 'Real' smoothness: kernel smoothing smooth <- function(xy, bw=1/5, nloc=51) { x <- xy[,1]; y <- xy[,2] xmin <- min(x); xmax <- max(x) xs <- seq(xmin, xmax, length=nloc) fx <- rep(NA, nloc) act.bw <- diff(range(x))*bw # actual bandwidth for(i in 1:nloc) { w <- dnorm(x=(x-xs[i])/act.bw) # same as dnorm(x, m=xs[i], s=act.bw) w <- w/sum(w) # so we have: sum(w)==1 fx[i] <- sum(y*w) } return(cbind(x=xs, y=fx)) } . Apply to same with 4 bandwidths: bws <- c(2, .2, .03, .005) cols <- c("red","orange","green","blue") plot(xy, pch=16, cex=.5) for(i in 4:1) lines(smooth(xy, bw=bws[i]), col=cols[i], lwd=3) Q: What would be a principled way to select the bandwidth? (stay tuned...) . Simulation to show variability of smoothers: bws <- c(2, .2, .03, .005) N <- 100 x <- sort(runif(N)) Nsim <- 200 par(mfrow=c(2,2), mgp=c(1.5,.5,0), mar=c(3,3,1,1)) for(ibw in 1:4) { plot(x=c(0,1), y=c(-2,6), type="n", pch=16, cex=.5, xlab="", ylab="") xy <- cbind(x=x, y=4*x^2 + rnorm(N)) points(xy, pch=16, cex=.5, col="gray30") for(isim in 1:Nsim) { xy <- cbind(x=x, y=4*x^2 + rnorm(N)) lines(smooth(xy, bw=bws[ibw]), col=cols[ibw], lwd=1) } lines(x, 4*x^2, lwd=2) } ================================================================ LECTURE 24, 2011/12/07: * ORG: - Reminder: Homework 7 ---------------- * ROADMAP: - Recap: Bootstrap, Smoothing/Nonparametric Curve Fitting - New: . Bias-Variance trade-off . Cross-validated bandwidth choice . The BIG hammer: Breiman and Friedman''s ACE algorithm ---------------- * RECAP AND ELABORATIONS: - Bootstrap: . Main purpose of BS: ... . Main operational idea: .. . Compare BS-based inference with permutation-based inference: ... [Unfortunately, they often get put in one basket under the label 'resampling methods'.] . The ultimate BS quantile method: + What are the problems it solves? + Describe the idea. + Describe the derivation: Stage 1: ... Stage 2: ... . Other assorted other uses of BS: + ... + ... + ... . Something to think about: The resample size N* is usually assumed to be N. However, one can choose N* to be different from N, so you have a parameter to play with. BS with N* != N obviously produces inference for sample size N*, not N. How would you correct the BS results for N* to obtain inference for N? ... - Smoothing/Nonparametric Curve Fitting: . What is the setup we are considering? ... . Main idea of smoothing we considered? ... . Essential parameter of any smoothing technique? ... . How does this parameter affect the resulting smooths? ... . What is the basic trade-off? ... ---------------- * NEW: SMOOTHING/NONPARAMETRIC CURVE FITTING (contd.) - Lesson: BIAS-VARIANCE trade-off !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Really: bias-SE trade-off (Variance is dataset-to-dataset.) small bandwidth large bandwidth ==> ==> large variance (-) small variance (+) small bias (+) large bias (-) - Mean squared error and bias-variance decomposition: . Setup and definitions: consider one location 'x' of interest Model: y = f(x) + eps E[eps] = 0 BIAS: Bias(fhat(x)) = E[fhat(x)] - f(x) VARIANCE: V(fhat(x)) = E[ (fhat(x) - E[fhat(x)])^2 ] MSE: MSE(fhat(x)) = E[ (fhat(x) - f(x))^2 ] . Bias-variance decomposition of the MSE: --------------------------------------------------- | MSE(fhat(x)) = V(fhat(x)) + Bias(fhat(x))^2 | --------------------------------------------------- Proof: E[ (Z - c)^2 ] = V[Z] + ( E[Z] - c )^2 ==> MSE is an overall measure that trades off bias and variance but... MSE is not known Note: There is nothing specific to smoothing in this. The same identity can be written for arbitrary statistical functionsl T(F): MSE(T(Fn)) = V(T(Fn)) + Bias(T(Fn))^2 | Note, however, that the 'training data' that produce 'fhat(x)' are 2-D: (xi,yi) for i=1,...,N Thus the empirical distribution is Fn(dx,dy) = (1/n) sum delta_(xi,yi) and fhat(x) = T(Fn) where T(Fn) = ave(yi|xi near x) . PMSE or 'predictive mean squared error': PMSE(fhat(x)) = E[ (Y - fhat(X))^2 | X=x ] (Y,X) = future data fhat() = estimate based on past i.i.d. data (xi,yi) future and past data are assumed independent of each other Decomposition: PMSE(fhat(x)) = V[Y|X=x] + MSE(fhat(x)) = V[Y|X=x] + V(fhat(x)) + Bias(fhat(x))^2 ^^^^^^^^ variation around f(x) . Expectation across x-values: E[PMSE(fhat(X))] = E[V[Y|X]] + E[MSE(fhat(X))] ^^^^^^^^^ ^^^^^^^^^^^^^^^ independent criterion one of bandwidth wants to minimize over bandwidths => Minimize expected PMSE instead of MSE ??? how do we estimate the PMSE ??? . CROSS-VALIDATION: + Split data into training and holdout data (9/10 and 1/10, or 2/3 and 1/3). + Fit fhat(x) to training data. + Use fhat(x) to predict y on holdout data. + Estimate of E[PMSE]: ave((yi-fhat(xi))^2 | i in holdout data) + Repeat and average again. Machine Learning convention: Randomly partition data into 10 disjoint tenths; use each tenth as holdout; average over the ten runs. There is evidence that 1/10 is too little for holdout, and 10 repetitions are too few for stable averaging. Hence a probably better rule: Split data randomly into 2/3 and 1/3; use 1/3 as holdout; repeat 100 to 500 times and average. . R code for cross-validation of 'smooth()': cv.once <- function(bw, xy, holdout) { train <- !holdout # assumed logical, not enumeration xy.train <- xy[train,] # unpacking x.holdout <- xy[holdout,1] # " y.houldout <- xy[holdout,2] # " sm <- smooth(xy.train, bw=bw) # fhat on training f.houldout <- approx(sm, xout=x.holdout, rule=2)$y # fhat on holdout return(mean((y.houldout-f.houldout)^2)) # PMSE^ from holdout } xy <- as.matrix(boston[,c("LSTAT","MEDV")]) cv.once(.1, xy, c(rep(T,100),rep(F,406))) cv <- function(bw, xy, holdout=1/3, times=100, verbose=50) { pmses <- rep(NA, times) n <- nrow(xy); nholdout <- round(n*holdout) holdout.template <- c(rep(T,nholdout), rep(F,n-nholdout)) # to be permuted randomly for(icv in 1:times) { pmses[icv] <- cv.once(bw, xy, holdout=sample(holdout.template)) if(verbose>0) if(icv%%verbose==0) cat(icv,"") } return(pmses) } cv(bw=.1, xy) . Obtain CV estimates of the PMSE for a large number of bandwidths: bws <- 1.25^((-25):(+5)) pmse.bws <- rep(NA, length(bws)) for(i in 1:length(bws)) { pmse.bws[i] <- mean(cv(bw=bws[i], xy, verbose=0)) cat("----- done with bandwidth",bws[i],"(",i,")--------------\n") } Plot PMSEs as a function of bandwidth: par(mfrow=c(1,1)) plot(bws, pmse.bws, type="h"); lines(bws, pmse.bws) Need to take the log of the bandwidths or, equivalently, the order: plot(pmse.bws, type="h", xlab="log bw"); lines(pmse.bws) => In prediction, bias can kill, variance cannot The PMSE profile indicates that the 8th or 9th bandwidth was best: plot(xy, pch=16, cex=.5) lines(smooth(xy, bw=bws[6]), col="red", lwd=3) lines(smooth(xy, bw=bws[7]), col="orange", lwd=3) lines(smooth(xy, bw=bws[8]), col="brown", lwd=3) lines(smooth(xy, bw=bws[9]), col="green", lwd=3) lines(smooth(xy, bw=bws[10]), col="blue", lwd=3) lines(smooth(xy, bw=bws[11]), col="turquoise", lwd=3) But the 11th is probably better visually... => The bias-variance trade-off generated by PMSE minimization with CV is not visually optimal; follows data too closely => Good MSE asks for less bias/more variance than our visual judgment General conclusion: -------------------------------------------------- | Good performance in prediction (MSE) requires | | following the data more closely | | i.e., less bias | | whereas | | good interpretability requires | | less complexity | | i.e., less variance | -------------------------------------------------- Can be transferred to regression and model selection because CV and PMSE minimization can be applied to model search as well. General conclusion transferred to regression: - for prediction: when in doubt, keep it in - for interpretability: when in doubt, throw it out ---------------- * 2D-SMOOTHING: - Two predictors x1 and x2. - Local averaging: . Define 'windows' in the x1-x2 plane Dilemma: Not only what size, but what shape of window? ==> We use Euclidean neighborhoods after standardizing x1 and x2. ==> Can be done by using bandwidths prop. to sd(x1) and sd(x2). . Need double-grid of points for evaluation. . R code: smooth2d <- function(xxy, bw1=1/3, bw2=bw1, nloc1=21, nloc2=nloc1) { x1 <- xxy[,1]; x2 <- xxy[,2]; y <- xxy[,3] # unpack x1min <- min(x1); x1max <- max(x1) # for grid x2min <- min(x2); x2max <- max(x2) # dito x1s <- seq(x1min, x1max, length=nloc1) # x1 grid x2s <- seq(x2min, x2max, length=nloc2) # x2 grid fxx <- matrix(NA, nrow=nloc1, ncol=nloc2) # fits: nloc1*nloc2 act.bw1 <- sd(x1) * bw1 # x1 bandwidth act.bw2 <- sd(x2) * bw2 # x2 bandwidth for(i in 1:nloc1) { # loop over all combinations.. for(j in 1:nloc2) { # of x1 and x2 grid values w <- dnorm(x=x1-x1s[i],sd=act.bw1) * # weights: product of dnorm(x=x2-x2s[j],sd=act.bw2) # Gaussian kernels fxx[i,j] <- sum(y*w) / sum(w) } } return(list(x=x1s, y=x2s, z=fxx)) # return the two grids and the fits } # note: length(z)=length(x)*length(y) xxy <- boston[,c("RM","LSTAT","MEDV")] sm2d <- smooth2d(xxy) Perspective plots, first with hidden lines shown: not very good... persp(sm2d) With hidden lines removed and the surface colored and shaded: persp(sm2d, ltheta=0, lphi=45, shade=0.9, col="orange") Animate to search for informative views: for(iplt in 0:10000) { persp(sm2d, theta=0.1*iplt, phi=15+iplt*0.02, r=sqrt(3), shade=0.5, col="yellow", main=iplt, xlab="RM", ylab="LSTAT", zlab="MEDV") -> res points(trans3d(xxy[,1], xxy[,2], xxy[,3], pmat = res), col=2, pch=16, cex=.5) } We added the 3D points, not very successfully, because we can''t tell which are above and which below the surface. - [Fun project: Make this movie interactive with 'getGraphicsEvent()' by controlling 'theta' and 'phi' with 'onMouseDown=...' and maybe 'r' with 'onKeybd=...' to move in and out.] - Inference: How much dataset-to-dataset uncertainty? ==> BS xxy <- boston[,c("RM","LSTAT","MEDV")] rg.xxy <- apply(xxy, 2, range) N <- nrow(xxy) fac <- 10 iplt <- 1 # this and next line enables restart where we left off iplt <- 167 for(iplt in iplt:10000) { sm2d <- smooth2d(xxy[sample(N, replace=T),]) persp(sm2d, theta=fac*0.1*iplt, phi=15+fac*iplt*0.02, r=sqrt(3), shade=0.5, col="yellow", main=iplt, xlab="RM", ylab="LSTAT", zlab="MEDV") -> res points(trans3d(xxy[,1], xxy[,2], xxy[,3], pmat = res), col=2, pch=16, cex=.5) Sys.sleep(.05) } - Lessons: . 2D smoothing is somewhat less successful than 1D smoothing. . Beware of extrapolation in 2D: Lot''s of empty space . 3D smoothing: multiply the problems... ---------------------------------------------------------------- * ACE: ESTIMATING RESPONSE AND PREDICTOR TRANSFORMATIONS IN MULTIPLE REGRESSION - Big Hammer: ACE (Alternating Conditional Expectations) by Breiman and Friedman (JASA, 1985) - Idea 1: The concept -- optimize the following... cor(g(Y), f1(X1) + ... + fp(Xp)) = max_{g,f1,...,fp} - Idea 2: The computational realization -- optimize one at a time, then cycle... |[g(Y)- f2(X2) - ... - fp(Xp)] - f1(X1)|^2 = min_f1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^ partial residuals ... ... |[g(Y)- f1(X1) - ... - f{p-1}(X{p-1})] - fp(Xp)|^2 = min_fp |[f1(X1) + ... + fp(Xp)] - g(Y)|^2 = min_g Then cycle back and repeat... - Idea 3: Estimate f1, ..., fp, g non-parametrically with smoothing. - Odds and ends: The algorithm converges to ... Renormalizing g(Y) to sd(g(Y))=1 in each cycle prevents collapse and grants convergence (depending on the type of smoothers used). - Software: 'acepack' in R . Used to be discontinued at CRAN; available from the ARCHIVE. It is available again, but in case it isn''t, here is a local copy: http://stat.wharton.upenn.edu/~buja/STAT-541/acepack_1.3-2.2.zip Download in local directory where you run R, then install.packages(pkgs="acepack_1.3-2.2.zip") . In one swoop: install.packages("acepack_1.3-2.2.zip", repos=NULL, lib.loc="http://stat.wharton.upenn.edu/~buja/STAT-541/") . In each session, before you use 'acepack', call this: library(acepack) . Help: Old style help works, even though the Archive says it doesn''t. help(ace) - Data example: Boston Housing data boston.ace <- ace(x=boston[,-14], y=boston[,14]) [The code is old and predates the formula language in R/S, which is why you enter oldfashioned x and y arguments.] The returned data structure has the transformation evaluated at the observed predictor values: names(boston.ace) str(boston.ace) boston.ace$rsq The original data are in ...$x and ...$y while their transformations are in ...$tx and ...$ty. Here is a layout of the plots of the transformations: windows(width=3*3, height=5*3) par(mfrow=c(5,3), mgp=c(1.5,0.2,0), mar=c(2.5,2.5,.5,.5), tck=-.02) ylim <- range(boston.ace$tx) # shared range of the fi(Xi) pch <- 18; cex <- .7; cex.lab <- .7; cex.axis <- .7 # plot g(Y) plot(boston.ace$y, boston.ace$ty, xlab="MEDV", ylab="g(MEDV)", pch=pch, cex=cex, cex.lab=cex.lab, cex.axis=cex.axis) for(i in 1:13) plot(boston.ace$x[i,], boston.ace$tx[,i], # plot fi(Xi) ylim=ylim, xlab=dimnames(boston.ace$tx)[[2]][i], ylab="", pch=pch, cex=cex, cex.lab=cex.lab, cex.axis=cex.axis) [Note the idiosyncrasy of storing x as pxn but tx as nxp.] . Conclusions: + The response is untransformed for all practical purposes. This is in contrast to the log-transform suggested by the Box-Cox exercise, which, however, left the predictors untransformed! + Among the predictor transforms RM and LSTAT are dominant: RM has a curious transform: it seems to dip first, but only slightly, which is why it probably wants to stay flat first and then to shoot up starting with about 6.5 rooms. LSTAT has a convex transform, which we might want to model with a quadratic transformation. . A good use of ACE is to read off analytic forms for the transforms, then revert to linear modeling with the analytically transformed variables. Example: summary(lm(MEDV ~ CRIM + ZN + INDUS + CHAS + NOX + I(NOX^2) + I(pmax(RM-6.5,0)) + DIS + RAD + TAX + PTRATIO + B + log(LSTAT), data=boston)) as opposed to: summary(lm(MEDV ~ ., data=boston)) The most creative transformation is for RM (average number of rooms): f(RM) = max(RM-6.5,0), which is a kink function: zero up to 6.5 and linear therafter. The ACE plot would suggest a slightly negative slope for RM<6.5, but I thought this might be spurious. The kink function has a clear interpretation: at the level of aggregation of census tracts, the average number of rooms RM does not matter for RM small; only for RM>6.5 do we see a linear effect. That is, if the houses are small on the average, the median house price is not affected by the degree of smallness; but if the houses are larger on the average, the degree of largeness matters. In census tracts with small houses, everything else matters more: LSTAT above all. Besides a kink transformation of RM, we added quadratic transformations of NOX and log of LSTAT, again based on the ACE plot. It turns out that NOX^2 is insignificant. I was first suspicious that NOX and NOX^2 are too collinear (correlated) with each other, which might mask their joint explanatory power. I decorrelated NOX^2 from NOX by subtracting the mean: I((NOX-mean(NOX))^2), but it didn''t help, still insignificant, so I drop this term in the next model. The linear model fitted above would suggest removing a few variables: ZN, INDUS, NOX^2. summary(lm(MEDV ~ CRIM + CHAS + NOX + I(pmax(RM-6.5,0)) + DIS + RAD + TAX + PTRATIO + B + log(LSTAT), data=boston)) Indeed, with adjusted R^2=0.83, we are pretty well off. Remaining Q: How much do the transforms solve the diagnostics problems? windows(width=12, height=12) par(mfrow=c(2,2), mgp=c(1.5,0.5,0), mar=c(3,3,2,1)) plot(lm(MEDV ~ CRIM + CHAS + NOX + I(pmax(RM-6.5,0)) + DIS + RAD + TAX + PTRATIO + B + LSTAT + I(LSTAT^2), data=boston), pch=16, cex=.5, cex.lab=1.2, cex.id=1.2) Even though we improved the fit, we still have the problem of Boston proper and the truncation at $50K. ================================================================