================================================================ Chapter 2: BASICS OF LINEAR MODELS * LINEAR MODELS: THE OLS PROBLEM - LS criterion: RSS . In what follows we will play with OLS from the ground up. This is a conceptual exercise, not a recommendation for practice. In most contexts you will use the function lm() for OLS fitting. However, when speed matters, you might want to use the linear algebra below. . in R: a numeric toy example of LS RSS <- function(b) { sum((y-X%*%b)^2) } # What's bad about this code? N <- 5 X <- cbind(rep(1,N), 1:N) y <- X[,1] + 2*X[,2] # No 'error' in this case Getting at the coefficient estimates the hard way: nlm(RSS, c(0,0))$estimate # nlm() is a minimizer in R optim(c(0,0), RSS)$par # optim() is another minimizer in R . This is not how it is done, of course, but general code has some nice aspects: it is easily cannibalized for different situations. Such as: L1 <- function(b) sum(abs(y-X%*%b)) nlm(L1, c(0,0))$estimate optim(c(0,0), L1)$par Seems to work even though the L1 criterion is not differentiable. . In math: RSS(b) = |y - X b|^2 . LS estimate: b = argmin RSS(b) = minimizer of RSS(b) (Justifications of LS? ...) . NORMAL EQUATIONS as stationary equations of RSS: gradient RSS(b) = 2*( X^T X b - X^T y ) = 0 ----------------------- ==> | (X^T X) b = X^T y | ----------------------- [This is a special case of a moment condition: (X^T X)/N and X^T y/N estimate 2nd moments: E[x x^T] and E[x y]. ==> Moment estimators are defined by moment conditions. Other names for X^T X b - X^T y = 0: . score equations ['scores' are the negative gradients: -(X^T X b - X^T y)] . estimating equations ] . Explicit LS solution: b = (X^T X)^{-1} X^T y [Does not work when ...] In R: b <- solve(t(X) %*% X) %*% t(X) %*% y b # Correct [NOT the numeric method of choice; stay tuned for a HW.] Alternative using ginv() in MASS: library(MASS) ginv(t(X) %*% X) %*% t(X) %*% y This should be numerically more stable; not a problem here. - Residuals and fits: . Notation: yhat = X b r = y - yhat RSS = |r|^2 # sum(r^2) . Normal equations can be rewritten this way: X^T r = 0 # 0 = X^T y - (X^T X) b = X^T (y - X b) ==> Uniquely characterizes the LS solution. Geometric meaning: ... # see Homework 1 . In R: yhat <- X %*% b r <- y - yhat - LS and orthogonal projections: yhat = X b = ( X (X^T X)^{-1} X^T ) y = H y r = y - yhat = ( I - H ) y where H and I-H are orthogonal projections in n-space. . Definition of orth. proj.: ... # see Homework 1 . H is also called the "hat matrix" because it "puts the hat on y" (Tukey). [Notation: In math, one likes to use "P" as symbol for projections, but for us "P" stands for probability, hence we prefer "H".] - Visualization/animation of the geometry of LS: 2 predictors (p=1), n=3 source("http://www-stat.wharton.upenn.edu/~buja/STAT-961/LS-geometry.R") - What are the statistical properties of 'b', 'yhat' and 'r'? . What are their expected values? . What are their variances, and covariances? - 1st order properties: need expectation vectors of random vectors E[y] = mu = X beta This is assumed!!! What is the meaning of this? ... E[yhat] = ... E[r] = ... E[b] = ... - 2nd order properties: need covariance matrices of random vectors V[y] = sigma^2 * I (I = identity of size nxn) Again, this is assumed!!! What is the meaning of this? ... V[yhat] = ... V[r] = ... V[b] = ... - The role of X: . We will follow convention in statistics and treat X as fixed. . Strictly speaking, the above expressions should be written E[y|X], E[yhat|X], E[b|X], V[y|X], V[yhat|X], ... . Fixed-X will allow us to do a lot of theory. . We mentioned an ancillarity argument to justify conditioning on X when X is truly random. . We also mentioned the ancillarity argument depends on model correctness. ==> What we currently pursue is a 'model-trusting, assumption-rich theory'. There does exist 'model-robust, assumption-lean theory' as well. Stay tuned. ================================================================ * VECTOR EXPECTATIONS AND VARIANCE/COVARIANCE MATRICES - In preparation for linear models analysis we define vector expectations and variance/covariances. Confusing fact: . In multivariate analysis, all variables are random. Data set = iid replications of a random vector . In linear models theory of regression, only the response is random, the regressors are not. Data set = cbind(random y, fixed xj vectors) - Let us think the way of multivariate analysis for now, hence assume: Y ~ Ix1 random vector Z ~ Jx1 random vector with joint distribution ("observed on the same objects") Ex.: Y = (Height, Weight, Age, ...) of a person Z = (English Grade, Math Grade, SAT score, ...) of same person The Y- and Z-variables will generally be 'associated' (i.e., correlated, but not only in a linear sense). - EXPECTATION: . Definition: E[Y] = vector of E[Yi] (i=1...I) . Estimation: In Multivariate Analysis one assumes N i.i.d. draws yn from Y, where yn ~ Ix1 = n''th sample. Estimate of E[Y]: ybar = (sum_n yn)/N ~ Ix1 Unfortunately in regression we only see one draw y of Y Do you see what''s confusing? ... . Affine transformation: if A ~ JxI, c ~ Jx1, both constant, then E[AY + c] = A E[Y] + c Proof: Write out the components of AY and apply E[]. Q: Why do we need this? A: Because we will analyze objects such as b = (X^T X)^{-1} X^T y yhat = H y r = (I-H) y . Algebra: Assuming Y and Z are both Ix1, then E[a*Y + b*Z] = a*E[Y] + b*E[Z] This property is called linearity of E[...]. - COVARIANCE: . Definition: V[Y,Z] := E[ (Y-E[Y])(Z-E[Z])^T ] V[Y] := V[Y,Y] . Components: V[Y,Z]_ij = ... V[Y]_ij = ... V[Y]_ii = ... . What are the ranks of (Y-E[Y])(Z-E[Z])^T and (Y-E[Y])(Y-E[Y])^T ? Hence a covariance matrix is an average of ... Is it hence also a ...? Why or why not? ... . Sizes: V[Y,Z] is IxJ, V[Y] is IxI . Generalizing univariate formulas V[X] = E[X^2] - E[X]^2 and V[X1,X2] = E[X1*X2] - E[X1]*E[X2] : V[Y,Z] = E[ Y Z^T ] - E[Y] E[Z]^T V[Y] = E[ Y Y^T ] - E[Y] E[Y]^T . Algebra: Assuming Y, Z are both Ix1 and U is Kx1: V[Y+Z,U] = V[Y,U] + V[Z,U] Consequence: V[Y+Z} = V[Y] + V[Y,Z] + V[Z,Y] + V[Z] # !!!!!!!!!!!!!! Watch out !!!!! Can you continue and write the middle two terms as 2*V[Y,Z] ? ... . Affine transformation: Recall: Y (Ix1) and Z (Jx1) are random vectors. Assume fixed matrices A, B and fixed vectors c, d of legit sizes: A (mxI), B (nxJ), c (Ix1), d (Jx1) ---------------------------- | V[AY,BZ] = A V[Y,Z] B^T | | | | V[AY] = A V[Y] A^T | Butterfly or SANDWICH formula | | | V[Y+c,Z+d] = V[Y,Z] | ---------------------------- Exercises: ~ Specialize the middle to a linear form, i.e., A=a^T, where a (Ix1) is fixed. ~ Specialize further to a = rep(1/I,I) . Multivariate Analysis: Estimation of V[Y] * Given N i.i.d. vector samples yn drawn from the random vector Y Vhat[Y] = sum_i ( (yn-ybar)(yn-ybar)^T )/(N-1) * Affine transformation: zn = A yn + c Vhat[z] = A Vhat[y] A^T ================================================================ * LINEAR MODELS: - In the linear models theory of regression we observe only one single y-vector, but we will make assumptions about its expectation and covariance and examine what follows from them. [Whether the assumptions are realistic for a given dataset needs to be checked with diagnostics.] - Linear model assumptions: . Convention: y is Nx1 (not Ix1) Compare asymptotics in Multivariate Analysis: y1,...,yN are Ix1 and N-->Inf Linear Models Analysis: y is Nx1, and N-->Inf . Model in vector/matrix language: y = beta0 + beta1*x1 + beta2*x2 + ... + betap*xp + eps = X beta + eps where beta is (p+1)x1 Here y, xj, eps are Nx1 (response, predictors, errors, resp.) X = cbind(1,x1,x2,...,xp) is Nx(p+1) Note: + The response vector y is a random vector -- inspite of lower-case notation! + Predictors xj are NOT random but fixed and known (set experimentally or observed but conditioned on). + The 'error' vector 'eps' is a random vector. ['Error' is a misleading term; there is nothing wrong with 'eps'. We often use the term 'noise' instead.] . Stochastic assumptions: -------------------------------------------------------------------------- | E[eps] = 0 : On average [across possible worlds, universes, datasets] | | the model is 1st order "correct"/"unbiased". | | V[eps] = sigma^2 * I: Uncorrelated errors [across ...] | | constant variances ("homoscedastic errors") | | (2nd order correct) | -------------------------------------------------------------------------- . Equivalently: ---------------------- | E[y] = X beta | Why? | V[y] = sigma^2 * I | '' ---------------------- [Distinguish: "unbiased model" vs "unbiased estimator"; see below] . Recap fundamentals: Q: What variability do E[...] and V[...] refer to? A: Variability ... Spell it out: What is the variability of y[1], y[2], ... y[N] ? ================================================================ * LINEAR MODEL ANALYSIS, STEP 1: VARIABILITY OF b - Reminder: The LS Estimate is b = (X^T X)^{-1} X^T y - Type of variability in b: ... b = (p+1)x1 random vector varying from response vector y to response vector y... - Making the linear model assumptions, it follows: E[b] = ... Actually, which part of the assumptions was used? ... => 'Unbiased' estimator of beta - Making the linear model assumptions, it follows: V[b] = ... = ... Actually, which part of the assumptions was used? ... NOTE: V[b] is conditional on X (if X is random...) - What variability does V[b] describe? ... - What is the interpretation of the diagonal elements in V[b]? ... - Could one hope to estimate V[b]? ... - What would such estimates be good for? ... - [There exist estimators of V[b] that avoid making the assumption V[y] = sigma^2 I. Instead they estimate V[y] with diag(r^2), i.e., a diagonal matrix with squared residuals in the diagonal. This results in the 'sandwich estimator of standard error', which has become common in econometrics sind Hal White''s work in 1980+.] ================================================================ * SIMULATION EXAMPLE: simple linear regression with intercept ==> X is a 1-predictor matrix, hence of size Nx2. ## R: N <- 50 # Picking a sample size X <- cbind(x0=rep(1,N), x1=runif(N)) # Making up a predictor matrix sigma <- 1 # Assuming a 'true' error variance beta <- c(b0=1, b1=2) # Assuming a 'true' coeff vector y <- rnorm(N, m=X%*%beta, s=sigma) # Simulating a response vector b <- solve(t(X)%*%X)%*%t(X)%*%y; b ## Simulate and animate parallel universes/possible worlds/datasets ...: for(i in 1:500) { y <- rnorm(N, m=X%*%beta, s=sigma) # Simulating a response vector b <- solve(t(X)%*%X) %*% t(X) %*% y plot(x=X[,2], y=y, pch=16, col="gray20", ylim=c(-2,6)) abline(b, lwd=3, col="red") dev.flush() Sys.sleep(0.5) } ## ==> We see the DATASET-TO-DATASET (sampling) variability. ## Next, simulate many response vectors (possible worlds, parallel universes,...) ## and store the results: Nsim <- 10000 bs <- matrix(NA, nrow=Nsim, ncol=length(beta)) # Storage for coeff estimates colnames(bs) <- paste("b",0:(length(beta)-1),sep="") # Cosmetics for(isim in 1:Nsim) { y <- rnorm(N, m=X%*%beta, s=sigma) # Simulating a response vector bs[isim,] <- solve(t(X)%*%X)%*%t(X)%*%y if(isim%%100==0) cat(isim,"...") } ## Plot slopes versus intercepts to illustrate the 'sampling' or ## 'dataset-to-dataset' variability of b: plot(bs, pch=16, cex=.5) ## ==> Negative correlation between b0 and b1 !!! ## Compare: solve(t(X)%*%X)*sigma^2 # V[b] cov(bs) # simulation estimate of previous ## Interpretation: ## . cov(bs) is a simulation approximation to V[b]. ## . solve(t(X)%*%X)*sigma^2 is the theoretically obtained formula for V[b]. ## ==> They better be close!!! ## Further intuitions: ## Visualize the simulated lines and the simulated distribution of 'b': windows(height=5, width=10) par(mfrow=c(1,2), mgp=c(1.8,.5,0), mar=c(3,3,1,1)) # Set up two plot panels. # Panel 1: Show the last y only, faintly, because this is irrelevant. plot(cbind(x=X[,2],y=y), pch=16, cex=1, col="gray", xlim=c(-.5,1.5), ylim=c(-2,6)) # Panel 1: Draw the estimated lines for 100 simulations. for(i in 1:100) abline(bs[i,]) # Panel 2: Show a scatterplot of slopes versus intercepts: plot(bs, pch=16, cex=.5) ## Questions: ## . How does the negative correlation on the right ## relate to the lines on the left? ## ... ## . If we had done the simulation with a uniform ## x-distribution on [-1,+1] instead of [0,1], ## what would the correlation between intercept and slopes be? ## Alternative R code: X <- cbind(x0=rep(1,N), x1=runif(N,-1,+1) ## ... #================================================================ * LINEAR MODEL ANALYSIS, STEP 2: VARIABILITY OF yhat AND r - There is variability in 'b' ... dataset-to-dataset => There is variability in 'yhat' and 'r'. yhat = X betahat = X b r = y - yhat = y - X betahat = y - X b - Strangeness about the randomness in yhat and r: Even though both are N-dimensional, . yhat can only range in ... dim=... . r can only range in ... dim=... - Let as usual: H = X (X^T X)^{-1} X^T . First order: E[yhat] = ... E[r] = ... What assumptions are used? ... What assumptions are not used? ... . Second order: V[yhat] = ... V[r] = ... V[yhat,r] = ... What assumptions are used? ... What assumptions are not used? ... - Special cases: . Diagonal elements: Var[yhati] = ... Var[ri] = ... . Off-diagonal elements: Cov[yhati,yhatk] = ... Cov[ri,rk] = ... . Cov[yhati,rk] = ... - What variability is being described in the above equations? ... - Implications: Assume the errors (noise) in the Nx1 vector eps = y - X beta are homoskedastic and uncorrelated. . The residuals are generally ... (homo/heteroscedastic?) . The residuals are generally ... (un/correlated?) . Pairs of distinct fits and corresponding pairs of residuals are generally correlated with the ...[same/opposite?] sign. . The variance of residuals is ... [>/ tr(H) = p+1 = sum H_ii # ==> average size of H_ii = ... tr(I-H) = N-(p+1) tr(I) = N 'Trick': diagonalize H - Trace formulas or 'total variance' formulas: Def.: total variance of a random vector Z = sum_i V[Z_i] = tr(V[Z]) sum V[y_i] = sigma^2 N = sigma^2 trace(I) sum V(yhat_i) = tr(V[yhat]) = sigma^2 tr(H) = sigma^2 (p+1) sum V(r_i) = tr(V[r]) = sigma^2 tr(I-H) = sigma^2 (N-(p+1)) - Interpretation of the trace formulas: The more predictors the ...[more/less] variability in yhat. The more predictors the ...[more/less] variability in r. NOTE: . This is independent of whether the model is 1st order correct/unbiased or not! We only used second order assumptions: homoskedasticity and uncorrelated errors. . This is independent of whether we use good or bad predictors. Whether a predictor is good or bad, its inclusion moves one case worth of variation (sigma^2) from r to yhat (and hence b). - 'Degrees of freedom': dimensions of the subspaces in which yhat and r can range dfs(fits) = p+1 dfs(resid) = N-(p+1) dfs(y) = N - [Quirky question: Unlike y, yhat, r, why might we NOT be interested in the total variance of b? Think like a physicist: What are the units of tr(V[b]) compared to the those of tr(V[y]), tr(V[yhat]), tr(V[r]). ... ] ================================================================ * ESTIMATING THE ERROR VARIANCE sigma^2: - Recall from ~ Stat 102: RMSE = sigmahat = sqrt( RSS / (N-p-1) ) RMSE^2 is an unbiased estimate of sigma^2, E[RMSE^2] = sigma^2 as we will now show. - Heuristic: RMSE^2 is RSS up to a factor (N-p-1), hence look at E[r_i^2] first, then sum up. - If the model is ... order correct, then E[ r_i ] = ... E[ r_i^2 ] = ... Which of 1st and 2nd order model assumptions are used??? ... E[RSS] = ... Therefore: sigma^2 = ... Hence an unbiased estimate of sigma^2 is sigmahat^2 = ... - PS: We would really like E[RMSE] = sigma, but this is not true, and there is no way to generate an unbiased estimate of sigma. Hence one resorts to what is mathematically doable: Showing that E[RMSE^2] = sigma^2. PPS: But there is an inequality: Jensen applied to f(x)=x^2 (convex) gives us the following: E[X^2] >= E[X]^2 Apply this to X=RMSE: E[RMSE^2] >= E[RMSE]^2 But E[RMSE^2] = sigma^2, hence sigma^2 >= E[RMSE]^2 sigma >= E[RMSE] ==> RMSE is an under-estimate of sigma !!!!!!!!!!!!! ================================================================ * SAMPLING PROPERTIES OF THE 'FIXED' PREDICTOR MATRIX X: - We derived: V[b] = (X^T X)^{-1} sigma^2 (the 'double-X matrix') Compare with: V[Xbar] = sigma^2 / N for iid X1,...,XN (Stat 101) Q: Why don''t we see the increased precision in V[b] as N grows? - Even though statisticians condition on X, we now consider the sampling properties of X assuming the rows of X are iid samples from a random vector 'xrow' with multivariate distribution P(dx1,dx2,...,dxp): X = rbind(xrow1,xrow2,...,xrowN) xrowi ~ P(dx1,dx2,...,dxp) iid for n=1,...,N ==> Point of view of multivariate analysis applied to predictors/regressors. - Interpretation of X^T X in terms of xrow1,...,xrowN: X^T X = sum of rank-one matrices formed from xrowi (HW1) = xrow1 xrow^T + xrow2 xrow2^T + ... + xrowN xrowN^T = sum_n xrowi xrowi^T (We think of xrowi as a column, though it''s a row of X.) Q: Does X^T X converge to something if we increase N? A: No, but it does if we divide it by ... !!!! (X^T X)/N = sum_{i=1...N} xrowi xrowi^T / N Reason: Vector/matrix version of the LLN The rank-one matrices (xrowi xrowi^T) are iid (n=1...N), and the LLN applies to the (j,k) entries in the matrices. Q: What is the limit? A: sum_i (xrowi xrowi^T) / N --> E[xrow xrow^T] =: C[xrow] Note that C[xrow] = E[xrow xrow^T] is NOT equal to V[xrow] = E[xrow xrow^T] - E[xrow] E[xrow]^T Call it the "2nd-moment matrix" instead. Think of it as the "uncentered variance/covariance matrix" of the random vector 'xrow'. We assume that C[row] is non-singular. - Conclusions: (X^T X / N) --> C[xrow] (LLN) ==> (X^T X / N)^{-1} --> C[xrow]^{-1} = (X^T X )^{-1} * N --> C[xrow]^{-1} ==> (X^T X )^{-1} ~ C[xrow]^{-1} / N Now we are seeing 1/N in V[b]: --------------------------------------- | | | V[b] ~ sigma^2 C[xrow]^{-1} / N | | | --------------------------------------- . Recap the thought process: As N -> Inf, what happens to the following? (X^T X) --> ... (X^T X)^{-1} --> ... X^T X/N --> ... ( X^T X/N )^{-1} --> ... . Q: What does this formula say about the precision of estimates of regression coefficients in relation to the distribution of the predictor vectors? A: We gain precision of estimation with more data at the rate 1/N in terms of DATASET-TO-DATASET VARIANCES (and hence 1/sqrt(N) of dataset-to-dataset sderrs). - The weirdness about V[b] = sigma^2 (X^T X)^{-1} : . Conditionally on X, V[b] is a fixed covariance matrix, but because of conditioning it should be written 'V[b|X]'. . Unconditionally, V[b|X] = sigma^2 (X^T X)^{-1} is a random matrix if the X matrix is observational and hence random. ('Observational' means: If we realistically repeated the data collection, we would end up with a different X. This is different from designed experiments where we control X and are able to set the values of X. In repeats of a designed experiment, X is the same.) ================================================================ * "ADJUSTMENT" AND ITS ALGEBRA: - Intuitive Example: 'Explain' student''s school performance. . Factors of interest: Measures of school quality (Class size, training of teachers, teacher salaries, physical plant,...) ==> Collect these predictors in a matrix X1. . Factors NOT of interest in the study but cannot be ignored: Demographics (Household income, home zip code, education level of parents, ...) ==> Collect these predictors in a matrix X2. . Outcome variables: SAT scores, grades, graduation vs dropping out, ... . Intuition: Need to ADJUST the variables of interest for demographics. - Partitioning multiple linear regression: Partition the predictor matrix: X = (X1,X2) (Nxp1, Nxp2) X1 = predictors of primary interest X2 = non-ignorable predictors not of primary interest Partition the LS estimator: b^T = (b1^T,b2^T) (p+1 = p1+p2) y = X b + r = X1 b1 + X2 b2 + r Compare with naive approach: y = X1 b1* + r* - Q: How are b1 and b1* related to each other? A: In general we only know that b1 and b1* are not the same. Rough Reason: The following problems produce different b1 coefficients: | y - (X1 b1 + X2 b2) |^2 = min_b1,b2 | y - (X1 b1*) |^2 = min_b1* - Define orthogonal (LS) projections according to the partitions: H = projection onto column space of X H1 = projection onto column space of X1 H2 = projection onto column space of X2 (H1, H2 are NOT partitions of H; all are NxN !!!) Question: When is H = H1 + H2 ? - DEFINITION: adjustment = residualizing operation I-H2 = "adjustment operator" for X2 ('regresses out' demographics, e.g.) ------------------------------------------------ | | | y.2 := (I-H2) y is "y adjusted for X2" | | | ------------------------------------------------ . Algebraically: y.2 is a residual vector in a model with predictors X2. . Intuitively: y.2 has the variation due to X2 removed from y variation due to X2 ~ "explained"/"accounted for" by X2. . Practical example from above: X1 = predictors describing school quality X2 = predictors describing demographics y = performance on SAT y.2 = performance on SAT adjusted for demographics - Some 'higher trivialities': Algebra and geometry of adjustment . If y = X1 b1 + X2 b2 + r is the LS decomposition based on both X1 and X2, note the following: r and the columns of X are orthogonal to each other, hence r and the columns of X1, X2 are ... ==> r is 'adjusted' for both X1 and X2. . Practice: H X = ... X H X1 = ... X1 H X2 = ... X2 H r = ... 0 (I-H) r = ... r (I-H) X = ... 0 H1 r = ... 0 H2 r = ... 0 (I-H1) r = ... r (I-H2) r = ... r (I-H1) X1 = ... 0 (I-H2) X2 = ... 0 (I-H2) X1 = ... . Upcoming surprise: We will need X1 adjusted for X2 !!!! DEFINITION: ------------------------------------------------------------------------- | X1.2 := (I-H2) X1 | | = Collection of predictors of interest X1 (school quality) | | adjusted for | | non-ignorable but uninteresting predictors X2 (demographics). | ------------------------------------------------------------------------- . r is orthogonal to X1.2: r^T X1.2 = row vector of inner products of r with columns of X1.2 = r^T (I-H2) X1 = r^T (I-H2)^T X1 = ((I-H2)r)^T X1 = r^T X1 = 0 because r^T X = 0 is the normal equation, meaning that r is orthogonal to all columns of X, hence orthogonal to the subset of columns in X1. This should not be a surprise: Residualizing/orthogonalizing X1 wrt X2 is an operation that is entirely within the column space of X=(X1,X2). Related facts: cspan(X1.2) is a subspace of cspan(X1,X2)=cspan(X). cspan(X1.2) = orthogonal complement of cspan(X2) within cspan(X1,X2)=cspan(X). - Adjust the LS decomposition y = X1 b1 + X2 b2 + r for X2 by applying I-H2 to both sides: (I-H2) y = (I-H2) X1 b1 + r (*****) ^^^^^^^^ ^^^^^^^^^ ^ X2-adjusted X2-adjusted Remains fixed under response y predictors X1 X2-adjustment Recall notation: y.2 = (I-H2) y 'y adjusted for X2' X1.2 = (I-H2) X1 'X1 adjusted for X2' From orthogonality of X1.2 and r in (*****) we infoer that this is an LS decomposition: ------------------------------------- | | | y.2 = X1.2 b1 + r | (****) | | | where b1 is the LS estimator | | for the X1 coefficients in the | | JOINT regression onto X=(X1,X2). | | | ------------------------------------- It solves | y.2 - X1.2 b1 |^2 = min_b1 (***) The solution b1 of (***) is the same as that of | y - X1 b1 - X2 b2 |^2 = min_b1,b2 (**) - Interpretation of b1: To get the LS regression coefficients b1 in the joint regression y on X=(X1,X2), one can do the following: -------------------------------------------------------- | | | Regress the X2-adjusted response, y.2 = (I-H2)y, | | on the X2-adjusted X1 predictors, X1.2 = (I-H2)X1. | | | -------------------------------------------------------- Note: The adjustment formalism is not meant for computations but for conceptual understanding. Stay tuned! ================================================================ * APPLICATION 1 OF ADJUSTMENT: CENTERED REGRESSION - We usually consider the intercept as a nuisance. With adjustment we can get rid of it: X1 = all non-intercept predictors ('substantive regressors') X2 = e = (1,...,1)^T the intercept 'regressors' - What does adjustment for e do? H2 = e e^T / |e|^2 = e e^T / N H2 y = e (e^T y) /N = mean(y) e ^^^^^^^^^^ mean(y) (I - H2) y = y - mean(y) e = centered y vector - The adjusted LS decomposition (****) above y.2 = X1.2 b1 + r means that the substantive LS estimates can be obtained from a 'centered regression' with the intercept omitted. ================================================================ * APPLICATION 2 OF ADJUSTMENT: REDUCTION OF A MULTIPLE TO A SIMPLE REGR COEFF - Consider: a single predictor adjusted for all else X1 = single column xj (Nx1) X2 = all columns of X except column xj (Nxp) => The adjusted equation (****) amounts to a simple linear regression without intercept. - Simplified notation: y.adj = (I-H2) y (y adjusted for all xk except xj) xj.adj = (I-H2) xj (xj adjusted for all xk except xj) bj = slope of simple lin. regr. of "y adjusted for all but xj" onto xj adjusted for all other predictors [Reminder: The simple regr coeff w/o icept of y on x is b=/|x|^2 .] ------------------------------- | | | < y.adj, xj.adj> | | bj = ---------------- | ("adjustment formula" for bj) | |xj.adj|^2 | | | | ------------------------------- Reminder: bj = j''th MULTIPLE regression coefficient (estimate) - Interpretation: bj is the simple lin regression coefficient when regressing y.adj on xj.adj w/o intercept. - Elementary consequence: Does it matter for the value of bj if we add or drop some of the other regressors? ... - Some weirdness: Adjusting y is actually not necessary. Adjusting xj is. Here is how you see it: Numerator of adjustment formula = < y.adj, xj.adj> = <(I-H2)y, (I-H2)xj> = (Why? Why? Why?) = --------------------------- | | | < y, xj.adj> | | bj = -------------- | (the "weird adjustment formula" for bj) | |xj.adj|^2 | | | --------------------------- - Interpretation 1: bj is the simple lin regression coefficient when regressing y on xj.adj w/o intercept. - Interpretation 2: bj is a linear function of y !!! bj = a^T y ("contrast") where xj.adj a = ---------- |xj.adj|^2 If we condition on X, then 'a' is a constant vector. Can you smell a way to get at a standard error for bj? [See the section after next.] [We know this already because the whole vector b of coefficient estimates is b = (X^T X)^{-1} X^T y which is a linear mapping of y (in R^N) to b (in R^{p+1}). ================================================================ * ADJUSTED VARIABLES PLOT OR "LEVERAGE PLOT": - A plot of y.adj vs. xj.adj shows what the LS criterion "sees" when determining bj. - Example: Fetch data from class website; after downloading to folder 'Data'... cars <- read.table("http://www-stat.wharton.upenn.edu/~buja/STAT-961/CarModels2003-4-NA.TXT", header=T, sep=",", na=".") names(cars); dim(cars) rownames(cars) <- paste(1:nrow(cars),as.character(cars[,"Model"])) apply(cars, 2, function(x) sum(is.na(x))) # there is an NA in "MPG.Highway" str(cars) # Ooops cars[,"MPG.Highway"] cars[,"MPG.Highway"] <- as.numeric(cars[,"MPG.Highway"]) # Select non-NAs: sel <- !is.na(cars[,"MPG.Highway"]) & !is.na(cars[,"Weight.lbs"]) & !is.na(cars[,"Horsepower"]) # Multiple regression of response on two predictors: cars.lm <- lm(MPG.Highway ~ Weight.lbs + Horsepower, data=cars[sel,]) # Adjusted response and other predictor: mpg.adj <- resid(lm(MPG.Highway ~ Horsepower, data=cars[sel,])) wgt.adj <- resid(lm(Weight.lbs ~ Horsepower, data=cars[sel,])) # Compare coeff of Weight: same!! "proof" by example... lm(mpg.adj ~ wgt.adj); cars.lm plot(mpg.adj ~ wgt.adj, pch=16, cex=.5); abline(lm(mpg.adj ~ wgt.adj)) - "Leverage Plot" because it shows points that have high leverage for bj (every bj can have its own leverage points). - Canned function for leverage plots with optional point identification: source("http://www-stat.wharton.upenn.edu/~buja/STAT-961/leverage-plots.R") leverage.plots(model=cars.lm, ident=c(1,2), frame.sz=5) # stop with right click [This function allows you to identify funny points; right-click to stop.] - Another use of leverage plots: predictor variables shrink when adjusted. Leverage plots show the shrinkage in the horizontal width. One can then read off what estimated average difference in the response corresponds to the adjusted range. - Note to future TAs of Stats classes: The JMP software provides leverage plots with all regression outputs. - Ultimate final question: Why does mpg.adj take on negative values? ... - Final question: What happens to the intercept vector under adjustment? In other words, is x0.adj still a constant column??? ... Example: the icept vector 'adjusted' for weight and horsepower hist(resid(lm(rep(1,nrow(cars)) ~ .-1, data=cars[,c("Weight.lbs","Horsepower")])), main="", xlab="", col="gray30", breaks=20) ================================================================ * VARIANCE/STDERR OF bj: - Note about terminology: We call 'standard error' (SE) the true sdev of an estimate (across ...). We call 'standard error estimate' (SE.est) the estimated sdev of an estimate. (Most people say 'standard error' and mean 'standard error estimate'.) - Reminder: We already know the SE of bj from V[b]. V[bj] = ( (X^T X)^{-1} )_jj sigma^2 - Goal: Obtain Var[ bj ] from the adjustment formula for bj. Recall variance of a linear combination: 'butterfly'/'sandwich' with A=a^T Var[ ] = Var[ a^T y ] = ... Use a = xj.adj: Var[ ] = ... To obtain V[bj]: Be careful and handle the denominator correctly. Var[ bj ] = ... ... Var[bj] = ------------ ... Finally, SE[bj] = sqrt(Var[bj]): ----------------------- | | | sigma | "Adjustment formula" | SE[bj] = ---------- | for standard errors | |xj.adj| | of regression coeffs | | ----------------------- - Consequences for statistical inference: . What does |xj.adj| mean statistically speaking? Try this for j>0 (exclude the x0.adj): |xj.adj|/sqrt(N-1) = ... . Use the previous sub-bullet to rewrite the above magic formula for SE[bj] and make the root-N law visible assuming regressor rows are randomly sampled: SE[bj] = ... . Do we like small or large values of |xj.adj| ? ... . How do you "see" |xj.adj|/sqrt(N-1) in the leverage plot? ... . What happens to |xj.adj| if we add a predictor/regressor that is highly correlated with xj? ... . See Mosteller & Tukey, p.326f, on the proxy problem: If we have mother''s education in the regression equation, what do you think is going to happen to SE(b_mother) if we add father''s education level to the equation? ... . How would you measure how collinear xj is with all other predictors? See next! ================================================================ * VARIANCE INFLATION FACTOR (VIF) -- BETTER: SE INFLATION FACTOR (SEIF) - Prelude: When is there no adjustment effect? . General case: y = X1 b1 + X2 b2 +r When is X1.2 = X1 ? ... X1 _|_ X2 . Single case: X1 = xj, X2 = all other predictors When is xj.adj = xj.0 ? (We always adjust for the intercept, that is, centering.) ... xj.0 is orthogonal to xk.0 for all k (!= j) . What inequality is there between |xj.adj| and |xj.0| ? |xj.adj| <= |xj.0| - When is there the least loss in precision of bj from collinearity? ... |xj.adj| = |xj.0| <==> xj.adj = xj.0 - What is SE[bj] in this case? Call it SE.best[bj]. ... SE.best[bj] = sigma / |xj.0| - Definition: Standard error and variance inflation factors SEIF[bj] := SE.actual[bj] / SE.best[bj] VIF[bj] := SEIF[bj]^2 - Facts: SEIF[bj] = (sigma/|xj.adj|) / (sigma/|xj.0|) ------------------------------ | | | |xj.0| | | SEIF[bj] = -------- | | |xj.adj| | | | ------------------------------ - Fact: The lower bound on SEIF is SEIF[bj] >= ... - AYT? Is the SEIF (and VIF) dependent on the units of xj? ... If xj has units of Celsius, does SEIF[bj] change under conversion to Fahrenheit? ... - SEIF in R: Ignoring the SEIF for the intercept, here is a way to get the SEIFs for all slopes. sqrt(diag(solve(cor(X)))) ================================================================ * HAT DIAGONAL ELEMENTS AS MEASURES OF OUTLYINGNESS IN PREDICTOR SPACE - General concept: Given a symmetric positive definite matrix S, think of it as defining a squared distance from the origin or squared norm as follows. 2 |x| = x^T S x S Using a root matrix R of S, i.e., S = R^T R, you can define new coordinates xnew = R x, and you will find: 2 |xnew|^2 = xnew^T xnew = x^T R^T R x = x^T S x = |x| ^^^^^^^^ S Euclidean So, yes, x^T S x is a squared norm modulo a linear distortion. - Recall the hat matrix: H = X (X^T X)^{-1} X^T - For a row written as a column, xrowi = (x_i0=1,x_i1,x_i2,...,x_ip)^T, we have: H_ii = xrowi^T (X^T X)^{-1} xrowi ==> H_ii is the squared norm of the i''th case in predictor space wrt to the norm defined by S = (X^T X)^{-1} - More intuition about "outlyingness" in regressor space: . Write: X = (x0=e,x1,...,xp) of size Nx(p+1) e = (1,...,1)^T of size Nx1 . Form the "mean-adjusted matrix" Z: Z = (x1.0,...,xp.0) (Nxp) matrix of centered real regressors (j>0). AYT? Can the range of column xj.0 be [100,200]? . Facts: H_e = e e^T / N [Homework 1: What is H_e ? H_e x = mean(x) e] xj.0 = (I - H_e) xj H_X = H_e + H_Z [Homework 1: e _|_ xj.0 for j=1,...,p] . Consequence: Let z{i} = row i of Z written as a column (in R^p) z{k} = row k of Z written as a column (in R^p) Hik = 1/N + z{i}^T (Z^T Z)^{-1} z{k} Hii = 1/N + z{i}^T (Z^T Z)^{-1} z{i} . Geometric interpretation: In the metric defined by (Z^T Z)^{-1}, Hii - 1/N is the squared distance of zi from the origin. - Consider a fixed location z0 in R^p and define a POSITIVE QUADRATIC FORM in z0 by: H(z0) := z0^T (Z^T Z)^{-1} z0 Recall from HW1: Z^T Z = sum_{i=1...N} z{i} z{i}^T As N --> Inf, what happens to: . H(z0) --> ... ? [LLN assuming z{i} have a multivariate distribution] . N*H(z0) --> ... ? Does this make sense? Think in terms of 'self-influence' of location z as N-->Inf and formulate a rule for its decay rate. ... ================================================================ * ANOVA DECOMPOSITIONS, R SQUARE, AND E[SS]: - Q: How much "variation" does the model given by X "account for"? . Approach: Use sums of squares (SS) as measures of "variation" but first remove uninteresting variation: the mean. . Preliminaries: recall the "mean-adjusted" LS decomposition y.0 = x1.0 * b1 + ... + xp.0 * bp + r = yhat.0 + r obtained from the full equation by centering: y = x0 * b0 + x1 * b1 + ... + xp * bp + r [Note that r is centered, but yhat is not, hence we need yhat.0.] Unlike the real predictors the intercept is not considered "explanatory". Therefore, get rid of the intercept once for all. - Pythagorean decomposition of orthogonal projections: y.0 = b1*x1.0 + ... + bp*xp.0 + r (The mean-adjusted LS decomposition again) y.0 = yhat.0 + r (On the RHS are orthogonal vectors.) ==> |y.0|^2 = |yhat.0|^2 + |r|^2 Pythagorean theorem ----------------------------------- | TSS = MSS + RSS | 'ANOVA decomposition' (the simplest form) ----------------------------------- Total SS = Model SS + Residual SS - Definition of RSquare ("R2"): MSS |yhat.0|^2 . R2 := ----- = -------- = "Fraction of variation accounted for by the model" TSS |y.0|^2 "--explained--" . Fact: ---------------------- | R2 = cor(y,yhat)^2 | = cor(y.0,yhat.0)^2 ---------------------- + Note: 'cor()' implies mean-adjustment and standardization. + Proof: cor(y,yhat)^2 = ^2 / (|y.0|^2*|yhat.0|^2) = ^2 / ... = QED - Can we calculate E[TSS], E[RSS] and E[MSS] theoretically? . We won''t have use for this in the future, but there are two reasons for doing this math exercise: + ANOVA (really the analysis of designed experiments with highly structured X) is permeated by calculations of expected values of sums of squares (E[SS]). + The calculations are insightful in their own right because they describe sources of variation between datasets. . Assume the model is unbiased and with uncorrelated homoscedastic errors: y = X beta + eps with E[eps]=0, V[eps] = sigma^2 I ==> E[y] = X beta . The smallest model is the mean model: y = beta0 x0 + eps (x0 = 1(Nx1)) We compare this model with the full model. TSS = |(I-H0)y|^2 (total SS in the data other than the mean) MSS = |(H-H0)y|^2 (SS captured by the model other than the intercept) RSS = |(I-H) y|^2 (SS not captured by the model) where H is the projection onto the space spanned by x0,x1,...,xp and H0 onto the vector x0. . Note: I-H0, H-H0, I-H are all orthogonal projections, with ranks N-1, p, N-p-1, respectively. Fact: H-H0 = H(I-H0) . We will reduce the derivations of E[TSS], E[MSS], E[RSS] to the following generic facts about projections of spherically distributed random vectors eps (to first and second order): E[eps] = 0 and V[eps] = sigma^2 I: y = mu + eps . FACT 1: For any orthogonal projection H we have the following. E[ |H eps|^2 ] = E[ eps^T H eps ] = E[ sum_{ij} eps_i H_ij eps_j ] = sum_{ij} H_ij E[ eps_i eps_j ] = sum_i H_ii sigma^2 = sigma^2 tr(H) . FACT 2: Consider the vector 'eps' shifted away from the origin, y = mu + eps. E[ |H (mu + eps)|^2 ] = E[ |H mu|^2 + mu^T H eps + eps^T H mu + |H eps|^2 ] = |H mu|^2 + sigma^2 tr(H) The neat aspect is that this holds for ANY orthogonal projection H and any shift vector 'mu'. . APPLICATION 1: To find E[TSS], use the projection I-H0. E[TSS] = E[|(I-H0)y|^2] = |(I-H0) X beta|^2 + sigma^2 (N-1) . APPLICATION 2: To find E[MSS], use the projection H-H0, where H projects onto cspan(X). E[MSS] = E[ |(H-H0) y|^2 ] = |(I-H0) X beta|^2 + sigma^2 p . APPLICATION 3: To find E[RSS], use the projection (I-H). E[RSS] = E[ |(I-H) y|^2 ] = sigma^2 (N-p-1) because (I-H) X = 0 . CONCLUSIONS: If y = X beta + eps, then E[TSS] = | (I-H0)X beta |^2 + sigma^2 (N-1) E[MSS] = | (I-H0)X beta |^2 + sigma^2 p E[RSS] = sigma^2 (N-p-1) - An Engineering View of Linear Regression: Signal to Noise Ratio in Estimation The estimation part is encapsulated in yhat.0; its 'energy' is MSS. We obtained a decomposition for E[MSS]: E[MSS] = | (I-H0)X beta |^2 + sigma^2 p ^^^^^^^^^^^^^^^^^ ^^^^^^^^^ Sources: model, 'signal' error, 'noise' This motivates the definition of "signal-to-noise ratio": | (I-H0) X beta |^2 / Ratio of model variation to noise variation SN = ------------------- = | in fitted values sigma^2 p \ excluding uninteresting mean variation Q: Do we know the SN in actual data analysis? ... Q: Find estimates for numerator and denominator! ... (denominator) s^2 p (unbiased? yes) ... (numerator) | (I-H0) X b |^2 (unbiased? no) SNhat = ... (ratio) Is the ratio also unbiased? ... probably not - We calculated E[MSS] and E[TSS]; we defined R2 = MSS/TSS. Q: Can we calculate E[R2] theoretically? ... Q: Can we express R2 in terms of SNhat and vice versa? ... ================================================================