================================================================ LECTURE 7: * ORG: - Video recording of Lecture 6 worthless. Please, study the posted notes: 'Lecture06.txt' - HW 3 due on Friday: analysis of the English dictionary . Re Problem 4: Answer exactly as stated -- don''t be surprised if the plural makes no sense. . If you want to use 'grep()' to search for capitalized words, the following does not work because the function seems to be case-insensitive: grep("^[A-Z]", dict, value=T) This here works, though: grep("^[[:upper:]]", dict, value=T) There is another solution, though, using substring(dict,1,1) * RECAP - First order properties of yhat and r: E[yhat] = ... E[r] = ... - Second order properties of yhat and r: V[yhat] = ... V[r] = ... V[yhat,r] = ... - Pick special cases in terms of components of the above three equations. . Translate into English. . Point out surprising and meaningful cases: ... - Sampling properties of (X^T X): . Interest stems from V[b] = sigma^2 (X^T X)^{-1} Q: Where is the root-N law? V[b] should shrink at the rate 1/N. . For more meaning, assume: mean(y)=0, mean(xj)=0 (j=1..p) and drop the intercept. ==> X is of size ...x... . Also assume the rows of X are iid samples from a joint distribution in R^p. ==> (X^T X) / (N-1) --> ... ^^^^^^^^^^^^^^^ estimates ^^^ By what law? ... . Result: V[b] ~ sigma^2 (X^T X / (N-1))^{-1} / (N-1) --> ... at the rate ... as N-->Inf. . Why is there talk about 'root-N' convergence? ... (1/N for variances, 1/sqrt(N) for sdevs) * ROADMAP: - Self-influence - Standardized residuals - Degrees of freedom in linear models and dimensions of subspaces - Estimating sigma^2 Reminder: Two purposes for estimating sigma^2 1) ... prediction 2) ... stat inference: CIs, tests - The algebra of 'adjustment' and a magic formula for LS estimates of multiple regression coeffs ================================================================ * SELF-INFLUENCE AND THE LS PROJECTION: - Diagonal elements of P, P_ii, are called "self-influence"; why? yhat_i = sum_j Pij yj = Pii yi + sum_{i!=j} Pij yj ^^^^^^ Pii determines how much y_i determines yhat_i, i.e., how much the observation determines its own fit. - Properties of the self-influence values Pii: 0 <= Pii <= 1 (See HW4) - When might Pii be large? (HW4) ... ================================================================ * STANDARDIZED RESIDUALS: - Residuals are generally heteroscedastic: V[ri] = (1-Pii) sigma^2 - Criticism of standard residual plots: Plotting apples and oranges on top of each other because V[ri] is not the same for all i. - Standardized residuals: ri* = ri / sqrt(1-Pii) V[ri*] = sigma^2 - Distinguish from 'studentized residuals': ri** = ri / sqrt((1-Pii)*sigmahat^2) (However, terminology does not seem to be uniform: Some authors call the latter 'standardized residuals'.) (Note: It is NOT true that E[ri**^2] = 1. Why? ...) ================================================================ * DEGREES OF FREEDOM IN LINEAR MODELS: - Total variance and degrees of freedom: (background in HW4) . Def.: tr(A) = sum A_ii (sum of diagonal elements for square A) . Facts: Assuming X is full rank, rank(X) = p+1, we have tr(P) = rank(P) = dim(colspace(P)) = dim(colspace(X)) = rank(X) = p+1 => tr(P) = p+1 'Trick': tr(P) = tr(A P A^{-1}) = tr(P diagonalized) = p+1 (HW4) . Trace formulas or 'total variance' formulas: sum Var(yhati) = tr(V[yhat] = sigma^2 tr(P) = sigma^2 sum P_ii = sigma^2 (p+1) sum Var(ri) = tr(V[r]) = sigma^2 tr(I-P) = sigma^2 sum (1-P_ii) = sigma^2 (N-(p+1)) . Interpretation of the trace formulas: The more predictors the more variability in ... The more predictors the less variability in ... Interesting: This is independent of whether we use good or bad predictors. . 'Degrees of freedom': dimensions of the subspaces in which yhat and r can range dfs(fits) = p+1 dfs(resid) = N-(p+1) ================================================================ * ESTIMATING THE ERROR VARIANCE sigma^2: - Recall from ~ Stat 102: RMSE = sigmahat = sqrt( RSS / (N-p-1) ) RMSE^2 is an unbiased estimate of sigma^2 as we will now show. (Q: Why not unbiased estimation of sigma? ...) - If the model is true (1st+2nd order correct), then E[ ri^2 ] = V[ri] = ... sigma^2 (1-Pii) (Which of 1st and 2nd order model assumptions are used?) E[RSS] = ... sigma^2 (N - p-1) Therefore: sigma^2 = ... E[RSS/(N-p-1)] Hence an unbiased estimate of sigma^2 is sigmahat^2 = ... RSS/(N-p-1) [Note: We would really like E[RMSE] = sigma, but this is not true, and there is no simple formula to connect left and right. Hence one resorts to what is mathematically doable: Show that E[RMSE^2] = sigma^2.] ================================================================ * "ADJUSTMENT" AND ITS ALGEBRA: - Partitioning linear regression: partition the predictor matrix: X = (X1,X2) (Nxp1, Nxp2) partition the LS estimator: b^T = (b1^T,b2^T) (p+1 = p1+p2) y = X b + r = X1 b1 + X2 b2 + r - QUESTION: How is b1 different from a regression of y on X1? In y = X1 b1 + r1 both b1 and r1 are different from the above b1 and r!!! because the following problems produce different b1 coefficients: | y - (X1 b1 + X2 b2) |^2 = min_b1,b2 | y - (X1 b1) |^2 = min_b1 - Define projections according to the partitions: P = projection onto column space of X P1 = projection onto column space of X1 P2 = projection onto column space of X2 (P1, P2 are NOT partitions of P; all are NxN) Question: When is P = P1 + P2 ? (see HW2, P7) - Definition: residual operations I-P1 = "adjustment operator" w.r.t. X1 I-P2 = "adjustment operator" w.r.t. X2 y.1 = (I-P1) y is "y adjusted for X1" y.2 = (I-P2) y is "y adjusted for X2" Intuitively: Removing the variation that can be "explained" by X1 or X2 "accounted for" Practical example: X1 = predictors describing education measures at school X2 = predictors describing demographics y = performance on SAT y.1 = performance on SAT adjusted for demographics Upcoming surprise: We will need X1 adjusted for X2 !!!! X1.2 = (I-P2) X1 - Note some higher trivialities: r and the columns of X are orthogonal to each other, hence r and the columns of X1, X2 are ... P X = ... X P X1 = ... X1 P X2 = ... X2 P r = ... 0 (I-P) r = ... r (I-P) X = ... 0 P1 r = ... 0 P2 r = ... 0 (I-P1) r = ... r (because r is orth to cols of X, hence to cols of X1) (I-P2) r = ... r (I-P1) X1 = ... 0 (I-P2) X2 = ... 0 ......................................