================================================================ LECTURE 9: * ORG: - Solutions to HW 1-3 posted - Graded HW2 has been returned by the TA. (Note emails from 'stat541.at.wharton' shows from 'Andreas Buja' but will often be from the TA.) - Reminder: Consulting previous years'' solutions is prohibited. - Q: Did everybody buy the required Style book? Next week we will start with writing practice. Reminder to myself: lecture 6 not posted (check HWs w extensions) * RECAP in the form of a quiz and further development - But first an elaboration of Jose''s question about overfit: Self-influence is the answer to diagosing overfit. Why? Because overfit can be diagnosed locally in linear models. Q: When is case i overfit? A: If, say, Pii>0.2. I.e., if yi contributes more than 20% to the variation of its own fit: Var[yhati] = Pii sigma^2 Var[yi] = sigma^2 Var[yhati]/Var[yi] = Pii Thus Pii can be used as diagnostic to answer the sharper question "Where does the model overfit?" as opposed to "Does the model overfit?". - Another face for the adjustment formula for one predictor: In the multiple regression of y on x1,...,xp with intercept let the following be the LS decomposition of y: y = b0*x0 + b1*x1 + b2*x2 + ... + bp*xp + r Let P` be the LS projection onto the predictors x2,...,xp with intercept. Derive the adjustment formula for the LS estimate b1: (I-P`) y = (I-P`)( b0*x0 + b1*x1 + b2*x2 + ... + bp*xp + r ) (I-P`) y = (I-P`)( b1*x1 + r ) (I-P`) y = (I-P`)( b1*x1 ) + r (I-P`) y = b1 (I-P`) x1 + r # This is a LS decomp. for simple regr. w/o icpt Define y.adj = (I-P`) y and x1.adj = (I-P`) x1 b1 = ---------------- = --------------- | x1.adj |^2 | x1.adj |^2 - The LS estimate b1 is a linear function of y: b1 = What is the coefficient vector 'a' in view of the above? a = ... x1.adj / |x1.adj|^2 - Another formula for the coefficient vector b = (b0,b1,...,bp)^T is b = (X^T X)^{-1} X^T y ..... Problem: Identify the above vector 'a' in this formula. ... 'a' is the second row of the triple-X matrix. - Last time we used the adjustment formula to derive Var[bj]: sigma^2 Var[b1] = ------------ |x1.adj|^2 But we also have an older matrix formula for the var/cov matrix of b: V[b] = sigma^2 (X^T X)^{-1} Problem: Identify Var[bj] in this second equation. ... Var[b1] = sigma^2 (X^T X)^{-1} element (2,2) down the diagonal. * ROADMAP: - ANOVA/Pythagorean decompositions for linear models - R^2, SN ratios - Distribution theory for linear models - Inference for linear models ================================================================ * ANOVA DECOMPOSITIONS, R SQUARE AND SUMS OF SQUARES: - Q: How much "variation" does the model given by X "account for"? . Approach: Use sums of squares (SS) as measures of "variation" but first remove uninteresting variation: the mean. - New partition: X = (x0,X1) where x0 = (1,...1)^T in R^N (intercept) is uninteresting and X1 is of interest. . Let P0 be the projection onto x0 (see HW4). ==> I-P0 does "adjustment for the intercept column x0", that is, it subtracts the mean (= it centers). . Full LS decomposition: y = x0*b0 + x1*b1 + ... + xp*bp + r "Mean-adjusted" decomposition: (I-P0) y = (I-P0) (x0*b0 + x1*b1 + ... + xp*bp + r) = (I-P0) x1*b1 + ... + (I-P0) xp*bp + r ==> Center y, x1, ..., xp and run LS w/o intercept to obtain the multiple regression LS slopes b1, ..., bp. . Why this rigmarole? Because the intercept is not considered "explanatory", unlike the real predictors. Therefore, get rid of the intercept once for all. - Pythagorean decomposition of orthogonal projections: We now assume y and all xj are centered. P is the LS projection onto the CENTERED columns x1,...,xp. y = b1*x1 + ... + bp*xp + r y = yhat + r ^ centered ^^^^ Recall: yhat and r are orthogonal, hence: |y|^2 = |yhat|^2 + |r|^2 (Pythagorean theorem, ANOVA decomposition) TSS = MSS + RSS Total SS = Model SS + Residual SS (Proof: ...) - Definition of R Square ("R2"): MSS |yhat|^2 . R2 = ----- = -------- = "Fraction of variation accounted for by the model" TSS |y|^2 . Fact: R2 = cor(y,yhat)^2 (Proof: ^2 / (|y|^2*|yhat|^2) = ...) - Can we calculate E[TSS], E[RSS] and E[MSS] theoretically? We won''t have use for this in the future, but there are two reasons for doing this math exercise: + ANOVA (really the analysis of designed experiments) is permeated by calculations of expected values of sums of squares (E[SS]). + The calculations are insightful in their own right because they describe sources of variation between datasets. . Assume the model is unbiased and with uncorrelated homoscedastic errors: y = X beta + eps with E[eps]=0, V[eps] = sigma^2 I ==> E[y] = X beta . We need to allow the intercept term in the model, hence we write: TSS = |(I-P0) y|^2 (total SS in the data other than the mean) MSS = |(P-P0) y|^2 (SS captured by the model other than the intercept) RSS = |(I-P) y|^2 (SS not captured by the model) . Note: I-P0, P-P0, I-P are all orthogonal projections, with ranks N-1, p, N-p-1, respectively. . Expectations: E[RSS] = sigma^2 (N-p-1) E[TSS] = | (I-P0) X beta |^2 + sigma^2 (N-1) E[MSS] = | (P-P0) X beta |^2 + sigma^2 p The proofs are similar for all three (we did E[RSS] earlier). Here is the proof for E[MSS] as a template for the others. There is a useful technical silliness of the following form: E[ |P eps|^2 ] = E[ eps^T P eps ] = E[ tr(eps^T P eps) ] # This is the silly part... = E[ tr(eps eps^T P) ] # HW 4: tr(AB)=tr(BA) = tr( E[ eps eps^T ] P ) # Linearity of trace and P = tr( (sigma^2 I) P ) # E[eps eps^T] = V[eps] if E[eps]=0 = sigma^2 tr(P) = sigma^2 (p+1) We will actually use this with P-P0 rather than P: E[MSS] = E[ y^T (P-P0) y ] = E[ (X beta + eps)^T (P-P0) (X beta + eps) ] = E[ beta^T X^T (P-P0) X beta + beta^T X^T (P-P0) eps + eps^T (P-P0) X beta + eps^T (P-P0) eps ] = | (P-P0) X beta |^2 + 2*0 + E[ eps^T (P-P0) eps ] = | (P-P0) X beta |^2 + sigma^2 p ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^ Sources: model, 'signal' error, 'noise' [PS: Note (P-P0)X = (I-P0)X because PX=X; either means centering X.] . E[MSS] motivates the definition of "signal-to-noise ratio": | (I-P0) X beta |^2 / Ratio of model to noise variation SN = ------------------- = | in fitted values sigma^2 p \ excluding uninteresting mean variation Q: Do we know the SN in actual data analysis? ... No, it would have to be estimated, but there are problems with that. . We calculated E[MSS] and E[TSS]. We defined R2 = MSS/TSS Q: Can we calculate E[R2] theoretically? ... NOPE ================================================================