================================================================ LECTURE 10: * ORG: - HW 4 due Friday noon - HW 5 will be posted soon: LS computations in R - Reading assignment for Wed, Oct 21: Check webpage . Assignment will be: Reading a chapter + Doing Exercises Bring Exercises on paper to class for collection. Be prepared for an in-class discussion. - Monday Oct 19: fall break, no class * RECAP: - Algebra of ADJUSTMENT !!!!!!!!!!!!!!!! . In a multiple linear regression, the LS estimate bj can be described as the LS estimate in a SIMPLE linear regression of ... on ... y on xj adjusted for all other predictors . "Adjustment" for the intercept column means ... centering (See HW 4 for details.) . Standard error of bj derived from the adjustment identity: ... sigma / |xj.adj| . Effect on inference of collinearity of xj with the other predictors: ... - ANOVA decompositions: Simplest example is y.0 = yhat.0 + r (PS: In 'balanced' ANOVA designs, yhat.0 allows further orthogonal decomps.) . What do y.0 and yhat.0 mean? ... . Why not y and yhat? ... . Why not r.0? ... . ANOVA decomp: ... = ... + ... |y.0|^2 = |yhat.0|^2 + |r|^2 ... = ... + ... TSS = MSS + RSS . Intuitive meanings of 'Sums of Squares' (SS): measures of ... in 'y' . R2 = ... . Interpretation of R2: ... - Calculation of expected values of TSS, MSS, RSS. . E[RSS] = ... sigma^2 (N-p-1) + no systematic part (if the model is unbiased) . E[TSS] = ... sigma^2 (N-1) + |(I-P0) X beta|^2 . E[MSS] = ... sigma^2 p + |(I-P0) X beta|^2 Try to build sufficient intuitions to derive these facts from memory! The objects that matter are subspaces associated with the SSs and the projections of . the systematic part 'X beta' and . the error vector 'eps' on these subspaces. * ROADMAP: - Distribution theory for inference in linear models ================================================================ * STATISTICAL INFERENCE IN LINEAR MODELS - Reminder: What is statistical inference? . Judging unknown/unobservable parameters in light of observable quantities/data. . Two principal approaches: + TESTING: Assume a parameter value and 'accept/reject' it in light of the observed data. + CIs: Collect all non-rejectable parameter values in a 'confidence interval/region'. . Quibbles with the logic of testing: 'Null hypotheses are never true.' Refutation of the quibble: This procedure uses a hypothetical (H0) as a yardstick. 'Acceptance' of H0 is really 'Non-Rejection' (we never believe H0). The hypothetical does not need to be exactly true. It formalizes, e.g., the case of an uninformative situation where a predictor has no explanatory power at all. (Strictly: No explanatory power above and beyond the other predictors.) - Null hypotheses we wish to test: assumptions about individual slopes, e.g., beta1 H0: beta1 = beta1.null (usually: beta1.null=0) NOTE: This is a COMPOSITE NULL HYPOTHESIS!!! y = beta0 + beta2*x2 + ... + betap*xp + eps, E[eps]=0, V[eps]=sigma^2*I ^^^^^ ^^^^^ ^^^^^ ^^^^^^^ 'Nuisance parameters' in the null hypothesis: not focus of the test ==> Deep problem! How do we define significance levels if each choice of values for the nuisance parameters in H0 produces different acceptance/rejection probabilities? ==> Solution: Use 'pivotal' quantities as test statistics, that is, quantities whose null distribution is independent of the nuisance parameters in H0. ??? Do pivotal test statistics exist? Try t-statistics! (Stay tuned.) - Two types of sources for statistical inference: . Exact distribution theory can be used for "exact" t-tests and CIs. They rely on exact distributional assumptions about the data, e.g., Gaussian errors in linear models. . Asymptotic theory based on some form of CLT can be used for approximate tests and CIs. The rely on very few distributional assumptions about the data, but they require 'large N'. - Present Goal: For linear models, justify . tests based on t = (bj-betaj)/stderr.est[bj], . CIs of the form bj +- 2*stderr.est[bj]. - Let s = RMSE = sqrt( RSS / (N-p-1) ) be the estimate of sigma. Then we define the obvious estimates of standard error and corresponding 'standard variance' as: ---------------------------------- | s | | stderr.est[bj] = ------------ | | |xj.adj| | ---------------------------------- | s^2 | | V.est[b] = (X^T X /N)^{-1} --- | | N | ---------------------------------- We are estimating 'dataset-to-dataset variation' of 'bj' and 'b'. - Principles of statistical inference as applied to linear models: . The test statistic for betaj is bj - betaj tj = ------------ sj where we abbreviate sj = stderr.est[bj] = s/|xj.adj|. Note: tj = tj(bj,betaj) = directional distance between bj and betaj in multiples of sj tj is not a function of the other betak and sigma^2, but its distribution might still depend on them. Or does it? Here goes next: . The statistic 'tj' has some special properties: + It is a 'pivotal' quantity, meaning its distribution does NOT depend on 'nuisance parameters' that are not part of the null hypothesis (i.e., parameters betak for k!=j, and sigma). 'Pivotality' of 'tj' holds > exactly under normal parametric assumptions. > asymptotically under minimal non-parametric assumptions, + 't' has a known finite-sample distribution for all N (in fact: Student''s t) under normal parametric assumptions, enabling 'exact' finite-sample inference based on 't'. + 't' has a non-degenerate limiting distribution as N-->Inf (in fact: N(0,1)) under minimal non-parametric assumptions, enabling asymptotic inference based on 't'. . Here is the kind of probability statement that is needed for statistical inference: P[ -c < tj < +c] (for 2-sided inference, P[ tj < c] for 1-sided) = 1-alpha (for 'exact' inference) --> 1-alpha as N-->Inf (for asymptotic inference) where c=c(alpha), independently of what the other betak and sigma^2 in H0 are. . USE FOR H0 TESTING: If H0: betaj=betaj.null is true, then for tj = (bj-betaj.null)/sj we have P[-c < tj < +c | betaj.null ] '=' or '-->' 1-alpha We consider it as evidence against H0 at the significance level alpha if the observed value tj.obs falls outside (-c,c): Reject H0 :<==> observed 't' is not in (-c,+c) (This is the definition of a rejection rule.) The gambling guarantee is: P[ false rejection ] = P[ reject H0 | H0 true ] = P[ tj not in (-c,c) | H0 true ] = alpha . USE FOR CIs: A CI with confidence level 1-alpha is the collection of parameter values betaj.null which cannot be rejected if tested at the significance level alpha. To find CI(1-alpha), invert -c < tj < +c so it becomes a condition for betaj: -c < tj < +c <==> bj - c*sj < betaj < bj + c*sj <==> betaj in (bj-c*sj, bj+c*sj) ==> CI(1-alpha) = (bj-c*sj, bj+c*sj) (a random interval). The gambling guarantee is: The true value of betaj is in CI(1-alpha) with Probability = 1-alpha. . Comparison of CIs and acceptance intervals: tj in (-c,+c) <==> betaj in (bj - c*sj, bj + c*sj) ^^^^^^^^^ CI ^^^^^^^^^ centered at bj <==> bj in (betaj - c*sj, betaj + c*sj) ^^^ Acceptance Interval ^^^^ centered at betaj from H0 <==> bj and betaj are no more than c*sj apart. There is only one true value of betaj, but each dataset will produce its own value of bj. The gambling guarantee is that the true betaj and the observed bj will be no more than c*sj apart for a fraction 1-alpha of datasets (exactly or approximately). * ASYMPTOTIC DISTRIBUTION THEORY FOR INFERENCE IN LINEAR MODELS: - The basis of asymptotic inference abouts slopes: a generalized CLT for bj . As N -> Inf, p (= number of predictors) is held fixed, and cases are i.i.d. sampled: bj-betaj tj = ---------- -> N(0,1) "in distribution", "weak convergence" sj where betaj is the true-but-unknown or hypothesized parameter value. Meaning: P(t in [a,b]) -> P(z in [a,b]) where z ~ N(0,1) N->Inf - Comments: . Some strangeness: This CLT still conditions on the predictors, yet the predictor rows get sampled from a distribution. . With asymptotic inference, one does not need to check parametric assumptions such as normal errors. BUT one needs 'large N' ! . 'N large enough' may mean, e.g., N > 20*(p+1), i.e., more than 20 obs. per parameter) (In general, how large is enough depends on the problem. Sometimes N=5 is large enough, and sometimes N=10^6 is not large enough. Simulations of special cases often give some idea.) . More powerful asymptotic theory exists, where 'p' can increase slowly with 'N'.) ================================================================