================================================================ LECTURE 11: * ORG: - HW 4 due Friday noon - Monday Oct 19 is fall break , no class - Writing assignment and HW 5 to be posted shortly ... * RECAP: - Adjustment formula yet again: bj = / |xj.adj|^2 Definitions: . is called a (scaled) 'weighted mean' of y if all aj>=0. . is called a 'contrast' in y if sum_j aj = 0. Q: Is 'bj' a (scaled) weighted mean or a contrast as a function of 'y'? A: ... xj.adj is orthogonal to e=(1,...,1)^T, hence a contrast - Review the reworked notes of Lecture 10: Take-two under above heading 'STATISTICAL INFERENCE IN LINEAR MODELS' Fundamentals are worked out in greater detail. * ROADMAP: - Multivariate normal distribution (reworked) - Derivation of the t-distribution for t-statistics (testing one slope) - Derivation of the F-distribution for F-statistics (testing multiple slopes) ================================================================ * THE MULTIVARIATE NORMAL DISTRIBUTIONS: Definitions and Facts - Def.: An N-dimensional 'standard normal random vector' is given by z = (z1,z2,...,zN)^T where zn are i.i.d. N(0,1) We write z ~ N(0,I) where 0 is in R^N and I is NxN. - Fact: If z ~ N(0,I), then Rz ~ N(0,I) for any NxN orthogonal matrix R. Proof: The density of 'z' depends only on |z|^2. - Def.: A general N-dimensional 'normal random vector' with mean vector 'mu' and covariance matrix 'C' is given by y = (y1,y2,...,yN)^T where 'y = A z + mu' and C = A A^T We write y ~ N(mu,C) where mu is in R^N and C is NxN (sym., >=0). - Fact: This definition is independent of the representation C=AA^T. Proof: AA^T = BB^T <==> A = B R for some orthogonal R Then: P[ y in Set ] = P[ z in A^{-1} Set ] = P[ z in R^{-1} B^{-1} Set ] = P[ Rz in B^{-1} Set ] = P[ z in B^{-1} Set ] - Fact: Multivariate normal distributions are uniquely characterized by their expectations and variance/covariance matrices (1st & 2nd moments). - Fact: Any linear (matrix) transformation of a multivariate normal distribution yields again a normal distribution: y ~ N(mu, C) ==> Gy+c ~ N(G mu + c, GCG^T) [Note: G can be non-square.] - Fact: Two random vectors Y, Z with a joint normal distribution are uncorrelated iff they are independent V[Y,Z] = 0 <==> Y, Z are independent - FACT: If 'y' is multivariate normal and V[y] = sigma^2 I, when are Ay and By uncorrelated (and hence ...)? V[A^T y, B^T y] = ... A^T V[y,y] B = A^T B sigma^2 = 0 ==> A^T y and B^T y are independent if the columns of A are orthogonal to the columns of B. (Column space of A is orthogonal to the column space of B.) ================================================================ * EXACT DISTRIBUTION THEORY FOR LINEAR MODELS: - Q: If y ~ N(X beta, sigma^2 I), what is the distribution of 'b' and 'bj'? A: For the whole coef vector of estimates: b = (X^T X)^{-1} X^T y is a linear transformation of y E[b] = beta V[b] = sigma^2 (X^T X)^{-1} => b ~ N(beta, sigma^2 (X^T X)^{-1}) (multivariate normal) One coeff estimate at a time: bj = < y, xj.adj/|xj.adj|^2 > is a linear combination of y E[bj] = betaj (assuming the model is unbiased) V[bj] = sigma^2 / |xj.adj|^2 (assuming the variance properties) => bj ~ N(betaj, sigma^2 / |xj.adj|^2) (univariate normal) - Q: Why do we arrive at a t-distribution and not a normal distribution? A: We estimate sigma^2 with s^2 = RSS/(n-p-1) (s=RMSE=RSE=...) bj - betaj (bj - betaj) (bj - betaj)*|xj.adj| t = -------------- = -------------- = --------------------- stderr.est[bj] s/|xj.adj| s - Q: How is a t-distribution with k degrees of freedom defined? A: For any Z, Z1,...,Zk that are i.i.d. N(0,sigma^2) define Z t := ------------------------ / Z1^2+...+Zk^2 \ sqrt(------------------) \ k / Facts: . This definition allows us to simulate t-distributions (not that we need to). . t is independent of sigma as long as it is the same for Z, Z1,...,Zk. . As k-->Inf we have t-->N(0,1). (Apply the LLN to the denominator.) - Q: Why would the 't' derived from 'bj' be distributed like a 't'-distribution? A: Some pretty linear algebra and geometry!!! t = (bj - betaj)*|xj.adj| / s^2 Z := (bj - betaj)*|xj.adj| ~ N(0,sigma^2), where betaj is the true slope. s^2 = |r|^2 / (N-p-1) = ????? To be shown: 1) s^2 = (Z1^2 + Z2^2 + ...) / (N-p-1) for some Zj iid N(0,sigma^2) 2) s^2 and Z are independent. Let Q be a Nx(N-p-1) matrix whose columns form an orthonormal basis of residual space: . Q^T X = 0 (The columns of Q are orthogonal to the columns of X.) . Q^T Q = I (The columns of Q are mutually orthogonal; I is (N-p-1)^2.) . Q Q^T = I-P (= the projection onto residual space; this I is N^2.) Define the (N-p-1)-dimensional random vector z = (Z1,Z2,...)^T = Q^T y Facts: . r = Q z (Proof: r = (I-P)y = Q Q^T y = Q z) (Interpretation: z contains the coordinates of r in the basis Q.) . z ~ N(0, sigma^2*I) where 'I' is (N-p-1)^2 (Proof: z = Q^T y; y is normal, hence z is normal. E[z] = Q^T X beta = 0 V[z] = Q^T V[y] Q = sigma^2*I ) . |r|^2 = |z|^2 QED 1) above (Proof: |r|^2 = z^T Q^T Q z = |z|^2.) . z and bj are stochastically independent. QED 2) above (Proof: bj ~ y^T xj.adj, and xj.adj is orthogonal to the columns of Q.) - Properties of the t-distribution with k degrees of freedom: . Bell-shaped: x <- seq(-4,4,length=500) plot(x, dt(x,df=1), type="l", ylim=c(0,.45), col="blue", xlab="quantile", ylab="t-density") # dt(..,df=1) == dcauchy(..) abline(h=0) text(x=0, y=.25, lab="Cauchy", col="blue") text(x=0, y=.23, lab="df=1", col="blue") . As df increases, the t-distribution gets lighter in the tails: colrs <- gray.colors(100,0,.8) for(df in 2:100) lines(x, dt(x,df), col=colrs[df], lwd=.5) . As df -> Inf, the t-distribution approaches N(0,1) distribution: lines(x, dnorm(x), col="red", lwd=2) text(x=0, y=.44, lab="Gauss", col="red") text(x=0, y=.42, lab=expression(paste("df=",infinity)), col="red") #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ math symbols # For math symbols in R plots see help(plotmath) demo(plotmath) ================================================================ * F-TESTS FOR COMPARING TWO ARBITRARY NESTED MODELS: - Purpose: testing groups of predictors jointly in particular: overall F-test - Def. of F-distribution: W1,...,Wq,Z1,...Zr i.i.d. N(0,sigma^2) ^^^^^ irrelevant but same (W1^2+...+Wq^2)/q F = ----------------- has a (central) F-distribution with q,r dfs (Z1^2+...+Zr^2)/r - Null hypotheses tested with F-ratios: H0: beta1 = beta2 = ... = betaq = 0 (I.e., model with remaining p-q predictors is sufficient.) H0: beta1 = beta2 = ... = betap = 0 (I.e., the model has no predictors with explanatory power at all.) This is the overall F-test. - All null hypotheses can be thought of as comparing a full model given by 'X' with 'p+1' predictors (incl. intercept), and a submodel given by 'Xsub' with 'q' fewer predictors. What matters really are the orthogonal projections 'P' and 'Psub', such that 'P-Psub' is also an orthogonal projection. What an F-test checks is whether the true expectation E[y]=mu has variation in the subspace onto which 'P-Psub' projects. What does this subspace exactly describe? It describes the variation in the 'q' predictors in question adjusted for the predictors in 'Xsub'. Example: 'y' = educational success, such as graduation rates at schools. 'q' predictors that describe educational measures at schools. 'p-q' predictors that describe demographics 'H0': educational measures explain nothing beyond demographics 'X': full model with educational measures and demographics 'Xsub': demographics only 'P-Psub' projects onto the orthogonal complement of 'Xsub' within the column space of 'X'. 'H0' says that 'E[y]=mu' satisfies the following equivalent statements: '(P-Psub) mu = 0' 'Psub mu = mu' (assuming the full model is correct: P) 'Psub mu = P mu' - In linear models F-ratios have numerators and denominators as follows: Numerator = 'Variation of y in subspace being tested' / dim(subspace) = 'Variation of y in subspace being tested' / dim(subspace) Denominator = RSS / df(res) . Assume predictors x1,...xq are all irrelevant: H0: beta1,...,betaq=0 . Numerator = | y projected onto x1,...,xq adjusted for all other preds. |^2 ^^^^^^^^ orthogonalized wrt other preds. 'Gram-Schmidted' wrt other preds. u1,...,uq, .... ^^^^ o.n. basis of space spanned by other predictors ^^^^^^^^^ o.n. basis of space spanned by x1,...,xq adjusted for other preds. Numerator = = ^2 +...+ ^2 = ^2 +...+ ^2 (Assumption H0: beta1,...,betaq=0) = W1^2 +...+ Wq^2 Reason: E[Wk] = 0, V[Wk] = sigma^2, Cov[Wj,Wk] = 0 Wk are normal, hence independent Denominator: Similar, but the u''s must form an o.n. basis of residual space. => The denominator u''s are orthogonal to the numerator u''s. => Numerator terms and denominator terms are independent because of orthogonality of the coeff. vectors (the u''s). . F-ratio for x1,...,xq: SS(x1,...xq|others)/q F = --------------------- has an F-distr with q,(n-p-1) dfs under H0 RSS/(n-p-1) - Special case of one predictor: F = t^2 (recall: t=Z/sqrt((Z1^2+...+Zr^2)/r) - Special case where q=p: overall F-test x1,...,xp get adjusted for ...? - Rules for F-distributions? . "2 for t": works for about n~dfs >= 25 qt(.975, df=1:30) . F-distribution: df2 <- 5:30 plot(x=range(df2), y=c(0,7), type="n", xlab="df2", ylab="F") for(df1 in 1:10) { lines(x=df2, y=qf(.95, df1=df1, df2=df2)) text(x=df2[1], y=qf(.95, df1=df1, df2=df2[1]), lab=df1, cex=.8, adj=1.2) } No apparent simple rule ================================================================