================================================================


Chapter 2: BASICS OF LINEAR MODELS


* LINEAR MODELS: THE OLS PROBLEM


  - LS criterion: RSS

    . In what follows we will play with OLS from the ground up.
      This is a conceptual exercise, not a recommendation for practice.
      In most contexts you will use the function lm() for OLS fitting.
      However, when speed matters, you might want to use the linear algebra below.

    . in R: a numeric toy example of LS
          RSS <- function(b) { sum((y-X%*%b)^2) }    # What's bad about this code?
          N <- 5
          X <- cbind(rep(1,N), 1:N)
          y <- X[,1] + 2*X[,2]                   # No 'error' in this case
        Getting at the coefficient estimates the hard way:
          nlm(RSS, c(0,0))$estimate    # nlm() is a minimizer in R
          optim(c(0,0), RSS)$par       # optim() is another minimizer in R

    . This is not how it is done, of course, but general code has some
      nice aspects: it is easily cannibalized for different situations.
      Such as:
         L1 <- function(b) sum(abs(y-X%*%b))
         nlm(L1, c(0,0))$estimate
         optim(c(0,0), L1)$par
      Seems to work even though the L1 criterion is not differentiable.

    . In math:
        RSS(b) = |y - X b|^2

    . LS estimate:
        b = argmin RSS(b) = minimizer of RSS(b)
        (Justifications of LS?  ...)

    . NORMAL EQUATIONS as stationary equations of RSS:
        gradient RSS(b) = 2*( X^T X b - X^T y ) = 0

                -----------------------
        ==>    |   (X^T X) b = X^T y   |
                -----------------------

       [This is a special case of a moment condition:
            (X^T X)/N and X^T y/N
        estimate 2nd moments:  E[x x^T] and E[x y].
        ==> Moment estimators are defined by moment conditions.

        Other names for  X^T X b - X^T y = 0:
        . score equations ['scores' are the negative gradients: -(X^T X b - X^T y)]
        . estimating equations
        ]

    . Explicit LS solution:
         b = (X^T X)^{-1} X^T y
             [Does not work when ...]

        In R:
          b <- solve(t(X) %*% X) %*% t(X) %*% y
          b           # Correct
        [NOT the numeric method of choice; stay tuned for a HW.]
        Alternative using ginv() in MASS:
          library(MASS)
          ginv(t(X) %*% X) %*% t(X) %*% y
        This should be numerically more stable; not a problem here.


  - Residuals and fits:
    . Notation:  yhat = X b
                 r = y - yhat
                 RSS = |r|^2     # sum(r^2)
    . Normal equations can be rewritten this way:
                 X^T r = 0          # 0 =  X^T y  - (X^T X) b = X^T (y - X b)
        ==> Uniquely characterizes the LS solution.
        Geometric meaning: ...      # see Homework 1
    . In R:
        yhat <- X %*% b
        r    <- y - yhat

  - LS and orthogonal projections:
         yhat = X b      = ( X (X^T X)^{-1} X^T ) y = H y
         r    = y - yhat = ( I - H ) y
      where H and I-H are orthogonal projections in n-space.
    . Definition of orth. proj.: ...  # see Homework 1
    . H is also called the "hat matrix" because it "puts the hat on y" (Tukey).
    [Notation: In math, one likes to use "P" as symbol for projections,
               but for us "P" stands for probability, hence we prefer "H".]

  - Visualization/animation of the geometry of LS: 2 predictors (p=1), n=3
      source("http://www-stat.wharton.upenn.edu/~buja/STAT-961/LS-geometry.R")

  - What are the statistical properties of 'b', 'yhat' and 'r'?
    . What are their expected values?
    . What are their variances, and covariances?

  - 1st order properties: need expectation vectors of random vectors
      E[y]    = mu = X beta                      This is assumed!!!
                What is the meaning of this?  ...
      E[yhat] = ...
      E[r]    = ...
      E[b]    = ...

  - 2nd order properties: need covariance matrices of random vectors
      V[y]    = sigma^2 * I   (I = identity of size nxn)   Again, this is assumed!!!
                What is the meaning of this?  ...
      V[yhat] = ...
      V[r]    = ...
      V[b]    = ...

  - The role of X:
    . We will follow convention in statistics and treat X as fixed.
    . Strictly speaking, the above expressions should be written
        E[y|X], E[yhat|X], E[b|X], V[y|X], V[yhat|X], ...
    . Fixed-X will allow us to do a lot of theory.
    . We mentioned an ancillarity argument to justify conditioning on X
      when X is truly random.
    . We also mentioned the ancillarity argument depends on model correctness.
    ==> What we currently pursue is a 'model-trusting, assumption-rich theory'.
        There does exist 'model-robust, assumption-lean theory' as well.
        Stay tuned.


================================================================


* VECTOR EXPECTATIONS AND VARIANCE/COVARIANCE MATRICES

  - In preparation for linear models analysis
    we define vector expectations and variance/covariances.
    Confusing fact:
    . In multivariate analysis, all variables are random.
        Data set = iid replications of a random vector
    . In linear models theory of regression, only the response is random,
      the regressors are not.
        Data set = cbind(random y, fixed xj vectors)

  - Let us think the way of multivariate analysis for now, hence assume:
        Y ~ Ix1 random vector
        Z ~ Jx1 random vector
        with joint distribution ("observed on the same objects")
          Ex.: Y = (Height, Weight, Age, ...) of a person
               Z = (English Grade, Math Grade, SAT score, ...) of same person
               The Y- and Z-variables will generally be 'associated'
               (i.e., correlated, but not only in a linear sense).

  - EXPECTATION:

    . Definition:  E[Y] = vector of E[Yi]  (i=1...I)

    . Estimation:  In Multivariate Analysis one assumes
                   N i.i.d. draws yn from Y, where  yn ~ Ix1 = n''th sample.
                   Estimate of E[Y]:  ybar = (sum_n yn)/N ~ Ix1
                   Unfortunately in regression we only see one draw y of Y
                   Do you see what''s confusing? ...

    . Affine transformation:  if A ~ JxI, c ~ Jx1, both constant, then
             E[AY + c] = A E[Y] + c
        Proof: Write out the components of AY and apply E[].

        Q: Why do we need this?
        A: Because we will analyze objects such as
             b    = (X^T X)^{-1} X^T y
             yhat = H y
             r    = (I-H) y

    . Algebra: Assuming Y and Z are both Ix1, then

             E[a*Y + b*Z]  = a*E[Y] + b*E[Z]

             This property is called linearity of E[...].


  - COVARIANCE:

    . Definition:  V[Y,Z] :=  E[ (Y-E[Y])(Z-E[Z])^T ]
                   V[Y]   :=  V[Y,Y]

    . Components:  V[Y,Z]_ij = ...
                   V[Y]_ij   = ...
                   V[Y]_ii   = ...

    . What are the ranks of  (Y-E[Y])(Z-E[Z])^T
                        and  (Y-E[Y])(Y-E[Y])^T  ?

         Hence a covariance matrix is an average of ...
         Is it hence also a ...?
         Why or why not?  ...

    . Sizes: V[Y,Z] is IxJ,  V[Y] is IxI

    . Generalizing univariate formulas
        V[X] = E[X^2] - E[X]^2 and
        V[X1,X2] = E[X1*X2] - E[X1]*E[X2] :

             V[Y,Z] = E[ Y Z^T ] - E[Y] E[Z]^T
             V[Y]   = E[ Y Y^T ] - E[Y] E[Y]^T

    . Algebra: Assuming Y, Z are both Ix1 and U is Kx1:

             V[Y+Z,U] = V[Y,U] + V[Z,U]

        Consequence:

             V[Y+Z} = V[Y] + V[Y,Z] + V[Z,Y] + V[Z]     # !!!!!!!!!!!!!! Watch out !!!!!

          Can you continue and write the middle two terms as 2*V[Y,Z] ?  ...

    . Affine transformation:

         Recall: Y (Ix1) and Z (Jx1) are random vectors.
         Assume fixed matrices A, B and fixed vectors c, d of legit sizes:
           A (mxI), B (nxJ), c (Ix1), d (Jx1)

          ----------------------------
         |   V[AY,BZ] = A V[Y,Z] B^T  |
         |                            |
         |   V[AY] = A V[Y] A^T       |  Butterfly or SANDWICH formula
         |                            |
         |   V[Y+c,Z+d] = V[Y,Z]      |
          ----------------------------

         Exercises:
          ~ Specialize the middle to a linear form,
              i.e., A=a^T, where a (Ix1) is fixed.

          ~ Specialize further to  a = rep(1/I,I)


    . Multivariate Analysis: Estimation of V[Y]

      * Given N i.i.d. vector samples yn drawn from the random vector Y

            Vhat[Y] = sum_i ( (yn-ybar)(yn-ybar)^T )/(N-1)

      * Affine transformation:  zn = A yn + c

            Vhat[z] = A Vhat[y] A^T


================================================================


* LINEAR MODELS:


  - In the linear models theory of regression
    we observe only one single y-vector,
    but we will make assumptions about its expectation
    and covariance and examine what follows from them.
    [Whether the assumptions are realistic for a given
     dataset needs to be checked with diagnostics.]


  - Linear model assumptions:

    . Convention: y is Nx1 (not Ix1)

        Compare asymptotics in
          Multivariate Analysis:    y1,...,yN are Ix1 and  N-->Inf
          Linear Models Analysis:   y is Nx1,         and  N-->Inf

    . Model in vector/matrix language:

        y = beta0 + beta1*x1 + beta2*x2 + ... + betap*xp + eps
          = X beta + eps               where beta is (p+1)x1

      Here y, xj, eps are Nx1 (response, predictors, errors, resp.)
        X = cbind(1,x1,x2,...,xp)  is  Nx(p+1)

      Note:
      + The response vector y is a random vector
          -- inspite of lower-case notation!
      + Predictors xj are NOT random but fixed and known
             (set experimentally or observed but conditioned on).
      + The 'error' vector 'eps' is a random vector.

            ['Error' is a misleading term; there is nothing wrong with 'eps'.
             We often use the term 'noise' instead.]

    . Stochastic assumptions:

       --------------------------------------------------------------------------
      |  E[eps] = 0 : On average [across possible worlds, universes, datasets]   |
      |               the model is 1st order "correct"/"unbiased".               |
      |  V[eps] = sigma^2 * I: Uncorrelated errors [across ...]                  |
      |                        constant variances ("homoscedastic errors")       |
      |                        (2nd order correct)                               |
       --------------------------------------------------------------------------

    . Equivalently:

       ----------------------
      |  E[y] = X beta       |     Why?
      |  V[y] = sigma^2 * I  |      ''
       ----------------------

      [Distinguish: "unbiased model" vs "unbiased estimator"; see below]

    . Recap fundamentals:
        Q: What variability do E[...] and V[...] refer to?
        A: Variability ...

      Spell it out: What is the variability of y[1], y[2], ... y[N] ?


================================================================


* LINEAR MODEL ANALYSIS, STEP 1: VARIABILITY OF b

  - Reminder: The LS Estimate is     b = (X^T X)^{-1} X^T y

  - Type of variability in b: ...
      b = (p+1)x1 random vector
          varying from response vector y to response vector y...

  - Making the linear model assumptions, it follows:

        E[b] = ...

      Actually, which part of the assumptions was used?
        ...

      => 'Unbiased' estimator of beta

  - Making the linear model assumptions, it follows:

         V[b] = ...
              = ...

      Actually, which part of the assumptions was used?
        ...

      NOTE: V[b] is conditional on X (if X is random...)


  - What variability does V[b] describe?
      ...

  - What is the interpretation of the diagonal elements in V[b]?
       ...

  - Could one hope to estimate V[b]?
       ...

  - What would such estimates be good for?
       ...


  - [There exist estimators of V[b] that avoid making the assumption V[y] = sigma^2 I.
     Instead they estimate  V[y] with  diag(r^2), i.e., a diagonal matrix with
     squared residuals in the diagonal.
     This results in the 'sandwich estimator of standard error',
     which has become common in econometrics sind Hal White''s work in 1980+.]



================================================================


* SIMULATION EXAMPLE: simple linear regression with intercept
      ==>  X is a 1-predictor matrix, hence of size Nx2.
##    R:
        N <- 50                              # Picking a sample size
        X <- cbind(x0=rep(1,N), x1=runif(N))  # Making up a predictor matrix
        sigma <- 1               # Assuming a 'true' error variance
        beta  <- c(b0=1, b1=2)   # Assuming a 'true' coeff vector
        y <- rnorm(N, m=X%*%beta, s=sigma)    # Simulating a response vector
        b <- solve(t(X)%*%X)%*%t(X)%*%y;  b
##    Simulate and animate parallel universes/possible worlds/datasets ...:
        for(i in 1:500) {
          y <- rnorm(N, m=X%*%beta, s=sigma)    # Simulating a response vector
          b <- solve(t(X)%*%X) %*% t(X) %*% y
          plot(x=X[,2], y=y, pch=16, col="gray20", ylim=c(-2,6))
          abline(b, lwd=3, col="red")
          dev.flush()
          Sys.sleep(0.5)
        }
##    ==> We see the DATASET-TO-DATASET (sampling) variability.
##    Next, simulate many response vectors (possible worlds, parallel universes,...)
##    and store the results:
        Nsim <- 10000
        bs <- matrix(NA, nrow=Nsim, ncol=length(beta))  # Storage for coeff estimates
        colnames(bs) <- paste("b",0:(length(beta)-1),sep="")  # Cosmetics
        for(isim in 1:Nsim) {
          y <- rnorm(N, m=X%*%beta, s=sigma)    # Simulating a response vector
          bs[isim,] <- solve(t(X)%*%X)%*%t(X)%*%y
          if(isim%%100==0) cat(isim,"...")
        }
##    Plot slopes versus intercepts to illustrate the 'sampling' or
##    'dataset-to-dataset' variability of b:
        plot(bs, pch=16, cex=.5)
##    ==> Negative correlation between b0 and b1 !!!

##    Compare:
        solve(t(X)%*%X)*sigma^2   # V[b]
        cov(bs)                   # simulation estimate of previous
##    Interpretation:
##    . cov(bs) is a simulation approximation to V[b].
##    . solve(t(X)%*%X)*sigma^2 is the theoretically obtained formula for V[b].
##    ==> They better be close!!!
##    Further intuitions:
##    Visualize the simulated lines and the simulated distribution of 'b':
        windows(height=5, width=10)
        par(mfrow=c(1,2), mgp=c(1.8,.5,0), mar=c(3,3,1,1)) # Set up two plot panels.
        # Panel 1: Show the last y only, faintly, because this is irrelevant.
        plot(cbind(x=X[,2],y=y), pch=16, cex=1, col="gray", xlim=c(-.5,1.5), ylim=c(-2,6))
        # Panel 1: Draw the estimated lines for 100 simulations.
        for(i in 1:100) abline(bs[i,])
        # Panel 2: Show a scatterplot of slopes versus intercepts:
        plot(bs, pch=16, cex=.5)
##    Questions:
##    . How does the negative correlation on the right
##      relate to the lines on the left?
##      ...
##    . If we had done the simulation with a uniform
##      x-distribution on [-1,+1] instead of [0,1],
##      what would the correlation between intercept and slopes be?
##      Alternative R code:  X <- cbind(x0=rep(1,N), x1=runif(N,-1,+1)
##      ...


#================================================================


* LINEAR MODEL ANALYSIS, STEP 2: VARIABILITY OF yhat AND r


  - There is variability in 'b' ... dataset-to-dataset
    => There is variability in 'yhat' and 'r'.
        yhat = X betahat = X b
        r = y - yhat = y - X betahat = y - X b

  - Strangeness about the randomness in yhat and r:
    Even though both are N-dimensional,
    . yhat can only range in ... dim=...
    . r    can only range in ... dim=...


  - Let as usual:  H = X (X^T X)^{-1} X^T

    . First order:
          E[yhat]   = ...
          E[r]      = ...
        What assumptions are used?  ...
        What assumptions are not used?  ...

    . Second order:
          V[yhat]   = ...
          V[r]      = ...
          V[yhat,r] = ...
        What assumptions are used?  ...
        What assumptions are not used?  ...


  - Special cases:
    . Diagonal elements:     Var[yhati] = ...
                             Var[ri]    = ...
    . Off-diagonal elements: Cov[yhati,yhatk] = ...
                             Cov[ri,rk]       = ...
    .                        Cov[yhati,rk]    = ...

  - What variability is being described in the above equations?
        ...

  - Implications: Assume the errors (noise) in the Nx1 vector
         eps = y - X beta
      are homoskedastic and uncorrelated.

    . The residuals are generally ... (homo/heteroscedastic?)

    . The residuals are generally ... (un/correlated?)

    . Pairs of distinct fits and corresponding pairs of residuals
      are generally correlated with the ...[same/opposite?] sign.

    . The variance of residuals is  ... [>/<?]
      than the variance of errors.

    . More variance in a fit means [...more/less] variance in a residual.

    . Could one hope to standardize the residuals so they have
      equal variance?  ...  [yes/no; if yes, how?]
        ...


================================================================


* STANDARDIZED RESIDUALS:

  - Residuals are generally heteroscedastic:

      V[r_i] = (1-H_ii) sigma^2

  - Criticism of standard residual plots:
      Plotting apples and oranges on top of each other
      because  V[r_i]  is not the same for all i.

  - 'Standardized' residuals:

      r_i* = r_i / sqrt(1-H_ii)

      V[r_i*] = sigma^2

  - Distinguish from 'studentized residuals':

      r_i** = r_i / sqrt((1-H_ii)*sigmahat^2)

      (However, terminology does not seem to be uniform:
       Some authors call the latter 'standardized residuals'.)

      (Note: It is NOT true that  E[r_i**^2] = 1.  Why?  ...)

  - Compare Corr(r_i1,r_i2) with Corr(r*_i1,r*_i2): ...


================================================================


* SELF-INFLUENCE AND THE LS PROJECTION:

  - Diagonal elements of H, H_ii, are called "self-influence"; why?

        yhat_i  =  sum_k H_ik yk  =  H_ii yi + sum_{k(!=i)} H_ik yk
                                     ^^^^^^^
      H_ii determines how much y_i determines yhat_i,
      i.e., how much the observation determines its own fit.

  - Properties of the self-influence values H_ii:

       0 <= H_ii <= 1     (See HW, or else use V[yhat_i] and V[r_i])

  - When might H_ii be large?  (See below)
      ...


================================================================


* DEGREES OF FREEDOM IN LINEAR MODELS:

  - Matrix traces:

    . Def.: a purely mathematical/lin. alg. matter

        tr(M) = sum M_ii  (sum of diagonal elements for square M)

    . The trace is invariant under basis/coordinate changes::

        tr(T M T^{-1}) = tr(M)

    . Facts: Assuming X is full rank, i.e., rank(X) = p+1, we have

        tr(H) = rank(H) = dim(colspace(H)) = dim(colspace(X)) = rank(X) = p+1

        =>  tr(H)   = p+1     = sum H_ii      # ==> average size of H_ii = ...

            tr(I-H) = N-(p+1)

            tr(I)   = N

        'Trick':  diagonalize H


  - Trace formulas or 'total variance' formulas:

        Def.: total variance of a random vector Z = sum_i V[Z_i] = tr(V[Z])

        sum V[y_i]    = sigma^2 N      = sigma^2 trace(I)

        sum V(yhat_i) = tr(V[yhat])
                      = sigma^2 tr(H)
                      = sigma^2 (p+1)

        sum V(r_i)    = tr(V[r])
                      = sigma^2 tr(I-H)
                      = sigma^2 (N-(p+1))

  - Interpretation of the trace formulas:

        The more predictors the ...[more/less] variability in yhat.
        The more predictors the ...[more/less] variability in r.

      NOTE:
        . This is independent of whether the model is 1st order correct/unbiased or not!
            We only used second order assumptions: homoskedasticity and uncorrelated errors.
        . This is independent of whether we use good or bad predictors.
            Whether a predictor is good or bad, its inclusion moves
            one case worth of variation (sigma^2) from  r  to  yhat (and hence  b).

  - 'Degrees of freedom': dimensions of the subspaces in which yhat and r can range
        dfs(fits)  = p+1
        dfs(resid) = N-(p+1)
        dfs(y)     = N

  - [Quirky question: Unlike y, yhat, r, why might we NOT be interested
     in the total variance of  b?
     Think like a physicist: What are the units of  tr(V[b])
     compared to the those of tr(V[y]), tr(V[yhat]), tr(V[r]).
       ...
    ]



================================================================


* ESTIMATING THE ERROR VARIANCE sigma^2:

  - Recall from ~ Stat 102:

        RMSE = sigmahat = sqrt( RSS / (N-p-1) )

    RMSE^2 is an unbiased estimate of sigma^2,

        E[RMSE^2] = sigma^2

    as we will now show.

  - Heuristic:
      RMSE^2 is RSS up to a factor (N-p-1),
      hence look at E[r_i^2] first, then sum up.

  - If the model is ... order correct, then

        E[ r_i ] = ...

        E[ r_i^2 ] = ...

      Which of 1st and 2nd order model assumptions are used???  ...

        E[RSS] = ...

      Therefore:

        sigma^2 = ...

      Hence an unbiased estimate of sigma^2 is

        sigmahat^2 = ...

  - PS: We would really like  E[RMSE] = sigma,
        but this is not true, and there is no way
        to generate an unbiased estimate of sigma.
        Hence one resorts to what is mathematically doable:
        Showing that  E[RMSE^2] = sigma^2.

    PPS: But there is an inequality:
         Jensen applied to f(x)=x^2 (convex) gives us the following:
           E[X^2] >= E[X]^2
         Apply this to X=RMSE:
           E[RMSE^2] >= E[RMSE]^2
         But E[RMSE^2] = sigma^2, hence
           sigma^2 >= E[RMSE]^2
           sigma >= E[RMSE]
         ==> RMSE is an under-estimate of sigma !!!!!!!!!!!!!


================================================================


* SAMPLING PROPERTIES OF THE 'FIXED' PREDICTOR MATRIX X:

  - We derived:     V[b] = (X^T X)^{-1} sigma^2   (the 'double-X matrix')
    Compare with:   V[Xbar] = sigma^2 / N    for iid X1,...,XN  (Stat 101)
    Q: Why don''t we see the increased precision in V[b] as N grows?

  - Even though statisticians condition on X,
    we now consider the sampling properties of X
    assuming the rows of X are iid samples
    from a random vector 'xrow' with
    multivariate distribution P(dx1,dx2,...,dxp):

        X = rbind(xrow1,xrow2,...,xrowN)
        xrowi ~ P(dx1,dx2,...,dxp) iid for n=1,...,N

    ==> Point of view of multivariate analysis applied to predictors/regressors.

  - Interpretation of X^T X in terms of xrow1,...,xrowN:

        X^T X = sum of rank-one matrices formed from xrowi   (HW1)
              = xrow1 xrow^T + xrow2 xrow2^T + ... + xrowN xrowN^T
              = sum_n  xrowi xrowi^T    (We think of xrowi as a column,
                                         though it''s a row of X.)

      Q: Does X^T X converge to something if we increase N?


      A: No, but it does if we divide it by ... !!!!

         (X^T X)/N  =  sum_{i=1...N}  xrowi xrowi^T / N

      Reason: Vector/matrix version of the LLN
              The rank-one matrices (xrowi xrowi^T) are iid (n=1...N),
              and the LLN applies to the (j,k) entries in the matrices.

      Q: What is the limit?


      A: sum_i (xrowi xrowi^T) / N  -->  E[xrow xrow^T] =: C[xrow]

      Note that C[xrow] = E[xrow xrow^T] is NOT equal to
         V[xrow] = E[xrow xrow^T] - E[xrow] E[xrow]^T
      Call it the "2nd-moment matrix" instead.
      Think of it as the "uncentered variance/covariance matrix"
      of the random vector 'xrow'.
      We assume that C[row] is non-singular.

  - Conclusions:   (X^T X / N)       -->  C[xrow]        (LLN)
            ==>    (X^T X / N)^{-1}  -->  C[xrow]^{-1}
                 = (X^T X )^{-1} * N -->  C[xrow]^{-1}
            ==>    (X^T X )^{-1}      ~   C[xrow]^{-1} / N
      Now we are seeing 1/N in V[b]:

             ---------------------------------------
            |                                       |
            |   V[b]  ~  sigma^2 C[xrow]^{-1} / N   |
            |                                       |
             ---------------------------------------

    . Recap the thought process:
      As N -> Inf, what happens to the following?
          (X^T X)          --> ...
          (X^T X)^{-1}     --> ...
          X^T X/N          --> ...
          ( X^T X/N )^{-1} --> ...

    . Q: What does this formula say about the precision of
         estimates of regression coefficients in relation
         to the distribution of the predictor vectors?

      A: We gain precision of estimation with more data at the rate
         1/N in terms of DATASET-TO-DATASET VARIANCES
         (and hence 1/sqrt(N) of dataset-to-dataset sderrs).

  - The weirdness about  V[b] = sigma^2 (X^T X)^{-1} :
    . Conditionally on X, V[b] is a fixed covariance matrix,
      but because of conditioning it should be written 'V[b|X]'.
    . Unconditionally,  V[b|X] = sigma^2 (X^T X)^{-1} is a random matrix
      if the X matrix is observational and hence random.
      ('Observational' means: If we realistically repeated
       the data collection, we would end up with a different X.
       This is different from designed experiments where we
       control X and are able to set the values of X.
       In repeats of a designed experiment, X is the same.)


================================================================


* "ADJUSTMENT" AND ITS ALGEBRA:


  - Intuitive Example: 'Explain' student''s school performance.
    . Factors of interest: Measures of school quality
        (Class size, training of teachers, teacher salaries, physical plant,...)
        ==> Collect these predictors in a matrix X1.
    . Factors NOT of interest in the study but cannot be ignored:
        Demographics
        (Household income, home zip code, education level of parents, ...)
        ==> Collect these predictors in a matrix X2.
    . Outcome variables: SAT scores, grades, graduation vs dropping out, ...
    . Intuition: Need to ADJUST the variables of interest for demographics.


  - Partitioning multiple linear regression:
      Partition the predictor matrix:  X   = (X1,X2)      (Nxp1, Nxp2)
        X1 = predictors of primary interest
        X2 = non-ignorable predictors not of primary interest
      Partition the LS estimator:      b^T = (b1^T,b2^T)  (p+1 = p1+p2)

        y  =  X b + r  =  X1 b1 + X2 b2 + r

      Compare with naive approach:

        y  =  X1 b1* + r*


  - Q: How are b1 and b1* related to each other?

    A: In general we only know that b1 and b1* are not the same.

       Rough Reason: The following problems produce different b1 coefficients:

          | y - (X1 b1 + X2 b2) |^2  = min_b1,b2
          | y - (X1 b1*) |^2         = min_b1*


  - Define orthogonal (LS) projections according to the partitions:
        H  = projection onto column space of X
        H1 = projection onto column space of X1
        H2 = projection onto column space of X2
             (H1, H2 are NOT partitions of H; all are NxN !!!)
        Question: When is  H = H1 + H2 ?


  - DEFINITION: adjustment = residualizing operation
        I-H2 =  "adjustment operator" for X2 ('regresses out' demographics, e.g.)


       ------------------------------------------------
      |                                                |
      |  y.2  :=  (I-H2) y     is "y adjusted for X2"  |
      |                                                |
       ------------------------------------------------

    . Algebraically: y.2 is a residual vector in a model with predictors X2.

    . Intuitively: y.2 has the variation due to X2 removed from y
                   variation due to X2 ~ "explained"/"accounted for" by X2.

    . Practical example from above:
        X1  = predictors describing school quality
        X2  = predictors describing demographics
        y   = performance on SAT
        y.2 = performance on SAT adjusted for demographics


  - Some 'higher trivialities': Algebra and geometry of adjustment

    . If         y = X1 b1 + X2 b2 + r

      is the LS decomposition based on both X1 and X2,
      note the following:

        r and the columns of X are orthogonal to each other, hence
        r and the columns of X1, X2 are ...

      ==> r is 'adjusted' for both X1 and X2.

    . Practice:

        H X       = ... X
        H X1      = ... X1
        H X2      = ... X2
        H r       = ... 0
        (I-H) r   = ... r
        (I-H) X   = ... 0
        H1 r      = ... 0
        H2 r      = ... 0
        (I-H1) r  = ... r
        (I-H2) r  = ... r
        (I-H1) X1 = ... 0
        (I-H2) X2 = ... 0
        (I-H2) X1 = ...


    . Upcoming surprise: We will need  X1  adjusted for  X2 !!!!

      DEFINITION:

       -------------------------------------------------------------------------
      |  X1.2 := (I-H2) X1                                                      |
      |        = Collection of predictors of interest X1 (school quality)       |
      |          adjusted for                                                   |
      |          non-ignorable but uninteresting predictors X2 (demographics).  |
       -------------------------------------------------------------------------

    . r is orthogonal to X1.2:

        r^T X1.2  = row vector of inner products of r with columns of X1.2
                  = r^T (I-H2) X1
                  = r^T (I-H2)^T X1
                  = ((I-H2)r)^T X1
                  = r^T X1
                  = 0
        because r^T X = 0 is the normal equation,
        meaning that r is orthogonal to all columns of X,
        hence orthogonal to the subset of columns in X1.

        This should not be a surprise:
        Residualizing/orthogonalizing X1 wrt X2
        is an operation that is entirely within the
        column space of X=(X1,X2).

        Related facts:

           cspan(X1.2) is a subspace of cspan(X1,X2)=cspan(X).

           cspan(X1.2) = orthogonal complement of cspan(X2)
                         within cspan(X1,X2)=cspan(X).

  - Adjust the LS decomposition

        y = X1 b1 + X2 b2 + r

    for X2 by applying  I-H2  to both sides:

        (I-H2) y    =   (I-H2) X1  b1 + r             (*****)
        ^^^^^^^^        ^^^^^^^^^       ^
        X2-adjusted     X2-adjusted     Remains fixed under
        response y      predictors X1   X2-adjustment

    Recall notation:
        y.2   =  (I-H2) y            'y  adjusted for X2'
        X1.2  =  (I-H2) X1           'X1 adjusted for X2'

    From orthogonality of X1.2 and r in (*****)
    we infoer that this is an LS decomposition:

     -------------------------------------
    |                                     |
    |   y.2  =  X1.2 b1 + r               |         (****)
    |                                     |
    |   where  b1  is the LS estimator    |
    |   for the X1 coefficients in the    |
    |   JOINT regression onto X=(X1,X2).  |
    |                                     |
     -------------------------------------

    It solves

        | y.2 - X1.2 b1 |^2 = min_b1                  (***)

    The solution  b1  of (***) is the same as that of

        | y - X1 b1 - X2 b2 |^2 = min_b1,b2           (**)

  - Interpretation of b1:

      To get the LS regression coefficients b1
      in the joint regression y on X=(X1,X2),
      one can do the following:

        --------------------------------------------------------
       |                                                        |
       |    Regress the X2-adjusted response, y.2 = (I-H2)y,    |
       |    on the X2-adjusted X1 predictors, X1.2 = (I-H2)X1.  |
       |                                                        |
        --------------------------------------------------------


    Note: The adjustment formalism is not meant for computations but
          for conceptual understanding.  Stay tuned!


================================================================


* APPLICATION 1 OF ADJUSTMENT: CENTERED REGRESSION

  - We usually consider the intercept as a nuisance.
    With adjustment we can get rid of it:

      X1 = all non-intercept predictors ('substantive regressors')
      X2 = e = (1,...,1)^T  the intercept 'regressors'

  - What does adjustment for e do?

      H2  =  e e^T / |e|^2  =  e e^T / N

      H2 y  =  e (e^T y) /N  =  mean(y) e
                 ^^^^^^^^^^
                   mean(y)

      (I - H2) y  =  y - mean(y) e  =  centered y vector

  - The adjusted LS decomposition (****) above

      y.2  =  X1.2 b1 + r

    means that the substantive LS estimates can be
    obtained from a 'centered regression'
    with the intercept omitted.


================================================================


* APPLICATION 2 OF ADJUSTMENT: REDUCTION OF A MULTIPLE TO A SIMPLE REGR COEFF


  - Consider: a single predictor adjusted for all else

        X1 = single column xj                   (Nx1)
        X2 = all columns of X except column xj  (Nxp)

      => The adjusted equation (****) amounts to
         a simple linear regression without intercept.

  - Simplified notation:

        y.adj  = (I-H2) y   (y  adjusted for all xk except xj)
        xj.adj = (I-H2) xj  (xj adjusted for all xk except xj)

        bj = slope of simple lin. regr. of  "y adjusted for all but xj"
             onto  xj  adjusted for all other predictors

        [Reminder: The simple regr coeff w/o icept of y on x is  b=<y,x>/|x|^2 .]

     -------------------------------
    |                               |
    |         < y.adj, xj.adj>      |
    |    bj = ----------------      |   ("adjustment formula" for bj)
    |           |xj.adj|^2          |
    |
    |                               |
     -------------------------------

        Reminder:  bj = j''th MULTIPLE regression coefficient (estimate)

  - Interpretation:  bj  is the simple lin regression coefficient
                     when regressing  y.adj  on  xj.adj  w/o intercept.


  - Elementary consequence:
        Does it matter for the value of bj
        if we add or drop some of the other regressors?
          ...

  - Some weirdness: Adjusting  y  is actually not necessary.
                    Adjusting  xj is.
        Here is how you see it:
                    Numerator of adjustment formula
                 =  < y.adj, xj.adj>
                 =  <(I-H2)y, (I-H2)xj>
                 =  <y, (I-H2)xj>        (Why? Why? Why?)
                 =  <y, xj.adj>

     ---------------------------
    |                           |
    |          < y, xj.adj>     |
    |    bj = --------------    |     (the "weird adjustment formula" for bj)
    |           |xj.adj|^2      |
    |                           |
     ---------------------------


  - Interpretation 1:  bj  is the simple lin regression coefficient
                       when regressing  y  on  xj.adj  w/o intercept.


  - Interpretation 2:  bj  is a linear function of  y  !!!

            bj  =  a^T y           ("contrast")

    where
                   xj.adj
            a  = ----------
                 |xj.adj|^2

    If we condition on X, then 'a' is a constant vector.
    Can you smell a way to get at a standard error for  bj?
    [See the section after next.]

    [We know this already because the whole vector b of coefficient estimates is
       b = (X^T X)^{-1} X^T y
     which is a linear mapping of  y (in R^N) to b (in R^{p+1}).


================================================================


* ADJUSTED VARIABLES PLOT OR "LEVERAGE PLOT":

  - A plot of  y.adj  vs.  xj.adj
    shows what the LS criterion "sees" when determining bj.

  - Example: Fetch data from class website; after downloading to folder 'Data'...
        cars <- read.table("http://www-stat.wharton.upenn.edu/~buja/STAT-961/CarModels2003-4-NA.TXT",
                           header=T, sep=",", na=".")
        names(cars);  dim(cars)
        rownames(cars) <- paste(1:nrow(cars),as.character(cars[,"Model"]))
        apply(cars, 2, function(x) sum(is.na(x)))  # there is an NA in "MPG.Highway"
        str(cars)  # Ooops
        cars[,"MPG.Highway"]
        cars[,"MPG.Highway"] <- as.numeric(cars[,"MPG.Highway"])
        # Select non-NAs:
        sel <- !is.na(cars[,"MPG.Highway"]) & !is.na(cars[,"Weight.lbs"]) & !is.na(cars[,"Horsepower"])
        # Multiple regression of response on two predictors:
        cars.lm <- lm(MPG.Highway ~ Weight.lbs + Horsepower, data=cars[sel,])
        # Adjusted response and other predictor:
        mpg.adj <- resid(lm(MPG.Highway ~ Horsepower, data=cars[sel,]))
        wgt.adj <- resid(lm(Weight.lbs  ~ Horsepower, data=cars[sel,]))
        # Compare coeff of Weight: same!!  "proof" by example...
        lm(mpg.adj ~ wgt.adj);  cars.lm
        plot(mpg.adj ~ wgt.adj, pch=16, cex=.5);  abline(lm(mpg.adj ~ wgt.adj))

  - "Leverage Plot" because it shows points that have high leverage for bj
      (every bj can have its own leverage points).

  - Canned function for leverage plots with optional point identification:
        source("http://www-stat.wharton.upenn.edu/~buja/STAT-961/leverage-plots.R")
        leverage.plots(model=cars.lm, ident=c(1,2), frame.sz=5)  # stop with right click
      [This function allows you to identify funny points; right-click to stop.]

  - Another use of leverage plots: predictor variables shrink when adjusted.
      Leverage plots show the shrinkage in the horizontal width.
      One can then read off what estimated average difference
      in the response corresponds to the adjusted range.

  - Note to future TAs of Stats classes:
      The JMP software provides leverage plots with all regression outputs.

  - Ultimate final question:
      Why does mpg.adj take on negative values?
      ...

  - Final question:
      What happens to the intercept vector under adjustment?
      In other words, is  x0.adj  still a constant column???
      ...
      Example: the icept vector 'adjusted' for weight and horsepower
        hist(resid(lm(rep(1,nrow(cars)) ~ .-1, data=cars[,c("Weight.lbs","Horsepower")])),
             main="", xlab="", col="gray30", breaks=20)

================================================================


* VARIANCE/STDERR OF bj:


  - Note about terminology:
        We call 'standard error' (SE) the true sdev of an estimate (across ...).
        We call 'standard error estimate' (SE.est) the estimated sdev of an estimate.
      (Most people say 'standard error' and mean 'standard error estimate'.)


  - Reminder: We already know the SE of bj from  V[b].

          V[bj]  =  ( (X^T X)^{-1} )_jj  sigma^2


  - Goal: Obtain  Var[ bj ]  from the adjustment formula for bj.

        Recall variance of a linear combination: 'butterfly'/'sandwich' with A=a^T

          Var[ <a, y> ]  =  Var[ a^T y ]  =  ...

        Use  a = xj.adj:

          Var[ <y, xj.adj> ]  =  ...

        To obtain V[bj]: Be careful and handle the denominator correctly.

          Var[ bj ]  =  ...

                      ...
          Var[bj] = ------------
                      ...

        Finally, SE[bj] = sqrt(Var[bj]):

         -----------------------
        |                       |
        |             sigma     |   "Adjustment formula"
        |  SE[bj] = ----------  |   for standard errors
        |            |xj.adj|   |   of regression coeffs
        |                       |
         -----------------------

  - Consequences for statistical inference:

    . What does  |xj.adj|  mean statistically speaking?
      Try this for j>0 (exclude the x0.adj):

          |xj.adj|/sqrt(N-1)  =  ...

    . Use the previous sub-bullet to rewrite the above magic formula for SE[bj]
      and make the root-N law visible assuming regressor rows are randomly sampled:

          SE[bj] = ...

    . Do we like small or large values of  |xj.adj| ?
          ...

    . How do you "see"  |xj.adj|/sqrt(N-1)  in the leverage plot?
          ...

    . What happens to  |xj.adj|  if we add a predictor/regressor that is
      highly correlated with xj?
          ...

    . See Mosteller & Tukey, p.326f, on the proxy problem:
        If we have mother''s education in the regression equation,
        what do you think is going to happen to SE(b_mother) if we
        add father''s education level to the equation?
          ...

    . How would you measure how collinear xj is with all other predictors?
          See next!


================================================================


* VARIANCE INFLATION FACTOR (VIF) -- BETTER: SE INFLATION FACTOR (SEIF)

  - Prelude: When is there no adjustment effect?

    . General case:   y = X1 b1 + X2 b2 +r
                      When is  X1.2 = X1 ?
          ...
               X1 _|_ X2

    . Single case:    X1 = xj,  X2 = all other predictors
                      When is  xj.adj = xj.0 ?
                      (We always adjust for the intercept, that is, centering.)

          ...
               xj.0 is orthogonal to xk.0 for all k (!= j)

    . What inequality is there between  |xj.adj| and |xj.0| ?
          |xj.adj|  <=  |xj.0|

  - When is there the least loss in precision of  bj  from collinearity?
          ...
          |xj.adj|  =  |xj.0|   <==>  xj.adj = xj.0

  - What is SE[bj] in this case?  Call it  SE.best[bj].
          ...
          SE.best[bj] = sigma / |xj.0|

  - Definition:  Standard error and variance inflation factors

          SEIF[bj] :=  SE.actual[bj] / SE.best[bj]

          VIF[bj]  :=  SEIF[bj]^2

  - Facts:

          SEIF[bj]  =  (sigma/|xj.adj|) / (sigma/|xj.0|)

       ------------------------------
      |                              |
      |                  |xj.0|      |
      |    SEIF[bj]  =  --------     |
      |                 |xj.adj|     |
      |                              |
       ------------------------------


  - Fact:  The lower bound on SEIF is   SEIF[bj] >= ...


  - AYT?  Is the SEIF (and VIF) dependent on the units of xj?  ...

          If xj has units of Celsius, does SEIF[bj] change under conversion to Fahrenheit?  ...


  - SEIF in R: Ignoring the SEIF for the intercept,
               here is a way to get the SEIFs for all slopes.

        sqrt(diag(solve(cor(X))))


================================================================


* HAT DIAGONAL ELEMENTS AS MEASURES OF OUTLYINGNESS IN PREDICTOR SPACE

  - General concept: Given a symmetric positive definite matrix S,
    think of it as defining a squared distance from the origin
    or squared norm as follows.

                  2
               |x|  =  x^T S x
                  S

    Using a root matrix  R  of  S, i.e.,  S = R^T R,
    you can define new coordinates  xnew = R x, and you will find:

                                                                  2
      |xnew|^2  =  xnew^T xnew  =  x^T R^T R x  =  x^T S x  =  |x|
      ^^^^^^^^                                                    S
      Euclidean

    So, yes,  x^T S x  is a squared norm modulo a linear distortion.

  - Recall the hat matrix:

               H  =  X (X^T X)^{-1} X^T

  - For a row written as a column,  xrowi = (x_i0=1,x_i1,x_i2,...,x_ip)^T,
    we have:
               H_ii  =  xrowi^T (X^T X)^{-1} xrowi

    ==> H_ii  is the squared norm of the i''th case in
        predictor space wrt to the norm defined by S = (X^T X)^{-1}


  - More intuition about "outlyingness" in regressor space:

    . Write:
         X = (x0=e,x1,...,xp) of size Nx(p+1)
         e = (1,...,1)^T  of size Nx1

    . Form the "mean-adjusted matrix" Z:
         Z = (x1.0,...,xp.0)  (Nxp) matrix of centered real regressors (j>0).

         AYT?  Can the range of column  xj.0  be [100,200]?

    . Facts:
             H_e  =  e e^T / N     [Homework 1: What is H_e ?  H_e x = mean(x) e]

             xj.0 =  (I - H_e) xj

             H_X  =  H_e + H_Z     [Homework 1: e _|_ xj.0 for j=1,...,p]

     . Consequence:
          Let z{i} = row i of Z written as a column (in R^p)
              z{k} = row k of Z written as a column (in R^p)

              Hik = 1/N + z{i}^T (Z^T Z)^{-1} z{k}

              Hii = 1/N + z{i}^T (Z^T Z)^{-1} z{i}

    . Geometric interpretation:
        In the metric defined by  (Z^T Z)^{-1},
        Hii - 1/N is the squared distance of zi from the origin.

  - Consider a fixed location  z0 in R^p
    and define a POSITIVE QUADRATIC FORM in z0 by:

               H(z0)  :=  z0^T (Z^T Z)^{-1} z0

    Recall from HW1:   Z^T Z = sum_{i=1...N}  z{i} z{i}^T

    As N --> Inf, what happens to:
    . H(z0)  --> ...  ?    [LLN assuming z{i} have a multivariate distribution]
    . N*H(z0) --> ...  ?
    Does this make sense?
    Think in terms of 'self-influence' of location z as N-->Inf
    and formulate a rule for its decay rate.
        ...


================================================================


* ANOVA DECOMPOSITIONS, R SQUARE, AND E[SS]:


  - Q: How much "variation" does the model given by X "account for"?

    . Approach: Use sums of squares (SS) as measures of "variation"
                but first remove uninteresting variation: the mean.

    . Preliminaries: recall the "mean-adjusted" LS decomposition

               y.0 =  x1.0 * b1  + ... +  xp.0 * bp  + r  =  yhat.0 + r

         obtained from the full equation by centering:

               y   =  x0 * b0 + x1 * b1  + ... +  xp * bp  + r

         [Note that r is centered, but yhat is not, hence we need yhat.0.]

      Unlike the real predictors the intercept is not considered
      "explanatory".  Therefore, get rid of the intercept once for all.

  - Pythagorean decomposition of orthogonal projections:

        y.0  =  b1*x1.0 + ... + bp*xp.0 + r         (The mean-adjusted LS decomposition again)
        y.0  =       yhat.0             + r         (On the RHS are orthogonal vectors.)
    ==>
        |y.0|^2     =  |yhat.0|^2  +  |r|^2          Pythagorean theorem


      -----------------------------------
     |  TSS       =    MSS     +    RSS  |      'ANOVA decomposition' (the simplest form)
      -----------------------------------

        Total SS  =  Model SS  +  Residual SS


  - Definition of RSquare ("R2"):

             MSS    |yhat.0|^2
    . R2 := ----- = --------    = "Fraction of variation accounted for by the model"
             TSS     |y.0|^2                             "--explained--"

    . Fact:
             ----------------------
            |  R2 = cor(y,yhat)^2  |   =  cor(y.0,yhat.0)^2
             ----------------------

      + Note: 'cor()' implies mean-adjustment and standardization.

      + Proof: cor(y,yhat)^2 = <y.0,yhat.0>^2 / (|y.0|^2*|yhat.0|^2)  =  <yhat.0,yhat.0>^2 / ... = QED


  - Can we calculate  E[TSS], E[RSS] and E[MSS] theoretically?

    . We won''t have use for this in the future,
      but there are two reasons for doing this math exercise:

      + ANOVA (really the analysis of designed experiments with highly structured X)
        is permeated by calculations of expected values of sums of squares (E[SS]).

      + The calculations are insightful in their own right
        because they describe sources of variation between datasets.

    . Assume the model is unbiased and with uncorrelated homoscedastic errors:

        y  = X beta + eps  with  E[eps]=0,  V[eps] = sigma^2 I

        ==>  E[y] = X beta

    . The smallest model is the mean model:  y = beta0 x0 + eps  (x0 = 1(Nx1))
      We compare this model with the full model.

          TSS = |(I-H0)y|^2     (total SS in the data other than the mean)
          MSS = |(H-H0)y|^2     (SS captured by the model other than the intercept)
          RSS = |(I-H) y|^2     (SS not captured by the model)

      where H is the projection onto the space spanned by x0,x1,...,xp
      and H0 onto the vector x0.

    . Note:        I-H0,  H-H0,  I-H    are all orthogonal projections,
      with ranks   N-1,   p,     N-p-1, respectively.
      Fact:        H-H0 = H(I-H0)

    . We will reduce the derivations of E[TSS], E[MSS], E[RSS] to the
      following generic facts  about projections of spherically
      distributed random vectors eps (to first and second order):

          E[eps] = 0   and   V[eps] = sigma^2 I:

          y = mu + eps

    . FACT 1: For any orthogonal projection H we have the following.

          E[ |H eps|^2 ] = E[ eps^T H eps ]
                         = E[ sum_{ij}  eps_i H_ij eps_j ]
                         = sum_{ij}  H_ij E[ eps_i eps_j ]
                         = sum_i  H_ii sigma^2
                         = sigma^2 tr(H)

    . FACT 2: Consider the vector 'eps' shifted away from the origin, y = mu + eps.

          E[ |H (mu + eps)|^2 ] = E[ |H mu|^2 + mu^T H eps + eps^T H mu + |H eps|^2 ]
                                = |H mu|^2 + sigma^2 tr(H)

        The neat aspect is that this holds for ANY orthogonal projection H
        and any shift vector 'mu'.


    . APPLICATION 1: To find E[TSS], use the projection I-H0.

          E[TSS] = E[|(I-H0)y|^2] = |(I-H0) X beta|^2 + sigma^2 (N-1)

    . APPLICATION 2: To find E[MSS], use the projection H-H0, where H projects onto cspan(X).

          E[MSS] = E[ |(H-H0) y|^2 ]
                 = |(I-H0) X beta|^2  +  sigma^2 p

    . APPLICATION 3: To find E[RSS], use the projection (I-H).

          E[RSS] = E[ |(I-H) y|^2 ]
                 = sigma^2 (N-p-1)          because  (I-H) X = 0

    . CONCLUSIONS: If  y = X beta + eps, then

          E[TSS] = | (I-H0)X beta |^2  +  sigma^2 (N-1)

          E[MSS] = | (I-H0)X beta |^2  +  sigma^2 p

          E[RSS] = sigma^2 (N-p-1)


  - An Engineering View of Linear Regression:  Signal to Noise Ratio in Estimation

      The estimation part is encapsulated in yhat.0;  its 'energy' is MSS.
      We obtained a decomposition for E[MSS]:

          E[MSS] = | (I-H0)X beta |^2  +  sigma^2 p
                   ^^^^^^^^^^^^^^^^^     ^^^^^^^^^
        Sources:    model, 'signal'     error, 'noise'

      This motivates the definition of "signal-to-noise ratio":

                | (I-H0) X beta |^2     / Ratio of model variation to noise variation
         SN  =  -------------------  = |  in fitted values
                     sigma^2 p          \ excluding uninteresting mean variation

         Q: Do we know the SN in actual data analysis?
              ...

         Q: Find estimates for numerator and denominator!
              ... (denominator)       s^2 p       (unbiased? yes)

              ... (numerator)   | (I-H0) X b |^2  (unbiased? no)


              SNhat = ... (ratio)

            Is the ratio also unbiased?
              ...  probably not


  - We calculated  E[MSS]  and  E[TSS];  we defined  R2 = MSS/TSS.

    Q: Can we calculate  E[R2]  theoretically?
           ...

    Q: Can we express R2 in terms of SNhat and vice versa?
           ...


================================================================