================================================================
Chapter 2: BASICS OF LINEAR MODELS
* LINEAR MODELS: THE OLS PROBLEM
 LS criterion: RSS
. In what follows we will play with OLS from the ground up.
This is a conceptual exercise, not a recommendation for practice.
In most contexts you will use the function lm() for OLS fitting.
However, when speed matters, you might want to use the linear algebra below.
. in R: a numeric toy example of LS
RSS < function(b) { sum((yX%*%b)^2) } # What's bad about this code?
N < 5
X < cbind(rep(1,N), 1:N)
y < X[,1] + 2*X[,2] # No 'error' in this case
Getting at the coefficient estimates the hard way:
nlm(RSS, c(0,0))$estimate # nlm() is a minimizer in R
optim(c(0,0), RSS)$par # optim() is another minimizer in R
. This is not how it is done, of course, but general code has some
nice aspects: it is easily cannibalized for different situations.
Such as:
L1 < function(b) sum(abs(yX%*%b))
nlm(L1, c(0,0))$estimate
optim(c(0,0), L1)$par
Seems to work even though the L1 criterion is not differentiable.
. In math:
RSS(b) = y  X b^2
. LS estimate:
b = argmin RSS(b) = minimizer of RSS(b)
(Justifications of LS? ...)
. NORMAL EQUATIONS as stationary equations of RSS:
gradient RSS(b) = 2*( X^T X b  X^T y ) = 0

==>  (X^T X) b = X^T y 

[This is a special case of a moment condition:
(X^T X)/N and X^T y/N
estimate 2nd moments: E[x x^T] and E[x y].
==> Moment estimators are defined by moment conditions.
Other names for X^T X b  X^T y = 0:
. score equations ['scores' are the negative gradients: (X^T X b  X^T y)]
. estimating equations
]
. Explicit LS solution:
b = (X^T X)^{1} X^T y
[Does not work when ...]
In R:
b < solve(t(X) %*% X) %*% t(X) %*% y
b # Correct
[NOT the numeric method of choice; stay tuned for a HW.]
Alternative using ginv() in MASS:
library(MASS)
ginv(t(X) %*% X) %*% t(X) %*% y
This should be numerically more stable; not a problem here.
 Residuals and fits:
. Notation: yhat = X b
r = y  yhat
RSS = r^2 # sum(r^2)
. Normal equations can be rewritten this way:
X^T r = 0 # 0 = X^T y  (X^T X) b = X^T (y  X b)
==> Uniquely characterizes the LS solution.
Geometric meaning: ... # see Homework 1
. In R:
yhat < X %*% b
r < y  yhat
 LS and orthogonal projections:
yhat = X b = ( X (X^T X)^{1} X^T ) y = H y
r = y  yhat = ( I  H ) y
where H and IH are orthogonal projections in nspace.
. Definition of orth. proj.: ... # see Homework 1
. H is also called the "hat matrix" because it "puts the hat on y" (Tukey).
[Notation: In math, one likes to use "P" as symbol for projections,
but for us "P" stands for probability, hence we prefer "H".]
 Visualization/animation of the geometry of LS: 2 predictors (p=1), n=3
source("http://wwwstat.wharton.upenn.edu/~buja/STAT961/LSgeometry.R")
 What are the statistical properties of 'b', 'yhat' and 'r'?
. What are their expected values?
. What are their variances, and covariances?
 1st order properties: need expectation vectors of random vectors
E[y] = mu = X beta This is assumed!!!
What is the meaning of this? ...
E[yhat] = ...
E[r] = ...
E[b] = ...
 2nd order properties: need covariance matrices of random vectors
V[y] = sigma^2 * I (I = identity of size nxn) Again, this is assumed!!!
What is the meaning of this? ...
V[yhat] = ...
V[r] = ...
V[b] = ...
 The role of X:
. We will follow convention in statistics and treat X as fixed.
. Strictly speaking, the above expressions should be written
E[yX], E[yhatX], E[bX], V[yX], V[yhatX], ...
. FixedX will allow us to do a lot of theory.
. We mentioned an ancillarity argument to justify conditioning on X
when X is truly random.
. We also mentioned the ancillarity argument depends on model correctness.
==> What we currently pursue is a 'modeltrusting, assumptionrich theory'.
There does exist 'modelrobust, assumptionlean theory' as well.
Stay tuned.
================================================================
* VECTOR EXPECTATIONS AND VARIANCE/COVARIANCE MATRICES
 In preparation for linear models analysis
we define vector expectations and variance/covariances.
Confusing fact:
. In multivariate analysis, all variables are random.
Data set = iid replications of a random vector
. In linear models theory of regression, only the response is random,
the regressors are not.
Data set = cbind(random y, fixed xj vectors)
 Let us think the way of multivariate analysis for now, hence assume:
Y ~ Ix1 random vector
Z ~ Jx1 random vector
with joint distribution ("observed on the same objects")
Ex.: Y = (Height, Weight, Age, ...) of a person
Z = (English Grade, Math Grade, SAT score, ...) of same person
The Y and Zvariables will generally be 'associated'
(i.e., correlated, but not only in a linear sense).
 EXPECTATION:
. Definition: E[Y] = vector of E[Yi] (i=1...I)
. Estimation: In Multivariate Analysis one assumes
N i.i.d. draws yn from Y, where yn ~ Ix1 = n''th sample.
Estimate of E[Y]: ybar = (sum_n yn)/N ~ Ix1
Unfortunately in regression we only see one draw y of Y
Do you see what''s confusing? ...
. Affine transformation: if A ~ JxI, c ~ Jx1, both constant, then
E[AY + c] = A E[Y] + c
Proof: Write out the components of AY and apply E[].
Q: Why do we need this?
A: Because we will analyze objects such as
b = (X^T X)^{1} X^T y
yhat = H y
r = (IH) y
. Algebra: Assuming Y and Z are both Ix1, then
E[a*Y + b*Z] = a*E[Y] + b*E[Z]
This property is called linearity of E[...].
 COVARIANCE:
. Definition: V[Y,Z] := E[ (YE[Y])(ZE[Z])^T ]
V[Y] := V[Y,Y]
. Components: V[Y,Z]_ij = ...
V[Y]_ij = ...
V[Y]_ii = ...
. What are the ranks of (YE[Y])(ZE[Z])^T
and (YE[Y])(YE[Y])^T ?
Hence a covariance matrix is an average of ...
Is it hence also a ...?
Why or why not? ...
. Sizes: V[Y,Z] is IxJ, V[Y] is IxI
. Generalizing univariate formulas
V[X] = E[X^2]  E[X]^2 and
V[X1,X2] = E[X1*X2]  E[X1]*E[X2] :
V[Y,Z] = E[ Y Z^T ]  E[Y] E[Z]^T
V[Y] = E[ Y Y^T ]  E[Y] E[Y]^T
. Algebra: Assuming Y, Z are both Ix1 and U is Kx1:
V[Y+Z,U] = V[Y,U] + V[Z,U]
Consequence:
V[Y+Z} = V[Y] + V[Y,Z] + V[Z,Y] + V[Z] # !!!!!!!!!!!!!! Watch out !!!!!
Can you continue and write the middle two terms as 2*V[Y,Z] ? ...
. Affine transformation:
Recall: Y (Ix1) and Z (Jx1) are random vectors.
Assume fixed matrices A, B and fixed vectors c, d of legit sizes:
A (mxI), B (nxJ), c (Ix1), d (Jx1)

 V[AY,BZ] = A V[Y,Z] B^T 
 
 V[AY] = A V[Y] A^T  Butterfly or SANDWICH formula
 
 V[Y+c,Z+d] = V[Y,Z] 

Exercises:
~ Specialize the middle to a linear form,
i.e., A=a^T, where a (Ix1) is fixed.
~ Specialize further to a = rep(1/I,I)
. Multivariate Analysis: Estimation of V[Y]
* Given N i.i.d. vector samples yn drawn from the random vector Y
Vhat[Y] = sum_i ( (ynybar)(ynybar)^T )/(N1)
* Affine transformation: zn = A yn + c
Vhat[z] = A Vhat[y] A^T
================================================================
* LINEAR MODELS:
 In the linear models theory of regression
we observe only one single yvector,
but we will make assumptions about its expectation
and covariance and examine what follows from them.
[Whether the assumptions are realistic for a given
dataset needs to be checked with diagnostics.]
 Linear model assumptions:
. Convention: y is Nx1 (not Ix1)
Compare asymptotics in
Multivariate Analysis: y1,...,yN are Ix1 and N>Inf
Linear Models Analysis: y is Nx1, and N>Inf
. Model in vector/matrix language:
y = beta0 + beta1*x1 + beta2*x2 + ... + betap*xp + eps
= X beta + eps where beta is (p+1)x1
Here y, xj, eps are Nx1 (response, predictors, errors, resp.)
X = cbind(1,x1,x2,...,xp) is Nx(p+1)
Note:
+ The response vector y is a random vector
 inspite of lowercase notation!
+ Predictors xj are NOT random but fixed and known
(set experimentally or observed but conditioned on).
+ The 'error' vector 'eps' is a random vector.
['Error' is a misleading term; there is nothing wrong with 'eps'.
We often use the term 'noise' instead.]
. Stochastic assumptions:

 E[eps] = 0 : On average [across possible worlds, universes, datasets] 
 the model is 1st order "correct"/"unbiased". 
 V[eps] = sigma^2 * I: Uncorrelated errors [across ...] 
 constant variances ("homoscedastic errors") 
 (2nd order correct) 

. Equivalently:

 E[y] = X beta  Why?
 V[y] = sigma^2 * I  ''

[Distinguish: "unbiased model" vs "unbiased estimator"; see below]
. Recap fundamentals:
Q: What variability do E[...] and V[...] refer to?
A: Variability ...
Spell it out: What is the variability of y[1], y[2], ... y[N] ?
================================================================
* LINEAR MODEL ANALYSIS, STEP 1: VARIABILITY OF b
 Reminder: The LS Estimate is b = (X^T X)^{1} X^T y
 Type of variability in b: ...
b = (p+1)x1 random vector
varying from response vector y to response vector y...
 Making the linear model assumptions, it follows:
E[b] = ...
Actually, which part of the assumptions was used?
...
=> 'Unbiased' estimator of beta
 Making the linear model assumptions, it follows:
V[b] = ...
= ...
Actually, which part of the assumptions was used?
...
NOTE: V[b] is conditional on X (if X is random...)
 What variability does V[b] describe?
...
 What is the interpretation of the diagonal elements in V[b]?
...
 Could one hope to estimate V[b]?
...
 What would such estimates be good for?
...
 [There exist estimators of V[b] that avoid making the assumption V[y] = sigma^2 I.
Instead they estimate V[y] with diag(r^2), i.e., a diagonal matrix with
squared residuals in the diagonal.
This results in the 'sandwich estimator of standard error',
which has become common in econometrics sind Hal White''s work in 1980+.]
================================================================
* SIMULATION EXAMPLE: simple linear regression with intercept
==> X is a 1predictor matrix, hence of size Nx2.
## R:
N < 50 # Picking a sample size
X < cbind(x0=rep(1,N), x1=runif(N)) # Making up a predictor matrix
sigma < 1 # Assuming a 'true' error variance
beta < c(b0=1, b1=2) # Assuming a 'true' coeff vector
y < rnorm(N, m=X%*%beta, s=sigma) # Simulating a response vector
b < solve(t(X)%*%X)%*%t(X)%*%y; b
## Simulate and animate parallel universes/possible worlds/datasets ...:
for(i in 1:500) {
y < rnorm(N, m=X%*%beta, s=sigma) # Simulating a response vector
b < solve(t(X)%*%X) %*% t(X) %*% y
plot(x=X[,2], y=y, pch=16, col="gray20", ylim=c(2,6))
abline(b, lwd=3, col="red")
dev.flush()
Sys.sleep(0.5)
}
## ==> We see the DATASETTODATASET (sampling) variability.
## Next, simulate many response vectors (possible worlds, parallel universes,...)
## and store the results:
Nsim < 10000
bs < matrix(NA, nrow=Nsim, ncol=length(beta)) # Storage for coeff estimates
colnames(bs) < paste("b",0:(length(beta)1),sep="") # Cosmetics
for(isim in 1:Nsim) {
y < rnorm(N, m=X%*%beta, s=sigma) # Simulating a response vector
bs[isim,] < solve(t(X)%*%X)%*%t(X)%*%y
if(isim%%100==0) cat(isim,"...")
}
## Plot slopes versus intercepts to illustrate the 'sampling' or
## 'datasettodataset' variability of b:
plot(bs, pch=16, cex=.5)
## ==> Negative correlation between b0 and b1 !!!
## Compare:
solve(t(X)%*%X)*sigma^2 # V[b]
cov(bs) # simulation estimate of previous
## Interpretation:
## . cov(bs) is a simulation approximation to V[b].
## . solve(t(X)%*%X)*sigma^2 is the theoretically obtained formula for V[b].
## ==> They better be close!!!
## Further intuitions:
## Visualize the simulated lines and the simulated distribution of 'b':
windows(height=5, width=10)
par(mfrow=c(1,2), mgp=c(1.8,.5,0), mar=c(3,3,1,1)) # Set up two plot panels.
# Panel 1: Show the last y only, faintly, because this is irrelevant.
plot(cbind(x=X[,2],y=y), pch=16, cex=1, col="gray", xlim=c(.5,1.5), ylim=c(2,6))
# Panel 1: Draw the estimated lines for 100 simulations.
for(i in 1:100) abline(bs[i,])
# Panel 2: Show a scatterplot of slopes versus intercepts:
plot(bs, pch=16, cex=.5)
## Questions:
## . How does the negative correlation on the right
## relate to the lines on the left?
## ...
## . If we had done the simulation with a uniform
## xdistribution on [1,+1] instead of [0,1],
## what would the correlation between intercept and slopes be?
## Alternative R code: X < cbind(x0=rep(1,N), x1=runif(N,1,+1)
## ...
#================================================================
* LINEAR MODEL ANALYSIS, STEP 2: VARIABILITY OF yhat AND r
 There is variability in 'b' ... datasettodataset
=> There is variability in 'yhat' and 'r'.
yhat = X betahat = X b
r = y  yhat = y  X betahat = y  X b
 Strangeness about the randomness in yhat and r:
Even though both are Ndimensional,
. yhat can only range in ... dim=...
. r can only range in ... dim=...
 Let as usual: H = X (X^T X)^{1} X^T
. First order:
E[yhat] = ...
E[r] = ...
What assumptions are used? ...
What assumptions are not used? ...
. Second order:
V[yhat] = ...
V[r] = ...
V[yhat,r] = ...
What assumptions are used? ...
What assumptions are not used? ...
 Special cases:
. Diagonal elements: Var[yhati] = ...
Var[ri] = ...
. Offdiagonal elements: Cov[yhati,yhatk] = ...
Cov[ri,rk] = ...
. Cov[yhati,rk] = ...
 What variability is being described in the above equations?
...
 Implications: Assume the errors (noise) in the Nx1 vector
eps = y  X beta
are homoskedastic and uncorrelated.
. The residuals are generally ... (homo/heteroscedastic?)
. The residuals are generally ... (un/correlated?)
. Pairs of distinct fits and corresponding pairs of residuals
are generally correlated with the ...[same/opposite?] sign.
. The variance of residuals is ... [>/ tr(H) = p+1 = sum H_ii # ==> average size of H_ii = ...
tr(IH) = N(p+1)
tr(I) = N
'Trick': diagonalize H
 Trace formulas or 'total variance' formulas:
Def.: total variance of a random vector Z = sum_i V[Z_i] = tr(V[Z])
sum V[y_i] = sigma^2 N = sigma^2 trace(I)
sum V(yhat_i) = tr(V[yhat])
= sigma^2 tr(H)
= sigma^2 (p+1)
sum V(r_i) = tr(V[r])
= sigma^2 tr(IH)
= sigma^2 (N(p+1))
 Interpretation of the trace formulas:
The more predictors the ...[more/less] variability in yhat.
The more predictors the ...[more/less] variability in r.
NOTE:
. This is independent of whether the model is 1st order correct/unbiased or not!
We only used second order assumptions: homoskedasticity and uncorrelated errors.
. This is independent of whether we use good or bad predictors.
Whether a predictor is good or bad, its inclusion moves
one case worth of variation (sigma^2) from r to yhat (and hence b).
 'Degrees of freedom': dimensions of the subspaces in which yhat and r can range
dfs(fits) = p+1
dfs(resid) = N(p+1)
dfs(y) = N
 [Quirky question: Unlike y, yhat, r, why might we NOT be interested
in the total variance of b?
Think like a physicist: What are the units of tr(V[b])
compared to the those of tr(V[y]), tr(V[yhat]), tr(V[r]).
...
]
================================================================
* ESTIMATING THE ERROR VARIANCE sigma^2:
 Recall from ~ Stat 102:
RMSE = sigmahat = sqrt( RSS / (Np1) )
RMSE^2 is an unbiased estimate of sigma^2,
E[RMSE^2] = sigma^2
as we will now show.
 Heuristic:
RMSE^2 is RSS up to a factor (Np1),
hence look at E[r_i^2] first, then sum up.
 If the model is ... order correct, then
E[ r_i ] = ...
E[ r_i^2 ] = ...
Which of 1st and 2nd order model assumptions are used??? ...
E[RSS] = ...
Therefore:
sigma^2 = ...
Hence an unbiased estimate of sigma^2 is
sigmahat^2 = ...
 PS: We would really like E[RMSE] = sigma,
but this is not true, and there is no way
to generate an unbiased estimate of sigma.
Hence one resorts to what is mathematically doable:
Showing that E[RMSE^2] = sigma^2.
PPS: But there is an inequality:
Jensen applied to f(x)=x^2 (convex) gives us the following:
E[X^2] >= E[X]^2
Apply this to X=RMSE:
E[RMSE^2] >= E[RMSE]^2
But E[RMSE^2] = sigma^2, hence
sigma^2 >= E[RMSE]^2
sigma >= E[RMSE]
==> RMSE is an underestimate of sigma !!!!!!!!!!!!!
================================================================
* SAMPLING PROPERTIES OF THE 'FIXED' PREDICTOR MATRIX X:
 We derived: V[b] = (X^T X)^{1} sigma^2 (the 'doubleX matrix')
Compare with: V[Xbar] = sigma^2 / N for iid X1,...,XN (Stat 101)
Q: Why don''t we see the increased precision in V[b] as N grows?
 Even though statisticians condition on X,
we now consider the sampling properties of X
assuming the rows of X are iid samples
from a random vector 'xrow' with
multivariate distribution P(dx1,dx2,...,dxp):
X = rbind(xrow1,xrow2,...,xrowN)
xrowi ~ P(dx1,dx2,...,dxp) iid for n=1,...,N
==> Point of view of multivariate analysis applied to predictors/regressors.
 Interpretation of X^T X in terms of xrow1,...,xrowN:
X^T X = sum of rankone matrices formed from xrowi (HW1)
= xrow1 xrow^T + xrow2 xrow2^T + ... + xrowN xrowN^T
= sum_n xrowi xrowi^T (We think of xrowi as a column,
though it''s a row of X.)
Q: Does X^T X converge to something if we increase N?
A: No, but it does if we divide it by ... !!!!
(X^T X)/N = sum_{i=1...N} xrowi xrowi^T / N
Reason: Vector/matrix version of the LLN
The rankone matrices (xrowi xrowi^T) are iid (n=1...N),
and the LLN applies to the (j,k) entries in the matrices.
Q: What is the limit?
A: sum_i (xrowi xrowi^T) / N > E[xrow xrow^T] =: C[xrow]
Note that C[xrow] = E[xrow xrow^T] is NOT equal to
V[xrow] = E[xrow xrow^T]  E[xrow] E[xrow]^T
Call it the "2ndmoment matrix" instead.
Think of it as the "uncentered variance/covariance matrix"
of the random vector 'xrow'.
We assume that C[row] is nonsingular.
 Conclusions: (X^T X / N) > C[xrow] (LLN)
==> (X^T X / N)^{1} > C[xrow]^{1}
= (X^T X )^{1} * N > C[xrow]^{1}
==> (X^T X )^{1} ~ C[xrow]^{1} / N
Now we are seeing 1/N in V[b]:

 
 V[b] ~ sigma^2 C[xrow]^{1} / N 
 

. Recap the thought process:
As N > Inf, what happens to the following?
(X^T X) > ...
(X^T X)^{1} > ...
X^T X/N > ...
( X^T X/N )^{1} > ...
. Q: What does this formula say about the precision of
estimates of regression coefficients in relation
to the distribution of the predictor vectors?
A: We gain precision of estimation with more data at the rate
1/N in terms of DATASETTODATASET VARIANCES
(and hence 1/sqrt(N) of datasettodataset sderrs).
 The weirdness about V[b] = sigma^2 (X^T X)^{1} :
. Conditionally on X, V[b] is a fixed covariance matrix,
but because of conditioning it should be written 'V[bX]'.
. Unconditionally, V[bX] = sigma^2 (X^T X)^{1} is a random matrix
if the X matrix is observational and hence random.
('Observational' means: If we realistically repeated
the data collection, we would end up with a different X.
This is different from designed experiments where we
control X and are able to set the values of X.
In repeats of a designed experiment, X is the same.)
================================================================
* "ADJUSTMENT" AND ITS ALGEBRA:
 Intuitive Example: 'Explain' student''s school performance.
. Factors of interest: Measures of school quality
(Class size, training of teachers, teacher salaries, physical plant,...)
==> Collect these predictors in a matrix X1.
. Factors NOT of interest in the study but cannot be ignored:
Demographics
(Household income, home zip code, education level of parents, ...)
==> Collect these predictors in a matrix X2.
. Outcome variables: SAT scores, grades, graduation vs dropping out, ...
. Intuition: Need to ADJUST the variables of interest for demographics.
 Partitioning multiple linear regression:
Partition the predictor matrix: X = (X1,X2) (Nxp1, Nxp2)
X1 = predictors of primary interest
X2 = nonignorable predictors not of primary interest
Partition the LS estimator: b^T = (b1^T,b2^T) (p+1 = p1+p2)
y = X b + r = X1 b1 + X2 b2 + r
Compare with naive approach:
y = X1 b1* + r*
 Q: How are b1 and b1* related to each other?
A: In general we only know that b1 and b1* are not the same.
Rough Reason: The following problems produce different b1 coefficients:
 y  (X1 b1 + X2 b2) ^2 = min_b1,b2
 y  (X1 b1*) ^2 = min_b1*
 Define orthogonal (LS) projections according to the partitions:
H = projection onto column space of X
H1 = projection onto column space of X1
H2 = projection onto column space of X2
(H1, H2 are NOT partitions of H; all are NxN !!!)
Question: When is H = H1 + H2 ?
 DEFINITION: adjustment = residualizing operation
IH2 = "adjustment operator" for X2 ('regresses out' demographics, e.g.)

 
 y.2 := (IH2) y is "y adjusted for X2" 
 

. Algebraically: y.2 is a residual vector in a model with predictors X2.
. Intuitively: y.2 has the variation due to X2 removed from y
variation due to X2 ~ "explained"/"accounted for" by X2.
. Practical example from above:
X1 = predictors describing school quality
X2 = predictors describing demographics
y = performance on SAT
y.2 = performance on SAT adjusted for demographics
 Some 'higher trivialities': Algebra and geometry of adjustment
. If y = X1 b1 + X2 b2 + r
is the LS decomposition based on both X1 and X2,
note the following:
r and the columns of X are orthogonal to each other, hence
r and the columns of X1, X2 are ...
==> r is 'adjusted' for both X1 and X2.
. Practice:
H X = ... X
H X1 = ... X1
H X2 = ... X2
H r = ... 0
(IH) r = ... r
(IH) X = ... 0
H1 r = ... 0
H2 r = ... 0
(IH1) r = ... r
(IH2) r = ... r
(IH1) X1 = ... 0
(IH2) X2 = ... 0
(IH2) X1 = ...
. Upcoming surprise: We will need X1 adjusted for X2 !!!!
DEFINITION:

 X1.2 := (IH2) X1 
 = Collection of predictors of interest X1 (school quality) 
 adjusted for 
 nonignorable but uninteresting predictors X2 (demographics). 

. r is orthogonal to X1.2:
r^T X1.2 = row vector of inner products of r with columns of X1.2
= r^T (IH2) X1
= r^T (IH2)^T X1
= ((IH2)r)^T X1
= r^T X1
= 0
because r^T X = 0 is the normal equation,
meaning that r is orthogonal to all columns of X,
hence orthogonal to the subset of columns in X1.
This should not be a surprise:
Residualizing/orthogonalizing X1 wrt X2
is an operation that is entirely within the
column space of X=(X1,X2).
Related facts:
cspan(X1.2) is a subspace of cspan(X1,X2)=cspan(X).
cspan(X1.2) = orthogonal complement of cspan(X2)
within cspan(X1,X2)=cspan(X).
 Adjust the LS decomposition
y = X1 b1 + X2 b2 + r
for X2 by applying IH2 to both sides:
(IH2) y = (IH2) X1 b1 + r (*****)
^^^^^^^^ ^^^^^^^^^ ^
X2adjusted X2adjusted Remains fixed under
response y predictors X1 X2adjustment
Recall notation:
y.2 = (IH2) y 'y adjusted for X2'
X1.2 = (IH2) X1 'X1 adjusted for X2'
From orthogonality of X1.2 and r in (*****)
we infoer that this is an LS decomposition:

 
 y.2 = X1.2 b1 + r  (****)
 
 where b1 is the LS estimator 
 for the X1 coefficients in the 
 JOINT regression onto X=(X1,X2). 
 

It solves
 y.2  X1.2 b1 ^2 = min_b1 (***)
The solution b1 of (***) is the same as that of
 y  X1 b1  X2 b2 ^2 = min_b1,b2 (**)
 Interpretation of b1:
To get the LS regression coefficients b1
in the joint regression y on X=(X1,X2),
one can do the following:

 
 Regress the X2adjusted response, y.2 = (IH2)y, 
 on the X2adjusted X1 predictors, X1.2 = (IH2)X1. 
 

Note: The adjustment formalism is not meant for computations but
for conceptual understanding. Stay tuned!
================================================================
* APPLICATION 1 OF ADJUSTMENT: CENTERED REGRESSION
 We usually consider the intercept as a nuisance.
With adjustment we can get rid of it:
X1 = all nonintercept predictors ('substantive regressors')
X2 = e = (1,...,1)^T the intercept 'regressors'
 What does adjustment for e do?
H2 = e e^T / e^2 = e e^T / N
H2 y = e (e^T y) /N = mean(y) e
^^^^^^^^^^
mean(y)
(I  H2) y = y  mean(y) e = centered y vector
 The adjusted LS decomposition (****) above
y.2 = X1.2 b1 + r
means that the substantive LS estimates can be
obtained from a 'centered regression'
with the intercept omitted.
================================================================
* APPLICATION 2 OF ADJUSTMENT: REDUCTION OF A MULTIPLE TO A SIMPLE REGR COEFF
 Consider: a single predictor adjusted for all else
X1 = single column xj (Nx1)
X2 = all columns of X except column xj (Nxp)
=> The adjusted equation (****) amounts to
a simple linear regression without intercept.
 Simplified notation:
y.adj = (IH2) y (y adjusted for all xk except xj)
xj.adj = (IH2) xj (xj adjusted for all xk except xj)
bj = slope of simple lin. regr. of "y adjusted for all but xj"
onto xj adjusted for all other predictors
[Reminder: The simple regr coeff w/o icept of y on x is b=/x^2 .]

 
 < y.adj, xj.adj> 
 bj =   ("adjustment formula" for bj)
 xj.adj^2 

 

Reminder: bj = j''th MULTIPLE regression coefficient (estimate)
 Interpretation: bj is the simple lin regression coefficient
when regressing y.adj on xj.adj w/o intercept.
 Elementary consequence:
Does it matter for the value of bj
if we add or drop some of the other regressors?
...
 Some weirdness: Adjusting y is actually not necessary.
Adjusting xj is.
Here is how you see it:
Numerator of adjustment formula
= < y.adj, xj.adj>
= <(IH2)y, (IH2)xj>
= (Why? Why? Why?)
=

 
 < y, xj.adj> 
 bj =   (the "weird adjustment formula" for bj)
 xj.adj^2 
 

 Interpretation 1: bj is the simple lin regression coefficient
when regressing y on xj.adj w/o intercept.
 Interpretation 2: bj is a linear function of y !!!
bj = a^T y ("contrast")
where
xj.adj
a = 
xj.adj^2
If we condition on X, then 'a' is a constant vector.
Can you smell a way to get at a standard error for bj?
[See the section after next.]
[We know this already because the whole vector b of coefficient estimates is
b = (X^T X)^{1} X^T y
which is a linear mapping of y (in R^N) to b (in R^{p+1}).
================================================================
* ADJUSTED VARIABLES PLOT OR "LEVERAGE PLOT":
 A plot of y.adj vs. xj.adj
shows what the LS criterion "sees" when determining bj.
 Example: Fetch data from class website; after downloading to folder 'Data'...
cars < read.table("http://wwwstat.wharton.upenn.edu/~buja/STAT961/CarModels20034NA.TXT",
header=T, sep=",", na=".")
names(cars); dim(cars)
rownames(cars) < paste(1:nrow(cars),as.character(cars[,"Model"]))
apply(cars, 2, function(x) sum(is.na(x))) # there is an NA in "MPG.Highway"
str(cars) # Ooops
cars[,"MPG.Highway"]
cars[,"MPG.Highway"] < as.numeric(cars[,"MPG.Highway"])
# Select nonNAs:
sel < !is.na(cars[,"MPG.Highway"]) & !is.na(cars[,"Weight.lbs"]) & !is.na(cars[,"Horsepower"])
# Multiple regression of response on two predictors:
cars.lm < lm(MPG.Highway ~ Weight.lbs + Horsepower, data=cars[sel,])
# Adjusted response and other predictor:
mpg.adj < resid(lm(MPG.Highway ~ Horsepower, data=cars[sel,]))
wgt.adj < resid(lm(Weight.lbs ~ Horsepower, data=cars[sel,]))
# Compare coeff of Weight: same!! "proof" by example...
lm(mpg.adj ~ wgt.adj); cars.lm
plot(mpg.adj ~ wgt.adj, pch=16, cex=.5); abline(lm(mpg.adj ~ wgt.adj))
 "Leverage Plot" because it shows points that have high leverage for bj
(every bj can have its own leverage points).
 Canned function for leverage plots with optional point identification:
source("http://wwwstat.wharton.upenn.edu/~buja/STAT961/leverageplots.R")
leverage.plots(model=cars.lm, ident=c(1,2), frame.sz=5) # stop with right click
[This function allows you to identify funny points; rightclick to stop.]
 Another use of leverage plots: predictor variables shrink when adjusted.
Leverage plots show the shrinkage in the horizontal width.
One can then read off what estimated average difference
in the response corresponds to the adjusted range.
 Note to future TAs of Stats classes:
The JMP software provides leverage plots with all regression outputs.
 Ultimate final question:
Why does mpg.adj take on negative values?
...
 Final question:
What happens to the intercept vector under adjustment?
In other words, is x0.adj still a constant column???
...
Example: the icept vector 'adjusted' for weight and horsepower
hist(resid(lm(rep(1,nrow(cars)) ~ .1, data=cars[,c("Weight.lbs","Horsepower")])),
main="", xlab="", col="gray30", breaks=20)
================================================================
* VARIANCE/STDERR OF bj:
 Note about terminology:
We call 'standard error' (SE) the true sdev of an estimate (across ...).
We call 'standard error estimate' (SE.est) the estimated sdev of an estimate.
(Most people say 'standard error' and mean 'standard error estimate'.)
 Reminder: We already know the SE of bj from V[b].
V[bj] = ( (X^T X)^{1} )_jj sigma^2
 Goal: Obtain Var[ bj ] from the adjustment formula for bj.
Recall variance of a linear combination: 'butterfly'/'sandwich' with A=a^T
Var[ ] = Var[ a^T y ] = ...
Use a = xj.adj:
Var[ ] = ...
To obtain V[bj]: Be careful and handle the denominator correctly.
Var[ bj ] = ...
...
Var[bj] = 
...
Finally, SE[bj] = sqrt(Var[bj]):

 
 sigma  "Adjustment formula"
 SE[bj] =   for standard errors
 xj.adj  of regression coeffs
 

 Consequences for statistical inference:
. What does xj.adj mean statistically speaking?
Try this for j>0 (exclude the x0.adj):
xj.adj/sqrt(N1) = ...
. Use the previous subbullet to rewrite the above magic formula for SE[bj]
and make the rootN law visible assuming regressor rows are randomly sampled:
SE[bj] = ...
. Do we like small or large values of xj.adj ?
...
. How do you "see" xj.adj/sqrt(N1) in the leverage plot?
...
. What happens to xj.adj if we add a predictor/regressor that is
highly correlated with xj?
...
. See Mosteller & Tukey, p.326f, on the proxy problem:
If we have mother''s education in the regression equation,
what do you think is going to happen to SE(b_mother) if we
add father''s education level to the equation?
...
. How would you measure how collinear xj is with all other predictors?
See next!
================================================================
* VARIANCE INFLATION FACTOR (VIF)  BETTER: SE INFLATION FACTOR (SEIF)
 Prelude: When is there no adjustment effect?
. General case: y = X1 b1 + X2 b2 +r
When is X1.2 = X1 ?
...
X1 __ X2
. Single case: X1 = xj, X2 = all other predictors
When is xj.adj = xj.0 ?
(We always adjust for the intercept, that is, centering.)
...
xj.0 is orthogonal to xk.0 for all k (!= j)
. What inequality is there between xj.adj and xj.0 ?
xj.adj <= xj.0
 When is there the least loss in precision of bj from collinearity?
...
xj.adj = xj.0 <==> xj.adj = xj.0
 What is SE[bj] in this case? Call it SE.best[bj].
...
SE.best[bj] = sigma / xj.0
 Definition: Standard error and variance inflation factors
SEIF[bj] := SE.actual[bj] / SE.best[bj]
VIF[bj] := SEIF[bj]^2
 Facts:
SEIF[bj] = (sigma/xj.adj) / (sigma/xj.0)

 
 xj.0 
 SEIF[bj] =  
 xj.adj 
 

 Fact: The lower bound on SEIF is SEIF[bj] >= ...
 AYT? Is the SEIF (and VIF) dependent on the units of xj? ...
If xj has units of Celsius, does SEIF[bj] change under conversion to Fahrenheit? ...
 SEIF in R: Ignoring the SEIF for the intercept,
here is a way to get the SEIFs for all slopes.
sqrt(diag(solve(cor(X))))
================================================================
* HAT DIAGONAL ELEMENTS AS MEASURES OF OUTLYINGNESS IN PREDICTOR SPACE
 General concept: Given a symmetric positive definite matrix S,
think of it as defining a squared distance from the origin
or squared norm as follows.
2
x = x^T S x
S
Using a root matrix R of S, i.e., S = R^T R,
you can define new coordinates xnew = R x, and you will find:
2
xnew^2 = xnew^T xnew = x^T R^T R x = x^T S x = x
^^^^^^^^ S
Euclidean
So, yes, x^T S x is a squared norm modulo a linear distortion.
 Recall the hat matrix:
H = X (X^T X)^{1} X^T
 For a row written as a column, xrowi = (x_i0=1,x_i1,x_i2,...,x_ip)^T,
we have:
H_ii = xrowi^T (X^T X)^{1} xrowi
==> H_ii is the squared norm of the i''th case in
predictor space wrt to the norm defined by S = (X^T X)^{1}
 More intuition about "outlyingness" in regressor space:
. Write:
X = (x0=e,x1,...,xp) of size Nx(p+1)
e = (1,...,1)^T of size Nx1
. Form the "meanadjusted matrix" Z:
Z = (x1.0,...,xp.0) (Nxp) matrix of centered real regressors (j>0).
AYT? Can the range of column xj.0 be [100,200]?
. Facts:
H_e = e e^T / N [Homework 1: What is H_e ? H_e x = mean(x) e]
xj.0 = (I  H_e) xj
H_X = H_e + H_Z [Homework 1: e __ xj.0 for j=1,...,p]
. Consequence:
Let z{i} = row i of Z written as a column (in R^p)
z{k} = row k of Z written as a column (in R^p)
Hik = 1/N + z{i}^T (Z^T Z)^{1} z{k}
Hii = 1/N + z{i}^T (Z^T Z)^{1} z{i}
. Geometric interpretation:
In the metric defined by (Z^T Z)^{1},
Hii  1/N is the squared distance of zi from the origin.
 Consider a fixed location z0 in R^p
and define a POSITIVE QUADRATIC FORM in z0 by:
H(z0) := z0^T (Z^T Z)^{1} z0
Recall from HW1: Z^T Z = sum_{i=1...N} z{i} z{i}^T
As N > Inf, what happens to:
. H(z0) > ... ? [LLN assuming z{i} have a multivariate distribution]
. N*H(z0) > ... ?
Does this make sense?
Think in terms of 'selfinfluence' of location z as N>Inf
and formulate a rule for its decay rate.
...
================================================================
* ANOVA DECOMPOSITIONS, R SQUARE, AND E[SS]:
 Q: How much "variation" does the model given by X "account for"?
. Approach: Use sums of squares (SS) as measures of "variation"
but first remove uninteresting variation: the mean.
. Preliminaries: recall the "meanadjusted" LS decomposition
y.0 = x1.0 * b1 + ... + xp.0 * bp + r = yhat.0 + r
obtained from the full equation by centering:
y = x0 * b0 + x1 * b1 + ... + xp * bp + r
[Note that r is centered, but yhat is not, hence we need yhat.0.]
Unlike the real predictors the intercept is not considered
"explanatory". Therefore, get rid of the intercept once for all.
 Pythagorean decomposition of orthogonal projections:
y.0 = b1*x1.0 + ... + bp*xp.0 + r (The meanadjusted LS decomposition again)
y.0 = yhat.0 + r (On the RHS are orthogonal vectors.)
==>
y.0^2 = yhat.0^2 + r^2 Pythagorean theorem

 TSS = MSS + RSS  'ANOVA decomposition' (the simplest form)

Total SS = Model SS + Residual SS
 Definition of RSquare ("R2"):
MSS yhat.0^2
. R2 :=  =  = "Fraction of variation accounted for by the model"
TSS y.0^2 "explained"
. Fact:

 R2 = cor(y,yhat)^2  = cor(y.0,yhat.0)^2

+ Note: 'cor()' implies meanadjustment and standardization.
+ Proof: cor(y,yhat)^2 = ^2 / (y.0^2*yhat.0^2) = ^2 / ... = QED
 Can we calculate E[TSS], E[RSS] and E[MSS] theoretically?
. We won''t have use for this in the future,
but there are two reasons for doing this math exercise:
+ ANOVA (really the analysis of designed experiments with highly structured X)
is permeated by calculations of expected values of sums of squares (E[SS]).
+ The calculations are insightful in their own right
because they describe sources of variation between datasets.
. Assume the model is unbiased and with uncorrelated homoscedastic errors:
y = X beta + eps with E[eps]=0, V[eps] = sigma^2 I
==> E[y] = X beta
. The smallest model is the mean model: y = beta0 x0 + eps (x0 = 1(Nx1))
We compare this model with the full model.
TSS = (IH0)y^2 (total SS in the data other than the mean)
MSS = (HH0)y^2 (SS captured by the model other than the intercept)
RSS = (IH) y^2 (SS not captured by the model)
where H is the projection onto the space spanned by x0,x1,...,xp
and H0 onto the vector x0.
. Note: IH0, HH0, IH are all orthogonal projections,
with ranks N1, p, Np1, respectively.
Fact: HH0 = H(IH0)
. We will reduce the derivations of E[TSS], E[MSS], E[RSS] to the
following generic facts about projections of spherically
distributed random vectors eps (to first and second order):
E[eps] = 0 and V[eps] = sigma^2 I:
y = mu + eps
. FACT 1: For any orthogonal projection H we have the following.
E[ H eps^2 ] = E[ eps^T H eps ]
= E[ sum_{ij} eps_i H_ij eps_j ]
= sum_{ij} H_ij E[ eps_i eps_j ]
= sum_i H_ii sigma^2
= sigma^2 tr(H)
. FACT 2: Consider the vector 'eps' shifted away from the origin, y = mu + eps.
E[ H (mu + eps)^2 ] = E[ H mu^2 + mu^T H eps + eps^T H mu + H eps^2 ]
= H mu^2 + sigma^2 tr(H)
The neat aspect is that this holds for ANY orthogonal projection H
and any shift vector 'mu'.
. APPLICATION 1: To find E[TSS], use the projection IH0.
E[TSS] = E[(IH0)y^2] = (IH0) X beta^2 + sigma^2 (N1)
. APPLICATION 2: To find E[MSS], use the projection HH0, where H projects onto cspan(X).
E[MSS] = E[ (HH0) y^2 ]
= (IH0) X beta^2 + sigma^2 p
. APPLICATION 3: To find E[RSS], use the projection (IH).
E[RSS] = E[ (IH) y^2 ]
= sigma^2 (Np1) because (IH) X = 0
. CONCLUSIONS: If y = X beta + eps, then
E[TSS] =  (IH0)X beta ^2 + sigma^2 (N1)
E[MSS] =  (IH0)X beta ^2 + sigma^2 p
E[RSS] = sigma^2 (Np1)
 An Engineering View of Linear Regression: Signal to Noise Ratio in Estimation
The estimation part is encapsulated in yhat.0; its 'energy' is MSS.
We obtained a decomposition for E[MSS]:
E[MSS] =  (IH0)X beta ^2 + sigma^2 p
^^^^^^^^^^^^^^^^^ ^^^^^^^^^
Sources: model, 'signal' error, 'noise'
This motivates the definition of "signaltonoise ratio":
 (IH0) X beta ^2 / Ratio of model variation to noise variation
SN =  =  in fitted values
sigma^2 p \ excluding uninteresting mean variation
Q: Do we know the SN in actual data analysis?
...
Q: Find estimates for numerator and denominator!
... (denominator) s^2 p (unbiased? yes)
... (numerator)  (IH0) X b ^2 (unbiased? no)
SNhat = ... (ratio)
Is the ratio also unbiased?
... probably not
 We calculated E[MSS] and E[TSS]; we defined R2 = MSS/TSS.
Q: Can we calculate E[R2] theoretically?
...
Q: Can we express R2 in terms of SNhat and vice versa?
...
================================================================