================================================================ * COLLINEARITY -- A SPECIAL CASE OF NON-IDENTIFIABILITY - Interpreation issues: . NOT a model problem, but causes interpretation issues . Commonality of COLLINEARITY and leverage/-influence: + both are properties of the predictor distribution + neither is a model problem because neither involves y + both cause interpretation problems . Difference between collinearity and leverage/self-influence: + Leverage/self-influence has to do with "outliers" in (p+1)-dimensional predictor space. + Collinearity has to do with near-degeneracy of N-dimensional predictor vectors => The interpretation problems they cause are completely different. . Compare the following design matrices and comment on their properties and interpretation of coefficients: |1 0| |1 0| |1 1 0| |1 0| |1 0| |1 1 0| X = |0 1|, |1 1|, |1 0 1| |0 1| |1 1| |1 0 1| . In the third matrix, what is x0 adjusted for x1 and x2, or x1 adjusted for x0 and x2, or x2 adjusted for x0 and x1? - EFFECTS OF COLLINEARITY 1: SIMPSON''S PARADOX . Practical example: "A marketing project identified a list of affluent customers for its new PDA. Should it focus on the younger or older members of this list?" "To answer this question, the marketing firm obtained a sample of 75 consumers and asked them to rate their 'likelihood of purchase' on a scale of 1 to 10." Data: site <- "http://stat.wharton.upenn.edu/~buja/STAT-541/" pda <- read.csv(paste(site,"pda.dat",sep="")) pda <- read.csv("Data/pda.dat") names(pda) dim(pda) plot(pda, pch=16) summary(lm(Rating ~ Age, data=pda)) Observations: ??? . For increasing Age we see on average increasing Rating which agrees with the simple regression of Rating on Age: summary(lm(Rating ~ Age, data=pda))$coef . However, adjusted for Income, we have a reversal of the slope for Age: summary(lm(Rating ~ Income + Age, data=pda))$coef . How is this possible? Collinearity of Age and Income: cor(pda[,c("Age","Income")]) ==> SIMPSON''S PARADOX -- REVERSAL OF SIGNS OF SLOPES UNDER ADJUSTMENT (This effect is possible but not frequent in the presence of collinearity. - EFFECTS OF COLLINEARITY 2: REDUCED SIGNIFICANCE OF SLOPES / LOSS OF POWER IN t-TESTS . Recall: "Variance Inflation Factor" VIF and "Standard Error Inflation Factor" SEIF = sqrt(VIF) . Critical formula: SE.est(bj) = s / |xj.adj| Collinearity of xj with other predictors causes |xj.adj| to be small hence stderr.est to be large . Interpretation: SEIF shows by how much the CI width has been inflated by collinearity. ---------------------------------------------------------------- * COLLINEARITY ANALYSIS WITH SMALLEST PRINCIPAL COMPONENTS: - Vectorized effects of collinearity on inference for the coeffs: . The (stderr-)variance matrix of b is V[b] = (X^T X)^{-1} sigma^2 . If the eigendecomposition of (X^T X) is: (X^T X) = lambda1 P1 + lambda2 P2 +... (p+1 terms, Pj = uj uj^T) then the eigendecomposition of (X^T X)^{-1} is: (X^T X)^{-1} = 1/lambda1 P1 + 1/lambda2 P2 +... [Side remark: Recall HW 3, spectral theory If S is a symmetric matrix with eigendecomp S = sum lambdaj Pj and if f(lambda) is a numeric function, then f(S) = sum f(lambdaj) Pj ] Consequence: Linear combinations uj^T b = with small lambdaj are statistically ill-determined (dataset-to-dataset). V[uj^T b] = uj^T V[b] uj = 1/lambdaj = huge if lambdaj = small - Extreme case: perfect collinearity -- rank deficient design matrix X b = 0 <==> (X^T X) b = 0 <==> b^T (X^T X) b = 0 <==> |X b|^2 = 0 <==> b is eigenvector of (X^T X) with eigenvalue 0 Determination of rank of X: eigen decomposition of X^T X Rank deficiency: count the (near) zero eigenvalues Each eigenvector for a zero eigenvalue is a 'b' with X b = 0 Example: boston.lm <- lm(MEDV ~ ., data=boston) X <- model.matrix(boston.lm) eigen(crossprod(X))$values # full rank XX <- cbind(X,x12=X[,1]+X[,2],apply(X,1,sum)) # Add a collinear column eigen(crossprod(XX))$values # rank deficient Look for eigenvalue several orders smaller than the one before (ex: .9, 8.4e-09) - Problem: Scaling of predictors XX <- cbind(X[,-13]*.000001,X[,13]*1000000) # example: rescale predictors eigen(crossprod(XX))$values # negative e'vals? => eigen() crashed!! eigen(cor(XX[,-1]))$values # Reasonable scaling of predictor avoids the problem. Solution: Scale the predictors, use 'cor()' instead of 'crossprod()' but remove intercept vector, which is not of interest anyway (centering/adjusting-for-the-mean is implicit in correlations) eigen(cor(X[,-1]))$values XX <- cbind(X[,-1],x12=X[,2]+X[,3],apply(X,1,sum)) # Append collinear predictors eigen(cor(XX))$values round(eigen(cor(XX))$vector[,14:15], 4) - Standardizing the predictors: Often practiced in social sciences Slope of x = est. ave. difference in y for a 1 sdev difference in x 'at fixed levels of all other predictors' - Collinearity analysis with "smallest principal components": Use of smallest eigenvalues/vectors to detect near-collinearity PCA: Usually used for DIMENSION REDUCTION => largest eigenvalues/vectors but COLLINEARITY ANALYSIS requires: smallest eigenvalues/vectors Idea: Assume the columns of X are centered (intercept removed) LS: find b such that |y - X b|^2 = min PCA: find b such that var(X b) ~ |X b|^2 = max/min for collinearity analysis we are interested in var(X b) small But: PCA is ill-posed; b->Inf or b=0 solve the problem trivially. => PCA needs a constraint. Idea: Inspired by collinearity analysis, take the best possible case for comparison, which is: the predictors are pairwise ... IF this were the case: var(X b) = |x1|^2 b1^2 + ... + |xp|^2 bp^2 w.l.o.g.: |x1|^2 = ... = |xp|^2 = 1 var(X b) = b1^2 + ... + bp^2 = |b|^2 => var(X b) = max/min subject to |b|^2 = 1 => Eigenproblem: var(X b) = b^T cor(X) b = max/min subj. to |b|^2=1 Interpretation: Eigenvalues = variances (max/min/stationary) var(X b) ~=~ 0 => X b ~=~ 0 for small eigenvalues - Canned smallest eigendecomps. of the correlation matrix of X (w/o intercept) site <- "http://stat.wharton.upenn.edu/~buja/STAT-541/" source(paste(site, "collinearity-pca.R", sep="")) X.scaled <- scale(X[,-1]) # standardize the columns, remove intercept collinearity.pca(X.scaled, nlin=3) $variances PC13 PC12 PC11 0.064 0.169 0.186 $coefficients PC13 PC12 PC11 CRIM -0.046 0.087 -0.110 ZN 0.081 -0.071 0.263 * INDUS 0.251 -0.113 -0.303 CHAS -0.036 -0.004 0.014 NOX -0.044 0.804 0.111 RM -0.046 0.153 0.053 AGE 0.039 -0.212 -0.459 DIS 0.018 0.391 -0.696 * RAD 0.633 -0.107 0.037 * TAX -0.720 -0.215 -0.105 PTRATIO -0.023 0.210 0.175 B 0.004 0.042 0.019 LSTAT -0.024 0.055 0.271 Interpretation of the smallest eigenvalue: The smallest PC has 6.4% of the variance of a single variable (var(xj)=1). Its standard deviation is sqrt(0.064)=0.253, or about a quarter of each xj Interpretation of its eigenvector: Round the coeffs first. INDUS: +1/4 RAD: +2/3 TAX: -3/4 Defines a linear combination with small variance, hence examine: 1/4*INDUS + 2/3*RAD - 3/4*TAX ~=~ 0 Equivalently: 3/4*TAX ~=~ 1/4*INDUS + 2/3*RAD Equivalently: TAX ~=~ 1/3*INDUS + 8/9*RAD Check with a plot: plot(TAX ~ I(1/3*INDUS + 8/9*RAD), data=as.data.frame(X.scaled), pch=16, cex=1) (Recall the formula language can also be used for plotting.) Suspicious plot: These are not 506 points... windows(width=6, height=12) par(mfrow=c(2,1), mgp=c(1.8,.5,0), mar=c(3,3,1,1), cex.lab=.8, cex.axis=.8) plot(TAX ~ I(1/3*INDUS + 8/9*RAD), data=as.data.frame(X.scaled), pch=16, cex=.5) Jitter the same: n <- nrow(X.scaled) jit <- .1 jit.x <- runif(n,-jit,jit) jit.y <- runif(n,-jit,jit) plot(TAX + jit.y ~ I(1/3*INDUS + 8/9*RAD + jit.x), data=as.data.frame(X.scaled), pch=16, cex=.5) Alternative: fuzzy scatter smoothScatter(x=1/3*boston[,"INDUS"]+8/9*boston[,"RAD"], boston[,"TAX"], xlim=c(0,30), ylim=c(100,800) ) ================================================================