================================================================


* COLLINEARITY -- A SPECIAL CASE OF NON-IDENTIFIABILITY

  - Interpreation issues:

    . NOT a model problem, but causes interpretation issues

    . Commonality of COLLINEARITY and leverage/-influence:
      + both are properties of the predictor distribution
      + neither is a model problem because neither involves y
      + both cause interpretation problems

    . Difference between collinearity and leverage/self-influence:
      + Leverage/self-influence has to do with "outliers"
          in (p+1)-dimensional predictor space.
      + Collinearity has to do with near-degeneracy of
          N-dimensional predictor vectors
      => The interpretation problems they cause are completely different.

   . Compare the following design matrices and comment on their
     properties and interpretation of coefficients:
         |1 0|   |1 0|   |1 1 0|
         |1 0|   |1 0|   |1 1 0|
     X = |0 1|,  |1 1|,  |1 0 1|
         |0 1|   |1 1|   |1 0 1|

   . In the third matrix, what is x0 adjusted for x1 and x2,
     or x1 adjusted for x0 and x2, or x2 adjusted for x0 and x1?


  - EFFECTS OF COLLINEARITY 1: SIMPSON''S PARADOX

    . Practical example:

       "A marketing project identified a list of affluent customers for
        its new PDA.  Should it focus on the younger or older members of
        this list?"

       "To answer this question, the marketing firm obtained a sample
        of 75 consumers and asked them to rate their 'likelihood of
        purchase' on a scale of 1 to 10."

        Data:

            site <- "http://stat.wharton.upenn.edu/~buja/STAT-541/"
            pda <- read.csv(paste(site,"pda.dat",sep=""))
            pda <- read.csv("Data/pda.dat")
            names(pda)
            dim(pda)
            plot(pda, pch=16)
            summary(lm(Rating ~ Age, data=pda))

        Observations:  ???

      . For increasing Age we see on average increasing Rating
        which agrees with the simple regression of Rating on Age:

            summary(lm(Rating ~ Age, data=pda))$coef

      . However, adjusted for Income, we have a reversal of the slope for Age:

            summary(lm(Rating ~ Income + Age, data=pda))$coef

      . How is this possible?  Collinearity of Age and Income:

            cor(pda[,c("Age","Income")])

       ==> SIMPSON''S PARADOX -- REVERSAL OF SIGNS OF SLOPES UNDER ADJUSTMENT

       (This effect is possible but not frequent in the presence of collinearity.


  - EFFECTS OF COLLINEARITY 2: REDUCED SIGNIFICANCE OF SLOPES / LOSS OF POWER IN t-TESTS

    . Recall: "Variance Inflation Factor" VIF and
              "Standard Error Inflation Factor" SEIF = sqrt(VIF)

    . Critical formula:         SE.est(bj) = s / |xj.adj|

        Collinearity of xj with other predictors causes |xj.adj| to be small
        hence stderr.est to be large

    . Interpretation: SEIF shows by how much the CI width
                      has been inflated by collinearity.


----------------------------------------------------------------


* COLLINEARITY ANALYSIS WITH SMALLEST PRINCIPAL COMPONENTS:

  - Vectorized effects of collinearity on inference for the coeffs:

    . The (stderr-)variance matrix of b is

        V[b] = (X^T X)^{-1} sigma^2

    . If the eigendecomposition of (X^T X) is:
        (X^T X) = lambda1 P1 + lambda2 P2 +...   (p+1 terms,  Pj = uj uj^T)
      then the eigendecomposition of (X^T X)^{-1} is:
        (X^T X)^{-1} = 1/lambda1 P1 + 1/lambda2 P2 +...

         [Side remark: Recall HW 3, spectral theory
                        If S is a symmetric matrix with eigendecomp
                        S = sum lambdaj Pj
                        and if f(lambda) is a numeric function, then
                        f(S) = sum f(lambdaj) Pj  ]

      Consequence: Linear combinations  uj^T b = <uj,b>  with small lambdaj
                   are statistically ill-determined (dataset-to-dataset).

         V[uj^T b] = uj^T V[b] uj = 1/lambdaj = huge  if  lambdaj = small


  - Extreme case: perfect collinearity -- rank deficient design matrix
        X b = 0  <==>  (X^T X) b = 0  <==>  b^T (X^T X) b = 0  <==>  |X b|^2 = 0
        <==>  b is eigenvector of (X^T X) with eigenvalue 0
        Determination of rank of X: eigen decomposition of X^T X
        Rank deficiency: count the (near) zero eigenvalues
        Each eigenvector for a zero eigenvalue is a  'b'  with  X b = 0
      Example:
        boston.lm <- lm(MEDV ~ ., data=boston)
        X <- model.matrix(boston.lm)
        eigen(crossprod(X))$values   # full rank

        XX <- cbind(X,x12=X[,1]+X[,2],apply(X,1,sum))  # Add a collinear column
        eigen(crossprod(XX))$values  # rank deficient
      Look for eigenvalue several orders smaller than the one before (ex: .9, 8.4e-09)

  - Problem: Scaling of predictors

        XX <- cbind(X[,-13]*.000001,X[,13]*1000000)  # example: rescale predictors
        eigen(crossprod(XX))$values  # negative e'vals? => eigen() crashed!!
        eigen(cor(XX[,-1]))$values   # Reasonable scaling of predictor avoids the problem.

      Solution: Scale the predictors, use 'cor()' instead of 'crossprod()'
      but remove intercept vector, which is not of interest anyway
      (centering/adjusting-for-the-mean is implicit in correlations)

        eigen(cor(X[,-1]))$values
        XX <- cbind(X[,-1],x12=X[,2]+X[,3],apply(X,1,sum)) # Append collinear predictors
        eigen(cor(XX))$values
        round(eigen(cor(XX))$vector[,14:15], 4)

  - Standardizing the predictors: Often practiced in social sciences
        Slope of x = est. ave. difference in y for a 1 sdev difference in x
                     'at fixed levels of all other predictors'

  - Collinearity analysis with "smallest principal components":

        Use of smallest eigenvalues/vectors to detect near-collinearity

        PCA: Usually used for DIMENSION REDUCTION => largest eigenvalues/vectors
             but COLLINEARITY ANALYSIS requires:     smallest eigenvalues/vectors

        Idea:  Assume the columns of X are centered (intercept removed)
               LS:   find  b  such that |y - X b|^2 = min
               PCA:  find  b  such that  var(X b) ~ |X b|^2 = max/min
               for collinearity analysis we are interested in var(X b) small

          But: PCA is ill-posed;  b->Inf or b=0 solve the problem trivially.
          =>   PCA needs a constraint.

          Idea: Inspired by collinearity analysis, take the best possible case
                for comparison, which is: the predictors are pairwise ...
                IF this were the case:
                  var(X b) = |x1|^2 b1^2 + ... + |xp|^2 bp^2
                w.l.o.g.:  |x1|^2 = ... = |xp|^2 = 1
                   var(X b) = b1^2 + ... + bp^2 = |b|^2
                => var(X b) = max/min  subject  to  |b|^2 = 1
          => Eigenproblem:  var(X b) = b^T cor(X) b = max/min   subj. to |b|^2=1
             Interpretation:   Eigenvalues = variances (max/min/stationary)
                               var(X b) ~=~ 0  =>  X b ~=~ 0  for small eigenvalues

  - Canned smallest eigendecomps. of the correlation matrix of X (w/o intercept)
          site <- "http://stat.wharton.upenn.edu/~buja/STAT-541/"
          source(paste(site, "collinearity-pca.R", sep=""))
          X.scaled <- scale(X[,-1])   # standardize the columns, remove intercept
          collinearity.pca(X.scaled, nlin=3)
            $variances
            PC13  PC12  PC11
            0.064 0.169 0.186

            $coefficients
                     PC13   PC12   PC11
            CRIM    -0.046  0.087 -0.110
            ZN       0.081 -0.071  0.263
         *  INDUS    0.251 -0.113 -0.303
            CHAS    -0.036 -0.004  0.014
            NOX     -0.044  0.804  0.111
            RM      -0.046  0.153  0.053
            AGE      0.039 -0.212 -0.459
            DIS      0.018  0.391 -0.696
         *  RAD      0.633 -0.107  0.037
         *  TAX     -0.720 -0.215 -0.105
            PTRATIO -0.023  0.210  0.175
            B        0.004  0.042  0.019
            LSTAT   -0.024  0.055  0.271

        Interpretation of the smallest eigenvalue:
          The smallest PC has 6.4% of the variance of a single variable (var(xj)=1).
          Its standard deviation is  sqrt(0.064)=0.253, or about a quarter of each xj
        Interpretation of its eigenvector: Round the coeffs first.
           INDUS: +1/4      RAD: +2/3     TAX: -3/4
        Defines a linear combination with small variance, hence examine:
          1/4*INDUS + 2/3*RAD - 3/4*TAX  ~=~  0
        Equivalently:
          3/4*TAX ~=~ 1/4*INDUS + 2/3*RAD
        Equivalently:
          TAX ~=~ 1/3*INDUS + 8/9*RAD
        Check with a plot:
          plot(TAX ~ I(1/3*INDUS + 8/9*RAD),
               data=as.data.frame(X.scaled), pch=16, cex=1)
        (Recall the formula language can also be used for plotting.)
        Suspicious plot: These are not 506 points...
          windows(width=6, height=12)
          par(mfrow=c(2,1), mgp=c(1.8,.5,0), mar=c(3,3,1,1), cex.lab=.8, cex.axis=.8)
          plot(TAX ~ I(1/3*INDUS + 8/9*RAD),
               data=as.data.frame(X.scaled), pch=16, cex=.5)
        Jitter the same:
          n <- nrow(X.scaled)
          jit <- .1
          jit.x <- runif(n,-jit,jit)
          jit.y <- runif(n,-jit,jit)
          plot(TAX + jit.y ~ I(1/3*INDUS + 8/9*RAD + jit.x),
               data=as.data.frame(X.scaled), pch=16, cex=.5)
        Alternative: fuzzy scatter
          smoothScatter(x=1/3*boston[,"INDUS"]+8/9*boston[,"RAD"], boston[,"TAX"], xlim=c(0,30), ylim=c(100,800) )


================================================================