================================================================


*  MODEL SELECTION

  - PREAMBLE: ADJUSTED R SQUARE
    . R2 is biased:  E[R2] > 'true R2'   (because of LS optimization)
    . Attempt at de-biasing:
        R2      = 1 - RSS/TSS = 1 - (RSS/(n-1))/(TSS/(n-1))
        RSS/(n-1) is downward biased as an estimate of sigma^2
        TSS/(n-1) is unbiased as an estimate of sigma.y^2  (? really?)
      => Estimate true R2 by

           R2.adj  :=  1 - s^2/s.y^2  <  R2

      where s^2   = RSS / (N-p-1)
            s.y^2 = TSS / (N-1)


  - MODEL SELECTION: Play with models using 'add1', 'drop1' functions.

    . Stepwise forward:  ...
        summary(lm(MEDV ~ RM, data=boston))
        summary(lm(MEDV ~ RM + LSTAT + NOX, data=boston))
      a clumsy function to ``help'':
        add1(lm(MEDV ~ 1, data=boston), scope= ~ RM + LSTAT + NOX)

    . Stepwise backward: ...
        summary(lm(MEDV ~ ., data=boston))
        summary(lm(MEDV ~ . - AGE, data=boston))
        summary(lm(MEDV ~ . - AGE - INDUS, data=boston))
      Another clumsy function to 'help':
        drop1(lm(MEDV ~ ., data=boston))


  - STEPWISE REGRESSION:
    . Forward:  Starting from null model (mean only),
                successively enter best variable as long as its t-stat IS signif.
    . Backward: Starting from full model,
                successively remove worst variable as long as its t-stat is NOT signif.
    . Alternating forward/backward...
    Do these procedures produce 'optimal' submodels?
      ...


  - ALL-SUBSETS REGRESSION:

    . Minimize RSS / maximize RSquare for given model sizes.
        Report the 1 or 2 or ... best models for each model size.
        ==> Doable for models up to 35 or so predictors.
            (Branch-and-bound algorithm from the 1970s)
      Question: How to choose among model sizes (see below)?

    . Software: install the following package once:
        install.packages(pkgs="leaps", repos="http://cran.us.r-project.org/")
        library(leaps)              # needs to be called in every R session


    . 'regsubsets()' in the package finds the best subset of predictors for every size.
      ## Optional arguments:
      ##   nbest=2    will find the two best for each size
      ##   nvmax=13   will find them for all sizes up to 13
      ##   force.in=c(6,13)  will force variables 6 and 13 in the model (RM, LSTAT)
      ## Example:
        summary(regsubsets(log(MEDV) ~., data=boston))
        summary(regsubsets(log(MEDV) ~., data=boston,
                           nbest=3, nvmax=13, force.in=c(6,13)))
      ## A graphical way of looking at these models:
        windows(width=10, height=12)
        plot(regsubsets(MEDV ~., data=boston,
                        nbest=2, nvmax=13), scale="adjr2")


  - PROBLEM: How high should we go with picking models?

    . Many criteria for stopping the inclusion of predictors:
           Mallows'' Cp, AIC, BIC, RIC, CIC, MIC, ...
      Note:
      ~ These criteria come into play only for deciding on the model size.
      ~ For a fixed model size pick the lowest RSS (= highest Rsquare)

    . A helpful universal trick/idea:

        ADD A FEW RANDOM PREDICTORS TO THE MODEL !!!!!!!

        # and check when the search gets trapped in them:
        ############## for(i in seq(1000)) {
        nnoise <- 10             # the number of random predictors we add
        boston.ext.dat <-        # the data frame of actual and random data
          data.frame(boston,
                     as.data.frame(matrix(rnorm(506*nnoise), ncol=nnoise)))
        # Uncomment the following expressions temporarily:
        ## summary(lm(log(MEDV) ~ .,  data=boston.ext.dat))  # model with 'junk'
        ## summary(lm(log(MEDV) ~ .,  data=boston))          # model w/o 'junk'

        ## Fortunately most random predictors are recognized as worthless,
        ## but occasionally there is one that sticks out by chance.
        ## Now we do all-subset search and check graphically:
        boston.ext.sub <- regsubsets(nvmax=20, nbest=2,
          log(MEDV) ~., data=boston.ext.dat)
        plot(boston.ext.sub, scale="adjr2", ylim=c(.4,.8))
        ############### dev.flush();  Sys.sleep(.5) }
        # Animate the above by uncommenting the two lines with "############".

    . A better version is to scramble the existing X matrix:

        Break the association between y and X by permuting
        the rows of X and cbind() the result to X.
        ==> Permuted X has the same approximate collinearities as X !!!

      Code:

        boston.ext.dat <-        # the data frame of actual and random data
          data.frame(boston, boston[sample(nrow(boston)),-14])
        boston.ext.sub <- regsubsets(nvmax=20, nbest=2,
          log(MEDV) ~., data=boston.ext.dat)
        plot(boston.ext.sub, scale="adjr2", ylim=c(.4,.8))

      => Models of size up to 9 would be ok by this method.

      Open questions:
      . How does the method depend on the number of included junk predictors?
      . Is this a good method for interpretation or prediction or both or neither?
      . Theory???
      . For some progress, see Shaun Lysen''s Ph.D. thesis, and references therein.

      Recent developments:
        Candes'' "knock-off" method is a smart variation on this scheme.


  - Mallows'' Cp criterion for model size selection: estimating prediction error

    . Prediction error:   PE  :=  E[ |yhat - mu|^2 ]

    . Facts:  Assume  y = mu + eps  (in R^N),  E[eps] = 0,  V[eps] = sigma^2 I_N

              I.e., allow an arbitrary mean vector  mu
              NO assumption that  mu is in cspan(X) !
              As usual let H be the hat/projection matrix onto cspan(X),
              hence  r = (I-H) y.

      (1)  E[ |r|^2 ]  =  E[ |(I-H)(mu + eps)|^2 ]  =  |(I-H) mu|^2 + (N-p-1) sigma^2

      (2)  PE  =  E[ |H y - mu|^2 ]  =  E[ |H y - H mu) - (I-H) mu |^2 ]
                                     =  E[ |H (y-mu)|^2 ] + |(I-H) mu|^2
                                          because:  < H (y-mu), (I-H) mu > = 0
                                     =  E[ |H eps|^2 ]    + |(I-H) mu|^2
                                     =  (p+1) sigma^2     + |(I-H) mu|^2

      (1)-(2)  =  E[ RSS ] - PE  =  (N-p-1) sigma^2 - (p+1) sigma^2
                                 =  [N - 2*(p+1)] sigma^2

              -------------------------------------------
      ==>    |  PE  =  E[ RSS ] - [N - 2*(p+1)] sigma^2  |
              -------------------------------------------

      ==>  If we have an unbiased estimate  sigmahat^2  from a (larger) correct model,
           then the following is an unbiased estimate of the PE:

              ------------------------------------------
             |  Cp  :=  RSS - [N - 2*(p+1)] sigmahat^2  |
              ------------------------------------------

      ==>  Search for a suitable submodel size by minimizing  Cp.

           Obviously the term  -N*sigmahat^2 is irrelevant, so we can minimize

                RSS + 2*(p+1)*sigmahat^2

      [Mallows himself warns against taking too much guidance from Cp.
       He didn''t first publish Cp; it first appeared in a path breaking book
       by Daniel and Woods, "Fitting Equations to Data", first author being
       Cuthbert Daniel, a leading industrial statistician at the time.
       'C' in 'Cp' seems to stand for 'Cuthbert'.
       Mallows later (1973) wrote two articles in Technometrics,
       "Comments on Cp"  and  "Further Comments on Cp",
       with skeptical remarks about the strong use of Cp at the time.


  - Akaike''s criterion, AIC (1973): For any MLE, use

          AIC      :=  -2*log Lik(parameters) + 2*p
          AIC.est  :=  -2*log Lik(MLE)        + 2*p

      where 'p' is the number of fitted parameters.

      This should be minimized over model sizes 'p'.

      AIC is based on information theory where information in the data
      is balanced against the complexity of the fitted model.  It provides
      estimates of relative loss of Kullback-Leibler divergence when using
      one model versus another.

      Linear models:

          -2*log Lik(beta,sigma^2)  =  |y - X beta|^2/sigma^2 + N*log(2*pi*sigma^2)

          AIC.est  =  RSS / sigmahat^2 + N*log(2*pi*sigmahat^2) + 2*(p+1)

      By the book one should use the MLE estimate for sigma, but this is not done.
      As in Cp, one uses the best available estimate from a larger model.
      In this case sigmahat does not vary with 'p', hence

          AIC.est  ~  RSS / sigmahat^2 + 2*(p+1)

      which is equivalent to Cp.


  - BIC or Schwarz'' Bayesian information criterion (1978):

          BIC  :=  -2*log Lik + p*log(N)

     The justification is using some Bayes priors for model sizes.

     Note that this criterion is more stringent:  log(N) > 2  for N >= 8

     Note also that -log Lik grows at the rate N for independent data.

     BIC is known to be 'asymptotically correct':

       If there is a correct submodel, minimizing BIC will asymptotically find it.

       This is not so for Cp and AIC: They are too lenient.
       Reason: Cp, and hence AIC, minimize PE, which errs on the side
               of variance to avoid bias.


  - RIC, Risk Inflation Criterion, by Foster and George (1994, in-house)


  - CIC, Covariance Inflation Criterion, by Knight and Tibshirani (????)


  - MIC, information-theoretic (Rissanen)


  - DIC ... ?  also info-theory?


  - Cross-validation (CV):

       Leave out some data, predict them, measure prediction error,
       repeat, average.

       Initially one used leave-one-out CV, which has a simple formula
       for linear models:

          CV1  :=  1/N sum_i  r_{-i}^2           where   r_{-i} = yi - yhati{-i}

                =  1/N sum_i [ri / (1-Hii)]^2

          I.e., the leave-i-out prediction error at observation i is

              ----------------------------
             |                            |
             |    r_{-i} = ri / (1-Hii)   |            (*)
             |                            |
              ----------------------------

        Today, leave-one-out has fallen out of favor.
        ML (Machine Learning) practices 10-fold CV:
          leave out 1/10, 10 times, then average

        There is some evidence that one should actually leave out more,
        for example 1/3.  In this case one would repeatedly randomly
        leave out 1/3 and average the prediction error estimates.

        Drawback: The more one leaves out, the more data is diverted
                  from estimation to validation.
                  Effectively, one estimates performance of estimation
                  based on smaller sample sizes than N.


        'Proof' of the leave-one-out formula: a neat idea...

           Let the leave-one-out prediction at xi be
              yhat_{-i}  =  xi^T b{-i}
           where {-i} means leaving out (xi,yi).

           Trick: Construct a modified response vector yvec{-i}
                  that is the same as y for all coordinates except
                  its i''th coordinate is  yhat{-i}.
                     yvec{-i}  =  y - (y_i - yhat_{-i})*ei   (ei = i''t basis vector)

           Lemma:  H yvec{-i}  contains all the leave-one-out fits
                   as well as the leave-one-out prediction  yhat{-i}
                   in its i''th coordinate.

           Hence:  H yvec_{-i}  =  yhat_{-i}                            (1)
                                =  sum_{k!=i} Hik*y_k  +  Hii*yhat_{-i}
                   H y          =  yhat_i                               (2)
                                =  sum_{k!=i} Hik*y_k  +  Hii*y_i
           (2)-(1) =  yhat_i - yhat_{-i}  =  Hii*y_i - Hii*yhat_{-i}
           ==>  (*)

           Note: The Lemma holds even for penalized LS, generalized Ridge regression:

                   |y - X b|^2 + lambda* b^T Omega b

                   b  =  (X^T X + lambda*Omega)^{-1} X^T y

           where  Omega is a non-neg. def. symmetric matrix,
           and lambda > 0 a penalty parameter.

           CV can be used to choose lambda; in this case it is not
           model selection but finding a good trade-off between bias and variance


  - Generalized Cross-Validation (GCV): Proposed by Craven and Wahba (1979)
       Simplifies leave-1-out CV by using the average of (1-Hii) for all
       residuals

          GCV  :=  1/N sum_i  r_i^2  /  (1 - tr(H)/N)^2

       This was thought initially as a computational simplification,
       but it turned out to have advantages in that it can be generalized
       to many other methods, including smoothing splines.


  - Lasso: L_1 penalty

        For a given penalty strength lambda, define

                  |y - X b|^2 + lambda * sum_j |b_j|

              =   |y - X b|^2 + lambda * |b|_1

         This makes sense only if there is comparability of the predictor columns,
         e.g,  X=(x0,x1,...xp),  sd(xj) = 1

         The L_1 penalty  |b|_1  =  sum |b_j|  has the property that
         it forces some b_j to zero, which amounts to model selection.

         The Lasso has two different uses in practice:

         (1) Use  b = argmin_b  |y - X b|^2 + lambda * |b|_1

         (2) Check which among the Lasso coefficients b_j = 0.
             Then refit OLS on the predictors with b_j != 0.

         Version (1) produces a biased estimate of beta.
         In principle this bias could be desirable.

         Compare with generalized Ridge regression:

                  b = argmin_b  |y - X b|^2 + lambda * b^T Omega b

         The solutions is

                  b = (X^T X + lambda*Omega)^{-1} X^T y
                               ^^^^^^^^^^^^
                               makes X^T X less ill-conditioned
                               in case of strong collinearity
                               (historic reason for inventing Ridge regression)

         It is known that the risk

                  E[ |b - beta|^2 ]

         is lower for the Ridge estimator for some small lambda > 0
         than for the OLS estimator which uses lambda = 0  when p+1 >= 3.
         ==> Discovered by Charles Stein (Stanford; 'James-Stein estimator').
             Larry Brown knows everything about it.
         ==> Bias is not necessarily bad if the variance gets lowered by more.

         Main message: Don''t rule out biased estimators!
                       Unbiased estimators are not necessarily 'admissible'.


  - Final remarks on model selection:

    . Open issues (1): Model selection when p >> N  (genetics)
                       Huge literature, some of it misguided by the notion of oracles:
                       IF nature is sparse, and if we knew which few coeff are non-zero
                       we can perform well.  ==> Oracle performance
                       Oracle performance is achieved by an actual estimator
                       if it performs asymptotically as well as the oracle estimator.
                       But nature is never strictly sparse;
                       it has a mix of (few) strong (hopefully) and
                          (many) weak signals, but not zero signals.

    . Open issues (2): Inference for coefficients after model selection
                       ==> Huge progress being made right now!
                           + Taylor et al. (Stanford)
                           + PoSI (in our dept)

    . Split-sample approach to inference:
        Works always for inference, but gives up a fraction from estimation.
        Recipe: Do model selection on the training set of the data,
                then do classical testing and CIs on the test set.
        Issues? What do you expect to happen to selected models depending on the data split?


================================================================