================================================================ * MODEL SELECTION - PREAMBLE: ADJUSTED R SQUARE . R2 is biased: E[R2] > 'true R2' (because of LS optimization) . Attempt at de-biasing: R2 = 1 - RSS/TSS = 1 - (RSS/(n-1))/(TSS/(n-1)) RSS/(n-1) is downward biased as an estimate of sigma^2 TSS/(n-1) is unbiased as an estimate of sigma.y^2 (? really?) => Estimate true R2 by R2.adj := 1 - s^2/s.y^2 < R2 where s^2 = RSS / (N-p-1) s.y^2 = TSS / (N-1) - MODEL SELECTION: Play with models using 'add1', 'drop1' functions. . Stepwise forward: ... summary(lm(MEDV ~ RM, data=boston)) summary(lm(MEDV ~ RM + LSTAT + NOX, data=boston)) a clumsy function to ``help'': add1(lm(MEDV ~ 1, data=boston), scope= ~ RM + LSTAT + NOX) . Stepwise backward: ... summary(lm(MEDV ~ ., data=boston)) summary(lm(MEDV ~ . - AGE, data=boston)) summary(lm(MEDV ~ . - AGE - INDUS, data=boston)) Another clumsy function to 'help': drop1(lm(MEDV ~ ., data=boston)) - STEPWISE REGRESSION: . Forward: Starting from null model (mean only), successively enter best variable as long as its t-stat IS signif. . Backward: Starting from full model, successively remove worst variable as long as its t-stat is NOT signif. . Alternating forward/backward... Do these procedures produce 'optimal' submodels? ... - ALL-SUBSETS REGRESSION: . Minimize RSS / maximize RSquare for given model sizes. Report the 1 or 2 or ... best models for each model size. ==> Doable for models up to 35 or so predictors. (Branch-and-bound algorithm from the 1970s) Question: How to choose among model sizes (see below)? . Software: install the following package once: install.packages(pkgs="leaps", repos="http://cran.us.r-project.org/") library(leaps) # needs to be called in every R session . 'regsubsets()' in the package finds the best subset of predictors for every size. ## Optional arguments: ## nbest=2 will find the two best for each size ## nvmax=13 will find them for all sizes up to 13 ## force.in=c(6,13) will force variables 6 and 13 in the model (RM, LSTAT) ## Example: summary(regsubsets(log(MEDV) ~., data=boston)) summary(regsubsets(log(MEDV) ~., data=boston, nbest=3, nvmax=13, force.in=c(6,13))) ## A graphical way of looking at these models: windows(width=10, height=12) plot(regsubsets(MEDV ~., data=boston, nbest=2, nvmax=13), scale="adjr2") - PROBLEM: How high should we go with picking models? . Many criteria for stopping the inclusion of predictors: Mallows'' Cp, AIC, BIC, RIC, CIC, MIC, ... Note: ~ These criteria come into play only for deciding on the model size. ~ For a fixed model size pick the lowest RSS (= highest Rsquare) . A helpful universal trick/idea: ADD A FEW RANDOM PREDICTORS TO THE MODEL !!!!!!! # and check when the search gets trapped in them: ############## for(i in seq(1000)) { nnoise <- 10 # the number of random predictors we add boston.ext.dat <- # the data frame of actual and random data data.frame(boston, as.data.frame(matrix(rnorm(506*nnoise), ncol=nnoise))) # Uncomment the following expressions temporarily: ## summary(lm(log(MEDV) ~ ., data=boston.ext.dat)) # model with 'junk' ## summary(lm(log(MEDV) ~ ., data=boston)) # model w/o 'junk' ## Fortunately most random predictors are recognized as worthless, ## but occasionally there is one that sticks out by chance. ## Now we do all-subset search and check graphically: boston.ext.sub <- regsubsets(nvmax=20, nbest=2, log(MEDV) ~., data=boston.ext.dat) plot(boston.ext.sub, scale="adjr2", ylim=c(.4,.8)) ############### dev.flush(); Sys.sleep(.5) } # Animate the above by uncommenting the two lines with "############". . A better version is to scramble the existing X matrix: Break the association between y and X by permuting the rows of X and cbind() the result to X. ==> Permuted X has the same approximate collinearities as X !!! Code: boston.ext.dat <- # the data frame of actual and random data data.frame(boston, boston[sample(nrow(boston)),-14]) boston.ext.sub <- regsubsets(nvmax=20, nbest=2, log(MEDV) ~., data=boston.ext.dat) plot(boston.ext.sub, scale="adjr2", ylim=c(.4,.8)) => Models of size up to 9 would be ok by this method. Open questions: . How does the method depend on the number of included junk predictors? . Is this a good method for interpretation or prediction or both or neither? . Theory??? . For some progress, see Shaun Lysen''s Ph.D. thesis, and references therein. Recent developments: Candes'' "knock-off" method is a smart variation on this scheme. - Mallows'' Cp criterion for model size selection: estimating prediction error . Prediction error: PE := E[ |yhat - mu|^2 ] . Facts: Assume y = mu + eps (in R^N), E[eps] = 0, V[eps] = sigma^2 I_N I.e., allow an arbitrary mean vector mu NO assumption that mu is in cspan(X) ! As usual let H be the hat/projection matrix onto cspan(X), hence r = (I-H) y. (1) E[ |r|^2 ] = E[ |(I-H)(mu + eps)|^2 ] = |(I-H) mu|^2 + (N-p-1) sigma^2 (2) PE = E[ |H y - mu|^2 ] = E[ |H y - H mu) - (I-H) mu |^2 ] = E[ |H (y-mu)|^2 ] + |(I-H) mu|^2 because: < H (y-mu), (I-H) mu > = 0 = E[ |H eps|^2 ] + |(I-H) mu|^2 = (p+1) sigma^2 + |(I-H) mu|^2 (1)-(2) = E[ RSS ] - PE = (N-p-1) sigma^2 - (p+1) sigma^2 = [N - 2*(p+1)] sigma^2 ------------------------------------------- ==> | PE = E[ RSS ] - [N - 2*(p+1)] sigma^2 | ------------------------------------------- ==> If we have an unbiased estimate sigmahat^2 from a (larger) correct model, then the following is an unbiased estimate of the PE: ------------------------------------------ | Cp := RSS - [N - 2*(p+1)] sigmahat^2 | ------------------------------------------ ==> Search for a suitable submodel size by minimizing Cp. Obviously the term -N*sigmahat^2 is irrelevant, so we can minimize RSS + 2*(p+1)*sigmahat^2 [Mallows himself warns against taking too much guidance from Cp. He didn''t first publish Cp; it first appeared in a path breaking book by Daniel and Woods, "Fitting Equations to Data", first author being Cuthbert Daniel, a leading industrial statistician at the time. 'C' in 'Cp' seems to stand for 'Cuthbert'. Mallows later (1973) wrote two articles in Technometrics, "Comments on Cp" and "Further Comments on Cp", with skeptical remarks about the strong use of Cp at the time. - Akaike''s criterion, AIC (1973): For any MLE, use AIC := -2*log Lik(parameters) + 2*p AIC.est := -2*log Lik(MLE) + 2*p where 'p' is the number of fitted parameters. This should be minimized over model sizes 'p'. AIC is based on information theory where information in the data is balanced against the complexity of the fitted model. It provides estimates of relative loss of Kullback-Leibler divergence when using one model versus another. Linear models: -2*log Lik(beta,sigma^2) = |y - X beta|^2/sigma^2 + N*log(2*pi*sigma^2) AIC.est = RSS / sigmahat^2 + N*log(2*pi*sigmahat^2) + 2*(p+1) By the book one should use the MLE estimate for sigma, but this is not done. As in Cp, one uses the best available estimate from a larger model. In this case sigmahat does not vary with 'p', hence AIC.est ~ RSS / sigmahat^2 + 2*(p+1) which is equivalent to Cp. - BIC or Schwarz'' Bayesian information criterion (1978): BIC := -2*log Lik + p*log(N) The justification is using some Bayes priors for model sizes. Note that this criterion is more stringent: log(N) > 2 for N >= 8 Note also that -log Lik grows at the rate N for independent data. BIC is known to be 'asymptotically correct': If there is a correct submodel, minimizing BIC will asymptotically find it. This is not so for Cp and AIC: They are too lenient. Reason: Cp, and hence AIC, minimize PE, which errs on the side of variance to avoid bias. - RIC, Risk Inflation Criterion, by Foster and George (1994, in-house) - CIC, Covariance Inflation Criterion, by Knight and Tibshirani (????) - MIC, information-theoretic (Rissanen) - DIC ... ? also info-theory? - Cross-validation (CV): Leave out some data, predict them, measure prediction error, repeat, average. Initially one used leave-one-out CV, which has a simple formula for linear models: CV1 := 1/N sum_i r_{-i}^2 where r_{-i} = yi - yhati{-i} = 1/N sum_i [ri / (1-Hii)]^2 I.e., the leave-i-out prediction error at observation i is ---------------------------- | | | r_{-i} = ri / (1-Hii) | (*) | | ---------------------------- Today, leave-one-out has fallen out of favor. ML (Machine Learning) practices 10-fold CV: leave out 1/10, 10 times, then average There is some evidence that one should actually leave out more, for example 1/3. In this case one would repeatedly randomly leave out 1/3 and average the prediction error estimates. Drawback: The more one leaves out, the more data is diverted from estimation to validation. Effectively, one estimates performance of estimation based on smaller sample sizes than N. 'Proof' of the leave-one-out formula: a neat idea... Let the leave-one-out prediction at xi be yhat_{-i} = xi^T b{-i} where {-i} means leaving out (xi,yi). Trick: Construct a modified response vector yvec{-i} that is the same as y for all coordinates except its i''th coordinate is yhat{-i}. yvec{-i} = y - (y_i - yhat_{-i})*ei (ei = i''t basis vector) Lemma: H yvec{-i} contains all the leave-one-out fits as well as the leave-one-out prediction yhat{-i} in its i''th coordinate. Hence: H yvec_{-i} = yhat_{-i} (1) = sum_{k!=i} Hik*y_k + Hii*yhat_{-i} H y = yhat_i (2) = sum_{k!=i} Hik*y_k + Hii*y_i (2)-(1) = yhat_i - yhat_{-i} = Hii*y_i - Hii*yhat_{-i} ==> (*) Note: The Lemma holds even for penalized LS, generalized Ridge regression: |y - X b|^2 + lambda* b^T Omega b b = (X^T X + lambda*Omega)^{-1} X^T y where Omega is a non-neg. def. symmetric matrix, and lambda > 0 a penalty parameter. CV can be used to choose lambda; in this case it is not model selection but finding a good trade-off between bias and variance - Generalized Cross-Validation (GCV): Proposed by Craven and Wahba (1979) Simplifies leave-1-out CV by using the average of (1-Hii) for all residuals GCV := 1/N sum_i r_i^2 / (1 - tr(H)/N)^2 This was thought initially as a computational simplification, but it turned out to have advantages in that it can be generalized to many other methods, including smoothing splines. - Lasso: L_1 penalty For a given penalty strength lambda, define |y - X b|^2 + lambda * sum_j |b_j| = |y - X b|^2 + lambda * |b|_1 This makes sense only if there is comparability of the predictor columns, e.g, X=(x0,x1,...xp), sd(xj) = 1 The L_1 penalty |b|_1 = sum |b_j| has the property that it forces some b_j to zero, which amounts to model selection. The Lasso has two different uses in practice: (1) Use b = argmin_b |y - X b|^2 + lambda * |b|_1 (2) Check which among the Lasso coefficients b_j = 0. Then refit OLS on the predictors with b_j != 0. Version (1) produces a biased estimate of beta. In principle this bias could be desirable. Compare with generalized Ridge regression: b = argmin_b |y - X b|^2 + lambda * b^T Omega b The solutions is b = (X^T X + lambda*Omega)^{-1} X^T y ^^^^^^^^^^^^ makes X^T X less ill-conditioned in case of strong collinearity (historic reason for inventing Ridge regression) It is known that the risk E[ |b - beta|^2 ] is lower for the Ridge estimator for some small lambda > 0 than for the OLS estimator which uses lambda = 0 when p+1 >= 3. ==> Discovered by Charles Stein (Stanford; 'James-Stein estimator'). Larry Brown knows everything about it. ==> Bias is not necessarily bad if the variance gets lowered by more. Main message: Don''t rule out biased estimators! Unbiased estimators are not necessarily 'admissible'. - Final remarks on model selection: . Open issues (1): Model selection when p >> N (genetics) Huge literature, some of it misguided by the notion of oracles: IF nature is sparse, and if we knew which few coeff are non-zero we can perform well. ==> Oracle performance Oracle performance is achieved by an actual estimator if it performs asymptotically as well as the oracle estimator. But nature is never strictly sparse; it has a mix of (few) strong (hopefully) and (many) weak signals, but not zero signals. . Open issues (2): Inference for coefficients after model selection ==> Huge progress being made right now! + Taylor et al. (Stanford) + PoSI (in our dept) . Split-sample approach to inference: Works always for inference, but gives up a fraction from estimation. Recipe: Do model selection on the training set of the data, then do classical testing and CIs on the test set. Issues? What do you expect to happen to selected models depending on the data split? ================================================================