---------------------------------------------------------------- RECAP OF LECTURE 1: - Syllabus, requirements, homeworks, grades, topics, texts and resources - R introduction: basic operations and data structures, "Stat-541-R-intro.R" ---------------- LECTURE 2 ANNOUNCEMENTS: - HW 1 (R practice) was due today before class. Solutions posted. - HW 2 (linear algebra and LaTex practice) is posted, due Fri, Sept 21 - HW 3 (R Practice) is posted, due Mon, Sept 17 - HW 4 (linear models practice) will be posted Mon, Sept 17 - Note on file formats: files 'xxx.R' are plain '.txt' files Please, preserve the format. Convenience: Make your favorite editor (MS Word, Notepad,...) the default for opening 'xxx.R' files. Do this once: Open a file 'xxx.R' with 'Open with' > 'Choose Program' Check the box 'Always use the selected program to open...' Select your favorite editor. ---------------- LECTURE 2: - Some quiz questions about R. * What are the basic data types? * For those who know C, what is the difference in handling text data between R and C? * What is a vector? a matrix? an array? * What are the ways of indexing a vector? * What happens generally if two vectors are mismatched in length when they should be of the same length? * What happens when a vector gets indexed with a value outside 1:length(vector)? * Same question for matrices in either dimension. * What happens when a matrix or array gets indexed like a vector? * What happens if a 3-D array gets indexed like a matrix? * What happens generally when R expects a certain basic data type but is given the wrong type? As in paste(1,2, sep="") paste(T,2, sep="") "1" + "2" T + 3 3 | T * How would you program a discrete Brownian motion simulation that steps up/down by 1 with equal probability? (animation) plot(cumsum(sample(c(-1,1),size=10000,replace=T)), type="l") * How would you design a random process that is always the highest value seen to date in a discrete Brownian motion simulation? plot(cummax(cumsum(sample(c(-1,1),size=10000,replace=T))), type="l") - R: Remaining material of "Stat-541-R-intro.R" * For details, see 'help()' * Two more composite data types . Lists: 'list()' and 'unlist()' . Data frames: 'data.frame()' - purpose main use of lists: statistical models main use of data frames: statistical data tables with cols of variable basic data type - coercion: data frame <-> matrix - reading data frames: 'read.table()', 'read.csv()' - column and row names - implemented as lists of vectors * Flow control: . conditionals: 'if'-'else', but also the 'ifelse()' function . loops: for, repeat, while . implicit looping with - 'apply()' for looping over matrix and array dimensions - 'lapply()' for looping over lists - 'tapply()' for over 'ragged data' (unequal number of obs. per case) * String manipulations: needed for HW 3 - 'grep()', 'nchar()', 'paste()', 'strsplit()', 'substring()' - a use of 'grep()' in R: the 'apropos()' function apropos("^q") apropos("^r.+s$") * Distributions: . two types: - continuous: unif, norm, lnorm, t, cauchy, f, chisq, exp, gamma, beta, logis,... - discrete: binom, nbinom, pois, hyper, geom, multinom - most have r, q, p, d version: . pseudo-random numbers: rnorm(n=5, m=10, s=3) . quantiles: qnorm(p=.5, m=10, s=3) . probabilities: pnorm(q=5, m=6, s=1) . densities: dnorm(x=1, m=6, s=2) - do not forget: 'sample()' for sampling from discrete distributions with and w/o replacement - simulation of new distributions is reduced to simulation of one of the above. Recipe for sampling from a univariate distribution with cdf F? - reproducibility of simulations: setting the seed * Plotting: - principle: iterative debugging and refinement - Example: . Fine tuning a single plot: x <- runif(1000); y <- x^2 + rnorm(1000, s=.1) plot(x, y) plot(x, y, pch=16) plot(x, y, pch=16, cex=.5) plot(x, y, pch=16, cex=.5, main="My first R plot") . Multiple plots: par(mfrow=c(2,3)) for(i in 1:6) { x <- runif(1000); y <- x^2 + rnorm(1000, s=.1) plot(x, y, pch=16, cex=.5, main=i) } . New window of given size: windows(width=8, height=6) par(mfrow=c(2,3), mar=c(3,3,3,1), mgp=c(1.8,.5,0)) for(i in 1:6) { x <- runif(1000); y <- x^2 + rnorm(1000, s=.1) plot(x, y, pch=16, cex=.5, main=paste("Plot",i)) } . PDF device driver for inclusion in papers: pdf(file="My First R Plot.pdf", width=6, height=4.5) par(mfrow=c(2,3), mar=c(3,3,3,1), mgp=c(1.8,.5,0)) for(i in 1:6) { x <- runif(1000); y <- x^2 + rnorm(1000, s=.1) plot(x, y, pch=16, cex=.5, main=paste("Plot",i)) } graphics.off() - many more plotting functions to build up complex plots: 'text()', 'lines()', 'points()', 'polygon()', 'rect()', 'legend()', ... - Linear models in preparation of HW 4 ---------------------------------------------------------------- LECTURE 4: - Notes on HW 3: * explain what your code does * show the output of your code if it is not too long and non-trivial * write pretty code, nicely indented, spaced, no lines with 150 characters,... * in general, keep this reader in mind and try to make him happy - Fundamentals of statistical inference: How we arrive at stderrs, CIs, statistical tests in the simplest case. * Imagine observing n ojects (e.g., sample students and measure height) X1,...,Xn i.i.d. with existing expectation mu and variance sigma^2 => E[Xi] = mu, V[Xi] = sigma^2, C[Xi,Xj] = 0 (i!=j) * Realize: If you repeated the data collection, you would get different values. * Summary statistic: mean Xbar = mean(X1,...,Xn) For a given dataset, you obtain one mean value. * But the mean would have been different from another dataset, => The mean is a random variable, even though observed only once. * Question: How much does the mean vary from dataset to dataset? => variance expansion V(Xbar) = ( sum_i Var(Xi) + sum_i!=j Cov(Xi,Xj) )/N^2 = sigma^2 / N => sigma(Xbar) = sigma/sqrt(N) true standard error: dispersion across datasets * Statistical question: Can we estimate sigma(Xbar)? YES, because . sigma(Xbar) = sigma/sqrt(N) and . we can estimate sigma with s: s = sqrt(sum_i (Xi - Xbar)^2 / n) => estimate of sigma(Xbar) is s/sqrt(n) * Statistical inference consists of (approx.) probab. statements about estimates, where the probabilities concern relative frequencies across datasets. The two components that typically make such statements possible are . standard error estimates and . the central limit theorem. * Central limit theorem: (Xbar - mu)/sigma(Xbar) gets ever more N(0,1) distributed as n-->Inf => mu +- 2*sigma(Xbar) contains Xbar for 95% of datasets Xbar +- 2*sigma(Xbar) contains mu for 95% of datasets !!!!!! * Confidence intervals: estimate sigma(Xbar) with stderr = s/sqrt(n) Xbar +- 2*s/sqrt(n) contains the true mu for 95% of datasets * Testing: Is the assumption mu=mu0 reasonable in light of the data X1...Xn? "Acceptance/non-rejection region" mu0 +- 2*stderr contains Xbar for 95% of datasets If Xbar is outside, reject mu0 because it makes the observed Xbar too unlikely. * Generalization to parameters other than means and mu: . standard errors and their estimates can be obtained analytically or computationally for most estimates. . a CLT holds almost invariably => statistical inference with CIs and tests can be performed theta.est +- 2*stderr(theta.est) contains true theta for 95% of datasets. * Generalization to regression: . Strangeness of regression: * The predictor values are considered fixed, not random. (Regression is "conditional on x".) * Only the response is considered random. . Hence "varibility across datasets" means datasets that have the same x-values but different y-values. (In a simulation, keep x values fixed, simulate only y values.) . Yet there is a further argument that says conditional inference is valid unconditionally (across datasets with varying x). . Stats students: Look out for the notion of "ancillarity" in math stat. ---------------------------------------------------------------- LECTURE 5: - Recap: essence of statistical inference * "approximate probability statements about estimates" * "the probabilities refer to variation dataset to dataset" - Quiz/Recap: log transformations * Purpose of logarithms in statistical modeling? * Characterization of ratio scales? * How are factor changes/difference expressed in business (and elsewhere)? * What factor change does a yearly growth rate of 146% imply? * If a predictor variable is log-transformed, how do you interpret the slope? * If a response variable is log-transformed, how does it affect the interpretation of slopes? * What is elasticity? - New Material: * Cost models: notions of variable and fixed costs . See the blocks example in Stat-541-Notes.R . Response: total versus average cost models . Predictors: variable versus fixed cost factors * R: formula language for modeling . "operator overloading": new meanings for +,-,1,*,/,^ . prevent overloading: I(...), e.g., I(1/x) . formulas are "first class citizens": myf <- y ~ x1 + x2 # y = beta0 + beta1*x1 + beta2*x2 + eps myf myd <- data.frame(x1=rnorm(100), x2=rnorm(100), x3=sample(c("a","b","c"),size=100, replace=T), y=rnorm(100)) summary(lm(myf, data=myd)) . linear models without interactions: # y = beta0 + beta1*x1 + beta2*x2 + eps summary(lm(y ~ x1 + x2, data=myd)) . linear models with interactions: # y = beta0 + beta1*x1 + beta2*x2 + + beta12*x1*x2 + eps summary(lm(y ~ x1 * x2, data=myd)) summary(lm(y ~ x1 + x2 + I(x1 * x2), data=myd)) # dito summary(lm(y ~ x1 * x2 * x3, data=myd)) # dito . categorical variables also (e.g., Location in real estate data) summary(lm(y ~ x3, data=myd)) # Uses the alphabetically first category name as the reference group. # y = beta0 + beta.b*1[b] + beta.c*1[c] + eps summary(lm(y ~ x1 + x3, data=myd)) # model?... . interaction of continuous and categorical predictors: summary(lm(y ~ x1 * x3, data=myd)) . some toy code for real estate data: names(real) real[1:3,] summary(lm(Renttotal ~ Sqft, dat=real)) summary(lm(Renttotal/Sqft ~ I(1/Sqft), dat=real)) summary(lm(I(Renttotal/Sqft) ~ I(1/Sqft), dat=real)) # Same summary(lm(Renttotal/Sqft ~ Location, dat=real)) summary(lm(Renttotal/Sqft ~ I(1/Sqft) * Location, dat=real)) * Watch out for the interpretation of . categorical variables . interaction terms . I(1/Sqft) in the real estate data analysis problem ---------------------------------------------------------------- LECTURE 6: * Homework 2: solutions posted, with extensive explanations * Homework 3: - R: . Please, do not show me code you ran in the interpreter with '>' and '+' at the beginning of the line. Insert your code with Notepad or Wordpad into the downloaded file, nicely indented and readable, then copy/paste into the interpreter. . No need to introduce variables that are not re-used. x <- sample(10) y <- 2*x Use nested function calls instead. y <- 2*sample(10) Exception: x <- sample(10) x[x>5] is not the same as sample(10)[sample(10)>5] . You used a lot of lapply(). Keep in mind sapply() also. . Kai Zhang found some puzzling properties of table() output: It is not a vector! x <- c("a", "b", "abc", "c", "a", "b", "a", "a", "ac", "c") t <- table(x) t[1:3] # values are c(4,1,1) is.vector(t[1:3]) Here are two functions that help you along: class(t) attributes(t) . I delight if you go to the trouble and generate pretty output. I delight also if you make efforts to explain even simple things. - Problems 11 and 12 were not the same: There is a difference in logic between . number of words with n anagrams and table(dict.srt.siz-1) # Problem 11 . number of anagram sets of a given size. table(dict.srt.tab) # Problem 12 Recall: dict.srt.tab <- table(dict.srt) dict.srt.siz <- dict.srt.tab[dict.srt] * Homework 4: extended deadline -- due Thursday 5pm - Watch out for interpretations of regression coefficients! . "bj = est. ave. difference in y for a unit difference in xj when all other predictors are the same." . "bj = est. ave. difference in y for two cases that only differ by a unit in xj." . "Unit change in xj" and "holding all other predictors fixed" are formulations often used but misleading because of their implication of active experimentation.) "ceteris paribus" * New Material 1: The LS problem - LS criterion: RSS . in R: RSS <- function(b,X,y) { sum((y - X%*%b)^2) } # . in math: RSS(b) = |y - X b|^2 - Normal equations as stationary equations of RSS: (X^T X) b = X^T y - Formal LS solution: usually not recommended for computations b = (X^T X)^{-1} X^T y - b is the LS estimate of beta - Orthogonality interpretation of normal equations: X^T (...) = 0 => X-columns are orthogonal to the residual vector - Vectors of fits and residuals: yhat = Xb = ( X (X^T X)^{-1} X^T ) y = P y r = y - yhat = ( I - P ) y where P and I-P are orthogonal projections * New Material 2: Covariance properties of LS solutions - assume: Y ~ nx1 random vector Z ~ mx1 random vector with joint distribution ("observed on the same objects") - Expectation: . definition: E[Y] = vector of E[Yi] . estimation: given k i.i.d. draws yi from Y, Ehat[Y] = ybar = sum yi/k . affine transformation: if A mxn, c mx1, both constant, then E[AY + c] = A E[Y] + c - Covariance: . definition: V[Y,Z] = E[ (Y-E[Y])(Z-E[Z])^T ] V[Y] = V[Y,Y] components: V[Y,Z]_ij = Cov[Y_i,Z_j] V[Y]_ij = Cov[Y_i,Z_j] V[Y]_ii = Var[Y_i] . what type of matrix are (Y-E[Y])(Z-E[Z])^T and (Y-E[Y])(Y-E[Y])^T? hence a covariance matrix is an average of ... . facts: V[Y,Z] is nxm, V(Y) is nxn V[Y,Z] = E[ Y Z^T ] - E[Y] E[Z]^T V[Y] = E[ Y Y^T ] - E[Y] E[Y]^T . affine transformation: if A mxn, c mx1, both constant, then V(AY + c) = A V(Y) A^T (specialize to a linear form: Z with m=1) . estimation of V(Y): given k i.i.d. samples yi drawn from Y Vhat(Y) = sum ( (yi-ybar)(yi-ybar)^T )/(k-1) affine transformation: zi = A yi + c Vhat(z) = A Vhat(y) A^T (specialize to linear form: A ~ mx1) . in regression we observe only one single y-vector (k=1), but we will make assumptions about its covariance and see what follows from them. Whether the assumptions are realistic for a given dataset needs to be checked with diagnostics. - Linear model assumptions: . model in vector/matrix language: y = beta0 + beta1*x1 + beta2*x2 + ... + betap*xp + eps = X beta + eps where beta is (p+1)x1 where y, xj, eps are is Nx1 X = (1,x1,x2,...,xp) is Nx(p+1) . stochastic assumptions: E[eps] = 0 , meaning: on average the model is "correct" "model is unbiased" V[eps] = sigma^2 * I , meaning: uncorrelated errors, constant variance "homoscedastic errors" . equivalently: E[y] = X beta V[y] = sigma^2 * I . note conditionality: the assumption of fixed X simulation would only sample eps-vectors, leaving X fixed, varying only y (why is this weird? ...) . what variability do E[...] and V[...] refer to? dataset-to-dataset variability holding X fixed... - Linear model inference, step 1: variability of b b = (X^T X)^{-1} X^T y . making the linear model assumptions, it follows: E[b] = beta actually, which part of the assumptions was used? . making the linear model assumptions, it follows: V[b] = sigma^2 (X^T X)^{-1} actually, which part of the assumptions was used? used only the variance part . what variability does V[b] describe? dataset-to-dataset variation . what is the interpretation of the diagonal elements in V[b]? variance of the LS coefficient for predictor xj . could one hope to estimate V[b]? if we had an estimate of sigma^2 . what would such estimates be good for? statistical inference for b ---------------------------------------------------------------- LECTURE 7: * Homework 4: - NOT an exercise in model building - focus on explaining the model suggested in the problem statement - do not throw in all predictors, only those needed to answer the questions raised in the problem statement - you can add some suggestions for further analysis in the technical section, but you do not need to follow up on them - again, this is largely a communications exercise: are you able to convey the story contained in a complex regression equation to a non-statistical audience of some importance? - use scenario and business language: "for ... we expect an average price per sqft of..." "for 95% of properties with these characteristics ..." "... would be overpriced" "... would be a good deal or indicate something is wrong" "for ... we expect an average premium/discount of ..." - look out for a counter-intuitive outlier (check 1/Sqft) - deadline: extension to Friday 5pm * Homework 5 to be issued on Monday: more linear algebra and what it means statistically * Recap and supplements: LS regression - LS criterion RSS: RSS <- function(b) sum((y-X%*%b)^2) . toy example: N <- 5 X <- cbind(rep(1,N), 1:N) y <- 1 + 2*(1:N) . getting at the coefficient estimates the hard way: nlm(RSS, rep(0,2))$estimate # minimizing RSS solve(t(X) %*% X) %*% t(X) %*% y # using normal equations these are NOT the ways of computing regr coeffs, though! we will say more about LS computations later * New Material: - Recall: linear model inference, step 1 the variability of regression coefficients V[b] = sigma^2 (X^T X)^{-1} . what is (X^T X) ? what is (X^T X)/(n-1) ? it can be interpreted as uncentered covariance matrix of the predictors . how does this interpretation jibe with conditionality on X? in reality even predictors are random, so covariance makes sense . what does this formula say about the precision of estimates of regression coefficients in relation to the distribution of the predictor vectors? we gain precision of estimation with more data . as N grows (N -> Inf), what happens to (X^T X)^{-1} ? what happens to (X^T X / (n-1))^{-1} ? (use HW 2, Problem 1b) - Linear model inference, step 2: variability of yhat and r . there is variability in b => there is variability in yhat and r first order: what assumptions are used? E[yhat] = ... E[r] = ... second order: what assumptions are used? V[yhat] = sigma^2 P V[r] = ... V[yhat,r] = 0 . what variability is being described in these equations? dataset to dataset . special cases: diagonal elements: Var[yhati] = ... Var[ri] = ... off-diagonal elements: Cov[yhati,yhatj] = ... Cov[ri,rj] = ... . implications: fits and residuals are inversely correlated of each other at different locations residuals are generally correlated even when the errors are uncorrelated residuals are generally heteroscedastic even when the errors are homoscedastic the variance of residuals is generally ... than the variance of errors could one hope to standardize the residuals so they have equal variance? more variance in a fit means ... variance in a residual pairs of fits and pairs of residuals are ... correlated ---------------------------------------------------------------- LECTURE 8: * Homework 5 is posted * Quiz Qs: Yes or No... - X^T X converges to a limit as n -> Inf? - the variance of bj is the j-th diagonal element of sigma^2 (X^T X)? - if one compares two regressions with the same variables but one with smaller spread of the predictor distribution, is the smaller one preferable? - fits do not vary between datasets? - fits have zero variance? - if the model is true, residuals have constant variance? - residuals are uncorrelated? * Recap: - morals from formulas such as: V[b] = sigma^2 (X^T X)^{-1} V[yhat] = sigma^2 P V[r] = sigma^2 (I-P) great generality, but what is the meaning? * New Material - What does Pij mean in terms of X? P = X (X^T X)^{-1} X^T Let x1,...,xn be the rows of X written as column vectors. Pij = ...? => Pij is the inner product of ... w.r.t. ... - Trace of the projection and total variance: . def.: tr(A) = sum Aii (sum of diagonal elements for square A) . facts: tr(P) = r = rank(P) = rank(X) = dim(colspace(X)) = dim(colspace(P)) tr(P) = p+1 if X is full rank (r=p+1) (see Homework 5) . trace formulas: sum Var(yhati) = sigma^2 tr(P) = sigma^2 r = sigma^2 sum Pii sum Var(ri) = sigma^2 tr(I-P) = sigma^2 (n-r) = sigma^2 sum(1-Pii) . interpretation of the trace formulas: the more predictors the more variability in ... the more predictors the less variability in ... - a basis for estimating the error variance sigma^2 from the trace formula for residuals: . if the model is true (unbiased), then Var(ri) = E[ ri^2 ] . therefore: E[RSS] = ... . therefore: sigma^2 = ... . hence an unbiased estimate of sigma^2 is ... - Self-influence and the projection: . diagonal elements of P, Pii, are called "self-influence"; why? yhati = sum_j Pij yj = Pii yi + sum_{i!=j} Pij yj ^^^^^^ Pii determines how much yi determines yhati (recall: Pii>=0) (how much the observation determines its own fit) . properties of the self-influence values Pii: 0 =< Pii =< 0 (see homework 5) - "Adjustment" and its algebra: see notes, Stat-541-Notes.R, search "ADJUSTED VARIABLES" . partitioning linear regression: partition the predictor matrix: X = (X1,X2) (nxp1, nxp2) partition the LS estimator: b^T = (b1^T,b2^T) (p=p1+p2) y = X b + r = X1 b1 + X2 b2 + r QUESTION: how is b1 different from a regression of y on X1? in y = X1 b1 + r both b1 and r are different from the above!!! because the following problems produce different b1 coefficients: | y - (X1 b1 + X2 b2) |^2 = min_b1,b2 | y - (X1 b1) |^2 = min_b1 . define projections according to the partitions: P = projection onto column space of X P1 = projection onto column space of X1 P2 = projection onto column space of X2 (P1, P2 are NOT partitions of P; all are nxn) Question: is P = P1 + P2 ? (see HW2, P7) . definition: residual operations I-P1 = "adjustment operator" w.r.t. X1 I-P2 = "adjustment operator" w.r.t. X2 y.1 = (I-P1) y is "y adjusted for X1" y.2 = (I-P2) y is "y adjusted for X2" intuitively: removing the variation that can be "explained" by X1 or X2 . note some higher trivialities (for practice): r and the columns of X are orthogonal to each other, hence r and the columns of X1, X2 are ... (if X=(x0,x1,...,xp), then PX = (P x0, P x1,..., P xp)) P X = X P X1 = X1 P X2 = X2 P r = 0 (I-P) r = r (I-P) X = 0 P1 r = 0 P2 r = 0 (I-P1) r = r (I-P2) r = r (I-P1) X1 = 0 (I-P2) X2 = 0 . apply I-P1 and I-P2 to y = X1 b1 + X2 b2 + r : (I-P1) y = (I-P1) X2 b2 + r (I-P2) y = ... <(I-P2)X1,r> = - = 0 . interpretation of b1 and b2? to get the LS regression coefficients b2 for the joint regression y on X1 and X2, one can regress (I-P1)y on (I-P1)X2 i.e., one can regress y adjusted for X1 on X2 adjusted for X1 ---------------------------------------------------------------- LECTURE 9: * Questions about HW 5? (available Friday between 11 and 2:30 for short questions) HW 6 will be posted on Monday; due in a week; topic: LS computations (R) * Quiz questions: - harking back: what do the normal equations express geometrically? - '' : when is y = X b + r a LS decomposition? - what is |(I-P)y|^2 ? RSS - what happens to E[ |(I-P)y|^2 ] as we add more predictors? - in English: as we add one predictor to the model, we transfer one observation worth of variance from r to yhat - what is an unbiased estimate? - give an unbiased estimate of sigma^2 - under what assumptions is this estimate unbiased? - let X=(X1,X2) X2 may contain variables for life style; X1 may be experimental variables or otherwise variables of real interest) describe (I-P2) intuitively, in NYTimes language - keep in mind: if B=(b1,...,bp), then A B = (A b1,...,A bp) - higher trivialities: if y = X1 b1 + X2 b2 + r is the LS decomposition (I-P2) X2 = 0 (I-P2) X1 = (I-P2) r = r - the fundamental adjustment result: the LS estimates b1 in the joint LS decomposition y = X1 b1 + X2 b2 + r can be obtained by regressing (I-P2)y on (I-P2)X1 * New Material: - special case of adjustment: X1 is just column xj X2 is all of X but column xj => simple linear regression without intercept . simplified notation: y.adj = (I-P2) y (y adjusted for all xk except xj) xj.adj = (I-P2) xj (xj adjusted for all xk except xj) bj = slope of simple lin. regr. of "y adjusted for all but xj" on to xj < y.adj, x.adj> bj = --------------- ("adjustment formula" for bj) |x.adj|^2 reminder: bj = j''th multiple regr. coeff. . elementary consequence: does it matter for the value bj if we add or drop some of the other variables? YES - adjusted variables plot or "leverage plot": . a plot of y.adj vs. xj.adj shows what the LS criterion "sees" when determining bj . example: cars <- read.table("data/CarModels2003-4.TXT", header=T, sep=",", na=".") names(cars); dim(cars) rownames(cars) <- paste(1:nrow(cars),as.character(cars[,"Model"])) apply(cars, 2, function(x) sum(is.na(x))) # there is an NA in "MPG.Highway" sel <- !is.na(cars[,"MPG.Highway"]) # to select non-NAs # multiple regression of response on two predictors: cars.lm <- lm(MPG.Highway ~ Weight.lbs + Horsepower, data=cars[sel,]) # adjusted response and other predictor: mpg.adj <- resid(lm(MPG.Highway ~ Horsepower, data=cars[sel,])) wgt.adj <- resid(lm(Weight.lbs ~ Horsepower, data=cars[sel,])) # compare coeff of Weight: same!! "proof" by example... lm(mpg.adj ~ wgt.adj); cars.lm plot(mpg.adj ~ wgt.adj, pch=16, cex=.5); abline(lm(mpg.adj ~ wgt.adj)) . "leverage plot" because it shows points that have high leverage for bj (every bj can have its own leverage points) . canned function for leverage plots with optional point identification: source("http://www-stat.wharton.upenn.edu/~buja/STAT-541/leverage-plots.R") leverage.plots(model=cars.lm, ident=c(1,2), frame.sz=5) # stop with right click - variance/stderr of bj: . goal: obtain Var[ bj ] from the adjustment formula for bj recall variance of a linear combination: Var[ ] = a^T V[y.adj] a = sigma^2 a^T (I-P2) a use a = xj.adj Var[ ] = sigma^2 xj.adj^T (I-P2) xj.adj = sigma^2 |xj.adj|^2 sigma^2 Var[bj] = ------------ |xj.adj|^2 ------------------------------- | | | sigma | "adjustment formula" | stderr[bj] = ------------ | for standard errors | |xj.adj| | of regression coeffs | | ------------------------------- . consequences for statistical inference: + what does |xj.adj| mean statistically speaking? + do we like small or large values of |xj.adj| ? + how do you "see" |xj.adj| in the leverage plot? + what happens to |xj.adj| if we add a predictor that is highly correlated with xj? + how would you measure how collinear xp is with all other predictors? + see Mosteller & Tukey, p.326f, on the proxy problem ---------------------------------------------------------------- LECTURE 10: * Homeworks: - HW6 will be posted tomorrow, due in 2 weeks (fall break): LS computation - "Style": read lessons 3 and 4, be prepared to give solutions for these exercises: 3.4, 3.6, 4.3, 4.6 date of in-class discussion: Wed, Oct 17 - HW5 solutions will be posted later today - questions about HW5? * Quiz questions: - today is proof of global warming, yes or no? - what are averages and what are contrasts? - according to the "adjustment formula", is bj an average or a contrast? < y.adj, xj.adj> bj = --------------- |xj.adj|^2 (imagine xj is time since 1850, the other predictors are season,...) - xj adjusted for all other predictors is obtained by taking residuals when regressing xj on all other predictors - the multiple linear regression coef bj is the coef of simple linear regression w/o intercept when regressing y.adj on xj.adj - |xj.adj|^2 is the RSS of this regression - |xj.adj|/sqrt(n-p) is the RMSE (R: RSE) of this regression - using the projection matrix P2 for regression on all predictors except xj, y.adj = (I-P2) y and xj.adj = (I-P2) xj, we find: < y.adj, xj.adj> = < (I-P2)y, (I-P2)xj> = < y, (I-P2)(I-P2)xj> = < y, (I-P2)xj> = < y, xj.adj > - does the stderr formula now make more sense? sigma stderr[bj] = -------- |xj.adj| - unconnected trick question: if y = X b + r is the LS decomposition for y and X, is there a multiple regression for which . r is the vector of fits, and . X b is the vector of residuals? * New Material: - R Square and sums of squares: Q: how much "variation" does the model given by X "account for"? . approach: use sums of squares (SS) as measures of "variation" but first remove uninteresting variation: the mean . new partition: X = (x0,X1) where x0 = (1,...1)^T in R^n (intercept) is uninteresting and X1 is of interest . projections: P from X, P0 from x0 see HW 5 for the meaning of P0 and I-P0 meaning of P-P0: projection onto the centered cols of X . Pythagorean decomposition of orthogonal projections: I-P0 = (I-P) + (P-P0) show: (I-P)(P-P0) = 0 . Pythagorean theorem or orthogonal decomposition of sums of squares: |(I-P0)y|^2 = |(I-P)y|^2 + |(P-P0)y|^2 TSS = RSS + MSS proof: <(I-P)y, (P-P0)y> = 0 . definition of R Square ("R2"): MSS R2 = ----- "fraction of variation accounted for by the model" TSS . can we calculate E[TSS], E[RSS] and E[MSS] theoretically? assuming the model is unbiased: E[y] = ... E[RSS] = sigma^2 (n-p-1) (we know this off the top of our heads) E[MSS] = E[ y^T (P-P0) y ] (difficulty: ...) = E[ (X beta + eps)^T (P-P0) (X beta + eps) ] = E[ beta^T X^T (P-P0) X beta + beta^T X^T (P-P0) eps + eps^T (P-P0) X beta + eps^T (P-P0) eps ] = | (I-P0) X beta |^2 + 2*0 + (assuming E[eps] = 0) E[ eps^T (P-P0) eps ] = E[ tr( (P-P0) eps eps^T ) ] + | (I-P0) X beta |^2 = tr( (P-P0) E[ eps eps^T ] ) + | (I-P0) X beta |^2 = tr( (P-P0) V[ eps ] ) + | (I-P0) X beta |^2 = tr( (P-P0) sigma^2 I ) + | (I-P0) X beta |^2 = sigma^2 tr(P-P0) + | (I-P0) X beta |^2 = sigma^2 p + | (I-P0) X beta |^2 E[TSS] = sigma^2 (n-1) + | (I-P0) X beta |^2 . interpretations: if the model is correct, E[RSS] arises only from error E[MSS] arises from both error AND the truth motivates definition of "signal-to-noise ratio" | (I-P0) X beta |^2 SN = ------------------- sigma^2 p Q: do we know the SN in actual data analysis? ... . can we calculate E[R2] theoretically? - statistical inference: . have we made use of eps ~ N(0, sigma^2 I) yet? . what is this assumption used for? . two types of sources for statistical inference: + exact distribution theory for "exact" t-tests and CIs uses Gaussian assumption + asymptotic theory based on the CLT for approximate tests and CIs does not use Gaussian assumption but requires n "large" . goal: justify + CIs such as bj +- 2*stderr.est[bj] + t-tests based on t = bj/stderr.est[bj] let s = RMSE = sqrt( RSS / (n-p-1) ) ---------------------------------- | s | | stderr.est[bj] = ------------ | | |xj.adj| | ---------------------------------- | s^2 | | V.est[b] = (X^T X /n)^{-1} --- | | n | ---------------------------------- . a generalized CLT for asymptotic inference: as n -> Inf, p is held fixed, and cases are i.i.d. sampled, bj-betaj -------------- ~->~ N(0,1) stderr.est[bj] (some strangeness: this CLT still conditions on the predictors, yet the predictors get sampled from a distribution as n -> Inf) ---------------------------------------------------------------- LECTURE 11: * Organization: - no class Mon, Oct 15. - reading assignment "Style" for Wed, Oct 17 - HW 6 not yet posted (check back tomorrow Thu) * Quiz: - what is xj.adj intuitively and technically? getting rid of variation due to other predictors taking residuals from the regr of xj on all other predictors - where and how does it play a role? bj, Var(bj), leverage plot - why does y not need to be adjusted in the formula for bj? - what is the MSS? |yhat centered|^2 - what does the ANOVA decomposition mean geometrically? Pythagorean decomposition in (n-1)-space - what do we know about E[RSS] and E[TSS]? - statistical inference: what are the two major approaches? practical: testing and CIs theoretical foundations: probability theory CLT -> asymptotic inference normal assumption -> finite sample inference - what are their relative advantages and drawbacks? asymp: good only for large n; do not need normal assumption finite sample: good for all n; need that normal assumption - what enables us to infer variation of estimates across datasets? sampling assumptions * New Material: Statistical inference for linear models: - asymptotic inference: . testing H0: betaj=betaj.null (most common: betaj.null=...) bj - betaj.null z = --------------- ~->~ N(0,1) if H0 is true (CLT, for n -> Inf) stderr.est[bj] hence reject H0 at 5% significance if |z| > 2 (e.g.) significance level = P[ rejection | H0 ] = P[ 'false rejection' ] = P[ |z| > 2 | H0 ] = .05 . confidence interval for betaj: bj - beta P[ -2 < -------------- < +2 ] ~->~ .95 as n -> Inf stderr.est[bj] invert and turn into an interval statement for beta: P[ betaj in (bj-2*stderr.est[bj], bj+2*stderr.est[bj]) ] ~->~ .95 as n->Inf what is random in this statement? what is not? the interval, betaj . note: if you rely on asymptotic inference you do not need to diagnose normality of errors (so when the heck do we need normality???? ...) - finite-sample inference based on normal assumptions: . recall: bj = /|xj.adj|^2 is a linear combination of ... Q: if y ~ N(X beta, sigma^2 I), what is the distribution of bj? . general properties of normal distributions: + multivariate normal distributions are uniquely characterized by their expectation and variance (1st & 2nd moments) + any matrix transformation of a (multivariate) normal distribution yields again a normal distribution . A: for the whole coef vector of estimates: b = (X^T X)^{-1} X^T y is a linear transformation of y E[b] = beta V[b] = sigma^2 (X^T X)^{-1} => b ~ N(beta, sigma^2 (X^T X)^{-1}) (multivariate normal) one coeff estimate at a time: bj = < y, xj.adj/|xj.adj|^2 > is a linear combination of y E[bj] = betaj (assuming the model is unbiased) V[bj] = sigma^2 / |xj.adj|^2 (assuming the variance properties) => bj ~ N(betaj, sigma^2 / |xj.adj|^2) (univariate normal) . Q: how do we arrive at a t-distribution? (we know we should!) A: we estimate sigma^2 with s^2 = RSS/(n-p-1) (s=RMSE=RSE=...) bj - betaj (bj - betaj) t = -------------- = -------------- stderr.est[bj] s/|xj.adj| . Q: how is a t-distribution with k degrees of freedom defined? A: for any Z, Z1,...,Zk that are i.i.d. N(0,sigma^2) ^^^^^^^ irrelevant but same Z t = ----------------------- / Z1^2+...+Zk^2 \ sqrt(------------------) \ k / . Q: why would the earlier t be distributed the same as this t? A: some pretty linear algebra and geometry Z = (bj - betaj)*|xj.adj| ~ N(0,sigma^2) (Sathya: thanks!) s^2 = |r|^2 / (n-p-1) = ??? let U be a nx(n-p-1) matrix whose columns + span residual space (i.e., a basis of the space orthogonal to X) + are orthonormal: U^T U = I note: PU = 0, hence (I-P)U = U we can express r in terms of U: r = Uz (z = coordinates of r in the basis U) z can be obtained easily from r: z = U^T r (z = linear mapping of r) and here is the punch line: |r|^2 = |z|^2 (show it!) the first and second moments of z are: E[z] = 0 V[z] = sigma^2 I where I is of size (n-p-1)x(n-p-1) but z is a linear transformation of r and hence of y: z ~ N(0, sigma^2 I) i.e., writing z = (Z1,Z2,...), Z1, Z2,... are i.i.d. N(0,sigma^2) !!!!!!!!! we still have to show that Z and (Z1,Z2,...) are uncorrelated: Cov[Z,Zk] = |xj.adj| Cov[ , ] = '' xj.adj^T V[y,r] uk = '' xj.adj^T V[r,r] uk = '' xj.adj^T (I-P) uk sigma^2 = '' sigma^2 = 0 because xj.adj is orthognal to residual space . properties of the t-distribution with k degrees of freedom: + bell-shaped x <- seq(-4,4,length=500) plot(x, dt(x,df=1), type="l", ylim=c(0,.45), col="blue", xlab="quantile", ylab="t-density") abline(h=0) text(x=0, y=.25, lab="Cauchy", col="blue") text(x=0, y=.23, lab="df=1", col="blue") + as df increases, the t-distribution gets lighter in the tails: colrs <- gray.colors(100,0,.8) for(df in 2:100) lines(x, dt(x,df), col=colrs[df], lwd=.5) + as df -> Inf, the t-distribution approaches N(0,1) distribution: lines(x, dnorm(x), col="red", lwd=2) text(x=0, y=.44, lab="Gauss", col="red") text(x=0, y=.42, lab=expression(paste("df=",infinity)), col="red") R: for math symbols in R plots see help(plotmath) demo(plotmath) ---------------------------------------------------------------- LECTURE 12: * Organization: - HW 6 due Monday - "Style", Lesson 4, next time * Writing: Exercise 3.7, p.48 -- revise the following sentences - The use of models in teaching prose style does not result in improvements of clarity and directness in student writing. When we use models to teach prose style, student writing does not improve in clarity and directness. - Precision in plotting the location of building foundations enhances the possibility of its accurate reconstruction. We are more likely to reconstruct buildings accurately if we precisely plot the location of their foundations. - Any departures by the members from established procedures may cause termination of membership by the Board. If members depart from established procedures, the Board may terminate membership. - A student''s lack of socialization into a field may lead to writing problems because of his insufficient understanding about arguments by professionals in that field. - The successful implementation of a new curriculum depends on the cooperation of faculty with students in setting achievable goals within a reasonable time. Your writing may not suffer from these defects, but when you see such prose in the future, you will know what is wrong about it and be able to parse and, if necessary, improve on it. * Quiz: - linear algebra: If Q = (Q1,Q2) is an nxn orthogonal matrix decomposed into two blocks, nx(p+1) and nx(n-p-1), what is Q1 Q1^T + Q2 Q2^T = I (nxn) - statistical inference: . two math approaches? finite sample inference and asymptotic inference . their pluses and minuses? finite sample: + "exact" for n, - normal assumption asymp. inf.: + no normal assumption, - "big n" and only approx. - fundamental statements for the two types of approaches in linear models? bj - betaj finite sample: has an exact t-distribution t = -------------- stderr.est[bj] asymp. inf.: has an approx. normal distr. - t-distribution in finite-sample "exact" inference is based on the normal assumption: y ~ N(X beta, sigma^2 I) . definition of the t-distribution: for any Z, Z1,...,Zk i.i.d. N(0,sigma^2) Z t = ----------------------- / Z1^2+...+Zk^2 \ sqrt(------------------) \ k / . why is the above ratio 't' t-distributed? bj - betaj ~ N(0, sigma^2 / |xj.adj|^2) stderr.est[bj] = s / |xj.adj| need to show: s^2 is of the form (Z1^2+Z2^2+...)/(n-p-1) and is independent of bj trick? (method, really) introduce a coordinate system in residual space -> Z1,Z2,... s^ = r^2/(n-p-1) is independent of bj because Cov[r,b] = sigma^2 (I-P) X (X^T X)^{-1} = 0 hence r and b are independent hence s and bj are independent * New Material: - LS computations, HW 6 -- ask questions now! - basics of Q-R decompositions: . introduce ANY o.n. basis of the column space of X (space of fits) . extend it to an o.n. basis of the orthogonal complement (residual space) collect the columns in Q = (Q1,Q2) and note that for an arbitrary b: | y - X b |^2 = | Q^T (y - X b) |^2 = | Q1^T (y - X b) |^2 + | Q2^T (y - X b) |^2 = ... - the art is to construct Q1 such that R = Q1^T X is simple - HW 6: use MODIFIED Gram-Schmidt to create a Q-R decomposition . G-S: successively adjust for all PREVIOUS vectors (which are o.n.) x1 (normalize) -> q1 x2 adjusted for q1 (and normalized) -> q2 x3 adjusted for q1, q2 (and normalized) -> q3 ... . modified G-S: successively adjust all SUBSEQUENT vectors copy X into Q, then do the following in place normalize q1 adjust q2, q3, ... for q1 normalize q2 adjust q3, q4, ... for q2 normalize q3 ... ---------------------------------------------------------------- - Alternative to G-S and adjustment: Householder reflections . H = I - 2 q q^T where q is a unit length vector . geometric meaning: for u orthogonal to q: H u = ... for u a multiple of q: H u = ... . eigendecomposition: lambda=... for ...; lambda=... for ... . reflection property: H^2 = ... . H is symmetric and orthogonal: H^T = H; H^{-1} = ... . fundamental property: if |u| = |v|, there exists H such that H u = v why? choose q = (...)/|...| . generating a Q-R decomposition with Householder reflections: if X=(x1,x2,...), choose H1 such that H1 x1 = (*,0,0,...)^T choose H2 such that H2 (H1 x2) = (*,*,0,...)^T ... after (p+1) steps, we have ...H3 H2 H1 X = upper triangular -> Q = H1 H2 H3... is nxn orthogonal and Q^T = ...H3 H2 H1 upper-triangularizes X -> z = Q^T y = (z1,z2) where z1 is (p+1)x1 X^T Q = (R^T,0) solve z1 = R b for b - An alternative to G-S and Householder: Givens rotations... Givens rotation G = 2x2 rotation in two coordinates, leaving the rest fixed use for Q-R decompositions: a serious of Givens rotations can annihilate all elements of X below the diagonal G1 G2 G3 ........... X = upper triangular ^^^^^^^^^^^^^^^^^^^^ = Q^T ---------------------------------------------------------------- LECTURE 13: * HW 6 was due today; solutions are posted HW 7 will be posted later this week; due no less than 7 days later * Quiz: - If Q^T Q = I, and Q=(q1,...qn), then sum qj qj^T = I? - If P^2 = P, then \lambda \in \{0,1\} - If P^2 = P and P^T=P, what is the matrix of P after a basis transformation where the basis consists of eigenvectors? - If R is square, upper triangular, and if u=(*,*,0,...,0)^T, then R u = (...,...,...,...)^ - If R is square, upper triangular, and non-singular, then R^{-1} is ... - If x is a predictor vector, and xbar its mean, write down the slope of x in the simple regression on x using the adjustment formula. - If a predictor is a categorical variable with three categories, what is the design matrix X? - If a predictor is a categorical variable with three categories, what is the projection matrix P? (Work without intercept.) - Why do the following regressions produce the same fits? lm(y ~ data.frame(x,x^2)) lm(y ~ data.frame(x,(x-mean(x))^2)) - Same question for the following: lm(y ~ data.frame(x,abs(x))) lm(y ~ data.frame(x,pmax(x,0))) * New Material: - special LS cases and interpretation of their coefficients: . one categorical predictor coeffs = means of categories, or mean of cat 1 and differences . two (or more) categorical predictors + additive lm(y ~ x1 + x2) coeffs = ... + with interaction lm(y ~ x1 * x2) + single vs multiple observations per cell single: no interactions, multiple: with interactions . one quantitative predictor simple lin regr . two (or more) quantitative predictors + additiv + with interaction . one categorical and one quantitative predictor + additive + with interaction - notion of degree of freedom . Def: dfs = dimension of space considered . Ex.: dfs in mean dfs in three-group categorical variable dfs in interaction between two categorical vars w 2 and 4 groups, resp. dfs in interaction between a quant var and a categ var 2 groups dfs in model dfs in residuals - F-tests for comparing two arbitrary nested models . Purpose: testing groups of predictors jointly in particular: overall F-test . Def. of F-distribution: W1,...,Wq,Z1,...Zr i.i.d. N(0,sigma^2) ^^^^^ irrelevant but same (W1^2+...+Wq^2)/q F = ----------------- has a (central) F-distribution with q,r dfs (Z1^2+...+Zr^2)/r . F-ratios in linear models: ratio of sums of squares divided by their dfs + Assume predictors x1,...xq are all irrelevant: H0: beta1=...=betaq=0 + SS(x1,...,xq) = | y projected onto x1,...,xq adj. for all other preds. |^2 distribution? y = lin.comb. of other predictors + eps u1,...,uq: o.n. basis of the space of x1,...,xq adj. for all others i.e.: u1,...,uq in space of ALL predictors but: = 0 y projected onto x1,...,xq adjusted = u1 +...+ uq = QQ^T y | y projected onto x1,...,xq adjusted... |^2 = ^2 +...+ ^2 = ^2 +...+ ^2 = W1^2 +...+ Wq^2 reason: E[]=0, E[^2]=sigma^2 ) + F-ratio for x1,...,xq: SS(x1,...xq)/q F = -------------- has an F-distr with q,(n-p-1) dfs under H0 RSS/(n-p-1) . special case of one predictor: F = t^2 (recall: t=Z/sqrt((Z1^2+...+Zr^2)/r) ---------------------------------------------------------------- LECTURE 14: * HW 5 and 6: solutions are posted, please, do study them!!! * From the webpage: "Style", lessons 3 and 4 be prepared to give solutions for these exercises: 3.4, 3.6, 4.3, 4.6 Exercise 4.3, p.67: 1. It is believed that a lack of understanding about the risks of alcohol is a cause of student binging. We/experts believe that students who do not understand the risks of alcohol tend to binge drink. 2. The model has been subjected to extensive statistical analysis. * From the news: testosterone and male mortality from YahooNews, Reuters, 2007/10/23 * Quiz: - When considering simultaneously predictors x1, x2, x3, the relevant subspace of R^n is not their span but ... - A categorical predictor with 3 categories generates 2 columns in the design matrix - Degrees of freedom are dimensions of relevant subspaces - An F-test in a linear model has the null hypothesis H0:... - The overall F-tests has the null hypothesis H0: y=mu+eps - An F-ratio is formed with numerator = | projection onto relevant subspace |^2/df and denominator = RSS/df(res) * New Material: linear models -- CIs and PIs for fitted values, diagnostics * rules for F-distributions? + "2 for t": works for about n~dfs >= 25 qt(.975, df=1:30) + F-distribution: df2 <- 5:30 plot(x=range(df2), y=c(0,7), type="n", xlab="df2", ylab="F") for(df1 in 1:10) { lines(x=df2, y=qf(.95, df1=df1, df2=df2)) text(x=df2[1], y=qf(.95, df1=df1, df2=df2[1]), lab=df1, cex=.8, adj=1.2) } no apparent simple rule - inference for fits vs. prediction: CIs vs PIs . let xx be a row (!) from the design matrix X written as a column, or a new set of predictor values e.g., when modeling the car data: MPG = b0 + b1*WEIGHT + b2*HP => xx=(1,WEIGHT,HP)^T yhat(xx) = where b=(b0,b1,b2) . inference for yhat(xx): if the model is unbiased, E[yhat(xx)] = then: Var[yhat(xx)] = Var[xx^T (X^T X)^{-1} X^T y] = sigma^2 (xx^T (X^T X)^{-1} xx) stderr.est(yhat(xx)) = s sqrt(xx^T (X^T X)^{-1} xx) uses: 95%CI = yhat(xx) +- 2 stderr.est(yhat(xx)) for true E[y(xx)]=mu(xx)= testing H0: E[MPG(WEIGHT=1000,HP=200)]=mu0(xx)=20 mpg (rarely done) => xx=(1,1000,200)^T t = (yhat - mu0(xx))/stderr.est(yhat) reject when |t|>2 '2' in CIs and tests should really be: qt(p=.975, df=n-p-1) ex.: simple linear regression with one predictor x, calculate (X^T X)^{-1} plot CI(x) as a function of x . prediction intervals based on yhat(xx): + let y(xx) be the random variable of FUTURE response values at xx whereas yhat(xx) is obtained from PAST data + assume the model is correct, wrt first and second moments and normality for past and future data + wanted: an interval around yhat(xx) that has a 95% chance of catching y(xx) + approach: find distribution of y(xx)-yhat(xx) E[ y(xx) - yhat(xx) ] = 0 V[ y(xx) - yhat(xx) ] = V [ y(xx) ] + V[ yhat(xx) ] = sigma^2 + stderr(yhat(xx))^2 = sigma^2 + sigma^2*(xx^T (X^T X)^{-1} xx) = sigma^2 * (1 + xx^T (X^T X)^{-1} xx) assuming past and future response values are normally distributed: y(xx) - yhat(xx) t = -------------------------------- has a t-distribution with df=n-p-1 s*sqrt(1 + xx^T (X^T X)^{-1} xx) reasons: the Gaussian assumption AND the numerator is independent of the denominator because - s is computed from the residuals of past data y - yhat(xx) is computed from the projection of y onto the X-space - y(xx) is future data at xx => s is independent of yhat(xx) and y(xx) (recall: V[r,yhat]=0 => under normality r and yhat are independent) + PI(xx) = yhat(xx) +- 2*s*sqrt( 1 + xx^T (X^T X)^{-1} xx ) ('2' should be ...) ---------------------------------------------------------------- LECTURE 15: * Style: study lesson 5 - Do exercise 5.2, 3. part, and hand in your solution ON PAPER. Due: Mon, Nov 5, in class. Be prepared to discuss this piece in class. * Quiz: - What is yhat(xx)? - How do you find a stderr and stderr.est for yhat(xx)? compute var(yhat(xx)), estimate sigma with s - What guarantees does a CI for yhat(xx) make? P[ in CI ] = .95 - What are the assumptions? normal linear model is correct (finite sample "exact" inference) N large (asymptotic approximate inference) - What is the conceptual framework for forming a PI? - How do you find the width of a PI? - What are the sources of variation a PI has to account for? dataset-to-dataset in yhat(xx) data variation in y(xx) * New Material: - problems with linear models: (1) E[y] != X beta for any beta (2) V[y] != sigma^2 I for any sigma (3) err !~ N(0,sigma^2) (4) leverage points: outlyingness in predictor space => need . detection: diagnostics => graphical methods, combined with inferential ideas . fixes: better models (may not be possible) - fixes: . if one can assume that all relevant predictors are at hand, and yet the model does not fit, try TRANSFORMATIONS and additional FUNCTIONAL TERMS such as interactions . Box-Cox family of power/log-transformation: assuming y>0 box.cox <- function(y, pow=1) if(pow != 0) return( (y^pow - 1)/pow ) else return( log(y) ) illustration: y <- seq(0,3,length=201) plot(x=range(y), y=c(-3,1.5), type="n", xlab="y", ylab="box.cox(y)") abline(h=0, col="gray90"); abline(v=1, col="gray90") for(pow in seq(-2,2,by=.2)) { lines(y, box.cox(y,pow), col="gray50") points(0, box.cox(0,pow), pch=16, cex=.3, col="gray50") } pow <--1; lines(y, box.cox(y,pow), col="green", lwd=2) pow <- 0; lines(y, box.cox(y,pow), col="red", lwd=2) pow <- 1; lines(y, box.cox(y,pow), col="blue", lwd=2) properties: ascending, convex for pow>1, concave for pow<1, BC(1)=0 ex.: pick the transformation of y that yields the highest R2 data example: Boston housing data (stay tuned for background) site <- "http://stat.wharton.upenn.edu/~buja/STAT-541/" boston <- read.table(paste(site,"boston.dat",sep=""), header=T) rownames(boston) <- paste(scan(paste(site,"boston.geography",sep=""), w=""), 1:nrow(boston)) boston[1:5,] try a collection of Box-Cox parameters for(pow in c(2.0,1.0, 0.5, 0.0, -0.5, -1.0, -2.0)) cat("power =",pow,":\n R^2 =", summary(lm(box.cox(MEDV, pow) ~ ., data=boston))["r.squared"][[1]],"\n") winner: pow=0 (but by how much?) (could be used for transformations of predictors also: Box-Tiao) . the ultimate hammer for transformations: the ACE algorithm (Breiman-Friedman 1985, JASA) -- stay tuned! . non-linear model terms: when it is likely that y = f(x1,x2,...) + eps with f(...) nonlinear, try: + interactions: y ~ x1 * x2 = y ~ x1 + x2 + x1:x2 => b1*x1 + b2*x2 + b12*x1*x2 = b1*x1 + (b2 + b12*x1)*x2 = b2*x2 + (b1 + b12*x2)*x1 the "effect" of x1 on y depends on level of x2 + polynomial terms: y ~ x + I(x^2) amounts to a quadratic transformation of the predictor but with free parameters: y = b0 + b1*x + b2*x^2 + r (b2 is the "curvature parameter" for the transformation of x) - diagnostics: . R diagnostics plots: help(plot.lm) boston.lm <- lm(MEDV ~ ., data=boston) par(mfrow=c(2,2)) plot(boston.lm) refined: par(mfrow=c(2,2), mgp=c(1.8,.5,0), mar=c(3,3,2.5,2)) plot(boston.lm, pch=16, cex=.5, cex.axis=.8, cex.lab=.8) + residuals vs fitted values: to see violations of ... + normal q-q plot of stdized residuals: to see violations of ... + S-L plot of sqrt(|stdized residuals|): to see violations of ... + stdized residuals vs leverage values P_{ii}: to see ... . missing diagnostics plots in R: + residuals versus order: to see "lurking" variables (such as: time, space,...) par(mfrow=c(1,1), mgp=c(1.8,.5,0), mar=c(3,3,2.5,2)) plot(resid(boston.lm), xlab="order", ylab="residuals", pch=16, cex=.3, cex.axis=.8, cex.lab=.8); abline(h=0, col="gray50") + leverage plots: to see the influence of cases on ... source(paste(site,"leverage-plots.R",sep="")) leverage.plots(boston.lm, cex=.5) - INFERENCE FOR DIAGNOSTICS PLOTS: . general idea: create datasets from estimated model ------------------------ | "parametric bootstrap" | ------------------------ parametric bootstrap for linear model: y* ~ N(X b, s^2 I) r <- resid(boston.lm) # actual residual vector RMSE <- function(model) sqrt( deviance(model) / model$df.resid ) X <- model.matrix(boston.lm) yhat <- fitted(boston.lm) s <- RMSE(boston.lm) N <- nrow(X) y.boot <- rnorm(N, m=yhat, s=s) # a simulated response vector r.boot <- resid(lm(y.boot ~ X)) # a simulated residual vector windows(wid=10, hei=5) par(mfrow=c(1,2), mgp=c(1.8,.5,0), mar=c(3,3,2,1)) qqnorm(r, main="Actual Data", pch=16, cex=.5) qqnorm(r.boot, main="Bootstrap Data", pch=16, cex=.5) boston.boot.lm <- lm(y.boot ~ X) par(mfrow=c(2,2)) plot(boston.boot.lm) The idea can be used for other types of model checks as well. ---------------------------------------------------------------- LECTURE 16: * Organizational: - Reminder: Style, exercise 5.2, 3. part; hand in your solution ON PAPER. Due: Mon, Nov 5, in class. Be prepared to discuss this piece in class. - Homework 7 posted, due Mon, Nov 12 (warning: Homework 8 might get posted before Nov 12) * Quiz: - What are the three linear model assumptions that can be violated? 1st order: E[y] = X beta 2nd order: V[y] = sigma^2 I normality - What kind of model violation is high leverage? not a model violation! interpretational problem only keep or toss high leverage points? - Graphical diagnostics for linear model assumptions? plot r vs yhat plot |r standardized| vs yhat normal quantile plot of r - Graphical diagnostics for leverage? plot ri (stdized) vs Pii (self-influence) leverage plots, one per predictor y.adj vs xj.adj - Graphical diagnostics for lurking variables? r vs order - Idea for gauging whether model violations are present? parametric bootstrap: simulate the model, assuming the estimated parameters are the true parameters execution of the idea for linear models: y* ~ N(yhat=X b, s^2 I) then use for residual plots, normal q-q plot, anything... * New Material: - Model checking based on: SUFFICIENCY THEORY (see math stat for a full understanding) (a deeper idea then parametric bootstrap, but does not always apply) y ~ N(X beta, sigma^2 I) p(y|beta,sigma) = exp(|y - X beta|^2/(2*sigma^2)) / (sigma^N * sqrt(2 pi)^N) SUFFICIENT STATISTICS: |y|^2 and X^T y together the data density depends on the parameters beta and sigma only through them [another example: if sigma=1 is known, y ~ N(X beta, I) p(y|beta) = exp(|y - X beta|^2/2) / (sqrt(2 pi)^N) depends on beta only through X^T y, but not |y|^2 ] equivalent suff. stat.: b = (X^T X)^{-1} X^T y and RSS = |y - X b|^2 equivalent suff. stat.: yhat = P y = X b and RSS = |(I-P) y|^2 DEEP FACT: the conditional distribution of y given a sufficient statistic is independent of the parameters application: cond. distr. of y given yhat and RSS is indep. of beta, sigma what is the conditional distribution of y given yhat and RSS? conditioning on yhat (holding yhat fixed, knowing yhat) => y can only vary in residual space (yhat=Py is fixed, r=(I-P)y is free) conditioning on RSS (holding RSS fixed, knowing RSS) => y can only vary on a ball of radius sqrt(RSS) (=|(I-P)y|) on these balls the data density is constant because of spherical symmetry => cond. distr. of y given the suff. statistics = uniform distr. on sphere of radius sqrt(RSS) in residual space note: this distr. is indep. of the true, unknown parameters beta, sigma!!! PRACTICAL USE: simulation of the conditional distribution to generate response vectors y* that are AS LIKELY AS THE OBSERVED RESPONSE vector y IF the linear model is correct SIMULATION: generate residual vectors r* so that y* = yhat + r* start with z ~ N(0,I) in R^n regress the model out: r* <- (I-P)z force the proper length: r* <- r*/norm(z*)*sqrt(RSS) use r* for comparison with observed r: r should look like r* repeat... simulation of residual vectors rr that are as likely as the observed residual vector if the model is correct: norm <- function(x) sqrt(sum(x^2)) r <- resid(boston.lm) # actual residual vector X <- model.matrix(boston.lm) RSS <- deviance(boston.lm) N <- nrow(X) r.suff <- resid(lm(rnorm(N) ~ X)) r.suff <- r.suff/norm(r.suff)*sqrt(RSS) # simulated residual vector windows(wid=10, hei=5) par(mfrow=c(1,2), mgp=c(1.8,.5,0), mar=c(3,3,2,1)) qqnorm(r, main="Actual Data", pch=16, cex=.5) qqnorm(r.suff, main="Bootstrap Data", pch=16, cex=.5) . Canned functions for comparison plots based on sufficiency theory: if the data followed the model, their residual plots would look like this site <- "http://stat.wharton.upenn.edu/~buja/STAT-541/" source(paste(site,"residual-comparison.R",sep="")) res.vs.fits(boston.lm) # execute one statement at a time res.vs.qnorm(boston.lm) res.vs.order(boston.lm) . Why are the above ideas "inference"? They are not CIs or tests, or are they? hor <- qqnorm(r, pch=16, cex=.5)$x # set up the plot in proper dimensions for(i in 1:100) { r.null <- resid(lm(rnorm(N) ~ X)) r.null <- r.null / norm(r.null) * norm(r) points(x=sort(hor), y=sort(r.null), pch=16, cex=.3, col="gray") } points(x=sort(hor), y=sort(r), pch=16, cex=.5, col="red") => "null band" for quantile plot assuming the model is correct . Simulation of distributions conditional on a sufficient statistic can be used for any test and model check of the model . Difference between parametric bootstrap and sufficiency-based simulation for residual analysis + parametric bootstrap uses r.boot <- resid(lm(rnorm(N) ~ X)) + sufficiency-based simulation: r.suff <- r.boot / norm(r.boot) * sqrt(RSS) The latter forces the simulated residuals to match the observed RSS. General differences: + parametric bootstrap works more often than sufficiency-based simulation + sufficiency-based simulation is theoretically exact; par. bootstrap is not + par. bootstrap can be used to asses the variability of the estimates b and s by computing b* and s* from bootstrap data y* whereas sufficiency-based simulation holds b and s fixed and only varies r ---------------------------------------------------------------- LECTURE 17: * Thanksgiving: - no class on Wed, Nov 21 - there will be an extensive 2-week homework * Homework: - reminder: Homework 7 posted, due Mon, Nov 12 - Style, exercise 5.2, 3. part; hand in your solution ON PAPER TODAY * Quiz: - what is parametric bootstrap? simulate data as if the estimates were the truth - what is the purpose of parametric bootstrap? model diagnostics - what is the idea behind "null-sufficient simulation"? simulate datasets that are equally likely if the model is correct - could you think of a catchy term for "null-sufficient simulation"? - what is the sufficient statistic for the normal linear model? yhat (or b) and RSS - intuition pump: if Z1,Z2,Z3 are independent, and if Z1 ~ N(mu,sigma^2), Z2 ~ N(0,sigma^2), Z3 ~ N(0,sigma^2) what is the conditional distr. of (Z1,Z2,Z3) given T=(Z1, Z2^2+Z3^2) ? => uniform distribution on the circle centered at (Z1,0,0) and with distance sqrt(Z2^2+Z3^2) from the origin * New Material (1): conditional inference continued - general def. of sufficiency: p(Y|theta) = f(T(Y)|theta) h(Y) => T(Y) is sufficient for p(Y|theta) T(Y) is "minimal sufficient" if T(Y) = S(U(Y)) for any other sufficient U(Y) intuition: T(Y) is sufficient to know the "likelihood ratio" p(Y|theta1)/p(Y|theta2) for any pair of parameters theta1 and theta2; according to the likelihood principle, one should not use anything other than likelihood ratios for inference about the parameter theta - so far we have used null-sufficient sampling for model diagnostics; now we will use them for formal statistical tests. - let beta=(beta0,...,betap) and H0: beta1=0 H0: y = beta0 + beta2*x2 + ... + betap*xp + eps H1: y = beta0 + beta1*x1 + beta2*x2 + ... + betap*xp + eps test statistic: t = b1 / stderr.est(b1) null distribution: the conditional distribution of t given T=(b0,b2,...,bp,RSS) where all quantities are estimated under H0 => simulate residual vectors r from H0 for each r, compute t => simulated null distribution of t reject H0 if the observed t is in either 2.5% tail of the simulated distr. SURPRISE! The null distribution of t is exactly the usual t-distribution - Example: Boston housing data; testing the coefficient of AGE boston.lm <- lm(MEDV ~ ., data=boston) summary(boston.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) AGE 6.922e-04 1.321e-02 0.052 0.958229 AGE is extremely insignificant simulate the conditional null distribution: n <- nrow(boston) Nrep <- 10000 t.null <- rep(NA,Nrep) # storage for simulated t-values, the 'null distr.' H0 <- lm(MEDV ~ . - AGE, data=boston) yhat.H0 <- fitted(H0) RSS.H0 <- deviance(H0) for(irep in 1:Nrep) { r.null <- resid(lm(rnorm(n) ~ . - AGE, data=boston)) r.null <- r.null / norm(r.null) * sqrt(RSS.H0) y.null <- yhat.H0 + r.null t.null[irep] <- summary(lm(y.null ~ ., data=boston))$coeff["AGE","t value"] if(irep%%100==0) cat(irep,"") } # note: the above two calls to lm() are ridiculously expensive plot(x=qt((1:Nrep)/(Nrep+1),df=(H0$df.resid+1)), y=sort(t.null), pch=16, cex=.3, xlim=c(-4,4), ylim=c(-4,4), xlab="t quantiles", ylab="empirical quantiles") abline(a=0, b=1, col="red") # btw, for 494 dfs, the t-distr is indistinguishable from normal modulo simulation error, this is a t-test - and the point is...? the above simulation works for arbitrary test statistics for comparison of two nested linear models H1 > H0, a plain t-test is fine, BUT if H1 is: "AGE should be introduced with some Box-Cox transformation" => test statistic = ... or: "RM should be transformed with some kink function" => test statistic = ... the point is: one can test against many alternatives simultaneously and account for the fact that one picked the most significant one ("simultaneous inference", "multiple comparisons",...) nobody knows the null distribution of max(|t1|,|t2|,|t3|,...) when these t-statistics are arbitrarily correlated * New Material (2): Collinearity . NOT a model problem, but causes interpretation issues . commonality of collinearity and leverage/self-influence: + both are properties of the predictor distribution + neither is a model problem because neither involves y + both cause interpretation problems . difference between collinearity and leverage/self-influence: + leverage/self-influence has to do with "outliers" in predictor space + collinearity has to do with predictors falling near a proper subspace => the interpretation problems they cause are completely different . extreme case: perfect collinearity -- rank deficient design matrix X b = 0 <=> (X^T X) b = 0 <=> b^T (X^T X) b = 0 <=> |X b|^2 = 0 <=> b is eigenvector of (X^T X) with eigenvalue 0 determination of rank of X: eigen decomposition of X^T X rank deficiency: count the (near) zero eigenvalues each eigenvector for a zero eigenvalue is a 'b' with X b = 0 example: boston.lm <- lm(MEDV ~ ., data=boston) X <- model.matrix(boston.lm) eigen(crossprod(X))$values # full rank XX <- cbind(X,x12=X[,1]+X[,2],apply(X,1,sum)) eigen(crossprod(XX))$values # rank deficient look for eigenvalue many orders smallers than the one before (ex: .9, 8.4e-09) . problem: scaling of predictors XX <- cbind(X[,-13]*.000001,X[,13]*1000000) # example: rescale predictors eigen(crossprod(XX))$values # negative e'vals? => eigen() crashed solution: scale the predictors, use 'cor()' instead of 'crossprod()' but remove intercept vector, which is not of interest anyway (centering/adjusting-for-the-mean is implicit in correlations) eigen(cor(X[,-1]))$values XX <- cbind(X[,-1],x12=X[,2]+X[,3],apply(X,1,sum)) eigen(cor(XX))$values round(eigen(cor(XX))$vector[,14:15], 4) . standardizing the predictors: often practiced in social sciences regr. coeff. of x = est. ave. difference in y for a 1 sdev difference in x 'holding' all other predictors constant . collinearity analysis with "smallest principal components": use of smallest eigenvalues/vectors to detect near-collinearity PCA: usually used for DIMENSION REDUCTION => largest eigenvalues/vectors but COLLINEARITY ANALYSIS requires: smallest eigenvalues/vectors idea: assume the columns of X are centered (intercept removed) LS: find b such that |y - X b|^2 = min PCA: find b such that var(X b) ~ |X b|^2 = max/min for collinearity analysis we are interested in var(X b) small but: PCA is ill-posed; b->Inf or b=0 solve trivially => PCA needs a constraint idea: inspired by collinearity analysis, take the best possible case for comparison, which is: the predictors are pairwise ... IF this were the case: var(X b) = |x1|^2 b1^2 + ... + |xp|^2 bp^2 w.l.o.g.: |x1|^2 = ... = |xp|^2 = 1 var(X b) = b1^2 + ... + bp^2 = |b|^2 => var(X b) = max/min subject to |b|^2 = 1 => eigenproblem: var(X b) = b^T cor(X) b = max/min subj. to |b|^2=1 interpretation: eigenvalues = variances (max/min/stationary) var(X b) ~=~ 0 => X b ~=~ 0 for small eigenvalues . canned smallest eigendecomps. of the correlation matrix of X (w/o intercept) site <- "http://www-stat.wharton.upenn.edu/~buja/STAT-541/" source(paste(site, "collinearity-pca.R", sep="")) X.scaled <- scale(X[,-1]) # standardize the columns, remove intercept collinearity.pca(X.scaled, nlin=3) $variances PC13 PC12 PC11 0.064 0.169 0.186 $coefficients PC13 PC12 PC11 CRIM -0.046 0.087 -0.110 ZN 0.081 -0.071 0.263 INDUS 0.251 -0.113 -0.303 CHAS -0.036 -0.004 0.014 NOX -0.044 0.804 0.111 RM -0.046 0.153 0.053 AGE 0.039 -0.212 -0.459 DIS 0.018 0.391 -0.696 RAD 0.633 -0.107 0.037 TAX -0.720 -0.215 -0.105 PTRATIO -0.023 0.210 0.175 B 0.004 0.042 0.019 LSTAT -0.024 0.055 0.271 interpretation of the smallest eigenvalue: the smallest PC has 6.4% of the variance of a single variable (var(xj)=1). its standard deviation is sqrt(0.064)=0.253, or about a quarter of each xj interpretation of its eigenvector: round the coeffs first INDUS: +1/4 RAD: +2/3 TAX: -3/4 defines a linear combination with small variance, hence: 1/4*INDUS + 2/3*RAD - 3/4*TAX ~=~ 0 equivalently: 3/4*TAX ~=~ 1/4*INDUS + 2/3*RAD equivalently: TAX ~=~ 1/3*INDUS + 8/9*RAD check with a plot: plot(TAX ~ I(1/3*INDUS + 8/9*RAD), data=as.data.frame(X.scaled), pch=16, cex=1) (Recall the formula language can also be used for plotting.) suspicious plot: these are not 506 points... windows(width=6, height=12) par(mfrow=c(2,1), mgp=c(1.8,.5,0), mar=c(3,3,1,1), cex.lab=.8, cex.axis=.8) plot(TAX ~ I(1/3*INDUS + 8/9*RAD), data=as.data.frame(X.scaled), pch=16, cex=.5) jitter the same: n <- nrow(X.scaled) jit <- .1 jit.x <- runif(n,-jit,jit) jit.y <- runif(n,-jit,jit) plot(TAX + jit.y ~ I(1/3*INDUS + 8/9*RAD + jit.x), data=as.data.frame(X.scaled), pch=16, cex=.5) ---------------------------------------------------------------- LECTURE 18: * HW 7: typos in Problems 4 & 5; please, make sure you have the latest... * name for the null-sufficient sampling method for model diagnostics? precedents: bootstrap, ACE, MARS, PRIM, Jackknife, ... * Style, Exercise 5.2.3: please, send by email... "The importance of language skills in children's problem-solving ability was stressed by Jones (1985) in his paper on children's thinking. Improvement in nonverbal problem solving was reported to have occurred as a result of improvements in language skills. The use of previously acquired language habits for problem articulation and activation of knowledge previously learned through langage are thought to be the cause of better performance. Therefore, systematic practice in the verbal formulation of nonlinguistic problems prior to attempts at their solution might be an avenue for exploration in the enhancement of problem solving in general." One submitted homework: "In his paper on children's thinking, Jones (1985) stressed the importance of language skills in children's ability to solve problems. He reported that when children improved their language skills, they would improve their ability to solve nonverbal problems. He thinks that they perform better because they use previously acquired language habits to articulate problems and activate previously learned knowledge. Therefore, we should explore whether children can solve general problems better if they systematically practice verbal formulation of nonlinguistic problems before they attempt to solve these problems." * Quiz: - How do you find a sufficient statistic? - Properties of the conditional distribution of the data given a sufficient statistic? - In the linear model, is the distribution of the RSS dependent on the parameters? - In the linear model, what is the conditional distribution of |y|^2 given the sufficient statistics? - Compare the following design matrices and comment on their properties and interpretation of coefficients: |1 0| |1 0| |1 1 0| |1 0| |1 0| |1 1 0| X = |0 1|, |1 1|, |1 0 1| |0 1| |1 1| |1 0 1| - In the third matrix, what is x0 adjusted for x1 and x2? zero - What is a general purpose method for detecting exact collinearity? - What is a general purpose method for detecting near collinearity? * New Material (1): Collinearity (contd.) - effects of collinearity on inference for individual coefficients: . if x1 is very nearly linearly dependent on x0,x2,..., then stderr(b1) is large . why? stderr(b1) = sigma/|x1.adj| . as a consequence, b1 tends to be insignificant - effects of collinearity on inference for the coeff vector: . the (stderr-)variance of b is V[b] = sigma^2 (X^T X)^{-1} . if the eigendecomposition of (X^T X) is: (X^T X) = lambda1 P1 + lambda2 P2 +... (p+1 terms, Pj = uj uj^T) then the eigendecomposition of (X^T X)^{-1} is: (X^T X)^{-1} = 1/lambda1 P1 + 1/lambda2 P2 +... sideline: if S is a symmetric matrix with eigendecomp S = sum lambdaj Pj and if f(lambda) is a numeric function, then f(S) = sum f(lambdaj) Pj consequence: linear combinations uj^T b = with small lambdaj are statistically ill-determined (dataset-to-dataset) V[uj^T b] = uj^T V[b] uj = 1/lambdaj = huge if lambdaj = small - material effects of collinearity: Simpson''s paradox . practical example: "A marketing project identified a list of affluent customers for its new PDA. Should it focus on the younger or older members of this list?" "To answer this question, the marketing firm obtained a sample of 75 consumers and asked them to rate their 'likelihood of purchase' on a scale of 1 to10." data: pda <- read.table(paste(site,"pda.dat",sep=""), header=T, sep=",") names(pda) plot(pda) summary(lm(Rating ~ Age, data=pda)) => for increasing Age we see on average increasing Rating but: summary(lm(Rating ~ Income + Age, data=pda)) => adjusted for Income, we collinearity: cor(pda[,c("Age","Income")]) * New Material (2): Model selection - you know how to play with models: add, drop predictors . stepwise forward: ... summary(lm(MEDV ~ RM, data=boston)) summary(lm(MEDV ~ RM + LSTAT, data=boston)) a clumsy function to 'help': add1(lm(MEDV ~ 1, data=boston), scope= ~ RM + LSTAT + NOX) . stepwise backward: ... summary(lm(MEDV ~ ., data=boston)) summary(lm(MEDV ~ . - AGE, data=boston)) summary(lm(MEDV ~ . - AGE - INDUS, data=boston)) another clumsy function to 'help': drop1(lm(MEDV ~ ., data=boston)) - all-subsets regression: . software: install the follwoing package once: install.packages(pkgs="leaps", repos="http://cran.us.r-project.org/") library(leaps) # needs to be called in every R session . 'regsubsets()' finds the best subset of predictors for every size. optional arguments: nbest=2 will find the two best for each size nvmax=13 will find them for all sizes up to 13 force.in=c(6,13) will force variables 6 and 13 in the model (RM, LSTAT) example: summary(regsubsets(log(MEDV) ~., data=boston)) summary(regsubsets(log(MEDV) ~., data=boston, nbest=3, nvmax=13, force.in=c(6,13))) a graphical way of looking at these models: windows(width=6, height=12) plot(regsubsets(MEDV ~., data=boston, nbest=2, nvmax=13), scale="adjr2") - problem: how high should we go with picking models? a helpful trick/idea: ADD A FEW RANDOM NUMBER PREDICTORS TO THE MODEL !!!!!!! and check when the regressions get trapped in them: nnoise <- 10 # the number of random predictors we add boston.ext.dat <- # the data frame of actual and random data data.frame(boston, as.data.frame(matrix(rnorm(506*nnoise), ncol=nnoise))) summary(lm(log(MEDV) ~ ., data=boston.ext.dat)) # model with 'junk' Fortunately most random predictors are recognized as worthless, but occasionally there is one that sticks out by chance. Now we do all-subset search and check graphically: boston.ext.sub <- regsubsets(nvmax=20, nbest=2, log(MEDV) ~. - RM + I(pmax(RM-6.5,0)) - LSTAT + log(LSTAT), data=boston.ext.dat) plot(boston.ext.sub, scale="adjr2") => models of size up to 9 would be ok by this method questions: . how does the method depend on the number of included junk predictors? . is this a good method for interpretation or prediction or both or neither? . theory??? ---------------------------------------------------------------- LECTURE 19: * HW 7 was due today; solutions are posted * Style: - ignore writing recommendations from liberal arts; you are not writing novels but non-fiction; your concerns should not be aesthetics but readers'' needs - repetition and parallelism helps the reader - use characters where possible, but more importantly: maintain the basic topic for the duration of a paragraph. * Quiz: - stderr.est(bj) = s / |xj.adj| - collinearity of xj with other predictors causes |xj.adj| to be small hence stderr.est to be large - what would be a good definition of "variance inflation factor"? VIFj = [ actual stderr.est(bj) / best possible stderr.est(bj) ]^2 = [ |xj| / |xj.adj| ]^2 = 1/(1-R2.j) - interpretation of the VIF in terms of CI-width? sqrt(VIFj) shows by how much the CI width has been inflated - V[b] = sigma^2 (X^T X)^{-1} - what is the relevance of an eigendecomposition of V[b] for collinearity? the eigendecomposition identifies linear combinations of the regression coefficients that are ill determined: V[] = large - variable selection: forward stepwise: add predictors as long as y backward stepwise: remove ... mixed stepwise: add/remove ... => same final model? NO * Recap and New Material: variable selection - adjusted R square . R2 is biased: E[R2] > 'true R2' (because of LS optimization) . attempt at de-biasing: R2.true = 1 - sigma^2 / sigma.y^2 R2 = 1 - RSS/TSS = 1 - (RSS/(n-1))/(TSS/(n-1)) RSS/(n-1) is downward biased as an estimate of sigma^2 TSS/(n-1) is unbiased as an estimate of sigma.y^2 => estimate R2.true by R2.adj = 1 - s^2/s.y^2 < R2 - all-subset regression: library(leaps) # needs to be called in every R session summary(regsubsets(log(MEDV) ~., data=boston, nbest=3, nvmax=13, force.in=c(6,13))) windows(width=6, height=12) plot(regsubsets(MEDV ~., data=boston, nbest=2, nvmax=13), scale="adjr2") - trick/method/inference: adding noise predictors nnoise <- 13 # the number of random predictors we add boston.ext.dat <- # the data frame of actual and random data data.frame(boston, as.data.frame(matrix(rnorm(506*nnoise), ncol=nnoise))) summary(lm(log(MEDV) ~ ., data=boston.ext.dat)) # model with 'junk' boston.ext.sub <- regsubsets(nvmax=20, nbest=1, log(MEDV) ~., data=boston.ext.dat) plot(boston.ext.sub, scale="adjr2") (animate the above!) - problems with variable selection and significance? significance at 5% is no longer valid after running 1000 tests... 1000 predictors => end up with a model with 50 predictors even if there is no relation between y and x !!!!!!!!!!!! end of linear models !!!!!!!!!!!!!!!! ---------------------------------------------------------------- LECTURE 20: * Homework 8 will be posted tomorrow Thu, due on Friday, Nov 29, 5pm (there will be one more homework thereafter) * Style: study lesson 6 for Monday, Nov 19 * Recap and new material: Permutation Tests - idea: if the null hypothesis is independence of two variables X and Y, break the association between X and Y with random permutations - data example: places rated data (1986), 329 metropolitan areas of the US places <- read.table(paste(site,"places.dat",sep=""), header=T) dim(places) names(places) rownames(places) - visual permutation test to assess the relationship between two variables: xlab <- "Climate.Terrain"; ylab <- "Housing" xlab <- "Education"; ylab <- "Crime" x <- places[,xlab]; y <- places[,ylab] par(mfrow=c(4,4), mar=rep(.5,4), xaxt="n", yaxt="n") i.pos <- sample(2:15)[1] # random position of the actual for(i in 1:(i.pos-1)) plot(x=x, y=sample(y), pch=16, cex=.5) plot(x=x, y=y, , pch=16, cex=.5) for(i in (i.pos+1):16) plot(x=x, y=sample(y), pch=16, cex=.5) which plot shows the actual data? 15 of the 16 plots show draws from a 'random association model' - numerical permutation test: p-value computation with simulation . test statistic? anything; use the correlation, for example test.stat <- function(x,y) cor(x,y) test.stat <- function(x,y) { sum(x>median(x) & y>median(y)) } T.actual <- test.stat(x,y) T.actual . simulation: sampling random permutations; the more samples, the more precision nsim <- 9999 # fence post problem: 9999 make 10000 equally likely spaces T.null <- rep(NA,nsim) # storage of "null values" for(i in 1:nsim) { if(i%%2000==0) print(i) # print a message every 1000 simulations T.null[i] <- test.stat(x, sample(y)) } . statistical significance: how extreme is 'T.actual' among 'T.null'? judge visually: windows() hist(T.null, breaks=40, col="gray", xlim=range(c(T.null,T.actual))) lines(rep(T.actual,2), c(0,1000), lwd=2, col="red") . non-rejection ('acceptance') region: quantile(T.null, c(.025, .975)) 2.5% 97.5% -0.1093858 0.1091154 # note symmetry quantile(T.null, c(.005, .995)) 0.5% 99.5% -0.1446682 0.1411944 # note symmetry this far out in the tail T.actual # not in the non-rejection region by a long shot . approximate normality of the simulated null distribution: qqnorm(T.null, pch=".") T.null.mean <- mean(T.null); T.null.sd <- sd(T.null) (where do you expect 'T.null.mean' to fall?) abline(a=T.null.mean, b=T.null.sd, col="red") could therefore use the fitted normal distribution for non-rejection regions: 2-sided 5% level: T.null.mean+c(-1,1)*qnorm(.975)*T.null.sd same thing, using the normal quantile function: qnorm(c(.025,.975), m=T.null.mean, s=T.null.sd) 2-sided 1% level: qnorm(c(.005,.995), m=T.null.mean, s=T.null.sd) somewhere there will exist an article that derives the asymptotic distribution; such 'research' is no longer very viable because simulation is now so easy; still, it would be neat to see an asymptotic standard error formula. . p-value: fence post consideration T.actual has equal probability 1/(nsim+1) to fall into any spacing of T.null (under what probability assumptions?) hence the simulation approximation to the p-value is: (sum(abs(T.null) >= abs(T.actual)) + 1) / (nsim+1) * 100 (what are its extreme values? ...) - theory behind permutation tests . H0: y1,...,yn i.i.d. F, where F is unknown, also: indep. of x1,...xn, sufficient statistic: S(y) = (y(1),...,y(n)), order statistic conditional distribution of y given S(y): ... . brain teaser: (x1,y1),...,(xn,yn) i.i.d. => y1,...,yn i.i.d. but this does not imply xi is independent of yi. - non-parametric nature of permutation tests: no assumptions made about the marginal distribution F(y) !!!! - special cases: . x: predictor vector; y: quantitative => multiple regression a permutation test of H0: y1,...,yn i.i.d. of x1,...,xn could be used as a non-parametric substitute for the overall F-test natural test statistic: overall-F or RSS (equivalent) . x: binary (categorical) variable; y: quantitative response => 2-sample test natural test statistic: 2-sample t-stat or difference of medians,... . x: categorical variable; y: quantitative response => 1-way ANOVA natural test statistic: overall-F or Kruskal-Wallis or any pair-difference . x: binary; y: binary => Fisher''s exact test (no need to simulate; exact null distr. known) natural test statistic: anything (any cell count, they are all equiv.) . x: categorical; y: categorical => chi-square test natural test statistics: '' or any cell count . x: predictor vector; y binary => logistic regression or binary classification (boosting, bagging, SVMs,...) natural test statistics: negative log-likelihood, misclassification rate,... . x: predictor vector; y categorical => 'polytomous regression' or multi-class classification ('') natural test statistics: '' . x: predictor vector; y response vector => canonical correlation natural test statistics: largest canonical correlation, sum of k largest cancorrs . x: time; y time series (usually quantitative) => test for generalized 'white noise'; cumsum = generalized 'random walk' natural test statistics: any time series model statistic (AR, ARMA, ARCH,...) . x: experimental design (controlled treatment levels); y quantitative response => again a non-parametric substitute for overall F-test natural test statistic: overall-F stat or R2 - drawback of permutation tests: no conversion to CIs? yes and now . recall duality between CIs with 95% coverage and tests at 5% significance CI = { parameter values that can not be rejected if used as H0 } . in some cases the conversion is possible example: 2-sample test for location can be converted H0: y11, ..., y1m, y21-mu, ..., y2n-mu i.i.d. F => permutation test applied to y11,...,y1m,y21+mu,...,y2n+mu can be efficiently excuted because location test statistics satisfy T(y+mu) = T(y) + mu (shift-equivariance) . more generally: simple linear regression H0: yi - beta1*xi i.i.d. (where the marginal distr. F is arbitrary) (reason why one can drop the intercept: ...) CI = { beta1 | H0: 'yi-beta1*xi i.i.d.' is not rejected } - other advantages/disadvantages of permutation tests: . intuitive nature . can be defended even for non-sampling data (such as the Places Rated data) . freedom of choice of test statistic: use any measure of dependence . non-parametric nature: marginal distribution F is arbitrary . limitation: good only for H0 that express independence between xi and yi or in which xi are fixed and yi are i.i.d. - Literature: . E. S. Edgington (1980) "Randomization Tests" (Dekker, Statistics: Textbooks and monographs, vol. 31) . Ph. I. Good (1994) "Permutation Tests : A Practical Guide to Resampling Methods for Testing Hypotheses" (Springer Series in Statistics) . E. Lehman (1986 or later, good for theory: sec. 4.3, 5.10ff) "Testing Statistical Hypotheses" (Wiley) ---------------------------------------------------------------- LECTURE 21: * Style: Lesson 6 today -- next time * Homework 8 posted, due: Mon, Dec 3, before class * Quiz: - to what type of null hypothesis do permutation tests apply? ... or ... - list some advantages of permutation tests compared to conventional tests: ... - list some drawbacks of permutation tests: ... - list all the examples you can think of where permutation tests apply: ... - what is the theory underlying permutation tests? ... * New Material: Statistical Functionals / Numerical Summaries - given a series of univariate observations y1,...,yn, such as n <- 100 y <- rnorm(n); x <- rnorm(n) # (make up some numbers) idea: for y1,...,yn i.i.d. F, consider numeric characteristic that have limits as n -> Inf T(y1,...,yn) -> T(F) example: T = mean(y1,...,yn) -> E[Y] = integral y F(dy) = T(F) as n -> Inf reason: LLN empirical vs theoretical distribution: y1,...,yn i.i.d. F theoretical distribution => Fn = sum delta_yi / n empirical distribution write: T(Fn) := T(y1,...,yn) T(Fn) -> T(F) as n -> Inf - examples functionals: mean(y) -> E[Y] median(y) -> F^{-1}(1/2) # if unique for F mean(y, trim=0.1) -> # 10% trimmed mean quantile(y, c(0.1, 0.25, 0.5, 0.75, 0.9)) -> # list of quantiles IQR(y) # interquartile range quantile(y, 0.75) - quantile(y, 0.25) # same mad(y) # MAD = "Median Absolute Deviation" # scaled to estimate sdev at the normal distr. median(abs(y-median(y)))* 1.4826 # = MAD central versus non-central moments skewness = E[(Y-m)^3]/sigma^3 kurtosis = E[(Y-m)^4]/sigma^4 summary(lm(y ~ x))$r.squared # if (xi,yi) are i.i.d. samples of (X,Y) summary(lm(y ~ x))$coeff[2] cor(x,y) - counter examples: max(y) mean(y, trim=10/n) y[5] surprisingly these are not strictly statistical functionals: sd(y) # std deviation source(paste(site,"trimmed-corr-cov.R",sep="")) sdev(y, trim=0.1) # 10% trimmed sdev (from file "trimmed-corr-cov.R") var(y) # variance, square of a dispersion trvar(y, trim=0.1) # trimmed variance (from "trimmed-corr-cov.R") summary(lm(y ~ x))$adj.r.squared reason: they do not depend on the data ... only the biased versions of 'var()' and 'sd()' are statistical functionals: var(y)*(length(y)-1)/length(y) sd(y)*sqrt((length(y)-1)/length(y)) - some defining properties: . location: T(y+c) = T(y) + c T(c*y) = c T(y) . dispersion: S(y+c) = S(y) S(c*y) = |c| * S(y) . homogeneity of order r: H(c*y) = c^r H(y) . location and scale invariant: I(y+c) = I(y) I(c*y) = I(y) - asymptotic normality of most statistical functionals: . definition: ( T(Fn) - T(F) ) / sqrt(n) ~->~ N(0,AV) . asymptotic variance: AV note: AV = AV(F), hence not known, except for a few theoretical distributions and functionals . example: T = mean/expectation AV(F) = sigma^2(Y) . for test statistics of interest the literature offers the explicit AV so one can form asymptotic CIs: estimate AV(F) with AV(Fn) CI: T(Fn) +- 2*AF(Fn)*sqrt(n) - asymptotic normality and asymptotic variance illustrated for the MAD: . 100 MADs for each of . sample sizes n=10,100,...,100000 in each of . scenario: cauchy data nsim <- 100; nlog10 <- 1:6 mads <- matrix(NA, nrow=nsim, ncol=length(nlog10)) for(j in nlog10) for(i in 1:nsim) { # took ~75 sec in 2007/11 mads[i,j] <- mad(rcauchy(10^j)); if(i==1) print(j) } . see the distribution of MADs shrink with increasing sample size: windows(width=3.5, height=14) par(mfrow=c(length(nlog10),1), mar=c(2,2,2,1), mgp=c(1.5,.5,0), cex.axis=.8) for(j in nlog10) hist(mads[,j], xlim=c(0.2,5), col="gray", breaks=15, cex.main=.8, xlab="", main=paste("n =",10^j), ylab="") . see how the distribution of MADs become normally distributed quickly: par(mfrow=c(length(nlog10),1), mar=c(0,3,2,3), mgp=c(1.5,.5,0), cex.axis=.8) for(j in nlog10) qqnorm(mads[,j], pch=18, cex=.5, xlab="", ylab="", xaxt="n", yaxt="n", main=paste("n =",10^j), cex.main=.8) . at what rate is the standard deviation shrinking? plot them against sample size on log-log scale mads.sd <- apply(mads,2,sd) windows() par(mgp=c(1.5,.5,0), mar=c(3,3,1,1), cex.axis=.8, cex.lab=.8) plot(x=log10(10^nlog10), y=log10(mads.sd), xlab="log10(n)", ylab="log10(sdev(MAD))", pch=16) abline(a=.4, b=-.5) # i'cept=.4, slope=-0.5 => sd(mad) ~=~ ... * n^{...} AV(MAD|Cauchy) ~=~ 2.5 . Q: how do we estimate the AV for actual data? A: Bootstrap !!! * New Material (2): Bootstrap -- a method for estimating the AV - General concept: get a sense of . the variability in data and . the dataset-to-dataset variability of statistics computed from data - Efron''s 'trivial idea': . mother nature presents us with data y1,...,yn i.i.d. F . we do not know F, but we can estimate it call the estimate Fhat . act as if you were mother nature and Fhat were the truth that is, sample fake data from Fhat but do this repeatedly: (y1*1,...,yn*1), (y1*2,...,yn*2), ... and compute statistics you computed from y1,...,yn from y1*,...,yn* as well: bootstrap distribution of T = { T(y1*1,...,yn*1), T(y1*2,...,yn*2), ... } . uses of these values: + their SD is used as an estimate of SE of T + the bootstrap SD is used to form bootstrap CIs: CI = T(Fn) +- 2*SE - two ways of estimating Fhat: . parametric bootstrap: assumes a parametric model, e.g., N(mu,sigma^2) estimate the parameters, e.g., m <- mean(y); s <- sd(y) sample y1*,...,yn* i.i.d. from N(m,s^2) . nonparametric bootstrap: use Fhat = Fn, the empirical distribution sample y1*,...,yn* i.i.d. from Fn, that is, sample from (y1,...,yn) WITH REPLACEMENT - comparison of the parametric and nonparametric bootstrap: . parametric bootstrap may not give you the proper idea of dataset-to-dataset variability if the fitted model overfits . nonparametric bootstrap is vastly more popular when it applies (i.i.d.) but the datasets it creates have dups and do not use some cases P[case 'i' not used in n draws] = (1-1/n)^n -> exp(-1) => approx fraction of cases used is (1-exp(-1)) ~=~ 0.6321206 or just under 2/3 (After many bootstrap samples, one will have used all data, though.) - example of non-parametric bootstrap standard errors: linear regression site <- "http://www-stat.wharton.upenn.edu/~buja/STAT-541/" bodyfat <- read.table(paste(site,"bodyfat.dat",sep="")) names(bodyfat) <- scan(paste(site,"bodyfat.col",sep=""), w="") dimnames(bodyfat) regress 'PercBF' on all predictors: bf.lm <- lm(PercBF ~ . - Density, data=bodyfat) bf.lm.smry <- summary(bf.lm) LS coefficients: bf.b <- bf.lm.smry$coeff[,"Estimate"] standard errors computed the usual way: bf.b.se <- bf.lm.smry$coeff[,"Std. Error"] bootstrap simulation: ------------------------------------- n <- nrow(bodyfat) nboot <- 1000 bf.b.bs <- matrix(NA, ncol=length(bf.b), nrow=nboot) colnames(bf.b.bs) <- names(bf.b.se) for(iboot in 1:nboot) { sel <- sample(1:n, replace=T) # check dups: sort(table(sel)) bf.b.bs[iboot,] <- coef(lm(PercBF ~ . - Density, data=bodyfat[sel,])) if(iboot%%200==0) print(iboot) } bootstrap standard errors: -------------------------------------- bf.b.bs.se <- apply(bf.b.bs, 2, sd) compare bootstrap with usual standard errors: round( rbind(boot=bf.b.bs.se, classic=bf.b.se), 3) ratio comparison: round( bf.b.bs.se / bf.b.se, 3) mostly a non-event!!! bootstrap SEs are very close to the usual SEs except for 'Height': classic=.096, bootstrap=0.140, ratio=1.45 - bootstrap CIs from bootstrap SEs: round( rbind( bf.b - 2*bf.b.bs.se, bf.b + 2*bf.b.bs.se), 3) - we applied non-parametric bootstrap to regression thus capturing the full variability of (xi,yi) sampling, whereas 'usual' inference conditions on the xi-values and considers variation in yi given xi only; why then are the results so similar? ... ('ancillarity' of the predictors) ---------------------------------------------------------------- LECTURE 22: * Reminder: Homework 8 due Mon, Dec 3, before class * Will return to 'Style' on Wed * Quiz: - what types of null hypotheses can be tested with permutation tests? - what is the theory underlying permutation tests? - a statistical functional is a function on distributions: T(F) how does one obtain a plug-in estimate of T(F)? - what continuity property does one require for T(F)? - if x1,x2,...,xn is an i.i.d. sample from F, does (x1+xn)/2 lead to a statistical functional? does mean(ximedian(X)) - mean(X +b - c.hi (*) -b + c.lo < - beta +b - c.lo > beta (**) >>> invert (*) and (**) >>> P[ b - c.hi < beta < b + c.lo | beta ] = 0.95 exact CI = (b - c.hi, b + c.lo) where c.hi = qb.hi - beta c.lo = beta - qb.lo . bootstrap plug-in estimation of c.hi and c.lo (replace +-2*SE): beta <-- b qb.hi <-- qb*.hi (97.5% quantile of bootstrap values b*) qb.lo <-- qb*.lo ( 2.5% quantile of bootstrap values b*) c.hi <-- qb*.hi - b c.lo <-- b - qb*.lo . bias-correcting BS quantile CI: (b - c.hi, b + c.lo) = (2b - qb*.hi, 2b - qb*.lo) compare with the naive BS quantile CI: (qb*.lo, qb*.hi) and with the classical interval: (b - 2SE, b + 2SE) options(width=120) # to print long lines round(cbind( qdbias.l = 2*bf.b - apply(bf.b.bs, 2, quantile, .975), qnaive.l = apply(bf.b.bs, 2, quantile, .025), normal.l = bf.b + qt(.025,df=238)*bf.b.bs.se, clssic.l = bf.b + qt(.025,df=238)*bf.b.se, qdbias.h = 2*bf.b - apply(bf.b.bs, 2, quantile, .025), qnaive.h = apply(bf.b.bs, 2, quantile, .975), normal.h = bf.b + qt(.975,df=238)*bf.b.bs.se, clssic.h = bf.b + qt(.975,df=238)*bf.b.se), 3) ^^^^^^^^^^^^^^^ exact, instead of 2 ---------------------------------------------------------------- LECTURE 23: * Reminder: Homework 8 due Mon, Dec 3, before class * Style: 'topic' and 'stress' location in sentences - establish the topic/actor in the beginning of the sentence - put the stress/news at the end of the sentence - tactics: . trim the end . shift peripheral ideas to the left . shift new information to the right, if necessary by using + passives ('BEES make honey' --> 'honey is made by BEES' + 'there' ('honey is on the table' --> 'there is honey on the table') + 'what' ('we need rain' --> 'what we need is rain') + 'it' ('it seems inevitable that we lose') + 'not only ... but ...' (<-- '... and ...') - a heavy exercise from an article on geology, separated into sentences: Large earthquakes along a given fault segment do not occur at random intervals because it takes time to accumulate the strain energy for the rupture. The rates at which tectonic plates move and accumulate strain at their boundaries are approximately uniform. Therefore, in first approximation, one may expect that large ruptures of the same fault segment will occur at approximately constant time intervals. If subsequent main shocks have different amounts of slip across the fault, then the recurrence time may vary, and the basic idea of periodic mainshocks must be modified. For great plate boundary ruptures the length and slip often vary by a factor of 2. Along the southern segment of the San Andreas fault the recurrence interval is 145 years with variations of several decades. The smaller the standard deviation of the average recurrence interval, the more specific could be the long term prediction of a future mainshock. * Instead of a quiz: reminder of BS - idea: nature presents i.i.d. samples x1,...,xn from F we mimic nature by repeatedly sampling i.i.d. x1*,...,xn* from Fhat => the BS distribution of T(x1*,...,xn*) gives us an idea of the sampling distribution of T(x1,...,xn) - BS SE: if T*1,...,T*R are BS values of T from R BS simulations, then SE.est = SD(T*1,...,T*R) - three types of BS CIs: . normal assumption: T +- 2*SE.est . naive quantile method: quantile(c(T*1,...,T*R), c(.025,.975)) . de-biased quantile method: 2*T - quantile(c(T*1,...,T*R), c(.025,.975)) the naive quantile method assumes neglibible bias and symmetry the de-biased quantile method allows bias and asymmetry note, however, that quantiles are less stably estimated than SEs !!! * New Material: - when is the bootstrap really indicated? . when there is no classical theory . nonlinear/nonparametric models... - problems with non-parametric bootstrap for regression: . predictors that are dummies of small groups can cause problems why? . leverage points can cause wild BS behavior: x <- c(100, rnorm(99)) y <- c(100, rnorm(99)) plot(x, y, pch=16, cex=.5) classical inference is driven by the leverage point: summary(lm(y ~ x)) compare with the unconditional inference with bootstrap: bs.b <- rep(NA,100) for(i in 1:100) { sel <- sample(1:100, replace=T) bs.b[i] <- coef(lm(y[sel] ~ x[sel]))[2] } qqnorm(bs.b) # normal? quantile(bs.b, c(.025,.975)) # naive BS quantile interval why? - further topics: . better BS CIs with better coverage probabilities example: bootstrap not the raw parameter estimates but studentized versions thereof . BS tests: estimate 'Fhat' under a null hypothesis and resample from it => the bootstrap intervals produce 'acceptance regions' . 'bagging' (Breiman, 1990s): when 'T(Fn)' has erratic behavior, replace it with the BS average 'ave[T(Fn*)]' (recall that 'ave[T(Fn*)]-T(Fn)' is the BS bias estimator, hence one replaces the raw estimator with a biased estimator) . BS for time series: block-BS . BS for experimental data with designed predictors ('experimental design'): bootstrap residuals? similar problem as permutation tests for residuals if the fitted model overfits, the resampled or permuted residuals are too small - Literature: . Efron and Tibshirani An Introduction to the Bootstrap Chapman & Hall (1993) . Davison and Hinkley Bootstrap Methods and their Applications Cambridge University Press (1997) . Politis, Romano, Wolf Subsampling Springer Series in Statistics the last is theoretical and emphasizes subsampling over bootstrap sampling (search "bootstrap", "resampling", "subsampling") * New Topic: Simultaneous inference, and another use of bootstrap - the CI guarantees are one at a time: P[ betaj in CI(bj) ] = 0.95 for j=0,1,2,...,p - what about simultaneous guarantees? P[ betaj in CI(bj) for all for j=0,1,2,...,p ] = ??? - Bonferroni lower bound for CIs: P[ betaj in CI(bj) for all for j=0,1,2,...,p ] >= 1 - 0.05*(p+1) => loose bound, P[...] >= 1-0.05*14 = 0.3 - Bonferroni adjustment for CIs: construct individual CIs with P[ betaj in CI(bj) ] = 1 - 0.05/(p+1) for j=0,1,2,...,p => too wide intervals due to the looseness of the bound - Can we construct better bounds than Bonferroni? YES -- with simulation - Use bootstrap to estimate actual simultaneous coverage: consider classical CIs CI.clssic.l <- bf.b + qt(.025,df=238)*bf.b.se CI.clssic.h <- bf.b + qt(.975,df=238)*bf.b.se how many resampled 'b' vectors are in the 14-D cube bounded by CI.clssic.l and CI.clssic.h? the following 2-liner computes the fraction of bootstrap coefficient vectors that fall completely inside the 2SE box: sum(apply(bf.b.bs, 1, function(x) all(CI.clssic.l <= x & x <= CI.clssic.h))) / nrow(bf.b.bs) [1] 0.4743 this estimate is valid assuming negligible bias and symmetry of the sampling (dataset-to-dataset) distribution; the estimate is less pessimistic than the Bonferroni lower bound 0.30 - Idea: use a larger multiplier than 'qt(.975,df=238)', i.e., enlarge the 14-D box in equal proportions on all sides till it contains, say, 90% of all bootstrap coefficient vectors. In the following computation we try factors 2.0, 2.1, ... 3.9, 4.0 for the classical SE and compute simultaneous coverage for each: for(s in seq(2,4,by=.1)) cat(s, # vvvvvvvvvvvvvvvvvvvvvvvv studentized bootstrap coeffs sum(apply((t(bf.b.bs)-bf.b)/bf.b.se, 2, function(x) all(-s <= x & x <= s))) / nrow(bf.b.bs), # ^^^ returns TRUE if ALL inputs are TRUE "\n") => use a factor 3.8 for 95% simultaneous coverage, and a factor 3.3 for 90% simultaneous coverage (instead of a factor 2) In the following computation we do the same but for the BS SE: for(s in seq(2,4,by=.1)) cat(s, # vv-- note the BS SEs sum(apply((t(bf.b.bs)-bf.b)/bf.b.bs.se, 2, function(x) all(-s <= x & x <= s))) / nrow(bf.b.bs), "\n") => use a factor 3.1 for 95% simultaneous coverage, and a factor 2.8 for 90% simultaneous coverage the BS SEs are wider and better attuned to the BS distributions than the classical SEs, hence require smaller factors here are the BS normal CIs with factor 3.1: cbind(lo=bf.b - 3.1*bf.b.bs.se, hi=bf.b + 3.1*bf.b.bs.se) inverting these CIs we can say: no set of coefficients that fall simultaneously between these bounds can be rejected by a test at the 5% significance level => the only coefficient for which 0 is excluded is 'Abdomen' ---------------------------------------------------------------- LECTURE 24: * Reminder: Homework 8 was due today before class (Apologies for Homework 7, will be graded shortly) * Recap: recent topics - permutation tests - bootstrap - simultaneous inference for CIs (also exists for testing: widen 'acceptance intervals') * New Topic: curve estimation / smoothing / bandwidth selection / cross-validation - idea: given x-y data (quantitative-quantitative) . drop assumption of linearity: y = (a + bx) + eps . assume smoothness only: y = f(x) + eps, f() smooth . smooth => locally linear f(x+h) ~=~ f(x) + Df(x)*h + O(h^2) . if we wish to infer f() from data (xi,yi) (i=1,...,n), we have to infer an estimate f(x) from (xi,yi) with xi near x . first try: fhat(x) = ave(yi; |xi-x| fhat(x)=NA smooth is really not smooth . variations: running median and quantiles for 50%, 90% prediction bands! plot(xy, pch=16, cex=.5) lines(smooth1(xy, func=median), col="black", lwd=3) lines(smooth1(xy, func=function(y) quantile(y,.25)), col="gray50", lwd=3) lines(smooth1(xy, func=function(y) quantile(y,.75)), col="gray50", lwd=3) lines(smooth1(xy, func=function(y) quantile(y,.05)), col="gray80", lwd=3) lines(smooth1(xy, func=function(y) quantile(y,.95)), col="gray80", lwd=3) . 'real' smoothness: kernel smoothing smooth <- function(xy, bw=1/5, nloc=51) { x <- xy[,1]; y <- xy[,2] xmin <- min(x); xmax <- max(x) xs <- seq(xmin, xmax, length=nloc) fx <- rep(NA, nloc) act.bw <- diff(range(x))*bw # actual bandwidth for(i in 1:nloc) { w <- dnorm(x=(x-xs[i])/act.bw) # same as dnorm(x, m=xs[i], s=act.bw) w <- w/sum(w) # so we have: sum(w)==1 fx[i] <- sum(y*w) } return(cbind(x=xs, y=fx)) } . apply to same with 4 bandwidths: bws <- c(2, .2, .03, .005) cols <- c("red","orange","green","blue") plot(xy, pch=16, cex=.5) for(i in 4:1) lines(smooth(xy, bw=bws[i]), col=cols[i], lwd=3) Q: what would be a principled way to select the bandwidth? (stay tuned...) . assess the variability of a fixed-bandwidth smoother: bootstrap n <- nrow(xy) par(mfrow=c(2,2), mgp=c(1.5,.5,0), mar=c(3,3,1,1)) for(i in 4:1) { plot(xy, pch=16, cex=.5) for(iboot in 1:200) lines(smooth(xy[sample(n,repl=T),], bw=bws[i]), col=cols[i], lwd=1) } . lesson: BIAS-VARIANCE trade-off !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! really: bias-SE trade-off (variance is dataset-to-dataset) small bandwidth large bandwidth --> --> large variance (-) small variance (+) small bias (+) large bias (-) . mean squared error and bias-variance decomposition: model: yi = f(xi) + epsi E[epsi] = 0 BIAS: Bias(fhat(x)) = E[fhat(x)] - f(x) VARIANCE: V(fhat(x)) = E[ (fhat(x) - E[fhat(x)])^2 ] MSE: MSE(fhat(x)) = E[ (fhat(x) - f(x))^2 ] bias-variance decomposition of the MSE: -------------------------------