================================================================ LECTURE 13: * RECAP: - A little more details for last time''s recap: "What is the distribution of ..." . Q: If y ~ N(mu*e, sigma^2 I), where e = (1,1,...,1)^T, why is the following a t-distributed random variable? sum_i (yi-mu)/sqrt(N) -------------------------------------- sqrt( sum_i (yi - mean(yi))^2 / (N-1)) A: 1) Numerator = =: Z, where a = (1,1,...,1)^T/sqrt(N) Note |a|^2 = 1, hence Var[Z] = sigma^2 Also E[Z] = 0 2) Denominator = sqrt( |y - P0 y|^2 / (N-1) ) Note |y - P0 y|^2 = |(I-P0) y|^2 = |Q Q^T y|^2 = |Q^T y|^2 where Q is of size Nx(N-1) forming an o.n. basis of the space _|_ 'a'; hence: = ^2 + ^2 + ... + ^2 = Z1^2 + Z2^2 + ... + Z{N-1}^2 where the Zi are iid N(0,sigma^2) 3) Numerator and denominator Z''s are independent: 'a' is orthogonal to 'q1',...,'q{N-1}' by construction, hence Z is uncorrelated from Z1,...,Z{N-1}, hence Z is independent from Z1,...,Z{N-1}. - BIG DEAL to remember: . If random vectors 'y1' and 'y2' of any dimension are independent, so are f(y1) and g(y2). . The same is NOT true if 'y1' and 'y2' are only uncorrelated: f(y1) and g(y2) are generally correlated. . Drastic example: If (y1,y2) are uniformly distributed on a circle with r=1 around 0, as in this R code: ang <- runif(1000000,0,2*pi); y1 <- cos(ang); y2 <- sin(ang) then cor(y1,y2) = ... but cor(y1^2,y2^2) = ... - CIs and PIs for the response: . A CI gives inference for mu(xx) = E[y(xx)] = . . A PI gives a prediction for future values of observations y(xx) at xx. . Derivation of 95% CI: center = yhat(xx) = width = 2* stderr.est(yhat(xx)) Var[y(xx)] = Var[xx^T b] = ... . Derivation of 95% PI: center = yhat(xx) = width = 2*estimated sdev(y(xx) - yhat(xx)) = 2*sqrt( ... + ... ) * ROADMAP: - R mechanics of fitting linear models - Interpretation of basic quantities in linear models - MBA Statistics * R: Formula language for modeling, with excursions - Create some toy 'data': N <- 1000 # (Programming: Do not hardcode constants such as this one!) x1 <- round(runif(N)*100) # Continuous Predictor x2 <- round(rnorm(N,m=71,s=3), dig=.5) # Continuous predictor x3 <- sample(c("m","f","c"),size=N, replace=T, prob=c(.6,.3,.1)) # Categorical predictor eps <- rnorm(N,s=1) # 'Error' myd <- data.frame(x1=x1, x2=x2, x3=x3, y= round(10 + .1*x1 + 0*x2 + c(m=0,f=-6,c=-10)[x3] + eps, dig=.01) ) - Excursion: Basic steps for getting acquainted with a new dataframe !!!!!!!!!!!!!!! # Dataframe or matrix or what? class(myd) # How large is it? dim(myd) # What are the variable names? names(myd) # Look at the first 6 rows: head(myd) # Look at the last 6 rows: tail(myd) # Basic tabulation of all variables: source("http://stat.wharton.upenn.edu/~buja/STATA-541/tab-all.R") # Once tab.all(myd) # Tabulate all columns for a general overview # Generic scatterplot matrix: plot(myd, pch=16, cex=.8) # If ncol(myd)>10, look at subsets of columns. # Another scatterplot matrix function with nicer functionality: source("http://stat.wharton.upenn.edu/~buja/STAT-541/plot-plus.R") # Once plot.plus(myd) # Like 'plot()' but (1) it jitters categorical variables, # (2) gives more margin space, and # (3) has nicer defaults for pch, cex. # To see the code and the arguments, say: plot.plus # Comment on plotting of categorical variables: # Categories are mapped to integers in lexicographical order of the labels. - Formulas to express linear models: # Use 'lm()' to fit a linear model: lm(y ~ x1 + x2 + x3, data=myd) # What is this, though, !%@^&*%? summary(lm(y ~ x1 + x2 + x3, data=myd)) # Much better! mym <- lm(y ~ x1 + x2 + x3, data=myd) # IMPORTANT! ==> Fitted models are 'first class citizens' of the language. They "exist" because they are objects that can be assigned to variables. summary(mym) names(mym) # Components of the list. for(i in 1:length(mym)) print(c(names(mym)[i],class(mym[[i]]))) sapply(mym, class) - Formulas are "first class citizens": myf <- y ~ x1 + x2 # y = beta0 + beta1*x1 + beta2*x2 + eps myf summary(lm(myf, data=myd)) - "Operator overloading": new meanings for +,-,1,*,/,^ '+' means 'include' '*' means 'include with interactions' ':' means 'include these specific interactions' '.' means 'include all predictors' '-' means 'exclude following predictor' '1' means 'intercept' - Prevent overloading: I(...), e.g., I(1/x) - Linear models with interactions: # y = beta0 + beta1*x1 + beta2*x2 + + beta12*x1*x2 + eps summary(lm(y ~ x1 * x2, data=myd)) summary(lm(y ~ x1 + x2 + x1:x2, data=myd)) summary(lm(y ~ x1 + x2 + I(x1 * x2), data=myd)) # dito summary(lm(y ~ x1 * x2 * x3, data=myd)) # dito - Categorical variables also (e.g., Location in real estate data) summary(lm(y ~ x3 -1, data=myd)) # Uses the alphabetically first category name as the reference group. # y = beta0 + beta.b*1[b] + beta.c*1[c] + eps summary(lm(y ~ x1 + x3, data=myd)) # model?... - Interaction of continuous and categorical predictors: summary(lm(y ~ x1 * x3, data=myd)) ================================================================