================================================================ LECTURE 1: - Welcome to STAT 541: STATISTICAL METHODS . Intended for 1st Year PhD students in STATISTICS . All others at their own risk... - Not an applied stats course! . See Paul Rosenbaum''s 500 level course. . We will use math extensively, in particular linear algebra. - Not an R course, but . we will spend the first two classes to learn R basics, and . R is our computational tool. - Prerequisites: . Basic statistical inference . Linear algebra . Some experience with programming (Visual Basic, Pearl, Python, R, Splus; C, Fortran,...) . Strong conceptual thinking - First steps into R: see the files 'Stat-541-R-intro.R' and 'Stat-541-R-crib.R' ================================================================ LECTURE 2: - RECAP: Syllabus . Prerequisites . Intended audience for this class . Grading based on HW alone, no exams . First 3 HWs are posted . Office hours after class and by appointment . First steps into R - ROADMAP: . Introductions . More steps into R - QUIZ QUESTIONS ABOUT R (assuming you studied for HW 1): * What are the basic data types? * What are the ways of indexing a vector? * What happens generally when R expects a certain basic data type but is given the wrong type? As in T + 3 3 | F "a" | T paste(1,2, sep="") paste(1,2) paste(T,2, sep="") "1" + "2" scan("gaga.txt") c(1,"a") * How can you ask an R object of what type it is? class(a) apropos("is[.]") * Explicit coercion? as.numeric("3") apropos("as[.]") * What are the type constraints on vectors, matrices, arrays? * What is the purpose of lists? * What is the purpose of data frames? * What do you obtain if you extract a row from a matrix? mat <- cbind(1:5,11:15) mat[2,] * What do you obtain if you extract a row from a data frame? dfr <- data.frame(1:5,letters[1:5]) dfr[2,] * Categorical variables are generated by the function 'factor()'. f1 <- factor(c("a","b","a","a","c","c")) f2 <- factor(c("a","b","a","a","c","c"), levels=letters) f1 f2 f1==f2 # Strip down to a plain vector and try again. plot(f1) plot(f2) Under the hood: attributes(f1) attributes(f2) levels(f1) levels(f2) class(f1) class(f2) c(f1) # Coercion to vectors c(f2) c(f1)==c(f2) as.integer(f1) as.integer(f2) Factors can be columns in dataframes! class(dfr[,2]) class(dfr[,1]) ================================================================ LECTURE 3: - RECAP: . Syllabus -- What is the purpose of this course? . Prerequisites: 1) ..., 2) ..., 3) ... . Intro to the R language - ROADMAP: . Winding down with R intro . Intuitions and fundamentals of statistical inference . Linear Model analysis with linear algebra - R INTRO (contd.): * What are the functions provided for (almost all) distributions? runif(100); punif(.5); qunif(.3); dunif(.2) unif, norm, chisq, f, t, cauchy, beta, gamma, binom, nbinom, (multinom) * Searching for patterns in string data: x <- paste(sample(letters, size=1000, repl=T), sample(letters, size=1000, repl=T), sep="") grep("a",x) grep("a",x,value=T) x[grep("a",x)] grep("[abc][xyz]",x,value=T) # Example of a regular expression. grep("[abc][abc]|[xyz][xyz]",x,value=T) # " * Looping: for(i in 1:100) plot(rnorm(1000),rnorm(1000), xlim=c(-4,4), ylim=c(-4,4)) # Refine the plot... par(mfrow=c(2,2)) # Refine for(i in 1:4) plot(rnorm(1000),rnorm(1000)) AVOID LOOPS AT ALL COST IF YOU CAN !!!!!!!!!!!!!!!!!!!!!!!!!!! Instead use implied loops such as: 'sum()', 'rowSums()', matrix multiplications '%*%' * Conditional execution: for(i in 1:20) if(runif(1)>.9) cat("You lucky guy!\n") else cat("Bummer!\n") while(runif(1)<.9) { cat("Not yet... \n") }; cat("Finally!\n") * How would you design a random process that is always the highest value seen to date in a discrete Brownian motion simulation? * Plotting: - Many plotting functions to build up complex plots: 'plot()', 'text()', 'lines()', 'points()', 'polygon()', 'rect()', 'legend()', 'grid()',... - Principle: iterative debugging and refinement - Example: . Fine tuning a single plot: x <- runif(1000); y <- x^2 + rnorm(1000, s=.1) plot(x, y) # add pch, cex, main . Multiple plots: par(mfrow=c(2,3)) for(i in 1:6) { x <- runif(1000); y <- x^2 + rnorm(1000, s=.1) plot(x, y, pch=16, cex=.5, main=paste("Plot",i)) } . New window of given size: windows(width=8, height=6) par(mfrow=c(2,3), mar=c(3,3,3,1), mgp=c(1.8,.5,0)) # Plot parameters for(i in 1:6) { x <- runif(1000); y <- x^2 + rnorm(1000, s=.1) plot(x, y, pch=16, cex=.5, main=paste("Plot",i)) } . PDF device driver for inclusion in papers: pdf(file="My First R Plot.pdf", width=6, height=4.5) par(mfrow=c(2,3), mar=c(3,3,3,1), mgp=c(1.8,.5,0)) for(i in 1:6) { x <- runif(1000); y <- x^2 + rnorm(1000, s=.1) plot(x, y, pch=16, cex=.5, main=paste("Plot",i)) } graphics.off() - Before you plot, form an idea what to expect. Example: Scatterplot of the age of an autistic child against the age of a matched unaffected sibling - In standard R plots are not objects. To this end check Paul Murrell''s Grid package. - FUNDAMENTALS OF STATISTICAL INFERENCE: * How we arrive at stderrs, CIs, statistical tests in the simplest case? * Imagine observing n ojects (e.g., sample students and measure height) X1,...,Xn i.i.d. with existing expectation mu and variance sigma^2 => E[Xi] = mu, V[Xi] = sigma^2, C[Xi,Xj] = 0 (i!=j) * Realize: If you repeated the data collection, you would get different values. * Summary statistic: mean Xbar = mean(X1,...,Xn) For a given dataset, you obtain one mean value. Simulation of a 'dataset': n <- 100 X <- rnorm(n, m=69, s=5) mean(X) sd(X) * But the mean would have been different from another dataset, => The mean is a random variable, even though observed only once. * Question: How much does the mean vary from dataset to dataset? => variance expansion V(Xbar) = ( sum_i Var(Xi) + sum_i!=j Cov(Xi,Xj) )/N^2 = sigma^2 / N => sigma(Xbar) = sigma/sqrt(N) true standard error: dispersion of a statistic across datasets * Statistical question: Can we estimate sigma(Xbar)? YES, because . sigma(Xbar) = sigma/sqrt(N) and . We can estimate sigma with s: s = sqrt(sum_i (Xi - Xbar)^2 / n) => Estimate of sigma(Xbar) is s/sqrt(n) * Statistical inference consists of (approx.) probab. statements about estimates, where the probabilities concern RELATIVE FREQUENCIES ACROSS DATASETS. The two components that typically make such statements possible are . standard error estimates and . the central limit theorem. * Central limit theorem: (Xbar - mu)/sigma(Xbar) gets ever more N(0,1) distributed as n-->Inf => mu +- 2*sigma(Xbar) contains Xbar for 95% of datasets Xbar +- 2*sigma(Xbar) contains mu for 95% of datasets !!!!!! ** Strangeness of the CLT: In most cases it speaks about "fraction of worlds/datasets" because we usually see only one mean. (Exception: quality control) * Distinguish confidence intervals from 'acceptance' intervals: . Confidence intervals: estimate sigma(Xbar) with stderr = s/sqrt(n) Xbar +- 2*s/sqrt(n) contains the true mu for 95% of datasets . Testing: Is the assumption mu=mu0 reasonable in light of the data X1...Xn? "Acceptance/non-rejection region" mu0 +- 2*stderr contains Xbar for 95% of datasets If Xbar is outside, reject mu0 because it makes the observed Xbar too unlikely. * Generalization to parameters other than means and mu: . Standard errors and their estimates can be obtained analytically or computationally for most estimates. . A CLT holds ALMOST invariably => Statistical inference with CIs and tests can be performed: theta.est +- 2*stderr(theta.est) contains true theta for 95% of datasets. ** Standard error estimates are obtained from a single dataset. => We think we can say something about the variation of Xbar - variation from dataset to dataset - BASED ON A SINGLE DATASET! * History: Neyman, Fisher,... a huge 20th century achievement . Fisher got CIs wrong with his 'fiducial intervals'. . Bayesians today compete with their 'posterior credible intervals'. * Statistical inference for regression: . Strangeness of regression: * The predictor values are considered fixed, not random. (Regression is "conditional on x".) * Only the response is considered random. . Hence "varibility across datasets" means datasets that have the same x-values but different y-values. (In a simulation, keep x values fixed, simulate only y values.) . Yet there is a further argument that says conditional inference is valid unconditionally (across datasets with varying x). . Stats students: Look out for the notion of "ancillarity" in math stat. ** Strangeness of Time series: Where''s the randomness? This must be "possible-world semantics". Have we seen replications of Lehman stock, US GDP,...? ================================================================ * LINEAR MODELS: THE LS PROBLEM - LS criterion: RSS . in R: a numeric toy example of LS RSS <- function(b) sum((y-X%*%b)^2) N <- 5 X <- cbind(rep(1,N), 1:N) y <- 1 + 2*(1:N) # No 'error' in this case Getting at the coefficient estimates the hard way: nlm(RSS, rep(0,2))$estimate # nlm() is a minimizer in R . in math: RSS(b) = |y - X b|^2 . LS estimate: b = argmin RSS(b) = minimizer of RSS(b) (Justifications of LS? ...) . Normal equations as stationary equations of RSS: (X^T X) b = X^T y . Explicit LS solution: b = (X^T X)^{-1} X^T y [Does not work when ...] [NOT the numeric method of choice; stay tuned for a HW.] In R: b <- solve(t(X) %*% X) %*% t(X) %*% y - Residuals and fits: . Notation: yhat = X b r = y - yhat RSS = |r|^2 . Normal equations: X^T r = 0 [(p+1)-vectors] Uniquely characterizes the LS solution Geometric meaning: ... . In R: yhat <- X %*% b r <- y - yhat - LS and orthogonal projections: yhat = X b = ( X (X^T X)^{-1} X^T ) y = P y r = y - yhat = ( I - P ) y where P and I-P are orthogonal projections . Definition of orth. proj.: ... - What are the statistical properties of 'b', 'yhat' and 'r'? . What are their expected values? . What are their variances, and covariances? ================================================================