================================================================ * PERMUTATION TESTS ## - Idea: If the null hypothesis is independence of two variables X and Y, ## break the association between X and Y with random permutations. ## - Data example: places rated data (1986), 329 metropolitan areas of the US site <- "http://stat.wharton.upenn.edu/~buja/STAT-961/" places <- read.table(paste(site,"places.dat",sep=""), header=T) dim(places) names(places) rownames(places) pairs.plus(places, gap=0, cex=.3) ## - Visual permutation test to assess presence/absence of association ## between two variables: xlab <- "Climate.Terrain"; ylab <- "Housing" ## xlab <- "Education"; ylab <- "Crime" x <- places[,xlab]; y <- places[,ylab] windows(6,7.5) par(mfrow=c(5,4), mar=rep(.5,4), xaxt="n", yaxt="n") i.pos <- sample(1:20,size=1) # random position of the actual for(i in seq(length=i.pos-1)) # (R issue: this works even if i.pos=1) plot(x=x, y=sample(y), pch=16, cex=.5) plot(x=x, y=y, , pch=16, cex=.5) for(i in i.pos+seq(length=20-i.pos)) plot(x=x, y=sample(y), pch=16, cex=.5) # (R: works even if i.pos=20) # Which plot shows the actual data? # 19 of the 20 plots show draws from a 'random association model'. ## - Numerical permutation test: p-value computation with simulation ## . Test statistic? anything; use the correlation, for example test.stat <- function(x,y) cor(x,y) # Pearson correlation test.stat <- function(x,y) cor(rank(x), rank(y)) # Spearman's rank correlation test.stat <- function(x,y) cor(x>median(x), y>median(y)) # dichotomized correlation test.stat <- function(x,y) (cancor(cbind(x,x^2), cbind(y,y^2))$cor[1]) # quadratic optimal corr ## ## . Simulation: sampling random permutations; the more samples, the more precision nsim <- 9999 # fence post problem: 9999 make 10000 equally likely spaces T.null <- rep(NA,nsim) # storage of "null values" for(i in 1:nsim) { if(i%%2000==0) print(i) # print a message every 1000 simulations T.null[i] <- test.stat(x, sample(y)) } ## Simpler, as one-liner, using the 'replicate()' function: T.null <- replicate(9999, { test.stat(x, sample(y)) } ) ## No intermediate output, though. ## ## Observed values: T.actual <- test.stat(x,y) T.actual ## ## . Statistical significance: How extreme is 'T.actual' among 'T.null'? ## Judge visually: hist(T.null, breaks=40, col="gray", xlim=range(c(T.null,T.actual))) lines(rep(T.actual,2), c(0,1000), lwd=2, col="red") ## . Non-rejection/retention/'acceptance' region at 5% and at 1% significance levels: ## (Results are shown for Pearson correlation.) quantile(T.null, c(.025, .975)) # retention interval for alpha=5% ## 2.5% 97.5% ## -0.1093858 0.1091154 # note symmetry quantile(T.null, c(.005, .995)) # retention interval for alpha=1% ## 0.5% 99.5% ## -0.1446682 0.1411944 # note symmetry this far out in the tail T.actual # not in the retention region by a long shot ## . Amazing approximate normality of the simulated null distribution for several 'test.stat()': qqnorm(T.null, pch=".") T.null.mean <- mean(T.null) T.null.sd <- sd(T.null) abline(a=T.null.mean, b=T.null.sd, col="red") ## Could therefore use the fitted normal distribution for retention regions: ## 2-sided 5% level: T.null.mean + c(-1,1)*qnorm(.975)*T.null.sd # 95% RI based on normal theory qnorm(c(.025,.975), m=T.null.mean, s=T.null.sd) # same thing quantile(T.null, c(.025, .975)) # 95% RI from simulation quantiles ## 2-sided 1% level: qnorm(c(.005,.995), m=T.null.mean, s=T.null.sd) # 99% RI based on normal theory quantile(T.null, c(.005, .995)) # 99% RI from simulation quantiles ## [There exists theory to derive an asymptotic normal approximation ## for the permutation distribution for many common statistics.] ## . p-value: Fence post consideration ## T.actual has equal probability 1/(nsim+1) to fall into any spacing of T.null ## (under what probability assumptions?) ## Nperm = 9,999 ==> 10,000 spacings, all equally likely for T.actual under H0 (independence) ## ==> The simulation approximation to the p-value is: pval <- (sum(abs(T.null) >= abs(T.actual)) + 1) / (nsim+1) pval # The minimum achievable with a null simulation of size 'nsim'. ##---------------------------------------------------------------- - Theory for permutation tests: . Confuser: (x1,y1),...,(xn,yn) i.i.d. ==> x1,...,xn i.i.d. y1,...,yn i.i.d. but this does not imply xi is independent of yi. . Version (1) of Permutation Test Theory: + Assume x and y are both random: (xi,yi) i.i.d. pairs + H0: In each pair (xi,yi), xi and yi are independent. + Minimal sufficient statistic: S = ( x[1]<=...<=x[n]; y[1]<=...<=y[n] ) = order statistics of x''s and y''s ~ empirical distributions of x''s and y''s ~ values with their multiplicities, but ignoring their order + Conditional distribution of (x,y) given S(x,y): ---------------------------------------------- | Random pairing of x[i] with y[k] using the | | uniform distributions on the permutations | | of x values against the y values. | ---------------------------------------------- [Important and vexing fact: Even though our null assumption is "X and Y are independent", conditional on the sufficient statistic S, X and Y are no longer independent... how come? something wrong? no!] . Version (2): + Assume: yi = independent random samples xi = fixed known numbers like predictors in regression + H0: yi are i.i.d. and hence have a shared marginal distribution that is free of the x[i] values. [In particular: E[yi], V[yi] are free of xi.] + Minimal sufficient statistic: S = (y[1]<=...<=y[n]) = order statistics of y''s ~ empirical distribution y''s + Conditional distribution of yi''s given S(y): ------------------------------------------------------------ | Uniform distribution on the permutations of the y values | | for random matching with the x values | ------------------------------------------------------------ - Non-parametric nature of permutation tests: . No assumptions made about the marginal distributions of F(x) and G(y) in (1) or G(y) in (2). . This makes the permutation approach different from an overall F test in linear models: There we assume normal errors, here we don''t. ---------------- * Permutation Tests: Special cases . x: predictor vector; y: quantitative => multiple regression Two versions: fixed x and random x => Both frameworks (1) and (2) apply and result in the same conditional null distribution. Natural test statistic: overall-F or R2 or RSS (all equivalent) => A 'non-parametric' substitute for the overall F-test, WITHOUT NORMAL ASSUMPTIONS. [The overall F-test assumes normal errors and all betaj = 0.] . x: binary (categorical) variable; y: quantitative response => 2-sample test Natural test statistic: 2-sample t-stat, or difference of medians,... Graphical: empirical quantile-quantile plot source("Functions/qqplot.R") sel <- grepl("Boston", rownames(boston)) qqplot.null(x=boston[!sel,"MEDV"], y=boston[sel,"MEDV"], xlab="Surroundings", ylab="Boston", nrep=500) ==> Null band . x: categorical variable; y: quantitative response => 1-way ANOVA Natural test statistic: overall-F or Kruskal-Wallis or any pair-difference . x: binary; y: binary => Fisher''s exact test (no need to simulate; exact null distr. known) Natural test statistic: anything (any cell count, they are all equiv.) R: fisher.test() . x: categorical; y: categorical => chi-square test of independence Natural test statistics: '', or any cell count . x: predictor vector; y binary => logistic regression or binary classification (boosting, bagging, SVMs,...) Natural test statistics: negative log-likelihood, misclassification rate,... Question answered: Do the predictors help lower the misclassification rate . x: predictor vector; y categorical => 'polytomous regression' or multi-class classification ('') Natural test statistics: '' . x: predictor vector; y response vector => 'canonical correlation analysis' (CCA) Natural test statistics: largest canonical correlation or sum of k largest cancorrs,... . x: time; y time series => test for generalized 'white noise' Natural test statistics: any time series model statistic (AR, ARMA, ARCH,...) . x: equi-spaced time; y = diff(time series) => test for generalized 'random walk' Natural test statistics: any time series model statistic (AR, ARMA, ARCH,...) . Think it through: PCA (for those who know Principal Components Analysis) What would be the null hypothesis? . Regression one more time: model selection and control of 'false inclusion rate' Recall in the chapter on model selection we proposed to add synthetic random predictors (rnorm(N)) to gauge when model selection starts to do 'false inclusion'. Proposal: Instead of adding random number predictors, add a permuted predictor matrix. y ~ cbind(X, X[sample(N),]) The idea is to run a model selection procedure for many draws of 'sample(N)' and somehow use the information for informed selection of the model size. Call the inclusion of a column from X[sample(N),] a 'false inclusion', and the proportion of false inclusions across many runs the 'false inclusion rate' or 'FIR'. Some specific proposals: ~ All subsets selection: + For each model size, find the best (two best, ...) model(s) wrt RSquare. + For each model size, establish the FIR across the best (two best, ...) models. + Choose the largest model size whose FIR is below alpha (=0.01, e.g.). ~ Lasso search: + Fit lasso to a grid of values of lambda. + For each lambda in the grid establish the FIR + Choose the smallest lambda whose FIR is below alpha (=0.01, e.g.). This is not strictly a permutation test but another creative use of the permutation paradigm. It answers the question: When does another predictor set with exactly the same collinearities but no true association with y at all, start to contribute viable predictors by the lights of the selection algorithm? - Drawback of permutation tests: No conversion to CIs? Yes and now... . Recall duality between CIs with 95% coverage and tests at 5% significance CI = { parameter values that can not be rejected if used as H0 } . In some cases the conversion is possible (instructor learned this from P. Rosenbaum): + In a linear regression, consider all H0: beta = beta0, for arbitrary beta0 in R^p I.e., y = X beta0 + iid noise ==> Permutation test of y.working = (y - X beta0) against X. ==> CR (confidence region) = { beta0 | permutation test does not reject } This approach provides a CR for the whole beta vector. It may be difficult to realize unless p=1. + If interest is in just one predictor/coefficient, such as betaj for xj, we are at a loss. No exact permutation test exists, unless one assumes yet again normality of the errors. The point of permutation tests, however, would be to get rid of assumptions about the marginal distribution of errors. . The instructor has recently move away from permutation tests toward bootstrap-based inference. The main reason is the presence of missing values in the data he faces: Should we permute NAs across different variables? Probably not: Missing values are often highly correlated, i.e., if there is an NA in one variable of an individual, it is more likely that the same individual has more NAs in other variables. But this is usually not of interest, so it would seem that one should condition on the NA patterns. But this gets messy... Bootstrap approaches are easier on NAs: NA patterns are retained under resampling. - Other advantages of permutation tests: . Intuitive nature . Can be defended even for non-sampling data (such as the Places Rated data) Play a what-if game of 'possible worlds'. . Freedom of choice of test statistic: use any measure of dependence . Non-parametric nature: marginal distribution F is arbitrary - Limitations: . Good only for H0 that express independence between xi and yi or in which xi are fixed and yi are i.i.d. with a distribution not involving xi . Like all simulations of null distributions, the smallest possible p-value is limited by the number of simulations: 1/(nsim+1) (unless we establish that the null distribution can be well-approximated by some analytical distribution, in which case one can reach further out into the tails). - Literature: . Lehman & Romano (1986 or later, good for theory: sec. 4.3, 5.10ff) "Testing Statistical Hypotheses" (Wiley) Older, more practical: . E. S. Edgington (1980) "Randomization Tests" (Dekker, Statistics: Textbooks and monographs, vol. 31) . Ph. I. Good (1994) "Permutation Tests : A Practical Guide to Resampling Methods for Testing Hypotheses" (Springer Series in Statistics) ================================================================