================================================================
* PERMUTATION TESTS
## - Idea: If the null hypothesis is independence of two variables X and Y,
## break the association between X and Y with random permutations.
## - Data example: places rated data (1986), 329 metropolitan areas of the US
site <- "http://stat.wharton.upenn.edu/~buja/STAT-961/"
places <- read.table(paste(site,"places.dat",sep=""), header=T)
dim(places)
names(places)
rownames(places)
pairs.plus(places, gap=0, cex=.3)
## - Visual permutation test to assess presence/absence of association
## between two variables:
xlab <- "Climate.Terrain"; ylab <- "Housing"
## xlab <- "Education"; ylab <- "Crime"
x <- places[,xlab]; y <- places[,ylab]
windows(6,7.5)
par(mfrow=c(5,4), mar=rep(.5,4), xaxt="n", yaxt="n")
i.pos <- sample(1:20,size=1) # random position of the actual
for(i in seq(length=i.pos-1)) # (R issue: this works even if i.pos=1)
plot(x=x, y=sample(y), pch=16, cex=.5)
plot(x=x, y=y, , pch=16, cex=.5)
for(i in i.pos+seq(length=20-i.pos))
plot(x=x, y=sample(y), pch=16, cex=.5) # (R: works even if i.pos=20)
# Which plot shows the actual data?
# 19 of the 20 plots show draws from a 'random association model'.
## - Numerical permutation test: p-value computation with simulation
## . Test statistic? anything; use the correlation, for example
test.stat <- function(x,y) cor(x,y) # Pearson correlation
test.stat <- function(x,y) cor(rank(x), rank(y)) # Spearman's rank correlation
test.stat <- function(x,y) cor(x>median(x), y>median(y)) # dichotomized correlation
test.stat <- function(x,y) (cancor(cbind(x,x^2), cbind(y,y^2))$cor[1]) # quadratic optimal corr
##
## . Simulation: sampling random permutations; the more samples, the more precision
nsim <- 9999 # fence post problem: 9999 make 10000 equally likely spaces
T.null <- rep(NA,nsim) # storage of "null values"
for(i in 1:nsim) {
if(i%%2000==0) print(i) # print a message every 1000 simulations
T.null[i] <- test.stat(x, sample(y))
}
## Simpler, as one-liner, using the 'replicate()' function:
T.null <- replicate(9999, { test.stat(x, sample(y)) } )
## No intermediate output, though.
##
## Observed values:
T.actual <- test.stat(x,y)
T.actual
##
## . Statistical significance: How extreme is 'T.actual' among 'T.null'?
## Judge visually:
hist(T.null, breaks=40, col="gray",
xlim=range(c(T.null,T.actual)))
lines(rep(T.actual,2), c(0,1000), lwd=2, col="red")
## . Non-rejection/retention/'acceptance' region at 5% and at 1% significance levels:
## (Results are shown for Pearson correlation.)
quantile(T.null, c(.025, .975)) # retention interval for alpha=5%
## 2.5% 97.5%
## -0.1093858 0.1091154 # note symmetry
quantile(T.null, c(.005, .995)) # retention interval for alpha=1%
## 0.5% 99.5%
## -0.1446682 0.1411944 # note symmetry this far out in the tail
T.actual # not in the retention region by a long shot
## . Amazing approximate normality of the simulated null distribution for several 'test.stat()':
qqnorm(T.null, pch=".")
T.null.mean <- mean(T.null)
T.null.sd <- sd(T.null)
abline(a=T.null.mean, b=T.null.sd, col="red")
## Could therefore use the fitted normal distribution for retention regions:
## 2-sided 5% level:
T.null.mean + c(-1,1)*qnorm(.975)*T.null.sd # 95% RI based on normal theory
qnorm(c(.025,.975), m=T.null.mean, s=T.null.sd) # same thing
quantile(T.null, c(.025, .975)) # 95% RI from simulation quantiles
## 2-sided 1% level:
qnorm(c(.005,.995), m=T.null.mean, s=T.null.sd) # 99% RI based on normal theory
quantile(T.null, c(.005, .995)) # 99% RI from simulation quantiles
## [There exists theory to derive an asymptotic normal approximation
## for the permutation distribution for many common statistics.]
## . p-value: Fence post consideration
## T.actual has equal probability 1/(nsim+1) to fall into any spacing of T.null
## (under what probability assumptions?)
## Nperm = 9,999 ==> 10,000 spacings, all equally likely for T.actual under H0 (independence)
## ==> The simulation approximation to the p-value is:
pval <- (sum(abs(T.null) >= abs(T.actual)) + 1) / (nsim+1)
pval # The minimum achievable with a null simulation of size 'nsim'.
##----------------------------------------------------------------
- Theory for permutation tests:
. Confuser: (x1,y1),...,(xn,yn) i.i.d.
==> x1,...,xn i.i.d.
y1,...,yn i.i.d.
but this does not imply xi is independent of yi.
. Version (1) of Permutation Test Theory:
+ Assume x and y are both random: (xi,yi) i.i.d. pairs
+ H0: In each pair (xi,yi), xi and yi are independent.
+ Minimal sufficient statistic:
S = ( x[1]<=...<=x[n]; y[1]<=...<=y[n] )
= order statistics of x''s and y''s
~ empirical distributions of x''s and y''s
~ values with their multiplicities, but ignoring their order
+ Conditional distribution of (x,y) given S(x,y):
----------------------------------------------
| Random pairing of x[i] with y[k] using the |
| uniform distributions on the permutations |
| of x values against the y values. |
----------------------------------------------
[Important and vexing fact: Even though our null assumption is
"X and Y are independent", conditional on the sufficient statistic S,
X and Y are no longer independent... how come? something wrong? no!]
. Version (2):
+ Assume: yi = independent random samples
xi = fixed known numbers like predictors in regression
+ H0: yi are i.i.d. and hence have a shared marginal distribution
that is free of the x[i] values.
[In particular: E[yi], V[yi] are free of xi.]
+ Minimal sufficient statistic:
S = (y[1]<=...<=y[n])
= order statistics of y''s
~ empirical distribution y''s
+ Conditional distribution of yi''s given S(y):
------------------------------------------------------------
| Uniform distribution on the permutations of the y values |
| for random matching with the x values |
------------------------------------------------------------
- Non-parametric nature of permutation tests:
. No assumptions made about the marginal distributions
of F(x) and G(y) in (1) or G(y) in (2).
. This makes the permutation approach different from an overall F test
in linear models: There we assume normal errors, here we don''t.
----------------
* Permutation Tests: Special cases
. x: predictor vector; y: quantitative
=> multiple regression
Two versions: fixed x and random x
=> Both frameworks (1) and (2) apply and result in the
same conditional null distribution.
Natural test statistic: overall-F or R2 or RSS (all equivalent)
=> A 'non-parametric' substitute for the overall F-test,
WITHOUT NORMAL ASSUMPTIONS.
[The overall F-test assumes normal errors and all betaj = 0.]
. x: binary (categorical) variable; y: quantitative response
=> 2-sample test
Natural test statistic: 2-sample t-stat, or difference of medians,...
Graphical: empirical quantile-quantile plot
source("Functions/qqplot.R")
sel <- grepl("Boston", rownames(boston))
qqplot.null(x=boston[!sel,"MEDV"], y=boston[sel,"MEDV"], xlab="Surroundings", ylab="Boston", nrep=500)
==> Null band
. x: categorical variable; y: quantitative response
=> 1-way ANOVA
Natural test statistic: overall-F or Kruskal-Wallis or any pair-difference
. x: binary; y: binary
=> Fisher''s exact test (no need to simulate; exact null distr. known)
Natural test statistic: anything (any cell count, they are all equiv.)
R: fisher.test()
. x: categorical; y: categorical
=> chi-square test of independence
Natural test statistics: '', or any cell count
. x: predictor vector; y binary
=> logistic regression or binary classification (boosting, bagging, SVMs,...)
Natural test statistics: negative log-likelihood, misclassification rate,...
Question answered: Do the predictors help lower the misclassification rate
. x: predictor vector; y categorical
=> 'polytomous regression' or multi-class classification ('')
Natural test statistics: ''
. x: predictor vector; y response vector
=> 'canonical correlation analysis' (CCA)
Natural test statistics: largest canonical correlation
or sum of k largest cancorrs,...
. x: time; y time series
=> test for generalized 'white noise'
Natural test statistics: any time series model statistic (AR, ARMA, ARCH,...)
. x: equi-spaced time; y = diff(time series)
=> test for generalized 'random walk'
Natural test statistics: any time series model statistic (AR, ARMA, ARCH,...)
. Think it through: PCA (for those who know Principal Components Analysis)
What would be the null hypothesis?
. Regression one more time: model selection and control of 'false inclusion rate'
Recall in the chapter on model selection we proposed to add synthetic random
predictors (rnorm(N)) to gauge when model selection starts to do 'false inclusion'.
Proposal: Instead of adding random number predictors, add a permuted predictor matrix.
y ~ cbind(X, X[sample(N),])
The idea is to run a model selection procedure for many draws of 'sample(N)'
and somehow use the information for informed selection of the model size.
Call the inclusion of a column from X[sample(N),] a 'false inclusion', and
the proportion of false inclusions across many runs the 'false inclusion rate'
or 'FIR'.
Some specific proposals:
~ All subsets selection:
+ For each model size, find the best (two best, ...) model(s) wrt RSquare.
+ For each model size, establish the FIR across the best (two best, ...) models.
+ Choose the largest model size whose FIR is below alpha (=0.01, e.g.).
~ Lasso search:
+ Fit lasso to a grid of values of lambda.
+ For each lambda in the grid establish the FIR
+ Choose the smallest lambda whose FIR is below alpha (=0.01, e.g.).
This is not strictly a permutation test but another creative use of the
permutation paradigm. It answers the question:
When does another predictor set with exactly the same collinearities
but no true association with y at all, start to contribute viable
predictors by the lights of the selection algorithm?
- Drawback of permutation tests: No conversion to CIs?
Yes and now...
. Recall duality between CIs with 95% coverage and tests at 5% significance
CI = { parameter values that can not be rejected if used as H0 }
. In some cases the conversion is possible (instructor learned this from P. Rosenbaum):
+ In a linear regression, consider all H0: beta = beta0, for arbitrary beta0 in R^p
I.e., y = X beta0 + iid noise
==> Permutation test of y.working = (y - X beta0) against X.
==> CR (confidence region) = { beta0 | permutation test does not reject }
This approach provides a CR for the whole beta vector.
It may be difficult to realize unless p=1.
+ If interest is in just one predictor/coefficient, such as betaj for xj,
we are at a loss. No exact permutation test exists, unless one assumes
yet again normality of the errors. The point of permutation tests, however,
would be to get rid of assumptions about the marginal distribution of errors.
. The instructor has recently move away from permutation tests toward bootstrap-based inference.
The main reason is the presence of missing values in the data he faces:
Should we permute NAs across different variables?
Probably not: Missing values are often highly correlated, i.e.,
if there is an NA in one variable of an individual,
it is more likely that the same individual has more NAs in other variables.
But this is usually not of interest, so it would seem that one should
condition on the NA patterns.
But this gets messy...
Bootstrap approaches are easier on NAs: NA patterns are retained under resampling.
- Other advantages of permutation tests:
. Intuitive nature
. Can be defended even for non-sampling data (such as the Places Rated data)
Play a what-if game of 'possible worlds'.
. Freedom of choice of test statistic: use any measure of dependence
. Non-parametric nature: marginal distribution F is arbitrary
- Limitations:
. Good only for H0 that express independence between xi and yi
or in which xi are fixed and yi are i.i.d. with a distribution not involving xi
. Like all simulations of null distributions, the smallest possible
p-value is limited by the number of simulations: 1/(nsim+1)
(unless we establish that the null distribution can be
well-approximated by some analytical distribution, in
which case one can reach further out into the tails).
- Literature:
. Lehman & Romano (1986 or later, good for theory: sec. 4.3, 5.10ff)
"Testing Statistical Hypotheses" (Wiley)
Older, more practical:
. E. S. Edgington (1980)
"Randomization Tests" (Dekker, Statistics: Textbooks and monographs, vol. 31)
. Ph. I. Good (1994)
"Permutation Tests : A Practical Guide to Resampling Methods
for Testing Hypotheses"
(Springer Series in Statistics)
================================================================