================================================================
* CURVE ESTIMATION / SMOOTHING / BANDWIDTH SELECTION / CROSS-VALIDATION
- Idea: given x-y data (quantitative-quantitative)
. Drop assumption of linearity: y = (a + bx) + eps
. Assume smoothness only: y = f(x) + eps, f() smooth
. Smooth => locally linear f(x+h) ~=~ f(x) + Df(x)*h + O(h^2)
. If we wish to infer y=f(x) from data (xi,yi) (i=1,...,n),
we have to infer an estimate f(x) from (xi,yi) with xi near x
. First try:
fhat(x) = ave(yi; |xi-x| fhat(x)=NA
Smooth is really not smooth
. Variations: running median and quantiles for 50%, 90% prediction bands!
plot(xy, pch=16, cex=.5)
lines(smooth1(xy, func=median), col="black", lwd=3)
lines(smooth1(xy, func=function(y) quantile(y,.25)), col="gray50", lwd=3)
lines(smooth1(xy, func=function(y) quantile(y,.75)), col="gray50", lwd=3)
lines(smooth1(xy, func=function(y) quantile(y,.05)), col="gray80", lwd=3)
lines(smooth1(xy, func=function(y) quantile(y,.95)), col="gray80", lwd=3)
. 'Real' smoothness: kernel smoothing
smooth <- function(xy, bw=1/5, nloc=51) {
x <- xy[,1]; y <- xy[,2]
xmin <- min(x); xmax <- max(x)
xs <- seq(xmin, xmax, length=nloc)
fx <- rep(NA, nloc)
act.bw <- diff(range(x))*bw # actual bandwidth
for(i in 1:nloc) {
w <- dnorm(x=(x-xs[i])/act.bw) # same as dnorm(x, m=xs[i], s=act.bw)
w <- w/sum(w) # so we have: sum(w)==1
fx[i] <- sum(y*w)
}
return(cbind(x=xs, y=fx))
}
. Apply to same with 4 bandwidths:
bws <- c(2, .2, .03, .005)
cols <- c("red","orange","green","blue")
plot(xy, pch=16, cex=.5)
for(i in 1:4)
lines(smooth(xy, bw=bws[i]), col=cols[i], lwd=3)
Q: What would be a principled way to select the bandwidth? (stay tuned...)
. Simulation to show variability of smoothers: y = 4*x^2 + eps
bws <- c(2, .2, .03, .005)
N <- 100
x <- sort(runif(N))
Nsim <- 200
par(mfrow=c(2,2), mgp=c(1.5,.5,0), mar=c(3,3,1,1))
for(ibw in 1:4) {
plot(x=c(0,1), y=c(-2,6), type="n", pch=16, cex=.5, xlab="", ylab="")
xy <- cbind(x=x, y=4*x^2 + rnorm(N))
points(xy, pch=16, cex=.5, col="gray30")
for(isim in 1:Nsim) {
xy <- cbind(x=x, y=4*x^2 + rnorm(N))
lines(smooth(xy, bw=bws[ibw]), col=cols[ibw], lwd=1)
}
lines(x, 4*x^2, lwd=2)
}
You should be looking at the widths of the color bands (variance!)
and how much the bands systematically deviate from the target (bias!).
The wiggliness of the functions is a side-effect of variance.
. Lesson: BIAS-VARIANCE trade-off !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Really: bias-SE trade-off (Variance is dataset-to-dataset.)
small bandwidth large bandwidth
==> ==>
large variance (-) small variance (+)
small bias (+) large bias (-)
. Mean squared error and bias-variance decomposition:
Model: yi = f(xi) + epsi E[epsi] = 0
BIAS: Bias(fhat(x)) = E[fhat(x)] - f(x)
VARIANCE: V(fhat(x)) = E[ (fhat(x) - E[fhat(x)])^2 ]
MSE: MSE(fhat(x)) = E[ (fhat(x) - f(x))^2 ]
Bias-variance decomposition of the MSE:
---------------------------------------------------
| MSE(fhat(x)) = V(fhat(x)) + Bias(fhat(x))^2 |
---------------------------------------------------
Proof: E[ (Z - c)^2 ] = V[Z] + ( E[Z] - c )^2
==> MSE is an overall measure that trades off bias and variance
but... MSE is not known
. PMSE or 'predictive mean squared error':
PMSE(fhat(x)) = E[ (Y - fhat(X))^2 | X=x ]
(Y,X) = future data
fhat() = estimate based on past i.i.d. data (xi,yi)
future and past data are assumed independent of each other
Decomposition, pointwise at one x:
PMSE(fhat(x)) = V[Y|X=x] + MSE(fhat(x))
= V[Y|X=x] + V(fhat(x)) + Bias(fhat(x))^2
^^^^^^^^
variation
around f(x)
. Expectation across x-values:
E[PMSE(fhat(X))] = E[V[Y|X]] + E[MSE(fhat(X))]
^^^^^^^^^ ^^^^^^^^^^^^^^^
independent criterion one
of bandwidth wants to minimize
over bandwidths
=> Minimize expected PMSE instead of MSE
??? how do we estimate the PMSE ???
. CROSS-VALIDATION:
+ Split data into training and holdout data (9/10 and 1/10, or 2/3 and 1/3).
+ Fit fhat(x) to training data.
+ Use fhat(x) to predict y on holdout data.
+ Estimate of E[PMSE]: ave((yi-fhat(xi))^2 | i in holdout data)
+ Repeat and average again.
Machine Learning convention:
Randomly partition data into 10 disjoint tenths;
use each tenth as holdout;
average over the ten runs.
There is evidence that 1/10 is too little for holdout,
and 10 repetitions are too few for stable averaging.
Hence a probably better rule:
Split data randomly into 2/3 and 1/3;
use 1/3 as holdout;
repeat 100 to 500 times and average.
----------------
* Cross-validated bandwidth choice:
. R code for cross-validation of 'smooth()':
# Function to perform one holdout operation and return one CV value for PMSE
cv.once <- function(bw, xy, holdout, smooth) { # 'holdout' = logical vector
train <- !holdout # assumed logical, not enumeration
xy.train <- xy[train,] # unpacking
x.holdout <- xy[holdout,1] # "
y.houldout <- xy[holdout,2] # "
sm <- smooth(xy.train, bw=bw) # fhat on training
f.houldout <- approx(sm, xout=x.holdout, rule=2)$y # fhat on holdout
return(mean((y.houldout-f.houldout)^2)) # PMSE^ from holdout
}
# Small difficulty: PMSE requires extrapolation
# from "training set" to holdout set.
# This is achieved with the R function 'approx()'; see help(approx).
# Try on Boston housing data:
xy <- as.matrix(boston[,c("LSTAT","MEDV")])
cv.once(.1, xy, c(rep(T,100),rep(F,406)), smooth=smooth)
# Function to perform 100 random holdouts and return their PMSE values:
cv <- function(bw, xy, holdout=1/3, times=100, verbose=50, smooth) {
pmses <- rep(NA, times)
n <- nrow(xy)
nholdout <- round(n*holdout)
holdout.template <- c(rep(T,nholdout), rep(F,n-nholdout))
for(icv in 1:times) {
pmses[icv] <- cv.once(bw, xy, holdout=sample(holdout.template), smooth)
if(verbose>0) if(icv%%verbose==0) cat(icv,"")
}
return(pmses)
}
cv(bw=.1, xy, smooth=smooth)
pmses1 <- cv(bw=.1, xy, smooth=smooth)
hist(pmses1, col="gray")
# Obtain CV estimates of the PMSE for a large number of bandwidths:
# Exponential ladder of bandwidths, factor=1.25
bws <- 1.25^((-25):(+5))
pmse.bws <- rep(NA, length(bws))
for(i in 1:length(bws)) {
pmse.bws[i] <- mean(cv(bw=bws[i], xy, holdout=1/3, times=400, verbose=0, smooth=smooth))
# pmse.bws[i] <- median(cv(bw=bws[i], xy, holdout=1/3, times=400, verbose=0, smooth=smooth))
cat("----- done with bandwidth",bws[i],"(",i,")--------------\n")
}
# Plot PMSEs as a function of bandwidth:
par(mfrow=c(1,1))
plot(bws, pmse.bws, type="h"); lines(bws, pmse.bws)
# Need to take the log of the bandwidths or, equivalently, the order:
plot(pmse.bws, type="h", xlab="log bw"); lines(pmse.bws)
# => In prediction, bias can kill, variance cannot
# The PMSE profile indicates that the 4th or 6th bandwidth was best:
plot(xy, pch=16, cex=.5)
lines(smooth(xy, bw=bws[9]), col="red", lwd=3)
lines(smooth(xy, bw=bws[11]), col="blue", lwd=3)
# Seems, though, that more smoothness would please the eyes:
lines(smooth(xy, bw=bws[14]), col="black", lwd=3)
lines(smooth(xy, bw=bws[30]), col="black", lwd=3)
lines(smooth(xy, bw=bws[1]), col="black", lwd=3)
# => The bias-variance trade-off generated by PMSE minimization with CV
# is not visually optimal; it follows data too closely.
# => Good MSE asks for less bias/more variance than our visual judgment.
General conclusions:
--------------------------------------------------
| Good performance in prediction (MSE) requires |
| following the data more closely |
| i.e., with less bias, more variance |
| whereas |
| good interpretability requires |
| greater simplicity |
| i.e., more bias, less variance |
--------------------------------------------------
Which is worse wrt potentially catastrophic performance?
-------------------------------------------------
| Variance is more benign than bias because |
| bias can be unlimited, whereas variance |
| is limited to V[error] by following the data. |
-------------------------------------------------
CV-based choice can be transferred to regression and model
selection because CV and PMSE minimization can be applied to
model search in general.
General conclusion transferred to regression:
- for prediction: when in doubt, keep it in
- for interpretability: when in doubt, throw it out
* LOCAL LINEAR SMOOTHING:
- Problem with naive local ('locally constant') averaging: boundary bias
- Solution: Fit local lines in windows, evaluate at points of interest
(Call the above smooth1() and smooth() 'local constant smoothers'.)
- Needed: Weighted straight line fits
- Formulas for weighted line fit given weights wi >= 0, sum wi = 1:
Numerically stable version based on centered x and y.
xbar = sum(w*x) # Weighted mean of x values
ybar = sum(w*y) # Weighted mean of y values
x.y = sum(w*(x-xbar)*(y-ybar)) # Weighted inner product
x.x = sum(w*(x-xbar)^2) # Weighted squared norm
b = x.y/x.x # Weighted LS slope
yhat = ybar + b*(x-xbar) # Weighted straight line fit
This solves: sum_i wi*(yi - a - b*xi)^2 = min_{a,b}
- R code:
smooth.lin <- function(xy, bw=1/5, nloc=51) {
x <- xy[,1]; y <- xy[,2] # unpack the data
xmin <- min(x); xmax <- max(x) # needed for the bandwidth
xs <- seq(xmin, xmax, length=nloc) # the grid points
fx <- rep(NA, nloc) # allocate space for the fits
act.bw <- bw * (xmax-xmin) # the actual bandwidth
for(i in 1:nloc) { # for all grid points ...
w <- dnorm(x=x-xs[i],s=act.bw) # normal or Gaussian kernel weights
w <- w/sum(w) # force sum(w)=1
xbar <- sum(w*x) # see the formulas above...
ybar <- sum(w*y)
x.y <- sum(w*(x-xbar)*(y-ybar))
x.x <- sum(w*(x-xbar)*(x-xbar))
fx[i] <- x.y/x.x*(xs[i]-xbar) + ybar # evaluate the line at the grid point
}
xf <- cbind(xs,fx) # Return the grid and its fitted values
if(!is.null(colnames(xy))) colnames(xf) <- colnames(xy)
return(xf)
}
- Applied to Boston''s MEDV and LSTAT:
xy <- boston[,c("LSTAT","MEDV")]
plot(xy, pch=16, cex=.5)
lines(smooth.lin(xy, bw=1/5), col="blue", lwd=4)
lines(smooth.lin(xy, bw=1/10), col="green", lwd=4)
lines(smooth.lin(xy, bw=1/15), col="brown", lwd=4)
lines(smooth(xy, bw=1/5), col="orange", lwd=4, lty=3)
lines(smooth(xy, bw=1/10), col="red", lwd=4, lty=3)
lines(smooth(xy, bw=1/20), col="purple", lwd=4, lty=3)
==> Local linear fit with bw=1/5 may have similar bias
as the local constant fit with bw=1/20,
but the latter has more variance (wiggles).
- CV with smooth.lin(): Use larger bandwidths, shift the ladder by +5
bws <- 1.25^((-20):(+15))
pmse.bws <- rep(NA, length(bws))
for(i in 1:length(bws)) {
pmse.bws[i] <- mean(cv(bw=bws[i], xy, holdout=1/3, times=100, verbose=0, smooth=smooth.lin))
cat("----- done with bandwidth",bws[i],"(",i,")--------------\n")
}
# Plot PMSEs as a function of bandwidth:
par(mfrow=c(1,1))
plot(pmse.bws, type="h", xlab="log bw"); lines(pmse.bws)
# Even for large bandwidths the bias is smaller,
# about 40 compared to about 80 for the running mean 'smooth()'.
# What do the smooths near the PMSE bottom look like?
plot(xy, pch=16, cex=.5)
lines(smooth.lin(xy, bw=bws[6]), col="blue", lwd=2)
----------------
* 2D-SMOOTHING:
- Two predictors x1 and x2.
- Local averaging:
. Define 'windows' in the x1-x2 plane
Dilemma: Not only what size, but what shape of window?
==> We use Euclidean neighborhoods after standardizing x1 and x2.
==> Can be done by using bandwidths prop. to sd(x1) and sd(x2).
. Need double-grid of points for evaluation.
. R code:
smooth2d <- function(xxy, bw1=1/3, bw2=bw1, nloc1=21, nloc2=nloc1) {
x1 <- xxy[,1]; x2 <- xxy[,2]; y <- xxy[,3] # unpack
x1min <- min(x1); x1max <- max(x1) # for grid
x2min <- min(x2); x2max <- max(x2) # dito
x1s <- seq(x1min, x1max, length=nloc1) # x1 grid
x2s <- seq(x2min, x2max, length=nloc2) # x2 grid
fxx <- matrix(NA, nrow=nloc1, ncol=nloc2) # fits: nloc1*nloc2
act.bw1 <- sd(x1) * bw1 # x1 bandwidth
act.bw2 <- sd(x2) * bw2 # x2 bandwidth
for(i in 1:nloc1) { # loop over all combinations..
for(j in 1:nloc2) { # of x1 and x2 grid values
w <- dnorm(x=x1-x1s[i],sd=act.bw1) * # weights: product of
dnorm(x=x2-x2s[j],sd=act.bw2) # Gaussian kernels
fxx[i,j] <- sum(y*w) / sum(w)
}
}
return(list(x=x1s, y=x2s, z=fxx)) # return the two grids and the fits
} # note: length(z)=length(x)*length(y)
xxy <- boston[,c("RM","LSTAT","MEDV")]
sm2d <- smooth2d(xxy)
Perspective plots, first with hidden lines shown: not very good...
persp(sm2d)
With hidden lines removed and the surface colored and shaded:
persp(sm2d, ltheta=0, lphi=45, shade=0.9, col="orange")
Animate to search for informative views:
for(iplt in 0:10000) {
persp(sm2d, theta=0.1*iplt, phi=15+iplt*0.02, r=sqrt(3),
shade=0.5, col="yellow",
main=iplt, xlab="RM", ylab="LSTAT", zlab="MEDV") -> res
points(trans3d(xxy[,1], xxy[,2], xxy[,3], pmat = res),
col=2, pch=16, cex=.5)
}
We added the 3D points, not very successfully, because we can''t tell
which are above and which below the surface.
- [Fun project: Make this movie interactive with 'getGraphicsEvent()'
by controlling 'theta' and 'phi' with 'onMouseDown=...'
and maybe 'r' with 'onKeybd=...' to move in and out.]
- Lessons:
. 2D smoothing is somewhat less successful than 1D smoothing.
. Beware of extrapolation in 2D: Lot''s of empty space
. 3D smoothing: multiply the problems...
----------------
* OTHER SMOOTHING PRINCIPLES:
- Basis expansions: expand a predictor x to a basis p1(x),...,pk(x)
and perform multiple regression onto X <- cbind(p1(x),...,pk(x))
Examples:
. Global polynomials (-):
|y - (b0 + b1*x + b2*x^2 + ... + bp*x^p)|^2 = min_b
=> multiple linear regression with monomial basis
X <- cbind(1, x, x^2, ..., x^k)
# Not numerically stable! Use orthogonal polynomials instead!
xy <- boston[,c("MEDV","LSTAT")]
xy <- cbind(xy, xy[,"LSTAT"]^2, xy[,"LSTAT"]^3, xy[,"LSTAT"]^4, xy[,"LSTAT"]^5, xy[,"LSTAT"]^6)
ord <- order(xy[,"LSTAT"])
xy <- xy[ord,]
plot(xy[,c("LSTAT","MEDV")], pch=16, cex=.5)
with(xy, lines(LSTAT, fitted(lm(MEDV ~ ., data=xy[,1:3])), lwd=2, col="red"))
with(xy, lines(LSTAT, fitted(lm(MEDV ~ ., data=xy[,1:4])), lwd=2, col="red"))
with(xy, lines(LSTAT, fitted(lm(MEDV ~ ., data=xy[,1:5])), lwd=2, col="red"))
with(xy, lines(LSTAT, fitted(lm(MEDV ~ ., data=xy[,1:6])), lwd=2, col="red"))
with(xy, lines(LSTAT, fitted(lm(MEDV ~ ., data=xy[,1:7])), lwd=2, col="red"))
# ==> Surprisingly well-behaved. For higher degrees, don't expect this.
. Piecewise polynomials: REGRESSION SPLINES
=> multiple linear regression with piecewise polynomial basis
+ 1st 0rder: X <- cbind(1, x, pmax(0,x-t1), pmax(0,x-t2), ..., pmax(0,x-tk))
+ 2nd 0rder: X <- cbind(1, x, x^2, pmax(0,x-t1)^2, pmax(0,x-t2)^2, ..., pmax(0,x-tk)^2)
+ 3rd 0rder: X <- cbind(1, x, x^2, x^3, pmax(0,x-t1)^3, pmax(0,x-t2)^3, ..., pmax(0,x-tk)^3)
...
- Penalization: smoothing splines/kernelizing
based on generalized Ridge regression
. General:
fhat = argmin_f sum((y - f)^2) + lambda * t(f) %*% Q %*% f
where Q is symmetric and +semi-definite.
==> Generalized Ridge Regression with design matrix X = diag(n).
"Normal Equations": - y + f + lambda Q %*% f = 0
--------------------------------------------
| |
| fhat = solve(diag(n) + lambda*Q) %*% y |
| |
--------------------------------------------
. Two intuitions for generating penalties:
+ Penalize generalized differences of nearby locations
Examples, assuming x sorted and unique
x <- boston[,"LSTAT"]
y <- boston[,"MEDV"]
n <- length(x)
# parallel sort of (x,y) according to x:
ord <- order(x)
x <- x[ord]
y <- y[ord]
# make unique x values by averaging x-ties:
xu <- tapply(x, x, mean) # same as unique(x): max(abs(xu - unique(x)))
yu <- tapply(y, x, mean)
nu <- length(xu)
# Dirty: We'll ignore the increased precision of yu-values at ties...
# We'd really have to use weighted penalized LS...
Example 1: Penalize first order differences
D1 <- cbind(diag(nu-1),0) - cbind(0,diag(nu-1)); D1[1:3,1:4]; dim(D1)
# Question: What is D1 %*% f ?
Q <- t(D1) %*% D1
# Question: What is t(f) %*% Q %*% f ?
# Now construct smooths according to the above formula:
lambda <- 1000
S <- solve(diag(nu) + lambda*Q)
fu.hat <- S %*% yu
plot(xu, yu, pch=16)
lines(xu, fu.hat, lwd=2, col="red")
# Try larger lambdas and repeat...
Example 2: Penalize second order differences
D2 <- cbind(0,D1[-1,-1]) - cbind(D1[-1,-1],0); D2[1:3,1:4]; dim(D2)
# Question: What is D2 %*% f ?
Q <- t(D2) %*% D2
# Question: What is t(f) %*% Q %*% f ?
# Now construct smooths according to the above formula:
lambda <- 100000
S <- solve(diag(nu) + lambda*Q)
fu.hat <- S %*% yu
plot(xu, yu, pch=16)
lines(xu, fu.hat, lwd=2, col="red")
# Try other lambdas and repeat...
Criticisms:
(1) We penalized small and large spacings equally.
This is not right.
==> Penalize smaller spacings more by weighting with 1/space^2.
I.e., penalize difference quotients, not differences.
Q = D1^T %*% diag(1/diff(xu)^2) %*% D1
(2) If the responses y are thought of as y = f(x) + eps
with independent errors eps, then differences are not
independent: y[2]-y[1] = f(x[2]) - f(x[1]) + eps[2] - eps[1]
y[3]-y[2] = f(x[3]) - f(x[2]) + eps[3] - eps[2]
are correlated: cor(eps[2]-eps[1], eps[3] - eps[2]) = -0.5
==> Decorrelate the difference quotients before penalizing
their sum of squares: Q = D1^T %*% C^{-1} %*% D1
Both criticisms are taken care of by solving the spline approach:
f = argmin_f sum (yi - f(xi))^2 + lambda * Integral (f''(x))^2 dx
==> Smoothing splines in Sobolev function spaces.
+ Kernelizing in ML (Machine Learning): Kernels as similarity functions
Principle: For objects/cases i and j devise a similarity measure K[i,j]
such that K is symmetric and +semi-definite.
Then use Q = K^{-1} as the penalty matrix and X = diag(n).
Example of Gaussian Kernel:
x, y as above (LSTAT, MEDV from Boston housing data)
Define a similarity measure by
K[i,j] ~ exp(-(x[i]-x[j])^2/(2*sig^2))
K[i,i] = max K; K[i,j] ~ 0 if x[i] and x[j] are far apart.
# Code:
s <- sd(x)*1 # Kernel width: a smoothing parameter
K <- dnorm( outer(x, x, FUN="-"), s=s) # The kernel matrix
# Visualize the kernel as a function on the x-x plane:
xu <- unique(x)
nu <- length(xu)
sel <- match(xu,x)
Ku <- K[sel,sel]
plot(outer(xu, rep(1,nu)), outer(rep(1,nu),xu), xlab="x", ylab="x",
col=heat.colors(100)[1+Ku/max(Ku)*99], pch=16, cex=.3)
# Ok, with s = sd(x) we think we have a reasonable kernel width.
# We would like to form K^{-1}, but this would be catastrophic:
K.eig <- eigen(K); K.eig$val[1:20]
par(mfrow=c(4,3), mar=rep(1,4))
for(j in 1:12) {
plot(x, K.eig$vec[,j], type="b", pch=16, cex=.5, xaxt="n", yaxt="n", xlab="", ylab="")
text(par()$usr[1], par()$usr[3], lab=round(K.eig$val[j],6), adj=c(-1,-1)) }
# Select eigenvalues > 10^{-6} of total (= trace) and preproject onto this subspace:
sel <- (K.eig$val > 1E-8 * sum(diag(K))); sum(sel)
lambda <- .0001
S <- K.eig$vec[,sel] %*% diag(1/(1 + lambda/K.eig$val[sel])) %*% t(K.eig$vec[,sel])
f <- S %*% y
par(mfrow=c(1,1))
#---------------------------------------------------------------- end of class; didn't work last time:
plot(x, y, pch=16, ylim=c(0,max(y)))
lines(x, f, lwd=2, col="red")
# ==> Results quite satisfactory, but require
# considerable tuning of two parameters:
# kernel width s and penalty multiplier lambda.
Remaining question:
Who says that Gaussian kernels are non-negative semi-definite?
They are, but why?
==> Kernelizing algebra !
. Degrees of freedom for smoother:
+ The following produces a more intuitive way of describing the
flexibility of the smoother
+ Observation:
For fixed bandwidth, many smoothers are linear in the sense that
fhat = S %*% y
for some nxn smoother matrix S.
Projection smoothers: S = t(X) %*% solve(t(X)%*%X) %*% X
where X is a basis expansion of x
Shrinking smoothers: S = solve(I + lambda*Q)
where Q = difference penalty
or inverse of a ML kernel
+ Generalized degrees of freedom for linear smoothers:
df := trace(S)
Example: Obtain df(S) above for various values of lambda
and show the corresponding dfs --- but we can get the trace
cheaply by observing: eval(S) = 1/(1+lambda/eval(K))
Reason: S = (I + lambda*K^{-1})^{-1}
[Recall a homework with a section on "Spectral Theory".]
sapply(c(0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0, 10), # grid of lambdas
function(lambda) { sum(diag(1/(1 + lambda/K.eig$val[sel]))) } )
# E'vals of Q(lambda): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
## [1] 11.050651 9.866161 8.549137 7.057659 5.338530 3.333074
# lambda = 0.00001 0.0001 0.001 0.01 0.1 1.0
- 'Trend Filtering': This is a recent development in smoothing.
Ryan Tibshirani (son of Rob Tibs) has excellent work on this.
It competes with the most advanced smoother we know of:
Sarah van de Geer''s and Enno Mammen''s "Locally Adaptive Regression Splines" (1997)
which is computationally of order n^3.
- Some distinctions and confusions:
. Parzen kernels =/= RKHS/ML kernels
Our initial smoothers are "Parzen kernel smoother".
They are not to be confused with kernels of RKHS theory!
+ Parzen kernels regularize by mimicking convolution.
Their principle is local averaging.
+ RKHS methods regularize through generalized Ridge regression.
Their principle is penalizing rough functions.
. Regression splines =/= smoothing spline
+ Regression splines: a basis expansion approach
Estimation: LS
Smoothing parameter: number of basis vectors
+ Smoothing/kernelizing splines: a penalty approach
Estimation: penalized LS
Smoothing parameter: penalty multiplier
+ The two approaches can be combined, as proposed by Brian Marx'':
Do not saturate the basis expansion (number of basis vectors < n)
but still penalize higher-order basis vectors
================================================================