STAT 926  Recaps and Roadmaps

LECTURE 2, 2014/09/03
* RECAP:
 This course:
. Prerequisites: Stat 541
. Multivariate extension of Stat 541
 R infrastructure
 Data visualization / statistical graphics
. Taxonomy of multivariate data viz
. Two graphics systems in R:
'base'
'lattice'
'ggplot2'
 Datasets:
. tips data
. earnings data
 Data viz methods:
. univariate (1D):
+ histograms: parameter = ...
MESSAGE: ...
functions: ...
[catch up: didn't do the 'lattice' fct histogram(); see Notes]
+ density plots: parameter = ...
+ violin plots
+ boxplots
. bivariate (2D):
+ scatterplots if x and y are both quantitative (R: numeric)
functions: ...
+ comparison boxplot if x is ... and y is ...
functions: ...
+ reverse: (see HW) 1 if x is quantitative and y is categorical

* ROADMAP:
 Data viz (contd.)
 PCA recap: goal, technique, criterion, interpretation
================================================================
LECTURE 3, 2014/09/08
* ORG: Tonight Homework 1 will be posted.
Will be discussed today in this class.
* RECAP:
 This course: Multivariate extension of Stat 541
 Chapter 1: Data visualization / statistical graphics
. Taxonomy of multivariate data viz
. Graphics systems in R:
'base'
'lattice'
('ggplot2')

* ROADMAP:
 Data viz (contd.)
================================================================
LECTURE 4, 2014/09/10
* ORG: HW 1 is posted, due Mon, Sept 22
================================================================
LECTURE 5, 2014/09/15
* RECAP:
 Interactive dynamic graphics:
. linked brushing
. 3D rotations and higherdim rotations (quantitative variables)
. rescaling of axes
==> ggobi from www.ggobi.org, standalone and Rlinked
 R: function 'getGraphicsEvent()'
To learn about this function, see
help(getGraphicsEvent)
Here is a simple nonstatistical demo of the getGraphicsEvent() function:
source("http://stat.wharton.upenn.edu/~buja/STAT926/etchasketch.R")
Examine the code and cannibalize it to do something useful.
* ROADMAP:
 More examples of statistical inference for data viz
 Classical eigenvaluebased methods of multivariate analysis
================================================================
LECTURE 6, 2014/09/17
* RECAP:
 Multivariate analysis is characterized by ...
 PCA:
. Assumptions about samples?
. What is the object being optimized?
. What is the optimization criterion?
. What is the constraint?
. What is the issue with the constraint?
. What is PCA's Rayleigh quotient Ray(a)?
. What are the stationary equations for Ray(a)?
. What is the value of Ray(a) at stationary solutions?
. What is the interpretation of Ray(a) as variance?
. What is the sum of Ray(a) over stationary solutions?

* ROADMAP:
 PCA projections/scores  linear dimension reduction...
================================================================
LECTURE 7, 2014/09/22
* ORG: HW 1 due tonight
* RECAP:
 PCA:
First decide correlation or covariancebased PCA:
X = scale(X, center=T, scale=F) # Covariancebased PCA
X = scale(X, center=T, scale=T) # Correlationbased PCA
Then:
cov(X) = t(X) %*% X / (n1) # = cor(X) if X is standardized
cov(X) = V %*% diag(Lambda) %*% t(V) # with eigen()
V[,j] = j'th eigenvector of cov(X)
Lambda[j] = j'th eigenvalue of cov(X)
cov(X %*% V) = t(V) %*% cov(X) %*% V = diag(Lambda)

 cov(X %*% V) = diag(Lambda) 

Sd := sqrt(Lambda) principal sdevs
Lambda = Sd^2 principal variances
PCA budget:
sum(Lambda) = trace(cov(X)) = var(X[,1])+...+var(X[,p])
= p # if correlationbased
 PCA scores: coordinates of cases in e'vec basis V
S = X %*% V (Nxp) coord. transf. to Vbasis
S[,j] = X %*% v[,j] (Nx1) j'th PC score column
S[i,] = X[,i,drop=F] %*% V (1xp) PC scores of i'th case
mean(S[,j]) = 0 because X is centered
var(S) = t(S)%*% S / (n1)
= t(V) %*% t(X) %*% X %*% V / (n1)
= t(V) %*% var(X) %*% V
= diag(Lambda)
==> sd(S[,j]) = sqrt(Lambda[j]) = D[j]
cor(S[,j],S[,k]) = 0 for j != k
Correct scores plot: identical axes !
Score vectors are ALWAYS uncorrelated,
even if their score plots look correlated !!!
==> Train your eyes...
 PCA loadings: rescaled eigenvectors
. Loading vectors:
L = V %*% diag(Sd)
L[,j] = V[,j]*Sd[j]
. Motivating property of loadings:
cov(X,S) = t(X) %*% S / (n1)
= t(X) %*% X %*% V / (n1)
= (t(X) %*% X / (n1)) %*% V
= cov(X) %*% V
= V %*% diag(Sd)^2 %*% t(V) %*% V
= V %*% diag(Sd)^2
= L %*% diag(Sd)
If X is standardized, then:
cor(X,S) = cov(X,S %*% diag(1/Sd)) = L
cor(X[,j],S[,k]) = L[j,k]
==> L[,k] is the eigenvector V[,k] rescaled so this holds:

 cor(X,S) = L  for correlationbased PCA (X standardized)

 SVD:
. What does the SVD of X look like?
X = U %*% diag(D) %*% t(V)
. What is a natural criterion for SVDs?
R(u,v) = (t(u) %*% X %*% v)^2 / (u^2 * v^2)
. What are the stationary equations?
u/u = d X v / v
v/v = d t(X) u / u
where d = R(u,v)^{1/2}
. How do singular vectors and values of X relate to PCA of X?
cov(X) = t(X) %*% X / (n1) = V %*% diag(D)^2 %*% t(V) / (n1)
Lambda = D^2/(n1)
Sd = D/sqrt(n1)
V = V (left from eigen(), right from svd())
S = X %*% V = U %*% diag(D)
* ROADMAP
 PCA in practice
 PCA inference
================================================================
LECTURE 8, 2014/09/24
* ORG: HW 2 will be posted soon
* RECAP: PCA in practice, and inference for PCA
 Decision:
. covariancebased PC: center X (X < scale(X, center=T, scale=F)
. correlationbased PC: standardize X (X < scale(X, center=T, scale=T)
 PC variance = eigenvalue = V[ X v ], v = eigenvector of var(X)
. PC sdev = root of PC eigenvalue
. individual % of var accounted for = lambda.j / sum{k} lambda.k *100
. cumulative % of var accounted for = sum{k<=j} lambda.k / sum{k} lambda.k *100
. individual % relative to rest = lambda.j / sum{k=j...p} lambda.k *100
==> Plot eigenvalue profiles
==> Decide how many of the large PCs are 'real', 'significant'
==> Inference problem
Kaiser's eigenvalue 1 rule:
In correlationbased PCA, retain the PCs with eigenvalues > 1.
Reasoning: One variable has variance 1.
Adopt the convention of considering PCs as large
if they account for more than
one variable's worth of variance.
Hence retain PCs with lambda.j > 1.
 PC score vector = X v, v = eigenvector of var(X)
==> projection of rows of X onto direction v
==> used for plotting
Keep in mind:
o Use identical axes for PC size comparison.
o By construction these variables are uncorrelated
even if they look highly structured.
 Loading vector = eigenvector scaled to have norm == PC sdev
==> L[j,k] = cor(X[,j],S[,k])
==> Shows how much X[,j] 'loads' on k'th PC
==> Interpret the 'meaning' of each PC
 Inference for eigenvalues:
. Test the hypothesis of independence of all variables:
Obtain conditional null data by randomly permuting within the columns.
Plot their PC eigenvalue profiles:
plot(eigen(cor(X))$val, type="n")
for(i in 1:100) lines(eigen(cor(apply(X, 2, sample)))$val, col="gray")
Compare with the observed eigenvalue profile:
lines(eigen(cor(X))$val, lwd=2)
Issues: This is a reasonable test for the largest and smallest e'val,
but not for the e'vals between.
Unreasonable because for e'vals 2...p1 we probably
have other null hypothese in mind.
Still, it's better than using Kaiser's e'val 1 rule.
See Notes for discussion of attempts at solutions.
. Confidence intervals for eigenvalues from nonparametric bootstrap:
Issue: Observed eigenvalues are biased estimates.
E[ lambda.1.estimate ]
== E[ max{v=1} v^T (X^T X)/(n1) v ]
>= max{v=1} v^T E[ (X^T X)/(n1) ] v
== lambda.1.population
Bias is substantial when p is a substantial fraction of N.
==> Use correctly inverted biascorrected bootstrap intervals.
See Notes.
* ROADMAP:
 Finish up PCA in practice and inference
 Odds and ends
 Canonical correlation analysis (CCA)
================================================================
LECTURE 9, 2014/09/29
* ORG: HW 2 will be posted next week
* RECAP and DISCUSSION:
 Tibshirani's talk:
. Says they have a conditional test given previous eigenvalues and
eigenvectors.
. Hitch: It's based on normal theory and sensitive to nonnormality.
. Why does similar sensitivity to normality not occur in regression?
Zongming's insight: CLT effect in linear regression
. What would the TaylorTibs test achieve?
It would account for having chancecapitalized in the first PCs
even if all eigenvalues were identical.
 Odds and ends
* ROADMAP: CCA
================================================================
LECTURE 10, 2014/10/01
* ORG: HW 2 will be posted next week
* RECAP: CCA
 What is the CCA problem?
...
 Why is the CCA problem welldefined,
as opposed to the naive PCA problem?
...
 How unique are the CCA coefficient vectors?
...
 How does prewhitening work in CC?
...
What does it mean?
...
What does it acomplish for CCA?
...
 What is the geometric interpretation of CCA
in terms of certain data subspaces?
...

* ROADMAP:
 CCA ctd.
 GCA
================================================================
LECTURE 11, 2014/10/06
* ORG: HW 2 will be posted this week
* RECAP: CCA
 The criterion?
 Comparison of CCA and PCA?
 CCA via SVD after prewhitening?
 What is prewhitening statistically?
 What is prewhitening geometrically?
 A variance criterion for CCA?
 What does it mean if all cancorrs are 1?
 Is it possible that a cancorr is negative?
 Do you expect cancorr estimates to be biased?
 Inference for CCA: use the two universal hammers
 Reminders:
. Like regression, CCA suffers from blockinternal collinearity.
. Unlike regression, CCA coefficients require some normalization.
What standardization would you propose to see coeff strength quickly?
. What is the Rsquare of the regression of a Ycanonical variate on X?
Vice versa?
* ROADMAP: GCA  Generalized Canonical Analysis
 GCA as an umbrella for PCA and CCA
 A deadend approach to GCA
 The proper approach
================================================================
LECTURE 12, 2014/10/08
* ORG: HW 2 will be posted this week
* RECAP: GCA
 Criterion?
 Principle for criteria in MA?
 GCA as umbrella: two special cases
. PCA
. CCA
* ROADMAP: Fisher scoring, dummies, more about GCA criteria, LDA
================================================================
LECTURE 13, 2014/10/13
* ORG: HW 2 will be posted ...
* RECAP: Fisher Scoring and CCA
 Fisher scoring = ... CCA with 2 categorical variables
 What's the issue with these X and Y matrices?
...
 What's an alternative to centering?
...
 Recall Matt's observation about the meaning of X^T Y?
... 2D table of counts
 Further thoughts on this interpretation:
...
 Generalities:
. two symmetric CCA criteria?
...
. two asymmetric CCA criteria?
...
* ROADMAP: QDA and LDA
================================================================
LECTURE 14, 2014/10/15
* ORG: HW 2 will be posted ...
* RECAP:
 QDA: ...
 LDA: ...
 Connection with CCA: ...
* ROADMAP:
 Multiple Correspondence Analysis
 PRINCALS ("PCA with Alternating LS")
 Nonlinear MA
================================================================
LECTURE 15, 2014/10/20
* ORG: HW 2 will be posted ...
* RECAP:
 PRINCALS:
. Motivation? Compare with Multiple Correspondence Analysis (MCA)
...
. Approach? What SVD fact is used?
...
. Is PRINCALS "hierarchical" in the sense of PCA, CCA, GCA?
...
. What insights are possible regarding number of reduced dimensions?
...
 Linear smoothers:
. What is "smoothing"? What do theoretical statisticians call it?
...
. What were the smoothers introduced in Stat 541?
...
. Were these smoothers linear? What does "linear" mean?
...
. Smoother matrix of a linear smoother: If you only have
a procedure implementing a linear smoother, how can you
obtain the columns of its smoother matrix?
...
. If you plot the entries of a row of the smoother matrix
as a function of x, what does it convey?
Think "local averaging".
...
. How do you obtain the smoother matrix for a polynomial
smoother? What warnings are you aware of regarding
numerical instability? How do you solve them?
...
* ROADMAP:
 Linear smoothers (contd.)
 Transformational Multivariate Analysis
================================================================
LECTURE 16, 2014/10/22
* ORG: HW 2 will be posted ...
* RECAP:
 Linear smoothers: Nvector fhat = S y
Interpret:
. columns of S
. rows of S
. eigen/singular values of S
 Types of linear smoothers?
. ...
. ...
. ...
* ROADMAP:
 Linear smoothers (contd.)
 Transformational MA
================================================================
LECTURE 17, 2014/10/27
* ORG: HW 2 will be posted ...
* RECAP:
 Linear smoothers: Nvector fhat = S y
Interpret:
. columns of S
. rows of S
. eigen/singular values of S
 Types of linear smoothers?
. ...
. ...
. ...
* ROADMAP:
 Transformational MA, in particular ACE et al.
================================================================
LECTURE 18, 2014/10/29
* ORG: HW 2 will be posted ...
* RECAP:
 ACE: describe ...
 Alternating projections:
Show that the convergence results for the power algorithm
for symmetric pos.semidef. matrices applies.
* ROADMAP:
 Transformational MA (contd.)
================================================================
LECTURE 19, 2014/11/03
* RECAP: ACE and APCs
 Additive models occur as building blocks in ACE and APC.
Explain how.
 Construct a supersized Ridge regression problem
for penalized additive regression
 The two purposes of smallest APCs
 The criterion, unpenalized
 The constraint, unpenalized
 The criterion, penalized
 The constraint, penalized
 Algorithm?
 Interpretation: If an APC has the form
f1(X1) + f2(X2) + ... ~ 0
and both f1(x1) and f2(x2) are strong (i.e., sd(...)>>0) and
monotone increasing, does this imply a positive or negative
association between X1 and X2, all else being constant?
* ROADMAP:
 ICA  Independent Component Analysis: one particular form
 FMA  Functional multivariate analysis: one particular form
 Subsequently: kmeans, MDS
================================================================
LECTURE 20, 2014/11/05
* RECAP: ICA
 What is the goal of the ICA?
 Can one always achieve the goal?
 What is the issue of ICA on multivariate normal data?
 What is required for identifiability of ICA?
 What is the natural preprocessing before doing ICA,
and what is the consequence for the resulting estimation problem?
 Outline approaches to ICA using tools you know from this class.
 Could you imagine other approaches?
* ROADMAP:
 FMA  Functional multivariate analysis: one particular form
 kmeans clustering
 multidimensional scaling  MDS
 Kernelizing and kernel PCA
================================================================
LECTURE 21, 2014/11/10
* RECAP: FMA
 What makes a MA functional?
(We haven't really answered this Q.)
 What kinds of smoothing do you want to achieve in FMA?
 What are the two large smoother classes again?
 Considered analyzing a functional data table with SVDs: fSVD.
Reminders about plain SVD:
Q: If you are given u and v, what is the optimal d?
A: ...
 How can you perform an fSVD...
. based on the first smoothing class?
. based on the second smoothing class?
* ROADMAP:
 FMA  Functional multivariate analysis (contd.)
 kmeans clustering
 multidimensional scaling  MDS
 Kernelizing and kernel PCA
================================================================
LECTURE 22, 2014/11/12
* RECAP:
 What makes a data analysis situation 'functional'?
. We're estimating discretized functions/signals/processes on a domain.
. Typical domains: time, space, frequency
. We only consider the case where all cases are sampled
at the same locations in the domain.
Examples: daily stock returns (case=company)
midday temperatures at 200 locations across the US
 What makes data twoway functional?
A: In the data table both rows and columns have underlying domains.
Ex.: Daily midday temperatures at 200 locations for 20 years
rows: locations, cols: days
 Recall twoway smoothing an SVD...
* ROADMAP:
 FMA  Functional multivariate analysis (contd.)
 kmeans clustering
 multidimensional scaling  MDS
 Kernelizing and kernel PCA
================================================================
LECTURE 23, 2014/11/17
* ORG:
 Thanksgiving week is off; no Mon, no Wed class
 Homework: Examine the twoway Lasso regularized SVD
Due Mon, Dec 15, 2014
You can do anything you like:
. Implement and experiment
. Play with optimization schemes
. Examine whether crossvalidation works for choice of lambdas
. Prove asymptotic minimaxity ...
Submit a PDF file from LaTex
with name 'hw02YourlastnameYourfirstname.pdf'
in an attachement
to stat926.at.wharton@gmail.com
with Subject: HW 2, 2014, YourLastName, YourFirstName
* ROADMAP:
 kmeans clustering
 multidimensional scaling  MDS
 Kernelizing and kernel PCA
================================================================
LECTURE 24, 2014/11/19
* ORG:
 Thanksgiving week is off; no Mon, no Wed class
 Homework: Examine the twoway Lasso regularized SVD
Due Mon, Dec 15, 2014
Discussion? Some lit refs were added to the latex file.
* RECAP: kmeans clustering
. What's the idea of kmeans clustering?
. Criterion?
. Variable scaling issue?
. Algorithm?
. What idea does the the term 'clustering' conjure in your mind?
Does kmeans live up to this idea?
. Where do you expect the centers to fall
in relation to PCs?
. This suggests what kind of plot?
. Issues with the kmeans criterion?
... (Tell a story.)
* ROADMAP:
 hierarchical clustering
 multidimensional scaling  MDS
 Kernelizing and kernel PCA
================================================================