STAT 926 -- Recaps and Roadmaps ---------------------------------------------------------------- LECTURE 2, 2014/09/03 * RECAP: - This course: . Prerequisites: Stat 541 . Multivariate extension of Stat 541 - R infrastructure - Data visualization / statistical graphics . Taxonomy of multivariate data viz . Two graphics systems in R: 'base' 'lattice' 'ggplot2' - Datasets: . tips data . earnings data - Data viz methods: . univariate (1D): + histograms: parameter = ... MESSAGE: ... functions: ... [catch up: didn't do the 'lattice' fct histogram(); see Notes] + density plots: parameter = ... + violin plots + boxplots . bivariate (2D): + scatterplots if x and y are both quantitative (R: numeric) functions: ... + comparison boxplot if x is ... and y is ... functions: ... + reverse: (see HW) 1 if x is quantitative and y is categorical ------------ * ROADMAP: - Data viz (contd.) - PCA recap: goal, technique, criterion, interpretation ================================================================ LECTURE 3, 2014/09/08 * ORG: Tonight Homework 1 will be posted. Will be discussed today in this class. * RECAP: - This course: Multivariate extension of Stat 541 - Chapter 1: Data visualization / statistical graphics . Taxonomy of multivariate data viz . Graphics systems in R: 'base' 'lattice' ('ggplot2') ------------ * ROADMAP: - Data viz (contd.) ================================================================ LECTURE 4, 2014/09/10 * ORG: HW 1 is posted, due Mon, Sept 22 ================================================================ LECTURE 5, 2014/09/15 * RECAP: - Interactive dynamic graphics: . linked brushing . 3D rotations and higher-dim rotations (quantitative variables) . rescaling of axes ==> ggobi from www.ggobi.org, stand-alone and R-linked - R: function 'getGraphicsEvent()' To learn about this function, see help(getGraphicsEvent) Here is a simple non-statistical demo of the getGraphicsEvent() function: source("http://stat.wharton.upenn.edu/~buja/STAT-926/etch-a-sketch.R") Examine the code and cannibalize it to do something useful. * ROADMAP: - More examples of statistical inference for data viz - Classical eigenvalue-based methods of multivariate analysis ================================================================ LECTURE 6, 2014/09/17 * RECAP: - Multivariate analysis is characterized by ... - PCA: . Assumptions about samples? . What is the object being optimized? . What is the optimization criterion? . What is the constraint? . What is the issue with the constraint? . What is PCA's Rayleigh quotient Ray(a)? . What are the stationary equations for Ray(a)? . What is the value of Ray(a) at stationary solutions? . What is the interpretation of Ray(a) as variance? . What is the sum of Ray(a) over stationary solutions? - * ROADMAP: - PCA projections/scores -- linear dimension reduction... ================================================================ LECTURE 7, 2014/09/22 * ORG: HW 1 due tonight * RECAP: - PCA: First decide correlation- or covariance-based PCA: X = scale(X, center=T, scale=F) # Covariance-based PCA X = scale(X, center=T, scale=T) # Correlation-based PCA Then: cov(X) = t(X) %*% X / (n-1) # = cor(X) if X is standardized cov(X) = V %*% diag(Lambda) %*% t(V) # with eigen() V[,j] = j'th eigenvector of cov(X) Lambda[j] = j'th eigenvalue of cov(X) cov(X %*% V) = t(V) %*% cov(X) %*% V = diag(Lambda) ------------------------------- | cov(X %*% V) = diag(Lambda) | ------------------------------- Sd := sqrt(Lambda) principal sdevs Lambda = Sd^2 principal variances PCA budget: sum(Lambda) = trace(cov(X)) = var(X[,1])+...+var(X[,p]) = p # if correlation-based - PCA scores: coordinates of cases in e'vec basis V S = X %*% V (Nxp) coord. transf. to V-basis S[,j] = X %*% v[,j] (Nx1) j'th PC score column S[i,] = X[,i,drop=F] %*% V (1xp) PC scores of i'th case mean(S[,j]) = 0 because X is centered var(S) = t(S)%*% S / (n-1) = t(V) %*% t(X) %*% X %*% V / (n-1) = t(V) %*% var(X) %*% V = diag(Lambda) ==> sd(S[,j]) = sqrt(Lambda[j]) = D[j] cor(S[,j],S[,k]) = 0 for j != k Correct scores plot: identical axes ! Score vectors are ALWAYS uncorrelated, even if their score plots look correlated !!! ==> Train your eyes... - PCA loadings: rescaled eigenvectors . Loading vectors: L = V %*% diag(Sd) L[,j] = V[,j]*Sd[j] . Motivating property of loadings: cov(X,S) = t(X) %*% S / (n-1) = t(X) %*% X %*% V / (n-1) = (t(X) %*% X / (n-1)) %*% V = cov(X) %*% V = V %*% diag(Sd)^2 %*% t(V) %*% V = V %*% diag(Sd)^2 = L %*% diag(Sd) If X is standardized, then: cor(X,S) = cov(X,S %*% diag(1/Sd)) = L cor(X[,j],S[,k]) = L[j,k] ==> L[,k] is the eigenvector V[,k] rescaled so this holds: ---------------- | cor(X,S) = L | for correlation-based PCA (X standardized) ---------------- - SVD: . What does the SVD of X look like? X = U %*% diag(D) %*% t(V) . What is a natural criterion for SVDs? R(u,v) = (t(u) %*% X %*% v)^2 / (|u|^2 * |v|^2) . What are the stationary equations? u/|u| = d X v / |v| v/|v| = d t(X) u / |u| where d = R(u,v)^{1/2} . How do singular vectors and values of X relate to PCA of X? cov(X) = t(X) %*% X / (n-1) = V %*% diag(D)^2 %*% t(V) / (n-1) Lambda = D^2/(n-1) Sd = D/sqrt(n-1) V = V (left from eigen(), right from svd()) S = X %*% V = U %*% diag(D) * ROADMAP - PCA in practice - PCA inference ================================================================ LECTURE 8, 2014/09/24 * ORG: HW 2 will be posted soon * RECAP: PCA in practice, and inference for PCA - Decision: . covariance-based PC: center X (X <- scale(X, center=T, scale=F) . correlation-based PC: standardize X (X <- scale(X, center=T, scale=T) - PC variance = eigenvalue = V[ X v ], v = eigenvector of var(X) . PC sdev = root of PC eigenvalue . individual % of var accounted for = lambda.j / sum{k} lambda.k *100 . cumulative % of var accounted for = sum{k<=j} lambda.k / sum{k} lambda.k *100 . individual % relative to rest = lambda.j / sum{k=j...p} lambda.k *100 ==> Plot eigenvalue profiles ==> Decide how many of the large PCs are 'real', 'significant' ==> Inference problem Kaiser's eigenvalue 1 rule: In correlation-based PCA, retain the PCs with eigenvalues > 1. Reasoning: One variable has variance 1. Adopt the convention of considering PCs as large if they account for more than one variable's worth of variance. Hence retain PCs with lambda.j > 1. - PC score vector = X v, v = eigenvector of var(X) ==> projection of rows of X onto direction v ==> used for plotting Keep in mind: o Use identical axes for PC size comparison. o By construction these variables are uncorrelated even if they look highly structured. - Loading vector = eigenvector scaled to have norm == PC sdev ==> L[j,k] = cor(X[,j],S[,k]) ==> Shows how much X[,j] 'loads' on k'th PC ==> Interpret the 'meaning' of each PC - Inference for eigenvalues: . Test the hypothesis of independence of all variables: Obtain conditional null data by randomly permuting within the columns. Plot their PC eigenvalue profiles: plot(eigen(cor(X))$val, type="n") for(i in 1:100) lines(eigen(cor(apply(X, 2, sample)))$val, col="gray") Compare with the observed eigenvalue profile: lines(eigen(cor(X))$val, lwd=2) Issues: This is a reasonable test for the largest and smallest e'val, but not for the e'vals between. Unreasonable because for e'vals 2...p-1 we probably have other null hypothese in mind. Still, it's better than using Kaiser's e'val 1 rule. See Notes for discussion of attempts at solutions. . Confidence intervals for eigenvalues from nonparametric bootstrap: Issue: Observed eigenvalues are biased estimates. E[ lambda.1.estimate ] == E[ max{|v|=1} v^T (X^T X)/(n-1) v ] >= max{|v|=1} v^T E[ (X^T X)/(n-1) ] v == lambda.1.population Bias is substantial when p is a substantial fraction of N. ==> Use correctly inverted bias-corrected bootstrap intervals. See Notes. * ROADMAP: - Finish up PCA in practice and inference - Odds and ends - Canonical correlation analysis (CCA) ================================================================ LECTURE 9, 2014/09/29 * ORG: HW 2 will be posted next week * RECAP and DISCUSSION: - Tibshirani's talk: . Says they have a conditional test given previous eigenvalues and eigenvectors. . Hitch: It's based on normal theory and sensitive to non-normality. . Why does similar sensitivity to normality not occur in regression? Zongming's insight: CLT effect in linear regression . What would the Taylor-Tibs test achieve? It would account for having chance-capitalized in the first PCs even if all eigenvalues were identical. - Odds and ends * ROADMAP: CCA ================================================================ LECTURE 10, 2014/10/01 * ORG: HW 2 will be posted next week * RECAP: CCA - What is the CCA problem? ... - Why is the CCA problem well-defined, as opposed to the naive PCA problem? ... - How unique are the CCA coefficient vectors? ... - How does pre-whitening work in CC? ... What does it mean? ... What does it acomplish for CCA? ... - What is the geometric interpretation of CCA in terms of certain data subspaces? ... - * ROADMAP: - CCA ctd. - GCA ================================================================ LECTURE 11, 2014/10/06 * ORG: HW 2 will be posted this week * RECAP: CCA - The criterion? - Comparison of CCA and PCA? - CCA via SVD after pre-whitening? - What is pre-whitening statistically? - What is pre-whitening geometrically? - A variance criterion for CCA? - What does it mean if all cancorrs are 1? - Is it possible that a cancorr is negative? - Do you expect cancorr estimates to be biased? - Inference for CCA: use the two universal hammers - Reminders: . Like regression, CCA suffers from block-internal collinearity. . Unlike regression, CCA coefficients require some normalization. What standardization would you propose to see coeff strength quickly? . What is the Rsquare of the regression of a Y-canonical variate on X? Vice versa? * ROADMAP: GCA -- Generalized Canonical Analysis - GCA as an umbrella for PCA and CCA - A deadend approach to GCA - The proper approach ================================================================ LECTURE 12, 2014/10/08 * ORG: HW 2 will be posted this week * RECAP: GCA - Criterion? - Principle for criteria in MA? - GCA as umbrella: two special cases . PCA . CCA * ROADMAP: Fisher scoring, dummies, more about GCA criteria, LDA ================================================================ LECTURE 13, 2014/10/13 * ORG: HW 2 will be posted ... * RECAP: Fisher Scoring and CCA - Fisher scoring = ... CCA with 2 categorical variables - What's the issue with these X and Y matrices? ... - What's an alternative to centering? ... - Recall Matt's observation about the meaning of X^T Y? ... 2D table of counts - Further thoughts on this interpretation: ... - Generalities: . two symmetric CCA criteria? ... . two asymmetric CCA criteria? ... * ROADMAP: QDA and LDA ================================================================ LECTURE 14, 2014/10/15 * ORG: HW 2 will be posted ... * RECAP: - QDA: ... - LDA: ... - Connection with CCA: ... * ROADMAP: - Multiple Correspondence Analysis - PRINCALS ("PCA with Alternating LS") - Nonlinear MA ================================================================ LECTURE 15, 2014/10/20 * ORG: HW 2 will be posted ... * RECAP: - PRINCALS: . Motivation? Compare with Multiple Correspondence Analysis (MCA) ... . Approach? What SVD fact is used? ... . Is PRINCALS "hierarchical" in the sense of PCA, CCA, GCA? ... . What insights are possible regarding number of reduced dimensions? ... - Linear smoothers: . What is "smoothing"? What do theoretical statisticians call it? ... . What were the smoothers introduced in Stat 541? ... . Were these smoothers linear? What does "linear" mean? ... . Smoother matrix of a linear smoother: If you only have a procedure implementing a linear smoother, how can you obtain the columns of its smoother matrix? ... . If you plot the entries of a row of the smoother matrix as a function of x, what does it convey? Think "local averaging". ... . How do you obtain the smoother matrix for a polynomial smoother? What warnings are you aware of regarding numerical instability? How do you solve them? ... * ROADMAP: - Linear smoothers (contd.) - Transformational Multivariate Analysis ================================================================ LECTURE 16, 2014/10/22 * ORG: HW 2 will be posted ... * RECAP: - Linear smoothers: N-vector fhat = S y Interpret: . columns of S . rows of S . eigen/singular values of S - Types of linear smoothers? . ... . ... . ... * ROADMAP: - Linear smoothers (contd.) - Transformational MA ================================================================ LECTURE 17, 2014/10/27 * ORG: HW 2 will be posted ... * RECAP: - Linear smoothers: N-vector fhat = S y Interpret: . columns of S . rows of S . eigen/singular values of S - Types of linear smoothers? . ... . ... . ... * ROADMAP: - Transformational MA, in particular ACE et al. ================================================================ LECTURE 18, 2014/10/29 * ORG: HW 2 will be posted ... * RECAP: - ACE: describe ... - Alternating projections: Show that the convergence results for the power algorithm for symmetric pos.semi-def. matrices applies. * ROADMAP: - Transformational MA (contd.) ================================================================ LECTURE 19, 2014/11/03 * RECAP: ACE and APCs - Additive models occur as building blocks in ACE and APC. Explain how. - Construct a super-sized Ridge regression problem for penalized additive regression - The two purposes of smallest APCs - The criterion, unpenalized - The constraint, unpenalized - The criterion, penalized - The constraint, penalized - Algorithm? - Interpretation: If an APC has the form f1(X1) + f2(X2) + ... ~ 0 and both f1(x1) and f2(x2) are strong (i.e., sd(...)>>0) and monotone increasing, does this imply a positive or negative association between X1 and X2, all else being constant? * ROADMAP: - ICA - Independent Component Analysis: one particular form - FMA - Functional multivariate analysis: one particular form - Subsequently: k-means, MDS ================================================================ LECTURE 20, 2014/11/05 * RECAP: ICA - What is the goal of the ICA? - Can one always achieve the goal? - What is the issue of ICA on multivariate normal data? - What is required for identifiability of ICA? - What is the natural pre-processing before doing ICA, and what is the consequence for the resulting estimation problem? - Outline approaches to ICA using tools you know from this class. - Could you imagine other approaches? * ROADMAP: - FMA - Functional multivariate analysis: one particular form - k-means clustering - multidimensional scaling -- MDS - Kernelizing and kernel PCA ================================================================ LECTURE 21, 2014/11/10 * RECAP: FMA - What makes a MA functional? (We haven't really answered this Q.) - What kinds of smoothing do you want to achieve in FMA? - What are the two large smoother classes again? - Considered analyzing a functional data table with SVDs: fSVD. Reminders about plain SVD: Q: If you are given u and v, what is the optimal d? A: ... - How can you perform an fSVD... . based on the first smoothing class? . based on the second smoothing class? * ROADMAP: - FMA - Functional multivariate analysis (contd.) - k-means clustering - multidimensional scaling -- MDS - Kernelizing and kernel PCA ================================================================ LECTURE 22, 2014/11/12 * RECAP: - What makes a data analysis situation 'functional'? . We're estimating discretized functions/signals/processes on a domain. . Typical domains: time, space, frequency . We only consider the case where all cases are sampled at the same locations in the domain. Examples: daily stock returns (case=company) mid-day temperatures at 200 locations across the US - What makes data two-way functional? A: In the data table both rows and columns have underlying domains. Ex.: Daily mid-day temperatures at 200 locations for 20 years rows: locations, cols: days - Recall two-way smoothing an SVD... * ROADMAP: - FMA - Functional multivariate analysis (contd.) - k-means clustering - multidimensional scaling -- MDS - Kernelizing and kernel PCA ================================================================ LECTURE 23, 2014/11/17 * ORG: - Thanksgiving week is off; no Mon, no Wed class - Homework: Examine the two-way Lasso regularized SVD Due Mon, Dec 15, 2014 You can do anything you like: . Implement and experiment . Play with optimization schemes . Examine whether cross-validation works for choice of lambdas . Prove asymptotic minimaxity ... Submit a PDF file from LaTex with name 'hw02-Yourlastname-Yourfirstname.pdf' in an attachement to stat926.at.wharton@gmail.com with Subject: HW 2, 2014, YourLastName, YourFirstName * ROADMAP: - k-means clustering - multidimensional scaling -- MDS - Kernelizing and kernel PCA ================================================================ LECTURE 24, 2014/11/19 * ORG: - Thanksgiving week is off; no Mon, no Wed class - Homework: Examine the two-way Lasso regularized SVD Due Mon, Dec 15, 2014 Discussion? Some lit refs were added to the latex file. * RECAP: k-means clustering . What's the idea of k-means clustering? . Criterion? . Variable scaling issue? . Algorithm? . What idea does the the term 'clustering' conjure in your mind? Does k-means live up to this idea? . Where do you expect the centers to fall in relation to PCs? . This suggests what kind of plot? . Issues with the k-means criterion? ... (Tell a story.) * ROADMAP: - hierarchical clustering - multidimensional scaling -- MDS - Kernelizing and kernel PCA ================================================================