STAT 926 -- Recaps and Roadmaps


----------------------------------------------------------------

LECTURE 2, 2014/09/03


* RECAP:

  - This course:
    . Prerequisites: Stat 541
    . Multivariate extension of Stat 541

  - R infrastructure

  - Data visualization / statistical graphics
    . Taxonomy of multivariate data viz
    . Two graphics systems in R:
        'base'
        'lattice'
        'ggplot2'

  - Datasets:
    . tips data
    . earnings data

  - Data viz methods:

    . univariate (1D):
       + histograms:    parameter = ...
           MESSAGE: ...
           functions: ...
             [catch up: didn't do the 'lattice' fct histogram(); see Notes]
       + density plots: parameter = ...
       + violin plots
       + boxplots

    . bivariate (2D):
      + scatterplots if x and y are both quantitative (R: numeric)
          functions: ...
      + comparison boxplot if x is ... and y is ...
          functions: ...
      + reverse: (see HW) 1 if x is quantitative and y is categorical


------------


* ROADMAP:

  - Data viz (contd.)

  - PCA recap: goal, technique, criterion, interpretation


================================================================


LECTURE 3, 2014/09/08


* ORG: Tonight Homework 1 will be posted.  
       Will be discussed today in this class.


* RECAP:

  - This course: Multivariate extension of Stat 541

  - Chapter 1: Data visualization / statistical graphics
    . Taxonomy of multivariate data viz
    . Graphics systems in R:
        'base'
        'lattice'
        ('ggplot2')


------------


* ROADMAP:

  - Data viz (contd.)


================================================================


LECTURE 4, 2014/09/10

* ORG: HW 1 is posted, due Mon, Sept 22


================================================================


LECTURE 5, 2014/09/15


* RECAP:

  - Interactive dynamic graphics: 
    . linked brushing 
    . 3D rotations and higher-dim rotations (quantitative variables)
    . rescaling of axes
    ==> ggobi  from www.ggobi.org, stand-alone and R-linked

  - R: function 'getGraphicsEvent()'
       To learn about this function, see
            help(getGraphicsEvent)
       Here is a simple non-statistical demo of the getGraphicsEvent() function:
            source("http://stat.wharton.upenn.edu/~buja/STAT-926/etch-a-sketch.R")
       Examine the code and cannibalize it to do something useful.

* ROADMAP:

  - More examples of statistical inference for data viz
  - Classical eigenvalue-based methods of multivariate analysis


================================================================


LECTURE 6, 2014/09/17


* RECAP:
  - Multivariate analysis is characterized by ...
  - PCA: 
    . Assumptions about samples?
    . What is the object being optimized?
    . What is the optimization criterion?
    . What is the constraint?
    . What is the issue with the constraint?
    . What is PCA's Rayleigh quotient Ray(a)?
    . What are the stationary equations for Ray(a)?
    . What is the value of Ray(a) at stationary solutions?
    . What is the interpretation of Ray(a) as variance?
    . What is the sum of Ray(a) over stationary solutions?
  - 


* ROADMAP:

  - PCA projections/scores -- linear dimension reduction...


================================================================

LECTURE 7, 2014/09/22

* ORG: HW 1 due tonight

* RECAP:

  - PCA:
    First decide correlation- or covariance-based PCA:
      X = scale(X, center=T, scale=F) # Covariance-based PCA
      X = scale(X, center=T, scale=T) # Correlation-based PCA
    Then:
      cov(X) = t(X) %*% X / (n-1)     # = cor(X) if X is standardized
      cov(X) = V %*% diag(Lambda) %*% t(V)  # with eigen() 
      V[,j] = j'th eigenvector of cov(X)
      Lambda[j] = j'th eigenvalue of cov(X) 
      cov(X %*% V) = t(V) %*% cov(X) %*% V = diag(Lambda)

       -------------------------------
      |  cov(X %*% V) = diag(Lambda)  |
       -------------------------------

      Sd := sqrt(Lambda)          principal sdevs
      Lambda = Sd^2               principal variances
      PCA budget:
        sum(Lambda) = trace(cov(X)) = var(X[,1])+...+var(X[,p])
                    = p      # if correlation-based

  - PCA scores: coordinates of cases in e'vec basis V
      S = X %*% V                  (Nxp) coord. transf. to V-basis
      S[,j] = X %*% v[,j]          (Nx1) j'th PC score column
      S[i,] = X[,i,drop=F] %*% V   (1xp) PC scores of i'th case
      mean(S[,j]) = 0              because X is centered
      var(S) = t(S)%*% S / (n-1) 
             = t(V) %*% t(X) %*% X %*% V  / (n-1)
             = t(V) %*% var(X) %*% V
             = diag(Lambda)
      ==>  sd(S[,j]) = sqrt(Lambda[j]) = D[j]
           cor(S[,j],S[,k]) = 0    for j != k
      Correct scores plot: identical axes !
      Score vectors are ALWAYS uncorrelated,
        even if their score plots look correlated !!!
        ==> Train your eyes...

  - PCA loadings: rescaled eigenvectors
    . Loading vectors: 
         L = V %*% diag(Sd)
         L[,j] = V[,j]*Sd[j]
    . Motivating property of loadings:   
         cov(X,S) = t(X) %*% S / (n-1)
                  = t(X) %*% X %*% V / (n-1)
                  = (t(X) %*% X / (n-1)) %*% V
                  = cov(X) %*% V
                  = V %*% diag(Sd)^2 %*% t(V) %*% V
                  = V %*% diag(Sd)^2
                  = L %*% diag(Sd)
         If X is standardized, then:
         cor(X,S) = cov(X,S %*% diag(1/Sd)) = L
         cor(X[,j],S[,k]) = L[j,k]      
         ==> L[,k] is the eigenvector V[,k] rescaled so this holds:
         
          ----------------
         |  cor(X,S) = L  |   for correlation-based PCA (X standardized)
          ----------------
          
  - SVD: 
    . What does the SVD of X look like?
        X = U %*% diag(D) %*% t(V)
    . What is a natural criterion for SVDs?
        R(u,v) = (t(u) %*% X %*% v)^2 / (|u|^2 * |v|^2)
    . What are the stationary equations?
        u/|u|  =  d X v / |v|
        v/|v|  =  d t(X) u / |u|
        where  d = R(u,v)^{1/2}
    . How do singular vectors and values of X relate to PCA of X?
        cov(X) = t(X) %*% X / (n-1) = V %*% diag(D)^2 %*% t(V) / (n-1)
        Lambda = D^2/(n-1)    
        Sd = D/sqrt(n-1)      
        V = V                 (left from eigen(), right from svd())
        S = X %*% V = U %*% diag(D)

* ROADMAP
  - PCA in practice
  - PCA inference


================================================================

LECTURE 8, 2014/09/24

* ORG: HW 2 will be posted soon

* RECAP: PCA in practice, and inference for PCA

  - Decision: 
    . covariance-based PC:  center X         (X <- scale(X, center=T, scale=F)
    . correlation-based PC: standardize X    (X <- scale(X, center=T, scale=T)

  - PC variance = eigenvalue = V[ X v ],   v = eigenvector of var(X)
    . PC sdev = root of PC eigenvalue
    . individual % of var accounted for  =  lambda.j / sum{k} lambda.k *100
    . cumulative % of var accounted for  =  sum{k<=j} lambda.k / sum{k} lambda.k *100
    . individual % relative to rest      =  lambda.j / sum{k=j...p} lambda.k *100
    ==> Plot eigenvalue profiles
    ==> Decide how many of the large PCs are 'real', 'significant'
    ==> Inference problem
    Kaiser's eigenvalue 1 rule:
      In correlation-based PCA, retain the PCs with eigenvalues > 1.
      Reasoning: One variable has variance 1.
                 Adopt the convention of considering PCs as large 
                 if they account for more than 
                 one variable's worth of variance.
                 Hence retain PCs with lambda.j > 1.                 

  - PC score vector = X v,  v = eigenvector of var(X)
      ==> projection of rows of X onto direction v
      ==> used for plotting
          Keep in mind: 
          o Use identical axes for PC size comparison.
          o By construction these variables are uncorrelated
            even if they look highly structured.

  - Loading vector = eigenvector scaled to have norm == PC sdev
      ==>  L[j,k] = cor(X[,j],S[,k])   
      ==>  Shows how much X[,j] 'loads' on k'th PC
      ==>  Interpret the 'meaning' of each PC

  - Inference for eigenvalues:

    . Test the hypothesis of independence of all variables:
        Obtain conditional null data by randomly permuting within the columns.
        Plot their PC eigenvalue profiles: 
          plot(eigen(cor(X))$val, type="n")
          for(i in 1:100) lines(eigen(cor(apply(X, 2, sample)))$val, col="gray")
        Compare with the observed eigenvalue profile:
          lines(eigen(cor(X))$val, lwd=2)
        Issues: This is a reasonable test for the largest and smallest e'val,
                but not for the e'vals between.
                Unreasonable because for e'vals 2...p-1 we probably
                have other null hypothese in mind.
                Still, it's better than using Kaiser's e'val 1 rule.
        See Notes for discussion of attempts at solutions.

    . Confidence intervals for eigenvalues from nonparametric bootstrap:
        Issue: Observed eigenvalues are biased estimates.
                  E[ lambda.1.estimate ]
               == E[ max{|v|=1} v^T (X^T X)/(n-1) v ]
               >= max{|v|=1} v^T E[ (X^T X)/(n-1) ] v
               == lambda.1.population
        Bias is substantial when p is a substantial fraction of N.
        ==> Use correctly inverted bias-corrected bootstrap intervals.
            See Notes.


* ROADMAP:
  - Finish up PCA in practice and inference
  - Odds and ends  
  - Canonical correlation analysis (CCA)


================================================================

LECTURE 9, 2014/09/29


* ORG: HW 2 will be posted next week


* RECAP and DISCUSSION:

  - Tibshirani's talk: 
    . Says they have a conditional test given previous eigenvalues and
      eigenvectors.
    . Hitch: It's based on normal theory and sensitive to non-normality.
    . Why does similar sensitivity to normality not occur in regression?
      Zongming's insight: CLT effect in linear regression
    . What would the Taylor-Tibs test achieve?
      It would account for having chance-capitalized in the first PCs
      even if all eigenvalues were identical.

  - Odds and ends

* ROADMAP: CCA


================================================================

LECTURE 10, 2014/10/01


* ORG: HW 2 will be posted next week


* RECAP: CCA
  - What is the CCA problem?
      ...
  - Why is the CCA problem well-defined, 
    as opposed to the naive PCA problem?
      ...
  - How unique are the CCA coefficient vectors?
      ...
  - How does pre-whitening work in CC?
      ...
    What does it mean?
      ...
    What does it acomplish for CCA?
      ...
  - What is the geometric interpretation of CCA
    in terms of certain data subspaces?
      ...
  - 


* ROADMAP: 
  - CCA ctd.
  - GCA


================================================================


LECTURE 11, 2014/10/06


* ORG: HW 2 will be posted this week


* RECAP: CCA
  - The criterion?
  - Comparison of CCA and PCA?
  - CCA via SVD after pre-whitening?
  - What is pre-whitening statistically?
  - What is pre-whitening geometrically?
  - A variance criterion for CCA?
  - What does it mean if all cancorrs are 1?
  - Is it possible that a cancorr is negative?
  - Do you expect cancorr estimates to be biased?
  - Inference for CCA: use the two universal hammers
  - Reminders:
    . Like regression, CCA suffers from block-internal collinearity.
    . Unlike regression, CCA coefficients require some normalization.
      What standardization would you propose to see coeff strength quickly?
    . What is the Rsquare of the regression of a Y-canonical variate on X?
      Vice versa?

* ROADMAP: GCA -- Generalized Canonical Analysis
  - GCA as an umbrella for PCA and CCA
  - A deadend approach to GCA
  - The proper approach


================================================================


LECTURE 12, 2014/10/08


* ORG: HW 2 will be posted this week


* RECAP: GCA
  - Criterion?
  - Principle for criteria in MA?
  - GCA as umbrella: two special cases
    . PCA
    . CCA

* ROADMAP: Fisher scoring, dummies, more about GCA criteria, LDA


================================================================


LECTURE 13, 2014/10/13


* ORG: HW 2 will be posted ...


* RECAP: Fisher Scoring and CCA
  - Fisher scoring = ... CCA with 2 categorical variables
  - What's the issue with these X and Y matrices?  
      ...
  - What's an alternative to centering?  
      ...
  - Recall Matt's observation about the meaning of X^T Y?
      ... 2D table of counts
  - Further thoughts on this interpretation:
      ...
  - Generalities: 
    . two symmetric CCA criteria?
        ...
    . two asymmetric CCA criteria?
        ...
      

* ROADMAP: QDA and LDA


================================================================


LECTURE 14, 2014/10/15


* ORG: HW 2 will be posted ...


* RECAP: 
  - QDA: ...
  - LDA: ...
  - Connection with CCA: ...


* ROADMAP: 
  - Multiple Correspondence Analysis
  - PRINCALS ("PCA with Alternating LS")
  - Nonlinear MA


================================================================


LECTURE 15, 2014/10/20


* ORG: HW 2 will be posted ...


* RECAP: 

  - PRINCALS: 
    . Motivation?  Compare with Multiple Correspondence Analysis (MCA)
        ...
    . Approach?  What SVD fact is used?
        ...
    . Is PRINCALS "hierarchical" in the sense of PCA, CCA, GCA?
        ...
    . What insights are possible regarding number of reduced dimensions?
        ...

  - Linear smoothers: 
    . What is "smoothing"?  What do theoretical statisticians call it?
        ...
    . What were the smoothers introduced in Stat 541?
        ...
    . Were these smoothers linear?  What does "linear" mean?
        ...
    . Smoother matrix of a linear smoother: If you only have
      a procedure implementing a linear smoother, how can you
      obtain the columns of its smoother matrix?
        ...
    . If you plot the entries of a row of the smoother matrix
      as a function of x, what does it convey?  
      Think "local averaging".
        ...
    . How do you obtain the smoother matrix for a polynomial
      smoother?  What warnings are you aware of regarding
      numerical instability?  How do you solve them?
        ...
        

* ROADMAP: 
  - Linear smoothers (contd.)
  - Transformational Multivariate Analysis


================================================================


LECTURE 16, 2014/10/22


* ORG: HW 2 will be posted ...


* RECAP: 
  - Linear smoothers:  N-vector  fhat = S y
    Interpret:
    . columns of S
    . rows of S
    . eigen/singular values of S
  - Types of linear smoothers?
    . ...
    . ...
    . ...

* ROADMAP:
  - Linear smoothers (contd.)
  - Transformational MA


================================================================


LECTURE 17, 2014/10/27


* ORG: HW 2 will be posted ...


* RECAP: 
  - Linear smoothers:  N-vector  fhat = S y
    Interpret:
    . columns of S
    . rows of S
    . eigen/singular values of S
  - Types of linear smoothers?
    . ...
    . ...
    . ...

* ROADMAP:
  - Transformational MA, in particular ACE et al.


================================================================


LECTURE 18, 2014/10/29


* ORG: HW 2 will be posted ...


* RECAP: 
  - ACE: describe ...
  - Alternating projections: 
      Show that the convergence results for the power algorithm 
      for symmetric pos.semi-def. matrices applies.

* ROADMAP:
  - Transformational MA (contd.)


================================================================


LECTURE 19, 2014/11/03


* RECAP: ACE and APCs
  - Additive models occur as building blocks in ACE and APC.
    Explain how.
  - Construct a super-sized Ridge regression problem 
    for penalized additive regression
  - The two purposes of smallest APCs
  - The criterion, unpenalized
  - The constraint, unpenalized
  - The criterion, penalized
  - The constraint, penalized
  - Algorithm?
  - Interpretation: If an APC has the form
      f1(X1) + f2(X2) + ... ~ 0
    and both f1(x1) and f2(x2) are strong (i.e., sd(...)>>0) and 
    monotone increasing, does this imply a positive or negative 
    association between X1 and X2, all else being constant?
    

* ROADMAP: 
  - ICA - Independent Component Analysis: one particular form
  - FMA - Functional multivariate analysis: one particular form
  - Subsequently: k-means, MDS


================================================================


LECTURE 20, 2014/11/05


* RECAP: ICA
  - What is the goal of the ICA?
  - Can one always achieve the goal?
  - What is the issue of ICA on multivariate normal data?
  - What is required for identifiability of ICA?
  - What is the natural pre-processing before doing ICA,
    and what is the consequence for the resulting estimation problem?
  - Outline approaches to ICA using tools you know from this class.
  - Could you imagine other approaches?
    

* ROADMAP: 
  - FMA - Functional multivariate analysis: one particular form
  - k-means clustering
  - multidimensional scaling -- MDS
  - Kernelizing and kernel PCA


================================================================


LECTURE 21, 2014/11/10


* RECAP: FMA
  - What makes a MA functional?  
     (We haven't really answered this Q.)
  - What kinds of smoothing do you want to achieve in FMA?
  - What are the two large smoother classes again?
  - Considered analyzing a functional data table with SVDs: fSVD.
      Reminders about plain SVD:
      Q: If you are given u and v, what is the optimal d?
      A: ...
  - How can you perform an fSVD...
    . based on the first smoothing class?
    . based on the second smoothing class?
    

* ROADMAP: 
  - FMA - Functional multivariate analysis (contd.)
  - k-means clustering
  - multidimensional scaling -- MDS
  - Kernelizing and kernel PCA


================================================================


LECTURE 22, 2014/11/12


* RECAP: 
  - What makes a data analysis situation 'functional'?
    . We're estimating discretized functions/signals/processes on a domain.
    . Typical domains: time, space, frequency
    . We only consider the case where all cases are sampled
      at the same locations in the domain.
      Examples: daily stock returns (case=company)
                mid-day temperatures at 200 locations across the US
  - What makes data two-way functional?
      A: In the data table both rows and columns have underlying domains.
      Ex.: Daily mid-day temperatures at 200 locations for 20 years         
           rows: locations, cols: days
  - Recall two-way smoothing an SVD...
     
    
* ROADMAP: 
  - FMA - Functional multivariate analysis (contd.)
  - k-means clustering
  - multidimensional scaling -- MDS
  - Kernelizing and kernel PCA


================================================================


LECTURE 23, 2014/11/17


* ORG: 

  - Thanksgiving week is off; no Mon, no Wed class

  - Homework: Examine the two-way Lasso regularized SVD
              Due Mon, Dec 15, 2014
      You can do anything you like:
      . Implement and experiment
      . Play with optimization schemes
      . Examine whether cross-validation works for choice of lambdas
      . Prove asymptotic minimaxity ...
      Submit a PDF file from LaTex 
        with name  'hw02-Yourlastname-Yourfirstname.pdf'
	in an attachement
        to stat926.at.wharton@gmail.com
        with Subject: HW 2, 2014, YourLastName, YourFirstName

    
* ROADMAP: 
  - k-means clustering
  - multidimensional scaling -- MDS
  - Kernelizing and kernel PCA


================================================================


LECTURE 24, 2014/11/19


* ORG: 

  - Thanksgiving week is off; no Mon, no Wed class

  - Homework: Examine the two-way Lasso regularized SVD
              Due Mon, Dec 15, 2014
    Discussion?  Some lit refs were added to the latex file.

* RECAP: k-means clustering
  . What's the idea of k-means clustering?
  . Criterion?
  . Variable scaling issue?
  . Algorithm?
  . What idea does the the term 'clustering' conjure in your mind?
      Does k-means live up to this idea?
  . Where do you expect the centers to fall
    in relation to PCs?
  . This suggests what kind of plot?
  . Issues with the k-means criterion?
      ... (Tell a story.)
    

* ROADMAP: 
  - hierarchical clustering
  - multidimensional scaling -- MDS
  - Kernelizing and kernel PCA


================================================================