Andreas Buja's Home Page at the University of Pennsylvania
The Liem Sioe Liong/First Pacific Company Professor Emeritus of Statistics
(as of July 2021)
Department of Statistics
The Wharton School
University of Pennsylvania
Philadelphia, PA 19104-6340
Senior Research Scientist (as of January 2020)
Center for Computational Mathematics
162 Fifth Avenue, New York, NY 10010
(a more fun alternative from
the 2004/5 MBA guide)
Past Teaching Materials:
Statistical Inference after Model Selection:
- Valid Post-Selection Inference
by Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang and Linda Zhao.
When predictors for statistical models are selected by looking at
the data, statistical inference based on these models is in danger
of being invalid. We show that confidence intervals may need to
be widened, sometimes considerably, to protect against
invalidation. This is a fundamental difficulty with statistical
inference that has implications all the way down to how we teach
statistics in introductory courses.
- Software for computing PoSI constants:
Play with the examples at the end of the help page,
and cannibalize them for your purposes.
- Short presentation [PDF]
at a workshop on "Reproducibility of Scientific Results", NAS, Feb 26/27, 2015.
Visualization of Large Correlation Tables:
Observations on Bagging
- A Tool for Mining Large Correlation Tables: 'Association Navigator'
This is a report (joint with Abba Krieger and Ed George) written for the
Simons Foundation - Autism Research Initiative (SFARI).
The work under a SFARI grant was the reason why we created an interactive tool for visualizing correlation tables
for many hundreds of variables. The report draws its examples from the 'Simons Simplex Collection' (SSC), a large
database of autism phenotype data.
- Association Navigator [R software]
written in the R language (currently only for MS Windows).
You can load the software into an R interpreter and run an example by executing the following expressions:
currencies <- read.csv("http://stat.wharton.upenn.edu/~buja/DATA/Currencies-2006-2016.csv")
currencies.nav <- a.nav.create(currencies)
Hit the letter 'h' for a help window. It needs to be closed before continuing in the navigator window.
Follow the nstructions on page 34f of the above report to
apply the software to your own numeric data matrix.
Disclaimer: At this point this software has only been tested on MS Windows machines.
"A Visualization Tool for Mining Large Correlation Tables: The Association Navigator,"
Buja, A., Krieger, A., George, E. (2016),
in: Handbook of Big Data,
eds.: Peter Buhlmann, Petros Drineas, Mark van der Laan, Michael Kane.
(Statistica Sinica 2006, Special Issue on Machine Learning, 16 (2), 323--352 (2006))
A preliminary version and a companion paper which I keep posted because others have
started referring to them:
The Effect of Bagging on Variance, Bias, and Mean Squared Error
Smoothing Effects of Bagging: Von Mises Expansions of Bagged Functionals
Inference for EDA and Diagnostics: It is commonly thought that the
visualization methods used in exploratory data analysis and model diagnostics
are beyond, and even adverse to, statistical inference. This is not
so. With some simple protocols it is possible to assign p-values to
- "Graphical Inference for Infovis"
(Wickham, Cook, Hofmann, Buja, 2010;
IEEE Trans. on Visualization and Computer Graphics, Vol. 16, No. 6, Nov/Dec; Best Paper Award)
- "Statistical inference for exploratory data analysis and model diagnostics"
with supplementary materials
(Buja, Cook, Hofmann, Lawrence, Lee, Swayne, Wickham, 2009;
Philosophical Transactions of The Royal Society, A)
- A precursor talk given at the Joint Statistics Meetings 1999, with Di Cook
- A related topic is model checking with parametric bootstrap:
I brought this up back in 2004 in a discussion of a paper by Andrew Gelman
who does the same with a posterior predictive approach. Here are
Gelman's JCGS article,
followed by my discussion,
and his rejoinder.
Lisha Chen's thesis
resulted in the following two articles:
Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing (Chen and Buja; JMLR 2013)
Local Multidimensional Scaling for Nonlinear Dimension Reduction, Graph Drawing and Proximity Analysis (Chen and Buja; JASA 2009)
Data Visualization With Multidimensional Scaling (Buja, Swayne, Littman, Dean, Hofman and Chen; JCGS 2008)
Visualization Methodology for Multidimensional Scaling (Buja and Swayne; JoC 2002)
- Quasi-Darwinian Selection in Marketing Relationships
joint with N. Eyuboglu;
(Journal of Marketing, Oct 2007,
featured JM blog article
and a finalist for JM's 2007 Harold H. Maynard Award)
Along with the paper go a few scenario calculations that are not included in the article:
- Different Worlds: Do Recommender Systems Fragment Consumers' Interest?
In his Ph.D. thesis, Dan Fleder devised a scheme
whereby observational data about consumers of music before and
after joining a recommender service could be interpreted as describing
a natural experiment. As a consequence, he
got as close as conceivable to causal inference about
the effects of recommender systems on consumers
in a particular setting. Here is a short version in Knowledge@Wharton:
Penalized Singular Value Decompositions with Jianhua Huang and Haipeng Shen:
- The Analysis of Two-Way Functional Data Using Two-Way Regularized Singular Value Decompositions
(Huang, Shen and Buja, JASA, 2009)
with supplementary material
- Functional principal components analysis via penalized rank one approximation
(Huang, Shen and Buja; Electronic Journal of Statistics, 2008)[.pdf]
High-Dimensional Data Visualization with grand tours and guided tours,
joint with Deborah Swayne, Di Cook, Dan Asimov, and Catherine Hurley:
- Theory of Dynamic Projections in High-Dimensional Data Visualization
Describes the invariant Riemannian geometries
on Stiefel manifolds.
- Computational Methods for High-Dimensional Rotations in Data Visualization
Appeared in "Handbook of Statistics" (eds. E. Wegman, C. R. Rao; 2005).
(An older version that had both papers in one should be considered out of date.)
- Differential geometry for dynamic projections: invariant Riemannian geometries on Stiefel manifolds
- Computational Methods for High-Dimensional Rotations in Data Visualization
- Software architecture for interactive dynamic data visualization systems
- Methodology for viewing high-dimensional data with dynamic projections
- XGobi manual
- Grand tours and projection pursuit
- Projection pursuit indices
- ``Prosections'': Theory of the synthesis of projecting and sectioning multivariate data clouds
- After Asimov's seminal article on grand tours, here is the
first proposal for extending the idea to correlation and regression
Multivariate Analysis: A talk I gave at an econometrics workshop at Stanford.
It tries to answer the question of how to choose the ``reference
metric'' or the constraint in multivariate methods based on
Loss Functions for Binary Class Probability Estimation: Structure and Applications. (Former title: Degrees of Boosting)
This paper started out as work on boosting but turned into something different,
joint with Werner Stuetzle and Yi Shen.
Yi Shen's 2005 Ph.D. thesis on cost-weighted class probability estimation
Cost-Weighted Boosting with Jittering and Over/Under-Sampling: JOUS-Boost
(Journal of Machine Learning Research 8 (Mar), 409-439, 2007).
A simple modification of boosting, joint with David Mease and Adi Wyner.
Calibration for Simultaneity: (Re)Sampling Methods for Simultaneous Inference with
Applications to Function Estimation and Functional Data
with Wolfgang Rolke.
Visual Comparison of Datasets using Mixture Decompositions
Alan Gous and Andreas Buja;
Journal of Computational and Graphical Statistics, 13 (1), 1-19 (2004).
(We are permitted to post the color version of this paper. The printed version is
b/w with gray-scale figures.)
Data Mining Criteria for Tree-Based Regression and Classification
A. Buja and Y.-S. Lee; Proceedings of KDD 2001, 27--36.
Here is an article everybody should read:
originally published in the
retyped and posted with permission.
- The Science of Scientific Writing
by Gopen and Swan
It's the single best piece on writing in the sciences--no exaggeration!