The Liem Sioe Liong/First Pacific Company Professor Emeritus of Statistics (as of July 2021)
Department of Statistics
The Wharton School
University of Pennsylvania

Current position:
Senior Research Scientist (as of January 2020)
Center for Computational Mathematics
Flatiron Institute
Simons Foundation
162 Fifth Avenue, New York, NY 10010

Curriculum vitae: [.pdf] (a more fun alternative from the 2004/5 MBA guide) )

Interests:

• A model-free theory of parametric regression
• Models as Approximations I: Consequences Illustrated with Linear Regression, by Andreas Buja, Richard Berk, Lawrence Brown, Ed George, Emil Pitkin, Mikhail Traskin, Kai Zhang and Linda Zhao.
• Models as Approximations II: A Model-Free Theory of Parametric Regression, by Andreas Buja, Lawrence Brown, Arun Kumar Kuchibhotla, Richard, Berk, Ed George and Linda Zhao.
• Assumption Lean Regression, by Richard Berk, Andreas Buja, Lawrence Brown, Edward George Arun Kumar Kuchibhotla, Weijie Su and Linda Zhao
• Talk at Larry Brown's Memorial Conference, UPenn, 2018/12/01
• When predictors are random, statisticians seem comfortable to condition on them and treat them as fixed. The underlying argument is that the predictors form an ancillary statistic. This argument is flawed, however, because it assumes the correctness of the model before even examining it. We reconstruct in our own way a piece of econometric theory to sort out the effects of model violations in the presence of random predictors.
• An animation in R to demonstrate the conspiracy effect of nonlinearity and random X: Copy/paste the following line into an R interpreter:
`	source("http://stat.wharton.upenn.edu/~buja/PAPERS/src-conspiracy-animation2.R")     `
See what happens as deterministic responses Y, one linear and the other nonlinear, are fitted by a linear function of X, from dataset to dataset. The point: Y is error-free, only X has randomness.

• Statistical Inference after Model Selection:
• Valid Post-Selection Inference [Article, pdf], [Supplement, pdf], [Talk, pdf],
by Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang and Linda Zhao.
When predictors for statistical models are selected by looking at the data, statistical inference based on these models is in danger of being invalid. We show that confidence intervals may need to be widened, sometimes considerably, to protect against invalidation. This is a fundamental difficulty with statistical inference that has implications all the way down to how we teach statistics in introductory courses.
• Software for computing PoSI constants: R package
Play with the examples at the end of the help page,
`    help(PoSI)`
and cannibalize them for your purposes.
• Short presentation [PDF] at a workshop on "Reproducibility of Scientific Results", NAS, Feb 26/27, 2015.

• Visualization of Large Correlation Tables:
• A Tool for Mining Large Correlation Tables: 'Association Navigator' [pdf]
This is a report (joint with Abba Krieger and Ed George) written for the Simons Foundation - Autism Research Initiative (SFARI). The work under a SFARI grant was the reason why we created an interactive tool for visualizing correlation tables for many hundreds of variables. The report draws its examples from the 'Simons Simplex Collection' (SSC), a large database of autism phenotype data.
• Association Navigator [R software] written in the R language (currently only for MS Windows). You can load the software into an R interpreter and run an example by executing the following expressions:
``` source("http://stat.wharton.upenn.edu/~buja/association-navigator.R") currencies <- read.csv("http://stat.wharton.upenn.edu/~buja/DATA/Currencies-2006-2016.csv") currencies.nav <- a.nav.create(currencies) a.nav.run(currencies.nav)```
Hit the letter 'h' for a help window. It needs to be closed before continuing in the navigator window.
Follow the nstructions on page 34f of the above report to apply the software to your own numeric data matrix.
Disclaimer: At this point this software has only been tested on MS Windows machines.
• Publication:
"A Visualization Tool for Mining Large Correlation Tables: The Association Navigator,"
Buja, A., Krieger, A., George, E. (2016), in: Handbook of Big Data,
eds.: Peter Buhlmann, Petros Drineas, Mark van der Laan, Michael Kane.

• Observations on Bagging [pdf], joint with Werner Stuetzle
(Statistica Sinica 2006, Special Issue on Machine Learning, 16 (2), 323--352 (2006))
A preliminary version and a companion paper which I keep posted because others have started referring to them:
The Effect of Bagging on Variance, Bias, and Mean Squared Error [.pdf] PPT slides,
Smoothing Effects of Bagging: Von Mises Expansions of Bagged Functionals [.pdf]

• Inference for EDA and Diagnostics: It is commonly thought that the visualization methods used in exploratory data analysis and model diagnostics are beyond, and even adverse to, statistical inference. This is not so. With some simple protocols it is possible to assign p-values to visual discoveries.
• "Graphical Inference for Infovis" [pdf] (Wickham, Cook, Hofmann, Buja, 2010; IEEE Trans. on Visualization and Computer Graphics, Vol. 16, No. 6, Nov/Dec; Best Paper Award)
• "Statistical inference for exploratory data analysis and model diagnostics" [pdf] with supplementary materials [pdf] (Buja, Cook, Hofmann, Lawrence, Lee, Swayne, Wickham, 2009; Philosophical Transactions of The Royal Society, A)
• A precursor talk given at the Joint Statistics Meetings 1999, with Di Cook [.pdf].
• A related topic is model checking with parametric bootstrap: I brought this up back in 2004 in a discussion of a paper by Andrew Gelman who does the same with a posterior predictive approach. Here are Gelman's JCGS article, followed by my discussion, and his rejoinder.

• Multidimensional Scaling:
• Lisha Chen's thesis resulted in the following two articles:
• Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing (Chen and Buja; JMLR 2013) [.pdf]
• Local Multidimensional Scaling for Nonlinear Dimension Reduction, Graph Drawing and Proximity Analysis (Chen and Buja; JASA 2009) [.pdf]
• Data Visualization With Multidimensional Scaling (Buja, Swayne, Littman, Dean, Hofman and Chen; JCGS 2008) [.pdf]
• Visualization Methodology for Multidimensional Scaling (Buja and Swayne; JoC 2002) [.pdf]

• Quasi-Darwinian Selection in Marketing Relationships [.pdf] joint with N. Eyuboglu;
(Journal of Marketing, Oct 2007, featured JM blog article and a finalist for JM's 2007 Harold H. Maynard Award)
Along with the paper go a few scenario calculations that are not included in the article: [.pdf]
• Different Worlds: Do Recommender Systems Fragment Consumers' Interest? [article] In his Ph.D. thesis, Dan Fleder devised a scheme whereby observational data about consumers of music before and after joining a recommender service could be interpreted as describing a natural experiment. As a consequence, he got as close as conceivable to causal inference about the effects of recommender systems on consumers in a particular setting. Here is a short version in Knowledge@Wharton:

• Penalized Singular Value Decompositions with Jianhua Huang and Haipeng Shen:
• The Analysis of Two-Way Functional Data Using Two-Way Regularized Singular Value Decompositions (Huang, Shen and Buja, JASA, 2009) [.pdf], with supplementary material [.pdf]
• Functional principal components analysis via penalized rank one approximation (Huang, Shen and Buja; Electronic Journal of Statistics, 2008)[.pdf]

• High-Dimensional Data Visualization with grand tours and guided tours, joint with Deborah Swayne, Di Cook, Dan Asimov, and Catherine Hurley:
• Theory of Dynamic Projections in High-Dimensional Data Visualization [pdf] Describes the invariant Riemannian geometries on Stiefel manifolds.
• Computational Methods for High-Dimensional Rotations in Data Visualization [pdf]
Appeared in "Handbook of Statistics" (eds. E. Wegman, C. R. Rao; 2005). (An older version that had both papers in one should be considered out of date.)
• Differential geometry for dynamic projections: invariant Riemannian geometries on Stiefel manifolds [.pdf]
• Computational Methods for High-Dimensional Rotations in Data Visualization [.pdf]
• Software architecture for interactive dynamic data visualization systems [.pdf]
• Methodology for viewing high-dimensional data with dynamic projections [.pdf]
• XGobi manual [.pdf]
• Grand tours and projection pursuit [.pdf]
• Projection pursuit indices [.pdf]
• ``Prosections'': Theory of the synthesis of projecting and sectioning multivariate data clouds [.pdf]
• After Asimov's seminal article on grand tours, here is the first proposal for extending the idea to correlation and regression tours [.pdf]

• Multivariate Analysis: A talk I gave at an econometrics workshop at Stanford. It tries to answer the question of how to choose the ``reference metric'' or the constraint in multivariate methods based on eigendecompositions. [.pdf]

• Loss Functions for Binary Class Probability Estimation: Structure and Applications. (Former title: Degrees of Boosting) [.pdf]
This paper started out as work on boosting but turned into something different, joint with Werner Stuetzle and Yi Shen.
Yi Shen's 2005 Ph.D. thesis on cost-weighted class probability estimation [pdf]

• Cost-Weighted Boosting with Jittering and Over/Under-Sampling: JOUS-Boost [pdf]
(Journal of Machine Learning Research 8 (Mar), 409-439, 2007).
A simple modification of boosting, joint with David Mease and Adi Wyner.

• Calibration for Simultaneity: (Re)Sampling Methods for Simultaneous Inference with Applications to Function Estimation and Functional Data [.pdf, 1.7MB], with Wolfgang Rolke.

• Visual Comparison of Datasets using Mixture Decompositions [.pdf]
Alan Gous and Andreas Buja; Journal of Computational and Graphical Statistics, 13 (1), 1-19 (2004).
(We are permitted to post the color version of this paper. The printed version is b/w with gray-scale figures.)

• Data Mining Criteria for Tree-Based Regression and Classification [.ps.gz]
A. Buja and Y.-S. Lee; Proceedings of KDD 2001, 27--36.

On writing:

Here is an article everybody should read:
• The Science of Scientific Writing by Gopen and Swan [HTML] [.pdf]]
originally published in the American Scientist, retyped and posted with permission.
It's the single best piece on writing in the sciences--no exaggeration!