The Liem Sioe Liong/First Pacific Company Professor of Statistics
Room: 471 JMHH
Doctoral Program in Statistics:
Faculty Positions:
Office: (215) 898-8222 (Leave a note with the administrator.)
Email: Click here for an image of the address. (The address 'buja@wharton...' is obsolete.)
Curriculum vitae:
[.pdf]
(a more fun alternative from
the 2004/5 MBA guide))
If you are interested in our Ph.D. program, please, visit
our program website.
Once you decide to apply, start your application
at this website.
There will be no job opening this year in the Statistics Department.
Slides as of June 10, 2009.
Rapplet to create an animated display of proper scoring rules.
Either download the file and source it in R, or do this in one swoop:
source("http://stat.wharton.upenn.edu/~buja/proper-scoring-rapplet.R")
Then drag the mouse on the R plot window or hit 'h' for Help.
R script and text for regression trees
R script and text for principal component analysis
R script and text for k-means clustering
R script and text for interactive R programming
A preliminary version and a companion paper which I keep posted because others have
started referring to them:
In his Ph.D. thesis, Dan Fleder devised a scheme
whereby observational data about consumers of music before and
after joining a recommender service could be interpreted as describing
a natural experiment. As a consequence, he
got as close as conceivable to causal inference about
the effects of recommender systems on consumers
in a particular setting. Here is a short version in Knowledge@Wharton:
Different Worlds: Do Recommender Systems Fragment Consumers' Interest?
When predictors for statistical models are selected by looking at the data,
statistical inference based on these models is in danger of being invalid.
We show that confidence intervals may need to be widened considerably
to protect against invalidation. This is a fundamental difficulty with
statistical inference that has implication all the way down to how we teach
statistics in introductory courses.
This is a report (joint with Abba Krieger and Ed George) written for the
Simons Foundation - Autism Research Initiative (SFARI).
The work under a SFARI grant was the reason why we created an interactive tool for visualizing correlation tables
for many hundreds of variables. The report draws its examples from the 'Simons Simplex Collection' (SSC), a large
database of autism phenotype data.
source("http://stat.wharton.upenn.edu/~buja/association-navigator.R")
Then follow the simple instructions on page 35 of the above report to
apply the software to your own numeric data matrix.
Here are
Gelman's JCGS article,
followed by my discussion,
and his rejoinder.
An older version that had both papers in one should be considered out of date.
Appeared in "Handbook of Statistics" (eds. E. Wegman, C. R. Rao; 2005).
Quasi-Darwinian Selection in Marketing Relationships
[.pdf]
Journal of Marketing, Oct 2007,
featured JM blog article
and a finalist for JM's 2007 Harold H. Maynard Award.
Along with the paper go a few scenario calculations that are not included in the article:
[.pdf]
Loss Functions for Binary Class Probability Estimation: Structure and Applications. (Former title: Degrees of Boosting)
[.pdf] (under revision)
Yi Shen's 2005 Ph.D. thesis on cost-weighted class probability estimation
[pdf]
Cost-Weighted Boosting with Jittering and Over/Under-Sampling: JOUS-Boost
[pdf]
(Journal of Machine Learning Research 8 (Mar), 409-439, 2007)
Observations on Bagging (Statistica Sinica 2006, Special Issue on Machine Learning, 16 (2), 323--352 (2006))
[pdf]
The Effect of Bagging on Variance, Bias, and Mean Squared Error
[.pdf]
PPT slides,
Smoothing Effects of Bagging
[.pdf]
Calibration for Simultaneity: (Re)Sampling Methods for Simultaneous Inference with
Applications to Function Estimation and Functional Data
[.pdf, 1.7MB] (under revision)
Alan Gous and Andreas Buja;
Journal of Computational and Graphical Statistics, 13 (1), 1-19 (2004).
(We are permitted to post the color version of this paper. The printed version is
b/w with gray-scale figures.)
Data Mining Criteria for Tree-Based Regression and Classification
[.ps.gz]
A. Buja and Y.-S. Lee; Proceedings of KDD 2001, 27--36.