Room: 471 JMHH
Office: (215) 898-8222 (Leave a note with the administrator.)
Email: Click here for an image of the address. (The address 'buja@wharton...' is obsolete.)
Curriculum vitae: [.pdf] (a more fun alternative from the 2004/5 MBA guide))
source("http://stat.wharton.upenn.edu/~buja/STAT-101/src-probability.R")Here are things that can be done:
X <- make.RV(1:6, rep("1/6",6)) # Create a fair die (class: "RV") Y <- make.RV(1:6, c(0.1,0.1,0.1,0.1,0.2,0.4)) # Create a loaded die ( '' ) P(X>3); P(Y>3) # Probabilities of events E(X); E(Y) # Expected values V(X); V(Y) # Variances SD(X); SD(Y) # Standard deviations par(mfrow=c(2,1)); plot(X); plot(Y) # Plot as pin graphs S <- SofI(X,Y); par(mfrow=c(1,1)); plot(S) # Sum of two independent RVs S10 <- SofIID(X,10); plot(S10) # Sum of 10 iid copies of X (works for many more => CLT) qqnorm(S10) # Normal quantile plot for RVs to check the CLT effect X.sim <- rsim(1000, X) # Simulate from X (class: "RVsim") plot(X.sim) # Plot simulated data as pin graph probs(X); props(X.sim) # Compare probabilites and simulated proportions E(X); mean(X.sim) # Compare expected value and mean SD(X); sd(X.sim) # Compare theoretical and observed std.dev. X2 <- X^2; X2; plot(X2) # univariate analytical transformation Yexp <- exp(Y); Yexp; plot(Yexp) # '' Yfair <- Y - E(Y); Yfair # Centering a RV: creates a fair game from a loaded die Z <- (X - E(X))/SD(X); Z; plot(Z) # z-scoring/standardizing a random variable Ybern <- con(ifelse(Y>3,1,0)) # Create a Bernoulli variable; 'con()' contracts values/probsCheck the header of the source file for more explanations and examples.
source("http://stat.wharton.upenn.edu/~buja/PAPERS/src-conspiracy-animation2.R")See what happens as deterministic responses Y, one linear and the other nonlinear, are fitted by a linear function of X, from dataset to dataset. The point: Y is error-free, only X has randomness.
by Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang and Linda Zhao.
When predictors for statistical models are selected by looking at the data, statistical inference based on these models is in danger of being invalid. We show that confidence intervals may need to be widened, sometimes considerably, to protect against invalidation. This is a fundamental difficulty with statistical inference that has implications all the way down to how we teach statistics in introductory courses.
Install it in R by typing or copy/pasting the following line into an R interpreter:
install.packages("http://stat.wharton.upenn.edu/~buja/PAPERS/PoSI_1.0.tar.gz", repos=NULL, type="source")
Then play with the examples at the end of the help
and cannibalize them for your purposes.
This is a report (joint with Abba Krieger and Ed George) written for the Simons Foundation - Autism Research Initiative (SFARI). The work under a SFARI grant was the reason why we created an interactive tool for visualizing correlation tables for many hundreds of variables. The report draws its examples from the 'Simons Simplex Collection' (SSC), a large database of autism phenotype data.
Then follow the simple instructions on pages 34 and 36 of the above report to apply the software to your own numeric data matrix.
(Journal of Marketing, Oct 2007, featured JM blog article and a finalist for JM's 2007 Harold H. Maynard Award)
Along with the paper go a few scenario calculations that are not included in the article: [.pdf]
Appeared in "Handbook of Statistics" (eds. E. Wegman, C. R. Rao; 2005). (An older version that had both papers in one should be considered out of date.)
Yi Shen's 2005 Ph.D. thesis on cost-weighted class probability estimation [pdf]
(Journal of Machine Learning Research 8 (Mar), 409-439, 2007). On a simple modification of boosting, joint with David Mease and Adi Wyner.
(Statistica Sinica 2006, Special Issue on Machine Learning, 16 (2), 323--352 (2006))
A preliminary version and a companion paper which I keep posted because others have started referring to them: The Effect of Bagging on Variance, Bias, and Mean Squared Error [.pdf] PPT slides,
Smoothing Effects of Bagging [.pdf]
Alan Gous and Andreas Buja; Journal of Computational and Graphical Statistics, 13 (1), 1-19 (2004).
(We are permitted to post the color version of this paper. The printed version is b/w with gray-scale figures.)
A. Buja and Y.-S. Lee; Proceedings of KDD 2001, 27--36.