Room: 471 JMHH
Office: (215) 898-8222 (Leave a note with the administrator.)
Email: Click here for an image of the address. (The address 'buja@wharton...' is obsolete.)
Curriculum vitae: [.pdf] (a more fun alternative from the 2004/5 MBA guide))
source("http://stat.wharton.upenn.edu/~buja/STAT-101/src-probability.R")Here are things that can be done:
X <- make.RV(1:6, rep("1/6",6)) # Create a fair die (class: "RV") Y <- make.RV(1:6, c(0.1,0.1,0.1,0.1,0.2,0.4)) # Create a loaded die ( '' ) P(X>3); P(Y>3) # Probabilities of events E(X); E(Y) # Expected values V(X); V(Y) # Variances SD(X); SD(Y) # Standard deviations par(mfrow=c(2,1)); plot(X); plot(Y) # Plot as pin graphs S <- SofI(X,Y); par(mfrow=c(1,1)); plot(S) # Sum of two independent RVs S10 <- SofIID(X,10); plot(S10) # Sum of 10 iid copies of X (works for many more => CLT) qqnorm(S10) # Normal quantile plot for RVs to check the CLT effect X.sim <- rsim(1000, X) # Simulate from X (class: "RVsim") plot(X.sim) # Plot simulated data as pin graph probs(X); props(X.sim) # Compare probabilites and simulated proportions E(X); mean(X.sim) # Compare expected value and mean SD(X); sd(X.sim) # Compare theoretical and observed std.dev. X2 <- X^2; X2; plot(X2) # univariate analytical transformation Yexp <- exp(Y); Yexp; plot(Yexp) # '' Yfair <- Y - E(Y); Yfair # Centering a RV: creates a fair game from a loaded die Z <- (X - E(X))/SD(X); Z; plot(Z) # z-scoring/standardizing a random variable Ybern <- con(ifelse(Y>3,1,0)) # Create a Bernoulli variable; 'con()' contracts values/probsCheck the header of the source file for more explanations and examples.
currencies <- read.csv("http://stat.wharton.upenn.edu/~buja/DATA/Currencies-2006-2016.csv")
currencies.nav <- a.nav.create(currencies)
When predictors are random, statisticians seem comfortable to condition on them and treat them as fixed. The underlying argument is that the predictors form an ancillary statistic. This argument is flawed, however, because it assumes the correctness of the model before even examining it. We reconstruct in our own way a piece of econometric theory to sort out the effects of model violations in the presence of random predictors.
See what happens as deterministic responses Y, one linear and the
other nonlinear, are fitted by a linear function of X, from
dataset to dataset. The point: Y is error-free, only X has
(Statistica Sinica 2006, Special Issue on Machine Learning, 16 (2), 323--352 (2006))
A preliminary version and a companion paper which I keep posted because others have started referring to them: The Effect of Bagging on Variance, Bias, and Mean Squared Error [.pdf] PPT slides,
Smoothing Effects of Bagging: Von Mises Expansions of Bagged Functionals [.pdf]
by Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang and Linda Zhao.
When predictors for statistical models are selected by looking at the data, statistical inference based on these models is in danger of being invalid. We show that confidence intervals may need to be widened, sometimes considerably, to protect against invalidation. This is a fundamental difficulty with statistical inference that has implications all the way down to how we teach statistics in introductory courses.
Install it in R by typing or copy/pasting the following line into an R interpreter:
install.packages("http://stat.wharton.upenn.edu/~buja/PAPERS/PoSI_1.0.tar.gz", repos=NULL, type="source")
Then play with the examples at the end of the help
and cannibalize them for your purposes.
(Journal of Marketing, Oct 2007, featured JM blog article and a finalist for JM's 2007 Harold H. Maynard Award)
Along with the paper go a few scenario calculations that are not included in the article: [.pdf]
Appeared in "Handbook of Statistics" (eds. E. Wegman, C. R. Rao; 2005). (An older version that had both papers in one should be considered out of date.)
Yi Shen's 2005 Ph.D. thesis on cost-weighted class probability estimation [pdf]
(Journal of Machine Learning Research 8 (Mar), 409-439, 2007). On a simple modification of boosting, joint with David Mease and Adi Wyner.
Alan Gous and Andreas Buja; Journal of Computational and Graphical Statistics, 13 (1), 1-19 (2004).
(We are permitted to post the color version of this paper. The printed version is b/w with gray-scale figures.)
A. Buja and Y.-S. Lee; Proceedings of KDD 2001, 27--36.