Robert A. Stine

Department of Statistics

444 Huntsman Hall

The Wharton School of the University of Pennsylvania

Philadelphia, PA 19104-6340



Research papers


Streaming Feature Selection
Streaming feature selection evaluates potential explanatory features for a regression type model one-at-a-time rather than all at once. This approach allows faster selection methods (such as VIF regression) and avoids the need to precompute every possible predictor at the start of modeling. Building variables on-the-fly is essential in database modeling and some types of image processing.

Auctions allow a blending of substantive insight with automatic searches for predictive features. I'll put a paper here one of these days that describes the auction process more completely, but in the meantime, see these slides from a recent talk. The papers that are here are ingredients needed for the auction.

Foster, D. P. and Stine, R. A. (2013). Risk inflation of sequential tests controlled by alpha investing
This paper (submitted for publication) developes a computational method for finding the exact risk inflation of the estimator implied by a testing process that is controlled by alpha investing. The resulting feasible sets display all possible risks of one estimation procedure relative to another.
Foster, D. P. and Stine, R. A. (2008, JRSS-B). Alpha-investing: sequential control of expected false discoveries
This paper (submitted for publication) describes a procedure for testing multiple hypotheses while controling the number of false discoveries. The key distinguishing attributes are that (a) it handles a stream of hypotheses so you don't need all the p-values at once and (b) it allows an investigator to incorporate formally domain knowledge into the testing procedure. We have used a variation of this procedure to pick predictors in the auction.
Text Mining, Computational Linguistics
Data sets that are well matched to regression often come with supplemental textual information, such as written comments, open-ended responses, and annotations. Some data sets come with nothing but text. Generating regressors from these can lead to more predictive models. There are also slides from a recent talk.
Foster, D. P., Liberman, M. and Stine, R. A. (2013). Featurizing text: Converting text into predictors for regression analysis
This draft manuscript (really more of a working paper) describes fast methods for the construction of numerical regressors from text using spectral methods related to the singular value decomposition (SVD). An example uses these methods to build regression models for the price of Chicago real estate using nothing but the text of a property listing. Topic models (LDA) provide some explanation for why these methods work so well as they do. For example, our model for real estate explains some 70% of the variation in prices using just the text of the listing with no attempt to use location or related demographics.

Statistics in Finance
It can be very hard to separate luck from skill when it comes time to evaluate the success of investors. We use a dice simulation described in the following paper to illustrate this point to students, as well as show them the role of portfolios in reducing the variance of an investment.

I'll soon put another paper here that offers one approach to making the distinction, but its not ready yet. Here are the slides from a recent talk.

Foster, D. P. and Stine, R. A. (2005). Being Warren Buffett: A classroom simulation of risk and wealth when investing in the stock market
This paper describes the dice simulation, including the notion of volatility drag as a penalty for variation. This form can be used in class to organize the simulation.

Foster, D. P. and Stine, R. A. (2005). Finding Warren Buffett: Separating knowledge from luck in investments.
This paper will detail our approach using Bennett's inequality and Bonferroni to identify investments that do better than chance. The slides from a talk summarize the approach.

Foster, D.P., Stine, R.A., and Young, P (2011). A Markov test for alpha
This revised version of a previous manuscript avoids getting fooled by a clever manager, a trick that slips by using a maximal approach. The paper introduces a simple-to-perform test called the compound alpha test (CAT). The test has good power (it is tight for sequences of returns generated by "trickery") and robust to numerous assumptions. The paper includes several illustrative examples using recent stock returns.

In a different vein, the following papers consider models for the forward curve or yield curve. The models decompose the forward curve into several components that isolate different aspects of the evolution of the curve over time.
Chua, C. T., Foster, D. P., Ramaswamy, K., and Stine, R. A. (2007). A dynamic model for the forward curve
Review of Financial Studies 21, 265-310.
This paper proposes an arbitrage-free model for the temporal evolution of the forward curve. The paper includes a discussion of model estimation (using a Kalman filter), examples fit to Treasury data, and comparison to alternatives.

Chua, C. T., Ramaswamy, K., and Stine, R. A. (2008). Predicting short-term Eurodollar futures.
Journal of Fixed Income , to appear.
This manuscript adapts the methods used in the prior work for Treasuries to practical aspects of modeling Eurodollar futures.

Pooling Subjective Intervals
A work in progress that concerns the use of subjective confidence intervals for making business decisions. A current interactive tool based on 50% intervals is available by following this link.

Information Theory and Model Selection
Information theory (coding, in particular) provides motivation for the various types of model selection criteria in wide use today (e.g., AIC, BIC). It also leads to generalizations of these methods which allow comparison, for example, of non-nested models. These methods can also be used for 'feature selection' in data mining.

The use of information theory in model selection is not new. The AIC (Akaike information criterion) originated as an unbiased estimate of the relative entropy, a key notion in comparing the lengths of codes. More closely tied to coding are MDL and stochastic complexity that were proposed by Rissanen.

MDL (minimum description length) is typically used in an asymptotic form which assumes many observations (large n) and fixed parameters. In this setting, MDL agrees with BIC, the large sample approximation to Bayes factors. Both penalize the likelihood by (1/2) log n for each added parameter.

Foster, D. P. and Stine, R. A. (2005).
Polyshrink: An adaptive variable selection procedure that is competitive with Bayes experts
Submitted for publication.
This revised manuscript considers the following competitive analysis of the variable selection problem in regression. Suppose you are trying to fit a regression model and do not know which predictors to include. You decide to use the data to pick the predictors using a selection method. How well can you do this? For example, can you fit a model as well as someone else who knows the distribution of the regression coefficients? This paper gives a lower bound for how well your rival can do, and provides a method that we call "Polyshrink" that approaches its performance.

This R package implements the Polyshrink estimator described in the paper.

Stine, R. A. (2004).
Model selection using information theory and the MDL principle.
Sociological Methods & Research , 33, 230-260.
This overview designed for the social sciences shows how information theory expands the scope of model selection theory to emcompass the role of theory. It also shows how one can compare non-nested models as well as models of rather different form. Examples illustrate the calculations, considering several models for Harrison and Rubinfeld's Boston housing data. The paper also introduces a variation on Foster and George's RIC that allows for searches that follow the principle of marginality.

Foster, D. P. and Stine, R. A. (2004).
Variable selection in data mining: building a predictive model for bankruptcy.
J. Amer. Statistical Assoc., 99, 303-313.
This revision of a prior manuscript describes an application of variable selection in the context of predicting bankruptcy. The central theme is the attempt to find a selection criterion that picks the right number of variables, where "right" in this context means that it identifies the model that minimizes the out of sample error --- without having to set aside a hold-back sample. The problem is hard in this example because we consider models with more than 67,000 predictors.

The prior manuscript is here as well, but is missing the figures and one or two references. The new version differs from the prior manuscript in many ways. For example we no longer use subsampling, we use of binomial variances, and we have included a 5-fold cross-validation that compares the predictions of stepwise to those of the tree-based classifiers C4.5 and C5.0.

For the truly adventurous, a compressed tar file has all of the source code used for fitting the big models in this paper (written in C and a bit of C++). To build the program, you need a unix system with gcc, but the build is pretty automatic (that is, if you have done this sort of thing -- see the README file). The software is distributed under the GNU General Public License (GPL). You can get a "sanitized" portion of the data in this compressed tar file (Be patient... the file is a bit more than 24 MB.) The data layout follows the format needed by C4.5. Each file represents a fold of 100,000 cases. Further instructions are at the top of the names file.

To see a collection of papers that consider credit modeling more generally, go to the Wharton Financial Institutions Center for proceedings from the Credit Risk Modeling and Decisioning conference which was held here at Wharton, May 29-30, 2002.

Foster, D. P., Stine, R. A., and Wyner, A. D. (2002).
Universal codes for finite sequences of integers drawn from a monotone distribution.
IEEE Trans. on Information Theory , 48, 1713-1720.
We show that you can compress a data series almost as well as if you knew the source distribution. The bounds on performance that we obtain are not asymptotic, but apply for all sequence lengths.

Stine, R. A. and Foster, D. P. (2001).
The competitive complexity ratio
Proceedings of 2001 Conf on Info Sci and Sys, WP8 1-6.
Stochastic complexity is a powerful concept, but its use requires that the associated model have bounded integrated Fisher information. Some models, like that for a Bernoulli probability, satisfy this condition, but others do not. In particular, the normal location model or regression model do not have bounded information. This leads one to bound the parameter space for the model in some fashion, and then compute how this bound affects a code length and the comparison of models.

Foster, D. P. and Stine, R. A. (1999).
Local asymptotic coding
IEEE Trans on Information Theory , 45, 1289-1293.
Dean Foster and I show that the usual asymptotic characterization of MDL (ie, (1/2) log n) is not uniform. It fails to hold near the crucial value of zero. Near zero, the MDL criterion leads to an "AIC-like" criterion.

Foster, D. P. and Stine, R. A. (2006).
Honest confidence intervals for the error variance in stepwise regression
To appear, at long last.
This paper describes the impact of variable selection on the confidence interval for the prediction error variance Of a stepwise regression. When you pick a model by selecting from many factors, you need to widen the interval for s2 to account for selection bias. The problem is particularly acute when one has more predictors (features) than observations, as often occurs in data mining. This text file has the monthly stock returns used in the example.

Introductory Lectures
An introductory sequence of lecture notes on methods of model selection (from a tutorial I've given) are also available. The emphasis is to build ideas needed to support the information theory point of view, so the coverage of some areas (like AIC) is less comprehensive.
  1. Overview
  2. Predictive risk
  3. Bayesian criteria
  4. Introduction to information theory and coding
  5. Information theory and model selection

Hidden Markov Models (HMM)
One paper deals with the problem of estimating the arrival rate and holding distribution of a queue. What makes it hard is that you do not get to see the arrivals, just the number in the queue. Fortunately, the covariances characterize both the arrival rate and holding distribution. With some approximations that give the queue a Markovian form, one can use dynamic programming to compute a likelihood (via a hidden Markov model).

A paper co-authored with J. Pickands appears in Biometrika, 84, 295-308.

A second paper considers two issues: the covariance structure of HMMs (including multivariate HMMs) and the use of model selection methods based on these covariances to find the order of the model (that is, the dimension of the underlying Markov chain). The idea is to exploit the connection between the order of the HMM and the implied ARMA structure of the covariances.

Autocovariance structure of Markov regime switching models and model selection , written with Jing Zhang (who did the hard parts), is to appear in Journal of Time Series Analysis .

Statistical Computing Environments for Social Research
This collection of papers (published by Sage in 1996 and co-editted by myself and John Fox) describes and contrasts seven programming environments for doing statistical computing: Each of these environments is programmable, with a flexible data model and extensible command set.

Examples for each show how to do kernel density estimation, robust regression, and bootstrap resampling. Additional articles focus on three extensions of LispStat.

Graphical interpretation of a variance inflation factor,
The American Statistician , 49, Feb 1995.

This paper illustrates the use of an interactive plotting tool implemented in Lisp-Stat to reveal the simple relationship among partial regression plots, component plots, and the variance inflation factor. The data sets from the paper are in the files fighters.dat and wildcats.dat .

The basic idea is that the ratio of t-statistics associated with these two plots is the square root of the associated VIF. The interactive plot shows how collinearity affects the relationship presented in regression diagnostic plots. One uses a slider to control how much of the present collinearity appears in the plot.

Explaining normal quantile plots through animation ,
To appear, The American Statistician 2016.

This manuscript characterizes quantile-quantile plots as a comparison between water levels in two vases that are gradually filled with water. Imagine water filling two vases, each able to hold a liter of water. Assume the water fills the vases at the same rate. If the vases have the same shape, then the water levels will match. A graph of the water level in one versus the water level in the other would then trace out a diagonal line. That's also the idea of these animated QQ plots. A parametric plot of the water levels in gradually filling vases shaped as probability distributions motivates quantile plots as used in statistics.

The R package qqvases implements this construction. The software shows an animation of the process, allowing you to choose different distributions. The plots are nicer with smooth populations, but you can show the similar figures with samples. These don't look so good unless the sample sizes are fairly large. The implementation requires installing R and shiny on your system (not to mention, knowing R). You can also try the procedure by following this link (thanks to Dean Foster for figuring out how to get Shiny running) to see an on-line version of the software in your browser (avoiding the need to install R and shiny on your own system).


Presentation slides


Institute for Research in Cognitive Science (U. Penn) and City University of New York. October, 2013.

Joint Statistics Meeting, Montreal. August, 2013.

34th New Jersey ASA Spring Symposium, New Brunswick, NJ. June, 2013.

Credit Scoring and Credit Control XII, Edinburgh, Scotland. August, 2011.

SIAM International Conference on Data Mining, Phoenix, AZ. April, 2011.

Modern Massive Data Sets, MMDS 2010, Stanford, CA. June, 2010.

Conference on Resampling Methods and High Dimensional Data, College Station, TX. March, 2010.

Wharton Commodities Club, Philadelphia, PA. December, 2009.

Joint Statistics Meeting, Washington, DC. August, 2009.

Philadelphia Chapter, ASA. February, 2009.

Wharton Commodities Club, Philadelphia, PA. November, 2008.

Credit Scoring and Control, Edinburgh. August, 2007.

Joint Statistics Meeting, Salt Lake City. August, 2007.

University of Southampton and University of Edinburgh. April, 2007.

Northern Illinois University. March, 2006.

AAAI 2005, Pittsburg, PA. July, 2005.

Summer Program in Data Analysis (SPIDA), York University, Toronto. June, 2005.

Department of Statistics, University of Pennsylvania. April, 2005.

Department of Statistics, Rutgers University. February, 2005.

Data Mining Conference, The College of New Jersey. January, 2005.

INFORMS, Denver, CO. October 2004.

M2004 SAS Data Mining Conference, Las Vegas, NV. October 2004.

Credit Rating and Scoring Models, Washington DC. May 2004.

Credit Scoring and Credit Control, VIII, Edinburgh. September 2003.

Nonparametric Modeling, Crete. July 2002.

Profiting from Data Mining

Credit Scoring and Credit Control, VII, Edinburgh. September 2001.

Temple University and Univ. of Pennsylvania. November, 2000

ASA Annual Meeting, Indianapolis, ID. August, 2000

MSMESB Conference, Syracuse NY. June, 2000

ASA Annual Meeting, Baltimore, MD. August, 1999

DSI Talks in the DASI track (formerly MSMESB)
Bob Andrews continues to do a great job organizing a track of sessions now named Data, Analytics, and Statistics Instruction (the old Making Statistics More Effective in Schools of Business ) at the annual national and SE DSI meetings. I managed to make it to several of these to give short talks, most often as part of panel discussions. The slides with a little introduction are listed below.


Computing


Mapping lambda functions in C++
[This was relevant before C++ got serious about lambda functions in C++11.] I'm an old APL/Lisp programmer and I've always been annoyed that C++ and the STL are so clumsy to use when it comes to mapping a function over a range. The code given here handles that task, without having to wait for the next version of the Boost code, C0x, and a compiler that can make any of that stuff. This tar file has files that define You will need to have the gcc and the Boost library installed on your system. Unpack the tar file (tar -xvzf lambda_maps.tzg), move into that directory, and then build the application (make all). The application will run at the command line. To see the interesting stuff, have a look into the file function_iterators.test.cc and then wind your way farther back into the code. The 'interesting' line is this one (though I cannot recall the html format to show what's in the brackets):

std::cout << make_unary_range(ret<>(_1+6.6),iz) << std::endl;

iz is a vector of doubles; this lambda function defines a new range with elements that are 6.6 plus those in the range defined by the container iz. Notice that the lambda function has to declare explicitly its return type.

Data analysis tools for Mathematica
I've defined some simple tools for analyzing data using Mathematica. These two files give you a notebook that illustrates the commands (using a small example with stock prices) and the accompany package of definitions. The third file has data used in the examples.

AXIS command interface for Lisp-Stat
AXIS provides a point-and-click, iconic interface to Lisp-Stat. Menus and dialogs provide access to an extended set of statistical analysis routines, particularly for bootstrap resampling. The interface is extensible and includes features to support various linked views. A zip file has the needed files.

Two useful files of documentation are axis.ps , which offers an overview of the use of the AXIS interface, and princomp.ps , which shows how to extend the interface by adding a command to perform principal components analysis. The associated lisp files for adding commands are

Further discussion of AXIS with emphasis on extending the interface appears in the collection of edited papers, Statistical Computing Environments for the Social Sciences (Sage, 1996).

Some sample data sets for use with AXIS are:

Automated simulation methods
This paper (published in the 1992 ASA proceedings and here as a pdf file) describes the use of Lisp-Stat to automate some of the more tedious aspects of Monte Carlo simulations.

3-D Rotation methods
This is work in progress which considers the use of various rotation methods for diƶscovering problems in multiple regression models.

Interactive wavelet plots in Lisp-Stat.
This material was presented at the 1995 ASA Meeting in Orlando.


Teaching


Textbooks
Statistics for Business: Decision Making and Analysis (Third Edition, with Dean Foster)

News for R users. Like most books in the B-stat market, our text features Excel, along with JMP and Minitab. I have a real fondness for the visualization capabilities of JMP, but R has become popular with the growing interest in data science. To help those interested in R, I have prepared an R-companion for the 3rd edition of this textbook. The 3rd edition has Analytics in Excel embedded in the text chapters; this companion shows you how to do all of those examples using R instead. Each section of the on-line material shows how to do the "Analytics" applications from our textbook in R. This link takes you to the companion itself. The R-examples use data and examples from the 3rd edition; those from the 2nd edition are similar, but not all the same. YMMV. If working through the examples on your own computer, you will need the data archive file and a few supplemental R functions defined in the file functions.R

Every so often I will add useful classroom supplements from instructors or past students who send me links to related content.

  • Neil Desnoyers(Drexel): The Area Principle
    These notes illustate how fancy 3-D perspective views of pie charts often mislead by violating the area principle (Section 3.3).
  • Wharton student (Stat 102): Spurious Correlation in Time Series
    If you search though enough data, you will find large correlations. Like I say in class, statistics rewards persistence. This web site collects a large number of these and shows sequence plots of series that happen to have very large correlations. You can also manufacture your own spurious correlations, disguising a random walk as a trend (an exercise in Chapter 27 Time Series).
Basic Business Statistics and Business Analysis using Regression (with Dean Foster and Richard Waterman)
These casebooks offer a collection of data analysis examples that motivate and illustrate key ideas of statistics, ranging standard error to regression diagnostics. The data used in the casebooks can be downloaded by following the above link.

MBA Statistics Concentration and Courses
Core MBA courses (Stat 603, Stat 604, Stat 608, Stat 621)
These are notes for older versions of the courses as offered up to 2001. Current versions of the materials for these are available from Web Cafe. For information about a concentration in Statistics in the MBA program, see this overview of the requirements.

Statistics 622. Data-driven statistical modeling
This six-week course is offered (usually) in the first and third quarters. It develops modeling ideas introduced in Stat 621, with a greater emphasis on decision making taking costs into account. The course also introduces more recent developments in automated modeling and visualization. Topics include
  1. Automatic construction of regression models
  2. Calibrating predictions
  3. Models for classifying cases into groups
  4. Data mining techniques including neural networks and tree models
  5. Optimal ordering and cost minimization methods

This talk ( ppt slides ) looks at data mining from a business and modeling point of view, adding in the comments of a statistician who builds models with some experiences from doing the modeling in business problems such as financial modeling and credit risk analysis. You can get a pdf version of the powerpoint slides as well.

Statistics 712. Decision Making using Statistics
This course develops the role of statistical methods in the decision making process. Rather than a formal decision theory course, the emphasis is upon practical methods used in day-to-day work, including the reconciliation of judgement and quantitative summaries, the role of coincidence, and principles of classical utility theory.

Undergrad and Graduate Statistics Courses
Statistics 102. Introduction to Statistics (2001)
This course is my version of a one-semester introductory statistical methods course, covering hypothesis testing through regression circa 2001. The emphasis is data analysis using regression. I'd do this course differently now. Here's a pdf file of slides that introduce fitting curves in regression. The data files covered in these slides are

Insurance 260. Introduction to Time Series Analysis (from 2009)
This six-week portion of INSR 260 is an introduction to the practical analysis of time series data, emphasizing data analysis and intrepretation.

Statistics 430. Introduction to Probability
This course introduces students to probability theory.

Statistics 540. Statistical Computing
A course that combines the foundations of statistical computing done using Tierney's Lisp stat with the development of Web pages that describe research areas in statistics.

Statistics 910. Time Series Analysis and Forecasting
This course is a mixture of the theory and use of time series methods. Theoretical material focusses upon the properties of stationary time series, emphasizing Hilbert space methods and state-space models. Applications blend theory with computing. Simulations are used to check various large sample approximations.

SMMD for ISB, 2003
These are the lecture notes and data used in the ISB program for 2003.

Lectures
Bootstrap Resampling Lectures
This series of five lectures (and an overview lecture) introduces the ideas of bootstrap resampling. The presentation is mostly via examples of applications in statistics, emphasizing regression-type models. Additional topics include the construction of confidence intervals and applications in time series and structural equations.

Data Mining Lectures
These lectures introduce students from the social sciences to the ideas of data mining. The emphasis is on the big picture, with lots of examples using data from ICPSR, medical trials, and business applications.

Text Analytics Lectures
These lectures introduce the ideas of text analytics. The emphasis is "featurizing" text, turning text data into the familar numerical information used in, say, regression models.