- Streaming Feature Selection
- Streaming feature selection evaluates potential explanatory
features for a regression type model one-at-a-time rather than
all at once. This approach allows faster selection methods (such
as VIF regression) and avoids the need to precompute every
possible predictor at the start of modeling. Building variables
on-the-fly is essential in database modeling and some types of
image processing.
Auctions allow a blending of substantive insight with automatic searches
for predictive features. I'll put a paper here one of these days that describes
the auction process more completely, but in the meantime, see these
slides from a recent talk.
The papers that are here are ingredients needed for the auction.
- Foster, D. P. and Stine, R. A. (2013).
Risk inflation of sequential tests controlled by alpha investing
- This paper (submitted for publication) developes a
computational method for finding the exact risk inflation of
the estimator implied by a testing process that is controlled
by alpha investing. The resulting feasible sets display all
possible risks of one estimation procedure relative to
another.
- Foster, D. P. and Stine, R. A. (2008, JRSS-B).
Alpha-investing: sequential control of expected false discoveries
- This paper (submitted for publication) describes a
procedure for testing multiple hypotheses while controling the number of
false discoveries. The key distinguishing attributes
are that (a) it handles a stream of hypotheses so you don't need all the
p-values at once and (b) it allows an investigator to incorporate formally
domain knowledge into the testing procedure. We have used a variation of this
procedure to pick predictors in the auction.
- Text Mining, Computational Linguistics
- Data sets that are well matched to regression often come
with supplemental textual information, such as written comments,
open-ended responses, and annotations. Some data sets come with
nothing but text. Generating regressors from these can lead to
more predictive models. There are also
slides from a recent talk.
- Foster, D. P., Liberman, M. and Stine, R. A. (2013).
Featurizing text: Converting text into predictors for
regression analysis
- This draft manuscript (really more of a working paper)
describes fast methods for the construction of numerical
regressors from text using spectral methods related to the
singular value decomposition (SVD). An example uses these
methods to build regression models for the price of Chicago
real estate using nothing but the text of a property listing.
Topic models (LDA) provide some explanation for why these
methods work so well as they do. For example, our model for
real estate explains some 70% of the variation in prices using
just the text of the listing with no attempt to use location
or related demographics.
- Statistics in Finance
- It can be very hard to separate luck from skill when it comes time
to evaluate the success of investors. We use a dice simulation described
in the following paper to illustrate this point to students, as well as
show them the role of portfolios in reducing the variance of an investment.
I'll soon put another paper here that offers one approach to making the
distinction, but its not ready yet. Here are the
slides from a recent talk.
- Foster, D. P. and Stine, R. A. (2005).
Being Warren Buffett: A classroom simulation of risk and wealth when
investing in the stock market
- This paper describes the dice simulation, including the notion
of volatility drag as a penalty for variation. This
form can be used in
class to organize the simulation.
- Foster, D. P. and Stine, R. A. (2005).
Finding Warren Buffett: Separating knowledge from luck in investments.
- This paper will detail our approach using Bennett's inequality and
Bonferroni to identify investments that do better than chance. The
slides from a talk
summarize the approach.
- Foster, D.P., Stine, R.A., and Young, P (2011).
A Markov test for alpha
- This revised version of a previous manuscript avoids getting fooled by
a clever manager, a trick that slips by using a maximal approach. The paper
introduces a simple-to-perform test called the compound alpha test (CAT).
The test has good power (it is tight for sequences of returns generated by
"trickery") and robust to numerous assumptions. The paper includes several illustrative
examples using recent stock returns.
In a different vein, the following papers consider models for the
forward curve or yield curve. The models decompose the forward
curve into several components that isolate different aspects of
the evolution of the curve over time.
- Chua, C. T., Foster, D. P., Ramaswamy, K., and Stine, R. A. (2007).
A dynamic model for the forward curve
Review of Financial Studies 21, 265-310.
- This paper proposes an arbitrage-free model for the
temporal evolution of the forward curve. The paper includes a
discussion of model estimation (using a Kalman filter),
examples fit to Treasury data, and comparison to alternatives.
- Chua, C. T., Ramaswamy, K., and Stine, R. A. (2008).
Predicting short-term Eurodollar futures.
Journal of Fixed Income , to appear.
- This manuscript adapts the methods used in the prior work for
Treasuries to practical aspects of modeling Eurodollar futures.
- Pooling Subjective Intervals
- A work in progress that concerns the use of subjective confidence intervals for making business decisions. A current interactive tool based on 50% intervals is available by following this link.
- Information Theory and Model Selection
- Information theory (coding, in particular) provides motivation
for the various types of model selection criteria in wide use today
(e.g., AIC, BIC). It also leads to generalizations of these methods
which allow comparison, for example, of non-nested models. These methods
can also be used for 'feature selection' in data mining.
The use of information theory in model selection is not new. The
AIC (Akaike information criterion) originated as an unbiased
estimate of the relative entropy, a key notion in comparing the
lengths of codes. More closely tied to coding are MDL and
stochastic complexity that were proposed by Rissanen.
MDL (minimum description length) is typically used in an asymptotic
form which assumes many observations (large n) and fixed parameters.
In this setting, MDL agrees with BIC, the large sample approximation to
Bayes factors. Both penalize the likelihood by (1/2) log n for each
added parameter.
- Foster, D. P. and Stine, R. A. (2005).
Polyshrink: An adaptive variable selection procedure
that is competitive with Bayes experts
Submitted for publication.
- This revised manuscript considers the following competitive analysis
of the variable selection problem in regression.
Suppose you are trying to fit a regression model and do not
know which predictors to include. You decide to
use the data to pick the predictors using a selection
method. How well can you do this? For example, can you
fit a model as well as someone else who knows the
distribution of the regression coefficients? This paper gives
a lower bound for how well your rival can do, and provides a
method that we call "Polyshrink" that approaches its performance.
This R package
implements the Polyshrink estimator described in the paper.
- Stine, R. A. (2004).
Model selection using information theory and the MDL principle.
Sociological Methods & Research , 33, 230-260.
- This overview designed for the social sciences shows how information
theory expands the scope of model selection theory to emcompass the
role of theory. It also shows how
one can compare non-nested models as well as models of rather
different form. Examples illustrate the calculations, considering
several models for Harrison and Rubinfeld's Boston housing data.
The paper also introduces a variation on Foster and George's RIC
that allows for searches that follow the principle of marginality.
- Foster, D. P. and Stine, R. A. (2004).
Variable selection in data
mining: building a predictive model for bankruptcy.
J. Amer. Statistical Assoc., 99, 303-313.
- This revision of a prior manuscript describes an
application of variable selection in the context of
predicting bankruptcy. The central theme is the attempt
to find a selection criterion that picks the right number
of variables, where "right" in this context means that it
identifies the model that minimizes the out of sample
error --- without having to set aside a hold-back sample.
The problem is hard in this example because we consider
models with more than 67,000 predictors.
The prior
manuscript is here as well, but is missing the
figures and one or two references. The new version
differs from the prior manuscript in many ways. For
example we no longer use subsampling, we use of binomial
variances, and we have included a 5-fold cross-validation
that compares the predictions of stepwise to those of the
tree-based classifiers C4.5 and C5.0.
For the truly adventurous, a compressed
tar file has all of the
source code used for fitting the big models in this paper
(written in C and a bit of C++). To build the program,
you need a unix system with gcc, but the build is pretty
automatic (that is, if you have done this sort of thing
-- see the README file). The software is distributed
under the GNU General Public License (GPL). You can get a "sanitized"
portion of the data in this
compressed tar file (Be patient... the file is a bit more than 24 MB.)
The data layout follows the format needed by C4.5. Each file represents
a fold of 100,000 cases. Further instructions are at the top of the names file.
To see a collection of papers that consider credit modeling more
generally, go to the
Wharton Financial Institutions Center
for proceedings from the Credit Risk Modeling and Decisioning conference
which was held here at Wharton, May 29-30, 2002.
- Foster, D. P., Stine, R. A., and Wyner, A. D. (2002).
Universal codes for finite sequences of integers drawn
from a monotone distribution.
IEEE Trans. on Information Theory , 48, 1713-1720.
- We show that you can compress a data series almost as well
as if you knew the source distribution. The bounds on performance
that we obtain are not asymptotic, but apply for all sequence
lengths.
- Stine, R. A. and Foster, D. P. (2001).
The competitive complexity ratio
Proceedings of 2001 Conf on Info Sci and Sys, WP8 1-6.
- Stochastic complexity is a powerful concept, but its use
requires that the associated model have bounded
integrated Fisher information. Some models, like that
for a Bernoulli probability, satisfy this condition, but
others do not. In particular, the normal location model
or regression model do not have bounded information.
This leads one to bound the parameter space for the
model in some fashion, and then compute how this bound
affects a code length and the comparison of models.
- Foster, D. P. and Stine, R. A. (1999).
Local asymptotic coding
IEEE Trans on Information Theory , 45, 1289-1293.
- Dean Foster and I show that the usual asymptotic
characterization of MDL (ie, (1/2) log n) is not
uniform. It fails to hold near the crucial value of
zero. Near zero, the MDL criterion leads to an
"AIC-like" criterion.
- Foster, D. P. and Stine, R. A. (2006).
Honest confidence intervals for the error variance in
stepwise regression
To appear, at long last.
- This paper describes the impact of variable selection on
the confidence interval for the prediction error variance
Of a stepwise regression. When you pick a model by
selecting from many factors, you need to widen the
interval for s2 to account for selection bias.
The problem is particularly acute when one has more
predictors (features) than observations, as often occurs
in data mining. This
text file
has the monthly stock returns used in the example.
- Introductory Lectures
- An introductory sequence of lecture notes on methods of
model selection (from a tutorial I've given) are also
available. The emphasis is to build ideas needed to
support the information theory point of view, so the
coverage of some areas (like AIC) is less comprehensive.
-
Overview
-
Predictive risk
-
Bayesian criteria
-
Introduction to information theory and coding
-
Information theory and model selection
- Hidden Markov Models (HMM)
- One paper deals with the problem of estimating the arrival rate
and holding distribution of a queue. What makes it hard is that
you do not get to see the arrivals, just the number in the queue.
Fortunately, the covariances characterize both the arrival rate
and holding distribution. With some approximations that give the
queue a Markovian form, one can use dynamic programming to compute
a likelihood (via a hidden Markov model).
A paper co-authored with J. Pickands appears in
Biometrika, 84, 295-308.
A second paper considers two issues: the covariance structure
of HMMs (including multivariate HMMs) and the use of model selection
methods based on these covariances to find the order of the model
(that is, the dimension of the underlying Markov chain). The idea
is to exploit the connection between the order of the HMM and the
implied ARMA structure of the covariances.
Autocovariance structure of Markov regime switching models and model
selection , written with Jing Zhang (who did the hard parts),
is to appear in Journal of Time Series Analysis .
- Statistical Computing Environments for Social Research
- This collection of papers (published by Sage in 1996 and co-editted
by myself and John Fox) describes and contrasts
seven programming environments for doing statistical computing:
- Data analysis using APL2 and APL2STAT
John Fox and Michael Friendly
- Data analysis using Gauss and Markov
J. Scott Long and Brian Noss
- Data analysis using Lisp-Stat
Luke Tierney
- Data analysis using Mathematica
Bob Stine (me)
- Data analysis using SAS
Charles Hallahan
- Data analysis using Stata
Lawrence Hamilton and Joe Hilbe
- Data analysis using S-plus
Dan Schulman, Alec Campbell, and Eric Kostello
- AXIS: an extensible graphical user interface for statistics
Bob Stine (me, again)
- The R-code: a graphical paradigm for regression analysis
Sandy Weisberg
- ViSta: a visual statistics system
Forrest Young and Carla Bann
Each of these environments is programmable, with a flexible data model and
extensible command set.
Examples for each show how to do kernel density estimation, robust regression,
and bootstrap resampling. Additional articles focus on three extensions of
LispStat.
- Graphical interpretation of a variance inflation factor,
- The American Statistician , 49, Feb 1995.
This paper illustrates the use of an
interactive plotting tool
implemented in Lisp-Stat to reveal the simple
relationship among partial regression plots, component plots,
and the variance inflation factor. The data sets from the paper are
in the files
fighters.dat and
wildcats.dat .
The basic idea is that the
ratio of t-statistics associated with these two plots is the
square root of the associated VIF. The interactive plot shows
how collinearity affects the relationship presented in regression
diagnostic plots. One uses a slider to control how much of the
present collinearity appears in the plot.
- Explaining normal quantile plots through animation ,
- To appear, The American Statistician 2016.
This manuscript
characterizes quantile-quantile plots as a comparison
between water levels in two vases that are gradually
filled with water. Imagine water filling two vases,
each able to hold a liter of water. Assume the water
fills the vases at the same rate. If the vases have
the same shape, then the water levels will match. A
graph of the water level in one versus the water level
in the other would then trace out a diagonal line.
That's also the idea of these animated QQ plots. A
parametric plot of the water levels in gradually
filling vases shaped as probability distributions
motivates quantile plots as used in statistics.
The R package qqvases
implements this construction. The software shows
an animation of the process, allowing you to choose
different distributions. The plots are nicer with
smooth populations, but you can show the similar
figures with samples. These don't look so good unless
the sample sizes are fairly large. The implementation
requires installing R and shiny on your system (not to
mention, knowing R). You can also try the procedure by
following
this
link (thanks to Dean Foster for figuring out how
to get Shiny running) to see an on-line version of the
software in your browser (avoiding the need to install
R and shiny on your own system).