ICPSR Summer Program Lecture Materials
These files of slides are the outlines that I use as guides to each lecture.
Each is given in PDF format. The software used for the computing is
JMP from SAS (many of the same tools are also in R, Stata, SPSS, and
SAS - though not the interactive graphics).
This syllabus summarizes the most
recent version of the course and lectures; the last few pages give an
annotated bibliography of approachable articles and books from the
statistics literature that cover these topics. The notes shown here
are from the 2014 edition of the lectures.
Some data sets to play with...
- Monday Introduction
This introductory lecture discusses roles for data mining
in the social sciences. The lecture also introduces JMP and
the data sets that follow in later examples, particularly the
ANES 2008 data from ICPSR. We'll start using JMP by exploring
this large data file in class, using JMP's plot linking to
explore voting behavior and the use of feeling thermometers.
We also use JMP to explore a more complex regression model that
has categorical predictors and interactions.
- Tuesday Regression
This lecture starts by reviewing the key contribution of
statistics to modeling data, namely the standard error of an
estimator. Bootstrapping adds to the interpretation. The
lecture then considers the use of regression as a data mining
tool, with discussion of methods for dealing with missing data.
The lecture also introduces the problem of over-fitting in the
context of stepwise linear regression in an example of modeling
stock market returns. The R-files used
and missing_data.R . Here's a
reduced version of the data
used to overfit the stock market
- Wednesday Model Selection
How does one avoid the problem of overfitting and find a model
that honestly reports its precision? Statistics offers a
variety of choices, ranging from an alphabet soup of criteria
(AIC, BIC, RIC) shrinkage methods like the lasso to
cross-validation. This link (
link ) shows the JMP script used for cross-validating a
regression in JMP. (We might skip that demo depending on time.)
The R script used to build the lasso model
is lasso.R . (Here is the
loan data mentioned in the
- Thursday Helen Newberry Lab Session
We will meet in the Michigan Lab in State Street side of
Newberry during the usual class time period from 1:30-3:00 for
some hands-one time with JMP and the ANES. Time permitting,
we'll use R as well to fit a lasso model.
- Friday July 4th holiday
- Monday Logistic regression
Many data mining problems classify observations, such as
classifying the choices of voters. Linear regression is a poor
match to this problem, unless you calibrate it. Calibrated
linear regression is often just as good as a logistic
regression, depending on how you grade the models. We'll look
at likelihoods, confusion matrices and ROC curves -- all of
which are needed in data mining too. Logistic regression can be
a better match and may have a simpler form than a comparable
linear model because the multiplicative form of logistic models
avoids interactions in some cases. Calibration and variable
selection remain problems.
- Tuesday Neural networks and boosting
No more equations -- or at least visible equations. Neural
networks combine several logistic regressions fit
simulataneously. Is the added complexity worth it? For that,
we'll compare fits from networks to those from logistic models.
Neural networks add another 'layer' of choice for the modeler
too. Not only do you have to pick features to offer the
network, but you also have to choose the number and arrangement
of the network.
Boosting is a general approach that iteratively refines the fit
of a model by building a sequence of models, each trying to
reduce the errors of the predecessor. JMP has a nice
implementation of this that we'll explore.
- Wednesday Classification trees and bagging
Classification trees are a very different approach to modeling
based on separating the data into homogeneous subsets rather
than forming equations. Nonetheless, there's still are
regression view of these models that reveals their weakness:
the fits don't easily borrow strength . Bagging (and
boosting) however provide a remedy for that problem.
- Thursday Helen Newberry Lab Session
We will meet in the Michigan lab in Newberry during the usual
class time period for some hands-one time with JMP and the
- Friday Comparisons and Intro to Text Mining
We'll wrap things up with a peek at text mining and a
discussion of how the various methods fit together to form a
powerful modeling toolkit. The lecture slides include a list
of 10 things every data miner should do.
This file holds the source for
the R commands used in the text modeling. The links for the
data files for the two examples of PCA in this lecture appear
in the list below.
The lecture summaries shown below are copies of the transparencies shown
on the computer and discussed in class. You can also get the software
that accompanies these lectures below. If you cannot find something, take
a second look and if its still not there, send me an e-mail at
These overview summarizes most of what is
covered at a more leisurely pace in the following lectures. The linked PDF
file gives the slides that I used in summarizing bootstrap methods in a
seminar at UNC in April, 2000. An extra
has the double bootstrap figure used in this overview.
My paper "An Introduction to Bootstrap
Methods" (which appeared in Sociological Methods & Research
back in 1989) introduces you to the ideas of bootstrap resampling
through a variety of examples. The paper includes examples in
regression and illustrates situations in which the bootstrap does not
give the answer you'd like.
The lecture notes are in PDF format, so that you will need to have
Adobe Acrobat to view, search, and print the files.
- The syllabus presents a brief overview of what happens in
each class, along with some review questions. This syllabus also
appears in the introductory program information given to you if
you attended the ICPSR Summer Program.
- This annotated list of references is not comprehensive and
rather is more representative of what is available on a
wide variety of topics, ranging from how to handle complex
surveys to the methods for time series. Like the syllabus, you
also have this in the information distributed to program participants.
- Lecture Notes
- The file for each lecture is a printed version of the Word files
that I use for each class. I may clean up some errors if I find them,
but they are pretty close to what was used in class. To use the
R scripts that accompany the lecture notes, you'll need to have
installed R on your own system.
( Lecture1.R )
I lost the data on the sample proportions when class
ended on Monday. Darn, and sorry.
- Exploring the Bootstrap
( Lecture2.R )
- The Bootstrap in Simple Regression
( Lecture3.R )
Here are the data sets that we used in this class:
Computing and more sophisticated estimators like robust regression
come up in this class. Fortunately, in his
on-line appendices for his
book An R and S-Plus Companion to Applied Regression
John Fox has discussions of both these. Look for his relevant
Web appendices for the book.
- Multiple Regression
( Lecture4.R )
We used Duncan's data
on occupational prestige for some of these examples.
- More Methods, Flaws, and Intervals
( Lecture5.R )
In case we used some files of commands with a lecture, those files appear
above with the lecture notes. Otherwise, look here.
- In addition to the "raw lisp" software used in class, I also used
the AXIS interface functions. You can get a zip file of the needed
programs and further information about AXIS and Lisp-Stat on my
main web page.
- The official source for Lisp-Stat is the
software archive prepared by Luke Tierney (author of Lisp-Stat). The
archive is available from the
University of Minnesota Statistics Department .
have the source for R for various systems,
including unix and windows. The archives also offer quite a
few supplements and documentation.
- References to Bootstrap Resampling
- In addition to the
above, I list references to bootstrap resampling used in
social science applications. I don't see too many of these
journals, so any suggestions on your part are appreciated.
No one seems to want to do this, but I'll post them if you
do. Just send me a
mail message .
- Dalgleish, L. I. (1995). Discriminant analysis: statistical
inference using the jackknife and bootstrap procedures.
Psychological Bulletin , Vol 116. (Shows some SAS
routines for testing the size of coefficients.)
- Follow this
link to see web pages describing recent work using
the bootstrap to assess goodness-of-fit measures.