ICPSR Summer Program Lecture Materials

Blalock Lectures: Text as Data

These slides and files summarize the Blalock Lectures titled "Text as Data" given in July 2017 in Ann Arbor.

Lecture 1: Introduction
Slides
Rmd notebook and rendered in HTML
Lecture 2: Sentiment analysis and latent semantic analysis
Slides
Sentiment analysis Rmd notebook ( HTML )
Latent semantic analysis Rmd notebook ( HTML )
Lecture 3: Probability models (naive Bayes and topic models, aka LDA)
Slides
Naive Bayes Rmd notebook ( HTML )
Topic models Rmd notebook ( HTML )
Lecture 4: More on prbability models (mostly topic models)
No additional slides.
Case study using inaugural addresses Rmd notebook ( HTML )

Here is the file of R text utilities (text_utils.R) that includes functions to generate the Zipf plot, to perform Good-Turing smoothing (of the most common types) and to simplify some tasks when simulating a topic model.

And here are the several sample data files used in the lectures.

Wine.csv
Federalist Papers are from Project Gutenberg
Compressed archive of presidential inauguration addresses

Text Analytics Short Course

This syllabus summarizes the most recent version of the course and lectures as given in July 2016 in Ann Arbor. The lecture slides are in these files that overlap a bit.

Scripts for doing the analyses in R. The first few are more generic, covering things like regular expressions, getting data from certain sources, and basic NLP.

These are specific to the running example that demos the methods discussed in the class using a data set of about 21,000 wine tasting notes.

Some data sets to play with...

Wine tasting notes (running example)
Amazon auto and electronics reviews
Federalist papers in raw form and as a csv file
2015 Trump tweets and 2016 Trump tweets
Positive and negative word lists (sentiment example)

Data Mining

These files of slides are the outlines that I use as guides to each lecture. Each is given in PDF format. The software used for the computing is JMP from SAS (many of the same tools are also in R, Stata, SPSS, and SAS - though not the interactive graphics). This syllabus summarizes the most recent version of the course and lectures; the last few pages give an annotated bibliography of approachable articles and books from the statistics literature that cover these topics. The notes shown here are from the 2014 edition of the lectures.

Monday Introduction
This introductory lecture discusses roles for data mining in the social sciences. The lecture also introduces JMP and the data sets that follow in later examples, particularly the ANES 2008 data from ICPSR. We'll start using JMP by exploring this large data file in class, using JMP's plot linking to explore voting behavior and the use of feeling thermometers. We also use JMP to explore a more complex regression model that has categorical predictors and interactions.
Tuesday Regression
This lecture starts by reviewing the key contribution of statistics to modeling data, namely the standard error of an estimator. Bootstrapping adds to the interpretation. The lecture then considers the use of regression as a data mining tool, with discussion of methods for dealing with missing data. The lecture also introduces the problem of over-fitting in the context of stepwise linear regression in an example of modeling stock market returns. The R-files used are bootstrap.R and missing_data.R . Here's a reduced version of the data used to overfit the stock market
Wednesday Model Selection
How does one avoid the problem of overfitting and find a model that honestly reports its precision? Statistics offers a variety of choices, ranging from an alphabet soup of criteria (AIC, BIC, RIC) shrinkage methods like the lasso to cross-validation. This link ( link ) shows the JMP script used for cross-validating a regression in JMP. (We might skip that demo depending on time.) The R script used to build the lasso model is lasso.R . (Here is the loan data mentioned in the script.)
Thursday Helen Newberry Lab Session
We will meet in the Michigan Lab in State Street side of Newberry during the usual class time period from 1:30-3:00 for some hands-one time with JMP and the ANES. Time permitting, we'll use R as well to fit a lasso model.
Friday July 4th holiday
Monday Logistic regression
Many data mining problems classify observations, such as classifying the choices of voters. Linear regression is a poor match to this problem, unless you calibrate it. Calibrated linear regression is often just as good as a logistic regression, depending on how you grade the models. We'll look at likelihoods, confusion matrices and ROC curves -- all of which are needed in data mining too. Logistic regression can be a better match and may have a simpler form than a comparable linear model because the multiplicative form of logistic models avoids interactions in some cases. Calibration and variable selection remain problems.
Tuesday Neural networks and boosting
No more equations -- or at least visible equations. Neural networks combine several logistic regressions fit simulataneously. Is the added complexity worth it? For that, we'll compare fits from networks to those from logistic models. Neural networks add another 'layer' of choice for the modeler too. Not only do you have to pick features to offer the network, but you also have to choose the number and arrangement of the network.
Boosting is a general approach that iteratively refines the fit of a model by building a sequence of models, each trying to reduce the errors of the predecessor. JMP has a nice implementation of this that we'll explore.
Wednesday Classification trees and bagging
Classification trees are a very different approach to modeling based on separating the data into homogeneous subsets rather than forming equations. Nonetheless, there's still are regression view of these models that reveals their weakness: the fits don't easily borrow strength . Bagging (and boosting) however provide a remedy for that problem.
Thursday Helen Newberry Lab Session
We will meet in the Michigan lab in Newberry during the usual class time period for some hands-one time with JMP and the ANES.
Friday Comparisons and Intro to Text Mining
We'll wrap things up with a peek at text mining and a discussion of how the various methods fit together to form a powerful modeling toolkit. The lecture slides include a list of 10 things every data miner should do. This file holds the source for the R commands used in the text modeling. The links for the data files for the two examples of PCA in this lecture appear in the list below.

Some data sets to play with...

Stocks
Flowers
PCA example (one component, regression)
PCA example (two components)
Fradulent loans (Warning: this one is 14MB)

Bootstrap Resampling

The lecture summaries shown below are copies of the transparencies shown on the computer and discussed in class. You can also get the software that accompanies these lectures below. If you cannot find something, take a second look and if its still not there, send me an e-mail at stine@wharton.upenn.edu .

Overview

These overview summarizes most of what is covered at a more leisurely pace in the following lectures. The linked PDF file gives the slides that I used in summarizing bootstrap methods in a seminar at UNC in April, 2000. An extra postscript file has the double bootstrap figure used in this overview.

My paper "An Introduction to Bootstrap Methods" (which appeared in Sociological Methods & Research back in 1989) introduces you to the ideas of bootstrap resampling through a variety of examples. The paper includes examples in regression and illustrates situations in which the bootstrap does not give the answer you'd like.

Lectures

The lecture notes are in PDF format, so that you will need to have Adobe Acrobat to view, search, and print the files.

Syllabus

The syllabus presents a brief overview of what happens in each class, along with some review questions. This syllabus also appears in the introductory program information given to you if you attended the ICPSR Summer Program.

Bibliography

This annotated list of references is not comprehensive and rather is more representative of what is available on a wide variety of topics, ranging from how to handle complex surveys to the methods for time series. Like the syllabus, you also have this in the information distributed to program participants.

Lecture Notes

The file for each lecture is a printed version of the Word files that I use for each class. I may clean up some errors if I find them, but they are pretty close to what was used in class. To use the R scripts that accompany the lecture notes, you'll need to have installed R on your own system.

Introduction ( Lecture1.R )
I lost the data on the sample proportions when class ended on Monday. Darn, and sorry.
Exploring the Bootstrap ( Lecture2.R )
- Data on osteoporosis osteo.dat
The Bootstrap in Simple Regression and Correlation ( Lecture3.R )
Here are the data sets that we used in this class:
- Efron's law school data lsat.dat
- Skeletal age data for smoothing skelage.dat
- Florida county presidential election results florida2000.dat
- State-level abortion rates abort.dat
- Modified abortion rates that make DC influential abort.out.dat
Computing and more sophisticated estimators like robust regression come up in this class. Fortunately, in his on-line appendices for his book An R and S-Plus Companion to Applied Regression John Fox has discussions of both these. Look for his relevant Web appendices for the book.
Multiple Regression ( Lecture4.R )
We used Duncan's data on occupational prestige for some of these examples.
More Methods, Flaws, and Intervals ( Lecture5.R )

Software

In case we used some files of commands with a lecture, those files appear above with the lecture notes. Otherwise, look here.

AXIS: In addition to the "raw lisp" software used in class, I also used the AXIS interface functions. You can get a zip file of the needed programs and further information about AXIS and Lisp-Stat on my main web page.
Lisp-Stat: The official source for Lisp-Stat is the software archive prepared by Luke Tierney (author of Lisp-Stat). The archive is available from the University of Minnesota Statistics Department .
R: The CRAN archives have the source for R for various systems, including unix and windows. The archives also offer quite a few supplements and documentation.

Other Things

References to Bootstrap Resampling

In addition to the bibliography mentioned above, I list references to bootstrap resampling used in social science applications. I don't see too many of these journals, so any suggestions on your part are appreciated. No one seems to want to do this, but I'll post them if you do. Just send me a mail message .

Dalgleish, L. I. (1995). Discriminant analysis: statistical inference using the jackknife and bootstrap procedures. Psychological Bulletin , Vol 116. (Shows some SAS routines for testing the size of coefficients.)
Follow this link to see web pages describing recent work using the bootstrap to assess goodness-of-fit measures.