ICPSR Summer Program Lecture Materials


Data Mining

These files are the outlines that I use as guides to each lecture. Each is given in PDF format. The software used for the computing is JMP from SAS (many of the same tools are also in R, Stata, SPSS, and SAS - though not the interactive graphics). This syllabus summarizes the course and lectures.
  1. Introduction

    This introductory lecture discusses the place for data mining in the social sciences. The lecture also introduces JMP and the data sets that follow in later examples, particularly the ANES 2008 data from ICPSR. We'll start using JMP by exploring this large data file in class, using JMP's plot linking to explore voting behavior and the use of feeling thermometers.

  2. Regression

    This lecture considers the use of regression as a data mining tool, with methods for dealing with missing data. The lecture also introduces the problem of over-fitting.

  3. Validation

    How does one avoid the problem of over-fitting and find a model that honestly reports its precision? Statistics offers a variety of choices, ranging from an alphabet soup of criteria (AIC, BIC, RIC) to cross-validation. This link ( link ) shows the JMP script used in class for cross-validating a regression in JMP.

  4. Helen Newberry Lab Session

    We will meet in the Michigan lab in Newberry during the usual class time period for some hands-one time with JMP and the ANES.

  5. Logistic regression

    Many data mining problems classify observations, such as classifying the choices of voters. Linear regression is a poor match to this problem, unless you calibrate it. Logistic regression is often a better match - though not without its own problems.

  6. Neural networks and boosting

  7. Classification trees and bagging

  8. July 4th holiday

  9. Helen Newberry Lab Session

    We will meet in the Michigan lab in Newberry during the usual class time period for some hands-one time with JMP and the ANES.

  10. Comparisons and Opportunities

Some data sets to play with...


Bootstrap Resampling

The lecture summaries shown below are copies of the transparencies shown on the computer and discussed in class. You can also get the software that accompanies these lectures below. If you cannot find something, take a second look and if its still not there, send me an e-mail at stine@wharton.upenn.edu .

Overview

These overview summarizes most of what is covered at a more leisurely pace in the following lectures. The linked PDF file gives the slides that I used in summarizing bootstrap methods in a seminar at UNC in April, 2000. An extra postscript file has the double bootstrap figure used in this overview.

My paper "An Introduction to Bootstrap Methods" (which appeared in Sociological Methods & Research back in 1989) introduces you to the ideas of bootstrap resampling through a variety of examples. The paper includes examples in regression and illustrates situations in which the bootstrap does not give the answer you'd like.

Lectures

The lecture notes are in PDF format, so that you will need to have Adobe Acrobat to view, search, and print the files.

Syllabus
The syllabus presents a brief overview of what happens in each class, along with some review questions. This syllabus also appears in the introductory program information given to you if you attended the ICPSR Summer Program.

Bibliography
This annotated list of references is not comprehensive and rather is more representative of what is available on a wide variety of topics, ranging from how to handle complex surveys to the methods for time series. Like the syllabus, you also have this in the information distributed to program participants.

Lecture Notes
The file for each lecture is a printed version of the Word files that I use for each class. I may clean up some errors if I find them, but they are pretty close to what was used in class. To use the R scripts that accompany the lecture notes, you'll need to have installed R on your own system.
  1. Introduction ( Lecture1.R )
    I lost the data on the sample proportions when class ended on Monday. Darn, and sorry.
  2. Exploring the Bootstrap ( Lecture2.R )
  3. The Bootstrap in Simple Regression and Correlation ( Lecture3.R )
    Here are the data sets that we used in this class: Computing and more sophisticated estimators like robust regression come up in this class. Fortunately, in his on-line appendices for his book An R and S-Plus Companion to Applied Regression John Fox has discussions of both these. Look for his relevant Web appendices for the book.
  4. Multiple Regression ( Lecture4.R )
    We used Duncan's data on occupational prestige for some of these examples.
  5. More Methods, Flaws, and Intervals ( Lecture5.R )

Software

In case we used some files of commands with a lecture, those files appear above with the lecture notes. Otherwise, look here.

AXIS
In addition to the "raw lisp" software used in class, I also used the AXIS interface functions. You can get a zip file of the needed programs and further information about AXIS and Lisp-Stat on my main web page.

Lisp-Stat
The official source for Lisp-Stat is the software archive prepared by Luke Tierney (author of Lisp-Stat). The archive is available from the University of Minnesota Statistics Department .

R
The CRAN archives have the source for R for various systems, including unix and windows. The archives also offer quite a few supplements and documentation.

Other Things

References to Bootstrap Resampling
In addition to the bibliography mentioned above, I list references to bootstrap resampling used in social science applications. I don't see too many of these journals, so any suggestions on your part are appreciated. No one seems to want to do this, but I'll post them if you do. Just send me a mail message .