This introductory lecture discusses the place for data mining in the social sciences. The lecture also introduces JMP and the data sets that follow in later examples, particularly the ANES 2008 data from ICPSR. We'll start using JMP by exploring this large data file in class, using JMP's plot linking to explore voting behavior and the use of feeling thermometers.
This lecture considers the use of regression as a data mining tool, with methods for dealing with missing data. The lecture also introduces the problem of over-fitting.
How does one avoid the problem of over-fitting and find a model that honestly reports its precision? Statistics offers a variety of choices, ranging from an alphabet soup of criteria (AIC, BIC, RIC) to cross-validation. This link ( link ) shows the JMP script used in class for cross-validating a regression in JMP.
We will meet in the Michigan lab in Newberry during the usual class time period for some hands-one time with JMP and the ANES.
Many data mining problems classify observations, such as classifying the choices of voters. Linear regression is a poor match to this problem, unless you calibrate it. Logistic regression is often a better match - though not without its own problems.
We will meet in the Michigan lab in Newberry during the usual class time period for some hands-one time with JMP and the ANES.
My paper "An Introduction to Bootstrap Methods" (which appeared in Sociological Methods & Research back in 1989) introduces you to the ideas of bootstrap resampling through a variety of examples. The paper includes examples in regression and illustrates situations in which the bootstrap does not give the answer you'd like.