Here is the file of R text utilities (text_utils.R) that includes functions to generate the Zipf plot, to perform Good-Turing smoothing (of the most common types) and to simplify some tasks when simulating a topic model.
And here are the several sample data files used in the lectures.
Scripts for doing the analyses in R. The first few are more generic, covering things like regular expressions, getting data from certain sources, and basic NLP.
Some data sets to play with...
This introductory lecture discusses roles for data mining in the social sciences. The lecture also introduces JMP and the data sets that follow in later examples, particularly the ANES 2008 data from ICPSR. We'll start using JMP by exploring this large data file in class, using JMP's plot linking to explore voting behavior and the use of feeling thermometers. We also use JMP to explore a more complex regression model that has categorical predictors and interactions.
This lecture starts by reviewing the key contribution of statistics to modeling data, namely the standard error of an estimator. Bootstrapping adds to the interpretation. The lecture then considers the use of regression as a data mining tool, with discussion of methods for dealing with missing data. The lecture also introduces the problem of over-fitting in the context of stepwise linear regression in an example of modeling stock market returns. The R-files used are bootstrap.R and missing_data.R . Here's a reduced version of the data used to overfit the stock market
How does one avoid the problem of overfitting and find a model that honestly reports its precision? Statistics offers a variety of choices, ranging from an alphabet soup of criteria (AIC, BIC, RIC) shrinkage methods like the lasso to cross-validation. This link ( link ) shows the JMP script used for cross-validating a regression in JMP. (We might skip that demo depending on time.) The R script used to build the lasso model is lasso.R . (Here is the loan data mentioned in the script.)
We will meet in the Michigan Lab in State Street side of Newberry during the usual class time period from 1:30-3:00 for some hands-one time with JMP and the ANES. Time permitting, we'll use R as well to fit a lasso model.
Many data mining problems classify observations, such as classifying the choices of voters. Linear regression is a poor match to this problem, unless you calibrate it. Calibrated linear regression is often just as good as a logistic regression, depending on how you grade the models. We'll look at likelihoods, confusion matrices and ROC curves -- all of which are needed in data mining too. Logistic regression can be a better match and may have a simpler form than a comparable linear model because the multiplicative form of logistic models avoids interactions in some cases. Calibration and variable selection remain problems.
No more equations -- or at least visible equations. Neural networks combine several logistic regressions fit simulataneously. Is the added complexity worth it? For that, we'll compare fits from networks to those from logistic models. Neural networks add another 'layer' of choice for the modeler too. Not only do you have to pick features to offer the network, but you also have to choose the number and arrangement of the network.
Boosting is a general approach that iteratively refines the fit of a model by building a sequence of models, each trying to reduce the errors of the predecessor. JMP has a nice implementation of this that we'll explore.
Classification trees are a very different approach to modeling based on separating the data into homogeneous subsets rather than forming equations. Nonetheless, there's still are regression view of these models that reveals their weakness: the fits don't easily borrow strength . Bagging (and boosting) however provide a remedy for that problem.
We will meet in the Michigan lab in Newberry during the usual class time period for some hands-one time with JMP and the ANES.
We'll wrap things up with a peek at text mining and a discussion of how the various methods fit together to form a powerful modeling toolkit. The lecture slides include a list of 10 things every data miner should do. This file holds the source for the R commands used in the text modeling. The links for the data files for the two examples of PCA in this lecture appear in the list below.
My paper "An Introduction to Bootstrap Methods" (which appeared in Sociological Methods & Research back in 1989) introduces you to the ideas of bootstrap resampling through a variety of examples. The paper includes examples in regression and illustrates situations in which the bootstrap does not give the answer you'd like.