Logistc Regression

Classification and Prediction

Part 1. Logistic Regression.

Todays class. Logistic regression chooses a model of the form
logit(P(y = 1)) = beta0 + beta1 X1 + beta2 X2 + ... + betap Xp
From the predicted logit we can find the predicted probability. Then use the classification rule on the predicted probabilities.

A simple assessment of the value of the model is the proportion of correctly classified observations.

Out of sample prediction

Using the same set of data to fit and validate the model is NOT a good idea. Models fit in this way tend to overfit, that is to zealously model some of the noise.

A very practical model selection and validation paradigm is the Training set/Validation set approach. Split your data into two independent pieces.

Example: the internet demographics data set. I have extracted the variables:
 [1] "Newbie"               "Age"                 
 [3] "Gender"               "Household.Income"    
 [5] "Sexual.Preference"    "Country"             
 [7] "Education.Attainment" "Major.Occupation"    
 [9] "Marital.Status"       "Years.on.Internet"   
The objective is to attempt to classify users into the Newbie category; that is those that have been on the Internet for less than a year, based on a set of demographic indicators.

uva.txt The data set (1.6Megs).

uva.jmp The data in JMP format (2.3Megs).

uva.zip The data (zipped) in JMP format (0.18Megs).



Richard Waterman
Tue Nov 4 00:16:39 EST 1997