Class 17 Stat701 Fall 1997

Classification and Prediction

Part 1. Logistic Regression.

Todays class.

Classification is a frequently used technique.
Many decisions are naturally recognized as classification problems.
- Bankruptcy prediction.
- Insurance company rating.
- Credit approval.
- Debt risk assessment -- bond rating/credit scoring.
- Purchase/Don't purchase.
- Survival at 5 years.
- Discharge from hospital within 2 days
Basic concept: Y is a dichotomous variable. We wish to predict it from a set of X's.
That is to classify Y as either a 1 or a 0.
Make a model for the probability that Y = 1. Use the classifiaction rule: If P(Y=1) > .5 then predict 1, otherwise predict 0.

Logistic regression chooses a model of the form

logit(P(y = 1)) = beta0 + beta1 X1 + beta2 X2 + ... + betap Xp

From the predicted logit we can find the predicted probability. Then use the classification rule on the predicted probabilities.

A simple assessment of the value of the model is the proportion of correctly classified observations.

Out of sample prediction

Using the same set of data to fit and validate the model is NOT a good idea. Models fit in this way tend to overfit, that is to zealously model some of the noise.

A very practical model selection and validation paradigm is the Training set/Validation set approach. Split your data into two independent pieces.

On one set fit the model -- the training set.
On the other one validate the model -- do out of sample prediction.
Can use this idea for model selection -- which one does the best out of sample prediction.
Need to define "best out of sample prediction".
For dichotomous classification can use proportion of correctly classified samples to define "best".
Note, no more R-squared, significance of coefficients etc. Complete prediction focus -- explanation is not the objective.

Example: the internet demographics data set. I have extracted the variables:

 [1] "Newbie"               "Age"                 
 [3] "Gender"               "Household.Income"    
 [5] "Sexual.Preference"    "Country"             
 [7] "Education.Attainment" "Major.Occupation"    
 [9] "Marital.Status"       "Years.on.Internet"

The objective is to attempt to classify users into the Newbie category; that is those that have been on the Internet for less than a year, based on a set of demographic indicators.

uva.txt The data set (1.6Megs).

uva.ssc Some SPlus code for analysis.

Richard Waterman
Tue Nov 4 00:16:39 EST 1997