Class 17 Stat701 Fall 1997
Classification and Prediction
Todays class.
- Classification is a frequently used technique.
- Many decisions are naturally recognized as classification problems.
- Bankruptcy prediction.
- Insurance company rating.
- Credit approval.
- Debt risk assessment -- bond rating/credit scoring.
- Purchase/Don't purchase.
- Survival at 5 years.
- Discharge from hospital within 2 days
- Basic concept: Y is a dichotomous variable. We wish to predict it
from a set of X's.
- That is to classify Y as either a 1 or a 0.
- Make a model for the probability that Y = 1. Use the
classifiaction rule:
If P(Y=1) > .5 then predict 1, otherwise predict 0.
Logistic regression chooses a model of the form
logit(P(y = 1)) = beta0 + beta1 X1 + beta2 X2 + ... + betap Xp
From the predicted logit we can find the predicted probability.
Then use the classification rule on the predicted probabilities.
A simple assessment of the value of the model is the proportion of
correctly classified observations.
Out of sample prediction
Using the same set of data to fit and validate the model is NOT
a good idea. Models fit in this way tend to overfit, that is to zealously
model some of the noise.
A very practical model selection and validation paradigm is the
Training set/Validation set approach.
Split your data into two independent pieces.
- On one set fit the model -- the training set.
- On the other one validate the model -- do out of sample prediction.
- Can use this idea for model selection -- which one does the best out of
sample prediction.
- Need to define "best out of sample prediction".
- For dichotomous classification can use proportion of correctly classified
samples to define "best".
- Note, no more R-squared, significance of coefficients etc. Complete prediction focus -- explanation is not the objective.
Example:
the internet demographics data set.
I have extracted the variables:
[1] "Newbie" "Age"
[3] "Gender" "Household.Income"
[5] "Sexual.Preference" "Country"
[7] "Education.Attainment" "Major.Occupation"
[9] "Marital.Status" "Years.on.Internet"
The objective is to attempt to classify users into the Newbie category;
that is those that have been on the Internet for less than a year, based
on a set of demographic indicators.
uva.txt The data set (1.6Megs).
uva.ssc Some SPlus code for analysis.
Richard Waterman
Tue Nov 4 00:16:39 EST 1997