Class 11. Regression for a categorical response
What you need to have learnt from Class 10: Comparing the mean
across two categorical variables.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- Two basic models:
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- No interaction: the impact of X1 on Y does not depend on the
level of X2.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- Interaction: the impact of X1 on Y depends on the level of X2.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- Practical consequences:
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- If NO interaction, then you can investigate the impact of
each X by itself.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- If there is interaction (consider practical importance as
well as statistical significance) then you must consider both X1
and X2 together.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- Know and check the assumptions for ANOVA.
New material for today: Regression for a categorical response
(logistic regression).
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- Objective: model a categorical (2-group) response.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- Example: how do gender and income impact the probability of the
purchase of a product.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- Problem: linear regression does not respect the range of the
response data (it's categorical).
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- Solution: model the probability that Y = 1, ie P(Y = 1 | X), in a
special way.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- Transform P(Y = 1) with the ``logit'' transform.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- Now fit a straight line regression to the logit of the
probabilities (this respects the range of the data).
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- On the original scale (probabilities) the transform looks
like this: the curve gives the probability that Y = 1 for a fixed
value of X.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- The logit is defined as logit(p) = ln(p/(1-p)). Example
logit(.25) = ln(.25/(1 - .25))
= ln (1/3) = -1.099.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- The three worlds of logistic regression.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- The probabilities: this is where most people
live. Probability lies in (0,1).
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- The odds: this is where the gamblers live. Odds
lies in (0, Infinity)
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- The logit: this is where the model lives. Logit
lies in (-Infinity, Infinity). Lines lie in (-Infinity,Infinity),
therefore fit a line to the logit.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- Must feel comfortable moving between the three worlds.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- Rules for moving between the worlds. Call P(Y = 1|X), p for simplicity.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- logit(p) = ln(p/(1-p))
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- p = exp(logit(p))/(1 + exp(logit(p))) *** Key to get back to
the real world.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- odds(p) = p/(1-p)
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- odds(p) = exp(logit(p)) *** Key for interpretation.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- Interpreting the output.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- P-values are under the Prob>ChiSq column.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- Main equation logit(p) = B0 + B1 X.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- B1 = 0. No relationship between X and p.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- B1 > 0. As X goes up p goes up.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- B1 < 0. As X goes up p goes down.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- B1: for every one unit change in X, the ODDS that Y = 1
changes by a multiplicative factor of exp(B1).
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- At X = -B1/B0 there is a 50% chance that Y = 1.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- Key calculation - based on the logistic regression output
calculate a probability. Example: Challenger output on p.306.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- logit(p) = 15.05 - 0.23 Temp.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- Find the probability that Y = 1 (at least one failure) at a
temperature of 31.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- logit(p) = 15.05 - 0.23 * 31
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- logit(p) = 7.96.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- p = exp(logit(p))/(1 + exp(logit(p)))
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- p = exp(7.96)/(1 + exp(7.96)) = 0.99965
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- There is a 99.965 percent chance of at least one failure
at a temperature of 31 degrees.
Richard Waterman
Wed Oct 9 22:50:58 EDT 1996