Class 5

What you need to have learnt from Class 4: Comparing group means.

*
Know the objectives of ANOVA and multiple comparisons.
*
Understand why we can't compare each pair (i.e. do lots of two-sample t-tests).
*
Be able to interpret the tables that come with Hsu's and Tukey's comparison procedures.
*
Know and check the assumptions for ANOVA.

Twoway ANOVA

*
Two basic models:
*
No interaction: the impact of X1 on Y does not depend on the level of X2.
*
Interaction: the impact of X1 on Y depends on the level of X2.

*
Practical consequences:
*
If NO interaction, then you can investigate the impact of each X by itself.
*
If there is interaction (consider practical importance as well as statistical significance) then you must consider both X1 and X2 together.

*
Know and check the assumptions for ANOVA.

New material for today: Regression for a categorical response Logistic regression

*
Objective: model a categorical (2-group) response.
*
Example: how do gender and income impact the probability of the purchase of a product.
*
Problem: linear regression does not respect the range of the response data (it's categorical).
*
Solution: model the probability that Y = 1, ie P(Y = 1 | X), in a special way.
*
Transform P(Y = 1) with the ``logit'' transform.
*
Now fit a straight line regression to the logit of the probabilities (this respects the range of the data).
*
On the original scale (probabilities) the transform looks like this: external
*
The logit is defined as logit(p) = ln(p/(1-p)). Example logit(.25) = ln(.25/(1 - .25)) = ln (1/3) = -1.099.
*
The three worlds of logistic regression.
*
The probabilities: this is where most people live.
*
The odds: this is where the gamblers live.
*
The logit: this is where the model lives.

*
Must feel comfortable moving between the three worlds.
*
Rules for moving between the worlds. Call P(Y = 1|X), p for simplicity.
*
logit(p) = ln(p/(1-p))
*
p = exp(logit(p))/(1 + exp(logit(p))) *** Key to get back to the real world.
*
odds(p) = p/(1-p)
*
odds(p) = exp(logit(p)) *** Key for interpretation.

*
Interpreting the output.
*
P-values are under the Prob>ChiSq column.
*
Main equation logit(p) = B0 + B1 X.
*
B1: for every one unit change in X, the ODDS that Y = 1 changes by a multiplicative factor of exp(B1).
*
B1 = 0. No relationship between X and p.
*
B1 > 0. As X goes up p goes up.
*
B1 < 0. As X goes up p goes down.
*
At X = -B1/B0 there is a 50% chance that Y = 1.

*
Key calculation - based on the logistic regression output calculate a probability. Example: Challenger output on p.282.

Orings.jmp p279.
*
logit(p) = 15.05 - 0.23 Temp.
*
Find the probability that Y = 1 (at least one failure) at a temperature of 31.
*
logit(p) = 15.05 - 0.23 * 31
*
logit(p) = 7.96.
*
p = exp(logit(p))/(1 + exp(logit(p)))
*
p = exp(7.96)/(1 + exp(7.96)) = 0.99965
*
There is a 99.965 percent chance of at least one failure.


Juice.jmp p285.
*
Can you parse the output? Bulk Pack p.296.
*
external
*
external
*
1. The overall test in logistic regression. Is anything going on, are any (any combination) of the predictors useful in predicting Y (the logits of the probabilities)? In this case the small p-value indicates that this is the case.
*
2. Is a specific coefficient significant (useful) after having controlled for the other variables in the model. The small p-value says this is indeed the case.
*
3. What does the 2.82 tell you?
*
For every 1 unit change in price diff the logit of the probability of buying CH changes by 2.82. (controlling for loyal ch and store 7.)
*
BETTER. For every one unit (ie a dollar) change in price diff the odds of buying CH changes by a multiplicative factor of exp(2.82) = 16.8.

*
Key calculation. At Loyal CH of 0.8, price diff of 20 cents and product sold in store 7, predict the probability of buying CH?
 1. Find the logit. logit = -3.06 + 6.32 * 0.8 + 2.82 * 0.2 + 0.35
                          = 2.91.

 2. Probability = exp(logit)/(1 + exp(logit)) = exp(2.91)/(1 + exp(2.91))
                = 0.948.


Regression for time series.

*
Objective: model a time series.
*
Example: Model default rates on mortgages as a function of interest rates.
*
Problem: Time series often have autocorrelated error terms which violates the standard assumption of independence.
*
Definition: Autocorrelation - successive error terms are dependent (see p.47 of the Bulk Pack).
*
Diagnostics.
*
Key graphic - residuals plotted against time. Tracking in the residual plots.
*
Look at the Durbin-Watson statistic. Less than 1.5 or over 2.5 suggests a problem.
*
Correlation of the residuals is roughly 1 - DW/2.

*
Consequences of positive autocorrelation:
*
Over-optimistic about the information content in the data.
*
Standard errors for slopes too small, confidence intervals too narrow.
*
Think variables are significant when really they are not.
*
False sense of precision.

*
Fix ups.
*
Use differences of both Y and X, not raw data (pp.325).
*
Include lagged residuals in the model (pp. 308).
*
Include lag Y in the model (as an X-variable p.332).

*
Benefits of differencing.
*
Often reduces autocorrelation.
*
Can reduce collinearity between X-variables.



Cellular.jmp p305. CompSale.jmp p317.

Richard Waterman
Thu Aug 21 20:52:16 EDT 1997