Class 5

What you need to have learnt from Class 4: Comparing group means.

: Know the objectives of ANOVA and multiple comparisons.
: Understand why we can't compare each pair (i.e. do lots of two-sample t-tests).
: Be able to interpret the tables that come with Hsu's and Tukey's comparison procedures.
: Know and check the assumptions for ANOVA.

Twoway ANOVA

Two basic models:

: No interaction: the impact of X1 on Y does not depend on the level of X2.
: Interaction: the impact of X1 on Y depends on the level of X2.

Practical consequences:

: If NO interaction, then you can investigate the impact of each X by itself.
: If there is interaction (consider practical importance as well as statistical significance) then you must consider both X1 and X2 together.

Know and check the assumptions for ANOVA.

New material for today: Regression for a categorical response Logistic regression

Objective: model a categorical (2-group) response.

Example: how do gender and income impact the probability of the purchase of a product.

Problem: linear regression does not respect the range of the response data (it's categorical).

Solution: model the probability that Y = 1, ie P(Y = 1 | X), in a special way.

Transform P(Y = 1) with the ``logit'' transform.

Now fit a straight line regression to the logit of the probabilities (this respects the range of the data).

On the original scale (probabilities) the transform looks like this: external

The logit is defined as logit(p) = ln(p/(1-p)). Example logit(.25) = ln(.25/(1 - .25)) = ln (1/3) = -1.099.

The three worlds of logistic regression.

: The probabilities: this is where most people live.
: The odds: this is where the gamblers live.
: The logit: this is where the model lives.

Must feel comfortable moving between the three worlds.

Rules for moving between the worlds. Call P(Y = 1|X), p for simplicity.

: logit(p) = ln(p/(1-p))
: p = exp(logit(p))/(1 + exp(logit(p))) *** Key to get back to the real world.
: odds(p) = p/(1-p)
: odds(p) = exp(logit(p)) *** Key for interpretation.

Interpreting the output.

: P-values are under the Prob>ChiSq column.
: Main equation logit(p) = B0 + B1 X.
: B1: for every one unit change in X, the ODDS that Y = 1 changes by a multiplicative factor of exp(B1).
: B1 = 0. No relationship between X and p.
: B1 > 0. As X goes up p goes up.
: B1 < 0. As X goes up p goes down.
: At X = -B1/B0 there is a 50% chance that Y = 1.

Key calculation - based on the logistic regression output calculate a probability. Example: Challenger output on p.282.

Orings.jmp p279.

: logit(p) = 15.05 - 0.23 Temp.
: Find the probability that Y = 1 (at least one failure) at a temperature of 31.
: logit(p) = 15.05 - 0.23 * 31
: logit(p) = 7.96.
: p = exp(logit(p))/(1 + exp(logit(p)))
: p = exp(7.96)/(1 + exp(7.96)) = 0.99965
: There is a 99.965 percent chance of at least one failure.

Juice.jmp p285.

Can you parse the output? Bulk Pack p.296.

1. The overall test in logistic regression. Is anything going on, are any (any combination) of the predictors useful in predicting Y (the logits of the probabilities)? In this case the small p-value indicates that this is the case.

2. Is a specific coefficient significant (useful) after having controlled for the other variables in the model. The small p-value says this is indeed the case.

3. What does the 2.82 tell you?

: For every 1 unit change in price diff the logit of the probability of buying CH changes by 2.82. (controlling for loyal ch and store 7.)
: BETTER. For every one unit (ie a dollar) change in price diff the odds of buying CH changes by a multiplicative factor of exp(2.82) = 16.8.

Key calculation. At Loyal CH of 0.8, price diff of 20 cents and product sold in store 7, predict the probability of buying CH?

 1. Find the logit. logit = -3.06 + 6.32 * 0.8 + 2.82 * 0.2 + 0.35
                          = 2.91.

 2. Probability = exp(logit)/(1 + exp(logit)) = exp(2.91)/(1 + exp(2.91))
                = 0.948.

Regression for time series.

Objective: model a time series.

Example: Model default rates on mortgages as a function of interest rates.

Problem: Time series often have autocorrelated error terms which violates the standard assumption of independence.

Definition: Autocorrelation - successive error terms are dependent (see p.47 of the Bulk Pack).

Diagnostics.

: Key graphic - residuals plotted against time. Tracking in the residual plots.
: Look at the Durbin-Watson statistic. Less than 1.5 or over 2.5 suggests a problem.
: Correlation of the residuals is roughly 1 - DW/2.

Consequences of positive autocorrelation:

: Over-optimistic about the information content in the data.
: Standard errors for slopes too small, confidence intervals too narrow.
: Think variables are significant when really they are not.
: False sense of precision.

Fix ups.

: Use differences of both Y and X, not raw data (pp.325).
: Include lagged residuals in the model (pp. 308).
: Include lag Y in the model (as an X-variable p.332).

Benefits of differencing.

: Often reduces autocorrelation.
: Can reduce collinearity between X-variables.

Cellular.jmp p305.

CompSale.jmp p317.

Richard Waterman
Thu Aug 21 20:52:16 EDT 1997