Class 2
From class 1.
- Understand, interpret and distinguish
the regression summaries:
- R-squared.
- Root Mean Squared Error (RMSE).
- The interpretation and the benefits of using a confidence
interval for the slope.
- Two types of prediction and interval (range of feasible values):
- Estimate a typical observation (conf curve:fit).
- Predict a single new observation (conf curve:indiv).
- The dangers in extrapolating outside the range of your data:
(three sources).
- The uncertainty in our estimate of the true regression line.
- The uncertainty due to the inherent variation of the data
about the line.
- The uncertainty due to the fact that maybe we should not be
using a line in the first place (model misspecification)!
New material for Class 2.
- Making more realistic models with many X-variables - multiple
regression analysis.
- The fundamental differences between simple and multiple regression.
- The X-variables may be related (correlated) with one another.
- Consequence: looking at one X-variable at a time may present
a misleading picture of the true relationship between Y and the
X-variables.
- The difference between marginal and partial
slopes. Marginal: the slope of the regression line for one
X-variable ignoring the impact all the others. Partial: the
slope of the regression line for one X-variable taking into
account all the others.
- Key graphics for multiple regression.
- The leverage plot. A "partial tool": the analog of the
scatterplot for simple
regression. It lets you look at a large multiple regression one
variable at a time, in a legitimate way (controls for other
X-variables). Potential uses:
- Spot leveraged points.
- Identify large residuals.
- Diagnose systematic lack of fit, i.e. spot curvature which may
suggest transformations.
- Identify heteroscedasticity.
- The scatterplot matrix. A "marginal tool": presents all the
two-variable (bivariate) relationships. Potential uses:
- Identify collinearity (correlation) between X-variables.
- Identify marginal non-linear relationships between Y and X-variables.
- Determine which X-variables are marginally most
significant (thin ellipses).
- Facts to know.
- R-squared always increases as you add variables to the model.
- RMSE does not have to decrease as variables are added to the model.
- Model building philosophy in this course.
- Keep it as simple as possible (parsimony).
- Make sure everything is interpretable (especially any
transformations).
- After having met the above criteria go for biggest R-squared,
smallest RMSE and the model that makes most sense (signs on
regression slopes).
Collinearity and Hypothesis testing
- Collinearity
- Definition: correlation between the X-variables.
- Consequence: it is difficult to establish which of the
X-variables are most important (they all look the same). Visually
the regression plane becomes very unstable.
- Diagnostics:
- Thin ellipses in the scatterplot matrix. (High correlation.)
- Counter-intuitive signs on the slopes.
- Large standard errors on the slopes (there's
little information on them).
- Collapsed leverage plots.
- High Variance Inflation Factors. The increase in the variance
of the slope estimate due to collinearity.
- Insignificant t-statistics even though over all regression is
significant (ANOVA F-test).
- Fix ups:
- Ignore it. OK if sole objective is prediction in the range of
the data.
- Combine collinear variables in a meaningful way.
- Delete variables. OK if extremely correlated.
- Hypothesis testing in multiple regression. Three flavors. They all test
whether slopes are equal to zero or not. They differ
in the number of slopes we are looking at simultaneously.
- Test a single regression coefficient (slope).
- Look for the t-statistic.
- The hypothesis test in English: does this variable add any
explanatory power to the model that already includes all the
other X-variables?
- Small p-value says YES, big p-value says NO.
- Test all the regression coefficients at once.
- Look for the F-statistic in the ANOVA table.
- The hypothesis test in English: do any of the X-variables in
the model explain any of the variability in the Y-variable?
- Small p-value says YES, big p-value says NO.
- Note that the test does not identify which variables are
important.
- If you answer this question as NO then it's back to the
drawing board - none of your variables are any good!
- Test a subset of the regression coefficients (more than one,
but not all of them - the Partial F-test).
- It's no use looking for this one on the output. You have to
calculate it yourself. See formula on p.154 of the Bulk Pack.
- The test in English: do any of the X-variables in the
subset under consideration explain any of the variability in
Y?
- We use a rule of thumb for this one (because we are not using
F-tables). If the partial F is
less than one then you can be sure that the answer is NO. If it
is greater than 4 then you can be sure that the answer is
YES. If it is in between 1 and 4 then we will let it be a
judgment call.
- Must be able to answer this question: "why not do a whole
bunch of t-tests rather than one partial F-test?" Answer: the
partial F-test is an honest simultaneous test.
Examples
Car89.jmp p111.
Stocks.jmp p140.
Richard Waterman
Sun Aug 17 22:24:25 EDT 1997