Class 2

From class 1.

*
Understand, interpret and distinguish the regression summaries:
*
R-squared.
*
Root Mean Squared Error (RMSE).

*
The interpretation and the benefits of using a confidence interval for the slope.
*
Two types of prediction and interval (range of feasible values):
*
Estimate a typical observation (conf curve:fit).
*
Predict a single new observation (conf curve:indiv).

*
The dangers in extrapolating outside the range of your data: (three sources).
*
The uncertainty in our estimate of the true regression line.
*
The uncertainty due to the inherent variation of the data about the line.
*
The uncertainty due to the fact that maybe we should not be using a line in the first place (model misspecification)!

New material for Class 2.

*
Making more realistic models with many X-variables - multiple regression analysis.
*
The fundamental differences between simple and multiple regression.
*
The X-variables may be related (correlated) with one another.
*
Consequence: looking at one X-variable at a time may present a misleading picture of the true relationship between Y and the X-variables.
*
The difference between marginal and partial slopes. Marginal: the slope of the regression line for one X-variable ignoring the impact all the others. Partial: the slope of the regression line for one X-variable taking into account all the others.

*
Key graphics for multiple regression.
*
The leverage plot. A "partial tool": the analog of the scatterplot for simple regression. It lets you look at a large multiple regression one variable at a time, in a legitimate way (controls for other X-variables). Potential uses:
*
Spot leveraged points.
*
Identify large residuals.
*
Diagnose systematic lack of fit, i.e. spot curvature which may suggest transformations.
*
Identify heteroscedasticity.

*
The scatterplot matrix. A "marginal tool": presents all the two-variable (bivariate) relationships. Potential uses:
*
Identify collinearity (correlation) between X-variables.
*
Identify marginal non-linear relationships between Y and X-variables.
*
Determine which X-variables are marginally most significant (thin ellipses).

*
Facts to know.
*
R-squared always increases as you add variables to the model.
*
RMSE does not have to decrease as variables are added to the model.

*
Model building philosophy in this course.
*
Keep it as simple as possible (parsimony).
*
Make sure everything is interpretable (especially any transformations).
*
After having met the above criteria go for biggest R-squared, smallest RMSE and the model that makes most sense (signs on regression slopes).

Collinearity and Hypothesis testing

*
Collinearity
*
Definition: correlation between the X-variables.
*
Consequence: it is difficult to establish which of the X-variables are most important (they all look the same). Visually the regression plane becomes very unstable.
*
Diagnostics:
*
Thin ellipses in the scatterplot matrix. (High correlation.)
*
Counter-intuitive signs on the slopes.
*
Large standard errors on the slopes (there's little information on them).
*
Collapsed leverage plots.
*
High Variance Inflation Factors. The increase in the variance of the slope estimate due to collinearity.
*
Insignificant t-statistics even though over all regression is significant (ANOVA F-test).

*
Fix ups:
*
Ignore it. OK if sole objective is prediction in the range of the data.
*
Combine collinear variables in a meaningful way.
*
Delete variables. OK if extremely correlated.

*
Hypothesis testing in multiple regression. Three flavors. They all test whether slopes are equal to zero or not. They differ in the number of slopes we are looking at simultaneously.
*
Test a single regression coefficient (slope).
*
Look for the t-statistic.
*
The hypothesis test in English: does this variable add any explanatory power to the model that already includes all the other X-variables?
*
Small p-value says YES, big p-value says NO.

*
Test all the regression coefficients at once.
*
Look for the F-statistic in the ANOVA table.
*
The hypothesis test in English: do any of the X-variables in the model explain any of the variability in the Y-variable?
*
Small p-value says YES, big p-value says NO.
*
Note that the test does not identify which variables are important.
*
If you answer this question as NO then it's back to the drawing board - none of your variables are any good!

*
Test a subset of the regression coefficients (more than one, but not all of them - the Partial F-test).
*
It's no use looking for this one on the output. You have to calculate it yourself. See formula on p.154 of the Bulk Pack.
*
The test in English: do any of the X-variables in the subset under consideration explain any of the variability in Y?
*
We use a rule of thumb for this one (because we are not using F-tables). If the partial F is less than one then you can be sure that the answer is NO. If it is greater than 4 then you can be sure that the answer is YES. If it is in between 1 and 4 then we will let it be a judgment call.

*
Must be able to answer this question: "why not do a whole bunch of t-tests rather than one partial F-test?" Answer: the partial F-test is an honest simultaneous test.

Examples

Car89.jmp p111.

Stocks.jmp p140.



Richard Waterman
Sun Aug 17 22:24:25 EDT 1997