Stat 601, Fall 2000, Class 7
- Understand the uses of curve fiting
- Know the interpretation of the Least Squares fit
- Understand residuals
- Know the regression assumptions
- Classify unusual points
- R2
- RMSE
- Confidnce and prediction intervals
- Introduction to multiple regression
- R-squared.
- The proportion of variability in Y explained by
the regression model.
- Answers the question ``how good is the fit''?
- Root mean squared error (RMSE).
- The spread of the points about the fitted model.
- Answers the question ``can you do good prediction''?
- Write the variance of the
as
,
then RMSE estimates
.
- Only a meaningful measure with respect to the range of Y.
- A rule of thumb 95% prediction interval: up to the line +/-
2 RMSE (only works in the range of the data).
- Answers the question ``is
there any point in it all''?
- If the CI contains 0, then 0 is a feasible value for the
slope, i.e. the line may be flat, that is X tells you nothing
about Y.
- The p-value associated with the slope
is testing the hypothesis Slope = 0 vs Slope
0.
- Estimating an average, ``where is the regression line''?
Range of feasible values should reflect uncertainty in the
true regression line.
- Predicting a new observation, ``where's a new point going to
be''?
Range of feasible values should reflect uncertainty
in the true regression line AND the variability of the
points about the line.
-
From pages 90 and 94.
Measure With outlier Without outlier
R-squared 0.78 0.075
RMSE 3570 3634
Slope 9.75 6.14
SE(slope) 1.30 5.56
If someone comes with a great R-squared it does not mean they
have a great model; maybe there is just a highly leveraged point
well fit by the regression.
- Making more realistic models with many X-variables - multiple
regression analysis.
- The fundamental differences between simple and multiple regression.
- The X-variables may be related (correlated) with one another.
- Consequence: looking at one X-variable at a time may present
a misleading picture of the true relationship between Y and the
X-variables.
- The difference between marginal and partial
slopes. Marginal: the slope of the regression line for one
X-variable ignoring the impact all the others. Partial: the
slope of the regression line for one X-variable taking into
account all the others. Recall the death penalty example.
- Key graphics for multiple regression.
- The leverage plot. A ``partial tool'': the analog of the
scatterplot for simple
regression. It lets you look at a large multiple regression one
variable at a time, in a legitimate way (controls for other
X-variables). Potential uses:
- Spot leveraged points.
- Identify large residuals.
- Diagnose systematic lack of fit, i.e. spot curvature which may
suggest transformations.
- Identify heteroscedasticity.
- The scatterplot matrix. A ``marginal tool'': presents all the
two-variable (bivariate) relationships. Potential uses:
- Identify collinearity (correlation) between X-variables.
- Identify marginal non-linear relationships between Y and X-variables.
- Determine which X-variables are marginally most
significant (thin ellipses).
- Facts to know.
- R-squared always increases as you add variables to the model.
- RMSE does not have to decrease as variables are added to the model.
- Model building philosophy in this course.
- Keep it as simple as possible (parsimony).
- Make sure everything is interpretable (especially any
transformations).
- After having met the above criteria go for biggest R-squared,
smallest RMSE and the model that makes most sense (signs on
regression slopes).
2000-11-09