Class 3
What you need to have learnt from Class 2.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- The second rule of data
analysis: always check the residuals.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- The regression assumptions.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- Consequences of assumption violations.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- Diagnosing violations through residual plots.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- Categorizing unusual points: leverage, residuals and influence.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- Impact of unusual points on the regression.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- How important is a point? Remove it and see how your decision
changes.
New material for Class 3.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- Understanding almost all the regression output
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- R-squared.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/blueball.gif)
- The proportion of variability in Y explained by
the regression model.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/blueball.gif)
- Answers the question ``how good is the fit''?
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- Root mean squared error (RMSE).
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/blueball.gif)
- The spread of the points about the fitted model.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/blueball.gif)
- Answers the question ``can you do good prediction''?
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/blueball.gif)
- Write the variance of the
as
, then RMSE estimates
.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/blueball.gif)
- Only a meaningful measure with respect to the range of Y.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/blueball.gif)
- A rule of thumb 95% prediction interval: up to the line +/-
2 RMSE (only works in the range of the data).
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- Confidence interval for the slope.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/blueball.gif)
- Answers the question ``is
there any point in it all''?
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/blueball.gif)
- If the CI contains 0, then 0 is a feasible value for the
slope, i.e. the line may be flat, that is X tells you nothing
about Y.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/blueball.gif)
- The p-value associated with the slope
is testing the hypothesis Slope = 0 vs Slope
0.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- Two types of prediction (concentrate on the second)
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- Estimating an average, ``where is the regression line''?
Range of feasible values should reflect uncertainty in the
true regression line.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/yellowball.gif)
- Predicting a new observation, ``where's a new point going to
be''?
Range of feasible values should reflect uncertainty
in the true regression line AND the variability of the
points about the line.
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- What a difference a leveraged point can make.
-
From pages 96 and 100.
Measure With outlier Without outlier
R-squared 0.78 0.075
RMSE 3570 3634
Slope 9.75 6.14
SE(slope) 1.30 5.56
![*](http://compstat.wharton.upenn.edu:8001/~waterman/icons/greenball.gif)
- If someone comes with a great R-squared it does not mean they
have a great model; maybe there is just a highly leveraged point
well fit by the regression line.
Richard Waterman
Wed Sep 11 23:19:07 EDT 1996