What you need to have learnt from Class 1.

*
The cardinal rule of data analysis: always, always plot your data.
*
Fitting lines and curves to data.
*
Thinking about a model to summarize data and exploiting the fit.
*
Our model: tex2html_wrap_inline39 .
*
Interpretation of the coefficients in the model. The interpretation of the slope in a ln(x) transformed model as a percentage change (see p.24 of the bulkpack).
*
The ``best line'', minimizing the sum of vertical distances squared, the Least Squares Line.

New material for Class 2.

*
Regression assumptions and how to check for them.
*
So far we have thought of our model as the rule that relates the average value of Y to X, that is tex2html_wrap_inline39 .
*
The process that generates the data, the actual Y-variables themselves is often modeled as

displaymath43

*
That is we think of the data as coming from a two part process, a systematic part and a random part (signal plus noise). The noise part is sometimes due to measurement error and other times interpreted as all the important variables that should have been in the model but were left out.
*
The regression assumptions:
*
tex2html_wrap_inline45 are independent.
*
tex2html_wrap_inline45 are mean zero and have constant variance, tex2html_wrap_inline49 and tex2html_wrap_inline51 for all i. (constant variance)
*
tex2html_wrap_inline45 are approximately normally distributed.

*
Consequences of violation of the assumptions:
*
If positive autocorrelation then we are over-optimistic about the information content in the data. We think that there is less noise than there really is. Confidence intervals too narrow.
*
Non-constant variance:
*
Incorrectly quantify the true uncertainty.
*
Prediction intervals are inaccurate.
*
Least squares is inefficient: if you understood the structure of tex2html_wrap_inline57 better you could get better estimates of tex2html_wrap_inline59 and tex2html_wrap_inline61 .

*
If tex2html_wrap_inline45 are symmetric then normality assumption is not a big deal. If tex2html_wrap_inline45 are really skewed and you only have a small amount of data then it is all up the creek.

*
Since the assumptions are on the error term, the tex2html_wrap_inline45 , we have to check them by looking at the ``estimated errors'', the residuals.
*
Note that tex2html_wrap_inline45 is the distance from the point to the true line.
*
But the residual is the distance from the point to the estimated line.

*
Identifying unusual points; residuals, leverage and influence.
*
Points extreme in the X-direction are called points of high leverage.
*
Points extreme in the Y-direction are points with BIG residuals.
*
Points to watch out for are those that are high leverage and with a BIG residual. These are called influential points. Deleting them from the regression can change everything.



Richard Waterman
Mon Sep 9 23:32:59 EDT 1996