next up previous
Next: 3. Up: Stat701 Previous: 1.

2. The regression set-up

The model for the mean relationship:


\begin{displaymath}Av(Y\vert x) = \beta_0 + \beta_1 x.\end{displaymath}

The model for the raw data:


\begin{displaymath}Y_i = \beta_0 + \beta_1 x_i + \epsilon_i.\end{displaymath}

This is the straight line or linear model.

Assumptions are on the $\epsilon_i$.

*
Independent
*
Constant variance. Mean zero.
*
Approximately normally distributed.

Biggest problems. Dependence, skewness and non-constant variance.

*
Call the $\epsilon_i$ the "true error terms".
*
Distance from point to the true line. $y_i - (\beta_0 + \beta_1 x_i)$.
*
We don't know them as we don't know the regression line.
*
Substitute with the "residuals", estimated error terms.
*
Distance from point to estimated regression line. $y_i - (\hat\beta_0 +
\hat\beta_1 x_i) = y_i - \hat y_i$.

ALWAYS check assumptions on the residuals.

Why so important?

*
Standard least squares regression is sensitive to individual data points.
*
A single point can dominate the regression.
*
Everything you say and conclude may be driven by a single data point.
*
Residual plots are one of the tools available to help identify these points.
*
Even if you keep it, it is important to know that it is there.
*
Inference, p-values, CI's etc only has validity if assumptions hold.

Key diagnostics.

1. Residual plot. Good plots lack structure.

2. Normal scores plot of the residuals.


next up previous
Next: 3. Up: Stat701 Previous: 1.
Richard Waterman
1999-09-13