Next: 3. Up: Stat701 Previous: 1.

2. The regression set-up

The model for the mean relationship:

$\begin{displaymath}Av(Y\vert x) = \beta_0 + \beta_1 x.\end{displaymath}$

The model for the raw data:

$\begin{displaymath}Y_i = \beta_0 + \beta_1 x_i + \epsilon_i.\end{displaymath}$

This is the straight line or linear model.

Assumptions are on the $\epsilon_i$ .

Biggest problems. Dependence, skewness and non-constant variance.

: Call the $\epsilon_i$ the "true error terms".
: Distance from point to the true line. $y_i - (\beta_0 + \beta_1 x_i)$ .
: We don't know them as we don't know the regression line.
: Substitute with the "residuals", estimated error terms.
: Distance from point to estimated regression line. $y_i - (\hat\beta_0 + \hat\beta_1 x_i) = y_i - \hat y_i$ .

ALWAYS check assumptions on the residuals.

Why so important?

: Standard least squares regression is sensitive to individual data points.
: A single point can dominate the regression.
: Everything you say and conclude may be driven by a single data point.
: Residual plots are one of the tools available to help identify these points.
: Even if you keep it, it is important to know that it is there.
: Inference, p-values, CI's etc only has validity if assumptions hold.

Key diagnostics.

1. Residual plot. Good plots lack structure.

2. Normal scores plot of the residuals.

Next: 3. Up: Stat701 Previous: 1.

Richard Waterman
1999-09-13