MODULE 2: THE SIMPLE REGRESSION MODEL (SRM)



  • MODEL CHECKING [slides 2-11...2-23]

    1. Types of model checks:
      • Checks of assumptions: any of the assumptions in
        y | x ~ N(b0+b1*x, s2) independent
        can be violated:
        - the true mean m(x) might not be b0+b1*x:
        for example, m(x) might be a curve, not a straight line (figure 1 below);
        - the true SD s might not be constant:
        for example, s=s(x) might depend on x also;
        s(x) might increase as x increases (figure 2 below);
        - the error distribution might not be normal:
        for example, it might be skewed (figure 3 below) or have occasional outliers (figure 4 below).
        - the errors might not be independent:
        usually the case in time series data, where errors might be correlated when they are close in time.

      • Checks of sensitivity: leverage and influence
        It can occur that a single observation determines the slope. This is the case when there is an observation (xi,yi) that is far out in x, that is, a horizontal outlier, as illustrated by the two plots below. In the second plot below the y-value of the leverage point is compatible with the majority, in the third plot it isn't.

        In general we call an observation influential if leaving it out changes at least one of the following quantities substantially: b1, b0, R2, RMSE.
        (What is "substantial change"? We're being vague. Most of the time you know it when you see it.)
        A leverage point will be influential by moving the slope b1, and also affecting R2 (by driving it up).
        The best way to learn about influence and leverage is by playing with the leverage applet created by our colleague at the University of Chicago.

    2. Tools for model checking:

      • scatterplots of the raw data (x1,y1), (x2,y2),..., (xn,yn)
      • residual plots:
        - scatterplot ei against xi
        - histogram and normal quantile plot of ei
        Recall that residuals ei estimate errors ei; plotting residuals is therefore the closest thing we have to plotting errors.

      • Residuals in JMP:
        - Residuals ei can be saved:
        Fit Y by X > Fit Line or Fit Special > red diamond > Save Residuals
        This forms an additional column in the data spreadsheet.
        Create a normal quantile plot of the residuals:
        Analyze > Distribution > X:Residuals > red diamond > Normal Quantile Plot.
        - Plotting residuals ei versus xi:
        Fit Y by X > Fit Line or Fit Special > red diamond > Plot Residuals

    3. Examples of model checking:

      Work through the details of slides 2-12...2-23. They have excellent examples.


  • SUMMARY:

    1. We developed the idea of a model as a data-generation process that allows us to forge new data. Good forgery requires good understanding of the data.

    2. Essential for a model is to also mimick the randomness or unexplained variation ("error") in the data. Much of statistics models randomness with normal distributions.

    3. The Simple Regression Model (SRM) is yi = b0+b1*xi+ei, where ei ~ i.i.d. N(0,s2). This can be readily simulated in JMP for assumed values of b0, b1, s. See the simulated diamond ring prices.

    4. LS gives us estimates b0, b1, RMSE, ei of the true model numbers b0, b1, s, ei.

    5. Any of the assumptions of the SRM can be wrong:
      • curvature: m(x) is not b0+b1*x
      • heteroscedasticity: s is not constant
      • non-normality: the error distribution is skewed or has heavy tails
      • dependence: the errors are not independent; they may be correlated

    6. Model checks based on residual plots allow us to pin down these violations:
      • plot of residuals ei against xi (-> curvature, heteroscedasticity)
      • normal quantile plot of the residuals ei (-> non-normality)
      • plot of residuals against time (-> correlated errors close in time; only meaningful when there is a time order)

    7. Models are also problematic if they are too sensitive to individual observations:
      • influential points: removal affects parameter estimates strongly
      • outliers: far from the majority of the data, in x and/or y
      • leverage points: outliers in x
      • response outliers: outliers in y
      Remove outliers if they are expected to be atypical of future observations.