Notes, STAT 621, MBA Program, Wharton

MODULE 2: THE SIMPLE REGRESSION MODEL (SRM)

MEASURING THE QUALITY OF LINEAR REGRESSIONS: R² AND RMSE
1. Squared Correlation: R² or RSquare
  We have just used the squared correlation, R² = corr(x,y)², to select the best set of transformations for x and y. We translate the fundamental properties from raw to squared correlations:
  - R²=1 <=> corr(x,y)=+1 or -1 <=> x, y are in perfect linear association (ascending OR descending)
  - R²= 0 <=> x and y are in no linear association whatsoever.
  - R² is unit-free.
  R² essentially ignores the sign of the linear association.
  In Module 3 there will be an interpretation of R² as "fraction of variance explained". For now we are content with the interpretation of R² as a measure of
  diagonal elongation of the data cloud in the x-y plain.
  The more elongated the best-fitting ellipse, the higher R².
  Recall how to find "density ellipses" in JMP:
  JMP: Analyze > Fit Y by X > pick X, Y; OK > red diamond > Density Ellipse > 0.95
  
  R²=0.978 R²=0.815
2. Root Mean Square Error or RMSE: [slides 2-9, 2-10]
  RMSE: measure of quality of a fit, on the scale of the response.
  - Formula: RMSE = (RSS/(n-2))¹^/²
  - The RMSE describes how well our predictions fit the observations.
  - More precisely, it estimates the SD of errors around the true line.
  - If the data are nice, that is, the errors around the true line are normally distributed, then:
    
    the true line +- RMSE contains 68% of the data;
    the true line +- 2*RMSE contains 95% of the data.
  - Compare:
    
    The RMSE measures the quality of fit by the vertical width of a band that contains 68% of the data.
    R² measures the quality of fit in terms of diagonal elongation of the x-y points;
    and:
    
    RMSE is not symmetric in x and y;
    R² is symmetric in x and y.
  - JMP: 3rd number under "Summary of Fit", called "Root Mean Square Error".
  - RMSE and log(y) fits: Example Used Car data
    In JMP, the log(y)-model shows RMSE=0.114923, which may seem useless, because we don't do predictions on the log(y) scale. But JMP also shows an RMSE under "Fit Measured on Original Scale", 0.718, which describes the spread of the data around the fitted curve. It says our predictions have a 68% uncertainty of about +-$720, and 95% uncertainty of about +-$1,440. Then again, this might seem not right because as the values decrease, their variability should decrease proportionately. So maybe the RMSE on the log(y)-scale is the right one to use.
UNDERSTANDING DATA SO WELL THAT WE CAN FAKE THEM: MODELING ERROR
1. Faking data:
  Q: What could prices of other Diamond Ring data have looked like?
  -> Simulate alternative data in JMP!
  Q: How?
  We know how to compute predictions from linear equations, but these predictions don't look like real response values. Unlike real data, they don't have variability off the straight line.
  A: Use a straight-line formula and add variability.
  Example: The Simulated Diamond Ring data have an additional column, "Price (simulated)", which implements the idea. If you look at its formula (R-click column label > Formula), you will see something like this:
  Price = -260 + 3720*Weight + 32*RandomNormal()
  
  Thought experiment: Let us assume that we know the true intercept, slope, and RMSE.
  
  Assume the true intercept and slope values are -260 and 3720.
  Assume also the variability around the straight line is normal, has true mean m=0 and true SD s=32.
  (We gleaned these values from the JMP analysis, which gave an estimated intercept -259.6, slope 3721, and RMSE=31.84.)
  
  Seeing Sample-to-Sample Variability in JMP:
  JMP: R-click "Price (simulated)" > Formula > Apply
  With every click on "Apply", the column gets filled with another set of numbers that look very similar to the actually observed data.
  How convincing are these fake response values? Check graphically.
  Look at the data generated by the formula with scatterplots:
  
  First make a scatterplot of the real data:
  X="Weight (carats)"; "Price (Singapore Dollars)"
  Then make three or so scatterplots of simulated data:
  X="Weight (carats)"; "Price (simulated)",
  clicking "Apply" in the formula window between each plot.
  An example is shown in the plots below.
  Think of the observed data as generated by this formula!
  The formula gives us a means to simulate sample-to-sample variation, assuming that it is a good "model" for the real data.
  
  We think of the formula as a model for the data!
  
  [Slide 2-2 uses the term "data generating process".]
  
  The above pictures give the impression that the three simulated datasets are pretty good fakes of the real thing (top left picture). The model seems to work quite nicely as a summary of the data.
  (There are some small discrepancies between the actual and the simulated data. The actual data seem to have a little less variability than their simulated cousins. After perusing the actual prices, one sees that there is a slight preponderance of round values, such as multiples of $10 and $5. A good data forger would round the simulated prices first to whole dollars and then some with small probability to nearest multiples of 10 or 5. Reduction of variability due to rounding is typical for monetary variables.)
NOTATION/JARGON/ASSUMPTIONS FOR MODELS [slides 2-2...2-5]
- Model for the population:
  y_i = b₀ + b₁ *x_i + e_i where e_i are i.i.d. N(0,s²)
```
    response = signal        + noise               (general)
             = straight line + normal variability  (special)
    
```
  As always, Greek letters stand for true but unknown population numbers. They are to be estimated by their sample-analogs b₀ and b₁, written in Roman letters.
- Essential: The straight line wants to capture the true means of y given any x value, e.g., true mean price for weights 0.12, 0.15, ...carats.
  This is expressed by errors with true mean zero for any fixed x_i: m(x_i) = 0. Another way of writing the model is as follows: y | x ~ N(b₀+b₁*x, s²) Meaning: The distribution of y at x is normal with true mean m(x)=b₀+b₁*x and true constant SD s.
- Q: Who says these true means sit on a straight line?
  A: Nobody. We're assuming it.
  Diagnostic: A scatterplot can show whether this is a reasonable assumption (counter-example: display.JMP).
- Q: Who says the errors are i.i.d. normal with a true SD=sigma?
  A: Nobody. We're assuming it.
  Diagnostic: We'll get to that shortly.
- Models and Estimation: [slides 2-6...2-8]
  
  With Least Squares we estimate
  the true parameter values b₀, b₁, s
  with estimated values b₀, b₁, s.
  We also estimate
  
  the true means m(x_i) = b₀ + b₁ * x_i
  with estimated means yhat_i = b₀ + b₁ * x_i
  -> FITS yhat_i (also: "fitted values", "predicted values")
  
  and true errors e_i= y_i - m(x_i)
  with estimated errors e_i = y_i - yhat_i
  -> RESIDUALS e_i
- Setup:
  - The assumption is that there exists a true straight line that contains the true mean of y for each value of x.
  - Least Squares gives us an estimated straight line that contains an estimated mean of y for each value of x.
  - The true straight line is fixed.
    The estimated straight line has sample-to-sample variation.
- Illustration of regression with i, but we hope that 1) the x_i's in hand describe the important factors and 2) the unknown factors are many and small and additive, so the central limit theorem justifies lumping them into "errors", "noise", "unexplained variability" that looks roughly normally distributed. Example: We hope weight in carats is the primary factor driving prices, and other factors (purity, brightness, shape,...) are minor. If we had data for these other factors, we could use statistical tests to see whether these factors are minor or not
- THE WEIRDNESS OF THE REGRESSION MODEL: Sample-to-sample variability in y but not x. Note for the diamond ring data we only modeled the variability of prices y_i given the same weight values x_i. We did NOT model the variability of weights x_i, although we could have done so in principle. But this is NOT done in regression: REGRESSION IS "CONDITIONAL ON THE x-VALUES". Regression only models sample-to-sample variability in y but keeps the x-values fixed. In the diamond data, the regression model assumes samples with the same weights but (slightly) varying prices. (Compare the simulation in utopia.JMP: There, one generates variability in the x's also, namely, with a uniform distribution, but this is not part of the regression model. Variability in the x's would often be more realistic.)
- SIMULATION FOR MODEL DIAGNOSTICS: Although we interpreted the above simulation of the diamond data as a thought-experiment, we can obviously use it as a diagnostic: Use estimates of slope, intercept, SD as if they were the true values, then simulate data from the resulting formula, and check whether the simulated and observed data look similar.
  (This approach is called "predictive model check" and unfortunately not widely known. It is also related to something called "parametric bootstrap".)

MODEL CHECKING [slides 2-11...2-23]

Types of model checks:
- Checks of assumptions: any of the assumptions in
  y | x ~ N(b₀+b₁*x, s²) independent
  can be violated:
  
  - the true mean m(x) might not be b₀+b₁*x:
  for example, m(x) might be a curve, not a straight line (figure 1 below);
  - the true SD s might not be constant:
  for example, s=s(x) might depend on x also;
  s(x) might increase as x increases (figure 2 below);
  - the error distribution might not be normal:
  for example, it might be skewed (figure 3 below) or have occasional outliers (figure 4 below).
  - the errors might not be independent:
  usually the case in time series data, where errors might be correlated when they are close in time.
  
  Checks of sensitivity: leverage and influence
  It can occur that a single observation determines the slope. This is the case when there is an observation (x_i,y_i) that is far out in x, that is, a horizontal outlier, as illustrated by the two plots below. In the second plot below the y-value of the leverage point is compatible with the majority, in the third plot it isn't.
  
  In general we call an observation influential if leaving it out changes at least one of the following quantities substantially: b₁, b₀, R², RMSE.
  (What is "substantial change"? We're being vague. Most of the time you know it when you see it.)
  A leverage point will be influential by moving the slope b₁, and also affecting R² (by driving it up).
  The best way to learn about influence and leverage is by playing with the leverage applet created by our colleague at the University of Chicago.
Tools for model checking:
- scatterplots of the raw data (x₁,y₁), (x₂,y₂),..., (x_n,y_n)
- residual plots:
  
  - scatterplot e_i against x_i
  - histogram and normal quantile plot of e_i
  Recall that residuals e_i estimate errors e_i; plotting residuals is therefore the closest thing we have to plotting errors.
- Residuals in JMP:
  
  - Residuals e_i can be saved:
  Fit Y by X > Fit Line or Fit Special > red diamond > Save Residuals
  This forms an additional column in the data spreadsheet.
  Create a normal quantile plot of the residuals:
  Analyze > Distribution > X:Residuals > red diamond > Normal Quantile Plot.
  - Plotting residuals e_i versus x_i:
  Fit Y by X > Fit Line or Fit Special > red diamond > Plot Residuals
Examples of model checking:
Work through the details of slides 2-12...2-23. They have excellent examples.

SUMMARY:

We developed the idea of a model as a data-generation process that allows us to forge new data. Good forgery requires good understanding of the data.
Essential for a model is to also mimick the randomness or unexplained variation ("error") in the data. Much of statistics models randomness with normal distributions.
The Simple Regression Model (SRM) is y_i = b₀+b₁*x_i+e_i, where e_i ~ i.i.d. N(0,s²). This can be readily simulated in JMP for assumed values of b₀, b₁, s. See the simulated diamond ring prices.
LS gives us estimates b₀, b₁, RMSE, e_i of the true model numbers b₀, b₁, s, e_i.
Any of the assumptions of the SRM can be wrong:
- curvature: m(x) is not b₀+b₁*x
- heteroscedasticity: s is not constant
- non-normality: the error distribution is skewed or has heavy tails
- dependence: the errors are not independent; they may be correlated
Model checks based on residual plots allow us to pin down these violations:
- plot of residuals e_i against x_i (-> curvature, heteroscedasticity)
- normal quantile plot of the residuals e_i (-> non-normality)
- plot of residuals against time (-> correlated errors close in time; only meaningful when there is a time order)
Models are also problematic if they are too sensitive to individual observations:
- influential points: removal affects parameter estimates strongly
- outliers: far from the majority of the data, in x and/or y
- leverage points: outliers in x
- response outliers: outliers in y
Remove outliers if they are expected to be atypical of future observations.


R²=0.978	R²=0.815