Notes, STAT 621, MBA Program, Wharton

MODULE 3: INFERENCE ABOUT THE SRM

QUESTIONS OF STATISTICAL INFERENCE: SIGNIFICANCE, PRECISION, PREDICTION
1. Q: Does x matter to predict or explain y?
  - Diamond ring prices: No question, weight is the major factor that drives prices.
  - Wine displays: Sure, feet of display space seems to have a large influence on sales.
  - Used cars: Of course, age is the driving factor of price.
  One can perform statistical tests to see whether x matters, but in these examples we know the answers without tests. Tests are needed when there is doubt because the sample size n is small. In simple regression with just one predictor, even a small sample size of n=20 suffices to achieve significance in most meaningful data. Still, in this module we'll go through the motions of statistical testing to learn the tool.
  (Testing becomes a lot more important in multiple regression with more than one predictor variable. It will help us weed out unimportant variables.)
2. Q: How precise are our estimates?
  A: Confidence intervals (CIs)
  => We want CIs for b₁, b₀, and mean response values.
  Example: Diamond ring prices.
  We will see that the average price $3,720 of an additional carat will have the 95% CI = (3720-160, 3720+160) = (3560, 3880). More realistically, the CI for an additional 1/10 carat is (356, 388).
  This seems like a useful thing to know, for example, during sales negotiations: If a buyer proposes 1) to buy a larger ring with 2/10 more carat instead of a smaller one, and 2) to pay 400 more and throw in an heirloom ring of his that we estimate to be worth 220, do we go for the deal? Maybe not: Our CI for an additional 2/10 carats is (712, 776), so 400+220=620 falls short of the range where we think the average price of an additional 2/10 carats should be.
  (We'll get to know other ways of answering such questions with prediction intervals that might be more satisfactory.)
3. Q: In what range should we expect future observations to fall?
  A: Prediction intervals (PIs)
  Example: In what range should we expect prices to fall for 0.4 carat diamonds? We will see that (1156, 1301) is a reasonable guess for a 95% prediction interval.
INFERENCE ABOUT SLOPE AND INTERCEPT
1. We approach things from a practical point of view and give underpinnings later. We argue by analogy to inference for means: like means, slopes and intercepts have standard errors, CIs, and statistical tests, and they work very much the same way. We elaborate for the slope, the parameter of main interest:
  - Standard error, SE(slope)
  - Confidence intervals, CI(slope): CI = (b₁ - 2*SE, b₁ + 2*SE)
  - t-ratio for the slope: t=b₁/SE(slope)
  - p-value
  We explain these quantities in a minute, but first we show where to find them in JMP.
2. JMP output: We use (as usual) the Diamond Ring data. We are familiar with the linear equation with intercept and slope, as well as R² and RMSE. The estimated intercept b₀ and slope b₁ can be found a second time in a section with title "Parameter Estimates". In this section we also find a column with SEs ("Std Error"), a column with t-statistics ("t Ratio"), and a column with p-values ("Prob>|t|").
  (Ignore other sections such as "Lack of Fit" and "Analysis of Variance" for now.)
3. We elaborate on the meaning and use of these quantities:
  - The SE(slope) is the standard deviation of the estimates b₁ under sample-to-sample (dataset-to-dataset) variability. Standard errors measure the uncertainty of estimates due to sample-to-sample variability.
  - The CI(slope) is a random interval that will catch the true slope b₁ 95% of times; in other words: the true b₁ is in CI(slope) for 95% of samples (=datasets).
  - Before interpreting t-ratios and p-values, we must understand that the conventional null hypothesis H₀ is: b₁=0.
    Why? Because if the true slope is b₁=0, then the predictor x is irrelevant for predicting or explaining the response y (diamond weight would be irrelevant for price, which it isn't).
    As we said earlier, this conventional null hypothesis is mostly not very meaningful right now because with a single predictor even weak effects tend to be significant even for small sample sizes.
  - We reject H₀: b₁=0 at the 5% significance level if any and all of the following equivalent conditions are satisfied:
    - |b₁| > 2*SE, that is, either b₁ > 2*SE or b₁ < -2*SE
    - 0 is not in the 95% CI
    - |t|>2, that is, either t>2 or t<-2
    - p-value < 0.05
    These conditions all describe the situation that the observed estimate b₁ and the assumed parameter value b₁=0 are too far apart to be compatible: b₁ is too unlikely to be observed under the assumption b₁=0. The yard stick for measuring distance between b₁ and anything else is the SE, because the SE measures the uncertainty in b₁.
    The most confusing among the above conditions for rejection of H₀ is in terms of the p-value. Recall: the p-value is a measure of evidence in favor of H₀ on a probability scale. In order to reject H₀, we want this evidence for H₀ to be small, namely, less than 0.05 (a convention corresponding to 0.95 confidence for CIs).
    Slides 3-3 and 3-5 show how to test any other slope value also, such as b₁=$3,800 for the diamond data. We just check whether b₁=3721 and b₁=3800 are further apart than two SEs (=82). They are not: |3721-3800|=79 < 2*82=164. Hence an assumed price of 3800 per additional carat could not be rejected based on the data.
  - Why is the intercept rarely tested? It's not often of interest. Testing the intercept can be of interest if y=cost/price and x=quantity, in which case intercept=fixed cost. One might ask: Is there fixed cost at all? This suggests testing the null hypothesis that true fixed cost is zero, b₀=0. We will reject if b₀> 2*SE(intercept).
4. Analysis of the standard error of slopes:
  Preliminary remark: From now on s_e = RMSE is our new notation for the estimate of s, the spread of the response values around the true line:
  s_e = RMSE = [ ( e₁²+...+e_n² )/(n-2)]¹^/²
  - There exists an explicit formula for the SE of the slope, shown on slide 3-2:
```
                       s_e    1
             SE(b₁) = --- * --
                       n¹^/²  s_X
        
```
    We will never use the above formula for actual calculation of SE(b₁) since JMP does that. But we find the formula insightful.
    First a reminder: s_X is the simple standard deviation of the x-values.
    - s_X = [ (x₁-x)²+...+(x_n-x)² ]¹^/²
      a measure of horizontal spread on the x-axis
    - s_e = [ e₁²+...+e_n² ]¹^/²
      a measure of vertical spread off the fitted line
  - Interpretation of the formula for the standard error of the slope:
    - It is good to have a large sample size n:
      as n increases, the SE goes to zero like 1/n¹^/², just like the SE of the mean in pre-term. Again, we need 4 times as much data to cut the SE by a factor of 2.
    - It is good to have a small standard deviation s_e around the line. All other things being equal, SE(slope) is directly proportional to s_e.
    - It is good to have a large spread of the x-values: the SD of the x-values in the denominator shows that, all things being equal, a larger s_X entails a smaller SE(b₁).
    To illustrate the last point, go to slide 3-6 and recreate the analysis of the Diamond Ring data with the right-most 20 observations removed: s_X shrinks, and SE(b₁) jumps from 82 to 430.
  - Implications for outlier analysis:
    - We don't like vertical outliers: they inflate s_e.
      s_e is the main quantity affected by vertical outliers.
      We hope we can remove them from the data based on subject-matter judgement, such as the coincidence of an inventory sale with a particular mass mailing as in the Direct Mail data used in slides 2-17 and 2-18.
    - We like horizontal outliers, that is, leverage points: they inflate s_X, which drives down SE(b₁).
      When we spot a leverage point, we hope it is compatible with the majority of the data in the sense that the slope b₁ does not change drastically when the point is removed. We also hope to find arguments that the leverage point is likely to be typical for future data with x-values that are similarly extreme.
    To illustrate the last point, analyze the Cottage data according to slide 3-16
CONFIDENCE INTERVALS FOR THE REGRESSION LINE (CURVE)
1. The estimated line yhat_x = b₀ + b₁*x approximates the true regression line m(x) = b₀ + beta₁*x. If this is so, we should think that there ought to exist an SE for yhat(x) and therefore a CI around yhat_x that catches the true m(x) about 95% times (sample-to-sample). Indeed:
```
                     s_e        (x-x)²
         SE(yhat_x) = --- ( 1 + ----- )¹^/²
                     n¹^/²        s_X²
       
```
  where we use the more intuitive abbreviation s_e = RMSE. The 95% CI is
  (yhat_x - 2*SE(yhat_x), yhat_x + 2*SE(yhat_x))
  Why would we show this arcane formula for SE(yhat_x) above?
  Not for calculations (JMP does that). Again, it's for a qualitative insight:
  - SE(yhat_x) shrinks at the rate of 1/n¹^/².
  - SE(yhat_x) gets larger as x moves away from x.
  The first point is familiar from pre-term where the SE of the mean was
  SE(y) = s_Y/n¹^/². An implication is that in order to double the precision of y, i.e., to cut SE(y) in half, we need 4 times as much data. The same holds for yhat_x.
  The second point is new: SE(yhat_x) = s_e/n¹^/² only for x=x. As x moves away from x, SE(yhat_x) grows, that is, the CI widens! The growth is slow, though, as we can convince ourselves in examples.
  In summary, x=x is the fulcrum where the estimated lines wobble the least under sample-to-sample variability.
  Fine print: The above CI holds only if the SRM is correct. If there is undiscovered curvature or heteroscedasticity, the CI for yhat_x is not valid, meaning, the coverage of the CI will not be 95%.
  
  SE(yhat_x) in JMP:
  Fit Y by X > pick X and Y > Fit Line or Special > click red diamond "Linear Fit" or "Transformed Fit..." > Confid Curves Fit
  Examples: Used Cars data (left) and the Philadelphia Crime data (right)
  
  Note the above two plots do linear fits (Fit Line) to transformed variables that obviously have been computed with JMP formulas. If we do the same thing but on the untransformed variables but let "Fit Special" do the transformation, then the CI band is shown on the plot of the original, untransformed variables:
PREDICTING INDIVIDUAL VALUES WITH A REGRESSION
1. Recall the distinction between CIs and PIs, confidence intervals and prediction intervals (from pre-term 603, slide 9-3):
  - CIs contain the true parameter 95% of times under sample-to-sample variability.
    CIs are of the form "estimate +- 2*SE".
    PIs contain 95% of future observations.
    In regression this means: 95% of future y's at a given x
    (95% of future diamond ring prices for a given fixed weight in carats).
    PIs are of the form "estimate +- 2*SD".
  PIs are wider than CIs because SD > SE. (Uncertainty about observations is greater than uncertainty about estimates: SE = SD/n¹^/² for means.)
2. In Deliverables (4) of the Individual Project (Installment 1) you are asked to do the second: produce two intervals that capture 95% of Total Costs of future orders, given order sizes of 200 Units and 4000 Units, respectively.
3. We said once that "regression line +- 2*s" contains 95% of the data. That's correct for the true regression line (m(x)=b₀+b₁*x) and the true SD of the errors (s):
  (m(x) - 2*s, m(x) + 2*s)
  contains 95% of future observations y at x.
  Replacing the truth with estimates (yhat(x) = b₀ + b₁*x) works ok if we're not extrapolating outside the range of observed x-values:
  (yhat(x) - 2*s_e, yhat(x) + 2*s_e)
  will contain about 95% of future observations y at x.
  (Recall s_e = RMSE.)
  If we're extrapolating, we should account for the sample-to-sample variability of yhat(x), which might be sizable at distant x-values for which we have no experience. From the previous section on SE(yhat_x), we know that variability of yhat_x can be substantial if x is far from x. The formula for the prediction band that adjusts for the sample-to-sample variability of yhat_x is as follows: First define the prediction error estimate
  PE(y_x) = [ SE(yhat_x)² + s_e² ]¹^/²
  Then the 95% prediction interval at x is:
  ( yhat_x - 2*PE(y_x), yhat_x + 2*PE(y_x) )
  See slide 3-9, bottom. We won't use this formula for actual calculations either (JMP will do it). We find the formula interesting, though, because it shows in what way the naive band based on +-2*s_e gets adjusted for the obvious fact that the more distant x is from x, the less we know about where the responses are going to fall.
  
  JMP, method 1: reading values off the scatterplot with the cross-hair cursor
  
  Analyze > Fit Y by X > pick X, Y; OK > red diamond: Fit Special... > pick transformations; OK; red diamond next to "Transformed Fit..." > Confid Curves Indiv (This is JMP's unfortunate name for PIs at all x-values, drawn as two curves.)
  Click "+" next to the lense, depress on the plot: cross-hair appears
  Read off the upper and lower 95% PI limits at x=200 and x=4000.
  If necessary, zoom in or out, by changing the min and max of the axes: right-click on axis numbers > Axis Settings > ... change Minimum and Maximum.
  
  JMP, method 2:
  
  First create columns with transformed variables, such as logs. This method works directly only on the scale on which a straight line is fitted.
  Add two rows: Rows (at top) > Add rows... > 2
  Fill in the values 200 and 4000 in the new rows of the Units column.
  Analyze > Fit Model > pick Y; click x-variable, click "Add"; "Run Model" > red diamond: Save Columns > Indiv Confidence Interval (again JMP's unfortunate name for PI)
  Two new columns will have appeared in the spreadsheet, containing upper and lower PI bounds for all rows in the spreadsheet, including the two additional rows.
  Back-transform if you used a y-transform. To this end form two new columns with the formula "Exp(...)", if you used a log(y) transform.
THE INTERPRETATION OF R² AS "FRACTION OF VARIATION EXPLAINED"
1. We often hear R² described as "proportion of total variation explained by the regression" or, more sloppily, "fraction of explained variance". What is this based on?
  Part of learning a new field is also learning about its conventions.
2. Q: If R² is a fraction, what is the the denominator?
  A: s_Y² = [(y₁-y)²+...+(y_n-y)²]/(n-1),
  which describes the total variation of the y-values ignoring the x-values.
  To reconstruct the interpretation of R² as "fraction of variation explained by the regression", we interpret
  - s_e² as the "variance unexplained by the regression". But if
  - s_Y² is "total variance", then
  - s_Y²-s_e² is the "variation explained by the regression", hence
  - [ s_Y²-s_e² ]/s_Y² is the "fraction of variation explained by the regression".
  Now here is an "amazing identity": If s_e² were computed not with 1/(n-2) but with 1/(n-1) instead:
  s_e² = (e₁²+...+e_n²)/(n-1)
  then this "fraction of variation explained by the regression" would be exactly R²:
```
               s_Y²-s_e²
          R² = -------
                s_Y²
        
```
  This modification of s_e is not desirable, which is why the right hand quantity with the actual, unmodified s_e² is called adjusted R², which you recall as the second number in JMP's regression outputs. The adjusted R² is slightly more realistic as an estimate of R² for the population (which is a "sample with n=infinity").