Model building strategies
Rule 1. There are NO rules.
Some strategies:
-
Understand the variables/get to know your data first.
-
Univariate summaries. Look at histograms. Identify potential
problems. Extreme skewness will often lead to outliers in the
regression.
-
Bivariate summaries. Look at the scatterplot matrix. Recognize
correlations (strong linear relationships) and curvature (non-linear
relationships). Remember that no correlation does not mean no
relationship, just NO LINEAR RELATIONSHIP.
-
Ways to cut down time spent at the computer.
- Think about likely associations
between the variables before you start
your analysis.
- Are there any a-priori sensible
transformations, for example LOG(INCOME).
- Every
minute of thought before you get to the computer is worth 10 minutes
at the computer.
- Let your initial thinking generate about five to ten
sensible questions, then go to the computer and answer them by looking
at the data and output.
- Avoid going to the computer for long periods
of time with no clear objective.
- Keep an audit trail of your findings as you go along.
Priority: obtain a model with a reasonable fit before you put too much
time into residual analysis and commentary. Don't dwell on bad
residual plots for models you are not going to choose anyway.
Build some models.
-
Strategies:
- 1. Start small and build up.
- Look for which new variable increases R-squared the most, or reduces
RMSE.
- Check scatterplot matrix and leverage plots for curvature to
suggest transformations.
- 2. Start big and knock down.
- Remove variables that have insignificant partial slopes.
- Combine highly collinear variables.
Typical attributes of a chosen model.
- Lean and mean.
- All t-stats are significant.
- Hard to improve R-squared significantly by playing around.
- Not too much collinearity (VIFS < 10 say).
- If serious outliers around then some justification for their inclusion
or removal.
- If you see any structure in the residual plots then the job is not
finished.
Back to class 6.