#================================================================

                 TREE-BASED REGRESSION

  We make an excursion into the very opposite of smoothing:

     CART -- Classification and Regression Trees

  It might as well be called "roughing".
  In what follows we consider the regression version only.
  This is our first example of "multiple regression",
  apart from the bivariate smoother in the end of the previous section.
  Tree-based regression is simpler than multiple linear regression,
  but more unusual for those who grew up with the latter.

  The idea:

    When data are heterogeneous, clustered, messy,
      it might not be the best idea to fit smooth functions;
    in particular, it might not be a good idea to fit linear functions,
      as is done in linear regression.
    Instead, one might consider breaking the data up into
      subsets that are homogeneous.
    Once we have homogeneous subsets, we might do the simplest possible thing:
      fit a constant, that is, a mean.
    Therefore, tree-based regression does local averaging, like smoothers,
      but this time on non-overlapping subsets.

    Q: How do we arrive at homogeneous subsets?
    A: By repeatedly breaking the data up along axis splits.

#   But how would we choose such axis splits?
#   Example: In the following plot, where would we reasonably place a split?
      dim(boston)
      pairs.plus(boston, pch=16, cex=.1, gap=0)
      plot(boston[,"LSTAT"], boston[,"MEDV"],
           pch=16, xlab="LSTAT", ylab="MEDV", cex=.5)
#   Here is a handchosen split at LSTAT=10:
      spl <- 8
      sel.left  <- boston[,"LSTAT"] < spl
      sel.right <- boston[,"LSTAT"] >= spl
      plot(boston[,"LSTAT"], boston[,"MEDV"], xlab="LSTAT", ylab="MEDV",
           col=c("red","green")[sel.left+1], pch=16, cex=.5)
#   We said we will fit only the simplest possible model, a
#   mean in each subset:
      lines(c(spl,spl), c(0,50))
      lines(c(0,spl),  rep(mean(boston[sel.left,"MEDV"]), 2),  lwd=2)
      lines(c(spl,40), rep(mean(boston[sel.right,"MEDV"]), 2), lwd=2)
#
# How do we formalize a rule for selecting the axis splits?
#   That's the big question.
# Here is a recipe that is used most often:
#
#   For a given split on a variable "xj" (="LSTAT" in the plot)
#   and threshold "th" (=10 in the plot),
#   evaluate a criterion, such as the residual sum of squares,
#   after fitting a separate mean left and right of the threshold:
      sum((boston[sel.left,"MEDV"]  - mean(boston[sel.left,"MEDV"]))^2) +
      sum((boston[sel.right,"MEDV"] - mean(boston[sel.right,"MEDV"]))^2)
#   which should be the same as
      var(boston[sel.left,"MEDV"])  * (sum(sel.left)-1) +
      var(boston[sel.right,"MEDV"]) * (sum(sel.right)-1)
#   Indeed!
#
#   Then:
#   For all possible splits       (505 for "boston")
#   of all x-variables            ( 13 for "boston"),
#   evaluate the criterion,
#   and pick the split that has the lowest value of the criterion:
      lstat.vals <- sort(unique(boston[,"LSTAT"]))
      split.vals <- (lstat.vals[-1] + lstat.vals[-length(lstat.vals)])/2 # midpoints
      crit.vals <- rep(NA, length(split.vals))
      for(i in 1:(length(split.vals))) {
        spl <- split.vals[i]
        sel.left  <- boston[,"LSTAT"] < spl;      sel.right <- !sel.left
        crit.vals[i] <-
          sum((boston[sel.left,"MEDV"]  - mean(boston[sel.left,"MEDV"]))^2) +
            sum((boston[sel.right,"MEDV"] - mean(boston[sel.right,"MEDV"]))^2)
      }
      plot(split.vals, crit.vals, pch=16, cex=.2)
#   Look at that: spl=10 was near the minimum of the criterion values!
#   Here is how you get the minimum location:
      split.vals[crit.vals==min(crit.vals)]
      ## [1] 9.725
     which(crit.vals==min(crit.vals))

#
#   Then:
#   Repeat on both sides of the split.
#   repeat on and on and on..., recursively.
#   Stop when the subsets get too small.
#
# There are many complications to the idea, but let's leave things simple.
# We are going to use trees not for prediction but for interpretation.
# [Trees for prediction would need sophisticated cross-validation.
#  See the book by Breiman, Friedman, Olshen, Stone (1982).]

# There exists an implementation of tree-based regression and classification
# authored by Clark and Pregibon, documented in the "White Book",
# "Statistical Models in S", edited by Chambers and Hastie.
# You may have to download and install it:
# * To this end, make sure you are connected to the internet!
# * In the Windows version of R you can do that with the pull-down
#   menu "packages".
# * In the Linux and Emacs version of R you can do it with the following function
    install.packages("tree")  # doesn't work anymore in R 2.4.1
# In every session before using a library function you have to call
    library(tree)
#
# We will use another implementation, though, from the PhD thesis
# of a student of mine.  The implementation is slow because
# it is all written in R, not C or Fortran,
# but it has some unique ways of constructing trees that are
# often more interpretable than standard trees.
# The trick is in the choice of the splitting criterion.
# [If you are curious about details, you can download
#  the Ph.D. thesis paper from the class web page.]
#
# Download first the file "trees.R" from the class web page
# and source it:
    source("http://www-stat.wharton.upenn.edu/~buja/STAT-961/trees.R")
#
# Here is the conventional "CART" method:
    boston.CART <- regression.tree(boston, method="CART",
                                   minsize=23, title="Boston Housing Data")
    show.tree(boston.CART, type="u", cex=.7)
# [Stretch this window as much as possible to avoid much of the overplotting.]
#
# The tree looks interesting and we should stare at it for a while
# to get a sense of what it might tell us:
#   The first split is on RM, the average number of rooms:
#     85% with RM < 6.94,      15% with RM > 6.94
#     [small/normal homes]     [large homes]
#     The 85% of smaller homes is split on LSTAT, % of lower status people,
#       51% with LSTAT < 14.55,  15% with LSTAT > 14.55
#       [^^few poor people^^^]   [^^more poor people^^]
#     The 15% of larger homes is split on RM again,
#       9.1% with RM < 7.42,    5.9% with RM > 7.42
#       [medium large homes]    [largest homes]
# This makes some sense:
#   among large homes, size drives the price;
#   among small/normal homes, social environment drives the price.
# We had found something like this when studying scatterplots of MEDV, RM, LSTAT.
# The variable of interest was really NOX, but it is only of
#   low importance in the tree.

# Let's move on to one of our new methods:
# It tries to repeatedly peel off subsets that have
# HIGH response values:
    boston.XM <- regression.tree(boston, method="XM",
                                 minsize=23, title="Boston Housing Data")
    show.tree(boston.XM, type="u", cex=.7)
# The story becomes clearer: LSTAT is the most important variable, but
#   to predict where the highest prices are, one has to look at RM.

# And here is another one of our new methods:
# It does the opposite by trying to repeatedly peel off subsets with
# LOW response values:
    boston.NM <- regression.tree(boston, method="NM",
                                 minsize=23, title="Boston Housing Data")
    show.tree(boston.NM, type="u", cex=.7)
# Another story emerges, complementary to the first story, and
#   more complete, more intuitive!!
#   In order to ascend housing prices, from lowest to highest,
#     peel off high crime,
#     then racially segregated areas,
#     then descend down the % of lower status people,
#     then climb up the number of rooms.
# Among the three trees we constructed, this one makes the most sense.
#
#
# Another data example: the Swiss fertility data
#   47 French speaking provinces of Switzerland around 1888
#
#           Fertility     - standardized fertility measure for each province
#           Agriculture   - population involved in agriculture
#           Examination   - draftees receiving highest mark in army exam
#           Education     - population educated beyond primary school
#           Catholic      - population who are catholic
#           Infant.Mort   - live births living less than 1 year

# The interest is to relate Fertility (the response) to the other variables
# (the predictors).
# The hypothesis is that the predictors are good proxies for the true causes
# of high and low fertility.
# Note our cautious formulation: these variables are probably not direct
# causes of fertility but part of a cultural mix that overall determines
# fertility rates.

    data(swiss)               # The dataset comes with R.
    options(width=100)        # I like a longer line for a wide window.
    swiss                     # It's a small dataset, so we can look at it.
    names(swiss)              # The variables.
    dim(swiss)                # The size of the data.

# For my own interest, I wanted to know the names of some of these provinces,
# especially those that are high in Education and Fertility.
# To this end, we obtain an index vector "ord" with the "order" function
# to sort the data, first according to Education, then Fertility:
# - High Education:
    ord <- rev(order(swiss[,"Education"]))
    rownames(swiss)[ord][1:7]                 # 7 highest on education
# ...includes Geneva, Neuchatel, Lausanne, Vevey, which are "larger" cities
# in French speaking Switzerland.  I believe Rive Gauche and Rive Droite
# are suburbs of Geneva (left and right of the river Rhone), but I'm not sure.
# - High Fertility:
    ord <- rev(order(swiss[,"Fertility"]))
    rownames(swiss)[ord][1:10]               # 10 highest on fertility
# ...names that seem rural or alpine to me (I forgot a lot about Switzerland).
# Look at a map of Switzerland.  "

# - Low Fertility:
    ord <- order(swiss[,"Fertility"])
    rownames(swiss)[ord][1:10]
# ...includes the larger cities.

# Now some basic plots: response (Fertility versus predictors):
    windows(width=25, height=5)
    par(pch=16, mfrow=c(1,5), mar=c(2,2,1.5,0), mgp=c(1.5,.5,0),
        cex=1, cex.axis=.5, cex.main=.6)
    for(i in 2:6)
    {
      plot(swiss[,i], swiss[,1], main=colnames(swiss)[i], xlab="", ylab="", cex=.5)
      # lines(lowess(swiss[,i], swiss[,1]), col="red")
    }
# If these marginal plots are any indication, we can expect
# Education and its proxy, Examination, to be negatively correlated
# with Fertility.
# On the other hand, Agriculture and Infant.Mort seem to be
# positively correlated; which may make sense because these are
# generally indicators of a low developmental state ("backwardness").
# An interesting connection exists with Catholic: highly catholic
# provinces have higher fertililty than those that are highly protestant,
# with the exception of three provinces that are highly mixed, and probably
# more cosmopolitan: they are Geneva, Rive Gauche and Rive Droite:
    windows()
    plot(swiss[,"Catholic"], swiss[,"Fertility"], pch=16, xlim=c(-20,120))
    for(i in 1:20)
      identify(swiss[,c("Catholic","Fertility")], labels=rownames(swiss))
# Click the three bottom points in the plot...
#
# These marginal findings call for an investigation of "collinearities",
# that is, dependencies among the predictors to answer questions of whether
# Examination and Education are indeed proxy for each other, for example.
# The easiest approach is with a scatterplot matrix:
    pairs(swiss, pch=16, cex.labels=1.5, gap=.1)
#   (What a nuisance!  The extreme points fall virtually on the borders.
#    Here is a version that allocates some space between
#    the points and the borders: download it from the class website.)
    pairs.plus(swiss, pch=16, cex.labels=1.5, gap=.1)
# Right away we see that Examination and Education are indeed
# positively correlated, that Agriculture negatively correlates
# with both, although more so with Examination, and that
# Infant.Mort does not correlate with anything very much, except for a
# positive correlation with Catholic.

# Finally, here is an application of the "tree()":
    library(tree)
    swiss.Scart <- tree(Fertility ~ ., data=swiss,
                        control=tree.control(47, mincut=3, minsize=6))
    show.tree(swiss.Scart, type="u")
# Observations:
# - The top split breaks out a group of 7 provinces (=14.9% of 47) with
#   high education levels and low fertility.
# - In the low Education bucket, the tree splits on Catholic with the
#   expected direction: the high Catholic bucket has higher fertility.
# - The next level shows splits on Infant.Mort on both sides, with
#   the expected direction: higher Infant.Mort, higher Fertility.
#   (This latter relation may not be a causal left to right,
#   equally likely the opposite: right to left, or both ways.)
# - The last split on Agriculture is not very believable as it has the
#   unexpected direction: higher Agriculture, lower Fertility;
#   But then the difference is the smallest of any split in the tree,
#   which adds evidence that the split is probably not "real".

# Overall, though, the standard tree function gave us in this case a
# very interpretable tree.  This may be due to the small size of the dataset.

# Our own CART routine does not exactly reproduce the tree generated
# by the standard library (there are differences in the code):
    swiss.CART <- regression.tree(swiss[,c(2:5,1)], method="CART",
                                  minsize=5, title="Swiss Fertility Data")
    show.tree(swiss.CART, type="u", cex=1)
# - The splits on Education, Catholic, and Examination all follow
#   what we know: more catholic ==> higher fertility
#                 more highly educated ==> lower fertility
# - The tree is interesting because it seems to suggest that
#   Agriculture does in fact have an unexpected correlation with
#   fertility after adjustment for Education and Catholic!
#   ==> a case of Simpson's paradox:
#       Marginally, the higher agriculture, the higher fertility,
#       but conditionally on things like education and religion,
#       it's reverse: the higher agriculture, the lower fertility.

# Now we use another option in our routine to look for high Fertility buckets:
    swiss.XM <- regression.tree(swiss[,c(2:5,1)], method="XM",
                                  minsize=5, title="Swiss Fertility Data")
    show.tree(swiss.XM, type="u", cex=1, height=10)
# - For finding high Fertility, one should look for low Examination
#   above all and high Catholic next.
#   Recall that Examination is an education proxy,
#   so it is surprising that deeper down Education also appears.
#   Overall we learn that for high Fertility we should look for the
#   lowest levels of the education proxies Examination and Education.
# - An interesting fact appears on the right hand side:
#   The bucket where catholics exceed 90.57% has a reversal
#   of association with Examination: Among very highly catholic
#   towns, the higher education bucket in terms of Examination
#   has the HIGHER fertility: 85.10 versus 73.32.
# - Agriculture, far down, has again a negative association with
##  Fertility: 63.12 in the high-agriculture bucket vs 70.83.
#
# Next we look for low fertility (bottom fishing):
    swiss.NM <- regression.tree(swiss[,c(2:5,1)], method="NM",
                                  minsize=5, title="Swiss Fertility Data")
    show.tree(swiss.NM, type="u", cex=1)
    row.names(swiss)[swiss[,"Education"] > 23.2]
# - Again, at the top a high Education bucket is broken off,
#   followed by a high Examination bucket,
#   followed by two buckets with low Catholic.
# - There are two splits on Agriculture, once again both in the unexpected
#   direction: high Agriculture - low Fertility.
#   We may have to conclude that other things being equal (education, religion),
#   agriculture in 1888 was such a taxing environment that it
#   depressed fertility
#
# Looking ahead, let's use a linear model fit to see whether the
# overall qualitative behaviors are confirmed, in particular that
# of Agriculture:
    summary(lm(Fertility ~ ., data=swiss))
# (We'll say more about the linear model function "lm" later.)
#------------------------ output begin
     ...
    Coefficients:
                     Estimate Std. Error t value Pr(>|t|)
    (Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
    Agriculture      -0.17211    0.07030  -2.448  0.01873 *
    Examination      -0.25801    0.25388  -1.016  0.31546
    Education        -0.87094    0.18303  -4.758 2.43e-05 ***
    Catholic          0.10412    0.03526   2.953  0.00519 **
    Infant.Mortality  1.07705    0.38172   2.822  0.00734 **
     ...
#------------------------ output end
# Yes, Agriculture has a significant negative slope!
# Note another feature of the linear model, though:
# Examination is insignificant!
# After looking at the trees and the plots we know better than that.
# Here is why Education towers over Examination: Education
# isolates the Geneva area
plot(Fertility ~ Education, data=swiss)
swiss.geneva <- with(swiss, 40 < Catholic & Catholic < 60)
rownames(swiss)[swiss.geneva] # [1] "V. De Geneve" "Rive Droite"  "Rive Gauche"


summary(lm(Fertility ~ . + I(Catholic>40 & Catholic<60), data=swiss))

# Let us remind ourselves of a couple of plots:
    windows(width=7, height=14)
    par(mfrow=c(2,1), mgp=c(1.5,.5,0), mar=c(2.5,2.5,1,1))
    plot(swiss[,c("Examination","Fertility")], pch=16)
    plot(swiss[,c("Education","Fertility")], pch=16)

plot(swiss[,c("Education","Fertility")], pch=16)
# Education comes in so strong because of the 7 high Education provinces
# By comparison, Examination has a weaker but more robust correlation
# with Fertility: no 7 provinces would account for the correlation.


#----------------

# FURTHER PERSPECTIVES ON TREES:
#
# - The name 'CART' (Breiman et al. 1982) stands for
#   'Classification And Regression Trees'.
#   We only considered regression trees where the response is quantitative.
#   Classification trees address the case of categorical responses.
#   Actually, there are two types of trees addressing this case:
#   (1) CLASSIFICATION TREES: They 'estimate' actual class labels.
#   (2) CLASS PROBABILITY TREES: They estimate class probabilities.
#   Note: 'Class' = one of the category labels of the categorical response.
#   You can always create a classification tree from a class probability tree
#   by picking the class with the highest class probability as the class estimate.
#   The reverse is not possible: a class label is insufficient information
#   to derive a set of class probabilities.
#   The is one more difference between classification and class probability trees:
#   in the way in which their performance is measured.
#   (1) Classification trees are judged by misclassification rates
#       (possibly cost-weighted).
#   (2) Class probability trees are judged by measures of class purity
#       in the leaves.  'Class purity' can be something like the
#       'Gini index' which is p*(1-p) in the 2-class case, or
#       'entropy', which is -p*log(p)-(1-p)*log(1-p).
#       Both look very similar as a function of p in [0,1].
#
#
# - Note that non-cascading CART trees really discover interactions
#   only.  Every time there is a switch from one splitting variable
#   to the next descending down the tree, one forms implicitly a
#   product of a dummy with two other dummies:
#        products of predictors = interactions
#   Cascading trees that re-use the same predictor down a cascade
#   avoid this and instead create a dose-response effect in that
#   predictor.
#
# - We focused on the interpretability aspect of trees:
#   By using unconventional tree-splitting criteria,
#   we can create surprisingly interpretable unbalanced
#   cascading trees that capture 'dose-response' effects.
#
# - If the game is prediction, then raw trees may not be the best
#   approach.  Here is Breiman's proposal:
#         'BOOTSTRAP AGGREGATION' or 'BAGGING'
#   Simply put:
#   . Generate Nboot=500 (e.g.) trees from bootstrap datasets.
#   . Average their predictions.
#   Q: Why would this be successful?
#   A: Because it creates smoother prediction functions.
#   See an AoS article by B�hlmann & Yu, 2002, "Analyzing Bagging".
#
# - Another development by Yali Amit and Breiman is
#            RANDOM FORESTS
#   The idea is to tweak bagging by making the bootstrap trees
#   more independent of each other: On every bootstrap dataset,
#   allow only a random subset of the predictors to be used.
#   This de-correlates the trees somewhat with apparently
#   beneficial effects.
#   Breiman recommends growing the trees all the way to the
#   bottom, so each leaf contains just one observation.
#   The smoothing is achieved by the averaging across trees.
#   Breiman also uses the fact that each bootstrap sample
#   leaves out a fraction 1/e of the data each time for
#   a form of cross-validation of the forest performance.
#   The left-out cases are called 'out-of-bag' or OOB.
#
# - Yet another development by Schapire and Freund is
#                     BOOSTING
#   This was initially meant for classification, not regression.
#   Here the idea is to grow many shallow trees and average them,
#   but unlike bagging, the cases are reweighted at each step.
#   Schapire and Freund's idea was to upweight the misclassified
#   cases so the next tree can get them right.
#   Because this is about classification, each tree produces
#   a class label in the leaves.  Their final classifier
#   is a 'voting' scheme: classify a case according to the
#   majority vote of trees.
#
# - Random forests work with many overfitting trees, averaging
#   them to reduce variance.
#   Boosting works with many underfitting biased trees, creating
#   many predictors that jointly reduce bias.
#
# - In 2000 Friedman, Hastie and Tibshirani published an article
#   in AoS, 28 (2), 337-407, with title
#   "Additive logistic regression: a statistical view of boosting".
#   They showed that boosting can be interpreted as a form of
#   'stagewise' logistic regression, using not the logistic loss
#   but a so-called 'exponential loss'.
#   'Stagewise' is in contra-distinction to 'stepwise':
#   In stagewise regression we add a new predictor
#   WITHOUT UPDATING THE PREVIOUS PREDICTORS!
#   That is, we form a residual and fit to the residual.
#
# - Note on classification: .. deep learning

#================================================================