Lab 7 - Decision Trees¶

This lab covers decision tree classifiers and regressors in sklearn. We will also explore using cross-validation to compare the performance of decision trees with $k$-nearest neighbors.

In [1]:
## Standard Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

# KNN will warn you about future updates, but we'll turn these off
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)  

Part 1 - Decision Tree Classifiers¶

Classification examples in this lab will use the Wisconsin breast cancer data set introduced in a few of our previous labs.

Recall that these data were collected in a study that explored using machine learning to diagnose breast cancer tissue samples as cancerous (a label of 'M' for malignant) or non-cancerous (a label of 'B' for benign). The predictive features in the data set are the average characteristics (radius, symmetry, etc.) of the cells collected in each tissue sample.

We'll begin by performing an 80-20 training-testing split:

In [2]:
## Read the data
wbc = pd.read_csv("https://remiller1450.github.io/data/wisc_bc.csv")

## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(wbc, test_size=0.2, random_state=7)

## Separate the target from the predictors
train_y = train['Label'].map({'M': 1, 'B': 0})
test_y = test['Label'].map({'M': 1, 'B': 0})
train_X = train.drop(['ID','Label'], axis = 1)
test_X = test.drop(['ID','Label'], axis = 1)

Notice that we re-coded the labels as 1 (malignant) or 0 (benign) using the map() method. As we'll soon see, this is to help us keep track of the predicted classes in our decision tree, which uses a generic plotting method that labels the predicted classes using integers.

For classification tasks, we use the DecisionTreeClassifier() function to fit/train a decision tree. The example below fits a tree with a maximum depth of 2 to our training data:

In [3]:
## Train the tree
from sklearn.tree import DecisionTreeClassifier
my_tree = DecisionTreeClassifier(max_depth = 2).fit(train_X, train_y)

Compared with other machine learning algorithms, decision trees are highly interpretable. We can see the structure of the tree using the plot_tree() function:

In [4]:
## Display the fitted tree's structure
from sklearn.tree import plot_tree
plot_tree(my_tree, feature_names=train_X.columns, class_names=True)
plt.show()

In this visualization, the top-line of each non-terminal node shows the splitting rule used to define the subsequent child nodes:

  • The first split uses the feature 'Concave' to split the data into two segments, the left segment being characterized by Concave <= 0.051
    • The samples with Concave <= 0.051 (left branch) are further split using the rule Radius <= 15.05
    • The samples with Concave > 0.051 (right branch) are further split using the rule Texture <= 16.795

The visualization also includes information on the Gini impurity, the total number of samples present, and the class distribution of each node. For example:

  • The left-most terminal node has a Gini index of 0.06 indicating low levels of impurity (or high levels of purity)
    • We can see that there are 260 data-points in this node, of which 252 belong to the class y[0], or benign

Question 1: Use the tree visualization provided above to answer the following questions.

  • Part A: Calculate the Gini gain resulting from the tree's first split. Show your work.
  • Part B: What is the highest predicted probability for the class "malignant" that can be assigned by this tree? Hint: you can calculate this from the information contained in the tree visualization, but you can confirm it using the predict_proba() method and np.unique().
  • Part C: Consider the first sample in the test set, what is the predicted probability that this sample is "malignant"?

Part 2 - Hyperparameter Tuning¶

Decisions trees are prone to overfitting, making hyperparameter tuning a very important consideration when training a tree that will be used on new data.

The two most important hyperparameters (in regard to potential overfitting) are:

  • max_depth - the maximum depth of the tree (ie: the most consecutive splits allowed in a single branch).
  • min_samples_split - how many observations must be present in a node for it to be eligible for splitting.

In sklearn, the default value of max_depth is None, so the algorithm will continue to split until all of its terminal nodes contain fewer observations than the value of min_samples_split. The default value of min_samples_split_sample is 2, which leads to the default set of parameters learning the deepest possible tree (which is not ideal for performance on new data).

An alternative to controlling the complexity of the tree using max_depth and min_samples_split is tuning via the min_impurity_decrease parameter, which is the minimum improvement in gini impurity needed for a split to occur. You should recognize the following before relying upon this parameter:

  1. The final tree will be very data dependent in the sense that you cannot reliably anticipate its depth/complexity ahead of time.
  2. Tuning this parameter is more context sensitive in the sense that a "hard" classification task will require a more relaxed value to produce a tree with any depth to it.

Regardless of which parameters you choose to tune, cross-validation (using either grid or random search) should be used find suitable values:

In [5]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline 

pipe = Pipeline([
('model', DecisionTreeClassifier())
])

parms = {'model__max_depth': [2,3,4],
        'model__min_samples_split': [10,20,40]}

grid_res = GridSearchCV(pipe, parms, cv=5, scoring = 'accuracy').fit(train_X, train_y)
print(grid_res.best_estimator_)
print(grid_res.best_score_)
Pipeline(steps=[('model',
                 DecisionTreeClassifier(max_depth=3, min_samples_split=10))])
0.9032967032967033

Here we see that using a maximum depth of 3 and requiring 10 samples to be present in a node to allow it to be split yields a tree with a cross-validated classification accuracy of 90.33%

Question 2: Continue using the Wisconsin breast cancer data for the following questions.

  • Part A: Use a cross-validated grid search to tune the parameter min_impurity_decrease. Report the classification accuracy of the best estimator.
  • Part B: Use plot_tree() to visually compare the best estimator you found in Part A to the best estimator found in the example above. How are these trees similar or different?
  • Part C: Add standardization to your data preparation pipeline, then use a cross-validated grid search to select an appropriate $k$-nearest neighbors model. You should explore a few different choices of hyperparameters including the number of neighbors and weighting scheme.
  • Part D: Use GridSearchCV() to compare the cross-validated classification accuracy your best decision tree from Part A with the best KNN approach found in Part C. You should note that standardization has no impact on the decision tree algorithm, so you can include it as a preprocessing step without causing any harm.
  • Part E: Find the classification accuracy of each model you considered in Part D on the test set. Are your findings consistent with the results of Part D?

Part 3 - Decision Tree Regressors¶

The regression examples in this lab will use the Iowa City home sales data set that we've also worked with in several of our previous labs.

For illustration purposes, we'll consider only two features, 'area.living' and 'bedrooms' and explore how they can be used to predict a home's sale price:

In [6]:
## Read IC home sales data and split into training/test sets
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
train_ic, test_ic = train_test_split(ic, test_size=0.2, random_state=7)

## Create X and y
train_ic_y = train_ic['sale.amount']
train_ic_X = train_ic[['area.living','bedrooms']]

To begin, you should recognize that the DecisionTreeRegressor() function behaves almost exactly the same as DecisionTreeClassifier(), but there are a few key differences that we can see in the visualization created by plot_tree():

In [7]:
## Train the tree
from sklearn.tree import DecisionTreeRegressor
my_tree = DecisionTreeRegressor(max_depth = 2).fit(train_ic_X, train_ic_y)

## Visualize tree
plt.figure(figsize=(8,4))
plot_tree(my_tree, feature_names=train_ic_X.columns)
plt.show()

First, we see that the predicted value in each terminal node is the simple average of the outcome variable's values for the data-points contained in that node.

Next, we see that splitting rules are learned based upon the drop in squared error (not the drop in gini impurity). This can be consequential for hyperparameter tuning because the units of squared error are related to the units of the outcome variable. However, we can avoid needing to worry about this by tuning our trees using the parameters max_depth and min_samples_split instead of the parameter min_impurity_decrease.

Question 3: Be sure that you're using the Iowa City home sales data for the following questions.

  • Part A: Using the training data, train_ic_X, create a scatterplot displaying a home's living area on the x-axis and its sale amount on the y-axis.
  • Part B: Notice that all splitting rules in the example tree given above involve the predictor area.living. For this question, use hlines() with the argument color = 'red' to display this tree's predicted values for different values area.living. Hint: your graph should display several horizontal lines, similar to the "resting metabolism" example in our lecture slides. You should consult the hlines documentation for additional information on how to use this function.
  • Part C: Simple linear regression (using only area.living as the predictor) would produce predicted values that follow the straight line that best fits these data. Using your plot from Part B, what advantages and disadvantages do you see in the decision tree model when compared with simple linear regression?

Part 4 - Gain-based Variable Importance¶

Because trees are easily interpretable models, it's worth considering how to quantify the importance of individual predictors in a tree. In this context, variable importance can be assessed by calculating the total reduction in Gini impurity (or reduction in squared error for numeric outcomes) achieved by the splits involving a given variable. These reductions are then rescaled to sum to 1 to ensure a uniform meaning across different applications.

As an example, notice that "area.living" is the only predictor involved in any of the splits in our most recent tree. Thus, we should expect that 100% of the reduction in impurity of this model is attributed to "area.living".

We can confirm this by printing the feature_importances_ attribute of the fitted tree:

In [8]:
## Feature importance
print(my_tree.feature_importances_)  

## Column names (which use the same index ordering as feature importance)
print(train_ic_X.columns)  
[1. 0.]
Index(['area.living', 'bedrooms'], dtype='object')

If we consider a more complex tree, such as one using max_depth=6, we see that both "area.living" and "bedrooms" make contributions to the tree, although "area.living" makes a much larger contribution:

In [9]:
## What about a deeper tree?
deep_tree = DecisionTreeRegressor(max_depth = 6).fit(train_ic_X, train_ic_y)
print(deep_tree.feature_importances_)
[0.8979018 0.1020982]

Question 4:

  • Part A: - Create a new predictor matrix for the Iowa City home sales data that contains the predictors 'built', 'bedrooms', 'area.base', 'area.living', 'area.garage1', and 'area.lot'.
  • Part B: - Use 5-fold cross-validation to tune the max_depth and min_samples_split parameters of a decision tree model that uses the predictors from Part A to predict sale price.
  • Part C: - Find the importance of each predictor in the best decision tree model identified in Part B.
  • Part D: - Conduct a small-scale grid search to find a reasonable $k$-nearest neighbors model for this application. Be sure to include a standardization step.
  • Part E: - Compare the performance of the best KNN model from Part E with the best decision tree model from Part B using the test set and root-mean squared error.
  • Part F: - Write a short paragraph summarizing the main results from Parts A-F. Your paragraph should be written in a professional tone that would be suitable for a scientific paper. You should mention all important details, including your model building steps/comparisons, the strengths/weaknesses of your best model, and any additional supporting details or limitations.