This lab covers decision tree classifiers and regressors in sklearn
. We will also explore using cross-validation to compare the performance of decision trees with $k$-nearest neighbors.
## Standard Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
# KNN will warn you about future updates, but we'll turn these off
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
Classification examples in this lab will use the Wisconsin breast cancer data set introduced in a few of our previous labs.
Recall that these data were collected in a study that explored using machine learning to diagnose breast cancer tissue samples as cancerous (a label of 'M'
for malignant) or non-cancerous (a label of 'B'
for benign). The predictive features in the data set are the average characteristics (radius, symmetry, etc.) of the cells collected in each tissue sample.
We'll begin by performing an 80-20 training-testing split:
## Read the data
wbc = pd.read_csv("https://remiller1450.github.io/data/wisc_bc.csv")
## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(wbc, test_size=0.2, random_state=7)
## Separate the target from the predictors
train_y = train['Label'].map({'M': 1, 'B': 0})
test_y = test['Label'].map({'M': 1, 'B': 0})
train_X = train.drop(['ID','Label'], axis = 1)
test_X = test.drop(['ID','Label'], axis = 1)
Notice that we re-coded the labels as 1 (malignant) or 0 (benign) using the map()
method. As we'll soon see, this is to help us keep track of the predicted classes in our decision tree, which uses a generic plotting method that labels the predicted classes using integers.
For classification tasks, we use the DecisionTreeClassifier()
function to fit/train a decision tree. The example below fits a tree with a maximum depth of 2 to our training data:
## Train the tree
from sklearn.tree import DecisionTreeClassifier
my_tree = DecisionTreeClassifier(max_depth = 2).fit(train_X, train_y)
Compared with other machine learning algorithms, decision trees are highly interpretable. We can see the structure of the tree using the plot_tree()
function:
## Display the fitted tree's structure
from sklearn.tree import plot_tree
plot_tree(my_tree, feature_names=train_X.columns, class_names=True)
plt.show()
In this visualization, the top-line of each non-terminal node shows the splitting rule used to define the subsequent child nodes:
Concave <= 0.051
Concave <= 0.051
(left branch) are further split using the rule Radius <= 15.05
Concave > 0.051
(right branch) are further split using the rule Texture <= 16.795
The visualization also includes information on the Gini impurity, the total number of samples present, and the class distribution of each node. For example:
y[0]
, or benignQuestion 1: Use the tree visualization provided above to answer the following questions.
predict_proba()
method and np.unique()
.Decisions trees are prone to overfitting, making hyperparameter tuning a very important consideration when training a tree that will be used on new data.
The two most important hyperparameters (in regard to potential overfitting) are:
max_depth
- the maximum depth of the tree (ie: the most consecutive splits allowed in a single branch).min_samples_split
- how many observations must be present in a node for it to be eligible for splitting.In sklearn
, the default value of max_depth
is None
, so the algorithm will continue to split until all of its terminal nodes contain fewer observations than the value of min_samples_split
. The default value of min_samples_split_sample
is 2, which leads to the default set of parameters learning the deepest possible tree (which is not ideal for performance on new data).
An alternative to controlling the complexity of the tree using max_depth
and min_samples_split
is tuning via the min_impurity_decrease
parameter, which is the minimum improvement in gini impurity needed for a split to occur. You should recognize the following before relying upon this parameter:
Regardless of which parameters you choose to tune, cross-validation (using either grid or random search) should be used find suitable values:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('model', DecisionTreeClassifier())
])
parms = {'model__max_depth': [2,3,4],
'model__min_samples_split': [10,20,40]}
grid_res = GridSearchCV(pipe, parms, cv=5, scoring = 'accuracy').fit(train_X, train_y)
print(grid_res.best_estimator_)
print(grid_res.best_score_)
Pipeline(steps=[('model', DecisionTreeClassifier(max_depth=3, min_samples_split=10))]) 0.9032967032967033
Here we see that using a maximum depth of 3 and requiring 10 samples to be present in a node to allow it to be split yields a tree with a cross-validated classification accuracy of 90.33%
Question 2: Continue using the Wisconsin breast cancer data for the following questions.
min_impurity_decrease
. Report the classification accuracy of the best estimator.plot_tree()
to visually compare the best estimator you found in Part A to the best estimator found in the example above. How are these trees similar or different?GridSearchCV()
to compare the cross-validated classification accuracy your best decision tree from Part A with the best KNN approach found in Part C. You should note that standardization has no impact on the decision tree algorithm, so you can include it as a preprocessing step without causing any harm.The regression examples in this lab will use the Iowa City home sales data set that we've also worked with in several of our previous labs.
For illustration purposes, we'll consider only two features, 'area.living'
and 'bedrooms'
and explore how they can be used to predict a home's sale price:
## Read IC home sales data and split into training/test sets
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
train_ic, test_ic = train_test_split(ic, test_size=0.2, random_state=7)
## Create X and y
train_ic_y = train_ic['sale.amount']
train_ic_X = train_ic[['area.living','bedrooms']]
To begin, you should recognize that the DecisionTreeRegressor()
function behaves almost exactly the same as DecisionTreeClassifier()
, but there are a few key differences that we can see in the visualization created by plot_tree()
:
## Train the tree
from sklearn.tree import DecisionTreeRegressor
my_tree = DecisionTreeRegressor(max_depth = 2).fit(train_ic_X, train_ic_y)
## Visualize tree
plt.figure(figsize=(8,4))
plot_tree(my_tree, feature_names=train_ic_X.columns)
plt.show()
First, we see that the predicted value in each terminal node is the simple average of the outcome variable's values for the data-points contained in that node.
Next, we see that splitting rules are learned based upon the drop in squared error (not the drop in gini impurity). This can be consequential for hyperparameter tuning because the units of squared error are related to the units of the outcome variable. However, we can avoid needing to worry about this by tuning our trees using the parameters max_depth
and min_samples_split
instead of the parameter min_impurity_decrease
.
Question 3: Be sure that you're using the Iowa City home sales data for the following questions.
train_ic_X
, create a scatterplot displaying a home's living area on the x-axis and its sale amount on the y-axis.area.living
. For this question, use hlines()
with the argument color = 'red'
to display this tree's predicted values for different values area.living
. Hint: your graph should display several horizontal lines, similar to the "resting metabolism" example in our lecture slides. You should consult the hlines documentation for additional information on how to use this function.area.living
as the predictor) would produce predicted values that follow the straight line that best fits these data. Using your plot from Part B, what advantages and disadvantages do you see in the decision tree model when compared with simple linear regression?Because trees are easily interpretable models, it's worth considering how to quantify the importance of individual predictors in a tree. In this context, variable importance can be assessed by calculating the total reduction in Gini impurity (or reduction in squared error for numeric outcomes) achieved by the splits involving a given variable. These reductions are then rescaled to sum to 1 to ensure a uniform meaning across different applications.
As an example, notice that "area.living"
is the only predictor involved in any of the splits in our most recent tree. Thus, we should expect that 100% of the reduction in impurity of this model is attributed to "area.living"
.
We can confirm this by printing the feature_importances_
attribute of the fitted tree:
## Feature importance
print(my_tree.feature_importances_)
## Column names (which use the same index ordering as feature importance)
print(train_ic_X.columns)
[1. 0.] Index(['area.living', 'bedrooms'], dtype='object')
If we consider a more complex tree, such as one using max_depth=6
, we see that both "area.living"
and "bedrooms"
make contributions to the tree, although "area.living"
makes a much larger contribution:
## What about a deeper tree?
deep_tree = DecisionTreeRegressor(max_depth = 6).fit(train_ic_X, train_ic_y)
print(deep_tree.feature_importances_)
[0.8979018 0.1020982]
Question 4:
'built'
, 'bedrooms'
, 'area.base'
, 'area.living'
, 'area.garage1'
, and 'area.lot'
.max_depth
and min_samples_split
parameters of a decision tree model that uses the predictors from Part A to predict sale price.