This lab covers implementations of decision tree models in sklearn
, including classification trees, regressor trees, tree visualization, tuning parameters, and variable importance measures.
As usual, we'll begin by loading several familiar libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math
Our classification examples will use the SMS spam dataset (introduced in Lab 3 part 2, "Feature Engineering Challenge"):
## Note that textfile containing these data uses a tab delimiter to separate the label and message
sms = pd.read_csv("https://remiller1450.github.io/data/sms_spam.txt", sep='\t', names=['Label','Message'])
## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(sms, test_size=0.2, random_state=8)
## Split outcome from predictors
train_y = (train['Label'] == 'spam').astype(int)
train_msg = train['Message']
Applying some of the code used in Lab 3 part 2, we'll create two features to use as predictors of whether a message is spam:
## Function to measure numbers
def get_num(text):
return sum(map(str.isdigit, text))/len(text)
## Define "first_word" function
def first_word(text):
return text.split(sep=' ')[0].lower().replace('!','')
## Recreate train_X using the new feature
d = {'prop_num': train_msg.apply(get_num),
'first_word': train_msg.apply(first_word)}
train_X = pd.DataFrame(d)
## Apply OHE to the 'first' column and keep only "urgent" and "free"
from pandas import get_dummies
train_X_ohe = get_dummies(train_X, columns=['first_word'])
train_X = train_X_ohe[['prop_num','first_word_urgent','first_word_free']]
Classification trees are fit using DecisionTreeClassifier
. The example below fits a decision tree with a maximum depth of 2 to our training data:
from sklearn.tree import DecisionTreeClassifier
my_tree = DecisionTreeClassifier(max_depth = 2).fit(train_X, train_y)
A strength of decision trees is that they are easy to understand and explain. We can use the plot_tree
function to plot the structure of our fitted tree:
## Plot
from sklearn.tree import plot_tree
plot_tree(my_tree, feature_names=train_X.columns, class_names=True)
plt.show()
In this visualization, the top-line of each non-terminal node displays the splitting rule used to define the corresponding child nodes. In this example, we can see that every rule used the prop_num
variable. Observations where a rule evaluates to True
are sent to left, while observations where a rule evaluates to False
are sent to the right.
For example, the first rule is prop_num <= 0.057
. When this rule is applied to the full training set, which consists of 4457 observations, it creates a subset of 3939 observations with prop_num <= 0.057
(left branch) and another subset of 518 observations with prop_num > 0.057
(right branch).
We are also given information about the gini impurity and distribution of outcomes in each node. For example, the bottom right node contains 24 non-spam (y = 0) and 433 spam (y = 1) and has a Gini impurity of 0.1 (indicating low levels impurity, or high levels of purity).
value
information in the tree.## Code for Part C
pred_probs = my_tree.predict_proba(train_X)
np.unique(pred_probs, axis=0)
array([[0.05251641, 0.94748359], [0.40983607, 0.59016393], [0.8119469 , 0.1880531 ], [0.98910238, 0.01089762]])
When thinking about decision trees in comparison to other modeling approaches, you should keep in mind that they recursively partition the feature space.
Because this tree only involves a single variable, it's easy to represent it's partitions visually:
plt.scatter(train_X['prop_num'],0.1*np.random.randn(len(train_y)), c = train_y, cmap='viridis', alpha = 0.3)
plt.vlines(x=[0.057, 0.012, 0.08], ymin=-0.5, ymax=0.5, color = ["red","green","blue"])
<matplotlib.collections.LineCollection at 0x26689b0c670>
In this plot, the y-values are artificial (random noise) to spread out the data-points enough that we can see them. The splits are draw as vertical lines. According to our model, any data-point to the right of the blue line has a ~94.7% estimated probability of being spam. Similarly, data-points between the red and blue lines have a ~59% estimated probability of being spam, etc.
For Question #2, you should use the code provided below, which recreates the train_X
data frame so that it includes a new feature measuring the proportion of alphanumeric characters in a message.
train_X
with a maximum depth of 2. Then visualize the tree to obtain its splitting rules.prop_num
on the x-axis and prop_alpha
on the y-axis with each data-point colored by whether or not it is spam. Then, use vlines
and hlines
to draw the splitting rules from the tree fit in Part A onto your scatterplot. Your lines should divide the scatterplot into 4 distinct regions.## Function to measure special characters
def alpha_percent(text):
return sum(map(str.isalnum, text))/len(text)
## Recreate train_X using this new feature (dropping the first words)
d = {'prop_num': train_msg.apply(get_num),
'prop_alpha': train_msg.apply(alpha_percent)}
train_X = pd.DataFrame(d)
Decisions trees are prone to overfitting, making hyperparameter tuning an important consideration before fitting a tree and applying it to new data.
The two most important tuning parameters (in regard to overfitting) are:
max_depth
- which controls the depth of the treemin_samples_split
- which controls how many observations must be present in a node for it to be eligible for splittingIn sklearn
, the default value of max_depth
is None
, which will result in the tree continuing to split the data until all terminal nodes contain fewer observations than the value of min_samples_split sample
. This can be problematic, since the default for min_samples_split_sample
is 2
.
Alternatively, you may also choose to protect against overfitting using the min_impurity_decrease
argument, which allows you to specify the minimum improvement in gini impurity needed for a split to occur. However, a drawback to this approach is that the relationship between this parameter and the structure of the final tree is highly data-dependent, meaning you won't be able to anticipate the depth/complexity of your tree using this parameter.
Regardless of which parameters you choose to tune, cross-validation (and grid/random search) should be used find suitable values:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('tree', DecisionTreeClassifier())
])
parms = {'tree__max_depth': [2,3,4,5,6],
'tree__min_samples_split': [20,40,60,80,100]}
grid_res = GridSearchCV(pipe, parms, cv=5, scoring = 'f1').fit(train_X, train_y)
print(grid_res.best_estimator_)
print(grid_res.best_score_)
Pipeline(steps=[('tree', DecisionTreeClassifier(max_depth=4, min_samples_split=80))]) 0.8667592885318941
In this example, we are able to achieve a cross-validated F1-score of approximately 0.87, or slightly worse than the benchmark (of 0.90) given in Lab 3 using KNN. However, it's important to recoginze that our decision tree model is much easier explain and visualize, and it requires much less information be stored to generate predictions on new data.
min_impurity_decrease
over the values [0.00001,0.0001,0.001,0.01]
. Report the optimal value of min_impurity_decrease
and the corresponding model's cross-validated F1-score.max_depth
we found earlier?Similar to k-nearest neighbors, the decision tree framework is suitable for both classification and regression tasks. For these examples we'll use the Iowa City Homes sale data with only two predictors, the home's living area and number of bedrooms, and we'll aim to predict a home's sale price.
## Read IC home sales data and split into training/test
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
train_ic, test_ic = train_test_split(ic, test_size=0.2, random_state=7)
## Create X and y
train_ic_y = train_ic['sale.amount']
train_ic_X = train_ic[['area.living','bedrooms']]
The DecisionTreeRegressor
function is nearly identical to DecisionTreeClassifier
, with a few distinctions that can be seen in the output given by plot_tree
:
from sklearn.tree import DecisionTreeRegressor
my_tree = DecisionTreeRegressor(max_depth = 2).fit(train_ic_X, train_ic_y)
plt.figure(figsize=(8,4))
plot_tree(my_tree, feature_names=train_ic_X.columns)
plt.show()
First, we see that the predicted value in each terminal node is a simple average of the outcome variable for the data-points belonging to that node (not the majority class).
Additionally, we see that splitting rules were learned based upon the drop in squared error (not the drop in gini impurity). This is particularly consequential for hyperparameter tuning, as squared error has units that depend upon the units of the outcome variable. A pracitical implication is that we should refrain from using the parameter min_impurity_decrease
when tuning a regression tree.
hlines
to draw lines displaying splitting rules. For this question, I'd like you to add horizontal lines displaying predictions for the tree we fit in the previous example. Use the argument color = 'red'
to display the model more prominently.area.living
as a predictor)?Because trees are easily interpretable models, it's worth considering how to quantify the importance of individual predictors in a tree. In this context, variable importance can be assessed by calculating the total reduction in Gini impurity (or reduction in squared error for numeric outcomes) achieved by the splits involving a given variable. These reductions are then rescaled to sum to 1 to ensure a uniform meaning across different applications.
As an example, notice that "area.living" is the only predictor involved in any of the splits in our most recent tree. Thus, we should expect that 100% of the reductions in impurity achieved by this model are attributed to "area.living".
We can confirm this by printing the feature_importances_
attribute of the fitted tree:
print(my_tree.feature_importances_) ## Feature importance
print(train_ic_X.columns) ## Column names (same ordering as feature importance)
[1. 0.] Index(['area.living', 'bedrooms'], dtype='object')
If we consider a more complex tree, such as one using max_depth=6
, we see that both "area.living" and "bedrooms" contribute (though area.living makes a much larger contribution):
## create a deeper tree
deep_tree = DecisionTreeRegressor(max_depth = 6).fit(train_ic_X, train_ic_y)
print(deep_tree.feature_importances_)
[0.89595436 0.10404564]
You should note that the order of feature importances in the output shown above aligns with the order that these predictors appear in the training data matrix:
train_ic_X.columns
Index(['area.living', 'bedrooms'], dtype='object')
max_depth
and min_samples_split
parameters of a decision tree model that uses the predictors from Part A to predict sale price.