Lab6-A

In [2]:

## Note that textfile containing these data uses a tab delimiter to separate the label and message
sms = pd.read_csv("https://remiller1450.github.io/data/sms_spam.txt", sep='\t', names=['Label','Message'])

## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(sms, test_size=0.2, random_state=8)

## Split outcome from predictors
train_y = (train['Label'] == 'spam').astype(int)
train_msg = train['Message']

Applying some of the code used in Lab 3 part 2, we'll create two features to use as predictors of whether a message is spam:

In [3]:

## Function to measure numbers
def get_num(text):
    return sum(map(str.isdigit, text))/len(text)

## Define "first_word" function
def first_word(text):
    return text.split(sep=' ')[0].lower().replace('!','')

## Recreate train_X using the new feature
d = {'prop_num': train_msg.apply(get_num),
    'first_word': train_msg.apply(first_word)}
train_X = pd.DataFrame(d)

## Apply OHE to the 'first' column and keep only "urgent" and "free"
from pandas import get_dummies
train_X_ohe = get_dummies(train_X, columns=['first_word'])
train_X = train_X_ohe[['prop_num','first_word_urgent','first_word_free']]

Part 1 - Classification Trees¶

Classification trees are fit using DecisionTreeClassifier. The example below fits a decision tree with a maximum depth of 2 to our training data:

In [4]:

from sklearn.tree import DecisionTreeClassifier
my_tree = DecisionTreeClassifier(max_depth = 2).fit(train_X, train_y)

A strength of decision trees is that they are easy to understand and explain. We can use the plot_tree function to plot the structure of our fitted tree:

In [5]:

## Plot
from sklearn.tree import plot_tree
plot_tree(my_tree, feature_names=train_X.columns, class_names=True)
plt.show()

In this visualization, the top-line of each non-terminal node displays the splitting rule used to define the corresponding child nodes. In this example, we can see that every rule used the prop_num variable. Observations where a rule evaluates to True are sent to left, while observations where a rule evaluates to False are sent to the right.

For example, the first rule is prop_num <= 0.057. When this rule is applied to the full training set, which consists of 4457 observations, it creates a subset of 3939 observations with prop_num <= 0.057 (left branch) and another subset of 518 observations with prop_num > 0.057 (right branch).

We are also given information about the gini impurity and distribution of outcomes in each node. For example, the bottom right node contains 24 non-spam (y = 0) and 433 spam (y = 1) and has a Gini impurity of 0.1 (indicating low levels impurity, or high levels of purity).

Question #1¶

Part A) Notice that the final prediction for the two terminal nodes on the left side of the tree is "not spam", and the prediction for the rightmost two nodes is "spam". Does this mean that the second round of splits are inconsequential? Briefly explain.
Part B) Verify that the gini impurity before any splits have occured is 0.2303647 (which was rounded in the plotted tree) by performing the calculation yourself using the value information in the tree.
Part C) The code given below finds the tree's predicted probabilities (of "not spam", "spam") for each observation in the training data, then prints the unique rows of the resulting array. Briefly explain how the printed information relates to the information in the tree.

In [6]:

## Code for Part C
pred_probs = my_tree.predict_proba(train_X)
np.unique(pred_probs, axis=0)

Out[6]:

array([[0.05251641, 0.94748359],
       [0.40983607, 0.59016393],
       [0.8119469 , 0.1880531 ],
       [0.98910238, 0.01089762]])

When thinking about decision trees in comparison to other modeling approaches, you should keep in mind that they recursively partition the feature space.

Because this tree only involves a single variable, it's easy to represent it's partitions visually:

In [7]:

plt.scatter(train_X['prop_num'],0.1*np.random.randn(len(train_y)), c = train_y,  cmap='viridis', alpha = 0.3)
plt.vlines(x=[0.057, 0.012, 0.08], ymin=-0.5, ymax=0.5, color = ["red","green","blue"])

Out[7]:

<matplotlib.collections.LineCollection at 0x26689b0c670>

In this plot, the y-values are artificial (random noise) to spread out the data-points enough that we can see them. The splits are draw as vertical lines. According to our model, any data-point to the right of the blue line has a ~94.7% estimated probability of being spam. Similarly, data-points between the red and blue lines have a ~59% estimated probability of being spam, etc.

Question #2¶

For Question #2, you should use the code provided below, which recreates the train_X data frame so that it includes a new feature measuring the proportion of alphanumeric characters in a message.

Part A - Fit a decision tree using the new version of train_X with a maximum depth of 2. Then visualize the tree to obtain its splitting rules.
Part B - Create a scatter plot displaying the predictor prop_num on the x-axis and prop_alpha on the y-axis with each data-point colored by whether or not it is spam. Then, use vlines and hlines to draw the splitting rules from the tree fit in Part A onto your scatterplot. Your lines should divide the scatterplot into 4 distinct regions.

In [8]:

## Function to measure special characters
def alpha_percent(text):
    return sum(map(str.isalnum, text))/len(text)

## Recreate train_X using this new feature (dropping the first words)
d = {'prop_num': train_msg.apply(get_num),
    'prop_alpha': train_msg.apply(alpha_percent)}
train_X = pd.DataFrame(d)

Part 2 - Tuning Parameters¶

Decisions trees are prone to overfitting, making hyperparameter tuning an important consideration before fitting a tree and applying it to new data.

The two most important tuning parameters (in regard to overfitting) are:

max_depth - which controls the depth of the tree
min_samples_split - which controls how many observations must be present in a node for it to be eligible for splitting

In sklearn, the default value of max_depth is None, which will result in the tree continuing to split the data until all terminal nodes contain fewer observations than the value of min_samples_split sample. This can be problematic, since the default for min_samples_split_sample is 2.

Alternatively, you may also choose to protect against overfitting using the min_impurity_decrease argument, which allows you to specify the minimum improvement in gini impurity needed for a split to occur. However, a drawback to this approach is that the relationship between this parameter and the structure of the final tree is highly data-dependent, meaning you won't be able to anticipate the depth/complexity of your tree using this parameter.

Regardless of which parameters you choose to tune, cross-validation (and grid/random search) should be used find suitable values:

In [9]:

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline 

pipe = Pipeline([
('tree', DecisionTreeClassifier())
])

parms = {'tree__max_depth': [2,3,4,5,6],
        'tree__min_samples_split': [20,40,60,80,100]}

grid_res = GridSearchCV(pipe, parms, cv=5, scoring = 'f1').fit(train_X, train_y)
print(grid_res.best_estimator_)
print(grid_res.best_score_)

Pipeline(steps=[('tree',
                 DecisionTreeClassifier(max_depth=4, min_samples_split=80))])
0.8667592885318941

In this example, we are able to achieve a cross-validated F1-score of approximately 0.87, or slightly worse than the benchmark (of 0.90) given in Lab 3 using KNN. However, it's important to recoginze that our decision tree model is much easier explain and visualize, and it requires much less information be stored to generate predictions on new data.

Question #3¶

Part A - Using the same pipeline defined in the previous example, perform a cross-validated grid search using the tuning parameter min_impurity_decrease over the values [0.00001,0.0001,0.001,0.01]. Report the optimal value of min_impurity_decrease and the corresponding model's cross-validated F1-score.
Part B - Display the fitted model from Part A. How does the depth of this tree compare to the optimal value of max_depth we found earlier?

Part 3 - Regression Trees¶

Similar to k-nearest neighbors, the decision tree framework is suitable for both classification and regression tasks. For these examples we'll use the Iowa City Homes sale data with only two predictors, the home's living area and number of bedrooms, and we'll aim to predict a home's sale price.

In [10]:

## Read IC home sales data and split into training/test
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
train_ic, test_ic = train_test_split(ic, test_size=0.2, random_state=7)

## Create X and y
train_ic_y = train_ic['sale.amount']
train_ic_X = train_ic[['area.living','bedrooms']]

The DecisionTreeRegressor function is nearly identical to DecisionTreeClassifier, with a few distinctions that can be seen in the output given by plot_tree:

In [11]:

from sklearn.tree import DecisionTreeRegressor
my_tree = DecisionTreeRegressor(max_depth = 2).fit(train_ic_X, train_ic_y)
plt.figure(figsize=(8,4))
plot_tree(my_tree, feature_names=train_ic_X.columns)
plt.show()

First, we see that the predicted value in each terminal node is a simple average of the outcome variable for the data-points belonging to that node (not the majority class).

Additionally, we see that splitting rules were learned based upon the drop in squared error (not the drop in gini impurity). This is particularly consequential for hyperparameter tuning, as squared error has units that depend upon the units of the outcome variable. A pracitical implication is that we should refrain from using the parameter min_impurity_decrease when tuning a regression tree.

Question #4¶

Part A - Using the training data, create a scatterplot displaying a home's living area on the x-axis and its sale amount on the y-axis.
Part B - Earlier in the lab, we used hlines to draw lines displaying splitting rules. For this question, I'd like you to add horizontal lines displaying predictions for the tree we fit in the previous example. Use the argument color = 'red' to display the model more prominently.
Part C - Looking at the plot you created in Part B, what are strengths/weaknesses of this decision tree model compared with a simple linear regression model (which considers only area.living as a predictor)?
Part D - Could you come up with a linear regression model that exhibits similar behavior (ie: generates predictions using similar rules)? If so, explain how.

Part 4 - Variable Importance¶

Because trees are easily interpretable models, it's worth considering how to quantify the importance of individual predictors in a tree. In this context, variable importance can be assessed by calculating the total reduction in Gini impurity (or reduction in squared error for numeric outcomes) achieved by the splits involving a given variable. These reductions are then rescaled to sum to 1 to ensure a uniform meaning across different applications.

As an example, notice that "area.living" is the only predictor involved in any of the splits in our most recent tree. Thus, we should expect that 100% of the reductions in impurity achieved by this model are attributed to "area.living".

We can confirm this by printing the feature_importances_ attribute of the fitted tree:

In [12]:

print(my_tree.feature_importances_)  ## Feature importance
print(train_ic_X.columns)  ## Column names (same ordering as feature importance)

[1. 0.]
Index(['area.living', 'bedrooms'], dtype='object')

If we consider a more complex tree, such as one using max_depth=6, we see that both "area.living" and "bedrooms" contribute (though area.living makes a much larger contribution):

In [13]:

## create a deeper tree
deep_tree = DecisionTreeRegressor(max_depth = 6).fit(train_ic_X, train_ic_y)
print(deep_tree.feature_importances_)

[0.89595436 0.10404564]

You should note that the order of feature importances in the output shown above aligns with the order that these predictors appear in the training data matrix:

In [14]:

train_ic_X.columns

Out[14]:

Index(['area.living', 'bedrooms'], dtype='object')

Question #5¶

Part A - Create a new predictor matrix for the Iowa City home sales data that contains the predictors 'built', 'bedrooms', 'area.base', 'area.living', 'area.garage1', and 'area.lot'.
Part B - Use 5-fold cross-validation to tune the max_depth and min_samples_split parameters of a decision tree model that uses the predictors from Part A to predict sale price.
Part C - Find the importance of each predictor in the best decision tree identified in Part B.
Part D - Evaluate the performance of the best decision tree identified in Part B on the test set using root mean squared error.
Part E - Write a short paragraph summarizing the best fitting decision tree model you indentified during this question. You should aim to write your paragraph in a professional tone that would be suitable for a scientific paper. You should mention all important details, including how you determined the best fitting model, how well it performs, which variables it uses, and which variables are most important (and how you know they're important).

Lab #6 (part 1) - Decision Trees¶