Lab 8 - Ensemble Models and Feature Engineering¶

This lab will cover a few different ensemble approaches, including the random forest algorithm and custom ensembles that combine multiple base models of different types.

It will also introduce the topic of "feature engineering" which is the processing of generating predictive features from non-standard data formats (text, images, time-series, etc.) so that the data conforms to the "spreadsheet format" that is typically expected for the models we've encountered so far in the course.

To begin, you should load our standard set of libraries:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math

You'll also need two different data sets, "clickbait" and "activity":

In [2]:
## Load data sets
clickbait = pd.read_csv("https://remiller1450.github.io/data/clickbait_headlines.csv", encoding='latin-1')
act_ts = pd.read_csv("https://remiller1450.github.io/data/activity.csv")

The "clickbait" data set contains a total of $n=6000$ headlines that were collected from two types of websites, 3000 headlines come from websites considered to be clickbait publishers (Upworthy, Buzzfeed, Thatscoop, ViralNova, Scoopwhoop, and ViralStories) and the remaining 3000 come from websites that were deemed not to publish clickbait (New York Times, WikiNews, The Guardian, and The Hindu).

The goal for these data is to use the headline text, Headline, to predict whether or not the article came from a source of clickbait (Label = 1 denoting articles from clickbait websites).

The "activity" data set contains accelerometer measures for $n=15$ participants performing 7 different activities. The data are recorded at 52 Hz (meaning there are 52 measurements for each second of data collection). Our analysis will focus on classifying 10-second samples of movement based upon time-series measurements from the accelerometer. These 10-second samples have already been defined using the variable Sample_Num.

Part 1 - Feature Engineering Textual Data¶

When analyzing textual data your natural instinct might be to tokenize the text and use the relative frequencies of each unique word within the text as predictive features, an approach known as bag of words. However, this creates a very high-dimensional feature space containing many features that aren't useful for prediction. So, unless you have a massive amount of data, it can be more effective to create your own custom features.

To begin, let's create separate training and testing sets, which is straightforward since the data already have each sample that we're seeking to classify as its own row:

In [3]:
## Train-Test split
from sklearn.model_selection import train_test_split
train_clickbait, test_clickbait = train_test_split(clickbait, test_size=0.2, random_state=7)
train_clickbait_y = train_clickbait['Label']
train_clickbait_X = train_clickbait['Headline']
test_clickbait_y = test_clickbait['Label']
test_clickbait_X = test_clickbait['Headline']

The training testing split is particularly important when feature engineering because we can explore the patterns in the training data to generate predictive features while still getting an unbiased estimate of how well these features perform on new data.

With this in mind, a good place to start is looking at a few clickbait and non-clickbait examples and trying to identify patterns:

In [4]:
## Examples of clickbait
train_clickbait_X[train_clickbait_y == 1].sample(15, random_state=9)
Out[4]:
3327    What's Your Craziest Bachelorette Party Experi...
3623                                 Spike Surprises Fans
3436    15 Brilliant And Creative Ways To Use Spaghett...
4026    19 Things You'll Just Get If You're Obsessed W...
5605    How Well Do You Know These Random, 16th Centur...
3595    Kanye West Auditioned For "American Idol" This...
4936    This Korean Pop Star Can't Stop Falling On Sta...
5563           Who Said It: Gene Belcher Or Chris Griffin
3639      Are You More Bailee Madison Or Hailee Steinfeld
4095     For Anyone Who Had A Crush On Sheik From "Zelda"
4518          Send This To The Person Who Gives Bad Gifts
5871    This Canadian Woman Opened Up A Can Of Thanksg...
4513    21 Gifts To Give Yourself When Adulting Is Too...
5534    17 Pairs Of Actually Cute Sweatpants That'll K...
4654        12 Charts Only Sephora Lovers Will Understand
Name: Headline, dtype: object
In [5]:
## Examples of non-clickbait
train_clickbait_X[train_clickbait_y == 0].sample(15, random_state=9)
Out[5]:
873     Both sides of Kenya's constitution dispute are...
2456        Italy Seizes Millions in Assets of Four Banks
2603    Truck carrying explosives crashes, explodes in...
1348    Caloundra win Sunshine Coast, Australia cricke...
2159         Southern Gaza hit by new Israeli air strikes
832               Posted deadlines for Christmas delivery
340     Rumsfeld memo recognizes need for 'major adjus...
220                              Horse flu damage spreads
269                     Officials Identify Alabama Gunman
2619         MS-13 gang threatens the Arizona "minutemen"
2710    18 illegal Afghan and Burmese immigrants kille...
1855    Financial Safety Net of Nonprofit Organization...
1317    United Nations requests US$700 million in aid ...
77      English "Lady in the Lake" killer found dead i...
2059               Tomb discovered in Valley of the Kings
Name: Headline, dtype: object

One pattern that you might notice is that clickbait headlines seem more likely to be framed as questions that start with words like "Would", "What", "Which", "How", and "Can" than non-clickbait headlines. We could generate a predictive feature representing this trend using the following function:

In [6]:
def first_word_q(input_text):
    first_word = input_text.str.split(' ').str[0]
    return (first_word == "What") | (first_word == "Which") | (first_word == "Would") | (first_word == "How") | (first_word == "Can") | (first_word == "Are")

Applying this function we can see that less than 1% of non-clickbait headlines start with a question word, but a significant percentage of clickbait headlines start with one of these words:

In [7]:
## Proportion of clickbait starting w/ a question word
np.average(first_word_q(train_clickbait_X)[train_clickbait_y == 1])
Out[7]:
0.15651085141903173
In [8]:
## Proportion of non-clickbait starting w/ a question word
np.average(first_word_q(train_clickbait_X)[train_clickbait_y == 0])
Out[8]:
0.0024958402662229617

A final thing you should be aware of when engineering textual features is the apply() method of pandas DataFrames and Series, which can be useful when you want to apply functions that are not vectorized. An example is shown below:

In [9]:
## Function to count appearances of "US" or "UK" or "EU" in a given string
def count_US(text):
    return text.count('US') + text.count('U.S.') + text.count('UK') + text.count('U.K.') +  text.count('EU') + text.count('E.U.') 

## Print how many of each in the training data
train_clickbait_X.apply(count_US).value_counts()
Out[9]:
0    4562
1     235
2       3
Name: Headline, dtype: int64

Finally, it's beyond the scope of this course to comprehensively cover string processing in Python, but this reference provides a nice list of string methods to you might find useful in this lab and on Homework 3. This is also one area of the course where you are welcome to rely upon resources outside of our lectures and labs so long as you cite them.

Question #1:

  • Part A: Create a feature that records whether the first "word" in a headline is a number (ie: "18" or "15"). Based upon the proportion of clickbait and non-clickbait headlines with values of True for this feature, does the feature seem like it would be useful for prediction?
  • Part B: Create a feature that records the proportion of characters in the headline that are numeric. Using either a data visualization or descriptive statistics, does this feature seem like it might be useful for prediction?
  • Part C: Create a Data Frame containing four engineered features: the results from Parts A and B, as well as the two features from this section's examples.
  • Part D: Fit a decision tree classifier to training data (the Data Frame you created in Part C) and display the fitted model using a tree diagram (plot_tree).
  • Part E: Evaluate the decision tree classifier on the test data using classification accuracy as the performance metric. Note that you will need to create a data frame containing the engineered features for the test set in order to evaluate the model.

Part 2 - Feature Engineering Time-Series Data¶

The "activity" data set is structured so that each row represents a measurement at a particular time step for a study participant, which poses a problem for the typical way we'd create a training-testing split using train_test_split(). As an alternative, we might opt to use data from Subject 1 for development and training, and data from another subject (Subject 7 here) for model evaluation. This has the added benefit of giving us a clear indication of how well features generated for a certain individual might generalize to other individuals.

In [10]:
## Training Data
train_act_X = act_ts[act_ts['Subject'] == 1].drop(['Label'], axis = 1)
train_act_y = act_ts[act_ts['Subject'] == 1]['Label']

## Validation Data
test_act_X = act_ts[act_ts['Subject'] == 7].drop(['Label'], axis = 1)
test_act_y = act_ts[act_ts['Subject'] == 7]['Label']

To make things even more simple, we'll also focus on two of the activity types that were labeled in the study, "Work", which indicates the subject was working at their computer, and "Talk", which indicates the subject was talking while standing.

Here are what a few of the variables look like for these two labels:

In [11]:
## Create plots of select variables over time and color by activity (label)
my_vars = ['X_accel', 'Y_accel', 'Z_accel']
for cur_var in my_vars:
    plt.figure().set_figheight(2)
    plt.scatter(x = range(len(train_act_X[cur_var])), y=train_act_X[cur_var], c = train_act_y.map({'Work': 1, 'Talk': 0}))
    plt.title(cur_var)
    plt.show()

There are many possible feature engineering ideas that we migh glean from these visualizations, but two that stand out are:

  1. In general, working tends to have a lower amount of variability in the accelerometer's values across all axes.
  2. Working tends to have higher average values, particularly in the X and Z axes.

To translate these patterns into predictive features we can use grouped summarization:

In [12]:
## Get the mean and std dev of each selected variable for each sample
train_X_FE = pd.concat([train_act_X.groupby('Sample_Num')[my_vars].mean(),
                        train_act_X.groupby('Sample_Num')[my_vars].std()], axis=1)
train_X_FE.columns = ['Mean_X', 'Mean_Y', 'Mean_Z', 'SD_X', 'SD_Y', 'SD_Z']
train_X_FE.head(5)
Out[12]:
Mean_X Mean_Y Mean_Z SD_X SD_Y SD_Z
Sample_Num
1 1897.030769 2292.657692 2064.600000 170.098461 143.059786 125.358252
2 1957.184615 2379.226923 2108.601923 11.067517 7.863383 16.358731
3 1962.921154 2378.396154 2112.584615 5.970345 3.332676 5.728319
4 1969.034615 2375.501923 2120.165385 14.418689 7.394212 15.416865
5 1976.882692 2373.157692 2130.376923 7.419165 6.416962 7.635497

Notice that we grouped by Sample_Num because we would like to make activity predictions for each 10-second sample in the training and testing sets.

A simple visual analysis suggests these features will be highly predictive:

In [14]:
## Plot two of the engineered features and color by the target variable (label)
train_y_FE = act_ts[act_ts['Subject'] == 1].groupby('Sample_Num')['Label'].first()
plt.scatter(train_X_FE['Mean_X'], train_X_FE['SD_X'], c = train_y_FE.map({'Work': 1, 'Talk': 0}))
plt.show()

Notice how we also had to remove the time-series component of the outcome variable, Label, in order to make this visualization.

Question #2:

  • Part A: Train a support vector machine classifier using the 6 engineered features from this section's example to predict whether the subject's action is working or talking. Choose an appropriate kernel function based upon a visual assessment of the data, and make sure you standardize the data prior to fitting the model. Report the cross-validated classification accuracy of this approach.
  • Part B: Evaluate your best performing method on the test set and briefly comment upon how successful you were.

Question #3:

Question 3: The data given below are from a different portion of the activity recognition study. For this question you'll use the data from three subjects and two different activities: walking alone, or 'Walking', and talking to someone else while walking, or 'Walking_Talking'.

https://remiller1450.github.io/data/activity_v2.csv

  • Part A: Split the data to use the subjects #5 and #7 as training data and subject #1 as test data.
  • Part B: Create at least one data visualization and use it to engineer at least two features derived from the variables X_accel, Y_accel, and Z_accel. Your answer to this question should include your visualization and code that creates a version of these data with 1 row per sample with your engineered features as the columns.
  • Part C: Create a pipeline involving at least one preprocessing step (ie: standardization/re-scaling, normalizing transformation, etc.) followed by a classifier step. Use a cross-validated grid search to tune at least one hyperparameter involved in this pipeline. You may choose any classification model we've used so far in the course (ie: KNN, decision tree, SVM, etc.)
  • Part D: Evaluate your best performing method on the test set and briefly comment upon how successful you were. It's okay if your approach didn't produce a high level of accuracy so long as you followed a logical step of modeling steps.

Part 3 - Random Forest¶

Recall that the random forest algorithm uses the predictions from a large set of decision trees to make an overall prediction. The success of this algorithm is based upon the creation of a diverse set of trees that each capture different patterns present in the training data. To achieve this, the following hyperparameters are most important:

  • max_depth - the maximum depth of the individual trees in the ensemble
  • min_samples_split - the minimum samples in a node for it to be eligible for splitting (for a single tree in the ensemble)
  • max_features - typically given as an int representing the number of randomly selected predictors to be considered in an individual tree in the ensemble.

You should also be aware of the n_estimators parameter, which governs the number of trees in the forest. If this parameter is set too low, the forest may not be sufficiently flexible to capture all of the meaningful patterns in the data. However, there is no benefit to choosing a very large number of trees as the performance of a random forest tends to stabilize with a certain number of trees and any more will only add computational burden.

Below is a quick demonstration of RandomForestClassifier(), and you should note that there is an analogous function RandomForestRegressor() that is used for regression tasks.

In [15]:
## Some example data
wbc = pd.read_csv("https://remiller1450.github.io/data/wisc_bc.csv")
train, test = train_test_split(wbc, test_size=0.2, random_state=7)
train_wbc_y = train['Label'].map({'M': 1, 'B': 0})
train_wbc_y = test['Label'].map({'M': 1, 'B': 0})
train_wbc_X = train.drop(['ID','Label'], axis = 1)
train_wbc_X = test.drop(['ID','Label'], axis = 1)

from sklearn.ensemble import RandomForestClassifier
my_forest = RandomForestClassifier(max_depth=3, min_samples_split=10, 
                                   max_features=2, n_estimators=200, 
                                   random_state=0, oob_score=True)

fitted_forest = my_forest.fit(train_wbc_X, train_wbc_y)
print(fitted_forest.oob_score_) ## out-of-sample classification accuracy
0.9385964912280702

Something you should know is that bootstrapping will naturally exclude some of the data-points from the training set of each tree in the forest. Thus, we can use these excluded data-points to calculate "out-of-bag" performance. Out-of-bag metrics are out-of-sample measures of performance, so we can use and interpret them in the same way that we might cross-validated measures of performance.

Setting oob_score = True will use classification accuracy as the scoring metric. You can also provide any callable function that has the arguments y_true and y_pred (such as f1_score(), etc.)

Unfortunately, there's no easy way to rely upon oob_score_ to efficiently tune hyperparameters, so you'll still need to rely upon GridSearchCV() despite its computational efficiencies with random forests.

Question #4: For this question you should use the features engineered for the "activity" data set in Part 2 of the lab (these features were used in Question #2).

  • Part A: Train a random forest model containing 50 trees with a maximum depth of 3 and a maximum number of features of 2. Report the out-of-bag classification accuracy of this model.
  • Part B: Now use a while loop to continually increment the number of trees by 25 until the out-of-bag classification accuracy changes by less than 0.01 from the previous model. Store each model's out-of-bag accuracy and report the number of trees used in the final model. All other parameters should use the values specified in Part A.

Part 4 - Custom Ensembles¶

The core idea of an ensemble learner is to combine the prediction rules from several different models/estimators to produce a composite model with superior performance on new data. This can work well if the different models used in the ensemble are diverse enough to be able to learn different things from the training data. Put differently, ensembles are most effective when each base model is prone to making different types of errors (or finding different types of patterns) with the training data.

There are two main approaches to creating an ensemble learner:

  • Aggregation (the random forest algorithm is an example) - Train several models independently and aggregate their predictions using voting or simple/weighted averaging.
  • Boosting (covered in our next lab) - Train several models sequentially such that each additional model attempts to correct the errors of earlier models.

We can build our own aggregation ensemble using the VotingClassifier() function:

In [16]:
## Import each base model plus a few extras
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler

## Defining the individual models
model1 = Pipeline([('scaler', StandardScaler()),
                  ('model', SVC(kernel = 'poly', probability=True))])
model2 = DecisionTreeClassifier(max_depth=5)
model3 = Pipeline([('scaler', StandardScaler()),
                  ('model', KNeighborsClassifier())])
                  
## Create the ensemble
my_ensemble = VotingClassifier(estimators=[('svm', model1),('tree', model2),('knn', model3)], voting='soft')

The voting classifier in this example combines the predictions of three different base models, a support vector machine, a decision tree, and a KNN model using a "soft" voting scheme.

  • Setting voting = 'soft' will sum the predicted probabilities for each outcome class produced by each model in the ensemble with the final prediction being the class with the largest sum.
  • Setting voting = 'hard' will use a simple majority vote where the final prediction is the most commonly predicted class. Unfortunately, if there's a tie the final prediction will be assigned in ascending order.

Thus, we'll generally prefer soft voting unless our scenario involves a sufficient number of models and outcome classes that make ties unlikely to occur.

Once it is set up, our voting classifier, my_ensemble has the same capabilities as the individual model classes we've seen so far in sklearn. For example, we can find the ensemble's cross-validated F1-score and compare it to those of its individual base models:

In [17]:
## A few more imports
from sklearn.pipeline import Pipeline 
from sklearn.model_selection import GridSearchCV

## Pipeline to compare models 
model_pipe = Pipeline([('model', SVC())])
candidate_models = {'model': [my_ensemble, model1, model2, model3]}

## Cross-validated F1 scores
grid = GridSearchCV(model_pipe, candidate_models, cv=5, scoring = 'f1').fit(train_wbc_X, train_wbc_y)
pd.DataFrame(grid.cv_results_).sort_values('mean_test_score', ascending=False)[['param_model', 'mean_test_score']]
Out[17]:
param_model mean_test_score
3 (StandardScaler(), KNeighborsClassifier()) 0.923235
0 VotingClassifier(estimators=[('svm',\n ... 0.891429
2 DecisionTreeClassifier(max_depth=5) 0.855798
1 (StandardScaler(), SVC(kernel='poly', probabil... 0.837216

Tuning the hyperparameters of each base model in the ensemble requires us to be careful about how we access each relevant parameter using the double underscore to identify its location within another named step:

In [18]:
## Some tuning parameters to search over
params = {'svm__model__kernel': ['poly','linear'], 
          'tree__max_depth': [4,5,6],
         'knn__model__n_neighbors': [5,10,15],
         'knn__model__weights': ['distance','uniform'],
         'voting': ['soft','hard']}

## Perform the grid search
grid = GridSearchCV(my_ensemble, param_grid=params, cv=5, scoring = 'f1').fit(train_wbc_X, train_wbc_y)
grid.best_estimator_
Out[18]:
VotingClassifier(estimators=[('svm',
                              Pipeline(steps=[('scaler', StandardScaler()),
                                              ('model',
                                               SVC(kernel='linear',
                                                   probability=True))])),
                             ('tree', DecisionTreeClassifier(max_depth=4)),
                             ('knn',
                              Pipeline(steps=[('scaler', StandardScaler()),
                                              ('model',
                                               KNeighborsClassifier(n_neighbors=15,
                                                                    weights='distance'))]))],
                 voting='soft')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
VotingClassifier(estimators=[('svm',
                              Pipeline(steps=[('scaler', StandardScaler()),
                                              ('model',
                                               SVC(kernel='linear',
                                                   probability=True))])),
                             ('tree', DecisionTreeClassifier(max_depth=4)),
                             ('knn',
                              Pipeline(steps=[('scaler', StandardScaler()),
                                              ('model',
                                               KNeighborsClassifier(n_neighbors=15,
                                                                    weights='distance'))]))],
                 voting='soft')
StandardScaler()
SVC(kernel='linear', probability=True)
DecisionTreeClassifier(max_depth=4)
StandardScaler()
KNeighborsClassifier(n_neighbors=15, weights='distance')

Question #5: For this question you should use the "clickbait" data set introduced in Part 1 of the lab (used in Question #1).

  • Part A: Create a "soft" voting ensemble involving five base models: a linear SVM, an 'rbf' SVM, a decision tree with a max depth of 4, a KNN classifier using $k=5$, and another KNN classifier using $k=20$. Include a standardization step as part of the SVM and KNN base models. Next, conduct a cross-validated grid search using the default scoring metric (classification accuracy) to optimize the following parameters:
    • The regularization or "slack" in each SVM base model as either C = 1 or C = 0.1
    • The weighting scheme in each KNN base model as either inverse-distance weighting or uniform weighting.
  • Part B: Fit a random forest containing 200 trees with a maximum depth of 3 that each use 2 randomly selected features. Report the out-of-bag performance (classification accuracy) of this model. Does it seem superior to the voting ensemble from Part A?
  • Part C: Display ROC curves for each of the two models considered in Parts A and B using the test data. Note that you'll need to generate the features of interest for the test set in order to do this. Based upon these curves, which model do you prefer?