This lab will cover a few different ensemble approaches, including the random forest algorithm and custom ensembles that combine multiple base models of different types.
It will also introduce the topic of "feature engineering" which is the processing of generating predictive features from non-standard data formats (text, images, time-series, etc.) so that the data conforms to the "spreadsheet format" that is typically expected for the models we've encountered so far in the course.
To begin, you should load our standard set of libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math
You'll also need two different data sets, "clickbait" and "activity":
## Load data sets
clickbait = pd.read_csv("https://remiller1450.github.io/data/clickbait_headlines.csv", encoding='latin-1')
act_ts = pd.read_csv("https://remiller1450.github.io/data/activity.csv")
The "clickbait" data set contains a total of $n=6000$ headlines that were collected from two types of websites, 3000 headlines come from websites considered to be clickbait publishers (Upworthy, Buzzfeed, Thatscoop, ViralNova, Scoopwhoop, and ViralStories) and the remaining 3000 come from websites that were deemed not to publish clickbait (New York Times, WikiNews, The Guardian, and The Hindu).
The goal for these data is to use the headline text, Headline
, to predict whether or not the article came from a source of clickbait (Label = 1
denoting articles from clickbait websites).
The "activity" data set contains accelerometer measures for $n=15$ participants performing 7 different activities. The data are recorded at 52 Hz (meaning there are 52 measurements for each second of data collection). Our analysis will focus on classifying 10-second samples of movement based upon time-series measurements from the accelerometer. These 10-second samples have already been defined using the variable Sample_Num
.
When analyzing textual data your natural instinct might be to tokenize the text and use the relative frequencies of each unique word within the text as predictive features, an approach known as bag of words. However, this creates a very high-dimensional feature space containing many features that aren't useful for prediction. So, unless you have a massive amount of data, it can be more effective to create your own custom features.
To begin, let's create separate training and testing sets, which is straightforward since the data already have each sample that we're seeking to classify as its own row:
## Train-Test split
from sklearn.model_selection import train_test_split
train_clickbait, test_clickbait = train_test_split(clickbait, test_size=0.2, random_state=7)
train_clickbait_y = train_clickbait['Label']
train_clickbait_X = train_clickbait['Headline']
test_clickbait_y = test_clickbait['Label']
test_clickbait_X = test_clickbait['Headline']
The training testing split is particularly important when feature engineering because we can explore the patterns in the training data to generate predictive features while still getting an unbiased estimate of how well these features perform on new data.
With this in mind, a good place to start is looking at a few clickbait and non-clickbait examples and trying to identify patterns:
## Examples of clickbait
train_clickbait_X[train_clickbait_y == 1].sample(15, random_state=9)
3327 What's Your Craziest Bachelorette Party Experi... 3623 Spike Surprises Fans 3436 15 Brilliant And Creative Ways To Use Spaghett... 4026 19 Things You'll Just Get If You're Obsessed W... 5605 How Well Do You Know These Random, 16th Centur... 3595 Kanye West Auditioned For "American Idol" This... 4936 This Korean Pop Star Can't Stop Falling On Sta... 5563 Who Said It: Gene Belcher Or Chris Griffin 3639 Are You More Bailee Madison Or Hailee Steinfeld 4095 For Anyone Who Had A Crush On Sheik From "Zelda" 4518 Send This To The Person Who Gives Bad Gifts 5871 This Canadian Woman Opened Up A Can Of Thanksg... 4513 21 Gifts To Give Yourself When Adulting Is Too... 5534 17 Pairs Of Actually Cute Sweatpants That'll K... 4654 12 Charts Only Sephora Lovers Will Understand Name: Headline, dtype: object
## Examples of non-clickbait
train_clickbait_X[train_clickbait_y == 0].sample(15, random_state=9)
873 Both sides of Kenya's constitution dispute are... 2456 Italy Seizes Millions in Assets of Four Banks 2603 Truck carrying explosives crashes, explodes in... 1348 Caloundra win Sunshine Coast, Australia cricke... 2159 Southern Gaza hit by new Israeli air strikes 832 Posted deadlines for Christmas delivery 340 Rumsfeld memo recognizes need for 'major adjus... 220 Horse flu damage spreads 269 Officials Identify Alabama Gunman 2619 MS-13 gang threatens the Arizona "minutemen" 2710 18 illegal Afghan and Burmese immigrants kille... 1855 Financial Safety Net of Nonprofit Organization... 1317 United Nations requests US$700 million in aid ... 77 English "Lady in the Lake" killer found dead i... 2059 Tomb discovered in Valley of the Kings Name: Headline, dtype: object
One pattern that you might notice is that clickbait headlines seem more likely to be framed as questions that start with words like "Would", "What", "Which", "How", and "Can" than non-clickbait headlines. We could generate a predictive feature representing this trend using the following function:
def first_word_q(input_text):
first_word = input_text.str.split(' ').str[0]
return (first_word == "What") | (first_word == "Which") | (first_word == "Would") | (first_word == "How") | (first_word == "Can") | (first_word == "Are")
Applying this function we can see that less than 1% of non-clickbait headlines start with a question word, but a significant percentage of clickbait headlines start with one of these words:
## Proportion of clickbait starting w/ a question word
np.average(first_word_q(train_clickbait_X)[train_clickbait_y == 1])
0.15651085141903173
## Proportion of non-clickbait starting w/ a question word
np.average(first_word_q(train_clickbait_X)[train_clickbait_y == 0])
0.0024958402662229617
A final thing you should be aware of when engineering textual features is the apply()
method of pandas
DataFrames and Series, which can be useful when you want to apply functions that are not vectorized. An example is shown below:
## Function to count appearances of "US" or "UK" or "EU" in a given string
def count_US(text):
return text.count('US') + text.count('U.S.') + text.count('UK') + text.count('U.K.') + text.count('EU') + text.count('E.U.')
## Print how many of each in the training data
train_clickbait_X.apply(count_US).value_counts()
0 4562 1 235 2 3 Name: Headline, dtype: int64
Finally, it's beyond the scope of this course to comprehensively cover string processing in Python, but this reference provides a nice list of string methods to you might find useful in this lab and on Homework 3. This is also one area of the course where you are welcome to rely upon resources outside of our lectures and labs so long as you cite them.
Question #1:
True
for this feature, does the feature seem like it would be useful for prediction?plot_tree
).The "activity" data set is structured so that each row represents a measurement at a particular time step for a study participant, which poses a problem for the typical way we'd create a training-testing split using train_test_split()
. As an alternative, we might opt to use data from Subject 1 for development and training, and data from another subject (Subject 7 here) for model evaluation. This has the added benefit of giving us a clear indication of how well features generated for a certain individual might generalize to other individuals.
## Training Data
train_act_X = act_ts[act_ts['Subject'] == 1].drop(['Label'], axis = 1)
train_act_y = act_ts[act_ts['Subject'] == 1]['Label']
## Validation Data
test_act_X = act_ts[act_ts['Subject'] == 7].drop(['Label'], axis = 1)
test_act_y = act_ts[act_ts['Subject'] == 7]['Label']
To make things even more simple, we'll also focus on two of the activity types that were labeled in the study, "Work", which indicates the subject was working at their computer, and "Talk", which indicates the subject was talking while standing.
Here are what a few of the variables look like for these two labels:
## Create plots of select variables over time and color by activity (label)
my_vars = ['X_accel', 'Y_accel', 'Z_accel']
for cur_var in my_vars:
plt.figure().set_figheight(2)
plt.scatter(x = range(len(train_act_X[cur_var])), y=train_act_X[cur_var], c = train_act_y.map({'Work': 1, 'Talk': 0}))
plt.title(cur_var)
plt.show()
There are many possible feature engineering ideas that we migh glean from these visualizations, but two that stand out are:
To translate these patterns into predictive features we can use grouped summarization:
## Get the mean and std dev of each selected variable for each sample
train_X_FE = pd.concat([train_act_X.groupby('Sample_Num')[my_vars].mean(),
train_act_X.groupby('Sample_Num')[my_vars].std()], axis=1)
train_X_FE.columns = ['Mean_X', 'Mean_Y', 'Mean_Z', 'SD_X', 'SD_Y', 'SD_Z']
train_X_FE.head(5)
Mean_X | Mean_Y | Mean_Z | SD_X | SD_Y | SD_Z | |
---|---|---|---|---|---|---|
Sample_Num | ||||||
1 | 1897.030769 | 2292.657692 | 2064.600000 | 170.098461 | 143.059786 | 125.358252 |
2 | 1957.184615 | 2379.226923 | 2108.601923 | 11.067517 | 7.863383 | 16.358731 |
3 | 1962.921154 | 2378.396154 | 2112.584615 | 5.970345 | 3.332676 | 5.728319 |
4 | 1969.034615 | 2375.501923 | 2120.165385 | 14.418689 | 7.394212 | 15.416865 |
5 | 1976.882692 | 2373.157692 | 2130.376923 | 7.419165 | 6.416962 | 7.635497 |
Notice that we grouped by Sample_Num
because we would like to make activity predictions for each 10-second sample in the training and testing sets.
A simple visual analysis suggests these features will be highly predictive:
## Plot two of the engineered features and color by the target variable (label)
train_y_FE = act_ts[act_ts['Subject'] == 1].groupby('Sample_Num')['Label'].first()
plt.scatter(train_X_FE['Mean_X'], train_X_FE['SD_X'], c = train_y_FE.map({'Work': 1, 'Talk': 0}))
plt.show()
Notice how we also had to remove the time-series component of the outcome variable, Label
, in order to make this visualization.
Question #2:
Question #3:
Question 3: The data given below are from a different portion of the activity recognition study. For this question you'll use the data from three subjects and two different activities: walking alone, or 'Walking'
, and talking to someone else while walking, or 'Walking_Talking'
.
Recall that the random forest algorithm uses the predictions from a large set of decision trees to make an overall prediction. The success of this algorithm is based upon the creation of a diverse set of trees that each capture different patterns present in the training data. To achieve this, the following hyperparameters are most important:
max_depth
- the maximum depth of the individual trees in the ensemblemin_samples_split
- the minimum samples in a node for it to be eligible for splitting (for a single tree in the ensemble)max_features
- typically given as an int representing the number of randomly selected predictors to be considered in an individual tree in the ensemble.You should also be aware of the n_estimators
parameter, which governs the number of trees in the forest. If this parameter is set too low, the forest may not be sufficiently flexible to capture all of the meaningful patterns in the data. However, there is no benefit to choosing a very large number of trees as the performance of a random forest tends to stabilize with a certain number of trees and any more will only add computational burden.
Below is a quick demonstration of RandomForestClassifier()
, and you should note that there is an analogous function RandomForestRegressor()
that is used for regression tasks.
## Some example data
wbc = pd.read_csv("https://remiller1450.github.io/data/wisc_bc.csv")
train, test = train_test_split(wbc, test_size=0.2, random_state=7)
train_wbc_y = train['Label'].map({'M': 1, 'B': 0})
train_wbc_y = test['Label'].map({'M': 1, 'B': 0})
train_wbc_X = train.drop(['ID','Label'], axis = 1)
train_wbc_X = test.drop(['ID','Label'], axis = 1)
from sklearn.ensemble import RandomForestClassifier
my_forest = RandomForestClassifier(max_depth=3, min_samples_split=10,
max_features=2, n_estimators=200,
random_state=0, oob_score=True)
fitted_forest = my_forest.fit(train_wbc_X, train_wbc_y)
print(fitted_forest.oob_score_) ## out-of-sample classification accuracy
0.9385964912280702
Something you should know is that bootstrapping will naturally exclude some of the data-points from the training set of each tree in the forest. Thus, we can use these excluded data-points to calculate "out-of-bag" performance. Out-of-bag metrics are out-of-sample measures of performance, so we can use and interpret them in the same way that we might cross-validated measures of performance.
Setting oob_score = True
will use classification accuracy as the scoring metric. You can also provide any callable function that has the arguments y_true
and y_pred
(such as f1_score()
, etc.)
Unfortunately, there's no easy way to rely upon oob_score_
to efficiently tune hyperparameters, so you'll still need to rely upon GridSearchCV()
despite its computational efficiencies with random forests.
Question #4: For this question you should use the features engineered for the "activity" data set in Part 2 of the lab (these features were used in Question #2).
The core idea of an ensemble learner is to combine the prediction rules from several different models/estimators to produce a composite model with superior performance on new data. This can work well if the different models used in the ensemble are diverse enough to be able to learn different things from the training data. Put differently, ensembles are most effective when each base model is prone to making different types of errors (or finding different types of patterns) with the training data.
There are two main approaches to creating an ensemble learner:
We can build our own aggregation ensemble using the VotingClassifier()
function:
## Import each base model plus a few extras
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
## Defining the individual models
model1 = Pipeline([('scaler', StandardScaler()),
('model', SVC(kernel = 'poly', probability=True))])
model2 = DecisionTreeClassifier(max_depth=5)
model3 = Pipeline([('scaler', StandardScaler()),
('model', KNeighborsClassifier())])
## Create the ensemble
my_ensemble = VotingClassifier(estimators=[('svm', model1),('tree', model2),('knn', model3)], voting='soft')
The voting classifier in this example combines the predictions of three different base models, a support vector machine, a decision tree, and a KNN model using a "soft" voting scheme.
voting = 'soft'
will sum the predicted probabilities for each outcome class produced by each model in the ensemble with the final prediction being the class with the largest sum.voting = 'hard'
will use a simple majority vote where the final prediction is the most commonly predicted class. Unfortunately, if there's a tie the final prediction will be assigned in ascending order.Thus, we'll generally prefer soft voting unless our scenario involves a sufficient number of models and outcome classes that make ties unlikely to occur.
Once it is set up, our voting classifier, my_ensemble
has the same capabilities as the individual model classes we've seen so far in sklearn
. For example, we can find the ensemble's cross-validated F1-score and compare it to those of its individual base models:
## A few more imports
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
## Pipeline to compare models
model_pipe = Pipeline([('model', SVC())])
candidate_models = {'model': [my_ensemble, model1, model2, model3]}
## Cross-validated F1 scores
grid = GridSearchCV(model_pipe, candidate_models, cv=5, scoring = 'f1').fit(train_wbc_X, train_wbc_y)
pd.DataFrame(grid.cv_results_).sort_values('mean_test_score', ascending=False)[['param_model', 'mean_test_score']]
param_model | mean_test_score | |
---|---|---|
3 | (StandardScaler(), KNeighborsClassifier()) | 0.923235 |
0 | VotingClassifier(estimators=[('svm',\n ... | 0.891429 |
2 | DecisionTreeClassifier(max_depth=5) | 0.855798 |
1 | (StandardScaler(), SVC(kernel='poly', probabil... | 0.837216 |
Tuning the hyperparameters of each base model in the ensemble requires us to be careful about how we access each relevant parameter using the double underscore to identify its location within another named step:
## Some tuning parameters to search over
params = {'svm__model__kernel': ['poly','linear'],
'tree__max_depth': [4,5,6],
'knn__model__n_neighbors': [5,10,15],
'knn__model__weights': ['distance','uniform'],
'voting': ['soft','hard']}
## Perform the grid search
grid = GridSearchCV(my_ensemble, param_grid=params, cv=5, scoring = 'f1').fit(train_wbc_X, train_wbc_y)
grid.best_estimator_
VotingClassifier(estimators=[('svm', Pipeline(steps=[('scaler', StandardScaler()), ('model', SVC(kernel='linear', probability=True))])), ('tree', DecisionTreeClassifier(max_depth=4)), ('knn', Pipeline(steps=[('scaler', StandardScaler()), ('model', KNeighborsClassifier(n_neighbors=15, weights='distance'))]))], voting='soft')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
VotingClassifier(estimators=[('svm', Pipeline(steps=[('scaler', StandardScaler()), ('model', SVC(kernel='linear', probability=True))])), ('tree', DecisionTreeClassifier(max_depth=4)), ('knn', Pipeline(steps=[('scaler', StandardScaler()), ('model', KNeighborsClassifier(n_neighbors=15, weights='distance'))]))], voting='soft')
StandardScaler()
SVC(kernel='linear', probability=True)
DecisionTreeClassifier(max_depth=4)
StandardScaler()
KNeighborsClassifier(n_neighbors=15, weights='distance')
Question #5: For this question you should use the "clickbait" data set introduced in Part 1 of the lab (used in Question #1).