Lab 11 - Feature Engineering¶

This lab covers the topic of "feature engineering" which is a broad term referring to the creation or extraction of predictive features from raw data to be used when training a downstream model.

In [1]:
## Standard Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

# turn off future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)  

This lab will use several different data sets for the purposes of providing a diverse range of examples. Each data set will be introduced in the part of the lab where it is used.

Part 1 - One-hot Encoding¶

Up until now we've been avoiding the use of categorical predictors in our models; however, categorical predictors can be included into any model using a strategy known as "one-hot encoding" or "dummy variables" in the language of statistical modeling.

The example provides a quick demonstration of one-hot encoding on the 'style' and 'ac' variables in the Iowa City home sales data:

In [2]:
## Load the Iowa City home sales data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")

## one hot encoding for style and ac, dropping other categorical vars
from pandas import get_dummies
encode_list = ['style','ac']
ic_ohe = get_dummies(ic[['sale.amount','assessed','style','ac']], columns=encode_list)
ic_ohe.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777 entries, 0 to 776
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   sale.amount              777 non-null    int64
 1   assessed                 777 non-null    int64
 2   style_1 1/2 Story Frame  777 non-null    uint8
 3   style_1 Story Brick      777 non-null    uint8
 4   style_1 Story Condo      777 non-null    uint8
 5   style_1 Story Frame      777 non-null    uint8
 6   style_2 Story Brick      777 non-null    uint8
 7   style_2 Story Condo      777 non-null    uint8
 8   style_2 Story Frame      777 non-null    uint8
 9   style_Split Foyer Frame  777 non-null    uint8
 10  style_Split Level Frame  777 non-null    uint8
 11  ac_No                    777 non-null    uint8
 12  ac_Yes                   777 non-null    uint8
dtypes: int64(2), uint8(11)
memory usage: 20.6 KB

Notice how the numeric variables, 'sale.amount' and 'assessed', were unchanged, but the variables 'style' and 'ac' were expanded using a collection of binary integer variables representing each unique value.

For some models, it's problematic to use linearly dependent sets of predictors, so we might opt to use "reference coding" via the drop_first argument to make the first category of each the encoded variables the reference category, thereby ensuring the new columns are not linearly dependent:

In [3]:
## Reference coding
ic_ohe = get_dummies(ic[['sale.amount','assessed','style','ac']], columns=encode_list, drop_first=True)
ic_ohe.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777 entries, 0 to 776
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   sale.amount              777 non-null    int64
 1   assessed                 777 non-null    int64
 2   style_1 Story Brick      777 non-null    uint8
 3   style_1 Story Condo      777 non-null    uint8
 4   style_1 Story Frame      777 non-null    uint8
 5   style_2 Story Brick      777 non-null    uint8
 6   style_2 Story Condo      777 non-null    uint8
 7   style_2 Story Frame      777 non-null    uint8
 8   style_Split Foyer Frame  777 non-null    uint8
 9   style_Split Level Frame  777 non-null    uint8
 10  ac_Yes                   777 non-null    uint8
dtypes: int64(2), uint8(9)
memory usage: 19.1 KB

Note that the style 1 1/2 Story Frame and the value No for the variable ac are now encoded as reference categories in the sense that this category of home corresponds with all of the dummy variables being zero.

Using one-hot encoding in a data processing pipeline is somewhat complicated, as it requires us to define different actions for numeric and categorical predictors:

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

## Set up training data for this example
train_ic, test_ic = train_test_split(ic, test_size=0.2, random_state=7)
train_ic_y = train_ic['sale.amount']
train_ic_X = train_ic.drop('sale.amount', axis = 1)

## Seperate pipelines for numeric vs. cat vars
num_transformer = Pipeline([("scaler", StandardScaler())])
cat_transformer = Pipeline([("encoder", OneHotEncoder(sparse=False, handle_unknown='ignore'))])

## Get names of numeric and categorical columns
num_col_names = train_ic_X.select_dtypes(exclude=['object']).columns.tolist()
cat_col_names = train_ic_X.select_dtypes(include=['object']).columns.tolist()

## Preprocessing transformer allowing different actions for numeric and categorical vars
preprocessor = ColumnTransformer([
    ('num', num_transformer, num_col_names),
    ('cat', cat_transformer, cat_col_names)
])

## Combine the preprocessor and model into a final pipeline
final_pipe = Pipeline([
    ("preprocessor", preprocessor),
    ("model", KNeighborsClassifier())
])

## We can fit this pipeline
fitted_pipe = final_pipe.fit(train_ic_X, train_ic_y)

A few things about this example that you should note:

  1. The default behavior of OneHotEncoder() is to return a sparse matrix, which is fine for some models but unncessary for what we're doing.
  2. The main purpose of pipelines is to handle new data, and it's possible for new data to contain categories that were not present in the training data. The argument handle_unknown='ignore' instructs the pipeline to ignore any categories that weren't present in the training data.

In the next section of the lab you'll need to use one-hot encoding to derive features from textual data. Importantly, you'll need to use one-hot encoding within a pipeline on this unit's homework.

Part 2 - Feature Engineering for Text Data¶

The data we'll use for the next portion of the lab come from the UC-Irvine Machine Learning Repository and contain 5574 SMS text messages that are labeled as either "spam" or "ham" (not spam).

In [5]:
## Note that textfile containing these data uses a tab delimiter to separate the label and message
sms = pd.read_csv("https://remiller1450.github.io/data/sms_spam.txt", sep='\t', names=['Label','Message'])
sms.head(5)
Out[5]:
Label Message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...

These data have a clear outcome, Label, but the predictive features must be derived from the message.

You might immediately want to jump to tokenizing each unique word in the message to be a predictive feature, but this brute force approach can often be beat using domain knowledge and feature engineering. This is because tokenization creates a high-dimensional feature space where many of these features are useless for prediction. So, unless you have a massive amount of data, overfitting is a major problem with such an approach.

To see this, let's start by creating a training set and looking at a few messages that are labeled "spam":

In [6]:
## Train/test split
train_sms, test_sms = train_test_split(sms, test_size=0.2, random_state=8)

## Separate outcome from message
train_sms_y = (train_sms['Label'] == 'spam').astype(int)
train_sms_X = train_sms['Message']

## Print some examples of 'spam'
train_sms_X[train_sms_y ==1].sample(10)
Out[6]:
3620    8007 25p 4 Alfie Moon's Children in Need song ...
4258    important information 4 orange user . today is...
1307    Enjoy the jamster videosound gold club with yo...
635     Dear Voucher Holder, 2 claim this weeks offer,...
5370    dating:i have had two of these. Only started a...
576     You have won ?1,000 cash or a ?2,000 prize! To...
709     To review and KEEP the fantastic Nokia N-Gage ...
1196    You have 1 new voicemail. Please call 08719181503
1518    Our brand new mobile music service is now live...
4967    URGENT! We are trying to contact U. Todays dra...
Name: Message, dtype: object

Inspecting these spam messages provides a several ideas for feature engineering. For example, we can notice that spam messages tend to contain a lot of numbers and may also contain an inordinate number of exclamation marks.

Based upon these observations we might engineer two features, one measuring the proportion of characters in a message that numeric, and another counting the number of exclamation marks in the message.

In [7]:
## Function to measure percentage of numbers
def get_num(text):
    return sum(map(str.isdigit, text))/len(text)

## Function to count exclamation marks
def count_exclamation_points(text):
    return text.count('!')

## Create a dictonary and convert to a Pandas dataframe
d = {'prop_num': train_sms_X.apply(get_num), 'num_exclam': train_sms_X.apply(count_exclamation_points)}
train_X = pd.DataFrame(d)
train_X.head(6)
Out[7]:
prop_num num_exclam
3922 0.017857 1
2559 0.000000 0
2672 0.000000 0
4282 0.006579 0
987 0.000000 0
3323 0.000000 1

We can see how these features might be useful in classifying each message:

In [8]:
## Scatter plot colored by label
plt.figure().set_figheight(3)
plt.scatter(x = train_X['prop_num'], y = train_X['num_exclam'], c = train_sms_y)
plt.show()

It's important to recognize that these are just two simple examples of features you might engineer from these messages. There are countless possibilities that could be even better predictors of "spam".

Additionally, it's beyond the scope of this course to thoroughly cover the topic of string processing Python, but I encourage you to consult this reference for guidance and examples when trying to write your own functions to engineer features from text.

Question 1: In this question you'll continue working with the SMS spam data. You are encouraged to use this as a starting point on this unit's homework assignment.

  • Part A: Import the "regular expression operations" library using import re, then use the findall() function with the regular expression r'[A-Z]' to create a function that counts the percentage of a message's characters that are capital letters.
  • Part B: The function first_word() that is defined below will return the first word in a message. It works by splitting the string at the character ' ' (space), keeping only the first word (position 0), converting this word to lower case, and then removes any exclamation points by replacing them with '' or an empty string. Use this function and the function you created in Part A to assemble a data frame containing two engineered features named "prop_cap" and "first_word".
  • Part C: Determine two words that appear in a large share of "spam" messages but do not appear in a similarly large share of "ham" messages. Hint: Use the value_counts() method on separate segments of the training data. You might also want to use the head() method to avoid printing a large table of counts.
  • Part D: Apply one-hot encoding to the "first_word" column in the data frame from Part B, then keep only the columns that encode the presence of the two words you identified in Part C.
  • Part E: Train a classification model of your choosing on the data from Part D. This data frame should include 3 predictive features: the percentage of capital letters (Part A) and the one-hot encoded first words (Parts B-C). Use a cross-validated grid search to find the optimal values of any hyperparameters. Report the best model and its cross-validated F1-score.
  • Part F: Provide a short explanation for why the F1-score is a good way to evaluate model performance in this application.
In [9]:
## Define "first_word" function
def first_word(text):
    return text.split(sep=' ')[0].lower().replace('!','')

Part 3 - Feature Engineering for Time Series Classification¶

The data below were collected using a chest-mounted accelerometer in a research study on activity recognition. The original study collected data on 15 participants performing 7 activities, but for demonstration purposes we will only use data for 2 study participants and 2 activities: "working at a computer", or 'Work'and "talking while standing", or 'Talk'.

These data are uncalibrated accelaration measurements recorded at a rate of 52 Hz (so 52 measurements are recorded in each second). In our analysis we will look to classify 10-second samples of data, or collections of 520 inputs.

We will use the data from one participant (#1) for training, and we will evaluate the generalizability of our models using data from the other participant (#7) as a test set.

In [10]:
## Dataset
act_ts = pd.read_csv("https://remiller1450.github.io/data/activity.csv")

## Training Data
train_X = act_ts[act_ts['Subject'] == 1].drop(['Label'], axis = 1)
train_y = act_ts[act_ts['Subject'] == 1]['Label']

## Validation Data
test_X = act_ts[act_ts['Subject'] == 7].drop(['Label'], axis = 1)
test_y = act_ts[act_ts['Subject'] == 7]['Label']

Successful feature engineering begins with a thoughtful examination of the training data in order to find predictive patterns. It is essential that this exploration only involves the training data to avoid data leakage.

Below we'll create a few simple graphs of the three available predictors over time, colored by the label:

In [11]:
my_vars = ['X_accel', 'Y_accel', 'Z_accel']
for cur_var in my_vars:
    plt.figure().set_figheight(2)
    plt.scatter(x = range(len(train_X[cur_var])), y=train_X[cur_var], c = train_y.map({'Work': 1, 'Talk': 0}))
    plt.title(cur_var)
    plt.show()

There are many feature engineering ideas we can glean from this exploration, below are two that stand out:

  1. In general, working tends to show a lower amount of variability in the accelerometer's values across all axes.
  2. The accelerometer tends to have higher values for working in the X and Z axes.

To translate these ideas into useable features we can use grouped summarization:

In [12]:
train_X_FE = pd.concat([train_X.groupby('Sample_Num')[my_vars].mean(),
                        train_X.groupby('Sample_Num')[my_vars].std()], axis=1)
train_X_FE.columns = ['Mean_X', 'Mean_Y', 'Mean_Z', 'SD_X', 'SD_Y', 'SD_Z']
train_X_FE.head(5)
Out[12]:
Mean_X Mean_Y Mean_Z SD_X SD_Y SD_Z
Sample_Num
1 1897.030769 2292.657692 2064.600000 170.098461 143.059786 125.358252
2 1957.184615 2379.226923 2108.601923 11.067517 7.863383 16.358731
3 1962.921154 2378.396154 2112.584615 5.970345 3.332676 5.728319
4 1969.034615 2375.501923 2120.165385 14.418689 7.394212 15.416865
5 1976.882692 2373.157692 2130.376923 7.419165 6.416962 7.635497

We can confirm that these simple features are incredibly predictive, at least on the training data:

In [13]:
train_y_FE = act_ts[act_ts['Subject'] == 1].groupby('Sample_Num')['Label'].first()
plt.scatter(train_X_FE['Mean_X'], train_X_FE['SD_X'], c = train_y_FE.map({'Work': 1, 'Talk': 0}))
plt.show()

Question 2: For this question you should continue using the data introduced in this section's example.

  • Part A: Train a support vector machine classifier using the 6 engineered features described in the example given above. Choose an appropriate kernel based upon the relationships you can visually see in the data that is also justified by cross-validation. Report the cross-validated classification accuracy of this approach.
  • Part B: Add a standardization step to the approach you explored in Part A and adjust any tuning parameters as necessary. Does this change make any difference in the classification accuracy achieved on the test data? Briefly explain what you observe.

Question 3: The data below are a different portion activity recognition study described at the beginning of this section for 3 subjects and two different activities: "Walking", or 'Walking', and "Walking and Talking to Someone", or 'Walking_Talking'.

  • Part A: Split the data to use the subjects #5 and #7 as training data and subject #1 as test data.
  • Part B: Create at least one data visualization and use it to engineer at least two features derived from the variables X_accel, Y_accel, and Z_accel. Your answer to this question should include your visualization and code that creates a version of your data with 1 row per sample with your engineered features as columns (similar to the format of the example given in this part of the lab).
  • Part C: Create a pipeline involving at least one preprocessing step (ie: standardization/re-scaling, normalizing transformation, dimension reduction, etc.) followed by a classifier step. Use a cross-validated grid search to tune at least one hyperparameter involved in this pipeline.
  • Part D: Evaluate your best performing method on the test set and briefly comment upon how successful you were.

Hint: Classifying these activities is significantly more difficult than the example distinguishing between working and talking. The purpose of this exercise is for you to explore the creation of more sophisticated features (beyond simple means/standard deviations). Some possible ideas for features are rolling averages, ranges, rates of change, etc. You should not be discouraged if you aren't able to achieve a very high degree of classification accuracy.

In [14]:
## Question 3 Dataset
act_ts_v2 = pd.read_csv("https://remiller1450.github.io/data/activity_v2.csv")

Part 4 - Handling Missing Data¶

While not directly related to the topic of feature engineering. It's worthwhile knowing that pipelines provide an opportunity to handle missing data without introducing data leakage. This section of the lab introduces two different missing data imputation methods: simple imputation and nearest neighbors imputation, but there are many other strategies that exist.

The goal of imputation is to "fill in" the missing values in a data set so that all of its rows can be used during model training/evaluation. Many machine learning algorithms in sklearn cannot use data-points with missing values.

To demonstrate imputation, we'll load the "Happy Planet" data set, which contains various measures of different countries around the world. We're primarily interested in the fact that two countries have missing values for the variables GDPperCapita and HDI:

In [15]:
## Import data set w/ missing values
hp = pd.read_csv("https://remiller1450.github.io/data/HappyPlanet.csv")

## Tally number of missing values (by variable)
hp.isna().sum()
Out[15]:
Country           0
Region            0
Happiness         0
LifeExpectancy    0
Footprint         0
HLY               0
HPI               0
HPIRank           0
GDPperCapita      2
HDI               2
Population        0
dtype: int64

Now let's consider trying to predict the happiness score of a country using it's region, life expectancy, and human development index (HDI).

In [16]:
## Outcome
hp_y = hp['Happiness']

## Predictors
hp_X = hp[['Region','LifeExpectancy', 'HDI']]

Below we create two different pipelines, the first imputes missing values using the mean value of the variable containing the missing data. Theoretically speaking, simple mean imputation is only a valid approach when missing values are "missing completely at random", or when the missing values appear entirely at random in the data with no relation to the values of other observed variables or the identity of the involved rows. The second imputes values using $k$-nearest neighbors, an approach that is valid when missing values are "missing at random", or when their occurrence can be explained using other variables observed in the data.

  • Under the "missing completely at random" (MCAR) definition, the United States and Afghanistan would be equally likely to have missing data (which probably isn't true)
  • Under the "missing at random" (MAR) definition, missing information is predictable using other observed variables, so the United States and Afghanistan can differ in their propensity for missing data so long as the missing data can be explained by other observed variables.
  • There is another possibility, "missing not at random" (MNAR), where missing information is only predictive using unobserved factors. Under these circumstances imputing missing data isn't justifiable from a statistical perspective, and might cause problems for the performance of machine learning models due to the imputed values being biased.

Determining whether your missing values are MCAR, MAR, or MNAR generally requires domain-specific knowledge about your data and how it was collected, thereby complicating the decision whether to include missing data imputation as part of a pre-processing pipeline.

In [17]:
## Import imputation functions
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.neighbors import KNeighborsRegressor

simp_imp_pipe = Pipeline([
    ("imputer", SimpleImputer(missing_values=np.nan, strategy='mean')),
    ("model", KNeighborsRegressor())
])

simp_fit = simp_imp_pipe.fit(hp_X, hp_y)
print(simp_fit.score(hp_X, hp_y))

knn_imp_pipe = Pipeline([
    ("imputer", KNNImputer(missing_values=np.nan, n_neighbors=4, weights='distance')),
    ("model", KNeighborsRegressor())
])

kp_fit = knn_imp_pipe.fit(hp_X, hp_y)

print(kp_fit.score(hp_X, hp_y))
0.8130099967617596
0.8128904952418757

Because there were only two missing values in this example, there's seemingly not much difference in the training scores ($R^2$) between these two strategies.

However, something to recognize is that we treated the "Region" variable as numeric, when it actually uses integer values to encode different categories. Additionally, we did not perform any standardization or scaling in our pipeline, which will impact KNNImputer().

Question 4: In this question you will improve upon the example pipeline from this section.

  • Part A: Modify the example pipeline using KNN imputation to include the following changes:
    • Handle categorical and numerical variables separately within the pipeline:
      • The categorical variable "Region" should be transformed using one-hot encoding
      • The numeric predictors "LifeExpectancy" and "HDI" should be standardized
    • Impute any missing values before using a support vector regressor as the final model
    • You may set any tuning parameters in these steps yourself so long as they do not lead to any computational issues
  • Part B: Fit your pipeline to the full training set and compare the method's score ($R^2$, the default metric) with the example. Does this suggest that your new approach is superior? Hint: Think about how both of these evaluations have used the training data.