This lab covers the topic of "feature engineering" which is a broad term referring to the creation or extraction of predictive features from raw data to be used when training a downstream model.
## Standard Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
# turn off future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
This lab will use several different data sets for the purposes of providing a diverse range of examples. Each data set will be introduced in the part of the lab where it is used.
Up until now we've been avoiding the use of categorical predictors in our models; however, categorical predictors can be included into any model using a strategy known as "one-hot encoding" or "dummy variables" in the language of statistical modeling.
The example provides a quick demonstration of one-hot encoding on the 'style'
and 'ac'
variables in the Iowa City home sales data:
## Load the Iowa City home sales data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
## one hot encoding for style and ac, dropping other categorical vars
from pandas import get_dummies
encode_list = ['style','ac']
ic_ohe = get_dummies(ic[['sale.amount','assessed','style','ac']], columns=encode_list)
ic_ohe.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 777 entries, 0 to 776 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sale.amount 777 non-null int64 1 assessed 777 non-null int64 2 style_1 1/2 Story Frame 777 non-null uint8 3 style_1 Story Brick 777 non-null uint8 4 style_1 Story Condo 777 non-null uint8 5 style_1 Story Frame 777 non-null uint8 6 style_2 Story Brick 777 non-null uint8 7 style_2 Story Condo 777 non-null uint8 8 style_2 Story Frame 777 non-null uint8 9 style_Split Foyer Frame 777 non-null uint8 10 style_Split Level Frame 777 non-null uint8 11 ac_No 777 non-null uint8 12 ac_Yes 777 non-null uint8 dtypes: int64(2), uint8(11) memory usage: 20.6 KB
Notice how the numeric variables, 'sale.amount'
and 'assessed'
, were unchanged, but the variables 'style'
and 'ac'
were expanded using a collection of binary integer variables representing each unique value.
For some models, it's problematic to use linearly dependent sets of predictors, so we might opt to use "reference coding" via the drop_first
argument to make the first category of each the encoded variables the reference category, thereby ensuring the new columns are not linearly dependent:
## Reference coding
ic_ohe = get_dummies(ic[['sale.amount','assessed','style','ac']], columns=encode_list, drop_first=True)
ic_ohe.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 777 entries, 0 to 776 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sale.amount 777 non-null int64 1 assessed 777 non-null int64 2 style_1 Story Brick 777 non-null uint8 3 style_1 Story Condo 777 non-null uint8 4 style_1 Story Frame 777 non-null uint8 5 style_2 Story Brick 777 non-null uint8 6 style_2 Story Condo 777 non-null uint8 7 style_2 Story Frame 777 non-null uint8 8 style_Split Foyer Frame 777 non-null uint8 9 style_Split Level Frame 777 non-null uint8 10 ac_Yes 777 non-null uint8 dtypes: int64(2), uint8(9) memory usage: 19.1 KB
Note that the style 1 1/2 Story Frame
and the value No
for the variable ac
are now encoded as reference categories in the sense that this category of home corresponds with all of the dummy variables being zero.
Using one-hot encoding in a data processing pipeline is somewhat complicated, as it requires us to define different actions for numeric and categorical predictors:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
## Set up training data for this example
train_ic, test_ic = train_test_split(ic, test_size=0.2, random_state=7)
train_ic_y = train_ic['sale.amount']
train_ic_X = train_ic.drop('sale.amount', axis = 1)
## Seperate pipelines for numeric vs. cat vars
num_transformer = Pipeline([("scaler", StandardScaler())])
cat_transformer = Pipeline([("encoder", OneHotEncoder(sparse=False, handle_unknown='ignore'))])
## Get names of numeric and categorical columns
num_col_names = train_ic_X.select_dtypes(exclude=['object']).columns.tolist()
cat_col_names = train_ic_X.select_dtypes(include=['object']).columns.tolist()
## Preprocessing transformer allowing different actions for numeric and categorical vars
preprocessor = ColumnTransformer([
('num', num_transformer, num_col_names),
('cat', cat_transformer, cat_col_names)
])
## Combine the preprocessor and model into a final pipeline
final_pipe = Pipeline([
("preprocessor", preprocessor),
("model", KNeighborsClassifier())
])
## We can fit this pipeline
fitted_pipe = final_pipe.fit(train_ic_X, train_ic_y)
A few things about this example that you should note:
OneHotEncoder()
is to return a sparse matrix, which is fine for some models but unncessary for what we're doing.handle_unknown='ignore'
instructs the pipeline to ignore any categories that weren't present in the training data.In the next section of the lab you'll need to use one-hot encoding to derive features from textual data. Importantly, you'll need to use one-hot encoding within a pipeline on this unit's homework.
The data we'll use for the next portion of the lab come from the UC-Irvine Machine Learning Repository and contain 5574 SMS text messages that are labeled as either "spam" or "ham" (not spam).
## Note that textfile containing these data uses a tab delimiter to separate the label and message
sms = pd.read_csv("https://remiller1450.github.io/data/sms_spam.txt", sep='\t', names=['Label','Message'])
sms.head(5)
Label | Message | |
---|---|---|
0 | ham | Go until jurong point, crazy.. Available only ... |
1 | ham | Ok lar... Joking wif u oni... |
2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... |
3 | ham | U dun say so early hor... U c already then say... |
4 | ham | Nah I don't think he goes to usf, he lives aro... |
These data have a clear outcome, Label
, but the predictive features must be derived from the message.
You might immediately want to jump to tokenizing each unique word in the message to be a predictive feature, but this brute force approach can often be beat using domain knowledge and feature engineering. This is because tokenization creates a high-dimensional feature space where many of these features are useless for prediction. So, unless you have a massive amount of data, overfitting is a major problem with such an approach.
To see this, let's start by creating a training set and looking at a few messages that are labeled "spam":
## Train/test split
train_sms, test_sms = train_test_split(sms, test_size=0.2, random_state=8)
## Separate outcome from message
train_sms_y = (train_sms['Label'] == 'spam').astype(int)
train_sms_X = train_sms['Message']
## Print some examples of 'spam'
train_sms_X[train_sms_y ==1].sample(10)
3620 8007 25p 4 Alfie Moon's Children in Need song ... 4258 important information 4 orange user . today is... 1307 Enjoy the jamster videosound gold club with yo... 635 Dear Voucher Holder, 2 claim this weeks offer,... 5370 dating:i have had two of these. Only started a... 576 You have won ?1,000 cash or a ?2,000 prize! To... 709 To review and KEEP the fantastic Nokia N-Gage ... 1196 You have 1 new voicemail. Please call 08719181503 1518 Our brand new mobile music service is now live... 4967 URGENT! We are trying to contact U. Todays dra... Name: Message, dtype: object
Inspecting these spam messages provides a several ideas for feature engineering. For example, we can notice that spam messages tend to contain a lot of numbers and may also contain an inordinate number of exclamation marks.
Based upon these observations we might engineer two features, one measuring the proportion of characters in a message that numeric, and another counting the number of exclamation marks in the message.
## Function to measure percentage of numbers
def get_num(text):
return sum(map(str.isdigit, text))/len(text)
## Function to count exclamation marks
def count_exclamation_points(text):
return text.count('!')
## Create a dictonary and convert to a Pandas dataframe
d = {'prop_num': train_sms_X.apply(get_num), 'num_exclam': train_sms_X.apply(count_exclamation_points)}
train_X = pd.DataFrame(d)
train_X.head(6)
prop_num | num_exclam | |
---|---|---|
3922 | 0.017857 | 1 |
2559 | 0.000000 | 0 |
2672 | 0.000000 | 0 |
4282 | 0.006579 | 0 |
987 | 0.000000 | 0 |
3323 | 0.000000 | 1 |
We can see how these features might be useful in classifying each message:
## Scatter plot colored by label
plt.figure().set_figheight(3)
plt.scatter(x = train_X['prop_num'], y = train_X['num_exclam'], c = train_sms_y)
plt.show()
It's important to recognize that these are just two simple examples of features you might engineer from these messages. There are countless possibilities that could be even better predictors of "spam".
Additionally, it's beyond the scope of this course to thoroughly cover the topic of string processing Python, but I encourage you to consult this reference for guidance and examples when trying to write your own functions to engineer features from text.
Question 1: In this question you'll continue working with the SMS spam data. You are encouraged to use this as a starting point on this unit's homework assignment.
import re
, then use the findall()
function with the regular expression r'[A-Z]'
to create a function that counts the percentage of a message's characters that are capital letters.first_word()
that is defined below will return the first word in a message. It works by splitting the string at the character ' '
(space), keeping only the first word (position 0), converting this word to lower case, and then removes any exclamation points by replacing them with ''
or an empty string. Use this function and the function you created in Part A to assemble a data frame containing two engineered features named "prop_cap" and "first_word".value_counts()
method on separate segments of the training data. You might also want to use the head()
method to avoid printing a large table of counts.## Define "first_word" function
def first_word(text):
return text.split(sep=' ')[0].lower().replace('!','')
The data below were collected using a chest-mounted accelerometer in a research study on activity recognition. The original study collected data on 15 participants performing 7 activities, but for demonstration purposes we will only use data for 2 study participants and 2 activities: "working at a computer", or 'Work'
and "talking while standing", or 'Talk'
.
These data are uncalibrated accelaration measurements recorded at a rate of 52 Hz (so 52 measurements are recorded in each second). In our analysis we will look to classify 10-second samples of data, or collections of 520 inputs.
We will use the data from one participant (#1) for training, and we will evaluate the generalizability of our models using data from the other participant (#7) as a test set.
## Dataset
act_ts = pd.read_csv("https://remiller1450.github.io/data/activity.csv")
## Training Data
train_X = act_ts[act_ts['Subject'] == 1].drop(['Label'], axis = 1)
train_y = act_ts[act_ts['Subject'] == 1]['Label']
## Validation Data
test_X = act_ts[act_ts['Subject'] == 7].drop(['Label'], axis = 1)
test_y = act_ts[act_ts['Subject'] == 7]['Label']
Successful feature engineering begins with a thoughtful examination of the training data in order to find predictive patterns. It is essential that this exploration only involves the training data to avoid data leakage.
Below we'll create a few simple graphs of the three available predictors over time, colored by the label:
my_vars = ['X_accel', 'Y_accel', 'Z_accel']
for cur_var in my_vars:
plt.figure().set_figheight(2)
plt.scatter(x = range(len(train_X[cur_var])), y=train_X[cur_var], c = train_y.map({'Work': 1, 'Talk': 0}))
plt.title(cur_var)
plt.show()
There are many feature engineering ideas we can glean from this exploration, below are two that stand out:
To translate these ideas into useable features we can use grouped summarization:
train_X_FE = pd.concat([train_X.groupby('Sample_Num')[my_vars].mean(),
train_X.groupby('Sample_Num')[my_vars].std()], axis=1)
train_X_FE.columns = ['Mean_X', 'Mean_Y', 'Mean_Z', 'SD_X', 'SD_Y', 'SD_Z']
train_X_FE.head(5)
Mean_X | Mean_Y | Mean_Z | SD_X | SD_Y | SD_Z | |
---|---|---|---|---|---|---|
Sample_Num | ||||||
1 | 1897.030769 | 2292.657692 | 2064.600000 | 170.098461 | 143.059786 | 125.358252 |
2 | 1957.184615 | 2379.226923 | 2108.601923 | 11.067517 | 7.863383 | 16.358731 |
3 | 1962.921154 | 2378.396154 | 2112.584615 | 5.970345 | 3.332676 | 5.728319 |
4 | 1969.034615 | 2375.501923 | 2120.165385 | 14.418689 | 7.394212 | 15.416865 |
5 | 1976.882692 | 2373.157692 | 2130.376923 | 7.419165 | 6.416962 | 7.635497 |
We can confirm that these simple features are incredibly predictive, at least on the training data:
train_y_FE = act_ts[act_ts['Subject'] == 1].groupby('Sample_Num')['Label'].first()
plt.scatter(train_X_FE['Mean_X'], train_X_FE['SD_X'], c = train_y_FE.map({'Work': 1, 'Talk': 0}))
plt.show()
Question 2: For this question you should continue using the data introduced in this section's example.
Question 3: The data below are a different portion activity recognition study described at the beginning of this section for 3 subjects and two different activities: "Walking", or 'Walking'
, and "Walking and Talking to Someone", or 'Walking_Talking'
.
X_accel
, Y_accel
, and Z_accel
. Your answer to this question should include your visualization and code that creates a version of your data with 1 row per sample with your engineered features as columns (similar to the format of the example given in this part of the lab).Hint: Classifying these activities is significantly more difficult than the example distinguishing between working and talking. The purpose of this exercise is for you to explore the creation of more sophisticated features (beyond simple means/standard deviations). Some possible ideas for features are rolling averages, ranges, rates of change, etc. You should not be discouraged if you aren't able to achieve a very high degree of classification accuracy.
## Question 3 Dataset
act_ts_v2 = pd.read_csv("https://remiller1450.github.io/data/activity_v2.csv")
While not directly related to the topic of feature engineering. It's worthwhile knowing that pipelines provide an opportunity to handle missing data without introducing data leakage. This section of the lab introduces two different missing data imputation methods: simple imputation and nearest neighbors imputation, but there are many other strategies that exist.
The goal of imputation is to "fill in" the missing values in a data set so that all of its rows can be used during model training/evaluation. Many machine learning algorithms in sklearn
cannot use data-points with missing values.
To demonstrate imputation, we'll load the "Happy Planet" data set, which contains various measures of different countries around the world. We're primarily interested in the fact that two countries have missing values for the variables GDPperCapita
and HDI
:
## Import data set w/ missing values
hp = pd.read_csv("https://remiller1450.github.io/data/HappyPlanet.csv")
## Tally number of missing values (by variable)
hp.isna().sum()
Country 0 Region 0 Happiness 0 LifeExpectancy 0 Footprint 0 HLY 0 HPI 0 HPIRank 0 GDPperCapita 2 HDI 2 Population 0 dtype: int64
Now let's consider trying to predict the happiness score of a country using it's region, life expectancy, and human development index (HDI).
## Outcome
hp_y = hp['Happiness']
## Predictors
hp_X = hp[['Region','LifeExpectancy', 'HDI']]
Below we create two different pipelines, the first imputes missing values using the mean value of the variable containing the missing data. Theoretically speaking, simple mean imputation is only a valid approach when missing values are "missing completely at random", or when the missing values appear entirely at random in the data with no relation to the values of other observed variables or the identity of the involved rows. The second imputes values using $k$-nearest neighbors, an approach that is valid when missing values are "missing at random", or when their occurrence can be explained using other variables observed in the data.
Determining whether your missing values are MCAR, MAR, or MNAR generally requires domain-specific knowledge about your data and how it was collected, thereby complicating the decision whether to include missing data imputation as part of a pre-processing pipeline.
## Import imputation functions
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.neighbors import KNeighborsRegressor
simp_imp_pipe = Pipeline([
("imputer", SimpleImputer(missing_values=np.nan, strategy='mean')),
("model", KNeighborsRegressor())
])
simp_fit = simp_imp_pipe.fit(hp_X, hp_y)
print(simp_fit.score(hp_X, hp_y))
knn_imp_pipe = Pipeline([
("imputer", KNNImputer(missing_values=np.nan, n_neighbors=4, weights='distance')),
("model", KNeighborsRegressor())
])
kp_fit = knn_imp_pipe.fit(hp_X, hp_y)
print(kp_fit.score(hp_X, hp_y))
0.8130099967617596 0.8128904952418757
Because there were only two missing values in this example, there's seemingly not much difference in the training scores ($R^2$) between these two strategies.
However, something to recognize is that we treated the "Region" variable as numeric, when it actually uses integer values to encode different categories. Additionally, we did not perform any standardization or scaling in our pipeline, which will impact KNNImputer()
.
Question 4: In this question you will improve upon the example pipeline from this section.