This lab will conclude our study of fundamental concepts in machine learning. The primary focus will be on feature engineering, or the processing of deriving/creating effective predictors from available data.
For context, it's quite common for the top performing teams in many machine learning competitions to use similar algorithms/methods, with the distinguishing factor being feature engineering. Andrew Ng, a well-known machine learning expert, has commented that most uses of applied machine learning are essentially feature engineering problems. I encourage you to read this article for additional perspective and background.
We'll begin by loading many familiar libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math
# Unfortunately, knn functions prompt "future warnings", the commands below turn them off
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
This lab will utilize the SMS spam data set from the UCI Machine Learning Repository, which contains 5574 text messages that are labeled as either "spam" or "ham" (not spam).
## Note that textfile containing these data uses a tab delimiter to separate the label and message
sms = pd.read_csv("https://remiller1450.github.io/data/sms_spam.txt", sep='\t', names=['Label','Message'])
sms.head(5)
Label | Message | |
---|---|---|
0 | ham | Go until jurong point, crazy.. Available only ... |
1 | ham | Ok lar... Joking wif u oni... |
2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... |
3 | ham | U dun say so early hor... U c already then say... |
4 | ham | Nah I don't think he goes to usf, he lives aro... |
These data have a clear outcome variable, the predictors must be derived from the message.
Your first instinct might be to tokenize each unique word into a feature and use the basis for a model; however, it's often possible to be a simpler and more efficient model that's just as accurate by combining domain-specific knowledge with artifacts found in the data.
Before we introduce too much data leakage, let's begin by performing a training-testing split:
from sklearn.model_selection import train_test_split
train, test = train_test_split(sms, test_size=0.2, random_state=8)
## Split outcome from predictors
train_y = (train['Label'] == 'spam').astype(int)
train_msg = train['Message']
Feature engineering describes the process of using domain knowledge to extract or construct features that can be used in a machine learning process.
To begin, let's do some basic exploration by printing out several spam messages and looking for possible identifying characteristics:
## Show 10 randomly selected spam messages
train_msg[train_y ==1].sample(10)
4798 Santa calling! Would your little ones like a c... 5292 Urgent! Please call 09061213237 from landline.... 3862 Free Msg: Ringtone!From: http://tms. widelive.... 1022 Guess what! Somebody you know secretly fancies... 4162 Had your mobile 11 months or more? U R entitle... 3954 Refused a loan? Secured or Unsecured? Can't ge... 4676 Hi babe its Chloe, how r u? I was smashed on s... 3968 YOU HAVE WON! As a valued Vodafone customer ou... 4297 Please CALL 08712402578 immediately as there i... 3598 Congratulations YOU'VE Won. You're a Winner in... Name: Message, dtype: object
Simple inspection of these messages is very informative. We'll first engineer a single feature using what we see, and later in the lab you'll have an opportunity to construct your own set of features.
To begin, you might notice that spam messages seem to contain more numbers that normal conversations. So, let's try engineering a feature to capture this:
## Function to measure numbers
def get_num(text):
return sum(map(str.isdigit, text))/len(text)
## Create a dictonary and convert to a Pandas dataframe
d = {'prop_num': train_msg.apply(get_num)}
train_X = pd.DataFrame(d)
train_X.head(6)
prop_num | |
---|---|
3922 | 0.017857 |
2559 | 0.000000 |
2672 | 0.000000 |
4282 | 0.006579 |
987 | 0.000000 |
3323 | 0.000000 |
In this example we create a function named get_num()
that returns the proportion of numeric characters in each string contained in the object it is given.
The map
function iteratively applies a function to an iterable object (such as a single string, a dictionary, or a list). In this example, the str.isdigit
method is applied to each character in string. The results for the string are then summed and scaled by the length of the string, so the function ultimately returns the proportion of digits in a given text message.
We then use the apply
method to apply the function get_num
to each message contained in train_msg
. We first store these results in a dictionary that we then convert to a pandas
DataFrame.
prop_num
column in train_X
has a lot of zeros. How might this impact the way you choose to scale these data? Briefly explain.So far, all of our examples (home sales, handwritten digits, and spam classification) have relied exclusively numeric predictors. Categorical predictors can be included in these models using a strategy known as one hot encoding (or dummy variables in statistical modeling).
The code below demonstrates this using the get_dummies
function on the Iowa City Home Sales data:
## Load the Iowa City home sales data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
## one hot encoding for style, bsmt and ac, dropping other categorical vars
from pandas import get_dummies
encode_list = ['style','bsmt','ac']
ic_ohe = get_dummies(ic, columns=encode_list)
## drop non-numeric vars
ic_new = ic_ohe.select_dtypes("number")
print(ic_new.dtypes) # Check out the new columns
sale.amount int64 built int64 bedrooms int64 area.base int64 area.add int64 area.bsmt int64 area.garage1 int64 area.garage2 int64 area.living int64 area.lot int64 lon float64 lat float64 assessed int64 style_1 1/2 Story Frame uint8 style_1 Story Brick uint8 style_1 Story Condo uint8 style_1 Story Frame uint8 style_2 Story Brick uint8 style_2 Story Condo uint8 style_2 Story Frame uint8 style_Split Foyer Frame uint8 style_Split Level Frame uint8 bsmt_1/2 uint8 bsmt_3/4 uint8 bsmt_Crawl uint8 bsmt_Full uint8 bsmt_None uint8 ac_No uint8 ac_Yes uint8 dtype: object
Notice how each category within the variables "style", "bsmt", and "ac" was coded into an integer column.
For models like $k$-nearest neighbors, it's fine if there are columns of linearly dependent dummy variables; but for many statistical models, the convention is to reserve one category as a reference category (to ensure indetifiability). So, if reference coding is desried, the drop_first
argument can be used to make the first category (dummy column) the reference category.
## Reference coding
ic_ohe = get_dummies(ic, columns=encode_list, drop_first=True)
ic_new = ic_ohe.select_dtypes("number")
print(ic_new.dtypes) # Check out the new columns
sale.amount int64 built int64 bedrooms int64 area.base int64 area.add int64 area.bsmt int64 area.garage1 int64 area.garage2 int64 area.living int64 area.lot int64 lon float64 lat float64 assessed int64 style_1 Story Brick uint8 style_1 Story Condo uint8 style_1 Story Frame uint8 style_2 Story Brick uint8 style_2 Story Condo uint8 style_2 Story Frame uint8 style_Split Foyer Frame uint8 style_Split Level Frame uint8 bsmt_3/4 uint8 bsmt_Crawl uint8 bsmt_Full uint8 bsmt_None uint8 ac_Yes uint8 dtype: object
Let's now use one-hot encoding with the SMS spam data to help us create new predictors derived from the first word in a message.
To help guide this process, you might begin by noticing that spam messages disproportionately begin with the words "free" and "urgent" (with some variation on the capitalization).
## Define "first_word" function
def first_word(text):
return text.split(sep=' ')[0].lower().replace('!','')
## Recreate train_X using the new feature
d = {'prop_num': train_msg.apply(get_num),
'first_word': train_msg.apply(first_word)}
train_X = pd.DataFrame(d)
train_X.head(6)
prop_num | first_word | |
---|---|---|
3922 | 0.017857 | do |
2559 | 0.000000 | some |
2672 | 0.000000 | that's |
4282 | 0.006579 | wn |
987 | 0.000000 | i'm |
3323 | 0.000000 | ok |
The code above defines a function named first_word
that accepts a character string. This function splits the string at the character ' '
(space), keeps only the first word (position 0), converts that word to lower case, and removes any exclamation points by replacing them with ''
or an empty string.
We can now apply one hot encoding to this column and keep only the columns corresponding to the words we identified as important (urgent and free):
## Apply OHE to the 'first' column,
train_X_ohe = get_dummies(train_X, columns=['first_word'])
train_X_ohe.shape
(4457, 1174)
At this step, we see that there are over a thousand different first words, and we don't expect most of these words to be predictive of spam. So, we'll only keep variables relating to the words "urgent" and "free" for the reasons previously discussed.
## Keep indicators for the words 'urgent' and 'free'
train_X2 = train_X_ohe[['prop_num','first_word_urgent','first_word_free']]
## Print the frequence of "1" for each of our new variables:
print({sum(train_X2['first_word_urgent']), sum(train_X2['first_word_free'])})
{43, 28}
Undoubtably this process could be improved with sophisticated use of regular expressions and string processing tools; however, you'll see that these simplistic features are enough to build a fairly strong classifier.
value_counts()
method with the argument ascending=False
to find the most commonly occuring first words in spam messages. Based upon this list, determine another word to keep as an indicator variable from train_X_ohe
(as created in the examples above). Hint: you may want to use the head()
method to print more than just the top few words.first_word_urgent
,first_word_free
, and the word you identified in Part A) in addition to predictor prop_num
from Question #1.It's worthwhile noting that it's possible include one-hot-encoding as a step within a preprocessing pipeline. Unfortunately, doing so requires carefully setting up of different preprocessing steps for numeric and categorical predictors.
The code below demonstrates this using the data we created in the previous section and a $k$-nearest neighbors classifier:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.neighbors import KNeighborsClassifier
## Seperate pipelines for numeric vs. cat vars
num_transformer = Pipeline([("scaler", StandardScaler())])
cat_transformer = Pipeline([("encoder", OneHotEncoder(sparse=False, handle_unknown='ignore'))])
## Get names of numeric and categorical columns
num_cols = train_X.select_dtypes(exclude=['object']).columns.tolist()
cat_cols = train_X.select_dtypes(include=['object']).columns.tolist()
## Preprocessing transformer allowing different actions for numeric and categorical vars
preprocessor = ColumnTransformer([
('num', num_transformer, num_cols),
('cat', cat_transformer, cat_cols)
])
## Combine the preprocessor and model into a final pipeline
final_pipe = Pipeline([
("preprocessor", preprocessor),
("model", KNeighborsClassifier())
])
fitted = final_pipe.fit(train_X, train_y)
print(fitted.score(train_X, train_y))
0.971729863136639
A few things to note:
OneHotEncoder
function by default returns a sparse matrix object (which isn't what we want), we prevent this with sparse=False
since we do not want to use a sparse matrix as an input into our model.handle_unknown='ignore'
instructs the pipeline to ignore any categories that weren't present in the training data.num_cols
and cat_cols
. These names are how the pipeline knows which preprocessing routine to use for each column in the input data.Pipelines also provide an opportunity to handle missing data without introducing data leakage. In this section of the lab, we'll briefly demonstrate two different imputation methods, simple imputation and nearest neighbors imputation, that can be used in a pipeline.
To begin, we'll load the "Happy Planet" data set, which contains various measures of different countries around the world. We're primarily interested in the fact that two countries have missing values for the variables GDPperCapita
and HDI
.
## Import data set w/ missing values
hp = pd.read_csv("https://remiller1450.github.io/data/HappyPlanet.csv")
## Tally number of missing values
hp.isna().sum()
Country 0 Region 0 Happiness 0 LifeExpectancy 0 Footprint 0 HLY 0 HPI 0 HPIRank 0 GDPperCapita 2 HDI 2 Population 0 dtype: int64
Before performing imputation, let's prepare these data for a simple modeling task: predicting "Happiness" using a country's Region, LifeExpectancy, and HDI (which contains missing values).
## create outcome
hp_y = hp['Happiness']
## create predictors
hp_X = hp[['Region','LifeExpectancy', 'HDI']]
Next we'll create a couple of pipelines that each contain an imputation step:
## Import imputation functions
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.neighbors import KNeighborsRegressor
simp_imp_pipe = Pipeline([
("imputer", SimpleImputer(missing_values=np.nan, strategy='mean')),
("model", KNeighborsRegressor())
])
simp_fit = simp_imp_pipe.fit(hp_X, hp_y)
print(simp_fit.score(hp_X, hp_y))
knn_imp_pipe = Pipeline([
("imputer", KNNImputer(missing_values=np.nan, n_neighbors=4, weights='distance')),
("model", KNeighborsRegressor())
])
kp_fit = knn_imp_pipe.fit(hp_X, hp_y)
print(kp_fit.score(hp_X, hp_y))
0.8130099967617596 0.8128904952418757
Because there were only two missing values in this example, there's seemingly not much difference between these two strategies.
However, something to recognize is that we treated the "Region" variable as numeric data, when it actually uses integer values to encode different categories. Additionally, we did not perform any standardization or scaling in our pipeline, which will impact the step involving KNNImputer
.
To finish up this example, let's create a pipeline that integrates standardization, imputation, and one-hot-encoding:
## Seperate pipelines for numeric vs. cat vars
num_transformer = Pipeline([("scaler", StandardScaler()),
("imputer", KNNImputer(missing_values=np.nan, n_neighbors=4, weights='distance'))
])
cat_transformer = Pipeline([("encoder", OneHotEncoder(sparse=False, handle_unknown='ignore'))])
## Get names of numeric and categorical columns
num_cols = ['LifeExpectancy', 'HDI']
cat_cols = ['Region']
## Preprocessing transformer allowing different actions for numeric and categorical vars
preprocessor = ColumnTransformer([
('num', num_transformer, num_cols),
('cat', cat_transformer, cat_cols)
])
## Combine the preprocessor and model into a final pipeline
knn_final_pipe = Pipeline([
("preprocessor", preprocessor),
("model", KNeighborsRegressor())
])
## Fit new pipeline and print the score
kpn_fit = knn_final_pipe.fit(hp_X, hp_y)
print(kpn_fit.score(hp_X, hp_y))
0.8525635060743124
For this question you should use the "Colleges 2019" data set, which contains information on all primarily undergraduate colleges in the United States during the 2019-20 academic year. You should notice that several variables include missing data. For simplicity, I'll ask you to create a model fitting pipeline aimed at predicting the median salary of a college's alumni 10 years after graduation, Salary10yr_median
, using the school's cost, average faculty salary, region, and median ACT score. The data needed for this pipeline are created below.
n_neighbors = 10
and uniform weighting. Print the cross-validated score (using the default metric of $R^2$ and 5-fold cross-validation) of the model using this pipeline.## Import data set w/ missing values
c19 = pd.read_csv("https://remiller1450.github.io/data/Colleges2019.csv")
## Remove rows with missing outcome
c19 = c19[c19.Salary10yr_median.notnull()]
## Seperate predictors and outcome
c19_y = c19['Salary10yr_median']
c19_X = c19[['Cost','Avg_Fac_Salary','Region','ACT_median']]
## Note that some data are missing
c19_X.isna().sum()
Cost 39 Avg_Fac_Salary 6 Region 0 ACT_median 432 dtype: int64
For the final part of this lab you are to build your own kNN classifier using feature engineering and preprocessing strategies of your choosing. You should optimize your classifier's performance, measured by the F1-score, using cross-validation. And you should report the final performance of your classifier on the test set.
As a benchmark, preprocessing with max-absolute scaling before fitting a kNN classifier with 15 neighbors using the predictors defined earlier in this lab, along with two additional features related to capitalization and non-alphanumeric characters, achieves a cross validated F1-score of approximately 0.90 (and a cross-validated accuracy of roughly 97%). To get full credit for this question, you must beat this benchmark F1-score (assessed via cross-validation on the training set, since you should only use the test set for final evaluation).
Hint: I encourage you to look at this list of string methods. In particular, you might find isalpha
, isupper
, and endswith('!')
to be useful in reaching the benchmark.