Lab #3 (part 2) - Feature Engineering and Missing Data¶

This lab will conclude our study of fundamental concepts in machine learning. The primary focus will be on feature engineering, or the processing of deriving/creating effective predictors from available data.

For context, it's quite common for the top performing teams in many machine learning competitions to use similar algorithms/methods, with the distinguishing factor being feature engineering. Andrew Ng, a well-known machine learning expert, has commented that most uses of applied machine learning are essentially feature engineering problems. I encourage you to read this article for additional perspective and background.

We'll begin by loading many familiar libraries:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math

# Unfortunately, knn functions prompt "future warnings", the commands below turn them off
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

This lab will utilize the SMS spam data set from the UCI Machine Learning Repository, which contains 5574 text messages that are labeled as either "spam" or "ham" (not spam).

In [2]:
## Note that textfile containing these data uses a tab delimiter to separate the label and message
sms = pd.read_csv("https://remiller1450.github.io/data/sms_spam.txt", sep='\t', names=['Label','Message'])
sms.head(5)
Out[2]:
Label Message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...

These data have a clear outcome variable, the predictors must be derived from the message.

Your first instinct might be to tokenize each unique word into a feature and use the basis for a model; however, it's often possible to be a simpler and more efficient model that's just as accurate by combining domain-specific knowledge with artifacts found in the data.

Before we introduce too much data leakage, let's begin by performing a training-testing split:

In [3]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(sms, test_size=0.2, random_state=8)

## Split outcome from predictors
train_y = (train['Label'] == 'spam').astype(int)
train_msg = train['Message']

Part 1 - Feature Engineering¶

Feature engineering describes the process of using domain knowledge to extract or construct features that can be used in a machine learning process.

To begin, let's do some basic exploration by printing out several spam messages and looking for possible identifying characteristics:

In [4]:
## Show 10 randomly selected spam messages
train_msg[train_y ==1].sample(10)
Out[4]:
4798    Santa calling! Would your little ones like a c...
5292    Urgent! Please call 09061213237 from landline....
3862    Free Msg: Ringtone!From: http://tms. widelive....
1022    Guess what! Somebody you know secretly fancies...
4162    Had your mobile 11 months or more? U R entitle...
3954    Refused a loan? Secured or Unsecured? Can't ge...
4676    Hi babe its Chloe, how r u? I was smashed on s...
3968    YOU HAVE WON! As a valued Vodafone customer ou...
4297    Please CALL 08712402578 immediately as there i...
3598    Congratulations YOU'VE Won. You're a Winner in...
Name: Message, dtype: object

Simple inspection of these messages is very informative. We'll first engineer a single feature using what we see, and later in the lab you'll have an opportunity to construct your own set of features.

To begin, you might notice that spam messages seem to contain more numbers that normal conversations. So, let's try engineering a feature to capture this:

In [5]:
## Function to measure numbers
def get_num(text):
    return sum(map(str.isdigit, text))/len(text)

## Create a dictonary and convert to a Pandas dataframe
d = {'prop_num': train_msg.apply(get_num)}
train_X = pd.DataFrame(d)
train_X.head(6)
Out[5]:
prop_num
3922 0.017857
2559 0.000000
2672 0.000000
4282 0.006579
987 0.000000
3323 0.000000

In this example we create a function named get_num() that returns the proportion of numeric characters in each string contained in the object it is given.

The map function iteratively applies a function to an iterable object (such as a single string, a dictionary, or a list). In this example, the str.isdigit method is applied to each character in string. The results for the string are then summed and scaled by the length of the string, so the function ultimately returns the proportion of digits in a given text message.

We then use the apply method to apply the function get_num to each message contained in train_msg. We first store these results in a dictionary that we then convert to a pandas DataFrame.

Question #1¶

  • Part A) Notice the prop_num column in train_X has a lot of zeros. How might this impact the way you choose to scale these data? Briefly explain.
  • Part B) Create a pipeline that uses the scaling method you chose in Part A and a kNN classifier with $k=60$, euclidean distance, and inverse distance weighting.
  • Part C) Use your pipeline to find a cross-validated F1-score (using 5-fold cross-validation) of this classifer using the single predictor we created. In addition to reporting the F1-score, briefly explain why the F1 score metric is a good choice for this application.

Part 2 - One Hot Encoding¶

So far, all of our examples (home sales, handwritten digits, and spam classification) have relied exclusively numeric predictors. Categorical predictors can be included in these models using a strategy known as one hot encoding (or dummy variables in statistical modeling).

The code below demonstrates this using the get_dummies function on the Iowa City Home Sales data:

In [6]:
## Load the Iowa City home sales data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")

## one hot encoding for style, bsmt and ac, dropping other categorical vars
from pandas import get_dummies
encode_list = ['style','bsmt','ac']
ic_ohe = get_dummies(ic, columns=encode_list)

## drop non-numeric vars
ic_new = ic_ohe.select_dtypes("number")
print(ic_new.dtypes) # Check out the new columns
sale.amount                  int64
built                        int64
bedrooms                     int64
area.base                    int64
area.add                     int64
area.bsmt                    int64
area.garage1                 int64
area.garage2                 int64
area.living                  int64
area.lot                     int64
lon                        float64
lat                        float64
assessed                     int64
style_1 1/2 Story Frame      uint8
style_1 Story Brick          uint8
style_1 Story Condo          uint8
style_1 Story Frame          uint8
style_2 Story Brick          uint8
style_2 Story Condo          uint8
style_2 Story Frame          uint8
style_Split Foyer Frame      uint8
style_Split Level Frame      uint8
bsmt_1/2                     uint8
bsmt_3/4                     uint8
bsmt_Crawl                   uint8
bsmt_Full                    uint8
bsmt_None                    uint8
ac_No                        uint8
ac_Yes                       uint8
dtype: object

Notice how each category within the variables "style", "bsmt", and "ac" was coded into an integer column.

For models like $k$-nearest neighbors, it's fine if there are columns of linearly dependent dummy variables; but for many statistical models, the convention is to reserve one category as a reference category (to ensure indetifiability). So, if reference coding is desried, the drop_first argument can be used to make the first category (dummy column) the reference category.

In [7]:
## Reference coding
ic_ohe = get_dummies(ic, columns=encode_list, drop_first=True)
ic_new = ic_ohe.select_dtypes("number")
print(ic_new.dtypes) # Check out the new columns
sale.amount                  int64
built                        int64
bedrooms                     int64
area.base                    int64
area.add                     int64
area.bsmt                    int64
area.garage1                 int64
area.garage2                 int64
area.living                  int64
area.lot                     int64
lon                        float64
lat                        float64
assessed                     int64
style_1 Story Brick          uint8
style_1 Story Condo          uint8
style_1 Story Frame          uint8
style_2 Story Brick          uint8
style_2 Story Condo          uint8
style_2 Story Frame          uint8
style_Split Foyer Frame      uint8
style_Split Level Frame      uint8
bsmt_3/4                     uint8
bsmt_Crawl                   uint8
bsmt_Full                    uint8
bsmt_None                    uint8
ac_Yes                       uint8
dtype: object

Let's now use one-hot encoding with the SMS spam data to help us create new predictors derived from the first word in a message.

To help guide this process, you might begin by noticing that spam messages disproportionately begin with the words "free" and "urgent" (with some variation on the capitalization).

In [8]:
## Define "first_word" function
def first_word(text):
    return text.split(sep=' ')[0].lower().replace('!','')

## Recreate train_X using the new feature
d = {'prop_num': train_msg.apply(get_num),
    'first_word': train_msg.apply(first_word)}
train_X = pd.DataFrame(d)
train_X.head(6)
Out[8]:
prop_num first_word
3922 0.017857 do
2559 0.000000 some
2672 0.000000 that's
4282 0.006579 wn
987 0.000000 i'm
3323 0.000000 ok

The code above defines a function named first_word that accepts a character string. This function splits the string at the character ' ' (space), keeps only the first word (position 0), converts that word to lower case, and removes any exclamation points by replacing them with '' or an empty string.

We can now apply one hot encoding to this column and keep only the columns corresponding to the words we identified as important (urgent and free):

In [9]:
## Apply OHE to the 'first' column, 
train_X_ohe = get_dummies(train_X, columns=['first_word'])
train_X_ohe.shape
Out[9]:
(4457, 1174)

At this step, we see that there are over a thousand different first words, and we don't expect most of these words to be predictive of spam. So, we'll only keep variables relating to the words "urgent" and "free" for the reasons previously discussed.

In [10]:
## Keep indicators for the words 'urgent' and 'free'
train_X2 = train_X_ohe[['prop_num','first_word_urgent','first_word_free']]

## Print the frequence of "1" for each of our new variables:
print({sum(train_X2['first_word_urgent']), sum(train_X2['first_word_free'])})
{43, 28}

Undoubtably this process could be improved with sophisticated use of regular expressions and string processing tools; however, you'll see that these simplistic features are enough to build a fairly strong classifier.

Question #2¶

  • Part A) Use the value_counts() method with the argument ascending=False to find the most commonly occuring first words in spam messages. Based upon this list, determine another word to keep as an indicator variable from train_X_ohe (as created in the examples above). Hint: you may want to use the head() method to print more than just the top few words.
  • Part B) Use the pipeline you created in Question #1 to find the cross-validated F1 when the data are augmented to contain the new predictors (ie: first_word_urgent,first_word_free, and the word you identified in Part A) in addition to predictor prop_num from Question #1.

Part 3 - Pipelines and One-Hot-Encoding¶

It's worthwhile noting that it's possible include one-hot-encoding as a step within a preprocessing pipeline. Unfortunately, doing so requires carefully setting up of different preprocessing steps for numeric and categorical predictors.

The code below demonstrates this using the data we created in the previous section and a $k$-nearest neighbors classifier:

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.neighbors import KNeighborsClassifier

## Seperate pipelines for numeric vs. cat vars
num_transformer = Pipeline([("scaler", StandardScaler())])
cat_transformer = Pipeline([("encoder", OneHotEncoder(sparse=False, handle_unknown='ignore'))])

## Get names of numeric and categorical columns
num_cols = train_X.select_dtypes(exclude=['object']).columns.tolist()
cat_cols = train_X.select_dtypes(include=['object']).columns.tolist()

## Preprocessing transformer allowing different actions for numeric and categorical vars
preprocessor = ColumnTransformer([
    ('num', num_transformer, num_cols),
    ('cat', cat_transformer, cat_cols)
])

## Combine the preprocessor and model into a final pipeline
final_pipe = Pipeline([
    ("preprocessor", preprocessor),
    ("model", KNeighborsClassifier())
])

fitted = final_pipe.fit(train_X, train_y)
print(fitted.score(train_X, train_y))
0.971729863136639

A few things to note:

  1. The OneHotEncoder function by default returns a sparse matrix object (which isn't what we want), we prevent this with sparse=False since we do not want to use a sparse matrix as an input into our model.
  2. Pipelines generally created so that preprocessing can be applied to new data, but it's possible that the new data doesn't contain the same set of categories as the training data. The argument handle_unknown='ignore' instructs the pipeline to ignore any categories that weren't present in the training data.
  3. This pipeline is built for predictor data frames that contain columns with the names in num_cols and cat_cols. These names are how the pipeline knows which preprocessing routine to use for each column in the input data.

Part 4 - Missing Data Imputation¶

Pipelines also provide an opportunity to handle missing data without introducing data leakage. In this section of the lab, we'll briefly demonstrate two different imputation methods, simple imputation and nearest neighbors imputation, that can be used in a pipeline.

To begin, we'll load the "Happy Planet" data set, which contains various measures of different countries around the world. We're primarily interested in the fact that two countries have missing values for the variables GDPperCapita and HDI.

In [12]:
## Import data set w/ missing values
hp = pd.read_csv("https://remiller1450.github.io/data/HappyPlanet.csv")

## Tally number of missing values
hp.isna().sum()
Out[12]:
Country           0
Region            0
Happiness         0
LifeExpectancy    0
Footprint         0
HLY               0
HPI               0
HPIRank           0
GDPperCapita      2
HDI               2
Population        0
dtype: int64

Before performing imputation, let's prepare these data for a simple modeling task: predicting "Happiness" using a country's Region, LifeExpectancy, and HDI (which contains missing values).

In [13]:
## create outcome
hp_y = hp['Happiness']

## create predictors
hp_X = hp[['Region','LifeExpectancy', 'HDI']]

Next we'll create a couple of pipelines that each contain an imputation step:

In [14]:
## Import imputation functions
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.neighbors import KNeighborsRegressor

simp_imp_pipe = Pipeline([
    ("imputer", SimpleImputer(missing_values=np.nan, strategy='mean')),
    ("model", KNeighborsRegressor())
])

simp_fit = simp_imp_pipe.fit(hp_X, hp_y)
print(simp_fit.score(hp_X, hp_y))

knn_imp_pipe = Pipeline([
    ("imputer", KNNImputer(missing_values=np.nan, n_neighbors=4, weights='distance')),
    ("model", KNeighborsRegressor())
])

kp_fit = knn_imp_pipe.fit(hp_X, hp_y)
print(kp_fit.score(hp_X, hp_y))
0.8130099967617596
0.8128904952418757

Because there were only two missing values in this example, there's seemingly not much difference between these two strategies.

However, something to recognize is that we treated the "Region" variable as numeric data, when it actually uses integer values to encode different categories. Additionally, we did not perform any standardization or scaling in our pipeline, which will impact the step involving KNNImputer.

To finish up this example, let's create a pipeline that integrates standardization, imputation, and one-hot-encoding:

In [15]:
## Seperate pipelines for numeric vs. cat vars
num_transformer = Pipeline([("scaler", StandardScaler()),
                           ("imputer", KNNImputer(missing_values=np.nan, n_neighbors=4, weights='distance'))
                           ])
cat_transformer = Pipeline([("encoder", OneHotEncoder(sparse=False, handle_unknown='ignore'))])

## Get names of numeric and categorical columns
num_cols = ['LifeExpectancy', 'HDI']
cat_cols = ['Region']

## Preprocessing transformer allowing different actions for numeric and categorical vars
preprocessor = ColumnTransformer([
    ('num', num_transformer, num_cols),
    ('cat', cat_transformer, cat_cols)
])

## Combine the preprocessor and model into a final pipeline
knn_final_pipe = Pipeline([
    ("preprocessor", preprocessor),
    ("model", KNeighborsRegressor())
])

## Fit new pipeline and print the score
kpn_fit = knn_final_pipe.fit(hp_X, hp_y)
print(kpn_fit.score(hp_X, hp_y))
0.8525635060743124

Question #3 (pipelines and imputation)¶

For this question you should use the "Colleges 2019" data set, which contains information on all primarily undergraduate colleges in the United States during the 2019-20 academic year. You should notice that several variables include missing data. For simplicity, I'll ask you to create a model fitting pipeline aimed at predicting the median salary of a college's alumni 10 years after graduation, Salary10yr_median, using the school's cost, average faculty salary, region, and median ACT score. The data needed for this pipeline are created below.

  • Create a pipeline similar to the example immediately prior to this question that standardizes numeric predictors, applies simple mean imputation to numeric predictors with missing data, and applies one-hot-encoding to the predictor 'Region' before fitting a $k$-nearest neighbors model with n_neighbors = 10 and uniform weighting. Print the cross-validated score (using the default metric of $R^2$ and 5-fold cross-validation) of the model using this pipeline.
In [16]:
## Import data set w/ missing values
c19 = pd.read_csv("https://remiller1450.github.io/data/Colleges2019.csv")

## Remove rows with missing outcome
c19 = c19[c19.Salary10yr_median.notnull()]

## Seperate predictors and outcome
c19_y = c19['Salary10yr_median']
c19_X = c19[['Cost','Avg_Fac_Salary','Region','ACT_median']]

## Note that some data are missing
c19_X.isna().sum()
Out[16]:
Cost               39
Avg_Fac_Salary      6
Region              0
ACT_median        432
dtype: int64

Part 5 - Application/Challenge¶

Question #4 (challenge)¶

For the final part of this lab you are to build your own kNN classifier using feature engineering and preprocessing strategies of your choosing. You should optimize your classifier's performance, measured by the F1-score, using cross-validation. And you should report the final performance of your classifier on the test set.

As a benchmark, preprocessing with max-absolute scaling before fitting a kNN classifier with 15 neighbors using the predictors defined earlier in this lab, along with two additional features related to capitalization and non-alphanumeric characters, achieves a cross validated F1-score of approximately 0.90 (and a cross-validated accuracy of roughly 97%). To get full credit for this question, you must beat this benchmark F1-score (assessed via cross-validation on the training set, since you should only use the test set for final evaluation).

Hint: I encourage you to look at this list of string methods. In particular, you might find isalpha, isupper, and endswith('!') to be useful in reaching the benchmark.