This lab will build upon topics from last week, namely pipelines and cross-validation in sklearn
, by introducing several additional tools and methods for analyzing classification performance.
We'll begin by loading many familiar libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math
# Unfortunately, knn functions prompt "future warnings", the commands below turn them off
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
This lab's examples will use the cervical cancer dataset from Sobar (2016), which was briefly introduced in today's lecture.
sobar = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00537/sobar-72.csv')
Our work will always begin with a training/testing split to avoid data leakage:
from sklearn.model_selection import train_test_split
train, test = train_test_split(sobar, test_size=0.2, random_state=4)
To learn about confusion matrices in sklearn
, we must first build a classifier and use it to obtain predictions.
We'll use this as an opportunity to review pipelines:
## Define outcome variable
train_y = train['ca_cervix']
## Define predictor matrix
train_X = train.drop('ca_cervix',axis=1)
## Imports used in the pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.neighbors import KNeighborsClassifier
## Pipeline: normalize -> scale -> fit classifier
pipe = Pipeline([
('transformer', PowerTransformer(method = 'yeo-johnson')),
('scaler', StandardScaler()),
('classifier', KNeighborsClassifier(n_neighbors = 8))
])
## Model fit
pf = pipe.fit(train_X, train_y)
## Predictions (using 5-fold CV)
from sklearn.model_selection import cross_val_predict
train_y_pred = cross_val_predict(pf, train_X, train_y, cv=5)
We can compare these predictions with the observed outcomes in the training data using a confusion matrix:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_true = train_y, y_pred = train_y_pred)
array([[41, 0], [12, 4]], dtype=int64)
The confusion_matrix
function considers the first alpha-numeric label as the "positive" class. So, the rows of this matrix correspond to the labels "0" and "1" in that order (as do the columns).
ROC analysis allows for a nuanced view of the trade-off between true positives and false positives by considering different classification thresholds (for mapping model estimates to class labels).
The first step in an ROC analysis is obtaining classification scores (usually predicted probabilties for the positive class):
## Returns predicted probabilities from the model fitted via our pipeline (and 5-fold CV)
train_y_probs = cross_val_predict(pf, train_X, train_y, cv=5, method='predict_proba')
## Select only the 2nd column (predicted probs of label=1)
train_y_score = train_y_probs[:,1]
## ROC curve plotting function:
from sklearn.metrics import RocCurveDisplay
RocCurveDisplay.from_predictions(train_y, train_y_score, pos_label=1)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x1cf04b81be0>
A couple of things to note:
cross_val_predict
, so these results should generalize to the test data fairly well.If we wanted more precise details about the TPR and FPR at various classification thresholds we can use the roc_curve
function:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(train_y, train_y_score, pos_label=1)
print(pd.DataFrame({'t': thresholds, 'TPR': tpr,'FPR': fpr}))
t TPR FPR 0 1.875 0.0000 0.000000 1 0.875 0.1875 0.000000 2 0.750 0.2500 0.000000 3 0.500 0.5000 0.000000 4 0.375 0.6875 0.048780 5 0.250 0.8125 0.170732 6 0.125 0.8750 0.365854 7 0.000 1.0000 1.000000
Note: the code above creates a pandas
DataFrame from a dictionary object. Dictionaries are created using curly braces and key: value combinations.
Some applications are highly focused on the positive class, which makes Precision (the fraction of predicted positives that are actually positive) an important quantity. The functions used to perform Precision-Recall analysis are similar to the functions for ROC analysis described in the previous section:
# Functions for PR-analysis
from sklearn.metrics import precision_recall_curve, PrecisionRecallDisplay
PrecisionRecallDisplay.from_predictions(train_y, train_y_score, pos_label=1)
pre, rec, thresholds = precision_recall_curve(train_y, train_y_score, pos_label=1)
Our previous labs foreshadowed the use of different performance measures in a pipeline. We'll briefly revisit that topic here since we've now learned a few new performance metrics.
The code below demonstrates how to use the F1 score in a cross-validated grid-search.
## Simple pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('classifier', KNeighborsClassifier())
])
## Parameters to try via grid search
from sklearn.preprocessing import RobustScaler, MaxAbsScaler
parms = {'scaler': [StandardScaler(), RobustScaler(), MaxAbsScaler()],
'classifier__n_neighbors': [4,6,8],
'classifier__weights': ['uniform','distance'],
'classifier__p': [1,2]
}
## Conduct grid search and print best estimator/score
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, parms, cv=5, scoring='f1').fit(train_X, train_y)
print(grid.best_estimator_)
print(grid.best_score_)
Pipeline(steps=[('scaler', MaxAbsScaler()), ('classifier', KNeighborsClassifier(n_neighbors=4, weights='distance'))]) 0.6933333333333334
A complete list of common scoring metrics that are compatible with GridSearchCV
is available here.
In the final section of this lab we'll briefly look at perfomance metrics for multi-label applications. To do so, we'll revisit the MNIST data from our previous lab:
### Read flattened, processed MNIST data
mnist = pd.read_csv("https://remiller1450.github.io/data/mnist_small.csv")
### Separate the label column (outcome)
label = mnist['label']
mnist = mnist.drop(['label'], axis="columns")
### Convert to numpy array and reshape to 28 by 28
mnist_unflattened = mnist.to_numpy()
mnist_unflattened = mnist_unflattened.reshape(6000,28,28)
### Import grayscale color map
import matplotlib.cm as cm
## Plot the first five samples (to refresh your memory)
fig, axs = plt.subplots(ncols=5)
for i in range(5):
axs[i].imshow(mnist_unflattened[i], cmap=cm.Greys)
axs[i].title.set_text(f'label={label[i]}')
plt.show()
Recall that these data have already been preprocessed such that the pixel grayscale intensities (the only type of feature in these data) are already on a standardized measurement scale.
For illustrative purposes, let's fit a simple kNN model and display its confusion matrix:
## Set up knn model
knn_mnist = KNeighborsClassifier(n_neighbors=20)
## Use 5-fold CV to get predicted labels
mnist_label_pred = cross_val_predict(knn_mnist, mnist, label, cv=5)
## Display confusion matrix
confusion_matrix(label, mnist_label_pred)
array([[604, 1, 2, 1, 0, 2, 7, 0, 1, 0], [ 0, 693, 1, 0, 0, 0, 1, 1, 0, 0], [ 10, 25, 510, 6, 4, 2, 3, 17, 8, 3], [ 3, 15, 3, 548, 1, 7, 2, 4, 8, 7], [ 0, 17, 0, 0, 538, 0, 4, 5, 0, 28], [ 5, 20, 0, 23, 3, 437, 7, 1, 3, 12], [ 10, 10, 0, 0, 0, 5, 558, 0, 0, 0], [ 0, 26, 0, 0, 3, 1, 0, 596, 0, 17], [ 8, 32, 0, 25, 7, 18, 4, 5, 459, 22], [ 4, 7, 0, 9, 6, 0, 1, 17, 1, 546]], dtype=int64)
Because this confusion matrix is relatively large, we might choose to visualize it:
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_predictions(label, mnist_label_pred)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1cf7f2fadf0>
The confusion matrix itself is useful in this application, as we can see that 5's and 8's are the most difficult digits to classify. This information might prompt us to collect more data belonging to those categories, or to devise new strategies to improve performance.
For multi-label classification tasks, can still calcluate performance metrics like ROC-AUC or F1 score using the functions introduced earlier in the lab, but we'll need to add a few additional arguments:
average
controls whether we'd like calculations to use micro-averaging (which gives each observation equal weight) or macro-averaging (which gives each class equal weight).multi_class
controls whether we'd like calculations to use one-vs-rest or one-vs-one comparison schemesfrom sklearn.metrics import f1_score, roc_auc_score
## Macro-averaging F1
f1_score(label, mnist_label_pred, average='macro')
## Micro-averaging F1
f1_score(label, mnist_label_pred, average='micro')
## Macro-averaging ROC-AUC w/ one-vs-one
mnist_label_prob = cross_val_predict(knn_mnist, mnist, label, cv=5, method='predict_proba')
roc_auc_score(label, mnist_label_prob, average='macro', multi_class='ovo')
## Macro-averaging ROC-AUC w/ one-vs-rest
roc_auc_score(label, mnist_label_prob, average='macro', multi_class='ovr')
0.992861360171459
Recognize that some performance metrics can only be calculated for certain combinations of the average
and multi_class
arguments. For example, the F1 score is inherently a one-vs-many metric (by virtue of how precision and recall are defined). While ROC-AUC does not have an implementation for average='micro'
in multi-class problems.
Can something as simple as your cell phone's Wi-Fi signal strength readings accurately predict your location within a building?
Researchers collected data inside a Pittsburgh office building where they stood at various locations in four different rooms and measured the Wi-Fi signal strength of the office's seven wireless routers on an Android cell phone. Their goal was to develop methods to predict the phone's location (room) based upon these measurements. You can view the paper at this link: Rohra et al 2017.
For your reference, Wi-Fi signal strength is measured in decibel milliwatts (dBm), with values typically ranging from -30 dBm, a "perfect" signal, to -90 dBm, effectively zero signal (and a common threshold for "not connected").
## Read data
wifi = pd.read_csv("https://remiller1450.github.io/data/wifi.csv")
wifi.head(5)
S1 | S2 | S3 | S4 | S5 | S6 | S7 | Room | |
---|---|---|---|---|---|---|---|---|
0 | -64 | -56 | -61 | -66 | -71 | -82 | -81 | 1 |
1 | -68 | -57 | -61 | -65 | -71 | -85 | -85 | 1 |
2 | -63 | -60 | -60 | -67 | -76 | -85 | -84 | 1 |
3 | -61 | -60 | -68 | -62 | -77 | -90 | -80 | 1 |
4 | -63 | -65 | -60 | -63 | -77 | -81 | -87 | 1 |
ConfusionMatrixDisplay
to create a visualization of the confusion matrix for the classifier you found in Part B. Report the room number that is most commonly misclassified.