Lab #3 (part 1) - Other Performance Metrics¶

This lab will build upon topics from last week, namely pipelines and cross-validation in sklearn, by introducing several additional tools and methods for analyzing classification performance.

We'll begin by loading many familiar libraries:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math

# Unfortunately, knn functions prompt "future warnings", the commands below turn them off
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

This lab's examples will use the cervical cancer dataset from Sobar (2016), which was briefly introduced in today's lecture.

In [2]:
sobar = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00537/sobar-72.csv')

Our work will always begin with a training/testing split to avoid data leakage:

In [3]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(sobar, test_size=0.2, random_state=4)

Part 1 - Confusion Matrices¶

To learn about confusion matrices in sklearn, we must first build a classifier and use it to obtain predictions.

We'll use this as an opportunity to review pipelines:

In [4]:
## Define outcome variable
train_y = train['ca_cervix']

## Define predictor matrix
train_X = train.drop('ca_cervix',axis=1)

## Imports used in the pipeline
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.neighbors import KNeighborsClassifier

## Pipeline: normalize -> scale -> fit classifier
pipe = Pipeline([
('transformer', PowerTransformer(method = 'yeo-johnson')),
('scaler', StandardScaler()),
('classifier', KNeighborsClassifier(n_neighbors = 8))
])

## Model fit
pf = pipe.fit(train_X, train_y)

## Predictions (using 5-fold CV)
from sklearn.model_selection import cross_val_predict
train_y_pred = cross_val_predict(pf, train_X, train_y, cv=5)

We can compare these predictions with the observed outcomes in the training data using a confusion matrix:

In [5]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_true = train_y, y_pred = train_y_pred)
Out[5]:
array([[41,  0],
       [12,  4]], dtype=int64)

The confusion_matrix function considers the first alpha-numeric label as the "positive" class. So, the rows of this matrix correspond to the labels "0" and "1" in that order (as do the columns).

Question #1 (confusion matrices)¶

  • Part A) Suppose we denote having cervical cancer as the positive class. Using the confusion matrix above, identify the number of true positives and false positives.
  • Part B) Calculate the true positive rate (TPR) and false positive rate (FPR) using the information you identified in Part A.

Part 2 - ROC Analysis¶

ROC analysis allows for a nuanced view of the trade-off between true positives and false positives by considering different classification thresholds (for mapping model estimates to class labels).

The first step in an ROC analysis is obtaining classification scores (usually predicted probabilties for the positive class):

In [6]:
## Returns predicted probabilities from the model fitted via our pipeline (and 5-fold CV)
train_y_probs = cross_val_predict(pf, train_X, train_y, cv=5, method='predict_proba')

## Select only the 2nd column (predicted probs of label=1)
train_y_score = train_y_probs[:,1] 
In [7]:
## ROC curve plotting function:
from sklearn.metrics import RocCurveDisplay
RocCurveDisplay.from_predictions(train_y, train_y_score, pos_label=1)
Out[7]:
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x1cf04b81be0>

A couple of things to note:

  • We used predicted probabilities from cross_val_predict, so these results should generalize to the test data fairly well.
  • The cross-validated AUC is close to 1.0, indicating this classifier is reasonably good.

If we wanted more precise details about the TPR and FPR at various classification thresholds we can use the roc_curve function:

In [8]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(train_y, train_y_score, pos_label=1)
print(pd.DataFrame({'t': thresholds, 'TPR': tpr,'FPR': fpr}))
       t     TPR       FPR
0  1.875  0.0000  0.000000
1  0.875  0.1875  0.000000
2  0.750  0.2500  0.000000
3  0.500  0.5000  0.000000
4  0.375  0.6875  0.048780
5  0.250  0.8125  0.170732
6  0.125  0.8750  0.365854
7  0.000  1.0000  1.000000

Note: the code above creates a pandas DataFrame from a dictionary object. Dictionaries are created using curly braces and key: value combinations.

Question #2 (ROC analysis)¶

  • Consider the practical differences between false negatives and false positives in the context of this application (cervical cancer diagnosis). What classification threshold (for a label of '1') would you recommend? Report the TPR and FPR at this threshold and a brief rationale for it.

Part 3 - Precision-Recall Analysis¶

Some applications are highly focused on the positive class, which makes Precision (the fraction of predicted positives that are actually positive) an important quantity. The functions used to perform Precision-Recall analysis are similar to the functions for ROC analysis described in the previous section:

In [9]:
# Functions for PR-analysis
from sklearn.metrics import precision_recall_curve, PrecisionRecallDisplay
PrecisionRecallDisplay.from_predictions(train_y, train_y_score, pos_label=1)
pre, rec, thresholds = precision_recall_curve(train_y, train_y_score, pos_label=1)

Question #3 (PR analysis)¶

  • Calculate the F1 score at the smallest classification threshold with a cross-validated Precision of 1. Use the exact values (rather than trying to infer them from the PR curve displayed above). Hint: I encourage you to view the documentation for the precision_recall_curve function.

Part 4 - Performance Metrics and Pipelines/Grid Search¶

Our previous labs foreshadowed the use of different performance measures in a pipeline. We'll briefly revisit that topic here since we've now learned a few new performance metrics.

The code below demonstrates how to use the F1 score in a cross-validated grid-search.

In [10]:
## Simple pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('classifier', KNeighborsClassifier())
])

## Parameters to try via grid search
from sklearn.preprocessing import RobustScaler, MaxAbsScaler
parms = {'scaler': [StandardScaler(), RobustScaler(), MaxAbsScaler()],
         'classifier__n_neighbors': [4,6,8],
         'classifier__weights': ['uniform','distance'],
         'classifier__p': [1,2]
        }

## Conduct grid search and print best estimator/score
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, parms, cv=5, scoring='f1').fit(train_X, train_y)
print(grid.best_estimator_)
print(grid.best_score_)
Pipeline(steps=[('scaler', MaxAbsScaler()),
                ('classifier',
                 KNeighborsClassifier(n_neighbors=4, weights='distance'))])
0.6933333333333334

A complete list of common scoring metrics that are compatible with GridSearchCV is available here.

Part 5 - Multiple Classes¶

In the final section of this lab we'll briefly look at perfomance metrics for multi-label applications. To do so, we'll revisit the MNIST data from our previous lab:

In [11]:
### Read flattened, processed MNIST data
mnist = pd.read_csv("https://remiller1450.github.io/data/mnist_small.csv")

### Separate the label column (outcome)
label = mnist['label']
mnist = mnist.drop(['label'], axis="columns")

### Convert to numpy array and reshape to 28 by 28
mnist_unflattened = mnist.to_numpy()
mnist_unflattened = mnist_unflattened.reshape(6000,28,28)

### Import grayscale color map
import matplotlib.cm as cm

## Plot the first five samples (to refresh your memory)
fig, axs = plt.subplots(ncols=5)
for i in range(5):
    axs[i].imshow(mnist_unflattened[i], cmap=cm.Greys)
    axs[i].title.set_text(f'label={label[i]}')
plt.show()

Recall that these data have already been preprocessed such that the pixel grayscale intensities (the only type of feature in these data) are already on a standardized measurement scale.

For illustrative purposes, let's fit a simple kNN model and display its confusion matrix:

In [12]:
## Set up knn model
knn_mnist = KNeighborsClassifier(n_neighbors=20)

## Use 5-fold CV to get predicted labels
mnist_label_pred = cross_val_predict(knn_mnist, mnist, label, cv=5)

## Display confusion matrix
confusion_matrix(label, mnist_label_pred)
Out[12]:
array([[604,   1,   2,   1,   0,   2,   7,   0,   1,   0],
       [  0, 693,   1,   0,   0,   0,   1,   1,   0,   0],
       [ 10,  25, 510,   6,   4,   2,   3,  17,   8,   3],
       [  3,  15,   3, 548,   1,   7,   2,   4,   8,   7],
       [  0,  17,   0,   0, 538,   0,   4,   5,   0,  28],
       [  5,  20,   0,  23,   3, 437,   7,   1,   3,  12],
       [ 10,  10,   0,   0,   0,   5, 558,   0,   0,   0],
       [  0,  26,   0,   0,   3,   1,   0, 596,   0,  17],
       [  8,  32,   0,  25,   7,  18,   4,   5, 459,  22],
       [  4,   7,   0,   9,   6,   0,   1,  17,   1, 546]], dtype=int64)

Because this confusion matrix is relatively large, we might choose to visualize it:

In [13]:
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_predictions(label, mnist_label_pred)
Out[13]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1cf7f2fadf0>

The confusion matrix itself is useful in this application, as we can see that 5's and 8's are the most difficult digits to classify. This information might prompt us to collect more data belonging to those categories, or to devise new strategies to improve performance.

For multi-label classification tasks, can still calcluate performance metrics like ROC-AUC or F1 score using the functions introduced earlier in the lab, but we'll need to add a few additional arguments:

  1. average controls whether we'd like calculations to use micro-averaging (which gives each observation equal weight) or macro-averaging (which gives each class equal weight).
  2. multi_class controls whether we'd like calculations to use one-vs-rest or one-vs-one comparison schemes
In [14]:
from sklearn.metrics import f1_score, roc_auc_score

## Macro-averaging F1
f1_score(label, mnist_label_pred, average='macro')

## Micro-averaging F1
f1_score(label, mnist_label_pred, average='micro')

## Macro-averaging ROC-AUC w/ one-vs-one
mnist_label_prob = cross_val_predict(knn_mnist, mnist, label, cv=5, method='predict_proba')
roc_auc_score(label, mnist_label_prob, average='macro', multi_class='ovo')

## Macro-averaging ROC-AUC w/ one-vs-rest
roc_auc_score(label, mnist_label_prob, average='macro', multi_class='ovr')
Out[14]:
0.992861360171459

Recognize that some performance metrics can only be calculated for certain combinations of the average and multi_class arguments. For example, the F1 score is inherently a one-vs-many metric (by virtue of how precision and recall are defined). While ROC-AUC does not have an implementation for average='micro' in multi-class problems.

Part 6 - Application¶

Can something as simple as your cell phone's Wi-Fi signal strength readings accurately predict your location within a building?

Researchers collected data inside a Pittsburgh office building where they stood at various locations in four different rooms and measured the Wi-Fi signal strength of the office's seven wireless routers on an Android cell phone. Their goal was to develop methods to predict the phone's location (room) based upon these measurements. You can view the paper at this link: Rohra et al 2017.

For your reference, Wi-Fi signal strength is measured in decibel milliwatts (dBm), with values typically ranging from -30 dBm, a "perfect" signal, to -90 dBm, effectively zero signal (and a common threshold for "not connected").

In [15]:
## Read data
wifi = pd.read_csv("https://remiller1450.github.io/data/wifi.csv")
wifi.head(5)
Out[15]:
S1 S2 S3 S4 S5 S6 S7 Room
0 -64 -56 -61 -66 -71 -82 -81 1
1 -68 -57 -61 -65 -71 -85 -85 1
2 -63 -60 -60 -67 -76 -85 -84 1
3 -61 -60 -68 -62 -77 -90 -80 1
4 -63 -65 -60 -63 -77 -81 -87 1

Question #4 (application)¶

  • Part A) Perform an 80-20 training/testing split. Separate the predictors and the outcome. Then setup a pipeline that will standardize/scale the data before applying a $k$-nearest neighbors classifier.
  • Part B) Use 5-fold cross-validation and randomized search, grid search, or a combination of the two search techniques to find a combination of scaler, number of neighbors, weighting, and distance measure ($p$) that achieves a high level of classification accuracy. Report this combination along with the cross-validated accuracy.
  • Part C) Use ConfusionMatrixDisplay to create a visualization of the confusion matrix for the classifier you found in Part B. Report the room number that is most commonly misclassified.
  • Part D) Re-run your tuning parameter search (Part B) using the macro-averaged F1 score as the model performance metric. Did the optimal combination of tuning parameters change?
  • Part E) Evaluate the final model from Part D on the test set and report its overall classification accuracy and macro-averaged F1 score. Do these values appear to indicate overfitting to the training data? Briefly explain.