Lab 4 - Classifier Performance¶

This lab covers various ways to assess and report upon the performance of classification models.

Directions: Please read through the contents of this lab with your partner and try the examples. After you're both confident that you understand a topic you should attempt the associated exercise and record your answer in your own Jupyter notebook that you will submit for credit. The notebook you submit should only contain answers to the lab's exercises (so you should remove any code you ran for the examples, or use a separate notebook to test out the examples).

To begin, you'll need the following libraries:

In [1]:
## Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)  

The examples during the lab will use the "Wisconsin breast cancer" data set from the UC Irvine machine learning data repository:

In [2]:
wisc_bc = pd.read_csv("https://remiller1450.github.io/data/wisc_bc.csv")

Our analysis of these data will build models to classify cells as either malignant (cancerous) or benign (non-cancerous), which is recorded in the variable target. The predictive features are derived from images of the cell nuclei.

We will begin by performing an 80-20 train-test split, separating the outcome variable, then removing the ID column:

In [3]:
## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(wisc_bc, test_size=0.2, random_state=7)

## Separate the target from the predictors
train_y = train['Label']
test_y = test['Label']
train_X = train.drop(['ID','Label'], axis = 1)
test_X = test.drop(['ID','Label'], axis = 1)

Part 1 - Scorers and Confusion Matrices¶

In our previous lab we saw that scoring argument of GridSearchCV() allowed us to change how models were evaluated. The complete list of strings that can be given as scoring metrics is found at this link.

The default scoring metric used by GridSearchCV() is classification accuracy. For imbalanced data, this metric generally favors models that assign higher predicted probabilities to the majority class.

A simple way to investigate this favoritism is the confusion matrix, which can be created using the confusion_matrix() function:

In [4]:
## Function imports
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_predict

## Pipeline
pipe = Pipeline([
 ('scaler', StandardScaler()),
 ('classifier', KNeighborsClassifier(n_neighbors = 8))
])

## Get out-of-sample predictions using 5-fold CV
train_y_pred = cross_val_predict(estimator = pipe, X = train_X, y = train_y, cv = 5, method = 'predict')

## Confusion matrix:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_true = train_y, y_pred = train_y_pred, labels = ['M','B'])
Out[4]:
array([[150,  22],
       [ 11, 272]], dtype=int64)

There are a few things you should pay attention to in this example:

  1. We wanted to use out-of-sample predictions in the confusion matrix, but we aren't ready to use the test set yet, so we use the cross_val_predict() function to get cross-validated predictions.
  2. We did not perform any hyperparameter tuning in this example, but if we did we could provide the best_estimator_ from GridSearchCV() as the estimator argument in cross_val_predict().
  3. The default argument method = 'predict' returns predicted classes based upon the highest probability outcome. However, it is useful to know that method = 'predict_proba' can be used to get the predicted probabilities themselves.

You also should realize that the confusion matrix output isn't very visually appealing or easy to read. The ConfusionMatrixDisplay() function provides a nicer output:

In [5]:
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_predictions(train_y, train_y_pred, labels = ['M','B'])
plt.show()

Many of the scoring functions used for classifiers expect the data to use a label of 1 for the positive class and 0 for the negative class. The Wisconsin breast cancer data set does not use this convention, so we might opt to create our own scoring function using make_scorer(), which allows us to provide our own label for the positive class:

In [6]:
from sklearn.metrics import f1_score, make_scorer
f1_scorer = make_scorer(f1_score, average='binary', pos_label='M') 

This scorer can then be given to functions like GridSearchCV() in place of the string name referencing a pre-built scorer.

Question #1:

  • Part A: Set up a parameter grid that considers either standardization or robust scaling, values of n_neighbors ranging from 5 to 25 by increments of 5, and uniform or distance weighting.
  • Part B: Display a confusion matrix showing the errors made by the best model identified in Part A.
  • Part C: Considering 'M' (malignant) to be the positive class, calculate the true positive and false positive rates from the confusion matrix you created in Part B.
  • Part D: Modify your grid search from Part B to use the F1 score as your performance metric. Make sure that you're using the label 'M' as the positive class. Report the cross-validated F1 score of the best model.

Part 2 - Receiver Operating Characteristic (ROC) Analysis¶

Recall that the ROC curve displays the trade-off between false positives and false negatives at a variety of thresholds used to map the predicted probability of the positive class to the corresponding label. As alluded to earlier in the lab, we will need to use the argument method = 'predict_proba' in cross_val_predict() as the first step in an ROC analysis:

In [7]:
train_y_pred_prob = cross_val_predict(estimator = pipe, X = train_X, y = train_y, cv = 5, method = 'predict_proba')

It is essential to recognize that train_y_pred_prob is an array with two columns, each representing the predicted probability of one category of the outcome variable (arranged in alpha-numeric order). Because we want malignant observations to be the positive class we will only provide the second column to RocCurveDisplay, which plots the ROC curve:

In [8]:
# ROC curve plot
from sklearn.metrics import RocCurveDisplay
RocCurveDisplay.from_predictions(train_y, train_y_pred_prob[:,1], pos_label='M')
plt.show()

Unfortunately, it is relatively difficult to display the ROC curves of multiple models on the same plot using RocCurveDisplay (and it was impossible to do this prior to Version 1.7 of sklearn). Consequently, you should know how to create an ROC curve on your own, which will easily allow you to add the curves for as many different models as you want to display:

In [9]:
## Functions to get TPR, FPR, and Thresholds for curves
from sklearn.metrics import roc_curve, roc_auc_score

## Get ROC curve components and AUC for model 1
fpr, tpr, thresh = roc_curve(train_y, train_y_pred_prob[:,1], pos_label = 'M')
auc = roc_auc_score(train_y, train_y_pred_prob[:,1], labels = ['M','B'])

## Same for model 2, but let's assume our 2nd model always guesses the positive class (for illustration)
fpr2, tpr2, thresh2 = roc_curve(train_y, np.ones(len(train_y)), pos_label = 'M')
auc2 = roc_auc_score(train_y, np.ones(len(train_y)), labels = ['M','B'])

## Create the plot
plt.plot(fpr,tpr,label="Model 1, auc="+str(round(auc,3)))
plt.plot(fpr2,tpr2,label="Guessing, auc="+str(auc2))
plt.legend(loc=0)
plt.show()

It's worthwhile taking a minute to see the values that went into making these ROC curves so that you can better understand their nuances:

In [10]:
fpr, tpr, thresh = roc_curve(train_y, train_y_pred_prob[:,1], pos_label = 'M')
print(pd.DataFrame({'t': thresh, 'TPR': tpr,'FPR': fpr}))
       t       TPR       FPR
0    inf  0.000000  0.000000
1  1.000  0.691860  0.003534
2  0.875  0.784884  0.014134
3  0.750  0.848837  0.021201
4  0.625  0.872093  0.038869
5  0.500  0.918605  0.088339
6  0.375  0.947674  0.113074
7  0.250  0.965116  0.173145
8  0.125  0.988372  0.325088
9  0.000  1.000000  1.000000

Question #2:

  • Part A: Display the ROC curves of two different models using their cross-validated predictions. The first should be a KNN model with 15 neighbors and distance weighting using standardization to re-scale the data. The second should be a decision tree with a maximum depth of 4. You may use default arguments for any other parameters.
  • Part B: Print a Data Frame showing the decision thresholds and corresponding TPR/FPR of the KNN model you used in Part A. How does the number of rows (threshold values) compare to the example from this section? Briefly explain why the number of rows differs from the example. Hint: Think about the number of neighbors used by each.
  • Part C: Use the Data Frame you printed in Part B to determine the highest true positive rate that can be achieved with a false positive rate no larger than 5%.

Part 3 - Precision-Recall (PR) Analysis¶

While ROC analysis tends to be broadly applicable, some applications involve a rare positive class whose accurate identification is highly important.

The functions used in PR analysis are direct analogues to those introduced in the previous section:

In [11]:
## Display a PR curve from predictions
from sklearn.metrics import precision_recall_curve, PrecisionRecallDisplay
PrecisionRecallDisplay.from_predictions(train_y, train_y_pred_prob[:,1], pos_label = 'M')
Out[11]:
<sklearn.metrics._plot.precision_recall_curve.PrecisionRecallDisplay at 0x219bd457ee0>
In [12]:
## Get precision/recall information to make our own PR curve or investigate different thresholds
pre, rec, thresholds = precision_recall_curve(train_y, train_y_pred_prob[:,1], pos_label = 'M')

Question #3:

  • Part A: Using code from the ROC analysis section as a template, try creating a Data Frame containing the threshold, precision, and recall values given by precision_recall_curve() in the example provided above. You should receive an error message explaining that the involved components have different lengths. Read the function documentation to learn why these lengths differ, then use this information to create and print a data frame showing the precision and recall at every threshold.
  • Part B: Use the data and model from this section's example to create another PR curve considering the other possible outcome, benign or 'B', as the positive class. Does the area under this curve differ compared to when malignant or 'M' was considered the positive class? Briefly explain.
  • Part C: Similar to Part B, modify code from the example from the previous section (ROC analysis) to create an ROC curve that uses benign or 'B' as the positive class. Does the area under this curve differ compared to when malignant or 'M' was considered the positive class? Briefly explain how this might influence your decision to use ROC analysis instead of PR analysis.

Part 4 - Multiclass Classification¶

Part 1 of this lab introduced scoring functions such as f1_score() in its discussion of custom scorers made using the make_scorer() function. Most of these scoring functions can accommodate multiclass outcomes using one or both of the following arguments:

  • multi_class - which dictates whether a one-vs-rest or one-vs-one comparison scheme should be used.
  • average - which dictates whether calculations should use micro-averaging (giving each observation equal importance) or macro-averaging (giving each class equal importance).

It is important to recognize that certain metrics can only be computed with certain combinations of these arguments. For example, the F1 score is defined in a way that it is intrinsically a one-vs-rest metric due to how precision and recall are defined. Consequently, f1_score() does not have a multi_class argument.

Below are a few different examples of how these arguments can be used for different scoring functions:

In [13]:
## Macro-averaging F1
a = f1_score(train_y, train_y_pred, labels = ['M','B'], average='macro')

## Micro-averaging F1
b = f1_score(train_y, train_y_pred, labels = ['M','B'], average='micro')

## Macro-averaging ROC-AUC w/ one-vs-one
c = roc_auc_score(train_y, train_y_pred_prob[:,1], average='macro', multi_class='ovo')

## Micro-averaging ROC-AUC w/ one-vs-rest
d = roc_auc_score(train_y, train_y_pred_prob[:,1], average='macro', multi_class='ovr')

## Neatly printed results
print(f"Macro F1: {a} | Micro F1: {b} | ROC-AUC (macro, ovo): {c} | ROC-AUC (macro, ovr): {d}")
Macro F1: 0.9218542632754072 | Micro F1: 0.9274725274725275 | ROC-AUC (macro, ovo): 0.974248089407511 | ROC-AUC (macro, ovr): 0.974248089407511

Part 5 - Application¶

In a 2017 paper by Rohra et al researchers collected data in a large office building by standing in various locations in four different rooms and measuring the wi-fi signal strength from the office's seven wireless routers using an Android cell phone. Their goal was to be able to use these signal strength measurements to accurately classify the room in which the phone was located.

The first few rows of the data are shown below. You should note that wi-fi signal is recorded in decibel-milliwatts (dBm), with values generally ranging from -30 dBm (a "perfect" signal) to -90 dBm (a common threshold at which devices begin to display "not connected")

In [14]:
## The Wi-Fi positioning data
wifi = pd.read_csv("https://remiller1450.github.io/data/wifi.csv")
wifi.head(5)
Out[14]:
S1 S2 S3 S4 S5 S6 S7 Room
0 -64 -56 -61 -66 -71 -82 -81 1
1 -68 -57 -61 -65 -71 -85 -85 1
2 -63 -60 -60 -67 -76 -85 -84 1
3 -61 -60 -68 -62 -77 -90 -80 1
4 -63 -65 -60 -63 -77 -81 -87 1

Question #5:

  • Part A: Perform an 80-20 training/testing split using random state 12. Then perform a brief exploratory analysis to assess whether the data contain any unexpected values or highly skewed distributions.
  • Part B: Perform a cross-validated grid search that considers KNN classifiers with at least 3 different values of $k$, uniform or distance weighting, and at least 2 different rescaling methods, as well as decision tree classifiers with at least 4 different maximum depths. Use classification accuracy as the scoring metric, and print a Data Frame showing the top-5 best performing methods.
  • Part C: Display the confusion matrix summarizing the performance of the best classification method you identified in Part B. Briefly describe when the performance of the classifier was best and worst based upon what you see in the confusion matrix
  • Part D: Change the scoring metric in your grid search from accuracy to macro-averaged F1 score and print a Data Frame showing the top-5 best performing methods. Briefly comment upon whether the change in scorer influenced which models performed the best.
  • Part E: Report the classification accuracy and macro-averaged F1 score of the best performing model from Part D on the test set.