This lab covers various approaches to understanding the performance of classification models.
## Standard Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
# KNN will warn you about future updates, but we'll turn these off
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
The examples in this lab will use the Wisconsin breast cancer data introduced in Lab 4. These data were collected in a study that explored using machine learning to diagnose breast cancer tissue samples as cancerous (a label of 'M'
for malignant) or non-cancerous (a label of 'B'
for benign). The features in this data set describe the average characteristics (radius, symmetry, etc.) of the cells collected in each tissue sample.
wbc = pd.read_csv("https://remiller1450.github.io/data/wisc_bc.csv")
To begin, we'll perform an 80-20 training-testing split, separate the target variable, and drop the ID
column.
## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(wbc, test_size=0.2, random_state=7)
## Separate the target from the predictors
train_y = train['Label']
test_y = test['Label']
train_X = train.drop(['ID','Label'], axis = 1)
test_X = test.drop(['ID','Label'], axis = 1)
To generate a confusion matrix, we need a set of predicted labels to compare with the actual labels present in the data. Because we'd like the confusion matrix we're analyzing to reflect the performance of our methods on new data, we should use out-of-sample predictions found using cross-validation, which we can ascertain using the cross_val_predict()
function:
## Function imports
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_predict
## Pipeline: normalize -> scale -> fit classifier
pipe = Pipeline([
('transformer', PowerTransformer(method = 'yeo-johnson')),
('scaler', StandardScaler()),
('classifier', KNeighborsClassifier(n_neighbors = 8))
])
## Get out-of-sample predictions using 5-fold CV
train_y_pred = cross_val_predict(estimator = pipe, X = train_X, y = train_y, cv = 5)
You should note that cross_val_predict()
will accept any estimator that has fit()
and predict()
methods, so we could provide a pipeline whose final step is a model, or just a model.
The simplest way to produce a confusion matrix in sklearn
is the confusion_matrix()
function:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_true = train_y, y_pred = train_y_pred, labels = ['M','B'])
array([[154, 18], [ 10, 273]], dtype=int64)
The labels
argument governs the order of the rows and columns in the confusion matrix. In this example we provided the list ['M','B']
because it's most natural to view malignant tumors, or 'M'
, as the "positive" class. If labels
is not provided, the matrix will be ordered using the values that appear at least once in provided values of y_true
or y_pred
in alpha-numeric order.
Another useful function is ConfusionMatrixDisplay()
, which will create a graphical display of the confusion matrix:
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_predictions(train_y, train_y_pred, labels = ['M','B'])
plt.show()
Question 1:
'M'
as the positive class, compare the true positive rate of the approach outlined in Part A with the true positive rate achieved by the example pipeline used in this section.'M'
as the positive class, compare the true negative rate of the approach outlined in Part A with the true positive rate achieved by the example pipeline used in this section.ROC curve analysis provides a more nuanced view of the performance of a classification model. Since the fundamental idea behind an ROC curve is to evaluate the trade-off between false positives and false negatives at various probability thresholds, our first step is to obtain predicted probabilities instead of predicted class labels. We can do this by setting the method
argument in cross_val_predict()
to 'predict_proba'
:
train_y_pred_prob = cross_val_predict(estimator = pipe, X = train_X, y = train_y, cv = 5, method = 'predict_proba')
It's important to recognize that this will produce an array containing a column for each category of the outcome (arranged in alpha-numeric order). Because of this we'll use the second column (ie: index position 1) when creating our ROC curve:
# ROC curve plotting:
from sklearn.metrics import RocCurveDisplay
RocCurveDisplay.from_predictions(train_y, train_y_pred_prob[:,1], pos_label='M')
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x1fde6c8ec10>
Unfortunately RocCurveDisplay()
is not easily amenable to showing multiple ROC curves on the same plot, which is a common task when summarizing the results of several competing approaches. Thus, it's worthwhile to know how to create a similar graphic yourself:
from sklearn.metrics import roc_curve, roc_auc_score
## Get ROC curve components and AUC for model 1
fpr, tpr, thresh = roc_curve(train_y, train_y_pred_prob[:,1], pos_label = 'M')
auc = roc_auc_score(train_y, train_y_pred_prob[:,1], labels = ['M','B'])
## Same for model 2, but let's assume our 2nd model is just random guessing (for illustration)
fpr2, tpr2, thresh2 = roc_curve(train_y, np.ones(len(train_y)), pos_label = 'M')
auc2 = roc_auc_score(train_y, np.ones(len(train_y)), labels = ['M','B'])
## Create the plot
plt.plot(fpr,tpr,label="Model 1, auc="+str(auc))
plt.plot(fpr2,tpr2,label="Guessing, auc="+str(auc2))
plt.legend(loc=0)
plt.show()
Finally, you should note that the values used to create these ROC curves can be accessed and explored numerically:
fpr, tpr, thresh = roc_curve(train_y, train_y_pred_prob[:,1], pos_label = 'M')
print(pd.DataFrame({'t': thresh, 'TPR': tpr,'FPR': fpr}))
t TPR FPR 0 inf 0.000000 0.000000 1 1.000 0.715116 0.000000 2 0.875 0.802326 0.010601 3 0.750 0.854651 0.017668 4 0.625 0.895349 0.035336 5 0.500 0.930233 0.084806 6 0.375 0.947674 0.102473 7 0.250 0.976744 0.169611 8 0.125 0.982558 0.314488 9 0.000 1.000000 1.000000
Question 2:
Some applications are highly focused on the positive class, which makes Precision, or the fraction of predicted positives that are actually positive an important quantity to explore in response to various probability thresholds. This metric can be graphed in response to Recall, or the true positive rate, to create a Precision-Recall (PR) curve. The functions used for this are directly analogous to those we used for ROC curve analysis:
from sklearn.metrics import precision_recall_curve, PrecisionRecallDisplay
PrecisionRecallDisplay.from_predictions(train_y, train_y_pred_prob[:,1], pos_label = 'M')
pre, rec, thresholds = precision_recall_curve(train_y, train_y_pred_prob[:,1], pos_label = 'M')
Question 3: Using the example results given above, calculate the F1 score at the smallest classification threshold with a cross-validated Precision above 0.97. Use the exact values rather than trying to infer them from the PR curve displayed above. Hint: I encourage you to view the documentation for precision_recall_curve() to learn about what this function returns, which might help you reconcile differences in object lengths that could lead to errors or misunderstandings.
A nice feature of pipelines and functions like GridSearchCV()
is the compatibility with different model performance criteria. The code below demonstrates how to find the best modeling pipeline using the cross-validated F1 score as the performance criterion:
## Simple pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('classifier', KNeighborsClassifier())
])
## Parameters to try via grid search
from sklearn.preprocessing import RobustScaler
parms = {'scaler': [StandardScaler(), RobustScaler()],
'classifier__n_neighbors': [6,10],
'classifier__weights': ['uniform','distance']
}
## Conduct grid search and print best estimator/score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer
my_score = make_scorer(f1_score, pos_label = 'M')
grid = GridSearchCV(pipe, parms, cv=5, scoring=my_score).fit(train_X, train_y)
## Print results
print(grid.best_estimator_)
print(grid.best_score_)
Pipeline(steps=[('scaler', StandardScaler()), ('classifier', KNeighborsClassifier(n_neighbors=10, weights='distance'))]) 0.9114487539813068
Something to note in this example was the use of make_scorer()
, which is required if you'd like deviate from the default arguments associated with the built-in scoring methods in GridSearchCV()
. This was necessary for our application because the label 'M'
denotes the positive class, but the first alpha-numeric label in the data, which is the default positive class for F1 scoring, is 'B'
.
Question 4:
Many of the same functions introduced earlier in this lab are compatible with multi-class prediction tasks; however, their usage is slightly more complicated when compared to binary classification.
average
argument dictates whether we'd like calculations to use micro-averaging (giving each observation equal weight) or macro-averaging (giving each class equal weight).multi_class
argument dictates whether calculations use a one-vs-rest or a one-vs-one comparison scheme.It's essential to recognize that some performance metrics can only be calculated for certain combinations of the average and multi_class arguments. For example, the F1 score is inherently a one-vs-rest metric due to how precision and recall are defined, so functions like f1_score()
do not involve a multi_class
argument.
Below are a few examples of how these arguments can be used:
## Macro-averaging F1
a = f1_score(train_y, train_y_pred, labels = ['M','B'], average='macro')
## Micro-averaging F1
b = f1_score(train_y, train_y_pred, labels = ['M','B'], average='micro')
## Macro-averaging ROC-AUC w/ one-vs-one
c = roc_auc_score(train_y, train_y_pred_prob[:,1], average='macro', multi_class='ovo')
## Micro-averaging ROC-AUC w/ one-vs-rest
d = roc_auc_score(train_y, train_y_pred_prob[:,1], average='macro', multi_class='ovr')
## Results
print(a, b, c, d)
0.9339430894308942 0.9384615384615385 0.9771653381543266 0.9771653381543266
Researchers collected data in a large Pittsburgh office building by standing in various locations inside of four different rooms and measuring the Wi-Fi signal strength of the office's seven wireless routers on an Android cell phone. Their goal was to develop methods to predict the phone's location (room) based upon these measurements. You can view the paper at this link: Rohra et al 2017
For your reference, Wi-Fi signal strength is measured in decibel milliwatts (dBm), with values typically ranging from -30 dBm, a "perfect" signal, to -90 dBm, which is effectively zero signal and a common threshold for a device to display "not connected".
## The Wi-Fi positioning data
wifi = pd.read_csv("https://remiller1450.github.io/data/wifi.csv")
wifi.head(5)
S1 | S2 | S3 | S4 | S5 | S6 | S7 | Room | |
---|---|---|---|---|---|---|---|---|
0 | -64 | -56 | -61 | -66 | -71 | -82 | -81 | 1 |
1 | -68 | -57 | -61 | -65 | -71 | -85 | -85 | 1 |
2 | -63 | -60 | -60 | -67 | -76 | -85 | -84 | 1 |
3 | -61 | -60 | -68 | -62 | -77 | -90 | -80 | 1 |
4 | -63 | -65 | -60 | -63 | -77 | -81 | -87 | 1 |
Question 5: For the parts that follow you should use the wifi
data described above.
ConfusionMatrixDisplay()
to create a visualization of the confusion matrix for the classifier you found in Part B. Be sure to use cross-validated predictions. Briefly describe the main highlights (ie: successes and shortcomings) of the best model from Part B.