Lab 8 - Support Vector Machines¶

This lab covers the implementation of support vector machines (SVM) in sklearn, paying particular attention to the choice of kernel function, the degree of regularization (or "slack"), and the importance given to individual data-points when using non-linear kernel functions (gamma).

In [1]:
## Standard Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

# turn off KNN future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)  

We'll continue working with the Wisconsin breast cancer classification data set introduced in a few of our previous labs. As a reminder, the predictive features were derived from cell measurements for malignant and benign tissue samples.

In [2]:
## Read the data
wbc = pd.read_csv("https://remiller1450.github.io/data/wisc_bc.csv")

## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(wbc, test_size=0.2, random_state=7)

## Separate the target from the predictors and re-label the target
train_y = train['Label'].map({'M': 1, 'B': 0})
test_y = test['Label'].map({'M': 1, 'B': 0})
train_X = train.drop(['ID','Label'], axis = 1)[['Radius', 'Texture']]
test_X = test.drop(['ID','Label'], axis = 1)[['Radius', 'Texture']]

To better understand how SVMs learn decision boundaries, we'll mostly work with just two of the available predictors: 'Radius' and 'Texture'.

Show below is a scatter plot of how these features relate to the outcome:

In [3]:
## Scatter plot of radius vs. texture by label
plt.scatter(train_X['Radius'], train_X['Texture'], c = train_y)
plt.show()

A substantial portion of this lab will be devoted towards visually understanding the role of different tuning parameters on the behavior of SVM classifiers, so we will now define a couple of helper functions to plot the prediction boundaries of a classifer more efficiently:

In [4]:
## Function to create a grid spanning the min/max values of two variables
def make_grid(x1, x2, s=1):
    x1_min, x1_max = x1.min() - 1, x1.max() + 1  # The +/- will make visualizations look nicer
    x2_min, x2_max = x2.min() - 1, x2.max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, s), np.arange(x2_min, x2_max, s))
    return xx1, xx2

## Function to get predictions from a classifier over a grid of x,y 
def plot_surface(ax, clf, xx1, xx2, **params):
    Z = clf.predict(np.c_[xx1.ravel(), xx2.ravel()])
    Z = Z.reshape(xx1.shape)
    plot = ax.contourf(xx1, xx2, Z, **params)
    return plot

Part 1 - Choice of kernel function¶

The kernel function is generally the most important tuning parameter involved in support vector machines, as it determines the underlying shape of the model's decision boundaries. Recall that the kernel function provides a mapping of the original features into a higher dimensional space where linear separation can be found more easily. When this boundary is expressed in the original feature space it will be non-linear.

The SVC() function in sklearn is used to train SVM classifiers, and it supports linear, polynomial, radial basis function, and sigmoid kernels. Shown below is a simple example that trains a linear SVM and visualizes its prediction boundary:

In [5]:
## Train a linear SVM w/ C=1 
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
train_XS = StandardScaler().fit_transform(train_X)
fitted_model = SVC(kernel='linear').fit(train_XS, train_y)

## Make a grid spanning our two predictors
xg1, xg2 = make_grid(train_XS[:,0], train_XS[:,1], s=0.05)

## Show prediction surface
fig, ax = plt.subplots()
plot_surface(ax, clf = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
plt.show()

You should note that SVMs are sensitive to the scale of our data, so we standardized the data prior to fitting.

Additionally, you should recognize that the core methods of SVC() are consistent with other models in sklearn. That is, we'll primarily use the fit(), predict(), and predict_proba() methods to train and evaluate SVM classifiers and involve them in pipelines.

That said, there are a couple of attributes unique to SVC() which are worth knowing about are:

  • support_ - index positions of the support vectors
  • support_vectors_ - an array of the support vectors themselves

These attributes can help us understand the margin of a classifier:

In [6]:
## Show prediction surface, w/ support vectors plotted in blue
fig, ax = plt.subplots()
plot_surface(ax, clf = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
ax.scatter(fitted_model.support_vectors_[:,0], fitted_model.support_vectors_[:,1])
plt.show()

You should note that a "hard margin" SVM would have support vectors that are exactly equidistant from the decision boundary.

However, these data are not perfectly seperable and we are using a "soft margin" SVM. Thus, both points on the edge of the margin and incorrectly classified points on the wrong side of the decision boundary determine the decision boundary and are considered to be support vectors.

Question 1:

  • Part A: Modify the example given above to use a polynomial kernel function with a degree of 3 (the default). Qualitatively, does it seem like the added complexity of this kernel function (relative to a linear kernel function) is worthwhile? Briefly explain.
  • Part B: Modify the example given above again to use a radial basis function (RBF) kernel. Qualitatively, how does this kernel function compare to the polynomial and the linear kernel functions in terms of effectiveness.
  • Part C: Create a pipeline that performs standardization following by the fitting of a support vector machine classifier. Then, use this pipeline to conduct a cross-validated grid search to determine whether an RBF, a polynomial kernel, or a linear kernel yields the best cross-validated classification accuracy. You should hold the regularization (or "slack") parameter "C" constant at 1 during your search.
  • Part D: Create a visualization of the best model found in Part C that displays its support vectors.

Part 2 - Regularization or "slack"¶

The tuning parameter C is a regularization parameter that controls the "error budget" or the amount of "slack" that is given to a soft-margin classifier. According to the sklearn documentation:

The C parameter trades off correct classification of training examples against maximization of the decision function’s margin. For larger values of C, a smaller margin will be accepted if the decision function is better at classifying all training points correctly. A lower C will encourage a larger margin, therefore a simpler decision function, at the cost of training accuracy.

We can better understand the importance of C by comparing the decision boundaries corresponding to a couple of different choices of C using a polynomial kernel:

In [7]:
## Train a poly SVM w/ C=0.5 
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
train_XS = StandardScaler().fit_transform(train_X)
fitted_model = SVC(kernel='poly', degree=3, C=0.5).fit(train_XS, train_y)

## Make a grid spanning our two predictors
xg1, xg2 = make_grid(train_XS[:,0], train_XS[:,1], s=0.05)

## Show prediction surface
fig, ax = plt.subplots()
plot_surface(ax, clf = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
ax.scatter(fitted_model.support_vectors_[:,0], fitted_model.support_vectors_[:,1])
plt.title("C=0.5")

## Train a poly SVM w/ C=1000
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
train_XS = StandardScaler().fit_transform(train_X)
fitted_model = SVC(kernel='poly', degree=3, C=1000).fit(train_XS, train_y)

## Make a grid spanning our two predictors
xg1, xg2 = make_grid(train_XS[:,0], train_XS[:,1], s=0.05)

## Show prediction surface
fig, ax = plt.subplots()
plot_surface(ax, clf = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
ax.scatter(fitted_model.support_vectors_[:,0], fitted_model.support_vectors_[:,1])
plt.title("C=1000")
plt.show()

Here we see that C=1000 provides very little regularization, thereby leading to a more irregular decision boundary that does well on the training data but might not generalize very well to new data.

One important thing to note is that by default C will treat errors in each class equally, which can be undesirable for data with substantial class imbalances. This can be changed by using the argument class_weight = 'balanced'. The diagram below illustrates the impact of balancing by class weight for data with a highly imbalanced class distribution:

In [8]:
## Comparison of weighted vs. non-weighted with imbalanced classes
from IPython.display import Image
Image("C:\\Users\\millerry\\OneDrive - Grinnell College\\Documents\\STA-395_Intro_ML\\Spring24\\Labs\\class_weight.PNG")
Out[8]:

Question 2:

  • Part A: Using a polynomial kernel function, perform a cross-validate grid-search to find the optimal values of the tuning parameters C and class_weight. Consider the values: [0.1, 1, 100, 10000] for C, and either 'balanced" or none for class_weight.
  • Part B: Visualize the prediction surface and decision boundary of the best model from your search in Part A, displaying the data and highlighting the support vectors in a manner similar to the examples in Part 1 of the lab.

Part 3 - Choice of gamma¶

For the polynomial and radial basis function kernels, the parameter gamma controls the influence of individual data-points on the shape of the decision boundary. When gamma is larger, more data-points exert influence on the shape of the boundary, and when gamma is smaller fewer points exert influence.

While their roles might sound similar, gamma is fundamentally different from C, which governs how the classifier's margin via how much error is deemed acceptable. We can see the effects of higher values of gamma in our polynomial kernel SVM with C=2:

In [9]:
## Train a poly SVM w/ gamma = 10
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
train_XS = StandardScaler().fit_transform(train_X)
fitted_model = SVC(kernel='poly', C=2, gamma = 10).fit(train_XS, train_y)

## Make a grid spanning our two predictors
xg1, xg2 = make_grid(train_XS[:,0], train_XS[:,1], s=0.05)

## Show prediction surface
fig, ax = plt.subplots()
plot_surface(ax, clf = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
plt.title("gamma = 10")

## Train a poly SVM w/ gamma = 0.1
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
train_XS = StandardScaler().fit_transform(train_X)
fitted_model = SVC(kernel='poly', C=2, gamma = 0.1).fit(train_XS, train_y)

## Make a grid spanning our two predictors
xg1, xg2 = make_grid(train_XS[:,0], train_XS[:,1], s=0.05)

## Show prediction surface
fig, ax = plt.subplots()
plot_surface(ax, clf = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
plt.title("gamma = 0.1")
plt.show()

For larger gamma our polynomial kernel looks more irregular due to the larger number of points exerting influence on it. For small gamma we see the opposite, that is our boundary appears to be much more smooth.

For additional insight into the roles of the gamma and C parameters, I strongly encourage you to read this sklearn documentation page.

Question 3:

  • Part A: Starting with your best model from Question 2, try a few different values of gamma and visualize their decision boundaries. As your answer to this question, include a visualization of one of the values of gamma you found to produce a reasonable boundary.
  • Part B: Starting with your best model from Question 2 and using the insights you gained in Part A, tune gamma using a cross-validated grid search involving at least 4 sensibly chosen values.
  • Part C: Use a cross-validated grid search to identify a reasonable KNN model for these data (ie: just the two predictors of tissue label that we've been using). Record the tuning parameters of this model.
  • Part D: Use a cross-validated grid search to identify a reasonable decision tree model for these data. Record the tuning parameters of this model.
  • Part E: Compare the classification accuracy of the models in Parts B, C, and D on the test set. Which method seemed to perform best?

Part 4 - Support Vector Regression¶

A fundamental aspect of SVMs is that the decision boundary is determined by a subset of data-points, the support vectors. Data-points that are beyond the margin do not contribute to the model. This idea can be adpated to regression tasks by constructing a cost function that ignores data-points whose prediction is close to their observed value.

Below is a simple illustration of support vector regression using the variable 'area.living' in the Iowa City home sales data to predict 'sale.amount':

In [10]:
## Read IC home sales data and split into training/test sets
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
train_ic, test_ic = train_test_split(ic, test_size=0.2, random_state=7)

## Create X and y
train_ic_y = train_ic['sale.amount']
train_ic_X = StandardScaler().fit_transform(train_ic[['area.living']])

## Train an SVR model
from sklearn.svm import SVR
ic_reg = SVR(kernel = 'linear', C=100000).fit(train_ic_X, train_ic_y)

## Grid of values used to generate the prediction line
xx = np.linspace(np.min(train_ic_X), np.max(train_ic_X), 100)
yy = ic_reg.predict(xx.reshape(-1, 1))

## Scatterplot w/ prediction line
plt.scatter(train_ic_X, train_ic_y)
plt.plot(xx, yy)
plt.show()

Because the tuning parameter C applies regularization inversely proportional to it's value, we'd need to use an extremely large value of C to produce the same model as ordinary least squares regression. In this example you can see that even with C=100000 the line is "shrunken" to have a slope closer to zero.

A nice feature of SVR() is that you can easily fit non-linear models by changing the kernel function:

In [11]:
## Non-linear kernel (RBF)
ic_reg = SVR(kernel = 'rbf', C=100000).fit(train_ic_X, train_ic_y)

## Grid of values used to generate the prediction line
xx = np.linspace(np.min(train_ic_X), np.max(train_ic_X), 100)
yy = ic_reg.predict(xx.reshape(-1, 1))

## Scatterplot w/ prediction line
plt.scatter(train_ic_X, train_ic_y)
plt.plot(xx, yy)
plt.show()

Question 4:

  • Part A: Modify the amount of regularization, or "slack", in the example shown above (the 'rbf' kernel SVM). Include an example of a model you'd consider to have unacceptably high variance and a model you'd consider to have unacceptably high bias. Clearly label each of these models.
  • Part B: Explore a few values of gamma in the model you deemed "high bias" in part A. Are you able to use gamma to better balance the bias-variance trade-off for this choice of C? Briefly explain, using an example to justify your explanation.

Part 5 - Comparing many models¶

In this section you'll use the entire set of predictors in the Wisconsin breast cancer data set instead of just 'Radius' and 'Texture'. Your goal will be to find the best performing classifier and associated data preparation steps when considering three different classifiers: KNN, decision tree, and SVM.

In [12]:
## Overwrite train/test X to now include all predictors
train_X = train.drop(['ID','Label'], axis = 1)
test_X = test.drop(['ID','Label'], axis = 1)

Question 5:

  • Part A: Use the function pd.plotting.scatter_matrix() to create a scatter plot matrix showing the pairwise relationship between every pair of predictors in the training data. Use argument c = train_y to color the points by the target variable. Use what you see in this plot to help you determine reasonable modeling steps in Parts B, C, and D.
  • Part B: Use a cross-validated grid search with accuracy as the scoring metric to determine a reasonable machine learning pipeline that uses an KNN classifier as the final step. You might consider standardization, normalization, dimension reduction, and KNN tuning parameters in your search. Report the best estimator and its performance.
  • Part C: Use a cross-validated grid search with accuracy as the scoring metric to determine a reasonable machine learning pipeline that uses a Decision tree classifier as the final step. You should consider data preparation steps and tuning parameters in your search. Report the best estimator and its performance.
  • Part D: Use a cross-validated grid search with accuracy as the scoring metric to determine a reasonable machine learning pipeline that uses an SVM classifier as the final step. You should consider data preparation steps and tuning parameters in your search. Report the best estimator and its performance. Report the best estimator and its performance.
  • Part E: Now conduct a final cross-validated grid search that compares the three approaches you identified in Parts B, C, and D. Report the best performing method and its cross-validated classification accuracy.
  • Part F: Create an ROC curve displaying the performance of the best model identified in Part E using its predictions on the test data. Report the AUC and comment upon the effectiveness of this classifier.