Lab 7 - Support Vector Machines¶

In this lab you'll explore support vector machine (SVM) models in sklearn, paying particular attention to the impact of the kernel function, amount of regularization (slack), and weighting.

You'll need our standard set of libraries:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math

We'll again work with the Wisconsin Breast Cancer data set that was first introduced in Lab 4 when discussing classification performance metrics. Recall that the goal behind these data was to use predictive features derived from cell measurements to accurately classify malignant (cancerous) or benign (non-cancerous) tissue samples.

In [2]:
## Read the data
wbc = pd.read_csv("https://remiller1450.github.io/data/wisc_bc.csv")

## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(wbc, test_size=0.2, random_state=7)

## Separate the target from the predictors and re-label the target
train_y = train['Label'].map({'M': 1, 'B': 0})
test_y = test['Label'].map({'M': 1, 'B': 0})
train_X = train.drop(['ID','Label'], axis = 1)[['Radius', 'Texture']]
test_X = test.drop(['ID','Label'], axis = 1)[['Radius', 'Texture']]

To allow us to visualize our classification models we'll focus on two features, Radius (the mean distance of the central point within each cell to its boundary) and Texture (the standard deviation of grayscale pixel values).

As shown in the scatter plot below, Radius is a strong predictor of malignancy, with larger cells being indicative of an increased likelihood that a tissue sample is malignant. Texture is a weaker predictor, but higher values of this feature also correspond with an increased likelihood of a sample being malignant:

In [3]:
## Scatter plot of radius vs. texture by label
plt.scatter(train_X['Radius'], train_X['Texture'], c = train_y, label = train_y)
plt.xlabel('Radius')
plt.ylabel('Texture')
plt.show()

A focal point of this lab will be understanding the role of different hyperparameters in SVM classifiers. Consequently, we'll use a couple of helper functions to more easily plot the prediction boundaries of SVM models:

In [4]:
## Function to make a grid spanning the min/max values of two variables
def make_grid(x1, x2, s=1):
    x1_min, x1_max = x1.min() - 1, x1.max() + 1  # The +/- 1 makes visualizations less tight
    x2_min, x2_max = x2.min() - 1, x2.max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, s), np.arange(x2_min, x2_max, s))
    return xx1, xx2

## Function to get predictions from a classifier over a grid of x,y 
def plot_surface(ax, mod, xx1, xx2):
    Z = mod.predict(np.c_[xx1.ravel(), xx2.ravel()])
    Z = Z.reshape(xx1.shape)
    plot = ax.contourf(xx1, xx2, Z)
    return plot

Part 1 - Kernel Choice¶

The kernel function is the most important hyperparameter involved in developing an SVM classifier as it is used to determine the underlying shape of the model's decision boundary.

The SVC() function, which is used to train SVM classifiers, supports the following kernel functions:

  • linear - linear decision boundary
  • poly - polynomial kernel
  • rbf - radial basis function
  • sigmoid - sigmoid kernel

Some of these kernels have additional associated parameters. For example, degree with poly, and gamma with rbf, poly, and sigmoid.

Below is an example of a simple SVM with a linear decision boundary:

In [5]:
## Train a linear SVM w/ C=1 
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
train_XS = StandardScaler().fit_transform(train_X)
fitted_model = SVC(kernel='linear').fit(train_XS, train_y)

## Make a grid spanning our two predictors
xg1, xg2 = make_grid(train_XS[:,0], train_XS[:,1], s=0.05)

## Prediction boundary
fig, ax = plt.subplots()
plot_surface(ax, mod = fitted_model, xx1 = xg1, xx2 = xg2) # Show the surface
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)      # Add the data points on top of the surface
plt.show()

Recall that SVMs are sensitive to the scale of the data, so we included a standardization step prior to fitting the SVC() model.

The SVC() function provides most of the same methods and attributes as other classification models in sklearn, including fit(), predict(), and predict_proba() methods. Though you should note that the SVM algorithm doesn't intrinsically compute probability estimates and you must set probability=True in order for SVC() to find predicted probabilities. Additionally, there are two other attributes specific to SVMs that are worth being aware of:

  • support_ - indices of the support vectors
  • support_vectors_ - an array of the support vectors themselves

We can use these attributes to further understand the margin of a classifier:

In [6]:
## Show prediction surface, w/ support vectors plotted in blue
fig, ax = plt.subplots()
plot_surface(ax, mod = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
ax.scatter(fitted_model.support_vectors_[:,0], fitted_model.support_vectors_[:,1])
plt.show()

Note that a "hard margin" SVM would have support vectors that are exactly equidistant from the decision boundary, but these data are not perfectly separable using a linear kernel so we end up needing to use a "soft margin" SVM. Thus, the decision boundary is influenced by points within the margin and those which were incorrectly classified outside of the margin.

Question #1:

  • Part A: Modify the examples from this section to display the decision boundary of an SVM using a polynomial kernel function with a degree of 3 (the default). Based upon a visual assessment, does the added complexity of this kernel function seem worthwhile?
  • Part B: Now use a radial basis function kernel with the default value of gamma. Based upon a visual assessment, does this kernel seem to be more appropriate than the linear kernel? Does it seem to be better than the polynomial kernel?
  • Part C: Use a cross-validated grid search to assess the performance (classification accuracy) of the kernel functions you considered in Parts A and B as well as a linear kernel. Briefly comment upon how these kernel functions seem to compare to one another in terms of the bias-variance trade off.

Part 2 - Regularization (slack)¶

The hyperparameter C is a regularization parameter that controls the "error budget" or amount of slack that a soft margin SVM uses. More precisely, the sklearn documentation states:

The C parameter trades off correct classification of training examples against maximization of the decision function’s margin. For larger values of C, a smaller margin will be accepted if the decision function is better at classifying all training points correctly. A lower C will encourage a larger margin, therefore a simpler decision function, at the cost of training accuracy.

A more succinct practical description is given later on in the documentation:

A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly.

We can better understand the parameter by trying out a few examples:

In [7]:
## Example #1 - lower value of C
fitted_model = SVC(kernel='poly', C = 0.1).fit(train_XS, train_y)
fig, ax = plt.subplots()
plot_surface(ax, mod = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
plt.show()
In [8]:
## Example #2 - higher value of C
fitted_model = SVC(kernel='poly', C = 100).fit(train_XS, train_y)
fig, ax = plt.subplots()
plot_surface(ax, mod = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
plt.show()

We can see that the lower value of C leads to a smoother polynomial boundary that doesn't classify as many of the training data-points correctly, but might generalize better to new data.

An important concept related to assessing different choices of C is whether or not to weight errors equally regardless of their class or to adjust for imbalanced classes. The argument class_weight = 'balanced' will

The diagram shown above illustrates the impact of using the argument class_weight = 'balanced' will weight the error contributions of each data point inversely proportional to the class frequencies in the input data. This will weight errors involving the minority class more heavily and lead to a decision boundary that is more so influenced by the minority class.

Question #2:

  • Part A: Using the rbf kernel function, perform a cross-validated grid search to determine the best value of C from the candidates [0.1,10,100,1000] and the best class weighting scheme from the options 'balanced' or None. Print the results of the grid search sorted by mean_test_score using classification accuracy as the scoring metric.
  • Part B: Plot the decision boundary of the best performing model using the examples given earlier in the lab (you might need to refit it rather than relying upon the best_estimator_).
  • Part C: Consider the value of C you selected in Part A and the visualization of this model's decision boundary in Part B. How do the bias and variance of this model compare with the rbf kernel SVM you visualized in Part B of Question #1?

Part 3 - Choosing gamma¶

For poly and rbf kernels the parameter gamma controls the influence that individual training samples exert on the shape of the decision boundary. The sklearn documentation states:

The gamma parameter can be seen as the inverse of the radius of influence of samples selected by the model as support vectors.

Thus, if gamma is very large, then each data-point will have a small area of influence, so the decision boundary will only be influenced by data-points that are very close to it, thereby leading to overfitting. Conversely, if gamma is too small, then each data-point will have a large area of influence and the decision boundary will be influenced by all or most of the training set, leading to a model that is overly smooth and will struggle to capture complexity.

Let's look at the impact of a couple of different values of gamma on a polynomial kernel SVM:

In [10]:
## Example #1 - lower value of gamma
fitted_model = SVC(kernel='poly', gamma = 0.1).fit(train_XS, train_y)
fig, ax = plt.subplots()
plot_surface(ax, mod = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
plt.show()
In [11]:
## Example #1 - higher value of gamma
fitted_model = SVC(kernel='poly', gamma = 10).fit(train_XS, train_y)
fig, ax = plt.subplots()
plot_surface(ax, mod = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
plt.show()

As you can see, when gamma is larger the decision boundary looks more irregular, while when gamma is smaller the decision boundary is much smoother due to it being influenced by a larger share of the training data.

Question #3:

  • Part A: Using the rbf kernel function, perform a cross-validated grid search to determine the optimal combination of C and gamma. Consider the values [0.1,10,100,1000] for C and [[0.01, 0.1, 1, 10] for gamma.
  • Part B: Plot the decision boundary of the best fitting model from Part A (again you might need to refit this model rather than using best_estimator_). How do the bias and variance of this model compare with the best model you identified and explored in Parts B and C of Question #2?
  • Part C: Report the classification accuracy of the best model from Part A on the test set.

Part 4 - Support Vector Regression¶

While SVMs were developed for classification tasks, they fundamentally operate by finding a prediction boundary that is determined by a subset of the training data (the support vectors), with data that are outside of a certain margin not contributing to the model's prediction boundary. These same principles can be applied to regression tasks.

Below is a simple illustration of support vector regression using the variable area.living and the outcome sale.amount in the Iowa City Homes data set:

In [12]:
## Read IC home sales data and split into training/test sets
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
train_ic, test_ic = train_test_split(ic, test_size=0.2, random_state=7)

## Create X and y
train_ic_y = train_ic['sale.amount']
train_ic_X = StandardScaler().fit_transform(train_ic[['area.living']])

## Train an SVR model
from sklearn.svm import SVR
ic_reg = SVR(kernel = 'linear', C=100000).fit(train_ic_X, train_ic_y)

## Grid of values used to generate the prediction line
xx = np.linspace(np.min(train_ic_X), np.max(train_ic_X), 100)
yy = ic_reg.predict(xx.reshape(-1, 1))

## Scatterplot w/ prediction line
plt.scatter(train_ic_X, train_ic_y)
plt.plot(xx, yy)
plt.show()

You should notice that I used a very large value of C in this example, so very little regularization was applied, causing the model to look similar to ordinary least squares regression (though you might notice the slope is still a bit shallower than it might otherwise be).

A nice feature of SVR() is that you can easily fit non-linear models by changing the kernel function. Recall that with LinearRegression() you'd have to set up a feature expansion transformation to do this.

In [13]:
## Polynomial model
ic_reg = SVR(kernel = 'poly', C=100000).fit(train_ic_X, train_ic_y)
plt.scatter(train_ic_X, train_ic_y)
plt.plot(xx, yy)
plt.show()

Question #4:

  • Part A: Using the rbf kernel function, try a few values of C until you find a model that you consider to have unacceptably high bias. Display this model on a scatter plot similar to the ones shown in this section of the lab.
  • Part B: For the value of C you identified in Part A, try modifying the gamma parameter to reduce the model's bias. Briefly comment upon whether you are able to obtain a reasonable fit to the training data by manipulating the gamma parameter when the amount of regularization is high.
  • Part C: Now find a value of C that you consider to have unacceptably high variance. Display this model on a scatter plot.
  • Part D: For the value of C you identified in Part C, try modifying the gamma parameter to reduce the model's variance. Briefly comment upon whether you are able to obtain a reasonable fit to the training data.
  • Part E: Perform a cross-validated grid search using the values of C you identified in Parts A and B as well as 3 values that fall in between them. You may use the default scoring metric. Display the best fitting model on a scatter plot.