In this lab you'll explore support vector machine (SVM) models in sklearn
, paying particular attention to the impact of the kernel function, amount of regularization (slack), and weighting.
You'll need our standard set of libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math
We'll again work with the Wisconsin Breast Cancer data set that was first introduced in Lab 4 when discussing classification performance metrics. Recall that the goal behind these data was to use predictive features derived from cell measurements to accurately classify malignant (cancerous) or benign (non-cancerous) tissue samples.
## Read the data
wbc = pd.read_csv("https://remiller1450.github.io/data/wisc_bc.csv")
## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(wbc, test_size=0.2, random_state=7)
## Separate the target from the predictors and re-label the target
train_y = train['Label'].map({'M': 1, 'B': 0})
test_y = test['Label'].map({'M': 1, 'B': 0})
train_X = train.drop(['ID','Label'], axis = 1)[['Radius', 'Texture']]
test_X = test.drop(['ID','Label'], axis = 1)[['Radius', 'Texture']]
To allow us to visualize our classification models we'll focus on two features, Radius
(the mean distance of the central point within each cell to its boundary) and Texture
(the standard deviation of grayscale pixel values).
As shown in the scatter plot below, Radius
is a strong predictor of malignancy, with larger cells being indicative of an increased likelihood that a tissue sample is malignant. Texture
is a weaker predictor, but higher values of this feature also correspond with an increased likelihood of a sample being malignant:
## Scatter plot of radius vs. texture by label
plt.scatter(train_X['Radius'], train_X['Texture'], c = train_y, label = train_y)
plt.xlabel('Radius')
plt.ylabel('Texture')
plt.show()
A focal point of this lab will be understanding the role of different hyperparameters in SVM classifiers. Consequently, we'll use a couple of helper functions to more easily plot the prediction boundaries of SVM models:
## Function to make a grid spanning the min/max values of two variables
def make_grid(x1, x2, s=1):
x1_min, x1_max = x1.min() - 1, x1.max() + 1 # The +/- 1 makes visualizations less tight
x2_min, x2_max = x2.min() - 1, x2.max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, s), np.arange(x2_min, x2_max, s))
return xx1, xx2
## Function to get predictions from a classifier over a grid of x,y
def plot_surface(ax, mod, xx1, xx2):
Z = mod.predict(np.c_[xx1.ravel(), xx2.ravel()])
Z = Z.reshape(xx1.shape)
plot = ax.contourf(xx1, xx2, Z)
return plot
The kernel function is the most important hyperparameter involved in developing an SVM classifier as it is used to determine the underlying shape of the model's decision boundary.
The SVC()
function, which is used to train SVM classifiers, supports the following kernel functions:
linear
- linear decision boundarypoly
- polynomial kernelrbf
- radial basis functionsigmoid
- sigmoid kernelSome of these kernels have additional associated parameters. For example, degree
with poly
, and gamma
with rbf
, poly
, and sigmoid
.
Below is an example of a simple SVM with a linear decision boundary:
## Train a linear SVM w/ C=1
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
train_XS = StandardScaler().fit_transform(train_X)
fitted_model = SVC(kernel='linear').fit(train_XS, train_y)
## Make a grid spanning our two predictors
xg1, xg2 = make_grid(train_XS[:,0], train_XS[:,1], s=0.05)
## Prediction boundary
fig, ax = plt.subplots()
plot_surface(ax, mod = fitted_model, xx1 = xg1, xx2 = xg2) # Show the surface
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y) # Add the data points on top of the surface
plt.show()
Recall that SVMs are sensitive to the scale of the data, so we included a standardization step prior to fitting the SVC()
model.
The SVC()
function provides most of the same methods and attributes as other classification models in sklearn
, including fit()
, predict()
, and predict_proba()
methods. Though you should note that the SVM algorithm doesn't intrinsically compute probability estimates and you must set probability=True
in order for SVC()
to find predicted probabilities. Additionally, there are two other attributes specific to SVMs that are worth being aware of:
support_
- indices of the support vectorssupport_vectors_
- an array of the support vectors themselvesWe can use these attributes to further understand the margin of a classifier:
## Show prediction surface, w/ support vectors plotted in blue
fig, ax = plt.subplots()
plot_surface(ax, mod = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
ax.scatter(fitted_model.support_vectors_[:,0], fitted_model.support_vectors_[:,1])
plt.show()
Note that a "hard margin" SVM would have support vectors that are exactly equidistant from the decision boundary, but these data are not perfectly separable using a linear kernel so we end up needing to use a "soft margin" SVM. Thus, the decision boundary is influenced by points within the margin and those which were incorrectly classified outside of the margin.
Question #1:
gamma
. Based upon a visual assessment, does this kernel seem to be more appropriate than the linear kernel? Does it seem to be better than the polynomial kernel? The hyperparameter C
is a regularization parameter that controls the "error budget" or amount of slack that a soft margin SVM uses. More precisely, the sklearn
documentation states:
The C parameter trades off correct classification of training examples against maximization of the decision function’s margin. For larger values of C, a smaller margin will be accepted if the decision function is better at classifying all training points correctly. A lower C will encourage a larger margin, therefore a simpler decision function, at the cost of training accuracy.
A more succinct practical description is given later on in the documentation:
A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly.
We can better understand the parameter by trying out a few examples:
## Example #1 - lower value of C
fitted_model = SVC(kernel='poly', C = 0.1).fit(train_XS, train_y)
fig, ax = plt.subplots()
plot_surface(ax, mod = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
plt.show()
## Example #2 - higher value of C
fitted_model = SVC(kernel='poly', C = 100).fit(train_XS, train_y)
fig, ax = plt.subplots()
plot_surface(ax, mod = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
plt.show()
We can see that the lower value of C
leads to a smoother polynomial boundary that doesn't classify as many of the training data-points correctly, but might generalize better to new data.
An important concept related to assessing different choices of C
is whether or not to weight errors equally regardless of their class or to adjust for imbalanced classes. The argument class_weight = 'balanced'
will
The diagram shown above illustrates the impact of using the argument class_weight = 'balanced'
will weight the error contributions of each data point inversely proportional to the class frequencies in the input data. This will weight errors involving the minority class more heavily and lead to a decision boundary that is more so influenced by the minority class.
Question #2:
rbf
kernel function, perform a cross-validated grid search to determine the best value of C
from the candidates [0.1,10,100,1000]
and the best class weighting scheme from the options 'balanced'
or None
. Print the results of the grid search sorted by mean_test_score
using classification accuracy as the scoring metric.best_estimator_
). C
you selected in Part A and the visualization of this model's decision boundary in Part B. How do the bias and variance of this model compare with the rbf
kernel SVM you visualized in Part B of Question #1?gamma
¶For poly
and rbf
kernels the parameter gamma
controls the influence that individual training samples exert on the shape of the decision boundary. The sklearn
documentation states:
The
gamma
parameter can be seen as the inverse of the radius of influence of samples selected by the model as support vectors.
Thus, if gamma
is very large, then each data-point will have a small area of influence, so the decision boundary will only be influenced by data-points that are very close to it, thereby leading to overfitting. Conversely, if gamma
is too small, then each data-point will have a large area of influence and the decision boundary will be influenced by all or most of the training set, leading to a model that is overly smooth and will struggle to capture complexity.
Let's look at the impact of a couple of different values of gamma
on a polynomial kernel SVM:
## Example #1 - lower value of gamma
fitted_model = SVC(kernel='poly', gamma = 0.1).fit(train_XS, train_y)
fig, ax = plt.subplots()
plot_surface(ax, mod = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
plt.show()
## Example #1 - higher value of gamma
fitted_model = SVC(kernel='poly', gamma = 10).fit(train_XS, train_y)
fig, ax = plt.subplots()
plot_surface(ax, mod = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
plt.show()
As you can see, when gamma
is larger the decision boundary looks more irregular, while when gamma
is smaller the decision boundary is much smoother due to it being influenced by a larger share of the training data.
Question #3:
rbf
kernel function, perform a cross-validated grid search to determine the optimal combination of C
and gamma
. Consider the values [0.1,10,100,1000]
for C
and [[0.01, 0.1, 1, 10]
for gamma
.best_estimator_
). How do the bias and variance of this model compare with the best model you identified and explored in Parts B and C of Question #2?While SVMs were developed for classification tasks, they fundamentally operate by finding a prediction boundary that is determined by a subset of the training data (the support vectors), with data that are outside of a certain margin not contributing to the model's prediction boundary. These same principles can be applied to regression tasks.
Below is a simple illustration of support vector regression using the variable area.living
and the outcome sale.amount
in the Iowa City Homes data set:
## Read IC home sales data and split into training/test sets
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
train_ic, test_ic = train_test_split(ic, test_size=0.2, random_state=7)
## Create X and y
train_ic_y = train_ic['sale.amount']
train_ic_X = StandardScaler().fit_transform(train_ic[['area.living']])
## Train an SVR model
from sklearn.svm import SVR
ic_reg = SVR(kernel = 'linear', C=100000).fit(train_ic_X, train_ic_y)
## Grid of values used to generate the prediction line
xx = np.linspace(np.min(train_ic_X), np.max(train_ic_X), 100)
yy = ic_reg.predict(xx.reshape(-1, 1))
## Scatterplot w/ prediction line
plt.scatter(train_ic_X, train_ic_y)
plt.plot(xx, yy)
plt.show()
You should notice that I used a very large value of C
in this example, so very little regularization was applied, causing the model to look similar to ordinary least squares regression (though you might notice the slope is still a bit shallower than it might otherwise be).
A nice feature of SVR()
is that you can easily fit non-linear models by changing the kernel function. Recall that with LinearRegression()
you'd have to set up a feature expansion transformation to do this.
## Polynomial model
ic_reg = SVR(kernel = 'poly', C=100000).fit(train_ic_X, train_ic_y)
plt.scatter(train_ic_X, train_ic_y)
plt.plot(xx, yy)
plt.show()
Question #4:
rbf
kernel function, try a few values of C
until you find a model that you consider to have unacceptably high bias. Display this model on a scatter plot similar to the ones shown in this section of the lab. C
you identified in Part A, try modifying the gamma
parameter to reduce the model's bias. Briefly comment upon whether you are able to obtain a reasonable fit to the training data by manipulating the gamma
parameter when the amount of regularization is high.C
that you consider to have unacceptably high variance. Display this model on a scatter plot.C
you identified in Part C, try modifying the gamma
parameter to reduce the model's variance. Briefly comment upon whether you are able to obtain a reasonable fit to the training data.C
you identified in Parts A and B as well as 3 values that fall in between them. You may use the default scoring metric. Display the best fitting model on a scatter plot.