This lab covers the implementation of support vector machines (SVM) in sklearn
, paying particular attention to the choice of kernel function, the degree of regularization (or "slack"), and the importance given to individual data-points when using non-linear kernel functions (gamma
).
## Standard Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
# turn off KNN future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
We'll continue working with the Wisconsin breast cancer classification data set introduced in a few of our previous labs. As a reminder, the predictive features were derived from cell measurements for malignant and benign tissue samples.
## Read the data
wbc = pd.read_csv("https://remiller1450.github.io/data/wisc_bc.csv")
## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(wbc, test_size=0.2, random_state=7)
## Separate the target from the predictors and re-label the target
train_y = train['Label'].map({'M': 1, 'B': 0})
test_y = test['Label'].map({'M': 1, 'B': 0})
train_X = train.drop(['ID','Label'], axis = 1)[['Radius', 'Texture']]
test_X = test.drop(['ID','Label'], axis = 1)[['Radius', 'Texture']]
To better understand how SVMs learn decision boundaries, we'll mostly work with just two of the available predictors: 'Radius'
and 'Texture'
.
Show below is a scatter plot of how these features relate to the outcome:
## Scatter plot of radius vs. texture by label
plt.scatter(train_X['Radius'], train_X['Texture'], c = train_y)
plt.show()
A substantial portion of this lab will be devoted towards visually understanding the role of different tuning parameters on the behavior of SVM classifiers, so we will now define a couple of helper functions to plot the prediction boundaries of a classifer more efficiently:
## Function to create a grid spanning the min/max values of two variables
def make_grid(x1, x2, s=1):
x1_min, x1_max = x1.min() - 1, x1.max() + 1 # The +/- will make visualizations look nicer
x2_min, x2_max = x2.min() - 1, x2.max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, s), np.arange(x2_min, x2_max, s))
return xx1, xx2
## Function to get predictions from a classifier over a grid of x,y
def plot_surface(ax, clf, xx1, xx2, **params):
Z = clf.predict(np.c_[xx1.ravel(), xx2.ravel()])
Z = Z.reshape(xx1.shape)
plot = ax.contourf(xx1, xx2, Z, **params)
return plot
The kernel function is generally the most important tuning parameter involved in support vector machines, as it determines the underlying shape of the model's decision boundaries. Recall that the kernel function provides a mapping of the original features into a higher dimensional space where linear separation can be found more easily. When this boundary is expressed in the original feature space it will be non-linear.
The SVC()
function in sklearn
is used to train SVM classifiers, and it supports linear, polynomial, radial basis function, and sigmoid kernels. Shown below is a simple example that trains a linear SVM and visualizes its prediction boundary:
## Train a linear SVM w/ C=1
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
train_XS = StandardScaler().fit_transform(train_X)
fitted_model = SVC(kernel='linear').fit(train_XS, train_y)
## Make a grid spanning our two predictors
xg1, xg2 = make_grid(train_XS[:,0], train_XS[:,1], s=0.05)
## Show prediction surface
fig, ax = plt.subplots()
plot_surface(ax, clf = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
plt.show()
You should note that SVMs are sensitive to the scale of our data, so we standardized the data prior to fitting.
Additionally, you should recognize that the core methods of SVC()
are consistent with other models in sklearn
. That is, we'll primarily use the fit()
, predict()
, and predict_proba()
methods to train and evaluate SVM classifiers and involve them in pipelines.
That said, there are a couple of attributes unique to SVC()
which are worth knowing about are:
support_
- index positions of the support vectorssupport_vectors_
- an array of the support vectors themselvesThese attributes can help us understand the margin of a classifier:
## Show prediction surface, w/ support vectors plotted in blue
fig, ax = plt.subplots()
plot_surface(ax, clf = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
ax.scatter(fitted_model.support_vectors_[:,0], fitted_model.support_vectors_[:,1])
plt.show()
You should note that a "hard margin" SVM would have support vectors that are exactly equidistant from the decision boundary.
However, these data are not perfectly seperable and we are using a "soft margin" SVM. Thus, both points on the edge of the margin and incorrectly classified points on the wrong side of the decision boundary determine the decision boundary and are considered to be support vectors.
Question 1:
The tuning parameter C
is a regularization parameter that controls the "error budget" or the amount of "slack" that is given to a soft-margin classifier. According to the sklearn
documentation:
The C parameter trades off correct classification of training examples against maximization of the decision function’s margin. For larger values of C, a smaller margin will be accepted if the decision function is better at classifying all training points correctly. A lower C will encourage a larger margin, therefore a simpler decision function, at the cost of training accuracy.
We can better understand the importance of C
by comparing the decision boundaries corresponding to a couple of different choices of C
using a polynomial kernel:
## Train a poly SVM w/ C=0.5
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
train_XS = StandardScaler().fit_transform(train_X)
fitted_model = SVC(kernel='poly', degree=3, C=0.5).fit(train_XS, train_y)
## Make a grid spanning our two predictors
xg1, xg2 = make_grid(train_XS[:,0], train_XS[:,1], s=0.05)
## Show prediction surface
fig, ax = plt.subplots()
plot_surface(ax, clf = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
ax.scatter(fitted_model.support_vectors_[:,0], fitted_model.support_vectors_[:,1])
plt.title("C=0.5")
## Train a poly SVM w/ C=1000
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
train_XS = StandardScaler().fit_transform(train_X)
fitted_model = SVC(kernel='poly', degree=3, C=1000).fit(train_XS, train_y)
## Make a grid spanning our two predictors
xg1, xg2 = make_grid(train_XS[:,0], train_XS[:,1], s=0.05)
## Show prediction surface
fig, ax = plt.subplots()
plot_surface(ax, clf = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
ax.scatter(fitted_model.support_vectors_[:,0], fitted_model.support_vectors_[:,1])
plt.title("C=1000")
plt.show()
Here we see that C=1000
provides very little regularization, thereby leading to a more irregular decision boundary that does well on the training data but might not generalize very well to new data.
One important thing to note is that by default C
will treat errors in each class equally, which can be undesirable for data with substantial class imbalances. This can be changed by using the argument class_weight = 'balanced'
. The diagram below illustrates the impact of balancing by class weight for data with a highly imbalanced class distribution:
## Comparison of weighted vs. non-weighted with imbalanced classes
from IPython.display import Image
Image("C:\\Users\\millerry\\OneDrive - Grinnell College\\Documents\\STA-395_Intro_ML\\Spring24\\Labs\\class_weight.PNG")
Question 2:
C
and class_weight
. Consider the values: [0.1, 1, 100, 10000]
for C
, and either 'balanced"
or none
for class_weight
.For the polynomial and radial basis function kernels, the parameter gamma
controls the influence of individual data-points on the shape of the decision boundary. When gamma
is larger, more data-points exert influence on the shape of the boundary, and when gamma
is smaller fewer points exert influence.
While their roles might sound similar, gamma
is fundamentally different from C
, which governs how the classifier's margin via how much error is deemed acceptable. We can see the effects of higher values of gamma
in our polynomial kernel SVM with C=2
:
## Train a poly SVM w/ gamma = 10
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
train_XS = StandardScaler().fit_transform(train_X)
fitted_model = SVC(kernel='poly', C=2, gamma = 10).fit(train_XS, train_y)
## Make a grid spanning our two predictors
xg1, xg2 = make_grid(train_XS[:,0], train_XS[:,1], s=0.05)
## Show prediction surface
fig, ax = plt.subplots()
plot_surface(ax, clf = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
plt.title("gamma = 10")
## Train a poly SVM w/ gamma = 0.1
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
train_XS = StandardScaler().fit_transform(train_X)
fitted_model = SVC(kernel='poly', C=2, gamma = 0.1).fit(train_XS, train_y)
## Make a grid spanning our two predictors
xg1, xg2 = make_grid(train_XS[:,0], train_XS[:,1], s=0.05)
## Show prediction surface
fig, ax = plt.subplots()
plot_surface(ax, clf = fitted_model, xx1 = xg1, xx2 = xg2)
ax.scatter(train_XS[:,0], train_XS[:,1], c = train_y)
plt.title("gamma = 0.1")
plt.show()
For larger gamma
our polynomial kernel looks more irregular due to the larger number of points exerting influence on it. For small gamma
we see the opposite, that is our boundary appears to be much more smooth.
For additional insight into the roles of the gamma
and C
parameters, I strongly encourage you to read this sklearn
documentation page.
Question 3:
gamma
and visualize their decision boundaries. As your answer to this question, include a visualization of one of the values of gamma
you found to produce a reasonable boundary.gamma
using a cross-validated grid search involving at least 4 sensibly chosen values.A fundamental aspect of SVMs is that the decision boundary is determined by a subset of data-points, the support vectors. Data-points that are beyond the margin do not contribute to the model. This idea can be adpated to regression tasks by constructing a cost function that ignores data-points whose prediction is close to their observed value.
Below is a simple illustration of support vector regression using the variable 'area.living'
in the Iowa City home sales data to predict 'sale.amount'
:
## Read IC home sales data and split into training/test sets
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
train_ic, test_ic = train_test_split(ic, test_size=0.2, random_state=7)
## Create X and y
train_ic_y = train_ic['sale.amount']
train_ic_X = StandardScaler().fit_transform(train_ic[['area.living']])
## Train an SVR model
from sklearn.svm import SVR
ic_reg = SVR(kernel = 'linear', C=100000).fit(train_ic_X, train_ic_y)
## Grid of values used to generate the prediction line
xx = np.linspace(np.min(train_ic_X), np.max(train_ic_X), 100)
yy = ic_reg.predict(xx.reshape(-1, 1))
## Scatterplot w/ prediction line
plt.scatter(train_ic_X, train_ic_y)
plt.plot(xx, yy)
plt.show()
Because the tuning parameter C
applies regularization inversely proportional to it's value, we'd need to use an extremely large value of C
to produce the same model as ordinary least squares regression. In this example you can see that even with C=100000
the line is "shrunken" to have a slope closer to zero.
A nice feature of SVR()
is that you can easily fit non-linear models by changing the kernel function:
## Non-linear kernel (RBF)
ic_reg = SVR(kernel = 'rbf', C=100000).fit(train_ic_X, train_ic_y)
## Grid of values used to generate the prediction line
xx = np.linspace(np.min(train_ic_X), np.max(train_ic_X), 100)
yy = ic_reg.predict(xx.reshape(-1, 1))
## Scatterplot w/ prediction line
plt.scatter(train_ic_X, train_ic_y)
plt.plot(xx, yy)
plt.show()
Question 4:
'rbf'
kernel SVM). Include an example of a model you'd consider to have unacceptably high variance and a model you'd consider to have unacceptably high bias. Clearly label each of these models.gamma
in the model you deemed "high bias" in part A. Are you able to use gamma
to better balance the bias-variance trade-off for this choice of C
? Briefly explain, using an example to justify your explanation.In this section you'll use the entire set of predictors in the Wisconsin breast cancer data set instead of just 'Radius'
and 'Texture'
. Your goal will be to find the best performing classifier and associated data preparation steps when considering three different classifiers: KNN, decision tree, and SVM.
## Overwrite train/test X to now include all predictors
train_X = train.drop(['ID','Label'], axis = 1)
test_X = test.drop(['ID','Label'], axis = 1)
Question 5:
pd.plotting.scatter_matrix()
to create a scatter plot matrix showing the pairwise relationship between every pair of predictors in the training data. Use argument c = train_y
to color the points by the target variable. Use what you see in this plot to help you determine reasonable modeling steps in Parts B, C, and D.