CS计算机代考程序代写 algorithm worksheet06b_solutions

worksheet06b_solutions

COMP90051 Workshop 6b¶
Support Vector Machines¶

In this section, we’ll explore how the SVM hyperparameters (i.e. the penalty parameter, the kernel, and any kernel parameters) affect the decision surface.

In [13]:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from timeit import default_timer as timer
sns.set_style(‘darkgrid’)
plt.rcParams[‘figure.dpi’] = 108

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, GridSearchCV

1. Data set¶
To make visualisation and training easy, we’ll consider a small binary classification data set called cats.csv (available from the LMS).
It contains observations for 150 cats.
There are two features: heart and body weight measured in kilograms.
The target variable is the sex of the cat (we encode ‘male’ as-1 and ‘female’ as +1).

[Note: the data set originates from the following paper: R. A. Fisher (1947) The analysis of covariance method for the relation between a part and the whole, Biometrics 3, 65–68]

Ensure that cats.csv is located in the same directory as this notebook, then run the following code block to read the CSV file using pandas.

In [14]:

full_df = pd.read_csv(‘cats.csv’)
full_df.SEX = full_df.SEX.map({‘M’: -1, ‘F’: 1})
full_df.head()

Out[14]:

HWT BWT SEX
0 2.0 7.0 1
1 2.0 7.4 1
2 2.0 9.5 1
3 2.1 7.2 1
4 2.1 7.3 1

Let’s split the data into train/test sets so that we can evaluate our trained SVM.
(Note that this is likely to be unreliable for such a small data set.)

In [15]:

train_df, test_df = train_test_split(full_df, test_size=0.2, random_state=1)

Since SVMs incorporate a penalty term for the weights (proportional to $\|\mathbf{w}\|_2^2$), it’s usually beneficial to standardise the features so that they vary on roughly the same scale.

Exercise: Complete the code block below to standardise the features, so that each feature has zero mean/unit variance.

Hint: use StandardScaler imported above.

In [16]:

scaler = StandardScaler()
X_train = scaler.fit_transform(train_df.drop(‘SEX’, axis=1)) # fill in
y_train = train_df.SEX # fill in

X_test = scaler.transform(test_df.drop(‘SEX’, axis=1)) # fill in
y_test = test_df.SEX # fill in

In [17]:

X_train

Out[17]:

array([[-0.47767243, 0.09678416],
[ 0.52474234, -0.10126172],
[ 0.72522529, 0.73053096],
[-0.07670652, -0.18048007],
[-1.07912129, 0.13639333],
[-0.47767243, -0.73500852],
[-0.27718947, -0.9330544 ],
[ 0.52474234, 0.69092179],
[ 0.12377643, 1.12662271],
[-0.87863834, -1.32914615],
[-1.27960424, -1.32914615],
[ 1.52715711, 0.88896766],
[-0.07670652, -0.85383605],
[-0.27718947, -0.7746177 ],
[ 0.72522529, 0.33443921],
[ 1.92812301, 0.13639333],
[ 0.52474234, -0.25969842],
[-1.07912129, -0.616181 ],
[-1.07912129, -1.21031862],
[ 0.32425939, 0.45326674],
[ 0.92570825, 0.92857684],
[ 0.92570825, 0.37404839],
[-1.27960424, -0.33891677],
[-0.07670652, -1.05188192],
[ 0.32425939, -0.22008924],
[-1.07912129, -1.4083645 ],
[-1.4800872 , -0.45774429],
[-0.67815538, -1.32914615],
[ 1.32667415, 0.61170344],
[ 2.32908892, 1.48310529],
[-1.27960424, -1.01227275],
[ 0.72522529, 1.44349612],
[ 1.1261912 , 1.36427777],
[-1.07912129, -0.41813512],
[ 0.92570825, 1.12662271],
[ 0.52474234, 0.37404839],
[-1.07912129, -0.41813512],
[ 0.72522529, 0.57209426],
[-0.27718947, -0.49735347],
[ 0.52474234, 0.92857684],
[-0.67815538, -0.616181 ],
[-0.07670652, 0.53248509],
[ 0.12377643, 1.04740436],
[ 0.92570825, 0.65131261],
[-1.27960424, -0.7746177 ],
[-0.47767243, -0.65579017],
[-0.67815538, -0.73500852],
[ 0.92570825, 0.49287591],
[-0.27718947, -0.22008924],
[ 0.52474234, 0.61170344],
[ 2.32908892, 3.89926498],
[-0.87863834, -0.22008924],
[-0.87863834, -0.45774429],
[-0.47767243, -0.73500852],
[ 1.32667415, 0.21561168],
[ 1.1261912 , 0.53248509],
[ 0.12377643, -0.18048007],
[ 0.32425939, 0.25522086],
[ 1.1261912 , 1.87919704],
[ 0.12377643, -0.616181 ],
[ 1.72764006, 0.45326674],
[ 1.52715711, 1.95841539],
[-0.47767243, -1.0914911 ],
[-1.07912129, -0.37852594],
[-0.07670652, -0.33891677],
[-0.07670652, 0.05717498],
[-0.07670652, -0.41813512],
[-1.07912129, -1.0914911 ],
[-0.87863834, -0.22008924],
[-0.27718947, -0.06165254],
[ 0.32425939, -0.22008924],
[-0.87863834, -0.41813512],
[ 1.52715711, 0.41365756],
[ 0.52474234, 1.24545024],
[ 1.1261912 , 0.33443921],
[-1.27960424, -0.22008924],
[-0.87863834, -0.65579017],
[-0.47767243, 0.80974931],
[-1.27960424, -1.36875533],
[ 1.72764006, 1.04740436],
[-0.67815538, -1.7252379 ],
[ 2.12860597, 2.4337255 ],
[ 0.52474234, 1.04740436],
[-1.4800872 , -1.64601955],
[ 1.32667415, 0.69092179],
[ 0.12377643, -0.25969842],
[-1.27960424, -0.9330544 ],
[-0.07670652, -0.41813512],
[-0.67815538, -1.0914911 ],
[-1.4800872 , -1.44797368],
[-1.4800872 , -1.28953697],
[-1.07912129, 0.01756581],
[-0.87863834, -0.65579017],
[-0.67815538, -1.0914911 ],
[-0.67815538, -0.53696265],
[-1.27960424, -0.97266357],
[-1.27960424, -0.85383605],
[ 0.52474234, 0.80974931],
[-1.07912129, -0.7746177 ],
[-0.07670652, 0.73053096],
[ 0.32425939, -0.02204337],
[-0.47767243, -0.53696265],
[-1.07912129, -1.36875533],
[ 0.52474234, -0.02204337],
[-0.87863834, -0.89344522],
[-0.87863834, -0.37852594],
[ 1.52715711, 2.5921622 ],
[-0.47767243, 0.80974931],
[ 1.32667415, 1.48310529],
[-0.07670652, -0.65579017],
[ 1.52715711, 1.99802457],
[ 1.72764006, 1.64154199],
[-0.27718947, -1.17070945],
[ 2.12860597, 1.64154199],
[-0.27718947, -0.22008924]])

Let’s plot the training data. Notice that it’s not linearly separable.

In [18]:

plt.scatter(X_train[y_train==1,0], X_train[y_train==1,1], label=”Female ($y=1$)”, c=’r’)
plt.scatter(X_train[y_train==-1,0], X_train[y_train==-1,1], label=”Male ($y=-1$)”, c=’b’)
plt.xlabel(“Heart weight”)
plt.ylabel(“Body weight”)
plt.legend()
plt.show()

2. Parameter grid search¶
Since the data is clearly not linearly separable, we’re going to fit a kernelised SVM.
To do this, we’ll use the sklearn.svm.SVC class, which is a wrapper for the popular LIBSVM library.
[Aside: LIBSVM solves the dual problem using a variant of the sequential minimal optimisation (SMO) algorithm.]
The corresponding primal problem is as follows:

$$
\begin{align}
\min_{\mathbf{w}, b, \xi} \phantom{=} & \frac{1}{2} \mathbf{w}^T \mathbf{w} + C \sum_{i = 1}^{n} \xi_i \\
\mathrm{subject~to} \phantom{=} & y_{i}(\mathbf{w}^T \cdot \phi(\mathbf{x_i}) + b) \geq 1 – \xi_i \\
\phantom{=} & \xi_i \geq 0 \ \forall i
\end{align}
$$Here $C$ is the penalty parameter, $\mathbf{w}$ are the weights, $b$ is the bias and $\phi$ is a mapping to a higher dimensional space—related to the kernel through $K(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle$.
For now, we’ll use the radial basis function (RBF) kernel, which is parameterised in terms of $\gamma$ as follows:

$$
K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma \|\mathbf{x}_i – \mathbf{x}_j\|^2)
$$Returning to our classification problem: it’s unclear how to set appropriate values for $C$ and $\gamma$ (named C and gamma in sklearn).
A simple way around this is to do an exhaustive cross validation grid search.
Below we define an evenly-spaced grid in log-space.

In [19]:

C_range = np.logspace(-2, 5, 8)
gamma_range = np.logspace(-6, 1, 16)

# Visualise the grid
xx, yy = np.meshgrid(C_range, gamma_range)
plt.plot(xx, yy, ‘ko’)
plt.xscale(‘log’)
plt.yscale(‘log’)
plt.xlabel(‘$C$’)
plt.ylabel(r’$\gamma$’)
plt.show()

To do the grid search, we’ll use the built-in sklearn.model_selection.GridSearchCV class.
It evaluates the model for each combination of parameter values using cross validation, and selects the combination with the best score.

We’ll use StratifiedShuffleSplit for cross validation (it effectively generates bootstrap samples from the training data, while preserving the class ratio).

In [20]:

cv = StratifiedShuffleSplit(n_splits=30, test_size=0.1, random_state=1)
grid = GridSearchCV(SVC(kernel=’rbf’), param_grid={‘gamma’: gamma_range, ‘C’: C_range}, cv=cv)
grid.fit(X_train, y_train)
print(“The best parameters are {0.best_params_} with an accuracy of {0.best_score_:.3g}”.format(grid))

The best parameters are {‘C’: 10.0, ‘gamma’: 0.04641588833612782} with an accuracy of 0.828

Question: Why aren’t we using k-fold cross validation?

Answer: Because our training data is too small: it only contains 115 instances. We hope to get a more precise estimate of the accuracy by using a high number of bootstrap samples.

Below we visualise the cross validation accuracy over the grid of parameters.

In [21]:

scores = grid.cv_results_[‘mean_test_score’].reshape(C_range.size, gamma_range.size)

plt.figure(figsize=(8, 6))
plt.imshow(scores, cmap=’viridis’)
plt.colorbar(shrink=0.7)
plt.xticks(np.arange(len(gamma_range)), [“%.2e” % gamma for gamma in gamma_range], rotation=90)
plt.yticks(np.arange(len(C_range)), [“%1.e” % C for C in C_range])
plt.title(‘Cross validation accuracy’)
plt.xlabel(r’$\gamma$’)
plt.ylabel(‘$C$’)
plt.show()

Question: Interpret this plot. Is there a clear winning combination of parameters?

Answer: There are a few parameter combinations that achieve high cross validation accuracy (the yellow squares).
With such a small training set, there’s likely to be significant variance (noise) in these estimates.
Hence, we can’t be confident that our chosen parameters are truly “the best”. If faced with a tie or near-tie, it makes sense to break the tie in a conservative direction, i.e., by using a smaller gamma (less non-linearity, as the RBF decays more slowly with distance) and a lower value of C (less of a penalty of margin violations relative to weight shrinkage.)

Now that we’ve found the “best” parameters, let’s fit the SVM on the entire training set (without cross-validation).

(Note: we actually fit all parameter combinations, as they’re needed for a plot generated below.)

In [22]:

classifiers = {(C, gamma) : SVC(C=C, gamma=gamma, kernel=’rbf’).fit(X_train, y_train)
for C in C_range
for gamma in gamma_range}

Below we evaluate the “best” classifier on the test set.

In [23]:

best_params = (grid.best_params_[“C”], grid.best_params_[“gamma”])
best_svm = classifiers[best_params]
best_train_acc = best_svm.score(X_train, y_train)
best_test_acc = best_svm.score(X_test, y_test)
print(“The SVM with parameters C={0[0]:.3g}, gamma={0[1]:.3g} has training accuracy {1:.3g} and test accuracy {2:.3g}.”.format(best_params, best_train_acc, best_test_acc))

The SVM with parameters C=10, gamma=0.0464 has training accuracy 0.817 and test accuracy 0.724.

Question: How does this compare to the training accuracy?

Answer: The test accuracy is significantly lower than the training accuracy. This could be explained by:

Overfitting
Lack of data (high variance in estimated metrics)
Lack of informative features

Below we visualise the decision functions for all parameter combinations (double-click output to expand to 100%)

In [24]:

fig, ax = plt.subplots(C_range.size, gamma_range.size, figsize=(50,20))
border = 0.2

# Build meshgrid over the feature space
X_min = np.amin(X_train, axis=0)
X_max = np.amax(X_train, axis=0)
xx, yy = np.meshgrid(np.linspace(X_min[0] – border, X_max[0] + border, 100),
np.linspace(X_min[1] – border, X_max[1] + border, 100))

# Plot training data + decision function for all feature combinations
for (i, C) in enumerate(C_range):
for (j, gamma) in enumerate(gamma_range):
clf = classifiers[(C, gamma)]
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

ax[i,j].set_title(“gamma={0.gamma:.3g}; C={0.C:.3g}”.format(clf),
size=’medium’)

ax[i,j].pcolormesh(xx, yy, -Z, cmap=plt.cm.RdBu)
ax[i,j].scatter(X_train[y_train==1,0], X_train[y_train==1,1], c=’r’, edgecolors=’k’)
ax[i,j].scatter(X_train[y_train==-1,0], X_train[y_train==-1,1], c=’b’, edgecolors=’k’)
ax[i,j].set_xticks([])
ax[i,j].set_yticks([])
ax[i,j].axis(‘tight’)
plt.show()

Question: Explain how gamma and C affect the decision surface qualitatively.

Answer:

Larger values of gamma implies less smoothing (a “spiky” decision boundary). This can result in overfitting
Larger values of C implies less regularisation (larger weights are not penalised). This can also result in overfitting.

Extension activity: Re-run this section using a different kernel (e.g. the built-in polynomial kernel or a custom kernel).