Support Vector Machines¶
You can use sklearn API for this homework.
In [1]:
# used for manipulating directory paths
import os
# Scientific and vector computation for python
import numpy as np
# Import regular expressions to process emails
import re
# Plotting library
import matplotlib.pyplot as plt
# Optimization module in scipy
from scipy import optimize
# will be used to load MATLAB mat datafile format
from scipy.io import loadmat
# tells matplotlib to embed plots within the notebook
%matplotlib inline
import nltk, nltk.stem.porter
from sklearn import svm
1 Support Vector Machines (50 pts)¶
In the first half of this exercise, you will be using support vector machines (SVMs) with various example 2D datasets. Experimenting with these datasets will help you gain an intuition of how SVMs work and how to use a Gaussian kernel with SVMs. In the next half of the exercise, you will be using support vector machines to build a spam classifier.
1.1 Example Dataset 1 (20 pts)¶
We will begin with a 2D example dataset which can be separated by a linear boundary. In this dataset, the positions of the positive examples (indicated with x) and the negative examples (indicated with o) suggest a natural separation indicated by the gap. However, notice that there is an outlier positive example x on the far left at about (0.1, 4.1). As part of this exercise, you will also see how this outlier affects the SVM decision boundary.
In [2]:
# Load from ex6data1
# You will have X, y as keys in the dict data
data = loadmat(os.path.join(‘Data’, ‘ex6data1.mat’))
X = data[‘X’]
y = data[‘y’].flatten()
m = y.size
1.1.1 Plot Data (10 pts)¶
Define a function below to plot the data.
In [3]:
def plot_data(X, y):
plt.figure()
# ===================== Your Code Here =====================
# Instructions : Plot the positive and negative examples on a
# 2D plot, using the marker=”+” for the positive
# examples and marker=”o” for the negative examples
#
# ===================== Your Code Here =====================
In [ ]:
plot_data(X, y)
In this part of the exercise, you will try using different values of the $C$ parameter with SVMs. Informally, the $C$ parameter is a positive value that controls the penalty for misclassified training examples. A large $C$ parameter tells the SVM to try to classify all the examples correctly. $C$ plays a role similar to $1/\lambda$, where $\lambda$ is the regularization parameter that we were using previously for logistic regression.
When $C=1$, you should find that the SVM puts the decision boundary in the gap between the two datasets and misclassifies the data point on the far left.
Your task is to try different values of $C$ on this dataset. Specifically, you should change the value of $C$ in the next cell to $C = 100$ and run the SVM training again. When $C = 100$, you should find that the SVM now classifies every single example correctly, but has a decision boundary that does not appear to be a natural fit for the data.
In [5]:
# You should try to change the C value below and see how the decision
# boundary varies (e.g., try C = 1000)
C = 1
clf = svm.SVC(C, kernel=’linear’, tol=1e-3)
clf.fit(X, y)
Out[5]:
SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=’ovr’, degree=3, gamma=’scale’, kernel=’linear’,
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
1.1.2 Plot the decision boundary (10 pts)¶
Define a function below to plot the decision boundary.
In [6]:
def visualize_boundary(clf, X, x_min, x_max, y_min, y_max):
# ===================== Your Code Here =====================
# ===================== Your Code Here =====================
In [ ]:
plot_data(X, y)
visualize_boundary(clf, X, 0, 4.5, 1.5, 5)
1.2 SVM with Gaussian Kernels (30 pts)¶
In this part of the exercise, you will be using SVMs to do non-linear classification. In particular, you will be using SVMs with Gaussian kernels on datasets that are not linearly separable.
1.2.1 Gaussian Kernel (10 pts)¶
To find non-linear decision boundaries with the SVM, we need to first implement a Gaussian kernel. You can think of the Gaussian kernel as a similarity function that measures the “distance” between a pair of examples, ($x^{(i)}$, $x^{(j)}$). The Gaussian kernel is also parameterized by a bandwidth parameter, $\sigma$, which determines how fast the similarity metric decreases (to 0) as the examples are further apart. You should now complete the code in gaussianKernel to compute the Gaussian kernel between two examples, ($x^{(i)}$, $x^{(j)}$). The Gaussian kernel function is defined as:
$$ K_{\text{gaussian}} \left( x^{(i)}, x^{(j)} \right) = \exp \left( – \frac{\left\lvert\left\lvert x^{(i)} – x^{(j)}\right\lvert\right\lvert^2}{2\sigma^2} \right) = \exp \left( -\frac{\sum_{k=1}^n \left( x_k^{(i)} – x_k^{(j)}\right)^2}{2\sigma^2} \right)$$
In [8]:
def gaussianKernel(x1, x2, sigma):
“””
Computes the radial basis function
Returns a radial basis function kernel between x1 and x2.
Parameters
———-
x1 : numpy ndarray
A vector of size (n, ), representing the first datapoint.
x2 : numpy ndarray
A vector of size (n, ), representing the second datapoint.
sigma : float
The bandwidth parameter for the Gaussian kernel.
Returns
——-
sim : float
The computed RBF between the two provided data points.
Instructions
————
Fill in this function to return the similarity between `x1` and `x2`
computed using a Gaussian kernel with bandwidth `sigma`.
“””
sim = 0
# ====================== YOUR CODE HERE ======================
# =============================================================
return sim
Once you have completed the function gaussianKernel the following cell will test your kernel function on two provided examples.
In [ ]:
x1 = np.array([1, 2, 1])
x2 = np.array([0, 4, -1])
sigma = 2
sim = gaussianKernel(x1, x2, sigma)
print(‘Gaussian Kernel between x1 = [1, 2, 1], x2 = [0, 4, -1], sigma = %0.2f:’
‘\n\t%f’ % (sigma, sim))
1.2.2 Example Dataset 2 (5 pts)¶
The next part in this notebook will load and plot dataset 2.
In [ ]:
# Load from ex6data2
# You will have X, y as keys in the dict data
data = loadmat(os.path.join(‘Data’, ‘ex6data2.mat’))
X = data[‘X’]
y = data[‘y’].flatten()
m = y.size
# Plot training data
plot_data(X, y)
From the figure, you can obserse that there is no linear decision boundary that separates the positive and negative examples for this dataset. However, by using the Gaussian kernel with the SVM, you will be able to learn a non-linear decision boundary that can perform reasonably well for the dataset. If you have correctly implemented the Gaussian kernel function, the following cell will proceed to train the SVM with the Gaussian kernel on this dataset.
You should get a decision boundary as shown in the figure below, as computed by the SVM with a Gaussian kernel. The decision boundary is able to separate most of the positive and negative examples correctly and follows the contours of the dataset well.
In [ ]:
1.2.3 Example Dataset 3 (15 pts)¶
In this part of the exercise, you will gain more practical skills on how to use a SVM with a Gaussian kernel. The next cell will load and display a third dataset.
You will be using the SVM with the Gaussian kernel with this dataset. In the provided dataset, ex6data3.mat, you are given the variables X, y, Xval, yval.
In [14]:
# Load from ex6data3
# You will have X, y, Xval, yval as keys in the dict data
data = loadmat(os.path.join(‘Data’, ‘ex6data3.mat’))
X = data[‘X’]
y = data[‘y’].flatten()
m = y.size
Your task is to use the cross validation set Xval, yval to determine the best $C$ and $\sigma$ parameter to use. For both $C$ and $\sigma$, we suggest trying values in multiplicative steps (e.g., 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30). Note that you should try all possible pairs of values for $C$ and $\sigma$ (e.g., $C = 0.3$ and $\sigma = 0.1$). For example, if you try each of the 8 values listed above for $C$ and for $\sigma^2$, you would end up training and evaluating (on the cross validation set) a total of $8^2 = 64$ different models.
In [ ]:
2 Spam Classification (50 pts)¶
Many email services today provide spam filters that are able to classify emails into spam and non-spam email with high accuracy. In this part of the exercise, you will use SVMs to build your own spam filter.
You will be training a classifier to classify whether a given email, $x$, is spam ($y = 1$) or non-spam ($y = 0$). In particular, you need to convert each email into a feature vector $x \in \mathbb{R}^n$ . The following parts of the exercise will walk you through how such a feature vector can be constructed from an email.
The dataset included for this exercise is based on a a subset of the SpamAssassin Public Corpus. For the purpose of this exercise, you will only be using the body of the email (excluding the email headers).
2.1 Preprocessing Emails (20 pts)¶
Before starting on a machine learning task, it is usually insightful to take a look at examples from the dataset. The figure below shows a sample email that contains a URL, an email address (at the end), numbers, and dollar amounts.

While many emails would contain similar types of entities (e.g., numbers, other URLs, or other email addresses), the specific entities (e.g., the specific URL or specific dollar amount) will be different in almost every email. Therefore, one method often employed in processing emails is to “normalize” these values, so that all URLs are treated the same, all numbers are treated the same, etc. For example, we could replace each URL in the email with the unique string “httpaddr” to indicate that a URL was present.
This has the effect of letting the spam classifier make a classification decision based on whether any URL was present, rather than whether a specific URL was present. This typically improves the performance of a spam classifier, since spammers often randomize the URLs, and thus the odds of seeing any particular URL again in a new piece of spam is very small.
In the function processEmail below, we have implemented the following email preprocessing and normalization steps:
• Lower-casing: The entire email is converted into lower case, so that captialization is ignored (e.g., IndIcaTE is treated the same as Indicate).
• Stripping HTML: All HTML tags are removed from the emails. Many emails often come with HTML formatting; we remove all the HTML tags, so that only the content remains.
• Normalizing URLs: All URLs are replaced with the text “httpaddr”.
• Normalizing Email Addresses: All email addresses are replaced with the text “emailaddr”.
• Normalizing Numbers: All numbers are replaced with the text “number”.
• Normalizing Dollars: All dollar signs ($) are replaced with the text “dollar”.
• Word Stemming: Words are reduced to their stemmed form. For example, “discount”, “discounts”, “discounted” and “discounting” are all replaced with “discount”. Sometimes, the Stemmer actually strips off additional characters from the end, so “include”, “includes”, “included”, and “including” are all replaced with “includ”.
• Removal of non-words: Non-words and punctuation have been removed. All white spaces (tabs, newlines, spaces) have all been trimmed to a single space character.
The result of these preprocessing steps is shown in the figure below.

While preprocessing has left word fragments and non-words, this form turns out to be much easier to work with for performing feature extraction.
2.1.1 Vocabulary List (20 pts)¶
After preprocessing the emails, we have a list of words for each email. The next step is to choose which words we would like to use in our classifier and which we would want to leave out.
For this exercise, we have chosen only the most frequently occuring words as our set of words considered (the vocabulary list). Since words that occur rarely in the training set are only in a few emails, they might cause the model to overfit our training set. The complete vocabulary list is in the file vocab.txt (inside the Data directory for this exercise) and also shown in the figure below.

Our vocabulary list was selected by choosing all words which occur at least a 100 times in the spam corpus, resulting in a list of 1899 words. In practice, a vocabulary list with about 10,000 to 50,000 words is often used. Given the vocabulary list, we can now map each word in the preprocessed emails into a list of word indices that contains the index of the word in the vocabulary dictionary. The figure below shows the mapping for the sample email. Specifically, in the sample email, the word “anyone” was first normalized to “anyon” and then mapped onto the index 86 in the vocabulary list.

Your task now is to complete the code in the function processEmail to perform this mapping. In the code, you are given a string word which is a single word from the processed email. You should look up the word in the vocabulary list vocabList. If the word exists in the list, you should add the index of the word into the word_indices variable. If the word does not exist, and is therefore not in the vocabulary, you can skip the word.
In [18]:
def get_vocab_list():
vocab_dict = {}
with open(os.path.join(‘Data’, ‘vocab.txt’)) as f:
for line in f:
(val, key) = line.split()
vocab_dict[int(val)] = key
return vocab_dict
In [19]:
def processEmail(email_contents, verbose=True):
“””
Preprocesses the body of an email and returns a list of indices
of the words contained in the email.
Parameters
———-
email_contents : str
A string containing one email.
verbose : bool
If True, print the resulting email after processing.
Returns
——-
word_indices : list
A list of integers containing the index of each word in the
email which is also present in the vocabulary.
Instructions
————
Fill in this function to add the index of word to word_indices
if it is in the vocabulary. At this point of the code, you have
a stemmed word from the email in the variable word.
You should look up word in the vocabulary list (vocabList).
If a match exists, you should add the index of the word to the word_indices
list. Concretely, if word = ‘action’, then you should
look up the vocabulary list to find where in vocabList
‘action’ appears. For example, if vocabList[18] =
‘action’, then, you should add 18 to the word_indices
vector (e.g., word_indices.append(18)).
Notes
—–
– vocabList[idx] returns a the word with index idx in the vocabulary list.
– vocabList.index(word) return index of word `word` in the vocabulary list.
(A ValueError exception is raised if the word does not exist.)
“””
# Load Vocabulary
vocab_list = get_vocab_list()
# Init return value
word_indices = np.array([], dtype=np.int64)
# ========================== Preprocess Email ===========================
# Find the Headers ( \n\n and remove )
# Uncomment the following lines if you are working with raw emails with the
# full headers
# hdrstart = email_contents.find(chr(10) + chr(10))
# email_contents = email_contents[hdrstart:]
# Lower case
email_contents = email_contents.lower()
# Strip all HTML
# Looks for any expression that starts with < and ends with > and replace
# and does not have any < or > in the tag it with a space
email_contents =re.compile(‘<[^<>]+>’).sub(‘ ‘, email_contents)
# Handle Numbers
# Look for one or more characters between 0-9
email_contents = re.compile(‘[0-9]+’).sub(‘ number ‘, email_contents)
# Handle URLS
# Look for strings starting with http:// or https://
email_contents = re.compile(‘(http|https)://[^\s]*’).sub(‘ httpaddr ‘, email_contents)
# Handle Email Addresses
# Look for strings with @ in the middle
email_contents = re.compile(‘[^\s]+@[^\s]+’).sub(‘ emailaddr ‘, email_contents)
# Handle $ sign
email_contents = re.compile(‘[$]+’).sub(‘ dollar ‘, email_contents)
# get rid of any punctuation
email_contents = re.split(‘[ @$/#.-:&*+=\[\]?!(){},””>_<;%\n\r]', email_contents)
# remove any empty word string
email_contents = [word for word in email_contents if len(word) > 0]
# Stem the email contents word by word
stemmer = nltk.stem.porter.PorterStemmer()
processed_email = []
for word in email_contents:
# Remove any remaining non alphanumeric characters in word
word = re.compile(‘[^a-zA-Z0-9]’).sub(”, word).strip()
word = stemmer.stem(word)
processed_email.append(word)
if len(word) < 1: continue # Look up the word in the dictionary and add to word_indices if found # ====================== YOUR CODE HERE ====================== # ============================================================= if verbose: print('----------------') print('Processed email:') print('----------------') print(' '.join(processed_email)) return word_indices Once you have implemented processEmail, the following cell will run your code on the email sample and you should see an output of the processed email and the indices list mapping. In [ ]: # To use an SVM to classify emails into Spam v.s. Non-Spam, you first need # to convert each email into a vector of features. In this part, you will # implement the preprocessing steps for each email. You should # complete the code in processEmail.m to produce a word indices vector # for a given email. # Extract Features with open(os.path.join('Data', 'emailSample1.txt')) as fid: file_contents = fid.read() word_indices = processEmail(file_contents) #Print Stats print('-------------') print('Word Indices:') print('-------------') print(word_indices) 2.2 Extracting Features from Emails (10 pts)¶ You will now implement the feature extraction that converts each email into a vector in $\mathbb{R}^n$. For this exercise, you will be using n = # words in vocabulary list. Specifically, the feature $x_i \in \{0, 1\}$ for an email corresponds to whether the $i^{th}$ word in the dictionary occurs in the email. That is, $x_i = 1$ if the $i^{th}$ word is in the email and $x_i = 0$ if the $i^{th}$ word is not present in the email. Thus, for a typical email, this feature would look like: $$ x = \begin{bmatrix} 0 & \dots & 1 & 0 & \dots & 1 & 0 & \dots & 0 \end{bmatrix}^T \in \mathbb{R}^n $$ You should now complete the code in the function emailFeatures to generate a feature vector for an email, given the word_indices. In [21]: def emailFeatures(word_indices): """ Takes in a word_indices vector and produces a feature vector from the word indices. Parameters ---------- word_indices : list A list of word indices from the vocabulary list. Returns ------- x : list The computed feature vector. Instructions ------------ Fill in this function to return a feature vector for the given email (word_indices). To help make it easier to process the emails, we have have already pre-processed each email and converted each word in the email into an index in a fixed dictionary (of 1899 words). The variable `word_indices` contains the list of indices of the words which occur in one email. Concretely, if an email has the text: The quick brown fox jumped over the lazy dog. Then, the word_indices vector for this text might look like: 60 100 33 44 10 53 60 58 5 where, we have mapped each word onto a number, for example: the -- 60 quick -- 100 ... Note ---- The above numbers are just an example and are not the actual mappings. Your task is take one such `word_indices` vector and construct a binary feature vector that indicates whether a particular word occurs in the email. That is, x[i] = 1 when word i is present in the email. Concretely, if the word 'the' (say, index 60) appears in the email, then x[60] = 1. The feature vector should look like: x = [ 0 0 0 0 1 0 0 0 ... 0 0 0 0 1 ... 0 0 0 1 0 ..] """ # Total number of words in the dictionary n = 1899 # You need to return the following variables correctly. x = np.zeros(n) # ===================== YOUR CODE HERE ====================== # =========================================================== return x Once you have implemented emailFeatures, the next cell will run your code on the email sample. In [ ]: # Extract Features with open(os.path.join('Data', 'emailSample1.txt')) as fid: file_contents = fid.read() word_indices = processEmail(file_contents) features = emailFeatures(word_indices) # Print Stats print('\nLength of feature vector: %d' % len(features)) print('Number of non-zero entries: %d' % sum(features > 0))
2.3 Training SVM for Spam Classification (20 pts)¶
In the following section we will load a preprocessed training dataset that will be used to train a SVM classifier. The file spamTrain.mat (within the Data folder for this exercise) contains 4000 training examples of spam and non-spam email, while spamTest.mat contains 1000 test examples. Each original email was processed using the processEmail and emailFeatures functions and converted into a vector $x^{(i)} \in \mathbb{R}^{1899}$.
After loading the dataset, the next cell proceed to train a linear SVM to classify between spam ($y = 1$) and non-spam ($y = 0$) emails.
In [ ]:
# Load the Spam Email dataset
# You will have X, y in your environment
data = loadmat(os.path.join(‘Data’, ‘spamTrain.mat’))
X = data[‘X’]
y = data[‘y’].flatten()
In [ ]:
2.4 Optional (ungraded) exercise: Try your own emails¶
Now that you have trained a spam classifier, you can start trying it out on your own emails. In the starter code, we have included two email examples (emailSample1.txt and emailSample2.txt) and two spam examples (spamSample1.txt and spamSample2.txt). You should now try the examples we have provided and see if the classifier gets them right. You can also try your own emails by replacing the examples (plain text files) with your own emails.
In [ ]: