18
Classifying with
k-Nearest Neighbors
Have you ever seen movies categorized into genres? What defines these genres, and
who says which movie goes into what genre? The movies in one genre are similar
but based on what? I’m sure if you asked the people involved with making the mov-
ies, they wouldn’t say that their movie is just like someone else’s movie, but in some
way you know they’re similar. What makes an action movie similar to another action
movie and dissimilar to a romance movie? Do people kiss in action movies, and do
people kick in romance movies? Yes, but there’s probably more kissing in romance
movies and more kicking in action movies. Perhaps if you measured kisses, kicks,
and other things per movie, you could automatically figure out what genre a movie
belongs to. I’ll use movies to explain some of the concepts of k-Nearest Neighbors;
then we will move on to other applications.
This chapter covers
■ The k-Nearest Neighbors classification algorithm
■ Parsing and importing data from a text file
■ Creating scatter plots with Matplotlib
■ Normalizing numeric values
Download from Wow! eBook
19Classifying with distance measurements
In this chapter, we’ll discuss our first machine-learning algorithm: k-Nearest
Neighbors. k-Nearest Neighbors is easy to grasp and very effective. We’ll first discuss
the theory and how you can use the concept of a distance measurement to classify
items. Next, you’ll see how to easily import and parse data from text files using
Python. We’ll address some common pitfalls when working with distance calculations
and data coming from numerous sources. We’ll put all of this into action in examples
for improving results from a dating website and recognizing handwritten digits.
2.1 Classifying with distance measurements
The first machine-learning algorithm we’ll look at is k-Nearest Neighbors (kNN). It
works like this: we have an existing set of example data, our training set. We have
labels for all of this data—we know what class each piece of the data should fall into.
When we’re given a new piece of data without a label, we compare that new piece of
data to the existing data, every piece of existing data. We then take the most similar
pieces of data (the nearest neighbors) and look at their labels. We look at the top k
most similar pieces of data from our known dataset; this is where the k comes from. (k
is an integer and it’s usually less than 20.) Lastly, we take a majority vote from the k
most similar pieces of data, and the majority is the new class we assign to the data we
were asked to classify.
Let’s run through a quick example classifying movies into romance or action mov-
ies. Someone watched a lot of movies and counted the number of kicks and kisses in
each movie. I’ve plotted six movies by the number of kisses and kicks in each movie in
figure 2.1. Now, you find a movie you haven’t seen yet and want to know if it’s a
romance movie or an action movie. To determine this, we’ll use the kNN algorithm.
k-Nearest Neighbors
Pros: High accuracy, insensitive to outliers, no assumptions about data
Cons: Computationally expensive, requires a lot of memory
Works with: Numeric values, nominal values
Figure 2.1 Classifying movies by plotting the
number of kicks and kisses in each movie
Download from Wow! eBook
20 CHAPTER 2 Classifying with k-Nearest Neighbors
We find the movie in question and see how many kicks and kisses it has. It’s plotted as
a large question mark along with a few other movies in figure 2.1. These values are
listed in table 2.1.
We don’t know what type of movie the question mark movie is, but we have a way of
figuring that out. First, we calculate the distance to all the other movies. I’ve calcu-
lated the distances and shown those in table 2.2. (Don’t worry about how I did these
calculations right now. We’ll get into that in a few minutes.)
Now that we have all the distances to our unknown movie, we need to find the k-nearest
movies by sorting the distances in decreasing order. Let’s assume k=3. Then, the three
closest movies are He’s Not Really into Dudes, Beautiful Woman, and California Man. The
kNN algorithm says to take the majority vote from these three movies to determine the
class of the mystery movie. Because all three movies are romances, we forecast that the
mystery movie is a romance movie.
We’ll work through a real machine learning algorithm in this chapter, and along
the way I’ll introduce the Python tools and machine learning terminology. First, how-
ever, we’ll go over a simple example of the kNN algorithm to make sure we’re using
the algorithm correctly.
Table 2.1 Movies with the number of kicks and number of kisses shown for each movie,
along with our assessment of the movie type
Movie title # of kicks # of kisses Type of movie
California Man 3 104 Romance
He’s Not Really into Dudes 2 100 Romance
Beautiful Woman 1 81 Romance
Kevin Longblade 101 10 Action
Robo Slayer 3000 99 5 Action
Amped II 98 2 Action
? 18 90 Unknown
Movie title Distance to movie “?”
California Man 20.5
He’s Not Really into Dudes 18.7
Beautiful Woman 19.2
Kevin Longblade 115.3
Robo Slayer 3000 117.4
Amped II 118.9 Table 2.2 Distances between each
movie and the unknown movie
Download from Wow! eBook
21Classifying with distance measurements
2.1.1 Prepare: importing data with Python
First, we’ll create a Python module called kNN.py, where we’ll place all the code used
in this chapter. You can create your own file and enter code as we progress, or you can
copy the file kNN.py from the book’s source code. The best way to learn is to start with
a blank module and enter code as it’s used.
First, let’s create kNN.py or copy it from the source code repository. We’ll create a
few support functions before we create the full kNN algorithm. Add the following
lines to kNN.py:
from numpy import *
import operator
def createDataSet():
group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = [‘A’,’A’,’B’,’B’]
return group, labels
In this code, we import two modules. The first one is NumPy, which is our scientific
computing package. The second module is the operator module, which is used later
in the kNN algorithm for sorting; we’ll get to that shortly.
The function createDataSet() is there for your convenience. This creates the
dataset and labels, as shown in figure 2.1. Let’s try this out: save kNN.py, change to the
directory where you’ve stored kNN.py, and launch a Python interactive session. To get
started you need to open a new terminal in Linux/Mac OS or in Windows, so open a
command prompt. When you’re using Linux or a Mac, you need to type python at the
command line to get started, and in Windows you need to refer to the Python pro-
gram directly, such as c:\Python26\python.exe, unless you have it aliased.
Once you’ve started Python to load your module, you need to type
>>> import kNN
General approach to kNN
1. Collect: Any method.
2. Prepare: Numeric values are needed for a distance calculation. A structured data
format is best.
3. Analyze: Any method.
4. Train: Does not apply to the kNN algorithm.
5. Test: Calculate the error rate.
6. Use: This application needs to get some input data and output structured num-
eric values. Next, the application runs the kNN algorithm on this input data and
determines which class the input data should belong to. The application then
takes some action on the calculated class.
Download from Wow! eBook
22 CHAPTER 2 Classifying with k-Nearest Neighbors
This will load the kNN module. To make sure that we’re looking at the same dataset, I
created a function called createDataSet. Type the following at the Python command
prompt:
>>> group,labels = kNN.createDataSet()
This creates two variables called group and labels. To inspect each variable, type its
name at the Python command prompt:
>>> group
array([[ 1. , 1.1],
[ 1. , 1. ],
[ 0. , 0. ],
[ 0. , 0.1]])
>>> labels
[‘A’, ‘A’, ‘B’, ‘B’]
Here we have four pieces of data. Each piece of data has two attributes or features, things
we know about it. In the group matrix each row is a different piece of data. Think of it
as a different measurement or entry in some sort of log. As humans, we can visualize
things in one, two, or sometimes three dimensions, but that’s about the limit of our
brains; to keep things easy to visualize, we’ll use only two features for each data point.
The label’s vector carries the labels we’ve given to each of the data points. There
should be as many items in this vector as there are rows in the group matrix. We
assigned the data point (1,1.1) to the class A, and similarly we assigned the data point
(0,0.1) to the class B. The values in this example are arbitrarily chosen for the purpose
of illustration, and the axes are unlabeled. The four data points with class labels are
plotted in figure 2.2.
Now that you have an idea of how to parse and load data into Python, and you have
an idea of how the kNN algorithm works, let’s put it all together and do some classification.
Figure 2.2 The four data
points of our very simple
kNN example
Download from Wow! eBook
23Classifying with distance measurements
2.1.2 Putting the kNN classification algorithm into action
In this section we’ll build a function, shown in listing 2.1, to run the kNN algorithm on
one piece of data. I’ll first show the function in pseudocode and then in actual
Python, followed by a detailed explanation of what everything in the code does.
Remember, the goal of this function is to use the kNN algorithm to classify one piece
of data called inX. Pseudocode for this function would look like this:
For every point in our dataset:
calculate the distance between inX and the current point
sort the distances in increasing order
take k items with lowest distances to inX
find the majority class among these items
return the majority class as our prediction for the class of inX
The Python code for the classify0() function is in the following listing.
def classify0(inX, dataSet, labels, k):
dataSetSize = dataSet.shape[0]
diffMat = tile(inX, (dataSetSize,1)) – dataSet
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
sortedDistIndicies = distances.argsort()
classCount={}
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
sortedClassCount = sorted(classCount.iteritems(),
key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
The function classify0() takes four inputs: the input vector to classify called inX,
our full matrix of training examples called dataSet, a vector of labels called labels,
and, finally, k, the number of nearest neighbors to use in the voting. The labels vector
should have as many elements in it as there are rows in the dataSet matrix. You calcu-
late the distances B using the Euclidian distance where the distance between two vec-
tors, xA and xB, with two elements, is given by
For example, the distance between points (0,0) and (1,2) is calculated by
If we are working with four features, the distance between points (1,0,0,1) and (7, 6, 9, 4)
would be calculated by
Listing 2.1 k-Nearest Neighbors algorithm
Distance
calculation
B
Voting with lowest
k distances
C
Sort
dictionaryD
d xA0 xB0–
2
xA1 xB1–
2
+=
1 0–
2
2 0–
2
+
7 1–
2
6 0–
2
9 0–
2
4 1–
2
+ + +
Download from Wow! eBook
24 CHAPTER 2 Classifying with k-Nearest Neighbors
Following the distance calculation, the distances are sorted from least to greatest (this
is the default). Next, C the first k or lowest k distances are used to vote on the class of
inX. The input k should always be a positive integer. Lastly, D you take the classCount
dictionary and decompose it into a list of tuples and then sort the tuples by the second
item in the tuple using the itemgetter method from the operator module imported
in the second line of the program. This sort is done in reverse so you have largest to
smallest. Finally, you can return the label of the item occurring the most frequently.
To predict the class, type the following text at the Python prompt:
>>> kNN.classify0([0,0], group, labels, 3)
The result should be B. Try to change the [0,0] entry to see how the answer changes.
Congratulations, you just made your first classifier! You can do a lot with this sim-
ple classifier. Things will only get easier from here on out.
2.1.3 How to test a classifier
We built the kNN algorithm and saw that it was giving us answers we would expect. You
may be asking yourself, “At what point does this break?” or “Is it always right?” No, it’s
not always right. There are different ways of exploring how often a classifier is right.
Also, there are different things that impact the performance of a classifier, such as set-
tings of the classifier and the dataset. Different algorithms perform differently on dif-
ferent datasets. That’s why we have six chapters on classification.
To test out a classifier, you start with some known data so you can hide the answer
from the classifier and ask the classifier for its best guess. You can add up the number
of times the classifier was wrong and divide it by the total number of tests you gave it.
This will give you the error rate, which is a common measure to gauge how good a clas-
sifier is doing on a dataset. An error rate of 0 means you have a perfect classifier, and
an error rate of 1.0 means the classifier is always wrong. You’ll see this in action with
some solid data later.
The example in this section worked, but it wasn’t useful. We’re going to put kNN to
use in real-world examples in the next two sections. First, we’ll look at improving the
results from a dating site with kNN, and then we’ll look at an impressive handwriting
recognition example. We’ll employ testing in the handwriting recognition example to
see if this algorithm is working.
2.2 Example: improving matches from a dating site with kNN
My friend Hellen has been using some online dating sites to find different people to
go out with. She realized that despite the site’s recommendations, she didn’t like
everyone she was matched with. After some introspection, she realized there were
three types of people she went out with:
■ People she didn’t like
■ People she liked in small doses
■ People she liked in large doses
Download from Wow! eBook
25Example: improving matches from a dating site with kNN
After discovering this, Hellen couldn’t figure out what made a person fit into any of
these categories. They all were recommended to her by the dating site. The people
whom she liked in small doses were good to see Monday through Friday, but on the
weekend she’d rather spend time with the people she liked in large doses. Hellen has
asked us to help her filter future matches to categorize them. In addition, Hellen has
collected some data that isn’t recorded by the dating site, but she feels it’s useful in
selecting people to go out with.
2.2.1 Prepare: parsing data from a text file
The data Hellen collected is in a text file called datingTestSet.txt. Hellen has been col-
lecting this data for a while and has 1,000 entries. A new sample is on each line, and
Hellen has recorded the following features:
■ Number of frequent flyer miles earned per year
■ Percentage of time spent playing video games
■ Liters of ice cream consumed per week
Before we can use this data in our classifier, we need to change it to the format that
our classifier accepts. In order to do this, we’ll add a new function to kNN.py called
file2matrix. This function takes a filename string and outputs two things: a matrix of
training examples and a vector of class labels.
Add the following code to your kNN.py.
def file2matrix(filename):
fr = open(filename)
numberOfLines = len(fr.readlines())
returnMat = zeros((numberOfLines,3))
classLabelVector = []
fr = open(filename)
Listing 2.2 Text record to NumPy parsing code
Example: using kNN on results from a dating site
1. Collect: Text file provided.
2. Prepare: Parse a text file in Python.
3. Analyze: Use Matplotlib to make 2D plots of our data.
4. Train: Doesn’t apply to the kNN algorithm.
5. Test: Write a function to use some portion of the data Hellen gave us as test ex-
amples. The test examples are classified against the non-test examples. If the
predicted class doesn’t match the real class, we’ll count that as an error.
6. Use: Build a simple command-line program Hellen can use to predict whether
she’ll like someone based on a few inputs.
Get number
of lines in file
B
Create NumPy
matrix to returnC
Download from Wow! eBook
26 CHAPTER 2 Classifying with k-Nearest Neighbors
index = 0
for line in fr.readlines():
line = line.strip()
listFromLine = line.split(‘\t’)
returnMat[index,:] = listFromLine[0:3]
classLabelVector.append(int(listFromLine[-1]))
index += 1
return returnMat,classLabelVector
This code is a great place to demonstrate how easy it is to process text with Python. Ini-
tially, you’d like to know how many lines are in the file. B It reads in the file and
counts the number of lines. Next, C you create a NumPy matrix (actually, it’s a 2D
array, but don’t worry about that now) to populate and return. I’ve hard-coded in the
size of this to be numberOfLines x 3, but you could add some code to make this adapt-
able to the various inputs. Finally, D you loop over all the lines in the file and strip off
the return line character with line.strip(). Next, you split the line into a list of ele-
ments delimited by the tab character: ‘\t’. You take the first three elements and
shove them into a row of your matrix, and you use the Python feature of negative
indexing to get the last item from the list to put into classLabelVector. You have to
explicitly tell the interpreter that you’d like the integer version of the last item in the
list, or it will give you the string version. Usually, you’d have to do this, but NumPy
takes care of those details for you.
To use this, type the following at the Python prompt:
>>> reload(kNN)
>>> datingDataMat,datingLabels = kNN.file2matrix(‘datingTestSet.txt’)
Make sure that the file datingTestSet.txt is in the same directory you’re working from.
Note that before executing the function, I reloaded the kNN.py module. When you
change a module, you need to reload that module or you’ll still be using the old version.
After successfully importing the datingTestSet.txt file, take a minute to explore the
data in Python. You should get something similar to the following.
>>> datingDataMat
array([[ 7.29170000e+04, 7.10627300e+00, 2.23600000e-01],
[ 1.42830000e+04, 2.44186700e+00, 1.90838000e-01],
[ 7.34750000e+04, 8.31018900e+00, 8.52795000e-01],
…,
[ 1.24290000e+04, 4.43233100e+00, 9.24649000e-01],
[ 2.52880000e+04, 1.31899030e+01, 1.05013800e+00],
[ 4.91800000e+03, 3.01112400e+00, 1.90663000e-01]])
>>> datingLabels[0:20]
[‘didntLike’, ‘smallDoses’, ‘didntLike’, ‘largeDoses’, ‘smallDoses’,
‘smallDoses’, ‘didntLike’, ‘smallDoses’, ‘didntLike’, ‘didntLike’,
‘largeDoses’, ‘largeDose s’, ‘largeDoses’, ‘didntLike’, ‘didntLike’,
‘smallDoses’, ‘smallDoses’, ‘didntLike’, ‘smallDoses’, ‘didntLike’]
Now that you have the data imported and properly formatted, let’s take a look at it
and see if we can make any sense of it. “Take a look” can mean many things. It can
mean look at the values in a text file or look at a plot of the values. We’ll next use
Parse line
to a list
D
Download from Wow! eBook
27Example: improving matches from a dating site with kNN
some of Python’s tools to make plots of the data. If we make a plot, we may be able to
distinguish some patterns.
2.2.2 Analyze: creating scatter plots with Matplotlib
Let’s look at the data in further detail by making some scatter plots of the data from
Matplotlib. This isn’t hard to do. From the Python console, type the following:
>>> import matplotlib
>>> import matplotlib.pyplot as plt
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111)
>>> ax.scatter(datingDataMat[:,1], datingDataMat[:,2])
>>> plt.show()
You should see something like figure 2.3. We’ve plotted the second and third columns
from the datingDataMat matrix. These are all of our values for the features “Percentage
NumPy Array and Python’s Array
We’ll be using the NumPy array extensively in this book. In your Python shell you can
import this using from numpy import array, or it will be imported when you import
all of NumPy. There’s another array type that comes with Python that we won’t be us-
ing. Don’t make the mistake of importing that array because the NumPy array meth-
ods won’t work on it.
Figure 2.3 Dating data without class labels. From this plot it’s difficult to dis-
cern which dot belongs to which group.
Download from Wow! eBook
28 CHAPTER 2 Classifying with k-Nearest Neighbors
of time spent playing video games” and “Liters of ice cream consumed weekly” for all
the classes put together.
It’s hard to see any patterns in this data, but we have additional data we haven’t
used yet—the class values. If we can plot these in color or use some other markers, we
can get a better understanding of the data. The Matplotlib scatter function has addi-
tional inputs we can use to customize the markers. Type the previous code again, but
this time use the following for a scatter function:
>>> ax.scatter(datingDataMat[:,1], datingDataMat[:,2],
15.0*array(datingLabels), 15.0*array(datingLabels))
I provided a different marker size and color that depend on the class labels we have in
datingLabels. You should see a plot similar to the one in figure 2.3. It doesn’t look like
you could make much sense of the data from figure 2.3; but if you plot columns 1 and 0
from our matrix, then you’ll see a plot like the one in figure 2.4. From this image, you
can make out three regions where the different classes lie.
Now that you can plot data using Matplotlib, you can get a better idea of exactly
what’s going on with our data. From figure 2.5, you can identify some regions where
the different classes lie.
Figure 2.4 Dating data with markers changed by class label. It’s easier to identify the
different classes, but it’s difficult to draw conclusions from looking at this data.
Download from Wow! eBook
29Example: improving matches from a dating site with kNN
2.2.3 Prepare: normalizing numeric values
If you were to calculate the distance between person 3 and person 4 in table 2.3, you
would have
Which term in this equation do you think is going to make the most difference? The
largest term, the number of frequent flyer miles earned per year, will have the most
effect. The frequent flyer term will dominate even though the percentage of time
spent playing video games and liters of ice cream consumed weekly have the largest
differences of any two features in table 2.3. Why should frequent flyer miles be so
important just because its values are large? It shouldn’t have any extra importance,
unless we want it to, but Hellen believes these terms are equally important.
Table 2.3 Sample of data from improved results on a dating site
Percentage of time spent
playing video games
Number of frequent flyer
miles earned per year
Liters of ice cream
consumed weekly
Category
1 0.8 400 0.5 1
2 12 134,000 0.9 3
3 0 20,000 1.1 2
4 67 32,000 0.1 2
Figure 2.5 Dating data with frequent flier miles versus percentage of time spent play-
ing video games plotted. The dating data has three features, and these two features
show areas where the three different classes lie.
0 67–
2
20,000 32,000–
2
1.1 0.1–
2
+ +
Download from Wow! eBook
30 CHAPTER 2 Classifying with k-Nearest Neighbors
When dealing with values that lie in different ranges, it’s common to normalize them.
Common ranges to normalize them to are 0 to 1 or -1 to 1. To scale everything from 0
to 1, you need to apply the following formula:
newValue = (oldValue-min)/(max-min)
In the normalization procedure, the variables min and max are the smallest and largest
values in the dataset. This scaling adds some complexity to our classifier, but it’s worth
it to get good results. Let’s create a new function in the file kNN.py called autoNorm()
to automatically normalize the data to values between 0 and 1.
The autoNorm() function is given in the following listing.
def autoNorm(dataSet):
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
ranges = maxVals – minVals
normDataSet = zeros(shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet – tile(minVals, (m,1))
normDataSet = normDataSet/tile(ranges, (m,1))
return normDataSet, ranges, minVals
In the autoNorm() function, you get the minimum values of each column and place
this in minVals; similarly, you get the maximum values. The 0 in dataSet.min(0)
allows you to take the minimums from the columns, not the rows. Next, you calculate
the range of possible values seen in our data and then create a new matrix to return.
To get the normalized values, you subtract the minimum values and then divide
by the range. The problem with this is that our matrix is 1000×3, while the minVals
and ranges are 1×3. To overcome this, you use the NumPy tile() function to create
a matrix the same size as our input matrix and then fill it up with many copies,
or tiles. Note that it B is element-wise division. In other numeric software packages,
the / operator can be used for matrix division, but in NumPy you need to use
linalg.solve(matA,matB) for matrix division.
To try out autoNorm, reload kNN.py, execute the function, and inspect the results
at the Python prompt:
>>> reload(kNN)
>>> normMat, ranges, minVals = kNN.autoNorm(datingDataMat)
>>> normMat
array([[ 0.33060119, 0.58918886, 0.69043973],
[ 0.49199139, 0.50262471, 0.13468257],
[ 0.34858782, 0.68886842, 0.59540619],
…,
[ 0.93077422, 0.52696233, 0.58885466],
[ 0.76626481, 0.44109859, 0.88192528],
[ 0.0975718 , 0.02096883, 0.02443895]])
>>> ranges
array([ 8.78430000e+04, 2.02823930e+01, 1.69197100e+00])
>>> minVals
array([ 0. , 0. , 0.001818])
Listing 2.3 Data-normalizing code
Element-wise
division
B
Download from Wow! eBook
31Example: improving matches from a dating site with kNN
You could have returned just normMat, but you need the ranges and minimum values
to normalize test data. You’ll see this in action next.
2.2.4 Test: testing the classifier as a whole program
Now that you have the data in a format you can use, you’re ready to test our classifier.
After you test it, you can give it to our friend Hellen to use. One common task in machine
learning is evaluating an algorithm’s accuracy. One way you can use the existing data is
to take some portion, say 90%, to train the classifier. Then you’ll take the remaining 10%
to test the classifier and see how accurate it is. There are more advanced ways of doing
this, which we’ll address later, but for now let’s use this method. The 10% to be held back
should be randomly selected. Our data isn’t stored in a specific sequence, so you can
take the first 10% or last 10% without upsetting any statistics professors.
Earlier, I mentioned that you can measure the performance of a classifier with the
error rate. In classification, the error rate is the number of misclassified pieces of data
divided by the total number of data points tested. An error rate of 0 means you have a
perfect classifier, and an error rate of 1.0 means the classifier is always wrong. In our
code, you’ll measure the error rate with a counter that’s incremented every time a
piece of data is misclassified. The total number of errors divided by the total number
of data points tested will give you the error rate.
To test the classifier, you’ll create a new function in kNN.py called datingClassTest.
This function is self-contained, so don’t worry if you closed your Python shell earlier. You
won’t have to go back and type the code again. Enter the code in the following listing
into kNN.py.
def datingClassTest():
hoRatio = 0.10
datingDataMat,datingLabels = file2matrix(‘datingTestSet.txt’)
normMat, ranges, minVals = autoNorm(datingDataMat)
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],\
datingLabels[numTestVecs:m],3)
print “the classifier came back with: %d, the real answer is: %d”\
% (classifierResult, datingLabels[i])
if (classifierResult != datingLabels[i]): errorCount += 1.0
print “the total error rate is: %f” % (errorCount/float(numTestVecs))
The datingClassTest function is shown in listing 2.4. This uses file2matrix and
autoNorm() from earlier to get the data into a form you can use. Next, the number of
test vectors is calculated, and this is used to decide which vectors from normMat will be
used for testing and which for training. The two parts are then fed into our original
kNN classifier, classify0. Finally, the error rate is calculated and displayed. Note that
you’re using the original classifier; you spent most of this section manipulating the
Listing 2.4 Classifier testing code for dating site
Download from Wow! eBook
32 CHAPTER 2 Classifying with k-Nearest Neighbors
data so that you could apply it to a simple classifier. Getting solid data is important and
will be the subject of chapter 20.
To execute this, reload kNN and then type kNN.datingClassTest() at the Python
prompt. You should get results similar to the following example:
>>> kNN.datingClassTest()
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
.
.
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the total error rate is: 0.024000
The total error rate for this classifier on this dataset with these settings is 2.4%. Not
bad. You can experiment with different hoRatios and different values of k inside the
datingClassTest function. How does the error change as hoRatio is increased? Note
that the results will vary by algorithm, dataset, and settings.
The example showed that we could predict the class with only a 2.4% error. To our
friend Hellen, this means that she can enter a new person’s information, and our sys-
tem will predict whether she’ll dislike or like the person in large or small doses.
2.2.5 Use: putting together a useful system
Now that you’ve tested the classifier on our data, it’s time to use it to actually classify
people for Hellen. We’ll provide Hellen with a small program. Hellen will find some-
one on the dating site and enter his information. The program predicts how much
she’ll like this person.
Add the code from the following listing to kNN.py and reload kNN.
def classifyPerson():
resultList = [‘not at all’,’in small doses’, ‘in large doses’]
percentTats = float(raw_input(\
“percentage of time spent playing video games?”))
ffMiles = float(raw_input(“frequent flier miles earned per year?”))
iceCream = float(raw_input(“liters of ice cream consumed per year?”))
datingDataMat,datingLabels = file2matrix(‘datingTestSet.txt’)
normMat, ranges, minVals = autoNorm(datingDataMat)
inArr = array([ffMiles, percentTats, iceCream])
classifierResult = classify0((inArr-\
minVals)/ranges,normMat,datingLabels,3)
print “You will probably like this person: “,\
resultList[classifierResult – 1]
The code in listing 2.5 mostly uses things you saw earlier. The only new code is the
function raw_input(). This gives the user a text prompt and returns whatever the
user enters. To see the program in action, type in the following:
Listing 2.5 Dating site predictor function
Download from Wow! eBook
33Example: a handwriting recognition system
>>> kNN.classifyPerson()
percentage of time spent playing video games?10
frequent flier miles earned per year?10000
liters of ice cream consumed per year?0.5
You will probably like this person: in small doses
You’ve seen how to create a classifier with some data. All of the data is easily read by a
human, but how could you use a classifier on data that isn’t easily read by a human?
The next section contains another example, this time showing how you can apply kNN
to things as diverse as images where the data is in binary form.
2.3 Example: a handwriting recognition system
We’re going to work through an example of handwriting recognition with our kNN
classifier. We’ll be working only with the digits 0–9. Some examples are shown in fig-
ure 2.6. These digits were processed through image-processing software to make them
all the same size and color.1 They’re all 32×32 black and white. The binary images
were converted to text format to make this example easier, although it isn’t the most
efficient use of memory.
2.3.1 Prepare: converting images into test vectors
The images are stored in two directories in the chapter 2 source code. The training-
Digits directory contains about 2,000 examples similar to those in figure 2.6. There
are roughly 200 samples from each digit. The testDigits directory contains about 900
examples. We’ll use the trainingDigits directory to train our classifier and testDigits to
1 The dataset is a modified version of the “Optical Recognition of Handwritten Digits Data Set” by E. Alpaydin,
C. Kaynak, Department of Computer Engineering at Bogazici University, 80815 Istanbul Turkey, retrieved
from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml) on October 3, 2010.
Example: using kNN on a handwriting recognition system
1. Collect: Text file provided.
2. Prepare: Write a function to convert from the image format to the list format
used in our classifier, classify0().
3. Analyze: We’ll look at the prepared data in the Python shell to make sure it’s
correct.
4. Train: Doesn’t apply to the kNN algorithm.
5. Test: Write a function to use some portion of the data as test examples. The
test examples are classified against the non-test examples. If the predicted
class doesn’t match the real class, you’ll count that as an error.
6. Use: Not performed in this example. You could build a complete program to extract
digits from an image, such a system used to sort the mail in the United States.
Download from Wow! eBook
http://archive.ics.uci.edu/ml
34 CHAPTER 2 Classifying with k-Nearest Neighbors
test it. There’ll be no overlap between the two groups. Feel free to take a look at the
files in those folders.
We’d like to use the same classifier that we used in the previous two examples, so
we’re going to need to reformat the images to a single vector. We’ll take the 32×32
matrix that is each binary image and make it a 1×1024 vector. After we do this, we can
apply it to our existing classifier.
The following code is a small function called img2vector, which converts the
image to a vector. The function creates a 1×1024 NumPy array, then opens the given
file, loops over the first 32 lines in the file, and stores the integer value of the first 32
characters on each line in the NumPy array. This array is finally returned.
def img2vector(filename):
returnVect = zeros((1,1024))
fr = open(filename)
for i in range(32):
lineStr = fr.readline()
for j in range(32):
returnVect[0,32*i+j] = int(lineStr[j])
return returnVect
Try out the img2vector code with the following commands in the Python shell, and
compare the results to a file opened with a text editor:
>>> testVector = kNN.img2vector(‘testDigits/0_13.txt’)
>>> testVector[0,0:31]
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0.])
>>> testVector[0,32:63]
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0.])
Figure 2.6 Examples of the handwritten digits dataset
Download from Wow! eBook
35Example: a handwriting recognition system
2.3.2 Test: kNN on handwritten digits
Now that you have the data in a format that you can plug into our classifier, you’re
ready to test out this idea and see how well it works. The function shown in listing 2.6,
handwritingClassTest(), is a self-contained function that tests out our classifier. You
can add it to kNN.py. Before you add it, make sure to add from os import listdir to
the top of the file. This imports one function, listdir, from the os module, so that
you can see the names of files in a given directory.
def handwritingClassTest():
hwLabels = []
trainingFileList = listdir(‘trainingDigits’)
m = len(trainingFileList)
trainingMat = zeros((m,1024))
for i in range(m):
fileNameStr = trainingFileList[i]
fileStr = fileNameStr.split(‘.’)[0]
classNumStr = int(fileStr.split(‘_’)[0])
hwLabels.append(classNumStr)
trainingMat[i,:] = img2vector(‘trainingDigits/%s’ % fileNameStr)
testFileList = listdir(‘testDigits’)
errorCount = 0.0
mTest = len(testFileList)
for i in range(mTest):
fileNameStr = testFileList[i]
fileStr = fileNameStr.split(‘.’)[0]
classNumStr = int(fileStr.split(‘_’)[0])
vectorUnderTest = img2vector(‘testDigits/%s’ % fileNameStr)
classifierResult = classify0(vectorUnderTest, \
trainingMat, hwLabels, 3)
print “the classifier came back with: %d, the real answer is: %d”\
% (classifierResult, classNumStr)
if (classifierResult != classNumStr): errorCount += 1.0
print “\nthe total number of errors is: %d” % errorCount
print “\nthe total error rate is: %f” % (errorCount/float(mTest))
In listing 2.6, you get the contents for the trainingDigits directory B as a list. Then
you see how many files are in that directory and call this m. Next, you create a training
matrix with m rows and 1024 columns to hold each image as a single row. You parse out
the class number from the filename. C The filename is something like 9_45.txt,
where 9 is the class number and it is the 45th instance of the digit 9. You then put this
class number in the hwLabels vector and load the image with the function img2vector
discussed previously. Next, you do something similar for all the files in the testDigits
directory, but instead of loading them into a big matrix, you test each vector individu-
ally with our classify0 function. You didn’t use the autoNorm() function from sec-
tion 2.2 because all of the values were already between 0 and 1.
To execute this from the Python shell, type kNN.handwritingClassTest() at the
Python prompt. It’s quite interesting to watch. Depending on your machine’s speed, it
Listing 2.6 Handwritten digits testing code
Get contents of
directory
B
Process class num
from filename
C
Download from Wow! eBook
36 CHAPTER 2 Classifying with k-Nearest Neighbors
will take some time to load the dataset. Then, when the function begins testing, you
can see the results as they come back. You should have output that’s similar to the fol-
lowing example:
>>> kNN.handwritingClassTest()
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
.
.
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 6, the real answer is: 8
.
.
the classifier came back with: 9, the real answer is: 9
the total number of errors is: 11
the total error rate is: 0.011628
Using the kNN algorithm on this dataset, you were able to achieve an error rate of 1.2%.
You can vary k to see how this changes. You can also modify the handwritingClassTest
function to randomly select training examples. That way, you can vary the number of
training examples and see how that impacts the error rate.
Depending on your computer’s speed, you may think this algorithm is slow, and
you’d be right. For each of our 900 test cases, you had to do 2,000 distance calcula-
tions on a 1024-entry floating point vector. Additionally, our test dataset size was 2 MB.
Is there a way to make this smaller and take fewer computations? One modification to
kNN, called kD-trees, allows you to reduce the number of calculations.
2.4 Summary
The k-Nearest Neighbors algorithm is a simple and effective way to classify data. The
examples in this chapter should be evidence of how powerful a classifier it is. kNN is
an example of instance-based learning, where you need to have instances of data close
at hand to perform the machine learning algorithm. The algorithm has to carry
around the full dataset; for large datasets, this implies a large amount of storage. In
addition, you need to calculate the distance measurement for every piece of data in
the database, and this can be cumbersome.
An additional drawback is that kNN doesn’t give you any idea of the underlying
structure of the data; you have no idea what an “average” or “exemplar” instance from
each class looks like. In the next chapter, we’ll address this issue by exploring ways in
which probability measurements can help you do classification.
Download from Wow! eBook
Classification
Classifying with k-Nearest Neighbors
2.1 Classifying with distance measurements
2.1.1 Prepare: importing data with Python
2.1.2 Putting the kNN classification algorithm into action
2.1.3 How to test a classifier
2.2 Example: improving matches from a dating site with kNN
2.2.1 Prepare: parsing data from a text file
2.2.2 Analyze: creating scatter plots with Matplotlib
2.2.3 Prepare: normalizing numeric values
2.2.4 Test: testing the classifier as a whole program
2.2.5 Use: putting together a useful system
2.3 Example: a handwriting recognition system
2.3.1 Prepare: converting images into test vectors
2.3.2 Test: kNN on handwritten digits
2.4 Summary