Text Classification: Name Gender¶
Here is a partition of the name gender data into training, dev-test, and test data as shown in the lectures:
In [1]:
import nltk
nltk.download(‘names’)
from nltk.corpus import names
m = names.words(‘male.txt’)
f = names.words(‘female.txt’)
[nltk_data] Downloading package names to /home/diego/nltk_data…
[nltk_data] Package names is already up-to-date!
In [2]:
import random
random.seed(1234) # Set the random seed to allow replicability
names = ([(name, ‘male’) for name in m] +
[(name, ‘female’) for name in f])
random.shuffle(names)
train_names = names[1000:]
devtest_names = names[500:1000]
test_names = names[:500]
And here is one of the classifiers given in the lectures.
In [3]:
def gender_features2(word):
return {‘suffix1’: word[-1:],
‘suffix2′: word[-2:]}
train_set2 = [(gender_features2(n), g) for n, g in train_names]
devtest_set2 = [(gender_features2(n), g) for n, g in devtest_names]
classifier2 = nltk.NaiveBayesClassifier.train(train_set2)
nltk.classify.accuracy(classifier2, devtest_set2)
Out[3]:
0.77
Exercise: Using more information¶
Define a new function gender_features5 that takes, as features, any suffixes of size 1, 2, 3, 4, and 5. Examine the accuracy results. What can you conclude from this new classifier?
In [ ]:
Exercise: Plot the impact of the training size¶
The following code plots the classifier accuracy on the training and dev-test set as we increase the training size. Examine the plot and answer the following questions:
1. From what amount of training data you would judge that the system stops over-fitting?
2. From what amount of training data would you say that there is no need to add more training data?
In [6]:
train_accuracy2 = []
devtest_accuracy2 = []
nsamples = range(10, 500, 5)
for n in nsamples:
classifier2 = nltk.NaiveBayesClassifier.train(train_set2[:n])
train_accuracy2.append(nltk.classify.accuracy(classifier2, train_set2[:n]))
devtest_accuracy2.append(nltk.classify.accuracy(classifier2, devtest_set2))
In [7]:
%matplotlib inline
from matplotlib import pyplot as plt
plt.plot(nsamples, train_accuracy2, label=’Train’)
plt.plot(nsamples, devtest_accuracy2, label=’Devtest’)
plt.xlabel(‘Training size’)
plt.ylabel(‘Accuracy’)
plt.title(‘Classifier 2’)
plt.legend()
Out[7]:
*{stroke-linecap:butt;stroke-linejoin:round;}
Exercise: Repeat the analysis using sk-learn¶
The lectures show how to use sklearn to implement the name classifier. Replicate the work in this workshop and try to answer the same questions as above.
1. Is it better to use the last two characters, or the last 5 characters?
2. From what amount of training data you would judge that the system stops overfitting?
3. From what amount of training data would you say that there is no need to add more training data?
In [ ]: