Deep Learning and Text Analytics II
¶
References:
• General introduction
▪ http://ufldl.stanford.edu/tutorial/supervised/MultiLayerNeuralNetworks/
• Word vector:
▪ https://code.google.com/archive/p/word2vec/
• Keras tutorial
▪ https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
• CNN
▪ http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
1. Agenda¶
• Introduction to neural networks
• Word/Document Vectors (vector representation of words/phrases/paragraphs)
• Convolutional neural network (CNN)
• Application of CNN in text classification
4. Word2Vector (a.k.a word embedding) and Doc2Vector¶
4.1. Word2Vector¶
• Vector representation of words (i.e. word vectors) learned using neural network
▪ e.g. “apple” : [0.35, -0.2, 0.4, …], ‘mongo’: [0.32, -0.18, 0.5, …]
▪ Interesting properties of word vectors:
◦ Words with similar semantics have close word vectors 
◦ https://www.kdnuggets.com/2017/04/cartoon-word2vec-espresso-cappuccino.html
◦ Composition: e.g. vector(“woman”)+vector(“king”)-vector(‘man’) $\approx$ vector(“queen”)
• Models:
▪ CBOW (Continuous Bag of Words): Predict a target word based on context
◦ e.g. the fox jumped over the lazy dog
◦ Assuming symmetric context with window size 3, this sentence can create training samples:
◦ ([-, fox], the)
◦ ([the, jumped], fox)
◦ ([fox, over], jumped)
◦ ([jumped, the], over)
◦ …
◦ 
◦ source: https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
▪ Skip Gram: predict context based on target words

▪ source: https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
▪ Nagtive Sampling:
◦ When training a neural network, for each sample, all weights are adjusted slightly so that it predicts that training sample more accurately.
◦ CBOW or skip gram models have tremendous number of weights, all of which would be updated slightly by every one of billions of training samples!
◦ Negative sampling addresses this by having each training sample only modify a small percentage of the weights, rather than all of them.
◦ e.g. when training with sample ([fox, over], jumped), update output weights connected to “jumped” along with a small number of other “negative words” sampled randomly
◦ For details, check http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/
In [5]:
# set up interactive shell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = “all”
In [2]:
# Exercise 4.1.1 Train your word vector
import pandas as pd
import nltk,string
# Load data
data=pd.read_csv(‘、amazon_review_large.csv’, header=None)
data.columns=[‘label’,’text’]
data.head()
# tokenize each document into a list of unigrams
# strip punctuations and leading/trailing spaces from unigrams
# only unigrams with 2 or more characters are taken
sentences=[ [token.strip(string.punctuation).strip() \
for token in nltk.word_tokenize(doc.lower()) \
if token not in string.punctuation and \
len(token.strip(string.punctuation).strip())>=2]\
for doc in data[“text”]]
print(sentences[0:2])
—————————————————————————
FileNotFoundError Traceback (most recent call last)
5
6 # Load data
—-> 7 data=pd.read_csv(‘../../../dataset/amazon_review_large.csv’, header=None)
8 data.columns=[‘label’,’text’]
9 data.head()
~/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision)
676 skip_blank_lines=skip_blank_lines)
677
–> 678 return _read(filepath_or_buffer, kwds)
679
680 parser_f.__name__ = name
~/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
438
439 # Create the parser.
–> 440 parser = TextFileReader(filepath_or_buffer, **kwds)
441
442 if chunksize or iterator:
~/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
785 self.options[‘has_index_names’] = kwds[‘has_index_names’]
786
–> 787 self._make_engine(self.engine)
788
789 def close(self):
~/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
1012 def _make_engine(self, engine=’c’):
1013 if engine == ‘c’:
-> 1014 self._engine = CParserWrapper(self.f, **self.options)
1015 else:
1016 if engine == ‘python’:
~/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
1706 kwds[‘usecols’] = self.usecols
1707
-> 1708 self._reader = parsers.TextReader(src, **kwds)
1709
1710 passed_names = self.names is None
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()
FileNotFoundError: File b’../../../dataset/amazon_review_large.csv’ does not exist
In [ ]:
# Train your own word vectors using gensim
# gensim.models is the package for word2vec
# check https://radimrehurek.com/gensim/models/word2vec.html
# for detailed description
from gensim.models import word2vec
import logging
import pandas as pd
# print out tracking information
logging.basicConfig(format=’%(asctime)s : %(levelname)s : %(message)s’, \
level=logging.INFO)
# min_count: words with total frequency lower than this are ignored
# size: the dimension of word vector
# window: context window, i.e. the maximum distance
# between the current and predicted word
# within a sentence (i.e. the length of ngrams)
# workers: # of parallel threads in training
# for other parameters, check https://radimrehurek.com/gensim/models/word2vec.html
wv_model = word2vec.Word2Vec(sentences, \
min_count=5, size=200, \
window=5, workers=4 )
In [ ]:
# test word2vec model
print(“Top 5 words similar to word ‘sound'”)
wv_model.wv.most_similar(‘sound’, topn=5)
print(“Top 5 words similar to word ‘sound’ but not relevant to ‘film'”)
wv_model.wv.most_similar(positive=[‘sound’,’music’], \
negative=[‘film’], topn=5)
print(“Similarity between ‘movie’ and ‘film’:”)
wv_model.wv.similarity(‘movie’,’film’)
print(“Similarity between ‘movie’ and ‘city’:”)
wv_model.wv.similarity(‘movie’,’city’)
print(“Word does not match with others in the list of \
[‘sound’, ‘music’, ‘graphics’, ‘actor’, ‘book’]:”)
wv_model.wv.doesnt_match([“sound”, “music”, \
“graphics”, “actor”, “book”])
print(“Word vector for ‘movie’:”)
wv_model.wv[‘movie’]
4.2. Pretrained Word Vectors¶
• Google published pre-trained 300-dimensional vectors for 3 million words and phrases that were trained on Google News dataset (about 100 billion words)(https://code.google.com/archive/p/word2vec/)
• GloVe (Global Vectors for Word Representation): Pretained word vectors from different data sources provided by Standford https://nlp.stanford.edu/projects/glove/
• FastText by Facebook https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
In [ ]:
# Exercise 4.2.1: Use pretrained word vectors
# download the bin file for pretrained word vectors
# from above links, e.g. https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
# Warning: the bin file is very big (over 2G)
# You need a powerful machine to load it
import gensim
model = gensim.models.KeyedVectors.\
load_word2vec_format(‘GoogleNews-vectors-negative300.bin’, binary=True)
model.wv.most_similar(positive=[‘women’,’king’], \
negative=’man’)
4.3. Sentence/Paragraph/Document Vectors¶
• So far we learned vector representation of words
• A lot of times, our samples are sentences, paragraphs, or documents
• How to create vector representations of sentences, paragraphs, or documents?
▪ Weighted average of word vectors (however, word order is lost as “bag of words”)
▪ Concatenation of word vectors (large space)
▪ ??
• Paragraph Vector: A distributed memory model (PV-DM)
▪ Word vectors are shared across paragraphs
▪ The paragraph vector is shared across all contexts generated from the same paragraph but not across paragraphs
▪ Both pragraph vectors and word vectors are returned
▪ Paragraph vectors can be used for document retrival or as features for classification or clustering 
▪ Source: Le Q. and Mikolov, T. Distributed Representations of Sentences and Documents https://arxiv.org/pdf/1405.4053v2.pdf
In [ ]:
# Exercise 4.3.1 Train your word vector
# We have tokenized sentences
# Label each sentence with a unique tag
from gensim.models.doc2vec import TaggedDocument
docs=[TaggedDocument(sentences[i], [str(i)]) for i in range(len(sentences)) ]
docs[0]
In [3]:
from random import shuffle
# package for doc2vec
from gensim.models import doc2vec
# for more parameters, check
# https://radimrehurek.com/gensim/models/doc2vec.html
# initialize the model without documents
# distributed memory model is used (dm=1)
model = doc2vec.Doc2Vec(dm=1, min_count=5, window=5, size=200, workers=4)
# build the vocabulary using the documents
model.build_vocab(docs)
# train the model in 20 epoches
# You may need to incease epoches
for epoch in range(30):
# shuffle the documents in each epoch
shuffle(docs)
# in each epoch, all samples are used
model.train(docs, total_examples=len(docs), epochs=1)
/Users/reneeyang/anaconda3/lib/python3.6/site-packages/gensim/models/doc2vec.py:580: UserWarning: The parameter `size` is deprecated, will be removed in 4.0.0, use `vector_size` instead.
warnings.warn(“The parameter `size` is deprecated, will be removed in 4.0.0, use `vector_size` instead.”)
—————————————————————————
NameError Traceback (most recent call last)
12
13 # build the vocabulary using the documents
—> 14 model.build_vocab(docs)
15
16 # train the model in 20 epoches
NameError: name ‘docs’ is not defined
In [ ]:
# Inspect paragraph vectors and word vectors
# the pragraph vector of the first document
model.docvecs[‘0’]
# the word vector of ‘movie’
model.wv[‘movie’]
In [ ]:
# Check word similarity
print(“Top 5 words similar to word ‘sound'”)
model.wv.most_similar(‘sound’, topn=5)
print(“Top 5 words similar to word ‘sound’ but not relevant to ‘film'”)
model.wv.most_similar(positive=[‘sound’,’music’], negative=[‘film’], topn=5)
print(“Similarity between ‘movie’ and ‘film’:”)
model.wv.similarity(‘movie’,’film’)
print(“Similarity between ‘movie’ and ‘city’:”)
model.wv.similarity(‘movie’,’city’)
In [4]:
# Inspect document similarity
model.docvecs.most_similar(‘0’)
—————————————————————————
TypeError Traceback (most recent call last)
1 # Inspect document similarity
2
—-> 3 model.docvecs.most_similar(‘0′)
~/anaconda3/lib/python3.6/site-packages/gensim/models/keyedvectors.py in most_similar(self, positive, negative, topn, clip_start, clip_end, indexer)
1665 negative = []
1666
-> 1667 self.init_sims()
1668 clip_end = clip_end or len(self.vectors_docs_norm)
1669
~/anaconda3/lib/python3.6/site-packages/gensim/models/keyedvectors.py in init_sims(self, replace)
1628 mode=’w+’, shape=self.vectors_docs.shape)
1629 else:
-> 1630 self.vectors_docs_norm = _l2_norm(self.vectors_docs, replace=replace)
1631
1632 def most_similar(self, positive=None, negative=None, topn=10, clip_start=0, clip_end=None, indexer=None):
~/anaconda3/lib/python3.6/site-packages/gensim/models/keyedvectors.py in _l2_norm(m, replace)
2344
2345 “””
-> 2346 dist = sqrt((m ** 2).sum(-1))[…, newaxis]
2347 if replace:
2348 m /= dist
TypeError: unsupported operand type(s) for ** or pow(): ‘list’ and ‘int’
5. Convolutional Neural Networks (CNN)¶
References (highly recommended):
• http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
• https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
• CNNs are widely used in Computer Vision
• CNNs were responsible for major breakthroughs in Image Recognition and are the core of most Computer Vision systems including automated photo tagging, self-driving cars
• Recently, CNNs have been applied in NLP and achieved good performance. 
5.1. Convolution¶
• Convolution is the technique to extract distinguishing features from feature spaces
• Example: feature detection from image pixels
▪ Feature space: a matrix of pixels of 0 (black) or 1 (white)
▪ Filter/kernal/feature Detector: a function applied to every fixed subset of the feature matrix
◦ e.g. 3×3 filter (a 3×3 matrix $\begin{vmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{vmatrix}$ ) slides through every area of the matrix sequentially, multiplies its values element-wise with the original matrix, then sum them up
◦ e.g. a filter (e.g. $\begin{vmatrix} 0 & -1 & 0 \\ -1 & 4 & -1 \\ 0 & -1 & 0 \end{vmatrix}$ ) to take difference between a pixel and its neighbors –> detect edges 
• Typically, a larger number of filters in different sizes will be used
• Configuration of filters
▪ filter size ($h \text{x} w$)
▪ stride size (how much to shift a filter in each step) ($s$)
▪ number of filters (depth) ($d$)
• Questions:
▪ With 5×5 feature space, afte apply a filter of size 3×3 with stride size 2, what will be the size of the result?
▪ Formula to calculate the size?
5.2. Pooling Layer¶
• Pooling layers are typically applied after the convolutional layers.
• A pooling layer subsamples its input.
• The most common way to do pooling is to apply a max operation to the result of each filter (a.k.a 1-max pooling).
▪ e.g. for the example below, by 1-max pooling, we get 8.
▪ If 100 filters have been used, then we get 100 numbers
• Pooling can be applied over a window (e.g. 2×2) 
5.3. What are CNNs¶
• CNNs consists of several layers of convolutions with nonlinear activation functions like ReLU or tanh

• A CNN typically contains:
▪ A convolution layer (not dense layer) connected to the input layer
◦ Each convolution layer applies different filters.
◦ Typically hundreds or thousands filters used.
◦ The results of filters are concatenated.
▪ A pooling layer is used to subsample the result of convolution layer
▪ There may be multiple layers of convolution and pooling combined. E.g. image detection
◦ 1st layer: detect edges
◦ 2nd layer: detect shape, e.g. round, square
◦ 3rd layer: wheels, doors etc.
▪ Then each result out of convolution-pooling is connected to a neuron in the output (local connections). Such results results are high-level features used by classification algorithms.
• During the training phase, a CNN automatically learns the values of its filters based on the task you want to perform.
• Powerful capabilities of CNN:
▪ Location Invariance: CNN extracts distinguishing features by convolution-pooling and it does not care where these features are. So images can still be recognized after rotation and scaling.
▪ Compositionality: Each filter composes a local patch of lower-level features into higher-level representation. E.g., detect edges from pixels, shapes from edges, and more complex objects from shapes.
• If you’re interested in how CNNs are used in image recognition, follow the classical MNIST handwritten digit recognition tutorial
• Play with it! http://scs.ryerson.ca/~aharley/vis/conv/flat.html
5.4. Application of CNN in Text Classification¶
• Assume $m$ samples, each of which is a sentence with $n$ words (short sentences can be padded)
• Embedding: In each sentence, each word can be represented as its word vector of dimension $d$ (pretrained or to be trained)
• Convolution: Apply filters to n-grams of different lengths (e.g. unigram, bigrams, …).
▪ E.g. A filter can slide through every 2 words (bigram)
▪ So, the filter size (i.e. region size) can be $1\text{x}d$ (unigram), $2\text{x}d$ (bigram), $3\text{x}d$ (trigram), …
• At pooling layer, 1-max pooling is applied to the result of each filter. Then all results after pooling are concatenated as the input to the output layer
▪ This is equivalent to select words or phrases that are discriminative with regard to the classification goal

Illustration of a Convolutional Neural Network (CNN) architecture for sentence classification. Here we depict three filter region sizes: 2, 3 and 4, each of which has 2 filters. Every filter performs convolution on the sentence matrix and generates (variable-length) feature maps. Then 1-max pooling is performed over each map, i.e., the largest number from each feature map is recorded. Thus a univariate feature vector is generated from all six maps, and these 6 features are concatenated to form a feature vector for the penultimate layer. The final softmax layer then receives this feature vector as input and uses it to classify the sentence; here we assume binary classification and hence depict two possible output states. Source: Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification.
• Questions:
▪ How many parameters in total in the convolution layer?
5.5. How to deal with overfitting – Regularization & Dropout¶
• Deep neural nets with a large number of parameters can be easily suffer from overfitting
• Typical approaches to overcome overfitting
▪ Regularization
▪ Dropout (which is also a kind of regularization technique)
• What is dropout?
▪ During training, randomly remove units in the hidden layer from the network. Update parameters as normal, leaving dropped-out units unchanged
▪ No dropout during testing
▪ Typically, each hidden unit is set to 0 with probability 0.1~0.5 
▪ https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf
• Why dropout?
▪ Hidden units cannot co-adapt with other units since a unit may not always be present
▪ Sample data usually come with noise. Dropout constrains network adaptation to the data at training time
▪ After training, only very useful neurons are kept (have high weights)
5.6. Example: Use CNN for Sentiment Analysis (Single-Label Classification)¶
• Dataset: IMDB review
• 25,000 movie reviews, positive or negative
• Benchmark performance is 80-90% with CNN (https://arxiv.org/abs/1408.5882)
• We’re going to create a CNN with the following:
▪ Word embedding trained as part of CNN
▪ filters in 3 sizes:
◦ unigram (Conv1D, kernel_size=1)
◦ bigram (Conv1D, kernel_size=2)
◦ trigram (Conv1D, kernel_size=3)
▪ Maxpooling for each convolution layer
▪ Dropout 
In [ ]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = “all”
In [ ]:
# Exercise 5.1: Load data
import pandas as pd
import nltk,string
from gensim import corpora
data=pd.read_csv(“../../../dataset/imdb_reviews.csv”, header=0, delimiter=”\t”)
data.head()
len(data)
# if your computer does not have enough resource
# reduce the dataset
data=data.loc[0:8000]
In [ ]:
# Exercise 5.2 Prepocessing data: Tokenize, pad sentences
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np
# set the maximum number of words to be used
MAX_NB_WORDS=10000
# set sentence/document length
MAX_DOC_LEN=500
# get a Keras tokenizer
# https://keras.io/preprocessing/text/
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(data[“review”])
# convert each document to a list of word index as a sequence
sequences = tokenizer.\
texts_to_sequences(data[“review”])
# pad all sequences into the same length
# if a sentence is longer than maxlen, pad it in the right
# if a sentence is shorter than maxlen, truncate it in the right
padded_sequences = pad_sequences(sequences, \
maxlen=MAX_DOC_LEN, \
padding=’post’, \
truncating=’post’)
print(padded_sequences[0])
In [ ]:
# get the mapping between word and its index
tokenizer.word_index[‘film’]
# get the count of each word
tokenizer.word_counts[‘film’]
In [ ]:
# Split data for training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(\
padded_sequences, data[‘sentiment’],\
test_size=0.3, random_state=1)
In [ ]:
# Exercise 5.3: Create CNN model
from keras.layers import Embedding, Dense, Conv1D, MaxPooling1D, \
Dropout, Activation, Input, Flatten, Concatenate
from keras.models import Model
# The dimension for embedding
EMBEDDING_DIM=100
# define input layer, where a sentence represented as
# 1 dimension array with integers
main_input = Input(shape=(MAX_DOC_LEN,), \
dtype=’int32′, name=’main_input’)
# define the embedding layer
# input_dim is the size of all words +1
# where 1 is for the padding symbol
# output_dim is the word vector dimension
# input_length is the max. length of a document
# input to embedding layer is the “main_input” layer
embed_1 = Embedding(input_dim=MAX_NB_WORDS+1, \
output_dim=EMBEDDING_DIM, \
input_length=MAX_DOC_LEN,\
name=’embedding’)(main_input)
# define 1D convolution layer
# 64 filters are used
# a filter slides through each word (kernel_size=1)
# input to this layer is the embedding layer
conv1d_1= Conv1D(filters=64, kernel_size=1, \
name=’conv_unigram’,\
activation=’relu’)(embed_1)
# define a 1-dimension MaxPooling
# to take the output of the previous convolution layer
# the convolution layer produce
# MAX_DOC_LEN-1+1 values as ouput (???)
pool_1 = MaxPooling1D(MAX_DOC_LEN-1+1, \
name=’pool_unigram’)(conv1d_1)
# The pooling layer creates output
# in the size of (# of sample, 1, 64)
# remove one dimension since the size is 1
flat_1 = Flatten(name=’flat_unigram’)(pool_1)
# following the same logic to define
# filters for bigram
conv1d_2= Conv1D(filters=64, kernel_size=2, \
name=’conv_bigram’,\
activation=’relu’)(embed_1)
pool_2 = MaxPooling1D(MAX_DOC_LEN-2+1, name=’pool_bigram’)(conv1d_2)
flat_2 = Flatten(name=’flat_bigram’)(pool_2)
# filters for trigram
conv1d_3= Conv1D(filters=64, kernel_size=3, \
name=’conv_trigram’,activation=’relu’)(embed_1)
pool_3 = MaxPooling1D(MAX_DOC_LEN-3+1, name=’pool_trigram’)(conv1d_3)
flat_3 = Flatten(name=’flat_trigram’)(pool_3)
# Concatenate flattened output
z=Concatenate(name=’concate’)([flat_1, flat_2, flat_3])
# Create a dropout layer
# In each iteration only 50% units are turned on
drop_1=Dropout(rate=0.5, name=’dropout’)(z)
# Create a dense layer
dense_1 = Dense(192, activation=’relu’, name=’dense’)(drop_1)
# Create the output layer
preds = Dense(1, activation=’sigmoid’, name=’output’)(dense_1)
# create the model with input layer
# and the output layer
model = Model(inputs=main_input, outputs=preds)
In [ ]:
# Exercise 5.4: Show model configuration
model.summary()
#model.get_config()
#model.get_weights()
#from keras.utils import plot_model
#plot_model(model, to_file=’cnn_model.png’)
In [ ]:
# Exercise 5.4: Compile the model
model.compile(loss=”binary_crossentropy”, \
optimizer=”adam”, \
metrics=[“accuracy”])
In [ ]:
# Exercise 5.5: Fit the model
BATCH_SIZE = 64
NUM_EPOCHES = 10
# fit the model and save fitting history to “training”
training=model.fit(X_train, y_train, \
batch_size=BATCH_SIZE, \
epochs=NUM_EPOCHES,\
validation_data=[X_test, y_test], \
verbose=2)
In [ ]:
# Exercise 5.6. Investigate the training process
import matplotlib.pyplot as plt
import pandas as pd
# plot a figure with size 20×8
# the fitting history is saved as dictionary
# covert the dictionary to dataframe
df=pd.DataFrame.from_dict(training.history)
df.columns=[“train_acc”, “train_loss”, \
“val_acc”, “val_loss”]
df.index.name=’epoch’
print(df)
# plot training history
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(8,3));
df[[“train_acc”, “val_acc”]].plot(ax=axes[0]);
df[[“train_loss”, “val_loss”]].plot(ax=axes[1]);
plt.show();
Observations from training history:
• As training goes on, training accuracy/loss gets always better
• Testing accuracy/loss gets better at the beginning, the gets worse
• This indicates that model is overfitted and cannot be generalized after certain point
• Thus, we should stop training the model when testing accuracy/loss gets worse.
• This analysis can be used to determine hyperparameter NUM_EPOCHES
• Fortunately, this can be done automatically by “Early Stopping”
In [ ]:
# Exercise 5.6: Use early stopping to find the best model
from keras.callbacks import EarlyStopping, ModelCheckpoint
# the file path to save best model
BEST_MODEL_FILEPATH=”best_model”
# define early stopping based on validation loss
# if validation loss is not improved in
# an iteration compared with the previous one,
# stop training (i.e. patience=0).
# mode=’min’ indicate the loss needs to decrease
earlyStopping=EarlyStopping(monitor=’val_loss’, \
patience=0, verbose=2, \
mode=’min’)
# define checkpoint to save best model
# which has max. validation acc
checkpoint = ModelCheckpoint(BEST_MODEL_FILEPATH, \
monitor=’val_acc’, \
verbose=2, \
save_best_only=True, \
mode=’max’)
# compile model
model.compile(loss=”binary_crossentropy”, \
optimizer=”adam”, metrics=[“accuracy”])
# fit the model with earlystopping and checkpoint
# as callbacks (functions that are executed as soon as
# an asynchronous thread is completed)
model.fit(X_train, y_train, \
batch_size=BATCH_SIZE, epochs=NUM_EPOCHES, \
callbacks=[earlyStopping, checkpoint],
validation_data=[X_test, y_test],\
verbose=2)
In [ ]:
# Exercise 5.7: Load the best model
# load the model using the save file
model.load_weights(“best_model”)
# predict
pred=model.predict(X_test)
print(pred[0:5])
# evaluate the model
scores = model.evaluate(X_test, y_test, verbose=0)
print(“%s: %.2f%%” % (model.metrics_names[1], scores[1]*100))
In [ ]:
# Exercise 5.8: Put Everything as a function
from keras.layers import Embedding, Dense, Conv1D, MaxPooling1D, \
Dropout, Activation, Input, Flatten, Concatenate
from keras.models import Model
from keras.regularizers import l2
from keras.callbacks import EarlyStopping, ModelCheckpoint
def cnn_model(FILTER_SIZES, \
# filter sizes as a list
MAX_NB_WORDS, \
# total number of words
MAX_DOC_LEN, \
# max words in a doc
EMBEDDING_DIM=200, \
# word vector dimension
NUM_FILTERS=64, \
# number of filters for all size
DROP_OUT=0.5, \
# dropout rate
NUM_OUTPUT_UNITS=1, \
# number of output units
NUM_DENSE_UNITS=100,\
# number of units in dense layer
PRETRAINED_WORD_VECTOR=None,\
# Whether to use pretrained word vectors
LAM=0.0):
# regularization coefficient
main_input = Input(shape=(MAX_DOC_LEN,), \
dtype=’int32′, name=’main_input’)
if PRETRAINED_WORD_VECTOR is not None:
embed_1 = Embedding(input_dim=MAX_NB_WORDS+1, \
output_dim=EMBEDDING_DIM, \
input_length=MAX_DOC_LEN, \
# use pretrained word vectors
weights=[PRETRAINED_WORD_VECTOR],\
# word vectors can be further tuned
# set it to False if use static word vectors
trainable=True,\
name=’embedding’)(main_input)
else:
embed_1 = Embedding(input_dim=MAX_NB_WORDS+1, \
output_dim=EMBEDDING_DIM, \
input_length=MAX_DOC_LEN, \
name=’embedding’)(main_input)
# add convolution-pooling-flat block
conv_blocks = []
for f in FILTER_SIZES:
conv = Conv1D(filters=NUM_FILTERS, kernel_size=f, \
activation=’relu’, name=’conv_’+str(f))(embed_1)
conv = MaxPooling1D(MAX_DOC_LEN-f+1, name=’max_’+str(f))(conv)
conv = Flatten(name=’flat_’+str(f))(conv)
conv_blocks.append(conv)
if len(conv_blocks)>1:
z=Concatenate(name=’concate’)(conv_blocks)
else:
z=conv_blocks[0]
drop=Dropout(rate=DROP_OUT, name=’dropout’)(z)
dense = Dense(NUM_DENSE_UNITS, activation=’relu’,\
kernel_regularizer=l2(LAM),name=’dense’)(drop)
preds = Dense(NUM_OUTPUT_UNITS, activation=’sigmoid’, name=’output’)(dense)
model = Model(inputs=main_input, outputs=preds)
model.compile(loss=”binary_crossentropy”, \
optimizer=”adam”, metrics=[“accuracy”])
return model
5.7. Use CNN for multi-label classification¶
• In multi-label classification, a document can be classified into multiple classes
• We can use multiple ouput units, each responsible for predicating one class
• For multi-label classification ($K$ classes), do the following:
1. Represent the labels as indication matrix
▪ e.g. three classes [‘econ’,’biz’,’tech’] in total,
▪ sample 1: ‘eco’ only -> [1, 0, 0]
▪ sample 2: [‘eco’,’biz’] ->[1, 1, 0]
2. Accordingly, set output layer to have K output units
▪ each responsible for one class
▪ each unit gives the probabability of one class
• Example: Yahoo News Ranked Multilabel Learning dataset (http://research.yahoo.com)
▪ A subset is selected
▪ 4 classes, 6426 samples
In [ ]:
# Exercise 5.7.1: Load and process the data
import json
from sklearn.preprocessing import MultiLabelBinarizer
from numpy.random import shuffle
# load the data
data=json.load(open(“../../../dataset/ydata.json”,’rb’))
#data=json.load(open(“ydata.json”,’r’))
# shuffle the data
shuffle(data)
# split into text and label
text,labels=zip(*data)
text=list(text)
labels=list(labels)
text[1]
labels[1]
In [ ]:
# Exercise 5.7.2: create indicator matrix for labels
mlb = MultiLabelBinarizer()
Y=mlb.fit_transform(labels)
# check size of indicator matrix
Y.shape
# check classes
mlb.classes_
# check # of samples in each class
np.sum(Y, axis=0)
In [ ]:
# Exercise 5.7.3: Load and process the data
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np
# get a Keras tokenizer
MAX_NB_WORDS=8000
# documents are quite long in the dataset
MAX_DOC_LEN=1000
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(text)
voc=tokenizer.word_index
# convert each document to a list of word index as a sequence
sequences = tokenizer.texts_to_sequences(text)
# get the mapping between words to word index
# pad all sequences into the same length (the longest)
padded_sequences = pad_sequences(sequences, \
maxlen=MAX_DOC_LEN, \
padding=’post’, truncating=’post’)
#print(padded_sequences[0])
In [ ]:
# Exercise 5.7.4: Fit the model using the function
from sklearn.model_selection import train_test_split
EMBEDDING_DIM=100
FILTER_SIZES=[2,3,4]
# set the number of output units
# as the number of classes
output_units_num=len(mlb.classes_)
num_filters=64
# set the dense units
dense_units_num= num_filters*len(FILTER_SIZES)
BTACH_SIZE = 64
NUM_EPOCHES = 20
# split dataset into train (70%) and test sets (30%)
X_train, X_test, Y_train, Y_test = train_test_split(\
padded_sequences, Y, test_size=0.2, random_state=0)
model=cnn_model(FILTER_SIZES, MAX_NB_WORDS, \
MAX_DOC_LEN, \
NUM_FILTERS=num_filters,\
NUM_OUTPUT_UNITS=output_units_num, \
NUM_DENSE_UNITS=dense_units_num)
earlyStopping=EarlyStopping(monitor=’val_loss’, patience=0, verbose=2, mode=’min’)
checkpoint = ModelCheckpoint(BEST_MODEL_FILEPATH, monitor=’val_loss’, \
verbose=2, save_best_only=True, mode=’min’)
training=model.fit(X_train, Y_train, \
batch_size=BTACH_SIZE, epochs=NUM_EPOCHES, \
callbacks=[earlyStopping, checkpoint],\
validation_data=[X_test, Y_test], verbose=2)
In [ ]:
# Exercise 5.7.5: predicate using the best model
# calculate performance
# load the best model
model.load_weights(“best_model”)
pred=model.predict(X_test)
pred[0:5]
In [ ]:
# Exercise 5.7.6: Generate performance report
from sklearn.metrics import classification_report
pred=np.where(pred>0.5, 1, 0)
print(classification_report(Y_test, pred,\
target_names=mlb.classes_))
5.8. Use Pretrained Word Vectors¶
• If the size of labeled samples is small, it’s better use pretrained word vectors
▪ e.g. google or facebook pretrained word vectors
▪ or you can train word vectors using relevant context data using gensim
• Procedure:
1. Obtain/train pretrained word vectors (see Section 4.1 and Exercise 4.1.1)
2. Look for the word vector for each word in the vocabulary and create embedding matrix where each row represents one word vector
3. Set embedding layer with the embedding matrix and set it not trainable.
• With well-trained word vectors, often a small sample set can also achieve good performance
In [ ]:
# Exercise 5.8.1: Load full yahoo news dataset
# to train the word vector
# note this data can be unlabeled. only text is used
import json
data=json.load(open(“../../../dataset/ydata_full.json”,’r’))
text,labels=zip(*data)
text=list(text)
sentences=[ [token.strip(string.punctuation).strip() \
for token in nltk.word_tokenize(doc) \
if token not in string.punctuation and \
len(token.strip(string.punctuation).strip())>=2]\
for doc in text]
In [ ]:
# Exercise 5.8.2: Train word vector using
# the large data set
from gensim.models import word2vec
import logging
import pandas as pd
# print out tracking information
logging.basicConfig(format=’%(asctime)s : %(levelname)s : %(message)s’, \
level=logging.INFO)
EMBEDDING_DIM=200
# min_count: words with total frequency lower than this are ignored
# size: the dimension of word vector
# window: is the maximum distance
# between the current and predicted word
# within a sentence (i.e. the length of ngrams)
# workers: # of parallel threads in training
# for other parameters, check https://radimrehurek.com/gensim/models/word2vec.html
wv_model = word2vec.Word2Vec(sentences, \
min_count=5, \
size=EMBEDDING_DIM, \
window=5, workers=4 )
In [ ]:
# get word vector for all words in the vocabulary
# see reference at https://github.com/fchollet/keras/blob/master/examples/pretrained_word_embeddings.py
EMBEDDING_DIM=200
MAX_NB_WORDS=8000
# tokenizer.word_index provides the mapping
# between a word and word index for all words
NUM_WORDS = min(MAX_NB_WORDS, len(tokenizer.word_index))
# “+1” is for padding symbol
embedding_matrix = np.zeros((NUM_WORDS+1, EMBEDDING_DIM))
for word, i in tokenizer.word_index.items():
# if word_index is above the max number of words, ignore it
if i >= NUM_WORDS:
continue
if word in wv_model.wv:
embedding_matrix[i]=wv_model.wv[word]
In [ ]:
# Exercise 5.8.3: Fit model using pretrained word vectors
from sklearn.model_selection import train_test_split
EMBEDDING_DIM=200
FILTER_SIZES=[2,3,4]
# set the number of output units
# as the number of classes
output_units_num=len(mlb.classes_)
#Number of filters for each size
num_filters=64
# set the dense units
dense_units_num= num_filters*len(FILTER_SIZES)
BTACH_SIZE = 32
NUM_EPOCHES = 100
# With well trained word vectors, sample size can be reduced
# Assume we only have 500 labeled data
# split dataset into train (80%) and test sets (20%)
X_train, X_test, Y_train, Y_test = train_test_split(\
padded_sequences[0:500], Y[0:500], \
test_size=0.2, random_state=0, \
shuffle=True)
# create the model with embedding matrix
model=cnn_model(FILTER_SIZES, MAX_NB_WORDS, \
MAX_DOC_LEN, \
NUM_FILTERS=num_filters,\
NUM_OUTPUT_UNITS=output_units_num, \
NUM_DENSE_UNITS=dense_units_num,\
PRETRAINED_WORD_VECTOR=embedding_matrix)
earlyStopping=EarlyStopping(monitor=’val_loss’, patience=1, verbose=2, mode=’min’)
checkpoint = ModelCheckpoint(BEST_MODEL_FILEPATH, monitor=’val_loss’, \
verbose=2, save_best_only=True, mode=’min’)
training=model.fit(X_train, Y_train, \
batch_size=BTACH_SIZE, epochs=NUM_EPOCHES, \
callbacks=[earlyStopping, checkpoint],\
validation_data=[X_test, Y_test], verbose=2)
In [ ]:
# Exercise 5.8.4: check model configuration
# Note that parameters from embedding layer
# is not trainable
model.summary()
In [ ]:
# Exercise 5.8.5: Performance evaluation
# Let’s use samples[500:1000]
# as an evaluation set
from sklearn.metrics import classification_report
pred=model.predict(padded_sequences[500:1000])
Y_pred=np.copy(pred)
Y_pred=np.where(Y_pred>0.5,1,0)
Y_pred[0:10]
Y[500:510]
print(classification_report(Y[500:1000], \
Y_pred, target_names=mlb.classes_))
Observations:
• Note that we only trained the model with 500 samples
• The performance is only slightly lower, compared with the one trained with 6000 samples
• This shows that pre-trained word vectors can effectively improve the classification performance in the case of small labeled dataset
5.9. How to select hyperparameters?¶
• Fitting a neural network is a very empirical process
• See Section 3 of “Practical Recommendations for Gradient-Based Training of Deep Architectures” (https://arxiv.org/abs/1206.5533) for detailed discussion
• The following is some useful techniques to set
▪ MAX_NB_WORDS: max number words to be included in word embedding
◦ Based on word frequency histogram to include words that appear at least $n$ times
▪ MAX_DOC_LEN: max length of documents
◦ Based on document length frequency histogram to include complete sentences as many as possible
In [ ]:
# Exercise 5.9.1 Set MAX_NB_WORDS to
# include words that appear at least K times
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# get count of each word
df=pd.DataFrame.from_dict(tokenizer.word_counts, \
orient=”index”)
df.columns=[‘freq’]
print(df.head())
# get histogram of word count
df=df[‘freq’].value_counts().reset_index()
df.columns=[‘word_freq’,’count’]
# sort by word_freq
df=df.sort_values(by=’word_freq’)
# convert absolute counts to precentage
df[‘percent’]=df[‘count’]/len(tokenizer.word_counts)
# get cumulative percentage
df[‘cumsum’]=df[‘percent’].cumsum()
print(df.head())
df.iloc[0:50].plot(x=’word_freq’, y=’cumsum’);
plt.show();
# if set min count for word to 10,
# what % of words can be included?
# how many words will be included?
# This is the parameter MAX_NB_WORDS
# tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
In [ ]:
# Exercise 5.9.2 Set MAX_DOC_LEN to
# include complete sentences as many as possible
# create a series based on the length of all sentences
sen_len=pd.Series([len(item) for item in sequences])
# create histogram of sentence length
# the “index” is the sentence length
# “counts” is the count of sentences at a length
df=sen_len.value_counts().reset_index().sort_values(by=’index’)
df.columns=[‘sent_length’,’counts’]
# sort by sentence length
# get percentage and cumulative percentage
df[‘percent’]=df[‘counts’]/len(sen_len)
df[‘cumsum’]=df[‘percent’].cumsum()
print(df.head(3))
# From the plot, 90% sentences have length<500 # so it makes sense to set MAX_DOC_LEN=4~500 df.plot(x="sent_length", y='cumsum'); plt.show(); # what will be the minimum sentence length # such that 99% of sentences will not be truncated? 6. Next, where to go?¶ • Recurrent Neural Networks (RNN)  • Application of RNN: Machine ▪ Language Modeling and Generating Text: Given a sequence of words we want to predict the probability of each word given the previous words ▪ Machine translation  ▪ Reference: http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/  In [ ]: