Challenges of Text for Machine Learning Word Embeddings Text Sequences
COMP3220 — Document Processing and the Semantic Web
Week 05 Lecture 1: Processing Text Sequences
Diego Moll ́a
Department of Computer Science Macquarie University
COMP3220 2021H1
Diego Moll ́a
W05L1: Text Sequences 1/28
Challenges of Text for Machine Learning Word Embeddings Text Sequences
Programme
1 Challenges of Text for Machine Learning
2 Word Embeddings
3 Text Sequences
Reading
Deep Learning book, chapter 6.
Understanding LSTM Networks, https://colah.github.
io/posts/2015-08-Understanding-LSTMs/. Additional Reading
Jurafsky & Martin, Chapter 9 (9.4 will be introduced in week 6)
Diego Moll ́a
W05L1: Text Sequences 2/28
Challenges of Text for Machine Learning
Word Embeddings Text Sequences
Programme
1 Challenges of Text for Machine Learning
2 Word Embeddings
3 Text Sequences
W05L1: Text Sequences 3/28
Diego Moll ́a
Challenges of Text for Machine Learning
Word Embeddings Text Sequences
Words as Arbitrary Symbols
Words are encoded as
arbitrary symbols.
Within one language there is no clear correspondence between a word symbol and its meaning.
“dig” vs. “dog”
“car” vs. “automobile”
Different languages may use different representations of the same word.
https://en.wikipedia.org/wiki/File: Hello_in_different_languages_word_cloud.jpeg
Diego Moll ́a
W05L1: Text Sequences 4/28
Challenges of Text for Machine Learning
Word Embeddings Text Sequences
Ambiguities Everywhere
Language features ambiguity at multiple levels.
Lexical Ambiguity
Example from Google’s dictionary:
bank (n): the land alongside or sloping down a river or lake.
bank (n): financial establishment that uses money deposited by customers for investment, . . .
bank (v): form in to a mass or mound.
bank (v): build (a road, railway, or sports track) higher at the outer edge of a bend to facilitate fast cornering.
…
Diego Moll ́a
W05L1: Text Sequences 5/28
Challenges of Text for Machine Learning
Word Embeddings Text Sequences
So many words!
Any language features a large number of distinct words. New words are coined.
Words change their use in time.
There are also names, numbers, dates… an infinite number.
https://trends.google.com
Diego Moll ́a
W05L1: Text Sequences 6/28
Challenges of Text for Machine Learning
Word Embeddings Text Sequences
Long-distance Dependencies
Sentences are sequences of words.
Words close in the sentence are often related.
But sometimes there are relations between words far apart.
grammatical: “The man living upstairs . . . very cheerful” “The people living upstairs . . . very cheerful”
knowledge: “I was born in France and I speak fluent . . . ” reference: “I bought a book from the shopkeeper and I liked it”
Diego Moll ́a
W05L1: Text Sequences 7/28
Challenges of Text for Machine Learning
Word Embeddings Text Sequences
Long-distance Dependencies
Sentences are sequences of words.
Words close in the sentence are often related.
But sometimes there are relations between words far apart.
grammatical: “The man living upstairs is very cheerful” “The people living upstairs are very cheerful”
knowledge: “I was born in France and I speak fluent . . . ” reference: “I bought a book from the shopkeeper and I liked it”
Diego Moll ́a
W05L1: Text Sequences 7/28
Challenges of Text for Machine Learning
Word Embeddings Text Sequences
Long-distance Dependencies
Sentences are sequences of words.
Words close in the sentence are often related.
But sometimes there are relations between words far apart.
grammatical: “The man living upstairs is very cheerful” “The people living upstairs are very cheerful”
knowledge: “I was born in France and I speak fluent French” reference: “I bought a book from the shopkeeper and I liked it”
Diego Moll ́a
W05L1: Text Sequences 7/28
Challenges of Text for Machine Learning
Word Embeddings Text Sequences
Long-distance Dependencies
Sentences are sequences of words.
Words close in the sentence are often related.
But sometimes there are relations between words far apart.
grammatical: “The man living upstairs is very cheerful” “The people living upstairs are very cheerful”
knowledge: “I was born in France and I speak fluent French” reference: “I bought a book from the shopkeeper and I liked it”
Diego Moll ́a
W05L1: Text Sequences 7/28
Programme
Challenges of Text for Machine Learning
Word Embeddings
Text Sequences
1 Challenges of Text for Machine Learning
2 Word Embeddings
3 Text Sequences
W05L1: Text Sequences 8/28
Diego Moll ́a
Challenges of Text for Machine Learning
Word Embeddings
Text Sequences
Word Embeddings
First introduced in 2013, nowadays is one of the most common ingredients in text processing systems.
Word embeddings squarely aim at addressing the issue of representing words as continuous vectors of integers.
Words with similar context are mapped to similar vectors. Embeddings are learnt using large, unlabelled training data.
https://www.tensorflow.org/tutorials/representation/word2vec
Diego Moll ́a
W05L1: Text Sequences 9/28
Challenges of Text for Machine Learning
Word Embeddings
Text Sequences
One-hot vs. word embeddings
One-hot
Sparse
Binary values (typically) High-dimensional Hard-coded
Word embeddings
Dense
Continous values Lower-dimensional Learned from data
Diego Moll ́a
W05L1: Text Sequences 10/28
Challenges of Text for Machine Learning
Word Embeddings
Text Sequences
Two Ways to Obtain Word Embeddings
1 Learn the word embeddings jointly with the task you care about (e.g. document classification).
2 Use pre-trained word embeddings.
Diego Moll ́a
W05L1: Text Sequences 11/28
Challenges of Text for Machine Learning
Word Embeddings
Text Sequences
Learning Word Embeddings
You can add a dense layer as the first layer of your network and let the system learn the optimal weights.
This approach is so useful and common that many deep learning frameworks define an “embedding” layer that facilitates this.
The input to the “embedding” layer is the word index. The output is the word embedding.
Diego Moll ́a
W05L1: Text Sequences 12/28
Challenges of Text for Machine Learning
Word Embeddings
Text Sequences
Embedding Layer as a Dense Layer
The input of the dense layer is the one-hot encoding of the word
A Dense Layer
Input Dense Output 0
h11
0 o1 h12
1 o2
h13
0
Diego Moll ́a
W05L1: Text Sequences 13/28
Challenges of Text for Machine Learning
Word Embeddings
Text Sequences
Embedding Layer in Keras
The input of a Keras embedding layer is a sequence of word indices which will be internally converted into their one-hot representations, followed by the dense layer.
A Keras Embedding Layer (for one word)
Input Embedding Output
2
o1 o2
0
h11
0
h12
1
h13
0
Diego Moll ́a
W05L1: Text Sequences 14/28
Challenges of Text for Machine Learning
Word Embeddings
Text Sequences
Processing Sequences of Words in Keras
The input of a Keras embedding layers is a sequence of words. The output is a sequence of word embeddings.
Since the layer will process a batch of samples at a time, each batch must have sequences with the same numbers of words.
Keras provides a way to trim sequences of words or pad them to adjust the sequence length: pad sequences.
Diego Moll ́a
W05L1: Text Sequences 15/28
Challenges of Text for Machine Learning
Word Embeddings
Text Sequences
Using pre-trained word embeddings
The Problem: Data Sparsity
Sometimes we have so little training data that many words are poorly represented.
Often, words in the training data do not occur in the test data.
For these unseen words we would not be able to learn the embeddings.
A Solution: Pre-training
Several people have computed word embeddings for large vocabularies using large data sets.
We can then use these pre-trained embeddings to map from the word index to the word embedding.
Diego Moll ́a
W05L1: Text Sequences 16/28
Challenges of Text for Machine Learning
Word Embeddings
Text Sequences
Using Word Embeddings in Keras
The following notebook is based on the jupyter notebooks provided by the Deep Learning book: https://github.com/ fchollet/deep-learning-with-python-notebooks
Using word embeddings.
The notebook illustrates how you can use an embeddings layer for text classification, and how to load pre-trained word embeddings.
This notebook is important because it also illustrates Keras’ text tokenisation techniques.
Diego Moll ́a
W05L1: Text Sequences 17/28
Challenges of Text for Machine Learning Word Embeddings Text Sequences
Programme
1 Challenges of Text for Machine Learning
2 Word Embeddings
3 Text Sequences
W05L1: Text Sequences
18/28
Diego Moll ́a
Challenges of Text for Machine Learning Word Embeddings Text Sequences
Handling Text Sequences
A document is a sequence of words.
Many document representations are based on a bag-of-words approach.
Word order is ignored.
The context around a word is ignored.
Even word embeddings ignore word order.
Why context matters
“I can1 kick the can2”
The meaning of “can1” is different from that of “can2”.
“can1” and “can2” should have different word embeddings. We can tell the meaning because of the context:
“I can kick …” “…kick the can”
Diego Moll ́a
W05L1: Text Sequences 19/28
Challenges of Text for Machine Learning Word Embeddings Text Sequences
Recurrent Neural Networks
A Recurrent Neural Network (RNN) is designed to process sequences.
A RNN is a neural network that is composed of RNN cells. Each RNN cell takes as input two pieces of information:
1 A vector representing an item xi in the sequence.
2 The state resulting from processing the previous items.
The output of the RNN cell is a state that can be fed to the next cell in the sequence.
All cells in an RNN chain share the same parameters.
In a sense, we can say that an RNN cell is the same for all words in the sequence, but now context also matters.
Diego Moll ́a W05L1: Text Sequences 20/28
https://colah.github.io/posts/2015- 08- Understanding- LSTMs/
Challenges of Text for Machine Learning Word Embeddings Text Sequences
Example: Dense layer vs RNN
Input
x11 x12 x13
x21 x22 x23
Dense
h11 h12 h13
Input RNN
x11 x12
rnn1 x13 rnn2
x21 h14 x22
x23
rnn1 rnn2
Diego Moll ́a
W05L1: Text Sequences 21/28
Challenges of Text for Machine Learning Word Embeddings Text Sequences
A Simple Recurrent Neural Network
A simple RNN cell (“vanilla RNN”) has just a dense layer with an activation function (hyperbolic tangent, or “tanh” in the drawing below).
Vanilla RNN cells have been used since 1990s.
https://colah.github.io/posts/2015- 08- Understanding- LSTMs/
Diego Moll ́a
W05L1: Text Sequences 22/28
Challenges of Text for Machine Learning Word Embeddings Text Sequences
LSTMs and GRUs
Vanilla RNN cells are still too simple and they do not handle long-distance dependencies easily.
More complex RNN cells have been designed specifically to address this issue.
Current most popular RNN cells are:
LSTM Long Short Term Memory (picture).
GRU Gated Recurrent Unit; a more recent, simpler cell.
https://colah.github.io/posts/2015- 08- Understanding- LSTMs/
Diego Moll ́a
W05L1: Text Sequences 23/28
Challenges of Text for Machine Learning Word Embeddings Text Sequences
RNNs in Practice
Most deep learning frameworks include special layers for RNNs.
When you use an RNN layer, you have the option to specify the type of RNN cell.
You often have the option to use the state of the last cell, or the state of all cells.
Diego Moll ́a
W05L1: Text Sequences 24/28
Challenges of Text for Machine Learning Word Embeddings Text Sequences
Recurrent Neural Networks in Keras
The following notebook is based on the jupyter notebooks provided by the Deep Learning book: https://github.com/fchollet/ deep-learning-with-python-notebooks
Understanding Recurrent Neural Networks.
The notebook illustrates how you can use an embeddings layer for text classification, and how to load pre-trained word embeddings.
Diego Moll ́a
W05L1: Text Sequences 25/28
Challenges of Text for Machine Learning Word Embeddings Text Sequences
Final Note: Contextualised Word Embeddings!
Recent research deviced a way to combine RNN and word embeddings to produce context-dependent word embeddings. The resulting systems are beating state of the art in many applications!
http://jalammar.github.io/illustrated-bert/
Diego Moll ́a
W05L1: Text Sequences 26/28
Challenges of Text for Machine Learning Word Embeddings Text Sequences
Take-home Messages
1 Explain some of the fundamental challenges that plain text represents to machine learning.
2 Apply word embeddings in deep learning.
3 Use recurrent neural networks for text classification.
Diego Moll ́a
W05L1: Text Sequences 27/28
Challenges of Text for Machine Learning Word Embeddings Text Sequences
What’s Next
Week 6
Advanced topics in deep learning
Reading: Deep Learning book, chapter 8.1 Additional reading: Jurafsky & Martin, Chapter 9
Diego Moll ́a
W05L1: Text Sequences 28/28