COMP5046
Natural Language Processing
Lecture 2: Word Embeddings and Representation
Dr. Caren Han
Semester 1, 2021
School of Computer Science, University of Sydney
0
LECTURE PLAN
Lecture 2: Word Embeddings and Representation
1.
2. 3.
4.
Lab Info
Previous Lecture Review
1. Word Meaning and WordNet
2. Count based Word Representation
Prediction based Word Representation
1. Introduction to the concept ‘Prediction’
2. Word2Vec
3. FastText
4. GloVe
Next Week Preview
1
Info: Lab Exercise
What do we do during Labs?
In Labs, Students will use Google Colab
Colaboratory is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud. With Colaboratory you can write and execute code, save and share your analyses, and access powerful computing resources, all for free from your browser.
1
Info: Lab Exercise
Submissions How to Submit
Students should submit “ipynb” file (Download it from “File” > “Download .ipynb”) to Canvas.
When and Where to Submit
Students must submit the Lab 1(for Week2) by Week 3 Monday 11:59PM.
0
LECTURE PLAN
Lecture 2: Word Embeddings and Representation
1.
2.
3.
Lab Info
Count-based Word Representation
1. Word Meaning
2. Limitations
Prediction based Word Representation
1. Introduction to the concept ‘Prediction’
2. Word2Vec
3. FastText
4. GloVe
Next Week Preview
4.
2
WORD REPRESENTATION
How to represent the meaning of the word?
Definition: meaning (Collins dictionary).
• the idea that it represents, and which can be explained using other words.
• the thoughts or ideas that are intended to be expressed by it.
Commonest linguistic way of thinking of meaning:
signifier (symbol) ⟺ signified (idea or thing) = denotation “Computer” “Apple”
\x63\x6f\x6d\x70\x75\x74\x65\x72 \x61\x70\x70\x6c\x65
*Unicode (utf-8)
2
COUNT based WORD REPRESENTATION
Problem with one-hot vectors
Problem #1. No word similarity representation
Example: in web search, if user searches for “Sydney motel”, we would like to match documents containing “Sydney Inn”
hotel
motel
Inn
motel = [0 0 0 0 0 0 0 0 0 0 0 0 0 … 0] hotel = [0 0 0 0 0 0 0 0 0 0 0 0 0 … 0] Inn = [0 0 0 0 0 0 0 0 0 0 0 0 0 … 1]
0
1 0
There is no natural notion of similarity for one-hot vectors!
Problem #2. Inefficiency
Vector dimension = number of words in vocabulary
Each representation has only a single ‘1’ with all remaining 0s.
1
0 0
2
COUNT based WORD REPRESENTATION Problem with BoW (Bag of Words)
• The intuition is that documents are similar if they have similar content. Further, that from the content alone we can learn something about the meaning of the document.
• Discarding word order ignores the context, and in turn meaning of words in the document (semantics). Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same words differently arranged (“this is interesting” vs “is this interesting”).
S1= I love you but you hate me S2= I hate you but you love me
2
COUNT based WORD REPRESENTATION Limitation of Term Frequency Inverse Document Frequency
1+dfi
wi,j = weight of term i in document j
tfi,j = number of occurrences of term i in document j
N = total number of documents
dfi = number of documents containing term i
• It computes document similarity directly in the word-count space, which may be slow for large vocabularies.
• It assumes that the counts of different words provide independent evidence of similarity.
• It makes no use of semantic similarities between words.
2
COUNT based WORD REPRESENTATION Sparse Representation
With COUNT based word representation (especially, one-hot vector), linguistic information was represented with sparse representations (high- dimensional features)
hotel
motel Inn
motel = [0 0 0 0 0 0 0 0 0 0 0 0 0 … 0] hotel = [0 0 0 0 0 0 0 0 0 0 0 0 0 … 0] Inn = [0 0 0 0 0 0 0 0 0 0 0 0 0 … 1]
0
1 0
1
0 0
2
COUNT based WORD REPRESENTATION Sparse Representation
With COUNT based word representation (especially, one-hot vector), linguistic information was represented with sparse representations (high- dimensional features)
hotel
motel Inn
motel = [0 0 0 0 0 0 0 0 0 0 0 0 0 … 0] hotel = [0 0 0 0 0 0 0 0 0 0 0 0 0 … 0] Inn = [0 0 0 0 0 0 0 0 0 0 0 0 0 … 1]
A Significant Improvement Required!
1. How to get the low-dimensional vector representation 2. How to represent the word similarity
maybe a low-dimensional vector?
0
1 0
1
0 0
Can we use a list of fixed numbers (properties) to represent the word?
0
LECTURE PLAN
Lecture 2: Word Embeddings and Representation
1. 2.
3.
Lab Info
Previous Lecture Review
1. Word Meaning and WordNet
2. Count based Word Representation
Prediction based Word Representation
1. Word Embedding
2. Word2Vec
3. FastText
4. Glove
Next Week Preview
4.
3
Prediction based Word representation How to Represent the Word Similarity!
• How to represent the word similarity with dense vector
• Try this with word2vec
Reference: http://turbomaze.github.io/word2vecjson/
3
Prediction based Word representation
Let’s make the word representation
We need to…
1. Have the fixed low-dimensional vector representation 2. Represent the word similarity
maybe a low-dimensional vector?
What if we use a list of fixed numbers (properties) to represent the word?
3
Prediction based Word representation
Let’s get familiar with using vectors to represent things
Assume that you are taking a personality test (the Big Five Personality Traits test) 1)Openness, 2)Agreeableness, 3)Conscientiousness, 4)Negative emotionality, 5)Extraversion
Openness
100
40
Jane
0
https://openpsychometrics.org/tests/IPIP-BFFM/
3
Prediction based Word representation
Let’s get familiar with using vectors to represent things
Assume that you are taking a personality test (the Big Five Personality Traits test) 1)Openness, 2)Agreeableness, 3)Conscientiousness, 4)Negative emotionality, 5)Extraversion
Openness
100
40
70
Agreeableness
Jane
0
100
0
3
Prediction based Word representation
Let’s get familiar with using vectors to represent things
Assume that you are taking a personality test (the Big Five Personality Traits test) 1)Openness, 2)Agreeableness, 3)Conscientiousness, 4)Negative emotionality, 5)Extraversion
0.4
0.7
Jane
Mark
Eve
Openness
1
01
Agreeableness
0.3
0.2
Mark
Jane Eve
0
0.4
0.6
3
Prediction based Word representation
Let’s get familiar with using vectors to represent things
Which of two people (Mark or Eve) is more similar to Jane?
Cosine Similarity
Measure of similarity between two vectors of inner product space that measures the cosine of the angle between them
Openness
1
01
Agreeableness
Mark
Jane Eve
0
3
Prediction based Word representation
Let’s get familiar with using vectors to represent things
Which of two people (Mark or Eve) is more similar to Jane?
0.4
0.7
Jane
Mark Eve
Openness
1
01
Agreeableness
0.3
0.2
Mark
Jane Eve
0.4
0.6
0
cos( https://onlinemschool.com/math/assistance/vector/angl/ cos(
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
Jane Jane
, ,
Mark Eve
) ≈ 0.89 ) ≈ 0.99
3
Prediction based Word representation
Let’s get familiar with using vectors to represent things
We need all five major factors for represent the personality
0.4
0.7
0.5
0.2
0.1
Jane
Mark Eve
Openness
1
01
Agreeableness
0.3
0.2
0.3
0.7
0.2
Mark
Jane Eve
0
0.4
0.6
0.4
0.3
0.5
With these embeddings,
1.Represent things as vectors of fixed numbers! 2.Easily calculate the similarity between vectors
3
Prediction based Word representation Remember? The Word2Vec Demo!
This is a word embedding for the word “king”
3
Prediction based Word representation Remember? The Word2Vec Demo!
This is a word embedding for the word “king”
* Trained by Wikipedia Data, 50-dimension GloVe Vector
[0.50451, 0.68607, -0.59517, -0.022801, 0.60046, 0.08813, 0.47377, -0.61798, -0.31012, -0.066666, 1.493, -0.034173, -0.98173, 0.68229, 0.812229, 0.81722, -0.51722, -744.5.4 1503, -0.55809, 0.66421, 0.1961, -0.1495, -0.033474, -0.30344, 0.41177, -2.223, -1.0756, -0.343554, 0.33505, 1.9927, -0.042434, -0.64519, 0.72519, 0.71419, 0.714319, 0.71419 9159, 0.16754, 0.34344, -0.25663, -0.8523, 0.1661, 0.40102, 1.1685, -1.0137, -0.2155, 0.78321, -0.91241, -1.6626, -0.64426, -0.542102]
king
Prediction based Word representation Remember? The Word2Vec Demo!
This is a word embedding for the word “king”
* Trained by Wikipedia Data, 50-dimension GloVe Vector
king
3
Prediction based Word representation Remember? The Word2Vec Demo!
Compare with Woman, Man, King, and Queen
woman man king queen
3
Prediction based Word representation Remember? The Word2Vec Demo!
Compare with Woman, Man, King, Queen, and Water
woman man king queen
water
3
3
Prediction based Word representation
Remember? The Word2Vec Demo!
Word Algebra
woman
man
king
queen
king-man +woman
king – man + woman ≈ queen?
How to make dense vectors for word representation
3
Prediction based Word representation
How to make dense vectors for word representation
Prof. John Rupert Firth
Distributional Hypothesis
“You shall know a word by the company it keeps”
— (Firth, J. R. 1957:11)
Prof. Firth is noted for drawing attention to the context-dependent nature of meaning with his notion of ‘context of situation’, and his work on collocational meaning is widely acknowledged in the field of distributional semantics.
3
Prediction based Word representation
Word Representations in the context
When a word w appears in a text, its context is the set of words that appear nearby • Use the surrounding contexts of w to build up a representation of w
These context words will represent Sydney
3
Prediction based Word representation
How can we train the word representation to machine?
Neural Networks! (Machine Learning)
These context words will represent Sydney
#
Brief in Machine Learning! Machine Learning
How to classify this with your machine?
Object: CAT
#
Brief in Machine Learning! Computer System
def prediction(image as input): …program…
return result
Data
Result
CAT!!
#
Brief in Machine Learning!
Can we classify this with the computer system?
Object: ??? Object: ???
Object: ???
#
Brief in Machine Learning! Computer System VS Machine Learning
Computer System
Data
def prediction(image as input): …program…
return result
Machine Learning
Data+Result
Data: Result
Image 1: Dog Image 2: Cat Image 3: Dog Image 4: Cat Image 5: Dog …
training
Result
Pattern
xi
Input
words (indices or vectors), sentences, documents, etc.
yi
class
What we try to classify/predict
#
Brief in Machine Learning! Neural Network and Deep Learning
Neuron and Perceptron
Neuron
Perceptron
Input
Output
weight
NOTE: The detailed neural network and deep learning concept will be covered in the Lecture 3
3
Prediction based Word representation
Neural Network and Deep Learning in Word Representation
“You shall know a word by the company it keeps”
(Firth, J. R. 1957:11)
Why don’t we train a word by the company it keeps?
Why don’t we represent a word by the company it keeps?
The company it keeps A Word Input Output
Perceptron
3
Prediction based Word representation
Neural Network and Deep Learning in Word Representation
Wikipedia: “Sydney is the state capital of NSW…”
The company it keeps
Input
A Word
Output
Perceptron
weight
3
Prediction based Word representation
Neural Network and Deep Learning in Word Representation
Wikipedia: “Sydney is the state capital of NSW…”
The company it keeps
A Word
Word representation
3
Prediction based Word representation
Neural Network and Deep Learning in Word Representation
Wikipedia: “Sydney is the state capital of NSW…”
Word2Vec
Centre word
Context word
Word representation
3
Prediction based Word representation Word2Vec
Word2vec can utilize either of two model architectures to produce a distributed representation of words:
1. Continuous Bag of Words (CBOW)
Predict center word from (bag of) context words
Context word
Centre word
2. Continuous Skip-gram
Predict context (“outside”) words given center word
Centre word
Context word
3
Prediction based Word representation
Word2Vec with Continuous Bag of Words (CBOW)
Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”
Sydney
is
the
state
capital
of
NSW
Aim
• Predict the center word
Setup
• Window size
• Assume that the window size is 2
Sydney
is
the
state
capital
of
NSW
Sydney
is
the
state
capital
of
NSW
Sydney
is
the
state
capital
of
NSW
Sydney
is
the
state
capital
of
NSW
Sydney
is
the
state
capital
of
NSW
Sydney
is
the
state
capital
of
NSW
Center word
Context (“outside”) word
3
Prediction based Word representation
Word2Vec with Continuous Bag of Words (CBOW)
Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”
Using window slicing, develop the training data
Center word
[1,0,0,0,0,0,0]
[0,1,0,0,0,0,0]
Context (“outside”) word
[0,1,0,0,0,0,0], [0,0,1,0,0,0,0]
Sydney
is
the
state
capital
of
NSW
[1,0,0,0,0,0,0], [0,0,1,0,0,0,0], [0,0,0,1,0,0,0]
Sydney
is
the
state
capital
of
NSW
[0,0,1,0,0,0,0]
[0,0,0,1,0,0,0]
[0,0,0,0,1,0,0]
[0,0,0,0,0,1,0]
[1,0,0,0,0,0,0], [0,1,0,0,0,0,0] [0,0,0,1,0,0,0], [0,0,0,0,1,0,0]
[0,0,0,1,0,0,0], [0,0,0,0,1,0,0] [0,0,0,0,0,0,1]
Sydney
is
the
state
of
NSW
[0,1,0,0,0,0,0], [0,0,1,0,0,0,0] [0,0,0,0,1,0,0], [0,0,0,0,0,1,0]
[0,0,1,0,0,0,0], [0,0,0,1,0,0,0] [0,0,0,0,0,1,0], [0,0,0,0,0,0,1]
Sydney
Sydney
is
the
state
capital
capital
of
NSW
is
the
state
capital
of
NSW
Sydney
is
the
state
capital
of
NSW
[0,0,0,0,0,0,1]
[0,0,0,0,1,0,0], [0,0,0,0,0,1,0]
Sydney
is
the
state
capital
of
NSW
Center word
Context (“outside”) word
3
Prediction based Word representation
Context word
CBOW – Neural Network Architecture
Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW” Input layer
is (one-hot vector)
the (one-hot vector)
capital (one-hot vector)
of (one-hot vector)
Projection layer
Output layer
state (one-hot vector)
Centre word
3
is (one-hot vectori)s
the (one-hot vector)
W’NxV state (one-hot vector)
capital (one-hot vector)
of (one-hot vectoro)f
the
capital
Prediction based Word representation CBOW – Neural Network Architecture
Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”
Input layer
Projection layer Output layer
N-Dimension
Depends on the dimension of word representation you would like to set up
3
Prediction based Word representation
CBOW – Neural Network Architecture
Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW” Xthe WVxN Vthe
the the (one-hot vector)
state (one-hot vector) state
capital (one-hot vector) capital
N-Dimension
N = Dimension of Word Embedding (Representation)
3
is (one-hot vectori)s
𝑣ො=Vis + Vthe + Vcapital + Vof 2m (window size)
the the (one-hot vector)
state (one-hot vector)
capital (one-hot vector) capital
of (one-hot vectoro)f
Prediction based Word representation
CBOW – Neural Network Architecture
Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”
Input layer
Projection layer Output layer
N-Dimension
N = Dimension of Word Embedding (Representation)
3
Prediction based Word representation
CBOW – Neural Network Architecture
Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”
Input layer
Projection layer
Output layer
the (one-hot vector)
capital (one-hot vector)
W ’ N x V x 𝒗ෝ = z
N
Predicted result state (one-hot vector)
Softmax: outputs a vector that represents the probability distributions (sum to 1) of a list of potential outcome
3
Prediction based Word representation
CBOW – Neural Network Architecture
Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”
Input layer
Projection layer
Output layer
the (one-hot vector)
capital (one-hot vector)
W ’ N x V x 𝒗ෝ = z
N
Predicted result
state (one-hot vector)
Cross Entropy: can be used as a loss function when optimizing classification
Loss Function (Cross Entropy)
3
Prediction based Word representation
CBOW – Neural Network Architecture
Predict center word from (bag of) context words
Sentence: “Sydney is the state capital of NSW”
Input layer
Projection layer
Output layer
the (one-hot vector)
capital (one-hot vector)
W ’ N x V x 𝒗ෝ = z
N
Predicted result
state (one-hot vector)
BACK PROPAGATION
*This back propagation or optimization function will be learned more details in the lecture 3.
3
Prediction based Word representation
CBOW – Neural Network Architecture
Predict center word from (bag of) context words.
Summary of CBOW Training (Review your understanding with equations)
1. Initialise each word in a one-hot vector form.
Projection or
𝒙𝒌 = [0,…,0,1,0,…,0]
2. Use context words (2m, based on window size =m)
as input of the Word2Vec-CBOW model.
(𝒙𝒄−𝒎,𝒙𝒄−𝒎+𝟏,…,𝒙𝒄−𝟏,𝒙𝒄+𝟏,…,𝒙𝒄+𝒎−𝟏,𝒙𝒄+𝒎) ∈R|𝑽| 3. Has two Parameter Matrices:
1) ParameterMatrix(fromInputLayertoHidden/ProjectionLayer) 𝐖 ∈ R𝑉x𝑁
2) ParameterMatrix(toOutputLayer)
𝐖′ ∈ R𝑁x𝑉
3
Prediction based Word representation
CBOW – Neural Network Architecture
Predict center word from (bag of) context words.
Summary of CBOW Training (Review your understanding with equations)
4. Initial words are represented in one hot vector so
Projection or
multiplying a one hot vector with 𝐖 will give you 𝐕x𝐍
a 1 x N (embedded word) vector.
𝟏𝟎 𝟐 𝟏𝟖
e.g. 𝟎𝟏𝟎𝟎 × 𝟏𝟓 𝟐𝟐 𝟑 =[1522 3] 𝟐𝟓 𝟏𝟏 𝟏𝟗
𝟒 𝟕 𝟐𝟐
(𝒗𝒄−𝒎 = W𝑥𝑐−𝑚, … , 𝒗𝒄+𝒎 = W𝑥𝑐+𝑚) ∈ R𝒏
5. Average those 2m embedded vectors to calculate the value of the Hidden Layer.
𝑣ො = 𝒗 𝒄 − 𝒎 + 𝒗 𝒄 − 𝒎 + 𝟏 + … + 𝒗 𝒄 + 𝒎 2𝑚
3
Prediction based Word representation
CBOW – Neural Network Architecture
Predict center word from (bag of) context words.
Summary of CBOW Training (Review your understanding with equations)
Projection or
6. Calculate the score value for the output layer. The higher score is produced when words are closer.
𝒛 = 𝐖 ′ × 𝑣ො ∈ R | 𝑽 |
7. Calculate the probability using softmax
𝑦ො = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (𝒛) ∈ R|𝑽|
8. Train the parameter matrix using objective function. |𝑉|
H𝑦ො,𝑦 =−𝑦𝑙𝑜𝑔(𝑦ො) 𝑗𝒋
𝑗=1
* Focus on minimising the value
We use an one-hot vector (one 1, the rest 0) so it will be
calculated in only one.
H𝑦ො,𝑦 =−𝑦𝑙𝑜𝑔(𝑦ො) 𝑗𝒋
3
Prediction based Word representation
CBOW – Neural Network Architecture
Predict center word from (bag of) context words.
Summary of CBOW Training (Review your understanding with equations)
8-1. Optimization Objective Function can be presented:
Projection or
*This optimization objective will be learned more details in the lecture 3.
ARE WE DONE YET?
3
Prediction based Word representation
Skip Gram
Predict context (“outside”) words (position independent) given center word
Sentence: “Sydney is the state capital of NSW”
P(wt+2|wt) P(wt+1|wt)
P(wt-1|wt)
… Sydney is the state capital of NSW …
3
Prediction based Word representation
Skip Gram
Predict context (“outside”) words (position independent) given center word
Sentence: “Sydney is the state capital of NSW”
Input layer
state (one-hot vector)
Projection layer
is (one-hot vector)
the (one-hot vector)
capital (one-hot vector)
of (one-hot vector)
Output layer
Context word
Centre word
3
Prediction based Word representation
Skip Gram – Neural Network Architecture
Predict context (“outside”) words (position independent) given center word
Summary of Skip Gram Training (Review your understanding with equations)
1. Initialise the centre word in a one-hot vector form.
𝒙𝒌 = [0,…,0,1,0,…,0] 𝒙 ∈ R|𝑽|
2. Has two Parameter Matrices:
1) ParameterMatrix(fromInputLayertoHidden/ProjectionLayer)
𝐖 ∈ R𝑉x𝑁
2) ParameterMatrix(toOutputLayer)
𝐖′ ∈ R𝑁x𝑉
3
Prediction based Word representation
Skip Gram – Neural Network Architecture
Predict context (“outside”) words (position independent) given center word
Summary of Skip Gram Training (Review your understanding with equations)
3. Initial words are represented in one hot vector so
multiplying a one hot vector with 𝐖 will give you 𝐕x𝐍
a 1 x N (embedded word) vector.
𝟏𝟎 𝟐 𝟏𝟖
e.g. 𝟎𝟏𝟎𝟎 × 𝟏𝟓 𝟐𝟐 𝟑 =[1522 3] 𝟐𝟓 𝟏𝟏 𝟏𝟗
𝟒 𝟕 𝟐𝟐
𝒗 = 𝐖 ∈ R𝒏 (as there is only one input) 𝒄𝒙
4. Calculate the score value for the output layer by multiplying the parameter matrix W’
𝒛 = 𝐖′𝒗𝒄
3
Prediction based Word representation
Skip Gram – Neural Network Architecture
Predict context (“outside”) words (position independent) given center word
Summary of Skip Gram Training (Review your understanding with equations) 5. Calculate the probability using softmax
𝑦ො = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (𝒛)
6. Calculate 2m probabilities as we need to predict 2m context words.
𝑦ො𝒄−𝒎, … , 𝑦ො𝒄−𝟏, 𝑦ො𝒄+𝟏, … , 𝑦ො𝒄+𝒎
and compare with the ground truth (one-hot vector) 𝑦(𝑐−𝑚), … , 𝑦(𝑐−1), 𝑦(𝑐+1), … , 𝑦(𝑐+𝑚)
3
Prediction based Word representation
Skip Gram – Neural Network Architecture
Predict context (“outside”) words (position independent) given center word
Summary of Skip Gram Training (Review your understanding with equations)
8. As in CBOW, use an objective function for us to evaluate the model. A key difference here is that we invoke a Naïve Bayes assumption to break out the probabilities. It is a strong naïve conditional independence assumption. Given the centre word, all output words are completely independent.
*This optimization objective will be learned more details in the lecture 3.
3
Prediction based Word representation
Skip Gram – Neural Network Architecture
Predict context (“outside”) words (position independent) given center word
Summary of Skip Gram Training (Review your understanding with equations)
8-1. With this objective function, we can compute the gradients with respect to the unknown parameters and at each iteration update them via Stochastic Gradient Descent
*This Stochastic Gradient Descent will be learned details in the lecture 3.
3
Prediction based Word representation CBOW vs Skip Gram Overview
CBOW
Predict center word from (bag of) context words
Skip-gram
Predict context words given center word
3
Prediction based Word representation
Key Parameter (1) for Training methods: Window Size
Different tasks are served better by different window sizes.
Smaller window sizes (2-15) lead to embeddings where high similarity scores between two embeddings indicates that the words are interchangeable.
Larger window sizes (15-50, or even more) lead to embeddings where similarity is more indicative of relatedness of the words
3
Prediction based Word representation
Key Parameter (2) for Training methods: Negative Samples
Note that the summation over |V| is computationally huge!
Negative samples to our dataset – samples of words that are not neighbors
Negative sample: 2 Negative sample: 5
Input word
Output word
Target
eat
mango
1
eat
exam
0
eat
tobacco
0
Input word
Output word
Target
eat
mango
1
eat
exam
0
eat
tobacco
0
eat
pool
0
eat
supervisor
0
*1= Appeared, 0=Not Appeared
The original paper prescribes 5-20 as being a good number of negative samples. It also states that 2-5 seems to be enough when you have a large enough dataset.
3
Prediction based Word representation
Key Parameter (2) for Training methods: Negative Samples
The number of negative samples is another factor of the training process.
Negative samples to our dataset – samples of words that are not neighbors
Negative sample: 2 Negative sample: 5
Input word
Output word
Target
eat
mango
1
eat
exam
0
eat
tobacco
0
Input word
Output word
Target
eat
mango
1
eat
exam
0
eat
tobacco
0
eat
pool
0
eat
supervisor
0
*1= Appeared, 0=Not Appeared
How to select the Negative Sample?
The “negative samples” are selected using a “unigram distribution”, where more frequent words are more likely to be selected as negative samples.
The probability for picking the word (𝑤𝑖) would be equal to the number of times (𝑤𝑖) appears in the corpus, divided the total number of word occurs in the corpus.
3
Prediction based Word representation Word2Vec Overview
Word2vec (Mikolov et al. 2013) is a framework for learning word vectors
Idea:
• Have a large corpus of text
• Every word in a fixed vocabulary is represented by a vector
• Go through each position t in the text, which has a center word c and context (“outside”) words o
• Use the similarity of the word vectors for c and o to calculate the probability of o given c (or vice versa)
• Keep adjusting the word vectors to maximize this probability
3
Prediction based Word representation Let’s try some Word2Vec!
Gensim: https://radimrehurek.com/gensim/models/word2vec.html
Resources: https://wit3.fbk.eu/ https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models
3
Prediction based Word representation
Limitation of Word2Vec
Issue#1: Cannot cover the morphological similarity
• Word2vec represents every word as an independent vector, even though many words are morphologically similar, like: teach, teacher, teaching
Issue#2: Hard to conduct embedding for rare words
• Word2vec is based on the Distribution hypothesis. Works well with the frequent words but does not embed the rare words.
(same concept with the under-fitting in machine learning)
Issue#3: Cannot handle the Out-of-Vocabulary (OOV)
• Word2vec does not work at all if the word is not included in the Vocabulary
3
Prediction based Word representation
FastText
• Deal with this Word2Vec Limitation
• Another Way to transfer WORDS to VECTORS
• FastText is a library for learning of word embeddings and text classification created by Facebook’s AI Research lab. The model allows to create an unsupervised learning or supervised learning algorithm for obtaining vector representations for words.
• Extension to Word2Vec
• Instead of feeding individual words into the Neural Network, FastText breaks words into several n-grams (sub-words)
https://fasttext.cc/
3
Prediction based Word representation FastText with N-gram Embeddings
• N-grams are simply all combinations of adjacent words or letters of length n that you can find in your source text. For example, given the word apple, all 2- grams (or “bigrams”) are ap, pp, pl, and le
• The tri-grams (n=3) for the word apple is app, ppl, and ple (ignoring the starting and ending of boundaries of words). The word embedding vector for apple will be the sum of all these n-grams.
apple apple apppplle apppplple
• After training the Neural Network (either with skip-gram or CBOW), we will have word embeddings for all the n-grams given the training dataset.
• Rare words can now be properly represented since it is highly likely that some
of their n-grams also appears in other words.
https://fasttext.cc/
3
Prediction based Word representation Word2Vec VS FastText
Find synonym with Word2vec
from gensim.models import Word2Vec
cbow_model = Word2Vec(sentences=result, size=100, window=5, min_count=5, workers=4, sg=0)
a=cbow_model.wv.most_similar(“electrofishing”) pprint.pprint(a)
Find synonym with FastText
from gensim.models import FastText
FT_model = FastText(sentences=result, size=100, window=5, min_count=5, workers=4, sg=0)
a=FT_model.wv.most_similar(“electrofishing”) pprint.pprint(a)
electrofishing
https://fasttext.cc/
3
Prediction based Word representation Global Vectors (GloVe)
• Deal with this Word2Vec Limitation
“Methods like skip-gram may do better on the analogy task, but they poorly utilize the statistics of the corpus since they train on separate local context windows instead of on global co-occurrence counts.”
• Focus on the Co-occurrence
(PeddingLon et al., 2014)
e.g. P(k | i) k=context words, i =centre words
https://nlp.stanford.edu/projects/glove/
3
Prediction based Word representation Limitation of Prediction based Word Representation
• Ilike
apple banana fruit
• Trainingdatasetreflectthewordrepresentationresult
• The word similarity of the word ‘software’ the model learned by Google News corpus can be different from the one from Twitter.
https://nlp.stanford.edu/projects/glove/
4
NEXT WEEK PREVIEW… Word Embeddings
• Finalisation!
Machine Learning/ Deep Learning for Natural Language Processing
/
Reference
Reference for this lecture
• Deng, L., & Liu, Y. (Eds.). (2018). Deep Learning in Natural Language Processing. Springer.
• Rao, D., & McMahan, B. (2019). Natural Language Processing with PyTorch: Build Intelligent Language Applications
Using Deep Learning. ” O’Reilly Media, Inc.”.
• Manning, C. D., Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. MIT press.
• Manning, C 2017, Introduction and Word Vectors, Natural Language Processing with Deep Learning, lecture notes, Stanford University
• Images: http://jalammar.github.io/illustrated-word2vec/
• Goldberg, Lewis R. 1992, “The development of markers for the Big-Five factor structure.” Psychological assessment
4.1: 26. Word2vec
• Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
• Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
FastText
• Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146.
• Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2017). Advances in pre-training distributed word representations. arXiv preprint arXiv:1712.09405.