Lab 02
Word2Vec
In [ ]:
import pprint
import re
# For parsing our XML data
from lxml import etree
# For data processing
import nltk
nltk.download(‘punkt’)
from nltk.tokenize import word_tokenize, sent_tokenize
# For implementing the word2vec family of algorithms
from gensim.models import Word2Vec
import warnings
warnings.simplefilter(action=’ignore’, category=FutureWarning)
[nltk_data] Downloading package punkt to /root/nltk_data…
[nltk_data] Unzipping tokenizers/punkt.zip.
Download data from Google Drive
For today’s lab we will download and use the TED script data from Google Drive.
Google Drive Access Setup
By running the following code, it will generate a link and a field for entering the verification code.
Click the link, which will direct to the Google Sign In page. Sign in with your own Google account by following the instructions on the page.
Then copy the generated verification code from the page into the verification code field and press Enter
In [ ]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
Downloading TED Scripts from Google Drive
Click on left side “Files” tab and see the file is downloaded successfully.
In [ ]:
id = ‘1B47OiEiG2Lo1jUY6hy_zMmHBxfKQuJ8-‘
downloaded = drive.CreateFile({‘id’:id})
downloaded.GetContentFile(‘ted_en-20160408.xml’)
Data Preprocessing
In [ ]:
targetXML=open(‘ted_en-20160408.xml’, ‘r’, encoding=’UTF8′)
# Getting contents of
target_text = etree.parse(targetXML)
parse_text = ‘\n’.join(target_text.xpath(‘//content/text()’))
# Removing “Sound-effect labels” using regular expression (regex) (i.e. (Audio), (Laughter))
content_text = re.sub(r’\([^)]*\)’, ”, parse_text)
# Tokenising the sentence to process it by using NLTK library
sent_text=sent_tokenize(content_text)
# Removing punctuation and changing all characters to lower case
normalized_text = []
for string in sent_text:
tokens = re.sub(r”[^a-z0-9]+”, ” “, string.lower())
normalized_text.append(tokens)
# Tokenising each sentence to process individual word
sentences=[]
sentences=[word_tokenize(sentence) for sentence in normalized_text]
# Prints only 10 (tokenised) sentences
print(sentences[:10])
Word2Vec – Continuous Bag-Of-Words (CBOW)
For more details about gensim.models.word2vec you can refer to API for Gensim Word2Vec
In [ ]:
# Initialize and train a word2vec model with the following parameters:
# sentence: iterable of iterables, i.e. the list of lists of tokens from our data
# size: dimensionality of the word vectors
# window: window size
# min_count: ignores all words with total frequency lower than the specified count value
# workers: Use specified number of worker threads to train the model (=faster training with multicore machines)
# sg: training algorithm, 0 for CBOW, 1 for skip-gram
wv_cbow_model = Word2Vec(sentences=sentences, size=100, window=5, min_count=5, workers=2, sg=0)
In [ ]:
# The trained word vectors are stored in a KeyedVectors instance as model.wv
# Get the top 10 similar words to ‘man’ by calling most_similar()
# most_similar() computes cosine similarity between a simple mean of the vectors of the given words and the vectors for each word in the model
similar_words=wv_cbow_model.wv.most_similar(“man”) # topn=10 by default
pprint.pprint(similar_words)
Word2Vec – Skip Gram
In [ ]:
# Now we switch to a Skip Gram model by setting parameter sg=1
wv_sg_model = Word2Vec(sentences=sentences, size=100, window=5, min_count=5, workers=2, sg=1)
In [ ]:
similar_words=wv_sg_model.wv.most_similar(“man”)
pprint.pprint(similar_words)
Word2Vec vs FastText
Word2Vec – Skip Gram cannot find similar words to “electrofishing” as “electrofishing” is not in the vocabulary.
In [ ]:
similar_words=wv_sg_model.wv.most_similar(“electrofishing”)
pprint.pprint(similar_words)
FastText – Skip Gram
In [ ]:
from gensim.models import FastText
In [ ]:
# Now we initialize and train FastText with Skip Gram architecture (sg=1)
ft_sg_model = FastText(sentences, size=100, window=5, min_count=5, workers=2, sg=1)
In [ ]:
# As we can see, FastText allows us to obtain word vectors for out-of-vocabulary words
result=ft_sg_model.wv.most_similar(“electrofishing”)
pprint.pprint(result)
FastText – Continuous Bag-Of-Words (CBOW)
In [ ]:
# Now we initialize and train FastText with CBOW architecture (sg=0)
ft_cbow_model = FastText(sentences, size=100, window=5, min_count=5, workers=2, sg=0)
In [ ]:
# Again, FastText allows us to obtain word vectors for out-of-vocabulary words
result=ft_cbow_model.wv.most_similar(“electrofishing”)
pprint.pprint(result)
King – Man + Woman = ?
Try both CBOW and Skip Gram model to calculate “King – Man + Woman = ?”
In [ ]:
# We can specify the positive/negative word list with the positive/negative parameters
# Top N most similar words can be specified with the topn parameter
result = wv_cbow_model.wv.most_similar(positive=[‘woman’, ‘king’], negative=[‘man’], topn=1)
print(result)
In [ ]:
result = wv_sg_model.wv.most_similar(positive=[‘woman’, ‘king’], negative=[‘man’], topn=1)
print(result)
In [ ]:
result = ft_cbow_model.wv.most_similar(positive=[‘woman’, ‘king’], negative=[‘man’], topn=1)
print(result)
In [ ]:
result = ft_sg_model.wv.most_similar(positive=[‘woman’, ‘king’], negative=[‘man’], topn=1)
print(result)
This is not what we expected…Probably not enough data to answer as “Queen”
Let’s try with bigger sized data (Google has already trained Word2Vec with Google News data) in the following section
Using Pretrained word embeddings with Gensim
1.Download and load from Google pretrained Word2Vec binary file
Link to Project
In [ ]:
# Download the pre-trained vectors trained on part of Google News dataset (about 100 billion words)
# Beware, this file is big (3.39GB) – might be long waiting!
id2 = ‘0B7XkCwpI5KDYNlNUTTlSS21pQmM’
downloaded = drive.CreateFile({‘id’:id2})
downloaded.GetContentFile(‘GoogleNews-vectors-negative300.bin.gz’)
In [ ]:
# Uncompress the downloaded file
!gzip -d /content/GoogleNews-vectors-negative300.bin.gz
Note: you may encounter a session crash with the pretrained word2vec code below due to out-of-memory issues. If it happens, you may start again directly from this section.
In [ ]:
from gensim.models import KeyedVectors
import warnings
warnings.simplefilter(action=’ignore’, category=FutureWarning)
# Load the pretrained vectors with KeyedVectors instance – might be long waiting!
filename = ‘GoogleNews-vectors-negative300.bin’
gn_wv_model = KeyedVectors.load_word2vec_format(‘GoogleNews-vectors-negative300.bin’, binary=True)
In [ ]:
# Now we can try to calculate “King – Man + Woman = ?” again
result = gn_wv_model.most_similar(positive=[‘woman’, ‘king’], negative=[‘man’], topn=1)
print(result)
2.Load a pretrained word embedding model using API
The following code illustrates another way of loading pretrained word embeddings with Gensim. Here we try with GloVe embedding trained on twitter data
In [ ]:
import gensim.downloader as api
# download the model and return as object ready for use
model = api.load(“glove-twitter-25”)
# The similarity() function can calculate the cosine similarity between two given words
print(model.similarity(“cat”,”dog”))
# The distance() function is another way of calculating the similarity between two given words, which returns 1-cosine similarity instead
print(model.distance(“cat”,”dog”))
[Tips] Play with Colab Form Fields
The Form supports multiple types of fields, including input fields, dropdown menus.
In Lab1 E1, we already used the input fields. Let’s try more now. You can edit this section by double-clicking it.
Let’s get familiar by changing the value in each input field (on the right) and checking the changes in the code (on the left) – vice versa
In [ ]:
#@title Example form fields
#@markdown please put description
string = ‘examples’ #@param {type: “string”}
slider_value = 111 #@param {type: “slider”, min: 100, max: 200}
number = 102 #@param {type: “number”}
date = ‘2020-01-05’ #@param {type: “date”}
pick_me = “monday” #@param [‘monday’, ‘tuesday’, ‘wednesday’, ‘thursday’]
select_or_input = “apples” #@param [“apples”, “bananas”, “oranges”] {allow-input: true}
#print the output
print(“string is”,string)
print(‘slider_value’,slider_value)
Exercise
Please complete the following two questions E1 and E2 and and submit your “ipynb” file to Canvas. (You can download it using “File” > “Download .ipynb”).
E1. What are the advantages of Facebook’s FastText over Google’s Word2Vec?
Please write down your answer below with a supportive example, using your own words.
In [ ]:
#@Lab01 – E1
Answer = ” Please refer to Lecture 2 slides (recording), page 67-70 where we discussed about limitation of Word2Vec and how FastText can deal with the limitation ” #@param {type:”raw”}
E2. Let’s find synonyms
Let’s assume the cosine similarity, or distance, between two word embedding vectors can indicate if the words are semantically similar to each other. In this exercise, you will implement a function called find_synonym(), in which:
1. A list of 6 words are given
2. You need to implement your own algorithm to find the synonym for each of the words (i.e. words with the highest cosine similarity or smallest distance) in the list from the rest of 5 words based on the cosine similarity calculated. (Using the .similarity() or distance() function from Load pretrained word embedding model using API section above may help)
3. Print out the synonyms found
Please use the pretrained 50-dimensional GloVe word embedding trained on wikipedia and gigaword corpus. (You can use the gensim.downloader to load by passing ‘glove-wiki-gigaword-50’ to the .load() function, refer to the Load pretrained word embedding model using API section above)
Before the function, you may need to import any required libraries.
In [ ]:
print(model.similarity(“upset”,”angry”))
In [ ]:
# Complete the following function based on the requirements above
# The list of words to find synonyms
words = [“beautiful”, “smart”, “clever”, “stupid”, “lovely”, “foolish”]
# Load GloVe
def find_synonym(word):
# Find synonym and return the synonym from the given words list for the given word
# Call the function to get the synonyms and print out the synonyms for each word
E2. Sample Solution
As long as your solution covers the requirements well and the output can show you got proper and expected output, then you should be able to get all the marks.
In [ ]:
# Complete the following function based on the requirements above
import gensim.downloader as api
import pprint
# The list of words to find synonym
words = [“beautiful”, “smart”, “clever”, “stupid”, “lovely”, “foolish”]
# Load glove-wiki-gigaword-50 word embedding model
model = api.load(“glove-wiki-gigaword-50”)
def find_synonym(word):
# Find synonym and return the synonym from the given words list for the given word
similarities = [(other_word, model.similarity(word, other_word)) for other_word in words if other_word!=word]
similarities = sorted(similarities, key=lambda x: x[1], reverse=True)
return (word, similarities[0][0])
# Call the function to get the synonyms and print out the synonyms for each word
words_syn=[find_synonym(word) for word in words]
pprint.pprint(words_syn)
Extension
Word Embedding Visual Inspector (WEVI)
If you would like to visualise how Word2Vec is learning, the following link is useful https://ronxin.github.io/wevi/