CSC485/2501 A2
TA:
Assignment 2
¡ñ Is now available.
¡ñ Due date : 23.59pm on Friday, November 5.
Assignment 2
¡ñ Word Sense Disambiguation
¡ñ After A2, you will be familiar with:
NLTK, WordNet, Lesk algorithm, using Word2Vec/Bert word embeddings
NLTK Package
¡ñ WordNet.
¡ñ Tokenizer.
WordNet
>>> from nltk.corpus import wordnet as wn >>> wn.synsets(‘motorcar’)
[Synset(‘car.n.01’)]
>>> wn.synset(‘car.n.01’).lemma_names()
[‘car’, ‘auto’, ‘automobile’, ‘machine’, ‘motorcar’] >>> wn.synsets(‘car’)
[Synset(‘car.n.01’), Synset(‘car.n.02’), Synset(‘car.n.03’), Synset(‘car.n.04’), Synset(‘cable_car.n.01′)]
Tokenizer
¡ñ Tokenize a string to split off punctuation other than periods. ¡ñ Input:
s = ”’Good muffins cost $3.88\nin . Please buy me two of them.\n\nThanks.”’
¡ñ Output:
[‘Good’, ‘muffins’, ‘cost’, ‘$’, ‘3.88’, ‘in’, ‘New’, ‘York’, ‘.’, ‘Please’, ‘buy’, ‘me’, ‘two’, ‘of’, ‘them’, ‘.’, ‘Thanks’, ‘.’]
Tokenizer
¡ñ ¡°$3.88¡±: [¡°$3.88¡±] or [¡°$¡±,¡±3¡±,¡±.¡±,¡±88¡±]
¡ñ sometimes: [¡°sometimes¡±] or [¡°some¡±,¡±times¡±]
Stopwords
>>> from nltk.corpus import stopwords
>>> stopwords.words(‘english’)
[‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, ‘your’, ‘yours’, ‘yourself’, ‘yourselves’, ‘he’, ‘him’, ‘his’, ‘himself’, ‘she’, ‘her’, ‘hers’,
……
‘than’, ‘too’, ‘very’, ‘s’, ‘t’, ‘can’, ‘will’, ‘just’, ‘don’, ‘should’, ‘now’]
Lesk Algorithm
Score
¡ñ Count(overlap).
¡ñ Bag of words VS set of words.
¡ñ Vector representation.
¡ð Vector with counts.
¡ð Embedding.
¡ñ Vector similarity
¡ð Euclidean distance.
¡ð Cosine similarity.
¡ð Dot product.
Word2Vec
¡ñ Pretrained word vectors from large amount of text.
¡ñ Words are mapped to vectors of real numbers. word2vec(word:str)->vector:np.array()
¡ñ Each word only has one fixed vector representation, although it could have multiple senses.
Word2Vec
Word2Vec
Contextual Word Embedding
Fixed
Contextual
BERT: pretrained language model
Bert: Next Sentence Prediction
BERT
¡ñ Feed model indices instead of strings.
¡ñ Input embedding layers + 12 hidden layers.
¡ñ Dimension.
¡ñ Realigning: BERT Tokenizer VS NLTK Tokenizer.
Do not use LOOP!
Consider matrix multiplication: for (int i = 0; i < M; ++i)
for (int j = 0; j < N; ++j)
for (int k = 0; k < K; ++k)
C[i][j] += A[i][k] * B[k][j];
Too slow!!!
>>> mat1 = torch.randn(2, 3)
>>> mat2 = torch.randn(3, 3)
>>> torch.mm(mat1, mat2)
tensor([[ 0.4851, 0.5037, -0.3633],
[-0.0760, -3.6705, 2.4784]])
Questions?