程序代写代做代考 algorithm CSC485/2501 A2

CSC485/2501 A2
TA:

Assignment 2
¡ñ Is now available.
¡ñ Due date : 23.59pm on Friday, November 5.

Assignment 2
¡ñ Word Sense Disambiguation
¡ñ After A2, you will be familiar with:
NLTK, WordNet, Lesk algorithm, using Word2Vec/Bert word embeddings

NLTK Package
¡ñ WordNet.
¡ñ Tokenizer.

WordNet
>>> from nltk.corpus import wordnet as wn >>> wn.synsets(‘motorcar’)
[Synset(‘car.n.01’)]
>>> wn.synset(‘car.n.01’).lemma_names()
[‘car’, ‘auto’, ‘automobile’, ‘machine’, ‘motorcar’] >>> wn.synsets(‘car’)
[Synset(‘car.n.01’), Synset(‘car.n.02’), Synset(‘car.n.03’), Synset(‘car.n.04’), Synset(‘cable_car.n.01′)]

Tokenizer
¡ñ Tokenize a string to split off punctuation other than periods. ¡ñ Input:
s = ”’Good muffins cost $3.88\nin . Please buy me two of them.\n\nThanks.”’
¡ñ Output:
[‘Good’, ‘muffins’, ‘cost’, ‘$’, ‘3.88’, ‘in’, ‘New’, ‘York’, ‘.’, ‘Please’, ‘buy’, ‘me’, ‘two’, ‘of’, ‘them’, ‘.’, ‘Thanks’, ‘.’]

Tokenizer
¡ñ ¡°$3.88¡±: [¡°$3.88¡±] or [¡°$¡±,¡±3¡±,¡±.¡±,¡±88¡±]
¡ñ sometimes: [¡°sometimes¡±] or [¡°some¡±,¡±times¡±]

Stopwords
>>> from nltk.corpus import stopwords
>>> stopwords.words(‘english’)
[‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, ‘your’, ‘yours’, ‘yourself’, ‘yourselves’, ‘he’, ‘him’, ‘his’, ‘himself’, ‘she’, ‘her’, ‘hers’,
……
‘than’, ‘too’, ‘very’, ‘s’, ‘t’, ‘can’, ‘will’, ‘just’, ‘don’, ‘should’, ‘now’]

Lesk Algorithm

Score
¡ñ Count(overlap).
¡ñ Bag of words VS set of words.
¡ñ Vector representation.
¡ð Vector with counts.
¡ð Embedding.
¡ñ Vector similarity
¡ð Euclidean distance.
¡ð Cosine similarity.
¡ð Dot product.

Word2Vec
¡ñ Pretrained word vectors from large amount of text.
¡ñ Words are mapped to vectors of real numbers. word2vec(word:str)->vector:np.array()
¡ñ Each word only has one fixed vector representation, although it could have multiple senses.

Word2Vec

Word2Vec

Contextual Word Embedding
Fixed
Contextual

BERT: pretrained language model

Bert: Next Sentence Prediction

BERT
¡ñ Feed model indices instead of strings.
¡ñ Input embedding layers + 12 hidden layers.
¡ñ Dimension.
¡ñ Realigning: BERT Tokenizer VS NLTK Tokenizer.

Do not use LOOP!
Consider matrix multiplication: for (int i = 0; i < M; ++i) for (int j = 0; j < N; ++j) for (int k = 0; k < K; ++k) C[i][j] += A[i][k] * B[k][j]; Too slow!!! >>> mat1 = torch.randn(2, 3)
>>> mat2 = torch.randn(3, 3)
>>> torch.mm(mat1, mat2)
tensor([[ 0.4851, 0.5037, -0.3633],
[-0.0760, -3.6705, 2.4784]])

Questions?