CS计算机代考程序代写 cuda python Hidden Markov Mode algorithm Lab06

Lab06

POS Tagging
POS tagging is the process of marking up a word in a corpus to a corresponding part of speech tag, based on its context and definition. This task is not straightforward, as a particular word may have a different part of speech based on the context in which the word is used

Regular Expression Tagger
The regular expression tagger assigns tags to tokens on the basis of matching patterns. For instance, we might guess that any word ending in -ed is the past participle of a verb, and any word ending with ‘s is a possessive noun. We can express these as a list of regular expressions:
In [ ]:
import nltk

# Downloading required corpus
nltk.download(‘punkt’)
nltk.download(‘brown’)

from nltk import word_tokenize
from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories=’news’)
brown_sents = brown.sents(categories=’news’)

[nltk_data] Downloading package punkt to /root/nltk_data…
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package brown to /root/nltk_data…
[nltk_data] Unzipping corpora/brown.zip.
In [ ]:
# Define regular expression patterns
patterns = [
(r’.*ing$’, ‘VBG’), # gerunds
(r’.*ed$’, ‘VBD’), # simple past
(r’.*es$’, ‘VBZ’), # 3rd singular present
(r’.*ould$’, ‘MD’), # modals
(r’.*\’s$’, ‘NN$’), # possessive nouns
(r’.*s$’, ‘NNS’), # plural nouns
(r’^-?[0-9]+(.[0-9]+)?$’, ‘CD’), # cardinal numbers
(r’.*’, ‘NN’) # nouns (default)
]
In [ ]:
# Build regular expression tagger using the defined patterns
regexp_tagger = nltk.RegexpTagger(patterns)

# Print one of the sentences
print(brown_sents[3])
# Print one of the tagged sentences
print(regexp_tagger.tag(brown_sents[3]))

[‘“’, ‘Only’, ‘a’, ‘relative’, ‘handful’, ‘of’, ‘such’, ‘reports’, ‘was’, ‘received’, “””, ‘,’, ‘the’, ‘jury’, ‘said’, ‘,’, ‘“’, ‘considering’, ‘the’, ‘widespread’, ‘interest’, ‘in’, ‘the’, ‘election’, ‘,’, ‘the’, ‘number’, ‘of’, ‘voters’, ‘and’, ‘the’, ‘size’, ‘of’, ‘this’, ‘city’, “””, ‘.’]
[(‘“’, ‘NN’), (‘Only’, ‘NN’), (‘a’, ‘NN’), (‘relative’, ‘NN’), (‘handful’, ‘NN’), (‘of’, ‘NN’), (‘such’, ‘NN’), (‘reports’, ‘NNS’), (‘was’, ‘NNS’), (‘received’, ‘VBD’), (“””, ‘NN’), (‘,’, ‘NN’), (‘the’, ‘NN’), (‘jury’, ‘NN’), (‘said’, ‘NN’), (‘,’, ‘NN’), (‘“’, ‘NN’), (‘considering’, ‘VBG’), (‘the’, ‘NN’), (‘widespread’, ‘NN’), (‘interest’, ‘NN’), (‘in’, ‘NN’), (‘the’, ‘NN’), (‘election’, ‘NN’), (‘,’, ‘NN’), (‘the’, ‘NN’), (‘number’, ‘NN’), (‘of’, ‘NN’), (‘voters’, ‘NNS’), (‘and’, ‘NN’), (‘the’, ‘NN’), (‘size’, ‘NN’), (‘of’, ‘NN’), (‘this’, ‘NNS’), (‘city’, ‘NN’), (“””, ‘NN’), (‘.’, ‘NN’)]
In [ ]:
# Evaluate the tagger (Calculate the accuracy/performance)
regexp_tagger.evaluate(brown_tagged_sents)
Out[ ]:
0.20326391789486245
In [ ]:
raw = ‘This race is awesome, I want to race too’
tokens = word_tokenize(raw)

print(regexp_tagger.tag(tokens))

[(‘This’, ‘NNS’), (‘race’, ‘NN’), (‘is’, ‘NNS’), (‘awesome’, ‘NN’), (‘,’, ‘NN’), (‘I’, ‘NN’), (‘want’, ‘NN’), (‘to’, ‘NN’), (‘race’, ‘NN’), (‘too’, ‘NN’)]

Hidden Markov Models
A hidden Markov model (HMM) allows us to talk about both observed events (like words that we see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in our probabilistic model.
In [ ]:
# Hidden Markov Models in Python
# Katrin Erk, March 2013 updated March 2016
#
# This HMM addresses the problem of part-of-speech tagging. It estimates
# the probability of a tag sequence for a given word sequence as follows:
#
# Say words = w1….wN
# and tags = t1..tN
#
# then
# P(tags | words) is_proportional_to product P(ti | t{i-1}) P(wi | ti)
#
# To find the best tag sequence for a given sequence of words,
# we want to find the tag sequence that has the maximum P(tags | words)
import nltk
import sys
nltk.download(‘brown’)

from nltk.corpus import brown
from nltk.corpus import treebank

[nltk_data] Downloading package brown to /root/nltk_data…
[nltk_data] Package brown is already up-to-date!
In [ ]:
# Estimating P(wi | ti) from corpus data using Maximum Likelihood Estimation (MLE):
# P(wi | ti) = count(wi, ti) / count(ti)
#
# We add an artificial “start” tag at the beginning of each sentence, and
# We add an artificial “end” tag at the end of each sentence.
# So we start out with the brown tagged sentences,
# add the two artificial tags,
# and then make one long list of all the tag/word pairs.

brown_tags_words = []
brown_tagged_sents = brown.tagged_sents()

for sent in brown_tagged_sents:
# sent is a list of word/tag pairs
# add START/START at the beginning
brown_tags_words.append( (“START”, “START”) )
# then all the tag/word pairs for the word/tag pairs in the sentence.
# shorten tags to 2 characters each
brown_tags_words.extend([ (tag[:2], word) for (word, tag) in sent ])
# then END/END
brown_tags_words.append( (“END”, “END”) )

# conditional frequency distribution
cfd_tagwords = nltk.ConditionalFreqDist(brown_tags_words)
# conditional probability distribution
cpd_tagwords = nltk.ConditionalProbDist(cfd_tagwords, nltk.MLEProbDist)

print(“The probability of an adjective (JJ) being ‘new’ is”, cpd_tagwords[“JJ”].prob(“new”))
print(“The probability of a verb (VB) being ‘duck’ is”, cpd_tagwords[“VB”].prob(“duck”))

# Estimating P(ti | t{i-1}) from corpus data using Maximum Likelihood Estimation (MLE):
# P(ti | t{i-1}) = count(t{i-1}, ti) / count(t{i-1})
brown_tags = [tag for (tag, word) in brown_tags_words ]

# make conditional frequency distribution:
# count(t{i-1} ti)
cfd_tags= nltk.ConditionalFreqDist(nltk.bigrams(brown_tags))
# make conditional probability distribution, using
# maximum likelihood estimate:
# P(ti | t{i-1})
cpd_tags = nltk.ConditionalProbDist(cfd_tags, nltk.MLEProbDist)

print(“If we have just seen ‘DT’, the probability of ‘NN’ is”, cpd_tags[“DT”].prob(“NN”))
print( “If we have just seen ‘VB’, the probability of ‘JJ’ is”, cpd_tags[“VB”].prob(“DT”))
print( “If we have just seen ‘VB’, the probability of ‘NN’ is”, cpd_tags[“VB”].prob(“NN”))

The probability of an adjective (JJ) being ‘new’ is 0.01472344917632025
The probability of a verb (VB) being ‘duck’ is 6.042713350943527e-05
If we have just seen ‘DT’, the probability of ‘NN’ is 0.5057722522030194
If we have just seen ‘VB’, the probability of ‘JJ’ is 0.016885067592065053
If we have just seen ‘VB’, the probability of ‘NN’ is 0.10970977711020183

Viterbi Algorithm
In [ ]:
#####
# Viterbi:
# If we have a word sequence, what is the best tag sequence?
#
# The method above lets us determine the probability for a single tag sequence.
# But in order to find the best tag sequence, we need the probability
# for _all_ tag sequence.
# What Viterbi gives us is just a good way of computing all those many probabilities
# as fast as possible.

# what is the list of all tags?
distinct_tags = set(brown_tags)

sentence = [“This”, “race”, “is”, “awesome”, “,”, “I”, “want”, “to”, “race”, “too” ]
#sentence = [“I”, “saw”, “her”, “duck” ]
sentlen = len(sentence)

# viterbi:
# for each step i in 1 .. sentlen,
# store a dictionary
# that maps each tag X
# to the probability of the best tag sequence of length i that ends in X
viterbi = [ ]

# backpointer:
# for each step i in 1..sentlen,
# store a dictionary
# that maps each tag X
# to the previous tag in the best tag sequence of length i that ends in X
backpointer = [ ]

first_viterbi = { }
first_backpointer = { }
for tag in distinct_tags:
# don’t record anything for the START tag
if tag == “START”: continue
first_viterbi[ tag ] = cpd_tags[“START”].prob(tag) * cpd_tagwords[tag].prob( sentence[0] )
first_backpointer[ tag ] = “START”

print(first_viterbi)
print(first_backpointer)

viterbi.append(first_viterbi)
backpointer.append(first_backpointer)

currbest = max(first_viterbi.keys(), key = lambda tag: first_viterbi[ tag ])
print( “Word”, “‘” + sentence[0] + “‘”, “current best two-tag sequence:”, first_backpointer[ currbest], currbest)
# print( “Word”, “‘” + sentence[0] + “‘”, “current best tag:”, currbest)

for wordindex in range(1, len(sentence)):
this_viterbi = { }
this_backpointer = { }
prev_viterbi = viterbi[-1]

for tag in distinct_tags:
# don’t record anything for the START tag
if tag == “START”: continue

# if this tag is X and the current word is w, then
# find the previous tag Y such that
# the best tag sequence that ends in X
# actually ends in Y X
# that is, the Y that maximizes
# prev_viterbi[ Y ] * P(X | Y) * P( w | X)
# The following command has the same notation
# that you saw in the sorted() command.
best_previous = max(prev_viterbi.keys(),
key = lambda prevtag: \
prev_viterbi[ prevtag ] * cpd_tags[prevtag].prob(tag) * cpd_tagwords[tag].prob(sentence[wordindex]))

# Instead, we can also use the following longer code:
# best_previous = None
# best_prob = 0.0
# for prevtag in distinct_tags:
# prob = prev_viterbi[ prevtag ] * cpd_tags[prevtag].prob(tag) * cpd_tagwords[tag].prob(sentence[wordindex])
# if prob > best_prob:
# best_previous= prevtag
# best_prob = prob
#
this_viterbi[ tag ] = prev_viterbi[ best_previous] * \
cpd_tags[ best_previous ].prob(tag) * cpd_tagwords[ tag].prob(sentence[wordindex])
this_backpointer[ tag ] = best_previous

currbest = max(this_viterbi.keys(), key = lambda tag: this_viterbi[ tag ])
print( “Word”, “‘” + sentence[ wordindex] + “‘”, “current best two-tag sequence:”, this_backpointer[ currbest], currbest)
# print( “Word”, “‘” + sentence[ wordindex] + “‘”, “current best tag:”, currbest)

# done with all tags in this iteration
# so store the current viterbi step
viterbi.append(this_viterbi)
backpointer.append(this_backpointer)

# done with all words in the sentence.
# now find the probability of each tag
# to have “END” as the next tag,
# and use that to find the overall best sequence
prev_viterbi = viterbi[-1]
best_previous = max(prev_viterbi.keys(),
key = lambda prevtag: prev_viterbi[ prevtag ] * cpd_tags[prevtag].prob(“END”))

prob_tagsequence = prev_viterbi[ best_previous ] * cpd_tags[ best_previous].prob(“END”)

# best tagsequence: we store this in reverse for now, will invert later
best_tagsequence = [ “END”, best_previous ]
# invert the list of backpointers
backpointer.reverse()

# go backwards through the list of backpointers
# (or in this case forward, because we have inverter the backpointer list)
# in each case:
# the following best tag is the one listed under
# the backpointer for the current best tag
current_best_tag = best_previous
for bp in backpointer:
best_tagsequence.append(bp[current_best_tag])
current_best_tag = bp[current_best_tag]

best_tagsequence.reverse()
print( “The sentence was:”, end = ” “)
for w in sentence: print( w, end = ” “)
print(“\n”)
print( “The best tag sequence is:”, end = ” “)
for t in best_tagsequence: print (t, end = ” “)
print(“\n”)
print( “The probability of the best tag sequence is:”, prob_tagsequence)

{‘WR’: 0.0, ‘NR’: 0.0, ‘WD’: 0.0, ‘,’: 0.0, ‘FW’: 0.0, ‘HV’: 0.0, ‘RN’: 0.0, ‘WP’: 0.0, ‘“’: 0.0, ‘NN’: 0.0, ‘.’: 0.0, ‘:’: 0.0, ‘(‘: 0.0, ‘END’: 0.0, ‘OD’: 0.0, ‘TO’: 0.0, ‘RB’: 0.0, ‘WQ’: 0.0, ‘JJ’: 0.0, ‘AP’: 0.0, ‘AT’: 0.0, ‘*’: 0.0, ‘DT’: 0.0033218181276236437, ‘NP’: 0.0, ‘RP’: 0.0, ‘VB’: 0.0, ‘IN’: 0.0, ‘AB’: 0.0, ‘.-‘: 0.0, ‘CS’: 0.0, ‘–‘: 0.0, ‘PP’: 0.0, ‘QL’: 0.0, ‘CD’: 0.0, ‘PN’: 0.0, ‘CC’: 0.0, ‘NI’: 0.0, ‘(-‘: 0.0, “””: 0.0, ‘MD’: 0.0, ‘:-‘: 0.0, ‘*-‘: 0.0, ‘)’: 0.0, ‘BE’: 0.0, ‘EX’: 0.0, ‘)-‘: 0.0, ‘,-‘: 0.0, “‘”: 0.0, ‘UH’: 0.0, ‘DO’: 0.0}
{‘WR’: ‘START’, ‘NR’: ‘START’, ‘WD’: ‘START’, ‘,’: ‘START’, ‘FW’: ‘START’, ‘HV’: ‘START’, ‘RN’: ‘START’, ‘WP’: ‘START’, ‘“’: ‘START’, ‘NN’: ‘START’, ‘.’: ‘START’, ‘:’: ‘START’, ‘(‘: ‘START’, ‘END’: ‘START’, ‘OD’: ‘START’, ‘TO’: ‘START’, ‘RB’: ‘START’, ‘WQ’: ‘START’, ‘JJ’: ‘START’, ‘AP’: ‘START’, ‘AT’: ‘START’, ‘*’: ‘START’, ‘DT’: ‘START’, ‘NP’: ‘START’, ‘RP’: ‘START’, ‘VB’: ‘START’, ‘IN’: ‘START’, ‘AB’: ‘START’, ‘.-‘: ‘START’, ‘CS’: ‘START’, ‘–‘: ‘START’, ‘PP’: ‘START’, ‘QL’: ‘START’, ‘CD’: ‘START’, ‘PN’: ‘START’, ‘CC’: ‘START’, ‘NI’: ‘START’, ‘(-‘: ‘START’, “””: ‘START’, ‘MD’: ‘START’, ‘:-‘: ‘START’, ‘*-‘: ‘START’, ‘)’: ‘START’, ‘BE’: ‘START’, ‘EX’: ‘START’, ‘)-‘: ‘START’, ‘,-‘: ‘START’, “‘”: ‘START’, ‘UH’: ‘START’, ‘DO’: ‘START’}
Word ‘This’ current best two-tag sequence: START DT
Word ‘race’ current best two-tag sequence: DT NN
Word ‘is’ current best two-tag sequence: NN BE
Word ‘awesome’ current best two-tag sequence: BE JJ
Word ‘,’ current best two-tag sequence: JJ ,
Word ‘I’ current best two-tag sequence: , PP
Word ‘want’ current best two-tag sequence: PP VB
Word ‘to’ current best two-tag sequence: VB TO
Word ‘race’ current best two-tag sequence: IN NN
Word ‘too’ current best two-tag sequence: VB QL
The sentence was: This race is awesome , I want to race too

The best tag sequence is: START DT NN BE JJ , PP VB TO VB RB END

The probability of the best tag sequence is: 3.9954320581626204e-33

The code is implemented by Katrin Erk

Train HMM Tagger with NLTK HMM Trainer
In [ ]:
# Pretagged training data
brown_tagged_sents = brown.tagged_sents()

print(brown_tagged_sents)

[[(‘The’, ‘AT’), (‘Fulton’, ‘NP-TL’), (‘County’, ‘NN-TL’), (‘Grand’, ‘JJ-TL’), (‘Jury’, ‘NN-TL’), (‘said’, ‘VBD’), (‘Friday’, ‘NR’), (‘an’, ‘AT’), (‘investigation’, ‘NN’), (‘of’, ‘IN’), (“Atlanta’s”, ‘NP$’), (‘recent’, ‘JJ’), (‘primary’, ‘NN’), (‘election’, ‘NN’), (‘produced’, ‘VBD’), (‘“’, ‘“’), (‘no’, ‘AT’), (‘evidence’, ‘NN’), (“””, “””), (‘that’, ‘CS’), (‘any’, ‘DTI’), (‘irregularities’, ‘NNS’), (‘took’, ‘VBD’), (‘place’, ‘NN’), (‘.’, ‘.’)], [(‘The’, ‘AT’), (‘jury’, ‘NN’), (‘further’, ‘RBR’), (‘said’, ‘VBD’), (‘in’, ‘IN’), (‘term-end’, ‘NN’), (‘presentments’, ‘NNS’), (‘that’, ‘CS’), (‘the’, ‘AT’), (‘City’, ‘NN-TL’), (‘Executive’, ‘JJ-TL’), (‘Committee’, ‘NN-TL’), (‘,’, ‘,’), (‘which’, ‘WDT’), (‘had’, ‘HVD’), (‘over-all’, ‘JJ’), (‘charge’, ‘NN’), (‘of’, ‘IN’), (‘the’, ‘AT’), (‘election’, ‘NN’), (‘,’, ‘,’), (‘“’, ‘“’), (‘deserves’, ‘VBZ’), (‘the’, ‘AT’), (‘praise’, ‘NN’), (‘and’, ‘CC’), (‘thanks’, ‘NNS’), (‘of’, ‘IN’), (‘the’, ‘AT’), (‘City’, ‘NN-TL’), (‘of’, ‘IN-TL’), (‘Atlanta’, ‘NP-TL’), (“””, “””), (‘for’, ‘IN’), (‘the’, ‘AT’), (‘manner’, ‘NN’), (‘in’, ‘IN’), (‘which’, ‘WDT’), (‘the’, ‘AT’), (‘election’, ‘NN’), (‘was’, ‘BEDZ’), (‘conducted’, ‘VBN’), (‘.’, ‘.’)], …]
In [ ]:
# Import HMM module
from nltk.tag import hmm

# Setup a trainer with default(None) values
# And train with the data
trainer = hmm.HiddenMarkovModelTrainer()
trained_tagger = trainer.train_supervised(brown_tagged_sents)

print (trained_tagger)
# Prints the basic data about the tagger

tokens = word_tokenize(“This race is awesome, I want to race too”)
print(trained_tagger.tag(tokens))


[(‘This’, ‘DT’), (‘race’, ‘NN’), (‘is’, ‘BEZ’), (‘awesome’, ‘JJ’), (‘,’, ‘,’), (‘I’, ‘PPSS’), (‘want’, ‘VB’), (‘to’, ‘TO’), (‘race’, ‘VB’), (‘too’, ‘QL’)]

Bi-LSTM based POS Tagger (Pytorch)
In this example, we would try to train a PoS tagger using Bi-LSTM.

In [ ]:
import torch
device = torch.device(“cuda” if torch.cuda.is_available() else “cpu”)

Training data
In [ ]:
import nltk
nltk.download(‘punkt’)
from nltk import word_tokenize

nltk.download(‘treebank’)
from nltk.corpus import treebank

import numpy as np
from sklearn.model_selection import train_test_split

[nltk_data] Downloading package punkt to /root/nltk_data…
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package treebank to /root/nltk_data…
[nltk_data] Unzipping corpora/treebank.zip.
In [ ]:
# Retrieve tagged sentences from treebank corpus
tagged_sentences = nltk.corpus.treebank.tagged_sents()

print(tagged_sentences[0])
print(“Tagged sentences: “, len(tagged_sentences))
print(“Tagged words:”, len(nltk.corpus.treebank.tagged_words()))
#tagged_words(): list of (str,str) tuple

[(‘Pierre’, ‘NNP’), (‘Vinken’, ‘NNP’), (‘,’, ‘,’), (’61’, ‘CD’), (‘years’, ‘NNS’), (‘old’, ‘JJ’), (‘,’, ‘,’), (‘will’, ‘MD’), (‘join’, ‘VB’), (‘the’, ‘DT’), (‘board’, ‘NN’), (‘as’, ‘IN’), (‘a’, ‘DT’), (‘nonexecutive’, ‘JJ’), (‘director’, ‘NN’), (‘Nov.’, ‘NNP’), (’29’, ‘CD’), (‘.’, ‘.’)]
Tagged sentences: 3914
Tagged words: 100676
In [ ]:
sentences, sentence_tags =[], []
for tagged_sentence in tagged_sentences:
#The zip() function returns a zip object, which is an iterator of tuples where the first item in each passed iterator is paired together,
#and then the second item in each passed iterator are paired together etc.
sentence, tags = zip(*tagged_sentence)
sentences.append(np.array(sentence))
sentence_tags.append(np.array(tags))

print(sentences[5])
print(sentence_tags[5])

[‘Lorillard’ ‘Inc.’ ‘,’ ‘the’ ‘unit’ ‘of’ ‘New’ ‘York-based’ ‘Loews’
‘Corp.’ ‘that’ ‘*T*-2’ ‘makes’ ‘Kent’ ‘cigarettes’ ‘,’ ‘stopped’ ‘using’
‘crocidolite’ ‘in’ ‘its’ ‘Micronite’ ‘cigarette’ ‘filters’ ‘in’ ‘1956’
‘.’]
[‘NNP’ ‘NNP’ ‘,’ ‘DT’ ‘NN’ ‘IN’ ‘JJ’ ‘JJ’ ‘NNP’ ‘NNP’ ‘WDT’ ‘-NONE-‘ ‘VBZ’
‘NNP’ ‘NNS’ ‘,’ ‘VBD’ ‘VBG’ ‘NN’ ‘IN’ ‘PRP$’ ‘NN’ ‘NN’ ‘NNS’ ‘IN’ ‘CD’
‘.’]
In [ ]:
(train_sentences,
test_sentences,
train_tags,
test_tags) = train_test_split(sentences, sentence_tags, test_size=0.2, random_state = 42)

Making vocabs with special tokens
PAD: Padding
OOV: Out Of Vocabulary
In [ ]:
words, tags = set([]), set([])

for s in train_sentences:
for w in s:
words.add(w.lower())

for ts in train_tags:
for t in ts:
tags.add(t)

word2index = {w: i + 2 for i, w in enumerate(list(words))}
word2index[‘-PAD-‘] = 0 # The special value used for padding
word2index[‘-OOV-‘] = 1 # The special value used for OOVs

tag2index = {t: i + 2 for i, t in enumerate(list(tags))}
tag2index[‘-PAD-‘] = 0 # The special value used to tag padding
tag2index[‘-OOV-‘] = 1 # The special value used to tag OOVs
In [ ]:
def encode_sentences(sentences):
res = []
for sent in sentences:
temp = [word2index[word.lower()] if word.lower() in word2index else word2index[‘-OOV-‘] for word in sent]
res.append(temp)
return res

train_sentences_encoded = encode_sentences(train_sentences)
test_sentences_encoded = encode_sentences(test_sentences)

train_tags_y, test_tags_y = [], []

def tag_to_index(tags_list):
res = []
for tags in tags_list:
temp = [tag2index[tag] if tag in tag2index else tag2index[‘-OOV-‘] for tag in tags]
res.append(temp)
return res

train_tags_y = tag_to_index(train_tags)
test_tags_y = tag_to_index(test_tags)

Padding
In [ ]:
# Pad to max_length
max_length = len(max(train_sentences_encoded, key=len))
print(max_length)

271
In [ ]:
def pad_sequence(seq_list, max_length, index_dict):
res = []
for seq in seq_list:
temp = seq[:]
if len(seq)>max_length:
res.append(temp[:max_length])
else:
temp += [index_dict[‘-PAD-‘]] * (max_length – len(seq))
res.append(temp)
return np.array(res)

train_sentences_encoded_pad = pad_sequence(train_sentences_encoded, max_length, word2index)
test_sentences_encoded_pad = pad_sequence(test_sentences_encoded, max_length, word2index)
train_tags_y_pad = pad_sequence(train_tags_y, max_length, tag2index)
test_tags_y_pad = pad_sequence(test_tags_y, max_length, tag2index)

Build Dataset and Dataloader for training data
In [ ]:
from torch.utils.data import TensorDataset
#More detailed info about the TensorDataset, https://pytorch.org/docs/1.1.0/_modules/torch/utils/data/dataset.html#TensorDataset
train_data = TensorDataset(torch.from_numpy(train_sentences_encoded_pad), torch.from_numpy(train_tags_y_pad))

from torch.utils.data import DataLoader
#More detailed info about the dataLoader, https://pytorch.org/docs/1.1.0/_modules/torch/utils/data/dataloader.html
batch_size = 128
train_loader = DataLoader(dataset=train_data, batch_size=batch_size, shuffle=True)
# shuffle (bool, optional) – set to True to have the data reshuffled at every epoch (default: False).

Model
In [ ]:
import torch.nn as nn

class LSTMTagger(nn.Module):
def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
super(LSTMTagger, self).__init__()
self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, bidirectional=True)
self.hidden2tag = nn.Linear(hidden_dim * 2, tagset_size)

def forward(self, sentence):
embeds = self.word_embeddings(sentence)
lstm_out, _ = self.lstm(embeds)
tag_space = self.hidden2tag(lstm_out)
return tag_space

EMBEDDING_DIM = 128
HIDDEN_DIM = 256

model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word2index), len(tag2index)).to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
In [ ]:
from sklearn.metrics import accuracy_score

number_epochs = 20

for epoch in range(number_epochs):
loss_now = 0.0
correct = 0

for sentence,targets in train_loader:
sentence = sentence.to(device)
targets = targets.to(device)

temp_batch_size = sentence.shape[0]

model.train()
optimizer.zero_grad()
tag_space = model(sentence)
loss = loss_function(tag_space.view(-1, tag_space.shape[-1]), targets.view(-1))
loss.backward()
optimizer.step()

loss_now += loss.item() * temp_batch_size
predicted = torch.argmax(tag_space, -1)
# Note: The training accuracy here is calculated with “PAD”, which will result in a relative higher accuracy.
correct += accuracy_score(predicted.view(-1).cpu().numpy(),targets.view(-1).cpu().numpy())*temp_batch_size

print(‘Epoch: %d, training loss: %.4f, training accuracy: %.2f%%’%(epoch+1,loss_now/len(train_data),100*correct/len(train_data)))

Epoch: 1, training loss: 0.7916, training accuracy: 87.24%
Epoch: 2, training loss: 0.3046, training accuracy: 93.24%
Epoch: 3, training loss: 0.2356, training accuracy: 94.30%
Epoch: 4, training loss: 0.1938, training accuracy: 95.05%
Epoch: 5, training loss: 0.1605, training accuracy: 95.72%
Epoch: 6, training loss: 0.1342, training accuracy: 96.44%
Epoch: 7, training loss: 0.1136, training accuracy: 96.97%
Epoch: 8, training loss: 0.0973, training accuracy: 97.37%
Epoch: 9, training loss: 0.0841, training accuracy: 97.73%
Epoch: 10, training loss: 0.0729, training accuracy: 98.03%
Epoch: 11, training loss: 0.0638, training accuracy: 98.28%
Epoch: 12, training loss: 0.0559, training accuracy: 98.50%
Epoch: 13, training loss: 0.0493, training accuracy: 98.68%
Epoch: 14, training loss: 0.0436, training accuracy: 98.85%
Epoch: 15, training loss: 0.0387, training accuracy: 98.99%
Epoch: 16, training loss: 0.0344, training accuracy: 99.11%
Epoch: 17, training loss: 0.0308, training accuracy: 99.22%
Epoch: 18, training loss: 0.0275, training accuracy: 99.31%
Epoch: 19, training loss: 0.0247, training accuracy: 99.38%
Epoch: 20, training loss: 0.0221, training accuracy: 99.45%

Test with the test set
In [ ]:
model.eval()
sentence = torch.from_numpy(test_sentences_encoded_pad).to(device)
tag_space = model(sentence)
predicted = torch.argmax(tag_space, -1)
predicted = predicted.cpu().numpy()

# cut off the PAD part
test_len_list = [len(s) for s in test_sentences_encoded]
actual_predicted_list= []
for i in range(predicted.shape[0]):
actual_predicted_list+=list(predicted[i])[:test_len_list[i]]

# get actual tag list
actual_tags = sum(test_tags_y, [])

print(‘Test Accuracy: %.2f%%’%(accuracy_score(actual_predicted_list,actual_tags)*100))

Test Accuracy: 88.15%

Exercise

E1. When using HMM to solve POS Tagging problem, what is (1) the hidden state, (2) an observation, (3) a transition probability example, (4) an emission probability example
In the HMM example mentioned in the lecture, hidden state is the weather, observation is clothes that the person wears, a transition probability example is the probability that today is rainy if yesterday was cloudy, an emission probability example is the probability that a person wears a shirt if it is cloudy.

Your answer:

E2. Testing with the new sentence
In this exercise, you are to classify part-of-speech(pos) tags on user-defined sentences using the Bi-LSTM model trained right before the exercise. (You can call functions and use the variables directly, just assume this part is the last part of Bi-LSTM based POS Tagger code). You should complete a function which returns the POS tags list for the input. Note: Your output should be “cut off” to the actual length of each sentence.
In [ ]:
test_samples = [
word_tokenize(“This race is awesome, I want to race too.”),
word_tokenize(“That race is silly, I do not want to race.”)
]

def test_model(test_samples):
token_sequences = []

# Please Complete this part

return token_sequences

print(test_samples)
print(test_model(test_samples))

E2 Sample Solution
In [ ]:
test_samples = [
word_tokenize(“This race is awesome, I want to race too.”),
word_tokenize(“That race is silly, I do not want to race.”)
]

def test_model(test_samples):

test_samples_encoded = encode_sentences(test_samples)
test_samples_encoded_pad = pad_sequence(test_samples_encoded, max_length, word2index)

model.eval()
sentences = torch.from_numpy(test_samples_encoded_pad).to(device)
tag_space = model(sentences)
predictions = torch.argmax(tag_space, -1).cpu().numpy()

index2tag = {i: t for t, i in tag2index.items()}

token_sequences = []
for i in range(len(predictions)):
pred_list = predictions[i]
length_temp = len(test_samples_encoded[i])
token_sequence = [index2tag[pred_list[j]] for j in range(length_temp)]
token_sequences.append(token_sequence)

return token_sequences

print(test_samples)
print(test_model(test_samples))

[[‘This’, ‘race’, ‘is’, ‘awesome’, ‘,’, ‘I’, ‘want’, ‘to’, ‘race’, ‘too’, ‘.’], [‘That’, ‘race’, ‘is’, ‘silly’, ‘,’, ‘I’, ‘do’, ‘not’, ‘want’, ‘to’, ‘race’, ‘.’]]
[[‘DT’, ‘NN’, ‘VBZ’, ‘RB’, ‘,’, ‘PRP’, ‘VBP’, ‘TO’, ‘VB’, ‘RB’, ‘.’], [‘IN’, ‘NN’, ‘VBZ’, ‘RB’, ‘,’, ‘PRP’, ‘VBP’, ‘RB’, ‘VB’, ‘TO’, ‘NN’, ‘.’]]
In [ ]: