程序代写代做代考 algorithm html javascript deep learning jquery Java Preprocessing with NLTK¶

Preprocessing with NLTK¶

First, if you haven’t used iPython notebooks before, in order to run the code on this workbook, you can use the run commands in the Cell menu, or do shift-enter when an individual code cell is selected. Generally, you will have to run the cells in order for them to work properly. The output for a given cell (in any) will appear below the code after it has completed running. To make sure things are working, run the cell bellow:
In [1]:
print(“hello world”)

hello world

Okay, now let’s do some simple preprocessing on this snippet from the html from the class website:
In [6]:
text = ”’

The aims for this subject is for students to develop an understanding of the main algorithms used in natural
language processing, for use in a diverse range of applications including text classification, machine
translation, and question answering. Topics to be covered include part-of-speech tagging, n-gram language
modelling, syntactic parsing and deep learning. The programming language used is Python, see
the detailed configuration instructions for more information on its use in the
workshops, assignments and installation at home.

”’

First, let’s remove the html markup using regular expressions
In [7]:
import re

text = re.sub(“<[^>]+>”, “”, text).strip()
print(text)

COMP90042 Natural Language Processing

We can see more clearly now that there are three newline characters between the title and the main text, and also some newlines within the text. Our sentence tokenizer won’t be able to handle the title properly, so let’s remove it, and change the other newlines to spaces.
In [8]:
text = text.split(“\n\n\n”)[1].replace(“\n”, ” “)
print(text)

The aims for this subject is for students to develop an understanding of the main algorithms used in natural language processing, for use in a diverse range of applications including text classification, machine translation, and question answering. Topics to be covered include part-of-speech tagging, n-gram language modelling, syntactic parsing and deep learning. The programming language used is Python, see the detailed configuration instructions for more information on its use in the workshops, assignments and installation at home.

Next let’s segment the text into sentences. Though a simple method like splitting on periods would work well enough in this case, let’s try a sentence segmenter from NLTK, which would be able to handle harder cases if they appeared in our text.
In [9]:
import nltk
nltk.download(‘punkt’)
sent_segmenter = nltk.data.load(‘tokenizers/punkt/english.pickle’)

sentences = sent_segmenter.tokenize(text)
print(sentences)

[‘The aims for this subject is for students to develop an understanding of the main algorithms used in natural language processing, for use in a diverse range of applications including text classification, machine translation, and question answering.’, ‘Topics to be covered include part-of-speech tagging, n-gram language modelling, syntactic parsing and deep learning.’, ‘The programming language used is Python, see the detailed configuration instructions for more information on its use in the workshops, assignments and installation at home.’]

[nltk_data] Downloading package punkt to /Users/laujh/nltk_data…
[nltk_data] Package punkt is already up-to-date!

NLTK also has a word tokenizer. For the first sentence, let’s compare a naive split using spaces and the NTLK regex tokenizer
In [16]:
word_tokenizer = nltk.tokenize.regexp.WordPunctTokenizer()

tokenized_sentence = word_tokenizer.tokenize(sentences[1])
print(tokenized_sentence)
print(sentences[1].split(” “))

[‘Topics’, ‘to’, ‘be’, ‘covered’, ‘include’, ‘part’, ‘-‘, ‘of’, ‘-‘, ‘speech’, ‘tagging’, ‘,’, ‘n’, ‘-‘, ‘gram’, ‘language’, ‘modelling’, ‘,’, ‘syntactic’, ‘parsing’, ‘and’, ‘deep’, ‘learning’, ‘.’]
[‘Topics’, ‘to’, ‘be’, ‘covered’, ‘include’, ‘part-of-speech’, ‘tagging,’, ‘n-gram’, ‘language’, ”, ‘modelling,’, ‘syntactic’, ‘parsing’, ‘and’, ‘deep’, ‘learning.’]

The NLTK tokenizer correctly splits off commas and periods from the ends of words. It also splits up the hyphenated word “part-of-speech”, which might be the right behavior for some applications, but not for others.
Let’s try out lemmatization. NLTK has a lemmatizer, though using it requires that we know the part of speech of the word. In this case, we’ll just try verb lemmatization, and if doesn’t change the word, we’ll try noun.
In [17]:
nltk.download(‘wordnet’)
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()

def lemmatize(word):
lemma = lemmatizer.lemmatize(word,’v’)
if lemma == word:
lemma = lemmatizer.lemmatize(word,’n’)
return lemma

print([lemmatize(token) for token in tokenized_sentence])

[nltk_data] Downloading package wordnet to /Users/laujh/nltk_data…
[nltk_data] Package wordnet is already up-to-date!

[‘Topics’, ‘to’, ‘be’, ‘cover’, ‘include’, ‘part’, ‘-‘, ‘of’, ‘-‘, ‘speech’, ‘tag’, ‘,’, ‘n’, ‘-‘, ‘gram’, ‘language’, ‘model’, ‘,’, ‘syntactic’, ‘parse’, ‘and’, ‘deep’, ‘learn’, ‘.’]

Compare this to the result of stemming using the Porter Stemmer:
In [7]:
stemmer = nltk.stem.porter.PorterStemmer()
print([stemmer.stem(token) for token in tokenized_sentence])

[‘topic’, ‘to’, ‘be’, ‘cover’, ‘includ’, ‘vector’, ‘space’, ‘model’, ‘,’, ‘part’, ‘-‘, ‘of’, ‘-‘, ‘speech’, ‘tag’, ‘,’, ‘n’, ‘-‘, ‘gram’, ‘languag’, ‘model’, ‘,’, ‘syntact’, ‘pars’, ‘and’, ‘neural’, ‘sequenc’, ‘model’, ‘.’]
In [ ]:

COMP90042 Natural Language Processing

Related Posts