程序代写代做代考 information retrieval algorithm Text Pre-Processing

Text Pre-Processing — 1

Faculty of Information Technology, Monash University, Australia

FIT5196 week 4

(Monash) FIT5196 1 / 14

Outline

1 Basic Tasks in Text Preprocessing
Tokenization
Case Normalization
Stopping — Remove Stop Words
Stemming & Lemmatisation
Sentence Segmentation

(Monash) FIT5196 2 / 14

Text is everywhere!

A large amount of text data available in different forms. For example,
É e-books & e-magazines
É online newspapers
É blogs & tweets
É online product reviews
É emails
É Medical reports and articles
É Research papers published by different conferences

Various text resources
É Online data repositories: UCI machine learning repository, Linguistic Data
Consortium.

É NLTK: the built-in datasets
É Web: Crawl text data by yourself!

How to use automatic approaches to analyse the text syntactically and
semantically?

(Monash) FIT5196 3 / 14

The goal of text analysis: provide understanding of how the text is
processed without having a human read it.
The capability of computers
É What a computer can do:

− Examine the individual characters in each word, and how those words are
arranged.

É What a computer cannot do:
− Know what the information is communicated by the text syntactically and

semantically.

Syntax v.s. Semantics
É Syntax: the structure of language, e.g., grammar rules

− How individual words are composed to make well-formed sentences and
paragraphs.

É Semantics: the meaning of the individual words within the surrounding
context

− Understand the theme of a given text fragment.

(Monash) FIT5196 4 / 14

Examples of Text Analysis Tasks

Analyse sentence structure — e.g., syntactic parsing and dependency
parsing

Figure: Figure from the NLTK book

(Monash) FIT5196 5 / 14

Examples of Text Analysis Tasks

Topic modelling

(Monash) FIT5196 5 / 14

Examples of Text Analysis Tasks

Learn topical phrases? — Topical collocation

Topic model with collocations

• Combines PCFG topic model and segmentation adaptor grammar

Sentence ! Docj j 2 1, . . . ,m
Docj ! j j 2 1, . . . ,m
Docj ! Docj Topici i 2 1, . . . , `;

j 2 1, . . . ,m
Topici ! Words i 2 1, . . . , `
Words ! Word
Words ! Words Word
Word ! w w 2 V

Sentence

Doc3

Topic5

Words

Word

polynomial

Word

size

Topic15

Words

Word

threshold

Word

circuits

21/29
Is “white house” a topical collocation?

(Monash) FIT5196 5 / 14

Examples of Text Analysis Tasks

Learn topical phrases? — Topical collocation

Topic model with collocations

• Combines PCFG topic model and segmentation adaptor grammar

Sentence ! Docj j 2 1, . . . ,m
Docj ! j j 2 1, . . . ,m
Docj ! Docj Topici i 2 1, . . . , `;

j 2 1, . . . ,m
Topici ! Words i 2 1, . . . , `
Words ! Word
Words ! Words Word
Word ! w w 2 V

Sentence

Doc3

Topic5

Words

Word

polynomial

Word

size

Topic15

Words

Word

threshold

Word

circuits

21/29
Is “white house” a topical collocation?
É In a real-estimate context: compositional phrase
É In a political context: topical collocation

(Monash) FIT5196 5 / 14

Examples of Text Analysis Tasks

Break down document into topically coherent chunks — Text segmentation

Figure: A 21-paragraph science news article, called Stargazers, from Hearst,
1997. The main topic is the existence of life on earth and other planets.

(Monash) FIT5196 5 / 14

Text Data in an Unstructured Form

Text data always appears in an unstructured form.

(Monash) FIT5196 6 / 14

Text Data in a Structured Form

Goal: manipulate and convert the free language text into structured form.

(Monash) FIT5196 7 / 14

Basic Tasks in Text Preprocessing

Tokenisation

Case Normalisation:

Stopping

Stemming & Lemmatisation

Sentence segmentation

(Monash) FIT5196 8 / 14

Basic Tasks in Text Preprocessing Tokenization

Basic Tasks in Text Preprocessing —Tokenisation

Tokenisation: the process of breaking a stream of text into tokens.
É Text is usually represented as sequences of characters by computers.

“A data wrangler is the person performing the wrangling tasks.”
É Most natural language processing (NLP) and text mining algorithms can only
operate on tokens.

[“A”, “data”, “wrangler”, “is”, “the”, “person”, “performing”, “the”,
“wrangling”, “tasks”]

Challenging issues:
É Periods in Abbreviations

− Common acronyms with periods: U.K., U.N. etc.
− Other abbreviations with a similar pattern: P.M., A.M., i.e., etc.

É Currency and Percentages
− Different currencies: $10,000.00, £10,000,000.00, AUD100, EUR$10.555 and

CNY555.55.
− Percentages: 23%, 23.23% and 100.00%

É Hyphens and Apostrophes
− Hyphens:“co-operate”, “co-education” and “pre-process”
− Apostrophes: “don’t”, “she’ll”

Looking at the Jupyter Notebook!

(Monash) FIT5196 9 / 14

Basic Tasks in Text Preprocessing Case Normalization

Basic Tasks in Text Preprocessing — Case Normalisation

Capitalisation helps readers differentiate, for example, between nouns and
proper nouns
É Common nouns: writer, teacher, cookies, . . .
É Proper noun: Herman Melville, Snoopy, University of Melbourne, . . .

Case normalisation: covert all the words into either uppercase or lowercase
words.
É “data” v.s. “Data”

Case normalisation is not always needed.
É Information Retrieval: “Data Wrangler” v.s. “data wrangler”
É Named Entity Recognition: One would better keep capitalised words left as
capitalised.

One function to finish case normalisation: lower()

(Monash) FIT5196 10 / 14

Basic Tasks in Text Preprocessing Stopping — Remove Stop Words

Basic Tasks in Text Preprocessing — Removing Stop Words

Stop words — words that are extremely common and carry little lexical
content. They are often function words in English. For example,
É articles (e.g., “a”, “the”, and “an”),
É pronouns (e.g., “he”, “him”, and “they”),
É particles (e.g., “well”, “however” and “thus”)

Stop words usually refer to the most common words in a language. The
general strategy for determining whether a word is a stop word or not is to
compute its total number of appearances in a corpus.
Why should we remove stop words?
É Stop words usually appear to be of little value and have little impact on the
final results, as the presence of stop words in a text document does not really
help distinguishing it from other documents.

É Failing to remove those common words could lead to skewed analysis results.
For example,

− Email analysis: remove headers (e.g., “Subject”, “To”, and “From”), remove
a lengthy legal disclaimer, . . .

(Monash) FIT5196 11 / 14

Basic Tasks in Text Preprocessing Stemming & Lemmatisation

Stemming & Lemmatisation

Should we keep word forms like “educate”, “educated”, “educating”, and
“educates” separate or to collapse them?
Stemming: the process of identification and removal of prefixes, suffixes,
and pluralisation, which leaves you with a stem.
É ’watches -> watch’
É ’parties -> party’,
É ’carrying -> carry’,
É ’loving -> lov’,

Lemmatization: a more advanced form of stemming that makes use of, for
example, the context surrounding the words, an existing vocabulary,
morphological analysis of words and other grammatical information (e.g.,
part-of-speech tags) to determine the basic or dictionary form of a word,
which is known as the lemma. Use “from nltk.stem import
WordNetLemmatizer”
É “meeting” + (POS = ’v’) → “meet”
É “meeting” + (POS = ’n’) → “meeting”

(Monash) FIT5196 12 / 14

Basic Tasks in Text Preprocessing Sentence Segmentation

Basic Tasks in Text Preprocessing — Sentence Segmentation

Sentence segmentation — a challenging problem in natural language
processing, which is about deciding where sentences begin and end.
Challenge: punctuation marks are often ambiguous
É Is something ending with one of the following punctuations “.”, “!”, “?” ?
É Does a period always indicate sentence boundaries?

− Some periods occur as part of abbreviations, monetary numerals, percentages,
decimal point, or an email address.

The NLTK’s Punkt Sentence Tokenizer was designed to split text into
sentences “by using an unsupervised algorithm to build a model for
abbreviation words, collocations, and words that start sentences.”

Any other cues that can be used to identify a sentence boundary?

(Monash) FIT5196 13 / 14

Basic Tasks in Text Preprocessing Sentence Segmentation

Summary: what do you need to do in this week?

Download and read the materials provided in Moodle, and also read the
recommended reading materials associated with each chapter.

(Monash) FIT5196 14 / 14

Basic Tasks in Text Preprocessing
Tokenization
Case Normalization
Stopping — Remove Stop Words
Stemming & Lemmatisation
Sentence Segmentation

Related Posts