COMP5046
Natural Language Processing
Lecture 5: Assignment1 and Language Fundamental
Dr. Caren Han
Semester 1, 2021
School of Computer Science, University of Sydney
0
LECTURE PLAN
Lecture 5: Assignment1 and Language Fundamental
1.
2. 3.
4. 5.
RNN/LSTM, Dealing Context Review
Assignment 1 Discussion
Sentiment Analysis
1. Sentiment Analysis Overview
2. Assignment Specification
Language Fundamental
▪ Phonology, Morphology, Syntax, Semantics, Pragmatics
Text Preprocessing
1. Tokenization
2. Cleaning and Normalisation
3. Stemming and Lemmatisation
4. Stopword
5. Regular Expression
1
RNN/LSTM Review
Neural Network + Memory = Recurrent Neural Network
Hidden Layer
h𝑡−1
Input Layer
𝑥𝑡
h𝑡
h𝑡
= tanh(𝑊 h
hh 𝑡−1
+ 𝑊 𝑥 + 𝑏 ) 𝑥h 𝑡 h
New hidden state A function Previous state input with parameters W
1
RNN/LSTM Review
LSTM (Long Short-Term Memory) – Forget Gate
𝑓=σ(𝑊[h ,𝑥]+𝑏) 𝑡 𝑓𝑡−1𝑡 𝑓
Decides what information should be thrown away or kept
Information from the previous hidden state and information from the current input is passed through the sigmoid function. Values come out between 0 and 1. The closer to 0 means to forget, and the closer to 1 means to keep.
1
RNN/LSTM Review
LSTM (Long Short-Term Memory) – Input Gate
𝑖=σ(𝑊[h ,𝑥]+𝑏) 𝑡 𝑖𝑡−1𝑡 𝑖
ሚ
𝐶=tanh(𝑊[h ,𝑥]+𝑏)
𝑡 𝐶𝑡−1𝑡𝐶
1. Pass the previous hidden state and current input into a sigmoid function
2. Pass the hidden state and current input into the tanh function to squish values
between -1 and 1 to help regulate the network
3. Multiply the tanh output with the sigmoid output
*sigmoid output will decide which information is important to keep from the tanh output
1
RNN/LSTM Review
LSTM (Long Short-Term Memory) – Cell States
ሚ 𝐶=𝑓∗𝐶 +𝑖∗𝐶
𝑡 𝑡 𝑡−1 𝑡 𝑡
• the cell state gets pointwise multiplied by the forget vector
• take the output from the input gate and do a pointwise addition which
updates the cell state to new values that the neural network finds relevant
• That gives us our new cell state
1
RNN/LSTM Review
LSTM (Long Short-Term Memory) – Output Gate
𝑜=σ(𝑊[h ,𝑥]+𝑏) 𝑡 𝑜𝑡−1𝑡 𝑜
h =𝑜 ∗tanh(𝐶) 𝑡𝑡𝑡
decides what the next hidden state should be.
• pass the previous hidden state and the current input into a sigmoid function
• pass the newly modified cell state to the tanh function
• multiply the tanh output with the sigmoid output to decide what information
the hidden state should carry
1
Dealing Context: Review
V to V’ – Projection with Context (1)
1
Dealing Context: Review
V to V’ – Projection with Context (2)
1
Dealing Context: Review
V to V’ with Context – Linear Algebra
1
Dealing Context: Review
V to V’ with Context – Linear Algebra (Simplified)
I II
1
Dealing Context: Review
V→V’→1
0
LECTURE PLAN
Lecture 5: Assignment1 and Language Fundamental
1.
2.
3.
4. 5.
RNN/LSTM Review
Assignment 1 Discussion
Sentiment Analysis
1. Sentiment Analysis Overview
2. Assignment Specification
Language Fundamental
▪ Phonology, Morphology, Syntax, Semantics, Pragmatics
Text Preprocessing
1. Tokenization
2. Cleaning and Normalisation
3. Stemming and Lemmatisation
4. Stopword
5. Regular Expression
2
Assignment 1 Discussion
Vs→V’s→V’
Reflect Data 2 (the present) and the past Data
𝑾𝒄
WW
Reflect Data 3 (the present) and the past Data
W
Reflect
all information
1
2
Data 1 Data 2 Data 3
3
2
Assignment 1 Discussion
Vs→V’s→V’
Reflect Data 2 (the present) and the past Data
Reflect Data 3 (the present) and the past Data
h1 𝑾 h2 𝒉𝒉
𝑾𝒙𝒉
𝑾𝒉𝒉
h3
𝑾𝒙𝒉
𝑾𝒙𝒉
Reflect
all information
Data 1
Data 2
Data 3
2
Assignment 1 Discussion
RNN
N to 1 Task
Positive Neg
softmax h 𝑾𝒉𝒉 h 𝑾𝒉𝒉 h 𝑾𝒉𝒉 h 𝑾𝒉𝒉 h tanh
h0 1 2 3 4 5
𝑾𝒙𝒉 𝑾𝒙𝒉 𝑾𝒙𝒉 𝑾𝒙𝒉 𝑾𝒙𝒉
𝑥1 𝑥2 𝑥3 𝑥4 𝑥5
I am crazy in love
2
Assignment 1 Discussion
Bi-RNN
N to 1 Task
h1
h1
Positive Negative
h2 h3 h4 h5
h2 h3 h4 h𝑛
Softmax layer
𝑥1 𝑥2 𝑥3 𝑥4 𝑥𝑛
I am crazy in love
2
Assignment 1 Discussion
Bi-RNN
N to 1 Task
Positive Negative
Softmax layer
𝑅𝑁𝑁
𝑅𝑁𝑁
𝑅𝑁𝑁
𝑅𝑁𝑁
𝑅𝑁𝑁
𝑅𝑁𝑁
𝑅𝑁𝑁
𝑅𝑁𝑁
𝑅𝑁𝑁
𝑅𝑁𝑁
𝑥1 𝑥2 𝑥3 𝑥4 𝑥𝑛
I am crazy in love
2
Assignment 1 Discussion
Bi-LSTM
N to 1 Task
Positive Negative
Softmax layer
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝑥1 𝑥2 𝑥3 𝑥4 𝑥𝑛
I am crazy in love
2
Assignment 1 Discussion
PRP VBP JJ IN NN
Bi-LSTM
N to N Task
Softmax
Softmax
Softmax
Softmax
Softmax
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝐿𝑆𝑇𝑀
𝑥1 𝑥2 𝑥3 𝑥4 𝑥𝑛
I am crazy in love
2
Assignment 1 Discussion
Let’s discuss our Assignment 1
2
Assignment 1 Discussion – Topic
Sentiment Analysis using Recurrent Neural Networks!
2
Assignment 1 Discussion – Model
RNN
N to 1 Task
Positive Neg
softmax h 𝑾𝒉𝒉 h 𝑾𝒉𝒉 h 𝑾𝒉𝒉 h 𝑾𝒉𝒉 h tanh
h0 1 2 3 4 5
𝑾𝒙𝒉 𝑾𝒙𝒉 𝑾𝒙𝒉 𝑾𝒙𝒉 𝑾𝒙𝒉
𝑒1 𝑒2 𝑒3 𝑒4 𝑒5
I am crazy in love
𝑒𝑛 = 𝑤𝑜𝑟𝑑 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔
2
Assignment 1 Discussion – Model
RNN
N to 1 Task
Positive Neg
softmax h 𝑾𝒉𝒉 h 𝑾𝒉𝒉 h 𝑾𝒉𝒉 h 𝑾𝒉𝒉 h tanh
h0 1 2 3 4 5
𝑾𝒙𝒉 𝑾𝒙𝒉 𝑾𝒙𝒉 𝑾𝒙𝒉 𝑾𝒙𝒉
𝑒1 𝑒2 𝑒3 𝑒4 𝑒5
𝑙+𝑙+𝑙+𝑙+𝑙+ 12345
I am crazy in love
𝑒𝑛 = 𝑤𝑜𝑟𝑑 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑙𝑛 = 𝑙𝑒𝑥𝑖𝑐𝑜𝑛 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔
2
Assignment 1 Discussion – Model
RNN
N to 1 Task
Positive Neg
softmax h 𝑾𝒉𝒉 h 𝑾𝒉𝒉 h 𝑾𝒉𝒉 h 𝑾𝒉𝒉 h tanh
h0 1 2 3 4 5
𝑾𝒙𝒉 𝑾𝒙𝒉 𝑾𝒙𝒉 𝑾𝒙𝒉 𝑾𝒙𝒉
𝑥1 𝑥2 𝑥3 𝑥4 𝑥5
+++++
I am crazy in love
𝑒𝑛 = 𝑤𝑜𝑟𝑑 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑙𝑛 = 𝑙𝑒𝑥𝑖𝑐𝑜𝑛 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔
2
Assignment 1 Discussion – Model
RNN
N to 1 Task
Positive Neg
Softmax layer
𝑅𝑁𝑁
𝑅𝑁𝑁
𝑅𝑁𝑁
𝑅𝑁𝑁
𝑅𝑁𝑁
𝑅𝑁𝑁
𝑅𝑁𝑁
𝑅𝑁𝑁
𝑅𝑁𝑁
𝑅𝑁𝑁
𝑥1 𝑥2 𝑥3 𝑥4 𝑥5
+++++
I am crazy in love
𝑒𝑛 = 𝑤𝑜𝑟𝑑 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑙𝑛 = 𝑙𝑒𝑥𝑖𝑐𝑜𝑛 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔
2
Assignment 1 Discussion – Model
Assignment 1 Specification can be found in
https://github.com/usydnlp/COMP5046
0
LECTURE PLAN
Lecture 5: Assignment1 and Language Fundamental
1. 2. 3.
4. 5.
RNN/LSTM, Dealing Context Review
Assignment 1 Discussion
Sentiment Analysis
1. Sentiment Analysis Overview
2. Assignment Specification
Language Fundamental
▪ Phonology, Morphology, Syntax, Semantics, Pragmatics
Text Preprocessing
1. Tokenization
2. Cleaning and Normalisation
3. Stemming and Lemmatisation
4. Stopword
5. Regular Expression
3
The NLP Big Picture
The purpose of Natural Language Processing: Overview
Understanding
Searching
Dialog
Translation
Sentiment Analysis
Topic Classification
Topic Modelling
….
Search
….
Entity Extraction
When Sebastian Thrun …
Claudia sat on a stool
She sells seashells
Drinking, Drank, Drunk How is the weather today
[she/PRP] [sells/VBZ] [seashells/NNS]
Drink
[How] [is] [the] [weather] [today]
Parsing
PoS Tagging
Stemming
Tokenisation
NLP Stack Application
3
Sentiment Analysis
Movie Review – Positive or Negative
Too easy?
3
Sentiment Analysis What is Sentiment Analysis?
3
Sentiment Analysis What is Sentiment Analysis?
“Sentiment analysis is the operation of understanding the intent or emotion behind a given piece of text. It is part of text classification, but it is useful for extracting structured information”
Different Names of a ‘Sentiment Analysis’
• Opinion extraction
• Opinion mining
• Sentiment mining
• Subjectivity analysis
3
Sentiment Analysis Sentiment Analysis
3
Sentiment Analysis What is Sentiment Analysis?
Emotion, Mood, Interpersonal stances, Attitude, Personality traits Typology of Affective States (Scherer et al. 2006)
Attitudes
Enduring, affectively colored beliefs, dispositions towards objects/persons • liking, loving, hating, valuing, desiring
Scherer, K., Dan, E., & Flykt, A. (2006). What determines a feeling’s position in affective space? A case for appraisal. Cognition & Emotion, 20(1), 92-113
3
Sentiment Analysis Sentiment Analysis: Examples
Aspects
3
Sentiment Analysis Sentiment Analysis: Sentiment viz
https://www.csc2.ncsu.edu/faculty/healey/tweet_viz/tweet_app/
3
Sentiment Analysis
Sentiment Analysis: Examples
Twitter mood predicts the stock market (Bollen et al. 2011)
3
Sentiment Analysis Sentiment Analysis Tasks
• Movie: Is this review positive or negative?
• Products: what do people think about the new phone?
• Public sentiment: how is consumer confidence? Is despair increasing?
• Politics: what do people think about this candidate or issue?
• Prediction: predict election outcomes or market trends from sentiment
3
Sentiment Analysis
What will be considered to analyse sentiment
Sentiment analysis = the detection of Attitudes
Enduring, affectively colored beliefs, dispositions towards objects/persons
Main Factors
• Target Object: an entity that can be a product, person, event, organisation, or topic
(e.g. iPhone)
• Attribute: an object usually has two types of attributes
– Components (e.g. touch screen, battery)
– Properties (e.g. size, weight, colour, voice quality)
– Explicit and implicit attributes:
• Explicit attributes: appearing in the attitude (e.g. “the battery life of this phone was not long”)
• Implicit attributes: not appearing in the attitude (e.g. “this phone is too expensive” – the property price)
• Attitude Holder: the person or organisation that expresses the opinion (e.g. my mother was mad with me)
• Type of attitude: positive, negative, or neutral or set of types (e.g. happy)
• Time: the time that expresses the opinion
3
Sentiment Analysis What is Sentiment Analysis?
• Basic Task: Is the attitude of this text positive or negative?
• More complex task: Rank the attitude of this text from 1 to 5 Likert Scale (1 to 5)
• Advanced task: Detect the target, source, or complex attitude types
3
Sentiment Analysis What is Sentiment Analysis?
• Basic Task: Is the attitude of this text positive or negative?
• More complex task: Rank the attitude of this text from 1 to 5 Likert Scale (1 to 5)
• Advanced task: Detect the target, source, or complex attitude types
3
Sentiment Analysis
Finding aspect/attribute/target of sentiment
Title: Sharp, Solid, but Harder to Hold than IPhone 7
– By Tristan on March 13, 2017
“my thoughts on the iPhone 7 are:
1) Retina display is awesome. Everything looks more defined and sharper. There is much color and clarity out there… or should I say, in those digital images and videos… needless to say, the camera as well captures great images.
…….”
Attribute based Visualisation
Attribute based Summary
• Attribute 1: display • Positive
1. Retina display is awesome
2. There is much color and clarity out there
3. …
• Attribute 2: camera • Positive
1. the camera as well captures great images.
2. …..
Smart Phone 1
3
Sentiment Analysis Features Vectors: a bird’s eye view
• Word ngrams (up to 4), skip ngrams w/ 1 missing word
• Character ngrams up to 5
• All caps: number of words in capitals
• Number of continuous punctuation marks, either exclamation or question or mixed. Also whether last char contains one of these.
• Presence of emoticons
Classify your Sentiment is a classification problem
• Typically people have used Naïve Bayes or Support Vector Machines (SVM) in the past [Mohammad et al. 2013]
• Artificial Neural Nets are also becoming more popular now [Nogueira dos Santos & Gatti, 2014]
3
Sentiment Analysis Useful Sentiment Lexicons
Name
Details
The General Inquirer
http://www.wjh.harvard.edu/~inquirer http://www.wjh.harvard.edu/~inquirer/homecat.htm http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls
Categories
• Positiv (1915 words) and Negativ (2291 words)
• Strong vs Weak, Active vs Passive, Overstated versus Understated
• Pleasure, Pain, Virtue, Vice, Motivation, Cognitive Orientation, etc Free to use
LIWC
Linguistic Inquiry and Word Count
http://www.liwc.net/
2300 words and less than 70 classes Affective Processes
• negative emotion (bad, weird, hate, problem, tough)
• positive emotion (love, nice, sweet) Cognitive Processes
• Tentative (maybe, perhaps, guess), Inhibition (block, constraint)
• Pronouns, Negation (no, never), Quantifiers (few, many) $30 or $90 fee
MPQA Subjectivity Cues Lexicon
http://www.cs.pitt.edu/mpqa/subj_lexicon.html
Each word annotated for intensity (strong, weak) 6885 words from 8221 lemmas
• 2718 positive
• 4912 negative
GNU GPL (widely-used free software license)
Opinion Lexicon
http://www.cs.uic.edu/~liub/FBS/opinion– ‐lexicon-‐English.rar
6786 words
• 2006 positive/ 4783 negative Free to use
SentiWordNet
http://swn.isti.cnr.it/
All WordNet synsets automatically annotated for degrees of positivity, negativity, and neutrality/objectiveness
• [estimable(J,3)] “may be computed or estimated”
Pos 0 Neg 0 Obj 1
• [estimable(J,1)] “deserving of respect or high regard” Pos .75 Neg 0 Obj .25
Free to use
3
Sentiment Analysis
Can you build the sentiment lexicon by yourself?
Bootstrap style: Semi-supervised learning of lexicons
• Use a small amount of information
• A few labeled examples
• A few hand-‐built patterns
• Bootstrapping a lexicon
3
Sentiment Analysis Assignment 1: Sentiment Analysis
3
Sentiment Analysis Assignment 1: Sentiment Analysis
0
LECTURE PLAN
Lecture 5: Assignment1 and Language Fundamental
1. 2. 3.
4.
5.
RNN/LSTM, Dealing Context Review
Assignment 1 Discussion
Sentiment Analysis
1. Sentiment Analysis Overview
2. Assignment Specification
Language Fundamental
▪ Phonology, Morphology, Syntax, Semantics, Pragmatics
Text Preprocessing
1. Tokenization
2. Cleaning and Normalisation
3. Stemming and Lemmatisation
4. Stopword
5. Regular Expression
4
Language Fundamental
Level of Natural Language Processing
Phonology
Morphology
Forms and words
Syntax
Clauses and sentences
Semantic
Meanings of various kinds
Pragmatics
Language use
All sounds, system sounds
Level of Natural Language Processing
4
Language Fundamental
We know the sounds of our language
Which sounds are in our language and which sounds are not
• For example, English speakers know the [ŋ] sound (in sing) does not appear at the beginning of a word
• Does this mean that [ŋ] cannot appear at the beginning of words in all human languages?
NO! — Nguyen Tran NO! — Andrew Ng
4
Language Fundamental
We know how sounds can combine
Often shown when a word from one language is borrowed into another:
• McDonalds — in English consonant clusters allowed ( [mk] and [ldz] )becomes…
マクドナルド 麦当劳 맥도날드 Makudonarudo Màidāngláo Maegdonaldeu
in other language — consonant clusters are not allowed
4
Language Fundamental Morphology: Pieces of words
• A field of linguistics focused on the study of the forms and formation of words in a language
• Words in a language consist of one element or elements of meaning which are morphemes
– Morphemes are the pieces of words: bases, roots and affixes (pre-fix, suffix). Unacceptable
un
Prefix (a type of affix)
A group of letters placed before the root word
accept
Root word
the central morpheme, the key element
able
Suffixes (a type of affix)
A group of letters placed after the root word
4
Language Fundamental Morphology: Pieces of words
• A field of linguistics focused on the study of the forms and formation of words in a language
• Words in a language consist of one element or elements of meaning which are morphemes
– Morphemes are the pieces of words: bases, roots and affixes.
• walk walked walking walks walk walk -ed walk -ing walk -s
4
Language Fundamental Natural Language Processing Level
• Phonology/Morphology: the structure of words
– Unusually is composed of a prefix un-, a stem usual, and an affix –ly. Learned is learn
plus the inflectional affix –ed
• Syntax: the way words are used to form phrases
– It is part of English syntax that a determiner such as the will come before a noun, and also that determiners are obligatory with certain singular noun.
• Semantics: Compositional and lexical semantics
– Compositional semantics: the construction of meaning based on syntax
– Lexical semantics: the meaning of individual words
• Pragmatics: meaning in context
– Do you have the time? – means ‘can you tell me what time is it now?’
0
LECTURE PLAN
Lecture 5: Assignment1 and Language Fundamental
1. 2. 3.
4.
5.
RNN/LSTM, Dealing Context Review
Assignment 1 Discussion
Sentiment Analysis
1. Sentiment Analysis Overview
2. Assignment Specification
Language Fundamental
▪ Phonology, Morphology, Syntax, Semantics, Pragmatics
Text Preprocessing
1. Tokenization
2. Cleaning and Normalisation
3. Stemming and Lemmatisation
4. Stopword
5. Regular Expression
5
Text Preprocessing Text Preprocessing
• Every NLP task needs to do text pre-processing
• Segmenting/tokenizing words in running text
• Normalizing word formats
• Segmenting sentences in running text
5
Text Preprocessing How many words?
• •
•
Type: an element of the vocabulary.
Token: an instance of that type in running text.
How many of them in the sentence?
– 14 tokens
– 13 (or 12?) (or 11?) types
they lay back on the Sydney grass and looked at the stars and their
• Token = number of tokens
• Type = vocabulary = set of types
• |V| is the size of the vocabulary
5
Text Preprocessing How many words?
• N = number of tokens
• V = vocabulary = set of types
• |V| is the size of the vocabulary
Tokens = N
Types = |V|
Switchboard phone conversations
2.4 million
20 thousand
Shakespeare
884,000
31 thousand
Google N-grams
1 trillion
13 million
5
Text Preprocessing Tokenization: language issues
• French
• L’ensemble→one token or two? • L?L’?Le?
• Want l’ensemble to match with un ensemble • Until 2003, Google cannot make this work
• German noun compounds are not segmented • Lebensversicherungsgesellschaftsangestellter
• ‘life insurance company employee’
• German information retrieval needs compound splitter
5
Text Preprocessing Tokenization: language issues
• Chinese has no spaces between words: • 悉尼大学位于澳大利亚悉尼
• 悉尼大学 位于 澳大利亚 悉尼
• University of Sydney is located in Sydney, Australia
• Further complicated in Japanese, with multiple alphabets intermingled
• Dates/amounts in multiple formats
フォーチュン500社は情報不足の あた$500K(約6,000万円) Katakana Hiragana Kanji Romaji
ため
時間
5
Text Preprocessing Tokenization: language issues
• Arabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to right
• Words are separated, but letter forms within a word form complex ligatures
→ → start
• ‘Algeria achieved its independence in 1962 after 132 years of
French occupation.’
• With Unicode, the order of characters in files matches the conceptual order, and the reversal of displayed characters is handled by the rendering system.
5
Text Preprocessing Normalization
• Need to “normalize” terms
• Information Retrieval: indexed text & query terms must have same form.
• We want to match U.S.A. and USA
• We implicitly define equivalence classes of terms
• e.g., deleting periods in a term
• Alternative: asymmetric expansion:
• Enter: window
• Enter: windows
• Enter: Windows
Search: window, windows
Search: Windows, windows, window Search: Windows
• Potentially more powerful, but less efficient
5
Text Preprocessing Case Folding
• Applications like IR: convert all letters to lower case
• Since users tend to use lower case
• Possible exception: upper case in mid-sentence?
• e.g., General Motors
• Fed vs. fed
• SAIL vs. sail
• For sentiment analysis, Machine Translation, Information extraction
• Case is helpful (US versus us is important)
5
Text Preprocessing Lemmatization
• Reduce inflections or variant forms to base form
• am, are, is→be
• car, cars, car’s, cars’→car
• the boy’s cars are different colors→the boy car be different color
• Lemmatization: have to find correct dictionary headword form
Machine translation
• Spanish quiero (‘I want’), quieres (‘you want’) same lemma as querer ‘want’
5
Text Preprocessing Morphology
• Morphemes:
• The small meaningful units that make up words
• Stems: The core meaning-bearing units
• Affixes: Bits and pieces that adhere to stems
• Often with grammatical functions
5
Text Preprocessing Stemming
• Reduce terms to their stems in information retrieval
• Stemming is crude chopping of affixes
• language dependent
• e.g., automate(s), automatic, automation all reduced to automat.
for example compressed and compression are both accepted as equivalent to compress.
for exampl compress and compress ar both accept as equival to compress
5
Text Preprocessing
Porter’s algorithm: The most common English stemmer
Porter Stemming Algorithm
5
Text Preprocessing
Dealing with complex morphology is sometimes necessary
• Some languages requires complex morpheme segmentation • Turkish
• Uygarlastiramadiklarimizdanmissinizcasina
• `(behaving) as if you are among those whom we could not civilize’
• Uygar `civilized’ + las `become’
• + tir `cause’ + ama `not able’
• + dik `past’ + lar ‘plural’
• + imiz ‘p1pl’ + dan ‘abl’
• + mis ‘past’ + siniz ‘2pl’ + casina ‘as if’
5
Text Preprocessing Sentence Segmentation
• !, ? are relatively unambiguous
• Period “.” is quite ambiguous
• Sentence boundary
• Abbreviations like Inc. or Dr.
• Numbers like .02% or 4.3
• Build a binary classifier
• Looks at a “.”
• Decides EndOfSentence/NotEndOfSentence
• Classifiers: hand-written rules, regular expressions, or machine-learning
5
Text Preprocessing
Sentence Segmentation using a Decision Tree
5
Text Preprocessing
Implementing Decision Trees or other classifiers
• A decision tree is just an if-then-else statement
• The interesting research is choosing the features
• Setting up the structure is often too hard to do by hand
• Hand-building only possible for very simple features, domains • For numeric features, it’s too hard to pick each threshold
• Instead, structure usually learned by machine learning from a training corpus
• As features that could be exploited by any kind of classifier
• Logistic regression
• SVM
• Neural Nets
• etc.
5
Text Preprocessing Regular expressions
• A formal language for specifying text strings
• How can we search for any of these?
1. woodchuck
2. woodchucks
3. Woodchuck
4. Woodchucks
5
Text Preprocessing
Regular Expressions: Disjunctions
• Letters inside square brackets []
Pattern
Matches
[wW]oodchuck
Woodchuck, woodchuck
[1234567890]
Any digit
• Ranges [A-Z]
Pattern
Matches
[A-Z]
An upper case letter
Drenched Blossoms
[a-z]
A lower case letter
my beans were impatient
[0-9]
A single digit
Chapter 1: Down the Rabbit Hole
5
Text Preprocessing
Regular Expressions: Negation in Disjunction
• Negations [^Ss]
• Carat means negation only when first in []
Pattern
Matches
[^A-Z]
Not an upper case letter
Oyfn pripetchik
[^Ss]
Neither ‘S’ nor ‘s’
I have no exquisite reason”
[^e^]
Neither e nor ^
Look here
a^b
The pattern ‘a carat b’
Look up a^b now
• Caret means negation only when showing as the first symbol in []
5
Text Preprocessing
Regular Expressions: More Disjunction
• Woodchucks is another name for groundhog!
• The pipe | for disjunction
Pattern
Matches
groundhog|woodchuck
yours|mine
yours mine
a|b|c
= [abc]
[gG]roundhog|[Ww]oodchuck
5
Text Preprocessing Regular Expressions: ? * + .
Pattern
Matches
colou?r
Optional previous char
color colour
oo*h!
0 or more of previous char
oh! ooh! oooh! ooooh!
o+h!
1 or more of previous char
oh! ooh! oooh! ooooh!
baa+
baa baaa baaaa baaaaa
beg.n
begin begun begun beg3n
Stephen C Kleene Kleene *, Kleene +
5
Text Preprocessing
Regular Expressions: Anchors ^ $
Pattern
Matches
^[A-Z]
Palo Alto
^[^A-Za-z]
1 “Hello”
\.$
The end.
.$
The end? The end!
5
Text Preprocessing Summary
• Regular expressions play a surprisingly large role
• Sophisticated sequences of regular expressions are often the first model for any text processing text
• For many hard tasks, we use machine learning classifiers
• But regular expressions are used as features in the classifiers
• Can be very useful in capturing generalizations
/
Reference Reference
• Serban, Iulian V., Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2015. “Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models.