CS447: Natural Language Processing
http://courses.engr.illinois.edu/cs447
Julia Hockenmaier
juliahmr@illinois.edu
3324 Siebel Center
Lecture 2:
Finite-state methods
for morphology
CS447: Natural Language Processing (J. Hockenmaier)
A bit more admin…
�2
CS447: Natural Language Processing (J. Hockenmaier)
HW 0
HW0 will come out later today
(check the syllabus.html page on the website)
We will assume Python 3.5.2 for our assignments
(you shouldn’t have to load any additional modules or libraries
besides the ones we provide)
You get 2 points for HW0
(HW1—HW4 have 10 points each)
1 point for uploading something to Compass
1 point for uploading a tar.gz file with the correct name
and file structure
�3 CS447: Natural Language Processing (J. Hockenmaier)
Compass and enrollment…
We won’t be able to grade more than 100
assignments (and HW0 is only worth 2 points)
-Lecture slides and the PDFs for the assignments
will always be posted on the class website.
-You don’t need to be on Compass to get access.
-Piazza is also available to everybody.
If you are planning to drop this class, please do so
ASAP, so that others can take your spot.
If you just got into the class, it is likely to take 24
hours to get access to Compass.
�4
CS447: Natural Language Processing (J. Hockenmaier)
DRES accommodations
If you need any disability related accommodations,
talk to DRES (http://disability.illinois.edu,
disability@illinois.edu, phone 333-4603)
If you are concerned you have a disability-related condition
that is impacting your academic progress, there are academic
screening appointments available on campus that can help
diagnosis a previously undiagnosed disability by visiting the
DRES website and selecting “Sign-Up for an Academic
Screening” at the bottom of the page.”
Come and talk to me as well, especially once you
have a letter of accommodation from DRES.
Do this early enough so that we can take your requirements
into account for exams and assignments.
�5 CS447: Natural Language Processing (J. Hockenmaier)
Last lecture
The NLP pipeline:
Tokenization — POS tagging — Syntactic parsing
— Semantic analysis — Coreference resolution
Why is NLP difficult?
Ambiguity
Coverage
�6
CS447: Natural Language Processing (J. Hockenmaier)
Today’s lecture
What is the structure of words?
(in English, Chinese, Arabic,…)
Morphology: the area of linguistics that deals with this.
How can we identify the structure of words?
We need to build a morphological analyzer (parser).
We will use finite-state transducers for this task.
Finite-State Automata and Regular Languages
(Review)
NB: No probabilities or machine learning yet.
We’re thinking about (symbolic) representations today.
�7 CS447: Natural Language Processing (J. Hockenmaier)
Morphology:
What is a word?
�8
CS447: Natural Language Processing (J. Hockenmaier)
uygarlaştıramadıklarımızdanmışsınızcasına
uygar_laş_tır_ama_dık_lar_ımız_dan_mış_sınız_casına
“as if you are among those whom we were not able to civilize
(=cause to become civilized )”
uygar: civilized
_laş: become
_tır: cause somebody to do something
_ama: not able
_dık: past participle
_lar: plural
_ımız: 1st person plural possessive (our)
_dan: among (ablative case)
_mış: past
_sınız: 2nd person plural (you)
_casına: as if (forms an adverb from a verb)
�9
A Turkish word
K. Oflazer pc to J&M
CS447: Natural Language Processing (J. Hockenmaier)
Basic word classes
(parts of speech)
Content words (open-class):
Nouns: student, university, knowledge,…
Verbs: write, learn, teach,…
Adjectives: difficult, boring, hard, ….
Adverbs: easily, repeatedly,…
Function words (closed-class):
Prepositions: in, with, under,…
Conjunctions: and, or,…
Determiners: a, the, every,…
�10
CS447: Natural Language Processing (J. Hockenmaier)
Words aren’t just defined
by blanks
Problem 1: Compounding
“ice cream”, “website”, “web site”, “New York-based”
Problem 2: Other writing systems have no blanks
Chinese: =
I start(ed) writing novel(s)
Problem 3: Clitics
English: “doesn’t” , “I’m” ,
Italian: “dirglielo” = dir + gli(e) + lo
tell + him + it
�11 CS447: Natural Language Processing (J. Hockenmaier)
Of course he wants to take the advanced course too.
He already took two beginners’ courses.
This is a bad question. Did I mean:
How many word tokens are there?
(16 to 19, depending on how we count punctuation)
How many word types are there?
(i.e. How many different words are there?
Again, this depends on how you count, but it’s
usually much less than the number of tokens)
How many words are there?
�12
CS447: Natural Language Processing (J. Hockenmaier)
Of course he wants to take the advanced course too.
He already took two beginners’ courses.
The same (underlying) word can take different forms:
course/courses, take/took
We distinguish concrete word forms (take, taking)
from abstract lemmas or dictionary forms (take)
Different words may be spelled/pronounced the same:
of course vs. advanced course
two vs. too
How many words are there?
�13 CS447: Natural Language Processing (J. Hockenmaier)
Inflection creates different forms of the same word:
Verbs: to be, being, I am, you are, he is, I was,
Nouns: one book, two books
Derivation creates different words from the same lemma:
grace ⇒ disgrace ⇒ disgraceful ⇒ disgracefully
Compounding combines two words into a new word:
cream ⇒ ice cream ⇒ ice cream cone ⇒ ice cream cone bakery
Word formation is productive:
New words are subject to all of these processes:
Google ⇒ Googler, to google, to ungoogle, to misgoogle, googlification,
ungooglification, googlified, Google Maps, Google Maps service,…
�14
How many different words are there?
CS447: Natural Language Processing (J. Hockenmaier)
Inflectional morphology in English
Verbs:
Infinitive/present tense: walk, go
3rd person singular present tense (s-form): walks, goes
Simple past: walked, went
Past participle (ed-form): walked, gone
Present participle (ing-form): walking, going
Nouns:
Common nouns inflect for number:
singular (book) vs. plural (books)
Personal pronouns inflect for person, number, gender, case:
I saw him; he saw me; you saw her; we saw them; they saw us.
�15 CS447: Natural Language Processing (J. Hockenmaier)
Derivational morphology
Nominalization:
V + -ation: computerization
V+ -er: killer
Adj + -ness: fuzziness
Negation:
un-: undo, unseen, …
mis-: mistake,…
Adjectivization:
V+ -able: doable
N + -al: national
�16
CS447: Natural Language Processing (J. Hockenmaier)
Morphemes: stems, affixes
dis-grace-ful-ly
prefix-stem-suffix-suffix
Many word forms consist of a stem plus a number of
affixes (prefixes or suffixes)
Infixes are inserted inside the stem.
Circumfixes (German gesehen) surround the stem
Morphemes: the smallest (meaningful/grammatical)
parts of words.
Stems (grace) are often free morphemes.
Free morphemes can occur by themselves as words.
Affixes (dis-, -ful, -ly) are usually bound morphemes.
Bound morphemes have to combine with others to form words.
�17 CS447: Natural Language Processing (J. Hockenmaier)
Morphemes and morphs
There are many irregular word forms:
Plural nouns add -s to singular: book-books,
but: box-boxes, fly-flies, child-children
Past tense verbs add -ed to infinitive: walk-walked,
but: like-liked, leap-leapt
One morpheme (e.g. for plural nouns) can be
realized as different surface forms (morphs):
-s/-es/-ren
Allomorphs: two different realizations (-s/-es/-ren)
of the same underlying morpheme (plural)
�18
CS447: Natural Language Processing (J. Hockenmaier)
Morphological
parsing and
generation
�19 CS447: Natural Language Processing (J. Hockenmaier)
Morphological parsing
disgracefully
dis grace ful ly
prefix stem suffix suffix
NEG grace+N +ADJ +ADV
�20
CS447: Natural Language Processing (J. Hockenmaier)
Morphological generation
We cannot enumerate all possible English words,
but we would like to capture the rules that define
whether a string could be an English word or not.
That is, we want a procedure that can generate
(or accept) possible English words…
grace, graceful, gracefully
disgrace, disgraceful, disgracefully,
ungraceful, ungracefully,
undisgraceful, undisgracefully,…
without generating/accepting impossible English words
*gracelyful, *gracefuly, *disungracefully,…
NB: * is linguists’ shorthand for “this is ungrammatical”
�21 CS447: Natural Language Processing (J. Hockenmaier)
Overgeneration
English
Undergeneration
grace
disgrace
foobar
disgraceful
google,
misgoogle, ungoogle,
googler, …
…
…..
gracelyful
disungracefully
grclf
….
�22
CS447: Natural Language Processing (J. Hockenmaier)
Review:
Finite-State Automata
and
Regular Languages
�23 CS447: Natural Language Processing (J. Hockenmaier)
Formal languages
An alphabet ∑ is a set of symbols:
e.g. ∑= {a, b, c}
A string ω is a sequence of symbols, e.g ω=abcb.
The empty string ε consists of zero symbols.
The Kleene closure ∑* (‘sigma star’) is the (infinite)
set of all strings that can be formed from ∑:
∑*= {ε, a, b, c, aa, ab, ba, aaa, …}
A language L ⊆ ∑* over ∑ is also a set of strings.
Typically we only care about proper subsets of ∑* (L ⊂ Σ).
�24
CS447: Natural Language Processing (J. Hockenmaier)
An automaton is an abstract model of a computer.
It reads an input string symbol by symbol.
It changes its internal state depending on
the current input symbol and its current internal state.
Automata and languages
�25
a b a c d e
Automaton
Input
string
1. read input
q
Current
state
2. change
state Automaton
q’
New
state
a
Current input symbol
CS447: Natural Language Processing (J. Hockenmaier)
Automata and languages
The automaton either accepts or rejects
the input string.
Every automaton defines a language
(the set of strings it accepts).
�26
a b a c d e
Automaton
Input
string
read accept!
reject!
Input string is
in the language
Input string is
NOT in the language
CS447: Natural Language Processing (J. Hockenmaier)
Automata and languages
Different types of automata define
different language classes:
– Finite-state automata define regular languages
– Pushdown automata define context-free languages
– Turing machines define recursively enumerable
languages
�27 CS447: Natural Language Processing (J. Hockenmaier)
Recursively enumerable
The Chomsky Hierarchy
The structure of English words can be described
by a regular (= finite-state) grammar.
Context-sensitive
Mildly context-sensitive
Context-free
Regular
�28
CS447: Natural Language Processing (J. Hockenmaier)
Finite-state automata
A (deterministic) finite-state automaton (FSA)
consists of:
-a finite set of states Q = {q0….qN}, including a start state q0
and one (or more) final (=accepting) states (say, qN)
-a (deterministic) transition function
δ(q,w) = q’ for q, q’ ∈ Q, w ∈ Σ
�29
final state
(note the
double line)
q0
q3
q2
q1
q4q4
a
b c
x y
move from state q2
to state q4
if you read ‘y’
start state
CS447: Natural Language Processing (J. Hockenmaier)
q0
a
q3q2q1
b
a
q0
a
q3q2q1
b
a
b a a a
b a a a
b a a a
b a a a q0
a
q3q2q1
b
a
q0
a
q3q2q1
b
a
b a a a
�30
q0
a
q3q2q1
b
a
Start in q0
Accept!
We’ve reached the end of the string,
and are in an accepting state.
CS447: Natural Language Processing (J. Hockenmaier)
q0
a
q3q2q1
b
a
b
q0
a
q3q2q1
b
a
b
�31
Start in q0
Reject!
(q1 is not a
final state)
Rejection: Automaton does not
end up in accepting state
CS447: Natural Language Processing (J. Hockenmaier) �32
Reject!
(There is no
transition
labeled ‘c’)
Rejection: Transition not defined
q0
a
q3q2q1
b
a
q0
a
q3q2q1
b
a
b a c
b a c
b a c
q0
a
q3q2q1
b
a
b a c
q0
a
q3q2q1
b
a
Start in q0
CS447: Natural Language Processing (J. Hockenmaier)
Finite State Automata (FSAs)
A finite-state automaton M =〈Q, Σ, q0, F, δ〉 consists of:
– A finite set of states Q = {q0, q1,.., qn}
– A finite alphabet Σ of input symbols (e.g. Σ = {a, b, c,…})
– A designated start state q0 ∈ Q
– A set of final states F ⊆Q
– A transition function δ:
– The transition function for a deterministic (D)FSA: Q × Σ → Q
δ(q,w) = q’ for q, q’ ∈ Q, w ∈ Σ
If the current state is q and the current input is w, go to q’
– The transition function for a nondeterministic (N)FSA: Q × Σ → 2Q
δ(q,w) = Q’ for q ∈ Q, Q’ ⊆ Q, w ∈ Σ
If the current state is q and the current input is w, go to any q’ ∈ Q’
�33 CS447: Natural Language Processing (J. Hockenmaier)
Every NFA can be transformed into an equivalent DFA:
Recognition of a string w with a DFA is linear in the length of w
Finite-state automata define the class of regular languages
L1 = { anbm } = {ab, aab, abb, aaab, abb,… } is a regular language,
L2 = { anbn } = {ab, aabb, aaabbb,…} is not (it’s context-free).
You cannot construct an FSA that accepts all the strings in L2 and nothing else.
Finite State Automata (FSAs)
q3
q3
b
q0
a
q3q2
b
a
q1
q3q0 q3
b
a
�34
CS447: Natural Language Processing (J. Hockenmaier)
Regular Expressions
Regular expressions can also be used to define a
regular language.
Simple patterns:
-Standard characters match themselves: ‘a’, ‘1’
-Character classes: ‘[abc]’, ‘[0-9]’, negation: ‘[^aeiou]’
(Predefined: \s (whitespace), \w (alphanumeric), etc.)
-Any character (except newline) is matched by ‘.’
Complex patterns: (e.g. ^[A-Z]([a-z])+\s )
-Group: ‘(…)’
-Repetition: 0 or more times: ‘*’, 1 or more times: ‘+’
-Disjunction: ‘…|…’
-Beginning of line ‘^’ and end of line ‘$’
�35 CS447: Natural Language Processing (J. Hockenmaier)
Finite-state methods
for morphology
�36
CS447: Natural Language Processing (J. Hockenmaier)
q0
stemprefix
q1 q3q2dis-grace:
suffixq0 q1
stem
q3q2grace-ful:
stemq0 q1 q2
prefix suffix q3q3dis-grace-ful:
Finite state automata for morphology
grace:
�37
q0
stem
q3q1
CS447: Natural Language Processing (J. Hockenmaier)
Union: merging automata
grace,
dis-grace,
grace-ful,
dis-grace-ful
q0 q1
ε
stem suffix
q3q3prefix q3q2
�38
CS447: Natural Language Processing (J. Hockenmaier)
Some irregular words require stem changes:
Past tense verbs:
teach-taught, go-went, write-wrote
Plural nouns:
mouse-mice, foot-feet, wife-wives
Stem changes
�39 CS447: Natural Language Processing (J. Hockenmaier)
q3q1
noun1
FSAs for derivational morphology
q0
q3q5
-ation
q3q6
-er
-iz
q2
-e
q3q3
adj1 -able q4
q3q7
noun2
-al
noun2 = {nation, form,…}
noun3
q10
-al
q3q11
-e
noun3 = {natur, structur,…}
noun1 = {fossil,mineral,…}
adj1 = {equal, neutral}
adj2 = {minim, maxim}
q3q9adj2 q8
-al
-iz
CS447: Natural Language Processing (J. Hockenmaier)
FSAs can recognize (accept) a string, but they don’t
tell us its internal structure.
We need is a machine that maps (transduces)
the input string into an output string that encodes
its structure:
Recognition vs. Analysis
�41
c a t sInput
(Surface form)
c a t +N +plOutput (Lexical form)
CS447: Natural Language Processing (J. Hockenmaier)
Finite-state transducers
A finite-state transducer T = 〈Q, Σ, Δ, q0, F, δ, σ〉 consists of:
– A finite set of states Q = {q0, q1,.., qn}
– A finite alphabet Σ of input symbols (e.g. Σ = {a, b, c,…})
– A finite alphabet Δ of output symbols (e.g. Δ = {+N, +pl,…})
– A designated start state q0 ∈ Q
– A set of final states F ⊆ Q
– A transition function δ: Q × Σ → 2Q
δ(q,w) = Q’ for q ∈ Q, Q’ ⊆ Q, w ∈ Σ
– An output function σ: Q × Σ → Δ*
σ(q,w) = ω for q ∈ Q, w ∈ Σ, ω ∈ Δ*
If the current state is q and the current input is w, write ω.
(NB: Jurafsky&Martin define σ: Q × Σ* → Δ*. Why is this equivalent?)
�42
CS447: Natural Language Processing (J. Hockenmaier)
An FST T = Lin ⨉ Lout defines a relation between two
regular languages Lin and Lout:
Lin = {cat, cats, fox, foxes, …}
Lout = {cat+N+sg, cat+N+pl, fox+N+sg, fox+N+pl …}
T = { ⟨cat, cat+N+sg⟩,
⟨cats, cat+N+pl⟩,
⟨fox, fox+N+sg⟩,
⟨foxes, fox+N+pl⟩ }
Finite-state transducers
�43 CS447: Natural Language Processing (J. Hockenmaier)
Some FST operations
Inversion T-1:
The inversion (T-1) of a transducer
switches input and output labels.
This can be used to switch from parsing words
to generating words.
Composition (T◦T’): (Cascade)
Two transducers T = L1 ⨉ L2 and T’ = L2 ⨉ L3 can be
composed into a third transducer T’’ = L1 ⨉ L3.
Sometimes intermediate representations are useful
�44
CS447: Natural Language Processing (J. Hockenmaier)
English spelling rules
Peculiarities of English spelling (orthography)
The same underlying morpheme (e.g. plural-s)
can have different orthographic “surface realizations”
(-s, -es)
This leads to spelling changes at morpheme
boundaries:
E-insertion: fox +s = foxes
E-deletion: make +ing = making
�45 CS447: Natural Language Processing (J. Hockenmaier)
Side note: “Surface realization”?
This terminology comes from Chomskyan
Transformational Grammar.
Dominant early approach in theoretical linguistics, superseded
by other approaches (“minimalism”).
Not computational, but has some historical influence on
computational linguistics (e.g. Penn Treebank)
“Surface” = standard English (Chinese, Hindi, etc.).
“Surface string” = a written sequence of characters or words
vs. “Deep”/“Underlying” structure/representation:
A more abstract representation.
Might be the same for different sentences with the same
meaning.
�46
CS447: Natural Language Processing (J. Hockenmaier)
Intermediate representations
English plural -s: cat ⇒ cats dog ⇒ dogs
but: fox ⇒ foxes, bus ⇒ buses buzz ⇒ buzzes
We define an intermediate representation to capture
morpheme boundaries (^) and word boundaries (#):
Lexicon: cat+N+PL fox+N+PL
⇒ Intermediate representation: cat^s# fox^s#
⇒ Surface string: cats foxes
Intermediate-to-Surface Spelling Rule:
If plural ‘s’ follows a morpheme ending in ‘x’,‘z’ or ‘s’, insert ‘e’.
�47 CS447: Natural Language Processing (J. Hockenmaier)
FST composition/cascade:
�48
CS447: Natural Language Processing (J. Hockenmaier)
Tlex: Lexical to intermediate level
�49 CS447: Natural Language Processing (J. Hockenmaier)
Te-insert: intermediate to surface level
�50
q0
a:a,…,r:r,
t:t,…,w:w,
y:y
s:s, x:x, z:z
q2
a:a,…,r:r,t:t,
…,w:w,y:y#:ε
q3q1
^:ε
q6
#:ε
q3q3
^:e q5
s:s
q3q8#:ε
^ = morpheme boundary
# = word boundary
ε = empty string
s:s, x:x, z:z
q4
#:ε
a:a,…,r:r,
t:t,…,w:w,y:y
^:e
q7
s:s
Intermediate-to-
Surface Spelling
Rule:
If plural ‘s’ follows a
morpheme ending in
‘x’,‘z’ or ‘s’, insert ‘e’.
CS447: Natural Language Processing (J. Hockenmaier)
Dealing with ambiguity
book: book +N +sg or book +V?
Generating words is generally unambiguous,
but analyzing words often requires disambiguation.
We need a nondeterministic FST.
– Efficiency problem: Not every nondeterministic
FST can be translated into a deterministic one!
We also need a scoring function to identify which
analysis is more likely.
– We may need to know the context in which the
word appears: (I read a book vs. I book flights)
�51 CS447: Natural Language Processing (J. Hockenmaier)
Other applications
of FSTs
�52
CS447: Natural Language Processing (J. Hockenmaier)
A computer therapist?
Computer: TELL ME MORE ABOUT YOUR FAMILY
Human: My mother takes care of me.
Computer: WHO ELSE IN YOUR FAMILY TAKES CARE OF YOU
Human: My father.
Computer: YOUR FATHER
Human: You are like my father in some ways.
Computer: WHAT RESEMBLANCE DO YOU SEE
Human: You are not very aggressive but I think you don’t want me
to notice that.
Computer: WHAT MAKES YOU THINK I AM NOT VERY
AGGRESSIVE
Human: You don’t argue with me.
Computer: WHY DO YOU THINK I DON’T ARGUE WITH YOU
Human: You are afraid of me.
�53
Weizenbaum (1966), ELIZA.
CS447: Natural Language Processing (J. Hockenmaier)
ELIZA as a FST cascade
Human: You don’t argue with me.
Computer: WHY DO YOU THINK I DON’T ARGUE WITH YOU
1. Replace you with I and me with you:
I don’t argue with you.
2. Replace <...> with Why do you think <...>:
Why do you think I don’t argue with you.
What about other NLP tasks?
Could we write an FST for machine translation?
�54
CS447: Natural Language Processing (J. Hockenmaier)
What about compounds?
Semantically, compounds have hierarchical structure:
(((ice cream) cone) bakery)
not (ice ((cream cone) bakery))
((computer science) (graduate student))
not (computer ((science graduate) student))
We will need context-free grammars to capture this
underlying structure.
�55 CS447: Natural Language Processing (J. Hockenmaier)
Today’s key concepts
Morphology (word structure): stems, affixes
Derivational vs. inflectional morphology
Compounding
Stem changes
Morphological analysis and generation
Finite-state automata
Finite-state transducers
Composing finite-state transducers
�56
CS447: Natural Language Processing (J. Hockenmaier)
Today’s reading
This lecture follows closely
Chapter 3.1-7 in J&M 2008
Optional readings (see website)
Karttunen and Beesley ’05, Mohri (1997), the Porter stemmer, Sproat et al.
(1996)
�57