lecture05.pptx
LECTURE 5
Spelling Correcton
Arkaitz Zubiaga, 22
nd
January, 2018
2
Applicaton of spelling correcton.
The spelling correcton task.
The noisy channel model for spelling correcton.
Non-word spelling correcton.
Real word spelling correcton.
Other factors for spelling correcton.
LECTURE 5: CONTENTS
3
Web search, writnggeditng teet.
SPELLING CORRECTION: APPLICATIONS
4
Correctng typos in our corpora.
SPELLING CORRECTION: APPLICATIONS
5
Spelling error detecton:
Spelling error correcton:
Autocorrect (for common mistakes, e.g. teh → the)
Suggest a correcton
Suggeston lists
SPELLING TASKS
6
26%: Web queries Wang et al. 2003
13%: Retyping, no backspace: Whitelaw et al. English&German
7%: Words corrected retyping on phone-sized organizer
2%: Words uncorrected on organizer Soukoref &&acKenzie 2003
1-2%: Retyping: Kane and Wobbrock 2007, Gruden et al. 1983
RATES OF SPELLING ERRORS
7
Non-word errors: e.g. impresive → impressive
Real-word errors:
Typographical errors , e.g. three → there
Cognitie errors, e.g. too → two, your → you’re
TYPES OF SPELLING ERRORS
THE NOISY CHANNEL MODEL FOR
NON-WORD SPELLING
CORRECTION
9
Error detecton by using a dictonary io fnd ihose noi lisied .
The larger the dictonary, the beter.
To correci them:
We need to generaie candidaie correctons, based on:
Shoriesi weighied edii disiance .
Highesi noisy channel probabiliiy .
NON-WORD SPELLING ERRORS
10
INTUITION OF THE NOISY CHANNEL
11
For a misspelled word x that we observe,
we aim to fnd the dictonary word w that maeimises P()
THE NOISY CHANNEL
ŵ=argmax
wÎV
P(w| x)
=argmax
wÎV
P(x |w)P(w)
P(x)
=argmax
wÎV
P(x|w)P(w)
It’s constant
12
acress
NON-WORD SPELLING ERROR EXA&PLE
13
Words with similar spelling.
Small edit distance to error.
Words with similar pronunciaton.
Small edit distance of pronunciaton to error.
CANDIDATE GENERATION
14
Minimum edii disiance between two strings
is the minimum number of editng operatons:
Inserton
Deleton
Substtuton
Needed to transform one into the other
CO&PUTING THE &INI&U& EDIT DISTANCE
15
If each operaton has cosi of 1:
Distance between these is 5
Leienshiein disiance: substtutons cost 2 (s t d + i)
Distance between them is 8
CO&PUTING THE EDIT DISTANCE
d = deletion
i = insertion
s = substitution
16
&inimum edit distance between two strings, edits being:
Inserton.
Deleton.
Substtuton.
Transpositon of two adjacent leters.
DA&ERAU-LEVENSHTEIN EDIT DISTANCE
17
WORDS WITHIN 1 EDIT OF ACRESS
Error Candidaie
Correcton
Correci
Leter
Error
Leter
Type
acress actress t – deleton
acress cress – a inserton
acress caress ca ac transpositon
acress access c r substtuton
acress across o e substtuton
acress acres – s inserton
acress acres – s inserton
18
NOTE: non-alphabetc characiers need io be considered in the
correcton too, e.g.:
goodmorning → good morning
inlaw → in-law
CANDIDATE GENERATION
19
80% of errors are within edit distance 1.
Almost all errors within edit distance 2.
CANDIDATE GENERATION
20
However, besides edit distance,
we need to compute the error probabilites.
e.g.:
P(the | teh) → high
P(the | tze) → low
CO&PUTING ERROR PROBABILITIES
21
Counts from 404,253,213 words in Corpus of Contemporary
English (COCA).
CO&PUTING ERROR PROBABILITIES
word Frequency of word P(word)
actress 9,321 .0000230573
cress 220 .0000005442
caress 686 .0000016969
access 37,038 .0000916207
across 120,844 .0002989314
acres 12,874 .0000318463
22
del[e,y]: count(ey typed as e)
ins[e,y]: count(e typed as ey)
sub[e,y]: count(e typed as y)
trans[e,y]: count(ey typed as ye)
CO&PUTING ERROR PROBABILITIES
23
CONFUSION &ATRIX OF ERRORS
24
Kernighan, Church, Gale 1990
CHANNEL &ODEL PROBABILITY
25
NOISY CHANNEL PROBABILITY FOR ACRESS
Candidaie
Correcton
Correci
Leter
Error
Leter
x|w P(x|word) P(word) 109 *P(e|w)P(w)
actress t – c|ct .000117 .0000231 2.7
cress – a a|# .00000144 .000000544 .00078
caress ca ac ac|ca .00000164 .00000170 .0028
access c r r|c .000000209 .0000916 .019
across o e e|o .0000093 .000299 2.8
acres – s es|e .0000321 .0000318 1.0
acres – s ss|s .0000342 .0000318 1.0
26
NOISY CHANNEL PROBABILITY FOR ACRESS
Candidaie
Correcton
Correci
Leter
Error
Leter
x|w P(x|word) P(word) 109 *P(e|w)P(w)
actress t – c|ct .000117 .0000231 2.7
cress – a a|# .00000144 .000000544 .00078
caress ca ac ac|ca .00000164 .00000170 .0028
access c r r|c .000000209 .0000916 .019
across o e e|o .0000093 .000299 2.8
acres – s es|e .0000321 .0000318 1.0
acres – s ss|s .0000342 .0000318 1.0
This can be extended to consider bigrams:
assume word before/after acress is correct)
Use probabilities of bigrams (which of
actress/across is more likely to co-occur with
those previous/next words?)
THE NOISY CHANNEL MODEL FOR
REAL WORD SPELLING
CORRECTION
28
…leaving in ffeen minueis to go to her house.
The design an constructon of the system…
Can they laie him my messages?
As many as 25-40% of spelling errors are real word errors (Kukich
1992)
REAL WORD SPELLING ERRORS
29
For eiery word w, generate candidate set:
Candidates with similar spelling.
Candidates with similar pronunciaton.
We need to consider w iiself as a candidate.
Choose besi candidaie:
Noisy channel.
Classifer.
REAL WORD SPELLING ERRORS
30
Given a sentence w
1
, w
2
, w
3
, …, w
n
Generate a set of candidates for each word w
i
:
candidate(w
1
) t {w
1
, w
1
’, w
1
’’, w
1
’’’,…}
candidate(w
2
) t {w
2
, w
2
’, w
2
’’, w
2
’’’,…}
…
candidate(w
n
) t {w
n
, w
n
’, w
n
’’, w
n
’’’,…}
Choose sequence W that maeimises P(W)
REAL WORD SPELLING CORRECTION
31
REAL WORD SPELLING CORRECTION
two of thew
to threw
on
thawofftao
thetoo
oftwo thaw
…
32
REAL WORD SPELLING CORRECTION
Choose sequence W that maeimises P(W).
two of thew
to threw
on
thawofftao
thetoo
oftwo thaw
…
33
To compute P(W):
Language models (e.g. probabilites of trigrams)
Noisy channel (probabilites of misspellings)
We need a probabiliiy for no error, i.e. P(w | w)
HOW DO WE CO&PUTE P(W)?
34
Only out of all possible sequences wiih 1 word replaced:
w
1
, w
2
’’, w
3
two of thew
w
1
, w
2
, w
3
’ two of the
w
1
’’’, w
2
, w
3
too of thew
…
Choose the sequence W that maeimises P(W)
SI&PLIFY: &AX ONE ERROR PER SENTENCE
OTHER FACTORS USED IN
SPELLING CORRECTION
36
Incorporate error probabilites owing to homophony:
bigger ihen → bigger ihan
a peace of paper → a piece of paper
I have io computers → I have iwo computers
HO&OPHONES: SI&ILAR PRONUNCIATION
37
Errors afer typing nearby keys:
threw → three (QWERTY keyboard)
PROBABILITIES BASED ON KEYBOARD LAYOUT
38
Other factors that could infuence p(misspelling | word):
The sourcegtarget leter.
Right-hand and lef-hand keyboard leters.
The positon in the word.
Formatng (e.g. space afer period).
…
OTHER FACTORS
39
What to do when our system fnds an error?
Very confdent about error → autocorrect
Less confdent → suggest the possible correcton
Less confdent → suggest list of correctons
Unconfdent → just fag the error
Or only check for errors when the user runs the spell checker.
DESIGNING SPELLING CORRECTION SYSTE&S
40
RESOURCES
Peter Norvig’s list of errors:
http://norvig.com/ngrams/spell-errors.txt
41
RESOURCES
Wikipedia’s list of common misspelllings:
https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines
42
RESOURCES
Free and open source software:
GNU Aspell: http://aspell.net/
Hunspell: http://hunspell.github.io/
Used by LibreOffice, OpenOffice.org, Mozilla Firefox &
Thunderbird, among others.
43
REFERENCES
Mays, Eric, Fred J. Damerau and Robert L. Mercer. 1991. Context
based spelling correction. Information Processing and Management,
23(5), 517–522.
Kernighan, Mark D., Kenneth W. Church, and William A. Gale. 1990.
A spelling correction program based on a noisy channel model.
Proceedings of COLING 1990, 205-210.
44
REFERENCES
Survey paper:
Kukich, K. (1992). Techniques for automatically correcting words in
text. ACM Computing Surveys (CSUR), 24(4), 377-439.
http://dc-pubs.dbs.uni-leipzig.de/files/Kukich1992Techniqueforautomatically.pdf
45
ASSOCIATED READING
Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapter 5.
Bird Steven, Ewan Klein, and Edward Loper. Natural Language
Processing with Python. O’Reilly Media, Inc., 2009. Chapter 2
Section 4.