PowerPoint Presentation
LECTURE 5
Spelling Correcton
Arkaitz Zubiaga, 22nd January, 2018
2
Applicaton of spelling correcton.
The spelling correcton task.
The noisy channel model for spelling correcton.
Non-word spelling correcton.
Real word spelling correcton.
Other factors for spelling correcton.
LECTURE 5: CONTENTS
3
Web search, writng/editng text.
SPELLING CORRECTION: APPLICATIONS
4
Correctng typos in our corpora.
SPELLING CORRECTION: APPLICATIONS
5
Spelling error detecton:
Spelling error correcton:
Autocorrect (for common mistakes, e.g. teh → the)
Suggest a correcton
Suggeston lists
SPELLING TASKS
6
26%: Web queries Wang et al. 2003
13%: Retyping, no backspace: Whitelaw et al. English&German
7%: Words corrected retyping on phone-sized organizer
2%: Words uncorrected on organizer Soukoref &MacKenzie 2003
1-2%: Retyping: Kane and Wobbrock 2007, Gruden et al. 1983
RATES OF SPELLING ERRORS
7
Non-word errors: e.g. impresive → impressive
Real-word errors:
Typographical errors, e.g. three → there
Cognitve errors, e.g. too → two, your → you’re
TYPES OF SPELLING ERRORS
THE NOISY CHANNEL MODEL FOR
NON-WORD SPELLING
CORRECTION
9
Error detecton by using a dictonary to fnd those not listed.
The larger the dictonary, the beter.
To correct them:
We need to generate candidate correctons, based on:
Shortest weighted edit distance.
Highest noisy channel probability.
NON-WORD SPELLING ERRORS
10
INTUITION OF THE NOISY CHANNEL
11
For a misspelled word x that we observe,
we aim to fnd the dictonary word w that maximises P()
THE NOISY CHANNEL
ŵ=argmax
wÎV
P(w| x)
=argmax
wÎV
P(x|w)P(w)
P(x)
=argmax
wÎV
P(x|w)P(w)
It’s constant
12
acress
NON-WORD SPELLING ERROR EXAMPLE
13
Words with similar spelling.
Small edit distance to error.
Words with similar pronunciaton.
Small edit distance of pronunciaton to error.
CANDIDATE GENERATION
14
Minimum edit distance between two strings
is the minimum number of editng operatons:
Inserton
Deleton
Substtuton
Needed to transform one into the other
COMPUTING THE MINIMUM EDIT DISTANCE
15
If each operaton has cost of 1:
Distance between these is 5
Levenshtein distance: substtutons cost 2 (s = d + i)
Distance between them is 8
COMPUTING THE EDIT DISTANCE
d = deletion
i = insertion
s = substitution
16
Minimum edit distance between two strings, edits being:
Inserton.
Deleton.
Substtuton.
Transpositon of two adjacent leters.
DAMERAU-LEVENSHTEIN EDIT DISTANCE
17
WORDS WITHIN 1 EDIT OF ACRESS
Error Candidate
Correcton
Correct
Leter
Error
Leter
Type
acress actress t – deleton
acress cress – a inserton
acress caress ca ac transpositon
acress access c r substtuton
acress across o e substtuton
acress acres – s inserton
acress acres – s inserton
18
NOTE: non-alphabetc characters need to be considered in the
correcton too, e.g.:
goodmorning → good morning
inlaw → in-law
CANDIDATE GENERATION
19
80% of errors are within edit distance 1.
Almost all errors within edit distance 2.
CANDIDATE GENERATION
20
However, besides edit distance,
we need to compute the error probabilites.
e.g.:
P(the | teh) → high
P(the | tze) → low
COMPUTING ERROR PROBABILITIES
21
Counts from 404,253,213 words in Corpus of Contemporary
English (COCA).
COMPUTING ERROR PROBABILITIES
word Frequency of word P(word)
actress 9,321 .0000230573
cress 220 .0000005442
caress 686 .0000016969
access 37,038 .0000916207
across 120,844 .0002989314
acres 12,874 .0000318463
22
del[x,y]: count(xy typed as x)
ins[x,y]: count(x typed as xy)
sub[x,y]: count(x typed as y)
trans[x,y]: count(xy typed as yx)
COMPUTING ERROR PROBABILITIES
23
CONFUSION MATRIX OF ERRORS
24
Kernighan, Church, Gale 1990
CHANNEL MODEL PROBABILITY
25
NOISY CHANNEL PROBABILITY FOR ACRESS
Candidate
Correcton
Correct
Leter
Error
Leter
x|w P(x|word) P(word) 109 *P(x|w)P(w)
actress t – c|ct .000117 .0000231 2.7
cress – a a|# .00000144 .000000544 .00078
caress ca ac ac|ca .00000164 .00000170 .0028
access c r r|c .000000209 .0000916 .019
across o e e|o .0000093 .000299 2.8
acres – s es|e .0000321 .0000318 1.0
acres – s ss|s .0000342 .0000318 1.0
26
NOISY CHANNEL PROBABILITY FOR ACRESS
Candidate
Correcton
Correct
Leter
Error
Leter
x|w P(x|word) P(word) 109 *P(x|w)P(w)
actress t – c|ct .000117 .0000231 2.7
cress – a a|# .00000144 .000000544 .00078
caress ca ac ac|ca .00000164 .00000170 .0028
access c r r|c .000000209 .0000916 .019
across o e e|o .0000093 .000299 2.8
acres – s es|e .0000321 .0000318 1.0
acres – s ss|s .0000342 .0000318 1.0
This can be extended to consider bigrams:
assume word before/after acress is correct)
Use probabilities of bigrams (which of
actress/across is more likely to co-occur with
those previous/next words?)
THE NOISY CHANNEL MODEL FOR
REAL WORD SPELLING
CORRECTION
28
…leaving in ffeen minuets to go to her house.
The design an constructon of the system…
Can they lave him my messages?
As many as 25-40% of spelling errors are real word errors (Kukich
1992)
REAL WORD SPELLING ERRORS
29
For every word w, generate candidate set:
Candidates with similar spelling.
Candidates with similar pronunciaton.
We need to consider w itself as a candidate.
Choose best candidate:
Noisy channel.
Classifer.
REAL WORD SPELLING ERRORS
30
Given a sentence w
1
, w
2
, w
3
, …, w
n
Generate a set of candidates for each word w
i
:
candidate(w
1
) = {w
1
, w
1
’, w
1
’’, w
1
’’’,…}
candidate(w
2
) = {w
2
, w
2
’, w
2
’’, w
2
’’’,…}
…
candidate(w
n
) = {w
n
, w
n
’, w
n
’’, w
n
’’’,…}
Choose sequence W that maximises P(W)
REAL WORD SPELLING CORRECTION
31
REAL WORD SPELLING CORRECTION
two of thew
to threw
on
thawofftao
thetoo
oftwo thaw
…
32
REAL WORD SPELLING CORRECTION
Choose sequence W that maximises P(W).
two of thew
to threw
on
thawofftao
thetoo
oftwo thaw
…
33
To compute P(W):
Language models (e.g. probabilites of trigrams)
Noisy channel (probabilites of misspellings)
We need a probability for no error, i.e. P(w | w)
HOW DO WE COMPUTE P(W)?
34
Only out of all possible sequences with 1 word replaced:
w
1
, w
2
’’, w
3
two of thew
w
1
, w
2
, w
3
’ two of the
w
1
’’’, w
2
, w
3
too of thew
…
Choose the sequence W that maximises P(W)
SIMPLIFY: MAX ONE ERROR PER SENTENCE
OTHER FACTORS USED IN
SPELLING CORRECTION
36
Incorporate error probabilites owing to homophony:
bigger then → bigger than
a peace of paper → a piece of paper
I have to computers → I have two computers
HOMOPHONES: SIMILAR PRONUNCIATION
37
Errors afer typing nearby keys:
threw → three (QWERTY keyboard)
PROBABILITIES BASED ON KEYBOARD LAYOUT
38
Other factors that could infuence p(misspelling | word):
The source/target leter.
Right-hand and lef-hand keyboard leters.
The positon in the word.
Formatng (e.g. space afer period).
…
OTHER FACTORS
39
What to do when our system fnds an error?
Very confdent about error → autocorrect
Less confdent → suggest the possible correcton
Less confdent → suggest list of correctons
Unconfdent → just fag the error
Or only check for errors when the user runs the spell checker.
DESIGNING SPELLING CORRECTION SYSTEMS
40
RESOURCES
Peter Norvig’s list of errors:
http://norvig.com/ngrams/spell-errors.txt
http://norvig.com/ngrams/spell-errors.txt
41
RESOURCES
Wikipedia’s list of common misspelllings:
https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines
https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines
42
RESOURCES
Free and open source software:
GNU Aspell: http://aspell.net/
Hunspell: http://hunspell.github.io/
Used by LibreOffice, OpenOffice.org, Mozilla Firefox &
Thunderbird, among others.
http://aspell.net/
http://hunspell.github.io/
43
REFERENCES
Mays, Eric, Fred J. Damerau and Robert L. Mercer. 1991. Context
based spelling correction. Information Processing and Management,
23(5), 517–522.
Kernighan, Mark D., Kenneth W. Church, and William A. Gale. 1990.
A spelling correction program based on a noisy channel model.
Proceedings of COLING 1990, 205-210.
44
REFERENCES
Survey paper:
Kukich, K. (1992). Techniques for automatically correcting words in
text. ACM Computing Surveys (CSUR), 24(4), 377-439.
http://dc-pubs.dbs.uni-leipzig.de/files/Kukich1992Techniqueforautomatically.pdf
http://dc-pubs.dbs.uni-leipzig.de/files/Kukich1992Techniqueforautomatically.pdf
45
ASSOCIATED READING
Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapter 5.
Bird Steven, Ewan Klein, and Edward Loper. Natural Language
Processing with Python. O’Reilly Media, Inc., 2009. Chapter 2
Section 4.
Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18
Slide 19
Slide 20
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Slide 38
Slide 39
Slide 40
Slide 41
Slide 42
Slide 43
Slide 44
Slide 45