程序代写代做代考 python PowerPoint Presentation

PowerPoint Presentation

LECTURE 5

Spelling Correcton

Arkaitz Zubiaga, 22nd January, 2018

2

 Applicaton of spelling correcton.

 The spelling correcton task.

 The noisy channel model for spelling correcton.

 Non-word spelling correcton.

 Real word spelling correcton.

 Other factors for spelling correcton.

LECTURE 5: CONTENTS

3

 Web search, writng/editng text.

SPELLING CORRECTION: APPLICATIONS

4

 Correctng typos in our corpora.

SPELLING CORRECTION: APPLICATIONS

5

 Spelling error detecton:

 Spelling error correcton:

 Autocorrect (for common mistakes, e.g. teh → the)

 Suggest a correcton

 Suggeston lists

SPELLING TASKS

6

 26%: Web queries Wang et al. 2003

 13%: Retyping, no backspace: Whitelaw et al. English&German

 7%: Words corrected retyping on phone-sized organizer

 2%: Words uncorrected on organizer Soukoref &MacKenzie 2003

 1-2%: Retyping: Kane and Wobbrock 2007, Gruden et al. 1983

RATES OF SPELLING ERRORS

7

 Non-word errors: e.g. impresive → impressive

 Real-word errors:

 Typographical errors, e.g. three → there

 Cognitve errors, e.g. too → two, your → you’re

TYPES OF SPELLING ERRORS

THE NOISY CHANNEL MODEL FOR
NON-WORD SPELLING
CORRECTION

9

 Error detecton by using a dictonary to fnd those not listed.

 The larger the dictonary, the beter.

 To correct them:

 We need to generate candidate correctons, based on:

 Shortest weighted edit distance.

 Highest noisy channel probability.

NON-WORD SPELLING ERRORS

10

INTUITION OF THE NOISY CHANNEL

11

 For a misspelled word x that we observe,
we aim to fnd the dictonary word w that maximises P()

THE NOISY CHANNEL

ŵ=argmax
wÎV

P(w| x)

=argmax
wÎV

P(x|w)P(w)
P(x)

=argmax
wÎV

P(x|w)P(w)

It’s constant

12

acress

NON-WORD SPELLING ERROR EXAMPLE

13

 Words with similar spelling.

 Small edit distance to error.

 Words with similar pronunciaton.

 Small edit distance of pronunciaton to error.

CANDIDATE GENERATION

14

 Minimum edit distance between two strings

 is the minimum number of editng operatons:

 Inserton

 Deleton

 Substtuton

Needed to transform one into the other

COMPUTING THE MINIMUM EDIT DISTANCE

15

 If each operaton has cost of 1:

 Distance between these is 5

 Levenshtein distance: substtutons cost 2 (s = d + i)

 Distance between them is 8

COMPUTING THE EDIT DISTANCE

d = deletion
i = insertion
s = substitution

16

 Minimum edit distance between two strings, edits being:

 Inserton.

 Deleton.

 Substtuton.

 Transpositon of two adjacent leters.

DAMERAU-LEVENSHTEIN EDIT DISTANCE

17

WORDS WITHIN 1 EDIT OF ACRESS

Error Candidate
Correcton

Correct
Leter

Error
Leter

Type

acress actress t – deleton

acress cress – a inserton

acress caress ca ac transpositon

acress access c r substtuton

acress across o e substtuton

acress acres – s inserton

acress acres – s inserton

18

 NOTE: non-alphabetc characters need to be considered in the
correcton too, e.g.:

 goodmorning → good morning

 inlaw → in-law

CANDIDATE GENERATION

19

 80% of errors are within edit distance 1.

 Almost all errors within edit distance 2.

CANDIDATE GENERATION

20

 However, besides edit distance,

we need to compute the error probabilites.

e.g.:
P(the | teh) → high
P(the | tze) → low

COMPUTING ERROR PROBABILITIES

21

 Counts from 404,253,213 words in Corpus of Contemporary
English (COCA).

COMPUTING ERROR PROBABILITIES

word Frequency of word P(word)

actress 9,321 .0000230573

cress 220 .0000005442

caress 686 .0000016969

access 37,038 .0000916207

across 120,844 .0002989314

acres 12,874 .0000318463

22

 del[x,y]: count(xy typed as x)

 ins[x,y]: count(x typed as xy)

 sub[x,y]: count(x typed as y)

 trans[x,y]: count(xy typed as yx)

COMPUTING ERROR PROBABILITIES

23

CONFUSION MATRIX OF ERRORS

24

 Kernighan, Church, Gale 1990

CHANNEL MODEL PROBABILITY

25

NOISY CHANNEL PROBABILITY FOR ACRESS

Candidate
Correcton

Correct
Leter

Error
Leter

x|w P(x|word) P(word) 109 *P(x|w)P(w)

actress t – c|ct .000117 .0000231 2.7

cress – a a|# .00000144 .000000544 .00078

caress ca ac ac|ca .00000164 .00000170 .0028

access c r r|c .000000209 .0000916 .019

across o e e|o .0000093 .000299 2.8

acres – s es|e .0000321 .0000318 1.0

acres – s ss|s .0000342 .0000318 1.0

26

NOISY CHANNEL PROBABILITY FOR ACRESS

Candidate
Correcton

Correct
Leter

Error
Leter

x|w P(x|word) P(word) 109 *P(x|w)P(w)

actress t – c|ct .000117 .0000231 2.7

cress – a a|# .00000144 .000000544 .00078

caress ca ac ac|ca .00000164 .00000170 .0028

access c r r|c .000000209 .0000916 .019

across o e e|o .0000093 .000299 2.8

acres – s es|e .0000321 .0000318 1.0

acres – s ss|s .0000342 .0000318 1.0

 This can be extended to consider bigrams:
 assume word before/after acress is correct)
 Use probabilities of bigrams (which of

actress/across is more likely to co-occur with
those previous/next words?)

THE NOISY CHANNEL MODEL FOR
REAL WORD SPELLING
CORRECTION

28

 …leaving in ffeen minuets to go to her house.

 The design an constructon of the system…

 Can they lave him my messages?

 As many as 25-40% of spelling errors are real word errors (Kukich
1992)

REAL WORD SPELLING ERRORS

29

 For every word w, generate candidate set:

 Candidates with similar spelling.

 Candidates with similar pronunciaton.

 We need to consider w itself as a candidate.

 Choose best candidate:

 Noisy channel.

 Classifer.

REAL WORD SPELLING ERRORS

30

 Given a sentence w
1
, w

2
, w

3
, …, w

n

 Generate a set of candidates for each word w
i
:

 candidate(w
1
) = {w

1
, w

1
’, w

1
’’, w

1
’’’,…}

 candidate(w
2
) = {w

2
, w

2
’, w

2
’’, w

2
’’’,…}

 …

 candidate(w
n
) = {w

n
, w

n
’, w

n
’’, w

n
’’’,…}

 Choose sequence W that maximises P(W)

REAL WORD SPELLING CORRECTION

31

REAL WORD SPELLING CORRECTION

two of thew

to threw

on

thawofftao

thetoo

oftwo thaw

32

REAL WORD SPELLING CORRECTION

 Choose sequence W that maximises P(W).

two of thew

to threw

on

thawofftao

thetoo

oftwo thaw

33

 To compute P(W):

 Language models (e.g. probabilites of trigrams)

 Noisy channel (probabilites of misspellings)

 We need a probability for no error, i.e. P(w | w)

HOW DO WE COMPUTE P(W)?

34

 Only out of all possible sequences with 1 word replaced:

 w
1
, w

2
’’, w

3
two of thew

 w
1
, w

2
, w

3
’ two of the

 w
1
’’’, w

2
, w

3
too of thew

 …

 Choose the sequence W that maximises P(W)

SIMPLIFY: MAX ONE ERROR PER SENTENCE

OTHER FACTORS USED IN
SPELLING CORRECTION

36

 Incorporate error probabilites owing to homophony:

 bigger then → bigger than

 a peace of paper → a piece of paper

 I have to computers → I have two computers

HOMOPHONES: SIMILAR PRONUNCIATION

37

 Errors afer typing nearby keys:

 threw → three (QWERTY keyboard)

PROBABILITIES BASED ON KEYBOARD LAYOUT

38

 Other factors that could infuence p(misspelling | word):

 The source/target leter.

 Right-hand and lef-hand keyboard leters.

 The positon in the word.

 Formatng (e.g. space afer period).

 …

OTHER FACTORS

39

 What to do when our system fnds an error?

 Very confdent about error → autocorrect

 Less confdent → suggest the possible correcton

 Less confdent → suggest list of correctons

 Unconfdent → just fag the error

 Or only check for errors when the user runs the spell checker.

DESIGNING SPELLING CORRECTION SYSTEMS

40

RESOURCES

 Peter Norvig’s list of errors:
http://norvig.com/ngrams/spell-errors.txt

http://norvig.com/ngrams/spell-errors.txt

41

RESOURCES

 Wikipedia’s list of common misspelllings:
https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines

https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines

42

RESOURCES

 Free and open source software:
 GNU Aspell: http://aspell.net/

 Hunspell: http://hunspell.github.io/

Used by LibreOffice, OpenOffice.org, Mozilla Firefox &
Thunderbird, among others.

http://aspell.net/
http://hunspell.github.io/

43

REFERENCES

 Mays, Eric, Fred J. Damerau and Robert L. Mercer. 1991. Context
based spelling correction. Information Processing and Management,
23(5), 517–522.

 Kernighan, Mark D., Kenneth W. Church, and William A. Gale. 1990.
A spelling correction program based on a noisy channel model.
Proceedings of COLING 1990, 205-210.

44

REFERENCES

 Survey paper:

Kukich, K. (1992). Techniques for automatically correcting words in
text. ACM Computing Surveys (CSUR), 24(4), 377-439.
http://dc-pubs.dbs.uni-leipzig.de/files/Kukich1992Techniqueforautomatically.pdf

http://dc-pubs.dbs.uni-leipzig.de/files/Kukich1992Techniqueforautomatically.pdf

45

ASSOCIATED READING

 Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapter 5.

 Bird Steven, Ewan Klein, and Edward Loper. Natural Language
Processing with Python. O’Reilly Media, Inc., 2009. Chapter 2
Section 4.

Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18
Slide 19
Slide 20
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Slide 38
Slide 39
Slide 40
Slide 41
Slide 42
Slide 43
Slide 44
Slide 45