程序代写代做代考 python lecture05.pptx

lecture05.pptx

LECTURE 5

Spelling Correcton

Arkaitz Zubiaga, 22
nd

January, 2018

2

 Applicaton of spelling correcton.

 The spelling correcton task.

 The noisy channel model for spelling correcton.

 Non-word spelling correcton.

 Real word spelling correcton.

 Other factors for spelling correcton.

LECTURE 5: CONTENTS

3

 Web search, writnggeditng teet.

SPELLING CORRECTION: APPLICATIONS

4

 Correctng typos in our corpora.

SPELLING CORRECTION: APPLICATIONS

5

 Spelling error detecton:

 Spelling error correcton:

 Autocorrect (for common mistakes, e.g. teh → the)

 Suggest a correcton

 Suggeston lists

SPELLING TASKS

6

 26%: Web queries Wang et al. 2003

 13%: Retyping, no backspace: Whitelaw et al. English&German

 7%: Words corrected retyping on phone-sized organizer

 2%: Words uncorrected on organizer Soukoref &&acKenzie 2003

 1-2%: Retyping: Kane and Wobbrock 2007, Gruden et al. 1983

RATES OF SPELLING ERRORS

7

 Non-word errors: e.g. impresive → impressive

 Real-word errors:

 Typographical errors , e.g. three → there

 Cognitie errors, e.g. too → two, your → you’re

TYPES OF SPELLING ERRORS

THE NOISY CHANNEL MODEL FOR
NON-WORD SPELLING
CORRECTION

9

 Error detecton by using a dictonary io fnd ihose noi lisied .

 The larger the dictonary, the beter.

 To correci them:

 We need to generaie candidaie correctons, based on:

 Shoriesi weighied edii disiance .

 Highesi noisy channel probabiliiy .

NON-WORD SPELLING ERRORS

10

INTUITION OF THE NOISY CHANNEL

11

 For a misspelled word x that we observe,
we aim to fnd the dictonary word w that maeimises P()

THE NOISY CHANNEL

ŵ=argmax
wÎV

P(w| x)

=argmax
wÎV

P(x |w)P(w)
P(x)

=argmax
wÎV

P(x|w)P(w)

It’s constant

12

acress

NON-WORD SPELLING ERROR EXA&PLE

13

 Words with similar spelling.

 Small edit distance to error.

 Words with similar pronunciaton.

 Small edit distance of pronunciaton to error.

CANDIDATE GENERATION

14

 Minimum edii disiance between two strings

 is the minimum number of editng operatons:

 Inserton

 Deleton

 Substtuton

Needed to transform one into the other

CO&PUTING THE &INI&U& EDIT DISTANCE

15

 If each operaton has cosi of 1:

 Distance between these is 5

 Leienshiein disiance: substtutons cost 2 (s t d + i)

 Distance between them is 8

CO&PUTING THE EDIT DISTANCE

d = deletion
i = insertion
s = substitution

16

 &inimum edit distance between two strings, edits being:

 Inserton.

 Deleton.

 Substtuton.

 Transpositon of two adjacent leters.

DA&ERAU-LEVENSHTEIN EDIT DISTANCE

17

WORDS WITHIN 1 EDIT OF ACRESS

Error Candidaie
Correcton

Correci
Leter

Error
Leter

Type

acress actress t – deleton

acress cress – a inserton

acress caress ca ac transpositon

acress access c r substtuton

acress across o e substtuton

acress acres – s inserton

acress acres – s inserton

18

 NOTE: non-alphabetc characiers need io be considered in the
correcton too, e.g.:

 goodmorning → good morning

 inlaw → in-law

CANDIDATE GENERATION

19

 80% of errors are within edit distance 1.

 Almost all errors within edit distance 2.

CANDIDATE GENERATION

20

 However, besides edit distance,

we need to compute the error probabilites.

e.g.:
P(the | teh) → high
P(the | tze) → low

CO&PUTING ERROR PROBABILITIES

21

 Counts from 404,253,213 words in Corpus of Contemporary
English (COCA).

CO&PUTING ERROR PROBABILITIES

word Frequency of word P(word)

actress 9,321 .0000230573

cress 220 .0000005442

caress 686 .0000016969

access 37,038 .0000916207

across 120,844 .0002989314

acres 12,874 .0000318463

22

 del[e,y]: count(ey typed as e)

 ins[e,y]: count(e typed as ey)

 sub[e,y]: count(e typed as y)

 trans[e,y]: count(ey typed as ye)

CO&PUTING ERROR PROBABILITIES

23

CONFUSION &ATRIX OF ERRORS

24

 Kernighan, Church, Gale 1990

CHANNEL &ODEL PROBABILITY

25

NOISY CHANNEL PROBABILITY FOR ACRESS

Candidaie
Correcton

Correci
Leter

Error
Leter

x|w P(x|word) P(word) 109 *P(e|w)P(w)

actress t – c|ct .000117 .0000231 2.7

cress – a a|# .00000144 .000000544 .00078

caress ca ac ac|ca .00000164 .00000170 .0028

access c r r|c .000000209 .0000916 .019

across o e e|o .0000093 .000299 2.8

acres – s es|e .0000321 .0000318 1.0

acres – s ss|s .0000342 .0000318 1.0

26

NOISY CHANNEL PROBABILITY FOR ACRESS

Candidaie
Correcton

Correci
Leter

Error
Leter

x|w P(x|word) P(word) 109 *P(e|w)P(w)

actress t – c|ct .000117 .0000231 2.7

cress – a a|# .00000144 .000000544 .00078

caress ca ac ac|ca .00000164 .00000170 .0028

access c r r|c .000000209 .0000916 .019

across o e e|o .0000093 .000299 2.8

acres – s es|e .0000321 .0000318 1.0

acres – s ss|s .0000342 .0000318 1.0

 This can be extended to consider bigrams:
 assume word before/after acress is correct)
 Use probabilities of bigrams (which of

actress/across is more likely to co-occur with
those previous/next words?)

THE NOISY CHANNEL MODEL FOR
REAL WORD SPELLING
CORRECTION

28

 …leaving in ffeen minueis to go to her house.

 The design an constructon of the system…

 Can they laie him my messages?

 As many as 25-40% of spelling errors are real word errors (Kukich
1992)

REAL WORD SPELLING ERRORS

29

 For eiery word w, generate candidate set:

 Candidates with similar spelling.

 Candidates with similar pronunciaton.

 We need to consider w iiself as a candidate.

 Choose besi candidaie:

 Noisy channel.

 Classifer.

REAL WORD SPELLING ERRORS

30

 Given a sentence w
1
, w

2
, w

3
, …, w

n

 Generate a set of candidates for each word w
i
:

 candidate(w
1
) t {w

1
, w

1
’, w

1
’’, w

1
’’’,…}

 candidate(w
2
) t {w

2
, w

2
’, w

2
’’, w

2
’’’,…}

 …

 candidate(w
n
) t {w

n
, w

n
’, w

n
’’, w

n
’’’,…}

 Choose sequence W that maeimises P(W)

REAL WORD SPELLING CORRECTION

31

REAL WORD SPELLING CORRECTION

two of thew

to threw

on

thawofftao

thetoo

oftwo thaw

32

REAL WORD SPELLING CORRECTION

 Choose sequence W that maeimises P(W).

two of thew

to threw

on

thawofftao

thetoo

oftwo thaw

33

 To compute P(W):

 Language models (e.g. probabilites of trigrams)

 Noisy channel (probabilites of misspellings)

 We need a probabiliiy for no error, i.e. P(w | w)

HOW DO WE CO&PUTE P(W)?

34

 Only out of all possible sequences wiih 1 word replaced:

 w
1
, w

2
’’, w

3
two of thew

 w
1
, w

2
, w

3
’ two of the

 w
1
’’’, w

2
, w

3
too of thew

 …

 Choose the sequence W that maeimises P(W)

SI&PLIFY: &AX ONE ERROR PER SENTENCE

OTHER FACTORS USED IN
SPELLING CORRECTION

36

 Incorporate error probabilites owing to homophony:

 bigger ihen → bigger ihan

 a peace of paper → a piece of paper

 I have io computers → I have iwo computers

HO&OPHONES: SI&ILAR PRONUNCIATION

37

 Errors afer typing nearby keys:

 threw → three (QWERTY keyboard)

PROBABILITIES BASED ON KEYBOARD LAYOUT

38

 Other factors that could infuence p(misspelling | word):

 The sourcegtarget leter.

 Right-hand and lef-hand keyboard leters.

 The positon in the word.

 Formatng (e.g. space afer period).

 …

OTHER FACTORS

39

 What to do when our system fnds an error?

 Very confdent about error → autocorrect

 Less confdent → suggest the possible correcton

 Less confdent → suggest list of correctons

 Unconfdent → just fag the error

 Or only check for errors when the user runs the spell checker.

DESIGNING SPELLING CORRECTION SYSTE&S

40

RESOURCES

 Peter Norvig’s list of errors:
http://norvig.com/ngrams/spell-errors.txt

41

RESOURCES

 Wikipedia’s list of common misspelllings:
https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines

42

RESOURCES

 Free and open source software:
 GNU Aspell: http://aspell.net/

 Hunspell: http://hunspell.github.io/

Used by LibreOffice, OpenOffice.org, Mozilla Firefox &
Thunderbird, among others.

43

REFERENCES

 Mays, Eric, Fred J. Damerau and Robert L. Mercer. 1991. Context
based spelling correction. Information Processing and Management,
23(5), 517–522.

 Kernighan, Mark D., Kenneth W. Church, and William A. Gale. 1990.
A spelling correction program based on a noisy channel model.
Proceedings of COLING 1990, 205-210.

44

REFERENCES

 Survey paper:

Kukich, K. (1992). Techniques for automatically correcting words in
text. ACM Computing Surveys (CSUR), 24(4), 377-439.
http://dc-pubs.dbs.uni-leipzig.de/files/Kukich1992Techniqueforautomatically.pdf

45

ASSOCIATED READING

 Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapter 5.

 Bird Steven, Ewan Klein, and Edward Loper. Natural Language
Processing with Python. O’Reilly Media, Inc., 2009. Chapter 2
Section 4.