Text processing – string match
School of Computing and Information Systems
@University of Melbourne 2022
Copyright By PowCoder代写 加微信 powcoder
Exact string match
• Given a string, is some substring contained within it? • Given a string, find all occurrences of some substring.
For example, find Exxon in:
In exes for foxes rex dux mixes a pox of waxed luxes.
An axe, and an axon, to exo Exxon max oxen. Grexit or
Brexit as quixotic haxxers with buxom rex taxation.
COMP20008 Elements of Data Processing
Approximate string match
Find exon in:
In exes for foxes rex dux mixes a pox of waxed luxes.
An axe, and an axon, to exo Exxon max oxen. Grexit or
Brexit as quixotic haxxers with buxom rex taxation.
Not present!
…But what is the “closest” or “best” match?
COMP20008 Elements of Data Processing
Approximate string matching
Find approximate match(es) for exon in:
In exes for foxes rex dux mixes a pox of waxed luxes.
An axe, and an axon, to exo Exxon max oxen.
Grexit or Brexit as quixotic haxxers with buxom rex taxation.
exon → Exxon Insert x exon → exo Delete n
exon → axon Replace e with a
exon → oxen Transpose e and o (not covered)
COMP20008 Elements of Data Processing
Why do string matching?
Application – spelling correction
Need the notion of a dictionary:
• Here, a list of words (entries) that are “correct” with respect to our
(expectations of our) language
• We can break our input into words (substrings) that we wish to match, and compare each of them against the entries in the dictionary
• A word (item) in the input that doesn’t appear in the dictionary is misspelled
• A word (item) in the input that does appear in the dictionary might be correctly spelled or might be misspelled (beyond the scope of this subject)
COMP20008 Elements of Data Processing
Why do string matching?
Application – Spelling Correction
Therefore, the problem here:
Given some item of interest — which does not appear in our dictionary — which entry from the dictionary was truly intended?
Depends on the person who wrote the original string!
COMP20008 Elements of Data Processing
Why do string matching?
Application – Detecting Neologisms
Word Blending:
• forming novel words by “blending” two other words
• not simple concatenation (e.g., football is not a blend word) breakfast + lunch → brunch
fork + spoon → spork
Britain + exit → Brexit
• Language changes continuously
• New terms are often coined in colloquial language (e.g., Twitter) • Social media is fertile grounds for linguistic innovation.
COMP20008 Elements of Data Processing
Why do string matching?
Application – Detecting Equivalence
street and place name conventions
boulevard|blvd|bd|bde|blv|bl|blvde|blvrd|boulavard|boul|bvd
apartment|apt|ap|aprt|aptmnt
village|vil|vge|vill|villag|villg|vlg|vlge|vllg
• Data cleaning (e.g., deduplication) • Query repair
COMP20008 Elements of Data Processing
String match – minimum edit distance
School of Computing and Information Systems
@University of Melbourne 2022
Minimum edit distance
Transform a source string into the target string through three basic character operations.
• Insert, Delete, Replace
• Each operation incurs a cost or (a mismatch score);
“… proceed if you can see no ther option … “
other … otter … their … there
COMP20008 Elements of Data Processing
Levenshtein distance
A minimum edit distance algorithm.
Operations ther ther
• delete • replace
|||| |||| other otter
COMP20008 Elements of Data Processing
Levenshtein distance
A minimum edit distance algorithm.
• Insert →1
• Delete →1
• Replace →1
Alternative
• Insert →1
• Delete →1
• Replace →2
|||| |||| other otter
distance = 1
distance = 2
COMP20008 Elements of Data Processing
Levenshtein distance
• Efficient algorithm based on prefix comparison • Demonstrated with a matrix.
COMP20008 Elements of Data Processing
Levenshtein distance
(m = 4 n=5)
COMP20008 Elements of Data Processing
Levenshtein distance
(m = 4 n=5)
COMP20008 Elements of Data Processing
Levenshtein distance
A minimum edit distance algorithm.
• Insert →1
• Delete →1
• Replace →1
Alternative
• Insert →1
• Delete →1
• Replace →2
|||| |||| other otter
distance = 1
distance = 2
COMP20008 Elements of Data Processing
COMP20008 Elements of Data Processing
Edit-distance to similarity
• Normalised distance: the distance divided by the length of the longer string ! “1,”2
max “1 , “2
• Similarity: 1 – normalised distance ‘()#$%& ‘1,’2 =1.0 − ! “1,”2
• ther → otter?
‘()#$%& /h12, 3//12 = 1.0 − 25 = 0.6
max “1 , “2
COMP20008 Elements of Data Processing
N-gram distance and similarities
School of Computing and Information Systems
@University of Melbourne 2022
Another method for finding approximate string match
What is an (character) n-gram? A substring of length n
• bi-grams of crat: [#c, cr, ra, at, t#]
• tri-grams of crat: [##c, #cr, cra, rat, at#, t##] ‘#’ is a padding character
A sequence of characters is converted into a list of n-grams.
COMP20008 Elements of Data Processing
N-gram distance
N-gram distance between x and y:
|Gn(x)|+|Gn(y)| − 2×|Gn(x) ∩ Gn(y)|
G2(crat) = [#c, cr, ra, at, t#] G2(cart) = [#c, ca, ar, rt, t#] |G2(crat)|= 5,
|G2(cart)|= 5,
|G2(crat) ∩ G2(cart)|= 2
Bi-gram distance between crat and cart
= |G2(crat)|+|G2(cart)| − 2×|G2(crat) ∩ G2(cart)|= 5+5 −2×2 = 6
COMP20008 Elements of Data Processing
N-gram distance
• Less sensitive to relative ordering of strings (matches can be anywhere!)
• More sensitive to long substring matches,
• Quite useless for very long strings and/or very small alphabets
• Potentially useful for (approximate) prefixes / suffixes, e.g., Street → St; or smog → smoke
COMP20008 Elements of Data Processing
Jaccard similarity
&!∩&” &!∪&”
S1: G2(crat) = [#c, cr, ra, at, t#] S2: G2(cart) = [#c, ca, ar, rt, t#]
7$∪7% =8 2
456!”# = G2(crat),G2(cart) = 8 = 0.25 COMP20008 Elements of Data Processing
456!”# 7$,7% =
Sørensen-Dice similarity
%× &!∩&” &! – &”
S1: G2(crat) = [#c, cr, ra, at, t#] S2: G2(cart) = [#c, ca, ar, rt, t#]
7$ =5 7% =5
7$∩7% =2 2×7$∩7% 4 456)*#+ G2(crat),G2(cart) = 7$ + 7% = 10 = 0.4
456)*#+ 7$,7% =
COMP20008 Elements of Data Processing
Cosine similarity
X: G2(crat) = [#c, cr, ra, at, t#] Y: G2(cart) = [#c, ca, ar, rt, t#]
cos G2(crat),G2(cart) =
X⋅Y= 23!4!
= 1+0+0+0+0+0+0+1=2
3 = ∑$ 3% !”# !
= 1%+1%+1%+1%+1% = 5,
4 = ∑$ 4% !”# !
= 1%+1%+1%+1%+1% = 5
COMP20008 Elements of Data Processing
Jaro-Winkler similarity
• Based on edit-distance (also character-based similarity)
• Give more weight to strings with matching prefixes
• Useful for matching (approximate) prefixes / suffixes.
• The details of the algorithm are not covered in the subject
COMP20008 Elements of Data Processing
Pattern Matching in Text – Regular Expressions
School of Computing and Information Systems
@University of Melbourne 2022
Patterns in Text
• Scenario: we have a large collection of unstructured text data. You need to write wrangling code in order to
• Check whether it contains any IP addresses (e.g. 128.250.65.5) • Find all of the IP addresses
• Requirements
• Do it succinctly
• Do it unambiguously
• Have maintainable code
• Specify patterns in text – regular expressions
• Good for calculating statistics (count occurrences of items in text)
• Checking for integrity, filtering, substitutions … COMP20008 Elements of Data Processing
Python regular expressions (re)
Simple match – characters match themselves exactly • The pattern hello will match the string hello
• Hello will match Hello
Metacharacters – special rules for matching
COMP20008 Elements of Data Processing
RE: wildcard
.: matching any character
– Forexample,a.cmatchesa/c,abc,a5c – Tomatch‘.’asaliteral,
escape with ‘\.’ a\.c matches a.c That is, ‘\’ is also a metacharacter
COMP20008 Elements of Data Processing
RE: character sets
[ ]: matching any character from a set (class) of characters; example: [abc], [a-zA-Z]
• [^] : Complementing the set
add ‘^’ as the first character in the class ([^z]anything but z) What does the pattern [z^] match?
• Use ‘\’ to escape special characters‘[’,‘]’inside []. [\[]matches the special character: ‘[’
COMP20008 Elements of Data Processing
RE: escapes
\: backslash character is used to
• escape metacharacters, e.g.,
– Match ‘.’ as a literal: a\.c matches a.c – Match ‘\’ as a literal: a\\c matches a\c
• back-reference a sequence captured by a sub-expression.
• escape the name of a character class
– Match any character that is a member of that character class
– Theclassofdecimaldigitsisnamedd;\dmatchesanydecimaldigit.
* When working with re, use raw string notation r”…”! COMP20008 Elements of Data Processing
RE: escapes – cont.
• Predefined named special character sets: e.g.,
• \d any decimal digit == [0-9]
• \w any alphanumeric character == [a-zA-Z0-9_]
• \W any non-alphanumeric character == [^a-zA-Z0-9_]
• Special character sets are at https://docs.python.org/2/howto/regex.html COMP20008 Elements of Data Processing
RE: alternation
| “OR” operator
• Alternatives
• Given two patterns P1 and P2, P1 | P2 will match P1 or P2
COMP20008 Elements of Data Processing
RE: anchors
^ $: Anchoring
• ^ : start of string
• ^from will match from only at the start of the string, e.g., ‘from a to b’ • ^from will not match ‘I am from Melbourne’
• $ : end of string
• To match ‘^’or ‘$’as a literal
• Escapewith‘\^’or‘\$’
• Put it in character class [$^](note special meaning if ‘^’ is placed as the first character)
COMP20008 Elements of Data Processing
RE: repeats
* + ? {m,n}: repeat a pattern • : zero or more repetitions
• : one or more repetitions
• : zero or once
• {m,n} : at least m and at most n repetitions.
• Pattern search is greedy
COMP20008 Elements of Data Processing
RE: marked sub-expression and substitution
( ): as in math notation, group patterns within the section • Grouped patterns are captured and numbered
• You can specify the contents by back-references
COMP20008 Elements of Data Processing
ELIZA and Doctor
ELIZA: a computer psychotherapist
• I’m (depressed|sad|unhappy)
• Why do you say you are \1
• Is it because \1 that you come to me?
It is now called Doctor in emacs
COMP20008 Elements of Data Processing
Re: metacharacters
The complete list of metacharacters
.^$*+?{}[]\|()
COMP20008 Elements of Data Processing
RE: Lookahead assertions
Specify what comes immediately after a match
• (?= ): positive lookahead assertion. The enclosed expression must also be matched if the preceding pattern can be a match.
• (!= ): negative lookahead assertion. The enclosed expression must not match if the preceding pattern can be a match.
• They are not part of the matched pattern.
\d+(?=AUD): matches amounts only if it IS followed by AUD \d+(!=AUD): matches amounts only if it IS NOT followed by AUD
COMP20008 Elements of Data Processing
re.match()
re.search()
re.split()
p = re.compile(regular expression)
https://docs.python.org/3/library/re.html
COMP20008 Elements of Data Processing
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com