COMP20008
Elements of Data Processing
Semester 1 2021
Lecture 4, Part 1 – Part 4: Unstructured Data – Text Preprocessing
© University of Melbourne 2021
Text (string) search
String Matching and String Comparison
© University of Melbourne 2021
Exact string search
• Given a string, is some substring contained within it? • Given a string, find all occurrences of some substring.
For example, find Exxon in:
In exes for foxes rex dux mixes a pox of waxed luxes. An
axe, and an axon, to exo Exxon max oxen. Grexit or
Brexit as quixotic haxxers with buxom rex taxation.
© University of Melbourne 2021
Approximate string search/match
Find exon in:
In exes for foxes rex dux mixes a pox of waxed luxes. An
axe, and an axon, to exo Exxon max oxen. Grexit or
Brexit as quixotic haxxers with buxom rex taxation.
Not present!
…But what is the “closest” or “best” match?
© University of Melbourne 2021
Approximate string search/matching
Find approximate match(es) for exon in:
In exes for foxes rex dux mixes a pox of waxed luxes.
An axe, and an axon, to exo Exxon max oxen.
Grexit or Brexit as quixotic haxxers with buxom rex taxation.
exon → Exxon Insert x exon → exo Delete n
exon → axon Replace e with a
exon → oxen Transpose e and o (not covered)
© University of Melbourne 2021
Why do string matching?
Application – spelling correction
Need the notion of a dictionary:
• Here, a list of words (entries) that are “correct” with respect to our
(expectations of our) language
• We can break our input into words (substrings) that we wish to match, and compare each of them against the entries in the dictionary
• A word (item) in the input that doesn’t appear in the dictionary is misspelled
• A word (item) in the input that does appear in the dictionary might be
correctly spelled or might be misspelled (beyond the scope of this subject) © University of Melbourne 2021
Why do string matching?
Application – Spelling Correction
Therefore, the problem here:
Given some item of interest — which does not appear in our dictionary — which entry from the dictionary was truly intended?
Depends on the person who wrote the original string!
© University of Melbourne 2021
Why do string matching?
Application – Detecting Neologisms
Word Blending:
• forming novel words by “blending” two other words
• not simple concatenation (e.g., football is not a blend word) breakfast + lunch → brunch
fork + spoon → spork
Britain + exit → Brexit
• Language changes continuously
• New terms are often coined in colloquial language (e.g., Twitter) • Social media is fertile grounds for linguistic innovation.
© University of Melbourne 2021
Why do string matching?
Application – Detecting Equivalence
street and place name conventions
boulevard|blvd|bd|bde|blv|bl|blvde|blvrd|boulavard|boul|bvd
apartment|apt|ap|aprt|aptmnt
village|vil|vge|vill|villag|villg|vlg|vlge|vllg
• Data cleaning (e.g., deduplication) • Query repair
© University of Melbourne 2021
Break
String match and comparison –
cont.
Minimum Edit Distance
© University of Melbourne 2021
Minimum edit distance
Transform a source string into the target string through three basic character operations.
• Insert, Delete, Replace
• Each operation incurs a cost or (a mismatch score);
“… proceed if you can see no ther option … “
other … otter … their … there
© University of Melbourne 2021
Levenshtein distance
A minimum edit distance algorithm.
Operations ther ther
• insert
• delete • replace
|||| |||| other otter
© University of Melbourne 2021
Levenshtein distance
A minimum edit distance algorithm.
Cost
• Insert →1
• Delete →1
• Replace →1
Alternative
• Insert →1
• Delete →1
• Replace →2
ther ther
|||| |||| other otter
distance = 1
distance = 2
© University of Melbourne 2021
Levenshtein distance
• Efficient algorithm based on prefix comparison • Demonstrated with a matrix.
#
o
t
t
e
r
#
0
1
2
3
4
5
t
1
1
1
2
3
4
h
2
2
2
2
3
4
e
3
3
3
3
2
3
r
4
4
4
4
3
2
© University of Melbourne 2021
Levenshtein distance
j
(m = 4 n=5)
#
o
t
t
e
r
#
0
1
2
3
4
5
t
1
2, 2, 1
h
2
e
3
r
4
i
© University of Melbourne 2021
Levenshtein distance
j
(m = 4 n=5)
#
o
t
t
e
r
#
0
1
2
3
4
5
t
1
2, 2, 1
3, 2, 1
3, 2, 2
5, 3, 4
6, 5, 4
h
2
2, 3, 2
2, 3, 3
3, 3, 2
4, 3, 3
5, 4, 4
e
3
3, 4, 3
3, 4, 3
3, 4, 3
4, 4, 2
5, 3, 3
r
4
4, 5, 4
4, 5, 4
4, 5, 4
3, 5, 4
4, 4, 2
i
© University of Melbourne 2021
Levenshtein distance
A minimum edit distance algorithm.
Cost
• Insert →1
• Delete →1
• Replace →1
Alternative
• Insert →1
• Delete →1
• Replace →2
ther ther
|||| |||| other otter
distance = 1
distance = 2
© University of Melbourne 2021
Try it!
#
l
e
c
t
u
r
e
#
0
1
2
3
4
5
6
7
r
1
2, 2, 1
e
2
2, 3, 2
c
3
o
4
r
5
d
6
© University of Melbourne 2021
Edit-distance to similarity
Normalised distance: The distance is divided by the length of the
! “!,”” $%& “!,””
𝑠𝑖𝑚𝑒𝑑𝑖𝑡 𝑠1, 𝑠2 = 1.0 −
• ther → otter?
longer string
! “!,”” $%& “!,””
t
h
e
r
o
t
t
e
r
𝑠𝑖𝑚𝑒𝑑𝑖𝑡 𝑡h𝑒𝑟, 𝑜𝑡𝑡𝑒𝑟 = 1.0 − 25 = 0.6
Break
String match and comparison –
cont.
N-gram Distance
© University of Melbourne 2021
N-grams
Another method for finding approximate string match
What is an (character) n-gram? A (character) substring of length n • bi-grams of crat: cr, ra, at; or
• bi-grams of crat: #c, cr, ra, at, t# (sometimes)
• tri-grams of crat: ##c, #cr, cra, rat, at#, t##
‘#’ is a padding character
A sequence of characters is converted into a set of n-grams.
N-gram distance
N-gram distance between x and y: |Gn(x)| + |Gn(y)| − 2 × |Gn(x) ∩ Gn(y)|
crat and cart:
|G2(crat)| + |G2(cart)| − 2 × |G2(crat) ∩ G2(cart)| = 5+5−2×2= 6
crat and arts:
|G2(crat)| + |G2(art)| − 2 × |G2(crat) ∩ G2(art)| = 5+5−2×0= 10
• bi-grams of crat: G2(crat) = #c, cr, ra, at, t# • bi-grams of cart: G2(cart) = #c, ca, ar, rt, t# • bi-grams of arts: G2(arts) = #a, ar, rt, ts, s#
N-gram distance
• Less sensitive to relative ordering of strings (matches can be anywhere!)
• More sensitive to long substring matches,
• Quite useless for very long strings and/or very small alphabets (Why?)
• Potentially useful for (approximate) prefixes / suffixes, e.g., Street → St; or smog → smoke
Jaccard similarity (set-based)
&!∩&” &!∪&”
S1: G2(crat) = {#c, cr, ra, at, t#} S2: G2(cart) = {#c, ca, ar, rt, t#}
𝑆’∩𝑆( =2 𝑆’∪𝑆( =8
𝑠𝑖𝑚!”## 𝑆$,𝑆% =
𝑠𝑖𝑚)*++ 𝑐𝑟𝑎𝑡, 𝑐𝑎𝑟𝑡 = 28 = 0.25
Sørensen-Dice similarity (set-based)
%× &!∩&” &! – &”
S1: G2(crat) = {#c, cr, ra, at, t#} S2: G2(cart) = {#c, ca, ar, rt, t#}
𝑆’ =5 𝑆( =5
𝑆’∩𝑆( =2 2×𝑆’∩𝑆( 4 𝑠𝑖𝑚!,+- 𝑐𝑟𝑎𝑡,𝑐𝑎𝑟𝑡 = 𝑆’ + 𝑆( = 10 = 0.4
𝑠𝑖𝑚)*#+ 𝑆$,𝑆% =
Cosine similarity (vector-based)
bi-gram(crat) = (#c, cr, ra, at, t#) bi-gram(cart) = (#c, ca, ar, rt, t#) vector representation
#c
ar
at
ca
cr
ra
rt
t#
𝑋
1
0
1
0
1
1
0
1
𝑌
1
1
0
1
0
0
1
1
𝑐𝑜𝑠𝑋,𝑌 =
𝑋⋅𝑌 𝑋! 𝑌”
X⋅𝑌=∑$ 𝑋𝑌 = 1+0+0+0+0+0+0+1=2 !”# ! !
𝑋=∑$ 𝑋%= 1%+1%+1%+1%+1%=5, !”# !
𝑌 = ∑$ 𝑌% = 1%+1%+1%+1%+1% = 5 !”# !
𝑐𝑜𝑠𝑋,𝑌 =
2 =0.4 5× 5
Jaro-Winkler similarity
• Based on edit-distance (also character-based similarity)
• Give more weight to strings with matching prefixes
• Useful for matching (approximate) prefixes / suffixes.
• The details of the algorithm are not covered in the subject
Break
Pattern Matching in Text
Regular Expressions
© University of Melbourne 2021
Patterns in Text
• Scenario: we have a large collection of unstructured text data. You need to write wrangling code in order to
• Check whether it contains any IP addresses (e.g. 128.250.65.5) • Find all of the IP addresses
• Requirements
• Do it succinctly
• Do it unambiguously
• Have maintainable code
• Specify patterns in text – regular expressions
• Good for calculating statistics (count occurrences of items in text)
• Checking for integrity, filtering, substitutions … © University of Melbourne 2021
Regular Expressions (re)
Simple match – characters match themselves exactly • The pattern hello will match the string hello
• Hello will match Hello
RegExp/regex/re define patterns
• Metacharacters –special rules for matching
© University of Melbourne 2021
RE: metacharacters
.: matching any character
– Forexample,a.cmatchesa/c,abc,a5c – Tomatch‘.’asaliteral,
escape with ‘\.’ a\.c matches a.c That is, ‘\’ is also a metacharacter
© University of Melbourne 2021
RE: metacharacters
\: backslash character is used to
– escape metacharacters or other special characters, e.g.: – match ‘.’ as a literal: a\.c matches a.c
– match ‘\’ as a literal: a\\c matches a\c
It also indicate special forms, e.g., special character set \d for any decimal digit
© University of Melbourne 2021
RE: metacharacters
[ ]: matching a set (class) of characters; example: [abc], [a-zA-Z]
• [^] : Complementing the set
add ‘^’ as the first character in the class ([^z]anything but z) What does the pattern [z^] match?
• Use ‘\’ to escape special characters‘[’,‘]’inside []. [\[]matches the special character: ‘[’
© University of Melbourne 2021
RE: metacharacters
[ ]: avoiding the use of square brackets
• Predefined special character set: e.g.,
• \d any decimal digit == [0-9]
• \w any alphanumeric character == [a-zA-Z0-9_]
• \W any non-alphanumeric character == [^a-zA-Z0-9_]
• Special character sets are at https://docs.python.org/2/howto/regex.html © University of Melbourne 2021
RE: metacharacters
| “OR” operator
• Alternatives
• Given two patterns P1 and P2, P1 | P2 will match P1 or P2
• Create a regex that produces the same result without using ‘|’ © University of Melbourne 2021
RE: metacharacters
^ $: Anchoring
• ^ : start of string
• ^from will match from only at the start of the string, e.g. ‘from a to b’ • ^from will not match ‘I am from Melbourne’
• $ : end of string
• To match ‘^’or ‘$’as a literal
• Escapewith‘\^’or‘\$’
• Put it in character class [$^](note special meaning if ‘^’ is placed as the first character)
© University of Melbourne 2021
RE: metacharacters
* + ? {m,n}: repeat a pattern • * : zero or more repetitions
• + : one or more repetitions
• ?: zero or once
• {m,n}: at least m and at most n repetitions.
• Pattern search is greedy
© University of Melbourne 2021
More complex regular expression
What do you think this pattern is for?
• [a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-]+ As a task to do for the next live Zoom, please improve this pattern!
© University of Melbourne 2021
RE: metacharacters substitution & capturing groups
( ): as in math notation, group patterns with the metacharacters() • Grouped patterns are captured and numbered
• You can specify the contents by back-references
© University of Melbourne 2021
Re: metacharacters
The complete list of metacharacters
.^$*+?{}[]\|()
© University of Melbourne 2021
ELIZA and Doctor
ELIZA: a computer psychotherapist
• I’m (depressed|sad|unhappy)
• Why do you say you are \1
• Is it because \1 that you come to me?
It is now called Doctor in emacs
© University of Melbourne 2021
Regular Expressions in Python
python re
import re
re.match()
re.search()
re.sub()
re.split()
p = re.compile(regular expression)
p.match()
Practice in the workshop
© University of Melbourne 2021
Break
From the time before, 2019
© University of Melbourne 2021