2021S2-workshop-week5-lab
Elements Of Data Processing (2021S2) – Week 5¶
Regular Expressions (Regex)¶
Regular expressions allow you to match patterns in strings, rather than matching exact characters.
For example, if I wanted to find all phone numbers with form (03) xxxx xxxx, where x is some arbitrary digit, then I could use a regular expression like this: \(03\) \d\d\d\d \d\d\d\d
\(03\) \d{4} \d{4} where \d{4} matches a digit exactly 4 times
Here’s a good tutorial on Python regex: https://www.w3schools.com/python/python_regex.asp
and a website to test your regex expressions with test cases: https://regex101.com/
Regex with Python¶
The re library in Python allows you to use regular expressions.
Methods of note are: .search() (search for a particular pattern given a string)
.findall() (finds all substrings that match a given pattern)
.sub() (replaces all matched substrings with another given substring)
Regex Quantifiers¶
?: exactly zero or one occurrences of the preceding element
*: zero or more occurrences of the preceding element
+: one or more occurrences of the preceding element
{n}: preceding item is matched exactly n times
{,n}: preceding item is matched up to n times inclusive
{n,}: preceding item is matched at least n or more times
{m,n}: preceding item is matched at least m or more times, but up to n times inclusive
Escaping Special Characters¶
Like special characters in Python (i.e \n), you will also need to escape special characters in regex.
For example, if you wanted to match a literal bracket (, you have to type \( to escape it as () in regex is used to capture a literal group of characters
Consider the phone number from the example above.
In [4]:
import re
string = r’Name: Chris, ph: (03) 9923 1123, comments: this is not my real number’
# this is the regex pattern we want
# notice that we need to “escape” the brackets
pattern = r’\(03\) \d{4} \d{4}’
if re.search(pattern, string) :
print(“Phone number found”)
print(re.findall(pattern, string))
else :
print(“Not found”)
Phone number found
[‘(03) 9923 1123′]
Exercise 1 ¶
Modify the example above so that it will also find phone numbers starting with (03) that:
have missing brackets;
instead of a single space, uses hyphens, backslashes, and/or spaces.
Your program should match all elements in strings in the code segment below
In [5]:
# This examples looks for phone numbers that match the format above
import re
strings = [
r’Name: Chris, ph: (03) 9923 1123, comments: this is not my real number’,
r’Name: John, ph: 03-9923-1123, comments: this might be an old number’,
r’Name: Sara, phone: (03)-9923-1123, comments: there is data quality issues, so far, three people sharig the same number’,
r’Name: Christopher, ph: (03)\-9923 -1123, comments, is this the same Chris in the first record?’
]
# change this line
pattern = r’\(03\) \d{4} \d{4}’
for s in strings:
if re.search(pattern, s) :
print(“Phone number found”)
print(re.findall(pattern, s))
else :
print(“Not found”)
Phone number found
[‘(03) 9923 1123’]
Not found
Not found
Not found
Exercise 2 ¶
Write a program that will remove all leading zeros from an IP address
For example, 0216.08.094.102 should become 216.8.94.196
Your program should match all elements in ip_addr in the code segment below
In [ ]:
# Exercise 2: Write a program that will remove all leading zeros from an IP address
# For example, 0216.08.094.102 should become 216.8.94.196
import re
ip_addr = ‘0216.08.094.102’
# change these line
pattern = …
replace = …
revised_addr = re.sub(pattern, replace, ip_addr)
print(revised_addr)
Jaccard Similarity¶
Jaccard similarity (set-based) is a measure of calculating the similarity between two $n$-grams.
Let $A$ and $B$ be two $n$-grams. Then the Jaccard similarity can be computed as:
$$
\text{sim} = J(A, B) = \frac{A\cap B}{A\cup B}
$$
where:
The intersection is the number of common elements between the two sets;
and the union contains the set of all elements in the two sets.
For example, if I had two sets of numbers:
$A = \{0,1,2,5,6\}, B = \{0,2,3,4,5,7,9\}$
Then $A\cap B = \{0, 2, 5\}$ and $A\cup B = \{0, 1, 2, 3, 4, 5, 6, 7, 9\}$
Therefore, $J(A, B) = 3 / 9 = 0.33$
Exercise 3 ¶
Use nltk.util.ngram to produce bi-grams (your device may need to download the punkt toolbox for nltk) for the two strings.
Then calculate the Jaccard similarity for each bi-gram for the two strings.
In [8]:
from nltk.util import ngrams
import nltk
sent1 = “crat”
sent2 = “cart”
Case Folding¶
Case folding removes all case distinctions present in a string. It is used for caseless matching, i.e. ignores cases when comparing.
Exercise 4 ¶
Use appropraite functions to covert string “Whereof one cannot speak, thereof one must be silent.” in
(1) Lower case
(2) Upper case
In [4]:
s = “Whereof one cannot speak, thereof one must be silent.”
Natural Language Processing¶
The nltk library provides you with tools for natural language processing, including tokenizing, stemming and lemmatization
In [ ]:
import nltk
from nltk.stem.porter import *
# if running the first time with errors:
#nltk.download(‘punkt’)
#nltk.download(‘stopwords’)
#
porterStemmer = PorterStemmer()
speech = ‘Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate — we can not consecrate — we can not hallow — this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us — that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion — that we here highly resolve that these dead shall not have died in vain — that this nation, under God, shall have a new birth of freedom — and that government of the people, by the people, for the people, shall not perish from the earth.’
wordList = nltk.word_tokenize(speech)
# run the line to download it the first time:
#nltk.download(‘stopwords’)
from nltk.corpus import stopwords
stopWords = set(stopwords.words(‘english’))
filteredList = [w for w in wordList if not w in stopWords]
wordDict = {}
for word in filteredList:
stemWord = porterStemmer.stem(word)
if stemWord in wordDict :
wordDict[stemWord] = wordDict[stemWord] +1
else :
wordDict[stemWord] = 1
wordDict = {k: v for k, v in sorted(wordDict.items(), key=lambda item: item[1], reverse=True)}
for key in wordDict : print(key, wordDict[key])
Exercise 5 ¶
Modify the example above to use a WordNet Lemmatizer instead of a porter stemmer.
Comment on the differences
In [ ]:
In [ ]: