Summarisation
and
Visualisation
Compare
Gholizadeh, Asa & Saberioon, Mehdi & Carmon, Nimrod & Boruvka, Lubos & Ben-Dor, Eyal. (2018). Examining the Performance of PARACUDA-II Data-Mining Engine versus Selected Techniques to Model Soil
Carbon from Reflectance Spectra. Remote Sensing. 10. 1172. 10.3390/rs10081172.
Histograms vs boxplots
Which is the corresponding boxplot?
Histograms vs boxplots
A
B
Is this the boxplot of A or B?
Outlier detection using histogram
� Figure shows the histogram of purchase amounts in transactions
� A transaction in the amount of $7,500 is an outlier, since only 0.2%
transactions have an amount higher than $5,000
Data Formats
Exercise 1
Represent the following information in JSON
b\3>0A46.+K++4c2A11bY\3>0A46.+K++4c
bY$+4/0-c
Exercise 1
Check it is well formed: http://jsonlint.com
In your group:
a) discuss whether it’s valid / what changes to make
it valid,
b) discuss the representation and how it could be
improved
Data Linkage and Privacy ± Approximate matching
� The hash based technique using the 3rd party, can only compute
exact match between strings in a privacy preserving manner.
� What if we wish to compute approximate match between two
strings in a privacy preserving manner?
± Representing a record field (string) in a privacy preserving
manner (so that it is difficult to reverse engineer its value)
± Computing approximate similarity of two record fields
(strings) that have been represented in privacy preserving
manner.
� One method:
± Set-based string similarity + bloom filters.
� ‘DWD�VWUXFWXUH��D�µELW�DUUD\¶��RU�D�µELW�VWULQJ¶�
± An array of binary bits (0s or 1s).
� $�EORRP�ILOWHU�LV�D�µELW�DUUD\¶�ZLWK�I bits
± Initialise bits to 0s
± Turn bits on using k number of hash functions 𝐻 ⋯ 𝐻
± Each hash function maps input to an index in the bit array.
± Index range 0⋯𝑰 − 1
1 0 1 1 0 0 1 0
0 0 0 0 0 0 0 0
Bloom Filters for storing strings
� Storing a string x in a bloom filter
± choose 𝐻 ⋯ 𝐻 that takes a string as input
± each 𝐻 (𝑥) returns a index to the bloom filter (bit array)
± set the value of that index to 1
� Example: a bloom filter 𝐼 = 25, 𝑘 = 2��VWULQJ�V� �µ
� 𝑠𝑖𝑚 𝑥, 𝑦 = 1 − ,
,
= 1 − = 0.6
� Jaro-Winkler similarity
� Based on edit-distance
� Favours matches at the start of the strings.
v a l u a t i o n
r e v o l u – t i o n
Scoring similarity
Title …
come fly
with me
…
michael
jordan come
fly with me
…
0.82 15 0 0 21( ), , , ,
0.1
f: RdÆ [0,1]
Score record pairs for similarity
Need a
good f
Combine the set of similarity scores Æ final score
Prep Block Score Match Merge
More on score combination
Idea 1: sum up feature scores
Idea 2: +similarities, -dissimilarities
Idea 3: weighted sum
Idea 4: label examples of non/matching
record pairs, train a classifier using
machine learning
Will learn the weight
0.82 15 0 0 21( ), , , ,
0.1
f: RdÆ [0,1]
Prep Block Score Match Merge
Match ‘sufficiently’ similar records
Prep Block Score Match Merge
Threshold θ
Match when final score > θ
e.g.
threshold θ = 0.5
final score > 0.5
pairs compared
sufficiently similar pairs
Merge matched records
Prep Block Score Match Merge
matched pairs
Merge matched records
� Also needs to resolve conflicting attributes
� False positives and false negatives still exist
Prep Block Score Match Merge
Evaluation of record linkage results
10
False positives (fp): # non-matched pairs that are incorrectly classified as matches.
False negatives(fn): # true matching pairs that are incorrectly classified as non-matches
True positives (tp): # true matching pairs that are classified as matches
True negative (tn): #non-matched pairs that are classified as non-matches
� Precision: ⁄𝑡𝑝 (𝑡𝑝 + 𝑓𝑝)
Proportion of pairs classified as matches that are true matches.
� Recall: ⁄𝑡𝑝 (𝑡𝑝 + 𝑓𝑛)
Proportion of true matching pairs that are classified as matches.
Acknowledgements
� Based on presentation materials created by Ben Rubinstein, Pauline
Lin, James Bailey and others
Example: Bing’s “Movies Vertical”
Bing Movies adding
entity actions to entity cards
Need to link movie records from
Bing and Netflix
Easy problem only if there’s an
attribute with unique value per
record
� If there’s a “key” then use
“database join”
Not so fast! The linkage challenge
No keys
Noisy values
Missing attributes
“Unrelated works with same titles” http://www.imdb.com/list/ls000397020/
Usually remove
diacritics (why?)
déjà Æ deja
Concatenate “Title, Year”: surely a key?
“Deja Vu,2006″Å
“Deja Vu,1997″Å
“Unrelated works with same titles” http://www.imdb.com/list/ls000397020/
Concatenate “Title, Year”: surely a key?
http://www.imdb.com
Pre-processing
Prep Block Score Match Merge
Title Year Directors Cast Runtime …
come fly
with me
2004 peter kagan michael buble 63 …
michael
jordan come
fly with me
1989
michael jordan,
jay thomas
42 …
Simple pre-processing to clean records
Pairwise comparison
� Need to compare two records to determine if they are to be linked
� Calculate similarity between a pair of records
If similar:
link pair
� High complexity – we will need to compare/”score” many pairs of
records for similarity.
Prep Block Score Match Merge
Complexity – pairwise comparison
� Two databases A (m records) and B (n records)
� How many comparisons? 𝑚 × 𝑛
� One MSFT problem had 1b records
� Complexity 𝑂(𝑛 )
� If 𝑚 == 𝑛 == 1𝑏 10 records
� 10 × 10 number of comparisons;
� It will take a long time to compute 10 pairs
� Naïve whole-database all-pairs comparison doesn’t scale.
Can we do better?
Blocking – scalable record linkage
� Maps complex records to
simple values (blocks).
� The simple values are
blocking-keys
� Same blocking methods
applied to both A and B.
Each record is allocated to
one or more of the blocks
a1
a2
a3
a4
a5
b1
b5 a4
b1
b2
b3
a2
a5
b3 a2a5
b1
b4
a1
a3
b1
b2
b3
b4
b5
B A
Prep Block Score Match Merge
Blocking – cont.
� Record-pairs assigned to the
same blocks are potential
matches.
� Within block pairwise
comparisons only
� For example, b5 is only
compared with a4, and no
other records in A
b1
b5
a4
b1
b2
b3
a2
a5
b3
a2
a5
b1
b4
a1
a3
b1 – a4
b5 – a4
b1 – a2
b1 – a5
b2 – a2
b2 – a5
b3 – a2
b3 – a5
b3 – a2
b3 – a5
b1 – a1
b1 – a3
b4 – a1
b4 – a3
2 × 1
2 × 2
3 × 2
1 × 2
Scalable record linkage with blocking
Title …
come fly
with me
…
michael
jordan come
fly with me
…
michael
Blocking example
Prep Block Score Match Merge
Represent complex records as simple values (blocks);
Only score records with simple value (block) in common.
jordancome fly
Blocking example
A movie record: <“ghost busters”, 2016>
� On one function: release year, result in one block
� 2016
� On one function: concatenated values of release year and one title-
word: results in two blocks
� 2016 ghost
� 2016 busters
� Movies with a difference in release year by one; three functions:
� release year,
� release year + 1,
� release year – 1; results in three blocks:
� 2015, 2016, and 2017
� Redundant blocks, why?
Common blocking methods for strings
� Token/word (‘come fly with me’)
{come, fly, with, me}
� Phrases/n-words (‘come fly with me’, n=2)
^¶B�FRPH¶��¶FRPH�IO\¶��¶IO\�ZLWK¶��µZLWK�PH¶��µPH�B¶`
� N-grams (‘Jamie’, n=2)
� {_j, ja, am, mi, ie, e_}
� blocking-keys are the 2-grams; ‘jamie’ are in 6 blocks
� Prefix, suffix
� soundex
� Pauline: P450, Paulina: P450,
� Chris: C620
� Kris: K620
Blocking methods– design
� Blocking methods should be efficient, e.g. hashing
� What would be a bad blocking method?
� Trade-off between block sizes: true matches being missed vs
computational efficiency
� Can filter out large blocks
� Blocks can overlap but avoid redundant blocks
� Need to ensure that recall does not suffer
Measuring blocking method
Given the ground truth (matching record-pairs)
False positives (fp): non-matching pairs in the same block.
False negatives(fn): matching pairs never in the same block.
True positives (tp): matching pairs in the same block.
True negative (tn): non-matching pairs never in the same block.
Total number of pairs with no blocking: 𝑚 × 𝑛 ( == 𝑓𝑝 + 𝑓𝑛 + 𝑡𝑝 + 𝑡𝑛)
Blocking measures:
� Pair-completeness (PC): ⁄𝑡𝑝 (𝑡𝑝 + 𝑓𝑛)
� Reduction-ratio (RR): ⁄1 − (𝑓𝑝 + 𝑡𝑝) (𝑓𝑝 + 𝑓𝑛 + 𝑡𝑝 + 𝑡𝑛)
Today
• What is data linkage, when and why is it needed and by whom?
• What are some challenges?
• How to define similarity
• Scalability
• How to merge/group records
• Thanks to
• Dr Ben Rubinstein for use of lecture materials on movies example
Data linkage: what is it?
• We collect information about entities such as people, products,
images, songs, …
• Combining, grouping, matching electronic records of the same real-
word entities.
• Two hospitals H1 and H2 wants to know if same patients visited both
hospitals.
Example from Data Matching book by Christen
Applications: security
Match data about people scheduled to fly to Australia by plane,
with information across different databases, to identify high risk
passengers before boarding. Databases with information such as
• Previous visits to Australia
• Previous visa applications/cancellations
• Crime databases …
A famous senator Sen. Ted Kennedy told the Senate Judiciary Committee in 2004 that he
had been stopped and interrogated on at least five occasions as he
attempted to board flights at several different airports. A Bush
administration official explained to the Washington Post that Kennedy
had been held up because the name “T. Kennedy” had become a
popular pseudonym among terror suspects.
http://edition.cnn.com/2015/12/07/politics/no-fly-mistakes-cat-stevens-ted-kennedy-john-lewis/
Applications: business
• Two businesses collaborate with each other for a marketing campaign.
Need a combined database of individuals to target
• Bob moves into a new home and wishes to be connected to electricity
provider. For verification, provider matches the address Bob supplies
against its “master” list of street addresses. Not always reliable!
• Online shopping comparison
• Is product X in Store A the same as product Y in Store B?
Data linkage applications – cont.
Src: K. Raymond CC4.0
Centrelink: “Robo-debt” collection
• Data matching using Centrelink data and Tax office data
• System checks for “discrepancies” in income
• Example data matching issue
• Welfare recipient reports to Centrelink working for a company with its trading name.
Tax office records show a different registered company name.
• Failure to match between the two names triggered conclusion
that some income was not being declared
• Automated notice …
Examples of problematic matching
• Centrelink
• A famous senator Sen. Ted Kennedy told the Senate Judiciary
Committee in 2004 that he had been stopped and interrogated
on at least five occasions as he attempted to board flights at
several different airports. A Bush administration official
explained to the Washington Post that Kennedy had been held up
because the name “T. Kennedy” had become a popular
pseudonym among terror suspects.
Centrelink Income:
Jane Doe
May’16: Maccas $7,000
June’16: Maccas $4,000
ATO Income:
Jane Doe
2015-16: McDonald’s $11,000
Discrepancy detected – potential undeclared income
� Automated process triggered -> letter to Jane Doe
Record linkage – terminology
Combine related/equivalent records across sources (sometimes within
a single source)
Studied across communities –different terminology
• Statistics: Record linkage [Dunn’46]
• Databases: Entity resolution, data integration, deduplication
• Natural Language Processing: coreference resolution, named-entity
recognition
…meaning and scope varies
Semester 2, 2021
Lecture 5: Data quality and pre-processing
Why is pre-processing needed?
Name Age Date of Birth
“Henry” 20.2 20 years ago
Katherine Forty-one 20/11/66
Michelle 37 5/20/79
Oscar@!! “5” 13th Feb. 2019
– 42 –
Mike___Moore 669 –
巴拉克奥巴橓 52 1961年8月4日
Why is pre-processing needed?
Measuring data quality
• Accuracy
• Correct or wrong, accurate or not
• Completeness
• Not recorded, unavailable
• Consistency
• E.g. discrepancies in representation
• Timeliness
• Updated in a timely way
• Believability
• Do I trust the data is true?
• Interpretability
• How easily can I understand the data?
Date of Birth
20 years ago
20/11/66
5/20/79
13th Feb. 2019
–
–
1961年8月4日
1
2
3
4
5
6
7
Inconsistent data
• Different date formats (“3/4/2016” versus “3rd April 2016”)
• Age=20, Birthdate=“1/1/1971”
• Two students with the same student id
• Outliers (e.g. 62,72,75,75,78,80,82,84,86,87,87,89,89,90,999)
• No good if it is list of ages of hospital patients
• Might be ok for a listing of people number of contacts on Linkedin though
• Can use automated techniques, but also need domain knowledge
Missing or incomplete data
• Lacking feature values
• Name=“”
• Age=null
• Some types of missing data (Rubin 1976)
• Missing completely at random
• Missing at random
• Missing not at random
Missing completely at random (MCAR)
Missing data are MCAR when
the probability of missing data
on a variable is unrelated to any
other measured variable and is
unrelated to the variable with
missing values itself.
Missing at random (MAR)
Missing data are MAR when the
probability of missing data on a
variable is related to other fully
measured variables.
Gender Weight
Male 81
Female 50
Male
Male
Female 66
Male
Male 75
Male
Female
Missing not at random (MNAR)
Data are MNAR when the missing
values on a variable are related to
the values of that variable itself, even
after controlling for other variables.
For example, when data are missing
on IQ and only the people with low
IQ values have missing observations
for this variable.
Too much missing data!
Simple Missing-Data Strategies
Simple strategies to retain all data instances/records
• Statistical Measurement Imputation
• Mean
• Median
• Mode
Simple Strategies with sklearn
Simple Strategies with sklearn
Simple Missing-Data Strategies
Simple strategies to retain all data instances/records
• Statistical Measurement Imputation
• Mean
• Median
• Mode
Pro: easy to compute and administer; no loss of records
Con: biases other statistical measurements
• variance
• standard deviation
Multivariate Strategies
Multivariate (more than one variable)
1. Logical rules
2. Model
Disguised missing data
• Everyone’s birthday is January 1st?
• Email address is
• Adriaans and Zantige
• “Recently, a colleague rented a car in the USA. Since he was Dutch, his post-
code did not fit the fields of the computer program. The car hire
representative suggested that she use the zip code of the rental office
instead.”
• How to handle
• Look for “unusual” or suspicious values in the dataset, using knowledge about
the domain
mailto:
Major data preprocessing activities
Data mining concepts and techniques, Han et al 2012
Data cleaning – process
• Many tools exist (Google Refine, Kettle, Talend, …)
• Data scrubbing
• Data discrepancy detection
• Data auditing
• ETL (Extract Transform Load) tools: users specify transformations via a
graphical interface
• Our emphasis will be to understand some of the methods employed
by typical tools
• Domain knowledge is important
Break
Semester 2, 2021
Lecture 4, Part 8: Web Crawling
WWW – A Repository of Data
A large amount of text data is on
the Web.
Web crawling is a method to get
the data from the web.
Resources
� APIs
� Twitter: https://developer.twitter.com/en/docs/api-reference-index
� Data Dumps
� Wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Database_download#Where_do_I_get_it?
� Yelp: https://www.yelp.com/dataset
� IMDB: https://www.imdb.com/interfaces/
� Shared Tasks
� Geolocation Prediction: https://www.aclweb.org/anthology/W16-3928/
� Lexical Normalisation of Tweets: https://noisy-text.github.io/2015/norm-shared-task.html
https://developer.twitter.com/en/docs/api-reference-index
https://en.wikiped/
https://en.wikipedia.org/wiki/Wikipedia:Database_download#Where_do_I_get_it
https://www.yelp.com/dataset
https://www.imdb.com/interfaces/
https://www.aclweb.org/anthology/W16-3928/
https://noisy-text.github.io/2015/norm-shared-task.html
Web Crawling
� Web crawlers are also known as spiders, robots, and bots.
� Crawlers attempt to visit every page of interest and retrieve them for
processing and indexing
� Basic challenge: there is no central index of URLs of interest.
� Secondary challenges:
� same content as a new URL
� never return status ‘done’ on access
� websites that are not intended to be crawled
� content generated on-the-fly from databases → costly for the content
provider → excessive visits unwelcome
� Some content has a short lifespan
Crawling
� The web is a highly linked graph
� If a web page is of interest, there will
be a link to it from another page.
� In principle:
� Create a prioritised list L of URLs
to visit (seed URLs)
� Create a list V of URLs that have
been visited and when.
� Repeat forever:
� Choose a
URL u from L and fetch
the page p(u) at
location u.
� Parse and index p(u)
Extract URLs {uƍ`�
from p(u).
� Add u to V and remove
it from L
� Add {uƍ`�í V to L.
� Process V to move
H[SLUHG�RU�µROG¶�85/V�
to L.
Crawling
� Web crawl (for the purpose of indexing), every page is visited eventually.
� Synonym URLs are disregarded.
� Significant or dynamic pages are visited more frequently,
� The crawler mustn’t cycle indefinitely in a single web site.
Crawling (site or the web) vs Scraping
� Crawling starts with seed URLs
� Crawling visits all sites within a linked graph
� Scraping is the process of extracting data
� Scraping is targeted (given the information that it has to extract)
Anatomy of a URL: Targeted Scraping
Crawling – challenges
� Crawler traps are surprisingly common. For example, a ‘next month’
link on a calendar can potentially be followed until the end of time.
� The Robots Exclusion Standard: protocol that all crawlers are
supposed to observe. It allows website managers to restrict access to
crawlers while allowing web browsing.
� robots.txt (inclusions, exclusions, sitemaps, etc.)
� See: https://developers.google.com/search/reference/robots.txt
� Ethical and Terms of Use
Python scrapy
Parsing
Once a document has been fetched, it must
be parsed.
� Web documents can usually be
segmented into discrete zones such as
title, anchor text, headings, and so on.
� Data can be extracted from specific zones.
� Information such as links and anchors can
be analysed, formats such as PDF or
Postscript or Word can be translated, and
so on.
� Python BeautifulSoup
Acknowledgement
Some contents of the slides are based on materials created by
Lea Frermann and Justin Zobel and Karin Verspoor, CIS
Break
From the time before, 2019
Semester 2, 2021
Lecture 4, Part 7: Unstructured Data – Text Representations
Raw Frequencies
� What are the problems?
� What are the alternatives?
Raw Frequencies
� What are the problems?
� What are the alternatives?
play
grace
crowd
play
grace
audience
SPORTS ARTS
TF-IDF
Discourse on Floating Bodies
– Galileo Galilei
Treatise on Light
– Christiaan Huygens
Experiments with Alternate Currents of High
Potential and High Frequency
– Nikola Tesla
Relativity: The Special and General Theory
– Albert Einstein
TF-IDF
� TF-IDF stands for Term Frequency-Inverse Document Frequency
� Each text document as a numeric vector
� each dimension is a specific word from the corpus
� A combination of two metrics to weight a term (word)
� term frequency (tf): how often a given word appears within a document
� inverse document frequency (idf): down-weights words that appear in many
documents.
� Main idea: reduce the weight of frequent terms and increase the
weight of rare and indicative ones.
TF-IDF
Term frequency (TF):
� 𝑡𝑓 𝑡, 𝑑 = the raw count of a term in the document.
Inverse Document Frequency (IDF):
� 𝑖𝑑𝑓 𝑡 = ln + 1 or 𝑖𝑑𝑓 𝑡 = ln + 1
� N is the number of document in the collection,
� 𝑑𝑓 is the document frequency, the number of document containing the term t.
TF-IDF (L2 normalised):
� 𝑡𝑓_𝑖𝑑𝑓 𝑡, 𝑑 =
∑
∈
where 𝑣 = 𝑡𝑓 𝑡, 𝑑 × 𝑖𝑑𝑓 𝑡
Example TF-IDF
word 𝑡𝑓 𝑡, 𝑑 𝒊𝒅𝒇(𝒕) =
𝒍𝒏 𝟏 𝑵
𝟏 𝒅𝒇𝒕
+ 1
𝒗𝒕 = 𝑡𝑓 𝑡, 𝑑 × 𝑖𝑑𝑓 𝑡 𝒕𝒇_𝒊𝒅𝒇 𝒕, 𝒅
A B A B A
∑ ∈ 𝑣 =
2.225
B
∑ ∈ 𝑣 =
2.225
car 1 0 ln + 1 = 1.405 1.405 0 0.632 0
driven 1 1 ln + 1 = 1 1 1 0.449 0.449
road 1 0 ln + 1 = 1.405 1.405 0 0.632 0
truck 0 1 ln + 1 = 1.405 0 1.405 0 0.632
highway 0 1 ln + 1 = 1.405 0 1.405 0 0.632
Two documents: A – ‘the car is driven on the road’
B – ‘the truck is driven on the highway’
* stop words removed
Example TF-IDF
Example TF-IDF – cont.
� Two documents, A and B.
A. ‘the car is driven on the road’
B. ‘the truck is driven on the highway’
� Text features for machine learning
car driven road truck highway
0.632 0.449 0.632 0 0
0 0.449 0 0.632 0.632
* stop words removed
TRY THIS!
� 3 documents:
A: ‘the car is driven on roads’
B: ‘the truck is driven on a highway’
C: ‘a bike can not be ridden on a highway’
word 𝑡𝑓 𝑡, 𝑑 𝒊𝒅𝒇 𝒕 =
𝒍𝒏 𝟏 𝑵
𝟏 𝒅𝒇𝒕
+1
𝒗𝒕 = 𝑡𝑓 𝑡, 𝑑 × 𝑖𝑑𝑓 𝑡 𝒕𝒇_𝒊𝒅𝒇 𝒕, 𝒅
A B C A B c A
∑ ∈ 𝑣
B
∑ ∈ 𝑣
C
∑ ∈ 𝑣
car 1 0 0 ln 4/2 + 1 =
driven 1 1 0
road 1 0 0
truck 0 1 0
highway 0 1 1
bike 0 0 1
ridden 0 0 1
* stop words removed
Features from unstructured text
Features for structured data
Features for unstructured text
car driven road truck highway
0.632 0.449 0.632 0 0
0 0.449 0 0.632 0.632
Rank document similarity to a query
� Query q = ‘I saw a car and a truck on the highway’
� Query terms = [‘car’, ‘truck’, ‘highway’]
� Query vector 1, 0, 0, 1, 1 , unit vector 𝑣 = [0.577, 0, 0, 0.577, 0.577]
� Cosine similarity to rank documents
cos(𝑣 , 𝑑 ), cos(𝑣 , 𝑑 ) : 0.36, 0.73
d1
d2
car driven road truck highway
0.632 0.449 0.632 0 0
0 0.449 0 0.632 0.632
car driven road truck highway
0.632 0.449 0.632 0 0
0 0.449 0 0.632 0.632
Break
Features
Sepal length Sepal width Petal length Petal width Species (label)
4.9 3.0 1.4 0.2 Iris setosa
7.0 3.2 4.7 1.4 Iris versicolor
5.4 3.7 1.5 0.2 Iris setosa
6.3 3.3 6.0 2.5 Iris virginica
How To Represent Text?
Text Features
� Part-of-speech tagging
� She saw a bear. bear: NOUN
� Your efforts will bear fruit. bear: VERB
� bear_NN; bear_VB
� ngrams
We value curiosity , passion and a desire to learn
PRON VERB ADJ PUNC ADJ CONJ DET NOUN TO VERB
we value curiosity , passion and a desire to learn
we_value value_curiosity curiosity_, ,_passion passion_and and_a a_desire desire_to to_learn
Text Representation – BoW
� Bag-of-words: simplest vector space representational model for text
� Disregards word order and grammatical features such as POS
� Each text document as a numeric vector
� each dimension is a specific word from the corpus
� the value could be its frequency in the document or occurrence
(denoted by 1 or 0).
Prepare for BoW
� Word tokenisation
� Case-folding
� Abstraction of number (#num#, #year#)
Prepare for BoW
� Stop word removal
Prepare for BoW
� Stop word removal
Prepare for BoW
� Stemming
� How would this look different if we lemmatised instead?
� Removed punctuation
� Word counts
Bag of Words
tranform school world year futur foster mind cheese transform …
doc001 2 4 1 4 3 0 0 0 0 …
doc002 1 3 3 0 2 2 2 0 1 …
doc003 0 3 4 0 3 3 0 4 2 …
…
Term Frequency
� What if a rare word occurs in a document?
� e.g. ‘catarrh’ is less common than ‘mucus’ or ‘stuffiness’
� What if a word occurs in many documents?
� Maybe we want to avoid raw counts?
� Raw frequencies varies with document length
� Don’t capture important (rare) features that would be telling of a type of
document
Break
Semester 2, 2021
Lecture 4, Part 5: Unstructured Data – Preprocessing
Text Preprocessing – Tokenisation
� Granularity of a token
� Sentence
� Word
� Token separators
� “The speaker did her Ph.D. in Germany. She now works at UniMelb.”
� “The issue—and there are many—is that text is not consistent.”
Text Preprocessing – Tokenisation
� Split continuous text into a list of individual tokens
� English words are often separated by white spaces but not always
� Tokens can be words, numbers, hashtags, etc.
� Can use regular expression
Text Preprocessing – Case folding
� Convert text to consistent cases
� Simple and effective for many tasks
� Reduce sparsity (many map to the same lower-case form)
� Good for search
I had an AMAZING trip to
Italy, Coffee is only 2
bucks, sometimes three!
i had an amazing trip to
italy, coffee is only 2
bucks, sometimes three!
Preprocessing – Stemming
� Words in English are derived from a root or stem
inexpensive → in+expense+ive
� Stemming attempts to undo the processes that lead to word formation
� Remove and replace word suffixes to arrive at a common root form
� Result does not necessarily look like a proper ‘word’
� Porter stemmer: one of the most widely used stemming algorithms
� suffix stripping (Porter stemmer)
� sses → ss
� ies → i
� tional → tion
� tion → t
Preprocessing – Stemming
Preprocessing – Lemmatization
� To remove inflections and map a word to its proper root form
(lemma)
� It does not just strip suffixes, it transforms words to valid roots:
running Æ run
runs Æ run
ran Æ run
� Python NLTK provides WordNet Lemmatizer that uses the WordNet
Database to lookup lemmas of words.
Stop Word Removal
� Stop words are ‘function’ words that structure sentences; they are
low information words and some of them are very common
� ‘the’, ‘a’, ‘is’,…
� Exclude them from being processed; helps to reduce the number of
features/words
� Commonly applied for search, text classification, topic modelling,
topic extraction, etc.
� A stopword list can be custom-made for a specific context/domain
Stop Word Removal
Text Normalisation
� Transforming a text into a canonical (standard) form
� Important for noisy text, e.g., social media comments, text messages
� Used when there are many abbreviations, misspellings and out-of-
vocabulary words (oov)
� E.g.
2moro Æ tomorrow
2mrw Æ tomorrow
tomrwÆ tomorrow
B4 Æ before
Noise Removal
� Remove unnecessary spacing
� Remove punctuation and special characters (regular expressions)
� Unify numbers
� Highly domain dependent
So far… Unstructured Text Data
� Text search – approximate string matching
� Preprocessing
� Document representation and text features (BoW, TF-IDF)
� Crawling & scraping
– Regular expressions
– Tokenisation
– Case folding
– Stemming
– Lemmatization
– Stop word removal
– Text normalization
– Noise removal
Break
From the time before, 2019
Semester 2 2021
Week 3 Live Lecture
Exercises
Text Search
Jaccard Similarity
� 𝑠𝑖𝑚 𝑆 , 𝑆 = ∩
∪
� Calculate the Jaccard Similarity of LECTURE and RELEASE using
bigrams with padding
Exam 2021 SM1
Alice has a habit of writing a travel diary, she saves her diary into a text file at the end of
each year.
In her diary, Alice has a habit of writing down items she purchased and roughly how much
in Australian dollars she spent on the purchases. After conversion from the foreign
currency into Australian dollars, if the amount is about 100 dollars, she would ignore the
decimals and writes it in her diary as ‘$100’ or ‘A100’ and sometimes, she places an
additional colon ‘:’ before the numeric value 100, for example ‘A:100’.
Alice would like to work out how much in Australian dollars she spent in the whole
year. She does not wish to manually go through her diary and add up the individual
amounts.
(a) Write a regular expression for the pattern representing the amount in Australian dollars
in Alice’s diary.
(b) Explain how you can extract the numeric value from the extracted RE pattern so that
the numeric value can be added by a program supporting regular expressions.
More complex regular expression
What do you think this pattern is for?
�>D-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-]+
As a task to do for the next live Zoom, please improve this pattern!
Try It!
# L E C T U R E
#
R
E
C
O
R
D