CS计算机代考程序代写 scheme python database crawler cuda data mining algorithm Summarisation

Summarisation
and
Visualisation

Compare

Gholizadeh, Asa & Saberioon, Mehdi & Carmon, Nimrod & Boruvka, Lubos & Ben-Dor, Eyal. (2018). Examining the Performance of PARACUDA-II Data-Mining Engine versus Selected Techniques to Model Soil
Carbon from Reflectance Spectra. Remote Sensing. 10. 1172. 10.3390/rs10081172.

Histograms vs boxplots

Which is the corresponding boxplot?

Histograms vs boxplots
A

B

Is this the boxplot of A or B?

Outlier detection using histogram

� Figure shows the histogram of purchase amounts in transactions
� A transaction in the amount of $7,500 is an outlier, since only 0.2%

transactions have an amount higher than $5,000

Data Formats

Exercise 1
Represent the following information in JSON


Homer
Simpson

Grandpa
Marge
Lisa
Bart

b\3>0A46.+K++4c2A11bY\3>0A46.+K++4c
bY$+4/0-c

Exercise 1

Check it is well formed: http://jsonlint.com

In your group:
a) discuss whether it’s valid / what changes to make
it valid,
b) discuss the representation and how it could be
improved

Data Linkage and Privacy ± Approximate matching

� The hash based technique using the 3rd party, can only compute
exact match between strings in a privacy preserving manner.

� What if we wish to compute approximate match between two
strings in a privacy preserving manner?
± Representing a record field (string) in a privacy preserving

manner (so that it is difficult to reverse engineer its value)
± Computing approximate similarity of two record fields

(strings) that have been represented in privacy preserving
manner.

� One method:
± Set-based string similarity + bloom filters.

� ‘DWD�VWUXFWXUH��D�µELW�DUUD\¶��RU�D�µELW�VWULQJ¶�
± An array of binary bits (0s or 1s).

� $�EORRP�ILOWHU�LV�D�µELW�DUUD\¶�ZLWK�I bits
± Initialise bits to 0s
± Turn bits on using k number of hash functions 𝐻 ⋯ 𝐻
± Each hash function maps input to an index in the bit array.
± Index range 0⋯𝑰 − 1

1 0 1 1 0 0 1 0

0 0 0 0 0 0 0 0

Bloom Filters for storing strings

� Storing a string x in a bloom filter
± choose 𝐻 ⋯ 𝐻 that takes a string as input
± each 𝐻 (𝑥) returns a index to the bloom filter (bit array)
± set the value of that index to 1

� Example: a bloom filter 𝐼 = 25, 𝑘 = 2��VWULQJ�V� �µ “revolution”?

� 𝑠𝑖𝑚 𝑥, 𝑦 = 1 − ,
,

= 1 − = 0.6

� Jaro-Winkler similarity
� Based on edit-distance
� Favours matches at the start of the strings.

v a l u a t i o n

r e v o l u – t i o n

Scoring similarity

Title …

come fly
with me

michael
jordan come
fly with me

0.82 15 0 0 21( ), , , ,

0.1

f: RdÆ [0,1]

Score record pairs for similarity

Need a
good f

Combine the set of similarity scores Æ final score

Prep Block Score Match Merge

More on score combination

Idea 1: sum up feature scores

Idea 2: +similarities, -dissimilarities

Idea 3: weighted sum

Idea 4: label examples of non/matching
record pairs, train a classifier using
machine learning

Will learn the weight

0.82 15 0 0 21( ), , , ,

0.1

f: RdÆ [0,1]

Prep Block Score Match Merge

Match ‘sufficiently’ similar records

Prep Block Score Match Merge

Threshold θ
Match when final score > θ
e.g.
threshold θ = 0.5
final score > 0.5

pairs compared
sufficiently similar pairs

Merge matched records

Prep Block Score Match Merge

matched pairs

Merge matched records

� Also needs to resolve conflicting attributes
� False positives and false negatives still exist

Prep Block Score Match Merge

Evaluation of record linkage results

10

False positives (fp): # non-matched pairs that are incorrectly classified as matches.
False negatives(fn): # true matching pairs that are incorrectly classified as non-matches
True positives (tp): # true matching pairs that are classified as matches
True negative (tn): #non-matched pairs that are classified as non-matches

� Precision: ⁄𝑡𝑝 (𝑡𝑝 + 𝑓𝑝)
Proportion of pairs classified as matches that are true matches.

� Recall: ⁄𝑡𝑝 (𝑡𝑝 + 𝑓𝑛)
Proportion of true matching pairs that are classified as matches.

Acknowledgements

� Based on presentation materials created by Ben Rubinstein, Pauline
Lin, James Bailey and others

Example: Bing’s “Movies Vertical”

Bing Movies adding
entity actions to entity cards

Need to link movie records from
Bing and Netflix

Easy problem only if there’s an
attribute with unique value per
record

� If there’s a “key” then use
“database join”

Not so fast! The linkage challenge

No keys

Noisy values

Missing attributes

“Unrelated works with same titles” http://www.imdb.com/list/ls000397020/

Usually remove
diacritics (why?)

déjà Æ deja

Concatenate “Title, Year”: surely a key?

“Deja Vu,2006″Å

“Deja Vu,1997″Å

“Unrelated works with same titles” http://www.imdb.com/list/ls000397020/

Concatenate “Title, Year”: surely a key?

http://www.imdb.com

Pre-processing

Prep Block Score Match Merge

Title Year Directors Cast Runtime …

come fly
with me

2004 peter kagan michael buble 63 …

michael
jordan come
fly with me

1989
michael jordan,

jay thomas
42 …

Simple pre-processing to clean records

Pairwise comparison

� Need to compare two records to determine if they are to be linked
� Calculate similarity between a pair of records

If similar:
link pair

� High complexity – we will need to compare/”score” many pairs of
records for similarity.

Prep Block Score Match Merge

Complexity – pairwise comparison
� Two databases A (m records) and B (n records)

� How many comparisons? 𝑚 × 𝑛
� One MSFT problem had 1b records

� Complexity 𝑂(𝑛 )
� If 𝑚 == 𝑛 == 1𝑏 10 records
� 10 × 10 number of comparisons;
� It will take a long time to compute 10 pairs

� Naïve whole-database all-pairs comparison doesn’t scale.

Can we do better?

Blocking – scalable record linkage

� Maps complex records to
simple values (blocks).

� The simple values are
blocking-keys

� Same blocking methods
applied to both A and B.

Each record is allocated to
one or more of the blocks

a1

a2

a3

a4

a5

b1
b5 a4

b1
b2
b3

a2
a5

b3 a2a5

b1
b4

a1
a3

b1

b2

b3

b4

b5

B A

Prep Block Score Match Merge

Blocking – cont.

� Record-pairs assigned to the
same blocks are potential
matches.

� Within block pairwise
comparisons only

� For example, b5 is only
compared with a4, and no
other records in A

b1
b5

a4

b1
b2
b3

a2
a5

b3
a2
a5

b1
b4

a1
a3

b1 – a4
b5 – a4

b1 – a2
b1 – a5
b2 – a2
b2 – a5
b3 – a2
b3 – a5

b3 – a2
b3 – a5

b1 – a1
b1 – a3
b4 – a1
b4 – a3

2 × 1

2 × 2

3 × 2

1 × 2

Scalable record linkage with blocking

Title …

come fly
with me

michael
jordan come
fly with me

michael

Blocking example

Prep Block Score Match Merge

Represent complex records as simple values (blocks);
Only score records with simple value (block) in common.

jordancome fly

Blocking example
A movie record: <“ghost busters”, 2016>

� On one function: release year, result in one block
� 2016

� On one function: concatenated values of release year and one title-
word: results in two blocks

� 2016 ghost
� 2016 busters

� Movies with a difference in release year by one; three functions:
� release year,
� release year + 1,
� release year – 1; results in three blocks:
� 2015, 2016, and 2017
� Redundant blocks, why?

Common blocking methods for strings
� Token/word (‘come fly with me’)

{come, fly, with, me}

� Phrases/n-words (‘come fly with me’, n=2)
^¶B�FRPH¶��¶FRPH�IO\¶��¶IO\�ZLWK¶��µZLWK�PH¶��µPH�B¶`

� N-grams (‘Jamie’, n=2)
� {_j, ja, am, mi, ie, e_}
� blocking-keys are the 2-grams; ‘jamie’ are in 6 blocks

� Prefix, suffix
� soundex

� Pauline: P450, Paulina: P450,
� Chris: C620
� Kris: K620

Blocking methods– design

� Blocking methods should be efficient, e.g. hashing
� What would be a bad blocking method?

� Trade-off between block sizes: true matches being missed vs
computational efficiency

� Can filter out large blocks
� Blocks can overlap but avoid redundant blocks
� Need to ensure that recall does not suffer

Measuring blocking method

Given the ground truth (matching record-pairs)
False positives (fp): non-matching pairs in the same block.
False negatives(fn): matching pairs never in the same block.
True positives (tp): matching pairs in the same block.
True negative (tn): non-matching pairs never in the same block.
Total number of pairs with no blocking: 𝑚 × 𝑛 ( == 𝑓𝑝 + 𝑓𝑛 + 𝑡𝑝 + 𝑡𝑛)

Blocking measures:
� Pair-completeness (PC): ⁄𝑡𝑝 (𝑡𝑝 + 𝑓𝑛)
� Reduction-ratio (RR): ⁄1 − (𝑓𝑝 + 𝑡𝑝) (𝑓𝑝 + 𝑓𝑛 + 𝑡𝑝 + 𝑡𝑛)

Today

• What is data linkage, when and why is it needed and by whom?

• What are some challenges?
• How to define similarity
• Scalability
• How to merge/group records

• Thanks to
• Dr Ben Rubinstein for use of lecture materials on movies example

Data linkage: what is it?
• We collect information about entities such as people, products,

images, songs, …
• Combining, grouping, matching electronic records of the same real-

word entities.
• Two hospitals H1 and H2 wants to know if same patients visited both

hospitals.

Example from Data Matching book by Christen

Applications: security

Match data about people scheduled to fly to Australia by plane,
with information across different databases, to identify high risk
passengers before boarding. Databases with information such as

• Previous visits to Australia
• Previous visa applications/cancellations
• Crime databases …

A famous senator Sen. Ted Kennedy told the Senate Judiciary Committee in 2004 that he
had been stopped and interrogated on at least five occasions as he
attempted to board flights at several different airports. A Bush
administration official explained to the Washington Post that Kennedy
had been held up because the name “T. Kennedy” had become a
popular pseudonym among terror suspects.

http://edition.cnn.com/2015/12/07/politics/no-fly-mistakes-cat-stevens-ted-kennedy-john-lewis/

Applications: business

• Two businesses collaborate with each other for a marketing campaign.
Need a combined database of individuals to target

• Bob moves into a new home and wishes to be connected to electricity
provider. For verification, provider matches the address Bob supplies
against its “master” list of street addresses. Not always reliable!

• Online shopping comparison
• Is product X in Store A the same as product Y in Store B?

Data linkage applications – cont.

Src: K. Raymond CC4.0

Centrelink: “Robo-debt” collection

• Data matching using Centrelink data and Tax office data
• System checks for “discrepancies” in income
• Example data matching issue

• Welfare recipient reports to Centrelink working for a company with its trading name.
Tax office records show a different registered company name.

• Failure to match between the two names triggered conclusion
that some income was not being declared

• Automated notice …

Examples of problematic matching
• Centrelink

• A famous senator Sen. Ted Kennedy told the Senate Judiciary
Committee in 2004 that he had been stopped and interrogated
on at least five occasions as he attempted to board flights at
several different airports. A Bush administration official
explained to the Washington Post that Kennedy had been held up
because the name “T. Kennedy” had become a popular
pseudonym among terror suspects.

Centrelink Income:
Jane Doe

May’16: Maccas $7,000
June’16: Maccas $4,000

ATO Income:
Jane Doe

2015-16: McDonald’s $11,000

Discrepancy detected – potential undeclared income
� Automated process triggered -> letter to Jane Doe

Record linkage – terminology

Combine related/equivalent records across sources (sometimes within
a single source)

Studied across communities –different terminology
• Statistics: Record linkage [Dunn’46]
• Databases: Entity resolution, data integration, deduplication
• Natural Language Processing: coreference resolution, named-entity

recognition

…meaning and scope varies

Semester 2, 2021
Lecture 5: Data quality and pre-processing

Why is pre-processing needed?

Name Age Date of Birth

“Henry” 20.2 20 years ago

Katherine Forty-one 20/11/66

Michelle 37 5/20/79

Oscar@!! “5” 13th Feb. 2019

– 42 –

Mike___Moore 669 –

巴拉克奥巴橓 52 1961年8月4日

Why is pre-processing needed?
Measuring data quality

• Accuracy
• Correct or wrong, accurate or not

• Completeness
• Not recorded, unavailable

• Consistency
• E.g. discrepancies in representation

• Timeliness
• Updated in a timely way

• Believability
• Do I trust the data is true?

• Interpretability
• How easily can I understand the data?

Date of Birth

20 years ago

20/11/66

5/20/79

13th Feb. 2019

1961年8月4日

1

2

3

4

5

6

7

Inconsistent data

• Different date formats (“3/4/2016” versus “3rd April 2016”)
• Age=20, Birthdate=“1/1/1971”
• Two students with the same student id
• Outliers (e.g. 62,72,75,75,78,80,82,84,86,87,87,89,89,90,999)

• No good if it is list of ages of hospital patients
• Might be ok for a listing of people number of contacts on Linkedin though
• Can use automated techniques, but also need domain knowledge

Missing or incomplete data

• Lacking feature values
• Name=“”
• Age=null

• Some types of missing data (Rubin 1976)
• Missing completely at random
• Missing at random
• Missing not at random

Missing completely at random (MCAR)

Missing data are MCAR when
the probability of missing data
on a variable is unrelated to any
other measured variable and is
unrelated to the variable with
missing values itself.

Missing at random (MAR)

Missing data are MAR when the
probability of missing data on a
variable is related to other fully
measured variables.

Gender Weight

Male 81

Female 50

Male

Male

Female 66

Male

Male 75

Male

Female

Missing not at random (MNAR)

Data are MNAR when the missing
values on a variable are related to
the values of that variable itself, even
after controlling for other variables.

For example, when data are missing
on IQ and only the people with low
IQ values have missing observations
for this variable.

Too much missing data!

Simple Missing-Data Strategies

Simple strategies to retain all data instances/records

• Statistical Measurement Imputation
• Mean
• Median
• Mode

Simple Strategies with sklearn

Simple Strategies with sklearn

Simple Missing-Data Strategies

Simple strategies to retain all data instances/records

• Statistical Measurement Imputation
• Mean
• Median
• Mode

Pro: easy to compute and administer; no loss of records
Con: biases other statistical measurements

• variance
• standard deviation

Multivariate Strategies

Multivariate (more than one variable)
1. Logical rules
2. Model

Disguised missing data

• Everyone’s birthday is January 1st?
• Email address is
• Adriaans and Zantige

• “Recently, a colleague rented a car in the USA. Since he was Dutch, his post-
code did not fit the fields of the computer program. The car hire
representative suggested that she use the zip code of the rental office
instead.”

• How to handle
• Look for “unusual” or suspicious values in the dataset, using knowledge about

the domain

mailto:

Major data preprocessing activities

Data mining concepts and techniques, Han et al 2012

Data cleaning – process

• Many tools exist (Google Refine, Kettle, Talend, …)
• Data scrubbing
• Data discrepancy detection
• Data auditing
• ETL (Extract Transform Load) tools: users specify transformations via a

graphical interface

• Our emphasis will be to understand some of the methods employed
by typical tools

• Domain knowledge is important

Break

Semester 2, 2021
Lecture 4, Part 8: Web Crawling

WWW – A Repository of Data

A large amount of text data is on
the Web.

Web crawling is a method to get
the data from the web.

Resources

� APIs
� Twitter: https://developer.twitter.com/en/docs/api-reference-index

� Data Dumps
� Wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Database_download#Where_do_I_get_it?
� Yelp: https://www.yelp.com/dataset
� IMDB: https://www.imdb.com/interfaces/

� Shared Tasks
� Geolocation Prediction: https://www.aclweb.org/anthology/W16-3928/
� Lexical Normalisation of Tweets: https://noisy-text.github.io/2015/norm-shared-task.html

https://developer.twitter.com/en/docs/api-reference-index
https://en.wikiped/
https://en.wikipedia.org/wiki/Wikipedia:Database_download#Where_do_I_get_it
https://www.yelp.com/dataset
https://www.imdb.com/interfaces/
https://www.aclweb.org/anthology/W16-3928/
https://noisy-text.github.io/2015/norm-shared-task.html

Web Crawling

� Web crawlers are also known as spiders, robots, and bots.
� Crawlers attempt to visit every page of interest and retrieve them for

processing and indexing
� Basic challenge: there is no central index of URLs of interest.
� Secondary challenges:

� same content as a new URL
� never return status ‘done’ on access
� websites that are not intended to be crawled
� content generated on-the-fly from databases → costly for the content

provider → excessive visits unwelcome
� Some content has a short lifespan

Crawling

� The web is a highly linked graph
� If a web page is of interest, there will
be a link to it from another page.
� In principle:

� Create a prioritised list L of URLs
to visit (seed URLs)

� Create a list V of URLs that have
been visited and when.

� Repeat forever:
� Choose a

URL u from L and fetch
the page p(u) at
location u.

� Parse and index p(u)
Extract URLs {uƍ`�
from p(u).

� Add u to V and remove
it from L

� Add {uƍ`�í V to L.
� Process V to move

H[SLUHG�RU�µROG¶�85/V�
to L.

Crawling

� Web crawl (for the purpose of indexing), every page is visited eventually.
� Synonym URLs are disregarded.
� Significant or dynamic pages are visited more frequently,
� The crawler mustn’t cycle indefinitely in a single web site.

Crawling (site or the web) vs Scraping
� Crawling starts with seed URLs
� Crawling visits all sites within a linked graph
� Scraping is the process of extracting data
� Scraping is targeted (given the information that it has to extract)

Anatomy of a URL: Targeted Scraping

12.2 Internet 101: Understanding How the Internet Works

Crawling – challenges

� Crawler traps are surprisingly common. For example, a ‘next month’
link on a calendar can potentially be followed until the end of time.

� The Robots Exclusion Standard: protocol that all crawlers are
supposed to observe. It allows website managers to restrict access to
crawlers while allowing web browsing.

� robots.txt (inclusions, exclusions, sitemaps, etc.)
� See: https://developers.google.com/search/reference/robots.txt

� Ethical and Terms of Use
Python scrapy

Parsing

Once a document has been fetched, it must
be parsed.
� Web documents can usually be

segmented into discrete zones such as
title, anchor text, headings, and so on.

� Data can be extracted from specific zones.
� Information such as links and anchors can

be analysed, formats such as PDF or
Postscript or Word can be translated, and
so on.

� Python BeautifulSoup

Acknowledgement

Some contents of the slides are based on materials created by
Lea Frermann and Justin Zobel and Karin Verspoor, CIS

Break

From the time before, 2019

Semester 2, 2021
Lecture 4, Part 7: Unstructured Data – Text Representations

Raw Frequencies

� What are the problems?
� What are the alternatives?

Raw Frequencies

� What are the problems?
� What are the alternatives?

play
grace
crowd

play
grace
audience

SPORTS ARTS

TF-IDF
Discourse on Floating Bodies

– Galileo Galilei
Treatise on Light

– Christiaan Huygens

Experiments with Alternate Currents of High
Potential and High Frequency

– Nikola Tesla
Relativity: The Special and General Theory

– Albert Einstein

TF-IDF

� TF-IDF stands for Term Frequency-Inverse Document Frequency
� Each text document as a numeric vector

� each dimension is a specific word from the corpus
� A combination of two metrics to weight a term (word)

� term frequency (tf): how often a given word appears within a document
� inverse document frequency (idf): down-weights words that appear in many

documents.

� Main idea: reduce the weight of frequent terms and increase the
weight of rare and indicative ones.

TF-IDF
Term frequency (TF):

� 𝑡𝑓 𝑡, 𝑑 = the raw count of a term in the document.
Inverse Document Frequency (IDF):

� 𝑖𝑑𝑓 𝑡 = ln + 1 or 𝑖𝑑𝑓 𝑡 = ln + 1
� N is the number of document in the collection,
� 𝑑𝑓 is the document frequency, the number of document containing the term t.

TF-IDF (L2 normalised):
� 𝑡𝑓_𝑖𝑑𝑓 𝑡, 𝑑 =


where 𝑣 = 𝑡𝑓 𝑡, 𝑑 × 𝑖𝑑𝑓 𝑡

Example TF-IDF

word​ 𝑡𝑓 𝑡, 𝑑 𝒊𝒅𝒇(𝒕) =
𝒍𝒏 𝟏 𝑵

𝟏 𝒅𝒇𝒕
+ 1

𝒗𝒕 = 𝑡𝑓 𝑡, 𝑑 × 𝑖𝑑𝑓 𝑡 𝒕𝒇_𝒊𝒅𝒇 𝒕, 𝒅
A B A B A

∑ ∈ 𝑣 =
2.225

B
∑ ∈ 𝑣 =

2.225

car 1 0 ln + 1 = 1.405 1.405 0 0.632 0

driven 1 1 ln + 1 = 1 1 1 0.449 0.449

road 1 0 ln + 1 = 1.405 1.405 0​ 0.632 0

truck​ 0​ 1 ln + 1 = 1.405 0 1.405 0 0.632

highway 0​ 1 ln + 1 = 1.405 0​ 1.405 0 0.632

Two documents: A – ‘the car is driven on the road’
B – ‘the truck is driven on the highway’

* stop words removed

Example TF-IDF

Example TF-IDF – cont.

� Two documents, A and B.
A. ‘the car is driven on the road’
B. ‘the truck is driven on the highway’

� Text features for machine learning

car driven road truck highway

0.632 0.449 0.632 0 0

0 0.449 0 0.632 0.632

* stop words removed

TRY THIS!
� 3 documents:

A: ‘the car is driven on roads’
B: ‘the truck is driven on a highway’
C: ‘a bike can not be ridden on a highway’

word​ 𝑡𝑓 𝑡, 𝑑 𝒊𝒅𝒇 𝒕 =
𝒍𝒏 𝟏 𝑵

𝟏 𝒅𝒇𝒕
+1

𝒗𝒕 = 𝑡𝑓 𝑡, 𝑑 × 𝑖𝑑𝑓 𝑡 𝒕𝒇_𝒊𝒅𝒇 𝒕, 𝒅
A​ B C A​ B c A

∑ ∈ 𝑣
B

∑ ∈ 𝑣
C

∑ ∈ 𝑣
car 1 0 0 ln 4/2 + 1 =
driven 1 1 0

road 1 0 0

truck​ 0​ 1 0

highway 0​ 1 1

bike 0 0 1

ridden 0 0 1

* stop words removed

Features from unstructured text

Features for structured data

Features for unstructured text

car driven road truck highway

0.632 0.449 0.632 0 0

0 0.449 0 0.632 0.632

Rank document similarity to a query

� Query q = ‘I saw a car and a truck on the highway’
� Query terms = [‘car’, ‘truck’, ‘highway’]
� Query vector 1, 0, 0, 1, 1 , unit vector 𝑣 = [0.577, 0, 0, 0.577, 0.577]
� Cosine similarity to rank documents
cos(𝑣 , 𝑑 ), cos(𝑣 , 𝑑 ) : 0.36, 0.73

d1

d2

car driven road truck highway

0.632 0.449 0.632 0 0

0 0.449 0 0.632 0.632

car driven road truck highway

0.632 0.449 0.632 0 0

0 0.449 0 0.632 0.632

Break

Features

Sepal length Sepal width Petal length Petal width Species (label)

4.9 3.0 1.4 0.2 Iris setosa

7.0 3.2 4.7 1.4 Iris versicolor

5.4 3.7 1.5 0.2 Iris setosa

6.3 3.3 6.0 2.5 Iris virginica

How To Represent Text?

Text Features
� Part-of-speech tagging

� She saw a bear. bear: NOUN
� Your efforts will bear fruit. bear: VERB
� bear_NN; bear_VB

� ngrams

We value curiosity , passion and a desire to learn

PRON VERB ADJ PUNC ADJ CONJ DET NOUN TO VERB

we value curiosity , passion and a desire to learn

we_value value_curiosity curiosity_, ,_passion passion_and and_a a_desire desire_to to_learn

Text Representation – BoW

� Bag-of-words: simplest vector space representational model for text
� Disregards word order and grammatical features such as POS
� Each text document as a numeric vector

� each dimension is a specific word from the corpus
� the value could be its frequency in the document or occurrence

(denoted by 1 or 0).

Prepare for BoW
� Word tokenisation
� Case-folding
� Abstraction of number (#num#, #year#)

Prepare for BoW

� Stop word removal

Prepare for BoW

� Stop word removal

Prepare for BoW

� Stemming
� How would this look different if we lemmatised instead?

� Removed punctuation
� Word counts

Bag of Words

tranform school world year futur foster mind cheese transform …

doc001 2 4 1 4 3 0 0 0 0 …

doc002 1 3 3 0 2 2 2 0 1 …

doc003 0 3 4 0 3 3 0 4 2 …

Term Frequency

� What if a rare word occurs in a document?
� e.g. ‘catarrh’ is less common than ‘mucus’ or ‘stuffiness’

� What if a word occurs in many documents?
� Maybe we want to avoid raw counts?

� Raw frequencies varies with document length
� Don’t capture important (rare) features that would be telling of a type of

document

Break

Semester 2, 2021
Lecture 4, Part 5: Unstructured Data – Preprocessing

Text Preprocessing – Tokenisation

� Granularity of a token
� Sentence
� Word

� Token separators
� “The speaker did her Ph.D. in Germany. She now works at UniMelb.”
� “The issue—and there are many—is that text is not consistent.”

Text Preprocessing – Tokenisation

� Split continuous text into a list of individual tokens
� English words are often separated by white spaces but not always
� Tokens can be words, numbers, hashtags, etc.
� Can use regular expression

Text Preprocessing – Case folding

� Convert text to consistent cases
� Simple and effective for many tasks
� Reduce sparsity (many map to the same lower-case form)
� Good for search

I had an AMAZING trip to
Italy, Coffee is only 2
bucks, sometimes three!

i had an amazing trip to
italy, coffee is only 2
bucks, sometimes three!

Preprocessing – Stemming

� Words in English are derived from a root or stem
inexpensive → in+expense+ive

� Stemming attempts to undo the processes that lead to word formation
� Remove and replace word suffixes to arrive at a common root form
� Result does not necessarily look like a proper ‘word’
� Porter stemmer: one of the most widely used stemming algorithms
� suffix stripping (Porter stemmer)

� sses → ss
� ies → i
� tional → tion
� tion → t

Preprocessing – Stemming

Preprocessing – Lemmatization

� To remove inflections and map a word to its proper root form
(lemma)

� It does not just strip suffixes, it transforms words to valid roots:
running Æ run
runs Æ run
ran Æ run

� Python NLTK provides WordNet Lemmatizer that uses the WordNet
Database to lookup lemmas of words.

Stop Word Removal

� Stop words are ‘function’ words that structure sentences; they are
low information words and some of them are very common

� ‘the’, ‘a’, ‘is’,…
� Exclude them from being processed; helps to reduce the number of

features/words
� Commonly applied for search, text classification, topic modelling,

topic extraction, etc.
� A stopword list can be custom-made for a specific context/domain

Stop Word Removal

Text Normalisation

� Transforming a text into a canonical (standard) form
� Important for noisy text, e.g., social media comments, text messages
� Used when there are many abbreviations, misspellings and out-of-

vocabulary words (oov)
� E.g.

2moro Æ tomorrow
2mrw Æ tomorrow
tomrwÆ tomorrow
B4 Æ before

Noise Removal

� Remove unnecessary spacing
� Remove punctuation and special characters (regular expressions)
� Unify numbers
� Highly domain dependent

So far… Unstructured Text Data
� Text search – approximate string matching
� Preprocessing

� Document representation and text features (BoW, TF-IDF)
� Crawling & scraping

– Regular expressions
– Tokenisation
– Case folding
– Stemming
– Lemmatization
– Stop word removal
– Text normalization
– Noise removal

Break

From the time before, 2019

Semester 2 2021
Week 3 Live Lecture

Exercises
Text Search

Jaccard Similarity

� 𝑠𝑖𝑚 𝑆 , 𝑆 = ∩

� Calculate the Jaccard Similarity of LECTURE and RELEASE using
bigrams with padding

Exam 2021 SM1

Alice has a habit of writing a travel diary, she saves her diary into a text file at the end of
each year.
In her diary, Alice has a habit of writing down items she purchased and roughly how much
in Australian dollars she spent on the purchases. After conversion from the foreign
currency into Australian dollars, if the amount is about 100 dollars, she would ignore the
decimals and writes it in her diary as ‘$100’ or ‘A100’ and sometimes, she places an
additional colon ‘:’ before the numeric value 100, for example ‘A:100’.
Alice would like to work out how much in Australian dollars she spent in the whole
year. She does not wish to manually go through her diary and add up the individual
amounts.
(a) Write a regular expression for the pattern representing the amount in Australian dollars
in Alice’s diary.
(b) Explain how you can extract the numeric value from the extracted RE pattern so that
the numeric value can be added by a program supporting regular expressions.

More complex regular expression

What do you think this pattern is for?

�>D-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-]+

As a task to do for the next live Zoom, please improve this pattern!

Try It!

# L E C T U R E

#

R

E

C

O

R

D