COMP20008
Elements of data processing
Semester 1 2020
Lecture 5: Web crawling and text search
Web – the repository of data
A large amount of text data is on the Web.
Web crawling is a method to get the data from the web.
We will cover:
• Web crawling and scraping; getting text data from the web
Then, we will cover common techniques for text data processing • Search and matching in text
• Text preprocessing (next lecture)
Web crawling
• The data to be searched needs to be gathered from the Web
• Web crawlers are also known as spiders, robots, and bots.
• Crawlers attempt to visit every page of interest and retrieve them for processing and indexing
• Basic challenge: there is no central index of URLs of interest.
• Secondary challenges:
• same content as a new URL
• never return status ‘done’ on access
• websites that are not intended to be crawled
• content generated on-the-fly from databases → costly for the content provider → excessive visits unwelcome
• Some content has a short lifespan
Crawling
The web is a highly linked graph
If a web page is of interest, there will be a link to it from another page.
In principle:
• Create a prioritised list L of URLs to visit (seed URLs)
• Create a list V of URLs that have been visited and when.
Repeat forever:
1. Choose a URL u from L and fetch the page p(u) at location u.
2. Parse and index p(u)
Extract URLs {u′} from p(u).
3. Add u to V and remove it from L Add {u′} − V to L.
4. Process V to move expired or ‘old’ URLs to L.
Crawling
• Every page is visited eventually.
• Synonym URLs are disregarded.
• Significant or dynamic pages are visited more frequently,
• The crawler musn’t cycling indefinitely in a single web site.
Crawling vs Scraping
• Crawling starts with seed URLs; scraping specifies a fixed list of URLs
• Crawling visits all of a linked graph; scraping extracts data from the fixed list
Crawling – challenges
• Crawler traps are surprisingly common. For example, a ‘next month’ link on a calendar can potentially be followed until the end of time.
• The Robots Exclusion Standard: protocol that all crawlers are supposed to observe. It allows website managers to restrict access to crawlers while allowing web browsing.
Simple crawlers are available, for example python scrapy
Parsing
Once a document has been fetched, it must be parsed.
• Web documents can usually be segmented into discrete zones such as
title, anchor text, headings, and so on.
• Data can be extracted from specific zones.
• Information such as links and anchors can be analysed, formats such as PDF or Postscript or Word can be translated, and so on.
• Python BeautifulSoup
Parsing – cont.
• Preprocessing for text – more details on this in the next lecture
• The result is data in a more structured format • xml
• Json • csv
Practice in workshop
basic crawling and scraping in the workshop BeautifulSoup
Text (string) search
Exact string search
• Given a string, is some substring contained within it? • Given a string, find all occurrences of some substring.
For example, find Exxon in:
In exes for foxes rex dux mixes a pox of waxed luxes. An
axe, and an axon, to exo Exxon max oxen. Grexit or
Brexit as quixotic haxxers with buxom rex taxation.
Exact string search
• Given a string, is some substring contained within it? • Given a string, find all occurrences of some substring.
For example, find Exxon in:
In exes for foxes rex dux mixes a pox of waxed luxes. An axe, and an axon, to exo Exxon max oxen. Grexit or Brexit as quixotic haxxers with buxom rex taxation.
Approximate string search
Find exon in:
In exes for foxes rex dux mixes a pox of waxed luxes. An
axe, and an axon, to exo Exxon max oxen. Grexit or
Brexit as quixotic haxxers with buxom rex taxation.
Approximate string search
Find exon in:
In exes for foxes rex dux mixes a pox of waxed luxes. An
axe, and an axon, to exo Exxon max oxen. Grexit or
Brexit as quixotic haxxers with buxom rex taxation.
Not present!
…But what is the “closest” or “best” match?
Application – spelling correction
Need the notion of a dictionary:
• Here, a list of words (entries) that are “correct” with respect to our
(expectations of our) language
• We can break our input into words (substrings) that we wish to match, and compare each of them against the entries in the dictionary
• A word (item) in the input that doesn’t appear in the dictionary is misspelled
• A word (item) in the input that does appear in the dictionary might be correctly spelled or might be misspelled (beyond the scope of this subject)
Application – spelling correction
Therefore, the problem here:
Given some item of interest — which does not appear in our dictionary — which entry from the dictionary was truly intended?
Depends on the person who wrote the original string!
Application – detecting novel words
Word Blending:
• forming novel words by “blending” two other words
• not simple concatenation (e.g., football is not a blend word) breakfast + lunch → brunch
fork + spoon → spork
Britain + exit → Brexit
• Language changes continuously
• New terms are often coined in colloquial language (e.g., Twitter) • Social media is fertile grounds for linguistic innovation.
Other applications
Name matching
street and place name conventions
boulevard|blvd|bd|bde|blv|bl|blvde|blvrd|boulavard|boul|bvd
apartment|apt|ap|aprt|aptmnt
village|vil|vge|vill|villag|villg|vlg|vlge|vllg
• Data cleaning (e.g., deduplication) • Query repair
Approximate string search/matching
What’s a “best” match?
Find approximate match(es) for exon in:
In exes for foxes rex dux mixes a pox of waxed luxes.
An axe, and an axon, to exo Exxon max oxen.
Grexit or Brexit as quixotic haxxers with buxom rex taxation.
Approximate string search/matching
Find approximate match(es) for exon in:
In exes for foxes rex dux mixes a pox of waxed luxes.
An axe, and an axon, to exo Exxon max oxen.
Grexit or Brexit as quixotic haxxers with buxom rex taxation.
exon → Exxon Insert x exon → exo Delete n
exon → axon Replace e with a
exon → oxen Transpose e and o (not covered)
Method – neighbourhood search
For a given string w of interest:
• Generate all variants of w that utilise at most k changes
(Insertions/Deletions/Replacements) — neighbours • Check whether generated variants exist in dictionary • All results found in dictionary are returned
For example:
… proceed if you can see no ther option … the their there tier other mther tnher thpr …
Unix command-line utility agrep is an efficient tool for finding these.
Method – edit distance
Levenshtein distance (a type of edit distance)
Scan through each dictionary entry looking for the “best” match
Intuition:
• Transform the string of interest into each dictionary entry
• Operations: Insert, Delete, Replace, and Match
• Each Insert, Delete, Replace operation incurs a score;
• Best match is the dictionary entry with best (lowest) aggregate mismatch score;
The details of the algorithm are not covered in the subject
Edit-distance to similarity
The distance is divided by the length of the longer string to get a similarity
𝑠𝑖𝑚𝑒𝑑𝑖𝑡 𝑠1, 𝑠2
• cart –> arts?
= 1.0 −
! “!,”” $%& “!,””
c
a
r
t
a
r
t
s
𝑠𝑖𝑚𝑒𝑑𝑖𝑡 𝑐𝑎𝑟𝑡, 𝑎𝑟𝑡𝑠
= 1.0 − 24 = 0.5
Jaro-Winkler similarity
• Based on edit-distance (also character-based similarity)
• Give more weight to strings with matching prefixes
• Useful for matching (approximate) prefixes / suffixes.
• The details of the algorithm are not covered in the subject
Method – N-gram distance
Another method for finding best approximate string match
What is an (character) n-gram? A (character) substring of length n • 2-grams of crat: cr, ra, at; or
• 2-grams of crat: #c, cr, ra, at, t# (sometimes)
• 3-grams of crat: ##c, #cr, cra, rat, at#, t##
A sequence of characters is converted into a set of n-grams; it becomes a vector in a vector space.
N-gram distance – example
• 2-grams of crat: G2(crat) = #c, cr, ra, at, t# • 2-grams of cart: G2(cart) = #c, ca, ar, rt, t# • 2-grams of arts: G2(arts) = #a, ar, rt, ts, s#
N-gram distance between Gn(x) and Gn(y): |Gn(x)| + |Gn(y)| − 2 × |Gn(x) ∩ Gn(y)|
crat and cart:
|G2(crat)| + |G2(cart)| − 2 × |G2(crat) ∩ G2(cart)| = 5+5−2×2= 6
crat and arts:
|G2(crat)| + |G2(art)| − 2 × |G2(crat) ∩ G2(art)| = 5+5−2×0= 10
N-gram distance
• More sensitive to long substring matches, less sensitive to relative ordering of strings (matches can be anywhere!)
• Despite its simplicity, takes roughly the same time to compare entire dictionary
• Quite useless for very long strings and/or very small alphabets (Why?)
• Potentially useful for (approximate) prefixes / suffixes, e.g., Street → St; or smog → smoke
N-gram distance to similarity
• Cosine
𝑐𝑜𝑠 𝑆1,𝑆2 =
• Jaccard similarity 𝑠𝑖𝑚$%&& 𝑆’,𝑆( =
• Sørensen-Dice similarity
(× !#∩!$ !# . !$
!!⋅!” !# × !$
!# ∩!$ !# ∪!$
𝑠𝑖𝑚+,&- 𝑆’, 𝑆( =
N-gram distance to similarity
S1: G2(crat) = {#c, cr, ra, at, t#} S2: G2(cart) = {#c, ca, ar, rt, t#}
{#c, cr, ra, at, t#, ca, ar, rt}
Crat: [1,1,1,1,1,0,0,0] Cart: [1,0,0,0,1,1,1,1] • 𝑆’ =5
• 𝑆( =5
• 𝑆’∪𝑆( =8
• 𝑆’∩𝑆( =2
• 𝑆1 ⋅ 𝑆2 = 1 + 0 + 0 + 0 + 1 + 0 + 0 + 0 = 2
Acknowledgement
Some contents of the slides are based on materials created by Lea Frermann and Justin Zobel and Karin Verspoor, CIS