CS计算机代考程序代写 crawler algorithm python database COMP20008

COMP20008
Elements of Data Processing
Semester 1, 2021
Lecture 4, Part 5: Unstructured Data – Preprocessing
© University of Melbourne 2021

Text Preprocessing – Tokenisation
• Granularity of a token • Sentence
• Word
• Token separators
• “The speaker did her Ph.D. in Germany. She now works at UniMelb.” • “The issue—and there are many—is that text is not consistent.”
© University of Melbourne 2021

Text Preprocessing – Tokenisation
• Split continuous text into a list of individual tokens
• English words are often separated by white spaces but not always • Tokens can be words, numbers, hashtags, etc.
• Can use regular expression
© University of Melbourne 2021

Text Preprocessing – Case folding
• Convert text to consistent cases
• Simple and effective for many tasks
• Reduce sparsity (many map to the same lower-case form) • Good for search
I had an AMAZING trip to Italy, Coffee is only 2 bucks, sometimes three!
i had an amazing trip to italy, coffee is only 2 bucks, sometimes three!
© University of Melbourne 2021

Preprocessing – Stemming
• Words in English are derived from a root or stem inexpensive → in+expense+ive
• Stemming attempts to undo the processes that lead to word formation • Remove and replace word suffixes to arrive at a common root form
• Result does not necessarily look like a proper ‘word’
• Porter stemmer: one of the most widely used stemming algorithms
• suffix stripping (Porter stemmer) • sses → ss
• ies → i
• tional → tion • tion→t
© University of Melbourne 2021

Preprocessing – Stemming
https://text-processing.com/demo/stem/
troubles à troubl troubled à troubl trouble à troubl
© University of Melbourne 2021

Preprocessing – Lemmatization
• To remove inflections and map a word to its proper root form (lemma)
• It does not just strip suffixes, it transforms words to valid roots: running à run
runs à run ran à run
• Python NLTK provides WordNet Lemmatizer that uses the WordNet Database to lookup lemmas of words.
© University of Melbourne 2021

Stop Word Removal
• Stop words are ‘function’ words that structure sentences; they are low information words and some of them are very common
• ‘the’, ‘a’, ‘is’,…
• Exclude them from being processed; helps to reduce the number of
features/words
• Commonly applied for search, text classification, topic modelling, topic extraction, etc.
• A stopword list can be custom-made for a specific context/domain © University of Melbourne 2021

Stop Word Removal
© University of Melbourne 2021

Text Normalisation
• Transforming a text into a canonical (standard) form
• Important for noisy text, e.g., social media comments, text messages
• Used when there are many abbreviations, misspellings and out-of- vocabulary words (oov)
• E.g.
2moro à tomorrow 2mrw à tomorrow tomrw à tomorrow B4 à before
© University of Melbourne 2021

Noise Removal
• Remove unnecessary spacing
• Remove punctuation and special characters (regular expressions) • Unify numbers
• Highly domain dependent
© University of Melbourne 2021

So far… Unstructured Text Data
• Text search – approximate string matching
• Preprocessing
– Regular expressions
– Tokenisation
– Case folding
– Stemming
– Lemmatization
– Stop word removal
– Text normalization
– Noise removal
• Document representation and text features (BoW, TF-IDF) • Crawling & scraping
© University of Melbourne 2021

Break
From the time before, 2019
© University of Melbourne 2021

COMP20008
Elements of Data Processing
Semester 1, 2021
Lecture 4, Part 6: Unstructured Data – Text Representations
© University of Melbourne 2021

Features
Sepal length
Sepal width
Petal length
Petal width
Species (label)
4.9
3.0
1.4
0.2
Iris setosa
7.0
3.2
4.7
1.4
Iris versicolor
5.4
3.7
1.5
0.2
Iris setosa
6.3
3.3
6.0
2.5
Iris virginica
© University of Melbourne 2021

How To Represent Text?
https://cis.unimelb.edu.au/about/school/
© University of Melbourne 2021

Text Features
• Part-of-speech tagging
• She saw a bear.
• Your efforts will bear fruit. • bear_NN; bear_VB
bear: NOUN bear: VERB
We
value
curiosity
,
passion
and
a
desire
to
learn
PRON
VERB
ADJ
PUNC
ADJ
CONJ
DET
NOUN
TO
VERB
• ngrams
we
value
curiosity
,
passion
and
a
desire
to
learn
we_value
value_curiosity
curiosity_,
,_passion
passion_and
and_a
a_desire
desire_to
to_learn
© University of Melbourne 2021

Text Representation – BoW
• Bag-of-words: simplest vector space representational model for text
• Disregards word order and grammatical features such as POS
• Each text document as a numeric vector
• each dimension is a specific word from the corpus
• the value could be its frequency in the document or occurrence (denoted by 1 or 0).
© University of Melbourne 2021

Prepare for BoW
• Word tokenisation
• Case-folding
• Abstraction of number (#num#, #year#)
© University of Melbourne 2021

Prepare for BoW
• Stop word removal
© University of Melbourne 2021

Prepare for BoW
• Stemming
• How would this look different if we lemmatised instead?
• Removed punctuation
• Word counts
© University of Melbourne 2021

Bag of Words
tranform
school
world
year
futur
foster
mind
cheese
transform
…
doc001
2
4
1
4
3
0
0
0
0
…
doc002
1
3
3
0
2
2
2
0
1
…
doc003
0
3
4
0
3
3
0
4
2
…
…
© University of Melbourne 2021

Term Frequency
• What if a rare word occurs in a document?
• e.g. ‘catarrh’ is less common than ‘mucus’ or ‘stuffiness’
• What if a word occurs in many documents?
• Maybe we want to avoid raw counts?
• Raw frequencies varies with document length
• Don’t capture important (rare) features that would be telling of a type of document
© University of Melbourne 2021

Break
© University of Melbourne 2021

COMP20008
Elements of Data Processing
Semester 1, 2021
Lecture 4, Part 7: Unstructured Data – Text Representations
© University of Melbourne 2021

Raw Frequencies
• What are the problems?
• What are the alternatives?
© University of Melbourne 2021

Raw Frequencies
• What are the problems?
• What are the alternatives?
SPORTS
play grace crowd
ARTS
play grace audience
© University of Melbourne 2021

TF-IDF
© University of Melbourne 2021
Discourse on Floating Bodies
– Galileo Galilei
Treatise on Light
– Christiaan Huygens
Experiments with Alternate Currents of High Potential and High Frequency
– Nikola Tesla
Relativity: The Special and General Theory
– Albert Einstein

TF-IDF
• TF-IDF stands for Term Frequency-Inverse Document Frequency
• Each text document as a numeric vector
• each dimension is a specific word from the corpus
• A combination of two metrics to weight a term (word)
• term frequency (tf): how often a given word appears within a document
• inverse document frequency (idf): down-weights words that appear in many documents.
• Main idea: reduce the weight of frequent terms and increase the weight of rare and indicative ones.
© University of Melbourne 2021

TF-IDF
Term frequency (TF):
• 𝑡𝑓 𝑡, 𝑑 = the raw count of a term in the document. Inverse Document Frequency (IDF):
•𝑖𝑑𝑓𝑡 =ln !”# +1or𝑖𝑑𝑓𝑡 =ln # +1 !”$%! $%!
• N is the number of document in the collection,
• 𝑑𝑓& is the document frequency, the number of document containing the term t.
TF-IDF (L2 normalised):
• 𝑡𝑓_𝑖𝑑𝑓 𝑡,𝑑 = ‘! ∑!”∈$ ‘!”%
where𝑣& =𝑡𝑓 𝑡,𝑑 ×𝑖𝑑𝑓 𝑡 © University of Melbourne 2021

Example TF-IDF
Two documents: A – ‘the car is driven on the road’
B – ‘the truck is driven on the highway’
word
𝑡𝑓 𝑡,𝑑
𝒊𝒅𝒇(𝒕) =
𝒍𝒏 𝟏”𝑵 +1 𝟏”𝒅𝒇𝒕
𝒗𝒕 = 𝑡𝑓 𝑡,𝑑 ×𝑖𝑑𝑓 𝑡
𝒕𝒇_𝒊𝒅𝒇 𝒕,𝒅
A
B
A
B
A
∑'”∈) 𝑣'”* = 2.225
B
∑'”∈) 𝑣'”* = 2.225
car
1
0
ln +* +1=1.405
1.405
0
0.632
0
driven
1
1
ln+ +1=1
1
1
0.449
0.449
road
1
0
ln +* +1=1.405
1.405
0
0.632
0
truck
0
1
ln +* +1=1.405
0
1.405
0
0.632
highway
0
1
ln +* +1=1.405
0
1.405
0
0.632
* stop words removed © University of Melbourne 2021

Example TF-IDF
© University of Melbourne 2021

Example TF-IDF – cont.
• Two documents, A and B.
A. ‘the car is driven on the road’
B. ‘the truck is driven on the highway’
* stop words removed
• Text features for machine learning
car
driven
road
truck
highway
0.632
0.449
0.632
0
0
0
0.449
0
0.632
0.632
© University of Melbourne 2021

TRY THIS!
• 3 documents:
A: ‘the car is driven on roads’
B: ‘the truck is driven on a highway’
C: ‘a bike can not be ridden on a highway’
word
𝑡𝑓 𝑡,𝑑
𝒊𝒅𝒇 𝒕 =
𝒍𝒏 𝟏”𝑵 +1 𝟏”𝒅𝒇𝒕
𝒗𝒕 = 𝑡𝑓 𝑡,𝑑 ×𝑖𝑑𝑓 𝑡
𝒕𝒇_𝒊𝒅𝒇 𝒕,𝒅
A
B
C
A
B
c
A
∑'”∈) 𝑣'”*
B
∑'”∈) 𝑣'”*
C
∑'”∈) 𝑣'”*
car
1
0
0
ln 4/2 +1=
driven
1
1
0
road
1
0
0
truck
0
1
0
highway
0
1
1
bike
0
0
1
ridden
0
0
1
© University of Melbourne 2021
* stop words removed

Features from unstructured text
Features for structured data
Features for unstructured text
car
driven
road
truck
highway
0.632
0.449
0.632
0
0
0
0.449
0
0.632
0.632
© University of Melbourne 2021

Rank document similarity to a query
• Query q = ‘I saw a car and a truck on the highway’
• Query terms = [‘car’, ‘truck’, ‘highway’]
• Query vector 1, 0, 0, 1, 1 , unit vector 𝑣! = [0.577, 0, 0, 0.577, 0.577] • Cosine similarity to rank documents
cos(𝑣!,𝑑”), cos(𝑣!,𝑑#) : 0.36, 0.73
d1 d2
car
driven
road
truck
highway
0.632
0.449
0.632
0
0
0
0.449
0
0.632
0.632
© University of Melbourne 2021

WWW – A Repository of Data
A large amount of text data is on the Web.
Web crawling is a method to get the data from the web.
© University of Melbourne 2021

Resources
• APIs
• Twitter: https://developer.twitter.com/en/docs/api-reference-index
• Data Dumps
• Wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Database_download#Where_do_I_get_it? • Yelp: https://www.yelp.com/dataset
• IMDB: https://www.imdb.com/interfaces/
• Shared Tasks
• Geolocation Prediction: https://www.aclweb.org/anthology/W16-3928/
• Lexical Normalisation of Tweets: https://noisy-text.github.io/2015/norm-shared-task.html
© University of Melbourne 2021

Web Crawling
• Web crawlers are also known as spiders, robots, and bots.
• Crawlers attempt to visit every page of interest and retrieve them for
processing and indexing
• Basic challenge: there is no central index of URLs of interest.
• Secondary challenges:
• same content as a new URL
• never return status ‘done’ on access
• websites that are not intended to be crawled
• content generated on-the-fly from databases → costly for the content provider → excessive visits unwelcome
• Some content has a short lifespan
© University of Melbourne 2021

Crawling
• The web is a highly linked graph
• If a web page is of interest, there will
be a link to it from another page. • In principle:
• Create a prioritised list L of URLs to visit (seed URLs)
•
Repeat forever:
• Choose a
URL u from L and fetch the page p(u) at location u.
• Parse and index p(u) Extract URLs {u′} from p(u).
• Add u to V and remove it from L
•CreatealistVofURLsthathave • Add{u′}−VtoL.
been visited and when. • Process V to move expired or ‘old’ URLs
to L. © University of Melbourne 2021

Crawling
• Web crawl (for the purpose of indexing), every page is visited eventually.
• Synonym URLs are disregarded.
• Significant or dynamic pages are visited more frequently,
• The crawler mustn’t cycle indefinitely in a single web site.
Crawling (site or the web) vs Scraping
• Crawling starts with seed URLs
• Crawling visits all sites within a linked graph
• Scraping is the process of extracting data
• Scraping is targeted (given the information that it has to extract)
© University of Melbourne 2021

Anatomy of a URL: Targeted Scraping

12.2 Internet 101: Understanding How the Internet Works

Crawling – challenges
• Crawler traps are surprisingly common. For example, a ‘next month’ link on a calendar can potentially be followed until the end of time.
• The Robots Exclusion Standard: protocol that all crawlers are supposed to observe. It allows website managers to restrict access to crawlers while allowing web browsing.
• robots.txt (inclusions, exclusions, sitemaps, etc.)
• See: https://developers.google.com/search/reference/robots.txt
• Ethical and Terms of Use Python scrapy
© University of Melbourne 2021

Parsing
Once a document has been fetched, it must be parsed.
• Web documents can usually be segmented into discrete zones such as title, anchor text, headings, and so on.
• Data can be extracted from specific zones.
• Information such as links and anchors can be analysed, formats such as PDF or Postscript or Word can be translated, and so on.
• Python BeautifulSoup © University of Melbourne 2021

Acknowledgement
Some contents of the slides are based on materials created by Lea Frermann and Justin Zobel and Karin Verspoor, CIS
© University of Melbourne 2021

Related Posts