EECS 485 Lecture 17
Scaling Web Search
John Kloosterman
Copyright By PowCoder代写 加微信 powcoder
Learning Objectives
• Describe how web crawling works
• Use Jaccard similarity to detect duplicate pages
• Describe the MinHash algorithm and how it can be used to speed up deduplication
Web Crawling
Web crawling
• Review: last two lectures
• IR1: Finding matching documents • IR2: Using link graph
• To search Internet, need to build index and link graphs
• Download the entire Internet
• Problem: no list of all the pages
Web crawling
• Solution: do a traversal of the link graph
Web crawling: graph traversal
• Have a list of “seed pages”, follow all links like a BFS/DFS
• Most of the Internet is a connected component
Problem 1: Client-side dynamic sites
Problem 2: Not nice to websites
• Crawlers generate a lot of traffic • CPU and bandwidth costs money
• Website owners don’t like if you crawl too fast and will block you
• Solution: limit # requests per site per second
Problem 3: Not all sites want to be indexed
• Websites generally like being on search engines
• Reasons not to:
• Personal creations: e.g. artwork on Google Images • Parts of dynamic websites: e.g. search engines
• Solution: robots.txt
robots.txt
https://github.com/robots.txt
User-agent: Googlebot Disallow: /*/download Disallow: /*/revisions Disallow: /*/*/issues/new Disallow: /*/*/issues/search Disallow: /*/*/commits/*/* Disallow: /*/*/commits/*?author Disallow: /*/*/commits/*?path …
Problem 4: Deduplication
• Many web pages are exact copies of other web pages
• Only create index once
• We’ll learn how to solve this in the next segments
Deduplication
Problem to solve
• Many web pages are duplicates or near-duplicates of each other
• News websites: many have exactly the same articles
• Indexing takes CPU time, so only want to run index
algorithm once for all duplicates
• How to detect duplicate pages in an efficient way?
Straightforward algorithm
• Define a hash of a document
• Hash(“It grows in dry places…”) = 8d2f908…
• When crawling a document: • compute its hash
• look that up in a hash table
• if present, it’s a duplicate
• if not, put in the hash table • How many comparisons?
Why not do this?
• Documents can be almost similar but not exactly similar
Quantifying similarity
• Property of pairs of documents A and B • Treat documents as sets of words
• Jaccard Similarity
• Because this is a property of pairs, we have to compute it across all pairs
• How many pairs of documents are there?
Part 1 of solution: shingles
• Instead of words, use “shingles”: sequences of k words
• Can also use sequences of characters “this is some text”
Combining Jaccard similarity and shingles
• Initial algorithm, still O(n^2)
• Crawl document
• Compute 3-shingles
• Compute Jaccard similarity with all previous documents using sets of shingles
• If similar enough, it’s a duplicate
Say hi to partner
• • jkloosterman.net/485
• Exercise 1: use shingles and Jaccard similarity • a | b => set union
• a & b => set intersection
• Exercise 2: compute similarities across documents
Improving efficiency
• Jaccard similarity requires computing the size of set intersection and union
• Repeat this work for every pair of pages
• Goal: Pre-compute some information about every page
which makes computing Jaccard similarity faster
• Non-goal: reduce O(n^2) comparisons. Need more algorithms than we’ll cover in 485
Trick: compute hashes
• Given a hash function
• Compute hash of each shingle for documents A and
shingles_a = { (“large”, “country”),
(“is”, “a”) , … }
shingles_b = { (“large”, “country”),
(“not”, “a”) , … }
Trick: find minimum hash
shingle_hashes_a = {0x0cba, 0xb3f4, 0x499a, …} shingle_hashes_b = {0x39ff, 0x93cc, 0x0cba, …}
•If minimum hash equal, sign of similarity
•What is the probability that minimums will be equal?
• We’re switching from word shingles to character shingles for the next few slides
• so a 3-shingle looks like “ahc” not (“word”, “word”, “word”)
Why this works: Perspective 1
a = { “oun”, “ada”, “ntr”} b = { “int”, “arg”, “oun”}
a U b = { “oun”, “ada”, “ntr”, “int”, “arg”}
Why this works: perspective 2
a = { “oun”, “ada”, “ntr”} b = { “int”, “arg”, “oun”}
{ “oun”, “ada”, “ntr”, “int”, “arg”}
Many hash functions
a = { “oun”, “ada”, “ntr”} b = { “ntr”, “arg”, “oun”}
• Each run is either 1 or 0 (for T/F)
• To converge on a value, repeat it with different hash functions for different random orderings
MinHash signatures
• Ahead of time, we can compute signatures for documents
• Given hash functions h1, h2, h3, …
Comparing using signatures
# document signatures [0x0b1a, 0x3dda, 0x43cf] [0x6c4f, 0x3222, 0x232a]
# comparison document signature [0x0b1a, 0x3222, 0x232a]
• • jkloosterman.net/485
• Exercise 1: print more hashes • Exercise 2: use MinHash
Learning Objectives
• Describe how web crawling works
• Use Jaccard similarity to detect duplicate pages
• Describe the MinHash algorithm and how it can be used to speed up deduplication
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com