程序代写代做代考 crawler html cache javascript Java go CGI dns graph CIS 455/555: Internet and Web Systems

CIS 455/555: Internet and Web Systems
Crawling October 12, 2020
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
1

Plan for today
n Data interchange
n Extensible Markup Language (XML)
n XML Schema; DOM
n XPath
n Example queries n Axes
n Basic crawling
n Mercator
n Publish/subscribe
n XFilter
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
NEXT
2

Visualization
Root
dblp mastersthesis
mdate key
2002… author title year school
2002…
ms/Brown92
root
attribute element
text
p-i
?xml
article
mdate key
editor title journal volume year ee ee
1992 PRPL…
1997
db/labs/dec
SRC… http://www.
Which XPath query returns this element?
3
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
tr/dec/…
The…
Digital…
Kurt P….
Univ….
Paul R.

More XPath query examples
n Square brackets to further specify elements
n /article/journal[1] First journal child of article n /article/journal[last()] Last journal child of article
n Predicates can filter the nodes that are returned n //journal[@issn=’123′] All journals with this issn
n count() counts selected elements
n //*[count(ee)=2] n //*[count(*)=3]
n Other functions available n //*[contains(name(),’ACM’)]
n //*[string-length(name())>3]
All elements w/2 ee children All elements w/3 children
All elem. w/name cont. ACM Elements w/name >3 char.
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
4

Context nodes and relative paths
n XPath has a notion of a context node
n Analogous to current working directory under Unix
n XPath is evaluated relative to that specific node; defaults to the document root
n ‘.’ represents the context node
n ‘..’ represents the parent node
n We can express relative paths:
foo/bar/../.. gets us back to the context node
n Example: Suppose we are at the ‘author’ child of the mastersthesis node, and we want to query the title
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
5

© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
6
More complex traversals with axes
n So far, we have seen XPath queries that go down the tree (and up one step)
n But we can go up, left, right, etc.
n This is expressed with so-called axes n self::path-step
All ancestors (parent, grandparent, …)
of the current node
n child::path-step
n descendant::path-step
n descendant-or-self::path-step n preceding-sibling::path-step
n preceding::path-step
parent::path-step ancestor::path-step ancestor-or-self::path-step following-sibling::path-step following::path-step
n The XPaths we’ve seen so far were in ‘abbreviated form’:
n /AAA/BBBisequivalentof/child::AAA/child::BBB
Everything after the closing tag of the current node

Recap: XPath
n A query language for XML
n Queries select a set of nodes
n Some selection criteria: Type of node, attributes, position in the tree (relative or absolute), number of children, …
n Can traverse the tree in all directions: up, down, left, right n A building block in many other technologies, e.g., XSLT
n For HW2
n XPath-like searches through HTML
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
7

Plan for today
n XPath
n Basic crawling
n Mercator
n Publish/subscribe n XFilter
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
8
NEXT

Motivation
n Suppose you want to build a search engine n Need a large corpus of web pages
n How can we find these pages?
n Idea: crawl the web
n What else can you crawl? n For example, social network,
publication network, …
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
9

Crawling: The basic process
n What state do we need? n Q := Queue of URLs to visit
n P := Set of pages already crawled
n Basic process:
1. Initialize Q with a set of seed URLs
Very important!
2. Pick the first URL from Q and download the corresponding page
3. Extract all URLs from the page ( tag, anchor links, CSS, Schemas, scripts, optionally image links)
4. Append to Q any URLs that a) meet our criteria, and b) are not already in P
5. If Q is not empty, repeat from step 2
n Can one machine crawl the entire web?
n Of course not! Need to distribute crawling across many machines.
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
10

Crawling visualized
“Frontier”
Pages currently being crawled
URLs that will eventually be crawled
Seeds
URLs already crawled
Unseen web
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
11
The Web
URLs that do not fit our criteria (too deep, etc)

The structure of the web
n Connectivity of the Web
n One can pass from any node of IN through SCC to any node of OUT.
n Hanging off IN and OUT are TENDRILS containing nodes that are reachable from portions of IN, or that can reach portions of OUT, without passage through SCC.
n It is possible for a TENDRIL hanging off from IN to be hooked into a TENDRIL leading into OUT, forming a TUBE: i.e., a passage from a portion of IN to a portion of OUT without touching SCC
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
12
Broder et al., “Graph structure in the web”, Proc. WWW, 2000

Crawling complications
n What order to traverse in? n Polite to do BFS – why?
n Malicious pages
n Spam pages / SEO
n Spider traps (incl. dynamically generated ones)
n General messiness
n Cycles, site mirrors, duplicate pages, aliases
n Varying latency and bandwidth to remote servers n Broken HTML, broken servers, …
n Web masters’ stipulations
Need to be robust!
n How deep to crawl? How often to crawl? n Continuous crawling; freshness
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
13

SEO: “White-hat” version
n There are several ways you can make your web page easier to crawl and index:
n Choose a good title
n Use tags, e.g., description
n Use meaningful URLs with actual words
n BAD:http://xyz.com/BQF/17823/bbq
n GOOD:http://xyz.com/cars/ferrari/458-spider.html
n Use mostly text for navigation (not flash, JavaScript, …) n Descriptive file names, anchor texts, ALT tags
n Have a fast / secure / mobile-friendly website!
n More information from search engines, e.g.: n http://www.google.com/webmasters/docs/search-engine-
optimization-starter-guide.pdf
n https://www.microsoft.com/web/seo © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
14

SEO: “Black-hat” version
n Tries to trick the search engine into ranking pages higher than it normally would:
n Doorway pages
n Keyword stuffing
n Hidden or invisible text
n Link farms
n Scraper sites
n Page hijacking
n Blog/comment/wiki spam n Article spinning
n Cloaking
n Shadow domains
n … © 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
15

Normalization; eliminating duplicates
n Some of the extracted URLs are relative URLs n Example: /~cis455/
n Normalize it: http://www.cis.upenn.edu:80/~cis455/
n Duplication is widespread on the web
n If the fetched page is already in the index, do not process it n Can verify using document fingerprint (hash) or shingles
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
16

Crawler etiquette
n Explicit politeness
n Look for meta tags; for example, ignore pages that have

n Implement the robot exclusion protocol; for example, look for, and respect, robots.txt
n Implicit politeness
n Even if no explicit specifications are present, do not hit the
same web site too often
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
17

© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
Robots.txt
n What should be in robots.txt?
n See http://www.robotstxt.org/robotstxt.html
n To exclude all robots from a server: User-agent: *
Disallow: /
n To exclude one robot from two directories: User-agent: BobsCrawler
User-agent: *
Disallow: /_mm/
Disallow: /_notes/
Disallow: /_baks/
Disallow: /MMWIP/
User-agent: googlebot
Disallow: *.csi
User-agent: *
Crawl-delay: 5
Disallow: /news/
Disallow: /tmp/
http://www.cis.upenn.edu/robots.txt
18

Recap: Crawling
n How does the basic process work?
n What are some of the main challenges? n Duplicate elimination
n Politeness
n Malicious pages / spider traps n Normalization
n Scalability
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
19

Plan for today
n XPath
n Basic crawling
n Mercator
n Publish/subscribe n XFilter
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
20
NEXT

Mercator: A scalable web crawler
n Written entirely in Java
n Expands a “URL frontier”
n Avoids re-crawling same URLs
n Also considers whether a document has been
seen before
n Same content, different URL [when might this occur?]
n Every document has signature/checksum info computed as it’s crawled
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
21
Heydon and Najork: Mercator, a scalable, extensible web crawler (WWW’99)

Mercator architecture
1. Dequeue frontier URL 5.
2. Fetch document 6.
3. Record into 7. RewindInputStream (RIS)
4. Check against fingerprints to 8. verify it’s new
Extract hyperlinks Filter unwanted links
Check if URL repeated (compare its hash)
Enqueue URL
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
22
Source: Mercator paper

Mercator’s polite frontier queues
n Tries to go beyond breadth-first approach n Goal is to have only one crawler thread per server
n What does this mean for the load caused by Mercator?
n Distributed URL frontier queue:
n One subqueue per worker thread
n The worker thread is determined by hashing the hostname of the URL
n Thus,onlyoneoutstandingrequestperwebserver
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
23

Mercator’s HTTP fetcher
n First, needs to ensure robots.txt is followed
n Caches the contents of robots.txt for various web sites as it
crawls them
n Designed to be extensible to other protocols
n Had to write own HTTP requestor in Java – their Java version didn’t have timeouts
n Today, can use setSoTimeout()
n Could use Java non-blocking I/O:
n But they use multiple threads and synchronous I/O
n Multi-threaded DNS resolver © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
24

Other caveats
n Infinitely long URL names (good way to get a buffer overflow!)
n Aliased host names
n Alternative paths to the same host
n Can catch most of these with signatures of document data (e.g., MD5)
n Comparison to Bloom filters
n Crawler traps (e.g., CGI scripts that link to themselves using a different name)
n May need to have a way for human to override certain URL paths
n Checkpointing!!
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
25

Mercator document statistics
PAGE TYPE PERCENT text/html 69.2% image/gif 17.9% image/jpeg 8.1% text/plain 1.5% pdf 0.9% audio 0.4% zip 0.4% postscript 0.3% other 1.4%
Histogram of document sizes (60M pages)
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
26

Further considerations
n May want to prioritize certain pages as being most worth crawling
n Focused crawling tries to prioritize based on relevance
n May need to refresh certain pages more often
n General approach: Data partitioning
n By domain, to coordinate activity with many machines
n The same trick works on each server: separate subqueue for each domain
n Then, can use priority queues or round-robin to visit pages
while respecting politeness
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
27

Where to go from here
n You will need to build a crawler for HW2MS1 and for the final project
n Please learn from others’ experiences!
n Several crawling-related papers are linked from the reading
list – e.g., the Google paper, the Mercator paper, …
n Reading these papers carefully before you begin will save you a lot of time
n Getasenseofwhattoexpect
n Avoidcommonproblemsandbottlenecks n Identifydesignsthatwon’tworkwell
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
28