程序代写代做代考 crawler file system CIS 455/555: Internet and Web Systems

CIS 455/555: Internet and Web Systems
Publish/subscribe October 21, 2020
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
1

Plan for today
n XPath
n Basic crawling
n Mercator
n Publish/subscribe n XFilter
n Google File System
NEXT
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
2

The publish/subscribe model
Events
???
n Each publisher produces events
n Example: Web page update, stock quote, announcement, …
n Each subscriber wants a subset of the events n But usually not all
Interests
n How do we implement this efficiently? © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
3

Example: RSS
n Web server publishes XML file with events n Clients periodically request the file to see if
there are new events n Is this a good solution?
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
4

Pub/Sub: great for event-driven architectures
Backend API
Storefront Website
n Fits well with the event driven paradigm
n Used for:
n News publication, announcements, … n Microservices / serverless compute
n Stream Processing
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
5
Account Service
Inventory Service
Shipping Service

More generally: Stream processing
n Stream engines take one item at a time, feed
them to a set of queries that process each
item as it comes
n The queries may maintain state from item to item n They may output results, possibly as streams
n Example: Crawling to extract data
Lists of interests
Crawler Crawler
“spout”
HW2MS1
Filter on
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
6
Filter on documents
documents
“bolt”
HW2MS2

Plan for today
n Publish/subscribe
n XFilter
n Google File System
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
7
NEXT

XPath
(here used to match
documents, not nodes)
XML-Based information dissemination
n Basic model:
n Users are interested in data relating to a particular topic, and
know the schema /politics/usa//body
n A crawler-aggregator reads XML files from the web (or gets them from data sources) and feeds them to interested parties
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
8

Engine for XFilter [Altinel & Franklin]
(2) (1)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
(3)
9

How does it work?
n Each XPath segment is basically a subset of regular expressions over element tags
n Convert into finite state automata
n Parse data as it comes in
n Match against finite state machines
n Most of these systems use modified FSMs because they want to match many patterns at the same time
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
10

XML -> Events Parser
Simple API for XML (SAX):
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
11
StartElement: rss
StartElement: channel
StartElement: title
StartElement: link
Text: https://www.nytimes.com/…
EndElement: link Text: NYT > Technology EndElement: title

Path nodes and FSMs
n XPath parser decomposes XPath expressions into a set of path nodes
n These nodes act as the states of corresponding FSM n Path nodes are not generated for ”*” nodes
n Simple FSM for /politics[@topic=“president”]/usa//body: //
politics
Q1_1
usa body
Q1_2 Q1_3
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
12

Decomposing into path nodes
Q1=/politics[@topic=“president”]/usa//body Query ID
Q1
1
0
1
Q1
2
1
0
Q1
3
-1
-1
Q1-1 Q1-2 Q1-3
Q2=//usa/*/body/p
Position in state machine
Relative Position (RP) in tree:
0 for root node if it’s not preceded by “//”
-1 for any node preceded by “//”
Else =1+ (no of “*” nodes from predecessor node)
Level:
If current node has fixed distance from root, then 1+ distance
Else if RP = –1, then –1, else 0
Finally, NextPathNodeSet points to next node
Q2
1
-1
-1
Q2
2
2
0
Q2
3
1
0
Q2-1 Q2-2 Q2-3
13
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania