CIS 455/555: Internet and Web Systems
Spring 2022
Homework 2: Web Crawling and Stream Processing
Milestone 1 due March 15, 2022, at 10:00pm ET Milestone 2 due March 25, 2022, at 10:00pm ET
Copyright By PowCoder代写 加微信 powcoder
1. Background
In this assignment, you will explore several more Web technologies and continue to build components useful towards your course project, by building a topic-specific Web crawler. A topic-specific crawler looks for documents or data matching certain phrases or content in the header or body. Similar tasks are handled by firewalls with various data-driven filter rules, and by stream processing systems looking for specific sequences of messages.
This assignment will entail, in Milestone 1:
● Expanding a dynamic Web application, which runs on the framework you built for Assignment 1
(or the Sparkjava Framework) and allows users to (1) create user identities, (2) define topic-specific “channels” defined by a set of XPath expressions, and (3) to display documents that match a channel;
● Implementing and expanding a persistent data store (using Oracle Berkeley DB) to hold retrieved HTML/XML documents and channel definitions.
● Fleshing out a crawler that traverses the Web, looking for HTML and XML documents that match one of the patterns.
In Milestone 2:
● Refactoring the crawler to fit into a stream processing system’s basic abstractions;
● Routing documents from the crawler through a stream engine for processing one at a time;
● Writing a state machine-based pattern matcher that determines if an HTML or XML document
matches one of a set of patterns;
The resulting application will be very versatile, able to filter documents into categories with specified keywords. Assignment 2 can build on your application server from Assignment 1. However, if you are not confident that your web server is working well, please use (http://www.sparkjava.com/)
to test your application. (Using your own server will earn you +5% extra credit. You are allowed to make fixes to it as necessary.)
2. Developing and running your code
You should fork, clone, and import the framework code for HW2 using the same process as for HW0 and HW1 (fork from to your own private repository, then clone from your repository to your VirtualBox/Vagrant instance). You should, of course, regularly commit code as you make changes so you can revert; and you should periodically push to your own repository on Bitbucket, in case your computer crashes.
Initially you will be using the for this assignment, but for extra credit you can run it using your own HW1 framework (see Section 6).
Carefully read the entire assignment (both milestones) from front to back and make a list of the features you need to implement. There are many important details that are easily overlooked! Spend at least some time thinking about the design of your solution. What classes will you need? How many threads will there be? What will their interfaces look like? Which data structures need to be synchronized? And so on.
We strongly recommend that you regularly check the discussions on Piazza for clarifications and solutions to common problems. We will try to include a “rollup” post with tips.
3. Milestone 1: Crawler Manager, B+ Tree Storage, and Crawler
For the first milestone, your task is to crawl and store Web documents. These will ultimately be fed into a pattern matching engine for Milestone 2.
3.1 Routes-Based Web Interface / Crawler Manager
In preparation for Milestone 2, we will have a Web interface for login. Your main class should be called edu.upenn.cis455.crawler.WebInterface and it should register routes for various behaviors. We have given you a partial implementation of the login handler so you can get started.
1. If the user makes a request for a page and is not logged in (i.e., has no Session), the server should output the login form, login-form.html. This form should submit a POST message as described below under “log in to existing account.”
2. If the user is logged in (has a Session), requests to the root URL (or /index.html) should present a simple login page showing “Welcome ” followed by the user’s username. (We will add more functionality in Milestone 2.)
You will see in the provided code how Filters allow you to make decisions about whether the user’s request should proceed, or the user should be redirected to the login form. Thus, some of the above should already be present.
Beyond the above, you should build additional routes to support the following functions:
● Create a new account. This should take a POST to URL localhost:45555/register with two parameters, username and password. Upon success it should return an appropriate success code with a link to the main page and its login screen. On failure, it should return an appropriate error.
● Log into an existing account (multiple users should be able to log in at the same time). This should take a POST to URL localhost:45555/login with two parameters, username and password. Upon success it should create a new Session with the user’s info and return the user to the main page, which should show the logged-in info as above. The Session should time out after 5 minutes of inactivity.
● Log out of the account that is currently logged in, via /logout. Upon success it should redirect the user to the login page.
Note that you may take advantage of Sessions and the various other capabilities of the and/or the ones you developed in Homework 1 Milestone 2. Some of the above functionality is provided, so please look carefully to understand what is and is not there.
3.2. Storage of Document and User Credentials in a B+ Tree
We will use Berkeley DB Java Edition (http://www.oracle.com/technetwork/database/berkeleydb/ overview/index.html), which may be downloaded freely from their website, to implement a disk-backed data store. Berkeley DB is a popular embedded database, and it is relatively easy to use as a key-value store; there is ample documentation and reference information available on its Web site, e.g., the Getting Started Guide. Berkeley DB has several different ways of storing and retrieving data; perhaps the simplest is through the Java Collections interface, although the Direct Persistence Layer (DPL) has slightly more power. You may choose whichever BerkeleyDB interface method you prefer.
Your store will hold (at least):
● the usernames and encrypted passwords of registered users (see below),
● (in Milestone 2) information about user channels
● and the raw content of HTML or XML files retrieved from the Web, as well as the time the file
was last checked by the crawler.
If you use the Collections interface, you will create objects, representing your data, that extend java.io.Serializable and store them in objects like StoredSortedMaps. User passwords should instead be saved using SHA-256 hashing. No cleartext passwords should be saved.
The WebInterface program, when run from the command-line, should take as the first argument a path for the BerkeleyDB data storage instance, and as a second argument, a path to your static HTML files. Your code should create a data storage directory if it does not already exist.
3.3. Basic Web Crawler
Your web crawler will initially be a Java application that can be run in Eclipse by creating a Run Configuration (as in HW0) with the goal “clean install From the command line, you can also run mvn In both cases, the crawler will take the following command-line arguments (in this specific order, and the first three are required):
and cast to javax.net.ssl.HttpsURLConnection. This in turn has input and output
1. The URL of the Web page at which to start. Note that there are several ways to open the URL.
a. For plain HTTP URLs you will probably get the best performance by just opening a socket to the port (we’ve provided the URLInfo class to help parse the pieces out). It is also
acceptable to use Java’s HTTPUrlConnection.
b. For HTTPS URLs you may want to use java.net.URL’s openConnection() method
streams as usual. Here you can keep relying on the sample code.
c. Your crawler should recursively follow links from the page it starts on.
2. The directory containing the BerkeleyDB database environment that holds your store (this will match the path the WebInterface takes). As specified above, the directory should be created if it does not already exist.
3. The maximum size, in megabytes, of any document to be retrieved from a Web server. You should discard the document once you’ve reached that limit.
4. An optional argument indicating the number of files (HTML and XML) to retrieve before ending and exiting. This will be useful for testing!
The crawler is intended to be run periodically, either by hand or from an automated tool like the cron command. It is, therefore, not necessary to build a connection from the Web interface to the crawler, except that the two will share a common database. Note also that BerkeleyDB does not like to share database instances across concurrently executing programs, so it’s okay to assume only one runs at a time.
The crawler traverses links in HTML documents. You can extract these using a HTML parser, such as JSoup (https://jsoup.org/, included with your Maven package), or simply by searching the HTML document for occurrences of the pattern href=”URL” and its subtle variations.
If a link points to another HTML document, it should be retrieved and scanned for links as well. The same is true if it points to an XML or RSS document. Don’t bother crawling images or trying to extract links from XML files. All retrieved HTML and XML documents should be stored in the BerkeleyDB database (so that the crawler does not have to retrieve them again if they do not change before the next crawl). The crawler must be careful not to search the same page multiple times during a given crawl, and it should exit when it has no more pages to crawl. You’ll need to understand what parts of functionality are provided and where you need to supplement.
Redundant documents and cyclic crawls: Your crawler should compute an MD5 hash of every document that is fetched, and store this in a “content seen” table in BerkeleyDB. If you crawl a
document with a matching hash during the same crawler run, you should not index it or traverse its outgoing links.
When your crawler is processing a new HTML or XML page, it should print a short status report to the Apache Log4J logger, using the “info” status level. At the very least, you should print: “http://xyz.com/index.html: downloading” (if the page is actually downloaded) or “http://abc.com/def.html: not modified” (if the page is not downloaded because it has not changed). Make sure you follow the above format to comply with the autograder’s assumptions.
3.4. Politeness
Your crawler must be a considerate Web citizen. First, it must respect the robots.txt file, as described in A Standard for Robot Exclusion (http://www.robotstxt.org/robotstxt.html). It must support the Crawl- Delay directive (see http://en.wikipedia.org/wiki/Robots.txt) and “User-agent: *”, but it need not support wildcards in the Disallow: paths. Second, it must always send a HEAD request first to determine the type and size of a file on a Web server. If the file has the type text/html or one of the XML MIME types:
● text/xml
● application/xml
● Any mime types that ends with +xml
and if the file is less than or equal to a specified maximum size, then the crawler should retrieve the file and process it; otherwise it should ignore it and move on. For more details on XML media types, see RFC 3023 (http://www.ietf.org/rfc/rfc3023.txt). Your crawler should also not retrieve the file if it has not been modified since the last time it was crawled, but it should still process unchanged files (i.e., match them against XPaths and extract links from them) using the copy in its local database.
We have given you some “helper” classes in the crawler.utils package, which might be useful to store information about URLs and robots.txt.
Certain web content, such as the papers in ACM’s Digital Library, normally costs money to download but is free from Penn’s campus network. If your crawler accidentally downloads a lot of this content, this will cause a lot of trouble. To prevent this, you must send the header User-Agent: cis455crawler in each request.
3.5. Test cases
You must develop at least two JUnit tests for storage system and two more for the crawler.
You should next add a Route to enable retrieval of documents from the BDBstore, using the following form:
localhost:45555/lookup?url=…
that takes a GET request with a URL as parameter url, and returns the stored document corresponding to that URL. Think of this as the equivalent of Google’s cache. If the document was not crawled, your server should return a 404 error. We will use this interface for testing.
3.6. Milestone 1 Submission
Submit a zip file on Gradescope as before. Additionally, as with Homework 1, to be eligible for submitting a “patch” for regrading:
1. Share your Homework 2 repo with
2. Commit your current code.
3. git tag hw2m1 to tag this version as your HW2M2 version.
4. Push to bitbucket (you may need to use git push –tag).
4. Milestone 2: Streaming Crawler and Matching Engine
The next milestone will consist of an evaluator for streams of document content. You will extend your Milestone 1 project to run on a stream processing engine that enables multithreaded execution. You will also extend your application and storage system to enable users to register “subscription channels” with the system. Finally, you will build a stream processing component that checks documents against the various “channels” and outputs results per user.
4.1 Rework the Crawler as a “Spout,” “Bolt,” and Shared Modules in StormLite
Your Milestone 1 project had a simple execution model, in which you controlled the execution of the crawler and presumably did this in a crawler loop. Now we want to break in into a smaller unit of work that can be parallelized.
To do this, we’ll be using a CIS 455/555-custom emulation of the Apache Storm stream engine, which we call StormLite (it should show up in your HW2 repo already). Please see the document on StormLite and see TestWordCount (in the test directory) as an example of a multithreaded stream job. Storm has spouts that produce data one result at a time, and bolts that process data one result at a time. As with our emulation of the , you should be able to use examples of Apache Storm code to understand how StormLite works.
You should refactor your Milestone 1 crawl task to run within StormLite, as follows. Note that StormLite supports multiple worker threads but you can control the number of “executors” (and start with 1).
1. You will maintain (or update) your frontier/crawl queue of URLs from Milestone 1. However, you want to place it in a “spout” (implement IRichSpout) so it sends URLs one at a time to the crawler via the nextTuple interface.
2. You will maintain your BerkeleyDB storage system from Milestone 1. This will also be a shared object, at least across some aspects of your Milestone 2 implementation. Again, you may want to use a “singleton factory” pattern.
3. The crawler should be placed in a bolt – its execute method gets called once for each URL from the crawler queue. The crawler should output documents one at a time to its output stream. See the IRichBolt interface and the example code.
4. Now, in our suggested (but not mandatory) architecture, there should be two “downstream” bolts that take documents. (It is perfectly possible to send an output stream to two destinations.)
5. Lower branch:
a. One bolt should have an execute method that takes a document, writes it to the BerkeleyDB
storage system, and outputs a stream of extracted URLs.
b. Next, there should be a bolt that filters URLs (using appropriate techniques and data
structures) and updates the shared frontier queue. 6. Upper branch:
a. A second (in parallel) bolt should take a document and parse it using JSoup or another parser that takes into account element structure. It should send streams of OccurrenceEvents. This bolt will traverse the entire DOM tree in the DOM, using a standard tree traversal. You will send an OccurrenceEvent each time you traverse to an element node from its parent (ElementOpen), each time you traverse to a text node (Text), or each time you traverse back up to an element node from its children (ElementClose).
b. Finally, there should be a bolt that checks for matches to channels by using the streams of events. When there is a match, it should update the BerkeleyDB store accordingly.
Based on the sample code in test.upenn.cis.stormlite, the StormLite document, and (in fact) the documentation on Storm available from Stack Overflow and the Web, you should be able to assemble a stream dataflow like the one illustrated above.
You can assume a single crawler queue bolt, but should look at how the fieldGrouping, allGrouping, and shuffleGrouping specifiers allow you to specify how data gets distributed when there are multiple “executors” such as multiple copies of the crawler, parser, etc.
4.2 Extended Routes-based Web Interface
For Milestone 2, you will also enhance the crawler’s Web application to support the following functions for logged in users.
4.2.1. Channels
Now that you have users and HTML or XML, we want to “connect” users with “interesting” content. To do this, any logged in user will be able to create channels. A channel is a named pattern describing a class
of documents. An example of a channel definition would be
sports : /rss/channel/title[contains(text(),”sports”)]
and you can see an example of content that would match the channel at:
http://rss.nytimes.com/services/xml/rss/nyt/Sports.xml
Assume that channels and their names are global.
You should implement an interface to create a channel, as a GET call:
localhost:45555/create/{name}?xpath={xpath-pattern}
Of course, a brand new channel will have no content, but we expect that the channel will get populated
after the next crawl.
4.2.2. Updated Login Screen
As before, you should have a login/registration form at localhost:45555/register. Once a user is logged in, you should have a “home page” at localhost:45555/.
● List all channels available on the system, and for each
● Include a link to the documents matching each channel, which triggers a lookup service for the
application at localhost:45555/show?channel={name}
Obviously, you will need to add some logic to the BerkeleyDB storage system to store which documents correspond to a channel (see Section 3.2 for how this will be populated). How you implement most of the functionality of the Web interface is entirely up to you; we are just constraining the URL interfaces. To make things consistent across assignments, we are specifying how the channel must be displayed by the “show” request.
For each channel, a
For each HTTP or XML (e.g., RSS) document that matched an XPath in the channel:
○ The string “Crawled on: ” followed by a date in the same format as 2019-10- 31T17:45:48, i.e. YYYY-MM-DDThh:mm:ss, where the T is a separator between the day and the time.
○ The string “Location: ” followed by the URL of the document.
○ A
We expect this application to run on your application server from HW1. If you did not complete the HW1, or for some other reason do not want to continue to use the application server that you wrote, you may continue to use with no penalty.
4.3 Pattern Engine a
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com