python Crawler ACHE Scrapy代写 INF 558 BUILDING KNOWLEDGE GRAPH

Homework 1: Crawling
INF 558 BUILDING KNOWLEDGE GRAPH

DUE DATE: Friday, 08/31/2018 @ 11:59pm on Blackboard.

Ground Rules

This homework must be done individually. You can ask others for help with the tools, but the submitted homework has to be your own work.

Summary

In this homework, you will create/use Web crawlers to collect webpages from Smithsonian American Art Museum (SAAM). A Web crawler is a program that explores the Web, typically for the purpose of Web indexing. It starts with a list of seed URLs to visit, and as it visits each webpage, it finds the links in that web page, and then visits those links and repeats the entire process.

Two Web crawlers you have to use in this homework are: – ACHE (https://github.com/ViDA-NYU/ache)
– Scrapy (https://scrapy.org)

Task 1 (3 points)

Crawl at least 3000 webpages of artworks in SAAM. The sample webpage is in figure 1. You have to submit two sets of webpages, which are obtained from ACHE and Scrapy respectively.

We provide a list of artwork URLs (mandatory_artworks.txt), and your result set must include these webpages. All of the collected webpages should be artwork pages. We will sample the webpages to check whether they are correct. For example, the page shown in Figure 2 should not be in your result set since it is not artwork page.

Task 2 (3 points)

Similar to task 1, you will crawl at least 5000 webpages of artists in SAAM using ACHE and Scrapy. The sample webpage is in figure 3. The list of required artists is in mandatory_artists.txt.

All of the collected webpages should be artist pages. We will sample the webpages to check whether they are correct. For example, the page shown in Figure 2 should not be in your result set since it is not artist page.

Task 3 (2 points)

Answer the following questions (maximum 2 sentences for each question):

  • What is the seed URL(s)?
  • How did you manage to only collect artwork/artist pages? How did you discard irrelevant

    pages?

  • If you were not able to collect 3000/5000 pages, describe and explain your issues.

    Task 4 (2 points)

    Store your crawled webpages into CDR files. CDR files follow JSON Lines (http://jsonlines.org/) format. Each line in a CDR file is a valid JSON object that represents the information about one crawled webpage. The JSON object has the following attributes:

    – doc_id: unique id for the webpage
    – url: url of the webpage
    – raw_content: html content of the webpage – timestamp_crawl: when did you crawl

    You can check the attached file sample_cdr.jl to understand how CDR format looks like. You should validate your JSON objects in your CDR file to ensure they have the correct format, especially the string value of “raw_content” attribute.

    After you build your CDR files, use the provided script post_processing.py to reduce size of your files. The script takes one argument <cdr_path> which is path of your CDR file, and outputs a new cdr file at: <cdr_path>.processed. For example: python post_processing.py /home/users/sample_cdr.jl will output /home/users/sample_cdr.jl.processed. Refer to post_processing_usage.pdf for more information.

    You will submit the new CDR files instead of your original files.

    Submission Instructions

    You must submit the following files/folders in a single .zip archive named Firstname_Lastname_hw1.zip and submit it via Blackboard:

  • Firstname_Lastname_hw1_report.pdf: A pdf file containing your answers to the Task 1.
  • CDR files contain all the web pages that you crawled:

o Firstname_Lastname_artist_ache_cdr.jl.processed: Artist web pages you got from ACHE.

o Firstname_Lastname_artist_scrapy_cdr.jl.processed: Artist web pages you got from Scrapy.

o Firstname_Lastname_artwork_ache_cdr.jl.processed: Artwork web pages you got from ACHE.

o Firstname_Lastname_artwork_scrapy_cdr.jl.processed: Artwork web pages you got from Scrapy.

• source: This folder includes all the code you wrote to accomplish Task 1, 2 and 4, for example, your Scrapy crawler, or your script/program to eliminate unwanted pages and store webpages into CDR format.

Figure 1. An artwork in SAAM

Figure 2. An irrelevant page

Figure 3. An artist in SAAM