程序代写代做 html crawler cache In this project, you are going to implement the core of a Web crawler, and then you are going to crawl the following URLs (to be considered as domains for the purposes of this assignment) and paths:

In this project, you are going to implement the core of a Web crawler, and then you are going to crawl the following URLs (to be considered as domains for the purposes of this assignment) and paths:
• *.ics.uci.edu/*
• *.cs.uci.edu/*
• *.informatics.uci.edu/*
• *.stat.uci.edu/*
• today.uci.edu/department/information_computer_sciences/*
As a concrete deliverable of this project, besides the code itself, you must submit a report containing answers to the following questions:
1. How many unique pages did you find? Uniqueness for the purposes of this assignment is ONLY established by the URL, but discarding the fragment part. So, for
example, http://www.ics.uci.edu#aaa and http:// www.ics.uci.edu#bbb are the same URL. Even if you implement additional methods for textual similarity detection, please keep considering the above definition of unique pages for the purposes of counting the unique pages in this assignment.
2. What is the longest page in terms of number of words? (HTML markup doesn’t count as words)
3. What are the 50 most common words in the entire set of pages? (Ignore English stop words, which can be found, for example, here (Links to an external site.) 
) Submit the list of common words ordered by frequency.
4. How many subdomains did you find in the ics.uci.edu domain? Submit the list of subdomains ordered alphabetically and the number of unique pages detected in each subdomain. The content of this list should be lines
containing URL, number, for example:  http://vision.ics.uci.edu, 10 (not the actual number here)
What to submit: a zip file containing your modified crawler code and the report.

Specifications
To get started, fork or get the crawler code from: https:// github.com/Mondego/spacetime-crawler4py
(Links to an external site.)
Read the instructions in the README.md file up to, and including, the section “Execution”. This is enough to implement the simple crawler for this project. In short, this is the minimum amount of work that you need to do:
1. Install the dependencies
2. Set the USERAGENT variable in Config.ini so that it
contains student’s ID (id是91435458) . If you fail to do this
properly, your crawler will not exist in the server’s log, which
will put your grade for this project at risk.
3. (This is the meat of the crawler) Implement the scraper function in scraper.py. The scraper function receives a URL and corresponding Web response (for example, the first one will be “http://www.ics.uci.edu” and the Web response will contain the page itself). Your task is to parse the Web response, extract enough information from the page (if it’s a valid page) so to be able to answer the questions for the report, and finally, return the list of URLs “scrapped” from that page. Some important notes:
1. Make sure to return only URLs that are within the domains and paths mentioned above!
(see is_valid function in scraper.py — you need to change it)
2. Make sure to defragment the URLs, i.e. remove the fragment part.
3. You can use whatever libraries make your life easier to parse things. Optional dependencies you might want to look at: BeautifulSoup, lxml (nudge, nudge, wink, wink!)
4. Optionally, in the scraper function, you can also save the URL and the web page on your local disk.
4. Run the crawler from your laptop/desktop or from an ICS openlab machine. Note that this will take several hours,

possibly a day! It may even never end if you are not careful with your implementation! Note that you need to be inside the campus network, or you won’t be able to crawl. If your computer is outside UCI, use the VPN.
5. Monitor what your crawler is doing. If you see it trapped in a Web trap, or malfunctioning in any way, stop it, fix the problem in the code, and restart it. Sometimes, you may need to restart from scratch. In that case, delete the frontier file (frontier.shelve), or move it to a backup location, before restarting the crawler.
Crawler Behavior Requirements
In this project. we are looking for text in Web pages so that we can search it later on. The following is a list of what a “correct crawl” entails in this context:
• Honor the politeness delay for each site
• Crawl all pages with high textual information content
• Detect and avoid infinite traps
• Detect and avoid sets of similar pages with no information
• Detect and avoid dead URLs that return a 200 status but no
data
• Detect and avoid crawling very large files, especially if they
have low information value
For most of these requirements, the only way you can detect these problems is by first monitoring where your crawler is going, and then adjusting its behavior in order to stay away from problematic pages.
Test Period and Deployment Period
Due to the nature of this project, the time allocated to it is divided into two parts:
Test: until 23 April, 23h59. During this time, your crawler can make all sorts of mistakes — try to crawl outside allowed domains, be impolite, etc. No penalties while you are figuring things out!
Deployment: from April 24, until April 28, 23h59. This is the real crawl. During this time, your crawler is expected to behave

correctly. Even if you finish your project earlier, you must operate your crawler during this time period.
Note: The cache server may die for a few hours during these periods due to loads created by impolite crawlers. We will be monitoring closely the server, and it will be back online after (at most) ~8h hours during the Test period, and after (at most) ~2h during the Deployment period (unless it happens to die during the night – ~1h am till ~7h am- in Irvine).
Extra credit:
(+1 points) Implement checks and usage of the robots and sitemap files.
(+2 points) Implement exact and near webpage similarity detection using the methods discussed in lecture. Your implementation must be made from scratch, no libraries are allowed.
(+7 points) Make the crawler multithreaded. However, your multithreaded crawler MUST obey the politeness rule: two or more requests to the same domain, possibly from separate threads, must have a delay of 500ms (this is more tricky than it seems!). In order to do this part of the extra credit, you should read the “Architecture” section of the README.md file. Basically, to make a multithreaded crawler you will need to:
1. Reimplement the Frontier so that it’s thread-safe and so that it makes politeness per domain easy to manage 
2. Reimplement the Worker thread so that it’s politeness-safe 
3. Set the THREADCOUNT variable in Config.ini to whatever number of threads you want 
4. If you multithreaded crawler is knocking down the server you may be penalized, so make sure you keep it polite (and note that it makes no sense to use a too large number of threads due to the politeness rule that you MUST obey).

Grading criteria
1. Are the analytics in your report within the expected range? (10%)
2. Did your crawler operate correctly? — we’ll check our logs (50%)
1. Does it exist in Prof. Lopes’ Web cache server logs? (if it’s not in the ICS logs, it didn’t happen: you will get 0)
2. Was it polite? (penalties for impolite crawlers)
3. Did you crawl ALL domains and paths mentioned in the
spec? (penalties for missing domains and paths)
4. Did it crawl ONLY the domains and paths mentioned in
the spec? (penalties for attempts to crawl outside)
5. Did it avoid traps? (penalties for falling in traps)
6. Did it avoid sets of pages with low information value?
(penalties for crawling useless families of pages – you must decide and discuss within your group on a reasonable definition for a low information value page)
3. Are you able to answer the questions about your code and the operation of your crawler? (40%)
Technical Details
In order not to disrupt the ICS network, your crawlers use a Web cache that is specifically designed for this project, and that runs on one of Prof. Lopes servers. The following picture illustrates the architecture of this project:

If you use the crawler code properly, the cache is largely invisible to you. But you should be aware that it is there. At certain points, you may receive errors that are specifically sent by the cache server to your crawler, when your crawler is doing something it shouldn’t — they are in the 600 range of the status code. DO NOT ATTEMPT TO BYPASS THE CACHE! — that may seriously disrupt the ICS network, which may pose additional problems during this already complex quarter.

Related Posts