程序代写代做 html crawler C algorithm javascript Java COMP30023 Project 1 􏰀 Web crawler

COMP30023 Project 1 􏰀 Web crawler
Worth 15% of the 􏰁nal mark Due: 23:59:59 Friday 3 April, 2020.
1 Background
How do web search engines know what is on the web? They have to look for it the slow way, by following every link on the web. They use sophisticated algorithms to search e􏰆ciently. For example, they don’t follow each link equally often; content that changes often is followed more often.
In this project you will write simple code to crawl a web site. This will teach you about socket program- ming, which is fundamental to writing all internet applications, and also about the HTTP application layer protocol. The crawler must be written in C and cannot use any existing HTTP libraries.
2 Crawling
The program will be given a URL on the command line, pointing to an HTML page. The program will be called crawler and it might, for example, be invoked as:
crawler http://web1.comp30023
The program will 􏰁rst fetch the indicated HTML page. The program will then recursively follow all
hypertext links (href=”…” attributes in anchor tags) and fetch the indicated pages if all but the 􏰁rst components of the host match the host of the URL on the command line.
For example, if the original host is web1.comp30023 then a link to web2.comp30023 should be followed, but a link to comp30023 or 128.250.106.72 or unimelb.edu.au should not.
For the purpose of this project, it can be assumed that pages which need to be crawled will always have the MIME-Type of text/html. When the MIME-Type header is missing, the crawler is not required to guess the media type of the entity-body and does not have to parse the page.
Note that 􏰁lename extensions are not the same as 􏰁le types. Some 􏰁les may end in .html or .htm but the header indicates that the MIME-type is text/plain. On the other hand, a 􏰁le may have an extension .txt 􏰄 or no extension 􏰄 but have a MIME-type of text/html. The MIME-type de􏰁nes the true type of the 􏰁le.
3 Project speci􏰁cation
The program must perform several tasks, and can expect certain things of the testing setup. These are described as follows, grouped by task.
3.1 Crawling
The code should crawl as described in the previous section.
1

No page should be fetched more than once. Two pages are considered to be 􏰂the same page􏰃 if the URLs indicate that they are the same. That is di􏰅erent from 􏰂having the same URL􏰃, because multiple URLs can point to the same page. For example, in a document at http://www.comp30023.example/a.html, the URLs http://www.comp30023.example/b.html and /b.html refer to the same page. Relative URLs also exist, like b.html. However, pages one/b.html and two/b.html are di􏰅erent pages, despite ending in the same 􏰁le name. The equivalence of pages is governed by the speci􏰁cation of URLs, given at the end of this document.
The program should print a log of the URLs it attempts to fetch 􏰄 whether successfully or unsuccessfully. The log should be sent to stdout. Print one URL per line, with no other characters.
The program should send no other output to stdout; other output may be sent to stderr, which will be ignored for assessment.
The order of fetching pages does not matter.
You do not need to fetch more than 100 distinct pages (but there may be more than 100 URLs). (Hint: How can you use this knowledge to simplify your code?)
3.2 HTTP
The servers hosting the URLs you are to crawl will conform to HTTP 1.1 and all content will be served over port 80.
You must provide a User-Agent header consisting of your username (only), such as
User-Agent: joeb
This will be used to observe which pages you fetch. Without this, you risk getting no marks.
1 Note that all requests must also contain the Host header .
The server may fail to serve a page that you request, and either indicate that with an error code or simply send a 􏰁le shorter than the length in the header. You should keep reading data until the connection is closed by the server or Content-Length bytes have been read. For transient (non-permanent) errors, you can re-request such pages without penalty, but not that the transient error condition may last for quite a while.
No server response will be longer than 100,000 bytes.
3.3 Parsing HTML
The HTML 􏰁les will be valid HTML (unlike nearly all real web pages; web browsers silently 􏰁x many mistakes).
The 􏰁le will not contain the characters 􏰂<􏰃 or 􏰂>􏰃 except at the start and end of tags. No anchor tags will be split across lines.
There will be no tags or canonical link elements. Both inline and external javascript should be ignored.
Submissions may (optionally) use the pcre library for pattern manipulation, or the standard functions declared in regex.h. Please read the submission instructions carefully if you plan to use pcre.
3.4 Parsing URLs
There will be no URLs using protocols other than http. In particular, there will be no https.
1
https://tools.ietf.org/html/rfc2616#section- 14.23
2

Note that URLs are partially case sensitive; see the references at the end of this document. HTML tags and 􏰁eld names are case-insensitive.
2
You can ignore URLs containing 􏰂.􏰃 and 􏰂..􏰃 path segments (nominally 􏰂same directory􏰃 and 􏰂parent
directory􏰃), character codes of the form %XY or the characters # and ?. URLs of these forms will exist in the data, and you may parse them if you wish.
You do not need to parse URLs longer than 1000 bytes. Here are some examples of URLs which will not appear:
• http://web1.comp30023:8080 (not port 80)
• https://web1.comp30023 (protocol not http)
• mailto:no-reply@unimelb.edu.au (protocol not http)
Here are some examples of URLs which may be ignored:
• ./a1 and bar/../
• http://web1.comp30023/assignments/./
• http://web1.comp30023/assignments/a2/../a1
• http://web1.comp30023/search?q=comp30023 (contains ?)
• http://web1.comp30023/assignments#a1 (contains #)
• http://web1.comp30023/%20 (contains URL encoded character)
3.5 Extensions
There are several versions of this project, each more challenging than the previous and each with a higher possible mark.
The more challenging extensions are intended as options. If you are 􏰁nding the project too big, stick to a simple version.
3.5.1 Minimum requirement
Only considers
• anchor tags that are the 􏰁rst tag on the line;
• anchor tags of the form ;
• whether the status code is 200 (success) or not 200 (treated as a permanent failure); • pages whose length header is correct (no truncated pages).
3.5.2 Basic Extension
Correctly omits (or re-fetches) truncated pages.
Only parses pages that have MIME-type text/html (irrespective of the extension).
Considers all anchor tags on a line. Considers anchor tags with other 􏰁elds between the

Related Posts