COMP3310/6331 2020 – Assignment 2 – Crawling the Web
Intro:
• This assignment is worth 15% of the final mark
• It is due by 23:59 Sunday 17 May AEST – note: CANBERRA TIME
• Late submissions will not be accepted, except in special circumstances
o Extensionsmustberequestedwellbeforetheduedate,viathecourseconvenor, with appropriate evidence.
Assignment 2
This is a coding assignment, to enhance and check your network programming skills. The main focus is on native socket programming, and your ability to understand and implement the key elements of an application protocol from its RFC specification.
A web-crawler, or ‘spider’ is a tool that scans/indexes/searches a website by opening some “first” page, parsing it for further links, and then recursively working its way down the set of links it finds to download some (usually bounded) number of pages. Most search engines and archives run their own spiders.
For this assignment, you need to write a web crawler in C or Java, without the use of any external (especially web/http/html-related) libraries. Your code should compile standalone on the CSIT Lab machines. It MUST open sockets in the standard way, as per the tutorial exercises, make appropriate and correctly-formed HTTP/1.0 (RFC1945) requests to a webserver on its own, and capture/interpret the results on its own.
Your code must report the following items against a specified site:
Item
What to report
What to report against this item
1
The total number of distinct URLs found on the site (including any errors and redirects)
A number
2
The number of html pages and the number of non-html objects on the site (e.g. images)
Two numbers
3
The smallest and largest html pages, and their sizes
Two URLs and their page sizes (in bytes)
4
The oldest and the most-recently modified page, and their date/timestamps
Two URLs and their page timestamps
5
A list of invalid URLs (not) found (404)
A number
6
A list of on-site redirected URLs found (30x) and where they redirect to
A table of URL->URL
7
A list of off-site URLs found (either 30x redirects or html references), and whether those sites are valid webservers
A table of URL->URL with a valid/invalid flag in each case
Page 1 of 2
Your code will be tested against a friendly assessment server at http://comp3310.ddns.net:7880 which is running now with a set of sample pages. There will be less than 100 URLs on the final assessment site. The HTML on the assessment server is guaranteed to be minimalist, i.e. no JS, no CSS. Links between resources will use a standard html
label form.
Your code also needs to behave ‘nicely’ – it must not make more than 1 request per 2 seconds. The server may generate a 503 error if your code exceeds that rate, resulting in lost marks. We’ll also be checking the server logs and timing your code.
It is your choice how you want to crawl the site and undertake the above analysis. Remember to be conservative in what you send and reasonably liberal in what you accept. Check your packets with wireshark, especially if your code is not working, and compare it to a working web transaction using e.g. your browser. Beware of poorly-formed responses, and do not spider other hosts despite any offers by the site to do so.
You need to submit your code, together with a Makefile (or Java equivalent) that has a ‘make’ and ‘make run’ targets (or other brief instructions) to compile and execute your code against the above assessment server. Your submission must be a zip file, packaging everything as needed, and submitted through the appropriate link on wattle.
There are many existing web-crawling tools and libraries out there, many of them with source. While perhaps educational for you, the assessors know they exist and they will be checking your code against them, and against other submissions from this class.
Your code will be assessed on
• correctness (the http queries it sends, and what the summary reports),
• performance (with the pacing constraint above),
• code correctness, clarity, and style, and
• documentation (comments).
You should be able to test your code against any website you like, although a lot of sites have complex html/js pages that can make parsing harder. If you, be careful you do not crawl an entire site, and don’t query it too often – some sites will block you!
Page 2 of 2