1 Introduction
COMPSCI4021 Coursework: Professor Web Scraper
Jeremy Singer Handout: 16 Nov 2020
current version ace0de9 dated 2020-11-16 12:25:35 +0000
This exercise will give you some experience in developing and testing a complete, small-scale application in the Haskell language, while making use of standard build tools and third party Haskell libraries. You will begin by producing a library that scrapes a Glasgow University staff listings web page and counts the number of Professors listed on that page. You will go on to identify the number of Professors for several different Schools in the University, and you will display this information in a programmatically generated visual form. Finally you will write a short report to summarize your experience with functional programming during this development exercise.
1.1 Intended Learning Outcomes
The key learning outcomes we cover in this coursework exercise are:
• Demonstrate understanding of how to structure programs using monads, how to use the most common standard monads (including IO, Maybe, and State), and how to use a monad transformer.
• Develop substantial software applications including GUIs and system interaction.
However, developing your own Haskell code should consolidate all the other learning outcomes from earlier in
the course. The best way to learn a programming language is to read and write code in that language.
2 Tasks
There are three distinct tasks you should complete for this assessed exercise.
2.1 Task 1: Professor Scraping Library
You should use the Scalpel library (introduced in lecture 12) to build a module named ProfScrapeLib with an exported function:
numProfessors :: String -> IO (Maybe Int)
where the String parameter is the name of a School in the University of Glasgow, e.g. “computing” or “chemistry” and the wrapped integer value is the number of distinct Professors listed on the corresponding staff page for that School. The staff page for School s is constructed via the URL https://www.gla.ac.uk/schools/ concatenated with s, and then the string /staff/ is appended, e.g. https://www.gla.ac.uk/schools/ computing/staff or https://www.gla.ac.uk/schools/chemistry/staff.
The function numProfessors will evaluate to IO Nothing when the URL for the specified School staff page cannot be fetched. When a HTML page can be fetched, the function parses the text in the body of the page and counts the number of distinct Professors listed for each School. Note this is not as simple as counting the frequency of the word ‘Professor’ since sometimes the word is mentioned twice for one person, e.g.
Calder, Professor Muffy (Professor of Formal Methods)
Also, you should not count individuals with the job title ‘Assistant Professor’ in your count of Professors, e.g.
1
Seow, Dr Chee Kiat (Assistant Professor)
Also, you should not count anyone in the Professor list who is only in the ‘Honorary & Visiting’ section of the staff listing, since these are not formal staff of the University.
You may develop internal helper functions in your ProfScrapeLib module. However only the numProfessors function should be visible externally.
You may use types, typeclasses and functions from Prelude, Scalpel, Text.String, Data.List, etc — reasonable use of appropriate base utilities and other Hackage libraries is encouraged. You should document these dependencies in your YAML config files.
This task is worth 40% of the overall mark.
2.2 Task 2: Information Representation Program
Write an application (with a top-level main) that uses your ProfScrapeLib module to visualize the number of listed Professors in the staff pages corresponding to the following seven Schools:
1. chemistry
2. computing
3. engineering
4. ges
5. mathematicsstatistics 6. physics
7. psychology
You should access the relevant page with the URL prefix and postfix as outlined in the previous section
Your visualization should be an output file from the program — it could be a TXT file, a PDF file, or an image file like a JPG or PNG. You can use any libraries on stackage to help generate your visualization. Example visualizations and their corresponding libraries might include:
• A PDF containing a table generated with the pandoc or HaTeX libraries
• A neatly typeset PDF containing a well-rendered tag cloud or similar, using the Pango library • A static HTML webpage containing the data, generated with the hakyll library
• A CSV file output containing the data, generated with a CSV processing library like cassava • a graph (of some kind) generated with haskell-chart
• something else . . .
There is real scope for creativity and innovation here. However, you should be aware of some constraints. First, I want to be able to build and execute your code via stack on a networked Virtual Box Linux instance, so you will need to have an appropriately configured project files (i.e. project.yaml and stack.yaml).
Second, this part of the coursework is (only) worth 20% of the overall mark. 2.3 Task 3: Reflective Report
Write a short (500 word max) report, as a plaintext file called report.txt. In this document, you should summarize your experiences of developing this project code in Haskell. How is it similar to more mainstream languages like Python or Java? How is it different? Are there any Haskell-specific features you really like? Why? Are there any features you intensely dislike? Why?
This task is worth 40% of the overall mark.
2
3 How to Proceed
Clone (or fork) the github repository for this project from https://github.com/jeremysinger/prof-scraper. This will give you an initial working project setup, configured for the Stack tool. You will need to do a stack build to get things downloaded etc. Note I have included some skeleton source code and unit tests for you already. Run the unit tests with a stack test. You should try to keep the same project structure as much as possible, since it will make things easier for me to mark. In particular, I have some ‘hidden’ unit tests that rely on the name of the module and its exported function remaining the same.
Since the majority of the marks are available for tasks 1 and 3, you are recommended to focus on these tasks. Task 2 (the open-ended coding exercise) should be considered an ‘optional extra’ if you are enjoying Haskell.
Overall, this coursework is worth 20% of the final mark for COMPSCI 4021 (Functional Programming). As such, we expect you will spend no more than 20 hours working on this exercise. The exam makes up the remaining 60% of the final mark, since you have already completed 20% of the assessments for this course.
4 What to Hand in
Please submit your coursework via moodle, on the Scraping Coursework Assignment submission slot on the FP(H) moodle page. Please submit a single zip file named according to your matriculation number, e.g. 2123456x.zip – when I unzip this file I want to see a top-level directory called prof-scraper that is a stack project directory.
The stack project should be cleaned and purged – i.e. I don’t want any hidden directories containing massive binaries. I want you to submit Haskell source files (for tasks 1 and 2), YAML config files, an output file from your task 2, and a report.txt file from your task 3. You should also submit a status.txt file explaining how far you got on each sub-task, what you did for task 2, whether and where the sample output file is included in the zip, and whether there are any issues I should be aware of during marking. There are skeleton .txt files in the git repo you will clone/fork when you start the project — edit these and leave them in the same locations. The diagram below gives the expected directory hierarchy layout inside the zip file.
prof-scraper
project.yaml
stack.yaml
prof-scraper.cabal
report.txt
status.txt
stopwords.txt
Setup.hs
README.md
app
Main.hs
(other haskell files from Task 2)
src
ProfScrapeLib.hs
(other haskell files from Task 1)
test
Spec.hs
(other unit test files, optional)
Please adhere to these submission guidelines. Include your name and matriculation number in the status.txt file, also in the project.yaml file. Email me if anything is unclear. Rapid and happy marking will take place if it’s straightforward for me to assess your work. Slow and grumpy marking will take place otherwise!
5 Marking Guidelines
5.1 Task 1: Professor Scraping Library
This task is worth 40% of the overall mark for this coursework.
3
A
Code compiles with no errors, minimal warn- ings. Hidden unit tests pass ok. Idiomatic Haskell code. Highly efficient implementation.
B
Code compiles with no errors. Substantial ma- jority of hidden unit tests pass. Fairly id- iomatic Haskell code. Efficient implementa- tion.
C
Code may compile, given minimal interven- tion. Majority of hidden unit tests pass. Some idiomatic Haskell code. Somewhat efficient implementation.
D
Code may compile, given some intervention. Some hidden unit tests pass. Potentially id- iomatic Haskell code. Potential for some inef- ficiencies.
E–F
Code may not compile. Few hidden unit tests pass. Little or no idiomatic Haskell code. Code is inefficient or incorrect.
G–H
Code does not compile. No hidden unit tests pass. No idiomatic Haskell code. No attention given to efficiency of code execution.
5.2 Task 2: Information Representation Program
This task is worth 20% of the overall mark for this coursework. It should be viewed as an ‘optional extra’.
A
Novel library. Striking output. Clean, compi- lable code. Idiomatic Haskell. Intelligent use of library and utility functions. Efficient code.
B
Appropriate library. Sensible output. Fairly clean, compilable code. Moderately idiomatic Haskell. Some appropriate use of library and utility functions. Moderately efficient code.
C
Simple but relevant library. Some meaningful output. Code compiles with minimal inter- vention. Attempt to write in functional style. Minimal use of library and utility functions. Perhaps inefficient code.
D
Simple but somewhat relevant library. At- tempt to produce meaningful output. Code might compile, with some intervention. At- tempt to write in functional style. Possible use of library and utility functions. Inefficient code.
E–F
Trivial library. Simple output. Code submit- ted may not compile. Non-idiomatic Haskell.
G–H
No meaningful engagement with third-party Haskell modules. Small codebase. No serious attempt to generate useful output.
5.3 Task 3: Reflective Report
This task is worth 40% of the overall mark for this coursework.
4
A
Highly readable. Excellent contrasts between Haskell and other languages/toolchains. Intel- ligent assessment of strengths and weaknesses of different language paradigms.
B
Readable. Very good contrasts between Haskell and other languages/toolchains. Prag- matic assessment of strengths and weaknesses of different language paradigms.
C
Moderately readable. Good contrasts between Haskell and other languages/toolchains. Some assessment of strengths and weaknesses of dif- ferent language paradigms.
D
Somewhat disjointed. Some contrasts between Haskell and other languages/toolchains. At- tempted assessment of strengths and weak- nesses of different language paradigms.
E–F
Disjointed. Minimal contrasts between Haskell and other languages/toolchains. Min- imal assessment of strengths and weaknesses of different language paradigms.
G–H
Largely unreadable. Little contrast between Haskell and other languages/toolchains. No meaningful assessment of strengths and weak- nesses of different language paradigms.
6 Frequently Asked Questions
Q: When is the submission deadline? Check on the FP moodle page in case of changes.. At the time of writing, the deadline is set at 4:30pm on Friday 4th December 2020.
Q: Can I implement my own unit tests? Yes, this might be sensible although it is not mandatory. However be aware that I will assess the correctness of your code with my unit tests which I haven’t released to you.
Q: Can I have multiple attempts at Task 2, with different libraries to generate different outputs?
You could have several attempts at Task 2, but this might not be an efficient use of your time. I will only mark one of your attempts, so you must indicate what you want me to mark . . . make this clear in your status.txt file.
Q: My code doesn’t compile. Should I still submit what I have done? Yes. Partial credit is definitely awarded for code that does not compile or run. It’s always better to submit something rather than nothing. Mention your difficulties in your status.txt file when you submit.
Q: Does it matter if my report.txt is longer than 500 words? I don’t promise to read it all, if it is too long. So you need to either (a) front-load all the interesting comments into the first 500 words, (b) trim it to 500 words, or (c) hope that I am hooked by your fascinating findings and will read on past the first 500 words.
Q: I really like my solution to the coursework. Can I post it on github? Please refer to the School policy on uploading your coursework to github. Inah or Gethin can give you more info on this.
Q: I am very confused about Haskell. What should I do? Contact course lecturer and tutor via the Moodle coursework help forum for advice or pointers about how to get started. Don’t worry—everyone feels like this at some stage!
Q: Can I use a non-Haskell library or service to generate my beautiful visualization for Task 2?
Well that is possible, but remember that marks are only awarded for Haskell code. So I would have to assess the elegance of your Haskell foreign function calls or proxy stub code, which might not be enough to get you an excellent grade.
Q: Can we work in groups to develop our code? No, code should be developed individually. It’s fine to chat to other people about your ideas, but please don’t share code. The School of Computing Science has strict policies on plagiarism. This restriction also applies to reusing verbatim code you copy from online sources.
5