2021/9/21 Assignment 2.0 (Due 23:59 CDT on Sep 27, 2021) – CS 242 – Illinois Wiki
https://wiki.illinois.edu/wiki/pages/viewpage.action?pageId=616243854 1/8
Assignment 2.0 (Due 23:59 CDT on Sep 27, 2021)
Assignment 2.0 – Building a Digital Library
Table of Contents
Overview
For the following two weeks, you will be building a Digital Library. This week the goal is to gather data and store it properly. Therefore, there will be 3 major parts of this
assignment: (1) Scraping data (2) cleaning and processing the data, and (3) storing the data.
For this week, we would be using primarily Python 3 (3.5+). If you would want to use an alternative language, please send an email to the TAs for approval.
You don’t need to do peer review for this assignment so please merge the branch to master while keeping the source branch.
Motivation and Goals
There are many methods of data collection in the rapidly-evolving world of information and technology, but web scraping is among the most popular and accurate. In layman’s
terms, web scraping is the act of using bots to extract specific content and data from a website. Web scraping is especially useful because it has the ability to convert non-tabular,
nonsensical, and poorly constructed data into something both in format and in content. Web scraping is also championed for its ability to acquire previously inaccessible data.
However, web-scraping is not about mere acquisition– it can also assist you to track changes, analyze trends and keep tabs on certain patterns in specific fields.
The purpose of this particular assignment is to introduce you to the real-world application of web-scraping tech, as well as get you thinking about the creative process that
accompanies the tasks you are assigned. There will be a number of directives that you will have to solve both in this assignment as well as when you graduate and break into
industry-standard workplaces, so keep this in mind as you work on this assignment. Web scraping may be the focus of this particular assignment, but it very well may be a
potential, real-life approach you use in the future.
For this practice assignment, we will be using Goodreads as our web source. Goodreads is a website that collects information on books as well as reviews from the community.
Think of it as an IMDB repository for Books. In this assignment, we will not be scraping reviews and users as that is disallowed according to the robots.txt of the site. Also, just to
be mindful, please avoid requesting large amounts of volume in a short period of time when writing the scraper.
Programming Language and IDE
We will be using Python for this assignment. For other languages, please provide equivalent tools that fulfill certain criteria within the rubrics to the TAs for approval. (ie. Ruby,
Go. Javascript….)
PyDev for Eclipse
PyCharm
Visual Studio Code (with plugins)
https://wiki.illinois.edu/wiki/pages/viewpage.action?pageId=616243854
http://pydev.org/
http://www.jetbrains.com/pycharm/
2021/9/21 Assignment 2.0 (Due 23:59 CDT on Sep 27, 2021) – CS 242 – Illinois Wiki
https://wiki.illinois.edu/wiki/pages/viewpage.action?pageId=616243854 2/8
Part 0: Reading
Before you begin web scraping, make sure to read the following links:
To avoid harming the website, please avoid extensive scraping. Be a responsible scraper!
Part I: Web Scraping
Your program should be able to:
Gather information of Authors and Books from Goodreads.
You should gather information from a large number of book pages (>200) and authors pages (>50) Essentially saying you need to contain at least information of at
least 200 books and 50 authors. You should not go beyond 2000 books or authors.
The starting page must be a book page. The starting page should be a variable and should not be hard-coded. (For example, starting from clean
code: https://www.goodreads.com/book/show/3735293-clean-code). The order of traversal doesn’t matter: for example, you can find the next books to scrape by
visiting all books that the same authors have written, or you can just use similar books listed on the GoodReads website.
Report progress and exceptions (e.g., books without ISBN). There is no limit as to how this should be implemented.
Represent Books and Authors efficiently. There is no one structure for this assignment, and there are no type constraints for fields. However, you are required to scrape at
least the following information:
For Authors, you need to store:
name Name of the Author
author_url the page URL
author_id a unique identifier of the author
rating the rating of the author
rating_count the number of ratings the author received
review_count the number of comments the author received
image_url A URL of the Author’s image
related_authors A list of authors related to the author
author_books A list of books by the author
For Books, you need to store:
https://www.goodreads.com/book/show/3735293-clean-code?from_search=true&qid=HhMDV0vMa5&rank=1
2021/9/21 Assignment 2.0 (Due 23:59 CDT on Sep 27, 2021) – CS 242 – Illinois Wiki
https://wiki.illinois.edu/wiki/pages/viewpage.action?pageId=616243854 3/8
book_url URL of the Book
title Name of the Book
book_id A unique identifier of the book
ISBN the ISBN of the book
author_url URL of the author of the book
author Author of the book
rating the rating of the book
rating_count the number of ratings the book received
review_count the number of comments the book received
image_url A URL of the book’s image
similar_books A list of books similar or related to the book
You can not use the GoodReads API nor existing scrapers for Goodreads on Github. You are allowed to use generic scraping libraries:
BeautifulSoup (Documentation) or Scrapy
Your web scraper should not run for an extended period of time (like over half an hour).
Part II: Data Storage in an External Database
Very often, we do not want to manually manage huge files or store large data structures in memory. In this case, we can utilize databases that reside on the cloud (or server), just
like a repository.
In the third part of this assignment, we will be using the same scraper you build in part one and store data into a database. We do not have a constraint in the type of database
you are using, Here are some suggestions:
Important Notice
You should be aware of the number of requests you are calling when designing the scraper. Servers can have protective mechanisms that prevent users from abusing them.
Some general methods to avoid this includes but is not limited to:
(1) Use less than 5 books and authors to test and design your scraper
(2) Download the HTML of the relevant links first to finish the parser
(3) Make-use of try/except functions and recording failed instances (for instance logging)
You would be in a bad position in finishing this assignment if you are being blocked.
https://www.crummy.com/software/BeautifulSoup/
https://beautiful-soup-4.readthedocs.io/en/latest/
https://scrapy.org/
2021/9/21 Assignment 2.0 (Due 23:59 CDT on Sep 27, 2021) – CS 242 – Illinois Wiki
https://wiki.illinois.edu/wiki/pages/viewpage.action?pageId=616243854 4/8
MongoDB (including MongoDB Altas) + PyMongo
firebase
SQLite
MySQL
Your program should be able to
Store your data into the database while scraping
Storing everything after scraping is done is not allowed
Again, you can use any database for this task.
Some helpful resources: On MongoDB, On Firebase
Part III: Command Line Interface
To make the scrapers more versatile, we can write a simple command-line interface that allows other users to configure the behaviors of the program. In this assignment, you are
only required to take user inputs through command-line arguments (however, you will need to make the program interactive by the next assignment). Python for instance now
has several built-in/external packages to assist this process (for example, the argparse library). It can handle and complete simple preprocessing of the arguments a user
provided to the system.
Your program should be able to:
Accept any valid starting URL
Check if the URL is valid, if it points to GoodReads, and if it potentially represents a book
Accept an arbitrary number of books and authors to scrape
Print warning for numbers greater than 200 books and 50 authors
Read from JSON files to create new books/authors or update existing books/authors
Print error for invalid JSON file (e.g., syntax error) and malformed data structure (e.g., what if the JSON is an array, or if the object doesn’t have id). Discuss your
design choice during discussions
Print what new entries are updated or created
Export existing books/authors into JSON files
The output must be valid JSON
What is JSON?
JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate.
– json.org
JSON is most commonly used in web applications, as a lightweight format to transfer data asynchronously between a client and a server. For instance, Facebook transmits
status updates to your newsfeed using JSON so that new posts appear without having to reload the page. But, JSON is not limited to web applications. Parsers for JSON are
https://www.mongodb.com/cloud/atlas/lp/general/try
https://api.mongodb.com/python/current/atlas.html
https://scotch.io/tutorials/getting-started-with-python-and-mongodb
https://medium.com/@cbrannen/importing-data-into-firestore-using-python-dce2d6d3cd51
https://docs.python.org/3.7/library/argparse.html
http://json.org/
2021/9/21 Assignment 2.0 (Due 23:59 CDT on Sep 27, 2021) – CS 242 – Illinois Wiki
https://wiki.illinois.edu/wiki/pages/viewpage.action?pageId=616243854 5/8
available for just about every language out there. See http://www.json.org for parsers for some of the most common programming languages. It would be wise for you to pick a
language for this assignment that already has a well-developed parser. For instance, JSON parsing is built into Python’s core library.
You can read more about JSON on Wikipedia.
Part IV: Miscellaneous Requirements
Testing
As usual, we require that you write extensive unit tests for each part of this assignment. We understand that it can be difficult to test for web scraping. However, make sure to
exhaustively test all other parts of your code. If your language does not have a testing framework (unlikely), you will need to implement your own test runner and utilities to
accomplish this part of the assignment. In order to test your web scraper, your moderator will ask you to scrape a book page of their choice in section, so be prepared. You
should also be demoing your data storage.
Please use Python’s standard unittest library for testing. You are welcome to use other testing tools in python to implement the test.
Linting
Linting is a tool to check for programming errors, bugs, stylistic errors, and etc. Python has a comprehensive styling standard and many linters have been developed to help us
understand how we wrote our code.
In this task, we ask you to utilize one or more linter of your choice to assist you in following the PEP 8 standards. Most IDEs have plugins that support these linters. It is best for
you to provide a report from the linter for your project. Moderators will be using pylint to grade this task. A tutorial is here: http://pylint.pycqa.org/en/latest/tutorial.html
(Screenshot of part of the pylint report)
http://www.json.org/
https://secure.wikimedia.org/wikipedia/en/wiki/JSON
https://docs.python.org/3/library/unittest.html#module-unittest
https://www.python.org/dev/peps/pep-0008/
http://pylint.pycqa.org/en/latest/tutorial.html
2021/9/21 Assignment 2.0 (Due 23:59 CDT on Sep 27, 2021) – CS 242 – Illinois Wiki
https://wiki.illinois.edu/wiki/pages/viewpage.action?pageId=616243854 6/8
Environment Variables
Since we are using a database in this assignment, all of these systems would likely require you some sort of username and password to complete read/write permissions. It is
always a bad idea to leave this information inside your code, not to mention committing them into git. Environment Variables are values stored outside of your code, like a key or
secret. Most languages support the use of environment variables or a .env file. In this task, you are required to handle environment variables using external packages. Again,
there is no limitation in which library you use to read in environment variables.
Some resources to complete this task: A easy-to-read tutorial on using dotenv, a demonstration of multiple methods.
Part V: Bonus (1-2 pt)
This is a 1-2 point bonus toward the overall assignment. Please utilize a network library to complete a graph demonstrating the Authors and Books relationship. For
instance, https://networkx.github.io/ provides some good-to-use elements.
Summary
Reading
Code Complete Chapter 7: High-Quality Routines
Optional: ThePragmaticProgrammerChapter 5-26: Decoupling and the Law of Demeter
Submission
Moderators are asked to grade styles according to PEP8 Standard, so please follow it. If you do follow an alternative style guide, please provide them to the
moderator and TAs and demonstrate that you are following this certain convention. This implementation is to maintain grading standards across sections.
(Alternatives, Airbnb Ruby Style Guide, Airbnb Javascript Style Guide)
Please use Python 3.5+.
This assignment is due on 23:59 CDT Sep 27, 2021. Please be sure to submit in Gitlab, grant correct access, and ask your moderator or TA before the deadline if you have
any questions.
Please make sure you follow the repo naming conventions listed on the piazza. You have to create a new repository for this assignment.
Please make sure that you create a branch with the name assignment-2.0 and merge it back to the master (while keeping the branch)
Objectives
The readings are due before lectures every Friday.
https://preslav.me/2019/01/09/dotenv-files-python/
https://help.pythonanywhere.com/pages/environment-variables-for-web-apps/
https://networkx.github.io/
http://google.github.io/styleguide/pyguide.html
https://www.python.org/dev/peps/pep-0008/
https://github.com/airbnb/ruby
https://github.com/airbnb/javascript
2021/9/21 Assignment 2.0 (Due 23:59 CDT on Sep 27, 2021) – CS 242 – Illinois Wiki
https://wiki.illinois.edu/wiki/pages/viewpage.action?pageId=616243854 7/8
Learn about responsible web scraping
Learn to work with data in JSON format
Learn to work with databases
Learn how to effectively decompose your code into reusable modules
Learn to use different tools to assist in writing better code
Resources
Python
Grading (see Grading Policy for full rubric)
Category Scoring Notes
Basic
Requirements
(Total: 22)
Code Submission 3 Same as listed in Grading Policy. DO NOT commit env files.
Decomposition/Overall
Design
4 -1 pt per infraction:
Any piece of the project must execute its core functionality well.
The project should be well decomposed into classes and methods.
The project is adequately decomposed into abstract classes and classes. The abstract classes and
methods with shared functionality from different parts should generalized and reusable.
Functions are not duplicated and you make an effort to ensure that duplicate pieces of code are refactored
into methods.
Style guide 3 Follow PEP 8 guideline up until to Comments (anything up to where the link is at)
Functional
Requirements
(Total: 15)
Web Scraping 5 0 pt: Lacks any form of a Web Scraper, or utilizes a non-scraping library to fetch information
-1 pt: Web Scraper semi functions (may crash on certain web pages), does not handle exceptions and
errors gracefully
+1 pt: Scraped some Authors
+1 pt: Scraped some Books
+1 pt: Complete Authors requirements (>200)
+1 pt: Complete Books requirements (>50)
+1 pt: Report scraping progress and errors effectively
-1 pt for scraping >2000 in one go, your program should stop early.
Database Setup 2.5 +0.5 pt: System setup with a database
+1 pt: System can write from a database without errors
+1 pt: System connected to a database and can read from a database without errors
Command Line
Interface
5.5 +0.5 pt: Configurable starting URL
+0.5 pt: Starting URL error checking
+0.5 pt: Configurable book and author counts
http://www.json.org/
http://www.python.org/
https://wiki.illinois.edu/wiki/x/yCOAKg
2021/9/21 Assignment 2.0 (Due 23:59 CDT on Sep 27, 2021) – CS 242 – Illinois Wiki
https://wiki.illinois.edu/wiki/pages/viewpage.action?pageId=616243854 8/8
+0.5 pt: Counts error checking
+2 pt: JSON to Database for books and authors and supports create and update
+0.5 pt: JSON to DB error checking
+1 pt: Database to JSON
-0.5 pt: Invalid JSON output
Linter 1 Shows a score of 8.5/10 or above in the pylint report
Environment Variables 1 Utilize environment variables for the database and other sensitive/dynamic variables
Testing
Requirements
(Total: 10)
Unit tests 5 0 points: No unit tests or not using python testing libraries/tools
for every 2 unit tests, gain 1 point (at most 5 points)
-1 pt per infraction:
Single unit test is testing multiple cases or unrelated things
Missing obvious test cases or edge cases
Tests are easy to understand
test name indicates the purpose of the test
correctness of the test is easy to verify
significant code smells that make the test hard to understand
Manual Test Plan 5 0 pts if there are no manual test plan
1 pts if the test plans include only environment setup OR scenario descriptions
2 pts for test plans that contained only some content and can be further improved (~8 pages)
4 pts for test plans that contained most of the content (~10 pages)
5 pts for well-composed test plans covering all aspects of the system(~12 pages)
Bonus
(Total: 3)
Network 2 +1 pt for constructing the network and display using the terminal
+1 pt for visualizing a static image of the network
CI 1 Utilize continuous integration on Gitlab to do testing
无标签
http://www.atlassian.com/