程序代写代做代考 python database junit data structure Java javascript Assignment

Assignment
2.0 – Scraping the Web

Overview

This week, you will be scraping Wikipedia and storing
information about actors and movies into a data structure of your design. You
will also write a graph library and a function for converting your data to a
graph. Then you will store the relevant information as a JSON file so you can
load it again without re-scraping the website. Finally, you should be able to
provide basic information from your data structure through console output.

Programming Language

Unlike past weeks, this week, you should implement your
project in a programming language which you have not used. For more
information, see #Programming
Language Selection.

Motivation and Goals

There are many methods of data collection in the rapidly-evolving
world of information and technology, but web scraping is among the most popular
and accurate. In layman’s terms, web scraping is the act of using bots to
extract specific content and data from a website. Web scraping is especially
useful because it has the ability to convert non-tabular, nonsensical and
poorly constructed data into something both in format and in content. Web
scraping is also championed for its ability to acquire previously-inaccessible
data. However, web-scraping is not about mere acquisition– it can also assist
you to track changes, analyze trends and keep tabs on certain patterns in
specific fields.

The purpose of this particular assignment is to introduce
you to the real-world application of web-scraping tech, as well as get you
thinking about the creative process that accompanies the tasks you are
assigned. There will be a number of directives that you will have to solve both
in this assignment as well as when you graduate and break into
industry-standard workplaces, so keep this in mind as you work on this
assignment. Web scraping may be the focus of this particular assignment, but it
very well may be a potential, real-life approach you use in the future.

For this practice assignment, we will be using Wikipedia as
our web source, for a number of reasons. Although Wikipedia provides database
dumps for everything, it is the best source to use for this exercise because
not only does it have fairly up-to-date information, it is also legal to scrape
Wikipedia without ramifications or complicated restrictions.

Programming Language
Selection

Whatever language you choose, you
should use an IDE of your choice (suggestions of Ruby & Python below):

Python Consider using PyDev for Eclipse or PyCharm (from the makers of
IntelliJ)

Ruby Consider using a
plugin for eclipse or RubyMine
(from the makers of IntellIJ)

Javascript

You can also select a language you would like to learn
(ideally something not too obscure), and contact your moderator or the TAs to
ensure that this language is appropriate to use.

Language Selection

Be aware that the TAs are not familiar with every
programming language out there, so we may or may not be able to provide in-depth
help with the language you choose. If we cannot help you, post your question to
Piazza since there may
be other people in the course that can help with your problem.

Background

Logging

We are asking you to perform logging as part of your web
scraping code.

What is logging?

When a function is called, I log that
it has been called along with any passed in arguments, the time it was called,
and in Javascript, the name of the function that
called it. I also log before a function returns, along with the time and any
relevant return data. I log when asynchronous calls return. I log when
significant logic changes happen. -Erik Hazzard, How
Logging Made me a Better Developer

In short, loggingisaddingoutputorlogstatements
to your codethatindicatethestatus of what is going on
while your code is running. Log statements are helpful because they decrease
code complexity, help debugging, increase visibility and assist communication
with other developers.

There are several different log levels.

You are not required to use all of these log levels when
writing your code, but we expect you to use at least 3 of the available log
levels in the language you are using. For example, for Pythonyoucould
log a WARNING level log when you encounter a
strangely formatted page that cannot be understood by your scraper. In
addition, we expect you to write meaningful log messages. In other
words, your log message should not be “logging.warning(‘Some error’)” or “logging.warning(i)”;
if we see vague or meaningless messages like this we will take off points.
Using the above example, an acceptable message could be “logging.warning(‘[page] was not able to be parsed because
filmography section was not found.’)”. Since the pages you are
scraping will not be uniform in their HTML markup, it is very likely that you
will encounter errors and edge cases as you scrape. Take advantage of logging
for these cases!

Here are some resources to get you started:

Python: Python 2 Logging, Python 3 Logging

Ruby: Logging
with Ruby

Javascript: Logging with Javascript

You are free to use any library you like for logging in
addition to those suggested above.

Graphs

A graph,
at least to computer scientists, is a set of vertices and a set of edges that
connect them. Each vertex in a (directed) graph has a set of outgoing edges and
a set of incoming edges. There are many ways to represent a graph in memory, it
is up to you to determine which representation you would like to use. See the
image below for a graphical representation of what your graph might look like:

For this assignment, you should be able to represent each
actor and movie as a vertex in your graph. The weight of an edge should
represent the amount of money the actor made from the movie. However, since
this information is not readily available on the wikipediapage,
we leave the choice of how to represent this up to you. You should not
assign the same weight to all your edges for a given vertex. Here are some
examples of what you can do:

Represent the weight of
an edge as the grossing value of the movie, with actors at the top of the
cast list having more weight than those at the bottom of the list.

Represent the weight of
an edge as the grossing value of the movie, with young actors being
weighted more than older actors

Represent the weight of
an edge as the grossing value of the movie, with actors who have been in
more movies having more weight than those who have been in fewer movies.

Also, your graph will need to be augmented to hold extra
data about each movie. You will have to adapt your implementation to handle
this extra data. Note that in our case the graphisundirected,becauseif an actor isinamovie,thanthemoviedefinitely contains that actor.

JSON

We are asking you to store your scraped data as a JSON
file.

What is JSON?

JSON (JavaScript Object Notation) is a lightweight
data-interchange format. It is easy for humans to read and write. It is easy
for machines to parse and generate. – json.org

JSON is most commonly used in web applications, as a
lightweight format to transfer data asynchronously between a client and server.
For instance, Facebook transmits status updates to your newsfeed using JSON so
that new posts appear without having to reload the page. But, JSON is not
limited to web applications. Parsers for JSON are available for just about
every language out there. See http://www.json.org
for parsers for some of the most common programming languages. It would be wise
for you to pick a language for this assignment that already has well developed
parser. For instance, JSON parsing is built into Python’s core library.

You can read more about JSON on Wikipedia.

Part 0: Reading

Before you begin web scraping, make sure to read the
following links:

How to Scrape Websites Without Getting Blocked

Don’t get Blacklisted. Go for Legitimate Web Scraping.

Some Traps to know and Avoid in Web Scraping

To
avoid harming the website, please avoid extensive scraping. Be a responsible
scraper!

Part I: Web Scraping

Here’s what you will need to do:

Write a web scraper to
gather information from pages on Wikipedia. You should gather information
from a large amount of movies page (>125) and actor pages (>250).
The choice of the starting page is up to you, but feel free to start
here: https://en.wikipedia.org/wiki/Morgan_Freeman

Store grossing
information for movies and age for actors.

Log messages to a log
file as you scrape. You should log at least three different levels of
messages; messages should be appropriate and meaningful.

You can not use
the Wikipedia API; however, feel free to use whatever scraping API your
language supports. Here are some of our recommended libraries:

Python: BeautifulSoup (Documentation),
Scrapy

Ruby: Nokogiri

Javascript: Node.js, Scraping
with Node.js

Your web scraper should not run for longer than 10 minutes
with the requirements we have given you. For your live testing in section, if
your scraper takes longer than your time slot (20 minutes in a 6 person section), you will not receive credit for that part
of the rubric!

Part II: Graph Library

Next, you will need to create a general graph library for
your application.

You can not use
any existing graph library to handle the representation of your graph in
memory. That is, you need to write your own graph
implementation for this assignment. For this section, implement a general graph
library that will hold the data you parseandinteractwith
in the following sections.

Testing

Just as with past assignments, you should test your code
thoroughly using unit tests. Your tests should be automated, and perform
asserts similar to jUnit, using a unit test library
available in your language of choice. Contact the course staff or post on Piazza if you have trouble
finding a test suite in your language.

You are required to be able to convert your web scraped data into a graph
structure. This will be important for next week. However, how you choose
to represent your web scraped data in order to do the following parts (III and
IV) is up to your choice.

Part III : JSON Storage and Retrieval

You should be able to

Store your data
structure as a JSON file

Load your data structure
from a JSON file

You can use any existing JSON
library you wish for this task.

Part IV : Graph Queries

Your code should have methods that allow the user to:

Find how much a movie
has grossed

List which movies an
actor has worked in

List which actors worked
in a movie

List the top X actors
with the most total grossing value

List the oldest X actors

List all the movies for
a given year

List all the actors for
a given year

All of this information should be calculated at query time
and not be hard-coded into your source. Be prepared to run these methods and
produce console output for your moderator in section.

Testing

As usual, we require that you write extensive unit tests
for each part of this assignment. We understand that it can be difficult to
test for web scraping. However, make sure to exhaustively test all other parts
of your code. If your language does not have a testing framework (unlikely),
you will need to implement your own test runner and utilities to accomplish
this part of the assignment. In order to test your web scraper, your moderator
will ask you to scrape an actor/movie page of their choice in section,
so be prepared.

Summary

Table of Contents

Reading

Code Complete
Chapter 7: High-Quality Routines

·
Optional: ThePragmaticProgrammerChapter
5-26: Decoupling and the Law of Demeter

The readings
are due before lectures every Friday.

Submission

This assignment is due at the beginning
of your discussion section the week of October 8th, 2018. Please be
sure to submit in GitLab, and ask your moderator or
TA before the deadline if you have any questions.

Objectives

Learn about
responsible web scraping

Learn a new
programming language

Learn about
graphs and how to implement them

Learn to work
with data in JSON format

Learn how to
effectively decompose your code into reusable modules

Resources

Python

Ruby

Javascript

JSON Homepage

JSON Wikipedia

Graphs

Grading

Related Posts