CS计算机代考程序代写 SQL python database chain Hive 2021/10/15 下午7:47 ImperativeSQL – Jupyter Notebook

2021/10/15 下午7:47 ImperativeSQL – Jupyter Notebook

localhost:8888/notebooks/Desktop/BigData_Hw/ImperativeSQL.ipynb 1/7

Homework – Graph analysis with SQL ¶
In this assignment, we will leverage SQL to analyze a graph dataset.

Two common tasks in analyzing a graph are

finding connected components
computing PageRank

In the first part of this homework, your task is to write an SQL function that finds connected
components of the graph.

In the second part of this homework, your task is to write an SQL function to compute the PageRank
of each of the nodes of the graph.

Grading

The assignment is worth 100 points.

Connected components

65 points – Connected Components implementation and execution
5 points – Comments. Write comments that help both you and the grader understand what you
are doing and why

PageRank

25 points – PageRank implementation and execution
5 points – Comments. Write comments that help both you and the grader understand what you
are doing and why

Points in each task are dedicated to designing a test case (a simple but comprehensive graph),
hand-calculating the expected output of each task on your test graph, running your implemented
functions on your test graph and examining the results. In a markdown cell, describe if the output is
the same as what you have expected from your hand-calculation, or not.

Partial credit may be given at the discretion of the grader.

A note on speed

It is very important that you try to do as much as possible declaratively. Looping through the contents
of a table using a cursor is necessarily going to be slow. You should try to do as much as is possible
using declarative SQL queries. Use loops and conditionals to guide the overall control flow, only
when there’s clearly no way to do what you want using declarative SQL. On this assignment, there’s
often a 100 or more difference in performance between a well-written code that is mostly using
declarative queries, and one written with a lot of loops. Not to mention that declarative queries are
easier to code and debug!

×

Turnin

2021/10/15 下午7:47 ImperativeSQL – Jupyter Notebook

localhost:8888/notebooks/Desktop/BigData_Hw/ImperativeSQL.ipynb 2/7

Fill in this notebook and download it as a notebook (.ipynb). By the assignment due date and time,
submit this document electronically to Canvas. Answers in the comments section of CANVAS are
typically not read (they don’t download with your assignment). If you have comments, please email
the TA and professor. Be sure to turn in the file with your answers in it, not just the empty
assignment notebook

Academic Honesty
The following level of collaboration is allowed on this assignment: You may discuss the assignment
with your classmates at a high level. Any issues getting PostgreSQL running is totally fine. What is
not allowed is direct examination of anyone else’s SQL code (on a computer, email, whiteboard, etc.)
or allowing anyone else to see your SQL code. You may not discuss query results with your
classmates.

You may use the search engine of your choice to lookup the syntax for SQL commands, but may not
use it to find answers to queries. Be sure to quote your sources, if you used any.

Autograder information
In order for our autograder to work, we need to connect your solution to a local database.

DO NOT RUN CELLS THAT START WITH THE COMMENT # LOCALDB and do not remove
those cells.

For cells that contain your answer, replace the SELECT 1 SQL code with your solution. Do not
change the first line of the cell or remove the following cell.

Cells that start with: %%sql py_var_x << Redirect the output of that cell to the variable named py_var_x . Note that if the next thing in this cell is a comment, you will get an error. Put some SQL code first, then add your comment. We then convert this variable to a dataframe, and record that as your answer for the question. You may add other cells to the notebook, but be sure that your answer is in the cell that copies the result to the appropriate Python variable. We are using an autograder for the first pass grading. For the autograder to work, you need to complete your work in the appropriate cells and name and order the attributes in your results as directed. It's critical that you use the attribute names and sort order as directed. If attribute name(s) are not specfied, use the name in the source table. 2021/10/15 下午7:47 ImperativeSQL - Jupyter Notebook localhost:8888/notebooks/Desktop/BigData_Hw/ImperativeSQL.ipynb 3/7 In [ ]: Initialization The next cell needs to be run each time you start up Google Colab to start up PostgreSQL. # LOCALDB import psycopg2 from configparser import ConfigParser def config(filename='.pg_service.conf', section='postgresql'): # create a parser parser = ConfigParser() # read config file parser.read(filename) # get section, default to postgresql db = {} if parser.has_section(section): params = parser.items(section) for param in params: db[param[0]] = param[1] else: raise Exception('Section {0} not found in the {1} file'.format(section, filename)) return db params = config() # build the connection string def make_conn_str(params): return f"postgresql+psycopg2://{params['user']}:{params['password']}@{params['host']}: # connect to the database conn_str = make_conn_str(params) %load_ext sql %sql $conn_str %config SqlMagic.displaylimit=100 2021/10/15 下午7:47 ImperativeSQL - Jupyter Notebook localhost:8888/notebooks/Desktop/BigData_Hw/ImperativeSQL.ipynb 4/7 In [ ]: Data Our graph dataset contains citation information for 4,560 papers from the Arxiv high-energy physics theory paper archive. The dataset has 14,411 citations between those papers. The dataset is comprised of two database tables: node (paperID, paperTitle); edge (paperID, citedPaperId); The node table gives a unique paper identifier, as well as the paper title. The edge table indicates citations between the papers. Note that citations have a direction. Paper node with paperID cites each citedPaperId . In other words, the direction is from paperId to citedPaperId . The two database tables, node and edge , are populated in two .sql files, node.sql and edge.sql , respectively. Create and populate the data tables In [ ]: Verify that the load worked and take a look at the data In [ ]: In [ ]: PostgreSQL automatically creates an index for each PRIMARY KEY. # install !pip install --upgrade pip !pip install SQLAlchemy==1.3.23 !pip install psycopg2-binary !apt install postgresql postgresql-contrib &>log
!service postgresql start
!sudo -u postgres psql -c “CREATE USER root WITH SUPERUSER”
# set connection
%load_ext sql
%config SqlMagic.autolimit=100
# Limit queries to 100 results. Increase this value if needed, but recognize that your note
%sql postgresql+psycopg2://@/postgres
!sudo -u postgres createdb ricedb

%%capture
!psql -d postgres -f node.sql
!psql -d postgres -f edge.sql

%%sql
select * from node limit 10;

%%sql
select * from edge limit 10;

2021/10/15 下午7:47 ImperativeSQL – Jupyter Notebook

localhost:8888/notebooks/Desktop/BigData_Hw/ImperativeSQL.ipynb 5/7

Let’s create an additional index to help your code run faster. You only need to run this cell ONCE
(ever, unless you drop the tables).

In [ ]:

Task 1 – finding connected components
To find connected components of a graph, you should treat the graph as being undirected (that is, do
not worry about the direction of reference).

A connected component is a subgraph such that there exists a path between each pair of nodes in
the subgraph. Such a subgraph must be maximal in the sense that it is not possible to add any
additional nodes that are connected to any node in the subgraph.

The standard method for computing a connected component is a simple breadth-first search (BFS).
For this assignment, you will find connected components using BFS.

Pick a random starting node, and then search for all nodes reachable from the starting node, then
search for all nodes reachable from all of those nodes, and then search for all of the nodes
reachable from those nodes, and so on, until no new nodes are found. The entire set of discovered
nodes form one connected component.

If there are some nodes left that are not part of any connected component analyzed so far, then pick
one of those nodes randomly, and restart the computation.

You are done when all of the nodes are part of exactly one connected component.

For the first part of this homework, your first task is to

Write one or more functions in SQL to find all connected components of the graph using BFS.
Identify each connected component by a unique integer.
Print out connected components that have at least 5 nodes (papers) and no more than 8 nodes
(papers)
Print out the components in order by size, from smallest to largest
When you print out each connected component, print each connected component identifier,
paperID as well as the title . Within each component, order the components by paperID

Test cases
Be sure to test your codes on some simple networks. Some reasonable test cases might include:

An empty network
A network with a single connected component A-B-C-D-A
A network with 3 components
A network with only single nodes (no edges) A B C
A network with a fully connected component and an unconnected node A-B-C-A D
A chain network A-B-C-D (with or without unconnected other nodes)

%%sql
CREATE INDEX IF NOT EXISTS citedPaperID ON edge(citedPaperId);

2021/10/15 下午7:47 ImperativeSQL – Jupyter Notebook

localhost:8888/notebooks/Desktop/BigData_Hw/ImperativeSQL.ipynb 6/7

( )

In [ ]:

In [ ]:

Task 2 – computing PageRank
PageRank is a standard graph metric that is well-known as the basis for Google’s original search
engine. The idea behind PageRank is simple: we want a metric that rewards web pages (or in our
case, papers) that are often pointed to by other pages (or in our case, often cited by other papers).
The more popular the page, the greater the PageRank.

To accomplish this, PageRank models a web surfer, starting at a random page, and randomly
clicking links. There is a probability that the surfer follows a link from the current page, a probability

that the surfer jumps to a random page. is called the damping factor. A standard value for
is 0.85 (everyone should use this value in this homework to expect the same result). Given this

setup, the so-called PageRank of a web page (or a paper) is the probability that when the user
stops clicking (or following references), s/he will land on that page.

Since so-called sinks (those pages that don’t link anywhere else) would accumulate all of this
probability under the simplest model, it is assumed that those pages with no out-links instead link
with equal probability to everyone else.

There are many ways to compute the PageRank of every page (or node) in a dataset. The simplest
is an iterative computation.

Let denote the estimated PageRank of the paper at iteration .

Assume that there are papers in our dataset. We start out with for all .

Then, at iteration , we simply set:

This iterative process is continued until there is only a small movement in probability across
iterations. In our case, we’ll continue as long as:

For the second part of this homework, your task in to

write one or more function in SQL to compute the PageRank of each of the nodes in the graph
print out the top 10 nodes with the greatest PageRank values
when you print out the top 10 nodes, print their paperID, title, and PageRank value, in
descending order of PageRank value.

𝑑

1 − 𝑑 𝑑

𝑑

𝑃 ( )𝑅𝑖 node𝑗 node𝑗 𝑖

𝑛 𝑃 ( ) =𝑅0 paper𝑗
1
𝑛

𝑗

𝑖

𝑃 ( ) = + 𝑑𝑅𝑖 paper𝑗
1 − 𝑑

𝑛


 ∑
𝑘∈{ }papers citing paper𝑗

𝑃 ( )𝑅𝑖−1 paper𝑘

num papers cited by paper𝑘


0.01 < |𝑃 ( ) − 𝑃 ( )|∑ 𝑗 𝑅𝑖 paper𝑗 𝑅𝑖−1 paper𝑗 %%sql py_var_task1_result << SELECT 1 -- YOUR CODE HERE df_findcc = py_var_task1_result.DataFrame() # Save the result of this task for grading purp df_findcc 2021/10/15 下午7:47 ImperativeSQL - Jupyter Notebook localhost:8888/notebooks/Desktop/BigData_Hw/ImperativeSQL.ipynb 7/7 In [ ]: In [ ]: Add code to drop any custom views or new tables that you created Be sure to put the drop statements in the correct order! (-5 points if this code is missing and you created custom views or tables) In [ ]: In [ ]: %%sql py_var_task2_result << SELECT 1 -- YOUR CODE HERE df_pagerank = py_var_task2_result.DataFrame() # Save the result of this task for grading pu df_pagerank %%sql SELECT 1;