程序代写代做 database DSC 450: Database Processing for Large-Scale Analytics

DSC 450: Database Processing for Large-Scale Analytics
Take-home Final

Part 1

We will use a full day worth of tweets as an input (there are total of 4.4M tweets in this file, but we will intentionally use fewer tweets):
http://rasinsrv07.cstcis.cti.depaul.edu/CSC455/OneDayOfTweets.txt
Execute the following tasks with 1,000 tweets, 50,000 tweets, 100,000 tweets, and 500,000 tweets. Collect and plot the runtimes in part 1-e.

Use python to download tweets from the web and save to a local text file (not into a database yet, just to a text file).

NOTE: Do not call read() or readlines(). That command will attempt to read the entire file which is too much data.

Repeat what you did in part-a, but instead of saving tweets to the file, populate the 3-table schema that you created in SQLite. Be sure to execute commit and verify that the data has been successfully loaded (report loaded row counts for each of the 3 tables).

Use your locally saved tweet file (created in part-a) to repeat the database population step from part-b. That is, load the tweets into the 3-table database using your saved file with tweets (do not use the URL to read twitter data here).

Re-run the previous step with a batching size of 1000 (i.e. by inserting 1000 rows at a time with executemany), similarly to the posted sample code.
How does the runtime compare?

Plot the resulting runtimes (# of tweets versus runtimes) using matplotlib for parts a-c.

Part 2

Write and execute a SQL query to find the average longitude and latitude value for each user name.

Re-execute the query in part 2-a 10 times and 100 times and measure the total runtime (just re-run the same exact query multiple times using a for-loop). Does the runtime scale linearly? (i.e., does it take 10X and 100X as much time?)

Write the equivalent of the 2-a query in python (without using SQL) by reading it from the file with 500,000 tweets.

Re-execute the query in part 2-c 10 times and 100 times and measure the total runtime. Does the runtime scale linearly?

Part 3

Export the contents of the largest (500k tweets, although the user table will contain fewer tweets) User table from a SQLite table into a sequence of INSERT SQL statements. This is very similar to what you already did in a previous assignment. However, you have to add a new ID column which has to be a string (you cannot use digits).

Export all three tables (Tweet, User and Geo tables) from the largest (500k records) database into a |-separated text file. You do not generate SQL INSERT statements, just raw |-separated text data.

For the Geo table, add a new column with a relative distance from a fixed point which is the location of CDM (41.878668, -87.625555). You can treat it as a Euclidean distance (although feel free to find a real distance in miles).

For the Tweet table, add a new column with data from the User table (“screen_name”) in addition to existing columns. You do not need to modify the original table in SQLite, although you can.

For the User table file add a column that specifies how many tweets by that user are currently in the database. You do not need to modify the User table, just the exported User file.

Related Posts