Midterm_2022_Spring-No-Solu (1)
Inspirational Quotes¶
Text is a popular form of data. To switch up the mood, let’s examine some inspirational quotes for our exam!
Copyright By PowCoder代写 加微信 powcoder
Exam Rules¶
This is an open-book, open-note, open-internet exam.
You cannot consult another intelligent being for this exam nor should you distribute this exam on any platform.
You need to submit a .ipynb file AND the corresponding HTML file for your exam.
The following can incur a penalty on your exam: Any attempts to print out all of the data (unless explicitly asked for) will be penalized.
Hardcode solutions unless you were instructed to do so.
Asking the following type of questions: Question about the differences between your solution and the backup dataset.
Question about partial credit or specific grade distributions within the problem
Questions about install Jupyter Notebook or packages used in class
The backup datasets are not meant to validate your answers and steps have been taken to prevent you from doing so. Using the backup datasets will result in a small penalty for the question that was meant to generate the dataset. You can later remove the use of the backup dataset if you solved the problem, then there will be no penalty.
If you use an external source to solve the final, please cite the URL with a comment so we know you grabbed it from online and not from each other.
Please read the entire exam before asking questions.
Quesiton 0 – Honor Code¶
We, the students of Columbia University, hereby pledge to value the integrity of our ideas and the ideas of others by honestly presenting our work, respecting authorship, and striving not simply for answers but for understanding in the pursuit of our common scholastic goals. In this way, we seek to build an academic community governed by our collective efforts, diligence, and Code of Honor.
I affirm that I will not plagiarize, use unauthorized materials, or give or receive illegitimate help on assignments, papers, or examinations. I will also uphold equity and honesty in the evaluation of my work and the work of others. I do so to sustain a community built around this Code of Honor.
Have you read and agree to the honor code?
To answer this, please create a variable called “i_will_follow_the_honor_code” and assign the appropriate True or False value to it.
You must also create a character variable called “UNI” that contains your UNI.
WARNING: You will receive 0 points for the exam if you do not get this problem correct.
i_will_follow_the_honor_code =
Q1 Getting the data (8 pts)¶
Quotable is an open-source API that serves quotes.
Please read its documentation to obtain all quotes with the tag “famous-quotes” (please keep the metadata associated with each quote) and assign it to a variable called “fqs”. The metadata related to the query can be discarded (optional).
Please print out the length of “fqs” in a human-readable message.
If this API crashes due to the volume of our queries, please wait 30 seconds before calling the API again or take a screen shot of
the error message
the time on your screen
then use the backup dataset and move forward. You will not be penalized if your code is correct but the API is not repsonsive.
Q2 – Exploring the data (6 pts)¶
How many unique tags are within our dataset? Be sure to use print() with a human-readable message for your answer.
Which author has the most quotes in our data set? Be sure to use print() with a human-readable message for your answer.
Please print out the longest quote, i.e. the quote with the most number of characters. Be sure to use print() with a human-readable message for your answer.
Q3 – Searching the quotes (6 pts)¶
Please assign the unique words following the word “great” or “greatest” to a variable called “great_followers”. For example: “Greatest weather…”, the word following “greatest” would be “weather” and “it’s great to…” would have “to” being the word following “great”. Note that if there are two occurrences of a match in a quote, you only need to grab one of the words that follows “great” or “greatest”
Please use regular expression in your solution.
Please print out “great_followers” AFTER sorting them according to Python’s natural ordering of strings.
Please print out the length of “great_followers”.
Q4 – Wrangling the data (8 pts)¶
Please wrangle the data such that we have a data frame where each row represents a different author and the columns are titled:
name: The author’s name
authorSlug: authorSlug in the original data
totalQuotes: The number of quotes in the dataset associated with this author
percModified: The percent of quotes that were modified since being added
avgModifiedDays: The average number of days required for the quotes from this author to be modified since its addition. Note that an unmodified quote should be recorded as 0 days given the “dateModified” is set to “dateAdded” by default.
avgQuoteLength: The average number of characters in the quotes from this author (ignore translation issues for the exam please).
Assign the data frame to a variable called “df”
Print out a description of the “shape” of this data frame with a human-readable message.
Please show the 3 rows with the highest “avgModifiedDays” in this data frame.
import pandas as pd
weird_df = pd.DataFrame(fqs)
weird_df.head(2)
Q5 – Visualization (6 pts)¶
Please visualize the relationship between the average length of quotes vs the chance of the quote being modified across authors, using your answer from Q4. Please be sure to label your axes and have a descriptive title. The title does not have to be insightful.
Please articulate with 2 or fewer sentences why calculating a correlation between these two variables using our data frame could be flawed (no code)?
Q6 – Data science Question (no code – 3 pts)¶
We will assume that a modified quote happens only due to typos in the quote or attributing the quote to the wrong author, excluding minor modifications like adding tags or adding a period at the end of the quote.
In this scenario, a large value for avgModifiedDays could imply that the quotes are not reliable even after a reasonable amount of time, i.e. bad data quality. However, this could also lead to bad incentives where Quotable is discouraged to modify any quotes to keep this metric low.
Please propose a solution and explanation that is no more than 5 sentences that could solve this problem. You can propose a new metric, add an additional metric, or change definitions of the data. Your solution should be simple and cannot rely on additional data sources.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com