CS代考 APS1070_Project_3

F21_APS1070_Project_3
Project 3, APS1070 Fall 2021¶
PCA [10 marks]¶
Deadline: Nov 5th, 21:00
Academic Integrity
This project is individual – it is to be completed on your own. If you have questions, please post your query in the APS1070 &A forums (the answer might be useful to others!).
Do not share your code with others, or post your work online. Do not submit code that you have not written yourself. Students suspected of plagiarism on a project, midterm or exam will be referred to the department for formal discipline for breaches of the Student Code of Conduct.
Please fill out the following:
Student number:
In this project we work on a Covid-19 dataset that reports the number cases for different countries at the end of each day.
Part 1: Getting started [1 Marks]¶
import pandas as pd
cases_raw = pd.read_csv(
filepath_or_buffer=’https://raw.githubusercontent.com/aps1070-2019/datasets/master/confirmed-june21.csv’,
index_col=0,
thousands=’,’
Write a function to do the following: [0.25]
Takes the dataframe, and your country list as inputs (US, China, Canada, …)
Plots time-series for the input list (it is best to plot each country in a separate graph (subplot), so you can easily compare them.)
Apply StandardScalar to the data. Each day should have a mean of zero and a StD of 1. [0.25]
Run the function in step 1 on the standardized dataset for the US, China, and Canada. [0.25]
Discuss the trends in the standardized time-series for the US, Canada, and China. What does it mean if the curve goes up or down (are the number of covid cases negative?) What does the sign of values indicate? [0.25]
### YOUR CODE HERE ###
Part 2: Applying PCA [2 Marks]¶
Compute the covariance matrix of the dataframe. Hint: The dimensions of your covariance matrix should be (511, 511). [0.25]
Write a function get_sorted_eigen(df_cov) that gets the covariance matrix of dataframe df (from step 1), and returns sorted eigenvalues and eigenvectors using np.linalg.eigh. [0.25]
Show the effectiveness of your principal components in covering the variance of the dataset with a scree plot. [0.25]
How many PCs do you need to cover 99% of the dataset’s variance? [0.25]
Plot the first 16 principal components (Eigenvectors) as a time series (16 subplots, on the x-axis you have dates and on the y-axis you have the value of the PC element) . [0.5]
Compare the first few PCs with the rest of them. Do you see any difference in their trend? [0.5]
### YOUR CODE HERE ###
Part 3: Data reconstruction [3 Marks]¶
Create a function that:
Accepts a country and the original dataset as inputs.
Calls useful functions that you designed in previous parts to compute eigen vectors and eigen values.
Plots 4 figures:
The original time-series for the specified country. [0.5]
The incremental reconstruction of the original (not standardized) time-series for the specified country in a single plot. [1.5]
You should at least show 5 curves in a figure for incremental reconstruction. For example, you can pick the following (or any other combination that you think is reasonable):
Reconstruction with only PC1
Reconstruction with both PC1 and PC2
Reconstruction with PC1 to PC4 (First 4 PCs)
Reconstruction with PC1 to PC8 (First 8 PCs)
Reconstruction with PC1 to PC16 (First 16 PCs)
Hint: you need to compute the reconstruction for the standardized time-series first, and then scale it back to the original (non-standardized form) using the StandardScaler inverse_transform help…
The residual error for your best reconstruction with respect to the original time-series. [0.5] Hint: You are plotting the error that we have for reconstructing each day (df – df_reconstructed). On the x-axis, you have dates, and on the y-axis, the residual error.
The RMSE of the reconstruction as a function of the number of included components (x-axis is the number of components and y-axis is the RMSE). Sweep x-axis from 1 to 10 (this part is independent from part 3.2.) [1]
Test your function using the US, Canada, and China as inputs. [0.5]
def plot_country_figures(original_df, country_name):
### YOUR CODE HERE ###
Part 4: SVD [2 Marks]¶
Modify your code in part 3 to use SVD instead of PCA for extracting the eigenvectors. [1]
Explain if standardization or covariance computation is required for this part.
Repeat part 3 and compare your PCA and SVD results. [1]
### YOUR CODE HERE ###
Part 5: Let’s collect a more recent dataset! [2 Marks]¶
Create a more recent dataset similar to the one provided in your handout using the raw information provided here. [1]
You need to manipulate the data to organize it in the desired format. You are free to use any tools you like, from Excel to Python!
In the end, you should have a new CSV file with more dates (features) compared to the provided dataset.
Upload your new dataset (in CSV format) to your colab notebook and repeat part 4. [1]
Don’t forget to add your new CSV file to your GitHub repo. The code below helps you to upload your new CSV file to your colab session.
# load train.csv to Google Colab
from google.colab import files
uploaded = files.upload()
References¶
Understanding PCA and SVD:
https://towardsdatascience.com/pca-and-svd-explained-with-numpy-5d13b0d2a4d8
https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca
https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.8-Singular-Value-Decomposition/
Snippets from: https://plot.ly/ipython-notebooks/principal-component-analysis/

3.7 Principal Component Analysis


Covid Data:
https://www.worldometers.info/coronavirus/
https://datahub.io/core/covid-19#resource-time-series-19-covid-combined