CIS 545 Homework 4 : Machine Learning¶
Due April 10th, 10pm EST¶
Worth 100 points in total¶
Hopefully everyone is safe and doing well! We hope to continue to equip your data science toolkit with new skills through out the remainder of the semester.
This homework will give you a hands on experience with machine learning with sklearn and scalable machine learning with Spark ML!
Since most of us are isolated in the comfort and safety of our homes, our biggest source of entertainment is through online media platforms like Netflix, Prime Video, Hulu and YouTube. We will be exploring what makes videos successful on these platforms focusing on YouTube’s data.
PLEASE READ THE FAQ as you do this assignment! It’s pinned on Piazza and we TA’s work really hard to keep it updated with everything you might need to know or anything we might have failed to specify. Writing these HWs and test cases gets tricky since students always end up implementing solutions that we did not anticipate and thus could not have prepared the grader correctly for.
Libraries and Setup Jargon!¶
Run the following cells to setup the notebook. When prompted for a selection, select the number that is associated with java 8
In [0]:
! sudo apt install openjdk-8-jdk
! sudo update-alternatives –config java
In [0]:
%%capture
!pip3 install penngrader
from penngrader.grader import *
VERY IMPORTANT : Enter your 8 digit Penn ID in the student id field below
PLEASE NOTE: There are some questions, for example making plots, that do not have test cases. All questions without an autograder attached will be manually graded.
In [0]:
#PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON’T KNOW WHO
#TO ASSIGN POINTS TO YOU IN OUR BACKEND
STUDENT_ID = # YOUR PENN-ID GOES HERE AS AN INTEGER#
In [0]:
grader = PennGrader(homework_id = ‘CIS545_Spring_2020_HW4’, student_id = STUDENT_ID)
In [0]:
import numpy as np
import pandas as pd
import json
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm
from datetime import datetime
import glob
import seaborn as sns
import re
import os
In [0]:
import boto3
from botocore import UNSIGNED
from botocore.config import Config
s3 = boto3.resource(‘s3’, config=Config(signature_version=UNSIGNED))
s3.Bucket(‘penn-cis545-files’).download_file(‘youtube_data.zip’, ‘youtube_data.zip’)
!unzip /content/youtube_data.zip
Section 1 : Machine Learning with Sklearn (45 points)¶
1.1 Data loading and Preprocessing (5 pts)¶
The dataset we will be using is a daily record of the top trending YouTube videos.
To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments and likes). “Note that they’re not the most-viewed videos overall for the calendar year”. Top performers on the YouTube trending list are music videos (such as the famously viral “Gangnam Style”), celebrity and/or reality TV performances, and the random dude-with-a-camera viral videos that YouTube is well-known for.
This dataset includes several months (and counting) of data on daily trending YouTube videos. Data is included for numerous countries, with up to 200 listed trending videos per day.
Each region’s data is in a separate file. Data includes:
• Video Title
• Channel title
• Publish time
• Tags
• Views
• Likes
• Dislikes
• Description
• Comment count
The data also includes a category_id field, which varies between regions. To retrieve the categories for a specific video, find it in the associated JSON. One such file is included for each of the five regions in the dataset.
For more information on specific columns in the dataset refer to the column metadata.
1.1.1: Combining Multiple CSV’s. (2 pts.)¶
There are multiple csv files in the dataset, each corresponding to a specific country. As a first step you need to read them and combine these csv files into a single dataframe. Use ‘video_id’ as your index.
While combining them, you also need to create a column for “country” and fill it in the final dataframe. The country name can be extracted using the filename itself.
Name your dataframe “combined_data”.
In [0]:
# Import all the csv files
files = [i for i in glob.glob(‘/content/youtube_data/*.csv’.format(‘csv’))]
sorted(files)
# TODO Combine all into a single dataframe “combined_data” and add a ‘country’ column.
all_dataframes = list()
for csv in files:
combined_data = pd.concat(all_dataframes)
In [0]:
# Grader cell
# 2 pts
grader.grade(‘check_combined_dataframe’, (combined_data.shape))
grader.grade(‘check_country’,(combined_data.country))
1.1.2: Map category Id’s to categories (2 pts)¶
Read the category_id.json file and map the category_id’s in the dataframe to the category name.
Use json.load to read in the data in the json file as a python dictionary and then map the category id given in the dataframe to category name from the json file.
In [0]:
combined_data[‘category_id’] = combined_data[‘category_id’].astype(str)
#Your code goes here
combined_data.insert(4, ‘category’, #Your code goes here)
In [0]:
# Grader Cell 2 pts
grader.grade(‘check_category_mapping’,(combined_data.category))
1.1.3: Fix datetime format and remove rows with NA’s (1 pt)¶
The ‘publish_time’ and ‘trending_date’ features are not in a unix datetime format, so use pandas to_datetime() api to convert it into the right format.
After that is done remove all the rows which have NA’s in them.
In [0]:
# Your code goes here
combined_data[‘trending_date’] =
combined_data[‘publish_time’] =
# Code to remove NA’s
combined_data =
In [0]:
# Grader cell 1 pt
grader.grade(‘validate_na’,(combined_data.shape))
1.2 EDA and Feature Engineering (20 pts)¶
EDA: Exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
Feature Engineering: Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms.
1.2.1: Mean, standard deviation, min and max. (1 pt)¶
Compute some simple statistics like mean, standard deviation, min annd max for each of the numerical features in the dataset and store them into lists in order [views, likes, dislikes, comment_count]
means = [views_mean, likes_mean, dislikes_mean, comment_count_mean] and similarly for mins, maxs and stds
In [0]:
# Your code goes here
maxs =
mins =
stds =
means =
In [0]:
# Grader cell 1 pt
grader.grade(‘check_min_max_mean_std’,([maxs, mins, stds, means]))
1.2.2: Rescale features (1 pt)¶
As you can observe from the above computation, the range of data is really high, to avoid numerical instability issues, rescale the likes, views, dislkes and comment_count to log scale (base e) and store them in the dataframe as likes_log, views_log, dislikes_log and comment_log.
In [0]:
# Your code goes here
combined_data[‘likes_log’] =
combined_data[‘views_log’] =
combined_data[‘dislikes_log’] =
combined_data[‘comment_log’] =
In [0]:
# Grader cell 1 pt
grader.grade(‘check_feature_rescaling’,([np.mean(combined_data[‘likes_log’]),np.mean(combined_data[‘views_log’]),np.mean(combined_data[‘dislikes_log’]),
np.mean(combined_data[‘comment_log’])]))
1.2.3: Plot the distribution (2 pt)¶
Plot the distribution for the newly created log features. They should look like normal distribution curves.
In [0]:
# Your code goes here
# Plots will be manually graded
1.2.4: Comparing views, likes, dislikes against categories (3 pt)¶
As a next step, try to gain insights into the data using categories, views, likes and dislikes.
Draw three plots:
1.) How many videos are there for each category?
2.) What is the distribution of views against categories? (Use boxplot and views on log scale)
3.) What is the distribution of dislikes against categories? (Use boxplot and dislikes on log scale)
For extra credit: You can try to gain more insights into the dataset by drawing interesting plots. Some ideas include:
• How long does a video trend in a country?
• What are some videos which got popular because they were disliked?
Think of such interesting things and add here. We will award points based on creativity of the insights that you get.
In [0]:
# Your code goes here
# Plots will be manually graded
1.2.5: Feature Engineering (8 pts)¶
a. Processing tags (1 pt)¶
The feature tags in the dataset has a delimiter, use that delimiter to count the number of tags, create a feature called num_tags and add that to the dataset.
In [0]:
# Your code goes here
combined_data[“num_tags”]=#tags
b. Processing description and title (2 pts.)¶
Compute the length of description and title and add them as features to the dataset
In [0]:
# Your code goes here
combined_data[“desc_len”]=#desc_len
In [0]:
# Your code goes here
combined_data[“len_title”]=#title_len
In [0]:
# Grader cell 3 pts.
grader.grade(‘check_tags_title_description’,([combined_data[‘num_tags’].describe(),combined_data[‘desc_len’].describe(),combined_data[‘len_title’].describe()]))
c. Processing publish_time. (4 pts.)¶
Split ‘publish_time’ feature into three parts time, date, and weekday, where time will contain the time component of the original feature and date and weekday will store the corresponding date and weekday number respectively. Start with 1 for Monday and end with 7 for Sunday.
In [0]:
# Your code goes here
combined_data[‘publish_time’] =
combined_data[‘publish_date’] =
#day on which video was published
combined_data[‘publish_weekday’]=
random_index = random.randint(0,combined_data.shape[0]-1)
In [0]:
# Grader cell 4 pts
grader.grade(‘check_date_time_processing’,([combined_data[‘publish_time’].iloc[random_index],combined_data[‘publish_date’].iloc[random_index],sorted(list(combined_data[“publish_weekday”].value_counts()))]))
d. Number of videos per weekday (1 pt)¶
Compute the number of videos published per day of the week. Which day of the week do people publish most videos?
In [0]:
# Your code goes here
# Plots will be manually graded
1.2.6: Drop all non numeric columns (1 pt.)¶
Drop all the columns that are non-numeric as we have processed them and stored the information captured in them in the dataset as numbers. Also drop original views, like, comments and dislikes as you have processed them as logs and stored them as separate feature
In [0]:
# Your code goes here
1.2.7: Convert categorical features in the dataset into one hot vectors. (3 pts)¶
There are three categorical features remaining in the dataset, identify them and convert them into one hot vectors.
In [0]:
combined_data.publish_weekday = combined_data.publish_weekday.astype(‘category’)
combined_data.country = combined_data.country.astype(‘category’)
# Hint: Use pd.get_dummies()range
In [0]:
# Grader cell 3 pts.
grader.grade(‘check_final_df’,(combined_data.shape))
Let’s write out the modified data we created to a file so that we can reuse it in Section 2.
In [0]:
combined_data_sec_2 = combined_data.copy()
combined_data_sec_2.rename(columns = {‘views_log’:’label’}, inplace = True)
combined_data_sec_2.to_csv(‘combined_data.csv’)
1.2.8: Split into x and y (1 pt)¶
Split the data into features and label, in this case the features are anything but views_log and the label is views_log.
In [0]:
# Your code goes here
label =
features =
In [0]:
# Grader cell 1 pt
grader.grade(‘check_x_y_split’,([features.shape, label.describe()]))
1.3 : Machine Learning using sklearn (15 pts)¶
Scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language.[3] It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
You can find the documentation here
Now we will train some machine learning models using sklearn to predict views, rather than predicting views directly we will predict views_log to avoid numerical instability issues
1.3.1 : Split data into train and test (1 pt)¶
Use sklearn’s train_test_split library and split data into train and test sets, the split should be 80-20 meaning 80% for training and rest for testing.
In [0]:
from sklearn.model_selection import train_test_split
# Your code goes here
In [0]:
# Grader cell 1 pt.
grader.grade(‘check_data_split’,[x_train.shape,x_test.shape,y_train.shape,y_test.shape])
1.3.2: Train Machine Learning Models.¶
1.3.2.1 Linear Regression (3 pts)¶
In this step we will train a linear regression model using sklearn. Train using the training data and then make predictions of test, report the mean squared error obtained on both train and test sets.
In [0]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Your code goes here
mse_test =
In [0]:
grader.grade(‘check_lr’, (np.sqrt(mean_squared_error(y_test, y_pred))))
1.3.2.2 Dimensionality reduction with PCA (6 pts)¶
Step 1: Use Principal component analysis to reduce number of dimensions of the dataset, as a first step fit a pca model on your train set and then plot the explained_variance_ratio against the number of components to decide the number of components you should keep. (3 pts)¶
In [0]:
import numpy as np
from sklearn.decomposition import PCA
# Your code goes here
Step 2: Use the plot to decide the number of components to keep, choose a number that explains atleast 95% of variance in the dataset. Then fit and transform your pca on training set using the number of components you decided. (1 pts)¶
Remember that your pca should be trained on the training set but only transformed on the test set.
In [0]:
# Your code goes here
In [0]:
# Grader cell 3 pts.
grader.grade(‘check_pca’, (x_train[:50,:]))
1.3.2.3 Random Forest. (10 pts)¶
Step 1: Use grid search and train a random forest model on the transformed train dataset. Tune hyperparameters that are available like depth and number of estimators using grid search and select the best hyperparameters out of those. (4 pts)¶
In [0]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
# Your code goes here
Step 2: Fit the random forest on the training data using the parameters you computed above. Then make predictions on the test set, report the root mean squared error for the test set. (3 pts)¶
In [0]:
# Your code goes here
In [0]:
# Grader cell 10 pts
grader.grade(‘check_rf’, (np.sqrt(mean_squared_error(y_test, y_pred))))
A Blissful Break¶
Well done! Almost halfway there 🙂
Take a well deserved break! Talk to your friends (on zoom, social distancing is important! :P), scroll through your Instagram feed or watch a video. Nothing better than cute dog videos! Here is our recommendation, we promise that the link works this time 🙂
Section 2 : Distributed Machine Learning with Spark (55 Points)¶
Apache Spark ML is the machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives.
Why Spark ML?
Moving to the Big Data Era requires heavy iterative computations on very big datasets. Standard implementations of machine learning algorithms require very powerful machines to be able to run. Depending on high-end machines is not advantageous due to their high price and improper costs of scaling up. The idea of using distributed computing engines is to distribute the calculations to multiple low-end machines (commodity hardware) instead of a single high-end one. This definitely speeds up the learning phase and allows us to create better models.
Read more about it with the python documentation here
Initializing Spark Connection – Boring setup stuff again¶
In [0]:
!apt install libkrb5-dev
!wget https://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install findspark
!pip install sparkmagic
!pip install pyspark
!pip install pyspark –user
!pip install seaborn –user
!pip install plotly –user
!pip install imageio –user
!pip install folium –user
In [0]:
!apt update
!apt install gcc python-dev libkrb5-dev
In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F
import os
spark = SparkSession.builder.appName(‘ml-hw4’).getOrCreate()
In [0]:
%load_ext sparkmagic.magics
In [0]:
#graph section
import networkx as nx
# SQLite RDBMS
import sqlite3
# Parallel processing
# import swifter
import pandas as pd
# NoSQL DB
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError, OperationFailure
import os
os.environ[‘SPARK_HOME’] = ‘/content/spark-2.4.5-bin-hadoop2.7’
os.environ[“JAVA_HOME”] = “/usr/lib/jvm/java-8-openjdk-amd64”
import pyspark
from pyspark.sql import SQLContext
In [0]:
try:
if(spark == None):
spark = SparkSession.builder.appName(‘Initial’).getOrCreate()
sqlContext=SQLContext(spark)
except NameError:
spark = SparkSession.builder.appName(‘Initial’).getOrCreate()
sqlContext=SQLContext(spark)
2.1 Data for Spark ML (20 points)¶
We have the spark setup ready. Now we need the fuel for our ML algorithms i.e. the data. We will use the data you processed in Section 1 but in spark.
Read in the csv that you created into a spark dataframe. Make sure to set the “inferSchema” flag to True when you do this so that the columns are the correct datatypes and not all strings.
In [0]:
# Your code goes here
train_sdf =
Just make sure everything looks good
In [0]:
train_sdf.show()
In [0]:
## Grader cell, worth 5 points
to_grade = train_sdf.toPandas()
grader.grade(‘check_spark_load’, (to_grade.size, to_grade[:50]))
Print out the dataframe schema and verify the datatypes
In [0]:
#TODO: Print the dataframe schema and verify
# Your code goes here
Great job, we have the processed data now. For Spark ML, we need to create a feature column which has all features concatenated and a single column for labels, which we already have!
We will use VectorAssembler() to create a feature vector from all categorical and numerical features and we call the final vector as “features”.
First, list all columns in the data and store it in a list named all_columns
In [0]:
from pyspark.ml.feature import StringIndexer, VectorAssembler
In [0]:
# Your code goes here
all_columns = # TODO
Create a list of columns which you don’t wan’t to include in your features i.e. the labels and probably other columns which don’t help the machine learning model. Name this list drop_columns.
In [0]:
# Your code goes here
drop_columns = # TODO
In [0]:
columns_to_use = [i for i in all_columns if i not in drop_columns]
Create a VectorAssembler object with columns you want to use. Name your output column as ‘features’ (cause they are the features you wanna use later), and name your vector assembler object ‘assembler’
In [0]:
# Your code goes here
In [0]:
# Grader cell, worth 5 points
grader.grade(‘check_assembler’, (str(assembler.params), columns_to_use))
Now we will create a pipeline with many stages. For this data, we just need a single stage with the assembler, but you could have other stages before that where you perform operations on the data like converting categorical strings in the features to numeric values, or do feature scaling operations.
In this step, we will create a pipeline with a single stage – the assembler. Fit the pipeline to your data and create the transformed dataframe and name it ‘modified_data_sdf’.
In [0]:
from pyspark.ml import Pipeline
# Your code goes here
In [0]:
#Grader cell, worth 8 points
to_grade_df = pd.DataFrame(modified_data_sdf.take(5), columns=new_df.columns)
grader.grade(‘check_pipeline’, (to_grade_df.columns.values, to_grade_df[‘features’][0].size))
Now that we have the data in the format we need, we will create out train and test sets. Split into an 80-20 ratio between the train and test sets. Name these ‘train_sdf’ and ‘test_sdf’
In [0]:
# Your code goes here
In [0]:
#Grader cell, worth 2 points
grader.grade(‘check_split’, (train_sdf.count(), test_sdf.count()))
2.2 Linear regression using Spark ML (15 points)¶
Time to do the cool stuff, let’s train a linear regression model to our data and try to predict the views again! This time, we will use “Big” data tools. Using Spark ML’s linear regression, create a model, fit the training data. We will then see the summary stats for the model – the RMSE error, R2 score and other information you find useful. Look up the documentation online and try to understand how to implement this.
Firstly, train a model without any regularization!
In [0]:
from pyspark.ml.regression import LinearRegression
# Your code goes here
lr_model =
In [0]:
trainingSummary = lr_model.summary
print(“RMSE: %f” % trainingSummary.rootMeanSquaredError)
print(“r2: %f” % trainingSummary.r2)
In [0]:
#Grader cell, worth 4 points
grader.grade(‘check_lr_train’, (trainingSummary.rootMeanSquaredError, trainingSummary.r2) )
Now, find out how good the model actually is and see if it overfits the training data. Predict the views for your test data (hint: it is called ‘transform’ in spark ml) and evaluate the performance using ‘RegressionEvaluator’ object in the Spark ML Regression library. Your prediction column should be named ‘prediction’.
In [0]:
# Your code goes here
predictions =
In [0]:
from pyspark.ml.evaluation import RegressionEvaluator
# TODO: Compute root mean squared error on the test set
test_rmse_orig =
In [0]:
#Grader cell, worth 4 points
predictions_to_grade = predictions.toPandas()
answer = [test_rmse_orig, predictions_to_grade[‘prediction’][0:50], predictions_to_grade[‘label’][0:50]]
grader.grade(‘check_lr_test’, answer)
Now, we will add regularization to avoid overfitting. Play around with different regularization parameters – try out L1, L2 and elastic net (combination of L1 and L2) and different regularization hyperparameters. Create a table and compare these with each other and the non regularized regression done above
In [0]:
# Your code goes here
In [0]:
# Your code goes here
# Compute predictions using each of the models
l1_predictions =
l2_predictions =
elastic_net_predictions =
# TODO: Compute root mean squared error on test set for each of your models
test_rmse_l1 =
test_rmse_l2 =
test_rmse_elastic =
In [0]:
# Grader cell, worth 7 points
answer = [test_rmse_l1, test_rmse_l2, test_rmse_elastic]
grader.grade(‘check_lr_all’, answer)
2.3 Random Forest Regression (10 points)¶
As a data scientist, if you are looking to win competitions, you definitely must know about random forests, boosted trees etc. These ensemble methods generalize well and work surprisingly well for a lot of classification problems and sometimes for regression problems. So let’s give it a go with our problem. Just like the linear regression model, create a random forest regressor model, fit the training data and evaluate using RegressionEvaluator. Compare the performance on the test set with the Linear Regression model
In [0]:
from pyspark.ml.regression import RandomForestRegressor
# Your code goes here
rf_model =
train_rmse_rf = #TODO: calculate the training rmse
In [0]:
# Your code goes here
predictions = #TODO : Predictions on the test set
In [0]:
from pyspark.ml.evaluation import RegressionEvaluator
# Your code goes here
rmse_rf = #TODO: calculate the rmse on the test set
In [0]:
#Grader cell, worth 10 points
predictions_to_grade = predictions.toPandas()
answer = [train_rmse_rf, predictions_to_grade[‘prediction’][0:50], predictions_to_grade[‘label’][0:50], rmse_rf]
grader.grade(‘check_rf_spark’, answer)
2.4 Dimensionality Reduction using PCA (10 points)¶
We will again use the powerful PCA to reduce the dimensions and project the data onto a lower dimensional space and so linear regression on the new projected data. Choose an appropriate value for the number of dimensions – you already found out this number in section 1!
Steps for this section
1. Initialize a PCA model
2. Fit the model using the training data
3. Get the PCA feature from the trained model
4. Train a linear regression model using the PCA features
5. Evaluate the performance on the test set
Note that this section deliberately gives less starter code to allow you to go through the documentation and implement, as you will be doing as future data scientists!
In [0]:
# Your code goes here
pca_model =
training_rmse_pca =
In [0]:
# Your code goes here
predictions = #TODO: Get predictions on the test set
test_rmse_pca = #TODO: Get RMSE for test data
In [0]:
# Your code goes here
predictions_to_grade = predictions.toPandas()
answer = [training_rmse_pca, predictions_to_grade[‘prediction’][0:50], predictions_to_grade[‘label’][0:50], test_rmse_pca]
grader.grade(‘check_pca_spark’, answer)
2.5 Extra Credit (Upto 5 points)¶
Since we didn’t code an actual model for HW3… you have the chance to make one now for extra credit!
1. Load in the created training data from the LinkedIn dataset in section 3.5 of Homework 3.
2. Split the data into training, validation and test sets using the ratio 70-15-15.
3. Use Spark ML to train a random forest classifier and tune your hyperparameters.
Report what hyperparameters you tuned. Also report the maximum accuracy you achieved.
In [0]:
# TODO if you are hungry for points like Maggie
2.6 Yet Another Extra Credit Question (Upto 5 points)¶
Think about how you will design a distributed K means clustering algorithm. Assume you have a clusters with N processors and you are trying to cluster a very very big dataset. Design a distributed K means clustering that is able to perform clustering.
For this question, a theoretical explanation of your design (+ visual descriptions if you want) is sufficient. You don’t need to code it up!
Answer:
HW Submission¶
Double check that you have the correct PennID (all numbers) in the autograder.
Go to the “File” tab at the top left, and click “Download .ipynb”. Zip it (name doesn’t matter) and submit it to OpenSubmit.
You must submit your notebook to receive credit.
On OpenSubmit, go to Settings and make sure to set your Student ID to your PennID (all numbers).