COMP2420/COMP6420 – Introduction to Data Management,
Analysis and Security
Assignment – 2
Grading¶
Maximum Marks
100
Weight
20% of the Total Course Grade
Submission deadline
7:00PM, Friday, May 24
Submission mode
Electronic, Using GitLab
One submission per group
Estimated time
20 hours
Penalty
100% after the deadline
Submission¶
You need to submit the notebook Assignment-2.ipynb and any other additional files that you may have created / hyperlinked in this notebook, as part of your submission by pushing it to your forked GitLab repository. You need to add your group details below. Make sure your group works on and submits only have one fork of the assignment repository.
Note:¶
• It is strongly advised to read the whole assignment before attempting it and have at least a cursory glance at the dataset in order to gauge the requirements and understand what you need to do as a bigger picture.
• For answers requiring free form written text, use the designated cells denoted by YOUR ANSWER HERE — double click on the cell to write inside them.
• For all coding questions please write your code after the comment YOUR CODE HERE.
• In the process of testing your code, you can insert more cells or use print statements for debugging, but when submitting your file remember to remove these cells and calls respectively.
• You will be marked on correctness and readability of your code, if your marker can’t understand your code your marks may be deducted.
Group Number : 000¶
Student IDs: uXXXXXXX, uXXXXXXX (and uXXXXXXX)¶
Introduction¶
In [ ]:
# IMPORTING FREQUENTLY USED PYTHON MODULES
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import psycopg2
import os
plt.style.use(‘seaborn-notebook’)
%matplotlib inline
# JUST TO MAKE SURE SOME WARNINGS ARE IGNORED
import warnings
warnings.filterwarnings(“ignore”)
In [ ]:
# IMPORT ANY OTHER REQUIRED MODULES IN THIS CELL
Section A – Database Management (15 Marks)¶
You have been given the following database containing 15 tables relating to DVD Movie Rentals. The data contained in these tables is as follows –
Table
Data Description
actor
actors data including first name and last name
film
films data such as title, release year, length, rating, etc
film_actor
stores the relationships between films and actors
category
stores film’s categories data
film_category
stores the relationships between films and categories
store
store data including manager staff and address
inventory
stores inventory data
rental
stores rental data
payment
stores customer’s payments
staff
stores staff data
customer
stores customer data
address
address data for staff and customers
city
stores city data
country
stores country data
Visualizing the relations between these tables can be aided by looking at the below E-R Diagram.

Based on your understanding of the relationships between these tables, answer the following questions by writing SQL queries to get the required data rows from the database and display them as a Pandas dataframe.
NOTE:¶
For the following questions, in the CSIT labs you will be able to connect directly to the database using the below code within the notebook. From outside of the CSIT labs, you will need to perform SQL queries in your terminal by using partch. Once you have the correct query, you may just fill in the boxes below.
Partch Instructions
1. Connect to partch as per the above hyperlink
2. In your terminal, enter psql to access the sql database
3. Enter \c dvdrental to ensure you are accessing the assignment database (or simply specify dvdrental when connecting to the database server — psql dvdrental).
4. (Sanity Check) Enter SELECT * FROM actor; . If you receive the first row to be “Penelope Guiness”, you should be good to go !
In [ ]:
# Connect using psycopg2
conn = psycopg2.connect(host=”/var/run/postgresql”, database=”dvdrental”)
# Activate connection cursor
curr = conn.cursor()
In [ ]:
def execute_sql(query):
# Select table and display
curr.execute(query)
# Fetches all the rows from the result of the query
rows = curr.fetchall()
# Gets the column names for the table
colnames = [desc[0] for desc in curr.description]
# Converts into readable pandas dataframe
df_result = pd.DataFrame(rows, columns=colnames)
return df_result
1. Write an SQL query to determine the total sales from the rentals of the film ‘Affair Prejudice’ from the rental store with store_id = 2. Your query should result in a single column called Total Rental Cost with the value of the total cost of all these rentals.¶
[3 marks]
In [ ]:
# YOUR QUERY HERE
query = “”
execute_sql(query)
2. Write an SQL query to retrieve the names of all customers who rented the film ‘Affair Prejudice’ from the store with store_id = 1. The result of your query should display each customer’s first name and last name.¶
[3 marks]
In [ ]:
# YOUR QUERY HERE
query = “”
execute_sql(query)
3. Write an SQL query to retrieve the names of all those customers who have a total of more than 100 dollars in recorded payments. Your query’s result should include each customer’s first name, last name and customer ID.¶
[4 marks]
In [ ]:
# YOUR QUERY HERE
query = “”
execute_sql(query)
4. Write an SQL query to retrieve the names of all customers who have rented the movie ‘Angels Life’ from both stores with store_id = 1 and store_id = 2. Your query’s result should include each customer’s first name, last name and customer ID.¶
[5 marks]
In [ ]:
# YOUR QUERY HERE
query = “”
execute_sql(query)
Section B – Data Acquisition (20 Marks)¶
In this section, you need to read and import the data from the database into Python and make it ready to be used by your machine learning algorithms in the next section. Your task here is to load the required columns from the various appropriate tables in the database to fuel the data required to train your machine learning models in the next section. You can load the required data into one or more Pandas dataframes, to suit the the needs of different Machine Learning models in Section C.
If you fail to do so, or an error in the previous section is preventing you from doing so, we can provide you with a CSV of the required data. If you choose to use this CSV, you will not receive any marks for this section, but you’ll be able to do the following section without having to spend time on this section.
Note: While you are provided the .csv files to use for development of the later questions at home, you must submit a copy of the code that can read the database to receive marks for this section.
[20 marks]
In [ ]:
# YOUR CODE HERE
Section C – Machine Learning (55 Marks)¶
Note: We have provided the .csv outputs for the files to use in these questions if you are working on the assignment at home. Please note if any of these questions refer to those csvs and not the data from Section B in your submitted assignment, you will receive no marks for Section B.
In [ ]:
## Import csv files for home development here as necessary.
homework_actor = pd.read_csv(“data/dvd_rental/actor.csv”)
# read in other csvs as necessary.
Part 1 – Clustering (25 Marks)¶
Part 1.1 – Clustering Short Answers (8 Marks)¶
1. Can a Decision Tree be used to perform clustering? If so, explain how. If not, provide a counterexample showing how it is not suitable for the function.¶
[4 marks]
Your answer here: ¶
**
2. Will a K-means Clustering Algorithm generate the same results each time? Provide examples on how this may or may not be the case.¶
[4 marks]
Your answer here: ¶
**
Part 1.2 – K-Means Clustering Implementation (17 Marks)¶
Clustering helps visualise a dataset based on attributes considered important to the data scientist and/or reader. From the dataset acquired after completing Section – B, implement a K-Means clustering algorithm to help you cluster the dataset of customers on the basis of the movies they have rented from the DVD rental stores. Various attributes related to the movies such rating, year, rental_rate and year may be useful for this exercise. Another interesting attribute that you can look at is the fulltext of the movie. In terms of the customer details, you can look at which movies a customer has rented, and the total number of movies rented by a customer. After you’ve prepared your learning model, plot a visualization showing the different clusters. If you have used more than 2 features for your clustering, you are still expected to provide a visualization by reducing the dimensions into a 2D graph.
It’s upto you to decide how many clusters you would like to incorporate in your model. You are expected to justify all aspects of your implementation including the reasoning behind the choice of the number of clusters and number of iterations in your model.
Note: You are only allowed to use packages that are within the Anaconda distribution.
[15 marks total: 10 marks model, 5 marks justification & commentary]
In [ ]:
# YOUR CODE HERE
Your justifications & commentary here:
**
What do your resulting clusters represent? Explain the distinguishing characteristics of each cluster.¶
[2 marks]
Your answer here: ¶
**
Part 2 – Classification (30 Marks)¶
The rental company has decided that they want to use simple machine learning to allocate price tags to their movies. The tags are as follows:
Classification
Requirements
Cheap
Rental Rate is 0.99
Moderate
Rental Rate is 2.99
Expensive
Rental Rate is 4.99
Your task is to implement a Classification Algorithm (such as K-Nearest Neighbours) that can predict the Price Label of a movie. You are required to perform the following tasks:
1. Create useable dataset/s by manually determining the ‘truth values’ for existing data (where the rating is within the classification system defined above)
2. Implement an algorithm that can predict the classification as per the above classifications.
3. Perform independent testing of the model and provide statistical metrics outlining the performance of your model. Splitting the dataset into testing and training subsets will assist with this.
You are welcome to use any features within the dataset, except the Rental Rate of the film. Various attributes relating to a movie in the tables rating, movie, film_actor, actor and film_category can be helpful while making the algorithm. If required, you can also look to make new compound attributes that may be helpful in increasing the accuracy of your model. You are expected to justify all aspects of your answer including the features used, the metrics provided and validation system employed. Provide commentary on the strengths and potential pitfalls of the model.
Note: You are only allowed to use packages that are within the Anaconda distribution. This means packages such as Keras, Tensorflow etc are not available for use.
[25 marks total: 18 marks model, 7 marks justification & commentary]
In [ ]:
# YOUR CODE HERE
Your justifications & commentary here:
**
Would you be able to get a better result, if you had used Clustering as a pre-processing step before Classfication? Justify your answer.¶
[5 marks]
Your answer here: ¶
**
Section D – Decision Trees (10 Marks)¶
The following is a small synthetic data set about the weather conditions. We are going to try and use decision trees to predict whether it will rain or not on the given day.
Temperature
Cloudy
UV Index
Humidity
Rain
25
No
Low
Low
No
29
No
Low
High
No
26
No
Low
Medium
No
26
No
Medium
Medium
No
27
No
Medium
High
No
28
No
High
High
No
25
No
High
Low
No
29
Yes
Low
Low
Yes
28
No
Medium
High
Yes
28
Yes
Medium
High
Yes
26
No
Low
Low
Yes
27
Yes
Low
High
Yes
Note:
• You can treat temperature as a continuous variable and split it on a range of values (to convert it to a categorical variable, for example).
• Attribute selection in the tree uses information gain.
• You can use LaTeX and/or markdown to format your equations.
1. What is the initial entropy of Cloudy?¶
[3 marks]
YOUR ANSWER HERE¶
2. Which attribute would the decision-tree building algorithm choose at the root of the tree?¶
[3 marks]
YOUR ANSWER HERE¶
3. Calculate and specify the information gain of the attribute you chose to split on in the previous question¶
[4 marks]
YOUR ANSWER HERE¶