Assignment_2_2022-checkpoint
COMP2420/COMP6420 – Introduction to Data Management,
Analysis and Security
Copyright By PowCoder代写 加微信 powcoder
Assignment – 2 (2022)
Maximum Marks 100 for COMP2420 and 120 for COMP6420 students
Weight 15% of the Total Course Grade
Submission deadline 11.59M, Tuesday, May 24th
Submission mode Electronic, Using GitLab
Penalty 100% after the deadline
Learning Outcomes¶
The following learning outcomes apply to this assignment:
LO1 – Demonstrate a conceptual understanding of database systems and architecture, data models and declarative query languages
LO2 – Define, query and manipulate a relational database
LO3 – Demonstrate basic knowledge and understanding of descriptive and predictive data analysis methods, optimization and search, and knowledge representation.
LO4 – Formulate and extract descriptive and predictive statistics from data
LO5 – Analyse and interpret results from descriptive and predictive data analysis
LO6 – Apply their knowledge to a given problem domain and articulate potential data analysis problems
LO7 – Identify potential pitfalls, and social and ethical implications of data science
LO8 – Explain key security concepts and the use of cryptographic techniques, digital signatures and PKI in security
Submission¶
You need to submit the following items:
The notebook Assignment_2_2022_uXXXXXXX.ipynb (where uXXXXXXX is your uid) [Please note that if you have to pull from the upstream repo you will need to change the name back first and add your uid after the pull again]
A completed statement-of-originality.md, found in the root of the forked gitlab repo.
Submissions are performed by pushing to your forked GitLab assignment repository. For a refresher on forking and cloning repositories, please refer to Lab 1. Issues with your Git repo (with the exception of a CECS/ANU wide Gitlab failure) will not be considered as grounds for an extension. Any variation of this will result in a zero mark.
It is strongly advised to read the whole assignment before attempting it and have at least a cursory glance at the dataset in order to gauge the requirements and understand what you need to do as a bigger picture.
Backup your assignment to your Gitlab repo often.
Extra reading and research will be required. Make sure you include all references in your Statement of Originality. If this does not occur, at best marks will be deduced. Otherwise, academic misconduct processes will be followed.
For answers requiring free form written text, use the designated cells denoted by YOUR WRITTEN ANSWER HERE — double click on the cell to write inside them. You can change these to Markdown cells where necessary (e.g. you may need to include some figures in your answer).
For all coding questions please write your code after the comment YOUR CODE HERE.
In the process of testing your code, you can insert more cells or use print statements for debugging, but when submitting your file remember to remove these cells and calls respectively. You are welcome to add additional cells to the final submission, provided they add value to the overall piece.
Your code answers will be marked on correctness and readability of your code, if your marker can’t understand your code your marks may be deducted.
Your written answers will be marked on the correctness, depth and clarity of your written answers. If your marker cannot understand your answer, marks may be deducted
Before submitting, restart the kernel in and re-run all cells before submitting your code. This will ensure the namespace has not kept any old variables, as these won’t come across in submission and your code will not run. Without this, you could lose a significant number of marks.
Introduction¶
This introduction has been split into three sections, based on the datasets you will be interacting with: CVSS Data, BitcoinHeist, and Sakila.
CVSS Data¶
Sound familiar? We hope so. You are being provided with the same dataset from the assignment 1, with an additional csv file. For a refresher, go check the about.md file. The provided files are:
cvss_data_complete.csv
cve_configurations_mapping.csv
cwe_capec_mapping.csv
BitcoinHeist¶
We will be using a subset of data from a UCI Machine Learning Repository dataset that contains parsed Bitcoin transaction graphs from 2009 January to 2018 December. BitcoinHeist dataset contains address features on the heterogeneous Bitcoin network to identify ransomware payments. We will be using this dataset to predict whether a given record is white (i.e., not known to be ransomware) or not (i.e., known to be ransomware).
To test your SQL muscles, we have provided a database (and the creation scripts) for a sample SQL database called Sakila made available by MySQL under a BSD licence. The Sakila sample database is a fictitious database designed to represent a DVD rental store. The tables of the database include film, film_category, actor, customer, rental, payment and inventory among others.. We have provided the Database diagram in Q3 for reference.
Detailed information about the database can be found on the MySQL website: https://dev.mysql.com/doc/sakila/en/
In summary, you have three (3) datasets to work with in this assignment, broken down as follows:¶
Question 1 & 4 – CVSS data
Question 2 – Bitcoin 3 – again, the CVSS dataset is a sizable dataset, so it is wise to consider your code in terms of complexity to ensure it doesn’t take 30 minutes to run a single line.
# Code Imports
# Every import is here, you may need to uncomment additional items as necessary.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#from matplotlib import cm
#import seaborn as sns
import sqlite3
from sqlite3 import Error
#from scipy import stats
#from sklearn.linear_model import LogisticRegression # Logistic Regression
#from sklearn.neighbors import KNeighborsClassifier # k-Nearest Neighbours
#from sklearn.preprocessing import LabelEncoder # encooding variables
#from sklearn.preprocessing import StandardScaler # encooding variables
#from sklearn.model_selection import train_test_split # testing our models
#from sklearn.preprocessing import OneHotEncoder # nominal variable
#from sklearn.metrics import confusion_matrix # scoring
#from sklearn.tree import DecisionTreeClassifier # decision trees
#from sklearn.tree import DecisionTreeRegressor # decision trees
#from sklearn import tree # decision trees
#from sklearn.decomposition import PCA # PCA
#from sklearn.cluster import KMeans # KMeans Clustering
#from sklearn import metrics # metrics
# Import additional modules here as required
# It is unlikely that you would need any additional modules, however we had added space here just in case you feel
# extras are required. Note that some justification as to WHY you are using them should be provided.
# Note that only modules in the standard Anaconda distribution are allowed. If you need to install it manually, it is not an accepted package.
Q1: Unsupervised Learning (25 marks for COMP2420, 32 marks for COMP6420)¶
Building off the dataset you initially encountered in Assignment 1, the CVSS dataset has made a return for the following question. Note that while the dataset holds the same schema, the last 4 columns are populated for you in this assignment.
The following question is designed to get you to load and process data, and implement a clustering model for the given scenario below. You have been introduced to KMeans clustering in the lectures and labs and this would therefore be the assumed clustering method, although you are welcome to supplement this with other clustering methods from the sklearn package as you desire.
Your scenario is as follows:
Once again, you have assumed the role of a member of a cyber-security team interested in the vulnerability of products. Your procurement team has asked you to develop an unsupervised modelling system (ie: clustering model) to identify software that can be grouped based on their CVSS metrics. You must decide how many groupings (Severity Levels) are appropriate for training the model. The procurement team does not wish to use the Base Scores, Sub Scores and Base Severity as clustering metrics.
Based on this clustering model, the procurement team will make risk-based decisions to determine whether it is safe to introduce a product to the system.
You will first be asked to import and pre-process the data ready to implement a clustering model. Then, you are on your own in the world of clustering. Good luck!
1.1 Preprocessing¶
To start off, bring in the data and get it ready for clustering. Your tasks are:
Import the Data. The dataset is available in the location data/cvss/CVSS_data_complete.csv [1 mark]
Prepare the data for a clustering task. This time you may encounter some missing or invalid values, hence you may have to decide how to tackle these as well.
Note: While not every part of this section is directly assessed, this preprocessing section includes a number of tasks that will help you in your clustering and data analysis in future questions.
# YOUR CODE HERE
# (ANY ADDITIONAL CELLS AS REQUIRED)
1.2 Building a Clustering Model¶
Clustering helps visualise a dataset based on attributes considered important to the data scientist and/or reader. Using the CVSS Dataset above, implement a K-Means clustering algorithm to cluster the dataset by using either all or a subset of the available features (except the Base Scores, Sub Scores and Base Severity). You are expected to reduce the dataset to either 2 or 3 dimensions. After you have prepared your learning model, plot a 2D or 3D visualisation showing the different clusters.
It is up to you to decide how many clusters you would like to incorporate in your model. You are expected to verbally and visually justify your implementation, including the reasoning behind the choice of the number of clusters in your model.
[10 marks]
# YOUR CODE HERE
# (ANY ADDITIONAL CELLS AS REQUIRED)
# YOUR WRITTEN ANSWER HERE
1.3 Analysing the Clusters¶
With your clustering model complete, analyse the outputs in preparation for showing the results to the procurement team. Your tasks are as follows:
1.3.1 Comparison of summary statistics between clusters¶
Display and compare the mean of the CVSS Base Score, between all clusters in your model.
Identify and explain the largest differences in metrics (including the mean of the CVSS Base Scores) between the clusters using a suitable summary statistic(s). (i.e. what are the defining characteristics of a cluster compared to other cluster(s)?)
# YOUR CODE HERE
# (ANY ADDITIONAL CELLS AS REQUIRED)
# YOUR WRITTEN ANSWER HERE
1.3.2 Visual exploration of the distribution of Base Scores between clusters¶
(a) Visualise the distribution of the CVSS Base Score between each cluster.
(b) Answer the following: Without the use of a hypothesis test, does the CVSS Base Score appear to differ significantly between these clusters? (You may reference the previous question (a) rather than reproducing answers)
# YOUR CODE HERE
# (ANY ADDITIONAL CELLS AS REQUIRED)
# YOUR WRITTEN ANSWER HERE
1.4 Comparing our model prediction based on Base Severity values¶
Do the cluster labels accurately categorise the Base Severity? Since you know the actual Base Severity classification values for this dataset, can you now compare the model predictions with the actual Base Severity values?
HINT: Present the Base Severity counts of the clusters found by the model like we did in Lab 06
OR plot the actual Base Severity groups and predicted clusters side by side
NOTE: The number of unique Base Severity groups and the number of clusters in your trained model may be different.
# YOUR WRITTEN ANSWER HERE
# YOUR CODE HERE
# (ANY ADDITIONAL CELLS AS REQUIRED)
1.5 Convincing the Procurement Team (for COMP6420 students only)¶
(a) With your evidence and analysis points well defined, it is now time to present your model and findings to the procurement team. Based on your findings, do you recommend the model in its current state or not? provide your reasoning.
(b) Explain how this model could be used when they are determining what software should be purchased. Include an example of one of more imaginary products in your explanation, and how you can use this product’s features to make a prediction from your trained model.
# YOUR CODE HERE
# (ANY ADDITIONAL CELLS AS REQUIRED)
# YOUR WRITTEN ANSWER HERE
Q2: Classifciation using k-nearest neighbors and decision trees (20 marks for COMP2420, 28 marks for COMP6420)¶
In this question we are using a subset of data from a UCI Machine Learning Repository dataset that contains parsed Bitcoin transaction graphs from 2009 January to 2018 December. The BitcoinHeist dataset contains address features on the heterogeneous Bitcoin network to identify ransomware payments. The dataset contains the following features.
Column Name Description
address (String) Bitcoin address
year (Integer) Year
day (Integer) Day of the year. 1 is the first day, 365 is the last day
length (Integer) Length is designed to quantify mixing rounds on Bitcoin, where transactions receive and distribute similar amounts of coins in multiple rounds with newly created addresses to hide the coin origin.
weight (Float) Weight quantifies the merge behavior (i.e., the transaction has more input addresses than output addresses), where coins in multiple addresses are each passed through a succession of merging transactions and accumulated in a final address.
count (Integer) Similar to weight, the count feature is designed to quantify the merging pattern. However, the count feature represents information on the number of transactions, whereas the weight feature represents information on the amount (what percent of these transactions output?) of transactions.
looped (Integer) Loop is intended to count how many transaction i) split their coins; ii) move these coins in the network by using different paths and finally, and iii) merge them in a single address. Coins at this final address can then be sold and converted to flat currency.
neighbors (Integer) Indicates the number of neighbors a transaction had.
income (Integer) Satoshi amount (1 bitcoin = 100 million satoshis).
label (Category String) Name of the ransomware family (e.g., Cryptxxx, cryptolocker etc) or white (i.e., not known to be ransomware)
Reference: BitcoinHeist: Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain. Akcora, and Li, Yitao and Gel, Yulia R and Kantarcioglu, Murat. arXiv preprint. 2019
In this question, you are asked to predict whether a given record is ransom (i.e., known to be ransomware) or white (i.e., not known to be ransomware). For this purpose, we’ll ask you to train two classification models (namely, k-nearest neighbors and decision trees). You are also expected to do an experiment on a range of hyperparameters for your models. Later, you will be comparing the performance of the two models.
For comparing two models on the same dataset, we’ll usually keep the same train-test split of the dataset.
You are also strongly encouraged to use cross validation to choose the hyperparameters of the two models.
2.1: Load and process the data¶
Your first task is to load and pre-process the data for the classifciation task. Remember that we are asking you to predict whether a given record is ransom (i.e., known to be ransomware).
The data to be loaded is found in data/BitcoinHeistDataSample.csv.
# YOUR CODE HERE
# (ANY ADDITIONAL CELLS AS REQUIRED)
# YOUR WRITTEN ANSWER HERE
2.2 Data Analysis and Visualisation¶
We are interested in looking at how the volume of ransom and white data points differ across the years in this sample dataset. Your task is to come up with a suitable plot to visualise this and comment on your observations.
NOTE: The dataset is a sub-set of the original dataset that is not representative of the real spread between the actual labels. Hence, the conclusions we can reach based on the above plot may not hold for the full dataset, which is out of scope for this question.
# YOUR CODE HERE
# (ANY ADDITIONAL CELLS AS REQUIRED)
# YOUR WRITTEN ANSWER HERE
2.3: Classification using k-Nearest Neighbors¶
Train a KNN model using this dataset to predict whether a given record is ransom (i.e., known to be ransomware).
How did you prepare the dataset for this task?
How many neighbors k, did you use for your model? What are the values of the other hyperparameters of your model?
How did you decide on the hyper-parameters of your model?
Print the validation, training and testing scores for your model
# YOUR CODE HERE
# (ANY ADDITIONAL CELLS AS REQUIRED)
# YOUR WRITTEN ANSWER HERE
2.4: Classification using Decision trees¶
Train a Decision Tree model using this dataset to predict whether a given record is ransom (i.e., known to be ransomware).
How did you prepare the dataset for this task?
What are the hyperparameters of your model?
How did you decide on the hyperparameters for your model?
Visualise the decision tree that you trained (display the tree in the notebook, you may optionally save this as a figure; If you save the tree as a figure, you should include the figure in the img folder, and import it into a markdown box in your notebook. You can use the command ![tree.png](./img/tree.png) where tree.png is the name of your figure.)
Print the validation, training and testing scores for your model
# YOUR CODE HERE
# (ANY ADDITIONAL CELLS AS REQUIRED)
# YOUR WRITTEN ANSWER HERE
2.5: Model comparision¶
Compare the performance of the two models that you trained in 2.3 and 2.4.
Include the Confusion Matrix, Accuracy, Precision, Recall and F1-scores in your comparision.
Which model is better based on the metrics you obtained?
Discuss the performance of the two models in terms of the False Negatives and False Positives (assume that we consider False Negatives (incorrectly classifying a ransom record as white) are more harmful than False Positives (incorrectly classifying white record as ransom))
# YOUR CODE HERE
# (ANY ADDITIONAL CELLS AS REQUIRED)
# YOUR WRITTEN ANSWER HERE
2.6: Dealing with unbalanced datasets (for COMP6420 students only)¶
In the original dataset, we have a larger number of white records than ransom, as expected. To be exact, 2875284 of the 2916697 rows in the original dataset represented records not known to be ransomware.
Briefly state how you would deal with such unbalanaced datasets in classification problems. Give at least two techniques that can be applied. You may refer techniques that can be applied to other classification models as well (i.e. you don’t have to limit your answer to techniques that can be applied to k-nearest neighbors and decision trees). Don’t forget to give your references.
# YOUR WRITTEN ANSWER HERE
Q3: Serious SQL (20 marks for COMP2420, 25 marks for COMP6420)¶
Consider the following scenario.
You are applying for a job as a database developer for an unnamed wrestling company. Part of the job description includes creating an automation system for running SQL queries. During the hiring process, the interviewers want to ensure you understand the SQL language. They have provided a set of questions to be answered by you, and your responses will later be reviewed by them. They are unwilling to give you access to their real database (which is
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com