CS代写 COMP2420/COMP6420 – Introduction to Data Management, Analysis and Security

Assignment_1_2022.ipynb ¡¤ master ¡¤ comp2420 / 2022 / comp2420-2022-ass1 ¡¤ GitLab https://gitlab.cecs.anu.edu.au/comp2420/2022/comp2420-2022-ass1/-/blob/master/Ass…
Update Assignment_1_2022.ipynb
Line Kan John authored 1 week ago
Assignment_1_2022.ipynb 32.4 KB

Copyright By PowCoder代写 加微信 powcoder

COMP2420/COMP6420 – Introduction to Data Management, Analysis and Security
Assignment – 1 (2022)
Maximum Marks
15% of the Total Course Grade
Submission deadline
Submission mode
Electronic, Using GitLab
100% after the deadline
Learning Outcomes
The following learning outcomes apply to this piece:
LO3 – Demonstrate basic knowledge and understanding of descriptive and predictive data analysis methods, optimization and search, and knowledge representation.
LO4 – Formulate and extract descriptive and predictive statistics from data
LO5 – Analyse and interpret results from descriptive and predictive data analysis
LO6 – Apply their knowledge to a given problem domain and articulate potential data analysis problems
Submission
You need to submit the following items:
The notebook Assignment_1_2022_uXXXXXXX.ipynb (where uXXXXXXX is your uid)
A completed statement-of-originality.md , found in the root of the forked gitlab repo.
Submissions are performed by pushing to your forked GitLab assignment repository. For a refresher on forking and cloning repositories, please refer to Lab 1 . Issues with your Git repo (with the exception of a CECS/ANU wide Gitlab failure) will not be considered as grounds for an extension. You will also need to add your details below. Any variation of this will result in a zero mark .
It is strongly advised to read the whole assignment before attempting it and have at least a cursory glance at the dataset in order to gauge the requirements and understand what you need to do as a bigger picture.
Backup your assignment to your Gitlab repo often.
Extra reading and research will be required. Make sure you include all references in your Statement of Originality. If this does not occur, at best marks will be deduced. Otherwise, academic misconduct processes will be followed.
For answers requiring free form written text, use the designated cells denoted by YOUR WRITTEN ANSWER HERE — double click on the cell to write inside them.
For all coding questions please write your code after the comment YOUR CODE HERE . Remember to document your code using comments and doc strings as appropriate.
In the process of testing your code, you can insert more cells or use print statements for debugging, but when submitting your file remember to remove these cells and calls respectively. You are welcome to add additional cells to the final submission, provided they add value to the overall piece.
You will be marked on correctness and readability of your code, if your marker can’t understand your code your marks may be deducted.
Comment your code.
Before submitting, restart the kernel in Jupyter Lab and re-run all cells before submitting your code. This will ensure the namespace has not kept any old variables, as these won’t come across in submission and your code will not run. Without this, you could lose a significant number of marks.
µÚ1Ò³ ¹²8Ò³ 2022/3/19 17:00

Assignment_1_2022.ipynb ¡¤ master ¡¤ comp2420 / 2022 / comp2420-2022-ass1 ¡¤ GitLab https://gitlab.cecs.anu.edu.au/comp2420/2022/comp2420-2022-ass1/-/blob/master/Ass…
Credit: This assignment is based on previous work by in COMP2420/6420. We thank Alex for allowing us to use his work and build on it.
Enter your Student ID below: Context
You have been hired as a data scientist on a cybersecurity consulting team. Your team has been tasked with advising government on the risk and impact of recent cybersecurity threats.
What is cybersecurity and why do we care?
¡°Cybersecurity is the practice of protecting critical systems and sensitive information from digital attacks¡± (IBM,2022). Attackers normally target vulnerabilities in software and hardware systems in order to either bring down a system or steal sensitive or personal information. Common cyber threats include malware, phishing, ransomware and distributed denial of service (DDoS). You can read more about cyber threats on the Australian Cyber Security Centre web page.
Cyber-attacks can have significant negative impact to individuals, businesses and society at large. They can lead to loss of privacy and money, cause disruption in key services and even cause death (some examples are given in this article on impact of cyber-attacks here). Moreover, dealing with cyber-attacks is expensive. IBM reports that the cost of a data breach in 2020 was USD $3.85 million globally (IBM, 2022).
How could you start your underlying investigation?
As a data scientist, you need to understand what problem you are trying to solve first. In this particular case, you are trying to assess the risk and impact of recent cybersecurity threats. In order to do so, you need to know what are those threats and have a method to carry out this assessment. Where can you find this information? There are various sources you could draw from. To get started, your team has identified a few relevant systems for your investigation as described next.
The Common Vulnerability and Exposures (CVE) system
The CVE system is like a database that holds a number of the publicly known vulnerabilities that exist for software. It is the de-facto identifying system for publicly exposed vulnerabilities in systems, used by big tech companies such as Apple, Microsoft, Google, Red Hat, etc. The CVE is a schema that allows the consistent storing of information regarding vulnerabilities. More reading on the CVE is here
The CVE system was developed by The MITRE Corporation almost 20 years ago, and is now the de-facto system for providing identifiers for vulnerabilities in various systems.
CVE defines a vulnerability as, “A weakness in the computational logic (e.g., code) found in software and hardware components that, when exploited, results in a negative impact to confidentiality, integrity, or availability”. A CVE can affect multiple products and multiple software versions of a product.
However, the CVE system alone is incomplete, and extended by organisations such as the National Vulnerability Database (NVD). The Common Weakness Enumeration (CWE) system
There is another related system to CVE called the Common Weakness Enumeration (CWE), also developed by MITRE. CWE categorises types of software vulnerabilities whereas CVE is just a list of currently known vulnerabilities regarding specific systems and products (Camacho, 2021) . Each CWE identifier is related to a specific type of weakness which will have its own unique characteristics, rather than specific instances of vulnerabilities within products or systems. The CWE’s are broadly viewed in three categories:
by Software Development by Hardware Design
by Research Concepts
The Common Vulnerability Scoring System (CVSS)
CVSS is the de-facto scoring system for determining the impacts of vulnerabilities in the CVE system. It is developed and maintained by the National Vulnerability Database (NVD). All vulnerabilities in the NVD have been assigned a CVE identifier. Developed by the Forum of Incident Response and Security Teams (FIRST), the CVSS system is now in its 3rd major iteration (version 3).
The Assignment Dataset: based on the Common Vulnerability Scoring System (CVSS) data
The assignment dataset is derived from a subset of the Common Vulnerability Scoring System (CVSS) data for the year 2020 available from the National Vulnerability Database (NVD).
Note that while over 1000 CWE identifiers exist, only a small subset will be present within our dataset. This is due to the NVD using their own subset of them, which can be found on the NVD website.
We have further filtered the 2020 CVSS dataset by retaining only the records that relate to the Software Development viewpoint. In our dataset, each unique CVE is mapped to one or more CWE’s and is given a vulnerability score that is assigned by the CVSS scoring system.
µÚ2Ò³ ¹²8Ò³ 2022/3/19 17:00

Assignment_1_2022.ipynb ¡¤ master ¡¤ comp2420 / 2022 / comp2420-2022-ass1 ¡¤ GitLab https://gitlab.cecs.anu.edu.au/comp2420/2022/comp2420-2022-ass1/-/blob/master/Ass…
What should I do next?
Good question! Now that you have some background, you can work with the given CVE-based dataset as a starting point to explore a number of questions to help in your investigation. You can draw on your python, data analysis and basic machine learning skills to work towards the goal that your team has been tasked with.
References
IBM. 2022. What is Cybersecurity? | IBM. [online] Available at: https://www.ibm.com/au-en/topics/cybersecurity. (Accessed 3 March 2022).
Camacho, R. 2021. All about CWE: Common Weakness Enumeration. Parasoft. https://www.parasoft.com/blog/what-is- cwe/#:~:text=In%20short%3A%20the%20difference%20between,regarding%20specific%20systems%20and%20products.
Data Description
We have a sizable dataset to give you (in the form of 2 files), so it is wise to consider your code in terms of complexity to ensure it doesn’t take 30 minutes to run a single line.
The below tables provide an outline of the data, broken down into the columns of the dataset features.
The CVSS data table
Column Name
Description
The CVE identifier for the vulnerability
The entity who assigned the CVE
description
A description of the vulnerability
The CWE identifiers of the vulnerability. Note that there can be multiple cwe_id’s attached to one cve_id
url links to the initial postings of the vulnerability
other information which provide more reference about the CVE
ref_sources
other information which provide more reference about the CVE
other information which provide more reference about the CVE
v3_attackVector
CVSSv3 field, identifier for how the vulnerability would be used in an attack
v3_attackComplexity
CVSSv3 field, identifier for the difficulty of performing an attack using the vulnerability
v3_privilegesRequired
CVSSv3 field, an identifier for the privileges required in the system to use the vulnerability successfully
v3_userInteraction
CVSSv3 field, an identifier for whether a user needs to actively interact for the vulnerability to be exploited or not
CVSSv3 field, an identifier for whether the scope of an item changes when using the vulnerability. e.g: whether a regular user becomes a superuser.
v3_confidentialityImpact
CVSSv3 field, identifier for the impact upon the confidentiality of information in the product/service after using the vulnerability
v3_integrityImpact
CVSSv3 field, identifier for the impact upon the integrity of information in the product/service after using the vulnerability
v3_availabilityImpact
CVSSv3, field, identifier for the impact upon the availability of information in the product/service after using the vulnerability
v3_exploitabilityScore
The Exploitability Score is a sub score of the CVSS Base Score
v3_impactScore
The Impact Score is a sub score of the CVSS Base Score
v3_baseScore
The CVSS score (out of 10) given to the vulnerability based on CVSS v3.1
v3_baseSeverity
A textual representation of the numeric Base Score
We only use the Base Metrics out of the CVSS Metrics. While there are additional metrics that can be applied, most are variants. Therefore, we will use the base metrics. The column names starting with ‘v3_’ are CVSS v3.1 metrics. Refer to the specification document CVSSv3.1
µÚ3Ò³ ¹²8Ò³ 2022/3/19 17:00

Assignment_1_2022.ipynb ¡¤ master ¡¤ comp2420 / 2022 / comp2420-2022-ass1 ¡¤ GitLab https://gitlab.cecs.anu.edu.au/comp2420/2022/comp2420-2022-ass1/-/blob/master/Ass…
Guide for more information on the metrics.
Note: While this dataset has 20 columns, the data in the last four columns have been purposely omitted (see Question 2 of the
Assignment).
The CVE to Configurations mapping table
Column Name
Description
The CVE identifier for the vulnerability
The name of the vendor who produces the product
product_name
The name of the affected product
List of the affected product versions
Recall that a CVE can affect multiple products and multiple software versions of a product.
Package Imports
# Common Imports
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sns
plt.style.use(‘seaborn’)
%matplotlib inline
# Import additional modules here as required
# It is unlikely that you would need any additional modules, however we had added space here just in case
# extras are required. Note that justification as to WHY you are using them MUST be provided.
# Note that only modules in the standard Anaconda distribution are allowed. If you need to install it
manually, it is not an accepted package.
Q1: Loading and Processing the Data
Your first step in any data analysis and visualisation task is to load the data and make it usable. Note that the data consists of various types (Categorical, Numerical, Text, etc.). Also the dataset may not be perfect; it may contain missing data or invalid values at some places. It would be wise to perform some pre-processing to make the data easier to work with.
(Q1.a) You need to load the following two files available in the ‘./data’ folder into a suitable data structure:
cvss_dataset.csv cve_configurations_mapping.csv
Please write out the code you would use to load those files and the code you would use to perform some pre-processing. (2 marks)
(Q1.b)You also need to briefly outline your steps and justify your decisions.
This is an open-ended question, and marks will be awarded for logical processing of data. (3 marks) HINTS –
You might need to split some columns into two or combine two columns into one to make them more useful from an analysis point-of- view.
You might need to rename some columns.
It may be worth recoding the CVSS data to the numerical values required for Q2.
You are welcome to drop unwanted columns (but don’t remove or impute values for the last four columns as you will be asked to
µÚ4Ò³ ¹²8Ò³ 2022/3/19 17:00

Assignment_1_2022.ipynb ¡¤ master ¡¤ comp2420 / 2022 / comp2420-2022-ass1 ¡¤ GitLab https://gitlab.cecs.anu.edu.au/comp2420/2022/comp2420-2022-ass1/-/blob/master/Ass…
recreate these columns in Q2)
If you wish, you may combine the data available in both files.
# YOUR CODE HERE (Q1.a)
# (ADD ANY ADDITIONAL CELLS AS REQUIRED)
Q2: Recreating Missing Data
While the dataset that has been provided is thorough, you may have already noticed that the last four columns (i.e. ‘v3_exploitabilityScore’, ‘v3_impactScore’, ‘v3_baseScore’, ‘v3_baseSeverity) are empty. These are related to the CVSSv3.1 base score and are well documented in the specification documents CVSSv3.1 Guide and CVSS calculator.
Your task is as follows:
(Q2.a) Implement a CVSSv3.1 Base Score calculator and recalculate values for the last four columns for each applicable entry in the dataset. (5 marks)
(Q2.b) Explain how you performed the calculations. Provide the Equations that you used. (5 marks)
Additional Questions for COMP6420 students: [worth extra 5 marks]
(Q2.c) Please explain how would you validate that your calculations are correct? (3 marks) (Q2.d) Provide some evidence that you have validated your calculations. (2 marks)
# YOUR CODE HERE
# (ADD ANY ADDITIONAL CELLS AS REQUIRED)
Q3: Data Exploration
[10 marks]
There is an unverified claim made that most of the CVEs reported in 2020 were of MEDIUM Severity. How would you check that claim and
In this question you are asked to explore the given data and present information in a suitable manner. You are required to present the information both visually (using plots) and using descriptive statistics.
present the conclusion? Please explain your approach and implement it in code.
# YOUR CODE HERE
# (ADD ANY ADDITIONAL CELLS AS REQUIRED)
What are the top 5 CWEs that are mentioned in the data? Why did you present this information in the way you chose?[5 marks] # YOUR CODE HERE
# (ADD ANY ADDITIONAL CELLS AS REQUIRED)
Google products are commonly used. Your team wants to know how cyber-threats are affecting google users. Find all the CVEs associated with the Vendor google and present the distribution of CVSS Base Scores for google in a suitable manner. Please also explain your steps.
# YOUR CODE HERE
# (ADD ANY ADDITIONAL CELLS AS REQUIRED)
µÚ5Ò³ ¹²8Ò³ 2022/3/19 17:00

Assignment_1_2022.ipynb ¡¤ master ¡¤ comp2420 / 2022 / comp2420-2022-ass1 ¡¤ GitLab https://gitlab.cecs.anu.edu.au/comp2420/2022/comp2420-2022-ass1/-/blob/master/Ass…
Find the top 5 vendors that are most affected (i.e. that has most number of rows in the configurations table) and present the distribution of
CVSS Base scores for these top 5 vendors using a suitable visualization. Please also explain your steps.
# YOUR CODE HERE
# (ADD ANY ADDITIONAL CELLS AS REQUIRED)
Q4: Identifying Data Analysis Problems CVEs and real world issues
[10 marks]
We mentioned that dataset is from 2020 and we have only given you the records that relate to the Software Development viewpoint. Do you remember any major software vulnerabilities that came to light in 2020? This article claims that the top 2 vulnerabilities that were found in 2020 are;
CVE-2020-8515: Draytek Vigor Command Injection CVE-2020-5722: HTTP: Grandstream UCM6200 SQL Injection
However, there may be various viewpoints. For e.g.: this article mentions another 10 CVE’s. Your task is as follows:
Find and present the vulnerabilities that are mentioned in the above two articles in the given dataset in a tabular format. You may not find all the 12 CVE’s. What are possible reasons for this? (5 marks)
Examine the properties of the CVEs that you found above. (At a minimum you should consider the data available in the ‘./data /cvss_dataset.csv’ file). Present a justification as to why some of the given CVEs may have been considered a “large” bug? This should include references to the amount of damage a vulnerability caused, or could have potentially caused. (5 marks)
Additional question for COMP6420 students: [worth extra 10 marks]
If you were given the task of identifying the top-10 most critical CVEs in the given data, how would you tackle the problem? Give a brief list of initial analysis you would perform. (7 marks)
How would you go about implemention your proposed approach? (3 marks)
References are highly recommended for this question (both parts a and b) so that you can evidence your argument. DO NOT forget to list your references, including in your statement of originality document. Please note that failure to reference or improper referencing constitute a case for plagiarism which can have serious consequences for you. So make sure you use references appropriately. Please familiarise yourself with the university’s academic integrity rules here if you have not done so already.
# YOUR CODE HERE
# (ADD ANY ADDITIONAL CELLS AS REQUIRED)
Q5: Data Analysis
In this section, you will be provided a question or statement that you are required to prove/disprove. For each question, you are to provide a statement outlining your answer, using evidence from the dataset as your justification. You are expected to draw upon not only your visualisation skills, but also your hypothesis testing skills where required. That means you expected to justify your answer based on both statistical and visual evidence.
Don’t forget to state any assumptions you make in the questions in order to clarify your argument. Use the following as a guide to assess the statements given below:
How would you assess the given

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com