CS代考 COMP2420-FE-1

COMP2420-FE-1

COMP2420/COMP6420 – Introduction to Data Management,
Analysis and Security

Copyright By PowCoder代写 加微信 powcoder

Final Exam 1 – Semester 1, 2020

Instructions¶
Maximum Marks 100
Weightage 50% of the Total Course Grade
Duration 15 min Reading + 180 min Typing
Permitted Material Open Book

General Instructions¶
Save, Commit (and Push) your changes frequently, so that you do not lose your work. Do not change the names of the directories or of the files.

Code Instructions¶
You can import any additional Python modules you may need for your analysis in the first code block. DO NOT install any modules other than those present in the Anaconda distribution.
For all coding questions please write your code after the comment YOUR CODE HERE.
In the process of testing your code, you can insert more cells or use print statements for debugging, but when submitting your file remember to remove these cells and calls respectively.
You will be marked on correctness and readability of your code, if your marker can’t understand your code your marks may be deducted.

Written Answer Questions¶
You will be marked on the correctness, depth and clarity of your written answers. If your marker cannot understand your answer, marks may be deducted.
Avoid long-winded answers and give precise answers. Answers should be clear and legible.

# Important Imports for the question/s
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans, MeanShift
from sklearn.decomposition import PCA
import sklearn.metrics as skm
from scipy.stats import ttest_ind
import sqlite3
from sqlite3 import Error
plt.style.use(‘seaborn-notebook’)
## inline figures
%matplotlib inline

# Getting rid of warnings
import warnings
warnings.filterwarnings(“ignore”)

# Add imports as necessary.
# You are only allowed to use what is in the standard Anaconda installation

Question 1: Security [20 marks]¶
The following questions will cover the topic of security covered in the lectures
and labs. Please provide your written answers in the raw text boxes provided.
Any external information you use must be referenced. Provide references
in the raw textbox and statement of originality.

Q1.1: Super Secret¶
Consider the following scenario:

Alex has designed a question for the final exam and would like to send it to Ramesh. However, he is worried about the possibility of an unauthorized person getting access to the question and leaking it before the exam. Because of this concern, he’d like to ensure that the question can only be read by Ramesh.

Assuming Alex and Ramesh haven’t had any secure communications earlier, answer the following questions.

Q1.1.1: What options do they have to ensure the confidentiality of the exam question? Which option would you recommend they use? Explain why.¶

# Your answer here

Q1.1.2: What information do Alex and Ramesh have to exchange before the exam question can be transferred securely? Explain the secure transfer from Alex to Ramesh step by step, including what kind of encryption algorithms and keys are used.¶

# Your answer here

Q1.2: Replaced?¶
Given established and secure public key systems like RSA, are secret key systems like AES at all useful? Briefly explain.

# Your answer here

Q1.3: Hot Hashing¶
Tom has used a hash function to sign a message that he sends to Alice. Trudy gets hold of the signed hash. She knows that Tom has used a hash function that is not preimage resistant. What can Trudy do with this information?

# Your answer here

Question 2: Databases & SQL [20 marks]¶
The following questions will cover the topics of Databases & SQL covered in the
lectures and labs. Please provide your written answers in the raw text boxes
provided, and code answers in the code boxes provided.
Any external information you use must be referenced. Provide references
in the raw textbox or code box as appropriate, and statement of originality.

Q2.1: Structured Sanity¶
Consider the following scenario:

Alex has recently taken up a position in a government organisation tracking
CAVOD-27 cases in Australia. Alex is about to publish an application that can
track when people have come into contact with others who have had the virus,
and is attempting to determine how to collect and store this data.

How structured is the data that Alex is looking to collect? Discuss what data
might be present in this process and how it could be structured, if at all.

# Your answer here

Q2.2: Nauseous North Winds¶
The following question is designed to test your sql skills. You have been provided a database, which you will be required to run a number of queries over. You should provide the query, and the output in an appropriate format. You may use the Pandas sql<>DataFrame functionality, or regular sqlite3 cursor functionality.

Your database looks as follows:

Note: Due to the database setup, some tables have whitespace in their names. To avoid any issues, use quotations around the table name (example below)

— Example of use of quotation marks for handling table name
SELECT * FROM “ORDER DETAILS”;

# THIS IS YOUR CONNECTION BLOCK, DO NOT MODIFY THIS.
# OTHERWISE, YOU WILL NOT BE ABLE TO READ THE DATABASE
def create_connection(db_file):
“”” Connect to the specified SQLite database, if not exist, create a new one (in memory);
:db_file: location of db to connect to
:return: Connection object or None
conn = None
conn = sqlite3.connect(db_file)
print(“Connection established!”)
except Error as e:
print(“Error Connecting to Database”)
return conn

dbfile_nw = “./data/q2.db”
conn = create_connection(dbfile_nw)
cur = conn.cursor()
# remember to close the connection when everything is done

Q2.2.1: Dubious Discounts¶
Specify how many products received a discount of more than 10% when sold. In this instance, you can assume there is a 1-1 relationship between orders and products sold in an order.

# Your Code Here

Q2.2.2: ¶
Assuming commission is based on 5% of the unit price of an order/productID pair, determine the amount of commission (in dollars) made by the employee with employeeid 1 (one). (You may assume that the unit price has already had the discount applied).

# Your Code Here

Q2.2.3: Multinational¶
Provide a list of all the countries that are listed as customer addresses, and the number of customers in each country. Order the result based on the number of customers in each country. You may include null values. Order the list in descending order.

# Your Code Here

Question 3: Data Analysis [20 marks]¶
The following questions will cover the topic of Data Analysis covered in the
lectures and labs. Please provide your written answers in the raw text boxes
provided, and code answers in the code boxes provided.
Any external information you use must be referenced. Provide references
in the raw textbox or code box as appropriate, and statement of originality.

3.1: Modeling the Ideal Model¶
Consider the following scenario:

While you’re working at a consulting firm, one of your colleagues, Afzal, created a machine learning model for a real estate agent client to predict the price of a house.

The data initially contains data from 10,000 houses. Each house has information such as:

number of bedrooms
number of bathrooms
size of the land in square meters ($m^2$)
the year the house was built
suburb name (eg. “Acton”, “Gungahlin”, “Holder”)
the price in thousands of dollars (\$000’s)
3 other relevant details (you may assume all of these other values are numerical)

Some of the data is missing – for example, for many old houses, the year that the house was built is unknown so the entry is “NaN” for this value.

Afzal creates the model by:

Dropping all rows where any of the values are “NaN”. This reduces the number of houses to 6,740.
Recoding the suburb name alphabetically using Label Encoding (eg. Acton as 1, Aranda as 2, …, Woden as 46, Wright as 47, etc).
Using all of the data (except for the house price column, of course!), Afzal fits the model using a linear regression model.
He reports the coefficients for the model – some are positive and some are negative. In particular, the intercept is a negative number.
He also reports the score of his model using all of the data, and gets an R^2 score of 0.923 and a Mean-Squared Error (MSE) of 48,030. He shows these scores to your boss and explains that a score of 0.923 is great!

Your boss is very impressed by Afzal’s model, and is just about to present it to the client. However, you suspect that there’s something wrong here.

Clearly describe each issue in the process of this model creation, why this is an issue in a general context when creating a machine learning model (not just in this scenario), and what you would do instead in this scenario.

If you find that there are more than 3 issues, only discuss the biggest 3 issues. For each issue, use a new raw text box and provide your response regarding that issue in that text box. Be sure to state all assumptions.
[10 marks]

# YOUR ANSWER HERE
# YOUR ANSWER HERE
# Issue 2 (if you find a second issue)
# YOUR ANSWER HERE
# Issue 3 (if you find a third issue)

3.2: Rough Time in the Rumble¶
The Royal Rumble Dataset from Assignment 2 has made a return. As a reminder, you have a number of csvs in the following format:

Field Description
Draw The number of the entrant
Entrant The Entrant’s name
Brand Brand division, only exists in files for 2003-2011 and 2017 or later
Order The position of which the entrant was eliminated. A Winner will have the order of –
Eliminated by Name of the entrant who eliminated them
This may be multiple entrants, in which case it is semicolon-delimited (‘;’)
If this is “Winner”, then this indicates that the entrant won the event (never eliminated)
Time The amount of time (in minutes) the entrant was in the match
If this is ’00:00′, don’t include them as a participant in the match; the reason will be given in Eliminated by
Eliminations The number of entrants they eliminated

The following questions are designed to test your data analysis skills.

3.2.1: The Entrances¶
Import all the datasets (available in ./data/q3), providing two dataframes of the entire rumble history: one for female matches and one for male matches. Add a column that specifies the year of the row. At the end of this question, your two dataframes should match the following schema (one for males, one for females):

Draw Entrant Brand/Status Order Eliminated by Time Elimination(s) Year
1 Raw 14 Intyre 26:24 13 2020
1 Bret Hart No Data 8 25:42 1 1988

Note: For any column, where you do not have the data, you may fill that in as “No Data”. You also don’t have to sort by Draw, this is simply showing an example of a “No Data” column.

# Your Code Here

3.2.2: Number One & Two¶
Naturally, you would think positions (denoted as draw) 1 (one) & 2 (two) would be the worst to have, although anything can happen. Show the 3 positions that have yielded the most winners in the Mens Royal Rumble Match.

# Your Code Here

3.2.3: Scared of the Ropes¶
Provide the 5 participants who have eliminated the least people in the Mens Royal Rumble. Only consider people who have participated in more than 2 events.

# Your Code Here

3.2.4: Battling Baszler & Belair¶
Show each participant that was eliminated by either or (or both), what spot they entered in, and who eliminated them.

# Your Code Here

Q4 – Machine Learning [30 Marks]¶
The following questions will cover the topic of Machine Learning covered in the
lectures and labs. Please provide your written answers in the raw text boxes
provided, and code answers in the code boxes provided.
Any external information you use must be referenced. Provide references
in the raw textbox or code box as appropriate, and statement of originality.

Q4.1: Serious about Soccer¶
The data is in the location /data/q4.csv. Your data looks as follows:

Name Description
ID Player ID
Age Age of player
Stamina Stamina skill value of player
Strength Strength skill value of player
Aggression Aggression skill value of player
Interceptions Interceptions skill value of player
Penalties Penality skill value of player
Composure Composure skill value of player

In preparation for further data analysis at a later date, Ben wants to separate the soccer players into distinct groups based on their characteristics.

He has recruited you to find these groups for him, because he’s heard you’re good at this sort of thing. In fact, he trusts your judgement so much that he’s given you (almost) total control over what sort of groups you create. His only requests are:

Consider all the data he’s giving you – try not to throw away data that he’s spent time to gather, unless you’ve got a good reason for it.
He doesn’t want too many groups; if there’s a large number of groups then it’s hard for him to do group-specific data analysis later.
While he trusts your judgement, he also wants to know what your judgement is. When you present the groups to him, he wants a description of each group and why it differs from the rest.

Q4.1.1: Starting Off¶
Start by importing the data and pre-processing it as you wish.

# Your Code Here

Q4.1.2: Stick Together¶
Now get to work on creating your grouping. Include a visualisation of your groupings as well, so Ben can easily see how these players’ stats differ on a graph.

Note: To ensure your analysis is consistent during marking, you should set a random seed before grouping.

# Your Code Here

Q4.1.3: Presentation Prep¶
Before you present your groupings to Ben, you should justify your findings.

First, print relevant statistics on each grouping and ensure that the output is easy to read and compare – you may find it useful to use a visualisation, but this isn’t necessary for full marks.
Second, for each group, give a brief description (~30 words or less each) on what differentiates it from the rest of the groups.
Finally, give a brief justification as to why you settled on this grouping. If you excluded any data, explain why.

# Your Code Here

# Your answer here

Q4.1.4: Not so Close!¶
Finally, based on your visualisation, statistical output and descriptions, find the two groupings that look most alike – if there are many that are similar, choose any of these pairs.

To justify to Ben that this pair of groups should be separate and that they shouldn’t be combined into one group, show that these groups have a statistically significant difference between them in one of the player attributes.

Side Note: A hypothesis test on one attribute doesn’t indicate that the clusters as a whole should be separate, but Ben is happy to use this as justification.

# Your Code Here

# Your answer here

Q4.2: Troublesome Theory¶

Q4.2.1: Tumultuous Testing¶
Consider the following scenario:

Alex is currently using a Machine Learning model to predict whether a student will pass or fail, based on their lab attendance. However, depending on how his data is split when he uses train_test_split, the accuracy of the model is varied highly.

What is the problem here and how could he account for this? Provide an example of a tool or procedure he could use to solve this issue and ensure the accuracy is representative of the dataset.

# Your answer here

Q4.2.2: Silly Scoring¶
Consider the following scenario:

Ben is currently working for Blume, using items such as Name, Date of Birth, employment, and income to predict how likely someone is to ask for a payrise.

Disregarding the potential ethical issues with this, should Ben be using a classification or regression model? What would be the impact of using the incorrect model?

# Your answer here

Q5 – Ethics [10 Marks]¶
The following questions will cover topics of ethics covered in the lectures
and labs. Please provide your written answers in the raw text boxes provided.
Any external information you use must be referenced. Provide references
in the raw textbox and statement of originality.

Q5.1: Ethical Issues¶
Discuss two Economical ethical issues (as defined in the lectures)
that may arise from the use of Data Analytics, and what impact this may have
on workers, or the wider community. Provide an example and outline the impact
this example could have.

# Your answer here

Q5.2: The Value of Data¶
Discuss two of ways the value of Confidentiality can be harmed by
Data Analytics. Provide an example of each.

# Your answer here

Congratulations, you have made it to the end. Don’t forget to save, fill out your statement of originality, commit & finally push your work to your repo.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com