CS代考 COMP2420/COMP6420 – Introduction to Data Management,

2019-SampleExam-Draft

COMP2420/COMP6420 – Introduction to Data Management,
Analysis and Security

Copyright By PowCoder代写 加微信 powcoder

Sample Exam – Semester 1, 2019

Instructions¶
Maximum Marks 100
Weightage 50% of the Total Course Grade
Duration 15 min Reading + 180 min Typing
Permitted Material One A4 page with (written or printed)
notes on both sides

There are five questions. All answers are to be submitted using the plain text or notebook files found in directories named Q1 through Q5 in the finalexam directory on your Desktop.
Save your changes frequently, so that you do not lose your work. Do not change the names of the directories or of the files.
You can import any additional Python modules you may need for your analysis in the first code block. DO NOT try to install any modules other than those present in the standard Anaconda installation.
For all coding questions please write your code after the comment YOUR CODE HERE.
In the process of testing your code, you can insert more cells or use print statements for debugging, but when submitting your file remember to remove these cells and calls respectively.
You will be marked on correctness and readability of your code. If your marker can’t understand your code, marks may be deducted. Don’t forget to leave comments in your code to explain the workflow.
Check that your work is being saved regularly.
Use Jupyter only for the questions that need it. Ideally, close one question before opening the next one. Save your work and check that it is being saved using another application, such as a document viewer.
Avoid long-winded answers and give precise answers. Answers should be clear and legible.

# Important Imports for the question/s
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans, MeanShift
import sklearn.metrics as skm
plt.style.use(‘seaborn-notebook’)
## inline figures
%matplotlib inline

# Getting rid of warnings
import warnings
warnings.filterwarnings(“ignore”)

# Add imports as necessary. You are only allowed to use what is in the standard Anaconda installation in the exam as others will not be installed on the lab computers!

Question 1: Data Analysis [20 Marks]¶

Q1.1 – Appropiate Graphs [5 Marks]¶

For each of the following scenarios, determine which plot type is most appropriate to reveal the distribution of and/or the relationships between the variable(s) referred to in each sub question.

Select (with justification) only one plot type from the ones listed below.

side-by-side boxplots
scatter plot
stacked bar plot

Sale price and number of bedrooms (assume integer) for houses sold in Canberra in 2010.

Answer here

Sale price and date of sale for houses sold in Canberra between 2000 and 2017.

Answer here

Time taken by ANU employees in minutes to reach university for year 2017.

Answer here

Country of nationality for students admitted to ANU in 2016, assuming you can combine countries with less than 100 students together.

Answer here

The percentage of female students admitted to ANU each year from 1950 to 2017.

Answer here

Q1.2 – Visualisations [15 Marks]¶

The data set ./data/Economist_pensions.csv is a collection of the percentage of GDP that various OECD countries spent on pension benefits, and the proportion of their population aged 65 and over, also as a percentage. Countries have different age when their people become entitled to claim pension benefits. The proportion of GDP figures represent the GDP expenditure on all pension claims that is allowed by their pension law.

Using the dataset, perform the following tasks:

Load the DataFrame. Provide the mean and mode of the government spending on pension benefits as a percentage of the countries GDP.

ecp = pd.read_csv(‘./data/Economist_pensions.csv’)
ecp.head()

Using a Scatter Plot, visualise the relationship between the percentage of a country’s population over the age of 65 and the government spending on pension benefits as a percentage of the GDP. [3 marks]
Building upon your previous graph, highlight the data points for the following countries and regenerate your graph below the original one. The countries are: Mexico, Turkey, Brazil, Poland, France, Italy, Greece, Japan, United States of America & South Korea. [3 marks]
Which visual attribute (in the scatterplot function) from matplotlib is more appropriate for highlighting the previous countries: Colour or Alpha? Justify your answer using an appropriate example. [3 marks]
Define the country with the highest ratio of pension benefits as a % of GDP to the % of 65+ fraction of population. Use a distinctive visual feature to highlight it. [3 marks]

# Your Code here

Question 2: Machine Learning [40 Marks]¶

Q2.1 – Togepi’s Theory Questions [12 Marks]¶

If your regression line perfectly fits all the points, what is the value of R2 score?

Your Answer Here

If your regression equation looks like $\hat{y}$ = β0 + β1$x$ + β2$x^2$ + β3$x^3$, is this still a linear model? Explain your reasoning

Your Answer Here

What does the statement “independent and identically distributed” (or i.i.d.) define in Machine Learning? Why is it important?

Your Answer Here

Explain the difference between Flat and Hierarchical clustering

Your Answer Here

What is a loss function? Give an example of a loss function and how it is used in a Machine Learning Algorithm

Your Answer Here

How does the Minimum Sum of Squares (also known as Least Squares or Ordinary Least Squares) optimise a Linear Regression line of fit? Does it require multiple passes on the data or is it a single calculation? Provide a technical explanation of how it works.

Your Answer Here

Q2.2 – Cubone’s Classifications [15 Marks]¶

poke_data = pd.read_csv(“./data/pokemon.csv”, index_col=0)
poke_data.head()

Using the Pokemon dataset, you will be implementing the Decision Tree and KNN classifers using sklearn to predict the Legendary status of a pokemon. Perform the following tasks:

Complete the following function such that it will return the best combination of 2 columns to use to predict the Legendary status of a pokemon for a KNN classifier when n_neighbours = 10 . You should check every pair of numerical columns and return the best pair (with the names as a list) based on accuracy.

Note: You should only use numerical columns.

def pairChecker(data):
Input/s: Dataframe of every column in the dataset
Expected Output/s: The names of the best performing pair of models based on accuracy of KNN classifier as a list.

Expected steps: Determine the pairs of columns that can be used, for each pair implement a KNN classifier and
check the accuracy score. Return the column names of the pair with the best accuracy score as a list.

# Your Code Here

# Tester block
print(“Your best pair was: “, pairChecker(poke_data))
if isinstance(pairChecker(poke_data), list):
print(“Output is correct type”)
print(“Output is not returned as a list of the column names”)

Using the best pair you found above, implement a KNN classifier with 10 neighbours and a DecisionTreeClassifier and provide the prediction accuracy score and F1 score for each model.

# Your Code Here

What do the above metrics tell you about the performance of your models? Discuss how the metrics show that the models classify differently. Which would you rather use and why?

Your Answer Here

Q2.3 – Charmander’s Clustering [13 Marks]¶

Using the Pokemon dataset, implement the KMeans clustering algorithm with k=5 and k=3 using sklearn or equivalent packages. Provide graphical representations of your clustering outputs and provide the accuracy score of your algorithms.

# Your Code here

Which k value is better for your clustering (or is neither appropriate)? Explain how different k values could be more (or less) appropriate.

Your Answer Here

Explain the limitations of KMeans and give examples of how and when it would be unsuitable.

Your Answer Here

Q3: Databases and Relational Algebra [20 Marks]¶

Q3.1 Short Answers [10 Marks]¶

What is the difference between the flat file and hierarchical database models? Provide an example of each.

Answer Here

What are the typical layers in a 2-tier Architecture model? How can an Architecture be extended to n-layers?

Answer Here

What is the difference between a Data-definition and Data-manipulation language?

Answer Here

What are the features of a 2NF normalised database?

Answer Here

What is the difference between a well-formed and a valid XML document? Why would you want to check if an XML document is valid?

Answer Here

3.2 SQL [10 Marks]¶
(NOTE: This needs to be performed on Lab Computers unless you have psql installed on your computer)

Load the provided dataset into psql (open terminal, navigate to the directory with the script, invoke psql, then \i spj.sql). These questions refer to the full database provided in the .sql file (NOT the subset used in question 1).

Please answer the following questions. Provide both the answer, and the SQL you used to achieve the answer.

Find the names of all suppliers providing parts for the Console job.

Answer Here

Which Part/s weigh the most?

Answer Here

Which part is most stocked by suppliers?

Answer Here

Find the s_id and names of suppliers supplying parts to jobs where the city of the supplier, the part, and the job are all different to each other.

Answer Here

Find all (suppliers, parts) where the supplier doesn’t have enough parts to fill their requests.

Answer Here

Q4: Security [20 Marks]¶

4.1: Security Multiple Choice [10 Marks]¶

A fabrication attack is an attack on:

Availability
Confidentiality
Authenticity

Answer Here

An RSA public key is given by (e,n) = (3,33). Suppose we want to encrypt the number 7 using this key. Then, the encrypted number is:

Answer Here

Which technology is considered to be able to break current public-key encrpytion algorithms in the future?

Probabilistic Computation
Neuron Networks
Quantum Computing

Answer Here

Which of the following encryption algorithms is considered unbreakable?

One-Time Pad

Answer Here

In creating a digital signature, the sender encrypts the document with:

The sender’s public key

The receiver’s public key
The sender’s private key
The receiver’s private key

Answer Here

4.2: Security Short Answers [10 Marks]¶

A good hash function needs to be collision resistant. What does this mean? Why is this important?

Answer Here

What are the three key principles of the information security triad? Briefly describe each one of these three principles.

Answer Here

Bob wants to send a signed, confidential message to Alice. Discuss how he could go about doing this using cryptographic techniques? What are the pros and cons of your suggested approach?

Answer Here

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com