CS代考 COMP2420/COMP6420 - Introduction to Data Management,

mid-sem-sample-questions

COMP2420/COMP6420 – Introduction to Data Management,
Analysis and Security

Mid-Semester Exam – Sample Questions

# IMPORTING FREQUENTLY USED PYTHON MODULES
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use(‘seaborn-notebook’)
%matplotlib inline

# JUST TO MAKE SURE SOME WARNINGS ARE IGNORED
import warnings
warnings.filterwarnings(“ignore”)

# IMPORT ANY PACKAGES YOU WOULD LIKE TO USE BELOW

Section A – Data Analysis¶
Question 1: [30 marks]¶
You are given the following dataset from the Seaborn library in Anaconda.

1) Do some exploratory data analysis on this dataset and find out three facts that jump out from this data.

df_diamonds = sns.load_dataset(‘diamonds’)
df_diamonds.head()

# DO SOME EXPLORATORY DATA ANALYSIS ON THE DATASET
# AND STATE 3 FACTS THE JUMP OUT FROM THE BASIC STAT
# ANALYSIS OF THE DATASET.

2) Find out the max weight (in carat) and most valueable stone(s).

# YOUR CODE HERE

3) Are stones of max weight also the dearest stones? Justify your answer with appropriate data or visualisation.

# YOUR CODE HERE

3) Find out how the average price per carat for different cut (quality) groups. Do cut quality group names correlate with the average price? Plot the results (using bar plot or piechart).

[20 marks]

# YOUR CODE HERE

Question 2: [15 marks]¶
Researcher A claims to have quantified the drink consumption habits of Coffee Grounds’ customers. He claims
that the choice of each customer is a random sample from a distribution over five outcomes: Smoothie, Coffee,
Milk Tea, Classic Tea, and Sparkling Juice, with probability of 0.2, 0.15, 0.2, 0.4, and 0.05, respectively.

Another researcher B rejects one part of the A’s claim: he can’t believe customers choose Classic Tea with a 40% probability (he doesn’t care at all about the probabilities of the other choices). State null and alternative hypotheses that he should use to investigate this issue.

YOUR ANSWER HERE¶

Now B needs a sample of customer choices. Each beverage cup contains a mark describing its original contents. Should he look in the garbage can outside of Coffee Grounds at the end of the day and count the proportion of cups that contained Classic Tea? Why or why not?

YOUR ANSWER HERE¶

Alternatively, Coffee Grounds offers to give B a uniform random sample of 10 orders from its database of all past orders. He replies, “That’s not enough, I need a large random sample.” They ask why. How should he respond to justify his request of a large random sample?

YOUR ANSWER HERE¶

Section B – Data Visualization¶
Question 3: [40 marks]¶
In the world of Pokémon, there are many different pokémon, each holding a unique power inside. Ash, the pokémon trainer has collected a large dataset of Pokémon throughout his many adventures, and wants to know more about the composition of different generations of pokémon. Complete the function plot_pokemon_composition() to help Ash visualise the composition.
[10 marks]

# TODO: COMPLETE THE BELOW FUNCTION
def plot_pokemon_composition(df_pokemon):

return False

# RUN THIS BLOCK TO TEST YOUR CODE
df_pokemon = pd.read_csv(‘pokemon.csv’)
plot_pokemon_composition(df_pokemon)

Briefly justify why the graph you chose in 1 is suitable for the intended purpose.

YOUR ANSWER HERE¶

Professor Oak, a specialist in the study of Pokémon, wants to discover the relationship between Attack vs Defense to HP of Pokémon. He asks you to create a graph depicting this relationship, with an additional requirement. Make the data points for Legendary Pokémon (extremely powerful, one-of-a-kind Pokémon) stand out from the others. Complete the function plot_attack_defense_hp() to create this graph.
[15 marks]

# TODO: COMPLETE THE BELOW FUNCTION
def plot_attack_defense(df_pokemon):

return False

# RUN THIS CELL TO TEST YOUR FUNCTION
df_pokemon = pd.read_csv(‘pokemon.csv’)
plot_attack_defense(df_pokemon)

Pokémon Trainers and Cat Erpie are having an argument. Ally believes that Psychic type pokémon are superior to Normal type pokémon because they have a higher Special Attack, however, Cat disagrees. Complete the function plot_psychic_vs_bug() to help and Cat Erpie compare the distribution of Special Attack for Psychic and Bug type pokémon.

Note: A pokémon p belongs to Type t if p’s Type 1 = t OR p’s Type 2 = t.
[10 marks]

def plot_psychic_vs_bug(df_pokemon):

return False

# RUN THIS CELL TO TEST YOUR FUNCTION
df_pokemon = pd.read_csv(‘pokemon.csv’)
plot_psychic_vs_bug(df_pokemon)

Section C – Introductory Machine Learning¶
Question 4: [15 marks]¶
You have been given a file named admission.csv which contains a dataset of university admission chance and some factors. The attributes in this dataset includes:

GRE Scores ( out of 340 )
TOEFL Scores ( out of 120 )
University Rating ( out of 5 )
Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
Undergraduate GPA ( out of 10 )
Research Experience ( either 0 or 1 )
Chance of Admit ( ranging from 0 to 1 )

With our dataset, we would like to predict the chance that a student will be admitted to study in an university with the some of his/her characters and previous performance.

1) The first thing to do is to load the data into a Pandas DataFrame and make it ready for use for training our linear regression model.

2) Then you will have to complete the linear_regression() function. This function takes X and Y you constructed from 1) as input, and generates an array $\textbf{$\beta$}$, which contains the parameters of the fitted function. You are also expected to complete the predict() function, which returns the predicted value $\hat{y}_{i}$ of a given point $x_{i}$ and the parameters $\textbf{$\beta$}$.

# TODO: COMPLETE THIS FUNCTION
def load_data(data_path):

return X, Y

X, Y = load_data(‘./admission.csv’)
print (‘The shape of data points X and true value Y are:’, X.shape, Y.shape)

# TODO: COMPLETE THIS FUNCTION
def linear_regression(X,Y):
Inputs: X,Y are numpy arrays containing the training data points and the true value
Outputs: beta is numpy array containing the parameters of the fitted function.
beta = None

return beta

# TODO: COMPLETE THIS FUNCTION
def predict(beta,x):
Inputs: x is a numpy array of a point in the dataset (padded 1 at the left).
beta is a numpy array containing the parameters of the fitted function.
Outputs: predicted is the value of predicted y_hat of the give data x and parameter set beta.
predicted = None

return predicted

beta = linear_regression(X,Y)
predicted_y = predict(beta, X[1,:])

print (‘Fitted model parameters are:’, beta)
print (‘The predicted value of’, X[1,:], ‘is:’, predicted_y )

3) If your linear model does not perform very well on your data, it sometime means there’s less linear relationship in your data. There are many ways of measuring linear relationship and we will use $R^{2}$ measure, which is defined by:

$R^{2}$ = $1- \dfrac{\sum{(y_{i}-\hat{y_{i}})^2}}{\sum{(y_{i}-\tilde{y})^2}}$
where, $\tilde{y}$ is the mean of true y values.
Your task here is to implement the r_square() function which calculates $R^{2}$ score for the given X and Y.

# TODO: COMPLETE THIS FUNCTION
def r_square(X,Y):
Inputs: X,Y are numpy arrays containing the training data points and the true value
Outputs: r2 is the r square value of the input X and Y

r2 = r_square(X,Y)
print (‘The R2 value of this dataset is:’, r2 )

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts