CS代考 COMP2420/COMP6420 – Introduction to Data Management,

Question Set 1

COMP2420/COMP6420 – Introduction to Data Management,
Analysis and Security

Copyright By PowCoder代写 加微信 powcoder

Mid-Semester Exam (Sample 2)

Instructions¶
Maximum Marks 100
Weightage 18% of the Total Course Grade
Duration 15 min Reading + 90 min Typing
Permitted Material This is an open book exam. Any course
or online material can be used

There are four questions. All answers are to be submitted via gitlab before the end of the exam time period.
Save your changes frequently, so that you do not lose your work! Do not change the names of the directories or of the files.
You can import any additional Python modules you may need for your analysis in the first code block. DO NOT try to install any modules other than those present in the Anaconda distribution.
For all coding questions please write your code after the comment YOUR CODE HERE.
In the process of testing your code, you can insert more cells or use print statements for debugging, but when submitting your file remember to remove these cells and calls respectively.
You will be marked on correctness and readability of your code/analysis/explanation. If your marker can’t understand your code/analysis/explanation, then marks may be deducted.

# Feel free to import other modules, provided they are a part of the standard conda distribution.
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from scipy import stats
from itertools import combinations
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use(‘seaborn’)
%matplotlib inline

import warnings
warnings.filterwarnings(“ignore”)

Question 1: Short Answer [10 marks]¶
Answer the following questions in the raw cell left below the question.

1.0) Consider the process of creating a test/train split for a given data set. When creating the test set, is it better to take a random sample of rows from the data set or a section of contiguous rows from the start/end of the data set?¶
Be sure to justify your answer with an explanation.

1.1) Explain the difference between Python comments begin with the # symbol or the ”’ symbols. Provide an example of each.¶

1.2) Consider the following graphs. For each graph, state the problem with how the data is represented and an alternative plotting method that would better convey the information.¶
[4 marks: 2 per graph]

Question 2: Data Analysis & Visualisation [40 marks]¶
All the COMP2420 tutors have been enjoying the recent popular Netflix series “The Last Dance”, a basketball documentary. Due to this, many disagreements and discussions have been had around basketball in recent tutor meetings. Your task for the this section is to use statistics taken from the site Basketball reference to resolve some of these disagreements.

The below table is a description of the dataset:

Field Description
player_id Unique identifier of a player
season_id Season year the statistics are from
player_age Age of the player during that season
min Total number of minutes spent on the court
fg_pct Percentage of field goals made
fg3_pct Percentage of 3 points shots made
ft_pct Percentage of free throws made
reb Total number of rebounds
ast Total number of assists
stl Total number of steals
blk Total number of shots blocked
tov Total number of turnovers
pts Totl number of points scored

The questions are as follows:

2.0) Import the Data¶
The data file is called player_data.csv in the data directory.
The resulting variable containing the dataframe should be called player_data for testing purposes.
Do not use player_id as an index. Instead, allow for pandas to create a default index.
Print out the first 5 rows of the data set along with the names of the columns.

# YOUR CODE HERE
player_data = pd.read_csv(“data/player_data.csv”)
player_data.dropna()
player_data.head()

player_id season_id player_age min fg_pct fg3_pct ft_pct reb ast stl blk tov pts
0 147 1997-98 25 1706.0 0.478 0.342 0.728 195.0 155 56.0 14.0 132.0 771
1 147 1998-99 26 1238.0 0.403 0.262 0.791 154.0 93 50.0 15.0 72.0 542
2 147 1999-00 27 2978.0 0.471 0.393 0.827 387.0 320 84.0 49.0 188.0 1457
3 147 2000-01 28 2943.0 0.457 0.339 0.828 359.0 435 65.0 43.0 211.0 1478
4 147 2001-02 29 3156.0 0.455 0.362 0.839 373.0 355 78.0 45.0 201.0 1696

2.1) Good Ol’ Days¶
Each player is listed along with the year of the season they played in. A common sentiment amongst older tutors is that players who played before 1990 are better than those who played after 1990.

Visualise the average number of ast and tov for players pre and post the 1990 season to detemine who is correct.

Note: For the category ast a higher value is better however for tov a lower value is better.
[10 marks]

# YOUR CODE HERE
pre1990_rec = player_data[player_data[‘season_id’] < '1990'] pre1990_player_id = pre1990_rec.groupby('player_id').mean().index pre1990_player = player_data[player_data['player_id'].isin(pre1990_player_id)] post1990_player = player_data[~player_data['player_id'].isin(pre1990_player_id)] pre1990_perf = pre1990_player.mean()[["ast","tov"]] post1990_perf = post1990_player.mean()[["ast","tov"]] plt.bar(["ast","tov"],[pre1990_perf["ast"],pre1990_perf["tov"]],align = 'edge', width = -0.2) plt.bar(["ast","tov"],[post1990_perf["ast"],post1990_perf["tov"]],align = 'edge', width = 0.2)

Written answer here

2.2) The Worm¶
Dennis ‘The Worm’ Rodman (player_id: 23) was a very popular player in the 90’s and was known as a very talented rebounder. In more than 10 seasons in his career, he had more total rebounds than points, an incredible feat.

How many players have also gotten more reb than pts for at least 10 of the seasons they have played?

[10 marks]

# YOUR CODE HERE
reb_pts = player_data[player_data[‘reb’]>player_data[‘pts’]]
reb_pts = np.array(reb_pts.groupby(‘player_id’).size())
for elem in np.nditer(reb_pts):
if elem > 10:
print(count)

# player_data[“reb>pts”] = 0
# player_data.loc[player_data[“reb”]>player_data[“pts”],”reb>pts”] = 1

# target = player_data.groupby(“player_id”).sum()
# len(target[target[“reb>pts”]>=10][“reb>pts”])

Written answer here

2.3) Hogging the Court¶
Players with a free throw percentage ft_pct > 80% get more minutes min on court than those who shoot under 80% on free throws. – Skip Bayless (probably)

This is a very interesting claim and the answer is not immediately obvious from looking at the data. We would however like to fact check this to ensure no misinformation is being spread around the basketball statics community. To test this claim, perform a Hypothesis Test (T-test) for the above statement, stating your hypotheses, the results, and the final acceptance or rejection statements.

[15 marks]

H0: Players with a free throw percentage ft_pct > 80% get the more minutes min on court than those who shoot under 80% on free throws.
HA: Players with a free throw percentage ft_pct > 80% get less than or equal minutes min on court than those who shoot under 80% on free throws.

# YOUR CODE HERE
players1 = player_data[player_data[“ft_pct”] > 0.8]
players1.dropna(inplace = True)
players2 = player_data[player_data[“ft_pct”] <= 0.8] players2.dropna(inplace = True) print(players1.mean()["min"]) print(players2.mean()["min"]) t,p = stats.ttest_ind(players1["min"],players2["min"]) print("p-value is: ", p/2) 1355.1180217937972 1061.4892564224226 p-value is: 1.0496958809138981e-98 Written answer here Question 3: Classification [30 Marks]¶ Afzal has come into a large sum of money and is entering a new Canberra team into the Australian 'NBL' Basketball League. He has decided to take a statistically sound approach to choosing players for his new team. He wants to use the above data to determine which attributes of a given player he should be looking for when selecting new talent and we are going to help him out! The follow questions all use the same basketball statistics dataset as question 2: 3.0) Import the Data, Prepare for Classification¶ Import the data from the player_data.csv file in the data directory to a new variable classification_player_data. This is both for testing purposes and in case any changes were made to the dataframe in the previous question. Afzal likes fast paced games where lots of points are scored so we are going to divide our players into three categories: Category Number of points Low scoring players pts <= 200 Mid scoring players pts > 200 and
pts <= 800 High scoring players pts > 800

Your task is to split the data into these three categories, ensuring that category each player falls into is based on that single players total pts in a single season. Note that this means some players may fall into different catergories in different seasons.

Hint: It may be useful to use numerical categories instead of the string names.
Hint: Make sure to handle any NaN or null values within the data.

# YOUR CODE HERE
classification_player_data = pd.read_csv(“data/player_data.csv”,index_col=[0,1,2,4,7,8,10,11])
classification_player_data.dropna()

classification_player_data.loc[((classification_player_data.pts <= 200)),'pts']= 0 classification_player_data.loc[((classification_player_data.pts >= 200) & (classification_player_data.pts < 800)),'pts']= 1 classification_player_data.loc[((classification_player_data.pts >= 800)),’pts’]= 2
classification_player_data.sort_values(by=[‘player_id’], inplace=True)
classification_player_data

X = classification_player_data.drop(columns=[“min”,”fg3_pct”,”ft_pct”,”stl”,”pts”],axis=1)
X.dropna()
y = classification_player_data[“pts”]
y.dropna()

player_id season_id player_age fg_pct reb ast blk tov
2 1983-84 23 0.484 164.0 177 19.0 116.0
1984-85 24 0.539 210.0 244 17.0 138.0
1996-97 36 0.430 118.0 99 16.0 53.0
1995-96 35 0.401 192.0 123 22.0 100.0
1993-94 33 0.467 110.0 133 9.0 103.0
… … … … … … … …
1629744 2019-20 25 0.487 61.0 22 1.0 16.0
1629745 2019-20 25 0.200 3.0 0 0.0 0.0
1629750 2019-20 26 0.500 93.0 26 8.0 23.0
1629752 2019-20 23 0.577 29.0 6 2.0 6.0
1629760 2019-20 23 0.250 3.0 1 1.0 0.0

26976 rows × 0 columns

3.1) Implementing kNN¶
You will now implement a classifier to predict which scoring category a player will be in for a given season. Afzal however only wants to use two columns from the dataset to predict which category a player will fall into. He has narrowed the best 2 columns down to some combination of the follwing:



Your task is now to:

Iterate over each 2-tuple, fitting a kNN classification model (where k=3) using those columns to predict which scoring category a player is in.
Print a table of mean accuracy scores for each 2-tuple grouping, note the best accuracy along with which two columns are used to produce this .


Hint: combinations(my_list, n) returns a list of all n length tuples from a list my_list.

[12 marks]

# YOUR CODE HERE
X_train, X_test, y_train,y_test =train_test_split(X,y)

for comb in combinations(X_train.columns,2):
model[comb] = KNeighborsClassifier(3)
model[comb].fit(X_train[list(comb)],y_train)
print(f”{comb}:{model[comb].score(X_test[list(comb)],y_test).round(3)}”)

3.2) Model Evaluation¶
For your model above calculate and show:

# YOUR CODE HERE

3.3) Finding Optimal K¶
For the best 2-tuple of columns used in your kNN model for question 3.1, find the value of k which gives the model the highest Accuracy. You should assume that the optimal k value is between 1 and 51.

State your findings and what they show.

# YOUR CODE HERE

Written answer here

Question 4: Linear Regression [20 Marks]¶

For this questions your dataset is one of housing prices in two suburbs in Canberra. Note that this is not a real dataset and as such should not be used if you intend to actually buy a house in Canberra.

Description of the dataset:

Column Description
price Sale price of the property
suburb Suburb the property is located in
rooms Number of rooms in the property
bathrooms Number of bathrooms in the property
sqm Size of the property in square meters
pool Whether the property has a pool
garage Whether the property has a car space/garage
garden Whether the property has a garden

NOTE: In the following questions you will implement a linear regression model to predict the price of a property based on it’s features. Marks will be awarded for the steps taken in the development of the model aswell as the accuracy of your final model (tested against an identically formated but previously unseen data set).

4.0) Import and Split the Data¶
Import the data from the housing_prices.csv file in the data directory. Change any columns for use within your model as appropriate.

Split the data into two variables housing_prices_testing and housing_prices_training. You may choose whatever split you wish but make sure to explicity state what test/train split you have chosen.

Hint: Make sure to handle any NaN or null values within the data.

# YOUR CODE HERE

4.1) Selecting Features¶
To decide which columns to use in our model we will find those with the highest correlation with the price of a property. Calculate this correlation with price for each non-categorical attribute and print the column names in order.

Hint: ‘Correlation’ in this question refers to pairwise correlation. You may use an alternative but be sure to note this and explain your reasoning.

# YOUR CODE HERE

4.2) Regression Implementation¶
We will now use training data within housing_prices_training to create a model for predicting housing prices based off our findings in the previous question.

You may use as many columns as you wish as long as you state clearly which ones you are using. Ensure you are not over fitting to the dataset as the testing data that will be used in the final evaluation is not part of the csv you have been supplied with.

Complete the following tasks:

[6 marks] Implement a Linear Regression model based on your chosen columns.
[6 marks] Test your model against housing_prices_testing using an evaluation method of your choice. State why you have chosen this method of testing and any observations you have based on the results.
[3 marks] Choose one column you have used in the construction of your regression model and plot a regression line of price against this column. On the same axis show a scatter plot using the column you have chosen and price of data points from vals_to_plot.csv. Comment on the distribution of these datapoints compared to your regression line.

[15 marks]

# YOUR CODE HERE

Written answer here

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com