midterm_solved-4
DS3000A – DS9000A Midterm Exam¶
Student ID #: XXXXXXXXX¶
Copyright By PowCoder代写 加微信 powcoder
Grade: __ / 100¶
General Comments¶
This exam integrates knowledge and skills acquired in the first half of the semester. You are allowed to use any document and source on your computer and look up documents on the internet, but you are NOT allowed to share documents, post questions to online forums, or communicate in any way with people inside or outside the class, .
Having open any document sharing or communication tool (e.g. Discord, Teams, Outlook, Google Drive etc.) either web-based or app-based on your laptop (or having them running in the background) is considered act of cheating and you will receive 0 pts for the exam.
To finish the midterm in the alloted time, you will have to work efficiently. Read the entirety of each question carefully.
You need to submit the midterm by 6:15PM on OWL to the Test and Quizzes section, where you downloaded the notebook and data. Late submission will be scored with 0 pts. To avoid technical difficulties, start your submission, at the latest, five to ten minutes before the deadline.
Some questions demand a written answer – answer these in a full English sentence in their allocated markdown cells.
For your figures, ensure that all axes are labeled in an informative way. In order to interpret, there can be a situation where you should limit the x-axis and/or y-axis to zoom-in.
Ensure that your code runs correctly by choosing “Kernel -> Restart and Run All” before submitting to OWL.
Additional Guidance¶
If at any point you are asking yourself “are we supposed to…”, write your assumptions clearly and proceed according to those assumptions.
If you have no clue how to approach a question, skip it and move on. Revisit the skipped one(s) after you are done with other questions.
Preliminaries¶
Feel free to add stuff to Preliminaries. However, be mindful of every question’s restrictions as some may exclude use of some functions.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy as sc
import seaborn as sns
from math import pi
from sklearn.model_selection import train_test_split, cross_val_score, KFold, cross_validate, StratifiedKFold
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, make_scorer, r2_score, roc_curve, auc, plot_roc_curve, RocCurveDisplay, roc_auc_score, accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
np.random.seed(seed)
Question 1 – Hardcode Linear Regression [10 marks]¶
Q 1.1 – [0.5] – Load q1q2.csv as pandas dataframe (name it df1) and show its first 5 rows.
df1 = pd.read_csv(“q1q2.csv”)
df1.head()
x1 y1 x2 y2
0 -5.000000 -24.598243 10.000000 16.078992
1 -4.924623 -12.399073 10.100503 11.607512
2 -4.849246 -7.797257 10.201005 15.694078
3 -4.773869 -33.047786 10.301508 17.470684
4 -4.698492 -12.686326 10.402010 14.867190
Q 1.2 – [0.5] – Use an appropriate plotting command to see the relationship between x1 and y1.
plt.scatter(df1.x1, df1.y1)
plt.show()
Q 1.3 – [5] – Hardcode OLS linear regression. You need at least two functions:
A loss function (name it Loss): Takes (b, X, y) as input arguments, calculates OLS cost function, and returns the calculated values. Note that this function must return only one variable which is a 1D array consisting of the calculated values. [2]
A fitting function (name it Fit): Takes (X, y, lossfcn=Loss) as input arguments and returns two variables i.e. estimated betas and R-squared. This function must use scipy.optimize.minimize to minimize the Loss function. [3]
def Loss(b, X, y):
predY = np.dot(X, b)
res = y-predY
c = sum(res**2)
return c # must only return the calculated values of the cost function (here called `c`)
def Fit(X, y, lossfcn=Loss):
_, ncols = X.shape
betas = np.zeros((ncols, 1))
RES = sc.optimize.minimize(lossfcn, betas, args=(X, y), jac=False)
estimated_betas = RES.x
res = y-np.mean(y)
TSS = sum(res**2)
RSS = lossfcn(estimated_betas, X, y)
R2 = 1-RSS/TSS
return (estimated_betas, R2)
Q 1.4 – [1] – Do these:
Construct a target y using y1, and a feature matrix X_a using x1 without any feature transformation.[0.25]
Use X_a and y and call Fit to fit your model. [0.25]
What is training R-squared of this model? [0.5]
X_a = np.c_[np.ones(df1.x1.size), df1.x1]
y = df1.y1.values
betas_a, R2 = Fit(X_a, y)
print(“R-squared:”, R2.round(4))
R-squared: 0.5715
/tmp/ipykernel_88621/2130850704.py:10: DeprecationWarning: Use of `minimize` with `x0.ndim != 1` is deprecated. Currently, singleton dimensions will be removed from `x0`, but an error will be raised in SciPy 1.11.0.
RES = sc.optimize.minimize(lossfcn, betas, args=(X, y), jac=False)
Q 1.5 – [2] – Do these:
Construct a new feature matrix X_b using x1 but this time also include a transformation of x1 that you deem to best describe the relationship between x1 and y1. Take a look at your plot of Q 1.2 and try to identify the relationship. [1]
Use X_b and y and call Fit to fit your model [0.5]
What is training R-squared of this model? [0.5]
# guessed transformation
def gt(x):
return x**2
X_b = np.c_[np.ones(df1.x1.size), gt(df1.x1)]
betas_b, R2 = Fit(X_b, y)
print(“R-squared:”, R2.round(4))
R-squared: 0.9169
/tmp/ipykernel_88621/2130850704.py:10: DeprecationWarning: Use of `minimize` with `x0.ndim != 1` is deprecated. Currently, singleton dimensions will be removed from `x0`, but an error will be raised in SciPy 1.11.0.
RES = sc.optimize.minimize(lossfcn, betas, args=(X, y), jac=False)
Q 1.6 – [1] – Use the given xnew to construct two test design matrices i.e., one without any transformation (name it X_new_a) and one with the transformation identified in Q 1.5 (name it X_new_b). Make predictions for both. Plot these predictions and the original data points all together in one plot window.
xnew = np.linspace(df1.x1.min(), df1.x1.max(), 100)
X_new_a = np.c_[np.ones(xnew.size), xnew]
X_new_b = np.c_[np.ones(xnew.size), gt(xnew)]
y_pred_a = X_new_a @ betas_a
y_pred_b = X_new_b @ betas_b
plt.scatter(df1.x1, df1.y1, c=’blue’, label=’Original Data Point’)
plt.plot(xnew, y_pred_a, c=’green’, label=’Fit 1′)
plt.plot(xnew, y_pred_b, c=’orange’, label=’Fit 2′)
plt.legend()
plt.show()
Question 2 – Hardcode Maximum Likelihood Regression [20 marks]¶
Q 2.1 – [15] – Code an OLS regression log likelihood using this probability density function:
$f_Y(y|X=x)=\dfrac{\pi}{3\sigma_{\epsilon}\sqrt{2\pi}}e^{-\dfrac{1}{2}\dfrac{(y-\mu_{Y})^2}{\sigma_{\epsilon}^2\sqrt{\pi}}}$
where you can assume $\sigma_{\epsilon}$ to be the standard deviation of the noise in the data (hint: you can assume an arbitrary value but must be a valid one). You need to calculate the log likelihood of the PDF (i.e., $l(\mu_{Y},\sigma_{\epsilon}^2;y_1,…,y_n$)), and then choose the form of $\mu_{Y}$ using some assumption that you make. You can start to code once you know the final form of the $l$ equation. You need to code at least two functions:
one which takes in (beta, X, y), calculates and returns $-l$. [12]
another one which takes in (X, y, thePreviousFunction) and makes use of scipy.optimize.minimize to minimize the previous function. This function returns the betas which maximize $l$. [3]
def RegNegLogLikelihood(beta, X, y):
n = y.size
l = n*np.log(pi) – n*np.log(3) – n*np.log(sigma) – (n/2)*np.log(2*pi) – (1/(2*np.sqrt(pi)*sigma*sigma))*np.sum((y-mu)**2)
# Function to maximize regression log likelihood
def maximumRegLikelihood(X, y, negloglik=RegNegLogLikelihood):
_, ncols = X.shape
betas=np.zeros((ncols,1)) # initialize vector beta
RES = sc.optimize.minimize(negloglik, betas, args=(X,y), method=”Powell”, tol=1e-8)
return RES.x
Q 2.2 – [0.5] – Load q1q2.csv as pandas dataframe (name it df2) and show its first 5 rows.
df2 = pd.read_csv(“q1q2.csv”)
df2.head()
x1 y1 x2 y2
0 -5.000000 -24.598243 10.000000 16.078992
1 -4.924623 -12.399073 10.100503 11.607512
2 -4.849246 -7.797257 10.201005 15.694078
3 -4.773869 -33.047786 10.301508 17.470684
4 -4.698492 -12.686326 10.402010 14.867190
Q 2.3 – [3] – Take x2 as feature and y2 as target. Fit the model. Use the given x_new (below) to make new predictions. Plot the original data points and the new predictions in the same plot window.
x_train = df2.x2.values
X_train = np.c_[np.ones(x_train.size), x_train]
y_train = df2.y2.values
betas = maximumRegLikelihood(X_train, y_train)
x_new = np.linspace(x_train.min(), x_train.max(), 200)
X_new = np.c_[np.ones(x_new.size), x_new]
y_predicted =
plt.scatter(x_train, y_train, c=’blue’, label=’Training Data’)
plt.plot(x_new, y_predicted, c=’red’, label=’Fit’)
plt.legend()
plt.xlabel(“$X_1$”)
plt.ylabel(“$Y$”)
plt.show()
/tmp/ipykernel_88621/553534777.py:12: DeprecationWarning: Use of `minimize` with `x0.ndim != 1` is deprecated. Currently, singleton dimensions will be removed from `x0`, but an error will be raised in SciPy 1.11.0.
RES = sc.optimize.minimize(negloglik, betas, args=(X,y), method=”Powell”, tol=1e-8)
Q 2.4 – [1.5] – Calculate training R-squared of your Maximum Likelihood Regression model.
# R-squared
r2 = r2_score(y_train,
print(‘R-squared:’, r2.round(4))
R-squared: 0.9084
Question 3: An End-to-End DS Project [45 marks]¶
You are going to work on a newly published dataset which lists soccer players participating in the FIFA World Cup 2022 – Qatar. Our ultimate goal is to train a linear regression model to predict monetary values of the players.
Q 3.1 – [0.5] – Load the dataset q3.csv as pandas dataframe and take a look at its first 5 rows.
data = pd.read_csv(“q3.csv”)
data.head()
ID Name Age Photo Nationality Flag Overall Potential Club Club Logo … Real Face Position Joined Loaned From Contract Valid Until Height Weight Release Clause Kit Number Rating
0 209658 L. Goretzka 27 https://cdn.sofifa.net/players/209/658/23_60.png Germany https://cdn.sofifa.net/flags/de.png 87 88 FC Bayern München https://cdn.sofifa.net/teams/21/30.png … Yes SUB Jul 1, 2018 NaN 2026 189cm 82kg €157M 8.0 NaN
1 212198 27 https://cdn.sofifa.net/players/212/198/23_60.png Portugal https://cdn.sofifa.net/flags/pt.png 86 87 Manchester United https://cdn.sofifa.net/teams/11/30.png … Yes LCM Jan 30, 2020 NaN 2026 179cm 69kg €155M 8.0 NaN
2 224334 M. Acuña 30 https://cdn.sofifa.net/players/224/334/23_60.png Argentina https://cdn.sofifa.net/flags/ar.png 85 85 C https://cdn.sofifa.net/teams/481/30.png … No LB Sep 14, 2020 NaN 2024 172cm 69kg €97.7M 19.0 NaN
3 192985 K. 31 https://cdn.sofifa.net/players/192/985/23_60.png Belgium https://cdn.sofifa.net/flags/be.png 91 91 Manchester City https://cdn.sofifa.net/teams/10/30.png … Yes RCM Aug 30, 2015 NaN 2025 181cm 70kg €198.9M 17.0 NaN
4 224232 N. Barella 25 https://cdn.sofifa.net/players/224/232/23_60.png Italy https://cdn.sofifa.net/flags/it.png 86 89 Inter https://cdn.sofifa.net/teams/44/30.png … Yes RCM Sep 1, 2020 NaN 2026 172cm 68kg €154.4M 23.0 NaN
5 rows × 29 columns
Q 3.2 – [1] – What is the Nationality, Wage, Value, Skill Moves, Overall of the player?
player = ‘ ‘
data[data.Name.str.contains(player)].get([“Name”, “Nationality”, “Wage”, “Value”, “Skill Moves”, “Overall”])
Name Nationality Wage Value Skill Moves Overall
100 Portugal €220K €41M 5.0 90
Q 3.3 – [4] – The feature Overall indicates player’s overall performance score, which normally ranges from 0 to 100. However, it seems the participating players in this FIFA World Cup all have Overall $>40$. Plot the smoothed distribution of Overall. Your plot must also include three vertical lines: one for mean, one for median, and one for 99.94th percentile of the distribution. Your plot must have a legend.
sns.displot(data.Overall, kind=’kde’, fill=True, rug = True)
plt.axvline(data.Overall.mean(), color=’green’, label=’Mean’)
plt.axvline(data.Overall.median(), color=’orange’, label=’Median’)
percentile = 99.94
quantile=percentile/100
plt.axvline(data.Overall.quantile(quantile), color=’r’, ls=’–‘, label=str(percentile)+’th percentile’)
plt.legend(loc=’upper right’)
plt.show()
Q 3.4 – [2] – What is Name, Nationality, Wage, Value, Skill Moves, and Overall of the top 0.06% players of the Overall distribution?
print(‘Top players are:’)
percentile = 100-0.06
quantile=percentile/100
data[data.Overall > data.Overall.quantile(quantile)].get([“Name”, “Age”, “Nationality”, “Wage”, “Value”, “Position”, “Overall”])
Top players are:
Name Age Nationality Wage Value Position Overall
3 K. 31 Belgium €350K €107.5M RCM 91
25 M. Salah 30 Egypt €270K €115.5M RW 90
41 R. Lewandowski 33 Poland €420K €84M ST 91
56 L. Messi 35 Argentina €195K €54M RW 91
75 K. Mbappé 23 France €230K €190.5M ST 91
100 37 Portugal €220K €41M ST 90
124 K. Benzema 34 France €450K €64M CF 91
192 V. van Dijk 30 Netherlands €230K €98M LCB 90
9151 M. Neuer 36 Germany €72K €13.5M GK 90
14357 T. Courtois 30 Belgium €250K €90M GK 90
Q 3.5 – [0.5] – What features are in the dataset?
data.columns
Index([‘ID’, ‘Name’, ‘Age’, ‘Photo’, ‘Nationality’, ‘Flag’, ‘Overall’,
‘Potential’, ‘Club’, ‘Club Logo’, ‘Value’, ‘Wage’, ‘Special’,
‘Preferred Foot’, ‘International Reputation’, ‘Weak Foot’,
‘Skill Moves’, ‘Work Rate’, ‘Body Type’, ‘Real Face’, ‘Position’,
‘Joined’, ‘Loaned From’, ‘Contract Valid Until’, ‘Height’, ‘Weight’,
‘Release Clause’, ‘Kit Number’, ‘ Rating’],
dtype=’object’)
Q 3.6 – [1] – Some features such as ID and Kit Number are obviously irrelevant for player Value prediction and one would drop them. However, for the sake of time let’s drop some more i.e, the following columns:
ID, Name, Nationality, Photo, Flag, Club, Club Logo, Real Face, Joined, Loaned From, Contract Valid Until, Kit Number, Work Rate, Special, Release Clause, Rating
Show the first 5 rows after the drop.
data.drop([‘ID’, ‘Name’, ‘Nationality’, ‘Photo’, ‘Flag’, ‘Club’, ‘Club Logo’, ‘Real Face’, \
‘Joined’, ‘Loaned From’, ‘Contract Valid Until’, ‘Kit Number’, \
‘Work Rate’,’Special’,’Release Clause’,’ Rating’], axis=1, inplace=True)
data.head()
Age Overall Potential Value Wage Preferred Foot International Reputation Weak Foot Skill Moves Body Type Position Height Weight
0 27 87 88 €91M €115K Right 4.0 4.0 3.0 Unique SUB 189cm 82kg
1 27 86 87 €78.5M €190K Right 3.0 3.0 4.0 Unique LCM 179cm 69kg
2 30 85 85 €46.5M €46K Left 2.0 3.0 3.0 Stocky (170-185) LB 172cm 69kg
3 31 91 91 €107.5M €350K Right 4.0 5.0 4.0 Unique RCM 181cm 70kg
4 25 86 89 €89.5M €110K Right 3.0 3.0 3.0 Normal (170-) RCM 172cm 68kg
data.info()
# note that we do see some mulls, let them be for now.
RangeIndex: 17660 entries, 0 to 17659
Data columns (total 13 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 Age 17660 non-null int64
1 Overall 17660 non-null int64
2 Potential 17660 non-null int64
3 Value 17660 non-null object
4 Wage 17660 non-null object
5 Preferred Foot 17660 non-null object
6 International Reputation 17660 non-null float64
7 Weak Foot 17660 non-null float64
8 Skill Moves 17660 non-null float64
9 Body Type 17622 non-null object
10 Position 17625 non-null object
11 Height 17660 non-null object
12 Weight 17660 non-null object
dtypes: float64(3), int64(3), object(7)
memory usage: 1.8+ MB
Q 3.7 – [2] – Now let’s do some data cleaning on Height and Weight:
Height and Weight are categorical. But we need numbers. First, remove any possible white spaces from their entries using data[‘column_name’] = data[‘column_name’].str.replace(‘ ‘, ‘_’). Second, eliminate “cm” and “kg” from their entries. Third, convert their types to numerical.
data[‘Height’] = data[‘Height’].str.replace(‘ ‘, ‘_’)
data[‘Height’] = data[‘Height’].str.replace(‘cm’, ”).astype(int)
data[‘Weight’] = data[‘Weight’].str.replace(‘ ‘, ‘_’)
data[‘Weight’] = data[‘Weight’].str.replace(‘kg’, ”).astype(int)
Q 3.8 – [4] – Now let’s do some data cleaning on Value and Wage:
Remove white spaces
Drop the “€” symbol from their entries
You should have realized that some contain a “K” and some contain an “M”. We need to first store their corresponding indices before removing “K” and “M”. Because after removing and conversion to numerical, you then want to multiply “K” entries by 1e+3 and “M” entries by 1e+6.
There are many ways to achieve these steps. And, it does not matter how you do as long as outputs are correct. For example, for the multiplication part one can do:
data.loc[K_indices, ColumnName] = data[ColumnName].apply(lambda x: x*1e+3)
monetary = [‘Value’,’Wage’]
for c in monetary:
data[c] = data[c].str.replace(‘ ‘, ‘_’)
data[c] = data[c].str.replace(‘€’, ”)
ind_K = data[data[c].str.contains(‘K’, regex=False)].index
ind_M = data[data[c].str.contains(‘M’, regex=False)].index
data[c] = data[c].str.replace(‘K’, ”)
data[c] = data[c].str.replace(‘M’, ”)
data[c] = data[c].astype(float)
data.loc[ind_K, c] = data[c].apply(lambda x: x*1e+3)
data.loc[ind_M, c] = data[c].apply(lambda x: x*1e+6)
data.head()
Age Overall Potential Value Wage Preferred Foot International Reputation Weak Foot Skill Moves Body Type Position Height Weight
0 27 87 88 91000000.0 115000.0 Right 4.0 4.0 3.0 Unique SUB 189 82
1 27 86 87 78500000.0 190000.0 Right 3.0 3.0 4.0 Unique LCM 179 69
2 30 85 85 46500000.0 46000.0 Left 2.0 3.0 3.0 Stocky (170-185) LB 172 69
3 31 91 91 107500000.0 350000.0 Right 4.0 5.0 4.0 Unique RCM 181 70
4 25 86 89 89500000.0 110000.0 Right 3.0 3.0 3.0 Normal (170-) RCM 172 68
data.info()
RangeIndex: 17660 entries, 0 to 17659
Data columns (total 13 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 Age 17660 non-null int64
1 Overall 17660 non-null int64
2 Potential 17660 non-null int64
3 Value 17660 non-null float64
4 Wage 17660 non-null float64
5 Preferred Foot 17660 non-null object
6 International Reputation 17660 non-null float64
7 Weak Foot 17660 non-null float64
8 Skill Moves 17660 non-null float64
9 Body Type 17622 non-null object
10 Position 17625 non-null object
11 Height 17660 non-null int64
12 Weight 17660 non-null int64
dtypes: float64(5), int64(5), object(3)
memory usage: 1.8+ MB
Q 3.9 – [2] – Now use the pandas describe() to get a stat
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com