1 Assignment 1
October 20, 2021
In this assignment, we will be predicting the prices of houses using features like the house’s age, distance to public transportation, and latitude/longitude coordinates. The data that we are using is the Real Estate Valuation Data Set from https://archive.ics.uci.edu/ml/datasets/Real+ estate+valuation+data+set. Download the file data.csv from the course website (not from the URL, since we made minor changes to the data format).
We will be exploring both k Nearest Neighbour models and Linear Regression models.
Copyright By PowCoder代写 加微信 powcoder
For this entire assignment, you may not add loops that are not provided in the starter code.
1.1 Question 1: Data, Indexing, and Vectorized Code
Before beginning a machine learning task, one of the first thing that you should do is to understand the data that you are working with. That is where we should start: with the data. Along the way, we will illustrate how to use Python’s numpy package to vectorize computation.
import numpy as np
import numpy.random as rnd
# Read the data
# Download “data.csv” from the course website
data = np.genfromtxt(‘data.csv’, delimiter=’,’, skip_header=1)
# Display the *shape* of the data matrix
print(data.shape)
# Please leave these print statements, to help your TAs grade quickly.
print(‘\n\nQuestion 1’) print(‘———-‘)
Question 1
———-
1.1.1 Part (a)
Print the first column and the first 10 rows of the data. Recall that you should not add loops that are not already in the starter code.
Note that data is a 2D numpy array (i.e. a matrix), and its elements can be indexed. For examples data[0, 0] indexes the first row and column. Additionally, similar to python lists, numpy arrays support slicing: e.g. data[1:3, 0] and data[200:, :].
print(‘\nQuestion 1(a):’) # Please leave print statements like these print(data[:10, 0]) # SOLUTION
Question 1(a):
[ 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.]
1.1.2 Part (b)
Print the second column and the first 10 rows of the data.
print(‘\nQuestion 1(b):’) print(data[:10, 1]) # SOLUTION
Question 1(b):
[2012.917 2012.917 2013.583 2013.5 2012.833 2012.667 2012.667 2013.417
2013.5 2013.417]
1.1.3 Part (c)
What do you think the columns in parts (a) and (b) represent? Find the answer by reading the data set information and “Attribute Information” in https://archive.ics.uci.edu/ml/datasets/ Real+estate+valuation+data+set. You should understand the meaning of the remaining fields as well.
We will be predicting the housing price (last column) using the some of the remaining features.
Sample Solution:
Part (a) shows us the house index or row index, not shown on the “Attribute Information” website. Part (b) shows us the transaction date.
1.1.4 Part (d)
Remove the first column from the data, and overwrite the variable data with the result. Print the shape of the resulting matrix data.
print(‘\nQuestion 1(d):’) data = data[:, 1:] # SOLUTION
print(data.shape)
Question 1(d):
1.1.5 Part (e)
We will first separate the data into training, validation, and test sets. Rather than choosing a random percentage of data points to leave out in our test set, we will instead place the most recent data points in our test set. In particular, any data point with date larger than 2013.417 will be placed in our test set. The code to select the test set element is written for you below. Pay attention to the way a boolean numpy array like data[:, 0] > 2013.417 can be used to index elements of another numpy array.
Explain why this is a better strategy than randomly selecting data points in our test set.
[5]: test = data[data[:, 0] > 2013.417]
Sample Solution:
There are several possible reasons. We expect one that is well articulated.
One possible reason is that this selection of the test set is most similar to how we would deploy the model in real life. We would train the model on historical data, and use the model to predict the price of a new house at a future date. Since the purpose of the test set is to estimate how our model will perform in real life, it makes sense to construct the test set as similiarly as we can to reality.
A related reason is that because house trends change, these data points are not actually i.i.d. The model could learn information about these trends, and thus encode more information in the model than it would have about houses sold at a later date.
1.1.6 Part (f)
Create a matrix train_valid that contains the data points that will be in the training or validation set. Then, print the shape of both new matrices from parts (e) and (f).
print(‘\nQuestion 1(f):’)
train_valid = data[data[:, 0] <= 2013.417] # SOLUTION
print(test.shape)
print(train_valid.shape)
Question 1(f):
1.1.7 Part (g)
We will use the variable randarray, given below, to separate our training and validation sets. This array assigns a random integer (0, 1, 2, 3, 4) to every element of the train_valid dictionary.
For each data point in train_valid, if its corresponding value in randarray is 0, place that data point in the validation set. Otherwise, place that data point in the training set. To earn credit, you should do this without using any loops. (Hint: consider the way we indexed numpy arrays in parts e and f)
Print the shape of the training and validation matrices.
print('\nQuestion 1(g):')
# Below array was generated by calling
# randarray = rnd.randint(0, 5, train_valid.shape[0])
# Do NOT uncomment the above line of code. Instead, we are including
# the values you should use below.
randarray = np.array([2, 0, 1, 3, 0, 0, 0, 3, 2, 3, 1, 1, 2, 0, 4, 4, 0, 2, 1,␣
→2, 2, 2, →0, 1, 2, →0, 0, 1, →0, 0, 0, →1, 3, 2, →0, 4, 0, →4, 2, 0, →1, 0, 0, →3, 4, 0, →1, 3, 0, →2, 2, 4, →0, 1, 0, →1, 4, 2, →4, 0, 3, →4, 1, 0,
4, 1, 3, 2, 0, 1, 2, 0, 3, 0, 3, 1, 3, 0, 4, 1, 4, 4, 0,␣
4, 0, 0, 1, 1, 1, 2, 3, 4, 4, 3, 3, 0, 0, 0, 0, 2, 2, 3,␣
4, 1, 4, 2, 2, 4, 4, 2, 0, 4, 0, 3, 2, 0, 4, 3, 1, 1, 0,␣
1, 1, 4, 0, 3, 4, 2, 0, 0, 4, 4, 4, 4, 3, 3, 0, 0, 2, 2,␣
4, 1, 2, 2, 3, 4, 1, 4, 3, 1, 1, 3, 0, 4, 4, 4, 0, 3, 3,␣
0, 4, 3, 4, 1, 2, 2, 4, 4, 1, 2, 1, 1, 0, 4, 4, 4, 2, 0,␣
4, 4, 1, 4, 0, 4, 0, 0, 1, 4, 3, 2, 4, 3, 1, 4, 1, 3, 4,␣
4, 4, 2, 0, 4, 4, 4, 2, 3, 3, 4, 1, 0, 1, 2, 3, 1, 0, 1,␣
0, 1, 2, 2, 2, 2, 4, 3, 1, 1, 4, 4, 1, 4, 2, 4, 0, 2, 4,␣
4, 2, 3, 0, 4, 2, 3, 2, 2, 0, 2, 0, 2, 3, 2, 3, 4, 2, 4,␣
3, 4, 0, 4, 4, 0, 1, 4, 2, 4, 2, 4, 0, 3, 4, 2, 1, 1, 3,␣
3, 1, 3, 2, 4, 3, 1, 3, 0, 3, 0, 4, 2, 1, 2, 3, 2, 2, 4,␣
1, 3, 1, 2, 2, 3, 1, 4, 2, 2, 4, 4, 1, 3, 2, 4, 1, 2, 4,␣
1, 2, 1, 3, 3, 3, 2, 1, 3, 0, 2, 4, 2, 0, 3, 1, 0, 4, 2,␣
4, 2, 2, 0, 1, 4, 3, 4, 0, 3, 2, 0, 2, 0]) train = train_valid[randarray != 0] # SOLUTION
valid = train_valid[randarray == 0] # SOLUTION
print(train.shape)
print(valid.shape)
Question 1(g):
1.1.8 Part (h)
Separate the input features and target values for each of the train, valid, and test sets. In particular, we will use the following columns as features: 1, 2, 3, 4, 5 (but not the date column 0). We will predict the housing price, which is in column 6.
We will refer to the five feature columns as x, and the housing price as t. We will need training, validation and testing versions of both x and t, for a total of 6 arrays. You should build these 6 arrays using the starter code below.
Print the first 2 rows of each of these six new numpy arrays.
print('\nQuestion 1(h):')
train_x = train[:, 1:6] # SOLUTION train_t = train[:, 6] # SOLUTION valid_x = valid[:, 1:6] # SOLUTION
valid_t = valid[:, 6]
test_x = test[:, 1:6]
test_t = test[:, 6]
print(train_x[:2])
print(train_t[:2])
print(valid_x[:2])
print(valid_t[:2])
print(test_x[:2])
print(test_t[:2])
# SOLUTION
# SOLUTION
# SOLUTION
Question 1(h):
[[ 32. [ 5.
[37.9 43.1]
[42.2 40.3]
84.87882 10.
390.5684 5.
306.5947 9.
623.4731 7.
24.98298 121.54024]
24.97937 121.54245]]
24.98034 121.53951]
24.97933 121.53642]]
[[ 13.3 561.9845 5. 24.98746 121.54391]
[ 13.3 561.9845 5. 24.98746 121.54391]]
[47.3 54.8]
1.1.9 Part (i)
Compute the mean and standard deviation of each column in data. Then, compute the mean and standard deviation of each column in train_x, saving the results in x_mean and x_std. Print both sets of means and standard deviations.
You may find the functions np.mean and np.std helpful. Find and read their documentations, and pay particular attention to the parameter axis.
print('\nQuestion 1(i):')
print(np.mean(data, axis=0)) # SOLUTION print(np.std(data, axis=0)) # SOLUTION x_mean = np.mean(train_x, axis=0) # SOLUTION x_std = np.std(train_x, axis=0) # SOLUTION print(x_mean) # SOLUTION
print(x_std) # SOLUTION
Question 1(i):
[2013.14897101 17.71256039 1083.88568891 4.0942029 24.96903007
121.53336109 37.98019324]
[2.81626494e-01 1.13787172e+01 1.26058439e+03 2.94200221e+00
1.23951994e-02 1.53286366e-02 1.35900448e+01]
[ 17.7018315 1029.6309178 4.05494505 24.96959267 121.53340674]
[1.15249195e+01 1.21117367e+03 2.84275045e+00 1.14181823e-02
1.50324998e-02]
1.1.10 Part (j)
For some of the models that we work with, we will be working with a normalized version of the features. In other words, we subtract the mean, and divide by the std, so that the features have zero mean and unit variance. Explain why using normalized data may be useful for some models, like the k-nearest neighbor model.
Sample Solution:
For the kNN model, normalization effectively changes the distance metric that we use. In other words, normalization changes the weighting of different features.
For example, if we have one feature with an extremely large scale (e.g. distance to MRT station), then that feature will dominate: the distance between two data points will be determined primarily by the difference in their respective distances to the MRT stations.
Normalization is a way to ensure that every features makes a contribution in the distance metric.
1.1.11 Part (k)
Explain why we should compute the mean and standard deviation using the training data, rather than across the entire labeled data (including the validation/test sets).
Sample Solution:
We compute the mean and std using the training data only, so that the validation and test sets are representative of new data that we have never seen during training.
1.1.12 Part (l)
This part is meant to help you understand broadcasting, a method that numpy uses to perform vectorized computation. You will need to use broadcasting to complete part (m). Consider the following computation:
print('\nQuestion 1(l):')
tmp_a = np.array([[1.4, 2.5, 3.0], [9.1, 3.4, 2.3]])
tmp_b = np.sum(tmp_a, axis=0)
print(tmp_a)
print(tmp_b)
print(tmp_a - tmp_b)
Question 1(l):
[[1.4 2.5 3. ]
[9.1 3.4 2.3]]
[10.5 5.9 5.3]
[[-9.1 -3.4 -2.3]
[-1.4 -2.5 -3. ]]
Explain what computation was done to obtain the result. Sample Solution:
Each element in tmp_a at index tmp_a[i,j] is subtracted by the sum in tmp_b at index tmp_b[j]. So the computations that were performed were:
1.4 − 10.5 = −9.1 2.5 − 5.9 = −3.4 3−5.3 = −2.3 9.1 − 10.5 = −1.4 3.4 − 5.9 = −2.5
2.3−5.3 = −3
1.1.13 Part (m)
Using broadcasting (which you learned in the previous part), create the numpy array norm_train_x, which is the normalized version of the training data. Each column of this new matrix should have zero mean and unit variance.
Print the mean of the columns of the new matrix.
Print the first 2 rows of norm_train_x. [11]:
Question 1(m):
-1.7899170139545e-13
[[ 1.24063066 -0.78003025 2.09130384 1.17245685 0.45456578]
[-1.10211889 -0.52763904 0.33244386 0.85629444 0.60158059]]
1.1.14 Part (n)
Consider the computation below, which is an alternative way of computing norm_train_x that uses loops rather than vectorized code. How much slower is this code compared to your code in the previous part? Include your response in your writeup.
print('\nQuestion 1(m):')
norm_train_x = (train_x - x_mean) / x_std # SOLUTION print(np.mean(norm_train_x)) # SOLUTION print(norm_train_x[:2]) # SOLUTION
print('\nQuestion 1(n):') import time
nonvec_before = time.time()
norm_train_x_loop = np.zeros_like(train_x) for i in range(train_x.shape[0]):
for j in range(train_x.shape[1]):
norm_train_x_loop = (train_x[i, j] - x_mean[j]) / x_std[j]
nonvec_after = time.time()
print("Non-vectorized time: ", nonvec_after - nonvec_before)
vec_before = time.time()
# TODO: Add your code here
norm_train_x = (train_x - x_mean) / x_std # SOLUTION vec_after = time.time()
print("Vectorized time: ", vec_after - vec_before)
# Include your response in your writeup
Question 1(n):
Non-vectorized time: 0.0010249614715576172
Vectorized time: 5.078315734863281e-05
1.2 Question 2: Nearest Neighbour for Regression
In class we discussed nearest neighbours for classification, here we will use nearest neighbours for regression. In particular, we will use the nearest neighbor method to predict the housing prices, given the other features. Instead of taking a majority vote of the discrete target in the nearest neighbours, we will take the average of the continuous target (the housing prices). We will explore using both the normalized and unnormalized features.
For this question, you may not add loops that are not already in the starter code. 1.2.1 Part (a)
First, let’s consider using a 1-nearest neighbour approach to predict the house price of the first data point v in the validation set. We will use the unnormalized version of the dataset.
Without using loops, compute the Euclidean distance between v and every data point in the training set train_x. Save the result in the numpy array distances. Print the first 10 rows of distances.
Then, find the index n with the minimum value of distances. Print the row train_x[n], which is the closest data point to v, and the prediction train_t[n]. (There are several ways to do this!)
# Please leave these print statements, to help your TAs grade quickly.
print('\n\nQuestion 2') print('----------') print('\nQuestion 2(a):')
v = valid_x[0] # should be np.array([ 19.5 , 306.5947 , 9. , 24. →98034, 121.53951])
distances = np.sum((train_x - v) ** 2, 1) # SOLUTION print(distances[:10]) # SOLUTION
n = np.argmin(distances) # SOLUTION print(train_x[n]) # SOLUTION
print(train_t[n]) # SOLUTION
Question 2
----------
Question 2(a):
[4.93151815e+04 7.27783230e+03 3.49124023e+06 1.00237381e+04
4.68901518e+04 3.45191975e+04 4.67881241e+06 5.74125724e+02
2.02590444e+03 8.05369578e+04]
[ 16.4 289.3248 5. 24.98203 121.54348]
1.2.2 Part (b)
Now, let’s consider using a 3-nearest neighbour model to make a prediction for the same v from part (a), again using unnormalized data. In other words, we find the 3 smallest elements of distances, and average their corresponding values in train_t to obtain a prediction. Print this prediction.
You may want to consider sorting the list of distances, if you didn’t do this in part (a). Again, there are several ways to do this sorting. You will also want to keep track of the indices (or the corresponding t values) as you sort the distances.
(Since this portion is unrelated to machine learning per se, how to do this is up to you to figure out. If this part is challenging, you may benefit from further computer science preparation before taking this course—i.e. any course that requires you to practise writing code to solve problems.)
print('\nQuestion 2(b):')
results = sorted(zip(distances, train_t), key=lambda x: x[0]) # SOLUTION pred = np.mean([r[1] for r in results[:3]]) # SOLUTION print(pred) # SOLUTION
Question 2(b):
1.2.3 Part (c)
Complete the function unnorm_knn(v, k) that takes a feature vector v, and uses the k-nearest neighbour algorithm to make a prediction. Your code should be nearly identical to those from part (b), except k is now a parameter.
Print the prediction for v=valid_x[1], with k=5. [15]:
print('\nQuestion 2(c):')
def unnorm_knn(v, k, features=train_x, labels=train_t): """
Returns the k Nearest Neighbour prediction of housing prices for an input
Parameters:
v - The input vector to make predictions for
k - The hyperparameter "k" in kNN
features - The input features of the training data; a numpy array of␣
→shape [N, D]
labels - The target labels of the training data; a numpy array of shape␣
(By default, `train_x` is used)
(By default, `train_t` is used)
euclidean_distances = np.sum((features - v) ** 2, 1) #␣ →SOLUTION
results = sorted(zip(euclidean_distances, labels), key=lambda x: x[0]) #␣ →SOLUTION
return np.mean([r[1] for r in results[:k]]) #␣ →SOLUTION
print(unnorm_knn(v=valid_x[1], k=5))
Question 2(c):
1.2.4 Part (d)
We wrote most of the function compute_mse for you below. This function takes a parameter predict, which is itself a function that makes a prediction given a feature vector. This function is intended to compute the mean squared error of the predictions made using the predict method, across the provided dataset (by default, the entire unnormalized validation set is supplied).
Complete the Mean Square Error (MSE) computation. The Mean Square Error is another term for the average square loss across a data set. When you have done so, the code below will print the training and validation MSE for a (very simple) model that always predicts the average house price across the entire training set. We will call this a baseline model. Such a model is often used for sanity checking, and as a point of comparison. Your kNN model should be better than this baseline model.
print('\nQuestion 2(d):')
def compute_mse(predict, data_x=valid_x, data_t=valid_t):
Returns the Mean Squared Error of a model across a dataset
Parameters:
predict - A Python *function* that takes an input vector and produces a
prediction for that vector.
data_x - The input features of the data set to make predictions for
(By default, `valid_x` is used)
data_t - The target labels of the dataset to make predictions for
(By default, `train_t` is used)
errors = []
for i in range(data_t.shape[0]):
y = predict(data_x[i]) # SOLUTION t = data_t[i] # SOLUTION error = (y - t)**2 # SOLUTION errors.append(error)
return np.mean(errors)
def baseline(v): """
Returns the average housing price given an input vector v. """
return np.mean(train_t)
# compute and print the training and validation MSE
print(compute_mse(baseline, data_x=train_x, data_t=train_t))
print(compute_mse(baseline, data_x=valid_x, data_t=valid_t))
Question 2(d):
174.77563525607482
200.6520555038695
1.2.5 Part (e)
For each choice of k (1, 2, up to 30), compute the MSE on the training and validation sets for the corresponding kNN model. Store these values in the two lists train_mse and valid_mse. Print these two lists.
We include code below that plots the values in these two lists. Include the plot in your writeup. From the plot, what is the optimal value of k?
print('\nQuestion 2(e):')
train_mse = [] valid_mse = []
for k in range(1, 31):
# create a temporary function `predict_fn` that computes the knn # prediction for the current value of the loop variable `k`
def predict_fn(new_v):
return unnorm_knn(new_v, k)
# compute the training and validation MSE for this kNN model
mse = compute_mse(predict_fn, data_x=train_x, data_t=train_t)
train_mse.append(mse)
mse = compute_mse(predict_fn, data_x=valid_x, data_t=valid_t)
valid_mse.append(mse)
print(train_mse) # SOLUTION print(valid_mse) # SOLUTION
from matplotlib import pyplot as plt plt.plot(range(1, 31), train_mse) plt.plot(range(1, 31), valid_mse) plt.xlabel("k")
plt.ylabel("MSE")
plt.title("Unnormalized kNN")
plt.legend(["Training", "Validation"])
plt.show()
Question 2(e):
[2.4539194139194143, 23.00464285714286, 33.72310134310134, 38.824844322344326,
43.87213479853481, 46.965049857549864, 51.38078044404575, 54.25775297619048,
57.21188486410709
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com