APS1070¶
Linear Regression – Project 4¶
Please fill out the following:
• Your name:
• Your student number:
How to use this Notebook¶
1. Run the cells to get a bird’s eye view of whats going on.
2. Interact with the code to better understand how it works – add cells, plot values, change parameters, etc.
3. (Optional) Replicate Linear Regression code on a blank notebook, without viewing this code. This is called retrieval practice.
4. Complete exercises.
Linear Regression – Introduction¶
Linear Regression¶
In statistics, linear regression is a linear approach to modelling the relationship between a dependent variable and one or more independent variables. Let X be the independent variable and Y be the dependent variable. We will define a linear relationship between these two variables as follows:
$y = mX + c$
This is the equation for a line that you studied in high school. m is the slope of the line and c is the y intercept. Today we will use this equation to train our model with a given dataset and predict the value of Y for any given value of X. Our challenge today is to determine the value of m and c, such that the line corresponding to those values is the best fitting line or gives the minimum error.
Are we always fitting a line into the data? What if we have more than 1 feature (independant variable)?
The vectorized form of above equation is written as $y = Xw$, where y and w are vectors while X is a matrix.
Where is the c term? It is included within the X matrix.
Hypothesis of Linear Regression
The linear regression model can be represented by the following equation:

• Y is the predicted value ($h_\theta(x)$)
• θ₀ is the bias term.
• θ₁,…,θₙ are the model parameters
• x₁, x₂,…,xₙ are the feature values.
LR – Procedure¶
Our Loss function for Linear regression would be sum of squares, which makes the cost function to be:

Here yᵢ is the actual value and ȳᵢ is the predicted value. Lets substitute the value of ȳᵢ from $y = mX + c$:

Partial Derivative of the above equation with respect to m is shown as $D_m$:

While the Partial Derivative with respect to c is shown as $D_c$:

Analytical Approach for Linear Regression¶
Can you derive the Analytical Solution for Linear Regression?
We arrive at the analytical solution when we turn the partial derivatives with respect to the parameters to zero. Why?
This is because at the point where cost function is at the minimum with respect to the parameters (m and c), the derivative of cost function with respect to the parameters would be zero.
$D_m=0$ and $D_c=0$
The solution for the general case comes out to be:

With $L_2$ Regularization. Cost:
$J(\theta) = \frac{1}{2m}[\sum_{i=1}^m(h_\theta (x^{(i)}) – y^{(i)})^2 + \lambda\sum_{j=1}^n\theta^2_j]$
$w = (X^TX + \lambda I)^{-1}X^Ty$
Gradient Descent Solution¶
Gradient descent uses the equations for gradient derived above to find the direction in which we tinker the values of our parameters m and c.

Here the term $L$ is defined as the learning rate.

Any other solutions methods?¶
Some other solution methods are QR Decomposition, Singular-Value Decomposition.
• Do these methods give exact solutions?
• Do these methods scale well with large datasets?
Let’s Start¶
Single Feature, 5 data points.
Analytical Approach in Code¶
Let’s get our hands dirty.
In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
In [0]:
n = 5 # number of points
m = 2 # slope of line
c = 4 # y intercept of line
r = 2 # range of data on x-axis
np.random.seed(2)
x = r * np.random.rand(n)
x.sort()
yPerf = m*x + c # perfect world: no noise
y = m*x + c + np.random.randn(n) # imperfect world: we get noisy data
In [0]:
x
Out[0]:
array([0.05185246, 0.8407356 , 0.87064479, 0.8719898 , 1.09932496])
Spoiler Alert: We are in an Imperfect World.
In [0]:
def rmse(yPred, y):
return np.sqrt(mean_squared_error(yPred, y))
In [0]:
plt.plot(x, y, ‘o’, label=’Data’)
plt.plot(x, yPerf, ‘o-‘, label=’Underlying Distribution’)
plt.legend()
plt.show()

In [0]:
x = np.vstack((np.ones(np.shape(x)), x)).T
y = y.reshape(-1, 1)
In [0]:
# analytical solution
w = np.dot(np.linalg.inv(np.dot(x.T, x)), np.dot(x.T, y))
# prediction
yPred = np.dot(x, w)
In [0]:
w[0], w[1]
Out[0]:
(array([2.67741212]), array([3.35314632]))
• How similar are these to the values we had set initially?
• Will they be same as those set initially if there was no noise?
In [0]:
plt.plot(x[:, 1], y, ‘o’, label=’Data’)
plt.plot(x[:, 1], yPerf, ‘o-‘, label=’Underlying Distribution’)
plt.plot(x[:, 1], yPred, ‘o-‘, label=’Predicted’)
plt.legend()
plt.show()

In [0]:
print(‘RMSE: ‘, rmse(yPred, y))
RMSE: 0.8321853483360432
Complex Dependant Variable¶
It’s complicated.
In [0]:
n = 20 # number of points
m = 3 # slope of line
c = 10 # y intercept of line
r = 2 # range of data on x-axis
np.random.seed(10)
X = xD = r * np.random.randn(n) # points also stored in xD (xData). will be useful later.
X.sort()
yPerf = X – 2 * (X ** 2) + 0.5 * (X ** 3)
np.random.seed(10)
y = X – 2 * (X ** 2) + 0.5 * (X ** 3) + np.random.normal(0, 5, n) # imperfect world: we get noisy data
In [0]:
plt.plot(X, y, ‘o’, label=’Data’)
plt.plot(X, yPerf, ‘o-‘, label=’Underlying Distribution’)
plt.legend()
plt.show()

In [0]:
X = np.vstack((np.ones(np.shape(X)), X)).T
y = y.reshape(-1, 1)
In [0]:
# analytical solution
w = np.dot(np.linalg.inv(np.dot(X.T, X)), np.dot(X.T, y))
# prediction
yPredLinear = yPred = np.dot(X, w)
w
Out[0]:
array([[-5.45917353],
[ 3.82202654]])
In [0]:
plt.plot(xD, y, ‘o’, label=’Data’)
plt.plot(xD, yPerf, ‘o-‘, label=’Underlying Distribution’)
plt.plot(xD, yPred, ‘o-‘, label=’Predicted’)
plt.legend()
plt.show()

In [0]:
print(‘RMSE: ‘, rmse(yPred, y))
RMSE: 6.839531223681816
What to do next?¶
Can we add more features.
In [0]:
X = np.vstack((X.T, xD**2, xD**3)).T
In [0]:
# analytical solution
w = np.dot(np.linalg.inv(np.dot(X.T, X)), np.dot(X.T, y))
# prediction
yPred4Feature = yPred = np.dot(X, w)
In [0]:
w
Out[0]:
array([[ 0.04177438],
[ 2.58652514],
[-1.87142105],
[ 0.19655185]])
In [0]:
plt.plot(xD, y, ‘o’, label=’Data’)
plt.plot(xD, yPerf, ‘o-‘, label=’Underlying Distribution’)
plt.plot(xD, yPred, ‘o-‘, label=’Predicted’)
plt.legend()
plt.show()

In [0]:
print(‘RMSE: ‘, rmse(yPred, y))
RMSE: 3.929493007090521
But how do we know when to stop, since we would not be knowing when to stop adding features in x.
In [0]:
X = np.vstack((X.T, xD**4, xD**5, xD**6, xD**7)).T
In [0]:
# analytical solution
w = np.dot(np.linalg.inv(np.dot(X.T, X)), np.dot(X.T, y))
# prediction
yPred8Feature = yPred = np.dot(X, w)
In [0]:
w
Out[0]:
array([[-2.64611130e-01],
[ 8.28612458e+00],
[-2.07703327e+00],
[-6.35661778e+00],
[-1.30720676e-02],
[ 1.64293567e+00],
[-2.08380043e-03],
[-1.09229711e-01]])
In [0]:
plt.plot(xD, y, ‘o’, label=’Data’)
plt.plot(xD, yPerf, ‘o-‘, label=’Underlying Distribution’)
plt.plot(xD, yPred, ‘o-‘, label=’Predicted’)
plt.legend()
plt.show()

In [0]:
print(‘RMSE: ‘, rmse(yPred, y))
RMSE: 3.3097843019929725
Even though RMSE here (with much more higher features) is lower than previous case, we can see this Predicted Line is quite far from the Underlying distribution. This is called Overfitting.
When does the model overfits: when we have more features or comparitively less data for the model.
What exactly is overfitting: It pays more attention to the noise of the data provided, in a sense trying to rotely memorize everything, without generalizing.
Since we don’t know when to stop adding features, what can be done:
• Solution is to work with a model or feature set that can slightly overfit your data, and then use techniques to prevent overfitting from happening. The alternative gives us underfitting which we cannot fix unless you modify the feature set or model.
Options we have to prevent overfitting. Well there are many, most widely used ones are
• Using a validation set
• Regularization: add penalty on weights
In [0]:
λ = 10 # what is lambda: regularization parameter
f = 8 # number of features
In [0]:
# analytical solution
w = np.dot(np.linalg.inv(np.dot(X.T, X) + (λ)*np.identity(f)), np.dot(X.T, y))
# prediction
yPred8FeatRegu = yPred = np.dot(X, w)
w
Out[0]:
array([[ 0.21326186],
[ 0.40727946],
[-0.46745243],
[-0.31891582],
[-0.51080301],
[ 0.38864889],
[ 0.03613141],
[-0.03367748]])
In [0]:
plt.plot(xD, y, ‘o’, label=’Data’)
plt.plot(xD, yPerf, ‘o-‘, label=’Underlying Distribution’)
plt.plot(xD, yPred, ‘o-‘, label=’Predicted’)
plt.legend()
plt.show()

Can we know from the plot if the value of λ is optimal:
Somewhat but not exactly.
To get the exact value of lambda you need to split dataset between training and testing. Then cycle over multiple values of lambda. The most optimum is the one which gives the lowest test error.
What does low test error represent?
All models together:
In [0]:
plt.figure(figsize=(12, 8))
plt.plot(xD, y, ‘o’, label=’Data’)
plt.plot(xD, yPerf, ‘o-‘, label=’Underlying Distribution’)
plt.plot(xD, yPredLinear, ‘o-‘, label=’2 Feature’)
plt.plot(xD, yPred4Feature, ‘o-‘, label=’4 Feature’)
plt.plot(xD, yPred8Feature, ‘o-‘, label=’8 Feature’)
plt.plot(xD, yPred8FeatRegu, ‘o-‘, label=’8 Feature Regularized’)
plt.legend()
plt.show()

Further Complex Multivariate Data¶
We will take the Housing dataset which contains information about different houses in Boston. This data was originally a part of UCI Machine Learning Repository and has been removed now. We can also access this data from the scikit-learn library. There are 506 samples and 13 feature variables in this dataset. The objective is to predict the value of prices of the house using the given features.
In [0]:
from sklearn.datasets import load_boston
dataset = load_boston()
In [0]:
print(dataset.keys())
dict_keys([‘data’, ‘target’, ‘feature_names’, ‘DESCR’, ‘filename’])
In [0]:
print(dataset.DESCR)
.. _boston_dataset:
Boston house prices dataset
—————————
**Data Set Characteristics:**
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
:Attribute Information (in order):
– CRIM per capita crime rate by town
– ZN proportion of residential land zoned for lots over 25,000 sq.ft.
– INDUS proportion of non-retail business acres per town
– CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
– NOX nitric oxides concentration (parts per 10 million)
– RM average number of rooms per dwelling
– AGE proportion of owner-occupied units built prior to 1940
– DIS weighted distances to five Boston employment centres
– RAD index of accessibility to radial highways
– TAX full-value property-tax rate per $10,000
– PTRATIO pupil-teacher ratio by town
– B 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
– LSTAT % lower status of the population
– MEDV Median value of owner-occupied homes in $1000’s
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. ‘Hedonic
prices and the demand for clean air’, J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, ‘Regression diagnostics
…’, Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
.. topic:: References
– Belsley, Kuh & Welsch, ‘Regression diagnostics: Identifying Influential Data and Sources of Collinearity’, Wiley, 1980. 244-261.
– Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
In [0]:
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df.head()
Out[0]:
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
0
0.00632
18.0
2.31
0.0
0.538
6.575
65.2
4.0900
1.0
296.0
15.3
396.90
4.98
1
0.02731
0.0
7.07
0.0
0.469
6.421
78.9
4.9671
2.0
242.0
17.8
396.90
9.14
2
0.02729
0.0
7.07
0.0
0.469
7.185
61.1
4.9671
2.0
242.0
17.8
392.83
4.03
3
0.03237
0.0
2.18
0.0
0.458
6.998
45.8
6.0622
3.0
222.0
18.7
394.63
2.94
4
0.06905
0.0
2.18
0.0
0.458
7.147
54.2
6.0622
3.0
222.0
18.7
396.90
5.33
We can see that the target value by the name MEDV is missing from the dataframe. It is present in dataset.target. Let’s add that to the df.
In [0]:
df[‘MEDV’] = dataset.target
Are there any null values in the dataset?
In [0]:
df.isnull().sum()
Out[0]:
CRIM 0
ZN 0
INDUS 0
CHAS 0
NOX 0
RM 0
AGE 0
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 0
MEDV 0
dtype: int64
Even though there are none, it’s a good practice to check.
Exploratory Data Analysis¶
We here use distplot method from the seaborn library, which is another popular graphing library for Python.
In [0]:
sns.set(rc={‘figure.figsize’:(12, 9)})
sns.distplot(df[‘MEDV’], bins=30)
plt.show()

We can see that it mostly normally distributed with a few outliers.
Linear Regression – Assignment¶
We’ll work with the above dataset for the assignment below. Create a blank notebook for the assignment.
Question 1¶
1. Normalize the data using StandardScaler from sklearn. [0.5]
2. Split the multivariate data into training/validation data (20% validation data) and plot train error and validation error as a function of the lambda regularizer. [1]
3. What is the best lambda where the best fit occurs? Use this best lambda from here on out. [Note: You may be surprised by the best lambda here – this won’t be the case for all datasets.] [1]
In [0]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
In [0]:
#load the data
dataset = load_boston()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
In [3]:
df.head()#506 rows × 13 columns
Out[3]:
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
0
0.00632
18.0
2.31
0.0
0.538
6.575
65.2
4.0900
1.0
296.0
15.3
396.90
4.98
1
0.02731
0.0
7.07
0.0
0.469
6.421
78.9
4.9671
2.0
242.0
17.8
396.90
9.14
2
0.02729
0.0
7.07
0.0
0.469
7.185
61.1
4.9671
2.0
242.0
17.8
392.83
4.03
3
0.03237
0.0
2.18
0.0
0.458
6.998
45.8
6.0622
3.0
222.0
18.7
394.63
2.94
4
0.06905
0.0
2.18
0.0
0.458
7.147
54.2
6.0622
3.0
222.0
18.7
396.90
5.33
In [0]:
df[‘MEDV’] = dataset.target
In [0]:
X = df.iloc[:, :-1]
scale = StandardScaler().fit_transform(X)
y=np.array(dataset.target)
bias = np.ones((X.shape[0],1))
X_std = np.concatenate((bias,scale),axis=1)#506*14
In [0]:
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(X_std, y, test_size = 0.2, random_state=1)
def rmse(yPred, y):
return np.sqrt(mean_squared_error(yPred, y))
In [7]:
from sklearn.model_selection import train_test_split
train_rmse=[]
val_rmse=[]
λs=np.arange(0,100,0.1)#[0,0.01,0.1,1,5,10,20,100]
#xD_train=x_train
#xD_val=x_val;
for λ in λs:
#x_train=xD_train
#x_val=xD_val
#x_train_ = np.hstack((np.ones(np.shape(x_train[:,0].reshape(-1,1))), x_train))
f=x_train.shape[1]
w = np.dot(np.linalg.inv(np.dot(x_train.T, x_train) + (λ)*np.identity(f)), np.dot(x_train.T, y_train))
y_train_pred=np.dot(x_train,w)
#print(‘λ ={} Train RMSE : ‘.format(λ), rmse(y_train_pred, y_train))
#x_val_ = np.hstack((np.ones(np.shape(x_val[:,0].reshape(-1,1))), x_val))
y_val_pred=np.dot(x_val,w)
train_rmse.append(rmse(y_train_pred, y_train))
val_rmse.append(rmse(y_val_pred, y_val))
#print(‘λ={} Test RMSE : ‘.format(λ), rmse(y_val_pred, y_val))
plt.figure(figsize=(12, 8))
plt.plot(λs, train_rmse, label=’train_rmses’)
plt.plot(λs, val_rmse, label=’test_rmses’)
plt.legend()
plt.show()

Thus, the best λ is 0.
Question 2¶
1. Derive the gradient of the $L_2$ regularized objective for Linear Regression, $\frac{\partial}{\partial \theta_j}J(\theta)$. Show your work (either by using in-line equations like we do here, or taking a picture of your work, and emdedding it in the notebook). [1]
$J(\theta) = \frac{1}{2m}[\sum_{i=1}^m(h_\theta (x^{(i)}) – y^{(i)})^2 + \lambda\sum_{j=1}^n\theta^2_j=\frac{1}{2m}[\sum_{i=1}^m(\theta_j (x^{(i)}) – y^{(i)})^2 + \lambda\sum_{j=1}^n\theta^2_j]$
$\frac{\partial}{\partial \theta_j}J(\theta)=\frac{1}{2m}[2\sum_{i=1}^mx^i (\theta_jx^{(i)} – y^{(i)}) + 2\lambda\sum_{j=1}^n\theta^2_j=\frac{1}{m}\sum_{i=1}^m (h_\theta x^{(i)} – y^{(i)})x^i_j + \frac {\lambda} {m}\theta_j$
$\theta_j=\theta_j-\frac{\partial}{\partial \theta_j}J(\theta)\alpha=(1-\frac {\alpha\lambda} {m})\theta_j-\frac {\alpha} {m}\sum_{i=1}^m (h_\theta x^{(i)} – y^{(i)})x^i_j$
Question 3¶
1. Implement batch gradient descent for linear regression using a fixed learning rate $\eta$ of 1e-2. [1.5]
2. Run the gradient descent for a fixed number of epochs (based on convergence) and verify that the results closely approximate the matrix solution with the same lambda. [1]
In [25]:
def gradient_descent(alpha,λ,x,y,numIterations):
m = x.shape[0] # number of samples
theta = np.zeros(14)
b=0
x_transpose = x.transpose()
costplot=[]
for iter in range(0, numIterations):
hypothesis = np.dot(x, theta)+b
loss = hypothesis – y
cost = (1/(2*m))*loss.T @ loss
costplot.append(cost)
gradient = np.dot(x_transpose, loss) / m
theta = theta – alpha * gradient # update
if iter % 1000 == 0:
print(“Epoch:”,int(iter/1000+1))
print(“W:”,theta)
print(“cost:”,cost)
print(“\n”)
return theta,costplot
η=0.01
theta_, costplot= gradient_descent(η,0, X_std, y, 10000)
plt.figure(figsize=(12, 8))
plt.plot(range(10000), costplot, ‘o-‘, label=’cost’)
plt.legend()
plt.show()
#y_test_pred=np.dot(x_val,theta)
#print(‘Train RMSE : ‘, rmse(y_test_pred, y_val))
#print(theta_)
Epoch: 1
W: [ 0.22532806 -0.03567747 0.03311776 -0.04444472 0.01610293 -0.03926228
0.06388975 -0.03463463 0.02296348 -0.03506386 -0.04304914 -0.0466555
0.03063842 -0.06777654]
cost: 296.07345849802374
Epoch: 2
W: [22.53184328 -0.78118429 0.81225774 -0.27296022 0.742528 -1.57768576
2.88911968 -0.10337755 -2.74309864 1.46066488 -0.88375756 -1.95231035
0.87219238 -3.64295466]
cost: 11.082223583143334
Epoch: 3
W: [22.53280628 -0.87243548 0.98512752 -0.09431146 0.71500795 -1.93870211
2.73741056 -0.02849154 -3.05537045 2.0495178 -1.41256803 -2.02454326
0.85005571 -3.71537434]
cost: 10.97700639802383
Epoch: 4
W: [ 2.25328063e+01 -9.02897407e-01 1.03782470e+00 1.47606690e-02
6.99636589e-01 -2.02039368e+00 2.70038833e+00 -2.10039626e-03
-3.10194293e+00 2.34192952e+00 -1.71869388e+00 -2.04599478e+00
8.47868812e-01 -3.73080258e+00]
cost: 10.955463258630404
Epoch: 5
W: [ 2.25328063e+01 -9.15536228e-01 1.05954170e+00 7.40231518e-02
6.91270023e-01 -2.04242336e+00 2.68691330e+00 8.77783084e-03
-3.10695830e+00 2.49331793e+00 -1.88588052e+00 -2.05381422e+00
8.48227952e-01 -3.73705349e+00]
cost: 10.94966597972041
Epoch: 6
W: [ 2.25328063e+01 -9.21601950e-01 1.07008302e+00 1.05492362e-01
6.86797232e-01 -2.05007088e+00 2.68075049e+00 1.39480346e-02
-3.10628176e+00 2.57287969e+00 -1.97541927e+00 -2.05719928e+00
8.48665728e-01 -3.74017178e+00]
cost: 10.94804692596907
Epoch: 7
W: [ 2.25328063e+01 -9.24703373e-01 1.07551394e+00 1.22150174e-01
6.84420574e-01 -2.05336878e+00 2.67764967e+00 1.65713257e-02
-3.10535030e+00 2.61491388e+00 -2.02303919e+00 -2.05883888e+00
8.48939998e-01 -3.74180080e+00]
cost: 10.947592776318182
Epoch: 8
W: [ 2.25328063e+01 -9.26326519e-01 1.07836602e+00 1.30968866e-01
6.83160242e-01 -2.05497568e+00 2.67603552e+00 1.79381510e-02
-3.10475733e+00 2.63716174e+00 -2.04830200e+00 -2.05967731e+00
8.49092780e-01 -3.74266038e+00]
cost: 10.947465320555178
Epoch: 9
W: [ 2.25328063e+01 -9.27182880e-01 1.07987281e+00 1.35639165e-01
6.82492331e-01 -2.05580098e+00 2.67518548e+00 1.86577213e-02
-3.10442576e+00 2.64894438e+00 -2.06169223e+00 -2.06011567e+00
8.49175068e-01 -3.74311513e+00]
cost: 10.947429548304674
Epoch: 10
W: [ 2.25328063e+01 -9.27635927e-01 1.08067036e+00 1.38113015e-01
6.82138448e-01 -2.05623339e+00 2.67473606e+00 1.90380404e-02
-3.10424703e+00 2.65518590e+00 -2.06878732e+00 -2.06034680e+00
8.49218908e-01 -3.74335592e+00]
cost: 10.947419508247867

Question 4¶
1. Plot training error performance vs. number of epochs as a function of learning rate $\eta$ for a range of values {1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1} with each $\eta$ being a different line. [1]
For very low $\eta$, gradient descent should be slow to converge. For very high $\eta$, the error might be very unstable (i.e., not decreasing on every epoch).
In [40]:
def gradient_descentQ4(alpha,λ,x,y,numIterations):
m = x.shape[0] # number of samples
theta = np.zeros(14)
b=0
x_transpose = x.transpose()
costplot=[]
for iter in range(0, numIterations):
hypothesis = np.dot(x, theta)+b
loss = hypothesis – y
cost = (1/(2*m))*loss.T @ loss
costplot.append(cost)
gradient = np.dot(x_transpose, loss) / m
theta = theta – alpha * gradient # update
return theta,costplot
theta1,error1 = gradient_descentQ4(1e-6,0, x_train, y_train, 10000)
theta2,error2 = gradient_descentQ4(1e-5,0, x_train, y_train, 10000)
theta3,error3 = gradient_descentQ4(1e-4,0, x_train, y_train, 10000)
theta4,error4 = gradient_descentQ4(1e-3,0, x_train, y_train, 10000)
theta5,error5 = gradient_descentQ4(1e-2,0, x_train, y_train, 10000)
theta6,error6 = gradient_descentQ4(1e-1,0, x_train, y_train, 10000)
plt.figure(figsize=(12, 8))
plt.plot(range(10000), error1, ‘-‘, label=’error’)
plt.plot(range(10000), error2, ‘-‘, label=’error’)
plt.plot(range(10000), error3, ‘-‘, label=’error’)
plt.plot(range(10000), error4, ‘-‘, label=’error’)
plt.plot(range(10000), error5, ‘-‘, label=’error’)
plt.plot(range(10000), error6, ‘-‘, label=’error’)
plt.legend([‘1e-6’, ‘1e-5’, ‘1e-4’, ‘1e-3′,’1e-2′,’1e-1′], loc=’upper right’)
plt.show()

Question 5¶
Implement a mini-batch version with a batch_size constant B. When B=1, it is stochastic gradient descent. When B=#train data points, it is full-batch. Anywhere in between is mini-batch.
1. Choose B=1, B=16, B=128, B=256, and B=#data points and plot training error as a function of “# of gradient updates” and also separately as a function of “wall-clock time” for each value of B. [2]
2. All lines should be on the same plot. Which B leads to fastest convergence in terms of #gradient updates and in terms of “wall-clock time”? [1]
Question 6¶
1. Switch to absolute error for loss but still use $L_2$ regularization. Derive the gradient for this case, $\frac{\partial}{\partial \theta_j}J(\theta)$. [1.5]
1. Tune $\eta$ and B to obtain good performance (fast convergence to minimum error, but not noisy). Is it easier to train with squared error or absolute error, why? Show some plots that help you demonstrate this point and explain. [1]
Question 7¶
1. Does the model perform better if polynomial features are added (Use sklearn.preprocessing.PolynomialFeatures to test it out).
1. Does the optimal value of λ (regularization constant) change with this new set of features added.
References¶
Linear Regression:
1. https://sebastianraschka.com/faq/docs/closed-form-vs-gd.html
Gradient Descent (images and equations):
1. https://towardsdatascience.com/linear-regression-using-gradient-descent-97a6c8700931
2. https://towardsdatascience.com/linear-regression-using-python-b136c91bf0a2
Author: Parth Jaggi