LinearRegression-Workshop2
Data Science Workshop: Week 4¶
This week we will use the theory about linear regression that we have learned last and this week. We will recap derivatives, introduce vectors and matrices in python and derive and implement the least squares solution. Subsequently, we will look at polynomial regression using the least squares solution derived before. Here, we will also see interesting findings about model-selection and see the over and under-fitting problem directly in our data set.
Please read over the whole notebook. It contains a several excercises (12) that you have to complete.
Vectors and Matrices in Python¶
We can create a vector by:
In [1]:
import numpy as np
a = np.array([1,0,2])
b = np.array([3,2,1])
A vector only has one dimension in python (its not directly treated as matrix). We can see that as the second dimension (calling shape) is left free
In [2]:
a.shape
Out[2]:
(3,)
We can add vectors and compute the inner product with the dot function
In [3]:
c = a + b
d = a.dot(c)
d
Out[3]:
10
A matrix is created by:
In [4]:
A = np.array([[2, 1, 3], [1, 1 ,2]])
B = np.array([[2, 1], [1, 2], [5 ,2]])
A.shape, B.shape
Out[4]:
((2, 3), (3, 2))
where we have created a $2×3$ and a $3×2$ matrix
Or vectors can be stacked into matrices
In [5]:
X = np.column_stack((a,b))
Y = np.row_stack((a,b))
X,Y
Out[5]:
(array([[1, 3],
[0, 2],
[2, 1]]), array([[1, 0, 2],
[3, 2, 1]]))
We can add, transpose and multiply matrices
In [6]:
C = A + B.transpose()
D = C.dot(A.transpose()) # matrix product C * A^T
C,D
Out[6]:
(array([[4, 2, 8],
[2, 3, 4]]), array([[34, 22],
[19, 13]]))
And we can multiply matrices with vectors
In [7]:
e = A.dot(a) # this corresponds to A * a
f = a.dot(B) # this corresponds to a^T * B
e, f
Out[7]:
(array([8, 5]), array([12, 5]))
The inverse of a matrix can be computed by
In [8]:
import numpy.linalg as linalg
AA = A.dot(A.transpose()) # A * A^T … we can only invert quadratic matrices
AAinv = linalg.inv(AA)
AA, AAinv
Out[8]:
(array([[14, 9],
[ 9, 6]]), array([[ 2. , -3. ],
[-3. , 4.66666667]]))
Multiplying with the inverse needs to result in the Identity matrix (from both sides)
In [9]:
AA.dot(AAinv), AAinv.dot(AA)
Out[9]:
(array([[ 1., 0.],
[ 0., 1.]]), array([[ 1.00000000e+00, -3.55271368e-15],
[ 1.42108547e-14, 1.00000000e+00]]))
Note: Computing the inverse of a matrix is tricky and it is hard to get a numerically accurate solution. Whenever we need to compute the inverse of a matrix times another matrix ($\boldsymbol{A}^{-1}\boldsymbol{B}$, then it is better to use specifically tailored methods for this which are numerically more stable.
In [10]:
import numpy.linalg as linalg
# compute A^-1*b in a more stable way using linalg.solve.
b = np.array([1, 2])
out1 = linalg.solve(AA, b)
out1
Out[10]:
array([-4. , 6.33333333])
Exercise 1:¶
Compute:
$(\boldsymbol{A} \boldsymbol a – \boldsymbol b)^T(\boldsymbol A \boldsymbol a – \boldsymbol b)$,
$(\boldsymbol{C} \boldsymbol b)^T\boldsymbol C$
$(\boldsymbol{C}^T \boldsymbol C)^{-1}\boldsymbol C^T \boldsymbol a$,
where the matrices are defined below. Check your result also in terms of the dimensionalities of the resulting matrices. Thats an easy way of spotting an error. Always use the linalg.solve method instead of the linalg.inv method if possible.
In [11]:
A = np.array([[1, 0, 1], [2, 3, 1]])
C = np.array([[1, 0], [2, 3], [1, 5]])
a = np.array([1,2,1])
b = np.array([2,2])
sol1 = # … put your code here
sol2 = # … put your code here
sol3 = # … put your code here
sol1, sol2, sol3
File “
sol1 = # … put your code here
^
SyntaxError: invalid syntax
Linear Regression in Matrix Form¶
We want to find a linear function (line) that best fits the data
Linear Functions in vector form¶
More formally, we want to fit a linear function of the following form:
$$\hat y = \beta_0 + \sum_i \beta_i x_i $$
$\beta_0$ is the offset and $\beta_i$ defines the slope for the ith input.
We can also write the output $\hat{y}$ in vector form
$$\hat{y} = \beta_0 + \boldsymbol x^T\boldsymbol \beta = \tilde{\boldsymbol x}^T \tilde{\boldsymbol \beta}, \textrm{ with } \tilde{\boldsymbol \beta} = \left[\begin{array}{c}\beta_0 \\ \vdots \\ \beta_d \end{array} \right] \textrm{ and } \tilde{\boldsymbol x} = \left[\begin{array}{c}1 \\ x_1 \\ \vdots \\ x_d \end{array} \right]$$
Note that we prepended a one to the $\boldsymbol x$-vector which will multiply with the offset $\beta_0$ when computing the scalar product.
Matrices for multiple outputs¶
We will now consider multiple samples $\boldsymbol x_i$, where we will prepend again a $1$ to create the $\tilde{\boldsymbol x}_{i} = \left[\begin{array}{c}1 \\ x_{i,1} \\ \vdots \\ x_{i,d} \end{array} \right]$
vector. We can stack all $\tilde{\boldsymbol x_{i}}$ in a matrix $\tilde{\boldsymbol X} = \left[\begin{array}{c}\tilde{\boldsymbol x}_{d}\\ \vdots \\ \tilde{\boldsymbol x}_{n} \end{array} \right].$
The output $\hat y_i$ for each sample can also be subsumed in a vector
$\hat{\boldsymbol y} = \left[\begin{array}{c}\hat{y}_{1}\\ \vdots \\ \hat{ y}_{n} \end{array} \right] = \left[\begin{array}{c} \tilde{\boldsymbol x_1}^T \tilde{\boldsymbol \beta} \\ \vdots \\ \tilde{\boldsymbol x_n}^T \tilde{\boldsymbol \beta} \end{array} \right] = \tilde{\boldsymbol X} \tilde{\boldsymbol{\beta}}.$ Hence, the computation of all output values can be written as matrix vector product
Lets do it in python…¶
We will consider a 1-dimensional problem as illustrated below. We are given 10 training samples and we want to fit
a line to these samples. Our line has 2 parameters, $\beta_0$ and $\beta_1$. Lets first look at the data and how we can compute a prediction using hand-picked
$\beta_0$ and $\beta_1$ values.
In [ ]:
import pandas as pd
import matplotlib.pyplot
data_train = pd.DataFrame.from_csv(‘regression_train.csv’)
data_test = pd.DataFrame.from_csv(‘regression_test.csv’)
data_train
Get the training data as numpy arrays¶
In [ ]:
x_train = data_train[‘x’].as_matrix()
y_train = data_train[‘y’].as_matrix()
x_test = data_test[‘x’].as_matrix()
y_test = data_test[‘y’].as_matrix()
Plot the training data¶
In [ ]:
%matplotlib notebook
## get input output vectors from the data frame and plot the data
import matplotlib.pyplot as plt
plt.clf()
plt.plot(x_train,y_train, ‘bo’)
plt.plot(x_test,y_test, ‘g’)
plt.legend((‘training points’, ‘ground truth’))
plt.hold(True)
plt.savefig(‘trainingdata.png’)
plt.show()
Preparing the data matrix¶
As a first step, lets construct the $\tilde{\boldsymbol{X}}$ matrix
In [ ]:
Xtilde = np.column_stack((np.ones(x_train.shape), x_train))
Xtilde
Exercise 2 – Prediction with arbitrary betas.¶
We want to extend the plot now with the prediction for a hand-picked betas.
First, compute the data matrix Xtest_tilde for the test set (see code above how it is done for the training set).
Reuse the plotting code from above, add the predicted line for $\beta_0 = 7$ and $\beta_1 = 8$ to the plot (use red line color, extend your legend)
In [ ]:
beta_0 = 7
beta_1 = 6
betatilde = np.array([beta_0, beta_1])
Xtest_tilde = # put your code here…
ytest_hat = # put your code here…
plt.figure()
plt.plot(x_test,ytest_hat, ‘r’)
plt.plot(x_train,y_train, ‘bo’)
plt.plot(x_test,y_test, ‘g’)
plt.legend((‘predictions’, ‘training points’, ‘ground truth’), loc = ‘lower right’)
plt.savefig(‘regression_randomPrediction.png’)
Error computation¶
Well, not too bad but still a large error. We can do better. To assess the quality of our prediction, we compute the error as the difference to the training labels, i.e.,
$$\boldsymbol e = \left[\begin{array}{c}y_1 \\ \vdots \\ y_n \end{array} \right] – \left[\begin{array}{c}\hat y_1 \\ \vdots \\ \hat y_n \end{array} \right].$$
In [ ]:
yhat = Xtilde.dot(betatilde)
error = y_train – yhat
error
The summed squared error can be computed by the scalar product of the error vector with itself, i.e.,
$$SSE = \boldsymbol{e}^T \boldsymbol{e} = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i – \hat{y}_i)^2$$
In [ ]:
SSE = error.dot(error) # The scalar product is also implemented with the dot function (no need for transpose)
SSE
Exercise 3 – The SSE error function¶
The SSE is a function of the $\tilde{\boldsymbol{\beta}}$ vector. We want to minimize the SSE, i.e., find the
$\tilde{\boldsymbol{\beta}}$ with minimal summed squared error. In this excercise, you should visualize the error function by testing different $\beta_0$ and $\beta_1$ values and creating a 3D plot.
First, implement a function SSE that takes the betaTilde vector, the training input x and the training output y and computes the sum of squarred errors for the given betas.
In [ ]:
def SSE(beta, x, y):
error = # Put your code here..
SSE = # Put your code here..
return SSE
We now compute the SSE for a grid from $-200$ to $200$, using $50$ grid values per dimension. I.e., in total we get
$2500$ evaluations of the SSE function. If your SSE function has ben executed without errors correctly, you just have to execute the cell.
In [ ]:
import numpy as np
# specify data points for beta0 and beta1 (from – 200 to 200, using 50 uniformly distributed points)
beta0Array = np.linspace(-200, 200, 50)
beta1Array = np.linspace(-200, 200, 50)
SSEarray = np.zeros((50,50))
for i in range(0,50):
for j in range(0,50):
beta = np.array([beta0Array[i], beta1Array[j]])
SSEarray[i,j] = SSE(beta, x_train, y_train)
SSEarray
The cell ouputs the SSE for every grid position between -200 and 200 for both dimensions. For a better visualization, we can create a 3D plot. Run the following cell for doing so.
In [ ]:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
fig = plt.figure()
ax = fig.gca(projection=’3d’)
beta0Grid, beta1Grid = np.meshgrid(beta0Array, beta1Array)
# Plot the surface.
surf = ax.plot_surface(beta0Grid, beta1Grid, SSEarray, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
# Add a color bar which maps values to colors.
fig.colorbar(surf, shrink=0.5, aspect=5)
plt.xlabel(‘beta0’)
plt.ylabel(‘beta1’)
plt.savefig(‘errorfunction.png’)
plt.show()
Minimization of the SSE¶
We have seen that we need to minimize a quadratic function that depends on 2 or more parameters ($\beta_0$ to $\beta_d$). We could now take the best $\boldsymbol \beta$ values that you have seen on the grid. However, this is highly inefficient for high-dimensional problems and also inaccurate. In order to do this minimization properly, we first have to talk about derivatives and vector derivatives.
We can now compute the optimal beta values on our grid
In [ ]:
minIndex = np.argmin(SSEarray)
index1, index2 = np.unravel_index(minIndex, (50,50))
beta0 = beta0Array[index1]
beta1 = beta1Array[index2]
beta0, beta1
Exercise 4 – Plot the line¶
Using the found beta values from before, plot the resulting line in a similar plot as for excercise 2 and compute the SSE. Is it better then our hand-picked value?
In [ ]:
# Put your code here…
Derivatives¶
Definition: “The derivative of a function of a real variable measures the sensitivity to change of a quantity (a function value or dependent variable) which is determined by another quantity (the independent variable)”.
Derivative Rules¶
In order to computer a derivative $\partial f(x) / \partial x $ of a function $f(x)$, we can use the following rules:
Constants:
$$ \frac{\partial a}{\partial x} = 0$$
Linear term:
$$ \frac{\partial ax}{\partial x} = a$$
Quadratic terms:
$$ \frac{\partial x^2}{\partial x} = 2x$$
Derivative Rules cont.¶
Linearity:
$$ \frac{\partial a\cdot f(x) + b \cdot g(x)}{\partial x} = a\frac{\partial f(x)}{\partial x} + b \frac{\partial g(x)}{\partial x}$$
Chain Rule: For a composition of functions, i.e., $y = f(g(x))$, we can introduce an auxiliary variable $u = g(x)$. The derivative of $$y = f(g(x))$$ is then given by
$$ \frac{\partial f(g(x)) }{\partial x} = \frac{\partial f(u)}{\partial u} \frac{\partial u}{\partial x},$$
i.e. we have to multiply the derivative of $f$ with the deriviative of $u = g(x)$.
Example¶
Compute the derivative of $h(x) = 2 (x ^2 – 4)^2$. This function can be decomposed in $f(u) = 2 u^2$ and $u=g(x) = x^2 -4$, with $h(x) = f(g(x))$.
Compute derivative of $f$: $$ \frac{\partial f(u)}{\partial u} = \frac{\partial (2 u^2)}{\partial u}= 4 u.$$
Compute derivative of $u$: $$ \frac{\partial u}{\partial x} = \frac{\partial (x^2 – 4)}{\partial x} = 2x.$$
Final Result:
$$ \frac{\partial f(g(x)) }{\partial x} = \frac{\partial f(u)}{\partial u} \frac{\partial u}{\partial x} = 4 u \cdot 2 x = 8 \underbrace{(x^2 – 4)}_u x$$
Example in Python¶
In [ ]:
# Define the function and its derivative
def h(x):
return 2 *(x**2 – 4) ** 2
def hd(x):
return 8*(x ** 2 – 4) * x
In [ ]:
# plot it
import numpy as np
import matplotlib.pyplot as plt
plt.figure()
x = np.linspace(-3, 3, num=100)
plt.subplot(2,1,1)
plt.plot(x, h(x))
plt.ylabel(‘h(x)’)
plt.subplot(2,1,2)
plt.plot(x, hd(x))
plt.xlabel(‘x’)
plt.ylabel(‘hd(x)’)
plt.savefig(‘function_derivatives.png’)
We can see that the function has 3 extrema (2 minima and 1 maximum), which occur when its derivative is 0 (-2, 0, 2)
Exercise 5: Derivatives¶
Compute the derivative of $v(x) = (x + 4)^2 + 3 x^2 – (x – 1)^2$. Plot the function and its derivative!
In [ ]:
# Put your code here
Computing the minima of a function¶
We have seen that a function has an extremum of a function occurs when its derivative is 0. For certain types of functions, i.e.,
quadratic functions, we can directly compute where the extrenum (i.e. max or minimum) is. Note that this only works if the function is convex,
i.e., there is only one maximum or minimum.
Example: We want to compute the minimum of the following function: $f(x) = 2 x^2 – 2x + 1$
Derivative:
$$\frac{\partial f(x)}{\partial x} = 4x – 2 $$
Compute minimum:
\begin{align} \frac{\partial f(x)}{\partial x} = &0 \
4 x – 2 = &0 \
x = & \frac{1}{2}
\end{align}
Lets confirm that with a plot…¶
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
In [ ]:
x = np.linspace(-3, 3, num=100)
plt.figure()
plt.plot(x, 2 * x ** 2 – 2 * x + 1)
plt.savefig(‘function_minimum.png’)
Exercise 6: Computing the minimum¶
Compute the minimum of the function defined in excercise 4. Confirm your result in the plot from excercise 4.
Matrix calculus¶
Now, we want to compute the derivatives of functions $f(\boldsymbol{x})$ that take several input values, i.e., a vector $\boldsymbol{x}$. As we have several input values $x_i$, we also need to derive the function with respect to each of them. The derivative of $f$ is therefore not a single value, but a vector that contains the derivative w.r.t each variable, i.e.
$$\frac{\partial f(\boldsymbol x)}{\partial \boldsymbol x} = \left[\frac{\partial f(\boldsymbol x)}{\partial x_1}, \dots, \frac{\partial f(\boldsymbol x)}{\partial x_d}\right].$$
The rules for the differentation are similar to the scalar rules, with small differences as listed below:
Constants:
$$ \frac{\partial \boldsymbol a}{\partial \boldsymbol x} = \boldsymbol 0$$
Linear term:
$$ \frac{\partial \boldsymbol A \boldsymbol x}{\partial \boldsymbol x} = \boldsymbol A$$
Quadratic terms:
$$ \frac{\partial \boldsymbol x^T \boldsymbol x}{\partial \boldsymbol x} = 2 \boldsymbol x^T$$
$$ \frac{\partial \boldsymbol x^T \boldsymbol A \boldsymbol x}{\partial \boldsymbol x} = 2 \boldsymbol x^T \boldsymbol A,$$
if $\boldsymbol A$ is a symmetrix matrix.
Linearity: Still holds…
Chain Rule: The chain rule is easy to generalize to the vector case.
For a composition of functions, i.e., $y = \boldsymbol f(\boldsymbol g( \boldsymbol x))$, we can again introduce an auxiliary variable $\boldsymbol u = \boldsymbol g( \boldsymbol x)$. The derivative of $y = f(\boldsymbol g( \boldsymbol x))$ is then given by
$$ \frac{\partial f(\boldsymbol g(\boldsymbol x)) }{\partial \boldsymbol x} = \frac{\partial f(\boldsymbol u)}{\partial \boldsymbol u} \frac{\partial \boldsymbol u}{\partial \boldsymbol x}.$$
Example: We want to compute the derivative of $$ (\boldsymbol B \boldsymbol x)^T(\boldsymbol B \boldsymbol x)$$. This function can be decomposed in $f(\boldsymbol u) = \boldsymbol u^T \boldsymbol u$ and $\boldsymbol u =\boldsymbol g( \boldsymbol x) = \boldsymbol B \boldsymbol x$, with $h(\boldsymbol x) = f(\boldsymbol g(\boldsymbol x))$.
Compute derivative of $f$: $$ \frac{\partial f(\boldsymbol u)}{\partial \boldsymbol u} = \frac{\partial (\boldsymbol u^T \boldsymbol u)}{\partial \boldsymbol u}= 2 \boldsymbol u^T.$$
Compute derivative of $\boldsymbol u$: $$ \frac{\partial \boldsymbol u}{\partial \boldsymbol x} = \frac{\partial (\boldsymbol B \boldsymbol x)}{\partial \boldsymbol x} = \boldsymbol B.$$
Final Result:
$$ \frac{\partial f(\boldsymbol g( \boldsymbol x)) }{\partial \boldsymbol x} = \frac{\partial f( \boldsymbol u)}{\partial \boldsymbol u} \frac{\partial \boldsymbol u}{\partial \boldsymbol x} = 2 \boldsymbol u^T \cdot \boldsymbol B = 2 \underbrace{\boldsymbol x^T \boldsymbol B^T}_{u^T} \boldsymbol B$$
Excercise 7¶
Using the chain rule, compute the derivative of
$$E(\boldsymbol x) = (\boldsymbol a – 5\boldsymbol x)^T \boldsymbol A(\boldsymbol{a} – 5 \boldsymbol x).$$ Set the derivative to zero and compute the minimum.
Using similar code as in excercise 3, plot $E(\boldsymbol x)$ and $\frac{\partial E(\boldsymbol x)}{\partial x}$ as 3D plot. For $x_0$ and $x_1$, use an interval of $[-5, 5]$ for the plot using $51$ partitions for each dimension. Confirm your finding of the minimum in the plot. Note that $\boldsymbol x$ is a 2×1 vector in this equation.
In [ ]:
A = np.array([[1, 0.5], [0.5, 1]])
a = np.array([[1], [0]])
# specify data points for x0 and x1 (from – 5 to 5, using 51 uniformly distributed points)
x0Array = np.linspace(-5, 5, 51)
x1Array = np.linspace(-5, 5, 51)
# Put your code here…
Minimizing the SSE function¶
In order to compute the minimum of the function, we need to compute its derivative and set it to zero.
The SSE function is given by: $$SSE(\tilde{\boldsymbol{\beta}}) = (\boldsymbol{y} – \tilde{\boldsymbol{X}}\tilde{\boldsymbol{\beta}})^T (\boldsymbol{y} – \tilde{\boldsymbol{X}}\tilde{\boldsymbol{\beta}}).$$
We can again use the chain rule and introduce the auxiliary variable $\boldsymbol u = (\boldsymbol{y} – \tilde{\boldsymbol{X}}\tilde{\boldsymbol{\beta}})$. The SSE is then given by
$$SSE(\tilde{\boldsymbol \beta}) = f(\boldsymbol g(\tilde{\boldsymbol \beta})),$$
with $f(\boldsymbol{u}) = \boldsymbol{u}^T\boldsymbol{u}$ and $\boldsymbol{u} = \boldsymbol{g}(\tilde{ \boldsymbol{\beta}})= (\boldsymbol{y} – \tilde{\boldsymbol{X}}\tilde{\boldsymbol{\beta}}) $
Compute derivative of $f$: $$ \frac{\partial f(\boldsymbol u)}{\partial \boldsymbol u} = \frac{\partial (\boldsymbol u^T \boldsymbol u)}{\partial \boldsymbol u}= 2 \boldsymbol u^T.$$
Compute derivative of $\boldsymbol u$: $$ \frac{\partial \boldsymbol u}{\partial \tilde{\boldsymbol{\beta}}} = \frac{\partial (\boldsymbol{y} – \tilde{\boldsymbol{X}}\tilde{\boldsymbol{\beta}}) }{\partial \tilde{\boldsymbol{\beta}}} = – \tilde{\boldsymbol{X}}.$$
Final Result:
$$ \frac{\partial f(\boldsymbol g( \tilde{\boldsymbol{\beta}})) }{\partial \tilde{\boldsymbol{\beta}} } = \frac{\partial f( \boldsymbol u)}{\partial \boldsymbol u} \frac{\partial \boldsymbol u}{\partial \tilde{\boldsymbol{\beta}}} = 2 \boldsymbol u^T \cdot (-\tilde{\boldsymbol{X}}) = – 2(\boldsymbol{y} – \tilde{\boldsymbol{X}}\tilde{\boldsymbol{\beta}})^T \tilde{\boldsymbol{X}}. $$
Solving for the parameters¶
We now need to set the derivative to $0$, i.e.,
$$\frac{\partial SSE( \tilde{\boldsymbol{\beta}})}{\partial \tilde{\boldsymbol{\beta}}} = \boldsymbol 0^T.$$
Hence, the derivative w.r.t every dimension should be zero. Now, we solve for $\tilde{\boldsymbol{\beta}}$ using the following steps:
Cancel constant factors:
$$(\boldsymbol{y} – \tilde{\boldsymbol{X}}\tilde{\boldsymbol{\beta}})^T \tilde{\boldsymbol{X}} = \boldsymbol 0^T. $$ Transpose on both sides (using $(\boldsymbol{A} \boldsymbol{B})^T = \boldsymbol{B}^T \boldsymbol{A}^T$):
$$\tilde{\boldsymbol{X}}^T(\boldsymbol{y} – \tilde{\boldsymbol{X}}\tilde{\boldsymbol{\beta}}) = \boldsymbol 0 $$ Multiply out brackets
$$\tilde{\boldsymbol{X}}^T\boldsymbol{y} – \tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}}\tilde{\boldsymbol{\beta}} = \boldsymbol 0 $$ Bring $\tilde{\boldsymbol{X}}^T\boldsymbol{y}$ on other side
$$\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}}\tilde{\boldsymbol{\beta}} = \tilde{\boldsymbol{X}}^T\boldsymbol{y}$$ Multiply both sides by the inverse of $\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}}$ (multiply on the left)
$$\tilde{\boldsymbol{\beta}} = (\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}})^{-1} \tilde{\boldsymbol{X}}^T\boldsymbol{y}$$
The least squares solution¶
The least squares solution is given by:
$$\tilde{\boldsymbol{\beta}} = (\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}})^{-1} \tilde{\boldsymbol{X}}^T\boldsymbol{y}$$
We have now obtained the general least squares solution. Compare it with the linear regression algorithm from semester A.
Can you find similarities? Why should this one be preferred? Is the solution from Semester A applicable to multiple input variables?
Note: The term $(\tilde{\boldsymbol{X}}^T\tilde{\boldsymbol{X}})^{-1}\tilde{\boldsymbol{X}}^T$ is called the pseudo inverse of matrix $\tilde{\boldsymbol{X}}$. It is used to invert non-square matrices that could not be inverted otherwise.
Exercise 8 – Implementation in python¶
We now want to implement the least squares solution that is given above in python. First, construct the Xtilde matrix.
Subsequently implement the least squares solution. Try to avoud the linalg.inv method but use the numerically more stable linalg.solve
method instead.
Plot again the training data, ground truth and prediction with the optimal least squares solution
What is the SSE of the optimal solution?
Compare your solution to the solution we have found in excercise 4 with the grid search. How much do we
do better now?
In [ ]:
# Check wether we still have Xtilde (otherwise we need to rerun all scripts)
Xtilde
In [ ]:
import numpy.linalg as linalg
betaTilde = # put your code here (you can also define additional variables if you want…)
betaTilde
In [ ]:
# Lets plot our function
import matplotlib.pyplot as plt
Xtest = np.column_stack((np.ones(x_test.shape), x_test))
ytest_predicted = # Put your code here…
plt.figure()
plt.plot(x_test,y_test, ‘g’)
plt.plot(x_test, ytest_predicted, ‘r’)
plt.plot(x_train,y_train, ‘bo’)
plt.legend((‘training points’, ‘ground truth’, ‘prediction’), loc = ‘lower right’)
plt.hold(True)
plt.savefig(‘regression_LSS.png’)
plt.show()
We can now compute the error on the training set
In [ ]:
error = # Put your code here
SSE = # Put your code here
SSE
In comparison to the hand-picked line parameters from above, our error is approximately half. However, we can still see that the approximation is far from perfect. The ground truth can not be represented by a line, so we will always have a rather large approximation error.
Optional: Polynomial Regression¶
Instead of fitting a line, we can also fit a polynomial. We want to fit $d$th order polynomials which are given by
$$\hat y = \beta_0 + \sum_{i = 1}^d \beta_i x^i$$
Note that, while $\hat y$ is now non-linear in $x$, it is still linear in the parameters $\beta_i$. Hence, we can still apply linear regression here!
Setting up the data matrix¶
We can still describe $\hat{y}$ as a scalar product, i.e.,
$$\hat{y} = \beta_0 + \sum_{i = 1}^n \beta_i x^i = \tilde{\boldsymbol{x}}^T\tilde{\boldsymbol{\beta}}, \textrm{ with } \tilde{\boldsymbol{x}} = \left[\begin{array}{c} 1 \\ x^1 \\ x^2 \\ \vdots \\ x^d \end{array}\right] \textrm{ and } \tilde{\boldsymbol{\beta}} = \left[\begin{array}{c} \beta_0 \\ \vdots \\\beta_{d+1} \end{array}\right]$$
Setting up the data matrix in python¶
In python, we write a small function that does the feature expansion up to a certain degree for a given data set x.
In [ ]:
def getPolynomialDataMatrix(x, degree):
Xtilde = np.ones(x.shape)
for i in range(1,degree + 1):
Xtilde = np.column_stack((Xtilde, x ** i))
return Xtilde
print(getPolynomialDataMatrix(x_train, 4))
Exercise 9: Fit Polynomials with different degrees¶
We now want to test different polynomials and see which one fits our data best. First, implement a function
that computes the optimal beta values given the input data x, output data y and the desired degree of the polynomial. Reuse the getPolynomialDataMatrix given above in your function.
In [ ]:
import numpy.linalg as linalg
def getBetaForPolynomialFit(x,y,degree):
Xtilde = getPolynomialDataMatrix(x, degree)
betaTilde = # Put your code here
return betaTilde
Given your getBetaForPolynomialFit function, plot the fitted function for a polynomial 1st, 2nd, 3rd and 4th degree.
Can we now fit the structure of the function better?
In [ ]:
# Lets plot our polynomials function
import matplotlib.pyplot as plt
plt.figure()
plt.plot(x_test,y_test, ‘g’)
plt.plot(x_train,y_train, ‘bo’)
ytest1 = # Put your code for 1st degree predictions
plt.plot(x_test, ytest1, ‘r’)
ytest2 = # Put your code for 1st degree predictions
plt.plot(x_test, ytest2, ‘g’)
ytest3 = # Put your code for 1st degree predictions
plt.plot(x_test, ytest3, ‘m’)
ytest4 = # Put your code for 1st degree predictions
plt.plot(x_test, ytest4, ‘c’)
plt.legend((‘training points’, ‘ground truth’, ‘$x$’, ‘$x^2$’, ‘$x^3$’, ‘$x^4$’), loc = ‘lower right’)
plt.savefig(‘polynomial.png’)
As we can see, the predictions for 3rd and 4th order are already quite good. Lets see what happens if we increase
the order of the polynomial. Repeat the same plots for for example 7, 10 and 12th order polynomials.
In [ ]:
# Lets plot our polynomials function
import matplotlib.pyplot as plt
plt.figure()
plt.plot(x_test,y_test, ‘g’)
plt.plot(x_train,y_train, ‘bo’)
ytest7 = # Put your code for 7st degree predictions
plt.plot(x_test, ytest7, ‘r’)
ytest10 = # Put your code for 10st degree predictions
plt.plot(x_test, ytest10, ‘c’)
ytest12 = # Put your code for 12st degree predictions
plt.plot(x_test, ytest12, ‘m’)
plt.ylim((-200, 200))
plt.legend((‘training points’, ‘ground truth’, ‘$x^{7}$’, ‘$x^{10}$’, ‘$x^{12}$’), loc = ‘lower right’)
plt.savefig(‘polynomial1.png’)
We can clearly see that the prediction performance of our polynomials degrade. Why is that?
This effect is called overfitting. We have too many data dimensions and too little data in order to fit the polynomial. Overfitting can have 2 bad effects:
We fit the noise in the data
The function ‘does what it wants’ between the data points. It is underspecified what the function should do between the data points.
However, for the training data, the fit is actually almost perfect. Thats another characteristic of overfitting!
Exercise 10: Evaluating the Models¶
We now want to evaluate the learned models to see which one works best. To do so, we compare polynomials of order
1 to 12. Compute the squarred error on the training and on the test set for each of these polynomials and plot them
as a function of the degree of the polynomial. Due to the huge differences in the error, use a log-scale plot for the y-axis (see pyplot.semilogy) What are your observations?
In [ ]:
SSEtrain = np.zeros((11,1))
SSEtest = np.zeros((11,1))
# Feel free to use the functions getBetaForPolynomialFit and getPolynomialDataMatrix
for i in range(1,12):
Xtildetrain = # Put your code here
Xtildetest = # Put your code here
betaTilde = # Put your code here
SSEtrain[i – 1] = # Put your code here
SSEtest[i – 1] = # Put your code here
In [ ]:
# Execute this…
import matplotlib.pyplot as plt
plt.figure();
plt.semilogy(range(1,12), SSEtrain)
plt.semilogy(range(1,12), SSEtest)
plt.legend((‘SSE on training set’, ‘SSE on test set’))
plt.savefig(‘polynomial_evaluation.png’)
plt.show()
Overfitting / Under-Fitting¶
We can see that, while the training error decreases with a larger degree of the polynomial, the test set error
significantly increases (which is the one we are interested in). This is a typical behavior we get for overfitting.
We can make the following conclusions:
In order to fit the data well, we have to find the right model complexity
Over-fitting: The model-complexity is too high (degree > 4). We fit the noise and not the data
Under-fitting: The model-complexity is too low (degree < 3). We can not represent the data well enough.
For choosing the optimal model, we always have to consider the error on an independent test set not on the training set
On the training set, the error can be arbitrarily good. This is only an indication that the algorithm has learned the example by heart, not that it can generalize to new samples.
Optional: Ridge Regression¶
A typical problem with overfitting is that the function gets too steep with 'crazy' slopes.
An efficient way to prevent this to force the weights $\beta_i$ to be small. We can achieve this by adding a quadratic
punishment term to the SSE function, i.e.,
$$\textrm{SSE}_{\textrm{reg}} = \sum_{i=1}^n (y_i - \hat y_i)^2 + \alpha \sum_{j = 0}^d \beta_d^2.$$Minimizing this error function is also called ridge regression
Ridge Regression in Matrix Form¶
We can again write the $\textrm{SSE}_{\textrm{reg}}$ function in matrix form, i.e.,
$$\textrm{SSE}_{\textrm{reg}} = (\boldsymbol y - \tilde{\boldsymbol X}\tilde{\boldsymbol \beta} )^T(\boldsymbol y - \tilde{\boldsymbol X}\tilde{\boldsymbol \beta} ) + \alpha \boldsymbol \beta^T \boldsymbol \beta$$
Exercise 11 (Optional): Deriving ridge regression¶
Using the same transformations as for linear regression, derive the solution for the ridge regression error function
$\textrm{SSE}_{\textrm{reg}}$. You will need additional trick which is given by
$$\boldsymbol A \boldsymbol x + \alpha \boldsymbol x = (\boldsymbol A + \alpha \boldsymbol I) \boldsymbol x,$$
where $\boldsymbol{I}$ is the identity matrix. State your solution
Optimal solution for ridge regression¶
The optimal solution for ridge regression is given by
$$\tilde{\boldsymbol{\beta}} = (\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}} + \alpha \boldsymbol I)^{-1} \tilde{\boldsymbol{X}}^T\boldsymbol{y}.$$
Note that for $\alpha = 0$, we obtain the original least squares solution.
Note that almost all approaches use ridge regression instead of standard regression as it is much more stable.
Hence, we typically have to set an appropriate $\alpha$ value. Typical values range between $10^{-3}$ and $10^{-8}$.
Exercise 12 (Optional): Evaluating Ridge Regression¶
We now want to evaluate ridge regression when using a high degree polynom. From before, we have seen that standard regression does not work very well. Similar to Excericise 9, first write a function {\em getBetaFromPolynomialRidgeFit} which computes the optimal ridge regression solution. The identity matrix is obtained by the function call numpy.eye.
In [ ]:
import numpy.linalg as linalg
def getBetaForPolynomialRidgeFit(x,y,degree, alpha):
Xtilde = getPolynomialDataMatrix(x, degree)
betaTilde = # Put your code here ...
return betaTilde
Plot the fitted function for $\alpha = 10^{-8}$, $\alpha = 10^{-4}$, $\alpha = 10^0$, $\alpha = 10^2$ using
similar code from excercise 9. Use a polynomial of degree 6.
In [ ]:
# Lets plot our polynomials function
import matplotlib.pyplot as plt
plt.figure()
plt.clf()
plt.plot(x_test,y_test, 'g')
plt.plot(x_train,y_train, 'bo')
degree = 6
y_test1 = # Put your code here for alpha = 10^-4
plt.plot(x_test, y_test1, 'r')
y_test2 = # Put your code here for alpha = 10^-1
plt.plot(x_test, y_test2, 'k')
y_test3 = # Put your code here for alpha = 10^2
plt.plot(x_test, y_test3, 'm')
y_test4 = # Put your code here for alpha = 10^5
plt.plot(x_test, y_test4, 'c')
plt.legend(('training points', 'ground truth', 'alpha = $10^{-4}$', 'alpha = $10^{-1}$', 'alpha = $10^2$', 'alpha = $10^5$'), loc = 'top left')
plt.hold(True)
plt.xlim((-5, 5))
plt.ylim((-200, 200))
plt.savefig('ridge_regression.png')
As we can see, the fit is poor for small $\alpha$ but gets much better for larger $\alpha$. We now want to evaluate different alphas and compute the error on the training set and the error on the test set. Use the following $\alpha$ values. Use a polynom of degree 6.
In [ ]:
alphaArray = np.logspace(-4, 10, 10)
alphaArray
In [ ]:
SSEtrain_ridge = np.zeros((10,1))
SSEtest_ridge = np.zeros((10,1))
for i in range(0,10):
Xtildetrain = # Put your code here...
Xtildetest = # Put your code here...
betaTilde = # Put your code here...
SSEtrain_ridge[i] = # Put your code here...
SSEtest_ridge[i] = # Put your code here...
Plot the train error and the test error for the different alphas. Use the function pyplot.loglog which produces a
a plot with a log-scale x-axis and a log-scale y-axis. This is necessary as the alpha values differ over several orders of magnitude.
In [ ]:
# Execute this...
import matplotlib.pyplot as plt
plt.figure();
plt.loglog(alphaArray, SSEtrain_ridge)
plt.loglog(alphaArray, SSEtest_ridge)
plt.legend(('SSE on training set', 'SSE on test set'))
plt.savefig('ridge_regression_evaluation.png')
plt.show()
We can again see a similar behavior as for the polynomial. Again, the error on the training set is very low. However, the error
on the test set is the one that really matters.
Here, the optimal value is around $10^6$.
For smaller $\alpha$ values, the beta-vector is not constrained enough and we again have an overfitting problem.
For larger $\alpha$ values, the beta-vector is too much constrained and we under-fit.
In [ ]: