程序代写 - PowCoder代写

Live Coding Wk3 – Lecture 6 – Linear Regression¶

In this lecture, we have introduced our first machine learning model: linear regression. Lets see how we can use this using the sklearn python package. We will look into different ways we might try to increase the performance of this simple model.

### Imports and data you will need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as skl
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
%matplotlib inline

data = pd.read_csv(‘data/weight_loss.csv’, index_col=0)

Lets look at the data first¶
Do you remember how to have a peak at the dataset?

# For you to do

Remember, we should split the data into a training and testing subset.

train_data, test_data = train_test_split(data, train_size = 0.8, random_state=2) # Random state to everything is the same

Its also a good idea to plot data when we can. Try fixing the plotting Days on the x-axis and Weight on the y-axis.

# Whats given
plt.figure(figsize=[12,9])
plt.scatter(train_data[‘Days’], train_data[‘Weight’], label=’train’)
plt.scatter(test_data[‘Days’], test_data[‘Weight’], label=’test’)

# For you to do

plt.show()

Dicussion: Notice anything about the data points? Anything which would go against a linear model?

Discussion points here

Simple Linear Regression¶
Now we have seen the data, lets try and fit a straight line to it. Our goal will be to use the [variable] value and [variable] value to predict the [variable]. As such, we are trying to find the best variables for the straight line equation
\begin{equation}
\hat{y}_{\text{Weight}} = \beta_{0} + \beta_{1} x_{\text{Day}}.
\end{equation}

So how do we do this in Python? Luckily sklearn provides us a linear model out of the box, so let use that.

LinearRegression?

Init signature:
LinearRegression(
fit_intercept=True,
normalize=False,
copy_X=True,
n_jobs=None,
Docstring:
Ordinary least squares Linear Regression.

Parameters
———-
fit_intercept : boolean, optional, default True
whether to calculate the intercept for this model. If set
to False, no intercept will be used in calculations
(e.g. data is expected to be already centered).

normalize : boolean, optional, default False
This parameter is ignored when “fit_intercept“ is set to False.
If True, the regressors X will be normalized before regression by
subtracting the mean and dividing by the l2-norm.
If you wish to standardize, please use
:class:`sklearn.preprocessing.StandardScaler` before calling “fit“ on
an estimator with “normalize=False“.

copy_X : boolean, optional, default True
If True, X will be copied; else, it may be overwritten.

n_jobs : int or None, optional (default=None)
The number of jobs to use for the computation. This will only provide
speedup for n_targets > 1 and sufficient large problems.
“None“ means 1 unless in a :obj:`joblib.parallel_backend` context.
“-1“ means using all processors. See :term:`Glossary `
for more details.

Attributes
———-
coef_ : array, shape (n_features, ) or (n_targets, n_features)
Estimated coefficients for the linear regression problem.
If multiple targets are passed during the fit (y 2D), this
is a 2D array of shape (n_targets, n_features), while if only
one target is passed, this is a 1D array of length n_features.

intercept_ : array
Independent term in the linear model.

>>> import numpy as np
>>> from sklearn.linear_model import LinearRegression
>>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
>>> # y = 1 * x_0 + 2 * x_1 + 3
>>> y = np.dot(X, np.array([1, 2])) + 3
>>> reg = LinearRegression().fit(X, y)
>>> reg.score(X, y)
>>> reg.coef_
array([1., 2.])
>>> reg.intercept_ # doctest: +ELLIPSIS
>>> reg.predict(np.array([[3, 5]]))
array([16.])

From the implementation point of view, this is just plain Ordinary
Least Squares (scipy.linalg.lstsq) wrapped as a predictor object.
File: ~/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/base.py
Type: ABCMeta
Subclasses:

# Lets fit the model
lm = LinearRegression()
# We need to do this as sklean expects an array of feature vectors
train_input1 = np.array(train_data[‘Days’]).reshape(-1, 1)

model1 = lm.fit(train_input1, train_data[‘Weight’])

# retrieve the slope and intercept of the line
# fit.coef_ returns an array – useful for when you have multiple features!
print(‘Coefficients:’, model1.coef_)
print(‘Intercept:’, model1.intercept_)

Coefficients: [-0.26980993]
Intercept: 175.0125736864023

Now we can calculate our predictions for the specific days. Check out the predict method for our model.

model1.predict?

Signature: model1.predict(X)
Docstring:
Predict using the linear model

Parameters
———-
X : array_like or sparse matrix, shape (n_samples, n_features)

C : array, shape (n_samples,)
Returns predicted values.
File: ~/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/base.py
Type: method

# Your code, remember to make sure the test input is in the correct dimension!

We can also see how our fit looks graphically. We just need to simply add a line to our original plot.

# More Plots!
plt.figure(figsize=[12,9])

plt.scatter(test_data[‘Days’], test_data[‘Weight’])
plt.xlabel(‘Days’)
plt.ylabel(‘Weight’)
plt.title(‘Days vs Weight’)

plt.plot(test_input1, pred_weight1, ‘o-‘, c=’r’)

plt.show()

To evaluate how well quantitatively we do we can use the $ R^{2} $ and Mean Squared Error (MSE).

# R-squared and MSE are two of many metrics used to evaluate models
print(‘R-squared:’, skl.metrics.r2_score(test_data[‘Weight’], pred_weight1))
print(‘Mean Squared Error (MSE):’, skl.metrics.mean_squared_error(test_data[‘Weight’], pred_weight1))

R-squared: 0.9360729569829845
Mean Squared Error (MSE): 35.00670805688385

Discussion: So how well did we do? What do these numbers mean? [Note: weight ranges from 110 to 180]

Discussion points here

More than linear¶
Although lines are great, we can do a bit better. How about adding a quadtratic component? We now want our regression equation to be
\begin{equation}
\hat{y}_{\text{Weight}} = \beta_{0} + \beta_{1} x_{\text{Day}} + \beta_{2} x_{\text{Day}}^{2}.
\end{equation}

As LinearRegression only makes lines, we need to add squared features to our current representation. Lets do this by adding the square of the Days value to the input when fitting.

train_sq_days = train_input1 ** 2
train_input2 = np.concatenate((train_input1, train_sq_days), axis=1)

Now we can fit a new ‘linear’ model using LinearRegression just like we did before.

lm = LinearRegression()
model2 = lm.fit(train_input2, train_data[‘Weight’])

# retrieve the slope and intercept of the line
# fit.coef_ returns an array – useful for when you have multiple features!
print(‘Coefficients:’, model2.coef_)
print(‘Intercept:’, model2.intercept_)

# Calculate the prediction
test_sq_days = test_input1 ** 2
test_input2 = np.concatenate((test_input1, test_sq_days), axis=1)
pred_weight2 = model2.predict(test_input2)

Coefficients: [-0.4695216 0.00080324]
Intercept: 183.3703612273582

And another plot of our fit:

# More Plots!
plt.figure(figsize=[12,9])

plt.scatter(test_data[‘Days’], test_data[‘Weight’])
plt.xlabel(‘Days’)
plt.ylabel(‘Weight’)
plt.title(‘Days vs Weight’)

sorted_idx = np.argsort(test_input2[:, 0]) # This is just for plotting purposes

plt.plot(test_input2[sorted_idx, 0], pred_weight2[sorted_idx], ‘o-‘, c=’r’)

plt.show()

And our evaluated $ R^{2} $ and MSE values:

# R-squared and MSE are two of many metrics used to evaluate models
print(‘R-squared:’, skl.metrics.r2_score(test_data[‘Weight’], pred_weight2))
print(‘Mean Squared Error (MSE):’, skl.metrics.mean_squared_error(test_data[‘Weight’], pred_weight2))

R-squared: 0.979974713023908
Mean Squared Error (MSE): 10.965928374643868

Discussion: Did we do better

Discussion points here.

A better loss¶
Regardless of the regression equation we use, when we are training our model we are trying to reduce the sum of squared error as a loss function,
\begin{equation}
\mathcal{L}(x, y) = \sum_{i=1}^{n} (y_{i} – \hat{y}_{i})^{2}
\end{equation}

However, sometimes when fitting our models by using this loss function we obtain large weight values. That is our $ \beta_{\cdot} $ values become very large.

Discussion: Why might this be problematic?

Discussion points here.

Instead of using this loss function, we can additionally penalise the size of our weights using regularization. One way to do this is to add a squared penalty to our loss function:
\begin{align}
\mathcal{L}(x, y)
&= \sum_{i=1}^{n} (y_{i} – \hat{y}_{i})^{2} + \alpha \sum_{j}\beta_{j}^{2} \\
&= \sum_{i=1}^{n} (y_{i} – \hat{y}_{i})^{2} + \alpha \Vert \mathbf{\beta} \Vert^{2}
\end{align}
Here $ \alpha \geq 0 $ determines how important we penalise the size of our weights.

This has a number of names when used for regression: $L2$ regularisation, Tikhonov regularization, or Ridge regression.

Lets try linear regression with this regularization instead using Ridge from sklearn. Lets also use the inputs with our squared Days values.

lm = Ridge(alpha=1.0)
model3 = lm.fit(train_input2, train_data[‘Weight’])

# retrieve the slope and intercept of the line
# fit.coef_ returns an array – useful for when you have multiple features!
print(‘Coefficients:’, model3.coef_)
print(‘Intercept:’, model3.intercept_)

pred_weight3 = model3.predict(test_input2)

Coefficients: [-0.46948965 0.00080312]
Intercept: 183.36884379027958

Lets see how all our models did now:

print(‘R-squared’)
print(‘Base: {:.3f}’.format(skl.metrics.r2_score(test_data[‘Weight’], pred_weight1)))
print(‘NonLinear: {:.3f}’.format(skl.metrics.r2_score(test_data[‘Weight’], pred_weight2)))
print(‘NonLinear + L2 Reg.: {:.3f}’.format(skl.metrics.r2_score(test_data[‘Weight’], pred_weight3)))

Base: 0.936
NonLinear: 0.980
NonLinear + L2 Reg.: 0.980

print(‘MSE’)
print(‘Base: {:.3f}’.format(skl.metrics.mean_squared_error(test_data[‘Weight’], pred_weight1)))
print(‘NonLinear: {:.3f}’.format(skl.metrics.mean_squared_error(test_data[‘Weight’], pred_weight2)))
print(‘NonLinear + L2 Reg.: {:.3f}’.format(skl.metrics.mean_squared_error(test_data[‘Weight’], pred_weight3)))

Base: 35.007
NonLinear: 10.966
NonLinear + L2 Reg.: 10.967

Discussion: Why do you think the L2 regularisation doesn’t improve performance?

Here’s another plot summarising everything we have done.

# More Plots!
plt.figure(figsize=[12,9])

plt.scatter(test_data[‘Days’], test_data[‘Weight’])
plt.xlabel(‘Days’)
plt.ylabel(‘Weight’)
plt.title(‘Days vs Weight’)

sorted_idx = np.argsort(test_input2[:, 0]) # This is just for plotting purposes

plt.plot(test_input1, pred_weight1, ‘o-‘, c=’k’, label=’Base’)
plt.plot(test_input2[sorted_idx, 0], pred_weight2[sorted_idx], ‘o-‘, c=’r’, label=’NonLinear’)
plt.plot(test_input2[sorted_idx, 0], pred_weight3[sorted_idx], ‘o–‘, c=’y’, label=’NonLinear + L2 Reg.’)

plt.legend()

plt.show()

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts