REGRESSION – CONCEPTS (PART 1)
Machine Learning for Financial Data
January 2021
Contents
◦ Supervised Learning
◦ Linear Regression
◦ Polynomial Regression
◦ Regularized Regression
◦ Time Series
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 2
Regression
Supervised Learning
Supervised Learning
Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set
of training examples. Each example is a pair consisting of an input object (typically a vector) and a desired output value (also called
the supervisory signal).
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 4
Regression
Supervised learning can be used in regression problems and classification problems
Supervised learning algorithms are provided with historical data and asked to find the relationship that has the best predictive power
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 5
Regression
Regression predicts along a continuous set of possible outcomes while classification finds the category of the highest probability among a number of finite categories
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 6
Regression
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 7
Regression
In the context of finance, supervised learning models represent one of the most-used class of machine learning models.
Many algorithms that are widely applied in algorithmic trading rely on supervised learning models because they can be efficiently trained, they are relatively robust to noisy financial data, and they have strong links to the theory of finance.
Some machine learning models are used to solve both regression and classification problems
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 8
Regression
Linear Regression
Linear Regression
Linear regression is a linear model that assumes a linear relationship between the input variables (x) and the single output variable (y). The goal of linear regression is to train a linear model to predict a new y given a previously unseen x with as little error as possible.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 10
Regression
Hyperplane
◦ Alinearfunction𝑦=β0+β1𝑥1+⋯+β𝑖𝑥𝑖
◦ β0 represents the intercept with the y-axis
◦ β1 ⋯ β𝑖 are the coefficients of the regression
◦ The coefficients are estimated by minimizing the sum of the squared deviations between the observed y and the predicted y, Residual Sum of Squares (RSS)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
11
Regression
◦
𝑛𝑚
𝑅𝑆𝑆 = (𝑦𝑖 − β0 − β𝑗 𝑥𝑖𝑗)2 𝑖=1 𝑗=1
Residues refer exclusively to the differences between dependent variables and estimations from linear regression
RSS is a parabolic function and the best hyperplane occurs at the bottom of the parabola
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 12
Regression
Different measures of error are used with linear regression
Score
Formula
Remarks
Mean absolute error
1𝑛
𝑀𝐴𝐸 = 𝑛 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑𝑖 − 𝑎𝑐𝑡𝑢𝑎𝑙𝑖
𝑖=1
Average of absolute errors of all the data points.
Mean squared error
1𝑛 𝑅𝑆𝑆 𝑀𝑆𝐸 = 𝑛 (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑𝑖 − 𝑎𝑐𝑡𝑢𝑎𝑙𝑖 )2 = 𝑛
𝑖=1
Average of the squares of the errors of all the data points. A good practice is to keep the MSE low and the R2 score high.
Median absolute error
𝑀𝑒𝑑𝐴𝐸 = 𝑚𝑒𝑑𝑖𝑎𝑛( 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑𝑖 − 𝑎𝑐𝑡𝑢𝑎𝑙𝑖 )
The median of all the errors. Robust to outliers.
Explain variance score
𝐸𝑥𝑝𝑉𝑎𝑟 = 1 − 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑎𝑐𝑡𝑢𝑎𝑙𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑𝑖 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑎𝑐𝑡𝑢𝑎𝑙𝑖
Measures how well the model can account for the variation in the dataset. A good practice is to keep the MSE low and the R2 score high.
R2 score (coefficient of determination)
σ𝑛 (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑎𝑐𝑡𝑢𝑎𝑙 )2 𝑅2 = 1 − 𝑖=1 𝑖 𝑖
σ𝑛 (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 1 σ𝑛 𝑎𝑐𝑡𝑢𝑎𝑙 )2 𝑖=1 𝑖𝑛𝑖=1𝑖
Measures how well the unknown sample will be predicted by the model. A score near 1 means that the model is able to predict the data very well.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 13
Regression
Linear regression with Python (1)
# Import the relevant libraries
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
# Load the Boston house prices dataset as an array of features and an array of target
features, target = load_boston(return_X_y = True) features = features[:, 12:13]
# Show first 5 rows of features
features[0:5]
# Show first 5 rows of target
target[0:5]
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
14
Regression
Linear regression with Python (2)
# Create a linear regressor
regression = LinearRegression()
# Fit the linear regressor and return the model
model = regression.fit(features, target)
# Show the y-intercept of the regression line
model.intercept_
# Show the coefficients of the regression line – the slope
# Since only one feature is involved, the array stores only one coefficient
model.coef_
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 15
Regression
Linear regression with Python (3)
# Make predictions using the trained model
prediction = model.predict(features)
# Display the data points and the hyperplane
plt.figure(figsize=(12, 8))
plt.scatter(features, target, alpha=0.5, color=’blue’)
plt.plot(features, prediction, color=”red”)
plt.title(‘Home Price vs Percentage of Black’) plt.xlabel(‘Percentage of Black by Town’, fontsize=12) plt.ylabel(‘Home Price in US$1000’, fontsize=12)
plt.show()
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 16
Regression
Linear regression with Python (4)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 17
Regression
Linear regression with Python (5)
# Show the MSE of the hyperplane
print(‘Mean Squared Error: %.2f’
% mean_squared_error(target, prediction))
# Show the R2 score of the hyperplane
# The score suggests that 54% of the variation in home price is explained by the black %
print(‘Coefficient of Determination (R^2 Score): %.2f’ % r2_score(target, prediction))
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 18
Regression
2
R Score
◦ It represents the proportion of the difference or variance in statistical terms for a dependent variablewhichcanbeexplainedbyan independent variable or variables
𝑆𝑆 σ𝑛 𝑦−𝑦ෝ2 𝑅2 =1− 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 =1− 𝑖=1 𝑖 𝑖
𝑆𝑆 σ𝑛 𝑦−𝑦ഥ2 𝑡𝑜𝑡𝑎𝑙 𝑖=1 𝑖 𝑖
◦ To what extent the variance of one variable explains the variance of the second variable
◦ If R2 of a model is 0.50, approximately half of the observed variation can be explained by the model’s inputs
◦ It determines how well data will fit the regression model
𝑅2 = 1 −
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
19
Regression
R2 score compares the change of variation to the dependable variable after the inclusion of an independent variable
𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑦 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑦 𝑎𝑟𝑜𝑢𝑛𝑑 𝑖𝑡𝑠 𝑚𝑒𝑎𝑛 around the regression line
y y1𝑛2 y1𝑛2 𝑦−𝑦ഥ 𝑦−𝑦ෝ
𝑛 𝑖=1 𝑖 𝑖 𝑛 𝑖=1 𝑖 𝑖
xx
x is an independent variable that determines the value of the dependable variable y
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
The variation of the dependable variable without considering any dependable variable
20
σ 𝑛 𝑦 − 𝑦ഥ 2 − σ 𝑛 𝑦 − 𝑦ෝ 2 𝑖=1𝑖𝑖 𝑖=1𝑖𝑖
𝑅2 =
𝑅2 represents the percentage change of
σ𝑛 𝑦−𝑦ഥ2 𝑖=1 𝑖 𝑖
variation with the introduction of variable x Regression
Polynomial Regression
Polynomial regression should be applied where the relationship is curvilinear
▪ The linear model (dotted line) is not the best fit to the data points, the polynomial regression model (solid line) provides a better fit
▪ Polynomial combinations of the same or different features – feature interaction – can return new features that capture the curviness
▪ A simple & easy way to model curves, without needing to create big non-linear models
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
22
Regression
Polynomial Regression
With polynomial regression, the prediction model may have independent variables appearing in degrees equal to or greater than two to fit the data with a curved hyperplane. Polynomial regression is usually used when the relationship between variables looks curved.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 23
Regression
As the curviness of the model increases, it gets more accurate
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 24
Regression
Polynomial regression with Python (1)
# Import the relevant libraries
from sklearn.datasets import load_boston
from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load the Boston house prices dataset as an array of features and an array of target
features, target = load_boston(return_X_y = True) features = features[:, 12:13]
# Construct new features via feature interaction
interaction = PolynomialFeatures(degree=2, include_bias=False,
interaction_only=False) features_interaction = interaction.fit_transform(features)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 25
Regression
Polynomial regression with Python (2)
# Create a linear regressor
regression = LinearRegression()
# Fit the linear regressor and return the model
model = regression.fit(features_interaction, target) # Make predictions using the new model
prediction = model.predict(features_interaction)
# Show the MSE and R-squared score of the hyperplane
print(‘Mean Squared Error: %.2f’
% mean_squared_error(target, prediction))
print(‘Coefficient of Determination (R^2 Score): %.2f’ % r2_score(target, prediction))
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 26
Regression
Regularized Regression
Linear regression attempts to remain unbiased but leads to greater variance and suboptimal prediction accuracy
y
Linear regression attempts to remain unbiased by taking into considering every single data point
Linear regression is sensitive to outliers
y
the optimal model has less variance at the expense of higher bias
x
the model is suboptimal (i.e. less accurate)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
28
Regression
x
the outliers tend to contribute a lot to the overall error and disrupt the entire model
Linear regression with poorly selected coefficients may result in overfitting that hurts model robustness
▪ Higher degree model and large coefficients increase the variance significantly (while keeping bias low) leading to overfitting
▪ The optimal model needs to be robust meaning predicting well on training data as well as testing data (data the model has not seen during training)
▪ The overfitting during training is what prevents the model from being robust
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 29
Regression
Regularization
▪ To avoid overfitting and increase model robustness, model training needs to be regularized (constrained)
▪ Shrink the coefficients (weights) of model features
▪ Get rid of high degree polynomial features
▪ Linear models are typically regularized by
constraining the weights
▪ Polynomial models are regularized by reducing
the polynomial degrees
▪ A penalty term together with a regularization hyperparameter (𝝀) can regulate the size of the bias term in the model
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 30
Regression
L2 Regularization / Ridge Regression
𝑝
𝐶𝑜𝑠𝑡 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 = 𝑅𝑆𝑆 + λ ∙ β2 𝑗
𝑗=1
Ridge regression adds a factor of the sum of the square of coefficients to the (RSS) cost function for linear regression. It shrinks the coefficients and helps reduce variance by introducing bias. Ridge regression can shrink the coefficients asymptotically close to 0, and therefore cannot reduce the number of variables. When sample sizes are relatively small, it can improve predictions made from new data by making the predictions less sensitive to the training data.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 31
Regression
L2 regularization constrains the determination of coefficients
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 32
Regression
λλ λλ λλ
A plain Ridge model leading to linear predictions
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 33
Polynomial Regression with L2/ridge regularization flattens predictions and reduces variance while increases bias
Regression
L1 Regularization / Lasso Regression
𝑝
𝐶𝑜𝑠𝑡 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 = 𝑅𝑆𝑆 + λ ∙ β𝑗 𝑗=1
Lasso regression adds a factor of the sum of the absolute value of coefficients to the (RSS) cost function for linear regression. The larger the value of the regularization parameter λ, the more coefficients are shrunk towards zero, even all the way to 0. Effectively, it makes predictions with new data less sensitive to the training dataset. Lasso regression not only helps in reducing overfitting, but also can help in feature selection.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 34
Regression
λλ λλ λλ
A plain Lasso model leading to linear predictions
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 35
L1/Lasso regression tends to eliminate the weights of the least important features – effectively performing feature selection
Regression
Lasso regression with Python (1)
# Import the relevant libraries
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler from sklearn.linear_model import Lasso
# Load the Boston house prices dataset as an array of features and an array of target
boston = load_boston()
features = boston.data
target = boston.target
# Standardize features
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 36
Regression
Lasso regression with Python (2)
# Create lasso regression
# Set the regularization parameter using the alpha value
regression = Lasso(alpha=0.5) # Fit the lasso regressor
model = regression.fit(features_standardized, target) # View coefficients
model.coef_
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 37
Regression
Lasso regression with Python (2)
# Setting alpha to a high value will see literally none of the features being used
regression_a10 = Lasso(alpha=10)
model_a10 = regression_a10.fit(features_standardized, target) model_a10.coef_
The practical benefit of this effect is that it means that we could include 100 features in our feature matrix and then, through adjusting lasso’s α hyperparameter, produce a model that uses only 10 (for instance) of the most important features. This lets us reduce variance while improving the interpretability of our model (since fewer features is easier to explain).
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 38
Regression
Combinations of β1 and β2 that produce the same error can be represented using colored contours in a 2D space
global minimum error
Gradient descent represented as a 3D surface in the space capturing 𝛽1, 𝛽2 and the error 𝐽(𝛽)
global minimum error
Turning gradient descent from 3D representation to 2D representation by eliminating the error dimension
Gradient descent represented as contours in the 2D space capturing 𝛽1 and 𝛽2
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
39
Regression
low loss (blue)
The (increase of the) penalty term shifts the non-regularized function “bowl” upwards and its minimum towards the origin
regularized loss function
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 40
Regression
The contours and penalty becomes denser & heavier as the value of 𝝀 increases
Two forces are at work when determining the minimum loss for the regularized model
𝐽(𝛽)
▪ Let’s use a simplified situation when a model uses hard constraints
▪ The hard constraint is β2 + β2 < 12
1 is represented by the cross section of the cylinder
▪ The height of the cylinder corresponds to 𝝀
▪ The minimum loss is the point when the non-regularized loss "bowl" meets with the cylinder
▪ The larger the value of 𝝀, the less contribution to the loss from the bowl thus pushing it up
The loss contour of different combinations of 𝛽1, 𝛽2
The 𝛽1, 𝛽2 resulting in minimum loss
The ridge regression coefficient
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
41
Regression
L1 encourages zero coefficients but not L2
▪ L1 regularization encourages zero coefficients but not L2 regularization
▪ For L1, the optimal point is at the diamond tip, any movement away from this point increases the loss
▪ For L2, the optimal point is non-zero and not on the axis but can be very close to the axis
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 42
Regression
L1 regularization encourages zero coefficients
green: a loss function minimum
with zero regularized coefficient
blue: a loss function minimum
with non-zero regularized coefficient
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
43
Regression
To strike a balance between Ridge and Lasso’s penalty functions, Elastic Net can be used
▪ Both L2/ridge and L1/lasso regression can penalize large or complex models by including coefficient values in the loss function that is to be minimized
▪ As a very general rule of thumb, L2/ridge regularization often produces slightly better predictions than L1/lasso regularization, but L1/lasso regularization produces more interpretable models
▪ To strike a balance between L2/ridge and L1/lasso’s penalty functions, Elastic Net can be used, which is simply a regression model with both penalties included
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 44
Regression
1−𝛼𝑝 𝑝
Elastic Nets add regularization terms to the model, which are a combination of both L1 and L2 regularization. In addition to setting and choosing a 𝜆 value, an elastic net also allows us to tune the 𝛼 parameter, where 𝛼 = 0 corresponds to ridge and 𝛼 = 1 to lasso. Therefore, we can choose an 𝛼 value between 0 and 1 to optimize the elastic net. Effectively, this will shrink some coefficients and set some to 0 for sparse selection.
Elastic Net
𝐶𝑜𝑠𝑡 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 = 𝑅𝑆𝑆 + 𝜆 ∙
𝛽2 + 𝛼 𝛽 𝑗𝑗
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 45
Regression
2
𝑗=1 𝑗=1
So when to use plain Linear Regression (i.e., without any regularization), Ridge, Lasso, or Elastic Net?
▪ It is almost always preferable to have at least a little bit of regularization, so generally plain Linear Regression should be avoided
▪ Ridge is a good default
▪ If it is suspected that only a few features are useful, Lasso or Elastic Net would be preferred because they tend to reduce the useless features’ weights down to zero
▪ In general, Elastic Net is preferred over Lasso because Lasso may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 46
Regression
A very different way to regularize is to stop training as soon as the validation error reaches a minimum – early stopping
▪ As the epochs go by, the algorithm learns and its prediction error on the training set goes down, along with its prediction error on the testing dataset
▪ After a while though, the testing error stops decreasing and starts to go back up
▪ This indicates that the model has started to overfit the training dataset
▪ With early stopping training can be stopped as soon as the testing error reaches the minimum
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 47
Regression
Time Series
Other than historical stock price, there are other features that are generally useful for stock price prediction
Correlated Assets
▪ An organization depends on and interacts with many external factors, including its competitors, clients, the global economy, the geopolitical situation, fiscal and monetary policies, access to capital, and so on
▪ Hence, its stock price may be correlated not only with the stock price of other companies but also with other assets such as commodities, FX, broad-based indices, or even fixed income securities
Technical Indicators
▪ A lot of investors follow technical indicators
Fundamental Analysis
▪ Two primary data sources to glean features that can be used in fundamental analysis
▪ Performance reports
▪ Annual and quarterly reports of companies can be used to extract or determine key metrics, such as ROE (Return on Equity) and P/E (Price-to-Earnings)
▪ News
▪ News can indicate upcoming events that can potentially move the stock price in a certain direction.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
49
Regression
▪
Moving average, exponential moving average, and momentum are the most popular indicators
Time Series
A time series is a sequence of numbers that are ordered by a time index. It can be broken down into trend component (deterministic or stochastic), seasonal component (representing seasonal or cyclical variation), and the residual component.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 50
Regression
A time series can be broken down into trend, seasonal, and residual component
Trend Component
▪ A consistent directional movement
▪ Either deterministic or stochastic
▪ The former provides an underlying rationale for the trend
▪ The latter is a random feature of a series
▪ Trends often appear in financial series, and many trading models use sophisticated trend identification algorithms
Seasonal Component
▪ Many time series contain seasonal variation
▪ This is particularly true in series representing business sales or climate levels
▪ In quantitative finance we often see seasonal variation, particularly in series related to holiday seasons or annual temperature variation (such as natural gas)
Residual Component
▪ The residual component is what is left over when the seasonal and trend components have been subtracted from the data
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
51
Regression
Autocorrelation
𝑦𝑡 =𝑐+∅1𝑦𝑡−1 +∅2𝑦𝑡−2 +⋯+∅𝑝𝑦𝑡−𝑝 +𝜖𝑡
There are many situations in which consecutive elements of a time series exhibit correlation - the behavior of sequential points in the series affect each other in a dependent manner. Autocorrelation is the similarity between observations as a function of the time lag between them. Such relationships can be modeled using an autoregression model. The term autoregression indicates that it is a regression of the variable against itself.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 52
Regression
Most statistical models require the time series to be stationary to make effective and precise predictions
▪ A time series is said to be stationary if its statistical properties (mean, variance, covariance) do not change over time
▪ Non-stationary series, as a rule, are unpredictable and cannot be modeled or forecasted
▪ A white noise (𝜖) series is considered stationary and has a mean of zero
▪ A stationary series must have a constant mean, variance & covariance
▪ Non-stationary series are converted to a stationary series before using time series forecast models
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 53
Regression
Generally speaking, forecasting models cannot be applied directly to pricing time series data
▪ On the left is the stock price data for Microsoft over a period of 4 years
▪ The time series is obvious non- stationary so cannot be directly used with forecasting models
▪ Mean and variance are not constant
▪ The question therefore is how to transform this non-stationary time series to a stationary time series that we can make prediction with night
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 54
Regression
Differencing
𝑦′ = 𝑦 − 𝑦
𝑡 𝑡 𝑡−1
◦ Using non-stationary time series data in financial models produces unreliable and spurious results and leads to poor understanding and forecasting
◦ Differencing computes the difference of consecutive terms in a time series
◦ It can help stabilise the mean of a time series by removing changes in the level and therefore eliminating (or reducing) trend and seasonality
◦ The disadvantage of differencing is that it loses one observation each time the difference is taken
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
55
Regression
Time series data need to be reorganized before using supervised learning models
Time Step
Value
1
10
2
11
3
18
4
15
5
20
Time Step
X
Y
1
10
2
10
11
3
11
18
4
18
15
5
15
20
6
20
from 5 to 4 data points
In Python, the main function to help transform time series data into a supervised learning problem is the shift() function
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 56
Regression
Traditional time series statistical models work primarily with linear functions and do not tolerate corrupt or missing data
▪ The traditional time series models such as ARIMA are well understood and effective on many problems
▪ However, these traditional methods also suffer from several limitations
▪ They are linear functions, or simple transformations of linear functions, and they require manually diagnosed parameters, such as time dependence, and do not perform well with corrupt or missing data
▪ RNN has gained increasing attention in recent years
▪ These methods can identify structure and patterns such as nonlinearity, can seamlessly model problems with multiple input variables, and are relatively robust to missing data
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 57
Regression
Conclusion
Regression should be deployed with regularization
1
2
3
4
5
6
7
8
Property
Feature Data Types
Target Data Types
Key Principles
Hyperparameters
Data Assumptions
Performance
Accuracy
Explainability
Description
Numeric data. Variable encoding is therefore necessary for categorical data. Normalised data is advised. For time series data, differencing is performed to make time series stationary and a number of differences (referred to as the order of integration) may be performed depending on the lag time. Train and test split should be done based on sequential sample.
Numeric data.
Linear regression uses linear function to estimate the data points. Polynomial regression relies on feature interaction to different degrees to derive a polynomial function to estimate the data points. Regularization constrains the coefficients of the model functions using a loss function together with a penalty term. Lass regression may induce some coefficients to zero, effectively performing feature selection. Elastic Net uses both the ridge and lasso regression penalty terms.
Regularization parameter is used to specify the influence of the penalty term. When the value is very large, the regularization effect dominates the sum of squared loss function and the coefficients shift towards or to zero. When the regularization parameter tends toward zero, the regularized loss function tends towards the ordinary least sum of squared and coefficients exhibit big oscillations.
Non-parametric – no assumption about data distribution. All data are used.
N/A
Ridge regularization performs better.
N/A
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 59
Regression
References
References
“Hands-On Machine Learning with Scikit- Learn and TensorFlow”, Aurelien Geron, O'Reilly Media, Inc., 2017
"Regression Analysis with Python”, Luca Massaron & Alberto Boschetti, Packt Publishing, 2016
"Machine Learning and Data Science Blueprints for Finance”, Hariom Tatsat, Sahil Puri & Brad Lookabaugh, O'Reilly Media, Inc., 2020
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
61
Regression
References
▪ "LinearRegression"(https://en.wikipedia.org/wiki/Linear_regression)
▪ "IntroductiontoLinearRegression",DavidM.Lane(http://onlinestatbook.com/2/regression/intro.html)
▪ "A Deep Dive into Regularization", Divakar Kapil, 2018 (https://medium.com/uwaterloo-voice/a-deep-dive-into- regularization-eec8ab648bce)
▪ "FromLinearRegressiontoRidgeRegression,theLasso,andtheElasticNet",RobbySneiderman,2020 (https://towardsdatascience.com/from-linear-regression-to-ridge-regression-the-lasso-and-the-elastic-net-4eaecaf5f7e6)
▪ "A Visual Explanation for Regularization of Linear Models", Terence Parr, (https://explained.ai/regularization/index.html)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 62
Regression
THANK YOU