REGRESSION – CONCEPTS (PART 2)
Machine Learning for Financial Data
February 2021
Contents
◦ Support Vector Machine (SVM)
◦ Support Vector Regression (SVR)
◦ Hyperparameter Optimization
◦ K-fold Cross Validation
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 2
Regression
Support Vector Machine (SVM)
Support Vector Machine (SVM)
SVM is a powerful and versatile ML model, capable of performing linear or non-linear classification, regression, and even outlier detection. It is one of the more complex but accurate family of models making it one of most popular models in ML despite being a black box technique. SVMs are particularly well suited for classification of complex and small- or medium-size datasets.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 4
Regression
Advantages
◦ Effective in high dimensional spaces
◦ Still effective in cases where number of features is greater than the number of samples
◦ Uses a subset of the training dataset in the decision functions (called support vectors), so it is also memory efficient
◦ Versatile as different kernel functions, including customised kernel functions, can be specified for the decision function
Disadvantages
◦ If the number of features is much greater than the number of samples, avoid over-fitting in choosing kernel functions and regularization term is crucial
◦ SVMs do not directly provide probability estimates
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
5
Regression
Support Vector Regression (SVR)
Support Vector Regression (SVR)
Similar to regression, SVR’s goal is to discover a hyperplane that minimizes error and obtain a minimum margin interval which contains the maximum number of data points. The key difference is with the cost function. The cost function of regression considers all data points in the dataset and uses regularization to introduce bias and constrain complexity. Whereas the cost function of SVR considers only a subset of the training dataset – the data points that fall into the margin are not included in the calculation of the cost.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 7
Regression
SVR finds the best hyperplane with the maximum number of data points captured within the margin
support vector
■
■
decision boundary
■
■ ■
■■ ■
■
■
■
data points
■ ■
■
■ ■
decision hyperplane
decisionboundary
support vector
■
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
8
Regression
margin
The cost covers only data points beyond the decision boundaries at +𝜀 and −𝜀 distance from the hyperplane
■
ξ𝑖
■
■
■
■
■
■ ■
The decision boundaries reflect the tolerance level are at 𝜀 distance from the
■ ■■𝑖𝑖
■ ξ∗ Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
9
Regression
■
■
decision hyperplane
■ ■𝑖
Error (e.g. ξ and ξ∗) is ■ measured only for data
points outside of the margin
𝜀
Kernel functions transform data points into a higher dimensional feature space to make them linearly separable
Radial Basis Function (RBF)
∅𝛾(𝑥,𝑙)=exp−𝛾 𝑥−𝑙 2 where
𝑥 − 𝑙 𝑖𝑠 𝑡h𝑒 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡 𝑥 𝑓𝑟𝑜𝑚 𝑙,
𝛾 = 1 𝑖𝑠 𝑎 𝑠h𝑎𝑝𝑖𝑛𝑔 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 2𝜎2
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 10
Regression
Radial Basis Function (RBF) introduces a new feature having values in (0,1)
𝑥2 is a new feature obtained by applying ∅𝛾(𝑥, 𝑙1) over the existing data points
𝑥3 is a new feature obtained by applying ∅𝛾(𝑥, 𝑙2) over the existing data points
▪ The RBF is a bell-shaped function measuring the similarity between a landmark point (i.e. 𝑙) and any existing data point (e.g. 𝑥)
▪ ∅𝛾 𝑥,𝑙 =0indicatesthedatapoint𝑥isfarfrom the landmark point 𝑙
▪ ∅𝛾 𝑥,𝑙 = 1 indicates the data point 𝑥 is at the landmark point 𝑙
▪ 𝛾 is a hyperparameter and can be perceived as the inverse of the radius of influence of data points selected by the model as support vectors
▪ It can be perceived as deciding how much curvature we want in a decision boundary (i.e. high 𝛾 means more curvature)
Regression
landmark 𝑙1
landmark 𝑙2
input 𝑥1 has a 1D feature space
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 11
Some data points will end up outside the decision boundaries introduced by RBF
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 12
Regression
Samples falling between the boundary lines incur no cost (i.e. loss is 0)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 13
Regression
The strength of the regularization is inversely proportional to the regularization hyperparameter C
▪ C is a hyperparameter for SVR
▪ A low value might end up having less error and less predictive power ▪ A high value might get more error but better predictive power
▪ Reducing C can regularize the model to avoid overfitting Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 14
Regression
Epsilon 𝜀 specifies the epsilon-tube within which no penalty is associated in the training loss function
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 15
Regression
SVR with Python (1)
# Import the relevant libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
# Generate sample data
X = np.sort(3 * np.random.rand(60, 1), axis=0) y = np.sin(X).ravel()
# Add noise to targets
y[::3] += 3 * (0.6 – np.random.rand(20))
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 16
Regression
SVR with Python (2)
# Create regression models
svr_rbf1 = SVR(kernel=’rbf’, C=0.1, gamma=0.1, epsilon=0.1) svr_rbf2 = SVR(kernel=’rbf’, C=1, gamma=0.1, epsilon=0.1) svr_rbf3 = SVR(kernel=’rbf’, C=10, gamma=0.1, epsilon=0.1)
# Specify parameters to use for visualization
svrs = [svr_rbf1, svr_rbf2, svr_rbf3]
kernel_label = [‘RBF (C=0.1)’, ‘RBF (C=1)’, ‘RBF (C=10)’] model_color = [‘c’, ‘g’, ‘m’]
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 17
Regression
SVR with Python (3)
# Display 3 models, one after another, with the MSE
for ix, svr in enumerate(svrs): plt.figure(figsize=(12,8)) plt.ylabel(‘y’)
plt.xlabel(‘x’)
plt.title(‘Support Vector Regression’)
plt.plot(X, svr.fit(X, y).predict(X), color=model_color[ix], label='{} model’.format(kernel_label[ix]))
plt.scatter(X, y, facecolor=”none”, edgecolor=model_color[ix], s=50,
label='{} actual’.format(kernel_label[ix]))
plt.legend()
plt.show()
print (‘MSE %0.3f’ % mean_squared_error(y, svr.fit(X,y).predict(X))) Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 18
Regression
SVR with Python (4)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 19
Regression
Hyperparameter Optimization
Hyperparameter Optimization
The technique of identifying an ideal set of parameters for a prediction algorithm (e.g. coefficient values for regression problems), which provides the optimum performance. The algorithm learns which parameter values provide us with better performance by iteratively working on a pre-defined set of parameters.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 21
Regression
Grid Search
◦ Take each hyperparameter of interest and select a set of values
∙ e.g. the epsilon 𝜀 hyperparameter in an SVR model having a value of 0.1, 0.3, 0.5 and the gamma 𝛾 hyperparameter having value 0.001 and 0.0001
◦ Train models using the combination of potential hyperparameter values
◦ Identify the best performing model and the corresponding hyperparameter values
◦ Computationally costly for a grid of fine granularity
◦ Might miss optimal hyperparameter values
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
22
Regression
Random Search
◦ Train models using the combination of potential hyperparameter values chosen at random
◦ May try more values per hyperparameter than the case with grid search
◦ Identify the best performing model and the corresponding hyperparameter values
◦ May find the optimal combination of hyperparameter values by chance or may miss the optimal points altogether
◦ Often preferable when the hyperparameter search space is large
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
23
Regression
Random search turns out to be a surprisingly effective technique
The reason random search turns out to work so well is due to two key properties
▪ It turns out that the hyperparameter space has a low effective dimensionality ◦ Some parameters matter much more than others when it comes to finding good
settings
▪ The optimal combination of hyperparameter values varies according to the dataset
◦ One cannot just find the two most important hyperparameters for some model architecture and then always optimize based on just those
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 24
Regression
Grid Search with Python (1)
# Import the relevant libraries
import numpy as np import pandas as pd
from sklearn import datasets
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
data = datasets.load_diabetes() num_test = 40
X_train = data.data[:-num_test, :] y_train = data.target[:-num_test] X_test = data.data[-num_test:, :] y_test = data.target[-num_test:]
# Partition the dataset into training and testing datasets # Testing dataset being the last 40 samples
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 25
Regression
Grid Search with Python (2)
# Create a regression model
model = SVR()
# Specify hyperparameter values to perform the search with
param_grid = [
{‘kernel’: [‘rbf’], ‘C’: [0.1, 1, 10], ‘gamma’: [0.001, 0.0001], ‘epsilon’: [0.1]}, {‘kernel’: [‘rbf’], ‘C’: [10, 100], ‘gamma’: [0.001, 0.0001], ‘epsilon’: [0.5]}
]
“param_grid” tells the algorithm to first evaluate 3×2=6 (representing the potential values for “C” and “gamma”) combinations of hyperparameter values. The algorithm will then try another 2×2=4 combinations. Altogether grid search will search using 10 combinations.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 26
Regression
Grid Search with Python (3)
# Set up grid search and use cross-validation
grid_search = GridSearchCV(model, param_grid, cv=5) # Perform grid search
grid_search.fit(X_train, y_train) # Show the over results
grid_search.cv_results_
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 27
Regression
Grid Search with Python (4)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 28
Regression
Grid Search with Python (5)
# Show the hyperparameter values for the best performing model
grid_search.best_params_
# Get hold of the best performing model
best_model = grid_search.best_estimator_
# Use the best performing model to make predictions for the testing dataset
pred = best_model.predict(X_test)
# Show the RMSE and R^2 score of the prediction
print (‘RMSE: %0.3f’ % mean_squared_error(y_test, pred, squared=False)) print (‘R^2 Score: %0.3f’ % r2_score(y_test, pred))
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 29
Regression
K-fold Cross Validation
Machine Learning Validation
◦ Validation is the process of making sure that the model generalizes well
◦ Generalization is when model is built using one set of data and it performs well on a completely different set of data
◦ Validation dataset is used to fine tune hyperparameters and serves also as an intermediary testing dataset
◦ Sometimes referred to as the hold-out validation set
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 31
Regression
K-fold Cross Validation
◦ Evaluates the data across the entire training set
◦ Divides the training set into K folds and then
training the model K times
◦ Each time leaving a different fold out of the training data and using it instead as a validation dataset
◦ The performance metric is averaged across all K tests
◦ Once the best parameter combination has been found, the model is retrained on the full dataset
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
32
Regression
All data in the hold-out dataset can be used for both training and testing through k-fold cross-validation
2,000 data 2,000 data 2,000 data 2,000 data 2,000 data
▪ Free of selection bias
▪ Generalizes well
▪ Matters less how the data gets divided
▪ However, it has higher computational cost as training has to be done k times
Evaluation Metric 0.884 2nd iteration
0.879 4th iteration
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
33
Regression
0.867 1st iteration
0.901 3rd iteration
0.896 5th iteration
test data
test data
test data
test data
test data
10,000 data
K-fold Cross Validation (1)
# Import the relevant libraries
import numpy as np import pandas as pd
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score from sklearn.model_selection import cross_validate
import plotly.graph_objects as go import plotly.express as px
# Load data
# https://www.kaggle.com/quantbruce/real-estate-price-prediction?select=Real+estate.csv
data = pd.read_csv(‘Real estate.csv’, encoding=’utf-8′)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 34
Regression
K-fold Cross Validation (2)
# Identify features and target to use
X = data[‘X3 distance to the nearest MRT station’].values.reshape(-1,1) y = data[‘Y house price of unit area’].values
# Create an SVR
C = 100
epsilon = 0.1
gamma = 0.001
svr = SVR(kernel=’rbf’, C=C, epsilon=epsilon, gamma=gamma)
# Cross-validate the SVR
scores = cross_validate(svr, X, y, cv=5,
scoring=(‘r2’, ‘neg_mean_squared_error’),
return_train_score=True)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 35
Regression
K-fold Cross Validation (3)
# Show scores collected during cross-validation
scores
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 36
Regression
Conclusion
Support Vector Machine (SVM) Models in a Nutshell
1
2
3
4
5
6
7
8
Property
Feature Data Types
Target Data Types
Key Principles
Hyperparameters
Data Assumptions
Performance
Accuracy
Explainability
Description
Requires the feature scaling of the data.
Numeric values.
Introduce a margin on either side of the regression plane such that data points falling within the margin will not contribute to the loss function calculation. The goal is to minimise the margin and loss while fitting as many data points into the margin as possible.
With linear or polynomial kernel, the C hyperparameter (cost of misclassification) is needed but gamma (curvature weight of the decision boundary) is not needed. With the Gaussian RBF kernel, both Gamma and C are needed.
No data distributional requirement.
Fairly robust against overfitting, especially in higher dimensional space. Handles nonlinear relationships quite well, with many kernels to choose from. Can be inefficient to train and memory- intensive to run and tune. Does not perform well with large datasets.
Generally, performs better than linear, polynomial regressions.
Black box technique
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 38
Regression
References
References
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 40
Regression
“Hands-On Machine Learning with Scikit- Learn and TensorFlow”, Aurelien Geron, O’Reilly Media, Inc., 2017
References
▪ “SupportVectorRegression(SVR)Model:ARegression-BasedMachineLearningApproach”,AbhilashSingh,2020 (https://medium.com/analytics-vidhya/support-vector-regression-svr-model-a-regression-based-machine-learning- approach-f4641670c5bb)
▪ “SupportVectorRegressionTutorialforMachineLearning”,AlakhSethi,2020 (https://www.analyticsvidhya.com/blog/2020/03/support-vector-regression-tutorial-for-machine-learning/)
▪ “SupportVectorRegression:SVR”,IndreshBhattacharyya,2018(https://medium.com/coinmonks/support-vector- regression-or-svr-8eb3acf6d0ff)
▪ “SupportVectorMachine–Regression(SVR)”(http://www.saedsayad.com/support_vector_machine_reg.htm)
▪ “SupportVectorRegression(SVR)–OneoftheMostFlexibleYetRobustPredictionAlgorithms”,SaulDobilas,2020 (https://towardsdatascience.com/support-vector-regression-svr-one-of-the-most-flexible-yet-robust-prediction-algorithms- 4d25fbdaca60)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 41
Regression
THANK YOU