CS计算机代考程序代写 algorithm python chain deep learning Machine Learning for Financial Data

Machine Learning for Financial Data
January 2021
DEEP LEARNING (PART 2)

Contents
◦ Optimizers
◦ Shortcoming of RNNs
◦ Long Short-Term Memory (LSTM)
◦ Information Regulation using the Gating Mechanism
◦ PyTorch LSTM
◦ LSTM Layers
◦ Loss Function
◦ Learning Rate
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 2
Deep Learning

Optimizers

Neural Network Optimizers
A neural network typically involves a large number of nodes and connections, optimizing the weights for each perceptron’s connections cannot therefore be done manually and is often performed by an optimizer. An optimizer is an algorithm that changes the attributes of the neural network such as weights and learning rate to reduce the loss. Optimizers are used to solve optimization problems by minimizing the loss function.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 4
Deep Learning

How do neural network optimizers work
▪ For a useful mental model, you can think of a hiker trying to get down a mountain with a blindfold on
▪ It is impossible to know which direction to go in, but there is one thing the hiker can know: if he/she is going down (making progress) or going up (losing progress)
▪ Eventually, if the hiker keeps taking steps that lead him/her downwards, he/she will reach the base
▪ Similarly, it’s impossible to know what your model’s weights should be right from the start
▪ But with some trial and error based on the loss function, you can end up getting there eventually
▪ How you should change your weights or learning rates to reduce the losses is defined by the optimizers you use
▪ Optimization algorithms are responsible for reducing the losses and to provide the most accurate results possible
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
5
Deep Learning

Gradient Descent is an optimization algorithm for finding a local minimum of a differentiable function
A gradient measures how much the output of a function changes if you change the inputs a little bit
How big the steps are gradient descent takes into the direction of the local minimum are determined by the learning rate, which figures out how fast or slows we will move towards the optimal weights.
For gradient descent to reach the local minimum we must set the learning rate to an appropriate value, which is neither too low nor too high.
If the steps are too big, it may not reach the local minimum because it bounces back and forth between the convex function of gradient descent. If the steps are too small, gradient descent will eventually reach the local minimum but that may take a while.
high learning rate low learning rate
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 6
Deep Learning

Neural Network Optimizers
◦ Gradient Descent
◦ Stochastic Gradient Descent (SGD)
◦ Mini-Batch SGD (MB-SGD)
◦ SGD with Momentum
◦ Nesterov Accelerated Gradient (NAG)
◦ Adaptive Gradient (AdaGrad)
◦ AdaDelta
◦ RMSprop
◦ Adaptive Momentum Estimation (Adam)
◦ … many others 7
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
Deep Learning

Adaptive Momentum Estimation (Adam)
◦ Calculates adaptive learning rates for each parameter
◦ Can be considered as a combination of RMSprop and
Stochastic Gradient Descent (SGD) with momentum
◦ Like RMSprop, Adam keeps an exponentially decaying
average of past squared gradients
◦ Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface
◦ Straight forward to implement, computationally efficient, little memory requirements, invariant to diagonal rescaling of the gradients, no stationary objective required, well suited for problems that are large in terms of data/parameters and sparse gradient
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
8
Deep Learning

Adam is the Preferred Optimizer
◦ SGD (red) is stuck at a saddle point
◦ So SGD can only be used for shallow networks
◦ All the other algorithms except SGD finally converges one after the other, AdaDelta being the fastest followed by momentum algorithms
◦ AdaGrad and AdaDelta can be used for sparse data
◦ Momentum and NAG work well for most cases but is
slower
◦ Adam is not shown but it is the fastest algorithm to converge to minima and is considered the best algorithm amongst all the algorithms included
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
9
Deep Learning

Shortcoming of RNNs

RNNs carry information over different time steps rather than keeping all the inputs independent of each other
▪ Having a gradient that is too small prevents the weights from updating and learning, whereas extremely large gradients cause the model to be unstable
▪ RNNs are therefore unable to work with longer sequences and hold on to long- term dependencies, making them suffer
from “short-term memory”
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 11
▪ A significant shortcoming that plagues the typical RNN is the problem of vanishing / exploding gradients
▪ These problems arise when back- propagating through the RNN during training, especially for networks with deeper layers
▪ The gradients have to go through continuous matrix multiplications during the back-propagation process due to the chain rule, causing the gradient to either shrink exponentially (vanish) or blow up exponentially (explode)
Deep Learning

Long Short-Term Memory (LSTM)

RNNs are unable to remember information from much earlier – the context is lost
p(cat) = 0.33 p(dog) = 0.33 p(hamster) = 0.33
𝑦𝑦𝑦𝑦 12 𝑛−1𝑛
h1 𝑥1
“I have a dog named Sam”
h2 𝑥2
short-term
/ working memory
h𝑛−1 𝑥𝑛−1
“I enjoy all kinds of sports”
h𝑛 𝑥𝑛
“However, Sam, my pet ___”
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
13
Deep Learning

LSTM is a kind of RNN, its support on long-term memory and the use of gating mechanism is what sets it apart
Output layer
𝑦
Recurrent Neural Network
𝑥
Output layer 𝑦 Long Short-
Long-term Term Memory Memory
Network
Input layer 𝑥
Working Memory
Working Memory
Input layer
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
14
Deep Learning

LSTM can retain earlier information through its long-term memory
𝑦𝑦𝑦𝑦 12 𝑛−1𝑛
h1 𝑥1
“I have a dog named Sam”
h2 𝑥2
long-term memory
short-term memory
h𝑛−1 𝑥𝑛−1
“I enjoy all kinds of sports”
h𝑛 𝑥𝑛
“However, Sam, my pet ___”
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
15
Deep Learning

In a normal RNN cell, the input at a time-step and the hidden state from the previous time-step is passed through a tanh activation function to produce a new hidden state and output
𝑦
hidden state / short-term memory / working memory
tanh new hidden state
/ short-term memory
/ working memory
𝑥
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 16
Deep Learning

An LSTM cell has a slightly more complex structure and takes in 3 inputs: the current input data, the short-term from the previous time-step, and the long-term memory
(input, (hidden state, cell state))
cell state / long-term memory
hidden state / short-term memory / working memory
𝑦
new cell state
/ long-term memory
new hidden state
/ short-term memory / working memory
tanh
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 17
Deep Learning
𝑥
(output, (hidden state, cell state))

Information Regulation using the Gating Mechanism

A gate can be seen as a filter that lets relevant information through and discards irrelevant information
▪ LSTM cell uses gates to regulate the information to be kept or discarded at each time-step before passing on the long-term and short-term information to the next cell
▪ Ideally, the role of these gates is supposed to selectively
remove any irrelevant information
▪ At the same time, only holds on to the useful information
▪ These gates need to be trained to accurately filter what is useful and what is not
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 19
Deep Learning

LSTM uses Input Gate, Forget Gate, and Output Gate to regulate information flowing across the network
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 20
Deep Learning

LSTM Input Gate, Forget Gate, and Output Gate
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 21
Deep Learning

𝑖 =𝜎(𝑊 ∙ 𝐻 ,𝑥 +𝑏𝑖𝑎𝑠 ) 𝑖 =𝑡𝑎𝑛h(𝑊 ∙ 𝐻 ,𝑥 +𝑏𝑖𝑎𝑠 ) 𝑖 =𝑖 ∙𝑖
1 𝑖1 𝑡−1 𝑡
𝑖1 2 𝑖2 𝑡−1 𝑡 𝑖2 𝑖𝑛𝑝𝑢𝑡 1 2
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
22
Deep Learning

𝑓=𝜎(𝑊 ∙𝐻 ,𝑥 +𝑏𝑖𝑎𝑠 ) 𝐶=𝐶 ∙𝑓+𝑖
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
23
Deep Learning
𝑓𝑜𝑟𝑔𝑒𝑡
𝑡−1 𝑡 𝑓𝑜𝑟𝑔𝑒𝑡 𝑡 𝑡−1 𝑖𝑛𝑝𝑢𝑡

𝑂 =𝜎(𝑊 ∙ 𝐻 ,𝑥 +𝑏𝑖𝑎𝑠 ) 𝑂 =𝑡𝑎𝑛h(𝑊 ∙𝐶 +𝑏𝑖𝑎𝑠 ) 𝐻,𝑂 =𝑂 ∙𝑂 1 𝑜𝑢𝑡𝑝𝑢𝑡1 𝑡−1 𝑡 𝑜𝑢𝑡𝑝𝑢𝑡 2 𝑜𝑢𝑡𝑝𝑢𝑡2 𝑡 𝑜𝑢𝑡𝑝𝑢𝑡2 𝑡 𝑡 1 2
It should be noted that the output is actually a tuple containing both the hidden state and the prediction value
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 24
Deep Learning

LSTM Layers

Input Layer
batch size (no. of samples)
◦ Every neural network has one input layer
◦ The number of perceptrons/nodes in this layer is completely and uniquely determined by the number of features uses to make predictions (i.e. input size or input dimension)
◦ For time series data
◦ One row of features represents the input at one time step
◦ The number of time steps represents the sequence length
◦ Data along a series of time steps form a batch of data
◦ The number of samples is referred to as batch size
◦ For some problems, prediction is made using several time steps (e.g. moving average), the number of time steps used is referred to as a window
xt-1 xt xt+1
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
26
Deep Learning
(sequence length)
time
sliding window

Output Layer
◦ Every neural network has exactly one output layer
◦ The number of perceptrons/nodes in this layer is determined by the number of predictions to make (e.g. 4 perceptrons when predicting a bounding box)
◦ For classification problems, the output layer can comprise of one single node unless the softmax function is applied giving the output layer one node per class label
◦ For regression problems, the output layer typically comprises of a single node but multiple values are also possible
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
27
Deep Learning

Hidden Layers
◦ Linear data does not need any hidden layer!
◦ One hidden layer is sufficient for the large majority of
problems
◦ There is a consensus that there are very few situations in which a 2nd or 3rd hidden layer would improve performance
◦ There are however counter examples that cannot directly be learned via a single one-hidden-layer MLP or require an infinite number of nodes
◦ Increasing hidden layers would also increase the complexity of the model and choosing hidden layers such as 8, 9, or in two digits may sometimes lead to overfitting
◦ In general, the number of layers cannot be analytically calculated or the number of nodes to use per layer
1st
hidden layer
2nd
hidden layer
3rd
hidden layer
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
28
Deep Learning

Hidden Perceptrons
◦ In general, using the same number of perceptrons for all hidden layers will suffice
◦ Usually, more performance boost can be gained from adding more layers than adding more perceptrons in each layer
◦ If the number of layers/perceptrons is too small, the network might not be able to learn the underlying patterns in the data and thus become useless
◦ Some suggested estimations:
hidden state
cell state
To prevent over-fitting, the number of hidden perceptrons should be
< 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑠𝑐𝑎𝑙𝑖𝑛𝑔 𝑓𝑎𝑐𝑡𝑜𝑟 ∙ (𝑖𝑛𝑝𝑢𝑡 𝑠𝑖𝑧𝑒 + 𝑜𝑢𝑡𝑝𝑢𝑡 𝑠𝑖𝑧𝑒) where the 𝑠𝑐𝑎𝑙𝑖𝑛𝑔 𝑓𝑎𝑐𝑡𝑜𝑟 is usually between 2 and 10 ◦ ◦ ◦ ◦ Between the size of the input layer & the output layer 2 ∙ 𝑖𝑛𝑝𝑢𝑡 𝑠𝑖𝑧𝑒 + 𝑜𝑢𝑡𝑝𝑢𝑡 𝑠𝑖𝑧𝑒 3 𝑖𝑛𝑝𝑢𝑡 𝑠𝑖𝑧𝑒 ∙ 𝑜𝑢𝑡𝑝𝑢𝑡 𝑠𝑖𝑧𝑒 < 2 ∙ 𝑖𝑛𝑝𝑢𝑡 𝑠𝑖𝑧𝑒 Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 29 Deep Learning LSTM uses forward propagation to provide prediction and the loss function & optimizer to drive backward propagation Tensor(batch size, sequence length, no. of features) when batch_first=True Tensor(no. of layers, batch size, hidden size) The size of hidden state represents the dimension of the vector used to capture the state of a time-step. It can be chosen at will. xt-1 xt xt+1 hidden state cell state (output size) target Forward Propagation prediction Backward Propagation batch size (no. of samples) Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 30 Deep Learning (sequence length) time sliding window Input, hidden state, and cell state are captured as tensors of specific dimensions LSTM layers * when batch_first=True; otherwise [seq_length, batch_size, input_size] Input Sequential Data* torch.Tensor(batch_size, seq_length, input_size) Hidden State /Short-Term Memory torch.T ensor(num_layers, batch_size, hidden_size) Cell State /Long-Term Memory torch.T ensor(num_layers, batch_size, hidden_size) Hidden State /Short-Term Memory torch.zeros(num_layers, batch_size, hidden_size) Cell State /Long-Term Memory torch.zeros(num_layers, batch_size, hidden_size) Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 31 Deep Learning number of layers Output Data torch.Tensor(batch_size, seq_length, output_size) RNN can be perceived as a stack of layers with each layer representing one unfolding of the RNN hidden layers (2 in this case) readout layers 1st unfolding 2nd unfolding 3rd unfolding Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 32 Deep Learning Stacking is established through the interactions in the gating mechanism Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 33 Deep Learning Loss Function Loss Function ◦ For regression problems, Mean Squared Error (MSE) is the most common loss function to use 1𝑛 𝑀𝑆𝐸 = 𝑛 ෍(𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑𝑖 − 𝑎𝑐𝑡𝑢𝑎𝑙𝑖 )2 𝑖=1 ◦ If there are a significant number of outliers, Mean Absolute Error (MAE) or the Huber loss function can be used 1𝑛 𝑀𝐴𝐸 = 𝑛 ෍ 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑𝑖 − 𝑎𝑐𝑡𝑢𝑎𝑙𝑖 𝑖=1 Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 37 Deep Learning ◦ For classification problems, cross-entropy will serve well in most cases Learning Rate Learning Rate ◦ To find the best learning rate, start with a very low value (10-6) and slowly multiply it by a constant until it reaches a very high value (e.g., 10) ◦ Measure the model performance (against the learning rate) to determine which rate served well for the problem ◦ The model can then be retrained using this optimal learning rate ◦ The best learning rate is usually half of the learning rate that causes the model to diverge Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 39 Deep Learning Conclusion RNN/LSTM is designed for sequential data predictions 1 2 3 4 5 6 7 8 Property Feature Data Types Target Data Types Key Principles Hyperparameters Data Assumptions Performance Accuracy Explainability Description Numeric data. Variable encoding is therefore necessary for categorical data. Normalised data is advised. For time series data, differencing is performed to make time series stationary and a number of differences (referred to as the order of integration) may be performed depending on the lag time. Train and test split should be done based on sequential sample. Numeric data. Univariate/Multivariate predicted values. Univariate/Multivariate predicted class labels.. Random weights are assigned initially. Multiple layers and non-linear activation functions are used to capture complex patterns. Loss function is used to compute the accuracy and drives the backward propagation that updates the gradients and therefore the weights. The number of hidden states reflects the complexity that can be captured by the RNN. Learning rate often begins with a low value, e.g. 10-6. Number of hidden layers, number of output layers, number of perceptrons per layer, number of hidden states, loss function, optimizer, learning rate, number of epochs. No hard and fast rules to determine the number of hidden/output layers, number of perceptrons/layer, number of hidden states. Non-parametric – no assumption about data distribution. All data are used. Training time is generally high. The use of GPU, TPU, and various distributed platforms should provide better performance. .Accuracy has more correlation with the quality and size of the dataset than the number of layers or number of perceptrons per layer. Poor. Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 41 Classification References References “Hands-On Machine Learning with Scikit- Learn and TensorFlow”, Aurelien Geron, O'Reilly Media, Inc., 2017 “The Deep Learning with PyTorch Workshop”, Hyatt Saleh, Packt Publishing, 2020 Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 43 Deep Learning References ▪ "LongShort-TermMemory:FromZerotoHerowithPyTorch",GabrielLoye (https://blog.floydhub.com/long-short-term-memory-from-zero-to-hero-with-pytorch/) ▪ “Understanding LSTM and its Diagrams”, Shi Yan (https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714) ▪ “Understanding LSTM Networks”, Christopher Olah (https://colah.github.io/posts/2015-08-Understanding-LSTMs/) ▪ "IllustratedGuidetoLSTM'sandGRU's:aStepbyStepExplanation" (https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21) ▪ "NeuralNetworks" (https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#sphx-glr-beginner-blitz-neural-networks-tutorial- py) ▪ "Designing Your Neural Networks" (https://www.kdnuggets.com/2019/11/designing-neural-networks.html) ▪ "pytorch / tutorials" (https://github.com/pytorch/tutorials) (https://pytorch.org/tutorials/) Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 44 Deep Learning References ▪ "CountingNo.ofParametersinDeepLearningModelsbyHand",RaimiKarim (https://towardsdatascience.com/counting-no-of-parameters-in-deep-learning-models-by-hand-8f1716241889) ▪ "HyperparameterTuninginPython:aCompleteGuide2020"(https://neptune.ai/blog/hyperparameter-tuning-in-python-a- complete-guide-2020) ▪ "HowToMakeDeepLearningModelsThatDon'tSuck"(https://nanonets.com/blog/hyperparameter-optimization/) ▪ "PracticalGuidetoHyperparametersOptimizationforDeepLearningModels"(https://blog.floydhub.com/guide-to- hyperparameters-search-for-deep-learning-models/) ▪ Raydocumentation(https://docs.ray.io/en/master/index.html) ▪ Raydocumentation(https://simon-ray.readthedocs.io/en/latest/tune.html) Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 45 Deep Learning THANK YOU