Machine Learning and Data Mining in Business
Lecture 12: Recurrent Neural Networks
Discipline of Business Analytics
Copyright By PowCoder代写 加微信 powcoder
Lecture 12: Recurrent Neural Networks
Learning objectives
• Recurrent neural networks
• Gated recurrent units (GRU)
• Long short-term memory (LSTM)
Lecture 12: Recurrent Neural Networks
1. Sequence models
2. Text data
3. Recurrent neural networks
4. Modern recurrent neural networks 5. Embedding layers
Sequence models
Sequence data
• So far, we’ve encountered two types of data: tabular data and image data.
• Implicitly, we’ve been making the assumption our samples are independently and identically distributed (i.i.d.). This is not realistic for most data.
• Many important applications concern sequential data such as text, audio, video, time series, and longitudinal data.
Example: speech recognition
Image credit: NVIDIA Developer Blog
Example: natural language processing
Image credit: Google AI blog, Pathways Language Model
Example: time series forecasting
Image credit: Hyndman, R.J., & Athanasopoulos, G. (2021)
Example: DNA sequence analysis
Image credit: Zou, J., Huss, M., Abid, A. et al. (2019)
Sequence models
Suppose that we want to predict the next value in the sequence from past information. That is, we want to model the distribution
xt ∼p(xt|xt−1,…,x1).
A challenge that we have in this setting is that the set of inputs xt−1,…,x1 increaseswithtime.
Autoregressive models
In an autoregressive model, we assume that xt only depends directly on the past p observations:
xt ∼p(xt|xt−1,…,xt−p).
In the context of time forecasting, you are familiar with models
such as the AR(p) model:
xt =c+φ1xt−1 +φ2xt−2 +···+φpxt−p +εt.
Latent autoregressive models
In a latent autoregressive model, we keep a summary ht−1 of past observations and make predictions according to:
xt ∼ p(xt|ht),
ht = f(xt−1, ht−1).
Example: exponential smoothing
In simple exponential smoothing, we forecast the next value in a time series as:
x t = h t ,
ht = αxt−1 + (1 − α)ht−1
State space models
In a state space model, we assume that xt depends on a latent state ht:
xt ∼ p(xt|ht), ht ∼ p(ht|ht−1).
We then compute the distribution p(xt|xt−1, . . . , x1), which is a nontrivial task.
Sequence models
More generally, suppose that we want to model the full sequence
x1,…,xT ∼ p(x1,…,xT). From the chain rule of probability,
p(x1, . . . , xT ) = p(xt|x1, . . . , xt−1).
Therefore, modelling p(xt|x1, . . . , xt−1) is suitable for many applications.
Sequence models
More generally, we want to predict an output variable y or output sequencey1,…,yτ givenaninputvariablexorinputsequence x1,…,yt:
Image credit:
One to many example: image captioning
Image credit: JalFaizy at
Many to one example: sentiment analysis
Image credit: https://monkeylearn.com/sentiment-analysis/
Many to many example: machine translation
Image credit: https://lilianweng.github.io/posts/2018-06-24-attention/
Example: visual question answering
Text preprocessing
We need implement the following steps to process text for deep learning:
1. Load the text data as a string or collection of strings.
2. Split the strings into tokens, such as words, characters or
sub-words.
3. Build a lookup table of vocabulary that maps the tokens to numerical ids.
4. Convert the sequence of tokens into sequences of numerical token ids.
Text preprocessing
Tokenisation:
Text preprocessing
Counting the token frequencies:
Text preprocessing
Building the vocabulary:
Text preprocessing
Constructing sequences of token ids:
Language models
• Let x1, . . . , xT be a sequence of tokens. A language model is a model of the joint distribution
p(x1,…,xT).
• Autoregressive language models such as GPT-3 model the conditional distribution p(xt|x1, . . . , xt−1) and are applicable for text generation.
• In masked language modelling (MLM), we train models to predict masked tokens from the rest of the text.
Illustration: GPT-3
Thought experiments generated by GPT-3:
Image credit: https://gpt3demo.com/apps/10-thought-experiments
Illustration: Github Copilot
Image credit: https://copilot.github.com/
Language models
• We can train language models by constructing the corresponding supervised learning task from natural text, called self-supervised learning.
• Language models provide text representations for downstream tasks such as classification and question answering.
Recurrent neural networks
Recurrent neural networks
Recurrent neural networks (RNN) are a class of models designed toprocessasequenceofvectorsx1,…,xT intosequenceof hidden states:
ht =f(xt,ht−1).
Recurrent neural networks
A basic RNN is
ht = tanh(bh + Whht−1 + Uhxt) ot = bo + Woht
Example: character-level language model
Image credit: Zhang et al (2021), Dive Into Deep Learning.
Recurrent neural network
Figure from Deep Learning by , and .
The challenge of long-term dependencies
• RNNs have difficulty learning long-term dependencies.
• In particular, the gradient of the cost function to train an RNN with respect its parameters involves long products of matrices that can lead to vanishing or exploding gradients.
• Techniques such as gradient clipping and truncation can alleviate this problem, but are not sufficient to overcome the limitations of traditional RNNs.
Modern recurrent neural networks
Modern recurrent neural networks
The next models incorporate the following ideas:
• An early observation may be highly significant for predicting all future observations, so the network should be able to store this information.
• There may be structural breaks between parts of the sequence, so the network should be able to reset the state.
• Some observations may contain no useful information, therefore the network should have a mechanism to skip inconsequential information.
Notation: element-wise multiplication
Let a = (a1,a2,…,an) and b = (b1,b2,…,bn). a ⊙ b = (a1b1, a2b2, . . . , anbn)
Gated recurrent units (GRU)
Gated recurrent units (GRU) introduce the following mechanisms:
• An update gate to control when the network should update the hidden state.
• A reset gate to control when the network should reset the hiddent state.
GRU: memory and reset gates
zt = σ (bz + Wzht−1 + Uzxt) (update gate) rt = σ (br + Wrht−1 + Urxt) (reset gate)
Image credit: Zhang et al (2021), Dive Into Deep Learning.
GRU: candidate hidden state
gt =tanh(bg +Wg(rt ⊙ht−1)+Ugxt)
Image credit: Zhang et al (2021), Dive Into Deep Learning.
GRU: hidden state
ht =zt ⊙ht−1 +(1−zt)⊙gt
Image credit: Zhang et al (2021), Dive Into Deep Learning.
Gated recurrent units (GRU)
zt = σ (bz + Wzht−1 + Uzxt) (update gate)
rt = σ (br + Wrht−1 + Urxt) (reset gate)
gt = tanh (bg + Wg(rt ⊙ ht−1) + Ugxt) (candidate hidden state)
ht =zt⊙ht−1+(1−zt)⊙gt (hiddenstate)
Gated recurrent unit (GRU)
Image credit: Hands-on Machine Learning with Scikit-Learn, Keras and
Tensorflow by Aureli ́en G ́eron. 41/56
Long short-term memory (LSTM)
• The long short-term memory (LSTM) is an earlier model that shares many of the properties of GRU, but has a more complex design.
• The LSTM model has a memory cell to keep track of important information and a range of gates to control the memory cell and hidden state.
LSTM: input, forget, and output gates
it = σ (bi + Wiht−1 + Uixt) (input gate) ft=σ(bf+Wfht−1+Ufxt) (forgetgate) ot = σ (bo + Woht−1 + Uoxt) (output gate)
Image credit: Zhang et al (2021), Dive Into Deep Learning.
LSTM: candidate memory cell
gt =tanh(bg +Wght−1 +Ugxt)
Image credit: Zhang et al (2021), Dive Into Deep Learning.
LSTM: memory cell
ct =ft ⊙ct−1 +it ⊙gt
Image credit: Zhang et al (2021), Dive Into Deep Learning.
LSTM: hidden state
ht = ot ⊙ tanh(ct)
Image credit: Zhang et al (2021), Dive Into Deep Learning.
Long short-term memory (LSTM)
Combining all the components, the LSTM model is:
it = σ (bi + Wiht−1 + Uixt) (input gate) ft=σ(bf+Wfht−1+Ufxt) (forgetgate)
ot = σ (bo + Woht−1 + Uoxt) (output gate) gt=tanh(bg+Wght−1+Ugxt) (candidatememorycell) ct =ft⊙ct−1+it⊙gt (memorycell)
ht = ot ⊙ tanh(ct) (hidden state)
Long short-term memory (LSTM)
Image credit: Hands-on Machine Learning with Scikit-Learn, Keras and Tensorflow by Aureli ́en G ́eron.
Image credit: Zhang et al (2021), Dive Into Deep Learning.
Image credit: Deep Learning by , and .
Bidirectional RNNs
Image credit: Zhang et al (2021), Dive Into Deep Learning.
Embedding layers
• As we’ve seen earlier, the inputs to our models for text data are sequences of token ids.
• We pass a tensor X with shape (N, L) to the model, where N is the batch size and L is the sequence length.
• Before using RNNs and other architectures, we need to further process the input.
• One option is to one-hot encode the token ids and pass the sequence of one-hot vectors as the input to the next layer.
• The one-hot encoded input would then have shape (N, L, V ), where V is the vocabulary size.
• One-hot encoding is generally not a good idea since V is large in most applications. Furthermore, the one-hot vector does not capture the simillarity between words.
Embedding layer
We can define an embedding layer in two equivalent ways:
• As a lookup table that maps each possible input id to a vector
e ∈ RH , where H is the embedding dimension.
• As one-hot encoding of the input id followed by a linear layer.
For sequences, the output of the embedding layer is (N, L, H), where typically H is much lower than the vocabulary size.
Embeddings
• The resulting embeddings represent the meaning of different words relative to each other. For example, words that have a similar meaning should have word vectors that are close.
• Word embeddings may encode approximate relations such as: ecanberra − eaustralia + echina ≈ ebeijing
• Even though embeddings originated in NLP, entity embeddings of categorical features are also used for tabular tasks.
Illustration: word vectors
Image credit: and
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com