CS代写 PowerPoint Presentation

PowerPoint Presentation

Lecture 5: Recurrent Neural Network
Instructor:

Outline of this lecture
Why Recurrent Neural Networks (RNNs)
How RNN works
Why Long Short-term Memory (LSTM) Network?
LSTM and Forecasting
Case Study

Rep: Feedforward Neural Network
Each input sample is represented by a fixed-length vector of features
No time-wise context information is modeled

Motivations
Some machine learning problems have time-wise inputs/outputs, e.g., work on time-series data
E.g., predicting the daily values of a stock
E.g., forecasting the inflation rate of a country
E.g., forecasting the exchange rate between two currencies
E.g., predicting if one student will show up for a class meeting
E.g., predicting for a city (e.g., Los Angeles) the number of tourists for each month in the next 12 months

Problem: Forecasting the number of passengers on a monthly basis
Method A: learning a supervised Learning model from a training set (x,y)
x: features of a given month, including weather condition, GPD, interest rates, is peak month?
y: output labels (regression)
Fail to model the strong dependence between monthly observations

Recurrent Neural Network (RNNs)
RNN is designed to model the relationship between sequence of data points and a sequence of output labels, i.e., a sequence-to-sequence model

For a given time t, a neuron or a neural network is employed as the prediction function
The output or hidden states of the network at time t are used as inputs to the network at time t+1.

Recurrent Neural Network (RNNs)

Recurrent Neural Network (RNNs)
The network weights are shared over time

A RNN comprises of multiple copies of RNN CELLS over time, where each cell deals with different inputs at different time steps

Case 1: RNN for sentiment classification
Task: classify a movie review as positive or negative
Inputs: multiple words or sentences (a sequence of words)
Outputs: binary labels (positive vs. negative)

E.g., ‘The food is really delicious’

Case 1: RNN for sentiment classification
In this case, only the last cell’s output is used for making the prediction

RNN Variants: Input-output
Single input, single output

e.g. Feedforward Network
Single input, multiple output

e.g. forecasting inflation rates
Multiple input, Single output

Multiple input, Multiple output
e.g. sentiment classification

e.g. speech recognition

squared(H(x_i)-y_i)

squared(H(x_i)[1..k]-y_i([1..k])

squared(H(x_i)-y_i)

RNN: Learning
Most cost functions developed for other networks could be applicable for RNNs as well

Optimizer (Gradient descent)
Backpropagation through time (BPTT)
The network of RNN cells is considered as a big feed-forward network
A single computational graph is constructed to propagate gradients back through time
The network parameters should kept the same for all RNN cells: first updated for each copy and then averaged over

RNN: Limitations
Like a product of multiple real numbers can shrink to zero or explode to infinity, the product of multiple matrices would lead to two issues while using gradient descent method:
Gradient Shrinkage: zero or being closed to zero
Gradient Explosion: gradients become extremely large

RNN can only model the time-wise relationship of fixed-term length
Might be too short or long for a given problem over time
Not adaptive

From RNN to LSTM
Long Short-Term Memory (LSTM) network is a special design of RNN.

LSTM uses the idea of constant error flow for RNNs to ensure that gradients do not decay or explode

The key component of LSTM is a memory cell that works like an accumulator over time

A new state is obtained through additive operations over the previous state, instead of multiplicative operations, to ensure gradient based methods behave well.

From RNN Cell to LSTM Cell

is the output or activation of cell
is the input data
g() is the activation function

From RNN Cell to LSTM Cell
LSTM cells

point-wise operation

network layer
vector flow
is the output or activation of cell
is the input data
g(), activation function, tanh(),output range (-1,1)
(), activation function, sigmoid function, output range (0,1)

LSTM Cell: Architecture

Cell State
Hidden State
Output (Hidden State)
Output (Hidden State)
Next Cell State
Forget Gate
Input Gate
Output Gate
LSTM cells

LSTM Cell: Architecture

Cell State
Next Cell State
Cell State

LSTM Cell: Architecture

Information from the input data will be filtered by this LSTM layer

Forget gate

LSTM Cell: Architecture

Input gate
Information from the input data will be selected to add to the cell state

LSTM Cell: Architecture

Update Cell State
The new state is now determined by the previous state and the information from the input gate!

LSTM Cell: Architecture

Output Gate
The updated cell state is used to generate the hidden state or output state

LSTM Cell: Architecture

LSTM Cell: A different architecture

et al. Learning phrase representation using RNN encoder-decoder for statistical machine translation. 2014

Outline of this lecture
Why Recurrent Neural Networks (RNNs)
How RNN works
Why Long Short-term Memory (LSTM) Network?
LSTM and Forecasting
Time-series data
Case Study

Objectives of Time-series Analysis
Interpretation

Forecasting

Hypothesis Testing

Simulation

Example: Anti-diabetic Drug Sales
A Public dataset

Forecasting: Random walk
Assuming the time-series y(t), t=1,2,3…, is generated from a stochastic model.

The prediction for every horizon is simply set to be the last observed value: y(t+w|t)=y(t) where w is the horizon

A variant is to assume the time-series has a seasonal component with period T, and y(t+w|t)=y(t+w-T).

This model is often used as the benchmark results

Forecasting: Seasonal Decomposition
Decompose the time series with seasonality (e.g.,weekly, monthly, etc.) into the sum of three components:
y(t)= Season(t) + Trend(t)+Remainder(t),
This method is called as additive decomposition

Forecasting: Seasonal Decomposition
Additive decomposition

statsmodels.tsa.seasonal.seasonal_decompose(x, model=’additive’, filt=None, period=None, two_sided=True, extrapolate_trend=0)

Forecasting: Seasonal Decomposition
Multiplicative decomposition:
y(t)= Season(t) *Trend(t)*Remainder(t)

statsmodels.tsa.seasonal.seasonal_decompose(x, model=multiplicative’, filt=None, period=None, two_sided=True, extrapolate_trend=0)

statsmodels.tsa.seasonal.seasonal_decompose — statsmodels

Forecasting: Exponential Smoothing
Classical forecasting method: the forecasts are equal to a weighted average of past observations and the corresponding weights decrease exponentially as walking away from time t:

where is between 0 and 1

It is possible to extend this original method to deal with trends/sesonalities.

Forecasting: ARIMA
Autoregressive model: the forecast at a time t is a linear combination of past values of the variable.

Moving Average: the forecast at a time t is a linear combination of past forecast errors.

Where q is the model order, is white noise.
ARIMA: AutoRegressive Integrated Moving Average
Integrating: differencing the time-series to make it stationary.
Combining the above two models
Three hyperparameters: d (degree of first differencing involved), w (order of the autoregressive part) and q (order of the moving average part)

Other forecasting models
Dynamic linear models: at each time t the forecast is a linear model with time-involving coefficients
TBATs: standing for Trigonometric, Box-Cox transform, ARMA errors, Trend and Seasonal components;
deal with multiple seasonalities by modeling each seasonality with a trigonometric representation based on Fourier series
Prophet: be able to deal with multiple seasonalities (facebook)
The forecast is represented as the linear combination of trend, seasonality and holiday
Formulated in the Bayesian framework
NNETAR: neural network autoregression
Inputs: the last elements of the sequence up to time t
Outputs: the forecasted value at time t+1

LSTM for forecasting
A LSTM model comprises of multiple LSTM cells over time steps and is able to model the time-wise dependences of the time series.

LSTM might need a large amount of training samples

Case Study: Two examples
LSTM Example 1: Toy Data

LSTM Example 2 : Monthly Anti-Diabetic Drug Sales

LSTM Example 1: Toy Data
Inputs: a sequence of data [10,20,30,40,50,60,70,80,90]
Objective: forecast the next number
Method: Develop a LSTM model (n_len=3)
Step 1: sample multiple sub-sequence of length n_len, and for each sub-sequence, use the following number in the series as the output label, which leads to multiple pairs (X, y)
X=[10 20 30], y=[40]
X=[20 30 40], y=[50]
X=[30 40 50], y=[60]
Step 2: Training a LSTM model
Step 3: Apply the trained model over a testing sequence X=[70,80,90] to get its prediction

LSTM Example 1: Toy Data
Preparing training data for LSTM

LSTM Example 1: Toy Data
Network architecture for the LSTM model

LSTM Example 1: Toy Data
Training and Evaluating the LSTM model

LSTM Example 2
Data: Anti-Diebetic drug sales in Australia form 1991 to 2008

LSTM Example 2
Idea: need to remove the seasonal factors and model the trend and reminder factors only.

LSTM Example 2
Step 1: Load data
Step 2: Removing the seasonal factors
Step 3: Preparing training and testing samples
Step 4: Create a LSTM model
Step 5: Train and Evaluate the model

LSTM Example 2
Step 1: Load data

LSTM Example 2
Step 1: Load data
Step 2: Removing the seasonal factors

Original sequence
Blue: trend
Green: seasonality
Yellow: Reminder
Yellow: trend+reminder

LSTM Example 2
Step 1: Load data
Step 2: Removing the seasonal factors
Step 3: Preparing training and testing samples

LSTM Example 2
Step 1: Load data
Step 2: Removing the seasonal factors
Step 3: Preparing training and testing samples
Step 4: Create a LSTM model

LSTM Example 2
Step 1: Load data
Step 2: Removing the seasonal factors
Step 3: Preparing training and testing samples
Step 4: Create a LSTM model
Step 5: Train and Evaluate the model

Last but most important,
Visualizing the predictions

More Layers in Tensorflow
Core layers
Dense layers
Activation layers
Embedding layers
Masking layers
Lambda layers
Convolution Layers
Pooling layers
Recurrent layers
Preprocessing layers
Normalization layers
Regularization layers
Attention layers
Reshaping layers
Merging layers
Locally-connected layers
Activation layers
https://keras.io/api/

More Loss Functions in Tensorflow
Cross-entropy loss
Poisson class:
loss = y_pred – y_true * log(y_pred)
KLDivergence:
loss = y_true * log(y_true / y_pred)
MeanSquaredError
loss = squared(y_true – y_pred)
MeanAbsoluteError
loss = abs(y_true – y_pred)
MeanAbsolutePercentageError
loss = 100 * abs(y_true – y_pred) / y_true
Cosine Similarity:
loss = -sum(l2_norm(y_true) * l2_norm(y_pred))
Hinge loss:
loss= maximum(1 – y_true * y_pred, 0)
Squared Hinge Loss:
loss = square(maximum(1 – y_true * y_pred, 0))

More optimizers in Tensorflow
SGD: Stochastic Gradient Descent
Mini-batch Gradient Descent
Momentum-based Gradient Descent
V= costant*V – learning_rate * gradient
ADAM: Adaptive Moment Estimation

Outline of this lecture
Why Recurrent Neural Networks (RNNs)
How RNN works
Why Long Short-term Memory (LSTM) Network?
LSTM and Forecasting
Case Study
https://arxiv.org/pdf/1412.6980.pdf

/docProps/thumbnail.jpeg

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts