QBUS 6840 Lecture 12 Predictive Analytics with Neural Networks and Deep Learning III
QBUS 6840 Lecture 12
Copyright By PowCoder代写 加微信 powcoder
Predictive Analytics with Neural Networks and
Deep Learning III
The University of School
Neural Networks for Time Series
Recurrent Neural Networks (RNNs)
Neural Network Autoregression
Long Short-Term Memory (LSTM)
Other Variants of Recurrent Neural Networks
An Application Example
Online textbook Section 11.3
https://otexts.com/fpp2/nnetar.html;
Teaching Slides; and
http://www.wildml.com/2015/09/recurrent-neural-
networks-tutorial-part-1-introduction-to-rnns/
https://otexts.com/fpp2/nnetar.html
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
Objectives
Know Neural Network Autoregression for Time Series and how
seasonality is handled
Be able to organise time series data to train and test a neural
Understand some basic concepts of Recurrent Neural
Networks (RNNs)
Be able to explain ES as a RNN model
Be able to explain training issues in RNN models
Know Long Short-Term Memory (LSTM)
Online Resource
The neural forecasting website http://www.neural-
forecasting-competition.com/index.htm
Particularly the motivation page http://www.neural-
forecasting-competition.com/motivation.htm
A beginners Guide http://neuroph.sourceforge.net/
TimeSeriesPredictionTutorial.html
https://arxiv.org/abs/1407.5949
TS with LSTM
http://machinelearningmastery.com/time-series-
prediction-lstm-recurrent-neural-networks-python-
http://www.neural-forecasting-competition.com/index.htm
http://www.neural-forecasting-competition.com/index.htm
http://www.neural-forecasting-competition.com/motivation.htm
http://www.neural-forecasting-competition.com/motivation.htm
http://neuroph.sourceforge.net/TimeSeriesPredictionTutorial.html
http://neuroph.sourceforge.net/TimeSeriesPredictionTutorial.html
https://arxiv.org/abs/1407.5949
Time Series Prediction with LSTM Recurrent Neural Networks in Python with Keras
Time Series Prediction with LSTM Recurrent Neural Networks in Python with Keras
Time Series Prediction with LSTM Recurrent Neural Networks in Python with Keras
Neural Networks for Time Series: Autoregression
For time series, we shall use lagged values of time series as
inputs to a neural network and the output as the prediction
This means, in general, the number of input neurons of a
neural network is the number of the time lag and there is only
one output neuron
We will first consider feed-forward networks with one hidden
layer. An example
Neural Networks for Time Series: Autoregression
We denote by NNAR(p, k) the NNs with p lagged inputs and
k neurons in the hidden layer and with one output as the
For example NNAR(12, 50) model is a neural network with the
last 12 observations (yt−1, yt−2, …, yt−12) to fit yt at any time
step t, with 50 neurons in the hidden layer
A NNAR(p, 0) model is equivalent to ARIMA(p, 0, 0) without
the restrictions on the parameters to ensure stationary.
In terms of NNAR(1, 0) model, we may use φ1Yt−1 to predict
Yt for any φ1, however, for an AR(1) model
Yt = c + φ1Yt−1 + εt to be stationary:
−1 < φ1 < 1. Note now we focus on one-step-ahead forecast. Neural Networks for Time Series: Training Recall how we learn an ARIMA(p, 0, 0) model yt = c + φ1yt−1 + φ2yt−2 + · · ·+ φpyt−p + εt Suppose the given time series is T = {y1, y2, y3, · · · , yT−1, yT} We form the training data as y1, y2, ..., yp → yp+1 : εp+1 = (yp+1 − c − φ1yp − φ2yp−1 − · · · − φpy1) y2, y3, ..., yp+1 → yp+2 : εp+2 = (yp+2 − c − φ1yp+1 − φ2yp − · · · − φpy2) y3, y4, ..., yp+2 → yp+3 : εp+3 = (yp+3 − c − φ1yp+2 − φ2yp+1 − · · · − φpy3) yT−p , yT−p+1, ..., yT−1 → yT : εT = (yT − c − φ1yT−1 − φ2yT−2 − · · · − φpyT−p) To find out all the coefficients c , φ1, ..., φp, we minimise c,φ1,...,φp E (c , φ1, ..., φp) = p+3+· · ·+ε Neural Networks for Time Series: Training Consider the neural network NNAR(p, k). For a given section of time series yt−p+1, yt−p+2, ..., yt denote the output from NNAR(p, k) by ŷt+1 = F (yt−p+1, yt−p+2, ..., yt ;W ) where W collects all the weights in the network. We can also form the following y1, y2, ..., yp → yp+1 : εp+1 = (yp+1 − F (y1, y2, ..., yp ;W )) y2, y3, ..., yp+1 → yp+2 : εp+2 = (yp+2 − F (y2, y3, ..., yp+1;W )) y3, y4, ..., yp+2 → yp+3 : εp+3 = (yp+3 − F (y3, y4, ..., yp+2;W )) yT−p , yT−p+1, ..., yT−1 → yT : εT = (yT − F (yT−p , yT−p+1, ..., yT−1;W )) To find out all the coefficients W , we minimise p+3 + · · ·+ ε BP (backpropagation) algorithm can be applied to minimise the above objective. Neural Networks for Seasonal Time Series Except for the lagged data as inputs, it is useful to add the last observed values from the same seasons as inputs, for seasonal time series The notation NNAR(p,P, k)m means a model with inputs (yt−1, yt−2, ..., yt−p, yt−m, yt−2m, ..., yt−Pm) and k neurons in the hidden layer What is the input for NNAR(3, 2, 10)6? A NNAR(p,P, 0)m model is similar to a special ARIMA(Pm, 0, 0) model without the restrictions on the parameters to ensure stationarity This time, How do you organise your training data? [See an example on Slide 14.] Modeling Time Series with Neural Networks Python package keras implements most neural networks in a user-friendly way. Please note keras relies on backend either Theano or TensorFlow for heavy symbolic computation, e.g, automatically creating the BP algorithm. Normally we can simply use theano which can be easily installed on Anaconda. In fact, we dont need to understand how theano works behind the scene If there is no C++ compiler or compiled binary library on your desktop/laptop, the speed could be very slow. I hope you are ready A simple Recipe 1 Exploratory data analysis: Apply some of the traditional time series analysis methods to estimate the lag dependence in the data (e.g. auto-correlation and partial auto-correlation plots, transformations, differencing). For example the lag may be 2 Split your data into two main sections: training (in-sample) section {Y1,Y2, ...,Yn} and test (out-of-sample) section {Yn+1,Yn+2, ...,YT}. For example, for a time series with three years data, you may use data in the first two years for training, the data in the third year for testing 3 Define the neural network architecture: Although you have freedom to decide how many hidden layers (depth) and how many neurons on each hidden layer, picking appropriate numbers for them is not something very easy. This is actually part of the model selection A simple Recipe 4 Create the training patterns: Each training pattern will contain p + 1 values, with the first p (note in this example p = 4) corresponding to the input neurons and the last one defining the prediction as the output node. Here is all the data pattern for the training data {Y1,Y2, ...,Yn} for a NNAR(4, k). Training Pattern 1: Y1,Y2,Y3,Y4 → Y5 Training Pattern 2: Y2,Y3,Y4,Y5 → Y6 Training Pattern 3: Y3,Y4,Y5,Y6 → Y7 Training Pattern n − 4 : Yn−4,Yn−3,Yn−2,Yn−1 → Yn Forecasts are highlighted. A simple Recipe 5 Train the neural network on these training patterns 6 Test/predict the network on the test data {Yn+1,Yn+2, ...,YT} : here you will pass in four values as the input layer and see what the output node gets. Test Pattern 1: Yn−3,Yn−2,Yn−1,Yn → Ŷn+1 Test Pattern 2: Yn−2,Yn−1,Yn,Yn+1 → Ŷn+2 Test Pattern 3: Yn−1,Yn,Yn+1,Yn+2 → Ŷn+3 Test Pattern 4: Yn,Yn+1,Yn+2,Yn+3 → Ŷn+4 Test Pattern 5: Yn+1,Yn+2,Yn+3,Yn+4 → Ŷn+5 Test Pattern 6: Yn+2,Yn+3,Yn+4,Yn+5 → Ŷn+6 Test Pattern 7: Yn+3,Yn+4,Yn+5,Yn+6 → Ŷn+7 A Sample Recipe: Training Pattern for NNAR(3, 2, k)4 1 This is a network of 5 inputs for a seasonal time series of period M = 4. Three inputs from normal lagged observations and Two from seasonal lags 2 As we will use the seasonal lagged for prediction, we must look back at least P ×M = 2× 4 = 8 time units. That means the first training pattern is to predict Y9: that is Training Pattern 1: Y1,Y5,Y6,Y7,Y8 → Y9 3 t = 9, so the inputs (Yt−1,Yt−2,Yt−3,Yt−M ,Yt−2M) becomes (Y8,Y7,Y6,Y5,Y1). 4 Other patterns are Training Pattern 2: Y2,Y6,Y7,Y8,Y9 → Y10 Training Pattern 3: Y3,Y7,Y8,Y9,Y10 → Y11 Training Pattern 4: Y4,Y8,Y9,Y10,Y11 → Y12 Last Training Pattern: Yn−8,Yn−4,Yn−3,Yn−2,Yn−1 → Yn The Example Dataset: International Airport Arriving Passengers: in csv Prepare Data: Xtrain, Ytrain, Xtest and Ytest with time lag 4 Define the NN architecture: model = tf.keras.models.Sequential() model.add(tf.keras.layers.Dense(30, input dim=time lag, activation=’relu’)) model.add(tf.keras.layers.Dense(1)) This defines a network with 30 neurons on the hidden layer See Lecture12 Example01.py The example: Forecasting Hidden = 3 Hidden = 30 Hidden = 100 Hidden = 500 Multi-step-ahead (dynamic) forecast with NN In real world, we do not have access to the future values, so we have to do a so called dynamic forecast for multi-step-ahead forecast. This forecast strategy uses observations up to some time point. From that point onwards we append our latest forecast to the list and use a combination of real values and forecast values as input features. Eventually if the forecast length exceeds the window of values used, we will be only using forecast values. What are RNNs? The ordinary NNs can be trained for time series forecasting, but the dependence information among the time was totally ignored, resulting in less accurate predictions or forecasts This approach is to represent time effects explicitly via some simple functions, often linear functions, of the lagged values of the time series. This is the mainstream time series data analysis approach in the statistics literature Well-known models: AR or ARMA, etc. The idea behind recurrent NNs (RNNs) is to make use of sequential information RNNs: How? Recurrent neural networks (RNN) is an approach representing time effects implicitly via latent variables (also called hidden hidden states are designed to store the memory of the dynamics in the data updated in a recurrent manner using the information carried over by their values from the previous time steps and the information from the data at the current time step RNN was first developed in cognitive science and successfully used in computer science and other fields Recurrent neural network (RNN) Let the time series data be {Dt = (xt , yt), t = 1, 2, ...} where xt is the vector of inputs and yt the output. E.g., yt : sales at time t xt,1 = yt−1; xt,2: ads hours at time t − 1; xt,3: consumption expenditure index at time t − 1, etc. For ease of understanding, it might be useful to think of xt as scalar; however, RNN is often efficiently used to model multivariate time series If the time series of interest has the form {yt : t = 1, 2, ...}, it can be written as {(xt , yt) : t = 2, 3, ...} with xt = yt−1 or {(xt , yt) : t = p + 1, p + 2, ...} with xt = (yt−1, yt−2, ..., yt−p). The goal is to estimate the prediction E(yt |xt ,D1:t−1). Recurrent neural network (RNN) First, let’s use a feedforward neural network (FNN) to transform the raw input data xt into a set of hidden units (states) ht (for the purpose of predicting yt). But we need to take into account the dynamics/serial correlation of the time series data The main idea behind RNN is to let the set of hidden units ht to feed itself using its value ht−1 from the previous time step Hence, RNN can be best thought of as a FNN that allows a connection of the hidden units to their value from the previous time step, which enables the network to possess the memory. Recurrent neural network (RNN) Mathematically, this basic RNN model is written as ht = f (Uxt + Wht−1 + b) ηt = β0 + β1ht yt = ηt + εt where εt is the white noise with mean 0 and variance σ As an example, for time series, we may take xt = yt−1. U,W , b, β0 and β1 are model parameters (need to be estimated), f (·) is a non-linear activation function (e.g. tanh or sigmoid functions) Usually we can set h0 = 0, i.e. the neural network initially doesn’t have any memory. Graphical Representation Graphical representation of the basic RNN model: the black square indicates the delay of one time step. Recurrent neural network (RNN): A Special Case Special case: If the time series of interest has the form {yt : t = 1, 2, ...} and we take xt = yt−1 as the input: ht = f (Uyt−1 + Wht−1 + b) ηt = β0 + β1ht yt = ηt + εt where the εt are white noise with mean 0 and variance σ Forecast and variance: ŷt|1:t−1 = E(yt |y1:t−1) = ηt V(yt |y1:t−1) = σ2 An Example A sample RNN model for {yt} ht = σ(0.4yt−1 + 0.3ht−1 + 0.01) ηt = 0.02 + 0.5ht yt = ηt + εt where σ(x) = 1/(1 + exp(−x)) is sigmoid activation and the εt are white noise with mean 0 and variance σ Forecast and variance: ŷt|1:t−1 = E(yt |y1:t−1) = ηt V(yt |y1:t−1) = σ2 Suppose we know ŷt|1:t−1 = 0.7, given yt = 0.9, what is ŷt+1|1:t =??. Recall: Exponential Smoothing Simple Exponential Smoothing (SES): We update the level according to lt = αyt + (1− α)lt−1, and yt+1 = lt + εt+1 ht = lt−1 with U = α, W = (1− α), b = 0, f (z) = z , β0 = 0 and β1 = 1. Here we consider xt = yt−1 to match the general model on Slide 22. ht is the output at step t. Depending on tasks, it could be a vector or a scalar. We set it as a scalar for time series. In our case, it may be computed as ηt = β0 + β1ht as a normal linear mapping. We see that for SES β0 = 0 and Recall: Exponential Smoothing Simple Exponential Smoothing (SES): We update the level according to lt = αyt + (1− α)lt−1, and yt+1 = lt + εt+1 ht = lt−1 with U = α, W = (1− α), b = 0, f (z) = z , β0 = 0 and β1 = 1. Here we consider xt = yt−1 to match the general model on Slide 22. ht is the output at step t. Depending on tasks, it could be a vector or a scalar. We set it as a scalar for time series. In our case, it may be computed as ηt = β0 + β1ht as a normal linear mapping. We see that for SES β0 = 0 and Outputs in RNNs (optional) Holt’s Linear Smoothing (Trend Corrected Exponential Smoothing) lt = αyt + (1− α)lt−1 + (1− α)bt−1 bt = βαyt − αβlt−1 + (1− αβ)bt−1 (1− α) (1− α) −αβ (1− αβ) = Uyt +Wht−1 with ht+1 = (1− α) (1− α) −αβ (1− αβ) b = 0 and f (z) = z with xt+1 = yt to match the general model on Slide 22. In this case yt+1 = lt + bt + �t+1 = 0 + [1 1] + �t+1 = β0 + β1ht + �t+1 It seems that the output ηt at the time step t is calculated solely based on the memory ht at time t, but it is implicitly determined by all the past values in the time series. The Structure The overall network looks like a forward neural network along the time, however the use of information is recurrently. Unlike a traditional deep neural network, which uses different parameters at each layer, a RNN shares the same parameters (U, W and β above) across all nodes The total of number of parameters to be learned has greatly been reduced comparing with the traditional deep neural The previous RNN diagram has outputs at each time step, but depending on the task this may not be necessary. For time series forecasting, we only care about the last output as the forecast for the time step t + 1. Recurrent neural network (RNN): training Give a training data set of time series {Dt = (xt , yt) : t = 1, 2, ...,T} Sum of squared errors (yt − ŷt|1:t−1) The model parameters θ = (U,W , b, β0, β1) are estimated by minimizing the SSE Training RNNs Training an RNN is similar to training a traditional Neural Network, by using the so-called backpropagation algorithm, but in a revised way The architecture shows that the parameters are shared in all time steps, thus the gradient should be computed by taking into account all the previous time steps. For example, the gradient at the time step t = 4 should be backpropagated 3 steps to calculate other gradients at outputs for the time steps 3, 2 and 1. In many case, the calculation involves recurrent multiplication of parameter matrices. This is called Backpropagation Through Time (BPTT) An Example Consider the case in which we use two hidden states for each time step, then we shall have ht = The input xt (a scalar) is from a time series (i.e. xt = yt−1). Then the parameters will be in matrix form , β0, β1 = [β11, β12] Please note the dimension of each parameter matrix. The total number of parameters is 2 + 4 + 2+ 3 (without considering bias parameters). The training becomes estimating 11 parameters by minimising e.g. the mean squared errors. An Example Mathematically this defines a model of( ηt = β0 + [β11, β12]× ; with e.g. xt = yt−1 The activation function f introduces nonlinearity The training is to minimise the following objective function with respect to parameters U,W , b, β, L(U,W , β) = (yt − ηt)2 = yt − β0 − [β11, β12]× yt − β0 − [β11, β12]× f Another Example http://karpathy.github.io/2015/05/21/rnn-effectiveness/ http://karpathy.github.io/2015/05/21/rnn-effectiveness/ Basic Architectures The “many-to-one” is the most suitable architecture for the time http://karpathy.github.io/2015/05/21/rnn-effectiveness/ http://karpathy.github.io/2015/05/21/rnn-effectiveness/ Successful Application in NLP (Natural Language Processing) Given a sequence of words we want to predict the probability of each word given the previous words. You can produce “new” Shakespeare dramas, after training RNN with Shakespeare’s Google is doing Machine Translation with RNN Image Source: ImageSource:http://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf Image Source: http://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf Issues with the basic RNN The unfolded graph suggests that the hidden state at time t is a composite function ht = g(xt , g(xt−1, ..., g(x1, h0))) g(xt , ht−1) = f (Uxt + Wht−1 + b) Training a statistical model based on a many-layer composite function is often challenging: its gradient is either vanishing or Issues with the basic RNN (optional) The gradient of SSE (θ) with respect to a model parameter θ using the chain rule is, say for W , ) = ∂SSE (Uxt + Wht−1 + b) where f ′(·) is the derivative of the activation function f (·), which is always between 0 and 1 if f (·) is the tanh or sigmoid activation function. Consequently, the gradient might either explode or vanish if T is sufficiently large and W is not equal to 1 The exploding gradient problem: the gradient gets too large, making the optimization becomes highly unstable The vanishing gradient problem: the gradient is close to zero, making the learning process too slow Long Short-Term Mem 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com