Liyuan Xing, Gleb Sizov, Odd Erik Gundersen
Times Series Forecasting
stand in the present and forecast the future
Lecture 3
Introduction to time series forecasting (Liyuan)
Data exploration by time series graphics (Liyuan)
Statistic methods (Liyuan)
• Time series decomposition
• Exponential smoothing
• ARI MA models
• Time series regression models
Evaluation (Liyuan)
• (Nested) cross validation
• Forecast error types and measures
Page 2
Lecture 7
Machine learning methods (Gleb)
• Statistical methods vs machine learning, M4 competition • Recurrent neural networks
• Hybrid models
• Global vs local models
Uncertainty and forecast distribution (Liyuan)
• Prediction interval, Quantile • Quantile models
Case study: Projects in TrønderEnergi (Odd Erik)
• Wind power production forecasting • Grid loss forecasting
Page 3
Machine learning methods
• Statistical methods vs machine learning, M4 competition
• Recurrent neural networks
• Hybrid models
• Global vs local models
Uncertainty and probability forecast
Point forecast
• Statistical methods •Regression, ETS, ARIMA
• Machine learning methods •NN, DL, Regression trees, SVR
• •
Prediction interval, quantile
Probability forecast
Parametric models
•
• Johnson’s Su, Skew-t distribution
•
Quantile models
• Regression models, Deep learning, Tree based models
Motivation
Why point forecast is NOT enough, and uncertainty estimation is important?
• Real-world problems often extend beyond predicting means • capacity planning problems
• Example, capacity in Google’s data centers to support spikes in compute workloads
• Forecasting the capacity to support up to the 95th or higher percentiles of our forecasts
• It is more important to get the upper quantiles correct than to actually get the point forecasts correct
• supply chain planning
• price prediction for trading • …
Prediction intervals are just as important as the point forecast itself
• The difference in prediction intervals results in two very different forecasts
• The second forecast calls for much higher capacity reserves to allow for the possibility of a large increase in demand
• The further ahead we forecast, the more uncertain we are
Page 6
Uncertainty
The thing we are trying to forecast is unknown (or we would not be forecasting it), and so we can think of it as a random variable
• Example, the total sales for next month could take a range of possible values, and until we add up the actual sales at the end of the month, we don’t know what the value will be. So until we know the sales for next month, it is a random quantity
• Predictions are never absolute, and it is imperative to know the potential variations
At least four sources of uncertainty in forecasting using time series models
• The random error term
• The parameter estimates
• The choice of model for the historical data
• The continuation of the historical data generating process into the future
Page 7
Predictions are themselves random variables with distribution
Point forecasts
•Only represent the expected prediction
•Does not model uncertainty
•The middle of the range of possible values the random variable could take
Probability forecasts
•Represent the prediction distribution •Model the uncertainty of forecasting
error
•Apredictionintervalgiving arange of values the random variable could take with relatively high probability
Page 8
Probability forecasting methods
Parametric models
• The forecast is given by a full parameterization of the probability distribution, which is defined by random variable p and its cumulative distribution CDF F(p)
• Limited by the distribution assumption
Quantile models
• F(p) is approximated building quantile models qα(θ,x)
• Quantile qα of p is the value at which the probability of p is
less than or equal to α, i.e. α=F(qα)
• More general: no assumptions needed
Page 9
Prediction interval
A prediction interval is an estimate of an interval [Qα-Q1-α] in which a future observation will fall, with a certain probability α, 0<α<1
Forecast error is a normal distribution
•The prediction intervals for normal distributions are easily calculated from the ML-estimates of the expectation μ
and the variance σ
•α=80%, [μ-1.28σ, μ+1.28σ] •α=95%, [μ-1.96σ, μ+1.96σ]
•Normality assumption by Linear regression, ETS, ARIMA
Forecast error is not a normal distribution, but uncorrelated •Bootstrapped prediction interval
•Assuming future errors will be similar to past errors, so replace future error by sampling from the collection of errors we have seen in the past (i.e., the residuals)
•Compute prediction intervals by calculating percentiles
Page 10
Quantiles
Quantiles are a generalization of prediction intervals and no assumption about distribution is needed
• Example: 90% prediction interval equals the interval [q5,q95] of quantiles Cut points
• diving the range of a probability distribution into continuous intervals with equal probabilities
• dividing the observations in a sample in the same way
A quantile is the value below which a fraction of observations in a group falls
• Example: A prediction for quantile 0.9 should over-predict 90% of the times
Page 11
Empirical quantiles
Empirical quantile distribution
• A step function that jumps up by 1/n at each of the n data points Point prediction plus quantile function of errors
• 1. Consider past point forecasts at hour h
• 2. Compute historical forecasting errors
• 3. Compute empirically quantile distribution of errors • 4. Quantile function of price at hour h
Page 12
Quantile loss
weight
weight
under-estimates positive error
over-estimates negative error
np.maximum(q * e, (q - 1) * e), e = y-y_hat
For true values y, the predicted values y_hat, and a desired quantile q • Quantile 50% weights underestimates equally to overestimates
• The closer the desired quantile is to 0%, the more this loss function penalizes estimates when they are above the true value, meaning assign more of a loss to under- estimates/positive error than to over-estimates/negative error.
• The closer the desired quantile gets to 100%, the more the loss function penalizes estimates which are less than the true value, meaning assign more of a loss to over- estimates/negative error than to under-estimates/positive error.
This quantile loss can be used to calculate prediction intervals for
• Regression, neural networks, tree based models
Page 13
Linear quantile regression
OLS: Regressions minimize the squared-error loss function to predict a mean
• Prediction intervals are calculate based on standard errors and the inverse normal CDF
Quantile regressions minimize the quantile loss in predicting a certain quantile
• Optimizing this loss function results in an estimated linear relationship between yi and xi where a portion of the data, , lies below the line and the remaining portion of the data, 1- , lies above the line
Page 14
Page 15 https://towardsdatascience.com/quantile-regression-from-linear-models-to-trees-to-deep-learning-af3738b527c3
Deep learning and quantiles
Deep learning vs traditional machine learning
• Large and deep neural networks
• Feature learning, such as automatic feature extraction
Quantiles are predicted in DL by passing the quantile loss function, neither RMSE nor MAE
• Keras: each quantile must be trained separately
• Tensorflow: to leverage patterns common to the quantiles
• Co-learning across the quantiles in its predictions, where the model learns a common kink rather than separate ones for each quantile
Page 16
Page 17 https://towardsdatascience.com/quantile-regression-from-linear-models-to-trees-to-deep-learning-af3738b527c3
Quantile regression on gradient boosting
Using a second-order approximation of the quantile loss function
• Tree boosting vs gradient boosting in splitting procedure
• every (dimesion, cutoff) pair vs a single pass per dimension (next-largest cutpoint)
• Point forecasts vs forecast distribution
Implemented differently, but all support explicit quantile prediction
• Scikit-learn’s implementation GradientBoostingRegressor
• LightGBM
• http://jmarkhou.com/lgbqr/
• Xgboost
• https://towardsdatascience.com/regression-prediction-intervals-with-xgboost-428e0a018b
• Catboost
Page 18
Page 19 https://towardsdatascience.com/quantile-regression-from-linear-models-to-trees-to-deep-learning-af3738b527c3
Bootstrapping
Doesn’t explicitly predict quantiles, but treat each model as a possible value, and calculate quantiles using its empirical CDF
• 1. Generate datasets obtained via resampling with replacement • Samples split, blocked bootstrap
• 2. Estimate a point forecast model for each dataset or time series • Tree in random forest, ETS
• 3. Use models to estimate quantiles of model errors • Model and parameter uncertainty
• 4. Use to estimate quantiles of process error • Randomerrorterm
• Quantilefunctionofpriceathourh
Page 20
Which did best?
Small datasets with 1 feature
Page 21
Small datasets with 1 feature Large datasets with 12 features
Error measures
proper scoring rule for probabilistic forecasting
Pinball loss function
• y is the observation used for forecast evaluation
• a=1, 2, ..., 99, and q1, q2, ..., q99 are the 1st, 2nd, ..., 99th percentiles
MAE, RMSE, MAPE, sMAPE, MASE, K-S statistic
• To evaluate the full predictive densities, this score L is then averaged over all target quantiles for all time periods over the forecast horizon
• A lower score indicates a better forecast
Continuous rank probability score (CRPS)
• x be the observation
• F the CDF associated with an empirical probabilistic forecast
• 𝟙 is the Heaviside step function and denotes a step function along the real line that attains
• the value of 1 if the real argument is positive or zero • the value of 0 otherwise
Page 22
Case study: Projects in TrønderEnergi
• Wind power production forecasting
• Grid loss forecasting