7CCMFM18 MACHINE LEARNING
Dr Blanka Horvath
Department of Mathematics, King’s College London
Lecture notes
Updated: 2 March 2020
●●●●●
●●●●●●●
●●●●●●● ●● ●●●●●●● ●● ●●●●●●●
●●● ●●●●●●●
●● ●●●●●●● ●● ●●●●●●●
●●●●●●●
●●●●●
B. Horvath 7CCMFM18 Machine Learning
Spring 2020
Contents
Outline of the module and practical matters
1 Introduction
3
5
1.1 Ahelicoptertourofdeeplearning ……………….. 5 1.2 Deeplearninginabroadercontext……………….. 10 1.3 Averybriefhistoryofdeeplearning ………………. 12
2 Feedforward neural networks and their properties 17 2.1 Generalconstruction………………………. 17 2.2 Activationfunctions ………………………. 20 2.3 Convolutionalandpoolinglayers ……………….. 24 2.4 Universalapproximationproperty……………….. 28
3 Training feedforward neural networks 31 3.1 Lossfunction ………………………….. 31 3.2 Stochasticgradientdescent …………………… 34 3.3 Backpropagation ………………………… 37
4 Practical implementation and applications 41 4.1 TrainingneuralnetworksinKeras ……………….. 41 4.2 Casestudy:deephedging ……………………. 44
Bibliography
48
2
B. Horvath 7CCMFM18 Machine Learning Spring 2020
Outline of the module and practical matters Overview and intended learning outcomes
The aim of this module is to provide a concise introduction to deep learning, with a view to applications in quantitative finance. In particular, this module will de- velop thorough understanding of the properties of feedforward neural networks and how they are trained. After successfully completing this module, the student is able to:
⋆ explainthestructureandcomponentsoffeedforwardneuralnetworks,
⋆ recognisetheuniversalapproximationpropertyofneuralnetworks,
⋆ derive the gradient of the loss function of a neural network and evaluate it using backpropagation,
⋆ explainthetrainingofneuralnetworksusingstochasticgradientdescent,
⋆ implementandtrainneuralnetworksusingKeras,
⋆ recognise how neural networks can be used to solve computational and statistical problems in quantitative finance.
Prerequisites
Good command of undergraduate linear algebra and multivariate calculus. On the programming side, basic Python and NumPy skills are expected.
Lectures and problem classes
Weekly lectures are on:
⋆ Mondays,13:00–15:00inS-1.27
The first lecture is on Monday, 13 January 2020 and the final lecture on Wednes- day, 25 March 2020. The lectures will be recorded.
During the module, three non-assessed problem sheets will be released. The solutions to the problems will be discussed in the problem classes, scheduled for
⋆ Tuesdays,9:00–10:00inS2.28
⋆ Fridays,10:00–11:00inS0.11 starting in the week of 20 January.
3
B. Horvath 7CCMFM18 Machine Learning Spring 2020
Module material and literature
The module is based on these self-contained lecture notes. They are updated and expanded periodically — please check the KEATS page of the course for the most recent version.
Although there is no need to study any literature beyond these lecture notes, for further reading on deep learning in general I can recommend the textbook Goodfellow et al. (2016) and the survey article Higham and Higham (2018), on which the module is loosely based on. Please see Chollet (2018) for an introduc- tion to deep learning with Keras.
Feedback
Your marked coursework will be returned to you with comments. Additionally, you can obtain feedback by meeting with me during my office hour (see below).
Any feedback to me is welcome. For example, if you think you have found misprints in these lecture notes, please do not hesitate to ask. Informal mid- term feedback will be collected and you will be invited to respond to the official SOLE survey towards the end of the term.
4
B. Horvath 7CCMFM18 Machine Learning Spring 2020
1 Introduction
1.1 A helicopter tour of deep learning
Deep learning is a flourishing subfield of machine learning that employs neu- ral networks to tackle a host of difficult problems within diverse areas, including prediction, classification, image recognition, speech recognition and synthesis, simulation and optimal decision making. Since the year 2015, it has become vir- tually impossible to miss news reports on the various breakthroughs achived by deep learning — perhaps the most prominent of which has so far been AlphaGo (Silver at al., 2016), a computer program developed by Google DeepMind in Lon- don in 2015 that has consistently been able to beat the best human players in the classical Chinese board game Go, which was long thought to be far too complex to master by computers.
During the deep learning “boom” of the past five years, unsurprisingly also the financial industry has become increasingly interested in the new deep learn- ing methodology. Machine learning has traditionally been applied in retail bank- ing already for multiple decades to detect fraud, machine read cheques and per- form credit scoring, say. The recent focus has been chiefly in the automation of various tasks using modern machine learning, ranging from customer service in retail banking to trading and portfolio construction in investment banking and asset management, respectively. While machine learning, and deep learning in particular, is bound to make banking more efficient and perhaps to “disrupt” it to some degree, concerns have also been voiced about the limited explainability of deep learning and the black-box nature of neural networks. As in finance even small algorithmic glitches may translate to significant monetary losses in pro- duction use,1 it is essential that deep learning practitioners working in finance have a solid understanding of the theoretical underpinnings of deep learning and its limitations.
In very broad terms, the problem deep learning endeavours to solve entails “designing” a function, most often non-linear,
f =(f1,…,fO):RI →RO
1The mistakes that lead in August 2012 to the demise of Knight Capital Group, a major elec- tronic market maker in US equities, are a prime example, albeit not related to machine learning per se.
5
B. Horvath 7CCMFM18 Machine Learning Spring 2020
that turns I ∈ N (:= {1,2,…,}) inputs x1,…,xI
into O ∈ N outputs
f1(x1,…,xI ),…, fO(x1,…,xI )
in some sense optimal way. Implicit in this conceptualisation is, of course, that both the inputs and outputs are in numerical form, which nowadays is not any kind of a limitation since modern computers store all information (text, images, video, sound, etc) in bits,2 that is, numbers. The optimality of the function f is quantified through a loss function, the choice of which depends on what kind of task the function f should accomplish.
Example 1.1. A key example in machine learning is the (possibly non-linear) re- gression problem. We observe N ∈ N samples of our input variables, “features”,
x1 ··· xI1
. . ,
x1N … xIN
and matching N samples of an R-valued variable we are trying to predict,
y1
. ,
yN
typically called “labels” in machine learning parlance. In financial applications, the labels could, for example, be prices or returns that we are trying to predict, while the features would then be relevant predictors, past prices, trading volume etc. Given the inputs x1i ,…,xIi , for any i = 1,…,N, the output f (x1i ,…,xIi ) ∈ R should be as close as possible to y i . Quantifying closeness by a per-sample loss function l, which is typically the squared loss l(yˆ, y) = (yˆ − y)2, we seek f that minimises the average loss (also called empirical risk)
1Niii
L(f):= lf(x1,…,xI),y . (1.2)
N i=1
Example 1.3. The binary classification problem is a close relative of the regres-
sion problem (Example 1.1). The inputs here are analogous to regression, but 2a contraction of binary digit
6
B. Horvath 7CCMFM18 Machine Learning Spring 2020
whatwearetryingtopredictconsistsofbinary(zero-one)labelsy1,…,yN ∈{0,1}. In classification, it is often more convenient3 seek a function f that predicts the (conditional) probability of observing label 1 rather than the zero-one la- bel directly. (This probability can then be mapped to {0,1}, if necessary, using some threshold-based rule, say.) Financial applications of the binary classifi- cation problem include predicting the direction of the next price change (up or down) in high-frequency financial data and predicting whether a debtor defaults in the context of credit risk analytics. While the squared loss, as used in regres- sion, could be used to characterise prediction performance also in binary clas- sification, it is more common (and typically better) to replace it by the binary cross-entropy loss
l(yˆ,y):=−ylogyˆ−(1−y)log(1−yˆ),
while retaining the overall form (1.2) of the average loss function L ( f ).
To make the above concepts and line of thought — involving the so-far com- pletely unspecified function f — actually work in practice, we face two key prob- lems:
Problem 1. How can we construct a class of functions from which to select f ? The class should be:
(i) richinthesensethatitencompasses“almostany”reasonablefunctional relationship between the outputs and inputs,
(ii) parameterisedbyafinitesetofparameters,sothatwecanactuallywork with it numerically,
(iii) abletocopewithhigh-dimensionalinputsandoutputs.
Problem 2. Once a “good” class of functions has been developed, what kind of a training procedure helps us find an optimal f ? The procedure should be:
(iv) implementablenumerically,
(v) efficientenoughtobeabletocopewithlargenumbersofsamples,
(vi) able to avoid the pitfall of overfitting, that is, producing a function f
3Anynon-trivialfunctionfromRI to{0,1}cannotenjoyanycontinuityordifferentiabilityprop- erties.
7
B. Horvath 7CCMFM18 Machine Learning Spring 2020
L1
σ1
L2
σ2
L3
σ3
O = d3 = 3
I = d0 = 4
d1 = 6
d2 = 5
Figure 1: Graphical representation of a neural network with r = 3, I = d0 = 4, d1 =6,d2 =5andO=d3 =3.
that performs well with the training data but poorly with other data.
Deep learning solves Problem 1 by employing neural networks, sometimes also called artificial neural networks, functions constructed by composing alter- natingly affine4 and (simple) non-linear functions. More mathematically, a neu- ral network f : RI → RO can be expressed as
f =σr ◦Lr ◦···◦σ1◦L1, (1.4)
where, for any i = 1,…,r, Li : Rdi−1 → Rdi is an affine function, such that d0 = I and dr = O, and σi : Rdi → Rdi is component-wise application of a function σi : R → R, which is called an activation function, that is,
σi(x):=σi(x1),…,σi(xdi ), x =(x1,…,xdi )∈Rdi .
The structure of a neural network can be visualised graphically, as exempli- fied in Figure 1, which motivates the term “network”. Intuitively, the affine func- tion Li is seen to transmit di−1 signals to di units, or neurons, indicated by the vertices of the graph in Figure 1, where the signals are transformed by the acti- vation function σi . Originally, neural networks were inspired by the nervous sys- tem, and they were seen rather literally as a model of the connections between
4RecallthatafunctionL:Rn →Rm isaffineifitcanbewrittenasL(x)=Wx+b,x∈Rn,where W ∈ Rm×n and b ∈ Rm — or, in a nutshell, “linear plus a constant.”
8
B. Horvath 7CCMFM18 Machine Learning Spring 2020
neurons in the nervous system, but nowadays, with increased understanding of the complexity biological neural networks, this justification is typically de- emphasised by deep learning researchers since these networks are not very re- alistic models of their biological counterparts. In fact, neural networks are some- times simply called “networks” or “nets”. In this module, we will still call them neural networks, for clarity, but we will speak about “units” instead of “neurons”.
The four columns of units in Figure 1 are called the layers of the neural net- work, the leftmost being the input layer (with four units) and the rightmost the output layer (with three units). The layers between the input and output layers are called hidden layers. Deep learning specifically refers to the use of deep neural networks with multiple hidden layers.5
Since an affine function is parameterised by a matrix and a vector, and is naturally definable for any input and output dimensions, it is evident that the construction (1.4) of a neural network satisfies the requirements (ii) and (iii) in Problem 1 — at least in principle. We will learn in Chapter 2 that neural networks satisfy (i) as well, in an approximative sense, thanks to their universal approxi- mation property.
In deep learning, Problem 2 is solved by training the neural network using a method called stochastic gradient descent (SGD), introduced in detail in Chapter 3. SGD entails first working out the gradient of the loss function L ( f ) with re- spect to the parameters of the neural network f , that is, the matrices and vectors that parameterise its layers. However, instead of using all samples to compute L ( f ) and its gradient, a randomly drawn subset of samples, called a minibatch, is used, which helps immensely with the requirement (v).6 The random mini- batch explains the modifier “stochastic” in SGD. In terms of numerics, the gradi- ent is computed using a form of algorithmic differentiation, called backpropaga- tion, which is exact and efficient, fulfilling (iv). The parameters are then adjusted towards the direction opposite of the gradient. The above step of updating pa- rameters is then repeated, each time using a new minibatch, until convergence is deemed to be attained. The use of minibatches provides the first line of de- fence against overfitting (vi), while further methods of overfit prevention will be discussed in Chapter 3.
5Deep learning is not a precise concept, however, and there is no consensus over the minimum
number of layers that qualifies as “deep”.
6With a large number of samples, computing the “full” gradient would typically be pro-
hibitively time consuming.
9
B. Horvath 7CCMFM18 Machine Learning Spring 2020
Data Science
Cluster Computing Database Management
Data Engineering
Official Statistics
Statistical Graphics
Statistics
Neural Network
Linear Regression
Machine
Learning
Artificial Intelligence
Deep Learning
Logistic Regression
Gaussian Process
Random Forest
Mathematical Statistics
Support Vector Machine
Robotics
Brain Simulation
Symbolic AI
Decision Theory
Survey Methodology
Figure 2: Venn diagram representing deep learning and related fields.
1.2 Deep learning in a broader context
It is helpful to discuss, where deep learning stands in the broader context of fields such as statistics, data science and artificial intelligence. How these areas are re- lated is sketched in the form a Venn diagram in Figure 2, which we shall now briefly unpack.
As mentioned in Section 1.1, deep learning is a subfield of machine learning. Machine learning in itself is a field that takes an algorithmic approach to data analysis, processing and prediction. Deep learning is a proper subset of machine learning, as there are also machine learning methods that are not founded on neural networks, such as random forests, support vector machines and Gaussian processes. The relationship between machine learning statistics has occasionally been subject to some controversy. In his famous, but slightly provocative article (Breiman, 2001), Leo Breiman (1928–2005, American probabilist and statistician) — the inventor of random forests — contrasts the (in his view more pragmatic) algorithmic approach of machine learning, which is content with just develop- ing an algorithm that produces good predictions or extracts useful information from data to solve a practical problem, without worrying too much about the mechanism that generated the data, to that of statistics, which conventionally in-
10
B. Horvath 7CCMFM18 Machine Learning Spring 2020
deed endeavours to estimate the “true” data generating model or process. A great deal of Breiman’s criticism is also directed at statistical theory, or mathematical statistics, which he felt was out of touch with the demands of practical applied statistical modelling, due to its unrealistic assumptions and misplaced emphasis on mathematical rigour and elegance.
Nowadays, statistics as a field is more inclusive and open to ideas that do not follow the classical statistical inference paradigm — perhaps partly due to the rise of (computational) Bayesian and non-parametric approaches to statistics. Thus, the “rift” between machine learning and statistics hardly exists anymore, which is why in Figure 2, machine learning is depicted as a subset of statistics. Granted, some cultural differences still remain. For instance, in terms of how new research results are communicated, machine learning favours conferences7 and relatively short8 conference papers, while in statistics longer 9 journal articles are standard. Moreover, there are differences in terminology.10
Machine learning is naturally part of the emerging field of data science, which also encompasses data visualisation and various aspects of computing involv- ing the management of large databases and computer clusters. Moreover, ma- chine learning fundamentally embodies the statistical approach to artificial in- telligence (AI). AI is the study of intelligent agents, which autonomously perform tasks, while observing their environment, in order to achieve a particular goal — for instance, an autonomous vehicle trying to reach its destination while keep- ing the car on the road and avoiding collisions with other cars and obstacles. It should be pointed out, however, that while machine learning is powering many of the most impressive recent advances in AI, there are also non-statistical ap- proaches to AI, such as symbolic AI, which is based on logic through knowledge representation and reasoning.
Finally, we briefly discuss the distinction between supervised and unsuper- vised learning (within machine learning). Supervised learning refers to machine learning tasks where the output of the algorithm (a neural network, say) is trained by comparing it to known labels. Regression (Example 1.1) and classification (Ex- ample 1.3) are basic examples of supervised learning. Unsupervised learning is,
7The major annual conferences in machine learning include the Conference on Neural Infor-
mation Processing Systems (NeurIPS), International Conference on Machine Learning (ICML) and
International Conference on Learning Representations (ICLR). 8typically 10 pages, not including appendices
9roughly 20–30 pages
10E.g., “learning” in machine learning is comparable to “estimation” in statistics. 11
B. Horvath 7CCMFM18 Machine Learning Spring 2020
in contrast, performed without labels. Then the optimality of the algorithm can- not be measured directly through the distance of its outputs from the labels, say, but instead needs to be established in some other way. Examples of unsupervised learning methods include clustering and principal component analysis. The fol- lowing is an example of unsupervised learning inspired by finance.
Example 1.5. In the context of algorithmic trading, we are typically interested in developing an algorithm, or a function f , that takes price data x as inputs and produces an optimal trading decision f (x) as the output — more concretely, f (x) ∈ R could be the number shares to buy or sell. Usually, we do not know a priori how to trade optimally — it is the very problem we are trying to solve — so there are no labels to fit f to. Instead, we can proceed in an unsupervised manner, whereby we backtest f with historical or simulated data, resulting in a sample of profit-and-loss (P&L) scenarios, whose empirical distribution we then try to optimise by adjusting f .
It should be stressed that deep learning is a fully generic framework that ap- plies to both types of learning. Indeed, the latest triumphs of deep learning include both supervised (e.g., image recognition via convolutional neural net- works) and unsupervised (e.g., image synthesis via generative adversarial net- works, GANs) learning tasks.
1.3 A very brief history of deep learning
To understand the background and evolution of the ideas that power deep learn- ing, we provide a very brief historical overview of deep learning. This account is inevitably incomplete and does not do justice to all major developments in this area. More comprehensive surveys are given, e.g., by Goodfellow et al. (2016, Section 1.2) and Schmidhuber (2015), on which this section is loosely based on.
The history of neural networks is usually thought to begin from the artificial neuron model of McCulloch and Pitts (1943), which can be formulated in terms of real functions as
1,I xi≥a, f (x) := i=1
0,I xi0 g(x)=log(1+ex)
g(x) = e−x2
0, x < 0 1, x>0
α > 0 x
α, x<0
g(x)= α(e −1), x<0
x, x≥0 g′(x)=
g(x)+α, x<0
g′(x) =
g′(x)= 1, x>0
1, g′(x)= 1
x > 0
1+e−x g′(x) = −2xg(x)
should not be used simultaneously in all layers as the resulting neural network simply reduces to an affine function (exercise, PROBLEM SHEET 3), defeating the purpose of neural networks and deep learning — non-linearity!
The Heaviside function H is included here just to remind of the perceptron heritage of neural networks (Section 1.3). Since the derivative of H is either zero or undefined, it cannot be used with any gradient-based learning methods, like the stochastic gradient descent, to be introduced in Chapter 3, which is why the function is of limited use in modern deep learning.
The sigmoid (σ) and the hyperbolic tangent (tanh) have traditionally been popular activation functions, in particular in hidden layers. These activation functions are said to be saturating, which means that their output is bounded, which is nowadays seen as an undesirably property for hidden-layer activations as it can cause problems with gradient-based learning. However, these functions
1 21
B. Horvath 7CCMFM18 Machine Learning Spring 2020
remain useful, when used in the output layer to produce bounded output values in some particular range.19
In hidden layers, the sigmoid and the hyperbolic tangent have largely been superseded by the (non-saturating) rectified linear unit (ReLU), which was pop- ularised by Glorot et al. (2011). ReLU and its derivative are mathematically very simple, making them numerically efficient. Strictly speaking, the derivative of ReLU is undefined at zero, but we can usually substitute it with value 1 without significant issues. The (inevitable) lack of continuity of the derivative at zero is sometimes considered problematic, but for example Glorot et al. (2011) found in their experiments that ReLU works in hidden layers at least equally well, if not better, than its continuous differentiable (but more complicated) counterpart, the softplus.20
A more serious limitation of ReLU is that it does not let any negative values pass through it, so that its derivative on the negative real line is zero. This leads to the so-called dead ReLU problem, where a layer of ReLU activations receives only negative values, thereby producing constant output, which may freeze gradient- based learning algorithms. To mitigate the problem, Maas et al. (2013) and He et al. (2015) proposed the leaky ReLU (LReLU) and parametric ReLU (PReLU), re- spectively. Both of them set the slope of the function to α > 0 for negative values, letting them thereby “leak” through the function. In PReLU, α is a learnable pa- rameter, while LReLU fixes α = 0.01. Moreover, the exponential linear unit (ELU) is a similar substitute of ReLU, which can however achieve continuous differen- tiability, proposed by Clevert et al. (2016).
The above activation functions are all monotonic, but this is not a strict re- quirement as networks with non-monotonic activations can be trained with gra- dient-based learning methods as well. An example of a non-monotonic activa- tion function with good properties, albeit not very often used, is the Gaussian function.
Finally, we discuss two multi-dimensional activation functions, decribed in detail in Table 2. To build a multiclass classifier, the sigmoid activation function needs to be extended so that it outputs a probability distribution over an arbitrary
19For example, the sigmoid is the key output activation function in binary classification. 20Glorot et al. (2011) observe that learning with ReLU often leads to sparse internal represen-
tations, that is, networks where most weights are zero, unlike with softplus. They give several arguments why sparsity is a desirable property, similar to what LASSO can achieve in linear re- gresssion.
22
B. Horvath 7CCMFM18 Machine Learning Spring 2020
Table 2: Multi-dimensional activation functions and their properties. Adapted from Wikipedia (2019).
Activation Softmax
Maxout
Definition
Derivative gi(x)(1−gi(x)),
Range Smoothness
exi gi(x)=d
∂gi(x)
d
∞
xj ,i=1,…,d g(x)=max{x1,…,xd}
=
∂g(x) =1, xi >maxj̸=i xj
∂xi 0, xi
stride s = 1,2,… is a function Ck,s : Rd → Rd′′ given by
Ck,s(x)i :=Ck(x)i, i∈sj+1:j∈Z,0≤j≤d−1, s
where d ′′ := #s j + 1 : j ∈ Z, 0 ≤ j ≤ d −1 . s
When the inputs to a convolutional layer form a time series, it may be also convenient to introduce a constraint that the convolution is backward-looking, whereby the i -th convolution output depends on the inputs only up to index i , as follows:
Remark 2.16. Definition 2.15 is analogous to the notion of a causal process, en- countered in time series analysis.
Remark 2.17. It is evident that all convolutional layers are linear functions. In PROBLEM SHEET 1 we will look into how to represent them in matrix form.
Example2.18. Whilethekernelweightsk−l,…,kl aretypicallylearnableparam- eters, it is useful to look into a few examples of what kind of concrete operations can be represented with convolutional layers. For instance,
ki := 1 , i=−l′,…,l, l +l′ +1
defines a (two-sided) equally-weighted moving average, which can be applied to smoothen the input data. Moreover, (k0,k−1) := (1,−1) and (k0,k−1,k−2) := (1,−2,1) implement first- and second-order differencing, respectively.
Definition 2.15. A convolutional or padded convolutional layer is said to be causal if l = 0. (See Figure 5(d) for an illustration.)
26
B. Horvath 7CCMFM18 Machine Learning Spring 2020
(a) Max pooling layer
(b) Average pooling layer
avg
Output 2 2 2 2
3
1
0
4
2
2
3
1
Input
Output
max
3
1
0
4
2
2
3
1
Input
3
4
2
3
Figure 6: Schemata of pooling layers with pool size l = 2 and stride s = 2.
Remark 2.19. In practical applications, especially in image recognition, convo- lutional layers are extend to contain multiple kernels that are learned simulta- neously. What this means in our one-dimensional context is that the output of such an extended layer would consist of a matrix, whose column vectors are out- puts of convolutional layers in the sense of Definition 2.12 and 2.14 with different kernels.
A convolutional layer with stride greater than one is important in the sense that it allows us to downsample the output from the layer, in pursuit of reduced dimensionality and removal of redundancy. Pooling layers play a similar role, by aggregating data in a sliding window with either maximum or averaging opera- tion.
Definition 2.20. A max pooling layer with d inputs, pool size l = 2, 3, . . . and stride s = 1,2,… is a function Pmax : Rd → Rd′ given by
l,s
Pmax(x)i :=max{xsi+1,…,xsi+l}, i ∈j ∈Z:0≤ j ≤ d−l ,
l,s s whered′:=#j∈Z:0≤j≤d−l. Anaveragepoolinglayerwithdinputs,
s
pool size l = 2,3,… and stride s = 1,2,… is a function Pavg : Rd → Rd′ given by
l,s
Pavg(x)i :=xsi+1+···+xsi+l, i∈j∈Z:0≤j≤d−l,
l,s l s with d′ as above.
The definition is illustrated in Figure 6.25 Typically, the stride s is chosen to be equal to the window length l to avoid overlapping pooling windows.
Remark 2.21. The average pooling layer is a special case of the convolutional layer
25The max pooling layer is not an affine function, so it does not strictly conform to our defini- tion of a layer in Definition 2.1. Due to its similarity with the maxout function, one could even argue that it is an activation function.
27
B. Horvath 7CCMFM18 Machine Learning Spring 2020
(with stride), cf. Example 2.18. The average pooling layer is nevertheless usually offered as a separate function in deep learning libraries as the philosophy is that the kernel weights in a convolutional layer are learnable parameters, while the average pooling layer involves no learnable parameters.
2.4 Universal approximation property
Aside from their neuroscience background, the use of neural networks to model complex functional relationships can (mathematically) be motivated by their uni- versal approximation property, whereby any “reasonable” function can be ap- proximated by a suitable neural network — recall criterion (i) of Problem 1 in Section 1.1.
The universal approximation property is conventionally proved for feedfor- ward neural networks with a single hidden layer — we will discuss this assump- tion below in Remark 2.24. We can assume, without loss of generality, that there is only one output, O = 1, though. The earliest results on the universal approxi- mation property were proved by Cybenko (1989), Hornik et al. (1989) and Hornik (1991) under rather restrictive assumptions (from modern-day perspective) on the hidden-layer activation function.26 We shall focus here on a more general re- sult by Leshno et al. (1993, Theorem 1), which applies to any practically relevant activation function.
To measure the precision of approximation by a neural network, we define two norms for functions. Let K ⊂ RI be compact.27 For any measurable28 f : RI → R, we introduce the sup norm
∥f ∥sup,K := sup|f (x)| x∈K
and,foranyp≥1,theLp norm ∥f∥Lp(K):=
1 |f(x)|pdx p.
K
Moreover, we denote by L p (K , R) by the class of measurable functions f : K → R
suchthat∥f∥Lp(K) <∞.
The following is a slight reformulation of Leshno et al. (1993, Theorem 1 and
Proposition 1).
26ruling out, for example, ReLU
27Recall that K ⊂ RI is compact if it is both closed and bounded.
28We invite any reader unfamiliar with measure theory to simply ignore this attribute here and
in what follows. Non-measurable functions are pathological and never encountered in practical applications.
28
B. Horvath 7CCMFM18 Machine Learning Spring 2020
Theorem 2.22 (Universal approximation property). Let g : R → R be a mea- surable function such that
(a) g is not a polynomial function,
(b) g is bounded on any finite interval,
(c) theclosureofthesetofalldiscontinuitypointsofginRhaszeroLebesgue measure.
Moreover, let K ⊂ RI be compact and ε > 0.
(i) Foranyu∈C(K,R),thereexistd∈Nandf ∈N2(I,d,1;g,Id)suchthat
∥u−f∥sup,K <ε.
(ii) Letp≥1.Foranyv∈Lp(K,R),thereexistd′∈Nandh∈N2(I,d′,1;g,Id) such that
∥v−h∥Lp(K) <ε.
We will not discuss the proof, given in Leshno et al. (1993, Section 6), of this result in the general case. However, in PROBLEM SHEET 2 we shall derive a con- structive proof of statement (i) in the special case I = 1 and g = ReLU.
Remark 2.23. All of the one-dimensional activation functions discussed in Sec- tion 2.2, apart from Id, satisfy the assumptions (a)–(c) in Theorem 2.22 on the hidden layer activation function g. The result of Leshno et al. (1993, Theorem 1) says that assumption (a) is in fact necessary for the universal approximation property to hold.
Remark 2.24. While Theorem 2.22 holds for networks with a single hidden layer, standard folklore in deep learning is that the universal approximation property usually extends to deeper networks. For networks with ReLU activations in hid- den layers, this follows from Theorem 2.22 and Problem 3 in PROBLEM SHEET 1, if unbounded network width is allowed. A very recent result of Kidger and Lyons (2019, Theorem 3.2) shows that the universal approximation property holds for deep networks with bounded width but unbounded depth. Their result holds for general activation functions, not just for ReLU.
While theoretically elegant and reassuring, Theorem 2.22 is however of lim- ited use in practical deep learning. The main caveat is that the result is non-
29
B. Horvath 7CCMFM18 Machine Learning Spring 2020
constructive — it does not tell what the approximating neural networks f and h look like, as it merely guarantees their existence. Similarly, it is non-quantitative — it does not tell how many hidden units, that is width, we need to realise these networks. Intuitively, one expects that the smaller the tolerance ε > 0 is and the wigglier u and v are, the larger the number of units needed is. (The solution to PROBLEM SHEET 2 illustrates this intuition in a simple special case.) However, re- cent work by Yarotsky (2017), Petersen and Voigtlaender (2018) and Gühring et al. (2019) on deep ReLU networks is now paving the way for understanding how width and depth requirements crucially depend on the approximation tolerance and the properties of the function to be approximated.
30
B. Horvath 7CCMFM18 Machine Learning Spring 2020
3 Training feedforward neural networks 3.1 Loss function
We have seen in Chapter 2 that feedforward neural networks give rise to a rich class of functions that are able to at least theoretically approximate any reason- able functional relationship. As argued in Chapter 1, the objective of deep learn- ing is to find a function that accomplishes a given task by turning inputs into outputs in a some sense optimal way. We will now start our journey towards finding an optimal function.
The notions of task and optimality are not embodied in the neural network f : RI → RO, f ∈ Nr (I,d1,…,dr−1,O) per se, but they are characterised by a loss
function
l : RO × RO → R.
Given input x ∈ RI and reference value y ∈ RO ,29 we compute the realised loss as l(f (x),y). In supervised learning, the value y ∈ RO would be the label, but our framework is more general and applies to unsupervised learning as well.
Example 3.1. In unsupervised learning, an autoencoder (Goodfellow et al., 2016, Chapter 14) is a feedforward neural network f with I = O that aims to reconstruct its input x with the constraint that the data must pass through a hidden layer with d units, where d ≪ I . Thus, the network performs dimensionality reduction by seeking a d-dimensional internal representation of the data, from which the data can be reconstructed as accurately as possible. The autoencoder should thus minimise l( f (x ), x ), that is y = x , for some loss function l that measures the distance between its two vector arguments. Autoencoding can be understood as a non-linear extension of principal component analysis (Goodfellow et al., 2016, Section 14.1).
If x and y are a realisation of a joint random vector (X , Y ), we could theoreti- cally try to seek optimal f by minimising risk
E[l( f (X ), Y )]. (3.2) In practice, we typically do not know the distribution of (X , Y ), however, so we
need to resort to empirical methods with samples x1,…,xN of x and y1,…,yN 29The assumption that the dimensionality of y equals O is not always convenient, but it is a
convention that for example Keras follows.
31
B. Horvath 7CCMFM18 Machine Learning Spring 2020
of y for some N ∈ N. As an empirical proxy of (3.2), we then work with empirical
risk
1N ii
L(f):=N
l(f(x ),y ). (3.3)
i=1
For future purposes, we extend and modify the definition (3.3) to allow for av-
eraging over any subset, minibatch, B ⊂ {1,…,N} of samples, while also recog- nising that, once architecture and activation functions have been fixed, f θ = f ∈ Nr (I,d1,…,dr−1,O) is fully determined by the parameter vector
θ := (W 1,…,W r ;b1,…,br )
weights biases
∈Rd1×d0 ×···×Rdr×dr−1 ×Rd1 ×···×Rdr ≃Rri=1di(di−1+1).
That is, we define minibatch risk
LB(θ):= 1 l(fθ(xi),yi) (3.4)
#B i∈B
as a function of θ.
Before delving into the minimisation of empirical risk, which is the topic of
next section, we discuss the choice of the loss function l. We have already seen two loss functions, squared loss (Example 1.1) and binary cross-entropy (Example 1.3), in the one-dimensional case O = 1, which apply to regression and binary classification, respectively. Four common one-dimensional loss functions, in- cluding the above two, are presented in Figure 7. In regression applications, ab- solute loss is a potential alternative to squared loss. To understand the distinction between these two losses, consider a random variable Y such that E[Y 2] < ∞. It is not hard to show that the unique solution to the risk minimisation problem
minE[l(yˆ,Y )] yˆ∈R
isyˆ=E[Y]forl(yˆ,y)=(yˆ−y)2,whileyˆequalsthemedianofY forl(yˆ,y)=|yˆ− y|.30 Thus, training with squared loss targets mean, while training with absolute loss median.
Squared loss however has the shortcoming that it significantly amplifies the loss of a prediction yˆ far from true value y due to its quadratic growth. Thus, outliers in the data may have disproportional impact on the training result. To mitigate this problem, a classical solution by Huber (1964) is to retain the (desir- able) quadratic behaviour of yˆ → l(yˆ, y) in a δ-neighbourhood (for small δ > 0)
30See, e.g., http://www.stat.umn.edu/geyer/f11/5101/slides/s4a.pdf. 32
B. Horvath 7CCMFM18 Machine Learning Spring 2020
(a) Absolute loss
(b) Squared loss
y
l ( yˆ , y )
l ( yˆ , y )
y
l(yˆ,y)=|yˆ−y|, yˆ,y ∈R (c) Huber loss
l(yˆ,y)=(yˆ−y)2, yˆ,y ∈R (d) Binary cross-entropy
l ( yˆ , y )
y−δ y y+δ
yˆ
yˆ
1(yˆ−y)2, |yˆ−y|≤δ l(yˆ,y)=−ylogyˆ−(1−y)log(1−yˆ),
l(yˆ,y)= 2
δ(|yˆ−y|−1δ), |yˆ−y|>δ yˆ∈(0,1),y∈{0,1}
2
Figure 7: One-dimensional loss functions.
of y, while applying linear extrapolation elsewhere, which gives rise to the Hu- ber loss in Figure 7(c). Another similar loss function is the log–cosh loss, given by l(yˆ, y) := log(cosh(yˆ − y)), yˆ, y ∈ R, which also behaves almost quadratically near true value and almost linearly far from true value, but is twice continuously differentiable, unlike the Huber loss.
In the multidimensional setting, O ≥ 2, we can often define a loss function as a weighted sum of one-dimensional losses,
O
l(yˆ,y):= λili(yˆi,yi), yˆ =(yˆ1,…,yˆO), y =(y1,…,yO)∈RO, (3.5)
i=1
with some weights λ1,…,λO > 0 and one-dimensional loss functions l1,…,lO.
Example 3.6 (Continuation of Remark 2.9). Consider the network with O = 2 out- puts, the first of which does classification and the second regression. We could train such a network by minimising the sum of binary cross-entropy (applied to classifier output) and squared loss (applied to regression output).
l ( yˆ , y )
yˆ y=0 y=1
yˆ
33
B. Horvath 7CCMFM18 Machine Learning Spring 2020
There are also relevant multidimensional loss functions that are not defined via (3.5). Perhaps the most important example of such a loss function is (multi- class) categorical cross-entropy, which is used to train multiclass classifiers. Cat- egorical cross-entropy over O classes is defined as
O
l(yˆ,y)=−1{yi=1}logyˆi, (3.7)
i=1 whereyˆ=(yˆ1,…,yˆO)∈(0,1)O,suchthatOi=1yˆi =1,isthepredicted(non-degen-
erate)probabilitydistributionon{1,…,O}andy=(y1,…,yO)∈{0,1}d isone-hot encoding of the label, that is, a label in class i = 1,…,O is encoded with yi := 1 and y j := 0 for j ̸= i . Minimising categorical cross-entropy amounts to minimising the corresponding Kullback–Leibler divergence,31 which is an information theo- retic measure of the discrepancy between the empirical distribution of labels and the distribution predicted by the network. In the case O = 2, the definition (3.7) agrees with that of binary cross-entropy.
3.2 Stochastic gradient descent
Feedforward neural networks are commonly trained by minimising empirical risk using stochastic gradient descent (SGD). To motivate SGD, consider first the problem of minimising a generic differentiable objective function F : Rd → R. The usual textbook approach would be to work out the gradient ∇F of F , solve ∇F(x) = 0 for x ∈ Rd and argue that at least one of the solutions is indeed a min- imiser using the Hessian of F .
However, in practice this approach is often infeasible because ∇F may be hard to compute or ∇F(x) = 0 hard to solve. In such cases we can nevertheless develop procedures that approximate a minimiser (if it exists). The differential equation
dx(t) =−∇F(x(t)), t >0, (3.8) dt
with initial condition x(0) ∈ Rd, if solvable, defines the so-called gradient flow (x(t))t≥0 of F (Santambrogio, 2016). Under certain assumptions on F, which guarantee the existence of a unique minimiser and are stronger than mere con- vexity, it can be shown that x(t) tends to the minimiser as t → ∞ (Santambrogio, 2016, Remark 2.1). To make this idea practically useful, we need to discretise the differential equation (3.8), however. For step size η > 0, we approximate
x (t + η) − x (t ) ≈ −∇F (x (t )), η
31https://en.wikipedia.org/wiki/Kullback-Leibler_divergence 34
B. Horvath
7CCMFM18 Machine Learning
Spring 2020
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
= 0.0045 384
= 0.0020 Minimiser
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 x1
336
288
240
192
144
96
48
0
x2
F(x1, x2)
Figure 8: An illustration of gradient descent applied to the Rosenbrock banana function F (x1, x2) := (a −x1)2 +b(x2 −x12)2, (x1, x2) ∈ Rd , with a := 0.5 and b := 100. Gradient descent is run 600 steps from initial condition (0.7,−0.8) with learning rate η ∈ {0.002, 0.0045}. With the lower learning rate η = 0.002, gradient descent largely avoids the initial missteps made with the higher learning rate η = 0.0045, but fails to reach the unique minimiser (a,a2) in 600 steps.
whereby
x (t + η) ≈ x (t ) − η∇F (x (t )).
This Euler approximation motivates (ordinary) gradient descent, which is an iter-
ative algorithm that progressively seeks a minimiser with gradient updates xnew := xold −η∇F(xold),
given some initial condition x0. In deep learning context, the step size η > 0 is called the learning rate. The application of gradient descent to the Rosenbrock banana function, which is known to be difficult to minimise numerically and is often used to challenge optimisation algorithms, is exemplified in Figure 8.
In the context of neural networks, it may seem tempting to just try to directly minimise empirical risk L ( f θ ), as defined (3.2), with respect to θ using gradi- ent descent. However, computing the gradient of L ( f θ ) can be computationally costly with large N , while gradient descent applied to L ( f θ ) may also lead to an overfitted network f θ. For these reasons, stochastic gradient descent (SGD) is preferred in the training of neural networks.
In SGD, we randomly split the training data, indexed by {1,…,N}, into mini- batches that are used successively to compute gradient updates. To this end, we
35
B. Horvath 7CCMFM18 Machine Learning Spring 2020
fix minibatch size m ≪ N, typically such that N is divisible by m, that is, N = km for some k ∈ N. We then sample uniformly minibatches B1,…,Bk ⊂ {1,…,n}, such that #Bi = m for any i = 1,…,k, without replacement,32 that is, B1,…,Bk are disjoint with ki=1 Bi = {1,…,n}. Starting from initial condition θ0, the parameter vector θ is updated via
θi :=θi−1−η∇θLBi(θi−1), i=1,…,k, (3.9)
where LBi (θ) is minibatch empirical risk corresponding to Bi , as defined in (3.4). If the training data consist of iid samples, we may intuitively interpret the mini- batch gradient ∇θLBi (θi−1) as a noisy but unbiased estimate of the full gradient ∇θL{1,…,N}(θi−1). One pass through the entire training data set by (3.9) consti- tutes a training epoch. This procedure is then repeated over multiple epochs, with new minibatches, while initialising θ with the last value of the previous epoch. For clarity, we summarise the entire SGD algorithm in pseudocode as follows:
Algorithm 3.1 Stochastic gradient descent
Require: minibatch size m ∈ N such that N = km for some k ∈ N
Require: number of epochs E ∈ N Require: initial condition θ0 Require: learningrateη>0
1: fore=1,…,Edo
2: sampledisjointB1e,…,Bke ⊂{1,…,N}suchthat#Bej =mforanyj=1,…,k
3: ife=1then
4: θ e0 : = θ 0
5: else
6: θe := θe−1 0k
7: end if
8: for i = 1,…,k do
9: θei :=θei−1−η∇θLBie(θei−1)
10: end for
11: endfor
12: returnθEk
◃ initial condition from previous epoch
◃minibatchgradientupdate
Remark 3.10. The initial condition, “initialiser” θ0 can be specified manually or it can also be derived from unsupervised pre-training. Currently, the standard
32Sampling without replacement is not necessary, but is the standard choice. 36
B. Horvath 7CCMFM18 Machine Learning Spring 2020
approach is however to use a random initialiser. This approach has been advo- cated, in particular, by Glorot and Bengio (2010). The Glorot (also called Xavier) initialiser of weights is drawn from a zero-mean normal distribution with vari- ance depending on the number of units in the next layer.
Remark 3.11. In the case of a convex objective function with unique minimiser, a proof of convergence to the minimiser could be given for SGD using the theory of Robbins–Monro stochastic approximation,33 assuming that the learning rate η is gradually adjusted to zero. However, the loss function L ( f θ ) of a neural network f θ is rarely, if ever, convex, whereby we cannot really give any guarantees of con- vergence to anything other than a local minimiser. A common school of thought in deep learning is that we do not even want to find a global minimiser of L ( f θ ) (if it exists) as it might overfit. Arguably the greatest unsolved theoretical problem in deep learning is the question why SGD nevertheless works in practice (at least usually) despite non-convexity of the objective function and high dimensionality of the parameter space.
Remark 3.12. It is not hard to see that taking a step in the direction −∇L in gra- dient descent or SGD need not be optimal for any η > 0 — Figure 8 is a clear indication of this. Both gradient descent and SGD may overshoot or zigzag if the learning rate η is too high, while low η may lead to very slow convergence otherwise. For this reason, several refinements of SGD have been proposed, to adaptively tune η or to seek update directions better than −∇L , and this remains an active area of research. The most prominent refinements of SGD include the Nesterov momentum method (Nesterov, 1983; Sutskever at al., 2013), RMSProp (Tieleman and Hinton, 2012) and Adam (Kingma and Ba, 2015), the last of which is perhaps the most popular optimisation method in deep learning at the mo- ment. In Adam, the parameter update direction is determined by an exponen- tially weighted moving average (EWMA) of previous gradients, while the learning rate is adaptively tuned by an EWMA of squares of previous gradients.
3.3 Backpropagation
Thegradient∇θLB(θ)ofminibatchempiricalriskLB isanimportantingredient of SGD, which we need to be able to compute numerically. In numerical analysis, derivatives are sometimes approximated by finite differences,
Fx + 1∆−Fx − 1∆ F′(x)≈ 2 2
∆
33 https://en.wikipedia.org/wiki/Stochastic_approximation 37
B. Horvath 7CCMFM18 Machine Learning Spring 2020
for small ∆ > 0. This finite-difference approximation can be poor, however, if F is highly non-linear. An altenative to finite differences is symbolic differenti- ation, where a symbolic expression for F′(x) is derived using computer algebra, instantaneously applicable to any argument x. The downside of symbolic dif- ferentiation is that it becomes onerous to apply if F is a complicated function, a composition of a large number of functions, say. Algorithmic differentiation (sometimes also called automatic differentiation), which applies to functions F built by composing simple functions, is the middle ground — it provides a way to efficiently compute F′(x) exactly, but for chosen x only.34
Backpropagation (or backprop for short) is a special case of (backward-mode) algorithmic differentiation, which computes the gradient of empirical risk for a feedforward neural network. We discuss here the key ideas, adapting Higham and Higham (2018, Section 5). Suppose that we are training by SGD a feedforward neural network f θ ∈ Nr (I,d1,…,dr−1,O,σ1,…,σr ), where we assume, for sim- plicity,35 that σi is the component-wise application of a one-dimensional activa- tionfunctiongi :R→R,denotedbygi,foranyi =1,…,r. Similarly,wedenote by g′i the component-wise derivative of gi, that is, g′i = (gi′,…,gi′) : Rdi → Rdi . Note that, by linearity,
∇θLB(θ)= 1 ∇l(fθ(xi),yi), #B i∈B
which is why it suffices to work with the per-sample gradient ∇θ l( f θ (x ), y ).
To set up backpropagation, we introduce next some helpful recursive nota-
tion. For x ∈ RI ,
zi =(z1i,…,zdi ):=Li(ai−1)=Wiai−1+bi,
i
ai =(a1i,…,adi ):=gi(zi), i
a0 := x,
i=1,…,r, i=1,…,r,
so that fθ(x) = ar and l(fθ(x),y) = l(ar,y). (It is worth recalling that θ := (W 1,…,W r ;b1,…,br ).) Additionally, we introduce the so-called adjoint δi =
(δi1,…,δidi)∈Rdi by for any i = 1, . . . , r .
δij:=∂l, j=1,…,di, ∂ z ij
34 https://en.wikipedia.org/wiki/Automatic_differentiation 35this is not in general necessary, but simplifies some notations a lot
38
B. Horvath 7CCMFM18 Machine Learning Spring 2020
Backpropagation (and algorithmic differentiation in general) is based on the chainrule,whichwenowrecall. FordifferentiableG :Rd →RandF =(F1,…,Fd): Rd′ →Rd,defineH(x)=G(y)withy=(y1,…,yd)=F(x),thatis,H=G◦F:Rd′ → R. Then
∂H(x)= ∂G(y)∂Fj(x), x=(x1,…,xd′), ∂xi j=1d ∂yj ∂xi
or more succinctly (and informally)
∂H=d ∂G∂yj.
∂xi j=1 ∂yj ∂xi
Using the chain rule, we can now derive a backward recursive procedure to work
out the components of ∇θ l( f θ (x ), y ) using adjoints.
Proposition 3.13 (Backpropagation). We have
δr =g′r(zr)⊙∇yˆl(ar,y),
δi =g′i(zi)⊙(Wi+1)′δi+1, ∂l =δij,
∂bij
∂l = δi ai−1, ∂Wi jk
j,k
where ⊙ stands for the component-wise Hadamard product of vectors.
(3.14) (3.15)
i =1,…,r −1, i=1,…,r,j=1,…,di,
(3.16) i = 1,…,r, j = 1,…,di , k = 1,…,di−1, (3.17)
Proof. Toprove(3.14),notethatl=l(ar,y)=l(gr(zr),y),sobythechainrule, ∂lO∂l∂ar ∂l
j
since
∂ z r s = 1 ∂ yˆ s ∂ z r jj
∂asr ∂zr =
r j ∂ yˆ j gr′ (zrj ), s = j,
δr = = (ar,y) s =g′(zr) (ar,y), j=1,…,O, (3.18)
of z i , and applying the chain rule,
δij =
∂l ∂zi
di+1 =
∂l ∂zi+1 (zi+1) s ,
j =1,…,di.
j 0, s̸=j.
Equation (3.14) is simply (3.18) in vector form.
Proceeding on to (3.15), viewing l as a function of z i +1 , and z i +1 as a function
∂zi+1 ∂zi j s=1 s j
=δi +1 s
39
B. Horvath
7CCMFM18 Machine Learning
Spring 2020
Since
we find that
di
zi+1=Wi+1g (zi)+bi+1,
s s,uius
(3.19)
u=1
=aui
∂zi+1 s
∂zi j
=Wi+1g′(zi ), s,j i j
whereby
δi =g′(zi)δi+1Wi+1=g′(zi)(Wi+1)′ δi+1, j=1,…,d ,
di +1
s=1 s=1 which becomes (3.14) in vector/matrix form.
di +1 jijss,jij j,ss i
Finally, we can understand the dependence of l on Wi and bi through zi using the relationship
di−1
zi=Wi ai−1+bi, s=1,…,d,
ss,uus i u=1
that is, rewriting (3.19) with i replaced by i + 1, which yields
∂zi 1, s = j, ∂zi ai−1, s= s=k
s = j, s ̸= j.
ii ∂bj 0, s ̸= j, ∂Wj,k
Now by the chain rule,
0,
∂ldi∂l∂zsii ∂ldi∂l∂zsi ii−1 = =δj, = =δjak ,
∂bi s=1∂zsi ∂bi ∂Wi s=1∂zsi ∂Wi
j j j,k j,k
which proves both (3.16) and (3.17).
Remark 3.20. To apply Proposition 3.13 in practice to compute ∇θl(f θ(x),y) at fixed θ (given by the previous gradient update, say), we proceed in two steps. We first evaluate l( f θ (x ), y ) in a forward pass
x=a0→z1→a1→···→zr →ar →l(ar,y)=l(fθ(x),y),
storing the intermediate values a1,…,ar and z1,…,zr . If we additionally have symbolicexpressionsforg1′,…,gr′ andifwecanevaluate∇yˆl(ar,y)eithersym- bolically or by general algorithmic differentiation, we can then perform a back- ward pass using Proposition 3.13 to evaluate ∇θl(f θ(x),y) at θ.
Remark 3.21. The backbone of any major deep learning library (such as PyTorch, TensorFlow and Theano) consists of an implementation of algorithmic differen- tiation, which makes backpropagation effectively an automatic process, so that one does not actually have to code up Proposition 3.13 in practice.
40
B. Horvath 7CCMFM18 Machine Learning Spring 2020
4 Practical implementation and applications 4.1 Training neural networks in Keras
We will now look into to the process of implementing and training feedforward neural networks. TensorFlow is a deep learning software library, developed by Google Brain initially for internal use at Google, but released as open source in 2015. Alongside PyTorch, it is arguably the most popular deep learning library at the moment. TensorFlow is typically used via its Python API.36 TensorFlow is able to utilise also GPUs and Google’s proprietary Tensor Processing Units (TPUs) to accelerate computations.
Keras, developed by Google engineer François Chollet, is a high-level deep learning library, with the objective of making it user-friendly and fast to imple- ment and train neural networks. A thorough introduction to Keras is given in Chollet (2018). Keras is simply a front-end that uses a lower-level deep learn- ing library (TensorFlow, Theano or Microsoft Cognitive Toolkit) to do the “hard work”. Keras has been part of TensorFlow core library since version 1.10.0, and we use here the version of Keras supplied with TensorFlow, sometimes dubbed “tf.keras”.
With TensorFlow installed,37 we can import Keras by: import tensorflow.keras as keras
If our aim is to specify and train a basic feedforward neural network with stan- dard layers, we can easily accomplish this by using the Sequential model of Keras. Suppose now that we want to create, say, a network
f ∈N3(2,100,100,1;ReLU,ReLU,Id). We can specify f with the Sequential model as follows:
36application programming interface
37An easy way to get started with TensorFlow, without manually installing anything, is to use
Google Colaboratory: https://colab.research.google.com/
f = keras.Sequential([
keras.layers.Dense(100, activation=”relu”, input_shape=(2,)),
keras.layers.Dense(100, activation=”relu”),
keras.layers.Dense(1, activation=”linear”)
])
41
B. Horvath 7CCMFM18 Machine Learning Spring 2020
We can inspect the specified network by
f.summary()
Before we can train f , we need to specify our optimisation algorithm and loss function by “compiling” f . Let us say we want to use Adam and squared loss:
f.compile(optimizer=”adam”, loss=”mean_squared_error”)
If our training data x1,…,xN ∈ R2 and y1,…,yN ∈ R, with N = 100000, say, are supplied in NumPy arrays X and Y with shapes (100000,2) and (100000,1), respectively, we can start training with minibatches of size 100, over 5 epochs, by:
f.fit(X, Y, batch_size=100, epochs=5)
After training, we can use f to predict values from new inputs in X_new, a NumPy array with shape (1000,2), say, by:
Y_hat = f.predict(X_new)
Remark 4.1. The Sequential model is too restrictive for certain applications, for example, involving networks that have shared layers or non-standard routing. They can be handled with the Functional API of Keras, which is an alternative to the Sequential model.
Example 4.2 (Regression, Jupyter notebook DL_2020_Regression.ipynb will be available from KEATS under folder “Lecture Notebook 24/02”). In this example, we test neural networks in a task that involves non-linear regression with artificial data. We generate data y1,…,yN ∈ R and x1,…,xN ∈ R with N = 1000000 as follows. Wefirstsamplex1,…,xN independentlyfromUniform(0,1)distribution. Then we set
yi :=gxi+εi, i =1,…,N,
whereg(x):=sin(10x),x∈R,andε1,…,εN areindependentN(0,0.1)-distributed variates.
Our aim is now to learn the function g from the data by supervised learning. To this end, we train
g∈N4(1,100,100,100,1;ReLU,ReLU,ReLU,Id) 42
B. Horvath 7CCMFM18 Machine Learning Spring 2020
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
g(x) g(x)
0.0 0.2 0.4
0.6 0.8 1.0
x
Figure 9: The trained network g compared to the true function g in Example 4.2.
using squared loss as loss function l, Adam optimisation algorithm with mini- batches of size 100, over 5 epochs. We then compare the trained g to g . The result is given in Figure 9. With such a large number of observations, the function g is not too hard to learn!
To demonstrate the potential problem of overfitting, we make the problem now harder by using only a small subset of the data — the first 200 samples. In fact, we use only the first 100 for training; the rest we retain as a validation set. We train a blatantly overparameterised network
gover ∈N4(1,1000,1000,1000,1;ReLU,ReLU,ReLU,Id)
with the data. This network has more than 2 million parameters! We train the network using Adam, but now with full batches of 100 samples, over 500 epochs. While training this network, we may observe that training loss plunges below the model error variance 0.1, while validation loss remains considerably higher. This is a telltale sign of overfitting. We can also concretely confirm this by assessing the trained gover, which is plotted in the left panel of Figure 10.
There are several strategies to combat overfitting. Firstly, we could simply stop the training early — that is, run it over fewer epochs. Secondly, we could apply a regularisation method. As an example of regularisation, we briefly ex- periment with dropout, introduced by Srivastava at al. (2014). The idea behind dropout is to introduce a dropout layer, where each input gets randomly replaced during training by zero value with fixed probability p ∈ [0,1]. We introduce a dropout layer with p = 0.5 after each hidden layer in gover, giving rise to a new
43
Value
B. Horvath
7CCMFM18 Machine Learning
Spring 2020
With dropout
gover(x) g(x)
(xi, yi)
Without dropout
1.5 1.0 0.5 0.0 0.5 1.0 1.5
0.0 0.2 0.4
0.6 0.8 1.0
x
gdropout(x) g(x)
(xi, yi)
1.5 1.0 0.5 0.0 0.5 1.0 1.5
Figure 10: The trained overparameterised network gover versus its dropout vari- ant gdropout in Example 4.2.
network gdropout, which we train with other settings unchanged. In training, we note that training loss remains typically above the model error variance 0.1, while validation loss ends up being close to training loss, which is a good sign. We can confirm the result visually by plotting the final gdropout, displayed in the right panel of Figure 10. While the fit is not perfect, due to the small sample size, the overfitting problem has effectively vanished.
As another example, in the next PROBLEM SHEET, one of the questions is to do a similar exercise but involving binary classification.
4.2 Case study: deep hedging
We now study an application in quantitative finance. Deep hedging (Buehler et al., 2019) is an unsupervised learning-based approach to determine optimal hedging strategies for options and other derivatives. It was originally devised by researchers at JP Morgan and ETH Zurich in 2017. The approach is based on the idea to represent the hedging strategy as a neural network, whose inputs are cur- rent and past market data. The network is then trained to optimise the hedger’s profit and loss (PnL), according to some performance measure, e.g., hedging er- ror, risk measure or utility function, using a set of price paths.
The approach is in principle model-free, although the number of price paths needed in training is typically so large that there simply are not enough histor- ical price paths available, whereby the training data need to be generated by a model (calibrated to market data) or some other market simulator. A particu-
44
0.0 0.2 0.4
0.6
0.8 1.0
x
Value
B. Horvath 7CCMFM18 Machine Learning Spring 2020
lar strength of deep hedging is that various market imperfections, such as trans- action costs, are relatively straightforward incorporate to the framework, while analytically they would be very hard to tackle.
To demonstrate how deep hedging works, we look into a very simple example, where we learn the classical Black–Scholes delta hedging strategy using the deep hedging approach.
Example 4.3 (Deep hedging, Jupyter notebook DL_2019_Deep_Hedging.ipynb available from KEATS under folder “Lecture Notebook 02/03 Deep Hedging”). Deep hedging works in discrete time, so we set up a model that approximates the continuous-time Black–Scholes model. To this end, we construct a price process S=(St)Tt=0 asfollows:
t t St=S0expμT+σ ξi , t=0,1,…,T,
i=1 whereμ>0,σ>0andS0>0areconstants,andξ1,…,ξT aremutuallyindepen-
dent N(0, 1 )-distributed random variables. When T → ∞, the process S, under T
rescaling of time and interpolation, indeed tends to the continuous-time Black- Scholes price process
SBS=S exp(μt+σW), t∈[0,1], t0t
where W = (Wt )t ∈[0,1] is a standard Brownian motion. We use the following pa- rameter values:
μ=0.1, σ=0.5, T =100, S0 =1.
For training, we generate N = 100000 independent samples of the price path S.
Our aim is to hedge the call option
(ST −K)+,
written on S, that is, develop an adapted and self-financing trading strategy in the underlying stock so that its terminal wealth matches the option payoff as closely as possible. Such a trading strategy is specified by its initial wealth x ∈ R and position γt in the underlying stock at time t for any t = 0, 1, . . . , T . By adaptedness, γt must be a function of the past prices St ,St−1,…,S0 only. Assuming zero interest rate, by the self-financing property, the terminal wealth of the strategy can be expressed as
T
VT =x+γt−1(St −St−1).
t=1
45
B. Horvath 7CCMFM18 Machine Learning Spring 2020
Then the option hedger’s PnL is
PnL=VT −(ST −K)+,
Note that, with x and S0 fixed, we can view PnL as a function of the trading strat- egy γ0,γ1,…,γT−1 and the price increments S1 −S0,S2 −S1,…,ST −ST−1, since
PnL(γ0,γ1,…,γT−1,;S1 −S0,S2 −S1,…,ST −ST−1)
T T + = x + γt−1(St −St−1)− S0 + (St −St−1)−K .
t=1 t=1
We now represent the trading strategy γ as a neural network, whose inputs
are the available market data and output is the hedging position, that is γt = ft (St ,St−1,…,S0),
where ft is a neural network for any t = 0,1,…,T −1. Here, since we know that S is a Markov process, we can simplify the problem slightly — although this is not necessary in general — by seeking a single network f : [0, 1] × R → R so that
γ=ft,S, t=0,1,…,T−1. tTt
However,toevaluatePnL,weneedallvalues f(0,S0),f1,S1,…,fT−1,ST−1, TT
so we need to create a large hedging network F by concatenating f ( t , S t ) over T
t =0,1,…,T −1,sothatourfeedforwardnetworkisactuallythemap
F:(0,S ) 1,S ··· T−1,S →f(0,S ) f1,S ··· fT−1,S .
0T1 TT−1 0TT TT−1 This network is not fully connected and it has shared layers (since f is repeated),
so we need to use the Functional API in Keras to build it. For f , we specify f ∈N4(2,100,100,100,1;RELU,RELU,RELU,Sigmoid),
where we choose Sigmoid due to the financial intuition that the hedging position should be between 0 and 1. (This choice helps training, but is not really neces- sary.)
We train f , and subsequently F , so that quadratic hedging error, that is, PnL2 is empirically minimised. To this end, we define loss function
l(yˆ0,yˆ1,…,yˆT−1),(y0,y1,…,yT−1):=PnL(yˆ0,yˆ1,…,yˆT−1;y0,y1,…,yT−1)2. We fix x as the corresponding Black-Scholes call price BS(S0,K,1). The loss func-
tion l is a custom one, so it needs to be implemented separately in Keras. We 46
B. Horvath
7CCMFM18 Machine Learning
Spring 2020
Tt =0.60
K=1.00
Tt =0.50
K=1.00
1.0 0.8 0.6 0.4 0.2 0.0
f ( Tt , S t ) SBS(St,K,1 Tt)
0.0 0.5 1.0
1.5 2.0
1.0 0.8 0.6 0.4 0.2 0.0
1.0 0.8 0.6 0.4 0.2 0.0
0.0
Tt =0.80
0.5 1.0
1.5 2.0
K=1.00
1.0 0.8 0.6 0.4 0.2 0.0
Tt =0.70
K=1.00
0.0 0.5
St (spot price)
1.0 1.5 2.0
Figure 11: Comparison between the deep hedging strategy, learned in Example 4.3, and the corresponding Black–Scholes delta hedging strategy.
train the network F using Adam with minibatch size 100, over 4 epochs. In train- ing, for the i -th sample,
yˆ =ft,Si, y =Si −Si, t=0,1,…,T−1. t T t t t+1 t
Since for large T the price process S is close to the Black-Scholes price process SBS, the hedging strategy γt = f ( t ,St ) should be close to the (continuous-time)
Black-Scholes delta hedging strategy
γBS = ∂ BSS ,K,1− t , t ∂S t T
which amounts to perfect replication, PnL = 0. With the trained f , we verify that there is indeed good agreement with these two strategies, as illustrated in Figure 11.
We have “derived” the Black-Scholes delta hedge by deep learning! Note that this is indeed unsupervised learning — we did not tell the network f (or F ) what the Black-Scholes delta hedge is, it learned it by PnL optimisation.
0.0
1.5 2.0
0.5 1.0
St (spot price)
T
47
t (hedge ratio) t (hedge ratio)
B. Horvath 7CCMFM18 Machine Learning Spring 2020
Bibliography
C. M. Bishop (2006): Pattern Recognition and Machine Learning. Springer, New York, NY.
L. Breiman (2001): Statistical modeling: the two cultures. Statistical Science 16(3), 199–231.
H. Buehler, L. Gonon, J. Teichmann and B. Wood (2019): Deep hedging. Quanti- tative Finance 19(8), 1271–1291.
F. Chollet (2018): Deep Learning with Python. Manning, Shelter Island, NY.
D.-A. Clevert, T. Unterthiner and S. Hochreiter (2016): Fast and accurate deep network learning by exponential linear units (ELUs). In International Confer- ence on Learning Representations (ICLR 2016).
G. Cybenko (1989): Approximation by superpositions of a sigmoidal function.
Mathematics of Control, Signals and Systems 2(4), 303–314.
X. Glorot and Y. Bengio (2010): Understanding the difficulty of training deep feed- forward neural networks. In Y. W. Teh and Mike Titterington (eds.): Proceedings of Machine Learning Research 9, pp. 249–256.
X. Glorot, A. Bordes and Y. Bengio (2011): Deep sparse rectifier neural networks. In G. Gordon, D. Dunson and M. Dudík (eds.): Proceedings of Machine Learn- ing Research 15, pp. 315–323.
I. Goodfellow, Y. Bengio and A. Courville (2016): Deep Learning. MIT Press, Cam- bridge, MA.
I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville and Y. Bengio (2013): Max- out networks. In S. Dasgupta and D. McAllester (eds.): Proceedings of Machine Learning Research 28(3), pp. 1319–1327.
I. Gühring, G. Kutyniok and P. Petersen (2019): Error bounds for approximations with deep ReLU neural networks in W s,p norms. Preprint, available at: https: //arxiv.org/abs/1902.07896.
K. He, X. Zhang, S. Ren and J. Sun (2015): Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. Preprint, available at: https://arxiv.org/abs/1502.01852.
48
B. Horvath 7CCMFM18 Machine Learning Spring 2020
C. F. Higham and D. J. Higham (2018): Deep learning: an introduction for applied mathematicians. Preprint, available at: https://arxiv.org/abs/ 1801.05894.
G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Van- houcke, P. Nguyen, T. N. Sainath and B. Kingsbury (2012): Deep neural net- works for acoustic modeling in speech recognition. IEEE Signal Processing Magazine 29(6), 82–97.
S. Hochreiter and J. Schmidhuber (1997): Long short-term memory. Neural Com- putation 9(8), 1735–1780.
K. Hornik (1991): Approximation capabilities of multilayer feedforward neural networks. Neural Networks 4(2), 251–257.
K. Hornik, M. Stinchcombe and H. White (1989): Multilayer feedforward neural networks are universal approximators. Neural Networks 2(5), 359–366.
P. J. Huber (1964): Robust estimation of a location parameter. The Annals of Math- ematical Statistics 35(1), 73–101.
P. Kidger and T. Lyons (2019): Universal approximation with deep narrow net- works. Preprint, available at: https://arxiv.org/abs/1905.08539.
D. P. Kingma and J. L. Ba (2015): Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR 2015).
A. Krizhevsky, I. Sutskever and G. E. Hinton (2012): ImageNet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou and K. Q. Weinberger (eds.): Advances in Neural Information Processing Systems 25.
M. Leshno, V. Ya. Lin, A. Pinkus and S. Schocken (1993): Multilayer feedfor- ward networks with a nonpolynomial activation function can approximate any function. Neural Networks 6(6), 861–867.
S. Linnainmaa (1970): Algoritmin kumulatiivinen pyöristysvirhe yksittäisten pyö- ristysvirheiden Taylor-kehitelmänä (in Finnish, translated title: The represen- tation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors). MSc thesis, University of Helsinki.
49
B. Horvath 7CCMFM18 Machine Learning Spring 2020
A. L. Maas, A. Y. Hannun and A. Y. Ng (2013): Rectifier nonlinearities improve neural network acoustic models. In ICML Workshop on Deep Learning for Au- dio, Speech, and Language Processing.
W. S. McCulloch and W. Pitts (1943): A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics 5(4), 115–133.
M. L. Minsky and S. A. Papert (1969): Perceptrons: An Introduction to Computa- tional Geometry. MIT Press, Cambridge, MA.
Y. Nesterov (1983): A method of solving a convex programming problem with
convergence rate O(1/ k). Soviet Mathematics Doklady 27, 372–376.
P. Petersen and F. Voigtlaender (2018): Optimal approximation of piecewise smooth functions using deep ReLU neural networks. Neural Networks 108, 296–330.
F. Rosenblatt (1958): The perceptron: a probabilistic model for information stor- age and organization in the brain. Psychological Review 65(6), 386–408.
D. E. Rumelhart, G. E. Hinton and R. J. Williams (1986): Learning representations by back-propagating errors. Nature 323, 533–536.
F. Santambrogio (2016): {Euclidean, metric, and Wasserstein} gradient flows: an overview. Preprint, available at: https://arxiv.org/abs/1609.03890.
J. Schmidhuber (2015): Deep learning in neural networks: an overview. Neural Networks 61, 85–117.
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel and D. Hassabis (2016): Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov (2014): Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1929–1958.
I. Sutskever, J. Martens, G. Dahl and G. Hinton (2013): On the importance of ini- tialization and momentum in deep learning. In S. Dasgupta and D. McAllester (eds.): Proceedings of Machine Learning Research 28(3), pp. 1139–1147.
50
B. Horvath 7CCMFM18 Machine Learning Spring 2020
T. Tieleman and G. Hinton (2012): RMSProp. COURSERA — Neural Networks for Machine Learning, Lecture 6.5.
P. J. Werbos (1974): Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University.
Wikipedia (2019): Activation function. Accessed: 13 November 2019.
D. Yarotsky (2017): Error bounds for approximations with deep ReLU networks.
Neural Networks 94, 103–114.
51