Machine learning with neural networks An introduction for scientists and engineers
ACKNOWLEDGEMENTS
This textbook is based on lecture notes for the course Artificial Neural Networks that I have given at Gothenburg University and at Chalmers Technical University in Gothenburg, Sweden. When I prepared my lectures, my main source was Intro- duction to the theory of neural computation by Hertz, Krogh, and Palmer [1]. Other sources were Neural Networks: a comprehensive foundation by Haykin [2], Horner’s lecture notes [3] from Heidelberg, Deep learning by Goodfellow, Bengio & Courville [4], and the online book Neural Networks and Deep Learning by Nielsen [5].
I thank Martin Cˇ ejka for typesetting the first version of my hand-written lecture notes, Erik Werner and Hampus Linander for their help in preparing Chapter 8, Kris- tian Gustafsson for his detailed feedback on Chapter 11, Nihat Ay for his comments on Section 4.5, and Mats Granath for discussions about autoencoders. I would also like to thank Juan Diego Arango, Oleksandr Balabanov, Anshuman Dubey, Johan Fries, Phillip Gräfensteiner, Navid Mousavi, Marina Rafajlovic, Jan Schiffeler, Ludvig Storm, and Arvid Wenzel Wartenberg for implementing algorithms described in this book. Many Figures are based on their results. Oleksandr Balabanov, Anshu- man Dubey, Jan Meibohm, and in particular Johan Fries and Marina Rafajlovic contributed exam questions that became exercises in this book. Finally, I would like to express my gratitude to Stellan Östlund, for his encouragement and criticism. Last but not least, a large number of colleagues and students – past and present – pointed out misprints and errors, and suggested improvements. I thank them all.
The cover image shows an input pattern designed to maximise the output of neurons corresponding to one feature map in a given convolution layer of a deep convolu- tional neural network (See Section 8.7 and the references cited there.) Image by Hampus Linander. Reproduced with permission.
Acknowledgements
Contents
1 Introduction
v
vii
1
CONTENTS
1.1 Neuralnetworks ………………………………….. 6 1.2 McCulloch-Pittsneurons …………………………….. 7 1.3 Activationfunctions………………………………… 9 1.4 Asynchronousupdates………………………………. 11 1.5 Summary ………………………………………. 11 1.6 Furtherreading …………………………………… 12
I Hopfield networks 13
2 Deterministic Hopfield networks 15
2.1 Patternrecognition ………………………………… 15 2.2 HopfieldnetworksandHebb’srule ………………………. 16 2.3 Thecross-talkterm ………………………………… 22 2.4 One-steperrorprobability ……………………………. 24 2.5 Energyfunction…………………………………… 27 2.6 Summary ………………………………………. 30 2.7 Exercises……………………………………….. 31
3 Stochastic Hopfield networks 35
3.1 Stochasticdynamics ……………………………….. 35 3.2 Orderparameters …………………………………. 36 3.3 Mean-fieldtheory …………………………………. 38 3.4 Criticalstoragecapacity……………………………… 42 3.5 Beyondmean-fieldtheory ……………………………. 49 3.6 Correlatedandnon-randompatterns…………………….. 50 3.7 Summary ………………………………………. 51 3.8 Furtherreading …………………………………… 51 3.9 Exercises……………………………………….. 52
4 The Boltzmann distribution 54
4.1 Convergenceofthestochasticdynamics…………………… 55 4.2 Monte-Carlosimulation……………………………… 57
4.3 Simulatedannealing ……………………………….. 59 4.4 Boltzmannmachines……………………………….. 62 4.5 RestrictedBoltzmannmachines………………………… 66 4.6 Summary ………………………………………. 71 4.7 Furtherreading …………………………………… 71 4.8 Exercises……………………………………….. 72
II Supervised learning 75
5 Perceptrons 77
5.1 Aclassificationproblem……………………………… 79 5.2 Iterativelearningalgorithm …………………………… 82 5.3 Gradientdescentforlinearunits ……………………….. 84 5.4 Classificationcapacity ………………………………. 86 5.5 Multi-layerperceptrons ……………………………… 89 5.6 Summary ………………………………………. 93 5.7 Furtherreading …………………………………… 94 5.8 Exercises……………………………………….. 94
6 Stochastic gradient descent 100
6.1 Chainruleanderrorbackpropagation …………………….100 6.2 Stochasticgradient-descentalgorithm …………………….103 6.3 Preprocessingtheinputdata …………………………..106 6.4 Overfittingandcrossvalidation …………………………110 6.5 Adaptationofthelearningrate………………………….112 6.6 Summary ……………………………………….115 6.7 Furtherreading……………………………………115 6.8 Exercises………………………………………..115
7 Deep learning 120
7.1 Howmanyhiddenlayers?……………………………..120 7.2 Vanishingandexplodinggradients……………………….126 7.3 Rectifiedlinearunits ………………………………..132 7.4 Residualnetworks ………………………………….133 7.5 Outputsandenergyfunctions ………………………….136 7.6 Regularisation…………………………………….139 7.7 Summary ……………………………………….146 7.8 Furtherreading……………………………………146 7.9 Exercises………………………………………..147
8
8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8
9
9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8
III
Convolutional networks 149
Convolutionlayers………………………………….150 Poolinglayers …………………………………….152 Learningtoreadhandwrittendigits ………………………153 Copingwithdeformationsoftheinputdistribution. . . . . . . . . . . . . . . . 156 Deeplearningforobjectrecognition………………………158 Summary ……………………………………….161 Furtherreading……………………………………162 Exercises………………………………………..163
Supervised recurrent networks 165
Recurrentbackpropagation …………………………… 167 Backpropagationthroughtime………………………….170 Vanishinggradients…………………………………175 Recurrentnetworksformachinetranslation ………………… 177 Reservoircomputing ………………………………..180 Summary ……………………………………….183 Furtherreading……………………………………183 Exercises………………………………………..184
Learning without labels 187
10 Unsupervised learning 189
10.1 Oja’s rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 10.2 Competitive learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 10.3 Self-organising maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 10.4 K-meansclustering…………………………………202 10.5 Radialbasisfunctions ………………………………. 204 10.6Autoencoders …………………………………….208 10.7Summary ……………………………………….211 10.8 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 10.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
11 Reinforcement learning 218
11.1 Associativereward-penaltyalgorithm…………………….. 221 11.2 Temporaldifferencelearning ………………………….. 224 11.3Q-learning ………………………………………228 11.4Summary ……………………………………….235 11.5 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 11.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
1
1 Introduction
The term neural networks historically refers to networks of neurons in the mam- malian brain. Neurons are its fundamental units of computation, and they are connected together in networks to process data. This can be a very complex task, and the dynamics of such neural networks in response to external stimuli is there- fore often quite intricate. Inputs and outputs of each neuron vary as functions of time in the form of spike trains, but also the network itself changes over time: we learn and improve our data-processing capacities by establishing new connections between neurons.
Neural-network algorithms for machine learning are inspired by the architecture and the dynamics of networks of neurons in the brain. The algorithms use highly idealised neuron models. Nevertheless, the fundamental principle is the same: artificial neural networks learn by changing the connections between their neurons. Such networks can perform a multitude of information-processing tasks.
Neural networks can for instance learn to recognise structures in a set of “training” data and, to some extent, generalise what they have learnt. A training set contains a list of input patterns together with a list of corresponding labels, or target values, that encode the properties of the input patterns the network is supposed to learn. Artificial neural networks can be trained to classify such data very accurately by ad- justing the connection strengths between their neurons, and can learn to generalise the result to other data sets – provided that the new data is not too different from the training data. A prime example for a problem of this type is object recognition in images, for instance in the sequence of camera images taken by a self-driving car. Recent interest in machine learning with neural networks is driven in part by the success of neural networks in visual object recognition.
Another task at which neural networks excel is machine translation with dynami- cal, or recurrent networks. Such networks take sentences as inputs. As one feeds word after word, the network outputs the words in the translated sentence. Recur- rent networks can be efficiently trained on large training sets of input sentences and their translations. Google translate works in this way. Recurrent networks have also been used with considerable success to predict chaotic dynamics. These are all examples of supervised learning, where the networks are trained to associate certain targets, or labels, with each input.
Artificial neural networks are also good at analysing large sets of unlabeled, often high-dimensional data – where it may be difficult to determine a priori which questions are most relevant and rewarding to ask. Unsupervised-learning algorithms organise the unlabeled input data in different ways: they can for instance detect familiarity and similarity (clusters) of input patterns and other structures in the input
2 INTRODUCTION
Figure 1.1: Logical OR function represented by three neurons. Neuron 3 fires actively if at least one of the neurons 1 and 2 is active. After Fig. 1b by McCulloch and Pitts [6].
data. Unsupervised-learning algorithms work well when there is redundancy in the input data, and they are particularly useful for large, high-dimensional data sets, where it may be a challenge to detect clusters or other data structures by inspection.
Many problems lie between these two extremes of supervised and unsupervised learning. Consider how an agent may learn to navigate a complex environment, in order to get from one location to another as quickly as possible, or expending as little energy as possible. The method of reinforcement learning allows the agent to do just that, by optimising its behaviour in response to environmental cues in the form of penalties and rewards. In short, the agent learns to act in such a way that it receives positive feedback (reward) more often than a penalty.
The tools for machine learning with neural networks were developed long ago, most of them during the second half of the last century. In 1943, McCulloch and Pitts [6] analysed how networks of neurons can process information. Using an abstract model for a neuron, they demonstrated how such units can be coupled together to represent logical functions (Figure 1.1). Their analysis and conclusions are formu- lated using Carnap’s logical syntax [7], not in terms of algebraic equations as we are used to today. Nevertheless, their neuron model is essentially the binary threshold unit, closely related to the fundamental building block of most neural-network algorithms for machine learning to this date. In this book we therefore refer to this model as the McCulloch-Pitts neuron. The purpose of this early research on neural networks was to explain neuro-physiological mechanisms [8]. Perhaps the most significant advance was Hebb’s learning principle, describing how neural networks learn by strengthening connections between neurons that are active simultane- ously. The principle is described in Hebb’s book The organization of behavior: A neuropsychological theory [9], published in 1949.
About ten years later, research in artificial neural networks had intensified, sparked by the influential work of Rosenblatt. In 1958 he formulated a learning rule for the McCulloch-Pitts neuron, related to Hebb’s rule, and demonstrated that the rule con-
3
verges to the correct solution for all problems this model can solve [10]. He coined the term perceptron for layered networks of McCulloch-Pitts neurons, and showed that such neural networks could in principle solve tasks that a single McCulloch- Pitts neuron could not. However, there was no general learning rule for perceptrons at the time. The work of Minsky and Papert [11] emphasised the geometric aspects of learning. This allowed them to prove which kinds of problems perceptrons could solve, and which not. In 1969 they summarised these results in their elegant book Perceptrons. An Introduction to Computational Geometry.
Perceptrons are classifiers that output a label for each input pattern. A perceptron represents a mathematical function, an input-output mapping. A breakthrough in perceptron learning was the paper by Rumelhart et al. [12]. The authors demon- strated in 1986 that perceptrons can be trained by gradient descent. This means that the connection strengths between the neurons are iteratively changed in small steps, to eventually minimise the output error. For a single McCulloch-Pitts neuron, this gives essentially Hebb’s rule. The important point is that gradient descent allows to efficiently train perceptrons with many layers (backpropagation for multi-layer perceptrons). A second contribution of Rumelhart et al. is the idea to use local feature maps for object recognition with neural networks. The corresponding math- ematical operation is a convolution. Therefore such architectures are now known as convolutional networks.
The work of Hopfield popularised an entirely different approach, also based on Hebb’s rule. In 1982, Hopfield analysed the properties of a dynamical, or recurrent, network that functions as a memory [13]: the dynamics is designed to find stored patterns by converging to a corresponding steady state. Such Hopfield networks were especially popular amongst physicists because there are close connections to the statistical physics of spin glasses that made it possible to derive a precise mathematical understanding of such artificial neural networks. Hopfield networks are the basis for important developments in computer science. More general recur- rent networks, for example, are trained like perceptrons for language processing. Hopfield networks with hidden neurons, so-called Boltzmann machines [14] are generative models that allow to sample from a distribution the neural network learned. The training algorithm for Boltzmann machines with many hidden layers [15], published in 1986, is one of the first algorithms for training networks with many layers, so-called deep networks.
An important problem in behavioural psychology is to understand how we learn from experience. One hypothesis is that desirable behaviours are reinforced by positive feedback. Around the same time as researchers began to analyse percep- trons, a different line of neural-network research emerged: to find learning rules that allow neural networks to learn by reinforcement. In many problems of this kind, the positive feedback or reward is not immediate but is received at some time in the
4 INTRODUCTION
future, as in a board game for example. Therefore it is necessary to understand how to estimate the future reward for a certain behaviour, and how to find strategies that optimise the future reward. Reinforcement learning [16] is designed for this purpose. In 1995, an early application of this method demonstrated how a neural network could learn to play the board game backgammon [17].
A related research field originated from the neuro-physiological question: how do we learn to map visual or sensory stimuli to spatio-temporal patterns of neural activity in the brain? In 1992 Kohonen described a self-organising map [18] that successfully explains how neurons might create meaningful geometric representa- tions of inputs. At the same time, Kohonen’s algorithm is one of the first methods for non-linear dimensionality reduction for large data sets.
There are many connections between neural-network algorithms for machine learning and methods used in mathematical statistics, such as for instance Markov- chain Monte-Carlo algorithms and simulated-annealing methods. Certain unsu- pervised learning algorithms are related to principal-component analysis, others to clustering algorithms such as K -means clustering. Supervised learning with deep networks is essentially regression analysis, trying to fit an input-output function to the training data. In other words this is just function fitting – and usually with a very large number of fitting parameters. Recent convolutional neural networks have millions of parameters. To determine so many parameters requires very large and accurate data sets. This makes it clear that neural networks are not a solution of everything. One of the difficult problems is to understand when machine learning with neural networks is called for, and when not. Therefore we need a detailed understanding of how the algorithms work, and in particular when and how they fail.
There were some early successes of machine learning with neural networks, but these methods were not widely used in the last century. During the past decade, by contrast, machine learning with neural networks has become increasingly successful and popular. For many applications, neural-network based algorithms are now regarded as the method of choice, for example for predicting how proteins fold [19]. What caused this paradigm shift? After all, the methods are essentially those developed forty or more years ago. A reason for the new success is perhaps that industry, in acute need of machine-learning algorithms for large data sets, has invested time and effort into generating larger and more accurate training sets than previously available. Computer hardware has improved too, so that networks with many layers containing many neurons can now be efficiently trained, making the recent progress possible.
This book is based on notes for lectures on artificial neural networks I have given during the last fifteen years at Gothenburg University and Chalmers University of Technology in Gothenburg, Sweden. When I prepared these lectures, my primary
5
Figure1.2: Neuronsinthecerebralcortex,apartofthemammalianbrain.Drawing by Santiago Ramón y Cajal, the Spanish neuroscientist who received the Nobel Prize in Physiology and Medicine in 1906 together with Camillo Golgi ‘in recognition of their work on the structure of the nervous system’ [20]. Courtesy of the Cajal Institute, ‘Cajal Legacy’, Spanish National Research Council (CSIC), Madrid, Spain.
source was Introduction to the theory of neural computation by Hertz, Krogh, and Palmer [1]. The material is organised into three parts: Hopfield networks, supervised learning of labeled data, and learning for unlabeled data sets (unsupervised and reinforcement learning). One reason for starting with Hopfield networks is that there is an elegant mathematical theory that describes how these neural networks learn, making it possible to understand their strengths and weaknesses from first principles. This is not the case for most of the other algorithms discussed in this book. The analysis of Hopfield networks sets the scene for the later parts of the book. Part II describes supervised learning with multilayer perceptrons and convolutional neural networks, starting from the simple geometrical picture emphasised by Min- sky and Papert, and leading to the recent successes of convolutional networks in object recognition, and trained recurrent networks in language processing. Part III explains what neural networks can learn about data that is not labeled, with
6 INTRODUCTION
particular emphasis on reinforcement learning. The overall goal is to explain the fundamental principles that allow neural networks to learn, emphasising ideas and concepts that are common to all three parts.
1.1 Neural networks
Different regions in the mammalian brain perform different tasks. The cerebral cortex is the outer layer of the mammalian brain. We can think of it as a thin sheet (about 2 to 5 mm thick) that folds upon itself to form a layered structure with a large surface area. The cortex is the largest and best developed part of the human brain. It contains large numbers of nerve cells, neurons. The human cerebral cortex contains about 1010 neurons. They are linked together by nerve strands (axons) that branch and end in synapses. These synapses are the connections to other neurons. The synapses connect to dendrites, branches extending from the neural cell body that are designed to receive input from other neurons in the form of electrical signals. A neuron in the human brain may have thousands of synaptic connections with other neurons. The resulting network of connected neurons in the cerebral cortex is responsible for processing of visual, audio, and sensory data.
Figure 1.2 shows neurons in the cerebral cortex. This drawing was made by Santiago Ramón y Cajal more than 100 years ago. By microscope he studied the structure of neural networks in the brain and documented his observations by ink- on-paper drawings like the one reproduced in Figure 1.2. One can distinguish the cell bodies of the neural cells, their axons (f), and their dendrites. The axons of some neurons connect to the dendrites of other neurons, forming a neural network (see Ref. [21] for a slightly more detailed description of this drawing).
A schematic image of a neuron is drawn in Figure 1.3. Information is processed from left to right. On the left are the dendrites that receive signals and connect to the cell body of the neuron where the signal is processed. The right part of the Figure shows the axon, through which the output is sent to the other neurons. The axon connects to their dendrites via synapses.
Information is transmitted as an electrical signal. Figure 1.4 shows an example of the time series of the electric potential for a pyramidal neuron in fish [22]. The time series consists of an intermittent series of electrical-potential spikes. Quiescent periods without spikes occur when the neuron is inactive, during spike-rich periods we say that the neuron is active.
MCCULLOCH-PITTS NEURONS 7
Figure1.3: Schematicimageofaneuron.Dendritesreceiveinputintheformof electrical signals, via synapses. The signals are processed in the cell body of the neuron. The cell nucleus is shown in white. The output travels from the neural cell body along the axon which connect via synapses to other neurons.
Figure 1.4: Spike train in electro-sensory pyramidal neuron of a fish. The time series is from Ref. [22]. It is reproduced by permission of the publisher. The labels were added.
1.2 McCulloch-Pitts neurons
In artificial neural networks, the ways in which information is processed and signals are transferred are highly idealised. McCulloch and Pitts [6] modelled the neuron, the computational unit of the neural network, as a binary threshold unit. It has only two possible outputs, or states: active or inactive. To compute the output, the unit sums the weighted inputs. If the sum exceeds a given threshold, the state of the neuron is said to be active, otherwise inactive. A slightly more general model than the original one is illustrated in Figure 1.5. The model performs repeated computations in discrete time steps t = 0, 1, 2, 3, . . .. The state of neuron number j at time step t is denoted by
−1 inactive ,
sj(t)= 1 active. (1.1)
8 INTRODUCTION
Figure 1.5: Schematic diagram of a McCulloch-Pitts neuron. The index of the neuron is i , it receives inputs from N neurons. The strength of the connection from neuron j to neuron i is denoted by wi j . The threshold value for neuron i is denoted by θi . The index t = 0, 1, 2, 3, . . . labels the discrete time sequence of computation steps, and sgn(b ) stands for the signum function [Figure 1.6 and Equation (1.3)].
Given the states sj (t ), neuron number i computes
N
si(t +1)=sgnwijsj(t)−θi≡sgn[bi(t)].
j=1
Here sgn(b ) is the signum function (Figure 1.5):
−1, b<0, sgn(b)= +1, b ≥0.
The argument of the signum function,
N bi(t)=wijsj(t)−θi ,
j=1
(1.2)
(1.3)
(1.4)
is called the local field. We see that the neuron performs a weighted average of the inputssj(t). Theparameterswij arecalledweights. Herethefirstindex,i,refers to the neuron that does the computation, and j labels the neurons that connect to neuron i . In general weights between different pairs of neurons assume different numerical values, reflecting different strengths of the synaptic couplings. Weights can be positive or negative, and we say that there is no connection when wi j = 0.
In this book we refer to the model described in Figure 1.5 as the McCulloch- Pitts neuron, although their original model had some additional constraints on the weights. The threshold1 for neuron i is denoted by θi .
1In the deep-learning literature [4], the thresholds are called biases, defined as the negative of θi , with a plus sign in Equation (1.4). In this book we use the convention (1.4), with the minus sign.
ACTIVATION FUNCTIONS 9
Figure1.6: Signumfunction[Equation(1.3)].
Finally, note that the computation (1.2) is performed for all neurons i in parallel, giventhestatessj(t)attimestept. Theoutputssi(t +1)aretheinputstoallneurons at the next time step. Therefore the outputs have the time argument t + 1. These steps are repeated many times, resulting in time series of the activity levels of all neurons in the network.
1.3 Activation functions
The McCulloch-Pitts model approximates the patterns of spiking activity in Figure 1.4 in terms of two states, −1 and +1, representing the inactive and active periods shown in the Figure. For many computation tasks this is sufficient, and for our purposes it does not matter that the dynamics of electrical signals in the cortex is quite different in detail. The aim after all is not to model the neural dynamics in the brain, but to construct computation models inspired by real neural dynamics.
It will become apparent later that the simplest model described above must be generalised somewhat for certain tasks and questions. For example, the jump in the signum function at b = 0 may cause large fluctuations in the activity levels of a network of neurons, caused by infinitesimal changes of the local fields across b = 0. To dampen this effect, one allows the neuron to respond continuously to its inputs, replacing Eq. (1.2) by
s (t +1)=gw s (t)−θ . (1.5) iijji
j
Here g (b ) is a continuous activation function. It could just be a linear function, g (b ) ∝ b . But we shall see that many tasks require non-linear activation functions, such as tanh(b ) (Figure 1.7). When the activation function is continuous, the neuron states assume continuous values too, not just the discrete values −1 and +1 given in Equation (1.1).
Alternatively one may use a piecewise linear activation function (Figure 1.8). This
10 INTRODUCTION
Figure1.7: Continuousactivationfunctiong(b)=tanh(b).
is motivated in part by the response curve of the leaky integrate-and-fire neuron, a model for the relation between the electrical current I through the cell membrane into the neuron cell, and the membrane potential U . The simplest model for the dynamics of the membrane potential represents the neuron as a capacitor. In the leaky integrate-and-fire neuron, leakage is added by a resistor R in parallel with the capacitor C , so that
I = U + C dU . (1.6) R dt
For a constant current, the membrane potential grows from zero as a function of
time,U(t)=RI[1−exp(−t/τ)],whereτ=RC isthetimeconstantofthemodel.
One says that the neuron produces a spike when the membrane potential exceeds
a critical value, Uc. Immediately after, the membrane potential is set to zero (and
begins to grow again). In this model, the firing rate f (I ) is thus given by t −1, where c
tc is the solution of U (t ) = Uc. It follows that the firing rate exhibits a threshold behaviour. In other words, the system works like a rectifier:
0 for I≤Uc/R,
f(I)= τlog RI −1 for I>Uc/R. (1.7)
RI−Uc
Figure 1.8: (a) Firing rate of a leaky integrate-and-fire neuron as a function of the electrical current I through the cell membrane, Equation (1.7) for τ = 25 and Uc/R = 2. (b) Piecewise linear activation function, g (b ) = max{0, b }.
ASYNCHRONOUS UPDATES 11
This response curve is illustrated in Figure 1.8 (a). The main point is that there is a threshold below which the response is strictly zero. The response function looks qualitatively like the piecewise linear function
g(b)=max{0,b}, (1.8) shown in panel (b). Neurons with this activation function are called rectified linear
units, and the activation function (1.8) is called the ReLU function. 1.4 Asynchronous updates
Equations (1.2) and (1.5) are called synchronous update rules, because all neurons are updated in parallel: at time step t all inputs sj (t ) are stored. Then all neurons i are simultaneously updated using the stored inputs. An alternative is to update only a single neuron per iteration, the one with index m say:
gjwmjsj(t)−θm fori=m,
si(t +1)= si(t) otherwise. (1.9)
This is called an asynchronous update rule. Different schemes for choosing neurons are used in asynchronous updating. One possibility is to arrange the neurons into a two-dimensional array and to update them one by one, in a certain order. In the typewriter scheme, for example, one updates the neurons in the top row of the array first, from left to right, then the second row from left to right, and so forth. A second possibility is to choose randomly which neuron to update.
If there are N neurons, then one synchronous step corresponds to N asyn- chronous steps, on average. This difference in time scales is not the only difference between synchronous and asynchronous updating. The asynchronous dynamics can be shown to converge to a definite state in certain cases, while the synchronous dynamics may fail to do so, resulting in periodic cycles that persist forever.
1.5 Summary
Artificial neural networks use a highly idealised model for the fundamental compu- tation unit: the McCulloch-Pitts neuron (Figure 1.5) is a binary threshold unit, very similar to the model introduced originally by McCulloch and Pitts [6]. The units are linked together by weights wi j , and each unit computes a weighted average of its in- puts. The network performs these computations in sequence. Most neural-network algorithms are built using the model described in this Chapter.
12 INTRODUCTION
1.6 Further reading
Two accounts of the history of artificial neural networks are especially recommended. First, the early history of the field is summarised in the Introduction to the second edition of Perceptrons. An introduction to computational geometry by Minsky and Papert [11], which came out in 1988. This book also contains a concise bibliography of the important papers, with comments by Minsky and Papert. Second, in a short note, Kanal [23] reviews the work of Rosenblatt and puts it into context.
PART I HOPFIELD NETWORKS
13
14 HOPFIELD NETWORKS
Hopfield networks [13, 24] are artificial neural networks that can recognise or reconstruct images, for instance the binary images of digits shown in Figure 2.1. The images are stored in the network by choosing the weights wi j according to Hebb’s rule [9]. One feeds a distorted digit (Figure 2.2) to the network by assigning the initial states of its neurons to the bits of the distorted digit. The idea is that the neural-network dynamics converges to the closest stored digit. In this way the network can recognise the input as a distorted version of the correct pattern, it can retrieve the correct digit. Hopfield networks recognise patterns with many bits quite efficiently, and in the past such networks were used to perform pattern-recognition tasks. Today there are more efficient algorithms for this purpose (Chapter 8).
Nevertheless, Hopfield networks exemplify fundamental principles of machine learning with neural networks. For a start, most neural-network algorithms dis- cussed in this book are built from similar building blocks and use learning rules related to Hebb’s rule. Hopfield networks are examples of recurrent networks, their neurons are updated following a dynamical rule. Widely used algorithms for ma- chine translation and time-series prediction are based on this principle.
Furthermore, restricted Boltzmann machines are closely related to Hopfield net- works. These machines use hidden neurons to learn distributions of input patterns. This makes it possible to generate image textures and to complete partially obscured images [25]. Generalisations of these machines, deep-belief networks, are examples of the first deep network architectures for machine learning. Restricted Boltzmann machines have been developed into more efficient generative models, Helmholtz machines, to sample new patterns similar to those in a given data distribution. The training algorithm for recent generative models, variational autoencoders, is based on the same principles as the learning rule for Helmholtz machines.
The dynamics of Hopfield networks is closely related to stochastic Markov-chain Monte-Carlo algorithms used in a wide range of problems in the natural sciences. Hopfield networks highlight the role of stochasticity in neural-network dynamics. A certain degree of noise, not too much, can substantially improve the performance of the Hopfield network. In engineering problems it is usually better to avoid stochas- ticity, when it is due to multiplicative or additive noise that diminishes the perfor- mance of the system. In neural-network dynamics, by contrast, stochasticity is often helpful, because it allows to explore a wider range of configurations or actions and thus helps the dynamics to converge to a better solution. In general it is challenging to analyse the stochastic dynamics of neural networks. But for the Hopfield net- work much is known. The reason is that Hopfield networks are closely related to stochastic systems studied in statistical physics, so-called spin glasses [26, 27].
Figure2.1: Binaryrepresentationofthedigits0to4.Eachpatternhas10×16pixels. Adapted from Figure 14.17 in Ref. [2]. The slightly peculiar shapes help the Hopfield network to distinguish the patterns [28].
2 Deterministic Hopfield networks 2.1 Pattern recognition
As an example for a pattern-recognition task, consider p images (patterns). The
patterns could for instance be the letters in the alphabet, or the digits shown in
Figure 2.1. The different patterns are indexed by μ = 1, . . . , p . Here p is the number
of patterns (p = 5 in Figure 2.1). The bits of pattern μ are denoted by x (μ) . The index i
i = 1, . . . , N labels the different bits of a given pattern, and N is the number of bits
per pattern (N = 160 in Figure 2.1). The bits are binary: they can take only the values
−1 and +1. To determine the generic properties of the algorithm, one often turns to
random patterns where each bit x (μ) is chosen randomly, taking either value with i
probability equal to 12 . It is convenient to gather the bits of a pattern in a column vector
x(μ) 1 x(μ)
(μ) 2 x = . .
x (μ) N
(2.1)
In this book, vectors are written in bold math font.
The task of the neural network is to recognise distorted patterns, to determine for
instance that the pattern on the right in Figure 2.2 is a perturbed version of the digit on the left in this Figure. To this end, one stores p patterns in the network, presents it with a distorted version of one of these patterns, and asks the network to find the stored pattern that is most similar to the distorted one.
The formulation of the problem makes it necessary to define how similar a dis- torted pattern x is to any of the stored patterns, x (ν) say. One possibility is to quantify
15
16 DETERMINISTIC HOPFIELD NETWORKS
Figure 2.2: Binary image (N = 160) of the digit 0, and a distorted version of the sameimage.ThereareN=160bitsxi,i=1,…,N,andstandsforxi =+1while denotes xi = −1.
the distance dν between the patterns x and x (ν) in terms of the mean squared error 1N 2
d = x −x(ν) . (2.2) ν 4N i i
i=1
The prefactor is chosen so that the distance equals the fraction of bits by which two ±1-patterns differ. Note that the distance (2.2) does not refer to distortions by translations, rotations, or shearing. An improved measure of distance might take the minimum distance between the patterns subject to all possible translations, rotations, and so forth.
2.2 Hopfield networks and Hebb’s rule
Hopfield networks [13, 24] are networks of McCulloch-Pitts neurons designed to solve the pattern-recognition task described in the previous Section. The bits of the patterns to be recognised are encoded in a particular choice of weights called Hebb’s rule, as explained in the following.
All possible states of the McCulloch-Pitts neurons in the network, s1
s = s2 , .
. sN
(2.3)
form the configuration space of the network. The components of the states s are updated either with the synchronous rule (1.2):
N
si(t +1)=sgnbi(t) withlocalfield bi(t)=wijsj(t)−θi , (2.4)
j=1
HOPFIELD NETWORKS AND HEBB’S RULE or with the asynchronous rule
sgnbm(t) si(t +1)= si(t)
17
fori =m,
otherwise. (2.5)
Torecogniseadistortedpattern,onefeedsitsbitsxi intothenetworkbyassigning the initial states of the neurons to the pattern bits,
si(t =0)=xi . (2.6)
Theideaistochooseasetofweightswij sothatthenetworkdynamics(2.4)or(2.5) converges to the correct stored pattern. The choice of weights must depend on all p patterns, x(1),…,x(p). We say that these patterns are stored in the network by assigning appropriate weights. For example, if x is a distorted version of x (ν) (Figure 2.2), then we want the network to converge to this pattern:
if s(t =0)=x ≈x(ν) then s(t)→x(ν) as t →∞. (2.7)
Equation (2.7) means that the network corrects the errors in the distorted pattern x . If this works, the stored pattern x (ν) is said to be an attractor of the network dynamics.
But convergence is not guaranteed. If the initial distortion is too large, the network may converge to another pattern, or to some other state that bears no or little relation to the stored patterns, or it may not converge at all. The region around pattern x (ν) in which all patterns x converge to x (ν) is called the region of attraction of x (ν). The size of this region depends in an intricate way upon the ensemble of stored patterns, and there is no general convergence proof.
Therefore we ask a simpler question first: if one feeds one of the undistorted patterns, for instance x (ν), does the network recognise it as one of the stored, undis- torted ones? The network should not make any changes to x (ν) because all bits are correct:
if s(t =0)=x(ν) then s(t)=x(ν) forall t =0,1,2,…. (2.8)
How can we choose weights and thresholds to ensure that Equation (2.8) holds? Let us consider a simple case first, where there is only one pattern, x (1), to recognise. In this case a suitable choice of weights and thresholds is
w = 1 x(1)x(1) and θ =0. (2.9) ijNij i
18 DETERMINISTIC HOPFIELD NETWORKS This learning rule reminds of a relation between activation patterns of neurons and
their coupling, postulated by Hebb [9] more than 70 years ago:
When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.
This is a mechanism for learning through establishing connections: the coupling between neurons tends to increase if they are active at the same time. Equation (2.25) expresses an analogous principle. Together with Equation (2.7) it says that the coupling wi j between two neurons is positive if they are both active (si = sj = 1); if their states differ, the coupling is negative. Therefore the rule (2.25) is called Hebb’s rule. Hopfield networks are networks of of McCulloch-Pitts neurons with weights determined by Hebb’s rule.
Does a Hopfield network recognise the pattern x (1) stored in this way? To check
that the rule (2.9) does the trick, we feed the pattern to the network by assigning
s (t = 0) = x (1), and evaluate Equation (2.4): jj
1. Using sgn(x (1)) = x (1), we obtain ii
N 1 N w x(1) =
1 N ijjNijjNii
x(1)x(1)x(1) =
j=1 j=1 j=1
x(1) = x(1) .
The second equality follows because x (1) can only take the values ±1, so that [x (1)]2 =
jj
N
sgnw x(1)=x(1).
(2.11) Comparing Equation (2.11) with the update rule (2.4) shows that the bits x (1) of
the pattern x (1) remain unchanged under the update, as required by Equation (2.8). The network recognises the pattern as a stored one, so Equation (2.9) does what we asked for. Note that we obtain the same result if we leave out the factor of N −1 in Equation (2.9).
But does the network correct errors? In other words, is the pattern x (1) an attractor [Eq. (2.7)]? This question cannot be answered in general. Yet in practice Hopfield networks work often quite well. It is a fundamental insight that neural networks may perform well although no proof exists that their dynamics converges to the correct solution.
To illustrate the difficulties consider an example, a Hopfield network with N = 4 neurons (Figure 2.3). Store the pattern x (1) shown in Figure 2.3 by assigning the
ijji j=1
(2.10)
j
HOPFIELD NETWORKS AND HEBB’S RULE 19
Figure2.3: HopfieldnetworkwithN=4neurons.(a)Networklayout.Neuroniis representedas⃝i .Arrowsindicatesymmetricconnections.(b)Patternx(1).
weights wi j using Equation (2.9). Now feed a distorted pattern x that has a non-zero distance to x (1) [Figure 2.4]
14 21
d = x −x(1) = . (2.12)
1 16 i i 4 i=1
To feed the pattern to the network, we set s j (t = 0) = x j . Then we iterate the dynamics using the synchronous rule (2.4). Results for different distorted patterns are shown in Figure 2.4. We see that the first two distorted patterns converge to the stored pattern, cases (a) and (b). But the third distorted pattern does not [case (c)].
To understand this behaviour, we analyse the synchronous dynamics (2.4) using the weight matrix
= N1 x (1)x (1)T . (2.13)
Here x (1)T denotes the transpose of the column vector x (1), so that x (1)T is a row vector. The standard rules for matrix multiplication apply also to column and row vectors, they are just N × 1 and 1 × N matrices. So the product on the r.h.s. of Equation (2.13) is an N × N matrix. In the following, matrices with elements Ai j or
Figure2.4: Reconstructionofadistortedpattern.Undersynchronousupdating (2.4) the first two distorted patterns (a) and (b) converge to the stored pattern x (1), but pattern (c) does not.
20 DETERMINISTIC HOPFIELD NETWORKS
Figure 2.5: Reproduced from xkcd.com/1838 under the creative commons attribution-noncommercial 2.5 license.
Bij arewrittenas,,andsoforth.TheproductinEquation(2.13)isalsoreferred to as an outer product. If we change the order of x (1) and x (1)T in the product, we obtain a number instead:
N
x (1)Tx (1) = [x (1)]2 = N . (2.14) j
j=1
The product (2.14) is called scalar product. It also is denoted by a · b = a Tb and equals|a||b|cosφ,whereφistheanglebetweenthevectorsa andb,and|a|isthe magnitude of a . We use the same notation for multiplying a transposed vector with a matrix from the left: x · = x T. An excellent source for those not familiar with these terms from Linear Algebra (Figure 2.5) is Chapter 6 of Mathematical methods of physics by Mathews and Walker [29].
Using Equation (2.14) we see that projects onto the vector x (1) ,
x (1) = x (1) . (2.15)
In addition, it follows from Equation (2.14) that the matrix (2.13) is idempotent: t = for t =1,2,3,…. (2.16)
HOPFIELD NETWORKS AND HEBB’S RULE 21 Equations (2.15) and (2.16) together with sgn(x(1)) = x(1) imply that the network
recognises the pattern x (1) as the stored one. This example illustrates the general proof, Equations (2.10) and (2.11).
Now consider the distorted pattern (a) in Figure 2.4. We feed this pattern to the network by assigning
−1
s(t =0)=−1. (2.17)
−1 1
To compute one step in the synchronous dynamics (2.4) we apply to s (t = 0). This is done in two steps, using the outer-product form (2.13) of the weight matrix. We first multiply s (t = 0) with x (1)T from the left
−1 x(1)Ts(t =0)=1, −1, −1, 1−1=2,
−1 1
and then we multiply this result with x (1) . This results in: s (t = 0) = 12 x (1) .
The signum of the i -th component of the vector s (t = 0) yields si (t = 1): N
s (t =1)=sgnw s (t =0)=x(1). iijji
j=1
(2.18)
(2.19)
(2.20)
We conclude that the state of the network converges to the stored pattern, in one synchronous update. Since is idempotent, the network stays there: the pattern x (1) is an attractor. Case (b) in Figure 2.4 works in a similar way.
Now look at case (c), where the network fails to converge to the stored pattern. Wefeedthispatterntothenetworkbyassignings(t =0)=[−1,1,−1,−1]T. Forone iteration of the synchronous dynamics we evaluate
It follows that
−1 x(1)Ts(0)=1, −1, −1, 1 1 =−2.
−1 −1
s(t =0)=−12x(1).
(2.21)
(2.22)
ii
22 DETERMINISTIC HOPFIELD NETWORKS Using the update rule (2.4) we find
Equation (2.16) implies that
s(t =1)=−x(1). (2.23) s(t)=−x(1) for t ≥1. (2.24)
Thus the network shown in Figure 2.3 has two attractors, the pattern x (1) as well as the inverted pattern −x (1). As we shall see in Section 2.5, this is a general property of Hopfield networks: if x (1) is an attractor, then the pattern −x (1) is an attractor too. In the next Section we discuss what happens when more than one patterns are stored in the Hopfield network.
2.3 The cross-talk term
When there are more patterns than just one, we need to generalise Equation (2.9).
One possibility is to simply sum Equation (2.9) over the stored patterns [13]:
μ=1
Equation (2.25) generalises Hebb’s rule to p patterns. Because of the sum over μ, the relation to Hebb’s learning hypothesis is less clear, but we nevertheless refer to Equation (2.25) as Hebb’s rule. At any rate, we see that the weights are proportional to the second moments of the pattern bits. It is plausible that a neural network based upon the rule (2.25) can recognise properties of the patterns x (μ) that are encoded in the two-point correlations of their bits.
Notethattheweightsaresymmetric,wij =wji.Also,notethattheprefactorN−1 in Equation (2.25) is not important. It is chosen to simplify the large-N analysis of the model (Chapter 3). An alternative version of Hebb’s rule [2, 13] sets the diagonal weights to zero:
1 p
ijNij ii i
μ=1
1 p ijNiji
w =
x(μ)x(μ) and θ =0. (2.25)
x(μ)x(μ) for i̸=j, w =0, and θ=0. (2.26) way, does the network recognise the stored patterns? The question is whether
w =
In this Section we use the form (2.26) of Hebb’s rule. If we assign the weights in this
sgn 1 x(μ)x(μ)x(ν)=? x(ν). Nijji
j̸=i μ
=b (ν) i
(2.27)
THE CROSS-TALK TERM 23
To check whether Equation (2.27) holds or not, we repeat the calculation described in the previous Section. As a first step we evaluate the local field
b (ν) = 1 − 1 x (ν) + 1 x (μ) x (μ) x (ν) . (2.28) iNiNijj
j̸=i μ̸=ν
Here we split the sum over the patterns into two contributions. The first term corresponds to μ = ν, where ν refers to the pattern that was fed to the network, the one we want the network to recognise. The second term in Equation (2.28) contains the sum over the remaining patterns. Condition (2.27) is satisfied if the second term in (2.28) does not affect the sign of the r.h.s. of this Equation. This second term is called cross-talk term.
Whether adding the cross-talk term to x (ν) affects sgn(b (ν)) or not, depends on i
the stored patterns. Since the cross-talk term contains a sum over μ = 1, . . . , p , we expect that this term does not matter if p is small enough. The fewer patterns we store, the more likely it is that all of them are recognised. Furthermore, by analogy with the example described in the previous Section, it is plausible that the stored patterns are then also attractors, so that slightly distorted patterns converge to the correct stored pattern.
For a more quantitative analysis of the effect of the cross-talk term, we store patterns with random bits (random patterns). Different bits are assigned ±1 inde- pendently with equal probability:
Prob(x (ν) = ±1) = 1 . (2.29) i2
This means in particular that different patterns are uncorrelated, because their
covariance vanishes: Here〈···〉denotesanaverageovermanyrealisationsofrandompatterns,andδij is
the Kronecker delta, equal to unity if i = j but zero otherwise. Note that 〈x (μ)〉 = 0. j
This follows from Equation (2.29).
Given an ensemble of random patterns, what is the probability that the cross-talk
term changes sgn(b (ν))? In other words, what is the probability that the network i
produces a wrong bit in one asynchronous update, if all bits were initially correct?
The magnitude of the cross-talk term does not matter when it has the same sign
as x (ν). If it has a different sign, then the cross-talk term matters if its magnitude is i
larger than unity (the magnitude of x (ν) ). To simplify the analysis, one defines i
〈x(μ)x(ν)〉=δ δ . (2.30) ij ijμν
C(ν) ≡−x(ν) 1 x(μ)x(μ)x(ν) . iiNijj
j̸=i μ̸=ν
cross-talk term
(2.31)
24 DETERMINISTIC HOPFIELD NETWORKS If C(ν) < 0 then the cross-talk term has same sign as x(ν), so that the cross-talk
ii
term does not make a difference: adding this term does not change the sign of x(ν).If 0
iii
1 − 1/N ≈ 1 in Equation (2.28), assuming that N is large). The network produces an
errorinbiti ofpatternνifC(ν) >1. i
2.4 One-step error probability
Theone-steperrorprobabilityPt=1 isdefinedastheprobabilitythatanerroroccurs error
in one attempt to update a bit, given that initially all bits were correct:
P t =1 = Prob(C (ν) > 1). (2.32)
error i
Since patterns and bits are identically distributed, Prob(C (ν) > 1) does not depend
i
How does P t =1 depend on the parameters of the problem, p and N ? When
on i or ν. Therefore P t =1 does not carry any indices. error
error
both p and N are large, we can use the central-limit theorem [29, 30] to answer this
question. Since different bits/patterns are independent, we can think of C (ν) as a i
sumofindependentrandomnumberscm thattakethevalues−1and+1withequal probabilities,
1 1 (N−1)(p−1) x(μ)x(μ)x(ν)x(ν) =− c . (2.33)
C(ν) =− iNijjiNm
j̸=i μ̸=ν m=1
There are M = (N − 1)(p − 1) terms in the sum on the r.h.s. because terms with μ = ν are excluded, and also those with j = i [Equation (2.26)]. If we use the rule (2.25) instead, then there is a correction to Equation (2.33) from the diagonal weights. For p ≪ N this correction is small.
When p and N are large, the sum Mm=1 cm contains a large number of inde- pendently identically distributed random numbers with mean zero and variance unity. It follows from the central-limit theorem [29, 30] that Mm=1 cm is Gaussian distributed with mean zero and variance M .
Since the central-limit theorem plays an important role in the analysis of neural- network algorithms, it is worth discussing this theorem in a little more detail. To begin with, note that the sum Mm =1 cm equals 2k − M , where k is the number of occurrences of cm = +1 in the sum. Choosing cm randomly to equal either −1 or +1 iscalledaBernoullitrial[30],andtheprobabilityPk,M ofdrawingktimes+1and M − k times −1 is given by the binomial distribution [30]. In our case the probability ofcm =±1equals 21,sothat
M 1k1M−k
Pk,M=k22 . (2.34)
ONE-STEP ERROR PROBABILITY 25
Figure 2.6: Gaussian distribution of the quantity C defined in Equation (2.31). Here M = M !/[k ! (M − k )!] denotes the number of ways in which k occurrences of
k
+1 can be distributed over M places.
WewanttoshowthatPk,M approachesaGaussiandistributionforlargeM,with
mean zero and with variance M . Since the variance diverges as M → ∞, it is
convenient to use the variable z = (2k − M )/M . The central-limit theorem implies
that z is Gaussian with mean zero and unit variance in the limit of large M . To prove
that this is the case, we substitute k = M2 + 2M z into Equation (2.34) and take the limit of large M using Stirling’s approximation
n!≈enlogn−n+21 log2πn . (2.35)
ExpandingPk,M toleadingorderinM−1assumingthatzremainsoforderunitygives Pk,M =2/(πM)exp(−z2/2). Nowonechangesvariablesfromk toz. Thisstretches
local neighbourhoods dk to dz . Conservation of probability implies that P (z )dz = P(k)dk. ItfollowsthatP(z)=(M/2)P(k),sothatP(z)=(2π)−1/2 exp(−z2/2). In other words, the distribution of z is Gaussian with zero mean and unit variance, as we intended to show.
Returning to Equation (2.33), we conclude that C (ν) is Gaussian distributed i
P (C ) = (2πσC2 )−1/2 exp[−C 2 /(2σC2 )] , with zero mean, as illustrated in Figure 2.6, and with variance
σC2 =NM ≈Np . 2
(2.36)
(2.37)
HereweusedM ≈Np forlargeN andp.
Another way to compute this variance is to square Equation (2.33) and to average
over random patterns:
σ2 = c = 〈c c 〉. (2.38)
1M21MM
CN2mN2nm m=1 n=1 m=1
26 DETERMINISTIC HOPFIELD NETWORKS
Figure2.7: Dependenceoftheone-steperrorprobabilityonthestoragecapacityα according to Equation (2.39).
Here 〈· · · 〉 denotes the average over random realisations of cm . Since the random numbers cm are independent for different indices and because 〈cm2 〉 = 1, we have that 〈cn cm 〉 = δn m . So only the diagonal terms in the double sum contribute, summing to M ≈ N p . This yields Equation (2.37).
To determine P t =1 [Equation (2.32)] we must integrate the distribution of C from error
1to∞:
Here erf is the error function defined as [31] 2z
1 ∞ − C 2 1 N dCe2σC2 = 1−erf
Pt=1=
error2πσC1 22p
. (2.39)
dx e−x2 .
Since erf(z ) increases monotonically as z increases, we conclude that P t =1 increases
erf(z)= π
(2.40)
error
as p increases, or as N decreases. This is expected: it is more difficult for the network
to distinguish stored patterns when there are more of them. On the other hand, it is easier to differentiate stored patterns if they have more bits. We also see that the one-step error probability depends on p and N only through the combination
α ≡ Np . (2.41) The parameter α is called the storage capacity of the network. Figure 2.7 shows how
Pt=1 dependsonthestoragecapacity.Forα=0.2forexample.theone-steperror error
probability is slightly larger than 1%.
In the derivation of Equation (2.39) we assumed that the stored patterns are
random with independent bits. Realistic patterns are not random. We nevertheless
expectthatPt=1 describesthetypicalone-steperrorprobabilityoftheHopfield error
0
ENERGY FUNCTION 27 network when p and N are large. However, it is straightforward to construct counter
examples. Consider for example orthogonal patterns:
x(μ) ·x(ν) =0 for μ̸=ν. (2.42)
For such patterns, the cross-talk term vanishes in the limit of large N (Exercise 2.2), sothatPt=1 =0.
error
More importantly, the error probability defined in this Section refers only to
the initial update, the first iteration. What happens in the next iteration, and after
many iterations? Numerical experiments show that the error probability can be
much higher in later iterations, because an error tends to increase the probability
ofmakinganothererrorlateron.SotheestimatePt=1 isonlyalowerboundforthe error
probability of observing errors in the long run.
2.5 Energy function
Consider the long-time limit t → ∞. Does the Hopfield dynamics converge, as required by Equation (2.7)? This is an important question in the analysis of neural- network algorithms, because an algorithm that does not converge to a meaningful solution is useless.
The standard way of analysing convergence of neural-network algorithms is to define an energy function H (s ) that has a minimum at the desired solution, s = x (ν) say. We monitor how the energy function changes as we iterate, and keep track of the smallest values of H encountered, to find the minimum. If we store only one pattern, p = 1, then a suitable energy function is
1N 2
H =− s x(1) . (2.43)
2N ii i=1
This function is minimal when s = x (1), i.e., when s = x (1) for all i . It is customary to ii
insert the factor 1/(2N ), this does not change the fact that H is minimal at s = x (1). A crucial point is that the asynchronous McCulloch-Pitts dynamics (2.5) converges to the minimum [13]. This follows from the fact that H cannot increase under the update rule (2.5). To prove this important property, we begin by evaluating the
expression on the r.h.s. of Equation (2.43):
H=−1N 1x(1)x(1)ss . (2.44) 2ijNijij
28 DETERMINISTIC HOPFIELD NETWORKS Using Hebb’s rule (2.9) we find that the energy function (2.43) becomes
H =−1wijsisj . (2.45) 2ij
This function has the same form as the energy function (or Hamiltonian) for certain physical models of magnetic systems consisting of interacting spins [32], where the interaction energy between spins si and s j is 12 (wi j + w j i )si s j . Note that Hebb’s rule(2.9)yieldssymmetricweights,wij =wji,andwii >0.Notealsothatsetting the diagonal weights to zero does not change the fact that H is minimal at s = x (1) because si2 = 1. The diagonal weights just give a constant contribution to H , independent of s .
The second step is to show that H cannot increase under the asynchronous McCulloch-Pitts dynamics (2.5). In this case we say that the energy function is a Lyapunov function, or loss function. To demonstrate that the energy function is a Lyapunov function, choose a neuron m and update it according to Equation (2.5). Wedenotetheupdatedstateofneuronmbysm′ :
s′ =sgnw s . (2.46) m mjj
j
All other neurons remain unchanged. There are two possibilities, either sm′ = sm orsm′ =−sm.InthefirstcaseHremainsunchanged,H′=H.HereH′referstothe value of the energy function after the update (2.46). When sm′ = −sm , by contrast, the energy function changes by the amount
H′−H =−1(wmj +wjm)(sm′ sj −smsj)−1wmm(sm′ sm′ −smsm) 2 j̸=m 2
updated in Equation (2.46). Now if the weights are symmetric, H ′ − H equals H′−H=2wmjsmsj =2wmjsmsj −2wmm. (2.48)
=(wmj +wjm)smsj . j ̸=m
(2.47) The sum goes over all neurons j that are connected to the neuron m , the one to be
If wm m ≥0, it follows that
j̸=m j
H′−H<0, (2.49)
ENERGY FUNCTION 29
sincethesignofj wmjsj isthatofsm′ =−sm.Inotherwords,thevalueofH must decrease when the state of neuron m changes, sm′ ̸= sm . In summary,1 H either
remains constant under the asynchronous McCulloch-Pitts dynamics (sm′ = sm ), or its value decreases (sm′ ̸= sm ). Note that this does not hold for the synchronous dynamics (2.4), see Exercise 2.9. Since the energy function cannot increase under the asynchronous McCulloch-Pitts dynamics, it must converge to minima of the energy function. For the energy function (2.43) this implies that the dynamics must either converge to the stored pattern or to its inverse. Both are attractors.
We assumed the thresholds to vanish, but the proof also works when the thresh- olds are not zero, in this case for the energy function
H =−1wijsisj +θisi (2.50) 2ij i
in conjunction with the update rule sm′ = sgn( j wm j s j − θm ).
Up to now we considered only one stored pattern, p = 1. If we store more than
one pattern [Hebb’s rule (2.25)], the proof that (2.45) cannot increase under the McCulloch-Pitts dynamics works in the same way because no particular form of the weightswij wasassumed,onlythattheymustbesymmetric,andthatthediagonal weights must not be negative. Therefore it follows in this case too that the minima of the energy function must correspond to attractors, as illustrated schematically in Figure 2.8. The configuration space of the network, corresponding to all possible choicesofs=[s1,...sN]T,isdrawnasasingleaxis,thex-axis.ButwhenN islarge, the configuration space is really very high dimensional.
However, some stored patterns may not be attractors when p > 1. This follows from our analysis of the cross-talk term in Section 2.2 . If the cross-talk term causes errors for a certain stored pattern, then this pattern is not located at a minimum of the energy function. Another way to see this is to combine Equations (2.25) and (2.45) to give:
1 p N 2 H =− s x(μ)
. (2.51) While the energy function defined in Equation (2.43) has a minimum at x (1) , Equation
i=1 i i
other patterns. This happens rarely when p is small (Section 2.2).
2N ii μ=1 i=1
(2.51) need not have a minimum at x (1) (or at any other stored pattern), because a maximal value of N s x(1)2 may be compensated by terms stemming from
1The derivation outlined here did not use the specific form of Hebb’s rule (2.9), only that the weights are symmetric, and that wm m ≥0. However, the derivation fails when wm m < 0. In this case it is still true that H assumes a minimum at s = x (1) , but H can increase under the update rule, so that convergence is not guaranteed. We therefore require that the diagonal weights are not negative.
30 DETERMINISTIC HOPFIELD NETWORKS
Figure2.8: Minimaoftheenergyfunctionareattractorsinconfigurationspace,the space of all network states. Not all minima correspond to stored patterns (x (mix) is a mixed state, see text), and stored patterns need not correspond to minima.
Conversely there may be minima that do not correspond to stored patterns. Such states are referred to as spurious states. The network may converge to spurious states. This is undesirable but it occurs even when there is only one stored pattern, as we saw in Section 2.2: the McCulloch-Pitts dynamics may converge to the inverted pattern. This follows also from Equation (2.51): if s = x (1) is a local minimum of H,thensoiss =−x(1). ThisisaconsequenceoftheinvarianceofH unders →−s. There are other types of spurious states besides inverted patterns. An example are mixed states, superpositions of an odd number 2n + 1 of patterns [1]. For n = 1, for example, the bits of a mixed state read:
x(mix) =sgn(±x(1) ±x(2) ±x(3)). (2.52) i iii
The number of distinct mixed states increases as n increases. There are 22n+1 p 2n +1
mixed states that are superpositions of 2n + 1 out of p patterns, for n = 1, 2, . . .
(Exercise 2.4). Mixed states such as (2.52) are sometimes recognised by the network
(Exercise 2.5), therefore it may happen that the network converges to these states.
Finally, there are spurious states that are not related in any way to the stored patterns
x(μ). Such spin-glass states are discussed in detail in Refs. [27, 33, 34], and also in j
the book by Hertz, Krogh and Palmer [1]. 2.6 Summary
Hopfield networks are networks of McCulloch-Pitts neurons that recognise, or re- trieve, patterns (Algorithm 1). Their layout is defined by connection strengths, or weights,chosenaccordingtoHebb’srule.Theweightswij aresymmetric,andthe network is in general fully connected. Hebb’s rule ensures that stored patterns
EXERCISES 31 Algorithm 1 pattern recognition with deterministic Hopfield network
store patterns x (μ) using Hebb’s rule;
feed distorted pattern x into network by assigning s (t = 0) ← x ; fort =1,...,T do
choose a value of m and update sm (t ) ← sgn Nj =1 wm j s j (t − 1); end for
read out pattern s (T );
are recognised, at least most of the time if the number of patterns is not too large. Convergence of the McCulloch-Pitts dynamics is analysed in terms of an energy function, which cannot increase under this dynamics.
A single-step estimate for the error probability of the network dynamics was derived in Section 2.2. If one iterates several steps, the error probability is usually much larger, but it is difficult to evaluate in general. For stochastic Hopfield net- works the steady-state error probability can be estimated more easily, because the dynamics converges to a steady state.
2.7 Exercises
2.1 Modified Hebb’s rule. Show that the modified Hebb’s rule (2.26) satisfies Equa-
tion (2.8) if we store only one pattern, p = 1.
2.2 Orthogonal patterns. For Hebb’s rule (2.25), show that the cross-talk term
vanishes for orthogonal patterns, so that P t =1 = 0. For the modified Hebb’s rule error
(2.26), show that the cross-talk term is non-zero for orthogonal patterns, but that it becomes negligible in the limit of large N .
2.3 Cross-talk term. Expression (2.33) for the cross-talk term was derived using
the modified Hebb’s rule Equation (2.26). How does Equation (2.33) change if you
use the rule (2.25) instead? Show that the distribution of C (ν) then acquires a non- i
zero mean, obtain an estimate for this mean value, and compute the one-step error probability. Show that your result approaches (2.39) for small values of α. Explain why your result is different from (2.39) for large α.
2.4 Mixed states. Explain why there are no mixed states that are superpositions of an even number of stored patterns. Show that there are 22n+1 p mixed states
2n +1 that are superpositions of 2n + 1 out of p patterns, for n = 1, 2, . . ..
2.5 Recognising mixed states. Store p random patterns in a Hopfield network
32 DETERMINISTIC HOPFIELD NETWORKS
Figure2.9: Twoneuronswithasymmetricconnections(Exercise2.6).
Figure2.10: Heavisidefunction(Exercise2.8).
with N = 50 and 100 neurons using Hebb’s rule (2.25). Using computer simulations,
determine the probability that the network recognises bit x (mix) of the mixed state
x (mix) with bits
Show that the one-step error probability tends to zero as α → 0 in the limit of large
x(mix) =sgnx(1) +x(2) +x(3) i iii
(2.53) N,byanalysingunderwhichcircumstancessgn1 p N x(μ)x(μ)x(mix)=x(mix)
N μ=1 j=1 i j j i holds. Hint: think of 1 as an average of x (μ) x (mix) over random bits and evaluate
Nj jj
this average. Then apply the signum function.
2.6 Energy function. Figure 2.9 shows a network with two neurons with asymmet-
ric weights, w12 = 2 and w21 = −1. Show that the energy function H = − w12 +w21 s1 s2 2
can increase under the asynchronous McCulloch-Pitts rule.
2.7 Higher-order Hopfield networks. Determine under which conditions the en-
ergyfunctionH =−1 w(2)s s −1 w(3) s s s isaLyapunovfunctionforthe 2 ij ijij 6 ijk ijkijk
asynchronous dynamics sm′ = sgn(bm ) with bm = ∂ H /∂ sm .
2.8 Hebb’s rule and energy function for 0/1 units. Suppose that the state of a neu- ron takes the values 0 (inactive) and 1 (active). The corresponding asynchronous update rule is nm′ = θH ( j wm j n j − μm ) with threshold μm . The activation function θH(b)istheHeavisidefunction,equalto0ifb <0andequalto1ifb ≥0(Fig-
i
EXERCISES 33
Figure 2.11: The pattern x(1) has N = 4 bits, x(1) = 1, and x(1) = −1 for i = 2,3,4. 1i
Exercise 2.11.
ure 2.10). Write down Hebb’s rule for such 0/1 units and show that if one stores only one pattern, then this pattern is recognised. Show that H = − 12 i j wi j ni n j +i μi ni cannot increase under the asynchronous update rule (it is assumed that the weights are symmetric, and that wi i ≥ 0). See Ref. [13].
2.9 Energy function and synchronous dynamics. Analyse how the energy func- tion (2.45) changes under the synchronous dynamics (2.4). Show that the energy function can increase, even though the weights are symmetric and the diagonal weights are zero.
2.10 Continuous Hopfield network. Hopfield [35] also analysed a version of his modelwithcontinuous-timedynamics.Hereweuseτd ni =−ni +g( wijnj −θi)
dt j
with g (b ) = (1 + e−b )−1 (this dynamical equation is slightly different from the one
used by Hopfield [35]). Show that the energy function E = −12 i j wi j ni nj +i θi ni + ni dng −1(n) cannot increase under the network dynamics if the weights are
i0
symmetric. It is not necessary to assume that wi i ≥ 0.
2.11 Hopfield network with four neurons. The pattern shown in Fig. 2.11 is stored in a Hopfield network using Hebb’s rule w = 1 x (1) x (1). There are 24 four-bit pat-
terns. Apply each of these to the Hopfield network, and perform one synchronous update. List the patterns you obtain and discuss your results.
2.12 Recognising letters with a Hopfield network. The five patterns in Figure 2.12
each have N = 32 bits. Store the patterns x (1) and x (2) in a Hopfield network using
Hebb’s rule w = 1 2 x (μ) x (μ). Which of the patterns in Figure 2.12 remain un- ij N μ=1i j
changed after one synchronous update with si′ = sgn(Nj =1 wi j s j )? Hint: read off N x (μ) x (ν) from the Hamming distance between the two patterns, equal to the
j=1 j j
number of bits by which the patterns differ. Use this quantity to express the local
fields b (μ) as linear combinations of x (1) and x (2) . iii
2.13 XOR function. The Boolean XOR function takes two binary inputs. For the inputs [−1, −1] and [1, 1] the function evaluates to −1, for the other two to +1. Try to encode the XOR function in a Hopfield network with three neurons by storing the patterns [−1, −1, −1], [1, 1, −1], [−1, 1, 1], and [1, −1, 1] using Hebb’s rule. Test whether
ijNij
34 DETERMINISTIC HOPFIELD NETWORKS
Figure 2.12: Each of the five patterns consists of 32 bits x(μ). A black pixel i in i
pattern μ corresponds to x (μ) = 1, a white one to x (μ) = −1. Exercise 2.12. ii
the patterns are recognised or not. Discuss your findings.
2.14 Distance as a measure of convergence. The distance d = 1 (s − x (1))2
4N i i i [Equation (2.2)] has a minimum at s = x(1). How are d and H [Equation (2.43)]
related? What is the advantage of using H instead of d as a measure of convergence?
3 Stochastic Hopfield networks
Two related problems became apparent in the previous Chapter. First, the Hopfield dynamics may get stuck in spurious minima. In fact, if there is a local minimum downhill from a given initial state, between this state and the correct attractor, then the dynamics arrests in the local minimum, so that the algorithm fails to converge to the correct attractor. Second, the energy function usually is a strongly varying function over a high-dimensional configuration space. Therefore it is difficult to predict the first local minimum encountered by the down-hill dynamics of the network.
Both problems are solved by introducing an element of stochasticity into the dynamics. This is a trick that works for many neural-network algorithms. In general, however, it is quite challenging to analyse the stochastic dynamics. For the Hopfield network, by contrast, much is known. The reason is that the stochastic Hopfield network is closely related to systems studied in statistical mechanics, so-called spin glasses. Like these systems – and many other physical systems – the stochastic Hop- field network exhibits an order-disorder transition. This transition becomes sharp in the limit of a large number of neurons. This has important consequences. Suppose that the network produces satisfactory results for a given number of patterns with a certain number of bits. If one tries to store just one more pattern, the network may fail to recognise anything. The goal of this Chapter is to explain why this occurs, and how it can be avoided.
3.1 Stochastic dynamics
The asynchronous update rule (2.5) is called deterministic, because a given set of statessj determinestheoutcomeoftheupdateofneuronm.Tointroducenoise, one replaces the rule (2.5) by an asynchronous stochastic rule [36]:
+1 −1
p (b ) , m
1 − p (bm ) .
j wm j sj − θm is the local field, and the probability p (b ) is given by:
p(b)= 1 . 1+e−2βb
with probability with probability
sm′ =
A neuron with update rule (3.1a) is called binary stochastic neuron.
(3.1a) Here bm =
(3.1b)
35
The function p (b ) is plotted in Figure 3.1. The parameter β is the noise parameter. When β is large, the noise level is small. As β tends to infinity, the function p (b )
36 STOCHASTIC HOPFIELD NETWORKS
Figure3.1: Probabilityfunction(3.1b)usedinthedefinitionofthestochasticrule (3.1), plotted for β = 10 and β = 0.
approaches zero if b is negative, and it tends to unity if b is positive. So for β → ∞, the stochastic update rule (3.1) converges to the deterministic rule (2.5). In the opposite limit, when β = 0, the update probability p (b ) simply equals 12 . In this case si isupdatedto−1or+1randomly,withequalprobability.Thedynamicsdoesnot depend upon the stored patterns contained in the local field b .
The idea is to keep a small but finite noise level. Then the network dynamics is very similar to the deterministic Hopfield dynamics analysed in the previous Chapter. But the noise allows the system to escape spurious minima. However, since the dynamics is stochastic, we must rephrase the convergence criterion (2.7). This is discussed next.
3.2 Order parameters
If we feed one of the stored patterns, x (1) for example, then we want the stochastic
dynamics to stay in the vicinity of x (1) . This can only work if the noise is weak enough,
and even then it is not guaranteed. At time step t , bit i is correct if s (t )x (1) = 1. All ii
bits are correct when N s (t )x (1) = N , otherwise the sum takes a value smaller i=1i i
than N . One measures success by averaging 1 N s (t )x (1) over the asynchronous N i=1i i
stochastic dynamics of the network from t = 0 to t = T , for given bits x (μ) : i
1 T 1 N s(t)x(μ) .
m(T)= μTNii
(3.2a)
t=1 i=1
If we feed pattern x (1) to the network, we have m1(t =0) = 1 initially. We want that
m1(t ) remains close to unity, so that the network recognises the pattern x (1). In
practice, the quantity 1 N s (t )x (1) settles into a steady state, where it fluctuates N i=1i i
around a mean value with a definite distribution that becomes independent of the iteration number t . If the network works well, the finite-time average m1(T )
ORDER PARAMETERS 37
Figure3.2: Illustrateshowthefinite-timeaveragem1(T)dependsuponthetotal iteration time T . The light gray lines show results for m1(T ) for different realisations of random patterns stored in the network, at a large but finite value of N . The black line is the average of m1(T ) over the different realisations of random patterns.
converges to a value of order unity after an initial transient (Figure 3.2). The limiting value
i=1
is called the order parameter. Since there is noise, the order parameter m1 is usually smaller than unity. The last equality in Equation (3.2b) defines the time average 〈si 〉 over the stochastic network dynamics.
Figure 3.2 also illustrates a subtlety. For finite values of N , the order parameter m1 depends upon the stored patterns. Different realisations x(1),...,x(p) of random patterns yield different values of m1. In the limit of N → ∞ this problem does not occur, the order parameter m1 becomes independent of the stored patterns. We say that the system is self averaging in this limit.
To obtain a definite value for the order parameter, one usually averages m1 over different realisations of random patterns stored in the network (thick solid line in Figure 3.2). The dashed line in Figure 3.2 shows 〈m1〉.
1 N
1 T→∞ 1 N i i
m ≡ lim m (T)≡
〈s 〉x(1) (3.2b)
The other components, mμ = limT →∞ mμ(T ) for μ > 1, are expected to be small. This is certainly true for random patterns with many independent bits. If s (t ) ≈ x (1),
ii the individual terms in the sum over i in Equation (3.2b) cancel approximately upon
summation, because the bits of the patterns x (2) to x (p ) are independent from those ofx(1).Insummary,ifwefeedpatternx(1) andifthenetworkworkswell,weexpect in the limit of large N :
1 if μ=1,
mμ ≈ 0 otherwise. (3.3)
38 STOCHASTIC HOPFIELD NETWORKS Whether this is the case or not depends on the values of p, N , and β. In the next
Sections we determine how m1 depends on these parameters. 3.3 Mean-field theory
The order parameter is defined as an average over the stochastic dynamics of the network in its steady state (Figure 3.2). It is a challenging task to compute this average because all neurons interact with each other in a nonlinear fashion. Consider neuron numberi.Thefateofsi isdeterminedbyitslocalfieldbi,throughEquation(3.1). The difficulty is that the local field in turn depends on the states sj of all other neurons in the network:1
N
bi (t ) = wi j s j (t ) . (3.4)
j=1
When N is large, we may assume that bi (t ) remains essentially constant in the steady state, independent of t , because fluctuations of sj (t ) average out when summing over j:
bi (t ) = 〈bi 〉 + fluctuations . (3.5) Since bi (t ) is given by a sum over many random numbers, we appeal to the central-
limit theorem and argue that the fluctuations of bi (t ) are of order N . Since 〈bi (t )〉 ∼
N , we ignore the fluctuations in the limit of large N and write N 1(μ)(μ)
〈si 〉,
〈s 〉=tanh(β〈b 〉) with 〈b 〉= 1 x(μ)x(μ)〈s 〉. (3.8) i i iNijj
bi(t)≈〈bi〉= wij〈sj〉=N xi xj 〈sj〉, (3.6) j=1 μ j̸=i
using Hebb’s rule (2.26) for given patterns x (μ). The time-averaged local field 〈bi 〉 is called the mean field. Theories that neglect the fluctuations in Equation (3.5) are called mean-field theories. They require a self-consistent solution, because the average 〈sj 〉 on the r.h.s. of Equation (3.6) depends on the mean field. Using the stochastic update rule (3.1) we find:
〈si〉=Prob(si =+1)−Prob(si =−1)=p(〈bi〉)−[1−p(〈bi〉)]=tanh(β〈bi〉). (3.7) Equations (3.6) and (3.7) yield a set of N non-linear self-consistent equations for
μ j̸=i
Recall that the averages 〈· · · 〉 are time averages, evaluated for given patterns x (μ) .
1We set the thresholds to zero, as assumed in Hebb’s rule (2.26).
MEAN-FIELD THEORY 39
An equivalent yet slightly different derivation of the mean-field equations (3.8) is this: suppose we average si over the dynamics (3.1) at fixed sj ̸= si , and then we averageallsj overthedynamics.Thisgives〈si〉=〈tanh(βbi)〉.ComparingwithEqua- tion (3.8), we see that the mean-field approximation corresponds to approximating 〈tanh(β bi )〉 ≈ tanh(β 〈bi 〉).
Now, in order to calculate the order parameters (3.9), 1 N
m= 〈s〉x(μ), μNjj
j=1
(3.9)
we must solve the mean-field equations (3.8) to obtain the time averages 〈si 〉 in Equation (3.9). To this end we express the mean field 〈bi 〉 in terms of the order parameters mμ:
1pp
〈b 〉= x(μ)x(μ)〈s 〉≈x(μ)m .
iNijjiμ (3.10) μ=1 j̸=i μ=1
The last equality is only approximate because the j -sum in the definition of mμ contains the term j = i . Whether or not to include this term makes only a small difference in the limit of large N .
Let us first calculate m1 assuming Equation (3.3), neglecting terms with μ ≠ 1 in Equation (3.10). To make sure that these small μ ≠ 1-terms do not add up to a substantial correction to the first term, we must assume that the storage capacity is small enough. For large values of N , the condition is [37]:
α=p ≪logN. (3.11) NN
In this case it is sufficient to keep only the first term on the r.h.s. of Equation (3.10). This approximation yields together with Equation (3.8):
〈s 〉=tanh(β〈b 〉)≈tanh(βm x(1)). ii1i
Applying the definition (3.9) of the order parameter, one finds
i=1
(3.12)
(3.13)
the values ±1, one obtains:
1 N 1N1ii
m =
tanhβm x(1)x(1).
Using that tanh(z ) = − tanh(−z ) as well as the fact that the bits x (μ) can only assume i
m1 = tanh(βm1).
(3.14)
40 STOCHASTIC HOPFIELD NETWORKS
Figure 3.3: Solutions of the mean-field equation (3.14), solid lines. The critical noiselevelisβc =1.Thedashedlinecorrespondstoanunstablesolution.
This is a self-consistent equation for m1. For β → 0, it has the solution m1 = 0. This is not the desired solution, because m1 = 0 means that x (1) is not recognised. For β → ∞, by contrast, there are three solutions, m1 = 0, ±1. Figure 3.3 shows results of the numerical evaluation of Equation (3.14) for intermediate values of β . For β larger than the critical value
βc =1, (3.15)
the three solutions persist. The solution m1 = 0 is unstable (this can be shown by computing the derivatives of the free energy of the Hopfield network [1]). In other words, if we start with an initial condition that corresponds to m1 = 0, the network dynamics does not stay there. The other two solutions are stable: when the network is initialised close to x (1), then it converges to m1 = O (1).
The symmetry of the problem dictates that there must also be a solution with −m1. This solution corresponds to the inverted pattern −x (1) (Section 2.5). If we start in the vicinity of x (1) , then the network is unlikely to converge to −x (1) , provided that N is large enough. The probability of the dynamical transition x (1) → −x (1) vanishes very rapidly as N increases and as the noise level decreases. If this transition happens in a simulation in this limit, the network then stays near −x (1) for a very long time. Consider the limit where T tends to ∞ at a finite but large value of N . Then the network jumps back and forth between x (1) and −x (1) at a very small rate. As a result, the order parameter averages to zero. This shows that the limits of large N and large T do not commute:
lim lim m1(T ) ̸= lim lim m1(T ). (3.16) T→∞N→∞ N→∞T→∞
In practice the interesting limit is the left one, that of a large network run for a time T much longer than the initial transient, but not infinite. This is precisely where the mean-field theory applies. It corresponds to taking the limit N → ∞ first, at finite
MEAN-FIELD THEORY 41
but large T . This describes simulations where the transition x (1) → −x (1) does not occur.
In summary, Equation (3.14) predicts that the order parameter converges to a
definite value, m1, independent of the stored patterns in the limit N → ∞. When
N is finite, the limiting value of the order parameter does depend on the stored
patterns (Figure 3.2). In this case one averages also over different realisations of the
stored patterns, as mentioned above. The value of this average, 〈m1〉, determines
the average error probability P t =∞ in the steady state, the average fraction of wrong error
bits. The steady-state average number of correct bits is given by 1NN
1+〈s 〉x(1)= (1+〈m 〉), (3.17) 2ii21
i=1
because 1 (1 + s x (1)) = 1 if x (1) is correct, and equal to zero otherwise. The outer
2iii
average is over different realisations of random patterns (the inner average is over
the network dynamics). The second equality follows from Equation (3.2b). Since
the l.h.s. of Equation (3.17) equals N times 1 − P t =∞ , we deduce that error
Pt=∞ = 1(1−〈m1〉). (3.18) error 2
Since m1 → 1 as β → ∞, the steady-state error probability tends to zero in this limit. This is expected since the stored patterns x (μ) are recognised for small enough values of α in the deterministic limit, when the cross-talk term is negligible. But note that the stochastic dynamics slows down as the noise level tends to zero. The lower the noise level, the longer the network remains stuck in local minima, so that it takes longer time to reach the steady state, and to sample the steady-state statistics of H . In the opposite limit, β → 0, the steady-state error probability tends to 12 , because m1 → 0. In this noise-dominated limit the stochastic network ceases to function. If one were to assign N bits entirely randomly, then half of them would be correct, on average, P t =∞ = 1 .
error 2
It is important to note that noise can also help, because mixed states have lower
critical noise levels than the stored patterns x (μ). This can be seen as follows [1, 33]. To derive the above mean-field result we assumed m1 ≈ 1 and mμ ≈ 0 for μ ̸= 1. Mixed states correspond to solutions where an odd number of components of m is non-zero, for example:
m m
m (mix) = m . 0
.
(3.19)
42 STOCHASTIC HOPFIELD NETWORKS Neglecting the cross-talk term, the mean-field equation reads
one finds:
p
m(mix) =x(μ)tanhβm(mix)x(ν). (3.21)
μiνi ν=1
p
〈s 〉=tanhβm(mix)x(μ). (3.20)
iμi μ=1
In the limit of β → ∞, the averages 〈si 〉 converge to the mixed states (2.52) when m (mix) is given by Equation (3.19). Averaging over the bits of the random patterns
The numerical solution of Equation (3.21) shows that there is a non-zero solution
for β−1 < β−1 = 1. Yet this solution is unstable for 0.46 < β−1 < 1 [33]. In other words, c
the mixed states have a lower critical noise level than the stored patterns, equal to 0.46. For noise levels larger than that, but still smaller than unity, the network can recognise the stored patterns, and it does not converge to mixed states.
However, these results were obtained assuming that only one (or a few) order parameters are not zero. This corresponds to the limit of α = p /N → 0, where the cross-talk term (Section 2.3) is negligible. The next Section describes a mean-field theory that remains valid for larger values of α.
3.4 Critical storage capacity TheanalysisintheprecedingSectionreplacedthesum(3.10)byitsfirstterm,x(1)m .
i1 This can only work when p /N is small enough. Now we discuss how to proceed
when p /N is not small.
Note that the analysis in Section 2.2 did not assume that p /N is small, but it
yielded only the one-step error probability P t =1 , and we discussed the storage error
capacity α = p/N in relation to the one-step error probability. As the network
dynamics is iterated, however, the number of errors tends to increase, at least when
α is large enough so that the cross-talk term matters. Now we describe how to
compute P t =∞ for general values of the storage capacity α, in order to demonstrate error
how the errors multiply when α is larger, causing the network to fail.
As before, we store p patterns in the network using Hebb’s rule (2.26) and feed pattern x (1) to the network. The aim is to determine the order parameter m1 and the corresponding error probability in the steady state for p ∼ N , so that α remains finite as N → ∞. In this case we can no longer approximate the sum in Equation (3.10) just by its first term, because the other terms for μ > 1 may sum up to a contribution that is of the same order as m1. Instead we must evaluate all mμ to compute the
mean field 〈bi 〉.
CRITICAL STORAGE CAPACITY 43
The relevant calculation is summarised in Chapter 4 of Ref. [38]. It is also outlined in Section 2.5 of Hertz, Krogh and Palmer [1]. The remainder of this Section follows this outline quite closely. One starts by rewriting the mean-field equations (3.8) in terms of the order parameters mμ. Using
we find
〈s 〉=tanh(βx(μ)m ) iiμ
μ
m = 1 x(ν)〈s 〉= 1 x(ν)tanhβx(μ)m . νNiiNi iμ iiμ
(3.22)
(3.23)
This coupled set of p non-linear equations is equivalent to the mean-field equations (3.8).
Now feed pattern x (1) to the network. We assume that the network stays close to the pattern x (1) in the steady state, so that m1 remains of order unity. The other mμ remain small. When p is large, however, we cannot simply approximate the sum over μ on the r.h.s. of Equation (3.23) by its first term only, because the sum of the remaining (small) terms might not be negligible. Therefore we need to estimate these terms, the other order parameters mμ for μ ̸= 1.
The trick is to assume that the pattern bits are random, uncorrelated with mean zero [Equations (2.29) and (2.30)]. In this case the order parameters mμ, μ = 2,…,p, become random numbers that fluctuate around zero with variance 〈mμ2〉 (this av- erage is over random patterns). We use Equation (3.23) to compute the variance approximately.
In the μ-sum on the r.h.s of Equation (3.23) we must treat the term μ = ν sepa-
rately, because the index ν appears also on the l.h.s. of this equation. Also the term
μ = 1 must be treated separately, as before, because μ = 1 is the index of the pattern
that is fed to the network. As a consequence, the calculations of m1 and mν for ν ̸= 1
proceed slightly differently. We begin with the first case. Using that x (μ) = ±1, and i
that tanh(z ) is an odd function, Equation (3.23) simplifies to: m = 1 tanhβm +β x(μ)x(1)m .
(3.24) The next steps are similar to the analysis of the cross-talk term in Section 2.2. One
1N1iiμ i μ̸=1
assumes that the patterns are random, that their bits x (μ) = ±1 are independently i
and identically distributed. In the limit of large N and p , the sums in Equation (3.24) can then be estimated using the central-limit theorem . For random patterns, the
variable
z ≡x(μ)x(1)m (3.25) iiμ
μ̸=1
44 STOCHASTIC HOPFIELD NETWORKS
is a sum of many independent, identically distributed random numbers with mean zero and finite variance. The variable z is therefore approximately Gaussian dis- tributed, with mean zero. As a consequence, the distribution of z is entirely deter- mined by its variance σz2, and it is indepedent of i.
ReturningtoEquation(3.24),oneapproximatesthesumN1 i asanaverageover the Gaussian distributed variable z . This yields:
dz −z2
m1= 2πσz2e 2σz2 tanh(βm1+βz). (3.26)
Equation (3.26) is the desired result, a self-consistent equation for m1 replacing the mean-field equation (3.14).
In order to determine m1, we need to estimate the variance σz2 featuring in Equa-
tion (3.26). To this end, one squares Equation (3.25), and averages the resulting
double sum over pattern indices. Since the bits x (μ) and x (μ′ ) are independent when ii
μ ̸= μ′, only the diagonal terms in this double sum contribute to the average:
σz2 =〈mμ2〉≈p〈mμ2〉 forany μ̸=1. (3.27)
μ̸=1
Here we assumed that p is large, and approximated p − 1 ≈ p . To evaluate the variance further, it is necessary to estimate the remaining order parameters. One starts again from Equation (3.23) and writes for ν ̸= 1
m = 1 x(ν) tanhβx(1)m +βx(ν)m +β x(μ)m νNii1iνiμ
i μ̸=1 μ̸=ν
= 1 x(ν)x(1) tanhβm +βx(1)x(ν)m +β x(μ)x(1)m . ii1iiνiiμ
(3.28)
N i μ̸=1
⃝1 ⃝2
μ̸=ν
⃝3
Considerthethreetermsintheargumentoftanh(…).Theterm⃝1 isoforderunity, itisindependentofN.Theterm⃝3 maybeofthesameorder,becausethesum over μ contains ∼ p N terms. The term ⃝2 , by contrast, is small for large values of N . Therefore it is a good approximation to Taylor-expand as follows:
d
tanh ⃝1 +⃝2 +⃝3 ≈tanh ⃝1 +⃝3 +⃝2 dx tanh +…. (3.29)
⃝1 + ⃝3
CRITICAL STORAGE CAPACITY 45 Using d tanh(x)=1−tanh2(x)oneobtains
(3.30)
dx
m = 1 x(ν)x(1) tanhβm +β x(μ)x(1)m νii1iiμ
N i μ̸=1 ⃝1 μ̸=ν
+ 1 x(ν)x(1) βx(1)x(ν)m 1−tanh2 βm +β x(μ)x(1)m . iiiiν 1 iiμ
N i μ̸=1 ⃝2 μ̸=ν
Using the fact that x (μ) = ±1 and thus [x (μ)]2 = 1, this expression simplifies: i
m = 1 x(ν)x(1) tanhβm +β x(μ)x(1)m + νNii 1 iiμ
i
1
+β mν N
i
2
β m1 + β
μ̸=1 μ̸=ν
(μ) (1) xi xi
μ̸=1 μ̸=ν
(3.31)
1 − tanh
mμ
.
Thegoalisnowtosolveformν.ApproximatingthesumN1 i inthesecondlineas an average over the Gaussian distributed variable z gives:
∞
βmν dz2πσe2σz2 1−tanh2βm1+βz .
1−z2 −∞ z
(3.32)
(3.33)
(3.34)
(3.35)
(3.36)
⃝3
Defining the parameter q ∞
−∞ z one can write Equation (3.32) as
1−z2
q≡ dz2πσ e 2σz2 tanh2 βm1+βz ,
∞1−z2
βmν 1− dz2πσ e 2σz2 tanh2 βm1+βz ≡βmν(1−q). −∞ z
Returning to Equation (3.31), we see that it takes the form
m = 1 x(ν)x(1) tanhβm +β x(μ)x(1)m +(1−q)βm .
νNii1iiμν i μ̸=1
μ̸=ν
Solving for mν one finds for ν ̸= 1:
1 x(ν)x(1) tanhβm +β μ̸=1 x(μ)x(1)m
mν =
1−β(1−q)
μ̸=ν .
Niii 1 iiμ
46 STOCHASTIC HOPFIELD NETWORKS
This expression allows us to compute the variance σz , defined by Equation (3.27). Equation (3.36) shows that the average 〈mν2〉 contains a double sum over the bit index, i . Since the bits are independent, only the diagonal terms contribute, so that
1 tanh2βm+βμ̸=1x(μ)x(1)m N2i 1 iiμ
〈mν2〉 ≈ μ̸=ν (3.37) [1−β(1−q)]2
for ν ≠ 1, but otherwise independent of ν. The numerator is just q /N , from Equation (3.33). So the variance evaluates to
σz2 = αq . (3.38) [1−β(1−q)]2
In summary there are three coupled equations, for m1, q , and σz , Equations (3.26), (3.34), and (3.38). They must be solved together to determine how m1 depends on β and α.
In order to compare with the results described in Section 2.2, we must take the deterministic limit, β → ∞. In this limit, q approaches unity, yet β (1 − q ) remains finite [1]. Setting q = 1 in Equation (3.38) but retaining β (1 − q ) one finds:
σz2 = α . [1−β(1−q)]2
The deterministic limits of Equations (3.34) and (3.26) become [1]: 2 − m 1 2
(3.39a)
(3.39b)
t=∞1 m1
Perror =2 1−erf 2σz2 . (3.40)
Compare this with Equation (2.39) for the one-step error probability in the deter- ministic limit. That equation was derived for only one step of the network dynamics, while Equation (3.40) describes the limit of many steps, the long-time or steady-state limit.
β(1−q)=πσz2e 2σz2 , m1
m1=erf 2σz2 .
(3.39c) for m1 into this expression we find in the deterministic limit:
(3.39c) Recall expression (3.18) for the steady-state error probability. Inserting Equation
CRITICAL STORAGE CAPACITY 47
Figure 3.4: Error probability as a function of the storage capacity α in the deter- ministiclimit.Theone-steperrorprobabilityPt=1 [Equation(2.39)]isshownasa
error
dashed line, the steady-state error probability P t =∞ [Equation (3.40)] is shown as error
a solid line. In the hashed region, error avalanches increase the error probability. After Figure 1 in Ref. [34].
Yet it turns out that Equation (3.40) reduces to (2.39) in the limit of α → 0. To see this, one solves the set of Equations (3.39) by introducing the variable y = m1/2σz2. One obtains the following one-dimensional equation for y [1, 34]:
y(2α+(2/π)e−y2)=erf(y). (3.41)
The relevant solutions are those satisfying 0 ≤ erf(y ) ≤ 1, because the order parame- ter is restricted to this range (transitions to −m1 do not occur in the limit N → ∞). Figure 3.4 shows the steady-state error probability obtained from Equations (3.40) and (3.41). Also shown is the one-step error probability
Pt=1=11−erf1 error 2 2α
derived in Section 2.2. As stated above, P t =∞ approaches P t =1 for small α. We con- error error
clude: in this limit, for small α, the error probability does not increase significantly as one iterates the network dynamics. Errors in earlier iterations have little effect on the probability that later errors occur.
The situation is different at larger values of α. In that case, P t =1 significantly error
underestimates the steady-state error probability. In the hashed region, errors in the dynamics increase the probability of errors in subsequent steps, giving rise to error avalanches. Figure 3.4 illustrates that the steady-state error probability tends to 12 as the parameter α increases beyond a critical value, αc . Equation (3.41) yields
αc ≈0.1379 (3.42)
48 STOCHASTIC HOPFIELD NETWORKS
Figure3.5: PhasediagramoftheHopfieldnetworkinthelimitoflargeN (schematic). The region with Pt=∞ < 1 is the ordered phase, the region with Pt=∞ = 1 is the
error 2 error 2 disordered phase. After Figure 2 in Ref. [34].
for critical storage capacity αc. When α > αc , the network produces just noise. When α < αc, by contrast, the network works well. The smaller the storage capacity, the better the network performs.
Figure 3.4 shows that the steady-state error probability changes very abruptly near αc . Suppose we store 137 patterns with 1000 bits in a Hopfield network. Figure 3.4 demonstrates that the network can recognise the patterns with a comparatively small error probability. However, if we try to store one or two more patterns, the network fails to produce output meaningfully related to the stored patterns. This rapid change is an example of a phase transition. In many physical systems one observes similar transitions between ordered and disordered phases [32].
What happens at higher noise levels? The numerical solution of Equations (3.34), (3.26),and(3.38)showsthatthecriticalstoragecapacityαc decreasesasthenoise level increases (smaller values of β ). This is shown schematically in Figure 3.5. Below the solid line the error probability is smaller than 12 , so that the network operates reliably (although less so as one approaches the phase-transition boundary). Out- side this region the the error probability equals 12 . In this region the network fails. Inthelimitofsmallαthecriticalnoiselevelisβc =1.Inthisregimethenetworkis described by the theory explained in Section 3.3, Equation (3.14).
Alternatively these two different phases of the Hopfield network are characterised in terms of the order parameter m1. We see that m1 ̸= 0 below the solid line, while m1 =0aboveit,inthelimitoflargeN.
BEYOND MEAN-FIELD THEORY 49
3.5 Beyond mean-field theory
The theory summarised in this Chapter rests on a mean-field approximation for the local field, Equation (3.6). The main result is the phase diagram shown in Figure 3.5, derived in the limit N → ∞. For smaller values of N one expects the transition to be less sharp, so that m1 is non-zero also for values of α larger than the critical storage capacity αc.
But even for large values of N , the question remains how reliable the mean-field theory really is. To answer this question, one uses a more accurate theory, based on the so-called replica trick. One starts from the steady-state distribution of s for fixed patterns x (μ) . In Chapter 4 we will see that the steady-state distribution for the McCulloch-Pitts dynamics is the Boltzmann distribution
PB(s)=Z−1e−βH(s) (3.43) (the proof in Chapter 4 assumes that the diagonal weights are set to zero). The
normalisation factor Z is called the partition function
Z =e−βH(s). (3.44)
s
In order to compute the order parameter, one adds a threshold term to the energy
function (2.45)
H=−1w ss+λx(μ)s. 2ijijμii
ijμi
Then the order parameter mμ is obtained by taking a derivative w.r.t λμ:
m = 1 x(μ)〈n 〉=− 1 ∂ logZ. μNii Nβ∂λμ
i
(3.45)
(3.46)
The outer average is over different realisations of random patterns. The logarithm of Z is averaged using the replica trick. The idea is to represent the average of the logarithm as
〈log Z 〉 = lim n1 (〈Z n 〉 − 1) , (3.47) n→0
The function Z n looks like the partition function of n copies of the system, hence the name replica trick. If one assumes that all copies yield the same order parameter, one obtains the mean-field solution described in Section 3.4. If one allows different copies to have different order parameters (replica-symmetry breaking), one obtains a more accurate solution for the critical storage capacity [39],
αc =0.138187. (3.48)
50 STOCHASTIC HOPFIELD NETWORKS
The mean-field result (3.42) differs only slightly from Equation (3.48). The most precise Monte-Carlo simulations (Section 4.2) for finite values of N [40] yield upon extrapolation to N = ∞
αc =0.143±0.002. (3.49)
This is close to, yet significantly different from the best theoretical estimate, Equation (3.48), and also different from the mean-field result (3.42).
To put these results into context, note that for other systems mean-field theories tend to give results much worse than here. Usually, mean-field theories yield at best a qualitative description of a phase transition. For the Hopfield network, by contrast, the mean-field theory works very well because every neuron is connected with every other neuron. This helps to average out the fluctuations in Equation (3.6). In physical systems with local interactions, mean-field theories tend to work better in higher dimensions, because there are more neighbours to average over (Exercise 3.5).
3.6 Correlated and non-random patterns
In the two previous Sections we assumed that the stored patterns are random with independently identically distributed bits. This allowed us to calculate the storage capacity of the Hopfield network using the central-limit theorem. The hope is that the result describes what happens for typical, non-random patterns, or for random patterns with correlated bits. Correlations affect the distribution of the cross-talk term, and thus the storage capacity of the Hopfield network. It has been argued that the storage capacity increases when the patterns are more strongly correlated, while others have claimed that the capacity decreases in this limit (see Ref. [41] for a discussion).
For a set of definite patterns (no randomness to average over), the situation seems to be even more challenging. Yet there is a way of modifying Hebb’s rule to deal with this problem, at least when the patterns are linearly independent. The recipe is explained by Hertz, Krogh, and Palmer [1]. One simply incorporates the overlaps
Qμν = N1 x(μ) ·x(ν) (3.50) into Hebb’s rule. To this end one defines the p × p overlap matrix with elements
Qμν and writes:
w = 1 x(μ)−1 x(ν). (3.51) ijNi μνj
μν
For orthogonal patterns (Qμν = δμν), this modified Hebb’s rule is identical to Equa- tion (2.25). For non-orthogonal patterns, the rule (3.51) ensures that all patterns
SUMMARY 51
are recognised. Equation (3.51) requires that the matrix is invertible: its columns must be linearly independent (and this implies that the rows are linearly indepen- dent too). This limits the number of patterns one can store with the rule (3.51), because p > N implies linear dependence.
Forlinearlyindependentpatternsonecanfindtheweightswij iteratively,by successive improvement from an arbitrary starting point. We can say that the network learns the task through a sequence of weight changes. This is the idea used to solve classification tasks with perceptrons (Part II).
3.7 Summary
In this Chapter we analysed the stochastic dynamics of Hopfield networks. We asked under which circumstances the network dynamics can reliably recognise stored patterns. If the stored patterns are random, the performance of the Hopfield network depends on their number , on the number of bits per pattern, and upon the noise level. The storage capacity α equals the ratio of the number of stored patterns to the number of bits per pattern. The network operates reliably when this ratio is small, and provided the noise level is not too large. A mean-field analysis of the N → ∞-limit shows that there is a phase transition in the parameter plane of the Hopfield network (Figure 3.5): when α exceeds the critical storage capacity αc, the network ceases to function.
Hopfield networks share many properties with the networks discussed later on in this book. The most important point is perhaps that introducing noise in the dynamics allows to study the convergence and performance of the network: in the presence of noise there is a well-defined steady state that can be analysed. Without noise, in the deterministic limit, the network dynamics arrests in local minima of the energy function, and may not reach the stored patterns. Naturally the noise must be small enough for the network to function reliably. Finally, the building blocks of Hopfield networks are McCulloch-Pitts neurons and Hebb’s rule for the weights. Many of the algorithms discussed in the coming Chapters use these elements in some form.
3.8 Further reading
The statistical mechanics of Hopfield networks is explained in Introduction to the theory of neural computation by Hertz, Krogh, and Palmer [1]. Starting from the Boltzmann distribution, Chapter 10 in this book summarises how to compute the order parameters, and how to evaluate the stability of the corresponding solutions.
52 STOCHASTIC HOPFIELD NETWORKS
Figure 3.6: The Ising model is a model for ferromagnetism. It describes N spins that can either point up (↑) or down (↓), arranged on a lattice (here shown in one spatial dimension), interacting with their nearest neighbours with interaction strength J , and subject to an external magnetic field h . The state of spin i is described by the variablesi,withsi =1for↑andsi =−1for↓.
For more details on the replica trick, see the books by Müller, Reinhard and Strick- landt [37] and by Engel and van den Broeck [42], as well as the review article [43].
3.9 Exercises
3.1 Mixed states. Write a computer program that implements the stochastic dy- namics of a Hopfield model. Compute the order parameter for mixed states that are superpositions of the bits of three stored patterns. Determine how it depends on the noise level for 0.5 ≤ β ≤ 2.5, small α, and large N . Solve the mean-field equation (3.21) numerically, and compare the results of this mean-field theory with those of your computer simulations. Repeat the exercise for mixed states that consist of superpositions of the bits of five stored patterns
3.2 Deterministic limit. Derive the deterministic limit (3.39) of the three coupled Equations (3.34), (3.38), and (3.26) for m1, q , and σz .
3.3 Phase diagram of the Hopfield network. Derive Equation (3.41) from Equa- tion (3.39). Numerically solve (3.41) to find the critical storage capacity αc in the deterministic limit. Quote your result with three-digit accuracy. To determine how the critical storage capacity depends on the noise level, numerically solve the three coupled Equations (3.26), (3.33), and (3.38). Compare your result with the schematic Figure 3.5.
3.4 Non-orthogonal patterns. Show that the rule (3.51) ensures that all patterns are recognised, for any set of non-orthogonal patterns that gives rise to an invertible matrix . Demonstrate this by showing that the cross-talk term evaluates to zero, assuming that −1 exists.
3.5 Ising model. The Ising model is a model for ferromagnetism, N spins si = ±1 are arranged on a d -dimensional hypercubic lattice as shown in Figure 3.6. TheenergyfunctionfortheIsingmodelisH =−Ji,j=nn(i)sisj −hi si. Here J
EXERCISES 53
is the ferromagnetic coupling between nearest-neighbour spins, h is an external magnetic field, and nn(i ) denotes the nearest-neighbours of site i on the lattice. In equilibrium at temperature T , the states are distributed according to the Boltzmann distribution with β = 1/(kBT ) where kB is the Boltzmann constant. Derive a mean- field approximation for the magnetisation of the system, m = 〈 N1 i si 〉, assuming that N is large enough that the contribution of the boundary spins can be neglected. Derive an expression for the critical temperature below which mean-field theory predicts ferromagnetism, m ̸= 0. Discuss how the critical temperature depends on the dimension d . Note: mean-field theory fails for the one-dimensional Ising model, but its predictions become more accurate as d increases.
3.6 Storage capacity. Derive the condition (3.11) that allows to neglect the cross- talk terms in Equation (3.10).
54 THE BOLTZMANN DISTRIBUTION
4 The Boltzmann distribution
In Chapter 2 we saw that the deterministic dynamics (2.5) of Hopfield networks admits the Lyapunov function
H =−1wijsisj +θisi , (4.1) 2ij i
if the weights wi j are symmetric, and wi i g e q 0. In this Chapter1 we show that the asynchronous stochastic McCulloch-Pitts dynamics (3.1) converges to a steady state where the state vector s follows the Boltzmann distribution
PB(s)=Z−1e−βH(s) withnormalisation Z =e−βH(s). (4.2) s
The stochastic dynamics (3.1) is closely related to that of Markov-chain Monte-Carlo algorithms, designed to efficiently sample from the Boltzmann distribution. In this Chapter we also discuss how to solve optimisation tasks by Monte-Carlo simulation: one assigns a suitable energy H to each configuration s , so that the function H (s ) has global minimum for the optimal configuration s min . The stochastic dynamics finds low-energy configurations (but not necessarily s min ), in particular if one iteratively decreases the noise level by increasing β (simulated annealing [44]).
Last but not least we look at Boltzmann machines [14, 15, 45–47], stochastic Hopfield networks with hidden neurons that are neither used for input nor for output. Boltzmann machines can be trained to learn the properties of a distribution Pdata(x ) of binary input patterns x . The idea is to iteratively change the weights in Equation (4.1) until the Boltzmann distribution represents the input distribution. This idea, to iterate the weights until the network learns the input distribution Pdata, is used in a slightly different form in supervised learning (Part II). Boltzmann machines are closely related to Hopfield networks. Without hidden neurons, both models learn to represent two-point correlations 〈x (μ) x (μ)〉 of pattern bits.
When important information about the inputs is encoded in higher-order cor- relations, one can use hidden neurons to represent these correlations. Generally Boltzmann machines are hard to train, in particular if they have many hidden neu- rons. Restricted Boltzmann machines are neural networks with hidden neurons, but with fewer connections: only those between visible and hidden neurons are allowed. These neural networks can be fairly efficiently trained and can solve a number of different tasks. Apart from learning a distribution of input patterns, they can for instance be trained to recognise incomplete input patterns, and to classify inputs [25].
1In this Chapter we set the diagonal weights to zero.
ij
CONVERGENCE OF THE STOCHASTIC DYNAMICS 55
4.1 Convergence of the stochastic dynamics
We begin by showing that the stochastic dynamics (3.1) has a steady state where s is distributed according to the Boltzmann distribution (4.2). To this end, we consider an alternative yet equivalent formulation of the network dynamics. It consists of two parts. First, choose a neuron randomly, number m say. Second, update sm to sm′ ̸= sm with probability
Prob(sm →s′ )= 1
m 1+eβ∆Hm
, (4.3a)
with
∆Hm =H(…,sm′ ,…)−H(…,sm,…). (4.3b) To explore the relation between the stochastic rules (4.3) and (3.1), we use that
∆Hm =−bm(sm′ −sm) (4.4)
with local field bm = j wm j sj − θm . To derive Equation (4.4), we must assume that the weights are symmetric, and that the diagonal weights vanish. The result is obtained with a calculation similar to the one leading to Equation (2.48), except that we have non-zero thresholds here. Now we break the rule (4.3) up into different cases. The state of neuron m changes with probability
if sm =−1 obtain s′ =1 withprob. 1 =p(bm), (4.5a) m 1+e−2βbm
if sm =1 obtain s′ =−1 withprob. 1 =1−p(bm). (4.5b) m 1+e2βbm
Inthesecondrowweusedthat1−p(b)=1− 1 1+e−2β b
remains unchanged with probability:
if sm =−1 obtain s′ =−1 withprob.
= 1+e−2βb−1 = 1 . Thestate 1+e−2β b 1+e2β b
(4.5c)
(4.5d)
1 =1−p(bm), m 1+e−βbm
if sm =1 obtain s′ =1 withprob. 1 =p(bm). m 1+eβbm
Comparing with Equation (3.1) we conclude that the two schemes (3.1) and (4.3) are
equivalentundertheassumptionsmade(wij =wji andwii =0).NotethatEquation (4.3) is more general than the stochastic Hopfield dynamics, because it does not require the energy function to be of the form (4.1). In particular it is neither needed that the weights are symmetric, nor that the diagonal weights vanish. Equations (3.1) and (4.3) are not equivalent if these conditions are not satisfied (Exercise 4.1).
56
THE BOLTZMANN DISTRIBUTION
The rule (4.3) defines a Markov chain of states
st=0 →st=1 →st=2 →… (4.6)
As before, the index t counts the iteration steps. A Markov chain is a memoryless random sequence of states defined by transition probabilities p(s′|s) from state s to s ′ [48]. The transition probability p (s ′|s ) connects arbitrary states. One distin- guishes between local moves where only one neuron may change, as above, and global moves where many neurons may change their states in a single step.
In both cases, an update consists of two parts. First, a new state s ′ is suggested with probability q (s ′ |s ). Second, the new state s ′ is accepted with acceptance proba- bility
pa(s′|s)= 1 with ∆H =H(s′)−H(s). (4.7) 1+eβ∆H
As result, the transition probability is given by a product of two factors p(s′|s)=q(s′|s)pa(s′|s). (4.8)
These steps are repeated many times, creating the chain of states (4.6).
The Markov chain defined by the transition probability (4.8) has the Boltzmann distribution (4.2) as a steady-state distribution if the detailed-balance condition is
satisfied:
p(s′|s)PB(s)=p(s|s′)PB(s′). (4.9)
Note that this is a sufficient condition, not a necessary one [49]. There are Markov chains that do not satisfy detailed balance but still have a steady state (Exercise 4.4). Usually detailed balance implies not only that the Markov chain has PB (s ) as a steady-state distribution, but also that the distribution of states generated by the sequence (4.6) converges to PB (s ), see Ref. [48] for details.
To prove that the detailed-balance condition (4.9) holds for the transition proba- bility (4.8), assume that a single neuron is picked randomly with uniform probability
q =N−1, (4.10)
where N is the number of neurons in the network. Since q does not depend on either s or s′, the probability of suggesting a new state is clearly symmetric. Equations (4.2), (4.7) then imply:
qe−βH(s) q qe−βH(s′)
1+eβ[H(s′)−H(s)] = eβH(s′) +eβH(s) = 1+eβ[H(s)−H(s′)] . (4.11)
This demonstrates that the detailed-balance condition (4.9) holds for the Markov chain defined by (4.7), (4.8), and (4.10). As a consequence, the Boltzmann distribu- tion is a steady state of the Markov chain. If the simulation converges to the steady
MONTE-CARLO SIMULATION 57
state (as it usually does), then states visited by the Markov chain are distributed according to the Boltzmann distribution. This also means that the steady-state distribution for the Hopfield model is the Boltzmann distribution, as stated in Sec- tion 3.5.
It is important to stress that Equation (4.9) is a condition for the transition proba- bility p (s ′|s ) = q (s ′|s )pa(s ′|s ), not just for the acceptance probability pa(s ′|s ). For the local moves discussed above, q is a constant, so that p (s ′|s ) ∝ pa(s ′|s ). In this case it is sufficient to check the detailed-balance condition for the acceptance probability. In general, and in particular for global moves, it is necessary to include q (s ′|s ) in the detailed-balance check [50].
4.2 Monte-Carlo simulation
The Markov chain described in the previous Section is the basis for the Markov- chain Monte-Carlo algorithm. This method is widely used in statistical physics and in mathematical statistics. It is therefore important to understand the connections between the different formulations.
The Boltzmann distribution describes the probabilities of observing configura- tions of a large class of physical systems in their steady states [32]. The statistical mechanics of systems with energy function (also called Hamiltonian) H shows that their configurations are distributed according to the Boltzmann distribution in ther- modynamic equilibrium at a given temperature T (in this context β−1 = kBT where kB is the Boltzmann constant), and free from any other constraints. If we denote the configuration of a system by the vector s , then the Boltzmann distribution takes the form (4.2). The normalisation factor Z = s e−β H (s ) is also called partition function. For systems with a large number of interacting degrees of freedom, the partition function can be very expensive to compute, because the sum over s contains many terms. Therefore, instead of computing the distribution directly one generates a Markov chain of states with a suitable transition probability, for instance (4.3).
In practice one often uses a slightly different form of the transition probabilities (Metropolis algorithm [51]). Assuming that q is constant, one takes:
′ e−β∆H when ∆H>0,
p(s |s)=q 1 when ∆H ≤0, (4.12)
with ∆H = H (s ′ ) − H (s ) as before. That the Metropolis rates obey the detailed-
58 THE BOLTZMANN DISTRIBUTION balance condition (4.9) can be seen as follows:
p(s′|s)PB(s)=qZ−1e−βH(s) e−β[H(s′)−H(s)] if H(s′)>H(s) 1 otherwise
=qZ−1e−βmax{H(s),H(s′)} (4.13) =qZ−1e−βH(s′) e−β[H(s)−H(s)] if H(s)>H(s′)
1 otherwise
=p(s|s′)PB(s′).
The Metropolis algorithm is summarised in Algorithm 2. It provides an elegant way of computing the average 〈A〉 of an observable A(s ) over the Boltzmann distribution
ofs:
− 1 − β H ( s ) 1 T
A(s)e ≈ T A(st ). (4.14)
〈A〉=Z
s t=1
This particular way of evaluating the average 〈A〉 is a special case of the more general
method of importance sampling [52]. The central-limit theorem implies that the error of this estimate for 〈A〉 decreases ∝ T −1/2 as T increases. The prefactor is determined by the correlations between subsequent terms in the sum (4.14): the states in the sequence (4.6) are correlated, in particular when the moves are local, because then subsequent configurations are similar. Generating many quite strongly correlated samples from a distribution is not a very efficient way of sampling this distribution. Sometimes it may be more efficient to suggest global moves instead, in order to avoid that subsequent states in the Markov chain are similar. But it is not guaranteed that global moves lead to weaker correlations. For global moves, ∆H may be more likely to assume large positive values, so that fewer suggested moves are accepted. As a consequence the Markov chain may stay in certain states for a long time, increasing correlations in the sequence. Usually a compromise is most efficient, moves that are neither local nor global. In summary, the convergence of Monte-Carlo sampling is quite slow. This motivated Sokal to begin his lecture notes on Monte-Carlo simulation with the warning [49]
Monte Carlo is an extremely bad method; it should be used only when all alternative methods are worse.
Monte-Carlo algorithms are very widely used, and the original reference for the Metropolis algorithm [51] is generally considered one of the most significant scien- tific papers in computational physics. Sokal’s point is of course that many problems cannot be solved in any other way, so that Monte-Carlo simulation is the only op- tion. But we should be aware of the shortcomings of the method. The same caution applies more generally to the topic of this book, machine-learning algorithms with neural networks.
SIMULATED ANNEALING 59
Algorithm 2 Metropolis algorithm for symmetric q (s ′ |s ) initialises =s0;
fort =1,…,T do
suggestanewstates′ withprobabilityq(s′|s); compute∆H =H(s′)−H(s);
if∆H ≤0then
accept the new state: s = s ′; else
draw a random number r uniformly distributed in [0, 1]; if r < exp(−β ∆H ) then
accept the new state: s = s ′; else
rejects′; end if
end if
samplest =sandAt =A(st); end for
4.3 Simulated annealing
Combinatorialoptimisationproblemsadmit2k ork!configurations-toomany to find the optimal one by complete enumeration when k is large. An alternative strategy is to assign an energy H (s ) to each configuration s so that H is minimal at the optimal configuration smin . One minimises H (s ) by Monte-Carlo simulation, using that the Monte-Carlo dynamics tends to decrease H when the temperature
Figure4.1: Schematic.Simulatedannealing(arrows)tendstoreducetheenergy function. Noise helps to avoid that the dynamics arrests in a local minimum.
60 THE BOLTZMANN DISTRIBUTION
kBT = β−1 is low, Figure 4.1. A common strategy is to lower the temperature on the fly. In the beginning of the simulation, the temperature is high, so that the dynamics first explores the rough features of the energy landscape. When the temperature is lowered, the dynamics perceives finer and finer features of H (s ). The hope is that it ends up in the global minimum Hmin = H (s min ) at zero temperature, where PB(s) = 0 when H(s) > Hmin and PB(s) > 0 only for H(s) = Hmin. This method is called simulated annealing [44], see also Section 10.9 in Numerical Recipes [53]. Slowly lowering the temperature during the simulation mimics the slow cooling of a physical system. It passes through a sequence of quasi-equilibrium Boltzmann distributions with lower and lower temperatures, until the system hopefully finds the global minimum Hmin.
For a number of combinatorial optimisation problems one can write down energy
functions that have the same form as Equation (2.50) with symmetric weights [54].
Sinces2 =1,onecanalwaysassumethatthediagonalweightsvanish,becausethey ij
make only a constant contribution to H . In short, one can use the Hopfield dynamics (3.1) to minimise H . The travelling-salesman problem has been solved in this way [1, 54], gradually reducing the noise level as one iterates the stochastic dynamics. It is by no means necessary to use a Hopfield model for this purpose. Instead we can just use the stochastic dynamics (4.3) or the Metropolis algorithm (4.12) to solve combinatorial optimisation problems by simulated annealing. Nevertheless, a crucial step is to find a suitable energy function.
As an example, consider the double-digest problem. It arose when sequencing the human genome [55, 56]. The human genome sequence was first assembled by piecing together overlapping DNA segments in the right order by making sure that overlapping segments share the same DNA sequence. To this end it is necessary to uniquely identify the DNA segments. The actual DNA sequence of a segment is a unique identifier. But it is sufficient and more efficient to identify a DNA segment by a fingerprint, for example the sequence of restriction sites. These are short subse- quences (four or six base pairs long) that are recognised by enzymes that cut (digest) the DNA strand precisely at these sites. A DNA segment is identified by the types and locations of restriction sites that it contains, the so-called restriction map.
When a DNA segment is cut by two different enzymes one can experimentally determine the lengths of the resulting fragments. Is it possible to determine how the cuts were ordered in the DNA sequence of the segment from the fragment lengths, to find the restriction map? This is the double-digest problem [55]. In a double- digest experiment, a given DNA sequence is first digested by one enzyme (A say). Assume that this results in n fragments with lengths ai (i = 1,…,n). Second, the DNA sequence is digested by another enzyme, B . In this case m fragments are found, with lengths b1 , b2 , . . . , bm . Third, the DNA sequence is digested with both enzymes A and B, yielding l fragments with lengths c1,…,cl , see Table 4.1 for examples.
SIMULATED ANNEALING 61
L = 10000
a = [5976, 1543, 1319, 1120, 42]
b = [4513, 2823, 2057, 607]
c = [4513, 1543, 1319, 1120, 607, 514, 342, 42]
L = 20000
a = [8479, 4868, 3696, 2646, 169, 142]
b = [11968, 5026, 1081, 1050, 691, 184]
c = [8479, 4167, 2646, 1081, 881, 859, 701, 691, 184, 169, 142]
L = 40000
a = [9979, 9348, 8022, 4020, 2693, 1892, 1714, 1371, 510, 451]
b = [9492, 8453, 7749, 7365, 2292, 2180, 1023, 959, 278, 124, 85]
c = [7042, 5608, 5464, 4371, 3884, 3121, 1901, 1768, 1590, 959, 899, 707, 702, 510, 451, 412,
278, 124, 124, 85]
Table 4.1: Example configurations for the double-digest problem [55] for three different chromosome lengths L. For each example, three ordered fragment sets are given, corresponding to the result of digestion with A, with B, and with both A and B.
The task is now to determine all possible orderings of the a – and b -cuts that result in l fragments with lengths c1,c2,…,cl . Since the solutions of the double-digest problem are degenerate, an important quewstion is to determine how many distinct solutions there are (Exercise 4.5).
To write down an energy function, denote the ordered set of fragment lengths producedbydigestingwithenzymeAbya ={a1,…,an},wherea1 ≥a2 ≥…≥an ≥1. Similarlyb ={b1,…,bm}(b1 ≥b2 ≥…≥bm ≥1)forfragmentlengthsproducedby enzymeB,andc ={c1,…,cl}(c1 ≥c2 ≥…≥cl ≥1)forfragmentlengthsproduced by digesting first with A and then with B . Permutations σ and μ of the sets a and b resultinasetofc-fragmentsthatwecallcˆ(σ,μ).Solutionsofthedouble-digest problem correspond to permutations [σ, μ] that yield cˆ(σ, μ) = c . A suitable energy function is therefore [56]
H(σ,μ)=c−1[cj −cˆj(σ,μ)]2, (4.15) j
j
and configuration space is the space of all permutation pairs s = [σ, μ]. Local moves in configuration space correspond to inversions of short subsequences of σ and/or μ. One can show that the corresponding q (s ′|s ) is symmetric (Exercise 4.5). As mentioned above, this is necessary for the stochastic dynamics to converge in its simplest form, Equation (4.12) and Algorithm 2.
For the simulation, one chooses a larger temperature kBT = β−1 to begin with, so
62 THE BOLTZMANN DISTRIBUTION
Figure4.2: Boltzmannmachinewithfiveneurons.Allweightsaresymmetric,the diagonal weights are set to zero. The states of the neurons are denoted by si = ±1. This neural network has no hidden units. It looks like a Hopfield network, but the weights are not given by Hebb’s rule.
that the stochastic dynamics explores the rough features of the energy landscape at first. As the simulation proceeds, the temperature is gradually reduced. This allows the dynamics to learn finer features of the landscape, as described above.
4.4 Boltzmann machines
Boltzmann machines are generalised Hopfield networks that can learn to approxi- mate data distributions of binary input patterns. Boltzmann machines differ from Hopfield networks in two essential ways. First, instead of using Hebb’s rule, the weights are adjusted until the Boltzmann machine approximates the data distribu- tion precisely. The weights are iteratively refined to minimise the difference between the data distribution and the model (the Boltzmann distribution). Nevertheless, this procedure is closely related to Hebb’s rule, as we shall see. Second, to repre- sent higher-order correlations between bits of input patterns, Boltzmann machines employ hidden neurons.
We begin with Boltzmann machines without hidden neurons (Figure 4.2), be- cause they are simpler to analyse. Then we discuss why hidden neurons are neces- sary to learn the properties of general input distributions Pdata(x ) of binary inputs x . The training algorithm for Boltzmann machines with hidden neurons is described in Section 4.5.
The goal of the training algorithm is to find weights so that the Boltzmann distri- bution
P (s =x)=Z−1exp1 w x x (4.16) B2ijij
i̸=j
approximates the distribution Pdata(x) as precisely as possible. Here and in the remainder of this Chapter we set β = 1. The input patterns have N binary bits
BOLTZMANN MACHINES 63
[Equation(2.1)]withvalues±1.Theweightmatrixissymmetric,wij =wji,and itsdiagonalelementsaresettozero,wii =0.InthisSectionwealsosetthethresholds to zero.
The Boltzmann machine is trained by iteratively adjusting the weights wi j , us- ingasequenceofinputpatternsx(μ) (μ=1,…,p),independentlysampledfrom the data distribution Pdata(x ). This is achieved by maximising the likelihood = pμ=1PB(s =x(μ))thattheBoltzmannmachineproducesthesequencex(1),…,x(p) of input patterns. Any pattern may appear more than once in the sequence, with frequency proportional to Pdata (x ). Maximising therefore corresponds to approxi- mating the data distribution as accurately as possible. Usually one maximises the logarithm of the likelihood, the log-likelihood function
pp
log =logPB(s =x(μ))=logPB(s =x(μ)). (4.17)
μ=1 μ=1
The logarithm is a monotonic function, so the log-likelihood has its maximum at the
same weight values as the likelihood. Taking the logarithm simplifies the analysis
ofthelearningalgorithm,becauselogPB(s =s(μ))issimplyaquadraticfunctionof
x (μ). Also, a learning algorithm based on the log-likelihood is usually more stable j
numerically.
A different reasoning behind maximising the log-likelihood starts from the Kullback-
Leibler divergence, defined as p
DKL =Pdata(x(μ))log[Pdata(x(μ))/PB(s =x(μ))]. (4.18) μ=1
Terms in the sum with Pdata(x(μ))=0 are set to zero, and DKL is defined to equal infinity when there are patterns for which PB = 0 but Pdata ̸= 0. The Kullback-Leibler divergence is a measure of the difference between the two distributions: DKL is non-negative, and it assumes its global minimum DKL = 0 for Pdata(x (μ)) = PB(s = x (μ)), see Exercise 4.6. We infer from Equation (4.18) that minimising DKL corresponds to maximising log .
To find the global maximum of the log-likelihood, we use gradient ascent: we repeatedly change the weights by adding increments
w′ =wmn +δwmn with δwmn =η∂ log . (4.19) mn ∂ wmn
The small parameter η > 0 is the learning rate. The gradient points in the steepest uphill direction of . The idea is to take many small uphill steps until one hopefully (but not necessarily) reaches the global maximum. Since the likelihood is a product
64 THE BOLTZMANN DISTRIBUTION
of many possibly quite small factors, can become very small. This can lead to numerical instabilities. Maximising log instead of can be more stable because ityieldsanadditionalfactor−1 inthegradient:∂log/∂wmn =−1∂/∂wmn.
To evaluate the gradient of we start from Eq. (4.17) p
log =−logZ + 1 w x(μ)x(μ). (4.20) 2ijij
μ=1 i̸=j
Here we used that the diagonal weights vanish. The first step is to evaluate the
derivative of
logZ =log exp1 w s s . 2 ijij
(4.21)
(4.22)
s1=±1,…,sN =±1 i̸=j
To compute ∂ log Z /∂ wm n , we use the chain rule together with
∂wij =δimδjn+δjmδin. ∂ wmn
This relation is valid for symmetric weights and provided that i ̸= j and m ̸= n. In Equation (4.22), δk l is the Kronecker delta, δk l = 1 if k = l and zero otherwise (Chapter2). Inparticular,δimδjn =1onlyifi =m and j =n. Otherwisetheproduct of Kronecker deltas equals zero. Equation (4.22) is illustrated by the following story (a modification of a well-known maths joke):
The linear function, x , and the constant function are going for a walk. When they suddenly see the derivative approaching, the constant func- tion gets worried. “I’m not worried” says the function x confidently, “I’m not put to zero by the derivative.” When the derivative comes closer, it says “Hi! I’m ∂ /∂ y . How are you?”
The moral is: since x and y are independent variables, ∂ x /∂ y = 0. Equation (4.22) reflects the same principle: the weights wi j and wmn are independent variables unless their indices agree. Equation (4.22) is valid for off-diagonal weights, and there are two terms on the r.h.s. because the weights are symmetric.
Returning to the derivative of log Z with respect to wm n , one finds using Equation (4.22):
∂ logZ = smsnPB(s)≡〈smsn〉model. (4.23) ∂ wmn s1=±1,…,sN =±1
The last equality defines the two-point correlations of the model, 〈sm sn 〉model, com- puted using the steady-state distribution (4.16) of the Boltzmann machine. Evaluat- ing the derivative of the second term in Equation (4.20) gives:
∂ 1 w x(μ)x(μ) = x(μ)x(μ) . (4.24) ∂wmn2 ijij mn
i̸=j
BOLTZMANN MACHINES 65 In summary,
∂log p
= x(μ)x(μ) −〈smsn〉model=p〈xmxn〉data −〈smsn〉model. (4.25)
Here 〈xm xn 〉data = p −1 p x (μ) x (μ) is the two-point correlation of the input data. μ=1 m n
Using (4.19), the learning rule becomes:
δwmn =η〈xmxn〉data−〈smsn〉model, (4.26)
where we dropped a factor of p that only affects the numerical value of the learning rate η. The weight increments are determined by the two-point pattern correlations, just like Hebb’s rule (2.25). The first term on the r.h.s. of Eq. (4.26) has precisely the same form as Equation (2.25), a sum over two-point correlations of the input patterns. The second average is over the steady-state distribution (4.16) of the Boltzmann machine. The learning rule takes the form of the difference between two-point correlations because the task is to minimise the difference between two distributions. It is plausible that the learning rule may converge because the weight increments vanish when the model correlations equal the data correlations.
The average 〈sm sn 〉model can be approximated by numerical simulation of the McCulloch-Pitts dynamics
∂wmn mn μ=1
with b = w s and p(b ) = ijijj i1+e−2bi
1
si′ =
−1
p (b ) ,
i (4.27)
system has reached its steady state, long enough so that any initial transient becomes negligible.
The training algorithm can be summarised as follows. One initialises all weights and computes 〈xm xn 〉data from the given sequence of input patterns. One estimates 〈sm sn 〉model by numerical simulation of the dynamics of the Boltzmann machine, and changes the weights using (4.26). This step is iterated, either with a sequence of new inputs, or with the same inputs but in permuted sequence. In each iteration one must compute 〈sm sn 〉model again, because the weights changed. This proce- dure is quite slow, because it usually takes long simulations to estimate 〈xm xn 〉model accurately, in each iteration of the learning algorithm.
There is a more fundamental problem. Like Hebb’s rule, the learning rule (4.26) relies entirely upon two-point correlations of the input bits. This means that the Boltzmann machine cannot learn higher-order correlations between inputs. How- ever, two-point correlations may not be sufficient to represent the information
with probability with probability
1 − p (bi ) ,
1 . One must iterate Equation (4.27) until the
66 THE BOLTZMANN DISTRIBUTION
encoded in the input data. To illustrate this point, consider the Boolean XOR func- tion (Exercise 2.13 and Chapter 5). It can be encoded in the four patterns [−1, −1, −1], [1, 1, −1], [−1, 1, 1], and [1, −1, 1]. The first two components represent the input to the XOR function. The third component represents the output, which depends on both input variables as prescribed by the XOR function. Let us define an input distri- bution that reflects these three-point correlations by assigning Pdata = 14 to the four patterns, and setting Pdata = 0 otherwise. A Boltzmann machine with three neurons cannot represent this input distribution, because there is no energy function of the form (4.1) that has four minima at these patterns. So the three-point correlations en- coded in the four patterns cannot be represented in terms of a Boltzmann machine in its simplest form.
Also Hopfield networks fail for the XOR function: the four states are not attractors of a Hopfield network with three neurons (Exercise 2.13). One could consider neural networks with third- or higher-order interactions [47],
H =−1 w(2)s s −1 w(3) s s s +… (4.28) 2 ijij 6 ijkijk
ij ijk
(Exercise 2.7). But the number of weights proliferates as the order increases, render- ing the training very slow.
An alternative is to use Boltzmann machines with hidden neurons, that are neither input nor output units. The idea is that the hidden neurons can learn to represent such correlations [47]. The learning rule for the Boltzmann machines with hidden neurons is very similar to Equation (4.26), but when the number of hidden neurons is large, the Boltzmann machine is very slow to train. It is more efficient to remove all weights between visible neurons, and between hidden neurons. This is described in the next Section.
4.5 Restricted Boltzmann machines
Restricted Boltzmann machines [57] consist of visible and hidden neurons arranged in an undirected bipartite graph (Figure 4.3): the only connections are between neurons of different kinds, there are no connections between visible neurons, no connections between hidden neurons either. So the energy function for a restricted Boltzmann machine for N visible neurons vj and M hidden neurons hi takes the form
MNNM
H =−w h v +θ(v)v +θ(h)h , (4.29) ijijjjii
i=1 j=1 j=1 i=1
RESTRICTED BOLTZMANN MACHINES 67
Figure4.3: RestrictedBoltzmannmachinewiththreevisibleneurons,vj,andfour hidden neurons, hi .
with weights w and thresholds θ (v) and θ (h) . The McCulloch-Pitts dynamics reads ijji
h′= i
and
v′= j
with b(h)= i
1 with probability p (b (h)) i
−1 withprobability 1−p(b(h)) 1 with probability p (b (v))
−1 withprobability 1−p(b(v)) j
mn n
The first average,
〈hmx(μ)〉data = hmx(μ) P(hi|v =x(μ)) ,
j
M
with b(v)= hw −θ(v). (4.30b)
N i j=1
w v −θ(h), (4.30a) ij j i
The diagonal weights are assumed to vanish, but the weight matrix is not required to be symmetric. Since most often M ≫ N , it is usually not even a square matrix.
The learning rule for the weights of the restricted Boltzmann machine is derived using gradient ascent on the log-likelihood for a single pattern x (μ):
logP(x(μ))=log PB(v =x(μ),h). h1=±1,…,hM =±1
Proceeding as in the previous Section one finds:
δw(μ) =η〈hmx(μ)〉data−〈hmvn〉model.
(4.31)
(4.32)
(4.33)
M
nn h1=±1,…,hM =±1 i=1
j i ij j i=1
68 THE BOLTZMANN DISTRIBUTION
can be evaluated further, using the fact that there are no connections between the hidden units. Making use of the update rule (4.30a) we find
hmP(hm|v =x(μ))=p(b(h))−[1−p(b(h))]=tanh(b(h)), (4.34) mmm
hm =±1
just like Equation (3.7). For the other sums in Equation (4.33) we use the normalisa-
tion condition 1 = hk =±1 P (hk |v = x (μ)) to obtain: N
〈h x(μ)〉 =tanh(b(h))x(μ)=tanhw x(μ)−θ(h)x(μ). mndatamn mjjmn
j=1
The second average on the r.h.s. of Equation (4.32) simplifies to
N
〈h v 〉 =tanhw v −θ(h)v . (4.35)
j=1
mj j m n model
The average 〈· · · 〉model is computed by Monte-Carlo sampling, using the McCulloch-
m n model
Pitts dynamics (4.30) to generate the sequence
vt=0 →ht=0 →vt=1 →ht=1 →vt=2 →···. (4.36)
In the limit t → ∞, the steady state of this sequence is distributed according to the model distribution, the Boltzmann distribution with energy function (4.29). In general only the asynchronous McCulloch-Pitts dynamics can be proven to converge (Sections 2.5 and 4.1). Here, however, the Markov chain can be generated more efficientlybyupdatingallhiddenneuronsht atthesametime,givenvt,becausethe componentsofht areindependentfromeachothersincetherearenoconnections betweenthem.Inthesamewaythevisibleneuronsvt areupdatedinparallel.To speed up the computation further, one usually only iterates for a finite number of steps, up to t = k say, and initialises the chain with v t =0 = x (μ). After k steps one approximates
NN
tanhw v −θ(h)v ≈tanhw v −θ(h)v . (4.37)
mj j m n model
j=1 j=1
mj j,t=k m n,t=k
This algorithm is called contrastive-divergence or CD-k algorithm (Algorithm 3). Since the average over the model distribution is approximated [Equation (4.37)], this algorithm does not precisely correspond to gradient ascent. In summary,
δw =ηtanhw v −θ(h)v −tanhw v −θ(h)v
mn mj j,t=0 m n,t=0 mj j,t=k m n,t=k
jj
. (4.38)
RESTRICTED BOLTZMANN MACHINES 69
Figure4.4: Patterncompletionforbars-and-stripesdataset[47].(a)Allpatterns in the 3 × 3 bars-and-stripes data set, corresponds to −1, to +1. (b) The three visible units [v1,v2,v3] corresponding to the first row are clamped to [+1,−1,+1] and remain fixed to these values. The remaining units are initially set to 0 (gray bits), and their states are allowed to change while sampling from the restricted Boltzmann machine using After a short transient of the McCulloch-Pitts dynamics, the pattern is correctly completed. Schematic, after Figure 7 in Ref. [25].
The analogous learning rules for the thresholds read: δθ(v) =−ηvn,t=0 −vn,t=k,
n
δθ(h)=−ηtanhwmjvj,t=0−θ(h)−tanhwmjvj,t=k −θ(h). mmm
jj
(4.39a) (4.39b)
The derivation of Equation (4.39) is left as an exercise (Exercise 4.10). Usually re- stricted Boltzmann machines have 0/1 neurons with state values 0 and 1 instead of −1 and 1. For 0/1 neurons, the CD-k algorithm is slightly different (Exercise 4.11).
Figure 4.4 illustrates how a restricted Boltzmann machine can learn to com- plete patterns, using the bars-and-stripes data set [25, 47] as an example. To be- gin with, the restricted Boltzmann machine is trained using the CD-k algorithm. Then consider a partially obscured pattern. Assume for instance that only the upperrowofitsbitsisknown: v1 =+1(),v2 =−1(),andv3 =+1(). The remaining bits v4,…,v9 are obscured, their states are set to zero as shown in Fig- ure 4.4(b). To complete the pattern, one samples from the Boltzmann distribution PB(v4,…,v9|v1 =+1,v2 =−1,v3 =+1)keepingv1 =+1,v2 =−1,v3 =+1fixed(clamp- ing these neurons), and iterates the McCulloch-Pitts dynamics for the remaining ones. Panel (b) shows how the machine outputs the correct completed pattern.
70 THE BOLTZMANN DISTRIBUTION
Figure4.5: Restricted-Boltzmann-machinelearningfortheXORproblem[panel (a)], see Section 4.4. Panel (b) shows numerical estimates of DKL versus the number M of hidden neurons, in comparison with the upper bound (4.40). Schematic, based on simulations performed by Arvid Wenzel Wartenberg using the CD-k algorithm for k = 100, with learning rate η = 0.1, averaging over 500 realisations.
This requires hidden neurons, because three-point correlations are needed to discriminate between bar and stripe patterns. In general a restricted Boltzmann machine can approximate a distribution Pdata of binary input data better with more hidden neurons. How many are needed [58, 59]? The answer is not known in general, but it is plausible M ∼ 2N hidden neurons are sufficient, because each hidden neuron can encode one of the binary input patterns (winning neuron, Section 7.1). More precisely, it can be shown that M = 2N /2 − 1 hidden neurons are sufficient to reach arbitrarily small Kullback-Leibler divergence [60]. For binary data, an upper bound for the Kullback-Leibler divergence was derived in Ref. [61, 62]:
N −⌊log (M +1)⌋− DKL ≤ log 2 2
0
M+1 2⌊log2 (M +1)⌋
M <2N−1 −1, M ≥ 2N −1 − 1 .
(4.40)
Here ⌊· · · ⌋ denotes the integer part. Figure 4.5 illustrates this result. It demonstrates how well a restricted Boltzmann machine approximates the XOR distribution in- troduced in Section 4.4. The Figure shows how the Kullback-Leibler divergence depends on the number of hidden neurons (Exercise 4.7). In this example there are N = 3 inputs. We see that three hidden neurons are sufficient to allow the restricted Boltzmann machine to approximate the data distribution very precisely, consistent with Equation (4.40). In general, however, the CD-k algorithm is not guaranteed to converge to the optimal solution corresponding to the estimate (4.40).
Restricted Boltzmann machines are generative models, they can be used to sample from a distribution the machine has learned [25]. In this way, the machine can complete missing information, as illustrated in Figure 4.4. Restricted Boltzmann machines can also learn to classify patterns, by learning a distribution of binary inputs together with their labels. To this end one splits the visible neurons into input
SUMMARY 71
neurons and output neurons with labels or targets. This is a supervised-learning task, the subject of Part II. Recently, restricted Boltzmann machines were used to represent and analyse ground-state wave functions of quantum many-body systems [63].
4.6 Summary
This Chapter dealt with the Boltzmann distribution. Two main points are, first, that the stochastic McCulloch-Pitts dynamics (3.1) has the Boltzmann distribution as a steady state. Second, the update rule (3.1) is a special case of the Markov-chain Monte-Carlo algorithm, for Hopfield models with the energy function (2.45). Since this algorithm tends to decrease the energy function, it can be used to solve complex optimisation problems. In simulated annealing one gradually reduces the noise level as the simulation proceeds. This mimics the slow cooling of a physical system, usually an efficient way of bringing the system into its global optimum.
Boltzmann machines are generalisations of Hopfield networks that can learn distributions of binary data by iteratively changing the weights and thresholds until the corresponding Boltzmann distribution approximates the data distribution. The learning rule is derived using gradient ascent on a target function, in this case the log-likelihood. A related idea is used for training deep neural networks with stochas- tic gradient descent (Part II). To learn general input distributions of binary patterns requires hidden neurons, also this is a central topic of Part II. Since Boltzmann ma- chines with many hidden neurons are hard to train, one removes connections that are not needed. Restricted Boltzmann machines have connections only between visible and hidden neurons.
4.7 Further reading
Older but still good references for Monte-Carlo methods in statistical physics are the book Monte Carlo methods in Statistical Physics edited by Binder [52], and Sokal’s lecture notes [49]. Some historical notes are found in Ref. [64].
For a concise introduction to Boltzmann machines, refer to Information the- ory, inference and learning algorithms by MacKay [47], or to Machine learning: a probabilistic perspective by Murphy [65]. Ref. [66] is a more mathematical review of restricted Boltzmann machines.
How many hidden neurons should one allow for in a restricted Boltzmann ma- chine? Little is known apart from the upper bound (4.40) for the Kullback-Leibler divergence, and simulations [60] show that one can get very precise approximations of Pdata with less hidden neurons than stipulated by Equation (4.40).
72 THE BOLTZMANN DISTRIBUTION
Figure4.6: Markovchainwiththreestates.Thetransitionprobabilityfromstatel to state k is denoted by p (k |l ) (Exercise 4.4).
Deep-belief networks consist of layers of restricted Boltzmann machines [2]. Contrastive-divergence training for such deep architectures (networks with many layers) is one of the first examples of deep-learning algorithms [67] (Chapter 7).
Helmholtz machines [68, 69] are generalisations of Boltzmann machines designed as more efficient generative models. They consist of two networks, encoder and decoder, just like variational autoencoders (Section 10.6). The encoder (called recognition model in Ref. [69]) generates a compressed representation of the data distribution, and the decoder (generative model) generates patterns from the com- pressed representation.
4.8 Exercises
4.1 Asymmetric weights. Show that Equations (3.1) and (4.3) are not equivalent for the network shown in Figure 2.9. The reason is that the weights are not symmet- ric.
4.2 Stochastic dynamics with 0/1-neurons. Derive the equivalent of the stochas- ticdynamics(3.1)for0/1-neuronswithstatesnj =0or1.Writedowntheequivalent form (4.3) with energy function H = − 12 i j wi j ni n j + i μi ni . Why are the two for- mulationsnotequivalentiftheweightsareasymmetric,orifsomewii arepositive?
4.3 Metropolis algorithm. Use the Metropolis algorithm to generate a Markov chain that samples the exponential distribution P (x ) = exp(−x ).
4.4 Markov chain. Figure 4.6 illustrates the transition probabilities pl →k for a Markov chain on a state space with three states. Find the steady state of this Markov chain. Does this chain satisfy detailed balance?
4.5 Double-digest problem. Implement simulated annealing for the double-digest problems given in Table 4.1. Use the energy function (4.15). Configuration space is the space of all permutation pairs [σ, μ]. Local moves correspond to inversions
EXERCISES 73
of short subsequences of σ and/or μ. Check that the scheme of suggesting new states is symmetric. Using simulated annealing, determine the degeneracy of the solutions for the fragment sets shown in Table 4.1.
4.6 Kullback-Leibler divergence. Show that the Kullback-Leibler divergence DKL, Equation (4.18), is non-negative, and that it assumes its global minimum DKL = 0 at Pdata(x (μ)) = PB(s = x (μ)). Show that minimising DKL is equivalent to maximising the log-likelihood (4.17).
4.7 XOR function. Program a restricted Boltzmann machine to learn the XOR function, by approximating the following data distribution over three-bit binary patterns: Pdata = 14 for [−1,−1,−1], [1,1,−1], [−1,1,1], and [1,−1,1], and Pdata = 0 otherwise. Plot the Kullback-Leibler divergence as a function of iteration number for different numbers of hidden neurons: 0, 2, 4, and 8.
4.8 Shifter ensemble. Explain why the shifter ensemble [15, 47] cannot be ap- proximated by a Boltzmann machine without hidden neurons.
4.9 McCulloch-Pitts dynamics for restricted Boltzmann machine. Write down the deterministic analogue of the update rule (4.30) and show that the energy func- tion of the restricted Boltzmann machine cannot increase under this rule. Note that it is not required that the weight matrix is symmetric, or that the diagonal elements are non-positive.
4.10 Thresholds in restricted Boltzmann machines. Derive the learning rule (4.39) for the thresholds for a restricted Boltzmann machine.
4.11 Restricted Boltzmann machine with 0/1 neurons. Derive the contrastive divergence algorithm for training a restricted Boltzmann machine with 0/1 neurons.
4.12 Bars-and-stripes data set. Train a restricted Boltzmann machine with Algo- rithm 3 to learn the bars-and-stripes data set (Figure 4.4). After training, sample the model distribution for M = 2, 4, 8, 16 and 32 hidden neurons and use the numerical results to estimate the Kullback-Leibler divergence DKL as a function of k . Compare with the theoretical upper bound (4.40).
74 THE BOLTZMANN DISTRIBUTION
Algorithm 3 contrastive divergence CD-k for ±1 neurons initialise weights and thresholds;
for ν = 1,...,νmax do
sample p0 patterns from the data distribution (p0 ≤ p ); for μ = 1,...,p0 do
initialise v (0) ← x (μ);
update all hidden neurons: b (h) (0) ← v (0) − θ (h) ; for i = 1, . . . , M do
h (0) ← +1 with probability p b (h)(0) otherwise h (0) ← −1; iii
end for
for t = 1, . . . , k do
updateallvisibleneurons:b(v)(t −1)←h(t −1)·−θ(v); for j =1,...,N do
v (t)←+1withprobabilitypb(v)(t −1)otherwisev (t)←−1; jjj
end for
update all hidden neurons: b (h)(t ) ← v (t ) − θ (h); for i = 1, . . . , M do
h (t ) ← +1 with probability p b (h)(t ) otherwise h (t ) ← −1; iii
end for end for
compute weight and theshold increments:
δwmn ←ηtanhb(h)(0)vn(0)−tanhb(h)(k)vn(k);
δθ(v) ←−η[vn(0)−vn(k)]; n
δθ(h) ←−ηtanhb(h)(0)−tanhb(h)(k); mmm
end for
for μ = 1,...,p0 do
adjust weights and thresholds;
end for end for
mm
PART II SUPERVISED LEARNING
75
76 SUPERVISED LEARNING
The Hopfield network described in Part I recognises patterns stored using Hebb’s rule. Its neurons act as inputs and outputs. After feeding a distorted pattern into the network, the network dynamics runs until it reaches a steady state which hopefully corresponds to the stored pattern closest to the distorted one. In this case, the network classifies the distorted pattern by associating it with the closest one amongst the stored patterns.
Part II describes supervised learning, a different way of solving classification tasks with neural networks using labeled data sets. The machine-learning repository [70] at the University of California Irvine contains a number of such data sets. An example is the iris data set which lists certain properties of 150 iris plants. For each plant, four attributes are given (Figure 5.1): its sepal length, sepal width, petal length, and petal width. Each entry in the data set contains a label (or target) that says which class the plant belongs to: iris setosa, iris versicolor, or iris virginica. This data set was described by the geneticist R. A. Fisher [71].
The machine-learning task is to adjust weights and thresholds of a neural network so that it correctly determines the class of each plant from its attributes. To this end one uses a training set of labeled data. Each set of attributes is an input pattern to the network. The neural network is supposed to output the correct label (or target), in this case whether the plant is an iris setosa, iris versicolor, or iris virginica. One compares the network output with the corresponding target, for all input patterns in the training set, and changes the weights and thresholds until the network computes the correct output for each input pattern. The crucial question is whether the trained network can generalise: does it find the correct labels for an input pattern not contained in the training set?
The networks used for supervised learning are called perceptrons [10]. They consist of layers of McCulloch-Pitts neurons: a number of layers of hidden neurons, and an output layer. We briefly discussed the idea of hidden neurons in connection with restricted Boltzmann machines (Section 4.5), but perceptrons have different layouts, and they are trained in a different way. The layers are usually arranged from the left (input) to the right (output). All connections are one-way, from neurons in one layer to neurons in the layer immediately to the right. There are no connections between neurons in a given layer, or back to layers on the left. This arrangement ensures convergence of the training algorithm (stochastic gradient descent). During training with this algorithm, the network parameters are changed iteratively. In each step, an input is applied, and weights and thresholds of the network are updated to reduce the output error. Loosely speaking, each step corresponds to adding a little bit of Hebb’s rule to the weights. This is repeated until the network classifies the training set correctly.
Stochastic gradient descent for multilayer perceptrons has received much atten- tion recently, after it was realised that networks with many hidden layers can be
sepal petal
length width length width 6.3 2.5 5.0 1.9 5.1 3.5 1.4 0.2 5.5 2.6 4.4 1.2 4.9 3.0 1.4 0.2 6.1 3.0 4.6 1.4 6.5 3.0 5.2 2.0
classification
virginica setosa versicolor setosa versicolor virginica
77
Figure5.1: Left:petalsandsepalsoftheirisflower.Right:sixentriesoftheirisdata set [70]. All lengths in cm. The whole data set contains 150 entries.
Figure5.2: Feed-forwardnetworkwithonehiddenlayer.Theinputterminalsare colouredblack.WeusethenotationofRef.[1]:Wij fortheweightsconnectingto theoutputneuronOi (withthresholdΘi),andwjk fortheweightsconnectingtothe hiddenneuronVj (withthresholdθj).
trained to reliably recognise and classify image data (deep learning).
5 Perceptrons
In 1958 Rosenblatt [10] suggested to connect McCulloch-Pitts neurons into layered feed-forward networks to process information. He referred to these networks as perceptrons. The layout is illustrated in in Figure 5.2. The leftmost layer consists of input terminals, drawn in black in Figure 5.2. To the right follow two layers of McCulloch-Pitts neurons. The rightmost layer consists of output neurons. The intermediate layer is a hidden layer, the states of its neurons are not read out. All
78 PERCEPTRONS
connections are one-way: every neuron feeds forward, only to neurons in the layer immediately to the right. There are no connections within layers, no back connec- tions, no connections that skip a layer. There are N input terminals. As in Part I, we denote the input patterns by
x(μ) 1 x(μ)
(μ) 2 x = . .
x (μ) N
(5.1)
The index μ = 1,...,p labels the different input patterns. The hidden neurons compute
Vj=gbj with bj=wjkxk−θj, (5.2) k
withweightswjk andthresholdsθj.Thefunctiong(b)isanactivationfunction,and its argument is called local field (Section 1.2). The output neurons of the network shown in Figure 5.2 perform the computation
Oi=g(Bi) with Bi=WijVj−Θi. (5.3) j
The index i = 1, . . . , M labels the output neurons with weights Wi j , and with thresh- olds Θi .
A classification problem is given by a training set of input patterns x (μ) and the corresponding target vectors
t(μ) 1 t(μ)
(μ) 2 t = . .
t (μ) M
(5.4)
The idea is to choose all weights and thresholds so that the network produces the
desired output:
O(μ) =t(μ) forall i and μ. (5.5) ii
In the Hopfield networks described in Part I, the weights were assigned using Hebb’s rule (2.26). Perceptrons, by contrast, are trained by iteratively updating their weights and thresholds until Equation (5.5) is satisfied. This is achieved by repeatedly adding small multiples of Hebb’s rule to the weights (Section 5.2). An alternative approach is to define an energy function, a function of the weights of the network, that has a global minimum when Equation (5.5) is satisfied. The network is trained by taking small steps in weight space that reduce the energy function (gradient descent, Section 5.3).
A CLASSIFICATION PROBLEM 79
Figure5.3: Classificationproblemwithtwo-dimensionalreal-valuedinputsandtar- gets equal to ±1. The gray solid line is the decision boundary. Legend: corresponds tot(μ) =1,andtot(μ) =−1.
5.1 A classification problem
To illustrate how perceptrons can solve classification problems, consider the simple example shown in Figure 5.3. There are ten patterns, each has two real-valued components:
x(μ)
x(μ) = 1 . (5.6)
In Figure 5.3 the patterns are drawn as points in the x1-x2 plane, the input plane. There are two classes of patterns, with targets ±1:
t(μ) =1 for and t(μ) =−1 for . (5.7)
A single neuron suffices to classify these patterns, a binary threshold unit with activation function g (b ) = sgn(b ), consistent with the possible target values. Since there is only one neuron, we can arrange the weights into a weight vector
w1 w=w .
2 The network performs the computation
O =sgn(w1x1 +w2x2 −θ)=sgn(w ·x −θ).
(5.8)
(5.9)
Herew·x =w1x1+w2x2 isthescalarproductbetweenthevectorsw andx (Chapter 2).
This very simple example allows us to find a geometrical interpretation of the classification problem. We see in Figure 5.3 that the patterns fall into two clusters:
x (μ) 2
80 PERCEPTRONS
Figure5.4: Theperceptronclassifiesthepatternscorrectlyfortheweightvector w shown, orthogonal to the decision boundary (gray solid line). Legend: corre- spondstot(μ) =1,andtot(μ) =−1.
to the left and to the right. We can classify the patterns by drawing a line that separates the two clusters, so that everything to the right of the line has t = 1, while the patterns to the left of the line have t = −1. This line is called the decision boundary. To find the geometrical significance of Equation (5.9), let us put the threshold to zero for a moment, so that
O =sgn(w ·x). Then the classification problem takes the form
sgnw ·x(μ)=t(μ).
To evaluate the scalar product, we write the vectors as
cos α cos β w=|w| sinα and x=|x| sinβ .
(5.10)
(5.11)
(5.12)
Here|w|=w12 +w2 denotesthenormofthevectorw,andαandβ aretheangles ofthevectorswiththex1-axis.Thenw·x =|w||x|cos(α−β)=|w||x|cosφ,whereφ is the angle between the two vectors. When φ is between −π/2 and π/2, the scalar product is positive, otherwise negative. As a consequence, the network classifies the patterns in Figure 5.3 correctly if the weight vector is orthogonal to the decision boundary, as shown in Figure 5.4.
What is the role of the threshold θ ? Equation (5.9) implies that the decision boundary is parameterised by w · x = θ , or
x2 = −(w1/w2) x1 + θ /w2 . (5.13)
A CLASSIFICATION PROBLEM 81
Figure 5.5: Decision boundaries without and with threshold.
Therefore the threshold determines the intersection of the decision boundary with the x2-axis (equal to θ /w2). This is illustrated in Figure 5.5.
The decision boundary – the straight line orthogonal to w – should divide inputs with positive and negative targets. If such a line can be found, then the problem can be solved with a single neuron. We say that the problem is linearly separable. Conversely, if no such line exists, the problem not linearly separable. This can occur onlywhenp>N. Figure5.6showstwoproblems.Theleftoneislinearlyseparable, the right one is not.
Other examples are Boolean functions. A Boolean function takes N binary inputs and has one binary output. The Boolean AND function (two inputs) is illustrated in Figure 5.7. The value table of the function is shown on the left. The graphical representation is shown on the right of the Figure ( corresponds to t = −1 and to t = +1). Also shown is the decision boundary of a binary threshold unit and its weight vector w . It is important to note that the decision boundary is not unique, neither are the weight vector and threshold value that solve the problem. The norm of the weight vector, in particular, is arbitrary. Figure 5.8 shows that the Boolean
Figure 5.6: Linearly separable and non-separable data in two-dimensional input space.
82
PERCEPTRONS
x1 x2 t 0 0 -1 0 1 -1 1 0 -1 1 1 +1
Figure5.7: BooleanANDfunction:valuetable(left)andgeometricalrepresentation in the input plane (right). Legend: corresponds to t (μ) = 1, and to t (μ) = −1.
XOR function is not linearly separable [11]. There are 16 different Boolean functions of two variables. Only two are not linearly separable (Exercise 5.2), XOR and XNOR.
Up to now we discussed only one single neuron. If the classification problem requiresseveraloutputneurons,eachhasitsownweightvectorwi andthreshold θi . We can group the weight vectors into a weight matrix as in Part I, so that the row vectors w Ti are the rows of the weight matrix .
5.2 Iterative learning algorithm
In the previous Section we determined the weights and threshold for the Boolean AND function by inspection (Figure 5.7). Now we discuss an algorithm that finds the weights iteratively. It is illustrated in Figure 5.9. In panel (a), the pattern x (8) (t (8) = 1) isonthewrongsideofthedecisionboundary. Inordertocorrectthiserror,one
x1 x2 t 0 0 -1 0 1 +1 1 0 +1 1 1 -1
Figure5.8: TheBooleanXORfunctionisnotlinearlyseparable.Legend:corre- spondstot(μ) =1,andtot(μ) =−1.
ITERATIVE LEARNING ALGORITHM 83
Figure5.9: Illustrationofthelearningalgorithm.Inpanel(a)thet=1patternx(8) is on the wrong side of the decision boundary (solid red line). To correct the error the weight must be rotated anti-clockwise [panel (b)]. In panel (c) the t = −1 pattern x (4) is on the wrong side of the decision boundary. To correct the error the weight must be rotated anti-clockwise [panel (d)].
84 PERCEPTRONS turns the decision boundary anti-clockwise. To this end, one adds a small multiple
of the pattern vector x (8) to the weight vector
w′=w+δw with δw=ηx(8). (5.14)
The parameter η > 0 is called the learning rate. It must be small, so that the decision boundary is not rotated too far. The result is shown in panel (b). Panel (c) shows another case, where pattern x (4) (t (4) = −1) is on the wrong side of the decision boundary. In order to turn the decision boundary in the right way, anti-clockwise, one subtracts a small multiple of x (4) :
w′=w+δw with δw=−ηx(4).
These two learning rules combine to the learning rule of Rosenblatt [10]:
w′ =w +δw(μ) with δw(μ) =ηt(μ)x(μ). For more than one neuron, the rule reads
associated with distinct units. Therefore we have t (μ) x (μ) instead of x (μ) x (μ) . One ij ij
applies (5.17) iteratively for a sequence of randomly chosen patterns μ, until the problem is solved. This corresponds to adding a little bit of Hebb’s rule in each iteration. To ensure that the algorithm stops when the problem is solved, one can use the learning rule
δw (μ) = η(t (μ) − O (μ))x (μ) . (5.18) ijiij
5.3 Gradient descent for linear units
In this Section, the learning algorithm (5.18) is derived in a different way, by min- imising an energy function using gradient descent. This requires differentiation, therefore we must choose a differentiable activation function. The simplest choice is a linear activation function, g (b ) = b . We set θ = 0, so that the network computes:
O(μ) =w x(μ). (5.19) i ikk
k
(5.15)
(5.16)
(5.17) This rule is reminiscent of Hebb’s rule (2.9), except that here inputs and outputs are
w′ =w +δw(μ) with δw(μ) =ηt(μ)x(μ). ijijij ijij
GRADIENT DESCENT FOR LINEAR UNITS 85 A neuron with a linear activation function is called a linear unit. The outputs O (μ)
assume continuous values, but not necessarily the targets t (μ). For linear units, the i
classification problem
O(μ) = t (μ) for i = 1,…,N and μ = 1,…,p
(5.20)
(5.21)
ii
has the formal solution
w = 1 t(μ)−1 x(ν). ikNi μνk
μν
This can be verified by inserting Equation (5.21) into (5.19). Here is the overlap matrix with elements
Qμν = N1 x(μ) ·x(ν) (5.22)
(Section 3.6). For the solution (5.21) to exist, the matrix must be invertible. As mentioned in Section 3.6, this requires that p ≤ N , because otherwise the input- pattern vectors are linearly dependent, and thus also the columns (and rows) of . If the matrix has linearly dependent columns or rows, it cannot be inverted.
Let us assume that the input patterns are linearly independent, so that the solu- tion (5.21) exists. In this case we can find the solution iteratively. To this end one defines the energy function
H =1t(μ)−O(μ)2. (5.23) 2ii
iμ
This function is non-negative, and it vanishes when all ouputs equal the correspond- ing targets, for all patterns.
The energy function (5.23) is regarded as a function of the weights wi j , unlike the energy function in Part I which is a function of the state-variables of the neurons. The goal is now to find weights that minimise H . If the input patterns are linearly independent, H vanishes at the global miminum, corresponding to the desired solution of the problem (Exercise 5.1). Let us use gradient descent to minimise H ,
w′ =wmn +δwmn withweightincrements δwmn =−η ∂H . (5.24) mn ∂ wmn
with learning rate η > 0. This is analogous to Equation (4.19), apart from the minus sign. In Section 4.4 the goal was to maximise the target function, here we want to minimise H by taking many downhill steps in search of the global minimum. The derivatives in Equation (5.24) are evaluated with the chain rule, together with Equation (4.22) which takes the form
∂wij =δimδjn (5.25) ∂ wmn
i
86 PERCEPTRONS for asymmetric weights. This yields the weight increments
δwmn =ηt(μ)−O(μ)x(μ). (5.26) mmn
μ
This learning rule is very similar to Equation (5.18). One difference is that Equation (5.26) contains a sum over all patterns. It is important to keep in mind also that the activations functions are different, while Equation (5.18) was derived for g (b ) = sgn(b ), the learning rule (5.26) was derived for g (b ) = b . An advantage of the rule (5.26) is that it is derived from an energy function. This helps to analyse the convergence of the algorithm, as we have seen in Chapter 2.
Linear units [Equation (5.19)] are special. The Boolean AND problem (Figure 5.7) does not admit the solution (5.21), even though the problem is linearly separable. Since the pattern vectors x (μ) are linearly dependent, the solution (5.21) does not exist. Shifting the patterns or introducing a threshold does not change this fact.
In Section 5.5 we discuss how to solve problems that are not linearly separable using a hidden layer of neurons with non-linear activation functions. Note that introducing hidden layers with linear units does not help, because the resulting input-output mapping is still linear if all neurons have linear activation functions, so that only problems with p ≤ N can be solved. This is the main reason for using hidden layers with non-linear activation functions.
There are four points to keep in mind. First, if the the patterns are linearly
independent, then we can use gradient descent to determine suitable weights (and
thresholds) of linear units. Second, hidden layers with non-linear units are required,
because a single neuron with a continuous and monotonous activation function can
only solve problems with linearly independent patterns (Exercise 5.11). Third, for
gradient descent for non-linear units we must require that the activation function
g (b ) is differentiable, or at least piecewise differentiable. Fourth, in this case we
calculatethegradientsusingthechainrule,resultinginfactorsofderivatives d g(b). db
This is the origin of the vanishing-gradient problem (Chapter 7).
5.4 Classification capacity
In Chapter 3 we analysed the storage capacity of Hopfield networks. The analo- gous question for the classification problem described in Section 5.1 is: how many patterns can a single neuron with activation function g (b ) = sgn(b ) classify? As in the case of Hopfield networks, one can find a general answer for random binary classification problems.
Consider p points with coordinate vectors x (μ) in N -dimensional input space,
CLASSIFICATION CAPACITY 87
Figure5.10: Left:5pointsingeneralpositionintheplane.Right:thesepointsare not in general position because three points lie on a straight line.
Figure 5.11: Probability (5.29) of separability as a function of α = p /N for three different values of the dimension N of input space. Note the pronounced threshold near α = 2, for large values of m.
and assign random targets:
t (μ) = 2 (5.27)
+1 with probability 1 , −1 with probability 12 .
This random classification problem is homogeneously linearly separable if we can find an N -dimensional weight vector w , so that w ·x = 0 is a valid decision boundary that goes through the origin:
w ·u(μ) >0 if t(μ) =1 and w ·u(μ) <0 if t(μ) =−1. (5.28)
So homogeneously linearly separable problems are binary classification problems that are linearly separable by a hyperplane that contains the origin. Such problems with this property can be solved by a binary threshold unit with threshold θ = 0.
Now assume that the points (including the origin) are in general position (Fig- ure 5.10). In this case Cover’s theorem [72] gives an expression for the probability that the random binary classification problem of p patterns in dimension N is
88 PERCEPTRONS
Figure5.12: TheXORproblemcanbesolvedbyembeddingintoathree-dimensional input space.
homogeneously linearly separable: 1p−1N−1p−1
P(p,N)= 2 k=0 k 1
forp >N , otherwise .
(5.29) Here l = l ! are the binomial coefficients, for l ≥ k ≥ 0. Equation (5.29) is proven
k (l−k)!k!
by recursion, starting from a set of p − 1 points in general position. Assume that the
number C (p − 1, N ) of homogeneously linearly separable classification problems given these points is known. After adding one more point, one can compute the C (p , N ) in terms of C (p − 1, N ), and recursion yields Equation (5.29). Figure 5.11 shows this result as a function of α = p/N for different values of N . For p ≤ N , any random classification problem is homogeneously linearly separable. In this case the pattern vectors are linearly independent, so that the problem can also be solved by a linear unit (Section 5.3). But a neuron with activation function sgn(b ) can classify problems with more than N patterns. In the limit of N → ∞, the function P (αN , N ) approaches a step function θH (2 − α) (Exercise 5.12). In this limit, the maximal classification capacity is therefore αmax = 2.
What is the expected classification capacity for finite values of N ? To answer this question, consider a random sequence of patterns x(1),x(2),… and targets t1,t2,… and ask [72]: what is the distribution of the largest integer so that the problem x(1),x(2),…,x(n) isseparableindimensionN,butx(1),x(2),…,x(n),x(n+1) isnot? P(n,N) is the probability that n patterns are linearly separable in N -dimensional input space. We can write P(n +1,N) = q(n +1|n)P(n,N) where q(n +1|n) is the con- ditional probability that n + 1 patterns are linearly separable if the n patterns are.
MULTI-LAYER PERCEPTRONS 89
Then the probability that n + 1 patterns are not separable (but n patterns are) reads [(1−q(n+1|n)]P(n,N)=P(n,N)−P(n+1,N). Wecaninterprettheright-handside ofthisEquationasadistributionpn oftherandomvariablen,themaximalnumber of separable patterns in dimension N :
1 n n − 1
pn =P(n,N)−P(n+1,N)= 2 N−1 for n=0,1,2,….
It follows that the expected maximal number of separable patterns is
∞ 〈n〉=npn =2N .
n=0
So the expected classification capacity is twice the input dimension:
〈αmax 〉 = 2 .
(5.30)
(5.31)
This quantifies the notion that it is easier to separate patterns in higher-dimensional input space. As an illustration, consider the XOR problem which is not linearly separable in two-dimensional input space. The problem becomes separable when we embed the points in three-dimensional space, for instance by assigning x3 = 0 to the t = +1 patterns and x3 = 1 to the t = −1 patterns (Figure 5.12).
5.5 Multi-layer perceptrons
In Sections 5.1 and 5.2 we discussed how to solve linearly separable problems [Figure 5.13(a)]. The aim of this Section is to show that non-separable problems like the one in Figure 5.13(b) can be solved by a perceptron with one hidden layer. A network that does the trick for the classification problem in Figure 5.13(b) is depicted in Figure 5.14. As in the previous Section, all neurons have the signum function as activation function, with possible outputs ±1:
V(μ)=sgnb(μ) with b(μ)=w x(μ)−θ , jjjjkkj
Each of the three neurons in the hidden layer has its own decision boundary. The ideaistochoosetheweightswjk andthethresholdsθj insuchawaythatthethree decision boundaries partition the input plane into distinct regions, so that each region contains either only t = −1 patterns or t = 1 patterns [3].
k O(μ)=sgnB(μ) with B(μ)=W V(μ)−Θ .
(5.32)
1 1 1 1jj 1 j
90 PERCEPTRONS
Figure5.13: (a)Linearlyseparableproblem.(b)Problemsthatarenotlinearlysepa- rable can be solved by a piecewise linear decision boundary. Legend: corresponds tot(μ) =1,andtot(μ) =−1.
HowthisconstructionworksisshowninFigure5.15. TheleftpartoftheFigure shows the three decision boundaries with their weight vectors, and how they divide the input plane into different regions which contain either only or only . Each region bears a three-digit code made out of the symbols + and −. The codes are determined by the states of the hidden neurons. A + sign in the j -th entry of the code means that Vj = +1. So the region in question is on the weight-vector side of the decision boundary j . A − sign, by contrast, corresponds to Vj = −1. In this case the region is on the other side of the decision boundary, the one opposite the weight vector. The value table shows the targets associated with each region, together with
Figure5.14: Hidden-layerperceptrontosolvetheproblemshowninFigure5.13 (b).
MULTI-LAYER PERCEPTRONS 91
V1 V2 V3 target − − − −1 +−−-
− + − −1 − − + −1 + + − +1 + − + +1 − + + +1 + + + +1
Figure 5.15: Left: decision boundaries, regions, and the corresponding binary codes determined by the states of the hidden neurons. Legend: corresponds to t (μ) = 1, and to t (μ) = −1. Right: encoding of the regions and corresponding targets. The region + − − does not exist.
the code of the region.
TheweightsW1j andthethresholdΘj oftheoutputneuronarechosensothat
it associates the correct target value with each region. A graphical representation of the output problem is shown in Figure 5.16. This problem is linearly separable (Exercise 5.3). The following function computes the correct output for each region:
O (μ) = sgn V (μ) + V (μ) + V (μ) . (5.33) 1 123
This solves the binary classification problem described in Figure 5.15, but note that the solution is not unique. There is a range of different weights and thresholds that solve the problem, and there are other solutions based on different network layouts. Nevertheless, the solution illustrates how non-linearly separable classification prob- lems can be solved by adding a hidden layer to the network layout. The neurons in the hidden layer define segments of a piecewise linear decision boundary. More hidden neurons are needed if the decision boundary is very wiggly.
Figure 5.17 shows another example, how to solve the Boolean XOR problem with a perceptron that has two neurons in a hidden layer, with activation functions sgn(b ), thresholds 12 and 32 , and all weights equal to unity. The output neuron has weights
+1 and −1 and unit threshold:
O1 =sgn(V1 −V2 −1). (5.34)
Minsky and Papert [11] proved in 1969 that all Boolean functions can be represented by multilayer perceptrons, but that at least one hidden neuron must be connected
92 PERCEPTRONS
Figure 5.16: Graphical representation of the output problem for the classification problem shown in Figure 5.15.
V1 V2 t − − -1 + − +1 −+- + + -1
Figure5.17: BooleanXORfunction:geometricalrepresentation,networklayout, and value table for the output neuron. The region − + does not exist. All neurons assume two possible states, +1 or −1. Legend for the geometrical representation: correspondstot(μ) =1,andtot(μ) =−1.
to all input terminals. This means that not all neurons in the network are locally connected (the neurons have only a few incoming weights). Since fully connected networks are much harder to train than locally connected ones, this was considered a shortcoming at the time. Now, almost 50 years later, the perspective has changed. Convolutional networks (Chapter 8) have only local connections to the inputs and can be trained to recognise objects in images with high accuracy.
Insummary,perceptronsaretrainedonatrainingset[x(μ),t(μ)],μ=1,…,p,by moving the decision boundaries into the correct positions. This is achieved by repeatedly applying Hebb’s rule to adjust all weights. A related learning algorithm is obtained using gradient-descent on the energy function (5.23). Also, we have not discussed how to update the thresholds yet, but it is clear that they too can be
SUMMARY 93
Figure 5.18: (a) Result of training the network on a training set. Legend: cor- responds to t (μ) = 1, and to t (μ) = −1. (b) Classification of a validation set. One pattern is wrongly classified.
updated with gradient-descent learning.
Once all decision boundaries are in the right places we must ask: what happens
if we apply the trained network to a new dataset? Does it classify the new inputs correctly? In other words, can the network generalise? An example is shown in Figure 5.18. Panel (a) shows the result of training the network on a training set. The decision boundary separates t = −1 patterns from t = 1 patterns, so that the network classifies all patterns in the training set correctly. In panel (b) the trained network is applied to patterns in a validation set. We see that most patterns are correctly classified, save for one error. This means that the energy function (5.23) is not exactly zero for the validation set. Nevertheless, the network does quite a good job. Usually it is not a good idea to try to precisely classify all patterns near the decision boundary, because real-world data sets are subject to noise. It is a futile effort to try to learn and predict noise (Section 6.4).
5.6 Summary
Perceptrons are layered feed-forward networks that can learn to classify data in atrainingset[x(μ),t(μ)]. Foreachinputpatternx(μ),thenetworkfindsthecorrect target vector t (μ). We discussed the learning algorithm for a simple example: real- valued patterns with just two components, and one binary target. This allowed us to represent the classification problem graphically, and to see how linearly separable classification problems can be solved by a simple perceptron. There are three different ways of understanding how the perceptron learns. First, geometrically, the
94 PERCEPTRONS
perceptron learn by moving decision boundaries into the correct locations. Second, this can be achieved by repeatedly adding a little bit of Hebb’s rule. Third, these rules are similar to the learning rule derived from gradient descent on the energy function (5.23). Cover’s theorem quantifies the capacity of a simple perceptron to separate patterns with binary targets. Finally we discussed how to solve non-linearly separable classification problems with perceptrons with a hidden layer.
5.7 Further reading
As mentioned in the Introduction, a short account of the history of perceptron research is the review by Kanal [23]. The remarkable book by Minsky and Papert explains the geometry of perceptron learning in great depth, and in a very elegant fashion. For a proof of Cover’s theorem see Ref. [73].
5.8 Exercises
5.1 Boolean AND problem. The Boolean AND problem (Figure 5.7) cannot be solved with a linear unit with weights w and threshold θ . To show this, solve ∂H/∂w =0and∂H/∂θ =0forw andθ. Usingtheresultingweightsandthresholds, demonstrate that O (μ) ̸= t (μ) for some μ. Hint: express the linear system to solve in termsof〈xxT〉,〈x〉,〈tx〉,and〈t〉,where〈···〉isanaverageoverpatterns. Relateyour findings to Equation (5.21) by computing the Moore-Penrose inverse of using its singular-value decomposition (see Section 2.6 in Ref. [53]).
5.2 Boolean functions. How many Boolean functions with two-dimensional in- puts are there? How many of them are linearly separable? How many Boolean functions with three-dimensional inputs are there? How many of them are linearly separable? Solve the problem graphically, considering its point-group symmetries. See also Exercise 5.13.
5.3 Output problem for binary classification. The binary classification problem shown in Figure 5.15 can be solved with a network with one hidden layer and one output neuron. Figure 5.16 shows the problem that the output neuron has to solve. Show that such output problems are linear separable if the decision boundaries corresponding to the hidden units allow to partition the input plane into distinct regions that contain either only t = 1 or only t = −1 patterns.
5.4 Piecewise linear decision boundary. Find an alternative solution for the clas- sification problem shown in Figure 5.15, where the weight vectors are chosen as
EXERCISES 95
Figure5.19: AlternativesolutionoftheclassificationproblemshowninFigure5.15. Exercise 5.4.
depicted in Figure 5.19.
5.5 Three-dimensional parity function. The three-dimensional parity function
is illustrated in Figure 5.20. The input bits x (μ) for k = 1 , 2 , 3 are either +1 or −1. The k
output O (μ) of the network is +1 if there is an odd number of positive bits in x (μ) , and
−1 if the number of positive bits are even. One representation of this function uses
a hidden layer with eight neurons. The state V (μ) of hidden neuron j is computed as j
Vj
= 0 if −θj+wjkx(μ)≤0. (5.35) kk
(μ)
1 if −θ+w x(μ)>0, j kjkk
Weightsandthresholdsaregivenbywjk =x(j) andθj =2(j=1,…,8).Thenetwork k
output is computed as O (μ) = W V (μ) (linear unit). Determine the weights W . jjj j
5.6 Linearly inseparable problem. A classification problem is given in Figure
5.21. Inputs x (μ) inside the gray triangle have targets t (μ) = 1, inputs outside the
triangle have targets is t (μ) = 0. The problem can be solved by a perceptron with
onehiddenlayerwiththreeneuronsV(μ) =θ −θ +2 w x(μ),for j =1,2,3. j H j k=1jkk
ThenetworkoutputiscomputedasO(μ)=θ (−Θ+3 WV(μ)).Hereθ (b)isthe H j=1jj H
Heavisidefunction(Figure2.10).Findweightswjk,Wj andthresholdsθj,Θthat solve the classification problem.
5.7 Perceptron with one hidden layer. A perceptron has one layer of hidden neu- rons, and a single output neuron. It receives two-dimensional input patterns
96 PERCEPTRONS
Figure5.20: Three-dimensionalparityfunction,withtargetst(μ)=1(),t(μ)=−1 (). Exercise 5.5.
Figure5.21: Classificationproblem.Exercise5.6.
EXERCISES 97
Figure5.22: Left:inputplanewithdecisionboundariesofhiddenneuronsVj (gray lines). The boundaries partition input space into nine regions labeled by the binary code V1V2V3V4. Right: same, but for a different labeling. Exercise 5.7.
x (μ) = [x (μ), x (μ)]T. They are mapped to four hidden neurons V (μ) as 12i
Vj
withW1 =W3 =W4 =1,W2 =−1,andΘ= 12. Figure5.22(left)showshowinputspace is mapped to the the hidden neurons. Draw the decision boundary of the network. Givevaluesforwij andθi thatyieldthepatterninFigure5.22(left).Showthatone cannot map the input space to the space of hidden neurons as in Figure 5.22(right).
5.8 Multilayer perceptron. A classification problem is shown in Figure 5.23. It can be solved by a multilayer perceptron with two inputs, three hidden neurons
V(μ) =θ 2 w x(μ)−θ,andoneoutputO(μ) =θ 3 WV(μ)−Θ,where j Hi=1jkk j Hj=1jj
θH (b ) is the Heaviside function (Figure 2.10). A possible solution is illustrated in Fig. 5.23. Compute weights w j k and thresholds θ j of the hidden neurons that determine the three decision boundaries (gray lines). Draw a representation of the problem in the space with axes V1, V2, and V3. Find output weights Wj and threshold Θ that solve the problem.
5.9 Expected maximal number of separable patterns. Show that the sum in Equa- tion (5.30) sums to 2N .
0 if −θ+w x(μ)≤0, (μ) j kjkk
= 1 if −θj+wjkx(μ)>0. kk
(5.36)
(5.37)
The network output is computed as
0 if −Θ+WV(μ)≤0,
(μ) jjj
O = 1 if −Θ+WV(μ)>0,
jjj
98
PERCEPTRONS
x1 x2 t
0.1 0.95 0
0.2 0.85 0
0.2 0.9 0
0.3 0.75 1
0.4 0.65 1
0.4 0.75 1 0.6 0.45 0 0.8 0.25 0
0.1 0.65 1
0.2 0.75 1
0.7 0.2 1
Figure5.23: Inputsandtargetsforaclassificationproblem.Thetargetsareeither t = 0 () or t = 1 (). The three decision boundaries (gray lines) illustrate a solution to the problem using a multilayer perceptron. Exercise 5.8.
Figure 5.24: Cover’s theorem for p = 3 and m = 2. Examples for problems that are homogeneously linearly separable, (b) and (c), and for problems that are not homogeneously linearly separable, (a) and (d). Exercise 5.10.
EXERCISES
99
x1 x2 x3 t 0000 0011 0101 1001 0110 1010 1100 1111
Figure5.25: Valuetableforathree-dimensionalBooleanfunction.Exercise5.14.
5.10 Cover’s theorem. Prove that P (3, 2) = 34 by complete enumeration of all cases. Some cases (not all) are shown in Figure 5.24.
5.11 Non-linear activation function. Consider a single neuron with continuous, non-linear,andmonotonicallyincreasingactivationfunctiong(b),andwithN input components x1,…,xN . Show that this neuron cannot solve binary classification problems[x(μ),t(μ)](μ=1,…,p)ifp >N.
5.12 Random classification problem. The probability P (p , N ) that a random bi- nary classification problem with p patterns in input dimension N is homogeneously linearly separable is given in Equation (5.29). Show that [1]
P(p,N)∼ 11+erfαN 2 −1 (5.38) 22α
inthelimitofN →∞atfixedα=p/N.
5.13 Boolean functions with n -dimensional inputs. What is the number n of linearly separable Boolean functions with n -dimensional inputs? Write a computer program that attempts to solve n -dimensional Boolean functions using the learning rule (5.18) for a McCulloch-Pitts neuron with g (b ) = sgn(b ). Try the program out on as many four- and five-dimensional Boolean functions as possible, to estimate 4 and 5. Hint: 2 = 14, 3 = 104, 4 = 1882, 5 = 94572,… (sequence A000609 in the online encyclopedia of integer sequences [74]).
5.14 Three-dimensional Boolean function. Figure 5.25 shows the value table for a three-dimensional Boolean function. Demonstrate that the function is not linearly separable by drawing it in three-dimensional input space. Construct a network with hidden layers that represents this function. Hint: one possibility is to wire together several two-dimensional XOR networks.
100 STOCHASTIC GRADIENT DESCENT
6 Stochastic gradient descent
In Chapter 5 we discussed how a hidden layer helps to classify problems that are not linearly separable. We explained how the decision boundary in Figure 5.15 is represented in terms of the weights and thresholds of the hidden neurons, and introduced a training algorithm based on gradient descent. In this Section, the training algorithm is discussed in more detail.
Figure 5.2 shows the layout of the network to be trained. There are p input patterns x (μ) with N components each, as before. The output of the network has M components:
O(μ) 1 O(μ)
(μ) 2 O = . ,
O (μ) M
(6.1)
to be matched to the target vector t (μ). The network shown in Figure 5.2 computes N
V (μ) =g b(μ) with b(μ) =w x(μ) −θ , jjjjkkj
Equation (6.2) shows that outputs are computed in terms of nested activation functions g (b ). They must be differentiable (or at least piecewise differentiable). Apart from that there is no need to specify them further at this point.
6.1 Chain rule and error backpropagation
The network in Figure 5.2 is trained by gradient-descent learning in the same way as in Section 5.3. The weight increments are given by:
k=1
O(μ) =g B(μ) with B(μ) =W V (μ) − Θ .
(6.2a) (6.2b)
iiiijji j
δWmn =−η ∂H and δwmn =−η ∂H ,
with energy function
(6.3)
(6.4)
∂ Wmn
∂ wmn
H =1t(μ)−O(μ)2 . 2ii
μi
The small parameter η > 0 in Equation (6.3) is the learning rate, as in Section 5.3. The derivatives of the energy function are evaluated with the chain rule. For the
CHAIN RULE AND ERROR BACKPROPAGATION
weights connecting to the output layer we apply the chain rule once
∂ H ∂ O (μ) = − t (μ) − O (μ) i ,
∂Wmn ii∂Wmn μi
and then once more, using Equation (5.25):
Here g ′(B ) = dg /dB is the derivative of the activation function with respect to the localfieldB,andδim istheKroneckerdelta:δim=1ifi=mandzerootherwise.
AnimportantpointisthatthestatesVj oftheneuronsinthehiddenlayerdo not depend on Wm n , because these neurons do not have incoming connections with these weights, a consequence of the feed-forward layout of the network. In summary we obtain for the increments of the weights connecting to the output layer:
∂O(μ)
i = g ′(B (μ))δ V (μ) .
101
(6.5a)
(6.5b)
∂Wmn iimn
∂Hpp =ηt(μ)−O(μ)g′B(μ)≡η∆(μ)V(μ).
∂Wmn mmm mn μ=1 μ=1
mm
the hidden layer are adjusted in a similar fashion, by applying the chain rule four times:
δWmn =−η The quantity
(6.6a)
∆(μ) = (t (μ) − O (μ))g ′(B (μ)) mmmm
(6.6b) is a weighted output error : it vanishes when O (μ) = t (μ) . The weights connecting to
∂ O (μ) t (μ) − O (μ) i
∂H ∂wmn
∂O(μ) i
∂wmn
∂O(μ) i
∂V(μ) l
∂ V (μ) l
∂wmn
Here we used Equation (5.25). With the definition of the output error, ∆(μ), Equation i
= −
= i l ,
,
(6.7a)
(6.7b)
(6.7c)
μi
ii∂wmn ∂O(μ) ∂V(μ)
∂V(μ) ∂wmn ll
=g′(B(μ))W , i il
=g′(b(μ))δlmx(μ). (6.7d)
(6.3) yields:
l n
δw =η∆(μ)W g′b(μ)x(μ) ≡ηδ(μ)x(μ). mn iimmn mn
μiμ
(6.8)
102 STOCHASTIC GRADIENT DESCENT The last equality defines weighted errors,
δ(μ) =∆(μ)W g′b(μ), (6.9) m iimm
i
associated with the hidden layer. Note that the δ(μ) vanish when the output errors m
∆(μ) are zero. Equation (6.9) shows that the errors are determined recursively. The i
neuron states are also updated recursively, Equation (6.2), but there is an important difference between Equations (6.9) and (6.2). The feed-forward structure of the layered network implies that the neurons are updated from left to right. Equation (6.9), by contrast, says that the errors are updated from the right to the left, from the output layer to the hidden layer. The term backpropagation refers to this difference: the neurons are updated forward, the errors are updated backwards.
In terms of the errors ∆(μ) and δ(μ), the weight increments have the same form mm
for both layers:
(6.10) The rule (6.10) is also called δ-rule [1]. The thresholds are adjusted in a similar way:
pp
δΘm =−η
δWmn =η∆(μ)V(μ) and δwmn =ηδ(μ)x(μ). mn mn
μ=1 μ=1
∂Hpp =ηt(μ)−O(μ)−g′B(μ)=−η∆(μ),
∂Θm m m m m μ=1 μ=1
(6.11a)
pp
δΘm = −η∆(μ) and δθm = −ηδ(μ) , (6.12) mm
μ=1 μ=1
but without the state variables of the neurons (or the inputs). A way to remember the difference between Equations (6.10) and (6.12) is to note that the formula for the threshold increments looks like the one for the weight increments if one sets the state values of the neurons to −1. This follows from Equation (6.2).
The backpropagation rules (6.10) and (6.12) contain sums over patterns. This corresponds to feeding all patterns at the same time to compute the increments of weights and thresholds (batch training). Alternatively one may choose a single pattern, update the weights by backpropagation, and then continue to iterate these
∂Hpp =η∆(μ)W −g′b(μ)=−ηδ(μ).
δθ =−η
m∂θm iimm m
(6.11b) So, the general form for the threshold increments is analogous to Equation (6.10)
μ=1 i μ=1
STOCHASTIC GRADIENT-DESCENT ALGORITHM 103
training steps many times (sequential training). One iteration corresponds to feeding a single pattern, p iterations are called one epoch (in batch training, one iteration corresponds to one epoch). If one chooses the patterns randomly, then sequential training results in stochastic gradient descent:
δWmn =η∆(μ)V(μ) and δwmn =ηδ(μ)x(μ), (6.13a) mn mn
δΘm = −η∆(μ) and δθm = −ηδ(μ) . (6.13b) mm
Since the sum over pattern is absent, the steps do not necessarily decrease the energy function. Their directions fluctuate, but the average weight increment (averaged over all patterns) points downhill. The result is a stochastic path through parameter space, less prone to getting stuck in local minima (but see Section 7.8).
6.2 Stochastic gradient-descent algorithm
The stochastic-gradient descent formulae derived in the previous Section were
derived for a network with one hidden layer. This Section describes the details of
the stochastic-gradient algorithm for deep networks with many hidden layers. To
this end we need to adapt our notation, as described in Figure 6.1. We label the
layers by the index l. The layer of input terminals has label l = 0, while layer l = L
denotes the layer of output neurons. The state variables for the neurons in layer l
are V (l) , the weights connecting into these neurons from the left are w (l) , the errors jjk
associated with layer l are denoted by δ(l). In this notation, Equations (6.2) read: k
V (l) = g w (l) V (l−1) − θ (l) . (6.14) jjkkj
k
Repeating the steps outlined in the previous Section, we arrive at the update for- mulae
with errors
δw (l) = ηδ(l)V (l−1) and δθ (l) = −ηδ(l) , mnmn mm
(6.15)
δ(l−1) = t −V (L)
∂ V (L)
i g′(b(l−1)),
(6.16) whereb(l)= w(l)V(l−1)−θ(l)isthelocalfieldofV(l).Itinvolvesthematrix-vector
j i i ij
jkjkkj j
product between the weight matrix (l) and the vector V (l−1). Evaluating the gradi-
ents ∂ V (L)/∂ V (l−1) with the chain rule, one obtains the recursion ij
δ(l−1) =δ(l)w(l)g′(b(l−1)), (6.17) jiijj
i
∂ V (l−1) j
104 STOCHASTIC GRADIENT DESCENT
Figure6.1: IllustratesthenotationusedinAlgorithm4.
Algorithm 4 stochastic gradient descent
initialise weights w (l) to random numbers, thresholds to zero, θ (l) = 0;
for ν = 1,…,νmax do
choose a value of μ and apply pattern x (μ) to input layer, V (0) ← x (μ); forl=1,…,L do
propagateforward:V(l)←g w(l)V(l−1)−θ(l); jkjkkj
end for
computeerrorsforoutputlayer:δ(L)←g′(b(L))(t −V(L)); i iii
for l = L,…,2 do
propagatebackward: δ(l−1) ← δ(l)w(l)g′(b(l−1));
end for forl=1,…,L do
mn m
jiiijj
change weights and thresholds: w (l) ← w (l) + ηδ(l)V (l−1) and θ (l) ← θ (l) −
ηδ(l); m
end for end for
mnmnmn mm
STOCHASTIC GRADIENT-DESCENT ALGORITHM 105 withinitialconditionδ(L)=(t −V(L))g′(b(L)).Foronehiddenlayer,Equation(6.17)
iiii
is equivalent to (6.9). The result of the recursion (6.17) is a vector δ(l−1) with compo- nentsδ(l−1),obtainedbycomponent-wisemultiplicationof[(l)Tδ(l)] withg′(b(l−1)).
jjj
Component-wise multiplication of vectors is sometimes called Schur or Hadamard product [75], denoted by a ⊙b = [a1b1,…,aN bN ]T. It does not have a geometric meaning like the scalar product or the cross product of vectors, and therefore there is little point in using it. Also, note that the vector δ(l) is multiplied by the transpose of the weight matrix, (l)T, rather than by the weight matrix itself. We return to this point in Section 9.1.
The stochastic-gradient algorithm is summarised in Algorithm 4. One feeds an input x (ν) , updates the weights using (6.15), and iterates these steps until the energy function (5.23) is deemed sufficiently small. Note that the resulting weights and thresholds are not unique. In Figure 5.17 all weights for the Boolean XOR function are equal to ±1. But the training algorithm (6.10) corresponds to repeatedly adding weight increments. This may cause the weights to grow.
In practice, the stochastic gradient-descent dynamics may be too noisy. In this case it is better to average over a small number of randomly chosen patterns. Such asetiscalledminibatch,ofsizemB say.Instochasticgradientdescentwithmini batches one replaces Equations (6.10) and (6.12) by
mB mB
δWmn = η∆(μ)V (μ) and δΘm = −η∆(μ) , (6.18) mnm
μ=1 mB
δθm =−ηδ(μ). mnm
Sometimes the mini-batch rule is quoted with prefactors of m −1 before the sums. B
The factors m−1 can just be absorbed in the learning rate, but when comparing B
learning rates for different implementations one needs to check whether or not there are factors of m −1 in front of the sums in Equation (6.18).
B
How does one select which inputs to include in a mini batch? This is discussed below, in Section 6.3: at the beginning of each epoch, one randomly shuffles the sequence of the input patterns in the training set. Then the first mini batch contains patterns μ = 1,…,mB , and so forth.
Common choices for the activation functions g (b ) are the sigmoid function or tanh:
g(b)= 1 ≡σ(b), (6.19a) 1+e−b
g(b)=tanh(b). (6.19b)
μ=1 mB
δwmn =ηδ(μ)x(μ)
μ=1 μ=1
and
106 STOCHASTIC GRADIENT DESCENT
Figure 6.2: Saturation of the activation functions (6.19). The derivatives g ′(b ) ≡ d g(b)ofbothactivationfunctionstendtozeroforlargevaluesof|b|.
db
In both cases, the derivatives can be expressed in terms of the function itself:
d σ(b)=σ(b)[1−σ(b)], d tanh(b)=1−tanh2(b). (6.20)
db db
The second equality was used in Section 3.4. The following short-hand notation for
thederivativeoftheactivationfunctiong(b)iscommon:g′(b)≡ d g(b). db
As illustrated in Figure 6.2, the activation functions (6.19) saturate at large values of |b |: their derivatives g ′(b ) tend to zero. Since the backpropagation rule (6.16) contains factors of g ′(b ), this implies that the algorithm slows down if |b | becomes too large. For this reason, the initial weights and thresholds should be chosen so that the local fields b are not too large in magnitude, to avoid that g ′(b ) becomes too small. A standard procedure is to take all weights to be initially randomly dis- tributed, for example Gaussian with zero mean, and with a suitable variance. The performance of networks with many hidden layers (deep networks) can be sensitive to the initialisation of the weights (Section 7.2).
It is sometimes argued that the initial values of the thresholds are not so critical. The idea is that they are learned more rapidly than the weights, at least initially, and a common choice is to initialise the thresholds to zero. Section 7.2 summarises a mean-field argument that comes to a different conclusion.
6.3 Preprocessing the input data
It can be useful to preprocess the input data, although any preprocessing may remove information from the data. Nevertheless, it is usually advisable to shift the data so the mean of each component over all p patterns vanishes:
1 p
〈xk〉= x(μ) =0. (6.21) pk
μ=1
PREPROCESSING THE INPUT DATA 107
Figure6.3: Shiftandscaletheinputdatatoachievezeromeanandunitvariance.
There are several reasons for this. First, large mean values can cause steep gradients in the energy function (Exercise 6.9) that are difficult to navigate with gradient descent. Different input-data variances in different directions have a similar effect. Therefore one scales the inputs so that the input-data distribution has the same variance in all directions (Figure 6.3), equal to unity for instance:
p
σ2 = 1 x(μ) −〈xk〉2 =1. (6.22) kpk
μ=1
Second, to avoid that the neurons connected to the inputs saturate, their local fields must not be too large (Section 6.2). If one initialises the weights to Gaussian random numbers with mean zero and unit variance, large activations are quite likely if the distribution of input patterns has a large mean or a large variance. Third, enforcing zero input mean by shifting the input data avoids that the weights of the neurons in the first hidden layer must decrease or increase together [76]. Equation (6.18) shows that the components of δw m ∝ δm x into hidden neuron m are likely to have the same signs if the input data has a large mean. This makes it difficult for the network to learn to differentiate. In summary, it is advisable to shift and scale the input-data distribution so that it has mean zero and unit variance, as illustrated in Figure 6.3. The same transformation (using the mean values and scaling factors determined for the training set) should be applied to any new data set that the network is supposed to classify after it has been trained on the training set.
Figure 6.4 shows a distribution of inputs that falls into two distinct clusters. The difference between the clusters is sometimes called covariate shift, here covariate is just another term for input. Imagine feeding just inputs from one of the clusters
108 STOCHASTIC GRADIENT DESCENT
to the network. It will learn local properties of the decision boundary, instead of its global features. Such global properties are efficiently learned if the network is more frequently confronted with unfamiliar data. For sequential training (stochastic gradient descent) this is not a problem, because the sequence of input patterns presented to the network is random. However, if one trains with mini batches, the mini batches should contain randomly chosen patterns in order to avoid covariate shifts. To this end one randomly shuffles the sequence of the input patterns in the training set, at the beginning of each epoch.
It is also recommended [76] to observe the output errors during training. If the errors are similar for a number of subsequent learning steps, the corresponding inputs appear familiar to the network. Larger errors correspond to unfamiliar inputs, and Ref. [76] suggests to feed such inputs more often.
When the input data is very high dimensional, many input terminals are needed. This usually means that one should use many neurons in the hidden layers. This can be problematic because it increases the risk of overfitting the input data. To avoid this as far as possible, one can reduce the dimensionality of the input data by principal-component analysis. This method allows to project high-dimensional data to a lower dimensional subspace (Figure 6.5).
The data shown on the left of Figure 6.5 falls approximately onto a straight line, the principal direction u1. We see that the coordinate orthogonal to the principal direction is not useful in classifying the data. Consequently this coordinate can be disregarded, reducing the dimensionality of the data set. The idea of principal component analysis is to rotate the basis in input space so that the variance of the data along the first axis of the new coordinate system, u1, is maximal. One keeps the input components corresponding to u 1 , discarding those corresponding to u 2 (Figure 6.5).
To determine the maximal-variance direction, consider the data variance along
Figure 6.4: When the input data falls into clusters as shown in this Figure, one should randomly pick data from either cluster. The decision boundary is shown as a solid gray line. It has different slopes for the two clusters.
PREPROCESSING THE INPUT DATA 109
Figure6.5: Principal-componentanalysis(schematic).Thedatasetontheleftcan be classified keeping only the principal component u 1 of the data. This is not true for the data set on the right.
a direction v :
Here
σ2 =〈(x ·v)2〉−〈x ·v〉2 =v ·v . v
(6.23)
(6.24)
=〈δx δxT〉 with δx =x −〈x〉 v
direction of the leading eigenvector of the covariance matrix . This can be seen as follows. The covariance matrix is symmetric, therefore its eigenvectors u 1 , . . . , u N form an orthonormal basis of input space. This allows us to express the matrix as
N
=λαuαuTα . (6.25) α=1
The eigenvalues λα are non-negative. This follows from Equation (6.24) and the eigenvalueequationuα =λαuα. Wearrangetheeigenvaluesbymagnitude,λ1 ≥ λ2≥…≥λN ≥0.UsingEquation(6.25)wecanwriteforthevariance
N
σ2 =λαv2 (6.26)
with vα = v ·uα. We want to show that σ2 is maximal for v = ±u1 subject to the v
is the data covariance matrix. The variance σ2 is maximal when v points in the
constraint
N
vα2 =1. (6.27) α=1
vα α=1
To ensure that this constraint is satisfied as the vα are varied, one introduces a Lagrange multiplier λ (Exercises 6.10 and 6.11). The constraint (6.27) is multiplied with λ and added to the target function (6.26). The function to maximise reads
=λαvα2 −λ1−vα2. (6.28) αα
110 STOCHASTIC GRADIENT DESCENT
Tofindthemaximumof,wedetermineitssingularpoints,definedby∂/∂vβ =0.
This yields vβ (λβ + λ) = 0. The maximum of is obtained for λ = −λ1, where λ1 is
themaximaleigenvalueofwitheigenvectoru1.Soallcomponentsvβ mustvanish,
except v1 which must equal unity. This shows that the variance σ2 is maximised by v
the principal direction.
In more than two dimensions there is commonly more than one direction along
which the data varies significantly. These k principal directions correspond to the k eigenvectors of with the largest eigenvalues. This can be shown recursively. One projects the data to the subspace orthogonal to u 1 by applying the projection matrix 1 = − u 1u T1 . Then one repeats the procedure outlined above, and finds that the data varies maximally along u 2 . Upon iteration, one obtains the k principal directions u 1 , . . . , u k . Often there is a gap between the k largest eigenvalues and the small ones (all close to zero). Then one can safely project the data onto the subspace spanned by the k principal directions. If there is no gap then it is less clear what to do.
The data set shown on the right of Figure 6.5 illustrates another problem. This data set is much harder to classify if we use only the principal component alone. In this case we lose important information by projecting the data on its principal component.
6.4 Overfitting and cross validation
The goal of supervised learning is to generalise from a training set to new data. Only general properties of the training set are of interest, not specific ones that are particular to the training set in question. A neural network with more neurons may classify the input data better, because it more accurately represents all specific features of the given data set. But a different set of patterns from the same input distribution could look quite different in detail, so that the decision boundary does not classify the new data very well (Figure 6.6). In other words, the network fits too fine details that have no general meaning, for instance noise in the training set. This problem, illustrated in Figure 6.6, is also referred to as overfitting.
As a consequence, we must look for a compromise: between accurate classifica- tion of the training set and the ability of the network to generalise. The tendency to overfit is larger for networks with more neurons. One way of avoiding overfitting is to use cross validation and early stopping. One splits the data into two sets: a training set and a validation set. The idea is that these sets share the general features to be learnt. But although training and validation sets are drawn from the same distribution, they may differ in details that are not of interest.
While the network is trained on the training set, one monitors not only the energy
OVERFITTING AND CROSS VALIDATION 111
Figure6.6: Overfitting.Left:accuraterepresentationofthedecisionboundaryin the training set, for a network with a single hidden layer with 15 neurons. Right: this new data set differs from the first one just by a little bit of noise. The points in the vicinity of the decision boundary are not correctly classified. Legend: corresponds tot(μ) =1,andtot(μ) =−1.
function for the training set, but also the energy function evaluated using the vali- dation data. As long as the network learns general features of the input distribution, both training and validation energies decrease. But when the network starts to learn specific features of the training set, then the validation energy saturates, or may start to increase. At this point the training is stopped. The scheme is illustrated in Figure 6.7.
Often the possible state values of the output neurons are continuous while the targets assume only discrete values. In this case one may also monitor the classifi- cation error of the validation set. The definition of the classification error depends on the type of the classification problem. For one single output neuron with targets t = 0/1, the classification error is defined as
1 p
t(μ) −θH(O(μ) − 12). (6.29a) If, by contrast, the targets take the values t = ±1, then the classification error reads:
1 p 2p μ=1
As a third example, consider a classification problem where inputs must be classified into M mutually exclusive classes, such as the MNIST data set of hand-written digits (Section 8.3) where M = 10. Another example is given in Table 6.1, with M = 3
C =
C =
p μ=1
t(μ) −sgn(O(μ)). (6.29b)
112 STOCHASTIC GRADIENT DESCENT
Figure6.7: Progressoftrainingandvalidationerrors.Theplotisschematic,and the data is smoothed. Based on simulations performed by Oleksandr Balabanov. Shown is the natural logarithm of the energy functions for the training set (solid line) and the validation set (dashed line) as a function of the number of training iterations. The training is stopped when the validation energy begins to increase.
classes. In both cases, one of the targets equals unity while all others equal zero. As a consequence, the targets sum to unity: M t (μ) = 1. Now assume that the network
ii
hassigmoidoutputs,O(μ) =σ(b(μ)). Toclassifyinputx(μ) fromthenetworkoutputs
O(μ) wedefine i
y (μ) = i
ii
1 if O(μ) is the largest of all outputs i = 1,…,M , i
(6.30a)
(6.30b)
0 otherwise.
Then the classification error can be computed as
1pM
C= t(μ)−y(μ).
2p i i μ=1 i=1
In all cases, the classification accuracy is defined as (1 − C ) 100%, it is usually quoted in percent.
The classification error determines the fraction of inputs that are classified wrongly. However, it contains less information than the energy function, which is in fact a mean-squared error of the outputs. This is illustrated in Table 6.1. All three inputs are classified correctly, but there is a substantial mean-squared error. This indicates that the classification is not very reliable.
6.5 Adaptation of the learning rate
It is tempting to choose larger learning rates, because they enable the network to escape more efficiently from shallow minima. But this can lead to problems when
ADAPTATION OF THE LEARNING RATE
113
μ output O (μ)
1 [0.4, 0.5, 0.4]
2 [0.4, 0.3, 0.5]
3 [0.6, 0.5, 0.4]
target t (μ) [0,1,0] [0,0,1] [1,0,0]
classification versicolor setosa virginica
correct? yes
yes
yes
Table 6.1: Illustrates the difference between energy function and classification error. The table shows network outputs for three different inputs from the iris data set, as well as the correct classifications. All inputs are classified correctly, but the difference between outputs and targets is substantial.
the energy function varies rapidly, causing the training to fail. To avoid this, one uses an adaptive learning rule, such as:
δw(t) =−η ∂H +αδw(t−1). (6.31) mn ∂ wmn (t ) mn
Here t = 1, 2, . . . , T labels the iteration number. We see that the increment at step
t depends not only on the instantaneous gradient, but also on the weight change
δw (t −1) of the previous iteration. We say that the dynamics becomes inertial, the mn
weights gain momentum. The parameter α ≥ 0 is called momentum constant. It determines how strong the inertial effect is. We see that α = 0 corresponds to the usual backpropagation rule. When α is positive, then how does inertia change the learning process? Iterating Equation (6.31) yields
{wij}={wij }
T ∂H
δw(T) =−η αT−t . (6.32)
m n ∂ w (t ) t=0 mn
Here and in the following we use the short-hand notation ∂H ≡ ∂H .
∂ w (t ) ∂ wm n (t ) mn {wij}={wij }
Equation(6.32)showsthatδw(T) isaweightedaverageofthegradientsencountered mn
during training. Now assume that the training is stuck in a shallow minimum. Then the gradient ∂ H /∂ w (t ) remains roughly constant through many time steps. To
mn
illustratewhathappens,letusassumethat∂H/∂w(t) =∂H/∂w(0) fort =1,…,T. mn mn
In this case we can write
δw(T) ≈−η mn
∂ H T ∂w(0)
mn t=0
αT−t =−η
α T + 1 − 1 ∂ H
α−1 ∂w(0) mn
. (6.33)
114 STOCHASTIC GRADIENT DESCENT
Figure6.8: (a)Momentummethod(6.31).Thegrayarrowrepresentstheincrement −η(∂H/∂wmn)|{w(t)}. (b) Nesterov’s accelerated gradient method (6.35). The gray
ij
arrowrepresents−η(∂H/∂wmn)|{w(t)+α δw(t−1)}. Thelocationofw(t+1) (graypoint)is
i j t −1 i j
closer to the minimum (black point) than in panel (a).
In this situation, convergence is accelerated when α is close to unity. We also see that it is necessary that α < 1 for the sum in Equation (6.33) to converge.
The other limit to consider is that the gradient changes rapidly from iteration to iteration. How is the learning rule modified in this case? As an example, let us assume that the gradient remains of the same magnitude, but that its sign oscillates, ∂H/∂w(t) =(−1)t∂H/∂w(0) fort =1,...,T.InsertingthisintoEquation(6.32),we
obtain:
mn
mn
∂H T
∂ w(0)
mn t=0
δw(T) ≈−η mn
(−1)tαT−t =−η
αT+1+(−1)T ∂H
. (6.34)
Here the increments are much smaller compared with those in Equation (6.33). This shows that introducing inertia can substantially accelerate convergence with- out sacrificing accuracy. The disadvantage is, of course, that there is yet another parameter to choose, namely the momentum constant α.
Nesterov’s accelerated gradient method [77] is another way of implementing momentum. The algorithm was developed for smooth optimisation problems, but it has been suggested to use the method when training deep neural networks with gradient descent [78]:
(t) ∂H (t−1)
δwmn=−η (t) (t−1) +αt−1δwmn . (6.35)
Asuitablesequenceofcoefficientsαt isdefinedbyrecursion[78].Thecoefficients αt approach unity from below as t increases.
α+1
∂ w(0) mn
∂wmn {wij +αt−1δwij }
SUMMARY 115
Nesterov’s accelerated-gradient method is more accurate than the simple mo- mentum method, because the accelerated-gradient method evaluates the gradient at an extrapolated point, not at the initial point. Figure 6.8 illustrates a situation where Nesterov’s method converges more rapidly. Nesterov’s method is not much more difficult to implement than Equation (6.31), and it is not much more expensive in terms of computational cost.
There are other ways of adapting the learning rate during training, described in Section 4.10 in Haykin’s book [2]. Finally, the learning rate need not be the same for all neurons. If the weights of neurons in different layers change at very different speeds (Section 7.2), it may be advantageous to define a layer-dependent learning rate ηl that is larger for neurons with smaller gradients.
6.6 Summary
Backpropagation is an efficient algorithm for stochastic gradient-descent on the energy function (6.4) in weight space, because it refers only to quantities that are local to the weight to be updated. Networks with many hidden neurons have many free parameters (their weights and thresholds). This increases the risk of overfitting, which reduces the power of the network to generalise. Deep networks with many hidden layers are particularly prone to overfitting (Chapter 7). The tendency of networks to overfit can be reduced by cross validation (Section 6.4).
6.7 Further reading
The backpropagation algorithm is explained in Section 6.1 of Hertz, Krogh and Palmer [1], and in Chapter 4 of Haykin’s book [2]. The paper [76] by LeCun et al. predates deep learning, but it is still a very nice collection of recipes for making backpropagation more efficient.
One of the first papers on error backpropagation is the one by Rumelhart et al. [12] from 1986. The authors provide an elegant explanation and summary of the backpropagation algorithm. They also describe results of different numerical experiments, and one of them introduces convolutional networks (Chapter 8) to learn to tell the difference between the letters T and C (Figure 6.9).
6.8 Exercises
6.1 Covariance matrix. Show that the eigenvalues of the data covariance matrix defined in Equation (6.24) are real and non-negative.
116 STOCHASTIC GRADIENT DESCENT
Figure6.9: PatternsdetectedbytheconvolutionalnetworkofRef.[12].AfterFig.13 in Ref. [12].
Figure 6.10: The principal direction of this data set is u1.
6.2 Principal-component analysis. Compute the data covariance matrix for the example shown in Figure 6.10 and determine the principal direction. Determine the principal direction for the data shown in Figure 6.11.
6.3 Nesterov’s accelerated-gradient method. The version (6.35) of Nesterov’s al- gorithm is slightly different from the original formulation [77]. This point is dis- cussed in Ref. [78]. Show that both versions are equivalent.
6.4 Momentum. Section 6.5 describes how to speed up gradient descent by in- troducing momentum. To explain how this works, consider the one-dimensional energy function shown in Figure 6.12. Iterate Equation (6.31) for α = 12 , and deter- mine how many iteration steps it takes to get from wA to wB. Compare with the corresponding result for α = 0. Then consider what happens after wB. How many steps does it take to reach wC?
6.5 Backpropagation. Derive stochastic gradient-descent learning rules for the weights of a network with the layout shown in Figure 5.2. Assume that all activation functions are of sigmoid form, σ(b ) = 1/(1 + e−b ), hidden thresholds are denoted by θj , and those of the output neurons by Θi . The energy function is (Section 7.5)
H =− t(μ)logO(μ) +(1−t(μ))log(1−O(μ)),wherelogisthenaturallogarithm, i,μi i i i
and t (μ) = 0/1 are the targets. i
6.6
Stochastic gradient descent. To train a multi-layer perceptron using stochas- tic gradient descent one needs update formulae for the weights and thresholds in
EXERCISES 117
the network. Derive these update formulae for sequential training using backprop-
agation for the network shown in Fig. 6.13. The weights for the first and second
hidden layer, and for the output layer are denoted by w (1) , w (2) , and W . The corre- jkmj im
sponding thresholds are denoted by θ (1) , θ (2) , and Θ , and the activation function by jmi
g(···).Thetargetvalueforinputpatternx(μ) ist(μ),andthepatternindexμranges i
from1top. TheenergyfunctionisH = 1 M p (t(μ) −O(μ))2. 2i=1μ=1i i
6.7 Multi-layer perceptron. A perceptron has hidden layers l = 1, . . . , L − 1 and output layer l = L. Neuron j in layer l computes V (l) = g(b(l)) with b(l) = −θ(l) +
w(l)V(l−1),wherew(l)areweights,θ(l)arethresholds,g(b)istheactivationfunc- kjkk jk j
tion, V (L) = O = g(b(L)), and V (0) = x . Draw this network. Indicate where the iiikk
elementsx ,O belong,aswellasb(l),V(l),w(l) andθ(l) forl=0,...,L.Determine ki jjjkj
how the derivatives ∂ V (l)/∂ w(l′) depend upon the derivatives ∂ V (l−1)/∂ w(l′) for i mn j mn
l′
To show how this construction works, consider the Boolean XOR function as an example. Toconfirmthatonlythecorrespondingwinningneurongivesapositive signal, consider pattern x (1) = [−1, −1]T for example. It activates the first neuron in the hidden layer ( j = 0). To see this, compute the local fields of the hidden neurons:
b(1) = 2δ−2(δ−1)=2, 0
b(1) = −2(δ−1)=2−2δ, 1
b(1) = −2(δ−1)=2−2δ, 2
b(1) = −2δ−2(δ−1)=2−4δ. 3
(7.7)
If we choose δ > 1 then the output of the first hidden neuron gives a positive output (V0 > 0), the other neurons produce negative outputs, Vj < 0 for j = 1,2,3. In
124 DEEP LEARNING
Figure7.6: ShowshowtheXORnetworkdepictedinFigure7.5partitionstheinput plane. Target values are encoded as in Figure 5.8: corresponds to t = −1 and to t = +1).
conclusion, output neuron 1 is the winning neuron for this pattern. Now consider x(3) =[−1,+1]T. Inthiscase
b(3) = −2(δ−1)=2−2δ, 0
b(3) = −2δ−2(δ−1)=2−4δ, 1
b(3) = 2δ−2(δ−1)=2, 2
b(3) = −2(δ−1)=2−2δ. 3
(7.8)
Now the third hidden neuron gives a positive output, while the others yield negative values. It works in the same way for the other two patterns, x (2) and x (4). In summary, thereisauniquewinningneuronforeachpattern.1 Figure7.6showshowthefour decisionboundariescorrespondingtoVj partitiontheinputplane.
According to the scheme outlined above, the output neuron computes
O1 =sgn(−V1 +V2 +V3 −V4) (7.9)
withΘ=j W1j =0. Forx(1) andx(4) wefindthecorrectresultO1 =−1. Thesame is true for x(2) and x(3), in this case we obtain O1 = 1. In summary, this example
illustrates how an N -dimensional Boolean function is represented by a network withonehiddenlayer,with2N neurons.Theproblemisofcoursethatthisnetwork is expensive to train for large values of N because the number of hidden neurons is very large.
There are more efficient layouts if one uses more than one hidden layer. As an example, consider the parity function for N -dimensional binary inputs with bits equal to 0 or 1. It measures the parity of the sequence of input bits. The function
1That pattern μ = k gives the winning neuron j = k −1 is of no importance, it is just a consequence of how the patterns are ordered in the value table in Figure 7.5
HOW MANY HIDDEN LAYERS? 125
Figure7.7: SolutionoftheparityproblemforN-dimensionalinputs.Thenetwork is built from XOR units (Figure 5.17, here with 0/1 neurons). Only the states of the inputs and outputs of the XOR units are shown, not those of the hidden neurons. In total, the whole network has only O (N ) neurons. After Fig. 2 in Ref. [81].
126 DEEP LEARNING
evaluates to unity if there is an odd number of ones in the input; otherwise, it evaluates to zero. A construction similar to the above yields a network layout with 2N neurons in the hidden layer. If one instead wires together the XOR networks, one can solve the parity problem with O (N ) neurons [81] (Figure 7.7). When N is a power of two, this network has 3(N − 1) neurons. To see this, set the number of inputs to N = 2k . Figure 7.7 shows that the number k of neurons satisfies the recursion k +1 = 2k + 3 with 1 = 3. The solution of this recursion is k = 3(2k − 1).
This example also illustrates a second reason why it may be useful to have more than one hidden layer. To design a neural network for a certain task it is often convenient to build the network from well-studied building blocks. One wires them together, often in a hierarchical fashion. In Figure 7.7 there is only one building block, the XOR network from Figure 5.17. Another example are convolutional networks for image analysis (Chapter 8). Here the fundamental building blocks are so-called feature maps, they recognise different geometrical features in the image, such as edges or corners (Chapter8).
7.2 Vanishing and exploding gradients
This Section describes an inherent instability in the training of deep networks with stochastic gradient descent, the vanishing- or exploding-gradient problem.
In Chapter 6, we saw that learning slows down when the factors g′(b) in the recursion (6.17) become small. When the network has several hidden layers, like the one shown in Figure 7.8, the potentially small factors of g ′(b ) are multiplied, aggravating the problem. As a consequence, the weights of hidden neurons close to the input layer change only by small amounts, the smaller the more hidden layers the network has. This is the vanishing-gradient problem.
Figure 7.9 quantifies the problem. The Figure shows that the r.m.s. errors aver- aged over different realisations of random initial weights, δ(l) ≡ 〈N −1 N [δ(l)]2〉1/2,
rms j=1 j
tend to be very small during initial training. To explain this phenomenon, consider
the very simple example discussed in Ref. [5], a long chain of neurons with only one neuron per layer (Figure 7.10). The output V (L) is given by nested activation functions
V (L)=gw(L)gw(L−1) ···gw(2)g(w(1)x −θ(1))−θ(2)...−θ(L−1)−θ(L). (7.10) Let us compute the errors δ(l) using Equation (6.16). The partial derivative in (6.16)
VANISHING AND EXPLODING GRADIENTS 127
Figure7.8: Fullyconnecteddeepnetworkwithfourhiddenlayers.
is evaluated using the chain rule:
∂ V (L) ∂ V (L−1)
∂V(L) ∂V(L−2) =
∂V(L) ∂V(L−1)
∂V(L−1) ∂V(L−2) =g′(b(L))w(L)g′(b(L−1))w(L−1),
= g′(b(L))w(L),
.
where b (k ) = w (k ) V (k −1) − θ (k ) is the local field for neuron k . This yields the following
expression for ∂ V (L)/∂ V (l):
Inserting this result into Equation (6.16), we find:
l+1
δ(l) =[t −V (L)(x)]g′(b(L))[w(k)g′(b(k−1))].
k=L
Now consider the early stages of training. For the activation functions (6.19), the maximum of g ′(b ) is 14 and 1, respectively, and g ′(b ) becomes exponentially small if |b | is large. If one initialises the weights as described in Section 6.2, to Gaussianrandomvariableswithmeanzeroandvarianceσ2w =1[5],thenthefactors w (k )g ′(b (k −1)) tend to be smaller than unity. In this case, Equation (7.12) implies that
l+1
∂ V (L) ′ (k) (k)
∂V(l)= [g(b )w ]. k=L
(7.13) One can also obtain this result by applying the recursion from Algorithm 4, δ(l) =
δ(l+1)w(l+1)g′(b(l)).
(7.11)
(7.12)
128 DEEP LEARNING
Figure7.9: Vanishing-gradientproblemforanetworkwithfourfullyconnected
hidden layers. The Figure illustrates schematically how the r.m.s. error δ(l) in rms
layer l depends on the number of training epochs. During phase I, the vanishing- gradient problem is severe, during phase II the network starts to learn, phase III is the convergence phase where the errors decline. Schematic, based on simulations performed by Ludvig Storm, training a network with four hidden layers and N = 30 neurons per layer on the MNIST data set.
Figure7.10: Chainofneuronsusedtoillustratethevanishing-gradientproblem[5], with neurons V (l), weights w(l), and thresholds θ(l).
the error or gradient δ(l) vanishes quickly as l decreases. The reason is simply that the number of small factors in the product (7.12) increases when l becomes smaller, and multiplying many small numbers gives a very small product. As a result, the training slows down. As mentioned above, this is the vanishing-gradient problem (phase I in Figure 7.9).
What happens at later times? Figure 7.9 indicates that the network continues to learn slowly. For the particular example shown in Figure 7.9, the effect persists for about 20 epochs. Then the first layers begin to learn faster (phase II). There is to date no mathematical theory describing how this transition occurs. Much later in training, the errors decay as the learning converges (phase III in Figure 7.9).
There is a second, equivalent, point of view [5]: the learning is slow in a layer far from the output because the output is not very sensitive to the state of these neurons. The effect of a given neuron on the output is measured by Equation (7.12), which describes how the output of the network changes when changing the state of
VANISHING AND EXPLODING GRADIENTS 129
a neuron in a particular layer. At any rate, Equation (7.13) demonstrates that hidden layers far from the output layer learn slowly, at least initially when the weights are still random.
Suppose we try to combat the vanishing-gradient problem by increasing the weight variance σ2w . The problem is that this may cause the factors w (k )g ′(b (k −1)) to become larger than unity. As a consequence, the gradients increase exponentially instead (exploding gradients). In conclusion, the training dynamics is fundamentally unstable. This is due to the multiplicative nature of the recursion for the errors. Taking the logarithm of the product in Equation (7.12) and assuming that the weights are independently distributed random numbers, the central-limit theorem (Chapter 2) implies that the distribution of log δ(l) is Gaussian. In other words, the distribution of the errors is lognormal, implying that very small and very large values of δ(l) occur with high probability.
In networks like the one shown in Figure 7.8 the principle is the same. Assume that all layers l = 1,...,L have the same number N of neurons. When N > 1, one multiplies N × N matrices instead of numbers. The product (7.12) of random numbers becomes a product of random matrices. Using the chain rule we find:
we can evaluate each factor:
∂ V ( L ) N N N ∂ V ( L ) ∂ V ( L − 1 ) ∂ V ( l + 1 )
i = ··· i m ··· p . (7.14)
∂V(l) j
With the update rule
∂V(L−1) ∂V(L−2) m=1n=1 p=1 m n
N
V (k ) = g w (k ) V (k −1) − θ (k )
mijji j=1
∂V(l) j
∂ V (k )
m =g′(b(k))w(k) .
(7.16) Substituting this result into Equation (7.14), we see that the partial derivatives
∂ V (L)/∂ V (l) can be computed in the form of a matrix product. The matrix with ij L−l
(7.15)
∂ V (k−1) m mn n
elements [ ] = ∂ V (L)/∂ V (l) is given by: L−l ij i j
L−l = (L)(L)(L−1)(L−1) ···(l+1)(l+1) . Here (k ) is the matrix of weights feeding into layer k , and
g′(b(k)) 0 1
(7.17)
(7.18)
(k) = .
..
0 g′(b(k)) N
130 DEEP LEARNING isthediagonalmatrixwithentriesD(k) =g′(b(k)). Thematrixproduct(7.17)deter-
mines the error dynamics, just like Equation (7.13):
δ(l) = δ(L)L−l . (7.19)
Does the magnitude δ(l)2 = δ(l)Tδ(l) of the errors shrink or grow as they propagate
through the layers? This is determined by the eigenvalues of the left Cauchy-Green matrix p Tp , with p = L − l. This matrix is symmetric, and its eigenvalues are non-negative. Their square roots are the singular values of p :
Λ(p) ≥ Λ(p) ≥ ··· ≥ Λ(p) ≥ 0. (7.20) 12N
It is customary to sort the singular values by their magnitudes, as in Equation (7.20). When there are many layers, the number p = L −l of factors in Equation (7.17) is large. In this case, the maximal singular value either decreases or increases exponentially as a function of p [82]. The corresponding rate
λ = lim 1 logΛ(p) (7.21) 1 p→∞p 1
is called the maximal Lyapunov exponent. A negative maximal Lyapunov exponent indicatesthattheerrorsvanishexponentially.TheeigenvectorsofpTp arecalled backward Lyapunov vectors. They describe how the errors change as they propagate through the network. How small differences between the inputs change, is deter- minedbytheforwardLyapunovvectors,theeigenvectorsofTpp.SinceTpp and
pTp havethesameeigenvalues,therateofdecayorincreaseofthemagnitudeof input differences is the same as that of the error magnitudes.
The concept of a maximal Lyapunov exponent is borrowed from chaos theory
[83–85], where λ1 > 0 implies that small perturbations of the initial conditions grow
exponentially as a function of time. The iterated map (7.10) is a dynamical system.
The transition in Figure 7.9 is triggered by a change of the Lyapunov exponent from
negative values to λ1 ≈ 0 [86]. In summary, the unstable-gradient problem in deep
networks is due to the fact that the maximal singular value Λ(p ) either increases or 1
decreases exponentially as one moves away from the output layer, depending on whether the maximal Lyapunov exponent is negative or positive.
Pennington et al. [87] suggested to combat the unstable-gradient problem by ini- tialising weights and thresholds in such a way that the maximal Lyapunov exponent is close to zero, in order to make sure that the errors neither grow nor shrink expo- nentially. Consider the network shown in Figure 7.8, with N neurons per hidden layer, and initialise the weights to independent Gaussian random numbers with mean zero and variance σ2w . The thresholds are initialised in the same way, with
jjj
VANISHING AND EXPLODING GRADIENTS 131
variance σθ2 . In the limit of N → ∞ one can use a mean-field theory [87], just as in Chapter 3, to estimate the maximal Lyapunov exponent.
Following Ref. [87], the first step is to compute how the errors propagate through the network. We assume uncorrelated random input patterns, Equation (2.29), and random weights with mean zero and variance 〈wi j wkl 〉 = σ2w δi j δkl . When N → ∞, the errors are sums of many random numbers [Equation (6.17)]. Invoking the central- limit theorem (Chapter 2), one concludes that the errors are approximately Gaussian distributed, with mean zero and with variance
NN 〈[δ(l−1)]2〉=δ(l)δ(l)w(l)w(l)[g′(b(l−1))]2≈σ2 〈[δ(l)]2〉〈[g′(b(l−1))]2〉. (7.22)
jikijkjjwij i,k=1 i=1
The last approximation neglects possible correlations with the local fields. The variance 〈[δ(l)]2〉 does not depend on i , so that the sum just gives a factor of N .
The central limit theorem ensures that the local fields b (l) are Gaussian distributed j
i
too, in the limit N → ∞, with mean zero and variance
1 N lNj
j=1
narrows or broadens as one iterates. It was shown in Ref. [87] that σl approaches
a fixed point, σ∗ = liml→∞ σl, under certain conditions on the activation function
and upon the variances of the weights and thresholds, σ2w and σθ2 . If σl is well
approximated by σ∗, Equation (7.22) simplifies to: δ(l−1) ≈ δ(l) σ2 N F (σ∗). This rms rms w
results in a mean-field estimate of the the maximal Lyapunov exponent,
λ1 ∼ logδ(l−1)/δ(l) ≈ 12 logσ2w N F (σ∗). (7.25)
The network parameters should be adjusted so that this exponent is as close to zero as possible. This means, in particular, that one should take
σ2w ∝N−1, (7.26)
see also Refs. [88, 89]. But we must keep in mind that Equation (7.25) relies on taking the limit N → ∞. It is expected that the assumptions underlying Equation (7.25)
This allows us to estimate 〈[g′(b(l)]2〉∼
e − z 2 / 2 σ l2
dz [g′(z )]2 ≡F(σ ),
σ2 =
[b(l)]2.
(7.23)
(7.24) independent of j . Equation (7.23) describes how the distribution of local fields b (l)
j l2πσl2l l
j
132 DEEP LEARNING
break down when N is finite, causing the mean-field theory to fail. The tails of the error distribution, for example, are expected to become heavier as N decreases, as indicated by the results for N = 1 described above. Note also that p assumes rank zero with a small but non-zero probability when N is finite. In this case λ1 = −∞.
There are a number of other tricks that help to cope with unstable gradients in practice, to some extent at least. First, it is sometimes argued that activation functions which do not saturate at large b , such as the ReLU function , help against the vanishing-gradient problem. Second, batch normalisation (Section 7.6.5) may reduce the unstable-gradient problem. Third, introducing connections that skip layers (residual networks) can alleviate the unstable-gradient problem. This is discussed in Section 7.4.
Finally, there is an important aspect that we did not discuss: unstable gradients limit the extent to which information can propagate through the network in a meaningful way. This is explained in Ref. [90].
In this Section we assumed all along that the weights are random numbers. When the network starts to learn, this is no longer the case. The question is how the singular valuesofp changewhencorrelationsbetweendifferentfactorsintheproduct(7.17) develop.
7.3 Rectified linear units
Glorot et al. [91] suggested to use a piecewise activation function, the ReLU function2 max{0, b } (Chapter 1). What is the point of using ReLU neurons? When training a deep network with ReLU activation functions, many of the hidden neurons produce output zero. This means that the network of active neurons (non-zero output) is sparsely connected. It is sometimes argued that sparse networks have desirable properties; at least sparse representations of a classification problem tend to be easier to learn because they are more likely to be linearly separable (Section 5.4). Figure 7.11 illustrates that for a given input pattern, only a certain fraction of hidden neurons is active. For these neurons the computation is linear, yet different input patterns give different sets of active neurons. The product in Equation (7.17) ac- quires a particularly simple structure: the matrices (p ) are diagonal with 0/1 entries. But while the weight matrices are independent initially, they become correlated as the training proceeds. Also the (p)-matrices develop correlations: which diagonal elements vanish depends on which pattern is clamped to the input terminals.
A hidden layer with only one or very few active neurons might act as a bottleneck preventing efficient backpropagation of output errors which could in principle
2Since the derivative of the ReLU function is discontinuous at b = 0, a common convention is to set the derivative to zero at b = 0.
RESIDUAL NETWORKS 133
Figure7.11: SparsenetworkofactiveneuronswithReLUactivationfunctions.The black paths correspond to active neurons with positive local fields.
slow down the training. For the examples given in Ref. [91], this does not occur. To describe information propagation through a network with ReLU neurons, one should compute the probability that a given number of singular values of the matrix p vanish(Section7.2).
The ReLU function is unbounded for large positive local fields. Therefore, the vanishing-gradient problem (Section 7.2) is thought to be less severe in networks made of rectified linear units. However, since the ReLU function does not saturate, the weights tend to increase. Glorot et al. [91] suggested to employ L1-regularisation (Section 7.6.1) to make sure that the weights do not grow.
Finally, using ReLU functions instead of sigmoid functions speeds up the training, because the ReLU function has piecewise constant derivatives. Such function calls are faster to evaluate than non-linear activation functions and their derivatives.
7.4 Residual networks
One way of reducing the vanishing-gradient problem is to introduce short cuts, connections that skip layers [92]. Empirical evidence shows that networks with such short cuts are easier to train than standard multilayer perceptrons. The likely reason is that the vanishing-gradient problem is less severe in networks with short cuts, because error propagation is determined by the matrix product with the smallest number of factors.
This Section explains how to train feed-forward networks with short cuts [93]. The layout is illustrated schematically in Figure 7.12. Black arrows stand for usual
134 DEEP LEARNING
Figure7.12: Schematicillustrationofanetworkwithashortcutthatskipsonelayer (gray arrow). After Fig. 1 from Ref. [93].
Figure7.13: Chainofneuronswithshortcuts(grayarrows)thatskiponeneuron.
feed-forward connections, and the gray arrow indicates a connection that skips a
layer. The notation in Figure 7.12 differs somewhat from that of Algorithm 4. The
weightsfromlayerl−1tolaredenotedbyw(l,l−1),andthosefromlayerl−1tol+1 jk
by w (l+1,l−1) (gray arrow in Figure 7.12). Note that the superscripts are ordered in the ij
same way as the subscripts: the right index refers to the layer on the left. Neuron j in layer l computes
V (l) = g w (l,l−1) V (l−1) − θ (l) + w (l,l−2) V (l−2) . (7.27) jjkkjjnn
kn
As usual, the argument of the activation function is the local field b (l) . The weights j
of all connections are trained in the usual fashion, by stochastic gradient descent. To illustrate the structure of the resulting formulae, consider a chain of neurons, just one per layer, with short cuts that skip one neuron (Figure 7.13). We calculate the weight increments using Equations (6.15) and (6.16). The recursion (6.17) ap-
RESIDUAL NETWORKS 135
plies only to standard feed-forward networks without skipping layers. In order to determine how to update the weights for the network shown in Figure 7.13, we need to evaluate the gradients ∂ V (L)/∂ V (l). To begin with, consider the learning rule for w(L,L−1). Using Equations (6.15) and (6.16) one finds
δw(L,L−1) =ηδ(L)V (L−1) with δ(L) =(t −V (L))g′(b(L)), (7.28) as in Algorithm 4. In the same way one obtains
δw(L,L−2) =ηδ(L)V (L−2) with δ(L) =(t −V (L))g′(b(L)). (7.29) Nowconsiderthelearningruleforw(L−1,L−2). Using∂V(L)/∂V(L−1) =g′(b(L))w(L,L−1)
gives
δw(L−1,L−2) =ηδ(L−1)V (L−2) with δ(L−1) =δ(L)w(L,L−1)g′(b(L−1)), (7.30)
as before. But the update for w (L −2,L −3) is different, because now the short cuts come into play. The connection from layer L − 2 to L gives rise to an extra term:
∂V(L) ∂V(L) ∂V(L−1)
∂ V (L−2) = ∂ V (L−1) ∂ V (L−2) +g′(b(L))w(L,L−2) . (7.31)
Evaluating the partial derivatives yields:
δw(L−2,L−3) =ηδ(L−2)V (L−3) with δ(L−2) =δ(L−1)w(L−1,L−2)g′(b(L−2)
+δ(L)w(L,L−2)g′(b(L−2)). (7.32) Iterating further in this way, one finds the following error-backpropagation rule:
δ(l−1) =δ(l)w(l,l−1)g′(b(l−1))+δ(l+1)w(l+1,l−1)g′(b(l−1)) (7.33)
with w(l+1,l−1) = 0 for l ≥ L −1. The first term is the same as in Algorithm 4. The second term is due to the skipping connections. These connections reduce the vanishing-gradient problem. To see this, note that we can write the error δ(l) as
δ(l) =δ(L) w(L,ln)g′(b(ln))···w(l2,l1)g′(b(l1))w(l1,l)g′(b(l)) (7.34) l1 ,l2 ,…,ln
where the sum is over all paths L > ln > ln−1 > ··· > l1 > l back through the network. The structure of the general formula, for networks with more than only one neuron per layer, is analogous to Equation (7.34). According to this equation, the smallest errors, or gradients, in networks with many layers are dominated by the product corresponding to the path with the smallest number of steps (factors). Therefore short cuts tend to increase the small gradients.
136 DEEP LEARNING Finally, the network described in Ref. [92] used unit weights for the skipping
connections. In this case, the local field of V (l+1) takes the form j
b (l+1) = w (l+1,l)V (l) − θ (l+1) + V (l−1) ≡ F + V (l−1) , (7.35) jjkkjjj
k
assuming that all hidden layers have the same number of neurons. Here F is a
residual contribution to the local field (when F = 0, the inputs V (l−1) are passed j
right through to b (l+1)). Therefore such networks are called residual networks. But j
note that the networks described in Ref. [92] use convolution layers (Section 8.1). 7.5 Outputs and energy functions
Up to now we discussed networks that have the same activation functions for all neurons in all layers, either sigmoid or tanh activation functions [Equation (6.19)], or ReLU functions (Sections 1.3 and 7.3). In the output layer one often uses neurons with a different activation function, so-called softmax outputs:
eαb(L) i
Oi = M eαb k=1
. (7.36) Hereb(L)= w(L)V(L−1)−θ(L)arethelocalfieldsintheoutputlayer.Inthelimit
ijijji
α → ∞, we see that Oi = δi i0 where i0 is the index of the winning output neuron,
the one with the largest value b (L ) (Chapter 10). For α = 1, Equation (7.36) is a soft i
version of this maximum criterion, thus the name softmax. We set α to unity from now on.
Two important properties of softmax outputs are, first, that 0 ≤ Oi ≤ 1. Second, the values of the outputs sum to unity,
M
Oi =1. (7.37)
i=1
Therefore the outputs of softmax units can be interpreted as probabilities. Consider
classification problems where the inputs must be assigned to one of M classes. In
this case, the output O (μ) of softmax unit i is assumed to represent the probability i
thattheinputx(μ) isinclassi (intermsofthetargets: t(μ) =1whilet(μ) =0fork ≠ i). ik
If O (μ) ≈ 1, we assume that the network is quite certain that input x (μ) is in class i . i
On the other hand, if all O (μ) ≈ M −1 , we interpret the network output as uncertain. k
But note that neural networks may fail like humans sometimes do: their output can be very certain yet wrong (Section 8.6).
(L) k
OUTPUTS AND ENERGY FUNCTIONS 137 Softmax units are used in conjunction with a different energy function,
H =−t(μ)logO(μ). (7.38) ii
iμ
Here and in the following log stands for the natural logarithm. The function (7.38) is
minimal when O (μ) = t (μ) (Exercise 7.5). To find the correct backpropagation formula ii
for the energy function (7.38), we need to evaluate
∂H t(μ) ∂O(μ) =−i i.
∂wmn O(μ) ∂wmn iμ i
(7.39)
Here the labels denoting the output layer were omitted, and in the following equa- tions the index μ that refers to the input pattern is dropped as well. Using the
identities
one obtains
∂Oi =Oi(δil −Ol) and ∂bl =δlmVn,
(7.40)
(7.41)
So
l ∂bl ∂wmn
δw =−η ∂H =ηt(μ)(δ −O(μ))V(μ)=η(t(μ)−O(μ))V(μ), (7.42)
∂ bl
∂ Oi
∂ wmn
= ∂ Oi ∂ bl = Oi (δi m − Om )Vn .
∂wmn
mn∂wmn iimmn mmn iμ μ
since M t (μ) = 1 for the type of classification problem where each input belongs i=1 i
to precisely one class. The corresponding learning rule for the thresholds reads δθm =−η∂H =−η(t(μ)−O(μ)). (7.43)
Equations (7.42) and (7.43) highlight a further advantage of softmax output neurons
(apart from the fact that they allow the output to be interpreted in terms of probabil-
ities). The weight and threshold increments for the output layer derived in Section 6
[Equations(6.6a)and(6.11a)]containfactorsofderivativesg′(B(μ)). Asnotedearlier, m
these derivatives tend to zero when the activation function saturates, slowing down
the learning. But here the rate at which the neuron learns is simply proportional
to the output error, (t (μ) − O (μ)), without any possibly small factor g ′(b ). Softmax mm
units are normally only used in the output layer, because the learning speedup is coupled to the use of the energy function (7.38), and because it is customary to avoid dependencies between the neurons within a hidden layer.
∂θm mm μ
138 DEEP LEARNING
There is an alternative form of the energy function that is very similar to the above, but works with sigmoid activation functions and 0/1 targets. Instead of Equation (7.38) one chooses:
H =−t(μ)logO(μ)+(1−t(μ))log(1−O(μ)), (7.44) iiii
iμ
with Oi = σ(bi ), i = 1,…,M , and where σ denotes the sigmoid function (6.19a). To
compute the weight increments, we apply the chain rule:
∂H t(μ) 1−t(μ)∂O t(μ)−O(μ) ∂O
=−i−i l=−ii l.(7.45) ∂ wmn O(μ) 1−O(μ) ∂ wmn O(μ)(1−O(μ)) ∂ wmn
iμii iμii Using Equation (6.20) we obtain
δwmn =η(t(μ)−O(μ))V(μ), mmn
μ
(7.46)
identical to Equation (7.42). The thresholds are adjusted in an analogous fashion, Equation (7.43). But now the interpretation of the outputs is slightly different, since the values of the softmax units in the output layers sum to unity, while those with sigmoid activation functions do not. In either case one can use the definition (6.30) for the classification error.
To conclude this Section, we briefly discuss the meaning of the energy functions (7.38) and (7.44). In Chapter 6 we saw how deep neural networks are trained to fit input-output functions (Section 7.1) by minimising the quadratic energy function (6.4). This reminds of regression analysis in mathematical statistics, where the pre- dictive accuracy of a model is improved by minimising the sum over the squared errors. Now consider the energy function (7.44) for a single sigmoid output with targets t = 0 and t = 1. In this case, the network output is interpreted as the proba- bility O (μ) = Prob(t (μ) = 1|x (μ)) of observing t (μ) = 1. The corresponding likelihood is the joint probability of observing the outcomes t (μ) for p independent inputs x (μ):
p
(μ)t (μ) (μ)1−t (μ)
= O 1−O , (7.47) μ=1
under the model determined by the weights and thresholds of the network. Min- imising the negative log-likelihood − (Section 4.4) corresponds to minimising (7.44). This is just binary logistic regression [94] to predict a binary outcome t = 0 or 1. The case M > 1 corresponds to a multivariate regression problem [94] with M possibly correlated outcome variables t1,…,tM .
REGULARISATION 139
When the targets describe M mutually exclusive categorical outcomes, ti = 0, 1 withMi=1ti =1,thesoftmaxoutputOi isinterpretedastheprobabilityofobserving ti = 1. An example is the problem of classifying hand-written digits (Section 8.3). Training the network then corresponds to multinomial regression [94] with the log- likelihood (7.38). Note that Equation (7.44), for M = 1, is equivalent to (7.38) for M = 2, because O2 = 1−O1 and t2 = 1−t1. At any rate, these remarks motivate why the energy functions (7.38) and (7.44) are sometimes called log-likelihoods. Equation (7.44) is also referred to as cross entropy, because it has the same for as the cross entropy [65] characterising the difference between two Bernoulli distributions: the network output Oi , and the target ti .
7.6 Regularisation
Deeper networks have more neurons, so the problem of overfitting (Figure 6.6) tends to be more severe for deeper networks. Regularisation schemes limit the tendency to overfit. Apart from cross validation (Section 6.4), a number of other regularisation schemes have proved useful for deep networks: weight decay, pruning, drop out, expansion of the training set, and batch normalisation . This Section summarises the most important aspects of these methods.
7.6.1 Weight decay
Recall Figure 5.17 which shows a solution of the classification problem defined by the Boolean XOR function. In the solution illustrated in this Figure, all weights equal ±1, and also the thresholds are of order unity. If one uses the backpropagation algorithm to find a solution to this problem, one may find that the weights continue to grow during training. As mentioned above, this can be problematic because it may imply that the local fields become so large that the activation functions saturate. Then training slows down, as explained in Section 7.2.
To prevent the weights from growing, one can reduce them by some factor during training,eitherateachiterationorinregularintervals,wij →(1−ε)wij for0<ε<1, or
δwmn =−εwmn for 0<ε<1.
This is achieved by adding a term to the energy function, such as
H=1t(μ)−O(μ)2+γw2 . 2ii2ij
iμ ij
≡H0
(7.48)
(7.49)
140 DEEP LEARNING Gradient descent on H gives:
δwmn =−η ∂H0 −εwmn (7.50) ∂ wmn
with ε = ηγ. The scheme summarised here is sometimes called L2-regularisation. An alternative scheme is L1-regularisation. It amounts to
H=1t(μ)−O(μ)2+γ|w |. 2ii2ij
iμ ij
This gives the learning rule
δwmn =−η ∂H0 −εsgn(wmn).
∂ wmn
(7.51)
(7.52)
The discontinuity of the learning rule at wmn = 0 is cured by defining sgn(0) = 0. Comparing Equations (7.50) and (7.52), we see that L1-regularisation puts more weights to zero, compared with the L2-scheme [5].
An alternative to these two methods is max-norm regularisation [95], where the weights are constrained to remain smaller than a given constant: |wi j | ≤ c . If a |wi j | exceedsthepositiveconstantc,thenwij isrescaledsothat|wij|=c.
These weight-decay schemes are referred to as regularisation schemes because they tend to help against overfitting. How does this work? Weight decay adds a constraint to the problem of minimising the energy function. The result is a compromise between a small value of H and small weight values [5]. The idea is that a network with smaller weights is more robust to the effect of noise. When the weights are small, then small changes in some of the patterns do not give a substantially different training result. When the network has large weights, by contrast, it may happen that small changes in the input yield significant differences in the training result that are difficult to generalise.
7.6.2 Pruning
The term pruning refers to removing unnecessary weights or neurons from the network, to improve its efficiency. The simplest approach is weight elimination by weight decay [96]. Weights that tend to remain very close to zero during training are removed by setting them to zero and not updating them anymore. Neurons that have zero weights for all incoming connections are effectively removed (pruned). Pruning is a regularisation method: by removing unnecessary weights, one reduces the risk of overfitting. As opposed to drop out (Section 7.6.3), where hidden neurons are only temporarily ignored, pruning refers to permanently removing hidden neurons.
REGULARISATION 141
The idea is to train a large network, and then to prune a large fraction of neurons to obtain a much smaller network. It is usually found that such pruned networks generalise better than small networks that were trained without pruning. Up to 90% of the hidden neurons can be removed in some cases. In general, pruning is an excellent way to create efficient classifiers for real-time applications.
An efficient pruning algorithm is based on the idea to remove weights in such a way that the effect upon the energy function is as small as possible [97]. The idea is to find the optimal weight, to remove it, and to change the other weights in such a way that the energy function increases as little as possible. The algorithm works as follows. Assume that the network was trained, so that it reached a (local) minimum of the energy function H . One expands the energy function around this minimum. To second order, the expansion of H reads:
H=Hmin+12δw·δw + higherordersin δw. (7.53)
The term linear in δw vanishes because we expand around a local minimum. The matrix is the Hessian, the matrix of second derivatives of the energy function.
For the next step it is convenient to adopt the following notation [97]. One groups all weights in the network into a long weight vector w (as opposed to grouping themintoaweightmatrixaswedidinChapter2).Aparticularcomponentwq is extracted from the vector w as follows:
.
wq =eˆq ·w where eˆq =1←q. (7.54)
.
Hereeˆq istheCartesianunitvectorinthedirectionq,withcomponents[eˆq]j =δqj. Inthisnotation,theelementsofareMpq =∂2H/∂wp∂wq.Now,eliminatingthe weightwq amountstosetting
δwq =−wq . (7.55)
To minimise the damage to the network, we should eliminate the weight that has least effect upon H , changing the other weights at the same time so that H increases as little as possible (Figure 7.14). This is achieved by minimising
minmin{12δw·δw} subjecttotheconstraint eˆq·δw+wq=0. (7.56) q δw
The constant term Hmin was dropped because it does not matter. Now we first minimise H w.r.t. δw , for a given value of q . The linear constraint is incorporated using a Lagrange multiplier as in Section 6.3, to form the Lagrangian
= 12δw ·δw +λ(eˆq ·δw +wq). (7.57)
142 DEEP LEARNING
Figure 7.14: Pruning algorithm (schematic). The minimum of H is located at [w1,w2]T. The contours of the quadratic approximation to H are represented as solid black lines. The weight change δw = [−w1,0]T (gray arrow) leads to a smaller increase in H than δw = [0,−w2]T. The black arrow represents the optimal δwq∗ which leads to an even smaller increase in H .
A necessary condition for a minimum [δw , λ] satisfying the constraint is
∂ =δw +λeˆq =0 and ∂ =eˆq ·δw +wq =0. (7.58)
∂δw ∂λ
We denote the solution of these Equations by δw ∗ and λ∗. It is obtained by solving
the linear system
If is invertible, then the top rows of Eq. (7.59) give δw ∗ = −−1eˆq λ∗ .
Inserting this result into eˆTq δw ∗ + wq = 0 we find
δw ∗ = −−1eˆq wq (eˆTq −1eˆq )−1and λ∗ = wq (eˆTq −1eˆq )−1 .
eˆqδw∗ 0 eˆT 0 λ∗ = −w .
(7.59)
(7.60)
(7.61)
qq
We see that eˆq · δw ∗ = −wq , so that the weight wq is eliminated. The other weights are also changed (black arrow in Figure 7.14). The final step is to find the optimal q by minimising
(δw∗,λ∗;q)= 12wq2(eˆTq −1eˆq)−1. (7.62)
The Hessian of the energy function is expensive to evaluate, and so is the inverse of this matrix. Usually one resorts to an approximate expression for −1 [97]. One possibility is to set the off-diagonal elements of to zero [98]. But in this case the other weights are not adjusted, because eˆq ′ · δw ∗q = 0 for q ′ ̸= q if is diagonal. In this case it is necessary to retrain the network after weight elimination.
REGULARISATION 143
The algorithm is summarised in Algorithm 5. It succeeds better than elimination by weight decay in removing the unnecessary weights in the network [97]. Weight decay eliminates the smallest weights. One obtains weight elimination of the small- est weights by substituting = in the algorithm described above [Equation (7.62)]. Since small weights are often needed to achieve a small training error, this is usually not a good approximation.
To illustrate the effect of pruning for neural networks with hidden layers, con- sider the XOR function. Recall that it can be represented by a hidden layer with two neurons (Figure 5.17). For random initial weights, backpropagation takes a long time to find a valid solution, and networks with many more hidden neurons tend to perform much better [99]. The numerical experiments in Ref. [99] indicate that with two hidden neurons, only about 49% of the networks learned the task in 10 000 training steps of stochastic gradient descent, and networks with more neurons in the hidden layer learn more easily (98.5 % for n = 10 hidden neurons). The data in Ref. [99] also show that pruned networks, initially trained with n = 10 hidden neurons, still show excellent training success (83.3 % if only n = 2 hidden neurons remain). The networks were pruned iteratively during training, removing the neurons with the largest average magnitude. After training, the weights and threshold were reset to their initial values, the values before training began.
One can draw three conclusions from the numerical experiments described in Ref. [99]. First, iterative pruning during training singles out neurons in the hid- den layer that had initial weights and thresholds resulting in the correct decision boundaries (lottery-ticket effect [99]). Second, the pruned network with two hidden neurons has much better training success than the network that was trained with only two hidden neurons. Third, despite pruning more than 50% of the hidden neurons, the network with n = 4 hidden neurons performs almost as well as the one with n = 10 hidden neurons (97.9 % training success). When training deep networks it is common to start with many neurons in the hidden layers, and to prune up to
Algorithm 5 pruning least important weight train the network to reach Hmin;
compute −1 approximately;
determineq∗ asthevalueofq forwhich(δw∗,λ∗;q)isminimal; if (δw∗,λ∗;q∗)≪Hmin then
adjust all weights using δw = −w ∗ −1eˆ ∗ (eˆT −1eˆ ∗ )−1; q qq∗ q
goto 2;
else
end;
end if
144 DEEP LEARNING
90% of them. This results in small trained networks that can classify efficiently and reliably.
7.6.3 Drop out
In this regularisation scheme some hidden neurons are ignored during training [95]. In each step of the training algorithm (for each mini batch, or for each individual pattern) one disregards at random a fraction q of neurons from each hidden layer by setting their outputs to zero, and by updating only the weights and thresholds of the remaining neurons. For the next step in the training algorithm, the ignored neurons at put back, and another set of hidden neurons is removed. Once the training is completed, all hidden neurons are activated. Their outputs are multiplied by 1 − q to ensure that the local fields are independent of q , on average.
This method is motivated by noting that the performance of machine-learning algorithms is usually improved by combining the results of several learning attempts [5, 95], for instance by separately training networks with different layouts, and aver- aging over their outputs. For deep networks this is computationally very expensive. Drop out is an attempt to achieve the same goal more efficiently. The idea is that dropout corresponds to effectively training a large number of different networks. If there are k hidden neurons, then there are 2k different combinations of neurons that are turned on or off. The hope is that the network learns more robust features of the input data in this way, and that this reduces overfitting. In practice the method is applied together with max-norm regularisation (Section 7.6.1).
7.6.4 Expanding the training set
If one trains a network with a fixed number of hidden neurons on larger training sets, one observes that the network generalises with higher accuracy (better classification success). The reason is that overfitting is reduced when the training set is larger. Thus, a way of avoiding overfitting is to expand or augment the training set. It is sometimes argued that the recent success of deep neural networks in image recognition and object recognition is in large part due to larger training sets. One example is ImageNet, a database of more than 107 hand-classified images, into more than 20 000 categories [100]. It is expensive to expand training sets in this way because it requires manual annotation. An alternative is to expand the training set artificially. For digit recognition (Figure 2.1), one could create more input patterns by shifting, rotating, and shearing the digits, or by adding noise.
REGULARISATION 145 7.6.5 Batch normalisation
Batch normalisation [101] can significantly speed up the training of deep networks
with backpropagation. The idea is to shift and normalise the input data for each
hidden layer, not only for the input patterns (Section 6.3). This is done separately
for each mini batch (Section 6.2), and for each component of the inputs into the
layer (Algorithm 6). Denoting the states of the neurons feeding into the layer in
question by V (μ) , j = 1, . . . , j , one calculates the average and variance over each j max
mini batch
1 mB 1 mB
V = V(μ) and σ2 = (V(μ)−V )2, jmBj BmBjj
μ=1 μ=1
(7.63)
subtracts the mean from the V (μ), and divides by σ2 + ε. The parameter ε > jB
0 is added to the denominator to avoid division by zero when σ2B evaluates to zero.TherearetwoadditionalparametersinAlgorithm6,γj andβj.Theyareare learnt by backpropagation, just like the weights and thresholds. In general the new parameters are allowed to differ from layer to layer, γ(l) and β(l).
Batch normalisation was originally motivated by arguing that it reduces possible covariate shifts faced by hidden neurons in layer l: as the parameters of the neurons in the preceding layer l − 1 change their outputs shift, thus forcing the neurons in layer l to adapt. However in Ref. [102] it was argued that batch normalisation does not reduce the internal covariate shift, but that it speeds up the training by effectively smoothing the energy landscape.
Batch normalisation helps to combat the vanishing-gradient problem because it prevents local fields of hidden neurons to grow. This makes it possible to use sigmoid functions in deep networks, because the distribution of inputs remains normalised.
It is sometimes argued that batch normalisation has a regularising effect, and it has been suggested [101] that batch normalisation can replace drop out (Section 7.6.3). It is also argued that batch normalisation may help the network to generalise better, in particular if each mini batch contains randomly picked inputs. Then batch normalisation corresponds to randomly transforming the inputs to each hidden neuron (by the randomly changing means and variances). This may help to make the learning more robust. There is no theory that proves either of these claims, but it is an empirical fact that batch normalisation often speeds up the training.
jj
146
DEEP LEARNING
Algorithm 6 batch normalisation for j =1,…,jmax do
calculate mean V ← 1 mB V (μ) j mB μ=1 j
calculate variance σ2 ← 1 mB (V (μ) − V )2 BmBμ=1j j
normaliseVˆ(μ)←(V(μ)−V )/σ2 +ε jjjB
calculate outputs as: g (γ Vˆ (μ) + β ) jjj
end for
7.7 Summary
Neural networks with many layers of hidden neurons are called deep networks. Error backpropagation in deep networks suffers from the vanishing-gradient problem. It can be reduced by using ReLU units, by initialising the weights in certain ways, and with networks containing connections that skip layers. Yet vanishing or exploding gradients remain a fundamental difficulty, slowing learning down in the initial phase of training. Nevertheless, convolutional neural networks have become immensely successful in object recognition, outperforming other algorithms significantly.
Since deep networks contain many free parameters, deep networks tend to overfit the training data. Apart from cross validation, there are other ways of regularising the problem: weight decay, drop out, pruning, and data-set augmentation.
7.8 Further reading
Deep networks suffer from catastrophic forgetting: when we train a network on a new input distribution that is quite different from the one the network was originally trained on, then the network tends to forget what it learned initially. A good starting point for further reading is Ref. [103].
The stochastic-gradient descent algorithm (with or without minibatches) samples the input-data distribution uniformly randomly. As mentioned in Section 6.3, it may be advantageous to sample those inputs more frequently that initially cause larger output errors. More generally, the algorithm may use other criteria to choose certain input data more often, with the goal to speed up learning. It may even suggest how to augment a given training set most efficiently, by asking to specifically label certain types of input data (active learning) [104].
Another question concerns the structure of the energy landscape for multilayer perceptrons. It seems that local minima are perhaps less important for deep net- works than for Hopfield networksindexHopfield network, because the energy func- tions of deep networks tend to have more saddle points than minima [105], just
EXERCISES 147
like Gaussian random functions [106]. A recent study explores the relation between the multilayer layout of the perceptron network and the properties of the energy landscape [107].
Finally, training tends to work best when all input patterns appear with roughly the same frequency in the training set. Unlike humans, neural networks tend to struggle with rare input patterns. Special techniques, however, allow networks to recognise rare patterns, by comparison with features of the input distribution that are well represented (few-shot learning [108]). Standard algorithms for few-shot learning use elements of unsupervised learning (Chapter 10).
7.9 Exercises
7.1 Pruning. Show that the expression (7.61) for the weight increment δw ∗ min- imises the Lagrangian (7.57) subject to the constraint (7.55).
7.2 Decision boundaries for XOR problem. Figure 7.5 shows the layout of a net- work that solves the Boolean XOR problem. Draw the decision boundaries for the four hidden neurons in the input plane, and label the boundaries and the regions as in Figure 5.15.
7.3 Vanishing-gradient problem. Train the network shown in Figure 7.8 on the iris data set, available from the Machine learning repository of the University of California Irvine. Measure the effects upon of the neurons in the different layers by numerically evaluating the derivative of the energy function H w.r.t. their thresholds.
7.4 Residual network. Derive Equation (7.34) for the error δ(l) in layer l of the residual ‘network’ shown in Figure 7.13.
7.5 Loglikelihood. The log-likelihood function (7.38) is an energy function for softmax output neurons. Show that the function has a global minimum at O (μ) = t (μ) .
7.6 Cross-entropy function. The cross-entropy function (7.44) is an energy func- tion for sigmoid output neurons. Write down a cross-entropy function for tanh- output neurons, and show that it has a global minimum at O (μ) = t (μ) , where the function takes the value zero.
7.7 Softmax outputs. Consider a network with L layers with softmax outputs O(μ). Compute the derivative of O(μ) with respect to the local field b(L,μ) of output
iim
neuron m . The network is trained by gradient descent on the negative log-likelihood
function H = − t (μ) logO (μ) . The targets t (μ) satisfy the constraint t (μ) = 1, for iμiii ii
148 DEEP LEARNING all patterns μ. Derive the stochastic gradient-descent learning rule for the weights
w(L) inlayerL. mn
7.8 Generalised XOR function. The parity function can be viewed as a general- isation of the XOR function to N > 2 input dimensions, because it becomes the XOR function for N = 2. Another way to generalise the XOR function to N > 2- dimensional inputs is to define a Boolean function that gives unity if exactly one of its inputs equals unity. Otherwise the function evaluates to zero. Construct networks that represent this function, for N = 3 and N = 4.
149
Figure8.1: Imagesofirisflowers.Fromlefttoright:irissetosa(copyrightT.Monto), iris versicolor (copyright R. A. Nonenmacher), and iris virginica (copyright A. West- ermoreland). All images are copyrighted under the creative commons license.
8 Convolutional networks
Convolutional networks have been around since the 1980’s. They became widely used after Krizhevsky et al. [109] won the ImageNet challenge (Section 8.5) with a convolutional net. One reason for the recent success of convolutional networks is that they have fewer connections than fully connected networks with the same number of neurons. This has two advantages. Firstly, such networks are obviously cheaper to train. Secondly, as pointed out above, reducing the number of connec- tions regularises the network, it reduces the risk of overfitting.
Convolutional neural networks are designed for object recognition and image classification. They take images as inputs (Figure 8.1), not just a list of attributes (Figure 5.1). Convolutional networks have important properties in common with networks of neurons in the visual cortex of the human brain [4]. First, there is a spatial array of input terminals. For image analysis this is the two-dimensional array of bits shown in Figure 8.2(a). Second, neurons are designed to detect local features of the image (such as edges or corners for instance). The maps learned by such neurons, from inputs to output, are referred to as feature maps. Since these features occur in different parts of the image, one uses the same kernel (or filter) for different parts of the image, always with the same weights and thresholds for different parts of the image. Since these kernels are local, and since they act in a translational-invariant way, the number of neurons from the two-dimensional input array is greatly reduced, compared with fully connected networks. Feature maps are obtained by convolution of the kernel with the input image. Therefore, layers consisting of a number of feature maps corresponding to different kernels are also referred to as convolution layers, Figure 8.2(b).
Convolutional networks can have many convolution layers. The idea is that the additional layers can learn more abstract features (Section 8.7). Apart from feature maps, convolutional networks contain other types of layers. Pooling layers perform local averages of the output of convolution layers, to speed up learning by reducing
150 CONVOLUTIONAL NETWORKS
Figure8.2: (a)Featuremap,kernel,andreceptivefield(schematic).Afeaturemap (the 8 × 8 array of hidden neurons) is obtained by translating a kernel (filter) with a 3 × 3 receptive field, over the input image, a 10 × 10 array of pixels. (b) A convolution layer usually consists of several feature maps, each corresponding to a kernel that detects a certain feature in the input image. After a figure in Ref. [5].
the number of variables. Convolutional networks may also contain several fully connected layers.
8.1 Convolution layers
Figure 8.2(a) illustrates how a feature map is obtained by convolution of the input image with a kernel which reads a 3 × 3 part of the input image [5]. In analogy with the terminology used in neuroscience, this 3×3 array is called the local receptive field of the kernel. The outputs of the kernel from different parts of the input image make up the feature map, here an 8×8 array of hidden neurons: neuron V11 connects to the 3 × 3 area in the upper left-hand corner of the input image. Neuron V12 connects to a shifted area, as illustrated in Figure 8.2(a), and so forth. Since the input has 10 × 10 pixels, the dimension of the feature map is 8 × 8 in this example. The important point is that the neurons V11 and V12, and all other neurons in this convolution layer, share their weights and the threshold. In the example shown in Figure 8.2(a) there are thus only nine independent weights, and one threshold. Since the different hidden neurons share weights and thresholds, their computation rule is a discrete
CONVOLUTION LAYERS 151 convolution [4]:
33
Vi j =gwpq xp+i−1,q+j−1 −θ. (8.1)
p=1 q=1
In Figure 8.2(a) the local receptive field is shifted by one pixel at a time. Sometimes it is useful to use a larger stride [s1, s2], to shift the receptive field by s1 pixels horizontally and by s2 pixels vertically. Also, the local receptive regions need not have size 3 × 3. IfweassumethattheirsizeisQ×P,andthats1 =s2 =s,therule(8.1)takestheform
PQ
Vi j =gwpq xp+s(i−1),q+s(j−1) −θ. (8.2)
p=1 q=1
Figure 8.2(a) depicts a two-dimensional input array. For colour images there are three colour channels, in this case the input array is three-dimensional, and the input bits are labeled by three indices: two for position and the last one for colour, xp q r . Usually one connects several feature maps with different kernels to the input layer, as shown in Figure 8.2(b). The different kernels detect different features of the input image, one detects edges for example, and another one detects corners, and so forth. To account for these extra dimensions, one groups weights (and thresholds) into higher-dimensional arrays (tensors). The convolution takes the form:
PQR
V =gw i jk
p=1 q=1 r=1
pqrk
x −θ (8.3) p+s(i−1),q+s(j−1),r k
(see Figure 8.3). All neurons in a given convolution layer have the same threshold. The software package TensorFlow [110] is designed to efficiently perform tensor operations as in Equation (8.3).
If one couples several convolution layers together, the number of neurons in these layers decreases as one moves to the right. To avoid this, one can pad the image (and the convolution layers) by adding rows and columns of bits equal to zero [4]. In Figure 8.2(a), for example, one obtains a convolution layer of the same dimension as the original image by adding one column each on the left-hand and right-hand sides of the image, as well as two rows, one at the bottom and one at the top. In general, the numbers of rows and columns need not be equal, so the amount of padding is specified by four numbers, [p1,p2,p3,p4].
Convolution layers are trained with backpropagation. Consider the simplest case, Equation (8.1). As usual, we use the chain rule to evaluate the gradients:
∂Vij =g′(bij) ∂bij (8.4) ∂ wmn ∂ wmn
152 CONVOLUTIONAL NETWORKS
Figure 8.3: Illustration of summation in Equation (8.3). Each feature map has a receptivefieldofdimensionP×Q×R.ThereareK featuremaps,eachofdimension I×J.
withlocalfieldbij =pq wpqxp+i−1,q+j−1−θ.Thederivativeofbij isevaluatedby applying rule (5.25):
∂ bi j =δmpδnq xp+i−1,q+j−1 (8.5) ∂ wmn pq
In this way one can train networks with several stacked convolution layers too. It is important to keep track of the summation boundaries. To that end it helps to pad out the image and the convolution layers, so that the upper bounds remain the same in different layers.
Details aside, the fundamental principle of feature maps is that the map is applied in the same form to different parts of the image (translational invariance). In this way, each weight in a given feature map is trained on different parts of the image. This effectively increases the training set for the feature map and combats overfitting.
8.2 Pooling layers
Pooling layers process the output of convolution layers. A neuron in a pooling layer takes the outputs of several neighbouring feature maps and compresses the outputs into a single number [5]. There are no weights or thresholds associated with pooling layers. Max-pooling units, for example, take the maximum over several nearby feature-map outputs. Alternatively, one may compute the root-mean square of the map values (L2-pooling). Other ways of pooling are discussed in Ref. [4]. Just as for convolution layers, we need to specify stride and padding for pooling layers.
Pooling is performed independently on each feature map [5]. The network layout looks like the one shown schematically in Figure 8.4. In this Figure, the pooling layers
LEARNING TO READ HANDWRITTEN DIGITS 153
Figure8.4: Layoutofaconvolutionalneuralnetworkforobjectrecognitionand image classification (schematic). The inputs are stored in a 10 × 10 array. They feed into a convolution layer with four different feature maps with 3 × 3 kernels, stride [1,1], and zero padding. Each convolution layer connects to its own max- pooling layer (each pooling unit takes the maximum over a 2×2 array of feature-map outputs), with stride [2, 2] and zero padding. Between these and the output layer are a couple of fully connected hidden layers. After a figure in Ref. [5].
Figure8.5: ExamplesofdigitsfromtheMNISTdatasetofhandwrittendigits[111]. The images were produced using MATLAB. Copyright for the data set: Y. LeCun and C. Cortes.
connect to a number of fully connected hidden layers that connect to the output neurons. There are as many output neurons as there are classes to be recognised. This layout is similar to the layout used by Krizhesvky et al. [109] in the ImageNet challenge (see Section 8.5).
8.3 Learning to read handwritten digits
Figure 8.5 shows patterns from the MNIST data set of handwritten digits [111]. The data set derives from a data set compiled by the National Institute of Standards and Technology (NIST), of digits handwritten by high-school students and employees of the United States Census Bureau. The data contains 60 000 images of digits, each with 28 × 28 pixels, and a test set of 10 000 digits. The images are grayscale with 8-bit
154 CONVOLUTIONAL NETWORKS
resolution, so each pixel contains a value ranging from 0 to 255. The images in the database were preprocessed. The procedure is described on the MNIST home page. Each original binary image from the National Institute of Standards and Technology was represented as a 20×20 gray-scale image, preserving the aspect ratio of the digit. The resulting image was placed in a 28×28 image so that the centre-of-mass of the image coincided with its geometrical centre. These preprocessing steps improve the performance of the algorithm.
The goal of this Section is to show how the principles introduced so far allow neural networks to learn the MNIST data with low classification error, following Ref. [5]. As described in Chapter 6, one divides the data set into a training set and a validation set, with 50 000 digits and 10 000 digits, respectively [5]. The validation set is used for cross validation. The test data set allows to measure the classification error after training. For this purpose, one must use a data set that was not involved in the training. As described in Section 6.3, the inputs are preprocessed further by subtracting the mean image averaged over the whole training set from each input image [Equation (6.21)].
To find good parameter values and network layouts is one of the main difficulties when training a neural network, and it usually requires experimenting. There are recipes for finding certain parameters [112], but the general approach is still trial and error [5]. Consider first a network with one hidden layer with ReLU activation functions (Section 7.3), and a softmax output layer (Section 7.5) with ten outputs Oi and energy function (7.38). Output Oi is interpreted as the probability that the pattern fed to the network falls into category i . The network is trained with stochastic gradient descent with momentum, Equation (6.31). The learning rate is set to η = 0.001, and the momentum constant to α = 0.9. The mini-batch size [Equation (6.18)] equals 8192. Cross validation and early stopping is implemented as follows: during training, the algorithm keeps track of the smallest validation error observed so far. Training stops when the validation error was larger than the minimum for a specified number of times, equal to 5 in this case.
Figure 8.6 shows how the training and the validation energies decrease during training, for networks with 30 and 100 hidden neurons [5]. One epoch corresponds to applying p patterns or p /mB = 50000/8192 iterations (Section 6.1). The energies are a little lower for the network with 100 hidden neurons. But one observes overfitting in both cases: after many training steps the validation energy is much higher than the training energy. Early stopping caused the training of the larger network to abort after 135 epochs, this corresponds to 824 iterations. The resulting classification accuracy is about 97.2% for the network with 100 hidden neurons. It is difficult to increase the classification accuracy by adding more hidden layers, most likely because the network overfits the data (Section 6.4). This problem becomes more acute as one adds more hidden neurons. The tendency of the network to overfit is
LEARNING TO READ HANDWRITTEN DIGITS 155
Figure8.6: EnergyfunctionsfortheMNISTtrainingset(solidlines)andforthevali- dation set (dashed lines) for a fully connected hidden layer with 30 neurons, and for a similar algorithm, but with 100 neurons in the hidden layer. The data was smoothed and the plot is schematic. The x -axis shows iterations. One iteration corresponds to feeding one minibatch of patterns. One epoch consists of 50000/8192 ≈ 6 iterations. Based on simulations performed by Oleksandr Balabanov.
reduced by regularisation (Section 7.6). For the network with one hidden layer with 100 ReLU neurons, L2-regularisation improves the classification accuracy to almost 98%.
Convolutional networks can be optimised to yield higher classification accuracies than those quoted above. A convolutional network with one convolution layer with 20 feature maps, a max-pooling layer, and a fully connected hidden layer with 100 ReLU neurons, similar to the network shown schematically in Figure 8.7, gives classification accuracy only slightly above 98% after training for 60 epochs. Adding a second convolution layer and batch normalisation (Section 7.6.5) gives a classification accuracy is 98.99% after 30 epochs (this layout is similar to a layout described in MathWorks [114]). The accuracy can be improved further by tuning parameters and network layout, and by using ensembles of convolutional neural networks [111]. The best classification accuracy found in this way is 99.77% [113]. Several of the MNIST digits are difficult to recognise for humans too (Figure 8.8). It is not surprising that the network fails on these digits. The above examples show also that it takes some experimenting to find the right parameters and network layout, as well as long training times to reach the best classification accuracies. It could be argued that one reaches a stage of diminishing returns as the classification error falls below a fraction of a percent.
156 CONVOLUTIONAL NETWORKS
Figure 8.7: Convolutional network that classifies the handwritten digits in the MNIST data set (schematic).
8.4 Coping with deformations of the input distribution
How well does a MNIST-trained convolutional network classify your own hand- written digits? Figure 8.9(a) shows examples of digits drawn by colleagues at the University of Gothenburg, preprocessed in the same way as the MNIST data. Using a MNIST-trained convolutional network on these digits yields a classification accuracy of about 90%, substantially lower than the classification errors quoted in the previous Section.
A possible cause is that the digits in Figure 8.9(a) have a more slender stroke than those in Figure 8.5. It was suggested in Ref. [115] that differences in line thickness can confuse algorithms designed to read hand-written text [116]. There are different
Figure 8.8: Some hand-written digits from the MNIST test set, misclassified by a convolutional network that achieved an overall classification accuracy of 98%. Target (top right), network output (bottom right). Data from Oleksandr Balabanov. After a figure in Ref. [5], see also Fig. 2(c) in Ref. [113].
COPING WITH DEFORMATIONS OF THE INPUT DISTRIBUTION 157
Figure8.9: (a)Non-MNISThand-writtendigits,preprocessedliketheMNISTdigits. b) Same digits, except that the thickness of the stroke was normalised (see text). Data from Oleksandr Balabanov.
methods for normalising the line thickness of hand-written text. Applying the method proposed in Ref. [116] to our digits results in Figure 8.9(b). The algorithm has a free parameter, T , that specifies the line thickness. In Figure 8.9(b) it was taken to be T = 10, close to the average line thickness of the MNIST digits, which is approximately T ≈ 9.7. If we run a MNIST-trained convolutional network on a data set of 60 digits with normalised line thickness, it fails on only two digits. This corresponds to a classification accuracy of roughly 97%, not so bad – yet not as good as the best results in Section 8.3. But note that this estimate of the classification accuracy is not very precise, because the test set had only 60 digits. To obtain a better estimate, more test digits are needed.
A question is of course whether there are other significant differences between our non-MNIST hand-written digits and those in the MNIST data. At any rate, the results of this Section raise a point of fundamental importance. We have seen that convolutional networks can be trained to represent a distribution of input patterns with very high accuracy. But the network may not work as well on a data set with a slightly different input distribution, perhaps because the patterns were preprocessed differently, or because they were slightly deformed in other ways.
158 CONVOLUTIONAL NETWORKS
Figure8.10: Objectrecognitionusingadeepconvolutionalnetwork.Shownisa frame from a movie recorded by a data-collection vehicle of the company Zenseact. The neural net recognises pedestrians, cars, and lorries, and localises them in the image by bounding boxes. Copyright © Zenseact AB 2020. Reproduced with per- mission.
8.5 Deep learning for object recognition
Deep learning has become so popular in the last few years because deep convo- lutional networks are good at recognising objects in images. Figure 8.10 shows a frame from a movie taken by a data-collection vehicle. A convolutional network was trained to recognise objects, and to localise them in the image by means of bounding boxes around the objects.
Convolutional networks excel at this task, as demonstrated by the ImageNet large-scale visual recognition challenge (ILSVRC) [117], a competition for object recognition and localisation in images, based upon the ImageNet database [100]. The challenge is based on a subset of ImageNet. The training set contains more than 106 images manually classified into one of 1000 classes. There are approximately 1000 images for each class. The validation set contains 50 000 images.
The ILSVRC challenge consists of several tasks. One task is image classification, to list the object classes found in the image. A common measure for accuracy is the so-called top-5 error for this classification task. The algorithm lists the five object classes with highest softmax outputs. The result is defined to be correct if
DEEP LEARNING FOR OBJECT RECOGNITION 159
Figure8.11: SmallestclassificationerrorfortheImageNetchallenge[117].Thedata up to 2014 comes from Ref. [117]. The data for 2015 comes from Ref. [92], for 2016 from Ref. [118], and for 2017 from Ref. [119]. From 2012 onwards the smallest error was achieved by convolutional neural networks. After Fig. 1.12 in Goodfellow et al. [4].
the annotated class is among these five. The error equals the fraction of incorrectly classified images. Why does one not simply judge whether the most probable class is the correct one? The reason is that the images in the ImageNet database are annotated by a single-class identifier. Often this is not unique. The image in Figure 8.5, for example, shows not only a car but also trees, yet the image is annotated with the class label car. The resulting classification ambiguity is reduced by considering the top five softmax outputs, and checking whether the annotated class is among them.
The tasks in the ILSVRC challenge are significantly more difficult than the digit recognition described in Section 8.3. One reason is that the ImageNet classes are organised into a deep hierarchy of subclasses. This results in highly specific sub classes that can be very difficult to distinguish. The algorithm must be very sensitive to small differences between similar sub classes. We say that the algorithm must have high inter-class variability [120]. Different images in the same sub class, on the other hand, may look quite different. The algorithm should nevertheless recognise them as similar, belonging to the same class, the algorithm should have small intra- class variability [120].
Since 2012, algorithms based on deep convolutional networks won the ILSVRC challenge. Figure 8.11 shows that the error has significantly decreased until 2017, the last year of the challenge in the form described above. We saw in previous Sections that deep networks are difficult to train. So how can these algorithms work so well? It is generally argued that the recent success of deep convolutional networks is mainly due to three factors.
First, there are now much larger and better annotated training sets available.
160 CONVOLUTIONAL NETWORKS
Figure 8.12: Reproduced from xkcd.com/1897 under the creative commons attribution-noncommercial 2.5 license.
ImageNet is an example. Excellent training data is now recognised as one of the most important factors. Companies developing software for self-driving cars and systems that help to avoid accidents, recognise that good training sets are indis- pensable. At the same time, it is a challenge to create high-quality training data, because one must manually collect and annotate the data (Figure 8.12). This is costly, also because it is important to have as large data sets as possible, in order to reduce overfitting. In addition one must aim for a large variability in the collected data. Second, the hardware is much better today. Deep networks are nowadays implemented on single or multiple GPUs. Nowadays there are even dedicated chips. Third, improved regularisation techniques (Section 7.6) help to fight overfitting, and skipping connections (Section 7.4) render the networks less susceptible to the vanishing-gradient problem (Section 7.2).
The winning algorithm for 2012 was based on a network with five convolution layers and three fully connected layers, using drop out, ReLU units, and data-set augmentation [109]. The algorithm was implemented on GPU processors. The 2013 ILSVRC challenge was also won by a convolutional network [121], with 22 layers. Nevertheless, the network has substantially fewer free parameters (weights and thresholds) than the 2012 network: 4×106 instead of 60 × 106. In 2015, the winning algorithm [92] had 152 layers. One significant new element in the layout was the idea to allow connections that skip layers (Section 7.4). The best algorithms in 2016 [122] and 2017 [119] used ensembles of convolutional networks, where the classification is based on the ensemble average of the outputs.
SUMMARY 161
8.6 Summary
Convolutional networks can be trained to recognise objects in images with high accuracy. An advantage of convolutional networks is that they have fewer weights than fully connected networks with the same number of neurons, and that the weights of a given feature map are trained on different parts of the input images, effectively increasing the size of the training set. This helps against overfitting. Another view is that the hidden neurons are forced to agree on a particular choice of weights, they must compromise. This yields a more robust training result.
It is sometimes stated that convolutional networks are now better than humans, in that they recognise objects with lower classification errors than humans [123]. This and similar statements refer to an experiment showing that the human classification error in recognising objects in the ImageNet database is about 5.1% [124], worse than the most recent convolutional neural-network algorithms (Figure 8.11).
This notion is not unproblematic, for several reasons. To begin with, the article [123] refers to the 2015 ILSVRC competition, where the scores of the best algorithms were quite similar, and it has been debated whether interpreting the rules of the competition in different ways allowed competitors to gain an advantage. Second, and more importantly, it is clear that these algorithms learn in quite a different way from humans. The algorithms can detect local features, but since these convo- lutional networks rely on translational invariance, they do not easily understand global features, and can mistake a leopard-patterned sofa for a leopard [125]. It may help to include more leopard-patterned sofas in the training set, but the essential difficulty remains: translational invariance imposes constraints on what convolu- tional networks can learn [125]. More fundamentally one may argue that humans learn differently, by abstraction instead of going through very large training sets.
We have also seen that convolutional networks are sensitive to small changes in the input data. Convolutional networks excel at learning the properties of a given input distribution, but they may have difficulties in recognising patterns sampled from a slightly different distribution, even if the two distributions appear to be very similar to the human eye. Note also that this problem cannot be solved by cross validation, because training and validation sets are drawn from the same input distribution, but here we are concerned with what happens when the network is applied to a input distribution different from the one it was trained on.
Here is another example illustrating this point: the authors of Ref. [126] trained a convolutional network on perturbed grayscale images from the ImageNet data base, adding a little bit of noise independently to each pixel (white noise) before training. This network failed to recognise images that were weakly perturbed in a different way, by setting a small number of pixels to white or black. But when we look at the images, we have no difficulties seeing through the noise.
162 CONVOLUTIONAL NETWORKS
Refs. [127, 128] illustrate intriguing failures of convolutional networks [5]. Sze- gedy et al. [127] demonstrate that the way convolutional networks partition input space can lead to unexpected results. The authors took an image that the network classifies correctly with high confidence, and perturbed it slightly. The perturbation was not random, but specifically designed to push the input pattern over a decision boundary. The difference between the original and perturbed images (adversar- ial images) is undetectable to the human eye, yet the network misclassifies the perturbed image with high confidence [127]. This reflects the fact that decision boundaries are always close in high-dimensional input space.
Figure 1 in Ref. [128] shows images that are completely unrecognisable to the human eye. Yet a convolutional network classifies these images with high confidence. This illustrates that there is no telling what a network may do if the input is far away from the training distribution. Unfortunately the network can sometimes be highly confident yet wrong. Nevertheless, despite these problems, deep convolutional networks have enjoyed tremendous success in image classification during the past years, and they have found widespread use in industry and science.
Finally, the fundamental mechanisms of deep learning are fairly well understood, but many open questions remain. It is fair to say that the theory of deep learning has lagged behind the practical successes, although some progress has been made in recent years.
8.7 Further reading
The online book of Nielsen [5] is an excellent introduction to convolutional neu- ral networks, and guides the reader through all the steps required to program a convolutional network to recognise hand-written digits.
What do the hidden layers in a convolutional layer actually compute? Feature maps that are directly coupled to the inputs detect local features, such as edges or corners. Yet it is unclear precisely how hidden convolutional layers help the network to learn. To which input features do the neurons of a certain hidden layer react most strongly? Input patterns chosen to maximise the outputs of neurons in a given layer [129, 130] reveal intricate geometric structures that defy straightforward interpretation. An example is shown on the cover image of this book, see also Exercise 8.7.
It has been suggested that more general models, normally used for natural- language processing, may outperform convolutional nets [131] in image-processing tasks when there is enough data. An advantage is that these models do not rely on translational invariance, unlike convolutional networks.
EXERCISES 163
Figure8.13: LayoutofconvolutionalnetworkforExercise8.1. 8.8 Exercises
8.1 Number of parameters of a convolutional network. A convolutional network has the following layout (Figure 8.13): an input layer of size 21 × 21 × 3, a convolu- tional layer with ReLU activations with 16 kernels with local receptive fields of size 2 × 2, stride [1, 1], and padding [0, 0, 0, 0], a max-pooling layer with local receptive field of size 2 × 2, stride = [2, 2], padding = [0, 0, 0, 0], a fully connected layer with 20 neurons with sigmoid activations, and a fully connected output layer with 10 neurons. In one or two sentences, explain the function of each of the layers. Enter the values of the parameters x 1, y 1, z 1, x 2, . . . , y 5 into Figure 8.13 and determine the number of trainable parameters (weights and thresholds) for the connections into each layer of the network.
8.2 Convolutional network. Figure 8.4 shows the schematic layout of a convolu- tional net. Explain how a convolution layer works. In your discussion, refer to the terms convolution, colour channel, receptive field, feature map, stride, and explain the meaning of the parameters in the computation rule
PQ
Vi j =gwpq xp+s(i−1),q+s(j−1) −θ. (8.6)
p=1 q=1
Explain how a pooling layer works, and why it is useful.
8.3 Feature map. The two patterns shown in Figure 8.14(a) are processed by a very simple convolutional network that has one convolution layer with one single 3 × 3 kernel with ReLU neurons, zero threshold, weights as given in Figure 8.14(b), and stride [1,1]. The resulting feature map is fed into a 3 × 3 max-pooling layer
164 CONVOLUTIONAL NETWORKS
Figure8.14: (a)Inputpatternswith0/1bits(correspondstoxi =0andtoxi =1). (b) 3 × 3 kernel of a feature map. ReLU neurons, zero threshold, weights either 0 or 1 ( corresponds to w = 0 and to w = 1). (Exercise 8.3).
with stride [1,1]. Finally there is a fully connected classification layer with two output neurons with Heaviside activation functions (binary threshold unit). For both patterns, determine the resulting feature map and the output of the max- pooling layer. Find weights and thresholds of the classification layer that allow to classify the two patterns into different classes.
8.4 Distorted MNIST digits. Train a convolutional network on the MNIST data set. Distort the patterns in the test set by adding noise, in two different ways. First, choose q pixels randomly and make them black (or leave them black). Second, choose q pixels randomly and make them white (or leave them white). Vary q and investigate the performance of the convolutional network for both noisy test sets.
8.5 CIFAR-10 data set. Train a fully connected network with two hidden layers on the CIFAR-10 data set [132], and minimise the classification error by optimising the network parameters. Compare with the performance of an optimised convolutional network.
8.6 Bars-and-stripes data set. Construct a convolutional network to classify the patterns of the bars-and-stripes data set (Figure 4.4) into patterns with bars (black columns) and stripes (black rows). Use a convolution layer with at most four 2 × 2 kernels, one pooling layer, and one fully connected layer for classification. Give all parameters of the network (weights, thresholds, padding and stride where relevant).
8.7 Visualise activations of hidden neurons. Train a convolutional network on the CIFAR-10 data set [132], and for each feature map construct the input patterns that achieve maximal activation. Use gradient ascent to modify the values of the input pixels in order to find these patterns. See cover image of this book, and Refs. [129, 130].
165
9 Supervised recurrent networks
The layout of the perceptrons analysed in the previous Chapters is special. All connections are one way, and only to the layer to the right, so that the update rule for the i -th neuron in layer l becomes, for example,
V (l) = g w (l) V (l−1) − θ (l) . (9.1) iijji
j
The backpropagation algorithm relies on this feed-forward layout. It means that
the derivatives ∂ V (l−1)/∂ w (l) vanish. This ensures that the outputs are nested j mn
functions of the inputs, which in turn implies the simple iterative structure of the backpropagation algorithm (Chapter 6).
In some cases it is necessary or convenient to use networks that do not have this simple layout. The Hopfield networks discussed in part I are examples where all connections are symmetric. More general networks may have a feed-forward layout with feedbacks, as shown in Figure 9.1. Such networks are called recurrent networks. There are many different ways in which the feedbacks can act: from the output layer to hidden neurons for example, or there could be connections between the neurons in a given layer. Neurons 3 and 4 in Figure 9.1 are output neurons, they are associated with targets just as in Chapters 5 to 7. The layout of recurrent networks is very general, but because of the feedbacks we must consider how such networks can be trained.
Unlike multilayer perceptrons that represent an input-to-output mapping in terms of nested activation functions, recurrent networks are used as dynamical networks, where the iteration index t replaces the layer index l:
V (t)=gw(vv)V (t −1)+w(vx)x −θ(v) for t =1,2,…. (9.2) i ij j ik k i
jk
See Figure 9.1 for a definition of the different weights, and the parameters θ(v) i
are thresholds. Equation (9.2) is analogous to the deterministic McCulloch-Pitts dynamics of Hopfield networks and Boltzmann machines [c.f. Equation (1.5)]. As in the case of Hopfield networks (Exercise 2.10), one may also consider a continuous network dynamics:
τdVi =−V +gw(vv)V (t)+w(vx)x −θ(v), (9.3) dt i ij j ik k i
jk
with time constant τ. We shall see in a moment why it is convenient to assume that the dynamics is continuous in t , as in Equation (9.3).
166 SUPERVISED RECURRENT NETWORKS
Figure 9.1: Network with a feedback connection. Neurons 1 and 2 are hidden neurons.Theweightsfromtheinputxk totheneuronsVi aredenotedbyw(vx),the
weightfromneuronV toneuronV isw(vv).Neurons3and4areoutputneurons, jiij
with prescribed target values yi . To avoid confusion with the iteration index t , the targets are denoted by y in this Chapter.
Recurrent networks can learn in different ways. One possibility is to use a training set of pairs (x (μ) , y (μ) ) with μ = 1, . . . , p . To avoid confusion with the iteration index t , the targets are denoted by y in this Chapter. One feeds a pattern from this set and runs the dynamics (9.2) or (9.3) for the given x (μ) until it reaches a steady state V ∗ (if this does not happen, the training fails). Then one adjusts the weights by one gradient-descent step using the energy function
1
H =
2 k
(Ek∗)2
where Ek∗ =
y(μ)−V∗ ifV isanoutputneuron,
k k k (9.4)
0 otherwise.
The asterisk in this Equation indicates that all variables are evaluated in the steady state, at V = V ∗. Iterating these steps, one feeds another pattern x (μ), finds the steady state V ∗, adjusts the weights, and so forth. One continues to iterate until the steady-state outputs yield the correct targets for all input patterns. This is reminiscent of the algorithms discussed in Chapters 5 to 7. Instead of defining the energy function in terms of the mean-squared output errors, one could also use the negative log-likelihood function (7.44).
Another possibility is that inputs and targets change as functions of time t while the network dynamics runs. This allows to solve temporal association tasks. The network is trained on a set of input sequences x (t ) and corresponding target se- quences y (t ). In this way, recurrent networks can translate written text or recognise speech. The network can be trained by unfolding its dynamics in time as explained in Section 9.2, although this algorithm suffers from the vanishing-gradient problem discussed in Chapter 7.
ik
RECURRENT BACKPROPAGATION 167
9.1 Recurrent backpropagation
This Section summarises how to generalise Algorithm 4 to recurrent networks with
feedback connections. Recall the recurrent network shown in Figure 9.1. The neu-
ronsV havesmoothactivationfunctions,andtheyareconnectedbyweightsw(vv). iij
Several neurons may be linked to inputs x (μ), with weights w (v x ). Other neurons are k ik
output units with associated target values y (μ). i
One takes the dynamics to be continuous in time, Equation (9.3), and assumes that V (t ) runs into a steady state
Equation (9.3) implies
dV ∗
V (t ) → V ∗ so that i = 0 .
dt
(9.5)
(9.6)
V∗=gw(vv)V∗+w(vx)x −θ(v), i ijj ikki
jk
and it is assumed that V ∗ is a linearly stable steady state of the dynamics (9.3), so that small perturbations δV away from V ∗ decay with time.
The synchronous discrete dynamics (9.2) can exhibit undesirable stable periodic solutions [133], as mentioned in Section 1.3. This is a reason for using the continuous dynamics (9.3), yet convergence to the steady state is not guaranteed in this case either.
Equation (9.6) is a non-linear self-consistent condition for the components of V ∗, in general difficult to solve. However, if the steady state V ∗ is stable, we can use the dynamics (9.3) to automatically pick out the steady-state solution V ∗. This solution depends on the pattern x (μ). Note that the superscript (μ) is left out in Equations (9.5) and (9.6), and also in the remainder of this Section.
The goal is to find weights so that the outputs give the correct target values in
the steady state. To this end one uses gradient descent on the energy function (9.4).
Consider first how to adjust the weights w (v v ): ij
∗ ∂Vk∗ δwmn =−η∂w(vv)=η Ek∂w(vv).
(vv) ∂H
mn k mn
(9.7)
(9.8)
One calculates the gradients of V ∗ by differentiating Equation (9.6):
∗∗ ∂ Vi = g ′(b ∗) ∂ bi
∂ w (v v ) i ∂ w (v v ) mn mn
= g ′(b ∗)δ i
i m
V ∗ + n
∂V∗ w (v v ) j
i j ∂ w (v v ) j mn
.
168 SUPERVISED RECURRENT NETWORKS Hereb∗= w(vv)V∗+ w(vx)x −θ(v)isthelocalfieldinthesteadystate.Equa-
i j ij j k ik k i
tion (9.8) is a self-consistent equation for the gradient, as opposed to the explicit expressions we found in Chapter 6. The reason for the difference is that the recurrent network has feedbacks.
Since Equation (9.8) is linear in the gradients, it can be solved by matrix inversion, at least formally. In terms of the matrix with elements
δkj onefinds:
i ki
L i j = (9.11)
L =δ −g′(b∗)w(vv), ijijiij
(9.9)
(9.10)
Equation (9.8) can be written as
If is invertible, one applies −1 to both sides. Using the fact that
∂V∗
L j = δ
g ′(b ∗)V ∗ . i m i n
i j ∂ w (v v ) j mn
∗
mn
∂Vk =−1 g′(b∗ )V∗. ∂w(vv) kmmn
−1
i ki
Inserting this result into (9.7) one obtains:
δw(vv)=ηE∗−1 g′(b∗)V∗. mn kkmmn
k
(9.12)
This learning rule can be written in the form of the backpropagation rule (6.10) by introducing the error
∆∗ =g′(b∗ )E∗ −1 . m m k km
k
Then the learning rule (9.12) takes the form δw(vv)=η∆∗ V∗.
If there are no recurrent connections, then Li j = δi j . In this case Equation (9.13) reduces to the standard expression (6.6b), Exercise 9.1.
The learning rule for the weights w (v x ) is derived in an analogous fashion. The mn
result is:
δw(vx) =η∆∗ xn . (9.15) mn m
The learning rules (9.14) and (9.15) are well-defined only if the matrix is invertible. Otherwise the solution (9.11) does not exist. Also, matrix inversion is an expensive operation. As described in Chapter 5, one can try to avoid the problem by finding
mn mn
(9.13)
(9.14)
RECURRENT BACKPROPAGATION 169
the inverse iteratively. The trick [1] is to write down a dynamical equation for ∆i that has a steady state at the solution of Equation (9.13):
τ d ∆ =−∆ +g′(b∗)E∗ +∆ w(vv)g′(b∗). (9.16) dtj j jj iij j
i
It is left as an exercise (Exercise 9.2) to verify that the dynamics (9.16) has a steady state satisfying Equation (9.13). Equation (9.16) is written in a form to stress that (9.16) and (9.3) exhibit the same duality as Algorithm 4, between forward propaga- tion of states of neurons and backpropagation of errors. The sum in Equation (9.16) has the same form as the recursion for the errors in Algorithm 4, except that there are no layer indices l here.
Equation (9.16) admits the steady state (9.13). But does ∆i (t ) converge to ∆∗i ? For convergence it is necessary that the steady state is linearly stable. Whether or not this is the case is determined by linear stability analysis [85]. One asks: does a small deviation from the steady state increase or decrease under Equation (9.16)? In other words, if one writes
V (t)=V ∗ +δV (t) and ∆(t)=∆∗ +δ∆(t), (9.17) do δV (t ) and δ∆(t ) grow in magnitude? To answer this question, one inserts this
(9.18a) (9.18b)
ansatz into (9.3) and (9.16), and linearises:
τd δV =−δV +g′(b∗)w(vv)δV ≈−L δV ,
dt i i i ij j ij j jj
τ d δ∆ =−δ∆ +δ∆ w(vv)g′(b∗)≈−δ∆ g′(b∗)L /g′(b∗). dtjjiijj iiijj
ii
Equation (9.18a) shows: whether or not the norm of δV (t ) grows is determined by the eigenvalues of the matrix . We say that V ∗ is a linearly stable steady state of Equation (9.3) if all eigenvalues of have negative real parts. In this case |δV (t )| → 0. If at least one eigenvalue has a positive real part then |δV | grows. In this case we say that V ∗ is linearly unstable. Since the matrix with elements g ′(bi∗)Li j /g ′(bj∗) has the same eigenvalues as , ∆∗ is a stable steady state of (9.16) if V ∗ is a stable steady state of (9.3). If the steady states are unstable, the algorithm does not converge.
In summary, recurrent backpropagation is analogous to backpropagation (Al-
gorithm 4) for layered feed-forward networks, save for two differences. First, the
non-linear network dynamics is no longer a simple input-to-output mapping with
nested activation functions, but a non-linear dynamics that may (or may not) con-
verge to a steady state. Second, the feedbacks give rise to linear self-consistent
equations for the steady-state gradients ∂ V ∗/∂ wmn , which can be viewed as steady- j
state conditions for a dual dynamics of the errors.
170 SUPERVISED RECURRENT NETWORKS
The main conclusion of this Section is that convergence of the training is not guaranteed if the network has feedback connections (for a layered feed-forward network without feedbacks, recurrent backpropagation simplifies to stochastic gra- dient descent, Algorithm 4, see Exercise 9.1). This explains why stochastic gradient descent is used mostly for multi-layer networks with feed-forward layouts. The algorithm tends to fail for networks with feedbacks. However, it is possible to get rid of the feedbacks in recurrent networks by unfolding the dynamics in time. This is described in the next Section.
9.2 Backpropagation through time
Recurrent networks can be used to learn sequential inputs, as in speech recognition and machine translation. The training set consists of time sequences [x (t ), y (t )] of inputs and targets. The network is trained on the sequences and learns to predict the targets. In this context the layout differs from the one described in the previous Section. There are two main differences. Firstly, the inputs and targets depend on t , and one uses a discrete-time update rule. Secondly, separate output neurons Oi (t ) are added to the layout. The update rule takes the form
V (t)=gw(vv)V (t −1)+w(vx)x (t)−θ(v), i ijj ikki
jk
O (t)=gw(ov)V (t)−θ(o). iijji
j
(9.19a) (9.19b)
TheactivationfunctionoftheoutputneuronsOi canbedifferentfromthatofthe hidden neurons Vj . One possibility is to use the softmax function for the outputs [134, 135]. For the hidden units one often uses tanh activations.
To train recurrent networks with time-dependent inputs and with the dynamics (9.19), one uses backpropagation through time. The idea is to unfold the network in time to get rid of the feedbacks. The price paid is that one obtains large networks in this way, with as many copies of the original neurons as there are time steps.
The procedure is illustrated in Figure 9.2 for a recurrent network with one hidden neuron, one input terminal, and one output neuron. The unfolded network has T inputs and outputs. It can be trained in the usual way with stochastic gradient descent. The errors are calculated using backpropagation as in Algorithm 4, but here the error is propagated back in time, not from layer to layer. The energy function is the squared error summed over all time steps
1 T
H = 2
t=1
Et2 with Et = yt −Ot . (9.20)
BACKPROPAGATION THROUGH TIME 171
Figure9.2: Left:recurrentnetworkwithoneinputterminal,onehiddenneuron,and one output neuron. Right: same network but unfolded in time. The weights w (v v ) remain unchanged as drawn, also the weights w (v x ) and w (o v ) remain unchanged (not drawn). After Figs. 7 and 8 in Ref. [135].
One could use the negative log-likelihood function (7.38), but here we use the squared output-error function (9.20). There is only one hidden neuron in our exam- ple, and the inputs and outputs are also one-dimensional. Here and in the following wewritethetimeargumentasasubscript,Ot insteadofO(t)andsoforth,because there is no risk of confusing it with other subscripts.
Consider first how to adjust the weight w (v v ). Gradient descent (5.24) yields
δw(vv) =η
E t =η t ∂w(vv)
∆ w(ov) t
t . ∂w(vv)
(9.21a)
(9.22)
T∂OT ∂V
Here
t=1 t=1
∆t =Etg′(Bt)
isanoutputerror,andBt =w(ov)Vt −θ(o) isthelocalfieldoftheoutputneuronat time t [Equation (9.19)]. Equation (9.21a) is similar to the learning rule for recur- rent backpropagation, Equations (9.7) and (9.8), but the derivative ∂ Vt /∂ w (v v ) is evaluated differently. Equation (9.19a) yields the recursion
∂Vt =g′(bt)Vt−1+w(vv) ∂Vt−1 (9.23) ∂ w(vv) ∂ w(vv)
172 SUPERVISED RECURRENT NETWORKS for t ≥ 1. Since ∂ V0/∂ w (v v ) = 0, Equation (9.23) implies:
∂V1 =g′(b1)V0, ∂ w (v v )
∂V2 =g′(b2)V1+g′(b2)w(vv)g′(b1)V0, ∂ w (v v )
∂V3 =g′(b3)V2 +g′(b3)w(vv)g′(b2)V1 +g′(b3)w(vv)g′(b2)w(vv)g′(b1)V0 ∂ w (v v )
∂VT−1 ∂ w (v v )
.
=g′(bT−1)VT−2 +g′(bT−1)w(vv)g′(bT−2)VT−3 +…
∂VT =g′(bT )VT−1 +g′(bT )w(vv)g′(bT−1)VT−2 +… ∂ w (v v )
Equation (9.21a) says that we must sum over t . Regrouping the terms in this sum yields:
∆ ∂V1 +∆ ∂V2 +∆ ∂V3 +… 1 ∂ w (v v ) 2 ∂ w (v v ) 3 ∂ w (v v )
=[∆1g′(b1)+∆2g′(b2)w(vv)g′(b1)+∆3g′(b3)w(vv)g′(b2)w(vv)g′(b1)+…]V0 +[∆2g′(b2)+∆3g′(b3)w(vv)g′(b2)+∆4g′(b4)w(vv)g′(b3)w(vv)g′(b2)+…]V1 +[∆3g′(b3)+∆4g′(b4)w(vv)g′(b3)+∆5g′(b5)w(vv)g′(b4)w(vv)g′(b3)+…]V2
.
+[∆T−1g′(bT−1)+∆T g′(bT )w(vv)g′(bT−1)]VT−2
+ [∆T g ′(bT )]VT −1 . Towritethelearningruleintheusualform,wedefineerrorsδt recursively:
∆Tw(ov)g′(bT) fort =T,
δt = ∆tw(ov)g′(bt)+δt+1w(vv)g′(bt) for0
A slight variation of the above algorithm (truncated backpropagation through time) suffers less from the exploding-gradient problem. The idea is that the explod- ing gradients are tamed by truncating the memory. This is achieved by limiting the error propagation backwards in time, errors are computed back to T − τ and not further, where τ is the truncation time [2]. Naturally this implies that long-time correlations cannot be learnt.
The learning rules for the weights w (v x ) are obtained in a similar fashion. Equa- tion (9.19a) yields the recursion
∂Vt =g′(bt)xt +w(vv) ∂Vt−1 . (9.26) ∂ w(vx) ∂ w(vx)
This looks just like Equation (9.23), except that Vt −1 is replaced by xt . As a conse- quence we have
T
δw(vx) =ηδt xt . (9.27)
t=1
The learning rule for w (o v ) is simpler to derive. From Equation (9.19b) we find by
differentiation w.r.t. w (o v ) :
δw(ov)=ηEtg′(Bt)Vt =η∆tVt. (9.28)
TT t=1 t=1
How are the thresholds θ (v ) adjusted? Going through the above derivation we see that we must replace Vt −1 in Equation (9.25) by −1. It works in the same way for the output threshold.
In order to keep the formulae simple, we derived the algorithm for a single hidden neuron, a single output neuron, and one-component inputs, so that we could leave out the indices referring to different hidden neurons, and different input and output components. If we consider several hidden and output neurons and multi- dimensional inputs, the structure of the Equations remains exactly the same, except
174 SUPERVISED RECURRENT NETWORKS
Algorithm 7 backpropagation through time initialiseweightsw(vv),w(vx),w(ov) andthresholdsθ(v),θ(o);
m m
mn mn mn
for τ = 1,…,τmax do chooseinputsequencex(1),…,x(T); initialise Vj (0) = 0;
fort =1,…,T do
propagate forward:
b (t)← w(vv)V (t −1)+ w(vx)x (t)−θ(v) andV (t)←g[b (t)];
i jijj kikk i i i
compute outputs:
B (t)← w(ov)V (t)−θ(o) andO (t)←g[B (t)];
ijijjiii
end for
compute errors for t = T (targets yi ):
∆ (T)←[y −O (T)]g′[B (T)]andδ (T)← ∆ (T)w(ov)g′[b (T)];
iiiijiiijj for t = T,…,2 do
propagate backwards: ∆i (t ) = [yi − Oi (t )]g ′[Bi (t )] and δ (t −1)← ∆(t)w(ov)g′(b(t))+ δ(t+1)w(vv)g′(b(t));
jiiijjiiijj
end for
δw(vv) =0,δw(vx) =0,δw(ov) =0,δθ(v) =0,δθ(o) =0; mn mn mn
fort =1,…,T do
δw(vv) =δw(vv) +ηδm(t)Vn(t −1);
δw(vx) =δw(vx) +ηδm(t)xn(t); mn mn
δw(ov) =δw(ov) +η∆m(t)Vn(t); mn mn
δθ(v) =δθ(v) −ηδm(t); mm
δθ(o) =δθ(o) −η∆m(t); mm
end for
adjustweightsandthresholds: w(vv) =w(vv) +δw(vv),…; mn mn mn
end for
mn mn
VANISHING GRADIENTS
for a number of extra sums over those indices:
T
δw(vv) =ηδ(t)V(t−1)
mn mn t=1
The second term in the recursion for δ(t ) is analogous to the error recursion in of j
Algorithm 4. The time index t here plays the role of the layer index l in Algorithm 4. A difference is that the weights in Equation (9.29) are the same for all time steps. The algorithm is summarised in Algorithm 7.
In conclusion we see that backpropagation through time for recurrent networks is similar to backpropagation for multilayer perceptrons. After the recurrent network is unfolded to get rid of the feedback connections, it can be trained by backpropagation. The time index t takes the role of the layer index l. Backpropagation through time is the standard approach for training recurrent networks, despite the fact that it suffers from the vanishing-gradient problem. The next Section describes how improvements to the layout make it possible to more efficiently train recurrent networks.
9.3 Vanishing gradients
Hochreiter and Schmidhuber [137] suggested to replace the hidden neurons of the recurrent network with computation units that are specially designed to reduce the vanishing-gradient problem. The method is referred to as long-short-term memory (LSTM). The basic ingredient is the same as in residual networks (Section 7.4): short cuts reduce the vanishing-gradient problem. For our purposes we can think of LSTMs as units that replace the hidden neurons. For a detailed description of LSTMs see Ref. [138].
Gated recurrent units [139] serve the same purpose as LSTMs, and they function in a similar way. It has been argued that LSTMs outperform gated recurrent units for certain tasks, but since they are simpler than LSTMs, the remainder of this Section focuses on gated recurrent units. As illustrated in Figure 9.3, these units replace the
∆(T)w(ov)g′(b(T)) (t) iiij j
175
(9.29) fort =T,
for0
RECURRENT NETWORKS FOR MACHINE TRANSLATION 179
second word, and so forth. The drawback of this scheme is that it does not account for the fact that two given words might be more or less closely related to each other. Other encoding schemes are described in Ref. [135].
Each input to the recurrent network is a vector with as many components as there
are words in the dictionary. A sentence corresponds to a sequence x1,x2,…,xT .
Each sentence ends with an end-of-sentence tag,
when the input sentence ends. This is necessary because the number of words per
sentence is not fixed. Now suppose that a possible translation reads x′ ,x′ ,…,x′ . 12T′
The task of the network is to determine the probability p(x′ ,…,x′ |x ,…x ) that 1T′1T
the translation is correct. The idea is to estimate this probability recursively as
T′
p(x′ ,…,x′ |x ,…x )=p(x′ |x′ ,…,x′ ). (9.33)
1T′1T t1t−1 t=1
Sutskever et al. [134] describe how to achieve this with a recurrent network with two hidden LSTMs. The network uses softmax outputs O t , where j -th component ofOt isinterpretedastheprobabilitythatthej-thcomponentofx′t isthecorrect word at position t in the translated sentence. As shown in Figure 9.4, the first LSTM processes the input sentence x 1 , . . . , x T , encoding its contents in the hidden states. When the
There is a large number of recent papers on machine translation with recurrent neural networks. Most studies are based on the training algorithm described in Section 9.2, backpropagation through time. The different approaches mainly differ in their network layouts. Google’s machine translation system uses a deep network with layers of LSTMs [141]. Different hidden neurons are unfolded forward as well as backwards in time, as shown schematically in Figure 9.5. For several hidden and output neurons and multi-dimensional inputs, the bidirectional network has the dynamics
V (t)=gw(vv)V (t −1)+w(vx)x (t)−θ(v), i ijj ikki
jk
U (t)=gw(uu)U (t +1)+w(ux)x (t)−θ(u), i ijj ikki
jk
O (t)=gw(ov)V (t)+w(ou)U (t)−θ(o), iijjijji
jj
(9.34)
where the hidden neurons V are updated forward in time, while U are updated backwards in time. It is natural to use bidirectional networks for machine trans- lation because correlations go either way in a sentence, forward and backwards.
180 SUPERVISED RECURRENT NETWORKS
Figure9.5: Schematicillustrationofabidirectionalrecurrentnetwork.Thenetwork consists of two hidden neurons (U and V ) that are unfolded in different ways. After Fig. 12 in Ref. [135].
In German, for example, the finite verb form is usually at the end of the sentence. In practice, the hidden states are represented by LSTMs [134, 135, 141], instead of hidden neurons as in Equation (9.34).
Different schemes for scoring the accuracy of a translation are described by Lipton et al. [135]. One difficulty is that there are often several different valid translations of a given sentence, and the score must compare the machine translation with all of them. More recent papers on machine translation usually use the so-called BLEU score to evaluate the translation accuracy. The acronym stands for bilingual evaluation understudy. The scheme was proposed by Papieni et al. [142]. It is argued to evaluate the accuracy of a translation not too differently from humans.
9.5 Reservoir computing
An alternative to backpropagation through time for recurrent networks is reser- voir computing [143]. This method has been used with success to predict chaotic dynamics [144, 145] and rare transitions in stochastic bi-stable systems [146].
Consider input data in the form of a time series x (0), . . . x (T − 1) of N -dimensional
vectors x (t ), and a corresponding series of M -dimensional targets y (t ). The goal is
to train the recurrent network so that its outputs O (t ) approximate the targets as
precisely as possible, by minimising the energy function H = 1 T −1 N [Ei (t )]2, 2 t=τ i=1
where Ei (t ) = yi (t ) − Oi (t ) is the output error, and τ represents an initial transient that is disregarded.
Figure 9.6 shows the layout for this task. There are N input terminals. They are connected with weights w (in) to a reservoir of hidden neurons with state variables
r (t ). The reservoir is linked to M linear output units O (t ) with weights w (out). The jiij
jk
RESERVOIR COMPUTING 181
Figure9.6: Reservoircomputing(schematic).Notallconnectionsaredrawn.There can be connections from all inputs to all neurons in the reservoir (gray), and from all reservoir neurons to all output neurons.
reservoir itself is a large recurrent network with weights wi j . The update rule is similar to Equation (9.19). There are many different versions that differ in detail [147]. One possibility is [146]
N
ri(t +1)=gwijrj(t)+w(in)xk(t), ik
fort =0,…,T −1withinitialconditionsrj(0).
The main difference to the training algorithms described in the previous Sections
of this Chapter is that the input weights w (in) and the reservoir weights wi j are ik
randomly initialised and kept constant. Only the output weights w (out) are trained ij
by gradient descent. The idea is that the dynamics of a sufficiently large reservoir finds nonlinear, high-dimensional representations of the input data [143], not unlike sparse representations of binary classification problems embedded in a high-dimensional space (Section 5.4) that become linearly separable in this way.
In addition, and this is a difference to the problem described in Section 5.4, the reservoir serves as a dynamical memory. This requires that the reservoir states faithfully represent the input sequence: similar input sequences should yield similar reservoir activations, provided one iterates it long enough. However, for random weights the recurrent reservoir dynamics can be chaotic [85]. In this case, the state of the reservoir after many iterations bears no relation to the input sequence. To
j k=1 O (t +1)=w(out)r (t +1),
(9.35a) (9.35b)
iijj j
182 SUPERVISED RECURRENT NETWORKS
avoid this, one requires that the reservoir dynamics is linearly stable. Linearising the reservoir dynamics (9.35a) gives
δr(t +1)=(t +1)δr(t), (9.36) where (t) is a diagonal matrix with entries Dii(t) = g′[bi(t)], and where bi(t) =
wijrj(t)+N w(in)xk(t).Whetherornotδr growsisthendeterminedbythe j k=1 ik
singularvaluesofthematrixproductt =(t)(t−1)···(1),asinSection 7.2.Thesingularvaluesoft aredenotedbyΛ1(t)≥Λ2(t)≥···.Atlargetimes,the maximal Lyapunov exponent λ1 = limt →∞ t −1 logΛ1(t ) must be negative to ensure that the reservoir dynamics (9.36), driven with a stationary input time series x (t ), is stable:
λ1 <0. (9.37)
Sometimes the stability criterion is quoted in terms of the maximal eigenvalue of . If one uses tanh activation-functions and if the local fields bi (t ) remain small, then the diagonal elements of (t ) remain close to unity. In this case the stability condition for the reservoir dynamics is given by the weight matrix alone. In general the singular values of are different from its eigenvalues, but the maximal singularvalueoft approachesetlog|ν1|,whereν1istheeigenvalueofwithlargest modulus (Exercise 9.8).
For inputs with long time correlations, the reservoir dynamics must not decay too quickly, so that it can represent the dynamical correlations in the input sequence. There is no precise mathematical theory that says how to optimise the reservoir. In practice one adjusts the maximal Lyapunov exponent by trial and error. Its optimal value may depend on the properties of the input series, for instance upon whether theinputseriesischaoticornot. Therearemanydifferentrecipesforhowtoset up a reservoir. One takes the weights to be uniformly distributed, or Gaussian, or assumes that wi j = ±1 with equal probability. Usually the reservoir is sparse, with only a small fraction of weights non-zero. The elements of the resulting weight matrix are rescaled to adjust λ1 [147]. The weight matrix (in) is commonly taken to be a full matrix, and its elements are drawn from the same distribution as those of the reservoir. Lukosevicius [147] gives a practical overview over different schemes for setting up reservoir computers.
For time-series prediction, one trains the network on many input series x (0), . . . x (T − 1) with targets y (t ) = x (t ). After training, one continues to iterate the network dy- namics with inputs x (T + k ) = O (T + k ) to predict x (T + k + 1), for k = 0, 1, 2, . . .. In order to represent complex spatio-temporal patterns, Pathak et al. [144] found it necessary to use several parallel reservoirs. Lim et al. [146] used a chain of reservoirs that feed into each other, replacing Equation (9.35a) by a set of nested update rules.
SUMMARY 183
Tanaka et al. [148] describe different physical implementations of reservoir com- puters, based on electronic RC-circuits, optical cavities or resonators, spin-torque oscillators, or mechanical devices.
9.6 Summary
It is sometimes said that recurrent networks learn dynamical systems, while multi- layer perceptrons learn input-output maps. This emphasises a difference in how these networks are usually used, but we should bear in mind that they are trained in similar ways, by backpropagation. Neither is it given that the tasks must differ: recurrent networks are also used to learn time-independent data. It is true, however, that tools from dynamical-systems theory help to analyse the dynamics of recurrent networks [136, 149].
Recurrent neural networks can be trained by stochastic gradient descent after unfolding the network in time, to get rid of feedback connections. This algorithm suffers from the vanishing-gradient problem. To overcome this difficulty, the hidden neurons in the recurrent network are replaced by composite units that are trained to sometimes act as residual connections, passing the signal right through, and sometimes as non-linear units that can learn correlations in a meaningful way. There are different versions, long-short-term memory units and gated recurrent units. They all work in similar ways. Succesful layouts for machine translation use deep bidirectional networks with layers of LSTMs.
An alternative scheme is reservoir computing, where a large reservoir of hidden neurons is used to represent correlations in the input data, and a set of linear output units is trained to learn the original sequence from such representations. The idea is that it is easier to learn intricate features of an input sequence from a high-dimensional, sparse representation of the data.
9.7 Further reading
The training of recurrent networks is discussed in Chapter 15 of Ref. [2], see also Refs. [150, 151]. Recurrent backpropagation is described by Hertz, Krogh and Palmer [1], for a very similar network layout. How LSTMs combat the vanishing-gradient problem is explained in Ref. [138]. For a recent review of recurrent neural networks, see Ref. [135]. This webpage [152] gives a very enthusiastic overview about what recurrent networks can do. A more pessimistic view is expressed in this blog. For a review of reservoir computing, see Ref. [143].
184 SUPERVISED RECURRENT NETWORKS
Figure9.7: Recurrentnetworkwithoneinputterminalx(t),onehiddenneuron V (t ), and one output neuron O (t ). Exercise 9.4.
9.8 Exercises
9.1 Recurrent backpropagation. Derive Eq. (9.15) for the weight increments δw (v x )
mn
in recurrent backpropagation. Show how the recurrent-backpropagation algorithm simplifies to Algorithm 4 for layered feed-forward networks when there are no feed- backs,
9.2 Steady state of error dynamics in recurrent backpropagation. Verify that the error dynamics (9.16) has a steady state satisfying Equation (9.13).
9.3 Learning rules for backpropagation through time. Derive the learning rules (9.27) and (9.28) from Equation (9.19).
9.4 Recurrent network. Figure 9.7 shows a simple recurrent network with one hidden neuron V (t ), one input component x (t ), and one output O (t ). The network learns a time series of input-output pairs [x(t),y(t)] for t = 1,2,3,...,T. Here t is a discrete time index and y (t ) is the target value at time t . The hidden unit is initialised to V (0) at t = 0. This network can be trained by backpropgation by unfolding it in time. Write down the dynamical rules for this network, the rules that determine V (t ) in terms of V (t − 1) and x (t ), and O (t ) in terms of V (t ). Assume that both V (t ) and O (t ) have the same activation function g (b ). Derive the learning rule for w (o v ) using gradient descent on the energy function H = 12 Tt =1 E (t )2 with E (t ) = y (t ) − O (t ). Denote the learning rate by η.
9.5 Backpropagation through time. A recurrent network with two hidden neu- rons is shown in Figure 9.8. Write down the dynamical rules for this network. Assume that all neurons have the same activation function g (b ). Draw the unfolded network. Derive the learning rules for the weights.
EXERCISES 185
Figure 9.8: . Recurrent network used in Exercise 9.5. After Fig. 3 in Ref. [135]. 9.6 Backpropagation through time for thresholds. Derive learning rules for the
thresholds θ (v ) and θ (0) for the backpropagation-through-time algorithm. ji
9.7 Dual dynamics for recurrent backpropagation. Show that the dynamics (9.16) admits a steady state satisfying (9.13).
9.8 Eigenvalues and singular values. Compute the eigenvalues and the singular values of the matrices
1 2 1 1 1 2
1=2 2, 2=−1 1, 3=0 2. (9.38)
These examples illustrate that the singular values Λα of a symmetric matrix equal its eigenvalues να. For a normal matrix, Λα = |να|. In general, singular values and eigenvalues differ. Then show that for all three matrices, the maximal singular value Λ1(t ) of t converges to |να|, where να is the eigenvalue of with largest modulus.
9.9 Time series prediction. Ikeda modelled the ray dynamics of light in an optical resonator by the map [150]
x1(t +1)=1+ux1(t)cos(τ)−x2(t)sin(τ), (9.39a) x2(t +1)=ux1(t)cos(τ)+x2(t)sin(τ), (9.39b)
with τ = 0.4 − 6/(1 + |x (t )|2), and u = 0.8. Time series generated by Equation (9.39) are chaotic and therefore difficult to predict. Train a reservoir computer on a data set of time series generated by numerical solution of Equation (9.39). Evaluate how well the reservoir computer manages to predict the time series x2(t ). For background on time-series prediction, refer to Nonlinear Time Series Analysis by Kantz and Schreiber [153].
PART III LEARNING WITHOUT LABELS
187
188 LEARNING WITHOUT LABELS
Chapters 5 to 9 describe supervised learning of labeled data with neural networks. The network is trained to reproduce the correct labels (targets) for each input pattern. The analysis of unlabeled data requires different methods. Machine learning can be applied with success to large data sets of high-dimensional unlabeled data. The machine can for instance mark patterns that are typical for the given distribution, or detect outliers. Other tasks are to detect similarity, to find clusters in the data (Figure 10.1), and to determine non-linear, low-dimensional representations of high-dimensional data. More recently, such unsupervised learning algorithms have been used to generate synthetic data, patterns that resemble those in a certain data set. One possible application is data-set augmentation for supervised learning.
Learning without labels is called unsupervised learning , because there are no targets that tell the network whether it has learnt correctly or not. There is no obvious function to fit, or dynamics to learn. Instead the network organises the input data in relevant ways. This requires redundancy in the input data. It is sometimes said that unsupervised learning corresponds to learning without a teacher, implying that the network itself discovers suitable ways of organising the input data. This is inaccurate, because unsupervised networks operate with a pre-determined learning rule, like Hopfield networks.
Part III of this book is organised as follows. Chapter 10 describes unsupervised- learning algorithms, starting with unsupervised Hebbian learning to detect famil- iarity and similarity of input patterns (Sections 10.1 and 10.2). Related algorithms can be used to find low-dimensional non-linear projections of high-dimensional input data (self-organising maps, Section 10.3). In Section 10.4, these algorithms are compared and contrasted with a standard unsupervised clustering algorithm, K -means clustering. Section 10.5 introduces radial basis-function networks, they learn using a hybrid algorithm with supervised and unsupervised elements. Section 10.6 explains how to use layered feedforward networks for unsupervised learning.
Chapter 11 deals with learning tasks that lie in between supervised and unsu- pervised learning, problems where the machine receives partial feedback on its performance in the form of a penalty or a reward. Such tasks can be solved by reinforcement-learning algorithms that allow a neural network or more generally an agent to learn to reproduce outputs that tend to give positive rewards. Several algorithms for reinforcement learning are described, the associative reward-penalty algorithm (Section 11.1), temporal difference learning (Section 11.2), and Q -learning (Section 11.3). The Q -learning algorithm is illustrated by demonstrating how it al- lows two players to learn to compete in the board game tic-tac-toe.
Figure10.1: Supervisedlearningfindsdecisionboundariesforlabeleddata,like in the binary classification problem shown on the left. Unsupervised learning can find clusters in the input data (right).
Figure 10.2: Neural net for unsupervised Hebbian learning, with a single linear output unit that has weight vector w . The network output is denoted by y in this Chapter.
10 Unsupervised learning
10.1 Oja’s rule
A simple example for an unsupervised-learning algorithm uses a single McCulloch- Pitts neuron with linear activation function (Figure 10.2). The neuron computes1 y = w · x with weight vector w = [w1, . . . , wN ]T. Now consider a distribution Pdata(x ) of input patterns x = [x1,..., xN ]T with continuous-valued components xi . Patterns are drawn from this distribution at random and fed one after another to the net. For each pattern x , the weights w are adjusted as follows:
w′=w+δw with δw=ηyx. (10.1) This rule is also called Hebbian unsupervised learning rule, because it is reminiscent
of Hebb’s rule (Chapter 2). As usual, 0 < η ≪ 1 is the learning rate.
1In this Chapter we follow a common convention [1] and denote the output of unsupervised- learning algorithms by y .
189
190 UNSUPERVISED LEARNING
What can this rule learn about the input distribution Pdata(x)? Since we keep adding multiples of the pattern vectors x to the weights (just as described in Section 5.2), the magnitude of the output |y | becomes the larger the more often the input pattern occurs in the distribution Pdata(x ). So the most familiar pattern produces the largest output. In this way the network can detect how familiar certain input patterns are.
A problem is that the components of the weight vector continue to grow as we keep adding. This means that the simple Hebbian learning rule (10.1) does not converge to a steady state. To analyse learning outcomes we want the learning to converge. This is achieved by adding a weight-decay term with coefficient propor- tional to y 2 to Equation (10.1):
δw =ηy(x −yw). (10.2) Making use of y = w ·x = wTx = xTw, Equation (10.2) can be rewritten in the
following form:
δw =ηxxTw −[w ·(xxT)w]w. (10.3)
This learning rule is called Oja’s rule [154]. Equation (10.3) ensures that w remains normalised. To see why, consider an analogy: a vector q that obeys the differential equation
d q =(t)q. (10.4) dt
For a general matrix (t ), the norm |q | may increase or decrease, depending on the singular values of . We can ensure that q remains normalised by adding a term to Equation (10.4):
d w =(t)w −[w ·(t)w]w . (10.5) dt
Thevectorw turnsinthesamewayasq,andifweset|w|=1initially,thenw remains normalised, w = q /|q | (Exercise 10.1). Equation (10.5) describes the dy- namics of the normalised orientation vector of a small rod in turbulence [155], where (t ) is the matrix of fluid-velocity gradients.
Returning to Equation (10.2), we note that the dynamics of (10.2) and (10.5) is the same in the limit of small learning rates η. Therefore we conclude that w remains normalised under (10.3) when the learning rate is small enough. Oja’s algorithm is summarised in Algorithm 8. One draws a pattern x from the distribution Pdata(x ) of input patterns, applies it to the network, and updates the weights as prescribed in Equation (10.2). This is repeated many times. In the following we denote the average over T input patterns as 〈···〉 = T1 Tt =1 ···.
While the rule (10.1) does not have a steady state, Oja’s rule (10.3) does. For zero-mean input data, its steady state w ∗ corresponds to the principal component
OJA’S RULE 191
Algorithm 8 Oja’s rule initialise weights randomly; fort =1,...,T do
draw an input pattern x from Pdata(x );
adjust all weights using δw = ηy (x − y w ); end for
Figure10.3: Oja’srulefindstheprincipalcomponentofzero-meandata(schematic). The initial weight vector is w 0, the steady-state weight vector is w ∗.
of the input data, as illustrated in Figure 10.3. This can be seen by analysing the steady-state condition
0=〈δw〉w∗ . (10.6)
Here〈···〉w∗ isanaverageoveriterationsofthelearningrule(10.3)atfixedw∗,the steady state. Equation (10.6) says that the weight increments δw must average to zero in the steady state, to ensure that the weights neither grow nor decrease in the long run. Equation (10.6) is a condition upon w ∗. Using the learning rule (10.3), it can be written as:
0=′w∗ −(w∗ ·′w∗)w∗ with ′ =〈xxT〉. (10.7)
Equation (10.7) shows that w ∗ must be an eigenvector of the matrix2 ′, normalised to unity, |w ∗| = 1. But which one?
We denote the eigenvectors and eigenvalues of ′ by u α and λα , and investigate the stability of w ∗ = u α for different values of α by linear stability analysis, just as in Section9.1.Tothisend,considerasmallperturbationεt awayfromw∗=uα:
wt =uα+εt . (10.8) A difference to the analysis in Section 9.1 is that the dynamics is now discrete in
time. Theperturbationatthenexttimestep,εt+1,isdefinedbywt+1 =uα +εt+1. A 2For zero-mean input data, ′ equals the data-covariance matrix, Equation (6.24).
192 UNSUPERVISED LEARNING
second difference is that sequence of weight increments depends on the randomly chosen input patterns. In order to determine the linear stability one should iterate andthenlinearisethedynamics(10.3),toseewhetherεt growsornot.However, in the limit of small learning rate it is sufficient to average over x before iterating (Exercise10.4).Tolinearorderinεt onefinds:
ε ≈ε +η′ε −2u (u ·′ε )−(u ·u )ε =(α)ε , (10.9) t+1t tααtααt t
where the last equality sign defines the matrix (α). The steady state w ∗ = u α is linearly stable if all eigenvalues of (α) have real parts with magnitudes smaller than unity.3 To determine the eigenvalues of (α), we use the fact that (α) has the same eigenvectors as ′. Since ′ is symmetric, these eigenvectors form an orthonormal basis, u α · u β = δαβ . As a consequence, the eigenvalues of (α) are simply given by
Λ(α) =uβ ·(α)uβ =1+η[(λβ −λα)−2λαδαβ]. (10.10) β
Since ′ is a positive-semidefinite matrix (its eigenvalues λα cannot be negative), Equation (10.10) shows that there are eigenvalues with |Λ(α)| > 1 unless w ∗ is the
leading eigenvector of ′, the one corresponding to its largest eigenvalue. This means that Algorithm 8 finds the principal component of zero-mean data, and it alsoimpliesthatthealgorithmmaximises〈y2〉overallw with|w|=1,seeSection6.3. Note that 〈y 〉 = 0 for zero-mean input data.
Now consider inputs with non-zero mean. In this case Algorithm 8 still finds the maximal eigenvalue direction of ′. But for inputs with non-zero mean, this direction is different from the maximal principal direction (Section 6.3). Figure 10.4 illustrates this difference. The Figure shows three data points in a two-dimensional input plane. The elements of ′ = 〈x x T〉 are
′ 121
=312, (10.11)
with eigenvalues and eigenvectors
1 1 1 1 −1
λ1=1, u1=2 1 and λ2=3, u2=2 1 . (10.12)
So the the maximal eigenvalue direction of ′ is u1. To compute the principal direction of the data, we must determine the data-covariance matrix , Equation (6.24). Its maximal eigenvalue-direction is u2, and this is the maximal principal component of the data shown in Figure 10.4.
3For time-continuous dynamics (Section 9.1), linear stability is ensured when all eigenvalues have negative real parts, for discrete dynamics their magnitudes must all be smaller than unity [85].
β
COMPETITIVE LEARNING 193
Figure10.4: Inputdatawithnon-zeromean.Algorithm8convergestou1,butthe principal direction is u2.
Oja’s rule can be generalised to determine M principal components of zero-mean inputdatausingM outputneuronsthatcompute yi =wi ·x fori =1,…,M:
M
δw =ηyx−yw (10.13) ijijkkj
k=1
This is called Oja’s M-rule [1]. For M = 1, Equation (10.13) simplifies to Oja’s rule.
10.2 Competitive learning
Oja’s M -rule (10.13) results in neurons that are activated simultaneously. Any input usually causes several outputs to assume non-zero values yi ≠ 0 at the same time. In Sections 4.5 and 7.1 we encountered the notion of a winning neuron where the weights are trained in such a way that each pattern activates only a single neuron, and different patterns activate different winning neurons. This allows to represent a distribution of input patterns with a neural network.
Unsupervised learning algorithms can categorise or cluster input data in this way: similar inputs are classified to belong to the same category, and activate the same winning neuron. This is called competitive learning [1]. Figure 10.5(a) shows an example, input patterns on the unit circle that cluster into two distinct clusters. Theideaistofindweightvectorswi thatpointintothedirectionoftheclusters.To this end we take M linear output units i with weight vectors w i , i = 1,…,M . We feed a pattern x from the distribution Pdata(x ) and define the winning neuron i0 as the one that has minimal angle between its weight and the pattern vector x . This is illustrated in Figure 10.5(b), where i0 = 2. Then only this weight vector is updated byaddingalittlebitofthedifferencex−wi0 betweenthepatternvectorandthe
194 UNSUPERVISED LEARNING
Figure 10.5: Detection of clusters by unsupervised learning. (a) Distribution of input patterns on the unit circle and two unit-length weight vectors initialised to random angles. The winning neuron for pattern x is the one with weight vector w 2. (b) Updating w ′2 = w 2 + δw moves this weight vector closer to x .
weight of the winning neuron. The other weights remain unchanged: η(x−wi) for i=i0(x,w1…wM),
δwi = 0 otherwise. (10.14) In other words, only the winning neuron is updated, w i′ = w i0 + δw i0 . Equation
0
(10.14) is called competitive-learning rule.
The learning rule (10.14) has the following geometrical interpretation: the weight
of the winning neuron is drawn towards the pattern x . Upon iterating (10.14), the weight vectors are drawn to clusters of inputs. If the input patterns are normalised as in Figure 10.5, the weights end up normalised on average, even though |w i0 | = 1 does notimplythat|wi0 +δwi0|=1,ingeneral.Thealgorithmforcompetitivelearning is summarised in Algorithm 9. When weight and input vectors are normalised, then the winning neuron i0 is the one with the largest scalar product w i · x . For linear output units yi = w i · x (Figure 10.2) this is simply the unit with the largest output. Equivalently, the winning neuron is the one with the smallest distance |w i − x |. Output units with w i that are very far away from any pattern may never be
Algorithm 9 competitive learning (Figure 10.5)
initialise weights to vectors with random angles and norm |w i | = 1; fort =1,…,T do
draw a pattern x from Pdata(x );
find the winning neuron i0 (smallest angle between wi0 and x); adjustonlytheweightofthewinningneuronδwi0 =η(x −wi0);
end for
SELF-ORGANISING MAPS 195
Figure10.6: Principal-componentanalysis(Section6.3)findsthelinearprincipal direction (dashed line) of the data (). A self-organising map can instead find the principal manifold (solid line), a non-linear approximation to the data.
updated (dead units). There are several strategies to avoid this [1]. One possibility is to initialise the weights to directions found in the inputs. Also, how to choose the number of weight vectors is a matter of trial and error. Clearly it is better to start with too many rather than too few.
Finally, consider the relation between the competitive learning rule (10.14) and Oja’s rule (10.13). If we define
1 fori=i0, yi = δi i0 = 0 otherwise.
then the rule (10.14) can be written in the form of Oja’s M -rule:
k=1
10.3 Self-organising maps
In order to analyse high-dimensional data it is often useful to map the high-dimensional input patterns to a low-dimensional output space, to obtain a low-dimensional representation of the input distribution. Principal-component analysis (Section 6.3) does just that. However, it does not necessarily preserve distance. To visualise clusters or other arrangements of the input patterns, similar patterns or patterns that are close in input space should be mapped to nearby points in output space, and patterns that are far apart should be mapped to outputs that are far from each other. Maps with this property are called semantic or topgraphic maps.
M δwij=ηyi xj− ykwkj .
(10.16) Equation (10.16) is reminiscent of Hebb’s rule (Chapter 2) with weight decay.
(10.15)
196 UNSUPERVISED LEARNING
Moreover, principal-component analysis is a linear method. As explained in Section 6.3, it projects the data to the space spanned by the leading eigenvectors of the correlation matrix. In many cases, however, the data may not lie in a linear subspace, as illustrated in Figure 10.6. In order to project the data onto the non- linear principal manifold (solid line), a non-linear map is needed.
In neuroscience, the term topographic map refers to the relation between the spatial arrangement of stimuli and the activation patterns in certain parts of the mammalian brain. Similar patterns of visual stimuli on the retina, for instance, activate close-by regions in the visual [156]. Other cognitive stimuli, auditory and sensory, are mapped in analogous ways. The complex neural networks in the mam- malian cortex host large numbers of such maps, arranged in a hierarchical fashion. They represent sensory stimuli in terms of spatially localised neural activation. How did this complex structure arise? One possibility is that the mappings are coded in the genetic sequence, that the connections are hard wired, so to speak. However, it is observed that such maps can change over time [157], leading to the hypothesis that they are learned, and that the our DNA merely encodes a set of fairly simple learning rules.
This motivated Kohonen [157, 158] and others to propose and analyse learning rules for topographic maps. The term self-organising map [18, 159] emphasises that the mapping develops in response to the stimuli it maps, that it learns in an unsupervised fashion. Kohonen’s model for a non-linear self-organising map relies on an ordered array of output neurons, as illustrated in Figure 10.7. The map learns to activate nearby output neurons for similar inputs. This is achieved using a competitive learning rule, similar to the learning rule (10.14) described in the previous Section. In order to represent the proximity or similarity of inputs, the rule is endowed with the notion of distance in the output array, by updating not only the winning neuron, but also those that are neighbours in the output array. To this end one replaces the competitive-learning rule (10.14) by
δwi =ηh(i,i0)(x−wi), (10.17)
where i0 is the index of the winning neuron, the one with weight vector closest to the inputx. Theneighbourhoodfunctionfunctionh(i,i0)dependsonthedistanceofthe neurons i and i0 in the output array. The neighbourhood function has a maximum at i = i0 and decreases as the distance between i and i0 increases. One possibility is to assign decreasing values to h(i,i0) for nearest neighbours, next-nearest neighbours, and so forth (Figure 10.8). Another possibility is to use a Gaussian function of the Euclidean distance |r i − r i0 | in the output array [1]:
h(i,i )=exp− 1 |r −r |2. (10.18) 0 2σ2 i i0
SELF-ORGANISING MAPS 197
Figure 10.7: Kohonen’s self-organising map. If patterns x (1) and x (2) are close in input space, then the two patterns activate neighbouring winning neurons in the output array (with coordinates r = [r1, r2]T). Often the dimension of the output array is much lower than that of input space.
Here r i is the position of neuron i in the output array (Figure 10.7). Different normalisations of the Gaussian [2] can be subsumed in different learning rates.
Kohonen’s rule has two parameters: the learning rate η, and the width σ of the neighbourhood function. Usually one adjusts these parameters as the learning proceeds. Typically one begins with large values for η and σ (ordering phase), and then reduces these parameters as the elastic network evolves (convergence phase): quickly at first and then in smaller steps, until the algorithm converges [1, 2, 157].
According to Equations (10.17) and (10.18), similar patterns activate nearby neu- rons in output space, and their weight vectors change in similar ways. Kohonen’s ruledragsthewinningweightvectorwi0 towardsx,justasthecompetitivelearning rule (10.14), but it also drags the neighbouring weight vectors along. Figure 10.9 illustrates a geometrical interpretation of Kohonen’s rule [18]. We can think of the weight vectors as pointing to the nodes of an elastic net that has the same layout as the output array. As one feeds patterns from the input distribution, the weights are updated, causing the nodes of the network to move. This changes the shape of the elastic network until it resembles the shape defined by the distribution of input
198 UNSUPERVISED LEARNING
Figure10.8: Nearestneighbours(•)andnext-nearestneighbours(◦)totheneuron at the centre () of the output array.
patterns. Figure 10.6 shows another example where the dimensionality of the output array (one-dimensional) is lower than that of the input space (two-dimensional). The algorithm finds a non-linear approximation to the data, the principal manifold. As opposed to the principal direction in principal-component analysis, the principal manifold need not be linear. Therefore it can approximate the data more precisely, leading to a smaller residual variance (Exercise 10.7).
In summary, Kohonen’s algorithm learns by distributing the weight vectors of the output neurons to reflect the distribution of input patterns. In general this works well, but problems occur at the boundaries. Why this happens is quite clear (Figure 10.9): since the density of patterns outside the parallelogram is low, the elastic network cannot be drawn very close to the boundary. To analyse how the boundaries affect learning for Kohonen’s rule, consider the steady-state condition
η T
〈δwi〉= h(i,i0)x(t) −w∗=0. (10.19) Ti
t=1
This condition is more complicated than it looks at first sight, because i0 depends on the weights and on the patterns, as mentioned above. The steady-state condition (10.19) is very difficult to analyse in general. One of the reasons is that global geometric information is difficult to learn. It is usually much easier to learn local structures. This is particularly true in the continuum limit where we can analyse local learning progress using Taylor expansions.
The analysis of condition (10.19) in the continuum limit is due to Ritter and Schul-
ten [160], and it is described in detail by Hertz, Krogh, and Palmer [1]. One assumes
that there is a very dense network of output neurons, so that one can approximate
i→r,i0→r0,wi →w(r),h(i,i0)→hr−r0(x),andT1 t →dxPdata(x).Inthis
continuum limit, Equation (10.19) reads
dx Pdata(x)hr −r0(x)x −w∗(r)=0. (10.20)
SELF-ORGANISING MAPS 199
Figure 10.9: Learning a distribution Pdata(x) (gray) of two-dimensional real- valued inputs x with Kohonen’s algorithm. Illustration of the dynamics of the self- organising map in terms of an elastic net. (a) Initial condition. (b) Intermediate stage. (c) In the steady-state the elastic network resembles the shape defined by the input distribution Pdata(x ).
This is a condition for the steady-state learning outcome, the function w ∗(r ).
In the continuum limit, the position r 0(x ) of the winning neuron in the output
array for pattern x is given by
w∗(r0)=x . (10.21)
We use this relation to write Equation (10.20) as:
dx Pdata(x)hr −r0(x)w∗(r0(x))−w∗(r)=0. (10.22)
Equation (10.21) defines a mapping r 0(x ) from input space to output space, the self-organising map (Figure 10.7). Assuming that this mapping is one-to-one, we change integration variable from x to r 0:
dr0|det|Q(r0)h(r −r0)w∗(r0)−w∗(r)=0, (10.23)
whereQ(r0)≡Pdatax(r0),andwherethedeterminantrepresentsthevolumeele- ment of the variable transformation. Using Equation (10.21), the Jacobian of the transformation has elements
Jij =∂wi(r0). (10.24) ∂ rj
The neighbourhood function is sharply peaked at r = r 0, and this makes it possible to evaluate the steady-state condition (10.23) approximately, expanding the inte- grand in δr = r 0 − r , assuming that w ∗(r ) is a smooth function. This is illustrated in Figure 10.10, for one-dimensional inputs and outputs. We consider this special case not only to simplify the notation, but also because it is one of the few cases
200 UNSUPERVISED LEARNING
Figure10.10: Inordertofindouthowthesteady-statemapw∗(r)variesnearr(gray line),oneexpandsw∗inδraroundr,w∗(r+δr)=w∗(r)+dw∗ δr+1d2w∗δr2+….
that admit mathematical analysis (Exercise 10.9). Expanding w ∗(r + δr ) as shown in Figure 10.10 yields
w∗(r+δr)−w∗(r)= d w∗(r)δr+1 d2 w∗(r)δr2+…. dr 2 dr2
(10.25)
(10.26a) (10.26b)
Inserting these expressions into Equation (10.20), discarding terms of order higher than δr 2, and changing the integration variable to δr , one finds
∞ 0=w′[3w′′Pdata(w)+(w′)2 d Pdata(w)] dδr δr2h(δr), (10.27)
The other factors in Equation (10.23) are expanded in a similar way: J(r+δr)=dw∗ +d2w∗δr+…,
dr dr2
Q(r +δr)=Pdata(w∗)+δr dw∗ d Pdata(w).
2 dw
−∞
where we introduced the short-hand notation w ′ = d w ∗(r ), and we used that the
dr
neighbourhood function (10.18) is symmetric, h (−δr ) = h (δr ). Since the integral in
Equation (10.27) is non-zero, we must either have
w′ =0 or 3w′′Pdata(w)+(w′)2 d Pdata(w)=0. (10.28)
2 dw
The first solution can be excluded because it corresponds to a singular weight distribution that does not contain any geometrical information about the input distribution Pdata. The second solution gives
w′′ 2w′ d Pdata(w)
=− dw . (10.29)
dr dw
dr 2 dr2
w′ 3 Pdata(w)
K -MEANS CLUSTERING 201 Inotherwords, d log|w′|=−2 d logPdata(w),andthismeansthat|w′|∝[Pdata(w)]−23 .
dr 3 dr
The density of output weights can be computed as
ρ(w)=
where δ(w ) is the Dirac δ-function [161]. Changing variables in the δ-function
(10.31)
(10.32)
drδ[w −w∗(r)], (10.30)
δ[w−w∗(r)]= 1 δ(r−rj),
|w′|
and assuming that the function w ∗(r ) is one-to-one, one finds
j|w=w∗(rj)
ρ(w)= 1 =[Pdata(w)]23 . |w′|
This tells us that the self-organising map learns the input distribution in the following way: the distribution of output weights in the steady state reflects the distribution of input patterns. Equation (10.32) tells us that the two distributions are not equal (equality would have been a perfect outcome). The distribution of weights is instead
2
proportional to Pdata(w)3 . Little is known in higher dimensions, but the general
idea is that the elastic network has difficulties reaching the corners and edges of the domain where the input distribution is non-zero.
The output of a self-organising map can be interpreted in different ways. For a low-dimensional inputs and outputs, one can simply plot the map w ∗ (r ), as in Figure 10.7. Dense regions of weights point to regions in input space with a high density of inputs. Often the output dimension is taken to be much lower than the dimension of input space. In this case the self-organising map performs non-linear dimensionality reduction, and it can be used to find clusters in high-dimensional input data [162]. The analysis proceeds in two steps. First, one runs Kohonen’s algorithm until the map has converged to a steady state. Second, one feeds all inputs into the net, and for each input one determines the location of the winning neuron in the output array. The spatial activation patterns in the output array represent clusters of similar inputs. This is illustrated in Figure 10.11, which shows how a self-organising map represents handwritten digits from the MNIST data set. To reveal the semantic map, the Figure labels clusters of outputs that correspond to the same digits (as determined by the labels in the training set). We see that the self-organising map groups the same digits together, but it has some difficulty distinguishing the digits 3 and 8, and also 4 and 9.
202 UNSUPERVISED LEARNING
Figure 10.11: Clustering of hand-written digits (MNIST data set) with a self- organising map with a 30 × 30 output array. In the shaded regions the outputs are quite certain: here the winning neurons are activated by the indicated digit in 80% of the cases. The white regions correspond to outputs where the majority digit appears in less than 80% of the cases, or to outputs that are never activated, or only once. Schematic, based on simulations performed by Juan Diego Arango.
10.4 K -means clustering
Sections 10.2 and 10.3 described different ways of finding clusters in input data. In particular, it was shown how self-organising maps can find clusters in high- dimensional input data, and represent them in a low-dimensional, non-linear pro- jection. K -means clustering [2] is an alternative unsupervised-learning algorithm for finding clusters in the input data. Let us compare and contrast this algorithm with Kohonen’s self-organising map. The goal is to cluster input patterns x (μ) , μ = 1, . . . , p into K clusters. Usually K is much smaller than the number of inputs, p , and than the input dimension N .
A solution of the clustering task is a mapping k (μ) that associates each input x (μ) with one of the clusters k = 1, . . . , K . The function k (μ) is determined by minimising
K -MEANS CLUSTERING 203
Figure10.12: SchematicillustrationoftheK-meansclusteringalgorithmwithtwo weight vectors w 1 and w 2. The radii of the disks equal s1 and s2, Equation (10.34).
the energy function
Thesecondsumisoverallvaluesofμthatsatisfyk(μ)=k.Thevectorwk becomes the average of all pattern vectors in cluster k , and the expression in the parentheses is the variance associated with this cluster:
σk2 = |x(μ)−wk|2. (10.34) μ|k(μ)=k
In other words, H measures the sum of the cluster variances σk2 . A solution to the clustering problem corresponds to a local minimum of H . To determine the cluster vectorswk andthecorrespondingvariancesσk2,onestartsfromaninitialguess fork(μ).Foreachcluster,onebeginsbyadjustingthewk tominimisethecluster variance:
1K
H(w1,…,wK )= |x(μ) −wk|2. (10.33)
argmin= |x(μ)−wk|2. wk μ|k(μ)=k
In a second step, one optimises the encoding function k (μ) = arg min |x (μ) − w k |2 ,
1≤k≤K
(10.35)
(10.36)
2k=1 μ|k(μ)=k
given the vectors w k . These steps are repeated until a satisfactory solution is found (Figure 10.12). The solution is not unique, usually the algorithm converges to a local minimum of H . In practice one should try different random initialisations to find the best local minimum.
All three algorithms, competitive learning, the self-organising map, and K -means clustering move weight vectors towards clusters in input space. A difference between
204 UNSUPERVISED LEARNING
Figure 10.13: Linear separation of the XOR function by the non-linear mapping (10.37). (a) In the input plane the problem is not linearly separable. (b) In the u1-u2 plane the problem is homogeneously linearly separable.
the self-organising map and the other two algorithms is that the self-organising map uses a neighbourhood function (so that similar inputs activate close-by neurons in the output array), and updates their weight vectors in similar fashion. In this way, a self-organising map with a large output array can find a smooth parameterisation of the principal manifold. If we shrink the neighbourhood function to the centre point in Figure 10.8, all geometric information is lost, and the self-organising map becomes equivalent to competitive learning (Algorithm 9). Essentially, competitive learning and K -means clustering are sequential and batch versions of the same algorithm [1]. So the self-organising map becomes equivalent to K -means clustering when the neighbourhood range tends to zero.
10.5 Radial basis functions
Problems that are not linearly separable can be solved by perceptrons with hidden layers, as we saw in Chapter 5. Figure 5.13(b), for example, shows a piecewise linear decision boundary parameterised by hidden neurons.
Separability can also be achieved by a non-linear transformation of input space. Figure 10.13 shows how the XOR problem can be transformed into a linearly separa- ble problem by the transformation
u1(x)=(x2−x1)2−12 and u2(x)=x2. (10.37)
The Figure shows the non-separable problem in the x1-x2 plane, and in the new coordinates u1 and u2. The problem is homogeneously linearly separable in the u1-u2 plane. We can solve it by a single McCulloch-Pitts neuron with weights W and zero threshold, parameterising the decision boundary as W · u (x ) = 0.
RADIAL BASIS FUNCTIONS 205
It is even better to map the patterns (non-linearly) to a space of higher dimension, because Cover’s theorem (Section 5.4) says that it is easier to separate the patterns there: considerasetu(x)=[u1(x),…,um(x)]T ofm polynomialfunctionsoffinite order that embed N -dimensional input space in an m -dimensional space. Then the probability that a problem with p points x (μ) in N -dimensional input space is separable by a polynomial decision boundary is given by P (p , m ) [Equation (5.29)] [2, 72]. Note that this probability is independent of the dimension N of input space.
The question is of course how to find the non-linear mapping u (x ). One possibil- ity is to use radial basis functions. The idea is to parameterise the functions u j (x ) in terms of weight vectors w j , and to use an unsupervised-learning algorithm to find weights that separate the input data. A common choice [2] is to use radial basis functions of the form:
u (x)=exp− 1 x−w 2. (10.38) j 2 s j2 j
Note that these functions are not of the finite-order polynomial form that was assumed above. So strictly speaking we cannot invoke Cover’s theorem. In practice themappinguj(x)worksneverthelessquitewell.Theparameterssj parameterise the widths of the radial basis functions. In the simplest version of the algorithm they are set to unity. Hertz, Krogh, and Palmer [1] discuss radial basis-function networks with normalised radial basis functions
exp− 1 x−w 2 2sj2 j
uj(x)=m exp− 1 |x−w |2. k = 1 2 s k2 k
(10.39)
Other choices for radial basis functions are given by Haykin [2].
Figure 10.14 shows a radial basis-function network for N = 2 and m = 4. The four neurons in the hidden layer stand for the four radial basis functions (10.38) that map the inputs to four-dimensional u -space. The network looks like a perceptron (Chapter 5). But here the hidden layers work in a different way. Perceptrons have hidden McCulloch-Pitts neurons that compute non-local outputs σ(w j · x − θ ). The output of radial basis functions u j (x ), by contrast, is localised in input space [Figure 10.15 (left)]. We saw in Section 7.1 how to make localised basis functions out of McCulloch-Pitts neurons with sigmoid activation functions σ(b ), but one needs
two hidden layers to do that [Figure 10.15 (right)].
Radial basis functions produce localised outputs with a single hidden layer, they
divide up input space into localised regions, each corresponding to one radial basis function. Imagine for a moment that we have as many radial basis functions as inputpatterns. Inthiscasewecansimplytakewν =x(ν) forν=1,…,p. Thelinear
206 UNSUPERVISED LEARNING
Figure10.14: Radialbasis-functionnetworkforN=2inputsandm=4radialbasis functions (10.38). The output neuron has a linear activation function, weights W , and zero threshold.
Figure10.15: Comparisonbetweenradial-basisfunctionnetworkandperceptron. Left: the output of a radial basis function is localised in input space. Right: to achieve a localised output with sigmoid units one needs two hidden layers (Figure 7.4).
output in Figure 10.14 computes O (μ) = W · u (x (μ) ), and so the classification problem in u-space takes the form
p
W μUμν = t (ν) (10.40)
μ=1
with Uμν = uν(x (μ)). If all patterns are pairwise different, x (μ) ̸= x (ν) for μ ≠ ν, then the matrix is invertible [2]. In this case the solution of the classification problem reads
p
Wμ = t (ν)[−1]νμ , (10.41)
ν=1
where is the symmetric p × p matrix with elements Uμν.
RADIAL BASIS FUNCTIONS 207
Algorithm 10 radial basis functions
initialisetheweightswjk independentlyrandomlyfrom[−1,1];
setallwidthstosj =0; fort =1,…,T do
feed randomly chosen pattern x (μ); determinewinningneuronj0:uj0 ≥uj forallvaluesofj; updatewidths: sj =minj̸=k |wj −wk|; updateonlywinningneuron:δwj0 =η(x(μ)−wj0);
end for
In practice one can get away with fewer radial basis functions by choosing their weights to point in the directions of clusters of input data. To this end one uses unsupervised competitive learning (Algorithm 10), where the index j0 of the winning neuronisdefinedtobetheonewithlargestuj.Howarethewidthssj determined? Thewidthsj ofradialbasisfunctionuj(x)istakentobeequaltotheminimum distancebetweenwj andthecentersofthesurroundingradialbasisfunctions. Once weights and widths of the radial basis functions are found, the weights of the output neuron are determined by minimising
H =1t(μ)−O(μ)2 (10.42) 2μ
with respect to W . This works even if is not invertible. An approximate solution can be obtained by stochastic gradient descent on H keeping the parameters of the radial basis functions fixed. Cover’s theorem indicates that the problem is more likely to be separable if the embedding dimension m is higher.
Radial basis-function networks are similar to the perceptrons described in Chap- ters 5 to 7, in that they are feed-forward networks designed to solve classification problems. A fundamental difference is that the parameters of the radial basis func- tions are determined by unsupervised learning, whereas perceptrons are trained using supervised learning for all units. While McCulloch-Pitts neurons compute weights to minimise their output from given targets, the radial basis functions com- pute weights by maximising the ouput u j as a function of j . The algorithm for finding the weights of the radial basis functions is summarised in Algorithm 10. Further, as opposed to the deep networks from Chapter 7, radial basis-function networks have only one hidden layer, and a linear output neuron. In summary, radial basis-function networks learn using a hybrid scheme: unsupervised learning for the parameters of the radial basis functions, and supervised learning for the weights of the output neuron.
208 UNSUPERVISED LEARNING
Figure 10.16: Autoencoder (schematic). Both encoder and decoder consist of a number of fully connected or convolutional layers (depicted as squares). In the layout shown, the bottleneck consists of a layer with very few neurons. Sparse autoencoders have bottlenecks with many neurons, but only few are activated.
10.6 Autoencoders
Multi-layer perceptrons, layered feed-forward networks, were developed for super- vised learning, as described in Part II. Such layouts can also be used for unsupervised learning. Examples are autoencoders and generative adversarial networks.
Autoencoders employ layered feed-forward networks for unsupervised learning of an unlabeled data set of input patterns, using the inputs as targets, t (μ) = x (μ). The layout is illustrated in Figure 10.16. The network consists of two main parts, an encoder (on the left), and a decoder (on the right). The encoder consists for instance of several fully connected or convolutional layers and maps the inputs to a bottleneck layer with a small number M of neurons, significantly smaller than the input dimension, M ≪ N . We denote the states of the bottleneck neurons by z j . The encoder corresponds to a non-linear mapping z = f e (x ). The decoder maps the bottleneck (or latent) variables back to the inputs, x = f d(z ). One adjusts weights and thresholds by backpropagation until the network learns to approximate the inputs as
x = f d[f e(x)].
The energy function reads:
H=1x(μ)−f [f (x(μ))2,
(10.43)
(10.44)
2de μ
AUTOENCODERS 209
where |x |2 = x Tx . In other words, the autoencoder learns the identity function. The point is that the identity is represented in terms of two non-linear functions, the encoder f e and the decoder f d. While the identity function is trivial, the encoding and decoding functions need not be. The bottleneck ensures that the network does notsimplylearn fe(x)= fd(x)=x.
The latent variables z may encode interesting properties of the input patterns. If the number of neurons is much smaller than the number of pattern bits, as indicated by the term bottleneck, the encoder is a low-dimensional (compressed) representa- tion of the input data. In this way autoencoders can perform non-linear dimension- ality reduction, like self-organising maps (Section 10.3). If both encoder and decoder are linear functions with zero thresholds, then H = 12 μ |x (μ) − dex (μ))|2. In this case, z1(x),…,zM (x) are simply the first M principal components of zero-mean input data [163] (Exercise 10.14).
Sparse autoencoders [164] have a large number of neurons in the bottleneck, pos- sibly more than the number of pattern bits. But only a small number of bottleneck neurons are allowed to be active at the same time. The idea is that sparse representa- tions of input data are more robust than dense ones, and generalise more reliably. At least high-dimensional but sparse representations of binary classification problems are more likely to be linearly separable (Section 5.4). There are different ways of enforcing sparsity, for instance using L1- or L2-regularisation (Section 7.6.1). An alternative [164] is to ensure that the average activation of each bottleneck neuron with sigmoid activation function,
(10.45)
(10.46)
σ(b(μ)), remains small. This is achieved by adding the term
λa log a +(1−a)log 1−a j aj 1−aj
1 p jpj
a =
μ=1
to the energy function, with Lagrange multiplier λ (Section 6.3). This term penalises anydeviationofaj fromaj =a ≪1,wherea >0isasparsityparameter. Eachterm in the sum is non-negative and vanishes when a j = a for all j , because each term can be interpreted as the Kullback-Leibler divergence (Section 4.4) between two Bernoulli distributions with parameters a and a j .
Variational autoencoders [165–167] have layouts similar to the one shown schemat- ically in Figure 10.16, but their purpose is quite different. Variational autoencoders are generative models (Section 4.5): just like restricted Boltzmann machines (Sec- tion 4.5) they approximate a data distribution of inputs Pdata (x ), and allow to sample from it. As an example consider the MNIST data set of handwritten digits. The
210 UNSUPERVISED LEARNING
patterns define a data distribution that encodes the properties of the digits in terms of covariances and higher-order correlations. The question is how to generate new digits from this distribution, different from those in the data set, yet sharing their defining properties. In other words, how can a machine learn to generate images that look like handwritten digits?
The idea of variational autoencoders is to represent the data distribution in terms of a Gaussian distribution PL(z ) of latent variables z , using the fact that one can approximate any given data distribution Pdata(x ) in terms of PL(z ) by a suitable non-linear transformation. Variational autoencoders are trained not unlike neural networks, but an essential difference to the algorithms described in Part II is that variational autoencoders learn probabilities rather than deterministic input-output mappings.
Given the Gaussian distribution PL(z ) of the latent variables, the goal is to max- imise the log-likelihood (Section 4.5):
=logP(x)=log dz P(x|z)PL(z). (10.47)
Here P (x |z ) is the probability to generate x given z . In the simplest case, this distribution is assumed to be Gaussian with mean μP (z ), and with correlation matrix P (z ). The decoder represents these functions in terms of a multilayer perceptron or a convolutional neural net. Weights and thresholds are determined to maximise by gradient ascent. To this end we must find an efficient way of computing and its gradients. One possibility is Monte-Carlo sampling (Section 4.2), but this is not very efficient because most values of z drawn from PL(z ) result in unlikely patterns x , with only negligible contributions to . To get around this problem, one needs to know which values of z are likely to produce a given pattern x . The idea is to learn a second approximate distribution Q (z |x ) of z given x . We can think of Q (z |x ) as an encoder. So P (x |z ) corresponds to the decoder f d discussed above, while Q (z |x ) corresponds to f e . An important difference is that P and Q are probabilities, not deterministic functions.
To determine a good approximation Q (z |x ), we minimise the difference between Q (z |x ) and the unknown exact distribution P (z |x ),
DKL[Q(z|x),P(z|x)]=〈logQ(z|x)−logP(z|x)〉Q . (10.48) A first trick is to rewrite this expression using Bayes’ theorem [29],
P(z|x)=P(x|z)PL(z)/P(x). (10.49) −DKL[Q(z|x)|P(z|x)]=〈logP(x|z)−DKL[Q(z|x)|PL(z)]〉Q , (10.50)
This gives:
SUMMARY 211
The second trick is to recognise that the l.h.s of Equation (10.50) is a suitable target function to maximise. We want to maximise subject to the constraint that the unknown function Q (z |x ) approximates the probability P (z |x ) of z encoding the patternx.UsuallyonetakesQ(z|x)tobeaGaussianwithmeanμQ andcorrelation matrix Q . The task is then to determine the functions μP (z ), P (z ), μQ (x ), and Q (x ) by adjusting the weights of two neural networks, the encoder and the decoder.
This task, maximising the r.h.s. of Equation (10.50), is not as straightforward as it may seem, because the target function (10.50) involves an average over the distribution Q which in turn depends on the weights. The question is how to move thederivative∂/∂wmn insidetheaverage〈···〉Q,toobtainanunbiasedexpression of the weight updates δwm n . In other words, the goal is to ensure that the average weight increments are proportional to the gradients of the target function (10.50). A related problem occurs when training binary stochastic neurons (Section 11.1). Similarities and differences are described in Ref. [168].
One solution is to use stochastic backpropagation [167]. In its simplest form, the algorithm makes use of a relation for the gradient of the average of a test function F (z ) with a Gaussian probability Q (z ;wi j ) that depends on the weights wi j :
∂ 〈F(z)〉Q=〈b· ∂ μQ+1tr ∂ 〉Q. (10.51) ∂ wmn ∂ wmn 2 ∂ wmn
Here b and are the gradient and the Hessian of the function F (z ). Algorithm 4. A challenge is that they tend to be difficult to compute reliably. Suitable approxi- mations are described in Ref. [167]. The expression inside the average in Equation (10.51) is the desired weight increment. Iterating this learning rule, one determines the parameters of the Gaussian distributions P (x |z ) and Q (z |x ). This allows to efficiently sample x by sampling z and then applying the decoder P (x |z ).
Generative adversarial networks [169] are generative models based on learning rules similar to that described above for variational autoencoders, but there are some differences in detail. Generative adversarial networks consist of two multilayer perceptrons, a generator and a discriminator. The generative network produces new outputs from a given data distribution (fakes), and the task of the discriminator is to classify these outputs into two classes: real or fake data. Generator and discriminator are trained together. The weights of the generator are adjusted to maximise the classification error of the discriminator, while those of the discriminator are trained to minimise this error [170].
10.7 Summary
The unsupervised-learning algorithms described in Sections 10.1 and 10.2 are based on Hebb’s rule. These algorithms can learn different features of unlabeled input data:
212 UNSUPERVISED LEARNING
they can detect the familiarity of inputs, perform principal-component analysis, and identify clusters in the input data. Self-organising maps also rely on Hebb’s rule. An important difference is that the outputs are arranged in an array, and that output neurons that are close-by in the output array are updated in similar ways. Self- organising maps can therefore represent topographic and semantic maps, where close-by or similar inputs are mapped to nearby outputs. When the dimension of the output array is much lower than the input dimension, self-organising maps perform non-linear dimensional reduction. Radial basis-function networks are classifiers, just like multilayer perceptrons. Their output neurons are trained in the same way, using labeled input data. However, the decision boundaries of radial basis-function networks are polynomial functions (not just hyperplanes), and their parameters are determined by unsupervised learning.
Autoencoders are multilayer perceptrons. They can learn to encode non-linear features of unlabeled input data by using the input patterns as targets. Finally, gener- ative adversarial networks do not require labeled inputs, so they can be considered unsupervised-learning machines. They are used to generate synthetic data in or- der to expand training sets for supervised learning, and pose an ethical dilemma because they can be used to generate deep fakes [171], manipulated videos where someone’s facial expression and speech are replaced by another person’s.
In summary, the simple algorithms described in this Chapter provide a proof of concept: how machines can learn without labels.
10.8 Further reading
The primary source for Sections 10.1 and 10.2 is the book by Hertz, Krogh, and Palmer [1]. A good reference for self-organising maps is Kohonen’s book [158]. Radial basis- function networks are discussed by Haykin in Chapter 5 of his book [2]. It has been argued that radial-basis function networks do not generalise as well as perceptrons do [172]. To solve this problem, Poggio and Girosi [173] suggested to determine the parameterswj oftheradialbasisfunctionbysupervisedlearning,usingstochastic gradient descent.
Autoencoders can generate non-linear, low-dimensional representations of an input distribution. The relation to principal component analysis is discussed in Refs. [163, 174].
The recommended introduction to variational autoencoders is the tutorial by Doersch [166]. He also mentions that the underlying mathematics for variational autoencoders is similar to that of Helmholtz machines (Section 4.7), although the two machines learn in quite different ways.
Variational autoencoders are used for a number of different purposes. Ref. [175]
EXERCISES 213
suggests to employ a variational autoencoder for active learning (Section 7.8). The idea is to represent the input distribution in terms of lower-dimensional latent variables, and to use K -means clustering (Section 10.4) to identify groups of patterns that should be labeled. Variational autoencoders have also been used for outlier detection [176] and language generation [177].
10.9 Exercises
10.1 Continuous Oja’s rule. Using the ansatz w = q /|q | show that Equations (10.4) and (10.5) describe the same angular dynamics. The difference is just that w remains normalised to unity, whereas the norm of q may increase of decrease. See Ref. [155].
10.2 Data-covariance matrix. Determine the data-covariance matrix and the principal direction for the data shown in Figure 10.4.
10.3 Oja’s rule. The aim of unsupervised learning is to construct a network that
learns the properties of a distribution Pdata(x) of input patterns x = [x1,…,xN ]T.
Consider one linear output that computes y = Nj =1 w j x j . Show that Oja’s learning
rule δwj = ηy (xj − y wj ) has the stable steady state w ∗ corresponding to the leading
eigenvector of the matrix ′ with elements C ′ = 〈xi x j 〉. Here 〈· · · 〉 denotes the ij
average over Pdata(x ).
10.4 Linear stability analysis for Oja’s rule. Iterate the stochastic dynamics (10.3) near a fixed point w ∗, linearise, and average the result over a random sequence of patterns x . Expand the result to leading order in the learning rate η to show that the linear stability of w ∗ to this order is determined by Equation (10.9).
10.5 Competitive learning for binary patterns. A competitive learning rule for binary patterns with 0/1 bits reads δwi j = η(xj /Nk=1 xk − wi j ) for the winning neuron i = i0, and δwi j = 0 otherwise. Show that the steady-state weight vectors
w i0 have positive components and are normalised as Nk=1 wi0,k = 1.
10.6 Self-organising map. Write a computer program that implements Koho- nen’s algorithm with a two-dimensional output array, to learn the properties of a two-dimensional input distribution that is uniform inside an equilateral triangle with sides of unit length, and zero outside. Hint: to generate this distribution, sam- ple at least 1000 points uniformly distributed over the smallest square that contains the triangle, and then accept only points that fall inside the triangle. Increase the number of weights and study how the two-dimensional density of weights near the
214 UNSUPERVISED LEARNING boundary depends on the distance from the boundary.
10.7 Principal manifolds. Create a data set similar to the one shown in Figure 10.6, using x2 = x12 + r where r is a Gaussian random number with mean zero and vari- ance σr2 = 0.01. Determine the principal component (solid line) of the data set (Section 6.3). Use Kohonen’s algorithm to find a better approximation to the data, the principal manifold (dashed line). For both cases, determine the variance of data that remains unexplained.
10.8 Iris data set. Write a computer program that combines a two-dimensional self-organising map with a simple classifier to classify the Iris data set (Figure 5.1).
10.9 Steady state of two-dimensional Kohonen algorithm. Repeat the analysis of Equation (10.23) for a two-dimensional self-organising map. Derive the equiv- alent of Equation (10.27) and determine a relation between the weight density ρ and Pdata assuming that the data distribution factorises Pdata(w ) = f (w1)g (w2) [160]. Assumethatw(r)=u+iv canbewrittenasananalyticfunctionofr =x+iy and derive a relation between ρ and Pdata.
10.10 Radial basis functions for XOR. Show that the two-dimensional Boolean XOR problem with 0/1 inputs can be solved using the two radial basis functions u1(x(μ)) = exp(−|x(μ) −w1|2) and u2(x(μ)) = exp(−|x(μ) −w2|2) with w1 = (1,1)T and w 2 = [0,0]T. Draw the positions of the four input patterns in the transformed space with coordinates u1 and u2.
10.11 Radial basis functions. Table 10.1 describes a classification problem. Show that this problem can be solved as follows. Transform the inputs x to two-dimensional coordinates u1, u2 using radial basis functions:
u1 =exp(−41|x −w1|2), withw1 =[-1,1,1]T , (10.52) u2 =exp(−41|x −w2|2), withw2 =[1,1,-1]T . (10.53)
Plot the positions of the eight input patterns in the u1-u2-plane. Hint: to compute u j use the following approximations: exp(-1) ≈ 0.37, exp(-2) ≈ 0.14, exp(-3) ≈ 0.05. The transformed data is used as input to a simple perceptron O (μ) = sgn 2i =1 Wj u j (μ) − Θ. Draw a decision boundary in the u1-u2-plane and determine the corresponding weight vector W , as well as the threshold Θ.
10.12 A two-dimensional binary classification problem. Figure 10.17 illustrates a binary classification problem defined on the square −1 ≤ x1 ≤ 1 and −1 ≤ x2 ≤ 1 with decision boundary x2 = sin(2πx1). Make your own input data set by distributing 1000 inputs in the two regions shown, half of them with target t = +1 (), the other
EXERCISES
215
x1x2x3 -1-1-1 -1 -1 1 -1 1 -1 -1 1 1
t
1 1 1
-1 1 1 -1 1111
1 -1 -1 1 -1 1 1 1 -1
Table10.1: InputsandtargetsforExercise10.11.
Non-linear decision boundary (gray line) x2 = sin(2πx1) for a non-
Figure 10.17:
linearly separable binary classification problem defined on the square −1 ≤ x1 ≤ 1 and −1 ≤ x2 ≤ 1. Exercise 10.12.
216 UNSUPERVISED LEARNING
half with t = −1 (). Find approximate decision boundaries using a radial-basis function network with m radial basis functions, for m = 5, 10, 20 and 100. Plot the decision boundaries in the input plane and determine the classification errors.
10.13 Autoencoder for MNIST. Train the autoencoder network shown in Figure 10.18 on the MNIST data set. Analyse which properties of the data set the latent variables z1 and z2 encode.
10.14 Autoencoder with linear units. Figure 10.19 shows the layout of an autoen- coder [Equations (10.43) and (10.44)]. All neurons are linear units. Assume that the input data has zero mean, so that all thresholds can be set to zero. Prove that the latent variables z = [z1, z2]T are the top principal components of the input data after training the autoencoder with backpropagation. Repeat the proof for inputs with non-zero means.
EXERCISES 217
Figure10.18: Layoutforanautoencoder[Equations(10.43)and(10.44)]toanalyse the MNIST data set. The layers are fully connected, the number of neurons are given in the Figure. Exercise 10.13.
Figure10.19: Layoutofanautoencoder[Equations(10.43)and(10.44)].Thelayers are fully connected. The bottleneck consists of two hidden neurons with states z1 and z2 (latent variables). Input and output dimensions are equal to N . Exercise 10.14.
218 REINFORCEMENT LEARNING
11 Reinforcement learning
Supervised learning requires labeled data, where each input comes with a target the network is supposed to learn. Unsupervised learning, by contrast, does not require labeled data. Reinforcement learning lies between these extremes. The term reinforcement describes the principle of learning by means of a reward function. This function assigns penalties or rewards to the network output, depending on how the output relates to the learning goal. For a neural network with a vector of outputs, the reward function could be
+1 reward if all outputs correct ,
r = −1 penalty otherwise . (11.1)
The goal is to learn to produce outputs that receive a reward more frequently than those that trigger a penalty. We say that rewarded outputs are reinforced. The feedback may be random, given by a distribution initially unknown to the network.
The reward function reflects the learning goal.The training process as well as the learning outcome depend crucially on this function. Suppose one replaces the reward function (11.1) by the more lenient alternative: r = 1 if at least one output is correct, and r = −1 if all outputs are wrong. Naturally this leads to more errors, possibly not a good idea if the goal is to teach a robot to fly.
One distinguishes two different types of reinforcement problems, associative and non-associative tasks [178]. An example for a non-associative task is the N -armed bandit problem [16]. Imagine N slot machines with different reward distributions, initially unknown to the player. Given a finite amount of money, the question is in which order to play the machines so as to maximise the overall profit. The dilemma is whether to stick with a machine that yields a decent reward, or whether to try out other machines that may yield a low reward initially, but could give much higher rewards eventually (exploit-versus-explore dilemma). In this type of problem, the player receives only the reinforcement signal, no other inputs. In associative tasks, by contrast, the agent receives inputs, or stimuli, and it should learn to associate with each stimulus the output that yields the highest reward. Such tasks occur for instance in behavioural psychology, where the problem is to discriminate between different stimuli, and to associate the right behaviour with each stimulus.
In general, such associative tasks can be described as sequential decision processes (Figure 11.1), where an agent explores a sequence of states s0,s1,s2,… through a sequence of actions a 0 , a 1 , a 2 , . . .. Consider for instance a motile microorganism in the turbulent ocean that should swim to the water surface as quickly as possible [179]. It determines its state by observing the local environment. The microorganism might measure local strain and vorticity of the flow. The environment provides a
219
Figure11.1: Sequentialdecisionprocess(schematic).AdaptedfromFigure3.1in Ref. [16].
reinforcement signal (the distance to the surface for example), and the organism determines which action to take, given its state and the reinforcement signal. Should it turn, stop to swim, or accelerate? The organism learns to associate actions with certain states that maximise the reward. This sounds quite similar to associating optimal outputs with stimuli. A conceptual difference is that the action of the agent modifies the environment: its actions take it to a different place in the turbulent flow, with different vorticity and different strain. A second point is that the reward to an action may not be immediate. In this case the challenge is to credit actions that optimise the expected future reward, given the information collected so far. This is the credit-assignment problem [180].
There are two different kinds of associative tasks: continuous and episodic ones. In continuous tasks, the intertwined sequences of states and actions have no natural end, so that one must either terminate the sequence in an ad-hoc fashion or intro- duce a weighting factor to ensure that the expected future reward remains finite. In episodic tasks, by contrast, the learning is divided into episodes that terminate after a finite number of steps. An example is to learn a strategy for winning a board game. In this case, each episode corresponds to a round of the game, and the re- ward is incurred at the end of each episode. The number of steps per round, the episode length T , may vary from round to round. In order to estimate the expected reward one usually needs many episodes. A second, very simple example is the stimulus problem described above, where the states (stimuli) are independent from the actions. Each episode consists of only one step, so that T = 1. In response to a randomly chosen state s 0 , the agent learns to perform the action a 0 that maximises the immediate reward. This can be achieved by the associative reward-penalty algorithm (Section 11.1). It uses stochastic neurons with weights that are trained by gradient ascent to maximise the expected immediate reward.
To estimate the expected future reward when T > 1, one must use a different method, usually temporal difference learning. It allows to estimate the expected
220 REINFORCEMENT LEARNING
future reward, after T steps, by breaking up the learning into time steps t = 1, . . . , T . The idea is that it is better to adjust the prediction of a future reward as one iterates, rather than waiting for T iterations before updating the prediction. In temporal difference learning one expresses the reward at time T in terms of differences at time steps t + 1 and t [181].
Temporal difference learning builds up a lookup table that summarises the best
actions for each state, the Q -table Q (s , a ). Given Ns states s and Na possible actions
a , Q (s , a ) is a Ns × Na table. Its elements contain the expected future reward for
each state-action pair. To implement temporal difference learning, one must adopt
a policy or strategy. It specifies for each state which action is taken. In total there
areNNs possibilitiesofassigningactionstostates.Whentherearemanystatesand a
actions it quickly becomes impossible to determine the best strategy by simple sampling, because there are too many to consider.
The advantage of temporal difference learning and related algorithms is that they do not rely on the evaluation of all possible policies. Instead, the policy is updated usingiteratedestimatesoftheQ-table.WewriteQt fortheestimateattimestept. There are different ways of deriving a policy from a Q -table. The greedy policy is a deterministic policy, it corresponds to choosing the action that corresponds to the largestQ-elementinagivenrowoftheQ-table: a =argmaxa′Qt(s,a′). Thispolicy maximises the current estimate of the future reward.
Stochastic policies are often better, in particular for non-stationary or stochastic environments, because they allow the agent to explore potentially better alterna- tives. Also, for a deterministic environment, a deterministic policy may lead to cycles. This can be avoided with a stochastic policy. One example for a stochastic policy is the ε-greedy policy. With probability 1 − ε, it chooses the greedy action a = argmaxa′Qt (s,a′), but with a small probability ε it takes a suboptimal action. Another example is the softmax policy, where argmax is replaced by the softmax function (Section 7.5). The softmax policy can handle actions described in terms of continuous variables.
In general the policy can change as the algorithm is iterated. A common choice is to reduce the parameter ε in the ε-greedy policy as one iterates, so that the algorithm converges to the optimal deterministic policy.
Q-learning is an approximation to the temporal difference algorithm. In Q- learning, the Q -table is updated assuming that the agent always follows the greedy policy, even though it might actually follow a different policy. Q -learning allows agents to learn to play strategic games [182]. A simple example is the game of tic-tac- toe (Section 11.3). Games such as chess or go require to keep track of a very large number of states, so large that Q -learning in its simplest form becomes impractical. An alternative is to represent the state-action mapping by a deep neural network [183].
ASSOCIATIVE REWARD-PENALTY ALGORITHM 221
11.1 Associative reward-penalty algorithm
The associative reward-penalty algorithm uses stochastic neurons that are trained to maximise the average immediate reward. In Chapters 5 to 9, the output neurons were deterministic functions of their inputs. For reinforcement learning, by contrast, it is better to use stochastic neurons. The idea is the same as in Chapters 3 and 4: stochastic neurons can explore a wider range of possible states, which may in the end lead to a better solution. The state yi of neuron i is given by the stochastic update rule (3.1):
+1 with probability p (bi ) ,
yi = −1 with probability 1 − p (bi ) , (11.2)
where bi = w i · x is the local field (no thresholds), and p (b ) = (1 + e −2β b )−1. Recall that the parameter β −1 is the noise level. Since the outputs can assume only two values, yi = ±1, Equation (11.2) describes a binary stochastic neuron.
To illustrate the associative reward-penalty algorithm for a single binary stochas- tic neuron, consider an agent experiencing different stimuli x drawn with equal probability from a distribution of inputs. Upon receiving stimulus x , the stochastic neuron outputs either y = 1 or y = −1. Given x and y , the environment provides a stochastic reward r (x , y ) = ±1 drawn from a reward distribution preward (x , y ):
+1 with probability preward(x , y ),
r(x,y)= −1 withprobability1−preward(x,y). (11.3)
The goal is to adjust the weights so that the neuron produces outputs that are rewarded with high probability. Figure 11.2(a) shows an example with just two stimuli,x1 =[1,0]T andx2 =[1,1]T. Thenumericalvaluesofpreward(x,y)indicate that the expected reward is maximised when the neuron outputs y = 1 in response to x 1, and y = −1 in response to x 2. Since x 1 and x 2 occur with equal probability, the maximal expected reward is
rmax = 1〈r(x1,+1)〉reward +〈r(x2,−1)〉reward=0.1. (11.4) 2
Here we used 〈r (x , y )〉reward = preward(x , y )−[1−preward(x , y )], as well as the numerical values for preward (x , y ) given in Figure 11.2(a).
Figure 11.2 (b) shows the contingency space [184] of the problem, representing the inputs x in a plane with coordinates preward (x , +1) and preward (x , −1). It is easier to learn to associate the correct output with inputs that lie in the shaded regions wherepreward(x,+1)>12 andpreward(x,−1)<12,orviceversa.Inthiscaseonecansolve the problem by fixing y = +1 and sampling preward (x , +1) for all x . If preward (x , +1) > 12 then y = +1 is the optimal output for x , otherwise it is y = −1. This strategy cannot
222
REINFORCEMENT LEARNING
preward
x1 =[1,0]T x2 =[1,1]T
y =−1 0.6 0.3
y =+1 0.8 0.1
Figure11.2: Conditioningbyreward[184].Astochasticneuronrespondstostimuli x 1 and x 2 with different outputs, y = ±1 and receives the reward (11.1): r = +1 with probability preward(x , y ), and r = −1 with probability 1 − preward(x , y ). The goal is to always respond with the output that maximises the expected reward. Table: reward distribution. Panel (a): contingency space [184] of the problem, representing each input x in a plane with coordinates preward (x , +1) and preward (x , −1). (b) Reward versus iteration number of the associative reward-penalty rule (11.10). Schematic, based on simulations by Phillip Graefensteiner averaged over 100 independent realisations.
be used outside the shaded region. For example, if both reward probabilities are larger than one half, it is necessary to sample both preward(x,−1) and preward(x,+1) sufficiently often in order to determine which one is larger: one must find the greater of two goods according to Barto [184]. This illustrates the fundamental dilemma of reinforcement learning: an output that appears at first to yield a high reward may not be the optimal one in the long run. To find the optimal output, it is necessary to estimate both reward probabilities precisely. This means that one must try all possible outputs frequently, not only the one that appears to be optimal at the moment.
To derive a learning rule we need a cost function. One possibility is to use the average of the immediate reward for a given stimulus x
〈r〉= r(x,y)P(y|x) , (11.5) reward
y1=±1,…,yM =±1
averaged over the states of M outputs, and over the response of the environment
ASSOCIATIVE REWARD-PENALTY ALGORITHM 223
determinedbytherewarddistributionpreward(x,y). Itisassumedthatthereward distribution is stationary. Furthermore,
P(y|x)=M p(bi) for yi =1, (11.6) i=1 1−p(bi) for yi =−1
is the probability that the network produces the output y = [ y1 , . . . , yM ]T given the localfieldbi =j wijxj.
To find the maximum of 〈r 〉 one uses gradient ascent on 〈r 〉, analogous to max- imising the log-likelihood for Boltzmann machines (Section 4.4), and to gradient descent on the energy function for perceptrons in supervised learning (Chapter 6). The gradient is computed by applying the chain rule, as usual. The calculation is similar to the one for Boltzmann machines (Chapter 4). After some algebra (Exer- cise 11.2) one finds for given x and y that the derivative of P (y |x ) with respect to wmn equalsP(y|x)β[ym −tanh(βbm)]xn. Weconcludethat
∂〈r〉 =βr(x,y)ym −tanh(βbm)xn (11.7) ∂ wmn
with bm = j wm j xj , as before. The average is over the output of the network and over the reward distribution, just as in Equation (11.5).
Now we seek a learning rule that increases the expected immediate reward 〈r 〉. Inotherwords,werequirethattheweightincrementδwmn isanunbiasedestimator (Section 10.6) of the gradient of the expected immediate reward [178]:
〈δwmn〉=η ∂〈r〉 . (11.8) ∂ wmn
Comparison with Equation (11.7) leads to:
δwmn =αr[ym −tanh(βbm)]xn , (11.9)
with α = ηβ . This learning rule belongs to a set of more general rules derived by Williams [178]. It is plausible that the rule (11.9) converges to a steady state, because the weight increments approach zero as the network learns to produce the output maxy {preward (x , y )}, independent of y , so that y − 〈y 〉 averages to zero. But there is no proof of convergence.
An alternative is the associative reward-penalty rule [184]:
ym −tanh(βbm)xn for r =+1,
δwmn =α −δym +tanh(βbm)xn for r =−1, (11.10)
224 REINFORCEMENT LEARNING
with 0 < δ ≪ 1. For r = 1, the learning rules (11.9) and (11.10) give the same weight increment, but for r = −1 the increments are different. With rule (11.10), the agent learns primarily from positive feedback. One advantage of this asymmetric rule is that it can be proven to converge in the limit of δ → 0 [184]. In general, however, the convergence becomes quite slow when δ is small. Figure 11.2(c) shows simulation results for the immediate reward, averaged over 100 independent realisations of the learning process, versus the iteration number of the rule (11.10). We see that the average immediate reward approaches a steady state. The steady-state average of the immediate reward is smaller than rmax = 0.1, but as expected it approaches rmax as δ decreases.
The averaged learning curves still exhibit substantial fluctuations. They reflect significant variations within and between individual realisations. Furthermore, the convergence proof assumes that the input patterns are linearly independent. This means that the number of patterns cannot exceed the input dimension N . Associative reinforcement problems with linearly dependent inputs can be solved by embedding the input patterns in a higher-dimensional input space (Section 5.4).
The associative reward-penalty rule illustrates how an agent can use a reinforce- ment signal to maximise the expected immediate reward. The algorithm could for instance be a model for how an animal learns to respond in different ways to different stimuli.
Yet there are many problems where the reward is not immediate. When we play chess, the reward comes at the end of the game, for example r = +1 if we won, r = −1 if we lost, and r = 0 if the game ended in a draw. More generally, an agent navigating a complex environment should not only consider immediate rewards, but also how a certain action affects possible future rewards. One way of estimating future rewards for such tasks is temporal difference learning, which is discussed next.
11.2 Temporal difference learning
How does temporal difference learning allow an agent to optimise its expected future reward? For an episodic task, given an episode with T steps, the agent visits the finite sequence of states s0,...,sT −1, and collects the rewards r1,...,rT . The future reward is defined as
T−1
Rt =rτ+1. (11.11)
τ=t
Continuous tasks, by contrast, do not have defined end points. Since the sum in (11.11) might diverge as T → ∞, it is customary to introduce a weighting factor
TEMPORAL DIFFERENCE LEARNING 225 0≤γ≤1inthesumoverrewards:
∞
Rt =γτ−trτ+1. (11.12)
τ=t
The weighting factor reduces the contribution of the far future to the estimate. Smaller values of γ give more weight to the immediate future, and the limit γ → 0+ correspondstoRt =rt+1.ThesuminEquation(11.12)iscalledfuturediscounted reward.
Weuseaneuralnetworkwithinputst toestimateRt.Ingeneral,thenetwork output is a non-linear function of the inputs, parameterised by weights that could be arranged into several layers of hidden neurons (Part II). The simplest choice is to use a single linear unit, just as in Equation (5.19):
O (s t ) = w · s t . (11.13)
The components wj of the weight vector w are determined so that the network output O (s t ) approximates Rt . This can be achieved by minimising the energy
function
using gradient descent. The corresponding learning rule reads:
T−1 ∂O δwm =α[Rt −O(st)]∂w .
T−1 H=1[Rt −O(st)]2
(11.14)
(11.15)
2
t=0
t=0 m Theideaoftemporaldifferencelearning[181]istoexpresstheerrorRt −O(st)asa
sum of temporal differences:
Rt −O(st )=[rτ+1 +O(sτ+1)−O(sτ)], (11.16)
where O (s T ) is defined to be zero, O (s T ) ≡ 0. Using the gradient-descent rule (11.15) one obtains
T −1 T −1
δw =α[rτ+1 +O(sτ+1)−O(sτ)]st . (11.17)
t=0 τ=t
The terms in this double sum can be summed in a different way, as illustrated in Figure 11.3:
T−1 τ
δw =α[rτ+1 +O(sτ+1)−O(sτ)]st . (11.18) τ=0 t=0
T−1 τ=t
226 REINFORCEMENT LEARNING
Figure11.3: ThedoublesuminEquation(11.17)extendsoverthetermsindicated in black. The corresponding terms can be summed in two ways, as illustrated in the two panels, for T = 6.
Exchanging the summation variables and introducing a weighting factor 0 ≤ λ ≤ 1 gives:
T−1 t
δw =α[rt+1+O(st+1)−O(st)]λt−τsτ. (11.19)
t=0 τ=0
The purpose of the weighting factor is to reduce the weight of past states in the sum [185]. Alternatively one may update w (and hence O ) at each time step, with increment [184]
t
δwt =α[rt+1+O(wt,st+1)−O(wt,st)]λt−τsτ. (11.20)
τ=0
This is the temporal difference learning rule, also called TD(λ) [185]. Temporal difference learning allows a machine to learn the board game backgammon [17, 186], using a deep layered network and backpropagation (Section 6.1) to determine the weights.
The rule TD(0) is similar to the learning rule (6.6a) with target rt +1 + O (w t , s t +1 ). It allows to learn one-step prediction of the time series. Using Equation (11.13), we see that the TD(0)-learning rule corresponds to the following learning rule for the output O :
Ot+1(st )=Ot (st )+α[rt+1 +Ot (st+1)−Ot (st )]. (11.21)
ThesubscriptinOt emphasisesthattheoutputfunctionisupdatediteratively.The learning rule (11.21) applies to estimating the future reward (11.11) for episodic tasks. If the environment is stationary, one may average over many consecutive episodes,
TEMPORAL DIFFERENCE LEARNING 227
Figure11.4: Sequenceofstatesst insequentialreinforcementlearning.Theaction at leadsfromst tost+1 wheretheagentreceivesreinforcementrt+1.TheQ-table with elements Q (s t , a t ) estimates the future discounted reward.
using the final weights from episode k as initial weight values for episode k + 1. For continuous tasks, the corresponding rule for estimating the future discounted reward (11.12) reads:
Ot+1(st )=Ot (st )+α[rt+1 +γOt (st+1)−Ot (st )]. (11.22)
Returning to the problem outlined in the beginning of this Chapter, consider an agent exploring a complex environment. The task might be to get from location A to location B as quickly as possible, or expending as little energy as possible. At time t the agent is at position x t with velocity v t . These variables as well as the local state of the environment are summarised in the state vector s t . Given s t , the agent can act in certain ways: it might for example slow down, speed up, or turn. These possible actions are summarised in a vector a t . At each time step, the agent taketheactionat thatoptimisestheexpectedfuturediscountedreward(11.12), given its present state s t . The estimated expected future reward for any state-action pair is summarised in a table: the Q -table with elements Qt (s t , a t ) is the analogue of Ot (s t ). Different rows of the Q -table correspond to different states, and different columns to different actions. The TD(0) rule for the Q -table reads:
Qt+1(st ,at )=Qt (st ,at )+αt rt+1 +γQt (st+1,at+1)−Qt (st ,at ). (11.23)
This algorithm is called SARSA, because one needs st ,at ,rt+1,st+1, and at+1 to update the Q -table (Figure 11.4). A difficulty with the rule (11.23) is that it depends not only on the present state-action pair [s t , a t ], but also on the next action a t +1 , and thus indirectly upon the policy. Sometimes this is indicated by writing Qπ for the Q -table given policy π.
228 REINFORCEMENT LEARNING
Algorithm 11 Q -learning for episodic task with the ε-greedy policy initialise Q ;
fork=1,...,K do
initialise s 0 ;
fort =0,...,Tk −1do
chooseat fromQ(at,st)accordingtoε-greedypolicy;
compute s t +1 and record rt +1;
updateQ(st ,at )←Q(st ,at )+α[rt+1 +maxa Q(st+1,a)−Q(st ,at )];
end for end for
11.3 Q -learning
The Q -learning rule [187] is an approximation to Eq. (11.23) that does not depend
on a t +1 . Instead one assumes that the next action, a t +1 , is the optimal one: Qt+1(st ,at )=Qt (st ,at )+αt rt+1 +γmaxQt (st+1,a)−Qt (st ,at ), (11.24)
a
regardless of the policy that is currently followed. Although Equation (11.24) does not refer to any policy, the learning outcome nevertheless depends on it, because the policy determines the sequence of states and actions [s t , a t ]. For the greedy policy, Eq. (11.24) is equivalent to (11.23), but in general the two algorithms differ, and converge to different solutions. While the Q -table converges to the expected future reward of the greedy policy in Q -learning, for SARSA it converges to the expected future reward corresponding to the policy used in training. This can be an advantage when performance during training is important, for example when training an expensive robot that should not crash too often, or for a small bird that learns flying by doing. Q -learning is simpler, and can be used for problems where the final strategy based on argmaxa ′ Q (s , a ′ ) matters, but where the reward during training is less important. Examples are board games where only the quality of the final strategy counts, not how often one loses during training. In summary, Q - learning is simpler, if one takes ε → 0 during training, it yields the optimal strategy as well as SARSA, but it may give lower rewards during training.
The Q -learning algorithm is summarised in Algorithm 11. Usually one sets the ini- tial entries in the Q -table to large positive values (optimistic initialisation), because this prompts the agent to explore many different actions, at least in the beginning. Iftheagentisinstatest,itchoosestheactionat fromQt(st,at)accordingtothe given policy. For the ε-greedy policy, for example, the agent picks a random action from the corresponding row of the Q -table with probability ε. With probability 1 − ε,
Q -LEARNING 229
it chooses the action at that yields the largest1 Qt (st ,at ) given st . The choice of ac- tionat determinesthenextstatest+1,andthisinturnallowstoupdatetheQ-table: giventhenewstatest+1 resultingfromtheactionat,oneupdatesQt(st,at)using Equation (11.24). For episodic tasks one puts γ = 1 in (11.24), and one averages over many episodes using the outcome QTk from episode k as initial condition for the Q -table for episode k + 1. Each new episode can start with a new initial state s 0 . It helps the exploration process if s 0 is one of the states that are rarely visited by the learning algorithm.
When the sequence s0,s1,s2,... is a Markov chain (Section 4.2), then the Q- learning algorithm can be shown [185, 188] to converge if one uses a time-dependent learningrateαt thatsatisfies
∞∞
αt =∞ and α2t <∞. (11.25)
t=0 t=0
OftenQ-learningisimplementedincombinationwiththeε-greedypolicy. This policy shares an important property with the associative reward-penalty algorithm with stochastic neurons: stochasticity allows for a wider range of responses, some of which may turn out beneficial in the long run. When ε is very small, the agent picks the action that appears optimal. As a consequence, suboptimal Q -elements are sampled less frequently and are therefore subject to larger errors. Therefore it is advantageous to begin with a relatively large value of ε. It is customary to decrease ε as the algorithm is iterated, because this accelerates convergence to the greedy policy.
It is important to bear in mind that the learning outcome depends on the reward function, as mentioned above. In general it is a good idea to analyse how the optimal strategy changes as one varies the reward function. Sometimes we are faced with the inverse problem: consider how a microorganism swimming in the turbulent ocean responds to different stimuli. How was this behaviour shaped by genetic evolution? Which quantity was optimised? Is it most important to reduce the energy cost for propulsion? Or is it more important to avoid predation?
Another challenge is to determine suitable states and actions. An agent navigating a complex environment may have a continuous range of positions and velocities, and may experience continuous-valued signals from the environment. To represent the corresponding states in a Q -table it is necessary to discretise. To this end one must determine suitable ranges and resolutions of these variables, and for the actions. If there are too many states and actions, Q -learning becomes inefficient. This is referred to as the curse of dimensionality [16, 189].
1If several elements in the relevant row have the same maximal value then any one of them is chosen with equal probability.
230 REINFORCEMENT LEARNING
Let us see how Q -learning works for a very simple example, for the associative task described in Fig. 11.2. One episode corresponds to computing the output of the neuron given its initial state, so T = 1. There is no sequence of states, and the task is to estimate the immediate reward. In this case, the learning rule (11.24) simplifies to
δQ(s,a)=αr(s,a)−Q(s,a). (11.26)
Since each episode consists only of a single time step, we dropped the index t . Also, the term maxa Q (s t +1 , a ) in Equation (11.24) does not appear in (11.26) because Q estimates the immediate reward. There are only two states in this problem, s = x 1 ands =x2,andthepossibleactionsarea =±1. Inotherwords,Ns =Na =2inthis case. In each round, one of the states is chosen randomly, with equal probability. The action is determined from the current estimate of the immediate reward as argmaxa Q (s , a ) with probability 1 − ε, and uniformly randomly otherwise. These steps are iterated over many iterations (episodes), using the outcome of episode k as initial condition for episode k + 1. The rule (11.26) describes exponential relaxation to the target for small learning rate α. In this limit, Equation (11.26) is approximated by the stochastic differential equation
d Qk(s,a)=αfε(s,a)[r(s,a)−Qk(s,a)], (11.27) dk
where fε (s , a ) is the stationary frequency with which the state-action pair [s , a ] is visited using the ε-greedy policy:
f (s,a)= 1 1−ε+Nε ifa=argmaxa′Qk(s,a′), (11.28) a
ε Ns Nε otherwise. a
The frequency is normalised to unity, 1 = s ,a fε (s , a ). Averaging the solution of Equation (11.27) with initial condition Q0(s,a) = 1 over the reward distribution gives:
Q∗(x1,−1) Q∗(x1,+1) 0.2 0.6
Q∗(x2,−1) Q∗(x2,+1) = −0.4 −0.8 . (11.30)
Here we used that 〈r (x , y )〉 = 2preward(x , y ) − 1, as well the reward probabilities in Figure 11.2. Figure 11.5 illustrates how the rate of convergence depends on the value
〈Qk(s,a)〉=exp[−αfε(s,a)k]+αfε(s,a) For ε > 0, Qt (s , a ) converges on average to
k 0
dk′〈r(s,a)〉exp[fε(s,a)α(k′−k)]. (11.29)
Q -LEARNING 231
Figure 11.5: Q -learning for the task described in Figure 11.2. (a) Entries of the Q -table versus the number of iterations of Equation (11.26) for α = 0.01 and ε = 1. (b) Same, but for ε = 0.05. Schematic, based on simulations by Navid Mousavi, averaged over 5000 independent realisations of the learning curve.
of the parameter ε. For ε = 1, all state-action pairs are visited and evaluated equally often, independently of the present elements of the Q -table. Equations (11.28) and (11.29) show that all elements of the Q -table converge to their steady-state values at the same rate, equal to α4 . For small values of ε, by contrast, the algorithm tends totakeoptimalactions,argmaxa′Qk(s,a′). Thereforeitfindstheoptimalelements of the Q -table more quickly, at the rate α2 . However, the other elements converge much more slowly. Initially, the theory (11.29) does not apply because optimal and suboptimal Q -elements in each row of the Q -table are not well separated. As a result the decay rates of all elements are similar at first. But once optimal and suboptimal elements are significantly different, the suboptimal ones decay at the rate εα/4 as predicted by the theory.
This example illustrates the strength of Q -learning with the ε-greedy policy. For small values of ε, the algorithm tends to converge to the optimal strategy more quickly than a brute-force algorithm that visits every state-action pair equally often. Equation (11.28) shows that this advantage becomes larger for larger values of Na. The price one pays is that the suboptimal entries of the Q -table converge more slowly.
The gain is even more significant for episodic tasks with T > 1 steps. In this case
thelearningrulefortheQ-tabledependsonargmaxa′Qt (s,a′). Asmentionedabove,
thereareNNs waysinwhichthelargestelementscanbedistributedovertherowsof a
the Q -table. To find the optimal deterministic strategy by complete enumeration, onemustrunalargenumberofepisodesforeachoftheNNs possibilities,tofindout
a
which one gives the largest reward. This is impractical when either Na or Ns or both
become too large. Q -learning with a small value of ε, by contrast, tends to explore actions that appear to yield the largest expected future reward given the current estimates of the Q -values. Using experience in this way allows the algorithm to
232 REINFORCEMENT LEARNING
simultaneously improve its policy and the Q -values towards optimality. In this way, Q -learning can find optimal, or at least good strategies when complete enumeration of all possibilities fails.
A second example for Q -learning is illustrated in Figure 11.6, the board game tic-tac-toe. It is a very simple game where two players take turns in placing their pieces on a 3 × 3 board. The player who manages to first obtain three pieces in a row, column, or diagonal wins and receives the reward r = +1. A draw gives r = 0, and the player receives r = −1 when the round is lost. The goal is to win as often as possible, to maximise the expected future reward. However, there is a strategy for both players to ensure that they do not lose. If both players try to maximise their expected future reward, then they end up following this strategy. As a consequence, every game must end in a draw [190]. As a result, the game is quite boring.
Nevertheless it is instructive to ask how the players can learn to find this strategy using Q -learning with the ε-greedy policy. To this end we let two agents play many rounds against each other. The state space is the collection of all board configura- tions. Player × starts, and thus always sees a board with an even number of pieces, while the number of pieces is odd for player ◦. Since the players encounter different sets of states, each must keep track of their own Q -table. The task is episodic, and the number T of steps may vary from round to round. Feedback is only obtained at the end of each round.
We use Equation (11.24) with a constant learning rate α. We can set γ = 1 since the number of steps in each round is finite. The Q -table is a 2 × n table where each entry is a 3 × 3 array. Here n is the number of states the player encountered so far. The first row lists the states, each a 3 × 3 array with entries −1 (◦), 1 (×), or 0 (empty). The second row contains the Q -values. Given a certain state of the board, a player can play a piece on any empty field. The corresponding estimate of the expected future reward is stored in the corresponding 3 × 3 array in the second row. Since one cannot place a piece onto an occupied field, the corresponding entries in the Q -table are assigned NaN.
During a round, the Q -tables of the players are updated in turns, always the one of the player who places a piece. After a round, the Q -tables of both players are updated. During the first round, the elements of Q -tables encountered are initialised to zero and remain zero. The first change to Q occurs in the last step of this round. If player × wins, for example, the element QT −1(sT −1,aT −1) corresponding to the state-action pair that led to the final state s ′T −1 is updated for the winning player, andQT−2(s′T−2,a′T−2)issetto−1forplayer◦.
Both players follow the ε-greedy policy . With probability 1 − ε they take the optimal move (if the maximal Q -element in the relevant row is degenerate, then one of the maximal elements is chosen randomly). With probability ε, a random action is chosen. As the players continue to play rounds against each other, the
Q -LEARNING 233
Figure11.6: Tic-tac-toe.Twoplayers,×and◦,taketurnsinplacingapieceonan empty field of 3 × 3 board. The goal is to be the first to complete a row, column, or diagonal consisting of three of one’s own pieces. In the example shown, player × starts and ends up winning the game. The states encountered by player × are denoted by s t , those encountered by player ◦ by s ′t . Their actions are denoted by a t and a ′t .
rewards spread to other elements of Q . Suppose that the state s T −1 is encountered once more, the one that allowed player × to win the first round with a T −1 . Then the term maxaQT −1(sT −1,a) causes a Q-element for the previous state to change, the one from which s T −1 was reached the second time. However as times goes on, this process slows down, because later updates are multiplied with higher powers of the learning rate α. Also, if the opponent lost in the previous round, it will try different actions that may block winning moves for the other player.
Figure 11.7 illustrates how the players learn, after playing many rounds against each other. Since both players try to maximise their expected future reward in the steady state of the Q -learning algorithm, all games end in a draw in this case. The corresponding Q -tables contain the strategies each player should adopt to maximise their reward. Suppose player ◦ places the first piece as shown below:
board
1.00 1.00 1.00 Q-table 0.34 NaN NaN 1.00 1.00 1.00
(11.31)
NaN −0.57 1.00 −0.69 NaN NaN −0.59 −0.73 NaN
NaN 1.00 NaN −0.86 NaN NaN .
NaN 0.027 NaN
How does this game continue? There are several different ways for player × to win. The left Q -table in Equation (11.31) shows that one possibility is to place the piece in the top or bottom row, because this creates the opportunity of creating a bridge
234 REINFORCEMENT LEARNING
Figure 11.7: Learning curves for two players learning to play tic-tac-toe with Q – learning and the ε-greedy policy. Shown are the frequencies that the game ends in a draw, that player × wins, and that player ◦ wins. Similar curves are obtained using a learning rate α = 0.1. The parameter ε was equal to unity for the first 104 rounds, and then decreased by a factor of 0.9 after each 100 rounds, and averaging each curve over a running window of 30 rounds. Schematic, based on simulations performed by Navid Mousavi.
in the next move, a configuration that cannot be blocked by the opponent, allowing × to win. The right Q -table shows that player × could still lose or end up with a draw if he makes the wrong move. The corresponding Q -entries have not quite converged to −1 and 0, respectively. Q -entries corresponding to suboptimal states are not estimated as precisely because they are visited less frequently. Here ε = 0.3 was chosen quite large. Smaller values of ε give even less accurate estimates for the suboptimal Q -elements compared with those in Equation (11.31), after training for the same number of rounds.
As pointed out above, the learning outcome depends on the reward function. If one increases reward for winning, to r = +2 for instance, the optimal strategy appears to be to take turns in winning. The same learning outcome is expected if one imposes a penalty for a draw, r = +1 (win), r = −1 (draw, lose), Exercise 11.8. More examples of reinforcement-learning problems in robotics and in the natural sciences are described in Ref. [191].
The Q -learning algorithm described above is quite efficient when the number of states and actions is not too large. For very large Q -tables, the algorithm becomes quite slow. In this case it may be more efficient to replace the Q -table by an ap- proximate Q -function that maps states to actions. As explained in Section 11.2, one can use a neural network to represent the Q -function [183] (deep reinforcement learning). An application of this method is AlphaGo, a machine-learning algorithm that learnt to play the game of go [182]. The Q -function is represented in terms of a
SUMMARY 235
convolutional neural network. This makes it possible to use a variant of Q -learning despite the fact that the number of states is enormous.
In recent years many proof-of-principle studies have demonstrated the possi- bilities of reinforcement learning in a wide range of scientific problems. Recent advances in deep reinforcement learning hold promise for the future, for real-world control problems in the engineering sciences.
11.4 Summary
Reinforcement learning lies between unsupervised learning (Chapter 10) and su- pervised learning (Chapters 5 to 9). In reinforcement learning, there are no labeled data sets. Instead, the neural network or agent learns through feedback from the environment in the form of a reward or a penalty. The goal is to find a strategy that maximises the expected reward. Reinforcement learning is applied in a wide range of fields, from psychology to mechanical engineering, using a large variety of algo- rithms. The associative reward-penalty algorithm and many versions of temporal difference learning were originally formulated using neural networks. Q -learning is an approximation to temporal difference learning for sequential decision processes. In its simplest form it does not rely on neural networks. However, when the number of states and actions is large this algorithm becomes slow. In this case it may be more efficient to approximate the Q -function by a neural network.
11.5 Further reading
The standard reference for reinforcement learning is Reinforcement learning: an introduction by Sutton and Barto [16]. The original reference for the convergence of the Q -learning algorithm is Ref. [188]. A more mathematical introduction to reinforcement learning is given in Ref. [185]. Examples for reinforcement learning in statistical and non-linear physics are summarised in Ref. [191].
An open question is when and how symmetries can be exploited to simplify a reinforcement problem. For a small microorganism learning to navigate a turbulent flow, some aspects are discussed in Ref. [192], but little is known in general. Another open question concerns the convergence of the Q -learning algorithm. Convergence to the optimal policy is assured if the sequence of states is a Markov chain. However, most real-world problems are not Markovian, so that convergence is not guaranteed. The algorithm appears to perform well nevertheless (a recurring theme in this book), but it is an open question under which circumstances it may fail.
236
REINFORCEMENT LEARNING
preward x1=[0,0]T x2=[0,1]T x3=[1,0]T x4=[1,1]T
y =−1 y =+1 0.8 0.1 0.1 0.8 0.1 0.8 0.8 0.1
Table11.1: StochasticXORproblemfromRef.[184].Exercise11.4. 11.6 Exercises
11.1 Binary stochastic neurons. Consider binary stochastic neurons yi with up-
dateruleyi =+1withprobabilityp(bi)andyi =−1otherwise.Herebi =j wijxj,
andp(b)=(1+e−2βb)−1.Theparameterβ−1 isthenoiselevel,xj areinputs,andwij
are weights. The energy function reads H = 1 t (μ) − y (μ)2, with targets t (μ) = ±1. 2iμii i
Stochastic neurons can be trained by gradient descent on the energy function H ′ = 1 t (μ)−〈y (μ)〉2, defined in terms of the average outputs 〈y (μ)〉. The error δ(μ) is de-
2iμii im finedbyδwmn ≡−η ∂H′ =η δ(μ)x(μ). Showthatδ(μ) =(t(μ) −〈y(μ)〉)β(1−〈y(μ)〉2).
∂wmnμmn mmmm Show that this rule does not necessarily minimise 〈H 〉.
11.2 Gradient of average reward. Derive Equation (11.7). An outline of the deriva- tion is given in Section 7.4 of Hertz, Krogh, and Palmer [1].
11.3 Klopf’s self-interested neuron. Klopf’s self-interested neuron [193] is a bi- nary stochastic neuron with outputs 0 and 1. Derive a learning rule that is equivalent to Eq. (11.9).
11.4 Associate reward-penalty algorithm. Barto [184] explains how to solve as- sociation tasks with linearly dependent inputs using hidden neurons. One of his examples is the XOR problem, Table 11.1. Verify by numerical simulation that the task can be solved by a single stochastic neuron if you embed the input data in four-dimensional input space.
11.5 Three-armed bandit problem. Implement the Q -learning algorithm for the three-armed bandit problem with reward distributions shown in Fig. 11.8. Analyse the convergence of Q -learning for different values of ε. Discuss the exploitation- exploration dilemma. Illustrate with results of your computer simulations.
11.6 Psychology of rock-paper-scissors. Learn to exploit idiosyncrasies in the
EXERCISES 237
Figure11.8: Three-armedbanditproblem[16].ThreeslotmachineshaveGaussian reward distributions shown, with means and standard deviations μ1 = 7.5, σ1 = 2, μ2 =10,σ2 =1,andμ3 =15,σ3 =5.
strategy of your opponent. Suppose that your opponent tends to repeat his action if he won the previous round [194], but changes his action if he lost. Let us say that your opponent randomly chooses between two strategies. With probability p he repeats his action if he won, but chooses one of the other two actions with equal probability if he lost. With probability 1 − p he picks one of the three actions, rock with probability 1 − 2q , and paper or scissors with probability q (your opponent has a preference for rock, so q < 13 ), and this is also the strategy for the first round. Implement the Q -learning algorithm to determine the optimal strategy as a function of q and p after N rounds.
11.7 Tic-tac-toe. Implement the Q -learning algorithm for two agents learning to play tic-tac-toe (Figure 11.6), with reward function r = +1 (win), r = 0 (draw), and r = −1 (lose). Crowley and Siegler [190] described how a perfect player should play to never lose. Their Table 1 summarises how to play given a certain configuration of the board. Determine the Q -table of a perfect player, to verify Table 1 of Crowley and Siegler.
11.8 Different reward function for tic-tac-toe. For the tic-tac-toe problem (Fig- ure 11.6) investigate, with Q learning, how the optimal strategy depends on the reward function. Determine the optimal strategy for r = +2 (win), r = 0 (draw), r = −1 (lose), and for r = +1 (win), r = −1 (draw, lose).
11.9 Connect four. Implement the Q -learning algorithm for two agents learning to play connect four (Figure 11.9) on a 6 × 6 board. Show that the second player can always achieve a draw against a perfect player [195].
11.10 Eat or save the chocolate. Suppose you get a piece of chocolate every morn-
238 REINFORCEMENT LEARNING
Figure11.9: Connectfourisagamefortwoplayerswhotaketurnsdroppingtheir pieces into one of k columns of height l (k = 7 and l = 6 in the Figure). The player first completing a horizontal, vertical, or diagonal row of four pieces wins. The red player started.
ing. Either you save your chocolate for the next day, or you eat all of it during the day, including all pieces you may have saved from previous days. Each day, before you go to bed, you receive a reward: for each piece of chocolate you ate during the day you get +2. If you save the chocolate instead you get +1 for each piece of chocolate in stock. But your brother likes chocolate too, and he searches for your stock while you are asleep. Suppose he finds it with probability p , and eats all the chocolate. What is your strategy to optimise your future reward over N days?
BIBLIOGRAPHY 239
Bibliography
[1] HERTZ, J, KROGH, A & PALMER, R 1991 Introduction to the Theory of Neural
Computation. Addison-Wesley.
[2] HAYKIN,S1999NeuralNetworks:acomprehensivefoundation,2ndedn.New
Jersey: Prentice Hall.
[3] HORNER,H,NeuronaleNetze,www.tphys.uni-heidelberg.de/~horner,[Last
accessed 8-November-2018].
[4] GOODFELLOW, I. J, BENGIO, Y & COURVILLE, A, Deep learning,
www.deeplearningbook.org, [Last accessed 5-September-2018].
[5] NIELSEN,M,Neuralnetworksanddeeplearning,
http://neuralnetworksanddeeplearning.com, [Last accessed 13-August-2018].
[6] MCCULLOCH,W&PITTS,W1943Alogicalcalculusoftheideasimmanentin
nervous activity. Bull. Math. Biophys. 5, 115.
[7] CARNAP,R1937Thelogicalsyntaxoflanguage.London:K.Paul,Trench,Trub-
ner & Co., Limited.
[8] MCCULLOCH,W&PITTS,W1947Howweknowuniversalstheperceptionof
auditory and visual forms. Bull. Math. Biophys. 9, 127–147.
[9] HEBB,D.O1949Theorganizationofbehavior:Aneuropsychologicaltheory.
New York: Wiley.
[10] RO S E N B L AT T, F 1958 The perceptron: A probabilistic model for information
storage and organization in the brain. Psychological Rev. 65, 386.
[11] MINSKY,M&PAPERT,S1969Perceptrons.AnIntroductiontoComputational
Geometry. MIT Press.
[12] RUMELHART, D. E, HINTON, G. E & WILLIAMS, R. J 1986 Learning internal representations by error propagation. In Parallel distributed processing: ex- plorations in the microstructure of cognition (ed. D. E Rumelhart & J. L MC- Clelland).
[13] HOPFIELD, J. J 1982 Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences 79 (8), 2554–2558.
240 BIBLIOGRAPHY
[14] HINTON,G.E2010Boltzmannmachines.InEncyclopediaofMachineLearning (ed. C Sammut & G. I Webb), pp. 132–136. Boston, MA: Springer US.
[15] HINTON,G.E&SEJNOWSKI,T.J1986LearningandRelearninginBoltzmann Machines, pp. 282–317. Cambridge, MA, USA: MIT Press.
[16] SUTTON, R. S & BARTO, A. G 2018 Reinforcement Learning: An Introduction, 2nd edn. The MIT Press.
[17] TESAURO,G1995TemporaldifferencelearningandTD-Gammon.Communi- cations of the ACM 38, 58–68.
[18] KOHONEN,T1990Theself-organizingmap.ProceedingsoftheIEEE78,1464– 1480.
[19] SENIOR, A. W, EVANS, R, JUMPER, J et al. 2020 Improved protein structure prediction using potentials from deep learning. Nature 577, 2076–710.
[20] TheNobelPrizeinPhysiologyorMedicine1906,www.nobelprize.org,[Last accessed 1-October-2020].
[21] NEWMAN,E.A,ARAQUE,A&DUBINSKY,J.M,ed.2017Thebeautifulbrain.The drawings of Santiago Ramón y Cajal. New York: Abrams.
[22] GABBIANI,F&METZNER,W1999Encodingandprocessingofsensoryinfor- mation in neuronal spike trains. Journal of Experimental Biology 202 (10), 1267.
[23] KANAL,L2001Perceptrons.InInternationalEncyclopediaoftheSocial&Be- havioral Sciences (ed. N. J Smelser & P. B Baltes), pp. 11218 – 11221. Oxford: Pergamon.
[24] LITTLE,W1974Theexistenceofpersistentstatesinthebrain.Mathematical Biosciences 19, 101 – 120.
[25] FISCHER,A&IGEL,C2014TrainingrestrictedBoltzmannmachines:Anintro- duction. Pattern Recognition 47 (1), 25–39.
[26] SHERRINGTON, D, Spin glasses: a perspective, arxiv.org/abs/cond-mat/0512425, [Last accessed 5-December-2020].
[27] SHERRINGTON,D&KIRKPATRICK,S1975Solvablemodelofaspin-glass.Phys. Rev. Lett. 35, 1792–1796.
BIBLIOGRAPHY 241
[28] LIPPMANN,R1987Anintroductiontocomputingwithneuralnets.IEEEASSP
Magazine 4, 4–22.
[29] MATHEWS,J&WALKER,R.L1964MathematicalMethodsofPhysics.NewYork:
W.A. Benjamin.
[30] FELLER,W1968Anintroductiontoprobabilitytheoryanditsapplications,3rd edn. New York: John Wiley & Sons.
[31] WEISSTEIN, E. W, WolframMathWorld - a Wolfram web resource, math- world.wolfram.com/Erf.html, [Last accessed 17-September-2019].
[32] KADANOFF,L.P,Moreisthesame:phasetransitionsandmeanfieldtheories, arxiv.org/abs/0906.0653, [Last accessed 3-September-2020].
[33] AMIT,D.J,GUTFREUND,H&SOMPOLINSKY,H1985Spin-glassmodelsofneural networks. Phys. Rev. A 32, 1007.
[34] AMIT, D. J & GUTFREUND, H 1987 Statistical mechanics of neural networks near saturation. Ann. Phys. 173, 30.
[35] HOPFIELD,J.J1984Neuronswithgradedresponsehavecollectivecomputa- tional properties like those of two-state neurons. Proceedings of the National Academy of Sciences 81 (10), 3088–3092.
[36] HINTON, G. E & SEJNOWSKI, T. J 1983 Optimal perceptual inference. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 444–453.
[37] MÜLLER, B, REINHARDT, J & STRICKLAND, M. T 1999 Neural Networks: An Introduction. Heidelberg: Springer.
[38] GESZTI,T1990Physicalmodelsofneuralnetworks.WorldScientific.
[39] STEFFAN,H&KÜHN,R1994Replicasymmetrybreakinginattractorneural
network models. Zeitschrift für Physik B Condensed Matter 95, 249–260.
[40] VOLK,D1998OnthephasetransitionofHopfieldnetworks–anotherMonte
Carlo study. Int. J. Mod. Phys. C 9, 693.
[41] LÖWE, M 1998 On the storage capcaity of Hopfield models with correlated
patterns. Ann. Prob. 8, 1216.
[42] ENGEL, A & VAN DEN BROECK, C 2001 Statistical Mechanics of Learning. Cam- bridge University Press.
242 BIBLIOGRAPHY
[43] WATKIN,T.L.H,RAU,A&BIEHL,M1993Thestatisticalmechanicsoflearning
a rule. Rev. Mod. Phys. 65, 499–556.
[44] KIRKPATRICK,S,GELATT,C.D&VECCHI,M.P1983Optimizationbysimulated
annealing. Science 220, 671–680.
[45] HINTON,G.E,Boltzmannmachine, www.scholarpedia.org/article/Boltzmann_machine, [Last accessed 21- September-2019].
[46] HINTON, G. E, A practical guide to training restricted Boltzmann ma- chines, www.cs.toronto.edu/~hinton/absps/guideTR.pdf, [Last accessed 18- September-2019].
[47] MACKAY,D.J.C2003InformationTheory,InferenceandLearningAlgorithms. New Jersey: Cambridge University Press.
[48] VANKAMPEN,N.G2007Stochasticprocessesinphysicsandchemistry.North Holland.
[49] SOKAL,A1997MonteCarlomethodsinstatisticalmechanics:Foundations and new algorithms. In Functional Integration: Basics and Applications (ed. C DeWitt-Morette, P Cartier & A Folacci), pp. 131–192. Boston, MA: Springer US.
[50] MEHLIG, B, HEERMANN, D. W & FORREST, B. M 1992 Hybrid Monte Carlo method for condensed-matter systems. Phys. Rev. B 45, 679–685.
[51] METROPOLIS,N,ROSENBLUTH,A.W,ROSENBLUTH,M.N,TELLER,M&TELLER, E 1953 Equation of state calculations by very fast computing machine. Journal of Chemical Physics 21, 1087–1092.
[52] BINDER,K,ed.1986Monte-CarloMethodsinStatisticalPhysics,2ndedn.Berlin: Springer.
[53] PRESS,W.H,TEUKOLSKY,S.A,VETTERLING,W.T&FLANNERY,W.P1992Nu- merical Recipes in C: The Art of Scientific Computing, second edition. New York: Cambridge University Press.
[54] HOPFIELD,J.J&TANK,D.W1985Neuralcomputationofdecisionsinoptimi- sation problems. Biol. Cybern. 52, 141.
[55] WATERMAN,M.S1995IntroductiontoBioinformatics.PrenticeHall.
BIBLIOGRAPHY 243
[56] LANDER,E,LINTON,L,BIRREN,Betal.2001Initialsequencingandanalysisof
the Human genome. Nature 409, 860–921.
[57] SMOLENSKY, P 1987 Information Processing in Dynamical Systems: Founda-
tionsofHarmonyTheory,pp.194–281.MITP.
[58] LEROUX,N&BENGIO,Y2008RepresentationalpowerofrestrictedBoltzmann
machines and deep belief networks. Neural Computation 20, 1631–1649.
[59] LE ROUX, N & BENGIO, Y 2010 Deep belief networks are compact universal
approximators. Neural Computation 22, 2192–2207.
[60] MONTÚFAR, G. F & AY, N 2011 Refinements of universal approximation re- sults for deep belief networks and restricted Boltzmann machines. Neural Computation 23, 1306–1319.
[61] MONTÚFAR,G.F,RAUH,J&AY,N2011Expressivepowerandapproximation errors of restricted Boltzmann machines. In Advances in Neural Information Processing Systems (ed. J Shawe-Taylor, R Zemel, P Bartlett, F Pereira & K. Q Weinberger), , vol. 24, pp. 415–423.
[62] MONTÚFAR,G.F,RAUH,J&AY,N2013Maximalinformationdivergencefrom statistical models defined by neural networks. In Geometric Science of Informa- tion (ed. F Nielsen & F Barbaresco), pp. 759–766. Berlin, Heidelberg: Springer Berlin Heidelberg.
[63] CARLEO,G&TROYER,M2017Solvingthequantummany-bodyproblemwith artificial neural networks. Science 355 (6325), 602–606.
[64] GUBERNATIS, J. E 2005 Marshal Rosenbluth and the Metropolis algorithm. Physics of Plasmas 12, 057303.
[65] MURPHY,K.P2012MachineLearning:AProbabilisticPerspective.Cambridge, Massachusetts: MIT Press.
[66] FISCHER,A&IGEL,C2012AnintroductiontorestrictedBoltzmannmachines. In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications (ed. L Alvarez, M Mejail, L Gomez & J Jacobo), pp. 14–36. Berlin, Heidelberg: Springer Berlin Heidelberg.
[67] BENGIO,Y2009LearningdeeparchitecturesforAI.FoundationsandTrends in Machine Learning 2, 1–127.
244 BIBLIOGRAPHY
[68] DAYAN,P,HINTON,G.E,NEAL,R.M&ZEMEL,R.S1995TheHelmholtzmachine.
Neural Computation 7, 889–904.
[69] DAYAN,P&HINTON,G.E1996VarietiesofHelmholtzmachine.NeuralNet-
works 9, 1385–1403.
[70] DUA,D&GRAFF,C,UCImachinelearningrepository,archive.ics.uci.edu/ml,
[Last accessed 18-August-2018].
[71] FISHER,R.A1936Theuseofmultiplemeasurementsintaxonomicproblems.
Ann. Eugenics 7, 179.
[72] COVER,T.M1965Geometricalandstatisticalpropertiesofsystemsoflinear inequalities with applications in pattern recognition. IEEE Trans. on electronic computers p. 326.
[73] SOMPOLINSKY, H, Introduction: the perceptron, web.mit.edu, [Last accessed 9-October-2018].
[74] SLOANE,N.J.A,Onlineencyclopediaofintegersequences,oeis.org/A000609, [Last accessed 9-November-2020].
[75] GREUB, W 1981 Linear Algebra. New York: Springer.
[76] LECUN,Y,BOTTOU,L,ORR,G.B&MÜLLER,K.-R1998Efficientbackprop.In
Neural networks: tricks of the trade (ed. G. B Orr & K.-R Müller). Springer.
[77] NESTEROV,Y1983Amethodofsolvingaconvexprogrammingproblemwith
convergence rate o(1/k2). Soviet Mathematics Doklady 27, 372.
[78] SUTSKEVER,I2013Trainingrecurrentneuralnetworks.PhDthesis,University
of Toronto.
[79] HORNIK,K,STINCHCOMBE,M&WHITE,H1989Neuralnetworksareuniversal
approximators. Neural Networks 2, 359.
[80] LAPEDES,A&FARBER,R1988Howneuralnetswork.InNeuralInformationPro-
cessing Systems (ed. D Anderson), pp. 442–456. American Institute of Physics.
[81] FRANCO,L&CANNAS,S2001Generalizationpropertiesofmodularnetworks: implementing the parity function. IEEE Transactions on Neural Networks 12, 1306–1313.
[82] CRISANTI,A,VULPIANI,A&PALADIN,G1993Productsofrandommatricesin Statistical Physics. Berlin: Springer.
BIBLIOGRAPHY 245
[83] CVITANOVIC, P, ARTUSO, G, MAINIERI, R, TANNER, G & VATTAY, G, Lya- punov exponents, chaosbook.org/chapters/Lyapunov.pdf, [Last accessed 30- September-2018].
[84] ECKMANN,J.P&RUELLE,D1985Ergodictheoryofchaosandstrangeattractors. Rev. Mod. Phys. 57, 617–656.
[85] STROGATZ, S. H 2000 Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry and Engineering. Westview Press.
[86] STORM,L,Unstablegradientsindeepneuralnets,MScthesisChalmersUni- versity of Technology (2020).
[87] PENNINGTON,J,SCHOENHOLZ,S.S&GANGULI,S2017Resurrectingthesig- moid in deep learning through dynamical isometry: theory and practice. In Advances in Neural Information Processing Systems (ed. I Guyon, U. V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan & R Garnett), , vol. 30, pp. 4785– 4795. Curran Associates, Inc.
[88] SUTSKEVER,I,MARTENS,J,DAHL,G&HINTON,G.E2013Ontheimportance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning - Volume 28, pp. III–1139–III– 1147.
[89] GLOROT, X & BENGIO, Y 2010 Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (ed. Y. W Teh & M Titter- ington), Proceedings of Machine Learning Research, vol. 9, pp. 249–256. Chia Laguna Resort, Sardinia, Italy: JMLR Workshop and Conference Proceedings.
[90] SCHOENHOLZ,S.S,GILMER,J,GANGULI,S&SOHL-DICKSTEIN,J,Deepinfor- mation propagation, arxiv.org/abs/1611.01232, [Last accessed 5-December- 2020].
[91] GLOROT,X,BORDES,A&BENGIO,Y2011Deepsparserectifierneuralnetworks. In Proceedings of the Fourteenth International Conference on Artificial Intel- ligence and Statistics (ed. G Gordon, D Dunson & M Dudík), Proceedings of Machine Learning Research, vol. 15, pp. 315–323. Fort Lauderdale, FL, USA: JMLR Workshop and Conference Proceedings.
[92] HE,K,ZHANG,X,REN,S&SUN,J2016Deepresiduallearningforimagerecog- nition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778.
246 BIBLIOGRAPHY
[93] Residualneuralnetwork,wikipedia.org/wiki/Residual_neural_network,[Last
accessed 25-May-2021].
[94] KLEINBAUM, D, KUPPER, L & NIZAM, A 2008 Applied regression analysis and
other multivariable methods, 3rd edn. Belmont: Thomson Higher Education.
[95] SRIVASTAVA,N,HINTON,G.E,KRIZHEVSKY,A,SUTSKEVER,I&SALAKHUTDINOV, R 2014 Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (56), 1929–1958.
[96] HANSON,S&PRATT,L1989Comparingbiasesforminimalnetworkconstruc- tion with back-propagation. In Advances in Neural Information Processing Systems (ed. D Touretzky), , vol. 1, pp. 177–185. Morgan-Kaufmann.
[97] HASSIBI, B & STORK, D 1993 Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems (ed. S Hanson, J Cowan & C Giles), , vol. 5, pp. 164–171. Morgan-Kaufmann.
[98] LECUN,Y,DENKER,J&SOLLA,S1990Optimalbraindamage.InAdvancesin Neural Information Processing Systems (ed. D Touretzky), , vol. 2, pp. 598–605. Morgan-Kaufmann.
[99] FRANKLE,J&CARBIN,M,Thelotterytickethypothesis:Findingsmall,trainable neural networks, arxiv.org/abs/1803.03635, [Last accessed 5-December-2020].
[100] DENG,J,DONG,W,SOCHER,R,LI,L.J,LI,K&LI,F.F2009ImageNet:Alarge- scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. [Last accessed 3-September- 2018].
[101] IOFFE,S&SZEGEDY,C,Batchnormalization:Acceleratingdeepnetworktrain- ing by reducing internal covariate shift, arxiv.org/abs/1502.03167, [Last ac- cessed 5-December-2020].
[102] SANTURKAR, S, TSIPRAS, D, ILYAS, A & MADRY, A, How does batch normal- ization help optimization? (No, it is not about internal covariate shift), arxiv.org/abs/1805.11604, [Last accessed 5-December-2020].
[103] KIRKPATRICK, J, PASCANU, R, RABINOWITZ, N et al. 2017 Overcoming catas- trophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114, 3521–3526.
BIBLIOGRAPHY 247
[104] SETTLES,B,Activelearningliteraturesurvey, burrsettles.com/pub/settles.activelearning.pdf, [Last accessed 5-December- 2020].
[105] CHOROMANSKA,A,HENAFF,M,MATHIEU,M,AROUS,G.B&LECUN,Y2015 The loss surfaces of multilayer networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (ed. G Lebanon & S. V. N Vishwanathan), Proceedings of Machine Learning Research, vol. 38, pp. 192–204. San Diego, California, USA: PMLR.
[106] FYODOROV, Y. V 2004 Complexity of random energy landscapes, glass tran- sition, and absolute value of the spectral determinant of random matrices. Phys. Rev. Lett. 92, 240601.
[107] BECKER,S,ZHANG,Y&LEE,A.A2020Geometryofenergylandscapesandthe optimizability of deep neural networks. Phys. Rev. Lett. 124, 108301.
[108] WANG,Y,YAO,Q,KWOK,J&NI,L.M2020Generalizingfromafewexamples: A survey on few-shot learning. ACM Computing Surveys 53, 63.
[109] KRIZHEVSKY, A, SUTSKEVER, I & HINTON, G. E 2012 ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (ed. F Pereira, C. J. C Burges, L Bottou & K. Q Weinberger), , vol. 25, pp. 1097–1105. Curran Associates, Inc.
[110] ABADI, M, AGARWAL, A, BARHAM, P et al., TensorFlow: Large-scale machine learning on heterogeneous systems, www.tensorflow.org, [Last accessed 3- September-2018].
[111] LECUN, Y, CORTES, C & BURGES, C. J, The MNIST database of handwritten digits, yann.lecun.com/exdb/mnist, [Last accessed 3-September-2018].
[112] SMITH,L.N2017Cyclicallearningratesfortrainingneuralnetworks.In2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464– 472.
[113] CIREGAN, D, MEIER, U & SCHMIDHUBER, J 2012 Multi-column deep neural networks for image classification. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3642–3649.
[114] Deep learning in MATLAB, se.mathworks.com, [Last accessed 14-January- 2020].
248 BIBLIOGRAPHY [115] PICASSO,J.P,Pre-processingbeforedigitrecognitionforNNandCNNtrained
[116]
[117]
[118] [119]
[120]
[121]
[122]
[123] [124] [125] [126]
with MNIST dataset, stackoverflow.com, [Last accessed 26-September-2018].
KOZIELSKI,M,FORSTER,J&NEY,H2012Moment-basedimagenormalization for handwritten text recognition. In 2012 International Conference on Frontiers in Handwriting Recognition, pp. 256–261.
RUSSAKOVSKY, O, DENG, J, SU, H, KRAUSE, J, SATHEESH, S, MA, S, HUANG, Z, KARPATHY, A, KHOSLA, A, BERNSTEIN, M, BERG, A. C & LI, F. F 2015 ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 211–252.
LI,F.F,JOHNSON,J&YEUNG,S,CNNarchitectures,http://cs231n.stanford.edu, [Last accessed 4-December-2020].
HU, J, SHEN, L & SUN, G 2018 Squeeze-and-excitation networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132– 7141.
SEIF,G,Deeplearningforimagerecognition:whyit’schallenging,wherewe’ve been, and what’s next, towardsdatascience.com, [Last accessed 26-September- 2018].
SZEGEDY, C, WEI LIU, YANGQING JIA, SERMANET, P, REED, S, ANGUELOV, D, ERHAN, D, VANHOUCKE, V & RABINOVICH, A 2015 Going deeper with convolu- tions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9.
ZENG,X,OUYANG,W,YAN,Jetal.2018Craftinggbd-netforobjectdetection. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (9), 2109– 2123.
HERN, A, Computers now better than humans at recognising and sorting images, www.theguardian.com, [Last accessed 26-September-2018].
KARPATHY,A,WhatIlearnedfromcompetingagainstaconvnetonimagenet, karpathy.github.io, [Last accessed 26-September-2018].
KHURSHUDOV, A, Suddenly, a leopard print sofa appears, rocknroll- nerd.github.io, [Last accessed 23-August-2018].
GEIRHOS, R, MEDINA TEMME, C. R, RAUBER, J, SCHÜTT, H. H, BETHGE, M & WICHMANN, F. A 2018 Generalisation in humans and deep neural networks. In Advances in Neural Information Processing Systems (ed. S Bengio, H Wallach,
BIBLIOGRAPHY 249
[127]
[128]
[129]
[130] [131]
H Larochelle, K Grauman, N Cesa-Bianchi & R Garnett), , vol. 31, pp. 7538– 7550. Curran Associates, Inc.
SZEGEDY, C, ZAREMBA, W, SUTSKEVER, I, BRUNA, J, ERBAN, D, GOOD- FELLOW, I. J & FERGUS, R, Intriguing properties of neural networks, arxiv.org/abs/1312.6199, [Last accessed 5-December-2020].
NGUYEN, A, YOSINSKI, J & CLUNE, J 2015 Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 427–436.
YOSINSKI, J, CLUNE, J, NGUYEN, A, FUCHS, T & LIPSON, H, Understanding neural networks through deep visualization, arxiv.org/abs/1506.06579, [Last accessed 5-December-2020].
GRAETZ, F. M, How to visualize convolutional features in 40 lines of code, https://towardsdatascience.com, [Last accessed 30-December-2020].
DOSOVITSKIY,A,BEYER,L,KOLESNIKOV,Aetal.,Animageisworth16x16words: Transformers for image recognition at scale, arxiv.org/abs/2010.11929, [Last accessed 5-December-2020].
[132] KRIZHEVSKY, A, Learning multiple layers of features from tinyimages, www.cs.toronto.edu/k ̃riz, [Last accessed 1-November-2020].
[133] [134]
[135]
[136]
OTT, E 2002 Chaos in Dynamical Systems, 2nd edn. Cambridge University Press.
SUTSKEVER,I,VINYALS,O&LE,Q.V2014Sequencetosequencelearningwith neural networks. In Advances in Neural Information Processing Systems (ed. Z Ghahramani, M Welling, C Cortes, N Lawrence & K. Q Weinberger), , vol. 27, pp. 3104–3112. Curran Associates, Inc.
LIPTON,Z.C,BERKOWITZ,J&ELKAN,C,Acriticalreviewofrecurrentneural networks for sequence learning, arxiv.org/abs/1506.00019, [Last accessed 5-December-2020].
PASCANU,R,MIKOLOV,T&BENGIO,Y2013Onthedifficultyoftrainingrecur- rent neural networks. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, pp. III–1310–III– 1318.
[137] HOCHREITER, S & SCHMIDHUBER, J 1997 Long short-term memory. Neural Computation 9, 1735.
250 [138]
[139]
[140]
[141]
[142]
[143] [144]
[145] [146]
[147]
[148] [149]
BIBLIOGRAPHY OLAH, C, Understanding lstm networks, colah.github.io, [Online; accessed
30-September-2020].
CHO, K, VAN MERRIËNBOER, B, GULCEHRE, C, BAHDANAU, D, BOUGARES, F, SCHWENK, H & BENGIO, Y 2014 Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Doha, Qatar: Association for Computational Linguistics.
HECK,J.C&SALEM,F.M2017Simplifiedminimalgatedunitvariationsforre- current neural networks. In 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), pp. 1593–1596.
WU, Y, SCHUSTER, M, CHEN, Z et al., Google’s neural machine transla- tion system: bridging the gap between Human and machine translation, arxiv.org/abs/1609.08144, [Last accessed 5-December-2020].
PAPINENI,K,ROUKOS,S,WARD,T&ZHU,W.-J2002BLEU:amethodforauto- matic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, p. 311.
LUKOSEVICIUS,M&JAEGER,H2009Reservoircomputingapproachestore- current neural network training. Computer Science Review 3, 127.
PATHAK, J, HUNT, B, GIRVAN, M, LU, Z & OTT, E 2018 Model-free prediction of large spatiotemporally chaotic systems from data: A reservoir computing approach. Phys. Rev. Lett. 120, 024102.
JAEGER,H&HAAS,H2004Harnessingnonlinearity:Predictingchaoticsystems and saving energy in wireless communication. Science 304, 78–80.
LIM, S. H, GIORGINI, L. T. T, MOON, W & WETTLAUFER, J. S, Predicting criti- cal transitions in multiscale dynamical systems using reservoir computing, arxiv.org/abs/1908.03771, [Last accessed 5-December-2020].
LUKOSEVICIUS,M2012Apracticalguidetoapplyingechostatenetworks.In Neural Networks: Tricks of the Trade (ed. G Montavon, G Orr & K Müller). Berlin,Heidelberg: Springer.
TANAKA,G,YAMANE,T,HÉROUX,J.Betal.2019Recentadvancesinphysical reservoir computing: A review. Neural Networks 115, 100 – 123.
DOYA,K1993Bifurcationsofrecurrentneuralnetworksingradientdescent learning. IEEE Transactions on Neural Networks 1, 75.
BIBLIOGRAPHY 251
[150]
[151]
[152]
[153]
[154]
[155]
[156]
[157]
[158] [159]
[160] [161]
[162]
[163]
WILLIAMS,R.J&ZIPSER,D1995Gradient-basedlearningalgorithmsforre- current networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications (ed. Y Chauvin & D. E Rumelhart), pp. 433–486. Hillsdale, NJ: Erlbaum.
DOYA,K1995Recurrentnetworks:supervisedlearning.InTheHandbookof Brain Theory and Neural Networks (ed. M. A Arbib), pp. 796–799. Cambridge MA: MIT Press.
KARPATHY,A,Theunreasonableeffectivenessofrecurrentneuralnetworks, karpathy.github.io, [Online; accessed 4-October-2018].
KANTZ, H & SCHREIBER, T 2004 Nonlinear Time Series Analysis. Cambridge: Cambridge University Press.
OJA,E1982Asimplifiedneuronmodelasaprincipalcomponentanalyzer.J. Math. Biol. 15, 267.
WILKINSON,M,BEZUGLYY,V&MEHLIG,B2009Fingerprintsofrandomflows? Phys. Fluids 21, 043304.
WELIKY,M,BOSKING,W.H&FITZPATRICK,D1996Asystematicmapofdirec- tion preference in primary visual cortex. Nature 379, 1476–4687.
KOHONEN,T2013Essentialsoftheself-organizingmap.NeuralNetworks37, 52–65.
KOHONEN, T 1995 Self-Organizing Maps. Berlin: Springer.
MARTIN, R & OBERMAYER, K 2009 Self-organizing maps. In Encyclopedia of
Neuroscience (ed. L. R Squire), p. 551. Oxford: Academic Press.
RITTER, H & SCHULTEN, K 1986 On the stationary state of kohonen’s self-
organizing sensory mapping. Biological Cybernetics 54, 99–106.
JACKSON,J.D1999Classicalelectrodynamics,3rdedn.NewYork,NY:Wiley.
SNYDER, W, NISSMAN, D, VAN DEN BOUT, D & BILBRO, G 1991 Kohonen net- works and clustering: Comparative performance in color clustering. In Ad- vances in Neural Information Processing Systems (ed. R. P Lippmann, J Moody & D Touretzky), , vol. 3, pp. 984–990. Morgan-Kaufmann.
BOURLARD,H&KAMP,Y1988Auto-associationbymultilayerperceptronsand singular value decomposition. Biological Cybernetics 59, 201.
252 [164] [165] [166] [167]
[168] [169]
[170] [171] [172]
[173] [174]
BIBLIOGRAPHY NG,A,Sparseautoencoder,web.stanford.edu/class/cs294a,[Online;accessed
13-October-2020].
KINGMA, D. P & WELLING, M, Auto-encoding variational Bayes,
arxiv.org/abs/1312.6114, [Last accessed 5-December-2020]. DOERSCH,C,Tutorialonvariationalautoencoders,arxiv.org/abs/1606.05908,
[Last accessed 5-December-2020].
JIMENEZREZENDE,D,MOHAMED,S&WIERSTRA,D2014Stochasticbackprop- agation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning (ed. E. P Xing & T Jebara), Proceedings of Machine Learning Research, vol. 32, pp. 1278–1286. Bejing, China: PMLR.
JANKOWIAK,M&OBERMEYER,F,Pathwisederivativesbeyondthereparame- terization trick, arxiv.org/abs/1806.01851, [Last accessed 25-May-2021].
GOODFELLOW, I. J, POUGET-ABADIE, J, MIRZA, M, XU, B, WARDE-FARLEY, D, OZAIR, S, COURVILLE, A & BENGIO, Y 2014 Generative adversarial nets. In Advances in Neural Information Processing Systems (ed. Z Ghahramani, M Welling, C Cortes, N Lawrence & K. Q Weinberger), , vol. 27, pp. 2672–2680. Curran Associates, Inc.
ROCCA, J, Understanding generative adversarial networks, towardsdata- science.com, [Last accessed 15-October-2020].
SAMPLE, I, What are deepfakes – and how can you spot them?, the- guardian.com, [Last accessed 30-September-2020].
WETTSCHERECK,D&DIETTERICH,T1992Improvingtheperformanceofradial basis function networks by learning center locations. In Advances in Neural Information Processing Systems (ed. J Moody, S Hanson & R. P Lippmann), , vol. 4, pp. 1133–1140. Morgan-Kaufmann.
POGGIO,T&GIROSI,F1990Networksforapproximationandlearning.Pro- ceedings of the IEEE 78 (9), 1481–1497.
BOURLARD,H,Auto-associationbymultilayerperceptronsandsingularvalue decomposition, publications.idiap.ch/downloads/reports/2000/rr00-16.pdf, [Last accessed 16-October-2020].
BIBLIOGRAPHY 253
[175]
[176]
[177]
[178] [179]
[180] [181] [182] [183] [184] [185]
[186]
[187]
POURKAMALI-ANARAKI,F&WAKIN,M.B,Theeffectivenessofvariationalau- toencoders for active learning, arxiv:1911.07716, [Last accessed 5-December- 2020].
EDUARDO, S, NAZABAL, A, WILLIAMS, C. K. I & SUTTON, C, Robust varia- tional autoencoders for outlier detection and repair of mixed-type data, arxiv.org/abs/1907.06671, [Last accessed 5-December-2020].
LI,C,GAO,X,LI,Y,PENG,B,LI,X,ZHANG,Y&GAO,J,Optimus:Organizingsen- tences via pre-trained modeling of a latent space, arxiv.org/abs/2004.04092, [Last accessed 5-December-2020].
WILLIAMS,R.J1992Simplestatisticalgradient-followingalgorithmsforcon- nectionist reinforcement learning. Machine Learning 8, 229–256.
COLABRESE,S,GUSTAVSSON,K,CELANI,A&BIFERALE,L2017Flownavigation by smart microswimmers via reinforcement learning. Phys. Rev. Lett. 118, 158004.
MINSKY,M1961Stepstowardartificialintelligence.ProceedingsoftheIREpp. 8–30.
SUTTON,R.S1988Learningtopredictbythemethodsoftemporaldifferences. Machine Learning 3, 9–44.
SILVER, D, HUANG, A, MADDISON, C. J et al. 2016 Mastering the game of go with deep neural networks and tree search. Nature 529, 484–489.
MNIH,V,KAVUKCUOGLU,K,SILVER,Detal.2015Human-levelcontrolthrough deep reinforcement learning. Nature 518, 1476–4687.
BARTO,A.G1985Learningbystatisticalcooperationofself-interestedneuron- like computing elements. Hum. Neurobiol. 4, 229–56.
SZEPESVARI,C2010Algorithmsforreinforcementlearning.InSynthesisLec- tures on Artificial Intelligence and Machine Learning (ed. R. J Brachmann & T Dietterich). Morgan and Claypool Publishers.
MCCLELLAND, J. L 2015 Explorations in Parallel Distributed Processing: A Handbook of Models, Programs, and Exercises. New Jersey: Prentice Hall, [Online; accessed 19-November-2019].
WATKINS,C.J.C.H1989Learningfromdelayedrewards.PhDthesis,University of Cambridge, [Online; accessed 25-December-2019].
254
BIBLIOGRAPHY
[188] [189] [190]
[191] [192]
[193] [194] [195]
WATKINS,C.J.C.H&DAYAN,P1992Q-learning.Machinelearning8,279–292. BELLMAN,R.E1957DynamicProgramming.DoverPublications.
CROWLEY, K & SIEGLER, R. S 1993 Flexible strategy use in young children’s tic-tac-toe. Cognitive Science 17, 531–561.
CICHOS,F,GUSTAVSSON,K,MEHLIG,B&VOLPE,G2020Machinelearningfor active matter. Nature Machine Intelligence 2, 94–103.
QIU,J,MOUSAVI,N,GUSTAVSSON,K,XU,C,MEHLIG,B&ZHAO,L,Navigation of a micro-swimmer in steady vortical flow: the importance of symmetries, arxiv.org/abs/arxiv:2104.11303, [Last accessed 25-May-2021].
KLOPF, A. H 1982 The Hedonistic Neuron: Theory of Memory, Learning and Intelligence. Taylor and Francis.
MORGAN, J, How to win at rock-paper-scissors, bbc.com/news/science- environment-27228416, [Last accessed 7-September-2020].
ALLIS,V,Aknowledge-basedapproachofconnect-four,ReportIR-163,Faculty of Mathematics and Computer Science at the Vrije Universiteit Amsterdam.
BIBLIOGRAPHY
255
Author index
Abadi, M. 152
Agarwal, A. 152
Allis, V. 237
Amit, D. J. 30, 41, 42, 47, 48 Anguelov, D. 161
Arous, G. Ben 147 Artuso, G.R. 131 Ay, N. 70, 71
Bahdanau, D. 176
Barham,P.152
Barto, A. G. 4, 218, 219, 221–224, 226,
229, 235–237
Becker, S. 148
Bellman, R. E. 229
Bengio, Y. v, 8, 72, 132–134, 150, 152, 153,
160, 174, 176, 184, 211
Berg, A. C. 159, 160
Berkowitz, J. 171, 172, 178–181, 184, 186 Bernstein, M. 159, 160
Bethge, M. 162
Beyer, L. 163
Bezuglyy, V. 190, 213
Biehl, M. 52
Biferale, L. 218
Bilbro, G. 201
Birren, B. 60, 61
Bordes, A. 133, 134
Bosking, W. H. 196
Bottou, L. 107, 108, 115
Bougares, F. 176
Bourlard, H. 209, 212
Bruna, J. 162, 163
Burges, C. J.C. 154, 156
Cannas, S.A. 126, 127 Carbin, M. 144
Carleo, G. 71
Carnap, R. 2
Celani, A. 218
Chen, Z. 180, 181
Cho, K. 176
Choromanska, A. 147
Cichos, F. 234, 235
Ciregan, D. 156, 157
Clune, J. 163, 165
Colabrese, S. 218
Cortes,C.154,156
Courville, A. v, 8, 150, 152, 153, 160, 211 Cover, T. M. 87, 88, 204
Crisanti, A. 131 Crowley, K. 232, 237 Cvitanovic, P. 131
Dahl, G. 132 Dayan,P.72,229,235 Deng, J. 145, 159, 160 Denker, J. 144 Dietterich, T. 212 Doersch, C. 209, 212 Dong, W. 145, 159 Dosovitskiy, A. 163 Doya, K. 184
Dua, D. 76, 77
Eckmann,J.P.131
Eduardo, S. 213
Elkan, C. 171, 172, 178–181, 184, 186 Engel, A. 52
Erban, D. 162, 163
Erhan, D. 161
Evans, R. 4
Farber, R. 121, 123 Feller, W. 24
256
BIBLIOGRAPHY
Fergus, R. 162, 163 Fischer, A. 14, 54, 69–71 Fisher, R. A. 76 Fitzpatrick, D. 196 Flannery, W. P. 60, 94 Forrest, B. M. 57 Forster, J. 157
Franco, L. 126, 127 Frankle, J. 144
Fuchs, T. 163, 165 Fyodorov, Y. V. 148
Gabbiani, F. 6, 7 Ganguli, S. 131–133 Gao, J. 213
Gao, X. 213
Geirhos, R. 162
Gelatt, C. D. 54, 60 Geszti, T. 43
Gilmer, J. 133
Giorgini, L. T. T. 181–183 Girosi, F. 212
Girvan, M. 181, 183
Glorot, X. 132–134
Goodfellow, I. J. v, 8, 150, 152, 153, 160,
162, 163, 211
Graetz, F. M. 163, 165
Graff, C. 76, 77
Greub, W. 105
Gubernatis, J. E. 71
Gulcehre, C. 176
Gustavsson, K. 218, 234, 235 Gutfreund, H. 30, 41, 42, 47, 48
Haas, H. 181
Hanson, S. 142
Hassibi, B. 142, 144
Haykin, S. v, 15, 22, 72, 115, 123, 174, 184,
197, 202, 204–206, 212 He, K. 134, 137, 160, 161 Hebb, D. O. 2, 14, 18
Heck, J. C. 178
Heermann, D. W. 57
Henaff, M. 147
Hern, A. 162
Héroux, J. B. 184
Hertz, J. v, 5, 30, 40, 41, 43, 46, 47, 50, 51,
60, 77, 99, 102, 115, 121, 123, 124, 170, 184, 189, 193, 195–198, 204, 205, 212, 236
Hinton, G. E 150, 154, 161 Hochreiter, S. 176
Hopfield, J. J. 3, 14, 16, 22, 27, 33, 60 Horner, H. v, 89
Hornik, K. 121
Hu, J. 160, 161 Huang, A. 220, 235 Huang, Z. 159, 160 Hunt, B. 181, 183
Igel, C. 14, 54, 69–71 Ilyas, A. 146
Ioffe, S. 146
Jackson, J. D. 201
Jaeger, H. 181, 182, 184 Jankowiak, Martin 211 Jimenez Rezende, D. 209, 211 Johnson, J. 160
Jumper, J. 4
Kadanoff, L. P. 28, 48, 57 Kamp, Y. 209, 212 Kanal, L.N. 12, 93 Kantz, H. 186
Karpathy, A. 159, 160, 162, 184 Kavukcuoglu, K. 220, 234 Khosla, A. 159, 160 Khurshudov, A. 162
Kingma, D. P. 209 Kirkpatrick, J. 147 Kirkpatrick, S. 14, 30, 54, 60
BIBLIOGRAPHY
257
Kleinbaum, D. 139, 140 Klopf, A. H. 236
Kohonen, T. 4, 196, 197, 212 Kolesnikov, A. 163 Kozielski, M. 157
Krause, J. 159, 160
Krizhevsky, A. 141, 145, 150, 154, 161, 165 Krogh, A. v, 5, 30, 40, 41, 43, 46, 47, 50, 51,
60, 77, 99, 102, 115, 121, 123, 124, 170, 184, 189, 193, 195–198, 204, 205, 212, 236
Kühn, R. 49 Kupper, L. 139, 140 Kwok, J. 148
Lander, E. 60, 61
Lapedes, A. 121, 123
Le, Q. V. 171, 178–181
Le Roux, N. 70
LeCun, Y. 107, 108, 115, 144, 147, 154,
156
Lee, A. A. 148
Li, C. 213
Li, F. F. 145, 159, 160
Li, K. 145, 159
Li, L. J. 145, 159
Li, X. 213
Li, Y. 213
Lim, S. H. 181–183
Linton, L. 60, 61
Lippmann, R. 15
Lipson, H. 163, 165
Lipton, Z. C. 171, 172, 178–181, 184, 186 Little, W.A. 14, 16
Löwe, M. 50
Lu, Z. 181, 183
Lukosevicius, M. 181–184
Ma, S. 159, 160
MacKay, D. J. C. 54, 66, 69, 71, 73 Maddison, C. J. 220, 235
Madry, A. 146
Mainieri, R. 131
Martens, J. 132
Martin, R. 196
Mathews, J. 20, 24, 210 Mathieu, M. 147
McClelland, J. L. 226 McCulloch, W.S. 2, 7, 11 Medina Temme, C. R. 162 Mehlig, B. 57, 190, 213, 234, 235 Meier, U. 156, 157
Metropolis, N. 57, 58 Metzner, W. 6, 7
Mikolov, T. 174, 184 Minsky, M. 3, 12, 82, 91, 219 Mirza, M. 211
Mnih, V. 220, 234 Mohamed, S. 209, 211 Montúfar, G. F. 70, 71 Moon, W. 181–183 Morgan, J. 237
Mousavi, N. 235
Müller, B. 39, 52
Müller, K.-R. 107, 108, 115 Murphy, K. P. 71, 140
Nazabal, A. 213 Neal, R. M. 72 Nesterov, Y. 114, 116 Ney, H. 157
Ng, A. 209
Nguyen, A. 163, 165
Ni, L. M. 148
Nielsen, M. v, 127–129, 141, 145, 151,
153–155, 157, 162, 163 Nissman, D. 201
Nizam, A. 139, 140
Obermayer, K. 196 Obermeyer, Fritz 211 Oja, E. 190
258
BIBLIOGRAPHY
Olah, C. 176, 184
Orr, G. B. 107, 108, 115 Ott, E. 168, 181, 183 Ouyang, W. 161
Ozair, S. 211
Paladin, G. 131
Palmer, R.G. v, 5, 30, 40, 41, 43, 46, 47, 50,
51, 60, 77, 99, 102, 115, 121, 123, 124, 170, 184, 189, 193, 195–198, 204, 205, 212, 236
Papert, S. 3, 12, 82, 91 Papineni, K. 181 Pascanu, R. 147, 174, 184 Pathak, J. 181, 183
Peng, B. 213
Pennington, J. 131, 132 Picasso, J. P. 157
Pitts, W. 2, 7, 11
Poggio, T. 212 Pouget-Abadie, J. 211 Pourkamali-Anaraki, F. 212 Pratt, L. 142
Press, W. H. 60, 94
Qiu, J. 235
Rabinovich, A. 161 Rabinowitz, N. 147 Rau, A. 52
Rauber, J. 162 Rauh, J. 70
Reed, S. 161
Reinhardt, J. 39, 52
Ren, S. 134, 137, 160, 161 Ritter, H. 198, 214
Rocca, J. 211
Rosenblatt, F. 3, 76, 77, 84 Rosenbluth, A. W. 57, 58 Rosenbluth, M. N. 57, 58 Roukos, S. 181
Ruelle, D. 131
Rumelhart, D. E. 3, 115, 116 Russakovsky, O. 159, 160
Salakhutdinov, R. 141, 145 Salem, F. M. 178
Sample, I. 212
Santurkar, S. 146 Satheesh, S. 159, 160 Schmidhuber, J. 176 Schoenholz, S. S. 131–133 Schreiber, T. 186 Schulten, K. 198, 214 Schuster, M. 180, 181 Schütt, H. H. 162 Schwenk, H. 176
Seif, G. 160
Sejnowski, T. J. 3, 35, 54, 73 Senior, A. W. 4 Sermanet,P.161
Settles, Burr 147
Shen, L. 160, 161 Sherrington, D. 14, 30
Siegler, R. S. 232, 237
Silver, D. 220, 234, 235 Sloane, N. J. A. 99
Smith, L. N. 155
Smolensky, P. 66
Snyder, W. 201
Socher, R. 145, 159 Sohl-Dickstein, J. 133
Sokal, A. 56, 58, 71
Solla, S. 144
Sompolinsky, H. 30, 41, 42, 93 Srivastava, N. 141, 145 Steffan, H. 49
Stinchcombe, M. 121
Stork, D. 142, 144
Storm, L. 131
Strickland, M. T. 39, 52
BIBLIOGRAPHY
259
Strogatz, S. H. 131, 170, 182, 192
Su, H. 159, 160
Sun, G. 160, 161
Sun, J. 134, 137, 160, 161
Sutskever, I. 115, 116, 132, 141, 145, 150,
154, 161–163, 171, 178–181
Sutton, C. 213
Sutton, R. S. 4, 218–220, 225, 229, 235,
237
Szegedy, C. 146, 162, 163 Szepesvari, C. 226, 229, 235
Tanaka, G. 184
Tank, D. W. 60 Tanner, G. 131
Teller, E. 57, 58
Teller, M. 57, 58 Tesauro, G. 4, 226 Teukolsky, S. A. 60, 94 Troyer, M. 71
Tsipras, D. 146
Van den Bout, D. 201 Van den Broeck, C. 52 Van Kampen, N. G. 56 van Merriënboer, B. 176 Vanhoucke, V. 161 Vattay, G. 131
Vecchi, M. P. 54, 60 Vetterling, W. T. 60, 94 Vinyals, O. 171, 178–181 Volk, D 50
Volpe, G. 234, 235 Vulpiani, A. 131
Wakin, M. B. 212 Walker, R. L. 20, 24, 210 Wang, Y. 148
Ward, T. 181
Warde-Farley, D. 211
Waterman, M. S. 60, 61
Watkin, T. L. H. 52
Watkins, C. J. C. H. 229, 235
Wei Liu 161
Weisstein, E. W. 26
Weliky, M. 196
Welling, M. 209
Wettlaufer, J. S. 181–183
Wettschereck, D. 212
White, H. 121
Wichmann, F. A. 162
Wierstra, D. 209, 211
Wilkinson, M. 190, 213
Williams, C. K. I. 213
Williams, R. J. 3, 115, 116, 184, 218, 223 Wu, Y. 180, 181
Xu, B. 211 Xu, C. 235
Yamane, T. 184 Yan, J. 161 Yangqing Jia 161 Yao, Q. 148
Yeung, S. 160 Yosinski, J. 163, 165
Zaremba, W. 162, 163 Zemel, R. S. 72
Zeng, X. 161
Zhang, X. 134, 137, 160, 161 Zhang, Y. 148, 213
Zhao, L. 235 Zhu, W.-J. 181 Zipser, D. 184
SUBJECT INDEX
261
Subject index
acceptance probability, 56 action, 227
greedy, 220
suboptimal, 220
activation function, 9, 32, 117, 171
derivative of, 106
linear, 84, 121, 205
piecewise linear, 9, 10
ReLU, see ReLU function
saturation, 106, 140
sigmoid, 105, 112, 121–123, 134, 137,
139, 146, 148, 164, 177, 206, 209 tanh, 105, 124, 137, 148, 171, 183
active learning, 147, 212 adversarial images, 163
agent, 218
annealing, simulated, 4, 54, 60 annotation, 159, 160
argmax, 220
association task, temporal, 167 associative reward penalty algorithm, 219,
221–224, 229, 235 attraction, region of, 17
attractor, 17, 18, 21–23, 29, 35 autoencoder, 14, 208, 207–211, 216 average
time, see time average weighted, 8
backpropagation, 3, 102, 100–103, 166, 226
gation
stochastic, see stochastic backpropa-
gation
through time, 167, 175, 171–176
bars and stripes data set, 69, 69, 73, 165 basis
function, 122–123
orthonormal, 109, 192 batch
learning, 204
batch normalisation, 133, 140, 146, 146 batch training, 102, 103
Bernoulli trial, 24
bias, see also threshold, 8
binary threshold unit, 2, 7, 11, 79, 81, 87,
164
binomial distribution, 24
bit, binary, 15
Boltzmann constant, 53
Boltzmann distribution, 49, 54, 54, 56–58,
60, 62, 68
Boltzmann machine, 3, 14, 54, 62–71, 166,
209
restricted, see restricted Boltzmann
machine
Boolean function, 81, 82, 86, 105, 123–
125, 140, 148, 214 AND, 81
parity, see parity function XNOR, 82
XOR, see XOR function
recurrent,seerecurrentbackpropa- bottleneck,133,208,209
262
SUBJECT INDEX
bridge, 233
catastrophic forgetting, 147
categorical outcome, 140
Cauchy Green matrix, 131
central limit theorem, 24, 38, 43, 130, 132 chain rule, 100, 130, 223
chaos theory, 131 CIFAR 10 data set, 165 classification
accuracy, 112, 155–156
error, see classification error problem, binary, 91, 94, 182, 189, 209,
214
task, 51, 76, 121, 159
mixed state, 30, 42
Monte Carlo sampling, 58
order parameter, 37, 40, 41
phase, 129, 197
proof, 17, 223, 224
Q learning, 228–230, 235, 236
rate, 230
recurrent backpropagation, 170, 171 recurrent network, 168
SARSA, 228
slow, 224, 231
spurious state, 30
stochastic dynamics, 54, 61, 68
to attractor, 35
training, 76
classification error, 111, 111–113, 139, 155, convolution, 150, 152
157, 160, 162, 165, 211, 214
cluster,1,188,189,193–195,201–204,206, convolutionalnetwork,3,127,150,164,
211
colour channel, 152
combinatorial optimisation, 59, 60 competitive learning, 194, 193–197, 203,
206, 213
complete enumeration, 59
configuration space, 16, 29, 35, 61 connections, symmetric, see weight ma-
trix, symmetric contingency space, 221, 222
continuum limit, 198, 198
convergence, 3, 11, 14, 17, 17, 19, 27, 27,
29, 34 accelerated, 114, 115 CDk,70
criterion, 17, 36 energy function, 27 gradient descent, 86 Hopfield network, 51 inverted pattern, 30 learning rule, 65 Markov chain, 56
convolution layer, 150, 151, 151 235
cortex
cerebral, 5, 6, 9, 196 visual, 196
cost function, 222
covariance, 23
covariance matrix, 109, 109, 116, 191, 192,
213
covariate shift, 107, 146
credit assignment problem, 219
cross entropy, 140, 148
cross talk term, 23, 22–24, 27, 29, 31, 41–
43, 50, 52, 61 cross validation, 110–162
curse of dimensionality, 229 cycle, 220
data set augmentation, see also training set, expansion of, 145, 147, 161, 188
data, synthetic, 188
decision boundary, 80, 79–84, 87, 89–163
SUBJECT INDEX
263
piecewise linear, 90
decision process, sequential, 218 decoder, 208, 208–211
deep fake, 212
deep learning, 8, 72, 77, 115, 121–149,
158, 163
deep network, 3, 106, 127, 128, 131, 133,
139, 140, 180, 220, 226 delta rule, 102
detailed balance, 56
deterministic limit, 36, 41, 46, 47, 51, 52 difference, temporal, 225
differential equation, stochastic, 230 digest, 60
dimensionality reduction, 4, 201, 209 dimensionality,curseof,seecurseofdi-
mensionality diminishing returns, 156
direction
maximal eigenvalue, 192
distance, 194, 195, 206
between patterns, 16, 16, 19, 34 Euclidean, see Euclidean distance from the boundary, 213 Hamming, see Hamming distance in output array, 196, 196
to surface, 219
distribution Bernoulli, 140
Boltzmann, see Boltzmann distribu- tion
log normal, 130
steady state, 49, 56, 64 DNA segment, 60
DNA sequence, 60
double digest problem, 60 drop out, 140, 145–147, 161 duality, 170
dynamical system, 184 dynamics
asynchronous, see update rule, asyn- chronous
linearised, 183
stochastic, 35, 35–55, 60, 61 synchronous, see update rule, syn-
chronous
early stopping, 110, 155
eigenvalue, 109, 109, 110, 116, 131, 191,
192
maximal, 110, 192 non negative, 131
eigenvector, 109, 109, 110, 131, 191, 192, 213
leading, 109, 196 elasticnet,197,199
embedding, 88, 89, 204, 224, 236
dimension, 207
encoder, 208, 208–210
energy function, 27–30, 54, 55, 57, 59–61,
78, 85, 86, 100, 107, 111–113, 137– 141, 155, 167, 168, 171, 181, 185, 202, 208, 209, 223, 225
invariance, 30
energy landscape, 60, 62, 146–148 enumeration, complete, 231 environment, 227
deterministic, 220 non-stationary, 220 stochastic, 220
episode, 219, 219, 224, 226, 227, 229–231 length, 219
epoch, 103, 105, 108, 129, 155
error, 102, 127, 129–132, 134, 169, 170,
173
avalanche, 47, 47
backpropagation, 102, 136, 170, 171,
174
classification, see classification error distribution, 130, 133
264
SUBJECT INDEX
dynamics, 131
function, 26
output, see output error probability, see error probability squared, 139
variance, 132
error probability, 24–27, 41, 42, 46–48 Euclidean distance, 196
evolution, genetic, 229
exploding gradient, 127–133, 147, 174 exploitation, 218
exploration, 218
factor, weighting, 225, 226
familiarity, 1, 188, 190
feature map, 127, 150, 151, 151, 164 feed forward network, 77, 101, 166 feedback, 2, 166, 167, 168, 170, 171, 185,
218, 224
fieldlocal, see local field receptive, see receptive field
filter, 150, 151 fingerprint, 60 fluctuations, 224 free energy, 40 function
activation, see activation function approximation, 121
basis, see basis function
Boolean, see Boolean function continuous, 123
energy, see energy function
loss, see loss function
Lyapunov, see Lyapunov, function neighbourhood, see neighbourhood
function
Q function, see Q learning
radial basis, see radial basis function ReLU, see ReLU function
gate, 177
generalisation, 1, 76, 141
generative adversarial networks, 207 generative model, 70, 72, 209, 211 GPU, 161
gradient
accelerated, Nesterov, 114 exploding, see exploding gradient unstable, see unstable gradient vanishing, see vanishing gradient
gradient ascent, 63, 219, 223
gradient descent, 78, 84–86, 92, 100, 141,
178, 225 stochastic, 76, 135
Hamiltonian, see also energy function, 28, 57
Hamming distance, 33
Heaviside function, 33, 95, 97, 164 Hebb’s rule, 14, 16, 18, 18, 22, 65, 189, 190,
195
Helmholtz machine, 14, 72, 212
hidden layer, 3, 76, 77, 86, 89–92, 101–
106, 121–127, 154–156, 204, 207–
211, 217, 225
hidden neuron, 3, 14, 54, 54, 62, 66, 70,
76, 77, 91, 92, 100, 107, 115, 124– 125, 133, 142, 144, 146, 151, 154– 156, 166, 167, 171, 174, 176, 181, 204, 225
homogeneously linearly separable, 87 Hopfield network, 3, 18, 14–54, 62, 166 human genome sequence, 60
image classification, 159
imagenet, 145, 150, 154, 159, 160, 162 importance sampling, 58
inertia, 113
initialisation
of weights, 183 optimistic of Q table, 228
SUBJECT INDEX
265
input, 1, 107
array of input terminals, 150
clamp, 69, 69 linearlyseparable,seelinearsepara-
bility
pattern, see input pattern plane, 79
preprocessing, 106–110, 154 scaling, 107
sequential, 171
space, see input space terminal, see input terminal
input distribution, 54, 62, 66, 71, 111, 147, 162, 190, 195, 197, 199–201, 212, 213
input pattern, 1, 3, 14, 62, 63, 65, 76, 78, 85, 93, 100, 105, 124, 133, 167, 188, 207
binary, 54, 62, 63, 70
cluster, 202
distribution, 158, 189, 190, 193, 198,
201 encoding, 208
familiarity, 188
high dimensional, 195
linearly dependent, 85
linearly independent, 224
linearly separable, see linear separa-
bility normalisation, 146 normalised, 194 random, 132 shuffle, 108 similarity, 188
input space, 81, 86–89, 97, 108, 109, 162, 163, 195, 197–199, 201, 203–206, 224, 236
input terminal, 77, 77, 78, 92, 103, 108, 120, 150, 171, 172, 181, 185
instability, 127
iris data set, 76, 214
K means clustering, 4, 202 kernel,150,151
Kronecker delta, 23, 64, 101 Kullback Leibler divergence, 63, 209
label, see also target, 1, 71, 76, 76, 188, 201
Lagrange multiplier, 109, 143, 209 Lagrangian, 117, 143
singular point, 117
latent variable, 208–210, 212, 216, 217 layer
bottleneck, 208
convolution, see convolution layer fully connected, 129, 151, 154, 156 hidden, see hidden layer
pooling, see pooling layer
learning
active, see active learning approximation problem, 121, 123 batch, see batch learning competitive, see competitive learn-
ing
curve, 224, 231, 234
deep, see deep learning
episode, see episode
few shot, 148
goal, 218
rate, see learning rate
reinforcement, see reinforcement learn-
ing
rule, see learning rule
supervised, see supervised learning temporal difference, 219, 224 unsupervised, see unsupervised learn-
ing
learning rate, 63, 65, 70, 84, 85, 100, 105,
113, 115, 117, 155, 229, 233
266
learning rule, 2, 3, 14, 18, 65, 67, 84, 86, 99, 149, 169, 173, 185, 188, 191, 196, 222, 224
adaptive, 113 competitive,seecompetitivelearn-
ing
Hebb’s, see Hebb’s rule SARSA, 227
TD(0), 226
TD(λ), 226
likelihood, 63, 139
linear separability, 81, 204
local connection, 92
local field, 8, 9, 23, 33, 35, 36, 38, 49, 55,
78, 101, 103, 106, 107, 124, 128, 132, 134, 135, 137, 140, 153, 172, 221, 223
time averaged, 38
local minimum, 30, 35, 41, 51, 59, 103,
142, 147, 203 log likelihood, 139, 167
gradient, 64
long short term memory, 176, 179 lookup table, 220
Lorenz system, 186
loss function, see also energy function, 28 LSTM, see long short term memory Lyapunov
exponent, 131, 132 function, 28, 54 vector, 131
machine learning repository, 76 machine translation, 1, 171, 179
bilingual evaluation understudy, 181 code, 178
end of sentence tag, 179, 180 sentence, 1, 179–181
map
non linear, 204, 205
SUBJECT INDEX restriction, see restriction, map
self organising, 196, 199 semantic, 195 topographic, 195
Markovchain,4,56,229,235
Markov chain Monte Carlo, 4, 14, 54, 57–
71
global move, 56, 58 local move, 56, 57, 58
matrix
correlation, 196
covariance, see covariance matrix Hessian, 142, 144
idempotent, 20, 21
inverse, 85, 143, 169 multiplication, 19
notation, 16
overlap, see overlap matrix projection, 20, 110 random, 130
transpose, 105
weight, see weight, matrix
McCulloch Pitts neuron, 2, 7–9, 11, 16, 18, 51, 76, 77, 99
mean field, 38
mean field approximation, see mean field
theory
mean field equation, 39, 39, 40, 42, 43 mean field theory, 38–42, 49, 50, 52, 132 membrane potential, 10
memory, dynamical, 182
Metropolis algorithm, 57 microorganism, 229
mini batch, 105, 105, 108, 145, 146 minimum
local, see local minimum
spurious, see spurious minimum MNIST data set, 112, 129, 154, 154–158,
165, 201, 202 momentum, 113, 114–116, 155
SUBJECT INDEX
267
constant, 113, 114, 155
Monte Carlo, see Markov chain Monte
Carlo
multi layer perceptron, 147
N armed bandit problem, 218 neighbourhood function, 196, 197, 199,
200, 203
network, 106, 127, 128, 131, 133, 139, 140
convolutional,seeconvolutionalnet- work
deep, see deep network
deep belief, 14
dynamics, 17, 37, 47, 55, 166, 170,
183
ensemble of networks, 156, 161 feed forward, see feed forward net-
work
Hopfield, see Hopfield network layered, 3, 77, 102, 226 recurrent, see recurrent network residual, see residual network
neuron, 6–11
active, 2, 6, 133, 134
axon, 6, 7
binary stochastic, 35, 211, 221, 236 bottleneck, 209
cell body, 6, 7
dendrite, 6, 7
firing, 2, 10
hidden, see hidden neuron
inactive, 6
leaky integrate and fire, 10 McCulloch Pitts, see McCulloch Pitts
neuron pyramidal, 6
state, 7
stochastic, 35, 221, 236 synapse, 6, 7
visible, 54
winning, see winning neuron
noise level, 35, 36, 41, 42, 48, 54, 60, 221
critical, 41
object recognition, 1, 145, 147, 150, 154, 158–161
observable, 58
Oja’s rule, 190, 189–193, 195, 213
Oja’s M rule, 193 orderdisordertransition,35
order parameter, 37, 37–43, 47–49, 51
mixed state, 41, 52
ordering phase, 197
output error, 3, 76, 101, 101, 102, 108, 112,
133, 138, 147, 167, 172, 181 output neuron, 71, 77, 78, 82, 91, 93–95, 103, 111, 120, 123–125, 137, 138, 148, 154, 164, 166, 167, 171, 172, 174, 182, 185, 193, 196, 198, 205,
207, 212
output space, 195, 197, 199
output unit, see also output neuron, 66, 168
overfitting, 110, 110, 111, 115, 140–142, 145, 150, 153, 155, 161
overlap matrix, 50, 85
padding, 152, 153
parity function, 125 partition function, 49, 57 pattern, 1, 14, 15
bit, see bit
correlations, 50, 54, 132
distorted, 15, 15, 17, 19, 21, 23 input, see input pattern
inverse, 29
inverted, 22, 30, 40
linearly dependent, see input pat-
tern, linearly dependent, 51 linearly independent, see input pat-
tern, linearly independent
268
SUBJECT INDEX
linearly separable, see linear separa- bility
orthogonal, 27, 31, 50
overlap, 50
random, 15, 23, 37, 41–43, 49, 50, 132 recognition, 14–16, 31
retrieval, 14, 30
second order statistics, see also two
point correlations, 22
spatio temporal, 183
stored, 14, 15, 17, 19, 21–23, 26, 29–
31, 36, 37, 41, 42, 48, 51, 52
two point correlations, 22, 54, 64, 65 uncorrelated, 23
undistorted, 17
penalty, 218
perceptron, 3, 76, 77, 166 phase transition, 48, 51 policy, 220, 227
ε greedy, 220, 228–232 deterministic, 220 greedy, 220, 228 softmax, 220 stochastic, 220
pooling layer, 150, 153–154, 156, 164 L2 pooling layer, 153
max pooling layer, 153
predation, 229
prediction, one step, 226
principal component, 4, 108–110, 116,
190–193, 209, 211, 214 principal manifold, 195, 196, 198, 214 probability
of acceptance, see acceptance proba- bility
of transition, see transition probabil- ity
product
Hadamard, see also product, Schur,
105
matrix product, 183 outer, 20
scalar, see scalar product Schur, 105
projection, see also matrix, projection, 110
propulsion, 229
protein folding, 4
pruning, 140, 141, 141, 145, 147, 148
iterative, 144 psychology, behavioural, 218
Q learning, 220, 228, 228–235 Q table, 220, 220, 227, 232 symmetry, 235
radial basis function, 205, 204–207 normalised, 205
rate
firing, 10
learning, see learning rate
of convergence, see convergence, rate receptive field, 151, 151, 152, 164 rectified linear unit, 11, 133–134 recurrent backpropagation, 170, 168–171,
184
recurrent network, 1, 14, 166, 166, 181
bidirectional, 181 redundancy, 188 regression
logistic, binary, 139 multinomial, 140 multivariate, 140
regularisation, 140, 140–141, 161
batch normalisation, see batch nor-
malisation
drop out, see drop out
expansion of training set, see training
set, expansion of
L1 regularisation, 134, 141, 209 L2 regularisation, 141, 155, 209
SUBJECT INDEX
269
max norm regularisation, 141, 145 pruning, see pruning
weight decay, see weight decay
reinforcement, 218 signal, 218, 219, 224
reinforcement learning, 2, 4, 188, 218 associative task, 218, 230 continuous task, 219, 224
deep, 234
episodic task, 219, 224 non associative task, 218 sequential, 227
relaxation, exponential, 230 ReLU, see rectified linear unit ReLU function, 11, 133–134, 161 replica, 49
symmetry, 49
symmetry, breaking of, 49 trick, 49, 52
representation
high dimensional, 182 nonlinear, 182
sparse, 133, 182
reservoir
chain of, 183
multiple reservoirs, 183
sparse, 183
reservoir computing, 181, 181–184, 186 residual network, 137, 133–176
response curve, 10, 11 restrictedBoltzmannmachine,54,66,66–
71
energy function, 66
restriction enzyme, 60, 61
fragment, 60, 61 fragment length, 60, 61 fragment set, 73
map, 60
site, 60
reward, 218
discounted, 225
distribution, 218, 221, 222, 230 expected, 219
function, 218, 234
future, 219, 224
immediate, 219, 221–223, 230 maximal, 221
probability, 222, 222, 230 stochastic, 221
robotics, 234 rule
associative reward penalty, 223 deterministic, 36
Hebb’s, see Hebb’s rule
learning, see learning rule McCulloch Pitts, see McCulloch Pitts
neuron
Oja’s, see Oja’s rule udpate, see update rule
scalar product, 20, 79
self averaging, 37 sequence, memoryless, 56 shifter ensemble, 73 shuffle inputs, 105 similarity, 188
simulated annealing, 60 singular value, 183, 190 slowing down, 41 softmax,137,171 solution
degenerate, 61 stable, 40 unstable, 40
sparse, 182, 209
speech recognition, 171 spike, 10
spike train, 1
spin glass, 14, 35
270
spurious minimum, 35, 36
spurious state, 30
stability, linear, 170, 183, 191, 192, 213 state
active, 7
correlations, 58
inactive, 7
mixed, 30, 30, 41, 42, 52 neuron, see neuron, state space, 72, 232
spin glass, 30
state action pair, 227, 230 steady, see steady state value, 102
vector, 16, 54, 227
steady state, 31, 36, 38, 41–43, 51, 56, 57, 76, 167, 168, 170, 190, 191, 198, 201, 213, 231, 233
stimulus, 196, 218, 221, 222 cognitive, 196
sensory, 196
visual, 196
stochastic backpropagation, 211 stochastic gradient descent, 103, 108, 171 stochastic path, 103
stochastic policy, 220
storage capacity, 26, 42, 47
critical, 48, 49, 51, 52
strategy, 220, 233
stride, 152, 153
superposition, 30, 52
supervised learning, 1, 54, 76, 110, 166,
188, 212, 218 symmetry, 40, 49, 235
point group, 94
replica, see replica, symmetry system, dynamical, see dynamical sys-
tem
target, 1, 71, 76, 78, 79, 81, 85, 88, 90, 91,
SUBJECT INDEX
93, 111–113, 117, 137, 139, 140, 166–168, 171, 181, 183, 185, 188, 230
function, 121, 122 random, 87 vector, 78, 100
temperature, 57, 59–62
tensor, 152
test data set, 155
test set, 154, 157, 165
threshold, 8, 8, 11, 17, 29, 32, 38, 49, 55,
63, 67, 69, 71, 73, 76–78, 80, 80– 82, 86, 87, 91, 92, 102, 105, 124, 129, 138, 139, 151–153, 166, 174, 177, 186, 204, 205, 208, 210, 214
increments, 102
initial, 106, 131, 144
threshold unit, see binary threshold unit tic tac toe, 220, 232
time average, 37–39
time constant, 166
time correlation, 174, 178, 183
time series, 6, 7, 9, 181, 185
prediction of, 183, 226
time series prediction, 186
top 5 error, 159
training, see also learning, 54, 66, 72, 73,
92, 94, 108, 111, 129, 142, 144–
146, 155, 178, 183, 218, 228, 234 algorithm, 62, 65, 76, 100, 105, 145,
180, 182
autoencoder, 216
batch, see batch training
deep networks, 71, 114, 133, 145, 146 distribution, 163
early stages of, 128
energy, 111, 155
epoch, 129
error, 112, 144
instability, 127, 130
SUBJECT INDEX
271
recurrent networks, 174, 176, 184 sequential, 103, 108, 117
set, see training set
slow down, 129, 134, 140, 147 speed up, 134
training set, 1, 76, 78, 92, 93, 105, 107, 108, 110, 111, 111, 122, 153, 155, 160, 162, 167, 171, 201, 212
expansion of, 140, 145, 145–146 image net, 159
MNIST, 156
size of, 162
transient, 37, 65, 181
transition probability, 56, 56, 57, 72 translation invariance, 150, 153, 162
unbiased estimator, 223 undirected bipartite graph, 66 unfolding in time, 171, 172, 180 unit
gated recurrent, 176, 177
input, see input terminal
linear, 85, 85, 86, 88, 95, 121, 181, 184,
189, 193, 194, 205, 216, 225 pooling, see pooling layer
rectified linear, see rectified linear unit ReLU, see ReLU function
softmax, see softmax
threshold, see binary threshold unit winning, see winning neuron
XOR, see XOR function
universal approximation theorem, 123 unstable gradient, 174
unsupervised learning, 1, 188, 189–216,
218
update rule, 11, 29, 32, 68, 166, 177, 182
asynchronous, 11, 17, 35, 68 continuous, 166 deterministic, 35
discrete, 171
nested, 183
stochastic, 35, 36, 221 synchronous, 11, 16, 19, 29 typewriter scheme, 11
validation
cross, see cross validation
energy, 111, 112, 155
error, 112
set, 92, 94, 111, 111, 112, 155, 156,
159, 162
vanishing gradient, 86, 127, 129, 127–134,
146, 147, 161, 167, 174, 176–178 variable, latent, see latent variable vector
column, 15, 19 notation, 15
row, 19
target, see target, vector transpose, 19
weight, 8, 14, 17, 71, 76, 78, 82, 86, 91, 105, 210
asymmetric weights, 32, 86
decay, 140–142, 144, 147
diagonal, 22, 24, 28, 29, 33, 49, 54, 55,
60, 62–64
elimination, 141, 144
increment, see weight increment initial, 106, 131, 144
matrix, 19, 21, 63, 67, 82, 103, 105,
142, 183
matrix, transpose, 105
off diagonal, 64
symmetric weights, 22, 28–30, 33, 54,
55, 60, 64 vector, 81
weight increment, 63, 65, 65, 85, 100–103, 105, 139, 148, 191, 223, 224
white noise, 162
272 SUBJECT INDEX
winning neuron, 70, 124, 137, 193, 193, 194, 197, 206
XOR function, 33, 66, 70, 73, 82, 82, 88, 89, 91, 93, 105, 124, 126, 127, 140, 144
not linearly separable, 82 XOR problem, see XOR function
SUBJECT INDEX 273
Congratulations, you reached the end of this book. This is the end of the first episode. You are rewarded by +1. Please multiply everything you have learned by α = 0.01 and add to your knowledge. Then start reading from the first page to start a new episode.