CS计算机代考程序代写 deep learning AI Neural Networks:

Neural Networks:
What can a network represent
Deep Learning, Fall 2021
1

•
Recap : Neural networks have taken over AI
Tasks that are made possible by NNs, aka deep learning
– Tasksthatwereonceassumedtobepurelyinthehumandomainof expertise
2

Voice signal
So what are neural networks??
Image
N.Net
Game State
Next move
Transcription
Text caption
• Whataretheseboxes?
– Functions that take an input and produce an output – What’s in these functions?
3
N.Net
N.Net

Voice signal
The human perspective
Image
Text caption
N.Net
Game State
Next move
Transcription
• In a human, those functions are computed by the brain…
4
N.Net
N.Net

Recap : NNets and the brain
• In their basic form, NNets mimic the networked structure in the brain
5

Recap : The brain
• TheBrainiscomposedofnetworksofneurons 6

Recap : Nnets and the brain
• Neural nets are composed of networks of computational models of neurons called perceptrons
7

Recap: the perceptron
x1 x1 x3
xN
• Athresholdunit
– “Fires” if the weighted sum of inputs exceeds a threshold
– Electrical engineers will call this a threshold gate • A basic unit of Boolean circuits
8
􏰂􏰂 􏰂

􏰆􏰆 􏰈
A better figure
􏰊
.􏰈 . 􏰊
+
􏰇 􏰇
• Athresholdunit
– “Fires” if the affine function of inputs is positive
• The bias is the negative of the threshold T in the previous slide
9
􏰂􏰂 􏰂

The “soft” perceptron (logistic)
+
• A “squashing” function instead of a threshold at the output
– The sigmoid “activation” replaces the threshold
• Activation: The function that acts on the weighted combination of inputs (and threshold)
􏰆􏰆 􏰈
􏰊
.􏰈 . 􏰊
􏰂􏰂 􏰂
􏰇 􏰇
10

Other “activations”
𝑤􏰆 𝑤􏰈
. 𝑤􏰊 .
.
𝑤􏰇 x􏰇
tanh
x􏰆 x􏰈 x􏰊
+ 𝑧 𝑏
𝑦
sigmoid
1 + exp(−𝑧)
log(1 + 𝑒􏰋)
1
tanh(𝑧)
• Does not always have to be a squashing function – Wewillhearmoreaboutactivationslater
• We will continue to assume a “threshold” activation in this lecture 11

• Piazza @93
Poll 1
12

Poll 1
• Mark all true statements
– 3x + 7y is a linear combination of x and y
– 3x + 7y + 4 is a linear combination of x and y – 3x + 7y is an affine function of x and y
– 3x + 7y + 4 is an affine function of x and y
13

The multi-layer perceptron
• Anetworkofperceptrons – Perceptrons “feed” other
perceptrons
– We give you the “formal” definition of a layer later
14

Defining “depth”
• What is a “deep” network
15

Deep Structures
• In any directed graph with input source nodes and output sink nodes, “depth” is the length of the longest path from a source to a sink
– A “source” node in a directed graph is a node that has only outgoing edges
– A “sink” node is a node that has only incoming edges
• Left: Depth = 2. Right: Depth = 3
16

Deep Structures
• Layered deep structure
– Theinputisthe“source”,
– Theoutputnodesare“sinks”
Input: Black Layer 1: Red Layer 2: Green Layer 3: Yellow Layer 4: Blue
• “Deep”  Depth greater than 2
• “Depth” of a layer – the depth of the neurons in the layer w.r.t. input
17

The multi-layer perceptron
• Inputs are real or Boolean stimuli
• Outputs are real or Boolean values
– Can have multiple outputs for a single input
• What can this network compute?
– What kinds of input/output relationships can it model?
18
N.Net

MLPs approximate functions
1 21 1 1-1 11
2111211 -111-121 111
XYZA
• MLPscancomposeBooleanfunctions
• MLPscancomposereal-valuedfunctions • What are the limitations?
h􏰈
011 h􏰉
x
19

Today
• Multi-layer Perceptrons as universal Boolean functions
– The need for depth
• MLPs as universal classifiers
– The need for depth
• MLPs as universal approximators
• A discussion of optimal depth and width
• Brief segue: RBF networks
20

Today
• Multi-layer Perceptrons as universal Boolean functions
– The need for depth
• MLPs as universal classifiers – The need for depth
• MLPs as universal approximators
• A discussion of optimal depth and width • Brief segue: RBF networks
21

The MLP as a Boolean function
• How well do MLPs model Boolean functions?
22

The perceptron as a Boolean gate
X
Y
1
2 1
-1
X
0
X
Y
1
1 1
• A perceptron can model any simple binary Boolean gate
Values in the circles are thresholds Values on edges are weights
23

Perceptron as a Boolean gate
1 1
1 -1 -1
L -1
Will fire only if X1 .. XL are all 1 and XL+1 .. XN are all 0
• The universal AND gate
– AND any number of inputs
• Any subset of who may be negated
24

Perceptron as a Boolean gate
1 1
1 -1 -1
L-N+1 -1
WillfireonlyifanyofX1 ..XLare1 or any of XL+1 .. XN are 0
• The universal OR gate
– OR any number of inputs
• Any subset of who may be negated
25

Perceptron as a Boolean Gate
1 1
1 1 1
Will fire only if at least K inputs are 1
K 1
• Generalized majority gate
– Fire if at least K inputs are of the desired polarity
26

Perceptron as a Boolean Gate
1 1
1 -1 -1
-1
Will fire only if the total number of L-N+K of X1 .. XL that are 1 and XL+1 .. XN that
are 0 is at least K
• Generalized majority gate
– Fire if at least K inputs are of the desired polarity
27

The perceptron is not enough
X
Y
• Cannot compute an XOR
?
? ?
28

Multi-layer perceptron
X
1 -1
Y
-1
Hidden Layer
1 -1
1
1
2 1
• MLPs can compute the XOR
29

Multi-layer perceptron XOR
X
Y
• With 2 neurons
– 5 weights and two thresholds
1 1
1.5
1
1
-2 0.5
Thanks to Gerald Friedland
30

Multi-layer perceptron
1 21 1 1-1 11
2111211 -111-121 111
XYZA
• MLPs can compute more complex Boolean functions
• MLPs can compute any Boolean function
– Since they can emulate individual gates
• MLPs are universal Boolean functions
01
1
31

MLP as Boolean Functions
1 21 1 1-1 11
2111211 -111-121 111
XYZA
MLPs are universal Boolean functions
– Any function over any number of inputs and any number of outputs
But how many “layers” will they need?
01
1
• •
32

How many layers for a Boolean MLP?
Truth Table
Truth table shows all input combinations for which output is 1
X1
X2
X3
X4
X5
Y
0
0
1
1
0
1
0
1
0
1
1
1
0
1
1
0
0
1
1
0
0
0
1
1
1
0
1
1
1
1
1
1
0
0
1
1
• ABooleanfunctionisjustatruthtable
33

How many layers for a Boolean MLP?
Truth Table
Truth table shows all input combinations for which output is 1
􏰆􏰈􏰊􏰌􏰍 􏰆􏰈􏰊􏰌􏰍 􏰆􏰈􏰊􏰌􏰍 􏰆􏰈􏰊􏰌􏰍 􏰆􏰈􏰊􏰌􏰍 􏰆􏰈􏰊􏰌􏰍
X1
X2
X3
X4
X5
Y
0
0
1
1
0
1
0
1
0
1
1
1
0
1
1
0
0
1
1
0
0
0
1
1
1
0
1
1
1
1
1
1
0
0
1
1
X1 X2 X3 X4 X5
• Any truth table can be expressed in this manner!
• A one-hidden-layer MLP is a Universal Boolean Function
42

WX YZ 00
01 11
10
Reducing a Boolean Function
This is a “Karnaugh Map”
It represents a truth table as a grid
Filled boxes represent input combinations for which output is 1; blank boxes have output 0
Adjacent boxes can be “grouped” to reduce the complexity of the DNF formula for the table
00 01 11 10
• DNF form:
– Find groups
– Express as reduced DNF
44

WX YZ Reducing a Boolean Function 00 01 11 10
00
01 11
10
Basic DNF formula will require 7 terms
45

WX YZ Reducing a Boolean Function 00 01 11 10
00
01 11
10
• Reduced DNF form:
– Find groups
– Express as reduced DNF
46

WX YZ Reducing a Boolean Function 00 01 11 10
•
00
01 11
10
Reduced DNF form:
– Findgroups
– ExpressasreducedDNF
WXYZ
– Booleannetworkforthisfunctionneedsonly3hiddenunits
• Reduction of the DNF reduces the size of the one-hidden-layer network
47

Largest irreducible DNF?
WX YZ 00
01 11
10
• What arrangement of ones and zeros simply cannot be reduced further?
48
00 01 11 10

Largest irreducible DNF?
WX YZ 00
01 11
10
00 01 11 10
Red=0, white=1
• What arrangement of ones and zeros simply cannot be reduced further?
49

Largest irreducible DNF?
WX YZ 00
01 11
10
• What arrangement of ones and zeros simply cannot be reduced further?
50
How many neurons in a DNF (one- hidden-layer) MLP for this Boolean function?
00 01 11 10

Width of a one-hidden-layer Boolean MLP
WX YZ 00
01 11
10
UV00 01 11 10
Red=0, white=1
10 00
• How many neurons in a DNF (one-hidden- layer) MLP for this Boolean function of 6 variables?
51
01 11 YZ

Width of a one-hidden-layer Boolean MLP
WX YZ 00
Can be generalized: Will require 2N-1
perceptrons in hidden layer
Expon1e1ntial in N 10
layer) MLP for this Boolean function
01
10 00
UV00 01 11 10
• How many neurons in a DNF (one-hidden-
01 11 YZ
52

Poll 2
• Piazzathread@94
53

Poll 2
How many neurons will be required in the hidden layer of a one-hidden-layer network that models a Boolean function over 10 inputs, where the output for two input bit patterns that differ in only one bit is always different? (I.e. the checkerboard Karnaugh map)
 20
 256  512  1024
54

Width of a one-hidden-layer Boolean MLP
WX YZ 00
Can be generalized: Will require 2N-1
perceptrons in hidden layer
Expon11ential in N 10
UV00 01 11 10
01
10 00
01 11 YZ
• How many neurons in a DNF (one-hidden- How many units if we use multiple hidden
layers?
layer) MLP for this Boolean function
55

WX
00
YZ
00 01 11 10
Size of a deep MLP
WX YZ 00
01 11
10 10 01 11
UV 00 01 11 10 00 YZ
01 11
10
56

Multi-layer perceptron XOR
X
1 -1
Y
-1
Hidden Layer
1 -1
1
1
2 1
• AnXORtakesthreeperceptrons
57

WX
00
YZ
00 01 11 10
Size of a deep MLP
01 11
10
9 perceptrons
• AnXORneeds3perceptrons
• Thisnetworkwillrequire3x3=9perceptrons
WXYZ
58

Size of a deep MLP
WX YZ 00
01 11
10
UV 00
00
10
01 11 YZ
UVWXYZ
• AnXORneeds3perceptrons
• Thisnetworkwillrequire3x5=15perceptrons
59
15 perceptrons
01 11 10

Size of a deep MLP
WX YZ 00
01 11
10
UV 00
00
10
01 11 YZ
UVWXYZ
• AnXORneeds3perceptrons
• Thisnetworkwillrequire3x5=15perceptrons
60
01 11 10
More generally, the XOR of N variables will require 3(N-1) perceptrons!!

One-hidden layer vs deep Boolean MLP
WX YZ 00
Single hidden layer: Will require 2N-1+1
perceptrons in all (including output unit)
01
Expon1e1ntial in N 10
10
01 11
00
Will require030 (N01-1)11pe1r0ceptrons in a deep
YZ
• How many neurons in a DNF (one-hidden-
network
UV
Linear in N!!!
layer) MLP for this Boolean function
Can be arranged in only 2log2(N) layers

A better representation
𝑋􏰆 𝑋􏰇
• Only layers – By pairing terms
– 2 layers per XOR
…
62

A better representation
XOR
XOR XOR XOR
𝑋􏰆
• Only
– By pairing terms – 2 layers per XOR
𝑋􏰇
layers
…
63

The challenge of depth
……
𝑍􏰆 𝑍􏰐
XOR
• Using only K hidden layers will require O(2CN) neurons in the Kth layer, where 􏰎(􏰏􏰎􏰆)/􏰈
– Because the output is the XOR of all the 2N-K/2 values output by the K-1th hidden layer
– I.e. reducing the number of layers below the minimum will result in an
exponentially sized network to express the function fully
– A network with fewer than the minimum required number of neurons cannot model
𝑋􏰆 𝑋􏰇
the function
64

The actual number of parameters in a network
X1 X2 X3 X4 X5
• The actual number of parameters in a network is the number of
connections
– Inthisexamplethereare30
• This is the number that really matters in software or hardware
implementations
• Networks that require an exponential number of neurons will require an exponential number of weights..
65

Recap: The need for depth
• Deep Boolean MLPs that scale linearly with the number of inputs …
• … can become exponentially large if recast using only one hidden layer
66

The XORs could occur anywhere!
The need for depth
abcdef
X1 X2 X3 X4 X5
• An MLP for any function that can eventually be expressed as the XOR of a number of intermediate variables will require depth.
– The XOR structure could occur in any layer
– If you have a fixed depth from that point on, the network can grow exponentially in size.
• Having a few extra layers can greatly reduce network size
67

Depth vs Size in Boolean Circuits
• TheXORisreallyaparityproblem
• Any Boolean parity circuit of depth using
AND,OR and NOT gates with unbounded fan-in must have size
– Parity, Circuits, and the Polynomial-Time Hierarchy, M. Furst, J. B. Saxe, and M. Sipser, Mathematical Systems Theory 1984
– Alternately stated:
• Set of constant-depth polynomial size circuits of unbounded fan-in elements
68

Caveat 1: Not all Boolean functions..
• Not all Boolean circuits have such clear depth-vs-size tradeoff
• Shannon’s theorem: For , there is a Boolean function
of variables that requires at least Boolean gates
– More correctly, for large , almost all n-input Boolean functions need more than 􏰉 Boolean gates
• Regardlessofdepth
• Note: If all Boolean functions over inputs could be computed using a circuit of size that is polynomial in , P = NP!
69

Network size: summary
• An MLP is a universal Boolean function
• But can represent a given function only if
– Itissufficientlywide
– Itissufficientlydeep
– Depthcanbetradedofffor(sometimes)exponentialgrowthofthe width of the network
• Optimal width and depth depend on the number of variables and the complexity of the Boolean function
– Complexity: minimal number of terms in DNF formula to represent it
70

Story so far
• Multi-layer perceptrons are Universal Boolean Machines
• Even a network with a single hidden layer is a universal
Boolean machine
– But a single-layer network may require an exponentially large number of perceptrons
• Deeper networks may require far fewer neurons than shallower networks to express the same function
– Could be exponentially smaller
71

Caveat 2
• Used a simple “Boolean circuit” analogy for explanation
• We actually have threshold circuit (TC) not, just a Boolean circuit (AC)
– Specificallycomposedofthresholdgates
• More versatile than Boolean gates (can compute majority function)
– E.g. “at least K inputs are 1” is a single TC gate, but an exponential size AC
– For fixed depth, 𝐵𝑜𝑜𝑙𝑒𝑎𝑛 𝑐𝑖𝑟𝑐𝑢𝑖𝑡𝑠 ⊂ 𝑡h𝑟𝑒𝑠h𝑜𝑙𝑑 𝑐𝑖𝑟𝑐𝑢𝑖𝑡𝑠 (strict subset)
– A depth-2 TC parity circuit can be composed with 􏰈 weights • But a network of depth log(𝑛) requires only 𝒪 𝑛 weights
– But more generally, for large , for most Boolean functions, a threshold circuit that is polynomial in at optimal depth may become exponentially large at
• Other formal analyses typically view neural networks as arithmetic circuits
– Circuitswhichcomputepolynomialsoveranyfield • So let’s consider functions over the field of reals
72

Today
• Multi-layer Perceptrons as universal Boolean functions
– The need for depth
• MLPs as universal approximators
• A discussion of optimal depth and width • Brief segue: RBF networks
• MLPs as universal classifiers – The need for depth
73

Recap: The MLP as a classifier
784 dimensions (MNIST)
2
784 dimensions
• MLPasafunctionoverrealinputs
• MLPasafunctionthatfindsacomplex“decision boundary” over a space of reals
74

A Perceptron on Reals
x1 1 x2
x3 x xN
w1x1+w2x2=T
0
2
x1
􏰂􏰂 􏰂
• Aperceptronoperateson real-valued vectors
– This is a linear classifier
x2
x1
75

Boolean functions with a real perceptron
1,1
0,1 1,1 0,1 1,1 0,1
XY
0,0 Y 1,0 0,0 X 1,0
X
0,0 Y 1,0
• Boolean perceptrons are also linear classifiers – Purple regions are 1
76

Poll 3
• Piazzathread@95
77

•
Poll 3
An XOR network needs two hidden neurons and one output neuron, because we need one hidden neuron for each of the two boundaries of the XOR region, and an output neuron to AND them. True or false?
– True – False
78

Composing complicated “decision” boundaries
x2
x1
Can now be composed into “networks” to compute arbitrary classification “boundaries”
• Build a network of units with a single output that fires if the input is in the coloured area
79

Booleans over the reals
x2
x1
• The network must fire if the input is in the coloured area
x2 x1
80

Booleans over the reals
x2
x1
• The network must fire if the input is in the coloured area
x2 x1
81

Booleans over the reals
x2
x1
• The network must fire if the input is in the coloured area
x2 x1
82

Booleans over the reals
x2
x1
• The network must fire if the input is in the coloured area
x2 x1
83

Booleans over the reals
x2
x1
• The network must fire if the input is in the coloured area
x2 x1
84

Booleans over the reals
3
4 3
4
3
x
􏰇
􏰂 􏰂􏰅􏰆
AND y1 y2 y3 y4
x2
x
5
4
4
3
2
2
x
y5
x1
x
4 3
1
1
• The network must fire if the input is in the coloured area
– The AND compares the sum of the hidden outputs to 5 • NB:Whatwouldthepatternbeifitcompareditto4?
85

More complex decision boundaries
OR
AND
AND
x2
x1
x1
x2
• Networktofireiftheinputisintheyellowarea – “OR” two polygons
– A third layer is required
86

Complex decision boundaries
• Can compose arbitrarily complex decision boundaries
87

Complex decision boundaries
OR
AND
x1 x2
• Can compose arbitrarily complex decision
boundaries
88

Complex decision boundaries
OR
AND
x1 x2
• Can compose arbitrarily complex decision boundaries
– With only one hidden layer! – How?
89

Exercise: compose this with one hidden layer
x2
x1
x1 x2
• How would you compose the decision boundary to the left with only one hidden layer?
90

Composing a Square decision boundary
2 242
2
• The polygon net
􏰌
􏰑y􏰂 ≥4? 􏰂􏰅􏰆
y1 y2 y3 y4
x2 x1
91

Composing a pentagon
2
3
2
3443 454
2
4 3
2 3
􏰍
􏰑y􏰂 ≥5? 􏰂􏰅􏰆
2 • The polygon net
y1 y2 y3 y4
x2
y5
x1
92

Composing a hexagon
3
3453 55
34543 3
• The polygon net
4
55
4
6
y1 y2 y3 y4 y5
x2 x1
􏰇
􏰑y􏰂 ≥6? 􏰂􏰅􏰆
y6
93

How about a heptagon
• What are the sums in the different regions?
– A pattern emerges as we consider N > 6.. • N is the number of sides of the polygon
94

16 sides
• What are the sums in the different regions? – A pattern emerges as we consider N > 6..
95

64 sides
• What are the sums in the different regions? – A pattern emerges as we consider N > 6..
96

1000 sides
• What are the sums in the different regions? – A pattern emerges as we consider N > 6..
97

Polygon net
􏰇
􏰑y􏰂 ≥𝑁? 􏰂􏰅􏰆
y1 y2 y3 y4
x2
y5
x1
• Increasing the number of sides reduces the area outside the
polygon that have 􏰇􏰈
􏰂􏰂
98

In the limit
N
N/2
􏰆 􏰒
• For small radius, it’s a near perfect cylinder – N in the cylinder, N/2 outside
􏰇
􏰑y􏰂 ≥𝑁? 􏰂􏰅􏰆
y1 y2 y3 y4
x2
y5
x1
􏰓􏰔􏰕􏰂􏰖􏰗 𝐱􏰎􏰘􏰙􏰉􏰚􏰙􏰓
•
– Value of the sum at the output unit, as a function of distance from center, as N increases
􏰂􏰂
99

Composing a circle
N
􏰇
􏰑y􏰂 ≥𝑁? 􏰂􏰅􏰆
N/2
• The circle net
– Very large number of neurons
– Sum is N inside the circle, N/2 outside almost everywhere – Circle can be at any location
100

N/2
𝑵𝑵
􏰑 𝐲𝒊 − 𝟐 ≥ 𝟎? 𝒊􏰅𝟏
−𝑁/2
1
Composing a circle
0
• The circle net
– Very large number of neurons
– Sum is N/2 inside the circle, 0 outside almost everywhere – Circle can be at any location
101

Adding circles
𝟐𝑵 𝑵
􏰑 𝐲𝒊 − 𝟐 ≥ 𝟎? 𝒊􏰅𝟏
• The “sum” of two circles sub nets is exactly N/2 inside either circle, and 0 almost everywhere outside
102

Composing an arbitrary figure
𝑲𝑵 𝑵
􏰑 𝐲𝒊 − 𝟐 ≥ 𝟎? 𝒊􏰅𝟏
• Justfitinanarbitrarynumberofcircles
– More accurate approximation with greater number of
smaller circles
– Can achieve arbitrary precision
103

MLP: Universal classifier
𝑲𝑵 𝑵
􏰑 𝐲𝒊 − 𝟐 ≥ 𝟎? 𝒊􏰅𝟏
• MLPscancaptureanyclassificationboundary • Aone-hidden-layerMLPcanmodelany
classification boundary
• MLPsareuniversalclassifiers
104

Depth and the universal classifier
x2
x1
x1 x2
• Deepernetworkscanrequirefarfewerneurons
105

Optimal depth..
• Formal analyses typically view these as category of arithmetic circuits
– Compute polynomials over any field
• Valiantet.al:Apolynomialofdegreenrequiresanetworkof
depth
– Cannotbecomputedwithshallowernetworks
– Themajorityoffunctionsareveryhigh(possibly∞)orderpolynomials
• Bengioet.al:Showsasimilarresultforsum-productnetworks
– Butonlyconsiderstwo-inputunits
– GeneralizedbyMhaskaretal.toallfunctionsthatcanbeexpressedasa binary tree
– Depth/Size analyses of arithmetic circuits still a research problem
106
􏰈

Special case: Sum-product nets
• “Shallow vs deep sum-product networks,” Oliver Dellaleau and Yoshua Bengio
– For networks where layers alternately perform either sums or products, a deep network may require an exponentially fewer number of layers than a shallow one
107

Depth in sum-product networks
108

Optimal depth in generic nets • We look at a different pattern:
– “worst case” decision boundaries
• For threshold-activation networks – Generalizes to other nets
109

Optimal depth
𝑲𝑵 𝑵
􏰑 𝐲𝒊 − 𝟐 > 𝟎? 𝒊􏰅𝟏
• A naïve one-hidden-layer neural network will require infinite hidden neurons
110

Optimal depth
• Two hidden-layer network: 56 hidden neurons 111

􏰆
􏰍 􏰝
Optimal depth
􏰈􏰊 􏰛􏰞􏰜 􏰆􏰟 􏰆􏰆 􏰆􏰈
􏰌
􏰆􏰊 􏰆􏰌 􏰆􏰍 􏰆􏰛
􏰆􏰈􏰊 􏰆􏰛
• Two-hidden-layer network: 56 hidden neurons – 16 neurons in hidden layer 1
112

Optimal depth
• Two-hidden-layer network: 56 hidden neurons – 16 in hidden layer 1
– 40 in hidden layer 2
– 57 total neurons, including output neuron
113

􏰆
􏰍 􏰝
Optimal depth
􏰈􏰊 􏰛􏰞􏰜 􏰆􏰟 􏰆􏰆 􏰆􏰈
􏰌
􏰆􏰊 􏰆􏰌 􏰆􏰍 􏰆􏰛
􏰆􏰈􏰊 􏰆􏰛
• But this is just
114

Optimal depth
• But this is just
– The XOR net will require 16 + 15×3 = 61 neurons • 46 neurons if we use a two-neuron XOR model
115

Optimal depth
• Gridformedfrom64lines
– Network must output 1 for inputs in the yellow regions
116

Actual linear units
􏰆􏰈􏰊 ….􏰛􏰌
• 64 basic linear feature detectors
117

Optimal depth
…. ….
• Two hidden layers: 608 hidden neurons – 64inlayer1
– 544inlayer2
• 609 total neurons (including output neuron)
118

Optimal depth
…. …. …. …. …. ….
• XOR network (12 hidden layers): 253 neurons – 190 neurons with 2-gate XOR
• The difference in size between the deeper optimal (XOR) net and shallower nets increases with increasing pattern complexity and input dimension
119

Network size?
• In this problem the 2-layer net was quadratic in the number of lines
– 􏰈 neurons in 2nd hidden layer
– Not exponential
– Even though the pattern is an XOR
– Why?
• The data are two-dimensional!
– Only two fully independent features
– The pattern is exponential in the dimension of the input (two)!
• For general case of mutually intersecting hyperplanes in dimensions,
we will need 􏰇􏰠 (􏰡􏰎􏰆)!
weights (assuming ).
– Increasing input dimensions can increase the worst-case size of the shallower network exponentially, but not the XOR net
• The size of the XOR net depends only on the number of first-level linear detectors (𝑁)
120

Depth: Summary
• The number of neurons required in a shallow network is potentially exponential in the dimensionality of the input
– (this is the worst case)
– Alternately, exponential in the number of statistically independent features
121

Story so far
• Multi-layer perceptrons are Universal Boolean Machines
– EvenanetworkwithasinglehiddenlayerisauniversalBooleanmachine
• Multi-layer perceptrons are Universal Classification Functions – Evenanetworkwithasinglehiddenlayerisauniversalclassifier
• But a single-layer network may require an exponentially large number of perceptrons than a deep one
• Deeper networks may require far fewer neurons than shallower networks to express the same function
– Couldbeexponentiallysmaller
– Deepernetworksaremoreexpressive
122

MLP as a continuous-valued regression
T1 T1
x
1
1
+
f(x) T1 T2 x
1 T2 -1 T2
• A simple 3-unit MLP with a “summing” output unit can generate a “square pulse” over an input
– Output is 1 only if the input lies between T1 and T2 – T1 and T2 can be arbitrarily specified
124

MLP as a continuous-valued regression
h􏰈 h􏰆
h􏰉
x
× h􏰉
T1 T1
x
1
1
f(x) T1 T2 x
1 T2 -1 T2
× h􏰆
× h􏰈
+
• A simple 3-unit MLP can generate a “square pulse” over an input
• An MLP with many units can model an arbitrary function over an input
– To arbitrary precision
• Simply make the individual pulses narrower
• A one-hidden-layer MLP can model an arbitrary function of a single input125

For higher dimensions
N/2
0
+
-N/2
1
• An MLP can compose a cylinder – in the circle, 0 outside
126

MLP as a continuous-valued function
􏰆
􏰈
􏰆+ 􏰈
􏰉
􏰉
•
MLPs can actually compose arbitrary functions in any number of dimensions!
– Even with only one hidden layer
• As sums of scaled and shifted cylinders
– To arbitrary precision
• By making the cylinders thinner
– The MLP is a universal approximator!
127

Poll 4
• Piazzathread@96
128

Poll 4
Any real valued function can be modelled exactly by a one-hidden layer network with infinite neurons in the hidden layer, true or false?
– False – True
Explanation: (it can only be approximated)
129

Caution: MLPs with additive output units are universal approximators
􏰇,􏰏
􏰂 (􏰂􏰎􏰆)􏰏􏰢􏰃 􏰂􏰅􏰆,􏰃􏰅􏰆
􏰉
􏰉􏰏
􏰆+ 􏰈
􏰆
􏰈
􏰉
􏰆
• MLPscanactuallycomposearbitraryfunctions • Butexplanationsofaronlyholdsiftheoutput
unit only performs summation
– i.e. does not have an additional “activation”
130

“Proper” networks: Outputs with activations
x1 x2 x3
xN sigmoid tanh
• Output neuron may have actual “activation” – Threshold, sigmoid, tanh, softplus, rectifier, etc.
• What is the property of such networks?
131

The network as a function
􏰇
􏰇 􏰇 􏰇
􏰇
• Output unit with activation function – Threshold or Sigmoid, or any other
• The network is actually a universal map from the entire domain of input values to the entire range of the output activation
– All values the activation function of the output neuron
132

The network as a function
􏰇
􏰇 􏰇 􏰇
􏰇
The MLP is a Universal Approximator for the entire class of functions (maps) it represents!
• Output unit with activation function – Threshold or Sigmoid, or any other
• The network is actually a universal map from the entire domain of input values to the entire range of the output activation
– All values the activation function of the output neuron
133

Today
• Multi-layer Perceptrons as universal Boolean functions
– The need for depth
• MLPs as universal classifiers
– The need for depth
• MLPs as universal approximators
• A discussion of sufficient depth and width
• Brief segue: RBF networks
134

The issue of depth
• Previous discussion showed that a single-hidden-layer MLP is a universal function approximator
– Can approximate any function to arbitrary precision – But may require infinite neurons in the layer
• More generally, deeper networks will require far fewer neurons for the same approximation error
– True for Boolean functions, classifiers, and real-valued functions
• But there are limitations…
135

Sufficiency of architecture
A network with 16 or more
neurons in the first layer is ….. capable of representing the
figure to the right perfectly
• A neural network can represent any function provided it has sufficient capacity
– I.e. sufficiently broad and deep to represent the function
• Not all architectures can represent any function
136

Sufficiency of architecture
A network with 16 or more
neurons in the first layer is ….. capable of representing the
figure to the right perfectly
A network with less than 16 neurons in the first layer cannot represent this pattern exactly
Why?
 With caveats..
• A neural network can represent any function provided
it has sufficient capacity
– I.e. sufficiently broad and deep to represent the function
• Not all architectures can represent any function
137

Sufficiency of architecture
A network with 16 or more
neurons in the first layer is ….. capable of representing the
figure to the right perfectly
A network with less than 16 neurons in the first layer cannot represent this pattern exactly
 With caveats..
• A neural network can represent any function provided
it has sufficient capacity
– I.e. sufficiently broad and deep to represent the function
We will revisit this idea shortly
• Not all architectures can represent any function
138

Sufficiency of architecture
A network with 16 or more
neurons in the first layer is ….. capable of representing the
figure to the right perfectly
A network with less than 16 neurons in the first layer cannot represent this pattern exactly
A 2-layer network with 16 neurons in the first layer cannot represent the pattern with less than 40 neurons in the second layer
 With caveats..
• A neural network can represent any function provided
it has sufficient capacity
– I.e. sufficiently broad and deep to represent the function
• Not all architectures can represent any function
141

Sufficiency of architecture
This effect is because we use the threshold activation
It gates information in the input from later layers
The pattern of outputs within any colored region is identical
Subsequent layers do not obtain enough information to partition them
143

Sufficiency of architecture
This effect is because we use the threshold activation
It gates information in the input from later layers
144
Continuous activation functions result in graded output at the layer
The gradation provides information to subsequent layers, to capture information “missed” by the lower layer (i.e. it “passes” information to subsequent layers).

Sufficiency of architecture
This effect is because we use the threshold activation
It gates information in the input from later layers
145
Continuous activation functions result in graded output at the layer
The gradation provides information to subsequent layers, to capture information “missed” by the lower layer (i.e. it “passes” information to subsequent layers).
Activations with more gradation (e.g. RELU) pass more information

Width vs. Activations vs. Depth
• Narrow layers can still pass information to subsequent layers if the activation function is sufficiently graded
• But will require greater depth, to permit later layers to capture patterns
146

Poll 5
• Piazzathread@97
147

Poll 5
– A network with an upper bound on layer width (no. of neurons in a layer) can nevertheless model any function by making it sufficiently deep.
– Networks with “graded” activation functions are more able to compensate for insufficient width through depth, than those with threshold or saturating activations.
– We can always compensate for limits in the width and depth of the network by using more graded activations.
– For a given accuracy of modelling a function, networks with more graded activations will generally be smaller than those with less graded (i.e saturating or thresholding) activations.
148
Mark all true statements

Sufficiency of architecture
• The capacity of a network has various definitions
– Information or Storage capacity: how many patterns can it remember
– VC dimension
• bounded by the square of the number of weights in the network
– From our perspective: largest number of disconnected convex regions it can represent
• A network with insufficient capacity cannot exactly model a function that requires a greater minimal number of convex hulls than the capacity of the network
– But can approximate it with error
149

The “capacity” of a network
• VC dimension
• A separate lecture
– Koiran and Sontag (1998): For “linear” or threshold units, VC dimension is proportional to the number of weights
• Forunitswithpiecewiselinearactivationitisproportionaltothe square of the number of weights
– Batlett, Harvey, Liaw, Mehrabian “Nearly-tight VC-dimension bounds for piecewise linear neural networks” (2017):
• For any , s.t. 􏰈, there exisits a RELU network with
layers, weights with VC dimension
􏰣􏰤 􏰣 􏰥􏰈􏰤
– Friedland, Krell, “A Capacity Scaling Law for Artificial Neural Networks” (2017):
• VCdimensionofalinear/thresholdnetis , istheoverall number of hidden neurons, is the weights per neuron
150

Lessons today
• MLPs are universal Boolean function
• MLPs are universal classifiers
• MLPs are universal function approximators
• A single-layer MLP can approximate anything to arbitrary precision – Butcouldbeexponentiallyoreveninfinitelywideinitsinputssize
• Deeper MLPs can achieve the same precision with far fewer neurons
– Deepernetworksaremoreexpressive
– Moregradedactivationfunctionsresultinmoreexpressivenetworks
151

􏰆􏰆 􏰈
Perceptrons so far
􏰊
.􏰈 .􏰊 .
+
􏰇 􏰇
• The output of the neuron is a function of a linear combination of the inputs and a bias
153
􏰂􏰂 􏰂

An alternate type of neural unit:
Radial Basis Functions
􏰆􏰈􏰊 􏰇
􏰈
􏰆
􏰊
􏰈 .
.
􏰦􏰈
𝑓(𝑧)
Typical activation
􏰇
• The output is a function of the distance of the input from a “center”
– The“center” istheparameterspecifyingtheunit
– Themostcommonactivationistheexponent • 𝛽 is a “bandwidth” parameter
– Butothersimilaractivationsmayalsobeused
• Key aspect is radial symmetry, instead of linear symmetry
154

􏰆
An alternate type of neural unit:
Radial Basis Functions
􏰆􏰈􏰊 􏰇
􏰊
􏰈 .
.
􏰦􏰈
𝑓(𝑧)
􏰇
• Radial basis functions can compose cylinder-like outputs with just a single unit with appropriate choice of bandwidth (or activation function)
– Asopposedto unitsforthelinearperceptron
155

RBF networks as universal approximators
+
􏰆
􏰇
􏰈
􏰆
􏰇
􏰈􏰊
• RBFnetworksaremoreeffective approximators of continuous-valued functions
– A one-hidden-layer net only requires one unit per
“cylinder”
156

RBF networks as universal approximators
+
􏰆
􏰇
􏰈􏰊
• RBFnetworksaremoreeffective approximators of continuous-valued functions
– A one-hidden-layer net only requires one unit per
“cylinder”
157

RBF networks
• More effective than conventional linear perceptron networks in some problems
• We will revisit this topic, time permitting
158

Lessons today
• MLPs are universal Boolean function
• MLPs are universal classifiers
• MLPs are universal function approximators
• A single-layer MLP can approximate anything to arbitrary precision – Butcouldbeexponentiallyoreveninfinitelywideinitsinputssize
• Deeper MLPs can achieve the same precision with far fewer neurons
– Deepernetworksaremoreexpressive
• RBFs are good, now lets get back to linear perceptrons… 
159

Next up
• WeknowMLPscanemulateanyfunction
• Buthowdowemakethememulateaspecific
desired function
– E.g. a function that takes an image as input and outputs the labels of all objects in it
– E.g. a function that takes speech input and outputs the labels of all phonemes in it
– Etc…
• TraininganMLP
160

Related Posts