程序代写代做代考 game AI deep learning chain decision tree C COMP9444

COMP9444
Neural Networks and Deep Learning
Outline
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Backpropagation 2
COMP9444 20T2
Backpropagation 3
1d. Backpropagation
􏰈 Supervised Learning (5.1)
􏰈 Ockham’s Razor (5.2)
􏰈 Multi-Layer Networks
􏰈 Continuous Activation Functions (3.10) 􏰈 Gradient Descent (4.3)
Textbook, Sections 3.10, 4.3, 5.1-5.2, 6.5.2
Types of Learning (5.1)
Supervised Learning
􏰈 Supervised Learning
◮ agent is presented with examples of inputs and their target outputs
􏰈 we have a training set and a test set, each consisting of a set of items; for each item, a number of input attributes and a target value are specified.
􏰈 Reinforcement Learning
◮ agent is not presented with target outputs, but is given a reward
􏰈 the aim is to predict the target value, based on the input attributes.
􏰈 agentispresentedwiththeinputandtargetoutputforeachiteminthe
signal, which it aims to maximize 􏰈 Unsupervised Learning
training set; it must then predict the output for each item in the test set
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
◮ agent is only presented with the inputs themselves, and aims to find structure in these inputs
􏰈 various learning paradigms are available: ◮ Neural Network
◮ Decision Tree
◮ Support Vector Machine, etc.
COMP9444 20T2
Backpropagation 1
􏰈 Backpropagation (6.5.2)

COMP9444 20T2 Backpropagation
4
COMP9444 20T2 Backpropagation
5
Supervised Learning – Issues
Curve Fitting
􏰈 framework (decision tree, neural network, SVM, etc.)
􏰈 representation (of inputs and outputs)
􏰈 pre-processing / post-processing
􏰈 training method (perceptron learning, backpropagation, etc.) 􏰈 generalization (avoid over-fitting)
f(x)
􏰈 evaluation (separate training and testing sets)
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Backpropagation
6
COMP9444 20T2
Backpropagation
7
Curve Fitting
Curve Fitting
Which curve gives the “best fit” to these data?
Which curve gives the “best fit” to these data?
straight line? COMP9444
parabola? ⃝c Alan Blair, 2017-20 COMP9444
⃝c Alan Blair, 2017-20
f(x)
f(x)
Which curve gives the “best fit” to these data?
xx
x

COMP9444 20T2 Backpropagation
8 COMP9444 20T2 Backpropagation 9
Curve Fitting
Curve Fitting
Which curve gives the “best fit” to these data?
Which curve gives the “best fit” to these data?
4th order polynomial? COMP9444
⃝c Alan Blair, 2017-20
Something else? COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Backpropagation
10
COMP9444 20T2
Backpropagation 11
f(x)
f(x)
Ockham’s Razor (5.2)
Outliers
“The most likely hypothesis is the simplest one consistent with the data.”
oxx oxx oxx oxoxox
oxx oxx oxx oxxx oxxx oxxx
ooo xxxxxx
ooxox ooxox ooxox xxxxxx
oo oooo oooo oo ooo ooo ooo
inadequate good compromise over-fitting
Since there can be noise in the measurements, in practice need to make a tradeoff between simplicity of the hypothesis and how well it fits the data.
Predicted Buchanan Votes by County
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
[faculty.washington.edu/mtbrett] ⃝c Alan Blair, 2017-20
xx

COMP9444 20T2
Backpropagation
12
COMP9444 20T2 Backpropagation 13
Butterfly Ballot
Recall: Limitations of Perceptrons
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Backpropagation
14
COMP9444 20T2 Backpropagation
15
Multi-Layer Neural Networks
Two-Layer Neural Network
AND
NOR
−1.5
+1
+0.5
XOR
Output units ai Wj,i
NOR
Problem: How can we train it to learn a new function? (credit assignment)
Input units ak
Normally, the numbers of input and output units are fixed,
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
−1 −1
+1 −1
Hidden units aj Wk,j
−1
+0.5
Problem: many useful functions are not linearly separable (e.g. XOR)
I1 I1 I1 111
000 0 1I2 0 1I2
0 1I2 (c) I1 xor I2
(a) I1 and I2 (b) I1 or I2
Possible solution:
x1 XOR x2 can be written as: (x1 AND x2) NOR (x1 NOR x2) Recall that AND, OR and NOR can be implemented by perceptrons.
but we can choose the number of hidden units.
?

COMP9444 20T2 Backpropagation
16 COMP9444 20T2 Backpropagation
17
The XOR Problem
Neural Network Equations
x1 x2 target
000 c
z s
v
0 1 1 y 1 2 y2
u1 =
y1 = g(u1)
101 1
u2 w
11 0
b1
u1 w11 w
b2 w
􏰈 for this toy problem, there is only a training set; there is no validation or test set, so we don’t worry about overfitting
21
22
􏰈 the XOR data cannot be learned with a perceptron, but can be achieved using a 2-layer network with two hidden units
We sometimes use w as a shorthand for any of the trainable weights {c,v1,v2,b1,b2,w11,w21,w12,w22}.
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T2 Backpropagation
18
COMP9444 20T2 Backpropagation 19
NN Training as Cost Minimization
Local Search in Weight Space
We define an error function or loss function E to be (half) the sum over all input patterns of the square of the difference between actual output and target output
E = 1 ∑(zi −ti)2 2i
If we think of E as height, it defines an error landscape on the weight space. The aim is to find a set of weights for which E is very low.
Problem: because of the step function, the landscape will not be smooth but will instead consist almost entirely of flat local regions and “shoulders”, with occasional discontinuous jumps.
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
x
12 12
v
x
COMP9444 ⃝c Alan Blair, 2017-20
b1 +w11x1 +w12x2 s=c+v1y1+v2y2
z = g(s)

COMP9444 20T2 Backpropagation
20
COMP9444 20T2 Backpropagation 21
Continuous Activation Functions (3.10)
Gradient Descent (4.3)
ai ai ai +1 +1 +1
Recall that the loss function E is (half) the sum over all input patterns of the square of the difference between actual output and target output
(a) Step function
(b) Sign function
(c) Sigmoid function
t ini
ini
ini
E = 1 ∑(zi −ti)2 2i
Chain Rule (6.5.2)
Forward Pass
If, say
z s
Then
v
v
b1 +w11x1 +w12x2 c+v1y1 +v2y2
−1
Key Idea: Replace the (discontinuous) step function with a differentiable function, such as the sigmoid:
If the functions involved are smooth, we can use multi-variable calculus to adjust the weights in such a way as to take us in the steepest downhill direction.
g(s) = 1
1 + e−s
or hyperbolic tangent
g(s)=tanh(s)=es+e−s =2􏰉1+e−2s􏰊−1
w ← w − η ∂E ∂w
COMP9444
⃝c Alan Blair, 2017-20
Parameterηiscalledthelearningrate. COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Backpropagation
22
COMP9444 20T2
Backpropagation
23
es − e−s 1
y=y(u)
u=u(x) c
∂y=∂y∂u
∂x ∂u∂x
This principle can be used to compute the partial derivatives in an
1 u1
u2 w
s = z =
efficient and localized manner. Note that the transfer function must be differentiable (usually sigmoid, or tanh).
b1
w11
b2
w21w22 2
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
Note: if if
z(s) = −s , 1+e
z (s) = z(1−z).
z(s) = tanh(s),
z (s) = 1−z .
1′
E =
x 12x ′212
The aim is to find a set of weights for which E is very low.
u1 = y1 2y2 y1=g(u1)
g(s)
1 ∑(z−t)2

COMP9444 20T2 Backpropagation
24
COMP9444 20T2 Backpropagation 25
Backpropagation
Two-Layer NN’s – Applications
Partial Derivatives Useful notation
􏰈 Medical Dignosis
􏰈 Autonomous Driving
􏰈 Game Playing
􏰈 Credit Card Fraud Detection
􏰈 Handwriting Recognition
􏰈 Financial Prediction
∂E ∂z dz ds ∂s
δout=∂E δ1=∂E δ2=∂E
∂y1 dy1
δ1
= =
δout y1
δout v1 y1 (1−y1)
= z−t ∂s Then
∂u1
(z−t) z (1−z)
∂u2
= g′(s) = z(1−z) δout ∂E = v1 ∂v1
=
= y1(1−y1) ∂E
∂w =δ1×1
du1
Partial derivatives can be calculated efficiently by packpropagating deltas
through the network.
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2 Backpropagation
26
COMP9444 20T2
Backpropagation 27
Example: Pima Indians Diabetes Dataset
Training Tips
Attribute
mean stdv
1. Number of times pregnant
2. Plasma glucose concentration
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)2)
7. Diabetes pedigree function
8. Age (years)
3.8 3.4 120.9 32.0 69.1 19.4 20.5 16.0 79.8 115.2 32.0 7.9 0.5 0.3 33.2 11.8
􏰈 re-scale inputs and outputs to be in the range 0 to 1 or −1 to 1
◮ otherwise, backprop may put undue emphasis on larger values
Based on these inputs, try to predict whether the patient will develop Diabetes (1) or Not (0).
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
11
􏰈 replace missing values with mean value for that attribute
􏰈 initialize weights to small random values
􏰈 on-line, batch, mini-batch, experience replay
􏰈 adjust learning rate (and momentum) to suit the particular task