程序代写代做代考 algorithm Bayesian deep learning C COMP9444

COMP9444
Neural Networks and Deep Learning
Outline
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Probability and Backprop Variations 2
COMP9444 20T2
Probability and Backprop Variations 3
2a. Probability and Backprop Variations
􏰈 Probability and Random Variables (3.1-3.2) 􏰈 Probability for Continuous Variables (3.3) 􏰈 Gaussian Distribution (3.9.3)
􏰈 Conditional Probability (3.5)
Textbook, Sections 3.1-3.5, 3.9-3.13, 5.2.2, 5.5, 8.3
􏰈 Bayes’ Rule (3.11)
􏰈 Entropy and KL-Divergence (3.13) 􏰈 Cross Entropy (3.13)
􏰈 Maximum Likelihood (5.5)
􏰈 Weight Decay (5.2.2)
􏰈 Momentum (8.3)
Probability (3.1)
Random Events
Begin with a set Ω – the sample space (e.g. 6 possible rolls of a die) Each ω ∈ Ω is a sample point / possible world / atomic event
A random event A is any subset of Ω
P(A)= ∑P(ω)
A probability space or probability model is a sample space with an assignment P(ω) for every ω ∈ Ω such that
{ω∈A} e.g.P(dieroll<5)=P(1)+P(2)+P(3)+P(4)=1+1+1+1 =2 e.g. P(1)=P(2)=P(3)=P(4)=P(5)=P(6)= 1. 6 COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 0≤P(ω)≤1 ∑P(ω) = 1 ω COMP9444 20T2 Probability and Backprop Variations 1 66663 COMP9444 20T2 Probability and Backprop Variations 4 COMP9444 20T2 Probability and Backprop Variations 5 Random Variables (3.2) Probability for Continuous Variables (3.3) A random variable is a function from sample points to some range (e.g. the Reals or Booleans) e.g. P(X = x) = U [18, 26](x) = uniform density between 18 and 26 0.125 For example, Odd(3) = true P induces a probability distribution for any random variable X: COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 P(X = xi) = ∑ P(ω) {ω:X (ω)=xi } 18 dx 26 e.g., P(Odd=true) = P(1)+P(3)+P(5) = 1 + 1 + 1 = 1 6 6 6 2 Here P is a density; it integrates to 1. P(X = 20.5) = 0.125 really means COMP9444 20T2 Probability and Backprop Variations 6 COMP9444 20T2 Probability and Backprop Variations 7 Gaussian Distribution (3.9.3) Probability and Logic Pμ,σ(x) = √ 1 2πσ e−(x−μ)2/2σ2 B μ = mean σ = standard deviation Logically related events must have related probabilities For example, P(A∨B) = P(A)+P(B)−P(A∧B) Multivariate Gaussian: COMP9444 Pμ,σ(x) = ∏ Pμi,σi (xi) i 0 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 A lim P(20.5 ≤ X ≤ 20.5 + dx)/dx = 0.125 dx→0 A ^B COMP9444 20T2 Probability and Backprop Variations 8 COMP9444 20T2 Probability and Backprop Variations 9 Conditional Probability (3.5) Bayes’ Rule (3.11) A B The formula for conditional probability can be manipulated to find a relationship when the two variables are swapped: A ^ B P(B) This is often useful for assessing the probability of an underlying cause If P(B) ̸= 0, then the conditional probability of A given B is P(A|B)= P(A∧B) after an effect has been observed: P(Cause|Effect) = P(Effect|Cause)P(Cause) P(B) COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T2 Probability and Backprop Variations 10 COMP9444 20T2 Probability and Backprop Variations 11 Example: Medical Diagnosis Bayes’ Rule for Medical Diagnosis Question: Suppose we have a test for a type of cancer which occurs in 1% of patients. The test has a sensitivity of 98% and a specificity of 97%. If a patient tests positive, what is the probability that they have the cancer? Pos P(Yes,Pos) = 0.01 0.98 Answer: There are two random variables: Cancer (true or false) and Test (positive or negative). The probability is called a prior, because it represents our estimate of the probability before we have done the test (or made some other observation). The sensitivity and specificity are interpreted as follows: 0.99 No Test 0.03 0.97 P(positive|cancer) = 0.98, COMP9444 and P(negative|¬cancer) = 0.97 ⃝c Alan Blair, 2017-20 P(cancer | positive) COMP9444 = = P(positive | cancer)P(cancer) P(positive) P(A∧B) = P(A|B)P(B) = P(B|A)P(A) → Bayes’ rule P(A | B) = P(B | A)P(A) Cancer? P(No ,Pos) = 0.03 Test Yes 0.02 0.01 Neg Pos P(Yes,Neg) = 0.00 Neg P(No ,Neg) = 0.96 P(Effect) 0.98∗0.01 = 0.01 0.98∗0.01+0.03∗0.99 0.01+0.03 4 = 1 ⃝c Alan Blair, 2017-20 COMP9444 20T2 Probability and Backprop Variations 12 COMP9444 20T2 Probability and Backprop Variations 13 Example: Light Bulb Defects Bayes’ Rule for Light Bulb Defects Question: You work for a lighting company which manufactures 60% of its light bulbs in Factory A and 40% in Factory B. One percent of the light bulbs from Factory A are defective, while two percent of those from Factory B are defective. If a random light bulb turns out to be defective, what is the probability that it was manufactured in Factory A? Defect? Yes P(A,Yes) = 0.006 0.01 Answer: There are two random variables: Factory (A or B) and Defect (Yes or No). In this case, the prior is: 0.98 P(A) = 0.6, P(B) = 0.4 The conditional probabilities are: P(A | defect) COMP9444 = = P(defect | A)P(A) P(defect) P(defect|A) = 0.01, COMP9444 and P(defect|B) = 0.02 COMP9444 20T2 Probability and Backprop Variations 14 COMP9444 20T2 Probability and Backprop Variations 15 Entropy and KL-Divergence (3.13) Entropy and Huffmann Coding The entropy of a discrete probability distribution p = ⟨p1,..., pn⟩ is Entropy is the number of bits per symbol achieved by a (block) Huffman Coding scheme. n H(p) = ∑ pi (−log2 pi) Example 1: H(⟨0.5,0.5⟩) = 1 bit. i=1 Given two probability distributions p = ⟨p1,..., pn⟩ and q = ⟨q1,...,qn⟩ on the same set Ω, the Kullback-Leibler Divergence between p and q is Suppose we want to encode, in zeros and ones, a long message composed of the two letters A and B, which occur with equal frequency. This can be done efficiently by assigning A=0, B=1. In other words, one bit is needed to encode each letter. n DKL(p||q) = ∑ pi (log2 pi −log2 qi) i=1 KL-Divergence is like a “distance” from one probability distribution to 01 AB another. But, it is not symmetric. DKL(p||q) ̸= DKL(q||p) COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 ⃝c Alan Blair, 2017-20 Factory? Yes P(B,Yes) = 0.008 0.02 A 0.99 No P(A,No ) = 0.594 B Defect? No P(B,No ) = 0.392 0.6 0.4 0.01∗0.6 = 0.01∗0.6+0.02∗0.4 0.006+0.008 7 0.006 = 3 ⃝c Alan Blair, 2017-20 COMP9444 20T2 Probability and Backprop Variations 16 COMP9444 20T2 Probability and Backprop Variations 17 Entropy and Huffmann Coding Entropy and KL-Divergence Example 2: H (⟨0.5, 0.25, 0.25⟩) = 1.5 bits. If the samples occur in some other proportion, we would need to “block” them together in order to encode them efficiently. But, the average number of bits required by the most efficient coding scheme is given by Suppose we need to encode a message consisting of the letters A, B and C, and that B and C occur equally often but A occurs twice as often as the other two letters. In this case, the most efficient code would be A=0, B=10, C=11. The average number of bits needed to encode each letter is 1.5 . n H(⟨p1,..., pn⟩) = ∑ pi (−log2 pi) COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T2 Probability and Backprop Variations 18 COMP9444 20T2 Probability and Backprop Variations 19 0 1 01 i=1 DKL (q || p) is the number of extra bits we need to trasmit if we designed a A BC n DKL(p||q) = ∑ pi (log2 pi −log2 qi) Continuous Entropy and KL-Divergence Forward KL-Divergence 􏰈 the entropy of a continuous distribution p() is H(p)=􏰋 p(θ)(−logp(θ))dθ Given P, choose Gaussian Q to minimize DKL(P||Q) 􏰈 for a multivariate Gaussian distribution, H(p) = ∑ log σi 􏰈 KL-Divergence DKL(p||q) = 􏰋 θ p(θ)(log p(θ)−log q(θ))dθ COMP9444 ⃝c Alan Blair, 2017-20 Q must not be small in any place where P is large. COMP9444 ⃝c Alan Blair, 2017-20 θ i code for q() but it turned out that the samples were drawn from p() instead. i=1 COMP9444 20T2 Probability and Backprop Variations 20 COMP9444 20T2 Probability and Backprop Variations 21 Reverse KL-Divergence KL-Divergence in Deep Learning Given P, choose Gaussian Q to minimize DKL(Q||P) 􏰈 KL-Divergence is used in a number of deep learning algorithms, including Variational Autoencoders (Week 10) Q just needs to be concentrated in some place where P is large. COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T2 Probability and Backprop Variations 22 COMP9444 20T2 Probability and Backprop Variations 23 Variations on Backprop Cross Entropy 􏰈 Cross Entropy ◮ problem: least squares error function not suitable for classification, For classification tasks, target t is either 0 or 1, so better to use E = −t log(z)−(1−t)log(1−z) where target = 0 or 1 ◮ mathematical theory: maximum likelihood ◮ solution: replace with cross entropy error function This can be justified mathematically, and works well in practice – especially when negative examples vastly outweigh positive ones. It also makes the backprop computations simpler 􏰈 Weight Decay ◮ problem: weights “blow up”, and inhibit further learning ◮ mathematical theory: Bayes’ rule ◮ solution: add weight decay term to error function ∂E = z−t ∂z z(1 − z) 􏰈 Momentum ◮ problem: weights oscillate in a “rain gutter” ◮ solution: weighted average of gradient over time if z = 1 , 1+e−s COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 􏰈 KL-Divergence is also used in some policy-based deep reinforcement learning algorithms such as Variational Inference or Trust Region Policy Optimization (TPRO) (but we will not cover these in detail) ∂E = ∂E∂z=z−t ∂s ∂z ∂s COMP9444 20T2 Probability and Backprop Variations 24 COMP9444 20T2 Probability and Backprop Variations 25 Maximum Likelihood (5.5) Least Squares Line Fitting H is a class of hypotheses f(x) P(D|h) = probability of data D being generated under hypothesis h ∈ H. logP(D|h) is called the likelihood. ML Principle: Choose h ∈ H which maximizes this likelihood, i.e. maximizes P(D|h) [or, maximizes logP(D|h)] In our case, each hypothesis h is a choice of parameters for a neural network or other function applied to input values x while the observed data D consist of the target values t . COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T2 Probability and Backprop Variations 26 COMP9444 20T2 Probability and Backprop Variations 27 Derivation of Least Squares Derivation of Cross Entropy (3.9.1) Suppose the data are generated by a linear function f () plus Gaussian noise with mean zero and standard deviation σ . For classification tasks, t is either 0 or 1. Assume t is generated by hypothesis f as follows: (Note: we do not need to know σ) COMP9444 fML = argmaxf∈H ∑ ti log f(xi)+(1−ti)log(1− f(xi)) i ∏ 1 − 1 (ti−f(xi))2 P(1| f(xi)) = P(0| f(xi)) = P(ti| f(xi)) = f(xi) (1− f(xi)) i f(xi)ti(1− f(xi))(1−ti) ∑ ti log f(xi)+(1−ti)log(1− f(xi)) √2πσe 2σ2 1 2 1 P(D|h)=P(t|f) = logP(t|f) = ∑i −2σ2 (ti − f(xi)) −log(σ)− 2 log(2π) fML = argmaxf∈H logP(t|f) 2 then logP(t|f) = = argminf∈H ∑(ti − f(xi)) i i ⃝c Alan Blair, 2017-20 (Can also be generalized to multiple classes.) COMP9444 ⃝c Alan Blair, 2017-20 i.e. x COMP9444 20T2 Probability and Backprop Variations 28 COMP9444 20T2 Probability and Backprop Variations 29 Weight Decay (5.2.2) Bayesian Inference Sometimes we add a penalty term to the loss function which encourages the neural network weights wj to remain small: H is a class of hypotheses P(D|h) = probability of data D being generated under hypothesis h ∈ H. P(h|D) = probability that h is correct, given that data D were observed. Bayes’ Theorem: E = 1 ∑(zi −ti)2 + λ ∑w2j 2i 2j This can prevent the weights from “saturating” to very high values. It is sometimes referred to as “elastic weights” because the weights experience a force as if there were a spring pulling them back towards the origin according to Hooke’s Law. P(h | D)P(D) = P(h|D) = P(D | h)P(h) P(D | h)P(h) P(D) The scaling factor λ needs to be determined from experience, or empirically. P(h) is called the prior. COMP9444 COMP9444 ⃝c Alan Blair, 2017-20 ⃝c Alan Blair, 2017-20 COMP9444 20T2 Probability and Backprop Variations 30 COMP9444 20T2 Probability and Backprop Variations 31 Weight Decay as MAP Estimation (5.6.1) Momentum (8.3) We assume a Gaussian prior distribution for the weights, i.e. 1 −w2/2σ2 If landscape is shaped like a “rain gutter”, weights will tend to oscillate without much improvement. P(w) = ∏ √ e j 0 Then j 2πσ0 P(t|w)P(w) 1 ∏ 1 P(w|t)= =√e2σ2 √ej0 Solution: add a momentum factor − 1 (zi−ti)2 ∏ 1 P(t) P(t) i 2πσ j 2πσ0 −w2/2σ2 1212 ∂w logP(w|t) = −2σ2 ∑(zi−ti) −2σ2 ∑wj + constant i0j δw ← αδw−η∂E w ← w + δw wMAP = argmaxw∈H log P(w | t ) 􏰌1 λ Hopefully, this will dampen sideways oscillations but amplify downhill where λ = σ2/σ20. COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 = argminw∈H 2 ∑(zi −ti)2 + 2 ∑w2j ij 􏰍 motionby 1 . 1−α