The Australian National University
School of Computing
Other Contributors: , Koneputugodage,
Semester 1, 2022 Assignment 2 Weight: 20%
Copyright By PowCoder代写 加微信 powcoder
COMP4670/8600: Statistical Machine Learning
Release Date. 22nd April 2022.
Due Date. 16th May 2022 at 1200 AEDT.
Maximum credit. 100 Marks for COMP4670 and COMP8600.
For submission, we are using Gradescope. Make sure that the files included in the submission follow the following format:
1. Ensure that if submitting a zip file (or similar), the file is uncompressed on Gradescope. Individual files should be visible in Gradescope’s submission website.
2. contribution.py is submitted with your contribution / reference / collaboration statement filled in. State your group members (if any) and state any collaboration has in completing the assignment.
3. A PDF for theory solutions. We will not be accepting any other file format. This may result in a null grade or a heavy mark penalty. When referring to equations in the spec, please use the equation number, i.e., (2.1). For Section 2, you should include the answers of Q2.3 and Q2.6b inside your theory PDF file.
4. Submit the coding folder boframework that should only include the following files with your code implementation where appropriate (FIXME sections):
• kernels.py
• acquisitions.py • bayesopt.py
Do not delete any setup code provided or change the functions you are required to complete.
5. Submit the file bayesopt_implementation_viewer.ipynb. Note that you should submit your exe- cuted version of the file, and not the default one. Similarly, do not delete or change any of the code inside the jupyter notebook.
Section 1: Generalisation and Data Noise (50 Total Points) In the lectures, we explored several aspects of the generalisation bound. One simple variant we discussed
is that of a finite hypothesis class, where it was shown that the generalisation gap can upper-bounded
by a quantity scaling with O(1/ N) many samples. However, this requires the independent samples to
follow the true distribution. Inspired by the discussions and questions given by students in that lecture, we will consider a very simple method1 of adapting the uniform bound for a finite hypothesis class in the presence of noise.
Setup. Let us set up the scene. Suppose we are interested in the classification of inputs X to classes Y, with corresponding distribution D and corresponding joint probability density function p. We also consider the random variables of input and output classes given by the distribution (X, Y ) ∼ D.
Define a hypothesis class H as a set of classifiers of the form f : X → Y. The objective of learning is to find the best possible f ∈ H, where “best” means the classifier that gives us the lowest risk based on some loss function l : Y × Y → R. Recall from the lectures that there are two ways we can aggregate the loss: either using the theoretical distribution D to get the true risk, or using the empirical distribution (through an i.i.d. dataset) to get the empirical risk. The following definition will make it clearer.
Definition 1 (Risk2). The (true) risk of a classifier f : X → Y with respect to (w.r.t.) a distribution D on X × Y and loss function l : Y × Y → R is given by
R [f]def E [l(f(X),Y)];
and the empirical risk w.r.t. samples S ∼ DN is given by
R [f]def E [l(f(X),Y)]=
S = (X,Y )∼S N
l(f(x),y). i i
Remark. Note the slight difference in notation from lectures. The empirical risk is denoted by a hat R□[f] and each type of risk has an explicit subscript to denote what distribution we are considering. This is so we can compare the true risk and empirical risk for different distributions.
From the lecture, we know that the the difference between these two types of risk — also called the generalisation gap — is important. It determines how close we can get to the true risk given only a finite amount of data. We also know in the lecture that this gap can be bounded no matter what classifier f we choose, using the two-sided Hoeffding’s inequality (see Proposition 1 in Appendix A). You may use this result without proof later.
General noise. To consider the effects data noise has on our generalisation bounds, we will consider two different distributions: (1) a clean distribution which we want to learn from; and (2) a noisy distribution which can cause problems in learning. For example, we consider a noisy distribution D as a noisy version of D. The “tilde” □ symbol is used to indicate noisy variants of object we have already introduced. Table 1 in Appendix A summarises the differences.
In this general setup, we have a natural question about the effect of noise: how close can we get to the true risk if we only have a noisy sample of data? Or mathematically, can the generalisation gap when we have noisy data, defined as
∆ [f]def |R [f]−R [f]|, (1.4) noisegap = D S
be bounded? To answer this, let us first examine the difference between the true risk of clean distribution and that of the noisy version.
1There are more slick bounds available if we use more sophisticated frameworks, i.e., PAC-Bayes.
2Technically, the co-domain of the classifier does not need to match that of the output space. However, we simplify to this case for convenience of notation.
Assumption: For the rest of this section, we will make two assumptions. First, the hypothesis class H is finite and contains |H| classifiers. Second, the loss function l is bounded between 0 and 1 on its domain—that is, for any y, y′ ∈ Y,
0 ≤ l(y, y′) ≤ 1. (1.3)
Question 1.1: Noise and True Risk
Let f ∈ H with a bounded loss function (as per Eq. (1.3)). Prove that |RD [f ] − RD [f ]| ≤ DT V (p, p),
where p, p are the density functions of D, D, respectively; and
|p(x,y)−p(x,y)|dxdy.
The integral term in Eq. (1.6) is actually a well known distance function called the total variation distance.
One nice characteristic of this distance is that it is (polynomially) upper-bounded by the KL-divergence3 A hint for later: The integral in Eq. (1.6) can be simplified to a “summation” if we are considering a
countable domain, i.e., a classification task with finite Y.
Thus putting everything together, let us establish a uniform bound for the noisy sample gap (given by
Eq. (1.4)) for a finite hypothesis class H.
Question 1.2: Uniform Bound for Noisy Sample Gap (15 Points)
D (p,p)def TV =
Pr ∀f∈H:|RD[f]−RS[f]|≤DTV(p,p)+ N ≥1−δ. (1.7)
(10 Points)
Let H be a finite hypothesis class. Prove that given a finite sample S ∼ DN , for δ > 0 ln|H|+ln(1/δ)
Hint: Use the triangle inequality. Then reuse the result in Question 1.1 to find an upper bound for the left-hand distance above. Finally use Proposition 1 to derive the bound and change ε to δ to get Eq. (1.7).
Question 1.3: Interpreting the Uniform Bound for Noisy Sample Gap (10 Points) Compare Eq. (1.7) with the bound we found in the lecture in the absence of noisy data, namely,
for any δ > 0,
ln|H|+ln(1/δ)
Pr ∀f∈H:|RD[f]−RS[f]|≤ N ≥1−δ.
What is the effect of noisy data on the generalisation bound and the rate at which it decreases with the number of data points?
Label noise. Now that we have a “general” bound for different noises, let us specialise the bound for label noise. In particular, for the next couple questions we will consider label noise for binary classification, where Y = {−1,+1}. Simply, label noise can be summarised the case where class labels used to train have a chance of being incorrect.
Definition 2 ((Asymmetric) Label Noise). Let (X, Y ) ∼ D be a data distribution. Given label flipping probabilities σ−1, σ+1 ∈ [0, 1], the corresponding noisy distribution under (asymmetric) label noise is the distribution (X,Y) ∼ D in which noisy random variables X,Y satisfy the following properties:
(LN1): Label flipping probability Pr(Y = −y | Y = y) = σy for all y ∈ Y; (LN2): Identical input marginals Pr(X = x) = Pr(X = x) for all x ∈ X ; (LN3): Conditional independence Y ⊥ X | Y .
3Pinsker’s Inequality gives for every probability distribution p,q on X ×Y: DTV(p,q) ≤ 12DKL(p,q). 3
Symmetric label noise is the noise model when σ = σ−1 = σ+1.
Given the definition of asymmetric label noise, we can calculate certain posterior probability quantities which are more intuitive. For instance, one can summarise Definition 2 by considering the noise posterior Pr(Y | X).
Question 1.4: Noisy Posterior for Label Noise (5 Points) Given a clean distribution D and a noisy version of the distribution D under asymmetric label
noise with flipping probabilities σ−1 , σ+1 ∈ [0, 1], show that for all y ∈ Y
Pr(Y =y|X)=σ−y ·(1−Pr(Y =y|X))+(1−σy)·Pr(Y =y|X). (1.8)
Using Eq. (1.8) we can simply adapt the uniform bound given in the general case for label noise. In particular, we will look at the symmetric label noise version of the bound.
Question 1.5: Uniform Bound for Symmetric Label Noise (10 Points) Suppose that a noisy distribution D is given by symmetric label noise, with flipping σ ∈ [0,1],
applied to clean distribution D. Prove that given a finite sample S ∼ DN , for δ > 0 ln|H|+ln(1/δ)
Pr ∀f ∈H:|RD[f]−RS[f]|≤2σ+ N ≥1−δ. (1.9) Hint: Start by bounding |RD[f] − RD[f]| like in Question 1.1. Then reuse the technique in
Question 1.2.
Remark. In Eq. (1.9), we can see that the noise term σ is linear in σ: Thus the noise is maximally harmful with σ = 1 and vanishes when σ = 0.
Section 2: Gaussian Processes and Bayesian Optimisation (50 Total Points) Gaussian Processes. For reference, we have defined the key concepts and notations of a Gaussian
Process in Appendix B.
Gaussian Process Regression. Having defined the prior and the joint distributions (check Ap- pendix B), we are now in position to define the Gaussian Process predictive distribution as,
where we have,
f ̄ =k⊤K−1y ∗∗y
V(f ) = k − k⊤K−1k . ∗ ∗∗ ∗ y ∗
∗ ∗∗ ∗ y ∗
If we have only a single point to predict, the above equations simplify to,
f∗ |X,y,X∗ ∼N( ̄f∗,cov(f∗))
̄f def E[f |X,y,X ]=K⊤K−1y
(2.2a) (2.2b)
(2.3a) (2.3b)
∗=∗∗∗y cov(f ) = K − K⊤K−1K .
Depending on our choice of the covariance function, we would have to choose its parameters as well. This choice must be diligent, otherwise we will have trouble predicting new points with high confidence. It would be ideal to have a field expert for every exact problem we are trying to solve to inform us of the underlying parameters of the model, but this isn’t always possible. In those cases, we have to treat those covariance function parameters as hyperparameters, which we denote as θ, and optimise them in our algorithm. There are many ways to achieve this hyperparameter optimisation, but here we will maximise the log marginal likelihood with respect to those parameters θ, with the log marginal likelihood defined as,
logp(y|X,θ)=−1y⊤K−1y−1log|K |−nlog2π. (2.4) 2y2y2
Now, as you have noticed we are required many times to compute the inverse of the training covariance matrix, K−1. Taking the inverse straight away would be a highly inefficient and numerically unstable
way, especially as the training dataset grows. So, instead we should use a more appropriate method by utilising techniques from matrix factorisation theory. More specifically, we will perform Cholesky factorisation on the training covariance matrix. An analytical implementation of the steps to implement this decomposition is given in Algorithm 1. Note the backslash notation A\b is the vector x which solves Ax = b.
Algorithm 1 Efficient Gaussian Process Regression
1: L ← cholesky(Ky)
2: α←L⊤\(L\y)
3: f ̄ ← k⊤α ∗∗
▷ Cholesky factorisation for predictions and log marginal likelihood ▷ predictive mean Eq. 2.3
5: V(f∗) ← k∗∗ − v⊤v
6: logp(y|X)←−12y⊤α−ilogLii−n2 log2π
▷ predictive variance Eq. 2.3 ▷logmarginallikelihoodEq.2.4
Choice of Covariance. As we discussed before, the choice of covariance plays an important factor in the success of a GP regression. For this assignment, we will consider a general class of Covariance functions, called Mat ́ern. This class is defined mathematically as,
21−ν √2νdν √2νd
kMatern(x,y)= Γ(ν) l Kν l (2.5)
with positive parameters ν and l. Kν is a modified Bessel function and d = dist(x, y) denotes the pairwise Euclidean distance.
We can easily observe that when ν → ∞, we obtain the well-known Radial Basis function (RBF). Therefore, this class of covariance functions is a generalisation of the RBF class. However, RBF can be
too smooth sometimes and this is not always appropriate in a machine learning setting. For machine learning solutions, we usually tend to use two other covariance functions. Mat ́ern32 and Mat ́ern52, when ν = 3/2 and ν = 5/2 respectfully. Mat ́ern32 covariance function works for single-differentiable functions, whereas Mat ́ern52 could work for up to twice-differentiable functions. In general, a Gaussian process with Mat ́ern covariance is ⌈ν⌉ − 1 times differentiable in the mean-square sense.
The two covariance functions are defined below as,
√3d kν=3/2(x, y) = σf2 · 1 + l
with σf2 being the signal variance.
Question 2.1: Mat ́ern Class Covariance Functions
√3d exp − l
→ In Mat ́ern32 case √5d
kν=5/2(x,y)=σf2· 1+ l +3l2 exp − l →InMat ́ern52case
(5 Points) Implement the Mat ́ern32 and Mat ́ern52 covariance functions (__call__()) under the FIXME sec-
tions in kernels.py.
You should use the Equations 2.6a and 2.6b for your implementation.
Question 2.2: Gaussian Process Regression (15 Points) Implement the optimisation(), fit(), predict(), and log_marginal_likelihood() functions
under the respective FIXME sections in gp.py.
You should use Algorithm 1, where appropriate for your implementation.
Bayesian Optimisation – Intuition. Bayesian Optimisation (BayesOpt) is a zeroth-order optimisation method of finding the maximum of expensive cost functions. BayesOpt employs Bayesian techniques of setting a prior over the objective function and combining it with evidence to get a posterior function. This permits a utility-based selection of the next observation to make on the objective function, which must take into account both exploration (sampling from areas of high uncertainty) and exploitation (sampling areas likely to offer improvement over the current best observation). More specifically, it builds a surrogate model for the objective and quantifies the uncertainty in that surrogate using Gaussian process regression, and then uses an acquisition function defined from this surrogate to decide where to sample. We will define all those terms later.
Let’s assume we want to solve the following optimisation problem
max f (x). (2.7)
The function defined in 2.7 also has the following important properties:
• The objective function f is L-Lipschitz-continuous in order to use Gaussian process regression.
• f is expensive to evaluate, either in the sense of time or money or other factors that will result in
hurdles along the way.
• f lacks known special structure, such as convexity or quasi-convexity (similarly with concavity) to allow us to use known gradient-based optimisation algorithms. We say that f is a black-box in those cases.
Finally, our focus in on finding a global rather than local optimum.
Surrogate Model. As we said before, BayesOpt consists of two main components: a Bayesian statistical model for modelling the objective function, and an acquisition function for deciding where to sample next.
For the surrogate model we will use the Gaussian process regression model that we have implemented before. Since our evaluation functions will be continuous, a GPR is an ideal surrogate model. In general, you can read about surrogate models here.
Improvement-based Acquisition Functions. Now, let’s turn over focus on the acquisition functions. The role of the acquisition function is to guide the search for the optimum. Typically, acquisition functions are defined such that high acquisition corresponds to potentially high values of the objective function, whether because the prediction is high, the uncertainty is great, or both. Maximising the acquisition function is used to select the next point at which to evaluate the function. That is, we wish to sample f at arg maxxu(x | D), where u(·) is the generic symbol for an acquisition function.
There are many types of acquisition functions, but in this assignment we will focus only on improvement- based acquisition functions. This class of functions is based on the random variable Improvement:
I(x) = max{0, f(x) − f(x+) − ξ}, (2.8) where f(x+) is the incumbent solution and x+ = arg maxxi∈x1:t f(xi). Note that ξ ≥ 0 is the exploration-
exploitation trade-off hyperparameter.
Intuitively, I(x) assigns a reward of {f(x)−f(x+)−ξ} if f(x) > f(x+)+ξ, and zero otherwise. Therefore, we should select the point that has the highest probability of improving upon the incumbent function value f(x+). This effectively means that we should maximise the probability of the event {Pr(I(x) > 0)} or equivalently maximise the probability of the event {f(x) > f(x+) + ξ}. Another way of seeing this, is by taking the expectation of the utility function u(x) = 1{f(x)>f(x+)+ξ}, which is u(x) = 1 when f(x) > f(x+) + ξ and zero otherwise.
A well-known acquisition function is the maximisation of this probability of improvement (PI) over the f(x+), and is defined as such,
PI(x) = Pr (I(x) > 0)
= 1 − Pr = Φ(Z),
f(x) − μ(x) σ(x) ≤
f(x+) + ξ − μ(x) σ(x)
⇒ reparameterisation trick u
Z = μ(x)−f(x+)−ξ. σ(x)
The full derivation can be found in Appendix B.
Also, Φ(·) and φ(·) is the normal cumulative and probability distribution functions respectively, μ(x) is the mean of f(x), σ(x) is the standard deviation of f(x), and ξ is the exploration-exploitation trade-off hyperparameter, as before.
However, the solution of the PI acquisition function is greedy as it selects the point most likely to offer an improvement of at least ξ.
A somewhat more satisfying alternative acquisition function would be one that takes into account not only the probability of improvement, but also the magnitude of the improvement a point can potentially yield. In particular, we want to minimise the expected deviation from the true maximum f(x⋆), when choosing a new trial point. This can be formulated mathematically as,
xt+1 = arg min E[∥ft+1(x) − f(x⋆)∥ | D1:t] x
∥ft+1(x)−f(x⋆)∥Pr(ft+1 |D1:t)dft+1
In the above definition, we have consider only one step t + 1, and if we want to account for many steps, we just have to use a recursive procedure with dynamic programming to find the solution. However, this is expensive, and so we shall propose an alternative procedure. That is to maximise this expected improvement with respect to f(x+). More specifically, we can re-introduce the improvement (or window function in stochastic modelling in engineering) from the PI acquisition function.
The new query point is found by maximizing the expected improvement:
x = arg max E[I(x) | Dt ]. (2.12)
Thus, taking the expectation of the random variable I(x) yields the expected improvement (EI) acquisition function, defined as follows,
μ(x)−f(x+)−ξ σ(x)
(μ(x) − f(x+) − ξ)Φ(Z) + σ(x)φ(Z) 0
if σ(x) > 0, if σ(x) = 0.
(6 Points)
if σ(x) > 0, if σ(x) = 0.
Question 2.3: Expected Improvement
Derive mathematically the result of the Expected Improvement (EI) acquisition function in Equa- tion 2.13. That is show that,
E [I(x)] =
where Z is defined in Equation 2.14.
(μ(x) − f(x+) − ξ)Φ(Z) + σ(x)φ(Z) if σ(x) > 0, 0 if σ(x) = 0.
Bayesian Optimisation. To summarise the BayesOpt algorithm can be described in Algorithm 2. Algorithm 2 Bayesian Optimisation
for t=1,2,… do
Choose xt by optimising the acquisition function u over the Gaussian Process such that:
xt = arg maxxu(x | D1:t−1). ▷* Sample the objective function: yt = f(xt) + ε.
Augment the data D1:t = D1:t−1 ∪ {(xt, yt)} and update the GP.
1: 2: 3: 4:
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com