代写代考 COMP4670/8600: Statistical Machine Learning

The Australian National University
School of Computing
Other Contributors: , Koneputugodage
Semester 1, 2022 Assignment 1 Weight: 20%

Copyright By PowCoder代写 加微信 powcoder

COMP4670/8600: Statistical Machine Learning
Release Date. 23rd Feburary 2022.
Due Date. 28th March 2022 at 1200 AEDT.
Maximum credit. 100 Marks for COMP4670 and 120 Marks for COMP8600 (see Appendix A).
Section 1: Bayesian Estimate vs Probabilistic Hunches (5 Total Points)
Background: and are the first to systematically describe the various human judgement biases in situations about uncertain outcomes, or involving probabilistic reasoning. In this question, we pitch human intuitions (including our own) against statistical inference, on several examples presented in their original 1972 paper1.
Gender balance in high school classes. There are two programs in a high school. Boys are a majority (65%) in program A, and a minority (45%) in program B. There is an equal number of classes in each of the two programs. You enter a class at random, and observe that 55% of the students are boys.
Question 1.1: Your hunch (0 Points) In 30 seconds or less, what is your guess – does the class belong to program A or to program B?
Question 1.2: Computing the probabilities (4 Points) Can you compute probabilities of the class being in program A and program B given the obser-
vation, and from there reason about which program is more likely?
Hint: you may need to write out expressions involving the number of students in this class, say n, which is fixed, but not provided in this question. Do the probability of the observed class being in program A depend on class size n?
Question 1.3: Well-reasoned hunch (1 Points)
Without computing the posterior probabilities for program A or B, could you make a reasoned argument, based on properties of the binomial distribution, that program B is more likely? Hint: You may want to consider the ratio of likelihoods.
1Kahneman, D. and Tversky, A., 1972. Subjective probability: A judgment of representativeness. Cognitive psychology, 3(3), pp.430-454.

Section 2: Exponential Families (60 Total Points)
In the following question, we will explore an important family of distributions in machine learning: the exponential family (not to be confused with an exponential distribution). Although this section and its questions are littered with mathematical and technical details, one should not miss the elegance and simplicity of exponential family distributions. Through the following questions, we will establish how this family of distribution generalises distributions you have seen before and explore how they allow for different forms of geometric ideas.
First, let us define the exponential family distributions we will be considering in this question.
Definition 1 (Exponential Family2). Given a function u : R → Rm, we denote an exponential family dis- tribution as Exp(u, η), where η ∈ P ⊂ Rm designates the m-dimensional parameters of the distribution within an exponential family3. The corresponding densities of the distributions are given by
q(x | η) = exp(η⊤u(x) − ψ(η)), (2.1) where 􏰏
Remark. Notice that Definition 1 represent a broad set of special cases of the definition given in the Bishop textbook Eq. (2.194), with h(x) = 1 and g(η) = exp(−ψ(η)). For notational convenience in the tasks specified below, we use this definition.
We will also take the assumption that any random variable defined with respect to any exponential family distribution x ∼ Exp(u, η) is a continuous random variable in this section (for simplicity). (For technical reasons, we also assume that u(·) is continuous for each of its dimensions and so are its partial derivatives.)
Question 2.1: Verifying valid distribution (5 Points) Show that q(x | η) given by Eqs. (2.1) and (2.2) is a valid probability density function.
To start our exploration of general exponential families, we will consider its application to Bayesian esti- mation. In particular, we look at the example of considering a simple 1-dimensional Gaussian distribution.
Question 2.2: A Bayesian example (Part 1) (13 Points) Suppose that we have a likelihood and prior distribution given by,
x | μ ∼ N(μ,σ2)
μ ∼ Exp(u, η),
where N is a 1-dimensional Gaussian distribution with mean μ and standard deviation σ. Show that μ | x ∼ Exp(u, ηˆ), for some ηˆ.
Make sure to clearly specify the parameters ηˆ.
Hint: The integral for q(μ) does not need to be simplified / solved.
To move from this general case, we make the addition assumption that μ is also Gaussian by considering sufficient statistics given by u(x) = (x, x2 ).
2This is actually a simplified version of an exponential family. A more general version can defined, i.e., multi-dimensional and non-unit base measure.
3The exponential family for a fixed u is the set of distributions given by Mu = {Exp(u,η) | η ∈ P}. Two exponential family distributions share the same exponential family if and only if their u’s are the same.
exp(η⊤u(x)) dx. (2.2)
ψ(η) = log
The function u is called the sufficient statistics of the exponential family and the function ψ is called the
log-partition function of the exponential family.

Question 2.3: A Bayesian example (Part 2) (7 Points) Suppose that we have the same likelihood as per Question 2.2 but with a Gaussian prior distri-
x | μ ∼ N(μ,σ2) μ ∼ N ( μ 0 , σ 02 ) .
First, find the parameters of an exponential family such that Exp(u, η) = N (μ0, σ02) (in distribu- tion) with u = (x, x2).
Then use the result of Question 2.2 to find the parameters of μ | x.
Another aspect which makes exponential families convenient to work with in Bayesian inference is the evaluation of maximum likelihood estimation (MLE). For multiple data points {xi}Ni=1 and a general probability distribution q(x | θ) with parameter θ, the MLE corresponds to the maximization of:
log q(xi | θ). (2.3) Let us consider the estimation of a parameter η of Exp(u, η) given a single data point x. Given that this
is an exponential family distribution, the log-likelihood is presented in a relatively simple form:
log q(x | η) = η⊤u(x) − ψ(η). (2.4)
In practice, we often maximise the log-likelihood by minimise the negative of Eq. (2.4), i.e., minimise the negative log-likelihood. The maximisation of this quantity for exponential families has many interesting properties which can be explored (for the interested reader see the convex conjugate).
Another important object in statistics is the KL-divergence, which is defined by:
Question 2.4 [COMP8600]: MLE is equivalent to KL minimisation
DKL[p(x) : q(x)] =
We will show an equivalence in MLE and the minimisation of the KL-divergences.
􏰏 􏰆p(x)􏰇 p(x) log q(x) dx.
(10 Points) Let {xi}Ni=1 denote a set of data points we will use to fit an exponential family distribution
Exp(u, η). With this, let the corresponding empirical distribution be defined as
q ̄(x) = N where δ(·) is the Dirac delta function.
With q(x | η) as the p.d.f. of Exp(u,η), show that the minimisation of DKL[q ̄(x) : q(x | η)] is equivalent to the MLE of η using all data points xi.
Then, with the additional assumption that u is a linear map, show that the minimisation of DKL[q ̄(x) : q(x | η)] is equivalent to MLE using a single data point consisting of the mean of all data points xi.
Hint: For the first part, use the property that for any continuous function f we have 􏰏
δ(x − a) · f(x) dx = f(a).
Remark. Notice that the first part of Question 2.4 actually holds for any type of distribution and is not
limited to just considering exponential family distributions. 3
δ(x − xi), (2.6)

As we have seen from the preceding questions, the parameters η are key in the analysis of exponential families. These sets of parameters are called the natural parameters of the exponential family. Despite their usefulness, sometimes it is more convenience to use a different set of parameters when analysing exponential family distributions. The aforementioned set of parameters we will consider are the distri- bution’s expectation parameters λ, which corresponds to the expectation of the sufficient statistics for a specific parametrisation η.
In particular, there is a very nice connection between the expectation parameters λ and the log-partition function ψ(·):
λ def E [u(x)] = ∇ψ(η). (2.7)
x∼Exp(u,η)
Although one could think of λ as a function of η, we make this correspondence implicit. If there are multiple natural and expectation parameters we are interested in, we will subscript them to remove ambiguity, i.e., ηi corresponds to λi or more concretely λi = ∇ψ(ηi).
Eq. (2.7) is key to the next few properties of exponential families we will explore.
One reason why the expectation parameters are of interest is that the KL-divergence of these exponential
families can be characterised by these parameters. For two distributions Exp(u,η1) and Exp(u,η2)
within the same exponential family, we shorthand the KL-divergence of densities by D [η : η ] def KL1 2=
DKL[q(x | η1) : q(x | η2)].
Question 2.5: KL-Divergence for exponential families (15 Points)
Show that the KL-divergence of distributions Exp(u,η1) and Exp(u,η2) within the same expo- nential family is given by:
DKL[η1 : η2] = ψ(η2) − ψ(η1) − λ⊤1 (η2 − η1).
Hint: Use Eq. (2.7).
Divergences are an important statistical tool to measure the ‘closeness’ of distributions. However, they typically do not exhibit properties which are typical for more common distance function (metric func- tions). For example, the common Euclidean distance is symmetrical and satisfies the triangle inequality. The KL-divergence does not have these properties. However despite this, within an exponential family a generalisation of the Pythagorean theorem exists, where a different notion of “right angle” is used and the squared Euclidean distance is replaced by KL-divergence:
Figure 1: Question 2.6. A Pythagorean Theorem for KL-divergence.

Question 2.6: Pythagorean Theorem for exponential families (10 Points) Let Exp(u, η1), Exp(u, η2), and Exp(u, η3) be distributions within the same exponential family.
Furthermore, let
a2 =DKL[η1 :η2]; b2 =DKL[η2 :η3]; c2 =DKL[η1 :η3].
Prove that Fig. 1 holds. That is, show that a2 + b2 = c2 iff the difference in natural parameters
(η2 − η3) is perpendicular to the difference in expectation parameters (λ1 − λ2).

Section 3: Utilising Expectation Maximisation (55 Total Points)
In this set of questions we are going to explore the Expectation Maximisation (EM) algorithm in greater detail. We will first consider EM for Exponential family Mixture Models (EMMs), which should mirror many of the aspects which Gaussian Mixture Models (GMMs) have. We then adapt EM to Bayesian Linear Regression (BLR) and BLR with non-gaussian priors. A light refresher of the EM algorithm with notation used in this section can be found in Appendix B.
Understanding EM with EMM. Let us consider a mixture of K exponential family distributions Exp(u,ηk) (all of which come from the same exponential family, i.e., they all share the same sufficient statistics function u(·)) with m-dimensional parameters {ηk ∈ Rm}Kk=1; and N data points sampled from the mixture {xn}Nn=1, xn ∈ R, where the sampling is done with the ratio π1,…,πK with 0 < πk < 1 and 􏰌Kk=1 πk = 1. Note then that our data is X = [xn]Nn=1 ∈ R1×N , and our parameters are H = [ηk]Kk=1 and π = [π]Kk=1. We denote both parameters as θ = {H,π} to follow the preceding notation in Appendix B. In summary, to sample a single sample from the mixture we have: z ∼ Categorical(π) x ∼ Exp(u, ηz ), where Categorical(π) is the categorical distribution, sampling values in 1 to K (1-of-K coding) with probabilities determined by π. As suggested by the above equations, the use of a (set of) latent variables is useful. Hence, we define Z as the set of latent variables, similar to their definition in Bishop for GMMs. As such, we have similar equation summarising the main components of our EMM (c.f . Bishop Section 9.2): p(x |θ)def 􏰍π q(x|η ) (3.1) π def p(z =1) n = k k k= nk For notations sake, we will denote the densities for exponential family distributions by q(x | η), following the convention set in Definition 1. πoldq(x |η) old knk p(x |z ,θ)def 􏰎(q(x |η ))znk (3.3) p(z )def 􏰎πznk. (3.2) (3.4) Question 3.1: Deriving the EMM expectation step Derive the expectation we will maximise in EMM EM. In particular, show that (8 Points) 􏰍 p(Z | X, θold) log p(X, Z | θ) = 􏰍 􏰍 γold(znk) (log πk + log q(xn | ηk)) 􏰍p(Z | X,θold)znk = 􏰍p(zn | X,θold)znk = p(znk = 1 | xn,θold) (3.7) Z zn Hint: You will first need to derive the complete likelihood p(X, Z | θ) and may want to use the which holds because Z is independent in n and because zn is a 1-of-K coding. Remark. Notice that these equation are similar to those presented in Bishop, namely Eqs. (9.40) and (znk) = 􏰌 πoldq(xn | ηk). jj (9.39). Further note that Eq. (9.39) is incorrect (see errata on page 19). It follows that in the E-step, we need to calculate 􏰂γold(znk)􏰃N,K . Now that we have the objective to maximise (the expectation to maximise), as given by Eq. (3.5), the EM updates can be derived. However, unlike in regular GMMs, the update that we initially derive do not 6 correspond to the original natural parameters of the exponential distributions H. Instead we actually get an update to the expectation parameters of the exponential distributions Λ def [λ ]K . That is, we get Despite this, there is in-fact a method to “re-project” the updated expectation parameters back to natural the update Hold → Λ. parameters. We will use without proof, that we have the transformation η = ∇φ(λ), (3.8) where φ is a convex function (you will not need to know the details4). Note the similarities of the transformation to that in Eq. (2.7). Question 3.2: Deriving the EMM maximisation step (7 Points) Derive the maximisation updates for EMM EM for mixing parameters π and expectation param- eters Λ. That is, show that πk = N (3.9) and λk = N (u(xn) · γold(znk)), where Nk = 􏰌Nn=1 γold(znk). Thus show that we have the update equation for the natural parameters H given by 􏰊 1 􏰍N 􏰋 ηk = ∇φ N (u(xn) · γold(znk)) Hint: Use Lagrange multipliers for Eq. (3.9) to enforce the constraint 􏰌k πk = 1 and use Eq. (2.7) for Eq. (3.10). Remark. The update in Eq. (3.11) can be thought of as a series of optimization and projection steps, as earlier eluded to. We do an optimization step to give the next λ and project the expectation parameters back into the natural parameter space. Now that we have our update equation, its time for the algorithm’s implementation. In particular, you are tasked to implement the EMM algorithm in its generality (as per prior equations and questions). The algorithm will then be tested in the simple case where the exponential families of the mixtures are simplified to Gaussian distributions (and thus simplifying EMMs to GMMs). In particular, we will use sufficient statistics function u(x) = (x, x2 ), as presented in Question 2.3. Question 3.3: EMM implementation (15 Points) Implement weighted_probs(), e_step_EMM(), m_step_EMM() in emm_question.py. You can test your implemented algorithm by utilising implementation_viewer.ipynb, a Jupyter notebook which includes as visualisation of 1-dimension GMMs. Hint: Note that you can do this coding question without completing the correspond theory questions: the equations for the E- and M-steps are given in Question 3.1 and 3.2. 4It turns out that φ is the convex conjugate of the log-partition function ψ. 7 Bayesian Linear Regression. We now revisit Bayesian Linear Regression (BLR) for a more complex use case of EM. In particular, given a dataset X = (φ(X′),t) = {(φ(x′n),tn)}Nn=1 ∈ RN×(M+1) of training data, our model assumes that tn = y(φ(x′n),w)+ε where φ is our feature map, ε is small Gaussian noise, and w has a Gaussian prior. We will define Φ = φ(X′) ∈ RN×M and φn = φ(x′n) ∈ RM to simplify notation. This clearly is not a mixture model, but it fits the criteria for EM: we have incomplete data from which we cannot describe the full likelihood, but have other variables which would make it possible to. It is not possible to specify the probability of the data X and the parameters w by themselves, we need the parameters of the two gaussian distributions mentioned, which will be β−1 and α−1 respectively. Specifically, p(tn | φn, w, β) = N (tn | y(φn, w), β−1) (3.12) p(w | α) = N(w | 0,α−1I). (3.13) It turns out it is easier to consider w as the latent variables which we will take the expectation over and {α,β} as the parameters of the model. Thus the complete data likelihood is p(X,Z | θ) = p(X,w | α,β) = 􏰎 p((tn,φn),w | α,β) = 􏰎 p(tn | φn,w,β)p(w | α). (3.14) Question 3.4 [COMP8600]: Deriving the BLR expectation step Derive the expectation we will maximise in BLR EM. In particular, show that (5 Points) E [logp(t,w|α,β)]= Z|X,θold α= M , E[w⊤w] N 􏰆 β 􏰇 β 􏰍N + 2 log 2π − 2E[w w]. With the objective for the expectation step of BLR EM, the update parameters for BLR can now be Question 3.5 [COMP8600]: Deriving the BLR maximisation step (Part 1) (2 Points) Derive the maximisation updates for BLR EM That is, maximise the Eq. (3.15) w.r.t. α to get Despite having update equations Eqs. (3.16) and (3.17) as presented in Question 3.5, these updates cannot be calculated directly as is. First we need to compute the denominators of each of the equations. Thankfully, there are some simple equation which can be derived for each of these quantities. E􏰂(tn −w⊤φn)2􏰃 M􏰄α􏰅α⊤ and maximise Eq. (3.15) w.r.t. β to get 􏰌Nn=1 E [(tn − w⊤φn)2] Remark. Note that Eqs. (3.16) and (3.17) are the update rules for the BLR M-step. Question 3.6 [COMP8600]: Deriving the BLR maximisation step (Part 2) Show that the following equations hold: E[w⊤w] = m⊤N mN + tr(SN ) 􏰍 E 􏰂(tn − w⊤φn)2􏰃 = ∥t − ΦmN ∥2 + tr(Φ⊤ΦSn), n=1 wheremN =βSNΦttandS−1 =αI+βΦ⊤Φ. N Hint: you may use that w ∼ N(mN,SN) as per (3.49) in Bishop. (3 Points) You may also use the following identities: given vector of random variables v ∈ Rn with mean μ ∈ Rn and covariance Σ ∈ Rn×n, and constant vector a ∈ Rn and constant matrix A ∈ Rn×n E[v⊤a] = μ⊤a E[v⊤Av] = tr 􏰀μ⊤Aμ􏰁 + tr (AΣ) . Question 3.7: Implementing BLR within the EM framework (3.20) (3.21) (15 Points) (3.18) (3.19) Implement single_EM_iter_blr() in blr_question.py. You can test your implemented algorithm by utilising implementation_viewer.ipynb, a Jupyter notebook which includes as visualisation of the BLR EM algorithm. Hint: Note that you can do these coding questions without having done the above theory ques- tions: the equations for the E- and M-steps are given in the question text of Question 3.4 to 3.6. A Mark Allocation The following section details the mark allocation for both COMP4670 and COMP8600 students. COMP8600 students are assigned all questions listed in the assignment (both gray and blue back- ground). The total number of marks available is 120 marks. COMP4670 students are given the choice of either completing only the non-COMP8600 questions (gray background) or all questions in the assignment (both gray and blue background). In either case, the student will receive the best total mark with respect to each selection of questions. Thus, an attempt in additional question will not lower your grade in any circumstance. In addition to these base marks, students will receive 2 extra marks (capped at 100 or 120 marks) if their theory side of their assignment is submitted in LATEX. We warn, that this may take a significant amount of time to complete depending on the student’s familiarity; but encourage the use and application of typesetting for this assignment. A short post on piazza will be made regarding tips for learning LATEX. A template file on overleaf is also provided for studen 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com