程序代写 Machine Learning and Data Mining in Business

Machine Learning and Data Mining in Business
Week 6 Tutorial
When studying these exercises, please keep in mind that they are about problem-solving techniques for machine learning. In general, they’re not about particular distributions or learning algorithms.
Question 1

Copyright By PowCoder代写 加微信 powcoder

Let Y1, Y2, . . . , Yn mass function
∼ Poisson(λ).
Recall that the Poisson distribution has probability
e−λ λy p(y;λ)= y! .
(a) Write down the likelihood for a sample y1, . . . , yn.
Solution: Likelihood function:
n n exp(−λ)λyi
∝ 􏰚 exp(−λ)λyi
The symbol ∝ means “proportional to”. In practice, we automatically drop any terms that do not depend on the parameter since they are irrelevant for optimisation.
L(λ) = 􏰚 p(yi; λ) = 􏰚
i=1 i=1 yi!
(b) Derive a simple expression for the log-likelihood.

Solution: Log-likelihood function: 􏰅􏰆
l(λ) = log L(λ)
􏰉 n exp(−λ)λyi 􏰊
􏰐n 􏰅exp(−λ)λyi 􏰆
=􏰐 −λ+yilog(λ)−log(yi!)
log yi! 􏰐n􏰅􏰃􏰄y 􏰆
log exp(−λ) + log(λ i ) − log(yi!) n􏰅􏰆
= −nλ + 􏰐 yi log(λ) − 􏰐 log(yi!)
􏰉n􏰊n i=1 i=1
Essentially, all that we did was to repeatedly apply the laws of exponents and logarithms from the basic facts training and review notes provided at the start of the semester.
(c) Let the objective function for optimisation be the negative log-likelihood. Find the critical point of the cost function.
Solution: Objective function (dropping constant terms from the negative log- likelihood):
J(λ) = nλ − 􏰐 yi log(λ)
i=1 First derivative with respect to the parameter:
dλ = n − λ First-order necessary condition:
Solving the equation:
1 􏰐n n − λ
1 􏰐n λ􏰑=n yi
(d) Show that the critical point is the MLE.

The second derivative of the cost function is d2J 1􏰐n
which is positive as long as at least one of the data points yi is not zero (recall
that y ∈ 0, 1, 2, . . . for the Poisson distribution).
Therefore, J(β) is strictly convex and we can conclude that λ􏰑 is the MLE.
(e) You can create as many additional exercises of this type as you like by picking any simple statistical distribution and answering the same questions.
Question 2
In addition to being good practice, this exercise derives results that will be very useful later.
Consider the model Y1, Y2, . . . , Yn ∼ Bernoulli σ(β) , where β ∈ R is a parameter and σ is the sigmoid function
σ(β) = 1 . 1 + exp(−β)
You can think of this model as a logistic regression that only has the intercept. Following the lecture, the optimisation problem for estimating this model is
􏰍n􏰇􏰅􏰆 􏰅􏰆􏰈􏰎 minimise 􏰐 −yilog σ(β) −(1−yi)log 1−σ(β) ,
(a) Differentiate σ(β).
Solution: Starting from
we have that
σ(β) = 1 , 1 + exp(−β)
(1 + exp(−β))2
using the chain rule, the reciprocal rule, and the derivative of the exponential function.

(b) Show that σ′(β) = σ(β)(1 − σ(β)).
σ′(β) = exp(−β)
(1 + exp(−β))2
noting that
1−σ(β)=1− 1
1 + exp(−β)
= 1 exp(−β) = σ(β)(1 − σ(β)), (1 + exp(−β)) 1 + exp(−β)
=1+exp(−β)− 1 = exp(−β)
1 + exp(−β) 1 + exp(−β) 1 + exp(−β)
(c) Find the derivative of J(β) using the chain rule and the previous result.
Solution: The derivative is
σ ′ ( β ) σ ′ ( β ) −yi σ(β) +(1−yi)1−σ(β)
σ(β)(1 − σ(β)) σ(β)(1 − σ(β)) −yi σ(β) +(1−yi) 1−σ(β)
􏰐 −yi + yiσ(β) + σ(β) − yiσ(β)
􏰐 σ(β) − yi i=1
n nσ(β) − 􏰐 yi
(d) Find the critical point of J(β).
Solution: The first-order necessary condition is
Rearranging,
σ(β􏰑) = y. We conclude that the critical point is
β􏰑 = σ−1(y),
n nσ(β􏰑)−􏰐yi =0.

where σ−1 denotes the inverse of the sigmoid function, i.e. the logit function.
(e) What is the second derivative of the cost function? Show that the objective function is convex.
Solution: Using the first derivative above, dJ2 ′
d2β = nσ (β) = nσ(β)(1 − σ(β))
This is expression is strictly positive since σ(β) ∈ (0,1). Therefore, the cost function is strictly convex.
Question 3
Suppport vector machines (SVMs) were a major development in machine learning in the mid-1990s due to their state-of-art performance and novelty at the time. Since then, researchers have discovered that support vector machines can be reformulated as regularised estimation, establishing a deep connection to classical methods such as logistic regression.
In suppport vector classification (SVC), we consider a binary classification problem and encode the response as y ∈ {−1, 1}. The method is based on the linear decision function
and classification rule
f(x)=β0 +β1×1 +…+βpxp 􏰅􏰆
y􏰑=sign f(x) ,
which means that y􏰑 = 1 if f(x) > 0 and y􏰑 = −1 if f(x) < 0. The set {x : f (x) = 0} is the decision boundary. Thus, we can view |f (x)| as a measure of the learning algorithm’s confidence that the observation is correctly classified. The support vector classifier learns the coefficients β0, β1, . . . , βp by regularised empirical risk minimisation based on the hinge loss L y,f(x) =max 0,1−yf(x) . This figure from the ISL textbook plots the hinge loss and the cross-entropy loss (neg- ative log-likelihood loss) for y = 1. The figure calls the latter the logistic regression loss because in this formulation, the prediction f(x) in the loss function L(y,f(x)) is a prediction for the logit of the probability. Logistic Regression Loss −6 −4 −2 0 2 yi(β0 +β1xi1 +...+βpxip) (a) Write down the learning rule for a support vector classifier based on l2 regularisa- (b) Consider the term yf(x) from the hinge loss. What is the classification when yif(xi) > 0 compared to yif(xi) < 0? (c) Intepret the hinge loss function by considering the following cases: 1. yf(x)>1
n 􏰇 􏰈 p 
minimise 􏰐max 0,1−yi (β0 +β1xi1 +…+βpxip) +λ􏰐βj2
β i=1 j=1
Solution: We have that yf(x) > 0 if y and f(x) both have the same sign. Therefore, yf(x) > 0 occurs when the observation is correctly classified. Like- wise, yf(x) < 0 occurs when y and f(x) have opposite signs, which means that the instance is incorrectly classified. 2. 0 1, the observation is correctly classified and the loss is zero since 1−yf(x) < 0. This occurs when f(x) has the right sign and is sufficiently far from zero. In words, the loss is zero when the model classifies correctly with sufficiently high confidence. 2. When 0 ≤ yf(x) ≤ 1, the instance is correctly classified but f(x) is close to the decision boundary of zero. The loss increases linearly as f(xi) gets closer to zero. This is a loss for not being sufficiently confident in the right prediction. 3. When yf(x) < 0, the observation is incorrectly classified. The loss increases linearly with f(x) as it becomes more distant from the decision boundary. This is a loss for classifying incorrectly with increasing degrees of confidence. (d) The observations that satisfy yf(x) < 1 are called the support vectors. Why do the support vectors have special relevance in this method? (e) Interpret the figure from the beginning of the exercise. Compare the hinge and logistic regression loss functions to the zero-one loss and to each other. What do we learn about loss functions for binary classification? Solution: The hinge loss in nonzero only for the support vectors. Therefore, only the support vectors contribute to the cost function in the solution. That implies that the learned model depends only on the support vectors. Solution: The zero-one loss is L(y, y􏰑) = I(y ̸= y􏰑). In the context of exercise, we can write L(y,f(x))=I yf(x)̸=1 , which we could add to the figure as a step function. In the zero-one loss, all that matters is whether the observation is correctly classified. In contrast, the hinge and cross-entropy losses also take into account the “confidence” of the classifier in the prediction. For yf(x) < 0, the hinge loss increases linearly and the cross-entropy loss tends towards linearity as f(x) becomes more negative, indicating higher confidence in the wrong classification. For yf(x) > 0, both the hinge loss and the cross-entropy loss penalise the classifier for insufficient confidence in the right prediction. The key difference between the two loss functions is that the hinge loss is zero for all instances that are correctly classified with sufficiently high confidence, while the cross-entropy loss slowly tends towards zero as yf(x) increases.
Overall, the hinge and cross-entropy loss functions are quite similar. The logistic regression loss is smooth, while the hinge loss has a “kink” where 1 − yf (x) = 0. Informally, we can say that the logistic regression loss is almost like a smooth (differentiable) version of the hinge loss.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com