Machine Learning 1 TU Berlin, WiSe 2020/21
Maximum Likelihood Parameter Estimation
In this first exercise, we would like to use the maximum-likelihood method to estimate the best parameter of a data density model p(x|θ) with respect to some dataset D = (x1,…,xN), and use that approach to build a classifier. Assuming the data is generated independently and identically distributed (iid.), the dataset likelihood is given by
N
p(D|θ) = p(xk|θ)
k=1 and the maximum likelihood solution is then computed as
θˆ=argmax p(D|θ) θ
=argmax logp(D|θ) θ
where the log term can also be expressed as a sum, i.e.
N
log p(D|θ) = log p(xk |θ).
k=1
As a first step, we load some useful libraries for numerical computations and plotting.
In [1]: import numpy
import matplotlib
%matplotlib inline
from matplotlib import pyplot as plt
na = numpy.newaxis
We now consider the univariate data density model
p(x|θ) = 1 1
π 1 + (x − θ)2
also known as the Cauchy distribution with fixed parameter γ = 1, and with parameter θ unknown. Compared to the Gaussian distribution, the Cauchy distribution is heavy-tailed, and this can be useful to handle the presence of outliers in the data generation process. The probability density function is implemented below.
In [2]: def pdf(X,THETA):
return (1.0 / numpy.pi) * (1.0 / (1+(X-THETA)**2))
Note that the function can be called with scalars or with numpy arrays, and if feeding arrays of different shape, numpy broadcasting rules will apply. Our first step will be to implement a function that estimates the optimal parameter θˆ in the maximum likelihood sense for some dataset D.
Task (10 P):
• Implement a function that takes a dataset D as input (given as one-dimensional array of numbers) and a list of candidate parameters θ (also given as a one-dimensional array), and returns a one- dimensional array containing the log-likelihood w.r.t. the dataset D for each parameter θ.
In [3]: def ll(D,THETA):
# ————————————–
# TODO: replace by your code
# ————————————– import solution; return solution.ll(D,THETA) # ————————————–
1
To test the method, we apply it to some dataset, and plot the log-likelihood for some plausible range of parameters θ.
In [4]: D = numpy.array([ 2.803, -1.563, -0.853, 2.212, -0.334, 2.503])
THETA = numpy.linspace(-10,10,1001)
plt.grid(True)
plt.plot(THETA,ll(D,THETA))
plt.xlabel(r’$\theta$’)
plt.ylabel(r’$\log p(\mathcal{D}|\theta)$’)
plt.show()
We observe that the likelihood has two peaks: one around θ = −0.5 and one around θ = 2. However, the highest peak is the second one, hence, the second peak is retained as a maximum likelihood solution.
Building a Classifier
We now would like to use the maximum likelihood technique to build a classifier. We consider a labeled dataset where the data associated to the two classes are given by:
In [5]: D1 = numpy.array([ 2.803, -1.563, -0.853, 2.212, -0.334, 2.503])
D2 = numpy.array([-4.510, -3.316, -3.050, -3.108, -2.315])
To be able to classify new data points, we consider the discriminant function
g(x) = logP(x|θˆ1)−logP(x|θˆ2)+logP(ω1)−logP(ω2)
were the first two terms can be computed based on our maximum likelihood estimates, and where the last two terms are the prior probabilities. The function g(x) produces the decision ω1 if g(x) > 0 and ω2 if g(x) < 0. We would like to implement a maximum-likelihood based classifier.
Tasks (10 P):
2
• Implement the function fit that receives as input a vector of candidate parameters θ and the dataset associated to each class, and produces the maximum likelihood parameter estimates. (Hint: from your function fit, you can call the function ll you have previously implemented.)
• Implement the function predict that takes as input the prior probability for each class and a vector of points X on which to evaluate the discriminant function, and that outputs a vector containing the value of g for each point in X.
In [6]: class MLClassifier:
def fit(self,THETA,D1,D2):
# -------------------------------------- # TODO: replace by your code
# -------------------------------------- import solution
self.theta1,self.theta2 = solution.fit(THETA,D1,D2) # --------------------------------------
def predict(self,X,p1,p2):
# -------------------------------------- # TODO: replace by your code
# -------------------------------------- import solution
return solution.predict(self.theta1,self.theta2,X,p1,p2) # --------------------------------------
Once these two functions have been implemented, the maximum likelihood classifier can be applied to our labeled data, and the decision function it implements can be visualized.
In [7]: X = numpy.linspace(-10,10,1001)
plt.grid(True)
mlcl = MLClassifier()
mlcl.fit(THETA,D1,D2)
plt.plot(X,mlcl.predict(X,0.5,0.5))
plt.plot(X,0*X,color=’black’,ls=’dotted’)
plt.xlabel(r’$x$’)
plt.ylabel(r’$g(x)$’)
for d1 in D1: plt.plot([d1,d1],[0,+0.5],color=’black’)
for d2 in D2: plt.plot([d2,d2],[0,-0.5],color=’black’)
3
Here, we observe that the model essentially learns a threshold classifier with threshold approximately −0.5. However, we note that the threshold seems to be too high to properly classify the data. One reason for this is the fact that maximum likelihood estimate retains only the best parameter. Here, the model for the first class focuses mainly on the peak at x = 2 and treat examples x < 0 as outliers, without considering the possibility that the peak at θ = 2 might actually be the outlier.
Bayes Parameter Estimation
Let us now bypass the computation of a maximum likelihood estimate of parameters and adopt instead a full Bayesian approach. We will consider the same data density model and datasets as in the maximum likelihood exercise but we include now a prior distribution over the parameters. Specifically, we set for both classes the prior distribution:
p(θ)= 1 1
10π 1 + (θ/10)2
Given a dataset D, the posterior distribution for the unknown parameter θ can then be obtained from the Bayes rule:
p(D|θ)p(θ) p(θ|D) = p(D|θ)p(θ)dθ
The integration can be performed numerically using the trapezoidal rule. ** Task (10 P):**
• Implement the prior and posterior functions below. These function receive as input a vector of parameters θ (assumed to be sorted from smallest to largest, linearly spaced, and covering the range of values where most of the probability mass lies). The posterior function also receive a dataset D as input. Both functions return a vector containing the probability scores associated to each value of θ.
4
In [8]: def prior(THETA):
# --------------------------------------
# TODO: replace by your code
# -------------------------------------- import solution; return solution.prior(THETA) # --------------------------------------
def posterior(D,THETA):
# --------------------------------------
# TODO: replace by your code
# --------------------------------------
import solution; return solution.posterior(D,THETA) # --------------------------------------
To verify the implementation of the two functions, we apply them to the dataset D defined above and with a broad range of parameters θ.
In [9]: THETA = numpy.linspace(-100,100,10001)
plt.grid(True)
plt.plot(THETA,numpy.log(prior(THETA)),label=r’$p(\theta)$’)
plt.plot(THETA,numpy.log(posterior(D,THETA)),label=r’$p(\theta|\mathcal{D})$’)
plt.legend(); plt.xlabel(r’$\theta$’); plt.show()
We observe that the posterior distribution is more concentrated to the specific values of the parameter that explain the dataset well. In particular, we observe the same two peaks around θ = −0.5 and θ = 2 observed in the maximum likelihood exercise.
5
Building a Classifier
We now would like to build a Bayes classifier based on the discriminant function
h(x) = logP(x|D1)−logP(x|D2)+logP(ω1)−logP(ω2)
where the dataset-conditioned densities are obtained from the original data density model and the parameter
posterior as
p(x|Dj ) =
• Implement a function fit that produces the parameter posteriors p(θ|D1) and p(θ|D2).
• Implement a function predict computing the new discriminant function h based on the dataset- conditioned data densities.
In [10]: class BayesClassifier:
def fit(self,THETA,D1,D2):
# --------------------------------------
# TODO: replace by your code
# --------------------------------------
import solution
self.THETA,self.post1,self.post2 = solution.fitBayes(THETA,D1,D2) # --------------------------------------
def predict(self,X,p1,p2):
# -------------------------------------- # TODO: replace by your code
# -------------------------------------- import solution
return solution.predictBayes(self.THETA,self.post1,self.post2,X,p1,p2) # --------------------------------------
We note that the function predict is computationally more expensive than the one for maximum likelihood since it involves computing an integral for each point to be predicted.
However, the quality of the prediction also differs compared to that of the maximum likelihood method. In the plot below, we compare the ML and Bayes approaches.
In [11]: X = numpy.linspace(-10,10,1001)
bacl = BayesClassifier()
bacl.fit(THETA,D1,D2)
plt.grid(True)
plt.plot(X,mlcl.predict(X,0.5,0.5),label=’ML’)
plt.plot(X,bacl.predict(X,0.5,0.5),label=’Bayes’)
plt.plot(X,0*X,color=’black’,ls=’dotted’)
plt.xlabel(r’$x$’); plt.ylabel(r’$g(x)$’)
plt.legend()
for d1 in D1: plt.plot([d1,d1],[0,+0.5],color=’black’)
for d2 in D2: plt.plot([d2,d2],[0,-0.5],color=’black’)
Tasks (10 P):
p(x|θ)p(θ|Dj )dθ
6
We observe that the Bayes classifier has generally lower output scores and its decision boundary has been noticeably shifted to the left, leading to better predictions for the current data. In this particular case, the difference between the two models can be explained by the fact that the Bayes one better integrates the possibility that negative examples for the first class are not necessarily outliers.
7