Machine Learning 1 TU Berlin, WiSe 2020/21
Programming Sheet 1: Bayes Decision Theory (40 P)
In this exercise sheet, we will apply Bayes decision theory in the context of small two-dimensional problems. For this, we will make use of 3D plotting. We introduce below the basics for constructing these plots in Python/Matplotlib.
The function numpy.meshgrid
To plot two-dimensional functions, we first need to discretize the two-dimensional input space. One basic function for this purpose is numpy.meshgrid. The following code creates a discrete grid of the rectangular surface [0, 4] × [0, 3]. The function numpy.meshgrid takes the discretized intervals as input, and returns two arrays of size corresponding to the discretized surface (i.e. the grid) and containing the X and Y-coordinates respectively.
In [1]: import numpy as np
X,Y = np.meshgrid([0,1,2,3,4],[0,1,2,3])
print(X)
print(Y)
[[0 1 2 3 4]
[0 1 2 3 4]
[0 1 2 3 4]
[0 1 2 3 4]]
[[0 0 0 0 0]
[1 1 1 1 1]
[2 2 2 2 2]
[3 3 3 3 3]]
Note that we can iterate over the elements of the grid by zipping the two arrays X and Y containing each coordinate. The function numpy.flatten converts the 2D arrays to one-dimensional arrays, that can then be iterated element-wise.
In [2]: print(list(zip(X.flatten(),Y.flatten())))
[(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (0, 2), (1, 2), (2, 2),
3D-Plotting
To enable 3D-plotting, we first need to load some modules in addition to matplotlib:
In [3]: import matplotlib
%matplotlib inline
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
As an example, we would like to plot the L2-norm function f (x, y) = x2 + y2 on the subspace x, y ∈ [−4, 4]. First, we create a meshgrid with appropriate size:
In [4]: R = np.arange(-4,4+1e-9,0.1)
X,Y = np.meshgrid(R,R)
print(X.shape,Y.shape)
((81, 81), (81, 81))
1
Here, we have used a discretization with small increments of 0.1 in order to produce a plot with better resolution. The resulting meshgrid has size (81×81), that is, approximately 6400 points. The function f needs to be evaluated at each of these points. This is achieved by applying element-wise operations on the arrays of the meshgrid. The norm at each point of the grid is therefore computed as:
In [5]: F = (X**2+Y**2)**.5
print(F.shape)
(81, 81)
The resulting function values are of same size as the meshgrid. Taking X,Y,F jointly results in a list of approximately 6400 triplets representing the x-, y-, and z-coordinates in the three-dimensional space where the function should be plotted. The 3d-plot can now be constructed easily by means of the function scatter of matplotlib.pyplot.
In [6]: fig = plt.figure(figsize=(10,6))
ax = plt.axes(projection=’3d’)
ax.scatter(X,Y,F,s=1,alpha=0.5)
Out[6]:
The parameter s and alpha control the size and the transparency of each data point. Other 3d plotting variants exist (e.g. surface plots), however, the scatter plot is the simplest approach at least conceptually. Having introduced how to easily plot 3D functions in Python, we can now analyze two-dimensional probability distributions with this same tool.
Exercise 1: Gaussian distributions (5+5+5 P)
Using the technique introduced above, we would like to plot a normal Gaussian probability distribution with mean vector μ = (0, 0), and covariance matrix Σ = I also known as standard normal distribution. We consider the same discretization as above (i.e. a grid from -4 to 4 using step size 0.1). For two-dimensional input spaces, the standard normal distribution is given by:
p(x, y) = 1 e−0.5(x2+y2). 2π
This distribution sums to 1 when integrated over R2. However, it does not sum to 1 when summing over the discretized space (i.e. the grid). Instead, we can work with a discretized Gaussian-like distribution:
P(x,y) = 1 e−0.5(x2+y2) Z
with Z = e−0.5(x2+y2) x,y
2
where the sum runs over the whole discretized space.
• Compute the distribution P(x,y), and plot it.
• Compute the conditional distribution Q(x, y) = P ((x, y)|x2 + y2 ≥ 1), and plot it.
• Marginalize the conditioned distribution Q(x,y) over y, and plot the resulting distribution Q(x).
In [7]: ### REPLACE BY YOUR CODE import solutions
solutions.s1a() ###
In [8]: ### REPLACE BY YOUR CODE import solutions
solutions.s1b() ###
3
In [9]: ### REPLACE BY YOUR CODE import solutions
solutions.s1c() ###
Exercise 2: Bayesian Classification (5+5+5 P)
Let the two coordinates x and y be now representated as a two-dimensional vector x. We consider two classes ω1 and ω2 with data-generating Gaussian distributions p(x|ω1) and p(x|ω2) of mean vectors
μ1 = (−0.5, −0.5) and μ2 = (0.5, 0.5) respectively, and same covariance matrix
1.0 0 Σ= 0 0.5 .
Classes occur with probability P(ω1) = 0.9 and P(ω2) = 0.1. Analysis tells us that in such scenario, the optimal decision boundary between the two classes should be linear. We would like to verify this computationally by applying Bayes decision theory on grid-like discretized distributions.
• ** Using the same grid as in Exercise 1, discretize the two data-generating distributions p(x|ω1) and p(x|ω2) (i.e. create discrete distributions P(x|ω1) and P(x|ω2) on the grid), and plot them with different colors.**
• From these distributions, compute the total probability distribution P(x) = c∈{1,2} P(x|ωc)·P(ωc), and plot it.
• Compute and plot the class posterior probabilities P(ω1|x) and P(ω2|x), and print the Bayes error rate for the discretized case.
In [10]: ### REPLACE BY YOUR CODE import solutions
solutions.s2a() ###
4
In [11]: ### REPLACE BY YOUR CODE import solutions
solutions.s2b() ###
In [12]: ### REPLACE BY YOUR CODE import solutions
solutions.s2c() ###
Bayes error rate: 0.080
5
Exercise 3: Reducing the Variance (5+5 P)
Suppose that the data generating distribution for the second class changes to produce samples much closer to the mean. This variance reduction for the second class is implemented by keeping the first covariance the same (i.e. Σ1 = Σ) and dividing the second covariance matrix by 4 (i.e. Σ2 = Σ/4). For this new set of parameters, we can perform the same analysis as in Exercise 2.
• Plot the new class posterior probabilities P(ω1|x) and P(ω2|x) associated to the new covariance matrices, and print the new Bayes error rate.
In [13]: ### REPLACE BY YOUR CODE import solutions
solutions.s3a() ###
Bayes error rate: 0.073
6
Intuition tells us that by variance reduction and resulting concentration of generated data for class 2 in a smaller region of the input space, it should be easier to predict class 2 with certainty at this location. Paradoxally, in this new “dense” setting, we observe that class 2 does not reach full certainty anywhere in the input space, whereas it did in the previous exercise.
• Explain this paradox. [YOUR EXPLANATION HERE]
7