Machine Learning 1 TU Berlin, WiSe 2020/21
Part 1: The James-Stein Estimator (20 P)
Let x1, . . . , xN ∈ Rd be independent draws from a multivariate Gaussian distribution with mean vector μ and covariance matrix Σ = σ2I. It can be shown that the maximum-likelihood estimator of the mean parameter μ is the empirical mean given by:
1 N μˆML=N xi
i=1
Maximum-likelihood appears to be a strong estimator. However, it was demonstrated that the following estimator
2
(d − 2) · σ μˆJS=1− N μˆML
(a shrinked version of the maximum-likelihood estimator towards the origin) has actually a smaller distance from the true mean when d ≥ 3. This however assumes knowledge of the variance of the distribution for which the mean is estimated. This estimator is called the James-Stein estimator. While the proof is a bit involved, this fact can be easily demonstrated empirically through simulation. This is the object of this exercise.
The code below draws ten 50-dimensional points from a normal distribution with mean vector μ = (1, . . . , 1) and covariance Σ = I.
In [1]: import numpy
def getdata(seed):
n = 10 # data points
d = 50 # dimensionality of data m = numpy.ones([d]) # true mean
s = 1.0 # true standard deviation
rstate = numpy.random.mtrand.RandomState(seed)
X = rstate.normal(0,1,[n,d])*s+m
return X,m,s
The following function computes the maximum likelihood estimator from a sample of the data assumed to be
generated by a Gaussian distribution:
In [2]: def ML(X):
return X.mean(axis=0)
Implementing the James-Stein Estimator (10 P)
• Based on the ML estimator function, write a function that receives as input the data (Xi)ni=1 and
the (known) variance σ2 of the generating distribution, and computes the James-Stein estimator
In [3]: def JS(X,s):
# REPLACE BY YOUR CODE
import solutions
m_JS = solutions.JS(X,s) ###
return m_JS
∥μˆML ∥2
1
Comparing the ML and James-Stein Estimators (10 P)
We would like to compute the error of the maximum likelihood estimator and the James-Stein estimator for 100 different samples (where each sample consists of 10 draws generated by the function getdata with a different random seed). Here, for reproducibility, we use seeds from 0 to 99. The error should be measured as the Euclidean distance between the true mean vector and the estimated mean vector.
• Compute the maximum-likelihood and James-Stein estimations. • Measure the error of these estimations.
• Build a scatter plot comparing these errors for different samples.
In [4]: %matplotlib inline
### REPLACE BY YOUR CODE
import solutions solutions.compare_ML_JS() ###
Part 2: Bias/Variance Decomposition (30 P)
In this part, we would like to implement a procedure to find the bias and variance of different predictors. We consider one for regression and one for classification. These predictors are available in the module utils.
• utils.ParzenRegressor: A regression method based on Parzen window. The hyperparameter corresponds to the scale of the Parzen window. A large scale creates a more rigid model. A small scale creates a more flexible one.
• utils.ParzenClassifier: A classification method based on Parzen window. The hyperparameter corre- sponds to the scale of the Parzen window. A large scale creates a more rigid model. A small scale creates a more flexible one. Note that instead of returning a single class for a given data point, it outputs a probability distribution over the set of possible classes.
2
Each class of predictor implements the following three methods:
• • •
init (self,parameter): Create an instance of the predictor with a certain scale parameter. fit(self,X,T): Fit the predictor to the data (a set of data points X and targets T). predict(self,X): Compute the output values arbitrary inputs X.
To compute the bias and variance estimates, we require multiple samples from the training set for a single set of observation data. To acomplish this, we utilize the Sampler class provided. The sampler is initialized with the training data and passed to the method for estimating bias and variance, where its function sampler.sample() is called repeatedly in order to fit multiple models and create an ensemble of prediction for each test data point.
Regression Case (15 P)
For the regression case, Bias, Variance and Error are given by:
• Bias(Y )2 = (EY [Y − T ])2
• Var(Y ) = EY [(Y − EY [Y ])2] • Error(Y)=EY[(Y −T)2]
Task: Implement the KL-based Bias-Variance Decomposition defined above. The function should repeatedly sample training sets from the sampler (as many times as specified by the argument nbsamples), learn the predictor on them, and evaluate the variance on the out-of-sample distribution given by X and T.
In [5]: def biasVarianceRegression(sampler, predictor, X, T, nbsamples):
# ——————————– # TODO: REPLACE BY YOUR CODE
# ——————————– import solutions
bias,variance = solutions.biasVarianceRegression(sampler, predictor, X, T, nbsamples=nbsampl # ——————————–
return bias,variance
Your implementation can be tested with the following code:
In [6]: import utils,numpy
%matplotlib inline
utils.plotBVE(utils.Housing,numpy.logspace(-6,3,num=30),utils.ParzenRegressor,biasVarianceRegres
3
Classification Case (15 P)
We consider here the Kullback-Leibler divergence as a measure of classification error, as derived in the exercise, the Bias, Variance decomposition for such error is:
• Bias(Y)=DKL(T||R)
• Var(Y ) = EY [DKL(R||Y )] • Error(Y ) = EY [DKL(T ||Y )]
where R is the distribution that minimizes its expected KL divergence from the estimator of probability distribution Y (see the theoretical exercise for how it is computed exactly), and where T is the target class distribution.
Task: Implement the KL-based Bias-Variance Decomposition defined above. The function should repeatedly sample training sets from the sampler (as many times as specified by the argument nbsamples), learn the predictor on them, and evaluate the variance on the out-of-sample distribution given by X and T.
In [7]: def biasVarianceClassification(sampler, predictor, X, T, nbsamples=25):
# ——————————– # TODO: REPLACE BY YOUR CODE
# ——————————– import solutions
bias,variance = solutions.biasVarianceClassification(sampler, predictor, X, T, nbsamples=nbs # ——————————–
return bias,variance
Your implementation can be tested with the following code:
4
In [8]: import utils,numpy
%matplotlib inline
utils.plotBVE(utils.Yeast,numpy.logspace(-6,3,num=30),utils.ParzenClassifier,biasVarianceClassif
5