ANLY-601 Spring 2018
Assignment 7 — Revised Again for Clarity, a Slight Modification to Prob. 3, and Consistency of Notation in Prob. 4
Due May 7, 2018
You may use the text, your lecture notes and handouts, and any mathematics references that you need. You may ask me for clarification but you may not consult other people or other references (including the internet). If you use Mathematica for any of this (I’m not convinced it’s useful here), be sure to include the notebook in your solution so I can see what you did. Better yet, cut and paste from the Mathematica notebook into your solution so it all flows logically.
1. (10 points) KL Divergence and Log-Likelihood
The Kulback-Liebler divergence is one of several measures of distance between probability distributions. Consider a true distribution p(x) and a model pˆ(x|θ) with parameters θ, then the KL divergence between them is
pˆ ( x | θ ) d(p, pˆ) ≡ − p(x) ln p(x) dx
= − p(x)lnpˆ(x|θ)dx + p(x) lnp(x)dx ≥0 (1) where the equality (in ≥ 0) is met if and only if pˆ(x|θ) = p(x). This exercise will show you
how the KL divergence is related to data log-likelihood used in model fitting.
Suppose you have a data set consisting of m samples D = {x1, x2, . . . , xm} each of which is a scalar (x ∈ R). (Note, the subscript identifies the sample.) Assuming that these samples are statistically independent, the log-likelihood of the data set under the model is
m
L = ln pˆ(D|θ) = ln pˆ(xa|θ) . (2)
a=1
Show that the expected log-likelihood of the dataset under the model is
ED[L] = m
p(x) ln pˆ(x|θ) dx (3)
where the expectation ED[·] is respect to the distribution over all possible data sets. (To receive full credit, you must write ED out as an integral over the distribution of data sets, and evaluate the integral showing all steps.)
Conclude that the KL divergence is
d(p,pˆ) = −1 ED[L] − Hp
m
where the differential entropy of the true distribution is
Hp ≡ − p(x)lnp(x)dx .
(Notice also that since the KL divergence is bounded below by zero, the expected log-likelihood
is bounded above — a theoretical fitting bound 1 ED[L] ≤ −Hp.) m
1
2. (10 points) Interpolating and Smoothing Kernels
In class we developed kernel density estimates. One can also use a kernel approach for regres-
sion, and this is closely related to Gaussian process regression and kriging (in geostatistics).
Suppose you have a data set consisting of (real) input/output pairs (xa, ya), a = 1, . . . N. One can build an interpolating kernel model in a way very similar to a kernel density estimate. We will use symmetric kernels, as before
κ(x−y) = κ(y−x) . A common choice is the radial exponential kernel
|x−y|2 κ(x−y) = exp − r2
where r is the (adjustable) kernel radius. (But you do not need to consider the kernel form in what follows.)
Our kernel model of f(x) is
a=1 where the coefficients ca need to be specified.
N
ˆ f(x) =
caκ(x−xa) (4)
(a) Find the coefficients by minimizing the MSE cost
1N 1NN 2
ˆ2
E(c) = N
with respect to ca, a = 1,…N. That is, set
dck and solve for the c’s1 . You should find
N
ca=K−1y =K−1 yb a ab
b=1
ya −f(xa) = N dE=0
ya −
cbκ(xa−xb) (5)
(6)
is then
NN
ˆ −1
a=1
a=1
b=1
where K is the matrix with elements Kab = κ(xa −xb), K−1 is its matrix inverse, and y is the (column) vector with elements ya, a = 1,…N. The complete kernel interpolator
κ(x−xa) . (7)
To be certain you get the indices correct, take for example k = 2, write out the first several terms of the summation explicitly, and take the derivative.
f(x) =
a=1
caκ(x−xa) = K y a=1
N
a
1Be careful taking the derivative
d cbκ(xa−xb)
dck
b=1
2
(b) Show that at each input point xa, the interpolator passes through the corresponding ˆ
output point f(xa) = ya. (This form of interpolator assumes that the ya are noiseless measurements of the underlying generating function ya = f(xa).)
Since I’ve given you the answers, to get full credit your solution must show all steps. I suggest that your use component notation for all your matrix manipulations to be clear about the algebra.
(Note: If the data points ya are noisy measurements of the generating function y = f (x) + ε, ε∼N(0,σ2),thecoefficientstaketheformc=K+σ2I−1yinstead. Theresultingmodel is called a smoothing kernel, and it does not pass through all the ya. This form makes full contact with Gaussian process regression and smoothing splines.)
- (10 points) Bootstrap Variance Estimate
Unlike the sample mean, one cannot write a closed-form expression for the variance of the sample median across data sets drawn from a distribution. This exercise will have you con- struct a bootstrap estimate for the variance of the sample median.
On the web page is a new dataset dataForBootstrap.txt that has 50 samples from a distri- bution. Calculate the median of the sample using whatever package you prefer. Next generate 10 bootstrap replicates. Calculate the median of each. Then compute the mean and variance of these 10 estimates of the median. Repeat for
[10, 30, 100, 300, 1000, 3000, 10000, 30000]
replicates2 and report the variance for each case. Note that if your original data set has nsamples, there are
2n−1 n
possible different bootstrap replicates; so you don’t have to worry about running out.
Interpret your results. I suggest making plots of the mean and variance (of the median for each set of replicates) as a function of the number of replicates using a log scale for the x-axis (number of replicates). What value would you quote for the population median, and the variance of the median estimate ?
- (25 points) Overfitting in K-Means
We discussed both soft clustering (with Gaussian mixture models fit by EM), and hard clus- tering (with K-means), and the relation between the former and latter algorithms. For the last homework assignment you fit Gaussian mixture models to a small, two-dimensional dataset, examined the log-likelihood on fitting and holdout sets, noting overfitting with increasing model sizes. (Although the log-likelihood per data point for the fitting data increases mono- tonically with increasing number of Gaussian bumps in the mixture model, the log-likelihood increases, then decreases on the holdout data.) This exercise has you explore the same dataset and concepts for K-Means.
Using any language or package you prefer, write or use an existing K-means algorithm (with Euclidean distance function) and fit a sequence of models to the first 750 vectors in the
2You can increase this to 300,000 or 3,000,000 without excessive compute time. 3
toydata1.txt dataset. Plot the mean square distance between the data points and the centroids of the cluster they belong to (the cost function minimized by K-means)
K Ni 1 Ni
J = | x(i) − m |2
NNi a i i=1 a=1
as a function of K. (Here K is the number of clusters, N is the total number of data points, N is the number of data points in cluster i, x(i) is the ath data point in cluster i, and m is
iai
the centroid (or mean) of the data in the ith cluster — see Lecture 13.)
Make such plots for both the fitting data (the first 750 data points) and the test data (the remaining 750 points). To facilitate comparing the curves of J vs K, plot both the training and test set curves on the same frame. Explore numbers of clusters K in the range 5–100. (You don’t have to make models for each value in between — but use enough so you can see what’s happening. I used 5,10,20,30,40,50,60,100 — but this is by no means a preferred set.) Since the K-means is a greedy algorithm (moves downhill in J), and the cost function has suboptimal local minima that the algorithm can converge to, it’s a good idea to run several re-starts (with different initializations of the mi) for each value of K. This will ensure a nice monotonically decreasing curve of J vs K on the training set. (I used 10 restarts for each K, and kept the best fit on the training data for each — running the test data through the corresponding models.)
- (a) (10 points) For models fit to the training data, plot the cost J vs the number of clusters (means) K for both the training and test data (as described above). How do these plots compare (qualitatively) to what you found for the Gaussian mixture model, and what you’re familiar with for regression and classification problems in general?
- (b) (10 points) WHY does the K-means algorithm behave so different with regard to over- fitting? Use diagrams and words to describe clearly what’s going on with K-means in this regard. (It is your job to be clear — this is both an exercise in reasoning about the cost function J, and communicating your thoughts.)
- (c) (5 points) Re-do the plots from part (a) using a log scale for both axes. (It doesn’t matter if you use log base 10 or base 2 or base e, but be clear which you use.) You will see a roughly straight line for the training data curve
Note that this is equivalent to
lnJ=lnb−mlnK .
J≈b Km
a power law for J vs K. Find b and m using the fitting curve values with K ≥ 20 (using a linear fit for the log data). You should see m ≈ 2/D where D is the true dimensionality of the data.
(d) (10 points EXTRA CREDIT) Use simple mathematical arguments (dimensional anal- ysis) to show that you should find m = 2/D. (Hints: The mean square error is pro- portional to the square of cluster length J ∝ r2, and the volume of each cluster is Volcluster ≈ Voldata set/K, while on dimensional grounds the volume of each cluster
is Volcluster ∝ rD. )
4
I will not be assigning an extra credit problem.
5