程序代写代做代考 algorithm ANLY-601 Spring 2018

ANLY-601 Spring 2018

Assignment 7 — Revised Again for Clarity, a Slight Modification to Prob. 3,
and Consistency of Notation in Prob. 4

Due May 7, 2018

You may use the text, your lecture notes and handouts, and any mathematics references that you
need. You may ask me for clarification but you may not consult other people or other references
(including the internet). If you use Mathematica for any of this (I’m not convinced it’s useful here),
be sure to include the notebook in your solution so I can see what you did. Better yet, cut and
paste from the Mathematica notebook into your solution so it all flows logically.

1. (10 points) KL Divergence and Log-Likelihood

The Kulback-Liebler divergence is one of several measures of distance between probability
distributions. Consider a true distribution p(x) and a model p̂(x|θ) with parameters θ, then
the KL divergence between them is

d(p, p̂) ≡ −

p(x) ln

(
p̂(x|θ)
p(x)

)
dx

= −

p(x) ln p̂(x|θ) dx +

p(x) ln p(x) dx ≥ 0 (1)

where the equality (in ≥ 0) is met if and only if p̂(x|θ) = p(x). This exercise will show you
how the KL divergence is related to data log-likelihood used in model fitting.

Suppose you have a data set consisting of m samples D = {x1, x2, . . . , xm} each of which is a
scalar (x ∈ R). (Note, the subscript identifies the sample.) Assuming that these samples are
statistically independent, the log-likelihood of the data set under the model is

L = ln p̂(D|θ) =
m∑
a=1

ln p̂(xa|θ) . (2)

Show that the expected log-likelihood of the dataset under the model is

ED[L] = m


p(x) ln p̂(x|θ) dx (3)

where the expectation ED[·] is respect to the distribution over all possible data sets. (To
receive full credit, you must write ED out as an integral over the distribution of data sets,
and evaluate the integral showing all steps.)

Conclude that the KL divergence is

d(p, p̂) = −
1

m
ED[L] − Hp

where the differential entropy of the true distribution is

Hp ≡ −

p(x) ln p(x) dx .

(Notice also that since the KL divergence is bounded below by zero, the expected log-likelihood
is bounded above — a theoretical fitting bound 1

m
ED[L] ≤ −Hp.)

1

2. (10 points) Interpolating and Smoothing Kernels

In class we developed kernel density estimates. One can also use a kernel approach for regres-
sion, and this is closely related to Gaussian process regression and kriging (in geostatistics).

Suppose you have a data set consisting of (real) input/output pairs (xa, ya), a = 1, . . . N . One
can build an interpolating kernel model in a way very similar to a kernel density estimate.
We will use symmetric kernels, as before

κ(x− y) = κ(y − x) .

A common choice is the radial exponential kernel

κ(x− y) = exp
(

|x− y|2

r2

)

where r is the (adjustable) kernel radius. (But you do not need to consider the kernel form
in what follows.)

Our kernel model of f(x) is

f̂(x) =
N∑
a=1

ca κ(x− xa) (4)

where the coefficients ca need to be specified.

(a) Find the coefficients by minimizing the MSE cost

E(c) =
1

N

N∑
a=1

(
ya − f̂(xa)

)2
=

1

N

N∑
a=1

(
ya −

N∑
b=1

cb κ(xa − xb)
)2

(5)

with respect to ca, a = 1, . . . N . That is, set

d E
dck

= 0

and solve for the c’s1 . You should find

ca =
(
K−1 y

)
a

=
N∑
b=1

(
K−1

)
ab
yb (6)

where K is the matrix with elements Kab = κ(xa−xb), K−1 is its matrix inverse, and y
is the (column) vector with elements ya, a = 1, . . . N . The complete kernel interpolator
is then

f̂(x) =
N∑
a=1

ca κ(x− xa) =
N∑
a=1

(
K−1 y

)
a
κ(x− xa) . (7)

1Be careful taking the derivative

d

d ck

N∑
b=1

cbκ(xa − xb)

To be certain you get the indices correct, take for example k = 2, write out the first several terms of the summation
explicitly, and take the derivative.

2

(b) Show that at each input point xa, the interpolator passes through the corresponding
output point f̂(xa) = ya. (This form of interpolator assumes that the ya are noiseless
measurements of the underlying generating function ya = f(xa).)

Since I’ve given you the answers, to get full credit your solution must show all steps. I suggest
that your use component notation for all your matrix manipulations to be clear about the
algebra.

(Note: If the data points ya are noisy measurements of the generating function y = f(x) + �,

� ∼ N(0, σ2), the coefficients take the form c =
(
K + σ2I

)−1
y instead. The resulting model

is called a smoothing kernel, and it does not pass through all the ya. This form makes full
contact with Gaussian process regression and smoothing splines.)

3. (10 points) Bootstrap Variance Estimate

Unlike the sample mean, one cannot write a closed-form expression for the variance of the
sample median across data sets drawn from a distribution. This exercise will have you con-
struct a bootstrap estimate for the variance of the sample median.

On the web page is a new dataset dataForBootstrap.txt that has 50 samples from a distri-
bution. Calculate the median of the sample using whatever package you prefer. Next generate
10 bootstrap replicates. Calculate the median of each. Then compute the mean and variance
of these 10 estimates of the median. Repeat for

[10, 30, 100, 300, 1000, 3000, 10000, 30000]

replicates2 and report the variance for each case. Note that if your original data set has n
samples, there are (

2n− 1
n

)
possible different bootstrap replicates; so you don’t have to worry about running out.

Interpret your results. I suggest making plots of the mean and variance (of the median
for each set of replicates) as a function of the number of replicates using a log scale for the
x-axis (number of replicates). What value would you quote for the population median, and
the variance of the median estimate ?

4. (25 points) Overfitting in K-Means

We discussed both soft clustering (with Gaussian mixture models fit by EM), and hard clus-
tering (with K-means), and the relation between the former and latter algorithms. For the last
homework assignment you fit Gaussian mixture models to a small, two-dimensional dataset,
examined the log-likelihood on fitting and holdout sets, noting overfitting with increasing
model sizes. (Although the log-likelihood per data point for the fitting data increases mono-
tonically with increasing number of Gaussian bumps in the mixture model, the log-likelihood
increases, then decreases on the holdout data.) This exercise has you explore the same dataset
and concepts for K-Means.

Using any language or package you prefer, write or use an existing K-means algorithm (with
Euclidean distance function) and fit a sequence of models to the first 750 vectors in the

2You can increase this to 300,000 or 3,000,000 without excessive compute time.

3

toydata1.txt dataset. Plot the mean square distance between the data points and the centroids
of the cluster they belong to (the cost function minimized by K-means)

J =
K∑
i=1

(
Ni
N

)
1

Ni

Ni∑
a=1

|x(i)a −mi|
2

as a function of K. (Here K is the number of clusters, N is the total number of data points,

Ni is the number of data points in cluster i, x
(i)
a is the a

th data point in cluster i, and mi is

the centroid (or mean) of the data in the ith cluster — see Lecture 13.)

Make such plots for both the fitting data (the first 750 data points) and the test data (the
remaining 750 points). To facilitate comparing the curves of J vs K, plot both the training
and test set curves on the same frame. Explore numbers of clusters K in the range 5–100.
(You don’t have to make models for each value in between — but use enough so you can see
what’s happening. I used 5,10,20,30,40,50,60,100 — but this is by no means a preferred set.)
Since the K-means is a greedy algorithm (moves downhill in J), and the cost function has
suboptimal local minima that the algorithm can converge to, it’s a good idea to run several
re-starts (with different initializations of the mi) for each value of K. This will ensure a nice
monotonically decreasing curve of J vs K on the training set. (I used 10 restarts for each
K, and kept the best fit on the training data for each — running the test data through the
corresponding models.)

(a) (10 points) For models fit to the training data, plot the cost J vs the number of clusters
(means) K for both the training and test data (as described above). How do these plots
compare (qualitatively) to what you found for the Gaussian mixture model, and what
you’re familiar with for regression and classification problems in general?

(b) (10 points) WHY does the K-means algorithm behave so different with regard to over-
fitting? Use diagrams and words to describe clearly what’s going on with K-means in
this regard. (It is your job to be clear — this is both an exercise in reasoning about the
cost function J , and communicating your thoughts.)

(c) (5 points) Re-do the plots from part (a) using a log scale for both axes. (It doesn’t
matter if you use log base 10 or base 2 or base e, but be clear which you use.) You will
see a roughly straight line for the training data curve

ln J = ln b − m ln K .

Note that this is equivalent to

J ≈
b

Km

a power law for J vs K. Find b and m using the fitting curve values with K ≥ 20 (using
a linear fit for the log data). You should see m ≈ 2/D where D is the true dimensionality
of the data.

(d) (10 points EXTRA CREDIT) Use simple mathematical arguments (dimensional anal-
ysis) to show that you should find m = 2/D. (Hints: The mean square error is pro-
portional to the square of cluster length J ∝ r2, and the volume of each cluster is
Volcluster ≈ Voldata set/K, while on dimensional grounds the volume of each cluster
is Volcluster ∝ r

D. )

4

I will not be assigning an extra credit problem.

5