Minimum Description Length – 175 – Marcus Hutter
5 MINIMUM DESCRIPTION LENGTH
• MDL as Approximation of Solomonoff’s M • The Minimum Description Length Principle • Application: Sequence Prediction
• Application: Regression / Polynomial Fitting • Summary
Minimum Description Length – 176 – Marcus Hutter
Minimum Description Length: Abstract
The Minimum Description/Message Length principle is one of the most important concepts in Machine Learning, and serves as a scientific guide, in general. The motivation is as follows: To make predictions involves finding regularities in past data, regularities in data allows for compression, hence short descriptions of data should help in making predictions. In this lecture series we approach MDL from a Bayesian perspective and relate it to a MAP (maximum a posteriori) model choice. The Bayesian prior is chosen in accordance with Occam and Epicurus and the posterior is approximated by the MAP solution. We reconsider (un)fair coin flips and compare the M(D)L to Bayes-Laplace’s solution, and similarly for general sequence prediction tasks. Finally I present an application to regression / polynomial fitting.
Minimum Description Length – 177 – Marcus Hutter
From Compression to Prediction
The better you can compress, the better you can predict.
Being able to predict (the env.) well is key for being able to act well.
Simple Example: Consider “14159…[990 more digits]…01989”. • If it looks random to you, you can neither compress it
nor can you predict the 1001st digit.
• If you realize that they are the first 1000 digits of π,
you can compress the sequence and predict the next digit.
Practical Example: The quality of natural language models is typically judged by its perplexity, which is essentially a compression ratio.
Later: Sequential decision theory tells you how to exploit such models for optimal rational actions.
Minimum Description Length – 178 – Marcus Hutter
MDL as Approximation of Solomonoff’s M
• Approximation of Solomonoff, since M incomputable:
• M(x) ≈ 2−Km(x) (excellent approximation)
• Km(x)≡KmU(x)≈KmT(x)
(approximation quality depends on T and x)
• Predict y of highest M(y|x) is approximately same as
• MDL: Predict y of smallest complexity KmT (xy).
• Examples for x: Daily weather or stock market data.
• Example for T: Lempel-Ziv decompressor.
• Prediction = finding regularities = compression = MDL.
• Improved compressors lead to improved predictors.
Minimum Description Length – 179 – Marcus Hutter
Human Knowledge Compression Contest
• compression = finding regularities ⇒ prediction ≈ intelligence [hard file size numbers] [slippery concept]
• Many researchers analyze data and find compact models.
• Compressors beating the current compressors need to be smart(er).
• “universal” corpus of data ⇒ “universally” smart compressors.
• Wikipedia seems a good snapshot of the Human World Knowledge.
• The ultimate compressor of Wikipedia will “understand” all human knowledge, i.e. be really smart.
• Contest: Compress Wikipedia better than the current record.
• Prize: 50’000 Euro × the relative
improvement to previous record. [http://prize.hutter1.net]
Minimum Description Length – 180 – Marcus Hutter
The Minimum Description Length Principle
Identification of probabilistic model “best” describing data: Probabilistic model(=hypothesis) Hν with ν ∈ M and data D. Most probable model is νMDL = arg maxν∈M p(Hν |D).
Bayes’ rule: p(Hν |D) = p(D|Hν )·p(Hν )/p(D).
Occam’s razor: p(Hν ) = 2−Kw(ν).
By definition: p(D|Hν) = ν(x), D = x =data-seq., p(D) =const. Take logarithm:
Kν(x) := −logν(x) = length of Shannon-Fano code of x given Hν. Kw(ν) = length of model Hν.
Names: Two-part MDL or MAP or MML (∃ “slight” differences)
Definition 5.1 (MDL) νMDL = arg min{Kν(x) + Kw(ν)} ν∈M
Minimum Description Length – 181 – Marcus Hutter
Predict with Best Model
• Use best model from class of models M for prediction:
• Predict y with probability νMDL(y|x) = νMDL(xy) (3 variants)
ν MDL (x)
• yMDL = arg max{νMDL(y|x)} is most likely continuation of x
y
• Special case: Kw(ν) =const.
=⇒ MDL ❀ ML:=Maximum likelihood principle.
• Example: Hθ =Bernoulli(θ) with θ ∈ [0, 1] and Kw(θ) :=const. and ν(x1:n)=θn1(1−θ)n0 withn1=x1+…+xn=n−n0.
⇒ θMDL = arg min{−logθn1 (1−θ)n0+Kw(θ)} = n1 = νMDL(1|x) θn
= ML frequency estimate. (overconfident, e.g. n1 = 0)
• Compare with Laplace’ rule based on Bayes’ rule: θLaplace = n1+1. n+2
Minimum Description Length – 182 – Marcus Hutter
Application: Sequence Prediction
Instead of Bayes mixture ξ(x) = ν wνν(x), consider MAP/MDL νMDL(x) = max{wνν(x) : ν ∈ M} = arg min{Kν(x) + Kw(ν)}.
ν∈M
Theorem 5.2 (MDL bound)
∞ MDL 2−1 E (μ(xt|x