代写 algorithm statistic Bayesian theory Algorithmic Probability & Universal Induction – 133 – Marcus Hutter

Algorithmic Probability & Universal Induction – 133 – Marcus Hutter
4 ALGORITHMIC PROBABILITY & UNIVERSAL INDUCTION
• The Universal a Priori Probability M • Universal Sequence Prediction
• Universal Inductive Inference
• Martin-L ̈of Randomness
• Discussion

Algorithmic Probability & Universal Induction – 134 – Marcus Hutter
Algorithmic Probability & Universal Induction: Abstract
Solomonoff completed the Bayesian framework by providing a rigorous, unique, formal, and universal choice for the model class and the prior. I will discuss in breadth how and in which sense universal (non-i.i.d.) sequence prediction solves various (philosophical) problems of traditional Bayesian sequence prediction. I show that Solomonoff’s model possesses many desirable properties: Strong total and weak instantaneous bounds , and in contrast to most classical continuous prior densities has no zero p(oste)rior problem, i.e. can confirm universal hypotheses, is reparametrization and regrouping invariant, and avoids the old-evidence and updating problem. It even performs well (actually better) in non-computable environments.

Algorithmic Probability & Universal Induction – 135 – Marcus Hutter
Problem Setup
• Since our primary purpose for doing induction is to forecast (time-series), we will concentrate on sequence prediction tasks.
• Classification is a special case of sequence prediction. (With some tricks the other direction is also true)
• This Course focusses on maximizing profit (minimizing loss).
We’re not (primarily) interested in finding a (true/predictive/causal) model.
• Separating noise from data is not necessary in this setting!

Algorithmic Probability & Universal Induction – 136 – Marcus Hutter
Philosophy & Notation
Occam’s razor: take simplest hy- pothesis consistent with data.
Epicurus’ principle of multiple ex- planations: Keep all theories con- sistent with the data.
⇓⇓
We now combine both principles:
Take all consistent explanations into account, but weight the simpler ones higher.
Formalization with Turing machines and Kolmogorov complexity
Additional notation: We denote binary strings of length l(x) = n by x = x1:n = x1x2…xn with xt ∈ B and further abbreviate
x× 2−K(ρ) · ρ(x) for all enumerable semimeasures ρ. M is enumerable, but not estimable.
Up to a multiplicative constant, M assigns higher probability to all x than any other computable probability distribution.
Proof sketch:
M(x) = 􏰅 2−l(p) ≥ 􏰅 2−l(Tq) = 2−l(T) 􏰅 2−l(q) =× 2−K(ρ)ρ(x) p : U(p)=x∗ q : U(Tq)=x∗ q : T(q)=x∗

Algorithmic Probability & Universal Induction – 145 – Marcus Hutter
4.2 Universal Sequence Prediction: Contents
• Solomonoff, Occam, Epicurus • Prediction
• Simple Deterministic Bound • Solomonoff’s Major Result
• Implications of Solomonoff’s Result • Entropy Inequality
• Proof of the Entropy Bound

Algorithmic Probability & Universal Induction – 146 – Marcus Hutter
Solomonoff, Occam, Epicurus
• In which sense does M incorporate Occam’s razor and Epicurus’ principle of multiple explanations?
• From M(x) ≈ 2−K(x) we see that M assigns high probability to simple strings (Occam).
• More useful is to think of x as being the observed history.
• We see from Definition 4.1 that every program p consistent with
history x is allowed to contribute to M (Epicurus).
• On the other hand, shorter programs give significantly larger contribution (Occam).

Algorithmic Probability & Universal Induction – 147 – Marcus Hutter
Prediction
How does all this affect prediction?
If M(x) correctly describes our (subjective) prior belief in x, then
M(y|x) := M(xy)/M(x) must be our posterior belief in y.
From the symmetry of algorithmic information
K(x, y) =+ K(y|x, K(x)) + K(x) (Theorem 2.15), and assuming
K(x, y) ≈ K(xy), and approximating K(y|x, K(x)) ≈ K(y|x),
M(x) ≈ 2−K(x), and M(xy) ≈ 2−K(xy) we get: M(y|x) ≈ 2−K(y|x)
This tells us that M predicts y with high probability iff y has an easy explanation, given x (Occam & Epicurus).

Algorithmic Probability & Universal Induction – 148 – Marcus Hutter
Simple Deterministic Bound
Sequence prediction algorithms try to predict the continuation xt ∈ B of
a given sequence x1…xt−1. Simple deterministic bound:
􏰅∞ a􏰅∞ b c
|1−M(xt|x ε is finite and bounded by c/ε2 and the probability that the number of ε-deviations exceeds c is
smaller than δ, where c=+ ln 2·K(μ). ε2δ • No statement is possible for which t these deviations occur.
• This holds for any computable probability distribution μ.
• How does M know to which μ?
The set of μ-random sequences differ for different μ.
• Intuition: Past data x× 2−K(μ)μ(x).
Inserting dt into Dn yields Solomonoff’s Theorem 4.5.
(b) 􏰅 =
μ(x1:n)ln
(a) Insert def. of dt and use product rule μ(x 0 for computable θ, and 0 for uncomp. θ.
• This effectively reduces M to a discrete class {νθ ∈ M : wθU > 0}
which is typically dense in M.
• This prior has many advantages over the classical prior (densities).
w νU : = 2 − K ( ν )

Algorithmic Probability & Universal Induction – 156 – Marcus Hutter
The Problem of Zero Prior
= the problem of confirmation of universal hypotheses
Problem: If the prior is zero, then the posterior is necessarily also zero. Example: Consider the hypothesis H = H1 that all balls in some urn or
all ravens are black (=1) or that the sun rises every day.
Starting with a prior density as w(θ) = 1 implies that prior P[Hθ] = 0
for all θ, hence posterior P[Hθ|1..1] = 0, hence H never gets confirmed. 3 non-solutions: define H = {ω = 1∞} | use finite population | abandon
strict/logical/all-quantified/universal hypotheses in favor of soft hyp.
Solution: Assign non-zero prior to θ = 1 ⇒ P[H|1n] → 1.
Generalization: Assign non-zero prior to all “special” θ, like 1 and 1 , 26
which may naturally appear in a hypothesis, like “is the coin or die fair”. Universal solution: Assign non-zero prior to all comp. θ, e.g. wθU = 2−K(θ)

Algorithmic Probability & Universal Induction – 157 – Marcus Hutter
Reparametrization Invariance
• New parametrization e.g. ψ = √θ, then the ψ-density
w ̃(ψ) = 2√θ w(θ) is no longer uniform if w(θ) = 1 is uniform ⇒ indifference principle is not reparametrization invariant (RIP).
• Jeffrey’s and Bernardo’s principle satisfy RIP w.r.t. differentiable bijective transformations ψ = f−1(θ).
• The universal prior wθU = 2−K(θ) also satisfies RIP w.r.t. simple computable f. (within a multiplicative constant)

Algorithmic Probability & Universal Induction – 158 – Marcus Hutter
Regrouping Invariance
• Non-bijective transformations:
E.g. grouping ball colors into categories black/non-black.
• No classical principle is regrouping invariant.
• Regrouping invariance is regarded as a very important and desirable
property. [Walley’s (1996) solution: sets of priors]
• The universal prior wθU = 2−K(θ) is invariant under regrouping, and more generally under all simple [computable with complexity O(1)] even non-bijective transformations. (within a multiplicative constant)
• Note: Reparametrization and regrouping invariance hold for arbitrary classes and are not limited to the i.i.d. case.

Algorithmic Probability & Universal Induction – 159 – Marcus Hutter
Universal Choice of Class M
• The larger M the less restrictive is the assumption μ ∈ M.
• The class MU of all (semi)computable (semi)measures, although only countable, is pretty large, since it includes all valid physics theories. Further, ξU is itself semi-computable [?].
• Solomonoff’s universal prior M(x) := probability that the output of a universal TM U with random input starts with x.
• Formally: M(x) := 􏰄p : U(p)=x∗ 2−l(p) where the sum is over all (minimal) programs p for which U outputs a string starting with x.
• M may be regarded as a 2−l(p)-weighted mixture over all deterministic environments νp. (νp(x) = 1 if U(p) = x∗ and 0 else)
• M(x) coincides with ξU(x) within an irrelevant multiplicative constant.

Algorithmic Probability & Universal Induction – 160 – Marcus Hutter
The Problem of Old Evidence / New Theories
• What if some evidence E=􏰀x (e.g. Mercury’s perihelion advance) is known well-before the correct hypothesis/theory/model H=􏰀μ (Einstein’s general relativity theory) is found?
• How shall H be added to the Bayesian machinery a posteriori?
• What should the “prior” of H be?
• Should it be the belief in H in a hypothetical counterfactual world in which E is not known?
• Can old evidence E confirm H?
• After all, H could simply be constructed/biased/fitted towards “explaining” E.

Algorithmic Probability & Universal Induction – 161 – Marcus Hutter
Solution of the Old-Evidence Problem
• The universal class MU and universal prior wνU formally solves this problem.
• The universal prior of H is 2−K(H) independent of M and of whether E is known or not.
• Updating M is unproblematic, and even not necessary when starting with MU , since it includes all hypothesis (including yet unknown or unnamed ones) a priori.

Algorithmic Probability & Universal Induction – 162 – Marcus Hutter
Universal is Better than Continuous M
• Although νθ() and wθ are incomp. for cont. classes M for most θ,
ξ() is typically computable. (exactly as for Laplace or numerically) ⇒
• That is, M is superior to all computable mixture predictors ξ based on any (continuous or discrete) model class M and weight w(θ), save an additive constant K(ξ) ln 2 = O(1), even if environment μ is not computable.
• WhileDn(μ||ξ)∼dlnnforallμ∈M, 2
Dn(μ||M) ≤ K(μ)ln2 is even finite for computable μ. Fazit:
Dn(μ||M) <+ Dn(μ||ξ)+K(ξ)ln2 for all μ Solomonoff prediction works also in non-computable environments Algorithmic Probability & Universal Induction - 163 - Marcus Hutter Convergence and Bounds • Total (loss)􏰄bounds: 􏰄∞n=1 E[hn] <+ K(μ)ln2, where ht(ω× μ follows from universality of M.

Algorithmic Probability & Universal Induction – 168 – Marcus Hutter
Properties of ML-Random Sequences
• Special case of μ being a fair coin, i.e. μ(x1:n) = 2−n, then
x1:∞ is random ⇐⇒ Km(x1:n) =+ n, i.e. iff x1:n is incompressible.
• For general μ, −logμ(x1:n) is the length of the Arithmetic code of x1:n, hence x1:∞ is μ-random ⇐⇒ the Arithmetic code is optimal.
• One can show that a μ-random sequence x1:∞ passes all thinkable effective randomness tests, e.g. the law of large numbers, the law of the iterated logarithm, etc.
• In particular, the set of all μ-random sequences has μ-measure 1.

Algorithmic Probability & Universal Induction – 169 – Marcus Hutter
4.5 Discussion: Contents
• Limitations of Other Approaches • Summary
• Exercises
• Literature

Algorithmic Probability & Universal Induction – 170 – Marcus Hutter
Limitations of Other Approaches 1
• Popper’s philosophy of science is seriously flawed: – falsificationism is too limited,
– corroboration ≡ confirmation or meaningless,
– simple ̸= easy-to-refute.
• No free lunch myth relies on unrealistic uniform sampling. Universal sampling permits free lunch.
• Frequentism: definition circular,
limited to i.i.d. data, reference class problem.
• Statistical Learning Theory: Predominantly considers i.i.d. data: Empirical Risk Minimization, PAC bounds, VC-dimension, Rademacher complexity, Cross-Validation.

Algorithmic Probability & Universal Induction – 171 – Marcus Hutter
Limitations of Other Approaches 2
• Subjective Bayes: No formal procedure/theory to get prior.
• Objective Bayes: Right in spirit, but limited to small classes
unless community embraces information theory.
• MDL/MML: practical approximations of universal induction.
• Pluralism is globally inconsistent.
• Deductive Logic: Not strong enough to allow for induction.
• Non-monotonic reasoning, inductive logic, default reasoning do not properly take uncertainty into account.
• Carnap’s confirmation theory: Only for exchangeable data. Cannot confirm universal hypotheses.
• Data paradigm: Data may be more important than algorithms for “simple” problems, but a “lookup-table” AGI will not work.
• Eliminative induction ignores uncertainty and information theory.

Algorithmic Probability & Universal Induction – 172 – Marcus Hutter
Summary
• Solomonoff’s universal a priori probability M(x)
= Occam + Epicurus + Turing + Bayes + Kolmogorov
= output probability of a universal TM with random input
= enum. semimeasure that dominates all enum. semimeasures
≈ 2−Kolmogorov complexity(x)
• M(xt|x z separately.
5. [C10] Prove the claim about (rapid) convergence after Theorem 4.5
(Hint: Markov-Inequality).
6. [C20] Prove the instantaneous bound M(1|0n) =× 2−K(n).

Algorithmic Probability & Universal Induction – 174 – Marcus Hutter
Literature
[Sol64] R. J. Solomonoff. A formal theory of inductive inference: Parts 1 and 2. Information and Control, 7:1–22 and 224–254, 1964.
[LV08] M. Li and P. M. B. Vit ́anyi. An Introduction to Kolmogorov Complexity and its Applications. Springer, Berlin, 3rd edition, 2008.
[Hut05] M. Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin, 2005. http://www.hutter1.net/ai/uaibook.htm
[Hut07] M. Hutter. On universal prediction and Bayesian confirmation. Theoretical Computer Science, 384(1):33–48, 2007. http://arxiv.org/abs/0709.1516
[RH11] S. Rathmanner and M. Hutter.
A philosophical treatise of universal induction. Entropy, 16(6):1076–1136, 2011. http://dx.doi.org/10.3390/e13061076