程序代写代做代考 information theory compiler algorithm COS597D: Information Theory in Computer Science October 19, 2011 Lecture 10

COS597D: Information Theory in Computer Science October 19, 2011 Lecture 10
Lecturer: Mark Braverman
Scribe: Andrej Risteski∗
1 Kolmogorov Complexity
In the previous lectures, we became acquainted with the concept of Shannon entropy, which is designed to capture distributions X over sets, i.e. relative frequencies of elements in a given set.
Intuitively, however, we would like a notion that captures the fact that:
• 111…111 is not very random.
• 3.14159… is not very random. (Since we can write a short program outputting π to arbitrary accuracy
and so predict the next bit easily.) • 78149512… is “fairly” random.
In other words, we want to describe how “compressible” and/or “hard to describe” a string is.
Definition 1. Let U be a universal computer. A valid execution of U on p is one that terminates and reads the entire string p.
Note that the programs on which U executes validly form a prefix-free set. Indeed, if p is a prefix of p′, U is valid on p, it will terminate once it’s done reading p – so it cannot read all of p′ – so the execution on p′ cannot be valid.
Definition 2. Kolmogorov complexity KU (x) of string x is defined as min l(p), where l(p) denotes the p:U (p)=x
length of string p.
In other words, the Kolmogorov complexity of x is the shortest program outputting x. Let’s note that
this definition is fairly robust.
Claim 3. If U is a universal computer, A another computer, there is some constant CA, s.t. KU(X) ≤
KA(x) + CA for all strings x.
Proof This is almost immediate. One can write a compiler for translating A to U. This is come constant-
sized program – so the claim holds.
Having the claim above in mind, we will henceforth drop the subscript U in the notation for Kolmogorov complexity. Now, let’s note a few properties about Kolmogorov complexity.
Claim 4. K(X|l(x)) ≤ l(x) + C
Proof Just use the trivial encoding as a program which outputs the sequence of bits in x.
Claim 5. K(X)≤l(x)+2logl(x)+C
∗Portions of the notes are based on “Elements of Information Theory”, T.M.Cover and J.A.Thomas
1

Proof We need to somehow encode the length of the string, and then we can use the trivial encoding as above. We can do this by encoding the length of the string as binary number, and then doubling each of the digits in the binary representation, followed by the string ”01”. For example, if the length is 6, i.e. 110 in binary, we encode it as 11110001. This gives us the bound we need.
As a remark, we can keep doing the trick in the proof above recursively. For instance, by doing it one more time, we can get Kolmogorov complexity l(x) + log l(x) + 2 log log l(x) + C. By continuing the recursion until the terms are positive, we get an encoding of size l(x) + log l(x) + log log l(x) + · · · + c.
Claim 6. The number of strings with K(x) = k satisfies |{x : K(x) < k}| < 2k. Proof Trivial. There are at most 2k programs of length k – so they encode at most 2k different strings. Theorem 7. Kolmogorov complexity is not computable. Proof Suppose there is a program Q(x) that outputs K(x). Then, consider the following program: P(n): For all x ∈ {0, 1}n If x is the first string with K(x) ≥ n, using the description of Q to check this. Output x and halt. Now, if the sequence of strings outputted for each input length n for P is x1, x2, . . . , xn, then K(xn) ≥ n. However, K(xn) ≤ c+2logn – by just using the program above, which has a constant-size description (given that the description of Q is just a constant as well), and using 2logn bits for encoding n. So we have a contradiction. 2 Kolmogorov Complexity and Entropy We note here a relationship between Kolmogorov complexity and entropy. Lemma 8. For any computer U, 􏰃 2−l(p) ≤ 1 p:U(p) halts Proof This is fairly simple, and based on the proof of Kraft’s inequality in “Elements of information theory”. Take any finite set of strings p, s.t. U(p) halts. We will call these strings “valid” in the proof to follow. We can view the valid strings as nodes in a full binary tree, where each string is located at the node we get by tracing out the sequence of bits in the string from the root. Let lmax be the maximal length of any of the valid strings, i.e. the maximal depth of the tree where there is a node that represents a valid string. Consider all notes at depth lmax of the tree. Each of them either denotes a valid string, a descendant of a valid string, or neither. A valid string at level l has at most 2lmax−l descendants at level lmax. Each of these descendant sets are disjoint, since the valid strings are prefix-free. So, 􏰃 2lmax−l(p) ≤ 2lmax, i.e. p:p is a valid string 􏰃 2−l(p) ≤ 1, as we need. Now, since this holds for any finite subset of strings p, s.t. U(p) halts, it holds in the infinite case as well. p:p is a valid string 2 Lemma 9. If xi are iid random variables, 1 E[K(x1,x2,...,xn)] → H(x), as x → ∞ n This will not be proven here – only the following partial result: Theorem 10. Let x1,x2,...,xn be B1/2; Then, Pr[K(x1,x2,...,xn)
2−K(x) is trivial – just by definition PU(x) includes the term 2−K(x).
We can rewrite the second part as K(x) ≤ log( 1 ) + C. In other words, we wish to find a short program PU (x)
for strings with high PU (x).
The problem is that PU (x) is not computable, so despite the fact that we know it will be large for simple
programs, we can’t just brute force run all programs of length ≤ log( 1 ), and find one. So we have to be PU (x)
somewhat clever.
The program for K(x) will try to simulate the universal computer U on all programs in lexicographic order,
and keep an estimate for PU(x′) as it goes along for all strings x′. Then, we will specify a particular short
string, which will allow the program to get x from these estimates.
We can simulate U on all programs as follows. Let p1, p2, . . . be the programs in lexicographic order, and the
corresponding outputs after running them are x1, x2, . . . (if they terminate of course). We run the simulation
in stages. In stage i, we run programs p1 , p2 , . . . pi for i steps each. That way, each program that terminates
will be run until it terminates.
Furthermore, we handle the terminations of programs as follows. If the program pk terminates in exactly i
steps in the i-th stage (i.e. we see it terminate for the first time), we increase the estimate P ̃ (x ) by 2−l(pk). Uk
2−l(p) = Pr [U(p) = x] p:U(p) halts
Then, we calculate nk = ⌈log(
1 )⌉. Now, we build a binary tree in whose nodes triplets (pk,nk,xk) will PU(xk)
̃
3

live. We will assign the triplets to the nodes incrementally. Namely, every time a program pk outputs a string x , we check whether the new estimate of n , after updating P ̃ (x ) has changed. If that is so, then
kkUk
we find any unassigned node on level nk + 1, and assign it the triplet (pk , nk , xk ).
There are two things we need to prove. One is that using this construction, we get a program of the required
size outputting x. The other is that it is always possible to place the pairs as specified above (i.e. if we need
to put a node on some level, there is always at least one unassigned node on that level).
For the first part, the encoding of x can be the path to the lowest depth node (i.e. closest to the root)
containing a triplet (pk , nk , xk ), xk = x. By construction, the depth will be ≤ ⌈log( 1 )⌉+1 ≤ log( 1 )+2 PU (x) PU (x)
–soK(x)≤log( 1 )+C,asweneed. PU (x)
For the second, the proof resembles the Kraft inequality proof mentioned above. We sketch the proof here as well, because the idea of the proof is simple, but elegant, and used in other places as well. We state it in the form of a general lemma.
Lemma 15. Let r1 , r2 , . . . be a sequence of requests to place index i at depth ri of a prefix-free binary coding n
tree. Then, it’s possible to satisfy all of the requests online if 􏰃 2−ri ≤ 1. i=1
Proof We can intuitively thing of think of 2−ri as the “volume” of a request i, as it blocks all of the nodes at depth higher than ri. So, really, what the statement of the lemma says is that the obvious condition the volumes need to satisfy is also sufficient.
We prove that the trivial strategy of placement, greedy leftmost actually works. In other words, when we get a new request ri, we place the element in the leftmost available spot at depth ri.
The proof is by induction on n.
The case n = 1 is trivial.
So, suppose the case holds when the number of requests is < n. Now, suppose we get n requests. Suppose also, for the sake of contradiction, we get stuck inserting rn. Let k be the depth of the deepest element in the right subtree R (WLOG assume R is not empty, otherwise the claim is trivial: we can just insert the request in the right subtree – so we get a contradiction immediately). We prove that in this case, both subtrees have to be fairly full, and the volume constraint is violated. An element on depth k and rn didn’t fit into left subtree L. Indeed, otherwise in the former case, we could have put all the elements on depth k in the left subtree. In the latter, we could have satisfied rn by putting the element in the left subtree. Let’s define l := max(k, rn). Let d1, d2, . . . dm be the depths of the nodes assigned 􏰃m 1 2−di + 2−l > 2. Putting everything together, w(L) + w(R) + 2−rn > 1 − 2−l + 1 + 2−l = 1, which is a contradiction. Hence,
to the left subtree L. Then, since neither k nor rn fit into the left subtree, w(L) =
Similarly, since rn didn’t fit in R, w(R) + 2−rn > 1 . Furthermore, all elements of R are at depth ≤ l,
2
22
the claim follows by induction.
With the lemma above, the last piece of the proof is almost complete. Once we verify that the sequence of requests in our particular case satisfy the constraints of Lemma 15, we are done. But this is easy. In our particular case, we get
therefore w(R) is an integer multiple of 2−l. Hence, w(R) + 2−rn ≥ 1 + 2−l. 2
i=1
n 􏰃2−ri =
i=1
= 􏰃 􏰃
x∈{0,1}∗ there is a request to insert x at level nk = 􏰃 2−1 􏰃
2−(nk+1)= 2−nk =
x∈{0,1}∗ there is a request to insert x at level nk
4

But now, since we insert a new triplet if the new estimate of nk changes, and the estimate will stop growing once the estimated value PU ̃ (x) reaches it’s actual value, PU (x), the above summation is at most:
=
􏰃
x∈{0,1}∗ 􏰃
x∈{0,1}∗
where the last inequality comes from Lemma 8. This finishes the proof.

2−1 􏰃 2−⌈log 1/PU (x)⌉−i =
i=0
∞ 2−12−⌈log 1/PU (x)⌉ 􏰃 2−i =
i=0
= 􏰃 2−12−⌈log 1/PU (x)⌉2 ≤
x∈{0,1}∗
≤ 􏰃 PU(x)≤1
x∈{0,1}∗
5