CS代考计算机代写 information theory AI chain algorithm Prof. Mark Bun

Prof. Mark Bun
CAS CS 591 B: Communication Complexity
Lecture Notes 12: One-way communication, streaming
Fall 2019
Reading.
• Rao-Yehudayo􏰢 Chapter 10, Roughgarden Chapters 1 & 2
We’ll start looking in depth at an application of communication complexity to lower bounds for streaming algorithms. The data stream model is motivated by applications to analyzing massive data sets: ones which are too large to 􏰣t entirely in memory. For instance, we might be interested in monitoring tra􏰯c at a network switch and would like to be able to produce useful summary statistics about the tra􏰯c pattern. A streaming algorithm receives as input a sequence of elements x1,…,xm from a domain [n] one-by-one. With space mlogn the algorithm can store all of the data. We’d like to be able to design algorithms which use far less space than 􏰤 ideally polylog(m, n) 􏰤 as well as prove lower bounds on the memory requirement of such algorithms.
1 Frequency Moments
For a data stream x1,…,xm ∈ [n] and universe item j ∈ [n], let fj = |{i : xi = j}| denote the frequency of item j in the stream. For k ≥ 0, the k-th frequency moment of the stream is de􏰣ned
as
n
F k = 􏰕 f jk .
j=1
Interpreting 00 = 0, the 0-th frequency moment F0 is the number of distinct elements in the stream. F1 is simply the number of elements in the stream m. F2 is a natural measure of how far from uniform a data stream is. (A stream in which every element is distinct has F2 = m, whereas a stream that is concentrated on a single element has F2 = m2.) And 􏰣nally, we can de􏰣ne
F∞ = max fj j ∈[n]
to be the maximum number of occurrences of any element.
If we allow for both randomization and approximation, then we can construct algorithms for
estimating F0 and F2 with extremely low space.
Theorem 1 (Alon-Matias-Szegedy96). F0 and F2 can be approximated to multiplicative error (1±ε)
with probability at least 1 − δ using space O((log n + log m) · log(1/δ)/ε2).
The details of these algorithms are beyond the scope of this lecture, but you can 􏰣nd good expositions of them in the reading material.
1

To think about how to design an F0 estimator, let h : [n] → [0, 1] be a random function. The streaming algorithm will simply keep track of the minimum value of h(j) seen in the stream. To see why this is helpful, suppose we see k distinct elements in the stream. These k elements will be uniformly distributed over [0, 1] and the expected minimum element will be 1/(k + 1). Hence if u ∈ [0, 1] is the minimum value of h(j) seen in the stream, a good estimator for k will be 1/u − 1.
Similarly the idea for the F2 algorithm is to start by designing a 􏰠basic􏰡 unbiased estimator for F2, which can then be run many times in parallel to amplify its accuracy and its success probability. Let h : [n] → {−1, 1} be a random hash function. We initialize Z = 0, and every time an element j ∈ [n] appears in the stream, we add h(j) to Z. Finally, we output Z2. To see that this is an unbiased estimator we compute
j=1 j̸=k = F2.
Some details that need to be checked include: Showing that the estimator has low variance, showing that the hash function can be stored in small space (it actually needs to be derandomized using 4-wise independence), and showing that the error/failure probabilities can be ampli􏰣ed.
Can either of these algorithms be generalized to k > 2? It turns out the answer is no, and in general there is an Ω(n1−2/k) space lower bound for computing the k-th frequency moment for k > 2. The proofs go by way of reductions to communication complexity. To illustrate, let’s see a lower bound of Ω(n) for the case of k = ∞.
Theorem 2. Any streaming algorithm which, with probability at least 2/3 estimates F∞ to multi- plicative error (1 ± 0.2) requires space Ω(min{m, n}).
Proof. We show that a streaming algorithm for estimating F∞ with space s on a stream of length m = n would allow us to solve the Disjointness problem using communication s. The reduction is as follows. On input x, Alice initializes the streaming algorithm and feeds it every i ∈ [n] such that xi = 1. She then sends the state of the streaming algorithm to Bob using s bits of communication. Bob then feeds the algorithm every i ∈ [n] for which yi = 1. Observe that if DISJn(x, y) = 0 then F∞ ≤ 1 since every i appears at most 1 time in the stream. On the other hand, if DISJn(x, y) = 1 we have F∞ ≤ 2. Being able to approximate F∞ with small multiplicative error allows us to distinguish these two cases and thus solve Disjointness.
2 One-Way Communication
The proof of Theorem 2 did not use the full power of the Disjointness lower bound. The protocol for Disjointness that we reduced to involved only a single message from Alice to Bob. Such 􏰠one-way􏰡 communication lower bounds are almost always su􏰯cient to prove lower bounds in streaming.
De􏰣nition 3. Let f : X × Y → {0, 1} and ε > 0. The randomized one-way communication complexity, denoted RA→B(f) is the least cost of a public-coin randomized protocol computing f
to error ε in which the only communication consists of a single message from Alice to Bob. 2
n 2 
􏰕 h(j)fj j=1
E[Z2] = E
=E 􏰕h(j)2fj2+􏰕h(j)h(k)fjfk
n 
ε

2.1 Lower Bound for Indexing
In the Indexing problem INDn, Alice is given a string x ∈ {0,1}n and Bob is given an index i ∈ [n]. The goal is for Bob to compute xi given one-way communication from Alice. Note that if we allow communication from Bob to Alice, this problem is easy since he can just send the log n bit index i. Nevertheless, when communication only goes from Alice to Bob, this problem requires communication Ω(n).
To state the result, it will be helpful for us to de􏰣ne the binary entropy function h(ε) = ε log(1/ε) + (1 − ε) log(1/(1 − ε)). This is simply the entropy of a binary random variable B which takes value 1 with probability ε.
Theorem 4. For every ε > 0, RA→B (IND ) ≥ (1 − h(ε))n. εn
Proof. As one should expect from the statement, the proof goes by way of information theory. Let Π be a one-way protocol computing INDn with error ε. Let A be uniformly random over {0,1}n and let B be uniform over [n]. Our goal will be to show that I(A; Π(A, B)) ≥ (1 − h(ε))n. To do this, let’s write
I(A; Π) = H(A) − H(A|Π) = n − H(A|Π)
so the remaining goal is to show that H(A|Π) ≤ h(ε)n. To show this we use the chain rule to obtain nn
H(A|Π) = 􏰕 H(Ai|ΠA≤i) ≤ 􏰕 H(Ai|Π). i=1 i=1
Now for every i ∈ [n], we have H(Ai|Π) = H(Ai|Π,B = i) since (Ai,Π) is independent of B. So it now su􏰯ces to show that H(Ai|Π,B = i) ≤ h(ε) for every i. It’s intuitive that we should be able to upper bound this quantity by a constant. If Π lets us predict Ai with high probability, then conditioning on Π should leave little remaining uncertainty about Ai. The technical tool we need is (a special case of) Fano’s Inequality:
Lemma 5 (Fano). Let X ∈ {0, 1} be a binary random variable, let M be a random variable jointly distributed with X, and let g be a function of M. Let E be the event that g(M) ̸= X. Then
H(X|M) ≤ H(E).
To use Lemma 5 to complete the proof, let E be the event that Bob’s output disagrees with Ai. Then by correctness of the protocol, E = 0 with probability 1 − δ and E = 1 with probability δ for some0<δ≤ε. Hence H(Ai|Π,B = i) ≤ H(E|B = i) = h(δ) ≤ h(ε) For completeness, let's now give a proof of the special case of Fano's Inequality we used. Proof of Lemma 5. Since E is a function of M and X, we have H(E|M, X) = 0. Therefore, H(X|M) = H(X|M) + H(E|M, X) = H(X,E|M) = H(E|M) + H(X|M, E) ≤ H(E), where the last inequality follows because M,E completely determines X. 3 as we wanted. The general statement of Fano's Inequality is as follows. Theorem 6 (Fano's Inequality). Let X ∈ X be a random variable, let M be jointly distributed with X, and let g be a function of M. Let E be the event that g(M)̸=X. Then H(X|M) ≤ H(E)+Pr[E]log(|X|−1). 2.2 Lower Bound for Gap Hamming In the Gap Hamming problem GHn, Alice is given a string x ∈ {0,1}n and Bob is given a string y ∈ {0, 1}n and the goal is to distinguish between the following two cases: 1) |x − y| ≤ n/2 − c√n and 2) |x − y| ≥ n/2 + c√n. Here c > 0 is a parameter of the problem which is independent of n (chosen to make proofs work).
We can use a reduction to Indexing to show that the Gap Hamming problem requires linear one- way communication. Note that a linear lower bound holds for two-way randomized communication as well (but the proof is more involved).
Theorem 7. RA→B(GHn) ≥ Ω(n). 1/3
Proof. We give a randomized reduction from Indexing to Gap Hamming. Given an instance (x, i) of the Indexing problem for m odd, x ∈ {0,1}m and i ∈ [m], Alice and Bob will (with public randomness but zero communication) construct an instance (x′, y′) ∈ {0, 1}n ×{0, 1}n for n = O(m) such that GHn(x′, y′) = INDn(x, i) with high probability.
The inputs x′,y′ will each be constructed one bit at a time independently. To construct a single pair of bits (a, b) Alice and Bob will look at a fresh m-bit portion of the public randomness r ∈ {0,1}m. Bob will set b = ri, i.e., the i-th bit of the random string. Alice will set a = 1 if |x−r|m/2. Theideaisthatifxi =1,thenthebitsa,bare correlated, but if xi = 0 then the bits are anti-correlated. More precisely, let E be the event that |x−i − r−i| = m/2. If E occurs, then a = 1 i􏰢 xi = ri. Hence conditioned on E we have that xi =1 =⇒ a=bandxi =0 =⇒ a̸=b. Ontheotherhand,conditionedonE ̄,thebitais determined by r−i and is a coin 􏰺ip independent of b. So for some constant c = Ω(1),
Pr[a = b] = Pr[a = b|E] Pr[E] + Pr[a = b|E ̄] Pr[E ̄] 2c′ 1􏰈 2c′􏰉
=Pr[a=b|E]·√m+2· 1−√m 􏰀1 c′
= 2 − √m if xi = 0 1 c′
2 + √m if xi = 1.
Hence for every j independently we have Pr[x′ = y′] ≤ 1/2−c′/√m if xi = 0 and Pr[x′ = y′] ≥
′√jjjj 1/2+c / m if xi = 1. Repeating n = O(m) times, we get that there is a constant c such that with
high probability (say 8/9) |x′ −y′| ≥ n/2+c√n if xi = 0 and |x′ −y′| ≤ n/2−c√n if xi = 1. So if we can solve GHn to error 1/3 with sublinear communication, we can solve INDm to error 4/9 with sublinear communication.
3 Frequency Moment Lower Bound from Gap Hamming
Theorem 8. A randomized algorithm for computing F0 to within a multiplicative (1±c/√n) requires space Ω(n).
4

Proof. Suppose we have a streaming algorithm which computes F0 to within a multiplicative (1 ± √
c/ n) factor using space s. We construct a one-way communication protocol which computes Hamming distance up to an additive ±c√n error using space s + log n as follows. Alice and Bob interpret their inputs x,y as the indicator vectors for sets A,B ⊆ [n]. Alice loads A into the streaming algorithm and sends the state to Bob, who then loads B. She also sends him |A| using an additional log n bits of communication. Observe that |x − y| = |A∆B| = |A \ B| + |B \ A| = (|A∪B|−|B|)+(|A∪B|−|A|). Noting that F0 = |A∪B| we have |x−y| = F0 −|x|−|y|. Since F0 ∈ [0, n], a multiplicative (1 ± c/√n) approximation translates into an additive ±c√n approximation to |x − y|.
A padding argument can be used to show that for ε > c′/√n, a multiplicative (1±ε) approxima-
tion to F0 requires space Ω(1/ε2). Speci􏰣cally, we can let k = 1/ε2 and reduce from GHk as follows.
On inputs x, y to GH , construct x′, y′ by appending n − k zeroes to the end of each. Estimating
√ F0 ∈ [0, k] to error (1 ± ε) on the resulting stream translates into additive error k · ε = O( k),
k
which is ruled out by the lower bound of Ω(k) for computing GHk.
5