CS代考计算机代写 data mining database data structure case study Excel information theory scheme algorithm AI discrete mathematics decision tree Communication Complexity (for Algorithm Designers)

Communication Complexity (for Algorithm Designers)
Tim Roughgarden

⃝c Tim Roughgarden 2015

Preface
The best algorithm designers prove both possibility and impossibility results — both upper and lower bounds. For example, every serious computer scientist knows a collection of canonical NP-complete problems and how to reduce them to other problems of interest. Communication complexity offers a clean theory that is extremely useful for proving lower bounds for lots of different fundamental problems. Many of the most significant algorithmic consequences of the theory follow from its most elementary aspects.
This document collects the lecture notes from my course “Communication Complexity (for Algorithm Designers),” taught at Stanford in the winter quarter of 2015. The two
primary goals of the course are:
(1) Learn several canonical problems in communication complexity that are useful for
proving lower bounds for algorithms (Disjointness, Index, Gap-Hamming, etc.). (2) Learn how to reduce lower bounds for fundamental algorithmic problems to communi-
cation complexity lower bounds. Along the way, we’ll also:
(3) Get exposure to lots of cool computational models and some famous results about them — data streams and linear sketches, compressive sensing, space-query time trade-offs in data structures, sublinear-time algorithms, and the extension complexity of linear programs.
(4) Scratch the surface of techniques for proving communication complexity lower bounds (fooling sets, corruption arguments, etc.).
Readers are assumed to be familiar with undergraduate-level algorithms, as well as the statements of standard large deviation inequalities (Markov, Chebyshev, and Chernoff- Hoeffding).
The course begins in Lectures 1–3 with the simple case of one-way communication protocols — where only a single message is sent — and their relevance to algorithm design. Each of these lectures depends on the previous one. Many of the “greatest hits” of communication complexity applications, including lower bounds for small-space streaming algorithms and compressive sensing, are already implied by lower bounds for one-way
ii

Preface iii
protocols. Reasoning about one-way protocols also provides a gentle warm-up to the standard model of general two-party communication protocols, which is the subject of Lecture 4. Lectures 5–8 translate communication complexity lower bounds into lower bounds in several disparate problem domains: the extension complexity of polytopes, data structure design, algorithmic game theory, and property testing. Each of these final four lectures depends only on Lecture 4.
The course Web page (http://theory.stanford.edu/~tim/w15/w15.html) contains links to relevant large deviation inequalities, links to many of the papers cited in these notes, and a partial list of exercises. Lecture notes and videos on several other topics in theoretical computer science are available from my Stanford home page.
I am grateful to the Stanford students who took the course, for their many excellent questions: Josh Alman, Dylan Cable, Brynmor Chapman, Michael Kim, Arjun Puranik, Okke Schrijvers, Nolan Skochdopole, Dan Stubbs, Joshua Wang, Huacheng Yu, Lin Zhai, and several auditors whose names I’ve forgotten. I am also indebted to Alex Andoni, Parikshit Gopalan, Ankur Moitra, and C. Seshadhri for their advice on some of these lectures. The writing of these notes was supported in part by NSF award CCF-1215965.
I always appreciate suggestions and corrections from readers.
Tim Roughgarden
474 Gates Building, 353 Serra Mall
Stanford, CA 94305
Email: tim@cs.stanford.edu
WWW: http://theory.stanford.edu/~tim/

Contents
Preface ii List of Figures viii
1
Data Streams: Algorithms and Lower Bounds 1
1.1 Preamble 1
1.2 The Data Stream Model 2
1.3 Frequency Moments 3
1.4 Estimating F2 : The Key Ideas 4
1.4.1 The Basic Estimator 5
1.4.2 4-Wise Independent Hash Functions 9
1.4.3 Further Optimizations 10
1.5 Estimating F0 : The High-Order Bit 10
1.6 Can We Do Better?
1.7 One-Way Communication Complexity 12
1.8 Connection to Streaming Algorithms 14
1.9 The Disjointness Problem 14
1.9.1 Disjointness Is Hard for One-Way Communication 15
1.9.2 Space Lower Bound for F∞ 16
1.9.3 Space Lower Bound for Exact Computation of F0 and F2 17
1.10 Looking Backward and Forward 18
Lower Bounds for One-Way Communication 19
2.1 The Story So Far 19
2.2 Randomized Protocols 20
2.3 Distributional Complexity 23
2.4 The Index Problem 24
2.5 Where We’re Going 28
2.6 The Gap-Hamming Problem 29
2.6.1 Why Disjointness Doesn’t Work 29
2.6.2 Reducing Gap-Hamming to F0 Estimation 29
2
2.7 Lower Bound for Gap-Hamming
31
iv
12

Contents v
3
Lower Bounds for Compressive Sensing 34
3.1 Randomized Communication Complexity of the Equality Function 34
3.2 Sparse Recovery 36
3.2.1 The Basic Setup 36
3.2.2 A Toy Version 36
3.2.3 Motivating Applications 38
3.2.4 The Real Problem 38
3.3 A Lower Bound for Sparse Recovery 39
3.3.1 Context 39
3.3.2 Proof of Theorem 3.4: First Attempt 40
3.3.3 Proof of Theorem 3.4 41
3.3.4 Lower Bounds for Randomized Recovery 47
3.3.5 Digression 48
Boot Camp on Communication Complexity 50
4.1 Preamble 50
4.2 Deterministic Protocols 51
4.2.1 Protocols 51
4.2.2 Example: Clique-Independent Set 51
4.2.3 Trees and Matrices 53
4.2.4 Protocols and Rectangles 55
4.2.5 Lower Bounds for Equality and Disjointness 58
4.2.6 Take-Aways 59
4.3 Randomized Protocols 60
4.3.1 Default Parameter Settings 60
4.3.2 Newman’s Theorem: Public- vs. Private-Coin Protocols 60
4.3.3 Distributional Complexity 62
4.3.4 Case Study: Disjointness 62
Lower Bounds for the Extension Complexity of Polytopes 67
5.1 Linear Programs, Polytopes, and Extended Formulations 67
5.1.1 Linear Programs for Combinatorial Optimization Problems 67
5.1.2 Auxiliary Variables and Extended Formulations 69
5.2 Nondeterministic Communication Complexity 70
5.3 Extended Formulations and Nondeterministic Communication Complexity 73
5.3.1 Faces and Facets 74
5.3.2 Yannakakis’s Lemma 75
5.3.3 Proof Sketch of Lemma 5.5: A Geometric Argument 75
5.3.4 Proof Sketch of Lemma 5.5: An Algebraic Argument 76
5.4 A Lower Bound for the Correlation Polytope 79 5.4.1 Overview 79
4
5

vi
6
Contents
5.4.2 Preliminaries 80
5.4.3 Some Faces of the Correlation Polytope 81
5.4.4 Face-Vertex(COR) and Unique-Disjointness 82
5.4.5 A Lower Bound for Unique-Disjointness 83
Lower Bounds for Data Structures 87
6.1 Preamble 87
6.2 The Approximate Nearest Neighbor Problem 87
6.3 An Upper Bound: Biased Random Inner Products 88
6.3.1 The Key Idea (via a Public-Coin Protocol) 88
6.3.2 The Data Structure (Decision Version) 91
6.3.3 The Data Structure (Full Version) 92
6.4 Lower Bounds via Asymmetric Communication Complexity 93
6.4.1 The Cell Probe Model 93
6.4.2 Asymmetric Communication Complexity 96
6.4.3 Lower Bound for the Approximate Nearest Neighbor Problem 102
Lower Bounds in Algorithmic Game Theory 107
7.1 Preamble 107
7.2 The Welfare Maximization Problem 107
7.3 Multi-Party Communication Complexity 108
7.3.1 The Model 108
7.3.2 The Multi-Disjointness Problem 109
7.3.3 Proof of Theorem 7.1 109
7.4 Lower Bounds for Approximate Welfare Maximization 111
7.4.1 General Valuations 111
7.4.2 Subadditive Valuations 113
7.5 Lower Bounds for Equilibria 114
7.5.1 Game Theory 114
7.5.2 POA Lower Bounds from Communication Complexity 117
7.5.3 Proof of Theorem 7.9 118
7.5.4 An Open Question 121
Lower Bounds in Property Testing 122
8.1 Property Testing 122
8.2 Example: The BLR Linearity Test 123
8.3 Monotonicity Testing: Upper Bounds 125
8.3.1 The Boolean Case 125
8.3.2 Recent Progress for the Boolean Case 129
8.3.3 Larger Ranges 130
7
8

Contents
8.4
8.5
vii
Monotonicity Testing: Lower Bounds 131
8.4.1 Lower Bound for General Ranges 131
8.4.2 Extension to Smaller Ranges 133
A General Approach 134
Bibliography 135

List of Figures
1.1 Expected order statistics of i.i.d. samples from the uniform distribution 11
1.2 A small-space streaming algorithm induces a low-communication one-way protocol 14
2.1 A one-way communication protocol 19
2.2 Balls of radius n/4 in the Hamming metric 26
2.3 Proof structure of linear space lower bounds for streaming algorithms 28
2.4 Proof plan for Ω(ε−2) space lower bounds 28
2.5 Hamming distance and symmetric difference 30
3.1 Compressive sensing 37
3.2 How Alice interprets her log |X | log n-bit input 44
3.3 The triangle inequality implies that Bob’s recovery is correct 45
4.1 A clique and an independent set 52
4.2 The binary tree induced by a protocol for Equality 54
4.3 Partition of the input space X × Y 56
4.4 A covering by four monochromatic rectangles that is not a partition 58
4.5 If S and T are different sets, then either S and Tc or T and Sc are not disjoint 59
5.1 A covering by four monochromatic rectangles that is not a partition 71
5.2 A supporting hyperplane and the corresponding face 74
5.3 Nonnegative matrix factorization 77
6.1 Should tables be sorted? 94
6.2 Proof of the Richness Lemma (Lemma 6.4) 99
7.1 Player utilities in Rock-Paper-Scissors 115
7.2 Proof of Theorem 7.9 120
8.1 {0, 1}n as an n-dimensional hypercube 126
8.2 Swapping values to eliminate the monotonicity violations in the ith slice 127
8.3 Tracking the number of monotonicity violations 128
viii

Lecture 1
Data Streams: Algorithms and Lower Bounds
1.1 Preamble
This class is mostly about impossibility results — lower bounds on what can be accom- plished by algorithms. However, our perspective will be unapologetically that of an algorithm designer.1 We’ll learn lower bound technology on a “need-to-know” basis, guided by funda- mental algorithmic problems that we care about (perhaps theoretically, perhaps practically). That said, we will wind up learning quite a bit of complexity theory — specifically, commu- nication complexity — along the way. We hope this viewpoint makes this course and these notes complementary to the numerous excellent courses, books (Jukna (2012) and Kushilevitz and Nisan (1996)), and surveys (e.g. Lee and Shraibman (2009); Lovász (1990); Chattopad- hyay and Pitassi (2010); Razborov (2011)) on communication complexity.2 The theme of communication complexity lower bounds also provides a convenient excuse to take a guided tour of numerous models, problems, and algorithms that are central to modern research in the theory of algorithms but missing from many algorithms textbooks: streaming algorithms, space-time trade-offs in data structures, compressive sensing, sublinear algorithms, extended formulations for linear programs, and more.
Why should an algorithm designer care about lower bounds? The best mathematical researchers can work on an open problem simultaneously from both sides. Even if you have a strong prior belief about whether a given mathematical statement is true or false, failing to prove one direction usefully informs your efforts to prove the other. (And for most us, the prior belief is wrong surprisingly often!) In algorithm design, working on both sides means striving simultaneously for better algorithms and for better lower bounds. For example, a good undergraduate algorithms course teaches you both how to design polynomial-time algorithms and how to prove that a problem is NP-complete — since when you encounter a new computational problem in your research or workplace, both are distinct possibilities. There are many other algorithmic problems where the fundamental difficulty is not the amount of time required, but rather concerns communication or information transmission. The goal of this course is to equip you with the basic tools of communication complexity — its canonical hard problems, the canonical reductions from computation in various models to
1Already in this lecture, over half our discussion will be about algorithms and upper bounds!
2See Pătraşcu (2009) for a series of four blog posts on data structures that share some spirit with our approach.
1

2 Data Streams: Algorithms and Lower Bounds the design of low-communication protocols, and a little bit about its lower bound techniques
— in the service of becoming a better algorithm designer.
This lecture and the next study the data stream model of computation. There are some nice upper bounds in this model (see Sections 1.4 and 1.5), and the model also naturally motivates a severe but useful restriction of the general communication complexity setup
(Section 1.7). We’ll cover many computational models in the course, so whenever you get sick of one, don’t worry, a new one is coming up around the corner.
1.2 The Data Stream Model
The data stream model is motivated by applications in which the input is best thought of as a firehose — packets arriving to a network switch at a torrential rate, or data being generated by a telescope at a rate of one exobyte per day. In these applications, there’s no hope of storing all the data, but we’d still like to remember useful summary statistics about what we’ve seen.
Alternatively, for example in database applications, it could be that data is not thrown away but resides on a big, slow disk. Rather than incurring random access costs to the data, one would like to process it sequentially once (or a few times), perhaps overnight, remembering the salient features of the data in a limited main memory for real-time use. The daily transactions of Amazon or Walmart, for example, could fall into this category.
Formally, suppose data elements belong to a known universe U = {1, 2, . . . , n}. The input is a stream x1, x2, . . . , xm ∈ U of elements that arrive one-by-one. Our algorithms will not assume advance knowledge of m, while our lower bounds will hold even if m is known a priori. With space ≈ m log2 n, it is possible to store all of the data. The central question in data stream algorithms is: what is possible, and what is impossible, using a one-pass algorithm and much less than m log n space? Ideally, the space usage should be sublinear or even logarithmic in n and m. We’re not going to worry about the computation time used by the algorithm (though our positive results in this model have low computational complexity, anyway).
Many of you will be familiar with a streaming or one-pass algorithm from the following common interview question. Suppose an array A, with length m, is promised to have a majority element — an element that occurs strictly more than m/2 times. A simple one-pass algorithm, which maintains only the current candidate majority element and a counter for it
— so O(log n + log m) bits — solves this problem. (If you haven’t seen this algorithm before, see the Exercises.) This can be viewed as an exemplary small-space streaming algorithm.3
3Interestingly, the promise that a majority element exists is crucial. A consequence of the next lecture is that there is no small-space streaming algorithm to check whether or not a majority element exists!

1.3 Frequency Moments
3
1.3 Frequency Moments
Next we introduce the canonical problems in the field of data stream algorithms: computing the frequency moments of a stream. These were studied in the paper that kickstarted the field (Alon et al., 1999), and the data stream algorithms community has been obsessed with them ever since.
Fix a data stream x1,x2,…,xm ∈ U. For an element j ∈ U, let fj ∈ {0,1,2,…,m} denote the number of times that j occurs in the stream. For a non-negative integer k, the kth frequency moment of the stream is
Fk :=􏰕fjk. (1.1) j∈U
Note that the bigger k is, the more the sum in (1.1) is dominated by the largest frequencies. It is therefore natural to define
F∞ = max fj j∈U
as the largest frequency of any element of U.
Let’s get some sense of these frequency moments. F1 is boring — since each data
stream element contributes to exactly one frequency fj, F1 = 􏰂j∈U fj is simply the stream length m. F0 is the number of distinct elements in the stream (we’re interpreting 00 = 0) —
it’s easy to imagine wanting to compute this quantity, for example a network switch might want to know how many different TCP flows are currently going through it. F∞ is the largest frequency, and again it’s easy to imagine wanting to know this — for example to detect a denial-of-service attack at a network switch, or identify the most popular product on Amazon yesterday. Note that computing F∞ is related to the aforementioned problem of detecting a majority element. Finally, F2 = 􏰂j∈U f2 is sometimes called the “skew” of the data — it is a measure of how non-uniform the data stream is. In a database context, it arises naturally as the size of a “self-join” — the table you get when you join a relation with itself on a particular attribute, with the fj’s being the frequencies of various values of this attribute. Having estimates of self-join (and more generally join) sizes at the ready is useful for query optimization, for example. We’ll talk about F2 extensively in the next section.4
It is trivial to compute all of the frequency moments in O(m log n) space, just by storing the xi’s, or in O(n log m), space, just by computing and storing the fj ’s (a log m-bit counter for each of the n universe elements). Similarly, F1 is trivial to compute in O(log m) space
(via a counter), and F0 in O(n) space (via a bit vector). Can we do better?
Intuitively, it might appear difficult to improve over the trivial solution. For F0, for example, it seems like you have to know which elements you’ve already seen (to avoid double-counting them), and there’s an exponential (in n) number of different possibilities for
4The problem of computing F2 and the solution we give for it are also quite well connected to other important concepts, such as compressive sensing and dimensionality reduction.

4 Data Streams: Algorithms and Lower Bounds
what you might have seen in the past. As we’ll see, this is good intuition for deterministic algorithms, and for (possibly randomized) exact algorithms. Thus, the following positive result is arguably surprising, and very cool.5
Theorem 1.1 (Alon et al. 1999) Both F0 and F2 can be approximated, to within a (1±ε) factor with probability at least 1 − δ, in space
O􏰃(ε−2(logn+logm)log 1􏰄. (1.2) δ
Theorem 1.1 refers to two different algorithms, one for F0 and one for F2. We cover the latter in detail below. Section 1.5 describes the high-order bit of the F0 algorithm, which is a modification of the earlier algorithm of Flajolet and Martin (1983), with the details in the exercises. Both algorithms are randomized, and are approximately correct (to within
(1 ± ε)) most of the time (except with probability δ). Also, the log m factor in (1.2) is not needed for the F0 algorithm, as you might expect. Some further optimization are possible; see Section 1.4.3.
The first reason to talk about Theorem 1.1 is that it’s a great result in the field of algorithms — if you only remember one streaming algorithm, the one below might as well be the one.6 You should never tire of seeing clever algorithms that radically outperform the “obvious solution” to a well-motivated problem. And Theorem 1.1 should serve as inspiration to any algorithm designer — even when at first blush there is no non-trivial algorithm for problem in sight, the right clever insight can unlock a good solution.
On the other hand, there unfortunately are some important problems out there with no non-trivial solutions. And it’s important for the algorithm designer to know which ones they are — the less effort wasted on trying to find something that doesn’t exist, the more energy is available for formulating the motivating problem in a more tractable way, weakening the desired guarantee, restricting the problem instances, and otherwise finding new ways to make algorithmic progress. A second interpretation of Theorem 1.1 is that it illuminates why such lower bounds can be so hard to prove. A lower bound is responsible for showing that every algorithm, even fiendishly clever ones like those employed for Theorem 1.1, cannot make significant inroads on the problem.
1.4 Estimating F2 : The Key Ideas
In this section we give a nearly complete proof of Theorem 1.1 for the case of F2 = 􏰂j∈U f2
estimation (a few details are left to the Exercises).
5The Alon-Matias-Szegedy paper (Alon et al., 1999) ignited the field of streaming algorithms as a hot area, and for this reason won the 2005 Gödel Prize (a “test of time”-type award in theoretical computer science). The paper includes a number of other upper and lower bounds as well, some of which we’ll cover shortly.
6Either algorithm, for estimating F0 or for F2, could serve this purpose. We present the F2 estimation algorithm in detail, because the analysis is slightly slicker and more canonical.

1.4 Estimating F2: The Key Ideas 5 1.4.1 The Basic Estimator
The high-level idea is very natural, especially once you start thinking about randomized algorithms.
1. Define a randomized unbiased estimator of F2, which can be computed in one pass. Small space seems to force a streaming algorithm to lose information, but maybe it’s possible to produce a result that’s correct “on average.”
2. Aggregate many independent copies of the estimator, computed in parallel, to get an estimate that is very accurate with high probability.
This is very hand-wavy, of course — does it have any hope of working? It’s hard to answer that question without actually doing some proper computations, so let’s proceed to the estimator devised in Alon et al. (1999).
The Basic Estimator:7
1. Let h : U → {±1} be a function that associates each universe element with a random sign. On a first reading, to focus on the main ideas, you should assume that h is a totally random function. Later we’ll see that relatively lightweight hash functions are good enough (Section 1.4.2), which enables a small-space implementation of the basic ideas.
2. Initialize Z = 0.
3. Every time a data stream element j ∈ U appears, add h(j) to Z. That is, increment
Z if h(j) = +1 and decrement Z if h(j) = −1.8
4. Return the estimate X = Z2.
Remark 1.2 A crucial point: since the function h is fixed once and for all before the data stream arrives, an element j ∈ U is treated consistently every time it shows up. That is, Z is either incremented every time j shows up or is decremented every time j shows up. In the end, element j contributes h(j)fj to the final value of Z.
First we need to prove that the basic estimator is indeed unbiased.
Lemma 1.3 For every data stream,
Eh[X] = F2.
7This is sometimes called the “tug-of-war” estimator.
8This is the “tug of war,” between elements j with h(j) = +1 and those with h(j) = −1.

6
Proof: We have
Data Streams: Algorithms and Lower Bounds
E􏰥Z2􏰦
 2
E[X] =
= E 􏰕h(j)fj
 
j∈U 

= E 􏰕h(j)2fj2+2􏰕h(j)h(l)fjfl
j∈U 􏱄 􏱅􏱆 􏱇 j 0,
Pr[|Y − E[Y ] | > c] ≤ Var[Y ] . (1.5) c2
Note that (1.5) is non-trivial (i.e., probability less than 1) once c exceeds Y ’s standard devi- ation, and the probability goes down quadratically with the number of standard deviations. It’s a simple inequality to prove; see the separate notes on tail inequalities for details.
We are interested in the case where Y is the average of t basic estimators, with variance as in Lemma 1.4. Since we want to guarantee a (1 ± ε)-approximation, the deviation c of interest to us is εF2. We also want the right-hand side of (1.5) to be equal to δ. Using Lemma 1.4 and solving, we get t = 2/ε2δ.9 􏱋
We now stop procrastinating and prove Lemma 1.4. Proof of Lemma 1.4: Recall that
 2
Var[X]=E􏰥X2􏰦− E[X]  . (1.6)
􏱄􏱅􏱆􏱇
=F2 by Lemma 1.3
9The dependence on 1 can be decreased to logarithmic; see Section 1.4.3. δ

8 Data Streams: Algorithms and Lower Bounds Zooming in on the E􏰥X2􏰦 term, recall that X is already defined as the square of the running
sum Z, so X2 is Z4. Thus,
 4  
E􏰥X2􏰦 = E 􏰕 h(j)fj j∈U
. (1.7)
Expanding the right-hand side of (1.7) yields |U|4 terms, each of the form h(j1)h(j2)h(j3)h(j4)fj1fj2fj3fj4. (Remember: the h-values are random, the f-values are fixed.) This might seem unwieldy. But, just as in the computation (1.4) in the proof of Lemma 1.3, most of these are zero in expectation. For example, suppose j1, j2, j3, j4 are distinct. Condition on the h-values of the first three. Since h(j4) is equally likely to be +1 or -1, the conditional expected value (averaged over the two cases) of the corresponding term is 0. Since this holds for all possible values of h(j1),h(j2),h(j3), the unconditional expectation of this term is also 0. This same argument applies to any term in which some element j ∈ U appears an odd number of times. Thus, when the dust settles, we have
 
E􏰥X2􏰦 = E 􏰕 h(j)4 fj4 + 6 􏰕 h(j)2 h(l)2 fj2fl2 j∈U 􏱄􏱅􏱆􏱇 j 2, check out the Exercises for details. The main message is to realize that you shouldn’t be scared of them — heavy machinery is not required. For example, it suffices to map the elements of U injectively into a suitable finite field (of size roughly |U|), and then let H be the set of all cubic polynomials (with all operations occurring in this field). The final output is then +1 if the polynomial’s output (viewed as an integer) is even, and
-1 otherwise. Such a hash function is easily specified with O(log n) bits (just list its four coefficients), and can also be evaluated in O(log n) space (which is not hard, but we won’t say more about it here).

10
Data Streams: Algorithms and Lower Bounds
Putting it all together, we get a space bound of
 
2 
O  1 · log m + log n  . (1.9)
εδ 􏱄􏱅􏱆􏱇 􏱄􏱅􏱆􏱇
􏱄􏱅􏱆􏱇
# of copies
counter hash function
1.4.3 Further Optimizations
The bound in (1.9) is worse than that claimed in Theorem 1.1, with a dependence on 1 δ
instead of log 1 . A simple trick yields the better bound. In Section 1.4.1, we averaged t δ
copies of the basic estimator to accomplish two conceptually different things: to improve the approximation ratio to (1 ± ε), for which we suffered an 1 factor, and to improve the success
ε2
probability to 1 − δ, for which we suffered an additional 1 . It is more efficient to implement
these improvements one at a time, rather than in one shot. The smarter implementation first uses ≈ 1 copies to obtain an approximation of (1 ± ε) with probably at least 2 (say).
ε2 3
To boost the success probability from 2 to 1 − δ, it is enough to run ≈ log 1 different copies
δ

of this solution, and then take the median of their ≈ log 1 different estimates. Since we
δ
expect at least two-thirds of these estimates to lie in the interval (1 ± ε)F2, it is very likely that the median of them lies in this interval. The details are easily made precise using a Chernoff bound argument; see the Exercises for details.
Second, believe it or not, the log m term in Theorem 1.1 can be improved to log log m.
The reason is that we don’t need to count the Zi’s exactly, only approximately and with
high probability. This relaxed counting problem can be solved using Morris’s algorithm,
which can be implemented as a streaming algorithm that uses O(ε−2 log log m log 1 ) space. δ
See the Exercises for further details.
1.5 Estimating F0 : The High-Order Bit
Recall that F0 denotes the number of distinct elements present in a data stream. The high-level idea of the F0 estimator is the same as for the F2 estimator above. The steps are to define a basic estimator that is essentially unbiased, and then reduce the variance by taking averages and medians. (Really making this work takes an additional idea; see Bar-Yossef et al. (2002b) and the Exercises.)
The basic estimator for F0 — originally from Flajolet and Martin (1983) and developed further in Alon et al. (1999) and Bar-Yossef et al. (2002b) — is as simple as but quite different from that used to estimate F2. The first step is to choose a random permutation h of U.10 Then, just remember (using O(logn) space) the minimum value of h(x) that ever shows up in the data stream.
10Or rather, a simple hash function with the salient properties of a random permutation.

1.6 Can We Do Better? 11
Why use the minimum? One intuition comes from the suggestive match between the idempotence of F0 and of the minimum — adding duplicate copies of an element to the input has no effect on the answer.
Given the minimum h(x)-value in the data stream, how do we extract from it an estimate
of F0, the number of distinct elements? For intuition, think about the uniform distribution
on [0, 1] (Figure 1.1). Obviously, the expected value of one draw from the distribution is 1 . 2
For two i.i.d. draws, simple calculations show that the expected minimum and maximum are 1 and 2 , respectively. More generally, the expected order statistics of k i.i.d. draws split
33
the interval into k + 1 segments of equal length. In particular, the expected minimum is 1 . In other words, if you are told that some number of i.i.d. draws were taken from the
k+1
uniform distribution on [0, 1] and the smallest draw was c, you might guess that there were
roughly 1/c draws.
Figure 1.1 The expected order statistics of i.i.d. samples from the uniform distribution on the unit interval are spaced out evenly over the interval.
Translating this idea to our basic F0 estimator, if there are k distinct elements in a data stream x1,…,xm, then there are k different (random) hash values h(xi), and we expect the smallest of these to be roughly |U|/k. This leads to the basic estimator X = |U|/(minmi=1 h(xi)). Using averaging and an extra idea to reduce the variance, and medians to boost the success probability, this leads to the bound claimed in Theorem 1.1
(without the log m term). The details are outlined in the exercises.
Remark 1.7 You’d be right to ask if this high-level approach to probabilistic and approxi- mate estimation applies to all of the frequency moments, not just F0 and F2. The approach can indeed be used to estimate Fk for all k. However, the variance of the basic estimators will be different for different frequency moments. For example, as k grows, the statistic Fk becomes quite sensitive to small changes in the input, resulting in probabilistic estimators with large variance, necessitating a large number of independent copies to obtain a good approximation. More generally, no frequency moment Fk with k ̸∈ {0, 1, 2} can be computed using only a logarithmic amount of space (more details to come).
0″ 1⁄3″ 1⁄2″ 2⁄3″ 1″
E[min]’of’ E[min]’of’ 2’samples’ 1’sample’

12 Data Streams: Algorithms and Lower Bounds 1.6 Can We Do Better?
Theorem 1.1 is a fantastic result. But a good algorithm designer is never satisfied, and always wants more. So what are the weaknesses of the upper bounds that we’ve proved so far?
1. We only have interesting positive results for F0 and F2 (and maybe F1, if you want to count that). What about for k > 2 and k = ∞?
2. Our F0 and F2 algorithms only approximate the corresponding frequency moment. Can we compute it exactly, possibly using a randomized algorithm?
3. Our F0 and F2 algorithms are randomized, and with probability δ fail to provide a good approximation. (Also, they are Monte Carlo algorithms, in that we can’t tell when they fail.) Can we compute F0 or F2 deterministically, at least approximately?
4. Our F0 and F2 algorithms use Ω(log n) space. Can we reduce the dependency of the space on the universe size?11
5. Our F0 and F2 algorithms use Ω(ε−2) space. Can the dependence on ε−1 be improved? The ε−2 dependence can be painful in practice, where you might want to take ε = .01, resulting in an extra factor of 10,000 in the space bound. An improvement to ≈ ε−1, for example, would be really nice.
Unfortunately, we can’t do better — the rest of this lecture and the next (and the exercises) explain why all of these compromises are necessary for positive results. This is kind of amazing, and it’s also pretty amazing that we can prove it without overly heavy machinery. Try to think of other basic computational problems where, in a couple hours of lecture and with minimal background, you can explain complete proofs of both a non-trivial upper bound and an unconditional (independent of P vs. NP, etc.) matching lower bound.12
1.7 One-Way Communication Complexity
We next describe a simple and clean formalism that is extremely useful for proving lower bounds on the space required by streaming algorithms to perform various tasks. The model will be a quite restricted form of the general communication model that we study later — and this is good for us, because the restriction makes it easier to prove lower bounds. Happily, even lower bounds for this restricted model typically translate to lower bounds for streaming algorithms.
11This might seem like a long shot, but you never know. Recall our comment about reducing the space dependency on m from O(log m) to O(log log m) via probabilistic approximate counters.
12OK, comparison-based sorting, sure. And we’ll see a couple others later in this course. But I don’t know of that many examples!

1.8 Connection to Streaming Algorithms 13
In general, communication complexity is a sweet spot. It is a general enough concept to capture the essential hardness lurking in many different models of computation, as we’ll see throughout the course. At the same time, it is possible to prove numerous different lower bounds in the model — some of these require a lot of work, but many of the most important ones are easier that you might have guessed. These lower bounds are “unconditional” — they are simply true, and don’t depend on any unproven (if widely believed) conjectures like P ̸= NP. Finally, because the model is so clean and free of distractions, it naturally guides one toward the development of the “right” mathematical techniques needed for proving new lower bounds.
In (two-party) communication complexity, there are two parties, Alice and Bob. Alice has an input x ∈ {0,1}a, Bob an input y ∈ {0,1}b. Neither one has any idea what the other’s input is. Alice and Bob want to cooperate to compute a Boolean function (i.e., a predicate) f : {0, 1}a × {0, 1}b → {0, 1} that is defined on their joint input. We’ll discuss several examples of such functions shortly.
For this lecture and the next, we can get away with restricting attention to one-way communication protocols. All that is allowed here is the following:
1. Alice sends Bob a message z, which is a function of her input x only.
2. Bob declares the output f(x,y), as a function of Alice’s message z and his input y
only.
Since we’re interested in both deterministic and randomized algorithms, we’ll discuss both deterministic and randomized one-way communication protocols.
The one-way communication complexity of a Boolean function f is the minimum worst- case number of bits used by any one-way protocol that correctly decides the function. (Or for randomized protocols, that correctly decides it with probability at least 2/3.) That is, it is
min max{length (in bits) of Alice’s message z when Alice’s input is x}, P x,y
where the minimum ranges over all correct protocols.
Note that the one-way communication complexity of a function f is always at most a, since Alice can just send her entire a-bit input x to Bob, who can then certainly correctly compute f. The question is to understand when one can do better. This will depend on the specific function f. For example, if f is the parity function (i.e., decide whether the total number of 1s in (x, y) is even or odd), then the one-way communication complexity of f is 1 (Alice just sends the parity of x to Bob, who’s then in a position to figure out the parity of (x, y)).

14 Data Streams: Algorithms and Lower Bounds 1.8 Connection to Streaming Algorithms
If you care about streaming algorithms, then you should also care about one-way communi- cation complexity. Why? Because of the unreasonable effectiveness of the following two-step plan to proving lower bounds on the space usage of streaming algorithms.
1. Small-space streaming algorithms imply low-communication one-way protocols.
2. The latter don’t exist.
Both steps of this plan are quite doable in many cases.
Does the connection in the first step above surprise you? It’s the best kind of statement
— genius and near-trivial at the same time. We’ll be formal about it shortly, but it’s worth remembering a cartoon meta-version of the connection, illustrated in Figure 1.2. Consider a problem that can be solved using a streaming algorithm S that uses space only s. How can we use it to define a low-communication protocol? The idea is for Alice and Bob to treat their inputs as a stream (x, y), with all of x arriving before all of y. Alice can feed x into S without communicating with Bob (she knows x and S). After processing x, S’s state is completely summarized by the s bits in its memory. Alice sends these bits to Bob. Bob can then simply restart the streaming algorithm S seeded with this initial memory, and then feed his input y to the algorithm. The algorithm S winds up computing some function of (x, y), and Alice only needs to communicate s bits to Bob to make it happen. The communication cost of the induced protocol is exactly the same as the space used by the streaming algorithm.
&
Figure 1.2 Why a small-space streaming algorithm induces a low-communication one-way protocol. Alice runs the streaming algorithm on her input, sends the memory contents of the algorithm to Bob, and Bob resumes the execution of the algorithm where Alice left off on his input.
1.9 The Disjointness Problem
To execute the two-step plan above to prove lower bounds on the space usage of streaming algorithms, we need to come up with a Boolean function that (i) can be reduced to a
Alice& Q(x)& Bob&
Q(x;&y)
input&=&x&
input&=&y&

1.9 The Disjointness Problem 15 streaming problem that we care about and (ii) does not admit a low-communication one-way
protocol.
1.9.1 Disjointness Is Hard for One-Way Communication
If you only remember one problem that is hard for communication protocols, it should be the Disjointness problem. This is the canonical hard problem in communication complexity, analogous to satisfiability (SAT) in the theory of NP-completeness. We’ll see more reductions from the Disjointness problem than from any other in this course.
In an instance of Disjointness, both Alice and Bob hold n-bit vectors x and y. We interpret these as characteristic vectors of two subsets of the universe {1, 2, . . . , n}, with the subsets corresponding to the “1” coordinates. We then define the Boolean function DISJ in the obvious way, with DISJ(x,y) = 0 if there is an index i ∈ {1,2,…,n} with xi = yi = 1, and DISJ(x, y) = 1 otherwise.
To warm up, let’s start with an easy result.
Proposition 1.8 Every deterministic one-way communication protocol that computes the function DISJ uses at least n bits of communication in the worst case.
That is, the trivial protocol is optimal among deterministic protocols.13 The proof follows pretty straightforwardly from the Pigeonhole Principle — you might want to think it through before reading the proof below.
Formally, consider any one-way communication protocol where Alice always sends at most n − 1 bits. This means that, ranging over the 2n possible inputs x that Alice might have, she only sends 2n−1 distinct messages. By the Pigeonhole Principle, there are distinct messages x1 and x2 where Alice sends the same message z to Bob. Poor Bob, then, has to compute DISJ(x,y) knowing only z and y and not knowing x — x could be x1, or it could be x2. Letting i denote an index in which x1 and x2 differ (there must be one), Bob is really in trouble if his input y happens to be the ith basis vector (all zeroes except yi = 1). For then, whatever Bob says upon receiving the message z, he will be wrong for exactly one of the cases x = x1 or x = x2. We conclude that the protocol is not correct.
A stronger, and more useful, lower bound also holds.
Theorem 1.9 Every randomized one-way protocol14 that, for every input (x, y), correctly decides the function DISJ with probability at least 2, uses Ω(n) communication in the worst
3
case.
13We’ll see later that the communication complexity remains n even when we allow general communication protocols.
14There are different flavors of randomized protocols, such as “public-coin” vs. “private-coin” versions. These distinctions won’t matter until next lecture, and we elaborate on them then.

16 Data Streams: Algorithms and Lower Bounds
The probability in Theorem 1.9 is over the coin flips performed by the protocol (there is no
randomness in the input, which is “worst-case”). There’s nothing special about the constant
2 in the statement of Theorem 1.9 — it can be replaced by any constant strictly larger than 3
1. 2
Theorem 1.9 is certainly harder to prove than Proposition 1.8, but it’s not too bad — we’ll kick off next lecture with a proof.15 For the rest of this lecture, we’ll take Theorem 1.9 on faith and use it to derive lower bounds on the space needed by streaming algorithms.
1.9.2 Space Lower Bound for F∞ (even with Randomization and Approximation)
Recall from Section 1.6 that the first weakness of Theorem 1.1 is that it applies only to F0 and F2 (and F1 is easy). The next result shows that, assuming Theorem 1.9, there is no sublinear-space algorithm for computing F∞, even probabilistically and approximately.
Theorem 1.10 (Alon et al. 1999) Every randomized streaming algorithm that, for every data stream of length m, computes F∞ to within a (1 ± .2) factor with probability at least 2/3 uses space Ω(min{m, n}).
Theorem 1.10 rules out, in a strong sense, extending our upper bounds for F0,F1,F2 to all Fk. Thus, the different frequency moments vary widely in tractability in the streaming model.16
Proof of Theorem 1.10: The proof simply implements the cartoon in Figure 1.2, with the problems of computing F∞ (in the streaming model) and Disjointness (in the one-way communication model). In more detail, let S be a space-s streaming algorithm that for every data stream, with probability at least 2/3, outputs an estimate in (1 ± .2)F∞. Now consider the following one-way communication protocol P for solving the Disjointness problem (given an input (x, y)):
1. Alice feeds into S the indices i for which xi = 1; the order can be arbitrary. Since Alice knows S and x, this step requires no communication.
2. Alice sends S’s current memory state σ to Bob. Since S uses space s, this can be communicated using s bits.
3. Bob resumes the streaming algorithm S with the memory state σ, and feeds into S the indices i for which yi = 1 (in arbitrary order).
15A more difficult and important result is that the communication complexity of Disjointness remains Ω(n) even if we allow arbitrary (not necessarily one-way) randomized protocols. We’ll use this stronger result several times later in the course. We’ll also briefly discuss the proof in Section 4.3.4 of Lecture 4.
16For finite k strictly larger than 2, the optimal space of a randomized (1 ± ε)-approximate streaming algorithm turns out to be roughly Θ(n1−1/2k) (Bar-Yossef et al., 2002a; Chakrabarti et al., 2003; Indyk and Woodruff, 2005). See the exercises for a bit more about these problems.

1.9 The Disjointness Problem 17 4. Bob declares “disjoint” if and only if S’s final answer is at most 4/3.
To analyze this reduction, observe that the frequency of an index i ∈ {1, 2, . . . , n} in thedatastreaminducedby(x,y)is0ifxi =yi =0,1ifexactlyoneofxi,yi is1,and2if xi = yi = 2. Thus, F∞ of this data stream is 2 if (x, y) is a “no” instance of Disjointness, and is at most 1 otherwise. By assumption, for every “yes” (respectively, “no”) input (x, y), with probability at least 2/3 the algorithm S outputs an estimate that is at most 1.2 (respectively, at least 2/1.2); in this case, the protocol P correctly decides the input (x, y). Since P is a one-way protocol using s bits of communication, Theorem 1.9 implies that s = Ω(n). Since the data stream length m is n, this reduction also rules out o(m)-space streaming algorithms for the problem. 􏱋
Remark 1.11 (The Heavy Hitters Problem) Theorem 1.10 implies that computing the maximum frequency is a hard problem in the streaming model, at least for worst-case inputs. As mentioned, the problem is nevertheless practically quite important, so it’s important to make progress on it despite this lower bound. For example, consider the following relaxed version, known as the “heavy hitters” problem: for a parameter k, if there are any elements with frequency bigger than m/k, then find one or all such elements. When k is constant, there are good solutions to this problem: the exercises outline the “Mishra-Gries” algorithm, and the “Count-Min Sketch” and its variants also give good solutions (Charikar et al., 2004; Cormode and Muthukrishnan, 2005).17 The heavy hitters problem captures many of the applications that motivated the problem of computing F∞.
1.9.3 Space Lower Bound for Randomized Exact Computation of F0 and F2
In Section 1.6 we also criticized our positive results for F0 and F2 — to achieve them, we had to make two compromises, allowing approximation and a non-zero chance of failure. The reduction in the proof of Theorem 1.10 also implies that merely allowing randomization is not enough.
Theorem 1.12 (Alon et al. 1999) For every non-negative integer k ̸= 1, every random- ized streaming algorithm that, for every data stream, computes Fk exactly with probability at least 2/3 uses space Ω(min{n, m}).
The proof of Theorem 1.12 is almost identical to that of Theorem 1.9. The reason the proof of Theorem 1.9 rules out approximation (even with randomization) is because F∞ differs by a factor of 2 in the two different cases (“yes” and “no” instances of Disjointness).
17This does not contradict Theorem 1.9 — in the hard instances of F∞ produced by that proof, all frequencies are in {0, 1, 2} and hence there are no heavy hitters.

18 Data Streams: Algorithms and Lower Bounds
For finite k, the correct value of Fk will be at least slightly different in the two cases, which is enough to rule out a randomized algorithm that is exact at least two-thirds of the time.18 The upshot of Theorem 1.12 is that, even for F0 and F2, approximation is essential
to obtain a sublinear-space algorithm. It turns out that randomization is also essential — every deterministic streaming algorithm that always outputs a (1 ± ε)-estimate of Fk (for any k ̸= 1) uses linear space Alon et al. (1999). The argument is not overly difficult — see the Exercises for the details.
1.10Looking Backward and Forward
Assuming that randomized one-way communication protocols require Ω(n) communication to solve the Disjointness problem (Theorem 1.9), we proved that some frequency moments (in particular, F∞) cannot be computed in sublinear space, even allowing randomization and approximation. Also, both randomization and approximation are essential for our
sublinear-space streaming algorithms for F0 and F2. The next action items are:
1. Prove Theorem 1.9.
2. Revisit the five compromises we made to obtain positive results (Section 1.6). We’ve showed senses in which the first three compromises are necessary. Next lecture we’ll see why the last two are needed, as well.
18Actually, this is not quite true (why?). But if Bob also knows the number of 1’s in Alice’s input (which Alice can communicate in log2 n bits, a drop in the bucket), then the exact computation of Fk allows Bob to distinguish “yes” and “no” inputs of Disjointness (for any k ̸= 1).

Lecture 2
Lower Bounds for One-Way Communication: Disjointness, Index, and Gap-Hamming
2.1 The Story So Far
Recall from last lecture the simple but useful model of one-way communication complexity. Alice has an input x ∈ {0, 1}a, Bob has an input y ∈ {0, 1}b, and the goal is to compute a Boolean function f : {0, 1}a × {0, 1}b → {0, 1} of the joint input (x, y). The players communicate as in Figure 2.1: Alice sends a message z to Bob as a function of x only (she doesn’t know Bob’s input y), and Bob has to decide the function f knowing only z and y (he doesn’t know Alice’s input x). The one-way communication complexity of f is the smallest number of bits communicated (in the worst case over (x, y)) of any protocol that computes f. We’ll sometimes consider deterministic protocols but are interested mostly in randomized protocols, which we’ll define more formally shortly.
Figure 2.1 A one-way communication protocol. Alice sends a message to Bob that depends only on her input; Bob makes a decision based on his input and Alice’s message.
We motivated the one-way communication model through applications to streaming
algorithms. Recall the data stream model, where a data stream x1, . . . , xm ∈ U of elements
from a universe of n = |U| elements arrive one by one. The assumption is that there is
insufficient space to store all of the data, but we’d still like to compute useful statistics
of it via a one-pass computation. Last lecture, we showed that very cool and non-trivial
positive results are possible in this model. We presented a slick and low-space (O(ε−2(logn+
log m) log 1 )) streaming algorithm that, with probability at least 1 − δ, computes a (1 ± ε)- δ
approximation of F2 = 􏰂j∈U fj2, the skew of the data. (Recall that fj ∈ {0, 1, 2, . . . , m} is the number of times that j appears in the stream.) We also mentioned the main idea
19
Alice& message&z& Bob&
decide&f&
input&=&x&
input&=&y&

20 Lower Bounds for One-Way Communication
(details in the homework) for an analogous low-space streaming algorithm that estimates F0, the number of distinct elements in a data stream.
Low-space streaming algorithms S induce low-communication one-way protocols P , with the communication used by P equal to the space used by S. Such reductions typically have the following form. Alice converts her input x to a data stream and feeds it into the assumed space-s streaming algorithm S. She then sends the memory of S (after processing x) to Bob; this requires only s bits of communication. Bob then resumes S’s execution at the point that Alice left off, and feeds a suitable representation of his input y into S. When S terminates, it has computed some kind of useful function of (x, y) with only s bits of communication. The point is that lower bounds for one-way communication protocols — which, as we’ll see, we can actually prove in many cases — imply lower bounds on the space needed by streaming algorithms.
Last lecture we used without proof the following result (Theorem 1.9).1
Theorem 2.1 The one-way communication complexity of the Disjointness problem is
Ω(n), even for randomized protocols.
We’ll be more precise about the randomized protocols that we consider in the next section. Recall that an input of Disjointness is defined by x,y ∈ {0,1}n, which we view as characteristic vectors of two subsets of {1, 2, . . . , n}, and the output should be “0” is there is an index i with xi = yi = 1 and “1” otherwise.
We used Theorem 1.9 to prove a few lower bounds on the space required by streaming algorithms. A simple reduction showed that every streaming algorithm that computes F∞, the maximum frequency, even approximately and with probability 2/3, needs linear (i.e., Ω(min{n, m})) space. This is in sharp contrast to our algorithms for approximating F0 and F2, which required only logarithmic space. The same reduction proves that, for F0 and F2, exact computation requires linear space, even if randomization is allowed. A different simple argument (see the homework) shows that randomization is also essential for our positive results: every deterministic streaming algorithm that approximates F0 or F2 up to a small constant factor requires linear space.
In today’s lecture we’ll prove Theorem 1.9, introduce and prove lower bounds for a couple of other problems that are hard for one-way communication, and prove via reductions some further space lower bounds for streaming algorithms.
2.2 Randomized Protocols
There are many different flavors of randomized communication protocols. Before proving any lower bounds, we need to be crystal clear about exactly which protocols we’re talking about. The good news is that, for algorithmic applications, we can almost always focus
1Though we did prove it for the special case of deterministic protocols, using a simple Pigeonhole Principle argument.

2.2 Randomized Protocols 21
on a particular type of randomized protocols. By default, we adopt the following four assumptions and rules of thumb. The common theme behind them is we want to allow as permissible a class of randomized protocols as possible, to maximize the strength of our lower bounds and the consequent algorithmic applications.
Public coins. First, unless otherwise noted, we consider public-coin protocols. This means that, before Alice and Bob ever show up, a deity writes an infinite sequence of perfectly random bits on a blackboard visible to both Alice and Bob. Alice and Bob can freely use as many of these random bits as they want — it doesn’t contribute to the communication cost of the protocol.
The private coins model might seem more natural to the algorithm designer — here, Alice and Bob just flip their own random coins as needed. Coins flipped by one player are unknown to the other player unless they are explicitly communicated.2 Note that every private-coins protocol can be simulated with no loss by a public-coins protocol: for example, Alice uses the shared random bits 1, 3, 5, etc. as needed, while Bob used the random bits 2, 4, 6, etc.
It turns out that while public-coin protocols are strictly more powerful than private-coin protocols, for the purposes of this course, the two models have essentially the same behavior. In any case, our lower bounds will generally apply to public-coin (and hence also private-coin) protocols.
A second convenient fact about public-coin randomized protocols is that they are equivalent to distributions over deterministic protocols. Once the random bits on the blackboard have been fixed, the protocol proceeds deterministically. Conversely, every distribution over deterministic protocols (with rational probabilities) can be implemented via a public-coin protocol — just use the public coins to choose from the distribution.
Two-sided error. We consider randomized algorithms that are allowed to error with some probability on every input (x,y), whether f(x,y) = 0 or f(x,y) = 1. A stronger requirement would be one-sided error — here there are two flavors, one that forbids false positives (but allows false negatives) and one the forbids false negatives (but allows false positives). Clearly, lower bounds that apply to protocols with two-sided error are at least as strong as those for protocols with one-sided error — indeed, the latter lower bounds are often much easier to prove (at least for one of the two sides). Note that the one-way protocols induces by the streaming algorithms in the last lecture are randomized protocols with two-sided error. There are other problems for which the natural randomized solutions have only one-sided error.3
Arbitrary constant error probability. A simple but important fact is that all constant error probabilities ε ∈ (0, 1 ) yield the same communication complexity, up to a
2
2Observe that the one-way communication protocols induced by streaming algorithms are private-coin protocols — random coins flipped during the first half of the data stream are only available to the second half if they are explicitly stored in memory.
3One can also consider “zero-error” randomized protocols, which always output the correct answer but use a random amount of communication. We won’t need to discuss such protocols in this course.

22 Lower Bounds for One-Way Communication
constant factor. The reason is simple: the success probability of a protocol can be boosted
through amplification (i.e., repeated trials).4 In more detail, suppose P uses k bits on
communication and has success at least 51% on every input. Imagine repeating P 10000
times. To preserve one-way-ness of the protocol, all of the repeated trials need to happen in
parallel, with the public coins providing the necessary 10000 independent random strings.
Alice sends 10000 messages to Bob, Bob imagines answering each one — some answers will
be “1,” others “0” — and concludes by reporting the majority vote of the 10000 answers. In
expectation 5100 of the trials give the correct answer, and the probability that more than
5000 of them are correct is big (at least 90%, say). In general, a constant number of trials,
followed by a majority vote, boosts the success probability of a protocol from any constant
bigger than 1 to any other constant less than 1. These repeated trials increase the amount 2
of communication by only a constant factor. See the exercises and the separate notes on Chernoff bounds for further details.
This argument justifies being sloppy about the exact (constant) error of a two-sided protocol. For upper bounds, we’ll be content to achieve error 49% — it can be reduced to an arbitrarily small constant with a constant blow-up in communication. For lower bounds, we’ll be content to rule out protocols with error %1 — the same communication lower bounds hold, modulo a constant factor, even for protocols with error 49%.
Worst-case communication. When we speak of the communication used by a ran- domized protocol, we take the worst case over inputs (x, y) and over the coin flips of the protocol. So if a protocol uses communication at most k, then Alice always sends at most k bits to Bob.
This definition seems to go against our guiding rule of being as permissive as possible. Why not measure only the expected communication used by a protocol, with respect to its coin flips? This objection is conceptually justified but technically moot — for protocols that can err, passing to the technically more convenient worst-case measure can only increase the communication complexity of a problem by a constant factor.
To see this, consider a protocol R that, for every input (x, y), has two-sided error at
most 1/3 (say) and uses at most k bits of communication on average over its coin flips. This
protocol uses at most 10k bits of communication at least 90% of the time — if it used more
than 10k bits more than 10% of the time, its expected communication cost would be more
than k. Now consider the following protocol R′: simulate R for up to 10k steps; if R fails to
terminate, then abort and output an arbitrary answer. The protocol R′ always sends at
most 10k bits of communication and has error at most that of R, plus 10% (here, ≈ 43%).
This error probability of R′ can be reduced back down (to 1 , or whatever) through repeated 3
trials, as before.
In light of these four standing assumptions and rules, we can restate Theorem 1.9 as
follows.
4We mentioned a similar “median of means” idea last lecture (developed further in the homework), when
we discussed how to reduce the 1 factor in the space usage of our streaming algorithms to a factor oflog 1 . δδ

2.3 Distributional Complexity 23 Theorem 2.2 Every public-coin randomized one-way protocol for Disjointness that has
two-sided error at most a constant ε ∈ (0, 1 ) uses Ω(min{n, m}) communication in the worst 2
case (over inputs and coin flips).
Now that we are clear on the formal statement of our lower bound, how do we prove it?
2.3 Distributional Complexity
Randomized protocols are much more of a pain to reason about than deterministic protocols. For example, recall our Pigeonhole Principle-based argument last lecture for deterministic protocols: if Alice holds an n-bit input and always sends at most n − 1 bits, then there are distinct inputs x, x′ such that Alice sends the same message z. (For Disjointness, this ambiguity left Bob in a lurch.) In a randomized protocol where Alice always sends at most n − 1 bits, Alice can use a different distribution over (n − 1)-bit messages for each of her 2n inputs x, and the naive argument breaks down. While Pigeonhole Proof-type arguments can sometimes be pushed through for randomized protocols, this section introduces a different approach.
Distributional complexity is the main methodology by which one proves lower bounds on the communication complexity of randomized algorithms. The point is to reduce the goal to proving lower bounds for deterministic protocols only, with respect to a suitably chosen input distribution.
Lemma 2.3 (Yao 1983) Let D be a distribution over the space of inputs (x, y) to a communication problem, and ε ∈ (0, 1 ). Suppose that every deterministic one-way protocol
2
Pr(x,y)∼D [P wrong on (x, y)] ≤ ε
P with
has communication cost at least k. Then every (public-coin) randomized one-way protocol R
with (two-sided) error at most ε on every input has communication cost at least k.
In the hypothesis of Lemma 2.3, all of the randomness is in the input — P is deterministic, (x, y) is random. In the conclusion, all of the randomness is in the protocol R — the input is arbitrary but fixed, while the protocol can flip coins. Not only is Lemma 2.3 extremely
useful, but it is easy to prove.
Proof of Lemma 2.3: Let R be a randomized protocol with communication cost less than k. Recall that such an R can be written as a distribution over deterministic protocols, call them P1, P2, . . . , Ps. Recalling that the communication cost of a randomized protocol is defined as the worst-case communication (over both inputs and coin flips), each deterministic protocol Pi always uses less than k bits of communication. By assumption,
Pr(x,y)∼D [Pi wrong on (x, y)] > ε

24 Lower Bounds for One-Way Communication
for i = 1,2,…,s. Averaging over the Pi’s, we have Pr(x,y)∼D;R[R wrong on (x, y)] > ε.
Since the maximum of a set of numbers is at least is average, there exists an input (x, y) such that
PrR[R wrong on (x, y)] > ε,
which completes the proof. 􏱋
The converse of Lemma 2.3 also holds — whatever the true randomized communication complexity of a problem, there exists a bad distribution D over inputs that proves it (Yao, 1983). The proof is by strong linear programming duality or, equivalently, von Neumann’s Minimax Theorem for zero-sum games (see the exercises for details). Thus, the distributional methodology is “complete” for proving lower bounds — one “only” needs to find the right distribution D over inputs. In general this is a bit of a dark art, though in today’s application D will just be the uniform distribution.
2.4 The Index Problem
We prove Theorem 2.2 in two steps. The first step is to prove a linear lower bound on the randomized communication complexity of a problem called Index, which is widely useful for proving one-way communication complexity lower bounds. The second step, which is easy, reduces Index to Disjointness.
In an instance of Index, Alice gets an n-bit string x ∈ {0, 1}n and Bob gets an integer i ∈ {1, 2, . . . , n}, encoded in binary using ≈ log2 n bits. The goal is simply to compute xi, the ith bit of Alice’s input.
Intuitively, since Alice has no idea which of her bits Bob is interested in, she has to send Bob her entire input. This intuition is easy to make precise for deterministic protocols, by a Pigeonhole Principle argument. The intuition also holds for randomized protocols, but the proof takes more work.
Theorem 2.4 (Kremer et al. 1999) The randomized one-way communication complexity of Index is Ω(n).
With a general communication protocol, where Bob can also send information to Alice, Index is trivial to solve using only ≈ log2 n bits of information — Bob just sends i to Alice. Thus Index nicely captures the difficulty of designing non-trivial one-way communication protocols, above and beyond the lower bounds that already apply to general protocols.
Theorem 2.4 easily implies Theorem 2.2.
Proof of Theorem 2.2: We show that Disjointness reduces to Index. Given an input (x, i) of Index, Alice forms the input x′ = x while Bob forms the input y′ = ei; here ei is the

2.4 The Index Problem 25
standard basis vector, with a “1” in the ith coordinate and “0”s in all other coordinates. Then, (x′, y′) is a “yes” instance of Disjointness if and only if xi = 0. Thus, every one-way protocol for Index induces one for Disjointness, with the same communication cost and error probability. 􏱋
We now prove Theorem 2.4. While some computations are required, the proof is conceptually pretty straightforward.
Proof of Theorem 2.4: We apply the distributional complexity methodology. This requires positing a distribution D over inputs. Sometimes this takes creativity. Here, the first thing you’d try — the uniform distribution D, where x and i are chosen independently and uniformly at random — works.
Let c be a sufficiently small constant (like .1 or less) and assume that n is sufficiently
large (like 300 or more). We’ll show that every deterministic one-way protocol that uses
at most cn bits of communication has error (w.r.t. D) at least 1. By Lemma 2.3, this 8
implies that every randomized protocol has error at least 1 on some input. Recalling the 8
discussion about error probabilities in Section 2.2, this implies that for every error ε′ > 0, there is a constant c′ > 0 such that every randomized protocol that uses at most c′n bits of communication has error bigger than ε′.
Fix a deterministic one-way protocol P that uses at most cn bits of communication. Since P is deterministic, there are only 2cn distinct messages z that Alice ever sends to Bob (ranging over the 2n possible inputs x). We need to formalize the intuition that Bob typically (over x) doesn’t learn very much about x, and hence typically (over i) doesn’t know what xi is.
Suppose Bob gets a message z from Alice, and his input is i. Since P is deterministic, Bob has to announce a bit, “0” or “1,” as a function of z and i only. (Recall Figure 2.1). Holding z fixed and considering Bob’s answers for each of his possible inputs i = 1, 2, . . . , n, we get an n-bit vector — Bob’s answer vector a(z) when he receives message z from Alice. Since there are at most 2cn possible messages z, there are at most 2cn possible answer vectors a(z).
Answer vectors are a convenient way to express the error of the protocol P, with respect to the randomness in Bob’s input. Fix Alice’s input x, which results in the message z. The protocol is correct if Bob holds an input i with a(z)i = xi, and incorrect otherwise. Since Bob’s index i is chosen uniformly at random, and independently of x, we have
Pri[P is incorrect|x,z] = dH(x,a(z)), (2.1) n
where dH(x,a(z)) denotes the Hamming distance between the vectors x and a(z) (i.e., the number of coordinates in which they differ). Our goal is to show that, with constant probability over the choice of x, the expression (2.1) is bounded below by a constant.
Let A = {a(z(x)) : x ∈ {0, 1}n} denote the set of all answer vectors used by the protocol P. Recall that |A| ≤ 2cn. Call Alice’s input x good if there exists an answer vector a ∈ A

26 Lower Bounds for One-Way Communication with dH(x,a) < n, and bad otherwise. Geometrically, you should think of each answer 4 vector a as the center of a ball of radius n in the Hamming cube — the set {0, 1}n equipped 4 with the Hamming metric. See Figure 2.2. The next claim states that, because there aren’t too many balls (only 2cn for a small constant c) and their radii aren’t too big (only n), the union of all of the balls is less than half of the Hamming cube.5 4 a1# n/4# n/4# a3# a2# n/4# Figure 2.2 Balls of radius n/4 in the Hamming metric, centered at the answer vectors used by the protocol P. Claim: Provided c is sufficiently small and n is sufficiently large, there are at least 2n−1 bad inputs x. 5More generally, the following is good intuition about the Hamming cube for large n: as you blow up a ball of radius r around a point, the ball includes very few points until r is almost equal to n/2; the ball includes roughly half the points for r ≈ n/2; and for r even modestly larger than r, the ball contains almost all of the points. 2.5 Where We’re Going 27 Before proving the claim, let’s see why it implies the theorem. We can write Pr(x,y)∼D[D wrong on (x,y)] = Pr[x is good]·Pr[D wrong on (x,y)|x is good] 􏱄 􏱅􏱆 􏱇 ≥0 +Pr[x is bad]·Pr[D wrong on (x,y)|x is bad]. 􏱄 􏱅􏱆 􏱇 ≥1/2 by Claim Recalling (2.1) and the definition of a bad input x, we have x Pr(x,y)[D wrong on (x,y)|x is bad] = ≥ E  min dH(x,a) | x is bad ≥1/4 since x is bad ≥ 1. 4 We conclude that the protocol P errs on the distribution D with probability at last 1 , which 8 implies the theorem. We conclude by proving the claim. Proof of Claim: Fix some answer vector a ∈ A. The number of inputs x with Hamming 􏰊 dH (x, a(z(x))) 􏰋 Ex n | x is bad  a∈An  􏱄 􏱅􏱆 􏱇 distance at most n from a is 4 􏰈n􏰉 􏰈n􏰉 􏰈n􏰉 1+ 1 + 2 +···+ n/4 . (2.2) 􏱄􏱅􏱆􏱇 a 􏱄􏱅􏱆􏱇 􏱄􏱅􏱆􏱇 dH (x,a)=1 dH (x,a)=2 􏰈n􏰉 􏰆 en 􏰇k k≤k, 􏱄 􏱅􏱆 􏱇 dH (x,a)=n/2 Recalling the inequality which follows easily from Stirling’s approximation of the factorial function (see the exercises), we can crudely bound (2.2) above by n/4 log (4e)n .861n n(4e)=n22 4≤n2. The total number of good inputs x — the union of all the balls — is at most |A|2.861n ≤ 2(.861+c)n, which is at most 2n−1 for c sufficiently small (say .1) and n sufficiently large (at least 300, say). 􏱋 28 Lower Bounds for One-Way Communication 2.5 Where We’re Going Theorem 2.4 completes our first approach to proving lower bounds on the space required by streaming algorithms to compute certain statistics. To review, we proved from scratch that Index is hard for one-way communication protocols (Theorem 2.4), reduced Index to Disjointness to extend the lower bound to the latter problem (Theorem 2.2), and reduced Disjointness to various streaming computations (last lecture). See also Figure 2.3. Specifically, we showed that linear space is necessary to compute the highest frequency in a data stream (F∞), even when randomization and approximation are allowed, and that linear space is necessary to compute exactly F0 or F2 by a randomized streaming algorithm with success probability 2/3. Index −−−−−−−→ Disjointness −−−−−→ Streaming 􏱄 􏱅􏱆 􏱇 Theorem 2.4 Figure 2.3 Review of the proof structure of linear (in min{n, m}) space lower bounds for streaming algorithms. Lower bounds travel from left to right. We next focus on the dependence on the approximation parameter ε required by a streaming algorithm to compute a (1 ± ε)-approximation of a frequency moment. Recall that the streaming algorithms that we’ve seen for F0 and F2 have quadratic dependence on ε−1. Thus an approximation of 1% would require a blowup of 10,000 in the space. Obviously, it would be useful to have algorithms with a smaller dependence on ε−1. We next prove that space quadratic in ε−1 is necessary, even allowing randomization and even for F0 and F2, to achieve a (1 ± ε)-approximation. Happily, we’ll prove this via reductions, and won’t need to prove from scratch any new communication lower bounds. We’ll follow the path in Figure 2.4. First we introduce a new problem, also very useful for proving lower bounds, called the Gap-Hamming problem. Second, we give a quite clever reduction from Index to Gap-Hamming. Finally, it is straightforward to show that one-way protocols for Gap-Hamming with sublinear communication induce streaming algorithms that can compute a (1 ± ε)-approximation of F0 or F2 in o(ε−2) space. Index −−−−−−−→ Gap-Hamming −−−−−−−→ Streaming 􏱄 􏱅􏱆 􏱇 Theorem 2.4 Figure 2.4 Proof plan for Ω(ε−2) space lower bounds for (randomized) streaming algorithms that approximate F0 or F2 up to a 1 ± ε factor. Lower bounds travel from left to right. Theorem 2.2 Lecture 1 Theorem 2.5 Section 2.6.2 2.6 The Gap-Hamming Problem 29 2.6 The Gap-Hamming Problem Our current goal is to prove that every streaming algorithm that computes a (1 ± ε)- approximation of F0 or F2 needs Ω(ε−2) space. Note that we’re not going to prove this when ε ≪ 1/√n, since we can always compute a frequency moment exactly in linear or near-linear space. So the extreme case of what we’re trying to prove is that a (1 ± 1 )-approximation n requires Ω(n) space. This special case already requires all of the ideas needed to prove a lower bound of Ω(ε−2) for all larger ε as well. 2.6.1 Why Disjointness Doesn’t Work Our goal is also to prove this lower bound through reductions, rather than from scratch. We don’t know too many hard problems yet, and we’ll need a new one. To motivate it, let’s see why Disjointness is not good enough for our purposes. Suppose we have a streaming algorithm S that gives a (1 ± 1 )-approximation to F — √n 0 how could we use it to solve Disjointness? The obvious idea is to follow the reduction used last lecture for F∞. Alice converts her input x of Disjointness and converts it to a stream, feeds this stream into S, sends the final memory state of S to Bob, and Bob converts his input y of Disjointness into a stream and resumes S’s computation on it. With healthy probability, S returns a (1 ± 1 )-approximation of F of the stream induced √n 0 by (x, y). But is this good for anything? Suppose (x, y) is a “yes” instance to Disjointness. Then, F0 of the corresponding stream is |x| + |y|, where | · | denotes the number of 1’s in a bit vector. If (x, y) is a “no” instance of Disjointness, then F0 is somewhere between max{|x|, |y|} and |x| + |y| − 1. A particularly hard case is when |x| = |y| = n/2 and x, y are either disjoint or overlap in exactly one element — F is then either n or n − 1. In this case, a (1 ± 1 )-approximation of F 0 √ √n 0 translates to additive error n, which is nowhere near enough resolution to distinguish between “yes” and “no” instances of Disjointness. √ 2.6.2 Reducing Gap-Hamming to F0 Estimation A (1 ± 1 )-approximation of F is insufficient to solve Disjointness— but perhaps there √n 0 is some other hard problem that it does solve? The answer is yes, and the problem is estimating the Hamming distance between two vectors x, y — the number of coordinates in which x, y differ. To see the connection between F0 and Hamming distance, consider x, y ∈ {0, 1}n and the usual data stream (with elements in U = {1, 2, . . . , n}) induced by them. As usual, we can interpret x, y as characteristic vectors of subsets A, B of U (Figure 2.5). Observe that the Hamming distance dH (x, y) is the just the size of the symmetric difference, |A \ B| + |B \ A|. ObservealsothatF0 =|A∪B|,so|A\B|=F0−|B|and|B\A|=F0−|A|,andhence 30 Lower Bounds for One-Way Communication dH (x, y) = 2F0 − |x| − |y|. Finally, Bob knows |y|, and Alice can send |x| to Bob using log2 n bits. Figure 2.5 The Hamming distance between two bit vectors equals the size of the symmetric difference of the corresponding subsets of 1-coordinates. The point is that a one-way protocol that computes F0 with communication c yields a indices'with'xi'='1' x" y" one-way protocol that computes dH (x, y) with communication c + log2 n. More generally, a (1± 1 )-approximation of F yields a protocol that estimates d (x,y) up to 2F /√n ≤ 2√n √n 0 H 0 additive error, with log2 n extra communication. This reduction from Hamming distance estimation to F0 estimation is only useful to us if the former problem has large communication complexity. It’s technically convenient to convert Hamming distance estimation into a decision problem. We do this using a “promise problem” — intuitively, a problem where we only care about a protocol’s correctness when the input satisfies some conditions (a “promise”). Formally, for a parameter t, we say that a protocol correctly solves Gap-Hamming(t) if it outputs “1” whenever dH (x, y) < t − c√n √ and outputs “0” whenever dH (x, y) > t + c n, where c is a sufficiently small constant. Note that the protocol can output whatever it wants, without penalty, on inputs for which dH (x, y) = t ± c√n.
Ourreductionaboveshowsthat,foreveryt,Gap-Hamming(t)reducestothe(1± c )- n
approximation of F0. Does it matter how we pick t? Remember we still need to prove that the Gap-Hamming(t) problem does not admit low-communication one-way protocols. If we pick t = 0, then the problem becomes a special case of the Equality problem
(where f (x, y) = 1 if and only x = y). We’ll see next lecture that the one-way randomized
communication complexity of Equality is shockingly low — only O(1) for public-coin
protocols. Picking t = n has the same issue. Picking t = n seems more promising. For 2

example, it’s easy to certify a “no” instance of Equality— just exhibit an index where x
and y differ. How would you succinctly certify that dH (x, y) is either at least n + √n or at 2

2.7 Lower Bound for Gap-Hamming 31 most n − √n? For more intuition, think about two vectors x, y ∈ {0, 1}n chosen uniformly
2
at random. The expected Hamming distance between them is n , with a standard deviation √n2
of ≈ n. Thus deciding an instance of Gap-Hamming( 2 ) has the flavor of learning an unpredictable fact about two random strings, and it seems difficult to do this without learning detailed information about the particular strings at hand.
2.7 Lower Bound on the One-Way Communication Complexity of
Gap-Hamming
This section dispenses with the hand-waving and formally proves that every protocol that solves Gap-Hamming— with t = n and c sufficiently small — requires linear communication.
2
Theorem 2.5 (Jayram et al. 2008; Woodruff 2004, 2007) The randomized one-way communication complexity of Gap-Hamming is Ω(n).
Proof: The proof is a randomized reduction from Index, and is more clever than the other reductions that we’ve seen so far. Consider an input to Index, where Alice holds an n-bit string x and Bob holds an index i ∈ {1, 2, . . . , n}. We assume, without loss of generality, that n is odd and sufficiently large.
Alice and Bob generate, without any communication, an input (x′, y′) to Gap-Hamming.
They do this one bit at a time, using the publicly available randomness. To generate the
first bit of the Gap-Hamming input, Alice and Bob interpret the first n public coins as a
random string r. Bob forms the bit b = ri, the ith bit of the random string. Intuitively, Bob
says “I’m going to pretend that r is actually Alice’s input, and report the corresponding
answer ri.” Meanwhile, Alice checks whether dH(x,r) < n or dH(x,r) > n. (Since n is odd, 22
one of these holds.) In the former case, Alice forms the bit a = 1 to indicate that r is a decent proxy for her input x. Otherwise, she forms the bit a = 0 to indicate that 1 − r would have been a better approximation of reality (i.e., of x).
The key and clever point of the proof is that a and b are correlated — positively if xi = 1
and negatively if xi = 0, where x and i are the given input to Index. To see this, condition
on the n−1 bits of r other than i. There are two cases. In the first case, x and r agree
on strictly less than or strictly greater than (n − 1)/2 of the bits so-far. In this case, a is
already determined (to 0 or 1, respectively). Thus, in this case, Pr[a = b] = Pr[a = ri] = 1 , 2
using that ri is independent of all the other bits. In the second case, amongst the n − 1 bits of r other than ri, exactly half of them agree with x. In this case, a = 1 if and only ifxi =ri. Hence,ifxi =1,thenaandbalwaysagree(ifri =1thena=b=1,ifri =0 thena=b=0). Ifxi =0,thenaandbalwaysdisagree(ifri =1,thena=0andb=1,if ri =0,thena=1andb=0).
The probability of the second case is the probability of getting (n − 1)/2 “heads” out
of n − 1 coin flips, which is 􏰃 n−1 􏰄. Applying Stirling’s approximation of the factorial (n−1)/2

32 Lower Bounds for One-Way Communication
function shows that this probability is bigger than you might have expected, namely ≈ c′
for a constant c′ (see Exercises for details). We therefore have

n
Pr[a = b]
= Pr[Case 1]·Pr[a = b|Case 1]+Pr[Case 2]·Pr[a = b|Case 2] 􏱄 􏱅􏱆 􏱇 􏱄 􏱅􏱆 􏱇 􏱄 􏱅􏱆 􏱇 􏱄 􏱅􏱆 􏱇
c′ 1 c′ 1or0 1−√n =2 √n
􏰀 1 − c′ if x = 1 2 √n i
= 1 + c′ if x = 0. 2 √n i
This is pretty amazing when you think about it — Alice and Bob have no knowledge of each other’s inputs and yet, with shared randomness but no explicit communication, can generate bits correlated with xi!6
The randomized reduction from Index to Gap-Hamming now proceeds as one would
expect. Alice and Bob repeat the bit-generating experiment above m independent times to
generate m-bit inputs x′ and y′ of Gap-Hamming. Here m = qn for a sufficiently large
constant q. The expected Hamming distance between x′ and y′ is at most m − c′√m (if m′√ 2
xi = 1) or at least 2 + c m (if xi = 0). A routine application of the Chernoff bound
(see Exercises) implies that, for a sufficiently small constant c and large constant q, with
probabilityatleast8 (say),dH(x′,y′)m+c√m 922
(if xi = 0). When this event holds, Alice and Bob can correctly compute the answer to the original input (x, i) to Index by simply invoking any protocol P for Gap-Hamming on the input (x′,y′). The communication cost is that of P on inputs of length m = Θ(n). The error is at most the combined error of the randomized reduction and of the protocol P — whenever the reduction and P both proceed as intended, the correct answer to the Index input (x, i) is computed.
Summarizing, our randomized reduction implies that, if there is a (public-coin) random- ized protocol for Gap-Hamming with (two-sided) error 1 and sublinear communication,
3
then there is a randomized protocol for Index with error 4 . Since we’ve ruled out the latter, 9
the former does not exist. 􏱋
Combining Theorem 2.5 with our reduction from Gap-Hamming to estimating F∞,
we’ve proved the following.
Theorem 2.6 There is a constant c > 0 such that the following statement holds: There is no sublinear-space randomized streaming algorithm that, for every data stream, computes F0
to within a 1 ± c factor with probability at least 2/3. n
A variation on the same reduction proves the same lower bound for approximating F2; see the Exercises.
6This would clearly not be possible with a private-coin protocol. But we’ll see later than the (additive) difference between the private-coin and public-coin communication complexity of a problem is O(log n), so a linear communication lower bound for one type automatically carries over to the other type.

2.7 Lower Bound for Gap-Hamming 33 Our original goal was to prove that the (1 ± ε)-approximate computation of F0 requires
√√ space Ω(ε−2), when ε ≥ 1 . Theorem 2.6 proves this in the special case where ε = Θ( 1 ).

nn
This can be extended to larger ε by a simple “padding” trick. Fix your favorite values of n
and ε ≥ 1 and modify the proof of Theorem 2.6 as follows. Reduce from Gap-Hamming n
on inputs of length m = Θ(ε−2). Given an input (x, y) of Gap-Hamming, form (x′, y′) by appending n − m zeroes to x and y. A streaming algorithm with space s that estimates F0 on the induced data stream up to a (1 ± ε) factor induces a randomized protocol that solves this special case of Gap-Hamming with communication s. Theorem 2.5 implies that every randomized protocol for the latter problem uses communication Ω(ε−2), so this lower bound carries over to the space used by the streaming algorithm.

Lecture 3
Lower Bounds for Compressive Sensing
3.1 An Appetizer: Randomized Communication Complexity of Equality
We begin with an appetizer before starting the lecture proper — an example that demon- strates that randomized one-way communication protocols can sometimes exhibit surprising power.
It won’t surprise you that the Equality function — with f(x,y) = 1 if and only if
x = y — is a central problem in communication complexity. It’s easy to prove, by the
Pigeonhole Principle, that its deterministic one-way communication complexity is n, where n
isthelengthoftheinputsxandy.1 Whataboutitsrandomizedcommunicationcomplexity?
Recall from last lecture that by default, our randomized protocols can use public coins2 and
can have two-sided error ε, where ε is any constant less than 1 . 2
Theorem 3.1 (Yao 1979) The (public-coin) randomized one-way communication com- plexity of Equality is O(1).
Thus, the randomized communication complexity of a problem can be radically smaller than its deterministic communication complexity. A similar statement follows from our upper and lower bound results for estimating the frequency moments F0 and F2 using small-space streaming algorithms, but Theorem 3.1 illustrates this point in a starker and clearer way.
Theorem 3.1 provides a cautionary tale: sometimes we expect a problem to be hard for a class of protocols and are proved wrong by a clever protocol; other times, clever protocols don’t provide non-trivial solutions to a problem but still make proving strong lower bounds technically difficult. Theorem 3.1 also suggests that, if we want to prove strong communication lower bounds for randomized protocols via a reduction, there might not be too many natural problems out there to reduce from.3
1We’ll see later that this lower bound applies to general deterministic protocols, not just to one-way protocols.
2Recall the public-coin model: when Alice and Bob show up there is already an infinite stream of random bits written on a blackboard, which both of them can see. Using shared randomness does not count toward the communication cost of the protocol.
3Recall our discussion about Gap-Hamming last lecture: for the problem to be hard, it is important to choose the midpoint t to be n . With t too close to 0 or n, the problem is a special case of Equality and is
2
therefore easy for randomized protocols.
34

3.1 Randomized Communication Complexity of the Equality Function 35 Proof of Theorem 3.1: The protocol is as follows.
1. Alice and Bob interpret the first 2n public coins as random strings r1, r2 ∈ {0, 1}n. This requires no communication.
2. Alice sends the two random inner products ⟨x, r1⟩ mod 2 and ⟨x, r2⟩ mod 2 to Bob. This requires two bits of communication.
3. Bob reports “1” if and only if his random inner products match those of Alice: ⟨y, ri⟩ = ⟨x, ri⟩ mod 2 for i = 1, 2. Note that Bob has all of the information needed to perform this computation.
We claim that the error of this protocol is at most 25% on every input. The protocol’s error is one-sided: when x = y the protocol always accepts, so there are no false negatives. Suppose that x ̸= y. We use the Principle of Deferred Decisions to argue that, for each i = 1, 2, the inner products ⟨y, ri⟩ and ⟨x, ri⟩ are different (mod 2) with probability exactly 50%. To see this, pick an index i where xi ̸= yi and condition on all of the bits of a random string except for the ith one. Let a and b denote the values of the inner products-so-far of x and y (modulo 2) with the random string. If the ith random bit is a 0, then the final inner products are also a and b. If the ith random bit is a 1, then one inner product stays the same while the other flips its value (since exactly one of xi, yi is a 1). Thus, whether a = b or a ̸= b, exactly one of the two random bit values (50% probability) results in the final two inner products having different values (modulo 2). The probability that two unequal strings have equal inner products (modulo 2) in two independent experiments is 25%. 􏱋
The proof of Theorem 3.1 gives a 2-bit protocol with (1-sided) error 25%. As usual, executing many parallel copies of the protocol reduces the error to an arbitrarily small constant, with a constant blow-up in the communication complexity.
The protocol used to prove Theorem 3.1 makes crucial use of public coins. We’ll see later that the private-coin one-way randomized communication complexity is Θ(log n), which is worse than public-coin protocols but still radically better than deterministic protocols. More generally, next lecture we’ll prove Newman’s theorem, which states that the private-coin randomized communication complexity of a problem is at most O(logn) more than its public-coin randomized communication complexity.
The protocol in the proof of Theorem 3.1 effectively gives each of the two strings x, y a 2-bit “sketch” or “fingerprint” such that the property of distinctness is approximately preserved. Clearly, this is the same basic idea as hashing. This is a useful idea in both theory and practice, and we’ll use it again shortly.
Remark 3.2 The computational model studied in communication complexity is potentially very powerful — for example, Alice and Bob have unlimited computational power — and the primary point of the model is to prove lower bounds. Thus, whenever you see an upper bound result in communication complexity, like Theorem 3.1, it’s worth asking what the

36 Lower Bounds for Compressive Sensing
point of the result is. In many cases, a positive result is really more of a “negative negative result,” intended to prove the tightness of a lower bound rather than offer a practical solution to a real problem. In other cases, the main point is demonstrate separations between different notions of communication complexity or between different problems. For example, Theorem 3.1 shows that the one-way deterministic and randomized communication complexity of a problem can be radically different, even if we insist on one-sided error. It also shows that the randomized communication complexity of Equality is very different than that of the problems we studied last lecture: Disjointness, Index, and Gap-Hamming.
Theorem 3.1 also uses a quite reasonable protocol, which is not far from a practical solution to probabilistic equality-testing. In some applications, the public coins can be replaced by a hash function that is published in advance; in other applications, one party can choose a random hash function that can be specified with a reasonable number of bits and communicate it to other parties.
3.2 Sparse Recovery 3.2.1 The Basic Setup
The field of sparse recovery has been a ridiculously hot area for the past ten years, in applied mathematics, machine learning, and theoretical computer science. We’ll study sparse recovery in the standard setup of “compressive sensing” (also called “compressed sensing”). There is an unknown “signal” — i.e., a real-valued vector x ∈ Rn — that we want to learn. The bad news is that we’re only allowed to access the signal through “linear measurements;” the good news is that we have the freedom to choose whatever measurements we want. Mathematically, we want to design a matrix A ∈ Rm×n, with m as small as possible, such that we can recover the unknown signal x from the linear measurements Ax
(whatever x may be).
As currently stated, this is a boring problem. It is clear that n measurements are
sufficient – just take A = I, or any other invertible n × n matrix. It is also clear that n measurements are necessary: if m < n, then there is a entire subspace of dimension n − m of vectors that have image Ax under A, and we have no way to know which one of them is the actual unknown signal. The problem becomes interesting when we also assume that the unknown signal x is “sparse,” in senses we define shortly. The hope is that under the additional promise that x is sparse, we can get away with much fewer than n measurements (Figure 3.1). 3.2.2 A Toy Version To develop intuition for this problem, let’s explore a toy version. Suppose you are promised that x is a 0-1 vector with exactly k 1’s (and hence n − k 0’s). Throughout this lecture, k is the parameter that measures the sparsity of the unknown signal x — it could be anything, 3.2 Sparse Recovery 37 A" x" Ax" design' recover' =" given' Figure 3.1 The basic compressive sensing setup. The goal is to design a matrix A such that an unknown sparse signal x can be recovered from the linear measurements Ax. but you might want to keep k ≈ √n in mind as a canonical parameter value. Let X denote the set of all such k-sparse 0-1 vectors, and note that |X| = 􏰃n􏰄. k Here’s a solution to the sparse recovery problem under that guarantee that x ∈ X. Take m = 3 log2 |X|, and choose each of the m rows of the sensing matrix A independently and uniformly at random from {0,1}n. By the fingerprinting argument used in our randomized protocol for Equality (Theorem 3.1), for fixed distinct x, x′ ∈ X, we have PrA􏰥Ax=Ax′ mod2􏰦= 1 = 1 . 2m |X|3 Of course, the probability that Ax = Ax′ (not modulo 2) is only less. Taking a Union Bound over the at most |X|2 different distinct pairs x,x′ ∈ X, we have PrA􏰥there exists x ̸= x′ s.t. Ax = Ax′􏰦 ≤ 1 . |X| Thus, there is a matrix A that maps all x ∈ X to distinct m-vectors. (Indeed, a random A works with high probability.) Thus, given Ax, one can recover x — if nothing else, one can just compute Ax′ for every x′ ∈ X until a match is found.4 4We won’t focus on computational efficiency in this lecture, but positive results in compressive sensing generally also have computationally efficient recovery algorithms. Our lower bounds will hold even for recovery algorithms with unbounded computational power. 38 Lower Bounds for Compressive Sensing The point is that m = Θ(log |X|) measurements are sufficient to recover exactly k-sparse 0-1 vectors. Recalling that |X| = 􏰃n􏰄 and that k 􏰈n􏰉 􏰆en􏰇k 􏱄􏱅􏱆􏱇 􏱄􏱅􏱆􏱇 􏰆n􏰇k k≤k≤k, easy Stirling’s approx. we see that m = Θ(k log n ) rows suffice.5 k This exercise shows that, at least for the special case of exactly sparse 0-1 signals, we can indeed achieve recovery with far fewer than n measurements. For example, if k ≈ √n, we need only O( n log n) measurements. Now that we have a proof of concept that recovering an unknown sparse vector is an interesting problem, we’d like to do better in two senses. First, we want to move beyond the toy version of the problem to the “real” version of the problem. Second, we have to wonder whether even fewer measurements suffice. 3.2.3 Motivating Applications To motivate the real version of the problem, we mention a couple of canonical applications of compressive sensing. One buzzword you can look up and read more about is the “single-pixel camera.” The standard approach to taking pictures is to first take a high-resolution picture in the “standard basis” — e.g., a light intensity for each pixel — and then to compress the picture later (via software). Because real-world images are typically sparse in a suitable basis, they can be compressed a lot. The compressive sensing approach asks, then why not just capture the image directly in a compressed form — in a representation where its sparsity shines through? For example, one can store random linear combinations of light intensities (implemented via suitable mirrors) rather than the light intensities themselves. This idea leads to a reduction in the number of pixels needed to capture an image at a given resolution. Another application of compressive sensing is in MRI. Here, decreasing the number of measurements decreases the time necessary for a scan. Since a patient needs to stay motionless during a scan — in some cases, not even breathing — shorter scan times can be a pretty big deal. 3.2.4 The Real Problem If the unknown signal x is an image, say, there’s no way it’s an exactly sparse 0-1 vector. We need to consider more general unknown signals x that are real-valued and only “approximately sparse.” To measure approximate sparsity, with respect to a choice of k, we define the residual res(x) of a vector x ∈ Rn as the contribution to x’s l1 norm by its n − k coordinates with smallest magnitudes. Recall that the l1 norm is just ∥x∥1 = 􏰂ni=1 |xi|. If we imagine 5If you read through the compressive sensing literature, you’ll be plagued by ubiquitous “k log n ” terms 􏰃n􏰄 k — remember this is just ≈ log k . √ 3.3 A Lower Bound for Sparse Recovery 39 sorting the coordinates by |xi|, then the residual of x is just the sum of the |xi|’s of the final n − k terms. If x has exactly k-sparse, it has at least n − k zeros and hence res(x) = 0. The goal is to design a sensing matrix A with a small number m of rows such that an unknown approximately sparse vector x can be recovered from Ax. Or rather, given that x is only approximately sparse, we want to recover a close approximation of x. The formal guarantee we’ll seek for the matrix A is the following: for every x ∈ Rn, we can compute from Ax a vector x′ such that ∥x′ − x∥1 ≤ c · res(x). (3.1) Here c is a constant, like 2. The guarantee (3.1) is very compelling. First, if x is exactly k-sparse, then res(x) = 0 and so (3.1) demands exact recovery of x. The guarantee is parameterized by how close x is to being sparse — the recovered vector x′ should lie in a ball (in the l1 norm) around x, and the further x is from being k-sparse, the bigger this ball is. Intuitively, the radius of this ball has to depend on something like res(x). For example, suppose that x′ is exactly k-sparse (with res(x) = 0). The guarantee (3.1) forces the algorithm to return x′ for every unknown signal x with Ax = Ax′. Recall that when m < n, there is an (n − m)-dimensional subspace of such signals x. In the extreme case where there is such an x with x − res(x) = x′ – i.e., where x is x′ with a little noise added to its zero coordinates — the recovery algorithm is forced to return a solution x′ with ∥x′ − x∥1 = res(x). Now that we’ve defined the real version of the problem, is there an interesting solution? Happily, the real version can be solved as well as the toy version. Fact 3.3 (Candes et al. 2006; Donoho 2006) With high probability, a random m × n matrix A with Θ(k log n ) rows admits a recovery algorithm that satisfies the guarantee Fact 3.3 is non-trivial and well outside the scope of this course. The fact is true for several different distributions over matrices, including the case where each matrix entry is an independent standard Gaussian. Also, there are computationally efficient recovery algorithms that achieve the guarantee.6 3.3 A Lower Bound for Sparse Recovery 3.3.1 Context At the end of Section 3.2.2, when we asked “can we do better?,” we meant this in two senses. First, can we extend the positive results for the toy problem to a more general problem? Fact 3.3 provides a resounding affirmative answer. Second, can we get away with an even 6See e.g. Moitra (2014) for an introduction to such positive results about sparse recovery. Lecture #9 of the instructor’s CS264 course also gives a brief overview. k in (3.1) for every x ∈ Rn. 40 Lower Bounds for Compressive Sensing smaller value of m — even fewer measurements? In this section we prove a relatively recent result of Do Ba et al. (2010), who showed that the answer is “no.”7 Amazingly, they proved this fundamental result via a reduction to a lower bound in communication complexity (for Index). Theorem 3.4 (Do Ba et al. 2010) If an m × n matrix A admits a recovery algorithm R that, for every x ∈ Rn, computes from Ax a vector x′ that satisfies (3.1), then m = Ω(k log n ). k The lower bound is information-theoretic, and therefore applies also to recovery algorithms with unlimited computational power. Note that there is an easy lower bound of m ≥ k.8 Theorem 3.4 offers a non-trivial improvement when k = o(n); in this case, we’re asking whether or not it’s possible to shave off the log factor in Fact 3.3. Given how fundamental the problem is, and how much a non-trivial improvement could potentially matter in practice, this question is well worth asking. 3.3.2 Proof of Theorem 3.4: First Attempt Recall that we first proved our upper bound of m = O(klogn) in the toy setting of k Section 3.2.2, and then stated in Fact 3.3 that it can be extended to the general version of the problem. Let’s first try to prove a matching lower bound on m that applies even in the toy setting. Recall that X denotes the set of all 0-1 vectors that have exactly k 1’s, and that |X| = 􏰃n􏰄. k For vectors x ∈ X, the guarantee (3.1) demands exact recovery. Thus, the sensing matrix A that we pick has to satisfy Ax ̸= Ax′ for all distinct x,x′ ∈ X. That is, Ax encodes x for all x ∈ X. But X has 􏰃n􏰄 members, so the worst-case encoding length of Ax 􏰃􏰄k has to be at least log n = Θ(klog n). So are we done? 2kk The issue is that we want a lower bound on the number of rows m of A, not on the worst-case length of Ax in bits. What is the relationship between these two quantities? Note that, even if A is a 0-1 matrix, then each entry of Ax is generally of magnitude Θ(k), requiring Θ(log k) bits to write down. For example, when k is polynomial in n (like our √ running choice k = A is a 0-1 matrix. Thus our lower bound of Θ(k log n ) on the length of Ax does not yield a n), then Ax generally requires Ω(m log n) bits to describe, even when k lower bound on m better than the trivial lower bound of k.9 7Such a lower bound was previously known for various special cases — for particular classes of matrices, for particular families of recovery algorithms, etc. 8For example, consider just the subset of vectors x that are zero in the final n − k coordinates. The guarantee (3.1) demands exact recovery of all such vectors. Thus, we’re back to the boring version of the problem mentioned at the beginning of the lecture, and A has to have rank at least k. 9This argument does imply that m = Ω(k log n ) if we only we report Ax modulo 2 (or some other k constant), since in this case the length of Ax is Θ(m). 3.3 A Lower Bound for Sparse Recovery 41 The argument above was doomed to fail. The reason is that, if you only care about recovering exactly k-sparse vectors x — rather than the more general and robust guarantee in (3.1) — then m = 2k suffices! One proof is via “Prony’s Method,” which uses the fact that a k-sparse vector x can be recovered exactly from its first 2k Fourier coefficients (see e.g. Moitra (2014)).10 Our argument above only invoked the requirement (3.1) for k-sparse vectors x, and such an argument cannot prove a lower bound of the form m = Ω(k log n ). k The first take-away from this exercise is that, to prove the lower bound that we want, we need to use the fact that the matrix A satisfies the guarantee (3.1) also for non-k-sparse vectors x. The second take-away is that we need a smarter argument — a straightforward application of the Pigeonhole Principle is not going to cut it. 3.3.3 Proof of Theorem 3.4 A Communication Complexity Perspective We can interpret the failed proof attempt in Section 3.3.2 as an attempted reduction from a “promise version” of Index. Recall that in this communication problem, Alice has an input x ∈ {0,1}n, Bob has an index i ∈ {1,2,...,n}, specified using ≈ log2 n bits, and the goal is to compute xi using a one-way protocol. We showed last lecture that the deterministic communication complexity of this problem is n (via an easy counting argument) and its randomized communication complexity is Ω(n) (via a harder counting argument). The previous proof attempt can be rephrased as follows. Consider a matrix A that permits exact recovery for all x ∈ X. This induces a one-way protocol for solving Index whenever Alice’s input x lies in X — Alice simply sends Ax to Bob, Bob recovers x, and Bob can then solve the problem, whatever his index i might be. The communication cost of this protocol is exactly the length of Ax, in bits. The deterministic one-way communication complexity of this promise version of Index is k log n , by the usual counting argument, so k this lower bound applies to the length of Ax. The Plan How can we push this idea further? We begin by assuming the existence of an m × n matrix A, a recovery algorithm R, and a constant c ≥ 1 such that, for every x ∈ Rn, R computes from Ax a vector x′ ∈ Rn that satisfies ∥x′ − x∥1 ≤ c · res(x). (3.2) Our goal is to show that if m << k log n , then we can solve Index with sublinear commu- k nication. 10 This method uses a sensing matrix A for which Ax will generally have Ω(m log n) = Ω(k log n) bits, so this does not contradict out lower bound on the necessary length of Ax. 42 Lower Bounds for Compressive Sensing For simplicity, we assume that the recovery algorithm R is deterministic. The lower bound continues to hold for randomized recovery algorithms that have success probability at least 2, but the proof requires more work; see Section 3.3.4. 3 An Aside on Bit Complexity We can assume that the sensing matrix A has reasonable bit complexity. Roughly: (i) we can assume without loss of generality that A has orthonormal rows (by a change of basis argument); (ii) dropping all but the O(log n) highest-order bits of every entry has negligible effect on the recovery algorithm R. We leave the details to the Exercises. When every entry of the matrix A and of a vector x can be described in O(log n) bits — equivalently, by scaling, is a polynomially-bounded integer — the same is true of Ax. In this case, Ax has length O(m log n). Redefining X It will be useful later to redefine the set X. Previously, X was all 0-1 n-vectors with exactly k 1’s. Now it will be a subset of such vectors, subject to the constraint that dH(x,x′) ≥ .1k for every distinct pair x, x′ of vectors in the set. Recall the dH (·, ·) denotes the Hamming distance between two vectors — the number of coordinates in which they differ. Two distinct 0-1 vectors with k 1’s each have Hamming distance between 2 and 2k, so we’re restricting to a subset of such vectors that have mostly disjoint supports. The following lemma says that there exist sets of such vectors with size not too much smaller than the number of all 0-1 vectors with k 1’s; we leave the proof to the Exercises. Lemma 3.5 Suppose k ≤ .01n. There is a set X of 0-1 n-bits vectors such that each x ∈ X has k 1s, each distinct x,x′ ∈ X satisfy dH(x,x′) ≥ .1k, and log2 |X| = Ω(klog n). k Lemma 3.5 is reminiscent of the fact that there are large error-correcting codes with large distance. One way to prove Lemma 3.5 is via the probabilistic method — by showing that a suitable randomized experiment yields a set with the desired size with positive probability. Intuitively, our proof attempt in Section 3.3.2 used that, because each x ∈ X is exactly k-sparse, it can be recovered exactly from Ax and thus there is no way to get confused between distinct elements of X from their images. In the following proof, we’ll instead need to recover “noisy” versions of the x’s — recall that our proof cannot rely only on the fact that the matrix A performs exact recovery of exactly sparse vectors. This means we might get confused between two different 0-1 vectors x, x′ that have small (but non-zero) Hamming distance. The above redefinition of X, which is essentially costless by Lemma 3.5, fixes the issue by restricting to vectors that all look very different from one another. 3.3 A Lower Bound for Sparse Recovery 43 The Reduction To obtain a lower bound of m = Ω(klogn) rather than m = Ω(k), we need a more k sophisticatedreductionthaninSection3.3.2.11 Theparametersoffersomestrongcluesasto what the reduction should look like. If we use a protocol where Alice sends Ay to Bob, where A is the assumed sensing matrix with a small number m of rows and y is some vector of Alice’s choosing — perhaps a noisy version of some x ∈ X — then the communication cost will be O(m log n). We want to prove that m = Ω(k log n ) = Ω(log |X|) (using Lemma 3.5). This k means we need a communication lower bound of Ω(log |X| log n). Since the communication lower bound for Index is linear, this suggests considering inputs to Index where Alice’s input has length log|X|logn, and Bob is given an index i ∈ {1,2,...,log|X|logn}. To describe the reduction formally, let enc : X → {0,1}log2 |X| be a binary encoding of the vectors of X. (We can assume that |X| is a power of 2.) Here is the communication protocol for solving Index. (1) Alice interprets her (log |X| log n)-bit input as log n blocks of log |X| bits each. For each j = 1,2,...,logn, she interprets the bits of the jth block as enc(xj) for some xj ∈ X. See Figure 3.2. (2) Alice computes a suitable linear combination of the xj’s: log n y = 􏰕 αjxj, i=1 with the details provided below. Each entry of y will be a polynomially-bounded integer. (3) Alice sends Ay to Bob. This uses O(m log n) bits of communication. (4) Bob uses Ay and the assumed recovery algorithm R to recover all of x1, . . . , xlog n. (Details TBA.) (5) Bobidentifiestheblockjoflog|X|bitsthatcontainshisgivenindexi∈{1,2,...,log|X|logn}, and outputs the relevant bit of enc(xj). If we can implement steps (2) and (4), then we’re done: this five-step deterministic protocol would solve Index on log |X| log n-bit inputs using O(m log n) communication. Since the communication complexity of Index is linear, we conclude that m = Ω(log|X|) = Ω(k log n ).12 k 11We assume from now on that k = o(n), since otherwise there is nothing to prove. 12Thus even the deterministic communication lower bound for Index, which is near-trivial to prove (Lectures 1–2), has very interesting implications. See Section 3.3.4 for the stronger implications provided by the randomized communication complexity lower bound. 44 Lower Bounds for Compressive Sensing Figure 3.2 In the reduction, Alice interprets her log |X| log n-bit input as log n blocks, with each block encoding some vector xj ∈ X. For step (2), let α ≥ 2 be a sufficiently large constant, depending on the constant c in the recovery guarantee (3.2) that the matrix A and algorithm R satisfy. Then log n y = 􏰕 αjxj. (3.3) i=1 Recall that each xj is a 0-1 n-vector with exactly k 1’s, so y is just a superposition of log n scaled copies of such vectors. For example, the l1 norm of y is simply k 􏰂log n αj . In j=1 particular, since α is a constant, the entries of y are polynomially bounded non-negative integers, as promised earlier, and Ay can be described using O(m log n) bits. Bob’s Recovery Algorithm The interesting part of the protocol and analysis is step (4), where Bob wants to recover the vectors x1, . . . , xlog n encoded by Alice’s input using only knowledge of Ay and black-box access to the recovery algorithm R. To get started, we explain how Bob can recover the last vector xlog n, which suffices to solve Index in the lucky case where Bob’s index i is one of the last log2 |X| positions. Intuitively, this is the easiest case, since xlog n is by far (for large α) the largest contributor to the vector y Alice computes in (3.3). With y = αlog nxlog n + noise, we might hope that the recovery algorithm can extract αlog nxlog n from Ay. Bob’s first step is the only thing he can do: invoke the recovery algorithm R on the message Ay from Alice. By assumption, R returns a vector yˆ satisfying ∥yˆ − y∥1 ≤ c · res(y), where res(y) is the contribution to y’s l1 norm by its smallest n − k coordinates. .............( log2(|X|(bits( log2(|X|(log2(n(bits( enc(x1)( enc(x2)( enc(xlog(n)( 3.3 A Lower Bound for Sparse Recovery 45 Bob then computes yˆ’s nearest neighbor x∗ in a scaled version of X under the l1 norm — argminx∈X ∥yˆ − αlog nx∥1. Bob can do this computation by brute force. The key claim is that the computed vector x∗ is indeed xlogn. This follows from a geometric argument, pictured in Figure 3.3. Briefly: (i) because α is large, y and αlog nxlog n are close to each other; (ii) since y is approximately k-sparse (by (i)) and since R satisfies the approximate recovery guarantee in (3.2), yˆ and y are close to each other and hence αlog nxlog n and yˆ are close; and (iii) since distinct vectors in X have large Hamming distance, for every x ∈ X other than xlog n, αlog nx is far from xlog n and hence also far from yˆ. We conclude that αlog nxlog n is closer to yˆ than any other scaled vector from X. Figure 3.3 The triangle inequality implies that the vector x∗ computed by Bob from yˆ must be xlogn. We now supply the details. (i) Recall that y is the superposition (i.e., sum) of scaled versions of the vectors x1, . . . , xlog n. y − αlog nxlog n is just y with the last contributor omitted. Thus, log n−1 ∥y−αlognxlogn∥1 = 􏰕 αjk. j=1 Assuming that α ≥ max{2, 200c}, where c is the constant that A and R satisfy in (3.2), we can bound from above the geometric sum and derive ∥y − αlog nxlog n∥1 ≤ .01kαlog n. (3.4) c (ii) By considering the n − k coordinates of y other than the k that are non-zero in xlog n, we can upper bound the residual res(y) by the contributions to the l1 weight by y" αlog%n%xlog%n%% ^" y" αlog%n%x% long%(by%the%defini1on%of%x)% 46 Lower Bounds for Compressive Sensing αx1,...,αlogn−1xlogn−1. The sum of these contributions is k􏰂logn−1 αj, which we j=1 already bounded above in (3.4). In light of the recovery guarantee (3.2), we have ∥yˆ − y∥1 ≤ .01kαlog n. (3.5) (iii) Let x,x′ ∈ X be distinct. By the definition of X, dH(x,x′) ≥ .1k. Since x,x′ are both 0-1 vectors, ∥αlog nx − αlog nx′∥1 ≥ .1kαlog n. (3.6) Combining (3.4) and (3.5) with the triangle inequality, we have ∥yˆ − αlog nxlog n∥1 ≤ .02kαlog n. (3.7) Meanwhile, for every other x ∈ X, combining (3.6) and (3.7) with the triangle inequality gives ∥yˆ − αlog nx∥1 ≥ .08kαlog n. (3.8) Inequalities (3.7) and (3.8) imply that Bob’s nearest-neighbor computation will indeed recover xlog n. You’d be right to regard the analysis so far with skepticism. The same reason that Bob can recover xlog n — because y is just a scaled version of xlog n, plus some noise — suggests that Bob should not be able to recover the other xj’s, and hence unable to solve Index for indices i outside of the last block of Alice’s input. The key observation is that, after recovering xlog n, Bob can “subtract it out” without any further communication from Alice and then recover xlog n−1. Iterating this argument allows Bob to reconstruct all of x1, . . . , xlog n and hence solve Index, no matter what his index i is. In more detail, suppose Bob has already reconstructed xj+1, . . . , xlog n. He’s then in a position to form z=αj+1xj+1 +···+αlognxlogn. (3.9) Then, y − z equals the first j contributors to y — subtracting z undoes the last log n − j of them — and is therefore just a scaled version of xj, plus some relatively small noise (the scaled xl’s with l < j). This raises the hope that Bob can recover a scaled version of xj from A(y − z). How can Bob get his hands on the latter vector? (He doesn’t know y, and we don’t want Alice to send any more bits to Bob.) The trick is to use the linearity of A — Bob knows A and z and hence can compute Az, Alice has already sent him Ay, so Bob just computes Ay − Az = A(y − z)! After computing A(y − z), Bob invokes the recovery algorithm R to obtain a vector wˆ ∈ Rn that satisfies ∥wˆ −(y−z)∥1 ≤c·res(y−z), and computes (by brute force) the vector x ∈ X minimizing ∥wˆ − αjx∥1. The minimizing vector is xj — the reader should check that the proof of this is word-for-word the same 3.3 A Lower Bound for Sparse Recovery 47 as our recovery proof for xlogn, with every “logn” replaced by “j,” every “y” replaced by “y − z,” and every “yˆ” replaced by “wˆ .” This completes the implementation of step (4) of the protocol, and hence of the reduction from Index to the design of a sensing matrix A and recovery algorithm R that satisfy (3.2). We conclude that A must have m = Ω(k log n ), which completes the proof of Theorem 3.4. k 3.3.4 Lower Bounds for Randomized Recovery We proved the lower bound in Theorem 3.4 only for fixed matrices A and deterministic recovery algorithms R. This is arguably the most relevant case, but it’s also worth asking whether or not better positive results (i.e., fewer rows) are possible for a randomized recovery requirement, where recovery can fail with constant probability. Superficially, the randomization can come from two sources: first, one can use a distribution over matrices A; second, the recovery algorithm (given Ax) can be randomized. Since we’re not worrying about computational efficiency, we can assume without loss of generality that R is deterministic — a randomized recovery algorithm can be derandomized just be enumerating over all of its possible executions and taking the majority vote. Formally, the relaxed requirement for a positive result is: there exists a constant c ≥ 1, a distribution D over m × n matrices A, and a (deterministic) recovery algorithm R such that, for every x ∈ Rn, with probability at least 2/3 (over the choice of A), R returns a vector x′ ∈ Rn that satisfies ∥x′ − x∥1 ≤ c · res(x). The lower bound in Theorem 3.4 applies even to such randomized solutions. The obvious idea is to follow the proof of Theorem 3.4 to show that a randomized recovery guarantee yields a randomized protocol for Index. Since even randomized protocols for the latter problem require linear communication, this would imply the desired lower bound. The first attempt at modifying the proof of Theorem 3.4 has Alice and Bob using the public coins to pick a sensing matrix A at random from the assumed distribution — thus, A is known to both Alice and Bob with no communication. Given the choice of A, Alice sends Ay to Bob as before and Bob runs the assumed recovery algorithm R. With probability at least 2/3, the result is a vector yˆ from which Bob can recover xlog n. The issue is that Bob has to run R on logn different inputs, once to recover each of x1,...,xlogn, and there is a failure probability of 1 each time. 3 The obvious fix is to reduce the failure probability by independent trials. So Alice and Bob use the public coins to pick l = Θ(log log n) matrices A1, . . . , Al independently from the assumed distribution. Alice sends A1y,...,Aly to Bob, and Bob runs the recovery algorithm R on each of them and computes the corresponding vectors x1, . . . , xl ∈ X. Except with probability O(1/ log n), a majority of the xj ’s will be the vector xlog n. By a Union Bound over the log n iterations of Bob’s recovery algorithm, Bob successfully reconstructs each of x1, . . . , xlog n with probability at least 2/3, completing the randomized protocol for Index. This protocol has communication cost O(m log n log log n) and the lower bound 48 Lower Bounds for Compressive Sensing for Index is Ω(log |X| log n), which yields a lower bound of m = Ω(k log n / log log n) for k randomized recovery. The reduction in the proof of Theorem 3.4 can be modified in a different way to avoid the log log n factor and prove the same lower bound of m = Ω(k log n ) that we established for k deterministic recovery. The trick is to modify the problem being reduced from (previously Index) so that it becomes easier — and therefore solvable assuming only a randomized recovery guarantee — subject to its randomized communication complexity remaining linear. The modified problem is called Augmented Index— it’s a contrived problem but has proved technically convenient in several applications. Alice gets an input x ∈ {0, 1}l. Bob gets an index i ∈ {1,2,...,l} and also the subsequent bits xi+1,...,xl of Alice’s input. This problem is obviously only easier than Index, but it’s easy to show that its deterministic one-way communication complexity is l (see the Exercises). With some work, it can be shown that its randomized one-way communication complexity is Ω(l) (see Bar-Yossef et al. (2004); Do Ba et al. (2010); Miltersen et al. (1998)). The reduction in Theorem 3.4 is easily adapted to show that a randomized approxi- mate sparse recovery guarantee with matrices with m rows yields a randomized one-way communication protocol for Augmented Index with log |X| log n-bit inputs with commu- nication cost O(m log n) (and hence m = Ω(log |X|) = Ω(k log n )). We interpret Alice’s k input in the same way as before. Alice and Bob use the public coins to a pick a matrix A from the assumed distribution and Alice sends Ay to Bob. Bob is given an index i ∈ {1,2,...,log|X|logn} that belongs to some block j. Bob is also given all bits of Alice’s input after the ith one. These bits include enc(xj+1), . . . , enc(xlog n), so Bob can simply compute z as in (3.9) (with no error) and invoke the recovery algorithm R (once). Whenever A is such that the guarantee (3.2) holds for y − z, Bob will successfully reconstruct xj and therefore compute the correct answer to Augmented Index. 3.3.5 Digression One could ask if communication complexity is “really needed” for this proof of Theorem 3.4. Churlish observers might complain that, due to the nature of communication complexity lower bounds (like those last lecture), this proof of Theorem 3.4 is “just counting.”13 While not wrong, this attitude is counterproductive. The fact is that adopting the language and mindset of communication complexity has permitted researchers to prove results that had previously eluded many smart people — in this case, the ends justifies the means. The biggest advantage of using the language of communication complexity is that one naturally thinks in terms of reductions between different lower bounds.14 Reductions 13Critics said (and perhaps still say) the same thing about the probabilistic method (Alon and Spencer, 2008). 14Thinking in terms of reductions seems to be a “competitive advantage” of theoretical computer scientists — there are several examples where a reduction-based approach yielded new progress on old and seemingly intractable mathematical problems. 3.3 A Lower Bound for Sparse Recovery 49 can repurpose a single counting argument, like our lower bound for Index, for lots of different problems. Many of the more important lower bounds derived from communication complexity, including today’s main result, involve quite clever reductions, and it would be difficult to devise from scratch the corresponding counting arguments. Lecture 4 Boot Camp on Communication Complexity 4.1 Preamble This lecture covers the most important basic facts about deterministic and randomized communication protocols in the general two-party model, as defined by Yao (1979). Some version of this lecture would normally be the first lecture in a course on communication complexity. How come it’s the fourth one here? The first three lectures were about one-way communication complexity — communication protocols where there is only one message, from Alice to Bob — and its applications. One reason we started with the one-way model is that several of the “greatest hits” of algorithmic lower bounds via communication complexity, such as space lower bounds for streaming algorithms and row lower bounds for compressive sensing matrices, already follow from communication lower bounds for one-way protocols. A second reason is that considering only one-way protocols is a gentle introduction to what communication protocols look like. There are already some non-trivial one-way protocols, like our randomized protocol for Equality. On the other hand, proving lower bounds for one-way protocols is much easier than proving them for general protocols, so it’s also a good introduction to lower bound proofs. The rest of our algorithmic applications require stronger lower bounds that apply to more than just one-way protocols. This lecture gives a “boot camp” on the basic model. We won’t say much about applications in this lecture, but the final five lectures all focus on applications. We won’t prove any hard results today, and focus instead on definitions and vocabulary, examples, and some easy results. One point of today’s lecture is to get a feel for what’s involved in proving a communication lower bound for general protocols. It generally boils down to a conceptually simple, if sometimes mathematically challenging, combinatorial problem — proving that a large number of “rectangles” of a certain type are need to cover a matrix.1 1There are many other methods for proving communication lower bounds, some quite deep and exotic (see e.g. Lee and Shraibman (2009)), but all of our algorithmic applications can ultimately be derived from combinatorial covering-type arguments. For example, we’re not even going to mention the famous “rank lower bound.” For your edification, some other lower bound methods are discussed in the Exercises. 50 4.2 Deterministic Protocols 51 4.2 Deterministic Protocols 4.2.1 Protocols We are still in the two-party model, where Alice has an input x ∈ X unknown to Bob, and Bob has an input y ∈ Y unknown to Alice. (Most commonly, X = Y = {0, 1}n.) A deterministic communication protocol specifies, as function of the messages sent so far, whose turn it is to speak. A protocol always specifies when the communication has ended and, in each end state, the value of the computed bit. Alice and Bob can coordinate in advance to decide upon the protocol, and both are assumed to cooperative fully. The only constraint faced by the players is that what a player says can depend only on what the player knows — his or her own input, and the history of all messages sent so far. Like with one-way protocols, we define the cost of a protocol as the maximum number of bits it ever sends, ranging over all inputs. The communication complexity of a function is then the minimum communication cost of a protocol that correctly computes it. The key feature of general communication protocols absent from the special case of one-way protocols is interaction between the two players. Intuitively, interaction should allow the players to communicate much more efficiently. Let’s see this in a concrete example. 4.2.2 Example: Clique-Independent Set The following problem might seem contrived, but it is fairly central in communication complexity. There is a graph G = (V, E) with |V | = n that is known to both players. Alice’s private input is a clique C of G — a subset of vertices such that (u, v) ∈ E for every distinct u, v ∈ C. Bob’s private input is an independent set I of G — a subset of vertices such that (u,v) ̸∈ E for every distinct u,v ∈ I. (There is no requirement that C or I is maximal.) Observe that C and I are either disjoint, or they intersect in a single vertex (Figure 4.1). The players’ goal is to figure out which of these two possibilities is the case. Thus, this problem is a special case of Disjointness, where players’ sets are not arbitrary but rather a clique and an independent set from a known graph. The naive communication protocol for solving the problem using Θ(n) bits — Alice can send the characteristic vector of C to Bob, or Bob the characteristic vector of I to Alice, and then the other player computes the correct answer. Since the number of cliques and independent sets of a graph is generally exponential in the number n of vertices, this protocol cannot be made significantly more communication-efficient via a smarter encoding. An easy reduction from Index shows that one-way protocols, including randomized protocols, require Ω(n) communication (exercise). The players can do much better by interacting. Here is the protocol. 1. If there is a vertex v ∈ C with deg(v) < n , then Alice sends the name of an arbitrary 2 such vertex to Bob (≈ log2 n bits). 52 Boot Camp on Communication Complexity Figure 4.1 A clique C and an independent set I overlap in zero or one vertices. a) Bob announces whether or not v ∈ I (1 bit). If so, the protocol terminates with conclusion “not disjoint.” b) Otherwise, Alice and Bob recurse on the subgraph H induced by v and its neighbors. [Note: H contains at most half the nodes of G. It contains all of C and, if I intersects C, it contains the vertex in their intersection. C and I intersect in G if and only if their projections to H intersect in H.] 2. Otherwise, Alice sends a “NULL” message to Bob (≈ log2 n bits). 3. If there is a vertex v ∈ I with deg(v) ≥ n , then Bob sends the name of an arbitrary 2 such vertex to Bob (≈ log2 n bits). a) Alice announces whether or not v ∈ C (1 bit). If so, the protocol terminates (“not disjoint”). b) If not, Alice and Bob recurse on the subgraph H induced by v and its non- neighbors. [Note: H contains at most half the nodes of G. It contains all of I and, if C intersects I, it contains the vertex in their intersection. Thus the function’s answer in H is the same as that in G.] 4. Otherwise, Bob terminates the protocol and declares “disjoint.” [Disjointness is obvious since, at this point in the protocol, we know that deg(v) < n C I" forallv∈Canddeg(v)≥n forallv∈I.] 2 2 4.2 Deterministic Protocols 53 Since each iteration of the protocol uses O(log n) bits of communication and cuts the number of vertices of the graph in half (or terminates), the total communication is O(log2 n). As previously noted, such a result is impossible without interaction between the players. 4.2.3 Trees and Matrices The Clique-Independent Set problem clearly demonstrates that we need new lower bound techniques to handle general communication protocols — the straightforward Pi- geonhole Principle arguments that worked for one-way protocols are not going to be good enough. At first blush this might seem intimidating — communication protocols can do all sorts of crazy things, so how can we reason about them in a principled way? How can we connect properties of a protocol to the function that it computes? Happily, we can quickly build up some powerful machinery for answering these questions. First, we observe that deterministic communication protocols are really just binary trees. We’ll almost never use this fact directly, but it should build some confidence that protocols are familiar and elementary mathematical objects. The connection is easiest to see by example; see Figure 4.2. Consider the following protocol for solving Equality with n = 2 (i.e., f(x,y) = 1 if and only if x = y). Alice begins by sending her first bit. If Bob’s first bit is different, he terminates the protocol and announces “not equal.” If Bob’s first bit is the same, then he transmits the same bit back. In this case, Alice then sends her second bit. At this point, Bob knows Alice’s whole input and can therefore compute the correct answer. In Figure 4.2, each node corresponds to a possible state of the protocol, and is labeled with the player whose turn it is to speak. Thus the labels alternate with the levels, with the root belonging to Alice.2 There are 10 leaves, representing the possible end states of the protocol. There are two leaves for the case where Alice and Bob have different first bits and the protocol terminates early, and eight leaves for the remaining cases where Alice and Bob have the same first bit. Note that the possible transcripts of the protocol are in one-to-one correspondence with the root-leaf nodes of the tree — we use leaves and transcripts interchangeably below. We can view the leaves as a partition {Z(l)} of the input space X × Y , with Z(l) the inputs (x, y) such that the protocol terminates in the leaf l. In our example, there are 10 leaves for the 16 possible inputs (x, y), so different inputs can generate the same transcript — more on this shortly. Next note that we can represent a function (from (x, y) to {0, 1}) a matrix. In contrast to the visualization exercise above, we’ll use this matrix representation all the time. The rows are labeled with the set X of possible inputs of Alice, the columns with the set Y of possible inputs of Bob. Entry (x, y) of the matrix is f (x, y). Keep in mind that this matrix is fully known to both Alice and Bob when they agree on a protocol. 2In general, players need not alternate turns in a communication protocol. 54 Boot Camp on Communication Complexity % Figure 4.2 The binary tree induced by a communication protocol for Equality with n = 2. Forexample,supposethatX=Y ={0,1}2,resultingin4×4matrices. Equality then corresponds to the identity matrix: 0% A" 1% 0% B"1% 1% 0% B" 1% “yes”% “no”% “no”% “yes”% “yes”% “no”% “no”% “yes” 0% A" 0% 1% A" 0%B" 1% “no”% “no”% 0% B" 1% 0% B" 1% 0% B" 1% 00 01 10 11 001 0 0 0 010 1 0 0 100 0 1 0 11 0 0 0 1 (4.1) If we define the Greater-Than function as 1 whenever x is at least y (where x and y are interpreted as non-negative integers, written in binary), then we just fill in the lower triangle with 1s: 00 01 10 11 001 0 0 0 011 1 0 0 101 1 1 0 11 1 1 1 1 4.2 Deterministic Protocols 55 We also write out the matrix for Disjointness, which is somewhat more inscrutable: 4.2.4 00 01 10 11 001 1 1 1 011 0 1 0 101 1 0 0 11 1 0 0 0 Protocols and Rectangles How can we reason about the behavior of a protocol? Just visualizing them as trees is not directly useful. We know that simple Pigeonhole Principle-based arguments are not strong enough, but it still feels like we want some kind of counting argument. To see what might be true, let’s run the 2-bit Equality protocol depicted in Figure 4.2 and track its progress using the matrix in (4.1). Put yourself in the shoes of an outside observer, who knows neither x nor y, and makes inferences about (x,y) as the protocol proceeds. When the protocol terminates, we’ll have carved up the matrix into 10 pieces, one for each leaf of protocol tree — the protocol transcript reveals the leaf to an outside observer, but nothing more. Before the protocol beings, all 16 inputs are fair game. After Alice sends her first bit, the outside observer can narrow down the possible inputs into a set of 8 — the top 8 if Alice sent a 0, the bottom 8 if she sent a 1. The next bit sent gives away whether or not Bob’s first bit is a 0 or 1, so the outsider observer learns which quadrant the input lies in. Interestingly, in the northeastern and southwestern quadrants, all of the entries are 0. In these cases, even though ambiguity remains about exactly what the input (x, y) is, the function’s value f(x,y) has been determined (it is 0, whatever the input). It’s no coincidence that these two regions correspond to the two leaves of the protocol in Figure 4.2 that stop early, with the correct answer. If the protocol continues further, then Alice’s second bit splits the northwestern and southeastern quadrants into two, and Bob’s final bit splits them again, now into singleton regions. In these cases, an outside observer learns the entire input (x, y) from the protocol’s transcript.3 What have we learned? We already knew that every protocol induces a partition of the input space X × Y , with one set for each leaf or, equivalently, for each distinct transcript. At least for the particular protocol that we just studied, each of the sets has a particularly nice submatrix form (Figure 4.3). This is true in general, in the following sense. 3It’s also interesting to do an analogous thought experiment from the perspective of one of the players. For example, consider Bob’s perspective when the input is (00,01). Initially Bob knows that the input lies in the second column but is unsure of the row. After Alice’s first message, Bob knows that the input is in the second column and one of the first two rows. Bob still cannot be sure about the correct answer, so the protocol proceeds. 56 Boot Camp on Communication Complexity Figure 4.3 The partition of the input space X × Y according to the 10 different transcripts that can be generated by the Equality protocol. Lemma 4.1 (Rectangles) For every transcript z of a deterministic protocol P, the set of inputs (x, y) that generate z are a rectangle, of the form A × B for A ⊆ X and B ⊆ Y . A rectangle just means a subset of the input space X × Y that can be written as a product. For example, the set {(00, 00), (11, 11)} is not a rectangle, while the set {(00, 00), (11, 00), (00, 11), (11, 11)} is. In general, a subset S ⊆ X × Y is a rectangle if and only if it is closed under “mix and match,” meaning that whenever (x1,y1) and (x2, y2) are in S, so are (x1, y2) and (x2, y1) (see the Exercises). Don’t be misled by our example (Figure 4.3), where the rectangles induced by our protocol happen to be “contiguous.” For example, if we keep the protocol the same but switch the order in which we write down the rows and columns corresponding to 01 and 10, we get an analogous decomposition in which the two large rectangles are not contiguous. In general, you shouldn’t even think of X and Y as ordered sets. Rectangles are sometimes called combinatorial rectangles to distinguish them from “geometric” rectangles and to emphasize this point. Lemma 4.1 is extremely important, though its proof is straightforward — we just follow the protocol like in our example above. Intuitively, each step of a protocol allows an outside observer to narrow down the possibilities for x while leaving the possibilities for y unchanged (if Alice speaks) or vice versa (if Bob speaks). Proof of Lemma 4.1: Fix a deterministic protocol P . We proceed by induction on the number of bits exchanged. For the base case, all inputs X × Y begin with the empty transcript. For the inductive step, consider an arbitrary t-bit transcript-so-far z generated by P , with t ≥ 1. Assume that Alice was the most recent player to speak; the other case is analogous. Let z′ denote z with the final bit b ∈ {0, 1} lopped off. By the inductive hypothesis, the set of inputs that generate z′ has the form A × B. Let Ab ⊆ A denote the inputs x ∈ A such that, in the protocol P, Alice sends the bit b given the transcript z′. (Recall that the message sent by a player is a function only of his or her private input and the history of the protocol 00"01"10"11" 00" 01" 10" 11" 1" 0" 0" 0" 0" 0" 0" 0" 1" 0" 0" 0" 1" 0" 0" 1" 4.2 Deterministic Protocols 57 so far.) Then the set of inputs that generate z are Ab × B, completing the inductive step. 􏱋 Note that Lemma 4.1 makes no reference to a function f — it holds for any deterministic protocol, whether or not it computes a function that we care about. In Figure 4.3, we can clearly see an additional property of all of the rectangles — with respect to the matrix in (4.1), every rectangle is monochromatic, meaning all of its entries have the same value. This is true for any protocol that correctly computes a function f. Lemma 4.2 If a deterministic protocol P computes a function f, then every rectangle induced by P is monochromatic in the matrix M(f). Proof: Consider an arbitrary combinatorial rectangle A × B induced by P , with all inputs in A × B inducing the same transcript. The output of P is constant on A × B. Since P correctly computes f, f is also constant on A × B. 􏱋 Amazingly, the minimal work we’ve invested so far already yields a powerful technique for lower bounding the deterministic communication complexity of functions. Theorem 4.3 Let f be a function such that every partition of M(f) into monochromatic rectangles requires at least t rectangles. Then the deterministic communication complexity of f is at least log2 t. Proof: A deterministic protocol with communication cost c can only generate 2c distinct transcripts — equivalently, its (binary) protocol tree can only have 2c leaves. If such a protocol computes the function f, then by Lemmas 4.1 and 4.2 it partitions M(f) into at most 2c monochromatic rectangles. By assumption, 2c ≥ t and hence c ≥ log2 t. 􏱋 Rather than applying Theorem 4.3 directly, we’ll almost always be able to prove a stronger and simpler condition. To partition a matrix, one needs to cover all of its entries with disjoint sets. The disjointness condition is annoying. So by a covering of a 0-1 matrix, we mean a collection of subsets of entries whose union includes all of its elements — overlaps between these sets are allowed. See Figure 4.4. Corollary 4.4 Let f be a function such that every covering of M(f) by monochromatic rectangles requires at least t rectangles. Then the deterministic communication complexity of f is at least log2 t. Communication complexity lower bounds proved using covers — including all of those proved in Section 4.2.5 — automatically apply also to more general “nondeterministic” communication protocols, as well as randomized protocols with 1-sided error. We’ll discuss this more next lecture, when it will be relevant. 58 Boot Camp on Communication Complexity Figure 4.4 A covering by four monochromatic rectangles that is not a partition. 4.2.5 Lower Bounds for Equality and Disjointness Armed with Corollary 4.4, we can quickly prove communication lower bounds for some functions of interest. For example, recall that when f is the Equality function, the matrix M(f) is the identity. The key observation about this matrix is: a monochromatic rectangle that includes a “1” contains only one element. The reason is simple: such a rectangle is not allowed to contain any 0’s since it is monochromatic, and if it included a second 1 it would pick up some 0-entries as well (recall that rectangles are closed under “mix and match”). Since there are 2n 1’s in the matrix, every covering by monochromatic rectangles (even of just the 1’s) has size 2n. Corollary 4.5 The deterministic communication complexity of Equality is at least n.4 The exact same argument gives the same lower bound for the Greater-Than function. Corollary 4.6 The deterministic communication complexity of Greater-Than is at least n. We can generalize this argument as follows. A fooling set for a function f is a subset F ⊆X×Y ofinputssuchthat: (i) f is constant on F; (ii) for each distinct pair (x1, y1), (x2, y2) ∈ F , at least one of (x1, y2), (x2, y1) has the opposite f-value. 4The 0s can be covered using another 2n monochromatic rectangles, one per row (rectangles need not be “contiguous”!). This gives a lower bound of n + 1. The trivial upper has Alice sending her input to Bob and Bob announcing the answer, which is a (n + 1)-bit protocol. Analogous “+1” improvements are possible for the other examples in this section. 0" 1" 1" 1" 1" 1" 1" 1" 0" 4.2 Deterministic Protocols 59 Since rectangles are closed under the “mix and match” operation, (i) and (ii) imply that every monochromatic rectangle contains at most one element of F. Corollary 4.7 If F is a fooling set for f, then the deterministic communication complexity of f is at least log2 |F|. For Equality and Greater-Than, we were effectively using the fooling set F = {(x, x) : x ∈ {0, 1}n}. The fooling set method is powerful enough to prove a strong lower bound on the deterministic communication complexity of Disjointness. Corollary 4.8 The deterministic communication complexity of Disjointness is at least n. Proof: Take F = {(x,1−x) : x ∈ {0,1}n} — or in set notation, {(S,Sc) : S ⊆ {1,2,...,n}}. The set F is a fooling set — it obviously consists only of “yes” inputs of Disjointness, while for every S ̸= T, either S ∩ Tc ̸= ∅ or T ∩ Sc ̸= ∅ (or both). See Figure 4.5. Since |F| = 2n, Corollary 4.7 completes the proof. 􏱋 Figure 4.5 If S and T are different sets, then either S and Tc or T and Sc are not disjoint. 4.2.6 Take-Aways A key take-away point from this section is that, using covering arguments, we can prove the lower bounds that we want on the deterministic communication complexity of many functions of interest. These lower bounds apply also to nondeterministic protocols (discussed next week) and randomized protocols with 1-sided error. As with one-way communication complexity, proving stronger lower bounds that apply also to randomized protocols with two-sided error is more challenging. Since we’re usually S""""Tc" S" T" T"""""Sc" U" U" 60 Boot Camp on Communication Complexity perfectly happy with a good randomized algorithm — recall the F2 estimation algorithm from Section 1.4 — such lower bounds are very relevant for algorithmic applications. They are our next topic. 4.3 Randomized Protocols 4.3.1 Default Parameter Settings Our discussion of randomized one-way communication protocols in Section 2.2 remains equally relevant for general protocols. Our “default parameter settings” for such protocols will be the same. Public coins. By default, we work with public-coin protocols, where Alice and Bob have shared randomness in the form of an infinite sequence of perfectly random bits written on a blackboard in public view. Such protocols are more powerful than private-coin protocols, but not by much (Theorem 4.9). Recall that public-coin randomized protocols are equivalent to distributions over deterministic protocols. Two-sided error. We allow a protocol to error with constant probability (1 by default), 3 whether or not the correct answer is “1” or “0.” This is the most permissive error model. Arbitrary constant error probability. Recall that all constant error probabilities in (0, 1 ) are the same — changing the error changes the randomized communication complexity 2 by only a constant factor (by the usual “independent trials” argument, detailed in the exercises). Thus for upper bounds, we’ll be content to achieve error 49%; for lower bounds, it is enough to rule out low-communication protocols with error %1. Worst-case communication. We define the communication cost of a randomized protocol as the maximum number of bits ever communicated, over all choices of inputs and coin flips. Measuring the expected communication (over the protocol’s coin flips) could reduce the communication complexity of a problem, but only by a constant factor. 4.3.2 Newman’s Theorem: Public- vs. Private-Coin Protocols We mentioned a few times that, for our purposes, it usually won’t matter whether we consider public-coin or private-coin randomized protocols. What we meant is the following result. Theorem 4.9 (Newman’s Theorem (1991)) If there is a public-coin protocol for a func- tion f with n-bit inputs that has two-sided error 1/3 and communication cost c, then there is a private-coin protocol for the problem that has two-sided error 1/3 and communication cost O(c + log n). Thus, for problems with public-coin randomized communication complexity Ω(log n), like most of the problems that we’ll study in this course, there is no difference between the 4.3 Randomized Protocols 61 communication complexity of the public-coin and private-coin variants (modulo constant factors). An interesting exception is Equality. Last lecture, we gave a public-coin protocol — one-way, even — with constant communication complexity. Theorem 4.9 only implies an upper bound of O(log n) communication for private-coin protocols. (One can also give such a private-coin protocol directly, see the Exercises.) There is also a matching lower bound of Ω(log n) for the private-coin communication complexity of Equality. (This isn’t very hard to prove, but we won’t have an occasion to do it.) Thus public-coin protocols can save Θ(log n) bits of communication over private-coin protocols, but no more. Proof of Theorem 4.9: Let P denote a public-coin protocol with two-sided error 1/3. We begin with a thought experiment. Fix an input (x, y), with x, y ∈ {0, 1}n. If we run P on this input, a public string r1 of random bits is consumed and the output of the protocol is correct with probability at least 2/3. If we run it again, a second (independent) random string r2 is consumed and another (independent) answer is given, again correct with probability at least 2/3. After t such trials and the consumption of random strings r1, . . . , rt, P produces t answers. We expect at least 2/3 of these to be correct, and Chernoff bounds (with δ = Θ(1) and μ = Θ(t)) imply that at least 60% of these answers are correct with probability at least 1 − exp{−Θ(t)}. We continue the thought experiment by taking a Union Bound over the 2n · 2n = 22n choices of the input (x, y). With probability at least 1 − 22n · exp{−Θ(t)} over the choice of r1, . . . , rt, for every input (x, y), running the protocol P with these random strings yields at least .6t (out of t) correct answers. In this event, the single sequence r1, . . . , rt of random strings “works” simultaneously for all inputs (x, y). Provided we take t = cn with a large enough constant c, this probability is positive. In particular, such a set r1, . . . , rt of random strings exist. Here is the private-coin protocol. (0) Before receiving their inputs, Alice and Bob agree on a set of strings r1, . . . , rt with the property that, for every input (x, y), running P t times with the random strings r1, . . . , rt yields at least 60% correct answers. (1) Alice picks an index i ∈ {1, 2, . . . , t} uniformly at random and sends it to Bob. This requires ≈ log2 t = Θ(log n) bit of communication (recall t = Θ(n)). (2) Alice and Bob simulate the private-coin protocol P as if they had public coins given by ri. By the defining property of the ri’s, this (private-coin) protocol has error 40%. As usual, this can be reduced to 1/3 via a constant number of independent repetitions followed by taking the majority answer. The resulting communication cost is O(c + log n), as claimed. 􏱋 62 Boot Camp on Communication Complexity We stated and proved Theorem 4.9 for general protocols, but the exact same statement holds (with the same proof) for the one-way protocols that we studied in Lectures 1– 3. 4.3.3 Distributional Complexity Randomized protocols are significantly harder to reason about than deterministic ones. For example, we’ve seen that a deterministic protocol can be thought of as a partition of the input space into rectangles. A randomized protocol is a distribution over such partitions. While a deterministic protocol that computes a function f induces only monochromatic rectangles, this does not hold for randomized protocols (which can err with some probability). We can make our lives somewhat simpler by using Yao’s Lemma to translate distributional lower bounds for deterministic protocols to worst-case lower bounds for randomized protocols. Recall the lemma from Lecture 2 (Lemma 2.3). Lemma 4.10 (Yao 1983) Let D be a distribution over the space of inputs (x, y) to a communication problem, and ε ∈ (0, 1 ). Suppose that every deterministic protocol P with 2 Pr(x,y)∼D [P wrong on (x, y)] ≤ ε has communication cost at least k. Then every (public-coin) randomized protocol R with (two-sided) error at most ε on every input has communication cost at least k. We proved Lemma 2.3 in Lecture 2 for one-way protocols, but the same proof holds verbatim for general communication protocols. Like in the one-way case, Lemma 2.3 is a “complete” proof technique — whatever the true randomized communication complexity, there is a hard distribution D over inputs that can in principle be used to prove it (recall the Exercises). Summarizing, proving lower bounds for randomized communication complexity reduces to: 1. Figuring out a “hard distribution” D over inputs. 2. Proving that every low-communication deterministic protocol has large error w.r.t. inputs drawn from D. Of course, this is usually easier said than done. 4.3.4 Case Study: Disjointness Overview We now return to the Disjointness problem. In Lecture 2 we proved that the one-way randomized communication complexity of this problem is linear (Theorem 2.2). We did this by reducing Index to Disjointness— the former is just a special case of the latter, where 4.3 Randomized Protocols 63 one player has a singleton set (i.e., a standard basis vector). We used Yao’s Lemma (with D the uniform distribution) and a counting argument (about the volume of small-radius balls in the Hamming cube, remember?) to prove that the one-way randomized communication complexity of Index is Ω(n). Unfortunately, for general communication protocols, the communication complexity of Index is obviously O(log n) — Bob can just send his index i to Alice using ≈ log2 n bits, and Alice can compute the function. So, it’s back to the drawing board. The following is a major and useful technical achievement. Theorem 4.11 (Kalyanasundaram and Schnitger 1992; Razborov 1992) The ran- domized communication complexity of Disjointness is Ω(n). Theorem 4.11 was originally proved in Kalyanasundaram and Schnitger (1992); the simplified proof in Razborov (1992) has been more influential. More recently, all the cool kids prove Theorem 4.11 using “information complexity” arguments; see Bar-Yossef et al. (2002a). If you only remember one result from the entire field of communication complexity, it should be Theorem 4.11. The primary reason is that the problem is unreasonably effective for proving lower bounds for other algorithmic problems — almost every subsequent lecture will include a significant example. Indeed, many algorithm designers simply use Theorem 4.11 as a “black box” to prove lower bounds for other problems, without losing sleep over its proof.5,6 As a bonus, proofs of Theorem 4.11 tend to showcase techniques that are reusable in other contexts. For a trivial consequence of Theorem 4.11 — see future lectures for less obvious ones — let’s return to the setting of streaming algorithms. Lectures 1 and 2 considered only one-pass algorithms. In some contexts, like a telescope that generates an exobyte of data per day, this is a hard constraint. In other settings, like database applications, a small constant number of passes over the data might be feasible (as an overnight job, for example). Communication complexity lower bounds for one-way protocols say nothing about two-pass algorithms, while those for general protocols do. Using Theorem 4.11, all of our Ω(m) space lower bounds for 1-pass algorithms become Ω(m/p) space lower bounds for p-pass algorithms, via the same reductions.7 For example, we proved such lower bounds for computing F∞, the highest frequency of an element, even with randomization and approximation, and for computing F0 or F2 exactly, even with randomization. So how would one go about proving Theorem 4.11? Recall that Yao’s Lemma reduces the proof to exhibiting a hard distribution D (a bit of dark art) over inputs and proving 5Similar to, for example, the PCP Theorem and the Parallel Repetition Theorem in the context of hardness of approximation (see e.g. Arora and Lund (1997)). 6There’s no shame in this — life is short and there’s lots of theorems that need proving. 7A p-pass space-s streaming algorithm S induces a communication protocol with O(ps) communication, where Alice and Bob turn their inputs into data streams, repeatedly feed them into S, repeatedly sending the memory state of S back and forth to continue the simulation. 64 Boot Camp on Communication Complexity that all low-communication deterministic protocols have large error with respect to D (a potentially tough math problem). We next discuss each of these steps in turn. Choosing a Hard Distribution The uniform distribution over inputs is not a hard distribution for Disjointness. What is the probability that a random input (x, y) satisfies f (x, y) = 1? Independently in each coordinate i, there is a 25% probability that xi = yi = 1. Thus, f (x, y) = 1 with probability (3/4)n. This means that the zero-communication protocol that always outputs “not disjoint” has low error with respect to this distribution. The moral is that a hard distribution D must, at the very least, have a constant probability of producing both “yes” and “no” instances. The next idea, motivated by the Birthday Paradox, is to define D such that each of Alice and Bob receive a random subset of {1, 2, . . . , n} of size ≈ √n. Elementary calculations show that a random instance (x, y) from D has a constant probability of satisfying each of f(x,y) = 1 and f(x,y) = 0. An obvious issue with this approach is that there is a trivial deterministic protocol that √ uses O( n log n) communication and has zero error: Alice (say) just sends her whole input to Bob by describing each of her √n elements explicitly by name (≈ log2 n bits each). So there’s no way to prove a linear communication lower bound using this distribution. Babai et al. (1986) prove that one can at least prove a Ω( n) communication lower bound using this distribution, which is already quite a non-trivial result (more on this below). They also showed that for every product distribution D — meaning whenever the random choices of x and of y are independent — there is a zero-error deterministic protocol that uses only O(√n log n) bits of communication (see the Exercises).8 Summarizing, if we believe that Disjointness really requires Ω(n) communication to solve via randomized protocols, then we need to find a distribution D that meets all of the following criteria. 1. There is a constant probability that f (x, y) = 1 and that f (x, y) = 0. (Otherwise, a constant protocol works.) 2. Alice and Bob need to usually receive inputs that correspond to sets of size Ω(n). (Otherwise, one player can explicitly communicate his or her set.) 3. The random inputs x and y are correlated. (Otherwise, the upper bound from Babai et al. (1986) applies.) 4. It must be mathematically tractable to prove good lower bounds on the error of all deterministic communication protocols that use a sublinear amount of communication. 8This does not imply that a linear lower bound is impossible. The proof of the converse of Lemma 2.3 — that a tight lower bound on the randomized communication complexity of a problem can always be proved through a distributional lower bound for a suitable choice of D — generally makes use of distributions in which the choices of x and y are correlated. √ 4.3 Randomized Protocols 65 Razborov (1992) proposed a distribution that obviously satisfies the first three properties and, less obviously, also satisfies the fourth. It is: 1. With probability 75%: a) (x,y) is chosen uniformly at random subject to: i. x, y each have exactly n/4 1’s; ii. there is no index i ∈ {1,2,...,n} with xi = yi = 1 (so f(x,y) = 1). 2. With probability 25%: a) (x,y) is chosen uniformly at random subject to: i. x, y each have exactly n/4 1’s; ii. there is exactly one index i ∈ {1,2,...,n} with xi = yi = 1 (so f(x,y) = 0). Note that in both cases, the constraint on the number of indices i with xi = yi = 0 creates correlation between the choices of x and y. Proving Error Lower Bounds via Corruption Bounds Even if you’re handed a hard distribution over inputs, there remains the challenging task of proving a good error lower bound on low-communication deterministic protocols. There are multiple methods for doing this, with the corruption method being the most successful one so far. We outline this method next. At a high level, the corruption method is a natural extension of the covering arguments of Section 4.2 to protocols that can err. Recall that for deterministic protocols, the covering approach argues that every covering of the matrix M(f) of the function f by monochromatic rectangles requires a lot of rectangles. In our examples, we only bothered to argue about the 1-inputs of the function.9 We’ll do something similar here, weighted by the distribution D and allowing errors — arguing that there’s significant mass on the 1-inputs of f, and that a lot of nearly monochromatic rectangles are required to cover them all. Precisely, suppose you have a distribution D over the inputs of a problem so that the “1-mass” of D, meaning Pr(x,y)∼D[f(x,y) = 1], is at least a constant, say .5. The plan is to prove two properties. (1) For every deterministic protocol P with error at most a sufficiently small constant ε, at least 25% of the 1-mass of D is contained in “almost monochromatic 1-rectangles” of P (defined below). We’ll see below that this is easy to prove in general by an averaging argument. 9 Since f has only two outputs, it’s almost without loss to pick a single output z ∈ {0, 1} of f and lower bound only the number of monochromatic rectangles needed to cover all of the z’s. 66 Boot Camp on Communication Complexity (2) An almost monochromatic 1-rectangle contains at most 2−c mass of the distribution D, where c is as large as possible (ideally c = Ω(n)). This is the hard step, and the argument will be different for different functions f and different input distributions D. If we can establish (1) and (2), then we have a lower bound of Ω(2−c) on the number of rectangles induced by P , which proves that P uses communication Ω(c).10 Here’s the formal definition of an almost monochromatic 1-rectangle (AM1R) R = A × B of a matrix M(f) with respect to an input distribution D: Pr(x,y)∼D[(x,y) ∈ R and f(x,y) = 0] ≤ 8ε·Pr(x,y)∼D[(x,y) ∈ R and f(x,y) = 1]. (4.2) Here’s why property (1) is true in general. Let P be a deterministic protocol with error at most ε with respect to D. Since P is deterministic, it partitions the matrix M(f) into rectangles, and in each rectangle, P ’s output is constant. Let R1, . . . , Rl denote the rectangles in which P outputs “1.” At least 50% of the 1-mass of D — and hence at least 25% of D’s overall mass — must be contained in R1,...,Rl. For if not, on at least 25% of the mass of D, f(x,y) = 1 while P outputs “0”. This contradicts the assumption that P has error ε with respect to D (provided ε < .25). Also, at least 50% of the mass in R1, . . . , Rl must lie in AM1Rs. For if not, using (4.2) and the fact that the total mass in R1, . . . , Rl is at least .25, it would follow that D places morethan8ε·.125=εmasson0-inputsinR1,...,Rl. SinceP outputs“1” onallofthese inputs, this contradicts the assumption that P has error at most ε with respect to D. This completes the proof of step (1), which applies to every problem and every distribution D over inputs with 1-mass at least .5. Step (2) is difficult and problem-specific. Babai et al. (1986), for their input distribution n), n) lower bound on the randomized communication complexity of the problem. Razborov (1992) gave, for his input distribution, a proof of step (2) with c = Ω(n), implying the desired lower bound for Disjointness. Sadly, we won’t have time to talk about these and subsequent proofs (as in Bar-Yossef et al. (2002a)); perhaps in a future course. 10Why call it the “corruption method”? Because the argument shows that, if a deterministic protocol has low communication, then most of its induced rectangles that contain 1-inputs are also “corrupted” by lots of 0-inputs — its rectangles are so big that (4.2) fails. In turn, this implies large error. D over Disjointness inputs mentioned above, gave a proof of step (2) with c = Ω( √ thus giving an Ω( √ Lecture 5 Lower Bounds for the Extension Complexity of Polytopes 5.1 Linear Programs, Polytopes, and Extended Formulations 5.1.1 Linear Programs for Combinatorial Optimization Problems You’ve probably seen some polynomial-time algorithms for the problem of computing a maximum-weight matching of a bipartite graph.1 Many of these, like the Kuhn-Tucker algorithm (Kuhn, 1955), are “combinatorial algorithms” that operate directly on the graph. Linear programming is also an effective tool for solving many discrete optimization prob- lems. For example, consider the following linear programming relaxation of the maximum- weight bipartite matching problem (for a weighted bipartite graph G = (U, V, E, w)): subject to max 􏰕 wexe e∈E 􏰕 xe ≤ 1 e∈δ(v) (5.1) (5.2) (5.3) for every vertex v ∈ U ∪ V (where δ(v) denotes the edges incident to v) and xe ≥ 0 for every edge e ∈ E. In this formulation, each decision variable xe is intended to encode whether an edge e is in the matching (xe = 1) or not (xe = 0). It is easy to verify that the vectors of {0, 1}E that satisfy the constraints (5.2) and (5.3) are precisely the characteristic vectors of the matchings of G, with the objective function value of the solution to the linear program equal to the total weight of the matching. Since every characteristic vector of a matching satisfies (5.2) and (5.3), and the set of feasible solutions to the linear system defined by (5.2) and (5.3) is convex, the convex hull 1Recall that a graph is bipartite if its vertex set can be partitioned into two sets U and V such that every edge has one endpoint in each of U, V . Recall that a matching of a graph is a subset of edges that are pairwise disjoint. 67 68 Lower Bounds for the Extension Complexity of Polytopes of the characteristic vectors of matchings is contained in this feasible region.2 Also note that every characteristic vector x of a matching is a vertex3 of this feasible region — since all feasible solutions have all coordinates bounded by 0 and 1, the 0-1 vector x cannot be written as a non-trivial convex combination of other feasible solutions. The worry is does this feasible region contain anything other than the convex hull of characteristic vectors of matchings? Equivalently, does it have any vertices that are fractional, and hence do not correspond to matchings? (Note that integrality is not explicitly enforced by (5.2) or (5.3).) A nice fact is that the vertices of the feasible region defined by (5.2) and (5.3) are precisely the characteristic vectors of matchings of G. This is equivalent to the Birkhoff-von Neumann theorem (see Exercises). There are algorithms that solve linear programs in polynomial time (and output a vertex of the feasible region, see e.g. Grötschel et al. (1988)), so this implies that the maximum-weight bipartite matching problem can be solved efficiently using linear programming. How about the more general problem of maximum-weight matching in general (non- bipartite) graphs? While the same linear system (5.2) and (5.3) still contains the convex hull of all characteristic vectors of matchings, and these characteristic vectors are vertices of the feasible region, there are also other, fractional, vertices. To see this, consider the simplest non-bipartite graph, a triangle. Every matching contains at most 1 edge. But assigning xe = 1 for each of the edges e yields a fractional solution that satisfies (5.2) and (5.3). 2 This solution clearly cannot be written as a convex combination of characteristic vectors of matchings. It is possible to add to (5.2)–(5.3) additional inequalities — “odd cycle inequalities” stating that, for every odd cycle C of G, 􏰂e∈C xe ≤ (|C| − 1)/2 — so that the resulting smaller set of feasible solutions is precisely the convex hull of the characteristic vectors of matchings. Unfortunately, many graphs have an exponential number of odd cycles. Is it possible to add only a polynomial number of inequalities instead? Unfortunately not — the convex hull of the characteristic vectors of matchings can have 2Ω(n) “facets” (Pulleyblank and Edmonds, 1974).4 We define facets more formally in Section 5.3.1, but intuitively they are the “sides” of a polytope,5 like the 2n sides of an n-dimensional cube. It is intuitively 2Recall that a set S ⊆ Rn is convex if it is “filled in,” with λx+(1−λ)y ∈ S whenever x,y ∈ S and λ ∈ [0, 1]. Recall that the convex hull of a point set P ⊆ Rn is the smallest (i.e., intersection of all) convex set that contains it. Equivalently, it is the set of all finite convex combinations of points of P, where a convex combination has the form 􏰂pi=1 λi xi for non-negative λi ’s summing to 1 and x1 , . . . , xp ∈ P . 3There is an unfortunate clash of terminology when talking about linear programming relaxations of combinatorial optimization problems: a “vertex” might refer to a node of a graph or to a “corner” of a geometric set. 4This linear programming formulation still leads to a polynomial-time algorithm, but using fairly heavy machinery — the “ellipsoid method” (Khachiyan, 1979) and a “separation oracle” for the odd cycle inequalities (Padberg and Rao, 1982). There are also polynomial-time combinatorial algorithms for (weighted) non-bipartite matching, beginning with Edmonds (1965). 5A polytope is just a high-dimensional polygon — an intersection of halfspaces that is bounded or, equivalently, the convex hull of a finite set of points. 5.1 Linear Programs, Polytopes, and Extended Formulations 69 clear that a polytope with l facets needs l inequalities to describe — it’s like cleaving a shape out of marble, with each inequality contributing a single cut. We conclude that there is no linear program with variables {xe}e∈E of polynomial size that captures the maximum-weight (non-bipartite) matching problem. 5.1.2 Auxiliary Variables and Extended Formulations The exponential lower bound above on the number of linear inequalities needed to describe the convex hull of characteristic vectors of matchings of a non-bipartite graph applies to linear systems in RE, with one dimension per edge. The idea of an extended formulation is to add a polynomial number of auxiliary decision variables, with the hope that radically fewer inequalities are needed to describe the region of interest in the higher-dimensional space. This idea might sound like grasping at straws, but sometimes it actually works. For example, fix a positive integer n, and represent a permutation π ∈ Sn by the n-vector xπ = (π(1), π(2), . . . , π(n)), with all coordinates in {1, 2, . . . , n}. The permutahedron is the convex hull of all n! such vectors. The permutahedron is known to have 2n/2 − 2 facets (see e.g. Goemans (2014)), so a polynomial-sized linear description would seem out of reach. Suppose we add n2 auxiliary variables, yij for all i, j ∈ {1, 2, . . . , n}. The intent is for yij to be a 0-1 variable that indicates whether or not π(i) = j — in this case, the yij’s are the entries of the n × n permutation matrix that corresponds to π. We next add a set of constraints to enforce the desired semantics of the yij’s (cf., (5.2) and (5.3)): for i = 1,2,...,n; n 􏰕yij ≤1 j=1 n 􏰕yij ≤1 i=1 (5.4) (5.5) for j = 1,2,...,n; and for all i, j ∈ {1, 2, . . . , n}. We also add constraints that enforce consistency between the yij ≥ 0 permutation encoded by the xi’s and by the yij’s: n xi =􏰕jyij (5.7) j=1 for all i = 1,2,...,n. It is straightforward to check that the vectors y ∈ {0,1}n2 that satisfy (5.4)–(5.6) are precisely the permutation matrices. For such a y corresponding to a permutation π, (5.6) 70 Lower Bounds for the Extension Complexity of Polytopes the constraints (5.7) force the xi’s to encode the same permutation π. Using again the Birkhoff-von Neumann Theorem, every vector y ∈ Rn2 that satisfies (5.4)–(5.6) is a convex combination of permutation matrices (see Exercises). Constraint (5.7) implies that the xi’s encode the same convex combination of permutations. Thus, if we take the set of solutions in Rn+n2 that satisfy (5.4)–(5.7) and project onto the x-coordinates, we get exactly the permutahedron. This is what we mean by an extended formulation of a polytope. To recap the remarkable trick we just pulled off: blowing up the number of variables from n to n + n2 reduced the number of inequalities needed from 2n/2 to n2 + 3n. This allows us to optimize a linear function over the permutahedron in polynomial time. Given a linear function (in the xi’s), we optimize it over the (polynomial-size) extended formulation, and retain only the x-variables of the optimal solution. Given the utility of polynomial-size extended formulations, we’d obviously like to understand which problems have them. For example, does the non-bipartite matching problem admit such a formulation? The goal of this lecture is to develop communication complexity-based techniques for ruling out such polynomial-size extended formulations. We’ll prove an impossibility result for the “correlation polytope” (Fiorini et al., 2015); similar (but much more involved) arguments imply that every extended formulation of the non-bipartite matching problem requires an exponential number of inequalities (Rothvoß, 2014). Remark 5.1 (Geometric Intuition) It may seem surprising that adding a relatively small number of auxiliary variables can radically reduce the number of inequalities needed to describe a set — described in reverse, that projecting onto a subset of variables can massively blow up the number of sides. It’s hard to draw (low-dimensional) pictures that illustrate this point. If you play around with projections of some three-dimensional polytopes onto the plane, you’ll observe that non-facets of the high-dimensional polytope (edges) often become facets (again, edges) in the low-dimensional projection. Since the number of lower- dimensional faces of a polytope can be much bigger than the number of facets — already in the 3-D cube, there are 12 edges and only 6 sides — it should be plausible that a projection could significantly increase the number of facets. 5.2 Nondeterministic Communication Complexity The connection between extended formulations of polytopes and communication complexity involves nondeterministic communication complexity. We studied this model implicitly in parts of Lecture 4; this section makes the model explicit. Consider a function f : X × Y → {0, 1} and the corresponding 0-1 matrix M (f ), with rows indexed by Alice’s possible inputs and columns indexed by Bob’s possible inputs. In Lecture 4 we proved that if every covering of M(f) by monochromatic rectangles6 requires 6Recall that a rectangle is a subset S ⊆ X × Y that has a product structure, meaning S = A × B for some A ⊆ X and B ⊆ Y . Equivalently, S is closed under “mix and match:” whenever (x1, y1) and (x2, y2) 5.2 Nondeterministic Communication Complexity 71 at least t rectangles, then the deterministic communication complexity of f is at least log2 t (Theorem 4.3). The reason is that every communication protocol computing f with communication cost c induces a partition of M(f) into at most 2c monochromatic rectangles, and partitions are a special case of coverings. See also Figure 5.1. Figure 5.1 A covering by four monochromatic rectangles that is not a partition. Communication complexity lower bounds that are proved through coverings are actually much stronger than we’ve let on thus far — they apply also to nondeterministic protocols, which we define next. You presumably already have a feel for nondeterminism from your study of the complexity class NP. Recall that one way to define NP is as the problems for which membership can be verified in polynomial time. To see how an analog might work with communication protocols, consider the complement of the Equality problem, ¬Equality. If a third party wanted to convince Alice and Bob that their inputs x and y are different, it would not be difficult: just specify an index i ∈ {1,2,...,n} for which xi ≠ yi. Specifying an index requires log2 n bits, and specifying whether or not xi = 0 and yi = 1 or xi = 1 and yi = 0 requires one additional bit. Given such a specification, Alice and Bob can check the correctness of this “proof of non-equality” without any communication. If x ̸= y, there is always a (log2 +1)-bit proof that will convince Alice and Bob of this fact; if x = y, then no such proof will convince Alice and Bob otherwise. This means that ¬Equality has nondeterministic communication complexity at most log2 n + 1. Coverings of M(f) by monochromatic rectangles are closely related to the nondetermin- istic communication complexity of f. We first show how coverings lead to nondeterministic protocols. It’s easiest to formally define such protocols after the proof. Proposition 5.2 Let f : X×Y → {0,1} be a Boolean function and M(f) the corresponding matrix. If there is a cover of the 1-entries of M(f) by t 1-rectangles, then there is a nondeterministic protocol that verifies f (x, y) = 1 with cost log2 t. are in S, so are (x1,y2) and (x2,y1). A rectangle is monochromatic (w.r.t. f) if it contains only 1-entries of M(f) or only 0-entries of M(f). In these cases, we call it a 1-rectangle or a 0-rectangle, respectively. 0" 1" 1" 1" 1" 1" 1" 1" 0" 72 Lower Bounds for the Extension Complexity of Polytopes Proof: Let R1,...,Rt denote a covering of the 1s of M(f) by 1-rectangles. Alice and Bob can agree to this covering in advance of receiving their inputs. Now consider the following scenario: 1. A prover — a third party — sees both inputs x and y. (This is the formal model used for nondeterministic protocols.) 2. The prover writes an index i ∈ {1,2,...,t} — the name of a rectangle Ri — on a blackboard, in public view. Since Ri is a rectangle, it can be written as Ri = Ai × Bi with Ai ⊆ X, Bi ⊆ Y . 3. Alice accepts if and only if x ∈ Ai. 4. Bob accepts if and only if y ∈ Bi. This protocol has the following properties: 1. If f (x, y) = 1, then there exists a proof such that Alice and Bob both accept. (Since f(x,y) = 1, (x,y) ∈ Ri for some i, and Alice and Bob both accept if “i” is written on the blackboard.) 2. If f(x,y) = 0, there is no proof that both Alice and Bob accept. (Whatever index i ∈ {1,2,...,t} is written on the blackboard, since f(x,y) = 0, either x ̸∈ Ri or y ̸∈ Ri, causing a rejection.) 3. The maximum length of a proof is log2 t. (A proof is just an index i ∈ {1,2,...,t}.) These three properties imply, by definition, that the nondeterministic communication complexity of the function f and the output 1 is at most log2 t. 􏱋 The proof of Proposition 5.2 introduces our formal model of nondeterministic communi- cation complexity: Alice and Bob are given a “proof” or “advice string” by a prover, which can depend on both of their inputs; the communication cost is the worst-case length of the proof; and a protocol is said to compute an output z ∈ {0,1} of a function f if f(x,y) = z if and only if there exists proof such that both Alice and Bob accept. With nondeterministic communication complexity, we speak about both a function f and an output z ∈ {0, 1}. For example, if f is Equality, then we saw that the nondeterministic communication complexity of f and the output 0 is at most log2 n + 1. Since it’s not clear how to convince Alice and Bob that their inputs are equal without specifying at least one bit for each of the n coordinates, one might expect the nondeterministic communication complexity of f and the output 1 to be roughly n. (And it is, as we’ll see.) We’ve defined nondeterministic protocols so that Alice and Bob never speak, and only verify. This is without loss of generality, since given a protocol in which they do speak, one could modify it so that the prover writes on the blackboard everything that they would have 5.3 Extended Formulations and Nondeterministic Communication Complexity 73 said. We encourage the reader to formalize an alternative definition of nondeterministic protocols without a prover and in which Alice and Bob speak nondeterministically, and to prove that this definition is equivalent to the one we’ve given above (see Exercises). Next we prove the converse of Proposition 5.3. Proposition 5.3 If the nondeterministic communication complexity of the function f and the output 1 is c, then there is a covering of the 1s of M(f) by 2c 1-rectangles. Proof: Let P denote a nondeterministic communication protocol for f and the output 1 with communication cost (i.e., maximum proof length) at most c. For a proof l, let Z(l) denote theinputs(x,y)wherebothAliceandBobaccepttheproof. WecanwriteZ(l)=A×B, where A is the set of inputs x ∈ X of Alice where she accepts the proof l, and B is the set of inputs y ∈ Y of Bob where he accepts the proof. By the assumed correctness of P, f(x,y) = 1 for every (x,y) ∈ Z(l). That is, Z(l) is a 1-rectangle. By the first property of nondeterministic protocols, for every 1-input (x, y) there is a proof such that both Alice and Bob accept. That is, ∪lZ(l) is precisely the set of 1-inputs of f — a covering of the 1s of M(f) by 1-rectangles. Since the communication cost of P is at most c, there are at most 2c different proofs l. 􏱋 Proposition 5.3 implies that communication complexity lower bounds derived from covering lower bounds apply to nondeterministic protocols. Corollary 5.4 If every covering of the 1s of M(f) by 1-rectangles uses at least t rectangles, then the nondeterministic communication complexity of f is at least log2 t. Thus our arguments in Lecture 4, while simple, were even more powerful than we realized — they prove that the nondeterministic communication complexity of Equality, Disjoint- ness, and Greater-Than (all with output 1) is at least n. It’s kind of amazing that these lower bounds can be proved with so little work. 5.3 Extended Formulations and Nondeterministic Communication Complexity What does communication complexity have to do with extended formulations? To forge a connection, we need to show that an extended formulation with few inequalities is somehow useful for solving hard communication problems. While this course includes a number of clever connections between communication complexity and various computational models, this connection to extended formulations is perhaps the most surprising and ingenious one of them all. Superficially, extended formulations with few inequalities can be thought of as “compressed descriptions” of a polytope, and communication complexity is generally useful for ruling out compressed descriptions of various types. It is not at all obvious that this vague intuition can be turned into a formal connection, let alone one that is useful for proving non-trivial impossibility results. 74 Lower Bounds for the Extension Complexity of Polytopes 5.3.1 Faces and Facets We discuss briefly some preliminaries about polytopes. Let P be a polytope in variables x ∈ Rn. By definition, an extended formulation of P is a set of the form Q={(x,y) : Cx+Dy≤d}, where x and y are the original and auxiliary variables, respectively, such that {x : ∃ys.t.(x,y)∈Q}=P. This is, projecting Q onto the original variables x yields the original polytope P. The extended formulation of the permutahedron described in Section 5.1.2 is a canonical example. The size of an extended formulation is the number of inequalities.7 Recall that x ∈ P is a vertex if it cannot be written as a non-trivial convex combination of other points in P . A supporting hyperplane of P is a vector a ∈ Rn and scalar b ∈ R such that ax = b for all x ∈ P . Every supporting hyperplane a, b induces a face of P , defined as {x ∈ P : ax = b} — the intersection of the boundaries of P and of the the halfspace defined by the supporting hyperplane. (See Figure 5.2.) Note that a face is generally induced by many different supporting hyperplanes. The empty set is considered a face. Note also that faces are nested — in three dimensions, there are vertices, edges, and sides. In general, if f is a face of P, then the vertices of f are precisely the vertices of P that are contained in f. Figure 5.2 A supporting hyperplane of a polytope P and the corresponding face of the polytope. 7There is no need to keep track of the number of auxiliary variables — there is no point in having an extended formulation of this type with more variables than inequalities (see Exercises). P" suppor/ng+hyperplane+ corresponding+face+ 5.3 Extended Formulations and Nondeterministic Communication Complexity 75 A facet of P is a maximal face — a face that is not strictly contained in any other face. Provided P has a non-empty interior, its facets are (n − 1)-dimensional. There are two different types of finite descriptions of a polytope, and it is useful to go back and forth between them. First, a polytope P equals the convex hull of its vertices. Second, P is the intersection of the halfspaces that define its facets.8 5.3.2 Yannakakis’s Lemma What good is a small extended formulation? We next make up a contrived communication problem for which small extended formulations are useful. For a polytope P, in the corresponding Face-Vertex(P) problem, Alice gets a face f of P (in the form of a supporting hyperplane a, b) and Bob gets a vertex v of P . The function F V (f, v) is defined as1ifvdoesnotbelongtof,and0ifv∈f. Equivalently,FV(f,v)=1ifandonlyif aTv < b, where a,b is a supporting hyperplane that induces f. Polytopes in n dimensions generally have an exponential number of faces and vertices. Thus, trivial protocols for Face- Vertex(P), where one party reports their input to the other, can have communication cost Ω(n). A key result is the following. Lemma 5.5 (Yannakakis’s Lemma (1991)) If the polytope P admits an extended for- mulation Q with r inequalities, then the nondeterministic communication complexity of Face-Vertex(P ) is at most log2 r. That is, if we can prove a linear lower bound on the nondeterministic communication complexity of the Face-Vertex(P) problem, then we have ruled out subexponential-size extended formulations of P. Sections 5.3.3 and 5.3.4 give two different proof sketches of Lemma 5.5. These are roughly equivalent, with the first emphasizing the geometric aspects (following Lovász (1990)) and the second the algebraic aspects (following Yannakakis (1991)). In Section 5.4 we put Lemma 5.5 to use and prove strong lower bounds for a concrete polytope. Remarkably, Yannakakis (1991) did not give any applications of his lemma — the lower bounds for extended formulations in Yannakakis (1991) are for “symmetric” formulations and proved via direct arguments. Lemma 5.5 was suggested by Yannakakis (1991) as a potentially useful tool for more general impossibility results, and finally in the past five years (beginning with Fiorini et al. (2015)) this prophecy has come to pass. 5.3.3 Proof Sketch of Lemma 5.5: A Geometric Argument Suppose P admits an extended formulation Q = {(x, y) : Cx + Dy ≤ d} with only r inequalities. Both P and Q are known to Alice and Bob before the protocol begins. A first 8Proofs of all of these statements are elementary but outside the scope of this lecture; see e.g. Ziegler (1995) for details. 76 Lower Bounds for the Extension Complexity of Polytopes idea is for Alice, who is given a face f of the original polytope P, to tell Bob the name of the “corresponding face” of Q. Bob can then check whether or not his “corresponding vertex” belongs to the named face or not, thereby computing the function. Unfortunately, knowing that Q is defined by r inequalities only implies that it has at most r facets — it can have a very large number of faces. Thus Alice can no more afford to write down an arbitrary face of Q than a face of P . We use a third-party prover to name a suitable facet of Q than enables Alice and Bob to compute the Face-Vertex(P) function; since Q has at most r facets, the protocol’s communication cost is only log2 r, as desired. Suppose the prover wants to convince Alice and Bob that Bob’s vertex v of P does not belong to Alice’s face f of P. If the prover can name a facet f∗ of Q such that: (i) there exists yv such that (v,yv) ̸∈ f∗; and (ii) forevery(x,y)∈Qwithx∈f,(x,y)∈f∗; then this facet f∗ proves that v ̸∈ f. Moreover, given f∗, Alice and Bob can verify (ii) and (i), respectively, without any communication. All that remains to prove is that, when v ∈/ f, there exists a facet f∗ of Q such that (i) and (ii) hold. First consider the inverse image of f in Q, f ̃ = {(x,y) ∈ Q : x ∈ f}. Similarly, define v ̃ = {(v, y) ∈ Q}. Since v ∈/ f , f ̃ and v ̃ are disjoint subsets of Q. It is not difficult to prove that f ̃ and v ̃, as inverse images of faces under a linear map, are faces of Q (exercise). An intuitive but non-trivial fact is that every face of a polytope is the intersection of the facets that contain it.9 Thus, for every vertex v∗ of Q that is contained in v ̃ (and ̃ hence not in f) — and since v ̃ is non-empty, there is at least one — we can choose a facet f∗ of Q that contains f ̃(property (ii)) but excludes v∗ (property (i)). This concludes the proof sketch of Lemma 5.5. 5.3.4 Proof Sketch of Lemma 5.5: An Algebraic Argument The next proof sketch of Lemma 5.5 is a bit longer but introduces some of the most important concepts in the study of extended formulations. The slack matrix of a polytope P has rows indexed by faces F and columns indexed by vertices V . We identify each face with a canonical supporting hyperplane a, b. Entry Sf v of the slack matrix is defined as b − aTv, where a, b is the supporting hyperplane corresponding to the face f. Observe that all entries of S are nonnegative. Define the support supp(S) of the slack matrix S as the F × V matrix with 1-entries wherever S has positive entries, and 0-entries wherever S has 0-entries. Observe that supp(S) is a property only of the polytope P, independent of the choices of the supporting hyperplanes for the faces of P. 9This follows from Farkas’s Lemma, or equivalently the Separating Hyperplane Theorem. See Ziegler (1995) for details. 5.3 Extended Formulations and Nondeterministic Communication Complexity 77 Observe also that supp(S) is precisely the answer matrix for the Face-Vertex(P) problem for the polytope P. We next identify a sufficient condition for Face-Vertex(P ) to have low nondetermin- istic communication complexity; later we explain why the existence of a small extended formulation implies this sufficient condition. Suppose the slack matrix S has nonnegative rank r, meaning it is possible to write S = T U with T a |F | × r nonnegative matrix and U a r × |V | nonnegative matrix (Figure 5.3).10 Equivalently, suppose we can write S as the sum of r outer products of nonnegative vectors (indexed by F and V ): r S = 􏰕 αj · βjT , j=1 where the αj’s correspond to the columns of T and the βj’s to the rows of U. " Figure 5.3 A rank-r factorization of the slack matrix S into nonnegative matrices T and U. We claim that if the slack matrix S of a polytope P has nonnegative rank r, then there is a nondeterministic communication protocol for Face-Vertex(P ) with cost at most log2 r. As usual, Alice and Bob can agree to the decomposition (5.8) in advance. A key observation is that, by inspection of (5.8), Sfv > 0 if and only if there exists some j ∈ {1,2,…,r} with αfj,βjv > 0. (We are using here that everything is nonnegative and so no cancellations are possible.) Equivalently, the supports of the outer products αj · βjT can be viewed as a covering of the 1-entries of supp(S) by r 1-rectangles. Given this observation, the protocol for Face-Vertex(P) should be clear.
10This is called a nonnegative matrix factorization. It is the analog of the singular value decomposition (SVD), but with the extra constraint that the factors are nonnegative matrices. It obviously only makes
sense to ask for such decompositions for nonnegative matrices (like S).
(5.8)
r”columns” indexed”by”v”
S”=”
r”rows
U”≥”0″
T”≥”0″
indexed”by”f”

78 Lower Bounds for the Extension Complexity of Polytopes
1. The prover announces an index j ∈ {1,2,…,r}.
2. Alice accepts if and only if the fth component of αj is strictly positive. 3. Bob accepts if and only if the vth component of βj is strictly positive.
The communication cost of the protocol is clearly log2 r. The key observation above implies that there is a proof (i.e., an index j ∈ {1, 2, . . . , r}) accepted by both Alice and Bob if and only if Bob’s vertex v does not belong to Alice’s face f.
It remains to prove that, whenever a polytope P admits an extended formulation with a small number of inequalities, its slack matrix admits a low-rank nonnegative matrix factorization.11 We’ll show this by exhibiting nonnegative r-vectors λf (for all faces f of P) and μv (for all vertices v of P) such that Sfv = λTf μv for all f and v. In terms of Figure 5.3, the λf’s and μv’s correspond to the rows of T and columns of U, respectively.
The next task is to understand better how an extended formulation Q = {(x,y) : Cx+Dy ≤ d} must be related to the original polytope P. Given that projecting Q onto the variables x yields P, it must be that every supporting hyperplane of P is logically implied by the inequalities that define Q. To see one way how this can happen, suppose there is a non-negative r-vector λ ∈ Rr+ with the following properties:
(P1) λTC=aT; (P2) λTD=0; (P3) λTd=b.
(P1)–(P3) imply that, for every (x, y) in Q (and so with Cx + Dy ≤ d), we have λT C x + λT D y ≤ λT d
􏱄􏱅􏱆􏱇 􏱄􏱅􏱆􏱇 􏱄􏱅􏱆􏱇
=aT =0 =b
and hence aTx ≤ b (no matter what y is).
Nonnegative linear combinations λ of the constraints of Q that satisfy (P1)–(P3) are one
way in which the constraints of Q imply constraints on the values of x in the projection of Q. A straightforward application of Farkas’s Lemma (see e.g. Chvátal (1983)) implies that such nonnegative linear combinations are the only way in which the constraints of Q imply constraints on the projection of Q.12 Put differently, whenever aTx ≤ b is a supporting hyperplane of P, there exists a nonnegative linear combination λ that proves it (i.e., that satisfies (P1)–(P3)). This clarifies what the extended formulation Q really accomplishes:
11The converse also holds, and might well be the easier direction to anticipate. See the Exercises for details.
12Farkas’s Lemma is sometimes phrased as the Separating Hyperplane Theorem. It can also be thought of as the feasibility version of strong linear programming duality.

5.4 A Lower Bound for the Correlation Polytope 79
ranging over all λ ∈ Rr+ satisfying (P2) generates all of the supporting hyperplanes a, b of P (with a and b arising as λT C and λT d, respectively).
To define the promised λf’s and μv’s, fix a face f of P with supporting hyperplane aTx ≤ b. Since Q’s projection does not include any points not in P, the constraints of Q imply this supporting hyperplane. By the previous paragraph, we can choose a nonnegative vector λf so that (P1)–(P3) hold.
Now fix a vertex v of P. Since Q’s projection includes every point of P, there exists a choice of yv such that (v,yv) ∈ Q. Define μv ∈ Rr+ as the slack in Q’s constraints at the point (v,yv):
μv =d−Cv−Dyv. Since (v,yv) ∈ Q, μv is a nonnegative vector.
Finally, for every face f of P and vertex v of P , we have
λTfμv =λTfd−λTfCv−λTfDyv =b−aTv=Sfv, 􏱄􏱅􏱆􏱇 􏱄􏱅􏱆􏱇 􏱄􏱅􏱆􏱇
=b =aT v =0
as desired. This completes the second proof of Lemma 5.5.
5.4 A Lower Bound for the Correlation Polytope 5.4.1 Overview
Lemma 5.5 reduces the task of proving lower bounds on the size of extended formulations of a polytope P to proving lower bounds on the nondeterministic communication complexity of Face-Vertex(P). The case study of the permutahedron (Section 5.1.2) serves as a cautionary tale here: the communication complexity of Face-Vertex(P) is surprisingly low for some complex-seeming polytopes, so proving strong lower bounds, when they exist, typically requires work and a detailed understanding of the particular polytope of interest.
Fiorini et al. (2015) were the first to use Yannakakis’s Lemma to prove lower bounds on the size of extended formulations of interesting polytopes.13 We follow the proof plan of Fiorini et al. (2015), which has two steps.
1. First, we exhibit a polytope that is tailor-made for proving a nondeterministic com- munication complexity lower bound on the corresponding Face-Vertex(P) problem, via a reduction from Disjointness. We’ll prove this step in full.
2. Second, we extend the consequent lower bound on the size of extended formulations to other problems, such as the Traveling Salesman Problem (TSP), via reductions. These reductions are bread-and-butter NP-completeness-style reductions; see the Exercises for more details.
13This paper won the Best Paper Award at STOC ’12.

80 Lower Bounds for the Extension Complexity of Polytopes
This two-step plan does not seem sufficient to resolve the motivating problem mentioned in Section 5.1, the non-bipartite matching problem. For an NP-hard problem like TSP, we fully expect all extended formulations of the convex hull of the characteristic vectors of solutions to be exponential; otherwise, we could use linear programming to obtain a subexponential-time algorithm for the problem (an unlikely result). The non-bipartite matching problem is polynomial-time solvable, so it’s less clear what to expect. Rothvoß
(2014) proved that every extended formulation of the convex hull of the perfect matchings of the complete graph has exponential size.14 The techniques in Rothvoß (2014) are more sophisticated variations of the tools covered in this lecture — a reader of these notes is well-positioned to learn them.
5.4.2 Preliminaries
We describe a polytope P for which it’s relatively easy to prove nondeterministic commu- nication complexity lower bounds for the corresponding Face-Vertex(P) problem. The polytope was studied earlier for other reasons (Pitowsky, 1991; de Wolf, 2003).
Given a 0-1 n-bit vector x, we consider the corresponding (symmetric and rank-1) outer product xxT . For example, if x = 10101, then
10101
00000 xxT=1 0 1 0 1.
 0 0 0 0 0  10101
For a positive integer n, we define COR as the convex hull of all 2n such vectors xxT (ranging over x ∈ {0, 1}n). This is a polytope in Rn2 , and its vertices are precisely the points xxT with x ∈ {0, 1}n.
Our goal is to prove the following result.
Theorem 5.6 (Fiorini et al. 2015) The nondeterministic communication complexity of Face-Vertex(COR) is Ω(n).
This lower bound is clearly the best possible (up to a constant factor), since Bob can communicate his vertex to Alice using only n bits (by specifying the appropriate x ∈ {0, 1}n).
Lemma 5.5 then implies that every extended formulation of the COR polytope requires 2Ω(n) inequalities, no matter how many auxiliary variables are added. Note the dimension d
14This paper won the Best Paper Award at STOC ’14.

d).
Elementary reductions (see the Exercises) translate this extension complexity lower
is Θ(n2), so this lower bound has the form 2Ω( bound for the COR polytope to a lower bound of 2Ω(

n) on the size of extended formulations of the convex hull of characteristic vectors of n-point traveling salesman tours.

5.4 A Lower Bound for the Correlation Polytope 81 5.4.3 Some Faces of the Correlation Polytope
Next we establish a key connection between certain faces of the correlation polytope and inputs to Disjointness. Throughout, n is a fixed positive integer.
Lemma 5.7 (Fiorini et al. 2015) For every subset S ⊆ {1, 2, . . . , n}, there is a face fS of COR such that: for every R ⊆ {1, 2, . . . , n} with characteristic vector xR and corresponding vertex vR = xRxTR of COR,
vR ∈fS if and only if |S∩R|=1.
That is, among the faces of COR are 2n faces that encode the “unique intersection property” for each of the 2n subsets S of {1,2,…,n}. Note that for a given S, the sets R with |S∩R| can be generated by (i) first picking a element of S; (ii) picking a subset of {1, 2, . . . , n} \ S. Thus if |S| = k, there are k2n−k sets R with which it has a unique intersection.
Lemma 5.7 is kind of amazing, but also not too hard to prove.
Proof of Lemma 5.7: For every S ⊆ {1, 2, . . . , n}, we need to exhibit a supporting hyperplane aTx≤bsuchthataTvR =bifandonlyif|S∩R|=1,wherevR denotesxRxTR andxR the characteristic vector of R ⊆ {1, 2, . . . , n}.
Fix S ⊆ {1, 2, . . . , n}. We develop the appropriate supporting hyperplane, in variables y ∈ Rn2 , over several small steps.
1. For clarity, let’s start in the wrong space, with variables z ∈ Rn rather than y ∈ Rn2 . Here z is meant to encode the characteristic vector of a set R ⊆ {1, 2, . . . , n}. One sensible inequality to start with is
􏰕 zi − 1 ≥ 0. (5.9) i∈S
For example, if S = {1, 3}, then this constraint reads z1 + z3 − 1 ≥ 0.
The good news is that for 0-1 vectors xR, this inequality is satisfied with equality if and only if |S ∩ R| = 1. The bad news is that it does not correspond to a supporting hyperplane: if S and R are disjoint, then xR violates the inequality. How can we change the constraint so that it holds with equality for xR with |S ∩ R| = 1 and also valid for all R?
2. One crazy idea is to square the left-hand side of (5.9):
􏰎 􏰏2
􏰕 zi − 1 ≥ 0. (5.10)
i∈S
For example, if S = {1, 3}, then the constraint reads (after expanding) z12 + z32 + 2z1z3 −2z1 −2z3 +1≥0.

82
Lower Bounds for the Extension Complexity of Polytopes
The good news is that every 0-1 vector xR satisfies this inequality, and equality holds if and only if |S ∩ R| = 1. The bad news is that the constraint is non-linear and hence does not correspond to a supporting hyperplane.
3. The obvious next idea is to “linearize” the previous constraint. Wherever the constraint has a z2i or a zi, we replace it by a variable yii (note these partially cancel out). Wherever the constraint has a 2zizj (and notice for i ̸= j these always come in pairs), we replace it by a yij + yji. Formally, the constraint now reads
− 􏰕 yii + 􏰕 yij + 1 ≥ 0. (5.11) i∈S i̸=j ∈S
Note that the new variable set is y ∈ Rn2 . For example, if S = {1, 3}, then the new constraint reads y13 + y31 − y11 − y33 ≥ −1.
A first observation is that, for y’s that are 0-1, symmetric, and rank-1, with y = zzT (henceyij =zi·zj fori,j∈{1,2,…,n}),theleft-handsidesof(5.10)and(5.11)are the same by definition. Thus, for y = xRxTR with x ∈ {0, 1}n, y satisfies the (linear)
inequality (5.11), and equality holds if and only if |S ∩ R| = 1.
We have shown that, for every S ⊆ {1, 2, . . . , n}, the linear inequality (5.11) is satisfied by every vector y ∈ Rn2 of the form y = xRxTR with x ∈ {0, 1}n. Since COR is by definition the convex hull of such vectors, every point of COR satisfies (5.11). This inequality is therefore a supporting hyperplane, and the face it induces contains precisely those vertices of the form xRxTR with |S ∩ R| = 1. This completes the proof. 􏱋
5.4.4 Face-Vertex(COR) and Unique-Disjointness
In the Face-Vertex(COR) problem, Alice receives a face f of COR and Bob a vertex v of COR. In the 1-inputs, v ̸∈ f; in the 0-inputs, v ∈ f. Let’s make the problem only easier by restricting Alice’s possible inputs to the 2n faces (one per subset S ⊆ {1, 2, . . . , n}) identified in Lemma 5.7. In the corresponding matrix MU of this function, we can index the rows by subsets S. Since every vertex of COR has the form y = xRxTR for R ⊆ {1,2,…,n}, we can index the columns of MU by subsets R. By Lemma 5.7, the entry (S, R) of the matrix MU is1if|S∩R|̸=1and0of|S∩R|=1. Thatis,the0-entriesofMU correspondtopairs
(S, R) that intersect in a unique element.
There is clearly a strong connection between the matrix MU above and the analogous
matrix MD for Disjointness. They differ on entries (S, R) with |S ∩ R| ≥ 2: these are 0-entries of MD but 1-entries of MU . In other words, MU is the matrix corresponding to the communication problem ¬Unique-Intersection: do the inputs S and R fail to have a unique intersection?
The closely related Unique-Disjointness problem is a “promise” version of Disjoint- ness. The task here is to distinguish between:

5.4 A Lower Bound for the Correlation Polytope 83
(1) inputs (S, R) of Disjointness with |S ∩ R| = 0;
(0) inputs (S, R) of Disjointness with |S ∩ R| = 1.
For inputs that fall into neither case (with |S ∩ R| > 1), the protocol is off the hook — any output is considered correct. Since a protocol that solves Unique-Disjointness has to do only less than one that solves ¬Unique-Intersection, communication complexity lower bounds for former problem apply immediate to the latter.
We summarize the discussion of this section in the following proposition.
Proposition 5.8 (Fiorini et al. 2015) The nondeterministic communication complexity of Face-Vertex(COR) is at least that of Unique-Disjointness.
5.4.5 A Lower Bound for Unique-Disjointness The Goal
One final step remains in our proof of Theorem 5.6, and hence of our lower bound on the size of extended formulations of the correlation polytope.
Theorem 5.9 (Fiorini et al. 2015; Kaibel and Weltge 2015) The nondeterministic communication complexity of Unique-Disjointness is Ω(n).
Disjointness Revisited
As a warm-up, we revisit the standard Disjointness problem. Recall that, in Lecture 4, we proved that the nondeterministic communication complexity of Disjointness is at least n by a fooling set argument (Corollary 4.8). Next we prove a slightly weaker lower bound, via an argument that generalizes to Unique-Disjointness.
The first claim is that, of the 2n × 2n = 4n possible inputs of Disjointness, exactly 3n of them are 1-inputs. The reason is that the following procedure, which makes n 3-way choices, generates every 1-input exactly once: independently for each coordinate i = 1,2,…,n, choosebetweentheoptions(i)xi =yi =0;(ii)xi =1andyi =0;and(iii)xi =0and yi = 1.
The second claim is that every 1-rectangle — every subset A of rows of MD and B of columns of MD such that A × B contains only 1-inputs — has size at most 2n. To prove this, let R = A × B be a 1-rectangle. We assert that, for every coordinate i = 1, 2, . . . , n, either(i)xi =0forallx∈Aor(ii)yi =0forally∈B. Thatis,everycoordinatehas, for at least one of the two parties, a “forced zero” in R. For if neither (i) nor (ii) hold for a coordinate i, then since R is a rectangle (and hence closed under “mix and match”) we can choose (x, y) ∈ R with xi = yi = 1; but this is a 0-input and R is a 1-rectangle. This assertion implies that the following procedure, which makes n 2-way choices, generates every 1-input of R (and possibly other inputs as well): independently for each coordinate

84 Lower Bounds for the Extension Complexity of Polytopes
i = 1,2,…,n, set the forced zero (xi = 0 in case (i) or yi = 0 in case (ii)) and choose a bit for this coordinate in the other input.
These two claims imply that every covering of the 1-inputs by 1-rectangles requires at least (3/2)n rectangles. Proposition 5.3 then implies a lower bound of Ω(n) on the nondeterministic communication complexity of Disjointness.
Proof of Theorem 5.9
Recall that the 1-inputs (x,y) of Unique-Disjointness are the same as those of Dis- jointness (for each i, either xi = 0, yi = 0, or both). Thus, there are still exactly 3n 1-inputs. The 0-inputs (x,y) of Unique-Disjointness are those with xi = yi = 1 in exactly one coordinate i. We call all other inputs, where the promise fails to hold, *-inputs. By a 1-rectangle, we now mean a rectangle with no 0-inputs (*-inputs are fine). With this revised definition, it is again true that every nondeterministic communication protocol that solves Unique-Disjointness using c bits of communication induces a covering of the 1-inputs by at most 2c 1-rectangles.
Lemma 5.10 Every 1-rectangle of Unique-Disjointness contains at most 2n 1-inputs.
As with the argument for Disjointness, Lemma 5.10 completes the proof of Theorem 5.9: since there are 3n 1-inputs and at most 2n per 1-rectangle, every covering by 1-rectangles requires at least (3/2)n rectangles. This implies that the nondeterministic communication complexity of Unique-Disjointness is Ω(n).
Why is the proof of Lemma 5.10 harder than in Section 5.4.5? We can no longer easily argue that, in a rectangle R = A×B, for each coordinate i, either xi = 0 for all x ∈ A or yi = 0 for all y ∈ B. Assuming the opposite no longer yields a contraction: exhibiting x∈Aandy∈Bwithxi =yi =1doesnotnecessarilycontradictthefactthatRisa 1-rectangle, since (x, y) might be a *-input.
Proof of Lemma 5.10: The proof is one of those slick inductions that you can’t help but sit back and admire.
We claim, by induction on k = 0,1,2,…,n, that if R = A×B is a 1-rectangle for which all x ∈ A and y ∈ B have 0s in their last n − k coordinates, then the number of 1-inputs in Risatmost2k. Thelemmaisequivalenttothecaseofk=n. Thebasecasek=0holds, because in this case the only possible input in R is (0, 0).
For the inductive step, fix a 1-rectangle R = A × B in which the last n − k coordinates of all x ∈ A and all y ∈ B are 0. To simplify notation, from here on we ignore the last n−k coordinates of all inputs (they play no role in the argument).
Intuitively, we need to somehow “zero out” the kth coordinate of all inputs in R so that we can apply the inductive hypothesis. This motivates focusing on the kth coordinate, and we’ll often write inputs x ∈ A and y ∈ B as x′a and y′b, respectively, with x′, y′ ∈ {0, 1}k−1

5.4 A Lower Bound for the Correlation Polytope 85
and a, b ∈ {0, 1}. (Recall we’re ignoring that last n − k coordinates, which are now always zero.)
First observe that, whenever (x′a, y′b) is a 1-input, we cannot have a = b = 1. Also: (*) If (x′a, y′b) ∈ R is a 1-input, then R cannot contain both the inputs (x′0, y′1) and
(x′1, y′0).
For otherwise, R would also contain the 0-input (x′1,y′1), contradicting that R is a 1- rectangle. (Since (x′a,y′b) is a 1-input, the unique coordinate of (x′1,y′1) with a 1 in both inputs is the kth coordinate.)
The plan for the rest of the proof is to define two sets S1, S2 of 1-inputs — not necessarily rectangles — such that:
(P1) the number of 1-inputs in S1 and S2 combined is at least that in R;
(P2) the inductive hypothesis applies to rect(S1) and rect(S2), where rect(S) denotes the
smallest rectangle containing a set S of inputs.15
If we can find sets S1,S2 with properties (P1),(P2), then we are done: by the inductive hypothesis, the rect(Si)’s have at most 2k−1 1-inputs each, the Si’s are only smaller, and hence (by (P1)) R has at most 2k 1-inputs, as required.
We define the sets in two steps, focusing first on property (P1). Recall that every 1-input (x, y) ∈ R has the form (x′1, y′0), (x′0, y′1), or (x′0, y′0). We put all 1-inputs of the first type into a set S1′ , and all 1-inputs of the second type into a set S2′ . When placing inputs of the third type, we want to avoid putting two inputs of the form (x′a,y′b) with the same x′ and y′ into the same set (this would create problems in the inductive step). So, for an input (x′0, y′0) ∈ R, we put it in S1′ if and only if the input (x′1, y′0) was not already put
in S1′; and we put it in S2′ if and only if the input (x′0,y′1) was not already put in S2′. Crucially, observation (*) implies that R cannot contain two 1-inputs of the form (x′1,y′0) and (x′0, y′1), so the 1-input (x′0, y′0) is placed in at least one of the sets S1′ , S2′ . (It is placed in both if R contains neither (x′1,y′0) nor (x′0,y′1).) By construction, the sets S1′ and S2′ satisfy property (P1).
We next make several observations about S1′ and S2′ . By construction:
(**) for each i = 1,2 and x′,y′ ∈ {0,1}k−1, there is at most one input of Si′ of the form
(x′a, y′b).
Also, since S1′ , S2′ are subsets of the rectangle R, rect(S1′ ), rect(S2′ ) are also subsets of R.
Since R is a 1-rectangle, so are rect(S1′ ), rect(S2′ ). Also, since every input (x, y) of Si′ (and
15Equivalently, the closure of S under the “mix and match” operation on pairs of inputs. Formally, rect(S) = X(S)×Y(S), where X(S) = {x : (x,y) ∈ Sforsomey} and Y(S) = {y : (x,y) ∈ S for some x}.

86 Lower Bounds for the Extension Complexity of Polytopes
hence rect(Si′)) has yk = 0 (for i = 1) or xk = 0 (for i = 2), the kth coordinate contributes nothing to the intersection of any inputs of rect(S1′ ) or rect(S2′ ).
Now obtain Si from Si′ (for i = 1,2) by zeroing out the kth coordinate of all inputs. Since the Si′’s only contain 1-inputs, the Si’s only contain 1-inputs. Since property (**) implies that |Si| = |Si′| for i = 1, 2, we conclude that property (P1) holds also for S1, S2.
Moving on to property (P2), since rect(S1′ ), rect(S2′ ) contain no 0-inputs and contain only inputs with no intersection in the kth coordinate, rect(S1),rect(S2) contain no 0-inputs.16 Finally, since all inputs of S1, S2 have zeroes in their final n − k + 1 coordinates, so do all inputs of rect(S1),rect(S2). The inductive hypothesis applies to rect(S1) and rect(S2), so each of them has at most 2k−1 1-inputs. This implies the inductive step and completes the proof. 􏱋
16The concern is that zeroing out an input in the kth coordinate turns some *-input (with intersection size 2) into a 0-input (with intersection size 1); but since there were no intersections in the kth coordinate, anyways, this can’t happen.

Lecture 6
Lower Bounds for Data Structures
6.1 Preamble
Next we discuss how to use communication complexity to prove lower bounds on the performance — meaning space, query time, and approximation — of data structures. Our case study will be the high-dimensional approximate nearest neighbor problem.
There is a large literature on data structure lower bounds. There are several different ways to use communication complexity to prove such lower bounds, and we’ll unfortunately only have time to discuss one of them. For example, we discuss only a static data structure problem — where the data structure can only be queried, not modified — and lower bounds for dynamic data structures tend to use somewhat different techniques. See Miltersen (1999) and Pătraşcu (2008) for some starting points for further reading.
We focus on the approximate nearest neighbor problem for a few reasons: it is obviously a fundamental problem, that gets solved all the time (in data mining, for example); there are some non-trivial upper bounds; for certain parameter ranges, we have matching lower bounds; and the techniques used to prove these lower bounds are representative of work in the area — asymmetric communication complexity and reductions from the “Lopsided Disjointness” problem.
6.2 The Approximate Nearest Neighbor Problem
In the nearest neighbor problem, the input is a set S of n points that lie in a metric space (X, l). Most commonly, the metric space is Euclidean space (Rd with the l2 norm). In these lectures, we’ll focus on the Hamming cube, where X = {0, 1}d and l is Hamming distance.
On the upper bound side, the high-level ideas (related to “locality sensitive hashing (LSH)”)
we use are also relevant for Euclidean space and other natural metric spaces — we’ll get
a glimpse of this at the very end of the lecture. On the lower bound side, you won’t be
surprised to hear that the Hamming cube is the easiest metric space to connect directly
to the standard communication complexity model. Throughout the lecture, you’ll want to
thinkofdasprettybig—sayd= n.
Returning to the general nearest neighbor problem, the goal is to build a data structure
D (as a function of the point set S ⊆ X) to prepare for all possible nearest neighbor queries. Such a query is a point q ∈ X, and the responsibility of the algorithm is to use D to return the point p∗ ∈ S that minimizes l(p,q) over p ∈ S. One extreme solution is to build no
87

88 Lower Bounds for Data Structures
data structure at all, and given q to compute p∗ by brute force. Assuming that computing l(p,q) takes O(d) time, this query algorithm runs in time O(dn). The other extreme is to pre-compute the answer to every possible query q, and store the results in a look-up table. For the Hamming cube, this solution uses Θ(d2d) space, and the query time is that of one look-up in this table. The exact nearest neighbor problem is believed to suffer from the “curse of dimensionality,” meaning that a non-trivial query time (sublinear in n, say) requires a data structure with space exponential in d.
There have been lots of exciting positive results for the (1 + ε)-approximate version of the nearest neighbor problem, where the query algorithm is only required to return a point p with l(p, q) ≤ (1 + ε)l(p∗, q), where p∗ is the (exact) nearest neighbor of q. This is the problem we discuss in these lectures. You’ll want to think of ε as a not-too-small constant, perhaps 1 or 2. For many of the motivating applications of the nearest-neighbor problem — for example, the problem of detecting near-duplicate documents (e.g., to filter search results)
— such approximate solutions are still practically relevant.
6.3 An Upper Bound: Biased Random Inner Products
In this section we present a non-trivial data structure for the (1 + ε)-approximate nearest neighbor problem in the Hamming cube. The rough idea is to hash the Hamming cube and then precompute the answer for each of the hash table’s buckets — the trick is to make sure that nearby points are likely to hash to the same bucket. Section 6.4 proves a sense in which this data structure is the best possible: no data structure for the problem with equally fast query time uses significantly less space.
6.3.1 The Key Idea (via a Public-Coin Protocol)
For the time being, we restrict attention to the decision version of the (1+ε)-nearest neighbor problem. Here, the data structure construction depends both on the point set S ⊆ {0, 1}d and on a given parameter L ∈ {0,1,2,…,d}. Given a query q, the algorithm only has to decide correctly between the following cases:
1. There exists a point p ∈ S with l(p,q) ≤ L.
2. l(p,q)≥(1+ε)Lforeverypointp∈S.
If neither of these two cases applies, then the algorithm is off the hook (either answer is regarded as correct). We’ll see in Section 6.3.3 how, using this solution as an ingredient, we can build a data structure for the original version of the nearest neighbor problem.
Recall that upper bounds on communication complexity are always suspect — by design, the computational model is extremely powerful so that lower bounds are as impressive as possible. There are cases, however, where designing a good communication protocol reveals

6.3 An Upper Bound: Biased Random Inner Products 89
the key idea for a solution that is realizable in a reasonable computational model. Next is the biggest example of this that we’ll see in the course.
In the special case where S contains only one point, the decision version of the (1 + ε)- approximate nearest neighbor problem resembles two other problems that we’ve studied before in other contexts, one easy and one hard.
1. Equality. Recall that when Alice and Bob just want to decide whether their inputs are the same or different — equivalently, deciding between Hamming distance 0 and Hamming distance at least 1 — there is an unreasonably effective (public-coin) randomized communication protocol for the problem. Alice and Bob interpret the first 2n public coins as two random n-bits strings r1, r2. Alice sends the inner product modulo 2 of her input x with r1 and r2 (2 bits) to Bob. Bob accepts if and only if the two inner products modulo 2 of his input y with r1,r2 match those of Alice. This protocol never rejects inputs with x = y, and accepts inputs with x ≠ y with probability 1/4.
2. Gap-Hamming. Recall in this problem Alice and Bob want to decide between the cases where the Hamming distance between their inputs is at most n − √n, or at
n√2
least 2 + n. In Theorem 2.5 of Section 2.7, we proved that this problem is hard
for one-way communication protocols (via a clever reduction from Index); it is also hard for general communication protocols (Chakrabarti and Regev, 2012; Sherstov, 2012; Vidick, 2011). Note however that the gap between the two cases is very small,
corresponding to ε ≈ 2 . In the decision version of the (1 + ε)-approximate nearest n
neighbor problem, we’re assuming a constant-factor gap in Hamming distance between the two cases, so there’s hope that the problem is easier.
Consider now the communication problem where Alice and Bob want to decide if the Hamming distance lH(x,y) between their inputs x,y ∈ {0,1}d is at most L or at least (1 + ε)L. We call this the ε-Gap Hamming problem. We analyze the following protocol;
we’ll see shortly how to go from this protocol to a data structure.
1. AliceandBobusethepublicrandomcoinstochoosesrandomstringsR={r1,…,rs}∈ {0, 1}d, where d = Θ(ε−2). The strings are not uniformly random: each entry is chosen independently, with probability 1/2L of being equal to 1.
2. Alice sends the s inner products (modulo 2) ⟨x, r1⟩, . . . , ⟨x, rs⟩ of her input and the random strings to Bob — a “hash value” hR(x) ∈ {0, 1}s.
3. Bob accepts if and only if the Hamming distance between the corresponding hash value hR(y) of his input — the s inner products (modulo 2) of y with the random strings in R — differs from hR(x) in only a “small” (TBD) number of coordinates.

90 Lower Bounds for Data Structures
Intuitively, the goal is to modify our randomized communication protocol for Equality so that it continues to accept strings that are close to being equal. A natural way to do this is to bias the coefficient vectors significantly more toward 0 than before. For example, if x, y differ in only a single bit, then choosing r uniformly at random results in ⟨x, r⟩ ̸≡ ⟨y, r⟩ mod 2 with probability 1/2 (if and only if r has a 1 in the coordinate where x,y differ). With probability 1/2L of a 1 in each coordinate of r, the probability that ⟨x, r⟩ ̸≡ ⟨y, r⟩ mod 2 is only 1/2L. Unlike our Equality protocol, this protocol for ε-Gap Hamming has two-sided error.
For the analysis, it’s useful to think of each random choice of a coordinate rji as occurring in two stages: in the first stage, the coordinate is deemed relevant with probability 1/L and irrelevant otherwise. In stage 2, rji is set to 0 if the coordinate is irrelevant, and chosen uniformly from {0, 1} if the coordinate is relevant. We can therefore think of the protocol as:
(i) first choosing a subset of relevant coordinates; (ii) running the old Equality protocol on these coordinates only. With this perspective, we see that if lH (x, y) = ∆, then
1 􏰎􏰎 􏰈 1 􏰉∆􏰏􏰏
Prrj [⟨rj , x⟩ ̸≡ ⟨rj , y⟩ mod 2] = 2 1 − 1 − L (6.1)
for every rj ∈ R. In (6.1), the quantity inside the outer parentheses is exactly the probability that at least one of the ∆ coordinates on which x,y differ is deemed relevant. This is a necessary condition for the event ⟨rj , x⟩ ̸≡ ⟨rj , y⟩ mod 2 and, in this case, the conditional probability of this event is exactly 1 (as in the old Equality protocol).
2
The probability in (6.1) is an increasing function of ∆, as one would expect. Let t denote the probability in (6.1) when ∆ = L. We’re interested in how much bigger this probability is when ∆ is at least (1 + ε)L. We can take the difference between these two cases and bound it below using the fact that 1 − x ∈ [e−2x, e−x] for x ∈ [0, 1]:
  
1􏰈 1􏰉L 􏰈 1􏰉εL 1􏰃 −ε􏰄
2 1− 1− 1− ≥ 2 1−e :=h(ε). L L2ε
􏱄 􏱅􏱆 􏱇 􏱄 􏱅􏱆 􏱇
≥e−2 ≤e−ε
Note that h(ε) is a constant, depending on ε only. Thus, with s random strings, if lH (x, y) ≤ ∆ then we expect ts of the random inner products (modulo 2) to be different; if lH (x, y) ≥ (1 + ε)∆, then we expect at least (t + h(ε))s of them to be different. A routine application of Chernoff bounds implies the following.
Corollary 6.1 Define the hash function hR as in the communication protocol above. If s = Ω( 1 log 1), then with probability at least 1 − δ over the choice of R:
(i) IflH(x,y)≤L,thenlH(h(x),h(y))≤(t+1h(ε)))s. 2
ε2 δ

6.3 An Upper Bound: Biased Random Inner Products 91 (ii) IflH(x,y)≥(1+ε)L,thenlH(h(x),h(y))>(t+1h(ε))s.
2
We complete the communication protocol above by defining “small” in the third step as (t + 1 h(ε))s. We conclude that the ε-Gap Hamming problem can be solved by a public-coin
2
randomized protocol with two-sided error and communication cost Θ(ε−2).
6.3.2 The Data Structure (Decision Version)
We now show how to translate the communication protocol above into a data structure for the (1 + ε)-nearest neighbor problem. For the moment, we continue to restrict to the decision version of the problem, for an a priori known value of L ∈ {0, 1, 2, . . . , d}. All we do is precompute the answers for all possible hash values of a query (an “inverted index”).
Given a point set P of n points in {0,1}d, we choose a set R of s = Θ(ε−2 logn) random
strings r1, . . . , rs according to the distribution of the previous section (with a “1” chosen with
probability 1/2L). We again define the hash function hR : {0, 1}d → {0, 1}s by setting the
jth coordinate of hR(x) to ⟨rj,x⟩ mod 2. We construct a table with 2s = nΘ(ε−2) buckets,
indexed by s-bit strings.1 Then, for each point p ∈ P, we insert p into every bucket
b ∈ {0, 1}s for which lH (hR (p), b) ≤ (t + 1 h(ε))s, where t is defined as in the previous 2
section (as the probability in (6.1) with ∆ = L). This preprocessing requires nΘ(ε−2) time. With this data structure in hand, answering a query q ∈ {0, 1}d is easy: just compute the hash value hR(q) and return an arbitrary point of the corresponding bucket, or “none”
if this bucket is empty.
For the analysis, think of an adversary who chooses a query point q ∈ {0, 1}d, and then
we subsequently flip our coins and build the above data structure. (This is the most common
way to analyze hashing, with the query independent of the random coin flips used to choose
the hash function.) Choosing the hidden constant in the definition of s appropriately and
applying Corollary 6.1 with δ = 1 , we find that, for every point p ∈ P , with probability at n2
least1− 1 ,pisinh(q)(ifl (p,q)≤L)orisnotinh(q)(ifl (p,q)≥(1+ε)L). Taking n2 H h
a Union Bound over the n points of P, we find that the data structure correctly answers the query q with probability at least 1 − 1 .
n
Before describing the full data structure, let’s take stock of what we’ve accomplished thus far. We’ve shown that, for every constant ε > 0, there is a data structure for the decision version of the (1 + ε)-nearest neighbor problem that uses space nO(ε−2), answers a query with a single random access to the data structure, and for every query is correct with high probability. Later in this lecture, we show a matching lower bound: every (possibly randomized) data structure with equally good search performance for the decision version of the (1 + ε)-nearest neighbor problem has space nΩ(ε−2). Thus, smaller space can only be achieved by increasing the query time (and there are ways to do this, see e.g. Indyk (2004)).
1Note the frightening dependence of the space on 1. This is why we suggested thinking of ε as a ε
not-too-small constant.

92 Lower Bounds for Data Structures 6.3.3 The Data Structure (Full Version)
The data structure of the previous section is an unsatisfactory solution to the (1 + ε)-nearest neighbor problem in two respects:
1. In the real problem, there is no a priori known value of L. Intuitively, one would like to take L equal to the actual nearest-neighbor distance of a query point q, a quantity that is different for different q’s.
2. Even for the decision version, the data structure can answer some queries incorrectly. Since the data structure only guarantees correctness with probability at least 1 − 1
n for each query q, it might be wrong on as many as 2d/n different queries. Thus, an
adversary that knows the data structure’s coin flips can exhibit a query that the data structure gets wrong.
The first fix is straightforward: just build ≈ d copies of the data structure of Section 6.3.2, oneforeachrelevantvalueofL.2 Givenaqueryq,thedatastructurenowusesbinarysearch over L to compute a (1 + ε)-nearest neighbor of q; see the Exercises for details. Answering a query thus requires O(log d) lookups to the data structure. This also necessitates blowing up the number s of random strings used in each data structure by a Θ(log log d) factor — this reduces the failure probability of a given lookup by a log d factor, enabling a Union Bound over log d times as many lookups as before.3
A draconian approach to the second problem is to again replicate the data structure
above Θ(d) times. Each query q is asked in all Θ(d) copies, and majority vote is used to
determine the final answer. Since each copy is correct on each of the 2d possible queries with
probability at least 1 − 1 > 2 , the majority vote is wrong on a given query with probability n3
at most inverse exponential in d. Taking a Union Bound over the 2d possible queries shows that, with high probability over the coin flips used to construct the data structure, the data structure answers every query correctly. Put differently, for almost all outcomes of the coin flips, not even an adversary that knows the coin flip outcomes can produce a query on which the data structure is incorrect. This solution blows up both the space used and the query time by a factor of Θ(d).
An alternative approach is to keep Θ(d) copies of the data structure as above but, given
a query q, to answer the query using one of the Θ(d) copies chosen uniformly at random.
With this solution, the space gets blown up by a factor of Θ(d) but the query time is
unaffected. The correctness guarantee is now slightly weaker. With high probability over
the coin flips used to construct the data structure (in the preprocessing), the data structure
satisfies: for every query q ∈ {0, 1}d, with probability at least 1 − Θ( 1 ) over the coins n
2For L ≥ d/(1 + ε), the data structure can just return an arbitrary point of P. For L = 0, when the data structure of Section 6.3.2 is not well defined, a standard data structure for set membership, such as a perfect hash table (Fredman et al., 1984), can be used.
3With some cleverness, this loglogd factor can be avoided – see the Exercises.

6.4 Lower Bounds via Asymmetric Communication Complexity 93
flipped at query time, the data structure answers the query correctly (why?). Equivalently, think of an adversary who is privy to the outcomes of the coins used to construct the data structure, but not those used to answer queries. For most outcomes of the coins used in the preprocessing phase, no matter what query the adversary suggests, the data structure answers the query correctly with high probability.
To put everything together, the data structure for a fixed L (from Section 6.3.2) requires nΘ(ε−2) space, the first fix blows up the space by a factor of Θ(dlogd), and the second fix blows up the space by another factor of d. For the query time, with the alternative implementation of the second fix, answering a query involves O(log d) lookups into the data structure. Each lookup involves computing a hash value, which in turn involves computing inner products (modulo 2) with s = Θ(ε−2 log n log log d) random strings. Each such inner product requires O(d) time.
Thus, the final scorecard for the data structure is: • Space: O(d2 log d) · nΘ(ε−2).
• Query time: O(ε−2d log n log log d).
For example, for the suggested parameter values of d = nc for a constant c ∈ (0, 1) and ε a not-too-small constant, we obtain a query time significantly better than the brute-force
(exact) solution (which is Θ(dn)), while using only a polynomial amount of space. 6.4 Lower Bounds via Asymmetric Communication Complexity
We now turn our attention from upper bounds for the (1 + ε)-nearest neighbor problem to lower bounds. We do this in three steps. In Section 6.4.1, we introduce a model for proving lower bounds on time-space trade-offs in data structures — the cell probe model. In Section 6.4.2, we explain how to deduce lower bounds in the cell probe model from a certain type of communication complexity lower bound. Section 6.4.3 applies this machinery to the (1 + ε)-approximate nearest neighbor problem, and proves a sense in which the data structure of Section 6.3 is optimal: every data structure for the decision version of the problem that uses O(1) lookups per query has space nΩ(ε−2). Thus, no polynomial-sized data structure can be both super-fast and super-accurate for this problem.
6.4.1 The Cell Probe Model
Motivation
The most widely used model for proving data structure lower bounds is the cell probe model, introduced by Yao — two years after he developed the foundations of communication complexity (Yao, 1979) — in the paper “Should Tables Be Sorted?” (Yao, 1981).4 The point
4Actually, what is now called the cell probe model is a bit stronger than the model proposed by Yao (1981).

94 Lower Bounds for Data Structures
of this model is to prove lower bounds for data structures that allow random access. To make the model as powerful as possible, and hence lower bounds for it as strong as possible (cf., our communication models), a random access to a data structure counts as 1 step in
this model, no matter how big the data structure is.
Some History
To explain the title of the paper by Yao (1981), suppose your job is to store k elements from a totally ordered universe of n elements ({1, 2, . . . , n}, say) in an array of length k. The goal is to minimize the worst-case number of array accesses necessary to check whether or not a given element is in the array, over all subsets of k elements that might be stored and over the n possible queries.
To see that this problem is non-trivial, suppose n = 3 and k = 2. One strategy is to store the pair of elements in sorted order, leading to the three possible arrays in Figure 6.1(a). This yields a worst-case query time of 2 array accesses. To see this, suppose we want to check whether or not 2 is in the array. If we initially query the first array element and find a “1,” or if we initially query the second array element and find a “3,” then we can’t be sure whether or not 2 is in the array.
(a) Sorted Array (b) Unsorted Array
Figure 6.1 Should tables be sorted? For n = 3 and k = 2, it is suboptimal to store the array elements in sorted order.
Suppose, on the other hand, our storage strategy is as shown in Figure 6.1(b). Whichever array entry is queried first, the answer uniquely determines the other array entry. Thus, storing the table in a non-sorted fashion is necessary to achieve the optimal worst-case query time of 1. On the other hand, if k = 2 and n = 4, storing the elements in sorted order (and using binary search to answer queries) is optimal!5
Formal Model
In the cell problem model, the goal is to encode a “database” D in order to answer a set Q of queries. The query set Q is known up front; the encoding scheme must work (in the
5Yao (1981) also uses Ramsey theory to prove that, provided the universe size is a sufficiently (really, really) large function of the array size, then binary search on a sorted array is optimal. This result assumes that no auxiliary storage is allowed, so solutions like perfect hashing (Fredman et al., 1984) are ineligible. If the universe size is not too much larger than the array size, then there are better solutions (Fiat and Naor, 1993; Gabizon and Hassidim, 2010; Gabizon and Shaltiel, 2012), even when there is no auxiliary storage.
1″
2″
1″
3″
1″
2″
3″
1″
2″
3″
2″
3″

6.4 Lower Bounds via Asymmetric Communication Complexity 95
sense below) for all possible databases.
For example, in the (1 + ε)-approximate nearest neighbor problem, the database corre-
sponds to the point set P ⊆ {0, 1}d, while the possible queries Q correspond to the elements of {0,1}d. Another canonical example is the set membership problem: here, D is a subset ofauniverseU,andeachqueryq∈Qasks“isi∈D?” forsomeelementi∈U.
A parameter of the cell probe model is the word size w; more on this shortly. Given this parameter, the design space consists of the ways to encode databases D as s cells of w bits each. We can view such an encoding as an abstract data structure representing the database, and we view the number s of cells as the space used by the encoding. To be a valid encoding, it must be possible to correctly answer every query q ∈ Q for the database D by reading enough bits of the encoding. A query-answering algorithm accesses the encoding by specifying the name of a cell; in return, the algorithm is given the contents of that cell. Thus every access to the encoding yields w bits of information. The query time of an algorithm
(with respect to an encoding scheme) is the maximum, over databases D (of a given size) and queries q ∈ Q, number of accesses to the encoding used to answer a query.
For example, in the original array example from Yao (1981) mentioned above, the word size w is ⌈log2 n⌉ — just enough to specify the name of an element. The goal in Yao (1981) was, for databases consisting of k elements of the universe, to understand when the
minimum-possible query time is ⌈log2 k⌉ under the constraint that the space is k.6
Most research on the cell-probe model seeks time-space trade-offs with respect to a fixed value for the word size w. Most commonly, the word size is taken large enough so that a single element of the database can be stored in a single cell, and ideally not too much larger than this. For nearest-neighbor-type problems involving n-point sets in the d-dimensional
hypercube, this guideline suggests taking w to be polynomial in max{d, log2 n}.
For this choice of w, the data structure in Section 6.3.2 that solves the decision version of the (1 + ε)-approximate nearest neighbor problem yields a (randomized) cell-probe encoding of point sets with space nΘ(ε−2) and query time 1. Cells of this encoding correspond to all possible s = Θ(ε−2 log n)-bit hash values hR(q) of a query q ∈ {0, 1}d, and the contents of a cell name an arbitrary point p ∈ P with hash value hR(p) sufficiently close (in Hamming distance in {0,1}s) to that of the cell’s name (or “NULL” if no such p exists). The rest of this lecture proves a matching lower bound in the cell-probe model: constant query time
can only be achieved by encodings (and hence data structures) that use nΩ(ε−2) space. From Data Structures to Communication Protocols
Our goal is to derive data structure lower bounds in the cell-probe model from communication complexity lower bounds. Thus, we need to extract low-communication protocols from good data structures. Similar to our approach last lecture, we begin with a contrived communication problem to forge an initial connection. Later we’ll see how to prove lower
6The model in Yao (1981) was a bit more restrictive — cells were required to contain names of elements in the database, rather than arbitrary ⌈log2 n⌉-bit strings.

96 Lower Bounds for Data Structures
bounds for the contrived problem via reductions from other communication problems that we already understand well.
Fix an instantiation of the cell probe model – i.e., a set of possible databases and possible queries. For simplicity, we assume that all queries are Boolean. In the corresponding Query-Database problem, Alice gets a query q and Bob gets a database D. (Note that in all natural examples, Bob’s input is much bigger than Alice’s.) The communication problem is to compute the answer to q on the database D.
We made up the Query-Database problem so that the following lemma holds.
Lemma 6.2 Consider a set of databases and queries so that there is a cell-probe encoding with word size w, space s, and query time t. Then, there is a communication protocol for the corresponding Query-Database problem with communication at most
t log2 s + tw .
􏱄 􏱅􏱆 􏱇
bits sent by Alice
􏱄􏱅􏱆􏱇
bits sent by Bob
The proof is the obvious simulation: Alice simulates the query-answering algorithm, sending at most log2 s bits to Bob specify each cell requested by the algorithm, and Bob sends w bits back to Alice to describe the contents of each requested cell. By assumption, they only need to go back-and-forth at most t times to identify the answer to Alice’s query q.
Lemma 6.2 reduces the task of proving data structure lower bounds to proving lower bounds on the communication cost of protocols for the Query-Database problem.7
6.4.2 Asymmetric Communication Complexity
Almost all data structure lower bounds derived from communication complexity use asym- metric communication complexity. This is just a variant of the standard two-party model where we keep track of the communication by Alice and by Bob separately. The most common motivation for doing this is when the two inputs have very different sizes, like in the protocol used to prove Lemma 6.2 above.
Case Study: Index
To get a feel for asymmetric communication complexity and lower bound techniques for it, let’s revisit an old friend, the Index problem. In addition to the application we saw earlier in the course, Index arises naturally as the Query-Database problem corresponding to the membership problem in data structures.
7The two-party communication model seems strictly stronger than the data structure design problem that it captures — in a communication protocol, Bob can remember which queries Alice asked about previously, while a (static) data structure cannot. An interesting open research direction is to find communication models and problems that more tightly capture data structure design problems, thereby implying strong lower bounds.

6.4 Lower Bounds via Asymmetric Communication Complexity 97
Recall that an input of Index gives Alice an index i ∈ {1,2,…,n}, specified using ≈ log2 n bits, and Bob a subset S ⊆ {1, 2, . . . , n}, or equivalently an n-bit vector.8 In Lecture 2 we proved that the communication complexity of Index is Ω(n) for one-way randomized protocols with two-sided error (Theorem 2.4) — Bob must send almost his entire input to Alice for Alice to have a good chance of computing her desired index. This lower bound clearly does not apply to general communication protocols, since Alice can just send her log2 n-bit input to Bob. It is also easy to prove a matching lower bound on deterministic and nondeterministic protocols (e.g., by a fooling set argument).
We might expect a more refined lower bound to hold: to solve Index, not only do the players have to send at least log2 n bits total, but more specifically Alice has to send at least log2 n bits to Bob. Well not quite: Bob could always send his entire input to Alice, using n bits of communication while freeing Alice to use essentially no communication. Revising our ambition, we could hope to prove that in every Index protocol, either (i) Alice has to communicate most of her input; or (ii) Bob has to communicate most of his input. The next result states that this is indeed the case.
Theorem 6.3 (Miltersen et al. 1998) For every δ > 0, there exists a constant N = N(δ) such that, for every n ≥ N and every randomized communication protocol with two-sided error that solves Index with n-bit inputs, either:
(i) in the worst case (over inputs and protocol randomness), Alice communicates at least δ log2 n bits; or
(ii) in the worst case (over inputs and protocol randomness), Bob communicates at least n1−2δ bits.9
Loosely speaking, Theorem 6.3 states that the only way Alice can get away with sending o(log n) bits of communication is if Bob sends at least n1−o(1) bits of communication.
For simplicity, we’ll prove Theorem 6.3 only for deterministic protocols. The lower bound for randomized protocols with two-sided error is very similar, just a little messier
(see Miltersen et al. (1998)).
Conceptually, the proof of Theorem 6.3 has the same flavor as many of our previous lower
bounds, and is based on covering-type arguments. The primary twist is that rather than keeping track only of the size of monochromatic rectangles, we keep track of both the height and width of such rectangles. For example, we’ve seen in the past that low-communication protocols imply the existence of a large monochromatic rectangle — if the players haven’t had the opportunity to speak much, then an outside observer hasn’t had the chance to eliminate many inputs as legitimate possibilities. The next lemma proves an analog of
8We’ve reversed the roles of the players relative to the standard description we gave in Lectures 1– 2. This reverse version is the one corresponding to the Query-Database problem induced by the membership problem.
9From the proof, it will be evident that n1−2δ can be replaced by n1−cδ for any constant c > 1.

98 Lower Bounds for Data Structures this, with the height and width of the monochromatic rectangle parameterized by the
communication used by Alice and Bob, respectively.
Lemma 6.4 (Richness Lemma (Miltersen et al., 1998)) Let f : X × Y → {0, 1} be a Boolean function with corresponding X × Y 0-1 matrix M (f ). Assume that:
(1) M(f) has at least v columns that each have at least u 1-inputs.10
(2) There is a deterministic protocol that computes f in which Alice and Bob always send
at most a and b bits, respectively.11
Then, M(f) has a 1-rectangle A×B with |A|≥ u and |B|≥ v .
The proof of Lemma 6.4 is a variation on the classic argument that a protocol computing a function f induces a partition of the matrix M(f) into monochromatic rectangles. Let’s recall the inductive argument. Let z be a transcript-so-far of the protocol, and assume by induction that the inputs (x, y) that lead to z form a rectangle A × B. Assume that Alice speaks next (the other case is symmetric). Partition A into A0,A1, with Aη the inputs x ∈ A such Alice sends the bit η next. (As always, this bit depends only on her input x and the transcript-so-far z.) After Alice speaks, the inputs consistent with the resulting transcript are either A0 × B or A1 × B — either way, a rectangle. All inputs that generate the same final transcript z form a monochromatic rectangle — since the protocol’s output is constant across these inputs and it computes the function f, f is also constant across these inputs.
Now let’s refine this argument to keep track of the dimensions of the monochromatic rectangle, as a function of the number of times that each of Alice and Bob speak.
Proof of Lemma 6.4: We proceed by induction on the number of steps of the protocol. Suppose the protocol has generated the transcript-so-far z and that A × B is the rectangle of inputs consistent with this transcript. Suppose that at least c of the columns of B have at least d 1-inputs in rows of A (possibly with different rows for different columns).
For the first case, suppose that Bob speaks next. Partition A × B into A × B0 and A × B1, where Bη are the inputs y ∈ B such that (given the transcript-so-far z) Bob sends the bit η. At least one of the sets B0,B1 contains at least c/2 columns that each contain at least d 1-inputs in the rows of A (Figure 6.2(a)).
For the second case, suppose that Alice speaks. Partition A × B into A0 × B and A1 × B. It is not possible that both (i) A0 × B has strictly less that c/2 columns with d/2 or more 1-inputs in the rows of A0 and (ii) A1 × B has strictly less that c/2 columns with d/2 or more 1-inputs in the rows of A1. For if both (i) and (ii) held, then A × B would have less than c columns with d or more 1-inputs in the rows of A, a contradiction (Figure 6.2(b)).
10Such a matrix is sometimes called (u,v)-rich. 11This is sometimes called an [a,b]-protocol.
2a 2a+b

6.4 Lower Bounds via Asymmetric Communication Complexity 99
(a) When Bob Speaks (b) When Alice Speaks
Figure 6.2 Proof of the Richness Lemma (Lemma 6.4). When Bob speaks, at least one of the corresponding subrectangles has at least c/2 columns that each contain at least d 1-inputs. When Alice speaks, at least one of the corresponding subrectangles has at least c/2 columns that each contain at least d/2 1-inputs.
By induction, we conclude that there is a 1-input (x,y) such that, at each point of the protocol’s execution on (x,y) (with Alice and Bob having sent α and β bits so-far, respectively), the current rectangle A × B of inputs consistent with the protocol’s execution has at least v/2α+β columns (in A) that each contain at least u/2α 1-inputs (among the rows of B). Since the protocol terminates with a monochromatic rectangle of M(f) and withα≤a,β≤b,theproofiscomplete. 􏱋
B0# B1#
≥#d#1’s#in#each#
A0″ A1″
c”columns”
≥”d”1’s”in”each”
Lemma 6.4 and some simple calculations prove Theorem 6.3.
Proof of Theorem 6.3: We first observe that the matrix M(Index) has a set of 􏰃 n 􏰄 columns
n/2
that each contain n/2 1-inputs: choose the columns (i.e., inputs for Bob) that correspond
to subsets S ⊆ {1, 2, . . . , n} of size n/2 and, for such a column S, consider the rows (i.e., indices for Alice) that correspond to the elements of S.
Now suppose for contradiction that there is a protocol that solves Index in which Alice always sends at most a = δ log2 n bits and Bob always sends at most b = n1−2δ bits. Invoking Lemma 6.4 proves that the matrix M(Index) has a 1-rectangle of size at least
n1􏰈n􏰉 1 11−δ n−n1−2δ 2·2a×n/2·2a+b =2n×c22 ,
􏱄􏱅􏱆􏱇
−δ =n
√ ≈2n / n
􏱄 􏱅􏱆 􏱇
􏱄 􏱅􏱆 􏱇
=n−δ ·2−n
1−2δ

100 Lower Bounds for Data Structures
where c2 > 0 is a constant independent of n. (We’re using here that n ≥ N(δ) is sufficiently large.)
On the other hand, how many columns can there be in a 1-rectangle with 1n1−δ rows? 2
If these rows correspond to the set S ⊆ {1, 2, . . . , n} of indices, then every column of the 1-rectangle must correspond to a superset of S. There are
of these. But
n−|S| n− 1 n1−δ 2=22
n−n1−2δ n− 1 n1−δ c22 >22 ,
for sufficiently large n, providing the desired contradiction. 􏱋
Does the asymmetric communication complexity lower bound in Theorem 6.3 have any interesting implications? By Lemma 6.2, a data structure that supports membership queries with query time t, space s, and word size w induces a communication protocol for Index in which Alice sends at most t log2 s bits and Bob sends at most tw bits. For example, suppose t = Θ(1) and w at most poly-logarithmic in n. Since Bob only sends tw bits in the induced protocol for Index, he certainly does not send n1−2δ bits.12 Thus, Theorem 6.3 implies that Alice must send at least δ log2 n bits in the protocol. This implies that
tlog2 s ≥ δlog2 n
and hence s ≥ nδ/t. The good news is that this is a polynomial lower bound for every constant t. The bad news is that even for t = 1, this argument will never prove a super-linear lower bound. We don’t expect to prove a super-linear lower bound in the particular case of the membership problem, since there is a data structure for this problem with constant query time and linear space (e.g., perfect hashing (Fredman et al., 1984)). For the (1 + ε)- approximate nearest neighbor problem, on the other hand, we want to prove a lower bound of the form nΩ(ε−2). To obtain such a super-linear lower bound, we need to reduce from a communication problem harder than Index— or rather, a communication problem in which Alice’s input is bigger than log2 n and in which she still reveals almost her entire input in every communication protocol induced by a constant-query data structure.
(k, l)-Disjointness
A natural idea for modifying Index so that Alice’s input is bigger is to give Alice multiple
indices; Bob’s input remains an n-bit vector. The new question is whether or not for at least
12Here δ ∈ (0,1) is a constant and n ≥ N(δ) is sufficiently large. Using the version of Theorem 6.3 with “n1−2δ” replaced by “n1−cδ” for an arbitrary constant c > 1, we can take δ arbitrarily close to 1.

6.4 Lower Bounds via Asymmetric Communication Complexity 101
one of Alice’s indices, Bob’s input is a 1.13 This problem is essentially equivalent — up to the details of how Alice’s input is encoded — to Disjointness.
This section considers the special case of Disjointness where the sizes of the sets given to Alice and Bob are restricted. If we follow the line of argument in the proof of Theorem 6.3, the best-case scenario is a space lower bound of 2Ω(a), where a is the length of Alice’s input; see also the proofs of Corollary 6.8 and Theorem 6.9 at the end of the lecture. This is why the Index problem (where Alice’s set is a singleton and a = log2 n) cannot lead — at least via Lemma 6.2 — to super-linear data structure lower bounds. The minimum a necessary for the desired space lower bound of nΩ(ε−2) is ε−2 log2 n. This motivates considering instances of Disjointness in which Alice receives a set of size ε−2. Formally, we define (k, l)-Disjointness as the communication problem in which Alice’s input is a set of size k (from a universe U) and Bob’s input is a set of size l (also from U), and the goal is to determine whether or not the sets are disjoint (a 1-input) or not (a 0-input).
We next extend the proof of Theorem 6.3 to show the following.
Theorem 6.5 (Andoni et al. 2006; Miltersen et al. 1998) For every ε, δ > 0 and ev- ery sufficiently large n ≥ N (ε, δ), in every communication protocol that solves ( 1 , n)-
As with Theorem 6.3, we’ll prove Theorem 6.5 for the special case of deterministic protocols.
The theorem also holds for randomized protocols with two-sided error (Andoni et al., 2006),
and we’ll use this stronger result in Theorem 6.9 below. (Our upper bound in Section 6.3.2
is randomized, so we really want a randomized lower bound.) The proof for randomized
protocols argues along the lines of the Ω( n) lower bound for the standard version of Disjointness, and is not as hard as the stronger Ω(n) lower bound (recall the discussion in Section 4.3.4).
Proof of Theorem 6.5: Let M denote the 0-1 matrix corresponding to the ( 1 , n)-Disjointness ε2
13This is reminiscent of a “direct sum,” where Alice and Bob are given multiple instances of a commu- nication problem and have to solve all of them. Direct sums are a fundamental part of communication complexity, but we won’t have time to discuss them.
Disjointness with a universe of size 2n, either: (i) Alice sends at least δ log n bits; or
ε2
ε2 2 (ii) Bob sends at least n1−2δ bits.

function. Ranging over all subsets of U of size n, and for a given such set S, over all subsets of U \ S of size 1 , we see that M has at least 􏰃2n􏰄 columns that each have at least 􏰃 n 􏰄
ε2 n ε−2 Assume for contraction that there is a communication protocol for ( 1 , n)-Disjointness
1-inputs.
such that neither (i) nor (ii) holds. By the Richness Lemma (Lemma 6.4), there exists a
ε2

102
1-rectangle A × B where
Lower Bounds for Data Structures
|A|=
􏰈 n 􏰉 ε−2
−δlogn 2 1 − δ 2 1 (1−δ) ·2 ε2 ≥(εn)ε2 ·n ε2 =εε2nε2
(6.2)
(6.3)
and
􏰈2n􏰉 n
−δ log n ·2 ε2
|B| =
−n1−2δ · 2
2n−n1−3δ/2 ≥ 2 ,
√ ≈22n / 2n
􏱄 􏱅􏱆 􏱇
where in (6.3) we are using that n is sufficiently large.
Since A × B is a rectangle, S and T are disjoint for every choice of S ∈ A and T ∈ B.
This implies that ∪S∈AS and ∪T∈BT are disjoint sets. Letting s = |∪S∈AS| ,
we have
􏰈s􏰉 1/ε2 |A|≤ ε−2 ≤s .
Combining (6.2) and (6.4) implies that
s ≥ ε2n1−δ.
Since every subset T ∈ B avoids the s elements in ∪S∈AS,
|B| ≤ 22n−s ≤ 22n−ε2n1−δ . Inequalities (6.3) and (6.5) furnish the desired contradiction. 􏱋
(6.4)
(6.5)
The upshot is that, for the goal of proving a communication lower bound of Ω(ε−2 log n) (for Alice, in a Query-Database problem) and a consequent data structure space lower
bound of nΩ(ε−2), ( 1 , n)-Disjointness is a promising candidate to reduce from. ε2
6.4.3 Lower Bound for the (1 + ε)-Approximate Nearest Neighbor Problem The final major step is to show that ( 1 , n)-Disjointness, which is hard by Theorem 6.5,
ε2
reduces to the Query-Database problem for the decision version of the (1 + ε)-nearest
neighbor problem.
A Simpler Lower Bound of nΩ(ε−1)
We begin with a simpler reduction that leads to a suboptimal but still interesting space
lower bound of nΩ(ε−1). In this reduction, we’ll reduce from (1,n)-Disjointness rather ε
than ( 1 , n)-Disjointness. Alice is given a 1 -set S (from a universe U of size 2n), which ε2 ε

6.4 Lower Bounds via Asymmetric Communication Complexity 103
we need to map to a nearest-neighbor query. Bob is given an n-set T ⊆ U, which we need to map to a point set.
Our first idea is to map the input to (1,n)-Disjointness to a nearest-neighbor query ε
in the 2n-dimensional hypercube {0,1}2n. Alice performs the obvious mapping of her input, from the set S to a query point q that is the characteristic vector of S (which lies in {0, 1}2n). Bob maps his input T to the point set P = {ei : i ∈ T }, where ei denotes the characteristic vector of the singleton {i} (i.e., the ith standard basis vector).
If the sets S and T are disjoint, then the corresponding query q has Hamming distance 1 + 1 from every point in the corresponding point set P . If S and T are not disjoint, then
ε
there exists a point ei ∈ P such that the Hamming distance between q and ei is 1 − 1. ε
Thus, the ( 1 , n)-Disjointness problem reduces to the (1 + ε)-approximate nearest neighbor ε
problem in the 2n-dimensional Hamming cube, where 2n is also the size of the universe from which the point set is drawn.
We’re not done, because extremely high-dimensional nearest neighbor problems are not very interesting. The convention in nearest neighbor problems is to assume that the word size w — recall Section 6.4.1 — is at least the dimension. When d is at least the size of the universe from which points are drawn, an entire point set can be described using a single word! This means that our reduction so far cannot possibly yield an interesting lower bound in the cell probe model. We fix this issue by applying dimension reduction — just as in our upper bound in Section 6.3.2 — to the instances produced by the above reduction.
Precisely, we can use the following embedding lemma.
Lemma 6.6 (Embedding Lemma #1) There exists a randomized function f from {0, 1}2n to {0,1}d with d = Θ( 1 logn) and a constant α > 0 such that, for every set P ⊆ {0,1}2n
ε2
of n points and query q ∈ {0, 1}2n produced by the reduction above, with probability at least
1−1: n
(1) if the nearest-neighbor distance between q and P is 1 − 1, then the nearest-neighbor ε
distance between f(q) and f(P) is at most α;
(2) if the nearest-neighbor distance between q and P is 1 + 1, then the nearest-neighbor
ε
distance between f (q) and f (P ) is at least α(1 + h(ε)), where h(ε) > 0 is a constant depending on ε only.
Lemma 6.6 is an almost immediate consequence of Corollary 6.1 — the map f just takes d = Θ(ε−2 log n) random inner products with 2n-bit vectors, where the probability of a “1” is roughly ε/2. We used this idea in Section 6.3.2 for a data structure — here we’re using it for a lower bound!
Composing our initial reduction with Lemma 6.6 yields the following.
Corollary 6.7 Every randomized asymmetric communication lower bound for ( 1 , n)-Disjointness ε
carries over to the Query-Database problem for the (1 + ε)-approximate nearest neighbor problem in d = Ω(ε−2 log n) dimensions.

104 Lower Bounds for Data Structures Proof: To recap our ideas, the reduction works as follows. Given inputs to ( 1 , n)-
ε
Disjointness, Alice interprets her input as a query and Bob interprets his input as a
point set (both in {0,1}2n) as described at the beginning of the section. They use shared
randomness to choose the function f of Lemma 6.6 and use it to map their inputs to {0, 1}d
with d = Θ(ε−2 log n). They run the assumed protocol for the Query-Database problem
for the (1 + ε)-approximate nearest neighbor problem in d dimensions. Provided the hidden
constant in the definition of d is sufficiently large, correctness (with high probability) is
guaranteed by Lemma 6.6. (Think of Lemma 6.6 as being invoked with a parameter ε′
satisfying h(ε′) = ε.) The amount of communication used by the (1,n)-Disjointness
ε
protocol is identical to that of the Query-Database protocol.14 􏱋
Following the arguments of Section 6.4.2 translates our asymmetric communication
complexity lower bound (via Lemma 6.2) to a data structure space lower bound.
Corollary 6.8 Every data structure for the decision version of the (1 + ε)-approximate nearest neighbors problem with query time t = Θ(1) and word size w = O(n1−δ) for constant δ > 0 uses space s = nΩ(ε−1).
Proof: Since tw = O(n1−δ), in the induced communication protocol for the Query- Database problem (and hence (1,n)-Disjointness, via Corollary 6.7), Bob sends a
ε
sublinear number of bits. Theorem 6.5 then implies that Alice sends at least Ω(ε−1 log n) bits, and so (by Lemma 6.2) we have t log2 s = Ω(ε−1 log n). Since t = O(1), this implies that s = nΩ(ε−1). 􏱋
The nΩ(ε−2) Lower Bound The culmination of this lecture is the following.
Theorem 6.9 Every data structure for the decision version of the (1 + ε)-approximate nearest neighbors problem with query time t = Θ(1) and word size w = O(n1−δ) for constant δ > 0 uses space s = nΩ(ε−2).
The proof is a refinement of the embedding arguments we used to prove Corollary 6.8. In that proof, the reduction structure was
S,T ⊆ U 􏰶→ {0,1}2n 􏰶→ {0,1}d,
with inputs S, T of ( 1 , n)-Disjointness mapped to the 2n-dimensional Hamming cube and
ε
then to the d-dimensional Hamming cube, with d = Θ(ε−2 log n). The new plan is
S,T ⊆ U 􏰶→ (R2n,l2) 􏰶→ (RD,l1) 􏰶→ {0,1}D′ 􏰶→ {0,1}d,
14The (1,n)-Disjointness protocol is randomized with two-sided error even if the Query-Database
ε
protocol is deterministic. This highlights our need for Theorem 6.5 in its full generality.

6.4 Lower Bounds via Asymmetric Communication Complexity 105 where d = Θ(ε−2 log n) as before, and D, D′ can be very large. Thus we map inputs S, T
of ( 1 , n)-Disjointness to 2n-dimensional Euclidean space (with the l norm), which we ε2 2
then map (preserving distances) to high-dimensional space with the l1 norm, then to the
high-dimensional Hamming cube, and finally to the Θ(ε−2 log n)-dimensional Hamming cube
as before (via Lemma 6.6). The key insight is that switching the initial embedding from
the high-dimensional hypercube to high-dimensional Euclidean space achieves a nearest
neighbor gap of 1 ± ε even when Alice begins with a 1 -set; the rest of the argument uses ε2
standard (if non-trivial) techniques to eventually get back to a hypercube of reasonable dimension.
To add detail to the important first step, consider inputs S, T to ( 1 , n)-Disjointness. ε2
Alice maps her set S to a query vector q that is ε times the characteristic vector of S, which we interpret as a point in 2n-dimensional Euclidean space. Bob maps his input T to the point set P = {ei : i ∈ T }, again in 2n-dimensional Euclidean space, where ei denotes the ith standard basis vector.
First, suppose that S and T are disjoint. Then, the l2 distance between Alice’s query q
and each point ei ∈ P is
If S and T are not disjoint, then there exists a point ei ∈ P such that the l2 distance
between q and ei is:
􏰜(1−ε)2 +􏰃1 −1􏰄ε2 =√2−2ε≤√2􏰃1− ε􏰄. ε2 2
Thus, as promised, switching to the l2 norm — and tweaking Alice’s query – allows us
to get a 1 ± Θ(ε) gap in nearest-neighbor distance between the “yes” and “no” instances
of ( 1 , n)-Disjointness. This immediately yields (via Theorem 6.5, following the proof ε2
of Corollary 6.8) lower bounds for the (1 + ε)-approximate nearest neighbor problem in high-dimensional Euclidean space in the cell-probe model. We can extend these lower bounds to the hypercube through the following embedding lemma.
Lemma 6.10 (Embedding Lemma #2) For every δ > 0 there exists a randomized function f from R2n to {0,1}D (with possibly large D = D(δ)) such that, for every set P ⊆ {0,1}2n of n points and query q ∈ {0,1}2n produced by the reduction above, with probability at least 1 − 1 ,
􏰜1 + 1 · ε2 = √2. ε2
for every p ∈ P .
n
lH(f(p),f(q)) ∈ (1±δ)·l2(p,q)
Thus Lemma 6.10 says that one can re-represent a query q and a set P of n points in R2n in a high-dimensional hypercube so that the nearest-neighbor distance — l2 distance in the domain, Hamming distance in the range — is approximately preserved, with the

106 Lower Bounds for Data Structures
approximation factor tending to 1 as the number D of dimensions tends to infinity. The Exercises outline the proof, which combines two standard facts from the theory of metric embeddings:
1. “L2 embeds isometrically into L1.” For every δ > 0 and dimension D there exists a
randomized function f from RD to RD′ , where D′ can depend on D and δ, such that,
for every set P ⊆ RD of n points, with probability at least 1 − 1 , n
∥f(p)−f(p′)∥1 ∈(1±δ)·∥p−p′∥2
for all p,p′ ∈ P.
2. “L1 embeds isometrically into the (scaled) Hamming cube.” For every δ > 0, there exists constants M = M(δ) and D′′ = D′′(D′,δ) and a function g : RD′ → {0,1}D′′ such that, for every set P ⊆ RD′ ,
dH (g(p, p′)) = M · ∥p − p′∥1 ± δ
With Lemma 6.10 in hand, we can prove Theorem 6.9 by following the argument in
for every p,p′ ∈ P. Corollary 6.8.
Proof of Theorem 6.9: By Lemmas 6.6 and 6.10, Alice and Bob can use a communication
protocol that solves the Query-Database problem for the decision version of (1+ε)-nearest
neighbors to solve the ( 1 , n)-Disjointness problem, with no additional communication ε2
(only shared randomness, to pick the random functions in Lemmas 6.6 and 6.10).15 Thus, the (randomized) asymmetric communication lower bound for the latter problem applies also to the former problem.
Since tw = O(n1−δ), in the induced communication protocol for the Query-Database problem (and hence ( 1 , n)-Disjointness), Bob sends a sublinear number of bits. Theo-
ε2
rem 6.5 then implies that Alice sends at least Ω(ε−2 log2 n) bits, and so (by Lemma 6.2) we
have t log2 s = Ω(ε−2 log2 n). Since t = O(1), this implies that s = nΩ(ε−2). 􏱋
15Strictly speaking, we’re using a generalization of Lemma 6.6 (with the same proof) where the query and point set can lie in a hypercube of arbitrarily large dimension, not just 2n.

Lecture 7
Lower Bounds in Algorithmic Game Theory
7.1 Preamble
This lecture explains some applications of communication complexity to proving lower bounds in algorithmic game theory (AGT), at the border of computer science and economics. In AGT, the natural description size of an object is often exponential in a parameter of interest, and the goal is to perform non-trivial computations in time polynomial in the parameter (i.e., logarithmic in the description size). As we know, communication complexity is a great tool for understanding when non-trivial computations require looking at most of the input.
7.2 The Welfare Maximization Problem
The focus of this lecture is the following optimization problem, which has been studied in AGT more than any other.
1. There are k players.
2. ThereisasetM ofmitems.
3. Each player i has a valuation vi : 2M → R+. The number vi(T) indicates i’s value, or willingness to pay, for the items T ⊆ M. The valuation is the private input of player i — i knows vi but none of the other vj’s. We assume that vi(∅) = 0 and that the valuations are monotone, meaning vi(S) ≤ vi(T) whenever S ⊆ T. To avoid bit complexity issues, we’ll also assume that all of the vi(T)’s are integers with description length polynomial in k and m.
Note that we may have more than two players — more than just Alice and Bob. Also note that the description length of a player’s valuation is exponential in the number of items m. In the welfare-maximization problem, the goal is to partition the items M into sets
T1, . . . , Tk to maximize, at least approximately, the welfare
k
􏰕 vi(Ti),
i=1 107

108 Lower Bounds in Algorithmic Game Theory
using communication polynomial in n and m. Note this amount of communication is logarithmic in the sizes of the private inputs.
The main motivation for this problem is combinatorial auctions. Already in the domain of government spectrum auctions, dozens of such auctions have raised hundreds of billions of dollars of revenue. They have also been used for other applications such as allocating take-off and landing slots at airports. For example, items could represent licenses for wireless spectrum — the right to use a certain frequency range in a certain geographic area. Players would then be wireless telecommunication companies. The value vi(S) would be the amount of profit company i expects to be able to extract from the licenses in S.
Designing good combinatorial auctions requires careful attention to “incentive issues,” making the auctions as robust as possible to strategic behavior by the (self-interested) participants. Incentives won’t play much of a role in this lecture. Our lower bounds for protocols in Section 7.4 apply even in the ideal case where players are fully cooperative. Our lower bounds for equilibria in Section 7.5 effectively apply no matter how incentive issues are resolved.
7.3 Multi-Party Communication Complexity 7.3.1 The Model
Welfare-maximization problems have an arbitrary number k of players, so lower bounds for them follow most naturally from lower bounds for multi-party communication protocols. The extension from two to many parties proceeds as one would expect, so we’ll breeze through the relevant points without much fuss.
SupposewewanttocomputeaBooleanfunctionf:{0,1}n1×{0,1}n2×···×{0,1}nk → {0, 1} that depends on the k inputs x1, . . . , xk. We’ll be interested in the number-in-hand (NIH) model, where player i only knows xi. What other model could there be, you ask? There’s also the stronger number-on-forehead (NOF) model, where player i knows everything except xi. (Hence the name — imagine the players are sitting in a circle.) The NOF model is studied mostly for its connections to circuit complexity; it has few direct algorithmic applications, so we won’t discuss it in this course. The NIH model is the natural one for our
purposes and, happily, it’s also much easier to prove strong lower bounds for it. Deterministic protocols are defined as you would expect, with the protocol specifying whose turn it is speak (as a function of the protocol’s transcript-so-far) and when the computation is complete. We’ll use the blackboard model, where we think of the bits sent by eachplayerasbeingwrittenonablackboardinpublicview.1 Similarly,inanondeterministic protocol, the prover writes a proof on the blackboard, and the protocol accepts the input if
and only if all k players accept the proof.
1In the weaker message-passing model, players communicate by point-to-point messages rather than via broadcast.

7.3 Multi-Party Communication Complexity 109 7.3.2 The Multi-Disjointness Problem
We need a problem that is hard for multi-party communication protocols. An obvious idea is to use an analog of Disjointness. There is some ambiguity about how to define a version of Disjointness for three or more players. For example, suppose there are three players, and amongst the three possible pairings of them, two have disjoint sets while the third have intersecting sets. Should this count as a “yes” or “no” instance? We’ll skirt this issue by worrying only about unambiguous inputs, that are either “totally disjoint” or “totally intersecting.”
Formally, in the Multi-Disjointness problem, each of the k players i holds an input xi ∈ {0, 1}n. (Equivalently, a set Si ⊆ {1, 2, . . . , n}.) The task is to correctly identify inputs that fall into one of the following two cases:
(1) “Totally disjoint,” with Si ∩ Si′ = ∅ for every i ̸= i′.
(0) “Totally intersecting,” with ∩ki=1Si ̸= ∅.
When k = 2, this is just Disjointness. When k > 2, there are inputs that are neither 1-inputs nor 0-inputs. We let protocols off the hook on such ambiguous inputs — they can answer “1” or “0” with impunity.
In the next section, we’ll prove the following communication complexity lower bound for Multi-Disjointness, credited to Jaikumar Radhakrishnan and Venkatesh Srinivasan in Nisan (2002).
Theorem 7.1 The nondeterministic communication complexity of Multi-Disjointness, with k players with n-bit inputs, is Ω(n/k).
The nondeterministic lower bound is for verifying a 1-input. (It is easy to verify a 0-input — the prover just suggests the index of an element r in ∩ki=1Si, the validity of which is easily checked privately by each of the players.)
In our application in Section 7.4, we’ll be interested in the case where k is much smaller
than n, such as k = Θ(logn). Intuition might suggest that the lower bound should be
Ω(n) rather than Ω(n/k), but this is incorrect — a slightly non-trivial argument shows that √
Theorem 7.1 is tight for nondeterministic protocols (for all small enough k, like k = O( n)). See the Homework for details. This factor-k difference won’t matter for our applications, however.
7.3.3 Proof of Theorem 7.1
The proof of Theorem 7.1 has three steps, all of which are generalizations of familiar arguments.
Step 1: Every deterministic protocol with communication cost c induces a partition of M(f) into at most 2c monochromatic boxes. By “M(f),” we mean the k-dimensional array in

110 Lower Bounds in Algorithmic Game Theory
which the ith dimension is indexed by the possible inputs of player i, and an array entry contains the value of the function f on the corresponding joint input. By a “box,” we mean the k-dimensional generalization of a rectangle — a subset of inputs that can be written as a product A1 × A2 × · · · × Ak. By “monochromatic,” we mean a box that does not contain both a 1-input and a 0-input. (Recall that for the Multi-Disjointness problem there are also wildcard (“*”) inputs — a monochromatic box can contain any number of these.)
The proof of this step is the same as in the two-party case. We just run the protocol and keep track of the joint inputs that are consistent with the transcript. The box of all inputs is consistent with the empty transcript, and the box structure is preserved inductively: when player i speaks, it narrows down the remaining possibilities for the input xi, but has no effect on the possible values of the other inputs. Thus every transcript corresponds to a box, with these boxes partitioning M(f). Since the protocol’s output is constant over such a box and the protocol computes f, all of the boxes it induces are monochromatic with respect to M(f).
Similarly, every nondeterministic protocol with communication cost c (for verifying 1-inputs) induces a cover of the 1-inputs of M(f) by at most 2c monochromatic boxes.
Step 2: The number of 1-inputs in M(f) is (k + 1)n. This step and the next are easy generalizations of our second proof of our nondeterministic communication complexity lower bounds for Disjointness (from Section 5.4.5): first we lower bound the number of 1-inputs, then we upper bound the number of 1-inputs that can coexist in a single 1-box. In a 1-input (x1, . . . , xk), for every coordinate l, at most one of the k inputs has a 1 in the lth coordinate. This yields k + 1 options for each of the n coordinates, thereby generating a total of (k + 1)n 1-inputs.
Step 3: The number of 1-inputs in a monochromatic box is at most kn. Let B = A1 × A2 ×···×Ak be a 1-box. The key claim here is: for each coordinate l = 1,…,n, there is a player i ∈ {1,…,k} such that, for every input xi ∈ Ai, the lth coordinate of xi is 0. That is, to each coordinate we can associate an “ineligible player” that, in this box, never has a 1 in that coordinate. This is easily seen by contradiction: otherwise, there exists a coordinate l such that, for every player i, there is an input xi ∈ Ai with a 1 in the lth coordinate. As a box, this means that B contains the input (x1, . . . , xk). But this is a 0-input, contradicting the assumption that B is a 1-box.
The claim implies the stated upper bound. Every 1-input of B can be generated by choosing, for each coordinate l, an assignment of at most one “1” in this coordinate to one of the k − 1 eligible players for this coordinate. With only k choices per coordinate, there are at most kn 1-inputs in the box B.
Conclusion: Steps 2 and 3 imply that covering of the 1s of the k-dimensional array of the Multi-Disjointness function requires at least (1 + 1 )n 1-boxes. By the discussion
k
in Step 1, this implies a lower bound of n log (1 + 1 ) = Θ(n/k) on the nondeterministic 2k
communication complexity of the Multi-Disjointness function (and output 1). This

7.4 Lower Bounds for Approximate Welfare Maximization 111 concludes the proof of Theorem 7.1.
Remark 7.2 (Randomized Communication Complexity of Multi-Disjointness) Randomized protocols with two-sided error also require communication Ω(n/k) to solve Multi-Disjointness (Gronemeier, 2009; Chakrabarti et al., 2003).2 This generalizes the Ω(n) lower bound that we stated (but did not prove) in Theorem 4.11, so naturally we’re not going to prove this lower bound either. Extending the lower bound for Disjointness to Multi-Disjointness requires significant work, but it is a smaller step than proving from scratch a linear lower bound for Disjointness (Kalyanasundaram and Schnitger, 1992; Razborov, 1992). This is especially true if one settles for the weaker lower bound of Ω(n/k4) (Alon et al., 1999), which is good enough for our purposes in this lecture.
7.4 Lower Bounds for Approximate Welfare Maximization 7.4.1 General Valuations
We now put Theorem 7.1 to work and prove that it is impossible to obtain a non-trivial approximation of the general welfare-maximization problem with a subexponential (in m) amount of communication. First, we observe that a k-approximation is trivial. The protocol is to give the full set of items M to the player with the largest vi(M). This protocol can clearly be implemented with a polynomial amount of communication. To prove the approximation guarantee, consider a partition T1, . . . , Tk of M with the maximum-possible welfare W∗. There is a player i with vi(Ti) ≥ W∗/k. The welfare obtained by our simple protocol is at least vi(M); since we assume that valuations are monotone, this is at least vi(Ti) ≥ W ∗/k.
To apply communication complexity, it is convenient to turn the optimization problem of welfare maximization into a decision problem. In the Welfare-Maximization(k) problem, the goal is to correctly identify inputs that fall into one of the following two cases:
(1) Every partition (T1,…,Tk) of the items has welfare at most 1.
(0) There exists a partition (T1, . . . , Tk) of the items with welfare at least k.
Clearly, communication lower bounds for Welfare-Maximization(k) apply more generally to the problem of obtaining a better-than-k-approximation of the maximum welfare.
We prove the following.
Theorem 7.3 (Nisan 2002) The communication complexity of Welfare-Maximization(k) is exp{Ω(m/k2)}.
2There is also a far-from-obvious matching upper bound of O(n/k) (Håstad and Wigderson, 2007; Chakrabarti et al., 2003).

112 Lower Bounds in Algorithmic Game Theory
Thus, if the number of items m is at least k2+ε for some ε > 0, then the communication complexity of the Welfare-Maximization(k) problem is exponential. Because the proof is a reduction from Multi-Disjointness, the lower bound applies to deterministic protocols, nondeterministic protocols (for the output 1), and randomized protocols with two-sided error.
The proof of Theorem 7.3 relies on Theorem 7.1 and a combinatorial gadget. We construct this gadget using the probabilistic method. As a thought experiment, consider t random partitions P1,…,Pt of M, where t is a parameter to be defined later. By a random partition P j = (P1j , . . . , Pkj ), we just mean that each of the m items is assigned to exactly one of the k players, independently and uniformly at random.
We are interested in the probability that two classes of different partitions intersect: for
all i ≠ i′ and j ̸= l, since the probability that a given item is assigned to i in P j and also to
i′ inPl is 1 ,wehave k2
.
Taking a Union Bound over the k choices for i and i′ and the t choices for j and l, we have
􏰙 􏰚􏰈1􏰉m 2 jl −m/k
PrPi∩Pi′=∅= 1−k2 ≤e
􏰙 ′ j l 􏰚 22−m/k2
Pr ∃i̸=i,j̸=ls.t.Pi ∩Pi′ =∅ ≤k t e . (7.1) 1tjl′
CallP ,…,P anintersectingfamilyifPi ∩Pi′ ≠ ∅wheneveri̸=i,j̸=l. By(7.1),the
probability that our random experiment fails to produce an intersecting family is less than 1
provided t < 1 em/2k2 . The following lemma is immediate. k Lemma 7.4 For every m, k ≥ 1, there exists an intersecting family of partitions P 1 , . . . , P t with t = exp{Ω(m/k2)}. A simple combination of Theorem 7.1 and Lemma 7.4 proves Theorem 7.3. Proof of Theorem 7.3: The proof is a reduction from Multi-Disjointness. Fix k and m. (To be interesting, m should be significantly bigger than k2.) Let (S1, . . . , Sk) denote an input to Multi-Disjointness with t-bit inputs, where t = exp{Ω(m/k2)} is the same value as in Lemma 7.4. We can assume that the players have coordinated in advance on an intersecting family of t partitions of a set M of m items. Each player i uses this family and its input Si to form the following valuation: 􏰌1 ifT⊇Pij forsomej∈Si vi(T ) = 0 otherwise. That is, player i is either happy (value 1) or unhappy (value 0), and is happy if and only if it receives all of the items in the corresponding class Pij of some partition Pj with index j belonging to its input to Multi-Disjointness. The valuations v1, . . . , vk define an input 7.4 Lower Bounds for Approximate Welfare Maximization 113 to Welfare-Maximization(k). Forming this input requires no communication between the players. ConsiderthecasewheretheinputtoMulti-Disjointnessisa1-input,withSi∩Si′ =∅ for every i ̸= i′. We claim that the induced input to Welfare-Maximization(k) is a 1-input, with maximum welfare at most 1. To see this, consider a partition (T1, . . . , Tk) in which some player i is happy (with vi(Ti) = 1). For some j ∈ Si, player i receives all the items in Pij . Since j ̸∈ Si′ for every i′ ̸= i, the only way to make a second player i′ happy ll is to give it all the items in Pi′ in some other partition P with l ∈ Si′ (and hence l ̸= j). 1t jl Since P , . . . , P is an intersecting family, this is impossible — Pi and Pi′ overlap for every l ̸= j. When the input to Multi-Disjointness is a 0-input, with an element r in the mutual intersection ∩ki=1Si, we claim that the induced input to Welfare-Maximization(k) is a 0-input, with maximum welfare at least k. This is easy to see: for i = 1, 2, . . . , k, assign the items of Pir to player i. Since r ∈ Si for every i, this makes all k players happy. This reduction shows that a (deterministic, nondeterministic, or randomized) protocol for Welfare-Maximization(k) yields one for Multi-Disjointness (with t-bit inputs) with the same communication. We conclude that the communication complexity of Welfare- Maximization(k) is Ω(t/k) = exp{Ω(m/k2)}. 􏱋 7.4.2 Subadditive Valuations To an algorithms person, Theorem 7.3 is depressing, as it rules out any non-trivial positive results. A natural idea is to seek positive results by imposing additional structure on players’ valuations. Many such restrictions have been studied. We consider here the case of subadditive valuations, where each vi satisfies vi(S ∪ T ) ≤ vi(S) + vi(T ) for every pair S,T ⊆M. Our reduction in Theorem 7.3 immediately yields a weaker inapproximability result for welfare maximization with subadditive valuations. Formally, define the Welfare- Maximization(2) problem as that of identifying inputs that fall into one of the following two cases: (1) Every partition (T1, . . . , Tk) of the items has welfare at most k + 1. (0) There exists a partition (T1, . . . , Tk) of the items with welfare at least 2k. Communication lower bounds for Welfare-Maximization(2) apply to the problem of obtaining a better-than-2-approximation of the maximum welfare. Corollary 7.5 (Dobzinski et al. 2010) The communication complexity of Welfare- Maximization(2) is exp{Ω(m/k2)}, even when all players have subadditive valuations. 114 Lower Bounds in Algorithmic Game Theory Proof: Picking up where the reduction in the proof of Theorem 7.5 left off, every player i adds 1 to its valuation for every non-empty set of items. Thus, the previously 0-1 valu- ations become 0-1-2 valuations that are only 0 for the empty set. Such functions always satisfy the subadditivity condition (vi(S ∪ T ) ≤ vi(S) + vi(T )). 1-inputs and 0-inputs of Multi-Disjointness now become 1-inputs and 0-inputs of Welfare-Maximization(2), respectively. The communication complexity lower bound follows. 􏱋 There is also a quite non-trivial matching upper bound of 2 for deterministic, polynomial- communication protocols (Feige, 2009). 7.5 Lower Bounds for Equilibria The lower bounds of the previous section show that every protocol for the welfare-maximization problem that interacts with the players and then explicitly computes an allocation has either a bad approximation ratio or high communication cost. Over the past five years, many researchers have aimed to shift the work from the protocol to the players, by analyzing the equilibria of simple auctions. Can such equilibria bypass the communication complexity lower bounds proved in Section 7.4? The answer is not obvious, because equilibria are defined non-constructively, and not through a low-communication protocol.3 7.5.1 Game Theory Next we give the world’s briefest-ever game theory tutorial. See e.g. Shoham and Leyton- Brown (2010), or the instructor’s CS364A lecture notes, for a more proper introduction. We’ll be brief because the details of these concepts do not play a first-order role in the arguments below. Games A (finite, normal-form) game is specified by: 1. A finite set of k ≥ 2 players. 2. For each player i, a finite action set Ai. 3. For each player i, a utility function ui(a) that maps an action profile a ∈ A1 × · · · × Ak to a real number. The utility of a player generally depends not only on its action, but also those chosen by the other players. For example, in “Rock-Paper-Scissors (RPS),” there are two players, each with three actions. A natural choice of utility functions is depicted in Figure 7.1. 3This question was bothering your instructor back in CS364B (Winter ’14) — hence, Theorem 7.9. 7.5 Lower Bounds for Equilibria 115 Figure 7.1 Player utilities in Rock-Paper-Scissors. The pair of numbers in a matrix entry denote the utilities of the row and column players, respectively, in a given outcome. For a more complex and relevant example of a game, consider simultaneous first-price auctions (S1As). There are k players. An action ai of a player i constitutes a bid bij on each item j of a set M of m items.4 In a S1A, each item is sold separately in parallel using a “first-price auction” — the item is awarded to the highest bidder, and the price is whatever that player bid.5 To specify the utility functions, we assume that each player i has a valuation vi as in Section 7.2. We define Rock Paper Scissors Rock Paper Scissors 0,0 1,-1 -1,1 -1,1 0,0 1,-1 1,-1 -1,1 0,0 ui(a) = vi(Si) − 􏱄 􏱅􏱆 􏱇 value of items won 􏰕 bij , j ∈Si 􏱄 􏱅􏱆 􏱇 price paid for them where Si denotes the items on which i is the highest bidder (given the bids of a).6 Note that the utility of a bidder depends both on its own action and those of the other bidders. Having specified the players, their actions, and their utility functions, we see that an S1A is an example of a game. Equilibria Given a game, how should one reason about it? The standard approach is to define some notion of “equilibrium” and then study the equilibrium outcomes. There are many useful notions of equilibria (see e.g. the instructor’s CS364A notes); for simplicity, we’ll stick here with the most common notion, (mixed) Nash equilibria.7 A mixed strategy for a player i is a probability distribution over its actions — for example, the uniform distribution over Rock/Paper/Scissors. A Nash equilibrium is a collection σ1, . . . , σk of mixed strategies, one per player, so that each player is performing a “best response” to the others. To explain, adopt the perspective of player i. We think 4To keep the game finite, let’s agree that each bid has to be an integer between 0 and some known upper bound B. 5You may have also heard of the Vickrey or second-price auction, where the winner does not pay their own bid, but rather the highest bid by someone else (the second-highest overall). We’ll stick with S1As for simplicity, but similar results are known for simultaneous second-price auctions, as well. 6Break ties in an arbitrary but consistent way. 7For the auction settings we study, “Bayes-Nash equilibria” are more relevant. These generalize Nash equilibria, so our lower bounds immediately apply to them. 116 Lower Bounds in Algorithmic Game Theory of i as knowing the mixed strategies σ−i used by the other k − 1 players (but not their coin flips). Thus, player i can compute the expected payoff of each action ai ∈ Ai, where the expectation assumes that the other k − 1 players randomly and independently select actions from their mixed strategies. Every action that maximizes i’s expected utility is a best response to σ−i. Similarly, every probability distribution over best responses is again a best response (and these exhaust the best responses). For example, in Rock-Paper-Scissors, both players playing the uniform distribution yields a Nash equilibrium. (Every action of a player has expected utility 0 w.r.t. the mixed strategy of the other player, so everything is a best response.) Nash proved the following. Theorem 7.6 (Nash 1950) In every finite game, there is at least one Nash equilibrium. Theorem 7.6 can be derived from, and is essentially equivalent to, Brouwer’s Fixed-Point Theorem. Note that a game can have a large number of Nash equilibria— if you’re trying to meet a friend in New York City, with actions equal to intersections, then every intersection corresponds to a Nash equilibrium. An ε-Nash equilibrium is the relaxation of a Nash equilibrium in which no player can increase its expected utility by more than ε by switching to a different strategy. Note that the set of ε-Nash equilibria is nondecreasing with ε. Such approximate Nash equilibria seem crucial to the lower bound in Theorem 7.9, below. The Price of Anarchy So how good are the equilibria of various games, such as S1As? To answer this question, we use an analog of the approximation ratio, adapted for equilibria. Given a game (like an S1A) and a nonnegative maximization objective function on the outcomes (like welfare), the price of anarchy (POA) (Koutsoupias and Papadimitriou, 1999) is defined as the ratio between the objective function value of an optimal solution, and that of the worst equilibrium.8 If the equilibrium involves randomization, as with mixed strategies, then we consider its expected objective function value. The POA of a game and a maximization objective function is always at least 1. It is common to identify “good performance” of a system with strategic participants as having a POA close to 1.9 For example, the equilibria of S1As are surprisingly good in fairly general settings. 8Recall that games generally have multiple equilibria. Ideally, we’d like an approximation guarantee that applies to all equilibria — this is the point of the POA. 9An important issue, outside the scope of these notes, is the plausibility of a system reaching an equilibrium. A natural solution is to relax the notion of equilibrium enough so that it become “relatively easy” to reach an equilibrium. See e.g. the instructor’s CS364A notes for much more on this point. 7.5 Lower Bounds for Equilibria 117 Theorem 7.7 (Feldman et al. 2013) In every S1A with subadditive bidder valuations, the POA is at most 2. Theorem 7.7 is non-trivial and we won’t prove it here (see the paper or the instructor’s CS364B notes for a proof). This result is particularly impressive because achieving an approximation factor of 2 for the welfare-maximization problem with subadditive bidder valuations by any means (other than brute-force search) is not easy (see Feige (2009)). A recent result shows that the analysis of Feldman et al. (2013) is tight. Theorem 7.8 (Christodoulou et al. 2013) The worst-case POA of S1As with subaddi- tive bidder valuations is at least 2. The proof of Theorem 7.8 is an ingenious explicit construction — the authors exhibit a choice of subadditive bidder valuations and a Nash equilibrium of the corresponding S1A so that the welfare of this equilibrium is only half of the maximum possible. One reason that proving results like Theorem 7.8 is challenging is that it can be difficult to solve for a (bad) equilibrium of a complex game like a S1A. 7.5.2 Price-of-Anarchy Lower Bounds from Communication Complexity Theorem 7.7 motivates an obvious question: can we do better? Theorem 7.8 implies that the analysis in Feldman et al. (2013) cannot be improved, but can we reduce the POA by considering a different auction? Ideally, the auction would still be “reasonably simple” in some sense. Alternatively, perhaps no “simple” auction could be better than S1As? If this is the case, it’s not clear how to prove it directly — proving lower bounds via explicit constructions auction-by-auction does not seem feasible. Perhaps it’s a clue that the POA upper bound of 2 for S1As (Theorem 7.7) gets stuck at the same threshold for which there is a lower bound for protocols that use polynomial commu- nication (Theorem 7.5). It’s not clear, however, that a lower bound for low-communication protocols has anything to do with equilibria. In the spirit of the other reductions that we’ve seen in this course, can we extract a low-communication protocol from an equilibrium? Theorem 7.9 (Roughgarden 2014) Fix a class V of possible bidder valuations. Suppose there exists no nondeterministic protocol with subexponential (in m) communication for the 1-inputs of the following promise version of the welfare-maximization problem with bidder valuations in V: (1) Every allocation has welfare at most W∗/α. (0) There exists an allocation with welfare at least W∗. Let ε be bounded below by some inverse polynomial function of n and m. Then, for every auction with sub-doubly-exponential (in m) actions per player, the worst-case POA of ε-Nash equilibria with bidder valuations in V is at least α. 118 Lower Bounds in Algorithmic Game Theory Theorem 7.9 says that lower bounds for nondeterministic protocols carry over to all “suffi- ciently simple” auctions, where “simplicity” is measured by the number of actions available to each player. These POA lower bounds follow from communication complexity lower bounds, and do not require any new explicit constructions. To get a feel for the simplicity constraint, note that S1As with integral bids between 0 and B have (B + 1)m actions per player — singly exponential in m. On the other hand, in a “direct-revelation” auction, where each bidder is allowed to submit a bid on each bundle S ⊆ M of items, each player has a doubly-exponential (in m) number of actions.10 The POA lower bound promised by Theorem 7.9 is only for ε-Nash equilibrium; since the POA is a worst-case measure and the set of ε-Nash equilibria is nondecreasing with ε, this is weaker than a lower bound for exact Nash equilibria. It is an open question whether or not Theorem 7.9 holds also for the POA of exact Nash equilibria. Arguably, Theorem 7.9 is good enough for all practical purposes — a POA upper bound that holds for exact Nash equilibria and does not hold (at least approximately) for ε-Nash equilibria with very small ε is too brittle to be meaningful. Theorem 7.9 has a number of interesting corollaries. First, since S1As have only a singly-exponential (in m) number of actions per player, Theorem 7.9 applies to them. Thus, combining it with Theorem 7.5 recovers the POA lower bound of Theorem 7.8 — modulo the exact vs. approximate Nash equilibria issue — and shows the optimality of the upper bound in Theorem 7.7 without an explicit construction. More interestingly, this POA lower bound of 2 (for subadditive bidder valuations) applies not only to S1As, but more generally to all auctions in which each player has a sub-doubly-exponential number of actions. Thus, S1As are in fact optimal among the class of all such auctions when bidders have subadditive valuations (w.r.t. the worst-case POA of ε-Nash equilibria). We can also combine Theorem 7.9 with Theorem 7.3 to prove that no “simple” auction gives a non-trivial (better than k-) approximation for general bidder valuation. Thus with general valuations, complexity is essential to any auction format that offers good equilibrium guarantees. 7.5.3 Proof of Theorem 7.9 Presumably, the proof of Theorem 7.9 extracts a low-communication protocol from a good POA bound. The hypothesis of Theorem 7.9 offers the clue that we should be looking to construct a nondeterministic protocol. So what could we use an all-powerful prover for? We’ll see that a good role for the prover is to suggest a Nash equilibrium to the players. Unfortunately, it’s too expensive for the prover to even write down the description of a Nash equilibrium, even in S1As. Recall that a mixed strategy is a distribution over actions, and that each player has an exponential (in m) number of actions available in a S1A. Specifying a Nash equilibrium thus requires an exponential number of probabilities. 10Equilibria can achieve the optimal welfare in direct-revelation mechanisms, so the bound in Theorem 7.9 on the number of actions is necessary. See the Exercises for further details. 7.5 Lower Bounds for Equilibria 119 To circumvent this issue, we resort to ε-Nash equilibria, which are guaranteed to exist even if we restrict ourselves to distributions with small descriptions. Lemma 7.10 (Lipton et al. 2003) For every ε > 0 and every game with k players with
action sets A1,…,Ak, there exists an ε-Nash equilibrium with description length polynomial
in k, log(maxk |Ai|), and 1. i=1 ε
We give the high-level idea of the proof of Lemma 7.10; see the Exercises for details.
1. Let (σ1, . . . , σk) be a Nash equilibrium. (One exists, by Nash’s Theorem.)
2. Run T independent trials of the following experiment: draw actions at1 ∼ σ1, . . . , atk ∼ σk for the k players independently, according to their mixed strategies in the Nash equilibrium.
3. For each i, define σˆi as the empirical distribution of the ati’s. (With the probability of ai in σˆi equal to the fraction of trials in which i played ai.)
4. Use Chernoff bounds to prove that, if T is at least a sufficiently large polynomial in k, log(maxk |Ai|), and 1, then with high probability (σˆ1,…,σˆk) is an ε-Nash
i=1 ε
equilibrium. Note that the natural description length of (σˆ1, . . . , σˆk) — for example,
just by listing all of the sampled actions — is polynomial in n, log(maxk |Ai|), and 1. i=1 ε
The intuition is that, for T sufficiently large, expectations with respect to σi and with respect to σˆi should be roughly the same. Since there are |Ai| relevant expectations per player (the expected utility of each of its actions) and Chernoff bounds give deviation probabilities that have an inverse exponential form, we might expect a log |Ai| dependence to show up in the number of trials.
We now proceed to the proof of Theorem 7.9.
Proof of Theorem 7.9: Fix an auction with at most A actions per player, and a value for ε = Ω(1/poly(k, m)). Assume that, no matter what the bidder valuations v1, . . . , vk ∈ V are, the POA of ε-Nash equilibria of the auction is at most ρ < α. We will show that A must be doubly-exponential in m. Consider the following nondeterministic protocol for computing a 1-input of the welfare- maximization problem — for convincing the k players that every allocation has welfare at most W∗/α. See also Figure 7.2. The prover writes on a publicly visible blackboard an ε-Nash equilibrium (σ1, . . . , σk) of the auction, with description length polynomial in k, log A, and 1 = O(poly(k, m)) as guaranteed by Lemma 7.10. The prover also writes down ε the expected welfare contribution E[vi(S)] of each bidder i in this equilibrium. Given this advice, each player i verifies that σi is indeed an ε-best response to the other σj’s and that its expected welfare is as claimed when all players play the mixed strategies σ1, . . . , σk. Crucially, player i is fully equipped to perform both of these checks without any 120 Lower Bounds in Algorithmic Game Theory Figure 7.2 Proof of Theorem 7.9. How to extract a low-communication nondeterministic protocol from a good price-of-anarchy bound. communication — it knows its valuation vi (and hence its utility in each outcome of the game) and the mixed strategies used by all players, and this is all that is needed to verify the ε-Nash equilibrium conditions that apply to it and to compute its expected contribution to the welfare.11 Player i accepts if and only if the prover’s advice passes these two tests, and if the expected welfare of the equilibrium is at most W∗/α. For the protocol correctness, consider first the case of a 1-input, where every allocation has welfare at most W∗/α. If the prover writes down the description of an arbitrary ε-Nash equilibrium and the appropriate expected contributions to the social welfare, then all of the players will accept (the expected welfare is obviously at most W∗/α). We also need to argue that, for the case of a 0-input — where some allocation has welfare at least W∗ — there is no proof that causes all of the players to accept. We can assume that the prover writes down an ε-Nash equilibrium and its correct expected welfare W , since otherwise at least one player will reject. Since the maximum-possible welfare is at least W∗ and (by assumption) the POA of ε-Nash equilibria is at most ρ < α, the expected welfare of the given ε-Nash equilibrium must satisfy W ≥ W∗/ρ > W/α. Since the players will reject such a proof, we conclude that the protocol is correct. Our assumption then implies that the protocol has communication cost exponential in m. Since the cost of the protocol is polynomial in k, m, and log A, A must be doubly exponential in m. 􏱋
Conceptually, the proof of Theorem 7.9 argues that, when the POA of ε-Nash equilibria is small, every ε-Nash equilibrium provides a privately verifiable proof of a good upper bound
11These computations may take a super-polynomial amount of time, but they do not contribute to the protocol’s cost.
(i) OPT’≤’W/α’ (ii) OPT’≥’W’
if’E[welfare(x)]’≤’W/α’ then’OPT’≤’ρW/α’<'W' (so'case'(i))' if'E[welfare(x)]'>‘W/α’ then’OPT’>’W/α” (so’case'(ii))’
prover’writes” down’an’ε9NE’x’
players’compute’ expected’welfare’of’x’
players’privately’ verify’ε9NE’condi@ons’

7.5 Lower Bounds for Equilibria 121 on the maximum-possible welfare. When such upper bounds require large communication,
the equilibrium description length (and hence the number of actions) must be large.
7.5.4 An Open Question
While Theorems 7.5, 7.7, and 7.9 pin down the best-possible POA achievable by simple auctions with subadditive bidder valuations, there are still open questions for other valuation classes. For example, a valuation vi is submodular if it satisfies
vi(T ∪ {j}) − vi(T) ≤ vi(S ∪ {j}) − vi(S)
for every S ⊆ T ⊂ M and j ∈/ T . This is a “diminishing returns” condition for set functions. Every submodular function is also subadditive, so welfare-maximization with the former valuations is only easier than with the latter.
The worst-case POA of S1As is exactly e ≈ 1.58 when bidders have submodular e−1
valuations. The upper bound was proved in Syrgkanis and Tardos (2013), the lower bound in Christodoulou et al. (2013). It is an open question whether or not there is a simple auction with a smaller worst-case POA. The best lower bound known — for nondeterministic protocols and hence, by Theorem 7.9, for the POA of ε-Nash equilibria of simple auctions — is
2e ≈ 1.23. Intriguingly, there is an upper bound (slightly) better than e for polynomial- 2e−1 e−1
communication protocols (Feige and Vondrák, 2010) — can this better upper bound also be realized as the POA of a simple auction? What is the best-possible approximation guarantee, either for polynomial-communication protocols or for the POA of simple auctions?

Lecture 8
Lower Bounds in Property Testing
8.1 Property Testing
We begin in this section with a brief introduction to the field of property testing. Section 8.2 explains the famous example of “linearity testing.” Section 8.3 gives upper bounds for the canonical problem of “monotonicity testing,” and Section 8.4 shows how to derive property testing lower bounds from communication complexity lower bounds.1 These lower bounds will follow from our existing communication complexity toolbox (specifically, Disjointness); no new results are required.
Let D and R be a finite domain and range, respectively. In this lecture, D will always be {0,1}n, while R might or might not be {0,1}. A property is simply a set P of functions from D to R. Examples we have in mind include:
1. Linearity, where P is the set of linear functions (with R a field and D a vector space over R).
2. Monotonicity, where P is the set of monotone functions (with D and R being partially ordered sets).
3. Various graph properties, like bipartiteness (with functions corresponding to charac- teristic vectors of edge sets, with respect to a fixed vertex set).
4. And so on. The property testing literature is vast. See Ron (2010) for a starting point.
In the standard property testing model, one has “black-box access” to a function f : D → R. That is, one can only learn about f by supplying an argument x ∈ D and receiving the function’s output f(x) ∈ R. The goal is to test membership in P by querying f as few times as possible. Since the goal is to use a small number of queries (much smaller than |D|), there is no hope of testing membership exactly. For example, suppose you derive f from your favorite monotone function by changing its value at a single point to introduce a non-monotonicity. There is little hope of detecting this monotonicity violation with a small number of queries. We therefore consider a relaxed “promise” version of the membership problem.
1Somewhat amazingly, this connection was only discovered in 2011 (Blais et al., 2012), even though the connection is simple and property testing is a relatively mature field.
122

8.2 Example: The BLR Linearity Test 123
Formally, we say that a function f is ε-far from the property P if, for every g ∈ P, f and g differ in at least ε|D| entries. Viewing functions as vectors indexed by D with coordinates in R, this definition says that f has distance at least ε|D| from its nearest neighbor in P
(under the Hamming metric). Equivalently, repairing f so that it belongs to P would require changing at least an ε fraction of its values. A function f is ε-close to P if it is not ε-far — if it can be turned into a function in P by modifying its values on strictly less than ε|D| entries.
The property testing goal is to query a function f a small number of times and then decide if:
1. f ∈ P; or
2. f is ε-far from P.
If neither of these two conditions applies to f, then the tester is off the hook — any declaration is treated as correct.
A tester specifies a sequence of queries to the unknown function f, and a declaration of either “∈ P” or “ε-far from P” at its conclusion. Interesting property testing results almost always require randomization. Thus, we allow the tester to be randomized, and allow it to err with probability at most 1/3. As with communication protocols, testers come in various flavors. One-sided error means that functions in P are accepted with probability 1, with no false negative allowed. Testers with two-sided error are allowed both false positives and false negatives (with probability at most 1/3, on every input that satisfies the promise). Testers can be non-adaptive, meaning that they flip all their coins and specify all their queries up front, or adaptive, with queries chosen as a function of the answers to previously asked queries. For upper bounds, we prefer the weakest model of non-adaptive testers with 1-sided error. Often (though not always) in property testing, neither adaptivity nor two-sided error leads to more efficient testers. Lower bounds can be much more difficult to prove for adaptive testers with two-sided error, however.
For a given choice of a class of testers, the query complexity of a property P is the minimum (over testers) worst-case (over inputs) number of queries used by a tester that solves the testing problem for P. The best-case scenario is that the query complexity of a property is a function of ε only; sometimes it depends on the size of D or R as well.
8.2 Example: The BLR Linearity Test
The unofficial beginning of the field of property testing is Blum et al. (1993). (For the official beginning, see Rubinfeld and Sudan (1996) and Goldreich et al. (1998).) The setting is D = {0,1}n and R = {0,1}, and the property is the set of linear functions, meaning

124 Lower Bounds in Property Testing
functions f such that f(x + y) = f(x) + f(y) (over F2) for all x, y ∈ {0, 1}n.2 linearity test is the following:
1. Repeatt=Θ(1)times: ε
a) Pick x, y ∈ {0, 1}n uniformly at random.
b) If f(x + y) ̸= f(x) + f(y) (over F2), then REJECT.
2. ACCEPT.
The BLR
It is clear that if f is linear, then the BLR linearity test accepts it with probability 1. That is, the test has one-sided error. The test is also non-adaptive — the t random choices of x and y can all be made up front. The non-trivial statement is that only functions that are close to linear pass the test with large probability.
Theorem 8.1 (Blum et al. 1993) If the BLR linearity test accepts a function f with probability greater than 1 , then f is ε-close to the set of linear functions.
3
The modern and slick proof of Theorem 8.1 uses Fourier analysis — indeed, the elegance of this proof serves as convincing motivation for the more general study of Boolean functions from a Fourier-analytic perspective. See Chapter 1 of Donnell (2014) for a good exposition. There are also more direct proofs of Theorem 8.1, as in Blum et al. (1993). None of these proofs are overly long, but we’ll spend our time on monotonicity testing instead. We mention the BLR test for the following reasons:
1. If you only remember one property testing result, Theorem 8.1 and the BLR linearity test would be a good one.
2. The BLR test is the thin end of the wedge in constructions of probabilistically checkable proofs (PCPs). Recall that a language is in NP if membership can be efficiently verified — for example, verifying an alleged satisfying assignment to a SAT formula is easy to do in polynomial time. The point of a PCP is to rewrite such a proof of membership so that it can be probabilistically verified after reading only a constant number of bits. The BLR test does exactly this for the special case of linearity testing — for proofs where “correctness” is equated with being the truth table of a linear function. The BLR test effectively means that one can assume without loss of generality that a proof encodes a linear function — the BLR test can be used as a preprocessing step to reject alleged proofs that are not close to a linear function. Subsequent testing steps can then focus on whether or not the encoded linear function is close to a subset of linear functions of interest.
2Equivalently, these are the functions that can be written as f(x) = 􏰂ni=1 aixi for some a1, . . . , an ∈ {0, 1}.

8.3
Monotonicity Testing: Upper Bounds 125
3. Theorem 8.1 highlights a consistent theme in property testing — establishing connec- tions between “global” and “local” properties of a function. Saying that a function f is ε-far from a property P refers to the entire domain D and in this sense asserts a “global violation” of the property. Property testers work well when there are ubiquitous “local violations” of the property. Theorem 8.1 proves that, for the property of linearity, a global violation necessarily implies lots of local violations. We give a full proof of such a “global to local” statement for monotonicity testing in the next section.
8.3 Monotonicity Testing: Upper Bounds
The problem of monotonicity testing was introduced in Goldreich et al. (2000) and is one of the central problems in the field. We discuss the Boolean case, where there have been several breakthroughs in just the past few months, in Sections 8.3.1 and 8.3.2. We discuss the case of larger ranges, where communication complexity has been used to prove strong lower bounds, in Section 8.3.3.
8.3.1 The Boolean Case
In this section, we take D = {0,1}n and R = {0,1}. For b ∈ {0,1} and x−i ∈ {0,1}n−1, we use the notation (b, x−i) to denote a vector of {0, 1}n in which the ith bit is b and the other n − 1 bits are x−i. A function f : {0, 1}n → {0, 1} is monotone if flipping a coordinate of an input from 0 to 1 can only increase the function’s output:
f(0, x−i) ≤ f(1, x−i)
for every i ∈ {1,2,…,n} and x−i ∈ {0,1}n−1.
It will be useful to visualize the domain {0, 1}n as the n-dimensional hypercube; see also
Figure 8.1. This graph has 2n vertices and n2n−1 edges. An edge can be uniquely specified by a coordinate i and vector x−i ∈ {0, 1}n−1 — the edge’s endpoints are then (0, x−i) and (1,x−i). By the ith slice of the hypercube, we mean the 2n−1 edges for which the endpoints differ (only) in the ith coordinate. The n slices form a partition of the edge set of the hypercube, and each slice is a perfect matching of the hypercube’s vertices. A function
{0, 1}n → {0, 1} can be visualized as a binary labeling of the vertices of the hypercube. We consider the following edge tester, which picks random edges of the hypercube and
rejects if it ever finds a monotonicity violation across one of the chosen edges. 1. Repeat t times:
a) Pick i ∈ {1,2,…,n} and x−i ∈ {0,1}n−1 uniformly at random. b) If f(0,x−i) > f(1,x−i) then REJECT.
2. ACCEPT.

126 Lower Bounds in Property Testing
Figure 8.1 {0, 1}n can be visualized as an n-dimensional hypercube.
Like the BLR test, it is clear that the edge tester has 1-sided error (no false negatives) and is non-adaptive. The non-trivial part is to understand the probability of rejecting a function that is ε-far from monotone — how many trials t are necessary and sufficient for a rejection probability of at least 2/3? Conceptually, how pervasive are the local failures of monotonicity for a function that is ε-far from monotone?
The bad news is that, in contrast to the BLR linearity test, taking t to be a constant (depending only on ε) is not good enough. The good news is that we can take t to be only
logarithmic in the size of the domain.
Theorem 8.2 (Goldreich et al. 2000) For t = Θ( n ), the edge tester rejects every func- ε
tion that is ε-far from monotone with probability at least 2/3.
Proof: A simple calculation shows that it is enough to prove that a single random trial of
the edge test rejects a function that is ε-far from monotone with probability at least ε . n
Fix an arbitrary function f. There are two quantities that we need to relate to each other — the rejection probability of f, and the distance between f and the set of monotone functions. We do this by relating both quantities to the sizes of the following sets: for i = 1,2,…,n, define
|Ai|={x−i∈{0,1}n−1 :f(0,x−i)>f(1,x−i)}. (8.1) That is, Ai is the edges of the ith slice of the hypercube across which f violates monotonicity.
By the definition of the edge tester, the probability that a single trial rejects f is exactly n
001″
000″
011″
010″
101″
100″
111″
110″
􏰕|Ai| / n2n−1 . 􏱄 􏱅􏱆 􏱇
i=1 # of edges 􏱄 􏱅􏱆 􏱇
# of violations
(8.2)
Next, we upper bound the distance between f and the set of monotone functions, implying that the only way in which the |Ai|’s (and hence the rejection probability) can be

8.3 Monotonicity Testing: Upper Bounds 127
small is if f is close to a monotone function. To upper bound the distance, all we need to do is exhibit a single monotone function close to f. Our plan is to transform f into a monotone function, coordinate by coordinate, tracking the number of changes that we make along the way. The next claim controls what happens when we “monotonize” a single coordinate.
Key Claim: Let i ∈ {1,2,…,n} be a coordinate. Obtain f′ from f by, for each violated edge ((0,x−i),(1,x−i)) ∈ Ai of the ith slice, swapping the values of f on its endpoints (Figure 8.2). That is, set f′(0,x−i) = 0 and f′(1,x−i) = 1. (This operation is well defined because the edges of Ai are disjoint.) For every coordinate j = 1, 2, . . . , n, f ′ has no more
monotonicity violations in the jth slice than does f.
Proof of Key Claim: The claim is clearly true for j = i: by construction, the swapping operation fixes all of the monotonicity violations in the ith slice, without introducing any new violations in the ith slice. The interesting case is when j ̸= i, since new monotonicity violations can be introduced (cf., Figure 8.2). The claim asserts that the overall number of violations cannot increase (cf., Figure 8.2).
(a) Fixing the first slice
(b) Fixing the second slice
Figure 8.2 Swapping values to eliminate the monotonicity violations in the ith slice.
1″ 0″ 0″ 1″
01″ 11″
01″ 11″
swap”with”i=1″
00″ 10″
1″ 1″ 1″ 1″
00″ 10″
1″ 0″ 1″ 1″
01″ 11″
01″ 11″
swap”with”i=2″
00″ 10″
1″ 1″ 1″ 0″
00″ 10″

128 Lower Bounds in Property Testing
We partition the edges of the jth slice into edge pairs as follows. We use x0−j to denote an assignment to the n − 1 coordinates other than j in which the ith coordinate is 0, and x1−j the corresponding assignment in which the ith coordinate is flipped to 1. For a choice of x0−j, we can consider the “square” formed by the vertices (0,x0−j), (0,x1−j), (1,x0−j), and
(1,x1−j); see Figure 8.3. The edges ((0,x0−j),(1,x0−j)) and ((0,x1−j),(1,x1−j)) belong to the jth slice, and ranging over the 2n−2 choices for x0−j — one binary choice per coordinate other than i and j — generates each such edge exactly once.
Figure 8.3 The number of monotonicity violations on edges e3 and e4 is at least as large under f as under f′.
Fix a choice of x0−j , and label the edges of the corresponding square e1, e2, e3, e4 as in Figure 8.3. A simple case analysis shows that the number of monotonicity violations on edges e3 and e4 is at least as large under f as under f′. If neither e1 nor e2 was violated under f, then f′ agrees with f on this square and the total number of monotonicity violations is obviously the same. If both e1 and e2 were violated under f, then values were swapped along both these edges; hence e3 (respectively, e4) is violated under f′ if and only if e4 (respectively, e3) was violated under f. Next, suppose the endpoints of e1 had their values swapped, while the endpoints of e2 did not. This implies that f (0, x0−j ) = 1 and f (0, x1−j ) = 0, and hence f′(0,x0−j) = 0 and f(0,x1−j) = 1. If the endpoints (1,x0−j) and (1,x1−j) of e2 have the values 0 and 1 (under both f and f′), then the number of monotonicity violations on e3 and e4 drops from 1 to 0. The same is true if their values are 0 and 0. If their values are 1 and 1, then the monotonicity violation on edge e4 under f moves to one on edge e3 under f′, but the number of violations remains the same. The final set of cases, when the endpoints of e2 have their values swapped while the endpoints of e1 do not, is similar.3
3Suppose we corrected only one endpoint of an edge to fix a monotonicity violation, rather than swapping the endpoint values. Would the proof still go through?
0″
(1,x%j)( f:” e3(
(0,x%j)( 1″
e2( e1(
0″
(1,x’%j)( e4(
(0,x’%j)( 0″
0″
(1,x%j)( f’:” e3(
(0,x%j)( 0″
e2( e1(
0″
(1,x’%j)( e4(
(0,x’%j)( 1″

8.3 Monotonicity Testing: Upper Bounds 129 Summing over all such squares — all choices of x0−j — we conclude that the number of
monotonicity violations in the jth slice can only decrease. 􏱋
Now consider turning a function f into a monotone function g by doing a single pass through the coordinates, fixing all monotonicity violations in a given coordinate via swaps as in the Key Claim. This process terminates with a monotone function: immediately after coordinate i is treated, there are no monotonicity violations in the ith slice by construction; and by the Key Claim, fixing future coordinates does not break this property. The Key Claim also implies that, in the iteration where this procedure processes the ith coordinate, the number of monotonicity violations that need fixing is at most the number |Ai| of monotonicity violations in this slice under the original function f. Since the procedure makes two modifications to f for each monotonicity violation that it fixes (the two endpoints of an edge), we conclude that f can be made monotone by changing at most 2 􏰂ni=1 |Ai| of its values. If f is ε-far from monotone, then 2 􏰂ni=1 |Ai| ≥ ε2n. Plugging this into (8.2), we find that a single trial of the edge tester rejects such an f with probability at least
as claimed. 􏱋
1ε2n ε 2=,
n2n−1 n
8.3.2 Recent Progress for the Boolean Case
An obvious question is whether or not we can improve over the query upper bound in Theorem 8.2. The analysis in Theorem 8.2 of the edge tester is tight up to a constant factor (see Exercises), so an improvement would have to come from a different tester. There was no progress on this problem for 15 years, but recently there has been a series of breakthroughs on the problem. Chakrabarty and Seshadhri (2013) gave the first improved upper bounds, of O ̃(n7/8/ε3/2).4 A year later, Chen et al. (2014) gave an upper bound of O ̃(n5/6/ε4). Just a couple of months ago, Khot et al. (2015) gave a bound of O ̃(√n/ε2). All of these improved upper bounds are for path testers. The idea is to sample a random monotone path from the hypercube (checking for a violation on its endpoints), rather than a random edge. One way to do this is: pick a random point x ∈ {0, 1}n; pick a random number z between 0 and the number of zeroes of x (from some distribution); and obtain y from x by choosing at random z of x’s 0-coordinates and flipping them to 1. Given that a function that is ε-far from monotone must have lots of violated edges (by Theorem 8.2), it is plausible that path testers, which aspire to check many edges at once, could be more effective than edge testers. The issue is that just because a path contains one or more violated edges does not imply that the path’s endpoints will reveal a monotonicity violation. Analyzing path testers seems substantially more complicated than the edge tester (Chakrabarty and Seshadhri, 2013;
4The notation O ̃(·) suppresses logarithmic factors.

130 Lower Bounds in Property Testing
Chen et al., 2014; Khot et al., 2015). Note that path testers are non-adaptive and have 1-sided error.
There have also been recent breakthroughs on the lower bound side. It has been known
for some time that all non-adaptive testers with 1-sided error require Ω( n) queries (Fischer et al., 2002); see also the Exercises. For non-adaptive testers with two-sided error, Chen et al. (2014) proved a lower bound of Ω ̃(n1/5) and Chen et al. (2015) improve this to Ω(n(1/2)−c) for every constant c > 0. Because the gap in query complexity between adaptive and non- adaptive testers can only be exponential (see Exercises), these lower bounds also imply that adaptive testers (with two-sided error) require Ω(log n) queries. The gap between O ̃(√n) and Ω(logn) for adaptive testers remains open; most researchers think that adaptivity
cannot help and that the upper bound is the correct answer.
An interesting open question is whether or not communication complexity is useful for
proving interesting lower bounds for the monotonicity testing of Boolean functions.5 We’ll see in Section 8.4 that it is useful for proving lower bounds in the case where the range is relatively large.
8.3.3 Larger Ranges
In this section we study monotonicity testing with the usual domain D = {0, 1}n but with a range R that is an arbitrary finite, totally ordered set. Some of our analysis for the Boolean case continues to apply. For example, the edge tester continues to be a well-defined tester with 1-sided error. Returning to the proof of Theorem 8.2, we can again define each Ai as the set of monotonicity violations — meaning f(0,x−i) > f(1,x−i) — along edges in the ith slice. The rejection probability again equals the quantity in (8.2).
We need to revisit the major step of the proof of Theorem 8.2, which for Boolean functions gives an upper bound of 2 􏰂ni=1 |Ai| on the distance from a function f to the set of monotone functions. One idea is to again do a single pass through the coordinates, swapping the function values of the endpoints of the edges in the current slice that have monotonicity violations. In contrast to the Boolean case, this idea does not always result in a monotone function (see Exercises).
We can extend the argument to general finite ranges R by doing multiple passes over the coordinates. The simplest approach uses one pass over the coordinates, fixing all monotonicity violations that involve a vertex x with f(x) = 0; a second pass, fixing all monotonicity violations that involve a vertex x with f(x) = 1; and so on. Formalizing this argument yields a bound of 2|R| 􏰂ni=1 |Ai| on the distance between f and the set of monotone functions, which gives a query bound of O(n|R|/ε) Goldreich et al. (2000).
5At the very least, some of the techniques we’ve learned in previous lectures are useful. The arguments in Chen et al. (2014) and Chen et al. (2015) use an analog of Yao’s Lemma (Lemma 2.3)) to switch from randomized to distributional lower bounds. The hard part is then to come up with a distribution over both monotone functions and functions ε-far from monotone such that no deterministic tester can reliably distinguish between the two cases using few queries to the function.

8.4 Monotonicity Testing: Lower Bounds 131
A divide-and-conquer approach gives a better upped bound Dodis et al. (1999). Assume
without loss of generality (relabeling if necessary) that R = {0, 1, . . . , r − 1}, and also (by
padding) that r = 2k for a positive integer k. The first pass over the coordinates fixes
all monotonicity violations that involve values that differ in their most significant bit —
one value that is less than r and one value that is at least r . The second pass fixes all 22
monotonicity violations involving two values that differ in their second-highest-order bit.
And so on. The Exercises ask you to prove that this idea can be made precise and show that
the distance between f and the set of monotone functions is at most 2 log2 |R| 􏰂ni=1 |Ai|.
This implies an upper bound of O(n log|R|) on the number of queries used by the edge ε
tester for the case of general finite ranges. The next section shows a lower bound of Ω(n/ε) when |R| = Ω(√n); in these cases, this upper bound is the best possible, up to the log R factor.6
8.4 Monotonicity Testing: Lower Bounds 8.4.1 Lower Bound for General Ranges
This section uses communication complexity to prove a lower bound on the query complexity of testing monotonicity for sufficiently large ranges.
Theorem 8.3 (Blais et al. (2012)) For large enough ranges R and ε = 1 , every (adap- 8
tive) monotonicity tester with two-sided error uses Ω(n) queries.
Note that Theorem 8.3 separates the case of a general range R from the case of a Boolean range, where O ̃(√n) queries are enough Khot et al. (2015). With the right communication complexity tools, Theorem 8.3 is not very hard to prove. Simultaneously with Blais et al.
(2012), Briët et al. Briët et al. (2012) gave a non-trivial proof from scratch of a similar lower bound, but it applies only to non-adaptive testers with 1-sided error. Communication complexity techniques naturally lead to lower bounds for adaptive testers with two-sided error.
As always, the first thing to try is a reduction from Disjointness, with the query complexity somehow translating to the communication cost. At first this might seem weird — there’s only one “player” in property testing, so where do Alice and Bob come from? But as we’ve seen over and over again, starting with our applications to streaming lower bounds, it can be useful to invent two parties just for the sake of standing on the shoulders of communication complexity lower bounds. To implement this, we need to show how a
low-query tester for monotonicity leads to a low-communication protocol for Disjointness. It’s convenient to reduce from a “promise” version of Disjointness that is just as hard as the general case. In the Unique-Disjointness problem, the goal is to distinguish between
6It is an open question to reduce the dependence on |R|. Since we can assume that |R| ≤ 2n (why?), any sub-quadratic upper bound o(n2) would constitute an improvement.

132 Lower Bounds in Property Testing
inputs where Alice and Bob have sets A and B with A∩B = ∅, and inputs where |A∩B| = 1. On inputs that satisfy neither property, any output is considered correct. The Unique- Disjointness problem showed up a couple of times in previous lectures; let’s review them. At the conclusion of our lecture on the extension complexity of polytopes, we proved that the nondeterministic communication complexity of the problem is Ω(n) using a covering argument with a clever inductive proof (Theorem 5.9). In our boot camp (Section 4.3.4), we discussed the high-level approach of Razborov’s proof that every randomized protocol for Disjointness with two-sided error requires Ω(n) communication. Since the hard probability distribution in this proof makes use only of inputs with intersection size 0 or 1, the lower bound applies also to the Unique-Disjointness problem.
Key to the proof of Theorem 8.3 is the following lemma.
Lemma 8.4 Fix sets A,B ⊆ U = {1,2,…,n}. Define the function hAB : 2U → Z by hAB(S) = 2|S| + (−1)|S∩A| + (−1)|S∩B|.
Then:
(i) IfA∩B=∅,thenhismonotone.
(ii) If|A∩B|=1,thenhis 1-farfrommonotone. 8
(8.3)
We’ll prove the lemma shortly; let’s first see how to use it to prove Theorem 8.3. Let Q
be a tester that distinguishes between monotone functions from {0, 1}n to R and functions
that are 1 -far from monotone. We proceed to construct a (public-coin randomized) protocol 8
for the Unique-Disjointness problem.
Suppose Alice and Bob have sets A,B ⊆ {1,2,…,n}. The idea is for both parties to
run local copies of the tester Q to test the function hAB, communicating with each other as needed to carry out these simulations. In more detail, Alice and Bob first use the public coins to agree on a random string to be used with the tester Q. Given this shared random string, Q is deterministic. Alice and Bob then simulate local copies of Q query-by-query:
1. Until Q halts:
a) Let S ⊆ {1,2,…,n} be the next query that Q asks about the function hAB.7
b) Alice sends (−1)|S∩A| to Bob.
c) Bob sends (−1)|S∩B| to Alice.
d) Both Alice and Bob evaluate the function hAB at S, and give the result to their
respective local copies of Q.
7As usual, we’re not distinguishing between subsets of {1, 2, . . . , n} and their characteristic vectors.

8.4 Monotonicity Testing: Lower Bounds 133 2. Alice (or Bob) declares “disjoint” if Q accepts the function hAB, and “not disjoint”
otherwise.
We first observe that the protocol is well defined. Since Alice and Bob use the same random string and simulate Q in lockstep, both parties know the (same) relevant query S to hAB in every iteration, and thus are positioned to send the relevant bits ((−1)|S∩A| and (−1)|S∩B|) to each other. Given these bits, they are able to evaluate hAB at the point S (even though Alice doesn’t know B and Bob doesn’t know A).
The communication cost of this protocol is twice the number of queries used by the tester Q, and it doesn’t matter if Q is adaptive or not. Correctness of the protocol follows immediately from Lemma 8.4, with the error of the protocol the same as that of the tester Q. Because every randomized protocol (with two-sided error) for Unique-Disjointness has communication complexity Ω(n), we conclude that every (possibly adaptive) tester Q with two-sided error requires Ω(n) queries for monotonicity testing. This completes the proof of Theorem 8.3.
Proof of Lemma 8.4: For part (i), assume that A ∩ B = ∅ and consider any set S ⊆ {1,2,…,n} and i ∈/ S. Because A and B are disjoint, i does not belong to at least one of A or B. Recalling (8.3), in the expression hAB(S ∪ {i}) − hAB(S), the difference between the first terms is 2, the difference in either the second terms (if i ∈/ A) or in the third terms (if i ∈/ B) is zero, and the difference in the remaining terms is at least -2. Thus, hAB(S∪{i})−hAB(S)≥0forallSandi∈/S,andhAB ismonotone.
Forpart(ii),letA∩B={i}. ForallS⊆{1,2,…,n}\{i}suchthat|S∩A|and
|S ∩ B| are both even, hAB(S ∪ {i}) − hAB(S) = −2. If we choose such an S uniformly
at random, then Pr[|S ∩ A| is even] is 1 (if A = {i}) or 1 (if A has additional elements, 2
using the Principle of Deferred Decisions). Similarly, Pr[|S ∩ B| is even] ≥ 1 . Since no 2
potential element of S ⊆ {1,2,…,n} \ {i} is a member of both A and B, these two events are independent and hence Pr[|S ∩ A|, |S ∩ B| are both even] ≥ 1 . Thus, for at least
4
1 · 2n−1 = 2n/8 choices of S, hAB(S ∪ {i}) < hAB(S). Since all of these monotonicity 4 violations involve different values of hAB — in the language of the proof of Theorem 8.2, they are all edges of the ith slice of the hypercube — fixing all of them requires changing hAB at 2n/8 values. We conclude that hAB is 1-far from a monotone function. 􏱋 8 8.4.2 Extension to Smaller Ranges Recalling the definition (8.3) of the function hAB, we see that the proof of Theorem 8.3 establishes a query complexity lower bound of Ω(n) provided the range R has size Ω(n). It is not difficult to extend the lower bound to ranges of size Ω( n). The trick is to consider a “truncated” version of hAB , call it h′ , where values of hAB less than n − c√n are rounded √ √ AB√ √ up to n−c n and values more than n+c n are rounded down to n+c n. (Here c is a sufficiently large constant.) The range of h′AB has size Θ(√n) for all A, B ⊆ {1, 2, . . . , n}. 134 Lower Bounds in Property Testing We claim that Lemma 8.4 still holds for h′ , with the “1” in case (ii) replaced by AB 8 “ 1 ;” the new version of Theorem 8.3 then follows. Checking that case (i) in Lemma 8.4 16 still holds is easy: truncating a monotone function yields another monotone function. For case (ii), it is enough to show that hAB and h′ differ in at most a 1 fraction of their AB 16 entries; since Hamming distance satisfies the triangle inequality, this implies that h′AB must be 1 -far from the set of monotone functions. Finally, consider choosing S ⊆ {1, 2, . . . , n} 16 uniformly at random: up to an ignorable additive term in {−2, −1, 0, 1, 2}, the value of hAB lies in n ± c√n with probability at least 15 , provided c is a sufficiently large constant (by 16 Chebyshev’s inequality). This implies that hAB and h′ agree on all but a 1 fraction of AB 16 For even smaller ranges R, the argument above can be augmented by a padding argument the domain, completing the proof. to prove a query complexity lower bound of Ω(|R|2); see the Exercises. 8.5 A General Approach It should be clear from the proof of Theorem 8.3 that its method of deriving property testing lower bounds from communication complexity lower bounds is general, and not particular to the problem of testing monotonicity. The general template for deriving lower bounds for testing a property P is: 1. Map inputs (x, y) of a communication problem Π with communication complexity at least c to a function h(x,y) such that: a) 1-inputs (x,y) of Π map to functions h(x,y) that belong to P; b) 0-inputs (x,y) of Π map to functions h(x,y) that are ε-far from P. 2. Devise a communication protocol for evaluating h(x,y) that has cost d. (In the proof of Theorem 8.3, d = 2.) Via the simulation argument in the proof of Theorem 8.3, instantiating this template yields a query complexity lower bound of c/d for testing the property P.8 There are a number of other applications of this template to various property testing problems, such as testing if a function admits a small representation (as a sparse polynomial, as a small decision tree, etc.). See Blais et al. (2012); Goldreich (2013) for several examples. A large chunk of the property testing literature is about testing graph properties Goldreich et al. (1998). An interesting open question is if communication complexity can be used to prove strong lower bounds for such problems. 8There is an analogous argument that uses one-way communication complexity lower bounds to derive query complexity lower bounds for non-adaptive testers; see the Exercises. Bibliography Alon, N., Matias, Y., and Szegedy, M. (1999). The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137–147. Alon, N. and Spencer, J. H. (2008). The Probabilistic Method. Wiley. Third Edition. Andoni, A., Indyk, P., and Pătraşcu, M. (2006). On the optimality of the dimensionality reduction method. In Proceedings of the 47th Symposium on Foundations of Computer Science (FOCS), pages 449–458. Arora, S. and Lund, C. (1997). Hardness of approximations. In Hochbaum, D. S., editor, Approximation Algorithms for NP-Hard Problems, chapter 10, pages 399–446. PWS Publishing Company. Babai, L., Frankl, P., and Simon, J. (1986). Complexity classes in communication complexity. In Proceedings of the 27th Annual Symposium on Foundations of Computer Science (FOCS), pages 337–347. Bar-Yossef, Z., Jayram, T. S., Krauthgamer, R., and Kumar, R. (2004). The sketching complexity of pattern matching. In Proceedings of APPROX-RANDOM, pages 261–272. Bar-Yossef, Z., Jayram, T. S., Kumar, R., and Sivakumar, D. (2002a). An information statistics approach to data stream and communication complexity. In Proceedings of the 43rd Symposium on Foundations of Computer Science (FOCS), pages 209–218. Bar-Yossef, Z., Jayram, T. S., Kumar, R., Sivakumar, D., and Trevisan, L. (2002b). Counting distinct elements in a data stream. In Proceedings of the 6th International Workshop on Randomization and Approximation Techniques (RANDOM), pages 1–10. Blais, E., Brody, J., and Matulef, K. (2012). Property testing lower bounds via communica- tion complexity. Computational Complexity, 21(2):311–358. Blum, M., Luby, M., and Rubinfeld, R. (1993). Self-testing/correcting with applications to numerical problems. Journal of Computer and System Sciences, 47(3):549–595. Briët, J., Chakraborty, S., García-Soriano, D., and Matsliah, A. (2012). Monotonicity testing and shortest-path routing on the cube. Combinatorica, 32(1):35–53. 135 136 Bibliography Candes, E. J., Romberg, J. K., and Tao, T. (2006). Robust uncertainty principles: Exact signal reconstruction from highly incomplete fourier information. IEEE Transactions on Information Theory, 52(2):489–509. Chakrabarti, A., Khot, S., and Sun, X. (2003). Near-optimal lower bounds on the multi-party communication complexity of set disjointness. In Proceedings of the 18th Annual IEEE Conference on Computational Complexity, pages 107–117. Chakrabarti, A. and Regev, O. (2012). An optimal lower bound on the communication complexity of gap-hamming-distance. SIAM Journal on Computing, 41(5):1299–1317. Chakrabarty, D. and Seshadhri, C. (2013). A o(n) monotonicity tester for boolean functions over the hypercube. In Proceedings of the 45th ACM Symposium on Theory of Computing (STOC), pages 411–418. Charikar, M., Chen, K., and Farach-Colton, M. (2004). Finding frequent items in data streams. Theoretical Computer Science, 312(1):3–15. Chattopadhyay, A. and Pitassi, T. (2010). The story of set disjointness. SIGACT News, 41(3):59–85. Chen, X., De, A., Servedio, R. A., and Tan, L. (2015). Boolean function monotonicity testing requires (almost) n1/2 non-adaptive queries. In Proceedings of the 47th ACM Symposium on Theory of Computing (STOC), pages 519–528. Chen, X., Servedio, R. A., and Tan, L. (2014). New algorithms and lower bounds for monotonicity testing. In Proceedings of the 55th Symposium on Foundations of Computer Science (FOCS), pages 286–295. Christodoulou, G., Kovacs, A., Sgoutitsa, A., and Tang, B. (2013). Tight bounds for the price of anarchy of simultaneous first price auctions. arXiv:1312.2371. Chvátal, V. (1983). Linear Programming. Freeman. Cormode, G. and Muthukrishnan, S. (2005). An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58–75. de Wolf, R. (2003). Nondeterministic quantum query and quantum communication com- plexities. SIAM Journal on Computing, 32(3):681–699. Do Ba, K., Indyk, P., Price, E., and Woodruff, D. P. (2010). Lower bounds for sparse recovery. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1190–1197. Dobzinski, S., Nisan, N., and Schapira, M. (2010). Approximation algorithms for combina- torial auctions with complement-free bidders. Math. Oper. Res., 35(1):1–13. Bibliography 137 Dodis, Y., Goldreich, O., Lehman, E., Raskhodnikova, S., Ron, D., and Samorodnitsky, A. (1999). Improved testing algorithms for monotonicity. In Proceedings of APPROX- RANDOM, pages 97–108. Donnell, R. O. (2014). Analysis of Boolean Functions. Cambridge. Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306. Edmonds, J. (1965). Maximum matching and a polyhedron with 0,1-vertices. Journal of Research of the National Bureau of Standards, Series B, 69B:125–130. Feige, U. (2009). On maximizing welfare where the utility functions are subadditive. SIAM Journal on Computing, 39(1):122–142. Feige, U. and Vondrák, J. (2010). The submodular welfare problem with demand queries. Theory of Computing, 6(1):247–290. Feldman, M., Fu, H., Gravin, N., and Lucier, B. (2013). Simultaneous auctions are (almost) efficient. In 45th ACM Symposium on Theory of Computing (STOC), pages 201–210. Fiat, A. and Naor, M. (1993). Implicit O(1) probe search. SIAM Journal on Computing, 22(1):1–10. Fiorini, S., Massar, S., Pokutta, S., Tiwary, H. R., and de Wolf, R. (2015). Exponential lower bounds for polytopes in combinatorial optimization. Journal of the ACM, 62(2). Article 17. Fischer, E., Lehman, E., Newman, I., Raskhodnikova, S., Rubinfeld, R., and Samorodnitsky, A. (2002). Monotonicity testing over general poset domains. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC), pages 474–483. Flajolet, P. and Martin, G. N. (1983). Probabilistic counting. In Proceedings of the 24th Annual Symposium on Foundations of Computer Science (FOCS), pages 76–82. Fredman, M. L., Komlós, J., and Szemerédi, E. (1984). Storing a sparse table with o(1) worst case access time. Journal of the ACM, 31(3):538–544. Gabizon, A. and Hassidim, A. (2010). Derandomizing algorithms on product distributions and other applications of order-based extraction. In Proceedings of Innovations in Computer Science (ICS), pages 397–405. Gabizon, A. and Shaltiel, R. (2012). Increasing the output length of zero-error dispersers. Random Structures & Algorithms, 40(1):74–104. 138 Bibliography Goemans, M. X. (2014). Smallest compact formulation for the permutahedron. Mathematical Programming, Series A. To appear. Goldreich, O. (2013). On the communication complexity methodology for proving lower bounds on the query complexity of property testing. Electronic Colloquium on Computa- tional Complexity (ECCC), 20. Goldreich, O., Goldwasser, S., Lehman, E., Ron, D., and Samorodnitsky, A. (2000). Testing monotonicity. Combinatorica, 20(3):301–337. Goldreich, O., Goldwasser, S., and Ron, D. (1998). Property testing and its connection to learning and approximation. Journal of the ACM, 45(4):653–750. Gronemeier, A. (2009). Asymptotically optimal lower bounds on the NIH-multi-party information complexity of the AND-function and disjointness. In Proceedings of the 26th Annual Symposium on Theoretical Aspects of Computer Science (STACS), pages 505–516. Grötschel, M., Lovász, L., and Schrijver, A. (1988). Geometric Algorithms and Combinatorial Optimization. Springer. Second Edition, 1993. Håstad, J. and Wigderson, A. (2007). The randomized communication complexity of set disjointness. Theory of Computing, 3(11):211–219. Indyk, P. (2004). Nearest-neighbor searching in high dimensions. In Goodman, J. E. and O’Rourke, J., editors, Handbook of Discrete and Computational Geometry, chapter 39, pages 877–892. CRC Press. Second Edition. Indyk, P. and Woodruff, D. P. (2005). Optimal approximations of the frequency moments of data streams. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC), pages 202–208. Jayram, T. S., Kumar, R., and Sivakumar, D. (2008). The one-way communication complexity of hamming distance. Theory of Computing, 4:129–135. Jukna, S. (2012). Boolean Function Complexity: Advances and Frontiers. Springer. Kaibel, V. and Weltge, S. (2015). A short proof that the extension complexity of the correlation polytope grows exponentially. Discrete & Computational Geometry, 53(2):397– 401. Kalyanasundaram, B. and Schnitger, G. (1992). The probabilistic communication complexity of set intersection. SIAM Journal on Discrete Mathematics, 5(4):545–557. Khachiyan, L. G. (1979). A polynomial algorithm in linear programming. Soviet Mathematics Doklady, 20(1):191–194. Bibliography 139 Khot, S., Minzer, D., and Safra, M. (2015). On monotonicity testing and Boolean isoperi- metric type theorems. In Proceedings of the 56th Symposium on Foundations of Computer Science (FOCS). To appear. Koutsoupias, E. and Papadimitriou, C. H. (1999). Worst-case equilibria. In Proceedings of the 16th Annual Symposium on Theoretical Aspects of Computer Science (STACS), volume 1563 of Lecture Notes in Computer Science, pages 404–413. Kremer, I., Nisan, N., and D, R. (1999). On randomized one-round communication complexity. Computational Complexity, 8(1):21–49. Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83–97. Kushilevitz, E. and Nisan, N. (1996). Communication Complexity. Cambridge. Lee, T. and Shraibman, A. (2009). Lower bounds in communication complexity. Foundations and Trends in Theoretical Computer Science, 3(4):263–399. Lipton, R. J., Markakis, E., and Mehta, A. (2003). Playing large games using simple strategies. In Proceedings of the 4th ACM Conference on Electronic Commerce (EC), pages 36–41. Lovász, L. (1990). Communication complexity: A survey. In Korte, B. H., editor, Paths, Flows, and VLSI Layout, pages 235–265. Springer-Verlag. Miltersen, P. B. (1999). Cell probe complexity — a survey. Invited paper at Workshop on Advances in Data Structures. Miltersen, P. B., Nisan, N., Safra, S., and Wigderson, A. (1998). On data structures and asymmetric communication complexity. Journal of Computer and System Sciences, 57(1):37–49. Moitra, A. (2014). Algorithmic aspects of machine learning. Lecture notes. Nash, J. F. (1950). Equilibrium points in N-person games. Proceedings of the National Academy of Science, 36(1):48–49. Newman, I. (1991). Private vs. common random bits in communication complexity. Infor- mation Processing Letters, 39:67–71. Nisan, N. (2002). The communication complexity of approximate set packing. In Proceedings of the 29th Annual International Colloquium on Automata, Languages, and Programming (ICALP), volume 2380 of Lecture Notes in Computer Science, pages 868–875. 140 Bibliography Padberg, M. and Rao, M. R. (1982). Odd minimum cut-sets and b-matchings. Mathematics of Operations Research, 7:67–80. Pitowsky, I. (1991). Correlation polytopes: Their geometry and complexity. Mathematical Programming, 50:395–414. Pătraşcu, M. (2008). Lower Bound Techniques for Data Structures. PhD thesis, Mas- sachusetts Institute of Technology. Pătraşcu, M. (2009). Communication complexity I, II, III, IV. WebDiarios de Motocicleta blog posts dated May 23, 2008; March 25, 2009; April 3, 2009; April 8, 2009. Pulleyblank, W. and Edmonds, J. (1974). Facets of 1-matching polyhedra. In Proceedings of the Working Seminar on Hypergraphs, pages 214–242. Springer. Razborov, A. A. (1992). On the distributional complexity of disjointness. Theoretical Computer Science, 106(2):385–390. Razborov, A. A. (2011). Communication complexity. In Schleicher, D. and Lackmann, M., editors, An Invitation to Mathematics: From Competitions to Research, pages 97–117. Springer. Ron, D. (2010). Property testing: A learning theory perspective. Foundations and Trends in Theoretical Computer Science, 5(2):73–205–402. Rothvoß, T. (2014). The matching polytope has exponential extension complexity. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing (STOC), pages 263–272. Roughgarden, T. (2014). Barriers to near-optimal equilibria. In Proceedings of the 55th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 71–80. Rubinfeld, R. and Sudan, M. (1996). Robust characterizations of polynomials with applica- tions to program testing. SIAM Journal on Computing, 25(2):252–271. Sherstov, A. A. (2012). The communication complexity of gap hamming distance. Theory of Computing, 8(8):197–208. Shoham, Y. and Leyton-Brown, K. (2010). Multiagent Systems. Cambridge. Syrgkanis, V. and Tardos, É. (2013). Composable and efficient mechanisms. In Proceedings of the 45th ACM Symposium on Theory of Computing (STOC), pages 211–220. Vidick, T. (2011). A concentration inequality for the overlap of a vector on a large set, with application to the communication complexity of the gap-hamming-distance problem. Electronic Colloquium on Computational Complexity (ECCC), 18:51. Bibliography 141 Woodruff, D. P. (2004). Optimal space lower bounds for all frequency moments. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 167–175. Woodruff, D. P. (2007). Efficient and Private Distance Approximation in the Communication and Streaming Models. PhD thesis, MIT. Yannakakis, M. (1991). Expressing combinatorial optimization problems by linear programs. Journal of Computer and System Sciences, 43(3):441–466. Yao, A. C. (1981). Should tables be sorted? Journal of the ACM, 28(3):615–628. Yao, A. C.-C. (1979). Some complexity questions related to distributive computing (prelimi- nary report). In Proceedings of the 11th Annual ACM Symposium on Theory of Computing (STOC), pages 209–213. Yao, A. C.-C. (1983). Lower bounds by probabilistic arguments. In Proceedings of the 24th Annual Symposium on Foundations of Computer Science (FOCS), pages 420–428. Ziegler, G. M. (1995). Lectures on Polytopes. Springer.