程序代写代做 information retrieval compiler game c++ information theory assembly Haskell C data mining computational biology database Excel html decision tree c/c++ Bayesian data structure AVL flex go computer architecture Fortran interpreter clock Hive Java algorithm AI discrete mathematics chain DNA graph Hidden Markov Mode David Liben-Nowell

David Liben-Nowell
Department of Computer Science Carleton College
Discrete Mathematics for Computer Science
or
(A Bit of) The Math that Computer Scientists Need to Know
1

VP AND EDITORIAL DIRECTOR SENIOR DIRECTOR
ACQUISITIONS EDITOR
EDITORIAL MANAGER
CONTENT MANAGEMENT DIRECTOR CONTENT MANAGER
SENIOR CONTENT SPECIALIST PRODUCTION EDITOR
PHOTO RESEARCHER
COVER PHOTO CREDIT
Laurie Rosatone
Don Fowley
Linda Ratts
Gladys Soto
Lisa Wojcik
Nichole Urban
Nicole Repasky Rajeshkumar Nallusamy Billy Ray
© slobo/Getty Images, Inc.
This book was set in TeXGyrePagella 10/12 by SPi Global and printed and bound by Strategic Content Imaging.
This book is printed on acid free paper. ∞
Founded in 1807, John Wiley & Sons, Inc. has been a valued source of knowledge and understanding for more than 200 years, helping people around the world meet their needs and fulfill their aspirations. Our company is built on a foundation of principles that include responsibility to the communities we serve and where we live and work. In 2008, we launched a Corporate Citizenship Initiative, a global effort to address the environmental, social, economic, and ethical challenges we face in our business. Among the issues we are addressing are carbon impact, paper specifications and procurement, ethical conduct within our business and among our vendors,
and community and charitable support. For more information, please visit our website: www.wiley.com/go/ citizenship.
Copyright © 2018, John Wiley & Sons, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment
of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923 (Web site: www.copyright.com). Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774, (201) 748-6011, fax (201) 748-6008, or online at: www.wiley.com/go/permissions.
Evaluation copies are provided to qualified academics and professionals for review purposes only, for use
in their courses during the next academic year. These copies are licensed and may not be sold or transferred
to a third party. Upon completion of the review period, please return the evaluation copy to Wiley. Return instructions and a free of charge return shipping label are available at: www.wiley.com/go/returnlabel. If you have chosen to adopt this textbook for use in your course, please accept this book as your complimentary desk copy. Outside of the United States, please contact your local sales representative.
ISBN: 978-1-118-06553-2 (PBK) ISBN: 978-1-119-07073-3 (EVALC)
Library of Congress Cataloging in Publication Data:
Liben-Nowell, David, author.
Title: Discrete mathematics for computer science / by David Liben-Nowell. Description: Hoboken, NJ : John Wiley & Sons, 2017. | Includes index. |
Identifiers: LCCN 2017025007 (print) | LCCN 2017035974 (ebook) | ISBN
9781119397199 (pdf) | ISBN 9781119397113 (epub) | ISBN 9781118065532 (pbk.) Subjects: LCSH: Computer science—Mathematics.
Classification: LCC QA76.9.M35 (ebook) | LCC QA76.9.M35 L53 2017 (print) |
DDC 004.01/51—dc23
LC record available at https://lccn.loc.gov/2017025007
The inside back cover will contain printing identification and country of origin if omitted from this page. In addition, if the ISBN on the back cover differs from the ISBN on this page, the one on the back cover is correct.

To MDSWM, with never-ending appreciation, and in loving memory of my grandfather, Jay Liben, who brought more joy, curiosity, and kvetching to this world than anyone else I know.

Contents
1 2
On the Point of this Book 101
Basic Data Types 201
2.1 Why You Might Care 202
2.2 Booleans, Numbers, and Arithmetic 203
2.3 Sets: Unordered Collections 222
2.4 Sequences, Vectors, and Matrices: Ordered Collections 237
2.5 Functions 253
2.6 Chapter at a Glance 270
3
Logic 301
3.1 Why You Might Care 302
3.2 An Introduction to Propositional Logic
3.3 Propositional Logic: Some Extensions
3.4 An Introduction to Predicate Logic
3.5 Predicate Logic: Nested Quantifiers
3.6 Chapter at a Glance 362
303 317
331 349

6
4
Proofs 401
4.1 Why You Might Care 402
5
6
5.2 Proofs by Mathematical Induction 503
5.3 Strong Induction 521
5.4 Recursively Defined Structures and Structural Induction 533
5.5 Chapter at a Glance 546
Analysis of Algorithms 601
6.1 Why You Might Care 602
6.2 Asymptotics 603
6.3 Asymptotic Analysis of Algorithms 617
6.4 Recurrence Relations: Analyzing Recursive Algorithms 631
6.5 Recurrence Relations: The Master Method 647
6.6 Chapter at a Glance 657
Number Theory 701
7.1 Why You Might Care 702
7.2 Modular Arithmetic 703
7.3 Primality and Relative Primality 717
7.4 Multiplicative Inverses 734
7.5 Cryptography 745
7.6 Chapter at a Glance 756
7
4.2 Error-Correcting Codes
4.3 Proofs and Proof Techniques
4.4 Some Examples of Proofs
4.5 Common Errors in Proofs
4.6 Chapter at a Glance
Mathematical Induction 5.1 Why You Might Care
469
403
423 441
458
501 502

8
Relations 801
8.1 Why You Might Care 802
8.2 Formal Introduction 803
8.3 Properties of Relations: Reflexivity, Symmetry, and Transitivity 818
8.4 Special Relations: Equivalence Relations and Partial/Total Orders 833
8.5 Chapter at a Glance 850
Counting 901
9.1 Why You Might Care 902
9.2 Counting Unions and Sequences 903
9.3 Using Functions to Count 926
9.4 Combinations and Permutations 944
9.5 Chapter at a Glance 965
Probability 1001
10.1 Why You Might Care 1002
10.2 Probability, Outcomes, and Events 1005
10.3 Independence and Conditional Probability 1021
10.4 Random Variables and Expectation 1041
10.5 Chapter at a Glance 1067
Graphs and Trees 1101
11.1 Why You Might Care 1102
11.2 Formal Introduction 1103
11.3 Paths, Connectivity, and Distances 1129
11.4 Trees 1147
11.5 Weighted Graphs 1164
11.6 Chapter at a Glance 1177
9
10
11
12
Index 1201
7

List of Computer Science Connections
Chapter 2: Basic Data Types
Integers and ints, Reals and floats 217
Computing Square Roots, and Not Computing Square Roots 218 Set Building in Languages 233
Clustering 234
The Vector Space Model 248
Rotation Matrices 249
Hash Tables and Hash Functions 267
Chapter 3: Logic
Natural Language Processing, Ambiguity, and Truth
314
326
Computational Complexity, Satisfiability, and $1,000,000
Short-Circuit Evaluation, Optimization, and Modern Compilers 327 Game Trees, Logic, and Winning Tic-Tac(-Toe) 344
Nonlocal Variables and Lexical vs. Dynamic Scoping 345
Gödel’s Incompleteness Theorem 346
Currying 357

10
Chapter 4: Proofs Reed–Solomon Codes 418
Are Massive Computer-Generated Proofs Proofs?
Paul Erdős, “The Book,” and Erdős Numbers 438
Cryptography and the Generation of Prime Numbers 454
Other Uncomputable Problems (That You Might Care About) 455 The Cost of Missing Proofs: Some Famous Bugs in CS 464
Chapter 5: Mathematical Induction Loop Invariants 517
Triangulation, Computer Graphics, and 3D Surfaces 528 Max Heaps 529
Grammars, Parsing, and Ambiguity 543
Chapter 6: Analysis of Algorithms Moore’s Law 613
Multitasking, Garbage Collection, and Wall Clocks 627
Time, Space, and Complexity 628
AVL Trees 643
Divide-and-Conquer Algorithms and Matrix Multiplication 655
Chapter 7: Number Theory
Converting Between Bases, Binary Representation, and Generating Strings 714
Secret Sharing 730
Error Correction with Reed–Solomon Codes 731 Miller–Rabin Primality Test 742 Diffie–Hellman Key Exchange 753
437

Chapter 8: Relations Relational Databases 815
Regular Expressions 830
Deterministic Finite Automata (DFAs) 846
The Painter’s Algorithm and Hidden-Surface Removal 847
Chapter 9: Counting
Running out of IP addresses, and IPv6 919
A Lower Bound for Comparison-Based Sorting 920
Infinite Cardinalities (and Problems that Can’t Be Solved by Any Program) 937 Lossy and Lossless Compression 938
Brute Force Algorithms and Dynamic Programming 959
The Enigma Machine and the First Computer 960
Chapter 10: Probability Quantum Computing 1016
Information, Charles Dickens, and the Entropy of English 1017 Speech Recognition, Bayes’ Rule, and Language Models 1036 Bayesian Modeling and Spam Filtering 1037
A Randomized Algorithm for Finding Medians 1060
The Monte Carlo Method 1062
Chapter 11: Graphs and Trees
Degree Distributions and the Heavy Tail 1123
Graph Drawing, Graph Layouts, and the 9/11 Memorial 1124 The Bowtie Structure of the Web 1142
Garbage Collection 1143
Directed Graphs, Cycles, and Kidney Transplants 1159 Binary Search Trees 1160
Random Walks and Ranking Web Pages 1174
11

Acknowledgements
Would thou hadst less deserved,
That the proportion both of thanks and payment Might have been mine! only I have left to say, More is thy due than more than all can pay.
William Shakespeare (1564–1616) The Scottish Play
To everyone who has helped, directly and indirectly, with everything over these last years—these words cannot adequately convey my thanks, but at least they’re a start: thank you!
I owe special thanks to a very long list of generous and warm people—many more than I can mention here—for advice and kindness and support, both technical and emotional, as this book came into being. For those whom I haven’t named by name, please know that it’s only because I have gotten such great support from so many people, and I hope that you’ll consider this sentence the promise that, when we next see each other, the first round’s on me. While I’m leaving out the names of the many people who have helped make my life happy and fulfilling while I’ve been working on this book, I do want to give specific thanks to a few people:
I want to thank my colleagues—near and far, including many who are not just col- leagues but also dear friends and beloved family members—for their wisdom and pa- tience, for answering my endlessly annoying questions, and for conversations that led to examples or exercises or bug fixes or the very existence of this entire book (even if you didn’t know that’s what we were talking about at the time): Eric Alexander, Tanya Berger-Wolf, Kelly Connole, Amy Csizmar Dalal, Josh Davis, Roger Downs, Laura Effinger-Dean, Eric Egge, Adriana Estill, Andy Exley, Alex Freeman, Sherri Goings, Jack Goldfeather, Deanna Haunsperger, Pierre Hecker, David Huyck, Sue Jandro, Sarah Jansen, Iris Jastram, Jon Kleinberg, Carissa Knipe, Mark Krusemeyer, Jessica Leiman, Lynn Liben, Jadrian Miles, Dave Musicant, Gail Nelson, Rich Nowell, Layla Oesper, Jeff Ondich, Sam Patterson, Anna Rafferty, Alexa Sharp, Julia Strand, Mike Tie, Zach Weinersmith, Tom Wexler, Kevin Woods, Jed Yang, and Steve Zdancewic.
I also owe my appreciation to Don Fowley, Bryan Gambrel, Beth Golub, Jessy Moor, Anna Pham, Sondra Scott, and Gladys Soto at Wiley. Thanks to Judy Brody for relent- less and efficient pursuit of permissions (from many different people and publishers)

14
to use the quotes that appear as epigraphs throughout the book. And thanks as well to the many insightful reviewers of previous drafts of this material. So many times I got chapter reviews back and put them aside in a huff, only to come back to the reviewers’ comments months later and realize that their suggestions were exactly right. (And, to be clear: blame me, not them, for the errors that I’m sure remain.)
I specifically want to thank Eric Alexander, Laura Biester, Josh Davis, Charlotte
Foran, Jadrian Miles, Dave Musicant, Layla Oesper, Anna Rafferty, Jed Yang, and the Carleton CS 202 students from 2013–2017 for their willingness to work with early,
and buggy, drafts of this book. And thanks to those and many other students at
Carleton for their patience, and for sending their comments and suggestions for improvements—in particular: Hami Abdi, David Abel, Alexander Auyeung, Andrew Bacon, Kharmen Bharucha, John Blake, Caleb Braun, Macallan Brown, Adam Canady, Noah Carnahan, Yitong Chen, Jinny Cho, Leah Cole, Katja Collier, Lila Conlee, Eric Ewing, Greg Fournier, Andy Freeland, Emma Freeman, Samuel Greaves, Reilly Hallstrom, Jacob Hamalian, Sylvie Hauser, Jack Hessel, Joy Hill, Matt Javaly, Emily Johnston,
Emily Kampa, Carlton Keedy, Henry Keiter, Jonathan Knudson, Julia Kroll, Brennan
Kuo, Edward Kwiatkowski, Dimitri Lang, Tristan Leigh, Zach Levonian, Daniel Levy, Rhys Lindmark, Gordon Loery, David Long, Robert Lord, Inara Makhmudova, Elliot Mawby, Javier Moran Lemus, Sean Mullan, Micah Nacht, Justin Norden, Laurel Orr, Raven Pillmann, Josh Pitkofsky, Matthew Pruyne, Nikki Rhodes, Will Schifeling,
Colby Seyferth, Alex Simonides, Oscar Smith, Kyung Song, Frederik Stensaeth, Patrick Stephen, Maximiliano Villarreal, Alex Voorhees, Allie Warren, Ben Wedin, Michael Wheatman, Jack Wines, Christopher Winter, and Andrew Yang.
This book would not have been possible without the support of Carleton College, not only for the direct support of this project, but also for providing a wonderfully engaging place to make my professional home. When I started at Carleton, my friends and family back east thought that moving to Minnesota (the frontier!) was nothing less than a sign that I had finally lost it, and I have to admit that I thought they had a point. But it’s been a fabulous place to have landed, with great friends and colleagues and students—the kind who don’t let you get away with anything, but in a good way.
Some of the late stages of the work on this book occurred while I was visiting the University of Cambridge. Thanks to Churchill College and the Computer Laboratory, and especially to Melissa Hines and Cecilia Mascolo, for their hospitality and support.
And my thanks to the somewhat less formal host institutions that have fueled this writing: Brick Oven Bakery, Cakewalk, Goodbye Blue Monday, Tandem Bagels, The Hideaway (Northfield, MN); Anodyne, Blue Moon, Bull Run, Caffetto, Common Roots, Espresso Royale, Isles Bun & Coffee, Keen Eye, Plan B, Precision Grind, Reverie, Spy- house, Sebastian Joe’s, The Beat, The Nicollet, The Purple Onion, Turtle Bread Com- pany, Uncommon Grounds, Urban Bean (Minneapolis, MN); Ginkgo, Grand Cen-
tral, Kopplin’s (St. Paul, MN); Collegetown Bagels (Ithaca, NY); Slave to the Grind (Bronxville, NY); Bloc Eleven, Diesel Cafe (Somerville, MA); Lyndell’s (Cambridge, MA); Tryst (Washington, DC); Hot Numbers, Espresso Library (Cambridge, England); and various Starbucks, Caribous, and Dunn Brothers.

And, last but certainly not least, my deepest gratitude to my friends and family for all your help and support while this project has consumed both hours and years. You know who you are, and I hope you also know how much I appreciate you. Thank you!
David Liben-Nowell Northfield, MN May 2017
PS: I would be delighted to receive any comments or suggestions from readers. Please don’t hesitate to get in touch.
15

Credits
This book was typeset using LATEX, and I produced all but a few figures from scratch using a combination of PSTricks and TikZ. The other figures are reprinted with per- mission from their copyright holders. The illustrations that open every chapter were drawn by Carissa Knipe (http://carissaknipe.com), who was a complete delight to work with—both on these illustrations and when she was a student at Carleton. I took the photograph of a house in Figure 2.48 myself. Figure 4.5 (the Therac-25 diagram)
is reproduced from Nancy Leveson’s book Safeware: System Safety and Computers with permission from Pearson Education. Figure 4.27 (a poem proving the undecidability of the Halting Problem) is reproduced with permission from Geoffrey K. Pullum. Fig- ure 5.22 (triangulations of a rabbit) is reproduced from a paper by Tobias Isenberg, Knut Hartmann, and Henry König with permission from the Society for Modeling and Simulation International (SCS). Figure 11.15 (a map of some European train routes) is reproduced with permission from RGBAlpha/Getty Images.1
For their kind permission to use quotes that appear as epigraphs in sections through- out the book, thanks to:
KurtVonnegut,p.102. ExcerptfromHocusPocusbyKurtVonnegut,copyright©1990 by Kurt Vonnegut. Used by permission of G. P. Putnam’s Sons, an imprint of Pen- guin Publishing Group, a division of Penguin Random House LLC. All rights re- served. Any third party use of this material, outside of this publication, is prohib- ited. Interested parties must apply directly to Penguin Random House LLC for permission.
PabloPicasso,p.203. ©2017EstateofPabloPicasso/ArtistsRightsSociety(ARS),New York. Reprinted with permission.
LaurenceJ.Peter,p.317. ReprintedwithpermissionoftheestateofLaurenceJ.Peter. CarlSagan,p.331. FromBroca’sBrain:ReflectionsontheRomanceofScience,©1979Carl
Sagan. Reprinted with permission from Democritus Properties, LLC.
PeterDeVries,p.349. Copyright©1967byPeterDeVries.Reprintedbypermissionof Curtis Brown, Ltd. All rights reserved.
1 Nancy Leveson.
Safeware: System Safety and Com- puters. Pearson Education, Inc., New York, 1995; To- bias Isenberg, Knut Hartmann, and Henry König. In- terest value driven adaptive subdivi- sion. In Simulation and Visualisation (SimVis), pages 139–149. SCS Eu- ropean Publishing House, 2003; and Geoffrey K. Pullum. Scooping the loop snooper: A proof that the halting problem is undecid- able. Mathematics Magazine, 73(4):319– 320, 2000. Used
by permission of Geoffrey K. Pullum.

18
EdnaSt.VincentMillay,p.521. EdnaSt.VincentMillay,excerptfromalettertoArthur Davidson Ficke (October 24, 1930) from Letters of Edna St. Vincent Millay, edited by Allan Ross Macdougall, ©1952 by Norma Millay Ellis. Reprinted with the permis- sion of The Permissions Company, Inc., on behalf of Holly Peppe, Literary Executor, The Millay Society, www.millay.org.
GeorgeC.Marshall,p.533. ReprintedwithpermissionoftheGeorgeC.MarshallFoun- dation.
PeterDrucker,p.602. ReprintedwithpermissionoftheDrucker1996LiteraryWorks Trust.
BobDylan,p.603. LyricsfromBobDylan’s“Don’tThinkTwice,It’sAllRight”(1963). Copyright ©1963 by Warner Bros. Inc.; renewed 1991 by Special Rider Music. All rights reserved. International copyright secured. Reprinted by permission.
MarioAndretti,p.617. PrintedwithpermissionofSportsManagementNetwork,Inc.
E.B.White,p.631. E.B.White/TheNewYorker;©CondeNast.Thequoteoriginally appeared in the Notes and Comment section of the July 3, 1943 issue of The New Yorker, “The 40s: The Story of a Decade.” Reprinted with permission.
CharlesdeGaulle,p.647. ©EditionsPlon.Reprintedwithpermission.
W.H.Auden,p.703. “NotesontheComic”fromTheDyer’sHandandOtherEssaysby W. H. Auden, copyright ©1948, 1950, 1952, 1953, 1954, 1956, 1957, 1958, 1960, 1962 by W. H. Auden. Used by permission of Random House, an imprint and division of Penguin Random House LLC. All rights reserved. Any third party use of this material, outside of this publication, is prohibited. Interested parties must apply directly to Penguin Random House LLC for permission.
BillWatterson,p.833. QuotefromaCalvin&Hobbescartoon;reprintedwithpermis- sion from Universal Uclick.
TomLehrer,p.926. Lyricsfrom“PoisoningPigeonsInThePark”reprintedwithper- mission from Maelstrom Music/Tom Lehrer.
DickCavett,p.1021. ReprintedwithpermissionfromDickCavett.
TomStoppard,p.1108. ExcerptsfromRosencrantzandGuildensternAreDead,copyright © 1967 by Tom Stoppard. Used by permission of Grove/Atlantic, Inc. Any third party use of this material, outside of this publication, is prohibited.
MarshallDodgeandRobertBryan,p.1129. From“WhichWaytoMillinocket?,”BertandI (1958). Reprinted with permission from Islandport Press, Inc.

1
On the Point of this Book
In which our heroes decide, possibly encouraged by a requirement for graduation, to set out to explore the world.

102 CHAPTER 1. ON THE POINT OF THIS BOOK
Why You Might Care
Just because some of us can read and write and do a little math, that doesn’t mean we deserve to conquer the Universe.
Kurt Vonnegut (1922–2007) Hocus Pocus (1990)
This book is designed for an undergraduate student who has taken a computer sci- ence class or three—most likely, you are a sophomore or junior prospective or current computer science major taking your first non-programming-based CS class. If you
are a student in this position, you may be wondering why you’re taking this class (or why you have to take this class!). Computer science students taking a class like this one sometimes don’t see why this material has anything to do with computer science— particularly if you enjoy CS because you enjoy programming.
I want to be clear: programming is awesome! I get lost in code all the time—let’s not count the number of hours that I spent writing the code to draw the fractals in Figure 5.1 in LATEX, for example. (LATEX, the tool used to typeset this book, is the stan- dard typesetting package for computer scientists, and it’s actually also a full-fledged, if somewhat bizarre, programming language.)
But there’s more to CS than programming. In fact, many seemingly unrelated prob- lems rely on the same sorts of abstract thinking. It’s not at all obvious that an optimiz- ing compiler (a program that translates source code in a programming language like C into something directly executable by a computer) would have anything important in common with a program to play chess perfectly. But, in fact, they’re both tasks that are best understood using logic (Chapter 3) as a central component of any solution. Simi- larly, filtering spam out of your inbox (“given a message m, should m be categorized as spam?”) and doing speech recognition (“given an audio stream s of a person speaking in English, what is the best ‘transcript’ reflecting the words spoken in s?”) are both best understood using probability (Chapter 10).
And these, of course, are just examples; there are many, many ways in which we can gain insight and efficiency by thinking more abstractly about the commonalities of interesting and important CS problems. That is the goal of this book: to introduce the kind of mathematical, formal thinking that will allow you to understand ideas that are shared among disparate applications of computer science—and to make it easier for you to make your own connections, and to extend CS in even more new directions.
How To Use This Book
Read much, but not many Books.
Benjamin Franklin (1706–1790)
Poor Richard’s Almanack (1738)
The brief version of the advice for how to use this book is: it’s your book; use it how- ever you’d like. (Will Shortz, the puzzle editor of The New York Times, gives the anal- ogous advice about crossword puzzles when he’s asked whether Googling for an

answer is cheating.) But my experience is that students do best when they read ac- tively, with scrap paper close by; most people end up with a deeper understanding of a problem by trying to solve it themselves first, before they look at the solution.
I’ve assumed throughout that you’re comfortable with programming in at least one language, including familiarity with recursion. It doesn’t much matter which particu- lar programming language you know; we’ll use features that are shared by almost all modern languages—things like conditionals, loops, functions, and recursion. You may or may not have had more than one programming-based CS course; many, but not all, institutions require Data Structures as a prerequisite for this material. There are times in the book when a data structures background may give you a deeper understanding (but the same is true in reverse if you study data structures after this material). There are similarly a handful of topics for which rudimentary calculus background is valu- able. But knowing/remembering calculus will be specifically useful only a handful of times in this book; the mathematical prerequisite for this material is really algebra and “mathematical maturity,” which basically means having some degree of comfort with the idea of a mathematical definition and with the manipulation of a mathematical expression. (The few places where calculus is helpful are explicitly marked.)
There are 10 chapters after this one in the book. Their dependencies are as shown at right. Aside from these dependencies, there are some occasional refer- ences to other chapters, but these references are light. If you’ve skipped Chapter 6—many instructors will choose not cover this material, as it is frequently in- cluded in a course on Algorithms instead of this one— then it will still be useful to have an informal sense of O, Ω, and Θ notation in the context of the worst-case running time of an algorithm. (You might skim Sec- tions 6.1 and 6.6 before reading Chapters 7–11.)
I’ve tried to include some helpful tips for problem
solving in the margins throughout the book, along with
a few warnings about common confusions and some
notes on terminology/notation that may be helpful in
keeping the words and symbols straight. There are also two kinds of extensions to the main material. The “Taking it Further” blocks give more technical details about the material under discussion—an alternate way of thinking about a definition, or a way that a concept is used in CS or a related field. You should read the “Taking it Further” blocks if—but only if!—you find them engaging. Each section also ends with one or more boxed-off “Computer Science Connections” that show how the core material can be used to solve a wide variety of (interesting, I hope!) CS applications. No matter how interesting the core technical material may be, I think that it is what we can do with it that makes it worth studying.
103
6 7
analysis of number algorithms theory
2 data types
3 logic
4 proofs
5 induction
8 9 11 relations counting graphs/trees
10 probability

104 CHAPTER 1. ON THE POINT OF THIS BOOK
What This Book Is About
All truths are easy to understand once they are discovered; the point is to discover them.
Galileo Galilei (1564–1642)
This book focuses on discrete mathematics, in which the entities of interest are dis- tinct and separate. Discrete mathematics contrasts with continuous mathematics, as
in calculus, which addresses infinitesimally small objects, which cannot be separated. We’ll use summations rather than integrals, and we’ll generally be thinking about things more like the integers (“1, 2, 3, . . .”) than like the real numbers (“all numbers between π and 42”). Because this book is mostly focused on non-programming-based parts of computer science, in general the “output” that you produce when solving a problem will be something different from a program. Most typically, you will be asked to answer some question (quantitatively or qualitatively) and to justify that answer— that is, to prove your answer. (A proof is an ironclad, airtight argument that convinces its reader of your claim.) Remember that your task in solving a problem is to persuade your reader that your purported solution genuinely solves the problem. Above all, that means that your main task in writing is communication and persuasion.
There are three very reasonable ways of thinking about this book.
View #1 is that this book is about the mathematical foundations of computation.
This book is designed to give you a firm foundation in mathematical concepts that are crucial to computer science: sets and sequences and functions, logic, proofs, probabil- ity, number theory, graphs, and so forth.
View #2 is that this book is about practice. Essentially no particular example that we consider matters; what’s crucial is for you to get exposure to and experience with formal reasoning. Learning specific facts about specific topics is less important than developing your ability to reason rigorously about formally defined structures.
View #3 is that this book is about applications of computer science: it’s about error- correcting codes (how to represent data redundantly so that the original information is recoverable even in the face of data corruption); cryptography (how to communi- cate securely so that your information is understood by its intended recipient but not by anyone else); natural language processing (how to interpret the “meaning” of an English sentence spoken by a human using an automated customer service system); and so forth. But, because solutions to these problems rely fundamentally on sets and counting and number theory and logic, we have to understand basic abstract struc- tures in order to understand the solutions to these applied problems.
In the end, of course, all three views are right: I hope that this book will help to in- troduce some of the foundational technical concepts and techniques of theoretical computer science, and I hope that it will also help demonstrate that these theoretical approaches have relevance and value in work throughout computer science—in topics both theoretical and applied. And I hope that it will be at least a little bit of fun.
Bon voyage!
Be careful; there
are two different words that are pro- nounced identically:
discrete, adj.: indi- vidually separate and distinct.
discreet, adj.: care- ful and judicious
in speech, espe- cially to maintain privacy or avoid embarrassment.
You wouldn’t read a book about discreet mathematics; instead, someone who trusts you might quietly share it while making sure no one was eavesdropping.

2
Basic Data Types
In which our heroes equip themselves for the journey ahead, by taking on the basic provisions that they will need along the road.

202 CHAPTER 2. BASIC DATA TYPES
2.1 Why You Might Care
It is a capital mistake to theorize before one has data. Sir Arthur Conan Doyle (1859–1930),
A Scandal in Bohemia (1892)
This chapter will introduce concepts, terminology, and notation related to the most common data types that recur throughout this book, and throughout computer sci- ence. These basic entities—the Booleans (True and False), numbers (integers, rationals, and reals), sets, sequences, functions—are also the basic data types we use in modern programming languages. Essentially every common primitive data type in programs appears on this list: a Boolean, an integer (or an int), a real number (or a float), and a string (an ordered sequence of characters). Ordered sequences of other elements are usually called arrays or lists. If you’ve taken a course on data structures, you’ve proba- bly worked on several implementations of sets that allow you to insert an element into an unordered collection and to test whether a particular object is a “member” of the collection. And functions that map a given input to a corresponding output are the basic building blocks of programs.
Virtually every interesting computer science application uses these basic data types extensively. Cryptography, which is devoted to the secure storage and transmission
of information in such a way that a malicious third party cannot decipher that infor- mation, is typically based directly on integers, particularly large prime numbers. A ubiquitous task in machine learning is to “cluster” a set of entities into a collection of nonoverlapping subsets so that two entities in the same subset are similar and two en- tities in different subsets are dissimilar. In information retrieval, where we might seek to find the document from a large collection that is most relevant to a given query, it
is common to represent each document by a vector (a sequence of numbers) based on the words used in the document, and to find the most relevant documents by identify- ing which ones “point in the same direction” as the query’s vector. And functions are everywhere in CS, from data structures like hash tables to the routing that’s done for every packet of information on the internet.
In this chapter, we’ll describe these basic entities and some standard notation that’s associated with them. Some closely related topics will appear later in the book, as well. Chapter 7, on number theory, will discuss some subtler properties of the inte- gers, particularly divisibility and prime numbers. Chapter 8 will discuss relations,
a generalization of functions. But, really, every chapter of this book is related to this chapter: our whole enterprise will involve building complex objects out of these simple ones (and, to be ready to understand the more complex objects, we have to understand the simple pieces first). And before we launch into the sea of applications, we need
to establish some basic shared language. Much of the basic material in this chapter may be familiar, but regardless of whether you have seen it before, it is important and standard content with which it is important to be comfortable.

2.2 Booleans, Numbers, and Arithmetic
Everything you can imagine is real.
Pablo Picasso (1881–1973)
We start with the most basic types of data: Boolean values (True and False), integers (. . . , −2, −1, 0, 1, 2, . . .), rational numbers (fractions with integers as numerators and de- nominators), and real numbers (including the integers and all the numbers in between them). The rest of this section will then introduce some basic numerical operations: absolute values and rounding, exponentiation and logarithms, summations and prod- ucts. Figure 2.1 summarizes this section’s notation and definitions.
2.2.1 Booleans: True and False
The most basic unit of data is the bit: a single piece of information, which either takes on the value 0 or the value 1. Every piece of stored data in a digital computer is stored as a sequence of bits. (See Section 2.4 for a formal definition of sequences.)
We’ll view bits from several different perspectives: 1 and 0, on and off, yes and no, True and False. Bits viewed under the last of these perspectives have a special name, the Booleans:
The Booleans are the central object of study of Chapter 3, on logic. In fact, they are in a sense the central object of study of this entire book: simply, we are interested in making true statements, with a proof to justify why the statement is true.
2.2.2 Numbers: Integers, Reals, and Rationals
We’ll often encounter a few common types of numbers—integers, reals, and rationals:
Booleans are named after George Boole (1815–
1864), a British mathematician, who was the first person to think about True as 1 and False as 0.
The superficially unintuitive notation for the integers,
the symbol Z, is a stylized “Z” that was chosen because of the German word Zahlen, which means “numbers.” The name rationals comes from the word ratio; the symbol Q comes from its synonym quotient. (Besides, the symbol R was already taken by the reals, so the rationals got stuck with their second choice.)
2.2. BOOLEANS,NUMBERS,ANDARITHMETIC 203
Definition 2.1 (Booleans)
A Boolean value is either True or False.
Definition 2.2 (Integers, Reals, and Rationals)
• Theintegers,denotedbyZ,arethosenumberswithnofractionalpart:0,thepositive integers (1, 2, . . .), and the negative integers (−1, −2, −3, . . .).
• Therealnumbers,denotedbyR,arethosenumbersthatcanbe(approximately) represented by decimal numbers; informally, the reals include all integers and all numbers “between” any two integers.
• Therationalnumbers,denotedbyQ,arethoserealnumbersthatcanberepresentedasa ratio n of two integers n and m, where n is called the numerator and m ̸= 0 is called the
m
denominator. A real number that is not rational is called an irrational number.
Here are a few examples of each of these types of numbers:

204 CHAPTER 2. BASIC DATA TYPES
Booleans True and False
Z integers (. . . , −3, −2, −1, 0, 1, 2, 3, . . .)
Q rational numbers
R real numbers
[a,b]
(a,b)
[a,b)
(a,b]
|x|
⌊x⌋
⌈x⌉
bn √
b1/n,or n b bm/n
logb x
n mod k k | n
those real numbers x where a ≤ x ≤ b
those real numbers x where a < x < b those real numbers x where a ≤ x < b those real numbers x where a < x ≤ b absolute value of x: |x| := −x if x < 0; |x| := x if x ≥ 0 floor of x: x rounded down to the nearest integer ceiling of x: x rounded up to the nearest integer b multiplied by itself n times anumberysuchthatyn =b(wherey≥0ifpossible),ifoneexists (b1/n )m logarithm: logb x is the value y such that by = x, if one exists modulo: n mod k := the remainder when dividing n by k k (evenly) divides n ∑ summation: ∑ni=1 xi := x1 +x2 + ··· +xn ∏ product: ∏ni=1 xi := x1 ·x2 · ··· ·xn Figure 2.1: Sum- mary of the basic mathematical nota- tion introduced in Section 2.2. Example 2.1 (Integers, reals, and rationals) The following are all examples of integers: 1, 42, 0, and −17. All of the following are real numbers: 1, 99.44, the ratio of the circumference of a circle to its diameter π ≈ 3.141592653 · · · , and the so-called golden ratio φ = (1+√5)/2 ≈ 1.61803···. 3 9 16 4 Examples of rational numbers include 2 , 5 , 4 , and 1 . (In Chapter 8, we’ll talk about the familiar notion of the equivalence of two rational numbers like 1 and 2 , 16 4 2 4 or like 4 and 1 , based on common divisors. See Example 8.36.) Of the example real numbers above, both 1 and 99.44 are rational numbers; we can write them as 1 and 4972 , for example. Both π and φ are irrational. 1 50 Here are a few useful points relating these three types of numbers: • Allintegersarerationalnumbers(withdenominatorequalto1). • Allrationalnumbersarerealnumbers. • Butnotallrationalnumbersareintegersandnotallrealnumbersarerational:for example, 3 is not an integer, and √2 is not rational. (We’ll prove that √2 is not 2 rational in Example 4.21.) Taking it further: Definition 2.2 specifies Z, Q, and R somewhat informally. To be completely rigor- ous, one can define the nonnegative integers as the smallest collection of numbers such that: (i) 0 is an integer; and (ii) if x is an integer, then x + 1 is also an integer. See Section 5.4.1. (Of course, for even this definition to make sense, we’d need to give a rigorous definition of the number zero and a rigorous def- inition of the operation of adding one.) With a proper definition of the integers, it’s fairly easy to define the rationals as ratios of integers. But formally defining the real numbers is surprisingly challenging; it was a major enterprise of mathematics in the late 1800s, and is often the focus of a first course in analysis in an undergraduate mathematics curriculum. Virtually every programming language supports both integers (usually known as ints) and real numbers (usually known as floats); see p. 217 for some discussion of the way that these basic numerical types are implemented in real computers. (Rational numbers are much less frequently implemented as basic data types in programming languages, though there are some exceptions, like Scheme.) In addition to the basic symbols that we’ve introduced to represent the integers, the rationals, and the reals (Z, Q, and R), we will also introduce special notation for some specific subsets of these numbers. We will write Z≥0 and Z≤0 to denote the nonnega- tive integers (0, 1, 2, . . .) and nonpositive integers (0, −1, −2, . . .), respectively. Generally, when we write Z with a superscripted condition, we mean all those integers for which the stated condition is true. For example, Z̸= 1 denotes all integers aside from 1. Sim- ilarly, we write R>0 to denote the positive real numbers (every real number x > 0). Other conditions in the superscript of R are analogous.
We’ll also use standard notation for intervals of real numbers, denoting all real numbers between two specified values. There are two variants of this notation, which allow “between two specified values” to either include or exclude those specified val- ues. We use round parentheses to mean “exclude the endpoint” and square brackets to mean “include the endpoint” when we denote a range:
• (a,b)denotesthoserealnumbersxforwhicha 0 that is not an integer. (It’s all too
easy to have done this calculation by typing numbers into a calculator without actually
thinking about what the expression actually means!) Here’s the definition of bm/n
when the exponent m is a rational number: n

2.2. BOOLEANS,NUMBERS,ANDARITHMETIC 207
Definition 2.6 (Raising a number to a positive rational power)
For any real number b and for any positive integers m and n ̸= 0:
• b1/n denotes the number y such that yn = b. The value b1/n is called the nth root of b, and it can also be denoted by √n b. If there are two values y such that yn = b, then by b1/n we mean the number y ≥ 0 such that yn = b. If there are no such values y, then we’ll treat b1/n as undefined.
• bm/n denotes the mth power of b1/n: that is, bm/n := (b1/n)m.
Here are a few examples:
Example 2.3 (Some fractional exponents)
• 161/2 is the value y such that y2 = 16, so 161/2 = 4 (because 42 = 16). Similarly, 161/4 = 2 because 24 = 16.
• The value of 51/2 is roughly 2.2360679774, because 2.23606797742 ≈ 5. (But note that this value of 51/2 is only an approximation, because actually 2.23606797742 = 4.99999999955372691076 ̸= 5.)
• As the definition implies, there may be more than one y such that yn = b. For example, consider 41/2. We need a number y such that y2 = 4—and either y = 2 or y = −2 satisfies this condition. By the definition, if there are positive and negative values of y satisfying the requirement, we choose the positive one. So 41/2 = 2.
• For (−8)1/3, we need a value y such that y3 = −8. No y ≥ 0 satisfies this condition, but y = −2 does. Thus (−8)1/3 = −2.
• For (−8)1/2, we need a value y such that y2 = −8. No y ≥ 0 satisfies this condition, and no y ≤ 0 does either. Thus we will treat (−8)1/2 as undefined.
Taking it further: Definition 2.6 presents difficulties if we try to compute, say, √−1: the definition tells usthatweneedtofindanumberysuchthaty2 =−1.Buty2 ≥0ify≤0andify≥0,sonorealnumber y satisfies the requirement y2 = −1. To handle this situation, one can define the imaginary numbers, specifically by defining i := √−1. (The name “real” to describe real numbers was chosen to contrast with the imaginary numbers.)
We will not be concerned with imaginary numbers in this book, although—perhaps surprisingly— there are some very natural computational problems in which imaginary numbers are fundamental parts of the best algorithms solving them, such as in signal processing and speech processing (transcrib- ing English words from a raw audio stream) or even quickly multiplying large numbers together.
When we write √b without explicitly indicating which root is intended, then we are talking about the square root of b. In other words, √b := √2 b denotes the y such that y2 = b. An integer n is called a perfect square if √n is an integer.
Forexample,2−4=1=1and25−3/2=1=1 =1=1.
24 16 253/2 (251/2)3 53 125
Definition 2.7 (Raising a number to a negative power)
When the exponent x is negative, then bx is defined as 1 . b−x

208 CHAPTER 2. BASIC DATA TYPES
For an irrational exponent x, the value of bx is approximated arbitrarily closely by choosing a rational number m sufficiently close to x and computing the value of bm/n. n
Taking it further: A fully rigorous treatment of irrational powers requires a formal definition
of the real numbers and an (ε, δ)-style proof as in calculus; we will omit the details as they are tangential to our purposes in this book. The basic idea is to choose a rational number m/n that approximates x to within a small error—for example, approximate r by the first k digits of its decimal expansion (which can be written as m/10k)—and approximate bx by bm/n. For example, 2π is approximated by the sequence shown in Figure 2.4; the value of 2π is the limit of this sequence of approximations.
While essentially every modern programming language supports exponentiation—including positive, fractional, and negative powers—in some form, often in a separate math library, the actual behind-the-scenes computation is rather complicated. See p. 218 for some discussion of the underlying steps that are done to compute a quantity like √x.
Here are a few useful facts about exponentiation:
Figure 2.4: Ap- proximating 2π .
23 = 8
231/10 = 8.5741 · · ·
2314/100 = 8.8815 · · · 23141/1000 = 8.8213 · · ·
231415/10000 = 8.8244 · · · 2314159/100000 = 8.8249 · · ·
.
Theorem 2.1 (Properties of exponentials)
For any real numbers a and b, and for any rational numbers x and y:
b0 = 1
b1 = b
bx+y = bx · by
(bx)y = bxy (ab)x = ax · bx
(2.1.1) (2.1.2) (2.1.3) (2.1.4) (2.1.5)
These properties follow fairly straightforwardly from the definition of exponentiation. (The properties of Theorem 2.1 carry over to irrational exponents, though the proofs are less straightforward.)
2.2.5 Logarithms
The logarithm (or log) is the inverse operation to exponentiation: the value of an expo- nential by is the result of multiplying a number b by itself y times, while the value of a logarithm logb x is the number of times we must multiply b by itself to get x.
Here are a few simple examples:
Example 2.4 (Some logs)
Problem-solving
tip: I have found many CS students scared, and scarred, by logs. The fear appears to me to result from students attempting to memorize facts about logs without trying to think about
what they mean. Mentally translating between logs and exponentials can help make these properties more intuitive and can help make them make sense. Often the intuition of apropertyof exponentials isreasonably straightforward to grasp.
Definition 2.8 (Logarithm)
For a positive real number b ̸= 1 and a real number x > 0, the logarithm base b of x, written logb x, is the real number y such that by = x.
• Thequantitylog 81isthepowertowhichwemustraise3toget81—andthus 3 4
log381=4,because3 =3·3·3·3=81. • Similarly, log4 16 = 2, because 42 = 16.

• Because2=√4=41/2,wehavelog42=0.5.
• 1280 =1,solog1281=0.
• 21.5849625 = 2.999999998 ≈ 3, so log2 3 ≈ 1.5849625.
For any base b, note that logb x does get larger as the value of x increases, but it gets larger very slowly. Figure 2.5 illustrates the slow rate of growth of log10 x as x grows.
For a real number x ≤ 0 and any base b, the expression logb x is undefined. For example, the value of log2(−4) would be the num- ber y such that 2y = −4—but 2y can never be negative. Similarly, logarithms base 1 are undefined: log1 2 would be the number y such that 1y = 2—but 1y = 1 for every value of y.
Logarithms show up frequently in the analysis of data structures and algorithms, including a number that we will discuss in this book. Several facts about logarithms will be useful in these analyses, and are also useful in other settings. Here are a few:
Figure 2.5: A graph of log10 x.
2.2. BOOLEANS,NUMBERS,ANDARITHMETIC 209
3.5 3.0 2.5 2.0 1.5 1.0 0.5
200 400
600 800 1000
Theorem 2.2 (Properties of logarithms)
For any real numbers b > 1, c > 1, x > 0, and y > 0, the following properties hold:
logb 1 = 0
logb b = 1
logb xy = logb x + logb y
log x =log x−log y bybb
logb xy = ylogb x logb x = logc x
log of a product logofaquotient
“change of base” formula
(2.2.1) (2.2.2) (2.2.3) (2.2.4)
(2.2.5) (2.2.6)
logc b
These properties generally follow directly from the analogous properties of exponen- tials in Theorem 2.1. You’ll explore some properties of logarithms (including many of the properties from Theorem 2.2) in the exercises.
We will make use of one standard piece of notational shorthand: often the expres-
sion log x is written without an explicit base. When computer scientists write the ex-
pression log x, we mean log2 x. One other base is commonly used in logarithms: the
natural logarithm ln x denotes loge x, where e ≈ 2.718281828 · · · is defined from calculus
as e := limn→∞(1 + 1 )n. n
2.2.6 Moduli and Division
So far, we’ve discussed multiplying numbers (repeatedly, to compute exponentials); in this subsection, we turn to the division of one number by another. When we consider dividing two integers—64 by 5, for example—there are several useful values to con-
sider:regular-olddivision(64 =12.8),what’ssometimescalledintegerdivisiongiving 5
Throughout this book (and through- out computer science), the as- sumed base of
log x is 2. (Some computer scien- tists write lg x to denote log2 x; we’ll simply write log x.) But be aware that mathematicians or engineers may treat the default base to be e or 10.

210 CHAPTER 2. BASIC DATA TYPES
“the whole part” of the fraction (⌊ 64 ⌋ = 12), and the remainder giving “the leftover 5
part” of the fraction (the difference between 64 and 12 · 5, namely 64 − 60 = 4).
We will return to these notions of division in great detail in Chapter 7, but we’ll
begin here with the formal definitions for the notions related to remainders:
Definition 2.9 (Modulus (remainder))
For any integers k > 0 and n, the integer n mod k is the remainder when we divide n by k.
Using the “floor” notation from Section 2.2.3, the value n mod k is defined as
n mod k := n − k · 􏰄 n 􏰅. k
Here are examples of the value of a few integers mod 3:
Example 2.5 (Three values mod 3)
• 8 mod 3 = 2, because 8 is 2 more than a multiple of 3, namely 6. (Or because
􏰘 8 􏰙 = ⌊2.6666 · · ·⌋ = 2, and 8 − 2 · 3 = 8 − 6 = 2.) 3 􏰘28􏰙
• 28mod3=1,as =9,and28−9·3=28−27=1. 3 􏰘48􏰙
• 48mod3=0,because 3 =⌊16⌋=16,and48−16·3=0.
Taking it further: In many programming languages, the / operator performs integer division when
its arguments are both integers, and performs “real” division when either argument is a floating point
number. So the expression 64 / 5 will yield 12, but 64.0 / 5 and 64 / 5.0 and 64.0 / 5.0 will all
yield 12.8. In this book, though, we will always mean “real” division when we write x/y or x . y
Thenmodkoperationisastandardoneinprogramminglanguages—it’swrittenasn % kinmany languages, including Java, Python, and C/C++, for example.
In Definition 2.9, we allowed n to be a negative integer, which may stretch your intuition about remainders a bit. Here’s an example of this case of the definition:
Example 2.6 (A negative integer mod 5)
We’ll compute −3 mod 5 simply by following the definition of mod from Defini- tion 2.9:
−3 mod 5 = (−3)−5·􏰏−3􏰐 = (−3)−5·(−1) = (−3)+5 = 2. 5
Viewed from an appropriate perspective, this calculation should actually be very intuitive: the value r = n mod k gives the amount r by which n exceeds its closest multiple of k. (And −3 is 2 more than a multiple of 5, namely −5, so −3 mod 5 = 2.)
Notice that the value of n mod k is always at least 0 and at most k − 1, for any n and
any k > 0; the remainder when dividing by k can never be k or more. At one of these
extreme points, when n has zero remainder, then we say that k (evenly) divides n: k
Definition 2.10 (Integer k (evenly) divides integer n)
For any integers k > 0 and n, we say that k divides n, written k | n, if n is an integer. Notice that k | n is equivalent to n mod k = 0. k

Here’s a simple example: Example 2.7 (What 5 divides)
Because 5 · ⌊ 10 ⌋ = 5 · 2 = 10 = 10, we know 5 | 10. But 5 · ⌊ 9 ⌋ = 5 · 1 = 5 ̸= 9, so 5 ̸ | 9. 55
By rearranging the floor-based definition from Definition 2.9 when n mod k = 0, we can see that the condition k | n is also equivalent to the condition k · 􏰄 n 􏰅 = n.
Some special numbers: evens, odds, primes, composites
A few special types of integers are defined in terms of their divisibility—specifically
based on whether they are divisible by 2 (evens and odds), or whether they are divisible by any other integer except for 1 (primes and composites).
For example, we have 17 mod 2 = 1 and 42 mod 2 = 0, so 17 is odd and 42 is even.
Taking it further: If we view 0 as False and 1 as True (see Section 2.2.1), then the value n mod 2 can be interpreted as a Boolean value. In fact, there’s a deeper connection between arithmetic and the Booleans than might be readily apparent. The “exclusive or” of two Boolean values p and q (which we will en- counter in Section 3.2.3) is denoted p ⊕ q, and the expression p ⊕ q is true when one but not both of p and q is true. The exclusive or is sometimes referred to as the parity function, because p + q is odd (viewing p and q as numerical values, 0 or 1) exactly when p ⊕ q is true (viewing p and q as Boolean values, False or True).
Notice that the definition of prime numbers does not include 0 and 1, and neither does the definition of composite numbers: in other words, 0 and 1 are neither composite nor prime. Here are a few examples of prime and composite numbers:
Example 2.8 (Prime numbers)
Problem: Is 77 prime? What about 7?
: 77isnotprime,becauseitisevenlydivisibleby7.Inotherwords,because Solution
77 mod 7 = 0 (and the integer 7 that evenly divides 77 is neither 1 nor 77 itself), 77 is composite.
On the other hand, 7 is prime. Convincing yourself that something is prime
is harder than convincing yourself that something is not prime, but we can see it by trying all the possible divisors, namely every positive integer except 1 and 7:
7 mod 2 = 1 and 7 mod 3 = 1 and 7 mod 4 = 3 and 7 mod 5 = 2 and 7 mod 6 = 1, and furthermore 7 mod d = 7 for any d ≥ 8. None of these remainders is zero, so 7 is prime.
2.2. BOOLEANS,NUMBERS,ANDARITHMETIC 211
k
Definition 2.11 (Even, odd, and parity)
A nonnegative integer n is even if n mod 2 = 0, and n is odd if n mod 2 = 1. The parity of n is its “oddness” or “evenness.”
Definition 2.12 (Prime and composite numbers)
A positive integer n > 1 is prime if the only positive integers that evenly divide n are 1 and n itself. A positive integer n > 1 is composite if it is not prime.

212 CHAPTER 2. BASIC DATA TYPES
Example 2.9 (Small primes and composites)
The first ten prime numbers are 2, 3, 5, 7, 11, 13, 17, 19, 23, 29. The first ten composite numbers are 4, 6, 8, 9, 10, 12, 14, 15, 16, 18.
Chapter 7 is devoted to the properties of modular arithmetic, prime numbers, and the like. These quantities have deep and important connections to cryptography, error- correcting codes, and other applications that we’ll explore later.
2.2.7 Summations and Products
There is one final piece of notation related to numbers that we need to introduce: a simple way of expressing the sum or product of a collection of numbers. We’ll start with the compact summation notation that allows us to express the result of adding many numbers:
Definition 2.13 (Summation notation)
Let x1, x2, . . . , xn be a sequence of n numbers. We write ∑ni=1 xi (usually read as “the sum for i equals 1 to n of xi”) to denote the sum of the xis:
n
∑i=1xi :=x1+x2+···+xn.
The variable i is called the index of summation or the index variable.
Note that ∑0i=1 xi = 0: when you add nothing together, you end up with zero.
Here are a few very simple examples:
Example 2.10 (Some simple summations)
Leta1 =2,a2 =4,a3 =8,anda4 =16,andletb1 =1,b2 =2,b3 =3,andb4 =4.Then 4
∑i=1ai = a1 +a2 +a3 +a4 = 2+4+8+16 = 30 4
∑i=1bi = b1 +b2 +b3 +b4 = 1+2+3+4 = 10
We can interpret this summation notation as if it expressed a for loop, as shown
in Figure 2.6. The for loop interpretation might help make the “empty sum” more intuitive: the value of ∑0i=1 xi = 0 is simply 0 because result is set to 0 in line 1, and it never changes, because n = 0 (and therefore line 3 is never executed).
In general, instead of just adding xi in the ith term of the sum, we can add any ex- pression involving the index of summation. (We can also start the index of summation a t a v a l u e o t h e r t h a n 1 : t o d e n o t e t h e s u m x j + x j + 1 + · · · + x n , w e w r i t e ∑ ni = j x i . ) H e r e a r e a few examples:
Figure 2.6: A for loop that returns the value of ∑ni=1 xi.
1: result := 0
2: for i := 1,2,…,n
3: result := result + xi 4: return result

1. ∑4i=1i2 = 12+22+32+42
2. ∑4i=2i2 = 22+32+42
3. ∑4i=1(ai +i2) = (2+12)+(4+22)+(8+32)+(16+42) = 60 4. ∑4i=15 =5+5+5+5 =20
Two special types of summations arise frequently enough to have special names. A geometric series is ∑ni=1 αi for some real number α; an arithmetic series is ∑ni=1 i · α for a real number α. See Section 5.2.2 for more on these types of summations.
We will very occasionally consider an infinite sequence of numbers x1, x2, . . . , xi, . . .; we may write ∑∞ x to denote the infinite sum of these numbers.
Example 2.13 (An infinite sum)
i=1 i
2.2. BOOLEANS,NUMBERS,ANDARITHMETIC 213
Example 2.11 (Some sums)
Leta1 =2,a2 =4,a3 =8,anda4 =16.Then
∑4i=1ai = 2+4+8+16 = 30 ∑4i=1(ai +1) = (2+1)+(4+1)+(8+1)+(16+1) = 34 ∑4i=1i = 1+2+3+4 = 10
Example 2.12 (Some more sums)
Problem: Asabove,leta1 = 2,a2 = 4,a3 = 8,anda4 = 16. Whatarethevaluesofthe
following expressions?
1. ∑4i=1 i2 2. ∑4i=2 i2
Solution
: Herearethevaluesofthesesums:
3. ∑4i=1(ai + i2)
4. ∑4i=1 5
= 30 = 29
Define xi := 1/2i, so that x1 = 1/2, x2 = 1/4, x3 = 1/8, and so forth. We can write ∑∞ xi todenote1/2+1/4+1/8+1/16+···. Thevalueofthissummationis1: each
i=1
term takes the sum halfway closer to 1.
While the for loop in Figure 2.6 would run forever if we tried to apply it to an infinite summation, the idea remains precisely the same: we successively add the value of each term to the result variable. (We will discuss this type of infinite sum in detail in Section 5.2.2, too.)
Reindexing summations
Just as in a for loop, the “name” of the index variable in a summation doesn’t mat-
ter, as long as it’s used consistently. For example, both ∑5i=1 ai and ∑5j=1 aj denote the valueofa1 +a2 +a3 +a4 +a5.
We can also rewrite a summation by reindexing it (also known as using a change of index or a change of variable), by adjusting both the limits of the sum (lower and upper) and what’s being summed while ensuring that, overall, exactly the same things are being added together.

214 CHAPTER 2. BASIC DATA TYPES
Example 2.14 (Shifting by two)
Thesums∑n iand∑n−2(j+2)areequal,becausebothexpress3+4+5+···+n.(We i=3 j=1
have applied the substitution j := i − 2 to get from the first summation to the second.) Example 2.15 (Counting backward)
The following two summations have the same value:
∑n ( n − i ) a n d ∑n j .
i=0 j=0
We can produce one from the other by substituting j := n − i, so that i = 0,1,…,n
corresponds to j = n − 0, n − 1, . . . , n − n (or, more simply, to j = n, n − 1, . . . , 0). Reindexing can be surprisingly helpful when we’re confronted by ungainly summa-
tions; doing so can often turn the given summation into something more familiar.
Nested sums
We can sum any expression that depends on the index variable—including sum-
mations. These summations are called double summations or, more generally, nested summations. Just as with nested loops in programs, the key is to read “from the inside out” in simplifying a summation. Here are two examples:
are summing i different copies of the number 5. Therefore
∑6 􏰑∑i 5􏰒 = ∑6 5i = 5+10+15+20+25+30 = 105.
i=1 j=1 i=1
Example 2.16 (A double sum)
Let’s compute ∑6i=1 􏰖∑ij=1 5􏰗.
Observe that, for any fixed value of i ≥ 0, the value of ∑ij=1 5 is just 5i, because we
Example 2.17 (A slightly more complicated double sum)
Problem: What is ∑6i=1 􏰖∑ij=1 j􏰗?
Solution
: Observethattheinnersum(∑ij=1j)hasthefollowingvalue,foreach
1 ≤ i ≤ 6:
• ∑1j=1j=1
• ∑2j=1j=1+2=3
• ∑3j=1j=1+2+3=6
• ∑4j=1j=1+2+3+4=10
• ∑5j=1j=1+2+3+4+5=15
• ∑6j=1j=1+2+3+4+5+6=21
Thus ∑6i=1 ∑ij=1 j = 1+3+6+10+15+21 = 56.

When you’re programming and need
to write two nested loops, it sometimes
ends up being easier to write the loops
with one variable in the outer loop
rather than the other variable. Sim-
ilarly, it may turn out to be easier to
think about a nested sum by revers-
ing the summation—that is, swapping
which variable is the “outer” summa-
tion and which is the “inner.” If we have
any sequence ai,j of numbers indexed by two variables i and j, then ∑i=1 ∑j=1 ai,j and ∑nj=1 ∑ni=1 ai,j have precisely the same value.
Here are two examples of reversing the order of a double summation, for the tables shown in Figure 2.7:
Example 2.18 (A simple sum)
Consider the table in Figure 2.7(a). Write ai,j to denote the element in the ith row and jth column of the table. Then the sum of elements in the table is, by summing the row-sums,
Figure 2.7: Two tables whose elements we’ll sum “row-wise” and “column-wise.”
2.2. BOOLEANS,NUMBERS,ANDARITHMETIC 215
1234 1
2 3
(a) A small table with some arbitrarily chosen numbers.
7
5
6
5
5
5
1
7
3
5
8
3
j=12345678 i=1
−1 1
−1 1
−2 2
−2 2
−3 3
−3 3
−4 4
−4 4
−1 1
−1 1
−2 2
−2 2
−3 3
−3 3
−4 4
−4 4
−1 1
−1 1
−2 2
−2 2
−3 3
−3 3
−4 4
−4 4
−1 1
−1 1
−2 2
−2 2
−3 3
−3 3
−4 4
−4 4
2 3 4 5 6 7 8
(b) Thetermsof∑n ∑n forn=8. i=1 j=1
􏰀(−1)i ·⌈j ⌉􏰁, 2
n n
3􏰑4􏰒3
∑i=1 ∑j=1ai,j =∑i=1thesumofelementsinrowi =23+18+19
=60.
= 6 0 .
And, by summing the column-sums, the sum of elements in the table is also 4􏰑3􏰒4
∑j = 1 ∑i = 1 a i , j = ∑j = 1 t h e s u m o f e l e m e n t s i n c o l u m n j = 1 5 + 1 5 + 1 5 + 1 5 Example 2.19 (A double sum, reversed)
Problem: Letn=8.Whatisthevalueofthefollowingsum? ∑n ∑n 􏰖(−1)i·􏰚j􏰛􏰗
i=1 j=1 2
Solution
: WearecomputingthesumofallthevaluescontainedinthetableinFig-
ure 2.7(b). The hard way to add up all of these values is by computing the row sums, and then adding them all up. (The given equation expresses this hard way.) The easier way is reverse the summation, and to instead compute
the value of the entire summation is ∑nj=1 0, which is just 0.
Problem-solving tip:
When you’re look- ing at a complicated double summation, try reversing it; it maybemucheasier to analyze the other way around.
n n 􏰖 􏰚 􏰛􏰗 ∑ ∑ (−1)i · j j=1 i=1 2
.
For any value of j, observe that ∑n (−1)i · ⌈ j ⌉ is actually zero! (This value is just
jnjn i=12
(⌈2⌉)2 +(−⌈2⌉)2.) Inotherwords,everycolumnsuminthetableiszero. Thus

216 CHAPTER 2. BASIC DATA TYPES
Note that computing the sum from Example 2.19 when n = 100 or n = 100,000 remains just as easy if we use the column-based approach: as long as n is an even number, every column sum is 0, and thus the entire summation is 0. (The row-based approach is ever-more painful to use as n gets large.)
Here’s one more example—another view of the double sum ∑6i=1 ∑ij=1 j from Exam- ple 2.17—where reversing the summation makes the calculation simpler:
Example 2.20 (A double sum, redone)
The value of ∑6 ∑i j is the sum of all the numbers in the table in Figure 2.8. We i=1 j=1 i
solved Example 2.17 by first computing ∑j=1 j, which is the sum of the numbers in the ith row. We then summed these values over the six different values of i to get 56.
123456
Figure2.8:The terms of ∑6i=1 ∑ij=1 j. We seek the sum
of all entries in the table.
The summation and product notation have a secret mnemonic to help you remember what each means: “Σ” is the Greek letter Sigma, which starts with the same letter as the word sum. And “Π” is the Greek letter Pi, which starts with the same letter as the word product.
Figure 2.9: A for loop that returns the value of ∏ni=1 xi.
2
2
3
4
1
3
4
5
1
2
3
4
5
6
Alternatively, we can compute the desired sum by looking at columns instead of 6 􏰖 6 􏰗 6
1 2 3 4 5 6
rows. The sum of the table’s elements is also ∑j=1 ∑i=j j , where ∑i=j j is the sum of the numbers in the jth column. Because there are a total of (7 − j) terms in ∑6i=j j, the sum of the numbers in the jth column is precisely j · (7 − j). (For example, the 4th column’s sum is 4 · (7 − 4) = 4 · 3 = 12.) Thus the overall summation can be written as
∑6 ∑i j=∑6 􏰂j·(7−j)􏰃=(1·6)+(2·5)+(3·4)+(4·3)+(5·2)+(6·1) i=1 j=1 j=1
= 6+10+12+12+10+6 = 56.
The ∑ notation allows us to express repeated addition of a sequence of numbers;
Products
there is analogous notation to represent repeated multiplication of numbers, too:
Definition 2.14 (Product notation)
Let x1, x2, . . . , xn be a sequence of n numbers. We write ∏ni=1 xi (usually read as “the product for i equals 1 to n of xi”) to denote the product of the xis:
n
∏i = 1 x i : = x 1 · x 2 · · · · · x n .
There are direct analogues between the notions regarding ∑ and corresponding notions for ∏: the for loop interpretation (Figure 2.9), infinite products, reindexing, and nested products. One slight difference worthy of note: the value of ∏0i=1 xi is 1; when we multiply by nothing, we’re multiplying by one.
Example 2.21 (Some products)
Here are a few simple products:
∏4i=1i = 1·2·3·4
∏4i=0i = 0·1·2·3·4 = 0 ∏4i=1i2 = 12·22·32·42 = 576 ∏ 4i = 1 5 = 5 · 5 · 5 · 5 = 6 2 5
1: result := 1
2: for i := 1,2,…,n
3: result := result · xi 4: return result
= 24
1
1
2
1
2
3
1

2.2. BOOLEANS,NUMBERS,ANDARITHMETIC 217
Computer Science Connections
Integers and ints, Reals and floats
Every modern programming language has types that correspond to the integers and the real numbers, often called something like int (short for “integer”) and float (short for floating-point number; more about this name and the floating point representation is below).
In most programming languages, though, these types differ from Z and
R in important ways. Every piece of data stored on a computer is stored
as a sequence of bits, and typically the bit sequence storing a number has some fixed length. For example, an int stored using 7 bits can range from 0000000 (the number 0 represented in binary) to 1111111 (the number 27 − 1 = 127 represented in binary). Typically, the first bit in an int’s representation
is reserved as the sign bit (set to True for a negative number and False for
a positive number), and the remaining bits store the value of the number.
(See Figure 2.10.) Thus there’s a bound on the largest int, depending on the number of bits used to represent ints in a particular programming language: 32,767 in Pascal (= 215 − 1, using 16 bits per int: 1 sign bit and 15 data bits), and 2,147,483,647 in Java (= 231 − 1; 32 bits, of which 1 is a sign bit). Similar constraints apply to the set of real numbers representable as a float.
A crucial point about Z and R is that they are infinite: there is no small-
est integer, there’s no biggest real number, and there isn’t even a biggest real number that is smaller than 1. In almost every programming language, how- ever, there is a smallest int, a biggest float, and a biggest float that’s smaller than 1: after all, there are only finitely many possible floats (perhaps 264 different values), and one of these 264 values is the smallest float.
The finite nature of these programming language data types can cause some subtle bugs in programs. There are issues related to integer overflow if we try to store “too large” an integer: for example, when we compute 32767 + 1
in Pascal, the result is −32768. And there are bugs related to underflow if we try to store “too small” a floating-point number: for example, if we compute (0.0000000001)33 in Python, the result is 0.0. (But (0.0000000001)32 is, correctly, 10−320.) Similarly, there are also rounding errors implicit in floating point representations of numbers: because there are only finitely many different floats, the infinitely many real numbers cannot all be stored exactly. For example,whenItype0.0006 – 0.0004 == 0.0002intoaPythoninterpreter,I getFalseasoutput.(That’sbecause,accordingtoPython,0.0006 – 0.0004is 0.00019999999999999993, not 0.0002.)
The name float originates with a clever idea that’s used to mitigate (though not solve) the issues above: we allow the decimal point to “float” in the repre- sentation of different numbers. Consider decimal numbers like
x = 0.000000000000000000000000000000000000000000000000001 y = 1929192919291929192919291929192919291929192919291929.5.
If, say, we represent these numbers using a total of 64 bits, most of the 64 bits representing x are devoted to the part after decimal point, whereas most of the 64 bits representing y are devoted to the part before the decimal point.1
sign bit
data bits
0
0110011
+ 0+32+16+0+0+2+1 =51
+ 64+0+16+0+4+0+0 =84
Figure 2.10: The integers 51 and 84, represented in binary as 8-bit signed integers.
0
1010100
You can learn more about the details
of how numerical values are stored on computers in a course on computer architecture. In addition to the floating- point standard, other interesting details include 2’s complement storage of inte- gers, which allows a single representa- tion of positive and negative integers so that addition “just works” the same way, even with a sign bit. You can learn more about this material in a good computer architecture textbook, such as
1 David A. Patterson and John L. Hen- nessy. Computer Organization and Design: the Hardware/Software Interface. Morgan Kaufmann, 4th edition, 2008.

218 CHAPTER 2. BASIC DATA TYPES
Computer Science Connections
Computing Square Roots, and Not Computing Square Roots
Programs can make use of numerical operations in surprisingly com-
plex ways. Many programmers just happily use these numerical operations without thinking about how they’re implemented—but a little knowledge of what’s happening behind the scenes can actually help speed up our programs. Computer hardware can directly and efficiently execute basic arithmetic op- erations like addition and multiplication and division, but more complex operations may require many of these basic operations.
Consider the task of computing √x, given an input value x, for example. The basic idea is to use some kind of iterative improvement algorithm: we
start with a guess y0 of the value of √x, and then update our guess to a new guess y1 (by observing in some way whether y0 was too big or too small). We continue to improve our guess until we’ve reached a value y such that y2 is “close enough” to x. (We can specify the tolerance of the algorithm—that is, how close counts as “close enough.”)
For example, here’s the computation of the square root of x = 42, using x as the initial guess: 2
i yi
021
1 11.5
2 7.576086956 · · · 3 6.559922961 · · · 4 6.481218587 · · · 5 6.480740716 · · · 6 6.480740698 · · ·
Figure 2.11: Heron’s method for com- puting square roots, and an example.
Many interesting questions and tech- niques are used in scientific computing; one outstanding, and classic, reference for some of this material is the book
2 William Press, Saul Teukolsky, William Vetterling,andBrianFlannery. Nu- merical Recipes: The Art of Scientific Computing. Cambridge University Press, 3rd edition, 2007.
Figure 2.12: Implementing a blur filter. We wish to average all pixels within the circle to compute the new pixel p.
Input: A positive real number x. Output: A real number y such that
y2 ≈ x.
1: Let y0 be arbitrary, and let i := 0.
2: while (yi)2 is too far from x: y+x
A simple implementation of this idea is called Heron’s method, named af-
ter the 1st-century Greek mathematician Heron of Alexandria and shown
3: letyi+1:= i2yi andi:=i+1 4: return yi
in Figure 2.11. It relies on the nonobvious fact that the average of y and x is √ √ y
closer to x than y was. (Unless y is exactly equal to x, of course; in that
√ √x
x x and √
case, the new guess is identical to the old guess: the average of
is still x.) Almost two millennia later, Isaac Newton developed a general technique for computing values of numerical expressions involving exponen- tials, among other things. This technique, known as Newton’s method, involves calculus—specifically, using derivatives to figure out how far to move from
a current guess yi in making the next guess yi+1. Like Heron’s method, New- ton’s method is an example of a technique in scientific computing, the subfield of computer science devoted to efficient computation of numerical values, often for the purposes of simulating a complex system.2
Work in scientific computing has improved the efficiency of numerical computation. But even better is to be aware of the fact that operations like square roots require significant computation “under the hood,” and to avoid them when possible. To take one particular example, consider applying a blur filter to an image: replace each pixel p by the average of all pixels within a radius-r circle centered at p in the original image. To compute the blurred ver- sion of a particular pixel p, we might look at every pixel q within ±r rows or columns and compute whether p and q are within distance r. (See Figure 2.12.) There are two natural ways to compute whether the two pixels p and q are within distance r: 􏰟
2
2. the“other”way:testwhether(px+qx)2+(py+qy)2≤r2.
While there is no important mathematical difference between these two for- mulas (we’ve simply squared both sides in the “other” way), there is a com- putational difference. Because square roots are expensive to compute, it turns out that in my Python implementation of a blur filter, using the “other” way was about 12% faster than using the “obvious” way.
1. the “obvious” way: test whether (px + qx)
2
+ (py + qy) ≤ r.
p

2.2.8 Exercises
What are the smallest and largest integers that are . . .
2.1 . . . in the interval (111, 202)? 2.2 . . . in the interval [111, 202)?
2.3 . . . in the interval (17, 42) but not in the interval (39, 99]?
2.4 . . . in the interval [17, 42] but not in the interval [39, 99)?
Explain your answers to the following questions.
2.5 If x and y are integers, is x + y necessarily an integer?
2.6 If x and y are rational numbers, is x + y necessarily rational?
2.7 If x and y are irrational numbers, is x + y necessarily irrational?
What is the value of each of the following expressions?
2.8 ⌊2.5⌋ + ⌈3.75⌉ 2.9 ⌊3.14159⌋ · ⌈0.87853⌉ 2.10
(⌊3.14159⌋)⌈3.14159⌉
2.11 Most programming languages provide two different functions called floor and truncate to trim real numbers to integers. In these languages, floor(x) is defined exactly as we defined ⌊x⌋, and trunc(x)
is defined to simply delete any digits that appear after the decimal point in writing x. So trunc(3.14159) =
3 .14159 = 3. Explain why programming languages have both floor and trunc—that is, explain under what circumstances floor(x) and trunc(x) give different values.
Using floor, ceiling, and standard arithmetic notation, give an expression for a real number x . . .
2.12 . . . rounded to the nearest integer. (“Round up” for a number that’s exactly between two integers— for example, 7.5 rounds to 8.)
2.13 . . . rounded to the nearest 0.1.
2.14 . . . rounded to the nearest 10−k , for an arbitrary number k of digits after the decimal point.
2.15 . . . truncated to k digits after the decimal point—that is, leaving off the (k + 1)st digit and beyond.
(For example, 3.1415926 truncated with 3 digits is 3.141, and truncated with 4 digits is 3.1415.)
Taking it further: Many programming languages provide a facility for displaying formatted output, particularlynumbers,inthestyleofExample2.15.Forexample,printf(“%.3f”, x)saysto“print (formatted)” the value of x with only 3 digits after the decimal point. (The “f” of “printf” stands for
2.17 For what value(s) of x in the interval [2, 3] is x − ⌊x⌋+⌈x⌉ the smallest? 2
Let x be a real number. Rewrite each of the following as simply as possible:
2.18 ⌊⌊x⌋⌋ 2.19 ⌈⌈x⌉⌉ 2.20 ⌊⌈x⌉⌋ 2.21 ⌈⌊x⌋⌉
2.22 Are |⌊x⌋| and ⌊|x|⌋ always equal? Explain.
2.23 Are 1 + ⌊x⌋ and ⌊1 + x⌋ always equal? Explain.
2.24 Are ⌊x⌋ + ⌊y⌋ and ⌊x + y⌋ always equal? Explain.
2.25 Let x be a real number. Describe (in English) what 1 + ⌊x⌋ − ⌈x⌉ represents. Explain.
2.26 In performing a binary search for x in a sorted n-element array A[1 . . . n] (see Figure 6.17(a)), the
first thing we do is to compare the value of x and the value of A 􏰂⌊ 1+n ⌋􏰃. Assume that all elements of A are
2.2. BOOLEANS,NUMBERS,ANDARITHMETIC 219
loat.) This style of printf command appears in many languages: 2.16 For what value(s) of x in the interval [2, 3] is x − ⌊x⌋+⌈x⌉ the largest?
ormatted; the “f” of “%.3f” stands for f
f
C, Java, Python, and others.
􏰂 1+n 􏰃 2
distinct. How many elements of A are less than A ⌊ 2 ⌋ ? How many are greater? Write your answers as
2
simply as possible.
2.27 Which is bigger, 310 or 103?
What is the value of each of the following expressions?
2.28 48 2.30 (−4)8 2.32
2.29 (1/4)8 2.31 (−4)9 2.33
What is the value of each of the following expressions?
2.36 log2 8 2.37 log2(1/8) 2.38
2561/4 81/4
2.34 83/4 2.35 (−9)1/4
2.39 log1/8 2
log8 2

220 CHAPTER 2. BASIC DATA TYPES 2.40 Which is bigger, log10 17 or log17 10?
Each of the following statements are general properties of logarithms (from Theorem 2.2), for any real numbers b, c > 1 and x, y > 0. Using the definition of logarithms and the properties of exponentials from Theorem 2.1, justify each of these properties.
2.41 logb1=0
2.42 logbb=1
2.43 logbxy =ylogbx
Using the properties from Theorem 2.2 that you just proved, and the fact that logb x = logb y exactly when x = y (for any base b > 1), justify the following additional properties of logarithms:
2.46 Foranyrealnumbersb>1andx>0,wehavethatb[logbx] =x.
2.47 For any real numbers b > 1 and a,n > 0, we have that n[logb a] = a[logb n].
2.48 Prove(2.2.4)fromTheorem2.2:foranyb>1andx,y>0,wehavethatlog x =log x−log y. bybb
2.49 Using notation defined in this chapter, define the “hyperceiling” ⌈n⌉ of a positive integer n, where
⌈n⌉ is the smallest exact power of two that is greater than or equal to n. (That is, ⌈n⌉ denotes the smallest value of 2k where 2k ≥ n and k is a nonnegative integer.)
2.50 Similar to the last exercise: when writing down an integer n on paper using standard decimal notation, we need enough columns for all the digits of n (and perhaps one additional column for a “−” if
n < 0). Write down an expression indicating how many columns we need to represent n. (Hint: use the case notation introduced in Definition 2.3, and be sure that your expression is well defined—that is, it doesn’t “generate any errors”—for all integers n.) 2.44 logbxy=logbx+logby 2.45 log x= logcx b logcb What are the values of the following expressions? 2.51 202 mod 2 2.52 202 mod 3 2.53 202 mod 10 2.54 −202 mod 10 2.55 17 mod 42 2.56 42 mod 17 2.57 17 mod 17 2.58 −42 mod 17 2.59 −42 mod 42 2.60 Observe the Python behavior of the % operator (the Python notation for mod) that’s shown in Figure 2.13. The first two lines (3 mod 5 = 3 and −3 mod 5 = 2) are completely consistent with the definition that we gave for mod (Definition 2.9), including its use for n mod k when n is negative (as in Example 2.6). But we haven’t defined what n mod k means for k < 0. Propose a formal definition of % in Python that’s consistent with Figure 2.13. What is the smallest positive integer n that has the following characteristics? 2.61 nmod2=0,nmod3=0,andnmod5=0 2.62 nmod2=1,nmod3=1,andnmod5=1 2.63 nmod2=0,nmod3=1,andnmod5=0 2.64 nmod3=2,nmod5=3,andnmod7=5 2.65 nmod2=1,nmod3=2,nmod5=3,andnmod7=4 2.66 (programming required) Write a program to determine whether a given positive integer n is prime by testing all possible divisors between 2 and n − 1. Use your program to find all prime numbers less than 202. 2.67 (programming required) A perfect number is a positive integer n that has the following property: n is equal to the sum of all positive integers k < n that evenly divide n. For example, 6 is a perfect number, because 1, 2, and 3 are the positive integers less than 6 that evenly divide 6—and 6 = 1 + 2 + 3. Write a program that finds the four smallest perfect numbers. 2.68 (programming required) Write a program to find all integers between 1 and 1000 that are evenly divisible by exactly three different integers. Figure 2.13: Python’s imple- mentation of % (“mod”). (The value of the expres- sion written after >>> is shown on the next line.)
>>> 3 % 5
3
>>> -3 % 5
2
>>> 3 % -5
-2
>>> -3 % -5
-3

Compute the values of the following summations and products.
2.69 ∑ 6i = 1 6
2.70 ∑6i=1 i2
2.71 ∑6i=1 22i
2.72 ∑6i=1i·2i
2.73 ∑6i=1(i+2i)
Compute the values of the following nested summations.
2.79 ∑6i=1 ∑6j=1(i · j)
2.80 ∑ 6i = 1 ∑ 6j = i ( i · j )
2.81 ∑6i=1∑ij=1(i·j)
2.82 ∑8i=1 ∑8j=i i
2.83 ∑8i=1 ∑8j=i j
2.84 ∑ 8i = 1 ∑ 8j = i ( i + j )
2.85 ∑4i=1 ∑4j=i (ji )
2.74 ∏ 6i = 1 6
2.75 ∏6i=1 i2
2.76 ∏6i=1 22i
2.77 ∏6i=1i·2i
2.78 ∏6i=1(i+2i)
2.2. BOOLEANS,NUMBERS,ANDARITHMETIC 221

222 CHAPTER 2. BASIC DATA TYPES
2.3 Sets: Unordered Collections
History is a set of lies agreed upon.
Napoleon Bonaparte (1769–1821)
Section 2.2 introduced the primitive types of objects that we’ll use throughout the book. We turn now to collections of objects, analogous to lists and arrays in program- ming languages. We start in this section with sets, in which objects are collected with- out respect to order or repetition. (Section 2.4 will address sequences, which are collec- tions of objects in which order and repetition do matter.) The definitions and notation related to sets are summarized in Figure 2.14.
Here are a few simple examples:
Example 2.22 (Some sets)
Here are three sets: the set of bits {0, 1}, the set of prime numbers {2, 3, 5, 7, 11, . . .}, and the set of basic arithmetic operators {+, −, ·, /}. (We’ve written these sets using standard notation by listing the objects in the set between curly braces { and }.)
Set membership—that is, the question is the object x one of the objects in the collection S?, for a particular object x and a particular set S—is the central notion for sets:
The expression x ∈/ S is the negation of the expression x ∈ S: that is, x ∈/ S is true whenever x is not an element of S (and thus whenever x ∈ S is false).
Example 2.23 (Some set memberships)
The integer 0 is an element of the set of bits, and + is in the set of basic arithmetic operators. But 1 is not an element of the set of prime numbers, and 8 is not in the set of bits.
A second key concept about a set is its cardinality, or size:
Sets are typi-
cally denoted by uppercase let-
ters (generically S,T,U,A,B,…), of- ten by a mnemonic letter: S for a set of students, D for a set of documents, etc. As we saw,
the common sets from mathematics defined in Sec-
tion 2.2.2 are often written using a “blackboard bold” font: Z, R, and Q.
Definition 2.15 (Sets)
A set is an unordered collection of objects.
Definition 2.16 (Set membership)
For a set S and an object x, the expression x ∈ S is true when x is one of the objects contained in the set S. When x ∈ S, we say that x is an element or member of S or, more simply, that x is in S.
Definition 2.17 (Set cardinality)
The cardinality of a set S, denoted by |S|, is the number of distinct elements in S.

2.3. SETS:UNORDEREDCOLLECTIONS 223
set membership cardinality
x∈S |S|
x is one of the elements of S
the number of distinct elements in the set S
set enumeration set abstraction
{x1,x2,…,xk} {x ∈ U : P(x)}
the set containing elements x1 , x2 , . . . , xk
the set containing all x ∈ U for which P(x) is true;
U is the “universe” of candidate elements
empty set
{} or ∅
∼ S : = { x ∈ U : x ∈/ S }
S ∪ T := {x : x ∈ S or x ∈ T}
S ∩ T := {x : x ∈ S and x ∈ T} S − T : = { x : x ∈ S a n d x ∈/ T } S=T
S⊆T
S⊂T
S⊇T
S⊃T
P (S)
the set containing no elements
complement
union intersection set difference
the set of all elements in the universe U that aren’t in S; U may be left implicit if it’s obvious from context
the set of all elements in either S or T (or both) the set of all elements in both S and T
the set of all elements in S but not in T
set equality subset
proper subset superset proper superset
every x ∈ S is also in T, and every x ∈ T is also in S every x ∈ S is also in T
S ⊆ T but S ̸= T
every x ∈ T is also in S
S ⊇ T but S ̸= T
the set of all subsets of S
power set
Example 2.24 (Some set sizes)
The cardinality of the set of bits is 2, because there are two distinct elements of that set (namely 0 and 1).
The cardinality of the set S of prime numbers between 10 and 20 is |S| = 4: the four elements of S are 11, 13, 17, and 19.
Chapter 9 is devoted entirely to the apparently trivial problem of counting—given a (possibly convoluted) description of a set S, find |S|—which turns out to have some interesting and useful applications, and isn’t as easy as it seems.
Taking it further: In this book, we will be concerned almost exclusively with the cardinality of finite sets, but one can also ask questions about the cardinality of sets like Z or R that contain an infinite number
of distinct elements. For example, it’s possible to prove that |Z| = |Z≥0|, which is a pretty amazing result: there are as many nonnegative integers as there are integers! (And that’s true despite the fact that
every nonnegative integer is an integer!) But it’s also possible to prove that |Z| ̸= |R|: . . . but there are more real numbers than integers! More amazingly, one can use similar ideas to prove that there are fewer computer programs than there are problems to solve, and that therefore there are some problems that are not solved by any computer program. This idea is the central focus of the study of computability and uncomputability. See Section 4.4.4 and the discussion on p. 937.
2.3.1 Building Sets from Scratch
There are two standard ways to specify a set “from scratch”: by simply listing each of the elements of the set, or by defining the set as the collection of objects for which a particular logical condition is true.
Set definition via exhaustive enumeration
A set can be specified using an exhaustive listing its elements—that is, by writing a
complete list of its elements inside the curly braces { and }. Here are a few examples:
Example 2.25 (Some exhaustively enumerated sets)
• Thesetofevenprimenumbersis{2}.
Figure 2.14: A summary of set notation.

224 CHAPTER 2. BASIC DATA TYPES
• Thesetofprimenumbersbetween10and20is{11,13,17,19}.
• Thesetof2-digitperfectsquaresis{81,64,25,16,36,49}.
• Thesetofbitsis{0,1}.
• ThesetofTuringAwardwinnersbetween1984and1987inclusiveis
{Niklaus Wirth, Richard Karp, John Hopcroft, Robert Tarjan, John Cocke}.
Taking it further: The Turing Award is the most prestigious award given in computer science—the “No- bel Prize of CS,” it’s sometimes called. Niklaus Wirth developed a number of programming languages, including Pascal. Richard Karp made major contributions to the study of computational complexity,
in particular with respect to the understanding of NP-Completeness. John Hopcroft and Robert Tar-
jan made massive early contributions in designing and analyzing algorithms and data structures for problems. John Cocke was a leader in compilers and computer architecture and is often credited with inventing the RISC architecture, which changed the way that computer chips and their corresponding instruction sets were designed.
Recall that a set is an unordered collection, and thus the order in which the elements are listed doesn’t matter when specifying a set via exhaustive enumeration. Any repe- tition in the listed elements is also unimportant. For example:
Example 2.26 (The same set, three ways)
Theset{2+2, 2·2, 2/2, 2−2}ispreciselyidenticaltotheset{0,1,4},bothof which are precisely identical to {4, 0, 1}. Also note that |{2 + 2, 2 · 2, 2/2, 2 − 2}| = 3; despite there being four entries in the list of elements, there are only three distinct objects in the set.
It’s important to remember that the integer 2 and the set {2} are two entirely different kinds of things. For example, note that 2 ∈ {2}, but that {2} ∈/ {2}; the lone element in {2} is the number two, not the set containing the number two.
Set definition via set abstraction
Instead of explicitly listing all of a set’s elements, we can also define a set in terms of
a condition that is true for the elements of the set and that’s false for every object that is not an element of the set. Defining a set this way uses set abstraction notation:
Definition 2.18 (Set Abstraction)
Let U be a set of possible elements, called the universe. Let P(x) be a condition (also called a predicate) that, for every x ∈ U, is either true or false. Then
{x ∈ U : P(x)} denotes the set of all objects x ∈ U for which P(x) is true.
That is, for any candidate element y ∈ U, the element y is in the set {x ∈ U : P(x)} when P(y) = True, and y ∈/ {x ∈ U : P(x)} when P(y) = False. (A fully proper version of Definition 2.18 requires functions, described in Section 2.5.)
The colon in the notation for set abstraction is read as “such that,” so the set in Definition 2.18 would be read “the set of all x in U such that P of x.”

2.3. SETS:UNORDEREDCOLLECTIONS 225
Example 2.27 (Most of Example 2.25, redone)
• Thesetofevenprimenumbersis{x∈Z>1 :xisprimeandxiseven}. • Thesetof2-digitperfectsquaresis􏰈n∈Z:√n∈Zand10≤n≤99􏰉. • Thesetofbitsis􏰈b∈Z:b2 =b􏰉.
For this set abstraction notation to meaningfully define a set S, we must specify the universe U of candidates from which the elements of S are drawn. We will permit ourselves to be sloppy in our notation, and when the universe U is clear from context we will allow ourselves the liberty of writing {x : P(x)} instead of {x ∈ U : P(x)}.3
Taking it further: The notational sloppiness of omitting the universe in set abstraction will be a convenience for us, and it will not cause us any trouble—but it turns out that one must be careful! In certain strange scenarios when defining sets, there are subtle but troubling paradoxes that arise if we allow the universe to be anything at all. The key problem can be seen in Russell’s paradox, named after the British philosopher/mathematician Bertrand Russell; Russell’s discovery of this paradox revealed an inconsistency in the commonly accepted foundations of mathematics in the early 20th century.
Here is a brief sketch of Russell’s Paradox. Let X denote the set of all sets that do not contain them-
s e l v e s : t h a t i s , l e t X : = { S : S ∈/ S } . F o r e x a m p l e , { 2 } ∈ X b e c a u s e { 2 } ∈/ { 2 } , a n d R ∈ X b e c a u s e R i s n o t a real number, so R ∈/ R. On the other hand, if we let T∗ denote the set of all sets, then T∗ ∈/ X: because T∗ is a set, and T∗ contains all sets, then T∗ ∈ T∗ and therefore T∗ ∈/ X.
H e r e ’ s t h e p r o b l e m : i s X ∈ X ? S u p p o s e t h a t X ∈ X : t h e n X ∈ { S : S ∈/ S } b y t h e d e fi n i t i o n o f X , a n d t h u s X ∈/ X . B u t s u p p o s e t h a t X ∈/ X ; t h e n , b y t h e d e fi n i t i o n o f X , w e h a v e X ∈ X . S o i f X ∈ X t h e n
X ∈/ X , a n d i f X ∈/ X t h e n X ∈ X — b u t t h a t ’ s a b s u r d !
One standard way to escape this paradox is to say that the set X cannot be defined—because, to be able to define a set using set abstraction, we need to start from a defined universe of candidate elements. (And the set T∗ cannot be defined either.) The Liar’s Paradox, dating back about 3000 years, is a simi-
lar paradox: is “this sentence is false” true (nope!) or false (nope!)? In both Russell’s Paradox and the Liar’s Paradox, the fundamental issue3relates to self-reference; many other mind-twisting paradoxes are generatedthroughself-reference,too.
Definition 2.18 lets us write {x ∈ U : P(x)} to denote the set containing exactly those elements x of U for which P(x) is True. We will extend this notation to allow ourselves to write more complicated expressions to the left of the colon, as in the following ex- ample:
Example2.28(2-digitperfectsquares,again) 􏰈 2 2 􏰉 Wecanwritethesetof2-digitperfectsquaresas x :x∈Zand10≤x ≤99 oras 􏰈x2 : x ∈ {4,5,6,7,8,9}􏰉 = 􏰈42,52,62,72,82,92􏰉.
To properly define this extended form of the set-abstraction notation, we again need the idea of functions, which are defined in Section 2.5.1. See Definition 2.47 for a proper definition of this extended notation.
Taking it further: Almost all modern programming languages support the use of lists to store a collec- tion of objects. While these lists store ordered collections, there are some very close parallels between these lists and sets. In fact, the ways we’ve described building sets have very close connections to ideas in certain programming languages like Scheme and Python; see p. 233 for some discussion.
The empty set
One particularly useful set—despite its simplicity—is the empty set, also sometimes
called the null set:
For more on these and other para- doxes, see
3 R. M. Sainsbury. Paradoxes. Cam- bridge University Press, 3rd edition, 2009.

226 CHAPTER 2. BASIC DATA TYPES
Definition 2.19 (The empty set ∅)
The empty set, denoted {} or ∅, is the set that contains no elements.
The definition of the empty set as {} is an exhaustive listing of all of the elements of the set—though, because there aren’t any elements, there are no elements in the list. Alternatively, we could have used the set abstraction notation to define the empty
set, as ∅ := {x : False}. This definition may seem initially confusing, but it’s in fact a direct application of Definition 2.18: the condition P for this set is P(x) = False (that is: for every object x, the value of P(x) is False), and we’ve defined ∅ to contain every object y such that P(y) = True. But there isn’t any object y such that P(y) = True— because P(y) is always false—and thus there’s no y ∈ {x : P(x)}.
Notice that, because there are zero elements in ∅, its cardinality is zero: in other words, |∅| = 0. One other special type of set is defined based on its cardinality; a sin- gleton set is a set S that contains exactly one element—that is, a set S such that |S| = 1.
2.3.2 Building Sets from Other Sets
There are a number of ways to create new sets from two given sets A and B. We will
define these operations formally, but it is sometimes more intuitive to look at a more visual representation of sets called a Venn diagram, which are drawings that represent sets as circular “blobs” that contain points (elements), enclosed in a rectangle that denotes the universe.
Example 2.29 (Venn diagram of odds and primes)
LetU := {1,2,…,10}. LetP := {2,3,5,7}denotethesetofprimesinU,andlet O := {1, 3, 5, 7, 9} denote the set of odd numbers in U.
Venn diagrams are named after the 19th-century British logician/ philosopher John Venn.
U
Figure 2.15: A Venn diagram for the set O of odd numbers and the set P of prime numbers between 1 and 9.
U
Figure 2.16: The complement of a set A. The shaded region represents the set ∼A with respect to the universe U.
A Venn diagram illustrating these sets is shown in Figure 2.15: 3, 5, and 7 are elements of both P and O; 2 is in P but not O; 1 and 9 are in O but not P; and 4, 6, and 8 are in neither P nor O.
We will now define four standard ways of building a new set in terms of one or two existing sets: complement, union, intersection, and set difference.
PO
2 35 1 46 798
Definition 2.20 (Set complement)
The complement of a set A with respect to the universe U, written ∼A (or sometimes A), is the set of all elements not contained within A. Formally, ∼A := {x ∈ U : x ∈/ A} . (When the universe is obvious from context, we will leave it implicit.)
Figure 2.16 shows a Venn diagram illustrating the complement of A.
For example, if the universe is {1,2,…,10}, then ∼{1,2,3} = {4,5,6,7,8,9,10}and
∼{3,4,5,6} = {1,2,7,8,9,10}.
A

2.3. SETS:UNORDEREDCOLLECTIONS 227
Definition 2.21 (Set union)
The union of two sets A and B, denoted A ∪ B, is the set of all elements in either A or B (or both). Formally, A ∪ B := {x : x ∈ A or x ∈ B} . Analogously to summation and product notation (∑ and ∏), we will sometimes write 􏰔ni=1 Si to denote S1 ∪ S2 ∪ · · · ∪ Sn.
Figure 2.17 shows a Venn diagram illustrating the union of A and B. For example, {1, 2, 3} ∪ {3, 4, 5, 6} = {1, 2, 3, 4, 5, 6}.
Figure 2.18 shows a Venn diagram illustrating A ∩ B. For example, {1, 2, 3} ∩ {3, 4, 5, 6} = {3}.
Figure 2.19 shows a Venn diagram illustrating the set difference of A and B. Note that A − B and B − A are different sets; both are illustrated in Figure 2.19. For example, {1, 2, 3} − {3, 4, 5, 6} = {1, 2} and {3, 4, 5, 6} − {1, 2, 3} = {4, 5, 6}.
In more complicated expressions that use more than one of these set operators, the ∼ operator “binds tightest”—that is, in an expression like ∼S ∪ T, we mean (∼S) ∪ T and not ∼(S ∪ T). We use parentheses to specify the order of operations among ∩, ∪, and −. Here’s a slightly more complicated example that combines set operations:
Example 2.30 (Combining odds and primes)
Problem: AsinExample2.29,defineU := {1,2,…,10},thesetP := {2,3,5,7}of primesinU,andthesetO := {1,3,5,7,9}ofoddnumbersinU. Whatarethe following sets?
1. P∩∼O 2. ∼(P∪O) 3. ∼P−∼O
Solution
: Foreachpart,wesimplypluginthedefinitions:
1. ThesetP∩∼Oisthesetofallprimenumbersthatarealsonotodd.
P∩∼O = {2,3,5,7}∩∼{1,3,5,7,9} = {2,3,5,7}∩{2,4,6,8,10}
Figure 2.17: The union A ∪ B of two sets A and B.
Figure 2.18: The intersection A ∩ B of sets A and B.
Definition 2.22 (Set intersection)
The intersection of two sets A and B, denoted A ∩ B, is the set of all elements in both A and B.Formally,A∩B:={x:x∈Aandx∈B}.Wewillsometimeswrite􏰕ni=1Si todenote
S1 ∩S2 ∩···∩Sn.
Definition 2.23 (Set difference)
The difference of two sets A and B, denoted A − B, is the set of all elements contained in the s e t A b u t n o t i n t h e s e t B . F o r m a l l y , A − B : = { x : x ∈ A a n d x ∈/ B } . ( S o m e p e o p l e w r i t e A \ B instead of A − B to denote set difference.)
AB
AB
= {2} .
AB
AB
Figure 2.19: The difference of two sets A and B. The shaded region
in the first panel represents the set A − B, and the shaded region in the second panel represents B − A.

228 CHAPTER 2. BASIC DATA TYPES
2. Theset∼(P∪O)consistsofeverythingthatisnotanelementofP∪O—thatis,
∼(P ∪ O) contains only nonprime even numbers.
∼(P ∪ O) = ∼({2, 3, 5, 7} ∪ {1, 3, 5, 7, 9}) = ∼{1,2,3,5,7,9}
= {4,6,8,10}.
3. Theset∼P−∼Oconsistsofallelementsof∼Pexceptthosethatareelements of ∼O—in other words, all nonprime numbers that aren’t nonodd, or, more simply stated, all nonprime odd numbers:
∼P−∼O = ∼{2,3,5,7}−∼{1,3,5,7,9} = {1,4,6,8,9,10}−{2,4,6,8,10}
= {1,9}.
Of course, we can also combine more than two sets in expressions using these set operators—for example,
A ∪ B ∪ C denotes the set {x : x ∈ A or x ∈ B or x ∈ C}. We can use Venn diagrams to visualize set operations that involve more than two sets; see Figure 2.20 for a few examples.
Arithmetic operations on sets
We’ll end this subsection with a few pieces of notation that allow us to perform
mathematical operations on the elements of a set. In Section 2.2.7, we introduced summation and product notation, so that we could write
nn
∑i = 1 x i a n d ∏i = 1 x i
torepresentx1 +x2 + ··· +xn andx1 ·x2 · ··· ·xn. Wewillalsosometimeswishto represent the sum or product of the elements of a particular set (instead of a sequence of values like x1, x2, . . . , xn). It will also sometimes be handy to refer to the smallest or largest element in a set.
Figure 2.20: Some three-set Venn diagrams.
AB C
AB C
AB C
(a) (B∪C)−A (b) (A−B)∩C (c) A∩(B∪C)
Definition 2.24 (Sum, product, minimum, and maximum of a set)
Let S be a set. Then the expressions
∑x, ∏x, minx, and maxx
x∈S x∈S
respectively denote the sum of the elements of S, the product of the elements of S, the smallest element in S, and the largest element in S.
x∈S x∈S
For example, for the set S := {1, 2, 4, 8}, we have that the sum of the elements of S is

2.3. SETS:UNORDEREDCOLLECTIONS 229 ∑x∈Sx = 15;theproductoftheelementsofSis∏x∈Sx = 64;theminimumofSis
minx∈S x = 1; and the maximum of S is maxx∈S x = 8. 2.3.3 Comparing Sets
In the same way that two numbers x and y can be compared (we can ask questions like: doesx = y? isx ≤ y? isx ≥ y?),wecanalsocomparetwosetsAandB. Here,we will define the analogous notions of comparison for sets. We’ll begin by defining what it means for two sets to be equal:
This definition formalizes the idea that order and repetition don’t matter in sets: for example, the sets {4, 4} and {4} are equal because there is no element x ∈ {4, 4} where x ∈/ {4} and there is no element y ∈ {4} where y ∈/ {4, 4}. This definition also implies that the empty set is unique: any set containing no elements is identical to ∅.
Taking it further: Definition 2.25 is sometimes called the axiom of extensionality. (All of mathematics, including a completely rigorous definition of the integers and all of arithmetic, can be built up from a small number of axioms about sets, including this one.) The point is that the only way to compare two sets is by their “externally observable” properties. For example, the following two sets are exactly the same set: {x : x > 10 is an even prime number}, and {y : y is a country with a 128-letter name}. (Namely, both of these sets are ∅.)
The other common type of comparison between two sets A and B is the subset rela- tionship, which expresses that every element of A is also an element of B:
Forexample,{1,3,5} ⊆ {1,2,3,4,5},because1 ∈ {1,2,3,4,5}and3 ∈ {1,2,3,4,5} and 5 ∈ {1,2,3,4,5}.
Notice that {} ⊆ S for any set S: it’s impossible for there to be an x ∈ {} that satisfies x ∈/ S, because there is no element x ∈ {} in the first place—and if there’s no x ∈ {} at all, then there’s certainly no x ∈ {} such that x ∈/ S.
For example, let A := {1, 2, 3}. Then A ⊆ {1, 2, 3, 4} and A ⊆ {1, 2, 3} and A ⊂ {1, 2, 3, 4}, but A is not a proper subset of {1, 2, 3}.
WhenA ⊂ BorA ⊆ B,werefertoAasthe(possiblyproper)subsetofB;wecan also call B the (possibly proper) superset of A:
Definition 2.25 (Set equality)
Two sets A and B are equal, denoted A = B, if A and B have exactly the same elements. (In other words, sets A and B are not equal if there’s an element x ∈ A but x ∈/ B, or if there’s an e l e m e n t y ∈ B b u t y ∈/ A . )
Definition 2.26 (Subset)
A set A is a subset of a set B, written A ⊆ B, if every x ∈ A is also an element of B. (In other words, A ⊆ B is equivalent to A − B = {}.)
Definition 2.27 (Proper subset)
A set A is a proper subset of a set B, written A ⊂ B, if A ⊆ B and A ̸= B. In other words, A ⊂ B whenever A ⊆ B but B ̸⊆ A.

230 CHAPTER 2. BASIC DATA TYPES
Definition 2.28 (Superset and proper superset)
Let A be a set. A set B is a superset of A, written B ⊇ A, if A ⊆ B. The set B is a proper superset of A, written B ⊃ A, if A ⊂ B.
Figure 2.21 illustrates subsets, proper subsets, supersets, and proper supersets. Here’s an example involving these relationships:
Example 2.31 (Subsets and supersets)
Problem: LetA:={3,4,5}andB:={4,5,6}.IdentifyasetCsatisfyingthefollowing conditions, or state that the requirement is impossible to achieve and explain why.
1. A⊆CandC⊇B 2. A⊇CandC⊆B 3. A⊇CandC⊇B
: Thefirsttwoconditionsareachievable,butthethirdisn’t. 1. LetC:={3,4,5,6};bothAandBare(proper)subsetsofthisset. 2. WecanchooseC:={4,5},because{4,5}⊆Aand{4,5}⊆B.
3. It’simpossibletosatisfy{3,4,5}⊇CandC⊇{4,5,6}simultaneously.If6∈C then we don’t have {3,4,5} ⊇ C, but if 6 ∈/ C we don’t have C ⊇ {4,5,6}. We can’t have 6 ∈ C and we can’t have 6 ∈/ C, so we’re stuck with an impossibility.
We’ll end the section with one last piece of terminology. Two sets A and B are called disjoint if they have no elements in common:
For example, the sets {1, 2, 3} and {4, 5, 6} are disjoint because {1, 2, 3} ∩ {4, 5, 6} = {}, but the sets {2, 3, 5, 7} and {2, 4, 6, 8} are not disjoint because 2 is an element of both. See Figure 2.22 for a diagram of two disjoint sets.
2.3.4 Sets of Sets
Just as we can have a list of lists in a programming language like Scheme or Java, we can also consider a set that has sets as its elements. (After all, sets are just collections of objects, and one kind of object that can be collected is a set itself.)
Example 2.32 (Set of sets of numbers)
The set A := {Z, R, Q} of the sets defined in Section 2.2.2 is itself a set. This set has cardinality |A| = 3, because A has three distinct elements—namely Z and R and Q. (Of course, all three of these elements of A are themselves sets, and each of these three elements of A has infinite cardinality.)
Figure 2.21: Two sets satisfying
A ⊆ B and, equiv- alently, B ⊇ A. The sets satisfy
A ⊂ B (and B ⊃ A) if there’s at least one element in
the darker shaded region, and they satisfy A = B if there’s no element in that region.
B
A
Solution
Definition 2.29 (Disjoint sets)
Two sets A and B are disjoint if there is no x ∈ A where x ∈ B—in other words, if A ∩ B = {}.
AB
Figure 2.22: Disjoint sets A and B.

2.3. SETS:UNORDEREDCOLLECTIONS 231
Example 2.33 (A set of smaller sets)
Consider the set B := {{} , {1, 2, 3}}. Note that |B| = 2: B has two elements, namely {} and {1, 2, 3}. Therefore {} ∈ B because {} is one of the two elements of B. How- ever1 ∈/ B,because1isnotoneofthetwoelementsofB—thatis,1 ̸= {}and
1 ̸= {1, 2, 3}—although 1 is an element of one of the two elements of B.
There are two important types of sets of sets that we will define in the remainder of this section, both derived from a base set S.
Partitions
The first interesting use of a set of sets is to form a partition of S into a set of disjoint
subsets whose union is precisely S.
A useful way of thinking about a partition of a set S is that we’ve divided S up into several (nonoverlapping) subcategories. See Figure 2.23 for an illustration of a partition of a set S. Here’s an example of one set partitioned many different ways:
Example 2.34 (Several partitions of the same set)
Consider the set S := {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. Here are some different ways to parti- tion S:
Figure 2.23: A visualization of partitioning a set
S into disjoint nonempty subsets whose union equals S itself.
Definition 2.30 (Partition)
A partition of a set S is a set {A1,A2,…,Ak} of nonempty sets A1,A2,…,Ak, for some k ≥ 1, such that:
• A1∪A2∪···∪Ak =S;and
• for any distinct i,j ∈ {1,…,k}, the sets Ai and Aj are disjoint.
(a) The set S.
(b) S partitioned into 5 subsets.
{{1, 3, 5, 7, 9} , {2, 4, 6, 8, 10}}
{{1, 2, 3, 4, 5, 6, 7, 8, 9} , {10}}
{{1,4,7,10},{2,5,8},{3,6,9}}
{{1} , {2} , {3} , {4} , {5} , {6} , {7} , {8} , {9} , {10}} (all separate) {{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}} (all together)
In each case, each of the 10 numbers from S is in one, and only one, of the listed sets (and no elements not in S appear in any of the listed sets).
It’s worth noting that the last two ways of partitioning S in Example 2.34 genuinely are partitions. For the partition {{1} , {2} , {3} , {4} , {5} , {6} , {7} , {8} , {9} , {10}}, we have k = 10 different disjoint sets whose union is precisely S. For the partition {{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}}, we have k = 1: there’s only one “subcategory” in the par- titioning, and every x ∈ S is indeed contained in one (the only one!) of these “subcat- egories.” (And no two distinct subcategories overlap, because there aren’t even two distinct subcategories at all!)
(evens and odds) (one- and two-digit numbers) (x mod 3 = 0 and = 1 and = 2)

232 CHAPTER 2. BASIC DATA TYPES
Taking it further: One way to helpfully organize a massive set S of data—for example, students or restaurants or web pages—is to partition S into small clusters. The idea is that two elements in the same cluster will be “similar,” and two entities in different clusters will be “dissimilar.” (So students might be clustered by their majors or dorms; restaurants might be clustered by their cuisine or geography; and web pages might be clustered based on the set of words that appear in them.) For more about clustering, see the discussion on p. 234.
Power sets
Our second important type of a set of sets is the power set of a set S, which is the set
of all subsets of S:
Here are some simple examples, and one example that’s a bit more complicated:
Example 2.35 (Some small power sets)
Here are the power sets of {0}, {0, 1}, and {0, 1, 2}:
P({0}) = {{} , {0}}
P({0,1}) = {{},{0},{1},{0,1}}
P({0,1,2}) = {{},{0},{1},{2},{0,1},{0,2},{1,2},{0,1,2}}
A quick check for the second of these examples: there are four elements in P ({0, 1}): the empty set, two singleton sets {0} and {1}, and the two-element set {0, 1} itself, because {0, 1} ⊆ {0, 1} is a subset of itself.
The power set of S is also occasionally denoted by 2S, in part because—
as we’ll see in Chapter 9—|P(S)| is 2|S|. The name “power set” also comes from this fact: the cardinality of P(S) is 2 to the power of |S|.
Definition 2.31 (Power set)
The power set of a set S, written P(S), denotes the set of all subsets of S: that is, a set A is an element of P(A) precisely if A ⊆ S. In other words, P(S) := {A : A ⊆ S}.
Example 2.36 (P (P ({0, 1})))
The power set of the power set of {0, 1} is
P (P ({0, 1}))
= P(􏰈{},{0},{1},{0,1}􏰉)
    􏰈 􏰉 ,
 􏰈{}􏰉,􏰈{0}􏰉,􏰈{1}􏰉,􏰈{0,1}􏰉,
 􏰈{}, {0}􏰉,􏰈{}, {1}􏰉,􏰈{}, {0,1}􏰉,
=  􏰈{0}, {1}􏰉,􏰈{0}, {0,1}􏰉,􏰈{1}, {0,1}􏰉,
1 set with 0 elements  4 sets with 1 element  6 sets with 2 elements 
 4 sets with 3 elements 
􏰈 􏰉􏰈 􏰉  {0}, {1}, {0,1} , {}, {1}, {0,1} ,
.
 􏰈{}, {0}, {0,1}􏰉,􏰈{}, {0}, {1}􏰉,  􏰈{}, {0}, {1}, {0,1}􏰉
 1 set with 4 elements 

2.3. SETS:UNORDEREDCOLLECTIONS 233
Computer Science Connections
Set Building in Languages
Programming languages like Python, Scheme, or ML make heavy use of lists and also allow higher-order functions (functions that take other functions as parameters); if you have experience programming in these languages,
the set-construction notions from Section 2.3.1 may seem familiar. These mechanisms for building sets in mathematical notation closely parallel built-in functionality for building lists in programs in these languages:
• buildalistfromscratchbywritingoutitselements.
• buildalistfromanexistinglistusingthefunctionfilter,whichtakestwo parameters (a list U, corresponding to the universe, and a function P) and returns a new list containing all x ∈ U for which P(x) is true.
• buildalistfromanexistinglistusingthefunctionmap,whichtakestwo parameters (a list U and a function f) and returns a new list containing f(x) for every element x of U.
Python has filter and map built in; some versions of Scheme have filter and map either built in or in a standard library. In Python, there’s even an explicit list comprehension syntax to create a list without using filter or map, which even more closely parallels the set-abstraction notation from Definitions 2.18 and 2.47. Here are some examples:
Unlike sets, the map function can cause repetitions in the stored list: map(square,L) where L contains both 2 and −2 will lead to 4 being present twice. (Some languages, including Python, also have syntax for sets in- stead of lists, creating an unordered, duplicate-free collection of elements.)
In set notation:
In Python:
In Scheme:
(define even?
(lambda (x) (= (modulo x 2) 0)))
(define square (lambda (x) (* x x))) (define false? (lambda (x) #f))
(define L (list 1 2 4 8 16))
;;; no simple Scheme is analogous to M in Python (define N (filter even? L))
(define O (map square L))
(define P (map square (filter even? L)))
(define Q (filter false? L))
>L
(1 2 4 8 16)
>N
(2 4 8 16)
>O
(1 4 16 64 256) >P
(4 16 64 256) >Q
()
L = {1,2,4,8,16}
M = {x ∈ L : x < 10} N = {x2∈ L : x is even} O={x :x∈L} P={x2 :x∈Landxiseven} Q = {x ∈ L : False} def even(x): return x % 2 == 0 def square(x): return x**2 def false(x): return False L = [1,2,4,8,16] M = [x for x in L if x < 10] N = filter(even, L) O = map(square, L) P = [square(x) for x in L if even(x)] Q = [x for x in L if false(x)] L = {1,2,4,8,16} M = {1,2,4,8} N = {2,4,8,16} O = {1, 4, 16, 64, 256} P = {4, 16, 64, 256} Q = {} >>> L
[1, 2, 4, 8, 16] >>> M
[1, 2, 4, 8]
>>> N
[2, 4, 8, 16]
>>> O
[1, 4, 16, 64, 256] >>> P
[4, 16, 64, 256] >>> Q
[]
While the technical details are a bit different, the basic idea underlying map forms half of a programming model called MapReduce that’s become increas- ingly popular for processing very large datasets.4 MapReduce is a distributed- computing framework that processes data using two user-specified functions: a “map” function that’s applied to every element of the dataset, and a “re- duce” function that collects together the outputs of the map function. Imple- mentations of MapReduce allow these computations to occur in parallel, on a cluster of machines, vastly speeding processing time.
4 Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.

234 CHAPTER 2. BASIC DATA TYPES
Computer Science Connections
Clustering
Partitioning a set is a task that arises frequently in various applications, usually with a goal like clustering a large collection of data points. The goal
is that elements placed into the same cluster should be “very similar,” and elementsindifferentclustersshouldbe“notverysimilar.”5 Whymightwe want to perform clustering on a data set? For example, we might try to cluster a set N of news articles into “topics” C1 , C2 , . . . , Ck , where any two articles
x, y that are both in the same cluster Ci are similar (say, with respect to the
Youcanreadmoreaboutclustering,and clustering algorithms, in a data-mining book like
5 Jure Leskovec, Anand Rajaraman,
and Jeff Ullman. Mining of Massive Datasets. Cambridge University Press, 2nd edition, 2014.
words contained within them), but if x ∈ C and y ∈ C
i j̸=i
then x and y are not very similar. Or we might try to cluster the people in a social network
into communities, so that a person in community c has a large fraction of her friends who are also in community c. Understanding these clusters—and understanding what properties of a data point “cause” it to be in one cluster rather than another—can help reveal the structure of a large data set, and can also be useful in building a system to react to new data. Or we might want to use clusters for anomaly detection: given a large data set—for example, of user behavior on a computer system, or the trajectory of a car on a highway—we might be able to identify those data points that do not seem to be part of a normal pattern. These data points may be the result of suspicious behavior that’s worth further investigation (or that might trigger a warning to the driver of the car that he or she has strayed from a lane).
Here’s one (vastly simplified) example application for clustering: speech processing. Software systems that interact with users as they speak in natu- ral language—that is, as they talk in English—have developed with rapidly increasing quality over the last decade. Speech recognition—taking an audio input, and identifying what English word is being spoken from the acoustic properties of the audio signal—turns out to be a very challenging problem. Figure 2.24 illustrates some of the reasons for the difficulty, showing a spec- trogram generated by the Praat software tool.6 In a spectrogram, the x-axis is time, and the y-axis is frequency; a darkly shaded frequency f at time t shows that the speech at time t had an intense component at frequency f . But we can partition a training set of many speakers saying a collection of common words into subsets based on which word was spoken, and then use the av- erage acoustic properties of the utterances to guess which word was spoken. Figure 2.25 shows the frequencies of the two lowest formants—frequencies of very high intensity—in the utterances of a half-dozen college students pro- nouncing the words bat and beat. First, the formants’ frequencies are shown unclustered; second, they are shown partitioned by the pronounced word. The centroid of each cluster (the center of mass of the points) can serve as a prototypical version of each word’s acoustics.
Figure 2.24: A spectrogram generated by Praat of me pronouncing the sentence “I prefer agglomerative clustering.” There are essentially no acoustic correlates
to the divisions between words, which is one reason speech recognition is so difficult.
6 Paul Boersma and David Weenink. Praat: doing phonetics by computer. http://www.praat.org, 2012. Version 5.3.22.
Figure 2.25: The frequencies of the first two formants in utterances by six speakers saying the words beat and bat.
“beat”
“bat”

2.3.5 Exercises
Let H := {0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f} denote the set of hexadecimal digits.
2.86 Is 6 ∈ H? 2.88 Is a70e ∈ H?
2.87 Is h ∈ H? 2.89 What is |H|?
LetS := {0+0, 0+1, 1+0, 1+1, 0·0, 0·1, 1·0, 1·1}bethesetofresultsofaddinganytwobitstogetheror multiplying any two bits together.
2.90 Which of 0, 1, 2, and 3 are elements of S? 2.91 What is |S|?
Let T := {n ∈ Z : 0 ≤ n ≤ 20 and n mod 2 = n mod 3}. Let H := {0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f} and S:={0+0, 0+1, 1+0, 1+1, 0·0, 0·1, 1·0, 1·1},asinthepreviousblocksofexercises.
2.92 Identify at least one element of H that is not an element of T.
2.93 Identify at least one element of T that is not an element of H.
2.94 Identify at least one element of T that is not an element of S.
2.95 Identify at least one element of S that is not an element of T.
2.96 What is |T|?
Rewrite the following sets by exhaustively listing their elements:
2.97 {n ∈ Z : 0 ≤ n ≤ 20 and n mod 5 = n mod 7}
2.98 {n ∈ Z : 10 ≤ n ≤ 30 and n mod 5 = n mod 7}
Let A := {1,3,4,5,7,8,9} and let B := {0,4,5,9}. What are the following sets? 2.99 A∩B 2.101 A−B 2.100 A∪B 2.102 B−A
Assume the universe is the set U := {0,1,2,…,9}. Define C := {0,3,6,9}, and let A := {1,3,4,5,7,8,9} and B := {0, 4, 5, 9} as before. What are the following sets?
2.103 ∼B 2.105 ∼C − ∼B 2.107 ∼(C − ∼A) 2.104 A∪∼C 2.106 C−∼C
2.108 In general, A − B and B − A do not denote the same set. (See Figure 2.26.) But your friends Evan and Yasmin wander by and tell you the following. Let E denote the set of CS homework questions that Evan has not yet solved. Let Y denote the set of CS homework questions that Yasmin has not yet solved. Evan and Yasmin claim that E − Y = Y − E. Is this possible? If so, under what circumstances? If not, why not? Justify your answer.
Let D and E be arbitrary sets. For each set given below, indicate which of the following statements is true:
• the given set must be a subset of D (for every choice of D and E);
• the given set may be a subset of D (for certain choices of D and E); or
• the given set cannot be a subset of D (for any choice of D and E).
If you answer “must” or “cannot,” justify your answer (1–2 sentences). If you answer “may,” identify an example D1,E1 forwhichthegivensetisasubsetofD1,andanexampleD2,E2 forwhichthegivensetisnotasubsetofD2.
2.109 D ∪ E 2.111 D − E 2.113 ∼D 2.110 D∩E 2.112 E−D
Let F := {1,2,4,8}, let G := {1,3,9}, and let H := {0,5,6,7}. Let U := {0,1,2,…,9} be the universe. Which of the following pairs of sets are disjoint?
2.114 F and G 2.116 F ∩ G and H
2.115 G and ∼F 2.117 H and ∼H
Let S and T be two sets, with n = |S| and m = |T|. For each of the following sets, state the smallest cardinality that the given set can have. Give examples of the minimum-sized sets for each part. (You should give a family of examples— that is, describe a smallest-possible set for any values of n and m.)
2.118 S∪T 2.119 S∩T 2.120 S−T
Repeat the last three exercises for the largest set: for two sets S and T with n = |S| and m = |T|, state the largest cardinality that the given set can have. Give a family of examples of the largest-possible sets for each part.
2.121 S∪T 2.122 S∩T 2.123 S−T
Figure 2.26: In general, the sets A−BandB−Aare different.
2.3. SETS:UNORDEREDCOLLECTIONS 235
AB
AB

236 CHAPTER 2. BASIC DATA TYPES
In a variety of CS applications, it’s useful to be able to compute the similarity of two sets A and B. (More about one of these applications, collaborative filtering, below.) There are a number of different ideas of how to measure set similarity, all based on the intuition that the larger |A ∩ B| is, the more similar the sets A and B are. Here are two basic measures of set similarity that are sometimes used:
The Jaccard coeffi- cient is named after the Swiss botanist Paul Jaccard, from around the turn of the20thcentury, who was interested in how similar
or different the distributions of various plants were in different regions.
7 P. Jaccard. Dis- tribution de la flore alpine dans le bassin des dranses et dans quelques régions voisines. Bulletin de la So- ciété Vaudoise des Sciences Naturelles, 37:241–272, 1901.
• the cardinality measure: the similarity of A and B is |A ∩ B|. • the Jaccard coefficient:7 the similarity of A and B is |A∩B| .
|A∪B|
2.124 LetA := {chocolate,hazelnut,cheese};B := {chocolate,cheese,cardamom,cherries};and
C := {chocolate}. Compute the similarities of each pair of these sets using the cardinality measure. 2.125 Repeat the previous exercise for the Jaccard coefficient.
Suppose we have a collection of sets A1 , A2 , . . . , An . Consider the following claim:
Claim: Suppose that the set Av is the most similar set to the set Au in this collection (aside from Au itself).
Then Au is necessarily the set that is most similar to Av (aside from Av itself).
2.126 Decide whether you think this claim is true for the cardinality measure of set similarity, and justify your answer. (That is, argue why it must be true, or give an example showing that it’s false.) 2.127 Repeat the previous exercise for the Jaccard coefficient.
Taking it further: A collaborative filtering system, or recommender system, seeks to suggest new products to a user u on the basis of the similarity of u’s past behavior to the past behavior of other users in the system. Collaborative filtering systems are mainstays of many popular commercial online sites (like Amazon or Netflix, for example). One common approach to collaborative filtering is the following. Let U denote the set of users of the system, and for each user u ∈ U, define the set Su of products that u has purchased. To make a product recommendation to a user u ∈ U:
(i) Identify the user v ∈ U − {u} such that Sv is the set “most similar” to Su. (ii) Recommend the products in Sv − Su to user u (if any exist).
This approach is called nearest-neighbor collaborative filtering, because the v found in step (i) is the other person closest to u. The measure of set similarity used in step (i) is all that’s left to decide, and either car- dinality or the Jaccard coefficient are reasonable choices. The idea behind the Jaccard coefficient is that the fraction of agreement matters more than the total amount of agreement: a {Cat’s Cradle, Catch 22} purchaser is more similar to a {Slaughterhouse Five, Cat’s Cradle} purchaser than someone who bought every book Amazon sells.
For each of the following claims, decide whether you think the statement is true for all sets of integers A, B, C. If it’s true for every A, B, C, then explain why. (A Venn diagram may be helpful.) If it’s not true for every A, B, C, then provide an example for which it does not hold.
2.128 A ∩ B = ∼(∼A ∪ ∼B) 2.130 (A − B) ∪ (B − C) = (A ∪ B) − C
2.129 A∪B = ∼(∼A∩∼B) 2.131 (B−A)∩(C−A) = (B∩C)−A
2.132 List all of the different ways to partition the set {1, 2, 3}.
Consider the table of distances shown in Figure 2.27 for a set P = {Alice, . . . , Frank} of people. Suppose we partition P into subsets S1 , . . . , Sk . Define the intracluster distance as the largest distance between two people who are in the same cluster:
max 􏰍 max distance between x and y􏰎 . i x,y∈Si
Define the intercluster distance as the smallest distance between two people who are in
Alice Bob Charlie David Eve Frank
different clusters:
􏰑
min min i,j̸=i x∈Si,y∈Sj
􏰒
In each of the following questions, partition P into . . .
distance between x and y
.
Figure 2.27: Some distances between people.
2.133 . . . 3 or fewer subsets so that the intracluster distance is ≤ 2.0.
2.134 . . . subsets S1 , . . . , Sk so the intracluster distance is as small as possible. (You choose k.)
2.135 . . . subsets S1 , . . . , Sk so the intercluster distance is as large as possible. (Again, you choose k.)
2.136 DefineS := {1,2,…,100}.LetW := {x∈S:xmod2=0},H := {x∈S:xmod3=0},and
O:=S−H−W. Is{W,H,O}apartitionofS?
What is the power set of each of the following sets?
2.137 {1,a} 2.138 {1} 2.139 {} 2.140 P(1)
0.0 1.7 1.7 0.0 1.2 4.3 0.8 1.1 7.2 4.3 2.9 3.4
1.2 0.8 7.2 2.9 4.3 1.1 4.3 3.4 0.0 7.8 5.2 1.3 7.8 0.0 2.1 1.9 5.2 2.1 0.0 1.9 1.3 1.9 1.9 0.0
Alice Bob Charlie David Eve Frank

2.4. SEQUENCES,VECTORS,ANDMATRICES:ORDEREDCOLLECTIONS 237
2.4 Sequences, Vectors, and Matrices: Ordered Collections
Watch out for the fellow who talks about putting things in order! Putting things in order always means getting other people under your control.
Denis Diderot (1713–1784) Supplément au voyage de Bougainville (1796)
In Section 2.3, we introduced sets—collections of objects in which the order of those objects doesn’t matter. In many circumstances, though, order does matter: if a Java method takes two parameters, then swapping the order of those parameters will usu- ally change what the method does; if there’s an interesting site at longitude x and lati- tude y, then showing up at longitude y and latitude x won’t do. In this section, we turn to ordered collections of objects, called sequences. A summary of the notation related to sequences is given in Figure 2.29.
We’ll write a sequence inside angle brackets, as in ⟨Northfield, Minnesota⟩ or ⟨0, 1⟩. (Some people use parentheses instead of angle brackets, as in (128, 128, 0) instead of ⟨128, 128, 0⟩.) For two sets A and B, we frequently will refer to the set of ordered pairs whose two elements, in order, come from A and B:
For example, {0, 1} × {2, 3} is the set {⟨0, 2⟩, ⟨0, 3⟩, ⟨1, 2⟩, ⟨1, 3⟩}. We can also view any particular cell in a 2-dimensional grid—like a cell in a spreadsheet, or a square on a chess board—as a sequence:
The Cartesian prod- uct is named after René Descartes, the 17th-century French philosopher/ mathematician. (The English ad- jectival form uses only the cartes part of his last name Descartes.)
8
Definition 2.32 (Sequence, list, and tuple)
A sequence—also known as a list or tuple—is an ordered collection of objects, typically called components or entries. When the number of objects in the collection is 2, 3, 4, or n, the sequence is called an (ordered) pair, triple, quadruple, or, n-tuple, respectively.
Definition 2.33 (Cartesian product)
The Cartesian product of two sets A and B, denoted A × B, is the set A × B = {⟨a, b⟩ : a ∈ A and b ∈ B}
containing all ordered pairs where the first component comes from A and the second from B.
Example 2.37 (Chess positions)
7 A chess board is an 8-by-8 grid. Chess players use what’s called “Algebraic nota- 6
tion” to refer to the columns (which they call files) using the letters a through h, and
5
4 they refer to the rows (which they call ranks) using the numbers 1 through 8. (See 3
Figure 2.28.)
Thus the square containing the white queen Q is ⟨d, 1⟩; the full set of squares of
2 1
abcdefgh
Figure 2.28: The squares of a chess board, written using Algebraic notation.
the chess board is {a, b, c, d, e, f, g, h} × {1, 2, 3, 4, 5, 6, 7, 8}; and the squares containing knights—the N pieces (both white and black)—are {⟨b, 1⟩, ⟨g, 1⟩, ⟨b, 8⟩, ⟨g, 8⟩}. The set of squares with knights could also be written as {b, g} × {1, 8}.
rmblkans
opopopop
0Z0Z0Z0Z
Z0Z0Z0Z0
0Z0Z0Z0Z
Z0Z0Z0Z0
POPOPOPO
SNAQJBMR

238 CHAPTER 2. BASIC DATA TYPES
sequence/ordered tuple
Cartesian product
the set of all n-element sequences of S
⟨a1,a2,…,an⟩
A × B := {⟨a, b⟩ : a ∈ A and b ∈ B} Sn :=S×S×···×S(ntimes) x∈Rn􏰟n 2
∥x∥ := ∑i=1 xi
x+y := ⟨x1 +y1,x2 +y2,…,xn +yn⟩ ax := ⟨a · x1 , a · x2 , . . . , a · xn ⟩
x • y : = ∑ ni = 1 x i · y i
M ∈ Rn×m
1 0 … 0
0 1 … 0 where I = . . … .
a matrix I ∈ R
n×n
00…1 a matrix N ∈ Rn×m where Ni,j := α · Mi,j
a matrix N ∈ Rn×m where Ni,j := Mi,j + Mi′,j
a matrix M ∈ Rn×p where Mi,j = ∑mk=1 Ai,kBk,j
a matrix M−1 ∈ Rn×n where MM−1 = I (if any such M−1 exists)
vector
vector length, for x ∈ Rn
vector addition, for vectors x, y ∈ Rn scalar product, for a ∈ R and x ∈ Rn dot product, for vectors x, y ∈ Rn matrix
identity matrix
scalar multiplication, for α ∈ R and M ∈ Rn×m matrix addition, for M, M′ ∈ Rn×m
matrix multiplication, for A ∈ Rn×m and B ∈ Rm×p matrix inverse, for M ∈ Rn×n
Here’s another example, about color representation on computers:
Example 2.38 (RGB color values)
The RGB color space represents colors as ordered triples, where each component is an element of {0, 1, . . . , 255}. RGB stands for red–green–blue; the three components of a color c, respectively, represent how red, how green, and how blue the color c is. Formally, a color c is an element of {0, 1, . . . , 255} × {0, 1, . . . , 255} × {0, 1, . . . , 255}.
The order of these components matters; for example, the color ⟨0, 0, 255⟩ is pure blue, while the color ⟨255, 0, 0⟩ is pure red. See Figure 2.30 for a few examples.
Taking it further: An annoying pedantic point: we are being sloppy with notation in Example 2.38;
we only defined the Cartesian product for two sets, so when we write S × S × S we “must” mean
either S × (S × S) or (S × S) × S. We’re going to ignore this issue, and simply write statements like ⟨0,1,1⟩ ∈ {0,1}×{0,1}×{0,1}—eventhoughweoughttoinsteadbewritingstatementslike ⟨0,⟨1,1⟩⟩ ∈ {0,1}×({0,1}×{0,1}).(Asimilarshorthandshowsupinprogramminglanguages likeScheme,wherepairing—“cons”ing—asingleelement3withalist(2 1)yieldsthethree-elementlist (3 2 1),ratherthanthetwo-elementpair(3 . (2 1)),wherethesecondelementisatwo-elementlist.)
Beyond the “obvious” sequences like Examples 2.37 and 2.38, we’ve also already seen some definitions that don’t seem to involve sequences, but implicitly are about ordered tuples of values. One example is the rational numbers (see Section 2.2.2):
Example 2.39 (Rational numbers as sequences)
We can define the rational numbers (also known as fractions) as the set Q := Z × Z>0. Under this view, a rational number would be represented as a pair ⟨n, d⟩ ∈ Z × Z>0, with a numerator n and a denominator d.
Figure 2.29: A sum- mary of notation for sequences, vectors, and matrices.
violet indigo blue green yellow orange red
⟨128, 0, 128⟩ ⟨74,0,130⟩ ⟨0, 0, 255⟩ ⟨0, 255, 0⟩ ⟨255,255,0⟩ ⟨255,128,0⟩ ⟨255, 0, 0⟩
Figure 2.30: A few RGB values of colors.
For example, the fractions 1 and 202 would be represented as ⟨1, 2⟩ and ⟨202, 808⟩, 2 808
respectively. (To flesh out the details of this representation, we also have to consider reducing fractions to lowest terms, to establish the equivalence of fractions like ⟨2, 4⟩ and ⟨1, 2⟩. In Example 8.36, we’ll formalize this equivalence.)

2.4. SEQUENCES,VECTORS,ANDMATRICES:ORDEREDCOLLECTIONS 239
We will often consider sequences of elements that are all drawn from the same set, and there is special notation for such a sequence:
Definition 2.34 (Sequences of elements from the same set)
For a set S and a positive integer n, we write Sn to denote Sn :=S×S×…×S.
􏰢 􏰡􏰠 􏰣
n times
Thus Sn denotes the set of all sequences of length n where each component of the
4 sequence is an element the set S. For example, the RGB values from Example 2.38 3
332 are elements of {0,1,…,255} , and {0,1} denotes the set 1
3, 1⟩
Figure 2.31: Three points in R2. The first component represents the x- axis (horizontal) position; the second component rep- resents the y-axis (vertical) position.
{⟨0,0,0⟩,⟨0,0,1⟩,⟨0,1,0⟩,⟨0,1,1⟩,⟨1,0,0⟩,⟨1,0,1⟩,⟨1,1,0⟩,⟨1,1,1⟩}.
This notation also lets us write R × R, called the Cartesian plane, as R2—the way
you might have written it in a high school algebra class. (See Figure 2.31.)
0 -1 -2
-3 ⟩ -4
⟨−3, −
2
⟨
1,3
⟩
⟨
Taking it further: René Descartes, the namesake of the Cartesian product and the Cartesian plane, was a major contributor in mathematics, particularly geometry. But Descartes is probably most famous as
a philosopher, for the cogito ergo sum (“I think therefore I am”) argument, in which Descartes—after adopting a highly skeptical view about all claims, even apparently obviously true ones—attempts to argue that he himself must exist.
In certain contexts, sequences of elements from the same set (as in Definition 2.34) are called strings. For a set Σ, called an alphabet, a string over Σ is an elem􏰔ent of Σn for some nonnegative integer n. (In other words, a string is any element of n∈Z≥0 Σn.) The length of a string x ∈5 Σn is n. For example, the set of 5-letter words in English
is a subset of {A, B, . . . , Z} . We allow strings to have length zero: for any alphabet
Σ, there is only one sequence of elements from Σ of length 0, called the empty string; it’s denoted by ε, and for any alphabet Σ, we have Σ0 := {ε}. When writing strings, it is customary to omit the punctuation (angle brackets and commas), so we write ABRACADABRA ∈ {A, B, . . . , Z}11 and 11010011 ∈ {0, 1}8.
2.4.1 Vectors
As we’ve already seen, we can create sequences of many types of things: we can view sequences of letters as strings (like ABRACADABRA ∈ {A, B, . . . , Z}11), or sequences of three integers between 0 and 255 as colors (like ⟨119, 136, 153⟩ ∈ {0, 1, . . . , 255}3, offi- cially called “light slate gray”). Perhaps the most pervasive type of sequence, though, is a sequence of real numbers, called a vector.
Taking it further: Vectors are used in a tremendous variety of computational contexts: computer graphics (representing the line-of-sight from the viewer’s eye to an object in a scene), machine learning (a feature vector describing which characteristics a particular object has, which can be used in trying to classify that object as satisfying a condition or failing to satisfy a condition), among many others. The discussion on p. 248 describes the vector-space model for representing a document d as a vector whose components correspond to the number of times each word appears in d.
Vectors and matrices (the topics of this and the next subsection) are the main focus of a math course in linear algebra. In these subsections, we’re only mentioning a few highlights of vectors and matrices; you can find much more in any good textbook on linear algebra.

240 CHAPTER 2. BASIC DATA TYPES
Definition 2.35 (Vector)
A vector (or n-vector) x is a sequence x ∈ Rn, for some positive integer n. For a vector x ∈ Rn and for any index i ∈ {1,2,…,n}, we write xi to denote the ith component of x.
For example, ⟨0,1⟩, ⟨1,0⟩, and ⟨ 1 , 1 ⟩ are all vectors in R2. For the vector x := √ √2 √2 √
A warning for C or Java or Python (or …) programmers: notice that our vec- tors’ components are indexed starting at one, not zero. For a vector x ∈ Rn, the expression xi is meaningless unless i ∈ {1,2,…,n}. In particular, the ex- pression x0 doesn’t mean anything.
⟨1/2, 3/2⟩, we have x1 = 1/2 and x2 = 3/2.
Vectors are sometimes contrasted with scalars, which are just numbers: that is, a
scalar is an element of R. Vectors are also sometimes written in square brackets, so
we may see an n-vector x written as x = [x1, x2, . . . , xn]. We may encounter vectors in
which the components are a restricted kind of number—for example, integers or bits.
n
Elementsof{0,1} areoftencalledbitvectorsorbitstrings.
Here’s an example of using vectors to compute distances between points:
Example 2.40 (Train stations in Manhattan)
Problem: Let’s(veryroughly!)representalocationinManhattanasavector— specifically, as a point ⟨x, y⟩ ∈ R2 representing the intersection of xth Avenue and yth Street. Define the walking distance between points p and q in Manhattan as
44 |p1 − q1| + |p2 − q2|: the number of east–west blocks between p and q plus the num- 43 42 ber of north–south blocks between p and q. (Note that walking distance is different 41 40 from the straight-line distance between the points!) 39 38 37 36 35 34
⟨
4,
42
⟩
1. The two major train stations in Manhattan are Penn Station, located at s := ⟨8, 33⟩, and Grand Central Station, located at g := ⟨4, 42⟩. What’s the walking distance between Penn Station and Grand Central?
2. Describethesetofallpointsthatarecloser(inwalkingdistance)toPennSta- tion than to Grand Central.
Solution
: 1. Thedistancebetweens = ⟨8,33⟩andg = ⟨4,42⟩is|s −g |+|s −g | =
33 33⟩ 32
2 3 4 5 6 7 8 9 10 44
43 42 41 40 39
⟨
8,
⟨4,
42⟩
?
?
⟨8,
1122 38 37 36 35 2. Let’scomputesomepointsthatareequidistanttothetwostations.(Those 34 33 points are on the boundary of the region of points closer to g and the region 32
|8 − 4| + |33 − 42| = 4 + 9 = 13.
of points closer to s.) For example, a point ⟨4, y⟩ has distances |42 − y| and
4 + |y − 33| to the stations; these distances are both equal to 6.5 when y = 35.5.
More generally, let’s think about a point whose x-coordinate falls between 4 and 8. For any offset 0 ≤ δ ≤ 4, the distance between the point ⟨4 + δ, y⟩ and the two stations are δ + |42 − y| and 4 − δ + |y − 33|. These two values are both equal to 6.5 when y = 35.5 + δ. (For example, when δ = 4, then y = 39.5.) Thus the points ⟨4 + 0, 35.5 + 0⟩ = ⟨4, 35.5⟩ and ⟨4 + 4, 35.5 + 4⟩ = ⟨8, 39.5⟩ are both equidistant to s and g, as are all points on the line segment between them. (See Figure 2.32.)
The remaining cases of the analysis—figuring out which points with x- coordinate less than 4 or greater than 8 are closer to s or g (the regions marked with “?” in Figure 2.32)—are left to you in Exercises 2.184 and 2.185.
33⟩ 2 3 4 5 6 7 8 9 10
Figure 2.32: Illustra- tions of Manhattan train stations. In the second panel, the dark shaded points are closer (in walking distance) to ⟨4, 42⟩ than to
⟨8, 33⟩. The white shaded points are closer to ⟨8, 33⟩ than to ⟨4, 42⟩.

2.4. SEQUENCES,VECTORS,ANDMATRICES:ORDEREDCOLLECTIONS 241
Taking it further: The measure of walking distance between points that we used in Example 2.40 is used surprisingly commonly in computer science applications—and, appropriately enough, it’s actually named after Manhattan. The Manhattan distance between two points p, q ∈ Rn is defined as ∑ni=1 |pi − qi |. (We’re summing the number of “blocks” of difference in each of the n dimensions; we take the absolute value of the difference in each component because we care about the difference in each dimension rather than which point has the higher value in that component.)
Here’s one more useful definition about vectors:
Definition 2.36 (Vector length)
10 9 8 7 6 5
⟨
1,9
⟩
n􏰞n24 is defined as ∥x∥ := ∑i=1(xi) . 3 2 1 0 -1
The length of a vector x ∈ R
Forexample,∥⟨2,8⟩∥=√22+82 =√4+64=√68≈8.246.Ifwedrawavectorx∈R2
in the Cartesian plane, then ∥x∥ denotes the length of the line from ⟨0, 0⟩ to x. (See -2
Figure 2.33.) A vector x ∈ R Vector arithmetic
n
-3
is called a unit vector if ∥x∥ = 1. -4 -5 -6 -7 -8 -9
⟨−3, −5⟩
We will now define basic arithmetic for vectors: vector addition, which is performed component-wise (adding the corresponding elements of the two vectors), and two forms of multiplication—one for multiplying a vector by a scalar (also component- wise) and one for multiplying two vectors together. We’ll start with addition:
Figure 2.33:
Two vector lengths: ∥⟨1, 9⟩∥ is √1+81 = √82, and ∥⟨−3, −5⟩∥ is √9+25 = √34.
Definition 2.37 (Vector addition)
The sum of two vectors x, y ∈ Rn, written x + y, is a vector z ∈ Rn, where for every index i∈{1,2,…,n}wehavezi :=xi+yi.(Notethatthesumoftwovectorswithdifferentsizesis meaningless.)
For example, ⟨1.1, 2.2, 3.3⟩ + ⟨2, 0, 2⟩ = ⟨3.1, 2.2, 5.3⟩.
The first type of multiplication for vectors is scalar multiplication, when we multiply
a vector by a real number. As with vector addition, scalar multiplication acts on each component independently, by rescaling each component by the same factor:
For example, we have 3 · ⟨1, 2, 3⟩ = ⟨3, 6, 9⟩. Similarly −1.5 · ⟨1, −1⟩ = ⟨−1.5, 1.5⟩ and 0·⟨1,2,3,5,8⟩= ⟨0,0,0,0,0⟩.
The second type of vector multiplication, the dot product, takes two vectors as input and multiplies them together to produce a single scalar as output:
As with vector addition, the dimensions of the vectors in a dot product have to match up: if x ∈ Rn andy ∈ Rm are vectors where
n ̸= m, then x • y is meaningless.
Definition 2.38 (Scalar product)
Givenavectorx∈Rn andarealnumberα∈R,thescalarproductαxisavectorz∈Rn, where zi := αxi for every index i ∈ {1,2,…,n}.
Definition 2.39 (Dot product)
Given two vectors x, y ∈ Rn, the dot product of x and y, denoted x • y, is given by summing the products of the corresponding components:
n
x • y = ∑i = 1 x i · y i .
√82
√34

242 CHAPTER 2. BASIC DATA TYPES
For example, ⟨1, 2, 3⟩ • ⟨4, 5, 6⟩ = 1 · 4 + 2 · 5 + 3 · 6 = 4 + 10 + 18 = 32.
Intuitively, the dot product of two vectors measures the extent to which they point
in the “same direction.” Here’s an example with a few unit vectors:
Example 2.41 (Dot products of unit vectors) √ √ Considertheunitvectorsa := ⟨0,1⟩,b := ⟨1,0⟩,c := ⟨1/ 2,1/ 2⟩,andd := ⟨0,−1⟩. (See Figure 2.34.) Here is the dot product of c with each of these vectors:
1
a
c c•ac•bc•cc•d0b
= c1·a1+c2·a2 = c1·b1+c2·b2 = c1·c1+c2·c2 = c1·d1+c2·d2 -1 1111 11-101
= √ ·0+√ ·1 = √ ·1+√ ·0 √ 12 2 12 2 =1
Here are two examples using dot products for simple applications:
Example 2.42 (Common classes)
LetC := ⟨CS1,CS2,…,CS8⟩denotethelistofallcoursesofferedbya(somewhat narrowly focused) university. For a particular student, let the bit vector u represent thecoursestakenbythatstudent,sothatui :=1ifthestudenthastakencourseci (and ui := 0 otherwise). For example, a student who’s taken only CS1 and CS8 would berepresentedbyx := ⟨1,0,0,0,0,0,0,1⟩,andastudentwho’stakeneverything except CS3 would be represented by y := ⟨1, 1, 0, 1, 1, 1, 1, 1⟩.
The dot product of two student vectors represents the number of common courses that they’ve taken. For example, the number of common classes taken by x and y is
8
x•y=∑i=1xiyi =1·1+0·1+0·0+0·1+0·1+0·1+0·1+1·1
= 1 + 0 + 0 + 0 + 0 + 0 + 0 + 1 = 2. Specifically, the two common courses taken by x and y are CS1 and CS8.
Example 2.43 (GPAs)
Let g ∈ Rn be an n-vector where gi denotes the grade (measured on the grade pointn
scale) that you got in the ith class that you’ve taken in your college career. Let c ∈ R
d
2·1
√ √ √ = √ ·0+√ ·−1
2+1
2·1
2 2 1 2
Figure2.34:Four unit vectors.
= √ . = √ . = 1 + 1 = 1. = − √ . 22222
be an n-vector where ci denotes the number of credit hours for the ith class you took g•c
in your college career. Then your grade point average (GPA) is given by ∑ni=1 ci .
For example, suppose your school gives grade points on the scale 4.0 = A, 3.7 = A-,
3.3 = B+, 3.0 = B, etc. Suppose you took CS 111 (6 credits), CS 201 (6 credits), and Mbira Lessons (4 credits), and got grades of B+, A-, and B, respectively. Then
g = ⟨3.3, 3.7, 3.0⟩ and c = ⟨6, 6, 4⟩, and your GPA is given by
g•c = 3.3·6+3.7·6+3.0·4 = 19.8+22.2+12.0 = 54 =3.375. ∑ 3i = 1 c i 6 + 6 + 4 1 6 1 6

2.4. SEQUENCES,VECTORS,ANDMATRICES:ORDEREDCOLLECTIONS 243
2.4.2 Matrices
If a vector is analogous to an array of numbers, then a matrix is analogous to
a two-dimensional array of numbers:
Here are a few very small example matrices:
Example 2.44 (Three matrices)
Here are three matrices. (The ⟨2, 1⟩st entry is circled in each.)
􏰑3 1 4􏰒 5 3 1 0 0 A=972 B=48I=010.
69 001
In these examples, A is a 2-by-3 matrix, B is a 3-by-2 matrix, and I is a 3-by-3 matrix.
One can think of a two-dimensional array in a programming language as a one- dimensional array of one-dimensional arrays. Similarly, if you prefer, you can think of an n-by-m matrix as a
sequence of n vectors,
all of which are ele-
ments of Rm. This view
of an n-by-m matrix is
as an element of (Rn)m.
One simple application
of matrices is as an easy
way to represent images:
Example 2.45 (Bitmaps)
A black-and-white image can be represented as a matrix with all entries in {0, 1}: each 1 entry represents white in the corresponding pixel; each 0 represents black. For example, the matrix in Figure 2.36(a) could represent the image in Figure 2.36(b).
Taking it further: The picture shown in Figure 2.36 is a simple black-and-white image, but we can use
the same basic structure for grayscale or color images. Instead of just an integer in {0, 1} as each entry
in the matrix, a grayscale pixel could be represented using a real number in [0, 1]—or, more practically, a
Figure 2.35: A matrix M.
The plural of matrix is matrices (which rhymes with the word “cheese”).
Definition 2.40 (Matrix)
An n-by-m matrix M is a two-dimensional table of real numbers containing n
rows and m columns. The ⟨i, j⟩th entry of the matrix appears in the ith row and jth column, and we denote that entry by Mi,j, as shown in Figure 2.35. Such a matrix M is an element of Rn×m, and we refer to M as having size or dimension n-by-m.
11111111111111111111  1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1  11111111111111001111  1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1   1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1   1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1  11111111111111011111 11111111111111111111
(a) A matrix.
(b) A bitmapped image.
Figure 2.36: A matrix representing a black-and-white bitmapped image, and the image.
number in { 0 , 1 , . . . , 255 }. For color images, each entry would be an RGB triple (see Example 2.38). 255 255 255
These matrix-based representations of an image are often called bitmaps. Bitmaps are highly in- efficient ways of storing images; most computer graphics file formats use much cleverer (and more space-efficient) representations.
 M1,1 M1,2 … M1,m   M2,1 M2,2 … M2,m   . . . . . . . . . . . . 
Mn,1 Mn,2 . . . Mn,m

244 CHAPTER 2. BASIC DATA TYPES
Here are few other examples of the pervasive applications of matrices in computer science. A term– document matrix can be used to represent a collection of documents: the entry Md,k of the matrix M stores the number of times that keyword k appears in document d. An adjacency matrix (see Chapter 11) can represent the page-to-page hyperlinks of the web in a matrix M, where Mi,j = 1 if web page i has a hyperlink to web page j (and Mi,j = 0 otherwise). A rotation matrix can be used in computer graphics to re-render a scene from a different perspective; see p. 249 for some discussion.
AmatrixM∈Rm×n iscalledsquareifm=n.ForasquarematrixM∈Rn×n,wemay say that the size of M is n (rather than saying that its size is n-by-n). A square matrix
M is called symmetric if, for all indices i,j ∈ {1,2,…,n}, we have Mi,j = Mj,i. The main diagonal of a square matrix M ∈ Rn×n is the sequence consisting of the entries Mi,i for
i = 1,2,…,n. For example:
Example 2.46 (Main diagonal)
Consider the 3-by-3 square matrix M shown in Figure 2.37. The main diagonal of M is ⟨M1,1, M2,2, M3,3⟩ = ⟨1, 5, 9⟩.
One special square matrix that will arise frequently is the identity matrix, which has ones on the main diagonal and zeros everywhere else (see Figure 2.38):
 1
4 5 6
2 3  7 8 9
Figure 2.37: A matrix M with the entries of the main diagonal circled.
1 0 … 0 0 1 … 0
 . . . . . . . . . . . .  00…1
Figure 2.38: The identity matrix I.
Definition 2.41 (Identity matrix)
The n-by-n identity matrix is the matrix􏰓I ∈ Rn×n whose entries satisfy
1 ifi=j 0 ifi̸=j.
Ii,j =
Note that there is a different n-by-n identity matrix for every n ≥ 1: Example 2.47 (The smallest identity matrices)
Here are the identity matrices of size up to 5:
􏰖 􏰗
1
􏰑
1 0 0 1
􏰒   1 0 0
0 1 0 0 0 1
1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
1 0 0 0 0 0 1 0 0 0
0 0 1 0 0 0 0 0 1 0 00001
As with vectors, we will need to define the basic arithmetic operations of addition and multiplication for matrices. Just as with vectors, adding two n-by-m matrices or multiplying a matrix by a scalar is done component by component.
Definition 2.42 (Matrix addition and scalar multiplication)
Given two matrices M, M′ ∈ Rn×m and a real number α ∈ R:
• The product αM is a matrix N ∈ Rn×m where Ni,j := αMi,j for all indices i ∈ {1,2,…,n} and j ∈ {1,2,…,m}.

• ThesumM+M′isamatrixN∈Rn×mwhereNi,j:=Mi,j+Mi′,jforallindices i ∈ {1,2,…,n} and j ∈ {1,2,…,m}.
Then we have:
A+B =
A+3I =
145 208 224
322 232 2 2 3
4B = A−3I =
4 8 12
0 0 24 0016 
−3 2 2 2 −3 2
2 2 −3
2.4. SEQUENCES,VECTORS,ANDMATRICES:ORDEREDCOLLECTIONS 245
Again, just as with vectors, adding two matrices that are not the same size is meaning- less. Here are some small examples:
Example 2.48 (Simple matrix arithmetic)
Consider the following matrices:
0 2 2 1 2 3 1 0 0 A:=202 B:=006 I:=010
220 004 001
Matrix multiplication
Multiplying matrices is a bit more complicated than the other vector/matrix op-
erations that we’ve seen so far. The product of two matrices is a matrix, rather than a single number: the entry in the ith row and jth column of AB is derived from the ith row of A and the j column of B. More precisely:
As usual, if the dimensions of the matrices A and B don’t match—if the number of columns in A is different from the number of rows in B—then AB is undefined.
Example 2.49 (Multiplying some small matrices)
Let’s compute the product of a sample 2-by-3 matrix and a 3-by-2 matrix:
Definition 2.43 (Matrix multiplication)
The product AB of two matrices A ∈ Rn×m and B ∈ Rm×p is an n-by-p matrix M ∈ Rn×p whose entries are, for any i ∈ {1,2,…n} and j ∈ {1,2,…,p},
m
Mi,j := ∑ Ai,kBk,j. k=1
􏰑1 2 3􏰒 7 8 4 5 6 ·1 3
90

246 CHAPTER 2. BASIC DATA TYPES
Note that, by definition, the result will be a 2-by-2 matrix. Let’s compute its entries:
􏰑1 2 3􏰒 7 8 􏰑1·7+2·1+3·9 1·8+2·3+3·0􏰒 4 5 6 ·1 3= 4·7+5·1+6·9 4·8+5·3+6·0
=􏰑7+2+27 8+6+0 􏰒 28+5+54 32+15+0
=􏰑36 14􏰒. 87 47
For example, the 14 in ⟨row #1, column #2⟩ of the result was calculated by succes- sively multiplying the first matrix’s first row ⟨1, 2, 3⟩ by the second matrix’s second column ⟨8, 3, 0⟩. Alternatively, here’s a visual representation of this multiplication:
 7
1 2 3  456·1
9  7
8 
 36 14 3=8747 0
8
 36 14 3=8747 0
Problem-solving tip:
To help keep matrix multiplication straight, it may
be helpful to compute the ⟨i, j⟩th entry of AB by simultaneously tracing the ith row of A with the index finger of your left hand, and the jth column of B with the index finger of your right hand. Multiply the two numbers that you’re pointing at, and add the result to a running tally; when you’ve traced the whole row/column, the running tally is (AB)i,j.
90
7
23 56·1
9 7
9
More compactly, we could write matrix multiplication using the dot product from Definition 2.39: for two matrices A ∈ Rn×m and B ∈ Rm×p, the ⟨i, j⟩th entry of AB is
.
B, the values AB and BA are generally different! (This asymmetry is unlike numeri- cal multiplication: for x, y ∈ R, it is always the case that xy = yx.) In fact, because the number of columns of A must match the number of rows of B for AB to even be meaningful, it’s possible for BA to be meaningless or a different size from AB.
Example 2.50 (Multiplying the other way around)
If we multiply the matrices from Example 2.49 in the other order, we get
7 8 􏰑1 2 3􏰒 39 54 69 1 3· 4 5 6 =13 17 21 9 0 9 18 27
This matrix differs from the result in Example 2.49—it’s not even the same size!
You’ll show in the exercises that, for any n-by-m matrix A, the result of multiplying A by the identity matrix I yields A itself: that is, AI = A. You’ll also explore the inverse of a matrix A: that is, the matrix A−1 such that AA−1 = I (if any such A−1 exists).
Here’s another example of using matrices, and matrix multiplication, to combine different types of information:
 4
1
8  3=3614
 4
8 3=3614.
123·1
  87 47  0
1
23·1
  8 7 4 7  0
456
9
5 6
the value of A
• B
Be careful: matrix multiplication is not commutative—that is, for matrices A and
i,(1…m)
(1…m),j

2.4. SEQUENCES,VECTORS,ANDMATRICES:ORDEREDCOLLECTIONS 247
Example 2.51 (Programming language knowledge)
Problem: Let A be an n-by-m matrix where Ai,j = 1 if student i has taken class j (and Ai,j = 0 otherwise). Let B be an m-by-p matrix where Bj,k = 1 if class j uses pro- gramminglanguagek(andBj,k =0otherwise).WhatdoesthematrixABrepre- sent?
: First,notethattheresultingmatrixABhasnrowsandpcolumns;thatis, Solution
its size is (number of students)-by-(number of languages). For a student i and a programming language k, we have by definition that
m
(AB)i,k = ∑j=1 Ai,jBj,k
=∑m 􏰑􏰓 1 ifstudentitookclassjandjuseslanguagek 􏰒 j=1 0 otherwise
because0·0 = 0·1 = 1·0 = 0,sotheonlytermsofthesumthatare1occur when both Ai,j (“student i took class j?”) and Bj,k (“class j uses language k?”) are true (that is, 1). Thus (AB)i,k denotes the number of classes that use language k that student i took.
Example 2.52 (A concrete example of Example 2.51)
Concretely, consider these 3 students, 5 courses, and 7 programming languages:

Alice0 1 1 1
A:= Bob1 1 0 1 Charlie 1 0 0 0
For these matrices, we have
AB =
(For example, the Alice/C cell is
product of Alice’s row of A with 0 · 0 +
 1 0 
1
B :=
i n t r o  0 datastruct0 org/arch 0 proglang0 theoryofcomp 0
1 0 0
1 0 1
0 1 0
1 1 1
0 0 0
0 0 0 0 1 0 1 1 0 0
0  0
0   . 1 0
Alice0 2 2 2 2 1 1 Bob0 3 1 2 1 1 1.
Charlie 0 1 0 0 0 0 0
computed by ⟨0, 1, 1, 1, 1⟩ • ⟨0, 0, 1, 1, 0⟩—the dot
C’s column of B—which has the value 1 · 0 + 1 · 1 + 1 · 1 + 1 · 0 = 2.
This entry reflects the fact that Alice has taken two classes that use C: organization/ architecture and programming languages.)
Perl Python
C
Java Assembly C++ Scheme
Perl Python
C
Java Assembly C++ Scheme
intro
data structures org/arch
prog langs theory of comp

248 CHAPTER 2. BASIC DATA TYPES
Computer Science Connections
The Vector Space Model
Here’s a classic application of vectors, taken from information retrieval, the
subfield of computer science devoted to searching for information relevant to
a given query in large datasets. We start with a large corpus of documents—for
example, transcripts of all email messages that you’ve sent in your entire life.
(The word corpus comes from the Latin for “body”; it simply means a body
of texts.) Tasks involving the corpus might include clustering the documents
into subcollections (“which of my email messages are spam?”), or finding the ↓ stored documents most similar to a given query (“find me the 10 emails most d1 relevant to ‘good restaurants in Chicago’ in my archives”). d2
The vector space model is a standard approach to representing text docu- d3
[1, 0, 1] [2, 2, 1] [1, 1, 0]
d1 Three is one of the loneliest numbers.
d2 A one and a two and a one, two, three.
d3 One, two, buckle my shoe.
ments for the purposes of information retrieval. We choose a list of n terms that might appear in a document. We then represent a document d as an n- vector x of integers, where xi is the number of times that the ith term appears in the document d. See Figure 2.39 for an example.
Because documents that are about similar topics tend to contain similar vocabulary, we can judge the similarity of documents d and d′ based on “how similar” their corresponding vectors x and x′ are:
• Afirststabatmeasuringsimilaritybetweenxandx′istocomputethedot product x • x′; this approach counts the number of times any word in d appears in d′. (And if a word appears twice in d, then each appearance in d′ counts twice for the dot product.)
• Thisfirstapproachhasanissueinthatitfavorslongerdocuments:adocu- ment that lists all the words in the dictionary would correspond to a vector [1, 1, 1, 1, 1, . . .]—which would therefore have a large dot product with all documents in the corpus. To compensate for the fact that longer documents have more words, we normalize these vectors so that they have the same length, by using x/∥x∥ and x′/∥x′∥ to represent the documents. It turns out that the dot product of the normalized vectors computes the cosine of the angle between these representations of the documents.
• Thissecondapproachsuffersfromcountingcommonoccurrencesofthe word the and the word normalize as equally indicative of the similarity of documents. Information retrieval systems apply different weights to different terms in measuring similarity; one common approach is called term frequency–inverse document frequency (TFIDF), which downweights terms that appear in many documents in the corpus.
It’s worth noting that real information retrieval systems are usually quite a lot more complicated than we’ve discussed so far. For example, a document that talks about sofas would be judged to be completely unrelated to a document that talks about couches, which seems like a naïve judgement. Handling syn- onyms requires a more complicated approach, often based around analyzing the term–document matrix that simultaneously represents the entire corpus. (For example, if documents that discuss sofas use very similar other words to documents that discuss couches—like change and cushion and nap—then we might be able to infer something about sofas and couches.)8
(a) Three documents translated into vectors using the keywords ‘one’, ‘two’, and ‘three’.
(b) A plot of the three documents in R3 Figure 2.39: An example from the
vector-space model.
For much more on information retrieval, see the excellent text
8 Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Intro- duction to Information Retrieval. Cam- bridge University Press, 2008.

2.4. SEQUENCES,VECTORS,ANDMATRICES:ORDEREDCOLLECTIONS 249
Computer Science Connections
Rotation Matrices
When an image is rendered (drawn) using computer graphics, we typically proceed by transforming a 3-dimensional representation of a scene, a model
of the world, into a 2-dimensional image fit for a screen. The scene is typically represented by a collection of points in R3, each defining a vertex of a poly- gon. The camera (the eye from which the scene is viewed) is another point in R3, with an orientation describing the direction of view. We then project the polygons’ points into R2. This computation is done using matrix multiplica- tions, by taking into account the position and direction of view of the camera, and the position of the given point. While a full account of this rendering al- gorithm isn’t too difficult, we’ll stick with a simpler problem that still includes theinterestingmatrixcomputations.9 We’llinsteadconsidertherotationofa set of points in R2 by an angle θ. (The full-scale problem requires thinking about the angle of view with two parameters, akin to “azimuth” and “ele- vation” in orienteering: the direction θ in the horizontal plane and the angle
φ away from a straight horizontal view.) Suppose that we have a scene that consists of a collection of points in R2. As an example, Figure 2.40 shows a collection of hand-collected points in R2 that represent the borders of the state of Nevada.
Suppose that we wish to rotate a point ⟨x, y⟩ by an angle θ around the point ⟨0, 0⟩. You should be able to convince yourself with a drawing that we can ro- tate a point ⟨x, 0⟩ around the point ⟨0, 0⟩ by moving it to ⟨x cos θ, x sin θ⟩. More generally, the point ⟨x, y⟩ becomes the point ⟨x cos θ − y sin θ, x sin θ + y cos θ⟩ when it’s rotated.
Suppose we wish to rotate the points ⟨x1,y1⟩,…,⟨xn,yn⟩ by angle θ. Write a matrix with the ith column corresponding to the ith point, and perform matrix multiplication as follows:
You can learn more about way that the full-scale computer graphics algorithms work in a textbook like
9JohnF.Hughes,AndriesvanDam, Morgan McGuire, David F. Sklar, James D. Foley, Steven K. Feiner, and Kurt Akeley. Computer Graphics: Princi- ples and Practice. Addison-Wesley, 3rd edition, 2013.
􏰑cosθ −sinθ􏰒􏰑x1 x2 ··· xn􏰒=􏰑x1cosθ−y1sinθ x2cosθ−y2sinθ ··· xncosθ−ynsinθ􏰒 sinθ cosθ y1 y2 ··· yn x1 sinθ+y1 cosθ x2 sinθ+y2 cosθ ··· xn sinθ+yn cosθ
(The matrix R = 􏰑cos θ − sin θ􏰒 is called a rotation matrix.) sin θ cos θ
The result is that we have rotated an entire collection of points—arranged in the 2-by-n matrix M—by multiplying M by this rotation matrix. In other words, RM is a 2-by-n matrix of the rotated points. See Figure 2.41.
Figure 2.40: The 10 points in R2 repre- senting the boundaries of Nevada.
Figure 2.41: Nevada, as above and rotated by three different angles.

250 CHAPTER 2. BASIC DATA TYPES
2.4.3 Exercises
2.141 What is {1, 2, 3} × {1, 4, 16}? 2.143 What is {1} × {1} × {1}?
2.142 What is {1, 4, 16} × {1, 2, 3}? 2.144 2.145 Suppose A × B = {⟨1, 1⟩, ⟨2, 1⟩}. What are A and B?
What is {1, 2} × {2, 3} × {1, 4, 16}?
Let S := {1, 2, 3, 4, 5, 6, 7, 8}, and let T be an unknown set. From the following, what can you conclude about T? Be as precise as possible: if you can list the elements of T exhaustively, do so; if you can’t, identify any elements that you can conclude must be (or must not be) in T.
2.146 |S × T| = 16 and ⟨1, 2⟩, ⟨3, 4⟩ ∈ S × T 2.148 (S × T) ∩ (T × S) = {⟨3, 3⟩}
2.147 S×T=∅ 2.149 S×T=T×S
Recall that Algebraic notation denotes the squares of the chess board as {a, b, c, d, e, f , g, h} × {1, 2, 3, 4, 5, 6, 7, 8}, as in Figure 2.42. For each of the following questions, identify sets S and T such that the set of cells containing the
8 7 6 5 4 3 2 1
designated pieces can be described as S × T.
2.150 the white rooks (R)
2.151 the bishops (B, white or black)
Write out the elements of the following sets.
2.154 {0, 1, 2}3 2.155
2.152 the pawns (p, white or black) 2.153 no pieces at all
c := ⟨4, 0⟩, and d := ⟨−3, −1⟩, state the values of each of the following:
2.167 ∥a∥+∥c∥and∥a+c∥ 2.168 ∥a∥+∥b∥and∥a+b∥ 2.169 3∥d∥ and ∥3d∥
2.161 ∥a∥ 2.162 ∥b∥ 2.163 ∥c∥
2.164 a + b
2.165 3d
2.166 2a + c − 3b
{A, B} × {C, D}2 × {E} 2.156
􏰔3i=1 {0, 1}i
abcdefgh
Figure 2.42: The squares of a chess board, written using Algebraic notation.
Let Σ := {A, B, . . . , Z} denote the English alphabet. Using notation from this chapter, give an expression that denotes each of the following sets. It may be useful to recall that Σk denotes the set of strings consisting of a sequence of k elements from Σ, so Σ0 contains the unique string of length 0 (called the empty string, and typically denoted by ε—or by “” in most programming languages).
2.157 The set of 8-letter strings.
2.158 The set of 5-letter strings that do not contain any vowels {A, E, I, O, U}.
2.159 The set of 6-letter strings that do not contain more than one vowel. (So GRITTY, QWERTY, and
BRRRRR are fine; but EEEEEE, THREAT, STRENGTHS, and A are not.)
2.160 The set of 6-letter strings that contain at most one type of vowel—multiple uses of the same vowel are fine, but no two different vowels can appear. (So BANANA, RHYTHM, and BOOBOO are fine; ESCAPE and STRAIN are not.)
Recall that the length of a vector x ∈ Rn is given by ∥x∥ = 􏰟∑ni=1 xi2. Considering the vectors a := ⟨1, 3⟩, b := ⟨2, −2⟩,
rmblkans
opopopop
0Z0Z0Z0Z
Z0Z0Z0Z0
0Z0Z0Z0Z
Z0Z0Z0Z0
POPOPOPO
SNAQJBMR
2.170 Explain why, for an arbitrary vector x ∈ Rn and an arbitrary scalar a ∈ R, ∥ax∥ = a∥x∥.
2.171 For any two vectors x, y ∈ Rn, we have ∥x∥ + ∥y∥ ≥ ∥x + y∥. Under precisely what circumstances
do we have ∥x∥ + ∥y∥ = ∥x + y∥ for x, y ∈ Rn? Explain briefly.
Still considering the same vectors a := ⟨1, 3⟩, b := ⟨2, −2⟩, c := ⟨4, 0⟩, and d := ⟨−3, −1⟩, what are the following? 2.172 a•b 2.173 a•d 2.174 c•c
Recall that the Manhattan distance between vectors x, y ∈ Rn is defined as ∑n |x − y |. The Euclidean distance
n􏰞n 2 i=1ii betweentwovectorsx,y∈R is ∑i=1(xi−yi).WhatistheManhattan/Euclideandistancesbetweenthefollowing
pairs of vectors?
2.175 a and b 2.176 a and d 2.177 b and c
Suppose that the Manhattan distance between two vectors x, y ∈ R2 is 1. Justify your answers:
2.178 What’s the largest possible Euclidean distance between x and y?
2.179 What’s the smallest possible Euclidean distance between x and y?
2.180 What’s the smallest possible Euclidean distance between x and y if x, y ∈ Rn (not just n = 2)?
Consider Figure 2.43, and sketch the following sets:
2.181 􏰈x ∈ R2 : the Euclidean distance between x and ⟨0, 0⟩ is at most 2􏰉.
2.182 􏰈x ∈ R2 : the Manhattan distance between x and ⟨0, 0⟩ is at most 2􏰉.
3 2 1 0 -1 -2 -3
-3-2-10 1 2 3
Figure 2.43: The plane.

2.4. SEQUENCES,VECTORS,ANDMATRICES:ORDEREDCOLLECTIONS 251
In Example 2.40, we considered two train stations located at points s := ⟨8, 33⟩ and g := ⟨4, 42⟩. (See Figure 2.44(a).) In that example, we showed that, for an offset δ ∈ [0, 4], the Manhattan distance between the point ⟨4 + δ, y⟩ and s is smaller than the Manhattan distance between the point ⟨4+δ,y⟩ and g when y < 35.5+δ. 2.183 Show that the point ⟨16, 40⟩ is closer to one station under Manhattan distance, and to the other under Euclidean distance. Let δ ≥ 0. Under Manhattan distance, describe the values of y for which the following point is closer to s than to g: 2.184 ⟨8 + δ, y⟩ 2.185 ⟨4 − δ, y⟩ 2.186 In the real-world island of Manhattan, the east–west blocks are roughly twice the length of the north–south blocks. As such, the more accurate picture of distances in the city is shown in Figure 2.44(b). Assuming it takes 1.5 minutes to walk a north–south (up–down) block and 3 minutes to walk an east–west (left–right) block, give a formula for the walking distance between ⟨x, y⟩ and Penn Station, at s := ⟨8, 33⟩. A Voronoi diagram—named after the 20th-century Russian mathematician Georgy Voronoy—is a decomposi- tion of the plane R2 into regions based on a given set S of points. The region “belonging” to a point x ∈ S is 􏰈y ∈ R2 : d(x, y) ≤ minz∈S d(z, y)􏰉, where d(·, ·) denotes Euclidean distance—in other words, the region “belong- ing” to point x is that portion of the plane that’s closer to x than any other point in S. 2.187 Compute the Voronoi diagram of the set of points {⟨0, 0⟩, ⟨4, 5⟩, ⟨3, 1⟩}. That is, compute: • the set of points y ∈ R2 that are closer to ⟨0, 0⟩ than ⟨4, 5⟩ or ⟨3, 1⟩ under Euclidean distance; • the set of points y ∈ R2 that are closer to ⟨4, 5⟩ than ⟨0, 0⟩ or ⟨3, 1⟩ under Euclidean distance; and • the set of points y ∈ R2 that are closer to ⟨3, 1⟩ than ⟨0, 0⟩ or ⟨4, 5⟩ under Euclidean distance. 2.188 Compute the Voronoi diagram of the set of points {⟨2, 2⟩, ⟨8, 1⟩, ⟨5, 8⟩}. 2.189 Compute the Voronoi diagram of the set of points {⟨0, 7⟩, ⟨3, 3⟩, ⟨8, 1⟩}. 2.190 (programming required) Write a program that takes three points as input and produces a represen- tation of the Voronoi diagram of those three points as output. 10 Taking it further: Voronoi diagrams are used frequently in computational geometry, among other areas of computer science. (For example, a coffee-shop chain might like to build a mobile app that is able to quickly answer the question What store is closest to me right now? for any customer at any time. Voronoi diagrams can allow precomputation of these answers.) Given any set S of n points, it’s reasonably straightforward to compute (an inefficient representation of) the Voronoi diagram of those points by computing the line that’s equidistant between each pair of points, as you saw in the last few exercises. But there are cleverer ways of computing Voronoi diagrams more efficiently; see a good textbook on computational geometry for more.10 Figure 2.44: Man- hattan train sta- tions. 44 43 42 41 40 39 38 37 36 35 34 33 33⟩ 32 2345678910 (a) The unscaled version. 44 43 42 41 40 39 38 37 36 35 34 33 8, 33⟩ 32 2 3 4 5 6 7 8 9 10 (b) The scaled version. ⟨ 4, 42 ⟩ ⟨ 8, ⟨ 4, 42⟩ ⟨ Consider the following matrix: 0 8 0 5 8 7 2 7 􏰍3 1􏰎 􏰍8 4􏰎 􏰍1 2 9􏰎 A=9 6 0 B=7 5 C=3 5 6 D= 0 8 E= 3 2 F= 5 4 0 233 32 125 (If the given quantity is undefined, say so—and say why.) 2.195 A+C 2.198 A+A 2.201 AB 2.204 BC 2.196 B + F 2.199 −2D 2.202 AC 2.205 DE 2.197 D + E 2.200 0.5F 2.203 AF 2.206 ED 10 Mark de Berg, Marc van Krev- eld, Mark Over- mars, and Otfried Schwarzkopf. Com- putational Geometry. Springer-Verlag, 2nd edition, 2000. 7 2.193 List every ⟨i, j⟩ such that Mi,j = 7. 2.194 What is 3M? Considering the following matrices, what are the values of the given expressions (if they’re defined)? 2.191 What size is M? 2.192 What is M3,1? 3 9 2 0 9 8 6 2 0 M= 7 7 2 5 4 5 1 6 252 CHAPTER 2. BASIC DATA TYPES Consider the matrices 1 0 0 0 0 0 A=1 0 0 and B=0 1 0. 110 111 2.207 What is 0.25A + 0.75B? 2.208 What is 0.5A + 0.5B? 2.209 Identify two other matrices C and D with the same average—that is, such that {A, B} ̸= {C, D} but 0.5A + 0.5B = 0.5C + 0.5D. 2.210 (programming required) A common computer graphics effect in the spirit of the last few exercises is morphing one image into another—that is, slowly changing the first image into the second. There are sophisticated techniques for this task, but a simple form can be achieved just by averaging. Given two n-by-m images represented by matrices A and B—say grayscale images, with each entry in [0, 1]—we can produce a “weighted average” of the images as λA + (1 − λ)B, for a parameter λ ∈ [0, 1]. See Figure 2.45. Write a program, in a programming language of your choice, that takes three inputs—an image A, an image B, and a weight λ ∈ [0, 1]—and produces a new image λA + (1 − λ)B. (You’ll need to research an image-processing library to use in your program.) 2.211 Let A be an m-by-n matrix. Let I be the n-by-n identity matrix. Explain why the matrix AI is identical to the matrix A. If M is an n-by-n matrix, then the product of M with itself is also an n-by-n matrix. We write matrix powers in the normalwaythatwedefinedpowersofintegers(oroftheCartesianproductofsets):Mk =M·M···M,multipliedk times. (M0 is the n-by-n identity matrix I.) What are the following? (Hint: M2k = (Mk)2.) 2.212 􏰍2 3􏰎3 2.213 􏰍1 1􏰎2 2.214 􏰍1 1􏰎4 2.215 􏰍1 1􏰎9 11101010 Taking it further: The Fibonacci numbers are defined recursively as the sequence f1 := 1, f2 := 1, and fn := fn−1 + fn−2 for n ≥ 3. The first several Fibonacci numbers are 1, 1, 2, 3, 5, 8, 13, . . .. As we’ll see in Exer- cises 5.56 and 6.99, there’s a very fast algorithm to compute the nth Fibonacci number based on computing the nth power of the matrix from Exercises 2.213–2.215. Let A by an n-by-n matrix. The inverse of A, denoted A−1, is also an n-by-n matrix, with the property that AA−1 = I. There’s a general algorithm that one can develop to invert matrices, but in the next few exercises you’ll calculate inverses of some small matrices by hand. Figure 2.45: Clubs to hearts (0%, 20%, 40%, 60%, 80%, and 100%). y􏰎 = 􏰍x+z y+w􏰎. Thus􏰍1 1􏰎−1 isthematrix􏰍x y􏰎,wherethe 2 1 z w 2x+z 2y+w 2 1 z w 2.216 Notethat􏰍1 1􏰎·􏰍x following four conditions hold: x + z = 1 and y + w = 0 and 2x + z = 0 and 2y + w = 1. Find the values of x, y, w, and z that satisfy these four conditions. Using the same approach as the last exercise, find the inverse of the following matrices: 2.217 􏰍1 2􏰎 2.218 􏰍0 1􏰎 3 4 1 0􏰍1 1􏰎 2.219 􏰍1 0􏰎 0 1 doesn’t have an inverse. Explain why not. 2.220 Not all matrices have inverses—for example, 1 1 An error-correcting code (see Section 4.2) is a method for redundantly encoding information so that the information can still be retrieved even in the face of some errors in transmission/storage. The Hamming code is a particular error- correcting code for 4-bit chunks of information. The Hamming code can be described using matrix multiplication: given a message m ∈ {0, 1}4, we encode m as mG mod 2, where 1000011 G=0 1 0 0 1 0 1. 0010110 0001111 (Here you should interpret the “mod 2” as describing an operation to each element of the output vector.) For example, [1, 1, 1, 1] · G = [1, 1, 1, 1, 3, 3, 3], so we’d encode [1, 1, 1, 1] as [1, 1, 1, 1, 3, 3, 3] mod 2 = [1, 1, 1, 1, 1, 1, 1]. What is the Hamming code encoding of the following messages? 2.221 [0,0,0,0] 2.222 [0,1,1,0] 2.223 [1,0,0,1] 2.5 Functions There is no passion like that of a functionary for his function. Georges Clemenceau (1841–1929) 2.5. FUNCTIONS 253 A function transforms an input value into an output value; that is, a function f takes an argument or parameter x, and returns a value f (x). Functions are familiar from both algebra and from programming. In algebra, we frequently encounter mathematical functions like f (x) = x + 6, which means that, for example, we have f (3) = 9 and f (4) = 10. In programming, we often write or invoke functions that use an algorithm to transform an input into an output, like a function sort—so that sort(⟨3, 1, 4, 1, 5, 9⟩) = ⟨1, 1, 3, 4, 5, 9⟩, for example. In this section, we will give formal definitions of functions and of some terminol- ogy related to functions, and also discuss a few special types of functions. (Functions themselves are a special case of relations, and we will revisit the definition of functions in Chapter 8 when we discuss relations.) 2.5.1 Basic Definitions We start with the definition of a function itself: Note that A and B are allowed to be the same set; for example, a function might have inputs and outputs that are both elements of Z. Here are two simple examples. First, we define a function not for Boolean inputs that maps True to False, and False to True: Example 2.53 (Not function) The function not : {True, False} → {True, False} can be defined with the table in Figure 2.46. Given an input x, we find the output value not(x) by locating x in the first column of the table and reading the value in that row’s second column. Thus not(True) = False and not(False) = True. As another simple example, we can also define a function square that returns its input multiplied by itself: Example 2.54 (Square function) The function square : R → R can be defined as square(x) := x2: for any input x ∈ R, the output is the real number x2. Thus, for example, square(8) = 64, because the function square assigns the output 82 = 64 to the input 8. x not(x) True False False True Figure 2.46: The function not. Definition 2.44 (Function) LetAandBbesets.Afunctionf fromAtoB,writtenf :A→B,assignstoeachinput value a ∈ A a unique output value b ∈ B; the unique value b assigned to a is denoted by f (a). We sometimes say that f maps a to f (a). 254 CHAPTER 2. BASIC DATA TYPES Note, too, that a function f : A → B might have a set A of inputs that are pairs; for example, the function that takes two numbers and returns their average is the function average : R × R → R, where average(⟨x, y⟩) := (x + y)/2. (We interpret R × R → R as (R × R) → R.) When there is no danger of confusion, we drop the angle brackets and simply write, for example, average(3, 2) instead of average(⟨3, 2⟩). As we’ve already seen in Examples 2.53 and 2.54, the rule by which a function as- signs an output to a given input can be specified either symbolically—typically via an algebraic expression—or exhaustively, by giving a table describing the input/output relationship. The table-based definition only makes sense when the set of possible inputs is finite; otherwise the table would have to be infinitely large. (And it’s only practical to define a function with a table if the set of possible inputs is pretty small!) Here’s an example of specifying the same function in two different ways, once sym- bolically and once using a table: Example 2.55 (Doubling function) Let’s define the function double that doubles its input value, for any input in {0,1,...,7}. (That is, we are defining a function double : {0,1,...,7} → Z.) We can write double symbolically by defining x double(x) 00 12 24 double(x) := 2 · x. 36 48 To define double using a table, we specify the output corresponding to every one of the 8 possible inputs, as shown in Figure 2.47. The functions that we’ve discussed so far are all fairly simple, but even simple func- tions can have some valuable applications. Here’s an example of another simple func- tion that can be used in compressing images so that they take up less space: Example 2.56 (Reducing the colorspace of an image) The pixels in a grayscale image are all elements of {0, 1, . . . , 255}. To reduce the space requirements for a large image, we can consider a form of lossy compression (that is, compression that loses some amount of data) by replacing each pixel with one chosen from a smaller list of candidate colors. That is, instead of having 256 different shades of gray, we might have 128 or 64 or even fewer shades. Define quantize : {0,1,...,255} → {0,1,...,255} as follows: 5 10 6 12 7 14 Figure 2.47: The double function, specified using a table. 26 78 quantize(n) :=  130  182 234 if0≤n≤51 if52≤n≤103 if104≤n≤155 if156≤n≤207 if208≤n≤255. We can apply quantize to every pixel in a grayscale image, and then use a much smaller number of bits per pixel in storing the resulting image. See Figure 2.48 for an example. 2.5. FUNCTIONS 255 (a) The function quantize. (b) An image of a house. (c) The same image, compressed to use only 5 shades of gray using the quantize function. Taking it further: A byte is a sequence of 8 bits. Using 8 bits, we can represent the numbers from 00000000 to 11111111—that is, from 0 to 255. Thus a pixel with {0, 1, . . . , 255} as possible grayscale values in an image requires one byte of storage for each pixel. If we don’t do something cleverer, a mod- erately sized 2048-by-1536 image (the size of an iPad) requires over 3 megabytes even if it’s grayscale. (A color image requires three times that amount of space.) Techniques similar to the compression func- tion from Example 2.56 are used in a variety of CS applications—including, for example, in automatic speech recognition, where each sample from a sound stream is stored using one of only, say, 256 different possible values instead of a floating-point number, which requires much more space. Domain and codomain The domain and codomain of a function are its sets of possible inputs and outputs: Let’s identify the domain and codomain from the previous examples of this section: Example 2.57 (Some domains and codomains) For the functions from Examples 2.53–2.56, we have: function domain not (Example 2.53) {True, False} {True, False} Figure 2.48: A visual repre- sentation of the color-mapping function (each input color in the left column is assigned the corresponding color in the right column), applied to an example image. In PNG format, the file for the second image takes up less than 14% of the space consumed by the first image. Definition 2.45 (Domain/codomain) Forafunctionf :A→B,thesetAiscalledthedomainofthefunctionf :A→B,andthe set B is called the codomain of the function f : A → B. codomain square (Example 2.54) R double (Example 2.55) {0, 1, . . . , 7} quantize (Example 2.56) {0,1,...,255} R Z {0,1,...,255} Note that for three of these functions, the domain and codomain are actually the same set; for the function double : {0, 1, . . . , 7} → Z, they’re different. 256 CHAPTER 2. BASIC DATA TYPES When the domain and codomain are clear from context (or they are unimportant for the purposes of a discussion), then they may be left unwritten. Taking it further: This possibility of implicitly representing the domain and codomain of a function is also present in code. Some programming languages (like Java) require the programmer to explicitly write out the types of the inputs and outputs of a function; in some (like Python), the input and output types are left implicit. In Java, for example, one would write an isPrime function with the explicit declaration that the input is an integer (int) and the output is a Boolean (boolean). In Python, one would write the function without any explicit type information. But regardless of whether they’re written out or left implicit, these functions do have a domain (the set of valid inputs) and a codomain (the set of possible outputs). Range/Image For a function f : A → B, the set A (the domain) is the set of all possible inputs, and the set B (the codomain) is the set of all possible outputs. But not all of the possible outputs are necessarily actually achieved: in other words, there may be an element b ∈ B for which there’s no a ∈ A with f (a) = b. For example, we defined square : R → R in Example 2.54, but there is no real number x such that square(x) = −1. The range or image defines the set of actually achieved outputs: We’ll start with the four functions defined earlier in this section: Example 2.58 (Some ranges) For the functions from Examples 2.53–2.56, we have: function range not (Example 2.53) {True, False} square (Example 2.54) R≥0 double (Example 2.55) {0, 2, 4, 6, 8, 10, 12, 14} quantize (Example 2.56) {26, 78, 130, 182, 234} For not, double, and quantize, the range is easy to determine: it’s precisely the set of values that appear in the “output” column of the table defining the function. For square, it’s clear that the range includes no negative numbers, because there’s no y ∈ R such that y2 < 0. In fact, the range of square is precisely R≥0: for any x ∈ R≥0, there’s an input to square that produces x as output—specifically √x. boolean isPrime(int n) { /* code to check primality of n */ } def isPrime(n): # code to check primality of n Definition 2.46 (Range/image) Therangeorimageofafunctionf :A→Bisthesetofallb∈Bsuchthatf(a)=bforsome a ∈ A. Using the notation of Section 2.3, the range of f is the set {y ∈ B : there exists at least one x ∈ A such that f (x) = y} . Here’s another example, for a slightly more complex function: Example 2.59 (The smallest divisor function) Problem: Define a function sd : Z≥2 → Z≥2 as follows. Given an input n ∈ Z≥2, the value of sd(n) is the smallest integer k ≥ 2 that evenly divides n. For example: • sd(2)=2(because2|2); • sd(3)=3(because3|3but2̸|3); • sd(4)=2(because2|4);and • sd(121)=11(because11|121but2̸|121,3̸|121,...,10̸|121). What are the domain, codomain, and range of sd? : Thedomainandcodomainofsdareeasytodetermine:theyarebothZ≥2. Solution Any integer n ≥ 2 is a valid input to sd, and we defined the function sd as produc- ing an integer k ≥ 2 as its output. (The domain and codomain are simply written in the function’s definition, before and after the arrow in sd : Z≥2 → Z≥2.) The range is a bit harder to see, but it turns out to be the set P of all prime numbers. Let’s argue that P is the range of sd by showing that (i) every prime number p ∈ P is in the range of sd, and (ii) every number p in the range of P is a prime number. (i) Let p ∈ Z≥2 be any prime number. Then sd(p) = p: by the definition of pri- mality, the only integers than evenly divide p are 1 and p itself (and 1 ≥ 2 isn’t true!). Therefore every prime number p is in the range of sd, because there’s an input to sd such that the output is p. (ii) Letpbeanynumberintherangeofsd—thatis,supposesd(n)=pforsomen. We will argue that p must be prime. Imagine that p were instead composite— that is, there is an integer k satisfying 2 ≤ k < p that evenly divides p. But then sd(n) = p is impossible: if p evenly divides n, then k also evenly divides n, and k < p, so k would be a smaller divisor of n. (For example, if n were evenly divisible by the composite number 15, then n would also be evenly divisible by 3 and 5—two factors of 15—so sd(n) ̸= 15.) Therefore every number in the range of sd is prime. Putting together the facts from (i) and (ii), we conclude that the range of sd is precisely the set of all prime numbers. We will also introduce a minor extension to the set-abstraction notation from Sec- tion 2.3.1 that’s related to the range of a function. (We used this notation informally in Example 2.28.) Consider a function f : A → B and a set U ⊆ A. We denote by {f (x) : x ∈ U} the set of all output values of the function f when it’s applied to the elements x ∈ U: Remember that order and repetition of elements in a set don’t matter, which means that the set {f(x) : x ∈ A} is precisely the range of the function f : A → B. Problem-solving tip: Example 2.59 illustrates a useful general technique if we wish to show that two sets A and B are equal. One nice way to establish that A = B is to show that A ⊆ B and B ⊆ A. That’s what we did to establish the range of sd in Example 2.59: • define P as the set of all prime numbers. • define R as the range of sd. We showed in (i) that every element of P is in R (that is, P ⊆ R); and in (ii) that every element of R is in P (that is, R ⊆ P). Together these facts establish that R = P. 2.5. FUNCTIONS 257 Definition 2.47 (Set abstraction using functions) Forafunctionf :A→BandasetU⊆A,wewrite{f(x):x∈U}asshorthandfortheset {b ∈ B : there exists some u ∈ U for which f (u) = b}. 258 CHAPTER 2. BASIC DATA TYPES A visual representation of functions The table-based and symbolic representations of functions that we’ve discussed fully represent a function, but sometimes a more visual representation of a function is clearer. Consider a function f : A → B. We can give a picture representing f by putting the elements of A into one column, the elements of B into a second column, anddrawinganarrowfromeacha ∈ Atothevalueoff(a) ∈ B.Noticethatthe definition of a function guarantees that every element in the first column has one and only onearrowgoingfromittothesecondcolumn:iff :A→Bisafunction,theneverya∈Ais assigned a unique output f (a) ∈ B. Here’s a simple example: Example 2.60 (A picture of a function) Figure 2.49 displays a function f : {1,...,5} → {10,...,15}, where f(1) = 10 and f (2) = f (4) = 11 and f (3) = 12 and f (5) = 13. We can read the domain, codomain, and range directly from this picture: the do- main is the set of elements in the first column; the codomain is the set of elements in the second column; and the range is the set of elements in the second column for which there is at least one incoming arrow. For instance, the range of f from Example 2.60 is {10, 11, 12, 13}. (There are no arrows pointing to 14 or 15, so these two numbers are in the codomain but not the range of f.) Function composition Supposewehavetwofunctionsf :A→Bandg:B→C.Givenaninputa∈A,we can find f(a) ∈ B, and then apply g to map f(a) to an element of C, namely g(f(a)) ∈ C. This successive application of f and g defines a new function, called the composition of f and g, whose domain is A and whose codomain is C: Notice a slight oddity of the notation: g ◦ f applies the function f first and the function g second, even though g is written first. Here’s an example of the functions that result from composing two simple functions in four different ways: Example 2.61 (Function composition, four ways) Letf :R→Randg:R→Rbedefinedbyf(x):=2x+1andg(x):=x2. 1. Thefunctiong◦f,givenaninputx,producesoutput g(f(x))=g(2x+1)=(2x+1)2 =4x2+4x+1. 2. Thefunctionf ◦gmapsxtof(g(x))=f(x2)=2x2+1. 3. The function g ◦ g maps x to g(g(x)) = g(x2) = (x2)2 = x4. 4. Thefunctionf ◦f mapsxtof(f(x))=f(2x+1)=2(2x+1)+1=4x+3. A B f 1 2 3 4 5 Figure 2.49: A picture of a function f : A → B, where A = {1,...,5} and B = {10,...,15}. 10 11 12 13 14 15 Definition 2.48 (Function composition) Fortwofunctionsf :A→Bandg:B→C,thefunctiong◦f :A→Cmapsanelement a∈Atog(f(a))∈C.Thefunctiong◦f iscalledthecompositionoff andg. As with many function- related concepts, the visual representation of functions gives a nice way of thinking about function compo- sition: the function g ◦ f corresponds to the “short- circuiting” of the pictures of the functions f and g. Here is a small example of this visualization: Example 2.62 (Function composition, by picture) Figure2.50showsfunctionsf :A→Bandg:B→C.Theircompositiong◦f isgiven by following two arrows in the diagram. For example, the value of (g ◦ f )(1) is g(f (1)), which is g(11) because f (1) = 11. And g(11) = 24 because of g’s arrow from 11 to 24. 2.5.2 Onto and One-to-One Functions We now turn to two special categories of functions—onto and one-to-one functions— that are distinguished by how many different input values (always at least one? never more than one?) are mapped to each output value. Onto functions A function f : A → B is onto if every possible output in B is, in fact, an actual output: Alternatively, using the terminology of Section 2.5.1, a function f is onto if f ’s codomain equals f ’s range. As an example, here are two of our previous functions, one of which is onto and one of which isn’t: Example 2.63 (An onto function) The function not : {True, False} → {True, False} is onto: there’s an input value that produces True (namely False), and there’s an input value that produces False (namely True). Every element of the codomain is “hit” by not, so the function is onto. Example 2.64 (A non-onto function) The function quantize : {0,1,...,255} → {0,1,...,255} from Example 2.56 is not onto. Recall that the only output values achieved were {26, 78, 130, 182, 234}. For example, Figure 2.50: A picture of functions f : A → B and g : B → C, first separately and then pasted together. The third panel shows g ◦ f , based on successively following two arrows from the second panel. 2.5. FUNCTIONS 259 AfB BgC AfBgC Ag◦fC 1 2 3 4 5 10 11 12 13 14 15 10 11 12 13 14 15 20 21 23 24 1 2 3 4 5 10 11 12 13 14 15 20 21 23 24 1 2 3 4 5 20 21 23 24 Definition 2.49 (Onto functions) Afunctionf :A→Biscalledontoif,foreveryb∈B,thereexistsatleastonea∈Afor which f (a) = b. An onto function is also sometimes called a surjective function. 260 CHAPTER 2. BASIC DATA TYPES then, there is no value of x for which quantize(x) = 42. Thus 42 is not in the range of quantize, and therefore this function is not onto. Here is a collection of a few more examples, where we’ll try to construct onto and non-onto functions meeting a certain description: Example 2.65 (Sample onto/non-onto functions) Problem: LetA:={0,1,2}andB:={3,4}.Giveanexampleofafunctionthatsatisfies the following descriptions; if there’s no such function, explain why it’s impossible. 1. anontofunctionf :A→B. 2. afunctiong:A→Bthatisnotonto. 3. anontofunctionh:B→A. : Thefirsttwoarepossible,butthethirdisnot: Solution 1. Define f(0) := 3, f(1) := 4, and f(2) := 4. 2. Defineg(0):=3,g(1):=3,andg(2):=3. 3. Impossible!Afunctionhwhosedomainis{3,4}onlyhastwooutputvalues, namely h(3) and h(4). For a function whose codomain is {0, 1, 2} to be onto, we need three different output values to be achieved. These two conditions cannot be simultaneously satisfied, so there is no onto function from B to A. It may be easier to think about onto functions using the visual representation that we just introduced: a function f is onto if there’s at least one arrow pointing at every element in the second column. Figure 2.51 illustrates the functions from Example 2.65.1 and Example 2.65.2; the fact that f is onto and g is not onto is immediately visible. One-to-one functions An onto function f : A → B guarantees that every element b ∈ B is “hit at least once”byf—thatis,thatb = f(a)foratleastonea ∈ A. Aone-to-onefunctionf : A → B guarantees that every element b ∈ B is “hit at most once” by f : (Terminologically, a one-to-one function sits in contrast to a many-to-one function, in which many different input values map to the same output value. Thinking about what a many-to-one function would mean may help to make the name “one-to-one” more intuitive.) Figure 2.51: An onto function f : {0,1,2} → {3,4} and a non- onto function g : {0,1,2} → {3,4}. AfB AgB 0 1 2 3 4 0 1 2 3 4 Definition 2.50 (One-to-one functions) Afunctionf :A→Biscalledone-to-oneif,foranyb∈B,thereisatmostonea∈Asuch that f (a) = b. A one-to-one function is also sometimes called an injective function. Taking it further: One of the many places that functions are used in computer science is in designing the data structure known as a hash table, discussed on p. 267. The idea is that we will store a piece of data called x in a location h(x), for some function h called a hash function. We want to choose h to ensure that this function is “not-too-many-to-one” so that no location has to store too much information. As an example, we’ll consider two of our previous functions, double and quantize, and evaluate whether they are one-to-one: Example 2.66 (A one-to-one function) Thefunctiondouble : {0,1,...,7} → Z,definedinExample2.55,isone-to-one. By examining the table of outputs for the function (reproduced in Figure 2.52), we see that no number appears more than once in the second column. Because every element of the codomain is “hit” by double at most once, the function is one-to-one. Observe that double : {0, 1, . . . , 7} → Z is not onto, because there are elements of the codomain that are “hit” zero times—but it is one-to-one, because no element of the codomain is hit twice. Here’s an example of a function that is not one-to-one: Example 2.67 (A non–one-to-one function) The function quantize : {0,1,...,255} → {0,1,...,255} from Example 2.56 is not one-to-one. Recall that quantize(42) = 26 and quantize(17) = 26. Thus 26 is the output for two or more distinct inputs, and therefore this function is not one-to-one. As with the definition of onto, it may be easier to think about one-to-one functions using our visual two-column representation: a function f is one-to-one if there’s at most one arrow pointing at every element in the second column. Here are two simple examples using this visual perspective: the function f in Figure 2.53 is one-to-one, because no element of B has multiple incoming arrows. But the function g is not one-to-one, because 4 ∈ B has two incoming arrows. One-to-one and onto functions One way of restating the definitions of onto and one-to-one functions is as follows. Letf :A→Bbeafunction.Then • f isontoif,foreveryb∈B,wehave|{a∈A:f(a)=b}|≥1. • f isone-to-oneif,foreveryb∈B,wehave|{a∈A:f(a)=b}|≤1. Therefore a function f : A → B that is both one-to-one and onto guarantees that |{a∈A:f(a)=b}| = 1—thatis,foranyb ∈ B,thereisexactlyoneelementa ∈ A so that f (a) = b. (There is at most one such a because f is one-to-one, and at least one such a because f is onto.) A function with both of these properties is called a bijection: x double(x) 00 12 24 36 48 5 10 6 12 7 14 Figure 2.52: The double function. 2.5. FUNCTIONS 261 AfB AgB 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Figure 2.53: A one- to-one function f and a non–one- to-one function g. 262 CHAPTER 2. BASIC DATA TYPES Definition 2.51 (Bijection) A function f : A → B is called a bijection if f is one-to-one and onto—that is, if |{a ∈ A : f (a) = b}| = 1 for every b ∈ B. Here are two examples of bijections: Example 2.68 (Two bijections) The function not : {True, False} → {True, False} from Example 2.53 and the function f :R→Rdefinedbyf(x):=x−1arebothbijections. For not, there’s exactly one input value whose output is True, namely False; and there’s exactly one input value whose output is False, namely True. Similarly, for f , for every b ∈ R, there is exactly one a such that f (a) = b: specifically, the value a = b + 1. Iff : A → Bisabijection,theneveryinputinAisassignedbyf toauniquevaluein B. We can define a new function, denoted f −1, that reverses this assignment—given b ∈ B, the function f −1(b) identifies the a ∈ A to which b was assigned by f . This function f −1 called the inverse of f : Here is an example of finding inverses of a few functions: Example 2.69 (Three inverses) Definition 2.52 (Function inverses) Let f be a bijection. Then f−1 : B → A is a function called the inverse of f, where f−1(b) = a whenever f (a) = b. Problem: Whatistheinverseofeachofthefollowingfunctions? 1. f:R→R,wheref(x)=x. ≥0≥02 2 2. square:R →R ,wheresquare(x)=x . 3. not:{True,False}→{True,False}. Solution : 1. Wecanfindthefunctionf−1,theinverseoff,bysolvingtheequation y = x forx.Weseethat2y = x.Thusthefunctionf−1 : R → Risgivenby −1 2 x −1 x f (y) = 2y. Foranyrealnumberx ∈ R,wehavethatf(x) = 2 andf (2) = x. (For example, f (3) = 1.5 and f −1(1.5) = 3.) 2. Notice that square : R≥0 → R≥0 is a bijection—otherwise this problem wouldn’t be solvable!—because the domain and the codomain are both the equal to the set of nonnegative real numbers. (For example, 32 = 9 and (−3)2 = 9; if we had allowed both negative and positive inputs, then square would not have been one-to-one. And there’s no x ∈ R such that x2 = −9; if we had allowed negative outputs, then square would not have been onto.) The inverse of square is the function square−1(y) = √y. 3. Note that not(not(True)) = not(False) = True and not(not(False)) = not(True) = False. Thus the inverse of the function not is the function not itself! Iff : A → Bisabijection,then,foranya ∈ A,observethatapplyingf−1 tof(a)givesa back as output: that is, f −1(f (a)) = a. In other words, the function f −1 ◦ f is the identity function, defined by id : A → A where id(a) := a. A bijection f : A → B has exactly one arrow coming into every element in the second column, and by definition it also has exactly one arrow leaving every element in the first column. The inverse of f is precisely the function that results from reversing the direction of each arrow. (The fact that every right-hand column element has exactly one incoming arrow under f is precisely what guarantees that reversing the direction of each arrow still results in the arrow diagram of a function.) Figure 2.54 shows an example of a bijection and its inverse illustrated in this man- ner. This picture-based approach should help to illustrate why a function that is not onto or that is not one-to-one fails to have an inverse. If f : A → B is not onto, then there exists some element b∗ ∈ B that’s never the value of f , so f −1(b∗) would be unde- fined. On the other hand, if f is not one-to-one, then there exists b† such that f (a) = b† and f (a′) = b† for a ̸= a′; thus f −1(b†) would have to be both a and a′, which is forbidden by the definition of a function. 2.5.3 Polynomials We’ll turn now to polynomials, a special type of function whose input and output are both real numbers, and where f (x) is the sum of powers of x: Figure 2.54: A bijec- tion f : {0,1,2,3} → {4, 5, 6, 7} and its inverse f−1 : {4,5,6,7} → {0, 1, 2, 3}. 2.5. FUNCTIONS 263 AfB A f−1 B 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Definition 2.53 (Polynomial) A polynomial is a function f : R → R of the form f (x) = a0 + a1x + a2x2 + · · · + akxk where each ai ∈ R and ak ̸= 0, for some k ∈ Z≥0. (More compactly, we can write this function as f (x) = ∑ki=0 aixi.) The real numbers a0, a1, . . . , ak are called the coefficients of the polynomial, and the values a0,a1x,a2x2,...,akxk being added together are called the terms of the polynomial. Here are a few examples: Example 2.70 (Some polynomials) Here are a few polynomials: f (x) = 7x, g(x) = x202 − 201x111, and h(x) = x2 − 2. The function h is graphed in Figure 2.55—in other words, for every x ∈ R, the point ⟨x, h(x)⟩ is drawn. There are two additional definitions related to polynomials that will be useful. The first is the degree of the polynomial p(x), which is the highest power of x in p’s terms: Figure 2.55: A graph of the poly- nomial h(x) = x2 − 2. 264 CHAPTER 2. BASIC DATA TYPES Definition 2.54 (Degree) The degree of a polynomial f (x) = ∑ki=0 ai xi is the largest index i such that ai ̸= 0—that is, the highest power of x with a nonzero coefficient. Here are a few examples: Example 2.71 (Some degrees) For the polynomials f(x) = x + x3 and g(x) = x9, the degree of f is 3 and the degree ofgis9. Forthepolynomialp(x)witha0 = 1,a1 = 3,anda2 = 0,thedegreeofpis1, becausep(x)=1+3x+0x2 =1+3x. Some more examples of polynomials with small degrees (namely 0, 1, 2, 3, and 4) are shown in Figure 2.56. The second useful notion about a polyno- mial p(x) is a root, which is a value of x where the graph of p crosses the x axis: Here are a few simple examples: Example 2.72 (Some roots) The roots of the polynomial f (x) = x + x2 are 0 and −1. For the polynomial g(x) = x9, the only root is 0. A useful general theorem relates the number of different roots for a polynomial to its degree: a polynomial p with degree k has at most k different values of x for which p(x) = 0 (unless p is always equal to 0): When p(x) is zero for every value x ∈ R, we sometimes write p(x) ≡ 0 and say that p is identically zero. We won’t give a formal proof of Theorem 2.3, but here’s one way to convince your- self of the basic idea. Think about how many times a polynomial of degree k can “change direction” from increasing to decreasing or from decreasing to increasing. Figure 2.56: Graphs of some polynomi- als of degree 0, 1, 2, 3, and 4. (a) Degree 0. (b) Degree 1. (c) Degree 2. (d) Degree 3. (e) Degree 4. Definition 2.55 (Roots) The roots of a polynomial p(x) are the values in the set {x ∈ R : p(x) = 0}. Theorem 2.3 ((Nonzero) polynomials of degree k have at most k roots) Let p(x) be a polynomial of degree at most k. Then p has at most k roots unless p(x) is zero for every value x ∈ R. Observe that a polynomial p must change directions between any two roots. (Draw a picture!) A polynomial of degree 0 never changes direction, so it’s either always zero or never zero. A polynomial p(x) of degree d ≥ 1 can change directions only at a point where its slope is precisely equal to zero—that is, a point x where the derivative p′ of p satisfies p′(x) = 0. Using calculus, we can show that the derivative of a polynomial of degree d ≥ 1 is a polynomial of degree d − 1. The idea of a proof by mathematical induction is to combine the above intuition to prove the theorem. Taking it further: Here’s some more detailed intuition of how to prove Theorem 2.3 using a proof by mathematical induction; see Chapter 5 for much more detail on this form of proof. Think first about a degree-zero polynomial—that is, a constant function p(x) = a. The theorem is clearforthiscase: eithera = 0(inwhichcasep(x) ≡ 0);ora ̸= 0,inwhichcasep(x) ̸= 0foranyx. (See Figure 2.56(a).) Now think about a degree-1 polynomial—that is, p(x) = ax + b for a ̸= 0. The derivative of p is a constant function—namely p′(x) = a ̸= 0. Imagine what it would mean for p to have two roots: as we move from smaller x to larger x, at some point r we cross the x-axis, say from p(r − ε) < 0 to p(r + ε) > 0. (See Figure 2.56(b).) In order to find another root larger than r, the function p would have to change from increasing to decreasing—in other words, there would have to be a point at which p′(x) = 0. But we just argued that a degree-zero polynomial like p′(x) that is not identically zero is never zero. So we can’t find another root.
Now think about a degree-2 polynomial—that is, p(x) = ax2 + bx + c for a ̸= 0. After a root, p will have to change direction to head back toward the x-axis. That is, between any two roots of p, there must be a point where the derivative of p is zero: that is, there is a root of the degree-one polynomial p′(x) = 2ax + b between any two roots of p. But p′ has at most one root, as we just argued, so p has at most two roots.
And so forth! We can apply the same argument for degree 3, then degree 4, and so on, up to any degree k. (See Chapter 5.)
2.5.4 Algorithms
While functions are a valuable mathematical abstraction, computer scientists are fun- damentally interested in computing things. So, in addition to the type of functions that we’ve discussed so far in this section, we will also often talk about mapping an in- put x to a corresponding output f (x) in the way that a computer program would, by computing the value of f (x) using an algorithm:
In other words, an algorithm is function—but specified as a sequence of simple oper- ations, of the type that could be written as a program in your favorite programming language; in fact, these step-by-step procedures are even called functions in many pro- gramming languages. (It’s probably worth noting that it’s unusual for a book like this one to introduce algorithms in the context of functions. But, because the point of an algorithm really is to transform inputs into outputs, it can be helpful to think of an algorithm as a description a function f that specifies how to calculate the output f (x) from a given input x, instead of simply describing what the value f (x) is.)
We will write algorithms in pseudocode, rather than in any particular programming language. In other words, we will specify the steps of the algorithm in a style that is neither Python nor Java nor English, but something in between; it’s written in a style that “looks” like a program, but is designed to communicate the steps to a human
2.5. FUNCTIONS 265
Definition 2.56 (Algorithm)
An algorithm is step-by-step procedure to transform an input into an output.

266 CHAPTER 2. BASIC DATA TYPES
reader, rather than to a computer executing the code. We will aim to write pseudocode that can be interpreted straightforwardly by a reader who has used any modern pro- gramming language; we will always try to avoid getting bogged down in detailed syntax, and instead emphasize trying to communicate algorithms clearly. Translating the pseudocode for an algorithm into any programming language should be straight- forward.
We will make use of the standard elements of any programming language in our pseudocode: conditionals (“if”), loops (“for” and “while”), function definitions and function calls (including recursive function calls), and functions returning values. We will use the symbol “:=” to denote assignment and the symbol “=” to denote equality testing,sothatx := 3setsthevalueofxtobe3,andx = 3isTrue(ifxis3)orFalse (if x is not 3). We assume a basic familiarity with these basic programming constructs throughout the book.
We will spend significant energy later in the book on proving algorithms correct (Chapters 4 and 5)—that is, showing that an algorithm com- putes the correct output for any given input—and on analyzing the efficiency of algorithms (Chap- ter 6). But here is one simple example to get us started:
Example 2.73 (Max finder)
An algorithm to find the index of the maximum element of a list is shown in Fig- ure 2.57. (More properly, this algorithm finds the index of the first maximum ele- ment.)
Our notation of
:= for assignment and = for equality testing is borrowed from the program- ming language Pascal. In a lot of other programming languages, like
C and Java and Python, assignment is expressed using = and equality testing is expressed using ==.
findMaxIndex(L):
Input: A list L with n ≥ 1 elements L[1],…,L[n].
Output: An index i such that L[i] is the maximum value in L.
1: 2: 3: 4: 5:
maxIndex := 1 fori:=2ton:
if L[i] > L[maxIndex] then maxIndex := i
return maxIndex
Figure 2.57: An algorithm to find the index of the maximum element of a list.

2.5. FUNCTIONS 267
Computer Science Connections
Hash Tables and Hash Functions
Consider the following scenario: we have a set S of elements that we must store, each of which is chosen from a universe U of all possible elements. We need to be able to answer the question “is x in S?” quickly. (We might also have data associated with each x ∈ S, and seek to find the associated data rather than just determining membership.) Furthermore, the set S might change over time, either by insertion of a new element or deletion of an ex- isting element. How might we efficiently organize the data to support these operations?
A hash table, one of the most frequently used data structures in computer science, is designed to store a set like S, as follows:
• we define a table T[1…n].
• wechooseahashfunctionh:U→{1,…,n}. • eachelementx∈SisstoredinthecellT[h(x)].
There are several different choices about how to handle collisions, when we try to store two different elements in the same cell, but for simplicity let’s assume that we store them all in that cell, in a list. For example, see the hash function and hash table in Figure 2.58:
h(x) := (x2 mod 10) + 1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
(a) A hash table with hash function h. (b) The table, filled with 4, 2, 8, and 20.
To insert a value x into the table, we merely need to compute h(x) and place the value into the list in the cell T[h(x)]. Answering the question “is x stored in the table?” is similar; we compute h(x) and look through whatever entries are stored in that list. As a result, the performance of this data structure is almost entirely dependent on how many collisions are generated—that is, how long the lists are in the cells of the table.
A “good” hash function h : U → {1, . . . , n} is one that distributes the pos- sible values of U as evenly as possible across the n different cells. The more evenly the function spreads out U across the cells of the table, the smaller the typical length of the list in a cell, and therefore the more efficiently the program would run. (Figure 2.58(c) says that the above hash function is not a very good one.) Programming languages like Python and Java have built-in implementations of hash tables, and they use some mildly complex iterative arithmetic operations in their hash functions. But designing a good hash function for whatever kind of data you end up storing can be the difference between a slow implementation and a blazingly fast one.
Incidentally, there are two other concerns with efficiency: first, the hash function must be able to be computed quickly, and there’s also some clever- ness in choosing the size of the table and in deciding when to rehash every- thing in the table into a bigger table if the lists get too long (on average).
1 2 3 4 5 6 7 8 9 10
(c) The table filled with {0, 1, . . . , 99}.
Figure 2.58: A hash table, empty and filled. If we’re asked to store 4 and 2 and 20 and 8, they would go into cells h(4) = (16 mod 10) + 1 = 7 and h(2) = 5 and h(20) = 1 and h(8) = 5. Panel (c) shows every element from the universe {0, 1, . . . , 99}; the fact that the number of elements per cell is so variable means that this hash function does a poor job of spreading out its inputs across the table.
8
0
10
20
30
40
50
60
70
80
90
1
9
11
19
21
29
31
39
41
49
51
59
61
69
71
79
81
89
91
99
2
8
12
18
22
28
32
38
42
48
52
58
62
68
72
78
82
88
92
98
5
15
25
35
45
55
65
75
85
95
4
6
14
16
24
26
34
36
44
46
54
56
64
66
74
76
84
86
94
96
3
7
13
17
23
27
33
37
43
47
53
57
63
67
73
77
83
87
93
97
20
2
4

268 CHAPTER 2. BASIC DATA TYPES
2.5.5 Exercises
Considerthefunctionf :{0,1,…7}→{0,1,…7}definedbyf(x):=(x2+3)mod8.
2.224 What is f(3)? 2.226 For what x is f(x) = 3?
2.225 What is f (7)? 2.227 Redefine f using a table.
  2 6
Let’s generalize the quantization idea from the previous exercise to be a two-argument func- 182 tion, so that quantize(n,k) takes an input color n ∈ {0,1,…,255} and a number k of 234 “quanta.” (We insist that 1 ≤ k ≤ 256.) In other words, k is the number of different equally
spaced output values, and the input color n is translated to the closest of these k values. (The
ranges associated with the quanta are only approximately equal because of issues of integrality: for example, in the
k = 5 case from Figure 2.59, the first four quanta correspond to 52 different colors; the last quantum corresponds to only 256 − 52 · 4 = 48 different colors.)
2.229 What are the domain and range of quantize(n, k)?
2.230 Repeat Exercise 2.228 for quantize(n, k). You should ensure that quantize(n, 5) yields the func-
tion from Figure 2.59. (Hint: first determine how big a range of colors should be mapped to a particular quantum, rounding the size up. Then figure out which quantum the given input n corresponds to.)
2.231 A function f : A → B is said to be c-to-1 if, for every output value b ∈ B, there are exactly c different values a ∈ A such that f (a) = b. (These functions are useful in counting; see the Division Rule in Theorem 9.11.) For what values of k is it possible to define a c-to-1 (for some integer c) quantizing function that transforms into {0, 1, . . . , 255} into a set of k quanta?
2.232 (programming required) Implement quantization for image files, in a programming language of your choice. Specifically, implement quantize(n, k), and apply it to every pixel of a given image. (You’ll need to research an image-processing library to use in your program.)
Many of the pieces of basic numerical notation that we’ve introduced can be thought of as functions. For each of the following, state the domain and range of the given function.
2.228 In Example 2.56, we introduced a function quantize for compressing
a grayscale image to use only five different shades of gray. (See Figure 2.59 for a 
reminder of the function.) Using basic arithmetic notation (including ⌊ ⌋ and/or ⌈ ⌉ if appropriate), redefine quantize without using cases.
i f 0 ≤ n ≤ 5 1
i f 5 2 ≤ n ≤ 1 0 3 if 104 ≤ n ≤ 155 if 156 ≤ n ≤ 207 if 208 ≤ n ≤ 255
Figure 2.59: The function from Example 2.56.
2.243 LetT = {1,…,12}×{0,1,…,59}denotethesetofnumbersthatcanbedisplayedonadigital clockintwelve-hourmode.Defineafunctionadd:T×Z≥0 →Tsothatadd(t,x)denotesthetimethat’sx minutes later than t. Do so using only standard symbols from arithmetic.
Define the functions f (x) := x mod 10, g(x) := x + 3, and h(x) := 2x. What are the following? (That is, rewrite the definition of the given function using a single algebraic expression. For example, the function g ◦ g is given by the definition (g ◦ g)(x) = g(g(x)) = x + 6.)
2.244 f ◦f 2.246 f ◦g 2.248 h◦g 2.250 f ◦g◦h
2.245 h◦h 2.247 g◦h 2.249 f ◦h
Let f (x) := 3x + 1 and let g(x) := 2x. Identify a function h such that . . . 2.251 …g◦handf areidentical. 2.252
Which of the following functions f : {0, 1, 2, 3} → {0, 1, 2, 3} are onto?
2.253 f (x) = x 2.256
2.254 f(x)=x2 mod4 2.257 f(0)=1,f(1)=2,f(2)=1,f(3)=2
2.255 f(x) = x2 −x mod 4
Which of the following functions f : {0, 1, 2, 3} → {0, 1, . . . , 7} are one-to-one?
2.258 f(x) = x2 mod 8 2.261 f(x) = (x3 +2x) mod 8
2.259 f(x)=x3 mod8 2.262 f(0)=3,f(1)=1,f(2)=4,f(3)=1
2.260 f(x) = (x3 −x) mod 8
…h◦gandf areidentical.
f (0) = 3, f (1) = 2, f (2) = 1, f (3) = 0
  7 8 quantize(n) := 130
2.233 f(x) = |x|
2.234 f(x) = ⌊x⌋
2.235 f(x) = 2x
2.236 f(x) = log2 x
2.237 f(x) = x mod 2 2.241 f(x) = ∥x∥
2.238 f(x) = 2 mod x 2.242 f(θ) = ⟨cosθ,sinθ⟩ 2.239 f(x,y) = x mod y
2.240 f(x) = 2|x

A heap is a data structure that is used to represent a collection of items, each of which has an associated priority. (See p. 529.) A heap can be represented as a complete binary tree—a binary tree with no “holes” as you read in left-to-right, top-to-bottom order—but a heap can also be stored more efficiently as an array, in which the elements are stored in that same left-to-right and top-to-bottom order. (See Figure 2.60.) To do so, we define three functions that allow us to compute the index of the parent of a node; the index of the left child of a node; and the index of the right child of a node. (For example, the parent of the node labeled 8 in Figure 2.60 is labeled 9, the left child of the node labeled 8 is labeled 3, and the right child is labeled 5.) Here are the functions: given an index i into the array, we define
parent(i) := 􏰏 i 􏰐 left(i) := 2i right(i) := 2i + 1. 2
For example, the node labeled 8 has index 2 in the array, and parent(2) = 1 (the index of the node labeled
9); left(2) = 4 (the index of the node labeled 3); and right(2) = 5 (the index of the node labeled 5).
2.263 Suppose that we have a heap stored as an array A[1 . . . n]. State the domain and range of the function parent. Is parent one-to-one?
2.264 State the domain and range of left and right for the heap as stored in A[1 . . . n]. Are left and right one-to-one?
Give both a mathematical description and an English-language description of the meanings of the following heap- related functions. Assume for the purposes of these questions that the array A is infinite (that is, don’t worry about the possibility of encountering an i such that left(i) or right(i) is undefined).
Figure 2.60: A maximum heap, as a tree and as an array.
2.5. FUNCTIONS 269
9 87
356
123456
9
8
7
3
5
6
2.265 parent ◦ left
2.266 parent ◦ right
2.267 left ◦ parent 2.268 right ◦ parent
What are the inverses of the following functions?
2.269 f :R→R,wheref(x)=3x+1.
2.270 g : R≥0 → R≥0, where g(x) = x3.
2.272 Why doesn’t the function f : {0,…,23} → {0,…,11} where f(n) = n mod 12 have an inverse?
2.271 h:R≥0 →R≥1,whereh(x)=3x. 2.273 p(x) = 3×3 +2×2 +x+0 2.275 p(x) = 4×4 +x2 −(2x)2
What are the degrees of the following polynomials? 2.274 p(x) = 9×3
Suppose that p and q are polynomials, both with degree 7. What are the smallest and largest possible degrees of the following polynomials?
2.276 f (x) = p(x) + q(x)
2.277 f (x) = p(x) · q(x)
Give an example of a polynomial p of degree 2 such that . . .
2.279 . . . p has exactly 0 roots.
2.280 . . . p has exactly 1 root.
2.278 f (x) = p(q(x))
2.281 . . . p has exactly 2 roots.
2.282 The median of a list L of n numbers is the number in the “middle” of L in sorted order. Describe an algorithm to find the median of a list L. (Don’t worry about efficiency.) You may find it useful to make use of the algorithm in Figure 2.57.

270 CHAPTER 2. BASIC DATA TYPES
2.6 Chapter at a Glance Booleans, Numbers, and Arithmetic
ABooleanvalueisTrueorFalse. TheintegersZare{…,−3,−2,−1,0,1,2,3,…}. The real numbers R are the integers and all numbers in between. The closed interval [a, b] consists of all real numbers x where a ≤ x ≤ b; the open interval (a, b) excludes a and b. The rational numbers Q are those numbers that can be represented as a/b for integers a and b ̸= 0. Here is some useful notation involving numbers:
• exponentiation: bk is b · b · · · · · b, where b is multiplied k times;
• logarithms: logb x is the number y such that by = x;
• absolutevalue:|x|isxforx≥0,and|x|=−xforx<0; • floorandceiling:⌊x⌋isthelargestintegern≤x;⌈x⌉isthesmallestintegern≥x; • modulus:nmodkistheremainderwhennisdividedbyk. If n mod d = 0, then d is a factor of n or evenly divides n, written d|n. If 2|n for a positive integer n, then n is even (“has even parity”); otherwise n is odd. An integer n ≥ 2 is prime if it has no positive integer factors other than 1 and n; otherwise n is composite. (Note that 0 and 1 are neither prime nor composite.) For a collection of numbers x1,x2,...,xn, their sum x1 + x2 + ··· + xn is written formallyas∑ni=1xi,andtheirproductx1·x2· ··· ·xniswritten∏ni=1xi. Sets: Unordered Collections A set is an unordered collection of objects called elements. A set can be specified by listing its elements inside braces, as {x1, x2, . . . , xn}. A set can also be denoted by {x : P(x)}, which contains all objects x such that P(x) is true. The set of possible val- ues x that are considered is the universe U, which is sometimes left implicit. Standard sets include the empty set {} (also written ∅), which contains no elements; the integers Z; the real numbers R; and the booleans {True, False}. We write Z≥0 = {0,1,2,...} and Z<0 = {−1,−2,...}, etc. For a set A and an object x, the expression x ∈ A(“xisinA”)istruewheneverxisinthesetA.(Soy ∈ {x:P(x)}whenever P(y) = True, and y ∈ {x1,x2 ...,xn} whenever xi = y for some i.) The cardinality of a set A, written |A|, is the number of distinct elements in A. GiventwosetsAandB,theunionofAandBisA∪B = {x:x∈Aorx∈B}.The intersectionofAandBisA∩B = {x:x∈Aandx∈B}.ThesetdifferenceofAand BisA−B = {x:x∈Aandx∈/B}.ThecomplementofasetAis∼A = U−A = {x : x ∈ U and x ̸∈ A}, where U is the universe. A subset of a set B is a set A such that every element of A is also an element of B; this relationship is denoted by A ⊆ B. If A is a subset of B, then B is a superset of A, writtenB ⊇ A.ApropersubsetofBisasetAthatisasubsetofBbutA ̸= B,written A ⊂ B. SuchasetBisapropersupersetofA,writtenB ⊃ A. TwosetsAandBare disjointifA∩B = ∅. ApartitionofasetSisacollectionofsetsA1,A2,...,Ak,where A1 ∪ A2 ∪ · · · ∪ Ak = S and, for any distinct i and j, the sets Ai and Aj are disjoint. The power set of a set A, written P(A), is the set of all subsets of A. Sequences, Vectors, and Matrices: Ordered Collections A sequence (or tuple, (ordered) pair, triple, quadruple, . . . , n-tuple, . . . ) is an ordered col- lection of objects called components or entries, written inside angle brackets. The set A × B = {⟨a, b⟩ : a ∈ A and b ∈ B} is the Cartesian product of sets A and B; the set A × B contains all pairs where the first component comes from A and the second from B. For a set S and a number n ≥ 0, the set Sn denotes the n-fold Cartesian product of S with itself:Sn =S×S×...×S,whereSoccursntimesinthisproduct. A vector (or n-vector) is an element of Rn, for some positive integer n n≥ 2. (An element of R1 = R is called a scalar.) A bit vector is an element of {0, 1} . Vectors are sometimes written in square brackets: x = [x1, x2, . . . , xn]. For a vector x, write xi to denote the ith component of x. (But xi is meaningless unless i ∈ {1, 2, . . . , n}.) The size or dimensionality of x ∈ Rn is n. For a vector x ∈ Rn and a real number α ∈ R, the scalar product αx is a vector where (αx)i = αxi. Fortwovectorsx,y ∈ Rn,thesumofxandyisavectorx+y,where (x+y)i =xi+yi.Thedotproductoftwovectorsx,y∈Rnisx•y=∑ni=1xiyi.Bothx+y and x • y are meaningless unless x and y have the same dimensionality. An n-by-m matrix M is an element of (Rn)m, which is also sometimes written Rn×m. Such a matrix M has n rows and m columns, as in Fig- ure 2.61. A matrix M ∈ Rn×m is square if n = m. For a size n, the identity matrix is I ∈ Rn×n has ones on the main diagonal (the entries Ii,i = 1) and zeros everywhere else. n×m Figure 2.61: A matrix. Given a matrix M ∈ R (αM)i,j = αMi,j. Given two matrices M, M′ ∈ Rn×m, the matrix M + M′ is specified by (M + M′)i,j = Mi,j + Mi′,j. (The sum M + M′ is meaningless if M and M′ have different dimensions.) The product of two matrices A ∈ Rn×m and B ∈ Rm×p is a matrix AB ∈ Rn×p whose components are given by (AB)i,j = ∑mk=1 Ai,kBk,j. (More compactly, (AB)i,j = Ai,(1...m) • B(1...m),j.) If the number of rows in A is different from the number of columns in B then AB is meaningless. The inverse of M is a matrix M−1 such that MM−1 = I (if any such matrix M−1 exists). Functions Afunctionf : A → Bmapseveryelementa ∈ Atosomeelementf(a) ∈ B.The domainoff isAandthecodomainisB.Theimageorrangeoff is{f(x):x∈A},theset of elements of the codomain “hit” by some element of A according to f . Thecompositionofafunctionf : A → Bandg : B → Ciswritteng◦f : A → C,and (g◦f)(x)=g(f(x)).Afunctionf :A→Bisone-to-oneorinjectiveiff(x)=f(y)implies that x = y. The function f is onto or surjective if the image is equal to the codomain. If f : A → B is one-to-one and onto, it is bijective. For a bijection f : A → B, the function f−1 :B→Aistheinverseoff,wheref−1(b)=awhenf(a)=b. k Apolynomialp:R→Risafunctionp(x)=a0+a1x+···+akx ,whereeachai ∈Ris a coefficient. The degree of p is k. The roots of p are {x : p(x) = 0}. A polynomial of degree k that is not always zero has at most k different roots. and a real number α ∈ R, the matrix αM is specified by An algorithm is a step-by-step procedure that transforms an input into an output. 2.6. CHAPTERATAGLANCE 271  M1,1 M1,2 ... M1,m   M2,1 M2,2 ... M2,m  M =   . . . . . . . . . . . .   Mn,1 Mn,2 . . . Mn,m 272 CHAPTER 2. BASIC DATA TYPES Key Terms and Results Key Terms Booleans, Numbers, Arithmetic • booleans,integers,reals,rationals • openintervals,closedintervals • absolutevalue|x|,floor⌊x⌋,ceiling⌈x⌉ • exponentiation,logarithms • modulus,remainder,divides • even,odd,prime,parity • summation∑,product∏ • nestedsummations,nestedproducts Sets • set,element,membership,cardinality • exhaustiveenumeration • setabstraction,universe • theemptyset∅={} • Venndiagram • complement∼,union∪,intersection∩ • setdifference− • (proper)subset,(proper)superset • disjointsets • partitions • powerset Sequences, Vectors, Matrices • sequence,list,orderedpair,n-tuple • Cartesianproduct • vector,dotproduct • matrix,identitymatrix • matrixmultiplication • matrixinverse Functions • domain,codomain,image/range • functioncomposition • one-to-one,ontofunctions • bijection,inverse • polynomial,degree,roots • algorithm Key Results Booleans, Numbers, and Arithmetic 1. The value of bn is b · b · · · b, multiplied together n times. If n < 0, then bn = 1/(b−n). For rational exponents, b1/m is the number x such that xm = b, and bn/m = (b1/m)n. 2. For a positive real number b ̸= 1 and a real number x > 0, the quantity logb x (the log base b of x) is the real number ysuchthatby=x.
3. Consider integers k > 0 and n. Then k | n (“k divides n”) if n is an integer—or, equivalently, if n mod k = 0.
4. Aslongasthetermsbeingaddedremainunchanged,we can reindex a summation (for example, shifting the variable over which the sum is taken, or reversing the order of nested sums) without affecting the total value of the sum. The same is true for products.
Sets: Unordered Collections
1. Asetcanbespecifiedusingexhaustiveenumeration(a list of its elements), or by abstraction (a condition describing when an object is an element of the set).
2. TwosetsSandTareequalifeveryelementofSisan element of T and every element of T is an element of S.
Sequences, Vectors, and Matrices
1. For vectors x, y ∈ Rn, the dot product of x and y is
x•y=∑ni=1xiyi.
2. The product AB of two matrices A ∈ Rn×m and B ∈ Rm×p is an n-by-p matrix M ∈ Rn×p whose components are given by Mi,j = ∑mk=1 Ai,kBk,j.
Functions
1. A one-to-one and onto function f : A → B has an inverse function f −1 : B → A, where f (a) = b precisely when
f −1(b) = a.
2. Apolynomialofdegreekthatisnotalwayszerohasat most k different roots.
k

3 Logic
In which our heroes move carefully through the marsh, making sure that each step follows safely from the one before it.

302 CHAPTER 3. LOGIC
3.1 Why You Might Care
How fondly dost thou reason!
William Shakespeare (1564–1616)
The Comedy of Errors
Logic is the study of truth and falsity, of theorem and proof, of valid reasoning in any context. In this chapter, we focus on formal logic, in which it is the “form” of the argument that matters, rather than the “content.” This chapter will introduce the two major types of formal logic:
• propositionallogic(Sections3.2and3.3),inwhichwewillstudythetruthandfalsity of statements, how to construct logical statements from basic logical operators (like and and or), and how to reason about those statements.
• predicatelogic(Sections3.4and3.5),whichgivesusaframeworktowritelogical statements of the form “every x . . .” or “there’s some x such that . . ..”
One of our main goals in this chapter will be to define a precise, formal, and unam- biguous language to express reasoning—in which writer and reader agree on what each word means.
Logic is the foundation of all of computer science; it’s the reasoning that you use when you write the condition of an if statement or when you design a circuit to add two 32-bit integers or when you design a program to beat a grandmaster at chess. Be- cause logic is the study of valid reasoning, any endeavor in which one wishes to state and justify claims rigorously—such as that of this book—must at its core rely on logic. Every condition that you write in a loop is a logical statement. When you sit down to write binary search in Python, it is through a (perhaps tacit) use of logical reasoning that you ensure that your code works properly for any input. When you use a search engine to look for web pages on the topic “beatles and not john or paul or george or ringo” you’ve implicitly used logical reasoning to select this particular query. Solving a Sudoku puzzle is nothing more and nothing less than following logical constraints to their conclusion. The central component of a natural language processing (NLP) system is to take an utterance by a human user that’s made in a “natural” language like English and “understand” what it means—and understanding what a sentence means is essentially the same task as understanding the circumstances under which the sentence is true, and thus is a question of logic.
And these are just a handful of examples; for a computer scientist, logic is the basis of the discipline. Indeed, the processor of a computer is built up from almost un- thinkably simple logical components: wires and physical implementations of logical operations like and, or, and not. Our main goal in this chapter will be to introduce the basic constructs of logic. But along the way, we will encounter applications of logic to natural language processing, circuits, programming languages, optimizing compilers, and building artificially intelligent systems to play chess and other games.

3.2. ANINTRODUCTIONTOPROPOSITIONALLOGIC 303
3.2 An Introduction to Propositional Logic
Everyone wishes to have truth on his side, but not everyone wishes to be on the side of truth.
Richard Whately (1787–1863)
A proposition is a statement that is either true or false—In December 2012, Facebook had over one billion users or Java is a programming language that uses indentation to denote block structure, for example. Propositional logic is the study of propositions, including how to formulate statements as propositions, how to evaluate whether a proposition is true or false, and how to manipulate propositions. The goal of this section is to introduce propositions—including related terminology, standard notation, and some techniques for reasoning about propositions.
3.2.1 Propositions and Truth Values
We’ll begin, briefly, with propositions themselves:
A proposition is also sometimes called a Boolean expression or a Boolean formula. (See Section 2.2.1.) A proposition is written in English as a declarative sentence, the kind of sentence that usually ends with a period. (Questions and demands—like Did you try binary search? or Use quicksort!—aren’t the kinds of things that are true or false, and so they’re not propositions.) Here are a few examples:
Example 3.1 (Some sample propositions)
The following statements are all propositions:
1. 2+2=4.
2. 33isaprimenumber.
3. BarackObamaisthe44thpersontobepresidentoftheUnitedStates.
4. Everyevenintegergreaterthan2canbewrittenasthesumoftwoprimenum-
bers.
(The last of these propositions is called Goldbach’s conjecture; it’s more complicated than the other propositions in this example, and we’ll return to it in Section 3.4.)
Let’s determine the above propositions’ truth values:
Example 3.2 (Determining truth values)
Problem: WhatarethetruthvaluesofthepropositionsfromExample3.1?
: Thesepropositions’truthvaluesare Solution
Definition 3.1 (Propositions and Truth Values)
A proposition is a statement that is either true or false. For a particular proposition p, the truth value of p is its truth or falsity.

304 CHAPTER 3. LOGIC
1. True.Itreallyisthecasethat2+2equals4.
2. False.Theinteger33isnotaprimenumberbecause33=3·11.(Primenumbers are evenly divisible only by 1 and themselves; 33 is evenly divisible by 3 and 11.)
3. False.AlthoughBarackObamaiscalledpresident#44,GroverClevelandwas president #22 and #24. So Barack Obama is actually the 43rd person to be presi- dent of the United States, not the 44th.
4. Unknown(!).Goldbach’sconjecturewasfirstmadein1742,buthasthusfar resisted proof—or disproof! It’s easy to check that particular small even integers can be written as the sum of two prime numbers; for example, 4 = 2 + 2, 6 = 3+3,8 = 3+5,10 = 3+7,andsoon. Butisittrueforallevenintegersgreater than 2? We simply don’t know! Many even integers have been tested, and no violation has been found in any of these tests. But, as far as we know, the next even integer we test can’t be written as the sum of two primes. See Example 3.47 and Exercises 3.178–3.181.
Before we move on from Example 3.2, there’s an important point to make about state- ments that have an unknown truth value. Even though we don’t know the truth value of Goldbach’s conjecture, it is still a proposition and thus it has a truth value. That is, Goldbach’s conjecture is indeed either true or false; it’s just that we don’t know which it is. (Like the proposition The person currently sitting next to you is wearing clean under- wear: it has a truth value, you just don’t know what truth value it has.)
Taking it further: Goldbach’s conjecture stands in contrast to declarative sentences whose truth is ill- defined—for example, This book is boring and Logic is fun. Whether these claims are true or false depends on the (imprecise) definitions of words like boring and fun. We’re going to de-emphasize subtle “shades of truth” questions of this form throughout the book, but see p. 314 for some discussion, including the role of ambiguity in software systems that interact with humans via English language input and output.
There is also a potentially interesting philosophical puzzle that’s hiding in questions about the truth values of natural-language utterances. Here’s a silly (but obviously true) statement: The sentence “snow is white” is true if and only if snow is white. (Of course!) This claim becomes a bit less trivial if the embedded proposition is stated in a different language—Spanish or Dutch, say: The sentence “La nieve es blanca” is true if and only if snow is white; or The sentence “Sneeuw is wit” is true if and only if snow is white. But there’s a troubling paradox lurking here. Surely we would like to believe that the English sentence x and the French translation of the English sentence x have the same truth value. For example, Snow is white and La neige est blanche surely are both true, or they’re both false. (And, in fact, it’s the former.) But this belief leads to a problem with certain self-referential sentences: for example, This sentence starts with a ‘T’ is true, but Cette phrase commence par un ‘T’ is, surely, false.1
3.2.2 Atomic and Compound Propositions
We will distinguish between two types of propositions, those that cannot be broken down into conceptually simpler pieces and those that can be:
For more on para- doxes and puzzles of translation, see
1 Douglas Hofs- tadter. Le Ton Beau de Marot: In Praise of the Music of Lan- guage. Basic Books, 1998; and R. M. Sainsbury. Para- doxes. Cambridge University Press, 3rd edition, 2009.
1
Definition 3.2 (Atomic and compound propositions)
An atomic proposition is a proposition that is conceptually indivisible. A compound proposition is a proposition that is built up out of conceptually simpler propositions.

3.2. ANINTRODUCTIONTOPROPOSITIONALLOGIC 305 Here’s a simple example of the difference:
Example 3.3 (Atomic and compound propositions)
The University of Minnesota’s mascot is the Badger is an atomic proposition, because it is not conceptually divisible into any simpler claim.
The University of Washington’s mascot is the Duck or the University of Oregon’s mascot is the Duck is a compound proposition, because it is conceptually divisible into two simpler claims—namely The University of Washington’s mascot is the Duck and The University of Oregon’s mascot is the Duck.
Atomic propositions are also sometimes called Boolean variables; see Section 2.2.1. A compound proposition that contains Boolean variables p1, . . . , pk is sometimes called a Boolean expression or Boolean formula over p1 , . . . , pk .
Example 3.4 (Password validity as a compound proposition)
A certain small college sends the following instructions to its users when they are required to change their password:
Your password is valid only if it is at least 8 characters long, you have not previously used it as your password, and it contains at least three different types of characters (lowercase letters, uppercase letters, digits, non-alphanumeric characters).
This compound proposition involves seven different atomic propositions:
• p:thepasswordisvalid
• q:thepasswordisatleast8characterslong
• r:thepasswordhasbeenusedpreviouslybyyou
• s:thepasswordcontainslowercaseletters
• t:thepasswordcontainsuppercaseletters
• u: the password contains digits
• v:thepasswordcontainsnon-alphanumericcharacters
The form of the compound proposition is “p, only if q and not r and at-least-three- of {s, t, u, v} are true.” (Later we’ll see how to write this compound proposition in standard logical notation; see Example 3.15.)
3.2.3 Logical Connectives
Logical connectives are the glue that creates the more complicated compound proposi- tions from simpler propositions. Here are definitions of our first three of these logical connectives—not, and, and or:
Definition 3.3 (Negation (not): ¬)
The proposition ¬p (“not p,” called the negation of the proposition p) is true when the proposition p is false, and is false when p is true.

306 CHAPTER 3. LOGIC
Definition 3.4 (Conjunction (and): ∧)
The proposition p ∧ q (“p and q,” the conjunction of the propositions p and q) is true when both of the propositions p and q are true, and is false when one or both of p or q is false.
Definition 3.5 (Disjunction (or): ∨)
The proposition p ∨ q (“p or q,” the disjunction of the propositions p and q) is true when one or both of the propositions p or q is true, and is false when both p and q are false.
In the conjunction p ∧ q, the propositions p and q are called conjuncts; in p ∨ q, they are called disjuncts. Here’s a simple example:
Example 3.5 (Some simple compound propositions)
Let p denote the proposition Ohio State’s mascot is the Buckeye and let q denote the proposition Michigan’s mascot is the Wolverine. Then:
• ¬qdenotesthepropositionMichigan’smascotisnottheWolverine.
• p∧qdenotesthepropositionOhioState’smascotistheBuckeye,andMichigan’smascot
is the Wolverine.
• p∨qdenotesthepropositionOhioState’smascotistheBuckeye,orMichigan’smascot
is the Wolverine.
Here’s an example of translating some English statements that express compound propositions into standard logical notation:
Example 3.6 (From English statements to compound propositions)
Problem: Translateeachofthefollowingstatementsintologicalnotation.(Namethe atomic propositions using appropriate Boolean variables.)
1. Carissaismajoringincomputerscienceandstudioart.
2. EitherDavetookaformallogicclass,orheisaquicklearner. 3. Elibrokehishandanddidn’ttakethetestasscheduled.
4. FredknowsPythonorhehasprogrammedinbothCandJava.
Solution
: Let’sfirstnametheatomicpropositionswithintheseEnglishstatements:
The prefix con- means “together” and dis- means “apart.” (Junct means “join.”) The conjunction p ∧ q is true when p and q are true together; the disjunction p ∨ q is true when p is true “apart from” q, or the other way around.
To help keep the symbols straight,
it may be helpful
to notice that the symbol ∧ is the angular version
of the symbol ∩ (intersection), while the symbol ∨ is the angular version
of the symbol ∪ (union). The set
S ∩ T is the set of all elements contained in S and T; the set
S ∪ T is the set of all elements contained in S or T.
p=Carissaismajoringincomputerscience. q = Carissa is majoring in studio art.
r = Dave took a formal logic class.
s = Dave is a quick learner.
t =Elibrokehishand.
u = Eli took the test as scheduled. v = Fred knows Python.
w = Fred has programmed in C.
x = Fred has programmed in Java.
We can now translate the four given statements as: (1) p ∧ q; (2) r ∨ s; (3) t ∧ ¬u; and (4)v∨(w∧x).
Implication (if/then)
Another important logical connective is ⇒, which denotes implication. It expresses
a familiar idea from everyday life, though one that’s not quite captured by a single

3.2. ANINTRODUCTIONTOPROPOSITIONALLOGIC 307
English word. Consider the sentence If you scratch my back, then I’ll scratch yours. It’s easiest to think of this sentence as a promise: I’ve promised that I’ll scratch your back as long as you scratch mine. I haven’t promised anything about what I’ll do if you fail
to scratch my back—I can abstain from back scratching, or I might generously scratch your back anyway, but I haven’t guaranteed anything. (You’d justifiably call me a liar if you scratched my back and I failed to scratch yours in return.) This kind of promise is expressed as an implication in propositional logic:
In the implication p ⇒ q, the proposition p is called the antecedent or the hypothesis, and the proposition q is called the consequent or the conclusion.
Here are a few examples of statements involving implication:
One initially con- fusing aspect of logical implica- tion is that the word “implies” seems to hint at something about causation—but
p ⇒ q doesn’t ac- tually say anything about p causing q, only that p being true implies that q
is true (or, in other words, p being true lets us conclude that q is true).
Definition 3.6 (Implication: ⇒)
The proposition p ⇒ q is true when the truth of p implies the truth of q. In other words, p ⇒ q is true unless p is true and q is false.
Example 3.7 (Some implications)
The following propositions are all true:
• 1+1=2impliesthat2+3=5. • 2+3=4impliesthat2+2=4. • 2+3=4impliesthat2+3=6.
But the following proposition is false:
• 2+2=4impliesthat2+1=5.
This last proposition is false because 2 + 2 = 4 is true, but 2 + 1 = 5 is false.
There are many different ways to express the proposition p ⇒ q in English, including all of those in Figure 3.1.
Here is an example of the same implication being stated in English in many different ways:
Example 3.8 (Expressing implications in English)
According to United States law, people who can legally vote must be American citi- zens, and they must also satisfy some other various conditions that vary from state to state (for example, registering in advance or not being a felon). Thus the following compound proposition is true:
you are a legal U.S. voter ⇒ you are an American citizen. All of the following sentences express this proposition in English:
If you are a legal U.S. voter, then you are an American citizen.
You being a legal U.S. voter implies that you are an American citizen. You are a legal U.S. voter only if you are an American citizen.
(“TrueimpliesTrue”istrue.) (“FalseimpliesTrue”istrue.) (“FalseimpliesFalse”istrue.)
(“TrueimpliesFalse”isfalse.)
“p implies q” “if p, then q” “p only if q”
“q whenever p”
“q, if p”
“q is necessary for p” “p is sufficient for q”
Figure 3.1: Some ways of expressing p ⇒ q in English.

308 CHAPTER 3. LOGIC
You are an American citizen if you are a legal U.S. voter.
You are an American citizen whenever you are a legal U.S. voter.
You being an American citizen is necessary for you to be a legal U.S. voter. You being a legal U.S. voter is sufficient for you to be an American citizen.
Most of these sentences are reasonably natural ways to express the stated implication, though the last phrasing seems awkward. But it’s easier to understand if we slightly rephrase it as “You being a legal U.S. voter is sufficient for one to conclude that you are an American citizen.”
Here’s another example of restating implications:
Example 3.9 (More implications in English)
Consider the proposition
The nondisclosure agreement is valid only if you signed it .
􏰢 􏰡􏰠 􏰣􏰢􏰡􏰠􏰣
pq
(This statement is different from “if you signed, then the agreement is valid”: for example, the agreement might not be valid because you’re legally a minor and thus not legally allowed to sign away rights.) We can restate p ⇒ q as “if p then q”:
If the nondisclosure agreement is valid, then you signed it.
We can also restate this implication equivalently—and perhaps more intuitively— using the so-called contrapositive ¬q ⇒ ¬p (see Example 3.21):
The nondisclosure agreement is invalid if you didn’t sign it.
“Exclusive or” and “if and only if”
The four logical connectives that we have defined so far (¬, ∨, ∧, and ⇒) are the
ones that are most frequently used, but we’ll define two other common connectives too. The first is exclusive or:
When we want to emphasize the distinction between ∨ and ⊕, we refer to ∨ as inclusive or. This terminology highlights the fact that p ∨ q includes the possibility that both p and q are true, while p ⊕ q excludes that possibility. Unfortunately, the word “or” in English can mean either inclusive or exclusive or, depending on the context in which it’s being used. When you see the word “or,” you’ll have to think carefully about which meaning is intended.
The connective ⊕ is usually pronounced like “ex ore” (a former significant other + some rock with high precious- metal content).
Definition 3.7 (Exclusive or: ⊕)
The proposition p ⊕ q (“p exclusive or q” or, more briefly, “p xor q”) is true when one of the propositions p or q is true, but not both. Thus p ⊕ q is false when both p and q are true, and when both p and q are false.
Here’s an example of distinguishing inclusive and exclusive or:

3.2. ANINTRODUCTIONTOPROPOSITIONALLOGIC 309
Example 3.10 (Inclusive versus exclusive or in English)
Problem: Translatethesestatementsfromacoverletterforajobintologicalnotation: You may contact me by email or by phone. I am available for an on-site day-long
interview on October 8th in Minneapolis or Hong Kong. Use the following Boolean variables:
p = you may contact me by phone
q = you may contact me by email
r = I am physically available for an interview in Minneapolis s = I am physically available for an interview in Hong Kong
: The“or”in“emailorphone”isinclusive,becauseyoucouldreceivebothan Solution
email and a call. However, the “or” in “Minneapolis or Hong Kong” is exclusive, because it’s not physically possible to be simultaneously present in Minneapolis and Hong Kong. Thus a correct translation of these statements is (p ∨ q) ∧ (r ⊕ s).
We are now ready to define our last logical connective:
The reason that ⇔ is read as “if and only if” is that p ⇔ q means the same thing as the compound proposition (p ⇒ q) ∧ (q ⇒ p). (We’ll prove this equivalence in Example 3.23.) Furthermore, the propositions p ⇒ q and q ⇒ p can be rendered, respectively, as “p only if q” and “p, if q.” Thus p ⇔ q expresses “p if q, and p only if q”—or, more compactly, “p if and only if q.” (The connective ⇔ is also sometimes called the biconditional, because an implication can also be called a conditional.)
Unfortunately, just like with “or,” the word “if” is ambiguous in English. Some- times “if” is used to express an implication, and sometimes it’s used to express an if-and-only-if definition. When you see the word “if” in a sentence, you’ll need to think carefully about whether it means ⇒ or ⇔. Here’s an example:
Example 3.11 (“If” versus “if and only if” in English)
Problem: Thinkofanumberbetween10and1,000,000.Let
p := your number is prime.
q := your number is even.
r := your number is evenly divisible by an integer other than 1 and itself.
Now translate the following two sentences into logical notation:
1. Ifthenumberyou’rethinkingofiseven,thenitisn’tprime.
2. Thenumberyou’rethinkingofisn’tprimeifit’sevenlydivisiblebyaninteger
other than 1 and itself.
Solution
: The“if”in(1)isanimplication,andthe“if”in(2)is“ifandonlyif.”A
Sometimes you’ll see ⇔ abbreviated in sentences as “iff” as shorthand for “if and only
if.” We’ll avoid this notation in this book, but you should understand it if you see it elsewhere.
Definition 3.8 (If and only if: ⇔)
The proposition p ⇔ q (“p if and only if q”) is true when the propositions p or q have the same truth value (both p and q are true, or both p and q are false), and false otherwise.
correct translation of these sentences is (1) q ⇒ ¬p; and (2) ¬p ⇔ r.

310 CHAPTER 3. LOGIC
3.2.4 Combining Logical Connectives
The six standard logical connectives
that we’ve defined so far (¬, ∧, ∨,
⇒, ⊕, and ⇔) are summarized in
Figure 3.2. The logical connective ¬
is a unary operator, because it builds a
compound proposition from a single
simpler proposition. The other five connectives are binary operators, which build a compound proposition from two simpler propositions. (We’ll encounter the full list of binary logical connectives later; see Exercises 4.66–4.71.)
Taking it further: The unary-vs.-binary categorization of logical connectives based on how many “arguments” they accept also occurs in other contexts—for example, arithmetic and programming. In arithmetic, for example, one might distinguish between “unary minus” and “binary minus”: the former denotes negation, as in −3; the latter subtraction, as in 2 − 3.
In programming languages, the number of arguments that a function takes is called its arity. (The arity of length is one; the arity of equals is two.) You will sometimes encounter variable arity functions that can take a different number of arguments each time they’re invoked. Common examples include the print functions in many languages—C’s printf and Python’s print, for example, can take any number of arguments—or arithmetic in prefix languages like Scheme, where you can write an expression like
(+ 1 2 3 4)todenote1+2+3+4(=10).
Order of operations
A full description of the syntax of a programming language always includes a ta-
ble of the precedence of operators, arranged from “binds the tightest” (highest prece- dence) to “binds the loosest” (lowest precedence). These precedence rules tell us when we have to include parentheses in an expression to make it mean what we want it
to mean, and when the parentheses are optional. In the same way, we’ll adopt some standard conventions regarding the precedence of our logical connectives:
• Negation (¬) binds the tightest.
• Afternegation,thereisathree-waytieamong∧,∨,and⊕.(We’llalwaysuseparen-
theses in propositions containing more than one of these three operators, just as we
should in programs.)
• Thetrifecta(∧,∨,and⊕)isfollowedby⇒. • ⇒isfollowedfinallyby⇔.
The horizontal lines in Figure 3.2 separate the logical connectives by their precedence, so that operators closer to the top of the table have higher precedence. For example:
Example 3.12 (Precedence of logical connectives)
The propositions p ∨ ¬q and p ∨ q ⇒ ¬r ⇔ p mean, respectively,
p∨(¬q) and 􏰋(p∨q) ⇒ (¬r)􏰌 ⇔ p,
which we can see by simply applying the relevant precedence rules (“¬ goes first, then ∨, then ⇒, then ⇔”).
Figure 3.2: Sum- mary of notation for propositional logic.
negation conjunction disjunction exclusive or implication if and only if
¬p p ∧ q p ∨ q p ⊕ q p ⇒ q p ⇔ q
“not p”
“p and q”
“p or q”
“p xor q”
“if p, then q” or “p implies q” “p if and only if q”
highest precedence
lowest precedence
The word “prece- dence” (pre before, cede go) means “what comes first,” so precedence rules tell us the order of which the operators “get to go.” For example, consider
a proposition like
p ∧ q ⇒ r. If ∧ “goes first,” the proposi- tion is (p ∧ q) ⇒ r; if ⇒ “goes first,” it meansp∧(q ⇒ r). Figure 3.2 says that the former is the correct interpreta- tion.

3.2. ANINTRODUCTIONTOPROPOSITIONALLOGIC 311
Taking it further: The precedence rules that we’ve described here match the precedence rules in most programminglanguages.InJava,forexample,thecondition!p && q—that’s“notpandq”inJava syntax—willbeinterpretedas(!p) && q,becausenot/¬/!bindstighterthanand/∧/&&.
The precedence rules for operators tell us the order in which two different operators are applied in an expression. For a sequence of applications of the same binary opera- tor, we’ll use the convention that the operator associates to the left. For example, p ∧ q ∧ r will mean (p ∧ q) ∧ r and not p ∧ (q ∧ r).
Example 3.13 (Precedence of logical connectives)
Problem: Fullyparenthesizeeachofthefollowingpropositions.(Inotherwords,add parentheses around each operator without changing the meaning.)
1. p∨q⇔p
2. p⊕p⊕q⊕q
3. ¬p⇔p⇔¬(p⇔p) 4. p∧¬q⇒r⇔s
5. p⇒q⇒r∧s
Solution
: UsingtheprecedencerulesfromFigure3.2andleftassociativity,weget:
1. (p∨q)⇔p
2. ((p⊕p)⊕q)⊕q
3. ((¬p)⇔p)⇔(¬(p⇔p)) 4. ((p∧(¬q)) ⇒ r) ⇔ s
5. (p⇒q)⇒(r∧s)
The choice that logical operators associate to the left (instead of associating to the right) won’t matter for most of the logical connectives anyway. For example, the propo- sitions (p ∧ q) ∧ r and p ∧ (q ∧ r) are true under exactly the same circumstances, as we’ll see shortly. In fact, of the binary operators {∧, ∨, ⊕, ⇒, ⇔}, the only one for which the order of application matters is implication. See Exercises 3.45–3.47.
3.2.5 Truth Tables
In Section 3.2.3, we described the logical connectives ¬, ∧, ∨, ⇒, ⊕, and ⇔, but we can more systematically define these connectives by using a truth table that collects the value yielded by the logical connective under every truth assignment.
For example, the function f where f (p) = T and f (q) = F is a truth assignment for the proposition p ∨ ¬q. (Each “T” abbreviates a truth value of true; each “F” abbreviates a truth value of false.)
Writing tip: Because the order of appli- cation does matter for implication, it’s considered good style to include the optional parenthe- ses so that it’s clear what you mean.
Definition 3.9 (Truth assignment)
A truth assignment for a proposition over variables p1, p2, . . . , pk is a function that assigns a truth value to each pi.

312 CHAPTER 3. LOGIC
For any particular proposition and for any particular truth assignment f for that proposition, we can evaluate the proposition under f to figure out the truth value of the entire proposition. In the previous example, the proposition p ∨ ¬q is true under thetruthassignmentwithp = Tandq = F(becauseT∨¬FisT∨T,whichistrue). A truth table displays a proposition’s truth value (evaluated in the way we just described) under all truth assignments:
For example, the truth table that defines ∧ is shown in Figure 3.3. A few words about this truth table are in order:
• Columns#1and#2correspondtotheatomicpropositionspandq.Thereisarow in the table corresponding to each possible truth assignment for p ∧ q—that is, for every pair of truth values for p and q. (So there are four rows: TT, TF, FT, and FF.)
• Thethirdcolumncorrespondstothecompoundpropositionp∧q,andithasaT only in the first row. That is, the truth value of p ∧ q is false unless both p and q are true—just as Definition 3.4 said.
The truth tables for the six basic logical connectives (negation, conjunction, disjunc- tion, exclusive or, implication, and “if and only if”) are shown in Figure 3.4. It’s worth paying special attention to the column for
p ⇒ q: the only truth assignment under which p ⇒ q is false is when p is true and q is false. False implies anything! Anything implies true! For example, both of the following are true propositions:
If 2 + 3 = 4, then you will eat tofu for dinner. (if false, then anything) If you are your own mother, then 2 + 3 = 5. (if anything, then true)
To emphasize the point, observe that the first statement is true even if you would never eat tofu if it were the last so-called food on earth; the hypothesis “2 + 3 = 4” of the proposition wasn’t true, so the truth of the proposition doesn’t depend on what your dinner plans are.
For more complicated compound propositions, we can fill in a truth table by re- peatedly applying the rules in Figure 3.4. For example, to find the truth table for
(p ⇒ q)∧(q∨p),wecomputethetruthtablesforp ⇒ qandq∨p,andputa“T”in the (p ⇒ q) ∧ (q ∨ p) column for precisely those rows in which the truth tables for p ⇒ q and q ∨ p both had “T”s. Here’s a simple example, and a somewhat more complicated one:
Example 3.14 (A small truth table)
Here is a truth table for the proposition (p ∧ q) ⇒ ¬q:
pqp∧q TTT TFF F T F FFF
Figure 3.3: The truth table for ∧.
Definition 3.10 (Truth table)
A truth table for a proposition lists, for each possible truth assignment for that proposition (with one truth assignment per row in the table), the truth value of the entire proposition.
p
¬p p TFT FTT F F
q
T F T F
p∧q T F F F
p∨q T T T F
p⇒q T
F
T
T
p⊕q F T T F
Figure 3.4: Truth tables for the basic logical connectives.
p⇔q T
F
F
T

3.2. ANINTRODUCTIONTOPROPOSITIONALLOGIC 313
p q p ∧ q ¬q (p ∧ q) ⇒ ¬q TTTFF TFFTT FTFFT FFFTT
This truth table shows that the given proposition (p ∧ q) ⇒ ¬q is true precisely when at least one of p and q is false.
Example 3.15 (Three (or more) of four, formalized)
In Example 3.4 (on the validity of passwords), we had a sentence of the form “p, only if q and not r and at-least-three-of {s, t, u, v} are true.”
Let’s translate this sentence into propositional logic. The tricky part will be translat- ing “at least three of {s, t, u, v} are true.” There are many solutions, but one relatively simple way to do it is to explicitly write out four cases, one corresponding to allowing a different one of the four variables {s, t, u, v} to be false:
(s∧t∧u)∨(s∧t∧v)∨(s∧u∧v)∨(t∧u∧v)
We can verify that we’ve gotten this proposition right with a (big!) truth table, shown in Figure 3.5. Indeed, the five rows in which the last column has a “T” are exactly the five rows in which there are three or four “T”s in the columns for s, t, u, and v.
To finish the translation, recall that “x only if y” means x ⇒ y, so the given sen- tence can be translated as p ⇒ q ∧ ¬r ∧ (the proposition above)—that is,
p ⇒ q∧¬r∧􏰋(s∧t∧u)∨(s∧t∧v)∨(s∧u∧v)∨(t∧u∧v)􏰌.
Taking it further: It’s worth pondering why there are five different rows of the truth table in Figure 3.5 in which the last column is true: there are four different truth assignments corresponding to exactly three of {s, t, u, v} being true (stu, suv, stv, tuv), and there is one
truth assignment corresponding to all four being true (stuv). In Chapter 9, on counting, we’ll re-encounter this style of question. (And, actually, precisely the same reasoning as in this example will allow us to prove something interesting about error-correcting codes—see Section 4.2.5.)
Figure 3.5: A truth table for Example 3.15.
s
T
T
T
T
T
T
T
T
F
F
F
F
F
F
F
F
t
u
v
s∧t∧u
s∧t∧v
s∧u∧v
t∧u∧v
(s∧t∧u)
∨ (s ∧ t ∧ v) ∨ (s ∧ u ∧ v) ∨ (t ∧ u ∧ v)
T
T
T
F
T
F
F
F
T
F
F
F
F
F
F
F
T
T
T
T
F
F
F
F
T
T
T
T
F
F
F
F
T
T
F
F
T
T
F
F
T
T
F
F
T
T
F
F
T
F
T
F
T
F
T
F
T
F
T
F
T
F
T
F
T
T
F
F
F
F
F
F
F
F
F
F
F
F
F
F
T
F
T
F
F
F
F
F
F
F
F
F
F
F
F
F
T
F
F
F
T
F
F
F
F
F
F
F
F
F
F
F
T
F
F
F
F
F
F
F
T
F
F
F
F
F
F
F

314 CHAPTER 3. LOGIC
Computer Science Connections
Natural Language Processing, Ambiguity, and Truth
Our main interest in this book is in developing (and understanding) precise and unambiguous language to express mathematical notions; in this chap-
ter specifically, we’re thinking about the truth values of completely precise statements. But thinking about the truth of ambiguous or ill-defined terms
is absolutely crucial to any computational system that’s designed to interact with users via natural language. (A natural language is one like English or French or Xhosa; these languages contrast with artificial languages like Java or Python or, arguably, Esperanto or Klingon.)
Natural language processing (NLP) (or the roughly similar computational linguistics) is the subfield of computer science that lies at the discipline’s inter- facewithlinguistics.2 InNLP,weworktodevelopsoftwaresystemsthatcan interact with users in a natural language. A necessary step in an NLP system is to take an utterance made by the human user and “understand it.” (“Under- standing what a sentence means” is more or less the same as “understanding the circumstances under which it is true”—which is fundamentally a question of logic.)
One major reason that NLP is hard is that there is a tremendous amount
of ambiguity in natural-language utterances. We can have lexical ambiguity, in which two different words are spelled identically but have two different mean- ings; we have to determine which word is meant in a sentence. Or there’s syntactic ambiguity, in which a sentence’s structure can be interpreted very differently. (See Figure 3.6.) But there are also subtleties about when a state- ment is true, even if the meaning of each word and the sentence’s structure are clear.
Consider, for example, designing and implementing a conversational system designed to assist with travel planning. (Many airlines or train com- panies have such systems.) Such a system might engage in a dialogue like the one in Figure 3.7 with a human user. There’s no hard-and-fast rule for what other flights should count as “slightly later” and “too much more expensive.” This conversational system has to be able to decide the truth of statements like Delta #2931 is slightly later than Delta #1927 and Delta #2931 isn’t too much more expensive than Delta #1927, even though the “truth” of these statements depends on heavy use of conversational context and pragmatic reasoning.
Of course, even though one cannot unambiguously determine whether these sentences are true or false, they’re the kind of statement made continually in natural language. So systems that process natural language must deal with this issue with great frequency.
One approach for handling these statements whose truth value is ambigu- ous is called fuzzy logic, in which each proposition has a truth value that is
a real number between 0 and 1. (So 10:33a is slightly later than 8:45a is “more true” than 12:19p is slightly later than 8:45a—so the former might have a truth value of 0.74, while the latter might have a truth value of 0.34. But 7:30a is slightly later than 8:45a would have a truth value of 0.00, as 7:30a is unambigu- ously not slightly later than 8:45a.)
For more, you can look for a textbook on NLP like
2 Daniel Jurafsky and James H. Martin. Speech and Language Processing: An Intro- duction to Natural Language Processing, Computational Linguistics, and Speech
Recognition. Pearson Prentice Hall, 2nd edition,2008.
A: B: C: D:
Do you prefer coffee or tea? Do you prefer cream or sugar? We ate cake with walnuts.
We ate cake with forks.
Figure 3.6: Examples of lexical (A and B) and syntactic ambiguity (C and D). The or of A/B can be either inclusive
or exclusive; simply answering “yes”
is a reasonable response to question
B, but a bizarre one to question A. The with of C/D can either attach to the cake or the eating; the sentences’ structures are consistent with using walnuts
as an eating utensil in C, or the cake containing forks as an ingredient in D.
User: I want to fly from MSP to BOS on 28 December.
System: Delta #1927 is a nonstop flight from MSP to BOS on Delta Airlines for $472 that leaves at 8:45am.
User: Is there a slightly later flight that isn’t too much more expensive?
Figure 3.7: A sample dialogue. Suppose that Delta #2931 is a second nonstop flight from MSP to BOS that leaves at 10:33am and costs $529.

3.2. ANINTRODUCTIONTOPROPOSITIONALLOGIC 315
3.2.6 Exercises
What are the truth values of the following propositions?
3.1 22 + 32 = 42
3.2 The number 202 is written 11010010 in binary.
3.3 After executing the C code fragment in Figure 3.8 (shown at right), the variable x has the value 1.
Consider the following atomic propositions:
Using these atomic propositions, translate the following (true!) statements about legal Python programs into logical notation. (Note that these statements do not come close to fully characterizing the set of valid Python statements, for several reasons: first, they’re about particular variables—x and y—rather than about generic variables. And, second, they omit some important common-sense facts—for example, it’s not simultaneously possible to be both a list and a numeric value. That is, for example, we have ¬v ∨ ¬z.)
3.4 x ** yisvalidPythonifandonlyifxandyarebothnumericvalues.
3.5 x + yisvalidPythonifandonlyifxandyarebothnumericvalues,orthey’rebothlists.
3.6 x * yisvalidPythonifandonlyifxandyarebothnumericvalues,orifoneofxandyisalist
and the other is numeric.
3.7 x * yisalistifx * yisvalidPythonandxandyarenotbothnumericvalues.
3.8 ifx + yisalist,thenx * yisnotalist.
3.9 x + yandx * yarebothvalidPythononlyifxisnotalist.
3.10 True story: a 29-year-old friend of mine who does not have an advance care directive was asked
the following question on a form at a doctor’s office. What should she answer?
If you’re over 55 years old, do you have an advance care directive? Circle one: YES NO
In Example 3.15, we constructed a proposition corresponding to “at least three of {s, t, u, v} are true.” Generalize this construction by building a proposition . . .
3.11 … expressing “at least 3 of {p1,…,pn} are true.”
3.12 . . . expressing “at least n − 1 of {p1 , . . . , pn } are true.”
The identity of a binary operator ⋄ is a value i such that, for any x, the expressions {x, x ⋄ i, i ⋄ x} are all equivalent. The zero of ⋄ is a value z such that, for any x, the expressions {z, x ⋄ z, z ⋄ x} are all equivalent. For an example from arithmetic, the identity of + is 0, because x + 0 = 0 + x = x for any number x. And the zero of multiplication is 0, because x · 0 = 0 · x = 0 for any number x. For each of the following, identify the identity or zero of the given logical operator. Justify your answer. Some operators do not have an identity or a zero; if the given operator fails to have the stated identity/zero, explain why it doesn’t exist.
Figure 3.8: Snippet of C code. Note that x/2 denotes integer division; for example, 7/2 = 3.
int x = 202;
while (x > 2) {
x = x / 2; }
p: x + y is valid Python
q: x * y is valid Python
r: x ** y is valid Python
s: x * yisalist
t: x + yisalist
u : v : w : z :
x is a numeric value y is a numeric value x is a list
y is a list
3.13 What is the identity of ∨?
3.14 What is the identity of ∧?
3.15 What is the identity of ⇔?
3.16 What is the identity of ⊕?
3.17 What is the zero of ∨? 3.18 What is the zero of ∧? 3.19 What is the zero of ⇔? 3.20 What is the zero of ⊕?
Because ⇒ is not commutative (that is, because p ⇒ q and q ⇒ p mean different things), it is not too surprising that ⇒ has neither an identity nor a zero. But there are a pair of related definitions that apply to this type of operator:
3.21 The left identity of a binary operator ⋄ is a value il such that, for any x, the expressions x and
il ⋄ x are equivalent. The right identity of ⋄ is a value ir such that, for any x, the expressions x and x ⋄ ir
are equivalent. (Again, some operators may not have left or right identities.) What are the left and right identities of ⇒ (if they exist)?
3.22 The left zero of a binary operator ⋄ is a value zl such that, for any x, the expressions zl and zl ⋄ x are equivalent; similarly, the right zero is a value zr such that, for any x, the expressions zr and x ⋄ zr are equivalent. (Again, some operators may not have left or right zeros.) What are the left and right zeros for ⇒ (if they exist)?

316 CHAPTER 3. LOGIC
In many programming languages, the Boolean values True and False are actually stored as the numerical values 1 and 0,respectively.InPython,forexample,both0 == Falseand1 == TrueareTrue.Thus,despiteappearances,wecan add or subtract or multiply Boolean values! Furthermore, in many languages (including Python), anything that is not False (in other words, anything other than 0) is considered True for the purposes of conditionals. For example, in many programming languages, including Python, code like if 2 print “yes” else print “no” will print “yes.”
Suppose that x and y are two Boolean variables in a programming language, like Python, where True and False are 1 and 0, respectively—that is, the values of x and y are both 0 or 1. Each of the following code snippets includes a conditional statement based on an arithmetic expression using x and y. For each, rewrite the given condition using the standard notation of propositional logic.
3.23 if x * y … 3.25 if 1 – x …
3.24 if x + y … 3.26 if (x * (1 – y)) + ((1 – x) * y) …
We can use the common programming language features described in in the previous block of exercises to give a simple programming solution to Exercises 3.11–3.12. Assume that {p1 , . . . , pn } are all Boolean variables in Python—that is, their values are all 0 or 1. Write a Python conditional expressing the condition that . . .
3.27 … at least 3 of {p1,…,pn} are true.
3.28 . . . at least n − 1 of {p1 , . . . , pn } are true.
In addition to purely logical operations, computer circuitry has to be built to do simple arithmetic very quickly. Here
you’ll explore some pieces of using propositional logic and binary representation of integers to express arithmetic operations. (It’s straightforward to convert your answers into circuits.)
Consider a number x ∈ {0,…,15} represented as a 4-bit binary number, as shown in Figure 3.9. Denote by x0 the least-significant bit of x, by x1 the next bit, and so forth. For example, for the number x = 12 (written 1100 in binary) would have x0 = 0, x1 = 0, x2 = 1, and x3 = 1). For each of the following conditions, give a proposition over the Boolean variables {x0 , x1 , x2 , x3 } that expresses the stated condition. (Think of 0 as false and 1 as true.)
3.29 x is greater than or equal to 8.
3.30 x is evenly divisible by 4.
3.31 x is evenly divisible by 5. (Hint: use a truth table, and then build a proposition from the table.)
3.32 x is an exact power of two.
3.33 Suppose that we have two 4-bit input integers x and y, represented as in Exercises 3.29–3.32. Give
a proposition over {x0,x1,x2,x3,y0,y1,y2,y3} that expresses the condition that x = y.
3.34 Given two 4-bit integers x and y as in the previous exercise, give a proposition over the Boolean variables {x0,x1,x2,x3,y0,y1,y2,y3} that expresses the condition that x ≤ y.
3.35 Suppose that we have a 4-bit input integer x, represented by four Boolean variables {x0 , x1 , x2 , x3 } as in Exercises 3.29–3.32. Let y be the integer x + 1, represented again as a 4-bit value {y0 , y1 , y2 , y3 }. (For the purposes of this question, treat 15 + 1 = 0—that is, we’re really defining y = (x + 1) mod 16.) For example, for x = 11 (which is 1011 in binary), we have that y = 12 (which is 1100 in binary). For each i ∈ {0,1,2,3}, give a proposition over the Boolean variables {x0 , x1 , x2 , x3 } that expresses the value of yi .
The remaining problems in this section ask you to build a program to compute various facts about a given proposition φ. To make your life as easy as possible, you should consider a simple representation of φ, based on representing
any compound proposition as a list. In such a list, the first element will be the logical connective, and the remaining elements will be the subpropositions. For example, the proposition p ⇒ (¬q) will be represented as
[“implies”, [“or”, “p”, “r”], [“not”, “q”]]
Now, using this representation of propositions, write a program, in a programming language of your choice, to accom- plish the following operations:
3.36 (programming required) Given a proposition φ, compute the set of all atomic propositions con- tained within φ. The following recursive formulation may be helpful:
variables(p) := {p} variables(¬φ) := variables(φ)
variables(φ ⋄ ψ) := variables(φ) ∪ variables(ψ) for any connective ⋄ ∈ {∧, ∨, ⇒, ⇔, ⊕, . . .}
3.37 (programming required) Given a proposition φ and a truth assignment for each variable in φ, evaluate whether φ is true or false under this truth assignment.
3.38 (programming required) Given a proposition φ, compute the set of all truth assignments for the variables in φ that make φ true. (One good approach: use your solution to Exercise 3.36 to compute all the variables in φ, then build the full list of truth assignments for those variables, and then evaluate φ under each of these truth assignments using your solution to Exercise 3.37.)
x3 x2
0 + 0 + 2 + 1 = 3
x3 x2 x1 x0 8+4+0+0 =12
Figure 3.9: Representing
x ∈ {0,…,15} using 4-bits.
x1 x0
0011
1100
We’ll occasionally use lowercase Greek letters, particularly φ (“phi”) or ψ (“psi”), to denote not- necessarily-atomic propositions.

3.3. PROPOSITIONALLOGIC:SOMEEXTENSIONS 317
3.3 Propositional Logic: Some Extensions
Against logic there is no armor like ignorance.
Laurence J. Peter (1919–1990)
With the definitions from Section 3.2 in hand, we turn to a few extensions: some special types of propositions, and some special ways of representing propositions.
3.3.1 Tautology and Satisfiability
Several important types of propositions are defined in terms of their truth tables: those that are always true (tautologies), sometimes true (satisfiable propositions), or never true (unsatisfiable propositions). We will explore each of these types in turn.
Tautologies
We’ll start by considering propositions that are always true:
One reason that tautologies are important is that we can use them to reason about logical statements, which can be particularly valuable when we’re trying to prove a claim.
Examples 3.16 and 3.17 illustrate two important tautologies. The first of these tau- tologies is the proposition p ∨ ¬p, which is called the law of the excluded middle: for any proposition p, either p is true or p is false; there is nothing “in between.”
Example 3.16 (Law of the Excluded Middle)
Here is the truth table for the proposition p ∨ ¬p:
p ¬p p∨¬p TFT FTT
The third column is filled with “T”s, so p ∨ ¬p is a tautology.
The second tautology is the proposition p ∧ (p ⇒ q) ⇒ q, called modus ponens: if we
know both that (a) p is true and that (b) the truth of p implies the truth of q, then we can conclude that q is true.
Example 3.17 (Modus Ponens)
Here is the truth table for p ∧ (p ⇒ q) ⇒ q (with a few extra columns of “scratch work,” for each of the constituent pieces of the desired final proposition):
p q p ⇒ q p ∧ (p ⇒ q) p ∧ (p ⇒ q) ⇒ q TTTTT TFFFT FTTFT FFTFT
Etymologically,
the word tautology comes from taut “same” (to + auto) + logy “word.” Another meaning for the word “tau- tology” (in real life, not just in logic) is the unnecessary repetition of an idea: “a canine dog.” (The ety- mology and the secondary street meaning are not totally removed from the usage in logic.)
Modus ponens rhymes with “goad us phone-ins”; literally, it means “the mood that affirms” in Latin.
Definition 3.11 (Tautology)
A proposition is a tautology if it is true under every truth assignment.

318 CHAPTER 3. LOGIC
There are only “T”s in the last column of this truth table, which establishes that modus ponens is a tautology.
Figure 3.10 contains a number of tautologies that you may find interesting and occasionally helpful. (Exercises 3.60–3.72 ask you to build truth tables to verify that these propositions really are tautologies.)
One terminological note from Figure 3.10: modus tollens is the proposition (p ⇒ q) ∧ ¬q ⇒ ¬p, and it’s the counterpoint to modus ponens: if we know both that (a) the truth of p implies the truth of q and that (b) q is not true, then we can conclude that p cannot be true either. (Modus tollens means “the mood that denies” in Latin.)
Satisfiable and unsatisfiable propositions
We now turn to propositions that are sometimes true, and those propositions that
are never true:
If f is a truth assignment under which a proposition is true, then we say that the proposition is satisfied by f .
Thus a proposition is satisfiable if it is true under at least one truth assignment, and unsatisfiable if it is false under every truth assignment. (And it’s a tautology if it is true under every truth assignment.) Here are some examples:
Example 3.18 (Contradiction of p ⇔ q and p ⊕ q) Here is the truth table for (p ⇔ q) ∧ (p ⊕ q):
p q p ⇔ q p ⊕ q (p ⇔ q) ∧ (p ⊕ q) TTTFF TFFTF FTFTF FFTFF
Because the column of the truth table corresponding to the given proposition has no “T”s in it, the proposition (p ⇔ q) ∧ (p ⊕ q) is unsatisfiable.
Figure 3.10: Some tautologies.
(p ⇒ q) ∧ p ⇒ q
(p ⇒ q) ∧ ¬q ⇒ ¬p
p ∨ ¬p
p ⇔ ¬¬p
p⇔p
p⇒p∨q
p∧q⇒p
(p∨q)∧¬p ⇒ q
(p ⇒ q) ∧ (¬p ⇒ q) ⇒ q
(p ⇒ q) ∧ (q ⇒ r) ⇒ (p ⇒ r) (p ⇒ q) ∧ (p ⇒ r) ⇔ p ⇒ q ∧ r (p ⇒ q) ∨ (p ⇒ r) ⇔ p ⇒ q ∨ r p ∧ (q ∨ r) ⇔ (p ∧ q) ∨ (p ∧ r)
p ⇒ (q ⇒ r) ⇔ p ∧ q ⇒ r
Modus Ponens
Modus Tollens
Law of the Excluded Middle Double Negation
Definition 3.12 (Satisfiable propositions)
A proposition is satisfiable if it is true under at least one truth assignment.
Definition 3.13 (Unsatisfiable propositions/contradictions)
A proposition is unsatisfiable if it is not satisfiable. Such a proposition is also called a contradiction.

3.3. PROPOSITIONALLOGIC:SOMEEXTENSIONS 319
Though it might not have been immediately apparent when they were defined, the logical connectives ⊕ and ⇔ demand precisely opposite things of their arguments: the proposition p ⊕ q is true when p and q have different truth values, while p ⇔ q is true when p and q have the same truth values. Because p and q cannot simultaneously have the same and different truth values, the conjunction (p ⇔ q) ∧ (p ⊕ q) is a contradiction.
Example 3.19 (Demanding satisfaction)
Problem: Isthepropositionp∨q⇒¬p∧¬qsatisfiable?
: We’llanswerthequestionbybuildingatruthtableforthegivenproposi- Solution
tion:
p q p ∨ q ¬p ¬q ¬p ∧ ¬q p ∨ q ⇒ ¬p ∧ ¬q TTTFFFF TFTFTFF FTTTFFF FFFTTTT
Because there is at least one “T” in the last column in the truth table, the proposi- tion is satisfiable. Specifically, this proposition is satisfied by the truth assignment p = False, q = False. (Under this truth assignment, the hypothesis p ∨ q is false; because false implies anything, the entire implication is true.)
Let φ be any proposition. Then φ is a tautology exactly when ¬φ is unsatisfiable: φ is a tautology when the truth table for φ is all “T”s, which happens exactly when the truth table for ¬φ is all “F”s. And that’s precisely the definition of ¬φ being unsatisfi- able!
Taking it further: While satisfiability seems like a pretty precise technical definition that wouldn’t mat- ter all that much, the satisfiability problem—given a proposition φ, determine whether φ is satisfiable— turns out to be at the heart of the biggest open question in computer science today. If you figure out how to solve the satisfiability problem efficiently (or prove that it’s impossible to solve efficiently), then you’ll be the most famous computer scientist of the century. See the discussion on p. 326.
3.3.2 Logical Equivalence
We’ll now turn to a special type of pairs of propositions. When two propositions “mean the same thing” (that is, they are true under precisely the same circumstances), they are called logically equivalent:
To state it differently: propositions φ and ψ are logically equivalent whenever φ ⇔ ψ is a tautology. Here’s a simple example of logical equivalence:
As we said in Section 3.2.6,
we occasionally denote generic propositions by lowercase Greek letters, particularly φ (“phi”) or ψ (“psi”).
Definition 3.14 (Logical equivalence)
Two propositions φ and ψ are logically equivalent, written φ ≡ ψ, if they have exactly identical truth tables (in other words, their truth values are the same under every truth assignment).

320 CHAPTER 3. LOGIC
Example 3.20 (¬(p ∧ q) ≡ (p ∧ q) ⇒ ¬q)
In Example 3.14, we found that (p ∧ q) ⇒ ¬q is true except when p and q are both true. Thus ¬(p ∧ q) is logically equivalent to (p ∧ q) ⇒ ¬q, as this truth table shows:
p q (p∧q) ⇒ ¬q ¬(p∧q) T T FF
T F TT
F T TT
F F TT
Implication, converse, contrapositive, inverse, and mutual implication We’ll now turn to an important question of logical equivalence that involves the
proposition p ⇒ q and three other implications derived from it:
These three new implications de- rived from the original implication p ⇒ q—particularly the converse and the contrapositive—will arise frequently. Let’s compare the three new implications to the original in light of logical equivalence:
Example 3.21 (Implications, contrapositives, converses, inverses)
Problem: Consider the implication p ⇒ q. Which of the converse, contrapositive, and inverse of p ⇒ q are logically equivalent to the original proposition p ⇒ q?
Solution
: Toanswerthisquestion,let’sbuildthetruthtable;seeFigure3.11.Thusthe
proposition p ⇒ q is logically equivalent to its contrapositive ¬q ⇒ ¬p, but not to its inverse or its converse.
Here’s a real-world example to make these results more intuitive:
Example 3.22 (Contrapositives, converses, and inverses)
Consider the following (true!) proposition, of the form p ⇒ q:
Writing tip: Now that we have a reasonable amount of experience
in writing truth tables, we will permit ourselves
to skip columns when they’re both obvious and not central to the point of a particular example. When you’re writing anything—whether as a food critic or a Shakespeare scholar or a computer scientist—you should always think about the intended audience, and
how much detail is appropriate for them.
Definition 3.15 (Converse, Contrapositive, and Inverse)
Consider an implication p ⇒ q. Then:
• Theconverseofp⇒qisthepropositionq⇒p.
• Thecontrapositiveofp⇒qistheproposition¬q⇒¬p. • Theinverseofp⇒qistheproposition¬p⇒¬q.
p
q
proposition p⇒q
T
F
T
T
converse q⇒p T
T
F
T
contrapositive ¬q ⇒ ¬p T
F
T
T
inverse ¬p ⇒ ¬q T
T
F
T
T T F F
T F T F
Figure 3.11: The truth table for an implication and its contrapositive, converse, and inverse.
Thanks to Jeff Ondich for Exam- ple 3.22.
If you were President of the U.S. in 2006, then your name is George.
􏰢 􏰡􏰠 􏰣 􏰢 􏰡􏰠 􏰣
pq
The contrapositive of this proposition is ¬q ⇒ ¬p, which is also true:

3.3. PROPOSITIONALLOGIC:SOMEEXTENSIONS 321
If your name isn’t George, then you weren’t President of the U.S. in 2006.
But the converse q ⇒ p and the inverse ¬p ⇒ ¬q are both blatantly false: If your name is George, then you were President of the U.S. in 2006.
If you weren’t President of the U.S. in 2006, then your name isn’t George.
Consider, for example, George Clooney, Saint George, George Lucas, and Curious George—all named George, and none the President in 2006.
For emphasis, let’s summarize the results from Example 3.21. Any implication p ⇒ q is logically equivalent to its contrapositive ¬q ⇒ ¬p, but it is not logically equivalent to its converse q ⇒ p or its inverse ¬p ⇒ ¬q. You might notice, though, that the inverse of p ⇒ q is the contrapositive of the converse of p ⇒ q (!), so the inverse and the converse are logically equivalent to each other.
Here’s another example of the concepts of tautology and satisfiability, as they relate to implications and converses:
Example 3.23 (Mutual implication)
Problem: Considertheconjunctionoftheimplicationp⇒qanditsconverse:inother words, consider (p ⇒ q) ∧ (q ⇒ p). Is this proposition a tautology? Satisfiable? Unsatisfiable? Is there a simpler proposition to which it’s logically equivalent?
: Wecananswerthisquestionwithatruthtable: Solution
p q p ⇒ q q ⇒ p (p ⇒ q) ∧ (q ⇒ p) TTTTT TFFTF FTTFF FFTTT
Because there is a “T” in its column, (p ⇒ q) ∧ (q ⇒ p) is satisfiable (and thus isn’t a contradiction). But that column does contain an “F” as well, and therefore (p ⇒ q) ∧ (q ⇒ p) is not a tautology.
Notice that the truth table for (p ⇒ q) ∧ (q ⇒ p) is identical to the truth table for p ⇔ q. (See Figure 3.4.) Thus p ⇔ q and (p ⇒ q) ∧ (q ⇒ p) are logically equivalent. (And ⇔ is called mutual implication for this reason: p and q imply each other.)
Some other logically equivalent statements
Figure 3.12 contains a large collection of logical equivalences. These equivalences
may use some unfamiliar terminology, which we’ll define here. Informally, an operator is commutative if the order of its arguments doesn’t matter; an operator is associative
if the way we parenthesize successive applications doesn’t matter; and an operator
is idempotent if applying it to the same argument twice gives that argument back. (In addition to these definitions, there are two other frequently discussed concepts: the identity and the zero of the operator; logical equivalences involving identities and zeros were left to you, in Exercises 3.13–3.22.) For each equivalence in Figure 3.12, it’s worth
Latin: idem “same” + potent “strength.”

322 CHAPTER 3. LOGIC Commutativity
Associativity Idempotence
p ∨ q ≡ q ∨ p
p ∧ q ≡ q ∧ p
p ⊕ q ≡ q ⊕ p
p ⇔ q ≡ q ⇔ p
p ∨ (q ∨ r) ≡ (p ∨ q) ∨ r
p ∧ (q ∧ r) ≡ (p ∧ q) ∧ r
p ⊕ (q ⊕ r) ≡ (p ⊕ q) ⊕ r
p ⇔ (q ⇔ r) ≡ (p ⇔ q) ⇔ r
p ∨ p ≡ p De Morgan’s Laws
p ∧ (q ∨ r) ≡ (p ∧ q) ∨ (p ∧ r) p ∨ (q ∧ r) ≡ (p ∨ q) ∧ (p ∨ r)
p ⇒ q ≡ ¬q ⇒ ¬p
p ⇒ q ≡ ¬p ∨ q
p ⇒ (q ⇒ r) ≡ p ∧ q ⇒ r p ⇔ q ≡ ¬p ⇔ ¬q
Distribution of ∧ over ∨ Distribution of ∨ over ∧ Contrapositive
Mutual Implication
(p ⇒ q) ∧ (q ⇒ p) ≡ p ⇔ q ¬(p ∧ q) ≡ ¬p ∨ ¬q ¬(p ∨ q) ≡ ¬p ∧ ¬q
p ∧ p ≡ p
taking a few minutes to think about why the two propositions are logically equivalent.
See also Exercises 3.73–3.82.
Taking it further: There are at least two ways in which the types of logical equivalences shown in Fig- ure 3.12 play an important role in programming. (See the discussion on p. 327.) First, most modern languages have a feature called short-circuit evaluation of logical expressions—they evaluate conjunc- tions and disjunctions from left to right, and stop as soon as the truth value of the logical expression is known—and programmers can exploit this feature to make their code cleaner or more efficient. Second, in compiled languages, an optimizing compiler can make use of logical equivalences to simplify the machine code that ends up being executed.
3.3.3 Representing Propositions: Circuits and Normal Forms
Now that we’ve established the core concepts of propositional logic, we’ll turn to some bigger and more applied questions. We’ll spend the rest of this section exploring two specific ways of representing propositions: circuits, the wires and connections from which physical computers are built; and two normal forms, in which the structure of propositions is restricted in a particular way.
The approach we’re taking with normal forms is a commonly used idea to make reasoning about some language L easier: we define a subset S of L, with two goals:
(1) any statement in L is equivalent to some statement in S; and (2) S is “simple” in some way. Then we can consider any statement from the “full” language L, which we can then “translate” into a simple-but-equivalent statement of S. Defining this subset and its accompanying translation will make it easier to accomplish some task for all expressions in L, while still making it easy to write statements clearly.
Taking it further: The idea of translating all propositions into a particular form has a natural analogue in designing and implementing programming languages. For example, every for loop can be expressed as a while loop instead, but it would be very annoying to program in a language that doesn’t have for loops. A nice compromise is to allow for loops, but behind the scenes to translate each for loop into a while loop. This compromise makes the language easier for the “user” programmer to use (for loops exist!) and also makes the job of the programmer of the compiler/interpreter easier (she can worry exclusively about implementing and optimizing while loops!).
In programming languages, this translation is captured by the notion of syntactic sugar. (The phrase is meant to suggest that the addition of for to the language is a bonus for the programmer—“sugar on top,” maybe—that adds to the syntax of the language.) The programming language Scheme is perhaps the pinnacle of syntactic sugar; the core language is almost unbelievably simple. Here’s one illustration: (and x y)(Schemefor“x∧y”)issyntacticsugarfor(if x y #f)(that’s“ifxthenyelsefalse”).Soa Scheme programmer can use and, but there’s no “real” and that has to be handled by the interpreter.
Circuits
We’ll introduce the idea of circuits by using the proposition (p ∧ ¬q) ∨ (¬p ∧ q) as an
Figure 3.12: Some logically equivalent propositions.
De Morgan’s Laws are named after Augustus De Morgan, a 19th- century British mathematician.

3.3. PROPOSITIONALLOGIC:SOMEEXTENSIONS 323
example. (Note, by the way, that this proposition is logically equivalent to p ⊕ q.) Observe that the stated proposition is a disjunction of two smaller proposi-
tions, p ∧ ¬q and ¬p ∧ q. Similarly, p ∧ ¬q is a conjunction of two even simpler propositions, namely p and ¬q. A representation of a proposition called a tree continues to break down every compound proposition embedded within it.
(We’ll talk about trees in detail in Chapter 11.) The tree for (p ∧ ¬q) ∨ (¬p ∧ q)
is shown in Figure 3.13. The tree-based view isn’t much of a change from our
usual notation (p ∧ ¬q) ∨ (¬p ∧ q); all we’ve done is use the parentheses and order-of- operation rules to organize the logical connectives. But this representation is closely related to a very important way of viewing logical propositions: circuits.
Figure 3.14 shows the same proposition redrawn as a collection of wires and gates. Wires carry a truth value from one physical location to another; gates are physical implementations of logical connectives. We can think of truth values “flowing in” as inputs to the left side of each gate, and
a truth value “flowing out” as output from the right side of the gate. (The only substantive difference between Figures 3.13 and 3.14—aside from which way is up—is whether the two p inputs come from the same wire, and likewise whether the two q inputs do.)
Example 3.24 (Using and and not for or)
Problem: Buildacircuitforp∨qusingonly∧and¬gates.
Solution
: We’lluseoneofDeMorgan’sLaws,whichsaysthatp∨q≡¬(¬p∧¬q):
p
q
This basic idea—of replacing one logical connective by another one (or by multiple other ones)—is a crucial part of the construction of computers themselves; we’ll return to this idea in Section 4.4.1.
Conjunctive and Disjunctive Normal Forms
In the rest of this section, we’ll consider a way to simplify propositions: conjunctive
and disjunctive normal forms, which constrain propositions to have a particular format. To define these restricted types of propositions, we need a basic definition: a literal is a Boolean variable (a.k.a. an atomic proposition) or the negation of a Boolean variable. (So p and ¬p are both literals.)
Figure 3.13: A tree-based view of (p ∧ ¬q) ∨ (¬p ∧ q).
∨ ∧∧ p¬¬q qp
p
q
¬∧
∨
¬∧
Figure 3.14: A circuit-based view.
¬
∧¬
¬

324 CHAPTER 3. LOGIC
Definition 3.16 (Conjunctive normal form)
A proposition is in conjunctive normal form (CNF) if it is the conjunction of one or more clauses, where each clause is the disjunction of one or more literals.
Definition 3.17 (Disjunctive normal form)
A proposition is in disjunctive normal form (DNF) if it is the disjunction of one or more clauses, where each clause is the conjunction of one or more literals.
Less formally, a proposition in conjunctive normal form is “the and of a bunch of ors,” and a proposition in disjunctive normal form is “the or of a bunch of ands.”
Taking it further: In computer architecture and digital electronics, people usually refer to a proposition in CNF as being a product of sums, and a proposition in DNF as being a sum of products. (There is a deep way of thinking about formal logic based on ∧ as multiplication, ∨ as addition, 0 as False, and 1 as True; see Exercises 3.23–3.26.)
Here is a simple example of both CNF and DNF:
Example 3.25 (Simple propositions in CNF and DNF)
The proposition (¬p ∨ q ∨ r) ∧ (¬q ∨ ¬r) ∧ (r) is in conjunctive normal form. It has threeclauses: ¬p∨q∨rand¬q∨¬randr.
The proposition (¬p ∧ q ∧ r) ∨ (¬q ∧ ¬r) ∨ (r) is in disjunctive normal form, again withthreeclauses: ¬p∧q∧rand¬q∧¬randr.
While conjunctive and disjunctive normal forms seem like heavy restrictions on the format of propositions, it turns out that every proposition is logically equivalent to a CNF proposition and to a DNF proposition:
These two theorems are perhaps the first results that we’ve encountered that are un- expected, or at least unintuitive. There’s no particular reason for it to be clear that they’re true—let alone how we might prove them. But we can, and we will: we’ll prove both theorems in Section 4.4.1 and again in Section 5.4.3, after we’ve introduced some relevant proof techniques. But, for now, here are a few examples of translating propo- sitions into DNF/CNF.
Problem-solving tip:
A good strategy when you’re trying to prove a not-at-all- obvious claim is to test out some small examples, and then try to start to figure a general pattern.
Theorem 3.1 (All propositions are expressible in CNF)
For any proposition φ, there is a proposition φcnf over the same Boolean variables and in conjunctive normal form such that φ ≡ φcnf.
Theorem 3.2 (All propositions are expressible in DNF)
For any proposition φ, there is a proposition ψdnf over the same Boolean variables and in disjunctive normal form such that φ ≡ ψdnf.

3.3. PROPOSITIONALLOGIC:SOMEEXTENSIONS 325
Example 3.26 (Translating basic connectives into DNF)
Problem: Givepropositionsindisjunctivenormalformthatarelogicallyequivalentto each of the following:
1. p∨q 2. p∧q 3. p⇒q 4. p⇔q
Solution
: 1&2. Thesequestionsareboring:bothpropositionsarealreadyinDNF,
with 2 clauses (p and q) and 1 clause (p ∧ q), respectively.
3. Figure3.12tellsusthatp⇒q≡¬p∨q,and¬p∨qisinDNF.
4. The proposition p ⇔ q is true when p and q are either both true or both false, andfalseotherwise. Sowecanrewritep ⇔ qas(p∧q)∨(¬p∧¬q). Wecan check that we’ve gotten this proposition right with a truth table:
p q p ∧ q ¬p ∧ ¬q (p ∧ q) ∨ (¬p ∧ ¬q) p ⇔ q TTTFTT TFFFFF FTFFFF FFFTTT
And here’s the task of translating basic logical connectives into CNF:
Example 3.27 (Translating basic connectives into CNF)
Problem: Givepropositionsinconjunctivenormalformthatarelogicallyequivalent to each of the following:
1. p⇒q 2. p⇔q 3. p⊕q
(Note that, as with DNF, both p ∨ q and p ∧ q are already in CNF.)
Solution
: 1. Asabove,weknowthatp⇒q≡¬p∨q,and¬p∨qisalsoinCNF.
2. Wecanrewritep⇔qasfollows:
p ⇔ q ≡ (p ⇒ q) ∧ (q ⇒ p) mutual implication (Example 3.23)
≡ (¬p ∨ q) ∧ (¬q ∨ p) x ⇒ y ≡ ¬x ∨ y (Figure 3.12), used twice The proposition (¬p ∨ q) ∧ (¬q ∨ p) is in CNF.
3. Becausep⊕qistrueaslongasoneof{p,q}istrueandoneof{p,q}isfalse,it’s easy to verify via truth table that p ⊕ q ≡ (p ∨ q) ∧ (¬p ∨ ¬q), which is in CNF.
We’ve only given some examples of converting a (simple) proposition into a new proposition, logically equivalent to the original, that’s in either CNF or DNF. We will figure out how to generalize this technique to any proposition in Section 4.4.1.

326 CHAPTER 3. LOGIC
Computer Science Connections
Computational Complexity, Satisfiability, and $1,000,000
Complexity theory is the subfield of computer science devoted to under- standing the resources—time and memory, usually—necessary to solve partic- ular problems. It’s the subject of a great deal of fascinating current research in theoreticalcomputerscience.3 Hereisacentralproblemofcomplexitytheory, the satisfiability problem:
You can read more about complexity theory in general, and the P-versus-NP question addressed here in particular, in most books on algorithms or the theory of computing. Some excellent places to readmoreare:
3 Thomas H. Cormen, Charles E. Leis- ersen, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. MIT Press,3rdedition,2009;JonKleinberg andÉvaTardos. AlgorithmDesign. Addison–Wesley, 2006; and Michael Sipser. IntroductiontotheTheoryof Computation. Course Technology, 3rd edition, 2012.
Given: A Boolean formula φ over variables p , p , . . . , p .
1 2 n
The satisfiability problem is pretty simple to solve. In fact, we’ve implicitly
Output: Isφsatisfiable?
described an algorithm for this problem already:
• constructthetruthtableforthen-variablepropositionφ;and
• checktoseewhetherthereareany“T”sinφ’scolumnofthetable.
But this algorithm is not very fast, because the truth table for φ has lots and lots of rows—2n rows, to be precise. (We’ve already seen this for n = 1, for negation, and n = 2, for all the binary connectives, with 21 = 2 and 22 = 4 rows each; in Chapter 9, we’ll address this counting issue formally.) And then even a moderate value of n means that this algorithm will not terminate in your lifetime; 2300 exceeds the number of particles in the known universe.
So, it’s clear that there is an algorithm that solves the SAT problem. What’s not clear is whether there is a substantially more efficient algorithm to solve the SAT problem. It’s so unclear, in fact, that nobody knows the answer,
and this question is one of the biggest open problems in computer science and mathematics today. (Arguably, it’s the biggest.) The Clay Mathematics Institute will even give a $1,000,000 prize to anyone who solves it.
Why is this problem so important? The reason is that, in a precise technical sense, SAT is just as hard as a slew of other problems that have a plethora of unspeakably useful applications: the traveling salesman problem, protein folding, optimally packing the trunk of a car with suitcases. This slew is a
class of computational problems known as NP (“n
ondeterministic polynomial
time”), for which it is easy to “verify” correct answers. In the context of SAT, that means that whenever you’ve got a satisfiable proposition φ, it’s very easy for you to (efficiently) convince me that φ is satisfiable. Here’s how: you’ll simply tell me a truth assignment under which φ evaluates to true. And I
can make sure that you didn’t try to fool me by plugging and chugging: I substitute your truth assignment in for every variable, and then I make sure that the final truth value of φ is indeed True.
One of the most important results in theoretical computer science in the 20th century—that’s saying something for a field that was founded in the 20th century!—is the Cook–Levin Theorem:4 if one can solve SAT efficiently, then one can solve any problem in NP efficiently. The major open question is what’s known as the P-versus-NP question. A problem that’s in P is easy to solve from scratch.
A problem that’s in NP is easy to verify (in the way described above). So the question is: does P = NP? Is verifying an answer to a problem no easier than solving the problem from scratch? (It seems intuitively “clear” that the answer is no—but nobody has been able to prove it!)
4 Stephen Cook. The complexity of theorem proving procedures. In Proceedings of the Third Annual ACM Symposium on Theory of Computing, pages 151–158, 1971; and Leonid Levin. Universal search problems. Problems of Information Transmission, 9(3):265–266, 1973. In Russian.

3.3. PROPOSITIONALLOGIC:SOMEEXTENSIONS 327
Computer Science Connections
Short-Circuit Evaluation, Optimization, and Modern Compilers
The logical equivalences in Figure 3.12 may seem far removed from “real” programming, but logical equivalences are actually central in modern pro- gramming. Here are two ways in which they play an important role:
Short-circuitevaluation: Inmostmodernprogramminglanguages,alogical expression involving ands and ors will only be evaluated until the truth value of the expression can be determined. For an example in Java, see Figure 3.15. Like most modern languages, Java evaluates an ∧ expression from left to right and stops as soon as it finds a false conjunct. Similarly, Java evaluates an ∨ expression from left to right and stops as soon as
it finds a true disjunct, because True ∨ anything ≡ True. This style of evaluation is called short-circuit evaluation.
Two slick ways in which programmers can take advantage of short-circuit evaluation are shown in Figure 3.16.
• Lines1–4useshort-circuitevaluationtoavoiddeeplynestedifstate- ments to handle exceptional cases. When x = 0, evaluating the second disjunct would cause a divide-by-zero error—but the second disjunct isn’t evaluated when x = 0 because the first disjunct was true!
• Lines6–9useshort-circuitevaluationtomakecodefaster.Ifthesec- ond conjunct typically takes much longer to evaluate (or if it is much more frequently true) than the first conjunct, then careful ordering of conjuncts avoids a long and usually fruitless computation.
Compile-timeoptimization: Foraprogramwritteninacompiledlanguagelike C, the source code is translated into machine-readable form by the compiler. But this translation is not verbatim; instead, the compiler streamlines your code (when it can!) to make it run faster.
One of the simplest types of compiler optimizations is constant folding: if some of the values in an arithmetic or logical expression are constants— known to the compiler at “compile time,” and thus unchanged at “run time”—then the compiler can “fold” those constants together. Using the rules of logical or arithmetic equivalence broadens the types of code that can be folded in this way. For example, in C, when you write an assign- mentstatementlikey = x + 2 + 3,mostcompilerswilltranslateitinto y = x + 5. Butwhataboutz = 7 * x * 8? Amoderncompilerwillop- timizeitintoz = x * 56,usingthecommutativityofmultiplication. Because the compiler can reorder the multiplicands without affecting the value, and this reordering allows the 7 and 8 to be folded into 56, the compiler does the reordering and the folding.
An example using logical equivalences is shown in Figure 3.17. Because p ∨ ¬p is a tautology—the law of the excluded middle—no matter what the value of p, the “then” clause is executed, not the “else” clause. Thus the compiler doesn’t even have to waste time checking whether p is true or false, and this optimization can be applied.
Figure 3.15: A snippet of Java code. In Java, && denotes ∧ and || denotes ∨. The second conjunct of the if condition will actually never be evaluated, because 2 > 3 is false, and False ∧ anything ≡ False.
1 2 3 4 5 6 7 8 9
if (2 > 3 && x + y < 9) { ... } else { ... } if (x == 0 || (x-1) / x > 0.5) {
… }
if (simpleOrOftenFalse(x)
&& complexOrOftenTrue(x)) {
… }
Figure 3.16: Two handy ways to rely on short-circuit evaluation.
if(p||!p){ /*”pornotp”*/ x = 51;
} else {
x = 63;
}
x = 51;
Figure 3.17: Two snippets of C code. When this code is compiled on a mod- ern optimizing compiler (gcc 4.3.4, with optimization turned on), the machine code that is produced is exactly identical for both snippets.

328 CHAPTER 3. LOGIC
3.3.4 Exercises
Theoperators∧and∨areidempotent(seeFigure3.12)—thatis,p∧p ≡ p∨p ≡ p.But⇒,⊕,and⇔arenot idempotent. Simplify—that is, give as-simple-as-possible propositions that are logically equivalent to—the following: 3.39 p⇒p 3.40 p⊕p 3.41 p⇔p
Consider the proposition p ⇒ ¬p ⇒ p ⇒ q. Add parentheses to this proposition so that the resulting proposition . . .
3.42 . . . is logically equivalent to True (that is, the result is a tautology).
3.43 . . . is logically equivalent to q.
3.44 Give as simple as possible a proposition logically equivalent to the (unparenthesized) original.
Unlike the binary connectives {∧, ∨, ⊕, ⇔}, implication is not associative. In other words, p ⇒ (q ⇒ r) and
(p ⇒ q) ⇒ r are not logically equivalent. The next few exercises explore the non-associativity of ⇒.
3.45 Prove that implication is not associative by giving a truth assignment in which p ⇒ (q ⇒ r) and (p ⇒ q) ⇒ r have different truth values.
3.46 Consider the propositions p ⇒ (q ⇒ q) and (p ⇒ q) ⇒ q. One of these is a tautology; one of them is not. Which is which? Prove your answer.
3.47 Consider the propositions p ⇒ (p ⇒ q) and (p ⇒ p) ⇒ q. Is either one a tautology? Satisfiable? Unsatisfiable? What is the simplest proposition to which each is logically equivalent?
On an exam, I once asked students to write a proposition logically equivalent to p ⊕ q using only the logical connectives ⇒, ¬, and ∧. Here are some of the students’ answers. Which ones are right?
3.48 ¬(p∧q) ⇒ (¬p∧¬q)
3.49 (p ⇒ ¬q)∧(q ⇒ ¬p)
3.50 (¬p ⇒ q)∧¬(p∧q)
3.51 ¬􏰂(p∧¬q ⇒ ¬p∧q)∧(¬p∧q ⇒ p∧¬q)􏰃
3.52 Write a proposition logically equivalent to p ⊕ q using only the logical connectives ⇒, ¬, and ∨.
The following code uses nested conditionals, or compound propositions as conditions. Simplify each as much as possi- ble. (For example, if p ⇒ q, it’s a waste of time to test whether q holds in a block where p is known to be true.)
3.53
3.54
Simplify the following propositions as much as possible.
3.55
(Notethatx % k == 0istruewhenxmodk=0,also known as when k | x.)
3.56 3.57
3.59
3.58 (p⇒p)⇒(¬p⇒¬p)∧q Is the following claim true or false? Prove your answer.
Claim:
lent to ¬p.
if (x > 20
or (x <= 20 and y < 0)) then foo(x,y) else bar(x,y) if (y >= 0
or y <= x or (x - y) * y >= 0)
then foo(x,y)
else bar(x,y)
(¬p ⇒ q)∧(q∧p ⇒ ¬p)
(p ⇒ ¬p) ⇒ ((q ⇒ (p ⇒ p)) ⇒ p)
Every proposition over the single variable p is either logically equivalent to p or it is logically equiva- Show using truth tables that these propositions from Figure 3.10 are tautologies:
3.60 (p ⇒ q) ∧ ¬q ⇒ ¬p
3.61 p⇒p∨q
3.62 p∧q⇒p
3.63 (p∨q)∧¬p⇒q
3.64 (p⇒q)∧(¬p⇒q)⇒q
3.65 (p ⇒ q) ∧ (q ⇒ r) ⇒ (p ⇒ r) 3.66 (p ⇒ q) ∧ (p ⇒ r) ⇔ p ⇒ q ∧ r 3.67 (p ⇒ q) ∨ (p ⇒ r) ⇔ p ⇒ q ∨ r 3.68 p∧(q∨r) ⇔ (p∧q)∨(p∧r) 3.69 p ⇒ (q ⇒ r) ⇔ p ∧ q ⇒ r
(Modus Tollens)
if (x % 12 == 0):
then if not (x % 4 == 0):
then foo(x)
else bar(x)
else if (x == 17):
then baz(x)
else quz(x)

Show that the following propositions are tautologies:
3.70 p∨(p∧q) ⇔ p
3.71 p∧(p∨q) ⇔ p
Prove De Morgan’s Laws:
3.73 ¬(p∧q) ≡ ¬p∨¬q
Show the following logical equivalences regarding associativity using truth tables:
3.75 p∨(q∨r) ≡ (p∨q)∨r 3.77
3.76 p ∧ (q ∧ r) ≡ (p ∧ q) ∧ r 3.78
p⊕(q⊕r) ≡ (p⊕q)⊕r
p ⇔ (q ⇔ r) ≡ (p ⇔ q) ⇔ r
3.3. PROPOSITIONALLOGIC:SOMEEXTENSIONS 329
Show using truth tables that the following logical equivalences hold:
3.79 p ⇒ q ≡ ¬p ∨ q 3.81 p ⇔ q ≡ ¬p ⇔ ¬q
3.80 p ⇒ (q ⇒ r) ≡ p ∧ q ⇒ r 3.82 ¬(p ⇒ q) ≡ p ∧ ¬q
3.83 On p. 327, we discussed the use of tautologies in optimizing compilers. In particular, these
compilers will perform the following optimization, transforming the first block of code into the second:
The compiler performs this transformation because p ∨ ¬p is a tautology—no matter what the truth value of p, the proposition p ∨ ¬p is true. But there are situations in which this code translation actually changes the behavior of the program, if p can be an arbitrary expression (rather than just a Boolean variable)! Describe such a situation. (Hint: why do (some) people watch auto racing?)
The unknown circuit in Figure 3.18 takes three inputs {p, q, r}, and either turns on a light bulb (output of the circuit = true) or leaves it off (output = false). For each of the following, draw a circuit—using at most three ∧, ∨,
and ¬ gates—that is consistent with the listed behavior. The light’s status is unknown for unlisted inputs. (If multiple circuits are consistent with the given behavior, draw any one them.)
3.84 The light is on when the true inputs are {q} or {r}. The light is off when the true inputs are {p} or {p, q} or {p, q, r}.
3.85 The light is on when the true inputs are {p, q} or {p, r}. The light is off when the true inputs are {p} or {q} or {r}.
3.86 The light is off when the true inputs are {p} or {q} or {r} or {p, q, r}.
3.87 The light is off when the true inputs are {p, q} or {p, r} or {q, r} or {p, q, r}.
3.88 Consider a simplified class of circuits like those from Exercises 3.84–3.87: there are two inputs
{p,q}andatmosttwogates,eachofwhichis∧,∨,or¬.Thereareatotalof24 =16distinctpropositions over inputs {p, q}: four different input configurations, each of which can turn the light on or leave it off. Which, if any, of these 16 propositions cannot be expressed using up to two {∧, ∨, ¬} gates?
3.89 (programming required) Consider the class of circuits from Exercises 3.84–3.87: inputs {p, q, r}, and atmostthreegateschosenfrom{∧,∨,¬}.Thereareatotalof28 =256distinctpropositionsoverinputs {p, q, r}: eight different input configurations, each of which can turn the light on or leave it off. Write a program to determine how many of these 256 propositions can be represented by a circuit of this type. (If you design it well, your program will let you check your answers to Exercises 3.84–3.88.)
3.90 ConsiderasetS = {p,q,r,s,t}ofBooleanvariables. Letφ = p⊕q⊕r⊕s⊕t. Describebriefly the conditions under which φ is true. Use English and, if appropriate, standard (nonlogical) mathematical notation. (Hint: look at the symbol ⊕ itself. What’s p + q + r + s + t, treating true as 1 and false as 0 as in Exercises 3.23–3.26?)
Figure 3.18: A circuit with at most 3 gates.
3.72 p⊕q ⇒ p∨q
3.74 ¬(p∨q) ≡ ¬p∧¬q
if(p||!p){ /*”pornotp”*/ x = 51;
} else {
x = 63;
}
x = 51;
p q r
unknown ≤ 3-gate circuit

330 CHAPTER 3. LOGIC
3.91 Dithering is a technique for converting grayscale images to black-and- white images (for printed media like newspapers). The classic dithering algorithm proceeds as follows. For every pixel in the image, going from top to bottom (“north to south”), and from left to right (“west to east”):
• “Round” the current pixel to black or white. (If it’s closer to black, make it black; if it’s closer to white, make it white.)
1 for y = 1 … height:
2 for x = 1 … width:
3 if P[x,y] is more white than black:
4 error = “white” – P[x,y]
5 P[x,y] = “white”
6
7 ifx>1:
8 if x < width and not (y < height): 9 add 7 ·error to P[x+1,y] (E) • This alteration to the current pixel has created “rounding error” x (in other 16 words, we have added x > 0 “whiteness units” by making it white, or x < 0 “whiteness units” by making it black). We compensate for this adding a total of −x “whiteness units,” distributed among the neighboring pixels to the “east” (add −7x/16 to the eastern neighboring pixel) “southwest” (−3x/16), “south” (−5x/16) and “southeast” (−x/16). If any of these neighboring pixels don’t exist (because the current pixel is on the border of the image), simply ignore the corresponding fraction of −x (and don’t add it anywhere). I assigned a dithering exercise in an introductory CS class, and I got, more or less, the code in Figure 3.19 from one student. This code is correct, but it is very repetitious. Reorganize this code so that it’s not so repetitive. In particular, rewrite lines 7–63 ensuring that each “distribute the error” line (9, 11, 12, and 13) appears only once if your solution. Recall Definition 3.16: a proposition φ is in conjunctive normal form (CNF) if φ is the conjunction of one or more clauses, where each clause is the disjunction of one or more literals, and where a literal is an atomic proposition or its negation. Further, recall Definition 3.17: φ is in disjunctive normal form (DNF) if φ is the disjunction of one or more clauses, where each clause is the conjunction of one or more literals. Give a proposition in disjunctive normal form that’s logically equivalent to . . . 3.92 ¬(p ∧ q) ⇒ r 3.93 p∧(q∨r) ⇒ (q∧r) 3.94 p ∨ ¬(q ⇔ p ∧ r) 3.95 p⊕(¬p ⇒ (q ⇒ r)∧¬r) Give a proposition in conjunctive normal form that’s logically equivalent to . . . 3.96 ¬(p ∧ q) ⇒ r 3.97 p ∧ (q ⇒ (r ⇒ q ⊕ r)) 3.98 (p ⇒ q) ⇒ (q ⇒ r ∧ p) 3.99 p ⇔ (q ∨ r ∨ ¬p) 10 else if x < width and y < height: 5·errorto P[x,y+1] (S) 1 · error to P[x-1,y+1] (SW) 11 add 12 add 13 add 14 add 15 else if y < height 3 16 · error to P[x+1,y+1] (SE) 16 7 16 · error to P[x+1,y] (E) 16 16 17 add and not (x < width): 5 · error to 20 do nothing 21 else: 16 1 P[x,y+1] (S) P[x-1,y+1] (SW) · error to 24 else if x < width and y < height: 18 add 19 else: 16 22 if x < width and not (y < height): 23 add 7 · error to P[x+1,y] (E) 16 5·errorto 16 25 add 26 add 27 add 28 else if y < height 29 5 and not (x < width): 30 add 16 · error to P[x,y+1] (S) 31 else: 32 do nothing 33 34 else: # P[x,y] is closer to "black" 35 error = "black" - P[x,y] 36 P[x,y] = "black" 37 38 ifx>1:
39 if x < width and not (y < height): 40 add 7 ·error to P[x+1,y] (E) 16 41 else if x < width and y < height: 5·errorto P[x,y+1] (S) 16 1 · error to P[x-1,y+1] (SW) 16 7 · error to P[x+1,y] (E) 16 3·errorto 16 P[x,y+1] (S) P[x+1,y+1] (SE) P[x+1,y] (E) 7·errorto 16 42 add 43 add 44 add 45 add 46 else if y < height 3 · error to P[x+1,y+1] (SE) 16 A CNF proposition φ is in 3CNF if each clause contains exactly three distinct literals. 16 P[x,y+1] (S) P[x-1,y+1] (SW) 47 48 add 5 and not (x < width): · error to · error to 51 do nothing 52 else: (Note that p and ¬p are distinct literals.) In terms of the number of clauses, what’s the smallest 3CNF formula . . . 3.100 . . . that’s a tautology? 3.101 . . . that’s not satisfiable? Consider the set of 3CNF propositions over the variables {p, q, r} for which no clause appears more than once. (Exercises 3.102–3.104 turn out to be boring without the restric- tion of no repeated clauses; we could repeat the same clause as many times as we please: (p∨q∨r)∧(p∨q∨r)∧(p∨q∨r)···.) Twoclausesthatcontainpreciselythesame literals (in any order) do not count as distinct. (But recall that a single clause can contain a variable in both negated and unnegated form.) In terms of the number of clauses, what’s the largest 3-variable distinct-clause 3CNF proposition . . . 3.102 . . . at all (with no further restrictions)? 3.103 . . . that’s a tautology? 3.104 . . . that’s satisfiable? A proposition φ is in 3DNF if it is the disjunction of one or more clauses, each of which is the conjunction of exactly three distinct literals. In terms of the number of clauses, what’s the smallest 3DNF formula . . . 3.105 . . . that’s a tautology? 3.106 . . . that’s not satisfiable? 49 add 1 16 50 else: 53 if x < width and not (y < height): 54 add 7 · error to P[x+1,y] (E) 16 55 else if x < width and y < height: P[x,y+1] (S) P[x+1,y+1] (SE) P[x+1,y] (E) 56 add 5 · error to 3 57 add 16 · error to 16 58 add 7 · error to 16 59 else if y < height 60 5 and not (x < width): 61 add 16 · error to P[x,y+1] (S) 62 else: 63 do nothing Figure 3.19: Some dithering code. 3.4 An Introduction to Predicate Logic But the fact that some geniuses were laughed at does not imply that all who are laughed at are geniuses. They laughed at Columbus, they laughed at Fulton, they laughed at the Wright brothers. But they also laughed at Bozo the Clown. Carl Sagan (1934–1996) Broca’s Brain: Reflections on the Romance of Science (1979) Propositional logic, which we have been discussing thus far, gives us formal nota- tion to encode Boolean expressions. But these expressions are relatively simple, a sort of “unstructured programming” style of logic. Predicate logic is a more general type of logic that allows us to write function-like logical expressions called predicates, and to express a broader range of notions than in propositional logic. 3.4.1 Predicates Informally, a predicate is a property that a particular entity might or might not have; for example, being a vowel is a property that some letters do have (A, E, . . .) and some letters do not have (B, C, . . .). A predicate isn’t the kind of thing that’s true or false, so predicates are different from propositions; rather, a predicate is like a “proposition with blanks” waiting to be filled in. For example: Example 3.28 (Some predicates) • “Theinteger isprime.” • “Thestring isapalindrome.” • “Theperson costarredinamoviewithKevinBacon.” • “Thestring isalphabeticallyafterthestring .” • “Theinteger evenlydividestheinteger .” Once the blanks of a predicate are filled in, the resulting expression is a proposition. Here are some examples of propositions—some true, some false—derived from the predicates in Example 3.28: Example 3.29 (Some propositions derived from Example 3.28) • “Theinteger57isprime.” • “ThestringTENETisapalindrome.” • “ThepersonSeanConnerycostarredinamoviewithKevinBacon.” • “ThestringPYTHONisalphabeticallyafterthestringPYTHAGOREAN 3.4. ANINTRODUCTIONTOPREDICATELOGIC 331 .” • “Theinteger17evenlydividestheinteger42 .” We can now give a formal definition of predicates: 332 CHAPTER 3. LOGIC Definition 3.18 (Predicate) A predicate P is a Boolean-valued function—that is, P is a function P : U → {True, False} for a set U. The set U is called the universe or the domain of discourse, and we say that P is a predicate over U. When the universe U is clear from context, we will allow ourselves to be sloppy with notation by leaving U implicit. Although we didn’t use the name at the time, we’ve already encountered predicates, in Chapter 2. Definition 2.18 introduced the notation {x ∈ U : P(x)} to denote the set of those objects x ∈ U for which P is true. The set abstraction notation “selects” the elements of U for which the predicate P is true. Example 3.30 (Some example predicates) Here are a few more sample predicates based on arithmetic: 1. isPrime(n):thepositiveintegernisaprimenumber. 2. isPowerOf (n, k): the integer n is an exact power of k: n = ki for some i ∈ Z≥0. 3. onlyPowersOfTwo(S):everyelementofthesetSisapoweroftwo. 4. Q(n,a,b):positiveintegernsatisfiesn=a+b,andintegersaandbarebothprime. 5. sumOfTwoPrimes(n):positiveintegernisequaltothesumoftwoprimenumbers. (To reiterate Definition 3.18, the isPrime predicate, for example, is a function isPrime : Z>0 → {True, False}.)
Deriving propositions from predicates
Again, by plugging particular values into the predicates from Example 3.30, we get
propositions, each of which has a truth value:
Example 3.31 (Propositions derived from predicates)
Using the predicates in Example 3.30, let’s figure out the truth values of the proposi- tions isPrime(261), isPrime(262), Q(8, 3, 5), and Q(9, 3, 6). For each, we’ll simply plug the given arguments into the definition of the predicate and figure out the truth value of the resulting proposition.
• Alittlearithmeticshowsthat261=3·87;thusisPrime(261)=False.
• Similarly,wehave262=2·131,soisPrime(262)=False.
• TocomputethetruthvalueofQ(8,3,5),wesimplyplugn=8,a=3,andb=5into
the definition of Q(n, a, b). The proposition Q(8, 3, 5) requires that the positive integer 8 satisfies 8 = 3 + 5, and the integers 3 and 5 are both prime. All of the requirements are met, so Q(8, 3, 5) = True.
• Ontheotherhand,Q(9,3,6)=FalsebecauseQ(9,3,6)requiresthat9=3+6,and that the integers 3 and 6 are both prime. But 6 isn’t prime.

Just like the propositional logical connectives, each predicate takes a fixed number of arguments. So a predicate might be unary (taking one argument, like the predicate isPrime); or binary (taking two arguments, like isPowerOf ); or ternary (taking three arguments, like Q from Example 3.30); and so forth. Here are a few more examples:
Example 3.32 (More propositions derived from predicates)
Problem: UsingthepredicatesinExample3.30,findthetruthvaluesoftheseproposi- tions:
1. sumOfTwoPrimes(17)andsumOfTwoPrimes(34) 2. isPowerOf(16,2)andisPowerOf(2,16)
3. onlyPowersOfTwo({1,2,8,128})
Solution
: Asbefore,wejustplugthegivenargumentsintothedefinition:
1. sumOfTwoPrimes(17) = False: the only way to get an odd number n by adding two prime numbers is for one of those prime numbers to be 2—but 17 − 2 = 15, and 15 isn’t prime. But sumOfTwoPrimes(34) = True, because 34 = 17 + 17, and 17 is prime. (And the other 17 is prime, too.)
2. isPowerOf (16, 2) = True because 24 = 16 (and the exponent 4 is an integer), but isPowerOf (2, 16) = False because 161/4 = 2 (and 1/4 is not an integer).
3. onlyPowersOfTwo({1,2,8,128})=Truebecauseeveryelementof{1,2,8,128}isa power of two: {1, 2, 8, 128} = 􏰈20, 21, 23, 27􏰉.
These brief examples may already be enough to begin to give you a sense of the power of logical abstraction that predicates grant us: we can now consider the same logical “condition” applied to two different “arguments.” In a sense, propositional logic is like programming without functions; letting ourselves use predicates allows us to write two related propositions using related notation, and to reason simultaneously about multiple propositions—just like writing a function in Java allows you to think simultaneously about the same function applied to different arguments.
Taking it further: Predicates give a convenient way of representing the state of play of multiplayer games like Tic-Tac-Toe, checkers, and chess. The basic idea is to define a predicate P(B) that expresses “Player 1 will win from board position B if both players play optimally.” For more on this idea, and on the application of logic (both predicate and propositional) to playing these kinds of games, see the discussion on p. 344.
3.4.2 Quantifiers
We’ve seen that we can form a proposition from a predicate by applying that predicate to a particular argument. But we can also form a proposition from a predicate using quantifiers, which allow us to formalize statements like every Java program contains at least four for loops (false!) or there is a proposition that cannot be expressed using only the connectives ∧ and ∨ (true! See Exercise 4.71).
These types of statements are expressed by the two standard quantifiers, the univer- sal (“every”) and existential (“some”) quantifiers (see Figure 3.20):
3.4. ANINTRODUCTIONTOPREDICATELOGIC 333

334 CHAPTER 3. LOGIC
∀x ∈ S : P(x) ∃x ∈ S : P(x)
“for all”
(universal quantifier)
true if P(x) is true for every x ∈ S.
true if P(x) is true for at least one x ∈ S.
“there exists” (existential quantifier)
Definition 3.19 (Universal quantifier (“for all”): ∀)
Let P be a predicate over the universe S. The proposition ∀x ∈ S : P(x) (“for all x in S, P(x)”) is true if, for every possible x ∈ S, P(x) is true.
Definition 3.20 (Existential quantifier (“there exists”): ∃)
Let P be a predicate over the universe S. The proposition ∃x ∈ S : P(x) (“there exists an x in S such that P(x)”) is true if, for at least one possible x ∈ S, we have that P(x) is true.
Here’s an example of two simple numerical propositions using these quantifiers:
Example 3.33 (Simple propositions using quantifiers)
Problem: Whatarethetruthvaluesofthefollowingtwopropositions?
1. ∀n ∈ Z≥2 : isPrime(n) 2. ∃n ∈ Z≥2 : isPrime(n)
: 1. False. This proposition says “every integer n ≥ 2 is prime.” This state- ment is false because, for example, the integer 32 is greater than or equal to 2 and is not prime.
2. True. The proposition says “there exists an integer n ≥ 2 that is prime.” This statement is true because, for example, the integer 31 (which is greater than or equal to 2) is prime.
In addition, we can make
precise many intuitive
statements using quanti-
fiers. For example, we can
use quantifiers to formal-
ize the predicates from
Example 3.30. (See Figure 3.21 for a reminder.)
Example 3.34 (Some example predicates, formalized)
isPrime(n): An integer n ∈ Z>0 is prime if and only if n ≥ 2 and the only integers that evenly divide n are 1 and n itself. Thus we are really expressing a condition on every candidate divisor d: either d ∈ {1, n}, or d doesn’t evenly divide n. Using the “divides” notation from Definition 2.10, we can formalize isPrime(n) as
n≥2∧􏰖∀d∈Z≥1 :􏰀d|n ⇒ d=1∨d=n􏰁􏰗. isPowerOf (n, k): We can formalize this predicate as ∃i ∈ Z≥0 : n = ki.
Figure 3.20: Sum- mary of notation for predicate logic.
The for all notation is ∀, an upside- down ‘A’ as in “all”; the exists notation is ∃, a backward
‘E’ as in “exists.” (Annoyingly, they had to be flipped in different directions: a backward ’A’ is still an ’A,’ and an upside-down ’E’ is still an ’E.’)
Solution
isPrime(n): n ∈ Z>0 is a prime number.
isPowerOf (n, k): n ∈ Z is an exact power of k.
onlyPowersOfTwo(S): every element of S is a power of two.
Q(n, a, b): n ∈ Z>0 satisfies n = a + b, and a, b ∈ Z are both prime.
sumOfTwoPrimes(n):
n ∈ Z>0 is equal to the sum of two prime numbers.
Figure 3.21: Re- minder of the predicates from Example 3.30.

3.4. ANINTRODUCTIONTOPREDICATELOGIC 335 onlyPowersOfTwo(S): BecauseisPowerOf(n,2)expressestheconditionthatnisa
power of two, we can formalize this predicate as ∀x ∈ S : isPowerOf (x, 2). Q(n,a,b): FormalizingQactuallydoesn’trequireaquantifieratall;wecansimply
write Q(n, a, b) as (n = a + b) ∧ isPrime(a) ∧ isPrime(b).
sumOfTwoPrimes(n): Thispredicaterequiresthatthereexistprimenumbersaandb
that sum to n. Given our definition of Q, we can write sumOfTwoPrimes(n) as ∃⟨a, b⟩ ∈ Z × Z : Q(n, a, b).
(“There exists a pair of integers ⟨a, b⟩ such that Q(n, a, b).”) Or we could write sumOfTwoPrimes(n) as ∃a ∈ Z : [∃b ∈ Z : Q(n, a, b)], by nesting one quantifier within the other. (See Section 3.5.)
Here’s one further example, regarding the prefix relationship between two strings:
Example 3.35 (Prefixes, formalized)
Abinarystringx∈{0,1}k isaprefixofthebinarystringy∈{0,1}n,forn≥k,ifyisx with some extra bits added on at the end. For example, 01 and 0110 are both prefixes of 01101010, but 1 is not a prefix of 01101010. If we write |x| and |y| to denote the length of x and y, respectively, then we can formalize isPrefixOf (x, y) as
|x|≤|y| ∧ 􏰖∀i∈{i∈Z:1≤i≤|x|} : xi =yi􏰗.
In other words, y must be no shorter than x, and the first |x| characters of y must
equal their corresponding characters in x.
Quantifiers as loops
One useful way of thinking about these quantifiers is by
analogy to loops in programming. If we ever encounter an
x ∈ S for which ¬P(x) = True, then we immediately know that ∀x ∈ S : P(x) is false. Similarly, any x ∈ S for which Q(x) = True is enough to demonstrate that ∃x ∈ S : Q(x)
is true. But if we “loop through” all candidate values of x and fail to encounter an x with ¬P(x) or Q(x), we know that ∀x ∈ S : P(x) is true or ∃x ∈ S : Q(x) is false. By this analogy, we might think of the two standard quantifiers as executing the programs in Figure 3.22(a) for ∀, and Figure 3.22(b) for ∃.
(a) A loop corresponding to ∀x ∈ S : P(x). (b) A loop corresponding to ∃x ∈ S : Q(x).
Another intuitive and useful way to think about these quantifiers is as a supersized version of ∧ and ∨:
∀x ∈ {x1,x2,…,xn} : P(x) ≡ P(x1)∧P(x2)∧···∧P(xn) ∃x ∈ {x1,x2,…,xn} : P(x) ≡ P(x1)∨P(x2)∨···∨P(xn)
Figure 3.22: Two for loops that return the value of ∀x ∈ S : P(x) and ∃x ∈ S : Q(x).
1: forxinS:
2: if not P(x) then 3: return False 4: return True
1: forxinS:
2: if Q(x) then
3: return True 4: return False

336 CHAPTER 3. LOGIC
The first of these propositions is true only if every one of the P(xi) terms is true; the second is true if at least one of the P(xi) terms is true.
There is one way in which these analogies are loose, though: just as for ∑ (summa- tion) and ∏ (product) notation (from Section 2.2.7), the loop analogy only makes sense when the domain of discourse is finite! The Figure 3.22(a) “program” for a true propo- sition ∀x ∈ Z : P(x) would have to complete an infinite number of iterations before returning True. But the intuition may still be helpful.
Precedence and parenthesization
As in propositional logic, we’ll adopt standard conventions regarding order of op-
erations so that we don’t overdose on parentheses. We treat the quantifiers ∀ and ∃ as binding tighter than the propositional logical connectives. Thus
∀x∈S:P(x) ⇒ ∃y∈S:P(y) 􏰖∀x∈S:P(x)􏰗 ⇒ 􏰖∃y∈S:P(y)􏰗.
will be understood to mean
To express the other reading (which involves nested quantifiers; see Section 3.5), we
can use parentheses explicitly, by writing ∀x ∈ S : 􏰂P(x) ⇒ ∃y ∈ S : P(y)􏰃. Free and bound variables
Consider the variables x and y in the expressions
3 | x and ∀y ∈ Z : 3 | y.
Understanding the first of these expressions requires knowledge of what x means, whereas the second is a self-contained statement that can be understood without any outside knowledge. The variable x is called a free or unbound variable: its value is not fixed by the expression. In contrast, the variable y is a bound variable: its value is de- fined within the expression itself. We say that the quantifier binds the variable y, and the scope or body of the quantifier is the part of the expression in which it has bound y. (We’ve encountered bound variables before; they arise whenever a variable name is assigned a value within an expression. For example, the variable i is bound in the arithmeticexpression∑10 i2,asisthevariablenin􏰈n∈Z:|n|≤|n2|􏰉.)
expression ∃y ∈ Z≥0 : x ≥ y contains a bound variable y and a free variable x. Here’s another example:
Example 3.36 (Free and bound variables)
Problem: Whichvariablesarefreeinthefollowingexpression?
i=1
A single expression can contain both free and bound variables: for example, the
􏰖∀x∈Z:x2 ≥y􏰗∧􏰖∀z∈Z:y=z∨zy =1􏰗
: Thevariableydoesn’tappearasthevariableboundbyeitherofthequan-
Solution
tifiers in this expression, so y is a free variable. Both x and z are bound by the universal quantifiers. (Incidentally, this expression is true if and only if y = 0.)

To test whether a particular variable x is free or bound in an expression, we can (consistently) replace x by a different name in that expression. If the meaning stays the same, then x is bound; if the meaning changes, then x is free. For example:
Example 3.37 (Testing for free and bound variables)
Consider the following pairs of propositions:
∃x ∈ S : x > 251 and ∃y ∈ S : y > 251 (A)
x ≥ 42x and y ≥ 42y (B)
The expressions in (A) express precisely the same condition, namely: some element of S is greater than 251. Thus, the variables x and y in these two expressions are bound.
But the expressions in (B) mean different things, in the sense that we can construct a context in which these two statements have different truth values (for example,
x = 3 and y = −2). The first expression states a condition on the value of x, and the latter states a condition on the value of y. So x is a free variable in “x ≥ 42x.”
Taking it further: The free-versus-bound-variable distinction is also something that may be familiar from programming, at least in some programming languages. There are some interesting issues in the design and implementation of programming languages that center on how free variables in a function definition, for example, get their values. See the discussion on p. 345.
An expression of predicate logic that contains no free variables is called fully quan- tified. For expressions that are not fully quantified, we adopt a standard convention that any unbound variables in a stated claim are implicitly universally quantified. For example, consider these claims:
ClaimA: Ifx≥1,thenx2 ≤x3. 2 3 ClaimB: Forallx∈R,ifx≥1,thenx ≤x .
When we write a (true) claim like Claim A, we will implicitly interpret it to mean Claim B. (Note that Claim B also explicitly notes R as the domain of discourse, which was left implicit in Claim A.)
3.4.3 Theorem and Proof in Predicate Logic
Recall that a tautology is a proposition that is always true—in other words, it is true no matter what each Boolean variable p in the proposition “means” (that is, whether p is true or false). In this section, we will be interested in the corresponding notion of always-true statements of predicate logic, which are called theorems. A statement of predicate logic is “always true” when it’s true no matter what its predicates mean. (Formally, the “meaning” of a predicate P is the set of elements of the universe U for which the predicate is true—that is, {x ∈ U : P(x)}.)
3.4. ANINTRODUCTIONTOPREDICATELOGIC 337
Definition 3.21 (Theorems in predicate logic)
A fully quantified expression of predicate logic is a theorem if and only if it is true for every possible meaning of each of its predicates.

338 CHAPTER 3. LOGIC
Analogously, two fully quantified expressions are logically equivalent if, for every possi- ble meaning of their predicates, the two expressions have the same truth values.
We’ll begin with a simple example of a theorem and a nontheorem:
Example 3.38 (A theorem of predicate logic)
Let S be any set. The following claim is true regardless of what the predicate P denotes: ∀x ∈ S : 􏰖P(x) ∨ ¬P(x)􏰗.
Indeed, this claim simply says that every x ∈ S either makes P(x) true or P(x) false. And that assertion is true if the predicate P(x) is “x ≥ 42” or “x has red hair” or “x prefers programming in Python to playing Parcheesi”—indeed, it’s true for any predicate P.
Example 3.39 (A nontheorem)
Let’s show that the following proposition is not a theorem: 􏰖∀x ∈ S : P(x)􏰗∨􏰖∀x ∈ S : ¬P(x)􏰗.
A theorem must be true regardless of P’s meaning, so we can establish that this proposition isn’t a theorem by giving an example predicate that makes it false. Here’s one: let P be isPrime (where S is Z). Observe that ∀x ∈ Z : isPrime(x) is false because isPrime(4) = False; and ∀x ∈ Z : ¬isPrime(x) is false because ¬isPrime(5) = False. Thus the given proposition is false when P is isPrime, and so it is not a theorem.
Note the crucial difference between Example 3.38, which states that every element of
S either makes P true or makes P false, and Example 3.39, which states that either every element of S makes P true, or every element of S makes P false. (Intuitively, it’s the difference between “Every letter is either a vowel or a consonant” and “Every letter is a vowel or every letter is a consonant.” The former is true; the latter is false.)
Example 3.39 establishes that the proposition [∀x ∈ S : P(x)] ∨ [∀x ∈ S : ¬P(x)] isn’t true for every meaning of the predicate P, but it may be true for some meanings. For example, if P(x) is the predicate x2 ≥ 0 and S is the set R, then this disjunction is true (because∀x∈R:x2 ≥0istrue).
The challenge of proofs in predicate logic
The remainder of this section states some theorems of predicate logic, along with an
initial discussion of how we might prove that they’re theorems. (A proof of a statement is simply a convincing argument that the statement is a theorem.) Much of the rest of the book will be devoted to developing and writing proofs of theorems like these, and Chapter 4 will be devoted exclusively to some techniques and strategies for proofs. (This section will preview some of the ideas we’ll see there.) Some theorems of pred- icate logic are summarized in Figure 3.23; we’ll prove a few of them here, and you’ll return to some of the others in the exercises.

While predicate logic allows us to express claims that we couldn’t state without quantifiers, that extra expressiveness comes with a cost! For a quantifier-free proposi- tion (like all propositions in Sections 3.2–3.3), there is a straightforward—if tedious— algorithm to decide whether a given proposition is a tautology: first, build a truth table for the proposition; and, second, check to make sure that the proposition is true in every row. It turns out that the analogous question for predicate logic is much more difficult—in fact, impossible to solve in general: there’s no algorithm that’s guaranteed to figure out whether a given fully quantified expression is a theorem! Demonstrating that a statement in predicate logic is a theorem will require you to think in a way that demonstrating that a statement in propositional logic is a tautology did not.
Taking it further: See the discussion on p. 346 for more about the fact that there’s no algorithm guaran- teed to determine whether a given proposition is a theorem. The absence of such an algorithm sounds like bad news; it means that proving predicate-logic statements is harder, because you can’t just plug- and-chug into a simple algorithm to figure out whether a given statement is actually always true. But this fact is also precisely the reason that creativity plays a crucial role in proofs and in theoretical com- puter science more generally—and why, arguably, proving things can be fun! (For me, this difference is exactly why I find Sudoku less interesting than crossword puzzles: when there’s no algorithm to solve a problem, we have to embrace the creative challenge in attacking it.)
3.4.4 A Few Examples of Theorems and Proofs
In the rest of this section, we will see a few further theorems of predicate logic, with proofs. As we’ve said, there’s no formulaic approach to prove these theorems; we’ll need to employ a variety of strategies in this endeavor.
3.4. ANINTRODUCTIONTOPREDICATELOGIC 339
∀x ∈ S : 􏰖P(x) ∨ ¬P(x)􏰗
¬􏰖∀x ∈ S : P(x)􏰗 ⇔ 􏰖∃x ∈ S : ¬P(x)􏰗
¬􏰖∃x ∈ S : P(x)􏰗 ⇔ 􏰖∀x ∈ S : ¬P(x)􏰗
􏰖∀x ∈ S : P(x)􏰗 ⇒ 􏰖∃x ∈ S : P(x)􏰗
∀x ∈ ∅ : P(x) ¬∃x ∈ ∅ : P(x)
De Morgan’s Laws (quantified form)
if the set S is nonempty
Vacuous quantification
∃x∈S:􏰖P(x)∨Q(x)􏰗 ⇔ 􏰖∃x∈S:P(x)􏰗∨􏰖∃x∈S:Q(x)􏰗 ∀x∈S:􏰖P(x)∧Q(x)􏰗 ⇔ 􏰖∀x∈S:P(x)􏰗∧􏰖∀x∈S:Q(x)􏰗 ∃x∈S:􏰖P(x)∧Q(x)􏰗 ⇒ 􏰖∃x∈S:P(x)􏰗∧􏰖∃x∈S:Q(x)􏰗 ∀x∈S:􏰖P(x)∨Q(x)􏰗 ⇐ 􏰖∀x∈S:P(x)􏰗∨􏰖∀x∈S:Q(x)􏰗 􏰖∀x ∈ S : P(x) ⇒ Q(x)􏰗 ∧ 􏰖∀x ∈ S : P(x)􏰗 ⇒ 􏰖∀x ∈ S : Q(x)􏰗 􏰖∀x∈{y∈S:P(y)}:Q(x)􏰗 ⇔ 􏰖∀x∈S:P(x)⇒Q(x)􏰗 􏰖∃x∈{y∈S:P(y)}:Q(x)􏰗 ⇔ 􏰖∃x∈S:P(x)∧Q(x)􏰗
φ ∧ 􏰖∃x ∈ S : P(x)􏰗 ⇔ 􏰖∃x ∈ S : φ ∧ P(x)􏰗 if x does not appear as a free variable in φ φ ∨ 􏰖∀x ∈ S : P(x)􏰗 ⇔ 􏰖∀x ∈ S : φ ∨ P(x)􏰗 if x does not appear as a free variable in φ
Figure 3.23: A few theorems involving quantification.

340 CHAPTER 3. LOGIC
Negating quantifiers: a first example
Suppose that your egomaniacal, overconfident partner from Intro CS wanders into
the lab and says For any array A that you give me, partner, my implementation of insertion sort correctly sorts A. You know, though, that your partner is wrong. (You spot a bug in his egomaniacal code.) What would that mean? Well, you might reply, gently but firmly: There’s an array A for which your implementation of insertion sort does not correctly sort A. The equivalence that you’re using is a theorem of predicate logic:
Example 3.40 (Negating universal quantifiers)
Let’s prove the equivalence you’re using to debunk your partner’s claim: ¬􏰂∀x ∈ S : P(x)􏰃 ⇔ 􏰂∃x ∈ S : ¬P(x)􏰃.
Perhaps the easiest way to view this claim is as a quantified version of the tautology ¬(p ∧ q) ⇔ ¬p ∨ ¬q, which was one of De Morgan’s Laws from propositional logic. If we think of ∀x ∈ S : P(x) as P(x1) ∧ P(x2) ∧ P(x3) ∧ · · · , then
¬􏰂∀x∈S:P(x)􏰃 ∼¬􏰂P(x )∧P(x )∧P(x )∧···􏰃 ∼􏰂123􏰃
≡ ¬P(x1)∨¬P(x2)∨¬P(x3)∨··· ∼ ∃x ∈ S : ¬P(x),
∼
where the second line follows by the propositional version of De Morgan’s Laws. There is something slightly more subtle to our claim because the set S might be infinite, but the idea is identical. If there’s an a ∈ S such that P(a) = False, then
∃x ∈ S : ¬P(x) is true (because a is an example) and ∀x ∈ S : P(x) is false (because a is a counterexample). And if every a ∈ S has P(a) = True, then ∃x ∈ S : ¬P(x) is false and ∀x ∈ S : P(x) is true.
The analogous claim for the negation of ∃x ∈ S : P(x) is also a theorem: Example 3.41 (Negating existential quantifiers)
Let’s prove that this claim is a theorem, too:
¬􏰂∃x ∈ S : P(x)􏰃 ⇔ 􏰂∀x ∈ S : ¬P(x)􏰃.
To see that this claim is true for an arbitrary predicate P, we start with the claim from Example 3.40, but using the predicate Q(x) := ¬P(x). (Note that Q is also a predicate—so Example 3.40 holds for Q too!) Thus we know that
¬􏰂∀x ∈ S : Q(x)􏰃 ⇔ 􏰂∃x ∈ S : ¬Q(x)􏰃, and, because p ⇔ q ≡ ¬p ⇔ ¬q, we therefore also know that
􏰂∀x ∈ S : Q(x)􏰃 ⇔ ¬􏰂∃x ∈ S : ¬Q(x)􏰃.

3.4. ANINTRODUCTIONTOPREDICATELOGIC 341 But Q(x) is just ¬P(x) and ¬Q(x) is just P(x), by definition of Q, and so we have
􏰂∀x ∈ S : ¬P(x)􏰃 ⇔ ¬􏰂∃x ∈ S : P(x)􏰃.
Thus we’ve now shown that the desired claim is true for any predicate P, so it is a
theorem.
All implies some: a proof of an implication
The entirety of Chapter 4 is devoted to proofs and proof techniques; there’s lots
more there about how to approach proving or disproving new claims. But here we’ll preview a particularly useful proof strategy for proving an implication, and use it to establish another theorem of predicate logic. Here’s the method of proof:
(Recall from the truth table of ⇒ that the only way for the implication φ ⇒ ψ to be false is when φ is true but ψ is false. Also recall that the proposition φ is called the antecedent of the implication φ ⇒ ψ; hence this proof technique is called assuming the antecedent.) Here are two examples of proofs that use this technique, one from propositional logic and one from arithmetic:
• Let’s prove that p ⇒ p ∨ q is a tautology: we assume that the antecedent p is true, and we must prove that the consequent p ∨ q is true too. But that’s obvious, because p is true (by our assumption), and True ∨ q ≡ True.
• Let’sprovethatifxisaperfectsquare,then4xisaperfectsquare:assumethatxisa perfect square, that is, assume that x = k2 for an integer k. Then 4x = 4k2 = (2k)2 is a perfect square too, because 2k is also an integer.
Finally, here’s a theorem of predicate logic that we can prove using this technique:
Example 3.42 (If everybody’s doing it, then somebody’s doing it)
Consider the following proposition, for an arbitrary nonempty set S: 􏰖∀x∈S:P(x)􏰗 ⇒ 􏰖∃x∈S:P(x)􏰗.
We’ll prove this claim by assuming the antecedent. Specifically, we assume ∀x ∈ S : P(x), and we need to prove that ∃x ∈ S : P(x).
Because the set S is nonempty, we know that there’s at least one element a ∈ S. By our assumption, we know that P(a) is true. But because P(a) is true, then it’s immedi- ately apparent that ∃x ∈ S : P(x) is true too—because we can just pick x := a.
Problem-solving
tip: When you’re facing a statement that contains a lot of mathematical notation, try to understand it by rephrasing it as an English sentence. Restating the assertion from Example 3.42 in English makes it pretty obvious that it’s true: if everyone in S satisfies P— and there’s actually someone in S—then of course someone in S satisfies P!
Definition 3.22 (Proof by assuming the antecedent)
Suppose that we must prove an implication φ ⇒ ψ. Because the only way for φ ⇒ ψ to fail to be true is for φ to be true and ψ to be false, to prove that the implication φ ⇒ ψ is always true, we will rule out the one scenario in which it wouldn’t be. Specifically, we assume that φ is true, and then prove that ψ must be true too, under this assumption.

342 CHAPTER 3. LOGIC
Vacuous quantification
Consider the proposition All even prime numbers greater than 12 have a 3 as their last
digit. Write P to denote the set of all even prime numbers greater than 12; formalized, then, this claim can be written as ∀n ∈ P : n mod 10 = 3. Is this claim true or false?
It has to be true! The point is that P actually contains no elements (there are no even prime numbers other than 2, because an even number is by definition divisible by 2). Thus this claim says: for every n ∈ ∅, something-or-other is true of n. But there is no n in ∅, so the claim has to be true! The general statement of the theorem is
∀x ∈ ∅ : P(x).
Quantification over the empty set is called vacuous quantification; this proposition is said to be vacuously true.
Here’s another way to see that ∀x ∈ ∅ : P(x) is a theorem, using the De Morgan–like view of quantification. The negation of ∀x ∈ ∅ : P(x) is ∃x ∈ ∅ : ¬P(x), but there never exists any element x ∈ ∅, let alone an element x ∈ ∅ such that ¬P(x). Thus
∃x ∈ ∅ : ¬P(x) is false, and therefore its negation ¬∃x ∈ ∅ : ¬P(x), which is equivalent to ∀x ∈ ∅ : P(x), is true.
Disjunctions and quantifiers
Here’s one last example, where we’ll figure out when the “or” of two quantified
statements can be expressed as one single quantified statement:
Example 3.43 (Disjunctions and quantifiers)
Consider the following two propositions, for an arbitrary set S:
∀x∈S:􏰖P(x)∨Q(x)􏰗 ⇔ 􏰖∀x∈S:P(x)􏰗 ∨ 􏰖∀x∈S:Q(x)􏰗 (A)
∃x∈S:􏰖P(x)∨Q(x)􏰗 ⇔ 􏰖∃x∈S:P(x)􏰗 ∨ 􏰖∃x∈S:Q(x)􏰗 (B) Problem: Iseither(A)or(B)atheorem?Proveyouranswers.
: Claim(B)isatheorem.Toproveit,we’llshowthattheleft-handside Solution
impliestheright-handside,andviceversa.(Thatis,we’reprovingp ⇔ q
by proving both p ⇒ q and q ⇒ p, which is a legitimate proof because
p ⇔ q ≡ (p ⇒ q) ∧ (q ⇒ p).) Both proofs will use the technique of assuming the antecedent.
• First, suppose that ∃x ∈ S : [P(x) ∨ Q(x)] is true. Then there is some particular x∗ ∈ S for which either P(x∗) or Q(x∗). But in either case, we’re done: if P(x∗) then ∃x ∈ S : P(x)—in particular, x∗ satisfies the condition; if Q(x∗) then
∃x ∈ S : Q(x).
• Conversely, suppose that [∃x ∈ S : P(x)] ∨ [∃x ∈ S : Q(x)] is true. Thus either there’s an x∗ ∈ S such that P(x∗) or an x∗ ∈ S such that Q(x∗). That x∗ suffices to make the left-hand side true.
Problem-solving
tip: In thinking about a question like whether (A) from Example 3.43 is a theorem, it’s often useful to
get intuition by plugging in a few sample values for S, P, and Q.

On the other hand, (A) is not a theorem, for much the same reason as in Exam- ple 3.39. (In fact, if Q(x) := ¬P(x), then Examples 3.38 and 3.39 precisely show that (A) is not a theorem.) The set Z and the predicates isOdd and isEven make (A) false: the left-hand side is true (“all integers are either even or odd”) but the right- hand side is false (“either (i) all integers are even, or (ii) all integers are odd”).
Although (A) from this example is not a theorem, one direction of it is; we’ll prove this implication as another example:
Example 3.44 (Disjunction, quantifiers, and one-way implications)
The ⇐ direction of (A) from Example 3.43 is a theorem: ∀x∈S:􏰖P(x)∨Q(x)􏰗 ⇐ 􏰖∀x∈S:P(x)􏰗∨􏰖∀x∈S:Q(x)􏰗.
To convince yourself of this claim, observe that if P(x) is true for an arbitrary x ∈ S, then it’s certainly true that P(x) ∨ Q(x) is true for an arbitrary x ∈ S too. And if Q(x) is true for every x ∈ S, then, similarly, P(x) ∨ Q(x) is true for every x ∈ S.
To prove this claim, we assume the antecedent [∀x ∈ S : P(x)] ∨ [∀x ∈ S : Q(x)]. Thus either [∀x ∈ S : P(x)] or [∀x ∈ S : Q(x)], and, in either case, we’ve argued that P(x) ∨ Q(x) is true for all x ∈ S.
You’ll have a chance to consider a number of other theorems of predicate logic in the exercises, including the ∧-analogy to Examples 3.43–3.44 (in Exercises 3.130–3.131).
3.4. ANINTRODUCTIONTOPREDICATELOGIC 343

344 CHAPTER 3. LOGIC
Computer Science Connections
Game Trees, Logic, and Winning Tic-Tac(-Toe)
In 1997, Deep Blue, a chess-playing program developed by IBM,5 beat the chess Grandmaster Garry Kasparov in a six-match series. This event was a turning point in the public perception of computation and artificial intelli- gence; it was the first time that a computer had outperformed the best humans at something that most people tended to identify as a “human endeavor.”
Ten years later, a research group developed a program called Chinook, a per- fect checkers-playing system: from any game position arising in its games, Chinook chooses the best possible legal move.6
While chess and checkers are very complicated games, the basic ideas
of playing them—ideas based on logic—are shared with simpler games. Consider Tic-Tac, a 2-by-2 version of Tic-Tac-Toe. Two players, O and X, make alternate moves, starting with O; a player wins by occupying a complete row or column. Diagonals don’t count, and if the board is filled without O or
X winning, then the game is a draw. Note that—unless O is tremendously dull—O will win the game, but we will use a game tree (Figure 3.24), which represents all possible moves, to systematize this reasoning.
Here’s the basic idea. Define P(B) to be the predicate
P(B) := “Player O wins under optimal play starting from board B.”
5 Murray Campbell, A. Joseph Hoane Jr., and Feng-hsiung Hsu. Deep Blue. Artificial Intelligence, 134:57–83, 2002.
6 Jonathan Schaeffer, Neil Burch, Yngvi Bjornsson, Akihiro Kishimoto, Martin Muller, Rob Lake, Paul Lu, and Steve Sutphen. Checkers is solved. Science, 317(5844):1518–1522, 14 September 2007.
Thanks to Jon Kleinberg for suggesting this game.
Figure 3.24: 25% of the Tic-Tac game tree. (The missing 75% is rotated, but otherwise identical.)
|
|
O|
|
|O
|
|
O|
|
|O
X|O
|
|O
X|
|O
|X
X|O
O|
X|O
|O
O|O
X|
|O
X|O
O|O
|X
|O
O|X
X|O
O|X
X|O
O|X
For example, P( X | ) = True because O has already won; and P( O | X ) = False O|O X|O
because it’s a draw. The answer to the question “does O win Tic-Tac if both players play optimally?” is the truth value of P( | ). If it’s O’s turn in board
|
B, then P(B) is true if and only if there exists a possible move for O leading to a board B′ in which P(B′); if it’s X’s turn, then P(B) is true if and only if every possible move made by X leads to a board B′ in which P(B′). So
P( |O )=P(X|O )∧P( |O )∧P( |O )
∨
|O||O|| and P( | ) = P( | ) ∨ P( X | ) ∨ P( | X ) ∨ P( ).
∧ |
| |O
O|
∧ |O
Figure 3.25: The game tree, with each win for O labeled by T, each loss/draw by F, ∨ if it’s Player O’s turn, and ∧ if it’s Player X’s turn.
For more on game trees and algorithms forexploringlargesearchspaces,seea good artificial intelligence (AI) text like
7 Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 3rd edition, 2009.
|||O||O
∧ |
X|O
|O
∧
∨ ∨ ∨
The game tree, labeled appropriately, is shown in Figure 3.25. If we view the truth values from the leaves as “bubbling up” from the bottom of the tree, then a board B gets assigned the truth value True if and only if Player O can guarantee a win from the board B.
Some serious complications arise in writing a program to play more com- plicated games like checkers or chess. Here are just a few of the issues that one must confront in building a system like Deep Blue or Chinook:7
• Thereare≈500,000,000,000,000,000,000differentcheckerspositions—and ≈ 1040 chess positions!—so we can’t afford to represent them all. (Luckily, we can choose moves so most positions are never reached.)
• Approximatelyonebitpertrillioniswrittenincorrectlymerelyincopying data on current hard disk technologies. So a program constructing a massive structure like the checkers game tree must “check its work.”
• Foragameasbigaschess,wecan’taffordtocomputeallthewaytothe bottom of the tree; instead, we estimate the quality of each position after computing a handful of layers deep in the game tree.
|X||X ∧∧
O|O T
|O
O|O
|O
T
X|O
|O
O| O|X
|O
X
|
X
|O
|O T
|
|
T
|X
|
X|O
X|O F
X|O F
O
|X
O
|X

3.4. ANINTRODUCTIONTOPREDICATELOGIC 345
Computer Science Connections
Nonlocal Variables and Lexical vs. Dynamic Scoping
In a function f written in a programming language—say, C or Python—we can use several different types of variables that store values:
• localvariables,whosevaluesaredefinedcompletelywithinthebodyoff; • parameters,inputstofwhosevalueisspecifiedwhenfisinvoked;
• nonlocalvariables,whichgettheirvaluefromothercontexts.Themost
common type of these “other” variables is a global variable, which persists throughout the execution of the entire program.
For an example function (written in C and Python as illustrative examples) that uses both a parameter and a nonlocal variable, see Figure 3.26. In the body of this function, the variable a is a bound variable; specifically, it is bound when the function is invoked with an actual parameter. But the variable b is unbound. (Just as with a quantified expression, an unbound variable is one for which the meaning of the function could change if we replaced that variable with a different name. If we changed the a to an x in both lines 1 and 2, then the function would behave identically, but if we changed the b to a y, then the function would behave differently.)
In this function, the variable b has to somehow get a value from some- where if we are going to be able to invoke the function addB without causing an error. Often b will be a global variable, but it is also possible in Python or C (with appropriate compiler settings) to nest function definitions—just as quantifiers can be nested. (See Section 3.5.)
One fundamental issue in the design and implementation in programming languages is illustrated in Figure 3.27.8 Suppose x is an unbound variable in the definition of a function f. Generally, programming languages either use lexical scope, where x’s value is found by looking “outward” where f is defined; or dynamic scope, where x’s value is found by looking where f is called. Almost all modern programming languages use lexical scope, though macros in C and other languages use dynamic scope. While we’re generally used to lexical scope and therefore it feels more intuitive, there are some circumstances in which macros can be tremendously useful and convenient.
Figure 3.26: A function addB written in C and analogous function addB written in Python. Here addB takes one (integer) parameter a, accesses a nonlocalvariableb,andreturnsa + b.
For more about lexical versus dynamic scope, and other related issues, see a textbook on programming languages. (One of the other interesting issues
is that there are actually multiple paradigms for passing parameters to a function; we’re discussing call-by-value parameter passing, which probably is the most common.) Some good books on programming languages include
8 Michael L. Scott. Programming Lan- guage Pragmatics. Morgan Kaufmann Publishers, 3rd edition, 2009; and Kenneth C. Louden and Kenneth A. Lambert. Programming Languages: Prin- ciples and Practices. Course Technology, 3rd edition, 2011.
int addB(int a) {
return a + b;
}
def addB(a):
return a + b
int b = 17;
int addB(int a) { return a + b; }
/* a FUNCTION in C finds values for unbound */
/* variables in the *defining* environment */
int test() {
int b = 128;
return addB(3);
}
test(3); /* returns 20 */
int b = 17;
#define addB(a) a + b
/* a MACRO in C finds values for unbound */
/* variables in the *calling* environment */
int test() {
int b = 128;
return addB(3);
}
test(3); /* returns 131 */
Figure 3.27: Two C snippets defining addB, where the nonlocal variable b gets its value from different places.

346 CHAPTER 3. LOGIC
Computer Science Connections
Gödel’s Incompleteness Theorem
Given a fully quantified proposition φ, is φ a theorem? This apparently simple question drove the development of some of the most profound and mind- numbing results of the last hundred years. In the early 20th century, there was great interest in the “formalist program,” advanced especially by the German mathematician David Hilbert. The formalist approach aimed to turn all of mathematical reasoning into a machine: one could feed in a mathematical statement φ as input, turn a hypothetical crank, and the machine would spit out a proof or disproof of φ as output. But this program was shattered by two closely related results—two of the greatest intellectual achievements of the 20th century.
The first blow to the formalist program was the proof by Kurt Gödel, in 1931, of what became known as Gödel’s Incompleteness Theorem. Gödel’s in- completeness theorem is based on the following two important and desirable properties of logical systems:
• Alogicalsystemisconsistentifonlytruestatementscanbeproven.(In other words, if there is a proof of φ in the system, then φ is true.)
• Alogicalsystemiscompleteifeverytruestatementcanbeproven.(Inother words, if φ is true, then there is a proof of φ in the system.)
Gödel’s Incompleteness Theorem is the following troubling result:
(Here “sufficiently powerful” just means “capable of expressing multiplica- tion”; predicate logic as described here is certainly “sufficiently powerful.”) If the system is inconsistent, then there is a false statement φ that can be
proven (which means that anything can be proven, as false implies anything!). And if the system is incomplete, then there is a true statement φ that cannot be proven. Gödel’s proof proceeds by constructing a self-referential logical expression φ that means “φ is not provable.” (So if φ is true, then the system is incomplete; and if φ is false, then the system is inconsistent.)
The second strike against the formalist program was the proof of the un- decidability of the halting problem, shown independently by Alan Turing and Alonzo Church in the 1930s. We can think of the halting problem as asking the following question: given a function f written in Python and an input x, does running f (x) get stuck in an infinite loop? (Or does it eventually termi- nate?) The undecidability of this problem means that there is no algorithm that solvesthehaltingproblem.Acorollaryofthisresultisthatourproblem—given a fully quantified proposition φ, is φ a theorem?—is also undecidable. We’ll discuss uncomputability in more detail in Chapter 4.
Undecidability, incompleteness, and their profound consequences are the focus of a number of excellent textbooks on the theory of computation9—and also Douglas Hofstadter’s fascinating masterpiece Gödel, Escher, Bach,10 which is all-but-required reading for computer scientists.
See, for example:
9DexterKozen. AutomataandCom- putability. Springer, 1997; and Michael Sipser. IntroductiontotheTheoryof Computation. Course Technology, 3rd edition, 2012.
10 Douglas Hofstadter. Gödel, Escher, Bach: An Eternal Golden Braid. Vintage, 1980.
Theorem 3.3 (Gödel’s (First) Incompleteness Theorem)
Any sufficiently powerful logical system is either inconsistent or incomplete.

3.4.5 Exercises
Figure 3.28 lists some well-known programming languages, with some characteristics. Using these characteristics, define a predicate that’s true for each of the following lists
of languages, and false for every other language in the table. For example, the predicate P(x) = “x has strong typing and x is not functional” makes P(Pascal) and P(Java) true, and makes P(x) false for every x ∈ {C, C++, LATEX, ML, Perl, Scheme}.
3.107 Java
3.108 ML, Perl
3.109 Pascal, Scheme, Perl
3.110 LATEX, Java, C++, Perl
3.111 C, Pascal, ML, C++, LATEX, Scheme, Perl
Examples 3.4 and 3.15 construct a proposition corresponding to “the password contains at least three of four character types (digits, lowercase letters, uppercase letters, other).” In that example, we took “the password contains at least
one digit” (and its analogues for the other character types) as an atomic proposition. But we could give a lower-level characterization of valid passwords. Let isDigit, isLower, and isUpper be predicates that are true of single characters of the appropriate type. Use standard arithmetic notation and these predicates to formalize the following conditions on a password x = ⟨x1 , . . . , xn ⟩, where xi is the ith character in the password:
3.112 x is at least 8 characters long.
3.113 x contains at least one lowercase letter.
3.114 x contains at least one non-alphanumeric character. (Remember that isDigit, isLower, and isUpper
are the only predicates available!)
3.115 (Inspired by a letter to the editor in The New Yorker by Alexander George from 24 December 2007.) Steve Martin, the great comedian, reports in Born Standing Up: A Comic’s Life that, inspired by Lewis Carroll, he startedclosinghisshowswiththefollowingline.11 (Itgotbiglaughs.)
I’m not going home tonight; I’m going to Bananaland, a place where only two things are true, only two things: One, all chairs are green; and two, no chairs are green.
Steve Martin describes the joke as a contradiction—but, in fact, these two true things are not contradictory! Describe how it is possible for both “all chairs in Bananaland are green” and “no chairs in Bananaland are green” to be simultaneously true.
As a rough approximation, we can think of a database as a two-dimensional table, where rows correspond to individual entities, and columns correspond to fields (data about those entities). A database query defines a predicate Q(x) that consists of tests of the values from various columns, joined by the basic logical connectives. The database system then returns a list of rows/entities for which the predicate is true. We can think of this type of database access as involving predicates: in response to query Q, the system returns the list of all rows x for which Q(x) is true.
See Figure 3.29 for an example; here, to find a list of all students with grade point averages over 3.4 who have taken at least one CS course if and only if they’re from Hawaii, we could query GPA(x) ≥ 3.4 ∧ 􏰀CS?(x) = yes ⇔ home(x) = Hawaii􏰁 . For this database, this query would return Charlie (and not Alice, Bob, or Dave).
Each of the following predicates Q(x) uses tests on particular columns in x’s row. For each, give a logically equivalent predicate in which each column’s name appears at most once. You may also use the symbols {True, False, ∧, ∨, ¬, ⇒} as many times as you please. Use a truth table to prove that your answer is logically equivalent to the given predicate.
3.116 [age(x) < 18] ∨ (¬[age(x) < 18] ∧ [gpa(x) ≥ 3.0)] 3.117 cs(x) ⇒ ¬(hawaii(x) ⇒ (hawaii(x) ∧ cs(x))) 3.118 (hasMajor(x) ∧ ¬junior(x) ∧ oncampus(x)) ∨ (hasMajor(x) ∧ ¬junior(x) ∧ ¬oncampus(x)) ∨ (hasMajor(x) ∧ junior(x) ∧ ¬oncampus(x)) 3.119 Following the last few exercises, you might begin to think that any query can be rewritten with- out duplication. Can it? Consider a unary predicate that is built up from the predicates P(x) and Q(x) and the propositional symbols {True, False, ∧, ∨, ¬, ⇒}. Decide whether the following claim is true or false, and prove your answer: Claim: Every such predicate is logically equivalent to a predicate that uses only the following symbols: (i) {True, False, ∧, ∨, ¬, ⇒}, all of which can be used as many times as you please; and (ii) the predicates {P(x), Q(x)}, which can appear only one time each. Figure 3.28: Some programming languages. 3.4. ANINTRODUCTIONTOPREDICATELOGIC 347 paradigm typing scope C imperative weak C++ object-oriented weak Java object-oriented strong LATEX scripting weak ML functional strong Pascal imperative strong Perl scripting weak Scheme functional weak lexical lexical lexical dynamic lexical lexical either either 11SteveMartin. Born Standing Up: A Comic’s Life. Simon & Schuster, 2008. name GPA CS? home ··· Alice 4.0 yes Alaska ··· Bob 3.14 yes Bermuda ··· Charlie 3.54 no California ··· Dave 3.8 yes Delaware ··· . Figure 3.29: A sample database. 348 CHAPTER 3. LOGIC Modern web search engines allow users to specify Boolean conditions in their queries. For example, “social OR net- works” will return only web pages containing either the word “social” or the word “networks.” You can view a query as a predicate Q; the search engine returns (in some order) the list of all pages p for which Q(p) is true. Consider the following queries: A: “java AND program AND NOT computer” B: “(computer OR algorithm) AND java” C: “java AND NOT (computer OR algorithm OR program)” Give an example of a web page—or a sentence—that would be returned . . . 3.124 Translate this Groucho Marx quote into logical notation: It isn’t necessary to have relatives in Kansas City in order to be unhappy. Let P(x) be “x has relatives in Kansas City” and Q(x) be “x is unhappy,” and view the statement as implicitly making a claim that a particular kind of person exists. Write an English sentence that expresses the logical negation of each given sentence. (Don’t just say “It is not the case that ...”; give a genuine negation.) Some of the given sentences are ambiguous in their meaning; if so, describe all of the interpretations of the sentence that you can find, then choose one and give its negation. 3.125 Every entry in the array A is positive. 3.126 Every decent programming language denotes block structure with parentheses or braces. 3.127 There exists an odd number that is evenly divisible by a different odd number. 3.128 There is a point in Minnesota that is farther than ten miles from a lake. 3.129 Every sorting algorithm takes at least n log n steps on some n-element input array. In Examples 3.43 and 3.44, we proved that ∃x∈S:􏰖P(x)∨Q(x)􏰗 ⇔ 􏰖∃x∈S:P(x)􏰗 ∨ 􏰖∃x∈S:Q(x)􏰗 3.122 . . . by query C but not by A or B. 3.123 Prove or disprove: ∀n ∈ Z : isPrime(n) ⇒ n ∈/ Z. 3.120 . . . by query A but not by B or C. 3.121 . . . by query B but not by A or C. ∀x∈S:􏰖P(x)∨Q(x)􏰗 ⇐ 􏰖∀x∈S:P(x)􏰗 ∨ 􏰖∀x∈S:Q(x)􏰗 are theorems. Argue that the following ∧-analogies to these statements are also theorems: 3.130 ∃x∈S:􏰖P(x)∧Q(x)􏰗 ⇒ 􏰖∃x∈S:P(x)􏰗 ∧ 􏰖∃x∈S:Q(x)􏰗 3.131 ∀x∈S:􏰖P(x)∧Q(x)􏰗 ⇔ 􏰖∀x∈S:P(x)􏰗 ∧ 􏰖∀x∈S:Q(x)􏰗 Explain why the following are theorems of predicate logic: 3.132 􏰖∀x ∈ S : P(x) ⇒ Q(x)􏰗 ∧ 􏰖∀x ∈ S : P(x)􏰗 ⇒ 􏰖∀x ∈ S : Q(x)􏰗 3.133 􏰖∀x∈{y∈S:P(y)}:Q(x)􏰗 ⇔ 􏰖∀x∈S:P(x)⇒Q(x)􏰗 3.134 􏰖∃x∈{y∈S:P(y)}:Q(x)􏰗 ⇔ 􏰖∃x∈S:P(x)∧Q(x)􏰗 Explain why the following propositions are theorems of predicate logic, assuming that x does not appear as a free variable in the expression φ (and assuming that S is nonempty): 3.135 φ⇔􏰖∀x∈S:φ􏰗 3.136 φ∨􏰖∀x∈S:P(x)􏰗⇔􏰖∀x∈S:φ∨P(x)􏰗 3.137 φ∧􏰖∃x∈S:P(x)􏰗⇔􏰖∃x∈S:φ∧P(x)􏰗 3.138 􏰋φ⇒􏰖∃x∈S:P(x)􏰗􏰌⇔􏰖∃x∈S:φ⇒P(x)􏰗 3.139 􏰋􏰖∃x∈S:P(x)􏰗⇒φ􏰌⇔􏰖∀x∈S:P(x)⇒φ􏰗 2 3.140 Give an example of a predicate P, a nonempty set S, and an expression φ containing x as a free variable such that the proposition from Exercise 3.136 is false. Because x has to get its meaning from some- where, we will imagine a universal quantifier for x wrapped around the entire expression. Specifically, give an example of P, φ, and S for which ∀x∈S:􏰍φ∨􏰂∀x∈S:P(x)􏰃􏰎 isnotlogicallyequivalentto ∀x∈S:􏰍􏰂∀x∈S:φ∨P(x)􏰃􏰎. 3.5 Predicate Logic: Nested Quantifiers Everybody hates me because I’m so universally liked. Peter De Vries (1910–1993) Just as we can place one loop inside another in a program, we can place one quanti- fied statement inside another in predicate logic. In fact, the most interesting quantified statements almost always involve more than one quantifier. (For example: during every semester, there’s a computer science class that every student on campus can take.) In formal notation, such a statement typically involves nested quantifiers—that is, multiple quanti- fiers in which one quantifier appears inside the scope of another. We’ve encountered statements involving nested quantification before, although so far we’ve discussed them using English rather than mathematical notation. The definition of a partition of a set (Definition 2.30) or of an onto function (Definition 2.49) are two examples. (To make the latter definition’s quantifiers more explicit: an onto function f : A → B is one where, for every element of B, there’s an element of A such that f (a) = b: that is, ∀b ∈ B : 􏰂∃a ∈ A : f (a) = b􏰃.) Here are two other examples: Example 3.45 (No unmatched elements in an array) Let’s express the condition that every element of an array A[1 . . . n] is a “double”— that is, appears at least twice in A. (For example, the array [3, 2, 1, 1, 4, 4, 2, 3, 1] sat- isfies this condition.) This condition requires that, for every index i, there exists an- other index j such that A[i] = A[j]. We can express the requirement as follows: ∀i ∈ {1, 2, . . . , n} : 􏰖∃j ∈ {1, 2, . . . , n} : i ̸= j ∧ A[i] = A[j]􏰗. Example 3.46 (Alphabetically later) Let’s formalize the predicate “The string is alphabetically after the string ” from Example 3.28. For two letters a,b ∈ {A,B,...,Z}, write a < b if a is earlier in the alphabet than b; we’ll use this ordering on letters to define an ordering on strings. Let x and y be strings over {A, B, . . . , Z}. There are two ways for x to be alphabetically later than y: • yisa(proper)prefixofx.(SeeExample3.35.)Forexample,FORTRANisafterFORT. • x and y share an initial prefix of identical letters, and the first i for which xi ̸= yi has xi later in the alphabet than yi. For example, PASTOR comes after PASCAL. Formally, then, x ∈ {A, B, . . . , Z}n is alphabetically after y ∈ {A, B, . . . , Z}m if 􏰖myi ∧ [∀j∈{1,2…,i−1}:xi =yi]􏰗
. . . or x1,…,i−1 = y1,…,i−1 and xi > yi .
“Sorting alphabet- ically” is usually called lexicographic ordering in com- puter science. This ordering reflects the way that words are listed in the dictio- nary (also known as the lexicon).
3.5. PREDICATELOGIC:NESTEDQUANTIFIERS 349

350 CHAPTER 3. LOGIC
Here is one more example of a statement that we’ve already seen—Goldbach’s conjecture—that implicitly involves nested quantifiers; we’ll formalize it in predicate logic. (Part of the point of this example is to illustrate how complex even some ap- parently simple concepts are; there’s a good deal of complexity hidden in words like “even” and “prime,” which at this point seem pretty intuitive!)
Example 3.47 (Goldbach’s Conjecture)
Problem: RecallGoldbach’sconjecture,fromExample3.1:
Every even integer greater than 2 can be written as the sum of two prime numbers.
Formalize this proposition using nested quantifiers.
Solution
: UsingthesumOfTwoPrimespredicatefromExample3.34,wecanwritethis
statement as either of the following:
∀n∈{n∈Z:n>2 ∧ 2|n}:sumOfTwoPrimes(n) (A)
Writing tip: Just as with nested loops in programs, the deeper the nesting of quantifiers,
the harder an expression is
for a reader to follow. Using well- chosen predicates (like isPrime, for example) in a logical statement can make it much easier to read—
just like using well-chosen (and well-named) functions makes your software easier to read!
∀n ∈ Z : 􏰖n > 2 ∧ 2|n ⇒ sumOfTwoPrimes(n)􏰗
(B)
In (B), we quantify over all integers, but the implication n > 2 ∧ 2 | n ⇒ sumOfTwoPrimes(n) is trivially true for an integer n that’s not even or not greater than 2, because false implies anything! Thus the only instantiations of the quanti- fier in which the implication has any “meat” is for even integers greater than 2. As such, these two formulations are equivalent. (See Exercise 3.133.) Expanding the definition of sumOfTwoPrimes(n) from Example 3.34, we can also rewrite (B) as
􏰑∀n∈Z:n>2∧2|n ⇒ 􏰒 ∃p ∈ Z : ∃q ∈ Z : 􏰂isPrime(p)∧isPrime(q)∧n = p+q􏰃
(C)
We’ve also already seen that the predicate isPrime implicitly contains quantifiers too (“for all potential divisors d, it is not the case that d evenly divides p”)—and, for that matter, so does the “evenly divides” predicate |. In Exercises 3.178, 3.179, and 3.180, you’ll show how to rewrite Goldbach’s Conjecture in a few different ways, including using yet further layers of nested quantifiers.
3.5.1 Order of Quantification
In expressions that involve nested quantifiers, the order of the quantifiers matters! As a frivolous example, take the title of the 1947 hit song “Everybody Loves Somebody” (sung by Dean Martin). There are two plausible interpretations of the title:
∀x:∃y: xlovesy and ∃y:∀x: xlovesy.
The former is the more natural reading; it says that every person x has someone that he or she loves, but each different x can love a different person. (As in: “every child loves his or her mother.”) The latter says that there is one single person loved by every x. (As in: “Everybody loves Raymond.”) These claims are different!

Taking it further: Disambiguating the order of quantification in English sentences is one of the most daunting challenges in natural language processing (NLP) systems. (See p. 314.) Compare Every student received a diploma and Every student heard a commencement address: there are, surely, many diplomas and only one address, but building a software system that understands that fact is tremendously challenging! There are many other vexing types of ambiguity in NLP systems, too. A classic example of ambiguity
in natural language is the sentence I saw the man with the telescope. Is the man holding a telescope? Or did I use one to see him? Human listeners are able to use pragmatic knowledge about the world to disambiguate, but doing so properly in an NLP system is very difficult.
Figure 3.30 shows a visual repre- sentation of the importance of this order of quantification. Compare Figure 3.30(d) and Figure 3.30(f), for example: ∀r : ∃c : P(r, c) is true
if every row has at least one col- umn with a filled cell in it, whereas ∃c : ∀r : P(r, c) requires that there be a single column so that every row has that column’s cell filled. The propo- sition ∃c : ∀r : P(r,c) is not true in Figure 3.30(d)—though the propo- sition ∀r : ∃c : P(r, c) is true in both Figure 3.30(d) and Figure 3.30(f).
Here’s a mathematical example
that illustrates the difference even more precisely.
Example 3.48 (The largest real number)
Problem: Oneofthefollowingpropositionsistrue;theotherisfalse.Whichiswhich? ∃y ∈ R : ∀x ∈ R : x < y (A) ∀x∈R:∃y∈R:x 15, j ≤ 0, or j > 15, treat G[i, j] = False.
Taking it further: The assumption that the ⟨i, j⟩th cell of G is False except when 1 ≤ i, j ≤ 15 can be re- expressed as us pretending that our real grid is surrounded by black squares. In CS, this style of structure is called a sentinel, wherein we introduce boundary values to avoid having to write out verbose special cases.
There are certain customs that G must obey to be a standard American puzzle. (See Figure 3.37, for example.) Rewrite the informally stated conditions that follow as fully formal definitions.
3.157 no unchecked letters: every open cell appears in both a down word and an across word.
3.158 no two-letter words: every word has length at least 3.
3.159 rotational symmetry: if the entire grid is rotated by 180◦, then the rotated grid is identical to
the original grid.
3.160 overall interlock: for any two open squares, there is a path of open squares that connects the first to the second. (That is, we can get from here to there through words.) Your answer should formally define a predicate P(i, j, x, y) that is true exactly when there exists is a path from ⟨i, j⟩ to ⟨x, y⟩: “there exists a sequence of open squares starting with ⟨i, j⟩ such that . . .”.)
3.161 Definition 2.30 defines a partition of a set S as a set {A1 , A2 , . . . , Ak } of sets such that (i) A1 , A2 , . . . , Ak areallnonempty;(ii)A1 ∪A2 ∪···∪Ak = S;and(iii)foranydistincti,j ∈ {1,…,k},thesetsAi andAj are disjoint. Formalize this definition using nested quantifiers and basic set notation.
3.162 Consider the “maximum” problem: given an array of numbers, return the maximum element of that array. Complete the formal specification for this problem by finishing the specification under “output”:
Input: An array A[1 . . . n], where each A[i] ∈ Z. Output: An integer x ∈ Z such that . . .
Greek: chron- “time.”
Figure 3.37: A sample American crossword puzzle.

LetT = {1,…,12}×{0,1,…,59}denotethesetofnumbersthatcanbedisplayedonadigitalclockintwelve-hour mode. (A clock actually displays a colon between the two numbers.) We can think of a clock as a function c : T → T, so that when the real time is t ∈ T, then the clock displays the time c(t). (For example, if fastby7 runs seven minutes fast, then fastby7(12:00) = 12:07.)
For several of these questions, it may be helpful to make use of the function add : T × Z≥0 → T so that add(t, x) denotes the time that’s x minutes later than t. See Exercise 2.243.
Formalize each of the following predicates using only the standard quantifiers and equality symbols.
3.163 A clock is right if it always displays the correct time. Formalize the predicate right.
3.164 A clock keeps time if there’s some fixed offset by which it is always off from being right. (For
example, fastby7 above correctly keeps time.) Formalize the predicate keepsTime.
3.165 A clock is close enough if it always displays a time that’s within two minutes of the correct time. Formalize the predicate closeEnough.
3.166 A clock is broken if there’s some fixed time that it always displays, regardless of the real time. Formalize the predicate broken.
3.167 “Even a broken clock is right twice a day,” they say. (They mean: “even a broken clock displays the correct time at least once per T.”) Formalize the adage and prove it true.
A classic topic of study for computational biologists is genomic distance measures: given two genomes, we’d like to report a single number that represents how different those two genomes are. These distance computations are useful in, for example, reconstructing the evolutionary tree of a collection of species.
Consider two genomes A and B of bacterium. Let’s label the n genes that appear in A’s chromosome, in order, as
πA = 1,2,…,n. The same genes appear in a different order in B—say, in the order πB = r1,r2,…rn. A particular model of genomic distance will define a specific way in which this list of numbers can mutate; the question at hand is to find the minimum-length sequence of these mutations that are necessary to explain the difference between the orders πA and πB. One type of biologically motivated mutation is the prefix reversal—in which some prefix of πB is reversed, as
14 W. H. Gates and C. H. Papadim- itriou. Bounds for sorting by prefix reversals. Dis- crete Mathematics, 27:47–57, 1979.
Figure 3.38: The pancake-flipping problem, and
its biological significance.
in ⟨3, 2, 1
,4,5⟩ → ⟨1,2,3
, 4, 5⟩. It turns out that this model is exactly the pancake-flipping problem, the subject of theloneacademicpaperwithBillGatesasanauthor.14 (SeeFigure3.38.)
3.5. PREDICATELOGIC:NESTEDQUANTIFIERS 359
foo
(b) A biological view. Think of a chromosome as a sequence of genes. If, in the course of cell activity, one end of the chromosome comes in contact with a point in the middle of the chromosome, a loop forms. If the loop untangles itself “the other way around,” the effect is to reverse the order of the genes in that loop. This transformation effects a prefix reversal on those genes. Here 123456789abc becomes 987654321abc.
(a) Two pancake-flipping instances. Given a stack of pancakes, with radii labeled from top to bottom, we must sort the pile by radius. We sort with a sequence of flips: turn the top k pancakes upside down, for some k, and replace them (inverted) on top of the remaining pancakes. The left instance is ⟨4, 3, 2, 1, 5⟩; the right is ⟨5, 4, 3, 1, 2⟩. They require 1 and 2 flips, respectively, to solve (as shown).
Suppose that you are given a stack of pancake radii r1, r2, . . . , rn, arranged from top to bottom, where {r1, r2, . . . , rn} = {1, 2, . . . , n} (but not necessarily in order). Write down a fully quantified logical expression that expresses the condi- tion that . . .
3.168 . . . the given pancakes are sorted.
3.169 . . . the given pancakes can be sorted with exactly one flip (see Figure 3.38).
3.170 . . . the given pancakes can be sorted with exactly two flips. (Hint: writing a program to verify that
your indices aren’t off by one is a very good idea!)
Let P be a set of people, and let T be a set of times. Let friends(x,y) be a predicate denoting that x ∈ P and y ∈ P are friends. Let bought(x, t) be a predicate denoting that x ∈ P bought an iPad at time t ∈ T.
3.171 Formalize this statement in predicate logic: “Everyone who bought an iPad has a friend who bought one previously.”
3.172 Is the claim from Exercise 3.171 true (in the real world)? Justify your answer.
6
6
7
7
3
3
5
5
4 4
8
8
2
2
9
9
1
1
abc abc

360 CHAPTER 3. LOGIC
In programming, an assertion is a logical statement that announces (“asserts”) a condition φ that the programmer believes to be true. For example, a programmer who is about to access the 202nd element of an array A might assert that length(A) ≥ 202 before accessing this element. When an executing program in languages like C and Java reaches an assert statement, the program aborts if the condition in the statement isn’t true.
For the following, give a nonempty input array A that would cause the stated
assertion from Figure 3.39 to fail (that is, for the asserted condition to be false).
3.173 foo
3.174 bar
3.175 baz
Taking it further: Using assertions can be an extremely valuable way of doc- umenting and debugging programs, particularly because liberally including assertions will allow the revelation of unexpected data values much earlier in the execution of a program. And these languages have a global toggle that allows the testing of assertions to be turned off, so once the programmer is satisfied
that the program is working properly, she doesn’t have to worry about any running-time overhead for these checks.
While the quantifiers ∀ and ∃ are by far the most common, there are some other quanti- fiers that are sometimes used. For each of the following quantifiers, write an expression that is logically equivalent to the given statement that uses only the quantifiers ∀ and ∃; standard propositional logic notation (∧, ¬, ∨, ⇒); standard equality/inequality notation (=, ≥, ≤, <, >); and the predicate P in the question.
3.176 Write an equivalent expression to ∃! x ∈ Z : P(x) (“there exists a unique x ∈ Z such that P(x)”), which is true when there is one and only one value of x in the set Z such that P(x) is true.
3.177 Write an equivalent expression to ∃∞ x ∈ Z : P(x) (“there exist infinitely many x ∈ Z such that P(x)”), which is true when there are infinitely many different values of x ∈ Z such that P(x) is true.
Here are two formulations of Goldbach’s conjecture (see Example 3.47):
∀n∈Z:􏰂 n>2 ∧ 2|n⇒ 􏰀∃p∈Z:∃q∈Z:􏰂isPrime(p)∧isPrime(q)∧n=p+q􏰃􏰁 􏰃
∀n∈Z:∃p∈Z:∃q∈Z:􏰂 n≤2 ∨ 2̸|n ∨ 􏰂isPrime(p)∧isPrime(q)∧n=p+q􏰃 􏰃.
Figure 3.39: Some functions using assert statements.
foo(A[1…n]):
last = 0
for index = 1 … n-1:
if A[index] > A[index+1]:
last = index
assert(last >= 1 and last <= n-1) swap A[last], A[last+1] Prove that these two formulations of Goldbach’s conjecture are logically equivalent. Rewrite Goldbach’s conjecture without using isPrime—that is, using only quantifiers, the | predi- 3.178 3.179 cate, and standard arithmetic (+, ·, ≥, etc.). 3.180 Even the | predicate implicitly involves a quantifier: p | q is equivalent to ∃k ∈ Z : p · k = q. Rewrite Goldbach’s conjecture without using the | predicate either—that is, use only quantifiers and standard arithmetic symbols (+, ·, ≥, etc.). 3.181 (programming required) As we discussed, the truth value of Goldbach’s conjecture is currently unknown. As of April 2012, the conjecture has been verified for all even integers from 4 up to 4 × 1018, through a massive distributed computation effort led by Tomás Oliveira e Silva. Write a program to test Goldbach’s conjecture, in a programming language of your choice, for even integers up to 10,000. Most real-world English utterances are ambiguous—that is, there are multiple possible interpretations of the given sentence. A particularly common type of ambiguity involves order of quantification. For each of the following English sentences, find as many different logical readings based on order of quantification as you can. Write down those interpretations using pseudological notation, and also write a sentence that expresses each meaning unambiguously. 3.182 A computer crashes every day. 3.183 Every prime number except 2 is divisible by an odd integer greater than 1. 3.184 Every student takes a class every term. 3.185 Every submitted program failed on a case submitted by a student. 3.186 You should have found two different logical interpretations in Exercise 3.183. One of these inter- pretations is a theorem, and one of them is not. Decide which is which, and prove your answers. bar(A[1...n]): total = A[1] i=1 for i = 2 ... n-1: if A[i+1] > A[i]:
total = total + A[i]
assert(total > A[1])
return total
baz(A[1…n]):
for start = 1 … n-1:
min = start
for i = start+1 … n:
assert(start == 1
or A[i] > A[start-1])
if A[min] > A[i]:
min = i
swap A[start], A[min]

Let S be an arbitrary nonempty set and let P be an arbitrary binary predicate. Decide whether the following statements are always true (for any P and S), or whether they can be false. Prove your answers.
3.187 􏰂∃y ∈ S : ∀x ∈ S : P(x, y)􏰃 ⇒ 􏰂∀x ∈ S : ∃y ∈ S : P(x, y)􏰃
3.188 􏰂∀x ∈ S : ∃y ∈ S : P(x, y)􏰃 ⇒ 􏰂∃y ∈ S : ∀x ∈ S : P(x, y)􏰃
Consider any unary predicate P(x) over a nonempty set S. It turns out that both of the following propositions are theorems of propositional logic. Prove them both.
3.189 ∀x ∈ S : 􏰖P(x) ⇒ 􏰀∃y ∈ S : P(y)􏰁􏰗
3.190 ∃x ∈ S : 􏰖P(x) ⇒ 􏰀∀y ∈ S : P(y)􏰁􏰗
The following blocks of code use nested loops to compute some fact about a predicate P. For each, write a fully quantified statement of predicate logic whose truth value matches the value returned by the given code. (Assume that S is a finite universe.)
3.191 3.193 3.195
3.196
3.192 3.194
As we’ve discussed, there is no algorithm that can decide whether a given fully quantified propo- sition φ is a theorem of predicate logic. But there are several specific types of fully quantified propositions for which we can decide whether a given statement is a theorem. Here you’ll show that, when quantification is only over a finite set, it is possible to give an algorithm to determine whether φ is a theorem. Suppose that you are given a fully quantified proposition φ, where the domain for every quantifier is a finite set—say
S = {0, 1}. Describe an algorithm that is guaranteed to figure out whether φ is a theorem.
3.5. PREDICATELOGIC:NESTEDQUANTIFIERS 361
for x in S:
for y in S:
flag = False
if P(x) or P(y):
flag = True
if flag:
return True
return False
for x in S:
flag = True
for y in S:
if not P(x,y):
flag = False
if flag:
return True
return False
for x in S:
for y in S:
if P(x,y):
return False
return True
flag = False
for x in S:
for y in S:
if P(x,y):
flag = True
return flag
for x in S:
flag = False
for y in S:
if not P(x,y):
flag = True
if flag:
return True
return False
3.197
for x in S:
flag = False
for y in S:
if not P(x,y):
flag = True
if not flag:
return False
return True

362 CHAPTER 3. LOGIC
3.6 Chapter at a Glance Propositional Logic
A proposition is the kind of thing
that is either true or false. An
atomic proposition (or Boolean
variable) is a conceptually indi-
visible proposition. A compound
proposition (or Boolean formula)
is one built up using a logical connective and one or more simpler propositions. The most common logical connectives are the ones shown in Figure 3.40. A proposition that contains the atomic propositions p1, . . . , pk is sometimes called a Boolean formula over p1,…,pk or a Boolean expression over p1,…,pk.
The truth value of a proposition is its truth or falsity. (The truth value of a Boolean formula over p1, . . . , pk is determined only by the truth values of each of p1,…,pk.) Each logical connective is defined by how the truth value of
the compound proposition formed using that connective relates to the truth values of the constituent propositions. A truth table defines a connective by listing, for each possible assignment of truth values for the constituent propo- sitions, the truth value of the entire compound proposition. See Figure 3.41. Observe that the proposition p ⇒ q is true if, whenever p is true, q is too. So the only situation in which p ⇒ q is false is when p is true and q is false. False implies anything! Anything implies true!
Consider a Boolean formula over variables p1, . . . , pk. A truth assignment is a setting to true or false for each variable. (So a truth assignment corre- sponds to a row of the truth table for the proposition.) A truth assignment satisfies the proposition if, when the values from the truth assignment are plugged in, the proposition is true. A Boolean formula is a tautology if every truth assignment satisfies it; it’s satisfiable if some truth assignment satisfies it; and it’s unsatisfiable or a contradiction if no truth assignment does. Two Boolean propositions are logically equivalent if they’re satisfied by exactly the same truth assignments (that is, they have identical truth tables).
Consider an implication p ⇒ q. The antecedent or hypothesis of the implication is
p; the consequent or conclusion of the implication is q. The converse of the implication
p ⇒ q is the implication q ⇒ p. The contrapositive is the implication ¬q ⇒ ¬p. Any im- plication is logically equivalent to its contrapositive. But an implication is not logically equivalent to its converse!
A literal is a Boolean variable or the negation of a Boolean variable. A proposition is in conjunctive normal form (CNF) if it is the conjunction (and) of a collection of clauses, where a clause is a disjunction (or) of a collection of literals. A proposition is in disjunc- tive normal form (DNF) if it is the disjunction of a collection of clauses, where a clause
is a conjunction of a collection of literals. Every proposition is logically equivalent to a proposition that is in CNF, and to another that is in DNF.
Figure 3.40: Logical connectives.
negation
¬p p∨q p∧q p⇒q p⇔q p⊕q
“not p”
disjunction (inclusive: “p, q, or both”)
“p or q”
conjunction
“p and q”
implication
“if p, then q” or “p implies q”
equivalence
“p if and only if q”
exclusive or (“p or q, but not both”)
“p xor q”
p
¬p TF FT
pqp∧qp∨q TTTT TFFT FTFT FFFF
pqp⇒q TTT TFF FTT FFT
pqp⊕qp⇔q FT TF TF FT
T T F F
T F T F
Figure 3.41: Truth tables for the basic logical connectives.

Predicate Logic
A predicate is a statement containing some number of variables that has a truth value once values are plugged in for those variables. (Alternatively, a predicate is a Boolean- valued function.) Once particular values for these variables are plugged in, the result- ing expression is a proposition. A proposition can also be formed from a predicate through quantifiers:
• Theuniversalquantifier∀(“forall”):theproposition∀x∈U:P(x)istrueif,forevery x ∈ U, we have that P(x) is true.
• The existential quantifier ∃ (“there exists”): the proposition ∃x ∈ U : P(x) is true if, for at least one x ∈ U, we have that P(x) is true.
The set U is called the universe or domain of discourse. When the universe is clear from context, it may be omitted from the notation.
In the expression 􏰂∀x : 􏰃 or 􏰂∃x : 􏰃, the scope or body of the quantifier is the un- derlined blank, and the variable x is bound by the quantifier. A free or unbound variable is one that is not bound by any quantifier. A fully quantified expression is one with no free variables.
3.6. CHAPTERATAGLANCE 363
A theorem of predicate logic is a fully quantified expression that is true for all possi- ble meanings of the predicates in it. Two expressions are logically equivalent if they are true under precisely the same set of meanings for their predicates. (Alternatively, two expressions φ and ψ are logically equivalent if φ ⇔ ψ is a theorem.) Two useful theo- rems of predicate logic are De Morgan’s laws: ¬∀x ∈ S : P(x) ⇔ ∃x ∈ S : ¬P(x) and ¬∃x ∈ S : P(x) ⇔ ∀x ∈ S : ¬P(x).
There is no general algorithm that can test whether any given expression is a theo- rem. If we wish to prove that an implication φ ⇒ ψ is an theorem, we can do so with a proof by assuming the antecedent: to prove that the implication φ ⇒ ψ is always true, we will rule out the one scenario in which it wouldn’t be; specifically, we assume that φ is true, and then prove that ψ must be true too, under this assumption.
A vacuously quantified statement is one in which the domain of discourse is the empty set. The vacuous universal quantification ∀x ∈ ∅ : P(x) is a theorem; the vacuous existential quantification ∃x ∈ ∅ : P(x) is always false.
Quantifiers are nested if one quantifier is inside the scope of another quantifier. Nested quantifiers work in precisely the same way as single quantifiers, applied in sequence. A proposition involving nested quantifier like ∀x ∈ S : ∃y ∈ T : R(x, y) is true if, for every choice of x, there is some choice of y (which can depend on the choice of x) for which R(x, y) is true. Order of quantification matters in general; the expressions ∀x : ∃y : R(x, y) and ∃y : ∀x : R(x, y) are not logically equivalent.

364 CHAPTER 3. LOGIC
Key Terms and Results Key Terms
Propositional Logic
• proposition
• truthvalue
• atomicandcompoundpropositions • logicalconnectives:
– negation(¬)
– conjunction(∧)
– disjunction(∨)
– implication(⇒) – exclusiveor(⊕) – ifandonlyif(⇔)
• truthassignmentsandtruthtables • tautology
• satisfiability/unsatisfiability
• logicalequivalence
• antecedentandconsequent
• converse,contrapositive,andinverse • conjunctivenormalform(CNF)
• disjunctivenormalform(DNF)
Predicate Logic
• predicate • quantifiers:
– universalquantifier(∀) – existentialquantifier(∃)
• freeandboundvariables
• fullyquantifiedexpression
• theoremsofpredicatelogic
• logicalequivalenceinpredicatelogic • proofbyassumingtheantecedent
• vacuousquantification
• nestedquantifiers
Key Results
Propositional Logic
1.
2. 3. 4.
5. 6.
We can build a truth table for any proposition by re- peatedly applying the definitions of each of the logical connectives, as shown in Figure 3.4.
Twopropositionsφandψarelogicallyequivalentifand only if φ ⇔ ψ is a tautology.
An implication p ⇒ q is logically equivalent to its contra- positive ¬q ⇒ ¬p, but not to its converse q ⇒ p.
Therearemanyimportantpropositionaltautologies and logical equivalences, some of which are shown in Figures 3.10 and 3.12.
Wecanshowthatpropositionsarelogicallyequivalentby showing that every row of their truth tables are the same.
Everypropositionislogicallyequivalenttoonethatis in disjunctive normal form (DNF) and to one that is in conjunctive normal form (CNF).
Predicate Logic
1. 2.
3. 4. 5.
WecanbuildapropositionfromapredicateP(x)byplug- ging in a particular value for x, or by quantifying over x as in ∀x : P(x) or ∃x : P(x).
Unlikewithpropositionallogic,thereisnoalgorithm that is guaranteed to determine whether a given fully quantified predicate-logic expression is a theorem.
Therearemanyimportantpredicate-logictheorems, some of which are shown in Figure 3.23.
The statements ¬∀x : P(x) and ∃x : ¬P(x) are logically equivalent. So are ¬∃x : P(x) and ∀x : ¬P(x).
We can think of nested quantifiers as a sequence of single quantifiers, or as “games with a demon.”

4 Proofs
In which our heroes build ironclad scaffolding to support their claims, thereby making them impervious to any perils they might encounter.

402 CHAPTER 4. PROOFS
4.1 Why You Might Care
By far the best proof is experience.
Sir Francis Bacon (1561–1626)
A proof is a convincing argument that establishes a particular claim as fact. That claim might be something explicitly computational: Bubble Sort performs fewer com- parisons than Merge Sort when the input array is already sorted, for example. Or the claim might be noncomputational, at least superficially: a property of an operating system, a structural fact about the minimum-length sequence of flips to sort pancakes, the impossibility of designing a voting system with a certain set of properties.
Generally speaking, our goal—in this chapter, in this book—is to establish new facts. And that’s precisely the point of a proof: to derive a new fact from old facts, while persuading the reader that the new fact is, indeed, a fact. (For example, we can derive a new fact using Modus Ponens: if we know both p and p ⇒ q, then we can conclude that q is a fact, too.) In Section 4.3, the technical meat of this chapter, we will develop a toolbox of techniques to use in proofs, and some strategies for choosing among these techniques. (In Section 4.5, we’ll also catalogue some common types
of mistakes in purported proofs, so that you can avoid them—and recognize bogus proofs when others attempt them.) We’ll illustrate these proof techniques throughout Section 4.3 with a hefty collection of examples about arithmetic.
While the proof techniques themselves are the “point” of this chapter, in many cases the fact that we’re proving is at least as interesting as the proof of that fact. Through-
out our tour of proof techniques, we’ll encounter a variety of examples of (fingers crossed!) interesting facts: about propositional logic, including the fact that we need only one logical connective (“nand”) to express every proposition; about geometry (the Pythagorean theorem); about prime numbers; and about uncomputability (there are problems that cannot be solved by any computer!). We begin in Section 4.2 with
an extended exploration of error-correcting codes, systems that allow for the reliable transmission and storage of information even in environments that corrupt data as it’s stored/transmitted/received/retrieved. (For example, CDs/DVDs are susceptible to scratches, and deep-space satellites’ transmissions are susceptible to radiation.) This section will merely scratch the surface of error-correcting codes, but it will serve as a nice introduction to error-correcting codes—and to proofs.
Why are proofs useful in computer science? First, proofs help prevent bugs. Whether or not she writes down in full detail a proof that her code is correct, a good software developer is always reasoning carefully about whether a function performs the task it’s supposed to perform, or whether a particular optimization continues to meet the given specification. For a theoretical computer scientist, proofs are bread and butter: proofs of correctness for novel algorithms, or proofs of the hardness of solving a particular problem. For both theoretically and practically oriented computer scientists, a proof often yields great insight that can avoid a brute force solution, improve the efficiency
of the code, or unearth some structural property of a problem that reveals that the problem doesn’t even need to be solved in the first place.

4.2 An Extended Application with Proofs: Error-Correcting Codes
Irrationally held truths may be more harmful than reasoned errors.
Thomas H. Huxley (1825–1895)
This section introduces error-correcting codes, a way of encoding data so that it can be transmitted correctly even in the face of (a limited number of) errors in transmis- sion. These codes are used widely—for example, on DVDs/CDs and in file transfer protocols—and they’re interesting to study on their own. But, despite appearances, they are not the point of this section! Rather, they’re mostly an excuse to introduce a technical topic with some interesting (and nonobvious) results—and to persuade you of a few of those results. In other words, this section is really about proofs.
Error-detecting and error-correcting codes: the basic idea
Visa and Mastercard use 16-digit numbers for their credit and debit cards, but it
turns out that there are only 1015 valid credit-card numbers: a number is valid only
if a particular arithmetic calculation on the digits—more or less, adding up the digits and taking the result modulo 10—always turns out to be zero. (See Exercises 4.1–4.5 for details of the calculation.) Or, to describe this fact in another way: if you get a (mildly gullible) friend to read you any 15 digits of his or her credit-card number,
you can figure out the 16th digit. Less creepily, this system means that there’s an error- detection mechanism built into credit-card numbers: if any one digit in your number is mistranscribed, then a very simple algorithm can reject that incorrect card number as invalid (because the calculation above will yield an answer other than zero).
In this section, we’ll explore encoding schemes with this sort of error-handling ca- pability. Suppose that you have some binary data that you wish to transmit to a friend across an imperfect channel—that is, one that (due to cosmic rays, hardware failures, or whatever) occasionally mistransmits a 0 as a 1, or vice versa. (When we refer to an error in a bitstring x, what we mean is a “substitution error,” where some single bit in x is flipped.) The fundamental idea will be to add redundancy to the transmitted data; if there is enough redundancy relative to the number of errors, then enough correct information will be transmitted to allow the receiver to reconstruct the original mes- sage. We’ll explore both error-detecting codes that are able to recognize whether an error has occurred (at least, as long as there aren’t too many errors) and error-correcting codes that can fix a small number of errors. To reiterate the above, though: although we’re focusing on error-correcting and error-detecting codes in this section, the fundamental purpose of this section is to introduce proof techniques. Along the way, we’ll see some interesting results about error-correcting codes, but the takeaway message is really about the methods that we’ll use to prove those results.
Taking it further: Aside from credit-card numbers, other examples of error-detecting or error-correcting codes include checksums on a transferred file—we might break a large file we wish to transmit into 32-bit blocks, transmit those blocks individually, and transmit as a final 32-bit block the XOR of all previously transmitted blocks—as a way to check that the file was transmitted properly. Error-correcting codes are also used in storing data on media (hard disks and CDs/DVDs, for example) so that one can reconstruct stored data even in the face of hardware errors (or scratches on the disc).
4.2. ERROR-CORRECTINGCODES 403

404 CHAPTER 4. PROOFS
The idea of error detection appears in other contexts, too. UPC (“universal product code”) bar codes on products in supermarkets use error checking similar to that in credit-card numbers. There are error- detection aspects in DNA. And “the buddy system” from elementary school field trips detects any one “deletion error” among the group (though two “deletions” may evade detection of the system).
4.2.1 A Formal Introduction
Imagine a sender who wishes to transmit a message m ∈ {0, 1}k to a receiver. A code C is a subset of {0, 1}n, listing the set of legal codewords; each k-bit message m is encoded as an n-bit codeword c ∈ C. The codeword is then transmitted to the receiver, but it may be corrupted during transmission. The recipient of the (possibly corrupted) n-bit string c′ decodes c′ into a new message m′ ∈ {0, 1}k . The goal is that, so long as the corruption is limited, the decoded message is identical to the original message—in other words, that m = m′ as long as c′ ≈ c. (We’ll make the meaning of “≈” precise soon.) Figure 4.1 shows a schematic of the process.
sender
receiver
m ∈ {0,1}k
c ∈ C corruption c′ ∈ {0,1}n C ⊆ {0, 1}n
m′ ∈ {0,1}k
encode
decode
(For an error-detecting code, the receiver still receives the bitstring c′, but determines whether the originally transmitted codeword was corrupted instead of determining which codeword was originally transmitted, as in an error-correcting code.)
Measuring the distance between bitstrings
Before we get to codes themselves, we need a way of quantifying how similar or
different two bitstrings are:
The Hamming dis- tance is named after Richard Hamming, a 20th-century American mathe- matician/computer scientist who was the third winner of the Turing Award.
Figure 4.1: A schematic view of error-correcting codes. The goal
is that, as long as there isn’t too much corruption, the received message m′ is identical to the sent message m.
Definition 4.1 (Hamming distance)
Let x, y ∈ {0, 1}n be two n-bit strings. The Hamming distance between x and y, denoted by ∆(x, y), is the number of positions in which x and y differ. In other words,
∆(x,y) := 􏰊􏰊􏰊􏰜i ∈ {1,2,…,n} : xi ̸= yi􏰝􏰊􏰊􏰊. (Hamming distance is undefined if x and y don’t have the same length.)
For example, ∆(011, 101) = 2 because 011 and 101 differ in bit positions #1 and #2, and ∆(0011, 0111) = 1 because 0011 and 0111 differ in bit #2. Similarly, ∆(0000, 1111) = 4 because all four bits differ, and ∆(10101, 10101) = 0 because all five bits match.
In Exercise 4.6, you’ll show that the Hamming distance is a metric, which means that it satisfies the following properties, for all bitstrings x, y, z ∈ {0, 1}n:

• “reflexivity”:∆(x,y)=0ifandonlyifx=y;
• “symmetry”:∆(x,y)=∆(y,x);and
• “thetriangleinequality”:∆(x,y)≤∆(x,z)+∆(z,y).(SeeFigure4.2.)
Informally, the fact that ∆ is a metric means that it generally matches your intuitions about geometric (Euclidean) distance.
Error-detecting and error-correcting codes
(It might seem a bit strange to require that the number of codewords in C be a precise power of two—but doing so is convenient, as it allows us to consider all k-bit strings as the set of possible messages, for k := log2 |C|.) Here’s an example of a code:
Example 4.1 (A small code)
The set C := {000000, 101010, 000111, 100001} is a code. Because |C| = 4 = 22, there are four messages, namely the four elements of {0, 1}2 = {00, 01, 10, 11}. And because C ⊆ {0, 1}6, the codewords—the four elements of the set C—are elements of {0, 1}6.
We can think of a code as being defined by a pair of operations:
• encoding: given a message m ∈ {0, 1}k , which codeword in C should we transmit? (We’d break up a longer message into a sequence of k-bit message chunks.)
• decoding: from a received (possibly corrupted) bitstring c′ ∈ {0, 1}n, what message should we infer was sent? (Or, if we trying to detect errors rather than correct them: from a received bitstring c′ ∈ {0, 1}n, do we say that an error occurred, or not?)
For the moment, we’ll consider a generic (and slow) way of encoding and decoding. Given C, we build a table mapping messages to codewords, by matching up the ith- largest message with the ith-largest codeword (with both the messages from {0, 1}k and the codewords in C sorted in numerical order):
• Weencodeamessagembythecodewordinrowmofthetable.
• Wedetectanerrorinareceivedbitstringc′byreporting“noerror”ifc′appearsin
the table, and reporting “error” if c′ does not appear in the table.
• We decode a received bitstring c′ by identifying the codeword c ∈ C that’s closest
to c′, measured by Hamming distance. We decode c′ as the message in row c of the table. (If there’s a tie, we choose one of the tied-for-closest codewords arbitrarily.)
Example 4.2 (Encoding and decoding with a small code)
Recall the code {000000, 101010, 000111, 100001} from Example 4.1. Sorting the four codewords (and the messages from {0, 1}2), we get the table in Figure 4.3.
y
∆(z, y) z
Figure 4.2: The triangle inequality. The distance from x to y isn’t decreased by “stopping off” at z along the way.
4.2. ERROR-CORRECTINGCODES
405
∆(x, y)
∆(x, z)
x
Definition 4.2 (Codes, messages, and codewords)
AcodeisasetC⊆{0,1}n,where|C|=2k forsomeinteger1≤k≤n.Anyelementof {0, 1}k is called a message, and the elements of C are called codewords.
00 000000
01 000111
10 100001
11 101010
Figure 4.3: The message/codeword table for the code from Example 4.1.
message codeword

406 CHAPTER 4. PROOFS
For example, we encode the message 10 as the codeword 100001.
If we receive the bitstring 111110, we report “error” because 111110 is not in C.
To decode the received bitstring 111110, we see that ∆(111110, 00000
0) = 5,
sponding to codeword 101010).
The danger in error detection is that we’re sent a codeword c ∈ C that’s corrupted into a bitstring c′, but we report “no error” because c′ ∈ C. (Note that we’re never wrong when we report “error.”) The danger in error correction is that we report an- other codeword c′′ ∈ C because c′ is closer to c′′ than it is to c. (As we’ll see soon, these dangers are really about Hamming distance between codewords: we might make a mistake if two codewords in C are too close together, relative to the number of errors.) Here are the precise definitions of error-detecting and error-correcting codes:
c′
000000∗ 0 3 2 3 000001† 1 2 1 4 000010 1 2 3 2 000011 2 1 2 3 000100 1 2 3 4 000101 2 1 2 5 000110 2 1 4 3 000111∗ 3 0 3 4 001000 1 4 3 2 001001 2 3 2 3 001010 2 3 4 1 001011 3 2 3 2 001100 2 3 4 3 001101 3 2 3 4 001110 3 2 5 2 001111 4 1 4 3 010000 1 4 3 4 010001 2 3 2 5 010010 2 3 4 3 010011 3 2 3 4 010100 2 3 4 5 010101 3 2 3 6 010110 3 2 5 4 010111 4 1 4 5 011000 2 5 4 3 011001 3 4 3 4 011010 3 4 5 2 011011 4 3 4 3 011100 3 4 5 4 011101 4 3 4 5 011110 4 3 6 3 011111 5 2 5 4 100000† 1 4 1 2 100001∗ 2 3 0 3 100010 2 3 2 1 100011 3 2 1 2 100100 2 3 2 3 100101 3 2 1 4 100110 3 2 3 2 100111 4 1 2 3 101000 2 5 2 1 101001 3 4 1 2 101010∗ 3 4 3 0 101011 4 3 2 1 101100 3 4 3 2 101101 4 3 2 3 101110 4 3 4 1 101111 5 2 3 2 110000 2 5 2 3 110001 3 4 1 4 110010 3 4 3 2 110011 4 3 2 3 110100 3 4 3 4 110101 4 3 2 5 110110 4 3 4 3 110111 5 2 3 4 111000 3 6 3 2 111001 4 5 2 3 111010 4 5 4 1 111011 5 4 3 2 111100 4 5 4 3 111101 5 4 3 4 111110 5 4 5 2 111111 6 3 4 3
Figure 4.4: The Hamming distance of every 6-bit string to all codewords from Example 4.1.
∆(111110, 000 111
10
10) = 2. The last of these distances is smallest, so we would decode 111110 as the message 11 (corre-
) = 4, ∆(111110, 100001
) = 5, and ∆(111110, 10
Definition 4.3 (Error-detecting and error-correcting codes)
Let C ⊆ {0, 1}n be a code, and let l ≥ 1 be any integer.
We say that C can detect l errors if, for any codeword c ∈ C and for any sequence of up to
l errors applied to c, we can correctly report “error” or “no error.”
The code C can correct l errors if, for any codeword c ∈ C and for any sequence of up to l
errors applied to c, we can correctly identify that c was the original codeword.
Here’s an example, for our small example code:
Example 4.3 (Error detection and correction in a small code)
Recall C = {000000, 101010, 000111, 100001} from Example 4.1. Figure 4.4 shows every bitstring x ∈ {0, 1}6, and the Hamming distance between x and each codeword in C.
There are 24 single-bit errors that can happen to codewords in C: there are 4 choices of codeword, and, for each, 6 different one-bit errors that can occur:
no errors: 000000 101010
one error: 1
01
0000 111010
000111 100001
00000 001010
100111 0 010111 110001 001111 101001 000011 100101 000101 100011 000110 100000
00001
000 100010
001
0001
000010 101000 000001 101011
00 101110
This code can detect one error, because the 24 bitstrings below the line are all differ- ent from the 4 bitstrings above the line; we can correctly report whether the bitstring in question is a codeword (no errors) or one of the 24 non-codewords (one error). Or, to state this fact in a different way: the four starred lines of Figure 4.4 corresponding to uncorrupted codewords are not within one error of any other codeword. On the other hand, C cannot detect two errors. If we receive the bitstring 000000, we can’t distinguish whether the original codeword was 000000 (and no errors occurred) or
whether the original codeword was 100001 (and two errors occurred, in 000000 ).
(Receiving the bitstring 100001 creates the same problem.)
∆(c′,000000) ∆(c′, 000111) ∆(c′,100001) ∆(c′,101010)

4.2. ERROR-CORRECTINGCODES 407 The code C also cannot correct even one error. Consider the bitstring 100000. We
cannot distinguish (i) the original codeword was 0
00000 (and one error occurred)
from (ii) the original codeword was 100001 (and one error occurred). Or, to state this fact differently: the two lines of Figure 4.4 marked with † are only one error away from two different codewords. (That is, 100000 appears twice in the list of 24 bitstrings below the line.)
4.2.2 Distance and Rate
Our goal with error-correcting codes is to ensure that the decoded message m′ is iden- tical to the original message m, as long as there aren’t too many errors in the transmis- sion. At a high level, we will achieve this goal by ensuring that the codewords in our code are all “very different” from each other. If every pair of distinct codewords c1 and c2 are far apart (in Hamming distance), then the closest codeword c to the received transmission c′ will correspond to the original message, even if “a few” errors occur. (We’ll quantify “very” and “a few” soon.)
Intuitively, this desire suggests adding a lot of redundancy to our codewords, by making them more redundant. But we must balance this desire for robustness against another desire that pulls in the opposite direction: we’d like to transmit a small num- ber of bits (so that the number of “wasted” non-data bits is small). There’s a seem-
ing trade-off between these two measures of the quality of a code: increasing error tolerance suggests making the codewords longer (so there’s room for them to differ more); increasing efficiency suggests making the codewords shorter (so there are fewer wasted bits). Let’s formally define both of these measures of code quality:
(Quiz question: if we hadn’t restricted the minimum in this definition to be only over pairs such that x ̸= y, what would the minimum distance have been?)
Let’s compute the rate and minimum distance for our running example:
Example 4.4 (Distance and rate in a small code)
Recall the code C = {000000, 101010, 000111, 100001} from Example 4.1.
The minimum distance of C is 2, because ∆(000000, 100001) = 2. You can check
Figure 4.4 (or see Figure 4.5) to see that no other pair of codewords is closer.
The rate of C is 2 , because |C| = 4 = |{0, 1}2|, and the codewords have length 6. 6
0000000 3 2 3 0001113 0 3 4 1000012 3 0 3 1010103 4 3 0
Figure 4.5: The Hamming distance between code- words of C from Example 4.1.
Definition 4.4 (Minimum distance)
The minimum distance of a code C is the smallest Hamming distance between two distinct codewords of C: that is, the minimum distance of C is min {∆(x, y) : x, y ∈ C and x ̸= y}.
Definition 4.5 (Rate)
The rate of a code C is the ratio between message length and codeword length. That is, if C is a
code where |C| = 2k and C ⊆ {0, 1}n, then the rate of C is the ratio k . n
000000 000111 100001 101010

408 CHAPTER 4. PROOFS
Relating minimum distance and error detection/correction
We have now defined enough of the concepts that we can state a first nontrivial
theorem, which characterizes the error-detecting and error-correcting capabilities of a code C in terms of the minimum distance of C. Here is the statement:
We’re now going to try to prove Theorem 4.1—that is, we’re going to try to generate a convincing argument that this statement is true. As with any statement that you try to prove, our first task is to understand what exactly the claim is saying. In this case, the theorem makes a statement about a generic nonnegative integer t and a generic code C. Plugging in particular values for t can help make the claim clearer:
• IftheminimumdistanceofacodeCis9—thatis,theminimumdistanceis2t+1for t = 4—then the claim says C can detect 2t = 2 · 4 = 8 errors and correct t = 4 errors.
• SupposetheminimumdistanceofCis7.Writing7=2t+1fort=3,theclaimstates
Problem-solving tip:
Step #1 in proving any claim is to un- derstand what it’s saying! (You can’t persuade someone of something you don’t understand.) One good way to start to do so is by plugging particular values into the statement.
Problem-solving tip:
Draw a picture to help you clarify/ understand the statement you’re trying to prove.
2t+1 c
2t
Figure 4.6: If
the minimum distance is 2t + 1, no codewords are within distance 2t of each other.
Theorem 4.1 (Relationship of minimum distance to detecting/correcting errors)
Let t ≥ 0 be any integer. If the minimum distance of a code C is 2t + 1, then C can detect 2t errors and correct t errors.
that C can detect 6 errors and correct 3 errors.
• IftheminimumdistanceofCis5,thenCcandetect4errorsandcorrect2errors.
• IftheminimumdistanceofCis3,thenCcandetect2errorsandcorrect1error.
• IftheminimumdistanceofCis1,thenCcandetect0errorsandcorrect0errors.
Now that we have a better sense of what the theorem says, let’s prove it: ProofofTheorem4.1. Firstwe’llprovetheerror-detectioncondition.Wemustargue
for the following claim: if a code C has minimum distance 2t + 1, then C can detect 2t errors. In other words, for an arbitrary codeword c ∈ C and an arbitrary received bitstringc′with∆(c,c′) ≤ 2t,ourerror-detectionalgorithmmustbecorrect.(If
∆(c, c′) > 2t, then we’re not obliged to correctly state that an error occurred, because we’re only arguing that we can detect 2t errors.) Recall that our error-detection algo- rithm reports “no error” if c′ ∈ C, and it reports “error” if c′ ∈/ C. Thus:
• If ∆(c, c′) = 0, then no error occurred (because the received bitstring matches the transmitted one). In this case, our error-detection algorithm correctly reports “no error”—because c′ ∈ C (because c′ = c, and c was a codeword).
• On the other hand, suppose 1 ≤ ∆(c, c′) ≤ 2t—so an error occurred. The only way that we’d fail to detect the error is if the received bitstring c′ is itself another codeword. But this situation can’t happen, by the definition of minimum distance: for any codeword c ∈ C, the set {c′ : ∆(c, c′) ≤ 2t} cannot contain any elements of C—otherwise the minimum distance of C would be 2t or smaller.
It may be helpful to think about this proof via Figure 4.6.
For the error-correction condition, suppose that x ∈ C is the transmitted code- word, and the received bitstring c′ satisfies ∆(x, c′) ≤ t. We have to persuade our- selves that x is the codeword closest to c′ in Hamming distance. Let y ∈ C − {x}

be any other codeword. We’ll start from the triangle inequality, which tells us that ∆(x, y) ≤ ∆(x, c′) + ∆(c′, y) and therefore that ∆(c′, y) ≥ ∆(x, y) − ∆(x, c′), and prove that c′ isclosertoxthanitistoy:
∆(c′,y)≥ ∆(x,y)−∆(x,c′) ≥ (2t+1)−∆(x,c′)
≥ (2t+1)−t = t+1
>t
≥ ∆(x,c′).
triangleinequality ∆(x,y)≥2t+1bydefinitionofminimumdistance ∆(x,c′)≤tbyassumption
∆(x,c′)≤tbyassumption
>t t
y 2t+1
x
4.2. ERROR-CORRECTINGCODES 409
This chain of inequalities shows c′ is closer to x than it is to y. (Pedantically speak- ing, we’re also relying on the symmetry of Hamming distance here: ∆(c′, y) = ∆(y, c′). Again, see Exercise 4.6.) Because y was a generic codeword in C − {x}, we can con- clude that the original codeword x is the one closest to c′. (See Figure 4.7.)
Before we move on from the theorem, let’s reflect a little bit on the proof. (We’ll concentrate on the error-correction half.) The most complicated part was unwinding the definitions in the theorem statement, in particular of “C has minimum distance 2t + 1” and “C can correct t errors.” Eventually, we had to argue for the claim
foreveryx∈C,y∈C−{x},andc′ ∈{0,1}n:if∆(x,c′)≤tthen∆(x,c′)<∆(y,c′). (In other words, if c′ is within t errors of x, then c′ is closer to x than to any other code- word.) In the end, we were able to state the proof as a relatively simple sequence of inequalities. After proving a theorem, it’s also worth briefly reflecting on what the the- orem does not say. Theorem 4.1, for example, only addresses codes with a minimum distance that’s an odd number. You’ll be asked to consider the error-correcting and error-detecting properties of a code C with an even minimum distance in Exercise 4.13. We also didn’t show that we couldn’t do better: Theorem 4.1 says that a code C with minimum distance 2t + 1 can correct t errors, but the theorem doesn’t say that C can’t correct t + 1 (or more) errors. (But, in fact, it can’t; see Exercise 4.12.) Outline of the remainder of the section Intuitively, rate and minimum distance are measures of the inherent tension in an error-correcting code. A code that has a higher distance means that we are more ro- bust to errors: the farther apart codewords are, the more corruption can occur before we’re unable to reconstruct the original message. A code that has a higher rate means that we are “wasting” fewer bits in providing this robustness: the larger the rate, the more our codeword contains “data” rather than “redundancy.” In the rest of this sec- tion, we’re going to prove several more theorems about error-correcting codes, explor- ing the trade-off between rate and distance. (But it’s also worth noting that it’s not a strict trade-off: sometimes we can improve in one measure without costing ourselves in the other!) And, as we go, we’ll continue to try to reflect on the proof techniques that we use to establish these claims. Figure 4.7: If the minimum distance is 2t + 1, a bitstring within distance t of one codeword is more than t away from every other codeword. Problem-solving tip: When you’re trying to prove a claim of the form p ⇒ q, try to massage p to look as much like q as possible. A good first step in doing so is to expand out the definitions of the premises, and then try to see what additional facts you can infer. It is customary to mark the end of one’s proofs typo- graphically; here, we’re using a tradi- tional box symbol: . Other people may write “QED,” short for the Latin phrase quod erat demonstrandum (“that which was to be demonstrated”). Here are the three main theorems that we’ll prove in the rest of this section: 410 CHAPTER 4. PROOFS Theorem 4.2 (Good news) There exists a code with 4-bit messages, minimum distance 3, and rate 1 . 3 Theorem 4.3 (Better news) There exists a code with 4-bit messages, minimum distance 3, and rate 4 . 7 Theorem 4.4 (Bad news) There does not exist a code with 4-bit messages, minimum distance 3, and any rate strictly better than 4 . 7 Notice that the first two of these results say that a code with particular properties exists, while the third result says that it’s impossible to create a code with a different set of properties. Also notice that Theorem 4.3 is an improvement on Theorem 4.2: we’ve made the rate better (higher) without making the minimum distance worse. (When we can, we’ll prove more general versions of these theorems, too, not limited to 4-bit messages with minimum distance 3.) We’ll prove Theorem 4.2 and Theorem 4.3 “by construction”—specifically, by build- ing a code with the desired parameters. But, because Theorem 4.4 says that a code with certain properties fails to exist, we’ll prove the result with a proof by contradiction: we assume that a code with 4-bit messages with distance 3 and rate strictly better than 4 does exist, and reasoning logically from that assumption, we will derive a false state- 7 ment (a contradiction). Because p ⇒ False ≡ ¬p, we can conclude that the assumption must have been false, and no such code can exist. 4.2.3 Repetition Codes Intuitively, a good error-correcting code will amplify even a small difference between two different messages—a single differing bit—into a larger difference between the corresponding codewords. Perhaps the most obvious implementation of this idea is simply to encode a message m by repeating the bits of m several times. This idea gives rise to a simple error-correcting code, called the repetition code. (Actually, there are many different versions of the repetition code, depending on how many times we repeat m in the codeword.) Here’s the basic definition: Definition 4.6 (Repetition code) Let l ∈ Z≥2. The Repetitionl code for k-bit messages consists of the codewords 􏰜mm ··· m:m∈{0,1}k􏰝. l times 􏰢 􏰡􏰠 􏰣 That is, the codeword corresponding to a message m ∈ {0, 1}k is the l-fold repetition of the message m, so each codeword is an element of {0, 1}kl. Here are some small examples of encoding/decoding using repetition codes: Example 4.5 (Some codewords for the repetition code) If we encode the message 00111 using the Repetition3 code, we get the codeword 00111 00111 00111. If we encode the same message using the Repetition5 code, we get the codeword 00111 00111 00111 00111 00111. For an example of decoding, suppose that we receive the (possibly corrupted) bitstring c′ = 0010 0110 0010 under the Repetition3 code. We detect that an error occurred: c′ is not a codeword, because the only codewords are 12-bit strings where all three 4-bit thirds are identical. For error correction, note that the closest codeword to c′ is 0010 00 10 0010, so we decode c′ as corresponding to the message 0010. Themessage/codewordtablefortheRepetition codefor4-bitmessagesisshown 3 in Figure 4.8. The distance and rate properties of the repetition code are relatively easy to see (from the definition or from this style of table): mc 0000 000000000000 0001 000100010001 0010 001000100010 0011 001100110011 0100 0100 0100 0100 0101 0101 0101 0101 0110 011001100110 0111 011101110111 1000 100010001000 1001 100110011001 1010 101010101010 1011 101110111011 1100 110011001100 1101 110111011101 1110 111011101110 1111 111111111111 Figure 4.8: The Repetition3 code for 4-bit messages. 4.2. ERROR-CORRECTINGCODES 411 Lemma 4.5 (Distance and rate of the repetition code) The Repetitionl code has rate 1 and minimum distance l. l Proof. Recall that the rate of a code is the ratio k , where k is the length of the mes- n sages and n is the length of the codewords. A k-bit message is encoded as a (kl)-bit codeword (l repetitions of k bits), and so the rate of this code is k = 1 . kll′ k For the minimum distance, consider any two distinct messages m, m ∈ {0, 1} with m′ ̸= m. We know that m and m′ must differ in at least one bit position, say bit position i. (Otherwise m = m′.) But if mi ̸= mi′, then the codeword corresponding to m = m′ m′ · · · m′ and the codeword corresponding to m′ = m′ m′ · · · m′ l times differ in at least one bit in each of the l “blocks” (in the ith position of the block)—for a total of at least l differences. Furthermore, the Repetitionl encodings of the messages 000 · · · 0 and 100 · · · 0 differ in only l places (the first bit of each “block”). Thus the minimum distance of the Repetitionl code is exactly l. Lemma 4.5 says that the Repetition3 code on 4-bit messages (see Figure 4.8) has Problem-solving tip: When you’re trying to prove a claim of the form ∃x : P(x), try using a proof by construction first. (There are other ways to prove an existential claim, but this approach is great when it’s possible.) minimum distance 3 and rate 1 . Thus we’ve proven Theorem 4.2: we had to show that 3 a code with these parameters exists, and we did so by explicitly building such a code. This proof is an example of a “proof by construction”: to show that an object with a particular property exists, we’ve explicitly built an object with that property. It’s also worth noticing that we started out by describing a generic way to do encod- ing and decoding for error-correcting codes in Section 4.2: after we build the table (like the one in Figure 4.8), we encode a message by finding the corresponding codeword in the table, and we decode a bitstring c′ by looking at every codeword and identify- ing the one closest to c′. For particular codes, we may be to give a much more efficient algorithm—and, indeed, we can do so for repetition codes. See Exercise 4.21. 􏰢 􏰡􏰠 􏰣 412 CHAPTER 4. PROOFS 4.2.4 Hamming Codes The Hamming code, like the Hamming distance, is named after Richard Hamming, who invented this code in 1950. (He was frustrated that programs he started running on Friday nights often failed over the weekend because of a single bit error in memory.) 1 R. W. Hamming. Error detecting and error correcting codes. The Bell Sys- tem Technical Journal, XXIX(2):147–160, April 1950. The parity of a and b can be denoted as a ⊕ b, because if you think of a, b ∈ {0, 1}, where True = 1and False = 0, then parity(a, b) is the XOR of a and b. mc 0000 0000000 0001 0001111 0010 0010110 0011 0011001 0100 0100101 0101 0101010 0110 0110011 0111 0111100 1000 1000011 1001 1001100 1010 1010101 1011 1011010 1100 1100110 1101 1101001 1110 1110000 1111 1111111 Figure 4.9: The Hamming code for 4-bit messages. When we’re encoding 4-bit messages, the Repetition3 code achieves minimum distance 3 with 12-bit codewords. (So its rate is 1 .) But it turns out that we can do better by 3 defining another, cleverer code: the Hamming code distance, while improving the rate from 1 to 4 . 3 7 1 maintains the same minimum The basic idea of the Hamming code is to use an extra bit that, like the 16th digit of a credit card number, redundantly reports a value computed from the previous components of the message. Concretely, we could tack a single bit b onto the message m, where b reports the parity of m—that is, whether there are an even or odd number of bits set to 1 in m. If a single error occurs in the message, then b would be inconsis- tent with the message m, and we’d detect that error. (See Exercise 4.19.) In fact, for the Hamming code, we’ll use several different parity bits, corresponding to different subsets of the bits of m. Definition 4.7 (Parity function) The parity of a sequence ⟨a1,a2,...,ak⟩ of bits is denoted either parity(a1,a2,...,ak) or a1 ⊕a2 ⊕···⊕ak,anditsval􏰓ueis 1 ifthereareanoddnumberofisuchthatai =1 0 if there are an even number of i such that ai = 1. a1 ⊕a2 ⊕···⊕ak := (We could also have defined this function as parity(a1, . . . , ak) := [∑ki=1 ai] mod 2.) Hamming’s insight was that it’s possible to achieve good error-correction properties by using three different parity bits, corresponding to different subsets of the message bits. It’s easiest to think of this code in terms of its encoding algorithm: Definition 4.8 (Hamming code) The Hamming code is defined via the following encoding function. We will encode a 4-bit message ⟨a, b, c, d⟩ as the following 7-bit codeword: ⟨ a,b,c,d, b⊕c⊕d, a⊕c⊕d, a⊕b⊕d ⟩. 􏰢 􏰡􏰠 􏰣 􏰢 􏰡􏰠 􏰣 message bits parity bits Applying this encoding to every 4-bit message yields the table of messages and their corresponding codewords shown in Figure 4.9; here are a few examples in detail: Example 4.6 (Sample Hamming code encodings) message codeword a,b,c,d a,b,c,d,(b⊕c⊕d),(a⊕c⊕d),(a⊕b⊕d) 0,0,0,0 0,0,0,0,(0⊕0⊕0),(0⊕0⊕0),(0⊕0⊕0) 1,0,0,0 1,0,0,0,(0⊕0⊕0),(1⊕0⊕0),(1⊕0⊕0) 1,1,1,0 1,1,1,0,(1⊕1⊕0),(1⊕1⊕0),(1⊕1⊕0) = 0000000 = 1000011 = 1110000. (We could have described encoding for the Hamming code using matrix multiplication instead; see Exercises 2.221–2.223.) Before we analyze the rate and minimum distance of the Hamming code, let’s start to develop some intuition by looking at a few received (possibly corrupted) code- words. (We’ll also begin to work out an efficient decoding algorithm as we go.) Example 4.7 (Some Hamming code decoding problems) Problem: Youreceivethefollowing(possiblycorrupted)Hammingcodecodewords. Find the original message, assuming at most one error occurred in transmission. 1. 0000010 2. 1000000 3. 1011010 messagea,b,c,d, 4.2. ERROR-CORRECTINGCODES 413 4. 1110111 : 1. We’vereceivedmessagebits0000andparitybits010.Everythinginthe Solution received codeword is consistent with the message being m = 0000, except for the second parity bit. So we infer that the second parity bit was corrupted, the transmitted codeword was 0000000, and the message was 0000. Could there have been a one-bit error in message bits instead? No: these parity bits are consistent only with a message ⟨a, b, c, d⟩ with a ̸= b (because the first two received parity bits differ), and therefore with d = 1 (because a ̸= b implies that a ⊕ b ⊕ d = 1 ⊕ d = ¬d, and the third parity bit a ⊕ b ⊕ d is 0). But 10?1 and 01?1 are both at least two errors away from the received message 0000. 2. We’vereceivedmessagebits1000andparitybits000.Ifthemessagebitswere uncorrupted, then the correct parity bits would have been 011. But then we would have to have suffered two transmission errors in the parity bits, and we’re assuming that at most one error occurred. Thus the error is in the message bits; the original message is 0000, and the first bit of the message was corrupted. 3. Theparitybitsforthemessage1011areindeed010,so1011010isitselfalegal codeword for the message 1011, and no errors occurred at all. 4. Thesereceivedbitsareconsistentwiththemessage1111withparitybits111, where the fourth bit of the message was flipped. From this example, the basic approach to decoding the Hamming code should start to coalesce. Briefly, we compute what the parity bits should have been, supposing that the received message bits (the first four bits of the received codeword) are correct; com- paring the computed parity bits to the received parity bits allows us to deduce which, if any, of the transmitted bits were erroneous. (More on efficient decoding later.) Why does this approach to decoding work? (And, relatedly, why were the parity bits of the Hamming code chosen the way that they were?) Here are two critical properties in the Hamming code’s parity bits: • everymessagebitappearsinatleasttwoparitybits.Thusanyerrorinareceivedparity bit is distinguishable from an error in a received message bit: an erroneous message the bits of the uncorrupted codeword are: 1. a 2. b 3. c 4. d 5. b⊕c⊕d 6. a⊕c⊕d 7. a⊕b⊕d Recall that, for a 414 CHAPTER 4. PROOFS bit will cause at least two parity bits to look wrong; an erroneous parity bit will cause only that one parity bit to look wrong. • notwomessagebitsappearinpreciselythesamesetofparitybits.Thusanyerrorina received message bit has a different “signature” of wrong-looking parity bits: an error in bit a affects parity bits #2 and #3; b affects parity bits #1 and #3; c affects #1 and #2; and d affects all three parity bits. Because all four of these signatures are different, we can distinguish which message bit was corrupted based on which set of two or more parity bits look wrong. Rate and minimum distance of the Hamming code Let’s use the intuition that we’ve developed so far to establish the rate and mini- mum distance for the Hamming code: Lemma 4.6 (Distance and rate of the Hamming code) The Hamming code has rate 4 and minimum distance 3. 7 Proof. Therateisstraightforwardtocompute:wehave4-bitmessagesand7-bitcode- words, so the rate is 4 by definition. 7 There are several ways to convince yourself that the minimum distance is 3— perhaps the simplest way (though certainly the most tedious) is to compute the Ham- ming distance between each pair of codewords in Figure 4.9. (There are only 16 code- words, so we just have to check that all (16 · 15)/2 = 120 pairs of distinct codewords have Hamming distance at least three.) You’ll write a program to verify this claim in Exercise 4.24. But here’s a different argument. Consider any two distinct messages m ∈ {0, 1}4 and m′ ∈ {0, 1}4. We must establish that the codewords c and c′ associated with m and m′ satisfy ∆(c, c′) ≥ 3. We’ll argue for this fact by looking at three separate cases, depending on ∆(m, m′): CaseI:∆(m,m′)≥3. Thenwe’redoneimmediately:themessagebitsofcandc′differ in at least three positions (even without looking at the parity bits). CaseII:∆(m,m′)=2. Thenatleastoneofthethreeparitybitscontainsoneofthebit positions where mi ̸= mi′ but not the other. (This fact follows from the second crucial property above, that no two message bits appear in precisely the same set of parity bits.) Therefore this parity bit differs in c and c′. Thus there are two message bits and at least one parity bit that differ, so ∆(c, c′) ≥ 3. CaseIII:∆(m,m′)=1. Thenatleasttwoofthethreeparitybitscontainthebitposition where mi ̸= mi′. (This fact follows from the first crucial property above, that every message bit appears in at least two parity bits.) Thus there are at least two parity bits and one message bit that differ, and ∆(c, c′) ≥ 3. Note that ∆(m, m′) must be 1, 2, or ≥ 3—it can’t be zero because m ̸= m′—so, no matter what ∆(m, m′), we’ve established that ∆(c, c′) ≥ 3. Because, for the codewords corresponding to messages 0000 and 1110, we have ∆(0000000, 1110000) = 3, the minimum distance is in fact exactly equal to three. Lemma 4.6 says that the Hamming code encodes 4-bit messages with minimum the proof. Our proof of the minimum distance in Lemma 4.6 was a proof by cases: we divided pairs of codewords into three different categories (differing in 1, 2, or ≥ 3 bits), and then used three different arguments to show that the corresponding codewords differed in ≥ 3 places. So we showed that the desired distance property was true in all three cases—and, crucially, that one of the cases applies for every pair of codewords. Although we’re mostly omitting any discussion of the efficiency of encoding and decoding, it’s worth a brief mention here. (The speed of these algorithms is a big deal for error-correcting codes used in practice!) The algorithm for decoding under the Hamming code is suggested by Figure 4.10: we calculate what the parity bits would have been if the received message bits were uncorrupted, and identify which received parity bits don’t match those calculated parity bits. Figure 4.10 tells us what inference to draw from each constellation of mismatched parity bits. Why does this decoding algorithm allow us to correct any single error? First, a low-level answer: the Hamming code has a minimum distance of 3 = 2 · 1 + 1, so Lemma 4.1 tells us that we can correct up to one error. So we know that a decoding scheme is possible. At a higher level, the reason that this decoding procedure works properly is that there are eight possible “≤ 1 error” corruptions of a codeword x— namely one 0-error string (x itself) and seven 1-error strings (one corresponding to an error in each of the seven bit positions of x)—and furthermore there are eight different subsets of the three parity bits that can be “wrong.” The Hamming code works by carefully selecting the parity bits in a way that each of these eight bitstrings corresponds to a different one of the eight parity-bit subsets. In Exercises 4.25–4.28, you’ll explore longer versions of the Hamming code (with longer messages and more parity bits) with the same relationship. Taking it further: As we’ve said, our attention here is mostly on the proofs and the proof techniques that we’ve used to establish the claims in this section, rather than on error-correcting codes themselves. But see p. 418 for an introduction to Reed–Solomon codes, the basis of the error-correcting codes used in CDs/DVDs (among other applications). 4.2.5 Upper Bounds on Rates In the last two sections, we’ve constructed two different codes, both for 4-bit messages Problem-solving tip: If you discover that a proposition seems true “for different reasons” in different circumstances (and those circumstances seem to cover all possible scenarios!), then a proof by cases may be a good strategy to employ. distance 3 and rate 4 ; thus we’ve proven Theorem 4.3. Let’s again reflect a little on 7 4.2. ERROR-CORRECTINGCODES 415 no error! ✗ parity #1 ✗ parity #2 ✗ parity #3 ✗ ✗ bit c ✗ ✗ bit b ✗ ✗ bit a ✗ ✗ ✗ bit d with minimum distance 3: the repetition code (rate 4 ) and the Hamming code (rate 4 12 location of error Figure 4.10: Decod- ing the Hamming code. We conclude that the stated error occurred if the received parity bits and those cal- culated from the received message bits mismatch in the listed places. 7 ). Because the message lengths and minimum distances match, and because higher rates are better, the Hamming code is better. Here we’ll consider whether we can im- prove the rate further, while still encoding 4-bit messages with minimum distance 3. (In other words, can we make the codewords shorter than 7 bits?) The answer turns out to be “no”—and we’ll prove that it’s impossible. “Balls” around codewords We’ll start by thinking about “balls” around codewords in a general code. (The ball of radius r around x ∈ {0, 1}n is the set {x′ : ∆(x, x′) ≤ r}—that is, the set of all points that are within Hamming distance r of x.) Here’s a first observation: paritybit#1: b⊕c⊕d paritybit#2: a⊕c⊕d paritybit#3: a⊕b⊕d 416 CHAPTER 4. PROOFS Lemma 4.7 (The size of a ball of radius 1 in {0, 1}n) Letx∈{0,1}n,anddefineX:=􏰈x′ ∈{0,1}n :∆(x,x′)≤1􏰉.Then|X|=n+1. Proof. ThebitstringxitselfisanelementofX,asareallbitstringsx′thatdifferfrom x in exactly one position. There are n such strings x′: one that is x with the first bit flipped, one that is x with the second bit flipped; . . .; and one that is x with the nth bit flipped. Thus there are 1 + n total bitstrings in X. Here’s a second useful fact about these balls: in a code C, the balls around code- words (of radius related to the minimum distance of C) cannot overlap. Lemma 4.8 (Balls around codewords are disjoint) Let C ⊆ {0, 1}n be a code with minimum distance 2t + 1. For distinct codewords x, y ∈ C, the sets 􏰈x′ ∈ {0,1}n : ∆(x,x′) ≤ t􏰉 and 􏰈y′ ∈ {0,1}n : ∆(y,y′) ≤ t􏰉 are disjoint. Proof. Suppose not: that is, suppose that the sets X := 􏰈x′ ∈ {0, 1}n : ∆(x, x′) ≤ t􏰉 and Y := 􏰈y′ ∈ {0, 1}n : ∆(y, y′) ≤ t􏰉 are not disjoint. We will derive a contradiction from this assumption—that is, a statement that can’t possibly be true. Thus we’ll have proventhatX∩Y ̸= ∅ ⇒ False,whichallowsustoconcludethatX∩Y = ∅,because ¬p ⇒ False ≡ p. That is, we’re using a proof by contradiction. To start again from the beginning: suppose that X and Y are not disjoint. That is, suppose that there is some bitstring z ∈ {0, 1}n such that z ∈ X and z ∈ Y. In other words, by definition of X and Y, there is a bitstring z ∈ {0, 1}n such that ∆(x, z) ≤ t and ∆(y, z) ≤ t. But if ∆(x, z) ≤ t and ∆(y, z) ≤ t, then, by the triangle inequality, we know ∆(x,y) ≤ ∆(x,z)+∆(z,y) ≤ t+t = 2t. Therefore∆(x,y) ≤ 2t—butthenwehavetwodistinctcodewordsx,y ∈ Cwith ∆(x, y) ≤ 2. This condition contradicts the assumption that the minimum distance of C is 2t + 1. (See Figure 4.11.) We could have used Lemma 4.8 to establish the error-correction part of Theo- rem 4.1—a bitstring corrupted by ≤ t errors from a codeword c is closer to c than to any other codeword—but here we’ll use it, plus Lemma 4.7, to establish a upper bound on the rate of codes. But, first, let’s pause to look at a similar argument in a different (but presumably more familiar) domain: normal Euclidean geometry. In a circle-packing problem, we are given an enclosing shape, and we’re asked to place (“pack”) as many nonoverlapping unit circles (of radius 1) into that shape as possible. (Sphere packing—what grocers have to do with oranges—is the 3-dimensional analogue.) How many unit circles can we fit into a 6-by-6 square, for example? (See Figure4.12.)Here’sanargumentthatit’satmost11:aunitcirclehasareaπ·12 =π, t 2t+1 and the 6-by-6 square has area 36; thus we certainly can’t fit more than 36 ≈ 11.459 π y x Figure 4.11: If the minimum distance is 2t + 1, the “balls” of radius t around each codeword are disjoint. Problem-solving tip: When you’re facing a problem in a less familiar domain, try to find an analogous problem in a different, more familiar setting to help gain intuition. Figure 4.12: Circles packed in a square. t nonoverlapping circles into the square. There isn’t room for 12. (In fact, we can’t even fit 10, because the circles won’t nestle together without wasting space “in between.” Thus, in this case we’d say that the area-based bound is loose.) Using packing arguments to derive bounds on error-correcting codes Now, let’s return to error-correcting codes, and use the circle-packing intuition (and the last two lemmas) to prove a bound on the number of n-bit codewords that can “fit” into {0, 1}n with minimum distance 3: Proof. Foreachx ∈ C,letSx := 􏰈x′ ∈{0,1}n :∆(x′,x)≤1􏰉betheballofradius1 around x. Lemma 4.7 says that |Sx| = n + 1 for each x. Further, Lemma 4.8 says that every element of {0, 1}n is in at most one Sx because the balls are disjoint. Therefore, |􏰈x′ ∈{0,1}n :x′ isinoneoftheSx balls􏰉|= ∑|Sx|= ∑(n+1)=|C|·(n+1). x∈C x∈C Also observe that every element of any Sx is an n-bit string. There are only 2n different n-bit strings, so therefore | 􏰈x′ ∈ {0, 1}n : x′ is in one of the Sx balls􏰉 | ≤ 2n. Putting together these two facts, we see that |C| · (n + 1) ≤ 2n. Solving for |C| yields the desired relationship: |C| ≤ 2n . n+1 Proof. By Lemma 4.9, we know that |C| ≤ 2n/(n + 1). With 4-bit messages we have |C| = 16, so we know that 16 ≤ 2n/(n + 1), or, equivalently, that 2n ≥ 16(n + 1). And 27 = 16(7 + 1), while for any n < 7 this inequality does not hold. Corollary 4.10 implies Theorem 4.4, so we’ve now proven the three claims that we set out to establish. Before we close, though, we’ll mention a few extensions. Lemma 4.8 was general, for any code with an odd minimum distance. But Lemma 4.7 was specifically about codes with minimum distance 3. To generalize the latter lemma, we’d need techniques from counting (see Chapter 9, specifically Section 9.4.) Another interesting question: when is the bound from Lemma 4.9 exactly achiev- able? If we have k-bit messages, n-bit codewords, and minimum distance 3, then Lemma 4.9 says that 2k ≤ 2n/(n + 1), or, taking logs, that k ≤ n − log2(n + 1). Be- cause k has to be an integer, this bound is exactly achievable only when n + 1 is an exact power of two. (For example, if n = 9, this bound requires us to have 2k ≤ 29/10 = 512/10 = 51.2. In other words, we need k ≤ log2 51.2 ≈ 5.678. But, because k ∈ Z, in fact we need k ≤ 5. That means that this bound is not exactly achievable for n = 9.) However, it’s possible to give a version of the Hamming code for n = 15 and k = 7 with minimum distance 3, as you’ll show in Exercise 4.26. (In fact, there’s a version of the Hamming code for any n = 2l − 1; see Exercise 4.28.) Lemma 4.9 (The “sphere-packing bound”: distance-3 version) Let C ⊆ {0, 1}n be a code with minimum distance three. Then |C| ≤ 2n/(n + 1). 4.2. ERROR-CORRECTINGCODES 417 Corollary 4.10 (The Hamming code is optimal) Any code with messages of length 4 and minimum distance 3 has codewords of length ≥ 7. (Thus the Hamming code has the best possible rate among all such codes.) 418 CHAPTER 4. PROOFS Computer Science Connections Reed–Solomon Codes The error-correcting codes that are used in CDs and DVDs are a bit more complicated than Repetition or Hamming codes, but they perform better. We’ll leave out a lot of the details, but here is a brief sketch of how they work. These codes are called Reed–Solomon codes, and they’re based on polynomials and modular arithmetic. First, we’re going to go beyond bits, to a larger “al- phabet” of characters in our messages and codewords: instead of encoding messages from {0, 1}k , we’re going to encode messages from {0, 1, . . . , q}k , for someintegerq. Here’sthebasicidea: givenamessagem = ⟨m1,m2,...,mk⟩, we will define a polynomial pm(x) as follows, with the coefficients of the polyno- mial corresponding to the characters of the message: ki pm(x) := ∑i=1 mix . To encode the message m, we will evaluate the polynomial for several values of x: encode(m) := ⟨pm(1), pm(2), . . . , pm(n)⟩. See Figure 4.13 for an example. Reed–Solomon codes are named after Irving Reed and Gustave Solomon, 20th- century American mathematicians who invented them in 1960. Suppose that we use a k-character message and an n-character output. It’s easy enough to compute that the rate is k . But what about the minimum n distance? Consider two distinct messages m and m′. Note that pm and pm′ are both polynomials of degree at most k. Therefore f (x) := pm (x) − pm′ (x) is a polynomial of degree at most k, too—and f (x) ̸≡ 0, because m ̸= m′ . Noticethat{x:f(x)=0} = {x:pm(x)=pm′(x)}.And|{x:f(x)=0}| ≤ k, by Lemma 2.3 (“degree-k polynomials have at most k roots”). Therefore |{x:f(x)=0}∩{1,2,...,n}| ≤ k:thereareatmostkvaluesxforwhich pm(x) = pm′ (x). We encoded m and m′ by evaluating pm and pm′ on n differ- ent inputs, so there are at least n − k inputs on which these two polynomials disagree. Thus the minimum distance is at least n − k. For example, if we pick n = 2k, then we achieve rate 1 and minimum distance k. 2 How might we decode Reed–Solomon codes? Efficient decoding algo- rithms rely on some results from linear algebra, but the basic idea is to find the degree-k polynomial that goes through as many of the given points as pos- sible. As a simple example, suppose you’re looking for a 2-character message (that is, something encoded as a quadratic), and you receive the codeword ⟨2, 6, 12, 13, 30, 42⟩. What was the original message? Plot the codeword and see! See Figure 4.14: all but one of the components of the received codeword is consistent with the polynomial pm(x) = x + x2, so you can decode this codeword as the message ⟨1, 1⟩. We’ve left out several important details of actual Reed-Solomon codes here. One is that our computation of the rate was misleading: we only counted the number of slots, rather than the “size” of those slots. (Figure 4.13 shows that the numbers can get pretty big!) In real Reed–Solomon codes, every value is stored modulo a prime. See p. 731 for discussion of how (and why) this fix works. There’s also a clever trick used in the physical layout of the encoded information on a CD/DVD: the bits for a particular codeword are spread out over the disc, so that a single physical scratch doesn’t cause errors all to occur in the same codeword. Figure 4.13: An example Reed–Solomon encoding. 50 40 30 20 10 00123456 Figure 4.14: Decoding a received (cor- rupted) Reed–Solomon codeword. Consider the message m = ⟨1, 3, 2⟩. Then pm(x) = x + 3x2 + 2x3. If we choose n = 6, then the encoding of this message will be ⟨1(1) + 3(1)2 + 2(1)3 , 1(2) + 3(2)2 + 2(2)3 , 1(3) + 3(3)2 + 2(3)3 , 1(4) + 3(4)2 + 2(4)3 , 1(5) + 3(5)2 + 2(5)3 , 1(6) + 3(6)2 + 2(6)3 ⟩ =⟨6, 30, 84, 180, 330, 546⟩. Alternatively, consider the message m′ =⟨3,0,3⟩.Thenpm′(x)=3x+3x3. Again for n = 6, the encoding of m′ is ⟨3(1) + 3(1)3 , 3(2) + 3(2)3 , 3(3) + 3(3)3 , 3(4) + 3(4)3 , 3(5) + 3(5)3 , 3(6) + 3(6)3 ⟩ =⟨6, 30, 90, 204, 390, 666⟩. 4.2.6 Exercises The algorithm for testing whether a given credit-card number is valid is shown in Figure 4.15. Here’s an example of the calculation that cc-check(4471 8329 · · · ) performs: (original number) 4 4 7 1 (odd-indexed digits doubled) 8 4 14 1 (digitssummed)4 + 8+1+4 + 1 8 3 2 9... 16 3 4 9... +1+6 + 3 + 4 + 9... (Try executing cc-check from Figure 4.15 on a few credit-card numbers, to make sure that you’ve understood the algorithm correctly.) This code can detect any one substitution error, because 0,2,4,6,8,1 = 1+0,3 = 1+2,5 = 1+4,7 = 1+6,9 = 1+8 are all distinct (so, even in odd-indexed digits, changing the digit changes the overall value of sum). 4.1 (programming required) Implement cc-check in a programming language of your choice. Extend your implementation so that, if it’s given any 16-digit credit/debit-card number with a single digit replaced by a "?", it computes and outputs the correct missing digit. 4.2 Suppose that we modified cc-check so that, instead of adding the ones digit and (if it exists) the tens digit to sum in Line 7 of the algorithm, we instead simply added the ones digit. (That is, replace Line 7 by sum := sum + di .) Does this modified code still allow us to detect any single substitution error? 4.3 Suppose that we modified cc-check so that, instead of doubling odd-indexed digits in Line 4 of thealgorithm,weinsteadtripledtheodd-indexeddigits.(Thatis,replaceLine4bydi :=3·ni.)Doesthis modified code still allow us to detect any single substitution error? 4.4 WhatifwereplaceLine4bydi :=5·ni? 4.5 There are simpler schemes that can detect a single substitution error than the one in cc-check: for example, we could simply ensure that the sum of all the digits themselves (undoubled) is divisible by 10. (Just skip the doubling step.) The credit-card encoding system includes the more complicated doubling step to help it detect a different type of error, called a transposition error, where two adjacent digits are recorded in reverse order. (If two digits are swapped, then the “wrong” digit is multiplied by two, and so this kind of error might be detectable.) Does cc-check detect every possible transposition error? A metric space consists of a set X and a function d : X × X → R≥0, called a distance function, where d obeys the following three properties: • reflexivity: for any x and y in X, we have d(x,x) = 0, and d(x,y) ̸= 0 whenever x ̸= y. • symmetry: for any x,y ∈ X, we have d(x,y) = d(y,x). • triangle inequality: for any x, y, z ∈ X, we have d(x, y) ≤ d(x, z) + d(z, y). When it satisfies all three conditions, we call the function d a metric. 4.6 In this section, we’ve been measuring the distance between bitstrings using the Hamming dis- tance,whichisafunction∆:{0,1}n×{0,1}n →Z≥0,denotingthenumberofpositionsinwhichxandy differ. Prove that ∆ is a metric. (Hint: think about one bit at a time.) 4.2. ERROR-CORRECTINGCODES 419 cc-check(n): Input: a 16-digit credit-card number n ∈ {0, 1, . . . , 9}16 1: sum := 0 2: 3: 4: 5: 6: 7: 8: for i = 1,2,...,16: if i is odd then di :=2·ni else di := ni Increase sum by the ones’ and tens’ digits of di . (That is, sum := sum + (di mod 10) + ⌊di /10⌋ .) return True if sum mod 10 = 0, and False otherwise. Figure 4.15: An algorithm for testing the validity of credit-card numbers. 420 CHAPTER 4. PROOFS The next few exercises propose a different distance function d : {0, 1}n × {0, 1}n → Z≥0. For each, decide whether you think the given function d is a metric or not, and prove your answer. (In other words, prove that d satisfies reflexivity, symmetry, and the triangle inequality; or prove that d fails to satisfy one or more of these properties.) 4.7 Forx,y∈{0,1}n,defined(x,y)asthesmallesti∈{0,1,...,n}suchthatxi+1,...,n =yi+1,...,n.For example, d(01000, 10101) = 5 and d(01000, 10100) = 3 and d(01000, 10000) = 2 and d(11010, 01010) = 1. (This function measures how far into x and y we must go before the remaining parts match; we could also define d(x,y)asthelargesti∈{0,1,...,n}suchthatxi ̸=yi,wherewetreatx0 ̸=y0.)Isdametric? 4.8 For x, y ∈ {0, 1}n , define d(x, y) as the length of the longest consecutive run of differing bits in correspondingpositionsofxandy—thatis,d(x,y):=max{j−i:forallk=i,i+1,...,jwehavexk ̸=yk}.For 01) = 3 and d(00100, 01010) = 3 and d(01 4.9 Forx,y ∈ {0,1}n,defi􏰊ned(x,y)asthedifferencein􏰊thenumberofonesthatappearsinthe two bitstrings—that is, d(x, y) := 􏰊􏰊 |{i : xi = 1}| − |{i : yi = 1}| 􏰊􏰊. (The vertical bars here are a little con- fusing: the bars around |{i : xi = 1}| and |{i : yi = 1}| denote set cardinality, while the outer vertical bars denote absolute value.) For example, d(01000, 10101) = |1 − 3| = 2 and d(01000, 10100) = |1 − 2| = 1 and d(01000, 10000) = |1 − 1| = 0 and d(11010, 01010) = |2 − 2| = 0. Is d a metric? 4.10 The distance version of the Sørensen index (a.k.a. the Dice coefficient) defines the distance based on the fraction of ones in x or y that are in the same positions. Specifically, d(x, y) := 1 − 2 ∑i xi · yi . ∑i xi + yi example, d(01000, 101 metric? 1000) = 1. Isda 000, 10 000) = 2 and d(1101 0,0 00,01110) = 1 − 2·1 = 1 − 2 = 1/2 and Forexample,d(01000,10101) = 1− 2·0 = 1− 0 = 1andd(001 The Sørensen/Dice measure is named after independent work by two ecolo- gists from the 1940s, the Danish botanist Thorvald Sørensen andtheAmerican mammalogist Lee RaymondDice. 1+3 4 1+3 4 000, 11 000)=1− 2·1 =1−2 =1/3andd(1101 d(01 0, 0101 0)=1− 2·2 =1−2 =3/5.Isdametric? 1+2 3 3+2 5 4.11 Forx,y ∈ {0,1}n,defined(x,y)asthedifferenceinthenumbersthatarerepresentedbythe two strings in binary. Writing this function formally is probably less helpful (particularly because the higher powers of 2 have lower indices), but here it is: d(x, y) := 􏰊􏰊∑ni=1 xi · 2n−i − ∑ni=1 yi 2n−i 􏰊􏰊 . For example, d(01000,10101) = |8−21| = 13andd(01000,10100) = |8−20| = 12andd(01000,10000) = |8−16| = 8and d(11010, 01010) = |26 − 10| = 16. Is d a metric? 4.12 Show that we can’t improve on the parameters in Theorem 4.1: for any integer t ≥ 0, prove that a code with minimum distance 2t + 1 cannot correct t + 1 or detect 2t + 1 errors. 4.13 Theorem 4.1 describes the error-detecting and error-correcting properties for a code whose minimum distance is any odd integer. This exercise asks you to give the analogous analysis for a code whose minimum distance is any even integer. Let t ≥ 1 be any integer, and let C be a code with minimum distance 2t. Determine how many errors C can detect and correct, and prove your answers. Let c ∈ {0, 1}n be a codeword. Until now, we’ve mostly talked about substitution errors, in which a single bit of c is flipped from 0 to 1, or from 1 to 0. The next few exercises explore two other types of errors. An erasure error occurs when a bit of c isn’t successfully transmitted, but the recipient is informed that the transmission of the corresponding bit wasn’t successful. We can view an erasure error as replacing a bit ci from c with a ‘?’ (as in Exercise 4.1, for credit-card numbers). Thus, unlike a substitution error, the recipient knows which bit was erased. (So a codeword 1100110 might become 1?0011? after two erasure errors.) When codeword c ∈ {0, 1}n is sent, the receiver gets a corrupted codeword c′ ∈ {0, 1, ?}n and where all unerased bits were transmitted correctly (that is, if ci′ ∈ {0, 1}, then ci′ = ci). A deletion error is like a “silent erasure” error: a bit fails to be transmitted, but there’s no indication to the recipient as to where the deletion occurred. (So a codeword 1100110 might become 10011 after two deletion errors.) 4.14 Let C be a code that can detect t substitution errors. Prove that C can correct t erasure errors. 4.15 Let C be a code that can correct t deletion errors. Prove that C can correct t erasure errors. 4.16 Give an example of a code that can correct one erasure error, but can’t correct one deletion error. Consider the following codes. For each, determine the rate and minimum distance of this code. How many errors can it detect/correct? 4.17 the “code” where all n-bit strings are codewords. (That is, C := {0, 1}n .) 4.18 the trivial code, defined as C := {0n , 1n }. 4.19 the parity-check code, defined as follows: the codewords are all n-bit strings with an even number of bits set to 1. 4.20 Let’s extend the idea of the parity-check code, from the previous exercise, as an add-on to any existing code with odd minimum distance. Let C ⊆ {0, 1}n be a code with minimum distance 2t + 1, for some integer t ≥ 0. Consider a new code C′, in which we augment every codeword of C by adding a parity bit, which is zero if the number of ones in the original codeword is even and one if the number is odd, as follows: C′ := 􏰜⟨x1,x2,...,xn,(∑ni=1 xi) mod 2⟩ : x ∈ C􏰝. Then the minimum distance of C′ is 2t + 2. (Hint: consider two distinct codewords x, y ∈ C. You have to argue that the corresponding codewords x′ , y′ ∈ C have Hamming distance 2t + 2 or more. Use two different cases, depending on the value of ∆(x, y).) 4.21 Show that we can correctly decode the Repetitionl code as follows: given a bitstring c′, for each bit position i, we take the majority vote of the l blocks’ ith bit in c′, breaking ties arbitrarily. (In other words, prove that this algorithm actually gives the codeword that’s closest to c′.) In some error-correcting codes, for certain errors, we may be able to correct more errors than Theorem 4.1 suggests: that is, the minimum distance is 2t + 1, but we can correct certain sequences of > t errors. We’ve already seen that we can’t successfully correct every such sequence of errors, but we can successfully handle some sequences of errors using the standard algorithm for error correction (returning the closest codeword).
4.22 The Repetition3 code with 4-bit messages is only guaranteed to correct 1 error. What’s the largest number of errors that can possibly be corrected successfully by this code? Explain your answer.
4.23 In the Hamming code, we never correct more than 1 error successfully. Prove why not.
4.24 (programming required) Write a program, in a programming language of your choice, to verify that
any two codewords in the Hamming code differ in at least three bit positions.
Let’s find the “next” Hamming code, with 7-bit messages and 11-bit codewords and a minimum distance of 3. We’ll use the same style of codeword as in Definition 4.8: the first 7 bits of the codeword will simply be the message, and the next 4 bits will be parity bits (each for some subset of the message bits).
4.25
(a) (b)
To achieve minimum distance 3, it will suffice to have parity bits with the following properties:
each bit of the original message appears in at least two parity bits.
no two bits of the original message appear in exactly the same set of parity bits.
Prove that these conditions are sufficient. That is, prove that any set of parity bits that satisfy conditions (a) and (b) ensure that the resulting code has minimum distance 3.
4.26 Define 4 parity bits for 11-bit messages that satisfy conditions (a) and (b) from Exercise 4.25.
4.27 Define 5 parity bits for 26-bit messages that satisfy conditions (a) and (b) from Exercise 4.25.
4.28 Let l ∈ Z>0, and let n := 2l − 1. Prove that a code with n-bit codewords, minimum distance 3,
and messages of length n − l is achievable. (Hint: look at all l-bit bitstrings; use the bits to identify which message bits are part of which parity bits.)
4.29 You have come into possession of 8 bottles of “poison,” except, you’ve learned, 7 are fake poison and only 1 is really poisonous. Your master plan to take over the world requires you to identify the poison by tomorrow. Luckily, as an evil genius, you have a small collection of very expensive rats, which you can use for testing. You can give samples from bottles to multiple rats simultaneously (a rat can receive a mixture
of samples from more than one bottle), and then wait for a day to see which ones die. Obviously you can identify the real poison with 8 rats (one bottle each), or even with 7 (one bottle each, one unused bottle; if all rats survive then the leftover bottle is the poison). But how many rats do you need to identify the poison? (Make the number as small as possible.)
4.2. ERROR-CORRECTINGCODES 421

422 CHAPTER 4. PROOFS
Let c ∈ {0, 1}23 . A handy fact (which you’ll show in Exercise 9.132, after we’ve developed the necessary tools for counting to figure out this quantity): the number of 23-bit strings c′ with ∆(c,c′) ≤ 3 is exactly 2048 = 211 = 223−12. This fact means that (according to a generalization of Lemma 4.9) it might be possible to achieve the following code parameters:
• 12-bit messages;
• 23-bit codewords; and
• minimum distance 7.
In fact, these parameters are achievable—and a code that achieves these parameters is surprisingly simple to construct. The Golay code is an error-correcting code that can be constructed by the following so-called “greedy” algorithm
in Figure 4.16. (The loop should consider the strings x in lexicographic order: first 00 · · · 00, then 00 · · · 01, then
00 · · · 10, going all the way up to 11 · · · 11. Notice that therefore the all-zero vector will be added to S in the first iteration of the while loop; a hundred and twenty-seven iterations later, 00000000000000001111111 will be the second element added to S, and so forth.)
4.30 (programming required) Write a program, in a language of your choice (but see the warning be- low), that implements the algorithm in Figure 4.16, and outputs the list of the 212 = 4096 different 23-bit codewords of the Golay code in a file, one per line.
Implementation hint: suppose you represent the set S as an array, appending each element that passes the test in Line 3 to the end of the array. When you add a bitstring x to S, the very next thing you do is to consider adding x + 1 to S. Implementing Line 3 by starting at the x-end of the array will make your code much faster than if you start at the 00000000000000000000000-end of the array. Think about why!
Implementation warning: this algorithm is not very efficient! We’re doing 223 iterations, each of which might involve checking the Hamming distance of as many as 212 pairs of strings. On a mildly aging laptop, my Python solution took about ten minutes to complete; if you ignore the implementation hint from the pre- vious paragraph, it took 80 minutes. (I also implemented a solution in C; it took about 10 seconds following the hint, and 100 seconds not following the hint.)
4.31 You and six other friends are imprisoned by an evil genius, in a room filled with eight bubbling bottles marked as “poison.” (Though, really, seven of them look perfectly safe to you.) The evil genius, though, admires skill with bitstrings and computation, and offers you all a deal.
You and your friends will each have a red or blue hat placed on your heads randomly. (Each hat has a 50% chance of being red and 50% chance of being blue, independent of all other hats’ colors.) Each person can each see all hats except his or her own. After a brief moment to look at each others’ hats, all of you must simultaneouslysayoneofthreethings: red, blue,or pass.Theevilgeniuswillreleaseallofyoufrom your imprisonment if:
• everyone who says red or blue correctly identifies their hat color; and
• at least one person says a color (that is, not everybody says pass).
You may collaborate on a strategy before the hats are placed on your heads, but once the hat is in place, no communication is allowed.
An example strategy: all 7 of you pick a random color and say it. (You succeed with probability (1/2)7 = 1/128 ≈ 0.0078.) Another example: you number yourselves 1, 2, . . . , 7, and person #7 picks a random color and says it; everyone else passes. (You succeed with probability 1/2.)
Can you succeed with probability better than 1/2? If so, how?
4.32 In Section 4.2.5, we proved an upper bound for the rate of a code with a particular minimum distance, based on the volume of “spheres” around each codeword. There are other bounds that we can prove, with different justifications. n k
Suppose that we have a code C ⊆ {0, 1} with |C| = 2 and minimum distance d. Prove the Singleton bound, which states that k ≤ n − d + 1. (Hint: what happens if we delete the first d − 1 bits from each codeword?)
Figure 4.16: The “greedy algorithm” for generating the Golay code.
The Golay code
is named after Marcel Golay, a Swiss researcher who discovered them in 1949, just before Hamming discovered what would later be called the Ham- ming code. A slight variant of the Golay code was used by NASA around 1980 to communicate with the Voyager spacecraft as they traveledtoSaturn and Jupiter.
Confusingly, the Singleton bound
is named after Richard Singleton, a 20th-century American computer scientist; it has nothing to do with singleton sets (sets containing only one element).
1: 2: 3: 4: 5:
S := ∅ 23 for x ∈ {0, 1}
(in numerical order): if ∆(x, y) ≥ 7 for all y ∈ S then
add x to S return S.

4.3 Proofs and Proof Techniques
4.3. PROOFSANDPROOFTECHNIQUES 423
Arguments are to be avoided; they are always vulgar and often convincing.
Oscar Wilde (1854–1900)
In Section 4.2, we saw a number of claims about error-correcting codes—and, more importantly, proofs that those claims were true. These proofs used several different styles of argument: proofs that involved straightforward reasoning by starting from the relevant definitions; proofs that used “case-based” reasoning; and proofs “by contradiction” that argued that x must be true because something impossible would happen if x were false. Indeed, whenever you face a claim that you need to prove, a variety of different strategies (including these strategies from Section 4.2) are possible approaches for you to employ. This section is devoted to outlining these and some other common proof strategies. We’ll first catalogue these techniques in Section 4.3.1, and then, in Section 4.3.2, we’ll reflect briefly on the strategies and how to choose among them—and also reflect on the writing part of writing proofs.
What is a proof?
This chapter is devoted to techniques for proving claims—but before we explore
proof techniques, let’s spend a few words discussing what a proof actually is:
Definition 4.9 says that a proof is a “convincing argument,” but it doesn’t say to whom the argument should be convincing. The answer is: to your reader. This definition may be frustrating, but the point is that a proof is a piece of writing, and—just like with fiction or a persuasive essay—you must write for your audience.
Taking it further: Different audiences will have very different expectations for what counts as “convinc- ing.” A formal logician might not find an argument convincing unless she saw every last step, no matter how allegedly obvious or apparently trivial. An instructor of early-to-mid-level computer science class might be convinced by a proof written in paragraph form that omits some simple steps, like those that invoke the commutativity of addition, for example. A professional CS researcher reading a publication in conference proceedings would expect “elementary” calculus to be omitted.
Some of the debates over what counts as convincing to an audience—in other words, what counts as a “proof”—were surprisingly controversial, particularly as computer scientists began to consider claims that had previously been the exclusive province of mathematicians. See the discussion on p. 437 of the Four-Color Theorem, which triggered many of these discussions in earnest.
To give an example of writing for different audiences, we’ll give several proofs of the same result. Here’s a claim regarding divisibility and factorials. (Recall that n!, pronounced “n factorial,” is defined as n! := n · (n − 1) · (n − 2) · · · 1.) Before reading further, spend a minute trying to convince yourself why (†) is true:
Let n be a positive integer and let k be any integer satisfying 2 ≤ k ≤ n.
Then n! + 1 is not evenly divisible by k. (†)
Definition 4.9 (Proof)
A proof of a proposition is a convincing argument that the proposition is true.

424 CHAPTER 4. PROOFS
We’ll prove Claim (†) three times, using three different levels of detail:
Example 4.8 (Factorials: Proof I)
Proof (heavy detail). By the definition of factorial, we have that n! = ∏n i, which can berewrittenasn!=􏰖∏k−1i􏰗·k·􏰂∏n i􏰃.Letm=􏰖∏k−1i􏰗·􏰂∏n i=1i􏰃.Thuswe
i=1 i=k+1 i=1 i=k+1
have that n! = k · m and m ∈ Z, because the product of any finite set of integers is also
an integer.
Observe that n!+1 = mk+1. We claim that there is no integer l such that kl = n!+1.
First,thereisnol ≤ msuchthatkl = n!+1,becausekl ≤ km = n! < n!+1. Second,thereisnol ≥ m+1suchthatkl = n!+1,becausek ≥ 2impliesthat kl ≥ k(m+1) = n!+k > n!+1. Becausethereisnosuchintegerl ≤ mandnosuch integer l > m, the claim follows.
Example 4.9 (Factorials: Proof II)
Proof(mediumdetail). Definem=n!/k,sothatn!=mkandn!+1=mk+1.Becausekis an integer between 2 and n, the definition of factorial implies that m is an integer. But because k ≥ 2, we know mk < mk + 1 < (m + 1)k. Thus mk + 1 is not evenly divisible by k, because this quantity is strictly between two consecutive integral multiples of k, namely m · k and (m + 1) · k. Example 4.10 (Factorials: Proof III) Proof(lightdetail). Notethatkevenlydividesn!.Thenextintegerevenlydivisibleby k is n! + k. But k ≥ 2, so n! < n! + 1 < n! + k. The claim follows immediately. Which of the three proofs from Examples 4.8, 4.9, and 4.10 is best? It depends! The right level of detail depends on your intended reader. A typical reader of this book would probably be happiest with the medium-detail proof from Example 4.9, but it is up to you to tailor your proof to your desired reader. Taking it further: It turns out that one can encode literally all of mathematics using a handful of set- theoretic axioms, and a lot of patience. It’s possible to write down everything in this book in ultraformal set-theoretic notation, which serves the purpose of making arguments 100% airtight. But the high- level computer science content can be hard to see in that style of proof. If you’ve ever programmed in assembly language before, there’s a close analogy: you can express every program that you’ve ever written in extremely low-level machine code, or you can write it in a high-level language like C or Java or Python or Scheme (and, one hopes, make the algorithm much more understandable for the reader). We’ll prove a lot of facts in this book, but at the Python-like level of proof. Someone could “compile” our proofs down into the low-level set-theoretic language—but we won’t bother. (Lest you underestimate the difficulty of this task: a proof that 2 + 2 = 4 would require hundreds of steps in this low-level proof!) There are subfields of computer science (“formal methods” or “formal verification,” or “automated theorem proving”) that take this ultrarigorous approach: start from a list of axioms, and a list of infer- ence rules, and a desired theorem, and derive the theorem by applying the inference rules. When it is absolutely life-or-death critical that the proof be 100% verified, then these approaches tend to be used: in verifying protocols in distributed computing, or in verifying certain crucial components of a processor, for example. Writing tip: As you study the material in this book, you will frequently be given a claim and asked to prove it. To complete this task well, you must think about the question of for whom you are writing your proof. A reasonable guideline is that your audience for your proofs is a classmate or a fellow reader of this book who has read and understood everything up to the point of the claim that you’re proving, but hasn’t thought about this particular claim at all. 4.3.1 Proof Techniques We will describe three general strategies for proofs: • directproof:weproveastatementφbyrepeatedlyinferringnewfactsfromknown facts to eventually conclude φ. (Sometimes we’ll divide our work into separate cases and give different proofs in each case. And if φ is of the form p ⇒ q, we’ll generally assume p and then try to infer q under that assumption.) • proofbycontrapositive:whenthestatementthatwe’retryingtoproveisanimplica- tion p ⇒ q, we can instead prove ¬q ⇒ ¬p—the contrapositive of the original claim. The contrapositive is logically equivalent to the original implication, so once we’ve proven ¬q ⇒ ¬p, we can also conclude p ⇒ q. • proofbycontradiction:weproveastatementφbyrepeatedlyassuming¬φ,andprov- ing something impossible—that is, proving ¬φ ⇒ False. Because ¬φ therefore cannot be true, we can conclude that φ must be true. We’ll give some additional examples of each proof technique as we go, proving some purely arithmetic claims to illustrate the strategy. Almost every claim that we’ll prove here—or that you’ll ever need to prove—will be a universally quantified statement, of the form ∀x ∈ S : P(x). (Often the quantification will not be explicit: we view any unquantified variable in a statement as being implic- itly universally quantified.) To prove a claim of the form ∀x ∈ S : P(x), we usually proceed by considering a generic element x ∈ S, and then proving that P(x) holds. (Considering a “generic” element means that we make no further assumptions about x, other than assuming that x ∈ S.) Because this proof establishes that an arbitrary x ∈ S makes P(x) true, we can conclude that ∀x ∈ S : P(x). Direct proofs The simplest type of proof for a statement φ is a derivation of φ from known facts. This type of argument is called a direct proof : Most of the proofs in Section 4.2 were direct proofs. Here’s another, simpler example: Example 4.11 (Divisibility by 4) Let’s prove the correctness of a simple test of whether a given integer is divisible by 4: Claim: Anypositiveintegernisdivisibleby4ifandonlyifitslasttwodigitsare themselves divisible by 4. (That is, n is divisible by 4 if and only if n’s last two digits are in {00, 04, 08, . . . , 92, 96}.) “When you have eliminated the im- possible, whatever remains, however improbable, must be the truth.” — Sir Arthur Conan Doyle (1859–1930), The Sign of the Four (1890). 4.3. PROOFSANDPROOFTECHNIQUES 425 Definition 4.10 (Direct Proof) A direct proof of a proposition φ starts from known facts and implications, and repeatedly applies logical deduction to derive new facts, eventually leading to the conclusion φ. 426 CHAPTER 4. PROOFS Proof. Letdk,dk−1,...,d1,d0denotethedigitsofn,readingfromlefttoright,sothat n = d0 +10d1 +100d2 +1000d3 +···+10kdk, or, dividing both sides by 4, n/4 = (d0 +10d1)/4+25d2 +250d3 +···+25·10k−2dk. (∗) The integer n is a divisible by 4 if and only if n/4 is an integer, which because of (∗) occurs if and only if the right-hand side of (∗) is an integer. And that’s true if and only if (d0 + 10d1)/4 is an integer, because all other terms in the right-hand side of (∗) are integers. Therefore 4 | n if and only if 4 | (d0 + 10d1). The last two digits of n are precisely d0 + 10d1, so the claim follows. Note that this argument considers a generic positive integer n, and establishes the result for that generic n. The proof relies on two previously known facts: (1) an integer n is divisible by 4 if and only if n/4 is an integer; and (2) for an integer a, we have that x + a is an integer if and only if x is an integer. The argument itself uses these two basic facts to derive the desired claim. Let’s give another example, this time for an implication. The proof strategy of as- suming the antecedent, discussed in Definition 3.22 in Section 3.4.3, is a form of direct proof. To prove an implication of the form φ ⇒ ψ, we assume the antecedent φ and then prove ψ under this assumption. This proof establishes φ ⇒ ψ because the only way for the implication to be false is when φ is true but ψ is false, but the proof shows that ψ is true whenever φ is true. Here’s an example of this type of direct proof, for a basic fact about rational numbers. (Recall that a number x is rational if and only if there exist integers n and d ̸= 0 such that x = n.) Example 4.12 (The product of rational numbers is rational) Claim: Ifxandyarerationalnumbers,thensoisxy. Proof. Assumetheantecedent—thatis,assumethatxandyarerational.Bythedef- d initionofrationality,then,thereexistintegersnx,ny,dx ̸=0,anddy ̸=0suchthat x=nx andy=ny.Therefore dx dy xy = nx · ny = nxny . dx dy dxdy Both nxny and dxdy are integers, because the product of any two integers is also an integer. And dxdy ̸= 0 because both dx ̸= 0 and dy ̸= 0. Thus xy is a rational number, by the definition of rationality. Proof by cases Sometimes we’ll be asked to prove a statement of the form ∀x ∈ S : P(x) that indeed seems true for every x ∈ S—but the “reason” that P(x) is true seems to be different for different “kinds” of elements x. For example, Lemma 4.6 argued that the Hamming distance between two Hamming-code codewords was at least three, based on three different arguments based on whether the corresponding messages differed in 1, 2, or ≥ 3 positions. This proof was an example of a proof by cases: (Proofs by cases need not be direct proofs, but plenty of them are.) Here are two sim- ple examples of proofs by cases: Example 4.13 (Certain squares) Claim: Let n be any integer. Then n · (n + 1)2 is even. Proof. We’llgiveaproofbycases,basedontheparityofn: • Ifniseven,thenanymultipleofnisalsoeven,sowe’redone. • Ifnisodd,thenn+1mustbeeven. Thusanymultipleofn+1isalsoeven,so we’re done again. Because the integer n must be either even or odd, and the quantity n · (n + 1)2 is an even number in either case, the claim follows. Example 4.14 (An easy fact about absolute values) Claim: Letx∈R.Then−|x|≤x≤|x|. Proof. Observethatx≥0orx≤0.Inbothcases,we’llshowthedesiredinequality: • For the case that x ≥ 0, we know −x ≤ 0 ≤ x. By the definition of absolute value, we have |x| = x and −|x| = −x. Thus −|x| = −x ≤ 0 ≤ x = |x|. • For the case that x < 0, we know x ≤ 0 ≤ −x. By the definition of absolute value, we have |x| = −x and −|x| = x. Thus −|x| = x ≤ 0 ≤ −x = |x|. Note that a proof by cases is only valid if the cases are exhaustive—that is, if every situation falls into one of the cases. (If, for example, you try to prove ∀x ∈ R : P(x) with the cases x > 0 and x < 0, you’ve left out x = 0—and your proof isn’t valid!) But the cases do not need to be mutually exclusive (that is, they’re allowed to overlap), as long as the cases really do cover all the possibilities; in Example 4.14, we handled the x = 0 case in both cases x ≥ 0 and x ≤ 0. If all possible values of x are covered by at least one case, and the claim is true in every case, then the proof is valid. Here’s another slightly more complex example, where we’ll prove the triangle in- equality for the absolute value function. (See Figure 4.2.) Example 4.15 (Triangle inequality for absolute values) Claim: Letx,y,z∈R.Then|x−y|≤|x−z|+|y−z|. Proof. Without loss of generality, assume that x ≤ y. (If y ≤ x, then we simply swap the names of x and y, and nothing changes in the claim.) The phrase “with- out loss of gen- erality” indicates that we won’t ex- plicitly write out all the cases in the proof, because the omitted ones are virtually identical to the ones that we are writing out. It allows you to avoid cut-and-paste-and- search-and-replace arguments for two very similar cases. 4.3. PROOFSANDPROOFTECHNIQUES 427 Definition 4.11 (Proof by cases) To give a proof by cases of a proposition φ, we identify a set of cases and then prove two different types of facts: (1) “in every case, φ holds”; and (2) one of the cases has to hold. 428 CHAPTER 4. PROOFS Because we’re assuming x ≤ y, we must show that |x − z| + |y − z| ≥ |x − y| = y − x. We’ll consider three cases: z ≤ x, or x ≤ z ≤ y, or y ≤ z. See Figure 4.17. CaseI:z≤x. Then case I zxy + case II xzy + case III xyz + Figure 4.17: The three cases for Example 4.15: z can fall to the left of x, between x and y, or to the right of y. In each case, we argue that the sum of the lengths of the dashed lines is at least y − x. |x−z|+|y−z|≥ |y−z| = y−z ≥ y−x. CaseII:x≤z≤y. Then |x−z|+|y−z|= (z−x)+|y−z| = (z−x)+(y−z) |x−z|≥0bythedefinitionofabsolutevalue. x≤ybyassumptionandz≤xinCaseI,soz≤ytoo. z≤xinCaseI,so−z≥−x. definitionofabsolutevalueandx≤zinCaseII. definitionofabsolutevalueandz≤yinCaseII. algebra/rearrangingterms. |y−z|≥0bythedefinitionofabsolutevalue. x≤ybyassumptionandy≤zinCaseIII,sox≤ztoo. z≥yinCaseIII. CaseIII:y≤z. Then = y−x. |x−z|+|y−z|≥ |x−z| = z−x ≥ y−x. In all three cases, we’ve shown that |x − z| + |y − z| ≥ y − x, so the claim follows. Notice the creative demand if you choose to develop a proof by cases: you have to choose which cases to use! The proposition itself does not necessarily make obvious an appropriate choice of which different cases to use. Proof by contrapositive When we seek to prove a claim φ, it suffices to instead prove any proposition that is logically equivalent to φ. (For example, a proof by cases with two cases q and ¬q corresponds to the logical equivalence p ≡ (q ⇒ p) ∧ (¬q ⇒ p).) A valid proof of any logically equivalent proposition can be used to prove that φ is true, but a few logical equivalences turn out to be particularly useful. A proof by contrapositive is a very common proof technique that relies on this principle: Recall from Section 3.4.3 that an implication p ⇒ q is logically equivalent to its con- trapositive ¬q ⇒ ¬p. (An implication is true unless its antecedent is true and its con- clusion is false, so ¬q ⇒ ¬p is true unless ¬q is true and ¬p is false, which is precisely when p ⇒ q is false.) Here are two simple examples of proofs using the contrapositive, one about absolute values and one about rational numbers: Definition 4.12 (Proof by contrapositive) To give a proof by contrapositive of an implication φ ⇒ ψ, we instead give a proof of the implication ¬ψ ⇒ ¬φ. 4.3. PROOFSANDPROOFTECHNIQUES 429 Example 4.16 (The sum of the absolute values vs. the absolute value of the sum) Claim: If|x|+|y|̸=|x+y|,thenxy<0. Proof. We’llprovethecontrapositive: If xy ≥ 0, then |x| + |y| = |x + y|. (∗) To prove (∗), assume the antecedent; that is, assume that xy ≥ 0. We must prove |x| + |y| = |x + y|. Because xy ≥ 0, there are two cases: either both x ≥ 0 and y ≥ 0, or both x ≤ 0 and y ≤ 0. CaseI:x≥0andy≥0. Then|x|+|y|=x+y,bythedefinitionofabsolutevalue.And |x + y| = x + y too, because x ≥ 0 and y ≥ 0 implies that x + y ≥ 0 as well. CaseII:x≤0andy≤0. Then|x|+|y|=−x+−y,bythedefinitionofabsolutevalue. And|x+y| = −(x+y) = −x+−ytoo,becausex ≤ 0andy ≤ 0impliesthat x + y ≤ 0 as well. Example 4.17 (Irrational quotients have an irrational numerator or denominator) Claim: Lety̸=0.Ifx/yisirrational,theneitherxisirrationaloryisirrational. Proof. Wewillprovethecontrapositive: If x is rational and y is rational, then x/y is rational. (†) (Note that, by De Morgan’s Laws, ¬ (x is irrational or y is irrational) is equivalent to x being rational and y being rational.) Writing tip: Help your reader figure out what’s going on! If you’re going to use a proof by contrapositive, say you’re using a proof by contrapositive! Don’t leave ’em guessing. This tip applies for all proof techniques: your job is to convince your reader, so be kind and informative to yourreader. To prove (†), assume the antecedent—that is, assume that x is rational and y is rational. By definition, then, there exist four integers nx, ny, dx ̸= 0, and dy ̸= 0 such nx ny xnxdy thatx= dx andy= dy.Thusy = dxny.(Bytheassumptionthaty̸=0,weknowthat ny ̸= 0, and thus dxny ̸= 0.) Both the numerator and denominator are integers, so x is rational. y Of course, you can always reuse previous results in any proof—and Example 4.12 is particularly useful for the claim in Example 4.17. Here’s a second, shorter proof: Example 4.18 (Irrational quotients, Version B) Claim: Lety̸=0.Ifx/yisirrational,theneitherxisirrationaloryisirrational. Proof. Weprovethecontrapositive.Assumethatxandyarerational.Bydefinition, then, y = n for some integers n and d ̸= 0. Therefore 1 = d is rational too. (By the dyn assumptionthaty ̸= 0,weknowthatn ̸= 0.) But x = x· 1,andbothxand 1 are xyyy rational. Therefore Example 4.12 implies that y is rational too. Here’s one more example of a proof that uses the contrapositive. When proving an “if and only if” statement φ ⇔ ψ, we can instead give proofs of both φ ⇒ ψ and ψ ⇒ φ, because φ ⇔ ψ and (φ ⇒ ψ) ∧ (ψ ⇒ φ) are logically equivalent. This type of proof is sometimes called a proof by mutual implication. (We can also prove φ ⇔ ψ 430 CHAPTER 4. PROOFS by giving a chain of logically equivalent statements that transform φ into ψ, but it is often easier to prove one direction at a time.) Here’s an example of a proof by mutual implication, which also uses the contrapositive to prove one of the directions: Example 4.19 (Even integers (and only even integers) have even squares) Claim: Let n be any integer. Then n is even if and only if n2 is even. Proof. Weproceedbymutualimplication. First, we will show that if n is even, then n2 is even too. Assume that n is even. Then, by definition, there exists an integer k such that n = 2k. Therefore n2 = (2k)2 = 4k2 = 2 · (2k2). Thus n2 is even too, because there exists an integer l such that n2 = 2l. (Namely, l = 2k2.) Second, we will show the converse: if n2 is even, then n is even. We will instead prove the contrapositive: if n is not even, then n2 is not even. Assume that n is not even. Then n is odd, and there exists an integer k such that n = 2k + 1. Therefore n2 =(2k+1)2 =4k2+4k+1=2(2k2+2k)+1.Thusn2isoddtoo,becausethereexists anintegerlsuchthatn2 =2l+1.(Namely,l=2k2+2k.) Proofs by contradiction The proof techniques that we’ve described so far establish a claim φ by arguing that φ must be true. Here, we’ll look at the other side of the coin, and prove φ has to be true by proving that φ cannot be false. This approach is called a proof by contradic- tion: we prove that something impossible must happen if φ is false (that is, we prove ¬φ ⇒ False); thus the assumption ¬φ led us to an absurd conclusion, and we must reject the assumption ¬φ and instead conclude its negation φ: (This proof technique is based on the logical equivalence of φ and the proposition ¬φ ⇒ False.) We used a proof by contradiction in Lemma 4.8: to show that two par- ticular sets X and Y were disjoint, we assumed that there was an element z ∈ X ∩ Y (that is, we assumed that X and Y were not disjoint), and we showed that this assump- tion led to a violation of the assumptions in the definitions of X and Y. Here’s another simple example: Example 4.20 (15x + 111y = 55057 for integers x and y?) Claim: Suppose15x+111y=55057,fortworealnumbersxandy.Theneitherxory (or both) is not an integer. Proof. Supposenot:thatis,supposethatxandyareintegerswith15x+111y=55057. A proof by contra- diction is also called reductio ad absurdum (Latin: “reduction to an absurdity”). As my grandfather always used to say: “If the conclusion is obviously false, reexamine the premises.” — Jay Liben (1913– 2006) Definition 4.13 (Proof by contradiction) To prove φ using a proof by contradiction, we assume the negation of φ and derive a contradiction; that is, we assume ¬φ and prove False. But 15x + 111y = 3 · (5x + 37y), so 55057 = 5x + 37y. But then 55057 must therefore 3 55057 3 be an integer, because 5x + 37y is—but 3 = 18352.333 · · · ∈/ Z. Therefore the assumption that both x ∈ Z and y ∈ Z was false, and at least one of x and y must be nonintegral. Here is another example of a proof by contradiction, for a classical result showing that there are numbers that aren’t rational: Writing tip: It’s always a good idea to help your reader with “signposts” in your writing. In a proof by contradiction, announce at the outset that you’re assuming ¬φ for the purposes of deriving a contradiction; when you reach a contradiction, say that you’ve reached acontradiction, anddeclarethat therefore the assumption ¬φ was false, and φ is true. 4.3. PROOFSANDPROOFTECHNIQUES 431 √ 2) Example 4.21 (The irrationality of Claim: √2isnotrational. Proof. Weproceedbycontradiction. Assume that √2 is rational. Therefore, by the definition of rationality, there exist integers n and d ̸= 0 such that n/d = √2, where n and d are in lowest terms (that is, where n and d have no common divisors). Squaring both sides yields that n2/d2 = 2, and therefore that n2 = 2d2. Because 2d2 is even, we know that n2 is even. Therefore, by Example 4.19 (“n is even if and only if n2 is even”) we have that n is itself even. Because n is even, there exists an integer k such that n = 2k, which implies that 22 22222222 2 n = 4k . Thusn = 4k andn = 2d ,so2d = 4k andd = 2k . Henced iseven, and—again using Example 4.19—we have that d is even. But now we have a contradiction: we assumed that n/d was in lowest terms, but we have now shown that n and d are both even! Thus the original assumption that √2 was rational was false, and we can conclude that √2 is irrational. Note again the structure of this proof: suppose that √2 is rational; therefore we can write √2 = n/k where n and k have no common divisors, and (a few steps later) therefore n and k are both even. Because n and k cannot both have no common divisors and also both be even, we’ve derived an absurdity. The only way we could have gotten to this absurdity is via our assumption that √2 was rational—so we conclude that this as- sumption must have been false, and therefore √2 is irrational. Note that, when you’re trying to prove an implication φ ⇒ ψ, a proof by contraposi- tive has some similarity to a proof by contradiction: • inaproofbycontrapositive,weprove¬ψ⇒¬φ,byassuming¬ψandproving¬φ. • inaproofbycontradiction,weproveFalseundertheassumption¬(φ⇒ψ)—that is, under the assumption that φ ∧ ¬ψ. (Note that there’s an extra creative demand here: you have to figure out which contradiction to derive—something that’s not generally made immediately clear by the given claim.) Proofs by contrapositive are generally preferred over proofs by contradiction when a proof by contrapositive is possible. A proof by contradiction can be hard to follow because we’re asking the reader to temporarily accept an assumption that we’ll later show to be false, and there can be a mental strain in keeping track of what’s been as- sumed and what was previously known. (Notice that the claim in Example 4.21 wasn’t an implication, so a proof by contrapositive wasn’t an option. The proofs of Lemma 4.8 and Example 4.20, though, could have been rephrased as proofs by contrapositive.) 432 CHAPTER 4. PROOFS Proofs by construction and disproofs by counterexample So far we’ve concentrated on proofs of universally quantified statements, where you are asked to show that some property holds for all elements of a given set. (Every example proof in this section, except the two proofs by contradiction about the irra- tionality of √2 and the infinitude of primes, were proofs of a “for all” statement—and, actually, even those two claims could have been phrased as universal quantifications. For example, we could have phrased Example 4.21 as the following claim: for all inte- gers n and d, we have n ̸= d · √2.) Sometimes you’ll confront a universally quantified statement that’s false, though. The easiest way to prove that ∀x ∈ S : P(x) is false is using a disproof by counterexample: Finding a counterexample for a claim requires creativity: you have to think about why a claim might not be true, and then try to construct an example that embodies that reason. Here is a simple example: Example 4.22 (Unique sums of squares) Claim: Let n be a positive integer such that n = a2 + b2 for positive integers a and b. Then n cannot be expressed as the sum of the squares of two positive integers ex- cept a and b. (Alternatively, this claim could be written more tersely as: No positive integer is expressible in two different ways as the sum of two perfect squares.) The claim is false, and we will prove that it is false by counterexample. We can start trying some examples. One easy class of potential counterexamples is a2 + 1 for an integer a. 12 + 12 = 2 can’t be expressed a different way. What about 5? 10? 17? 26? 37? 50? 65? 82? By testing these examples, we find that 65 is a counterexample to theclaim. Observethat12 +82 = 1+64 = 65,and42 +72 = 16+49 = 65. Another 2 2 2 2 counterexampleis50,as50=5 +5 =1 +7 . What about when you’re asked to prove an existential claim ∃x : P(x)? One ap- proach is to prove the claim by contradiction: you assume ∀x : ¬P(x), and then derive some contradiction. This type of proof is called nonconstructive: you have proven that an object with a certain property must exist, but you haven’t actually described a par- ticular object with that property. In contrast, a proof by construction actually identifies a specific object that has the desired property: Problem-solving tip: One way you might try to identify coun- terexample to a claim is by writing a program: write a loop that tries a bunch of examples; if you ever find one for which the claim is false, then you’ve found a counterex- ample. Just because you haven’t found a counterexample with your program doesn’t mean that there isn’t one— unless you’ve tried all the elements of S—butifyoudo find a counterex- ample,it’sstilla counterexample no matter how you found it! Definition 4.14 (Disproof by counterexample) A counterexample to a claim ∀x ∈ S : P(x) is a particular element y ∈ S such that P(y) is false. A disproof by counterexample of ¬∀x ∈ S : P(x) is such a counterexample y ∈ S, together with a proof that P(y) is false. Definition 4.15 (Proof by construction) A constructive proof or proof by construction for a claim ∃x ∈ S : P(x) actually builds an object satisfying the property P: first, we identify a particular element y ∈ S; and, second, we prove P(y). For example, here’s a simple claim that we’ll prove twice, once nonconstructively and once constructively: Example 4.23 (The last two digits of some squares) Claim: There exist distinct integers x, y ∈ {1901, 1902, . . . , 2014} such that the last two digits of x2 and y2 are the same. (In other words, x2 mod 100 = y2 mod 100.) Nonconstructive. Thereare114differentnumbersintheset{1901,1902,...,2014}. There are only 100 different possible values for the last two digits of numbers. Thus, because there are 114 elements assigned to only 100 categories, there must be some category that contains more than one element. Constructive. Let x = 1986 and y = 1964. Both numbers’ squares have 96 as their last two digits: 19862 = 3,944,196 and 19642 = 3,857,296. It’s generally preferable to give a constructive proof when you can. A constructive proof is sometimes harder to develop than a nonconstructive proof, though: it may require more insight about the kind of object that can satisfy a given property, and more creativity in figuring out how to actually construct that object. Taking it further: A constructive proof of a claim is generally more satisfying for the reader than a nonconstructive proof. A proof by contradiction may leave a reader unsettled—okay, the claim is true, but what can we do with that?—while a constructive proof may be useful in designing an algorithm, or it may suggest further possible claims to try to prove. (There’s even a school of thought in logic called constructivism that doesn’t count a proof by contradiction as a proof!) 4.3.2 Some Brief Thoughts about Proof Strategy So far in this section, we’ve concentrated on developing a toolbox of proof techniques. But when you’re confronted with a new claim and asked to prove it, you face a difficult task in figuring out which approach to take. (It’s even harder if you’re asked to for- mulate a claim and then prove it!) As we discussed in Chapter 3, there’s no formulaic approach that’s guaranteed to work—you must be creative, open-minded, persistent. You will have to accept that you will explore approaches that end up being dead ends. This section will give a few brief pointers about proof strategy—some things to try when you’re just starting to attack a new problem. We’ll start with some concrete advice in the form of a three-step plan, largely inspired by an outstanding book by George Pólya.2 (I highly recommend Pólya as further reading!) 1. Understandwhatyou’retryingtodo.Readthestatementthatyou’retryingtoprove. Reread it. What are the assumptions? What is the desired conclusion? (That is, what are you trying to prove under the given assumptions?) Remind yourself of any unfamiliar notation or terminology. Pick a simple example and make sure the alleged theorem holds for your example. (If not, either you’ve misunderstood some- thing or the claim is false.) Reread the statement again. If you’re not given a specific claim—for example, you’re asked to prove or dis- prove a given statement, or if you’re asked for the “best possible” solution to a 2 George Pólya. How to Solve It. Doubleday, Garden City, NY, 1957. 4.3. PROOFSANDPROOFTECHNIQUES 433 434 CHAPTER 4. PROOFS problem—then it’s harder but even more important to understand what you’re trying to do. Play around with some examples to generate a sense of what might be plausibly true. Then try to form a conjecture based on these examples or the intuition that you’ve developed. 2. Doit.Nowthatyouhaveanunderstandingofthestatementthatyou’retryingto prove, it’s time to actually prove it. You might start by trying to think about slightly different problems to help grant yourself insight about this one. Are there results that you already know that “look similar” to this one? Can you solve a more general problem? Make the premises look as much like the conclusion as possible. Expand out the definitions; write down what you know and what you have to derive, in primitive terms. Can you derive some facts from the given hypotheses? Are there easier-to-prove statements that would suffice to prove the desired conclusion? Look for a special case: add assumptions until the problem is easy, and then see if you can remove the extra assumptions. Restate the problem. Restate it again. Make analogies to problems that you’ve already solved. Could those related prob- lemsbedirectlyvaluable? Orcouldyouuseasimilartechniquetowhatyouused in that setting? Try to use a direct proof first; if you’re finding it difficult to construct a direct proof of an implication, try working on the contrapositive instead. If both of these approaches fail, try a proof by contradiction. When you have a candidate plan of attack, try to execute it. If there’s a picture that will help clarify the steps in your plan, draw it. Sketch out the “big” steps that you’d need to make the whole proof work. Make sure they fit together. Then crank through the details of each big step. Do the algebra. Check the algebra. If it all works out, great! If not, go back and try again. Where did things go off the rails, and can you fix them? Think about how to present your proof; then actually write it. Note that what you did in figuring out how to prove the result might or might not be the best way to present the proof. 3. Thinkaboutwhatyou’vedone.Checktomakesureyourproofisreasonable.Didyou actually use all the assumptions? (If you didn’t, do you believe the stronger claim that has the smaller set of assumptions?) Look over all the steps of your proof. Turn your internal skepticism dial to its maximum, and reread what you just wrote. Ask yourself Why? as you think through each step. Don’t let yourself get away with anything. After you’re satisfied that your proof is correct, work to improve it. Can you strengthen the result by making the conclusion stronger or the assumptions weaker? Can you make the proof constructive? Simplify the argument as much as you can. Are there unnecessary steps? Are there unnecessarily complex steps? Are there subclaims that would be better as separate lemmas? It’s important to be willing to move back and forth among these steps. You’ll try to prove a claim φ, and then you’ll discover a counterexample to φ—so you go back and modify the claim to a new claim φ′ and try to prove φ′ instead. You’ll formulate a draft of a proof of φ′ but discover a bug when you check your work while reflecting on the proof. You’ll go back to proving φ′, fix the bug, and discover a new proof that’s Problem-solving tip: If you’re totally stuck in attempting to prove a statement true, switch to trying to prove it false. If you succeed, you’re done—or, by figuring out why you’re struggling to construct a counterexample, you may figure out how to prove that the statement is true. Problem-solving tip: Check your work! If your claim says something about a general n, test it for n = 1. Compare your answer to a plot, or the output of a quick program. bugfree. You’ll think about your proof and realize that it didn’t use all the assumptions of φ′, so you’ll formulate a stronger claim φ′′ and then go through the proof of φ′′ and reflect again about the proof. Taking it further: One of the most famous—and prolific!—mathematicians of modern times was Paul Erdős (1913–1996), a Hungarian mathematician who wrote literally thousands of papers over his career, on a huge range of topics. Erdős used to talk about a mythical “Book” of proofs, containing the perfect proof of every theorem (the clearest, the most elegant—the best!). See p. 438 for some more discussion of The Book, and of Paul Erdős himself. 4.3.3 Some Brief Thoughts about Writing Good Proofs When you’re writing a proof, it’s important to remember that you are writing. Proofs, like novels or persuasive essays, form a particular genre of writing. Treat writing a proof with the same care and attention that you would give to writing an essay. Make your argument self-contained; include definitions of all variables and all nonstandard notation. State all assumptions, and explain your notation. Choose your notation and terminology carefully; name your variables well. Here’s an example. Example 4.24 (Pythagorean Theorem, stated poorly) Theorem: a2 + b2 = c2. This formulation is a terrible way of phrasing the theorem: the reader has no idea what a, b, and c are, or even that the theorem has anything whatsoever to do with geometry. (The Pythagorean Theorem, from geometry, states that the square of the hypotenuse of a right triangle is equal to the sum of the squares of its legs.) Here’s a much better statement of the Pythagorean Theorem: Writing tip: Draft. Write. Edit. Rewrite. 4.3. PROOFSANDPROOFTECHNIQUES 435 Example 4.25 (Pythagorean Theorem, stated well) a c b Figure 4.18: A right triangle. Thanks to Josh Davis for suggest- ing Examples 4.24 and 4.25. Theorem: Letaandbdenotethelengthsofthelegsofarighttriangle,andletcdenote the length of its hypotenuse. Then a2 + b2 = c2. If you are worried that your audience has forgotten the geometric terminology from this statement, then you might add the following clarification: As reminder from geometry, a right triangle is a 3-sided polygon with one 90◦ angle, called a right angle. The two sides adjacent to the right angle are called legs and the third side is called the hypotenuse. Figure 4.18 shows an example of a right triangle. Here the legs are labeled a and b, and the hypotenuse is labeled c. As is customary, the right angle is marked with the special square-shaped symbol ✷. Because the “standard” phrasing of the Pythagorean Theorem—which you might have heard in high school—calls the length of the legs a and b and the length of the hypotenuse c, we use the standard variable names. Calling the leg lengths θ and φ and the hypotenuse r would be hard on the reader; conventionally in geometry θ and φ are angles, while r is a radius. Whenever you can, make life as easy as possible for your reader. 436 CHAPTER 4. PROOFS (By the way, we’ll prove the Pythagorean Theorem in Example 4.14, and you’ll prove it again in Exercise 4.75.) Above all, remember that your primary goal in writing is communication. Just as when you are programming, it is possible to write two solutions to a problem that both “work,” but which differ tremendously in readability. Document! Comment your code; explain why this statement follows from previous statements. Make your proofs—and your code!—a pleasure to read. Writing tip: In writing a proof, keep your reader informed about the status of every sentence. And make sure that everything you write is a sentence. For example, every sentence contains a verb. (Note that a symbol like “=” is read as “is equal to” and is a verb.) Is the sentence an assumption? A goal? A conclusion? Annotate your sentences with signaling words and phrases to make it clear what each statement is doing. For example, introduce statements that follow logically from previous statements with words like hence, thus, so, therefore, and then. 4.3. PROOFSANDPROOFTECHNIQUES 437 Computer Science Connections Are Massive Computer-Generated Proofs Proofs? As we’ve said, what we mean by a “proof” is an argument that convinces the audience that the claim is true. What, then, is the status of the so-called proof of the claim Checkers is a draw when both players play optimally? The “proof” of this claim that we discussed on p. 344 hinged on showing that the software system Chinook can never lose at checkers—which was established via massive computation to perform a large-scale search of the checkers game tree.3 Isthat“proof”convincing?Cansuchaproofeverbeconvincing?It’s clear that a human reader cannot accommodate the 5 × 1020 checkers board positions in his or her brain, so it’s not convincing in the sense that a reader would be able to verify every step of the argument. But, on the other hand, a reader could potentially be convinced that Chinook’s code is correct, even if the output is too big for a reader to find convincing. The philosophical question about whether a large-scale computer-generated proof “counts” actually as a proof first arose in the late 1970s, when the Four- ColorTheoremwasfirstproven(?).4 Hereisthetheorem: Any “map” of contiguous geometric regions can be colored using four colors so that no two adjacent regions share the same color. Two quick notes: first, adjacent means sharing a positive-length border; two re- gions meeting at a point don’t need different colors. Second, the requirement of regions being contiguous means the map can’t require two disconnected regions (like the Lower 48 States and Alaska) to get the same color. The computational proof of four-color theorem given by Appel and Haken proceeds as follows. Appel and Haken first identified a set of 1476 different map configurations and proved (in the traditional way, by giving a convincing argument) that, if the four-color theorem were false, it would fail on one of these 1476 configurations. They then wrote a computer program that showed how to color each one of these 1476 configurations using only four colors. The theorem follows (“if there were a counterexample at all, there’d be a counterexample in one of the 1476 cases—and there are no counterexamples in the 1476 cases”). A great deal of controversy followed the publication of Appel and Haken’s work. Some mathematicians felt strongly that a proof that’s too massive for a human to understand is not a proof at all. Others were happy to accept the proof, particularly because the four-colorability question had been posed, and remained unresolved, for centuries. Computer scientists, by our nature, tend to be more accepting of computational proof than mathematicians—but there are still plenty of interesting questions to ponder. For example, as we discussed on p. 344, some errors in the execution of the code that generates Chinook’s proof are known to have occurred, simply because hardware errors happen at a high enough rate that they will arise in a computation of this size. Thus bit-level corruption may have occurred, without 100% correction, in Chinook’s proof that checkers is a draw under optimal play. So is Chinook’s “proof” really a proof? (Of course, there are also plenty of human-generated purported proofs that contain errors!) 3 Jonathan Schaeffer, Neil Burch, Yngvi Bjornsson, Akihiro Kishimoto, Martin Muller, Rob Lake, Paul Lu, and Steve Sutphen. Checkers is solved. Science, 317(5844):1518–1522, 14 September 2007. 4KennethAppelandWolfgangHaken. Solution of the four color map problem. Scientific American, 237(4):108–121, October 1977. Figure 4.19: A four-colored map of the 87 counties in Minnesota. 438 CHAPTER 4. PROOFS Computer Science Connections Paul Erdős, “The Book,” and Erdős Numbers After you’ve completed a proof of a claim—and after you’ve celebrated completing it—you should think again about the problem. In programming, there are often many fundamentally different algorithms to solve a particular problem; in proofs, there are often many fundamentally different ways of proving a particular theorem. And, just as in programming, some approaches will be more elegant, more clear, or more efficient than others. Paul Erdős, a prolific and world-famous mathematician who published ap- proximately 1500 papers before his death in 1996 (including papers on math, physics, and computer science), used to talk about “The Book” of proofs. “The Book” contains the ideal proof of each theorem—the most elegant, insightful, and beautiful proof. (If you believe in God, then The Book contains God’s proofs.) There’s even a non-metaphorical book called Proofs from The Book that collectssomeofthemostelegantknownproofsofsometheorems.5 Proving a theorem is great, but giving a beautiful proof is even better. Strive for the “book proof” of every theorem. Erdős was one of the most respected mathematicians of his time—and one of the most eccentric, too. (He forswore most material possessions, and instead traveled the world, crashing in the guest rooms of his research collab- orators for months at time.) Because of Erdős’s prolific publication record and his great respect from the research community, a measure of a certain type of fame for researchers has sprung up around him. A researcher’s Erdős num- ber is 1 if she has coauthored a published paper with Erdős; it’s 2 if she has coauthored a published paper with someone with an Erdős number of one; and so forth. For example, Bill Gates has an Erdős number of 4: he wrote a paper on the pancake-flipping problem with Christos Papadimitriou, who has coauthored a paper with someone (Xiao Tie Deng) who wrote a paper with someone (Pavol Hell) who wrote a paper with Paul Erdős. If you’re more of a movie person than a peripatetic mathematician person, then you may be more familiar with a very similar notion from the entertain- ment world, the so-called Bacon game. The goal here is to connect a given actor to Kevin Bacon via the shortest possible chain of intermediaries, where two actors are linked if they have appeared together in a movie. It is a source of great pride for researchers to have small Erdős numbers. And, although Erdős numbers themselves are really nothing more than a nerdy source of amusement, the ideas underlying them are fundamental in graph theory, the subject of Chapter 11. A closely related topic is the small- world phenomenon, also known as “six degrees of separation,” the principle that almost any two people are likely to be connected by a short chain of intermediate friends. The “six degrees of separation” phrase came from an important early paper by the social psychologist Stanley Milgram;6 it has spawned a massive amount of recent research by computer scientists, who have begun working to analyze questions about human behavior that have only become visible in the “Facebook era” in which it is now possible to study collective decision making on an massive scale. 5 Martin Aigner and Günter Ziegler. ProofsfromTheBook. Springer,4th edition, 2009. The Erdős Number Project, maintained at http://www.oakland.edu/enp by Jerry Grossman of Oakland University, is a good place to look for more infor- mation. You can see more about the Bacon game at the Oracle of Bacon, at http://oracleofbacon.org. 6 Stanley Milgram. The small world problem. Psychology Today, 1:61–67, May 1967. 4.3.4 Exercises Prove the following claims about divisibility. 4.33 The binary representation of any odd integer ends with a 1. 4.34 A positive integer n is divisible by 5 if and only if its last digit is 0 or 5. 4.35 Let k be any positive integer. Then any positive integer n is divisible by 2k if and only if its last k digits are divisible by 2k . (This exercise is a generalization of Example 4.11.) Prove the following claims about rationality. 4.36 If x and y are rational numbers, then x − y is also rational. 4.37 If x and y are rational numbers and y ̸= 0, then x is also rational. y 4.38 One of the following statements is true and one is false: • If xy and x are both rational, then y is too. • Ifx−yandxarebothrational,thenyistoo. Decide which statement is true and which is false, and give proof/disproof of both. 4.39 Let n be any integer. Prove by cases that n3 − n is evenly divisible by 3. 4.40 Let n be any integer. Prove by cases that n2 + 1 is not evenly divisible by 3. 4.41 Prove that |x| + |y| ≥ |x + y| for any real numbers x and y. 4.42 Prove that |x| − |y| ≤ |x − y| for any real numbers x and y. 4.43 Prove that the product of the absolute values of x and y is equal to the absolute value of their product—that is, prove that |x| · |y| = |x · y| for any real numbers x and y. 4.44 Suppose that x, y ∈ R satisfy |x| ≤ |y|. Prove that |x+y| ≤ |y|. 2 4.45 LetAandBbesets. ProvethatA×B = B×AifandonlyifA = ∅orB = ∅orA = B. Provethe result by mutual implication, where the proof of the ⇐ direction proceeds by contrapositive. Let x ≥ 0 and y ≥ 0 be arbitrary real numbers. The arithmetic mean of x and y is (x + y)/2, their average. The geometric mean of x and y is √xy. 4.46 First, a warm-up exercise: prove that x2 ≥ 0 for any real number x. (Hint: yes, it’s easy.) 4.47 Prove the Arithmetic Mean–Geometric Mean inequality: for x, y ∈ R≥0, we have √xy ≤ (x + y)/2. (Hint: (x − y)2 ≥ 0 by Exercise 4.46. Use algebraic manipulation to make this inequality look like the desired one.) 4.48 Prove that the arithmetic mean and geometric mean of x and y are equal if and only if x = y. 4.3. PROOFSANDPROOFTECHNIQUES 439 In Chapter 2, when we defined square roots, we introduced Heron’s method, a first-century algorithm to compute √x given x. See p. 218, or Figure 4.20 for a reminder. Here you’ll prove two properties that help establish why this algorithm correctly computes square roots: √ Input: A positive real number x Output: A real number y where y2 ≈ x Let y0 be arbitrary, and let i := 0. 4.49 Assume that y0 ≥ x. Prove that, for every i ≥ 1, we have yi ≥√x.Inotherwords,provethatify≥√xthen(y+x)/2≥√xtoo. while (yi)2 is too far away from x y+x letyi+1 := i 2yi ,andleti:=i+1. return yi y 4.50 Suppose that y > √x. Prove that x is closer to √x than y is—that
x√√y√√x is,provethat|y − x|<|y− x|.(Hint:showthat|y− x|−| x−y|>0.)
√
Figure4.20:Are- x minder of Heron’s
Now, using this result and Exercise 4.44, prove that yi+1 as computed in Heron’s Method is closer to than yi, as long as yi > √x.
method for com- puting square roots.
The second property that you just proved (Exercise 4.50) shows that Heron’s method improves its estimate of √x in every iteration. (We haven’t shown “how much” improvement Heron’s method achieves in an iteration, or even that this algorithm is converging to the correct answer—let alone quickly!—but, in fact, it is.)
Prove the following claims using a proof by contrapositive.
4.51 Let n ∈ Z≥0. If n mod 4 ∈ {2,3}, then n is not a perfect square.
4.52 Let n and m be integers. If nm is not evenly divisible by 3, then neither n nor m is evenly divisible
by 3. (In fact, the converse is true too, but you don’t have to prove it.) 4.53 Let n ∈ Z≥0. If 2n4 + n + 5 is odd, then n is even.

440 CHAPTER 4. PROOFS
Prove the following claims using a proof by mutual implication, using a proof by contrapositive for one direction.
4.54 Let n be any integer. Then n3 is even if and only if n is even.
4.55 Let n be any integer. Then n is divisible by 3 if and only if n2 is divisible by 3.
Prove the following claims using a proof by contradiction.
4.56 Let x, y be positive real numbers. If x2 − y2 = 1, then x or y (or both) is not an integer.
4.57 Suppose 12x + 3y = 254, for real numbers x and y. Then either x or y (or both) is not an integer. 4.58 Adapt Example 4.21 to prove that √3 2 = 21/3 is irrational. (You may find Exercise 4.54 helpful.) 4.59 Adapt Example 4.21 to prove that √3 is irrational. (You may find Exercise 4.55 helpful.)
4.60 Consider an array A[1 . . . n]. A value x is called a strict majority element of A if strictly more than half of the elements in A are equal to x—in other words, if
􏰊􏰊􏰊{i ∈ {1,2,…,n} : A[i] = x}􏰊􏰊􏰊 > n. 2
Give a proof by contradiction that every array has at most one strict majority element.
In Example 4.12, Exercise 4.36, and Exercise 4.37, we proved that if x and y are both rational, then so are all three
of xy, x − y, and x . The converse of each of these three statements is false. Disprove the following claims by giving
counterexamples: y
4.61 If xy is rational, then x and y are rational.
4.62 If x − y is rational, then x and y are rational.
4.63 If x is rational, then x and y are rational. y
4.64 In Example 4.22, we disproved the following claim by giving a counterexample:
Claim 1: No positive integer is expressible in two different ways as the sum of two perfect squares.
Let’s consider a related claim that is not disproved by our counterexamples from Example 4.22:
Claim 2: No positive integer is expressible in three different ways as the sum of two perfect squares.
Disprove Claim 2 by giving a counterexample.
4.65 Leonhard Euler, an 18th-century Swiss mathematician to whom the idea of an abstract formal model of networks (graphs; see Chapter 11) is due, made the observation that the polynomial
f(n) = n2 +n+41
yields a prime number when it’s evaluated for many small integers n: for example, f (0) = 41 and f (1) = 43 and f (2) = 47 and f (3) = 53, and so forth. Prove or disprove the following claim: the function f (n) yields a prime for every nonnegative integer n.

4.4 Some Examples of Proofs
We’ve now catalogued a variety of proof techniques, discussed some strategies for proving novel statements, and described some ideas about presenting proofs well. Section 4.3 illustrated some proof techniques with a few simple examples each, entirely about numbers and arithmetic. In this section, we’ll give a few “bigger”—and perhaps more interesting!—examples of theorems and proofs.
4.4.1 A Proof about Propositional Logic: Conjunctive/Disjunctive Normal Form
We’ll start with a result about propositional logic, namely showing that any proposi- tion is logically equivalent to another proposition that has a “simpler” structure. Recall the definitions of conjunctive and disjunctive normal form:
4.4. SOMEEXAMPLESOFPROOFS 441
Few things are harder to put up with than the annoyance of a good example.
Mark Twain (1835–1910) Pudd’nhead Wilson (1894)
Definition 4.16 (Reminder: Conjunctive/Disjunctive Normal Form)
In propositional logic, a literal is a Boolean variable or its negation (like p or ¬p).
A proposition φ is in conjunctive normal form (CNF) if φ is the conjunction of one or
more clauses, where each clause is the disjunction of one or more literals.
A proposition φ is in disjunctive normal form (DNF) if φ is the disjunction of one or
more clauses, where each clause is the conjunction of one or more literals.
Here are two small examples of CNF and DNF:
(¬p ∨ q ∨ ¬r) ∧ (¬q ∨ r) (conjunctive normal form)
(¬p ∧ ¬q ∧ r) ∨ (¬q ∧ ¬r ∨ s) ∨ (r). (disjunctive normal form)
Back in Chapter 3, we claimed that every proposition is logically equivalent to one in CNF and one in DNF, but we didn’t prove it. Here we will.
First, though, let’s recall an example from Chapter 3 and brainstorm a bit about how to generalize that result into the desired theorem. In Example 3.26, we converted p ⇔ q into DNF as the logically equivalent proposition (p ∧ q) ∨ (¬p ∧ ¬q). Note that this expression has two clauses p ∧ q and ¬p ∧ ¬q, each of which is true in one and only one row of the truth table. And our full proposition (p ∧ q) ∨ (¬p ∧ ¬q) is true in precisely two rows of the truth table. (See Figure 4.21.)
Can we make this idea general? Yes! For an arbitrary proposition φ, and for any particular row of the truth table for φ, we can construct a clause that’s true in that row and only in that row. We can then build a DNF proposition that’s logically equivalent to φ by “or”ing together each of the clauses corresponding to the rows in which φ is true. And then we’re done!
(Well, we’re almost done! There is one subtle bug in the proof sketch in the previous paragraph—can you find it? We’ll fix the issue in the last paragraph of the proof below.)
pq
TTTTF TFFFF FTFFF FFTFT
Figure 4.21: Truth table for p ⇔ q and the clauses for converting it to DNF.
p⇔q
p∧q ¬p∧¬q

442 CHAPTER 4. PROOFS
Theorem 4.11 (All propositions are expressible in DNF (Theorem 3.2))
For any proposition φ, there exists a proposition ψdnf in disjunctive normal form such that φ ≡ ψdnf.
Proof. Letφbeanarbitraryproposition,sayovertheBooleanvariablesp1,…,pk. For any particular truth assignment ρ for the variables p1, . . . , pk, we’ll construct
a conjunction cρ that’s true under ρ and false under all other truth assignments. Let x1,x2,…,xl bethevariablesassignedtruebyρ,andy1,y2,…,yk−l bethevariables assigned false by ρ. Then the clause
cρ := x1 ∧ x2 ∧ · · · ∧ xl ∧ ¬y1 ∧ ¬y2 ∧ · · · ∧ ¬yk−l
is true under ρ, and cρ is false under every other truth assignment.
We can now construct a DNF proposition ψdnf that is logically equivalent to φ by
“or”ing together the clause cρ for each truth assignment ρ that makes φ true. Build the truth table for φ, and let Sφ denote the set of truth assignments for p1, . . . , pk under which φ is true. If the truth assignments in Sφ are {ρ1, ρ2, . . . , ρm}, then define
ψdnf := cρ1 ∨cρ2 ∨···∨cρm. (∗)
It’s easy to see that ψdnf is true under every truth assignment ρ under which φ was true (because the clause cρ is true under ρ). And, for a truth assignment ρ under which φ was false, every disjunct in ψdnf evaluates to false, so the entire disjunction is false under such a ρ, too. Thus φ ≡ ψdnf.
There’s one thing we have to be careful about: what happens if Sφ = ∅—that is, if φ is unsatisfiable? (This issue is the minor bug we mentioned before the theorem statement.) The construction in (∗) doesn’t work, but it’s easy to handle this case too: we simply choose an unsatisfiable DNF proposition like p ∧ ¬p as ψdnf.
Note that, although we didn’t phrase it as such from the beginning, our proof of Theorem 4.11 was actually a proof by cases, with two cases corresponding to φ being unsatisfiable and φ being satisfiable.
As an illustration, let’s use the construction from Theorem 4.11 to transform an example proposition into DNF:
Example 4.26 (Converting p ⇒ (q ∧ r) to DNF)
Problem: FindapropositioninDNFlogicallyequivalenttop⇒(q∧r).
: Toconvertp⇒(q∧r)toDNF,westartfromthetruthtable,andthen“or” Solution
together the propositions corresponding to each row that’s marked with as True:
p q r q ∧ r p ⇒ (q ∧ r) TTTTTp∧q∧r TTFFF p∧q∧¬r TFTFF p∧¬q∧r TFFFF p∧¬q∧¬r FTTTT ¬p∧q∧r FTFFT ¬p∧q∧¬r FFTFT ¬p∧¬q∧r FFFF T ¬p∧¬q∧¬r
Problem-solving
tip: Be on the lookout for special cases (like an unsatisfiable φ in Theorem 4.11), and see whether you can handle them separately from the argument for the “typical” case.

Our DNF proposition will therefore have five clauses, one for each of the five truth assignments under which this implication is true:
(p∧q∧r) ∨ (¬p∧q∧r) ∨ (¬p∧q∧¬r) ∨ (¬p∧¬q∧r) ∨ (¬p∧¬q∧¬r). 􏰢 􏰡􏰠 􏰣 􏰢 􏰡􏰠 􏰣 􏰢 􏰡􏰠 􏰣 􏰢 􏰡􏰠 􏰣 􏰢 􏰡􏰠 􏰣
TTT FTT FTF FFT FFF
Conjunctive normal form
Now that we’ve proven that we can translate any proposition into disjunctive nor-
mal form (the “or of ands”), we’ll turn our attention to conjunctive normal form (the “and of ors”).
Though it’s not initially obvious, Theorem 4.12 actually turns out to be easy to prove by making use of the DNF result. The crucial idea—and, once again, it’s an idea that requires some genuine creativity to come up with!—is that it’s fairly simple to turn the negation of a DNF proposition into a CNF proposition. So, to build a CNF proposition logically equivalent to φ, we’ll construct a DNF proposition that is logically equivalent to ¬φ; we can then negate that DNF proposition and use De Morgan’s Laws to convert the resulting proposition into CNF. Here are the details:
Proof. If φ is a tautology, the task is easy; just define φcnf = p ∨ ¬p.
Otherwise, φ is a nontautology, say over the variables p1,…,pk. Using Theo-
rem 4.11, we can construct a DNF proposition ψ that is logically equivalent to ¬φ. (Note that, using our construction from Theorem 4.11, the proposition ψ will have k literals in every clause, because ¬φ is satisfiable.) Thus the form of ψ will be
ψ = (c1 ∧···∧ck1)∨(c12 ∧···∧ck2)∨···∨(c1m ∧···∧ckm)
for some m ≥ 1, where each cij is a literal. Recall that ψ ≡ ¬φ, so we also know that
¬ψ ≡ φ. Let’s negate ψ:
¬ψ = ¬􏰖(c1 ∧···∧ck1)∨(c12 ∧···∧ck2)∨···∨(c1m ∧···∧ckm)􏰗
≡ ¬(c1 ∧···∧ck1)∧¬(c12 ∧···∧ck2)∧···∧¬(c1m ∧···∧ckm)
De Morgan’s Law: ¬(p ∨ q) ≡ ¬p ∧ ¬q
≡ (¬c1 ∨···∨¬ck1)∧(¬c12 ∨···∨¬ck2)···∧(¬c1m ∨···∨¬ckm).
De Morgan’s Law: ¬(p ∧ q) ≡ ¬p ∨ ¬q, applied once per clause
But this expression is in CNF once we remove any doubly negated literals—that is, we replace any occurrences of ¬¬p by p instead. Thus we’ve constructed a proposition in conjunctive normal form that’s logically equivalent to ¬ψ ≡ φ.
Problem-solving
tip: Try being
lazy first! Think about whether there’s a way to use a previously established result to make the current problem easier.
4.4. SOMEEXAMPLESOFPROOFS 443
Theorem 4.12 (All propositions are expressible in CNF)
For any proposition φ, there exists a proposition φcnf in conjunctive normal form such that φ ≡ φcnf.

444 CHAPTER 4. PROOFS
As an illustration of this construction, let’s convert p ⇒ (q ∧ r)—which we converted to DNF in Example 4.26—to conjunctive normal form too:
Example 4.27 (Converting p ⇒ (q ∧ r) to CNF)
In Example 4.26, we converted the proposition φ = p ⇒ (q ∧ r) into DNF. Here we’ll convert it into CNF, using Theorem 4.12. Again, we start from the truth table for ¬φ:
φ ¬φ
p q r q ∧ r p ⇒ (q ∧ r) ¬(p ⇒ (q ∧ r))
TTTTTFp∧q∧r TTFFF T p∧q∧¬r TFTFF T p∧¬q∧r TFFFF T p∧¬q∧¬r FTTTT F ¬p∧q∧r FTFFT F ¬p∧q∧¬r FFTFT F ¬p∧¬q∧r FFFFT F ¬p∧¬q∧¬r
We first construct a DNF proposition equivalent to ¬φ. This proposition has three clauses, one for each of the truth assignments under which ¬φ is true (and φ is false):
¬φ ≡ (p∧q∧¬r) ∨ (p∧¬q∧r) ∨ (p∧¬q∧¬r) 􏰢 􏰡􏰠 􏰣 􏰢 􏰡􏰠 􏰣 􏰢 􏰡􏰠 􏰣
TTF TFT TFF
We negate this proposition and use De Morgan’s Laws to push around the negations:
φ ≡ ¬􏰂(p∧q∧¬r) ∨ (p∧¬q∧r) ∨ (p∧¬q∧¬r)􏰃 ≡¬(p∧q∧¬r) ∧ ¬(p∧¬q∧r) ∧ ¬(p∧¬q∧¬r) ≡(¬p∨¬q∨¬¬r) ∧ (¬p∨¬¬q∨¬r) ∧ (¬p∨¬¬q∨¬¬r) ≡(¬p∨¬q∨r) ∧ (¬p∨q∨¬r) ∧ (¬p∨q∨r).
DeMorgan
DeMorgan DoubleNegation
So (¬p ∨ ¬q ∨ r) ∧ (¬p ∨ q ∨ ¬r) ∧ (¬p ∨ q ∨ r) is a CNF proposition that’s logically equivalent to p ⇒ (q ∧ r). We can verify via truth table that this proposition is indeed logically equivalent to p ⇒ (q ∧ r).
One last comment about these proofs: it’s worth emphasizing again that there’s gen- uine creativity required in proving these theorems. Through the strategies from Sec- tion 4.3.2 and through practice, you can get better at having the kinds of creative ideas that lead to proofs—but that doesn’t mean that these results should have been “obvi- ous” to you in advance. It takes a real moment of insight to see how to use the truth table to develop the DNF proposition to prove Theorem 4.11, or how to use the DNF formula of the negation to prove Theorem 4.12.
Taking it further: Theorems 4.11 and 4.12 said that “a proposition ψ (of a particular form) exists for every φ”—but our proofs actually described an algorithm to build ψ from φ. (That’s a more computa- tional way to approach a question: a statement like “such-and-such exists!” is the kind of thing more typically proven by mathematicians, and “a such-and-such can be found with this algorithm!” is a claim more typical of computer scientists.) Our algorithms in Theorems 4.11 and 4.12 aren’t very efficient, unfortunately; they require 2k steps just to build the truth table for a k-variable proposition. We’ll give
a (sometimes, and somewhat) more efficient algorithm in Chapter 5 (see Section 5.4.3) that operates directly on the form of the proposition (“syntax”) rather than on using the truth table (“semantics”).

Some other results about propositional logic
In the exercises, you’ll be asked to prove a large collection of other facts about
propositional logic. We’ll highlight one of them, which is similar in spirit to the the- orems about DNF and CNF: you’ll show that any proposition φ is logically equivalent to a simpler proposition that uses only one kind of logical connective, called “nand.” For reasons of physics, building the physical circuitry for the logical connective nand— as in “not and,” where p nand q means ¬(p ∧ q)—is much simpler than other logical connectives. (The physical reasons relate specifically to the way that transistors—the most basic building blocks for digital circuits—work.) The truth table for nand—also known as the Sheffer stroke |—appears in Figure 4.22.
It turns out that every (every!) logical connective can be expressed in terms of |. In other words, if you have enough nand gates, then you will be able to build any logical circuit that you want. Here is a theorem that formally states this result:
The theorem follows from Exercise 4.69, where you’ll show that every logical connec- tive can be expressed in terms of |. (To give a fully rigorous proof, we will need to use mathematical induction, the subject of Chapter 5. Mathematical induction will essen- tially allow us to apply the results of Exercise 4.69 recursively to translate an arbitrary proposition φ into ψnand-only.)
Taking it further: Indeed, real circuits are typically built exclusively out of nand gates, using logical equivalences to construct and/or/not gates from a small number of nand gates. Although it may be initially implausible if this is the first time that you’ve heard it, the processor of a physical computer is essentially nothing more than a giant circuit built out of nand gates and wires. With some thought, you can build a circuit that takes two integers (represented in binary, as a 64-bit sequence) and computes their sum. Similarly, but more thought-provokingly, you can build a circuit that takes an instruction (add these numbers; compare those numbers; save this thing in memory; load the other thing from memory) and performs the requested action. That circuit is a computer!
Incidentally, all of the logical connectives can also be defined in terms of the logical connectiveknownasPeirce’sarrow↓andalsoknownasnor,asin“notor.” (You’ll prove the analogous result to Theorem 4.13 for Peirce’s arrow in Exercise 4.70.)
4.4.2 The Pythagorean Theorem
Example 4.24 presented the Pythagorean Theorem, which you probably once saw in a long-ago geometry class: the square of the length of hypotenuse of a right trian- gle equals the sum of the squares of the lengths of the legs. Let’s prove it. In brain- storming about this theorem, here’s an idea that turns out to be helpful. Because the statement of Pythagorean theorem involves side lengths raised to the second power (“squared”), we might be able to think about the problem using geometric squares, appropriately configured. Here’s a proof that proceeds using this geometric idea:
The Sheffer stroke | is named after the early-20th-century logician Henry Sheffer.
pqp|qp↓q TTFF TFTF FTTF FFTT
Figure 4.22: The truth table for nand (also known as the Sheffer stroke |), and nor (also known as Peirce’s arrow ↓).
4.4. SOMEEXAMPLESOFPROOFS 445
Theorem 4.13 (All propositions are expressible using only |)
For any Boolean formula φ over p1 , . . . , pk , there exists a proposition ψnand-only such that (i) φ ≡ ψnand-only, and (ii) ψnand-only contains only p1, . . . , pk and the logical connective |.
Peirce’s arrow
is named after
the 18th-century logician Charles Peirce. Its truth table is also shown in Figure 4.22.
The original for- mulation of the Pythagorean The- orem is attributed to Pythagoras, a Greek mathemati- cian/philosopher who lived around 500 bce.

446 CHAPTER 4. PROOFS
Theorem 4.14 (The Pythagorean Theorem)
Let a and b denote the lengths of the legs of a right triangle, and let c denote the length of its hypotenuse. Then a2 + b2 = c2.
Proof. Startingwiththe
given right triangle in Fig-
ure 4.23(a), draw a square
with side length c, where
one side of the square coin-
cides with the hypotenuse
of the given triangle, as in
Figure 4.23(b). Now draw
three new triangles, each
identical to the first. Place
these three new triangles symmetrically around the square that we just drew, so that each side of the square coincides with the hypotenuse of one of the four triangles, as in Figure 4.23(c). Each of these four triangles has leg lengths a and b and hypotenuse c. Including both the c-by-c square and the four triangles, the resulting figure is a square with side length a + b.
To complete the proof, we will account for the area of Figure 4.23(c) in two different ways. First, because a square with side length x has area x2, we have that
areaoftheenclosingsquare=(a+b)2 =a2+2ab+b2.
Second, this enclosing square can be decomposed into a c-by-c square and four identi- cal right triangles with leg lengths a and b. Because the area of a right triangle with leg lengths x and y is xy/2, we also have that
area of the enclosing square = 4 · (area of one triangle) + c2 = 4 · 1 ab + c2
Figure 4.23: Illustra- tions for the proof of the Pythagorean Theorem, Theo- rem 4.14.
acaca
bbb
(a) The right triangle. (b) . . . with an added square. (c) . . . and three added triangles.
c
2
= 2ab + c2.
But the area of the enclosing square is the same regardless of whether we count it all together,orinitsfivedisjointpieces.Thereforea2+2ab+b2 =2ab+c2.Thetheorem follows by subtracting 2ab from both sides.
There are many proofs of the Pythagorean theorem—in fact, hundreds! There is
a classic proof attributed to Euclid (see p. 447), and many subsequent and different proof approaches followed over the millennia. There’s even a book that collects over 350 different proofs of the result!7 There’s an important lesson to draw from the many proofs of this theorem: there’s more than one way to do it. Just as there are usually many fundamentally different algorithms for the same problem (think about sorting, for example), there are usually many fundamentally different techniques that can prove the same theorem. Keep an open mind; there is absolutely no shame in proving a result using a different approach than the “standard” way!
7 Elisha Scott Loomis. The Pythagorean Propo- sition. National Council of Teachers of Mathematics, June 1968.
“There’s more than onewaytodoit”is also the motto of the programming language Perl.

4.4.3 Prime Numbers
We’ll return to arithmetic for our next set of examples, a pair of proofs about the prime numbers. Recall that a positive integer n ≥ 2 is prime if and only if the only positive integers that divide n evenly are 1 and n itself. Also recall that a positive integer n ≥ 2 that is not prime is called composite. (That is, the integer n is composite if and only if there exists a positive integer k ∈/ {1, n} such that k divides n evenly.)
We’ll start with another example of a proof by contradiction:
Proof. Weproceedbycontradiction.
Suppose, for the purposes of deriving a contradiction, that there are only finitely
many primes. This assumption means that there is a largest prime number, which we will call p. Consider the integer p!, the factorial of this largest prime p. Let’s consider two separate cases: either p! + 1 is prime, or p! + 1 is not prime.
• Ifp!+1isprime,thenwehaveacontradictionoftheassumptionthatpisthelargest prime, because p! + 1 > p is also prime.
• Ifp!+1isnotprime,thenbydefinitionitisevenlydivisiblebysomeintegerksat- isfying 2 ≤ k ≤ p!. But we proved in Example 4.8 that p! + 1 is not evenly divisible by any integer between 2 and p, inclusive. Thus the smallest integer k that evenly divides p! + 1 must exceed p. Further, this integer k must be prime—otherwise some 2 ≤ k′ < k divides k and therefore divides p! + 1, but k was the smallest divisor of p! + 1. Thus k > p is prime, and again we have a contradiction of the assumption that p is the largest prime.
In either case, we have a contradiction! Thus the original assumption—there are only finitely many prime numbers—is false, and so there are infinitely many primes.
We’ll now turn to another result about prime numbers, relating to the primality testing problem: you are given a positive integer n, and you have to determine whether n is prime. The definition of primality says that n is composite if there’s an integer
k ∈Z−{1,n}suchthatk|n,butitshouldbeeasytoseethatniscompositeifandonly ifthere’sanintegerk ∈ {2,3,…,n−1}suchthatk|n.(Thatis,thelargestpossible divisor of n is n − 1.) But we can do better, strengthening this result by shrinking the largest candidate value of k:
A similar proof
to the one for Theorem 4.15 dates back around 2300 years. It’s due to Euclid,
the ancient Greek mathematician after whom Euclidean geometry—and
the Euclidean algorithm (see Section 7.2.4)—is named.
4.4. SOMEEXAMPLESOFPROOFS 447
Theorem 4.15 (An infinitude of primes)
There are infinitely many prime numbers.
Theorem 4.16 (A composite number n has a factor ≤ √n) √ A positive integer n ≥ 2 is evenly divisible by some other integer k ∈ {2, 3, . . . , ⌈ only if n is composite.
n⌉} if and
Proof. We’llproceedbymutualimplication.

448 CHAPTER 4. PROOFS
The forward direction is easy: if there’s some integer k ∈ {2, 3, . . . , ⌈√n⌉} with k ̸= n such that k evenly divides n, then by definition n is composite. (That integer k satisfies k | n a n d k ∈/ { 1 , n } . )
For the other direction, assume that the integer n ≥ 2 is composite. By definition of composite, there exists a positive integer k ∈/ {1, n} such that n mod k = 0—that is, thereexistpositiveintegersk ∈/ {1,n}anddsuchthatdk = n,sod|nandk|n. We musthavethatd ̸= 1(otherwisedk = 1·k = k = n,butk < n)andd ̸= n(otherwise dk = nk > n, but dk = n). Thus there exist positive integers d, k ∈/ {1, n} such that
dk = n. Butifbothd > √nandk > √n,thendk > √n·√n = n,whichcontradictsthe fact that dk = n. Thus either d ≤ √n or k ≤ √n.
Taking it further: Generating large prime numbers (and testing the primality of large numbers) is a crucial step in many modern cryptographic systems. See the discussion on p. 454 for some discussion of algorithms for testing primality suggested by these proofs, and a bit about the role that they play in modern cryptographic systems.
4.4.4 Uncomputability
We’ll close this section with one of the most important results in computer science, dating from the early 20th century: there are problems that cannot be solved by computers. At that time, great thinkers were pondering some of the most fundamental questions that can be asked in CS. What is a computer? What is computation? What is a pro- gram? What tasks can be solved by computers/programs? One of the deepest and most mind-numbing results of this time was a proof, developed independently by Alan Turing and by Alonzo Church, that there are uncomputable problems. That is, there is a problem P for which it’s possible to give a completely formal description of the right answer—but it’s not possible to write a program that solves P.
Here, we’ll prove this theorem. Specifically, we’ll describe the halting problem, and prove that it’s uncomputable. (Informally, the halting problem is: given a function p written in Python and an input x, does p get stuck in an infinite loop when it’s run on x?) The result is a great example of a proof by contradiction, where we will exploit the abyss of self-reference to produce the contradiction.
Problems
Before we address the computability of the halting problem, we have to define pre-
cisely what we mean by a “problem” and “computable.” A problem is the kind of task that we wish to solve with a computer program. We will focus on yes–no problems, called decision problems:
(In other words, a decision problem is specified by a description of a set of possible inputs, along with a description of those inputs for which the correct answer is “yes.”) We’ve already encountered several decision problems:
Definition 4.17 (Problem)
A problem is a description of a set of valid inputs, and a specification of the corresponding output for each them. A decision problem is one where the output is either “yes” or “no.”

4.4. SOMEEXAMPLESOFPROOFS 449
Example 4.28 (Some sample decision problems)
• primality:thesetofpossibleinputsisthesetofpositiveintegers;thesetof“yes” inputs is the set of prime numbers. (The “no” inputs are 1 and the composites.)
• satisfiability:anypropositional-logicpropositionφisavalidinput,andφisa “yes” input if and only if φ is satisfiable.
An instance of a problem is a valid input for that problem. (An invalid input is one that isn’t the right “kind of thing” for that problem.) We will refer to an instance x of a problem P as a yes-instance if the correct output is “yes,” and as a no-instance if the correct output is “no.” For example, 17 or 18 are both instances of primality; 17 is a yes-instance, while 18 is a no-instance; p ∨ ¬p is an invalid input.
Computability
Problems are the things that we’ll be interested in solving via computer programs.
Informally, problems that can be solved by computer are called computable and those that cannot be solved by computer are called uncomputable. It’ll be easiest to think of computability in terms of your favorite programming language, whatever it may be. For the sake of concreteness, we’ll pretend it’s Python, though any language would do.
Taking it further: The original definition of computability given by Alan Turing used an abstract device called a Turing machine; a programming language is called Turing complete if it can solve any problem that can be solved by a Turing machine. Every non-toy programming language is Turing complete: Java, C, C++, Python, Ruby, Perl, Haskell, BASIC, Fortran, Assembly Language, whatever.
Formally, we’ll define computability in terms of the existence of an algorithm, which we will think of as a function written in Python:
Definition 4.18 (Computability)
A decision problem P is computable if there exists a Python function A that solves P. That is, P is computable if there exists a Python function A such that, on any input x:
(i) Aterminateswhenrunonx.
(ii) A(x)returnstrueifandonlyifxisayes-instanceofP.
Notice that we insist that the Python function A must actually terminate on any input x: it’s not allowed to run forever. Furthermore, running A(x) returns True if x is a yes- instance of P and running A(x) returns False if x is a no-instance of P.
The decision problems from Example 4.28 are both computable:
Example 4.29 (Computability of some sample decision problems)
• primalityiscomputable:bothisPrimeandisPrimeBetter(p.454)arealgorithms
that could be implemented as a Python function that (i) terminates when run on any positive integer, and (ii) returns True on input n if and only if n is prime.

450 CHAPTER 4. PROOFS
• satisfiabilityiscomputable,too:aswediscussedinSection3.3.1,wecanex- haustively try all truth assignments for φ, checking whether any of them satis- fies φ. This algorithm is slow—if φ has n variables, there are 2n different truth assignments—but it is guaranteed to terminate for any input φ, and correctly de- cides whether φ is satisfiable.
Programs that take source code as input
The inputs to the problems or programs that we’ve talked
about so far have been integers (for primality) or Boolean formulas (for satisfiability). Of course, other input types like rational numbers or lists are possible, too. Programs that take programs as input are a particularly important category.
Taking it further: Although you might not have thought about them in these terms, you’ve frequently encountered programs that take programs as input. For example, in any introductory CS class, you’ve seen one frequently: the Python interpreter python, the Java compiler javac, and the C compiler gcc all take programs (written in Python or Java or C, respectively) as input.
It’s easy to think up some decision problems where the input is a Python program. Here’s one, about comment- ing code. (For example, it’s not hard to imagine an Intro CS instructor setting up an automated grading system for programs that gives an automatic zero to any submitted assignment that contains no comments.)
Example4.30(The commented decisionproblem) Define the decision problem commented as follows:
Input: thePythonsourcecodesforafunction
Output: “yes”ifscontainsatleastonecomment;“no”otherwise.
In Python, a comment starts with # and goes until the end of the line, so as long as a # appears somewhere in the source code s—and not inside quotation marks—then s is a yes-instance of commented; otherwise s is a no-instance.
The commented problem is computable: testing whether s is a yes-instance can be done by looking at the characters of s one by one, and testing to see whether any
one of those characters starts a comment. A Python program commentedTester
that solves commented is shown in Figure 4.24. (The details of testing whether character is inside quotes are omitted from the source code, but otherwise the code for commentedTester is valid, runnable Python code.)
Consider running commentedTester on the other instances shown Figure 4.24. Ob- serve that absoluteValue is a no-instance of commented, because it doesn’t contain the comment character # at all, and isEven is a yes-instance of commented, because it contains three comments. As desired, if we ran commentedTester on these two pieces of source code, the output would be False and True, respectively.
Figure 4.24: Python source code for three functions.
def commentedTester(sourceCode):
for character in sourceCode:
if character == “#”
and isn’t inside quotes:
return True
return False
def absoluteValue(n):
if n > 0:
return n
else:
return -1 * n
def isEven(n):
# % is Python’s mod operator
if n % 2 == 0:
return True # n is even
else:
return False # n is odd

Example 4.30 showed that the decision problem commented is
computable by giving a Python function commentedTester that solves
commented. Because we can run commentedTester on any piece of
Python source code we please, let’s do something a little bizarre: let’s
run commentedTester on the source code for commentedTester itself (!).
There weren’t any comments in commentedTester—the only # in the
code is inside quotes—so the source code of commentedTester is a no-instance of
commented. Put a different way, if sct denotes the source code of commentedTester,
then running s on s returns False. This idea of taking some source code s and run- ct ct
ning s on s itself will be essential in the rest of this section. The Halting Problem
The key decision problem that we’ll consider is the halting problem:
Figure 4.25: A reminder of
the Python source code for commentedTester.
4.4. SOMEEXAMPLESOFPROOFS 451
def commentedTester(sourceCode):
for character in sourceCode:
if character == “#”
and isn’t inside quotes:
return True
return False
Definition 4.19 (The Halting Problem)
Define the decision problem haltingProblem as follows:
Input: apair⟨s,x⟩,wheresisthesourcecodeofasyntacticallyvalidPythonfunctionthat takes one argument, and x is any value;
Output: “yes”ifsterminateswhenrunoninputx;“no”otherwise.
That is, ⟨s, x⟩ is a yes-instance of haltingProblem if s(x) terminates (doesn’t get stuck in an infinite loop), and it’s a no-instance if s(x) does get stuck in an infinite loop.
We can now use the idea of running a function with itself as input to show that the Halting Problem is uncomputable, by contradiction:
Proof. Wegiveaproofbycontradiction.Supposeforthesakeofcontradictionthatthe Halting Problem is computable—that is, assume
There’s a Python function Ahalting solving the Halting Problem. (1)
(In other words, for the Python source code s of a one-argument function, and any value x, running Ahalting(s, x) always terminates, and returns True if and only if run- ning s on x does not result in an infinite loop.)
Now consider the Python function makeSelfSafe in
Figure 4.26. The function makeSelfSafe takes as input
the Python source code s of a one-argument function,
tests whether running s on s itself is “safe” (does not
cause an infinite loop), and if it’s safe then it runs s on
s. We claim that makeSelfSafe never gets stuck in an infinite loop:
For any Python source code s, makeSelfSafe(s) terminates. (2)
Figure 4.26: The Python code for makeSelfSafe.
Theorem 4.17 (Uncomputability of the Halting Problem)
haltingProblem is uncomputable.
makeSelfSafe(s): # the input s is the Python source
# code of a one-argument function.
safe = Ahalting(s,s) if safe:
run s on input s
return True

452 CHAPTER 4. PROOFS
To see that (2) is true, observe that Step 1 of the algorithm always terminates, by as- sumption (1). Step 2 of the algorithm ensures that s is called on input s if and only if Ahalting(s, s) said that s terminates when run on s. And, by assumption, Ahalting is always correct. Thus s is run on input s only if s terminates when run on input s. So Step 2 of the algorithm always terminates. And Step 3 of the algorithm doesn’t do anything except return, so it terminates immediately. Thus (2) follows.
Write smss to denote the Python source code of makeSelfSafe. Because smss is itself Python source code, Fact (2) implies that
makeSelfSafe(smss ) terminates. (3) In other words, running smss on smss terminates. Thus, by the assumption (1) that
Ahalting is correct, we can conclude that
Ahalting(smss, smss) returns true. (4)
But now consider what happens when we run makeSelfSafe on its own source code— that is, when we compute makeSelfSafe(smss ). Observe that safe is set to true in Step 1 of the algorithm, by Fact (4). Thus Step 2 calls makeSelfSafe(smss ) recursively! But therefore makeSelfSafe(smss ) calls makeSelfSafe(smss ), which calls makeSelfSafe(smss ), and so on, ad infinitum. In other words,
makeSelfSafe(smss ) does not terminate. (5)
But (3) and (5) are contradictory! Thus the only assumption that we made, namely (1), was false. Therefore there does not exist a correct always-terminating algorithm for the Halting Problem. That is, the Halting Problem is uncomputable.
To summarize Theorem 4.17: we showed that the assumption of the existence of an algorithm for the halting problem leads to a contradiction, and therefore we con- clude that such an algorithm cannot exist. The contradiction is, at its heart, about self-reference—an algorithmic version of the Liar’s Paradox: This sentence is false.
Taking it further: Computability theory is the study of what problems can and cannot be solved by com- puters. Computability was a primary focus of theoretical computer science from the 1930s through roughly the 1970s. (After that time, the focus of theoretical computer scientists began to shift to com- plexity theory, which addresses the question of what problems can and cannot be solved efficiently by computers.) You can read more about the halting problem in any textbook on computability theory, and inDouglasHofstadter’samazingbookGödel,Escher,Bach.8 Forextraamusement,youcanevenfindafull proof of Theorem 4.17 in poem form, in Figure 4.27. And see p. 455 for a discussion of some practically relevant problems that are also uncomputable.
89
8 Dexter Kozen.
Automata and Com- putability. Springer, 1997; Michael Sipser. Introduction to the Theory of Com- putation. Course Technology,3rd edition, 2012; and Douglas Hofstadter. Gödel, Escher, Bach: An Eternal Golden Braid. Vintage, 1980.

Scooping the Loop Snooper: A proof that the Halting Problem is undecidable
Geoffrey K. Pullum
4.4. SOMEEXAMPLESOFPROOFS 453
No general procedure for bug checks will do.
Now, I won’t just assert that, I’ll prove it to you.
I will prove that although you might work till you drop, you cannot tell if computation will stop.
For imagine we have a procedure called P
that for specified input permits you to see
whether specified source code, with all of its faults, defines a routine that eventually halts.
You feed in your program, with suitable data, and P gets to work, and a little while later
(in finite compute time) correctly infers whether infinite looping behavior occurs.
If there will be no looping, then P prints out ‘Good.’ That means work on this input will halt, as it should. But if it detects an unstoppable loop,
then P reports ‘Bad!’—which means you’re in the soup.
Well, the truth is that P cannot possibly be,
because if you wrote it and gave it to me,
I could use it to set up a logical bind
that would shatter your reason and scramble your mind.
Here’s the trick that I’ll use—and it’s simple to do. I’ll define a procedure, which I will call Q,
that will use P’s predictions of halting success
to stir up a terrible logical mess.
For a specified program, say A, one supplies,
the first step of this program called Q I devise
is to find out from P what’s the right thing to say of the looping behavior of A run on A.
If P’s answer is ‘Bad!’, Q will suddenly stop.
But otherwise, Q will go back to the top,
and start off again, looping endlessly back,
till the universe dies and turns frozen and black.
And this program called Q wouldn’t stay on the shelf;
I would ask it to forecast its run on itself.
When it reads its own source code, just what will it do? What’s the looping behavior of Q run on Q?
If P warns of infinite loops, Q will quit;
yet P is supposed to speak truly of it!
And if Q’s going to quit, then P should say ‘Good.’ Which makes Q start to loop! (P denied that it would.)
No matter how P might perform, Q will scoop it:
Q uses P’s output to make P look stupid.
Whatever P says, it cannot predict Q:
P is right when it’s wrong, and is false when it’s true!
I’ve created a paradox, neat as can be—
and simply by using your putative P.
When you posited P you stepped into a snare; Your assumption has led you right into my lair.
So where can this argument possibly go?
I don’t have to tell you; I’m sure you must know. A reductio: There cannot possibly be
a procedure that acts like the mythical P.
You can never find general mechanical means
for predicting the acts of computing machines;
it’s something that cannot be done. So we users must find our own bugs. Our computers are losers!
Figure 4.27: A proof of Theorem 4.17, in poetic form, from
9 Geoffrey K. Pul- lum. Scooping
the loop snooper: A proof that the halting problem is undecidable. Math- ematics Magazine, 73(4):319–320, 2000. Used by permis- sion of Geoffrey
K. Pullum.

454 CHAPTER 4. PROOFS
Computer Science Connections
Cryptography and the Generation of Prime Numbers
As we’ll see in Section 7.5, prime numbers are used extensively in cryptog- raphy. The RSA cryptosystem—named after the first letters of its inventors’ last names10—uses as a primary step the generation of two large prime numbers, perhaps ≈128-bit integers.
The primary reason that prime numbers are useful in cryptography is an asymmetry in the apparent difficulty of two directions of a problem. If you are given two (big) prime numbers p and q, then computing their product pq is easy. But if you are given a number n that is guaranteed to be the product of two prime numbers, finding those two numbers—factoring n—appears to be much harder. For example, if you’re told that n = 504,761, it will probably take you a long time to figure out that n = 251 · 2011. But if you’re told that p = 251 and q = 2011, then you should be able to calculate pq = 504,761 in just a few seconds.
A crucial step in RSA, then, is the generation of large prime numbers. This step can be accomplished by choosing a random integer of the appropriate size and then testing whether that number is prime. (We keep retrying until the random number turns out to be prime.)
A little consideration of the definition of primality implies that we can
test whether an integer n is prime using the algorithm in Figure 4.28, which tests all candidate divisors between 2 and n − 1. This algorithm requires us
to do roughly n divisibility checks (actually, to be precise, n − 2 divisibility checks). Using Theorem 4.16, the algorithm can be improved to do only about √n divisibility checks, as Figure 4.29.
We can test these two algorithms empirically. A Python implementation using n − 1 calls to isPrime to find all primes in the integers {2, . . . , n} took about three minutes for n = 65,536 on a 2010-era laptop. For the same n, isPrimeBetter took about a second. This difference is a nice example of the way in which theoretical, proof-based techniques can improve actual widely used algorithms.
In part because of its importance to cryptography, there has been signifi- cant work on algorithms for primality testing over recent decades—improving far beyond the roughly √n division tests of isPrimeBetter. In general, an efficient algorithm for a number n should require a number of steps propor- tional to log n rather than proportional to n or even √n. (For example, when you add two 10-digit numbers by hand, you want to do about 10 operations, rather than about 1,000,000,000 operations.) Thus isPrimeBetter is still not as efficient as we’d like.
There are some very efficient randomized algorithms for primality testing which are actually used in real cryptosystems, including the Miller-Rabin test.11 Thisrandomizedalgorithmperformsa(randomlychosen)testthatall prime numbers pass and most composite numbers fail; repeating with many different randomly chosen tests decreases the probability of getting a wrong answer to an arbitrarily small number. (See p. 742.) And more recently, three researchers gave the first theoretically efficient algorithm for primality testing that’s not randomized.12
10 R. L. Rivest, A. Shamir, and L. Adle- man. A method for obtaining digital signatures and public-key cryptosys- tems. CommunicationsoftheACM, 21:120–126, February 1978.
isPrime(n):
1: 2: 3:
4: 5: 6:
k := 2
while k < n: if n is evenly divisible by k then return False k := k + 1 return True Figure 4.28: Slow primality testing. isPrimeBetter(n): 1: 2: 3: 4: 5: 6: k := 2 􏰆√ 􏰇 while k ≤ n : if n is evenly divisible by k and n ̸= k then return False k := k + 1 return True Figure 4.29: Faster primality testing. (We could further save roughly another factor of two by checking only k = 2 and odd k ≥ 3.) 11 Gary L. Miller. Riemann’s hypothesis and tests for primality. Journal of Com- puter and System Sciences, 13(3):300–317, 1976;andMichaelO.Rabin. Proba- bilistic algorithm for testing primality. Journal of Number Theory, 12(1):128–138, 1980. 12 Manindra Agrawal, Neeraj Kayal, and Nitin Saxena. Primes is in P. Annals of Mathematics, 160:781–793, 2004. 4.4. SOMEEXAMPLESOFPROOFS 455 Computer Science Connections Other Uncomputable Problems (That You Might Care About) The Halting Problem may seem like a purely abstract problem, and there- fore one that doesn’t matter in the real world—sure, it’d be nice to have an infinite-loop detector in your Python interpreter or Java compiler, but would it just be a vaguely helpful feature for students in Intro CS classes but nobody else? The answer is a resounding no: while the Halting Problem itself may seem obscure, there are many uncomputable problems that, if solved, would vastly improve operating systems or compilers. But they’re uncomputable, and therefore the desired improvements cannot be made. Here’s one example. Modern operating systems use virtual memory for their applications. The physical computer has a limited amount of physical memory—say, eight gigabytes of RAM—that applications can use. But the operating system “pretends” that it has a much larger amount of memory, so that the word processor, web browser, Java compiler, and solitaire game can each act as though they had even more than eight gigabytes of memory that they don’t have to share. Memory (both virtual and real) is divided into chunks of a fixed size, called pages. The operating system stores pages that are actively in use in physical memory (RAM), and relegates some of the not- currently-used pages to the hard drive. At every point in time, the operating system’s paging system decides which pages to leave in physical memory, and which pages to “eject” to the hard drive. (This idea is the same as what you do when you’re cooking several dishes in a kitchen with limited counter space: you have to relegate some of the not-currently-being-prepared ingredients to the fridge. And at every moment you have to decide which ingredients to leave on the counter, and which to “eject” to the fridge.) See Figure 4.30. Here’s a problem that a paging system would love to solve: given a page p of memory that an application has used, will that application ever access the contents of p again? Let’s call this problem willBeUsedAgain. When the paging system needs to eject a page, ideally it would eject a page that’s a no- instance of willBeUsedAgain, because it will never have to bring that page back into physical memory. (When you’re out of counter space, you would of course prefer to put away some ingredient that you’re done using.) Unfortunately for operating system designers, willBeUsedAgain is uncomputable. There’s a very quick proof, based on the uncomputability of the Halting Problem. Consider the algorithm: 1. runthePythonfunctionfontheinputx. 2. iff(x)terminates,thenaccesssomememoryfrompagep. This algorithm accesses page p if and only if ⟨f , x⟩ is a yes-instance of the Halting Problem. Therefore if we could give an algorithm to solve the willBeUsedAgain prob- lem, then we could give an algorithm to solve the Halting Problem. But we already know that we can’t give an algorithm to solve the Halting Problem. If p ⇒ q and ¬q, then we can conclude ¬p; therefore willBeUsedAgain is uncom- putable. Figure 4.30: A sample sequence of memory fetches in a paged memory system. RAM Hard Disk (a) Initial configuration, with pages #1,2,6 in memory, and remaining pages on disk. RAM Hard Disk (b) Program requests data on page #2. It’s in memory, so it’s just fetched; nothing else happens. RAM Hard Disk (c) Program requests data on page #4. It’s on disk, so it’s fetched and replaces some page in RAM—say, #1. RAM Hard Disk (d) Program requests data on page #1. It’s on disk, so it’s fetched and replaces some page in RAM—say, #6. 126 3,4,5,7,8,. . . 3,4,5,7,8,. . . 126 1,3,5,7,8,. . . 426 3,5,6,7,8,. . . 421 456 CHAPTER 4. PROOFS 4.4.5 Exercises Figure 4.31 shows the truth tables for all 16 different binary logical operators, with each column named if it’s a logical operator that we’ve already seen: A set S of binary operators is said to be universal if every binary logical operation can be expressed using some combi- nation of the operators in S. Formally, a set S is universal if, for every Boolean expression φ over variables p1 , . . . , pk , there exists a Boolean expression ψ that is logically equivalent to φ where ψ uses only the variables p1 , . . . , pk and the logical connectives in S. 4.66 Prove that the set {∨, ∧, ⇒, ¬} is universal. (Hint: To do so, you need to show that, for each column through of Figure 4.31, you can build a Boolean expression φi over the variables p and q that uses only the operators {∨, ∧, ⇒, ¬}, and such that φi is logically equivalent to p q.) 4.67 Prove that the set {∨, ∧, ¬} is universal. (Hint: once you’ve done Exercise 4.66, all you have to do is show that you can express ⇒ using {∨, ∧, ¬}.) 4.68 Prove that {∨, ¬} and {∧, ¬} are both universal. 4.69 Prove that the set {|}—the set containing just the Sheffer stroke, that is, nand—is universal. 4.70 Prove that the singleton set {↓} is universal. 4.71 Prove that the set {∧, ∨} is not universal. (Hint: what happens under the all-true truth assignment?) 4.72 Let φ be a fully quantified proposition of predicate logic. Prove that φ is logically equivalent to a fully quantified proposition ψ in which all quantifiers are at the outermost level of ψ. In other words, the proposition ψ must be of the form ∀/∃ x1 : ∀/∃ x2 : ···∀/∃ xk : P(x1,x2,...,xk), where each ∀/∃ is either a universal or existential quantifier. (The transformation that you performed in Exercise 3.178 put Goldbach’s Conjecture in this special form.) (Hint: you might find the results from Exer- cises 4.66–4.71 helpful. Using these results, you can assume that φ has a very particular form.) 4.73 Prove that, for any integer n ≥ 1, there is an n-variable logical proposition φ in conjunctive normal form such that the truth-table translation to DNF (from Theorem 4.11) yields an DNF proposition with exponentially more clauses than φ has. 4.74 Prove that the area of a right triangle with legs x and y is xy/2. 4.75 Use Figure 4.32(a) as an outline to give a differ- ent proof of the Pythagorean theorem. √xy ≤ (x + y)/2. Here you’ll reprove the result geometrically. Suppose that x ≥ y, and draw two circles of radius x and y tangent to each other, and tangent to a horizontal line. See Figure 4.32(b). Considering the right triangle shown in that diagram, and using the Pythagorean theorem and the fact that the hypotenuse is the longest side of a right triangle, prove the result again. Figure 4.31: The full set of binary logical operators. 1 T T T T 2 3 4 T T F F 5 6 T F T F 7 8 9 p q True p∨q p⇐q p p⇒q q p⇔q p∧q p|q p⊕q ¬q ¬p p↓q False TTTTTTFFFFF TTTFFFTTFFF FTFTFFTFTFF FFTTTFFTTTF F T T T 10 11 12 F T F F 13 14 F F T F 15 16 T F T F 1 16 i c c bc ca (a) Another way to prove the Pythagorean Theorem. x y (b) Using the Pythagorean Theorem for the Arithmetic Mean/Geometric Mean inequality. 4.76 Exercise 4.47 asked you to prove (via algebra) the Arithmetic Mean– Geometric Mean inequality: forx,y ∈ R≥0,wehave Figure 4.32: More on the Pythagorean Theorem. Let x, y ∈ R be two points in the plane. As usual, denote their coordinates by x1 and x2 , and y1 and y2 , respectively. 2􏰞 The Euclidean distance between these points is the length of the line that connects them: (x1 − y1)2 + (x2 − y2)2. The Manhattan distance between them is |x1 − y1 | + |x2 − y2 |: the number of blocks that you would have to walk “over” plus the number that you’d have to walk “up” to get from one point to the other. Denote these distances by deuclidean and dmanhattan. 4.77 Prove that deuclidean (x, y) ≤ dmanhattan (x, y) for any two points x, y. 4.78 Prove that there exists a constant a such that both • dmanhattan (x, y) ≤ a · deuclidean (x, y) for all points x and y; and • thereexistpointsx∗,y∗ suchthatdmanhattan(x∗,y∗)=a·deuclidean(x∗,y∗) A positive integer n is called a perfect number if it is equal to the sum of all positive integer factors 1 ≤ k < n of n. For example, the number 14 is not perfect: the numbers less than 14 that evenly divide 14 are {1, 2, 7}, but 1+2+7 = 10 ̸= 14. 4.79 Prove that at least one perfect number exists. 4.80 Prove that, for any prime integer p, the positive integer p2 is not a perfect number. 4.81 Letn ≥ 10beanypositiveinteger.Provethattheset{n,n+1,...,n+5}containsatmosttwo prime numbers. 4.82 Let n be any positive integer. Prove or disprove: any set of ten consecutive positive integers {n, n + 1, . . . , n + 9} contains at least one prime number. 4.83 (Thanks to the NPR radio show Car Talk, from which I learned this exercise.) Imagine a junior high school, with 100 lockers, numbered 1 through 100. All lockers are initially closed. There are 100 students, each of whom—brimming with teenage angst—systematically goes through the lockers and slams some of them shut and yanks some of them open. Specifically, in round i := 1, 2, . . . , 100, student #i changes the state of every ith locker: if the door is open, then it’s slammed shut; if the door is closed, then it’s opened. (So student #1 opens them all, student #2 closes all the even-numbered lockers, etc.) Which lockers are open after this whole process is over? Prove your answer. 4.84 We proved the following claim in Theorem 4.16: A positive integer n ≥ 2 is evenly divisible by someotherintegerk∈􏰈2,3,...,􏰆√n􏰇􏰉ifandonlyifniscomposite. Ifwedeletetheword“other,”thisclaim becomes false. Prove that this modified claim is false. 4.85 Prove that the unmodified claim (retaining the word “other”) remains true if the bounds on k are changed from k ∈ 􏰈2, 3, . . . , 􏰆√n􏰇􏰉 to k ∈ 􏰈􏰆√n􏰇 , . . . , n − 1􏰉. 4.4. SOMEEXAMPLESOFPROOFS 457 4.86 Prove that the bound cannot be changed from k ∈ 􏰈2, 3, . . . , 􏰆√n􏰇􏰉 to k ∈ 􏰈􏰄√n/2􏰅 , . . . , 􏰄3√n/2􏰅􏰉. Thatis,provethatthefollowingclaimisfalse: Apositiveintegern≥2isevenlydivisiblebysomeotherinteger k ∈ 􏰈􏰄√n/2􏰅 , . . . , 􏰄3√n/2􏰅􏰉 if and only if n is composite. 4.87 Let n be any positive integer, and let pn denote the smallest prime number that evenly divides n. Provethatthereareinfinitenumberofintegersnsuchthatpn ≥√n.(Thisfactestablishesthatwecannot change the bound in the aforementioned theorem to anything smaller than √n.) 458 CHAPTER 4. PROOFS 4.5 Common Errors in Proofs Mistakes were made. Ron Ziegler (1939–2003), press secretary for President Richard Nixon during Watergate We’ve now spent considerable time establishing a catalogue of proof techniques that you can use to prove theorems, along with some examples of these techniques in action. We’ll close this chapter with a brief overview of some common flaws in proofs, so that you can avoid them in your own work (and be on the lookout for them in the work of others). Recall that a proof consists of a sequence of logical inferences, deriv- ing new facts from assumptions or previously established facts. A valid inference is one whose conclusion is always true as long as the facts that it relies on were true. (That is, a valid step never creates a false statement from true ones.) An invalid inference is one in which the conclusion can be false even if the premises are all true. An invalid argument can also be called a logical fallacy, a fallacious argument, or just a fallacy. In a correct proof, of course, every step is valid. Here are a few examples of a single logical inference, some of which might be fallacious: Example 4.31 (Some (valid and invalid) logical inferences) Problem: Hereareseveralinferences.Ineachcase,therearetwopremises,anda conclusion that is claimed to follow logically from those premises. Which of these inferences are valid, and which are fallacies? 1. 2. 3. 4. Premises: (a)Allsoftwareisbuggy.(b)Windowsisapieceofsoftware. Conclusion: Therefore,Windowsisbuggy. Premises: (a)Allpeopleareannoyingsometimes.(b)MarkZuckerbergisa person. Conclusion: Therefore,MarkZuckerbergisannoyingsometimes. Premises: (a)Ifyouhandedinanexamwithoutyournameonit,thenyougota zero. (b) You handed in an exam without your name on it. Conclusion: Therefore,yougotazero. Premises: (a)Ifyouhandedinanexamwithoutyournameonit,thenyougota zero. (b) You handed in an exam with your name on it. Conclusion: Therefore,youdidn’tgetazero. Solution : Weabstractawayfrombuggysoftwareandannoyingpeoplebyrewriting these arguments in purely logical form: 1. Assume a ∈ S and assume ∀x ∈ S : P(x). Conclude P(a). 2. Assume a ∈ S and assume ∀x ∈ S : P(x). Conclude P(a). 3. Assumep⇒qandassumep.Concludeq. 4. Assume p ⇒ q and assume ¬p. Conclude ¬q. In this format, we see first that (1) and (2) are actually the same logical argument (with different meanings for the symbols), and they’re both valid. Argument (3) is Problem-solving tip: To make the logical structure of an argument clearer, consider an abstract form of the argument in which you use variables to name the atomic propositions. precisely an invocation of Modus Ponens (see Chapter 3), and it’s valid. But (4) is a fallacy: the fact that p ⇒ q and ¬p is consistent with either q or ¬q, so in particular when p = False and q = True the premises are true but the conclusion is false. Each of these examples purports to convince its reader of its conclusion, under the assumption that the premises are true. Valid arguments will convince any (reasonable) reader that their conclusion follows from their premises. Fallacious arguments are buggy; a vigilant reader will not accept the conclusion of a fallacious argument even if she accepts the premises. Taking it further: A useful way to think about validity and fallacy is as follows. An argument with premisesp1,p2,...,pk andconclusioncisvalidifandonlyifp1 ∧p2 ∧···∧pk ⇒ cisatheorem.If there is a circumstance in which p1 ∧ p2 ∧ · · · ∧ pk ⇒ c is false—in other words, where the premises p1 ∧ p2 ∧ · · · ∧ pk are all true but the conclusion c is false—then the argument is fallacious. Some of the most famous disasters in the history of computer science have come from some bugs that arose because of an erroneous understanding of some property of a system—and a lack of valid proof of correctness for the system. These bugs have been costly, with both lives and many dollars lost. See p. 464 for a few highlights/lowlights. Your main job in proofs is simple: avoid fallacies! But that can be harder than it sounds. The remainder of this section is devoted to a few types of common mistakes in proofs—that is, some common types of fallacies. a broken proof The most common mistake in a purported proof is simple but insidious: a single statement is alleged to follow logically from previous statements, but it doesn’t. Here’s a somewhat subtle example: Example 4.32 (What’s wrong with this logic?) Problem: Findtheerrorinthispurportedproof,andgiveacounterexampletothe claim. FalseTheorem: LetFn =􏰈k∈Z≥1:k|n􏰉denotethefactorsofanintegern≥2. Then |Fn| is even. Problem-solving tip: The kind of mistake in Example 4.32, in which there’s a single step that doesn’t follow from the previous step, can sometimes be difficult to sniff out. But it’s the kind of bug that you can spot by simply being überskeptical of everything that’s written in a purported proof. 4.5. COMMONERRORSINPROOFS 459 √ be the set of factors of n that are greater than √n. Observe that every d ∈ Fsmall Proof. Let Fsmall ⊆ F be the set of factors of n that are less than has a unique entry n/d corresponding to it in F . Therefore |F | = |F |. Let n. Let Fbig ⊆ F big small big k = |Fsmall| = |Fbig|. Note that k is an integer. Thus Fn contains precisely k elements less than √n and k elements greater than √n, and so |Fn| = 2k, which is an even number. : Theproblemcomesrightattheendoftheproof: Thus Fn contains precisely k elements less than √n and k elements greater than √n, and so |Fn| = 2k. √ The problem is that this statement discounts the possibility that in F. For an integer n that’s a perfect square, we have that √n ∈ F, and therefore |F| = 2k + 1. For example, the integer 9 is a counterexample, because F9 = {1, 3, 9} and |F9| = 3. Solution n itself might be 460 CHAPTER 4. PROOFS But while an error of this form—one step in the proof that doesn’t actually fol- low from the previously established facts—may be the most common type of bug in a proof, there are some other, more structural errors that can arise. Most of these structural errors come from errors of propositional logic—namely by proving a new proposition that’s not in fact logically equivalent to the given proposition. Here are a few of these types of flawed reasoning. Fallacy: proving true We are considering a claim φ. We proceed as follows: we assume φ, and (correctly) prove True under that assumption. (Usually, for some reason, the “proof” writer puts a little check mark in their alleged proof at this point: 􏰩.) What can we conclude about φ? The answer is: absolutely nothing! The reason: we’ve proven that φ ⇒ True, but anything implies true. (Both True ⇒ True and False ⇒ True are true implications.) Here’s a classical example of a bogus proof that uses this fallacious reasoning: Example 4.33 (What’s wrong with this logic?) Problem: Findtheerrorinthispurportedproof. FalseTheorem: 1=0. Proof. Suppose that 1 = 0. Then: therefore, multiplying both sides by 0 Solution about the truth or falsity of 1 = 0; anything implies true. Fallacy: affirming the consequent We are considering a claim φ. We prove (correctly) that φ ⇒ ψ, and we prove (cor- rectly) that ψ. We then conclude φ. (Recall that ψ is the consequent of the implication φ ⇒ ψ, and we have “affirmed” it by proving ψ.) This “proof” is wrong because it confuses necessary and sufficient conditions: when we prove φ ⇒ ψ, we’ve shown that one way for ψ to be true is for φ to be true. But there might be other reasons that φ is true! Here’s an example of a fallacious argument that uses this bogus logic: Example 4.34 (What’s wrong with this logic?) Problem: Findtheerrorinthisargument: Premises: (1)Ifit’sraining,thenthecomputerburningwillbepostponed. (2) The computer burning was postponed. Conclusion: Therefore,it’sraining. 0 = 0. 􏰩 : We have merely shown that (1 = 0) ⇒ (0 = 0), which does not say anything and therefore, Thus the assumption that 1 = 0 was correct, and the theorem follows. And, indeed, 0 = 0. 1=0 0 · 1 = 0 · 0 Writing tip: When you’re trying to prove that two quantities a and b are equal, it’s generally preferable to manipulate a until it equals b, rather than “meeting in the middle” by manipulating both sides of the equation until you reach a line in which the two sides are equal. The “manipulate a until it equals b” style of argument makes it clear to the reader that you are proving a = b rather than proving (a = b) ⇒ True. : Thisfallaciousargumentisanexampleofaffirmingtheconsequent.The first premise here merely says that the computer burning will be postponed if it rains; it does not say that rain is the only reason that the burning could be post- poned. There may be many other reasons why the burning might be delayed: for example, the inability to find a match, the sudden vigilance of the health and safety office, or a last-minute stay of execution by the owner of the computer. Fallacy: denying the hypothesis Denying the hypothesis is a closely related fallacy to affirming the consequent: we prove (correctly) that ψ ⇒ φ, and we prove (correctly) that ¬ψ; we then (fallaciously) conclude ¬φ. This logic is buggy for essentially the same reason as affirming the consequent. (In fact, denying the hypothesis is the contrapositive of affirming the consequent—and therefore a fallacy too, because it’s logically equivalent to a fallacy.) The implication ψ ⇒ φ means that one way of φ being true is for ψ to be true, but it does not mean that there is no other way for φ to be true. Here’s an example of a fallacious argument of this type: Example 4.35 (What’s wrong with this logic?) Problem: Findtheerrorinthisargument: Premises: (1)IfyouhaveresolvedtheP-versus-NPquestion,thenyouarefamous. (2) You have not resolved the P-versus-NP question. Conclusion: Therefore,youarenotfamous. Solution : Thisfallaciousargumentisanexampleofdenyingthehypothesis.Thefirst premise says that one way to be famous is to resolve the P-versus-NP question (see p. 326 for a brief description of this problem), but it does not say that resolving the P-versus-NP question is the only way to be famous. For example, you could be famous by being the President of the United States or by founding Google. Fallacy: false dichotomy A false dichotomy or false dilemma is a fallacious argument in which two nonexhaus- tive alternatives are presented as exhaustive (without acknowledgement that there are any unmentioned alternatives). Example 4.36 (False Dichotomy) The flawed step in Example 4.32 can be interpreted as a false dilemma: implicitly, that proof relied on the assertion that if k evenly divides n, then k ∈ Fsmall = 􏰈factors of n that are less than √n􏰉 or k ∈ Fbig = 􏰈factors of n that are greater than √n􏰉 . Solution 4.5. COMMONERRORSINPROOFS 461 But of course the third unmentioned possibility is that k = √n. 462 CHAPTER 4. PROOFS (The classical false dichotomy, often found in political rhetoric, is “either you’re with us or you’re against us”: actually, you might be neutral on the issue, and therefore neither “with” nor “against” us!) Fallacy: begging the question We wish to prove a proposition φ. A purported proof of φ that begs the question is one that assumes φ along the way. That is, the “proof” assumes precisely the thing that it purports to prove, and thus actually proves φ ⇒ φ. Although this type of fallacious reasoning sounds ridiculous, the assumption of the desired result can be very subtle; you must be vigilant to catch this type of error. Here’s an example of a fallacious argument of this kind: Example 4.37 (What’s wrong with this logic?) Problem: Findtheerrorinthisproof: False Theorem: Let n be a positive integer such that n + n2 is even. Then n is odd. 2 Proof. Assumetheantecedent—thatis,assumethatn+n iseven.Letkbethe Problem-solving tip: Even without identifying the specific bug in Example 4.37, we could notice that there’s something fishy by doing thepost-proof plausibility check to make sure that all premises were actually used. The “proof” states that it is assuming the antecedent, but we actually derived the fact that n + n2 is even. So we never used that assumption in the “proof.” (In fact, n+n2 isevenfor any positive integer n.) But, because we didn’t use the assumption, the same proof works just as well without it as an assumption, so we could use the same “proof” to establish this claim instead: Patently False The- orem: Let n be a positive integer. Then n is odd. Given that this new claim is obviously false, there must be a bug in the proof. The only challenge is to find that bug. integer such that n = 2k + 1. Then n+n2 =2k+1+(2k+1)2 = 2k + 1 + 4k2 + 4k + 1 = 4k2 + 6k + 2 = 2 · (2k2 + 3k + 1), which is even because it is equal to 2 times an integer. But n2 = 4k2 + 4k + 1 is odd (because 4k2 and 4k are both even). Therefore (2k + 1)2 n=n+n−n. = An even number less an odd number is an odd number, which implies that n must be odd too. : Theproblemcomesveryearlyinthe“proof,”inthesentence Let k be the integer such that n = 2k + 1. But this statement implicitly assumes that n is an odd integer; an integer k such that n = 2k + 1 exists only if n is odd. So the proof begs the question: it assumes that n is odd, and—after some algebraic shenanigans—concludes that n is odd. Other fallacies We have discussed a reasonably large collection of logical fallacies into which some less-than-careful or less-than-scrupulous proof writers may fall. But there are many other types of flaws in arguments that more typically arise in informal contexts; these are the kinds of flawed arguments that are—sadly—often used in politics. (Some of 22 􏰢 􏰡􏰠 􏰣 􏰢􏰡􏰠􏰣 Solution even by the above argument odd by the above argument them have analogues in more mathematical settings, too.) Here are a few examples of other types of fallacies that you may encounter in “real-world” arguments: • Confusingcorrelationandcausation.PhenomenaAandBaresaidtobe(positively) correlated if they occur together more often than their individual frequencies would predict. (See Chapter 10.) But just because A and B are correlated does not mean that one causes the other! For example, the user population of Facebook is much younger than is the population at large. We could say, correctly, that Being young is correlated with using Facebook. But Using Facebook makes you young is an obviously absurd conclusion. (Some correlation-versus-causation mistakes are subtler; your reaction to Being young makes you use Facebook is probably less virulent, but it is equally unsupported by the facts that we’ve cited here.) Always be wary when attempting to infer causal relationships! • Adhominemattacks.Anadhominemattackignoresthelogicalargumentandspeaks to the arguer: Bob doesn’t know the difference between contrapositive and converse, and he says that n is prime. So n must be composite. • Equivocationorshiftinglanguage.Thistypeofargumentreliesonchangesinthe meanings of the words/variables in an argument. This shift can be grammatical: Time waits for no man, and no man is an island; therefore, time waits for an island. Or it can be in the semantics of a particular word: 1024 is a prime example of an exact power of two, and prime numbers are evenly divisible only by 1 and themselves; therefore, 1024 is not divisible by 4. A similar type of fallacy can also occur when a variable in a proof 13 is introduced to mean two different things. Taking it further: This listing is just a brief outline of some of the many invalid techniques of persua- sion/propaganda; a much more extensive and thorough list is maintained by Gary Curtis at http:// www.fallacyfiles.org/. You might also be interested in books that catalogue fallacious techniques of argument.13 It is always your job to be vigilant—both when reading proofs written by others, and in developing your own proofs—to avoid fallacious reasoning. Latinadhominem: “to the man.” 4.5. COMMONERRORSINPROOFS 463 For example, 13 Madsen Pirie. How to Win Every Argument: The Use and Abuse of Logic. Continuum, 2007. 464 CHAPTER 4. PROOFS Computer Science Connections The Cost of Missing Proofs: Some Famous Bugs in CS There’s an apocryphal story that the first use of the word “bug” to refer to a flaw in a computer system was in the 1940s: Grace Hopper, a rear admiral in the US Navy and a pioneer in early programming, found a moth (a literal, physical moth) jamming a piece of computer equipment and causing a mal- function. (The story is true, but the Oxford English Dictionary reports uses of “bug” to refer to a technological fault dating back to Thomas Edison in the late 1800s.) But there are many other stories of bugs that are both more impor- tant and more true. When a computer system “almost” works—when there’s no proof that it works correctly in all circumstances—there can be grave reper- cussions, in dollars and lives lost. Here are a few of the most famous, and most costly, bugs in history:14 The Pentium division bug: In 1994, Thomas Nicely, at the time a math pro- fessor, discovered a hardware bug in Intel’s new Pentium chip that caused incorrect results when some floating-point numbers were divided by certain other floating-point numbers. The flaw resulted from a lookup table for the division operation that was missing a handful of entries. Although the range of numbers that were incorrectly divided was limited, the resulting brouhaha led to a full Pentium recall and about $500 million in losses for Intel.15 The Ariane 5 rocket: The European Space Agency’s rocket, carrying a $400,000,000 payload of satellites, exploded 40 seconds into its first flight, in 1996. The rocket had engaged its self-destruct system, which was correctly triggered when it strayed from its intended trajectory. But the altered trajec- tory was caused by a sequence of errors, including an integer overflow error: the rocket’s velocity was too big to fit into the 16-bit variable that was being usedtostoreit.16 (AnAriane5rocketwasmuchfasterthantheAriane4 rockets for which the code was originally developed.) Embarrassingly, the overflow caused a subsystem to output a diagnostic error code that was interpreted as navigation data. More embarrassingly still, this entire subsystem played no role in navigation after liftoff, and would have caused no harm if it were just turned off. The Therac-25: The Therac-25 was a medical de- vice in use in the mid-1980s that treated tumors with a focused beam of radiation. The device fired a con- centrated X-ray beam of extremely high dosage into a diffuser that would reduce the beam’s intensity to the desired levels before it was directed at the patient. But it turned out that a particularly fast touch-typing operator could cause the high-intensity beam to be fired without the diffuser in place: hitting enter at the precise moment that an internal variable reset to zero caused the undiffused beam to be fired. (This kind of bug is called a race con- dition, in which the output of a system depends crucially on the precise timing of events like operator input.) At least five patients were killed by radiation overdoses.17 For a list of one person’s view of the ten worst bugs in history, including these three and some other sordid tales, see: 14 Simpson Garfinkel. History’s worst software bugs. Wired Magazine, 2005. For more information on these bugs and their aftermath, see: 15 Ivars Peterson. MathTrek: Pentium bug revisited. MAA Online, May 1997. 16 J. L. Lions. Ariane 5 flight 501 failure report: Report by the enquiry board, 1996. Figure 4.33: Image of the Therac-25. Reprinted with permission from 17 Nancy Leveson. Safeware: System Safety and Computers. Pearson Education, Inc., New York, 1995. 4.5.1 Exercises Identify whether the following arguments are valid or fallacious. Justify your answers. 4.88 Premises: (a) Every programming language that uses garbage collection is slow; and (b) C does not use garbage collection. Conclusion: Therefore, C is slow. 4.89 Premises: (a) If a piece of software is written well, then it was built with its user in mind; and (b) The Firefox web browser is a piece of software that was written with its user in mind. Conclusion: Therefore, the Firefox web browser is written well. 4.90 Premises: (a) If a processor overheats while multiplying, then it overheats while computing square roots; and (b) The xMax processor does not overheat while computing square roots. Conclusion: Therefore, the xMax processor does not overheat while multiplying. 4.91 Premises: (a) Every data structure is either slow at insertions or lookups; and (b) The data struc- ture called the Hackmatack tree is slow at insertions. Conclusion: Therefore, the Hackmatack tree is slow at lookups. 4.92 Premises: (a) Every web server has an IP address; and (b) www.cia.gov is a web server. Conclusion: Therefore, www.cia.gov has an IP address. 4.93 Premises: (a) If a computer system is hacked, then there was user error or the system had a design flaw; and (b) A computer at NASA was hacked; and (c) That computer did not have a design flaw. Conclusion: Therefore, there was user error. In the next several problems, you will be presented with a false claim and a bogus proof of that false claim. For each, you’ll be asked to (a) identify the precise error in the proof, and (b) give a counterexample to the claim. (Note that saying why the claim is false does not address (a) in the slightest—it would be possible to give a bogus proof a true claim!) False Claim #1: Let n be a positive integer and let p, q ∈ Z≥2, where p and q are prime. If n is evenly divisible by both p and q, then n is also evenly divisible by pq. (FC-1) Bogus proof of (FC-1). Because p | n, there exists a positive integer k such that n = pk. Thus, by assumption, we know that q | pk. Because p and q are both prime, we know that p does not evenly divide q, and thus the onlywaythatq|pkcanholdisifq|k.Hencek = qlforsomepositiveintegerl,andthusn = pk = pql. Therefore pq | n. 4.94 State precisely what’s wrong with the proof of (FC-1). 4.95 Give a counterexample to (FC-1). False Claim #2: 721 is prime. (FC-2) Bogus proof of (FC-2). In Example 4.8, we proved that n! + 1 is not evenly divisible by any k satisfying 2 ≤ k ≤ n. Observe that 6! = 720. Therefore, 721 = 6! + 1 isn’t evenly divisible by any integer between 2 and 720 inclusive, and therefore 721 is prime. 4.96 State precisely what’s wrong with the proof of (FC-2). 4.97 Without using a calculator, disprove (FC-2). 4.98 Without using a calculator, find an integer n such that n! + 1 is prime. False Claim #3: √2/4 and 8/√2 are both rational. (FC-3) Bogus proof of (FC-3). In Example 4.12, we proved that if x and y are rational then xy is rational too. Here, let 4.5. COMMONERRORSINPROOFS 465 √√ x= 2/4andy=8/ 2.Thenxy= 4 ·√2 = 4√ √2 8 8√ 2 2 =2.Soxy=2isrational,andxandyaretoo. 4.99 State precisely what’s wrong with the proof of (FC-3). 4.100 Prove that 8/√2 isn’t rational. 466 CHAPTER 4. PROOFS False Claim #4: Let n be any integer. Then 12 | n if and only if 12 | n2. Bogus proof of (FC-4), similar to Example 4.19. We proceed by mutual implication. (FC-4) • Second, we must show the converse: if 12 | n , then 12 | n. We prove the contrapositive. Assume that 12̸|n.Thenthereexistintegerskandr∈{1,...,11}suchthatn=12k+r.Thereforen2 =(12k+r)2 = 144k2 + 24kr + r2 = 12(12k2 + 2kr) + r2. Because r < 12, adding r2 to a multiple of 12 does not result in another multiple of 12. Thus 12 ̸ | n2. 4.101 State precisely what’s wrong with the proof of (FC-4). 4.102 Disprove (FC-4). False Claim #5: √4 is irrational. (FC-5) Bogus proof of (FC-5). We’ll follow the same outline as Example 4.21. Our proof is by contradiction. Assume that √4 is rational. Therefore, there exist integers n and d ̸= 0 such that n/d = √4, where n and d have no common divisors. Squaring both sides yields that n2/d2 = 4, and therefore that n2 = 4d2. Because 4d2 is divisible by 4, we know that n2 is divisible by 4. Therefore, by the same logic as in Example 4.19, we have that n is itself divisible by 4. Because n is divisible by 4, there exists an integer k such that n = 4k, which implies that n2 = 16k2. Thus n2 = 16k2 and n2 = 4d2, so d2 = 4k2. Hence d2 is divisible by four. But now we have a contradiction: we assumed that n/d was in lowest terms, but we have now shown that n2 and d2 are both divisible by 4, and therefore both n and d must be even! Thus the original assumption was false, and √4 is irrational. 4.103 State precisely what’s wrong with the proof of (FC-5). False Claim #6: 3 ≤ 2. (FC-6) Bogus proof of (FC-6). Let x and y be arbitrary nonnegative numbers. Because y ≥ 0 implies −y ≤ y, we can add x to both sides of this inequality to get x − y ≤ x + y. (1) Similarly, adding y − 3x to both sides of −x ≤ x yields y − 4x ≤ y − 2x. (2) Observe that whenever a ≤ b and c ≤ d, we know that ac ≤ bd. So we can combine (1) and (2) to get (x − y)(y − 4x) ≤ (x + y)(y − 2x). (3) Multiplying out and then combining like terms, we have xy − 4x2 − y2 + 4xy ≤ xy − 2x2 + y2 − 2xy, and (4) 6xy ≤ 2x2 + 2y2 . (5) Thiscalculationwasvalidforanyx,y ≥ 0.Forx = y = √1/2,wehavexy = x2 = y2 = (√1/2)2 = 1/2. Plugging into (5), we have • First, assume that 12 | n. Then, by definition, there exists an integer k such that n = 12k. Therefore n2 =(12k)2 =12·(12k2).Thus12|n2 too. 2 (6/2) ≤ (2/2) + (2/2). (6) 4.104 State precisely what’s wrong with the proof of (FC-6). In other words, we have 3 ≤ 2. Computer vision is the subfield of computer science devoted to developing algorithms that can “understand” images. For example, some security systems use facial recognition software to decide whether to grant access to a particular person. We desire to maximize the probability that the vision algorithm we choose gets the answer right—that is, grants access to the person if and only if that person is authorized to enter. Suppose that we have two algorithms, A and B, that we have employed on two different cameras in a test run. Suppose that algorithm A is deployed on Camera I. It makes the correct decision on 75% of the CS majors at Camera I and 60% of philosophy majors at Camera I. (That is, when a CS major arrives at Camera I, algorithm A correctly decides whether to grant her access 75% of the time.) Algorithm B, deployed at Camera II, makes the correct decision on 70% of CS majors and 50% of philosophy majors. The following claim seems obvious, because Algorithm A performed better for both philosophy majors and CS majors: Claim: Algorithm A is right a higher fraction of the time (overall, combining both majors) than Algorithm B. But the claim is false, as you’ll show! 4.105 The falsehood of this claim (for example, in the scenario illustrated by the next exercise) is called Simpson’s Paradox because the behavior is so counterintuitive. State precisely where the following argument goes wrong: Observe that Algorithm A had a better success probability with CS majors, and also had a better success probability with philosophy majors. Therefore Algorithm A was right a higher fraction of the time (in total, for both philosophy majors and CS majors) than Algorithm B. 4.106 Suppose that there were 100 CS majors and 100 philosophy majors who went by Camera I. Sup- pose that 1000 CS majors and 100 philosophy majors went by Camera II. Calculate the success rate for Algorithm A at Camera I, over all people. Do the same for Algorithm B at Camera II. 4.107 Here is an obviously false theorem, together with a (nonobviously) bogus proof. Identify pre- cisely the flaw in the argument and explain where the proof fails. False Theorem: 1 = 0. Proof. Consider the four shapes in Figure 4.34(a), and the two arrangements thereof in Figure 4.34(b). (See below.) The area of the triangle in the first configuration is 13 · 5/2 = 65/2, as it forms a right triangle with height 5 and base 13. But the second configuration also forms a right triangle with height 5 and base 13 as well, and therefore it too has area 65/2. But the second configuration has one unfilled square in the triangle, and thus we have 0 = 65 − 65 22 = area of the second bounding triangle − area of the first bounding triangle = (1 + area of four constituent shapes) − (area of four constituent shapes) = 1. 4.5. COMMONERRORSINPROOFS 467 Thus 0 = 1. (a) The shapes. (b) Two configurations. Figure 4.34: Some shapes and their arrangements, for Exercise 4.107. 468 CHAPTER 4. PROOFS The following two statements are theorems from geometry that you may recall from high school: • the angles of a triangle sum to precisely 180◦. • if the three angles of triangle T1 are precisely equal to the three angles of T2, then T1 and T2 are similar, and their sides are in the same ratios. (That is, if the side lengths of T1 are a, b, c and the side lengths of T2 are x, y, z, then a/x = b/y = c/z.) These statements are theorems, but they’re used in the following utterly bogus “proof” of the Pythagorean Theorem (actually one that was published, in 1896!). 4.108 State precisely what’s wrong with the following purported proof of the Pythagorean Theorem. Proof. Consider an arbitrary right triangle. Let the two legs and hypotenuse, respectively, have length a, b, and c, and let the angles between the legs and the hypotenuse be given by θ and φ = 90◦ − θ. (See Figure 4.35(a).) Draw a line perpendicular to the hypotenuse to the opposite vertex, dividing the interior of the triangle into two separate sections, which are shaded with different colors in Figure 4.35(b). Observe that the unlabeled angle within the smaller shaded interior triangle must be φ = 90◦ − θ, because the other angles of the smaller shaded interior triangle are (just like for the enclosing triangle) 90◦ and θ. Similarly, the unlabeled angle within the larger shaded interior triangle must be θ. Therefore we have three similar triangles, all with angles 90◦, θ, and φ. Call the lengths of the previously unnamed sides x, y, and z as in Figure 4.35(c). Now we can assemble our known facts. By assumption, a2 = x2+y2, b2 = x2+z2, and (y+z)2 = a2+b2, which we can combine to yield θ a b (a) θ a c c φ b (b) y φ θ axx φ z θφ b (c) Notethatc=y+z. (y+z)2 =2x2+y2+z2. Expanding (y + z)2 = y2 + 2yz + z2 and subtracting common terms from both sides, we have 2yz = 2x2, (2) which, dividing both sides by two, yields But (3) is immediate: we know that yz = x2. (3) x/y = z/x (4) because the two shaded triangles are similar, and therefore the two triangles have the same ratio of the length of the hypotenuse to the length of the longer leg. Multiplying both sides of (4) by xy gives us x2 = yz, as desired. Figure 4.35: Dia- grams for Exercise 4.108. (1) 4.6 Chapter at a Glance Error-Correcting Codes Although the main purpose of this section was to introduce proofs, here’s a brief sum- mary of the results about error-correcting and error-detecting codes, too. AcodeisasetC ⊆ {0,1}n,where|C| = 2k forsomeinteger1 ≤ k ≤ n. Amessageis an element of {0, 1}k ; the elements of C are called codewords. Consider any codeword c ∈ C and for any sequence of up to l errors applied to c to produce c′. The code C can detect l ≥ 0 errors if we can always correctly report “error” or “no error,” and can correct l errors if we can always correctly identify that c was the original codeword. The Hamming distance between strings x, y ∈ {0, 1}n, denoted ∆(x, y), is the num- ber of positions i in which xi ̸= yi. The minimum distance of a code C is the smallest Hamming distance between two distinct codewords of C. The rate of a code with k-bit messages and n-bit codewords is k/n. If the minimum distance of a code C is 2t + 1 for an integer t, then C can detect 2t errors and correct t errors. The Repetitionl code creates codewords via the l-fold repetition of the message. This code has rate 1/l and minimum distance l. The Hamming code creates 7-bit codewords from 4-bit messages by adding three different parity bits to the message. This code has rate 4/7 and minimum distance 3. Any code with messages of length 4 and minimum distance 3 has codewords of length ≥ 7. (Thus the Hamming code has the best possible rate among all such codes.) We can prove this result via a “sphere-packing” argument and a proof by contradiction. Proofs and Proof Techniques A proof of a claim φ is a convincing argument that φ is true. (A proof should be writ- ten with its audience in mind.) A variety of useful proof techniques can be employed to prove a given claim φ: • directproof:weproveφbyrepeatedlyinferringnewfactsfromknownfactstoeven- tually conclude φ. (Sometimes we divide a proof into multiple cases, or “assume the antecedent,” where we prove p ⇒ q by assuming p and deriving q.) You may also prove φ by proving a claim logically equivalent to φ: • proofbycontrapositive:toprovep⇒q,weinsteadprove¬q⇒¬p. • proofbycontradiction(orreductioadabsurdum):toproveφ,weinsteadprovethat ¬φ ⇒ False—that is, we prove that ¬φ leads to an absurdity. We say that y ∈ S with ¬P(y) is a counterexample to the claim ∀x ∈ S : P(x). A proof by construction of the claim ∃x ∈ S : P(x) proceeds by constructing a particular y ∈ S and proving that P(y). A nonconstructive proof establishes ∃x ∈ S : P(x) without giving an explicit y ∈ S for which P(x)—for example, by proving ∃x ∈ S : P(x) by contradiction. The process of developing a proof requires persistence, open-mindedness, and creativity. Here’s a helpful three-step plan to use when developing a new proof: (1) 4.6. CHAPTERATAGLANCE 469 470 CHAPTER 4. PROOFS understand what you’re trying to do (checking definitions and small examples); (2) do it (by trying the proof techniques catalogued here, and thinking about analogies from similar problems that you’ve solved previously); and (3) think about what you’ve done (reflecting on and trying to improve your proof). Remember that writing a proof is a form of writing! Be kind to your reader. Some Examples of Proofs We can use these proof techniques to establish a wide variety of facts—about arith- metic, propositional logic, geometry, prime numbers, and computability. For more extensive examples, see Section 4.4. We’ll highlight one result: there are problems that we can formally define, but that cannot be solved by any computer program; these problems (including the Halting Problem) are called uncomputable. Common Errors in Proofs A valid inference is one whose conclusion is always true as long as the facts that it relies on were true. An invalid inference is one in which the conclusion can be false even if the premises are all true. An invalid, or fallacious, argument can also be called a logical fallacy or just a fallacy. In a correct proof, of course, every step is valid. Perhaps the most common error in a proof is simply asserting that a fact φ follows from previously established facts, when actually φ is not implied by those facts. Other common types of fallacious reasoning are structural errors that involve purporting to prove a statement φ, but instead proving a statement that is not logically equivalent to φ. (For example, the fallacy of proving true: a “proof” of φ that assumes φ and proves True. But φ ⇒ True is true regardless of the truth of φ, so this purported proof proves nothing.) Be vigilant; do not let anyone—yourself or others!—get away with fallacious reasoning. Key Terms and Results Key Terms Error-Correcting Codes • Hammingdistance • code,message,codeword • error-detecting/correctingcode • minimumdistance,rate • repetitioncode • Hammingcode Proofs and Proof Techniques • proof • prooftechniques: – directproof – proofbycontrapositive – proofbycontradiction • counterexample • constructive/nonconstructiveproof Some Examples of Proofs • conjunctive/disjunctivenormalform • uncomputability • theHaltingProblem Valid and Fallacious Arguments • validargument • fallacious/invalidargument;fallacy • fallacy: proving true • fallacy: affirming the consequent • fallacy: denying the hypothesis • fallacy: false dichotomy • fallacy: begging the question Key Results Error-Correcting Codes 1. 2. IftheminimumdistanceofacodeCis2t + 1foraninte- ger t ≥ 0, then C can detect 2t errors and correct t errors. For4-bitmessagesandminimumdistance3,thereexist 4.6. CHAPTERATAGLANCE 471 codes with rate 1 (such as the Repetition3 code) and with 43 rate 7 (such as the Hamming code), but not with rate better than 4 . 7 Proofs and Proof Techniques 1. Youcanproveaclaimφwithadirectproof,orbyinstead proving a different claim that is logically equivalent to φ. Examples include proofs by contrapositive and proofs by contradiction. 2. Ausefulthree-stepprocessfordevelopingproofsis:(1) understand what you’re trying to do; (2) do it; and (3) think about what you’ve done. All three steps are impor- tant, and doing each will help with the other steps. 3. Writingaproofisaformofwriting. Some Examples of Proofs 1. Alllogicalpropositionsareequivalenttopropositionsin conjunctive/disjunctive normal form, or using only nand. 2. Thereareinfinitelymanyprimenumbers. 3. Thereareproblemsthatcanbespecifiedcompletelyfor- mally that are uncomputable (that is, cannot be solved by any computer program). The Halting Problem is one example. Valid and Fallacious Arguments 1. Therearemanycommonmistakesinproofsthatare centered on several types of fallacious reasoning. These fallacies are essentially all the result of purporting to prove a statement φ by instead proving a statement ψ, where ψ fails to be logically equivalent to φ. 5 Mathematical Induction In which our heroes wistfully dream about having dreams about dreaming about a very simple and pleasant world in which no one sleeps at all. 502 CHAPTER 5. MATHEMATICAL INDUCTION 5.1 Why You Might Care Each problem that I solved became a rule which served afterwards to solve other problems. René Descartes (1596–1650) Recursion is a powerful technique in computer science. If we can express a solution to problem X in terms of solutions to smaller instances of the same problem X—and we can solve X directly for the “smallest” inputs—then we can solve X for all inputs. There are many examples. We can sort an n-element array A by sorting the left half of A and the right half of A and merging the results together; 1-element arrays are trivially sorted. (That’s merge sort.) We can build an efficient data structure for storing and searching a set of keys by selecting one of those keys k, and building two such data structures for keys < k and for keys > k; to search for a key x, we compare x to k and search for x in the appropriate substructure. And a trivial empty data structure can store an empty set of keys. (That’s a binary search tree.) And many other things are best understood recursively: factorials, the Fibonacci numbers, fractals (see Figure 5.1), and finding the median element of an unsorted array, for example.
Mathematical induction is a technique for proofs that is directly analogous to recur- sion: to prove that P(n) holds for all nonnegative integers n, we prove that P(0) is true, and we prove that for an arbitrary n ≥ 1, if P(n − 1) is true, then P(n) is true too. The proof of P(0) is called the base case, and the proof that P(n − 1) ⇒ P(n) is called the inductive case. In the same way that a recursive solution to a problem relies on solu- tions to a smaller instance of the same problem, an inductive proof of a claim relies on proofs of a smaller instance of the same claim.
A full understanding of recursion depends on a thorough understanding of mathe- matical induction. And many other applications of mathematical induction will arise throughout the book: analyzing the running time of algorithms, counting the number of bitstrings that have a particular form, and many others.
In this chapter, we will introduce mathematical induction, including a few varia- tions and extensions of this proof technique. We will start with the “vanilla” form of proofs by mathematical induction (Section 5.2). We will then introduce strong induction (Section 5.3), a form of proof by induction in which the proof of P(n) in the induc-
tive case may rely on the truth of all of P(0), P(1), …, and P(n − 1) instead of just on P(n − 1). Finally, we will turn to structural induction (Section 5.4), a form of inductive proof that operates directly on recursively defined structures like linked lists, binary trees, or well-formed formulas of propositional logic.
Figure 5.1: The Von Koch Snowflake fractal, shown at levels {0, 1, 2, 3, 4}. A level-l snowflake consists of three level-l lines. A level-0 line is
; a level-l line consists of four
level-(l − 1) lines arranged in the shape .

5.2
Proofs by Mathematical Induction
So if you find nothing in the corridors open the doors, if you find nothing behind these doors there are more floors, and if you find nothing up there, don’t worry, just leap up another flight of stairs. As long as you don’t stop climbing, the stairs won’t end, under your climbing feet they will go on growing upwards.
Franz Kafka (1883–1924) Fürsprecher (Advocates) (c. 1922)
An Overview of Proofs by Mathematical Induction
The principle of mathematical induction says the following: to prove that a statement P(n) is true for all nonnegative integers n, we can prove that P “starts being true” (the base case) and that P “never stops being true” (the inductive case). Formally, a proof by mathematical induction proceeds as follows:
When we’ve proven both the base case and the inductive case as in Definition 5.1, we have established that P(n) holds for all n ∈ Z≥0. Here’s an example to illustrate how the base case and inductive case combine to establish this fact:
Example 5.1 (Proving P(5) from a base case and inductive case)
Problem: Supposewe’veprovenboththebasecase(P(0))andtheinductivecase
(P(n − 1) ⇒ P(n), for any n ≥ 1) as in Definition 5.1. Why do these two facts establish that P(n) holds for all n ∈ Z≥0? For example, why do they establish P(5)?
Solution
: HereisaproofofP(5),usingthebasecaseonceandtheinductivecasefive
times. (At each stage we make use of modus ponens—which, as a reminder, states that from p ⇒ q and p, we can conclude q.)
5.2. PROOFSBYMATHEMATICALINDUCTION 503
5.2.1
Definition 5.1 (Proof by mathematical induction)
Suppose that we want to prove that P(n) holds for all n ∈ Z≥0. To give a proof by mathematical induction of ∀n ∈ Z≥0 : P(n), we prove the following:
1. thebasecase:proveP(0).
2. theinductivecase:foreveryn≥1,proveP(n−1)⇒P(n).
We know P(0)
and we know P(0) ⇒ P(1)
and thus we can conclude P(1).
We know P(1) ⇒ P(2)
and thus we can conclude P(2).
We know P(2) ⇒ P(3)
and thus we can conclude P(3).
base case
inductive case, with n = 1 (5.1), (5.2), and modus ponens
inductive case, with n = 2 (5.3), (5.4), and modus ponens
inductive case, with n = 3 (5.5), (5.6), and modus ponens
(5.1) (5.2) (5.3)
(5.4) (5.5)
(5.6) (5.7)

504 CHAPTER 5. MATHEMATICAL INDUCTION
We know P(3) ⇒ P(4) and thus we can conclude P(4).
We know P(4) ⇒ P(5) and thus we can conclude P(5).
inductive case, with n = 4 (5.7), (5.8), and modus ponens
inductive case, with n = 5 (5.9), (5.10), and modus ponens
(5.8) (5.9)
(5.10) (5.11)
This sequence of inferences established that P(5) is true. We can use the same technique to prove that P(n) holds for an arbitrary integer n ≥ 0, using the base case once and the inductive case n times.
The principle of mathematical induction is as simple as in Example 5.1—we apply the base case to get started, and then repeatedly apply the inductive case to conclude P(n) for any larger n—but there are several analogies that can help to make proofs by mathematical induction more intuitive; see Figure 5.2.
Dominoes falling: We have an infinitely long line of dominoes, numbered 0, 1, 2, . . . , n, . . .. To convince someone that the nth domino falls over, you can convince them that
• the 0th domino falls over, and
• whenever one domino falls over, the next domino falls over too.
(One domino falls, and they keep on falling. Thus, for any n ≥ 0, the nth domino falls.)
Climbing a ladder: We have a ladder with rungs numbered 0, 1, 2, . . . , n, . . .. To convince someone that a climber climbing the ladder reaches the nth rung, you can convince them that
• the climber steps onto rung #0.
• if the climber steps onto one rung, then she also steps onto the next rung.
(The climber starts to climb, and the climber never stops climbing. Thus, for any n ≥ 0, the climber reaches the nth rung.)
Whispering down the alley: We have an infinitely long line of people, with the people numbered 0, 1, 2, . . . , n, . . .. To argue that everyone in the line learns a secret, we can argue that
• person #0 learns the secret.
• if person #n learns the secret, then she tells person #(n + 1) the secret.
(The person at the front of the line learns the secret, and everyone who learns it tells the secret to the next person in line. Thus, for any n ≥ 0, the nth person learns the secret.)
Falling into the depths of despair: Consider the Pit of Infinite Despair, which is filled with nothing but despair and goes infinitely far down beneath the surface of the earth. (The Pit does not respect physics.) Suppose that:
• the Evil Villain is pushed into the pit (that is, She is in the Pit zero meters below the surface).
• if someone is in the Pit at a depth of n meters beneath the surface, then She falls to depth n + 1
meters beneath the surface.
(The Villain starts to fall, and if the Villain has fallen to a certain depth then She falls another meter further. Thus, for any n ≥ 0, the Evil Villain eventually reaches depth n in the Pit.)
Taking it further: “Mathematical induction” is somewhat unfortunately named because its name collides with a distinction made by philosophers between two types of reasoning. Deductive reasoning is the use of logic (particularly rules of inference) to reach conclusions—what computer scientists would call a proof. A proof by mathematical induction is an example of deductive reasoning. For a philosopher, though, inductive reasoning is the type of reasoning that draws conclusions from empirical observations. If you’ve seen a few hundred ravens in your life, and every one that you’ve seen is black, then you might
Figure 5.2: Some analogies to make mathematical induction more intuitive.

conclude All ravens are black. Of course, it might turn out that your conclusion is false, because you haven’t happened upon any of the albino ravens that exist in the world; hence what philosophers call inductive reasoning leads to conclusions that may turn out to be false.
A first example: summing powers of two
Let’s use mathematical induction to prove a simple arithmetic property:
These small examples all check out, so it’s reasonable to try to prove the claim. Here is our first example of a proof by induction:
Example 5.2 (A proof of Theorem 5.1)
5.2. PROOFSBYMATHEMATICALINDUCTION 505
Theorem 5.1 (A formula for the sum of powers of two)
For any nonnegative integer n, we have
∑n 2 i = 2 n + 1 − 1 . i=0
As a plausibility check, let’s test the given formula for some small values of n: 012
Problem-solving tip:
Do this kind of plausibility check, and test out a claim for small values of n before you try to prove it. Often the process of testing small examples either reveals a misunderstanding of the claim or helps you see why the claim is true in general.
n=1: 2 +2 =1+2=3
n=2: 20+21+22 =1+2+4=7
n=3: 20+21+22+23 =1+2+4+8=15
2 −1=3 23−1=7 24−1=15
Let P(n) denote the property
We’ll prove that ∀n ∈ Z≥0 : P(n) by induction on n.
base case (n = 0): We must prove P(0). That is, we must prove ∑0i=0 2i = 20+1 − 1. But this fact is easy to prove, because both sides are equal to 1: ∑0i=0 2i = 20 = 1, and 20+1 − 1 = 2 − 1 = 1.
inductivecase(n≥1): WemustprovethatP(n−1)⇒P(n),foranarbitraryinteger n ≥ 1. We prove this implication by assuming the antecedent—namely, we assume P(n − 1) and prove P(n). The assumption P(n − 1) is
n
∑i = 0 2 i = 2 n + 1 − 1 .
n−1 i (n−1)+1 ∑i = 0 2 = 2
− 1 . ( ∗ )
We can now prove P(n)—under the assumption (∗)—by showing that the left-hand and right-hand sides of P(n) are equal:
n i 􏰑n−1 i􏰒 n ∑i = 0 2 = ∑i=0 2 + 2
= 􏰂2(n−1)+1 − 1􏰃 + 2n
= 2n − 1 + 2n
= 2 · 2n − 1
= 2n+1 − 1.
by the definition of summations
by (∗), a.k.a. by the assumption that P(n − 1) by algebraic manipulation

506 CHAPTER 5. MATHEMATICAL INDUCTION
We’ve thus shown that ∑ni=0 2i = 2n+1 − 1—in other words, we’ve proven P(n).
We’ve proven the base case P(0) and the inductive case P(n − 1) ⇒ P(n), so by the principle of mathematical induction we have shown that P(n) holds for all n ∈ Z≥0.
Taking it further: In case the inductive proof doesn’t feel 100% natural, here’s another way to make the result from Example 5.2 intuitive: think about binary representations of numbers. Written in binary, the number ∑ni=0 2i will look like 11 · · · 111, with n + 1 ones. What happens when we add 1 to, say, 11111111 (= 255)? It’s a colossal sequence of carrying (as 1 + 1 = 0, carrying the 1 to the next place):
1111111
11111111 + 00000001 1 0 0 0 0 0 0 0 0.
In other words, 2n+1 − 1 is written in binary as a sequence of n + 1 ones—that is, 2n+1 − 1 = ∑ni=0 2i .
Example 5.2 follows the standard outline of a proof by mathematical induction. We will always prove the inductive case P(n − 1) ⇒ P(n) by assuming the antecedent P(n − 1) and proving P(n). The assumed antecedent P(n − 1) in the inductive case of the proof is called the inductive hypothesis.
A second example, and a template for proofs by induction
Here’s another proof by induction, with the parts of the proof carefully labeled:
Example 5.3 (Summing powers of −1) n i 1 if n is even Claim: For any integer n ≥ 0, we have that ∑i=0(−1) = 0 if n is odd.
Proof. Step #1: Clearly state the claim to be proven. Clearly state that the proof will be by induction, and clearly state the variable upon which induction will be performed.
You may see “in- ductive hypothesis” abbreviated as IH.
Warning! P(n) denotes a proposi- tion—that is, P(n) is either true or false. (We’re proving that, in fact, it’s true for every n.) Despite its apparent tempta- tion to people new to inductive proofs, it is nonsensical
to treat P(n) as a number.
Let P(n) denote the property
∑(−1)i =
We’ll prove that ∀n ∈ Z≥0 : P(n) by induction on n.
base case (n = 0): We must prove P(0). But ∑0i=0(−1)i = (−1)0 = 1, and 0 is even.
inductivecase(n≥1): Weassumethe􏰓inductivehypothesisP(n−1),namely
n−1 i ∑i = 0 ( − 1 ) =
n i=0
􏰓1 ifniseven 0 if n is odd.
Step #2: State and prove the base case.
Step #3: State and prove the inductive case. Within the statement and proof of the inductive case . . .
. . . Step #3a: state the inductive hypothesis.
1 ifn−1iseven 0 ifn−1isodd.

We must prove P(n).
n ∑(−1)i = i=0
= = = =
Thus we have
􏰑n−1
∑ (−1)i i=0
􏰓􏰓
􏰓 􏰓
􏰒
if n − 1 is even ifn−1isodd.
Writing tip: In the inductive case ofaproofofan equality—like Example 5.3—start from the left-hand side of the equality and manipulate it until you derive
the right-hand
side of the equality exactly. If you work from both sides simultaneously, you’re at risk of the fallacy of proving true—or at least the appearance of that fallacy!
Figure 5.3: A checklist of the steps required for a proof by mathematical induction.
n 0 + (−1)n
1+(−1) 1 + (−1)n
+ (−1)n
definition of summations inductive hypothesis nisodd⇔n−1iseven (−1)n = ±1, depending on whether n is even; see Exercise 5.3.
5.2. PROOFSBYMATHEMATICALINDUCTION 507
. . . Step #3b: state what we need to prove.
. . . Step #3c: prove it, making use of the inductive hypothesis and stating where it was used.
ifnisodd 0+(−1) if n is even.
n
1 + −1 if n is odd
0+1 ifniseven. 0 ifnisodd
1 if n is even.
proven P(n), and the theorem follows.
We can treat the labeled pieces of Example 5.3 as a checklist for writing proofs by induction. You should ensure that when you write an inductive proof, you include each of these steps. These steps are summarized in Figure 5.3.
Checklist for a proof by mathematical induction:
1. A clear statement of the claim to be proven—that is, a clear definition of the property P(n) that will be proven true for all n ≥ 0—and a statement that the proof is by induction, including specifically identifying the variable n upon which induction is being performed. (Some claims involve multiple variables, and it can be confusing if you aren’t clear about which is the variable upon which you are performing induction.)
2. A statement and proof of the base case—that is, a proof of P(0).
3. A statement and proof of the inductive case—that is, a proof of P(n − 1) ⇒ P(n), for a generic
value of n ≥ 1. The proof of the inductive case should include all of the following:
(a) a statement of the inductive hypothesis P(n − 1).
(b) a statement of the claim P(n) that needs to be proven.
(c) a proof of P(n), which at some point makes use of the assumed inductive hypothesis.
The sum of the first n integers
We’ll do another simple example of an inductive proof of an arithmetic property, by
showing that the sum of the integers between 0 and n is n(n+1) . (For example, for n = 4 we have 0 + 1 + 2 + 3 + 4 = 10 = 4(4+1) .) Here’s a proof: 2
Example 5.4 (Sum of the first n integers)
Problem: Showthat0+1+···+nis n(n+1),foranyintegern≥0. 2
2

508 CHAPTER 5. MATHEMATICAL INDUCTION
: First,wemustphrasethisproblemintermsofapropertyP(n)thatwe’ll prove true for every n ≥ 0. For a particular integer n, let P(n) denote the claim that
base case (n = 0): Note that ∑0 i = 0 and 0(0+1) = 0 too. Thus P(0) follows. i=1 2
Solution
n
∑i=0i= 2 .
Problem-solving
tip: Your first task in giving a proof by induction is
to identify the property P(n) that you’ll prove true for every integer
n ≥ 0. Sometimes the property is given to you more or less directly and sometimes you’ll have to formulate it yourself, but
in any case you need to identify the precise property you’re going to prove before you can prove it!
n(n+1)
We will prove that P(n) holds for all integers n ≥ 0 by induction on n.
inductivecase(n≥1): AssumetheinductivehypothesisP(n−1),namely n−1
∑ i = (n − 1)((n − 1) + 1) . i=0 2
n
We must prove P(n)—that is, we must prove that ∑i=0 i =
proof:
n ∑i = i=0
= = = =
n−1 ∑ i
􏰑􏰒
n(n+1)
2 . Here is the
definition of summations
inductive hypothesis putting terms over common denominator factoring
+ n
(n − 1)((n − 1) + 1) + n
i=0
(n − 1)n + 2n
2 2
n(n − 1 + 2) 2
n(n+1). 2
Thus we’ve shown P(n) assuming P(n − 1), which completes the proof.
Taking it further: While the summation that we analyzed in Example 5.4 may seem like a purely arith- metic example, it also has direct applications in CS—particularly in the analysis of algorithms. Chapter 6 is devoted to this topic, and there’s much more there, but here’s a brief preview.
A basic step in analyzing an algorithm is counting how many steps that algorithm takes, for an input of arbitrary size. One particular example is Insertion Sort, which sorts an n-element array by repeatedly ensuring that the first k elements of the array are in sorted order (by swapping the kth element backward until it’s in position). The total number of swaps that are done in the kth iteration can be as high as
k − 1—so the total number of swaps can be as high as ∑n k − 1 = ∑n−1 i. Thus Example 5.4 tells us that Insertion Sort can require as many as n(n − 1)/2 swaps. k=1 i=0
Generating a conjecture: segments in a fractal
In the inductive proofs that we’ve seen thus far, we were given a problem statement
that described exactly what property we needed to prove. Solving these problems “just” requires proving the base case and the inductive case—which may or may not be easy, but at least we know what we’re trying to prove! In other problems, though, you may also have to first figure out what you’re going to prove, and then prove it. Obviously this task is generally harder. Here’s one example of such a proof, about the Von Koch snowflake fractal from Figure 5.1:

Example 5.5 (Vertices in a Von Koch Line)
Problem: AVonKochlineoflevel0isastraightlinesegment;aVonKochlineoflevel l ≥ 1 consists of four Von Koch lines of level (l − 1), arranged in the shape . (See Figure 5.4.) Conjecture a formula for the number of vertices (that is, the number of segment endpoints) in a Von Koch line of level l. Prove your formula by induction.
: Ourfirsttaskistoformulateaconjectureforthenumberofverticesina Solution
Von Koch line of level l. Let’s start with a few small examples, based on Figure 5.4:
• alevel-0linehas2endpoints(and1segment).
• alevel-1linehas5endpoints(and4segments):thetwoatthefarleftandfar
right, plus the three in the start, middle, and end of the “bump” in the center.
• alevel-2line—aftersometediouscountinginthepictureinFigure5.4—turns
out to have 17 endpoints (and 16 segments).
There are a few ways to think about this pattern. Here’s one that turns out to be
helpful: a level-l line contains 4 lines of level (l − 1), so it contains 16 lines of level
Figure 5.4: Von Koch lines of level 0, 1, . . . , 5. (A Von Koch snowflake consists of three Von Koch lines, all of the same level, arranged
in a triangle; see Figure 5.1.)
(l − 2). And thus, expanding it all the way out, the level-l line contains 4l lines of 0 1
level 0. The number of endpoints that we observe is 2 = 4 + 1, then 5 = 4 + 1, then 17 = 42 + 1. (Why the “+1?” Each segment starts where the previous segment ended—so there is one more endpoint than segment, because of the last segment’s second endpoint.)
So it looks like there are 4l + 1 endpoints in a Von Koch line of level l. Let’s turn this observation into a formal claim, with an inductive proof:
Claim: Foranyl≥0,aVonKochlineoflevellhas4l+1endpoints.
Proof. LetP(l)denotetheclaimthataVonKochlineoflevellhas4l+1endpoints. We’ll prove that P(l) holds for all integers l ≥ 0 by induction on l.
basecase(l=0): WemustproveP(0).Bydefinition,aVonKochlineoflevel0isa single line segment, which has 2 endpoints. Indeed, 40 + 1 = 1 + 1 = 2.
inductivecase(l≥1): Weassumetheinductivehypothesis,namelyP(l−1), and we must prove P(l). The key observation is that a Von Koch line of level
l consists of four Von Koch lines of level (l − 1)—and the last endpoint of line #1 is identical to the first endpoint of line #2; the last endpoint of #2 is the first of #3, and the last endpoint of #3 is the first of #4. Therefore there are three endpoints that are shared among the four lines of level (l − 1). Thus:
the number of endpoints in a Von Koch line of level l
= 4·􏰖thenumberofendpointsinaVonKochlineoflevel(l−1)􏰗−3
= 4·􏰖4l−1+1􏰗−3
= 4l+4−3
by the definition of a Von Koch line, and by the above discussion bytheinductivehypothesis
= 4l + 1.
Thus P(l) follows, completing the proof.
multiplyingthrough algebra
5.2. PROOFSBYMATHEMATICALINDUCTION 509

510 CHAPTER 5. MATHEMATICAL INDUCTION
A note and two variations on the inductive template
The basic idea of induction is simple: the reason that P(n) holds is that P(n − 1) held,
and the reason that P(n − 1) held is that P(n − 2) held—and so forth, until eventually the proof finally rests on P(0), the base case. A proof by induction can sometimes look superficially like it’s circular reasoning—that we’re assuming precisely the thing that we’re trying to prove. But it’s not! In the inductive case, we’re assuming P(n − 1) and proving P(n)—we are not assuming P(n) and proving P(n).
Taking it further: The superficial appearance of circularity in a proof by induction is equivalent to the superficial appearance that a recursive function in a program will run forever. (A recursive function
f will run forever if calling f on n results in f calling itself on n again! That’s the same circularity that would happen if we assumed P(n) and proved P(n).) The correspondence between these aspects of induction and recursion should be no surprise; induction and recursion are essentially the same thing. In fact, it’s not too hard to write a recursive function that “implements” an inductive proof by outputting a step-by-step argument establishing P(n) for an arbitrary n, as in Example 5.1.
≥0
Another variation in writing inductive proofs relates to the statement of the induc- tive case. We’ve proven P(0) and P(n − 1) ⇒ P(n) for arbitrary n ≥ 1. Some writers prefer to prove P(0) and P(n) ⇒ P(n + 1) for arbitrary n ≥ 0. The difference is merely a reindexing, not a substantive difference: it’s just a matter of whether one thinks of induction as “the nth domino falls because the (n − 1)st domino fell into it” or as “the nth domino falls and therefore knocks over the (n + 1)st domino.”
In the remainder of this section, we’ll give some more examples of proofs by math- ematical induction, following the template of Figure 5.3. While the examples that we’ve used so far have almost all related to summations, the same style of inductive proof can be used for a wide variety of claims. We’ll encounter many inductive proofs throughout the book, and you’ll find inductive proofs ubiquitous throughout com- puter science. We’ll start with some more summation-based proofs, and then move on to inductive proofs of some other types of statements.
5.2.2 Some Numerical Examples: Geometric, Arithmetic, and Harmonic Series
We’ll now introduce three types of summations that arise frequently in computer
science: geometric sequences (1, 2, 4, 8, 16, . . .); arithmetic sequences (2, 4, 6, 8, 10, . . .);
Warning! If you
do not use the in- ductive hypothesis P(n − 1) in the proof of P(n), then some- thing is wrong—or, at least, your proof is not actually a proof by induction!
Our proofs so far have shown ∀n ∈ Z
instead want to prove ∀n ∈ Z≥k : P(n) for some integer k, we can prove P(k) as the base case, and then prove the inductive case P(n − 1) ⇒ P(n) for all n ≥ k + 1.
and the harmonic sequence (1, 1 , 1 , 1 , 1 , . . .). Summations involving all of these types 2345
: P(n) by proving P(0) as a base case. If we
of sequences can be analyzed inductively, and we’ll address all three of them here and in the exercises. (The statements we’ll prove are both useful facts to know about geometric/arithmetic/harmonic sequences, and good practice with induction.)
Geometric series
Definition 5.2 (Geometric sequences and series)
A geometric sequence is a sequence of numbers where each number is generated by multiplying the previous entry by a fixed ratio α ∈ R, starting from an initial value x0.

(Thus the sequence is ⟨x0, x0 · α, x0 · α2, x0 · α3, . . .⟩.) A geometric series or geometric sum is ∑ni=0 x0αi.
Examples include ⟨2,4,8,16,32,…⟩; or ⟨1, 1, 1, 1 ,…⟩; or ⟨1,1,1,1,1,…⟩. 3 9 27
It turns out that there is a relatively simple formula expressing the sum of the first n terms of a geometric sequence:
= αn + α − 1
5.2. PROOFSBYMATHEMATICALINDUCTION 511
Theorem 5.2 (Analysis of geometric series)
Let α ∈ R where α ̸= 1, and let n ∈ Z≥0. Then
∑n i αn+1−1
i=0α = α−1 . (Ifα=1,then∑ni=0αi =n+1.)
(For simplicity, we stated Theorem 5.2 without reference to x0. Because we can pull a
constant multiplicative factor out of a summation, we can use the theorem to conclude
that∑n xαi=x ·∑n αi=x ·αn+1−1.) i=00 0i=0 0α−1
We will be able to prove Theorem 5.2 using a proof by mathematical induction:
Example 5.6 (Geometric series)
Proof of Theorem 5.2. Consider a fixed real number α with α ̸= 1, and let P(n) denote
the property that
We’ll prove that P(n) holds for all integers n ≥ 0 by induction on n.
base case (n = 0): Note that ∑0 αi = α0 and α0+1−1 both equal 1. Thus P(0) holds. i=0 α−1
inductivecase(n≥1): WeassumetheinductivehypothesisP(n−1),namely
definition of summation
inductive hypothesis putting the fractions over a common denominator multiplying out simplifying
α=α−1, and we must prove P(n). Here is the proof:
n ∑i = 0 α
∑n i αn+1−1 i=0α = α−1 .
n−1 αn − 1 ∑i
Problem-solving
tip: The inductive cases of many inductive proofs follow the same pattern: first, we use some kind of structural definition to “pull apart” the statement about
n into something kind of statement about n − 1 (plus some “leftover” other stuff), then apply the inductive hypothesis to simplify the n − 1 part. We then manipulate the result of using
the inductive hypothesis plus the leftovers to get the desired equation.
i
n−1 ni
= α + ∑i=0 α n
i=0
= =
α−1 αn(α−1)+αn −1
α−1
αn+1 −αn +αn −1
n+1
= α
Thus P(n) holds, and the theorem follows.
α−1
α − 1 −1.

512 CHAPTER 5. MATHEMATICAL INDUCTION
Notice that Examples 5.2 and 5.3 were both special cases of Theorem 5.2. For the
former, Theorem 5.2 tells us that ∑n 2i = 2n+1−1 = 2n+1 − 1; for the latter, this theorem
tells us that
n ∑(−1)i = i=0
A corollary of Theorem 5.2 addressing infinite geometric sums will turn out to be useful later, so we’ll state it now. (You can skip over the proof if you don’t know calcu- lus, or if you haven’t thought about calculus recently.)
i=0 2−1
(−1)n+1−1 1−(−1)n+1 􏰓 1−(−1) =1 ifniseven = = 2
−1−1 2 1−1 =0 ifnisodd. 2
Corollary 5.3
L e t α ∈ R w h e r e 0 ≤ α < 1 , a n d d e fi n e f ( n ) = ∑ ni = 0 α i . T h e n : 1. ∑∞ αi = 1 , and i=0 1−α 1 2. Foralln≥0,wehave1≤f(n)≤ 1−α. Proof. The proof of (1) requires calculus. Theorem 5.2 says that f (n) = αn+1 −1 , and we n+1 α−1 take the limit as n → ∞. Because α < 1, we have that limn→∞ α = 0. Thus as n → ∞ the numerator αn+1 − 1 tends to −1, and the entire ratio tends to 1/(1 − α). For (2), observe that ∑ni=0 αi is definitely greater than or equal to ∑0i=0 αi (because α ≥ 0 and so the latter results by eliminating n nonnegative terms from the former). Similarly, ∑n αi is definitely less than or equal to ∑∞ αi. Thus: i=0 Arithmetic series i=0 f(n)=∑ni=0αi ≥∑0i=0αi =α0 =1 f(n)=∑n αi≤∑∞ αi= 1 . i=0 i=0 1−α Definition 5.3 (Arithmetic sequences and series) An arithmetic sequence is a sequence of numbers where each number is generated by adding a fixed step-size α ∈ R to the previous number in the sequence. The first entry in the sequence is some initial value x0 ∈ R. (Thus the sequence is n ⟨x0, x0 + α, x0 + 2α, x0 + 3α, . . .⟩.) An arithmetic series or sum is ∑i=0(x0 + iα). Examples include ⟨2,4,6,8,10,...⟩; or ⟨1, 1,−1,−1,−5,...⟩; or ⟨1,1,1,1,1,...⟩. You’ll 333 prove a general formula for an arithmetic sum in the exercises. Harmonic series Definition 5.4 (Harmonic series) A harmonic series is the sum of a sequence of numbers whose kth number is 1 . The nth harmonic number is defined by Hn := ∑n 1 . k k=1 k Thus,forexample,wehaveH1 = 1,H2 = 1+ 1 = 1.5,H3 = 1+ 1 + 1 ≈ 1.8333,and H4 = 1 + 1 + 1 + 1 ≈ 2.0833. 2 2 3 234 Giving a precise equation for the value of Hn requires a bit more work, but we can very easily prove upper and lower bounds on Hn by induction. (If you’ve had calculus, then there’s a simple way for you to approximate the value of Hn, as n ∑n 1 􏰮 n 1 H = x≈ x=1xdx=lnn. x=1 But we’ll do a calculus-free version here.) We will be able to prove the following, which captures the value of Hn to within a factor of 2, at least when n is a power of 2: We’ll prove half of Theorem 5.4 (namely k + 1 ≥ H2k ) by induction in Example 5.7, leaving the other half to the exercises. We will also leave to the exercises a proof of upper and lower bounds for Hn when n is not an exact power of 2. Example 5.7 (Inductive proof that k + 1 ≥ H2k ) Proof. LetP(k)denotethepropertythatk+1≥H2k.We’lluseinductiononktoprove that P(k) holds for all integers k ≥ 0. basecase(k=0): WehavethatH2k = H20 = H1 = 1,andk+1 = 0+1 = 1aswell. ThereforeH2k =1=k+1. inductive case (k ≥ 1): Let k ≥ 1 be an arbitrary integer. We must prove P(k)—that is, we must prove that k + 1 ≥ H2k . To do so, we assume the inductive hypothesis P(k − 1), namely that k ≥ H2k−1 . Consider H2k : The name “har- monic” comes from music: when a note at frequency f is played, overtones of that note—other high-intensity frequencies—can be heard at frequencies 2f,3f,4f,.... The wavelengths of the corresponding sound waves are 1, 1 , 1 , 1 ,.... f 2f 3f 4f 5.2. PROOFSBYMATHEMATICALINDUCTION 513 Theorem 5.4 (Bounds on the (2k)th harmonic number) Foranyintegerk≥0,wehavek+1≥Hk ≥k+1. 22 2k 1 H2k = ∑i=1 i 􏰑2k−1 1􏰒 􏰑 2k 1􏰒 definition of the harmonic numbers splitting the summation into parts definition of the harmonic numbers, again every term in the summation ∑2k 1 is smaller than 1 i=2k−1+1 i 2k−1 = ∑ i + ∑ i=1 i=2k−1+1 i 􏰑2k 1􏰒 􏰑2k 1􏰒 = H2k−1 + ∑ i i=2k−1+1 ≤ H2k−1 + ∑ 2k−1 i=2k−1+1 ≤ H2k−1 + 2k−1 · 1 there are 2k−1 terms in the summation = H k−1 + 1 1 · x = 1 for any x ̸= 0 2k−1 2x ≤ k + 1. inductive hypothesis Thus we’ve proven that H2k ≤ k + 1—that is, we’ve proven P(k). This proof com- pletes the inductive case, and the theorem follows. 514 CHAPTER 5. MATHEMATICAL INDUCTION The proof in Example 5.7 is perhaps the first time in this chapter in which we needed some serious insight and creativity to establish the inductive case. The struc- ture of a proof by induction is rigid—we must prove a base case P(0); we must prove an inductive case P(n − 1) ⇒ P(n)—but that doesn’t make the entire proof totally formulaic. (The proof of the inductive case must use the inductive hypothesis at some point, so its statement gives you a little guidance for the kinds of manipulations to try.) Just as with all the other proof techniques that we explored in Chapter 4, a proof by induction can require you to think—and all of strategies that we discussed in Chapter 4 may be helpful to deploy. 5.2.3 Some More Examples We’ll close this section with a few more examples of proofs by mathematical induc- tion, but we’ll focus on things other than analyzing summations. Some of these exam- ples are still about arithmetic properties, but they should at least hint at the breadth of possible statements that we might be able to prove by induction. Comparing algorithms: which is faster? Suppose that we have two different candidate algorithms that solve a problem re- lated to a set S with n elements—a brute-force algorithm that tries all 2n possible subsets of S, and a second algorithm that computes the solution by looking at only n2 subsets of S. Which would be faster to use? It turns out that the latter algorithm is faster, and we can prove this fact (with a small caveat for small n) by induction: Example 5.8 (2n vs. n2) We’d like to prove that 2n ≥ n2 for all integers n ≥ 0—but it turns out not to be true! (See Figure 5.5.) Indeed, 23 < 32. But the relationship appears to begin to hold starting at n = 4. Let’s prove it, by induction: Claim: For all integers n ≥ 4, we have 2n ≥ n2. Proof. LetP(n)denotetheproperty2n ≥n2.We’lluseinductiononntoprovethat P(n) holds for all n ≥ 4. basecase(n=4): Forn=4,wehave2n =16=n2,sotheinequalityP(4)holds. inductivecase(n≥5): AssumetheinductivehypothesisP(n−1)—thatis,assume 2n−1 ≥ (n − 1)2. We must prove P(n). For n ≥ 4, note that n2 ≥ 4n (by multiplying both sides of the inequality n ≥ 4 by n). Thus n2 − 4n ≥ 0, and so n2nn2 010 121 244 389 4 16 16 5 32 25 6 64 36 712849 2n n2 n=4 Figure 5.5: Small values of 2n and n2, and a plot of the functions. 2n = ≥ 2 · (2n−1) 2 · (n − 1)2 definition of exponentiation inductive hypothesis multiplying out rearranging bytheabovediscussion,wehaven2 −4n≥0 2n2 − 4n + 2 n2 + (n2 − 4n) + 2 = = ≥ n2+0+2 > n2.

Thus we have shown 2n > n2, which completes the proof of the inductive case. The claim follows.
Taking it further: In analyzing the efficiency of algorithms, we will frequently have to do the type of comparison that we just completed, to compare the amount of time consumed by one algorithm versus another. Chapter 6 discusses this type of comparison in much greater detail, but here’s one example of this sort.
Let X be a sequence. A subsequence of X results from selecting some of the entries in X—for exam-
SOURCING. For two sequences X and Y, a common subsequence is a ple, TURING is a subsequence of OUT
subsequence of both X and Y. The longest common subsequence of X and Y is, naturally, the common subsequence of X and Y that’s longest. (For example, TURING is the longest common subsequence of DISTURBINGLY and OUTSOURCING.)
Given two sequences X and Y of length n, we can find the longest common subsequence fairly easily by testing every possible subsequence of X to see whether it’s also a subsequence of Y. This brute-force solution takes requires testing 2n subsequences of X. But there’s a cleverer approach to solving this problem using an algorithmic design technique called dynamic programming (see p. 959 or a textbook
on algorithms) that avoids redoing the same computation—here, testing the same sequence of letters to see if it appears in Y—more than once. The dynamic programming algorithm for longest common subsequence requires only about n2 steps.
Proving algorithms correct: factorial
We just gave an example of using a proof by induction to
analyze the efficiency of an algorithm, but we can also use
mathematical induction to prove the correctness of a recursive
algorithm. (That is, we’d like to show that a recursive algo-
rithm always returns the desired output.) Here’s a simple
example, for the natural recursive algorithm to compute factorials (see Figure 5.6):
Example 5.9 (Factorial)
Consider the recursive algorithm fact in Figure 5.6. For a positive integer n, let P(n) denote the property that fact(n) = n!. We’ll prove by induction on n that, indeed, P(n) holds for all integers n ≥ 1.
base case (n = 1): Observe that fact(1) returns 1 immediately. And 1! = 1 by defini- tion. Thus P(1) holds.
inductivecase(n≥2): WeassumetheinductivehypothesisP(n−1),namelythat fact(n − 1) returns (n − 1)!. We want to prove that fact(n) returns n!. But this claim is easy to see:
Figure 5.6: Pseu- docode for factorial: given n ∈ Z≥1, we wish to compute the value of n!.
5.2. PROOFSBYMATHEMATICALINDUCTION 515
fact(n) = n · fact(n − 1) = n · (n − 1)!
= n! Therefore the claim holds by induction.
by inspection of the algorithm by the inductive hypothesis by definition of !
In fact, induction and recursion are basically the same thing: recursion “works” by leveraging a solution to a smaller instance of a problem to solve a larger instance of
fact(n):
1: ifn=1then
2: return 1
3: else
4: return n·fact(n−1)

516 CHAPTER 5. MATHEMATICAL INDUCTION
the same problem; a proof by induction “works” by leveraging a proof of a smaller instance of a claim to prove a larger instance of the same claim. (Actually, one common use of induction is to analyze the efficiency of a recursive algorithm. We’ll discuss this type of analysis in great depth in Section 6.4.)
Taking it further: While induction is much more closely related to recursive algorithms than nonrecur- sive algorithms, we can also prove the correctness of an iterative algorithm using induction. The basic idea is to consider a statement, called a loop invariant, about the correct behavior of a loop; we can prove inductively that a loop invariant starts out true and stays true throughout the execution of the algorithm. See the discussion on p. 517.
Divisibility
We’ll close this section with one more numerical example, about divisibility:
Example 5.10 (kn − 1 is evenly divisible by k − 1)
Claim: Foranyn≥0andk≥2,wehavethatkn−1isevenlydivisiblebyk−1.
(Forexample,7n −1isalwaysdivisibleby6,asin7−1,49−1,and343−1. And
k2 − 1 is always divisible by k − 1; in fact, factoring k2 − 1 yields k2 − 1 = (k − 1)(k + 1).)
Writing tip: Exam- ple 5.10 illustrates why it is crucial to state clearly
the variable upon which induction is being performed. This statement involves two vari- ables, k and n, but we’re performing induction on only oneofthem!
Problem-solving
tip: In inductive proofs, try to massage the expres- sion in question intosomething— anything!—that matches the form
of the inductive hy- pothesis. Here, the “antisimplification” step is obviously true but seems completely bizarre. Why did we do it? Our only hope in the inductive case is to somehow make use of the inductive hypothesis. Here, the inductive hy- pothesis tells us something about kn−1 − 1—so a good strategy is to transform kn − 1 into an expression involving kn−1 − 1, plus some leftover stuff.
Proof. We’llproceedbyinductiononn.Thatis,letP(n)denotetheclaim n
Forallintegersk≥2,wehavethatk −1isevenlydivisiblebyk−1. We will prove that P(n) holds for all integers n ≥ 0 by induction on n.
basecase(n=0): Foranyk,wehavekn−1=k0−1=1−1=0.And0isevenly divisible by any positive integer, including k − 1. Thus P(0) holds.
inductivecase(n≥1): WeassumetheinductivehypothesisP(n−1),andweneedto prove P(n). Let k ≥ 2 be an arbitrary integer. Then:
kn − 1 = kn − k + k − 1
= k · (kn−1 − 1) + k − 1
antisimplification: x = x + k − k. factoring
n−1
by the definition of divisibility, there exists a nonnegative integer a such that
− 1 is evenly divisible by k − 1. In other words, kn − 1 = k · a · (k − 1) + k − 1
By the inductive hypothesis, k a · (k − 1) = kn−1 − 1. Therefore
= (k − 1) · (k · a + 1).
Because k · a + 1 is a nonnegative integer, (k − 1) · (k · a + 1) is by definition evenly divisible by k − 1. Thus kn − 1 = (k − 1) · (k · a + 1) is evenly divisible by k − 1. Our k was arbitrary, so P(n) follows.

5.2. PROOFSBYMATHEMATICALINDUCTION 517
Computer Science Connections
Loop Invariants
In Example 5.9, we saw how to use a proof by induction to establish that
a recursive algorithm correctly solves a particular problem. But proving the correctness of iterative algorithms seems different. An approach—pioneered in the 1960s by Robert Floyd and C. A. R. Hoare1 —is based on loop invariants, and can be used to analyze nonrecursive algorithms. A loop invariant for a loop L is logical property P such that (i) P is true before L is first executed; and (ii) if P is true at the beginning of an iteration of L, then P is true after that iteration of L. The parallels to induction are clear; property (i) is the base case, and property (ii) is the inductive case. Together, they ensure that P is always true, and in particular P is true when the loop terminates.
Here’s an example of a sketch of a proof of correctness of Insertion Sort (Figure 5.7) using loop invariants. (Many proofs using loop invariants would proceed with more formal detail.) We claim that the property
P(k) := A[1 . . . k + 1] is sorted after completing k iterations of the outer while loop is true for all k ≥ 0. (That is, P is a loop invariant for the outer while loop.)
Proof(sketch). Forthebasecase(k=0),we’vecompletedzeroiterations—that is, we have only executed line 1. But A[1 . . . k + 1] is then vacuously sorted, because it contains only the lone element A[1].
Fortheinductivecase(k ≥ 1),weassumetheinductivehypothesis P(k − 1)—that is, A[1 . . . k] was sorted before the kth iteration. The kth iter- ation of the loop executed lines 2–7, so we must show that the execution of these lines extended the sorted segment A[1 . . . k] to A[1 . . . k + 1]. A formal proof of this claim would use another loop invariant, like
Q(j) := both A[1 . . . j − 1] and A[j . . . i] are sorted, and A[j − 1] < A[j + 1] but for this proof sketch we’ll be satisfied by concluding the desired conclu- sion by inspection of the algorithm’s code. Because P(n − 1) is true (after n − 1 iterations of the loop), we know that A[1 . . . (n − 1) + 1] = A[1 . . . n] is sorted, as desired. Loop invariants can also be extremely valuable as part of the development of programs. For example, many people end up struggling to correctly write binary search—but by writing down loop invariants before actually writing the code, it’s actually easy. If we think about the property if x is in A, then x is one of A[lo,...,hi] as a loop invariant as we write the program, binary search becomes much easier to get right. Many programming languages allow programmers to use assertions to state logical conditions that they believe to always be true at a particular point in the code. A simple assert(P) statement can help a programmer identify bugs earlier in the development process and avoid a great deal of debugging trauma later. 1 Robert W. Floyd. Assigning meanings to programs. In Proceedings of Symposia in Applied Mathematics XIX, American Mathematical Society, pages 19–32, 1967; and C. A. R. Hoare. An axiomatic basis for computer programming. Communications of the ACM, 12(10):576– 585, October 1969. insertionSort(A[1 . . . n]): 1: 2: 3: 4: 5: 6: 7: i := 2 while i ≤ n: j := i while j > 1 and A[j] > A[j − 1]:
swap A[j] and A[j − 1]
j := j − 1 i := i + 1
Figure 5.7: Insertion Sort.
binarySearch(A[1 . . . n], x):
// output: is x in the sorted array A?
1: 2: 3: 4: 5: 6: 7: 8: 9:
10: 11:
lo:=1
hi:=n
while lo ≤ hi:lo+hi
middle:=⌊2⌋
if A[middle] = x then
return True
else if A[middle] > x then
hi := middle − 1 else
lo := middle + 1 return False
Figure 5.8: Binary Search.

518 CHAPTER 5. MATHEMATICAL INDUCTION
5.2.4 Exercises
Prove that the following claims hold for all integers n ≥ 0, by induction on n:
5.1 ∑n i2 = n(n+1)(2n+1) i=0 6
5.4 ∑n 1 = n i=1 i(i+1) n+1
∑n3n4+2n3+n2
5.2 i=0i=􏰦4 5.5
5.3 (−1)n= 1 ifniseven 5.6 −1 ifnisodd
∑n 2 3 1 1 i=1 i(i + 2) = 2 − n + 1 − n + 2
∑n i·(i!)=(n+1)!−1 i=1
f/1
f /1.4
f /2
f /2.8
f/4
f /5.6
Figure 5.9: A par- ticular lens of a camera, shown at several different
f -stops. These con- figurations are only an approximation— therealbladesare shaped somewhat differently than is shown here.
8
5.7 In a typical optical camera lens, the light that enters the lens (through the opening called the aperture) is controlled by a collection of movable blades that can be adjusted inward to narrow the area through which light can pass. (There are two effects of narrowing this opening: first, the amount of light entering the lens is reduced, darkening the resulting image; and, second, the depth of field—the range of distances from the lens at which objects are in focus in the image—increases.) Although some lenses allow continuous adjustment to their openings, many have a sequence of so-called stops: discrete steps by which the aperture narrows. (See Figure 5.9.) These steps are called f -stops (the “f” is short for “focal”), and they are denoted with some unusual notation that you’ll unwind in this exercise. The “fastest” f -stop for a lens measures the ratio of two numbers: the focal length of the lens divided by the diameter of the aperture of the lens. (For example, you might use a lens that’s 50mm long and that has a 25mm diameter, which yields an f -stop of 50mm/25mm = 2.) One can also “stop down” a lens from this fastest setting by adjusting the blades to shrink the diameter of the aperture, as described above. (For example, for the 50mm-long lens with a 25mm diameter, you might reduce the diameter to 12.5mm, which yields an f -stop of 50mm/12.5mm = 4.)
Consideracameralenswitha50mmfocallength,andletd0 :=50mmdenotethediameterofthelens’s aperture diameter. “Stopping down” the lens by one step causes the lens’s aperture diameter to shrink by a
1
factor of √2 —that is, the next-smaller aperture diameter for a diameter di is defined as
di
di+1 := √2,foranyi≥0.
Give a closed-form expression for dn—that is, give a nonrecursive numerical expression whose value is equal to dn (where your expression involves only real numbers and the variable n). Prove your answer correct by induction on n. Also give a closed-form expression for two further quantities:
• the “light-gathering” area (that is, the area of the aperture) of the lens when its diameter is set to dn.
• the f -stop fn of the lens when its diameter is set to dn .
(Using your formula for fn , can you explain the f -stop names from Figure 5.9?)
5.8 What is the sum of the first n odd positive integers? First, formulate a conjecture by trying a few examples(forexample,what’s1+3,forn = 2? What’s1+3+5,forn = 3? What’s1+3+5+7,forn = 4?). Then prove your answer by induction.
5.9 What is the sum of the first n even positive integers? Prove your answer by induction.
5.10 Let α ∈ R and let n ∈ Z≥0, and consider the arithmetic sequence ⟨x0, x0 + α, x0 + 2α, . . .⟩. (Recall
that each entry in an arithmetic sequence is a fixed amount more than the previous entry. Three examples are ⟨1,3,5,7,9,…⟩, with x0 = 1 and α = 2; ⟨25,20,15,10,…⟩, with x0 = 25 and α = −5; and ⟨5,5,5,5,5,…⟩, with x0 = 5 and α = 0.) An arithmetic sum or arithmetic series is the sum of an arithmetic sequence. For the arithmetic sequence ⟨x0 , x0 + α, x0 + 2α, . . .⟩, formulate and prove correct by induction a formula expressing the value of the arithmetic series
n n (Hint: note that ∑i=0 iα = α ∑i=0 i =
αn(n+1) 2
3 , by Example 5.4.) 2 1
7 n 6
∑(x0 + iα).
i=0 4
5.11 In chess, a knight at position ⟨r, c⟩ can move in an L-shaped pattern to any of eight positions: moving over one row and up/down two columns (⟨r ± 1, c ± 2⟩), or two rows over and one column up/down (⟨r ± 2, c ± 1⟩). (See Figure 5.10.) A knight’s walk is a sequence of legal moves, starting from a square of your choice, that visits every square of the board. Prove by induction that there exists a knight’s walk for any n-by-n chessboard for any n ≥ 4. (A knight’s tour is a knight’s walk that visits every square only once. It turns out that knight’s tours exist for all even n ≥ 6, but you don’t need to prove this fact.)
abcdefgh
Figure 5.10: A chess board. The knight can move to any of the marked positions.
5
0Z0Z0Z0Z
Z0Z0Z0Z0
0Z0Z0Z0Z
Z0Z0M0Z0
0Z0Z0Z0Z
Z0Z0Z0Z0
0Z0Z0Z0Z
Z0Z0Z0Z0

5.12 (programming required) In a programming language of your choice, implement your proof from Exercise 5.11 as a recursive algorithm that computes a knight’s walk in an n-by-n chessboard.
5.13 In chess, a rook at position ⟨r, c⟩ can move in a straight line either horizontally or vertically (to
⟨r ± x, c⟩ or ⟨r, c ± x⟩, for any integer x). (See Figure 5.11.) A rook’s tour is a sequence of legal moves, starting from a square of your choice, that visits every square of the board once and only once. Prove by induction that there exists a rook’s tour for any n-by-n chessboard for any n ≥ 1.
Figure 5.12 shows three different fractals. One is the Von Koch snowflake (Figure 5.12(a)), which we’ve already seen: a Von Koch line of size s and level 0 is just a straight line segment; a Von Koch line of size s and level l consists of four Von Koch lines of size (s/3) and level (l − 1) arranged in the shape ; a Von Koch snowflake of size s and level l consists of a triangle of three Von Koch lines of size s and level l.
The other two fractals in Figure 5.12 are new. Figure 5.12(b) shows the Sierpinski triangle: a Sierpinski triangle of level 0 and size s is an equilateral triangle of side length s; a Sierpinski triangle of level (l + 1) is three Sierpinski tri- angles of level l and side length s/2 arranged in a triangle. Figure 5.12(c) shows a related fractal called the Sierpinski carpet, recursively formed from 8 smaller Sierpinski carpets (arranged in a 3-by-3 grid with a hole in the middle); the base case is just a filled square.
Suppose that we draw each of these fractals at level l and with size 1. What is the perimeter of each of these fractals? (By “perimeter,” we mean the total length of all boundaries separating regions inside the figure from regions outside— which includes, for example, the boundary of the “hole” in the Sierpinski carpet. For the Sierpinski fractals as drawn here, the perimeter is precisely the length of lines separating colored-in from uncolored-in regions.) In each case, conjecture a formula and prove your answer correct by induction.
5.14 Von Koch snowflake 5.15 Sierpinski triangle 5.16 Sierpinski carpet
Draw each of these fractals at level l and with size 1. What is the enclosed area of each of these fractals? (Again, for the Sierpinski fractals as drawn here, the enclosed area is precisely the area of the colored-in regions.)
5.17 Von Koch snowflake 5.18 Sierpinski triangle 5.19 Sierpinski carpet
In the last few exercises, you computed the fractals’ perimeter/area at level l. But what if we continued the fractal- expansion process forever? What are the area and perimeter of an infinite-level fractal? (Hint: use Corollary 5.3.)
8 0Z0Z0Z0Z
5.20 Von Koch snowflake 5.21
Sierpinski triangle 5.22
Sierpinski carpet
Figure 5.11: A rook can move to any
of the positions marked with a circle.
The Von Koch snowflake is named after Helge von Koch, a 19th/20th- century Swedish mathematician;
the Sierpinski triangle/carpet
are named after Wacław Sierpiński, a 20th-century Pol- ish mathematician.
Figure 5.12: Three fractals: the Von Koch snowflake, the Sierpinski triangle, and the Sierpinski carpet.
(a) The Von Koch snowflake, at levels 0, 1, 2, 3, and 4.
(b) The Sierpinski triangle, at levels 0, 1, 2, 3, and 4.
(c) The Sierpinski carpet, at levels 0, 1, 2, and 3.
5.2. PROOFSBYMATHEMATICALINDUCTION 519
7 Z0Z0Z0Z0
6 0Z0Z0Z0Z
5 Z0Z0S0Z0
4 0Z0Z0Z0Z
3 Z0Z0Z0Z0
2 0Z0Z0Z0Z
1 Z0Z0Z0Z0 abcdefgh

520 CHAPTER 5. MATHEMATICAL INDUCTION
5.23 (programming required) Write a recursive function sierpinskiTriangle(level, length, x, y), in a language of your choice, to draw a Sierpinski triangle of side length length at level level with bottom-left coordinate ⟨x, y⟩. (You’ll need to use some kind of graphics package with line-drawing capability.)
Write your function so that—in addition to drawing the fractal—it returns both the total length and total area of the triangles that it draws. Use your function to verify some small cases of Exercises 5.15 and 5.18.
5.24 (programming required) Write a recursive function sierpinskiCarpet(level, length, x, y), in a pro- gramming language of your choice, to draw a Sierpinski carpet. (See Exercise 5.23 for the meaning of the parameters.) Write your function so that—in addition to drawing the fractal—it also returns the area of the boxes that it encloses. Use your function to verify some small cases of your answer to Exercise 5.19.
5.25 An n-by-n magic square is an n-by-n grid into which the numbers 1, 2, . . . , n2 are placed, once each. The “magic” is that each row, column, and diagonal must be filled with numbers that have the same sum. For example, a 3-by-3 magic square is shown in Figure 5.13. Conjecture and prove a formula for what the sum of each row/column/diagonal must be in an n-by-n magic square.
Figure 5.13: A Magic Square.
4
9
2
3
5
7
8
1
6
Recall from Section 5.2.2 the harmonic numbers, where Hn := ∑n 1 is the sum of the reciprocals of the first n i=1 i k
positiveintegers.FurtherrecallTheorem5.4,whichstatesthatk+1≥H2k ≥2+1foranyintegerk≥0.
5.26 In Example 5.7, we proved that k + 1 ≥ H2k . Using the same type of reasoning as in the example,
complete the proof of Theorem 5.4: show by induction that H k ≥ k + 1 for any integer k ≥ 0. 22
5.27 Generalize Theorem 5.4 to numbers that aren’t necessarily exact powers of 2. Specifically, prove thatlogn+2≥Hn ≥(logn−1)/2+1foranyrealnumbern≥1.(Hint:useTheorem5.4.)
5.28 Prove Bernoulli’s inequality: let x ≥ −1 be an arbitrary real number. Prove by induction onnthat(1+x)n ≥1+nxforanypositiveintegern.
Prove that the following inequalities f (n) ≤ g(n) hold “for sufficiently large n.” That is, identify an integer k and then prove (by induction on n) that f (n) ≤ g(n) for all integers n ≥ k.
5.29 2n ≤ n!
5.30 bn ≤ n!, for an arbitrary integer b ≥ 1
5.31 3n ≤ n2
5.32 n3 ≤ 2n
5.33 Prove that, for any nonnegative integer n, the algorithm odd?(n) returns True if and
only if n is odd. (See Figure 5.14.)
5.34 Prove that the algorithm sum(n, m) returns ∑mi=n i (again see Figure 5.14) for any m ≥ n. (Hint: perform induction on the value of m − n.)
5.35 Describe how your proof from Exercise 5.34 would change if Line 4 from the sum algorithm in Figure 5.14 were changed to return m + sum(n, m − 1) instead of n + sum(n + 1, m).
5.36 Prove by induction on n that 8n − 3n is divisible by 5 for any nonnegative integer n.
5.37 Conjecture a formula for the value of 9n mod 10, and prove it correct by induction on n. (Hint: try
computing 9n mod 10 for a few small values of n to generate your conjecture.)
5.38 As in the previous exercise, conjecture a formula for the value of 2n mod 7, and prove it correct.
5.39 Suppose that we count, in binary, using an n-bit counter that goes from 0 to 2n − 1. There are
2n different steps along the way: the initial step of 00 · · · 0, and then 2n − 1 increment steps, each of which causes at least one bit to be flipped. What is the average number of bit flips that occur per step? (Count the
first step as changing all n bits.) For example, for n = 3, we have 0
11
Figure 5.14: Two algorithms.
odd?(n):
1: ifn=0then
2: return False
3: else
4: return notodd?(n−1)
sum(n, m):
1: ifn=mthen
2: return m
3: else
4: return n+sum(n+1,m)
00 → 101 → 0 → 111, which has a total of 3 + 1 + 2 + 1 + 3 + 1 + 2 + 1 = 14 bit flips. Prove your answer.
0
0 → 001 → 01
0 → 011 → 1
dog
dog
dog
5.40 To protect my backyard from my neighbor, a biology professor who is sometimes a little over- friendly, I have acquired a large army of vicious robotic dogs. Unfortunately the robotic dogs in this batch are very jealous, and they must be separated by fences—in fact, they can’t even face each other directly through a fence. So I have built a collection of n fences to separate my backyard into polygonal regions, where each fence completely crosses my yard (that is, it goes from property line to property line, possibly crossing other fences). I wish to deploy my robotic dogs to satisfy the following property:
For any two polygonal regions that share a boundary (that is, are separated by a fence segment), one of the two regions has exactly one robotic dog and the other region has zero robotic dogs.
(See Figure 5.15.) Prove by induction on n that this condition is satisfiable for any collection of n fences.
Figure 5.15: A configuration of fences, and a valid way to deploy my dogs.

5.3 Strong Induction
It’s not true that life is one damn thing after another; it is one damn thing over and over.
Edna St. Vincent Millay (1892–1950)
5.3. STRONGINDUCTION 521
In the proofs by induction in Section 5.2, we established the claim ∀n ∈ Z≥0 : P(n) by proving P(0) [the base case] and proving that P(n − 1) ⇒ P(n) [the inductive case]. But let’s think again about what happens in an inductive proof, as we build up facts about P(n) for ever-increasing values of n. (Glance at Example 5.1 again.)
1. WeproveP(0).
2. WeproveP(0)⇒P(1),soweconcludeP(1),usingFact#1.
Now we wish to prove P(2). In a proof by induction like those from Section 5.2, we’d proceed as follows:
3. WeproveP(1)⇒P(2),soweconcludeP(2),usingFact#2.
In a proof by strong induction, we allow ourselves to make use of more assumptions: namely, we know that P(1) and P(0) when we’re trying to prove P(2). (By way of con- trast, we’ll refer to proofs like those from Section 5.2 as using weak induction.) In a proof by strong induction, we proceed as follows instead:
3′. WeproveP(0)∧P(1)⇒P(2),soweconcludeP(2),usingFact#1andFact#2.
In a proof by strong induction, in the inductive case we prove P(n) by assuming n different inductive hypotheses: P(0), P(1), P(2), . . . , and P(n − 1). Or, less formally: in the inductive case of a proof by weak induction, we show that if P “was true last time” then it’s still true this time; in the inductive case of a proof by strong induction, we show that if P “has been true up until now” then it’s still true this time.
5.3.1 A Definition and a First Example
Here is the formal definition of a proof by strong induction:
Generally speaking, using strong induction makes sense when the “reason for” P(n) is that P(k) is true for more than one index k ≤ n − 1, or that P(k) is true for some index k ≤ n − 2. (For weak induction, the “reason for” P(n) is that P(n − 1) is true.)
Strong induction makes the inductive case easier to prove than weak induction, because the claim that we need to show—that is, P(n)—is the same, but we get to
Definition 5.5 (Proof by strong induction)
Suppose that we want to prove that P(n) holds for all n ∈ Z≥0. To give a proof by strong induction of ∀n ∈ Z≥0 : P(n), we prove the following:
1. thebasecase:proveP(0).
2. theinductivecase:foreveryn≥1,prove[P(0)∧P(1)∧···∧P(n−1)]⇒P(n).

522 CHAPTER 5. MATHEMATICAL INDUCTION
use more assumptions in strong induction: in strong induction, we’ve assumed all
of P(0) ∧ P(1) ∧ . . . ∧ P(n − 1); in weak induction, we’ve assumed only P(n − 1). We can always ignore those extra assumptions, so it’s never harder to prove something by strong induction than with weak induction. (Strong induction is actually equivalent to weak induction; anything that can be proven with one can also be proven with the other. See Exercises 5.75–5.76.)
A first example: a simple algorithm for parity
In the rest of this section, we’ll give several examples of proofs by strong induction.
We’ll start here with a proof of correctness for a blazingly simple algorithm that com- putes the parity of a positive integer. (Recall that the parity of n is the “evenness” or “oddness” of n.) See Figure 5.16 for the parity algorithm.
We’ve already used (weak) induction to prove the cor-
rectness of recursive algorithms that, given an input of size
n, call themselves on an input of size n − 1. (That’s how we
proved the correctness of the factorial algorithm fact from
Example 5.9.) But for recursive algorithms that call them-
selves on smaller inputs but not necessarily of size n − 1, like parity, we can use strong induction to prove their correctness.
Example 5.11 (The correctness of parity) Claim: Foranynonnegativeintegern≥0,
parity(n) = n mod 2.
Proof. Write P(n) to denote the property that parity(n) = n mod 2. We proceed by
strong induction on n to show that P(n) holds for all n ≥ 0:
basecases(n=0andn=1): Byinspectionofthealgorithm,parity(0)returns0in Line 2, and, indeed, 0 mod 2 = 0. Similarly, we have parity(1) = 1, and 1 mod 2 = 1 too. Thus P(0) and P(1) hold.
inductivecase(n≥2): AssumetheinductivehypothesisP(0)∧P(1)∧···∧P(n−1). Namely, assume that
for any integer 0 ≤ k < n, we have parity(k) = k mod 2. We must prove P(n)—that is, we must prove parity(n) = n mod 2: parity(n) = parity(n − 2) by inspection (specifically because n ≥ 2 and by Line 4) = (n − 2) mod 2 by the inductive hypothesis P(n − 2) = n mod 2, where (n − 2) mod 2 = n mod 2 by Definition 2.9. (Note that the inductive hypoth- esis applies for k := n − 2 because n ≥ 2 and therefore 0 ≤ n − 2 < n.) Writing tip: While anything that can be proven using weak induction can also be proven using strong induc- tion, you should still use the tool that’s best suited to the job—generally, the one that makes the argument easi- est to understand. Figure 5.16: A simple parity algorithm. parity(n): // assume that n ≥ 0 is an integer. 1: if n≤1then 2: return n 3: else 4: return parity(n−2) There are two things to note about the proof in Example 5.11. First, using strong induction instead of weak induction made sense because the inductive case relied on P(n − 2) to prove P(n); we did not show P(n − 1) ⇒ P(n). Second, we needed two base cases: the “reason” that P(1) holds is not that P(−1) was true. (In fact, P(−1) is false—parity(−1) isn’t equal to 1! Think about why.) The inductive case of the proof in Example 5.11 does not correctly apply for n = 1, and therefore we had to handle that case separately. 5.3.2 Some Further Examples of Strong Induction We’ll continue this section with several more examples of proofs by strong induction. We’ll first turn to a proof about prime factorization of integers, and then look at one geometric and one algorithmic claim. Prime factorization Recall that an integer n ≥ 2 is called prime if the only positive integers that evenly divide it are 1 and n itself. It’s a basic fact about numbers that any positive integer can be uniquely expressed as the product of primes: While proving the uniqueness requires a bit more work (see Section 7.3.3), we can give a proof using strong induction to show that a prime factorization exists. Example 5.12 (Prime factorization) The prime factor- ization theorem is also sometimes called the Funda- mental Theorem of Arithmetic. 5.3. STRONGINDUCTION 523 Theorem 5.5 (Prime Factorization Theorem) Let n ∈ Z≥1 be a positive integer. Then there exist k ≥ 0 prime numbers p1, p2, . . . , pk such that n = ∏ki=1 pi. Furthermore, up to reordering, the primes p1, p2, . . . , pk are unique. Let P(n) denote the first part of Theorem 5.5, namely the claim there exist k ≥ 0 prime numbers p1,p2,...,pk such that n = ∏i=1 pi. k We will prove that P(n) holds for any integer n ≥ 1, by strong induction on n. basecase(n=1): Recallthattheproductofzeromultiplicandsis1.(SeeSection 2.2.7.) Thus we can write n as the product of zero prime numbers. Thus P(1) holds. inductivecase(n≥2): Weassumetheinductivehypothesis—namely,weassume that P(n′) holds for any positive integer n′ where 1 ≤ n′ ≤ n − 1. We must prove P(n). There are two cases: • If n is prime, then there’s nothing to do: define p1 := n, and we’re done immedi- ately. (We’ve written n as the product of 1 prime number.) • Ifnisnotprime,thenbydefinitionncanbewrittenastheproductn = a·b,for positiveintegersaandbsatisfying2 ≤ a ≤ n−1and2 ≤ b ≤ n−1.(The definition of (non)primality says that n = a · b for a ∈/ {1, n}; it should be easy to 524 CHAPTER 5. MATHEMATICAL INDUCTION convince yourself that neither a nor b can be smaller than 2 or larger than n − 1.) By the inductive hypotheses P(a) and P(b), we have a=q1·q2· ··· ·ql and b=r1·r2· ··· ·rm (∗) for prime numbers q1, . . . , ql and r1, . . . , rm. By (∗) and the fact that n = a · b, n=q1·q2· ··· ·ql·r1·r2· ··· ·rm. Because each qi and ri is prime, we have now written n as the product of l + m prime numbers, and P(n) holds. The theorem follows. Taking it further: As with any inductive proof, it may be useful to view the proof from Example 5.12 as a recursive algorithm, as shown in Figure 5.17. (Notice that there’s some magic in the “algorithm,” in the sense that Line 7 doesn’t tell us how to find the values of a and b—but we do know that such values exist, by definition.) We can think of the inductive case of an inductive proof as “making a recursive call” to a proof for a smaller input. For example, primeFactor(2) returns ⟨2⟩ and primeFactor(5) returns ⟨5⟩, because both 2 and 5 are prime. For another example, the result of primeFactor(10) is ⟨2, 5⟩, because 10 is not prime, but wecanwrite10 = 2·5andprimeFactor(2)returns⟨2⟩andprimeFactor(5)returns⟨5⟩.There- sult of primeFactor(70) could be ⟨7, 2, 5⟩, because 70 is not prime, but we can write 70 = 7 · 10 and primeFactor(7) returns ⟨7⟩ and primeFactor(10) returns ⟨2, 5⟩. Or primeFactor(70) could be ⟨7, 5, 2⟩ because 70 = 35 · 2, and primeFactor(35) returns ⟨7, 5⟩ and primeFactor(2) returns ⟨2⟩. (Which ordering of the values is the output depends on the magic of Line 7. The second part of Theorem 5.5, about the uniqueness of the prime factorization, says that it is only the ordering of these numbers that depends on the magic; the numbers themselves must the same.) Triangulating a polygon We’ll now turn to a proof by strong induction about a geometric question, instead of a numerical one. A convex polygon is, informally, the points “inside” a set of n vertices: imagine stretching a giant rubber band around n points in the plane; the polygon is defined as the set of all points contained inside the rubber band. See Figure 5.18 for an example. Here we will show that an arbitrary convex polygon can be decomposed into a collection of nonoverlapping triangles. Example 5.13 (Decomposing a polygon into triangles) Problem: Provethefollowingclaim: Claim: AnyconvexpolygonPwithk≥3verticescanbedecomposedintoasetof k − 2 triangles whose interiors do not overlap. (For an example, and an outline of a possible proof, see Figure 5.19.) Figure 5.17: The proof of Exam- ple 5.12, interpreted as a recursive algorithm. primeFactor(n): 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: if n=1then return ⟨⟩ else if n is prime then return ⟨n⟩ else or “P(1) is true!” or “P(n) is true!” find factors a, b where 2 ≤ a ≤ n − 1 and 2 ≤ b ≤ n − 1 such that n = a · b. ⟨q1, ..., qk ⟩ := primeFactor(a) ⟨r1, ..., rm⟩ := primeFactor(b) return ⟨q1,...,qk,r1,...,rm⟩ or“P(n)istrue,becauseP(a)∧P(b)!” Figure 5.18: A polygon. The dots are called vertices; the lines connecting them are the sides; andtheshaded region (excluding the boundary) is the interior. Solution 5.3. STRONGINDUCTION 525 v A (a) The original polygon P. (b) Two vertices u, v of P, and P divided into A and B (above and below the ⟨u, v⟩ line). (c) The subpolygons A and B divided into triangles, using the inductive hypothesis. B u : LetQ(k)denotetheclaimthatanyk-vertexpolygoncanbedecomposed into a set of k − 2 interior-disjoint triangles. We’ll give a proof by strong induction on k that Q(k) holds for all k ≥ 3. (Note that strong induction isn’t strictly neces- sary to prove this claim; we could give an alternative proof using weak induction.) basecase(k=3): There’snothingtodo:any3-vertexpolygonPisitselfatrian- gle, so the collection {P} is a set of k − 2 = 1 triangles whose interiors do not intersect (vacuously, because there is only one triangle). Thus Q(3) holds. inductivecase(k≥4): Weassumetheinductivehypothesis:anyconvexpolygon with 3 ≤ l < k vertices can be decomposed into a set of l − 2 interior-disjoint triangles. (That is, we assume Q(3), Q(4), . . . , Q(k − 1).) We must prove Q(k). Let P be an arbitrary k-vertex polygon. Let u and v be any two nonadjacent vertices of P. (Because k ≥ 4, such a pair exists.) Define A as the “above the ⟨u, v⟩ line” piece of P and B as the “below the ⟨u, v⟩ line” piece of P. Notice that P = A ∪ B, both A and B are convex, and the interiors of A and B are disjoint. LetlbethenumberofverticesinA. Observethatl ≥ 3andl < kbecauseu and v are nonadjacent. Also observe that B contains precisely k − l + 2 vertices. (The “+ 2” is because vertices u and v appear in both A and B.) Note that both 3 ≤ l ≤ k−1and3 ≤ k−l+2 ≤ k−1,sowecanapplytheinductive hypothesis to both l and k − l + 2. Therefore, by the inductive hypothesis Q(l), the polygon A is decomposable into a set S of l − 2 interior-disjoint triangles. Again by the inductive hypothesis Q(k − l + 2), the polygon B is decomposable into a set T of k − l + 2 − 2 = k − l interior-disjoint triangles. Furthermore because A and B are interior disjoint, the triangles of S ∪ T all have disjoint interiors. Thus P itself can be decomposed into the union of these two sets of triangles, yielding a total of l − 2 + k − l = k − 2 interior-disjoint triangles. We’veshownbothQ(3)andQ(3)∧···∧Q(k−1) ⇒ Q(k)foranyk ≥ 4,which completes the proof by strong induction. Taking it further: The style of triangulation from Example 5.13 has particularly important implications in computer graphics, in which we seek to render representations of complicated real-world scenes using computational techniques. In many computer graphics applications, complex surfaces are decomposed into small triangular regions, which are then rendered individually. See p. 528 for more discussion. Figure 5.19: An example of the recursive decompo- sition of a polygon into interior-disjoint triangles. 526 CHAPTER 5. MATHEMATICAL INDUCTION quickSort(A[1 . . . n]): 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: if n≤1then return A else choose pivot ∈ {1, . . . , n}, somehow. L:=⟨⟩ R:=⟨⟩ for i ∈ {1,...,n} with i ̸= pivot: if A[i] < A[pivot] then append A[i] to L else append A[i] to R L := quickSort(L) R := quickSort(R) return L + ⟨A[pivot]⟩ + R choose3asthepivotvalue partitionintoLandR recursivelysortLandR (b) An example of quick sort. Starting from an array 724316589, we (through whatever mechanism) choose 3 as the pivot value, divide the array into the elements < 3 and those > 3, and recursively sort those two pieces.
7 2 4 3 1 6 5 8 9 2 1 3 7 4 6 5 8 9
􏰢􏰡􏰠􏰣 􏰢 􏰡􏰠 􏰣
LR
1 2 3 4 5 6 7 8 9
􏰢􏰡􏰠􏰣 􏰢 􏰡􏰠 􏰣
(a) The pseudocode.
L, sorted R, sorted
Proving algorithms correct: Quick Sort
We’ve now seen a proof of correctness by strong induction for a simple recursive al-
gorithm (for parity), and proofs of somewhat more complicated non-algorithmic prop- erties. Here we’ll prove the correctness of a somewhat more complicated algorithm— the recursive sorting algorithm called Quick Sort—again using strong induction.
The idea of the Quick Sort algorithm is to select a pivot value x from an input array A; we then partition the elements of A into those less than x (which we then sort re- cursively), then x itself, and finally the elements of A greater than x (which we again sort recursively). We also need a base case: an input array with fewer than 2 elements is already sorted. (See Figure 5.20(a) for the algorithm.) For example, suppose we wish to sort all 43 U.S. Presidents by birthday. (Grover Cleveland will appear only once.) Barack Obama’s birthday is August 4th. If we choose him as the pivot, then Quick Sort would first divide all the other presidents into two lists, of those with pre–August 4th and post–August 4th birthdays,
before[1 . . . 23] := ⟨George Washington [February 22nd], . . . , George W. Bush [July 6th]⟩ after[1 . . . 19] := ⟨John Adams [October 30th], . . . , Bill Clinton [August 19th]⟩,
and then recursively sort before and after. Then the final sorted list will be
Even without two Grover Cleveland entries in the array, the simplifying assumption that we’re making
about distinct elements actually doesn’t apply for the U.S. Presidents: James Polk and Warren Harding were both born on November 2nd. (Think about how you’d modify the proof that follows to handle duplicates.)
before in sorted order prez[1], . . . , prez[23],
Barack Obama
prez[24],
after in sorted order prez[25], . . . , prez[43]
(See Figure 5.20(b) for another example of Quick Sort.)
While the efficiency of Quick Sort depends crucially on how we choose the pivot
value (see Chapter 6), the correctness of the algorithm holds regardless of that choice. For simplicity, we will prove that Quick Sort correctly sorts its input under the as- sumption that all the elements of the input array A are distinct. (The more general case, in which there may be duplicate elements, is conceptually no harder, but is a bit more tedious.) It is easy to see by inspection of the algorithm that quickSort(A) re-
Figure 5.20: Quick Sort: pseudocode, and an example.

turns a reordering of the input array A—that is, Quick Sort neither deletes or inserts elements. Thus the real work is to prove that Quick Sort returns a sorted array:
Example 5.14 (Correctness of Quick Sort)
Claim: ForanyarrayAwithdistinctelements,quickSort(A)returnsasortedarray.
Proof. LetP(n)denotetheclaimthatquickSort(A[1…n])returnsasortedarrayfor any n-element array A with distinct elements. We’ll prove P(n) for every n ≥ 0, by strong induction on n.
basecases(n=0andn=1): BothP(0)andP(1)aretrivial:anyarrayoflength0or1 is sorted.
inductivecase(n≥2): WeassumetheinductivehypothesisP(0),…,P(n−1):for any array B[1 . . . k] with distinct elements and length k < n, quickSort(B) returns a sorted array. We must prove P(n). Let A[1 . . . n] be an arbitrary array with dis- tinct elements. Let pivot ∈ {1, . . . , n} be arbitrary. We must prove that x appears before y in quickSort(A) if and only if x < y. We proceed by cases, based on the relationship between x, y, and A[pivot]. (See Figure 5.21.) Case1:x=A[pivot]. TheelementsappearingafterxinquickSort(A)areprecisely the elements of R. And R is exactly the set of elements greater than x. Thus x appears before y if and only if y appears in R, which occurs if and only if x < y. LR 1. x Figure 5.21: The cases of the proof in Example 5.14. 5.3. STRONGINDUCTION 527 Case2:y=A[pivot]. AnalogouslytoCase1,xappearsbeforeyifandonlyifx appears in L, which occurs if and only if x < y. Case 3: x < A[pivot] and y < A[pivot]. Then both x and y appear in L. Because A[pivot] does not appear in L, we know that L contains at most n − 1 ele- ments, all of which are distinct because they’re a subset of the distinct ele- ments of A. Thus the inductive hypothesis P(|L|) says that x appears before y in quickSort(L) if and only if x < y. And x appears before y in quickSort(A) if and only if x appears before y in quickSort(L). Case4:x>A[pivot]andy>A[pivot]. ThenbothxandyappearinR.Ananalo- gous argument to Case 3 shows that x appears before y if and only if x < y. Case5:xA[pivot]. Itisimmediateboththatxappearsbeforey (because x is in L and y is in R) and that x < y. Case6:x>A[pivot]andy x then
return binarySearch(A[1…middle−1],x)
else
return binarySearch(A[middle+1…n],x)
merge(X[1…n],Y[1…m]):
1: 2: 3: 4: 5: 6: 7: 8:
if n=0then return Y
elseif m=0 then return X
else if X[1] < Y[1] then return X[1]followedbymerge(X[2...n],Y) else return Y[1]followedbymerge(X,Y[2...m]) mergeSort(A[1 . . . n]): 1: 2: 3: 4: 5: 6: if n=1then return A else L := mergeSort(A[1 . . . ]) R := mergeSort(A[ 2 +1...n]) 􏰄n􏰅 􏰄n􏰅 2 return merge(L,R) Figure 5.30: Binary Search, Merge, and Merge Sort, recursively. 5.4. RECURSIVELYDEFINEDSTRUCTURESANDSTRUCTURALINDUCTION 533 5.4 Recursively Defined Structures and Structural Induction When a thing is done, it’s done. Don’t look back. Look forward to your next objective. George C. Marshall (1880–1959) In the proofs that we have written so far in this chapter, we have performed induc- tion on an integer: the number that’s the input to an algorithm, the number of ver- tices of a polygon, the number of elements in an array. In this section, we will address proofs about recursively defined structures, instead of about integers, using a version of induction called structural induction that proceeds over the defined structure itself, rather than just using numbers. 5.4.1 Recursively Defined Structures A recursively defined structure, just like a recursive algorithm, is a structure defined in terms of one or more base cases and one or more inductive cases. Any data type that can be understood as either a trivial instance of the type or as being built up from a smaller instance (or smaller instances) of that type can be expressed in this way. For example, basic data structures like a linked list and a binary tree can be defined recursively. So too can well-formed sentences of a formal language—languages like Python, or propositional logic—among many other examples. In this section, we’ll give recursive definitions for some of these examples. Linked lists A linked list is a commonly used data structure in which we store a sequence of elements (just like the sequences from Sec- tion 2.4). The reasons that linked lists are useful are best left to a data structures course, but here is a brief synopsis of what a linked list actually is. Each element in the list, called a node, stores a data value and a “pointer” to the rest of the list. A special value, often called null, represents the empty list; the last node in the list stores this value as its pointer to represent that there are no further elements in the list. See Figure 5.31 for an example. (The slashed line in Figure 5.31 represents the null value.) Here is a recursive definition of a linked list: Example 5.15 (Linked list) A linked list is either: 1. ⟨⟩,knownastheemptylist;or 2. ⟨x,L⟩,wherexisanarbitraryelementandLisalinkedlist. For example, Figure 5.31 shows the linked list that consists of 1 followed by the linked list containing 7, 7, and 6 (which is a linked list consisting of 7 followed by a linked list containing 7 and 6, which is a linked list consisting of 7 followed by the linked list containing 6, which is . . . ). That is, Figure 5.31 shows the linked list ⟨1, ⟨7, ⟨7, ⟨6, ⟨⟩⟩⟩⟩⟩. Figure 5.31: An example linked list. 1 7 7 6 534 CHAPTER 5. MATHEMATICAL INDUCTION Binary trees We can also recursively define a binary tree (see Section 11.4.2). Again, deferring the discussion of why binary trees are useful to a course on data structures, here is a quick summary of what they are. Like a linked list, a binary tree is a collection of nodes that store data values and “pointers” to other nodes. Unlike a linked list, a node in a binary tree stores two pointers to other nodes (or null, representing an empty binary tree). These two pointers are to the left child and right child of the node. The root node is the one at the very top of the tree. See Figure 5.32 for an example; here the root node stores the value 1, and has a left child (the binary tree with root 3) and a right child (the binary tree with root 2). Here is a recursive definition: Example 5.16 (Binary trees) A binary tree is either: 1. theemptytree,denotedbynull;or 2. arootnodex,aleftsubtreeTl,andarightsubtreeTr,wherexisanarbitraryvalue and Tl and Tr are both binary trees. Taking it further: In many programming languages, we can explicitly define data types that echo these recursive definitions, where the base case is a trivial instance of the data structure (often nil or None or null). In C, for example, we can define a binary tree with integer-valued nodes as: struct binaryTree { int root; struct binaryTree *leftSubtree; struct binaryTree *rightSubtree; } The base case—an empty binary tree—is NULL; the inductive case—a binary tree with a root node—has a value stored as its root, and then two binary trees (possibly empty) as its left and right subtrees. (In C, the symbol * means that we’re storing a reference, or pointer, to the subtrees, rather than the subtrees themselves, in the data structure.) Define the leaves of a binary tree T to be those nodes contained in T whose left sub- tree and right subtree are both null. Define the internal nodes of T to be all nodes that are not leaves. In Figure 5.32, for example, the leaves are the nodes 5 and 8, and the internal nodes are {1, 2, 3, 4}. Taking it further: Binary trees with certain additional properties turn out to be very useful ways of organizing data for efficient access. For example, a binary search tree is a binary tree in which each node stores a “key,” and the tree is organized so that, for any node u, the key at node u is larger than all the keys in u’s left subtree and smaller than all the keys in u’s right subtree. (For example, we might store the email address of a student as a key; the tree is then organized alphabetically.) Another special type of a binary search tree is a heap, in which each node’s key is larger than all the keys in its subtrees. These two data structures are very useful in making certain common operations very efficient; see p. 529 (for heaps) and p. 1160 (for binary search trees) for more discussion. Figure 5.32: An example binary tree. 1 2 3 5 4 8 5.4. RECURSIVELYDEFINEDSTRUCTURESANDSTRUCTURALINDUCTION 535 Sentences in a language In addition to data structures, we can also define sentences in a language using a re- cursive definition—for example, arithmetic expressions of the type that are understood by a simple calculator; or propositions (as in Chapter 3’s propositional logic): Example 5.17 (Arithmetic expressions) An arithmetic expression is any of the following: 1. anyintegern; 2. −E,whereEisanarithmeticexpression;or 3. E⊙F,whereEandFarearithmeticexpressionsand⊙∈{+,−,·,/}isanoperator. Example 5.18 (Sentences of propositional logic) A sentence of propositional logic (also known as a well-formed formula, or wff ) over the propositional variables X is one of the following: 1. x,forsomex∈X; 2. ¬P,wherePisawffoverX;or 3. P∨Q,P∧Q,orP⇒Q,wherePandQarewffsoverX. We implicitly used the recursive definition of logical propositions from Example 5.18 throughout Chapter 3, but using this recursive definition explicitly allows us to ex- press a number of concepts more concisely. For example, consider a truth assignment f : X → {True, False} that assigns True or False to each variable in X. Then the truth value of a proposition over X under the truth assignment f can be defined recursively for each case of the definition: • thetruthvalueofx∈Xunderf isf(x); • the truth value of ¬P under f is True if the truth value of P under f is False, and the truth value of ¬P under f is False if the truth value of P under f is True; • andsoforth. Taking it further: Linguists interested in syntax spend a lot of energy constructing recursive definitions (like those in Examples 5.17 and 5.18) of grammatical sentences of English. But one can also give a recursive definition for non-natural languages: in fact, another structure that can be defined recursively is the grammar of a programming language itself. As such, this type of recursive approach to defining (and processing) a grammar plays a key role not just in linguistics but also in computer science. See the discussion on p. 543 for more. 5.4.2 Structural Induction The recursively defined structures from Section 5.4.1 are particularly amenable to inductive proofs. For example, recall from Example 5.16 that a binary tree is one of the following: (1) the empty tree, denoted by null; or (2) a root node x, a left subtree Tl, and a right subtree Tr, where Tl and Tr are both binary trees. To prove that some property P is true of all binary trees T, we can use (strong) induction on the number n of applications of rule #2 from the definition. Here is an example of such a proof: 536 CHAPTER 5. MATHEMATICAL INDUCTION Example 5.19 (Internal nodes vs. leaves in binary trees) Recall that a leaf in a binary tree is a node whose left and right subtrees are both empty; an internal node is any non-leaf node. Write leaves(T) and internals(T) to denote the number of leaves and internal nodes in a binary tree T, respectively. Claim: InanybinarytreeT,wehaveleaves(T)≤internals(T)+1. Proof. Weproceedbystronginductiononthenumberofapplicationsofrule#2used to generate T. Specifically, let P(n) denote the property that leaves(T) ≤ internals(T) + 1 holds for any binary tree T generated by n applications of rule #2; we’ll prove that P(n) holds for all n ≥ 0, which establishes the claim. basecase(n=0): Theonlybinarytreegeneratedwith0applicationsofrule#2isthe empty tree null. Indeed, leaves(null) = internals(null) = 0, and 0 ≤ 0 + 1. inductivecase(n≥1): AssumetheinductivehypothesisP(0)∧P(1)∧···∧P(n−1): foranybinarytreeBgeneratedusingk < napplicationsofrule#2,wehave leaves(B) ≤ internals(B) + 1. We must prove P(n). We’ll handle the case n = 1 separately. (See Figure 5.33(a).) The only way to make a binary tree T using one application of rule #2 is to use rule #1 for both of T’s subtrees, so T must contain only one node (which is itself a leaf). Then T contains 1 leaf and 0 internal nodes, and indeed 1 ≤ 0 + 1. Otherwise n ≥ 2. (See Figure 5.33(b).) Observe that the tree T must have been generated by (a) generating a left subtree Tl using some number l of applications of rule #2; (b) generating a right subtree Tr using some number r of applications of rule #2; and then (c) applying rule #2 to a root node x, Tl, and Tr to produce T. Thereforer+l+1 = n,andthereforer < nandl < n. Ergo,wecanapplythe inductive hypothesis to both Tl and Tr, and thus leaves(Tl) ≤ internals(Tl) + 1 (1) leaves(Tr) ≤ internals(Tr) + 1. (2) Alsoobservethat,becauser+l+1 = n ≥ 2,eitherTr ̸= nullorTl ̸= null,or both. Thus the leaves of T are the leaves of Tl and Tr, and internal nodes of T are the internal nodes of Tl and Tr plus the root x (which cannot be a leaf because at least one of Tl and Tr is not empty). Therefore An (abbreviated) reminder of the recursive definition of a binary tree: Rule #1: null is a binary tree; Rule #2: if Tl and Tr are binary trees, then⟨x,Tl,Tr⟩isa binary tree. x (a) The only binary tree produced by 1 application of rule #2 has one node, which is a leaf. Tl Tr (b) If T was produced by ≥ 2 applications of rule #2, then at least one of Tl and Tr is not null, and the leaves of T are precisely the leaves of Tl plus the leaves of Tr. x leaves(T) = leaves(Tl) + leaves(Tr) internals(T) = internals(Tl) + internals(Tr) + 1. Putting together these facts, we have leaves(T)= leaves(Tl)+leaves(Tr) ≤ internals(Tl)+1+internals(Tr)+1 = internals(T)+1. Thus P(n) holds, which completes the proof. (3) (4) by(3) by(1)and(2) by(4) Figure 5.33: Il- lustrations of the inductive case for Example 5.19. 5.4. RECURSIVELYDEFINEDSTRUCTURESANDSTRUCTURALINDUCTION 537 Structural induction: the idea The proof in Example 5.19 is perfectly legitimate, but there is another approach that we can use for recursively defined structures, called structural induction. The basic idea is to perform induction on the structure of an object itself rather than on some integer: instead of a case for n = 0 and a case for n ≥ 1, in a proof by structural induction our cases correspond directly to the cases of the recursive structural definition. For structural induction to make sense, we must impose some restrictions on the re- cursive definition. Specifically, the set of structures defined must be well ordered, which intuitively ensures that every invocation of the inductive case of the definition “makes progress” toward the base case(s) of the definition. (More precisely, a set of objects is well ordered if there’s a “least” element among any collection of those objects.) For the type of recursive definitions that we’re considering—where there are base cases in the definition, and all instances of the structure are produced by a finite-length sequence of applications of the inductive rules in the definition—structural induction is a valid technique to prove facts about the recursively defined structure. Taking it further: More formally, a set S of structures is well ordered if there exists a “smaller than” relationship ≺ between elements of S such that, for any nonempty T ⊆ S, there exists a minimal element m in T—that is, there exists m ∈ T such that no x ∈ T satisfies x ≺ m. (There might be more than one least element in T.) For example, the set Z≥0 is well ordered, using the normal ≤ relationship. However, the set R is not well ordered: for example, the set {x ∈ R : x > 2} has no smallest element using ≤. But the set of binary trees is well ordered; the relation ≺ is “is a subtree of.”
One can prove that a set S is well ordered if and only if a proof by mathematical induction is valid on a set S (where the base cases are the minimal elements of S, and to prove P(x) we assume the inductive hypotheses P(y) for any y ≺ x).
Proofs by structural induction
Here is the formal definition of a proof by structural induction:
Definition 5.6 (Proof by structural induction)
Suppose that we want to prove that P(x) holds for every x ∈ S, where S is the (well-ordered) set of structures generated by a recursive definition, and P is some property. To give a proof by structural induction of ∀x ∈ S : P(x), we prove the following:
1. Basecases:foreveryxdefinedbyabasecaseinthedefinitionofS,proveP(x).
2. Inductive cases: for every x defined in terms of y1, y2, . . . , yk ∈ S by an inductive case in
the definition of S, prove that P(y1) ∧ P(y2) ∧ · · · ∧ P(yk) ⇒ P(x).
In a proof by structural induction, we can view both base cases and inductive cases in the same light: each case assumes that the recursively constructed subpieces of a structure x satisfy the stated property, and we prove that x itself also satisfies the property. For a base case, the point is just that there are no recursively constructed pieces, so we actually are not making any assumption.
Notice that a proof by structural induction is identical in form to a proof by strong induction on the number of applications of the inductive-case rules used to generate the object. For example, we can immediately rephrase the proof in Example 5.19 to use structural induction instead. While the structure of the proof is identical, structural induction can streamline the proof and make it easier to read:

538 CHAPTER 5. MATHEMATICAL INDUCTION
Example 5.20 (Internal nodes vs. leaves in binary trees, take II)
Claim: InanybinarytreeT,wehaveleaves(T)≤internals(T)+1.
Proof. LetP(T)denotethepropertythatleaves(T)≤internals(T)+1forabinarytree T. We proceed by structural induction on the form of T.
basecase(T=null): Thenleaves(T)=internals(T)=0,andindeed0≤0+1. inductivecase(Thasrootx,leftsubtreeTl,andrightsubtreeTr): Weassumethe
inductive hypotheses P(Tl) and P(Tr), namely
leaves(Tl) ≤ internals(Tl) + 1 (1)
leaves(Tr) ≤ internals(Tr) + 1. (2)
• If x is itself a leaf, then Tl = Tr = null, and therefore leaves(T) = 1 and
internals(T) = 0, and indeed 1 ≤ 0 + 1.
• Otherwise x is not a leaf, and either Tr ̸= null or Tl ̸= null, or both. Thus the leaves of T are the leaves of Tl and Tr, and internal nodes of T are the internal nodes of Tl and Tr plus the root x. Therefore
leaves(T) = leaves(Tl) + leaves(Tr) internals(T) = internals(Tl) + internals(Tr) + 1.
Putting together these facts, we have
leaves(T)= leaves(Tl)+leaves(Tr)
≤ internals(Tl)+1+internals(Tr)+1
= internals(T)+1. Thus P(n) holds, which completes the proof.
5.4.3 Some More Examples of Structural Induction: Propositional Logic
(3) (4)
by(3) by(1)and(2) by(4)
We’ll finish this section with two more proofs by structural induction, about proposi- tional logic—using Example 5.18’s recursive definition.
Propositional logic using only ¬ and ∧
First, we’ll give a formal proof using structural induction of the claim that any
propositional logic statement can be expressed using ¬ and ∧ as the only logical con- nectives. (See Exercise 4.68.)
Example 5.21 (All of propositional logic using ¬ and ∧)
Claim: Foranylogicalpropositionφusingtheconnectives{¬,∧,∨,⇒},thereexists
a proposition using only {¬, ∧} that is logically equivalent to φ.

5.4. RECURSIVELYDEFINEDSTRUCTURESANDSTRUCTURALINDUCTION 539
Proof. Foralogicalpropositionφ,letA(φ)denotethepropertythatthereexistsa
{¬, ∧}-only proposition logically equivalent to φ. We’ll prove by structural induction on φ that A(φ) holds for any well-formed formula φ (see Example 5.18):
base case: φ is a variable, say φ = x. The proposition x uses no connectives—and thus is vacuously {¬, ∧}-only—and is obviously logically equivalent to itself. Thus A(x) follows.
inductivecaseI: φisanegation,sayφ=¬P.Weassumetheinductivehypothesis A(P). We must prove A(¬P). By the inductive hypothesis, there is a {¬, ∧}-only proposition Q such that Q ≡ P. Consider the proposition ¬Q. Because Q ≡ P, we have that ¬Q ≡ ¬P, and ¬Q contains only the connectives {¬, ∧}. Thus ¬Q is a {¬, ∧}-only proposition logically equivalent to ¬P. Thus A(¬P) follows.
inductivecaseII: φisaconjunction,disjunction,orimplication,sayφ=P1∧P2, φ = P1 ∨ P2, or φ = P1 ⇒ P2. We assume the inductive hypotheses A(P1) and A(P2)—that is, we assume there are {¬, ∧}-only propositions Q1 and Q2 with
Q1 ≡P1andQ2 ≡P2.WemustproveA(P1∧P2),A(P1∨P2),andA(P1⇒P2). Consider the propositions Q1 ∧ Q2, ¬(¬Q1 ∧ ¬Q2), and ¬(Q1 ∧ ¬Q2). By De Mor- gan’sLaw,andthefactsthatx⇒y≡¬(x∧¬y),P1 ≡Q1,andP2 ≡Q2:
Q1 ∧ Q2 ¬(¬Q1 ∧ ¬Q2) ¬(Q1∧¬Q2)
≡ Q1 ∧ Q2 ≡ Q1 ∨ Q2 ≡Q1 ⇒Q2
≡ P1 ∧ P2 ≡ P1 ∨ P2 ≡P1 ⇒P2
Because Q1 and Q2 are {¬, ∧}-only, our three propositions are {¬, ∧}-only as well; therefore A(P1 ∧ P2), A(P1 ∨ P2), and A(P1 ⇒ P2) follow.
We’ve shown that A(φ) holds for any proposition φ, so the claim follows.
Taking it further: In the programming language ML, among others, a programmer can use both re- cursive definitions and a form of recursion that mimics structural induction. For example, we can give
a simple implementation of the recursive definition of a well-formed formula from Example 5.18: a well-formed formula is a variable, or the negation of a well-formed formula, or the conjunction of a pairofwell-formedformulas(wff * wff),or….)InML,wecanalsowriteafunctionthatmimicsthe structure of the proof in Example 5.21, using ML’s capability of pattern matching function arguments. See Figure 5.34 for both the recursive definition of the wff datatype and the recursive function simplify, which takes an arbitrary wff as input, and produces a wff that uses only And and Not as output.
Figure 5.34: Well- formed formulas in ML.
datatype wff = Variable of string
| Not of wff
| And of (wff * wff)
| Or of (wff * wff)
| Implies of (wff * wff);
fun simplify (Variable var)
| simplify (Not P)
| simplify (And (P1, P2))
| simplify (Or (P1, P2))
| simplify (Implies (P1, P2)) = Not(And(simplify P1, Not(simplify P2)));
= Variable var
= Not(simplify P)
= And(simplify P1, simplify P2)
= Not(And(Not(simplify P1), Not(simplify P2)))

540 CHAPTER 5. MATHEMATICAL INDUCTION
Conjunctive and Disjunctive Normal Forms
Here is another example of a proof by structural induction based on propositional
logic, to establish Theorems 3.1 and 3.2, that any proposition is logically equivalent to one that’s in conjunctive or disjunctive normal form.
(Recall that a proposition φ is in conjunctive normal form (CNF) if φ is the conjunction of one or more clauses, where each clause is the disjunction of one or more literals. A literal is a Boolean variable or the negation of a Boolean variable. A proposition φ is in disjunctive normal form (DNF) if φ is the disjunction of one or more clauses, where each clause is the conjunction of one or more literals.)
Theorem 5.6 (CNF/DNF suffice)
Let φ be a Boolean formula that uses the connectives {∧, ∨, ¬, ⇒}. Then:
1. there exists φdnf in disjunctive normal form so that φ and φdnf are logically equivalent. 2. there exists φcnf in conjunctive normal form so that φ and φcnf are logically equivalent.
Perhaps bizarrely, it will turn out to be easier to prove that any proposition is logically equivalent to both one in CNF and one in DNF than to prove either claim on its own. So we will prove both parts of the theorem simultaneously, by structural induction.
We’ll make use of some handy notation in this proof: analogous to summation and
􏰬n 􏰭n product notation, we write i=1pi to denote p1 ∧ p2 ∧ · · · ∧ pn, and similarly i=1pi
meansp1 ∨p2 ∨···∨pn. Hereistheproof:
Example 5.22 (Conjunctive/disjunctive normal form)
Proof. Westartbysimplifyingthetask:weuseExample5.21toensurethatφcon- tains only the connectives {¬, ∧}. Let C(φ) and D(φ), respectively, denote the prop- erty that φ is logically equivalent to a CNF proposition and a DNF proposition, re- spectively. We now proceed by structural induction on the form of φ—which now can only be a variable, negation, or conjunction—to show that C(φ) ∧ D(φ) holds for any proposition φ.
basecase:φisavariable,sayφ=x. We’redoneimmediately;asinglevariableis actually in both CNF and DNF. We simply choose φdnf = φcnf = x. Thus C(x) and D(x) follow immediately.
inductivecaseI:φisanegation,sayφ=¬P. Weassumetheinductivehypothesis C(P) ∧ D(P)—that is, we assume that there are propositions Pcnf and Pdnf such that P ≡ Pcnf ≡ Pdnf, where Pcnf is in CNF and Pdnf is in DNF. We must show C(¬P) and D(¬P).
We’ll first show D(¬P)—that is, that ¬P can be rewritten in DNF. By the defini- tion of conjunctive normal form, we know that the proposition Pcnf is of the form Pcnf = 􏰬ni=1ci, where ci is a clause of the form ci = 􏰭mi cj, where cj is a variable or its negation.Thereforewehave j=1 i i
Problem-solving tip:
Suppose we want to prove ∀x : P(x) by induction. Here’s a problem- solving strategy that’s highly coun- terintuitive: it is sometimes eas-
ier to prove a stronger statement ∀x:P(x)∧Q(x).It seems bizarre that trying to prove more than what we want is easier—but the advantage arises because the induc- tive hypothesis is
a more powerful assumption! For ex- ample, I don’t know how to prove that any proposition φ can be expressed
in DNF (Theorem 5.6.1) by induction! But I do know how to prove that any proposition φ can be expressed in both DNF and CNF by in- duction, as is done in Example 5.22.

5.4. RECURSIVELYDEFINEDSTRUCTURESANDSTRUCTURALINDUCTION 541
nmi  ¬P ≡ ¬Pcnf ≡ ¬ 􏰯 􏰰 cji 
i=1 j=1
n mi 
≡  􏰰 ¬  􏰰 c ji   i=1 j=1
nmi  ≡ 􏰰 􏰯 ¬cji
inductive hypothesis C(P) and definition of CNF
D e M o r g a n ’ s L a w
De Morgan’s Law
i=1 j=1
Once we delete double negations (that is, if cij = ¬x, then we write ¬cij as x rather
than as ¬¬x), this last proposition is in DNF, so D(¬P) follows.
The construction to show C(¬P)—that is, to give an CNF proposition logically equivalent to ¬P—is strictly analogous; the only change to the argument is that we start from Pdnf instead of Pcnf.
inductivecaseII:φisaconjunction,sayP∧Q. Weassumetheinductivehypothe- ses C(P) ∧ D(P) and C(Q) ∧ D(Q)—that is, we assume that there are CNF proposi- tions Pcnf and Qcnf and DNF propositions Pdnf and Qdnf such that P ≡ Pcnf ≡ Pdnf andQ≡Qcnf ≡Qdnf.WemustshowC(P∧Q)andD(P∧Q).
• TheargumentforC(P∧Q)istheeasierofthetwo:wehavepropositionsPcnf andQcnfinCNFwherePcnf ≡PandQcnf ≡Q.ThusP∧Q≡Pcnf∧Qcnf—and the conjunction of two CNF formulas is itself in CNF. So C(P ∧ Q) follows.
• WehavetoworkalittlehardertoproveD(P∧Q).Recallthat,bytheinduc- tive hypothesis, there are propositions Pdnf and Qdnf in DNF, where P ≡ Pdnf and Q ≡ Qdnf. By the definition of DNF, these propositions have the form Pdnf = 􏰭ni=1ci and Qdnf = 􏰭mj=1dj, where every ci and dj is a clause that is a con- junction of literals. Therefore
􏰪􏰰n 􏰫
P ∧ Q ≡ Pdnf ∧ Q ≡ ci ∧ Q
i=1
≡ 􏰰n (ci ∧ Q)
i=1  ≡ 􏰰n ci ∧ 􏰰m dj
i=1 i=j
≡ 􏰰n 􏰰m 􏰀ci ∧ dj􏰁 .
i=1 j=1
inductive hypothesis D(P) and definition of DNF distributivity of ∨ over ∧
inductive hypothesis D(Q) and definition of DNF distributivity of ∨ over ∧
Because every ci and dj is a conjunction of literals, ci ∧ dj is too, and thus this last proposition is in DNF! So D(P ∧ Q) follows—as does the theorem.

542 CHAPTER 5. MATHEMATICAL INDUCTION
The construction for a conjunction P ∧ Q in Theorem 5.22 is a little tricky, so let’s illustrate it with a small example:
Example 5.23 (An example of the construction from Example 5.22)
Suppose that we are trying to transform a proposition φ ∧ ψ into DNF. Suppose that we have (recursively) computed φdnf = (p ∧ t) ∨ q and ψdnf = r ∨ (s ∧ t). Then the construction from Example 5.22 lets us construct a proposition equivalent to φ ∧ ψ as:
􏰢 􏰡􏰠 􏰣 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣
􏰢􏰡􏰠􏰣 􏰢 􏰡􏰠 􏰣 􏰢􏰡􏰠􏰣 􏰢 􏰡􏰠 􏰣
φ∧ψ ≡ φdnf ∧ψdnf ≡ 􏰂(p∧t)∨ (q) 􏰃∧􏰂 (r) ∨(s∧t)􏰃
􏰢 􏰡􏰠 􏰣
c1 c2 d1 d2
≡ 􏰑(p∧t)∧􏰂(r)∨(s∧t)􏰃􏰒∨􏰑 (q) ∧􏰂(r)∨(s∧t)􏰃􏰒
c1 d1∨d2 c2 d1∨d2
≡ 􏰑(p∧t∧r)∨(p∧t∧s∧t)􏰒∨􏰑(q∧r)∨(q∧s∧t)􏰒.
􏰢 􏰡􏰠 􏰣 􏰢 􏰡􏰠 􏰣 􏰢􏰡􏰠􏰣 􏰢 􏰡􏰠 􏰣
c1 ∧d1 c1 ∧d2 c2 ∧d1 c2 ∧d2 (p∧t∧r)∨(p∧t∧s∧t)∨(q∧r)∨(q∧s∧t)
as the DNF proposition equivalent to φ ∧ ψ. 5.4.4 The Integers, Recursively Defined
Before we end the section, we’ll close our discussion of recursively defined structures and structural induction with one more potentially interesting observation. Although the basic form of induction in Section 5.2 appears fairly different, that basic form of induction can actually be seen as structural induction, too. The key is to view the nonnegative integers Z≥0 as defined recursively:
Under this definition, a proof of ∀n ∈ Z≥0 : P(n) by structural induction and a proof of ∀n ∈ Z≥0 : P(n) by weak induction are identical:
• theyhavepreciselythesamebasecase:proveP(0);and
• they have precisely the same inductive case: prove P(n) ⇒ P(s(n))—or, in other
words, prove that P(n) ⇒ P(n + 1).
Then the construction yields
Definition 5.7 (Nonnegative integers, recursively defined)
A nonnegative integer is either:
1. zero,denotedby0;or
2. thesuccessorofanonnegativeinteger,denotedbys(x)foranonnegativeintegerx.

5.4. RECURSIVELYDEFINEDSTRUCTURESANDSTRUCTURALINDUCTION 543
Computer Science Connections
Grammars, Parsing, and Ambiguity
In interpreters and compilers—systems that translate input source code written in a programming language like Python, Java, or C into a machine- executable format—a key initial step is to parse the input into a format that represents its structure. (A similar step occurs in systems designed to per- form natural language processing.) The structured representation of such an expression is called a parse tree, in which the leaves of the tree correspond to the base cases of the recursive structural definition, and the internal nodes correspond to the inductive cases of the definition. We can then use the parse tree for whatever purpose we desire: evaluating arithmetic expressions, sim- plifying propositional logic, or any other manipulation. (See Figure 5.35.)
In this setting, a recursively defined structure is written as a context-free grammar (CFG). A grammar consists of a set of rules that can be used to gener- ate a particular example of this defined structure. We’ll take the definition of propositions over the variables {p, q, r} (Example 5.18) as a running example. Here is a CFG for propositions, following that definition precisely. (Here “→” means “can be rewritten as” and “|” means “or.”)
Figure 5.35: A parse tree for the arith- metic expression 2 · (3 + 4).
This type of grammar is called context free because the rules defined by the grammar can be used any time—that is, without regard to the context in which the symbol on the left-hand side of the rule appears.
S → p|q|r | ¬S
| S ∨ S | S ∧ S | S ⇒ S
Scanbeapropositionalvariable… . . . or the negation of a proposition . . . . . . or the ∧/∨/⇒ of two propositions.
An expression φ is a valid proposition over the variables {p, q, r} if and only if φ can be generated by a finite-length sequence of applications of the rewriting rules in the grammar. For example, ¬p ∨ p is a valid proposition over {p, q, r}, because we can generate it as follows:
S → S∨S → S∨p → ¬S∨p → ¬p∨p.
(We used the rule S → p twice, the rule S → ¬S once, and the rule S → S ∨ S once.) The parse tree corresponding to this sequence of rule applications is shown in Figure 5.36(a).
A complication that arises with the grammar given above is that it is ambiguous: the same proposition can be produced using a fundamentally different sequence of rule applications, which gives rise to a different parse tree, shown in Figure 5.36(b):
S → ¬S → ¬S∨S → ¬p∨S → ¬p∨p.
The parse tree in Figure 5.36(b) corresponds to ¬(p ∨ p) instead of (¬p) ∨ p, which is the correct “order of operations” because ¬ binds tighter than ∨.
It’s bad news if the grammar of a programming language is ambiguous, because certain valid code is then “allowed” to be interpreted in more than one way. (The classic example is the attachment of else clauses: in code like if P then if Q then X else Y, when should Y be executed? When P is true and Q is false? Or when P is false?) Thus programming language designers develop unambiguous grammars that reflect the desired behavior.3
Figure 5.36: Two parse trees for ¬p ∨ p.
More on context-free grammars and parsing, and their relationship to compilers and interpreters, can be found in books like
3 Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools. Prentice Hall, 2nd edition, 2006; Dexter Kozen. Automata and Computability. Springer, 1997; and Michael Sipser. Introduction totheTheoryofComputation. Course Technology, 3rd edition, 2012.
·
2+ 34
S S∨S
¬Sp p
(a) The correct order of operations.
S ¬S
S∨S
pp
(b) The wrong order of operations.

544 CHAPTER 5. MATHEMATICAL INDUCTION
5.4.5 Exercises
5.77 Let L be a linked list (as defined in Example 5.15). Prove by structural induction on L that length(L) returns the number of elements contained in L. (See Figure 5.37 for the algorithm.)
5.78 Let L be a linked list containing integers. Prove by structural induction on L that sum(L) returns the sum of the numbers contained in L. (See Figure 5.37 for the algorithm.)
5.79 In Example 5.15, we gave a recursive definition of a linked list. Here’s a variant of that definition, where we insist that the elements be in increasing order. Define a nonempty sorted list as one of the following:
1. ⟨x, ⟨⟩⟩; or
2. ⟨x, ⟨y, L⟩⟩ where x ≤ y and ⟨y, L⟩ is a nonempty sorted list.
Prove by structural induction that in a nonempty sorted list ⟨x, L⟩, every element z in L satisfies z ≥ x.
A string of balanced parentheses (with a close parenthesis that matches every open parenthesis, and appears to its right)
is one of the following:
1. the empty string (consisting of zero characters);
2. a string [ S ] where S is a string of balanced parentheses; or
3. a string S1S2 where S1 and S2 are both strings of balanced parentheses.
For example, [[]][] is a string of balanced parentheses, using Rule 3 on [[]] and []. (Note that [] is a string of balanced parentheses using Rule 2 on the empty string (Rule 1), and therefore [[]] is by using Rule 2 on [].)
5.80 Prove by structural induction that every string of balanced parentheses according to this defini- tion has exactly the same number of open parentheses as close parentheses.
5.81 Prove by structural induction that any prefix of a string of balanced parentheses according to this definition has at least as many open parentheses as it does close parentheses.
5.82 Recall from Definition 5.16 that we defined a binary tree as
1. an empty tree, denoted by null; or
2. a root node x, a left subtree Tl, and a right subtree Tr, where x is an arbi-
trary value and Tl and Tr are both binary trees.
Recall further that a leaf of a binary tree T is a node in T whose left subtree and right subtree are both null. Prove by structural induction that the algorithm countLeaves(T) in Figure 5.38 returns the number of leaves in a binary tree T.
5.83 Recall that a binary search tree (BST) is a binary tree in which each node stores a “key,” and, for any node u, the key at node u is larger than
all keys in u’s left subtree and smaller than all the keys in u’s right subtree. (See p. 1160.) That is, a BST is either:
1. an empty tree, denoted by null; or
2. a root node x, a left subtree Tl where all elements are less than x, and a right subtree Tr, where all elements
are greater than x, and Tl and Tr are both BSTs.
Prove that the smallest element in a nonempty BST is the bottommost leftmost node—that is, prove that
the smallest element in a BST with root x and left subtree Tl = 􏰓x if Tl = null the smallest element in Tl if Tl ̸= null.
A heap is a binary tree where each node stores a priority, and in which every node satisfies the heap property: the priority of a node u must be greater than or equal to the priorities of the roots of both of u’s subtrees. (The restriction only applies for a subtree that is not null.)
5.84 Give a recursive definition of a heap.
5.85 Prove by structural induction that every heap is empty, or that no element of the heap is larger
than its root node. (That is, the root is a maximum element.)
5.86 Prove by structural induction that every heap is empty, or it has a leaf u such that u is no larger than any node in the heap. (That is, the leaf u is a minimum element.)
Figure 5.37: Two algorithms on linked lists.
length(L): // assume L is a linked list. 1: ifL=⟨⟩then
2: return 0
3: else if L = ⟨x,L′⟩ then 4: return 1+length(L′)
sum(L): // assume L is a linked list containing integers. 1: ifL=⟨⟩then
2: return 0
3: else if L = ⟨x,L′⟩ then 4: return x+sum(L′)
countLeaves(T):
1: 2: 3: 4: 5: 6: 7: 8:
if T = null then return 0
else
TL , TR := the left and right subtrees of T ifTL =TR =nullthen
return 1 else
return countLeaves(TL)+countLeaves(TR)
Figure 5.38: An algorithm to count leaves in a binary tree.

5.4. RECURSIVELYDEFINEDSTRUCTURESANDSTRUCTURALINDUCTION 545
A 2–3 tree is a data structure, similar in spirit to a binary search tree (see Exercise 5.83)—or, more precisely, a bal- anced form of BST, which is guaranteed to support fast operations like insertions, lookups, and deletions. The name “2–3 tree” comes from the fact that each internal node in the tree must have precisely 2 or 3 children; no node has a single child. Furthermore, all leaves in a 2–3 tree must be at the same “level” of the tree.
5.87 Formally, a 2–3 tree of height h is one of the following:
1. a single node (in which case h = 0, and the node is called a leaf ); or
2. a node with 2 subtrees, both of which are 2–3 trees of height h − 1; or
3. a node with 3 subtrees, all three of which are 2–3 trees of height h − 1.
Prove by structural induction that a 2–3 tree of height h has at least 2h leaves and at most 3h leaves. (There- fore a 2–3 tree that contains n leaf nodes has height between log3 n and log2 n.)
5.88 A 2–3–4 tree is a similar data structure to a 2–3 tree, except that a tree can be a single node or a node with 2, 3, or 4 subtrees. Give a formal recursive definition of a 2–3–4 tree, and prove that a 2–3–4 tree of height h has at least 2h leaves and at most 4h leaves.
The next few exercises give recursive definitions of some familiar arithmetic operations which are usually defined nonrecursively. In each, you’re asked to prove a familiar property by structural induction. Think carefully when you choose the quantity upon which to perform induction, and don’t skip any steps in your proof! You may use the elementary-school facts about addition and multiplication from Figure 5.39 in your proofs:
(a + b) + c = a + (b + c) a + b = b + a
a + 0 = 0 + a = a
(a · b) · c = a · (b · c) a · b = b · a
a · 1 = 1 · a = a a · 0 = 0 · a = 0
Associativity of Addition Commutativity of Addition Additive Identity
Associativity of Multiplication Commutativity of Multiplication Multiplicative Identity Multiplicative Zero
5.89 Let’s define an even number as either (i) 0, or (ii) 2 + k, where k is an even number. Prove by structural induction that the sum of any two even numbers is an even number.
5.90 Let’s define a power of two as either (i) 1, or (ii) 2 · k, where k is a power of two. Prove by structural induction that the product of any two powers of two is itself a power of two.
􏰑k􏰒
5.91 Leta1,a2,…,akallbeevennumbers,foranarbitraryintegerk≥0.Provethat ∑ai isalsoan
Figure 5.39: A few elementary-school facts about addition and multiplication.
even number. (Hint: use weak induction and Exercise 5.89.) i=1
In Chapter 2, we defined bn (for a base b ∈ R and an exponent n ∈ Z≥0) as denoting the result of multiplying b by itself n times (Definition 2.5). As an alternative to that definition of exponentiation, we could instead give a recursive definition with integer exponents: b0 := 1 and bn+1 := b · bn, for any nonnegative integer n.
5.92 Using the associativity/commutativity/identity/zero properties in Figure 5.39, prove by induc- tionthatbmbn =bm+n foranyintegersn≥0andm≥0.Don’tskipanysteps. m n mn
5.93 Using the facts in Figure 5.39 and Exercise 5.92, prove by induction that (b ) = b integers n ≥ 0 and m ≥ 0. Again, don’t skip any steps.
for any
Recall Example 5.18, in which we defined a well-formed formula (a “wff”) of propositional logic as a variable; the negation (¬) of a wff; or the conjunction/disjunction/implication (∧, ∨, and ⇒) of two wffs. Assuming we allow the corresponding new connective in the following exercises as part of a wff, give a proof using structural induction (see Example 5.21 for an example) that any wff is logically equivalent to one using only . . .
5.94 Sheffer stroke |, where p | q ≡ ¬(p ∧ q) 5.95 Peirce’s arrow ↓, where p ↓ q ≡ ¬(p ∨ q)
(programming required) In the programming language ML (see Figure 5.34 for more), write a program to translate an arbitrary statement of propositional logic into a logically equivalent statement that has the following special form. (In other words, implement the proof of Exercises 5.94 and 5.95 as a recursive function.)
5.96 | is the only logical connective 5.97 ↓ is the only logical connective
5.98 Call a logical proposition truth-preserving if the proposition is true under the all-true truth assign- ment. That is, a proposition is truth-preserving if and only if the first row of its truth table is True.) Prove the following claim by structural induction on the form of the proposition:
Any logical proposition that uses only the logical connectives ∨ and ∧ is truth-preserving.
(A solution to this exercise yields a rigorous solution to Exercise 4.71—there are propositions that cannot be
expressed using only ∧ and ∨. Explain.)
5.99 A palindrome is a string that reads the same front-to-back as it does back-to-front—for example, RACECAR or (ignoring spaces/punctuation) A MAN, A PLAN, A CANAL–PANAMA! or 10011001. Give a recursive definition of the set of palindromic bitstrings.
5.100 Let #0(s) and #1(s) denote the number of 0s and 1s in a bitstring s, respectively. Using your recur- sive definition from the previous exercise, prove by structural induction that, for any palindromic bitstring s, the value of [#0(s)] · [#1(s)] is an even number.

546 CHAPTER 5. MATHEMATICAL INDUCTION
5.5 Chapter at a Glance Proofs by Mathematical Induction
Suppose that we want to prove that a property P(n) holds for all n ∈ Z≥0. To give a proof by mathematical induction of the claim ∀n ∈ Z≥0 : P(n), we prove the base case P(0), and we prove the inductive case: for every n ≥ 1, we have P(n − 1) ⇒ P(n).
When writing an inductive proof of the claim ∀n ∈ Z≥0 : P(n), include each of the following steps:
1. Aclearstatementoftheclaimtobeproven—thatis,acleardefinitionoftheprop- erty P(n) that will be proven true for all n ≥ 0—and a statement that the proof is by induction, including specifically identifying the variable n upon which induction
is being performed. (Some claims involve multiple variables, and it can be confus- ing if you aren’t clear about which is the variable upon which you are performing induction.)
2. Astatementandproofofthebasecase—thatis,aproofofP(0).
3. Astatementandproofoftheinductivecase—thatis,aproofofP(n−1)⇒P(n),for a generic value of n ≥ 1. The proof of the inductive case should include all of the following:
(a) astatementoftheinductivehypothesisP(n−1).
(b) astatementoftheclaimP(n)thatneedstobeproven.
(c) aproofofP(n),whichatsomepointmakesuseoftheassumedinductivehy-
pothesis P(n − 1).
We can use a proof by mathematical induction on arithmetic properties, like a formula for the sum of the nonnegative integers up to n—that is, ∑n i = n(n+1) for any integer n ≥ 0—or a formula for a geometric series: i=0 2
≥0 ∑niαn+1−1 ifα∈Rwhereα̸=1,andn∈Z ,theni=0α = α−1 .
(If α = 1, then ∑ni=0αi = n + 1.) We can also use proofs by mathematical induction to prove the correctness of algorithms, particularly recursive algorithms.
Strong Induction
Suppose that we want to prove that P(n) holds for all n ∈ Z≥0. To give a proof by strong induction of ∀n ∈ Z≥0 : P(n), we prove the base case P(0), and we prove the inductive case: foreveryn ≥ 1,wehave[P(0)∧P(1)…∧P(n−1)] ⇒ P(n). Stronginductionis actually completely equivalent to weak induction; anything that can be proven with one can also be proven with the other.
Generally speaking, using strong induction makes sense when the “reason” that P(n) is true is that P(k) is true for more than one value of k < n (or a single value of k < n with k ̸= n − 1). (For weak induction, the reason that P(n) is true is just P(n − 1).) We can use strong induction to prove many claims, including part of the 5.5. CHAPTERATAGLANCE 547 Prime Factorization Theorem: if n ∈ Z≥1 is a positive integer, then there exist k ≥ 0 prime numbers p1,p2,...,pk such that n = ∏ki=1 pi. Recursively Defined Structures and Structural Induction A recursively defined structure, just like a recursive algorithm, is a structure defined in terms of one or more base cases and one or more inductive cases. Any data type that can be understood as either a trivial instance of the type or as being built up from a smaller instance (or smaller instances) of that type can be expressed in this way. The set of structures defined is well ordered if, intuitively, every invocation of the inductive case of the definition “makes progress” toward the base case(s) of the definition (and, more formally, that every nonempty subset of those structures has a “least” element). SupposethatwewanttoprovethatP(x)holdsforeveryx ∈ S,whereSisthe (well-ordered) set of structures generated by a recursive definition. To give a proof by structural induction of ∀x ∈ S : P(x), we prove the following: 1. Basecases:foreveryxdefinedbyabasecaseinthedefinitionofS,proveP(x). 2. Inductivecases:foreveryxdefinedintermsofy1,y2,...,yk ∈Sbyaninductivecase in the definition of S, prove that P(y1) ∧ P(y2) . . . ∧ P(yk) ⇒ P(x). The form of a proof by structural induction that ∀x ∈ S : P(x) for a well-ordered set of structures S is identical to the form of a proof using strong induction. Specifically, the proof by structural induction looks like a proof by strong induction of the claim ∀n ∈ Z≥0 : Q(n), where Q(n) denotes the property “for any structure x ∈ S that is generated using n applications of the inductive-case rules in the definition of S, we have P(x).” 548 CHAPTER 5. MATHEMATICAL INDUCTION Key Terms and Results Key Terms Proofs by Mathematical Induction • proofbymathematicalinduction • basecase • inductivecase • inductivehypothesis • geometricseries • arithmeticseries • harmonicseries Strong Induction • stronginduction • primefactorization Recursively Defined Structures and Structural Induction • recursivelydefinedstructures • structuralinduction • well-orderedset Key Results Proofs by Mathematical Induction 1. SupposethatwewanttoprovethatP(n)holdsforall n ∈ Z≥0. To give a proof by mathematical induction of ∀n ∈ Z≥0 : P(n), we prove the following: (a) thebasecaseP(0). (b) theinductivecase:foreveryn≥1,wehave P(n − 1) ⇒ P(n). 2. Foranyintegern≥0,wehave1+2+...+n= n(n+1). 2 3. Letα∈Rwhereα̸=1,andletn∈Z≥0.Then ∑n i αn+1−1 i=0α = α−1 . (Ifα=1,then∑ni=0αi =n+1.) Strong Induction 1. SupposethatwewanttoprovethatP(n)holdsforall n ∈ Z≥0. To give a proof by strong induction of ∀n ∈ Z≥0 : P(n), we prove the following: (a) thebasecaseP(0). (b) theinductivecase:foreveryn≥1,wehave [P(0)∧P(1)...∧P(n−1)] ⇒ P(n). 2. The prime factorization theorem: let n ∈ Z≥1 be a positive integer. Then there exist k ≥ 0 prime numbers p1, p2, . . . , pk such that n = ∏ki=1 pi. Furthermore, up to reordering, the prime numbers p1, p2, . . . , pk are unique. Recursively Defined Structures and Structural Induction 1. Togiveaproofbystructuralinductionof∀x∈S:P(x),we prove the following: (a) thebasecases:foreveryxdefinedbyabasecaseinthe definition of S, we have that P(x). (b) theinductivecases:foreveryxdefinedintermsof y1,y2,...,yk ∈ S by an inductive case in the definition of S, we have that P(y1) ∧ P(y2) . . . ∧ P(yk) ⇒ P(x). 6 Analysis of Algorithms In which our heroes stay beyond the reach of danger, by calculating precise bounds on how quickly they must move to stay safe. 604 CHAPTER 6. ANALYSIS OF ALGORITHMS 6.2.1 Big O Consider two functions f and g. To reiterate, our goal is to compare the rates at which these functions grow. We’ll start by defining what it means for the function f (n) to grow no faster than g(n), written f (n) = O(g(n)). Taking it further: Philosophers sometimes distinguish between the “is” of identity and the “is” of pred- ication. In a sentence like Barbara Liskov is the 2008 Turing Award winner, we are asserting that Barbara Liskov and the 2008 Turing Award Winner actually refer to the same thing—that is, they are identical. In a sentence like Barbara Liskov is tall, we are asserting that Barbara Liskov (the entity to which Barbara Liskov refers) has the property of being tall—that is, the predicate x is tall is true of Barbara Liskov. One should interpret the “=” in f (n) = O(g(n)) as an “is of predication.” One reasonably accurate way to distinguish these two uses of is is by considering what happens if you reverse the order of the sentence: The 2008 Turing Award Winner is Barbara Liskov is still a (true) well-formed sentence, but Tall is Barbara Liskov sounds very strange. Similarly, for an “is of identity” in amathematicalcontext,wecansayeitherx2 −1 = (x+1)(x−1)or(x+1)(x−1) = x2 −1. But,while “f (n) = O(g(n))” is a well-formed statement, it is nonsensical to say “O(g(n)) = f (n).” Here is the formal definition: Definition 6.1 (“Big O”) Consider two functions f : R≥0 → R≥0 and g : R≥0 → R≥0. We say that f grows no faster than g if there exist constants c > 0 and n0 ≥ 0 such that
∀n≥n0 :f(n)≤c·g(n). In this case, we write “f (n) is O(g(n))” or “f (n) = O(g(n)).”
The “=” in
“f (n) = O(g(n))”
is odd notation,
but it’s also very standard. This expression means
f (n) has the property of being O(g(n)) and not f (n) is identical to O(g(n)).
O is pronounced “big oh.”
The intuition of the defini-
tion is that f (n) = O(g(n)) if,
for large enough n, we have
f (n) ≤ constant · g(n). Fig-
ure 6.2 shows five different
functionsf : R≥0 → R≥0
that all satisfy f (n) = O(n).
(In the figure, the value of
x is “large enough” once x is outside of the gray box, and the multiplicative constant is equal to 3 in each subplot. For a function like f (x) = 4x, we’d show that f (n) = O(n) by choosing some c ≥ 4 as the multiplicative constant.)
More quantitatively, here are two simple examples of functions that are O(n2):
Example 6.2 (A square function)
Problem: Provethatthefunctionf(n)=3n2+2isO(n2).
: Toprovethatf(n)=3n2+2satisfiesf(n)=O(n2),wemustidentifyconstants c>0andn0 ≥0suchthat∀n≥n0 :3n2+2≤c·n2.Let’sselectc=5andn0 =1. For all n ≥ 1, observe that 2n2 ≥ 2. Therefore, for all n ≥ 1, we have
f(x)=x
f(x)=2x
f(x)=x+8
f(x)=10
if x < 3.5 􏰓25 − x2 f(x)= 0.5x+11 ifx≥3.5 Figure 6.2: Five functions that are all O(n). For any x beyond the gray box, we have f (x) ≤ 3x. Solution f(n)=3n2+2≤3n2+2n2 =5n2 =c·n2. Example 6.3 (Another square function) Problem: Provethatthefunctiong(n)=4nisalsoO(n2). : Wewishtoshowthat4n ≤ c·n2 foralln ≥ n0,forconstantsc > 0and Solution
n0 ≥ 0 that we get to choose. The two functions g(n) and q(n) := n2 are shown in Figure 6.3. Because the functions cross (with no constant multiplier), we can pick c = 1. Observe that 4n ≤ n2 if and only if n2 − 4n = n(n − 4) ≥ 0—that is, for n ≤ 0 orn≥4.Thusc=1andn0 =4suffice.
Note that, when f (n) = O(g(n)), there are many choices of c and n0 that satisfy the definition. For example, we could have chosen c = 4 and n0 = 1 in Example 6.3. (See Exercise 6.15.)
Example 6.4 (One nonsquare) 3 2 Problem: Prove that the function h(n) = n is not O(n ).
: To show that h(n) = n3 is not O(n2), we need to argue that, for all constants Solution
n0 andc,thereexistsann≥n0 suchthath(n)>c·n2—thatis,thatn3 >c·n2. Fixapurportedn0 andc.Letn:=max(n0,c+1).Thenn>cbyourdefinitionof
n, so, by multiplying both sides of n > c by the nonnegative quantity n2, we have n3 = n · n2 > c · n2. But we also have that n ≥ n0 by our definition of n, and thus we haveidentifiedann≥n0 suchthatn3 >c·n2.
Because n0 and c were generic, we have shown that no such constants can exist, and therefore that h(n) = n3 is not O(n2).
Some properties of O(·)
Now that we’ve seen a few specific examples, let’s turn to some more general re-
sults. There are many useful properties of O(·) that will come in handy later; we’ll start here with a few of these properties, together with a proof of one. (The other proofs are left to you in Exercises 6.18–6.20.)
Proof. Weproceedbymutualimplication.Fortheforwarddirection,supposef(n)= O(g(n) + h(n)). Then by definition there exist constants c > 0 and n0 ≥ 0 such that
for all n ≥ n0 f (n) ≤ c · [g(n) + h(n)]. (1) For any a,b ∈ R, we know that a ≤ max(a,b) and b ≤ max(a,b), so (1) implies
for all n ≥ n0 f (n) ≤ c · [max(g(n), h(n)) + max(g(n), h(n))]
= 2c max(g(n), h(n)). (2)
q(n) = n2
g(n) = 4n
n = 4
Figure 6.3: A plot of g(n) = 4n and q(n) = n2.
6.2. ASYMPTOTICS
605
Lemma 6.1 (Asymptotic equivalence of max and sum)
We have f (n) = O(g(n) + h(n)) if and only if f (n) = O(max(g(n), h(n))).
But (2) is the definition of f (n) = O(max(g(n), h(n))), using constants n0′ = n0 and c′ = 2c.

606 CHAPTER 6. ANALYSIS OF ALGORITHMS
Conversely, suppose f (n) = O(max(g(n), h(n))). Then there exist constants c > 0 and
n0 ≥ 0 such that
for all n ≥ n0
Problem-solving
tip: Don’t force yourself to prove more than you have to!Forexample, when proving
that an asymptotic relationship like
f (n) = O(g(n)) holds, all we need to do
is identify some
pair of constants
c, n0 that satisfy Definition 6.1.
Don’t work too hard! Choose whatever c or n0 makes your life easiest, even if they’re much bigger than necessary.
For asymptotic purposes, we care that the constants c and n0 exist, but we don’t care how big they are.
f (n) ≤ c · max(g(n), h(n)). (3) Foranya,b∈R weknowmax(a,b)≤max(a,b)+min(a,b)=a+b;thus(3)implies
c′ = c.
≥0
for all n ≥ n f (n) ≤ c · [g(n) + h(n)]. (4)
0
Thus (4) implies that f (n) = O(g(n) + h(n)), using the same constants, n0′ = n0 and
Lemma 6.2 (Transitivity of O(·))
If f (n) = O(g(n)) and g(n) = O(h(n)), then f (n) = O(h(n)).
Lemma 6.3 (Addition and multiplication preserve O(·)-ness) If f (n) = O(h1(n)) and g(n) = O(h2(n)), then:
• f(n)+g(n)=O(h1(n)+h2(n)). • f(n)·g(n)=O(h1(n)·h2(n)).
Asymptotics of polynomials
So far, we’ve discussed properties of O(·) that are general with respect to the form
of the functions in question. But because we’re typically concerned with O(·) in the context of the running time of algorithms—and we are generally interested in algo- rithms that are efficient—we’ll be particularly interested in the asymptotics of poly- nomials. The most salient point about the growth of a polynomial p(n) is that p(n)’s asymptotic behavior is determined by the degree of p(n)—that is, the polynomial p(n) = a0 + a1n + a2n2 + · · · + aknk behaves like nk, asymptotically:
(If ak > 0, then indeed p(n) = O(nk), and it is not possible to improve this bound—that is, in the notation of Section 6.2.2, we have that p(n) = Θ(nk).)
The proof of Lemma 6.4 is deferred to Exercise 6.21, but we have already seen the intuition in previous examples: every term aini satisfies aini ≤ |ai| · nk, for any n ≥ 1.
Asymptotics of logarithms and exponentials
We will also often encounter logarithms and exponential functions, so it’s worth
identifying a few of their asymptotic properties. Again, we’ll prove one of these prop- erties as an example, and leave proofs of many of the remaining properties to the exercises. The first pair of properties is that logarithmic functions grow more slowly than polynomials, which grow more slowly than exponential functions:
Lemma 6.4 (Asymptotics of polynomials)
Let p(n) = ∑ki=0 aini be a polynomial. Then p(n) = O(nk).

Lemma 6.5 (log n grows slower than n0.0000001)
Let ε > 0 be an arbitrary constant, and let f (n) = log n. Then f (n) = O(nε).
Lemma 6.7 (The base of a logarithm doesn’t matter, asymptotically)
Let b > 1 and k > 0 be arbitrary constants. Then f (n) = logb(nk) is O(log n).
logb(nk)= k·logb(n) = k·logn.
(2.2.5):logbxy =ylogbx changeofbaseformula(2.2.6):log x= logcx
log b Thus,foranyn ≥ 1,wehavethatf(n) =
b logc b ·logn. Thusf(n) = O(logn)usingthe
6.2. ASYMPTOTICS 607
Lemma 6.6 (n1000000 grows slower than 1.0000001n)
Let b > 1 and k ≥ 0 be arbitrary constants, and let p(n) = ∑ki=0 aini be any polynomial. Then p(n) = O(bn).
The second pair of properties is that two logarithmic functions loga n and logb n grow at the same rate (for any bases a > 1 and b > 1) but that two exponential functions an andbn donot(foranybasesaandb̸=a):
ProofofLemma6.7. Usingstandardfactsaboutlogarithms,wehavethat
k constantsn0 =1andc= k . logb
log b
Lemma 6.8 (The base of an exponential does matter, asymptotically) Letb≥1andc≥1bearbitraryconstants.Thenf(n)=bn isO(cn)ifandonlyifb≤c.
Lemma 6.7 is the reason that, for example, binary search’s running time is described as O(log n) rather than as O(log2 n), without any concern for writing the “2”: the base of the logarithm is inconsequential asymptotically, so O(log√2 n) and O(log2 n) and O(ln n) all mean exactly the same thing. In contrast, for exponential functions, the base of the exponent does affect the asymptotic behavior: Lemma 6.8 says that, for example, the functions f (n) = 2n and g(n) = (√2)n do not grow at the same rate. (See Exercises 6.25–6.28.)
Taking it further: Generally, exponential growth is a problem for computer scientists. Many compu- tational problems that are important and useful to solve seem to require searching a very large space
of possible answers: for example, testing the satisfiability of an n-variable logical proposition seems to require looking at about 2n different truth assignments, and factoring an n-digit number seems to require looking at about 10n different candidate divisors. The fact that exponential functions grow so quickly
is exactly why we do not have algorithms that are practical for even moderately large instances of these problems.
But one of the most famous exponentially growing functions actually helps us to solve problems: the amount of computational power available to a “standard” user of a computer has been growing exponentially for decades: about every 18 months, the processing power of a standard computer has roughly doubled. This trend—dubbed Moore’s Law, after Gordon Moore, the co-founder of Intel—is discussed on p. 613.

608 CHAPTER 6. ANALYSIS OF ALGORITHMS
6.2.2 Other Asymptotic Relationships: Ω, Θ, ω, and o
There are several basic asymptotic notions (with accompanying notation), based
around two core ideas (see Figure 6.4):
f(n)growsnofasterthang(n): Inotherwords,ignoringsmallinputs,forallnwehave that f (n) ≤ constant · g(n). This relationship is expressed by the O(·) notation:
f (n) = O(g(n)). We can also say that g is an asymptotic upper bound for f : if we plot n against f (n) and g(n), then g(n) will be “above” f (n) for large inputs.
f(n)growsnoslowerthang(n): Theoppositerelationship,inwhichgisanasymp-
totic lower bound on f , is expressed by Ω(·) notation. Again, ignoring small inputs,
f (n) = Ω(g(n)) if for all n we have that f (n) ≥ constant · g(n). (Notice that the inequal- ity swapped directions from the definition of O(·).)
Formal definitions
Here are the formal definitions of four other relationships based on these notions:
The two fundamental asymptotic relationships, O(·) and Ω(·), are dual notions; they are related by the property that f (n) = O(g(n)) if and only if g(n) = Ω(f (n)). (The proof is left as Exercise 6.30.)
There are three other pieces of asymptotic notation, corresponding to the situations in which f (n) is both O(g) and Ω(g), or O(g) but not Ω(g), or Ω(g) but not O(g):
This notation is summarized, in two different ways, in Figure 6.5.
Ω(g(n))
g(n) O(g(n))
Figure 6.4: A function g(n), a function that’s Ω(g) (grows no slower than g), and a function that’s O(g) (grows no faster than g).
Ω is the Greek letter Omega written in upper case; ω is the same Greek letter written in lower case.
Definition 6.2 (“Big Omega”)
A function f grows no slower than g, written f (n) = Ω(g(n)), if there exist constants d > 0 andn0 ≥0suchthat∀n≥n0 :f(n)≥d·g(n).
Definition 6.3 (“Big Theta”)
A function f grows at the same rate as g, written f (n) = Θ(g(n)), if f (n) = O(g(n)) and f (n) = Ω(g(n)).
Definition 6.4 (“Little o”)
A function f grows (strictly) slower than g, written f (n) = o(g(n)), if f (n) = O(g(n)) but f (n) ̸= Ω(g(n)).
Definition 6.5 (“Little omega”)
A function f grows (strictly) faster than g, written f (n) = ω(g(n)), if f (n) = Ω(g(n)) but f (n) ̸= O(g(n)).

6.2. ASYMPTOTICS 609
f (n) = O(g(n))
f (n) = Ω(g(n))
f (n) = Θ(g(n))
yes
don’t care
if f(n) = O(g(n)) …
… then f(n) = o(g(n))
don’t care
if f(n) ̸= O(g(n)) …
… and f(n) = Ω(g(n)) …
… and f(n) ̸= Ω(g(n)) …
… then f(n) = Θ(g(n))
… then f(n) = ω(g(n))
—
∃c>0,n0 ≥0suchthat ∃d>0,n0 ≥0suchthat ∀n≥n0 :f(n)≤c·g(n) ∀n≥n0 :f(n)≥d·g(n)
f grows no faster than g
f grows no slower than g
f grows at the same rate as g f grows strictly slower than g f grows strictly faster than g
yes
f (n) = o(g(n))
f (n) = ω(g(n))
yes
yes
no
yes
no
yes
Figure 6.5: Sum- mary of notation for asymptotic notation, in two different ways.
Example 6.5 (f = (n))
Problem: Let f (n) = 3n2 + 1. Is f (n) = O(n)? Ω(n)? Θ(n)? o(n)? ω(n)? Prove your
answers.
Solution
: Once we determine whether f (n) = O(n) and whether f (n) = Ω(n), we can
answer all parts of the question using Figure 6.5(a).
• f(n)=Ω(n).Forn≥1,wehaven≤n2 ≤3n2+1=f(n).Thusselectingd=1and
n0 = 1 satisfies Definition 6.2.
• f(n)̸=O(n).Letc>0bearbitrary.Foranyn≥ c,wehave3n2+1>3n2 ≥c·n.
3
Therefore, for any n0 > 0, there exists an n ≥ n0 such that f (n) > c · n. (Namely,
forn=max(n0,c/3),wehaven≥n0 andf(n)>c·n.)
Thus, every constant c > 0 fails to satisfy the requirements of Definition 6.1, and
therefore f (n) ̸= O(n).
Assembling f (n) = Ω(n) and f (n) ̸= O(n) with Figure 6.5(a), we can also conclude
that f (n) = ω(n), f (n) ̸= Θ(n), and f (n) ̸= o(n).
Taking it further: We’ve given definitions of O(·), Ω(·), Θ(·), o(·), and ω(·) that are based on nested quantifiers: there exists a multiplicative constant such that, for all sufficiently large n, . . .. For those with a more calculus-based mindset, we could also give an equivalent definition in terms of limits:
• f (n) = O(g(n)) if limn→∞ f (n)/g(n) is finite;
• f (n) = Ω(g(n)) if limn→∞ f (n)/g(n) is nonzero;
• f (n) = Θ(g(n)) if limn→∞ f (n)/g(n) is finite and nonzero;
• f (n) = o(g(n)) if limn→∞ f (n)/g(n) = 0; and
• f (n) = ω(g(n)) if limn→∞ f (n)/g(n) = ∞.
For the function f (n) = 3n2 + 1 in Example 6.5, for example, observe that limn→∞ f (n) f (n) = Ω(n) and f (n) = ω(n), but none of the other asymptotic relationships holds. n
= ∞. Thus
A (possibly counterintuitive) example
Intuitively, the asymptotic symbols O, Ω, Θ, o, and ω correspond to the numerical
comparison symbols ≤, ≥, =, <, and >—but the correspondence isn’t perfect, as we’ll see in this example:

f ̸= Θ(g), f ̸= o(g), and f ̸= ω(g).)
Letaandbberealnumbers. Thetwoinequalitiesa ≤ bandb ≤ acanbetrueand
false in different combinations:
• Whena≤bandb≤a,thena=b.
• Whena≤bandb̸≤a,thenab.
• (Itisnotpossibletohavebotha̸≤bandb̸≤a.)
Intuitively, the relationship f (n) = O(g(n)) means (approximately!) that
“the growth rate of f ≤ the growth rate of g.′′ (A)
And, again, intuitively, f (n) = Ω(g(n)) means (approximately)
“the growth rate of f ≥ the growth rate of g.′′ (B)
So Definitions 6.3, 6.4, and 6.5 correspond to these three combinations: (A) and (B) is Θ;(A)butnot(B)iso;and(B)butnot(A)isω. Butbecareful! Fora,b ∈ R,it’strue that either a ≤ b or a ≥ b must be true. But it’s possible for both of the inequalities (A) and (B) to be false! The functions g(n) = n2 and the function f (n) from Example 6.6 that equals either n3 or n depending on the parity of n are an example of a pair of functions for which neither (A) nor (B) is satisfied.
Taking it further: The real numbers satisfy the mathematical property of trichotomy (Greek: “di- visionintothreeparts”):fora,b ∈ R,exactlyoneof{ab}holds.Functionscom- pared asymptotically do not obey trichotomy: for two functions f and g, it’s possible for none of {f = o(g),f = Θ(g),f = ω(g)} to hold.
Before we begin to apply asymptotic notation to the analysis of algorithms, we’ll close this section with a few notes about the use (and abuse) of asymptotic notation.
Using asymptotics in arithmetic expressions
It is often convenient to use asymptotic notation in arithmetic expressions. We per-
mit ourselves to write something like O(n log n) + O(n3) = O(n3), which intuitively means that, given functions that grow no faster than n log n and n3, their sum grows no faster than n3 too. When asymptotic notation like O(n2) appears on the left-hand side of an equality, we interpret it to mean an arbitrary unnamed function that grows no faster than n2. For example, making log n calls to an algorithm whose running time is O(n) requires log n · O(n) = O(n log n) time.
Using asymptotics with multiple variables
It will also occasionally turn out to be convenient to be able to write asymptotic
expressions that depend on more than one variable. Giving a precise technical def- inition of multivariate asymptotic notation is a bit subtle, but the intuition precisely matches the univariate definitions we’ve already given. We’ll use the notation g(n, m) = O(f (n, m)) to mean “for all sufficiently large n and m, there exists a constant c such thatg(n,m) ≤ c·f(n,m).”Forexample,thefunctionf(n,m) = n2 +3m−5satisfies f(n,m) = O(n2 +m).
6.2. ASYMPTOTICS 611

612 CHAPTER 6. ANALYSIS OF ALGORITHMS
A common mistake and some meaningless language
There is a widespread—and incorrect—sloppy use of asymptotic notation: it is un-
fortunately common for people to use O(·) when they mean Θ(·). You will sometimes encounter claims like:
“I prefer f to g, because f (n) = O(n2) and g(n) = O(n3).” (1) But this statement doesn’t make sense: O(·) defines only an upper bound, so either of f
or g might grow more slowly than the other! Saying (1) is like saying “Alice is richer than Bob,
because Alice has at most $1,000,000,000 and Bob has at most $1,000,000.” (2)
(Alice might be richer than Bob, but perhaps they both have twenty bucks each, or perhaps Bob has $1,000,000 and Alice has nothing.) Use O(·) when you mean O(·), and to use Θ(·) when you mean Θ(·)—and be aware that others may use O(·) improperly. (And, gently, correct them if they’re doing so.)
There’s a related imprecise use of asymptotics that leads to statements that don’t mean anything. For example, consider statements like “f (n) is at least O(n3)” or “f (n) is at most Ω(n2).” These sentences have no meaning: they say “f (n) grows at least as fast as at most as fast as n3” and “f (n) grows at most as fast as at least as fast as n2.” (?!?) Be careful: use upper bounds as upper bounds, and use lower bounds as lower bounds! Again, by analogy, consider the sentences
Thanks to Tom Wexler for suggest- ing (5).
“My weight is more than ≤ 100 kilograms”
or “I am shorter than some person who is taller than 4 feet tall.” or “You could save up to 50% or more!”
None of these sentences says anything!
(3) (4) (5)

6.2. ASYMPTOTICS 613
Computer Science Connections
Moore’s Law
In 1965, Gordon Moore, one of the co-founders of Intel, published an arti- cle making a basic prediction—and it’s been reinterpreted many times—that processing power would double roughly once every 18–24 months.2 (It’s been debated and revised over time, by, for example, interpreting “processing power” as the number of transistors—the most basic element of a processor, out of which logic gates like AND, OR, and NOT are built—rather than what we can actually compute.) This prediction later came to be known as Moore’s Law—it’s not a real “law” like Ohm’s Law or the Law of Large Numbers, of course, but rather simply a prediction. That said, it’s proven to be a remark- ably robust prediction: for something like 40 to 50 years, it has proven to be
a consistent guide to the massive increase in processing power for a typical computer user over the last decades. (See Figure 6.7.)
2 Gordon E. Moore. Cramming more components onto integrated circuits. Electronics, 38(8), April 1965.
Figure 6.7: A plot of the number of transistors per processor, for about 15 Intel brand processors introduced over the last 50 years. (Data are from an Intel press release celebrating the 40th anniversary of the original publication of Moore’s Law.) The dashed line indicates the rate of growth we’d see if the number of transistors per processor doubled every two years (starting with the Intel 4004 in 1971).
109 Intel Pentium 4
108 107 106 105 104 103 102
Dashed line =
doubling every 24 months
Intel Pentium Intel 386
Intel 4004
1970 1975 1980 1985 1990 1995 2000 2005 2010 Year of introduction
Claims that “Moore’s Law is just about to end!” have been made for many decades—we’re beginning to run up against physical limits in the size of transistors!—and yet Moore’s Law has still proven to be remarkably accurate over time. Its imminent demise is still predicted today, and yet it’s still a pretty good model of computing power.3 One probable reason that Moore’s Law has held for as long as it has is a little bizarre: the repeated publicity surround- ing Moore’s Law! Because chip manufacturing companies “know” that the public generally expects processors to have twice as many transistors in two years, these companies may actually be setting research-and-development tar- gets based on meeting Moore’s Law. (Just as in a physical system, we cannot observe a phenomenon without changing it!)
3 Gordon E. Moore. No exponential is forever: but “forever” can be delayed! In International Solid-State Circuits Conference, 2003.
Number of transistors

614 CHAPTER 6. ANALYSIS OF ALGORITHMS
6.2.3 Exercises
Part of the motivation for asymptotic analysis was that algorithms are typically analyzed ignoring constant factors. Ignoring constant factors in analyzing an algorithm may seem strange: if algorithm A runs twice as fast as B, then A is way faster! But the reason we care more about asymptotic running time is that even an improvement by a factor of 2 is quickly swamped by an asymptotic improvement for even slightly larger inputs. Here are a few examples:
6.1 Suppose that linear search can find an element in a sorted list of n elements in n steps on a par- ticular machine. Binary search (perhaps not implemented especially efficiently) requires 100 log n steps. For what values of n ≥ 2 is linear search faster?
Alice implements Merge Sort so, on a particular machine, it requires exactly ⌈8n log n⌉ steps to sort n elements. Bob implements Heap Sort so it requires exactly ⌈5n log n⌉ steps to sort n elements. Charlie implements Selection Sort so it requires exactly 2n2 steps to sort n elements. Suppose that Alice can sort 1000 elements in 1 minute.
6.2 How many elements can Bob sort in a minute? How many can Charlie sort in a minute?
6.3 What is the largest value of n that Charlie can sort faster than Alice?
6.4 Charlie, devastated by the news from the last exercise, buys a computer that’s twice the speed of
Alice’s. What is the largest value of n that Charlie can sort faster than Alice now?
Let f (n) = 9n + 3 and let g(n) = 3n3 − n2. (See the first plot in Figure 6.8.)
6.5 Prove that f (n) = O(n).
6.6 Prove that f (n) = O(n2 ).
6.7 Prove that f (n) = O(g(n)).
6.8 Prove that g(n) = O(n3).
6.9 Prove that g(n) = O(n4).
6.10 Prove that g(n) is not O(n2).
6.11 Prove that g(n) is not O(n3−ε), for any ε > 0.
Prove that the following functions are all O(n2). (See the second plot in Figure 6.8.)
6.12 f (n) = 7n
6.13 g(n) = 3n2 + sin n
6.14 h(n) = 202
The next few exercises ask you to explore the definition of O(·) in a little more detail.
6.15 Suppose f (n) = O(g(n)). Explain why there are infinitely many choices of c and infinitely many choices of n0 that satisfy the definition of O(·).
Consider two functions f , g : Z≥0 → Z≥0. We defined O(·) notation as follows:
• f(n)=O(g(n))ifthereexistconstantsc>0andn0 ≥0suchthat∀n≥n0 :f(n)≤c·g(n).
It turns out that both c and n0 are necessary to the definition. Define the following two pieces of alternative asymptotic notation, leaving out c (using c = 1) and n0 (using n0 = 1) from the definition:
• f(n) = P(g(n)) if there exists a constant n0 ≥ 0 such that ∀n ≥ n0 : f(n) ≤ g(n). • f(n)=Q(g(n))ifthereexistsaconstantc>0suchthat∀n≥1:f(n)≤c·g(n).
Prove that P(·) and Q(·) are both different from O(·)—that is, we can’t just use either of the new definitions without changing what we meant. Specifically, prove that there exist functions f and g such that . . .
6.16 …either(i)f =O(g)butf ̸=P(g),or(ii)f ̸=O(g)butf =P(g).
6.17 …either(i)f =O(g)butf ̸=Q(g),or(ii)f ̸=O(g)butf =Q(g).
The next several exercises ask you to prove some of properties of O(·) that we stated without proof earlier in the section. (For a model of a proof of this type of property, see Lemma 6.1 and its proof in this section.)
6.18 Prove Lemma 6.2, the transitivity of O(·): if f (n) = O(g(n)) and g(n) = O(h(n)), then f (n) = O(h(n)).
Prove Lemma 6.3: if f (n) = O(h1(n)) and g(n) = O(h2(n)), then . . .
6.19 . . . prove that f (n) + g(n) = O(h1 (n) + h2 (n)).
6.20 . . . prove that f (n) · g(n) = O(h1 (n) · h2 (n)).
Figure 6.8: Two sets of functions, for Exercises 6.5–6.11 and 6.12–6.14.
g(n) = 3n3 − n2 50
45 40 35 30 25 20 15 10
5 0
f(n) = 9n+3
012345
250 225 200 175 150 125 100
75 50 25
0
h(n)
f(n)
0 2 4 6 8 10
g(n)

616 CHAPTER 6. ANALYSIS OF ALGORITHMS
6.38 Prove or disprove: the all-zero function f (n) = 0 is the only function that is Θ(0).
6.39 Give an example of a function f (n) such that f (n) = Θ(f (n)2 ).
6.40 Let k ∈ Z≥0 be any constant. Prove that nk = o(n!).
6.41 Let f : Z≥0 → Z≥0 be an arbitrary function. Define the function g(n) = f (n) + 1. Prove that
g(n) = O(f (n)) if and only if f (n) = Ω(1).
6.42 Fill in each blank in the following table with an example of a function f that satisfies the stated conditions, or argue that it’s impossible to satisfy both conditions:
f(n)is… 2 o(n2) ̸=o(n2) …andω(n )
… and ̸= ω(n2)
6.43 Let f and g be arbitrary functions. Prove that at most one of the three properties f (n) = o(g(n)) and f (n) = Θ(g(n)) and f (n) = ω(g(n)) can hold.
6.44 Complete the proof in Example 6.6: pr􏰓ove that f (n) ̸= Ω(n2 ), where f (n) is the function f(n)= n3 ifniseven
n if n is odd.
Many of the properties of O(·) also hold for the other four asymptotic notions. Prove the following transitivity proper- ties for arbitrary functions f , g, and h:
6.45 If f (n) = Ω(g(n)) and g(n) = Ω(h(n)), then f (n) = Ω(h(n)).
6.46 If f (n) = Θ(g(n)) and g(n) = Θ(h(n)), then f (n) = Θ(h(n)).
6.47 If f (n) = o(g(n)) and g(n) = o(h(n)), then f (n) = o(h(n)).
For each of the following purported properties related to symmetry, decide whether you think the statement is true or false, and—in either case—prove your answer.
6.48 Prove or disprove: if f (n) = Ω(g(n)), then g(n) = Ω(f (n)).
6.49 Prove or disprove: if f (n) = Θ(g(n)), then g(n) = Θ(f (n)).
6.50 Prove or disprove: if f (n) = ω(g(n)), then g(n) = ω(f (n)).
Do the same for the following purported properties related to reflexivity:
6.51 Prove or disprove: f (n) = O(f (n)).
6.52 Prove or disprove: f (n) = Ω(f (n)).
6.53 Prove or disprove: f (n) = ω(f (n)).
6.54 Consider the false claim (FC-6.1) below, and the bogus proof that follows. Where, precisely, does
the proof of (FC-6.1) go wrong?
False Claim: The function f (n) = n2 satisfies f (n) = O(n).
(FC-6.1) basecase(n=1): Thenn2 =1.Thusf(1)=O(n)because1≤nforalln≥1.(Choosec=1andn0 =1.)
Bogus proof of (FC-6.1). We proceed by induction on n:
inductive case (n ≥ 2): Assume the inductive hypothesis—namely, assume that (n − 1)2 = O(n). We must
show that n2 = O(n). Here is the proof:
n2 = (n−1)2 +2n−1 = O(n)+2n−1
= O(n)+O(n)
= O(n).
byfactoring bytheinductivehypothesis bydefinitionofO(·)andLemma6.3

6.3 Asymptotic Analysis of Algorithms
If everything seems under control, you’re just not going fast enough.
Mario Andretti (b. 1940)
The main reason that computer scientists are interested in asymptotic analysis is for its application to the analysis of algorithms. When, for example, we compare different algorithms that solve the same problem—say, Merge Sort, Selection Sort, and Insertion Sort—we want to be able to give a meaningful answer to the question which algorithm is the fastest? (And different inputs may trigger different behaviors in the algorithms un- der consideration: when the input array is sorted, for example, Insertion Sort is faster than Merge Sort and Selection Sort; when the input is very far from sorted, Merge Sort is fastest. But typically we still would like to identify a single answer to the question of which algorithm is the fastest.)
When evaluating the running time of an algorithm, we generally follow asymptotic principles. Specifically, we will generally ignore constants in the same two ways that O(·) and its asymptotic siblings do:
• First, we don’t care much about what happens for
small inputs: there might be small special-case inputs for
which an algorithm is particularly fast, but this fast per-
formance on a few special inputs doesn’t mean that the
algorithm is fast in general. For example, consider the al-
gorithm for primality testing in Figure 6.10. Despite its speed onafewspecialcases(n < 100),wewouldn’tconsider isPrime-tunedForDoubleDigits a faster algorithm for primality testing in general than isPrime. We seek general answers to the question which algorithm is faster?, which leads us to pay little heed to special cases. • Second, we typically evaluate the running time of an algorithm not by measuring elapsed time on the “wall clock,” but rather by counting the number of steps that the algorithm takes to complete. (How long a program takes on your laptop, in terms of the wall clock, is affected by all sorts of things unrelated to the algorithm, like whether your virus checker is running while the algorithm executes.) We will generally ignore multiplicative constants in counting the number of steps consumed by an algorithm. One reason is so that we can give a machine-independent answer to the which algorithm is faster? question; how much is accomplished by one instruction on an Intel processor may be different from one instruction on an AMD processor, and ignoring constants allows us to compare algorithms in a way that doesn’t depend on grungy details about the particular machine. Figure 6.10: A trivially faster algorithm for testing primality. 6.3. ASYMPTOTICANALYSISOFALGORITHMS 617 isPrime-tunedForDoubleDigits(n): 1: 2: 3: 4: 5: 6: if n ∈ {2,3,5,7,11,13,17,19,23,29,31,37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97} then return True else if n ≤ 100 then return False else return isPrime(n), from Figure 4.28. Definition 6.6 (Running time of an algorithm on a particular input) Consider an algorithm A and an input x. The running time of algorithm A on input x is the number of primitive steps that A takes when it’s run on input x. For example, we can consider the running time of the algorithm binarySearch on the 618 CHAPTER 6. ANALYSIS OF ALGORITHMS input x = ⟨[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31], 4⟩. The precise number of primitive steps in this execution depends on the particular machine on which the algorithm is being run, but it involves successively comparing 4 to 13, then 5, then 2, and finally 3. Taking it further: Definition 6.6 is intentionally vague about what a “primitive step” is, but it’s probably easiest to think of a single machine instruction as a primitive step. That single machine instruction might add or compare two numbers, increment a counter, return a value, etc. Different hardware systems might have different granularity in their “primitive steps”—perhaps a Mac desktop can “do more” in one machine instruction than an iPhone can do—but, as we just indicated, we’ll look to analyze algorithms independently of this detail. We typically evaluate an algorithm’s efficiency by counting asymptotically of the number of primitive steps used by an algorithm’s execution, rather than by using a stopwatch to measure how long the algorithm actually takes to run on a particular input on a particular machine. One reason is that it’s very difficult to properly measure this type of performance; see p. 627 for some discussion about why. In certain applications, particularly those in scientific computing (the subfield of CS devoted to pro- cessing and analyzing real-valued data, where we have to be concerned with issues like accumulated rounding errors in long calculations), it is typical to use a variation on asymptotic analysis. Calcu- lations on integers are substantially cheaper than those involving floating point values; thus in this field one typically doesn’t bother counting integer operations, and instead we only track floating point operations, or flops. Because flops are substantially more expensive, often we’ll keep track of the constant on the leading (highest-degree) term—for example, an algorithm might require 3 n2 + O(n log n) flops or 2n2 + O(n) flops. (We’d choose the former.) 2 6.3.1 Worst-Case Analysis We will generally evaluate the efficiency of an algorithm A by thinking about its per- formance as the input gets large: what happens to the number of steps consumed by A as a function of the input size n? Furthermore, we generally assume the worst: when we ask about the running time of an algorithm A on an input of size n, we are inter- ested in the running time of A on the input of size n for which A is the slowest. When we perform worst-case analysis of an algorithm—analyzing the asymptotic behav- ior of the function TA(n)—we seek to understand the rate at which the running time of the algorithm increases as the input size increases. Because a primary goal of algorith- mic analysis is to provide a guarantee on the running time of an algorithm, we will be pessimistic, and think about how quickly A performs on the input of size n that’s the worst for algorithm A. Taking it further: Occasionally we will perform average-case analysis instead of worst-case analysis: we will compute the expected (average) performance of algorithm A for inputs drawn from an appropriate distribution. It can be difficult to decide on an appropriate distribution, but sometimes this approach makes more sense than being purely pessimistic. See Section 6.3.2. It’s also worth noting that using asymptotic, worst-case analysis can sometimes be misleading. There are occasions in which an algorithm’s performance in practice is very poor despite a “good” asymptotic running time—for example, because the multiplicative constant suppressed by the O(·) is massive. (And Definition 6.7 (Worst-case running time of an algorithm) The worst-case running time of an algorithm A is TA(n) = max 􏰂the number of primitive steps used by A on input x􏰃. x:|x|=n We will be interested in the asymptotic behavior of the function TA(n). conversely: sometimes an algorithm that’s asymptotically slow in the worst case might perform very well on problem instances that actually show up in real applications.) Asymptotics capture the high-level performance of an algorithm, but constants matter too! Figure 6.11 shows a sampling of worst-case run- ning times for a number of the algorithms you may have encountered earlier in this book or in previous CS classes. In the rest of this section, we’ll prove some of these results as examples. Some examples: sorting algorithms We’ll now turn to a few examples of worst-case analysis of several different sorting and searching algorithms. We’ll start with three sorting algorithms, illustrated in Figure 6.13: • Selection Sort: repeatedly find the minimum element in the unsorted portion of A; then swap that minimum element into the first slot of the unsorted segment of A. • Insertion Sort: maintain a sorted prefix of A (initially consisting only of the first element); repeatedly expand the sorted prefix by one element, by continuing to swap the first unsorted element backward in the array until it’s in place. • Bubble Sort: make n left-to-right passes through A; in each pass, swap each pair of adjacent elements that are out of order. We’ll start our analysis with Selection Sort, whose pseu- docode is shown in Figure 6.12. (The pseudocode for the other algorithms will accompany their analysis.) Example 6.7 (Selection Sort) Problem: Whatistheworst-caserunningtimeofSelectionSort? : Theouterforloop’sbody(lines2–6)isexecutedntimes,onceeachfor Solution i = 1...n.Wecompletethebodyoftheinnerforloop(lines4–5)atotalofn−i times in iteration i. Thus the total number of times that we execute lines 4–5 is ∑n n−i=n2−∑n i=n2−n(n+1)=n2−n, i=1 i=1 2 2 where ∑n i = n(n+1) by Lemma 5.4. i=1 2 Notice that the only variation in the running time of Selection Sort based on the particular input array A[1 . . . n] is in line 5; the number of times that minIndex is reassigned can vary from as low as 0 to as high as n − i. The remainder of the algorithm behaves precisely identically regardless of the input array values. Thus, for some constants c1 > 0 and c2 > 0 the total number of primitive steps used by the algorithm is c1n + c2n2 (for lines 1, 2, 3, 4, and 6), plus some number xofexecutionsofline5,where0 ≤ x ≤ ∑ni=1n−i ≤ n2,eachofwhichtakes2
a constant c3 number of steps. Thus the total running time is between c1n + c2n and c1n + (c2 + c3)n2. The asymptotic worst-case running time of Selection Sort is therefore Θ(n2).
Figure 6.11: The running time
of some sample algorithms.
6.3. ASYMPTOTICANALYSISOFALGORITHMS 619
worst-case running time
sample algorithm(s)
Θ(1) Θ(log n) Θ(√n) Θ(n)
Θ(n log n) Θ(n2 ) Θ(n3 ) Θ(2n )
push/pop in a stack binary search isPrimeBetter (p. 454) linear search, isPrime merge sort
selection sort, insertion sort, bubble sort naïve matrix multiplication
brute-force satisfiability algorithm
selectionSort(A[1 . . . n]):
1: 2: 3: 4: 5: 6:
for i:=1ton: minIndex := i
for j:=i+1ton:
if A[j] < A[minIndex] then minIndex := j swap A[i] and A[minIndex] Figure 6.12: Selec- tion Sort. 620 CHAPTER 6. ANALYSIS OF ALGORITHMS 35214 15234 12534 12354 12345 (a) Selection Sort 35214 35214 23514 12354 12345 (b) Insertion Sort 35214 35214 32514 32154 32145 23145 21345 21345 21345 12345 . 12345 (c) Bubble Sort We are generally interested in the asymptotic performance of algorithms, so the particular values of the constants c1, c2, and c3 from Example 6.7, which reflect the number of primitive steps corresponding to each line of the pseudocode in Figure 6.12, are irrelevant to our final answer. (One exception is that we may sometimes try to count exactly the number of comparisons between elements of A, or swaps of elements of A; see Exercises 6.55–6.63.) We’ll now turn to our second sorting algorithm, Insertion Sort (Figure 6.14). Insertion Sort proceeds by maintaining a sorted prefix of the given array (initially the sorted prefix consists only of the first element); it then repeatedly expands the sorted prefix one element at a time, by continuing to swap the first unsorted element backward. Example 6.8 (Insertion Sort) Insertion Sort is more sensitive to the structure of its input than Selection Sort: if A is in sorted order, then the while loop in lines 3–5 terminates immediately (because the test A[j] > A[j − 1] fails); whereas if the input array is in reverse sorted order, then the while loop in lines 3–5 completes i − 1 iterations. In fact, the reverse-sorted array is the worst-case input for Insertion Sort: there can be as many as i − 1 iterations of the while loop, and there cannot be more than i − 1 iterations. If the while loop goes through i − 1 iterations, then the total amount of work done is
∑n c+(i−1)d=(c−d)n+∑n id i=1 i=1
= (c − d)n + d · n(n+1)
= (c − d )n + d n2, 22
where c and d are constants corresponding to the work of lines 1–2 and 3–5, respec- tively. This function is Θ(n2), so Insertion Sort’s worst-case running time is Θ(n2).
Figure 6.14: Inser- tion Sort.
Figure 6.13: Three sorting algorithms applied to the list
3, 5, 2, 1, 4. Selection Sort repeatedly finds the minimum element in the unsorted segment and swaps it into place. Insertion Sort repeatedly extends a sorted prefix by swapping the next element backward into position. Bubble Sort repeatedly compares adjacent elements and swaps them if they’re out of order.
insertionSort(A[1 . . . n]):
1: 2: 3: 4: 5:
for i:=2ton: j := i
while j > 1 and A[j] < A[j − 1]: swap A[j] and A[j − 1] j := j − 1 2 Finally, we will analyze a third sorting algorithm: Bubble Sort (Figure 6.15), which makes n left-to-right passes through the array; in each pass, adjacent elements that are out of order are swapped. Bubble Sort is a very simple sorting algo- rithm to analyze. (But, in practice, it is also a comparatively slow sorting algorithm to run!) Example 6.9 (Bubble Sort) Bubble Sort simply repeatedly compares A[j] and A[j + 1] (swapping the two elements if necessary) for many different values of j. Every time the body of the inner loop, Lines 3–4, is executed, the algorithm does a constant amount of work: exactly one comparison and either zero or one swaps. Thus there are two constants c > 0 and
d > 0 such that any particular execution of Lines 3–4 takes an amount of time t
satisfying c ≤ t ≤ d. Therefore the total running time of Bubble Sort is somewhere
Figure 6.15: Bubble Sort.
Problem-solving tip:
Precisely speak-
ing, the number
of primitive steps required to execute, for example, Lines 3–4 of Bubble Sort varies based on whether a swap
has to occur. In Example 6.9, we carried through the analysis considering two different con- stants representing this difference.
But, more sim-
ply, we could say that Lines 3–4 of Bubble Sort take Θ(1) time, without caring about the particular constants. You can use this simpler approach
to streamline argu- ments like the one in Example 6.9.
Figure 6.16: Linear Search.
we analyzed in Example 6.7, and thus Bubble Sort’s running time is Ω(cn2) = Ω(n2) 2 2 2
and O(dn ) = O(n ). Therefore Bubble Sort is Θ(n ).
Before we close, we’ll mention one more sorting algorithm, Merge Sort, which pro- ceeds recursively by splitting the input array in half, recursively sorting each half, and then “merging” the sorted subarrays into a single sorted array. But we will defer the analysis of Merge Sort to Section 6.4: to analyze recursive algorithms like Merge Sort, we will use recurrence relations which represent the algorithm’s running time itself as a recursive function.
Some more examples: search algorithms
We will now turn to some examples of search algorithms, which determine whether
a particular value x appears in an array A. We’ll start with Linear Search (see Figure 6.16), which simply walks through the (possibly unsorted) array A and successively compares each element to the sought value x.
Unless otherwise specified (and we will rarely specify otherwise), we are interested in the worst-case behavior of algorithms. This concern with worst-case behavior includes lower bounds! Here’s an example of the analysis of an algorithm that suffers from this confusion:
Example 6.10 (Linear Search, unsatisfactorily analyzed)
Problem: Whatisincompleteorincorrectinthefollowinganalysisoftheworst-case running time of Linear Search?
The running time of Linear Search is obviously O(n): we at most iterate over every element of the array, performing a constant number of operations per element. And it’s obviously Ω(1): no matter what the inputs A and x are, the algorithm certainly at least does one operation (setting i := 1 in line 1), even if it immediately returns because A[1] = x.
6.3. ASYMPTOTICANALYSISOFALGORITHMS 621
bubbleSort(A[1 . . . n]): 1: for i:=1ton:
2: 3: 4:
for j:=1ton−i:
if A[j] > A[j + 1] then
swap A[j] and A[j + 1]
between ∑n ∑n−i c and ∑n ∑n−i d. The summation ∑n n − i is Θ(n2), precisely as i=1 j=1 i=1 j=1 i=1
linearSearch(A[1 . . . n], x):
Input: anarrayA[1…n]andanelementx Output: is x in the (possibly unsorted) array A?
1: fori:=1ton:
2: if A[i] = x then 3: return True 4: return False

622 CHAPTER 6. ANALYSIS OF ALGORITHMS
: Theanalysisiscorrect,butitgivesalooserlowerboundthancanbeshown: specifically, the running time of Linear-Search is Ω(n), and not just Ω(1). If we call linearSearch(A, 42) for an array A[1 . . . n] that does not contain the number 42, then the total number of steps required by the algorithm will be at least n, because every element of A is compared to 42. Performing n comparisons takes Ω(n) time.
Taking it further: When we’re analyzing an algorithm A’s running time, we can generally prove several different lower and upper bounds for A. For example, we might be able to prove that the running time is Ω(1), Ω(log n), Ω(n), O(n2), and O(n3). The bound Ω(1) is a loose bound, because it is superseded by the bound Ω(log n). (That is, if f (n) = Ω(log n) then f (n) = Ω(1).) Similarly, O(n3) is a loose bound, because it is implied by O(n2).
We seek asymptotic bounds that are as tight as possible—so we always want to prove f (n) = Ω(g(n)) and f (n) = O(h(n)) for the fastest-growing function g and slowest-growing function h that we can. If
g = h, then we have proven a tight bound, or, equivalently, that f (n) = Θ(g(n)). Sometimes there are algorithms for which we don’t know a tight bound; we can prove Ω(n) and O(n2), but the algorithm might be Θ(n) or Θ(n2 ) or Θ(n log n log log log n) or whatever. In general, we want to give upper and lower bounds that are as close together as possible.
Here is a terser writeup of the analysis of Linear Search:
Example 6.11 (Linear Search)
The worst case for Linear Search is an array A[1 . . . n] that doesn’t contain the element x. In this case, the algorithm compares x to all n elements of A, taking Θ(n) time.
Binary Search (see Fig-
ure 6.17(a)) is another
search algorithm for locat-
ing a value x in an array
A[1 . . . n], if the array is
sorted. It proceeds by
defining a range of the
array in which x would be
found if it is present, and
then repeatedly halving
the size of that range by
comparing x to the middle
entry in that range. Let’s
analyze the running time of Binary Search.
Example 6.12 (Binary Search)
The intuition is fairly straightforward. In every iteration of the while loop in lines 3–10, we halve the range of elements under consideration—that is, | {i : lo ≤ i ≤ hi} |. We can halve a set of size n only log2 n times before there’s only one element left, and therefore we have at most 1 + log2 n iterations of the while loop. Each of those iterations takes a constant amount of time, and therefore the total running time is O(log n).
Solution
(a) The code.
1 􏰘n+1􏰙 n 2
⌈n/2⌉ − 1 ⌊n/2⌋
When lo = 1 and hi = n, then
middle = ⌊(n + 1)/2⌋. Because
⌊(n + 1)/2⌋ = ⌈n/2⌉, there are
⌈n/2⌉ − 1 elements before middle and ⌊n/2⌋ elements after middle.
(b) An illustration of the split.
binarySearch(A[1 . . . n], x):
Input: a sorted array A[1 . . . n]; an element x Output: is x in the (sorted) array A?
1: lo:=1
2: 3: 4: 5: 6: 7: 8: 9:
10: 11:
hi:=n
while lo ≤ hi:lo+hi
middle:=⌊2⌋
if A[middle] = x then
return True
else if A[middle] > x then
hi := middle − 1 else
lo := middle + 1 return False
Figure 6.17: Binary Search.

To translate this intuition into a more formal proof, suppose that the range of elements under consideration at the beginning of an iteration of the while loop is A[lo, . . . , hi], which contains k = hi − lo + 1 elements. There are ⌈k/2⌉ − 1 elements in A[lo, . . . , middle − 1] and ⌊k/2⌋ elements in A[middle + 1, . . . , hi]. Then, after comparing x to A[middle], one of three things happens:
• wefindthatx=A[middle],andthealgorithmterminates.
• we find that x < A[middle], and we continue on a range of the array that contains ⌈k/2⌉ − 1 ≤ k/2 elements. • we find that x > A[middle], and we continue on a range of the array that contains
⌊k/2⌋ ≤ k/2 elements.
In any of the three cases, we have at most k/2 elements under consideration in the next iteration of the loop. (See Figure 6.17(b).)
Initially, the number of elements under consideration has size n. Therefore after i iterations, there are at most n/2i elements left under consideration. (This claim can be proven by induction.) Therefore, after at most log2 n iterations, there is only one element left under consideration. Once the range contains only one element, we complete at most one more iteration of the while loop. Thus the total number of iterations is at most 1 + log2 n. Each iteration takes a constant number of steps, and thus the total running time is O(log n).
Notice that analyzing the running time of any single iteration of the while loop in the algorithm was easy; the challenge in determining the running time of binarySearch lies in figuring out how many iterations occur.
Here we have only shown an upper bound on the running time of Binary Search;
in Example 6.26, we’ll prove that, in fact, Binary Search takes Θ(log n) time. (Just as for Linear Search, the worst-case input for Binary Search is an n-element array that does not contain the sought value x; in this case, we complete all logarithmically many iterations of the loops, and the running time is therefore Ω(log n) too.)
6.3.2 Some Other Types of Analysis
So far we have focused on asymptotically analyzing the worst-case running time of algorithms. While this type of analysis is the one most commonly used in the analy- sis of algorithms, there are other interesting types of questions that we can ask about algorithms. We’ll sketch two of them in this section: instead of being completely pes- simistic about the particular input that we get, we might instead consider either the best possible case or the “average” case.
Best-case analysis of running time
Best-case running time simply replaces the “max” from Definition 6.7 with a “min”:
6.3. ASYMPTOTICANALYSISOFALGORITHMS 623

624 CHAPTER 6. ANALYSIS OF ALGORITHMS
Definition 6.8 (Best-case running time of an algorithm)
The best-case running time of an algorithm A on an input of size n is
Tbest(n) = min 􏰂the number of primitive steps used by A on input x􏰃. A x:|x|=n
Best-case analysis is rarely used; knowing that an algorithm might be fast (on inputs for which it is particularly well tuned) doesn’t help much in drawing generalizable conclusions about its performance (on the input that it’s actually called on).
Average-case analysis of running time
The “average” running time of an algorithm A is subtler to state formally, because
“average” means that we have to have a notion of which values are more or less likely to be chosen as inputs. (For example, consider sorting. In many settings, an already- sorted array is the most common input type to the sorting algorithm; the programmer just wanted to “make sure” that the input was sorted, even though he might have been pretty confident that it already was.) The simplest way to do average-case analysis is to consider inputs that are chosen uniformly at random from the space of all possible inputs. For example, for sorting algorithms, we would consider each of the n! different orderings of {1, 2, . . . , n} to be equally likely inputs of size n.
“Optimism, n. The doctrine or belief that everything is beautiful, including what is ugly.”
— Ambrose Bierce (1842–≈1913), The Devil’s Dictionary (1911)
Definition 6.9 (Average-case running time of an algorithm)
Let X denote the set of all possible inputs to an algorithm A. The average-case running time of an algorithm A for a uniformly chosen input of size n is
Tavg(n) = 1 · ∑ 􏰂number of primitive steps used by A on x􏰃. A |{y∈X:|y|=n}| x∈X:|x|=n
Taking it further: Let ρn be a probability distribution over {x ∈ X : |x| = n}—that is, let ρn be a function such that ρn(x) denotes the fraction of the time that a size-n input to A is x. Definition 6.9 considers the uniform distribution, where ρn (x) = 1/| {x ∈ X : |x| = n} |.
The average-case running time of A on inputs of size n is the expected running time of A for an input x of size n chosen according to the probability distribution ρn. We will explore both probability distribu- tions and expectation in detail in Chapter 10, which is devoted to probability. (If someone refers to the average case of an algorithm without specifying the probability distribution ρ, then they probably mean that ρ is the uniform distribution, as in Definition 6.9.)
We will still consider the asymptotic behavior of the best-case and average-case running times, for the same reasons that we are generally interested in the asymptotic behavior in the worst case.
Best- and average-case analysis of sorting algorithms
We’ll close this section with the best- and average-case analyses of our three sorting
algorithms. (See Figure 6.18 for a reminder of the algorithms.)

626 CHAPTER 6. ANALYSIS OF ALGORITHMS
case of Insertion Sort is virtually invisible along the x-axis. On the other hand, Fig- ure 6.19(b) suggests that Selection Sort’s performance does not seem to depend very much on the structure of its input. Let’s analyze this algorithm formally:
Example 6.14 (Selection Sort, best- and average-case)
In Selection Sort (see Figure 6.18), the only effect of the input array’s structure is the number of times that line 5 is executed. (That’s why the reverse-sorted input tends to perform ever-so-slightly worse in Figure 6.19(b).) Thus the best- and average- case running time of Selection Sort is Θ(n2), just like the worst-case running time established in Example 6.7.
Figure 6.19(c) suggests that Bubble Sort’s performance varies only by a constant factor; indeed, the worst-, average-, and best-case running times are all Θ(n2):
Example 6.15 (Bubble Sort, best- and average-case)
Again, the only difference in running time based on the structure of the input array is in how many times line 4 is executed—that is, how many swaps occur. (The number of swaps ranges between 0 for a sorted array and n(n − 1)/2 for a reverse-sorted array.) But line 3 is executed Θ(n2) times in any case, and Θ(n2) + 0 and Θ(n2) + n2 are both Θ(n2).
More careful examination of Bubble Sort shows that we can improve the algorithm’s best-case performance without affecting the worst- and average-case performance asymptotically;seeExercise6.65. 4
Taking it further: The tools from this chapter can be used to analyze the consumption of any resource by an algorithm. So far, the only resource that we have considered is time: how many primitive steps are used by the algorithm on an particular input? The other resource whose consumption is most commonly analyzed is the space used by the algorithm—that is, the amount of memory used by the algorithm.
As with time, we almost always consider the worst-case space use of the algorithm. See the discussion on p. 628 for more on the subfield of CS called computational complexity, which seeks to understand the resources required to solve any particular problem.
While time and space are the resources most frequently analyzed by complexity theorists, there are other resources that are interesting to track, too. For example, randomized algorithms “flip coins” as they run—that is, they make decisions about how to continue based on a randomly generated bit. Generating a truly random bit is expensive, and so we can view randomness itself as a resource, and try to mini- mize the number of random bits used. And, particularly in mobile processors, power consumption—and therefore the amount of battery life consumed, and the amount of heat generated—may be a more lim- iting resource than time or space. Thus energy can also be viewed as a resource that an algorithm might consume.4
For some of the research from an architecture perspective on power-aware computing, see
4 Stefanos Kaxi-
ras and Margaret Martonosi. Com- puter Architecture Techniques for Power- Efficiency. Morgan Claypool, 2008.

628 CHAPTER 6. ANALYSIS OF ALGORITHMS
Computer Science Connections
Time, Space, and Complexity
Computational complexity is the subfield of computer science devoted to the study of the resources required to solve computational problems. Computa- tional complexity is the domain of the most important open question in all
of computer science, the P-versus-NP problem. That problem is described elsewhere in this book (see p. 326), but here we’ll describe some of the basic entities that are studied by complexity theorists.
A complexity class is a set of problems that can be solved using a given constraint on resources consumed. Those resources are most typically the time or space used by an algorithm that solves the problem. For example, the complexity class EXPTIME includes precisely those problems solvable in exponential time—that is, O(2nk ) time for some constant integer k.
One of the most important complexity classes is P, which denotes the set of all problems Π for which there is a polynomial-time algorithm A that solves Π. In other words,
Π ∈ P ⇔ there exists an algorithm A and an integer k ∈ Z≥0 such that k A solves Π and the worst-case running time of A on an input of size n is O(n ).
Although the practical efficiency of an algorithm that runs in time Θ(n1000) is highly suspect, it has turned out that essentially any (non-contrived) problem that has been shown to be in P has actually also had a reasonably efficient algorithm—almost always O(n5) or better. As a result, one might think of
the entire subfield of CS devoted to algorithms as really being devoted to understanding what problems can be solved in polynomial time. (Of course, improving the exponent of the polynomial is always a goal!)
Other commonly studied complexity classes are defined in terms of the space (memory) that they use:
• PSPACE:problemssolvableusingapolynomialamountofspace;
• L:problemssolvableusingO(logn)space(beyondtheinputitself);and • EXPSPACE:problemssolvableinexponentialspace.
While a great deal of effort has been devoted to complexity theory over the last half century, surprisingly little is known about how much time or space is actually required to solve problems—including some very important prob- lems! It is reasonably easy to prove the relationships among the complexity classes shown in Figure 6.21, namely
L ⊆ P ⊆ PSPACE ⊆ EXPTIME ⊆ EXPSPACE.
Although the proofs are trickier, it has also been known since the 1960s that P ̸= EXPTIME (using the “time hierarchy theorem”), and that both L ̸= PSPACE and PSPACE ̸= EXPSPACE (using the “space hierarchy theorem”). But that’s just about all that we know about the relationship among these complexity classes! For example, for all we know L = P or P = PSPACE— but not both, because we do know that L ̸= PSPACE. These foundational complexity-theoretic questions remain open—awaiting the insights of a new generation of computer scientists!5
EXPSPACE
EXPTIME
PSPACE
P
L
Figure 6.21: A few complexity classes, and their relationships.
For more, see any good textbook on computational complexity (also known as complexity theory). For example,
5 Michael Sipser. Introduction to the The- ory of Computation. Course Technology, 3rd edition, 2012; and Christos H. Pa- padimitriou. Computational Complexity. Addison Wesley, 1994.

6.3.3 Exercises
A comparison-based sorting algorithm reorders its input array A[1 . . . n] with two fundamental operations:
• the comparison of a pair of elements (to determine which one is bigger); and
• the swap of a pair of elements (to exchange their positions in the array).
See Figure 6.22 for another reminder of three comparison-based sorting algorithms: Selection, Insertion, and Bubble Sorts. For each of the following problems, give an exact answer (not an asymptotic one), and prove your answer. For the worst-case input array of size n, how many comparisons are done by these algorithms?
6.55 selectionSort
6.56 insertionSort
6.57 bubbleSort
We’ll now turn to counting swaps. In these exercises, you should count as a “swap” the exchange of an element A[i] with itself. (So if i = minIndex in Line 6 of selectionSort, Line 6 still counts as performing as swap.) For the worst-case input array of size n, how many swaps are done by these algorithms?
6.58 selectionSort
6.59 insertionSort
6.60 bubbleSort
Repeat the previous exercises for the best-case input: that is, for the input array
A[1 . . . n] on which the given algorithm performs the best, how many compar-
isons/swaps does the algorithm do? (If the best-case array for swaps is different from
the best-case array for comparisons, say so and explain why, and analyze the number of comparisons/swaps in the two different “best” arrays.) In the best case, how many comparisons and how many swaps are done by these algorithms?
6.61 selectionSort
6.62 insertionSort
6.63 bubbleSort
Two variations of the basic bubbleSort algorithm are shown in Figure 6.23. In the next few exercises, you’ll explore whether they’re asymptotic improvements.
6.64 What’s the worst-case running time of early-stopping-bubbleSort?
6.65 Show that the best-case running time of early-stopping-bubbleSort is asymptotically better than the best-case running time of bubbleSort.
6.66 Show that the running time of forward-backward-bubbleSort on a reverse-sorted array A[1 . . . n] is Θ(n). (The reverse-sorted input is the worst case for both bubbleSort and early-stopping-bubbleSort.)
Prove that the worst-case running time of forward-backward-bubbleSort is . . .
6.67 . . . O(n2 ).
6.68 . . . Ω(n2 ) (despite the apparent improvement!). To
prove this claim, explicitly describe an array A[1 . . . n] for which early-stopping-bubbleSort performs poorly—that is, in Ω(n2) time—on both A and the reverse of A.
6.69 (programming required) Implement the three versions of Bubble
Sort (including the two in Figure 6.23) in a programming language of your
choice.
6.70 (programming required) Modify your implementations from Ex-
ercise 6.69 to count the number of swaps and comparisons each algorithm
performs. Then run all three algorithms on each of the 8! = 40,320 different orderings of the elements {1, 2, . . . , 8}. How do the algorithms’ performances compare, on average?
Figure 6.22: An- other reminder of the sorting algo- rithms.
6.3. ASYMPTOTICANALYSISOFALGORITHMS 629
selectionSort(A[1 . . . n]):
1: 2: 3: 4: 5: 6:
for i:=1ton: minIndex := i
for j:=i+1ton:
if A[j] < A[minIndex] then minIndex := j swap A[i] and A[minIndex] insertionSort(A[1 . . . n]): 1: 2: 3: 4: 5: for i:=2ton: j := i while j > 1 and A[j] < A[j − 1]: swap A[j] and A[j − 1] j := j − 1 bubbleSort(A[1 . . . n]): 1: for i:=1ton: 2: 3: 4: for j:=1ton−i: if A[j] > A[j + 1] then
swap A[j] and A[j + 1]
early-stopping-bubbleSort(A[1 . . . n]):
1: 2: 3: 4: 5: 6: 7: 8:
for i:=1ton: swapped := False for j:=1ton−i:
if A[j] > A[j + 1] then swap A[j] and A[j + 1] swapped := True
if swapped = False then return A
forward-backward-bubbleSort(A[1 . . . n]):
1:
2: 3:
4:
5: 6:
ConstructR[1…n],thereverseofA,where R[i] := A[n − i + 1] for each i.
for i:=1ton:
Run one iteration of lines 2–8 of early-stopping-bubbleSort on A. Run one iteration of lines 2–8 of early-stopping-bubbleSort on R. if either A or R is now sorted then
return whichever is sorted
Figure 6.23: Bubble Sort, improved.

630 CHAPTER 6. ANALYSIS OF ALGORITHMS
In Chapter 9, we will meet a sorting algorithm called Counting Sort that sorts an array A[1…n] where each A[i] ∈ {1,2,…,k} as follows:
for each possible value x ∈ {1, 2, . . . , k}, we walk through A to compute
cx :=|{i:A[i]=x}|.(Wecancomputeallkvaluesofc1,…,ck inasingle pass through A.) The output array consists of c1 copies of 1, followed by
c2 copies of 2, and so forth, ending with ck copies of k. (See Figure 6.24.) Counting sort is particularly good when k is small.
6.71 In terms of n, what is the worst-case running time of countingSort on an input array of n letters from the alphabet (so k = 26, and n is arbitrary)?
6.72 (programming required) Implement Counting Sort and one
of the Θ(n2)-time sorting algorithms from this section. Collect some
data to determine, on a particular computer, for what values of k you’d generally prefer Counting Sort over the Θ(n2)-time algorithm when n = 4096 = 212 elements are each chosen uniformly at random from the set {1,2,…,k}.
6.73 Radix Sort is a sorting algorithm based on Counting Sort that proceeds by repeatedly applying Counting Sort to the ith-most significant bit in the input integers, for increasing i. Do some online research to learn more about Radix Sort, then write pseudocode for Radix Sort and compare its running time (in terms of n and k) to Counting Sort.
In Example 5.14, we proved the correctness of Quick Sort, a recursive
sorting algorithm (see Figure 6.25 for a reminder, or Figure 5.20(a) for more
detail). The basic idea is to choose a pivot element of the input array A, then
partition A into those elements smaller than the pivot and those elements
larger than the pivot. We can then recursively sort the two “halves” and
paste them together, around the pivot, to produce a sorted version of A. The
algorithm performs very well if the two “halves” are genuinely about half the
size of A; it performs very poorly if one “half” contains almost all the elements
of A. The running time of the algorithm therefore hinges on how we select the
pivot, in Line 4. (A very good choice of pivot is actually a random element of
A, but here we’ll think only about deterministic rules for choosing a pivot.)
6.74 Suppose that we always choose pivotIndex := 1. (That is, the
first element of the array is the pivot value.) Describe (for an arbitrary
n) an input array A[1 . . . n] that causes quickSort under this pivot rule to make either less or greater empty.
6.75 Argue that, for the array you found in Exercise 6.74, the running time of Quick Sort is Θ(n2).
6.76 Suppose that we always choose pivotIndex := ⌊n/2⌋. (That is, the middle element of the array is
the pivot value.) What input array A[1 . . . n] causes worst-case performance (that is, one of the two sides of the partition—less or greater—is empty) for this pivot rule?
6.77 A fairly commonly used pivot rule is called the Median of Three rule: we choose pivotIndex ∈
{1, ⌊n/2⌋ , n} so that A[pivotIndex] is the median of the three values A[1], A[⌊n/2⌋], and A[n]. Argue that there is still an input array of size n that results in Ω(n2) running time for Quick Sort.
6.78 Earlier we described a linear-search algorithm that looks for an element x in an array A[1 . . . n] by comparing x to A[i] for each
i = 1, 2, . . . n. (See Figure 6.16.) But if A is sorted, we can determine that x is not in A earlier, as shown in Figure 6.26: once we’ve passed where x “should” be, we know that it’s not in A. (Our original version omitted lines 4–5.) What is the worst-case running time of the early- stopping version of linear search?
6.79 Consider the algorithm in Figure 6.26 for counting the number of times the letter Z appears in a given string s. What is the worst-case running time of this algorithm on an input string of length n? Assume that testing whether Z is in s (line 2) and removing a letter from s (line 4) both take c · |s| time, for some constant c.
Figure 6.24: Count- ing Sort.
countingSort(A[1 . . . n]):
// assume each A[i] ∈ {1,2,…,k}
1: 2: 3: 4: 5: 6: 7: 8: 9:
for v:=1tok: count[v] := 0
for i:=1ton:
count[A[i]] := count[A[i]] + 1
i := 1
for v:=1tok:
for t := 1 to count[v]: A[i] := v
i := i + 1
quickSort(A[1 . . . n]):
1: 2: 3: 4: 5:
6: 7:
8:
ifn≤1then return A
else
Choose pivotIndex ∈ {1, . . . , n}, somehow.
Let less (those elements smaller than A[pivotIndex]), same and greater be empty arrays.
for i:=1ton:
compare A[i] to A[pivotIndex], and append A[i] to
the appropriate array less, same, or greater. return quickSort(less) + same + quickSort(greater).
Figure 6.25: A high- level reminder of Quick Sort.
early-stopping-linearSearch(A[1 . . . n], x):
1: 2: 3: 4: 5: 6:
fori:=1ton:
if A[i] = x then
return True
else if A[i] < x then return False return False countZ(s): 1: 2: 3: 4: 5: z := 0 while there exists i such that si = Z: z := z + 1 remove si from s (that is, set s := s1 ...si−1si+1 ...sn) return z Figure 6.26: Lin- ear Search and counting ZZZs. 6.4. RECURRENCERELATIONS:ANALYZINGRECURSIVEALGORITHMS 631 6.4 Recurrence Relations: Analyzing Recursive Algorithms Democracy is the recurrent suspicion that more than half of the people are right more than half the time. E. B. White (1899–1985) The nonrecursive algorithms in Section 6.3 could be analyzed by simple counting and manipulation of summations. First we figured out the number of iterations of each loop, and then figured out how long each iteration takes. By summing this work over the iterations and simplifying the summation, we were able to compute the run- ning time of the algorithm. Determining the running time of a recursive algorithm is harder. Instead of merely containing loops that can be analyzed as above, the algo- rithm’s running time on an input of size n depends on the same algorithm’s running time for inputs of size smaller than n. We’ll use the classical recursive sorting algorithm Merge Sort (Figure 6.27) as an example. Merge Sort sorts an array by recursively sorting the first half, recursively sorting the second half, and finally “merging” the resulting sorted lists. (On an input array of size 1, Merge Sort just returns the array as is.) You’ll argue in Exercise 6.100 that merging two n - 2 element arrays takes Θ(n) time, but what does that mean for the overall running time Figure 6.27: Merge Sort. The merge function takes two sorted arrays and combines them into a single sorted array. (See Exercise 5.72 or 6.100.) Figure 6.28: The recursion tree for Merge Sort. The size of the input itselfisshown in the shaded square node; the Θ(n) amount of time required for splitting and merging an n- element input is shown in the oval adjacent to that node, as c · n. mergeSort(A[1 . . . n]): 1: 2: 3: 4: 5: 6: if n=1then return A else L := mergeSort(A[1 . . . ]) R := mergeSort(A[ 2 +1...n]) 􏰄n􏰅 􏰄n􏰅 2 return merge(L,R) of Merge Sort? We can think about Merge Sort’s running time by drawing a picture of all of the work that is done in its execution, in the form of a recursion tree: Figure 6.28 shows the recursion tree for Merge Sort. For ease, we will assume that n is an exact power of 2. We denote by c · n the amount of time needed to process an n-element array aside from the recursive calls—that is, the time to split and merge. Definition 6.10 (Recursion tree) The recursion tree for a recursive algorithm A is a tree that shows all of the recursive calls spawned by a call to A on an input of size n. Each node in the tree is annotated with the amount of work, aside from any recursive calls, done by that call. n n n 2 2 n n n n 4 4 4 4 2 2 1 1 1 1 cn c· n c· n c·n c·n c·n c·n 4444 22 . . . . . c · 2 c · 2 c·1 c·1 c·1 c·1 ··· ··· 1 + log2 n levels ... 632 CHAPTER 6. ANALYSIS OF ALGORITHMS There are many different ways to analyze the total amount of work done by Merge Sort on an n-element input array, but one of the easiest is to use the recursion tree: Example 6.16 (Analyzing Merge Sort via recursion tree) Problem: HowquicklydoesMergeSortrunonann-elementinputarray?(Assume that n is a power of two.) : ThetotalamountofworkdonebyMergeSortispreciselythesumofthe circled values contained in the tree. (At the root, by definition the total work aside from the recursive calls is c · n; inductively, the work done in the recursive calls is the sum of the circled values in the left and right subtrees.) The easiest way to sum up the work in the tree is to sum “row-wise.” (See Fig- ure 6.29.) The first “row” of the tree (one call on an input of size n) generates cn work. The second row (two calls on inputs of size n/2) generates 2 · (cn/2) = cn work. The third row (four calls on inputs of size n/4) generates 4 · (cn/4) = cn work. In general, row #k of the tree contains 2k−1 calls on inputs of size n/2k−1, and generates 2k−1 · c · n/2k−1 = cn work—that is, the work at the kth level of the tree is cn, independent of the value of k. There are 1 + log2 n rows in the tree, and so the total work in this tree is 1+log2 n k−1 n 1+log2 n ∑2 ·c·2k−1=∑cn k=1 k=1 = cn(1 + log2 n) and thus is Θ(n log n) in total. Taking it further: Here’s a different argument as to why Merge Sort requires Θ(n log n) time: every element of the input array is merged once in an array of size 1, once in an array of size 2, once in an array of size 4, once in an array of size 8, etc. So each element is merged log2 n times, so thus the total work is Θ(n · log2 n). Solution n n 2 n n n 2 n 4 4 4 4 2 2 1 1 1 1 cn Figure 6.29: The row-wise sum of the tree in Figure 6.28. n c · n c · n c·n c·n c·n c·n 4 4 4 4 ... c·2 . c·2 · · · · · · c·1 c·1 c·1 c·1 1 · cn = cn. 2 · (cn/2) = cn. 4 · (cn/4) = cn. n · (c) = cn. 22 . 1 + log2 n levels ... 6.4. RECURRENCERELATIONS:ANALYZINGRECURSIVEALGORITHMS 633 6.4.1 Recurrence Relations Recursion trees are an excellent way to gain intuition about the running time of a recursive algorithm, and to analyze it. We now turn to another way of thinking about recursion trees, which suggests a rigorous (and in many ways easier to use) approach to analyzing recursive algorithms: the recurrence relation. Because at least one of the steps in a recursive algorithm A is to call A on a smaller input, the running time of A on an input of size n depends on A’s running time for inputs of size smaller than n. We will therefore express A’s running time recursively, too: Here’s a first example, about compounding interest in a bank account: Example 6.17 (Compound interest) Suppose that, in year #0, Alice puts $1000 in a bank account that pays 2% annual compound interest. Writing A(n) to denote the balance of Alice’s account in year #n, we have A(0) = 1000 A(n) = 1.02·A(n−1). If Bob opens a bank account with the same interest rate, and deposits $10 into the account each year (starting in year #0), then Bob’s balance is given by the recurrence B(0) = 10 B(n) = 1.02·B(n−1)+10. In computer science, the most common type of recurrence relation that we’ll encounter is one where T(n) denotes the worst-case number of steps taken by a particular recursive algorithm on an input of size n. Here are a few examples: Example 6.18 (Factorial) Let T(n) denote the worst-case running time of fact (Figure 6.30). Then: T(1) = d T(n) = T(n − 1) + c where c is a constant denoting the work of the comparison–conditional– multiplication–return, and d is a constant denoting the work of the comparison– conditional–return. A recurrence re- lation is called a recurrence relation because T recurs (“occurs again”) on the right-hand side of the equation. That’s the same rea- son that recursion is called recursion. Definition 6.11 (Recurrence relation) A recurrence relation (sometimes simply called a recurrence) is a function T(n) that is defined (for some n) in terms of the values of T(k) for input values k < n. fact(n): 1: ifn=1then 2: return 1 3: else 4: return n·fact(n−1) Figure 6.30: A recursive algorithm for factorial. 634 CHAPTER 6. ANALYSIS OF ALGORITHMS Example 6.19 (Merge Sort) Let T(n) denote the worst-case running time of Merge Sort (Figure 6.27) on an input array containing n elements. Then, for a constant c, we have: T(1) = c T(n) = T(⌊n⌋)+T(⌈n⌉)+cn. Just as for nonrecursive algorithms, we will generally be interested in the asymptotic running times of these recursive algorithms, so we will usually not fret about the par- ticular values of the constants in recurrences. We will often abuse notation and use a single constant to represent different Θ(1)-time operations, for example. In Example 6.19, for instance, we are being sloppy in our recurrence, using a single variable c to represent two dif- ferent values. The use of one constant to have two different meanings (plus the ‘=’ sign) is an abuse of notation, but when we care about asymptotic values, this abuse doesn’t matter. We will even sometimes write 1 to stand for this constant. (See Exercise 6.126.) Here’s another recurrence relation, for the recursive version of Binary Search: Example 6.20 (Binary Search) Let T(n) denote the worst-case running time of the recursive binarySearch (Fig- ure 6.31) on an n-element array. Then: T(0) = c 22 binarySearch(A[1 . . . n], x): 1: 2: 3: 4: 5: 6: 7: 8: 9: ifn≤0then return False middle:=⌊1+n⌋ 2 if A[middle] = x then return True else if A[middle] > x then
return binarySearch(A[1…middle−1],x)
else
return binarySearch(A[middle+1…n],x)
􏰓 T(n)+c ifniseven
T(n) = 2
T(n−1)+c ifnisodd.
2
Although our interest in recurrence relations will be almost exclusively about the running times of recursive algorithms, there are other interesting recurrence relations, too. The most famous of these is the recurrence for the Fibonacci numbers (which will turn out to have some interesting CS applications, too):
Figure 6.31: Binary Search, recursively.
Example 6.21 (Fibonacci numbers)
The Fibonacci numbers are defined by
f1 = 1
f2 = 1
fn =fn−1+fn−2
forn≥3
The first several Fibonacci numbers are 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89 . . ..

6.4. RECURRENCERELATIONS:ANALYZINGRECURSIVEALGORITHMS 635
6.4.2 Solving Recurrences: Induction
When we solve a recurrence relation, we find a closed-form (that is, nonrecursive) equivalent expression. Because recurrence relations are recursively defined quantities, induction is the easiest way to prove that a conjectured solution is correct. (The hard part is figuring out what solution to conjecture, as we’ll see.)
In the remainder of this section, we will solve all of the recurrences from Sec- tion 6.4.1—starting with Alice and Bob and their bank accounts:
Example 6.22 (Compound interest)
Recall the recurrences from Example 6.17:
A(0) = 1000 A(n) = 1.02 · A(n − 1)
B(0) = 10 B(n) = 1.02·B(n−1)+10.
The recurrence for Alice is the easier of the two to solve: we can prove relatively straightforwardly by induction that A(n) = 1000 · 1.02n for any n ≥ 1.
(Alice) (Bob)
For Bob, the analysis is a little trickier. Here’s some intuition: at time n, Bob has had $10 sitting in his account since year #0 (earning interest for n years); $10 in his account since year #1 (earning interest for n − 1 years); etc. A $10 deposit that has accumulated interest for i years has, as with Alice, grown to 10 · 1.02i. Thus the total amount of money in Bob’s account in year #n will be
∑n 􏰖10·1.02i􏰗 = 10·􏰑∑n 1.02i􏰒 = 10· 1.02n+1 −1 = 510·1.02n −500 i=0 i=0 1.02 − 1
where the second equality follows from Theorem 5.2 (the analysis of a geometric series). Let’s prove the property that B(n) = 510 · 1.02n − 500, by induction on n:
basecase(n=0): ThenB(0)=10,andindeed510·1.020−500=510−500=10. inductive case (n ≥ 1): We assume the inductive hypothesis B(n − 1) =
definition of B(n) inductive hypothesis
multiplying through simplifying
510 · 1.02n−1 − 500; we must show that B(n) = 510 · 1.02n − 500. Then: B(n) = 1.02 · B(n − 1) + 10
= 1.02 · 􏰖510 · 1.02n−1 − 500􏰗 + 10
= 1.02 · 510 · 1.02n−1 − 1.02 · 500 + 10 = 510 · 1.02n − 510 + 10
= 510 · 1.02n − 500,
precisely as desired.
Taking it further: As Example 6.22 suggests, some familiar kinds of summations like arithmetic and geometric series can be expressed using recurrence relations. Other familiar summations can also
be expressed using recurrence relations; for example, the sum of the first n integers is given by the recurrence T(1) = 1 and T(n) = T(n − 1) + n. (See Section 5.2 for some closed-form solutions.)

636 CHAPTER 6. ANALYSIS OF ALGORITHMS
Factorial
One good way to generate a conjecture that we then prove correct by induction is
by “iterating” the recurrence: expand out a few layers of the recursion to see what the values of T(n) are for a few small values of n. We’ll illustrate this technique with the simplest recurrence from the last section, for the recursive factorial function.
c
c
c
.
n
n−1
n−2
Example 6.23 (Factorial)
Problem: RecalltherecurrencefromExample6.18:
T(1) = d T(n) = T(n−1)+c.
Give an exact closed-form (nonrecursive) solution for T(n).
: SeeFigure6.32fortherecursiontree,whichmayhelpgivesomeintuition. Solution
Let’s iterate the recurrence a few times:
• T(1)=d
• T(2) = c+T(1) = c+d
• T(3) = c+T(2) = 2c+d
• T(4) = c+T(3) = 3c+d.
From these small values, we conjecture that T(n) = (n − 1)c + d.
Let’s prove this conjecture correct by induction. For the base case (n = 1), we
have T(1) = d by definition of the recurrence, which is 0 · c + d, as desired. For the inductive case, assume the inductive hypothesis T(n − 1) = (n − 2)c + d. We want to show that T(n) = (n − 1)c + d. Here’s the proof:
T(n) = T(n − 1) + c
= (n − 2)c + d + c
c d
2
1
= (n − 1)c + d.
Recall the Merge Sort recurrence, where T(n) = T(⌈ n ⌉) + T(⌊ n ⌋) + cn and T(1) = c.
Thus T(n) = (n − 1)c + d. Merge Sort
22
It will be easier to address the case in which n is an exact power of 2 first (so that the
by definition of the recurrence by the inductive hypothesis by algebraic manipulation
Figure6.32:The (agonizingly simple) recursion tree for fact.
Problem-solving
tip: Try iterating
a recurrence to generate its first
few values. Once we have a few values, we can often conjecture a general solution (which we then prove correct via induction).
floors and ceilings don’t complicate the picture), so we’ll start with that case first, and generalize later:
Example 6.24 (Merge Sort, for powers of 2)
Problem: RecalltheMergeSortrecurrencefromExample6.19:
T(1) = c T(n) = T(⌈n⌉)+T(⌊n⌋)+cn. 22
For convenience, assume that n is an exact power of two. Give an exact closed- form (nonrecursive) solution for T(n).
n levels

6.4. RECURRENCERELATIONS:ANALYZINGRECURSIVEALGORITHMS 637 : Becausenisanexactpoweroftwo,wecanwriten = 2k forsomek ∈ Z≥0.
Solution
(Note that for n = 2k we have ⌈n⌉ = ⌊n⌋ = n = 2k−1.) Define R(k) = T(2k); then
k2 2 k−12 k k R(0)=T(1)=candR(k)=T(2 )=2·T(2 )+c·2 =2·R(k−1)+c·2 ,sowecan
instead solve the recurrence R(0) = c
Iterating R a few times, we see • R(0)=c
R(k) = 2 · R(k − 1) + c · 2k.
• R(1)=c·21+2·R(0)=4c 2
Problem-solving tip:
A useful technique for solving recur- rences is to do a variable substitution. If you can express the recurrence in terms of a different variable and solve the new recurrence easily, you can then substitute back
into the original recurrence to solve it. Transforming an unfamiliar recur- rence into a familiar one will make life easy!
• R(2)=c·2 +2·R(1)=12c • R(3)=c·23+2·R(2)=32c
We conjecture
(How might we get to this conjecture? The pattern from iterating R matches
R(k) = (1+k)2k ·c (∗) it. Alternatively, looking at the recursion tree might help: there are k + 1 levels of
k−i i
copies of 2 · c work in the ith row of the tree—so that’s
the tree, and there are 2
(k + 1)2k−i2ic = (k + 1)2kc. Or, we’d expect a solution that’s the product of ≈ k and ≈2k sothatwegetT(n) ≈ nlogn. Andifwecheckthek = 0case—R(0) = 1—it looks like we’d better multiply by k + 1 rather than k.)
Let’s prove (∗), by induction on k. In the base case, R(0) = c and indeed we have that (1 + 0)20 · c = 1 · 1 · c. In the inductive case, we have
R(k) = 2R(k − 1) + c · 2k by definition of the recurrence = 2(1 + k − 1)2k−1 · c + c · 2k by the inductive hypothesis = 2k · 2k−1 · c + 2k · c
= (k + 1)2k · c.
Thus R(k) = (k + 1)2k · c, completing the inductive case—and the proof of (∗). Because we defined R(k) = T(2k), we can conclude that T(n) = R(log2 n), by
substituting. Thus T(n) = (1 + log2 n) · 2log2 n · c = n(1 + log2 n) · c.
Thinking only about powers of two in Example 6.24 made our life simpler, but it leaves a hole in the analysis: what is the running time of Merge Sort when the input array’s length is not precisely a power of two? The more general analysis is actually simple, given the result we just derived:
Example 6.25 (Merge Sort, for general n)
Problem: SolvetheMergeSortrecurrence(asymptotically),foranyintegern≥1:
T(1) = c T(n) = T(⌈n⌉)+T(⌊n⌋)+cn. 22

638 CHAPTER 6. ANALYSIS OF ALGORITHMS
: We’ll use the fact that T(n) ≥ T(n′) if n ≥ n′—that is, T is monotonic. (See Exercise 6.101.) So let k be the nonnegative integer such that 2k ≤ n < 2k+1. Then T(n) ≥ T(2k) monotonicity = ((log22k)+1)2k·c Example6.24 > (log n +1)·n ·c. definitionofk:wehave n <2k 2222 Solution Thus we know T(n) = Ω(n log n). Similarly, T(n) < T(2k+1) = ((log22k+1)+1)2k+1·c monotonicity Example6.24 definitionofk:wehave2n≥2k+1 ≤ (log22n+1)·2n·c. Thus T(n) = O(n log n). Combining these facts yields that T(n) = Θ(n log n). Binary Search There is a very simple intuitive argument for why Binary Search takes logarithmic time, which we used in Example 6.12: In the worst case, when the sought item x isn’t in the array, we repeatedly compare x to the middle of the valid range of the array, and halve the size of that valid range. We can halve an n-element range exactly log2 n times, and thus the running time of Binary Search is logarithmic. While this intuitive argument is plausible, there’s a subtle but nontrivial issue: the so- called “halving” in this description isn’t actually exactly halving. If there are n elements in the valid range, then after comparing x to the middle element of the range, we will end up with a valid range of size either n or n−1 , depending on the parity of n—not n22 exactly 2 . (We have already shown that Binary Search’s worst-case running time is O(log n), in Example 6.12, because if there are n elements in the valid range, then after so-called halving we end up with a valid range of size at most n . The issue here is 2 that we have not ruled out the possibility that the running time might be faster than Θ(log n), because we’ve “better-than-halved” at every stage.) We can resolve this issue by rigorously analyzing the correct recurrence relation— and we can prove that the running time is in fact Θ(log n). Example 6.26 (Binary Search) Problem: SolvetheBinarySearchrecurrence: 􏰓 T(n)+1 ifniseven T(0) = 1 T(n) = 2 T(n−1)+1 ifnisodd. 2 (Note that we’ve changed the additive constants to 1 instead of c; changing it back to c would only have the effect of multiplying the entire solution by c.) Solution Problem-solving tip: When solving a new recurrence, we can try to gen- erate conjectures (to prove correct via induction) by iterating the recur- rence, drawing out the recursion tree, or by straight-up guessing a solution (or recognizing a similar pattern to previously seen recurrences). To generate my con- jecture for Exam- ple 6.26, I actually wrote a program that implemented the recurrence. I ran the program for n ∈ {1,2,...,1000} and printed out the smallest integer n for which T(n) = 1, then the smallest for which T(n) = 2, etc. (See Figure 6.33.) The conjecture followed from the observation that the breakpoints all hap- penedatn = 2k −1 for an integer k. Figure 6.33: A plot of n versus T(n) for the binary search recurrence. 6.4. RECURRENCERELATIONS:ANALYZINGRECURSIVEALGORITHMS 639 : We conjecture that T(n) = ⌊log2 n⌋ + 2 for all n ≥ 1. We’ll prove the conjec- ture correct by strong induction on n. For the base case (n = 1), we have T(1) = T(0) + 1 = 1 + 1 = 2 by definition of the recurrence, and indeed 2 = ⌊0⌋ + 2 = ⌊log2 1⌋ + 2. For the inductive case (n ≥ 2), assume the inductive hypothesis, that T(k) = ⌊log2 k⌋ + 2 for any k < n. We’ll proceed in two cases: • Ifniseven: because ⌊x + 1⌋ = ⌊x⌋ + 1 by definition of the recurrence by the inductive hypothesis by the same manipulations as in the even case because ⌊log2(n − 1)⌋ = ⌊log2 n⌋ for any odd integer n > 1
As a general matter, the appearance of floors and ceilings inside a recurrence won’t
matter to the asymptotic running time, nor will small additive adjustments inside
T(n) = T( n ) + 1 2n
by definition of the recurrence
by the inductive hypothesis
= ⌊log2( 2 )⌋ + 2 + 1 =⌊(log n)−1⌋+3 = ⌊log2 n⌋ + 2.
becauselog(a)=loga−logb,andlog 2=1 2b2
• Ifnisodd:
T(n) = T( n−1 ) + 1
claim. Therefore T(n) = Θ(log n).
2 n−1
= ⌊log2( 2 )⌋ + 2 + 1
= ⌊log (n − 1)⌋ + 2 2
= ⌊log2 n⌋ + 2.
Because we’ve shown that T(n) = ⌊log2 n⌋ + 2 in either case, we’ve proven the
therecursiveterm. Forexample,T(n) = T(⌈n⌉)+1andT(n) = T(⌊n⌋−2)+1both 22
have T(n) = Θ(log n) solutions. Intuitively, floors and ceilings don’t change this type
of recurrence because they don’t affect the total depth of the recursion tree by more than a Θ(1) number of calls, and a Θ(1) difference in depth is asymptotically irrelevant. Typically, understanding the running time for the “pure” version of the recurrence will
14 12 10
8 6 4 2
T(256…511) = 10
T(128…255) = 9
T(512…1023) = 11
64 128
256 512
1024
n
T(n)

640 CHAPTER 6. ANALYSIS OF ALGORITHMS
give a correct understanding of the more complicated version. As such, we’ll often
be sloppy in our notation, and write T(n) = T( n ) + 1 when we really mean T(⌊ n ⌋) or n22
T(⌈ 2 ⌉). (This abuse of notation is fairly common.)
Taking it further: There’s a general theorem called the “sloppiness” theorem, which states conditions under which it is safe to ignore floors and ceilings in recurrence relations. (As long as we actually prove inductively that our conjectured solution to a recurrence relation is correct, it’s always fine in generating conjectures.) As a rough guideline, as long as T(n) is monotonic (n ≤ n′ ⇒ T(n) ≤ T(n′)) and doesn’t grow too quickly (T(n) is O(nk) for some constant k), then this “sloppiness” is fine. The details of the theorem, and its precise assumptions, are presented in many algorithms textbooks.
6.4.3 The Fibonacci Numbers
We’ll close with another example of a recurrence relation—the Fibonacci recurrence— that we will analyze using induction. But this time we will solve the recurrence exactly (that is, nonasymptotically):
Example 6.27 (The Fibonacci Numbers)
Problem: RecalltheFibonaccinumbers,definedbytherecurrence
f1 =1 f2 =2 fn =fn−1+fn−2.
Prove that fn grows exponentially: that is, prove that there exist a ∈ R>0 and r ∈ R>1 such that fn ≥ arn.
Brainstorming: Let’s start in the middle: suppose that we’ve somehow magically figured out values of a and r to make the base cases (n ∈ {1, 2}) work, and we’re in the middle of an inductive proof. (There are two base cases because f2 ̸= f1 + f0 ; f0 isn’t even defined!) We’d be able to prove this:
fn = fn−1 + fn−2 ≥ arn−1 + arn−2 = arn−2(r + 1). inductive hypothesis/algebra
Butwhatwewanttoproveisfn ≥arn.Sowe’dbedoneifonlyr+1=r2—thatis,if r2 − r − 1 = 0. But we get to pick the value of r (!). Using the quadratic formula, we find that there are two solutions to this equation, which we’ll name φ and φˆ:
φ = 1 + √ 5 φˆ = 1 − √ 5 . 22
Let’suser=φ.Togetthebasecasestowork,wewouldneedtohavef1 =1≥aφand f2 =1≥aφ2 =a(1+φ).Because1+φ>φ,thelatteristheharderonetoachieve.To ensure that a(1 + φ) ≤ 1, we must have
112
a≤ 1+φ = 1+1+√5 = 3+√5. 2
that φ + 1 = φ2 and a corresponding value for a), we’ll prove the following claim: Claim: f ≥ 2 ·φn,whereφ=1+√5.
Figure 6.34: Some brainstorming for Example 6.27.
Problem-solving tip:
Sometimes starting in the middle of a proof helps! You still need to go back and connect the dots, but imagining that you’ve gotten somewhere may help you figure out how to get there.
Example 6.27 (The Fibonacci Numbers, continued)
Solution
: BasedonthebrainstorminginFigure6.34(whichidentifiesavalueφsuch
n 3+√5 2 Proof(bystronginductiononn). Therearetwobasecases:

6.4. RECURRENCERELATIONS:ANALYZINGRECURSIVEALGORITHMS 641 2 1 2 1+√5 1+√5
• Forn=1,wehave 3+√5 ·φ = 3+√5 · 2 = 3+√5 <1=f1. • Forn=2:wehave 2 ·φ2 = 2 ·(1+φ) 3+√5 3+√5 √ 2 3+5 =3+√5· 2 =1=f2. wechoseφsothatφ+1=φ2 For the inductive case (n ≥ 3), we assume the inductive hypothesis, namely that f ≥ 2 ·φk for1≤k≤n−1.Then: k 3+√5 fn = fn−1 + fn−2 definition of the Fibonaccis inductive hypothesis, twice factoring wechoseφsothatφ+1=φ2 ≥ 2 3+√5 = 2 3+√5 = 2 3+√5 2 3+√5 · φn−2 · φn−1 + ·φn−2 ·(φ+1) ·φn−2 ·φ2 = 2 ·φn. 3+√5 Therefore the claim follows by induction. Taking it further: The value φ = 1+√ 5 ≈ 1.61803 · · · is called the golden ratio. It has a number of inter- esting characteristics, including both remarkable mathematical and aesthetic properties. For example, a 2 rectangle whose side lengths are in the ratio φ-to-1 can be divided into a square and a rectangle whose side lengths are in the ratio 1-to-φ. That’s because, for these rectangles to have the same ratios, we need φ = 1 —thatis,weneedφ(φ−1)=1,whichmeansφ2−φ=1.(SeeFigure6.35.)Thegoldenratio,it 1 φ−1 has been argued, describes proportions in famous works of art ranging from the Acropolis to Leonardo da Vinci’s drawings. φ 1 1 φ−1 (a) A rectangle with sides in ratio φ-to-1, with a 1-by-1 square inscribed. (b) Repeatedly inscribing a square in the “leftover” rectangle. (c) The same rectangles, rotated and shifted to share a lower-left corner. A closed-form formula for the Fibonaccis While Example 6.27 establishes a lower bound on the Fibonacci numbers—in asymptotic notation, it proves that fn = Ω(φn)—we have not yet established a closed- form solution for the nth Fibonacci number. Here’s a solution that does so, based on the following ideas. The trick will be to make use of φˆ. The inductive case would go throughperfectly,justasinExample6.27,ifwetriedtoprovefn =aφn+bφˆn,forcon- stants a and b. But what about the base cases? For f1, we would need 1 = aφ + bφˆ; for f2, Figure 6.35: Some golden rectangles. 642 CHAPTER 6. ANALYSIS OF ALGORITHMS we would need 1 = aφ2 + b(φˆ2) = a(1 + φ) + b(1 + φˆ). That’s two linear equations with two 1 −1 unknowns, and some algebra will reveal that a = √5 and b = √5 solves these equations. Let’s use these ideas to give a closed-form solution for the Fibonaccis, and a proof: Example 6.28 (A closed-form solution for the Fibonaccis) Problem: Provethefollowingclaim: φ n − φˆ n 1 + √ 5 ˆ 1 − √ 5 Claim: fn = √5 ,whereφ= 2 andφ= 2 . : Proof(bystronginductiononn). Forthebasecases(n=1andn=2): Solution • Forn=1,wehave • Forn=2,wehavethat For the inductive case (n ≥ 3), we assume the inductive hypothesis: for any definition of the Fibonaccis inductive hypothesis factoring φ + 1 = φ 2 a n d φˆ + 1 = φˆ 2 φ1−φˆ1 = 1+√5−1−√ √5 2√5 2ˆ 5 definition of φ and φ = 2√ 5 √5 2 =1 algebra = f1. φ2−φˆ2 = 1+φ−(1+φˆ) φ =1+φandφ =1+φ √5√5 2ˆ2ˆ = 1 by the previous case = f2. k ˆk φ −φ k 1, c > 0, and k ≥ 0.
Why do these recurrences come up frequently? Consider a recursive algorithm that
has the following structure: if the input is small—say, n = 1—then we compute the solution directly; otherwise, to solve an instance of size n:
• wemakeadifferentrecursivecallsoninputsofsizen;and b
• toconstructthesmallerinstancesandthentoreconstructthesolutiontothegiven instance from the recursive solutions, we spend Θ(nk) time.
(These algorithms are usually called divide-and-conquer algorithms: they “divide” their input into a pieces, and then recursively “conquer” those subproblems.) To be precise, the recurrence often has ceilings and floors as part of its recursive calls, but for now assume that n is exact power of b, so that the floors and ceilings don’t matter.
Here are a few examples of recursive algorithms with recurrences of this form:
Example 6.29 (Binary Search)
We spend c = Θ(1) time to compare the sought element to the middle of the range; we then make one recursive call to search for the element in the appropriate half of the array. If n is an exact power of two, then the recurrence is
T(n) = T( n ) + c. 2
(So a = 1, b = 2, and k = 0, because c = c · 1 = c · n0.)
Example 6.30 (Merge Sort)
We spend Θ(1) time to divide the array in half. We make two recursive calls on the left and right subarrays, and then spend Θ(n) time to merge the resulting sorted subarrays into a single sorted array. If n is an exact power of two, then the recurrence is
(So a = 2, b = 2, and k = 1.)
T(n) = 2T( n ) + c · n. 2

648
CHAPTER 6. ANALYSIS OF ALGORITHMS
n b2
n b2
n b
n
12a ···
1:1 1:2 1:a a:1 a:2 a:a ··· ··· ···
··· ··· ··· ··· ··· ···
…
···
··· ··· ···
n b
n b2
n b
n b2
n b2
n b2
b
b
b
1
1
1
1
1
1
1
1
1
6.5.1
The Master Method is a technique that allows us to solve any recurrence relation of
Figure 6.44: The
recursion tree
for a recurrence
relation T(n) =
The Master Method: Some Intuition
the form T(n) = aT( n ) + c · nk very easily. The Master Method is based on examining b
master method’s form. Assume that n is an exact power of b.
aT( n ) + c · nk , of the b
the recursion tree for this recurrence (see Figure 6.44), and the Master Theorem (Theo- rem 6.10) that describes the total amount of work represented by this tree.
Here’s the intuition for the Master Method. Let’s think about the ith level of the recursion tree (again, see Figure 6.44)—in other words, the work done by the recursive calls that are i levels beneath the root of the recursion tree. Observe the following:
There are ai different calls at level i. There is 1 = a0 call at the 0th level, then a = a1 calls at 1st level, then a2 calls at the 2nd level, and so forth.
Each of the the calls at the ith level operates on an input of size n . The input size is n = n at n n bi 1
the 0th level, then b at the 1st level, then b2 at the 2nd, and so forth. Thusthetotalamountofworkintheithlevelofthetreeisai·c·(n)k. Or,simplifying,the
total work at this level is cnk · ( a )i. bi bk
Thus the total amount of work contained within the entire tree is
∑ 􏰍cnk · 􏰋 a 􏰌i 􏰎 = cnk · ∑ 􏰍􏰋 a 􏰌i 􏰎 . (∗)
i bk ibk
(We’ll worry about the bounds on the summation later.)
Note that (∗) expresses the total work in the recursion tree as a geometric sum ∑i ri,
in which the ratio between terms is given by r := a . (See Section 5.2.2.) As with any bk
geometric sum, the critical question is how the ratio compares to 1: if r < 1, then the terms of the sum are getting smaller and smaller as i increases; if r > 1, then the terms of the sum are getting bigger and bigger as i increases. (And if r = 1, then each term is simply equal to 1.)
The Master Theorem has three cases, each of which corresponds to one of these three natural cases for the summation in (∗): its terms increase exponentially with i, its
…

6.5. RECURRENCERELATIONS:THEMASTERMETHOD 649
terms decrease exponentially with i, or its terms are constant with respect to i. In these cases, respectively, almost all of the work is done at the leaves of the tree; almost all of the work is done at the root of the tree; or the work is spread evenly across the levels of the tree. (Here “almost all the work” means “a constant fraction of the work,” which means that the total work in the tree is asymptotically equivalent to the work done solely at the root or at the leaves.)
A trio of examples
Before we prove the general theorem, we’ll solve a few recurrences that illustrate
the cases of the Master Method, and then we’ll prove the result in general. The three example recurrences are
T(n) = 2T( n ) + 1 2
T(n) = 2T( n ) + n 2
and T(n) = 2T( n ) + n2, 2
all with T(1) = 1. Figure 6.45 shows the recursion trees for these recurrences.
In each of these recurrences, we divide the input by two at every level of the recur- sion. Thus, the total depth of the recursion tree is log2 n. (Assume n is an exact power of two.) In the recursion tree for any one of these recurrences, consider the ith level of the tree beneath the root. (The root of the recursion tree has depth 0.) We have divided n by 2 a total of i times, and thus the input size at that level is n . Furthermore, there are 2i different calls at the ith level of the tree. 2i
Solving the three recurrences
To solve each recurrence, we will sum the total amount of work generated at each
level of the tree. The recursion trees for each of these three recurrences are shown in Figures 6.46, 6.47, and 6.48.
Figure 6.45: The recursion trees
for three different recurrences: T(n) = 2T( n ) + f (n), for f(n)2∈ 􏰈1,n,n2􏰉. The annotation in each row of the tree shows both the number of
calls at that level of the tree, plus the additional work done by each call at that level.
n
. … .
··· ···
1call(s),1or n or(n)2 workeach 11
2 call(s), 1 or n or ( n )2 work each 22
4 call(s), 1 or n or ( n )2 work each 44
n
2
4
4
n
2
n
4
n
n
n
4
.
2i call(s),1or n or(n )2 workeach
2i 2i .
2
2
n call(s), 1 or 1 or 1 work each
1
1
1
1
…

650 CHAPTER 6. ANALYSIS OF ALGORITHMS
n
n
n
=1·1=1
=2·1=2
2
n
n
2
n
n
4
4
=4·1=4 . … .
.
··· ···
= 2i · 1 = 2i .
= 2log2 n · 1 = n
4
4
2
2
1
1
1
1
Figure 6.46: The
recursion tree for
with the “row- wise” sums of work. The work at each level is twice the work at the level above it; thus the work is increasing exponentially at each level of the tree.
Figure 6.47: The
recursion tree for
T(n) = 2T(n)+1, 2
n
n
n
=1·n=n 1
=2·n=n 2
=4·n=n 4
2
4
4
4
n
2
n
n
n
4
. … . .
2
2
··· ···
= 2i · n = n 2i
.
= 2log2 n · 1 = n
1
1
1
1
T(n) = 2T(n)+n. 2
The work at each level is exactly n; thus the work is constant across the levels of the tree.
n
n
=1·(n)2=n2 11
=2·(n)2 = n2 22
=4·(n)2 = n2 44
2
n
4
4
n
4
n
2
n
n
4
. … . .
2
2
··· ···
= 2i · ( n )2 = n2 2i 2i
.
= 2log2 n · 1 = n2 n
1
1
1
1
Figure 6.48: The
recursion tree for
The work at each level is half of the work at the level above it; thus the work is decreasing exponentially at each level of the tree.
T(n) = 2T(n)+n2. 2
… … …

Example 6.31 (Solving T(n) = 2T( n ) + 1) 2i
6.5. RECURRENCERELATIONS:THEMASTERMETHOD 651
Figure 6.46 shows the recursion tree for this recurrence. There are 2 different calls at the ith level, each of which is on an input of size n —and we do 1 unit of work for
i 2ii
each of these 2 calls. Thus the total amount of work at level i is 2 . The total amount
of work in the entire tree is therefore
log2n i 21+log2n−1 log n
T ( n ) = ∑i = 0 2 = 2 − 1 = 2 · 2 2 = 2 n by Theorem 5.2. And, indeed, T(n) = Θ(n).
Example 6.32 (Solving T(n) = 2T( n ) + n) 2i
Figure 6.47 shows the recursion tree. There are 2 calls at the ith level of the recursion tree, on inputs of size n . We do n units of work at each call, so the total work at the
in2i 2i
ith level is 2 · ( 2i ) = n. Note that the amount of work at level i is independent of the
level i. The total amount of work in the tree is therefore log2 n log2 n
T(n) = ∑i=0 n = n· ∑i=0 1 = n(1+log2 n) = Θ(nlogn). 􏰢􏰡􏰠􏰣
work at level #i
Example 6.33 (Solving T(n) = 2T( n ) + n2) 2i
Figure 6.48 shows the recursion tree. There are 2 calls at the ith level of the tree, and we do ( n )2 work at each call at this level. Thus the work represented by the ith row
2i n2 i n2
of the recursion tree is ( 2i ) · 2 = 2i . The total amount of work in the tree is therefore
log2n 1 i 2 2 log2n 1 i T(n)= ∑i=0 (2)n =n · ∑i=0 (2).
Noticethat∑log2n(1)i =1+1+1+···+ 1 ,whichiscertainlyatleast1.But,by i=0 2 2 4 2log2n log n
thefactthat1+ 1 + 1 +…+ 1 < 2(seeTheorem5.2),wealsoknow∑ 2 (1)i ≤ 2. 24 2l i=02 Therefore n2 ≤ T(n) ≤ 2n2, which allows us to conclude that T(n) = Θ(n2). 6.5.2 The Master Method: The Formal Statement and a Proof Examples 6.31, 6.32, and 6.33 were designed to build the necessary intuition about the three different cases of the master method: work increases exponentially across levels of the recursion tree; work stays constant across levels; or work decreases expo- nentially across levels. Precisely the same intuition will yield the proof of the Master Theorem. Here is the formal statement of the Master Theorem, which generalizes the idea of these examples to all recurrences of the form T(n) = aT( n ) + cnk : b 652 CHAPTER 6. ANALYSIS OF ALGORITHMS Theorem 6.10 (Master Theorem) Consider the recurrence T(1) = c T(n) = a · T(n/b) + c · nk for constants a ≥ 1, b > 1, c > 0, and k ≥ 0. Then:
Case (i), “the leaves dominate”: if bk < a, then T(n) = Θ(nlogb(a)). Case (ii), “all levels are equal”: if bk = a, then T(n) = Θ(nk · log n). Case (iii), “the root dominates”: if bk > a, then T(n) = Θ(nk).
(As we discussed previously, we are abusing notation by using c to denote two differ- ent constants in this theorem statement. Again, as you’ll prove in Exercise 6.126, the recurrence T(1) = d with a constant d > 0 possibly different than c has precisely the same asymptotic solution.)
Proving the theorem
While the Master Theorem holds even when the input n is not an exact power of
b—we just have to fix the recurrence by adding floors or ceilings so that it still makes sense—we will prove the result for exact powers of b only.7 We will show that the total amount work contained in the recursion tree is
(†)
A full proof of the Master Theorem, including for the case when n is not an exact power of b, can be found in
7 Thomas H. Cor- men, Charles E. Leisersen, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. MIT Press, 3rd edition, 2009.
logb n 􏰋 T(n) = cnk · ∑ a
􏰌
i .
As before, the formula (†) should make intuitive the fact that a = bk (that is, a
i=0 bk
a bk
= 1) is the critical value. The value of bk corresponds to whether the work at each level of the
tree is increasing ( a > 1), steady ( a = 1), or decreasing ( a < 1). The summation in (†) bk bk bk is a geometric sum, and as we saw in Chapter 5 geometric sums behave fundamentally differently based on whether their ratio is less than, equal to, or greater than one. ProofofTheorem6.10(fornanexactpowerofb). Forallthreecases,webeginbyexam- ining the recursion tree (Figure 6.44). Summing the total amount of work in the tree “row-wise,” we see that there are ai nodes at the ith level of the tree (where, again, the root is at level zero), each of which corresponds to an input of size n/bi and therefore contributes c · (n/bi)k work to the total. The tree continues until the inputs are of size 1—that is, until n/bi = 1, or when i = logb n. Thus the total amount of work in the tree is logbn i 􏰋n􏰌k klogbn􏰋a􏰌i T ( n ) = ∑i = 0 a · c · b i = c n ∑i = 0 b k . (See the note at the end of this proof for another justification for this summation, or see Exercise 6.127.) We’ll examine this summation in each of the three cases, depending on the value of a —and we’ll handle the cases in order of ease, rather than in numerical order: bk Case(ii): Ifa=bk,then(†)saysthat k logb n 􏰋 a 􏰌i k logb n k 6.5. RECURRENCERELATIONS:THEMASTERMETHOD 653 T(n)=cn ∑i=0 bk =cn ∑i=0 1=cn (1+logbn). Thus the total work is Θ(nk log n). Case (iii): If a < bk, then (†) is a geometric sum whose ratio is strictly less than 1. Corollary 5.3 states that any geometric sum whose ratio is strictly between 0 and 1 is Θ(1). (Namely, the summation ∑logb n( a )i is lower-bounded by 1 and upper- 1 i=0 bk k bounded by 1−a/bk , both of which are positive constants when a < b .) Therefore: k logb n 􏰋 a 􏰌i T ( n ) = c n ∑i = 0 b k = cnk · Θ(1). by Corollary 5.3 Therefore the total work is Θ(nk). Case (i): If a > bk , then (†) is a geometric sum whose ratio is strictly larger than one. But we can make this summation look more like Case (iii), using a little algebraic manipulation. Notice that, for any α ̸= 0, we can rewrite ∑mi=0 αi as follows:
∑m i m ∑m i − m m ∑m 􏰤 1 􏰥 m − i m ∑m 􏰤 1 􏰥 j
i=0α =α ·i=0α =α ·i=0 α =α ·j=0 α (‡)
where the last equality follows by reindexing the summation (so that we set j = m − i). Applying this manipulation to (†), we have
k logb n 􏰋 a 􏰌i T ( n ) = c n ∑i = 0 b k
k 􏰋a􏰌logbn logbn􏰪bk􏰫j = c n · b k · ∑j = 0 a
b y ( † ) b y ( ‡ )
=nk·􏰋a􏰌logbn·Θ(1) Corollary5.3,becausebk <1. bk a k alogb n =n ·(bk)logbn ·Θ(1) k alogbn =n · nk ·Θ(1) k (bk)logbn =bklogbn =blogbn =nk = alogb n · Θ(1). Therefore the total work is Θ(alogb n). And alogb n = nlogb a, which we can verify by log manipulations: alogb n = blogb[alogb n] = b[logb n]·[logb a] = b[logb a]·[logb n] = blogb[nlogb a] = nlogb a. Therefore the total work in this case is Θ(alogb n) = Θ(nlogb a). 654 CHAPTER 6. ANALYSIS OF ALGORITHMS Taking it further: Another way to make the formula (†)—which was the entire basis of the Master Theorem—a little more intuitive is to consider iterating the recurrence a few times: T(n) =cnk+ a·T(n) =∑0 cai􏰋n􏰌k+aT􏰀n􏰁 􏰍b 􏰎 i=0bi b =cnk+ ac􏰀n􏰁k+ aT(n) b b2 =cnk+ ac􏰀n􏰁k+ a2T(n) = 1 cai􏰋n􏰌k+a2T􏰋n􏰌 ∑bi b2 b􏰍b2 􏰎i=0 =cnk+ ac􏰀n􏰁k+ a2 c􏰋n􏰌k+ aT(n) b b2 b3 =cnk+ ac􏰀n􏰁k+ a2c􏰋n􏰌k+ a3T(n) = 2 cai􏰋n􏰌k+a3T􏰋n􏰌. ∑bi b3 At every iteration, we generate another term of the form cai (n/bi )k . Eventually n/bi will equal 1— specifically when i = logb n—and the recursion will terminate. By iterating the recurrence logb n times, we would get to (logbn)−1 i􏰋n􏰌k log n 􏰤 n 􏰥 T ( n ) = ∑i = 0 c a b i + a b T b l o g b n . ( 6 . 1 0 . 1 ) b b2 b3 i=0 Because T(n/blogb n) = T(1) = c = 1kc = (n/blogb n)kc, from (6.10.1) we can conclude (logbn)−1 i􏰋n􏰌k log n log n k logbn i􏰋n􏰌k T ( n ) = ∑i = 0 c a b i + a b ( n / b b ) c = ∑i = 0 c a b i , which is precisely the summation (†). The Master Method: a few examples We’ll conclude with a few easy examples using the Master Method, reproducing the recursion-tree analysis of Examples 6.31, 6.32, and 6.33: Example 6.34 (Solving T(n) = 2T(n/2) + 􏰈1, n, n2􏰉) Recall the recurrences T(n) = 2T( n ) + 1 (1) 2 T(n) = 2T( n ) + n (2) 2 T(n) = 2T(n)+n2, (3) 2 all with T(1) = 1. For(1),wehavea=2,b=2,c=1,andk=0;becausebk =20 =1<2=a,case(i)of the Master Method says that T(n) = Θ(nlog2 2) = Θ(n). For(2),wehavea=2,b=2,c=1,andk=1;becausebk =21 =2=a,case(ii)ofthe Master Method says that T(n) = Θ(n1 log n) = Θ(n log n). For(3),wehavea=2,b=2,c=1,andk=2;becausebk =22 =4>2=a,case(iii)
of the Master Method says that T(n) = Θ(n2).
Taking it further: Although we’ve mostly presented “algorithmic design” and “algorithmic analysis” as two separate phases, in fact there’s interplay between these pieces. See p. 655 for a discussion of a partic- ular computational problem—matrix multiplication—and algorithms for it, including a straightforward but slow algorithm and another that (with inspiration from the Master Method) improves upon that slow algorithm.

6.5. RECURRENCERELATIONS:THEMASTERMETHOD 655
Computer Science Connections
Divide-and-Conquer Algorithms and Matrix Multiplication
Matrix multiplication (see Definition 2.43) is a fundamental operation with wide-ranging applications throughout CS: in computer graphics, in data mining, and in social-network analysis, just to name a few. Often the matrices in question are quite large—perhaps a matrix of hyperlinks among thousands or millions of web pages, for example. Thus asymptotic improvements to matrix multiplication algorithms have potential practical importance, too.
For simplicity, we’ll concentrate on multiplying square (n-by-n) matrices. The obvious algorithm for matrix multiplication simply follows the definition: separately for each of the n2 entries in the output matrix, perform the Θ(n) multiplications/additions to compute the entry. (See Figure 6.49.) But, in the spirit of this section, what might we be able to do with a recursive algorithm?
There is indeed a nice way to think about matrix multiplication recursively. To multiply two n-by-n matrices M and N, divide M and N each into four quarters, which we can label M11, M12, . . ., as follows:
M = 􏰑M11 M12􏰒, N = 􏰑N11 N12􏰒. M21 M22 N21 N22
Each of these quarters M11, M12, . . . is an n -by- n matrix. It turns out that 22
MN = 􏰑(MN)11 (MN)12􏰒 = 􏰑M11N11 + M12N21 M11N12 + M12N22􏰒 . (MN)21 (MN)22 M21N11 + M22N21 M21N12 + M22N22
This fact suggests a recursive, divide-and-conquer algorithm for multiplying
Figure 6.49: The naïve algorithm for ma- trix multiplication for n-by-n matrices. For matrices M ∈ Rn×n and N ∈ Rn×n, the product is a matrix P ∈ Rn×n where Pi,j := ∑nk=1 Mi,kNk,j.
matmult(M ∈ Rn×n, N ∈ Rn×n):
1: 2: 3: 4: 5: 6:
for i = 1,2,…n:
for j = 1,2,…,n:
Pi,j := 0
for k = 1,2,…,n:
Pi,j :=Pi,j+Mi,kNk,j return P
Compute these values recursively: A := (M11 + M22)(N11 + N22) B:= (M21+M22)N11
C := M11(N12 − N22)
D:= M22(N21−N11)
E:= (M11+M12)N22
F := (M21 − M11)(N11 + N12) G := (M12 − M22)(N21 + N22).
Then compute MN as
􏰍A+D−E+G C+E 􏰎. B+D A−B+C+F
matrices, with the recurrence T(n) = 8T( n ) + n2. (It takes c · n2 time to combine 2
the result of the recursive calls.) By the Master Method (a = 8, b = 2, k = 2; case (i)), we have T(n) = Θ(nlog2(8)) = Θ(n3)—so not an improvement over Figure 6.49 at all!
But, in a major algorithmic breakthrough, in 1969 Volker Strassen found
a way to use seven recursive calls instead of eight. (See Figure 6.50.) This
change makes the recurrence T(n) = 7T( n ) + n2; now the Master Method 2 log 7 2.8073···
Figure 6.50: The multiplications for Strassen’s Algorithm. After we com- puteA,B,…,Grecursively,wethen add/subtract the results as indicated. (This addition/subtraction takes c · n2 time.)
For more about matrix multiplication and the recent algorithmic improve- ments, see the following survey paper by Virginia Vassilevska Williams, one
of the researchers responsible for the reinvigorated progress in improving this exponent:
8 Virginia Vassilevska Williams. An overview of the recent progress on matrixmultiplication. ACMSIGACT News, 43(4), December 2012.
(a = 7, b = 2, k = 2; still case (i)), says that T(n) = Θ(n 2 ) = Θ(n )—a nice improvement!(Forexample,1000log27 isonlyabout25%of10003.)
Once the Master Method–style recurrence is in mind, one can investigate
other Strassen-like algorithms (making fewer recursive calls, and combining
them more cleverly). In 1978, Victor Pan gave a further running-time improve-
ment using this style of algorithm—though more complicatedly!—using a
work. Using the Master Method, that algorithm yields a running time of Θ(nlog70 143,640) = Θ(n2.7951···). Algorithms continued to improve for several years, culminating in 1990 with an Θ(n2.3754···)-time algorithm due to Don Coppersmith and Shmuel Winograd. That algorithm was the best known
for two decades, but in the last few years some new researchers with new insights have come along, and the exponent is now down to 2.373. For what- ever it’s worth, many people think that there might be an Θ(n2) algorithm for multiplying n-by-n matrices—but no one has found it yet!8
total of 143,640 recursive calls on inputs of size n (!), plus Θ(n2) additional 70

656 CHAPTER 6. ANALYSIS OF ALGORITHMS
6.5.3 Exercises
The following recurrence relations follow the form of the Master Method. Solve each.
6.109 T(n) = 4T(n/3) + n2
6.110 T(n) = 3T(n/4) + n2
6.111 T(n) = 2T(n/3) + n4
6.112 T(n) = 3T(n/3) + n
6.113 T(n) = 16T(n/4) + n2
6.114 T(n) = 2T(n/4) + 1
6.115 T(n) = 4T(n/2) + 1
6.116 T(n) = 3T(n/3) + 1
6.117 T(n) = 2T(n/2) + n2 6.118 T(n) = 2T(n/2) + n 6.119 T(n) = 2T(n/4) + n2 6.120 T(n) = 2T(n/4) + n 6.121 T(n) = 4T(n/2) + n2 6.122 T(n) = 4T(n/2) + n 6.123 T(n) = 4T(n/4) + n2 6.124 T(n) = 4T(n/4) + n
6.125 Solve the recurrence T(1) = 1 and T(n) = 1 + 4T(n/4) (see Exercise 6.82, regarding the number of regions defined by quadtrees), using the Master Method.
6.126 ProvethattherecurrencesT(n) = aT(n)+c·nk andT(1) = dandS(n) = aS(n)+nk andS(1) = 1 bb
have the same asymptotic solution, for any constants a ≥ 1, b > 1, c > 0, d > 0, and k ≥ 0.
6.127 Consider the Master Method recurrence T(n) = aT( n ) + nk and T(1) = 1. Using induction, prove
b
the summation (†) from the proof of the Master Theorem: prove that
k logbn􏰋a􏰌i T(n)=n · ∑i=0 bk
for any n that’s an exact power of b.
6.128 TheMasterMethoddoesnotapplyfortherecurrenceT(n) = 2T(n)+nlogn,butthesame
2
idea—considering the summation of all the work in the recursion tree—will still work. Prove that T(n) =
Θ(n log2 n) by analyzing the summation analogous to (†).
Each of the following problems gives a brief description of an algorithm for an interesting problem in computer science. (Sometimes the recurrence relation is explicitly written; sometimes it’s up to you to write down the recurrence.) For each, state the recurrence (if it’s missing) and give a Θ-bound on the running time. If the Master Method applies, you may use it. If not, give a proof by induction.
6.129 The Towers of Hanoi is a classic puzzle, as follows. There are three posts (the “towers”); post A starts with n concentric discs stacked from top-to-bottom in order of decreasing radius. We must move all the discs to post B, never placing a disc of larger radius on top of a disc of smaller radius. The easiest way to solve this puzzle is with recursion: (i) recursively move the top n − 1 discs from A to C; (ii) move the nth disc from A to B; and (iii) recursively move the n − 1 discs from C to B. The total number of moves made satisfies T(n) = 2T(n − 1) + 1 and T(1) = 1. Prove that T(n) = 2n − 1.
6.130 Suppose we are given a sorted array A[1 . . . n], and we wish to determine where in A the element x belongs—that is, the index i such that A[i − 1] < x ≤ A[i]. (Binary Search solves this problem.) Here’s a sketch of an algorithm rootSearch to solve this problem: • if n is small (say, less than 100), find the index by brute force. Otherwise: • define mileposts := A[√n], A[2√n], A[3√n], . . . , A[n] to be a list of every (√n)th element of A. • recursively, find post := rootSearch(mileposts, x). • return rootSearch(A[(post − 1)√n, . . . , post√n], x). (Note that rootSearch makes two recursive calls.) Find a recurrence relation for the running time of this algorithm, and solve it. 6.131 A van Emde Boas tree is a recursive data structure (with somewhat similar inspiration to the previous exercise) that allows us to insert, delete, and look up keys drawn from a set U = {1, 2, . . . , u} quickly. (It solves the same problem that binary search trees solve, but our running time will be in terms of the size of the universe U rather than in terms of the number of keys stored.) A van Emde Boas tree achieves a running time given by T(n) = T(√n) + 1 and T(1) = 1. Solve this recurrence. (Hint: define R(k) := T(2k ). Solving R(k) is easy!) 6.6 Chapter at a Glance Asymptotics Asymptotic analysis considers the rate of growth of functions, ignoring multiplicative constant factors and concentrating on the long-run behavior of the function on large inputs. Consider two functions f : R≥0 → R≥0 and g : R≥0 → R≥0. Then f (n) = O(g(n)) (“f growsnofasterthang”)ifthereexistc > 0andn0 ≥ 0suchthatf(n) ≤ c·g(n)forall n ≥ n0. Some useful properties of O(·):
• f(n)=O(g(n)+h(n))ifandonlyiff(n)=O(max(g(n),h(n))).
• iff(n)=O(g(n))andg(n)=O(h(n)),thenf(n)=O(h(n)).
• iff(n) = O(h1(n))andg(n) = O(h2(n)),thenf(n)+g(n) = O(h1(n)+h2(n))and
f (n) · g(n) = O(h1(n) · h2(n)).
• apolynomialp(n)=aknk+···a1n+a0satisfiesp(n)=O(nk).
• logn=O(nε)foranyε>0.
• foranybasebandexponentk,wehavelogb(nk)=O(logn).
• forconstantsb,c≥1,wehavebn =O(cn)ifandonlyifb≤c.
There are several other forms of asymptotic notation, to capture other relationships between functions. A function f grows no slower than g, written f (n) = Ω(g(n)), if there existconstantsd>0andn0 ≥0suchthat∀n≥n0 :f(n)≥d·g(n).Twofunctionsf and g satisfy f (n) = O(g(n)) if and only if g(n) = Ω(f (n)).
A function f grows at the same rate as g, written f (n) = Θ(g(n)), if f (n) = O(g(n)) and
f (n) = Ω(g(n)); it grows (strictly) slower than g, written f (n) = o(g(n)), if f (n) = O(g(n)) but f (n) ̸= Ω(g(n)); and it grows (strictly) faster than g, written f (n) = ω(g(n)), if f (n) = Ω(g(n)) but f (n) ̸= O(g(n)). Many of the properties of O have analogous properties for Ω, Θ, o, and ω. One possibly surprising point is that there are functions that are incomparable: there are functions f and g such that neither f (n) = O(g(n)) nor f (n) = Ω(g(n)).
Asymptotic Analysis of Algorithms
Our main interest in asymptotics is in the analysis of algorithms, so that we can make statements about which of two algorithms that solve the same problem is faster. The running time of an algorithm is a count of the number of primitive steps that the algo- rithm takes to complete on a particular input. (Think of one machine instruction as a primitive step.)
We generally evaluate the efficiency of an algorithm A using worst-case analysis: as
a function of n, how many primitive steps does A take on the input of size n for which A is the slowest. (A primary goal of algorithmic analysis is to provide a guarantee on the running time of an algorithm, so we will be pessimistic.) We can also analyze the space used by an algorithm, in the same way. Sometimes we will instead consider average- case running time of an algorithm A, which computes the running time of A, averaged over all inputs of size n. Almost never will we consider an algorithm’s running time on the input of size n for which A is the fastest (known as best-case analysis); this type of
6.6. CHAPTERATAGLANCE 657

658 CHAPTER 6. ANALYSIS OF ALGORITHMS analysis is rarely used.
Recurrence Relations: Analyzing Recursive Algorithms
Typically, for nonrecursive algorithms, we compute the running time by inspecting
the algorithm and writing down a summation corresponding to the operations done
in each iteration of each loop, summed over the iterations, and then simplifying. For
recursive algorithms, we typically record the work using a recurrence relation that ex-
presses the (worst-case) running time on inputs of size n in terms of the (worst-case)
running time on inputs of size less than n. (For small inputs, the running time is a
constant—say, T(1) = c.) For example, ignoring floors and ceilings, T(1) = c and
T(n) = 2T( n ) + cn is the recurrence relation for Merge Sort. (Almost always, we can 2
safely ignore floors and ceilings.)
A solution to a recurrence relation is a closed-form (nonrecursive) expression for
T(n). Recurrence relations can be solved by conjecturing a solution and proving that conjecture correct by induction.
A recurrence relation can be rep- resented using a recursion tree, where each node is annotated with the work that is performed there, aside from the recursive calls. Recurrence relations can also be solved by sum- ming up all of the work contained within the recursion tree.
Recurrence Relations: The Master Method
A particularly common type of recurrence relation is one of the form T(n) = aT(n)+c·nk,
for constants a ≥ 1, b > 1, c > 0, and k ≥ 0. This type of recurrence arises in divide-
and-conquer algorithms that solve an instance of size n by making a different recursive
calls on inputs of size n , and reconstructing the solution to the given instance in Θ(nk ) b
n
n
2
. …
2 . 2
··· ···
n
n
2
n
n
4
4
4
n
4
1
1
1
1
b
time. The Master Theorem states that the solution to any such recurrence relation is given by:
1. if bk < a, then T(n) = Θ(nlogb(a)). 2. if bk = a, then T(n) = Θ(nk · log n). 3. if bk > a, then T(n) = Θ(nk).
“The leaves dominate.” “All levels are equal.” “The root dominates.”
The proof follows by building the recursion tree, and summing the work at each level of the tree; the cases correspond to whether the work increases exponentially, de- creases exponentially, or stays constant across levels of the tree.
1 + log2 n levels …

Key Terms and Results Key Terms
Asymptotics
• asymptoticanalysis • O(bigoh)
• Ω(bigomega)
• Θ(bigtheta)
• ω(littleomega) • o(littleoh)
Analysis of Algorithms
• runningtime
• worst-caseanalysis
• average-caseanalysis • best-caseanalysis
Recurrence Relations
• recurrencerelation
• recursiontree
• iteratingarecurrence
Master Method
• MasterTheorem
• “theleavesdominate” • “alllevelsareequal” • “therootdominates”
Key Results
Asymptotics
1.
2. 3.
Some sample useful properties of O(·):
• f (n) = O(g(n) + h(n)) ⇔ f (n) = O(max(g(n), h(n))).
• O(·) is transitive.
• any degree-k polynomial satisfies p(n) = O(nk). • logn=O(nε)foranyε>0.
• if f (n) = O(g(n)) then log f (n) = O(log g(n)).
• foranybandk,wehavelogb(nk)=O(logn).
• forconstantsb,c≥1,wehavebn =O(cn)⇔b≤c.
Two functions f and g satisfy f (n) = O(g(n)) if and only if g(n) = Ω(f (n)).
There are pairs of functions f and g such that neither f (n) = O(g(n)) nor f (n) = Ω(g(n)).
6.6. CHAPTERATAGLANCE 659
Analysis of Algorithms
1. WegenerallyevaluatetheefficiencyofanalgorithmA using worst-case analysis: what happens (asymptotically) to the number of steps consumed by A as function of the input size n on the input of size n for which A is the slowest?
2. Typicallywecananalyzetherunningtimeofa nonrecursive algorithm by simple counting and manipulation of summations.
Recurrence Relations
1. Therunningtimeofarecursivealgorithmcanbe expressed using a recurrence relation, which can be solved by figuring out a conjecture of a closed-form formula for the relation, and then verifying by induction.
Master Method
1. Recurrence relations of the form T(n) = aT( n ) + cnk (and
b T(1) = c) can be solved using the Master Method:
Case 1: if bk < a, then T(n) = Θ(nlogb(a)). Case2: ifbk =a,thenT(n)=Θ(nk ·logn). Case 3: if bk > a, then T(n) = Θ(nk).

7
Number Theory
In which, after becoming separated, our heroes arrange a place to meet, by sending messages that stay secret even as snooping spies listen in.

702
CHAPTER 7. NUMBER THEORY
7.1
Why You Might Care
When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind.
Sir William Thomson, Lord Kelvin (1824–1907)
A chapter about numbers (particularly when it’s so far along in this book!) proba- bly seems a little bizarre—after all, what is there to say about numbers that you didn’t figure out by elementary school?!? But, more so than any other chapter of the book, the technical material in this chapter leads directly to a single absolutely crucial (and ubiquitous!) modern application of computer science: cryptography, which deals with protocols to allow multiple parties to communicate securely, even in the presence of eavesdropping adversaries (or worse!). Cryptographic systems are used throughout our daily lives—both in the security layers that connect us as users to servers (for ex- ample, in banking online or in registering for courses at a college), and in the backend systems that, we hope, protect our data even when we aren’t interacting with it.
Our goal in this chapter will be to build up the technical machinery necessary to define and understand the RSA cryptosystem, one of the most commonly used crypto-
graphic systems today. (RSA is named after the initials of its three discoverers, R ivest,
To get there, we’ll need to develop some concepts and tools from number theory. (“Number theory” is just a slightly fancy name for “arithmetic on integers.”) Our focus will be on modular arithmetic: that is, the numbers on which we’ll be doing arithmetic will be a set of integers {0, 1, 2, . . . , n − 1}, where—like on a clock—the numbers “wrap around” from n − 1 back to 0. In other words, we’ll interpret numerical expressions modulo n, always considering each expression via its remainder when we divide by
n. We begin in Section 7.2 with formal definitions of modular arithmetic, and the adaptation of some basic ideas from elementary-school arithmetic to this new setting. We’ll then turn in Section 7.3 to primality (when a number has no divisors other than
1 and itself) and relative primality (when two numbers have no common divisors other than 1). Modular arithmetic begins to diverge more substantially when we start to think about division: there’s no integer that’s one fifth of 3 . . . but, on a clock where we treat 12:00 as 0, there is an integer that’s a fifth of 3—namely 5, because 5 + 5 + 5 is 3 (because 3:00pm is 15 hours after midnight—so 5 · 3 is 3, modulo 12). In Section 7.4, we’ll explore exactly what division means in modular arithmetic—and some special features of division that arise when n is a prime number.
As we go, we’ll see a few other applications of number theory: to error-correcting codes, secret sharing, and the apparently unrelated task of generating all 4-letter se- quences (AAAA to ZZZZ). And, finally, we’ll put the pieces together to explore RSA.
cryptography (Greek): kryptos “concealed/secret” + graph “writing.”
Shamir, and A
dleman.) By the end of the chapter, in Section 7.5, we’ll be able to give a full treatment of RSA, along with sketched outlines of a few other important ideas from cryptography. (Later in the book, in Chapter 9, we’ll also encounter the histori- cal codebreaking work of Alan Turing and colleagues, which deciphered the German encryption in World War II—a major part of the allied victory. See p. 960.)

7.2 Modular Arithmetic
Among those whom I like or admire, I can find no common denominator, but among those whom I love, I can: all of them make me laugh.
W. H. Auden (1907–1973)
7.2. MODULARARITHMETIC 703
We will start with a few reminders of some basic arithmetic definitions from Chap- ter 2—about multiplication, division, and modular arithmetic—as these concepts are the foundations for all the work that we’ll do in this chapter. We’ll also introduce a few algorithms for computing these basic arithmetic quantities, including one of the oldest known algorithms: the Euclidean algorithm, from about 2300 years ago, which com- putes the greatest common divisor of two integers n and m (that is, the largest integer that evenly divides both n and m).
7.2.1 Remainders: A Reminder
Let’s start with a few simple facts about integers. Every integer is 0 or 1 more than some even number. Every integer is 0, 1, or 2 more than a multiple of three. Every integer is at most 3 more than a multiple of four. And, in general, for any integer k ≥ 1, every integer is r more than a multiple of k, for some r ∈ {0, 1, . . . , k − 1}. We’ll begin with a precise statement and proof of the general version of this property:
Before we prove the theorem, let’s look at a few examples of what it claims:
Example 7.1 (Some examples of the Division Theorem)
For k = 202 and n = 379, the theorem states that there exist integers r ∈ {0,1,…,201} and d with 202d + r = 379. Specifically, those values are r = 177 and d = 1, because 202·1+177 = 379.
Here are a few more examples, still with k = 202:
n=55057 n=507 n=177 n=404 n=−507 n=−404 d = 272 d = 2 d = 0 d = 2 d = −3 d = −2
r = 113 r = 103 r = 177 r = 0 r = 99 r = 0
You can verify that, in each of these six columns, indeed we have 202d + r = n. Now let’s give a proof of the general result:
ProofofTheorem7.1. Considerafixedintegerk≥1.LetP(n)denotetheclaim P(n) := there exist integers d and r such that 0 ≤ r < k and kd + r = n. Theorem 7.1 (Floors and Remainders: “The Division Theorem”) Let k ≥ 1 and n be integers. Then there exist integers d and r such that (i) 0 ≤ r < k, and (ii) kd + r = n. Furthermore, the values of d and r satisfying (i) and (ii) are unique. 704 CHAPTER 7. NUMBER THEORY We must prove that P(n) holds for all integers n. We’ll first prove the result for nonneg- ative n (by strong induction on n), and then show the claim for n < 0 (making use of the result for nonnegative n). CaseI:n≥0. We’llprovethatP(n)holdsforalln≥0bystronginductiononn. • Forthebasecases(0 ≤ n < k),wesimplyselectd := 0andr := n. Indeed,these values guarantee that 0 ≤ r < k and kd + r = k · 0 + n = 0 + n = n. • Fortheinductivecase(n≥k),weassumetheinductivehypotheses—namely,we assume P(n′) for any 0 ≤ n′ < n—and we must prove P(n). Because n ≥ k and k>0,itisimmediatethatn′ :=n−ksatisfies0≤n′ r′ >0,wehave 0 < k − r′ < k.) Thus kd + r = k(−d′ − 1) + k − r′ = −kd′ − k + k − r′ = −(kd′ + r) = −(−n) = n. CaseIIB:r′=0. Thenletd:=−d′andr:=r′=0.Therefore kd + r = −d′k + r′ = −(−n) = n. definition of d and r definition of d′ and r′ definition of d and r definition of d′ and r′ We have thus proven that P(n) holds for all integers n: Case I handled n ≥ 0, and Case II handled n < 0. (We have not yet proven the uniqueness of the integers r and d; this proof of uniqueness is left to you in Exercise 7.4.) This theorem now allows us to give a more careful definition of modular arithmetic. (In Definition 2.9, we gave the slightly less formal definition of n mod k as the remain- der when we divide n by k.) Incidentally, the integer d whose existence is guaranteed by Theorem 7.1 is ⌊n/k⌋: for any k ≥ 1, we can write the integer n as n = 􏰘 n 􏰙 · k + (n mod k). k Taking it further: One of the tasks that we can accomplish conveniently using modular arithmetic is base conversion of integers. We’re used to writing numbers in decimal (“base 10”), where each digit is “worth” a factor of 10 more than the digit to its right. (For example, the number we write “31” means 1 · 100 + 3 · 101 = 1 + 30.) Computers store numbers in binary (“base 2”) representation, and we can convert between bases using modular arithmetic. For more, see the discussion on p. 714. 7.2.2 Computing n mod k and 􏰄 n 􏰅 k So far, we’ve taken arithmetic operations for granted—ignoring how we’d figure out the numerical value of an arithmetic expression like 21024 − 3256 · 5202, which is simple to write—but not so instantaneous to calculate. (Quick! Is 21024 − 3256 · 5202 evenly di- visible by 7?) Indeed, many of us spent a lot of time in elementary-school math classes learning algorithms for basic arithmetic operations like addition, multiplication, long division, and exponentiation (even if back then nobody told us that they were called algorithms). Thinking about algorithms for some basic arithmetic op- erations will be useful, for multiple reasons: because they’re surprisingly relevant for proving some useful facts about modular arithmetic, and because computing them efficiently turns out to be crucial in the cryptographic systems that we’ll explore in Section 7.5. We’ll start with the algorithm shown in Figure 7.1 that computes n mod k (and simultaneously computes ⌊n/k⌋ too). The very basic idea for this algorithm was implicit in the proof of Theorem 7.1: we repeatedly subtract k from n until we reach a number in the range {0, 1, . . . , k − 1}. Example 7.2 (An example of mod-and-div) Let’s compute mod-and-div(64, 5). We start with r := 64 and d := 0, and repeatedly decrease r by 5 and increase d by 1 until r < 5. Here are the values in each iteration: r 64 59 54 49 44 39 34 29 24 19 14 9 4 d 0 1 2 3 4 5 6 7 8 9 10 11 12. Thus mod-and-div(64, 5) returns 4 and 12—and, indeed, we can write 64 = 12 · 5 + 4, where 4 = 64 mod 5 and 12 = ⌊64/5⌋. Similarly, mod-and-div(20, 17) starts with d = 0 and r = 20, and executes one (and only one) iteration of the loop, returning d = 1 and r = 3. Figure 7.1: An algo- rithm to compute n mod k and ⌊n/k⌋. Some programming languages—Pascal, for one (admittedly dated) example— use div to denote integer division, so that15 div 7is2. 7.2. MODULARARITHMETIC 705 Definition 7.1 (Modulus (reprise)) For integers k > 0 and n, the quantity n mod k is the unique integer r such that 0 ≤ r < k and kd + r = n for some integer d (whose existence is guaranteed by Theorem 7.1). mod-and-div(n, k): Input: integersn≥0andk≥1 Output: n mod k and ⌊n/k⌋ 1: r:=n;d:=0 2: while r ≥ k: 3: r := r − k; d := d + 1 4: return r,d 7.2.3 Congruences, Divisors, and Common Divisors We argued in Lemma 7.2 that mod-and-div(n, k), which repeatedly subtracts k from n in a loop, correctly computes the value of n mod k. We gave a proof by induction in Lemma 7.2, but we could have instead argued for the correctness of the algorithm, perhaps more intuitively, via the following fact: For any integers a ≥ 0 and k ≥ 1, we have (a + k) mod k = a mod k. That is, the remainder when we divide an integer a by k isn’t changed by adding an exact multiple of k to a. This property follows from the definition of mod, but it’s also a special case of a useful general property of modular arithmetic, which we’ll state (along with some other similar facts) in Theorem 7.3. Here are a few examples of this more general property: Example 7.3 (The mod of a sum, and the sum of the mods) Consider the following expressions of the form (a + b) mod k. • (17+43)mod7=60mod7=4.(Note17mod7=3,43mod7=1,and3+1=4.) • (18+42)mod9=60mod9=6.(Note18mod9=0,42mod9=6,and0+6=6.) • (25+25)mod6=50mod6=2.(Note25mod6=1,25mod6=1,and1+1=2.) At this point it might be tempting to conjecture that (a + b) mod k is always equal to (a mod k) + (b mod k), but be careful—this claim has a bug, as this example shows: • (18+49)mod5=67mod5=2.(Note18mod5=3,49mod5=4,but3+4̸=2.) Instead,itturnsoutthat(a+b)modk = [(amodk)+(bmodk)]modk—wehadto add an “extra” mod k at the end. Here are some of the useful general properties of modular arithmetic: 7.2. MODULARARITHMETIC 707 Theorem 7.3 (Properties of modular arithmetic) For integers a and b and k > 0:
k mod k = 0
a+b mod k = [(a mod k)+(b mod k)] mod k
ab mod k = [(a mod k)·(b mod k)] mod k ab mod k = [(a mod k)b] mod k.
(7.3.1) (7.3.2) (7.3.3) (7.3.4)
We’ll omit proofs of these properties, though we could give a formal proof based on the definitions of mod. (Exercise 7.17 asks you to give a formal proof for one of these properties, namely (7.3.2).) Again notice the “extra” mod k at the end of the last three of these equations—it is not the case that ab mod k = (a mod k) · (b mod k) in general. For example, 14 mod 6 = 2 and 5 mod 6 = 5, but (2 · 5) mod 6 = 4 ̸= 2 · 5.
In the cryptographic applications that we will explore later in this chapter, it will turn out to be important to perform “modular exponentiation” efficiently—that is,

708 CHAPTER 7. NUMBER THEORY
we’ll need to compute be mod n very quickly, even when e is fairly large. Fortunately, (7.3.4) will help us do this computation efficiently; see Exercises 7.23–7.25.
Congruences
We’ve now talked a little bit (in Theorem 7.3, for example) about two numbers a and
b that have the same remainder when we divide them by k—that is, with a mod k = b mod k. There’s useful terminology, and notation, for this kind of equivalence:
Taking it further: Some people write a ≡k b using the notation a ≡ b (mod k).
This notation is used to mean the same thing as our notation a ≡k b, but note the somewhat unusual precedence in this alternate notation: it says that
􏰂a ≡ b􏰃 (mod k)
(and it does not, as it might appear, say that the quantity a and the quantity [b mod k] are equivalent).
Divisors, factors, and multiples
We now return to the divisibility of one number by another, when the first is an
exact multiple of the second. As with the previous topics in this section, we gave some preliminary definitions in Chapter 2 of divisibility (and related terminology), but we’ll again repeat the definitions here, and also go into a little bit more detail.
(For example, we can say that 42 | 714, that 6 and 17 are factors of 714, and that 714 is a multiple of 7.) Here are a few useful properties of division:
Typically a ≡k b
is read as “a is equivalent to
b mod k” or “a is congruent to b mod k.” If you’re reading the statement a ≡k b out loud, it’s polite to pause slightly,
as if there were a comma, before the “mod k” part.
Definition 7.2 (Congruence)
Twointegersaandbarecongruentmodk,writtena≡k b,ifamodk=bmodk.
Definition 7.3 (Divisibility, Factors, and Multiples (reprise))
For two integers k > 0 and n, we write k | n to denote the proposition that n mod k = 0. If
k | n, we say that k divides n (or that k evenly divides n), that n is a multiple of k, and that k is a factor of n.
Theorem 7.4 (Properties of divisibility)
For integers a and b and c:
a|0 1|a a|a
a|b and b|c ⇒ a|b and b|a ⇒ a|b and a|c ⇒
a|b ⇒ ab|c ⇒
a|c
a=b or a=−b a|(b+c)
a|bc
a|c and b|c
(7.4.1) (7.4.2) (7.4.3) (7.4.4) (7.4.5) (7.4.6) (7.4.7) (7.4.8)

These properties generally follow fairly directly from the definition of divisibility.
A few are left to you in the exercises, and we’ll address a few others in Chapter 8, which introduces relations. (Facts (7.4.3), (7.4.4), and a version of (7.4.5) are certain standard properties of some relations that the “divides” relation happens to have: reflexivity, transitivity, and so-called antisymmetry. See Chapter 8.) To give the flavor of these arguments, here’s one of the proofs, that ab | c implies that a | c and b | c:
Proofof(7.4.8). Assumeab|c.Then,bydefinitionofmod(andbyTheorem7.1),there exists an integer k such that c = (ab) · k. Taking both sides mod a, we have
c mod a = abk mod a
= [(a mod a) · (bk mod a)] mod a
= [0 · (bk mod a)] mod a = 0 mod a
= 0.
k is the integer such that c = (ab) · k (7.3.3) (7.3.1) 0 · x = 0 for any x 0 mod a = 0 for any a
Thus c mod a = 0, so a | c. Analogously, because b · (ak) = c, we have that b | c too.
Greatest common divisors and least common multiples
We now turn to our last pair of definitions involving division: for two integers, we’ll
be interested in two related quantities—the largest number that divides both of them, and the smallest number that they both divide.
Here are some examples of both GCDs and LCMs, for a few pairs of small numbers:
Example 7.4 (Examples of GCDs)
TheGCDof6and27is3,because3dividesboth6and27(andnointegerk ≥ 4 divides both). Similarly, we have gcd(1, 9) = 1, gcd(12, 18) = 6, gcd(202, 505) = 101, and gcd(11, 202) = 1.
Example 7.5 (Examples of LCMs)
The LCM of 6 and 27 is 54, because 6 and 27 both divide 54 (and no k ≤ 53 is divided by both). Similarly, we have lcm(1, 9) = 9, lcm(12, 18) = 36, lcm(202, 505) = 1010, and lcm(11, 202) = 2222.
7.2. MODULARARITHMETIC 709
Definition 7.4 (Greatest Common Divisor (GCD))
The greatest common divisor of two positive integers n and m, denoted gcd(n, m), is the largestd∈Z≥1 suchthatd|nandd|m.
Definition 7.5 (Least Common Multiple (LCM))
The least common multiple of two positive integers n and m, denoted lcm(n, m), is the smallestd∈Z≥1 suchthatn|dandm|d.

710 CHAPTER 7. NUMBER THEORY
Both of these concepts should be (at least vaguely!) familiar from elementary school, specifically from when you learned about how to manipulate fractions:
• We can rewrite the fraction 38 as 2 , by dividing both numerator and denominator 133 7
by the common factor 19—and we can’t reduce it further because 19 is the greatest common divisor of 38 and 133. (We have “reduced the fraction to lowest terms.”)
• We can rewrite the sum 5 + 7 as 15 + 14 (which equals 29 ) by rewriting both frac- 12 18 36 36 36
tions with a denominator that’s a common multiple of the denominators of the two addends—and we couldn’t have chosen a smaller denominator, because 36 is the least common multiple of 12 and 18. (We have “put the fractions over the lowest common denominator.”)
In the remainder of this section, we’ll turn to the task of efficiently computing the great- est common divisor of two integers. (Using this algorithm, we can also find least common multiples quickly, because GCDs and LCMs turn out to be closely related quantities: for any integers a and b, we have lcm(a, b) · gcd(a, b) = a · b.)
7.2.4 Computing Greatest Common Divisors
The “obvious” way to compute the greatest common di-
visor of two positive integers n and m is to try all candidate
divisors d ∈ {1, 2, . . . , min(n, m)} and to return the largest
value of d that indeed evenly divides both n and m. This
algorithm is slow—very slow!—but there is a faster way to
solve the problem. Amazingly, a faster algorithm for com-
puting GCDs has been known for approximately 2300 years:
the Euclidean algorithm, named after the Greek geometer Euclid, who lived in the 3rd
century bce. (Euclid is also the namesake of the Euclidean distance between points in
the plane—see Exercise 2.174—among a number of other things in mathematics.) The
1
Taking it further: Euclid described his algorithm in his book Elements, from c. 300 bce, a multivolume opus covering the fundamentals of mathematics, particularly geometry, logic, and proofs. Most people view the Euclidean algorithm as the oldest nontrivial algorithm that’s still in use today; there are some older not-quite-fully-specified procedures for basic arithmetic operations like multiplication that date back close to 2000 bce, but they’re not quite laid out as algorithms.
Donald Knuth—the 1974 Turing Award winner, the inventor of TEX (the underlying system that was used to typeset virtually all scholarly materials in computer science—and this book!), and a genius of expository writing about computer science in general and algorithms in particular—describes the history of the Euclidean algorithm (among many other things!) in The Art of Computer Programming,1 his own modern-day version of a multivolume opus covering the fundamentals of computer science, particularly algorithms, programming, and proofs.
Among the fascinating things that Knuth points out about the Euclidean algorithm is that Euclid’s “proof” of correctness only handles the case of up to three iterations of the algorithm—because, Knuth argues, Euclid predated the idea of mathematical induction by hundreds of years. (And Euclid’s version of the algorithm is quite hard to read, in part because Euclid didn’t have a notion of zero, or the idea that 1 is a divisor of any positive integer n.)
Here are three small examples of the Euclidean algorithm in action:
Figure 7.2: The Eu- clidean algorithm for GCDs.
1 Donald E. Knuth.
The art of computer programming: Seminumerical algorithms (Volume 2). Addison-Wesley Longman, 3rd edition, 1997.
“Knuth” rhymes with “Duluth” (a city in Minnesota that Minnesotans make fun of for having harsh weather): the “K” is pronounced.
algorithm is shown in Figure 7.2.
Euclid(n, m):
Input: positive integers n and m ≥ n Output: gcd(n,m)
1: ifmmodn=0then
2: return n
3: else
4: return Euclid(m mod n, n)

􏰢 􏰡􏰠 􏰣
=8
= Euclid(17 mod 8, 8)
􏰢 􏰡􏰠 􏰣
7.2. MODULARARITHMETIC 711
Example 7.6 (GCDs using the Euclidean Algorithm)
Let’s compute the GCD of 17 and 42. Euclid(17, 42) = Euclid(42 mod 17, 17)
42 mod 17 = 8 ̸= 0, so we’re in the else case.
17 mod 8 = 1 ̸= 0, so we’re in the else case again.
= 1.
Here’s another example, for 48 and 1024:
=1
8 mod 1 = 0, so we’re done, and we return 1. Indeed, the only positive integer that divides both 17 and 42 is 1, so gcd(17, 42) = 1.
Euclid(48, 1024) = Euclid(1024 mod 48, 48) 1024 mod 48 = 16 ̸= 0, so we’re in the else case.
􏰢 􏰡􏰠 􏰣
=16
And here’s one last example (written more compactly), for 91 and 287:
= 16.
Euclid(91, 287) = Euclid(287 mod 91, 91) = Euclid(91 mod 14, 14) = 7.
=14 =7
􏰢 􏰡􏰠 􏰣 􏰢 􏰡􏰠 􏰣
48 mod 16 = 0, so we return 16.
Before we try to prove the correctness of the Euclidean algorithm, let’s spend a
few moments on the intuition behind it. The basic idea is that any common divisor
of two numbers must also evenly divide their difference. For example, does 7 divide
both 63 and 133? If so, then it would have to be the case that 7 | 63 and that 7 also
divides the “gap” between 133 and 63. (That’s because 63 = 7 · 9, and if 7k = 133,
then 7(k − 9) = 133 − 63.) More generally, suppose that d is a common divisor of
nandm ≥ n. Thenitmustbethecasethatddividesm−cn,foranyintegercwhere
cn < m. In particular, d divides m − ⌊m ⌋ · n; that is, d divides m mod n. (We’ve only n argued that if d is a common divisor of n and m then d must also divide m mod n, but actually the converse holds too; we’ll formalize this fact in the proof.) See Figure 7.3 for a visualization of this idea. 133 0 7 14 21 28 35 42 49 56 63 70 77 84 91 98 105 112 119 126 133 0 9 18 27 36 45 54 63 72 81 90 99 108 117 126 135 63 63 63 · 2 = 126 Making the intuition formal We will now make this intuition formal, and give a full proof of the correctness of the Euclidean algorithm: that is, we will establish that Euclid(n, m) = gcd(n, m) for any positive integers n and m ≥ n, with a proof by induction. There’s a crucial lemma that Figure 7.3: The intuition behind the Euclidean algorithm: d is a common divisor of 63 and 133 if and only if d also divides 133 − 63 and133−63·2 = 133 − 126. Indeed d = 7isacommon divisor of 63 and 133, but 9 is not (because 9 does not divide 133 − 126 = 7). 712 CHAPTER 7. NUMBER THEORY we’ll need to prove first, based on the intuition we just described: we need to show thatforanynandm ≥ nwheremmodn ̸= 0,wehavegcd(n,m) = gcd(n,mmodn). We will prove this fact by proving that the common divisors of {n, m} are identical to the common divisors of {n, m mod n}. (Thus the greatest common divisor of these two pairs of integers will be identical.) Here’s a concrete example before we prove the lemma: Example 7.7 (An example of Lemma 7.5) Considern = 42andm = 98.Thenn ≤ mandn̸|m,asLemma7.5requires.The divisors of 42 are {1, 2, 3, 6, 7, 14, 21, 42}. Of these divisors, the ones that also divide 98 are {1, 2, 7, 14}. The lemma claims that the common divisors of 42 and 98 mod 42 = 14 are also precisely {1, 2, 7, 14}. And they are: because 14 | 42, all divisors of 14—namely, 1, 2, 7, and 14—are common divisors of 14 and 42. ProofofLemma7.5. Bytheassumptionthatd|n,weknowthatthere’sanintegerasuch thatn = ad.Letr := mmodn,sothatm = cn+rforanintegerc(asguaranteedby Theorem7.1). Wemustprovethatd|mifandonlyifd|r. For the forward direction, suppose that d | m. (We must prove that d | r.) By defini- tion, there exists an integer b such that m = bd. But n = ad and m = bd, so m=cn+r ⇔ bd=c(ad)+r ⇔ r=(b−ac)d for integers a, b, and c. Thus r is a multiple of d, and therefore d | r. For the converse, suppose that d | r. (We must prove that d | m.) By definition, we have that r = bd for some integer b. But then n = ad and r = bd, so m = cn+r = c(ad)+bd = (ac+b)d for integers a, b, and c. Thus d | m. Proof. Lemma7.5establishesthatthesetofcommondivisorsof⟨n,m⟩isidenticalto the set of common divisors of ⟨n, m mod n⟩. Therefore the maxima of these two sets of divisors—that is, gcd(n, m) and gcd(m mod n, n)—are also equal. Putting it together: the correctness of the Euclidean algorithm Using this corollary, we can now prove the correctness of the Euclidean algorithm: Lemma 7.5 (When n ̸ | m, the same divisors of n divide m and m mod n) Let n and m be positive integers such that n ≤ m and n ̸ | m. Let d | n be an arbitrary divisor of n. Then d | m if and only if d | (m mod n). Corollary 7.6 Let n and m ≥ n be positive integers where n ̸ | m. Then gcd(n, m) = gcd(m mod n, n). Theorem 7.7 (Correctness of the Euclidean algorithm) For arbitrary positive integers n and m with n ≤ m, we have Euclid(n, m) = gcd(n, m). Proof. We’llproceedbystronginductiononn,thesmallerinput.Definetheproperty P(n) := for any m ≥ n, we have Euclid(n, m) = gcd(n, m). We’ll prove that P(n) holds for all integers n ≥ 1. Base case (n = 1): P(1) follows because both gcd(1, m) = 1 and Euclid(1, m) = 1: for any m, the only positive integer divisor of 1 is 1 itself (and indeed 1 | m), and thus gcd(1, m) = 1. Observe that Euclid(1, m) = 1, too, because m mod 1 = 0 for any m. Inductivecase(n≥2): Weassumetheinductivehypotheses—thatP(n′)holdsforany 1 ≤ n′ < n—and must prove P(n). Let m ≥ n be arbitrary. There are two subcases, based on whether n | m or n ̸ | m: • Ifn|m—thatis,ifm = cnforanintegerc—thenmmodn = 0andthus,by inspection of the algorithm, Euclid(n, m) = n. Because n | n (and there is no d > n that divides n evenly), indeed n is the GCD of n and m = cn.
• Ifn̸|m—thatis,ifmmodn̸=0—then
Euclid(n, m) = Euclid(m mod n, n) = gcd(m mod n, n)
= gcd(n, m).
by inspection of the algorithm by the inductive hypothesis P(m mod n) by Corollary 7.6
Note that (m mod n) ≤ n − 1 by the definition of mod (anything mod n is less than n), so we can invoke the inductive hypothesis P(m mod n) in the second step of this proof.
Theorem 7.7 establishes the correctness of the Euclidean algorithm, but we intro- duced this algorithm because the brute-force algorithm (simply testing every candi- date divisor d) was too slow. Indeed, the Euclidean algorithm is very efficient:
(The ability to efficiently compute gcd(n, m) using the Euclidean algorithm—assuming we use the efficient algorithm to compute m mod n from Exercises 7.11–7.16, at least— will be crucial in the RSA cryptographic system in Section 7.5.) You’ll prove Theo-
rem 7.8 by induction in Exercise 7.34—and you’ll show that the recursion tree can be as deep as Ω(log n + log m), using the Fibonacci numbers, in Exercise 7.37.
Problem-solving tip:
In Theorem 7.8,
it’s not obvious what quantity upon which to perform induction—after all, there are two input variables, n and m. It is often useful to combine multiple inputs into a single “measure of progress” toward the base case— perhaps performing induction on the quantity n + m or the quantity n · m.
7.2. MODULARARITHMETIC 713
Theorem 7.8 (Efficiency of Euclidean Algorithm)
For arbitrary positive integers n and m with n ≤ m, the recursion tree of Euclid(n, m) has depth at most log n + log m.

714 CHAPTER 7. NUMBER THEORY
Computer Science Connections
Converting Between Bases, Binary Representation, and Generating Strings
For a combination of historical and anatomical reasons—we have ten fin- gers and ten toes!—we generally use a base ten, or decimal, system to represent numbers. Moving from right to left, there’s a 1’s place, a 10’s place, a 100’s place,andsoforth;thus2048denotes8·1+4·10+0·100+2·1000. Thisrep- resentation is an example of a positional system, in which each place/position has a value, and the symbol in that position tells us how many of that value the number has. Some ancient cultures used non-decimal positional sys- tems, some of which survive to the present day: for example, the Sumarians and Babylonians used a base 60 system—and, even today, 60 seconds make a minute, and 60 minutes make an hour.
In general, to represent a number n in base b ≥ 2, we write a sequence of elements of {0, 1, . . . , b − 1}—say [dkdk−1 · · · d2d1d0]b. (We’ll write the base explicitly as a subscript, for clarity.) Moving from right to left, the ith position is “worth” bi, so this number’s value is ∑ki=0 bidi. For example,
[1234]5 = 4·50 +3·51 +2·52 +1·53 = 4+15+50+125 = 194 [1234]8 = 4·80 +3·81 +2·82 +1·83 = 4+24+128+512 = 668.
We can use modular arithmetic to quickly convert from one base to an- other. For simplicity, we’ll describe how to convert from base 10 into an ar- bitrary base b, though it’s not that much harder to convert from an arbitrary base instead. To start, notice that (∑ki=0 bidi) mod b = d0. (The value bidi is divisible by b for any i ≥ 1.) Therefore, to represent n in base b, we must have d0 := n mod b. Similarly, (∑ki=0 bidi) mod b2 = bd1 + d0; thus we must choose d1 := n−d0 mod b. (Note that n − d0 must be divisible by b, because of our choice ofbd0.) An algorithm following this strategy is shown in Fig- ure 7.4. (We could also have written this algorithm without using division; see Exercise 7.5.) For example, to convert 145 to binary (base 2), we execute baseConvert(145, 2). Here are the values of n, i, and di in each iteration:
n 145 72 36 18 9 4 2 1 0 i012345678
di:=nmod21 0 0 0 1 0 0 1 —.
Thus 145 can be written as [10010001]2 .
We can use the base conversion algorithm in Figure 7.4 to convert decimal
numbers (base 10) into binary (base 2), the internal representation in comput- ers. Or we can convert into octal (base 8) or hexadecimal (base 16), two other frequently used representations for numbers in programming. But we can also use baseConvert for seemingly unrelated problems. Consider the task of enumerating all 4-letter strings from the alphabet. The “easy” way to write a program to accomplish this task, with four nested loops, is painful to write— and it becomes utterly unwieldy if we needed all 10-letter strings instead.
But, instead, let’s count from 0 up to 264 − 1—there are 264 different 4-letter strings—and convert each number into base 26. We can then translate each number into a sequence of letters, with the ith digit acting as an index into the alphabet that tells us which letter to put in position i. See Figure 7.5.
Latin: decim “ten.” Note that digit is ambiguous in English between “place in a number” and “finger or toe.”
baseConvert(n, b):
Input: integers n and b ≥ 2 Output: n, represented in base b
1: i := 0
2: 3: 4: 5: 6:
while n > 0:
di :=nmodb n := (n − di )/b i := i + 1
return [didi−1···d1d0]b
Figure 7.4: Base conversion algorithm, from base 10 to base b.
n in base 10 0
1
2
25 26 27
1234
456,974 456,975
→ base 26 →[0000]26 →[000 1]26 →[000 2]26
. →[000 25]26
→[001 0]26 →[001 1]26
.
→ [0 1 21 12]26
.
→ [25 25 25 24]26 → [25 25 25 25]26
→ string → AAAA → AAAB → AAAC
→ AAAZ → AABA → AABB
→ ABVM → ZZZY
→ ZZZZ
Figure 7.5: Generating all 4-letter strings. For each n = 0, n = 1, . . . ,
n = 456,975, we convert n to a number in base 26; we then interpret each digit [i]26 ∈ {0,1,…,25} as an element of {A,B,…,Z}.

7.2.5 Exercises
Using paper and pencil only, follow the proof of Theorem 7.1 or use mod-and-div (see Figure 7.6a) to compute
integers r ∈ {0, 1, . . . , k − 1} and d such that kd + r = n, for:
7.1 k = 17, n = 202 7.2 k = 99, n = 2017 7.3 k = 99, n = −2017
7.4 When we proved Theorem 7.1, we showed that for integers k ≥ 1 and n, there exist integers r and dsuchthat0 ≤ r < kandkd+r = n. Westatedbutdidnotprovethatranddareunique. Provethatthey are—thatis,provethefollowing,foranyintegersk≥1,n,r,d,r′,andd′:if0≤r6isarepdigitb forsomebaseb≥2,wheren=[22···2]b.
7.8 Prove that no odd number n is a repdigitb of the form [22 · · · 2]b, for any base b.
7.9 Write R(n) to denote the number of bases b, for 2 ≤ b ≤ n − 1, such that n is a repdigitb. Conjec-
ture a condition on n such that R(n) = 1, and prove your conjecture.
Recall the mod-and-div(n, m) algorithm, reproduced in Figure 7.6(a), that com- putes n mod k and ⌊n/k⌋ by repeatedly subtracting k from n until the result is less than k.
7.10 As written, the mod-and-div algorithm fails when given a neg- ative value of n. Follow Case II of Theorem 7.1’s proof to extend the algo- rithm for n < 0 too. The mod-and-div algorithm is slow—this algorithm computes an integer d such that nd ≤ m < n(d + 1) by performing linear search for d. A faster version of this algorithm, called mod-and-div-faster, finds d using binary search instead; see Figure 7.6(b). 7.11 The code for mod-and-div-faster as written uses division, by averaging lo and hi. Modify the algorithm so that it uses only addition, subtraction, multiplication, and comparison. 7.12 The code for mod-and-div-faster as written uses hi := n + 1 as the initial upper bound. Why is this assignment an acceptable for the correctness of the algorithm? Explain briefly. 7.13 Describe an algorithm that finds a better upper bound hi, by repeatedly doubling hi until it’s large enough. 7.14 Let k be arbitrary. Describe an input n for which the doubling search from the last exercise yields a significant improvement on the run- ning time of the algorithm for inputs k and n. 7.15 (programming required) Implement, in a programming language of your choice, all three of these algorithms (mod-and-div, mod-and-div-faster, and the doubling-search tweaked version of mod-and-div-faster Figure 7.6: A re- minder of the algorithm to com- pute n mod k and ⌊n/k⌋, and a faster version. 7.2. MODULARARITHMETIC 715 mod-and-div(n, k): Input: integersn≥0andk≥1 Output: n mod k and ⌊n/k⌋ 1: r:=n;d:=0 2: while r ≥ k: 3: r := r − k; d := d + 1 4: return r,d mod-and-div-faster(n, k): Input: integersn≥0andk≥1 Output: n mod k and ⌊n/k⌋ 1: lo:=0;hi:=n+1. 2: 3: 4: 5: 6: 7: 8: whilelo<􏰘hi−1: mid := lo+hi 􏰙 2 if mid · k ≤ n then lo := mid else hi := mid return (n−k·lo),lo from the previous exercises) to compute n mod k and ⌊n/k⌋. 32 7.16 Run the three algorithms from the previous exercise to compute the following values: 2 232 mod 2020, and 232 mod 315. How do their speeds compare? mod 202, 7.17 Prove (7.3.2): for integers k > 0, a, and b, we have a + b mod k = [(a mod k) + (b mod k)] mod k. Beginyourproofasfollows: Wecanwritea = ck+randb = dk+tforr,t ∈ {0,…,k−1}(asguaranteedby Theorem 7.1). Then use mod-and-div and Lemma 7.2.
Prove the following properties of modular arithmetic and divisibility, for any positive integers a, b, and c:
7.18 amodb=(amodbc)modb 7.21 (7.4.6):ifa|banda|c,thena|(b+c).
7.19 (7.4.1): a|0 7.22 (7.4.7): ifa|c,thena|bc.
7.20 (7.4.2): 1 | a

716 CHAPTER 7. NUMBER THEORY
Consider the “repeated squaring” algorithm for modular exponentiation shown in Figure 7.7. Observe that this algorithm computes be mod n with a recursion tree of depth Θ(log e).
7.23 Use this algorithm to compute 380 mod 5 without using a calculator. (You should never have to keep track of a number larger than 5 except for the exponent itself when you’re doing these calculations!)
7.24 Write down a recurrence relation representing the number of multiplications done by mod-exp(b, e, n). Prove, using this recurrence, that the number of multiplications done is between log e and 2 log e.
7.25 (programming required) Implement mod-exp in a programming language of your choice. Also implement a version of mod-exp that com- putes be and then, after that computation is complete, takes the result
mod n. Compare the speeds of these two algorithms in computing 3k mod 5, for k = 80, k = 800, k = 8000, …, k = 8,000,000. Explain.
There’s a category of numerical tricks often called “divisibility rules” that you may have seen—quick ways of testing whether a given number is evenly divisible by some small k. The test for whether an integer n is divisible by 3 is this: add up the digits of n; n is divisible by 3 if and only if this sum is divisible by 3. For example, 6,007,023 is divisible by 3 because 6 + 0 + 0 + 7 + 0 + 2 + 3 = 18, and 3 | 18. (Indeed 3 · 2,002,341 = 6,007,023.) This test relies on the following claim: for any sequence ⟨x0, x1, . . . , xn−1⟩ ∈ {0, 1, . . . , 9}n, we have
􏰑n−1 i 􏰒 􏰑n−1 􏰒 ∑i=010xi mod3 = ∑i=0xi mod3.
(Forexample,6,007,023isrepresentedasx0 =3,x1 =2,x2 =0,x3 =7,x4 =0,x5 =0,andx6 =6.)
7.26 Prove that the test for divisibility by 3 is correct. First prove that 10i mod 3 = 1 for any integer
i ≥ 0; then prove the stated claim. Your proof should make heavy use of the properties in Theorem 7.3.
7.27 The divisibility test for 9 is to add up the digits of the given number, and test whether that sum is divisible by 9. State and prove the condition that ensures that this test is correct.
Using paper and pencil only, use the Euclidean algorithm to compute the GCDs of the following pairs of numbers:
7.28 n = 111, m = 202
7.29 n=333,m=2017
7.30 n=156,m=360
7.31 (programming required) Implement the Euclidean algorithm in a language of your choice.
7.32 (programming required) Early in Section 7.2.4, we discussed a brute-force algorithm to compute
gcd(n,m):tryalld ∈ {1,2,…,min(n,m)}andreturnthelargestdsuchthatd|nandd|m.Implementthis algorithm, and compare its performance to the Euclidean algorithm as follows: for both algorithms, find the largest n for which you can compute gcd(n, n − 1) in less than 1 second on your computer.
Let’s analyze the running time of the Euclidean algorithm for GCDs, to prove Theorem 7.8.
Figure 7.7: Modular exponentiation via repeated squaring.
mod-exp(b, e, n):
Input: integersn≥1,b,ande≥0 Output: be mod n
1: ife=0then
2: 3: 4: 5: 6: 7: 8:
return 1
else if e is even then
result := mod-exp(b, e , n) 2
return (result · result) mod n else
result := mod-exp(b, e − 1, n) return (b · result) mod n
7.33 Let n and m be arbitrary positive integers where n ≤ m. Prove that m mod n ≤ m . (Hint: what
happensifn≤ m?Whathappensif m n is prime. An algorithm called the Sieve of Eratosthenes, which computes a list of all prime numbers up to a given integer, exploits this redundancy to save some computation. The Sieve generates its list of prime numbers by successively eliminating (“sieving”) all multiples of each discovered prime: for example, once we know that 2 is prime and that 4 is a multiple of 2, we will never have to test whether 4 | n in determining whether n is prime. (If n isn’t prime because 4 | n, then n is also divisible by 2—that is, 4 is never the smallest integer greater than 1 that evenly divides n, so we never have to bother testing 4 as a candidatedivisor.)SeeExercises7.38–7.42andFigure7.15. 2
Taking it further: The Sieve of Eratosthenes is one of the earliest known algorithms, dating back to about 200 bce. (The date isn’t clear, in part because none of Eratosthenes’s work survived; the algorithm was reported, and attributed to Eratosthenes, by Nicomachus about 300 years later.) The Euclidean algorithm for greatest common divisors from Section 7.2, which dates from c. 300 bce, is one of the few older algorithms that are known.2
The distribution of the primes
For a positive integer n, let primes(n) denote the number of prime numbers less than
or equal to n. Thus, for example, we have
0 = primes(1)
1 = primes(2)
2 = primes(3) = primes(4)
3 = primes(5) = primes(6), and
4 = primes(7) = primes(8) = primes(9) = primes(10).
Or, to state it recursively: we have primes(1) := 0, and, for n ≥ 2, we have primes(n) := primes(n − 1) if n is composite
1 + primes(n − 1) if n is prime.
Figure 7.8(a) displays the value of primes(n) for moderately small n. An additional fact that we’ll state without proof is the Prime Number Theorem—illustrated in Fig- ure 7.8(b)—which describes the behavior of primes(n) for large n:
Formal proofs of the Prime Number Theorem are complicated beasts—far more com- plicated that we’ll want to deal with here!—but even an intuitive understanding of the theorem is useful. Informally, this theorem says that, given an integer n, approximately
a 1 fraction of the numbers “close to” n are prime. (See Exercise 7.45.) lnn
The Sieve of Eratos- thenes is named after Eratosthenes, a Greek mathemati- cian who lived in the 3rd century bce. For more, see
2 Donald E. Knuth.
The art of computer programming: Seminumerical algorithms (Volume 2). Addison-Wesley Longman, 3rd edition, 1997.
Theorem 7.9 (Prime Number Theorem)
As n gets large, the ratio between primes(n) and n approaches 1. lnn

400
300 ·
200 ·
100 · ·
0 0 500 1000 n1500 2000 2500 (a) Aplotofnvs.primes(n):=|{q≤n:qisprime}|.
Example 7.8 (Using the Prime Number Theorem)
2.0 1.5
10.05
n
Problem: Using the estimate primes(n) ≈ ln n , calculate (approximately) how many
10-digit integers are prime.
Solution: Bydefinition,thereareexactlyprimes(999,999,999)primeswith9orfewer digits, and primes(9,999,999,999) primes with 10 or fewer digits. Thus the number of 10-digit primes is
primes(9,999,999,999) − primes(999,999,999) ≈ 9,999,999,999 − 999,999,999 ln 9,999,999,999 ln 999,999,999
≈ 434,294,499 − 48,254,956 = 386,039,543.
Thus, roughly 386 million of the 9 billion 10-digit numbers (about 4.3%) are prime. (Exercise 7.46 asks you to consider how far off this estimate is.)
The density of the primes is potentially interesting for its own sake, but there’s also a practical reason that we’ll care about the Prime Number Theorem. In the RSA cryp- tosystem (see Section 7.5), one of the first steps of the protocol involves choosing two large prime numbers p and q. The bigger p and q are, the more secure the encryption, so we would want p and q to be pretty big—say, both approximately 22048. The Prime Number Theorem tells us that, roughly, one out of every ln 22048 ≈ 1420 integers around 22048 is prime. Thus, we can find a prime in this range by repeatedly choosing a random integer n of the right size and testing n for primality, using some efficient primality testing algorithm. (More about testing algorithms soon.) Approximately one out of every 1420 integers we try will turn out to be prime, so on average we’ll only need to try about 2840 values of n before we find primes to use as p and q.
lnn
in (b), converges
7.3. PRIMALITYANDRELATIVEPRIMALITY
719
0 0
(b) Aplotofnvs.theratioof n andprimes(n).
500 1000 n1500 lnn
2500
Figure 7.8: The distribution of primes. The Prime Number Theorem states that the ratio primes(n)/ n ,
2000
(slowly!) to 1.
Problem-solving
tip: Back-of-the- envelope calcu- lations are often great as plausibility checks: although the Prime Number Theorem doesn’t state a formal bound on how dif- ferent primes(n) and
see whether a so- lution to a problem “smells right” with an approximation like this one.
n are, you can lnn
primes(n)
primes(n) n
lnn

720 CHAPTER 7. NUMBER THEORY
Prime factorization
Recall that any integer can be factored into the product of primes. For example, we
canwrite2001 = 3·23·29and202 = 2·101and507 = 3·13·13and55057 = 55057. (All of {2, 3, 13, 23, 29, 101, 55057} are prime.) The Fundamental Theorem of Arithmetic (Theorem 5.5) states that any integer n can be factored into a product of primes—and that, up to reordering, there is a unique prime factorization of n. (In other words, any two prime factorizations of an integer n can differ in the ordering of the factors—for example, 202 = 101 · 2 and 202 = 2 · 101—but they can differ only in ordering.) We proved the “there exists” part of the theorem in Example 5.12 using induction; a bit later in this section, we’ll prove uniqueness. (The proof uses some properties of prime numbers that are most easily seen using an extension of the Euclidean algorithm that we’ll introduce shortly; we’ll defer the proof until we’ve established those properties.)
Relative primality
An integer n is prime if it has no divisors except 1 and n itself. Here we will in-
troduce a related concept for pairs of integers—two numbers that do not share any divisors except 1:
Here are a few small examples:
Example 7.9 (Some relatively prime integers)
Theintegers21and25arerelativelyprime,as21 = 3·7and25 = 5·5haveno common divisor (other than 1). Similarly, 5 and 6 are relatively prime, as are 17 and 35. (But 12 and 21 are not relatively prime, because they’re both divisible by 3.)
There will be a number of useful facts about relatively prime numbers that you’ll prove in the exercises—for example, a prime number p and any integer n are relatively prime unless p | n; and, more generally, two numbers are relatively prime if and only if their prime factorizations do not share any factors.
Taking it further: Let f (x) be a polynomial. One of the special characteristics of prime numbers is that f (x) has some special properties when we evaluate f (x) normally, or if we take the result of evaluating the polynomial mod p for some prime number p. In particular, if f (x) is a polynomial of degree k, then either f(a) ≡p 0foreverya ∈ {0,1,…,p−1}orthereareatmostkvaluesa ∈ {0,1,…,p−1}suchthat
f (a) ≡p 0. (We saw this property in Section 2.5.3 when we didn’t take the result modulo the prime p.) As a consequence, if we have two polynomials f (x) and g(x) of degree k, then if f and g are not equivalent modulop,thenthereareatmostkvaluesofa∈{0,1,…,p−1}forwhichf(a)≡p g(a).
We can use the fact that polynomials of degree k “behave” in the same way modulo p (with respect
to the number of roots, and the number of places that two polynomials agree) to give efficient solutions to two problems: secret sharing, in which n people wish to “distribute” shares of a secret so that any k
of them can reconstruct the secret (but no set of k − 1 can); and a form of error-correcting codes, as we discussed in Section 4.2. The basic idea will be that by using a polynomial f (x) and evaluating f (x) mod p for a prime p, we’ll be able to use small numbers (less than p) to accomplish everything that we’d be able to accomplish by evaluating f (x) without the modulus. See the discussions of secret sharing on p. 730 and of Reed–Solomon codes on p. 731.
Definition 7.7 (Relative primality)
Two positive integers n and m are called relatively prime if gcd(n, m) = 1—that is, if 1 is the only positive integer that evenly divides both n and m.

7.3.2 A Structural Fact and the Extended Euclidean Algorithm
Given an integer n ≥ 2, quickly determining whether n is prime seems tricky: we’ve seen some easy algorithms for this problem, but they’re pretty slow. And, though there are efficient but complicated algorithms for primality testing, we haven’t seen (and, really, nobody knows) a genuinely simple algorithm that’s also efficient. On the other hand, the analogous question about relative primality—given integers n and m, are n and m relatively prime?—is easy. In fact, we already know everything we need to solve this problem efficiently, just from the definition: n and m are relatively prime if and only if their GCD is 1, which occurs if and only if Euclid(n, m) = 1. So we can efficiently test whether n and m are relatively prime by testing whether Euclid(n, m) = 1.
We will start this section with a structural property about GCDs. (Right now it shouldn’t be at all clear what this claim has to with anything in the last paragraph— but stick with it! The connection will come along soon.) Here’s the claim:
Here are a few examples of the multiples guaranteed by this lemma:
Example 7.10 (Some examples of Lemma 7.10)
In Example 7.9, we saw that {5, 6} and {17, 35} are both relatively prime—that is, gcd(5, 6) = gcd(17, 35) = 1—and that gcd(12, 21) = 3. Also note that gcd(48, 1024) = 16 (from Example 7.6), and gcd(16, 48) = 16. For these pairs, we have:
7.3. PRIMALITYANDRELATIVEPRIMALITY 721
Lemma 7.10 (There are multiples of n and m that add up to gcd(n, m))
Let n and m be any positive integers, and let r = gcd(n, m). Then there exist integers x and y such that xn + ym = r.
(−1)·5 + 33 · 17 + 2 · 12 + (−21) · 48 + 1 · 16 +
1·6 =−5+6 (−16) · 35 = 561 − 560
(−1) · 21 = 24−21 1·1024 = −1008 + 1024 0 · 48 =16+0
= 1 = gcd(5, 6)
= 1 = gcd(17, 35)
= 3 = gcd(12, 21)
= 16 = gcd(48, 1024) = 16 = gcd(16, 48).
Note that for the second example in the table, the pair {17, 35}, we could have chosen −2 and 1 instead of 33 and −16, as −2 · 17 + 1 · 35 = 1 = 33 · 17 + (−16) · 35.
Note that the integers x and y whose existence is guaranteed by Lemma 7.10 are not necessarily positive! (In fact, in Example 7.10 the only time that we didn’t have a neg- ative coefficient for one of the numbers was for the pair {16, 48}, where gcd(16, 48) = 16 = 1 · 16 + 0 · 48.) Also, observe that there may be more than one pair of values for x and y that satisfy Lemma 7.10—in fact, you’ll show in Exercise 7.58 that there are always infinitely many values of {x, y} that satisfy the lemma.
Although, if you stare at it long enough, Example 7.10 might give a tiny hint about why Lemma 7.10 is true, a proof still seems distant. But, in fact, we’ll be able to prove the claim based what looks like a digression: a mild extension to the Euclidean al- gorithm. For a little bit of a hint as to how, let’s look at one more example of the Eu- clidean algorithm, but interpreting it as a guide to find the integers in Lemma 7.10:

722 CHAPTER 7. NUMBER THEORY
Example 7.11 (An example of Lemma 7.10, using the Euclidean algorithm)
Let’s find integers x and y such that 91x + 287y = gcd(91, 287).
By running Euclid(91, 287), we make the recursive calls Euclid(14, 91) and
Euclid(7, 14), which returns 7. Putting these calls into a small table—and using Defi- nition 7.1’s implied equality m = ⌊ m ⌋ · n + (m mod n), slightly rearranged—we have:
n
mn mmodn ⌊m⌋ mmodn=m−⌊m⌋·n
287 91 14 3n 14=287−3·91
91 14 7 6 7=91−6·14 (2) 14 7 0
n (1)
Notice that 7 = gcd(91, 287) = Euclid(91, 287). Using (1) and (2), we can rewrite 7 as:
7 = 91 − 6 · 14 by (2) = 91 − 6 · (287 − 3 · 91) = −6 · 287 + 19 · 91. by (1) and simplification
Thus x := −6 and y := 19 satisfy the requirement that 91x + 287y = gcd(91, 287).
The Extended Euclidean algorithm
The Extended Euclidean algorithm, shown in Figure 7.9,
follows the outline of Example 7.11, applying these algebraic manipulations recursively. Lemma 7.10 will follow from a proof that this extended version of the Euclidean algorithm actually computes three integers x, y, r such that gcd(n, m) =
extended-Euclid(n, m):
Input: positive integers n and m ≥ n.
Output: x,y,r ∈ Z where gcd(n,m) = r = xn+ym
1: ifmmodn=0then
2: return 1,0,n //1·n+0·m=n=gcd(n,m) 3: else
4: x, y, r := extended-Euclid(m mod n, n)
5: return y− m ·x,x,r n
􏰄􏰅
Evaluating extended-Euclid(12, 18) recursively computes
extended-Euclid(6, 12) = ⟨1, 0, 6⟩, and then computes its result from ⟨1, 0, 6⟩ and the values of n = 12 and m = 18:
extended-Euclid( ) (because 18 mod 12 ̸= 0, we make a recursive call). extended-Euclid(18 mod 12, 12)
=6
= (because 12 mod 6 = 0).
r = xn + ym. Here are two examples:
Example 7.12 (Running the Extended Euclidean Algorithm I)
Figure 7.9: The Extended Euclidean algorithm.
1,0,6
􏰢 􏰡􏰠 􏰣
= y − ⌊ m ⌋ · x, x, r where and . n
= −1, 1, 6.
The recursive call returned x = 1, y = 0, and r = 6, and the else case of the algorithm
x = 1, y = 0, r = 6
n = 12, m = 18
= 0 − ⌊ 18 ⌋ · 1, 1, 6 12
12, 18
tells us that our result is ⟨y − ⌊ m ⌋ · x, x, r⟩ where m = 18 and n = 12. Plugging these n
values into the formula for the result, we see that extended-Euclid(12, 18) returns ⟨−1, 1, 6⟩—and, indeed, gcd(12, 18) = 6 and −1 · 12 + 1 · 18 = 6.

Example 7.13 (Running the Extended Euclidean Algorithm II)
For slightly more complicated example, let’s compute extended-Euclid(18, 30):
extended-Euclid( ) extended-Euclid(30 mod 18, 18)
18, 30
7.3. PRIMALITYANDRELATIVEPRIMALITY 723
􏰢 􏰡􏰠 􏰣
=12 extended-Euclid(18 mod 12, 12)
􏰢 􏰡􏰠 􏰣
=6
Again, as required, we have gcd(18, 30) = 6 and 2 · 18 + −1 · 30 = 36 − 30 = 6. We’re now ready to state the correctness of the Extended Euclidean algorithm:
The proof, which is fairly straightforward by induction, is left to you as Exercise 7.60. And once you’ve proven this theorem, Lemma 7.10—which merely stated that there exist integers x, y, r with r = gcd(n, m) = xn + ym for any n and m—is immediate.
Note also that the Extended Euclidean algorithm is an efficient algorithm—you
already proved in Exercise 7.34 that the depth of the recursion tree for Euclid(n, m) is
upper bounded by O(log n + log m), and the running time of extended-Euclid(n, m)
is asymptotically the same as Euclid(n, m). (The only quantity that we need to use
= 1, 0, 6 = −1,1,6
byExample7.12.
where and .
= y − ⌊ m ⌋ · x, x, r n
n = 18, m = 30
= 1−⌊30⌋·(−1),−1,6 18
= 1−1·(−1),−1,6 = 2, −1, 6.
Theorem 7.11 (Correctness of the Extended Euclidean Algorithm)
For arbitrary positive integers n and m with n ≤ m, extended-Euclid(n, m) returns three integers x, y, r such that r = gcd(n, m) = xn + ym.
in extended-Euclid that we didn’t need in Euclid is ⌊ m ⌋, but we already had to find n
Problem-solving
tip: A nice way, particularly for computer scientists, to prove a theorem of the form “there exists x such that P(x)” is to actually give algorithm that computes such an x!
m mod n in Euclid—so if we used mod-and-div(n, m) to compute m mod n, then we “for free” also get the value of ⌊ m ⌋.)
n
7.3.3 The Uniqueness of Prime Factorization
Lemma 7.10—that there are multiples of n and m that add up to gcd(n, m)—and the Extended Euclidean algorithm (which computes those coefficients) will turn out to
be helpful in proving some facts that are apparently unrelated to greatest common divisors. Here’s a claim about divisibility related to prime numbers in that vein, which we’ll be able to use to prove that prime factorizations are unique:
x = −1, y = 1, r = 6
Lemma 7.12 (When a prime divides a product)
Letpbeprime,andletaandbbeintegers. Thenp|abifandonlyifp|aorp|b.

724 CHAPTER 7. NUMBER THEORY
Proof. We’llproceedbymutualimplication.
For the backward direction, assume p | a. (The case for p | b is strictly analogous.)
Then a = kp for some integer k, and thus ab = kpb, which is obviously divisible by p. For the forward direction, assume that p | ab and suppose that p ̸ | a. We must show
that p | b. Because p is prime and p ̸ | a, we know that gcd(p, a) = 1 (see Exercise 7.47), and, in particular, extended-Euclid(p, a) returns the GCD 1 and two integers n and m such that 1 = pm + an. Multiplying both sides by b yields b = pmb + anb, and thus
b mod p = (pmb+anb) mod p
= (pmb mod p+anb mod p) mod p
= (0+anb mod p) mod p = (0 + 0) mod p
= 0.
(7.3.2)
(7.4.7) p | ab by assumption, and (7.4.7) again
That is, we’ve shown that if p ̸ | a, then p | b. (And ¬x ⇒ y is equivalent to x ∨ y.)
We can use this fact to prove that an integer’s prime factorization is unique. (We’ll prove only the uniqueness part of the theorem here; see Example 5.12 for the “there exists a prime factorization” part.)
Taking it further: Back when we defined prime numbers, we were very careful to specify that 1 is neither prime nor composite. You may well have found this insistence to be silly and arbitrary and pedantic—after all, the only positive integers that evenly divide 1 are 1 and, well, 1 itself, so it sure seems like 1 ought
to be prime. But there was a good reason that we chose to exclude 1 from the list of primes: it makes the uniqueness of prime factorization true! If we’d listed 1 as a prime number, there would be many different waystoprimefactor,say,202:forexample,202=2·101and202=1·2·101and202=1·1·2·101,andso forth. So we’d have to have restated the theorem about uniqueness of prime factorization (“. . . is unique up to reordering and the number of times that we multiply by 1”), which is a much more cumbersome statement. This theorem is the reason that 1 is not defined as a prime number, in this book or in any other mathematical treatment.
Proof(ofuniqueness). We’llproceedbystronginductiononn.
For the base case (n = 1), we can write 1 as the product of zero prime numbers—
recall that ∏i∈∅ i = 1—and this representation is unique. (The product of one or more primes is greater than 1, as all primes are at least 2.)
For the inductive case (n ≥ 2), we assume the inductive hypotheses, namely that any n′ < n has a unique prime factorization. We must prove that the prime factoriza- tion of n is also unique. We consider two subcases: CaseI:nisprime. Thenthestatementholdsimmediately:theonlyprimefactoriza- tion is p1 = n. (Suppose that there were a different way of prime factoring n, as n = ∏li=1 qi for prime numbers ⟨q1,q2,...,ql⟩. We’d have to have l ≥ 2 for this fac- torization to differ from p1 = n, but then each qi satisfies qi > 1 and qi < n and qi | n—contradicting what it means for n to be prime.) Problem-solving tip: When you define something, you genuinely get to choose how to define it! When you can make a choice in the definition that makes your life easier, do it! Theorem 7.13 (Prime Factorization Theorem (Reprise)) Let n ∈ Z≥1 be any positive integer. There exist k ≥ 0 prime numbers p1, p2, . . . , pk such that n = ∏ki=1 pi. Further, up to reordering, the prime numbers p1, p2, . . . , pk are unique. Case II: n is composite. Then suppose that p1,p2,...,pk and q1,q2,...,ql are two se- quences of prime numbers such that n = ∏ki=1 pi = ∏li=1 qi. Without loss of generality, assume that both sequences are sorted in increasing order, so that p1 ≤ p2 ≤ ··· ≤ pk andq1 ≤ q2 ≤ ··· ≤ ql. Wemustprovethatthesetwo sequences are actually equal. • CaseIIA:p1 = q1.Definen′ := n = n =∏ki=2pi =∏li=2qi astheproductofall p1 q1 7.3. PRIMALITYANDRELATIVEPRIMALITY 725 the other prime numbers (excluding the primes p1 and q1 = p1). By the induc- tive hypothesis, n′ has a unique prime factorization, and thus p2, p3, . . . , pk and q2, q3, . . . , ql are identical. • Case IIB: p1 ̸= q1. Without loss of generality, suppose p1 < q1. But p1 | n, and therefore p1 | ∏li=1 qi. By Lemma 7.12, there exists an i such that p1 | qi. But 2 ≤ p1 < q1 ≤ qi. This contradicts the assumption that qi was prime. Taking it further: How difficult is it to factor a number n? Does there exist an efficient algorithm for factoring—that is, one that computes the prime factorization of n in a number of steps that’s propor- tional to O(logk n) for some k? We don’t know. But it is generally believed that the answer is no, that factoring large numbers cannot be done efficiently. The (believed) difficulty of factoring is a crucial pillar of widely used cryptographic systems, including the ones that we’ll encounter in Section 7.5. There are known algorithms that factor large numbers efficiently on so-called quantum computers (see the discus- sion on p. 1016)—but nobody knows how to build quantum computers. And, while there’s no known efficient algorithm for factoring large numbers on classical computers, there’s also no proof of hard- ness for this problem. (And most modern cryptographic systems count on the difficulty of the factoring problem—which is only a conjecture!) 7.3.4 The Chinese Remainder Theorem We’ll close this section with another ancient result about modular arithmetic, called the Chinese Remainder Theorem, from around 1750 years ago. Here’s the basic idea. If n is some nonnegative integer, then knowing that, say, when n is divided by 7 its remainder is 4 gives you a small clue about n’s value: one seventh of integers have the right value mod 7. Knowing n mod 2 and n mod 13 gives you more clues. The Chinese Remainder Theorem says that knowing n mod k for enough values of k will (almost) let you figure out the value of n exactly—at least, if those values of k are all relatively prime. Here’s a concrete example: Example 7.14 (An example of the Chinese Remainder Theorem) Problem: Whatnonnegativeintegersnsatisfythefollowingconditions? n mod 2 = 0 n mod 3 = 2 n mod 5 = 1. Solution : Supposen ∈ {0,1,...,29}.Thenthereareonlysixpossiblevalues forwhichnmod5 = 1,namely{0+1,5+1,10+1,15+1,20+1,25+1} = {1, 6, 11, 16, 21, 26}. Of these, the only even values are 6, 16, and 26. And we have 6 mod 3 = 0, 16 mod 3 = 1, and 26 mod 3 = 2. Thus n = 26. Notice that, for any integer k, we have k ≡b k + 30 for all three moduli b ∈ {2, 3, 5}. Therefore any n ≡30 26 will satisfy the given conditions. The name of the Chinese Remainder Theorem comes from its early discovery by the Chinese mathemati- cian Sun Tzu, who lived around the 5th century. (This Sun Tzu is a differ- ent Sun Tzu from the one who wrote The Art of War about 800 years prior.) 726 CHAPTER 7. NUMBER THEORY n 00 10 20 30 40 50 60 70 80 90 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 nmod2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 nmod3 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 nmod5 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 n 00 60 12 18 24 10 16 22 28 40 20 26 20 80 14 15 21 27 30 90 25 10 70 13 19 50 11 17 23 29 11111 22222 01234 n mod 2 n mod 3 n mod 5 00000 00000 01234 00000 11111 01234 00000 22222 01234 11111 00000 01234 11111 11111 01234 The basic point of Example 7.14 is that every value of n ∈ {0, . . . , 29} has a unique “profile” of remainders mod 2, 3, and 5. (See Figure 7.10.) Crucially, every one of the 30 possible profiles of remainders occurs in Figure 7.10, and no profile appears more than once. (The fact that there are exactly 30 possible profiles follows from the Product Rule for counting; see Section 9.2.1.) The Chinese Remainder Theorem states the general property that’s illustrated in these particular tables: each “remainder profile” occurs once and only once. Here is a formal statement of the theorem. We refer to a constraint of the form x mod n = a as a congruence, following Definition 7.2. We also write Zk := {0, 1, . . . , k − 1}. Proof. Toshowthatthereexistsanintegerxsatisfyingxmod n = a and x mod m = b, we’ll give a proof by construction— specifically, we’ll compute the value of x given the values of {a, b, n, m}. The simple algorithm is shown in Figure 7.11. Wemustarguethatxmodn = aandxmodm = b.Note that gcd(n, m) = 1 because n and m are relatively prime by assumption. Thus, by the correctness of the Extended Euclidean algorithm, we have cn+dm=1. (∗) Multiplying both sides of (∗) by a, we know that acn+adm = a. (†) Recall that we defined x := (adm + bcn) mod nm. Let’s now show that x mod n = a: = (adm + 0) mod n = (adm + acn) mod n = a mod n = a. Figure 7.10: The remainders of all n ∈ {0,1,...,29}, modulo 2, 3, and 5—sorted by n (above) and by the remainders (below). Theorem 7.14 (Chinese Remainder Theorem: two congruences) Let n and m be any two relatively prime integers. For any a ∈ Zn and b ∈ Zm, there exists oneandonlyoneintegerx∈Znm suchthatxmodn=aandxmodm=b. x mod n = (adm + bcn) mod nm mod n = (adm + bcn) mod n Input: relatively prime n, m ∈ Z; a ∈ Zn; b ∈ Zm. Output: xsuchthatxmodm=aandxmodn=b. 1: c,d,r:=extended-Euclid(n,m) 2: return x := (adm + bcn) mod nm definition of x Exercise 7.18 bcn mod n = 0 because n | bcn acn mod n = 0 because n | acn too! (†) a ∈ {0, 1, . . . , n − 1} by assumption, so a mod n = a Figure 7.11: An algorithm for the Chinese Remainder Theorem. (Ensure thatm ≥ nby swapping n and m if necessary.) We can argue that x = adm + bcn ≡m bdm + bcn ≡m b completely analogously, where the last equivalence follows by multiplying both sides of (∗) by b instead. Thus we’ve now established that there exists an x ∈ Znm with x mod n = a and x mod m = b (because we computed such an x). To prove that there is a unique such x, supposethatxmodn = x′ modnandxmodm = x′ modmfortwointegersx,x′ ∈ Znm. We will prove that x = x′—which establishes that there’s actually only one element of Znm with this property. By assumption, we know that (x − x′) mod n = 0 and(x−x′)modm = 0,or,inotherwords,weknowthatn|(x−x′)andm|(x−x′). By Exercise 7.70 and the fact that n and m are relatively prime, then, we know that nm | (x − x′). And because both x, x′ ∈ Znm, we’ve therefore shown that x = x′. Some examples Here are two concrete examples of using the Chinese Remainder Theorem (and, specifically, of using the algorithm from Figure 7.11): Example 7.15 (The Chinese Remainder Theorem, in action) Writing tip: Now that we’ve done a lot of manipulations with modular arithmetic, in proofs we will start to omit some simple steps that are by now tedious— like those using (7.3.2) to say that 7.3. PRIMALITYANDRELATIVEPRIMALITY 727 Let’s use the algorithm from the proof of the Chinese Remainder Theorem to find the integerx∈Z30 thatsatisfiesxmod5=4andxmod6=5. Note that 5 and 6 are relatively prime, and extended-Euclid(5, 6) returns ⟨−1, 1, 1⟩. (Indeed,wehavethat5·−1+6·1 = 1 = gcd(5,6).)Thuswecomputexfromthe values of ⟨n, m, a, b, c, d⟩ = ⟨5, 6, 4, 5, −1, 1⟩ as adm + bcn = 4 · 1 · 6 + 5 · −1 · 5 = 24 − 25 = −1. Thus x := −1 mod 30 = 29. And, indeed, 29 mod 5 = 4 and 29 mod 6 = 5. Example 7.16 (A second example of the Chinese Remainder Theorem) Problem: Wearetoldthatxmod7=1andxmod9=5.Whatisthevalueofx? Solution : Wefindextended-Euclid(7,9)=⟨4,−3,1⟩bytracingthealgorithm’sexecu- tion. The algorithm in Figure 7.11 computes x := adm + bcn mod nm, where n = 7 and m = 9 are the given moduli; a = 1 and b = 5 are the given remainders; and c = 4 and d = −3 are the computed multipliers from extended-Euclid. Thus x := (1 · −3 · 9) + (5 · 4 · 7) mod 7 · 9 = −27 + 140 mod 63 = 113 mod 63 = 50. Indeed,50mod7=1and50mod9=5.Thusx≡63 50. Generalizing to k congruences We’ve now shown the Chinese Remainder Theorem for two congruences, but Ex- ample 7.14 had three constraints (x mod 2, x mod 3, and x mod 5). In fact, the gener- alization of the Chinese Remainder Theorem to k congruences, for any k ≥ 1, is also true—again, as long as the moduli are pairwise relatively prime (that is, any two of the moduli share no common divisors). y +􏰀z mod n is equal to (ymodn)+ 􏰁 (z mod n) mod n. 728 CHAPTER 7. NUMBER THEORY We can prove this generalization fairly directly, using induction and the two- congruence case. The basic idea will be to repeatedly use Theorem 7.14 to combine a pair of congruences into a single congruence, until there are no pairs left to combine. Here’s a concrete example: Example 7.17 (The Chinese Remainder Theorem, with 3 congruences) Let’s describe the values of x that satisfy the congruences x mod 2 = 1 x mod 3 = 2 x mod 5 = 4. (∗) To do so, we first identify values of y that satisfy the first two congruences, ignor- ing the third. Note that 2 and 3 are relatively prime, and extended-Euclid(2, 3) = ⟨−1, 1, 1⟩. Thus, y mod 2 = 1 and y mod 3 = 2 if and only if y mod (2·3) = (1·1·3+2·−1·2) mod (2·3) = 5. In other words, y ∈ Z6 satisfies the congruences y mod 2 = 1 and y mod 3 = 2 if and only if y satisfies the single congruence y mod 6 = 5. Thus the values of x that satisfy (∗) are precisely the values of x that satisfy x mod 6 = 5 x mod 5 = 4. (†) And, in Example 7.15, we showed that values of x that satisfy (†) are precisely those with x mod 30 = 29. Now, using the idea from this example, we’ll prove the general version of the Chinese Remainder Theorem: Proof. Weproceedbyinductiononk. Base case (k = 1): Then there’s only one constraint, namely x mod n1 = a1, and obvi- ously x := a1 is the only element of ZN = Zn1 that satisfies this congruence. Inductivecase(k≥2): Weassumetheinductivehypothesis,namelythatthereexistsa unique x ∈ ZM satisfying any set of k − 1 congruences whose moduli have product M. To make use of this assumption, we will convert the k given congruences into k − 1 equivalent congruences, as follows: by Theorem 7.14, there exists a (unique) value y∗ ∈ Zn1n2 such that y∗ mod n1 = a1 and y∗ mod n2 = a2. In Exercise 7.69 you’ll prove that n1n2 is also relatively prime to every other ni, and, in Exercise 7.79, you will show that a value x ∈ ZN satisfies x mod n1 = a1 and x mod n2 = a2 if and Theorem 7.15 (Chinese Remainder Theorem: General version) Let n1, n2, . . . , nk be a collection of pairwise relatively prime integers, for some k ≥ 1, and let N := ∏ki=1 ni. For any ⟨a1,...,ak⟩ with each ai ∈ Zni, there exists one and only one integer x ∈ ZN such thatxmodni =ai forall1≤i≤k. only if x satisfies x mod n1n2 = y∗. More formally, given the A-constraints (on the left), define the B-constraints (on the right): x mod n1 = a1 (1A) xmodn2 =a2 (2A) xmodn3 =a3 (3A) xmodn4 =a4 (4A) . xmodnk =ak. (kA) x mod n1n2 = y∗ (1-and-2B) xmodn3 =a3 (3B) xmodn4 =a4 (4B) . xmodnk =ak. (kB) Observe that the product of the moduli is the same for both the A-constraints and theB-constraints:N:=n1·n2·n3···nk forA,and(n1n2)·n3···nk forB.Thus: • By Exercise 7.79, an integer x ∈ ZN satisfies the A-constraints if and only if x satisfies the B-constraints. • Bytheinductivehypothesis—whichappliesbyExercise7.69—there’saunique x ∈ ZN that satisfies the B-constraints. Therefore there is a unique x ∈ ZN that satisfies the A-constraints, as desired. Here we gave an inductive argument for the general version of Chinese Remainder Theorem (based on the 2-congruence version), but we could also give a version of the proof that directly echoes Theorem 7.14’s proof. See Exercise 7.107. Taking it further: One interesting implication of the Chinese Remainder Theorem is that we could choose to represent integers efficiently in a very different way from binary representation, instead using something called modular representation. In modular representation, we store an integer n as a sequence of values of n mod b, for a set of relatively prime values of b. To be concrete, consider the set {11, 13, 15, 17, 19}, and let N := 11 · 13 · 15 · 17 · 19 = 692,835 be their product. The Chinese Remainder Theorem tells us that we can uniquely represent any n ∈ ZN as ⟨n mod 11,n mod 13,n mod 15,n mod 17,n mod 19⟩. For example, 217 = ⟨7, 6, 2, 2, 10⟩, and 17 = ⟨6, 4, 2, 0, 17⟩. Perhaps surprisingly, the representation of 217+17is⟨2,10,4,2,8⟩and17·217 =⟨9,11,4,0,18⟩,whicharereallynothingmorethantheresultof doing component-wise addition/multiplication (modulo that component’s corresponding modulus): 7.3. PRIMALITYANDRELATIVEPRIMALITY 729 mod11 13 ⟨ 7, 6, +⟨ 6, 4, =⟨ 13, 10, ≡⟨ 2, 10, 15 17 19 2, 2, 10 ⟩ 2, 0, 17 ⟩ 4, 2, 27 ⟩ 4, 2, 8 ⟩ and mod11 13 ⟨ 7, 6, ·⟨ 6, 4, =⟨ 42, 24, ≡⟨ 9, 11, 15 17 19 2, 2, 10 ⟩ 2, 0, 17 ⟩ 4, 0, 170 ⟩ 4, 0, 18 ⟩. This representation has some advantages over the normal binary representation: the numbers in each component stay small, and multiplying k pairs of 5-bit numbers is significantly faster than multiplying one pair of 5k-bit numbers. (Also, the components can be calculated in parallel!) But there are some other operations that are slowed down by this representation. (See Exercises 7.145–7.146.) 730 CHAPTER 7. NUMBER THEORY Computer Science Connections Secret Sharing Although encryption/decryption is probably the most natural crypto- graphic problem, there are many other important problems in the same gen- eral vein. Here we’ll introduce and solve a different cryptographic problem— using a solution due to Adi Shamir (the S of the RSA cryptosystem, which we’ll see in Section 7.5).3 Imagine a shared resource, collectively owned by some group, that the group wishes to keep secure—for example, the launch codes for the U.S.’s nuclear weapons. In the post-apocalyptic world in which you’re imagining these codes being used, where many top officials are proba- bly dead, we’ll need to ensure that any, say, k = 3 of the cabinet members (out of the n = 15 cabinet positions) can launch the weapons. But you’d also like to guarantee that no single rogue secretary can destroy the world! In secret sharing, we seek a scheme by which we distribute “shares” of the secret s ∈ S to a group of n people such that the following properties hold: 1. Ifanykofthesenpeoplecooperate,then—bycombiningtheirksharesof the secret—they can compute the secret s (preferably efficiently). 2. If any k′ < k of these n people cooperate, then by combining their k′ shares they learn nothing about the secret s. (Informally, to “learn nothing” about the secret means that no k′ shares of the secret allow one to infer that s comes from any particular S′ ⊂ S.) (Note that just “splitting up the bits” of the secret violates condition 2.) 43 The basic idea will be to define a polynomial f (x), and distribute the value of f (i) as the the ith “share” of the secret; the secret itself will be f (0). Why will this be useful? Imagine that f (x) = ax + b. (The secret is thus f (0) = a · 0 + b = b.) Knowing that f (1) = 17 tells you that a + b = 17, but it doesn’t tell you anything 0 about b itself: for every possible value of the secret, there’s a value of a that 0 1 makesa+b = 17.Butknowingf(1) = 17andf(2) = 42letsyousolvefor 4 a = 25,b = −8. Iff(x) = ax2 +bx+c,thenknowingf(x1)andf(x2)givesyou 3 two equations and three unknowns—but you can solve for c if you know the 2 value of f (x) for three different values of x. In general, knowing k values of a 1 polynomial f of degree k lets you compute f (0), but any k − 1 values of f are 0 consistent with any value of f (0). And this result remains true if, instead of using the value f (x) as the share of the secret, we instead use f (x) mod p, for some prime p. (See p. 731.) Here’s a concrete example, to distribute shares of a secret m ∈ {0, 1, 2, 3, 4}: 3 Adi Shamir. How to share a secret. Communications of the ACM, 22(11):612– 613, November 1979. • Choosea1,...,akuniformlyandindependentlyatrandomfrom{0,1,2,3,4}. • Letf(x)=m+∑ki=1aixi.Distribute⟨n,f(n)mod5⟩as“share”#n. Forexample,letk := 3,andsupposeyouknowthatf(1)mod5 = 1and f (2) mod 5 = 2. These facts don’t help you figure out f (0): there are polyno- mials {f0, f1, . . . , f4} with fb(0) = b that are all consistent with those obser- vations! (See Figure 7.12.) To put this fact another way, given points ⟨x , y ⟩ 1 1 and ⟨x2 , y2 ⟩ for x1 , x2 ̸= 0, for any y-intercept b, there exists an f (x) such that f (x1) ≡p y1, f (x2) ≡p y2, and f (0) ≡p b. But three people can reconstruct the secret! There’s only one quadratic that passes through three given points. 0 1 2 3 4 2 1 f0 (x) 2 3 4 f2 (x) 2 3 4 f0(x)=0+1x+0x2 f1(x)=1+2x+3x2 f2(x)=2+3x+1x2 f3(x)=3+4x+4x2 f4(x)=4+0x+2x2 4 3 2 10 f4 (x) 0 1 2 Figure 7.12: Let f (x) := a + bx + cx . Even knowing f (1) ≡5 1 and f (2) ≡5 2, we don’t know f (0) mod 5; there are polynomials consistent with f (0) ≡5 m for every m ∈ {0,1,2,3,4}. Here we see fb(x) mod 5. (These polynomials can be hard to visualize, because their values “wrap around” from 5 to 0.) 7.3. PRIMALITYANDRELATIVEPRIMALITY 731 Computer Science Connections Error Correction with Reed–Solomon Codes Earlier (see Chapter 4), we discussed error-correcting codes: we encode a message m as a codeword c(m), so that m is (efficiently) recoverable from c(m), or even from a mildly corrupted codeword c′ ≈ c(m). (Note the difference in motivation with cryptography: in error-correcting codes, we want a codeword that makes computing the original message very easy; in cryptography, we want a ciphertext that makes computing the original message very hard.) The key property that we seek is that if m1 ̸= m2, then c(m1) and c(m2) are “very different,” so that decoding c′ simply corresponds to finding the m that minimizes the difference between c′ and c(m). There, we discussed Reed–Solomon codes, one of the classic schemes for error-correcting codes. Under Reed–Solomon codes, to encode a message m ∈ Zk, we define the polynomial pm(x) := ∑ki=1 mixi, and encode m as ⟨pm(1), pm(2), . . . , pm(n)⟩. (We choose n much bigger than k, to achieve the de- sired error-correction properties.) For example, for the messages m1 = ⟨1, 3, 2⟩ andm2 =⟨3,0,3⟩,wehavepm1(x)=x+3x2+2x3 andpm2(x)=3x+3x3.For n = 6, we have the codewords (for m1 and m2, respectively) ⟨6, 30, 84, 180, 330, 546⟩ and ⟨6, 30, 90, 204, 390, 666⟩. The key point is that two distinct polynomials of degree k agree on at most k inputs, which means that the codewords for m1 and m2 will be very different. (Here pm1 (x) and pm2 (x) agree on x ∈ {1, 2}, but not on x ∈ {3, 4, 5, 6}.) The theorem upon which this difference rests is important enough to be called the Fundamental Theorem of Algebra; see Figure 7.13. While this fact about Reed–Solomon codes is nice, it’s already evident that the numbers in the codewords get really big—546 and 666 are very big relative to the integers in the original messages! In real Reed–Solomon codes, there’s another trick that’s used: every value is stored modulo a prime. Let q be a prime. We’ll actually encode our message m as ⟨pm(1) mod q,pm(2) mod q,...,pm(n) mod q⟩. In fact, we now encode a message m ∈ Zkq with a codeword in Znq . And it turns out that everything important about polynomials remains true if we take all values modulo a prime q! (See Figure 7.14.) The combined message of Reed–Solomon error-correcting codes and the Shamir secret-sharing scheme (p. 730) is the following. Suppose that there is a degree-k polynomial p that is unknown to you, and suppose that you are given the evaluation of this polynomial on n distinct points. ifnk: Thenyoucanfindthedegree-kpolynomialconsistentwiththe largest number of these points. (Errors corrected!)
Figure 7.13: The Fundamental The- orem of Algebra. The corollary
follows because the polynomial
h(x) = f (x) − g(x) also has degree at most k, and {x : f (x) = g(x)} is precisely the set {x : h(x) = 0}.
Theorem 7.16
Let f (x) be a polynomial of degree k. Then eitherf(a) = 0foreverya ∈ Z,orthe equation f (x) = 0 has at most k solutions for x ∈ Z.
Corollary 7.17
Let f and g ̸= f be polynomials of degree k. Then |{x : f(x) = g(x)}| ≤ k.
Theorem 7.18
Let f (x) be a polynomial of degree k, and let q be a prime number. Then either
f(a) mod q = 0 for every a ∈ Zq, or the equation f (x) = 0 has at most k solutions for x ∈ Zq.
Corollary 7.19
Let f and g ̸= f be polynomials of degree k. Then |􏰈x : f(x) ≡q g(x)􏰉| ≤ k.
Figure 7.14: The Fundamental Theorem of Algebra, modulo a prime.

732 CHAPTER 7. NUMBER THEORY
7.3.5 Exercises
The Sieve of Eratosthenes returns a list of all prime numbers up to a given integer n by creating a list of candidate primes ⟨2, 3, . . . , n⟩, and repeatedly marking the first unmarked number p as prime and striking out all entries in the list that are multiples of p. See the Sieve in action in Figure 7.15.
7.38 Write pseudocode to describe the Sieve of Eratosthenes.
7.39 Run the algorithm, by hand, to find all primes less than 100.
7.40 (programming required) Implement the Sieve of Eratosthenes in a programming language of your
choice. Use your program to compute all primes up to 100,000. How many are there?
7.41 (programming required) Earlier, we suggested another algorithm to compute all primes up to
n := 100,000: for each i = 2, 3, . . . , n, test whether i is divisible by any integer between 2 and √i. Implement this algorithm too, and compare their execution times. What happens for n := 500,000?
7.42 Assume that each number k is crossed off by the Sieve of Eratosthenes every time a divisor of it is found. (For example, 6 is crossed off when 2 is the prime in question, and when 3 is the prime in question.) Prove that the total number of crossings-out by sieve(n) is ≤ Hn · n, where Hn is the nth harmonic number. (See Definition 5.4.)
Use the Prime Number Theorem to . . .
7.43 . . . estimate the number of primes between 2127 + 1 and 2128 .
7.44 . . . estimate the 2128 th-largest prime.
7.45 . . . argue that, roughly, the probability that a randomly chosen number close to n is prime is about
1/ ln n. (Hint: what does primes(n) − primes(n − 1) represent?)
7.46 Using the same technique as in Example 7.8, estimate the number of 6-digit primes. Then, using the Sieve or some other custom-built program, determine how far off the estimate was.
Let p be an arbitrary prime number and let a be an arbitrary nonnegative integer. Prove the following facts.
7.47 If p ̸ | a, then gcd(p, a) = 1. k
7.48 Foranypositiveintegerk,wehavep|a ifandonlyifp|a.(Hint:useinductionandLemma7.12.)
7.49 For any integers n, m ∈ {1, . . . , p − 1}, we have that p ̸ | nm.
7.50 For any integer m and any prime number q distinct from p (that is, p ̸= q), we have m ≡p a and
m ≡q a if and only if m ≡pq a. (Hint: think first about the case a = 0; then generalize.)
7.51 If0≤a 1. In other words, when a and n are not relatively prime, then a−1 fails to exist in Zn. (That’s because any multiple xa of a will also be divisible by d, and so xa mod n will also be divisible by d, and therefore xa mod n will not equal 1.) In fact, not being relatively prime to n is the only way to fail to have a multiplicative inverse in Zn, as we’ll prove. (Note that 0 ∈ Zn is not relatively prime to n, because gcd(n, 0) ̸= 1.)
Proof. Bydefinition,amultiplicativeinverseofaexistsinZnpreciselywhenthere exists an integer x such that ax ≡n 1. (The definition actually requires x ∈ Zn, not just x ∈ Z, but see Exercise 7.98.) But ax ≡n 1 means that ax is one more than a multiple of n—that is, there exists some integer y such that ax + yn = 1. In other words,
a−1 exists in Zn if and only if there exist integers x, y such that ax + yn = 1. (∗)
Observe that (∗) echoes the form of Lemma 7.10 (and thus also echoes the output of the Extended Euclidean algorithm), and we can use this fact to prove the theorem. We’ll prove the two directions of the implication separately:
Ifa−1existsinZn,thenaandnarerelativelyprime. We’llprovethecontrapositive.Sup- pose that a and n are not relatively prime—that is, suppose that gcd(a, n) = d for some d > 1. We will show that a−1 does not exist in Zn. Because d | a and d | n, there exist integers c and k such that a = cd and n = kd. But then, for any integers x and y, we have that
ax+yn = cdx+ykd = d(cx+yk) andthusd|(ax+yn). Thustherearenointegersx,yforwhichax+yn = 1and
therefore, by (∗), a−1 does not exist in Zn.
Ifaandnarerelativelyprime,thena−1existsinZn. Supposethataandnarerelatively prime. Then gcd(a, n) = 1 by definition. Thus, by the correctness of the Extended Euclidean algorithm (Theorem 7.11), the output of extended-Euclid(a, n) is ⟨x, y, 1⟩ for integers x, y such that xa + yn = gcd(a, n) = 1. The fact that extended-Euclid(a, n) outputs integers x and y such xa + yn = 1 means that such an x and y must exist— and so, by (∗), a−1 exists in Zn.
7.4. MULTIPLICATIVEINVERSES 737
Theorem 7.20 (Existence of Multiplicative Inverses)
Let n ≥ 2 and a ∈ Zn. Then a−1 exists in Zn if and only if n and a are relatively prime.

738 CHAPTER 7. NUMBER THEORY
Note that this theorem is consistent with the examples that we saw previously: we found 1−1 and 2−1 but not 3−1 in Z9 (Examples 7.19 and 7.20; 1 and 2 are relatively prime to 9, but 3 is not), and we found multiplicative inverses for all nonzero elements of Z7 (Example 7.21; all of {1, 2, . . . , 6} are relatively prime to 7).
Two implications of Theorem 7.20
There are two useful implications of this result. First, when the modulus is prime,
multiplicative inverses exist for all nonzero elements of Zn, because every nonzero a ∈ Zn and n are relatively prime for any prime number n.
(We saw an example of this corollary in Example 7.21, where we identified the multi- plicative inverses of all nonzero elements in Z7.)
The second useful implication of Theorem 7.20 is that,
whenever the multiplicative inverse of a exists in Zn, we can
efficiently compute a−1 in Zn using the Extended Euclidean algorithm—specifically, by running the (simple!) algorithm
in Figure 7.18. (This problem also nicely illustrates a case in
which proving a structural fact vastly improves the efficiency
of a calculation—the algorithm in Figure 7.18 is way faster
than building the entire multiplication table, as we did in Example 7.21.)
Proof. Wejustprovedthata−1existsifandonlyifextended-Euclid(a,n)returns ⟨x, y, 1⟩. In this case, we have xa + yn = 1 and therefore xa ≡n 1. Defining a−1 := x mod n ensures that a · (x mod n) ≡n 1, as required. (Again, see Exercise 7.98.)
Here’s an example, replicating the calculation of 5−1 in Z7 from Example 7.21: Example 7.22 (5−1 in Z7 , again)
To compute 5−1, we run the Extended Euclidean algorithm on 5 and 7:
The Extended Euclidean algorithm returns ⟨3, −2, 1⟩, implying that 3 · 5 + −2 · 7 = 1 = gcd(5, 7). Therefore inverse(5, 7) returns 3 mod 7 = 3. And, indeed, 3 · 5 ≡7 1.
Figure 7.18: An algorithm for com- puting multiplica- tive inverses using the Extended Eu- clidean algorithm.
Corollary 7.21
If p is prime, then every nonzero a ∈ Zp has a multiplicative inverse in Zp.
inverse(a, n):
Input: a∈Zn andn≥2 Output: a−1 in Zn, if it exists
1: x,y,d:=extended-Euclid(a,n)
2: ifd=1then
3: return xmodn //xa+yn=1,soxa≡n 1. 4: else
5: return “no inverse for a exists in Zn.”
Corollary 7.22
For any n ≥ 2 and a ∈ Zn, inverse(a, n) returns the value of a−1 in Zn.
extended-Euclid(5, 7) extended-Euclid(7 mod 5, 5)
=2
􏰠 􏰣􏰢 􏰡
extended-Euclid(5 mod 2, 2)
= 1, 0, 1 =1 = −2,1,1
􏰢 􏰡􏰠 􏰣
= 3, −2, 1.

7.4. MULTIPLICATIVEINVERSES 739
Example 7.23 (7−1 in Z9)
In Example 7.16, we saw that extended-Euclid(7, 9) = ⟨4, −3, 1⟩. Thus 7 and 9 are relatively prime, and 7−1 in Z9 is 4 mod 9 = 4. And indeed 7 · 4 = 28 ≡9 1.
7.4.3 Fermat’s Little Theorem
We’ll now make use of the results that we’ve developed so far—specifically Corol-
lary 7.21—to prove a surprising and very useful theorem, called Fermat’s Little Theorem, which states that ap−1 is equivalent to 1 mod p, for any prime number p and any a ̸= 0. (And we’ll see why this result is useful for cryptography in Section 7.5.) 4
Taking it further: Fermat’s Little Theorem is the second-most famous theorem named after Pierre de Fermat. His more famous theorem is called Fermat’s Last Theorem, which states the following:
For any integer k ≥ 3, there are no positive integers x, y, z satisfying xk + yk = zk .
There are integer solutions to the equation xk + yk = zk when k = 2—the so-called Pythagorean triples, like ⟨3,4,5⟩(where32 +42 = 9+16 = 25 = 52)and⟨7,24,25⟩(where72 +242 = 49+576 = 625 = 252). But Fermat’s Last Theorem states that there are no integer solutions when the exponent is larger than 2.
The history of Fermat’s Last Theorem is convoluted and about as fascinating as the history of any mathematical statement can be. In the 17th century, Fermat conjectured his theorem, and scrawled—in the margin of one of his books on mathematics—the words “I have discovered a truly marvelous proof, which this margin is too narrow to contain . . ..” The conjecture, and Fermat’s assertion, were found after Fermat’s death—but the proof that Fermat claimed to have discovered was never found. And it seems almost certain that he did not have a correct proof of this claim. Some 350 years later, in 1995, the mathematician Andrew Wiles published a proof of Fermat’s Last Theorem, building on work by a number of other 20th-century mathematicians.
The history of the Fermat’s Last Theorem—including the history of Fermat’s conjecture and the centuries-long quest for a proof—has been the subject of a number of books written for a nonspecialist audience; see, for example, the book by Simon Singh.4
Fermat’s Little Theorem is named after Pierre de Fermat, a 17th- century French mathematician.
Before we can prove Fermat’s Little Theorem itself, we’ll need a preliminary result. We will show that, for any prime p and any nonzero a ∈ Z , the first p − 1 nonzero
4 Simon Singh. Fer- mat’s Last Theorem: The Story of a Riddle That Confounded the World’s Greatest Minds for 358 Years. Fourth Estate Ltd., 2002.
p
multiples of a—that is, {a, 2a, 3a, . . . , (p − 1)a}—are precisely the p − 1 nonzero ele-
ments of Zp. Or, to state this claim in a slightly different way, we will prove that the function f : Zp → Zp defined by f (k) = ak mod p is both one-to-one and onto (and also satisfies f (0) = 0). Here is a formal statement of the result:
Before we dive into a proof, let’s check an example:
Example 7.24 ({ai mod 11} vs. {i mod 11})
Consider the prime p = 11 and two values of a, namely a = 2 and a = 5. Then, taking
Lemma 7.23 ({1, 2, . . . p − 1} and {1a, 2a, . . . (p − 1)a} are equivalent mod p) Forprimepandanya∈Zp wherea̸=0,wehave
{1·a mod p,2·a mod p,…,(p−1)·a mod p} = {1,2,…,p−1}.

740 CHAPTER 7. NUMBER THEORY all results modulo 11, we have
i 2i
2i mod 11 5i
1
2
3
4
5
6
7
8
9
10
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
1
3
5
7
9
5
10
15
20
25
30
35
40
45
50
5
10
4
9
3
8
2
7
1
6
5i mod 11
Note that every number from {1, 2, . . . , p} appears (once and only once) in the
.
{2i mod 11} and {5i mod 11} rows of this table—exactly as desired. That is,
0 1 2 3 4 5 6
{1,2,3,…,10} ≡ {2,4,6,…,20} ≡ {5,10,15,…,50}. 1111
We can also observe examples of this result in the multiplication table for Z7. (See Figure 7.19 for a reminder.) We can see that every (nonzero) row {a, 2a, 3a, 4a, 5a, 6a} contains all six numbers {1, 2, 3, 4, 5, 6}, in some order, in the six nonzero columns.
Proof of Lemma 7.23. Consider any prime p, and any nonzero a ∈ Zp. We must prove that {a,2a,…,(p−1)a} ≡p {1,2,…,p−1}.
We will first argue that the set {1 · a mod p, 2 · a mod p, . . . , (p − 1) · a mod p} contains no duplicates—that is, the value of i · a mod p is different for every i. Let
i,j ∈ {1,2,…,p−1}bearbitrary.Wewillshowthatia ≡p jaimpliesthati = j, which establishes this first claim. Suppose that ia ≡p ja. Then, multiplying both sides by a−1, we have that iaa−1 ≡p jaa−1, which immediately yields i ≡p j because
a · a−1 ≡p 1. (Note that, because p is prime, by Corollary 7.21, we know that a−1 exists inZp.)Therefore,foranyi,j∈{1,2,…,1−p},ifi̸=jthenai̸≡p aj.
We now need only show that ia mod p ̸= 0 for any i > 0. But that fact is straightfor- ward to see: ia mod p = 0 if and only if p | ia, but p is prime and i < p and a < p, so p cannot divide ia. (See Exercise 7.49.) With this preliminary result in hand, we turn to Fermat’s Little Theorem itself: As with the previous lemma, we’ll start with a few examples of this claim, and then give a proof of the general result. (While this property admittedly might seem a bit mysterious, it turns out to follow fairly closely from Lemma 7.23, as we’ll see.) Example 7.25 (Some examples of Fermat’s Little Theorem) Here are a few examples, for the prime numbers 7 and 19: 0 1 2 3 4 5 6 0 0 0 0 0 0 0 0 1 2 3 4 5 6 0 2 4 6 1 3 5 0 3 6 2 5 1 4 0 4 1 5 2 6 3 0 5 3 1 6 4 2 0 6 5 4 3 2 1 Figure 7.19: The multiplication table for Z7: a reminder. Theorem 7.24 (Fermat’s Little Theorem) Let p be prime, and let a ∈ Zp where a ̸= 0. Then ap−1 ≡p 1. 26 mod7 36 mod7 418 mod 19 =64mod7 =729mod7 = 68719476736 mod 19 =(7·9+1)mod7=1 =(104·7+1)mod7=1 = (3616814565 · 19 + 1) mod 19 = 1. The proof of Fermat’s Little Theorem We’ll now turn to a proof of the theorem: for any prime p and any nonzero a ∈ Zp, we have that ap−1 ≡p 1: ProofofFermat’sLittleTheorem(Theorem7.24). Notethat,becausepisprime,byCorol- lary 7.21, the multiplicative inverses 1−1, 2−1, . . . , (p − 1)−1 all exist in Zp. By Lemma 7.23, we know that {1 · a mod p, 2 · a mod p, . . . , (p − 1) · a mod p} and {1, 2, . . . , p} are the same set, and thus have the same product: 1·2·3···(p−1) ≡p (1·a)·(2·a)·(3·a)···((p−1)·a). (1) Multiplying both sides of (1) by the product of all p − 1 multiplicative inverses of {1,...,p−1}—thatis,multiplyingby1−1 ·2−1 · ··· ·(p−1)−1—wehave 1·2·3···(p−1)·1−1 ·2−1 ···(p−1)−1 ≡p (1·a)·(2·a)·(3·a)···((p−1)·a)·1−1 ·2−1 ···(p−1)−1. (2) Rearrangingtheleft-handsideof(2)andreplacingb·b−1 by1foreachb∈{1,...,p−1}, we simply get 1: 1 ≡p (1·a)·(2·a)·(3·a)···((p−1)·a)·1−1 ·2−1 ···(p−1)−1. (3) Rearranging the right-hand side of (3) and again replacing each b · b−1 by 1, we are left only with p − 1 copies of a: 1 ≡p ap−1. Note that Fermat’s Little Theorem is an implication, not an equivalence. It states that if p is prime, then for every a ∈ {1, . . . , p − 1}—that is, for every p relatively prime ton—wehaveap−1 ≡p 1.Theconversedoesnotalwayshold:ifan−1 ≡n 1forevery a ∈ Zn that’s relatively prime to n, we cannot conclude that n is prime. For example, a560 ≡561 1 for every a ∈ {1, 2, . . . , 560} with gcd(a, 561) = 1—but 561 is not prime! (See Exercise 7.110.) A number like 561, which passes the test in Fermat’s Little Theorem but is not prime, is called a Fermat pseudoprime or a Carmichael number. Taking it further: Let n ≥ 2 be an integer, and suppose that we need to determine whether n is prime. There’s a test for primality that’s implicitly suggested by Fermat’s Little Theorem—for “many” different values of a ∈ Zn, test to make sure that an−1 mod n = 1—but this test sometimes incorrectly identifies composite numbers as prime, because of the Carmichael numbers. (For speed, we generally test a few randomly chosen values of a ∈ Zp instead of trying many of them—but of course testing fewer values of a certainly can’t prevent us from incorrectly identifying Carmichael numbers as prime.) However, there are some tests for primality that have a similar spirit but that aren’t fooled by certain inputs in this way. See the discussion on p. 742 for a description of a randomized algorithm called the Miller–Rabin test that checks primality using this approach. Carmichael num- bers are named after Robert Carmichael, an American mathe- matician who first discovered these numbers, in the early 20th century. 7.4. MULTIPLICATIVEINVERSES 741 742 CHAPTER 7. NUMBER THEORY Computer Science Connections bogus-isPrime?(n, k): Input: n is a candidate prime number; k is a “certainty parameter” telling us how many tests to perform before giving up and reporting n as prime. 1: repeat 2: choose a ∈ {1, 2, . . . , n − 1} randomly 3: until an−1 ̸≡n 1 or we’ve tried k times 4: return “prime” if every an−1 ≡n 1; else return “composite” Miller–Rabin Primality Test Fermat’s Little Theorem says that an−1 ≡n 1 for any prime n and any nonzero a ∈ Zn, which makes the randomized algorithm in Figure 7.20 tempting as a way to test for primality. It’s clear that bogus-isPrime?(p) returns “prime” for any prime p—by Fermat’s Little Theorem—but what’s not clear is the false negative probability. Unfortunately, the probability can be terrible for particular values of n: for example, n = 118,901,521 is not prime, but the only a for which an−1 ̸≡n 1aremultiplesof271,541,or811—lessthan0.7%of{1,2,...,n−1}. (See the discussion of Carmichael numbers, and Exercise 7.110. And Carmichael numbers whose prime factors are all > 271 give even worse performance.)
We can, however, give a randomized primality test using modular arith- metic that doesn’t get fooled for any particular input integer. The Miller–Rabin primality test5 is based on the following fact (see Exercise 7.51):
2
ifpisprime,thenx ≡p 1ifandonlyifx∈{1,p−1}. (1)
Or, taking the contrapositive,
Figure 7.20: A bogus primality tester based on Fermat’s Little Theorem.
The original version of this test, due to Miller, is a nonrandom version of this algorithm that relies on a (still!) un- proven assumption in mathematics; it was subsequently modified by Rabin to removetheassumption(butatthecost of making it random instead). See
5 Gary L. Miller. Riemann’s hypothesis and tests for primality. Journal of Com- puter and System Sciences, 13(3):300–317, 1976; and Michael O. Rabin. Proba- bilistic algorithm for testing primality. Journal of Number Theory, 12(1):128–138, 1980.
if a2 ≡
The basic idea of Miller–Rabin is to look for an a ∈ Z
n
1 for a ∈/ {1, n − 1}, then n is not prime. (2)
with this property. (See Figure 7.21.) Consider a candidate prime number n ≥ 3. Thus n is odd, so n−1iseven,andwecanwriten−1 = 2rd,wheredisanoddnumberand
and d = 35.) Let a ∈ Zn with a ̸= 0. Define the sequence
ad, (ad)2 = a2d, (a2d)2 = a4d, …, (a2r−1d)2 = a2rd = an−1, (3)
with each entry taken modulo n. For example, for n = 561 (so r = 4 and d = 35) and a = 4, this sequence (modulo n) would be
⟨ 166 , 67 , 1 , 1 , 1 ⟩. 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣
ad≡n435≡n166 a2d≡n1662≡n67 a4d≡n672≡n1 a8d≡n12≡n1 a16d≡n12≡n1
ByFermat’sLittleTheorem,weknownisnotprimeifan−1 ̸≡n 1. Thus if (3) ends with something ̸≡n 1, we know that n is not prime. And if there’s a 1 that appears immediately after an entry x where xmodn ∈/ {1,n−1}in(3),thenwealsoknowthatnisnotprime: x2 ≡n 1butxmodn ∈/ {1,n−1},soby(2)weknowthatnisnot prime. The key fact, which we won’t prove here, is that many different values of a ∈ Zn result in one of these two violations:6
n
r ≥ 1. (Forn = 561,forexample,wecanwriten−1 = 560 = 24 ·35—sor = 4
Fact: If n is not prime, then for at least n−1 different nonzero values of 2
Figure 7.21: Miller–Rabin primality test.
For a proof of this fact, see
6 Thomas H. Cormen, Charles E. Leis- ersen, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. MIT Press, 3rd edition, 2009.
miller-rabin-isPrime?(n, k):
Input: n is a candidate prime number; k is a
1: 2: 3: 4: 5:
6: 7:
“certainty parameter” writen−1as2rdforanoddnumberd while we’ve done fewer than k tests:
choose a random a ∈ {1, . . . , n − 1}
σ := ⟨ad,a2d,a4d,a8d,…,a2rd⟩ mod n.
if σ ̸= ⟨…,1⟩ or if σ = ⟨…,x,1,…⟩ for some x ∈/ { 1 , n − 1 } t h e n
return “composite” return “prime”
a ∈ Zn, the sequence (3) contains a 1 following an entry x ∈/ {1, n − 1} or the sequence (3) doesn’t end with 1.
This fact then allows us to test for n’s primality by trying k different randomly chosen values of a; the probability that every one of these tests fails when n is not prime is at most 1/2k .

7.4.4 Exercises
7.80 Following Example 7.18, identify the numbers that are half of every element in Z9. (That is, for
eacha∈Z9,findb∈Z9 suchthat2b=a.)
We talked extensively in this section about multiplicative inverses, but there can be inverses for other operations, too. The next few exercises explore the additive inverse in Zn. Notice that the additive identity in Zn is 0: for any a ∈ Zn, we have a + 0 ≡n 0 + a ≡n a. The additive inverse of a ∈ Zn is typically denoted −a.
7.81 Give an algorithm to find the additive inverse of any a ∈ Zn. (Be careful: the additive inverse of a has to be a value from Zn, so you can’t just say that 3’s additive inverse is negative 3!)
Given your solution to the previous exercise, prove the following properties:
7.82 For any a ∈ Zn, we have −(−a) ≡n a.
7.83 Foranya,b∈Zn,wehavea·(−b)≡n (−a)·b.
7.84 Foranya,b∈Zn,wehavea·b≡n (−a)·(−b).
In regular arithmetic, for a number x ∈ R, a square root of x is a number y such that y2 = x. If x = 0, there’s only one such y, namely y = 0. If x < 0, there’s no such y. If x > 0, there are two such values y (one positive and one negative). Consider the following claim, and prove or disprove it.
7.85 Let n ≥ 2 be arbitrary. Then (i) there exists one and only one b ∈ Zn such that b2 ≡n 0; and (ii) for any a ∈ Zn with a ̸= 0, there is not exactly one b ∈ Zn such that b2 ≡n a. (Hint: think about Exercise 7.81.)
Using paper and pencil (and brute-force calculation), compute the following multiplicative inverses (or state that the inverse doesn’t exist):
7.86 4−1 in Z11
7.87 7−1 in Z11
7.88 0−1 in Z11
7.89 5−1 in Z15 7.90 7−1 in Z15 7.91 9−1 in Z15
7.92 Prove that the multiplicative inverse is unique: that is, for arbitrary n ≥ 2 and a ∈ Zn, suppose thatax≡n 1anday≡n 1.Provethatx≡n y.
Write down the full multiplication table (as in Figure 7.17) for the following:
7.93 Z5 7.94 Z6 7.95
For arbitrary n ≥ 2 and a ∈ Zn:
7.96 Prove or disprove the following: (n − 1)−1 = n − 1 in Zn.
7.97 Prove that (a−1)−1 = a: that is, a is the multiplicative inverse of the
multiplicative inverse of a.
7.98 Prove that there exists x ∈ Z with ax ≡n 1 if and only if there exists y ∈ Zn with ay ≡n 1.
7.99 Suppose that the multiplicative inverse a−1 exists in Zn. Let
k ∈ Zn be any exponent. Prove that ak has a multiplicativekinverse in Zn, and,inparticular,provethatthemultiplicativeinverseofa isthekthpower of the multiplicative inverse of a. (That is, prove that (ak )−1 ≡n (a−1 )k .)
Using paper and pencil and the algorithm based on the Extended Euclidean algorithm, compute the following multiplicative inverses (or explain why they don’t exist). See Figure 7.22 for a reminder.
7.100 17−1 in Z23
7.101 7−1 in Z25
7.102 9−1 in Z33
7.103 (programming required) Implement inverse(a, n) from Figure 7.18 in
a language of your choice.
Z8
􏰄 􏰅
7.4. MULTIPLICATIVEINVERSES 743
extended-Euclid(n, m):
Input: positive integers n and m ≥ n.
Output: x,y,r∈Zwheregcd(n,m)=r=xn+ym
1: ifmmodn=0then
2: return 1,0,n //1·n+0·m=n=gcd(n,m) 3: else
4: x, y, r := extended-Euclid(m mod n, n)
5: return y− m ·x,x,r n
inverse(a, n):
Input: a∈Zn andn≥2 Output: a−1 in Zn, if it exists
1: x,y,d:=extended-Euclid(a,n)
2: ifd=1then
3: return xmodn //xa+yn=1,soxa≡n 1. 4: else
5: return “no inverse for a exists in Zn.”
7.104 Prove or disprove the converse of Corollary 7.21: if n is composite, then there exists a ∈ Zn (with a ̸= 0) that does not have a multiplicative inverse in Zn.
Figure 7.22: A reminder of two algorithms.

744 CHAPTER 7. NUMBER THEORY
7.105 Let p be an arbitrary prime number. What value does the quantity 2p+1 mod p have? Be as specific as you can. Explain.
7.106 It turns out that 247248 mod 249 = 4. From this, you can conclude at least one of following: 247 is not prime; 247 is prime; 249 is not prime; or 249 is prime. Which one(s)? Explain.
7.107 Reprove the general version of the Chinese Remainder Theorem with single constructive argu- ment, as in the 2-congruence case, instead of using induction. Namely, assume n1 , n2 , . . . , nk are pairwise relatively prime, and let ai ∈ Zni . Let N := ∏ki=1 ni. Let Ni := N/ni (more precisely, let Ni be the product of all nj s except ni ) and let di be the multiplicative inverse of Ni in Zni . Prove that x := ∑ki=1 ai Ni di satisfies the congruencexmodni =ai forall1≤i≤k.
The totient function φ : Z≥1 → Z≥0, sometimes called Euler’s totient function after the 18th-century Swiss mathematician Leonhard Euler, is defined as
φ(n) := the number of k such that 1 ≤ k ≤ n such that k and n have no common divisors.
For example, φ(6) = 2 because 1 and 5 have no common divisors with 6 (but all of {2, 3, 4, 6} do share a common divisor with 6). There’s a generalization of Fermat’s Little Theorem, sometimes called the Fermat–Euler Theorem or Euler’s Theorem, that states the following: if a and n are relatively prime, then aφ(n) ≡n 1.
7.108 Using the Fermat–Euler theorem, argue that (i) Fermat’s Little Theorem holds.
(ii) a−1 in Zn is aφ(n)−1 mod n, for any a ∈ Zn that is relatively prime to n. Verify the latter claim for the multiplicative inverses of a ∈ {7, 17, 31} in Z60.
7.109 (programming required) Implicitly, the Fermat–Euler theorem gives a different way to compute the multiplicative inverse of a in Zn:
1. compute φ(n) [say by brute force, though there are somewhat faster ways—see Exercises 9.34–9.36]; and
2. compute aφ(n)−1 mod n [perhaps using repeated squaring; see Figure 7.7].
Implement this algorithm to compute a−1 in Zn in a programming language of your choice.
Recall that a Carmichael number is a composite number that passes the (bogus) primality test suggested by Fermat’s Little Theorem. In other words, a Carmichael number n is an integer that is composite but such that, for any a ∈ Zn that’s relatively prime to n, we have an−1 mod n = 1.
7.110 (programming required) Write a program to verify that 561 is (a) not prime, but (b) satisfies
a560 mod 561 = 1 for every a ∈ {1, . . . , 560} that’s relatively prime to 561. (That is, verify that 561 is a Carmichael number.)
7.111 Suppose n is a composite integer. Argue that there exists at least one integer a ∈ {1, 2, . . . , n − 1} such that an−1 ̸≡n 1. (In other words, there’s always at least one nonzero a ∈ Zn with an−1 ̸≡n 1 when n is composite. Thus, although the probability of error in bogus-isPrime? from p. 742 may be very high for particular composite integers n, the probability of success is nonzero, at least!)
The following theorem is due to Alwin Korselt, from 1899: an integer n is a Carmichael number if and only if n is composite, squarefree, and for all prime numbers p that divide n, we have that p − 1 | n − 1. (An integer n is squarefree if there is no integer d ≥ 2 such that d2 | n.)
7.112 (programming required) Use Korselt’s theorem (and a program) to find all Carmichael numbers less than 10,000.
7.113 Use Korselt’s theorem to prove that all Carmichael numbers are odd.
7.114 (programming required) Implement the Miller–Rabin primality test (see p. 742) in a language of
your choice.

7.5 Cryptography
Three may keep a secret, if two of them are dead. Benjamin Franklin (1706–1790)
In the rest of this chapter, we will make use of the number-theoretic machinery that we’ve now developed to explore cryptography. Imagine that a sender, named Alice, is trying to send a secret message to a receiver, named Bob. The goal of cryptography
is to ensure that the message itself is kept secret even if an eavesdropper—named Eve—overhears the transmission to Bob. To achieve this goal, Alice does not directly transmit the message m that she wishes to send to Bob; instead, she encrypts m in some way. The resulting encrypted message c is what’s transmitted to Bob. (The original message m is called plaintext; the encrypted message c that’s sent to Bob is called the ciphertext.) Bob then decrypts c to recover the original message m. A diagram of the basic structure of a cryptographic system is shown in Figure 7.23.
Traditionally, cryp- tographic systems are described using an imagined crew of people whose names start with consecutive letters of the alphabet. We’ll stick with these traditional names: Alice, Bob, Charlie, etc.
7.5. CRYPTOGRAPHY 745
plaintext m
Alice Bob
ciphertext c plaintext m
Eve (trying to decrypt without Bob’s private information)
encrypt
(using information about Bob)
decrypt
(using Bob’s private information)
The two obvious crucial properties of a cryptographic system are that (i) Bob can compute m from c, and (ii) Eve cannot compute m from c. (Of course, to make (i) and (ii) true simultaneously, it will have to be the case that Bob has some information that Eve doesn’t have—otherwise the task would be impossible!)
One-time pads
The simplest idea for a cryptographic system is for Alice and Bob to agree on a
shared secret key that they will use as the basis for their communication. The easiest
implementation of this idea is what’s called a one-time pad, which works as follows.
Alice and Bob agree in advance on an integer n, denoting the length of the message
that they would like to communicate. They also agree in advance on a secret bitstring
n
k ∈ {0, 1} , where each bit ki ∈ {0, 1} is chosen independently and uniformly—so
Figure 7.23: The outline of a cryp- tographic system.
The pad in the name comes from spycraft—spies might carry phys- ical pads of paper, where each sheet has a fresh secret key written on it. The one-time in the name derives from the fact that this system is secure only if the same key is never reused, as we’ll discuss.
that every one of the 2n different n-bit strings has a 1 chance of being chosen as k. To n 2n
encrypt a plaintext message m ∈ {0, 1} , Alice computes the bitwise exclusive or of m and k—in other words, the ith bit of the ciphertext is mi ⊕ ki. To decrypt the ciphertext c ∈ {0, 1}n, Bob computes the bitwise XOR of c and k.

746 CHAPTER 7. NUMBER THEORY
Example 7.26 (A One-Time Pad)
• AliceandBobagree(inadvance)onthesecretkeyk=10111000.
• Totransmitthemessagem=01101110,AlicefindsthebitwiseXORofmandk:
m 01101110
k 10111000. c=m⊕k 11010110
• Todecrypttheciphertextc=11010110,BobfindsthebitwiseXORofcandk:
c 11010110
k 10111000. c⊕k 01101110
Observe that c ⊕ k = 01101110 is indeed precisely m = 01101110, as desired.
The reason that Bob can decrypt the ciphertext to recover the original message m is simple: for any bits a and b, it’s the case that (a ⊕ b) ⊕ b = a. (See Figure 7.24.) The fact that Eve cannot recover m from c relies on the fact that, for any message m and every ciphertext c, there is precisely one secret key k such that m ⊕ k = c. (So Eve is just as likely to see a particular ciphertext regardless of what the message is, and therefore she gains no information about m by seeing c. See Exercise 7.116.) Thus the one-time pad is perfectly secure as a cryptographic system—if Alice and Bob only use it once! If Alice and Bob reuse the same key to exchange many different messages, then Eve can use frequency analysis to get a handle on the key, and therefore can begin to decode the allegedly secret messages. (See Exercises 10.72–10.76 or Exercise 7.117.)
Taking it further: One of the earliest encryption schemes is now known as a Caesar Cipher, after Julius Caesar, who used it in his correspondence. It can be understood as a cryptographic system that uses a one-time pad more than once. The Caesar cipher works as follows. The sender and receiver agree on a shift x, an integer, as their secret key. The ith letter in the alphabet (from A = 0 through Z = 25) will be shifted forward by x positions in the alphabet. The shift “wraps around,” so that we encode letter i as letter (i + x) mod 26. For example, if x = 3 then A→D, L→O, Y→B, etc. To send a text message m consisting of multiple letters from the alphabet, the same shift is applied to each letter. (For convenience, we’ll leave nonalphabeticcharactersunchanged.)Forexample,theciphertextXF BSF EJTDPWFSFE; GMFF BU PODF! was generated with the shift x = 1 from the message WE ARE DISCOVERED; FLEE AT ONCE!. Because we’ve reused the same shift x for each letter of the message, the Caesar Cipher is susceptible to being broken based on frequency analysis. (In the XF BSF EJTDPWFSFE; GMFF BU PODF! example, F is by far the most common letter in the ciphertext—and E is by far the most common letter in English text. From these two facts, you might infer that x = 1 is the most probable secret key. See Exercise 7.117.)
Millennia later, the Enigma machines, the encryption system used by the Germans during World War II, was—as with Caesar—a substitution cipher, but one where the shift changed with each letter. (But not in completely unpredictable ways, as in a one-time pad!) See p. 960 for more.
Public-key cryptography
In addition to being single-use-only, there’s another strange thing about the one-
time pad: if Alice and Bob are somehow able to communicate an n-bit string securely— as they must to share the secret key k—it doesn’t seem particularly impressive that they can then communicate the n-bit string m securely.
Public-key cryptography is an idea to get around this oddity. Here is the idea, in a nutshell. Every participant will have a public key and a private (or secret) key, which
ab
0 0 0 0 0110 1011 1101
Figure 7.24: The truth table for (a ⊕ b) ⊕ b = a.
a⊕b (a⊕b)⊕b

will somehow be related to the public key. A user’s public key is completely public— for example, posted on the web. If Alice wishes to send a message m to Bob, then
Alice will (somehow!) encrypt her message to Bob using Bob’s public key, producing ciphertext c. Bob, who of course knows Bob’s secret key, can decrypt c to reconstruct m; Eve, not knowing Bob’s secret key, cannot decrypt c.
This idea sounds a little crazy, but we will be able to make it work. Or, at least, we will make it work on the assumption that Eve has only limited computational power—and on the assumption that certain computational problems, like factoring large numbers, require a lot of computational power to solve. (For example, Bob’s secret key cannot be easily computable from Bob’s public key—otherwise Eve could easily figure out Bob’s secret key and then run whatever decryption algorithm Bob uses!)
7.5.1 The RSA Cryptosystem
The basic idea of public-key cryptography was discussed in abstract terms in the 1970s—especially by Whitfield Diffie, Martin Hellman, and Ralph Merkle—and, after some significant contributions by a number of researchers, a cryptosystem successfully implementing public-key cryptography was discovered by Ron Rivest, Adi Shamir, and Leonard Adleman.7 The RSA cryptosystem, named after the first initials of their three last names, is one of the most famous, and widely used, cryptographic protocols today. The previous sections of this chapter will serve as the building blocks for the RSA system, which we’ll explore in the rest of this section.8
Taking it further: The RSA cryptosystem is named after its three 1978 discoverers, and the Turing Award—the highest honor in computer science, roughly equivalent to the Nobel Prize of computer science—was conferred on Rivest, Shamir, and Adleman in 2002 for this discovery. But there is also a “shadow history” of the advances in cryptography made in the second half of the 20th century.
The British government’s signal intelligence agency, called Government Communications Headquar- ters (GCHQ), had been working to solve precisely the same set of research questions about cryptography as academic researchers like R., S., and A. (GCHQ was perhaps best known for its success in World
War II, in breaking the Enigma Code of the German military; see p. 960 for more discussion.) And it turned out that several British cryptographers at GCHQ—Clifford Cocks, James Ellis, and Malcolm Williamson—had discovered the RSA protocol several years before 1978. But their discovery was classi- fied by the British government, and thus we call this protocol “RSA” instead of “CEW.”
See the excellent book by Simon Singh for more on the history of cryptography, including both the published and classified advances in cryptographic systems.8 Also see the discussion on p. 753 of the Diffie–Hellman key exchange protocol, one of the first (published) modern breakthroughs in cryptogra- phy, which allows Alice and Bob to solve another apparently impossible problem: exchanging secret information while communicating only over an insecure channel.
In RSA, as for any public-key cryptosystem, we must define three algorithmic com- ponents. (These three algorithms for the RSA cryptosystem are shown in Figure 7.25; an overview of the system is shown in Figure 7.26.) They are:
• keygeneration:howdoAliceandBobconstructtheirpublic/privatekeypairs?
• encryption:whenAlicewishestosendamessagetoBob,howdoessheencodeit? • decryption:whenBobreceivesciphertextfromAlice,howdoeshedecodeit?
The very basic idea of RSA is the following. (The details of the protocols are in Fig-
ure 7.25.) To encrypt a numerical message m for Bob, Alice will compute c := me mod n, where Bob’s public key is ⟨e, n⟩. To decrypt the ciphertext c that he receives, Bob will
7 R. L. Rivest,
A. Shamir, and L.Adleman. A method for ob- taining digital signatures and public-key cryp- tosystems. Com- munications of the ACM, 21:120–126, February 1978.
8 Simon Singh. The Code Book: The Secret History of Codes
and Code-breaking. Fourth Estate Ltd., 1999.
7.5. CRYPTOGRAPHY 747

748 CHAPTER 7. NUMBER THEORY
Key Generation:
1. Bobchoosestwolargeprimes,pandq,anddefinesn:=pq.
2. Bobchoosese̸=1suchthateand(p−1)(q−1)arerelativelyprime. 3. Bobcomputesd:=e−1modulo(p−1)(q−1).
4. Bobpublishes⟨e,n⟩ashispublickey;Bob’ssecretkeyis⟨d,n⟩.
Encryption: If Alice wants to send message m to Bob,
1. AlicefindsBob’spublickey,say⟨eBob,nBob⟩,ashepublishedit.
2. Tosendmessagem∈{0,…,nBob−1},Alicecomputesc:=meBob modnBob. 3. AlicetransmitsctoBob.
Decryption: When Bob receives ciphertext c,
1. Bobcomputesm:=cdBob modnBob,where⟨dBob,nBob⟩isBob’ssecretkey.
compute cd mod n, where Bob’s private key is ⟨d, n⟩. (Of course, there’s an important relationship among the quantities e, d, and n!)
An example of RSA key generation, encryption, and decryption
Later we will prove that the message that Bob decrypts is always the same as the
message that Alice originally sent. But we’ll start with an example. First, Bob generates a public and private key, using the protocol in Figure 7.25. (All three phases can be im- plemented efficiently, using techniques from this chapter; see Exercises 7.129–7.132.)
Example 7.27 (Generating an RSA keypair for Bob)
For good security properties, we’d want to pick seriously large prime numbers p and q, but to make the computation easier to see we’ll choose very small primes.
1. Supposewechoosethe“large”primesp=13andq=17.Thenn:=13·17=221.
2. Wenowmustchooseavalueofe ̸= 1thatisrelativelyprimeto(p−1)(q−1) = 12 · 16 = 192. Note that gcd(2, 192) = 2 ̸= 1, so e = 2 fails. Similarly gcd(3, 192) = 3 and gcd(4, 192) = 4. But gcd(5, 192) = 1. We pick e := 5.
Figure 7.25: The RSA cryptosystem.
3. Wenowcomputed:=inverse(e,(p−1)(q−1))—thatis,d:=e
−1
inZ(p−1)(q−1):
Itmayseemstrange that n is part of both Bob’s secret key
and Bob’s public key—it’s usually done this way for symmetry, but also tosupportdigital signatures. When Alice sends Bob a message, she can encrypt it using her own secret key; Bob can then decrypt the message using Alice’s public key to verify that Alice was indeed the person who sent the message.
extended-Euclid(5, 192)
extended-Euclid(192 mod 5 = 2, 5)
= −2, 1, 1 exactly as in Example 7.22
= y − ⌊ m ⌋ · x, x, r where x = −2, y = 1, r = 1 and m = 192, n = 5. n
= 77, −2, 1.
Thus inverse(5, 192) returns 77 mod 192 = 77.
(Indeed,5·77=385=192·2+1,so5·77≡192 1.)Thuswesetd:=77.
Thus Bob’s public key is ⟨e, n⟩ = ⟨5, 221⟩, and Bob’s secret key is ⟨d, n⟩ = ⟨77, 221⟩.

Bob now publishes his public key somewhere, keeping his secret key to himself. If Alice now wishes to send a message to Bob, she uses his public key, as follows:
Example 7.28 (Encrypting a message with RSA)
To send message m = 202 to Bob, whose public key is ⟨e, n⟩ = ⟨5, 221⟩, Alice computes me mod n = 2025 mod 221 = 336,323,216,032 mod 221 = 206.
Thus she sends Bob the ciphertext c := 206.
When Bob receives an encrypted message, he uses his secret key to decrypt it:
Example 7.29 (Decrypting a message with RSA)
When Bob, whose secret key is ⟨d, n⟩ = ⟨77, 221⟩, receives the ciphertext c = 206 from
Alice, he decrypts it as
cd mod n = 20677 mod 221.
Computing 20677 mod 221 by hand is a bit tedious, but we can calculate it with “repeated squaring” (using the fact that b2k mod n = (b2 mod n)k mod n and b2k+1 mod n = b · (b2k mod n) mod n; see Exercises 7.23–7.25):
20677 mod 221 = 206 · (2062 mod 221)38 mod 221 􏰢 􏰡􏰠 􏰣
=4
= 206 · (42 mod 221)19 mod 221 􏰢 􏰡􏰠 􏰣
=16
= 206·16·(162 mod 221)9 mod 221 􏰢 􏰡􏰠 􏰣
=35
= 206·16·35·(352 mod 221)4 mod 221 􏰢 􏰡􏰠 􏰣
=120
= 206·16·35·(1202 mod 221)2 mod 221 􏰢 􏰡􏰠 􏰣
=35
= 206·16·35·(352 mod 221) mod 221 􏰢 􏰡􏰠 􏰣
=120
= 206 · 16 · 35 · 120 mod221 􏰢 􏰡􏰠 􏰣
=13,843,200
= 202.
Thus Bob decrypts the ciphertext 206 as 202 = 20677 mod 221. Indeed, then, the message that Bob receives is precisely 202—the same message that Alice sent!
We’ve now illustrated the full RSA protocol: generating a key, and encrypting and decrypting a message. Here’s one more chance to work through the full pipeline:
7.5. CRYPTOGRAPHY 749

750 CHAPTER 7. NUMBER THEORY
Alice
plaintext m
encrypt
ciphertext c := me mod n
Eve
Bob
Bob
decrypt
public key = ⟨e, n⟩ private key = ⟨d, n⟩
cd mod n
Example 7.30 (RSA, again, from end to end)
Problem: Bob generates a public/private keypair using the primes p = 11 and q = 13, choosing the smallest valid value of e. You encrypt the message 95 to send to Bob (using his generated public key). What ciphertext do you send to Bob?
Solution
: For⟨p,q⟩ = ⟨11,13⟩,wehavepq = 143and(p−1)(q−1) = 120.Because
120 is divisible by 2, 3, 4, 5, and 6 but gcd(120,7) = 1, we choose e := 7. We find d := inverse(7, 120) = 103. Then Bob’s public key is ⟨e, n⟩ = ⟨7, 143⟩ and Bob’s private key is ⟨d, n⟩ = ⟨103, 143⟩. e 7
TosendBobthemessagem = 95,wecomputem modn = 95 mod143, which is 17. Thus the ciphertext is c := 17. (Bob would decrypt this ciphertext as cd mod n = 17103 mod 143—which indeed is 95.)
7.5.2 The Correctness of RSA
Examples 7.27–7.29 gave one instance of the RSA cryptosystem working properly, in the sense that decrypt(encrypt(m)) turned out to be the original message m itself—but, of course, we want this property to be true in general. Let’s prove that it is. Before we give the full statement of correctness, we’ll prove an intermediate lemma:
Figure 7.26: A schematic of the RSA cryptosystem, where n = pq and de ≡(p−1)(q−1) 1, for two prime numbers p and q.
Lemma 7.25 (Correctness of RSA: decrypting the ciphertext, modulo p or q) Suppose e, d, p, q, n are all as specified in the RSA key generation protocol—that is, n = pq for primes p and q, and ed ≡(p−1)(q−1) 1. Let m ∈ Zn be any message. Then
m′ := [(me mod n)d mod n] (the decryption of the encryption of m) satisfies both m′ ≡p m and m′ ≡q m.
Proof. We’ll prove m′ ≡p m; because p and q are symmetric in the definition, m′ ≡q m follows immediately. Recall that we chose d so that ed ≡(p−1)(q−1) 1, and thus we have

ed = k(p − 1)(q − 1) + 1 for some integer k. Hence [(me mod n)d mod n] mod p
= (med mod n) mod p
= (mk(p−1)(q−1)+1 mod pq) mod p = mk(p−1)(q−1)+1 mod p
= [m · mk(p−1)(q−1)] mod p
by (7.3.4) by definition of e, d, n, and k by Exercise 7.18 ak+1 = a · ak by (7.3.3) by (7.3.4)
Although it’s not completely obvious, we’re actually almost done: we’ve now shown 􏰍(me mod n)d mod n􏰎 mod p
=􏰍(m mod p)· 􏰎 mod p. (∗)
If only the highlighted portion of the right-hand side of (∗) were equal to 1, we’d
have shown exactly the desired result, because the right-hand side would then equal [(mmodp)·1]modp = mmodpmodp = mmodp—exactlywhatwehadtoprove! And the good news is that the highlighted portion of (∗) matches the form of Fermat’s Little Theorem: the highlighted expression is ap−1 mod p, where a := mk(q−1) mod p, and Fermat’s Little Theorem tells us ap−1 mod p = 1 as long as a ̸≡p 0—that is, as long as p ̸ | a. (We’ll also have to handle the case when a is divisible by p, but we’ll be able to do that separately.) Here are the two cases:
where the last equality follows because p is prime and p | mk(q−1); thus Exercise 7.48 tells us that p | m as well.
• If a ̸≡p 0, then we can use Fermat’s Little Theorem: ed p−1
= [(m mod p) · (mk(p−1)(q−1) mod p)] mod p
= [(m mod p) · ((mk(q−1) mod p)p−1 mod p)] mod p.
((mk(q−1) mod p)p−1 mod p)
k(q−1) [(m modn) modn]modp=[(mmodp)·a modp]modp
=[(mmodp)·0]modp =0
= m mod p,
Problem-solving tip:
If there’s a proof outline that will establish a desired claim except in one or two special cases, then try to “break off” those special cases and handle them separately. Here we handled the “normal” case
a ̸≡p 0using Fermat’s Little Theorem, and broke off the speciala≡p0 case and handled it separately.
• Ifa≡p 0,thennoticethatm
ed p−1
.Therefore:
k(q−1)
[(m modn) modn]modp=[(mmodp)·a = m mod p.
modp]modp
modp=0andthusthatp|m
by(∗) = [(m mod p) · 1] mod p by Fermat’s Little Theorem
We’ve now established that [(me mod n)d mod n] mod p = m mod p in both cases, and thus the lemma follows.
7.5. CRYPTOGRAPHY 751
by(∗) bytheassumptionthata≡p 0
Using Lemma 7.25 to do most of the work, we can now prove the main theorem:

752 CHAPTER 7. NUMBER THEORY
Theorem 7.26 (Correctness of RSA)
Suppose that Bob’s RSA public key is ⟨e, n⟩ and his corresponding private key is ⟨d, n⟩. Let m ∈ Zn be any message. Then decryptBob(encryptBob(m)) = m.
Proof. Note that decryptBob(encryptBob(m)) = (me mod n)d mod n. By Lemma 7.25, (me modn)d modn≡p m and (me modn)d modn≡q m.
By Exercise 7.50, together these facts imply that (me mod n)d mod n ≡pq m as well. Becausen=pqandmintlist(s, n) and intlist->string(L, n) that convert between strings of characters and a list of elements from Zn. You may do this conversion in
many ways, but it must be the case that these operations are inverses of each other: if string->intlist(s∗ , n) = L∗, then intlist->string(L∗, n) = s∗. (Hint: the easiest way to do this conversion is to view text encoded as a se- quence of ASCII symbols, each of which is an element of {0, 1, . . . , 255}. Thus you can view your input text as a
number written in base 256. Your output is a number written in base n. Use baseConvert from p. 714.)
7.143 (programming required) Demonstrate that your implementations from Exercises 7.140, 7.141, and7.142areworkingproperlybygeneratingkeys,encrypting,anddecryptingusingtheprimesp = 5,277,019,477,592,911 and q = 7,502,904,222,052,693, and the message “THE SECRET OF BEING BORING IS TO SAY EVERYTHING.” (Voltaire (1694–1778)).
Complete the last missing piece of your RSA implementation:
7.144 (programming required) Prime generation. The key generation implementation from Exercise 7.140 relies on being given two prime numbers. Write a function that, given a (sufficiently large) range of possible numbers between nmin and nmax, repeatedly does the following: choose a random integer between nmin and nmax, and test whether it’s prime using the Miller–Rabin test (see Exercise 7.114).
The Chinese Remainder Theorem tells us that m ∈ Zpq is uniquely described by its value modulo p and q—that is, m mod p and m mod q fully describe m. Here’s one way to improve the efficiency of RSA using this observation: instead of computing m := cd mod pq directly, instead compute a := cd mod p and b := cd mod q. Then use the algorithm implicit in Theorem 7.14 to compute the value m with m mod p = a and m mod q = b.
7.145 (programming required) Modify your implementation of RSA to use the above idea.
7.146 Actually, instead of computing a := cd mod p and b := cd mod q, we could have computed
a := cd mod p−1 mod p and b := cd mod q−1 mod q. Explain why this modification is valid. (This change can improve the efficiency of RSA, because now both the base and the exponent may be substantially smaller than they were in the regular RSA implementation.)
7.5. CRYPTOGRAPHY 755

756 CHAPTER 7. NUMBER THEORY
7.6 Chapter at a Glance Modular Arithmetic
Given integers k ≥ 1 and n, there exist unique integers d and r such that 0 ≤ r < k and kd + r = n. The value of d is 􏰄 n 􏰅, the (whole) number of times k goes into n; the value of k r is n mod k, the remainder when we divide n by k. Two integers a and b are equivalent or congruent mod n, written a ≡n b, if a and b have the same remainder when divided by n—that is, when a mod n = b mod n. For expressions taken mod n, we can always freely “reduce” mod n (subtracting multiples of n) before performing addition or multiplication. (See Theorem 7.3.) We write k | n to denote the proposition that n mod k = 0. If k | n, we say that k (evenly) divides n, that k is a factor of n, and that n is a multiple of k. See Theorem 7.4 for some useful properties of divisibility: for example, if a | b then, for any integer c, it’s also the case that a divides bc as well. The greatest common divisor gcd(n, m) of two positive integers n and m is the largest d that evenly divides both n and m; the least common multiple is the smallest d ∈ Z≥1 that n and m both evenly divide. GCDs can be computed efficiently using the Euclidean algorithm. (See Figure 7.28.) Primality and Relative Primality An integer p ≥ 2 is prime if the only positive integers that evenly divide it are 1 and p itself; an integer n ≥ 2 that is not prime is called composite. (Note that 1 is neither prime nor composite.) Let primes(n) denote the number of prime numbers less or equal than n. The Prime Number Theorem states that, as n gets large, the ratio between primes(n) and n converges (slowly!) to 1. Every positive integer can be factored into a product log n Figure 7.28: The Eu- clidean algorithm for GCDs. of zero or more prime numbers, and that factorization is unique up to the ordering of the factors. Two positive integers n and m are called relatively prime if they have no common factors aside from 1—that is, if gcd(n, m) = 1. A tweak to the Euclidean algorithm, called the Extended Euclidean algorithm, takes arbitrary positive integers n and m as input, and (efficiently) computes three integers x, y, r such that r = gcd(n, m) = xn + ym. (See Figure 7.29.) We can determine whether n and m are relatively prime using the (Extended) Euclidean algorithm. 􏰄􏰅 Figure 7.29: The Extended Euclidean algorithm. Euclid(n, m): Input: positive integers n and m ≥ n Output: gcd(n,m) 1: ifmmodn=0then 2: return n 3: else 4: return Euclid(m mod n, n) extended-Euclid(n, m): Input: positive integers n and m ≥ n. Output: x,y,r ∈ Z where gcd(n,m) = r = xn+ym 1: ifmmodn=0then 2: return 1,0,n //1·n+0·m=n=gcd(n,m) 3: else 4: x, y, r := extended-Euclid(m mod n, n) 5: return y− m ·x,x,r n Let n1, n2, . . . , nk be a collection of integers, any pair of which is relatively prime. Let N:=∏ki=1ni.WritingZm :={0,1,...,m−1},theChineseRemainderTheoremstatesthat, for any sequence of values ⟨a1, . . . , ak ⟩ with each ai ∈ Zni , there exists one and only one integerx∈ZN suchthatxmodni =ai forall1≤i≤k. Multiplicative Inverses Foranyintegern ≥ 2,letZn denotetheset{0,1,...,n−1}. Leta ∈ Zn bearbitrary. The multiplicative inverse of a in Zn is the number a−1 ∈ Zn such that a · a−1 ≡n 1 if any such number exists. (If no such number exists, then a−1 is undefined.) For example, the multiplicative inverse of 2 in Z9 is 2−1 = 5 because 2 · 5 = 10 ≡9 1; the multiplica- tive inverse of 1 in Z9 is 1−1 = 1 because 1 · 1 ≡9 1; and the multiplicative inverse of 3 in Z9 is undefined (because 3a ̸≡9 1 for any a ∈ Z9). Let n ≥ 2 and a ∈ Zn. The multiplicative inverse a−1 exists in Zn if and only if n and a are relatively prime. Furthermore, when a−1 exists, we can find it using the Extended Euclidean algorithm. We compute ⟨x, y, r⟩ := extended-Euclid(a, n); when gcd(a, n) = 1 (as it is when a and n are relatively prime), the returned values satisfy xa + yn = 1, and thus a−1 := x mod n is the multiplicative inverse of a in Zn. For a prime number p, every nonzero a ∈ Zp has a multiplicative inverse in Zp. Fermat’s Little Theorem states that, for any prime p and any integer a with p ̸ | a, the (p − 1)st power of a must equal 1 modulo p. (That is: for prime p and nonzero a ∈ Zp, wehaveap−1 ≡p 1.Forexample,because17isprime,Fermat’sLittleTheorem—or arithmetic!—tells us that 516 mod 17 = 1.) Cryptography A sender (“Alice”) wants to send a private message to a receiver (“Bob”), but they can only communicate using a channel that can be overheard by an eavesdropper (“Eve”). In cryptography, Alice encrypts the message m (the “plaintext”) and transmits the encrypted version c (the “ciphertext”); Bob then decrypts it to recover the original message m. The simplest way to achieve this goal is with a one-time pad: Alice and Bob agree on a shared secret bitstring k; the ciphertext is the bitwise XOR of m and k, and Bob decrypts by computing the bitwise XOR of c and k. A more useful infrastructure is public-key cryptography, in which Alice and Bob do not have to communicate a secret in advance. Every user has a public key and a (math- ematically related) private key; to communicate with Bob, Alice uses Bob’s public key for encryption (and Bob uses his private key for decryption). The RSA cryptosystem is a widely used protocol for public-key cryptography; it works as follows: • Keygeneration:Bobfindslargeprimespandq;hechoosesane̸=1that’srelatively prime to (p − 1)(q − 1); and he computes d := e−1 modulo (p − 1)(q − 1). Bob’s public key is ⟨e, n⟩ and his private key is ⟨d, n⟩, where n := pq. e • Encryption:WhenAlicewantstosendmtoBob,sheencryptsmasc:=m modn. • Decryption: Bob decrypts c as cd mod n. By our choices of n, p, q, d, and e, Fermat’s Little Theorem allows us to prove that Bob’s decryption of the encryption of message m is always the original message m itself. And, under commonly held beliefs about the difficulty of factoring large numbers (and computing “eth roots mod n”), Eve cannot compute m without spending an implausi- bly large amount of computation time. 7.6. CHAPTERATAGLANCE 757 758 CHAPTER 7. NUMBER THEORY Key Terms and Results Key Terms Key Results Modular Arithmetic 1. For any integers k ≥ 1 and n, there exist unique integers d andrsuchthat0≤r blue;
to get the first result shown in Figure 8.11.) The third key operation in rela- tional databases, called join, corresponds closely to the composition of rela- tions. In a join, we combine two relations by insisting that an identified shared column of the two relations matches. Unlike with the composition of relations, we continue to include that matching column in the resulting table:
• join:fortwobinaryrelationsX ⊆ S×TandY ⊆ T×U,thejoinofX andY,denotedX ✶ Y,isaternaryrelationonS×T×U,definedas X ✶ Y := {⟨a, c, b⟩ ∈ S × T × U : ⟨a, c⟩ ∈ X and ⟨c, b⟩ ∈ Y} .
InSQLsyntax,thisoperationisdenotedbyINNER JOIN;forexample,withS and T as in Figure 8.6, we can generate the second table in Figure 8.11 with
SELECT * FROM T INNER JOIN S ON T.senator = S.senator;
The era of relational databases is gen- erally seen as starting with a massively influential paper by Edgar Codd:
1 Edgar F. Codd. A relational model of data for large shared data banks. Communications of the ACM, 13(6):377– 387, 1970.
“SQL” is short for Structured Query Language; it’s pronounced either like “sequel” or by spelling out the letters (to rhyme with “Bless you, Mel!”).
name
red
green
blue
Green Lime Magenta Maroon Navy Olive Purple Red
Teal White Yellow
0
0 255 128 0 128 128 255 0 255 255
128 255 0
0
0 128 0
0 128 255 255
0
0 255 0 128 0 128 0 128 255 0
Figure 8.10: Some RGB colors.
We will only just brush the surface of relational databases here—there’s a full course’s worth of material on databases (and then some!) that we’ve left out. For more, see a good book on databases, like
2 Avi Silberschatz, Henry F. Korth, and S. Sudarshan. Database System Concepts. McGraw-Hill, 6th edition, 2010.
Figure 8.11: Selecting colors with
green > blue and projecting to name, red; and joining S and T from Figure 8.6.
Crapo R ID Risch R ID Durbin D IL
.
name
red
Lime Yellow Green Olive
0 255 0 128
senator
party state

816 CHAPTER 8. RELATIONS
8.2.5 Exercises
Here are a few English-language descriptions of relations on a particular set. For each, write out (by exhaustive enu- meration) the full set of pairs in the relation, as we did in Example 8.5.
8.1 divides, written |, on {1,2,…,8} (so ⟨d,n⟩ ∈ | if and only if n mod d = 0, as in Example 8.4).
8.2 subset, written ⊂, on P({1, 2, 3}) (so ⟨S, T⟩ ∈ ⊂ if and only if S ̸= T and ∀x : x ∈ S ⇒ x ∈ T).
8.3 isProperPrefix on bitstrings of length ≤ 3. See Example 8.5, but here we are considering proper
prefixes only. A string x is prefix, but not a proper prefix, of itself: more formally, x is a proper prefix of y if x starts with precisely the symbols of y, followed by one or more other symbols.
For two strings x and y, we say that x is a substring of y if the symbols of x appear consecutively somewhere in y. We say that x is a subsequence of y if the symbols of x appear in order, but not necessarily consecutively, in y. (For
Problem-solving
tip: It’s easy to miss an element of these relations if you solve these problems by hand. Consider writing a small program to enumerate all the pairs meeting the descriptions in Exercises 8.1–8.5.
example, 001 is a substring of 1001 but not of 0101. But 001 is a subsequence of 1001 and also of 0
101
Let ⊆ and ⊂ denote the subset and proper subset relations on P(Z). (That is, we have ⟨A, B⟩ ∈ ⊂ if A ⊆ B but A ̸= B.) What relation is represented by each of the following?
8.6 ⊆∪⊂ 8.9 ⊂∩⊆ 8.7 ⊆−⊂ 8.10 ∼⊂ 8.8 ⊂−⊆
.) A string x is called a proper substring/subsequence of y if x is a substring/subsequence of y but x ̸= y. Again, write out (by
exhaustive enumeration) the full set of pairs in these relations:
8.4 isProperSubstring on bitstrings of length ≤ 3
8.5 isProperSubsequence on bitstrings of length ≤ 3
Consider the following two relations on {1, 2, 3, 4, 5, 6}: R = {⟨2, 2⟩, ⟨5, 1⟩, ⟨2, 3⟩, ⟨5, 2⟩, ⟨2, 1⟩} and S = {⟨3, 4⟩, ⟨5, 3⟩, ⟨6, 6⟩, ⟨1, 4⟩, ⟨4, 3⟩}. What pairs are in the following relations?
8.11 R−1 8.12 S−1 8.13 R◦R 8.14 R◦S
8.15 S◦R 8.16 R◦S−1 8.17 S◦R−1 8.18 S−1◦S
Five so-called mother sauces of French cooking were codified by the chef Auguste Escoffier in the early 20th century. (Many other sauces—“daughter” or “secondary” sauces—used in French cooking are derived from these basic recipes.) They are:
• Sauce Béchamel is made of milk, butter, and flour.
• Sauce Espagnole is made of stock, butter, and flour.
• Sauce Hollandaise is made of egg, butter, and lemon juice.
• Sauce Velouté is made of stock, butter, and flour.
• Sauce Tomate is made of tomatoes, butter, and flour.
8.19 Write down the “is an ingredient of” relation on Ingredients × Sauces using the tabular representa- tion of relations introduced in Figure 8.2.
8.20 Writing R to denote the relation that you enumerated in Exercise 8.19, what is R ◦ R−1? Give both a list of elements and an English-language description of what R ◦ R−1 represents.
8.21 Again for the R from Exercise 8.19, what is R−1 ◦ R? Again, give both a list of elements and a description of the meaning.
Suppose that a Registrar’s office has computed the following relations:
taughtIn ⊆ Classes × Rooms taking ⊆ Students × Classes at ⊆ Classes × Times.
For the following exercises, express the given additional relation using taughtIn, taking, and at, plus relation composi- tion and/or inversion (and no other tools).
8.22 R ⊆ Students × Times, where ⟨s, t⟩ ∈ R indicates that student s is taking a class at time t.
8.23 R ⊆ Rooms × Times, where ⟨r, t⟩ ∈ R indicates that there is a class in room r at time t.
8.24 R ⊆ Students × Students, where ⟨s, s′ ⟩ ∈ R indicates that students s and s′ are taking at least one
class in common.
8.25 R ⊆ Students × Students, where ⟨s, s′ ⟩ ∈ R indicates that there’s at least one time when s and s′ are both taking a class (but not necessarily the same class).

Let parent ⊆ People × People denote the relation {⟨p, c⟩ : p is a parent of c}. What familial relationships are represented by the following relations?
For the sake of simplicity in the following questions, assume that there are no divorces, remarriages, widows, widowers, adoptions, single parents, etc. That is, you should assume that each child has exactly two parents, and any two children who share one parent share both parents.
8.26 parent ◦ parent
8.27 (parent−1) ◦ (parent−1)
8.28 parent ◦ (parent−1)
8.29 (parent−1) ◦ parent
8.30 parent ◦ parent ◦ (parent−1) ◦ (parent−1) 8.31 parent ◦ (parent−1) ◦ parent ◦ (parent−1)
8.32 Suppose that the relations R ⊆ Z × Z and S ⊆ Z × Z contain, respectively, n pairs and m pairs of elements. In terms of n and m, what’s the largest possible size of R ◦ S? The smallest?
Consider the following claims about the composition of relations.
8.33 ForarbitraryrelationsR,S,andT,provethatR◦(S◦T)=(R◦S)◦T.
8.34 For arbitrary relations R and S, prove that (R ◦ S)−1 = (S−1 ◦ R−1).
8.35 LetRbeanyrelationonA×B.Proveordisprove:⟨x,x⟩∈R◦R−1 foreveryx∈A.
8.36 What set is represented by the relation ≤ ◦ ≥, where ≤ and ≥ are relations on R?
8.37 What set is represented by the relation successor ◦ predecessor, where successor = {⟨n, n + 1⟩ : n ∈ Z}
and predecessor = {⟨n, n − 1⟩ : n ∈ Z}?
Suppose that R ⊆ A × B and T ⊆ B × C are relations. Prove the following:
8.38 If R and T are both functions, then T ◦ R is a function too.
8.39 If R and T are both one-to-one functions, then T ◦ R is one-to-one too.
8.40 If R and T are both onto functions, then T ◦ R is onto too.
The next few exercises ask you to address the converse of the last few. Supposing that T ◦ R has the listed property, can you infer that both relations R and T have the same property? Only R? Only T? Neither? Prove your answers.
8.41 T ◦ R is a function. Must T be a function? R? Both?
8.42 T ◦ R is a one-to-one function and R and T are both functions. Must T be one-to-one? R? Both?
8.43 T ◦ R is an onto function and R and T are both functions. Must T be onto? R? Both?
On p. 815, we introduced three operations on relations that are used frequently in relational databases:
• select, which chooses a subset of the elements of an n-ary relation. For R ⊆ A1 × · · · × An and a
function φ : A1 × · · · × An → {True, False}, we can select only those elements of R that satisfy φ.
• project, which turns an n-ary relation into an n′-ary relation for some n′ ≤ n by eliminating components. ForR ⊆ A1 ×···×An andS ⊆ {1,2,…,n},wecanprojectRintoasmallersetof columns by removing the ith component of each pair in R for any i ∈/ S.
• join, which combines two binary relations R ⊆ A × B and S ⊆ B × C into a single ternary relation containing triples ⟨a, b, c⟩ such that ⟨a, b⟩ ∈ R and ⟨b, c⟩ ∈ S.
For example, let R = {⟨1,2,3⟩,⟨4,5,6⟩}, let S = {⟨6,7⟩,⟨6,8⟩}, and let T = {⟨7,9⟩,⟨7,10⟩}. Then
• select(R, xzEven) = {⟨4, 5, 6⟩} for xzEven(x, y, z) = (2 | x) ∧ (2 | z).
• project(R, {1, 2}) = {⟨1, 2⟩, ⟨4, 5⟩} and project(R, {1, 3}) = {⟨1, 3⟩, ⟨4, 6⟩}. • join(S, T) = {⟨6, 7, 9⟩, ⟨6, 7, 10⟩}.
8.2. FORMALINTRODUCTION 817
Color
R
G
B
White Red Lime Blue Cyan Magenta Yellow Black Gray Maroon Green Navy Teal Purple Olive
255 255 0
0
0 255 255 0 128 128 0
0
0 128 128
255 0 255 0 255 0 255 0 128 0 128 0 128 0 128
255 0
0 255 255 255 0
0 128 0
0 128 128 128 0
Solve the following using the relation operators −1 (inverse), ◦ (composition), select, project, and join: 8.44 Recall from Example 8.15 the “betweenness” relation, defined as the ternary relation
B:=􏰈⟨x,y,z⟩∈R3 :x≤y≤zorx≥y≥z􏰉.ShowhowtoconstructBusingonly≤,therelation operators (−1, ◦, join, select, project), and standard set-theoretic operations (∪, ∩, ∼, −).
Figure 8.12: A 4- ary relation C (see Example 8.16).
Using the relation C defined in Figure 8.12, and select/project/join, write a set that corresponds to the following:
8.45 the names of all colors that have red component 0.
8.46 the names of all pairs of colors whose amount of blue is the same.
8.47 the names of all colors that are more blue than red.
Let X denote the set of color names from Example 8.16. Define three relations Red, Green, and Blue on X × {0,1,…,255} such that ⟨x,r,g,b⟩ ∈ C if and only if ⟨x,r⟩ ∈ Red, ⟨x,g⟩ ∈ Green, and ⟨x,b⟩ ∈ Blue.
8.48 Repeat Exercise 8.46 using only −1, ◦, and the relations Red, Green, Blue, ≤, and =.
8.49 Do the same for Exercise 8.47—or, at least, compute the set of ⟨x, x⟩ such that x is the name of a
color that’s more blue than red. (You may construct a relation R on colors, and then take R ∩ =.)

818
CHAPTER 8. RELATIONS
8.3
Properties of Relations: Reflexivity, Symmetry, and Transitivity
Pride destroys all symmetry and grace, and affectation is a more terrible enemy to fine faces than the small-pox.
Sir Richard Steele (1672–1729)
Let R ⊆ A × A be a relation on a single set A (as in the successor or ≤ relations on Z, or the is a (blood) relative of relation on people). We’ve seen a two-column approach to visualizing a relation R ⊆ A × B, but this layout is misleading when the sets A and B are identical. (Weirdly, we’d
have to draw each element
twice, in both the A column
and the B column.) Instead,
it will be more convenient to
visualize a relation R ⊆ A × A
without differentiated columns,
using a directed graph: we sim-
ply write down each element of
A, and draw an arrow from a1
to a2 for every pair ⟨a1, a2⟩ ∈ R.
(See Chapter 11 for much more
on directed graphs.) A few
small examples are shown in
Figure 8.13.
Oct
Dec Aug
Jan Jul
May Nov
Sep
Mar Apr
Jun
Feb
01
10 2
93 84
75 6
ε
00 0
01
10 1
11
(a) isPrefix
This directed-graph visualization of relations will provide a useful way of thinking
Figure 8.13: Visu- alizations of three relations, from Ex- ample 8.5 (prefixes of bitstrings), Exam- ple 8.11 (months), and Example 8.12 (⟨x, x2 ⟩ mod 11).
(b) Months of the same length
(c) ⟨x, x2 mod 11⟩ for x ∈ Z11
intuitively about relations in general—and about some specific types of relations in particular. There are several important structural properties that some relations on A have (and that some relations do not), and we’ll explore these properties throughout this section. We’ll consider three basic categories of properties:
reflexivity: whetherelementsarerelatedtothemselves.Thatis,isanelementxneces- sarily related to x itself?
symmetry: whetherordermattersintherelation.Thatis,ifxandyarerelated,arey and x necessarily related too?
transitivity: whetherchainsofrelatedpairsarethemselvesrelated.Thatis,ifxandy are related and y and z are related, are x and z necessarily related too?
These properties turn out to characterize several important types of relations—for example, some relations divide A into clusters of “equivalent” elements (as in Fig-
ure 8.13(b)), while other relations “order” A in some consistent way (as in Figure 8.13(a))— and we’ll see these special types of relations in Section 8.4. But first we’ll examine
these three categories of properties in turn, and then we’ll define closures of relations, which expand any relation R as little as possible while ensuring that the expansion of
R has any particular desired subset of these properties.

8.3. PROPERTIESOFRELATIONS:REFLEXIVITY,SYMMETRY,ANDTRANSITIVITY 819
8.3.1 Reflexivity
The reflexivity of a relation R ⊆ A × A is based on whether elements of A are related to
themselves. That is, are pairs ⟨a, a⟩ in R? The relation R is reflexive if ⟨a, a⟩ is always in R (for every a ∈ A), and it’s irreflexive if ⟨a,a⟩ is never in R (for any a ∈ A):
Using the visualization style from Figure 8.13, a rela- tion is reflexive if every element a ∈ A has a “loop” from a back to itself—and it’s irreflexive if no a ∈ A has a loop back to itself. (See Figure 8.14.)
Example 8.17 (Reflexivity of =, ≡17, and ⟨x, x2⟩ mod 11)
The relations = and ≡17 on Z—that is, the relations {⟨x, y⟩ : x = y} and
{⟨x,y⟩ : x mod 17 = y mod 17}—are both reflexive, because x = x and x mod 17 = xmod17foranyx ∈ Z.ButtherelationR := 􏰈⟨x,x2 mod11⟩:x∈Z11􏰉from Figure 8.13(c) is not reflexive, because (among other examples) we have ⟨7, 7⟩ ∈/ R.
Note that there are relations that are neither reflexive nor irreflexive. For example, the relation S = {⟨0, 1⟩, ⟨1, 1⟩} on {0, 1} isn’t reflexive (because ⟨0, 0⟩ ∈/ S), but it’s also not irreflexive (because ⟨1, 1⟩ ∈ S).
Example 8.18 (A few arithmetic relations)
Problem: Which of the following relations on Z≥1 are reflexive? Irreflexive?
1. divides:R1 ={⟨n,m⟩:mmodn=0}
2. greater than: R2 = {⟨n,m⟩ : n > m}
3. lessthanorequalto:R3 ={⟨n,m⟩:n≤m}
4. square:R4=􏰈⟨n,m⟩:n2=m􏰉
5. equivalentmod5:R5 ={⟨n,m⟩:nmod5=mmod5}
Solution
: 1. reflexive. For any positive integer n, we have that n mod n = 0. Thus
⟨n,n⟩ ∈ R1 for any n.
2. irreflexive. For any n ∈ Z≥1, we have that n ̸> n. Thus ⟨n,n⟩ ∈/ R2 for any n.
3. reflexive.Foranypositiveintegern,wehaven≤n,soevery⟨n,n⟩∈R3.
4. neither. The square relation is not reflexive because ⟨9, 9⟩ ∈/ R4 and it is also not
irreflexive because ⟨1, 1⟩ ∈ R4, for example. (That’s because 9 ̸= 92, but 1 = 12.) 5. reflexive.Foranyn∈Z≥1,wehavenmod5=nmod5,so⟨n,n⟩∈R5.
Note again that, as with square, it is possible to be neither reflexive nor irreflexive. (But it’s not possible to be both reflexive and irreflexive, as long as A ̸= ∅: for any a ∈ A, if ⟨a, a⟩ ∈ R, then R is not irreflexive; if ⟨a, a⟩ ∈/ R, then R is not reflexive.)
Latin: re “back” + flect “bend.”
Definition 8.6 (Reflexive and Irreflexive Relations)
A relation R on A is reflexive if, for every x ∈ A, we have that ⟨x,x⟩ ∈ R. A relation R on A is irreflexive if, for every x ∈ A, we have that ⟨x, x⟩ ∈/ R.
134 2
134 2
Figure 8.14: A relation on A is reflexive if every a ∈ A has a self- loop (the dark arrows in the left panel),anditis irreflexive if no
a ∈ A does (as in the right panel).

820 CHAPTER 8. RELATIONS
8.3.2 Symmetry
The symmetry of a relation R ⊆ A × A is based on whether the order of the elements in
a pair matters. That is, if the pair ⟨a, b⟩ is in R, is the pair ⟨b, a⟩ always also in R? (Or is it never in R? Or sometimes but not always?) The relation R is symmetric if, for every a and b, the pairs ⟨a,b⟩ and ⟨b,a⟩ are both in R or both not in R.
There are two accompanying notions: a relation R is antisymmetric if the only time ⟨a,b⟩ and ⟨b,a⟩ are both in R is when a = b, and R is asymmetric if ⟨a,b⟩ and ⟨b,a⟩ are never both in R (whether a = b or a ̸= b). Here are the formal definitions:
Again thinking about the vi-
sualization from Figure 8.13:
a relation is symmetric if ev-
ery arrow a → b is matched
by an arrow b → a in the opposite direction. It’s antisymmetric if there are no matched bidirectional pairs of arrows between two distinct elements a and b; and it’s asym- metric if there also aren’t even any self-loops. (An a-to-a self-loop is, in a weird way, a “pair” of arrows a → b and b → a, just with a = b.) See Figure 8.15.
Greek: syn “same” + metron “measure.”
An important etymological
note: anti- means “against” rather than “not.” Asymmetric (no ⟨a,b⟩,⟨b,a⟩ ∈ R)
is different from antisymmetric (if ⟨a,b⟩,⟨b,a⟩ ∈ R then a = b) is dif- ferent from not symmetric (there is some ⟨a, b⟩ ∈ R but ⟨ b , a ⟩ ∈/ R ) .
Figure 8.15: R is symmetric if every a → b is matched byb → a(asin
the left panel). R
is antisymmetric ifnoa ↔ bexists fora ̸= b(asin
the middle or
right panel), and asymmetric if it also has no self-loops (as in the right panel).
zeugma, n.: gram- matical device in which words are used in parallel construction syn- tactically, but not semantically,asin
Yesterday, Alice caught a rainbow trout and hell from Bob for fishing all day.
Definition 8.7 (Symmetric, Antisymmetric, and Asymmetric Relations)
A relation R on A is symmetric if, for every a,b ∈ A, if ⟨a,b⟩ ∈ R then ⟨b,a⟩ ∈ R. A relation R on A is antisymmetric if, for every a, b ∈ A such that ⟨a, b⟩ ∈ R and
⟨b, a⟩ ∈ R, we have a = b.
A relation R on A is asymmetric if, for every a,b ∈ A, if ⟨a,b⟩ ∈ R then ⟨b,a⟩ ∈/ R.
134 2
134 2
134 2
Example 8.19 (Some symmetric relations)
The relations
􏰈⟨s, s′ ⟩ : s and s′ sat next to each other in class today􏰉
′′
􏰈⟨w, w ⟩ : w and w have the same length􏰉
(on the set of English words) (on the set of students)
are both symmetric. If w contains the same number of letters as w′, then w′ also con- tains the same number of letters as w. And if I sat next to you, then you sat next to me! (The first relation is also reflexive—ZEUGMA contains the same number of letters as ZEUGMA—but the latter is irreflexive, as no student sits beside herself in class.)
Example 8.20 (A few arithmetic relations, again)
Problem: WhichoftheserelationsfromExample8.18(seebelowforareminder)are symmetric? Antisymmetric? Asymmetric?
{⟨n,m⟩:mmodn=0} {⟨n,m⟩:n>m} {⟨n,m⟩:n≤m} 􏰈⟨n,m⟩:n2=m􏰉 {⟨n,m⟩:nmod5=mmod5}.
R1 = R =
2
R3 =
R4 =
R5 =

Solution
8.3. PROPERTIESOFRELATIONS:REFLEXIVITY,SYMMETRY,ANDTRANSITIVITY 821
: 1. antisymmetric.Becausenmodm=mmodn=0ifandonlyifn=m,if ⟨n, m⟩ ∈ R1 and ⟨m, n⟩ ∈ R1 then n = m. But the relation is neither symmetric (for example, 3 | 6 but 6 ̸ | 3) nor asymmetric (for example, 3 | 3).
asymmetric.Ifxm}
R3 = {⟨n,m⟩:n≤m}
R4 = 􏰈⟨n,m⟩:n2=m􏰉
R5 = {⟨n,m⟩:nmod5=mmod5}
Solution
: 1. transitive.Supposethata|bandb|c.Weneedtoshowthata|c.But
that’s easy: by definition a | b and b | c mean that b = ak and c = bl for integers k and l. Therefore c = a · (kl)—and thus a | c. (This fact was Theorem 7.4.4.)
Figure 8.17: A relation on A is transitive if every triangle is closed. The left panel shows a relation that is not transitive (the dark arrows form an open triangle). The right panel shows a transitive relation, with a highlighted closed triangle.
The relations
􏰈⟨w, w ⟩ : w and w have the same length􏰉
′′
􏰈⟨s, s′ ⟩ : s arrived in class before s′ today􏰉
(on the set of English words) (on the set of students)
2. transitive.Ifx>yandy>z,thenweknowx>z.

8.3. PROPERTIESOFRELATIONS:REFLEXIVITY,SYMMETRY,ANDTRANSITIVITY 823 3. transitive.Justasin(2),R3 istransitive:ifx≤yandy≤z,thenx≤z.
4. nottransitive.Thesquarerelationisn’ttransitive,because,forexample,we have ⟨2, 4⟩ ∈ R4 and ⟨4, 16⟩ ∈ R4—but ⟨2, 16⟩ ∈/ R4. (That’s because 22 = 4 and 42 =16but22 ̸=16.)
5. transitive.The“equivalentmod5”relationistransitivebecauseequalityis:if n mod 5 = m mod 5 and m mod 5 = p mod 5, then n mod 5 = p mod 5.
While we can understand the transitivity of a relation R directly from Definition 8.8, we can also think about the transitivity of R by considering the relationship between R and R ◦ R—that is, R and the composition of R with itself. (Earlier we saw how to view the symmetry of R by connecting R and its inverse R−1.)
Again, you’ll prove this theorem in the exercises (Exercise 8.85).
Taking it further: Imagine a collection of n people who have individual preferences over k candidates. Thatis,wehavenrelationsR1,R2,…,Rn,eachofwhichisarelationontheset{1,2,…,k}. Wewish
to aggregate these individual preferences into a single preference relation for the collection of people. Although this description is much more technical than our everyday usage, the problem that we’ve de- scribed here is well known: it’s otherwise known as voting. (Economists also call this topic the theory of social choice.) Some interesting and troubling paradoxes arise in voting problems, related to transitivity— or, more precisely, to the absence of transitivity.
Suppose that we have three candidates: Alice, Bob, and Charlie. For simplicity, let’s suppose that we also have exactly three voters: #1, #2, and #3. (This paradox also arises when there are many more voters.) Consider the situation in which Voter #1 thinks Alice > Bob > Charlie; Voter #2 thinks Charlie > Alice > Bob; and Voter #3 thinks Bob > Charlie > Alice. Then, in head-to-head runoffs between pairs of candidates, the results would be:
• Alice beats Bob: 2 votes (namely #1 and #2) for Alice, to 1 vote (just #3) for Bob.
• Bob beats Charlie: 2 votes (namely #1 and #3) for Bob, to 1 vote (just #2) for Charlie.
• Charlie beats Alice: 2 votes (namely #2 and #3) for Charlie, to 1 vote (just #1) for Alice.
That’s pretty weird: we have taken strict preferences (each of which is certainly transitive!) from each of the voters, and aggregated them into a nontransitive set of societal preferences. This phenomenon—no candidate would win a head-to-head vote against every other candidate—is called the Condorcet paradox. (The Condorcet criterion declares the winner of a vote to be the candidate who would win a runoff election against any other individual candidate.)
The Condorcet paradox is troubling, but an even more troubling result says that, more or less, there’s no good way of designing a voting system! Arrow’s Theorem, proven around 1950, states that there’s no way to aggregate individual preferences to society-level preferences in a way that’s consistent with three “obviously desirable” properties of a voting system: (1) if every voter prefers candidate A to candidate B, then A beats B; (2) there’s no “dictator” (a single voter whose preferences of the candidates directly determines the outcome of the vote); and (3) “independence of irrelevant alternatives” (if candidate A beats B when candidate C is in the race, then A still beats B if C were to drop out of the race).3
8.3.4 Properties of Asymptotic Relationships
Now that we’ve introduced the three categories of properties of relations (reflexivity, symmetry, and transitivity), let’s consider one more set of relations in light of these properties: the asymptotics of functions. Recall from Chapter 6 that, for two functions
The Condorcet paradox is named after the 18th- century French philosopher/ mathematician Marquis de Con- dorcet (rhymes with gone for hay). Arrow’s Theorem
is named after Kenneth Arrow, a 20th-century Amer- ican economist (who won the 1972 Nobel Prize in Eco- nomics, largely for this theorem). See
3 Kenneth Arrow.
Social Choice and Individual Values. Wiley, 1951.
Theorem 8.2 (Transitivity in terms of self-composition)
Let R ⊆ A × A be a relation. Then R is transitive if and only if R ◦ R ⊆ R.
3

824 CHAPTER 8. RELATIONS
f :R≥0 →R≥0 andg:R≥0 →R≥0,wesaythat
f(n)isO(g(n)) f (n) is Θ(g(n)) f (n) is o(g(n))
ifandonlyif if and only if if and only if
∃n0 ≥0,c>0:􏰀∀n≥n0 :f(n)≤c·g(n)􏰁. f (n) is O(g(n)) and g(n) is O(f (n)).
f (n) is O(g(n)) and g(n) is not O(f (n)).
(Actually we previously phrased the definitions of Θ(·) and o(·) in terms of Ω(·), but the definition we’ve given here is completely equivalent, as proven in Exercise 6.30.) We can view these asymptotic properties as relations on the set F := 􏰈f : R≥0 → R≥0􏰉 of functions.
Example 8.24 (O and Θ and o: reflexivity)
O is reflexive: For any function f , we can easily show that f = O(f ) by choosing the
constantsn0 := 1andc := 1,becauseitisimmediatethat∀n ≥ 1 : f(n) ≤ 1·f(n). Therefore O is reflexive, because every function f satisfies f = O(f ).
Θisreflexive: ThisfactfollowsimmediatelyfromthefactthatOisreflexive:
The standard asymptotic notation doesn’t match the standard notation for relations—we write f = Θ(g) rather than f Θ g
or ⟨f,g⟩ ∈ Θ—but Θ genuinely is a relation on F, in
the sense that some pairs of functions are related by Θ and some pairs are not. And O and o are relations on F in the same way.
Θ is reflexive ⇔ ∀f ∈ F : f = Θ(f )
⇔∀f ∈F:f =O(f)andf =O(f)
⇔∀f ∈F:f =O(f) ⇔ O is reflexive.
definition of reflexivity definitionofΘ p∧p≡p definition of reflexivity
o is irreflexive: This fact follows by similar logic: for any function f ∈ F,
f = o(f) ⇔ f = O(f) and f ̸= O(f). definition of o(·)
But p ∧ ¬p ≡ False (including when p is “f = O(f )”), so o is irreflexive.
Example 8.25 (O and Θ and o: symmetry)
O is not symmetric, antisymmetric, or asymmetric: Define the functions t1(n) = n
and t2(n) = n2 and t3(n) = 2n2. O is not symmetric because, for example, t1 = O(t2) but t2 ̸= O(t1). O is not asymmetric because, for example, t1 = O(t1). And O is not antisymmetric because, for example, t2 = O(t3) and t3 = O(t2) but t2 ̸= t3.
Θissymmetric: Thisfactfollowsimmediatelyfromthedefinition:forarbitraryf and g,
f =Θ(g)⇔f =O(g)andg=O(f) ⇔g=O(f)andf =O(g)
⇔ g = Θ(f ).
(Θ is not anti/asymmetric, because t2 = Θ(t3) for t2(n) and t3(n) as defined above.)
o is asymmetric: This fact follows immediately, by similar logic: for arbitrary f and g, wehavef = o(g)andg = o(f)ifandonlyiff = O(g)andg ̸= O(f)andg = O(f)and f ̸= O(g)—a contradiction! So if f = o(g) then g ̸= o(f ). Therefore o is asymmetric.
definitionofΘ p∧q≡q∧p definition of Θ

8.3. PROPERTIESOFRELATIONS:REFLEXIVITY,SYMMETRY,ANDTRANSITIVITY 825
You proved in Exercises 6.18, 6.46, and 6.47 that O, Θ, and o are all transitive, so we won’t repeat the proofs here.
In sum, then, we’ve argued that O is reflexive and transitive (but not symmetric, asymmetric, or antisymmetric); o is irreflexive, asymmetric, and transitive; and Θ is reflexive, symmetric, and transitive.
Taking it further: Among the computer scientists, philosophers, and mathematicians who study formal logic, there’s a special kind of logic called modal logic that’s of significant interest. Modal logic extends the type of logic we introduced in Chapter 3 to also include logical statements about whether a true proposition is necessarily true or accidentally true. For example, the proposition Canada won the 2014 Olympic gold medal in curling is true—but the gold-medal game could have turned out differently and, if it had, that proposition would have been false. But Either it rained yesterday or it didn’t rain yesterday is true, and there’s no possible scenario in which this proposition would have turned out to be false. We say that the former statement is “accidentally” true (it was an “accident” of fate that the game turned out the way it did), but the latter is “necessarily” true.
In modal logic, we evaluate the truth value of a particular logical statement multiple times, once
in each of a set W of so-called possible worlds. Each possible world assigns truth values to every atomic proposition. Thus every logical proposition φ of the form we saw in Chapter 3 has a truth value in each possible world w ∈ W. But there’s another layer to modal logic. In addition to the set W, we are also given a relation R ⊆ W × W, where ⟨w, w′⟩ ∈ R indicates that w′ is possible relative to w. In addition to the basic logical connectives from normal logic, we can also write two more types of propositions:
Of course, these operators can be nested, so we might have a proposition like ✷(✸p ⇒ ✷p).
Different assumptions about the relation R will allow us to use modal logic to model different types of interesting phenomena. For example, we might want to insist that ✷φ ⇒ φ (“if φ is necessarily true,
then φ is true”: that is, if φ is true in every world w′ ∈ W possible relative to w, then φ is true in w). This axiom corresponds to the relation R being reflexive: w is always possible relative to w. Symmetry and transitivity correspond to the axioms φ ⇒ ✷✸φ and ✷φ ⇒ ✷✷φ.
✸φ
“possibly φ”
✸φistrueinwif∃w′ ∈Wsuchthat⟨w,w′⟩∈Randφistrueinw′. ✷φistrueinwif∀w′ ∈Wsuchthat⟨w,w′⟩∈R,φistrueinw′.
✷φ
“necessarily φ”
The general framework of modal logic (with different assumptions about R) has been used to rep- resent logics of knowledge (where ✷φ corresponds to “I know φ”); logics of provability (where ✷φ corresponds to “we can prove φ”); and logics of possibility and necessity (where ✷φ corresponds to “necessarily φ” and ✸φ to “possibly φ”). Others have also studied temporal logics (where ✷φ corresponds to “always φ” and ✸φ to “eventually φ”); these logical formalisms have proven to be very useful in
For a good intro- duction to modal logic, see
4 G. E. Hughes and M. J. Cresswell. A New Introduction
to Modal Logic. Routledge, 1996.
4 formally analyzing the correctness of programs.4 8.3.5 Closures of Relations
Until now, in this section we’ve discussed some important properties that certain rela- tions R ⊆ A × A may or may not happen to have. We’ll close this section by looking at how to “force” the relation R to have one or more of these properties. Specifically, we will introduce the closure of a relation with respect to a property like symmetry: we’ll take a relation R and expand it into a relation R′ that has the desired property, while adding as few pairs to R as possible. That is, the symmetric closure of R is the smallest set R′ ⊇ R such that the relation R′ is symmetric.
Taking it further: In general, a set S is said to be closed under the operation f if, whenever we apply f to
an arbitrary element of S (or to an arbitrary k-tuple of elements from S, if f takes k arguments), then the result is also an element of S. For example, the integers are closed under + and ·, because the sum of two integers is always an integer, as is their product. But the integers are not closed under /: for example, 2/3 is not an integer even though 2, 3 ∈ Z. The closure of S under f is the smallest superset of S that is closed under f .

826 CHAPTER 8. RELATIONS Here are the formal definitions:
Definition 8.9 (Reflexive, symmetric, and transitive closures)
Let R ⊆ A × A be a relation.
• The reflexive closure of R is the smallest relation R′ ⊇ R such that R′ is reflexive.
• The symmetric closure of R is the smallest relation R′′ ⊇ R such that R′′ is symmetric. • The transitive closure of R is the smallest relation R+ ⊇ R such that R+ is transitive.
We’ll illustrate these definitions with an example of the symmetric, reflexive, and transitive closures of a small relation, and then return to a few of our running exam- ples of arithmetic relations.
Example 8.26 (Closures of a small relation)
Consider the relation R := {⟨1, 5⟩, ⟨2, 2⟩, ⟨2, 4⟩, ⟨4, 1⟩, ⟨4, 2⟩} on {1, 2, 3, 4, 5}. Then we have the following closures of R. (See Figure 8.18 for visualizations.)
reflexive closure = R ∪ {⟨1, 1⟩, ⟨3, 3⟩, ⟨4, 4⟩, ⟨5, 5⟩} .
(a) The relation R.
(b) The reflexive closure of R.
(c) The symmetric closure of R.
(d) The transitive closure of R.
Figure 8.18: A relation R, and several closures.
In each, the dark arrows had to
be added to R to achieve the desired property.
2 13
54
2 13
54
symmetric closure = R ∪ 􏰜􏰜 transitive closure = R ∪
􏰝.
􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣 􏰝
⟨4,5⟩ , ⟨2,5⟩ . 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣
⟨5, 1⟩ , ⟨1, 4⟩ 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣
because of ⟨1, 5⟩ because of ⟨4, 1⟩ ⟨2, 1⟩ ,
⟨4, 4⟩
because of ⟨2, 4⟩ and ⟨4, 1⟩ because of ⟨4, 2⟩ and ⟨2, 4⟩
2 13
54
because of ⟨4, 1⟩ and ⟨1, 5⟩ because of ⟨2, 4⟩ and ⟨4, 5⟩
It’s worth noting that ⟨2, 5⟩ had to be in the transitive closure R+ of R, even though
there was no x such that ⟨2, x⟩ ∈ R and ⟨x, 5⟩ ∈ R. There’s one more intermediate step in the chain of reasoning: the pair ⟨4, 5⟩ had to be in R+ because ⟨4, 1⟩, ⟨1, 5⟩ ∈ R, and therefore both ⟨2, 4⟩ and ⟨4, 5⟩ had to be in R+—so ⟨2, 5⟩ had to be in R+ as well.
Example 8.27 (Closures of divides)
Recall the “divides” relation R = {⟨n, m⟩ : m mod n = 0}. Because R is both reflexive and transitive, the reflexive closure and transitive closure of R are both just R itself. The symmetric closure of R is the set of pairs ⟨n, m⟩ where one of n and m is a divisor of the other (in either order): {⟨n, m⟩ : n mod m = 0 or m mod n = 0}.
Example 8.28 (Closures of >)
Recall the “greater than” relation {⟨n, m⟩ : n > m}. The reflexive closure of > is ≥— that is, the set {⟨n, m⟩ : n ≥ m}. The symmetric closure of > is the relation ̸=—that is,theset{⟨n,m⟩:n>morm>n} = {⟨n,m⟩:n̸=m}.Therelation>isalready transitive, so the transitive closure of > is > itself.
2 13
54

8.3. PROPERTIESOFRELATIONS:REFLEXIVITY,SYMMETRY,ANDTRANSITIVITY 827
0123012301230123
(a) The relation R. (b) ⟨0, 1⟩ and ⟨1, 2⟩ mean that we must
add ⟨0, 2⟩.
(c) ⟨1, 2⟩ and ⟨2, 3⟩ mean that we must add ⟨0, 2⟩.
(d) ⟨0, 2⟩, which we added in (b), and ⟨2, 3⟩ mean that we must now add ⟨0, 3⟩ too.
Computing the closures of a relation
How did we compute the closures in the last few examples? The approach itself is
simple: starting with R′ = R, we repeatedly look for a violation of the desired prop- erty in R′ (an element of R′ required by the property but missing from R′), and repair that violation by adding the necessary element to R′. For the reflexive and symmetric closures, this idea is straightforward: the violations of reflexivity are precisely those elements of {⟨a, a⟩ : a ∈ A} not already in R, and the violations of symmetry are pre- cisely those elements of R−1 that are not already in R.
Figure 8.19: Com- puting the tran- sitive closure
of the relation
{⟨0, 1⟩, ⟨1, 2⟩, ⟨2, 3⟩}. Note that in panel (d), we could have instead argued that we had to add ⟨0, 3⟩ because of ⟨0, 1⟩ and ⟨1, 3⟩ (from panel (c)), rather than because of
⟨0, 2⟩ (from panel (b)) and ⟨2, 3⟩.
For the transitive closure, things are slightly trickier: as
we resolve existing violations by adding missing pairs to
the relation, new violations of transitivity can crop up. (See Figure 8.19.) Thus, to compute the transitive closure, we can simply iterate as described above: starting with R′ := R, repeatedly add to R′ any missing ⟨a, c⟩ with ⟨a, b⟩, ⟨b, c⟩ ∈ R′, until there are no more violations of transitivity. (While we won’t prove it here, it’s an important fact that the order in which we add elements to the transitive closure turns out
not to affect the final result.) See Figure 8.20 for algorithms
to compute these closures for R ⊆ A × A for a finite set A. (Note that these algorithms are not guaranteed to terminate if A is infinite! Also, there are faster ways to find the transitive closure based on graph algorithms—see Chapter 11—but the basic idea is captured here.)
symmetric-closure(R):
Input: a relation R ⊆ A × A
Output: the smallest symmetric R′ ⊇ R
1: return R ∪ R−1
Alternatively, here’s another way to view the transitive closure of R ⊆ A × A. The relation R ◦ R denotes precisely those pairs ⟨a, c⟩ where ⟨a, b⟩, ⟨b, c⟩ ∈ R for some b ∈ A. Thus the “direct” violations of transitivity are pairs that are in R ◦ R but not R. But, as we saw in Figure 8.19, the relation R ∪ (R ◦ R) might have violations of transitivity, too: t h a t i s , a p a i r ⟨ a , d ⟩ ∈/ R ∪ ( R ◦ R ) b u t w h e r e ⟨ a , b ⟩ ∈ R a n d ⟨ b , d ⟩ ∈ R ◦ R f o r s o m e b ∈ A . So we have to add R ◦ R ◦ R as well. And so on! In other words, the transitive closure R+ ofRisgivenbyR+ = R∪R2 ∪R3 ∪···,whereRk := R◦R◦···◦Ristheresultof composing R with itself k times. Thus:
• thereflexiveclosureofRisR∪{⟨a,a⟩:a∈A}. • thesymmetricclosureofRisR∪R−1.
• thetransitiveclosureofRisR∪R2 ∪R3 ∪···.
(Exercise 8.104 asks you to prove correctness, and Exercise 8.105 asks you to show that
Figure 8.20: Algo- rithms to compute reflexive, symmet- ric, and transitive closures of a rela- tionR ⊆ A×A, when A is finite.
reflexive-closure(R):
Input: a relation R ⊆ A × A
Output: the smallest reflexive R′ ⊇ R
1: return R∪{⟨a,a⟩:a∈A}
transitive-closure(R):
Input: a relation R ⊆ A × A
Output: the smallest transitive R′ ⊇ R
1: R′ := R
2: while there exist a, b, c ∈ A such that
⟨ a , b ⟩ ∈ R a n d ⟨ b , c ⟩ ∈ R a n d ⟨ a , c ⟩ ∈/ R ′ : 3: R′ := R′ ∪ {⟨a, c⟩}
4: return R′

828 CHAPTER 8. RELATIONS
the transitive closure can be much bigger than the relation itself.)
Closures with respect to multiple properties at once
In addition to defining the closure of a relation R with respect to one of the three
properties (reflexivity, symmetry, or transitivity), we can also define the closure with respect to two or more of these properties simultaneously. Any subset of these prop- erties makes sense in this context, but the two most common combinations require reflexivity and transitivity, with or without requiring symmetry:
Definition 8.10 (Reflexive (symmetric) transitive closure)
Let R ⊆ A × A be a relation.
• The reflexive transitive closure of R is the smallest relation R∗ ⊇ R such that R∗ is both reflexive and transitive.
• The reflexive symmetric transitive closure of R is the smallest relation R≡ ⊇ R such that R≡ is reflexive, symmetric, and transitive.
Example 8.29 (Parent)
Consider the relation parent := {⟨p, c⟩ : p is a parent of c} over a set S. (This example makes sense if we think of S as a set of people where “parent” has biological mean- ing,orifwethinkofSasasetofnodesinatree.) Then:
• Thetransitiveclosureofparentis
parent ∪ grandparent ∪ greatgrandparent ∪ greatgreatgrandparent · · · .
• Thereflexivetransitiveclosureofparentisancestor.Thatis,⟨x,y⟩isinthereflexive transitive closure of parent if and only if x is a direct ancestor of y, counting x as
a direct ancestor of x herself. (Compared to the transitive closure, the reflexive transitive closure also includes the relation yourself := {⟨x, x⟩ : x ∈ S}.)
Example 8.30 (Adjacent seating at a concert)
Consider a set S of people attending a concert held in a theater with rows of seats. Let R denote the relation of “sat immediately to the right of,” so that ⟨x, y⟩ ∈ R if and only if x sat one seat to y’s right in the same row. (See Figure 8.21.)
The transitive closure of R is “sat (not necessarily immediately) to the right of.” The symmetric closure of R is “sat immediately next to.” The symmetric transitive closure of R is “sat in the same row as.” The reflexive symmetric transitive closure of R is also “sat in the same row as.” (You sit in the same row as yourself.)
As we discussed previously, we can think of the transitive closure R+ of the rela- tion R as the result of repeating R one or more times: in other words, we have that
Figure 8.21: The sat-immediately- to-the-right-of relation.

8.3. PROPERTIESOFRELATIONS:REFLEXIVITY,SYMMETRY,ANDTRANSITIVITY 829
R+ :=R∪R2∪R3∪···.ThereflexivetransitiveclosureofRalsoadds{⟨a,a⟩:a∈A} to the closure, which we can view as the result of repeating R zero or more times. In other words, we have that the reflexive transitive closure R∗ is R∗ = R0 ∪ R+, where R0 := {⟨a, a⟩ : a ∈ A} represents the “zero-hop” application of R.
Taking it further: The basic idea underlying the (reflexive) transitive closure of a relation R—allowing (zero or) one or more repetitions of a relation R—also comes up in a widely useful tool for pattern matching in text, called regular expressions. Using regular expressions, you can search a text file for lines that match certain kinds of patterns (like: find all violations in the dictionary of the “I before E except after C” rule), or apply some operation to all files with a certain name (like: remove all .txt files). For more discussion of regular expressions more generally, and a little more on the connection between (reflexive) transitive closure and regular expressions, see p. 830.
We’ll end with one last example of closures of an arithmetic relation:
Example 8.31 (Closures of the successor relation)
Problem: Thesuccessorrelationontheintegersis{⟨n,n+1⟩:n∈Z}.Whatarethe reflexive, symmetric, transitive, reflexive transitive, and reflexive symmetric transi- tive closures of this relation?
Solution :
• Thereflexiveclosureofsuccessoristherelation{⟨n,m⟩:m=norm=n+1}— that is, pairs of integers where the second component is equal to or one greater than the first component.
• Thesymmetricclosureofsuccessoris{⟨n,m⟩:m=n−1orm=n+1}—thatis, pairs of integers where the second component is exactly one less or one greater than the first component.
• Thetransitiveclosureofsuccessoristherelation<—thatis,therelation {⟨n, m⟩ : n < m}. In fact, the infinite version of Figure 8.20 illustrates why: for any n, we have ⟨n, n + 1⟩ and ⟨n + 1, n + 2⟩ in successor, so the transitive closure includes ⟨n, n + 2⟩. But ⟨n + 2, n + 3⟩ is in successor, so the transitive closure also includes ⟨n, n + 3⟩. But ⟨n + 3, n + 4⟩ is in successor, so the transitive closure also includes ⟨n, n + 4⟩. And so forth! (See Exercise 8.106 for a formal proof.) • Thereflexivetransitiveclosureofthesuccessorrelation{⟨x,x+1⟩:x∈Z}is≤. • Finally,thereflexivesymmetrictransitiveclosureofsuccessorisactuallyZ×Z: that is, every pair of integers is in this relation. Incidentally, we can view ≤ (the reflexive transitive closure of successor) as either the reflexive closure of < (the transitive closure of successor), or we can view ≤ as the transi- tive closure of {⟨n, m⟩ : m = n or m = n + 1} (the reflexive closure of successor). It’s true in general that the reflexive closure of the transitive closure equals the transitive closure of the reflexive closure. 830 CHAPTER 8. RELATIONS Computer Science Connections Regular Expressions Regular expressions (sometimes called regexps or regexes for short) are a mechanism to express pattern-matching searches in strings. (Their name is also a bit funny; more on that below.) Regular expressions are used by a number of useful utilities on Unix-based systems, like grep (which prints all lines of a file that match a given pattern) and sed (which can perform search- and-replace operations for particular patterns). And many programming languages have a capability for regular-expression processing—they’re a tremendously handy tool for text processing. Let Σ denote an alphabet of symbols. (For convenience, think of Σ = {A, B, . . . , Z}, but generally it’s the set of all ASCII characters.) Let Σ∗ denote the set of all finite-length strings of symbols from Σ. (Note that the ∗ notation echoes the notation for the reflexive transitive closure: Σ∗ is the set of elements resulting from “repeating” Σ zero or more times.) The basics of regular expressions are shown in Figure 8.22. Essen- tially the syntax of regular expressions (recursively) defines a relation Matches ⊆ Regexps × Σ∗, where certain strings match a given pattern α. Figure 8.22 says that, for example, {s : ⟨αβ, s⟩ ∈ Matches} is precisely the set of strings that can be written xy where ⟨α, x⟩ and ⟨β, y⟩ are in Matches. There’s some other shorthand for common constructions, too: for example, a list of characters in square brackets matches any of those characters (for example, [AEIOU] is shorthand for (A|E|I|O|U)). (Other syntax allows a range of char- acters or everything but a list of characters: for example, [A-Z] for all letters, and [^AEIOU] for consonants.) A few other regexp operators correspond to the types of closures that we introduced in this section. (See Figure 8.23.) For example, the following regular expressions match words in a dictio- nary that have some vaguely interesting properties: Figure 8.22: The basics of regexps. matches any string x1 x2 . . . xk , with k ≥ 1, where each xi matches α matches any string x1x2 ...xk, with k ≥ 0, where each xi matches α Figure 8.23: Some more regexp opera- tors. The + operator is roughly analo- gous to transitive closure—α+ matches any string that consists of one or more repetitions of α—while ? is roughly analogous to the reflexive closure and * to the reflexive transitive closure. The only difference is that here we’re com- bining repetitions by concatenation rather than by composition. We have only hinted at the depth of regular languages, regular expressions, and formal language theory here. There’s a whole courseload of material about these languages: for a bit more, see p. 846; for a lot more, see a good textbook on computational complexity and formal languages, like 5 Michael Sipser. Introduction to the The- ory of Computation. Course Technology, 3rd edition, 2012; and Dexter Kozen. Automata and Computability. Springer, 1997. A matches the single character A B . Z matches the single character B . matches the single character Z . matches any single character in Σ αβ matches any string xy where x matches α and y matches β α|β matches any string x where x matches α or x matches β 1. .*(CIE|[^C]EI).* 2. . [^AEIOU][^AEIOU][^AEIOU][^AEIOU][^AEIOU]. ** 3. [^AEIOU]*A[^AEIOU]*E[^AEIOU]*I[^AEIOU]*O[^AEIOU]*U[^AEIOU]* Respectively, these regexps match (1) words that violate the “I before E except after C” rule (like WEIRD or GLACIER); (2) words with five consecutive consonants (like LENGTHS or WITCHCRAFT); and (3) words with all five vowels, once each, in alphabetical order (like FACETIOUS and ABSTEMIOUS). The odd-sounding name “regular expression” derives from a related notion, called a “regular language.” A language L ⊆ Σ∗ is a subset of all strings; in the subfield of theoretical computer science called formal language theory, we’re interested in how easy it is to determine whether a given string x ∈ Σ∗ is in L or not, for a particular language L. (Some example languages: the set of words containing only type of vowel, or the set of binary strings with the same number of 1s and 0s.) A regular language is one for which it’s possible to determine whether x ∈ L by reading the string from left to right and, at each step, remembering only a constant amount of information about what you’ve seen so far. (The set of univocalic words is regular; the set of “balanced” bitstrings is not.)5 matches any string that matches α or the empty string α? α+ α* 8.3. PROPERTIESOFRELATIONS:REFLEXIVITY,SYMMETRY,ANDTRANSITIVITY 831 8.3.6 Exercises 8.50 Draw a directed graph representing the relation 􏰈⟨x, x2 mod 13⟩ : x ∈ Z13􏰉. 8.51 Repeat for {⟨x, 3x mod 13⟩ : x ∈ Z15}. 8.52 Repeat for {⟨x, 3x mod 15⟩ : x ∈ Z15}. Which of the following relations on {0, 1, 2, 3, 4} are reflexive? Irreflexive? Neither? 8.53 􏰈⟨x,x⟩ : x5 ≡5 x􏰉 8.54 {⟨x,y⟩:x+y≡50} 8.55 {⟨x,y⟩:thereexistszsuchthatx·z≡5 y} 8.56 􏰈⟨x, y⟩ : there exists z such that x2 · z2 ≡5 y􏰉 Let R ⊆ A × A and T ⊆ A × A be relations. Prove or disprove the following: 8.57 R is reflexive if and only if R−1 is reflexive. 8.58 if R and T are both reflexive, then R ◦ T is reflexive. 8.59 if R ◦ T is reflexive, then R and T are both reflexive. 8.60 R is irreflexive if and only if R−1 is irreflexive. 8.61 if R and T are both irreflexive, then R ◦ T is irreflexive. Which relations from Exercises 8.53–8.56 on {0, 1, 2, 3, 4} are symmetric? Antisymmetric? Asymmetric? Explain. 8.62 􏰈⟨x,x⟩ : x5 ≡5 x􏰉 8.63 {⟨x,y⟩:x+y≡50} 8.64 {⟨x,y⟩:thereexistszsuchthatx·z≡5 y} 8.65 􏰈⟨x, y⟩ : there exists z such that x2 · z2 ≡5 y􏰉 Prove Theorem 8.1, connecting the symmetry/asymmetry/antisymmetry of a relation R to the inverse R−1 of R. 8.66 ProvethatRissymmetricifandonlyifR∩R−1 =R=R−1. 8.67 ProvethatRisantisymmetricifandonlyifR∩R−1 ⊆{⟨a,a⟩:a∈A}. 8.68 Prove that R is asymmetric if and only if R ∩ R−1 = ∅. 8.69 Be careful: it’s possible for a relation R ⊆ A × A to be both symmetric and antisymmetric! Describe, as precisely as possible, the set of relations on A that are both. 8.70 Prove or disprove: if R is asymmetric, then R is antisymmetric. Fill in each cell in Figure 8.24 with a relation on {0, 1} that satisfies the given criteria. Or, if the criteria are inconsistent, explain why there is no such a relation. 8.71 a reflexive, symmetric relation on {0, 1} . 8.72 a reflexive, antisymmetric relation on {0, 1} . 8.73 a reflexive, asymmetric relation on {0, 1} . 8.74 an irreflexive, symmetric relation on {0, 1} . 8.75 an irreflexive, antisymmetric relation on {0, 1} . 8.76 an irreflexive, asymmetric relation on {0, 1} . 8.77 a symmetric relation on {0, 1} that’s neither reflexive nor irreflexive. 8.78 an antisymmetric relation on {0, 1} that’s neither reflexive nor irreflexive. 8.79 an asymmetric relation on {0, 1} that’s neither reflexive nor irreflexive. Figure 8.24: Some fill-in-the-blank relations. reflexive Exer. 8.71 Exer. 8.72 Exer. 8.73 irreflexive Exer. 8.74 Exer. 8.75 Exer. 8.76 neither Exer. 8.77 Exer. 8.78 Exer. 8.79 Which relations from Exercises 8.53–8.56 on {0, 1, 2, 3, 4} are transitive? Explain. 8.80 􏰈⟨x,x⟩:x5≡5x􏰉. 8.81 {⟨x,y⟩:x+y≡5 0}. 8.82 {⟨x,y⟩:thereexistszsuchthatx·z≡5 y}. 8.83 􏰈⟨x, y⟩ : there exists z such that x2 · z2 ≡5 y􏰉. Formally prove the following statements about a relation R ⊆ A × A, using the definitions of the given properties. 8.84 Prove that, if R is irreflexive and transitive, then R is asymmetric. 8.85 Prove Theorem 8.2: show that R is transitive if and only if R ◦ R ⊆ R. 8.86 Theorem 8.2 cannot be stated with an = instead of ⊆ (although I actually made this mistake in a previous draft!). Give an example of a transitive relation R where R ◦ R ⊂ R (that is, where R ◦ R ̸= R). symmetric antisymmetric asymmetric 832 CHAPTER 8. RELATIONS The following exercises describe a relation with certain properties. For each, say whether it is possible for a relation R ⊆ A × A to simultaneously have all of the stated properties. If so, describe as precisely as possible what structure the relation R must have. If not, prove that it is impossible. 8.87 Is it possible for R to be simultaneously symmetric, transitive, and irreflexive? 8.88 Is it possible for R to be simultaneously transitive and a function? 8.89 Identify all relations R on {0, 1} that are transitive. 8.90 Of the transitive relations on {0, 1} from Exercise 8.89, which are also reflexive and symmetric? Consider the relation R := {⟨2, 4⟩, ⟨4, 3⟩, ⟨4, 4⟩} on the set {1, 2, 3, 4}. 8.91 What is the reflexive closure of R? 8.92 What is the symmetric closure of R? 8.93 What is the transitive closure of R? 8.94 What is the reflexive transitive closure of R? 8.95 What is the reflexive symmetric transitive closure of R? Now consider the relation T := {⟨1, 2⟩, ⟨1, 3⟩, ⟨2, 1⟩, ⟨2, 3⟩, ⟨3, 1⟩, ⟨3, 2⟩, ⟨3, 4⟩, ⟨4, 5⟩} on {1, 2, 3, 4, 5}. 8.96 What is the reflexive closure of T? 8.97 What is the symmetric closure of T? 8.98 What is the transitive closure of T? 8.99 What is the symmetric closure of ≥? The next few exercises ask you to implement relations (and the standard relation operations) in a programming language of your choice. Don’t worry too much about efficiency in your implementation; it’s okay to run in time Θ(n3), Θ(n4) or even Θ(n5) when relation R is on a set of size n. 8.100 (programming required) Develop a basic implementation of re- lations on a set A. Also implement inverse (R−1) and composition (R ◦ T, where both R and T are subsets of A × A). 8.101 (programming required) Write functions reflexive?, irreflexive?, symmetric?, antisymmetric?, asymmetric?, and transitive? to test whether a given relation R has the specified property. 8.102 (programming required) Implement the closure algorithms (repro- duced in Figure 8.25) for relations. 8.103 (programming required) Using your implementations from the last few exercises, verify your answers to Exercises 8.71–8.79 (see Figure 8.24). 8.104 Prove that the transitive closure of R is indeed R+ :=R∪R2 ∪R3 ∪···,asfollows: showthatifS ⊇ Risanytransitive relation, then Rk ⊆ S. (We’d also need to prove that R+ is transitive, but you can omit this part of the proof. You may find a recursive definition of Rk most helpful: R1 = R and Rk = R ◦ Rk−1.) symmetric-closure(R): Input: a relation R ⊆ A × A Output: the smallest symmetric R′ ⊇ R 1: return R ∪ R−1 8.105 Give an example of a relation R ⊆ A × A, for a finite set A, such that the transitive closure of R contains at least c · |R|2 pairs, for some constant c > 0. Make c as big as you can.
8.107 We talked about the X closure of a relation R, for X being any nonempty subset of the properties of reflexivity, symmetry, and transitivity. But we didn’t define the “antisymmetric closure” of a relation R—with good reason! Why doesn’t the antisymmetric closure make sense?
Figure 8.25: A reminder of algo- rithms to compute the reflexive, symmetric, and transitive closures of a relation on a finite set.
􏰈 􏰉
8.106 Recall the relation successor := ⟨x, x + 1⟩ : x ∈ Z≥0 . Prove by induction on k that, for any integer x and any positive integer k, we have that ⟨x, x + k⟩ is in the transitive closure of successor. (In other words, you’re showing that the transitive closure of successor is ≥. Note that you cannot rely on the algorithm in Figure 8.25 because Z≥0 is not finite!)
reflexive-closure(R):
Input: a relation R ⊆ A × A
Output: the smallest reflexive R′ ⊇ R
1: return R∪{⟨a,a⟩:a∈A}
transitive-closure(R):
Input: a relation R ⊆ A × A
Output: the smallest transitive R′ ⊇ R
1: R′ := R
2: while there exist a, b, c ∈ A such that
⟨ a , b ⟩ ∈ R a n d ⟨ b , c ⟩ ∈ R a n d ⟨ a , c ⟩ ∈/ R ′ : 3: R′ := R′ ∪ {⟨a, c⟩}
4: return R′

8.4. SPECIALRELATIONS:EQUIVALENCERELATIONSANDPARTIAL/TOTALORDERS 833
8.4 Special Relations: Equivalence Relations and Partial/Total Orders
Talking with you is sort of the conversational equivalent of an out of body experience.
Bill Watterson (b. 1958), Calvin & Hobbes
In Section 8.3, we introduced three key categories of properties that a particular relation R ⊆ A × A might have: (ir)reflexivity, (a/anti)symmetry, and transitivity. Here we’ll consider relations R that have one of two particular combinations of those three categories of properties. Two very different “flavors” of relations emerge from these two particular constellations of properties:
• equivalencerelations(reflexive,symmetric,andtransitive),whichdividetheelements of A into one or more groups of equivalent elements, so that all elements in the same group are “the same” under R; and
• orderrelations(reflexiveorirreflexive,antisymmetric,andtransitive),which“rank” the elements of A, so that some elements of A are “more R” than others.
In this section, we’ll give formal definitions of these two types of relations, and look at a few applications.
8.4.1 Equivalence Relations
An equivalence relation R ⊆ A × A separates the elements of A into one or more groups,
where any two elements in the same group are equivalent according to R:
The most important equivalence relation that you’ve seen is equality (=): cer- tainly,foranyobjectsa,b,andc,wehavethat(i)a = a;(ii)a = bifandonlyif b = a; and (iii) if a = b and b = c, then a = c.
The relation sat in the same row as (see Example 8.30) is also an equivalence relation: it’s reflexive (you sat in the same row as you yourself), symmetric (anyone you sat in the same row as also sat in the same row as you), and tran- sitive (you sat in the same row as anyone who sat in the same row as someone who sat in the same row as you). And we already saw another example in Example 8.11: the relation
{⟨m1, m2⟩ : months m1 and m2 have the same number of days (in some years)}
(see Figure 8.26 for a reminder) is also an equivalence relation. It’s tedious but simple to verify by checking all pairs that the relation in Figure 8.26 is reflexive, symmetric, and transitive. (See also Exercises 8.115–8.117.)
Here are a few more examples of equivalence relations:
Figure 8.26: The months-of-the-same length relation (a reminder).
Definition 8.11 (Equivalence relation)
An equivalence relation is a relation that is reflexive, symmetric, and transitive.
Oct
Dec Aug
Jan Jul
May Nov
Sep
Mar Apr
Jun
Feb

834 CHAPTER 8. RELATIONS
Example 8.32 (Some equivalence relations)
All of the following are equivalence relations:
1. Thesetofpairsfrom{0,1,…,23}withthesamerepresentationona12-hour clock:
 ⟨0,0⟩,⟨0,12⟩,⟨12,0⟩,⟨12,12⟩,   ⟨1,1⟩,⟨1,13⟩,⟨13,1⟩,⟨13,13⟩, .
 .   ⟨11,11⟩,⟨11,23⟩,⟨23,11⟩,⟨23,23⟩ 
2. The asymptotic relation Θ (that is, for two functions f and g, we have ⟨f , g⟩ ∈ Θ if and only if f is Θ(g)). We argued in Examples 8.24–8.25 and Exercise 6.46 that Θ is reflexive, symmetric, and transitive.
3. The relation ≡ on logical propositions, where P ≡ Q if and only if P and Q are true under precisely the same set of truth assignments. (We even used the word “equivalent” in defining ≡, which we called logical equivalence back in Chapter 3.)
Example 8.33 (All equivalence relations on a small set)
Problem: Listallequivalencerelationsontheset{a,b,c}.
: Therearefivedifferentequivalencerelationsonthisset:
Solution
{⟨a, a⟩, ⟨b, b⟩, ⟨c, c⟩}
{⟨a, a⟩, ⟨a, b⟩, ⟨b, a⟩, ⟨b, b⟩, ⟨c, c⟩} {⟨a, a⟩, ⟨a, c⟩, ⟨b, b⟩, ⟨c, a⟩, ⟨c, c⟩} {⟨a, a⟩, ⟨b, b⟩, ⟨b, c⟩, ⟨c, b⟩, ⟨c, c⟩}
“no element is equivalent to any other” “a and b are equivalent, but they’re different from c” “a and c are equivalent, but they’re different from b” “b and c are equivalent, but they’re different from a” {⟨a, a⟩, ⟨a, b⟩, ⟨a, c⟩, ⟨b, a⟩, ⟨b, b⟩, ⟨b, c⟩, ⟨c, a⟩, ⟨c, b⟩, ⟨c, c⟩} . “all elements are equivalent”
Equivalence classes
The descriptions of the quintet of equivalence relations on the set {a, b, c} from
Example 8.33 makes more explicit the other way that we’ve talked about an equiva- lence relation R on A: as a relation that carves up A into one or more equivalence classes, where any two elements of the same equivalence class are related by R (and no two elements of different classes are). Here’s the formal definition:
Definition 8.12 (Equivalence class)
Let R ⊆ A × A be an equivalence relation. The equivalence class of a ∈ A is defined as the set {b ∈ A : ⟨a,b⟩ ∈ R} of elements related to A under R. The equivalence class of a ∈ A under R is denoted by [a]R—or, when R is clear from context, just as [a].

8.4. SPECIALRELATIONS:EQUIVALENCERELATIONSANDPARTIAL/TOTALORDERS 835
The equivalence classes of an equivalence relation on A form a partition of the set A—that is, every element of A is in one and only one equivalence class. (See Defini- tion 2.30 for a reminder of the definition of “partition.”)
Example 8.34 (Equivalent mod 5)
Definetherelation≡5 onZ,sothat⟨x,y⟩∈≡5 ifandonlyifxmod5=ymod5.It’s easy to check that all three requirements (reflexivity, symmetry, and transitivity) are met; see Examples 8.18, 8.20, and 8.23. There are five equivalence classes under ≡5:
{0,5,10,…},{1,6,11,…},{2,7,12,…},{3,8,13,…}, and {4,9,14,…}, corresponding to the five possible values mod 5.
Example 8.35 (Some equivalence classes)
The five different equivalence relations on {a, b, c} in Example 8.33 correspond to five different sets of equivalence classes:
􏰜 {a} , {b} , {c} 􏰝 􏰜 {a, b} , {c} 􏰝
􏰜 {a, c} , {b}􏰝
􏰜 {a} , {b, c} 􏰝
􏰜 {a, b, c} 􏰝 .
“no element is equivalent to any other” “a and b are equivalent, but they’re different from c” “a and c are equivalent, but they’re different from b” “b and c are equivalent, but they’re different from a” “all elements are equivalent”
An example: equivalence of rational numbers
Back in Chapter 2, we defined the rational numbers (that is, fractions) as the set
Q := Z × Z̸= 0—that is, as two-element sequences of integers, respectively called
the numerator and the denominator, where the denominator must be nonzero. (See
Example 2.39.) Here you will give a formal treatment of two rational numbers like
⟨17, 34⟩ and ⟨101, 202⟩ being equivalent, in the sense that 17 = 101 = 1 : 34 202 2
Example 8.36 (Equivalence of rationals by reducing to lowest terms)
Problem: Formallydefinearelation≡onQthatcapturesthenotionofequalityfor fractions, and prove that ≡ is an equivalence relation.
Solution
: Wedefinetworationals⟨a,b⟩and⟨c,d⟩asequivalentifandonlyifad=bc—
that is, we define the relation ≡ as the set 􏰜􏰲⟨a,b⟩,⟨c,d⟩􏰳 : ad = bc􏰝.
To show that ≡ is an equivalence relation, we must prove that ≡ is reflexive, sym- metric, and transitive. These three properties follow fairly straightforwardly from

836 CHAPTER 8. RELATIONS
the fact that the relation = on integers is an equivalence relation. We’ll prove symmetry (reflexivity and transitivity can be proven analogously): for arbitrary ⟨a, b⟩, ⟨c, d⟩ ∈ Q we have
⟨a,b⟩ ≡ ⟨c,d⟩ ⇒ ad = bc ⇒ bc = ad
⇒ ⟨c, d⟩ ≡ ⟨a, b⟩.
Taking it further: Recall that the equivalence class of a rational ⟨a, b⟩ ∈ Q under ≡, denoted [⟨a, b⟩]≡,
represents the set of all rationals equivalent to ⟨a, b⟩. For example,
[⟨17,34⟩]≡ = {⟨1,2⟩,⟨−1,−2⟩,⟨2,4⟩,⟨−2,−4⟩,…,⟨17,34⟩,…}.
For equivalence relations like ≡ for Q, we may agree to associate an equivalence class with a canonical element of that class—here, the representative that’s “in lowest terms.” So we might agree to write ⟨1, 2⟩ to denote the equivalence class [⟨1, 2⟩], for example. This idea doesn’t matter too much for the rationals, but it plays an important (albeit rather technical) role in figuring out how to define the real numbers
in a mathematically coherent way. One standard way of defining the real numbers is as the equivalence classes of converging infinite sequences of rational numbers, called Cauchy sequences after the 19th-century French mathematician Augustin Louis Cauchy. (Two converging infinite sequences of rational numbers are defined to be equivalent if they converge to the same limit—that is, if the two sequences eventually differ by less than ε, for all ε > 0.) Thus when we write π, we’re actually secretly denoting an infinitely large set of equivalent converging infinite sequences of rational numbers—but we’re representing that equivalence class using a particular canonical form. Actually producing a coherent definition of the real numbers is a surprisingly recent development in mathematics, dating back less than 150 years. For more,
6 see a good textbook on the subfield of math called analysis.6
Coarsening and refining equivalence relations An equivalence relation ≡ on A slices
up the elements of A into equivalence
classes—that is, disjoint subsets of A
such that any two elements of the same
class are related by ≡. For example,
you might consider two restaurants
equivalent if they serve food from the
same cuisine (Thai, Indian, Ethiopian, Chinese, British, Minnesotan, . . .). But, given ≡, we can imagine further subdividing the equivalence classes under ≡ by making finer- grained distinctions (that is, refining ≡)—perhaps dividing Indian into North Indian and South Indian, and Chinese into Americanized Chinese and Authentic Chinese. Or we could make ≡ less specific (that is, coarsening ≡) by combining some of the equivalence classes—perhaps having only two equivalence classes, Delicious (Thai, Indian, Ethiopian, Chinese) and Okay (British, Minnesotan). See Figure 8.27.
definition of ≡ symmetry of = definition of ≡
For example, this book is a classic:
6 Walter Rudin.
Principles of math- ematical analysis. McGraw–Hill, third edition, 1976.
• • ••
• ••• •
••••••
• • ••
• ••• •
•• •
• ••
• • ••
• ••• •
••••••
•••••• •• •• ••
•••••• •••
(a) An equivalence (b) A coarsening of ≡. (c) A refinement of ≡. relation ≡.
•• •
• ••
•• •
• ••
Figure 8.27: Re- fining/coarsening an equivalence relation. In (a), dots represent elements; each colored region denotes an equiv- alence class under ≡. Panel (b) shows a new equivalence relation formed by merging classes from ≡; (c) shows
a new equivalence relation formed by subdividing classes from ≡.
Definition 8.13 (Coarsening/refining equivalence relations)
Consider two equivalence relations ≡c and ≡r on the same set A. We say that ≡r is a refinementof≡c,orthat≡c isacoarseningof≡r,if(a≡r b)⇒(a≡c b)forany ⟨a,b⟩∈A×A.Wecanalsoreferto≡c ascoarserthan≡r,and≡r asfinerthan≡c.

8.4. SPECIALRELATIONS:EQUIVALENCERELATIONSANDPARTIAL/TOTALORDERS 837
For example, equivalence mod 10 is a refinement of equivalence mod 5: whenever
n ≡10 m—thatis,whennmod10 = mmod10—weknowforcertainthatnmod5 = m mod 5 too. (In other words, we have (n ≡10 m) ⇒ (n ≡5 m).) An equivalence class of the coarser relation is formed from the union of one or more equivalence classes of the finer relation. Here ≡10 is a refinement of ≡5, and, for example, the equivalence class [3]≡5 is the union of two equivalence classes from ≡10, namely [3]≡10 ∪ [8]≡10.
Taking it further: A deterministic finite automaton (DFA) is a simple model of a so-called “machine” that has a finite amount of memory, and processes an input string by moving from state to state according to a fixed set of rules. DFAs can be used for a variety of applications (for example, in computer architecture, compilers, or in modeling simple behavior in computer games). And they can also be understood in terms of equivalence relations. See p. 846 for more.
Example 8.37 (Refining/coarsening equivalence relations on {a, b, c})
In Example 8.35, we considered five different equivalence relations on {a, b, c}:
Of these, all three equivalence relations in the middle row refine the one-class equivalence relation {{a, b, c}} and coarsen the three-class equivalence relation {{a} , {b} , {c}}. (And the three-class equivalence relation {{a} , {b} , {c}} also refines the one-class equivalence relation {{a, b, c}}.)
Taking it further: This is a very meta comment, but we can think of “is a refinement of” as a relation on equivalence relations on a set A. In fact, the relation “is a refinement of” is reflexive, antisymmetric, and transitive: ≡ refines ≡; if ≡1 refines ≡2 and ≡2 refines ≡1 then ≡1 and ≡2 are precisely the same relation on A; and if ≡1 refines ≡2 and ≡2 refines ≡3 then ≡1 refines ≡3. Thus “is a refinement of” is, as per the definition to follow in the next section, a partial order on equivalence relations on the set A. Thus, for example, there is a minimal element according to the “is a refinement of” relation on the set of equivalence relations on any finite set A—that is, an equivalence relation ≡min such that ≡min is refined by no relation aside from ≡min itself. (Similarly, there’s a maximal relation ≡max that refines no relation except itself.) See Exercises 8.118 and 8.119.
8.4.2 Partial and Total Orders
An equivalence relation ≡ on a set A has properties that “feel like” a form of equality— differing from = only in that there might be multiple elements that are unequal but nonetheless cannot be distinguished by ≡. Here we’ll introduce a different special type of relation, more akin to ≤ than =, that instead describes a consistent order among the elements of A:
{{a} , {b} , {c}}
{{a, b} , {c}}
{{a, c} , {b}}
{{a} , {b, c}}
{{a, b, c}}

838 CHAPTER 8. RELATIONS
Definition 8.14 (Partial Order)
Let A be a set. A relation ≼ on A that is reflexive, antisymmetric, and transitive is called a partial order. (A relation ≺ on A that is irreflexive, antisymmetric, and transitive is called a strict partial order.)
(Actually, the requirement of antisymmetry in a strict partial order is redundant; see Exercise 8.84.) Here are a few examples, from arithmetic and sets:
Example 8.38 (Some (strict) partial orders on Z: |, >, and ≤)
In Examples 8.18, 8.20, and 8.23, we showed that the following relations are all anti- symmetric, transitive, and either reflexive or irreflexive:
1. divides (reflexive): R1 = {⟨n, m⟩ : m mod n = 0} is a partial order.
2. greater than (irreflexive): R2 = {⟨n, m⟩ : n > m} is a strict partial order. 3. less than or equal to (reflexive): R3 = {⟨n, m⟩ : n ≤ m} is a partial order.
Example 8.39 (The subset relation)
Consider the relation ⊆ on the set P ({0, 1}), which consists of the following pairs of sets:
• {}⊆{0},{}⊆{1},and{}⊆{0,1}.
• {0} ⊆ {0} and {0} ⊆ {0,1}.
• {1} ⊆ {1} and {1} ⊆ {0,1}. • {0,1} ⊆ {0,1}.
It’s easy to verify that ⊆ is reflexive, antisymmetric, and transitive. (One easy way to see this fact is via Figure 8.28, which abbreviates the visualizations in Figure 8.13 by leaving out an a-to-c arrow if their relationship is implied by transitivity because of a-to-b and b-to-c arrows. We’ll see more of this type of abbreviated diagram in a moment.)
Comparability and total orders
Note that, in a partial order ≼, there can be two elements a, b ∈ A such that neither
a ≼ b nor b ≼ a. For example, for the subset relation from Example 8.39 we have
{0} ̸⊆ {1}and{1} ̸⊆ {0},andforthedividesrelationwehave17̸|21and21̸|17. In this case, the relation ≼ does not say which of these elements is “smaller.” This phenomenon is the reason that ≼ is called a partial order, because it only specifies how some pairs compare.
Figure 8.28: The
⊆ relation on P({0,1}): A ⊆ B if we can get from A to B by following arrows in this diagram.
There’s a very mis- leading common- language use of “incomparable”
(or “beyond com- pare”) to mean “unequaled”—as in Cheese from France
is incomparable to cheese from Wis- consin. Be careful! “Incomparable” means “cannot be compared” and not “cannot be matched.”
{0}
{0, 1}
{}
{1}
Definition 8.15 (Comparability)
Let ≼ be a partial order on A. We say that two elements a ∈ A and b ∈ A are comparable under ≼ if either a ≼ b or b ≼ a. Otherwise we say that a and b are incomparable.

8.4. SPECIALRELATIONS:EQUIVALENCERELATIONSANDPARTIAL/TOTALORDERS 839 When there are no incomparable pairs under ≼, then we call ≼ a total order:
A few examples of partial and total orders
Here are a few examples of orders, related to strings and to asymptotics:
Example 8.40 (Ordering strings)
Problem: Let Σ∗ denote the set of all (finite-length) strings of letters. Which of the following relations on Σ∗ are partial orders? Total orders? Which are strict?
1. ⟨x, y⟩ ∈ R if |x| ≥ |y|. (The length of a string x—the number of letters in x—is denoted |x|.)
2. ⟨x,y⟩∈Sifxcomesalphabeticallynolaterthany.(SeeExample3.46.) 3. ⟨x,y⟩∈TifthenumberofAsinxislessthanthenumberofAsiny.
: 1. Therelation{⟨x,y⟩:|x|≥|y|}isreflexiveandtransitive,butitisnot antisymmetric: for example, both ⟨PASCAL, RASCAL⟩ and ⟨RASCAL, PASCAL⟩ are in the relation, but RASCAL ̸= PASCAL. So this relation isn’t a partial order.
2. Therelation“comesalphabeticallynolaterthan”isreflexive(everywordw comes alphabetically no later than w), antisymmetric (the only word that comes alphabetically no later than w and no earlier than w is w itself), and transitive (if w1 is alphabetically no later than w2 and w2 is no later than w3, then indeed w1 is no later than w3). Thus S is a partial order.
In fact, any two words are comparable under S: either w is a prefix of w′ (and ⟨w, w′⟩ ∈ S) or there’s a smallest index i in which wi ̸= wi′ (and either ⟨w, w′⟩ ∈ S or ⟨w′, w⟩ ∈ S, depending on whether wi is earlier or later in the alphabet than wi′). Thus S is actually a total order.
3. Therelation“containsfewerAsthan”isirreflexive(anywordwcontainsexactly the same number of As as it contains, not fewer than that!) and transitive (if we have aw < aw′ and aw′ < aw′′ , then we also have aw < aw′′ ). Therefore the relation is antisymmetric (by Exercise 8.84), and thus T is a strict partial order. But neither ⟨PASCAL, RASCAL⟩ nor ⟨RASCAL, PASCAL⟩ are in T—both words con- tain 2 As, so neither has fewer than the other—and thus RASCAL and PASCAL are incomparable, and T is not a (strict) total order. Example 8.41 (O and o as orders?) We’ve argued that o is irreflexive (Example 8.24), transitive (Exercise 6.47), and asym- metric (Example 8.25). Thus o is a strict partial order. But o is not a (strict) total order: Definition 8.16 (Total Order) A relation ≼ on A is a total order if it’s a partial order and every pair of elements in A is comparable. (A relation ≺ is a strict total order if ≺ is a strict partial order and every pair of distinct elements in A is comparable.) Solution 840 CHAPTER 8. RELATIONS we saw a function f (n) in Example 6.6 such that f (n) ̸= o(n2) and n2 ̸= o(f (n)), so these two functions are incomparable. And, though we showed that O is reflexive and transitive (Exercise 6.18), we showed that O is not antisymmetric (Example 8.25), because, for example, the func- tions f (n) = n2 and g(n) = 2n2 are O of each other. Thus O is not a partial order. Taking it further: A relation like O that is both reflexive and transitive (but not necessarily antisymmet- ric) is sometimes called a preorder. Although O is not a partial order, it very much has an “ordering-like” feel to it: it does rank functions by their growth rate, but there are clusters of functions that are all equiv- alent under O. We can think of O as defining a partial order on the equivalence classes under Θ. We saw another preorder in Example 8.40, with the relation R (“x and y have the same length”): although there are many pairs of nonidentical strings x and y where ⟨x, y⟩, ⟨y, x⟩ ∈ R, it is only because of ties in lengths that R fails to be a partial order—indeed, a total order. Hasse diagrams Let R be any relation on A. For k ≥ 1, we will call a sequence ⟨a1,a2,...,ak⟩ ∈ Ak a cycle if ⟨a1,a2⟩,⟨a2,a3⟩,··· ,⟨ak−1,ak⟩ ∈ R and ⟨ak,a1⟩ ∈ R. A cycle is a sequence of elements, each of which is related by R to the next element in the sequence (where the last element is related to the first). For a partial order ≼, there are cycles with k = 1 (because a partial order is reflexive, a1 ≼ a1 for any a1), but there are no longer cycles. (You’ll prove this fact in Exercise 8.130.) Recall the “directed graph” visualization of a relation R ⊆ A × A that we introduced earlier (see Figure 8.13): we write down every element of A, and then, for every pair ⟨a1, a2⟩ ∈ R, we draw an arrow from a1 to a2. For a relation R that’s a partial order, we’ll introduce a simplified visualization, called a Hasse diagram, that allows us to figure out the full relation R but makes the diagram dramatically cleaner. Let ≼ be a partial order. Consider three elements a, b, and c such that a ≼ b and b ≼ canda ≼ c.Thentheveryfactthat≼isapartialordermeansthata ≼ ccanbe inferred from the fact that a ≼ b and b ≼ c. (That’s just transitivity.) Thus we will omit from the diagram any arrows that can be inferred via transitivity. Similarly, we will leave out self-loops, which can be inferred from reflexivity. Finally, as we discussed above, there are no nontrivial cycles (that is, there are no cycles other than self-loops) in a partial order. Thus we will arrange the elements so that when a ≼ b we will draw a physically below b in the diagram; all arrows will implicitly point upward in the diagram. Here are two examples: Example 8.42 (A small Hasse diagram) A Hasse diagram for the partial order {⟨0, 0⟩, ⟨0, 1⟩, ⟨0, 2⟩, ⟨0, 3⟩, ⟨0, 4⟩, ⟨1, 1⟩, ⟨2, 2⟩, ⟨2, 3⟩, ⟨2, 4⟩, ⟨3, 3⟩, ⟨3, 4⟩, ⟨4, 4⟩} is shown in Figure 8.29. Note that we’ve omitted all arrow directions (they all point up), all five self-loops (they can be inferred from reflexivity), and the pairs ⟨0, 3⟩, ⟨0, 4⟩, and ⟨2, 4⟩ (they can be inferred from transitivity). Hasse diagrams are named after Helmut Hasse, a 20th-century German mathemati- cian. 4 3 12 0 Figure 8.29: A small Hasse diagram. 8.4. SPECIALRELATIONS:EQUIVALENCERELATIONSANDPARTIAL/TOTALORDERS 841 32 16 24 8 12 18 20 27 28 30 4 6 9 10 14 15 21 22 25 26 2 3 5 7 11 13 17 19 23 29 31 1 Example 8.43 (Hasse diagram for divides) A Hasse diagram for the relation | (divides) on the set {1, 2, . . . , 32} is shown in Figure 8.30. Again, the diagram omits arrow directions, self-loops, and “indirect” connections that can be inferred by transitivity. For example, the fact that 2 | 20 is implicitly represented by the arrows 2 → 4 → 20 (or 2 → 10 → 20). Which arrows must be shown in a Hasse diagram? Those arrows that cannot be inferred by the definition of a partial order—so we must draw a direct connections for all those relationships that are not “short circuits” of pairs of other relationships. In other words, we must draw lines for all those pairs ⟨a, c⟩ where a ≼ c and there is no b ∈/ {a, c} such that a ≼ b and b ≼ c. Such a c is called an immediate successor of a. Minimal/maximal elements in a partial order Consider the partial order ≼ := {⟨1, 1⟩, ⟨1, 2⟩, ⟨1, 3⟩, ⟨1, 4⟩, ⟨2, 2⟩, ⟨2, 4⟩, ⟨3, 3⟩, ⟨4, 4⟩}— that is, the divides relation on the set {1, 2, 3, 4}. There’s a strong sense in which 1 is the “smallest” element under ≼: every element a satisfies 1 ≼ a. And there’s a slightly weaker sense in which 3 and 4 are both “largest” elements under ≼: no element a satis- fies3≼aor4≼a. Theseideasinspiretworelatedpairsofdefinitions: Warning! When a ≼ bholdsfora partial order ≼, we think of a as “smaller” than b under ≼—a view that can be a little misleading if, for example, the partial order in question is ≥ instead of ≤. One example of this oddity: for ≥, the immediate successor of 42 is 41. Figure 8.30: A Hasse diagram for “divides” on {1,2,...,32}. The darker lines represent the Hasse diagram; the lighter arrows give the full picture of the relation, including all of the relationships that can be inferred from the fact that the relation is a partial order. Definition 8.17 (Minimum/maximum element) For a partial order ≼ on A: • aminimumelementisx∈Asuchthat,foreveryy∈A,wehavex≼y. • amaximumelementisx∈Asuchthat,foreveryy∈A,wehavey≼x. 842 CHAPTER 8. RELATIONS Definition 8.18 (Minimal/maximal element) For a partial order ≼ on A: • aminimalelementisx∈Asuchthat,foreveryy∈Awithy̸=x,wehavey̸≼x. • amaximalelementisx∈Asuchthat,foreveryy∈Awithy̸=x,wehavex̸≼y. Note that x being a minimal element does not demand that every other element be larger than x—only that no element is smaller! (Again, we’re talking about a partial order—so x ̸≼ y doesn’t imply that y ≼ x.) In other words, a minimal element is one for which every other element y either satisfies x ≼ y or is incomparable to x. Example 8.44 (Minimal/maximal/maximum/minimum elements in “divides”) For the divides relation on {1, 2, . . . , 32} (Example 8.43 and Figure 8.30): • 1isaminimumelement.(Everyn∈{1,2,...,32}satisfies1|n.) • 1isalsoaminimalelement.(Non∈{1,2,...,32}satisfiesn|1,exceptn=1itself.) • Thereisnomaximumelement. (Non ∈ {1,2,...,32}asidefrom32satisfiesn|32, so 32 is the only candidate—but 31 ̸ | 32.) • Thereareaslewofmaximalelements:eachof{17,18,...,32}isamaximalele- ment. (None of these elements divides any n ∈ {1, 2, . . . , 32} other than itself.) (You’ll prove that any minimum element is also minimal, and that there can be at most one minimum element in a partial order, in Exercises 8.143 and 8.144.) We’ve already seen partial orders that don’t have minimum or maximum elements, but every partial order must have at least one minimal element and at least one maxi- mal element—at least, as long as the partial order is over a set A that’s finite: Proof. We’llprovethatthere’saminimalelement;theproof for the maximal element is completely analogous. Our proof is constructive; we’ll give an algorithm to find a minimal element. (See Figure 8.31.) It’s easy to see that if this algorithm terminates, then it returns a minimal element. After all, the while loop only terminates if we’ve found an xi ∈ A such that there’s no y ̸= xi with y ≼ xi—which is precisely the definition of xi being a minimal element. Thus the real work is in proving that this algorithm actually terminates. We claim that after |A| iterations of the while loop—that is, after we’ve defined x1, x2, . . . , x|A|+1—we must have found a minimal element. Suppose not. Then we have found elements x1 ≽ x2 ≽ · · · ≽ x|A|+1, where xi+1 ̸= xi for each i. Because there A maximal whatzit is any whatzit that loses its whatz- itness if we add anything to it. A maximum whatzit is the largest possible whatzit. If you’ve studied calculus, you’ve seen a sim- ilar distinction under a different name: maximal cor- responds to a local maximum; maxi- mum corresponds to a global maximum. Theorem 8.3 (Every (finite) partial order has a minimal/maximal element) Let ≼ ⊆ A × A be a partial order on a finite set A. Then ≼ has at least one minimal element and at least one maximal element. Input: a partial order ≼ on a finite set A Output: a ∈ A that’s minimal under ≼ 1: i := 1 2: 3: 4: 5: 6: x1 := an arbitrarily chosen element in A while there exists any y ̸= xi with y ≼ xi: xi+1 :=anysuchy(withy̸=xi andy≼xi) i := i + 1 return xi Figure 8.31: An algorithm to find a minimal element. 8.4. SPECIALRELATIONS:EQUIVALENCERELATIONSANDPARTIAL/TOTALORDERS 843 are only |A| different elements in A, in a sequence of |A| + 1 elements we must have encountered the same element more than once. (This argument implicitly makes use of the pigeonhole principle, which we’ll see in much greater detail in Chapter 9.) But that’s a cycle containing two or more elements! And Exercise 8.130 asks you to show that there are no such cycles in a partial order. Note that Theorem 8.3 only claimed that a minimal element must exist in a partial order on a finite set A. The claim would be false without that assumption! If A is an infinite set, then there may be no minimal element in A under a partial order. (See Exercise 8.141.) We can identify minimal and maximal elements of a partial order very easily from the Hasse diagram: they’re simply the elements that aren’t connected to anything above them (the maximal elements), and those that aren’t connected to anything be- low them (the minimal elements). And, indeed, there are always topmost element(s) and bottommost element(s) in a Hasse diagram, and thus there are always maxi- mal/minimal elements in any partial order—if the set of elements is finite, at least! 8.4.3 Topological Ordering Partial orders can be used to specify constraints on the order in which certain tasks must be completed. For example, the printer must be loaded with paper before the document can be printed; the document must be written before the document can be printed; the paper must be purchased before the printer can be loaded with paper. Or, as another example: a computer science major at a certain college in the midwest must take courses following the prerequisite structure specified in Figure 8.32. But, while these types of constraints impose on a partial order on elements, the jobs must actually be completed in some sequence. (Likewise, the courses must be taken in some sequence—for a major who avoids “doubling up” on CS courses in the same term, at least.) The task we face here is to extend a partial order into a total order—that is, to create a total order that obeys all of the constraints of the partial order, while making comparable all previously incomparable pairs. In general, there are many total orders that are consistent with a given partial order. Here’s an example: Problem-solving tip: A good visualiza- tion of data often makes an appar- ently complicated statement much simpler. Another way of stating The- orem 8.3 and its proof: start any- where, and follow lines downward in the Hasse diagram; eventually, you must run out of elements below you, and you can’t go any lower. Thus there’s at least one bottommost ele- ment in any (finite) Hasse diagram. software design programming languages data structures algorithms math of CS intro to CS computability & complexity organization & architecture Figure 8.32: The CS major at a certain college in the midwest. Definition 8.19 (Consistency of a total order with a partial order) A total order ≼total is consistent with the partial order ≼ if a ≼ b implies that a ≼total b. 844 CHAPTER 8. RELATIONS Example 8.45 (Ordering CS classes) The following course orderings are consistent with the prerequisites in Figure 8.32. (There are many other valid orderings, too.) • introtoCS→datastructures→mathofCS→organization&architecture → software design → programming languages → algorithms → computability & complexity. • introtoCS→datastructures→softwaredesign→programminglanguages → math of CS → algorithms → computability & complexity → organization & architecture. The first of these orderings corresponds to reading the elements of the Hasse di- agram from the bottom-to-top (and left-to-right within a “row”); the second cor- responds to completing the top row left-to-right (first recursively completing the requirements to make the next element of the top row valid). As in these examples, we can construct a total order that’s consistent with any given partial order on the set A. Such an ordering of A is called a topological ordering of A. (Some people will refer to a topological ordering as a topological sort of A.) We’ll prove this result inductively, by repeatedly identifying a minimal element a from the set of unprocessed elements, and then adding constraints to make a be a minimum element (and not just a minimal element). Theorem 8.4 (Extending any partial order to a total order) Let A be any finite set with a partial order ≼. Then there is a total order ≼total on A that’s consistent with ≼. Proof. We’llproceedbyinductionon|A|. For the base case (|A| = 1), the task is trivial: there’s simply nothing to do! The relation ≼ must be {⟨a, a⟩}, where A = {a}, because partial orders are reflexive. And the relation {⟨a, a⟩} is a total order on {a} that’s consistent with ≼. Figure 8.33: A sketch of the proof of Theorem 8.4. First, we identify some minimal element a∗ in ≼ (left panel). Then we turn a∗ into a minimum element by adding constraints (thick lines in the right panel), and then we inductively find a total ordering of the remaining partial order (the shaded box at right). ·· ··· a• · · ∗ ·· ··· ·· a• ∗ For the inductive case (|A| ≥ 2), we assume the inductive hypothesis (for any set A′ ′ ′ ′ of size |A | = |A| − 1 and any partial order on A , there’s a total order on A consistent with that partial order). We must show how to extend ≼ to be a total order on all of A. Here’s the idea: we’ll remove some element of A that can go first in the total order, inductively find a total order of all the remaining elements, and then add the removed element to the beginning of the order. Morespecifically,leta∗ ∈Abeanarbitraryminimalelementunder≼onA—in other words, let a∗ be any element such that no b ∈ A − {a∗} satisfies b ≼ a∗. Such an element is guaranteed to exist by Theorem 8.3. Add any missing pair ⟨a∗, b⟩ to ≼. It’s easy to see that ≼ is still a partial order on A: by the definition of a minimal element, we haven’t introduced any violations of transitivity or antisymmetry. Now, inductively, we extend the partial order ≼ on A − {a∗} to a total order; the result is a total order on A that’s consistent with ≼. (See Figure 8.33.) (Slightly more formally: note that ≼′ := {⟨x, y⟩ ∈ (A − {a∗}) × (A − {a∗}) : x ≼ y} is a partial order on A − {a∗}; by the inductive hypothesis, there exists a total order ≼t′otal 8.4. SPECIALRELATIONS:EQUIVALENCERELATIONSANDPARTIAL/TOTALORDERS 845 on A − {a∗} consistent with ≼′. Define ≼total=􏰈⟨x,y⟩∈A×A:⟨x,y⟩∈≼t′otal orx=a∗􏰉. It’s easy to verify that ≼total is a total order on A that’s consistent with ≼.) Taking it further: Deciding the order in which to compute the cells of a spreadsheet (where a cell might depend on a list of other cells’ contents) is solved using a topological ordering. In this setting, let C denote the set of cells in the spreadsheet, and define a relation R ⊆ C × C where ⟨c, c′⟩ ∈ R if we need to know the value in cell c before we can compute the value for c′. (For example, if cell C4’s value isdeterminedbytheformulaA1 + B1 + C1,thenthethreepairs⟨A1,C4⟩,⟨B1,C4⟩,and⟨C1,C4⟩areall in R. Note that it’s not possible to compute all the values in a spreadsheet if there’s a cell x whose value depends on cell y, which depends on · · · , which depends on cell x—in other words, the “depends on” relationship cannot have a cycle! Furthermore, we’re in trouble if there’s a cell x whose value depends on x itself. In other words, we can compute the values in a spreadsheet if and only if R is irreflexive and transitive—that is, if R is a strict partial order. Another problem that can be solved using the idea of topological ordering is that of hidden-surface removal in computer graphics: we have a 3-dimensional “scene” of objects that we’d like to display on a 2-dimensional screen, as if it were being viewed from a camera. We need to figure out which of the objects are invisible from the camera (and therefore need not be drawn) because they’re “behind” other objects. One classic algorithm, called the painter’s algorithm, solves this problem using ideas from relations and topological ordering. See the discussion on p. 847. 846 CHAPTER 8. RELATIONS Computer Science Connections Deterministic Finite Automata (DFAs) As we hinted at previously (see the discussion of regular expressions on p. 830), there are some interesting computational applications of finite-state machines, a formal model for a computational device that uses a fixed (finite) amount of memory to respond to input. Variations on these machines can be used in building very simple characters in a video game, in computer architec- ture, in software systems to do automatic speech recognition, and other tasks. They can also identify which strings match a given regular expression—in fact, for a set of strings L, it’s a theorem that there exists a finite-state machine M that recognizes precisely the strings in L if and only if there’s a regular expression α that matches precisely the strings in L. Formally, a deterministic finite automaton (DFA)—the simplest version of a finite-state machine—is a quintuple M = ⟨Σ, Q, δ, s, F⟩, where: • Σisafinitealphabet,thesetofinputsymbolsthemachinecanhandle; • Qisafinitesetofstates;themachineisalwaysinoneofthesestates.(The fact that Q is finite corresponds to M having only finite memory.) • δ:Q×Σ→Qisatransitionfunction:whenthemachineisinstateq∈Q and sees an input symbol a ∈ Σ, the machine moves into state δ(q, a). • s∈Qisthestartstate,whereMbeginsbeforehavingseenanyinput. • F ⊆ Q is the set of final states. If, after processing a string x, M ends up in a s t a t e q ∈ F , t h e n M a c c e p t s x ; i f M e n d s i n a s t a t e q ∈/ F , t h e n M r e j e c t s x . An example of a DFA that accepts all bitstrings whose first two symbols are the same is shown in Figure 8.34. • • • • • Σ={0,1} Q={a,b,c,win,lose} δ is defined by the following table: 01 lose the start state is a. the only final state is win. b c win lose lose win win win lose lose a b c win a 0, 1 0 0 b win 1 1 1 c 0 lose 0,1 We can also understand DFAs—and the sorts of sets of strings that they can recognize—by thinking about equivalence relations. To see this connection, suppose that we wish to identify binary strings representing integers that are evenly divisible by 3. (So 11 and 1001 and 1111 are all “yes” because 3 | 3 and 0 3 | 9 and 3 | 15, but 10001 is “no” because 3 ̸ | 17.) 1 Figure 8.34: A DFA accepting all bit- strings whose first two symbols are the same—both by defining all five compo- nents, and by a picture. The start state is marked with an unattached incoming arrow; from state q on input symbol a, the arrow leaving q with label a points to δ(q, a). Final states are circled. Here’s one way to solve this problem. Let’s define an equivalence relation on binary strings, where x ≡ y if and only if, for any bitstring z, we have that (xz is divisible by 3) ⇔ (yz is divisible by 3). In other words, two bitstrings x and y are equivalent if, no matter what additional bitstring suffix we add to both of them, the two resulting bitstrings are either both divisible by three or both not divisible by three. For example, it turns out that 11 ≡ 1001 (11 and 1001 are both ’yes’; 110 and 10010 are both ’yes’; 111 and 10011 are both ’no’; 1110 and 100010 are both ’no’; etc.). Similarly, we have 1000 ≡ 10. It’s not hard to prove that ≡ is an equivalence relation. It’s also true, though a bit harder to prove, that there are only three equivalence classes for ≡. (Those equivalence classes are: bitstrings that are 0 mod 3, those that are 1 mod 3, and those that are 2 mod 3.) Thus we can actually figure out whether a bitstring is evenly divisible by 3 with the simple DFA in Figure 8.35. The three states of this machine,goingfromlefttoright,correspondtothethreeequivalenceclasses for ≡—namely [0], [1], and [10]. (For a set of strings that cannot be recognized byaDFA—forexample,bitstringswithanequalnumberof0sand1s—there are an infinite number of equivalence classes for ≡.)7 10 10 Figure 8.35: A DFA for bitstrings rep- resenting numbers divisible by 3. The input is divisible by three if and only if we end up in the leftmost state. These particular DFAs merely hint at the kind of problem that can be solved with this kind of machine—for much more, see a good textbook in formal languages, such as 7DexterKozen. AutomataandCom- putability. Springer, 1997; and Michael Sipser. IntroductiontotheTheoryof Computation. Course Technology, 3rd edition, 2012. 8.4. SPECIALRELATIONS:EQUIVALENCERELATIONSANDPARTIAL/TOTALORDERS 847 Computer Science Connections The Painter’s Algorithm and Hidden-Surface Removal At a high level, the goal in computer graphics is to take a 3-dimensional scene—a set of objects in R3 (with differing shapes, colors, surface reflectiv- ities, textures, etc.)—as seen from a particular vantage point (a point and a direction, also in R3). The task is then to project the scene into a 2-dimensional image. There are a lot of components to this task, and we’ve already talked a bit about some of them: typically we’ll approximate the shapes of the objects using a large collection of triangles (see p. 528), and then compute where each triangle shows up in the camera’s view, in R2, via rotation (see p. 249). Even after triangulation and rotation, we are still left with another impor- tant step: when two triangles overlap in the 2-dimensional image, we have to figure out which to draw—that is, which one is obscured by the other. This task is also known hidden-surface removal: we want to omit whatever pieces of the image aren’t visible. For example, when we wish to render the humble forest scene in Figure 8.36, we have to draw trees in front of and behind the house, and one particular tree in front of another. One approach to hidden- surface removal is called the Painter’s Algorithm, named after a hypothetical artist at an easel: we can “paint” the shapes in the image “from back to front,” simply painting over faraway shapes with the closer ones as we go: How might we implement this approach? Let S be the set of shapes that we have to draw. We can compute a relation obscures ⊆ S × S, where a pair ⟨s1,s2⟩ ∈ obscures tells us that we have to draw s2 before we draw s1. We seek a total order on S that is consistent with the obscures relation; we’ll draw the shapes in this order. Unfortunately obscures isn’t a total order—or even a partial order! The biggest problem with obscures is that we can have “cycles of obscurity”—s1 obscures s2 which obscures s3 which, eventually, obscures a shape sk that ob- scures s1. (See Figure 8.37; although it may look like an M. C. Escher drawing, there’s nothing strange going on—just three triangles that overlap a bit like a pretzel.) This issue can be resolved using some geometric algorithms specific to the particular task: we’ll split up shapes in each cycle of obscurity—splitting the black triangle into a left-half and a right-half object, for example—so that we no longer have any cycles. (Again see Figure 8.37.) We now have an expanded set S′ of shapes, and a cycle-free relation obscures on S′. We can use this relation to compute the order in which to draw the shapes, as follows: • computethereflexive,transitiveclosureofobscuresonS′.Theresulting relation is a partial order on S′. • extendthispartialordertoatotalorderonS′,usingTheorem8.4. We now have a total ordering on the shapes that respect the obscures relation, so we can draw the shapes in precisely this order.8 Figure 8.36: A house in a golden wood. Figure 8.37: A cycle of obscurity, and splitting one of the cycle’s pieces to break the cycle. While the Painter’s Algorithm does correctly accomplish hidden-surface removal, it’s pretty slow (particularly as we’ve described it here). For example, when there are many layers to a scene, we actually have to “paint” each pixel in the resulting image many many times. Every computation of a pixel’s color before the last is a waste of time. You can learn about cleverer approaches to hidden-surface removal, like the “z- buffer,” in a good textbook on computer graphics, such as 8 John F. Hughes, Andries van Dam, Morgan McGuire, David F. Sklar, James D. Foley, Steven K. Feiner, and Kurt Akeley. Computer Graphics: Princi- ples and Practice. Addison-Wesley, 3rd edition, 2013. 848 CHAPTER 8. RELATIONS 8.4.4 Exercises List all equivalence relations . . . 8.108 ...on {0,1}. 8.109 ...on {0,1,2,3}. Are the following relations on P({0, 1, 2, 3}) equivalence relations? If so, list the equivalence classes under the rela- tion; if not, explain why not. 8.110 ⟨A, B⟩ ∈ R1 if and only if (i) A and B are nonempty and the largest element in A equals the largest element in B, or (ii) if A = B = ∅. 8.111 ⟨A, B⟩ ∈ R2 if and only if the sum of the elements in A equals the sum of the elements in B. 8.112 ⟨A, B⟩ ∈ R3 if and only if the sum of the elements in A equals the sum of the elements in B and the largest element in A equals the largest element in B. (That is, R3 = R1 ∩ R2.) 8.113 ⟨A,B⟩∈R4 ifandonlyA∩B̸=∅. 8.114 ⟨A, B⟩ ∈ R5 if and only |A| = |B|. In Example 8.11, we considered the relation M := {⟨m, d⟩ : in some years, month m has d days}, and computed the pairs in the relation M−1 ◦ M. By checking all the requirements (or by visual inspection of Figure 8.13(b)), we see that M−1 ◦ M is an equivalence relation. But it turns out that the fact that M−1 ◦ M is an equivalence relation says something particular about M, and is not true in general. Let R ⊆ A × B be an arbitrary relation. Prove or disprove whether R−1 ◦ R must have the three required properties of an equivalence relation (at least one of these is false!): 8.115 Prove or disprove: 8.116 Prove or disprove: 8.117 Prove or disprove: R−1 ◦ R must be reflexive. R−1 ◦ R must be symmetric. R−1 ◦ R must be transitive. Let A be any set. There exist two equivalence relations ≡coarsest and ≡finest with the following property: if ≡ is an equivalence relation on A, then (i) ≡ refines ≡coarsest, and (ii) ≡finest refines ≡. 8.118 Identify ≡coarsest, prove that it’s an equivalence relation, and prove property (i) above. 8.119 Identify ≡finest, prove that it’s an equivalence relation, and prove property (ii) above. 8.120 In many programming languages, there are two distinct but related notions of “equality”: has the same value as and is the same object as. In Python, these are denoted as == and is, respectively; in Java, they are.equals()and==,respectively.(Confusingly!)(Forexample,inPython,1776 + 1 is 1777isfalse,but 1776 + 1 == 1777istrue.)Doesoneoftheseequalityrelationsrefinetheother?Explain. 8.121 List all partial orders on {0, 1}. 8.122 List all partial orders on {0, 1, 2}. Are the following relations on P({0, 1, 2, 3}) partial orders, strict partial orders, or neither? Explain. 8.123 ⟨A,B⟩∈R1 ⇔∑a∈Aa≤∑b∈Bb 8.126 ⟨A,B⟩∈R4 ⇔A⊇B 8.124 ⟨A,B⟩∈R2 ⇔∏a∈Aa≤∏b∈Bb 8.127 ⟨A,B⟩∈R5 ⇔|A|<|B| 8.125 ⟨A,B⟩∈R3 ⇔A⊆B 8.128 Prove that ≼ is a partial order if and only if ≼−1 is a partial order. 8.129 Prove that if ≼ is a partial order, then {⟨a, b⟩ : a ≼ b and a ̸= b} is a strict partial order. 8.130 A cycle in a relation R is a sequence of k distinct elements a0,a1,...,ak−1 ∈ A where ⟨ai,ai+1 mod k⟩ ∈ R foreachi ∈ {0,1,...,k−1}.Acycleisnontrivialifk ≥ 2.Provethattherearenonontrivialcyclesinany transitive, antisymmetric relation R. (Hint: use induction on the length k of the cycle.) Let S ∈ Z≥1 × Z≥1 be a collection of points. Define the relation R ⊆ S × S as follows: ⟨⟨a, b⟩, ⟨x, y⟩⟩ ∈ R if and only if a ≤ x and b ≤ y. (You can think of ⟨a, b⟩ ∈ S as an a-by-b picture frame, and ⟨f , f ′ ⟩ ∈ R if and only if f fits inside f ′ . Or you can think of ⟨a, b⟩ ∈ S as a job that you’d get a “happiness points” from doing and that pays you b dollars, and ⟨j,j′⟩ ∈ R if and only if j generates no more happiness and pays no more than j′. 8.131 Show that R might not be a total order by identifying two incomparable elements of Z≥1 × Z≥1. 8.132 Prove that R must be a partial order. 8.133 Write out all pairs in the relation represented by the Hasse diagram in Figure 8.38(a). 8.134 Repeat for Figure 8.38(b). 8.135 Draw the Hasse diagram for the partial order ⊆ on the set P(1, 2, 3). 2 3 8.136 Draw the Hasse diagram for the partial order ≼ on the set S := {0, 1} ∪ {0, 1} ∪ {0, 1} , where, for two bitstrings x, y ∈ S, we have x ≼ y if and only if x is a prefix of y. (a) 5 34 12 5 34 2 1 (b) Figure 8.38: Some Hasse diagrams. 8.4. SPECIALRELATIONS:EQUIVALENCERELATIONSANDPARTIAL/TOTALORDERS 849 Let ≼ be a partial order on A. Recall that an immediate successor of a ∈ A is an element c such that (i) a ≼ c, and ( i i ) t h e r e i s n o b ∈/ { a , c } s u c h t h a t a ≼ b a n d b ≼ c . I n t h i s c a s e a i s s a i d t o b e a n i m m e d i a t e p r e d e c e s s o r o f c . 8.137 For the partial order ≥ on Z≥1, identify all the immediate predecessor(s) and immediate succes- sor(s) of 202. 8.138 For the partial order | (divides) on Z≥1, identify all the immediate predecessor(s) and immediate successor(s) of 202. 8.139 Give an example of a strict partial order on Z≥1 such that every integer has exactly two different immediate successors. 8.140 Prove that for a partial order ≼ on A when A is finite there must be an a ∈ A that has fewer than two immediate successors. 8.141 Consider the partial order ≥ on the set Z≥0. Argue that there is no maximal element in Z. 8.142 Note that there is a minimal element under the partial order ≥ on Z≥0—namely 0, which is also the minimum element. Give an example of a partial order on an infinite set that has neither a minimal nor a maximal element. 8.143 Let ≼ be a partial order on a set A. Prove that there is at most one minimum element in A under ≼. (That is, prove that if a ∈ A and b ∈ A are both minimum elements, then a = b.) 8.144 Let ≼ be a partial order on a set A, and let a ∈ A be a minimum element under ≼. Prove that a is also a minimal element. Here’s a (surprisingly addictive) word game that can be played with a set of Scrabble tiles. Each player has a set of words that she “owns”; there is also a set of individual tiles in the middle of the table. At any moment, a player can form a new word by taking both (1) one or more tiles from the middle, and (2) zero or more words owned by any of the players; and reordering those letters to form a new word, which the player now owns. For example, from the word GRAMPS and the letters R and O, a player could make the word PROGRAMS. Define a relation ≼ on the set W of English words (of three or more letters), as follows: w ≼ w′ if w′ can be formed from word w plus one or more individual letters. For example, we showed above that GRAMPS ≼ PROGRAMS. 8.145 Give a description (in English) of what it means for a word w to be a minimal element under ≼, and what it means for a word w′ to be a maximal element under ≼. 8.146 (programming required) Write a program that, given a word w, finds all immediate successors of w. (You can find a dictionary of English words on the web, or /usr/share/dict/words on Unix-based operating systems.) Report all immediate successors of GRAMPS using your dictionary. 8.147 (programming required) Write a program to find the English word that is the longest minimal element under ≼ (that is, out of all minimal elements, find the one that contains the most letters). (If you’re bored and decide to waste time playing this game: it’s more fun if you forbid stealing words with “trivial” changes, like changing COMPUTER into COMPUTERS. Each player should also get a fair share of the tiles, originally face down; anyone can flip a new tile into the middle of the table at any time.) 8.148 Consider a spreadsheet containing a set of cells C. A cell c can contain a formula that depends on zero or more other cells. Write ≼ to denote the relation {⟨p, s⟩ : cell s depends on cell p}. For example, the value of cell C2 might be the result of the formula A2 ∗ B1; here A2 ≼ C2 and B1 ≼ C2. A spreadsheet is only meaningful if ≼ is a strict partial order. Give a description (in English) of what it means for a cell c to be a minimal element under ≼, and what it means for a cell c′ to be a maximal element under ≼. (a) 8.149 List all total orders consistent with the partial order reproduced in Figure 8.39(a). 8.150 Repeat for the partial order reproduced in Figure 8.39(b). A chain in a partial order ≼ on A is a set C ⊆ A such that ≼ imposes a total order on C—that is, writing the elements of C as C = {c1,c2,...,ck} [in an appropriate order], we have c1 ≼ c2 ≼ ··· ≼ ck. 8.151 Identify all chains of k ≥ 2 elements in the partial order in Figure 8.39(a). 8.152 Repeat for the partial order reproduced in Figure 8.39(b). An antichain in a partial order ≼ on A is a set S ⊆ A such that no two distinct elements in S are comparable under ≼—that is, for any distinct a, b ∈ S we have a ̸≼ b. 8.153 Identify all antichains S with |S| ≥ 2 in the partial order in Figure 8.39(a). (b) 8.154 Repeat for the partial order reproduced in Figure 8.39(b). 8.155 Consider the set A := {1, 2, . . . , n}. Consider the following claim: there exists a relation ≼ on the set A that is both an equivalence relation and a partial order. Either prove that the claim is true (and describe, as precisely as possible, the structure of any such relation ≼) or disprove the claim. 5 34 12 5 34 2 1 Figure 8.39: Re- productions of the Hasse diagrams from Figure 8.38. 850 CHAPTER 8. RELATIONS 8.5 Chapter at a Glance Formal Introduction A(binary)relationonA×BisasubsetofA×B. ForarelationRonA×B,wecan write⟨a,b⟩ ∈ RoraRb. WhenAandBarebothfinite,wecandescribeRusingatwo- column table, where a row containing a and b corresponds to ⟨a, b⟩ ∈ R. Or we can view R graphically: draw all elements of A in one column, all elements of B in a second column, and draw a line connecting a ∈ A to b ∈ B whenever ⟨a, b⟩ ∈ R. We’ll frequently be interested in a relation that’s a subset of A × A, where the two sets are the same. In this case, we may refer to a subset of A × A as simply a relation on A. For a relation R ⊆ A × A, it’s more convenient to visualize R using a directed graph, without separated columns: we simply draw each element of A, with an arrow from a1 to a2 whenever ⟨a1, a2⟩ ∈ R. The inverse of a relation R ⊆ A × B is a new relation, denoted R−1, that “flips around” every pair in R: the relation R−1 := {⟨b,a⟩ : ⟨a,b⟩ ∈ R} is a subset of B × A. The composition of two relations R ⊆ A × B and S ⊆ B × C is a new relation, denoted S ◦ R, that, informally, represents the successive “application” of R and S. A pair ⟨a, c⟩ isrelatedunderS◦R ⊆ A×Cifandonlyifthereexistsanelementb ∈ Bsuchthat ⟨a, b⟩ ∈ R and ⟨b, c⟩ ∈ S. ForsetsAandB,afunctionf fromAtoB,writtenf : A → B,isaspecialkindof relation on A × B where, for every a ∈ A, there exists one and only one element b ∈ B such that ⟨a,b⟩ ∈ f. Ann-aryrelationisageneralizationofabinaryrelation(n = 2)todescribea relationship among n-tuples, rather than just pairs. An n-ary relation on the set A1 ×A2 ×···×An isjustasubsetofA1 ×A2 ×···×An;ann-aryrelationonaset A is a subset of An. Properties of Relations: Reflexivity, Symmetry, and Transitivity A relation R on A is reflexive if, for every a ∈ A, we have that ⟨a, a⟩ ∈ R. It’s irreflexive if ⟨a, a⟩ ∈/ R for every a ∈ A. (In the visualization described above, where we draw an arrow a1 → a2 whenever ⟨a1, a2⟩ ∈ R, reflexivity corresponds to every element having a “self-loop” and irreflexivity corresponds to no self-loops.) Note that a relation might be neither reflexive nor irreflexive. A relation R on A is symmetric if, for every a,b ∈ A, we have ⟨a,b⟩ ∈ R if and only if ⟨b, a⟩ ∈ R. The relation is antisymmetric if the only time both ⟨a, b⟩ ∈ R and ⟨b, a⟩ ∈ R is when a = b, and it’s asymmetric if it’s never the case that ⟨a,b⟩ ∈ R and ⟨b,a⟩ ∈ R whether a ̸= b or a = b. Note that, while asymmetry implies antisymmetry, they are different properties—and they’re both different from “not symmetric”; a relation might not be symmetric, antisymmetric, or asymmetric. (In the visualization, a relation is symmetric if every arrow a → b is matched by an arrow b → a; it’s antisymmetric if there are no matched bidirectional pairs of arrows between a and b ̸= a; and it’s asymmetric if it’s antisymmetric and furthermore there aren’t even any self-loops.) An alternative view is that a relation R is symmetric if and only if R ∩ R−1 = R = R−1; it’s antisymmetric if and only if R ∩ R−1 ⊆ {⟨a, a⟩ : a ∈ A}; and it’s asymmetric if and only ifR∩R−1 =∅. A relation R on A is transitive if, for every a,b,c ∈ A, if ⟨a,b⟩ ∈ R and ⟨b,c⟩ ∈ R, then ⟨a, c⟩ ∈ R too. In the visualization, R is transitive if there are no “open triangles”: in a chain of connected elements, every element is also connected to all “downstream” connections. The relation R is transitive if and only if R ◦ R ⊆ R. For a relation R ⊆ A × A, the closure of R with respect to some property is the smallest relation R′ ⊇ R that has the named property. For example, the symmetric closure of R is the smallest relation R′′ ⊇ R such that R′′ is symmetric. We also define the reflexive closure R′; the transitive closure R+; the reflexive transitive closure R∗; and the reflexive symmetric transitive closure R≡. When A is finite, we can compute any of these closures by repeatedly adding any missing elements to the set. The reflexive closure of R is given by R ∪ {⟨a, a⟩ : a ∈ A}; the symmetric closure of R is R ∪ R−1; and the transitiveclosureofRisR∪R2 ∪R3 ∪···. Special Relations: Equivalence Relations and Partial/Total Orders There are two special kinds of relations that emerge from particular combinations of these properties: equivalence relations and partial/total orders. Equivalencerelations: Anequivalencerelationisarelation≡that’sreflexive,symmetric, and transitive. Such a relation partitions the elements of A into one or more cate- gories, called equivalence classes; any two elements in the same equivalence class are related by ≡, and no two elements in different equivalence classes are related. A refinement of ≡ is another equivalence relation ≡r on the same set A where a ≡ b whenever a ≡r b. Each equivalence class of ≡ is partitioned into one or more equiv- alence classes by ≡r, but no equivalence class of ≡r intersects with more than one equivalence class of ≡. We also call ≡ a coarsening of ≡r. Partialandtotalorders: Apartialorderisareflexive,antisymmetric,andtransitiverela- tion ≼. (A strict partial order ≺ is irreflexive, antisymmetric, and transitive.) Elements a and b are comparable under ≼ if either a ≼ b or b ≼ a; otherwise they’re incomparable. A Hasse diagram is a simplified visual representation of a partial order where we draw a physically below c whenever a ≼ c, and we omit the a → c arrow if there’s some other element b such that a ≼ b ≼ c. (We also omit self-loops.) For a partial order ≼ on A, a minimum element is an element a ∈ A such that, for everyb ∈ A,wehavea ≼ b;aminimalelementisana ∈ Asuchthat,forevery b ∈ A with b ̸= a, we have b ̸≼ a. (Maximum and maximal elements are defined analogously.) Every minimum element is also minimal, but a minimal element a isn’t minimum unless a is comparable with every other element. There’s at least one minimal element in any partial order on a finite set. A total order is a partial order under which all pairs of elements are comparable. A total order ≼total is consistent with the partial order ≼ if a ≼ b implies that a ≼total b. For any partial order ≼ on a finite set A, there is a total order ≼total on A that’s con- sistent with ≼. Such an ordering of A is called a topological ordering of A. 8.5. CHAPTERATAGLANCE 851 852 CHAPTER 8. RELATIONS Key Terms and Results Key Terms Formal Introduction • (binary)relation • inverse(ofarelation) • composition(oftworelations) • functions(asrelations) • n-aryrelation Properties of Relations • reflexivity • irreflexivity • symmetry • asymmetry • antisymmetry • transitivity • closures(ofarelation) Special Relations • equivalencerelation • equivalenceclass • coarsening,refinement • partialorder • comparability • totalorder • Hassediagram • minimal/maximal element • minimum/maximum element • consistency (of a total order with a par- tial order) • topologicalordering Key Results Formal Introduction 1. 2. For relations R ⊆ A × B and S ⊆ B × C, the relations R−1 ⊆B×AandS◦R⊆A×C—theinverseofRandthe composition of R and S—are defined as R−1 := {⟨b,a⟩ : ⟨a,b⟩ ∈ R} S ◦ R := {⟨a, c⟩ : ∃b ∈ B such that ⟨a, b⟩ ∈ R and ⟨b, c⟩ ∈ S}. Afunctionf :A→Bisaspecialcaseofarelationon A × B, where, for every a ∈ A, there exists one and only one element b ∈ B such that ⟨a,b⟩ ∈ f. Properties of Relations 1. ArelationRissymmetricifandonlyif R ∩ R−1 = R = R−1; it’s antisymmetric if and only if R∩R−1 ⊆{⟨a,a⟩:a∈A};andit’sasymmetricifandonly ifR∩R−1 =∅. 2. ArelationRistransitiveifandonlyifR◦R⊆R. 3. ThereflexiveclosureofRisR∪{⟨a,a⟩:a∈A};the symmetric closure of R is R ∪ R−1; and the transitive closureofRisR∪R2 ∪R3 ∪···. Special Relations 1. For a partial order ≼ ⊆ A × A on a finite set A, there is at least one minimal element and at least one maximal element under ≼. 2. LetAbeanyfinitesetwithapartialorder≼.Thenthere is a total order ≼total (a topological ordering of A) on A that’s consistent with ≼. 9 Counting In which our heroes encounter many choices, some of which may lead them to live more happily than others, and a precise count of their number of options is calculated. 902 CHAPTER 9. COUNTING 9.1 Why You Might Care How do I love thee? Let me count the ways. Elizabeth Barrett Browning (1806–1861) This chapter is devoted to the apparently trivial task of counting. By “counting,” we mean the following problem: given a potentially convoluted description of a set S, compute the cardinality of S—that is, compute the number of elements in S. It may seem bizarre that counting could somehow be harder than at the preschool level (just count! one, two, three), but it will turn out that we can solve surprisingly subtle prob- lems with some useful and general (and subtle) techniques. We’ll start in Section 9.2 by introducing basic counting techniques—how to compute the cardinality of a union A ∪ B of two sets, or sequences from the Cartesian product A × B of two sets. We then turn in Section 9.3 to one of the best counting strategies: being lazy! If we can show that |A| = |B| and we already know the value of |B|, then figuring out |A| is easy; we’ll often use functions to relate two sets so that we can then lazily compute the size of the apparently harder-to-count set. Finally, in Section 9.4, we will explore combinations (“how many ways are there to choose an unordered collec- tion of k items out of a set of n possibilities?”) and permutations (“how many ways are there to put a set of n items into some order?”). Why does counting matter in computer science? There are, again, surprisingly many applications. Here are a few examples. One common (though very basic) style of algorithm is a brute-force algorithm, which finds the best whatzit by trying every possible whatzit and seeing which one is best. Determining whether a brute-force algorithm is fast enough depends on counting how many possible whatzits there are. A more advanced algorithmic design technique, called dynamic programming, can be used to design efficient recursive solutions to problems—as long as there aren’t too many distinct subproblems. Counting techniques are even powerful enough to establish a mind-bending result about computability: we will be able to prove that there are more problems than computer programs—which means that there are some problems that cannot be solved by any program! Probability (see Chapter 10) has a plethora of applications in computer science, ranging from randomized algorithms in sorting (algorithms that process their input by making random decisions about how to act) to models of random noise in speech recognition or random errors in typing (if I’m trying to type the letter p, what is the chance that I accidentally type o instead?). We can think of the probability of some event X happening, roughly, as two counting problems: the numerator and denomina- tor of the ratio the number of ways X can happen . the number of ways X can either happen or not happen There are many other applications of counting scattered throughout computer sci- ence, and we will discuss a few more along the way: breaking cryptographic systems, compressing audio/image/video files, and changing the addressing scheme on the internet because we’ve run out of smaller addresses, to name a few. 9.2 Counting Unions and Sequences If a man who cannot count finds a four-leaf clover, is he entitled to happiness? Stanislaw J. Lec (1909–1966) Suppose that we have two sets A and B from which we must choose an element. There are two different natural scenarios that meet this one-sentence description: we must choose a total of one element from either A or B, or we must choose one el- ement from each of A and B. For example, consider a restaurant that offers soups A = {chicken noodle, beer cheese, minestrone, . . .} and salads B = {caesar, house, arugula, . . .}. A lunch special that includes soup or salad involves choosing an x ∈ A ∪ B. A dinner special including soup and salad involves choosing an x ∈ A and also choosing a y ∈ B—that is, choosing an element ⟨x, y⟩ ∈ A × B. In Section 9.2.1, we’ll start with two basic rules for computing these cardinalities: These rules will handle the simple restaurant scenarios above, but there are a pair of extensions that we’ll introduce to handle slightly more complex situations. The first (Section 9.2.2) extends the Sum Rule to allow us to calculate the cardinality of a union of two sets even if those sets may contain elements in common: The second extension (Section 9.2.3) generalizes the Product Rule to allow us to calcu- late the cardinality of a set of pairs ⟨x, y⟩ even if the choice of x changes the list (but not the number) of possible choices for y: The remainder of this section will give the details of these four rules, and how to use these rules individually and in combination. 9.2.1 The Basics: The Sum and Product Rules Sum Rule: counting unions Our first rule addresses the union of two sets: if two sets A and B are disjoint, then the cardinality of their union is simply the sum of their sizes: 9.2. COUNTINGUNIONSANDSEQUENCES 903 • SumRule:IfAandBaredisjoint,then|A∪B|=|A|+|B|. • ProductRule:Thenumberofpairs⟨x,y⟩withx∈Aandy∈Bis|A×B|=|A|·|B|. • Inclusion–Exclusion:|A∪B|=|A|+|B|−|A∩B|. • GeneralizedProductRule:Considerpairs⟨x,y⟩ofthefollowingform:wecanchoose any x ∈ A, and, for each such x, there are precisely n different choices for y. Then the total number of pairs meeting this description is |A| · n. Theorem 9.1 (Sum Rule) Let A and B be sets. If A ∩ B = ∅, then |A ∪ B| = |A| + |B|. 904 CHAPTER 9. COUNTING Moregenerally,consideracollectionofk ≥ 1setsA1,A2,...,Ak.Ifthesesetsareall disjoint—that is, if Ai ∩ Aj = ∅ whenever i ̸= j—then the cardinality of their union is the sum of their cardinalities: |A1 ∪ A2 ∪ · · · ∪ Ak| = |A1| + |A2| + · · · + |Ak|. The Sum Rule captures an intuitive fact: if a box contains some red things and some blue things, then the total number of things in the box is the number of red things plus the number of blue things. Here are a few examples that use this rule: Example 9.1 (Counting disjoint unions) • LetA:={1,2}andB:={3,4,5,6}.Thus|A|=2and|B|=4.Observethatthesets AandBaredisjoint. Bythesumrule,|A∪B| = |A|+|B| = 2+4 = 6. Indeed,we have A ∪ B = {1, 2, 3, 4, 5, 6}, which contains 6 elements. • Thereare11startersonyourschool’swomen’ssoccerteam.Supposethereare8 nonstarters on the team. The total number of people on the team is 19 = 11 + 8. • Atacertainschoolinthemidwest,therearecurrently30computersciencemajors who are studying abroad. There are 89 computer science majors who are studying on campus. Then the total number of computer science majors is 119 = 89 + 30. • Consideracomputerlabthatcontains32Macsand14PCsand1PDP-8(a1960s- era machine, one of the first computers that was sold commercially). Then the total number of computers in the lab is 47 = 32 + 14 + 1. Example 9.2 (Students in classes) Problem: Duringthisterm,thereare19studentstakingDataStructures,and39stu- dents taking Mathematics of Computer Science. Let S denote the set of students taking Data Structures or Mathematics of Computer Science this term. What is |S|? : Thereisn’tenoughinformationtoanswerthequestion! Solution • If there are no students who are taking both classes (that is, if DS ∩ MOCS = ∅), then |S| = |DS|+|MOCS| = 19+39 = 58. • But,forallweknowfromtheproblemstatement,everystudentinDataStruc- tures is also taking Mathematics of Computer Science. In this case, we have DS ⊂ MOCS and thus S = DS ∪ MOCS = MOCS; therefore |S| = |MOCS| = 39. (The Inclusion–Exclusion Rule, in Section 9.2.2, formalizes the calculation of |A ∪ B| in terms of |A|, |B|, and |A ∩ B|, in the manner that we just considered.) Taking it further: The logic that we used in Example 9.2 to conclude that there were at most 58 students in the two classes combined is an application of the general fact that |A ∪ B| ≤ |A| + |B|. While this fact is pretty simple, it turns out to be remarkably useful in proving facts about probability. The Union BoundstatesthattheprobabilitythatanyofA1,A2,...,Ak occursisatmostp1 +p2 +···+pk,where pi denotes the probability that Ai occurs. The Union Bound turns out to be useful when each Ai is a “bad event” that we’re worried might happen, and these bad events may have complicated probabilistic dependencies—but if we can show that the probability that every particular one of these bad events is some very small ε, then we can use the Union Bound to conclude that the probability of experiencing any bad event is at most k · ε. (See Exercise 10.141, for example.) Using the Sum Rule in less obvious settings As a general strategy for solving counting problems, we can try to find a way to apply the Sum Rule—even if it does not superficially seem to be applicable. If we can find a way to partit􏰔ion an apparently complicated set S into simple disjoint sets S1,S2,...,Sk such that ki=1 Si = S, then we can use the Sum Rule to find |S|. In this spirit, here’s a somewhat more complex example of using the Sum Rule, where we have to figure out the subsets ourselves: let’s determine how many 8-bit strings contain precisely two ones. (The full list of the bitstrings meeting this condition appears in Figure 9.1.) Example 9.3 (8-bit strings with exactly 2 ones) Problem: Howmanyelementsof{0,1}8havepreciselytwo1s? : Obviously,wecanjustcountthenumberofbitstringsinFigure9.1,which Solution yields the answer: there are 28 such bitstrings. But let’s use the Sum Rule instead. What does a bitstring x ∈ {0, 1}8 with two ones look like? There must be two indices i and j—say with i > j—such that xi = xj = 1, and all other components of x must be 0:
Figure 9.1: All bitstrings in {0, 1}8 that contain exactly two ones.
9.2. COUNTINGUNIONSANDSEQUENCES 905
11000000
01100000
10100000
00110000
01010000
10010000
00011000
00101000
01001000
10001000
00001100
00010100
00100100
01000100
10000100
00000110
00001010
00010010
00100010
01000010
10000010
00000011
00000101
00001001
00010001
00100001
01000001
10000001
one in position j 􏰠􏰣􏰢􏰡
one in position i 􏰠􏰣􏰢􏰡
x =
(For example, the bitstring 01001000 has ones in positions j = 2 and i = 5, inter-
spersedwithaninitialblockofj−1 = 1zero,ablockofi−j−1 = 2between-the- ones zeros, and a block of 8 − i = 3 final zeros.)
We are going to divide the set of 8-bit strings with two 1s based on the index i. That is, suppose that x ∈ {0, 1}8 contains two ones, and the second 1 in x appears in bit position #i. Then there are i − 1 positions in which the first one could appear— anyoftheslotsj ∈ {1,2,…,i−1}thatcomebeforei. (SeeFigure9.1,wherethe
(i − 1)st column contains all i − 1 bitstrings whose second 1 appears in position
#i. For example, column #3 contains the 3 bitstrings with x4,5,6,7,8 = 10000: that is, 10010000, 01010000, and 00110000.) Because every x with exactly two ones has an index i of its second 1, we can use the Sum Rule to say that the answer to the given question is
∑8 􏰂number of bitstrings with the second 1 in position i􏰃 = ∑8 (i − 1) i=1 i=1
= 0+1+···+7 = 28.
00···0 􏰢 􏰡􏰠 􏰣
1
00···0 1 􏰢 􏰡􏰠 􏰣
00···0 . 􏰢 􏰡􏰠 􏰣
j − 1 zeros
i − j − 1 zeros
8 − i zeros
Problem-solving
tip: When you’re trying to find the cardinality of a complicated set S, try to find a way
to split S into a collection of simpler disjoint sets, and then apply the Sum Rule.
(We’ll also see another way to solve this example later, in Example 9.39.)

906 CHAPTER 9. COUNTING
Let’s also generalize this example to bitstrings of arbitrary length:
Example 9.4 (k-bit strings with exactly 2 ones)
Consider the set S := {x ∈ {0, 1}k : x has precisely two 1s}. As in Example 9.3, every bitstring x ∈ S has an index i of its second 1; we’ll use the value of i to partition S into sets that can be easily counted, and then use the Sum Rule to find |S|. Specifically, for each index i with 1 ≤ i ≤ k, define the set
Si ={x∈S:xi =1andxi+1 =xi+2 =···=xk =0}.
=􏰈x∈{0,1}k :􏰂∃j≤i−1:xi =xj =1andxhasnoother1s􏰃􏰉.
Observe that |Si| = i − 1: there are i − 1 different possible values of j. Also, observe that S = 􏰔ki=1 Si and that, for any i ̸= i′, the sets Si and Si′ are disjoint. Thus
􏰊 􏰊 􏰊 􏰴k 􏰊 􏰊 􏰊 k k k ( k − 1 ) | S | = 􏰊􏰊 S i 􏰊􏰊 = ∑ | S i | = ∑ ( i − 1 ) = 2
i=1 i=1 i=1
by the Sum Rule and the formula for the sum of the first n integers (Example 5.4).
As a check of our formula, let’s verify our solution for some small values of k:
( ∗ )
• For k = 2, (∗) says there are 2(2−1) = 1 strings with two 1s. Indeed, there’s just one: 11. 2
2
Problem-solving
tip: Check to make sure your formulas are reasonable by testing them for small inputs (as we did in Example 9.4).
• For k = 3, indeed there are 3(3−1) = 3 strings with two 1s: 011, 101, and 110. 4·3 2
• For k = 4, there are 2 = 6 such strings: 1100, 1010, 0110, 1001, 0101, and 0011. Note that (∗) matches Example 9.3: for k = 8, we have 28 = 8·7 strings with two 1s.
Product Rule: counting sequences
Our second basic counting rule addresses the Cartesian product of sets. Recall that,
for sets A and B, the Cartesian product A × B consists of all pairs ⟨a, b⟩ with a ∈ A and b ∈ B. (Forexample,{1,2,3}×{x,y} = {⟨1,x⟩,⟨1,y⟩,⟨2,x⟩,⟨2,y⟩,⟨3,x⟩,⟨3,y⟩}.) The cardinality of A × B is the product of the cardinalities of A and B:
More generally, consider a collection of k arbitrary sets A1, A2, . . . , Ak, and consider the set of k-element sequences where, for each i, the ith component is an element of Ai. The number of such sequences is given by the product of the sets’ cardinalities:
|A1 ×A2 ×···×Ak| = |A1|·|A2|· ··· ·|Ak|. Here are a few examples of counting using the Product Rule:
Theorem 9.2 (Product Rule)
Let A and B be sets. Then |A × B| = |A| · |B|.

Example 9.5 (Counting sequences)
• LetA := {1,2}andB := {3,4,5,6}.Bytheproductrule,|A×B| = |A|·|B| =
2·4 = 8. Indeed, A×B = {⟨1,3⟩,⟨1,4⟩,⟨1,5⟩,⟨1,6⟩,⟨2,3⟩,⟨2,4⟩,⟨2,5⟩,⟨2,6⟩}, which contains 8 elements.
• Atacertainschoolinthemidwest,therearecurrently56seniorcomputerscience majors and 63 junior computer science majors. Then the number of ways to choose a pair of class representatives, one senior and one junior, is 56 · 63 = 3528.
• Consideratabletcomputerthatissoldwiththreedifferentoptions:achoiceof protective cover, a choice of stylus, and a color. If there are 7 different styles of protective cover, 5 different styles of stylus, and 3 different colors, then there are 7 · 5 · 3 = 105 different configurations of the computer.
Like the Sum Rule, the Product Rule should be reasonably intuitive: if we are choosing a pair ⟨a, b⟩ from A × B, then we have |A| different choices of the first component a— and, for each of those |A| choices, we have |B| choices for the second component b. (Thinking of A as A = {a1,a2,…,a|A|}, we can even view {⟨a,b⟩ : a ∈ A,b ∈ B} as
{⟨a1,b⟩ : b ∈ B}∪{⟨a2,b⟩ : b ∈ B}∪ ··· ∪{⟨a|A|,b⟩ : b ∈ B}.
By the Sum Rule, this set has cardinality |B| + |B| + · · · + |B|, with one term for each ele-
ment of A—in other words, it has cardinality |A| · |B|.) Here are a few more examples:
Example 9.6 (32-bit strings)
Problem: Howmanydifferent32-bitstringsarethere?
Solution
: Thesetof32-bitstringsis{0,1}32—thatis,elementsof {0,1}×{0,1}×{0,1}×···×{0,1}.
32 times
Because | {0, 1} | = 2, the Product Rule lets us conclude that |{0, 1}32| is
2·2·2· ··· ·2=232. 􏰢 􏰡􏰠 􏰣
32 times
(We can use the same type of analysis to show that there are 24 = 16 strings of 4 bits; for concreteness, they’re all listed in Figure 9.2.)
Example 9.7 (Number of possible shortened URLs)
A URL-shortening service like bit.ly or snipurl.com allows a user to compress a long URL into a much shorter sequence of characters. (The shorter URL can then be used in emails or tweets or other contexts in which a long URL is unwieldy.) For example, by entering the URL of Alan Turing’s Wikipedia page into bit.ly, I got the URL
Figure 9.2: The set of all 4-bit strings.
9.2. COUNTINGUNIONSANDSEQUENCES 907
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
􏰢 􏰡􏰠 􏰣

908 CHAPTER 9. COUNTING
http://bit.ly/1o6HPM as a shortened form of http://en.wikipedia.org/wiki/ Alan_Turing .
If a shortened URL consists of 6 characters, each of which is a digit, lowercase let- ter, or uppercase letter, the number of possible shortened URLs is, using the Product
Rule,
|C×C×C×C×C×C| = |C|·|C|·|C|·|C|·|C|·|C| = |C|6,
where C = {0, . . . , 9} ∪ {a, . . . , z} ∪ {A, . . . , Z} is the set of possible characters. Because |C| = 10 + 26 + 26 = 62 via the Sum Rule, we know that there are 626 = 56,800,235,584 possible shortened 6-character URLs.
Taking it further: The point of a URL-shortening service is to translate long URLs into short ones, but it’s theoretically impossible for every URL to be shortened by this service: there are more possible URLs of length k than there are URLs of length strictly less than k. A similar issue arises with file compression algorithms, like ZIP, that try to reduce the space required to store a file. See the discussion on p. 938.
Product Rule: counting sequences from a fixed set
This use of the Product Rule—to count the number of sequences of length k with
elements all drawn from a fixed set S, rather than having a different set of options for each component—is common enough that we’ll note it as a separate rule:
Theorem 9.3 (Product Rule: sequences of elements from a single set S) ForanysetSandanyk ∈ Z≥1,thenumberofk-tuplesfromthesetSk = S×S×···×Sis
A notational re- minderregarding Theorem 9.3: Sk is theset
S×S×···×S,
that is, the set of k-tuples where each component is an element of S. On the other hand, |S|k is the number |S| raised to the kth power.
k k |S |=|S| .
Here’s another example using this special case of the Product Rule:
Example 9.8 (MAC addresses)
Problem: Amediaaccesscontroladdress,orMACaddress,isauniqueidentifierfora network adapter, like an ethernet card or wireless card. A MAC address consists of a sequence of six groups of pairs of hexadecimal digits. (A hexadecimal digit is one of 0123456789ABCDEF.) For example, F7:DE:F1:B6:A4:38 is a MAC address. (The pairs of digits are traditionally separated by colons when written down.) How many different MAC addresses are there?
Solution
: Thereare16differenthexadecimaldigits.Thus,usingtheProductRule,
thereare16·16 = 256differentpairsofhexadecimaldigits,rangingfrom
00 to FF. Using the Product Rule again, as in Example 9.7, we see that there
are 2566 different sequences of six pairs of hexadecimal digits. Thus there are 2566 = [162]6 = [(24)2]6 = 248 = 281,474,976,710,656 total different MAC addresses.
Taking it further: In addition to the numerical addresses assigned to particular hardware devices— the MAC addresses from Example 9.8—each device that’s connected to the internet is also assigned an address, akin to a mailing address, that’s used to identify the destination of a packet of information. But we’ve had to make a major change to the way that information is transmitted across the internet because of a counting problem: we’ve run out of addresses! See the discussion on p. 919.
􏰢 􏰡􏰠 􏰣
ktimes

9.2.2 Inclusion–Exclusion: Unions of Nondisjoint Sets
The counting techniques that we’ve introduced so far have some important restric- tions. We can only use the Sum Rule to calculate |A ∪ B| when A and B are disjoint. And we are only able to use the Product Rule to calculate the number of sequences when the set of options for the second component does not depend on the choice that we made in the first component. In the remainder of this section, we will extend our techniques to remove these restrictions so that we can handle more general problems. Let’s start with a specific example of the cardinality of the union of nondisjoint sets:
Example 9.9 (Primes and odds)
Consider the set O = {1, 3, 5, 7, 9} of odd numbers less than 10 and the set P = {2, 3, 5, 7} of prime numbers less than 10. What is |O ∪ P|?
It might be tempting to use the Sum Rule to conclude that |O ∪ P| = |O| + |P| = 5 + 4 = 9. But this conclusion is incorrect, because P ∩ O = {3, 5, 7} ̸= ∅, so the Sum Rule doesn’t apply. In particular, O ∪ P = {1, 2, 3, 5, 7, 9}, so |O ∪ P| = 6.
The issue with the naïve applica-
tion of the Sum Rule in Example 9.9
is called double counting: in the ex-
pression |O| + |P|, we counted the
elements in the intersection O ∩ P
twice, which gave us the incorrect
total count. The idea underlying the
Inclusion–Exclusion Rule is to correct
for this error: to compute the size of
the union of two sets A and B, we
extend the Sum Rule to correct for
the double counting by subtracting
|A ∩ B| from the final result. (See
Figure 9.3.) This counting rule is called inclusion–exclusion because we include (add) the cardinalities of the two individual sets, and then exclude (subtract) the cardinality of the intersection of the pairs:
Here are a few small examples:
Example 9.10 (Counting not necessarily disjoint unions)
• LetA:={1,2,3}andB:={3,4,5,6}.ThusA∩B={3},andso|A|=3and|B|=4 and|A∩B| = 1. Bytheinclusion–exclusionrule,|A∪B| = |A|+|B|−|A∩B| =
3 + 4 − 1 = 6. Indeed, we have A ∪ B = {1, 2, 3, 4, 5, 6}, which contains 6 elements.
Figure 9.3: The Inclusion–Exclusion Rule.
Problem-solving
tip: Sometimes the easiest way to solve a problem—in CS or in life!—is to find an imperfect approximation
to the solution,
and then correct
for whatever inaccuracies result. Inclusion–Exclusion is a good example of this estimate- and-fix strategy.
9.2. COUNTINGUNIONSANDSEQUENCES 909
(a) TwosetsAandB;weseek|A∪B|. +=
(b) Calculating |A| + |B| counts elements in the dark-shaded region A ∩ B twice.
−=
(c) We correct for the double-counted intersection by subtracting its cardinality.
Theorem 9.4 (Inclusion–Exclusion)
Let A and B be sets. Then |A ∪ B| = |A| + |B| − |A ∩ B|.

910 CHAPTER 9. COUNTING
• Atacertainschoolinthemidwest,thereare119computersciencemajorsand65 math majors. There are 7 students double majoring in CS and math. Thus a total of 119 + 65 − 7 = 177 different students are majoring in either of the two fields.
• Thereare21consonants(BCDFGHJKLMNPQRSTVWXYZ)inEnglish.Thereare6vowels in English (AEIOUY). There is one letter that’s both a vowel and a consonant (Y). Thus there are 21 + 6 − 1 = 26 total letters.
• LetEbethesetofevenintegersbetween1and100.LetObethesetofoddinte- gersbetween1and100.Notethat|E| = 50,|O| = 50,and|E∩O| = 0.Thus |E∪O| = 50+50−0 = 100.
Here’s an example that uses Inclusion–Exclusion to compute the cardinality of a slightly more complicated set:
Example 9.11 (ATM machine PIN numbers)
Problem: Acertainbank’scustomerscanselecta4-digitnumber(calledaPIN)to access their accounts, but the bank insists that the PIN may not start with the same digit repeated three times (for example, 7770) or end with the same digit repeated three times (for example, 0111). How many invalid PINs are there?
Solution
: LetSdenotethesetofPINsthats
tart with three repeated digits. Let E
nd with three repeated digits. Then the set of invalid
denote the set of PINs that e PINs is S ∪ E.
• Notethat|S| = 100:wecanviewaPINinSasasequenceoftwodigits
⟨x, y⟩ ∈ {0, 1, . . . , 9}2, with x repeated three times in the PIN. (So ⟨3, 1⟩ corre- sponds to the PIN 3331.) By the Product Rule, there are 102 = 100 such codes.
• Similarly,|E|=100:wecanthinkofanelementofEasasequenceoftwodigits ⟨x,y⟩ ∈ {0,1,…,9}2, where y is repeated three times in the PIN.
If S ∩ E were empty, then we could apply the Sum Rule to compute |S ∪ E|. But there are PINs that are in both S and E:
• A4-digitnumber⟨x,y,z,w⟩isinS∩Eifandonlyifx = y = z(because ⟨x,y,z,w⟩ ∈ S) and y = z = w (because ⟨x,y,z,w⟩ ∈ E). That is, any 4-digit number that consists of the same digit repeated four times is in S ∩ E. Thus
S ∩ E = {0000, 1111, 2222, 3333, 4444, 5555, 6666, 7777, 8888, 9999} , and |S ∩ E| = 10.
(See Figure 9.4 for S, E, and S ∩ E.) Applying the Inclusion–Exclusion rule, we see that the set S∪E of invalid PINs has cardinality |S|+|E|−|S∩E| = 100+100−10 = 190. (So 10,000 − 190 = 9810 PINs are valid.)
The basic Sum Rule is actually a special case of the Inclusion–Exclusion Rule: if A and B are disjoint, then |A∩B| = ∅, so |A∪B| = |A|+|B|−|A∩B| = |A|+|B|−0 = |A|+|B|.
Figure 9.4: Invalid PINs, starting or ending with the same digit repeated three times.
0001 0002
.
9997 9998
0000 1111
. 9999
0111 0222
.
9777 9888
last three positions match first three positions match

Inclusion–Exclusion for three sets
Theorem 9.4 describes how to calculate the cardinality of the union of two sets,
but this idea can be generalized. The basic idea is simple: we will try counting in the easiest way possible, and then we’ll correct for any overcounting or undercounting.
For example, we can compute the cardinality of the union of three sets A ∪ B ∪ C using a more complicated version of Inclusion–Exclusion:
• Weadd(include)thethreesingle- ton sets (|A| + |B| + |C|), but this sum counts any element contained in more than one of the three sets more than once.
• Sowesubtract(exclude)the three pairwise intersections
(|A ∩ B| + |A ∩ C| + |B ∩ C|) from the sum. But we’re not done: imagine an element contained in all three of A, B, and C; such an element was included three times and then excluded three times, so it hasn’t been counted at all.
• Soweadd(include)thethree-way intersection |A ∩ B ∩ C|.
This calculation yields the following three-set rule for inclusion–exclusion. (Or see Figure 9.5 for a visual illustration of why this calculation is correct.)
Here are a couple of small examples of the three-set version of inclusion–exclusion:
Example 9.12 (Counting three-set unions)
• LetA:={0,1,2,3,4}andB:={0,2,4,6}andC:={0,3,6}.Then |A∪B∪C|
=5+4+3−3−2−2+1 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣
|A| |B| |C| |A∩B|=|{0,2,4}| |A∩C|=|{0,3}| |B∩C|=|{0,6}| |A∩B∩C|=|{0}| = 12 − 7 + 1 = 6,
by Inclusion–Exclusion. Indeed, A ∪ B ∪ C = {0, 1, 2, 3, 4, 6}. (See Figure 9.6.)
Figure 9.5: The Inclusion–Exclusion Rule for three sets A, B, and C. See Theorem 9.5.
9.2. COUNTINGUNIONSANDSEQUENCES 911
++=
(a) Ifwestarttocompute|A∪B∪C|as|A|+|B|+|C|,wecorrectlycountthe light-shaded regions, but we count elements in the medium-shaded regions twice, and elements in the dark-shaded region three times.
++=
(b) Subtracting the sum of the sizes of the pairwise intersections
|A ∩ B| + |B ∩ C| + |A ∩ C| almost corrects for the double counting from (a), but it also triple counts the elements of A ∩ B ∩ C.
−+=
(c) The result of (a) minus (b) hasn’t counted the elements of A ∩ B ∩ C at all, so we can achieve the final count by adding |A ∩ B ∩ C|.
Theorem 9.5 (Inclusion–Exclusion for three sets)
LetA,B,andCbesets. Then|A∪B∪C|isgivenby |A|+|B|+|C|−|A∩B|−|A∩C|−|B∩C|+|A∩B∩C|.
AB
1 2,4 306
C
Figure 9.6: Some small sets.

912 CHAPTER 9. COUNTING
• ConsiderthewordsONE,TWO,THREE,FOUR,FIVE,SIX,SEVEN,andEIGHT.LetEbethe set of these words containing at least one E, let T be the words containing a T, and let R be the words containing an R. Then
E={ONE, T={TWO, R={THREE, E∩T={THREE, E∩R={THREE} T∩R={THREE} E∩T∩R={THREE} THREE, THREE, FOUR} EIGHT}
FIVE, EIGHT}
SEVEN,
|E ∪ T ∪ R| =5+3+2−2−1−1+1
􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣
EIGHT}
= 7,
and, indeed, seven of the eight words are in E ∪ T ∪ R (the only one missing is SIX).
We’ll close with a slightly bigger example, about integers divisible by 2, 3, or 5:
Example 9.13 (Divisibility)
Problem: Howmanyintegersbetween1and1000,inclusive,areevenlydivisibleby any of 2, 3, or 5?
Solution
: Definethefollowingsets:
A = {n ∈ {1, . . . , 1000} : 2 | n} B = {n ∈ {1, . . . , 1000} : 3 | n} C = {n ∈ {1, . . . , 1000} : 5 | n} .
We must compute |A ∪ B ∪ C|.
• It’sfairlyeasytoseethat|A| = 500,|B| = 333,and|C| = 200,because
A = {2n : 1 ≤ n ≤ 500}, B = {3n : 1 ≤ n ≤ 333}, and C = {5n : 1 ≤ n ≤ 200}.
• ObservethatA∩Bisthesetofintegersbetween1and1000thataredivisibleby both 2 and 3—that is, the set of integers divisible by 6. By the same logic that we used to compute |A|, |B|, and |C|, we see
– |A∩B|=|{6n:1≤n≤166}|=166,
– |A∩C|=|{10n:1≤n≤100}|=100,and – |B∩C|=|{15n:1≤n≤66}|=66.
• And,usingthesameapproach,wecanconcludethatA∩B∩C={n:30|n}= {30n : 1 ≤ n ≤ 33}, so |A ∩ B ∩ C| = 33.
|A| |B| |C| |A∩B| |A∩C| |B∩C| |A∩B∩C|
Therefore, using the Inclusion–Exclusion Rule, |A ∪ B ∪ C| is 500+333+200−166−100− 66 + 33 =734.
Problem-solving
tip: To verify a calculation like this one, it’s a good idea (and very easy!) to write a short program.
􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣
We can further generalize the inclusion–exclusion principle to calculate the cardinality of the union of an arbitrary number of sets. (See Exercises 9.30 and 9.181.)

9.2.3 The Generalized Product Rule
The Product Rule (Theorem 9.2) tells us how to compute the number of 2-element sequences where the first element is drawn from the set A and the second from the
set B—specifically, it says that |A × B| is |A| · |B|. But there are many types of se- quences that do not precisely fit this setting: the Product Rule only describes the set
of sequences where each component is selected from a fixed set of options. If the set of options for choice #2 depends on choice #1, then we cannot directly apply the Product Rule. However, the basic principle of the Product Rule still applies if the number of dif- ferent choices for the second component is the same regardless of the choice of the first component, even if the particular set of choices can differ:
Here are a few examples using the Generalized Product Rule:
Example 9.14 (Gold, silver, and bronze)
Problem: AsetSofeightsprintersqualifyforthefinalsofthe100-meterdashinthe Olympics. One will win the gold medal, another the silver, and a third the bronze. How many different trios of medalists are possible?
Solution
: It“feels”likewecansolvethisproblemusingtheProductRule,bychoos-
ing a sequence of three elements from S, where we forbid duplication in our choices. But our choice of gold, silver, and bronze medalists would be from
S × 􏰀S − {the gold medalist}􏰁 × 􏰀S − {the gold and silver medalists}􏰁
and the Product Rule doesn’t permit the set of choices for the second component to depend on the first choice, or the options for the third choice to depend on the first two choices.
Instead, observe that there are 8 choices for the gold medalist. For each of those choices, there are 7 choices for the silver medalist. For each of these pairs of gold and silver medalists, there are 6 choices for the bronze medalist. Thus, by the Generalized Product Rule, the total number of trios of medalists is 8 · 7 · 6 = 336.
Example 9.15 (Opening moves in a chess game)
In White’s very first move in a chess game, there are n1 = 10 pieces that can move: any of White’s 8 pawns or 2 knights. Each of these pieces has n2 = 2 legal moves: the pawns can move forward either 1 or 2 squares, and the knights can move either 􏰧 or 􏰨. (See Figure 9.7.) Thus there are n1 · n2 = 10 · 2 = 20 legal first moves.
8 7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
abcdefgh
9.2. COUNTINGUNIONSANDSEQUENCES 913
Theorem 9.6 (Generalized Product Rule)
LetSdenoteasetofsequences,eachoflengthk,whereforeachindexi ∈ {1,…,k}the following condition holds: for each choice of the first i − 1 components of the sequence, there are exactly ni choices for the ith component. Then |S| = ∏ki=1 ni.
rmblkans
opopopop
0Z0Z0Z0Z
Z0Z0Z0Z0
0Z0Z0Z0Z
Z0Z0Z0Z0
POPOPOPO
SNAQJBMR
rmblkans
opopopop
0Z0Z0Z0Z
Z0Z0Z0Z0
0Z0Z0Z0Z
Z0Z0Z0Z0
POPOPOPO
SNAQJBMR
abcdefgh
Figure 9.7: The valid first moves in a chess game.

914 CHAPTER 9. COUNTING
Example 9.16 (Students in classes)
At a certain school in the midwest, each of 2023 students enrolls in exactly 3 classes per term. The set
Enrollments := {⟨s, c⟩ : s is a student enrolled in class c during the current term}
has cardinality 2023 · 3 = 6069, by the Generalized Product Rule: for each of the
n1 = 2023 choices of student, there are n2 = 3 choices of classes. (Note that the original Product Rule does not apply, because the set Enrollments is not a Cartesian product: in general, two students are not enrolled in the same classes—just the same number of classes.)
Although we didn’t say we were doing so, we actually used the underlying idea of the Generalized Product Rule in Example 9.11. Let’s make its use explicit here:
Example 9.17 (4-digit PINs starting with a triplicated digit)
Let S ⊆ {0, 1, . . . , 9}4 denote the set of 4-digit PINs that start with three repeated digits. We claim that |S| = 100, as follows:
• There are n1 = 10 choices for the first digit.
• There is only n2 = 1 choice for the second digit: it must match the first digit. • There’s also only n3 = 1 choice for the third digit: it must match the first two. • There are n4 = 10 choices for the fourth digit.
Thustherearen1·n2·n3·n4 =10·1·1·10=100elementsofS.
Permutations
The Generalized Product Rule sheds some light on a concept that arises in a wide
range of contexts: a permutation of a set S, which is any ordering of the elements of S.
As a first example, let’s list all the permutations of the set {1, 2, . . . , n} for a few small values of n:
• forn=1,there’sjustoneordering:⟨1⟩.
• forn=2,therearetwoorderings:⟨1,2⟩and⟨2,1⟩.
• forn=3,therearesix:⟨1,2,3⟩,⟨1,3,2⟩,⟨2,1,3⟩,⟨2,3,1⟩,⟨3,1,2⟩,and⟨3,2,1⟩.
• for n = 4, there are twenty-four: six with 1 as the first element (which can then be
followed by any of the six permutations of ⟨2, 3, 4⟩), six with 2 as the first element, six with 3 first, and six with 4 first, yielding a total of 4 · 6 = 24 orderings.
Definition 9.1 (Permutation)
A permutation of a set S is a sequence of elements from S that is of length |S| and contains no repetitions. In other words, a permutation of S is an ordering of the elements of S.

How many permutations of an n-element set are there? There are several ways to see the general pattern, including recursively, but it may be easiest to use the Generalized Product Rule to count the number of permutations:
Proof. TherearenchoicesforthefirstelementofapermutationofS.Forthesecond element, there are n − 1 choices (all but the element chosen first). There are n − 2 choices for the third slot (all but the elements chosen first and second). In general, for the ith element, there are n − i + 1 choices. Thus the number of permutations of S is
∏n (n−i+1)=∏n j=n! i=1 j=1
by the Generalized Product Rule.
Here’s a small example for a concrete set S:
Example 9.18 (10-digit numbers)
Problem: Whatfractionofintegersbetween0and9,999,999,999(allwrittenas10-digit
numbers, including any leading zeros) have no repeated digits?
: Weseeka10-digitsequencewithnorepetitions—thatis,apermutationof {0, 1, . . . , 9}. There are 10! = 3,628,800 such permutations, by Theorem 9.7. There are a total of 1010 integers between 0 and 9,999,999,999, by the Product Rule. Thus the fraction of these integers with no repeated digits is 10! ≈ 0.00036 · · · , about one out of every 2750 integers in this range. 1010
Taking it further: A permutation of a set S is an ordering of that set S—so thinking about permutations is closely related to thinking about sorting algorithms that put an out-of-order array into a specified order. By using the counting techniques of this section, we can prove that algorithms must take a certain amount of time to sort; see the discussion on p. 920.
We will also return to permutations frequently later in the chapter. For example, in Section 9.4, we will address counting questions like the following: how many different 13-card hands can be drawn from a standard 52-card deck of playing cards? (Here’s one way to think about it: we can lay out the 52 cards in any order—any permutation of the cards—and then pick the first 13 of them as a hand. We’ll have to correct for the fact that any ordering of the first 13 cards—and, for that matter, any ordering of the last 39—will count as the same hand. But permutations will also help us to think about this correction!)
9.2.4 Combining Products and Sums
Suppose that we select a pair ⟨a, b⟩ from a set of possible choices. The Product Rule tells us how many ways to make these choices if the particular choice of a does not affect the set of options from which b is chosen. The Generalized Product Rule tells us how many ways to make these choices if the particular choice of a does not affect the size of the set of options from which b is chosen. But if the number of options for the
Theorem 9.7 (Number of permutations)
Let S be any set, and write n := |S|. The number of different permutations of S is n!.
Solution
9.2. COUNTINGUNIONSANDSEQUENCES 915

916 CHAPTER 9. COUNTING
choice of b differs based on the choice of a, even the Generalized Product Rule does not apply. In this case, we can use a combination of the Sum Rule and the Generalized Product Rule to calculate the number of results. We’ll close this section with a few examples of these somewhat more complex counting questions.
Example 9.19 (Ordering coffee)
A certain coffeeshop sells the following espresso-based drinks: americano∗, cappuccino, espresso∗, latte, macchiato, mocha.
The drinks marked with an asterisk do not contain milk; the others do. All drinks can be made with either decaf or regular espresso. All milk-containing drinks can be made with any of {soy, skim, 2%, whole} milk. How many different drinks are sold by this coffeeshop?
We can think of a chosen drink as a sequence of the form
⟨drink type, milk type (or “none”), espresso type⟩.
There are 4 · 4 · 2 = 32 choices of milk-based drinks (4 drink types, 4 milk types, and 2 espresso types). There are 2 · 1 · 2 = 4 choices of non-milk-based drinks (2 drink types, 1 “milk” type [“none”], and 2 espresso types). Thus the total number of different drinks sold by this coffeeshop is 32 + 4 = 36.
Example 9.20 (Text numbers)
Problem: IntheUnitedStates,atextmessagecanbesenteithertoaregular10-digit phone number, or to a so-called short code which is a 5- or 6-digit number. Neither a phone number nor a short code can start with a 0 or a 1. How many different textable numbers are there in the United States?
Solution
: Let D = {2,3,…,9}. Note |D| = 8. The set of valid textable numbers is:
D×(D∪{0,1})9 ∪ D×(D∪{0,1})4 ∪ D×(D∪{0,1})5 . 􏰢 􏰡􏰠 􏰣 􏰢 􏰡􏰠 􏰣 􏰢 􏰡􏰠 􏰣
phone numbers 5-digit short codes 6-digit short codes
The Product Rule tells us that |D×(D∪{0,1})i| = |D|·|D∪{0,1}|i = 8·10i for any i. (To be totally pedantic: we’re using the Sum Rule to conclude that |D ∪ {0, 1} | = |D| + | {0, 1} | = 10, because D and {0, 1} are disjoint.) Therefore:
􏰊􏰊 􏰊􏰊D×(D∪{0,1})9 ∪D×(D∪{0,1})4 ∪D×(D∪{0,1})5􏰊􏰊
􏰊􏰊􏰊􏰊􏰊􏰊 = 􏰊􏰊D×(D∪{0,1})9􏰊􏰊+􏰊􏰊D×(D∪{0,1})4􏰊􏰊+􏰊􏰊D×(D∪{0,1})5􏰊􏰊
Sum Rule: the three types of numbers are disjoint because they have different lengths = 8 · 109 + 8 · 104 + 8 · 105 Product Rule, as described in the previous paragraph
= 8,000,880,000.
Problem-solving
tip: When you’re confronted with a counting problem that appears com- plicated, try to find a nice way of split- ting the problem into several disjoint options. Often a difficult counting problem is actually the sum of two simple counting problems.

Combining sums and products: prefix-free codes
We’ll end the section with two somewhat more complicated counting problems,
where we’re asked to calculate the number of objects meeting some particular con- dition: sets of bitstrings such that no string is a prefix of another, and results of a best-of-five series of games. In both cases, we can give a solution based entirely on a brute-force approach by simply enumerating all possible sequences, eliminating any that don’t meet the stated condition, and counting the uneliminated sequences one by one. But there are also ways to break down the set of objects of interest into subsets that we can count using the Sum and (Generalized) Product Rules.
A prefix-free code is a set C of bitstrings with the property that no x ∈ C
is a prefix of any other
y ∈ C. (For example, if 010 ∈ C, then we must have 0101 ∈/ C, because
010 is a prefix of 010 1.)
Let’s compute the number of prefix-free codes where all of the codewords are only 1 or 2 bits long:
Example 9.21 (Prefix-free codes)
One simple way to find the number of prefix-free codes C ⊆ {0, 1}1 ∪ {0, 1}2 is
to write down all subsets of S := {0, 1}1 ∪ {0, 1}2, and then check each subset to eliminate any set that violates the prefix rule. (See Figure 9.8, which was generated by a computer program; there are 25 codes in the table that pass the prefix test.) There are 2|S| = 26 = 64 subsets of S: we can describe each subset of S as an element of {yes, no}|S| where the ith component tells us whether the ith element of S is in the set. The Product Rule tells that |{yes, no}|S| | = 26 = 64. (See Lemma 9.10.)
Here’s a different approach, involving more thinking and less brute-force calcula- tion. Let’s partition the set of valid codes into four classes based on whether 0 ∈ C and 1 ∈ C:
• If0∈/Cand1∈/C,thenanysubsetof{00,01,10,11}canbeinC. • If0∈/Cand1∈C,thenanysubsetof{00,01}canalsobeinC. • If0∈Cand1∈/C,thenanysubsetof{10,11}canalsobeinC. • If0∈Cand1∈C,thenno2-bitstringscanbeincluded.
By the Product Rule, there are, respectively, 24 and 22 and 22 and 20 choices corre- sponding to these classes. (The four classes correspond to the four columns of Fig- ure 9.8.) By the Sum Rule, the total number of prefix-free codes using 1- and 2-bit strings is 16 + 4 + 4 + 1 = 25.
Figure 9.8: All
64 subsets of
{0, 1, 00, 01, 10, 11}, with indication of whether the subset is prefix-free or
not. In each row (a subset), if the set is not prefix-free, then one violation found in the set is listed.
9.2. COUNTINGUNIONSANDSEQUENCES 917
0
1
00
01
10
11
ok?
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
0
1
00
01
10
11
ok?
0
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
1
00
01
10
11
ok?
✓ ✓ ✓ ✓
0
0
1
1
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
00
01
10
11
ok?
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
✓ ✓ ✓ ✓
✓ ✓ ✓ ✓
✓ ✓
✓ ✓
✓ ✓
✓ ✓
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
✓ ✓ ✓ ✓
✓ ✓ ✓ ✓
✓ ✓
✓ ✓
✓ ✓
✓ ✓
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
✓
1
1
1
0
1
0
✓
1
1
1
0
1
0
✓
1
1
1
0
1
0
✓
1
1
1
0
1
0
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
✓ ✓ ✓ ✓
✓ ✓ ✓ ✓
✓ ✓
✓ ✓
✓ ✓
✓ ✓
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
✓ ✓ ✓ ✓
✓ ✓ ✓ ✓
✓ ✓
✓ ✓
✓ ✓
✓ ✓
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
✓
1
1
1
0
1
0
0
1
0
1
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

9.2. COUNTINGUNIONSANDSEQUENCES 919
Computer Science Connections
Running out of IP addresses, and IPv6
A crucial component of the internet is the assignment of an address to every machine connected to the network. This address is called an IP address, where “IP” stands for Internet Protocol—the algorithm by which packets of information are handled while they’re being transmitted across the internet. Each packet of information to be transmitted stores a variety of pieces of information, including (1) some basic header information; (2) a source address (the sender of the information); (3) a destination address (the intended recipient of the information); and (4) the data to be transmitted (the “payload”).
The subfield of computer science called computer networking is devoted to everything about how the internet (or some smaller network) works: design ofthenetwork,physicalsystems,protocolsforrouting,andmore.1 Herewe are going to concentrate on the IP address itself, and a particular issue related to how many—or how few!—addresses there are.
Each device on the internet that can send or receive information needs an address by which to do so. For almost the entire history of the internet, an
IP address has simply been a 32-bit string. These IP addresses are typically represented as an element of {0, . . . , 255}4 instead of as an element of {0, 1}32, by converting 8 bits at a time into base-10 numbers, and then writing each 8-bit chunk separated by periods. For example, the site cs.carleton.edu is associated with the IP address
137 22 4 23
Formore,seeagoodtextbookon computer networks, like
1 James F. Kurose and Keith W. Ross.
Computer Networking: A Top-Down Approach. Addison–Wesley, 6th edition, 2013.
10001001 . 00010110 . 00000100 . 00010111.
􏰢 􏰡􏰠 􏰣 􏰢 􏰡􏰠 􏰣 􏰢 􏰡􏰠 􏰣 􏰢 􏰡􏰠 􏰣
You can find the IP address of your favorite site using a tool called nslookup on most machines, which checks a so-called name server to translate a site’s name (like whitehouse.gov) into an IP address (like 173.223.132.110).
As an easy counting problem, we can check that there only 232 = 4,294,967,296 different possible 32-bit IP addresses—about 4.3 billion addresses. Every ma- chine connected to the internet needs to be addressable to receive data, so that means that we can only support about 4.3 billion connected devices. In the
1990s and 2000s, more and more people began to have machines connected to the internet, and each person also began to have more and more devices that they wanted to connect. It became clear that we were facing a dire shortage of IP addresses! As such, a new version of the Internet Protocol (version six, hence called IPv6) has been introduced.
In IPv6, instead of using 32-bit addresses, we now use 128-bit addresses.
There are some tricky elements to the transition from 32-bit to 128-bit addresses— your computer better keep working!—but there are now 2128 different ad-
dresses available. That’s 340,282,366,920,938,463,463,374,607,431,768,211,456 ≈
3.4 × 1038 , which should hold us for a few millennia. For example, whitehouse. gov is associated with a 32-bit address 173.223.132.110, and a 128-bit ad-
dress 2600:1408:0010:019a:0fc4, represented by 5 blocks of 4 hexadecimal numbers—that is, as an element of
􏰖{0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f}4􏰗5 .
There are some strategies from com- puter networking for conserving ad- dresses by “translation,” so that several computers c1 , c2 , . . . can be connected via an access point p—where p is the only machine that has a public, visible IP address. All of those computers’ traffic is handled by p, but p must be able to reroute the traffic it receives to the correct one of the ci computers. For more information, see the Kurose–Ross textbook cited previously.

920 CHAPTER 9. COUNTING
Computer Science Connections
A Lower Bound for Comparison-Based Sorting
Most people who encounter the sorting problem—given an array A[1 . . . n], rearrange A so that it’s in ascending order—initially devise a quadratic-time algorithm. (For simplicity, suppose that we’re sorting distinct elements.) The most common examples of Θ(n2)-time algorithms are Selection Sort, Insertion Sort, and Bubble Sort. Then, after a lot of thought (and, usually, some help), those people often are able to devise a O(n log n)-time sorting algorithm, like Merge Sort, Quick Sort, or Heap Sort. (See Section 6.3.)
But suppose that you were extra impatient with the speed of your sorting algorithm, and you were extra, extra clever. Could you do asymptotically better than O(n log n) in the worst case? The answer, we’ll show, is no—with a footnote: any “comparison-based” sorting algorithm requires Ω(n log n) time. (The footnote is that it depends on what we mean by “sort,” as we’ll see.)
A Warm-up: Selection Sort
First, recall Selection Sort, shown in Figure 9.10. One way to analyze its
running time is as we did in Example 6.7: there are n iterations, and in the (n − i)th iteration we require i steps. In other words, the running time of Selection Sort is ∑ni=1 i. We could repeat the straightforward inductive proof that ∑ni=1 i = n(n + 1)/2, but instead Figure 9.11 gives a more visual way
of seeing this result. Figure 9.11(a) shows a shaded triangle that represents
the running time of selection sort: ∑ni=1 i, where row i of the triangle has i
steps in it. Figure 9.11(b) shows that this triangle is contained within an n-by-n
Figure 9.10: Selection Sort.
row i contains i steps
(a) Selection Sort’s running time.
selectionSort(A[1 . . . n]):
1: 2: 3: 4: 5: 6:
for i:=1ton: minIndex := i
for j:=i+1ton:
if A[j] < A[minIndex] then minIndex := j swap A[i] and A[minIndex] square and also contains an n -by- n square. Thus the area of the triangle is upper 222nnn2 2 boundedbyn·n=n andlowerboundedby 2 · 2 = 4 ,andthereforeisΘ(n ). This picture is a visual representation of a more algebraic proof: ∑n i ≤ ∑n n = n 2 , a n d ∑n i ≥ ∑n i ≥ ∑n n = n 2 . i=1 i=1 i=1 i=n+1 i=n+1 2 4 22 While the analysis of Selection Sort isn’t necessary for our main proof, the style of analysis from Figure 9.11 will be useful in a moment. There Are No O(n) Comparison-Based Sorting Algorithms All of the sorting algorithms that we’ve encountered in the book are comparison-based sorting algorithms: they proceed by repeatedly comparing the values of two elements xi and xj from the input array without considering the values themselves. Depending on the result of the comparison, the algorithm may then swap some elements of the array. (Comparison-based sorting algo- rithms probably include every sorting algorithm that you’ve ever seen, except counting, radix, and bucket sorts.) One way to view a comparison-based sorting algorithm is through a deci- sion tree, like the one shown in Figure 9.12 for Selection Sort on a 3-element array. The internal nodes encode the comparisons made by the algorithm. The leaves correspond to sorted orders—the output of the sorting algorithm. (b) The analysis of the running time. Figure 9.11: A visual representation of the proof that Selection Sort runs in Θ(n2) time. n rows 9.2. COUNTINGUNIONSANDSEQUENCES 921 Computer Science Connections Sorting Lower Bounds, continued a ≶ b? ab
a ≶ c?
ac bc
a is min b ≶ c?
c is min a ≶ b?
b is min a ≶ c?
c is min a ≶ b?
b |Z|. More relevantly for computer science, we can prove that there are strictly more problems than there are computer programs, and therefore that there are problems that cannot be solved by a computer. See the discussion on p. 937.
Lemma 9.10 is the reason for the power set’s name: the cardinality of P(X) is 2 to the power of |X|.
Lemma 9.10 (Cardinality of the Power Set)
Let X be any finite set. Then |P(X)| = 2|X|.

9.3.2 The Division Rule
When we introduced the Inclusion–Exclusion Rule, we used an approach to counting that we might call count first, apologize later: to compute the cardinality of a set A ∪ B, we found |A| + |B| and then “fixed” our count by subtracting the number of elements that we’d counted twice—namely, subtracting |A ∩ B|. Here we’ll consider an analo- gous count-and-correct rule, called the Division Rule, that applies when we count every element of a set multiple times (and where each element is recounted the same num- ber of times); we’ll then correct our total by dividing by this “redundancy factor.” Let’s start with some informal examples:
Example 9.28 (Some redundant counting, informally)
• SupposethattheJugglingCluboncampussells99jugglingtorchestoitsmem- bers, in sets of three. Then there are 33 people who purchased torches.
• Thereare42peopleataparty.Supposethateverypersonshakeshandswithevery
other person. How many handshakes have occurred? There are many ways to
solve this problem, but here’s an approach that uses division: each person shakes
hands with all 41 other people, for a total of (42 people) · (41 shakes/person) =
1722 shakes. But each handshake involves two people, so we’ve counted every
shake exactly twice; thus there are actually a total of 861 = 1722 = 42·41 handshakes. 22
• InGame5ofthe1997NBAFinals,theChicagoBullshad10playerswhowereon the court for some portion of the game. The number of minutes played by these ten were ⟨45, 44, 26, 24, 24, 24, 23, 23, 4, 3⟩. The total number of minutes played was 45+44+26+24+24+24+23+23+4+3 = 240. Inbasketball,fiveplayersareon the court at a time. Thus the game lasted 240 = 48 minutes.
We’ll phrase the Division Rule using the same general structure as the Mapping Rule, in terms of a function that maps from one set to another. Specifically, if we have a function f : A → B that always maps exactly the same number of elements of A to each element of B—for instance, exactly three torches are mapped to any particular juggler in Example 9.28—then |A| and |B| differ exactly by that factor:
(The Division Rule with k = 1 simply is the bijection case of the Mapping Rule: what it meansforf :A→Btobeabijectionispreciselythat|{a∈A:f(a)=b}|=1forevery b ∈ B. If such a function f exists, then both the Mapping Rule and the Division Rule say that |A| = 1 · |B|.)
9.3. USINGFUNCTIONSTOCOUNT 931
5
Theorem 9.11 (Division Rule)
Let A and B be arbitrary sets. Suppose that there exists a function f : A → B such that, for every b ∈ B, there are exactly k elements a1,…,ak ∈ A such that f(ai) = b. (That is, |{a ∈ A : f (a) = b}| = k for all b ∈ B.) Then |A| = k · |B|.
Here are two simple examples to illustrate the formal version of the Division Rule:

932 CHAPTER 9. COUNTING
Example 9.29 (Redundant counting, formally)
• LetMbethesetofmembersoftheJugglingClub,andletTbethesetoftorches bought by the members of the club. Consider the function boughtBy : T → M. Assuming that each member bought precisely three torches—that is, assuming that | {t ∈ T : boughtBy(t) = m} | = 3 for every m ∈ M—then |T| = 3 · |M|.
• Consider the sets A = {0,1,…,31} and B = {0,1,…,15}. Define the function
f : A → B as f(n) = ⌊n/2⌋. For each b ∈ B, there are exactly two input values whose output under f is b, namely 2b and 2b + 1. Thus by the Division Rule |A| = 2 · |B|.
This basic idea—if we’ve counted each thing k times, then dividing our total count by k gives us the number of things—is pretty obvious, and it’ll also turn out to be surprisingly useful. Here’s a sequence of examples, starting with a warm-up exercise and continuing with two (slightly less obvious) applications of the Division Rule:
Example 9.30 (Rearranging PERL, PEER, and SMALLTALK)
Problem: Howmanydifferentwayscanyouarrangethelettersof…
1. …thenameoftheprogramminglanguagePERL?
2. …thewordPEER?
3. …thenameoftheprogramminglanguageSMALLTALK?
: PERL: Thereare4differentletters,andanypermutationofthemisadiffer- Solution
ent ordering. Thus there are 4! = 4 · 3 · 2 · 1 = 24 orderings. (See Theorem 9.7.)
PEER: We’llanswerthisquestionusingthesolutionforPERL.Definethefunc-
tion L->E as follows: given a 4-character input string, it produces a 4-character
output string in which every L has been replaced by an E. For example,
L->E(PERL) = PERE. Let S denote the orderings of the word PERL, and let T de-
note the orderings of PEER. Note that the function L->E : S → T has the property
that, for every t ∈ T, there are exactly two strings x ∈ S such that L->E(x) = t.
(For example, L->E(PERL) = PERE and L->E(PLRE) = PERE.) See Figure 9.19. Thus,
by the Division Rule, there are 4! = 24 = 12 ways to order the letters of PEER. 22
SMALLTALK: Thereare9!differentorderingsofthenine“letters”intheword
S M A1 L1 L2 T A2 L3 K.(WearewritingL1 andL2 andL3 todenotethree different “letters,” and similarly for A1 and A2.) We will use the Division Rule repeatedly to “erase” subscripts:
ELPR LEPR
ELRP LERP
EPLR LPER
EPRL LPRE
ERLP LREP
ERPL LRPE
PELR PLER
PERL PLRE
PREL PRLE
RELP RLEP
REPL RLPE
RPEL RPLE
EEPR
EERP
EPER
EPRE
EREP
ERPE
PEER
PERE
PREE
REEP
REPE
RPEE
• ThefunctionthaterasessubscriptsontheAsmapstwoinputstoeachoutput: one with A1 before A2, and one with A2 before A1. Thus there are 9! different
2 orderings of the “letters” in the word S M A L1 L2 T A L3 K.
Figure 9.19: The 24 different orderings of PERL and the 12 different orderings of PEER. The func- tion that replaces L by E is displayed by the arrows.
• The function that takes an ordering of S M A L1 L2 T A L3 K and erases the subscripts on the Ls maps precisely six inputs to each output: one for each of the 3! possible orderings of the Ls.
Thus there are 9! = 362,880 = 30,240 different orderings of the letters in the 2·3! 12
wordS M A L L T A L K.

Counting orderings when some elements are indistinguishable
Although we phrased Example 9.30 in terms of the number of ways to rearrange the letters of some particular words, there’s a very general idea that underlies the PEER and
SMALLTALK examples. We’ll state the underlying idea as a theorem:
1!·1!·1!·1! otherhand,SMALLTALKhask=6distinctelements,whichappearnA =2,nL =3,and
9.3. USINGFUNCTIONSTOCOUNT 933
Theorem 9.12 (Rearranging with duplicates)
The number of ways to rearrange a sequence containing k different distinct elements {x1,…,xk}, where element xi appears ni times, is
(n1 +n2 +···+nk)! . (n1!)·(n2!)· ··· ·(nk!)
For example, PERL has k = 4 distinct elements, which appear nP = nE = nR = nL = 1 time each;thetheoremsaysthatthereare(1+1+1+1)! =4!waystoarrangetheletters.Onthe
n = n = n = n = 1 times each; the theorem says that there are (2+3+1+1+1+1)! = 9! S M T K 2!·3!·1!·1!·1!·1! 2!·3!
ways to arrange the letters. Let’s prove the theorem:
ProofofTheorem9.12. Let’shandleasimplercasefirst:supposethatwehavendiffer-
ent elements that we can put into any order, and precisely k of these n elements are
To see this fact, imagine “decorating” each of those k items with some kind of arti- ficial distinguishing mark, like the numerical subscripts of the letters of SMALLTALK from Example 9.30. Then there are n! different orderings of the n elements. The erase function that eliminates our artificial distinguishing marks has k! inputs that yield the same output—namely, one for each ordering of the k artificially marked elements. There- fore, by the Division Rule, there are n! different orderings of the elements, without the distinguishing markers. k!
The full theorem is just a mild generalization of this argument, to allow us to con- sider more than one set of indistinguishable elements. (In particular, we could give aformalproofbyinductiononthenumberofelementswithni ≥2.)Intotal,there are (n1 + n2 + · · · + nk )! different orderings of the elements themselves, but there are n1! equivalent orderings of the first element, n2! of the second, and so forth. The func- tionthat“erasessubscripts”asinExample9.30has(n1!)·(n2!)· ··· ·(nk!)different equivalent orderings, and thus the total number of orderings is, by the Division Rule,
(n1 +n2 +···+nk)! . (n1!)·(n2!)· ··· ·(nk!)
Here’s another simple example that we can solve using this theorem:
Example 9.31 (Writing 232,848 as a sequence of prime factors)
Problem: Howmanywayscanwewrite232,848asaproductp1p2···pk,whereeach
pi is prime? (The set of prime factors, and the number of occurrences of each factor, are the same in every product, because the prime factorization of any positive integer is unique. But the order may change: for example, we can write 6 = 3 · 2 or 6 = 2 · 3.)
indistinguishable. Then there are exactly n! different orderings of those n elements. k!

934 CHAPTER 9. COUNTING
: The prime factorization of 232,848 is 232,848 = 24 · 33 · 72 · 11. Thus a product of primes that equals 232,848 consists of 4 copies of two, 3 copies of three, 2 copies of seven, and one copy of eleven—in some order. (For example, 2 · 2 · 7 · 3 · 3 · 7 · 2 · 11 · 3 · 2.) By Theorem 9.12, the number of orderings of these elements is
(4+3+2+1)! = 10! = 3,628,800 = 12,600. 4!·3!·2!·1! 4!·3!·2! 24·6·2
A slightly more complicated example
Here is one final example of the Division Rule, in which we’ll use this approach on a
slightly more complicated problem:
Example 9.32 (Assigning partners)
Problem: The professor divides the n students in a CS class into n partnerships, 2
Solution
with two students per partnership. (Assume that n is even.) The order of part- ners within a pair doesn’t matter, nor does the order of the partnerships. (That is,
Problem-solving tip:
There are often many different ways to solve a given problem— and you can use whatever approach makes the most sense to you! For example, Exer-
cise 9.106 explores a completely dif- ferent way to solve Example 9.32, based on the Gen- eralized Product Rule instead of the Division Rule.
the listings
Paul and George and John and Ringo
Ringo and John George and Paul
represent exactly the same set of partnerships.) How many ways are there to di- vide the class into partnerships?
Solution
: Let’slineupthestudentsinsomeorder,andthenpairthefirsttwostu-
dents, then pair the third and fourth, and so on. There are n! different orderings of the students, but there are fewer than n! possible partnerships, because we’ve double counted each set of pairs in two different ways:
• therearetwoequivalentorderingsofthefirstpairofstudents,andtwoequiva- lent orderings of the second pair, and so on.
• theorderingofthepairsdoesn’tmatter,sothepartnershipsthemselvescanbe listed in any order at all (without changing who’s paired with whom).
A B C D  AB CD  AB CD  AB CDAB CDAB+ CD AB CD CD AB
AB CD AB DC BA CD BA DC CD AB CD BA DC AB DC BA
AC BD AC DB BD AC BD CA CA BD CA DB DB AC DB CA
AD BC AD CB BC AD BC DA CB AD CB DA DA BC DA CB
n
BDAC +
Each of the
2n/2 different possible within-pair orderings. And there are (n/2)! different order- ings of the pairs. Applying the Division Rule, then, we see that there are
ACBD  BD
2
pairs can be listed in 2 orders, so—by the Product Rule—there are n!
ACBD  BD AC 
(n/2)! · 2n/2 total possible ways to assign partners.
B C A D  A D
 CD AB 
A C B D   AC BD 
BD AC  AC
(∗) Let’s make sure that (∗) checks out for some small values of n. For n = 2, there’s
A D B C   ADBC  BCAD
 BD AC 
just one pairing, and indeed (∗) is 2!
4! 4·3·2 1!·2 2
BC AD  AD BC ADBC
1 = 2 = 1. For n = 4, the formula (∗) yields 23 = 8 = 3 pairings; indeed, for the quartet Paul, John, George, and Ringo, there
Figure 9.20: Part- nerships for n = 4 students: the 4! orderings, then the orderings sorted within pairs, and then with the pairs sorted.
BCAD +  BC
are three possible partners for Paul (and once Paul is assigned a partner there are no further choices to be made). See Figure 9.20 for an illustration: we try all 4! = 24 orderings of the four people, then we reorder the names within each pair, and finally we reorder the pairs.
ordering
reordered within pairs

9.3.3 The Pigeonhole Principle
We’ll close this section with a very simple—but also surprisingly useful—theorem based on the Mapping Rule, called the pigeonhole principle. Here are a few informal examples to introduce the underlying idea:
Example 9.33 (What happens when there are more things than kinds of things)
• Iftherearemoresocksinyourdrawerthantherearecolorsofsocksinyour drawer, then you must have two socks of the same color.
• Ifthereareonly5possiblelettergradesandthereare6ormorestudentsinaclass, then there must be two students who receive the same letter grade.
• Ifyoutake9ormoreCScoursesduringthe8semestersthatyou’reincollege,then there must be at least one semester in which you doubled up on CS courses.
• Intheantiquatedlanguageinwhichthisresultisgenerallystated:iftherearen pigeonholes, and n + 1 pigeons that are placed into those pigeonholes, then there must be at least one pigeonhole that contains more than one pigeon.
Here is the general statement of the theorem, along with its proof:
Proof. WecanprovethePigeonholePrincipleusingtheMappingRule.Giventhesets A and B, and the function f : A → B, the Mapping Rule tells us that
iff :A→Bisone-to-one,then|A|≤|B|. (1) Taking the contrapositive of (1), we have
if|A|>|B|,thenf :A→Bisnotone-to-one. (2)
By assumption, we have that |A| > |B|, so f : A → B is not one-to-one. The theorem follows by the definition of a one-to-one function: the fact that f : A → B is not one- to-one means precisely that there is some b ∈ B that’s “hit” twice by f . In other words, there exist distinct a ∈ A and a′ ∈ A such that a ̸= a′ and f(a) = f(a′).
A slight generalization of this idea is also sometimes useful: if there are n total objects, each of which has one of k types, then there must be a type that has at least ⌈n/k⌉ objects. (We’ll omit the proof, but the idea is very similar to Theorem 9.13.)
A pigeonhole refers to one of the “cells” in a grid of com- partments that are openinthefront, and which can house either snail mail or, back in
the day, roosting pigeons. (There’s also a related verb: to pigeonhole some- one/something is to categorize that per- son/thing into one ofasmallnumber of—misleadingly simple—groups.)
9.3. USINGFUNCTIONSTOCOUNT 935
Theorem 9.13 (Pigeonhole Principle)
LetAandBbesetswith|A| > |B|,andletf : A → Bbeanyfunction. Thenthereexist distinct elements a ∈ A and a′ ∈ A such that f(a) = f(a′).
Theorem 9.14 (Pigeonhole Principle: Extended Version)
LetAandBbesets,andletf : A → Bbeanyfunction. Thenthereexistssomeb ∈ Bsuch that the set {a ∈ A : f (a) = b} contains at least ⌈|A|/|B|⌉ elements.

936 CHAPTER 9. COUNTING
(Another less formal way of stating this fact is “the maximum must exceed the aver- age”: the number of elements in A that “hit” a particular b ∈ B is |A|/|B| on average, and there must be some element of B that’s hit at least this many times.)
We’ll start with two simpler examples of the pigeonhole principle, and close with a slightly more complicated application. (In the last example, the slightly tricky part of applying the pigeonhole principle is figuring out what corresponds to the “holes.”)
Example 9.34 (Congressional voting)
Suppose that there were 5 different bills upon which the House of Representa-
tives voted yesterday. (There are 435 representatives in the U.S. House.) The pi- geonhole principle implies that there are two representatives who voted identi-
cally on yesterday’s bills. A representative’s vote can be expressed as an element of {aye, nay, abstain}5, which has cardinality 35 = 243. Because 243 < 435, the pigeonhole principle says that there are two representatives with the same voting record. Example 9.35 (Logical equivalence) Let S be a set of 17 different logical propositions over the Boolean variables p and q. A truth table for a proposition φ ∈ S is an element of {True, False}4 (the rows of the truth table correspond to each of the four truth assignments for p and q), and there are only |{True, False}4 | = 24 = 16 different such values. Therefore, our 17 dif- ferent propositions have only 16 different possible truth tables—so, by the pigeonhole principle, there must be two different propositions that have the same truth table. Example 9.36 (Points in a square) Solution (a) 17 points in a1-by-1square. (b) The square divided into 16 subsquares, and one of the several doubly occupied subsquares. Figure9.21:Putting n2 + 1 points in the unit square. Problem: Supposethattherearen2+1pointsina1-by-1square,asinFigure9.21(a). Show that there must be two points within distance √2 of each other. n : We will use the pigeonhole principle. Divide the unit square into n2 equal- sized disjoint subsquares—each with dimension 1 -by- 1 . (To prevent overlap, we’ll they are at opposite corners of the subsquare. In this case, they are 1 apart in x- 1 n nn say that every shared boundary line is included in the square to the left or below the shared line.) There are n2 subsquares, and n2 + 1 points. By the pigeonhole principle, at least one subsquare contains two or more points. (See Figure 9.21(b).) Notice that the farthest apart that two points in a subsquare can be is when coordinate and n apart in y-coordinate—in other words, they are separated by a distanceof 􏰟(1)2+(1)2 =􏰟2 = √2. n n n2 n Taking it further: The pigeonhole principle can be used to show that compression of data files (for example, ZIP files or compressed image formats like GIF) must either lose information about the original data (so-called lossy compression) or must, for some input files, actually cause the “compressed” version to be larger than the original file. See the discussion on p. 938. 9.3. USINGFUNCTIONSTOCOUNT 937 Computer Science Connections Infinite Cardinalities (and Problems that Can’t Be Solved by Any Program) Recall the Mapping Rule: for any two sets A and B, a bijection f : A → B exists if and only if |A| = |B|. Although we were thinking about finite sets when we stated this rule, the statement holds even for infinite sets A and B; we can even think of this rule as defining what it means for two sets to have the same cardinality. Those sets S such that |S| = |Z|, called countable sets, will turn out to be particularly important. Surprisingly, some sets that “seem” much bigger or much smaller than the integers have the same cardinality as Z. For example, the set of nonnegative integers has the same cardinality as the set of all integers! (See Figure 9.22 for a bijection between these sets.) This fact is very strange—after all, we’re looking at sets A and B where A is a proper subset of B and we’ve now established that |A| = |B|! But, indeed, because we have a bijection between A and B, they really are the same size. Or consider a Python program p. Think of the source code of p as a file— which thus represents p as a sequence of characters, each of which is repre- sented as a sequence of bits, which can therefore be interpreted as an integer written in binary. (See Figure 9.23.) Therefore there is a bijection f between the integers and the set of Python programs, where f (i) is the ith-largest Python program (sorted numerically by its binary representation). With all of these sets that have the same cardinality, it might be tempting to think that all infinite sets have the same cardinality as Z. But they don’t! Proof. Suppose for a contradiction that f : Z≥0 → P(Z≥0) is an onto function. We’ll show that there’s a set S ∈ P(Z≥0) such that for every n ∈ Z≥0 we have f (n) ̸= S. Define the set S as follows: S:={i∈Z≥0 :i∈/f(i)} (Soi∈S⇔thesetf(i)doesnotcontaini.) Observe that the set S differs from f (i) for every i: specifically, for every i we have i ∈ S ⇔ i ∈/ f (i). Thus S is never “hit” by f —contradicting the assumption that f was onto. Therefore there is no onto function f : Z≥0 → P(Z≥0), and, by the Mapping Rule, |Z≥0| < |P(Z≥0)|. (This argument is called a proof by diagonalization; see Figure 9.24.) We can think of any subset of Z as defining a problem that we might want to write a Python program to solve. For example, the set {0, 2, 4, 6, . . .} is the problem of identifying even numbers. The set {1, 2, 4, 8, 16, . . .} is exact powers of 2. The set {2, 3, 5, 7, 11, . . .} is prime numbers. What does all of this say? There are more problems than there are Python programs! And thus there are problems that cannot be solved by any program!4 Figure 9.22: A bijection between Z≥0 and Z. Thus |Z≥0| = |Z|. Figure 9.23: Converting a Python program into an integer. This pro- gram corresponds to the integer whose binary representation is 1110000 1110010 1101001 1101110 · · · . 01234  f(0) 1 0 1 0 1 ··· f(1)  0 0 0 1 1 ···  f(2)0 1 1 0 1 ··· f(3)1 1 0 1 1 ··· f(4)1 0 1 0 0 ··· . . . . . ... Figure 9.24: Diagonalization. Suppose that f : Z≥0 → P(Z≥0). In a table, write row n corresponding to f (n)—so that f (n) has a “1” in column j when j ∈ f(n). Define S := {i : i ∈/ f(i)}—that is, the opposite of the diagonal element. For this table we have 0 ∈/ S (because 0 ∈ f(0)), 1 ∈ S (because 1 ∈/ f(1)), etc. Problems that can’t be solved by any computer program are called uncom- putable. Section 4.4.4 identifies some particular uncomputable problems, or see a good book on computability, like 4 Dexter Kozen. Automata and Com- putability. Springer, 1997; and Michael Sipser. IntroductiontotheTheoryof Computation. Course Technology, 3rd edition, 2012. Define the function f : Z≥0 → Z as f (n) = 􏰆 n 􏰇 · (−1)n. Then: 2 f(0)=⌈0⌉·(−1)0 =0·1= 0 2 f(1)=⌈1⌉·(−1)1 =1·−1= −1 2 f(2)=⌈2⌉·(−1)2 =1·1= 1 2 f(3)=⌈3⌉·(−1)3 =2·−1= −2 2 f(4)=⌈4⌉·(−1)4 =2·1= 2 2 . print "hello wo 112 114 105 110 116 32 34 104 101 108 108 111 32 119 111 1110000 1110010 1101001 1101110 1110100 100000 100010 1101000 1100101 1101100 1101100 1101111 100000 1110111 1101111 Theorem 9.15 The set of all subsets of Z≥0—that is, P(Z≥0)—is strictly bigger than Z≥0. 938 CHAPTER 9. COUNTING Computer Science Connections Lossy and Lossless Compression The task in compression is to take a large (potentially massively large!) piece of data and to represent it, somehow, using a smaller amount of space. Com- pression techniques are tremendously common, for a wide variety of data: text, images, audio, and video, for example. There are two fundamentally dif- ferent approaches to compression of an original data file d into a compressed form d′: lossy and lossless compression. Lossy Compression. In lossy compression, d′ does not represent exactly all of the information in d—that is, we’ve “lost” some information through com- pression. (That’s why the compression is called “lossy.”) In fact, many of the standard file formats for images, audio, and video are just standard methods for lossy compression. For example, JPEG is a lossy image compression for- mat, and MP3 is a lossy audio compression format. The general goal with a lossy compression technique is to maintain, to the extent possible, “perceptual indistinguishability.” For example, a digital audio stream can be represented precisely as a sequence of intensities at each time t (“how loud is the sound at time t?”). A lossy compression technique for sound might round the intensi- ties: instead of representing an intensity as one of 216 values (“a 16-bit sound,” which is CD quality), we could round to the nearest of 28 values. (This idea is called quantization; see Example 2.56.) As long as the lost precision is smaller than the level of human perception, the new audio file would “sound the same” as the original. Lossless Compression. In lossless compression, the precise contents of the original data file d can be reconstructed when the compressed data file d′ is uncompressed. This approach is the one commonly used, for example, when compressing text using a program like ZIP. The typical idea of lossless compression is to exploit redundancy in the stored data and to avoid wasting space storing the “same” information twice. For example, take the complete works of Shakespeare. By replacing every occurrence of the with QQ (two letters that don’t occur consecutively in Shake- speare) the resulting file takes “only” about 99.2% of the original size. We can then set up a “translation table” telling us that QQ → the when we’re decom- pressing. One interesting fact about lossless compression, though, is that it is impossible to actually compress every input file into a smaller size: The word the appears over 20,000 times in the complete works of Shakespeare. The words thee, them, their, they, there, and these also appear over 1000 times each. Here’s an example of a lossless “com- pression” function making a file bigger: I downloaded the complete works of Shakespeare from Project Gutenberg, http://www.gutenberg.org. It took 5,590,193 bytes uncompressed, and 2,035,948 bytes when run through gzip. But shakespeare.zip.zip.zip (2,035,779 bytes), run through gzip three times, is actually bigger than shakespeare.zip.zip (2,035,417 bytes). Theorem 9.16 Let C be any lossless compression function. Then there exists an input file d such that C(d) takes up at least as much space as d. Proof. SupposethatCcompressesalln-bitinputsinton−1orfewerbits.That is, C : {0, 1}n → 􏰔n−1 {0, 1}i . Observe that the domain has size 2n and the i=0 range has size ∑n−1 2i = 2n − 1. By the pigeonhole principle, there must be i=0 two distinct input files d1 and d2 such that C(d1) = C(d2). But this C cannot be a lossless compression technique: if the compressed versions of the files are identical, the decompressed versions must be identical too! 9.3.4 Exercises 9.57 Use the idea of Example 9.23 to determine how many bitstrings x ∈ {0, 1}7 fail all three Hamming code tests—those marked “ ✗ ✗ ✗ ” in the table in Example 9.23, or satisfying these three conditions: x2+x3+x4 ̸≡2 x5 x1+x3+x4 ̸≡2 x6 x1+x2+x4 ̸≡2 x7. 9.58 Prove that the set P of legal positions in a chess game satisfies |P| ≤ 1364. (Hint: Define a one-to-one function from {1, 2, . . . , 13}64 to P.) Let Σ be a nonempty set. A string over Σ is a sequence of elements of Σ—that is, x ∈ Σn for some n ≥ 0. 9.59 How many strings of length n over the alphabet {A, B, . . . , Z, ␣} are there? How many contain exactly 2 “words” (that is, contain exactly one space ␣ that is not in the first or last position)? 9.60 Let n ≥ 3. How many n-symbol strings over this alphabet contain exactly 3 “words”? (Hint: use Example 9.4 to account for n-symbol strings with exactly two ␣s; then use Inclusion–Exclusion to prevent ini- tial/final/consecutive spaces, as in ␣ABC· · · , · · · XYZ␣, and · · · JKL␣␣MNO· · · .) A string over the alphabet {[, ]} is called a string of balanced parentheses if two conditions hold: (i) every [ is later closed by a ]; and (ii) every ] closes a previous [. (You must close everything, and you never close something you didn’t open.) Let Bn ⊆ {[, ]}n denote the set of strings of balanced parentheses that contain n symbols. 9.61 Show that |Bn | ≤ 2n : define a one-to-one function f : Bn → {0, 1}n and use the Mapping Rule. 9.62 Show that |Bn | ≥ 2n/4 by defining a one-to-one function g : {0, 1}n/4 → Bn and using the Mapping Rule. (Hint: consider [][] and [[]].) A certain college in the midwest requires its users’ passwords to be 15 characters long. Inspired by an XKCD comic (see http://xkcd.com/936/), a certain faculty member at this college now creates his passwords by choosing three 5-letter English words from the dictionary, without spaces. (An example password is ADOBESCORNADORN, from the words ADOBE and SCORN and ADORN.) There are 8636 five-letter words in the dictionary that he found. 9.63 How many passwords can be made from any 15 (uppercase-only) letters? How many passwords can be made by pasting together three 5-letter words from this dictionary? 9.64 How many passwords can be made by pasting together three distinct 5-letter words from this dictionary? (For example, the password ADOBESCUBAADOBE is forbidden because ADOBE is repeated.) The faculty member in question has a hard time remembering the order of the words in his password, so he’s decided to ensure that the three words he chooses from this dictionary are different and appear in alphabetical order in his password. (For example, the password ADOBESCUBAFOXES is forbidden because SCUBA is alphabetically after FOXES.) 9.65 How many passwords fit this criterion? Solve this problem as follows. Let P denote the set of three-distinct-word passwords (the set from Exercise 9.64). Let A denote the set of three-distinct- alphabetical-word passwords. Define a function f : P → A that sorts. Then use the Division Rule. 9.66 After play-in games, the NCAA basketball tournament involves 64 teams, ar- ranged in a bracket that specifies who plays whom in each round. (The winner of each game goes on to the next round; the loser is eliminated. See Figure 9.25.) How many different outcomes (that is, lists of winners of all games) of the tournament are there? A palindrome over Σ is a string x ∈ Σn that reads the same backward and forward—like 0110, TESTSET,or(ignoringspacesandpunctuation)SIT ON A POTATO PAN, OTIS!. 9.67 How many 6-letter palindromes (elements of {A, B, . . . , Z}6) are there? 9.68 How many 7-letter palindromes (elements of {A, B, . . . , Z}7 ) are there? 9.69 Let n ≥ 1 be an integer, and let Pn denote the set of palindromes over Σ of length n. Define a bijection f : Pn → Σk (for some k ≥ 0 that you choose). Prove that f is a bijection, and use this bijection to write a formula for |Pn| for arbitrary n ∈ Z≥1. Let n be a positive integer. Recall an integer k ≥ 1 is a factor of n if k | n. The integer n is called squarefree if there’s nointegerm≥2suchthatm2 |n. 9.70 How many positive integer factors does 100 have? How many are squarefree? 9.71 How many positive integer factors does 12! have? (Hint: calculate the prime factorization of 12!.) 9.72 How many squarefree factors does 12! have? Explain your answer. 9.73 (programming required) Write a program that, given n ∈ Z≥1, finds all squarefree factors of n. Figure 9.25: An 8-team tournament bracket. In the first round, A plays B, C plays D, etc. The A/B winner plays the C/D winner in the second round, and so forth. 9.3. USINGFUNCTIONSTOCOUNT 939 A B C D E F G H 940 CHAPTER 9. COUNTING 9.74 Consider two sets A and B. Consider the following claim: if there is a function f : A → B that is not onto, then |A| < |B|. Why does this claim not follow directly from the Mapping Rule? The genre-counting problem (Example 9.24) considered a function f : {1, 2, . . . , n} → {1, 2, 3, 4, 5}. When n = 5 . . . 9.75 Howmanydifferentfunctionsf :{1,2,...,5}→{1,2,...,5}arethere? 9.76 Howmanyone-to-onefunctionsf :{1,2,...,5}→{1,2,...,5}arethere? 9.77 Howmanybijectionsf :{1,2,...,5}→{1,2,...,5}arethere? 9.78 Let n ≥ 1 and m ≥ n be integers. Consider the set G of functions g : {1,2,...n} → {1,2,...,m}. How many functions are in G? How many one-to-one functions are there in G? How many bijections? 9.79 Show that the number of bijections f : A → B is equal to the number of bijections g : B → A. (Hint: define a bijection between {bijections f : A → B} and {bijections g : B → A}, and use the bijection case of the mapping rule!) 9.80 A Universal Product Code (UPC) is a numerical representation of the bar codes used in stores, with an error-detecting feature to handle misscanned codes. A UPC is a 12-digit number ⟨x1 , x2 , . . . , x12 ⟩ where [∑6i=1 3x2i−1 + x2i] mod 10 = 0. (That is, the even-indexed digits plus three times the odd-indexed digits should be divisible by 10.) Prove that there exists a bijection between the set of 11-digit numbers and the set of valid 12-digit UPC codes. Use this fact to determine the number of valid UPC codes. 9.81 Astrictlyincreasingsequenceofintegersis⟨i1,i2,...,ik⟩wherei1 < i2 < ··· < ik.Howmany strictly increasing sequences start with 1 and end with 1024? (That is, we have i1 = 1 and ik = 1024. The value of k can be anything you want; you should count both ⟨1, 1024⟩ and ⟨1, 2, 3, 4, . . . , 1023, 1024⟩.) Asubsequenceofasequencex = ⟨x1,x2,...,xn⟩isasequence⟨xi1,xi2,...,xik⟩ofk ≥ 0elementsofx,where ⟨i1 , i2 , . . . , ik ⟩ is a strictly increasing sequence. For example, PYTHON is a subsequence of PYTHAGOREAN and BASIC is a subsequence of BRAINSICKNESS. 9.82 Suppose the components of x = ⟨x1 , x2 , . . . , xn ⟩ are all different (as in PYTHON but not PYTHAGOREAN). Use the Mapping Rule to figure out how many subsequences of x there are. 9.83 Suppose the components of x = ⟨x1 , x2 , . . . , xn ⟩ are all different, except for a single pair of identical elements that are separated by k other elements. For example, PYTHAGOREAN has n = 11 and k = 4, because there are four entries (GORE) between the As (at index 5 and 10), which are the only repeated entries. In terms of n and k, how many subsequences of x are there? As Example 9.23 describes, the Hamming Code adds 3 different parity bits to a 4-bit message m, where each added bit corresponds to the parity of a carefully chosen subset of the message bits, creating a 7-bit codeword c. Let k and n, respectively, denote the number of bits in the message and the codeword. (For the Hamming Code, we have k = 4 and n = 7.) A decoding algorithm takes a received (and possibly corrupted) codeword c′ and determines which message has a corresponding codeword c that is most similar to c′. (See Section 4.2, or Figure 9.26 for a brief reminder. See also Exercises 4.25–4.28.) We can view the decoding algorithm as a function decode : P(1, 2, . . . , n − k) → {0, 1, 2, . . . , n}— where decode(S) tells us which bit (if any) to flip in the received codeword when S is the set of mismatched parity bits. (If decode(S) = 0, then no bits should be flipped.) 9.84 Argue using the Mapping Rule (that is, without refer- ence to the precise function in Figure 9.26) that for the Hamming Code’s parameters (n = 7 and k = 4) that there exists a bijection decode : P({1,2,...,n−k}) → {0,1,2,...,n}. 9.85 Suppose that we choose n = 9 and k = 4. Does there exist abijectionfromP({1,2,...,n−k})to{0,1,2,...,n}? Whyorwhy not? 9.86 Suppose that we choose n = 31. For what value(s) of k does there exist a bijection from P({1, 2, . . . , n − k}) to {0, 1, 2, . . . , n}? Prove your answer. 9.87 Prove that, for any n that is not one less than a power of 2, there does not exist a bijection from P({1, 2, . . . , n − k}) to {0, 1, 2, . . . , n}. Figure 9.26: De- coding the Ham- ming Code. Every single-bit error is corrected. The Hamming code For the message m = ⟨a, b, c, d⟩, we compute three parity bits: • paritybit#1: b⊕c⊕d • paritybit#2: a⊕c⊕d • paritybit#3: a⊕b⊕d and send c := ⟨a, b, c, d, parity #1, parity #2, parity #3⟩. Having received a (possibly corrupted) codeword c′, we com- pute what the parity bits would have been for the received message bits, and check for mismatches between the computed and received parity bits: parity bit mismatches {} no error! error (which bit to flip) {1} {2} {3} {1, 2} {1, 3} {2, 3} {1, 2, 3} parity #1 parity #2 parity #3 bit c bit b bit a bit d In the corporate and political worlds, there’s a dubious technique called URL squatting, where someone creates a website whose name is very similar to a popular site and uses it to skim the traffic generated by poor-typing internet users. For example, Google owns the addresses gogle.com and googl.com, which redirect to google.com. (But, as of this writing, someone else owns oogle.com, goole.com, and googe.com.) Consider an n-letter company name. How many single-typo manglings of the name are there if we consider the following kinds of errors? Consider only uppercase letters throughout. (If your answers depend on the particular n-letter company name, then say how they depend on that name. Note that no transposition errors are possible for the company name MMM, for example.) 9.88 one-letter substitutions 9.89 one-letter insertions 9.90 one-pair transpositions (two adjacent letters written in the wrong order) 9.91 one-letter deletions How many different ways can you arrange the letters of the following words? 9.92 9.93 PASCAL 9.94 ALANTURING 9.96 ADALOVELACE GRACEHOPPER 9.95 CHARLESBABBAGE 9.97 PEERTOPEERSYSTEM (programming required) Write a function that, given an input string, computes the number of ways to rearrange the string’s letters. Use your program to verify your answers to the last few exercises. 9.99 (programming required) In Example 9.31, we analyzed the number of ways to write a particular integer n as the product of primes. (Because the prime factorization of n is unique, the only difference between these products is the order in which the primes appear.) Write a program, in a language of your choice,tocomputethenumberxn ofwayswecanwriteagivennumbernasp1 ·p2···pk,whereeachpi is prime. For what number n ≤ 10,000 is xn the greatest? In Chapter 3, we discussed the application of Boolean logic to AI-based approaches to playing games like Tic-Tac-Toe. (See p. 344, or Figure 9.27 for a 2-by-2 version of the game [Tic-Tac; the 3-by-3 version is Tic-Tac-Toe].) Specifically, recall the Tic-Tac-Toe game tree: the root of the tree is the empty board, and the children of any node in the tree are the boards that result from any move made in any of the empty squares. We talked briefly about why chess is hard to solve using an approach like this. (In brief: it’s huge.) The next few problems will explore why a little bit of cleverness helps a lot in solving even something as simple as Tic-Tac-Toe. 9.100 Tic-Tac-Toe ends when either player completes a row, column, or diagonal. But for this question, assume that even after somebody wins the game, the board is completely filled in before the game ends. (That is, every leaf of the game tree has a completely filled board.) How many leaves are in the game tree? 9.101 Continue to assume that the board is completely filled in before the game ends. How many distinct leaves are there in the tree? (That is, suppose that the order in which O fills his or her squares doesn’t matter; if the same squares are filled, the boards count as the same.) 9.102 Continue to assume that the board is completely filled in before the game ends. Extend your answer to Exercise 9.100: how many total boards appear in the game tree (as leaves or as internal nodes)? (Hint: it may be easiest to compute the number of boards after k moves, and add up your numbers for k = 0, 1, . . . , 9.) 9.103 Continue to assume that the board is completely filled in before the game ends. How many distinct total boards—internal nodes or leaves—are there in the tree? There are still two optimizations left that we haven’t tried. The first is using the symmetry of the board to help us: for example, there are really only three first moves that can be made in Tic-Tac-Toe: a corner, the middle of the board, and the middle of a side. The second optimization is to truncate the tree when there’s a winner. These are both a bit tedious to track by hand, but it’s manageable with a small program. 9.104 (programming required) We can cut the size of the game tree down to less than a third of the orig- inal size—actually substantially more!—by exploiting symmetry in plays. (We’re down to a third of the original size just within the first move.) Write a program to compute the entire Tic-Tac-Toe game tree, and use it to determine the number of unique boards (counting as equivalent two boards that match with respect to rotational or reflectional symmetry) in the game tree. How many boards are now in the tree? 9.105 (programming required) We can reduce the size of the game tree just a bit further by not expanding the portions of the game tree where one of the players has already won. Extend your implementation from the last exercise so that no moves are made in any board in which O or X has already won. How many boards are in the tree now? 9.98 9.3. USINGFUNCTIONSTOCOUNT 941 | | O| | |O | | O| | |O X|O | |O X| |O |X X|O O| X|O |O O|O X| |O X|O O|O |X |O O|X X|O O|X X|O O|X Figure 9.27: A portion of the game tree for Tic- Tac. (The missing 75% is rotated, but otherwise identical.) 942 CHAPTER 9. COUNTING Recall Example 9.32: we must put n students (where n is even) into n partnerships. (We don’t care about the order of the partnerships, nor about the order of partners within a pair.) Here is an alternative way of solving this problem: 9.106 Consider sorting the n people alphabetically by name. Repeat the following n times: for the 2 unmatched person p whose name is alphabetically first, choose a partner for p from the set of all other 2 unmatched people. How many choices are there in iteration i? How many choices are there, in total? 9.107 Algebraically prove the following identity. (Hint: what does (n/2)! · 2n/2 represent?) n/2 n! ∏i=1(n−2i+1) = (n/2)!·2n/2 Think of an n-gene chromosome as a permutation of the numbers {1, 2, . . . , n}, representing the order in which these n genes appear. The following questions ask you to determine how many chromosome-level rearrangement events of a particular form there are. (See, for example, Figure 3.38.) 9.108 A prefix reversal inverts the order of the first j genes, for some j > 1 and j ≤ n. For example, for the
,2,1,4,7,3,8⟩ or For example, for the chromosome ⟨5, 9, 6, 2, 1, 4, 7, 3, 8⟩ we could get the result ⟨6, 9, 5
⟨5, 9, 6, 4, 1, 2, 7, 3, 8⟩ from a reversal. How many different reversals are there for a 1000-gene chromosome? 9.110 A transposition takes the genes between indices i and j and places them between indices k and
k + 1 , f o r s o m e i a n d j > i a n d k ∈/ { i , i + 1 , . . . , j } . F o r e x a m p l e , f o r t h e c h r o m o s o m e ⟨ 5 , 9 , 6 , 2 , 1 , 4 , 7 , 3 , 8 ⟩ w e
, 7, 3, 8⟩ from a transposition. How many different could get the result ⟨5,1,4,7,3,9,6,2␣8⟩ or ⟨␣1,4,5,9,6,2
transpositions are there for a 1000-gene chromosome?
A cellular automaton is a formalism that’s sometimes used to model complex systems—like the spatial distribution of populations, for example. Here is the model, in its simplest form. We start from an n-by-n toroidal lattice of cells: a two-dimensional grid, that “wraps around” so that that there’s no edge. (Think of a donut.) Each cell is connected to its eight immediate neighbors.
,4,7,3,8⟩ from a prefix reversal. How many different prefix reversals are there for a 1000-gene chromosome?
chromosome ⟨5, 9, 6, 2, 1, 4, 7, 3, 8⟩ we could get the result ⟨6, 9, 5
9.109 Areversalinvertstheorderofthegenesbetweenindexiandindexj,forsomeiandj > i.
,2,1,4,7,3,8⟩ or ⟨1,2,6,9,5
Cellular automata are a model of evolution over time: our model
will proceed in a sequence of time steps. At every time step, each cell
u is in one of two states: active or inactive. A cell’s state may change
from time t to time t + 1. More precisely, each cell u has an update rule
that describes u’s state at time t + 1 given the state of u and each of u’s
neighbors at time t. (For example, see Figure 9.28.)
9.111 An update rule is a function that takes the state of a cell and the state of its eight neighbors as input, and produces the new state of the cell as output. How many different update rules are there? 9.112 Let’s call an update rule a strictly cardinal update rule if—as in the Game of Life—the state of a cell u at time t + 1 depends only the following: (i) the state of cell u at time t, and (ii) the number of active neighbors of cell u at time t. How many different strictly cardinal update rules are there?
Suppose that we have an 10-by-10 lattice of 100 cells, and we have an update rule fu for every cell u. (These update rules might be the same or differ from cell to cell.) Suppose the system begins in an initial configuration M0. Suppose we start the system at time t = 0 in configuration M0, and derive the configuration Mt at time t ≥ 1 by computing
Mt(u) = fu(the states of u’s neighbors in Mt−1).
Let’s consider the possible outcomes of the sequence M , M , M , . . .. Say that this sequence exhibits eventual conver-
Figure 9.28: In the Game of Life, each cell has an identical update rule: an active cell with ≤ 1 live neighbors dies (from “loneliness”), a live cell with ≥ 4 live neighbors dies (from “overcrowd- ing”), and a dead cell with exactly three living neigh- bors becomes alive.
0 1 2
gence if the following holds: there exists a time t ≥ 0 such that, for all times t′ ≥ t, we have Mt′ = Mt. (So the Life
example in Figure 9.28 exhibits eventual convergence.) Otherwise, we’ll say that this sequence oscillates.
9.113 Given M0 and the fu’s, we’d like to know what the long-run behavior of this system is: does it eventually converge or does it oscillate? Prove that, for a sufficiently large value of K, we have eventual convergence if and only if the following algorithm returns True. Also compute the smallest value of K for which this algorithm is guaranteed to be correct.
• StartwithM:=M0 andt:=0.
• Repeat the following K times: update M to the next time step (that is, for each u compute the updated
M′(u) by evaluating fu on u’s neighbor cells in M).
• If M would be unchanged by one additional round of updates, return True. Else return False.
9.114 Suppose that we place 1234 items into 17 buckets. (For example, consider hashing 1234 items into a 17-cell hash table.) Call the number of items in a bucket its occupancy, and the maximum occupancy the number of items in the most-occupied bucket. What’s the smallest possible maximum occupancy?
→→→→

9.115 Consider a function f : A → B. Fill in the blank with a statement relating |A| and |B|, and then prove the resulting claim: if , then, for some b ∈ B, we have | {a ∈ A : f (a) = b} | ≥ 202.
9.116 SupposethatwequantizeasetofvaluesfromS = {1,2,…,n}into{k1,k2,…,k5} ⊂ S.(See Example 2.56.) Namely, we choose these 5 values and then define a function q : S → {k1,k2,…,k5}. The maximum error of this quantization is maxx∈S |x − q(x)|. Use the Pigeonhole Principle (or the “the maximum must exceed the average” generalization) to determine the smallest possible maximum error.
Imagine a round-robin chess tournament for 150 players, each of whom plays 7 games. (In other words, each player is guaranteed to participate in precisely 7 games with 7 different opponents. Remember that each game has two players.) 9.117 There are 20 possible first moves for White in a chess game, and 20 possible first moves for Black in response. (See Example 9.15.) Prove that there must be two different games in the tournament that began with the same first two moves (one by White and one by Black).
9.118 Suppose that would-be draws in this tournament are resolved by a coin flip, so that every game has a winner and a loser. Prove that there must be two participants in such a tournament who have precisely the same sequence of wins and losses (for example, WWWLLLW).
A win–loss record reports a number of wins and a number of losses (for example, 6 wins and 1 loss, or 3 wins and 4 losses), without reference to the order of these results.
9.119 Continuing to suppose that there are no draws in this tournament, identify as large a value of k as you can for which the following claim is true, and prove that it’s true for your value of k: there is some win–loss record that is shared by at least k competitors.
9.120 Now suppose that draws are allowed, so that competitors have a win–loss–draw record (for example, 2 wins, 1 loss, and 4 draws). Identify the largest k for which there is some win–loss–draw record that is shared by at least k competitors, and prove that this claim holds for the k you’ve identified.
9.3. USINGFUNCTIONSTOCOUNT 943

944 CHAPTER 9. COUNTING
9.4 Combinations and Permutations
Not everything that can be counted counts, and not everything that counts can be counted.
William Bruce Cameron (1921–2002)
So far in this chapter, we’ve been working to develop a toolbox of general techniques for counting problems: the Sum Rule and Inclusion–Exclusion, the (Generalized) Prod- uct Rule, the Mapping Rule, and the Division Rule. This section will be different; in- stead of a new technique, here we will devote our attention to a particularly common kind of counting problem: the number of ways to choose a subset from a given set of candidate elements. Let’s start with an illustrative example:
Example 9.37 (Printing t-shirts)
Problem: Supposeyourunat-shirtshop.Thereisacollectionofjobsthatyou’re asked to run, but there’s limited time so you must choose which ones to actually print. There are 17 requested jobs {a, b, . . . , q}, but there is only time to print 4 different jobs. How many ways are there to select 4 of these 17 candidate jobs?
: Therearetwoanswers,dependingonhowweinterprettheproblem: Solution
does the order of the printed jobs matter, or does it only matter whether a job was printed? (Are we choosing an ordered 4-tuple? Or an unordered subset of size 4?)
Ordermatters: ThentheGeneralizedProductRuleimmediatelygivesusthe
answer: there are 17 choices for the first job, 16 for the second job, 15 for the
third, and 14 for the fourth; thus there are 17 · 16 · 15 · 14 total choices.
Another way to write 17 · 16 · 15 · 14 is 17! : every multiplicand between 1 and 13!
13 appears in both the numerator and denominator, leaving only {17, 16, 15, 14} uncancelled. We can justify the 17! version of the answer using the Division
13!
Rule: we choose one of the 17! orderings of all 17 jobs, and then print the first
4 jobs in this order—but we’ve counted each 4-job ordering 13! times (once for each ordering of the 13 unprinted jobs), so we must divide by 13!.
Order doesn’t matter: As in the previous case, there are 17! ways of choosing an 13!
ordered sequence of 4 jobs. Because order doesn’t matter, we have counted each
set of four chosen jobs 4! times, once for each ordering of them. By the Division
Rule, then, there are 17! ways of selecting 4 unordered jobs from a set of 17. 13!·4!
Two different fundamental notions of choice are illustrated by Example 9.37: permu- tations, in which the order of the chosen elements matters, and combinations, in which the order doesn’t matter. These two notions will be our focus in this section. Here’s another example to further illustrate combinations:
Example 9.38 (Arranging letters of a bitstring)
Problem: Howmanydifferentwayscanyouarrangethesymbolsinthe“word” 000111? What about the “word” 00…011…1 containing k zeros and n − k ones?

: ThisproblemisjustanotherapplicationofthetechniquesweusedforPERL
and PEER and SMALLTALK in Example 9.30. (We can think of the word 000111 just
like a word like DEEDED: two different letters, appearing three times each.) There
are 6 total characters in the word, each appearing 3 times, so the total number of
Solution
9.4. COMBINATIONSANDPERMUTATIONS 945
arrangements is 6! . (See Theorem 9.12.) 3!·3!
For the general version of the problem—the word 00…011…1, with k zeros
and n − k ones—we have a total of n characters, so there are n! ways of writing
them down. But k! orderings of the zeros, and (n − k)! orderings of the ones, are
identical. Hence, by the Division Rule, the total number of orderings is n! . k!·(n−k)!
Combinations
The quantity that we computed in Example 9.38 is called the number of combinations
of k elements chosen from a set of n candidates:
As we just argued in Example 9.38, the quantity 􏰀nk􏰁 denotes the number of ways to choose a k-element subset of a set of n elements. For convenience, define 􏰀nk􏰁 := 0 whenever n < 0 or k < 0 or k > n: there are zero ways to choose a k-element subset of a set of n elements under these circumstances.
Taking it further: When there are annoying complications (or divide-by-zero errors or the like) in the boundary cases of a definition, it’s often easiest to tweak the definition to make those cases less special. (Here, for example, instead of having 􏰀78􏰁 be undefined, we treat 􏰀78􏰁 as 0.)
A similar idea in programming can make life much simpler when you encounter data structures with complicated edge conditions—for example, a node in a linked list that might not have a successor. A sentinel is a “fake” element that you might add to the boundary of a data structure that makes the edge elements of the data structure less special. For example, in image processing, we might augment an n-by-m image with an extra 0th and (m + 1)st column, and an extra 0th and (n + 1)st row, of blank pixels. Once these “border pixels” are added, every pixel in the image has a neighbor in each cardinal direction. Thus there’s no special code required for edge pixels in code to, for example, apply a blur filter to the image.
Here are a few small examples of counting problems that use combinations:
Example 9.39 (8-bit strings with 2 ones)
How many different 8-bit strings have exactly 2 ones?
We solved this precise problem in Example 9.3 using the Sum Rule, but combina-
The quantity 􏰀nk􏰁
is also sometimes called a binomial co- efficient, for reasons that we’ll see in Sec- tion 9.4.3. It’s also sometimes denoted C(n, k) (“C” as in “Combination”).
Definition 9.2 (Combinations) 􏰀n􏰁
Consider nonnegative integers n and k with k ≤ n. The quantity k is defined as
􏰤n􏰥 := n!
k k!·(n−k)!
,
and is read as “n choose k.”
tions give us an easier way to answer this question. We must choose 2 out of 8 indices
to make equal to one. There are 􏰀8􏰁 = 8! = 8! = 8·7 = 28 such choices of indices, 􏰀8􏰁 2 2!·(8−2)! 2!·6! 2
Figure 9.29: All 8-bit bitstrings with exactly 2 ones.
and thus 2
in Figure 9.29.
different 8-bit bitstrings with exactly 2 ones. These 28 strings are shown
11000000
10100000
10010000
10001000
10000100
10000010
10000001
01100000
01010000
01001000
01000100
01000010
01000001
00110000
00101000
00100100
00100010
00100001
00011000
00010100
00010010
00010001
00001100
00001010
00001001
00000110
00000101
00000011

946 CHAPTER 9. COUNTING
Example 9.40 (32-bit strings with < 3 ones) How many different 32-bit strings have fewer than 3 ones? We will use the Sum Rule, plus the formula for combinations. (We can partition the set of 32-bit strings that have fewer than 3 ones into those with 0, 1, or 2 ones.) Thusthereare􏰀32􏰁+􏰀32􏰁+􏰀32􏰁=1+32+32·31 =1+32+496=529totalsuchstrings. 01􏰀􏰁22 (Recallthat0!=1,so 32 = 32! = 32! = 32! =32!=1.) Finally, here’s an example of counting using combinations that relates counting to probability. (There’s much more about probability in Chapter 10.) If we flip an un- 0 0!·(32−0)! 0!·32! 1·32! 32! biased coin (in other words, a coin that comes up heads with probability 1 and tails 12 with probability 2 each time we flip it), then every sequence of coin flips is equally likely. The probability that an “event” E happens when we flip an unbiased coin is the fraction of possible flip sequences for which E actually occurs. Example 9.41 (Exactly 50% heads) Suppose we flip an unbiased coin 10 times. What is the probability that precisely 5 flips come up heads? 􏰀 There are 210 = 1024 total sequences, of which 10 = 10! = 252 have precisely 5 5 5!·5! heads. Thus there’s a 1024 ≈ 0.2461 chance of exactly half of the flips being heads. 9.4.1 Four Different Ways to Select k out of n Options In Example 9.37, we saw two different ways in which we can imag- ine choosing a subset of k distinct elements from a set S of n candi- dates, depending on whether the order in which we choose those k elements matters. There is another dichotomy that can arise in counting problems: we can imagine circumstances in which we choose k elements from a set S, but where repetition is allowed (that is, we can choose the same element more than once). In other scenarios, repetition might not make sense. Here are some examples of all four situations (see also Figure 9.30): • You order a two-scoop ice cream cone from a list of flavors. Order matters: a chocolate scoop on top of a mint scoop ̸= mint on top of chocolate. Repetition is allowed: you can choose vanilla for both scoops. • Your soccer game is tied, and you must choose 5 of your 11 players to take penalty kicks to break the tie. Order matters: the kicks are taken in sequence, so Pelé then Maradona ̸= Maradona then Pelé. Repetition is forbidden: each player is allowed to take only one kick. • You order a three-salad salad sampler from a list of salads. Order doesn’t matter: salads are served on a round plate, so it doesn’t matter which one is “first.” Repetition is allowed: you can choose the Caesar as two or all three of your salads. 252 􏰁 (9 ways) (6 ways) (6 ways) (3 ways) A, then A A, then B B, then A A, then C C, then A B, then B B, then C C, then B C, then C A, then B B, then A A, then C C, then A B, then C C, then B A and A A and B A and C B and B B and C C and C A and B A and C B and C Figure 9.30: Four ways of choosing 2 elements from the candidates A, B, and C—depending on whether we can choose the same element more than once, and whether the order of choices matters. order matters repetition allowed order matters repetition not allowed order irrelevant repetition allowed order irrelevant repetition not allowed • You select a starting lineup of 5 basketball players from your 13-person team. Or- der doesn’t matter: all 5 chosen players are equivalent in starting the game. Repetition is forbidden: you must choose five different players. Here we will consider all four types of counting problems—ordered/unordered choice with/without repetition—and do a few examples. See Figure 9.31 for a summary of the number of ways to make these different types of choices. When order matters and repetition is forbidden Suppose that we choose a sequence of k distinct elements from a set S: that is, the or- der of the selected elements matters and repetition is not allowed. (For example, in a player draft for a sports league, no player can be chosen more than once—”repetition is forbidden”—and the outcome of the draft depends not just on whether Babe Ruth was chosen, but also whether it was the Eagles or the Wildcats that selected him.) In other words, we make k successive selections from S, but no candidate can be chosen more than once. Such a sequence is sometimes called a k-permutation of S—an ordered sequence of k distinct elements of S. (Recall from Definition 9.1 that a permuta- tion of a set S is an ordering of S’s elements.) Figure 9.31: Four ways of selecting k of n items, and the number of ways to make that selection. Some people denote the number of ways of choosing an ordered sequence of k distinct selec- tions from a set of n options by P(n, k), because “permu- tation” starts with “P.” 9.4. COMBINATIONSANDPERMUTATIONS 947 order matters order doesn’t matter repetition forbidden n! (n−k)! nk 􏰤 nk 􏰥 repetition allowed 􏰤n + k − 1􏰥 k There are n! different k-permutations of an n-element set S, by the Generalized (n−k)! Product Rule. (Specifically, there are Example 9.42 (4 of 10) Suppose that you are asked to place four of the cards {A♥, 2♥, · · · , 10♥} on the table, arranged from left to right in an order of your choosing. There are 10 · 9 · 8 · 7 = (n) 􏰢􏰡􏰠􏰣 (n−1) 􏰢􏰡􏰠􏰣 (n−k+1) 􏰢 􏰡􏰠 􏰣 choices of kth element · totalchoices,and n! =n·(n−1)·(n−2)· ··· ·(n−k+1).) choices of first element (n−k)! choices of second element ···· · 10! such arrangements: order matters (A234♥ ̸= 432A♥) and repetition is not (10−4)! allowed (4444♥ isn’t a valid arrangement, because you only have one 4♥ card). When order matters and repetition is allowed Suppose that we simply choose a sequence of k (not necessarily distinct) elements: that is, order matters and repetition is allowed. In other words, we make k successive selections from S, and we’re allowed to make the same choice multiple times. (For example, suppose you and k − 1 friends go to a Chinese restaurant with n items on the menu, and each of you orders something for dinner. You’re allowed to order the same dish as your friends—”repetition is allowed”—but you getting the Tofu with Black Bean Sauce and your vegan friend getting Twice-Cooked Pork is definitely different from the other way around.) 948 CHAPTER 9. COUNTING Then there are nk different ways to make this choice, by the Product Rule: at every stage, there are n possible choices, and there are k stages. Example 9.43 (4 of 10, a second way) Suppose that you are asked to create a 4-digit integer. There are 104 such integers: order matters (1234 ̸= 4321) and repetition is allowed (4444 is a valid 4-digit number). When order doesn’t matter and repetition is forbidden Suppose that we choose an unordered set of k distinct elements: that is, order does not matter and repetition is not allowed. (For example, suppose you and n − 1 friends enter a raffle in which k identical new cell phones will be given away. Each of you puts your name on one of n cards that are placed in a hat, and k cards are drawn to choose the winners. Cards for winners are not put back into the hat after they’re drawn, so nobody can win twice—”repetition is forbidden”—but Alice and Bob winning is the same as Bob and Alice winning.) When we choose an unordered set of k distinct elements from a set of n options, there are 􏰀nk􏰁 different ways to make this choice, by the definition of combination. Such a subset is sometimes called a k-combination of S—an unordered set of k distinct elements of S. (Recall from Definition 9.2 that a combination of elements from a set S is precisely an unordered subset of elements from S.) Example 9.44 (4 of 10, another way) Suppose that you’re asked to create a 10-bit number with exactly 4 ones. You do so by starting with 0000000000 and choosing 4 indices to change from 0 to 1. There are 􏰀10􏰁 such bitstrings: the order in which you choose a bit to make a 1 doesn’t matter 4 (changing bit #2 and then bit #7 to 1 yields the same bitstring as changing bit #7 and then bit #2 to 1) and repetition is not allowed (you have to change 4 different bits to 1). When order doesn’t matter and repetition is allowed While these three types of selecting k out of n elements are the most frequent, the fourth possibility can sometimes arise, too: order doesn’t matter but repetition is allowed. Let’s build some intuition for this case with a longer example: Example 9.45 (Taking notes on six sheets of paper in three classes) Problem: Youdiscoverthatyourschoolnotebookhasonlyk=6sheetsofpaperleftin it. You are attending n = 3 different classes today: Archaeology (A), Buddhism (B), and Computer Science (C). How many ways are there to allocate your six sheets of paper across your three classes? (No paper splitting or hoarding: each sheet must be allocated to one and only one class!) (Here’s another way to phrase the question: you must choose how many pages to assign to A, how many to B, and how many to C. That is, you must choose three nonnegative integers a, b, and c with a + b + c = 6. How many ways can you do it?) Problem-solving tip: When you encounter a prob- lem that seems completely novel, run through the techniques you know about and try them on for size, even if they’re not an obvious fit. The type of counting in Example 9.45 doesn’t seem like it has a lot to do with combinations, but by changing the way you view this problem it can be transformed into a problem you’ve seen before. : The28waysofallocatingyourpaperareshowninthefollowingtables, sorted by the number of pages allocated to Archaeology (and breaking ties by the number of pages allocated to Buddhism). The allocations are shown in three ways: • Pagesarerepresentedbytheclassname. • Pagesarerepresentedby✷,with|markingdivisionsbetweenclasses:we allocate the number of pages before the first divider to A, the number between the dividers to B, and the number after the second divider to C. • Pagesarerepresentedby0,with1markingdivisionsbetweenclasses:asinthe ✷-and-| representation, we allocate pages before the first 1 to A, those between the 1s to B, and those after the second 1 to C. Here are the 28 different allocations: Solution 9.4. COMBINATIONSANDPERMUTATIONS 949 AAAAAA AAAAA B AAAAA AAAA BB AAAA B C AAAA AAA BBB AAA BB C AAA B CC AAA AA BBBB AA BBB C AA BB CC AA B CCC AA A BBBBB A BBBB C A BBB CC A BB CCC A B CCCC A CCCCC BBBBBB BBBBB C BBBB CC BBB CCC BB CCCC B CCCCC CCCCCC C CC CCC CCCC A |B |C ✷✷✷✷✷✷| | ✷✷✷✷✷ |✷ | ✷✷✷✷✷ | |✷ ✷✷✷✷ |✷✷ | ✷✷✷✷ |✷ |✷ ✷✷✷✷ | |✷✷ ✷✷✷ |✷✷✷ | ✷✷✷ |✷✷ |✷ ✷✷✷ |✷ |✷✷ ✷✷✷ | |✷✷✷ ✷✷ |✷✷✷✷ | ✷✷ |✷✷✷ |✷ ✷✷ |✷✷ |✷✷ ✷✷ |✷ |✷✷✷ ✷✷ | |✷✷✷✷ ✷ |✷✷✷✷✷ | ✷ |✷✷✷✷ |✷ ✷ |✷✷✷ |✷✷ ✷ |✷✷ |✷✷✷ ✷ |✷ |✷✷✷✷ ✷ | |✷✷✷✷✷ |✷✷✷✷✷✷| |✷✷✷✷✷ |✷ |✷✷✷✷ |✷✷ |✷✷✷ |✷✷✷ |✷✷ |✷✷✷✷ |✷ |✷✷✷✷✷ | |✷✷✷✷✷✷ 00000011 00000101 00000110 00001001 00001010 00001100 00010001 00010010 00010100 00011000 00100001 00100010 00100100 00101000 00110000 01000001 01000010 01000100 01001000 01010000 01100000 10000001 10000010 10000100 10001000 10010000 10100000 11000000 All three versions of this table accurately represent the full set of 28 allocations, but let’s concentrate on the representation in the second and third columns— particularly the third. The 0-and-1 representation in the third column contains exactly the same strings as Figure 9.29, which listed all 28 = 􏰀82􏰁 of the 8-bit strings that contain exactly 2 ones. In a moment, we’ll state a theorem that generalizes this example into a formula for the number of ways to select k out of n elements when order doesn’t matter but repetition is allowed. But, first, here’s a slightly different way of thinking about the result in Example 9.45 that may be more intuitive. 950 CHAPTER 9. COUNTING Suppose that we’re trying to allocate a total of k pages among n classes. Imagine placing the k pages into a three-ring binder along with n − 1 “di- vider tabs” (the kind that separate sections of a binder), as in Figure 9.32. There are now n + k − 1 things in your binder. (In Example 9.45, there were 6 pages and 2 dividers, so 8 total things are in the binder.) The ways of al- locating the pages precisely correspond to the ways of ordering the things in the binder—that is, choosing which of the n + k − 1 things in the binder should be Figure 9.32: Any ordering of 6 pieces of paper and2divider tabs defines three sections (before, between, and after the dividers). blank sheets of paper, and which should be dividers. So there are 􏰀n+k−1􏰁 ways of do- 􏰀8􏰁 k ingso. InExample9.45,wehadn = 3andk = 6,sotherewere 6 = 28waysofdoing this allocation. While the description in Example 9.45 wasn’t stated in precisely these terms, our paper-allocation task was really a task about choosing with repetition: six times (once for each piece of paper), we select one of the elements of the set {A, B, C} of classes. We may select the same class as many times as we wish (“repetition is allowed”), and the pieces of paper are indistinguishable (“order doesn’t matter”). Here is the general statement of the number of ways to select k out of n elements for this scenario: Theorem 9.17 (Choosing with repetition when order doesn’t matter) The number of ways to select k out of n elements when order doesn’t matter but repetition is allowed is 􏰀n+k−1􏰁. k Proof. We’llgiveaproofbasedontheMappingRule.Wecanrepresentaparticular choice of k elements from the set of n candidates as a sequence x ∈ (Z≥0)n such that ∑ni=1 xi = k. (Specifically, xi tells us how many times we chose element i.) Define X:={x∈(Z≥0)n :∑ni=1xi =k} andS:={x∈{0,1}n+k−1 :xcontainsexactlyn−1onesandkzeros}. x1 times x2 times xn times We claim that there is a bijection between X and S. Specifically, define f : X → S as f(x1,x2,...,xn) = 00···0 1 00···0 1 ··· 1 00···0 􏰢 􏰡􏰠 􏰣 􏰢 􏰡􏰠 􏰣 􏰢 􏰡􏰠 􏰣 (This representation is precisely the one in Example 9.45.) It’s easy to see that f is a bijection: every element of S corresponds to one and only one element of X. As we argued in Example 9.38, the cardinality of S is 􏰀n+k−1􏰁. k Here’s another example of this type of choice: Example 9.46 (4 of 10, one last way) Suppose that you have decided to buy 4 total drinks for a group of 10 of your friends. (You may buy multiple drinks for the same friend.) You can think of lining your friends up and performing a total of 13 successive actions, each of which is either (a) buying a drink for the friend immediately in front of you, or (b) shouting “next!”. Of your 13 actions, 4 must be drink purchases. (The other 9 must be shouts of “next!”) There are 􏰀13􏰁 ways to choose these actions. 4 Choosing k of n elements, summarized We’ve now discussed four notions of choosing k elements from a set of n candidates, depending on whether we could choose the same option more than once and whether the order of our choices mattered: • order matters and repetition is allowed: nk ways. • order matters and repetition is forbidden: n! ways. (n−k)! • orderdoesn’tmatterandrepetitionisallowed:􏰀n+k−1􏰁ways. (Or see Figure 9.31 for a summary.) We’ve also considered the same example—choosing 4 of 10 options—in each setting, and the number of ways to do so was different in each of the four different scenarios: • order matters and repetition is allowed: 10,000 = 104 ways. • ordermattersandrepetitionisforbidden:5040=10·9·8·7ways. • orderdoesn’tmatterandrepetitionisallowed:715=􏰀13􏰁ways. 4 • orderdoesn’tmatterandrepetitionisforbidden:210=􏰀10􏰁ways. 4 Taking it further: In CS, we frequently encounter tasks where we must identify the best solution from a set of possibilities. For example, we might want to find the longest increasing subsequence (LIS) of a sequence of n integers. A brute-force algorithm is one that solves the problem by literally trying every possible solution and selecting the best. (For LIS, there are 2n subsequences, so this algorithm is very slow.) But if there’s a certain kind of structure and enough repetition in the subproblems that arise in a naïve recursive solution, a more advanced algorithmic design technique called dynamic programming can yield a much faster algorithm. And counting the number of subproblems—and the number of distinct subproblems!—is what establishes when algorithms using brute force or dynamic programming are good enough. See the discussion on p. 959. 9.4.2 Some Properties of 􏰀nk􏰁, and Combinatorial Proofs Of the four ways of choosing k elements from n candidates that we explored in Sec- tion 9.4.1, perhaps the most common is the setting when order doesn’t matter and rep- etition is forbidden. In this section, we’ll explore some of the remarkable mathematical properties of the numbers—the values of 􏰀nk􏰁—that arise in this scenario. The properties that we’ll prove here (and those that you’ll establish in the exercises) will be equalities of the form x = y for two expressions x and y. We’ll generally be able to give two very different styles of proof that x = y. One type of proof uses algebra, typically using the definition of 􏰀nk􏰁 and algebraic manipulations to show that x and y are equal. The other type of proof will be a more story-based approach, called a combinatorial proof, where we argue that x = y by explaining how x and y are really just two ways of looking at the same set: The algebraic approach is perhaps apparently more straightforward, but combinatorial proofs can be more fun. Here’s a first example: 9.4. COMBINATIONSANDPERMUTATIONS 951 􏰀k􏰁 • order doesn’t matter and repetition is forbidden: nk ways. Definition 9.3 (Combinatorial Proof) A combinatorial proof establishes that two quantities x and y are equal by defining a set S and proving that |S| = x and |S| = y by counting |S| in two different ways. 952 CHAPTER 9. COUNTING Theorem 9.18 (A symmetry in choosing) 􏰀n􏰁 􏰀 n 􏰁 For any positive integer n and any integer k ∈ {0,1,...,n}, we have k = n−k . Proof #1 of 􏰀n􏰁 = 􏰀 n 􏰁, via algebra. We simply follow our noses through the definition: k n−k 􏰤n􏰥 = k = = n! k!·(n−k)! n! (n−k)!·k! n! (n−k)!·(n−(n−k))! definition of combinations commutativity of multiplication antisimplification: k = n − (n − k) definition of combinations = 􏰤 n 􏰥. n−k Here is a second proof of Theorem 9.18—this time a combinatorial proof. The basic idea is that we will construct a set S such that we can prove that |S| = 􏰀nk􏰁 and we can provethat|S|=􏰀 n 􏰁.(Thuswecanconclude􏰀n􏰁=􏰀 n 􏰁.) n−k k n−k Proof #2 of 􏰀n􏰁 = 􏰀 n 􏰁, via a combinatorial proof: Suppose that n students submit imple- k n−k mentations of Bubble Sort in a computer science class. The instructor has k gold stars, and he will affix a gold star to each of k different implementations. Let S be the set of ways to affix gold stars. Here are two ways of computing |S|: • First, we claim that |S| = 􏰀nk􏰁. Specifically, the instructor will choose k of the n sub- missions and affix gold stars to the k chosen elements. There are 􏰀nk􏰁 ways of doing so. • Second, we claim that |S| = 􏰀 n 􏰁. Specifically, the instructor will choose n − k of n−k the n submissions that he will not adorn with gold stars. The remaining unchosen submissions will be adorned. There are 􏰀 n 􏰁 ways of choosing the unadorned submissions. n−k But |S| is the same regardless of how we count it! So 􏰀n􏰁 = |S| = 􏰀 n 􏰁 and the theorem follows. k n−k (Another way to think about the combinatorial proof: an n-bit string with k ones is an n-bit string with n − k zeros; the number of choices for where the ones go is identical to the number of choices for where the zeros go.) A combinatorial proof requires creativity—what set S should we consider?—but the argument that the proof is correct is generally comparatively straightforward. Thus the challenge in proving an identity with a combinatorial proof is a challenge of narra- tive: we must find a story in which the two sides of the equation both capture the set described by that story. Problem-solving tip: The hard part in a combinatorial proof is coming up with a story that explains both sides of the equation. Understanding what the more complicated side of the equation means is often a good place to start. Pascal’s Identity Here’s another example claim with both algebraic and combinatorial proofs: Proof #1 of Pascal’s Identity (algebra). Observe that if k = 0 or k = n, the identity fol- lows immediately: by definition, we have 􏰀n􏰁 = 1 = 1 + 0 = 􏰀n−1􏰁 + 􏰀n−1􏰁 and similarly Pascal’s identity is named after Blaise Pascal, a 17th-century French mathematician. The programming language Pascal was also named in his honor. k k−1 (n − 1)! + (n − 1)! = = = = = = 􏰤nk􏰥. definition of combinations Proof #2 of Pascal’s Identity (combinatorial proof). For the case of k = 0 or k = n, the argument is the same as in Proof #1. Otherwise, consider a set of n ≥ 1 employees, one of whom is named Babbage. How many ways can we select a subset of k different employees? Here are two different ways of counting the number of these subsets: • Wechoosekofthenemployees.Thereare􏰀nk􏰁waystodoso. • WedecidewhethertoincludeBabbage,andthenfillintherestoftheteam: – IfwepickBabbage,weneedtopickk−1furtheremployeesfromthen−1 other (non-Babbage) employees; thus there are 􏰀n−1􏰁 ways to select a team that includes Babbage. k−1 k!·(n−1−k)! (k−1)!·(n−k)! (n−1)! ·n−k + (n−1)! definition of combinations multiplyingby1= x x 9.4. COMBINATIONSANDPERMUTATIONS 953 Theorem 9.19 (Pascal’s Identity) For any integer n ≥ 1 and any k ∈ {0,1,...,n}: 􏰤n−1􏰥+􏰤n−1􏰥 = 􏰤n􏰥. kk−1k 􏰀n􏰁 n = 1 = 0 + 1 = n 􏰀n−1􏰁 􏰀n−1􏰁 0 0 −1 + n−1 . For the non-boundary cases, we’ll manipulate the left- hand side until it’s equal to the right-hand side: 􏰤n−1􏰥 + 􏰤n−1􏰥 ·k k!·(n−1−k)! n−k (k−1)!·(n−k)! k (n−1)!·(n−k) + (n−1)!·k (k−1)!·k=k!and(n−1−k)!·(n−k)=(n−k)! k!·(n−k)! k!·(n−k)! (n−1)!·[(n−k)+k] factoring k!·(n−k)! n! n−k+k=n,and(n−1)!·n=n! k!·(n−k)! – Ifwedon’tpickBabbage,wepickallkemployeesfromthen−1others;thus there are 􏰀n−1􏰁 ways to select a team that does not include Babbage. k 􏰀n−1􏰁 􏰀n−1􏰁 By the Sum Rule, there are therefore k−1 + k−1 ways to choose a team. 954 CHAPTER 9. COUNTING Because we’ve counted the cardinality of one set in two different ways, the two sizes must be equal. Therefore 􏰀n􏰁 = 􏰀n−1􏰁 + 􏰀n−1􏰁 and the theorem follows. k k−1 k Taking it further: World War II was perhaps the first major historical moment in which computer science—and, by the end of the war, the computer—was central to the story. The German military used a complex cryptographic device called the Enigma machine for encryption of military communication during the war. The Enigma machine, which was partially mechanical and partially electrical, had a large (though not unfathomably large) set of possible physical configurations, each corresponding to a different cryptographic “key.” Among the first applications of an electronic computer—and the reason that one of the first computers was designed and built in the first place—was in breaking these codes, in part by exhaustively exploring the set of possible keys. As such, understanding the number of different keys in the system (a counting problem!) was crucial to the Allies’ success in breaking the Enigma code. For more, see the discussion on p. 960. 9.4.3 The Binomial Theorem The quantity 􏰀nk􏰁 is sometimes called a binomial coefficient, for reasons that we’ll see in this section. First, a reminder: the product of two binomials (x + y) and (a + b) is xa + xb + ya + yb. (You may have once learned the “FOIL” mnemonic for the terms of A binomial (Latin bi “two” + nom “name”) is a special kind of polynomial—poly “many” + nom “name”—that has precisely two terms. ast = yb.) Thus when we square (x+y)·(x+y) = xx+xy+yx+yy = 1·x2 + 2·xy + 1·y2. Observe that the three coefficients of these terms, in order, are ⟨1, 2, 1⟩ = ⟨􏰀20􏰁, 􏰀21􏰁, 􏰀2􏰁⟩. The binomial theorem is a general statement of this pattern: when we multiply out the expression (x + y)n, the coefficient of the xkyn−k term is 􏰀nk􏰁: Before we prove the binomial theorem, let’s start with some intuition about why these coefficientsarise.Forexample,let’scompute(x+y)4 =(x+y)·(x+y)·(x+y)·(x+y), without doing any simplification by combining like terms: (x+y)·(x+y)·(x+y)·(x+y) = (xx+xy+yx+yy)·(x+y)·(x+y) = (xxx+xyx+yxx+yyx+xxy+xyy+yxy+yyy)·(x+y) = xxxx+xyxx+yxxx+yyxx+xxyx+xyyx+yxyx+yyyx +xxxy+xyxy+yxxy+yyxy+xxyy+xyyy+yxyy+yyyy. Every term of the resulting expression consists of 4 multiplicands, one from each of the 4 copies of (x + y). How many of these 16 terms contain, say, 2 copies of x and 2 copies of y? There are 6—yyxx, xyyx, yxyx, xyxy, yxxy, and xxyy—which is just the the product: first = xa; o x + y—that is, multiply it by itself—we get uter = xb; i nner = ya; and l Theorem 9.20 (The Binomial Theorem) For any a ∈ R, any b ∈ R, and any n ∈ Z≥0, we have ( a + b ) n = ∑n 􏰀 ni 􏰁 a i b n − i . i=0 number of elements of {x, y}4 that contain precisely two copies of x. While the sym- bols are different, it’s easy to see that this quantity is precisely the number of elements of {0, 1}4 that contain precisely two ones—which is just 􏰀42􏰁. We will prove the Binomial Theorem in generality in a moment, but to build a little bit of intuition for the proof, let’s look at a special case first: Example 9.47 (The coefficients of (x + y)3) We’regoingtoshowthat(x+y)3 =x3+3x2y+3xy2+y3 inthesamestylethatwe’llusein Problem-solving tip: When you’re asked to solve a problem for a general value of n, one good way togetstartedisto try to solve it for a specific small value ofn—andthen try to generalize yoursolutionto an arbitrary n. It’softeneasierto generalize from aparticularnto ageneralnthan togiveafully generallyanswer “fromscratch.” the full proof of the Binomial Theorem. We’ll start with the observation, made previously, that(x+y)2 = x2 +2xy+y2 = 􏰀20􏰁x2 +􏰀21􏰁xy+􏰀2􏰁y2. Akeystepwillmakeuseof 2 3 Theorem9.19tomovefromthecoefficientsof(x+y) tothecoefficientsof(x+y). (x+y)3 =(x+y)·(x+y)2 􏰖􏰀2􏰁 2 􏰀2􏰁 􏰀2􏰁 2􏰗 =(x+y)· 0 x + 1 xy+ 2 y 􏰀2􏰁 3 􏰀2􏰁 2 􏰀2􏰁 2 􏰀2􏰁 2 􏰀2􏰁 2 􏰀2􏰁 3 =0x+1xy+2xy + 0xy+1xy+2y 9.4. COMBINATIONSANDPERMUTATIONS 955 􏰢 􏰋 􏰡􏰠 􏰌􏰣 􏰢 􏰋 􏰡􏰠 􏰌􏰣 x· (20)x2+(21)xy+(2)y2 y· (20)x2+(21)xy+(2)y2 which, collecting like terms, simplifies to (x + y)3 = 􏰀20􏰁x3 + 􏰖􏰀21􏰁 + 􏰀20􏰁􏰗 x2y + 􏰖􏰀2􏰁 + 􏰀21􏰁􏰗 xy2 + 􏰀2􏰁y3. By Theorem 9.19, we have that 􏰀21􏰁 + 􏰀20􏰁 = 􏰀31􏰁 and 􏰀2􏰁 + 􏰀21􏰁 = 􏰀32􏰁, so (x + y)3 = 􏰀20􏰁x3 + 􏰀31􏰁x2y + 􏰀32􏰁xy2 + 􏰀2􏰁y3 Because 􏰀n􏰁 = 1 and 􏰀n0􏰁 = 1 for any n, we have that 􏰀20􏰁 = 􏰀30􏰁 and 􏰀2􏰁 = 􏰀3􏰁, and thus (x + y)3 = 􏰀30􏰁x3 + 􏰀31􏰁x2y + 􏰀32􏰁xy2 + 􏰀3􏰁y3 = x3 +3x2y+3xy2 +y3. The combination notation can sometimes obscure the structure of the proof; for fur- ther intuition, here is what this proof looks like, without the notational overhead: (x+y)3 = (x+y)·(x+y)2 = (x+y)·(x2 +2xy+y2) = (x3 +2x2y+xy2)+(x2y+2xy2 +y3) = x3 +(2+1)x2y+(1+2)xy2 +y3 = x3 +3x2y+3xy2 +y3. Proof of the Binomial Theorem We’re now ready to give a proof of the general form of the Binomial Theorem. Our 956 CHAPTER 9. COUNTING proof will use mathematical induction on the exponent, and the structure of the induc- tive case of the proof will precisely mimic that of Example 9.47. ProofofBinomialTheorem. Letaandbbearbitraryrealnumbers.Wewishtoprove that, for any integer n ≥ 0, ( a + b ) n = ∑n 􏰀 ni 􏰁 a i b n − i . i=0 We proceed by induction on n. The base case (n = 0) is straightforward: anything to the 0th power is 1, so in partic- ular(a+b)0 =1.And∑0i=0􏰀0i􏰁aib0−i =􏰀0􏰁·1·1=1. n−1 For the inductive case (n ≥ 1), we assume the inductive hypothesis (a + b) = ∑n−1 􏰀n−1􏰁aibn−1−i. We must prove that (a + b)n = ∑n 􏰀n􏰁aibn−i. Our proof echoes the i=0 i i=0i structure of Example 9.47: (a + b)n = (a + b) · (a + b)n−1 n−1 􏰀n−1􏰁 i n−1−i definition of exponentiation inductive hypothesis = (a + b) · ∑i=0 i a b 􏰑n−1 􏰀n−1􏰁 i n−1−i􏰒 ∑i=0 i a b 􏰑n−1 􏰀n−1􏰁 i n−1−i􏰒 = a · ∑i=0 i a b + b · distributing the multiplication distributing the multiplication, again = ∑j = 1 j − 1 a b + ∑i = 0 i a b . r e i n d e x i n g t h e fi r s t s u m m a t i o n ( j : = i + 1 ) 􏰑n−1 􏰀n−1􏰁 i+1 n−1−i􏰒 = ∑i=0 i a b + 􏰑n−1 􏰀n−1􏰁 i n−i􏰒 ∑i=0 i a b 􏰑 n 􏰀n−1􏰁 j n−j􏰒 􏰑n−1 􏰀n−1􏰁 i n−i􏰒 By separating out the i = 0 and j = n terms from the two summations, and then combining like terms, we have n 􏰑n−1 􏰀n−1􏰁 j n−j􏰒 􏰑n−1 􏰀n−1􏰁 i n−i􏰒 􏰀n−1􏰁 n n−n 􏰀n−1􏰁 0 n−0 ( a + b ) = ∑j = 1 j − 1 a b + ∑i = 1 i a b + n − 1 a b + 0 a b 􏰑n−1 􏰋􏰀n−1􏰁 􏰀n−1􏰁􏰌 j n−j 􏰒 􏰀n−1􏰁 n n−n 􏰀n−1􏰁 0 n−0 = ∑j = 1 j − 1 + j a b + n − 1 a b + 0 a b . Applying Theorem 9.19 to substitute 􏰀n􏰁 for 􏰀n−1􏰁 + 􏰀n−1􏰁 and using the fact that 􏰀n−1􏰁 􏰀n􏰁 􏰀n−1􏰁 􏰀n􏰁 j j−1 j n−1 =1= n and 0 =1= 0 ,wehave n 􏰑n−1􏰀n􏰁 j n−j􏰒 􏰀n−1􏰁 n n−n 􏰀n−1􏰁 0 n−0 ( a + b ) = ∑j = 1 j a b + n − 1 a b + 0 a b 􏰑n−1􏰀n􏰁 j n−j􏰒 􏰀n􏰁 n n−n 􏰀n􏰁 0 n−0 = ∑j=1 j ab + n a b + 0 a b 􏰀 􏰁 = 􏰀 􏰁 + 􏰀 􏰁 n n−1 n−1 n−1 n 0 0 =􏰑∑n 􏰀nj􏰁ajbn−j􏰒, incorporatingthej=0andj=ntermsbackintothesummation j=0 which proves the theorem. j j−1 j 􏰀n−1􏰁=1=􏰀n􏰁and􏰀n−1􏰁=1=􏰀n􏰁 9.4.4 Pascal’s Triangle Much of this section has been devoted to understanding the binomial coefficients, through the Binomial Theorem and through combinatorial proofs of a number of their other properties. We’ll close our discussion of binomial coefficients with a visual representation of these quantities, called Pascal’s triangle. Pascal’s triangle arranges the binomial coefficients in a classical and very useful way: the nth row of Pascal’s triangle consists of all of the n + 1 binomial coefficients 􏰀n0􏰁, 􏰀n1􏰁, · · · , 􏰀n􏰁, in order. Figure 9.33 shows the first nine rows of Pascal’s triangle: Like Pascal’s identity, Pascal’s triangle is named after the 17th- century French mathematician Blaise Pascal. Figure 9.33: The first several rows of Pascal’s triangle, in both “choose” notation and in numerical form. Many of the properties of the binomial coefficients that we’ve established previously can be seen by looking at patterns visible in Pascal’s triangle—as can some others that we’ll prove here, or that you’ll prove in the exercises. For example, Figure 9.34 gives visualizations of two properties that we’ve already that 􏰀n􏰁 = 􏰀 n 􏰁; this theorem k n−k is reflected by the fact that the numerical values of Pascal’s triangle are symmetric around a vertical line drawn down through the middle of the tri- angle. And Theorem 9.19 (“Pas- 􏰀n−1􏰁 + 􏰀n−1􏰁 = 􏰀n􏰁, is illustrated by the fact that each entry in Pascal’s triangle is the k k−1 k we can see more easily by looking at Pascal’s triangle. Here’s one example; a number of other properties are left to you in the exercises. Let’s look at the row sums of Pas- 9.4. COMBINATIONSANDPERMUTATIONS 957 􏰀0􏰁 􏰀10􏰁 􏰀1􏰁 􏰀20􏰁 􏰀21􏰁 􏰀2􏰁 􏰀30􏰁 􏰀31􏰁 􏰀32􏰁 􏰀3􏰁 􏰀40􏰁 􏰀41􏰁 􏰀42􏰁 􏰀43􏰁 􏰀4􏰁 􏰀50􏰁 􏰀51􏰁 􏰀52􏰁 􏰀53􏰁 􏰀54􏰁 􏰀5􏰁 􏰀60􏰁 􏰀61􏰁 􏰀62􏰁 􏰀63􏰁 􏰀64􏰁 􏰀65􏰁 􏰀6􏰁 􏰀70􏰁 􏰀71􏰁 􏰀72􏰁 􏰀73􏰁 􏰀74􏰁 􏰀75􏰁 􏰀76􏰁 􏰀7􏰁 􏰀80􏰁 􏰀81􏰁 􏰀82􏰁 􏰀83􏰁 􏰀84􏰁 􏰀85􏰁 􏰀86􏰁 􏰀87􏰁 􏰀8􏰁 1 11 121 1331 14641 1 5 10 10 5 1 1 6 15 20 15 6 1 1 7 21 35 35 21 7 1 1 8 28 56 70 56 28 8 1 . . 􏰀0􏰁 􏰀10􏰁 􏰀1􏰁 1 1 1 1 2+1 􏰀20􏰁􏰀21􏰁+􏰀2􏰁 􏰀30􏰁 􏰀31􏰁 􏰀32􏰁 􏰀3􏰁 􏰀40􏰁 􏰀41􏰁 􏰀42􏰁 􏰀43􏰁 􏰀4􏰁 􏰀 51 􏰁 􏰀 52 􏰁 + 􏰀 53 􏰁 􏰀 54 􏰁 􏰀62􏰁 􏰀63􏰁 􏰀64􏰁 1 5 3 1 5 􏰀 50 􏰁 􏰀 5 5 􏰁 1 1 6 4 15 6 + 4 15 1 6 1 10 3 10 􏰀60􏰁 􏰀61􏰁 􏰀65􏰁 􏰀6􏰁 1 1 20 proven. Theorem 9.18 states cal’s Identity”), which states that Figure 9.34: Theo- rems 9.18 and 9.19 reflected in Pascal’s triangle. sum of the two elements immediately above it (up-and-left and up-and-right). There are many other notable properties of the binomial coefficients, many of which 958 CHAPTER 9. COUNTING cal’s triangle—that is, computing 􏰀n0􏰁 + 􏰀n1􏰁 + · · · + 􏰀n􏰁 for different values of n. (See Figure 9.35.) From calculating the row sum for a few small values of n, we see that the nth row appears to have value equal to 2n. (Incidentally, the sum of the squares of the numbers in any particular row in Pascal’s triangle also has a special form, as you’ll see in Exercise 9.170.) Indeed, the power- of-two pattern for the row sums of Pascal’s triangle that we observe in Figure 9.35 holds for arbitrary n—and we’ll prove this theorem here, in several different ways. 1=1 1+1 =2 1+2+1 1+3+3+1 1 + 4 + 6 + 4 + 1 1 + 5 + 10 + 10 + 5 + 1 1+6+15+20+15+6+1 . =4 =8 = 16 = 32 = 64 Theorem 9.21 (Sum of a row of Pascal’s triangle) Figure 9.35: The row sums of Pas- cal’s triangle. ∑ni=0 􏰀ni 􏰁 = 2n. Proof#1(algebraic/inductive)[sketch]. Wecangainabitofintuitionforthisclaimfrom Theorem 9.19 (Pascal’s Identity): each entry 􏰀nk􏰁 in the nth row is added into exactly two entries in the (n + 1)st row, namely 􏰀n+1􏰁 and 􏰀n+1􏰁. Therefore the values in row #n of k k+1 Pascal’s triangle each contribute twice to the values in row #(n + 1), and therefore the (n + 1)st row’s sum is twice the sum of the nth row. This intuition can be turned into an inductive proof, which you’ll give in Exercise 9.169. Proof#2(combinatorial). LetS:={1,2,...,n}beasetwithnelements.Let’scountthe number of subsets of S in two different ways. On one hand, there are 2n such subsets: there is a bijection between subsets of S and |S|-bit strings. (See Lemma 9.10.) On the other hand, let’s account for the subsets of S by first choosing a size k of the subset, and then counting the number of subsets of that size. By the Sum Rule, the total number of subsets of S is exactly ∑n (the number of subsets of S of size k). k=0 By definition, there are exactly 􏰀nk􏰁 subsets of size k. Therefore the total number of subsets is ∑nk=0 􏰀nk􏰁. Thus 2n = ∑nk=0 􏰀nk􏰁. Proof#3(makingcleveruseoftheBinomialTheorem). We’llstartfromtheright-handside of the theorem statement, and begin with a completely unexpected, but obviously true, antisimplification: 2n =(1+1)n =∑n 􏰀ni􏰁1i1n−i i=0 =∑n 􏰀ni􏰁. i=0 obviously2=1+1;therefore2n =(1+1)n binomialtheorem 1k =1foranyvalueofk You’ll explore some of the many other interesting and useful properties of Pascal’s triangle, and of the binomial coefficients in general, in the exercises. 9.4. COMBINATIONSANDPERMUTATIONS 959 Traveling Salesman Problem (TSP): Input: A set C of n cities, and distance function d giving the driving time between any two cities. Output: An ordering π of C such that the sum of the driving times ∑i d(πi,πi+1) is minimized. Cheapest Vertical Seam (CVS): Input: An n-by-n grid of integers. Output: A path from the top row to the bottom row, moving in direc- tion {ւ, ↓, ց} at each step, such that the sum of the integers along the path is minimized. Computer Science Connections Brute Force Algorithms and Dynamic Programming In an optimization problem, we’re given a set S of valid solutions and some measure of quality f : S → R, and asked to compute the element x ∈ S that’s the best according to f . (That is, we want to find the x ∈ S that optimizes f (x).) Two examples are shown in Figure 9.36: the traveling salesman problem (TSP)— the problem solved every day by delivery drivers, who have to visit a given list of addresses and return to the depot—and the cheapest vertical seam (CVS) problem, which arises in a remarkable computer graphics application.5 (For an example of the latter problem, see Figure 9.37.) For both TSP and CVS, there are very simple, but very slow, brute-force algorithms that solve the problem by computing the list of all possible solutions (all orderings of the cities; all top-to-bottom paths) and identifying the best of these possible solutions. It’s by now a reasonably straightforward counting exercise to show that there are n! orderings and between 2n · n and 3n · n paths (it takes some work to avoid counting paths that fall off the left/right edges of the grid). These running times are unimpressive—even n around 100 would require decades of computing time—and this is, more or less, the best known algorithm for TSP! (See p. 326.) But we can do better for CVS, with another view of the problem. Given a grid G, define best(i, j) as the cost of the cheapest path from grid cell ⟨i, j⟩ to the bottom of the grid. Then we can solve the CVS problem using a recursive algorithm that computes best(i, j) for every cell ⟨i, j⟩, as in Figure 9.38. Unfortu- nately, this algorithm is just as slow as the brute-force approach: to compute best(i, j), we make three recursive calls, at least two of which remain inside the grid. Thus the running time T(i) to find best(n − i, j) with i rows beneath Figure 9.36: Two problems. 5 Shai Avidan and Ariel Shamir. Seam carving for content-aware image resiz- ing. In ACM SIGGRAPH, 2007. Figure 9.37: A small example of CVS. best(i, j): // Assume G1...n,1...n is given. 9 8 7 1 9 7 3 2 9 1 2 8 5 6 9 4 7 5 3 4 3 8 2 8 1 cell ⟨i, j⟩ is given by the recurrence T(1) = 1 and T(i) ≥ 2T(i − 1) + 1—which n ifi=n then return Gi,j (in the last row) elseifj≤0orj≥n then return +∞ (outside the grid) else return the minimum of:  Gi,j +best(i+1,j−1), Gi,j +best(i+1,j), Gi,j +best(i+1,j+1). satisfies T(n) ≥ 2 , just as slow as before. But a key algorithmic observation is that the number of different cells in the grid is much smaller—only n2 different cells! So, while the algorithm n2 in Figure 9.38 does take Ω(2 ) time, it actually “should” require only Θ(n ) time—as long as we avoid recomputing best(i, j) multiple times for the same value of ⟨i, j⟩! Once we’ve figured out best(3, 7) (because we needed that value to figure out best(4, 6)), we don’t bother recomputing best(3, 7) when we need it again (while we’re computing best(4, 7) and best(4, 8)); instead, we just remember the value and reuse it without doing any further computation. 1: 2: 3: 4: 5: 6: Figure 9.38: A recursive algorithm for CVS. (To solve CVS itself, return the smallest best(1, j) for every 1 ≤ j ≤ n.) The most straightforward way to implement this basic idea is called mem- oization: we build a data structure in which we check to see whether we’ve already stored the value of best(i, j) before computing the value via the three recursive calls, and we always add all values we compute to the data structure before returning them. A slightly more efficient way of implementing this idea is called dynamic programming, where we transform this recursive solution into one using loops—and build up the values of best(i, j) from the bottom up. (See Figure 9.39). In general, dynamic programming is an algorithmic design technique that can save us a massive amount of computation—as long as the number of different problems encountered in the recursive solution is small. Figure 9.39: A dynamic programming algorithm for CVS. CVS(G1...n,1...n): 1: for j := 1,...,n: 2: 3: 4: 5: 6: T[n, j] := Gi,j fori:=n−1,...,1: for j := 1,...,n: T[i, j] := the minimum of: (Treat T[·, j] = ∞ if j out of range.) Gi,j +T[i+1,j−1], Gi,j +T[i+1,j], Gi,j +T[i+1,j+1]. return minj T[1, j]. 960 CHAPTER 9. COUNTING Computer Science Connections The Enigma Machine and the First Computer The Enigma machine was a physical cryptographic device used by the Germans during World War II to communicate between German high com- mand and their military units in the field. The basic structure of the machine involved rotors and cables. A rotor was a 26-slot physical wheel that encoded a permutation π; when the wire corresponding to input i is active, the output wire corresponding to πi is active. A plugboard allowed an arbitrary matching of keys on the keyboard to the inputs to the rotors—a cable was what actually connected a key to the first rotor. (The machine did not require any cables in the plugboard; if there was no cable, then the key pressed was what went into the rotor in the first place.) The basic encryption in the Enigma machine proceeded as follows: 1. Theuserpressedakey,sayA,onthekeyboard.Iftherewasacablefrom the A key, then the key would be remapped to the other end of the cable; otherwise the procedure proceeded using the A. (See Figure 9.40.) 2. Thepressedkeywaspermutedbyrotor#1;theoutputofrotor#1wasper- muted by rotor #2; the output of rotor #2 was permuted by rotor #3. (See Figure 9.41.) The output of rotor #3 was “reflected” by a fixed permutation, and then the reflector’s output pass through the three rotors, in reverse order and backward: the output of the reflector was permuted by rotor #3, then by #2, and then by #1. (See Figure 9.42.) 3. Alightcorrespondingtotheoutputofrotor#1,passedthroughtheplug- board cable if present, lights up; the illuminated letter is the encoding. The tricky part is that the rotors rotate by one notch when the key is pressed, so that the encoding changes with every keypress. The “secret key” that the two communicating entities needed to agree upon was which rotors to use in which order (5 · 4 · 3 = 60; there were 5 standard rotors in an Enigma), what the initial position of the rotors should be QWERT ASDFG P Y X C V Figure 9.40: The effect of the plugboard. Each of the 26 keys is either mapped to itself (like W here), or is matched with another key (like Q ↔ D here). Pressing an unmatched key x yields x itself; pressing a matched key x yields whatever letter is matched to x. GHPZMQ ABFECD Figure 9.41: The effect of a rotor. Each rotor encodes a permutation of the letters; when the input letter i comes into the rotor, the output πi comes out. (Here, for example, an input B turns into an output of H.) After each keypress, the top portion of the rotor would rotate by one notch, so that B would now turn into G. AQ Figure 9.42: The Enigma machine’s op- eration. The operator types an A, which (after going through the plugboard) is permuted by rotor #1, rotor #2, rotor #3, the fixed permutation of the machine, rotor #3, rotor #2, and rotor #1. It then (after passing through the plugboard) lights up the output, Q. The rotors advance by one notch, and encoding continues with the next letter. (263 = 17,576), and what plugboard matching to use ( 26! ≈ 8 × 1012 choices 13!·213 if all 26 letters were matched; see Example 9.32). Interestingly, almost all of the complexity came from the plugboards. Perhaps surprisingly, the fact that there were so many possible settings for the Enigma led to the invention of one of the first programmable computers, by Alan Turing at Bletchley Park, in England, during the war. Turing built a machine that could test many of these configurations, by brute force. (If there were fewer possibilities, it could have been cracked by hand; if there were many more, it couldn’t have been cracked by brute force.) Turing and his team developed a device called the Bombe to exhaustively try to compute the shared German secret key—each day! Many other cryptographic tricks related to the way the Enigma was being used were also part of the analysis. For example, the construction of the device meant that no letter could encrypt to itself; this fact was exploited in the analysis. Another crucial part of the code breaking was a known plaintext attack on the Enigma: the British also used knowledge of what the Germans tended to communicate (like weather reports) to narrow their search. 9.4.5 Exercises For two strings x and y, let’s call a shuffle of x and y any interleaving of the letters of the two strings (that maintains the order of the letters within each string, but may repeatedly alternate between blocks of x letters and blocks of y letters). For example, the words ALE and LID can be shuffled into ALLIED or ALLIDE or ALLIDE or LIDALE. How many different strings can be produced as shuffles of the following pairs of words? 9.4. COMBINATIONSANDPERMUTATIONS 961 9.121 BACK and FORTH 9.122 DAY and NIGHT 9.123 SUPPLY and DEMAND 9.124 LIFE and DEATH 9.125 ON and ON 9.126 OUT and OUT 9.127 (programming required) Write a program, in a language of your choice, that computes all shuffles of two given words x and y. A recursive approach works well: a shuffle consists either of the first character of x followed by a shuffle of x2...|x| and y, or the first character of y followed by a shuffle of x and y2...|y|. (Be sure to eliminate any duplicates from your resulting list.) The next few questions ask you to think about shuffles of generic strings, instead of particular words. (Assume that the alphabet is an arbitrarily large set—you are not restricted to the 26 letters in English.) Consider two strings x and y, and let n := |x| + |y| be the total number of characters between them. Note that the number of distinct shuffles of x and y may depend both on the lengths of x and y and on the particular strings themselves; for example, if some letters are shared between or within the two strings, there may be fewer possible shuffles. 9.128 In terms of n, what is the maximum possible number of different shuffles of x and y? 9.129 In terms of n, what’s the minimum possible number of distinct shuffles of x and y? 9.130 What is the largest possible number of different shuffles of three strings of length a, b, and c? 9.131 How many 42-bit strings have exactly 16 ones? 9.132 How many 23-bit strings have at most 3 ones? (The coincidental arithmetic structure of the answer actually turns out to be helpful for error-correcting codes; see Exercise 4.30.) 9.133 9.134 9.135 How many 32-bit strings have a number of ones within ±2 of the number of zeros? The set of 64-bit strings with k ones is largest for k = 32. What’s the smallest m for which | {the number of 64-bit strings with ≤ m ones} | ≥ | {the number of 64-bit strings with 32 ones} |? What is the smallest even integer n for which the following statement is true? If we flip an unbi- ased coin n times, as in Example 9.41, the probability that we get exactly n heads is less than 10%. 2 A bridge hand consists of 13 cards from a standard 52-card deck, with 13 ranks (2 through ace) and 4 suits (♣, ♦, ♥, and♠). (Thatis,thecardsinthedeckare{2,3,...,10,J,Q,K,A}×{♣,♦,♥,♠}.) Howmanydifferentbridgehands are there that meet the following conditions? 9.136 9.137 9.138 9.139 9.140 9.141 9.142 same suit. A void in spades: a 13-card hand that contains only cards from the suits ♣, ♦, and ♥. A singleton in hearts: exactly one of the 13 cards comes from the suit ♥. All four kings. No queens at all. Exactly two jacks. Exactly two jacks and exactly two queens. A bridge hand has high honors if it contains the five highest-ranked cards {10, J, Q, K, A} in the How many bridge hands have high honors? (Warning: be careful about double counting!) Many bridge players evaluate their hands by the following system of points. First, give yourself one high-card point for a jack, two for a queen, three for a king, and four for an ace. Furthermore, give yourself three distribution points for each void (a suit in which you have zero cards), two points for a singleton (a suit with one card), and one point for a doubleton (a suit with two cards). 9.143 How many bridge hands have a high-card point count of zero? 9.144 How many bridge hands have a high-card point count of zero and a distribution point count of zero? What fraction of all bridge hands is this? How many ways are there to choose 32 out of 202 options if . . . 9.145 . . . repetition is allowed and order matters? 9.146 . . . repetition is forbidden and order matters? 9.147 . . . repetition is allowed and order doesn’t matter? 9.148 . . . repetition is forbidden and order doesn’t matter? 962 CHAPTER 9. COUNTING The first 10 prime numbers are {2, 3, 5, 7, 11, 13, 17, 19, 23, 29}. How many different integers have exactly . . . 9.149 . . . 5 prime factors (all from this set), where all of these factors are different? 9.150 . . . 5 prime factors (all from this set)? (Note that 32 = 2 · 2 · 2 · 2 · 2 is an example.) How many different integers have exactly 10 prime factors . . . 9.151 . . . all of which come from the set of the first 20 prime numbers? 9.152 . . . all of which come from the set of the first 20 prime numbers, and where all 10 of these factors are different from each other? Suppose that we have two sequences ⟨x1,x2,...,xn⟩ and ⟨y1,y2,...,y2n⟩ of data points—perhaps representing a sequence of intensities from two streams of speech. We wish to align x to y by matching up elements of x to elements of y. (For example, y might represent a reference stream, where we’re trying to match x up to it.) We insist that each element of x is assigned to one and only one element of y. (See Figure 9.43.) 9.153 How many ways are there to assign each of the n elements of x to one of the 2n elements of y? 9.154 How many ways are there to assign each of the n elements of x to one of the 2n elements of y so that no element of y is matched to more than one element of x? In many applications, we can only consider alignments of the elements of x and y that “maintain order”: that is, we can’t have x5 assigned to an element of y that comes after the element assigned to x6. (If f : {1,...,n} → {1,...,2n} represents the alignment, then we require that i ≤ j implies that f (i) ≤ f (j).) 9.155 How many ways are there to assign each of the n elements of x to one of the 2n elements of y in a way that maintains order? 9.156 How many ways are there to assign each of the n elements of x to one of the 2n elements of y in a way that maintains order so that no element of y is matched to more than one element of x? 9.157 Consider the equation a + b + c = 202. How many solutions are there where a, b, and c are all nonnegative integers? 9.158 How many different solutions are there to the equation a + b + c + d + e = 8, where all of {a, b, c, d, e} have to be nonnegative integers? 9.159 What about for a + b + c + d + e = 88, again where all variables must be nonnegative integers? 9.160 What about for a + 2b + c = 128, again where a, b, and c must be nonnegative integers? (Hint: sum over the possible values of b and use Theorem 9.17.) The Association for Computing Machinery (the ACM)—a major professional society for computer scientists—puts on student programming competitions regularly. Teams of students spend a few hours working on some programming problems (of various levels of difficulty). 9.161 Suppose that, at a certain college in the midwest, there are 141 computer science majors. A programming contest team consists of 3 students. How many ways are there to choose a team? 9.162 Suppose that, at a certain programming contest, teams are given 10 problems to try to solve. When the contest begins, each of the 3 members of the team has to choose a problem to think about first. (More than one team member can think about the same problem.) How many ways are there for the 3 team members to choose a problem to think about first? 9.163 In most programming contests, teams are scored by the number of problems they correctly solve. (There are tiebreakers based on time and certain penalties.) A team can submit multiple solutions to the same problem. Suppose that a particular team has calculated that they have time to code up and submit 20 different attempted answers to the 10 questions in the contest. How many different ways can they allocate their 20 submissions across the 10 problems? (The order of their submissions doesn’t matter.) 9.164 Solve the following problem, posed by Adi Shamir in his original paper on secret sharing:6 Eleven scientists are working on a secret project. They wish to lock up the documents in a cabinet so that the cabinet can be opened if and only if six or more of the scientists are present. What is the smallest number of locks needed? What is the smallest number of keys to the locks each scientist must carry? Figure 9.43: An alignment between two sequences, for Exercises 9.153– 9.156. (Thanks to Roni Khardon, from whom I learned a version of the exercises.) x x4x 1 x3 5 x2 y y y8 y10 1y 7y9 y2 3 y4 y5 y6 (a) An alignment that doesn’t respect order. (b) An alignment that does respect order. x x4x 1 x3 5 x2 y y y8 y10 1y 7y9 y2 3 y4 y5 y6 See the discussion on p. 730, or 6 Adi Shamir. How to share a secret. Communi- cations of the ACM, 22(11):612–613, November 1979. 9.165 In machine learning, we try to use a collection of training data—for example, a large collection of ⟨image, letter⟩ pairs of images of handwritten letters and the English letter that they represent—to compute a predictor that will do well on predicting answers on a set of novel test data. One danger in such a system is overfitting: we might build a predictor that’s overly affected by idiosyncrasies of the training data. One way to address the risk of overfitting is a technique called cross-validation: we divide the training data into several subsets, and then, for each subset S, train our predictor based on ∼S and test it on S. We might then average the parameters of our predictor across the subsets S. In ten-fold cross-validation on a n-element training set, we would split our n training examples into disjoint sets S1,S2,...,S10 where |Si| = n . 10 n How many ways are there to split an n-element set into disjoint subsets S1 , S2 , . . . , S10 of size 10 each? (Note the order of the subsets themselves doesn’t matter, nor does the order of the elements within a subset.) 9.166 Consider the set of bitstrings x ∈ {0, 1}n+k with n zeros and k ones with the additional condition that no ones are adjacent. (For n = 3 and k = 2, for example, the legal bitstrings are 00101, 01001, 01010, 10001, 10010, and 10100.) Prove by induction on n that the number of such bitstrings is 􏰀n+1􏰁. n+k k 9.167 Consider the set of bitstrings x ∈ {0, 1} with n zeros and k ones with the additional condition that every block of ones has even length. (For n = 3 and k = 2, for example, the legal bitstrings are 00011, 00110, 01100, 11000.) Prove that, for any even k, the number of such bitstrings is 􏰀n+(k/2)􏰁. n 9.168 Prove that k · 􏰀n􏰁 = n · 􏰀n−1􏰁 twice, using both an algebraic and a combinatorial proof. k k−1 9.169 Using induction on n, prove Theorem 9.21—that is, prove that 9.4. COMBINATIONSANDPERMUTATIONS 963 ∑n 􏰀 ni 􏰁 = 2 n . i=0 9.170 Prove the following identity about the squares of the binomial coefficients. (For example, for n=4,thisidentitystatesthat􏰀4􏰁2+􏰀4􏰁2+􏰀4􏰁2+􏰀4􏰁2+􏰀4􏰁2 =12+42+62+42+12 =70isequalto􏰀8􏰁.And, 􏰀8􏰁 8! 0 1 2 3 4 4 indeed, 4 = 4!·4! = 70.) Use a combinatorial proof. Prove the following identity by algebraic manipulation: 􏰀n􏰁􏰀m􏰁 = 􏰀n􏰁􏰀n−k􏰁. n 􏰀n􏰁2 􏰀2n􏰁 ∑k=n. k=0 9.171 9.172 a team of m people from a pool of n candidates, and picking k managers from the team that you’ve chosen.) 9.173 Prove the following identity, using an algebraic, inductive, or combinatorial proof: n 􏰀k􏰁 􏰀n+1􏰁 ∑ m = m+1 . k=0 mk km−k Now prove the identity from Exercise 9.171 with a combinatorial proof. (Hint: think about choosing Recall that 􏰀a􏰁 = 0 for any b < 0 or b > a, so many of the terms of the summation are zero. For example, for b 􏰀6􏰁 􏰀0􏰁 􏰀1􏰁 􏰀2􏰁 􏰀3􏰁 􏰀4􏰁 􏰀5􏰁 􏰀3􏰁 􏰀4􏰁 􏰀5􏰁
m=3andn=5,theclaimstatesthat 4 = 3 + 3 + 3 + 3 + 3 + 3 =0+0+0+ 3 + 3 + 3 .
9.174 Prove the following identity about the binomial coefficients and the Fibonacci numbers (where fi
is the ith Fibonacci number), by induction on n:
⌊n/2⌋ 􏰀n−k􏰁
∑ k = fn+1. k=0
9.175 Prove van der Monde’s identity:
􏰀n+m􏰁 k 􏰀 m 􏰁 􏰀n􏰁
(Hint: suppose you have a deck of n red cards and m black cards, from which you choose a hand of k total cards.)
k =∑k−r·r. r=0

964 CHAPTER 9. COUNTING
A common subsequence of two strings x and y is a string z that’s a subse- quence of both. A subsequence of an n-character string corresponds to a subset of {1, 2, . . . , n}, indicating which indices are included (and which aren’t). (See Exer- cise 9.82.) For example, BASIC is a common subsequence of BRAINSICKNESS and BIOACOUSTICS.
9.176 Suppose that you have been asked to find the number of common subsequences of two n-character strings x, y ∈ Σn, by brute force. An algo- rithm to do so is shown in Figure 9.44(a). How many times do we execute Line 3 (testing whether a = b)?
9.177 Using the fact that common subsequences must have the same length, we can modify the algorithm as shown in Figure 9.44(b). Now how many times do we execute Line 4 (testing whether a = b)?
9.178 Using Stirling’s approximation of the factorial function, which states that n! ≈ √2πn(n/e)n (where π = 3.1415 · · · and e = 2.7182 · · · ), argue that Figure 9.44(b) is an improvement on Figure 9.44(a).
9.179 Use the Binomial Theorem to prove the following identity: ∑n ( − 1 ) k · 􏰀 nk 􏰁 = 0 .
k=0
9.180 Use the Binomial Theorem to prove the following identity: n 􏰀nk􏰁 􏰋3􏰌n
∑2k=2 . k=0
9.181 In Section 9.2.2, we introduced the Inclusion–Exclusion rule for counting the union of 2 or 3 sets: |A∪B| = |A|+|B|−|A∩B|
|A∪B∪C| = |A|+|B|+|C|−|A∩B|−|A∩C|−|B∩C|+|A∩B∩C|
Exercise 9.30 asked you to give a formula for a 4-set intersection, but here’s a completely general solution:
􏰊􏰊􏰊􏰊􏰊􏰴k Ai􏰊􏰊􏰊􏰊􏰊=∑k 􏰑(−1)i+1 · ∑ |Aj1 ∩Aj2 ∩···∩Aji|􏰒. i=1 i=1 j1 |B|, and f : A → B, then there exist distinct a and a′ ∈ A such that f (a) = f (a′). That is, if there are more pigeons than holes, and we place the pigeons into the holes, then there must be (at least) one hole containing more than one pigeon.
Combinations and Permutations
Consider nonnegative integers n and k with k ≤ n. The quantity 􏰀nk􏰁 is defined as
􏰤n􏰥 := n!
k k!·(n−k)!
,
and is read as “n choose k.” The quantity 􏰀nk􏰁 denotes the number of ways to choose a k-element subset of a set of n elements, called a combination, when each element can only be selected at most once and the order of the selected elements doesn’t matter. The quantity 􏰀nk􏰁 is also sometimes called a binomial coefficient.
Depending on whether we allow the same candidate to be chosen more than once and whether we care about the order in which the candidates are chosen, there are many versions of selecting k out of a set of n candidates:
• Iftheorderoftheselectedelementsdoesn’tmatterandrepetitionofthechosen elements is not allowed, then there are 􏰀nk􏰁 ways to choose.
A combinatorial proof establishes that two quantities x and y are equal by defining a set S and proving that |S| = x and |S| = y by counting |S| in two different ways. We can give combinatorial proofs of the following facts about the binomial coefficients, among others:
􏰀n􏰁=􏰀 n 􏰁 􏰀n􏰁=􏰀n−1􏰁+􏰀n−1􏰁 ∑n 􏰀n􏰁=2n. kn−k kkk−1 i=0i
The binomial theorem states that, for any a, b ∈ R and any n ∈ Z≥0, ( a + b ) n = ∑n 􏰀 ni 􏰁 a i b n − i .
i=0
We can prove the binomial theorem by induction on the exponent n.
Many of the interesting properties of the binomial coefficients can be seen
by looking at patterns visible in Pascal’s triangle, which arranges the bino- mial coefficients so that the nth row contains the n + 1 binomial coefficients 􏰀n0􏰁, 􏰀n1􏰁, · · · , 􏰀n􏰁. See Figure 9.45 for the first few rows of Pascal’s triangle.
• If order matters and repetition is not allowed, there are n! ways. k (n−k)!
• Ifordermattersandrepetitionisallowed,therearen ways.
• Iforderdoesn’tmatterandrepetitionisallowed,thereare􏰀n+k−1􏰁ways.
k
􏰀0􏰁 􏰀10􏰁 􏰀1􏰁
􏰀20􏰁 􏰀21􏰁 􏰀2􏰁 􏰀30􏰁 􏰀31􏰁 􏰀32􏰁 􏰀3􏰁
􏰀40􏰁 􏰀41􏰁 􏰀42􏰁 􏰀43􏰁 􏰀4􏰁 􏰀50􏰁 􏰀51􏰁 􏰀52􏰁 􏰀53􏰁 􏰀54􏰁 􏰀5􏰁
􏰀60􏰁 􏰀61􏰁 􏰀62􏰁 􏰀63􏰁 􏰀64􏰁 􏰀65􏰁 􏰀6􏰁 .
Figure 9.45: The first several rows of Pascal’s triangle.

Key Terms and Results Key Terms
Counting Unions and Sequences
• SumRule
• ProductRule
• doublecounting
• Inclusion–Exclusion
• GeneralizedProductRule • permutation
Using Functions to Count
• MappingRule
• DivisionRule
• pigeonholeprinciple
Combinations and Permutations
• combinations
• 􏰀n􏰁/binomialcoefficient k
Key Results
allows us to handle nondisjoint sets; for example, for any setsA,Bwehave|A∪B|=|A|+|B|−|A∩B|.
2. T h e P r o d u c t R u l e : | A 1 × A 2 × · · · × A k | = ∏ ki = 1 | A i | . F o r anysetSandanyk∈Z≥1,wehave|Sk|=|S|k.
3. The Generalized Product Rule: if S is a set of sequences of length k, where, for each choice of the first i − 1 components of the sequence, there are exactly ni choices for the ith component, then |S| = ∏ki=1 ni.
Using Functions to Count
1. The Mapping Rule: an onto function f : A → B means |A| ≥ |B|; a one-to-one function f : A → B means |A|≤|B|;andabijectionf :A→Bmeans|A|=|B|.
2. For any set S, |P(S)| = 2|S|.
3. The Division Rule: if f : A → B satisfies
|{a ∈ A : f (a) = b}| = k for all b ∈ B, then |A| = k · |B|. 4. Thenumberofwaystoarrangeasequencecontaining
elements{x1,…,xk},wherexi appearsni times,is (n1+n2+ ··· +nk)! .
(n1!)·(n2!)· ··· ·(nk!)
5. Pigeonhole principle: if f : A → B and |A| > |B|, then
there exist a,a′ ̸= a ∈ A such that f(a) = f(a′). Combinations and Permutations
1. Therearefourversionsofselectingkoutofncandidates, depending on whether the order of the chosen elements matters and whether we can choose the same element twice. (See Figure 9.31.) The binomial coefficient 􏰀nk􏰁 denotes the number of ways to choose when repetition is forbidden and order doesn’t matter (called combinations).
3. The binomial theorem: (a + b)n = ∑ni=0 ni ai bn−i .
• permutations
• binomialtheorem
• combinatorialproof • Pascal’striangle
9.5. CHAPTERATAGLANCE 967
Counting Unions and Sequences
1. The Sum Rule: if the sets A ,A ,…,A are all disjoint,
􏰊􏰊􏰔k 􏰊􏰊 k 1 2 k
then 􏰊 i=1 Ai 􏰊 = ∑i=1 |Ai |. The Inclusion–Exclusion Rule
2. Some useful properties: 􏰀n􏰁 = 􏰀 n 􏰁 and 􏰀 􏰁 􏰀 􏰁 􏰀􏰁 k 􏰀􏰁n−k
n−1 + n−1 = n and∑n n =2n.
k k−1 k i=0i 􏰀􏰁

10 Probability
In which our heroes evade threats and conquer their fears by flipping coins, rolling dice, and spinning the wheels of chance.

1002 CHAPTER 10. PROBABILITY
10.1 Why You Might Care
Fortune can, for her pleasure, fools advance, And toss them on the wheels of Chance.
Juvenal (c. 55-–c. 127)
This chapter introduces probability, the study of randomness. Our focus, as will be no surprise by this point of the book, is on building a formal mathematical framework for analyzing random processes. We’ll begin with a definition of the basics of probabil- ity: defining a random process that chooses one particular outcome from a set of pos- sibilities (any one of which occurs some fraction of the time). We’ll then analyze the likelihood that a particular event occurs—in other words, asking whether the chosen outcome has some particular property that we care about. We then consider indepen- dence and dependence of events, and conditional probability: how, if at all, does knowing that the randomly chosen outcome has one particular property change our calculation of the probability that it has a different property? (For example, perhaps 90% of all email is spam. Does knowing that a particular email contains the word ENLARGE make that email more than 90% likely to be spam?) Finally, we’ll turn to random variables and expectation, which give quantitative measurements of random processes: for example, if we flip a coin 1000 times, how many heads would we see (on average)? How many runs of 10 or more consecutive heads? Probabilistic questions are surprisingly difficult to have good intuition about; the focus of the chapter will be on the tools required to rigorously settle these questions.
Probability is relevant almost everywhere in computer science. One broad appli-
cation is in randomized algorithms to solve computational problems. In the same way
that the best strategy to use in a game of rock–paper–scissors involves randomness
(throw rock 1 of the time, throw paper 1 of the time, throw scissors 1 of the time), 333
there are some problems—for example, finding the median element of an unsorted array, or testing whether a given large integer is a prime number—for which the best known algorithm (the fastest, the simplest, the easiest to understand, . . . ) proceeds by making random choices. The same idea occurs in data structures: a hash table is an ex- cellent data structure for many applications, and it’s best when it assigns elements to (approximately) random cells of a table. (See Section 10.1.1.) Randomization can also be used for symmetry breaking: we can ensure that 1000 identical drones do not clog the airwaves by all trying to communicate simultaneously: each drone will choose to try to communicate at a random time. And we can generate more realistic computer graph- ics of flame or hair or, say, a field of grass by, for each blade, randomly perturbing the shape and configuration of an idealized piece of grass.
As a rough approximation, we can divide probabilistic applications in CS into two broad categories: those uses in which the randomness is internally generated by our algorithms or data structures, and those cases in which the randomness comes “from the outside.” The first type we discussed above. In the latter category, consider circum- stances in which we wish to build some sort of computational model that addresses some real-world phenomenon. For example, we might wish to model social behavior (a social network of friendships), or traffic on a road network or on the internet, or to

build a speech recognition system. Because these applications interact with extremely complex real-world behaviors, we will typically think of them as being generated ac- cording to some deterministic (nonrandom) underlying rule, but with hard-to-model variation that is valuably thought of as generated by a random process. In systems for speech recognition, it works well to treat a particular “frame” of the speech stream (perhaps tens of milliseconds in duration) as a noisy version of the sound that the speaker intended to produce, where the noise is essentially a random perturbation of the intended sound.
Finally, you should care about probability because any well-educated person must understand something about probability. You need probability to understand political polls, weather forecasting, news reports about medical studies, wagers that you might place (either with real money or by choosing which of two alternatives is a better option), and many other subjects. Probability is everywhere!
10.1.1 Hashing: A Running Example
Throughout this chapter, we will consider a running sequence of examples that are about hash tables, a highly useful data structure that also conveniently illustrates a wide variety of probabilistic concepts. So we’ll start here with a short primer on hash tables. (See also p. 267, or a good textbook on data structures.)
A hash table is a data structure that stores a set of elements in a table T[1 . . . m]— that is, an array of size m. (Remember that, throughout this book, arrays are indexed starting at 1, not 0.) The set of possible elements is called the universe or the keyspace. We will be asked to store in this table a particular small subset of the keyspace. (For example, the keyspace might be the set of all 8-letter strings; we might be asked to store the user IDs of all students on campus.) We use a hash function h to determine in which cell of the table T[1 . . . m] each element will be stored. The hash function h takes elements of the keyspace as input, and produces as output an index identifying a cell in T. To store an element x in T using hash function h, we compute h(x) and place x into the cell T[h(x)]. (We say that the element x hashes to the cell T[h(x)].)
We must somehow handle collisions, when we’re asked to store two different ele- ments that hash to the same cell of T. We will usually consider the simplest solution, where we use a strategy called chaining to resolve collisions. To implement chaining, we store all elements that hash to a cell in that cell, in an unsorted list. Thus, to find whether an element y is stored in the hash table T, we look one-by-one through the list of elements stored in T[h(y)].
Example 10.1 (A small hash table)
Let the keyspace be {1, 2, 3, 4}, and consider a 2-cell hash table with the hash function h given by h(x) = (x mod 2) + 1. (Thus h(1) = h(3) = 2 and h(2) = h(4) = 1.)
10.1. WHYYOUMIGHTCARE 1003
T[1]
T[2]
.
.
[4]
[1]
• Ifwestoretheelements{1,4},thenthetablewouldbe • Ifwestoretheelements{2,4},thenthetablewouldbe
[2, 4]
[]

1004 CHAPTER 10. PROBABILITY
More formally, we are given a finite set K called the keyspace, and we are also given a positive integer m representing the table size. We will base the data structure on a hash function h : K → {1, . . . , m}. For the purposes of this chapter, we choose h ran- domly, specifically choosing the hash function so that each function from K to {1, . . . , m} is equally likely to be chosen as h.
Let’s continue our above example with a randomly chosen hash function. For the moment, we’ll treat the process of randomly choosing a hash function informally. (The precise definitions of what it means to choose randomly, and what it means for certain “events” to occur, will be defined in the following sections.)
Example 10.2 (A small hash table)
As before, let K = {1, 2, 3, 4} and m = 2. There are m|K| = 24 = 16 different functions h : K → {1, 2}, and each of these functions is equally likely to be chosen. (The functions are listed in Figure 10.1.) Each of these functions is chosen a 1 fraction of
the time. Thus:
A
A AB
B AB
AB AB
B A
AC
Figure 10.1: All functions from
{1, 2, 3, 4} to {1, 2}. Each row is a differ- ent function h; the ith column records the value of h(i). The letters mark some functions
as described in Example 10.2.
1111
1112
1121
1122
1211
1212
1221
1222
2111
2112
2121
2122
2211
2212
2221
2222
16
63
• a 16 = 8 fraction of the time, the hash function is “perfectly balanced”—that is,
hashes an equal share of the keys to each cell.
(These functions are marked with a ‘B’ in Figure 10.1.)
• a 8 = 1 fraction of the time, we have h(4) = h(1). 16 2
(These functions are marked with an ‘A’ in Figure 10.1.)
• a 1 fraction of the time, the hash function hashes every element of K into cell #2. 16
(This one function is marked with a ‘C’ in Figure 10.1.)
Taking it further: In practice, the function h will not be chosen completely at random, for a variety
of practical reasons (for example, we’d have to write down the whole function to remember it!), but throughout this chapter we will model hash tables as if h is chosen completely randomly. The assump- tion that the hash function is chosen randomly, with every function K → {1, 2, . . . , m} equally likely to be chosen, is called the simple uniform hashing assumption. It is very common to make this assumption when analyzing hash tables.
It may be easier to think of choosing a random hash function using an iterative process instead: for every key x ∈ K, we choose a number ix uniformly at random and independently from {1, 2, . . . , m}. (The definitions of “uniformly” and “independently” are coming in the next few sections. Informally, this de- scription means that each number in {1, 2, . . . , m} is equally likely to be chosen as ix , regardless of what choices were made for previous numbers.) Now define the function h as follows: on input x, output ix. One can prove that this process is completely identical to the process illustrated in Example 10.2: write down every function from K to {1, 2, . . . , m} (there are m|K| of them), and pick one of these functions at random.
After we’ve chosen the hash function h, a set of actual keys {x1, . . . , xn} ⊆ K will be given to us, and we will store the element xi in the table slot T[h(xi)]. Notice that the only randomly determined quantity is the hash function h. Everything else—the keyspace K, the table size m, and the set of to-be-stored elements—is fixed.
h(1) h(2) h(3) h(4)

10.2 Probability, Outcomes, and Events
Anyone who does not know how to make the most of his luck has no right to complain if it passes by him.
Miguel de Cervantes (1547–1616)
This section will give formal definitions of the fundamental concepts in probability, giving us a framework to use in thinking about the many computational applications that involve chance. These definitions are somewhat technical, but they’ll allow us reason about some fairly sophisticated probabilistic settings fairly quickly.
10.2.1 Outcomes and Probability
Here’s the very rough outline of the relevant definitions; we’ll give more details in a moment. Imagine a scenario in which some quantity is determined in some random way. We will consider a set S of possible outcomes. Each outcome has an associated probability, which is a number between 0 and 1. The set S is called the sample space.
In any particular result of this scenario, one outcome from S is selected randomly
(by “nature”); the frequency with which a particular outcome is chosen is given by that outcome’s associated probability. (Sometimes we might talk about the process by which a sequence of random quantities is selected, and the realization as the actual choice made according to this process.) For example, for flipping an unweighted coin we would have S = {Heads, Tails}, where Heads has probability 0.5 and Tails has probability 0.5. Our particular outcome might be Heads.
Here are the formal definitions:
Warning! It is very rare to have good intuition or instincts about probability questions. Try to hold yourself back from jumping to conclusions too quickly, and instead use the systematic approaches to prob- abilistic questions that are introduced in this chapter.
10.2. PROBABILITY,OUTCOMES,ANDEVENTS 1005
Definition 10.1 (Outcomes and sample space)
An outcome of a probabilistic process is the sequence of results for all randomly determined quantities. (An outcome can also be called a realization of the probabilistic process.) The sample space S is the set of all outcomes.
Definition 10.2 (Probability function)
Let S be a sample space. A probability function Pr : S → R describes, for each outcome
s ∈ S, the fraction of the time that s occurs. (We denote probabilities using square brackets, so the probability of s ∈ S is written Pr [s].) We insist that the following two conditions hold of the probability function Pr:
∑ Pr [s] = 1 (10.1)
s∈S
Pr [s] ≥ 0 for all s ∈ S. (10.2)
Intuitively, condition (10.1) says that something has to happen: when we flip a coin, then either it comes up heads or it comes up tails. (And so Pr [Heads] + Pr [Tails] = 1.) The other condition, (10.2), formalizes the idea that Pr [s] denotes the fraction of the time that the outcome s occurs: the least frequently that an outcome can occur is never.

1006 CHAPTER 10. PROBABILITY
The probability function Pr is also sometimes called a probability distribution over S. (This function “distributes” one unit of probability across the set S of all possible out- comes, as in (10.1).)
Taking it further: Bizarrely, in quantum computation—an as-yet-theoretical type of computation based on quantum mechanics—we can have outcomes whose probabilities are not restricted to be real numbers between 0 and 1. This model is (very!) difficult to wrap one’s mind around, but a computer based on this idea turns out to let us solve interesting problems, and faster than on “normal” computers. For example, we can factor large numbers efficiently on a quantum computer. (Though we don’t know how to build quantum computers of any nontrivial size.) See p. 1016 for some discussion.
A few examples: cards, coins, and words
Here are a few examples of sample spaces with probabilities naturally associated
with each outcome:
Example 10.3 (One card from the deck)
We draw one card from a perfectly shuffled deck of 52 cards. Then we can denote
thesamplespaceasS = {2,3,…,10,J,Q,K,A}×{♣,♦,♥,♠}. Eachcardc ∈ Shas
Pr [c] = 1 . Note that condition (10.1) is satisfied because 52
∑ Pr[c]=∑ 1 =52·1 =1, c∈S c∈S 52 52
and (10.2) is obviously satisfied because Pr [c] = 1 ≥ 0 for each c. 52
Example 10.4 (Coin flips)
You flip a quarter and Bill Gates flips a platinum trillion-dollar coin. Assume that both coins are fair (equally likely to come up Heads and Tails) and that flips of the quarter and the platinum coin do not affect each other in any way. Then the four outcomes are—writing the quarter’s result first—⟨Heads, Heads⟩, ⟨Heads, Tails⟩, ⟨Tails, Heads⟩, and ⟨Tails, Tails⟩. Each of these four outcomes has probability 0.25.
Example 10.5 (A word on the page)
Consider the following sentence, which—excluding spaces—contains a total of 29 different symbols (namely N, o, w, i, s, t, . . . , t):
Now is the winter of our discontent.
We are going to select a word from this sentence, according to the following process: choose one of the 29 non-space symbols from the sentence with equal likelihood; the selected word is the one in which the selected symbol appears. (Thus longer words will be chosen more frequently than shorter words, because longer words contain more symbols—and are therefore more likely to be selected.)
Nowisthewinter of our discontent/ Made glorious summer by this sun of York;/And all the clouds that lour’d upon our house/In the deep bosom of the ocean buried. —William
Shakespeare (1564–1616) King Richard III
ThesamplespaceisS = {Now,is,the,winter,of,our,discontent}.Thereare 3+2+3+6+2+3+10=29totalsymbols,andthusPr[Now] = 3 ,Pr[is] = 2 ,and
10 29 29
so on, through Pr [discontent] = 29 . Again, the conditions for being a probability are
satisfied: each outcome’s probability is nonnegative, and ∑w∈S Pr [w] = 1.

Examples 10.3 and 10.4 are scenarios of uniform probability, in which each outcome
in the sample space is chosen with equal likelihood. (Specifically, each s ∈ S has
probability Pr [s] = 1 .) Example 10.5 illustrates nonuniform probability, in which some |S|
outcomes occur more frequently than others.
Note that for a single sample space S, we can have many different distinct processes
by which we choose an outcome from S. For example:
Example 10.6 (Two ways of choosing from S = {0, 1, 2, . . . , 7})
One process for selecting an element of S is to flip three fair coins and treat their
resultsasabinarynumber(HHH = 111 → 7,HHT = 110 → 6,…,TTT = 000 → 0).
This process gives a uniform distribution over S: each sequence of coin flips occurs
A second process for selecting an element of S is to flip 7 fair coins and to let the outcome be the number of heads that we see in those 7 flips (HHHHHHH → 7, HHHHHHT → 6,HHHHHTH → 6,…,TTTTTTT → 0).Thisprocessgivesa nonuniform distribution over S, because the number of sequences that have k heads is different for different values of k. For example:
Pr[4] = 􏰀74􏰁 = 35 ≈ 0.2734, but Pr[7] = 􏰀7􏰁 = 1 ≈ 0.0078. 27 128 27 128
As a word of warning, notice that probabilistic statements about a particular realiza- tion don’t make sense; the only kind of probabilistic statement that makes sense is a statement about a probabilistic process. If you happen to be one of the ≈ 10% of the pop- ulation that’s red–green colorblind, and a friend says “what are the odds that you’re colorblind!?”, the correct answer is: the probability is 1 (because it happened!).
10.2.2 Events
Many of the probabilistic questions that we’ll ask are about whether the realization has some particular property, rather than whether a single particular outcome occurs. For example, we might ask for the probability of getting more heads than tails in 1000
flips of a fair coin. Or we might ask for the probability that a hand of seven cards (dealt from a perfectly shuffled deck) contains at least two pairs. There may be many different outcomes in the sample space that have the property in question. Thus, often we will be interested in the probability of a set of outcomes, rather than the probability of a single outcome. Such a set of outcomes is called an event:
Definition 10.3 (Event)
Let S be a sample space with probability function Pr. An event is a subset of S. The probability of an event E is the sum of the probabilities of the outcomes in E, and it is written Pr [E] = ∑s∈E Pr [s].
The probability of an event E ⊆ S follows by a probabilistic version of the Sum Rule, from counting: because one (and only one) outcome is chosen in a particular realiza-
10.2. PROBABILITY,OUTCOMES,ANDEVENTS 1007
with the same probability. For example, Pr [4] = 1 = 0.125 and Pr [7] = 1 = 0.125. 88

1008 CHAPTER 10. PROBABILITY
tion, the probability of either outcome x or y occurring is Pr [x] + Pr 􏰂y􏰃.
Note that the notation in Definition 10.3 generalizes the function Pr by allowing
us to write either elements of S or subsets of S as inputs to Pr. That is, previously we considered a function Pr : S → [0, 1]; we have now “extended” our notation so that it’s a function Pr : P(S) → [0, 1]. (To be more precise, we’re actually extending the notation to be a function Pr : (S ∪ P (S)) → [0, 1], because we’re still letting ourselves write outcomes as arguments too.)
A few examples
Here are a few examples of events and their probabilities:
Example 10.7 (At least one head)
YouandBillGateseachflipfaircoins,asinExample10.4.DefinetheeventH = {⟨Heads, Heads⟩, ⟨Heads, Tails⟩, ⟨Tails, Heads⟩} as “at least one coin comes up heads.” Then Pr [H] = 0.25 + 0.25 + 0.25 = 0.75.
Example 10.8 (Aces up)
Problem: Supposethatyoudrawonecardfromaperfectlyshuffleddeck,asinExam- ple 10.3. What is the probability that you draw an ace?
Solution
: The event in question is E = {A♣, A♦, A♥, A♠}. Each of these four out-
comeshasaprobabilityof 1 ,soPr[E]= 1 + 1 + 1 + 1 = 4 = 1 . 52 52 52 52 52 52 13
Example 10.9 (Full house)
Problem: You’redealt5cardsfromashuffleddeck,sothateachsetof5cardsis equally likely to be your hand. A hand is a full house if 3 cards share one rank, and the other 2 cards share a second rank. (For example, the hand 3♥, 3♠, 9♥, 9♣, 3♣ is a full house.) What’s the probability of being dealt a full house?
Our mixture of
Pr [outcome] and Pr [event] is an abuse of notation; we’re mixing
the type of input willy nilly. But, because Pr [x] for an outcome x and Pr 􏰂{x}􏰃 for the singleton event {x} are identical, we can write probabilities this way without risk of confusion.
Solution
: Thereare􏰀52􏰁possiblehands,eachofwhichisdealtwithprobability
􏰀52􏰁 5
1/ 5 . Thus the key question is a counting question: how many full houses are there?
We can compute this number using the Generalized Product Rule; specifically, we can view a full house as the result of the following sequence of selections:
• wechoosetherankofwhichtohavethreeofakind;
• wechoosewhich3ofthe4cardsofthatrankareinthehand;
• wechoosetherankofthepair(anyofthe12remainingranks);and • wechoosewhich2ofthe4cardsofthatrankareinthehand.
Thus there are 􏰀13􏰁 · 􏰀4􏰁 · 􏰀12􏰁 · 􏰀4􏰁 full houses, and the probability of a full house is 1312
􏰀13􏰁 · 􏰀4􏰁 · 􏰀12􏰁 · 􏰀4􏰁 3744
13 􏰀52􏰁
1 2 = 2598960 ≈ 0.00144. 5
Here’s a slightly more complex example, with multiple events of interest:

Taking it further: Section 10.2.3 has been devoted to tree diagrams—a systematic way of analyzing probabilistic settings in which a sequence of random choices is made. Typically we think of—or at
least model—these random choices as being made “by nature”: if you flip a coin, you act as though the universe “chooses” (via microdrafts of wind, the precise topology of the ground where the coin bounces, etc.) whether the coin will come up Heads or Tails.
But, in many scenarios in computer science, we want to generate the randomness ourselves, per-
1
We’ll end this section by spending a few words on some of the common probabilistic processes (and therefore some common probability distributions) that arise in com- puter science applications.
Uniform distribution
Under the uniform distribution, every outcome is equally likely. We can define a
uniform distribution for any finite sample space S:
Some familiar examples of the uniform distribution include:
10.2. PROBABILITY,OUTCOMES,ANDEVENTS 1013
haps in a program: choose a random element of the set A; go left with probability 1 and go right with 12
probability 2 ; generate a random 8-symbol password. The process of actually generating a sequence of “random” numbers on a computer is difficult, and (perhaps surprisingly) very closely tied to notions of cryptographic security. A pseudorandom generator is an algorithm that produces a sequence of bits that seem to be random, at least to someone examining the sequence of generated bits with limited computa- tional power. It turns out that building a difficult-to-break encryption system is in a sense equivalent to building a difficult-to-distinguish-from-random pseudorandom generator.1
For more, see:
1 Oded Goldre-
ich. Foundations of Cryptography. Cam- bridge University Press, 2006.
10.2.4 Some Common Probability Distributions
Definition 10.4 (Uniform distribution)
Let S be a finite sample space. Under the uniform distribution, the probability of any
particular outcome s ∈ S is given by Pr [s] = 1 . |S|
• flippingafaircoin(Pr[Heads]=Pr[Tails]=1). 21
• rolling a fair 6-sided die (Pr[1] = Pr[2] = Pr[3] = Pr[4] = Pr[5] = Pr[6] = 6). • choosing one card from a shuffled deck (Pr [c] = 1 for any card c).
52
Note that, if outcomes are chosen uniformly at random, then the probability of an
event is simply its fraction of the sample space. That is, for any event E ⊆ S, we have Pr[E]= |E|.
Taking it further: We often make use of a uniform distribution in randomized algorithms. For example, in randomized quicksort or randomized select applied to an array A[1 . . . n], a key step is to choose a “pivot” value uniformly at random from A, and then use the chosen value to guide subsequent operation of the algorithm. (See Exercises 10.24–10.27.)
Bernoulli distribution
The next several distributions are related to “flipping coins” in various ways. “Coin
flipping” is a common informal way of referring to any probabilistic process is which we have one or more trials, where each trial has the same “success probability,” also known as “getting heads.” We will refer to flipping an actual coin as a coin flip, but we will also refer to other probabilistic processes that succeed with some fixed probability
|S|

1014 CHAPTER 10. PROBABILITY
as a coin flip. We will consider a (possibly) biased coin—that is, a coin that comes up
heads with probability p, and comes up tails with probability 1 − p. The coin is called
fair if p = 1 ; that is, if the probability distribution is uniform. We can call the coin p- 2
biased when Pr [heads] = p. It’s important that the result of one trial has no effect on the success probability of any subsequent trial. (That is, these flips are independent; see Section 10.3.)
The first coin-related distribution is simply the one associated with a single trial:
Taking it further: Imagine a sequence of Bernoulli trials performed with p = 0.01, and another sequence of Bernoulli trials performed with p = 0.48. The former sequence will consist almost entirely of zeros; the latter will be about half zeros and about half ones. There’s a precise technical sense in which the second sequence contains more information than the first, measured in terms of the entropy of the sequence. See
p. 1017 for some discussion.
Binomial distribution
A somewhat more interest-
ing distribution results from
considering a sequence of flips
of a biased coin. Consider the
following probabilistic process:
perform n flips of a p-biased coin,
and then count the number of
heads in those flips. The binomial
distribution with parameters n and p
is a distribution over the sample
space {0,1,…,n}, where Pr[k]
denotes the probability of getting
precisely k heads in those flips.
Figure 10.6 shows several exam-
ples of binomial distributions, for different settings of the parameters n and p. Each panel of Figure 10.6 shows the probability P[k] of getting precisely k heads in n flips of a p-biased coin, for each k in the sample space.
If we flip a p-biased coin n times, what is the probability of the event of getting exactly k heads? For example, consider the outcome
HH···H TT···T . 􏰢 􏰡􏰠 􏰣􏰢􏰡􏰠􏰣
k times n − k times
The probability of this outcome is pk · (1 − p)n−k : the first k flips must come up heads, and the next n − k flips must come up tails. In fact, any ordering of k heads and n − k tails has probability pk · (1 − p)n−k. One way to see this fact is by imagining the prob- ability tree, which is a binary tree with left branches (heads) having probability p and
The Bernoulli distribution is named after Jacob Bernoulli, a 17th- century Swiss mathematician.
Definition 10.5 (Bernoulli distribution)
The Bernoulli distribution with parameter p is the probability distribution that results from flipping one p-biased coin. Thus the sample space is {H, T}, where Pr [H] = p and Pr [T] = 1 − p.
0.3 0.2 0.1
Pr [k]
k
0 1 2 3 4 5 6 7 8 9 10
(a) n=10,p=0.5
0.3 0.2 0.1
0 1 2 3 4 5 6 7 8 9 10
(d) n = 10, p = 0.25
0.3 0.2 0.1
Pr [k]
k
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(b) n=15,p=0.5
0.3 0.2 0.1
0 1 2 3 4 5 6 7 8 9 10
(e) n = 10, p = 0.75
0.3 0.2 0.1
Pr [k]
k
0 1 2 3 4 5 6 7 8 9 1011121314151617181920
(c) n=20,p=0.5
0.3 0.2 0.1
0 1 2 3 4 5 6 7 8 9 10
(f) n = 10, p = 0.85
Figure 10.6: Several binomial distribu- tions, for different values of n and p.

right branches (tails) having probability 1 − p. The outcomes in question have k left branches and n − k right branches, and thus have probability pk · (1 − p)n−k. There are 􏰀nk􏰁 different outcomes with k heads—a sequence of n flips, out of which we choose which k come up heads. Therefore:
Definition 10.6 (Binomial distribution)
The binomial distribution with parameters n and p is a distribution over the sample space {0,1,…,n}, where for each k ∈ {0,1,…,n} we have
10.2. PROBABILITY,OUTCOMES,ANDEVENTS 1015
P r [ k ] = 􏰀 nk 􏰁 · p k · ( 1 − p ) n − k .
For an unbiased coin, when p = 1 , the expression for Pr [k] from Definition 10.6 simpli-
􏰀􏰁2
fiestoPr[k]= n /2n,because(1)k·(1−1)n−k =(1)k·(1)n−k =(1)n.
k22222
0.6 Pr [k] 0.5
0.4
0.3
0.2 0.1
k
(a) p = 0.3
1 2 3 4 5 6 7 8 9 10 …
0.6 Pr [k] 0.5
0.4
0.3
0.2 0.1
k
(b) p = 0.5
1 2 3 4 5 6 7 8 9 10 …
0.6 0.5 0.4 0.3 0.2 0.1
k
1 2 3 4 5 6 7 8 9 10 …
(c) p = 0.7
Geometric distribution
Another interesting coin-derived distribution comes from the
“waiting time” before we see heads for the first time. Consider a p-biased coin, and continue to flip it until we get a heads. The out- put of this probabilistic process is the number of flips that were required, and the geometric distribution with parameter p is defined by this process. (The name “geometric” comes from the fact that the probability of needing k flips looks a lot like a geometric se- ries, from Chapter 5.) See Figure 10.7 for a few such distributions.
What is the probability of needing precisely k flips to get heads for the first time? We would have to have k − 1 initial flips come up tails, and then one flip come up heads. As with the binomial distribution, one nice way to think about the probability of this outcome uses the probability tree. This tree has left branches (heads) having probability p and right branches (tails) having probability 1 − p; the outcome k follows k − 1 right branches and one left branch, and thus has probability (1 − p)k−1 · p. Therefore:
Notice that the geometric distribution is our first example of an infinite sample space: every positive integer is a possible result.
Figure 10.7: Sev- eral geometric distributions, for different values of p. Although these plots are truncated at k = 10, the dis- tribution continues infinitely: Pr [k] > 0 for all positive integers k.
Definition 10.7 (Geometric distribution)
Let p be a real number satisfying 0 < p ≤ 1. The geometric distribution with parameter p is a distribution over the sample space Z≥1 = {1, 2, 3, . . .}, where for each k we have Pr [k] = (1 − p)k−1 · p. 1016 CHAPTER 10. PROBABILITY Computer Science Connections Quantum Computing As the 20th-century revolution in physics brought about by the discovery of quantum mechanics unfolded, some researchers working at the boundary of physics and computer science developed a new model of computation based on these quantum ideas. This model of quantum computation relies deeply on some very deep physics, far too deep for one page, but here is a brief summary—without any of the details of the physics. The most basic element of data in a quantum computer is a quantum bit, or qubit. Like a bit (the basic element of data on a classical computer), a qubit can be in one of two basic states. These two states are written as |0⟩ and |1⟩. (A classical bit is in state 0 or 1). The quantum magic is that a qubit can be in both states simultaneously, in what’s called a superposition of these basic states. A qubit will be in a state α|0⟩ + β|1⟩, where α and β are “weights” where |α|2 + |β|2 = 1. (Actually, the weights α and β are complex numbers, but the basic idea will come across if we think of them as real numbers—possibly negative!—instead.) Thus, while there are only two states of a bit, there are infinitely many states that a qubit can be in. So a qubit’s state contains a huge amount of information. But, by the laws of quantum physics, we are limited in how we can extract that information from a qubit. Specifically, we can measure a qubit, but we only see 0 or 1 as the output. When we measure a qubit α|0⟩ + β|1⟩, the probability that we see 0 is |α|2; the probability that we see 1 is |β|2. For example, we might have a qubit in the state “Anyone who is not shocked by quan- tum theory has not understood it.” — attributed to Niels Bohr (1885–1962) 3 􏰌2 = 1 + 3 = 1.) 1 |0⟩ + √3 |1⟩. (Note 􏰋 1 􏰌2 + 􏰋 √ 22 2244 When we measure this qubit, 25% of the time we’d see a 0, and 75% of the time we’d see a 1. There are two more crucial points. First, when there are multiple qubits— say n of them—the qubits’ state is a superposition of 2n basic states. (For example, two qubits are in a state α00|00⟩ + α01|01⟩ + α10|10⟩ + α11|11⟩.) Sec- ond, even though we only see one value when we measure qubits, there can be “cancellation” (or interference) among coefficients. There are notable restric- tions on how we can operate on qubits, based on constraints of physics, but at a very rough level, we can run an operation on an n-qubit quantum com- puter in parallel in each of the 2n basic states and, if the process is designed properly, still read something useful from our single measurement.2 Why does anyone care about any of this? The main interest in quantum computation stems from a major breakthrough, Shor’s algorithm (named after its discoverer, Peter Shor): an algorithm that solves the factoring problem— given a large integer n, determine n’s prime factorization—efficiently on a quantum computer. An efficient factoring problem is deeply problematic for most currently deployed cryptographic systems (see Chapter 7), so a functional quantum computer would be a big deal. But, at least as of this writing, no one has been able to build a quantum computer of any appreciable size. So at the moment, at least, it’s a theoretical device—but there’s active research both on the physics side (can we actually build one?) and on the algorithmic side (what else could we do if we did build one?). This cursory description of qubits and quantum computation is nowhere close to a full accounting of how qubits work, or what a quantum computer might do. For much more, see the wonderful text 2 Michael A. Nielsen and Isaac L. Chuang. Quantum Computation and QuantumInformation. Cambridge University Press, 2000. Computer Science Connections Information, Charles Dickens, and the Entropy of English Consider the following two (identical-length) sequences of letters and spaces—one from Charles Dickens’s A Tale of Two Cities and one generated by uniformly randomly choosing a sequence of elements of {A, . . . , Z, ␣}: IT WAS THE BEST OF TIMES, IT WAS THE WORST OF TIMES, IT WAS THE AGE OF WISDOM, IT WAS THE AGE OF FOOLISHNESS, IT WAS THE EPOCH OF BELIEF, IT WAS THE EPOCH OF INCREDULITY. TUYSSUWWYVOZULF XZQBSFS AFNBMAOOGWZPAHGREAYC SUSCMBOWDCNCYEJBHPVCRO MLVTGVHTVCZXHSCQFULCMBO CDIWTXOCUPKTFZVNBHRGDWAKZSZPFTZKEWKWIH O QFIUWTCDKUBTQSPLXSYXGQZA DLXBHKFILFPZ. Which sequence contains more information? It is very tempting to choose the first (information about contrast, and irony, and the opposition of ideas!)— but, in a precise technical sense, Random contains far more information than Dickens. The basic reason is that, in Dickens, certain letters occur far more frequently than others—E occurs 17 times and there are six letters that don’t appear at all. (In Random, all 26 letters appear.) With such a lopsided distri- bution, you already know a lot about what letter is (probably) going to come next, and so there’s less new information conveyed by a typical letter. Formally, the entropy of a sequence of letters (or bits, or whatever) is a measure of “how surprising” each element of the sequence is, averaged over the sequence. We’ll convert the “unit of surprise” into a real number between zero and one, where zero corresponds to the next letter is 100% predictable and one corresponds to we have absolutely no idea what the next letter will be. Formally, the entropy H of a probability distribution over S is given by − ∑ Pr [x] · log(Pr [x]). x∈S For example, if we produce a sequence of coin flips where each flip comes up heads with probability p (see Figure 10.8), then the entropy of the sequence will be − 􏰀p log p + (1 − p) log(1 − p)􏰁, as shown in Figure 10.9. This definition of entropy comes from the 1940s, in a paper by Claude Shannon,3 andhasfoundallsortsofusefulapplicationssince.Hereisone example: the entropy of a sequence of bits expresses a theoretical limit on the compressibility of that sequence. (And that theoretical limit is, in fact, achievable.) That is, if the entropy of a string of n bits is very low—say around 0.25—then with some clever algorithms we can represent that string (without Figure 10.8: A sequence of bits, pro- duced independently at random with probability p = 0.25 (top), p = 0.5 (middle), and p = 0.9 (bottom) of a one. Their entropies are, respectively, 0.8113, 1.0000, and 0.4690. entropy 1.0 0.8 0.6 0.4 0.2 any error) using only about n bits. But we can’t represent it in fewer bits with 4 Claude E. Shannon. A mathematical Technical Journal, 27:379–423, 1948. perfect fidelity (“lossless” compression; see p. 938). There is significant redundancy in English text, as we’ve already men- 3 theory of communication. Bell System tioned, based on the nonuniformity in the probability distribution of indi- vidual letters. But there’s even more redundancy based on the fact that the probability that the ith character of an English document is an H is affected by whether the (i − 1)st character was a T. (In the language of Section 10.3, these events are not independent.) If you’ve seen the letters TH in succession, you can make a very good bet that E is coming next. Compression schemes for English make use of this phenomenon.4 For more about entropy, compressibility, and information generally, see a text- book about information theory. A great classic reference is: 4 Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley, 1991. 10.2. PROBABILITY,OUTCOMES,ANDEVENTS 1017 0 0 0.25 0.50 0.75 1.00 probability p Figure 10.9: The entropy of a biased coin whose heads probability is p. 1018 CHAPTER 10. PROBABILITY 10.2.5 Exercises Philippe flips a fair coin 100 times. Let the outcome be the number of heads that he sees. 10.1 What is the sample space? 10.3 What is Pr [50]? 10.2 What is Pr [0]? 10.4 What is Pr [64]? Philippe now flips his fair coin n times. He is interested in the event “there are (strictly) more heads than tails.” What’s the probability of this event for the following values of n? 10.5 n = 2 10.7 n = 1001 (Hint: Pr [k] = Pr [1001 − k].) 10.6 n = 3 10.8 an arbitrary positive integer n Bridget plays Bridge. Bridge is a card game played with a standard 52-card deck. Each player is initially dealt a hand of 13 cards; assume a fair deal in which each of the 􏰀52􏰁 hands is equally likely. 13 10.9 What is the probability of being dealt both A♣ and A♦? 10.10 Suppose Bridget receives a uniformly drawn hand of 13 cards, in a uniformly random order. Be- cause your ex-friend Peter was trying to cheat at poker with this deck, the A♣ card is marked. You observe that the card the fourth-from-the-right position in Bridget’s hand is A♣. What is the probability that Bridget also has the A♦ in her hand? Most casual bridge players sort their hands by suit (♠, ♥, ♣, ♦ from left to right), and decreasing from left to right by rank within each suit. (So one might have a hand like ♠AK4 ♥983 ♣AKQ ♦AJ98, reading from left to right.) Professional players are taught not to sort their hands, because doing so causes which card they play to leak information about the rest of their hand to the other players. Suppose Bridget receives a uniformly drawn hand of 13 cards, and sorts the cards in her hand. Peter’s card marking is still present, and you observe the A♣ in a particular position in Bridget’s hand. In the following scenarios, what is the probability that Bridget also has the A♦ in her hand? (That is: out of all hands for which A♣ is highest/lowest/etc. card, what fraction also have the A♦?) 10.11 A♣ is the fourth-from-the-right (that is, fourth-from-the-lowest) card 10.12 A♣ is the rightmost (that is, lowest) card 10.13 A♣ is the leftmost (that is, highest) card Chrissie plays Cribbage. Cribbage is a card game played with a standard 52-card deck. For the purposes of these ques- tions, assume that a player is dealt one of the 􏰀52􏰁 different 4-card hands, chosen uniformly at random. Cribbage hands are awarded points for having a variety of special configurations: • A flush is a hand with all four cards from the same suit. • A run is a set of at least 3 cards with consecutive rank. (For example, the hand 3♥, 9♣, 10♦, J♣ contains a run.) • A pair is a set of two cards with identical rank. Aces are low in Cribbage, so A,2,3 is a valid run, but Q,K,A is not. 10.14 What’s the probability that Chrissie is dealt a flush? 10.15 What’s the probability that Chrissie is dealt a run of length 4? 10.16 What’s the probability of getting two runs of length 3 that is not a run of 4? (For example, the hand 9♥, 9♣, 10♦, J♣ contains two runs of length 3: the first is 9♥, 10♦, J♣ and the second is 9♣, 10♦, J♣.) 10.17 What’s the probability of getting one (and only one) run of length 3 (and not a run of length 4)? 10.18 What’s the probability of getting at least one pair? (Hint: Pr 􏰂getting a pair􏰃 = 1 − Pr 􏰂getting no pair􏰃.) 10.19 What’s the probability of getting two or more pairs? (In cribbage, any two cards with the same rank count as a pair; for example, the hand 2♥2♦2♠8♣ has three pairs: 2♥2♦ and 2♥2♠ and 2♦2♠.) 10.20 (programming required) Write a program to approximately verify your calculations from these Cribbage exercises, as follows: generate 1,000,000 random hands from a standard deck, and count the number of those samples in which there’s a flush, run (of the three flavors), pair, or multiple pairs. 10.21 (programming required) Modify your program to exactly verify your calculations: exhaustively generate all 4-card hands, and count the number of hands with the various features (flushes, runs, pairs). 10.22 A fifteen is a subset of cards whose ranks sum to 15, where an A counts as 1 and each of {10, J, Q, K} countsas10.(Forexample,thehand3♥,2♣,5♦,J♣containstwofifteens:3♥+2♣+J♣ = 15and 5♦ + J♣ = 15.) What’s the probability a 4-card hand contains at least one fifteen? (Hint: use a program.) 10.23 A bitstring x ∈ {0, 1}5 is stored in vulnerable memory, subject to corruption—for example, on a spacecraft. An α-ray strikes the memory and resets one bit to a random value (both the new value and which bit is affected are chosen uniformly at random). A second α-ray strikes the memory and resets one bit (again chosen uniformly at random). What’s the probability that the resulting bitstring is identical to x? 4 Recall the quick sort algorithm for sorting an array A: we choose a “pivot” value x; we partition A into those elements less than x and those greater than x; and we return x and those two sublists, recursively sorted, in the correct order. (See Figure 10.10.) This algorithm is efficient if the two sublists are close to equal in size. There are many ways to choose the pivot value, but one common (and good!) strategy is to choose x randomly from A. Assume that the elements of A are all distinct. If we select pivot in Line 4 by choosing uniformly at random from the set {1, . . . , n}: 10.24 As a function of n, what is the probability that |L| ≤ 3n/4 and |R| ≤ 3n/4? (You may assume that n is divisible by 4.) 10.25 As a function of n and α ∈ [0, 1], what is the probability |L| ≤ αn and |R| ≤ αn? (You may neglect issues of integrality: assume αn is an integer.) Suppose that we choose pivot in Line 4 by choosing three elements p1 , p2 , p3 uniformly at random from the set {1, . . . , n}, and taking as pivot the pi whose corresponding element of A is the median of the three. (Assume that the same index can be chosen as both p1 and p3 , for example.) For example, for the array A = ⟨94, 32, 29, 85, 64, 8, 12, 99⟩, we might randomly choose p1 = 1, p2 = 7, and p3 = 2. Then the pivot will be p3 because A[p3] = 32 is between A[p2] = 12 and A[p1] = 94. Under this “median of three” strategy: 10.26 What is the probability that |L| ≤ 3n/4 and |R| ≤ 3n/4? Assume n is large; for ease, you may neglect issues of integrality in your answer. 10.27 As a function of α ∈ [0, 1], what is the probability |L| ≤ αn and |R| ≤ αn? Again, you may assume that n is large, and you may neglect issues of integrality in your answer. Suppose that Team Emacs and Team VI play a best-of-five series of softball games. Emacs, being better than VI, wins each game with probability 60%. 10.28 Use a tree diagram to compute the probability that Team Emacs wins the series. 10.29 What is the probability that the series goes five games? (That is, what is the probability that neither team wins 3 of the first 4 games?) 10.30 Update your last two answers if Team Emacs wins each game with probability 70%. (Calculus required.) Now assume that Team Emacs wins each game with probability p, for an arbitrary value p ∈ [0, 1]. For the following questions, write down a formula expressing the probability of the listed event. Also find the value of p that maximizes the probability, and the probability of the specified event for this maximizing p. 10.31 There is a fifth game in the series. 10.32 There is a fourth game of the series. 10.33 There is a fourth game of the series and Team Emacs wins that fourth game. Let S be a sample space, and let Pr : S → [0, 1] be an arbitrary function satisfying the requirements of being a probability function (Definition 10.2). That is, we have ∑ Pr [s] = 1 and Pr [s] ≥ 0 for all s ∈ S. s∈S Argue briefly that the following properties hold. Figure 10.10: Quick Sort, briefly. (See Figure 5.20(a) for more detail.) Assume that the elements of A are all distinct. “Emacs” rhymes with “ski wax”; “VI” rhymes with “knee-high.” The teams are named after two text editors frequently used by computer scientists to write programs or emails or textbooks. 10.35 ForanyeventA⊆S,wehavePr A =1−Pr[A].(RecallthatA=S−A.) 10.36 For any events A, B ⊆ S, we have Pr [A ∪ B] = Pr [A] + Pr [B] − Pr [A ∩ B]. 10.37 TheUnionBound:foranyeventsA1,A2,...,An,wehavePr[ iAi]≤∑iPr[Ai]. 10.2. PROBABILITY,OUTCOMES,ANDEVENTS 1019 quickSort(A[1 . . . n]): 1: 2: 3: 4: 5: 6: 7: if n≤1then return A else choose pivot ∈ {1, . . . , n}, somehow. L := list of all A[i] where A[i] < A[pivot]. R := list of all A[i] where A[i] > A[pivot].
return quickSort(L) + ⟨A[pivot]⟩ + quickSort(R)
10.34 For any outcome s ∈ S, we have Pr [s] ≤ 1. 􏰂􏰃􏰔
Imagine n identical computers that share a single radio frequency for use as a network connection. Each of the n computers would like to send a packet of information out across the network, but if two or more different computers simultaneously try to send a message, no message gets through. Here you’ll explore another use of randomization: using randomness for symmetry breaking.
10.38 Suppose that each computer flips a coin that comes up heads with probability p. What is the probability that exactly one of the n machines’ coins comes up heads (and thus that machine can send its message)? Your answer should be a formula that’s in terms of n and p.
(The next two exercises require calculus.)
10.39 Given the formula you found in Exercise 10.38, what p should you choose to maximize the proba- bility of a message being successfully sent?
10.40 What is the probability of success if you choose p as in Exercise 10.39? What is the limit of this
quantity as n grows large? (You may use the following fact: (1 − 1 )m → e−1 as m → ∞.) m

1020 CHAPTER 10. PROBABILITY
We hash items into a 10-slot hash table using a hash function h that uniformly assigns elements to {1, . . . , 10}. Com- pute the probability of the following events if we hash 3 elements into the 10-slot table:
10.41 no collisions occur
10.42 all 3 elements have the same hash value
Suppose that we resolve collisions by linear probing, wherein an element x that hashes to an occupied cell h(x) is placed in the first unoccupied cell after h(x). (That is, we try to put x into h(x), then h(x) + 1, then h(x) + 2, and so forth—wrapping back around to the beginning of the table after the 10th slot. See Figure 10.11.) If we hash 3 elements into the 10-slot table, what is the probability that . . .
10.43 at least 2 adjacent slots are filled. (Count slot #10 as adjacent to #1.)
10.44 3 adjacent slots are filled.
One issue with resolving collisions by linear probing is called clustering: if there’s a large block of occupied slots in the hash table, then there’s a relatively high chance that the next element placed into the table extends that block.
10.45 Suppose that we currently have a single block of k adjacent slots full in an n-slot hash table, and all other slots are empty. What’s the probability that the next element inserted into the hash table extends that block (that is, leaves k + 1 adjacent slots full).
10.46 (programming required) Write a program to hash 5000 elements into a 10,007-slot hash table using linear probing. Record which cell x5000 ends up occupying—that is, how many hops from h(x5000) is x5000? Run your program 2048 times, and report how far, on average, x5000 moved from h(x5000). Also report the maximum distance that x5000 moved.
Because linear probing suffers from this clustering issue, other mechanisms for resolving collisions are sometimes used. Another choice is called quadratic probing: we change the cell number we try by an increasing step size at every stage, instead of by one every time. Specifically, to hash x into an n-slot table, first try to store x in h(x); if that cell is full, try putting x into h(x) + i2 , wrapping back around to the beginning of the table as usual, for i = 1, 2, . . .. (Linear probing tried slot h(x) + i instead.)
10.47 (programming required) Modify your program from Exercise 10.46 to use quadratic probing in- stead, and report the same statistics: the mean and maximum number of cells probed for x5000.
10.48 In about one paragraph, explain the differences that you observed between linear and quadratic probing. A concern called secondary clustering arises in quadratic probing: if h(x) = h(y) for two elements
x and y, then the sequence of cells probed for x and y is identical. These sequences were also identical for linear probing. In your answer, explain why secondary clustering from quadratic probing is less of a concern than the clustering from linear probing.
A fourth way of handling collisions in hash tables (after chaining, linear probing, and quadratic probing) is what’s called double hashing: we move forward by the same number of slots at every stage, but that number is randomly chosen, as the output of a different hash function. Specifically, to hash x into an n-slot table, first try to store x in h(x); if that cell is full, try putting x into h(x) + i · g(x), wrapping
back around to the beginning of the table as usual, for i = 1, 2, . . .. (Here g is a different hash function, crucially one whose output is never zero.) See Figure 10.13.
10.49 (programming required) Modify your program from Exercises 10.46 and 10.47 to use double hashing. Again report the mean and maximum number of cells probed for x5000.
10.50 In about one paragraph, explain the differences you observe between chaining, linear probing, quadratic probing, and double hashing. Is there any reason you wouldn’t always use double hashing?
Consider a randomized algorithm that solves a problem on a particular input correctly with probability p, and it’s wrong with probability 1 − p. Assume that each run of the algorithm is independent of every other run, so that we can think of each run as being an (independent) coin flip of a p-biased coin (where heads means “correct answer”).
10.51 (Requires calculus.) Suppose that the probability p is unknown to you. You observe that exactly k out of n trials gave the correct answer. Then the number k of correct answers follows a binomial distribution with parameters n and p: that is, the probability that exactly k runs give the correct answer is
Figure 10.11: A reminder of linear probing. If h(x) = 4, then we try to store x in slot 4, then 5, then 6. Because
slot 6 is empty, x is placed into that slot.
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Figure 10.12: Quadratic probing. We try to store x
in slot h(x), then h(x) + 12, then
h(x) + 22, etc.
Figure 10.13: Double hashing. We try to store
x in slot h(x),
then h(x) + g(x), then h(x) + 2g(x), etc. (wrapping around the table as necessary).
􏰀nk􏰁 · pk · (1 − p)n−k. (∗) Prove that the maximum likelihood estimate of p is p = k —that is, prove that (∗) is maximized by p = k .
10.52 (Requires calculus.) Suppose that the probability p is unknown to you. You observe that it takes n trials before the first time you get a correct answer. Then n follows a geometric distribution with parameter p: that is, the probability that n runs were required is given by
(1 − p)n−1p. (†)
Prove that the maximum likelihood estimate of p is p = 1 —that is, prove that (†) is maximized by p = 1 . nn
nn
h(y) = 4 g(y) = 3
h(x) = 4 g(x) = 1
1 2 3 4 5 6 7 8 9 10

10.3. INDEPENDENCEANDCONDITIONALPROBABILITY 1021
10.3 Independence and Conditional Probability
If your parents never had children, chances are you won’t, either.
Dick Cavett (b. 1936)
Imagine that you’re interviewing to be a consultant for Premier Passenger Pigeon Purveyors, a company that pitches its products to prospective pigeon purchasers using online advertising—specifically, by displaying ads to users of a particular search en- gine on the web. PPPP makes $50 profit from each sale, and, from historical data, they have determined that 0.02% of searchers who see an ad buy a pigeon. The interviewer asks you how much PPPP should be willing to pay to advertise to a searcher. A good answer is $0.01: on average, PPPP earns $50 · 0.0002 = $0.01 per ad, so paying anything up to a penny per ad yields a profit, on average. But you realize that there’s a better answer (and, by giving it, you get the job): it depends on what the user is searching for! A user who searches for BIRD or PIGEON or BUYING A PET TO COMBAT LONELINESS is far more likely to respond to a PPPP ad than an average user, while a user who searches for ORNITHOPHOBIA is much less likely to respond to an ad.
It is a general phenomenon in probability that knowing that event A has occurred may tell you that an event B is much more likely (or much less likely) to occur than you’d previ- ously known. In this section, we’ll discuss when knowing that an event A has occurred does or does not affect the probability that B occurs (that is, whether A and B are de- pendent or independent, respectively). We’ll then introduce conditional probability, which allows us to state and manipulate quantities like “the probability that B happens given that A happens.”
10.3.1 Independence and Dependence of Events
We’ll start with independence and dependence of events. Intuitively, two events A and
B are dependent if A’s occurrence/nonoccurrence gives us some information about whether B occurs; in contrast, A and B are independent when A occurs with the same probability when B occurs as it does when B does not occur. More formally:
If A and B are dependent events, then we can also say that A and B are correlated; inde- pendent events are said to be uncorrelated.
This definition is phrased a bit differently from the intuition above, but a little manipulation of the equation from Definition 10.8 may help to make the connection clearer. Assume for the moment that Pr [B] ̸= 0. (Exercise 10.70 addresses the case of Pr [B] = 0.) Dividing both sides of the equality Pr [A] · Pr [B] = Pr [A ∩ B] by Pr [B], we see that the events A and B are independent if and only if
Pr[A]= Pr[A∩B]. Pr [B]
Definition 10.8 (Independent and dependent events)
Two events A and B are independent if and only if Pr [A ∩ B] = Pr [A] · Pr [B]. The events A and B are called dependent if they are not independent.

1022 CHAPTER 10. PROBABILITY
The left-hand side (Pr [A]) denotes the fraction of the time that A occurs. The right- hand side (Pr [A ∩ B] /Pr [B]) denotes the fraction of the time when B occurs that A occurs too. If these two fractions are equal, then A occurs with the same probabil-
ity when B occurs as it does when B does not occur. (And if these two fractions are equal, then both when B occurs and when B does not occur, A occurs with probability Pr [A]—that is, the probability of A without reference to B.)
Examples of independent and dependent events
To establish that two events A and B are independent, we can simply compute
Pr [A], Pr [B], and Pr [A ∩ B], and show that the product of the first two quantities is equal to the third. Here are a few examples:
Example 10.14 (Some independent events)
The following pairs of events are independent:
1. Iflipafairpennyandafairnickel.Definethefollowingevents:
• EventA:Thepennyisheads. • EventB:Thenickelisheads.
Then Pr [A] = 0.5 and Pr [B] = 0.5 and Pr [A ∩ B] = 0.25 = 0.5 · 0.5.
2. Idrawacardfromarandomlyshuffleddeck.Definethefollowingevents:
• EventA:Idrawanace. • EventB:Idrawaheart.
For these events, we have
Pr [A] = Pr 􏰂{A♣, A♦, A♥, A♠}􏰃 = 1 􏰂 􏰃13
Pr[B] = Pr {A♥,2♥,…,K♥} = 1 􏰂􏰃4
Pr[A∩B]=Pr{A♥} =1 =1·1. 52 4 13
3. Irollafairreddieandafairbluedie.Definethefollowingevents: • EventA:Thereddieisodd.
• EventB:Thesumoftherollednumbersisodd.
Then, writing outcomes as ⟨the red roll, the blue roll⟩, we have
Pr􏰂{1,3,5}×{1,2,3,4,5,6}􏰃 = Pr[{1,3,5}×{2,4,6}∪{2,4,6}×{1,3,5}] =
red odd, blue even red even, blue odd
ObservethatA∩B={1,3,5}×{2,4,6},andsoPr[A∩B]= 9 =(0.5)·(0.5). 36
Any time the processes by which A and B come to happen are completely unrelated, it’s certainly true that A and B are independent. But events can also be independent in other circumstances, as we saw in Example 10.14.3: both events in this example in
Pr[A] = Pr[B] =
18 = 0.5 36
18 = 0.5 􏰢 􏰡􏰠 􏰣􏰢 􏰡􏰠 􏰣 36

1024 CHAPTER 10. PROBABILITY
3. Iflipafairpennyandafairnickel.Definethefollowingevents:
• EventA:Thepennyisheads. • EventB:Bothcoinsareheads.
Then Pr [A] = 0.5 and Pr [B] = 0.25 and Pr [A ∩ B] = 0.25 = Pr [B] ̸= Pr [A] · Pr [B].
Correlation of events
The pairs of dependent events from Example 10.15 are of two different qualitative
types. Knowing that the first event occurred can make the second event more likely to occur (“rolling an odd number” and “rolling a prime number” for the dice) or less likely to occur (“rolling an even number” and “rolling a prime number”):
At the extreme, knowing that the first event occurred can ensure that the second event definitely does not occur (“drawing a heart” and “drawing a spade” from Exam-
ple 10.15) or can ensure that the second event definitely does occur (“both coins are heads” and “the first coin is heads” from Example 10.15).
Here are some further examples in which you’re asked to figure out whether certain pairs of events are independent or dependent:
Example 10.16 (Encryption by random substitution)
Problem: Onesimpleformofencryptionfortextisasubstitutioncipher,inwhich(in the simplest version) we choose a permutation of the alphabet, and then replace each letter with its permuted variant. (For example, we might permute the letters as ABCDE· · · → XENBG· · · ; thus DECADE would be written as BGNXBG.) Suppose we choose a random permutation for this mapping, so that each of the 26! orderings of the alphabet is equally likely. Are the following events Q and Z independent or dependent?
• Q=“theletterQismappedtoitself(thatis,Qis‘rewritten’asQ).” • Z=“theletterZismappedtoitself.”
Solution
: WemustcomputePr[Q],Pr[Z],andPr[Q∩Z].Becauseeachpermutation
is equally likely to be chosen, we have
Pr [Q] = # permutations π1,2,…,26 where π17 = 17 = 25! = 1
# permutations π1,2,…,26 26! 26 because we can choose any of 25! orderings of all non-Q letters. Similarly,
Pr [Z] = # permutations π1,2,…,26 where π26 = 26 = 25! = 1 . # permutations π1,2,…,26 26! 26
Definition 10.9 (Positive and negative correlation)
When two events A and B satisfy Pr [A ∩ B] > Pr [A] · Pr [B], we say that A and B are positively correlated. When Pr [A ∩ B] < Pr [A] · Pr [B], we say that A and B are negatively correlated. (If Pr [A ∩ B] = Pr [A] · Pr [B], then A and B are uncorrelated.) 10.3. INDEPENDENCEANDCONDITIONALPROBABILITY 1025 To compute Pr [Q ∩ Z], we need to count the number of permutations π1...26 with both π17 = 17 and π26 = 26. Any of the 24 other letters can go into any of the remaining 24 slots of the permutation, so there are 24! such permutations. Thus Pr[Q∩Z]= #permutationsπ1,2,...,26 whereπ17 =17andπ26 =26 = 24! = 1 . # permutations π1,2,...,26 26! 25 · 26 Thus we have Pr[Q∩Z]= 1 and Pr[Q]·Pr[Z]= 1 · 1 = 1 . 25·26 26 26 26·26 There’s only a small difference between 1 26·26 ≈ 0.00148 and 1 ≈ 0.00154, but 25·26 they’re indubitably different, and thus Q and Z are not independent. (Incidentally, substitution ciphers are susceptible to frequency analysis: the most com- mon letters in English-language texts are ETAOIN—almost universally in texts of rea- sonable length—and the frequencies of various letters is surprisingly consistent. See Exercises 10.72–10.76.) Example 10.17 (Matched flips of two fair coins) Problem: Ifliptwofaircoins(independently).Considerthefollowingevents: • EventA:thefirstflipcomesupheads. • EventB:thesecondflipcomesupheads. • EventC:thetwoflipsmatch(arebothheadsorarebothtails). Which pairs of these events are independent, if any? : Thesamplespaceis{HH,HT,TH,TT},andtheeventsfromtheproblem statement are given by A = {HH, HT}, B = {HH, TH}, and C = {HH, TT}. Thus A∩B = A∩C = B∩C = {HH}—thatis,HHistheonlyoutcomethatresultsin more than one of these events being true. (See Figure 10.15.) Because the coins are fair, every outcome in this sample space has probability 1 . Focusing on the events A and B, we have 4 Solution AB HT Figure 10.15: Two Pr[A] = Pr􏰂{HH,HT}􏰃 = 1 􏰂 􏰃 2 HH TT C TH Pr [B] = Pr {HH, TH} = 􏰂 􏰃 1 coin flips and three Pr[A∩B] = Pr {HH} = ThusPr[A]·Pr[B] = 1 · 1 = 1,andPr[A∩B] = 1. BecausePr[A]·Pr[B] = 2 events. 1. 4 224 4 Pr [A ∩ B], the two events are independent. The calculation is identical for the other two pairs of events, and so A and B are independent; A and C are independent; and B and C are independent. 1026 CHAPTER 10. PROBABILITY Example 10.18 (Matched flips of two biased coins) Problem: HowwouldyouranswerstoExample10.17changeifthecoinsarep-biased instead of fair? : ThesamplespaceandeventsremainasinExample10.17(seeFigure10.16), but the outcomes now have different probabilities: p·(1−p) (1−p)·p (1−p)·(1−p) Using these outcome probabilities, we compute the event probabilities as follows: Pr[A] = Pr􏰂{HH,HT}􏰃 􏰂􏰃 Solution outcome HH HT TH TT AB HT TH probability p·p = p·p+p·(1−p) = p·p+(1−p)·p = p (1) = p (2) HH TT C Pr[B] = Pr {HH,TH} 􏰂􏰃 22 Figure 10.16: The flips and events, again. Recall the events: A: 1st flip heads. B: 2nd flip heads. C: flips match. Pr[C] = Pr {HH,TT} = p·p+(1−p)·(1−p) = p +(1−p) . (3) Because A ∩ B = B ∩ C = A ∩ C = {HH}, we also have Pr[A∩B] = Pr[B∩C] = Pr[A∩C] = Pr[HH] = p2. (4) ThusAandBarestillindependent,becausePr[A]·Pr[B] = p·p = p2 = Pr[A∩B] by (1), (2), and (4). But what about the events A and C? By (1), (3), and (4), we have Pr[A]·Pr[C] = p·􏰖p2 +(1−p)2􏰗 and Pr[A∩C] = p2. By a bit of algebra, we see that Pr [A ∩ C] = Pr [A] · Pr [C] if and only if p2 = p(p2 +(1−p)2) ⇔ 0 = p(p2 +(1−p)2)−p2 ⇔ 0 = 2p3 − 3p2 + p ⇔ 0 = p(2p−1)(p−1). So the events A and C are independent—that is, Pr [A ∩ C] = Pr [A] · Pr [C]—if and Thus events A and B are independent for any value of p, while events A and C (and similarly B and C) are independent if and only if p ∈ {0, 1 , 1}. independent, the third event is not independent of the other two. Another way to describe this situation is that the events A and B ∩ C are not independent: in particular, Pr [A ∩ (B ∩ C)] /Pr [B ∩ C] = 1 ̸= Pr [A]. A set of events A1, A2, . . . , An is said to be pairwise independent if, for any two indices i and j ̸= i, the events Ai and Aj are independent. More generally, these events are said to be k-wise independent if, for any subset S of up to k of these events, the events in S are all independent. (And we say that the set of events is fully independent if every subset of any size satisfies this property.) Sometimes it will turn out that we “really” care only about pairwise independence. For example, if we think about a hash table that uses a “random” hash function, we’re usually only concerned with the question “do elements x and y collide?”—which is a question about just one pair of events. Generally, we can create a pairwise-independent random hash function more cheaply than creating a fully indepen- dent random hash function. If we view random bits as a scarce resource (like time and space, in the style of Chapter 6), then this savings is valuable. only if p ∈ {0, 1,1}. 2 Taking it further: While any two of the events from Example 10.17 (or Example 10.18 with p = 1 ) are 2 2 10.3. INDEPENDENCEANDCONDITIONALPROBABILITY 1027 10.3.2 Conditional Probability In Section 10.3.1, we discussed the black-and-white distinction between pairs of in- dependent events and dependent events: if A and B are independent, then knowing whether or not B happened gives you no information about whether A happened; if A and B are dependent, then the probability that A happens if B happened is different from the probability that A happens if B did not happen. But how does knowing that B occurred change your estimate of the probability of A? Think about events like “the sky is clear” and “it is very windy” and “it will rain today”: sometimes B means that A is less likely or even impossible; sometimes B means that A is more likely or even certain. Here we will discuss quantitatively how one event’s probability is affected by the knowledge of another event. The conditional probability of A given B represents the probability of A occurring if we know that B occurred: Definition 10.10 (Conditional probability) 􏰂 􏰃 The conditional probability of A given B, written Pr A|B , is given by Pr􏰂A|B􏰃 = Pr[A∩B]. Pr [B] (The quantity Pr 􏰂A|B􏰃 is also sometimes called the probability of A conditioned on B.) We will treat Pr 􏰂A|B􏰃 as undefined when Pr [B] = 0. Here are a few simple examples: Example 10.19 (Odds and primes) I choose a number uniformly at random from {1, 2, . . . , 10}. Define these two events: • EventA:Thechosennumberisodd. • EventB:Thechosennumberisprime. 􏰂 􏰃 􏰂 􏰃 Pr[A∩B] Pr {3,5,7} 3 For these events, we have Pr A|B = Pr[B] = Pr􏰂{2,3,5,7}􏰃 = 4 􏰂 􏰃 Pr[A∩B] Pr􏰂{3,5,7}􏰃 3 and Pr B|A = Pr[A] = Pr􏰂{1,3,5,7,9}􏰃 = 5. Example 10.20 (Dominoes) Problem: ShufflethedominoesinFigure10.17,anddrawoneuniformlyatrandom. 1. What is the probability that you drew a domino with a 2 ( ) on it? 2. Youmakeadrawandseethedomino .(Imaginetheshadedsideofthe domino is covered by your hand.) What’s the probability your domino has a 2? 3. Youmakeadrawandseethatthedominois .Whatistheprobabilitythat you drew a domino with a 2? Figure 10.17: Some dominoes. 10.3. INDEPENDENCEANDCONDITIONALPROBABILITY 1029 Here’s one more example, where we condition on slightly more complex events. Example 10.21 (Coin flips) Problem: Flipafaircoin10times(withallflipsindependent:theithfliphasnoeffect on the jth flip for j ̸= i). Write H to denote the event of getting at least 9 heads. 1. WhatisPr[H]? 􏰂 􏰃 2. Let A be the event “the first flip comes up heads.” What is Pr H|A ? 3. LetBbetheevent“thefirstflipcomesuptails.”WhatisPr􏰂H|B􏰃?􏰂 􏰃 4. Let C be the event “the first three flips come up heads.” What is Pr H|C ? 5. LetDbetheevent“wegetatleast8heads.”WhatisPr􏰂H|D􏰃? Solution : 1. Observethateveryoutcome—everyelementof{H,T}10—isequally likely, each with probability 1/210. The number of sequences of 10 flips with 9 or10headsis􏰀10􏰁+􏰀10􏰁=10+1=11,soPr[H]=11/210 ≈0.0107. 9 10 For the conditional probabilities, we will compute Pr [H ∩ X] and Pr [X] for each of the stated events X. The final answer is their ratio. Because each outcome is equally likely, we only have to compute the cardinality of the given events (and the cardinality of their intersection with H) to answer the questions. 2. 3. 4. 5. For A (the first flip comes up H), we have |A ∩ H| = 10: there are 9 outcomes with one Tails that start with a Heads (HTHHHHHHHH, HHTHHHHHHH, . . ., HHHHHHHHHT) and 1 outcome with zero Tails (HHHHHHHHHH). Thus Pr [A ∩ H] = 10/210. Obviously Pr [A] = 1 . Thus 2 Pr􏰂H|A􏰃 = Pr[A∩H] = 10/210 = 10 ≈ 0.01953. Pr[A] 1/2 29 ForB(thefirstflipcomesupT),we’vealready“usedup”thesinglepermit- ted non-heads in the first flip, so there’s only one outcome in B ∩ H, namely THHHHHHHHH. And, again, obviously Pr [B] = 1 . Therefore we have 2 Pr􏰂H|B􏰃 = Pr[B∩H] = 1/210 = 1 ≈ 0.00195. Pr[B] 1/2 29 For C (the first three flips come up H), we have Pr [C] = 1 . The outcomes in 8 C ∩ H are exactly those that start with HHH followed by 6+ heads in the last 7 flips. There are 􏰀7􏰁 + 􏰀76􏰁 = 8 such outcomes. Thus Pr􏰂H|C􏰃= Pr[C∩H] = 8/210 = 64 ≈0.0625. Pr[C] 1/8 210 ForD(thereareatleast8heads),wehavePr[H∩D]=Pr[H]=11/210.(There are no outcomes in which we get 9+ heads but fail to get 8+ heads!) The proba- bility of getting 8+ heads in 10 fair flips is And therefore 210 210 210 􏰀10􏰁+􏰀10􏰁+􏰀10􏰁 45+10+1 56 Pr[D]=8910= =. Pr􏰂H|D􏰃 = Pr[D∩H] = 11/210 = 11 ≈ 0.1964. Pr [D] 56/210 56 1030 CHAPTER 10. PROBABILITY To repeat the word of warning from early in this chapter: it can be very difficult to have good intuition about probability questions. For example, the last problem in Example 10.21 asked for the probability of getting 9+ heads in 10 flips conditioned on getting 8+ heads. It may be easy to talk yourself into believing that, of the times that we get 8+ heads, there’s a ≈ 50% chance of getting 9 or more heads. (“Put aside the first 8 heads, and look at one of the other flips—it’s heads with probability 1 , so we get a 9th heads 12 with probability 2 .”) But this intuition is blatantly wrong. Another way of thinking about the calculation in the last part of Example 10.21 is to observe that there are 56 outcomes with 8, 9, or 10 heads. Only 11 of these outcomes have 9 or 10 heads. Each outcome is equally likely. So if we’re promised that one of the 56 outcomes occurred, then there’s an 11 chance that one of the 11 occurred. 56 Taking it further: So far, we have considered only random processes in which each outcome that can occur does so with probability ε > 0—that is, there have been no infinitesimal probabilities. But we can imagine scenarios in which infinitesimal probabilities make sense.
For example, imagine a probabilistic process that chooses a real number x between 0 and 1, where
each element of the sample space S = {x : 0 ≤ x ≤ 1} is equally likely to be chosen. We can make
probabilistic statements like Pr [x ≤ 0.5] = 1 —half the time, we end up with x ≤ 0.5, half the time we 2
end up with x ≥ 0.5—but for any particular value c, the probability that x = c is zero! (Perhaps bizarrely, Pr [x ≤ 0.5] = Pr [x < 0.5]. Indeed, Pr [x = 0.5] cannot be ε > 0, for any ε. Every possible outcome has to have that same probability ε of occurring, and for any ε > 0 there are more than 1 real numbers between 0 and 1. So we’d violate (10.1) if we had Pr[x = 0.5] > 0.) ε
To handle infinitesimal probabilities, we need calculus. We can describe the above circumstance with a probability density function p : S → [0, 1], so th􏰮at, in place of (10.1), we require
x∈S p(x)dx = 1.
(For a uniformly chosen x ∈ [0, 1], we have p(x) = 1; for a uniformly chosen x ∈ [0, 100], we have
For example, the “zooming in” view of conditional probability from Figure 10.18 doesn’t quite work in the infinitesimal case. In fact, we can consider questions about Pr 􏰂A|B􏰃 even when Pr [B] = 0, like what is the probability that a uniformly chosen x ∈ [0, 100] is an integer, conditioned on x being a rational number?. (And Exercise 10.70—if Pr [B] = 0, then A and B are independent—isn’t true with infinitesimal probabilities.) But details of this infinitesimal version of probability theory are generally outside of our concern here, and are best left to a calculus-based/analysis-based textbook on probability.
The restriction to non-infinitesimal probabilities is generally a reasonable one to make for CS ap-
plications, but it is a genuine restriction. (It’s worth noting that we have encountered an infinite sample
space before—just one that didn’t have any infinitesimal probabilities. In a geometric distribution with
p(x) = 1 .) Some of the statements that we’ve made in this chapter don’t apply in the infinitesimal case. 100
parameter 1 , for example, any positive integer k is a possible outcome, with Pr [k] = 1/2k , which is a 2
finite, albeit very small, probability for any positive integer k.)
10.3.3 Bayes’ Rule and Calculating with Conditional Probability
Here, we’ll briefly introduce a few simple but useful ways of thinking about condi- tional probability: the connection between independence of events and conditional probability; a few ways of thinking about plain (unconditional) probability using conditional probability; and, finally, Bayes’ Rule, a tremendously useful formula that relates Pr 􏰂A|B􏰃 and Pr 􏰂B|A􏰃.
Relating independence of events and conditional probability
Consider two events A and B for which Pr [B] ̸= 0. Observe that A and B are inde-

pendent if and only if Pr 􏰂A|B􏰃 = Pr [A]:
A and B are independent ⇔ Pr [A] · Pr [B] = Pr [A ∩ B]
⇔Pr[A]= Pr[A∩B] Pr [B]
⇔ Pr [A] = Pr 􏰂A|B􏰃 .
definition of independence
dividingbyPr[B]
definition of Pr 􏰂A|B􏰃
10.3. INDEPENDENCEANDCONDITIONALPROBABILITY 1031
(Note that this calculation doesn’t work when Pr [B] = 0—we can’t divide by 0, and Pr 􏰂A|B􏰃 is undefined—but see Exercise 10.70.) Notice again that this relationship is an if-and-only-if relationship: when A and B are not independent, then Pr [A] and Pr 􏰂A|B􏰃 must be different. Here is a small example:
Example 10.22 (Self-mapped letters in substitution ciphers)
In Example 10.16, we showed that, for a random permutation π of the alphabet, the
events Q (Q is mapped to itself by π) and Z (Z is mapped to itself by π) were not inde-
pendent:specifically,Pr[Q]= 1,Pr[Z]= 1,andPr[Q∩Z]= 1 .Thus 26 26 25·26
Pr􏰂Q|Z􏰃 = Pr[Q∩Z] = 1/(25·26) = 1 . Pr[Z] 1/26 25
Problem-solving tip:
Often it is easier to get intuition about
a probabilistic statement by imagining an absurdly small variant of the problem. Here, for example, imagine a 2-letter alphabet Q,Z. Then if Z is mapped to itself then Q must also be mapped to itself. So Pr [Q] = 1 , but Pr 􏰂Q|Z􏰃 = 1. 2
Compare Pr 􏰂Q|Z􏰃 = 1 to Pr [Q] = 1 : thus, knowing that Z is mapped to itself makes 25 26
it slightly more likely that Q is also mapped to itself. The reason that Z makes Q slightly more probable is that, when Z occurs, Z cannot be mapped to Q, so there are only 25 letters “competing” to be mapped to Q instead of 26.
Intersections and conditional probability
The definition of conditional probability (Definition 10.10) states that
􏰂 􏰃 Pr[A∩B] Pr A|B = Pr[B] .
Multiplying both sides of this equality by Pr [B] yields a useful way of thinking about the probability of intersections:
Theorem 10.2 (The Chain Rule)
Let A and B be arbitrary events. Then
Pr[A∩B] = Pr[B]·Pr􏰂A|B􏰃.
And, more generally, for events A1, A2, . . . , Ak, we have Pr[A1 ∩A2 ∩A3 ∩···∩Ak]
= Pr[A1] · Pr􏰂A2|A1􏰃 · Pr􏰂A3|A1 ∩A2􏰃 · ··· · Pr􏰂Ak|A1 ∩···∩Ak−1􏰃.
If we’re interested in the probability that A and B occur, then we need it to be the case that A occurs—and, conditioned on A occurring, B occurs too.

1032 CHAPTER 10. PROBABILITY
Example 10.23 (Drawing a heart flush in poker)
Problem: Aflushinpokerisa5-cardhand,allofwhicharethesamesuit.Whatisthe probability of drawing a heart flush from a randomly shuffled deck?
: Wecandrawanyheartfirst.Wehavetokeepdrawingheartstogetaflush, so for 2 ≤ k ≤ 5, the kth card we draw must be one of the remaining 14 − k hearts from the 53 − k cards left in the deck. That is, writing Hi to denote the event that the ith card drawn is a heart:
Pr[H1 ∩H2 ∩H3 ∩H4 ∩H5]
= Pr [H1] · Pr 􏰂H2|H1􏰃 · Pr 􏰂H3|H1,2􏰃 · Pr 􏰂H4|H1,2,3􏰃 · Pr 􏰂H5|H1,2,3,4􏰃
= 13·12·11·10· 9 52 51 50 49 48
= 154440 ≈ 0.00049519807. 311875200
Solution
(We could also have directly computed this quantity via counting: there are 􏰀13􏰁 􏰀52􏰁 5
hands of 5 hearts, and 5 total hands. Thus the fraction of all hands that are heart flushes is 􏰀13􏰁 13! 13! · 47! 13 · 12 · 11 · 10 · 9
5
52 52! 8!·52! 52·51·50·49·48
􏰀
􏰁 = 5!·8! = = , 5 5!·47!
which is the same quantity that we found above.)
We can use the chain rule to compute the probability of an event A by making the
(obvious!) observation that another event B either occurs or doesn’t occur: Theorem 10.3 (The Law of Total Probability)
Let A and B be arbitrary events. Then
Pr[A]=Pr􏰂A|B􏰃·Pr[B] + Pr􏰂A|B􏰃·Pr[B].
Proof. We’llproceedbysplittingAintotwodisjointsubsets,A∩BandA−B(whichis otherwise known as A ∩ B):
Pr [A] = Pr 􏰂(A ∩ B) ∪ (A ∩ B)􏰃 =Pr[A∩B] + Pr􏰂A∩B􏰃
=Pr􏰂A|B􏰃·Pr[B] + Pr􏰂A|B􏰃·Pr[B]. Thus the theorem follows.
A = (A ∩ B) ∪ (A ∩ B) A∩BandA∩Baredisjoint thechainrule
Here’s a simple example of using the law of total probability:

10.3. INDEPENDENCEANDCONDITIONALPROBABILITY 1033
Example 10.24 (Binary Symmetric Channel)
We wish to transmit a 1-bit message from a sender to a receiver. The sender’s mes- sage is 0 with probability 0.3, and it’s 1 with probability 0.7. The sender sends this data using a communication channel that corrupts (that is, flips) every transmitted bit with probability 0.25. Then the probability that the receiver receives a “1” message is
Pr [receive 1] = Pr 􏰂receive 1|send 1􏰃 · Pr [send 1] + Pr 􏰂receive 1|send 0􏰃 · Pr [send 0] = (0.75 · 0.7) + (0.25 · 0.3)
= 0.525 + 0.075 = 0.6.
Taking it further: The binary symmetric channel is given this name because it transmits a bit (it’s bi- nary) and it corrupts a 0 with the same probability as it corrupts a 1 (it’s symmetric). (See Figure 10.19; view each arrow in the channel as transforming a particular input bit to a particular output bit, with the indicated probability.)
The binary symmetric channel is one of the
most basic forms of a noisy communication channel (that is, a channel that does not perfectly transmit its input without any chance of corruption). The subfield of information theory is devoted to analyzing topics like the (theoretical) efficiency of communication channels, including the binary symmetric channel. For much more, see a textbook on information theory.5
5
conditional probability statement. It allows us to express the conditional probability of A given B in terms of the conditional probability of B given A:
Proof. ApplyingthechainruletobreakapartPr[A∩B]“inbothorders,”wehave Pr [A ∩ B] = Pr 􏰂A|B􏰃 · Pr [B]
Pr [B ∩ A] = Pr 􏰂B|A􏰃 · Pr [A] .
The left-hand sides of these equations are identical because A ∩ B = B ∩ A (and there-
fore Pr [A ∩ B] = Pr [B ∩ A]), so their right-hand sides are equal, too: Pr 􏰂A|B􏰃 · Pr [B] = Pr 􏰂B|A􏰃 · Pr [A] .
Dividing both sides of this equality by Pr [B] yields the desired equation:
Pr􏰂A|B􏰃 = Pr􏰂B|A􏰃·Pr[A]. Pr [B]
Figure 10.19: The binary symmetric channel.
5 Thomas M. Cover and Joy A. Thomas. Elements of Informa- tion Theory. Wiley, 1991.
Bayes’ Rule is named after Thomas Bayes,
an 18th-century English mathemati- cian.
input p output
00 1−p
1−p p 11
Bayes’ Rule
Bayes’ Rule is a simple—but tremendously useful—rule for “flipping around” a
Theorem 10.4 (Bayes’ Rule)
For any two events A and B:
Pr􏰂A|B􏰃 = Pr􏰂B|A􏰃·Pr[A]. Pr [B]

1034 CHAPTER 10. PROBABILITY
Here are a couple of simple examples of using Bayes’ Rule:
Example 10.25 (Binary Symmetric Channel, again)
As in Example 10.24, assume a sender transmits a 0 with probability 0.3 and a 1 with probability 0.7 across a channel that corrupts every bit with probability 0.25. We showed in Example 10.24 that Pr [receive 1] = 0.6 and thus Pr [receive 0] = 0.4. Then the probability that the receiver receiving a “1” message was indeed sent a 1 is
Pr 􏰂message sent was 1|receive 1􏰃 = Pr 􏰂receive 1|send 1􏰃 · Pr [send 1] by Bayes’ Rule Pr [receive 1]
= 0.75 · 0.7 = 0.875. 0.6
And the probability that the receiver receiving a “0” message was indeed sent a 0 is Pr 􏰂message sent was 0|receive 0􏰃 = Pr 􏰂receive 0|send 0􏰃 · Pr [send 0] by Bayes’ Rule
Pr [receive 0] = 0.75 · 0.3 = 0.5625.
(Qualitatively, these numbers tell us that most of received ones were actually sent as ones, but barely more than half of the received zeros were actually sent as zeros.)
Example 10.26 (9+ heads, again)
We flip a fair coin 10 times. As in Example 10.21, let A denote the event that the first
0.4
flip comes up heads and let H denote the event that there are 9 or more heads in the
10 flips. (There we showed Pr [H] = 11/210, Pr [A] = 1 , and Pr 􏰂H|A􏰃 = 10/29.) Then 2
􏰂 􏰃 Pr􏰂H|A􏰃·Pr[A] (10/29)· 1 10
Taking it further: A speech recognition system is supposed to “listen” to speech in a language like English, and recognize the words that are being spoken. Bayes’ Rule allows us to think about two different types of evidence that such a system uses in deciding what words it “thinks” are being said; see p. 1036.
A particularly important application of Bayes’ Rule is in “updating” one’s beliefs about the world by observing new information. (Here “beliefs” take the form of a probability distribution.) One starts with a prior distribution which one then updates based on evidence to produce a posterior distribution. Here are two examples:
Example 10.27 (Alice the CS major)
We are interested in whether a student (let’s call her Alice) is a computer science major. Our prior for Alice might be Pr 􏰂CS major􏰃 = 0.05 because 5% of students are CS majors. We learn that Alice took Ancient Philosophy. If we know that 10% of students as a whole take Ancient Philosophy, and 50% of CS majors do, then
Pr A|H = Pr[H] = 11/210
2 = 11.
The prior (pre = before) is your
best guess of the probability of
the event prior
to seeing the produced evidence; the posterior (post = after) is your best guess after seeing the evidence.

Pr
10.3. INDEPENDENCEANDCONDITIONALPROBABILITY 1035
􏰂
􏰃 Pr 􏰂phil|CS major􏰃 · Pr 􏰂CS major􏰃 0.5 · 0.05
= Pr 􏰂phil􏰃 = 0.10 = 0.25.
CS major|phil
Our posterior distribution (that is, the updated guess) is that there is now a 25%
chance that Alice is a CS major.
Example 10.28 (Flipping a coin to decide which coin to flip)
I have two coins in an opaque bag. The coins are visually indistinguishable, but one coin is fair (Pr [H] = 0.5); the other coin is 0.75-biased (Pr [H] = 0.75). I pull one of the two coins out at random.
• Mypriordistributionisthatthereisa50%chanceI’mholdingthefaircoin,anda 50% chance I’m holding the biased coin. (That is, Pr [biased] = Pr [fair] = 0.5.)
I flip the coin that I’m holding. It comes up heads.
• TheevidenceistheHeadsflip.
Because the biased coin is more likely to produce Heads flips than the fair coin is (and we saw Heads), this evidence should make us view it as more likely that the coin that I’m holding is the biased coin. Let’s compute my posterior probability:
• Theposteriorprobabilityofaneventistheprobabilityofthateventconditionedon the observed evidence. So we wish to compute Pr 􏰂biased|H􏰃:
Pr 􏰂biased|H􏰃 = =
=
=
Pr 􏰂H|biased􏰃 · Pr [biased] Bayes’ Rule Pr [H]
Pr 􏰂H|biased􏰃 · Pr [biased]
Pr 􏰂H|biased􏰃 · Pr [biased] + Pr 􏰂H|fair􏰃 · Pr [fair]
Law of Total Probability
0.75 · Pr [biased] (0.75·Pr[biased]) + (0.5·Pr[fair])
the given biases of the coins: 0.75 for biased, 0.5 for fair 0.75 · 0.5 Pr [biased] = Pr 􏰂fair􏰃 = 0.5, as defined by the prior
(0.75 · 0.5) + (0.5 · 0.5) = 0.375 = 0.6.
0.375 + 0.25
So the posterior probability is Pr 􏰂biased|H􏰃 = 0.6 and Pr 􏰂fair|H􏰃 = 0.4.
Taking it further: The idea of Bayesian reasoning is used frequently in many applications of computer science—any time a computational system weighs various pieces of evidence in deciding what kind of action to take in a particular situation. One of the most noticeable examples of this type of reasoning occurs in Bayesian spam filters; see p. 1037 for more.

1036 CHAPTER 10. PROBABILITY
Computer Science Connections
Speech Recognition, Bayes’ Rule, and Language Models
A software system for speech recognition must solve the following problem: given an audio stream S of spoken English as input, produce as output a transcript W of the words in S. There will be many candidate transcripts of S, and generally the task of the system is to produce the most likely sequence of words given the audio stream—that is, to find the W ∗ maximizing Pr 􏰂W ∗ |S 􏰃.
Using Bayes’ Rule, we can rephrase Pr 􏰂W ∗ |S 􏰃 into an expression that’s easier to understand:
the W∗ maximizing Pr 􏰂W∗|S􏰃
Figure 10.20: A spectrogram represen- tationofanaudiostream: thex-axis represents time, the y-axis represents frequency, and the darkness of the shading denotes the intensity of sound at that particular frequency at that particular time. (See p. 234 for more discussion.) The task is to turn this representation into its most proba-
ble sequence of words—in this case, the sentence “I prefer agglomerative clustering.”
Pr􏰂S|W∗􏰃·Pr[W∗] Pr [S ]
=theW∗ maximizingPr􏰂S|W∗􏰃·Pr􏰂W∗􏰃.
= the W∗ maximizing
Bayes’ Rule Pr[S]isthesameforeachW∗
Thus there are two valuable sources of data for evaluating a candidate W: 􏰂􏰃
• Pr S|W , the likelihood of the observation: the probability that this sound stream would have been produced if W were the sequence of words; and
• Pr[W],theprobabilityofthisoutput:theprobabilityofthissequenceof words being uttered at all.
For example, even if the audio stream is a better acoustic match for the phrase whirled Siri string, you’d want your system to prefer the phrase World Se-
ries ring—because an English speaker is far more likely to say the latter phrase than the former. (That is, Pr 􏰂World Series ring􏰃 is much higher than
Pr 􏰂whirled Siri string􏰃.) Of course, we still must take into account the audio stream S—otherwise, regardless of the audio, we’d end up with a system that produced precisely the same output sentence (the most common sentence in English: I’m sorry!, or whatever it is) for any input sound stream.
Generally speaking, the quantity Pr 􏰂S |W 􏰃 would be estimated by an acoustic model of the vocal tract: if I’m trying to say Camp Utah seance, what is the probability that I produce a particular stream S of sounds?
The quantity Pr [W ] is estimated by what’s called a language model. We would acquire a large collection of English text, and then try to use that data to estimate how likely a particular sequence is. The simplest language model is the unigram model:
• fromagiantdatasetwithNtotalwords,foreachwordwwecountupthe number of times n(w) that w appears. n(w1 ) n(w2 ) n(wk )
• ifW=w1,w2,…,wk,weestimatePr[W]as N · N · ··· · N .
A more complex language model might use bigrams—two-word sequences— instead; we count the number of occurrences of wi , wi+1 consecutively in
the giant data set, and estimate Pr [W ] based on these counts. Other more complexlanguagemodelsareusedinrealsystems.6 There’salsoagreatdeal of complication with avoiding overfitting of the language model to the training data. (In addition to speech recognition, a variety of other natural language processing problems are generally solved with the same general approach.)
For much more, see
6 Daniel Jurafsky and James H. Martin.
Speech and Language Processing: An Intro- duction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Pearson Prentice Hall, 2nd edition, 2008.

10.3. INDEPENDENCEANDCONDITIONALPROBABILITY 1037
Computer Science Connections
Bayesian Modeling and Spam Filtering
There are, it’s estimated, a few hundred billion email messages sent on earth per day. Of those, a significant fraction of those messages are unso- licited, unwanted bulk messages—that is, what’s commonly known as spam. Somewhere between 50% and 95% of emails are currently spam. (It’s hard to be precise; statistics and definitions of spam vary, and there’s change over time as certain spammers are shut down, or not.)
The basic idea of a spam filter is to estimate the probability that a particular message m is spam. The email client, or possibly the individual user, can choose a threshold p; a message m for which Pr 􏰂m is spam􏰃 ≥ p is placed into a spam folder. The choice of p depends on the user’s relative concern about false positives (nonspam messages that end up being incorrectly treated as spam) versus false negatives (spam messages that end up being incorrectly left in the inbox). So, how might a spam filter actually make its decisions? Here’s one approach, based fundamentally on Bayes’ Rule. Consider a message consisting of words w1, w2, . . . , wn; we must compute Pr 􏰂spam|w1, w2, . . . wn􏰃. Using Bayes’ Rule, we turn around this probability:
See statistics on email and spam pro- duced by the Radicati Group, for exam- ple: www.radicati.com.
It’s a good test of your probabilistic intuition to ask: supposing that we have a spam filter that correctly classifies 90% of email messages as spam/nonspam, and 95% of email messages are spam, what fraction of email in your inbox is nonspam? The answer, by Bayes’ Rule:
􏰂
􏰃 Pr 􏰂w1, w2, . . . wn|spam􏰃 · Pr 􏰂spam􏰃 = Pr [w1, w2, . . . wn]
Pr 􏰂nonspam|inbox􏰃
Pr 􏰂inbox|nonspam􏰃 Pr 􏰂nonspam􏰃
spam|w1, w2, . . . wn
And, by the law of total probability (every message is either spam or not
Pr
spam), we can further rewrite this probability as Pr􏰂w1,w2,…wn|spam􏰃·Pr􏰂spam􏰃
= 􏰤Pr 􏰂inbox|nonspam􏰃 Pr 􏰂nonspam􏰃􏰥 + Pr 􏰂inbox|spam􏰃 Pr 􏰂spam􏰃
Pr 􏰂w1, w2, . . . wn|spam􏰃 Pr 􏰂spam􏰃 + Pr 􏰂w1, w2, . . . wn|not spam􏰃 Pr 􏰂not spam􏰃 .
= 0.9·0.05
0.9 · 0.05 + 0.1 · 0.95
That is, we want to know: what is the probability that the sequence of words w1, . . . , wn would have been generated in a spam message, relative to the probability that w1 , . . . , wn would have been generated in a spam or nonspam message? (These “relative probabilities” are weighted by the background probability of spam-vs.-nonspam messages.)
A naïve Bayes classifier uses an additional assumption: that the appearance of every word in an email is an independent event. That is, we’re going to estimate Pr [w1 , w2 , . . . wn ] as if the probability of each wi appearing does not depend on any other word appearing. (Obviously that assumption isn’t right: the probability of the word MORTGAGE appearing is not independent of the probability of the word RATE appearing, in either spam or nonspam.)
Pr􏰂w1,w2,…wn|spam􏰃 ≈ Pr􏰂w1|spam􏰃·Pr􏰂w2|spam􏰃· ··· ·Pr􏰂wn|spam􏰃.
Thus a naïve Bayes classifier estimates the probability of a message being generated as spam by multiplying a measure of “how spammy” each word is. A spam filter would still need to have two numbers associated with each word wi —namely Pr 􏰂wi |spam􏰃 and Pr 􏰂wi |nonspam􏰃. We can estimate these numbers from a training set of spam/nonspam emails, with some sort of “smoothing” mechanism to improve our estimate of the spamminess of a word that doesn’t appear in any of the training emails.7
=
0.045 0.045 + .095
= 0.3214 · · · .
In other words, a full two thirds of
messages in your inbox would be spam!
For more about the training of these estimates, and about text classification— the broader version of the problem that we’re trying to solve in spam filtering— again see:
7 Daniel Jurafsky and James H. Martin.
Speech and Language Processing: An Intro- duction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Pearson Prentice Hall, 2nd edition, 2008.

1038 CHAPTER 10. PROBABILITY
10.3.4 Exercises
Choose one of the 12 months of the year uniformly at random. (That is, choose a number uniformly from the set {1, 2, . . . , 12}.) Indicate whether the following pairs of events are independent or dependent. Justify your answers.
10.53 “The month number is even” and “the month number is divisible by 3.”
10.54 “The month number is even” and “the month number is divisible by 5.”
10.55 “The month number is even” and “the month number is divisible by 6.”
10.56 “The month number is even” and “the month number is divisible by 7.”
We flip a fair coin 6 times. Which of these events are independent or dependent? Justify your answers.
10.57 “The number of heads is even” and “the number of heads is divisible by 3.”
10.58 “The number of heads is even” and “the number of heads is divisible by 4.”
10.59 “The number of heads is even” and “the number of heads is divisible by 5.”
10.60 We flip three fair coins, called a, b, and c. Are the events “The number of heads in {a, b} is odd”
and “The number of heads in {b, c} is odd” independent or dependent?
10.61 How (if at all) would your answer to the previous exercise change if the three coins are p-biased? (That is, assume Pr[a = H], Pr[b = H], and Pr[c = H] are all equal to p.)
10.63 A and C
10.64 B and C
10.65 AandD
Let A and B be arbitrary events in a finite sample space.
10.70 Prove that if Pr [B] = 0, then A and B are independent.
10.71 Prove that A and B are independent if and only if A and B are independent.
A substitution cipher (see Example 10.16) is a simple cryptographic scheme in which we choose a permutation π of the alphabet, and replace each letter i with πi. (Decryption is the same process, but backward: replace πi by i.) However, substitution ciphers are susceptible to frequency analysis, in which an eavesdropper who observes the encrypted message (the ciphertext) infers that the most common letter in the ciphertext probably corresponds to the most common letter in English text (the letter E), the second-most common to the second-most common (T), and so on.
10.72 (programming required) Write a program that generates a random permutation π of the alphabet, and encrypts a given input text using π. (Leave all non-alphabetic characters unchanged.)
10.73 (programming required) Write a program that takes a text as input, converts it to upper case, and produces as output a vector ⟨fA , fB , . . . , fZ ⟩, where f• is the fraction of letters in the input text that are the letter •. (So f will be a probability distribution over the alphabet.)
10.74 (programming required) Write a program that, given a reference text and a text encrypted with an unknown substitution cipher, attempts to decrypt by mapping the most common encrypted letters, in order, to the most common reference letters. You can find useful reference files—for example, the complete works of Shakespeare—from Project Gutenberg, http://www.gutenberg.org/.
A Caesar cipher is a special kind of substitution cipher in which the permutation π is generated by choosing a nu- merical shift s and moving all letters s steps forward in the alphabet, wrapping back to the beginning of the alphabet as necessary. (For example, with a shift of 5, A → F and W → B.)
10.75 (programming required) Write a Caesar cipher encryption program that encrypts a given input text file with a randomly chosen shift in {0, 1, . . . , 25}.
10.76 (programming required) If you run your decryption program from Exercise 10.74 on Caesar- ciphered text, you’ll find that your program generally doesn’t work perfectly. Write a Caesar-cipher- decrypting program that takes advantage of the fact that every letter is shifted by the same amount. Find the most probable s—the s that minimizes the difference in the probabilities of each letter from the reference text and the deciphered text. That is, minimize ∑i |fi′ − fi+s |, where f comes from the ciphertext and f ′ comes from the reference text.
Consider the list of words and the events in Figure 10.21. Choose a word at random from this list. Which of these pairs of events are independent? For the pairs that are dependent, indicate whether the events are positively or negatively correlated. Justify your answers.
10.62 A and B
10.66 A and E
10.67 A ∩ B and E
10.68 A ∩ C and E
10.69 A∩DandE
(a) Somewords.
ABIDES
BASES
CAJOLED
DATIVE
EXUDE
FEDORA
GASOLINES
HABANERO
A : “the first letter of the
word is a consonant.”
B : “the second letter of the
word is a consonant.”
C : “the second letter of the
word is a vowel.”
D : “the last letter of the
word is a consonant.”
E : “the word has even
length.”
(b) Some events.
Figure 10.21: A word list from which we choose a random word, and some events.

10.77
10.78
Let i and j > i be arbitrary. Show that Pr [Ai,j] = 1 . 2
10.3. INDEPENDENCEANDCONDITIONALPROBABILITY 1039
Flip n fair coins. For any two distinct indices i and j with 1 ≤ i < j ≤ n, define the event Ai,j as Ai,j := (the ith coin flip came up heads) XOR (the jth coin flip came up heads). Forexample,forn = 4andtheoutcome⟨T,T,H,H⟩,theeventsA1,3,A1,4,A2,3,andA2,4 alloccur;A1,2 andA3,4 􏰀n􏰁 2 exercises, you’ll show that these 2 events are pairwise independent, but not fully independent. do not. Thus, from n independent coin flips, we’ve defined Ω(n2) different events—􏰀n􏰁, to be specific. In the next few Let i and j > i be arbitrary, and let i′ and j′ > i′ be arbitrary. Show that any two distinct events Ai,j
andA′ ′ areindependent.Thatis,showthatPr􏰂A |A′ ′􏰃=Pr􏰂A |A′ ′􏰃= 1 if{i,j}̸={i′,j′}.
i ,j i,j i ,j i,j i ,j 2
10.79 Show that there is a set of three distinct A events that are not mutually independent. That is, identifythreeeventsAi,j,Ai′,j′,andAi′′,j′′ wherethesets{i,j},{i′,j′},and{i′′,j′′}arealldifferent(though1 not necessarily disjoint). Then show that if you know the value of Ai,j and Ai′,j′ , the probability of Ai′′,j′′ ̸= 2 .
Suppose that you have the dominoes in Figure 10.22, and you shuffle them and draw
one domino uniformly at random. (More specifically, you choose any particular
domino with probability 1 . After you’ve chosen the domino, you choose an orienta- 12
tion, with a 50–50 chance of either side pointing to the left.) What are the following conditional probabilities? (“Even total” means that the sum of the two halves of the domino is even. “Doubles” means that the two halves are the same.)
10.80 Pr 􏰂even total|doubles􏰃
10.81 Pr 􏰂doubles|even total􏰃
10.82 Pr 􏰂doubles|at least one 􏰃
10.83 Pr 􏰂at least one |doubles􏰃
10.84 Pr 􏰂total ≥ 7|doubles􏰃
10.85 Pr 􏰂doubles|total ≥ 7􏰃
10.86 Pr 􏰂even total|total ≥ 7􏰃
10.87 Pr 􏰂doubles|left half of drawn domino is
Figure 10.22: Some dominoes.
􏰃
10.88 Suppose A and B are mutually exclusive events—that is, A ∩ B = ∅. Prove or disprove the
following claim: A and B cannot be independent. 􏰂 􏰃
10.89 Let A and B be two events such that Pr A|B = Pr B|A . Which of the following is true? (a)
A and B must be independent; (b) A and B must not be independent; or (c) A and B may or may not be independent (there’s not enough information to tell). Justify your answer briefly.
Suppose, as we have done throughout the chapter, that h : K → {1, . . . , n} is a random hash function.
10.90 Suppose that there are currently k cells in the array that are occupied. Consider a key x ∈ K not currently stored in the hash table. What is the probability that the cell h(x) into which x hashes is empty? 10.91 Suppose that you insert n distinct values x1, x2, . . . , xn into an initially empty n-slot hash table. What is the probability that there are no collisions? (Hint: if the first i elements have had no collisions, what is the probability that the (i + 1)st hashed element does not cause a collision? Use Theorem 10.2 and Exercise 10.90.)
There’s a disease BCF (“base-case failure”) that afflicts a small but very unfortunate fraction of the population. One in a thousand people in the population have BCF. Explain your answers to the following questions:
10.92 Doctor Genius has invented a BCF-detection test. Her test, though, isn’t perfect:
• it has false negatives: if you do have BCF, then her test says that you’re not sick with probability 0.01.
• it has false positives: if you don’t have BCF, then her test says that you’re sick with probability 0.03.
What is the probability p that Dr. Genius gives a random person x an erroneous diagnosis?
10.93 “Doctor” Quack has invented a BCF-detection test, too. He was a little confused by the statement
“one in a thousand people in the population have BCF,” so his test is this: no matter who the patient is, with
probability 1 report “sick” and with probability 999 report “not sick.” What is p now? 1000 1000
Alice wishes to send a 3-bit message 011 to Bob, over a noisy channel that corrupts (flips) each transmitted bit indepen- dently with some probability. To combat the possibility of her transmitted message differing from the received message, she adds a parity bit to the end of her message (so that the transmitted message is 0110). [Bob checks that he receives a message with an even number of 1s, and if so interprets the first three received bits as the message.]
10.94 Assume that each bit is flipped with probability 1%. Conditioned on receiving a message with an even number of 1s, what is the probability that the message Bob received is the message that Alice sent?
10.95 What if the probability of error is 10% per bit?
opaque bag, and flip it. What is Pr 􏰂I pulled the biased coin|the following observed flips􏰃? Justify your answers.
10.96 p = 2 , and I observe a single Heads flip. 3
10.98 p = 3 , and I observe the flip sequence HTTTHT. 4
􏰂 􏰃
Suppose, as in Example 10.28, I have two coins—one fair and one p-biased. I pull one uniformly at random from an
10.97 p = 3 , and I observe the flip sequence HHHT. 4

1040 CHAPTER 10. PROBABILITY
A Bloom filter is a probabilistic data structure designed to store a set of elements from a universe U, allowing very quickqueryoperationstodeterminewhetheraparticularelementhasbeenstored.8 Specifically,itsupportstheopera- tions Insert(x), which adds x to the stored set, and Lookup(x), which reports whether x was previously stored. But, unlike most data structures for this problem, we will allow ourselves to (occasionally) make mistakes in lookups, in exchange for making these operations fast.
Here’showaBloomfilterworks.Wewillchoosekdifferenthashfunctionsh1,…,hk :U→{1,…,m},andwe will maintain an array of m bits, all initially set to zero. The operations are implemented as follows:
• To insert x into the data structure, we set the k slots h1(x),h2(x),…,hk(x) of the array to 1. (If any of these slots was already set to 1, we leave it as a 1.)
• To look up x in the data structure, we check that the k slots h1(x),h2(x),…,hk(x) of the array are all set to 1. If they’re all 1s, we report “yes”; if any one of them is a 0, we report “no.”
For an example, see Figure 10.98. Note that there can be a false positive in a lookup: if all k slots corresponding to a query element x happen to have been set to 1 because of other insertions, then x will incorrectly be reported to be present.
As usual, we treat each of the k hash functions as independently assigning each element of U to a uniformly chosen slot of the array. Suppose that we have an m-slot Bloom filter, with k independent hash functions, and we insert n elements into the data structure.
10.99 Suppose we have k = 1 hash functions, and we’ve inserted n = 1 element into the Bloom filter. Consider any particular slot of the m-slot table. What is the probability that this particular slot is still set to 0? (That is, what is the probability that this slot is not the slot set to 1 when the single element was inserted?)
10.100 Let the number k of hash functions be an arbitrary number k ≥ 1, but continue to suppose that we’ve inserted only n = 1 element in the Bloom filter. What is the probability a particular slot is still set to 0 after this insertion?
10.101 Let the number k of hash functions be an arbitrary number k ≥ 1,
and suppose that we’ve inserted an arbitrary number n ≥ 1 of elements into
the Bloom filter. What is the probability a particular slot is still set to 0 after these insertions?
Define the false-positive rate of a Bloom filter (with m slots, k hash functions, and n inserted elements) to be the probability that we incorrectly report that y is in the table when we query for an uninserted element y.
8BurtonH.Bloom. Space/time trade- offs in hash coding with allowable errors.Communi- cations of the ACM, 13(7):422–426, July 1970.
1 2 3 4 5 6 7 8 9 10 11 12 13
(a) The table initially; after inserting 3; and after inserting 7. Note h1(3) = 4, h2(3) = 10, h1(7) = 8, and h2(7) = 11.
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
1
0
1
1
0
0
1 2 3 4 5 6 7 8 9 10 11 12 13
(b) Testing for 3 (yes!), 15 (no!), and 10 (yes!?!). Note h1(15) = 3, h2(15) = 5, h1(10) = 11, and h2(10) = 10—so 10 is a false positive.
0
0
0
1
0
0
0
1
0
1
1
0
0
0
0
0
1
0
0
0
1
0
1
1
0
0
0
0
0
1
0
0
0
1
0
1
1
0
0
For many years (starting with Bloom’s original paper about Bloom filters), people in computer science believed that k
Figure 10.23: An example of a Bloom filter with k = 2 hash functions:
h1 (x) = x mod 13 + 1 andh2(x) =
x2 mod 13 + 1.
9 Prosenjit Bose,
H u a G u o , E v a n g e – los Kranakis, Anil Maheshwari,Pat Morin, Jason Morri- son, Michiel Smid, and Yihui Tang. On the false-positive rate of Bloom fil- ters. Information Processing Letters, 108(4):210–213, 2008; and Ken Christensen, Allen Roginsky, and Miguel Jimeno. A new analysis of the false positive rate of a Bloom filter. In- formation Processing Letters, 110:944–949, 2010.
the false positive rate was precisely p , where p = (1 − [your answer to Exercise 10.101]). The justification was the following. Let Bi denote the event “slot hi (y) is occupied.” We have a false positive if and only if B1 , B2 , . . . , Bk are all
true. Thus
Yo u s h o w e d i n t h e p r e v i o u s e x e r c i s e t h a t P r [ B i ] = p . E v e r y t h i n g u p u n t i l h e r e i s c o r r e c t ; t h e n e x t s t e p i n t h e
the false positive rate = Pr [B1 and B2 and · · · and Bk ] . argument, however, was not! Therefore, because the Bi events are independent,
thefalsepositiverate=Pr[B andB and ··· andB ]=Pr[B ]·Pr[B ]···Pr[B ]=pk. 1 2 k 1 2 k
But it turns out that Bi and Bj are not independent!9 (This error is a prime example of how hard it is to have perfect intuition about probability!)
10.102 Let m = 2, k = 2, and n = 1. Compute by hand the false-positive rate. (Hint: there are “only” 16 different outcomes, each of which is equally likely: the random hash functions assign values in {1,2} to h1(x), h2(x), h1(y), and h2(y). In each of these 16 cases, determine whether a false positive occurred.)
10.103 Compute p2—the answer you would have gotten by using
false-positive rate = (1 − [your answer to Exercise 10.101])2.
Which is bigger—p2 or [your answer to Exercise 10.102]? In approximately one paragraph, explain the difference, including an explanation of why the events B and B are not independent.
10.104 While the actual false-positive rate is not exactly pk , it turns out that pk is a very good approxima- tion to the false-positive rate as long as m is sufficiently big and k is sufficiently small. Write a program that creates a Bloom filter with m = 1,000,000 slots and k = 20 hash functions. Insert n = 100,000 elements, and estimate the false positive probability by querying for n additional uninserted elements y ∈/ X. What is the false-positive rate that you observe in your experiment? How does it compare to pk ?
1 2

10.4. RANDOMVARIABLESANDEXPECTATION 1041
10.4 Random Variables and Expectation
Acts of sacrifice, charity and penance are not to be given up but should be performed. . . . All these activities should be performed without any expectation of result.
Bhagavad Gita 18:5–6
Thus far, we have been considering whether or not something occurs—that is, using the language of probability, we have been interested in events. But often we will also be interested in how many? questions and not just did it or did it not? questions. How many heads came up in 1000 coin flips? How many times do we have to flip a coin before
it comes up heads for the 1000th time? For a randomly ordered array A[1 . . . n] of the integers {1, . . . , n}, for how many indices i is A[i] < A[i + 1]? To address these types of questions, we will introduce the concept of a random variable, which measures some numerical quantity that varies from outcome to outcome. We will also consider the expectation of a random variable, which is the value of that variable averaged over all of the outcomes in the sample space. 10.4.1 Random Variables We begin with the definition of a random variable itself: Here are a few simple examples: Example 10.29 (Counting heads in 3 flips) Suppose that we flip a fair coin independently, three times. (Then the sample space is S = {H, T}3, and Pr [x] = 1 for any x ∈ S.) Define the random variables 8 X = the number of heads Y = the number of initial consecutive tails. Warning! A “ran- dom variable” is one of the worst- named concepts in this entire book. A random variable is not a variable— rather, it’s a function that maps each out- come to a numerical value. But everyone calls it a random variable, so that’s what we’ll call it, too. Definition 10.11 (Random variable) A random variable X assigns a numerical value to every outcome in the sample space S. (In other words, a random variable is a function X : S → R.) These random variables take on the X(HHH) = 3 X(HHT) = 2 X(HTH) = 2 X(HTT) = 1 X(THH) = 2 X(THT) = 1 X(TTH) = 1 X(TTT) = 0 following values: Y(HHH) = 0 Y(HHT) = 0 Y(HTH) = 0 Y(HTT) = 0 Y(THH) = 1 Y(THT) = 1 Y(TTH) = 2 Y(TTT) = 3. 1042 CHAPTER 10. PROBABILITY Example 10.30 (Word length, and number of vowels) Select a word from the sample space {Now, is, the, winter, of, our, discontent} by choosing word w with probability proportional to the number of letters in w, as in Example 10.5. Define a random variable L to denote the number of letters in the word chosen. Thus L(discontent) = 10 and L(winter) = 6, for example. We can also define a random variable V to denote the number of vowels in the word chosen. Thus V(discontent) = 3 and V(winter) = 2, for example. Here are the values for these two random variables for each outcome in the sample space: w Pr [w] L(w) V(w) Now 3/2931 is 2/2921 the 3/2931 winter 6/2962 of 2/2921 our 3/2932 discontent 10/29 10 3 Although it’s an abuse of notation, often we just write X to denote the value of a ran- dom variable X for a realization chosen according to the probability distribution Pr. (So we might write “X = 3 with probability 1 ” or “there are L letters in the chosen word.”) 8 We can state probability questions about events based on random variables, as the following example illustrates: Example 10.31 (More word length and vowel counts) Choose a word as in Example 10.30. Define L as the number of letters in the word, and define V as the number of vowels in the word. Then Pr [L = 3] denotes the proba- bility that we choose an outcome w for which L(w) = 3. (In other words, L = 3 denotes the event {w : L(w) = 3}.) Thus (see the table in Example 10.30) Pr[L = 3] = Pr􏰂{Now,the,our}􏰃 = Pr [V = 3] = Pr 􏰂{discontent}􏰃 = 10 . We will also abuse notation by performing arithmetic on random variables (remember, these are functions!): for two random variables X and Y, we write X + Y as a new random variable that, for any outcome x, denotes the sum of X(x) and Y(x). We will interpret similarly any other arithmetic expression that involves random variables. (Thenotationalanaloguehereiswriting“sin+cos”todenotethefunctionf(x) = sin(x) + cos(x).) Here’s a small example: Example 10.32 (Number of consonants) We can express the number of consonants in the randomly chosen word from our running example (see Example 10.30) as L − V. For example, L − V = 1 when the chosen word is our, and L − V = 4 when the chosen word is winter. 9 29 29 10.4. RANDOMVARIABLESANDEXPECTATION 1043 Indicator random variables One special type of random variable that will come up frequently is an indicator random variable, which only takes on the values 0 and 1. (Such a random variable “indi- cates” whether a particular event has occurred.) Here’s a simple example: Example 10.33 (Indicator random variables in coin flips) Suppose that we flip three fair coins independently. Let X1 be an indicator random variable that reports whether the first flip came up heads. Similarly, let X2 and X3 be indicator random variables for the second and third flips. Then: outcome X1 X2 X3 HHH 1 1 1 HHT 1 1 0 HTH 1 0 1 HTT 1 0 0 THH 0 1 1 THT 0 1 0 TTH 0 0 1 TTT 0 0 0 Note that the total number of heads is given by the random variable X1 + X2 + X3. Independence of random variables Just as with independence for events, we will often be concerned with whether knowing the value of one random variable tells us something about the value of an- other. Two random variables X and Y are independent if every two events of the form “X = x” and “Y = y” are independent: for every value x and y, it must be the case that Pr 􏰂X = x and Y = y􏰃 = Pr [X = x] · Pr 􏰂Y = y􏰃. For example: Example 10.34 (Some independent/dependent random variables) The random variables X2 and X3 from Example 10.33—we flip 3 fair coins indepen- dently; X2 and X3 indicate whether the 2nd and 3rd flips are heads—are indepen- dent. You can check all four possibilities; for example, Pr[X2 =1andX3 =1]= 1 = 1 ·1 =Pr[X2 =1]·Pr[X3 =1] and 422 Pr[X2 =1andX3 =0]= 1 = 1 ·1 =Pr[X2 =1]·Pr[X3 =0]. 422 On the other hand, the random variables X and Y from Example 10.29—we flip 3 fair coins independently; X is the number of heads and Y is the number of consecutive initial tails—are not independent; for example, Pr[X=3]·Pr[Y=3]= 1 ·1 but Pr[X=3andY=3]=0. 88 1044 CHAPTER 10. PROBABILITY 10.4.2 Expectation A random variable X measures a numerical quantity that varies from realization to realization. We will often be interested in the “average” value of X, which is otherwise known as the random variable’s expectation: Definition 10.12 (Expectation) The expectation of a random variable X, denoted E [X], is the average value of X, defined as E[X]= ∑X(x)·Pr[x]. x∈S The expectation of X is also sometimes called the mean of X. We can equivalently write E [X] = ∑y 􏰀y · Pr 􏰂X = y􏰃􏰁 by summing over each possible value y that X can take on, rather than by summing over outcomes. In other words, E [X] is the average value of X over all outcomes (where the average is weighted, with weights defined by the probability function). For example: Example 10.35 (Expectation of a Bernoulli random variable) Let X be an indicator random variable for a Bernoulli trial with parameter p—that is, X = 1 with probability p and X = 0 with probability 1 − p. Then E [X] is precisely The alternate ver- sion of the summa- tion for expectation in Definition 10.12 follows by collect- ing together each outcome x that has the same value of the random variable X(x): ∑X(x)·Pr[x] x∈S E [X] = 1 · Pr [X = 1] + 0 · Pr [X = 0] = 1 · p + 0 · (1 − p) = p. =∑ ∑y·Pr[x] y∈R x∈S: Example 10.36 (Counting heads in 3 flips, again) Problem: RecallExample10.29,wheretherandomvariableXdenotesthenumberof y∈R x∈S: X(x)=y = ∑ y · Pr 􏰂X = y􏰃 . y∈R Warning! Just because E [X] = 1.5 doesn’t mean that Pr[X=1.5]isbig! (If you ever flip three fair coins and see exactly 1.5 heads, it might be a sign that the world is ending.) Rememberthat “average” and “typical” aren’t the same thing! heads in three independent flips of a fair coin. (The sample space was S = {H, T}3, andPr[x]=1 foranyx∈S.)WhatisE[X]? definition of expectation (alternative version) definition of a Bernoulli trial with parameter p X(x)=y =∑y· ∑Pr[x] 8 : TheexpectationofXis E[X] = ∑ Pr[x]·X(x) x∈{H,T}3 Solution = 1 X(HHH) + 1 X(HHT) + 1 X(HTH) + 1 X(HTT) 8 1 8 1 8 1 8 1 + 8 X(THH) + 8 X(THT) + 8 X(TTH) + 8 X(TTT) 􏰂􏰃 = 1 · 3+2+2+1+2+1+1+0 8 = 12 = 1.5. 8 In other words, in three flips of a fair coin, we expect 1.5 flips to come up Heads. 10.4. RANDOMVARIABLESANDEXPECTATION 1045 Example 10.37 (Counting letters and vowels, again) RecalltheprobabilisticprocessofchoosingawordfromthesentenceNow is the winter of our discontentinproportiontowordlength.Recallalsotherandom variables from Example 10.30: L denotes the chosen word’s length, and V the number of vowels in the chosen word. (See Figure 10.24 for a reminder.) Then we have outcome Pr L V Now is the winter of our discontent 3 3 2 3 6 2 3 10 1 1 1 2 1 2 3 29 2 29 3 29 6 29 2 29 3 29 10 29 E[L]=3·3 +2·2 +3·3 +6·6 +2·2 +3·3 +10·10 = 171 29 29 29 29 29 29 29 29 ≈ 5.8966. E[V]=1·3 +1·2 +1·3 +2·6 +1·2 +2·3 +3·10 Figure 10.24: A reminder of the sample space, probabilities, and random variables for Example 10.37. =5729 29 29 29 29 29 29 29 ≈ 1.9656. Taking it further: If we think about it without a great deal of care, there’s something apparently curious about the result from Example 10.37. We’ve plopped down our thumb on a random letter in the sentence Now is the winter of our discontent,andwe’vecomputedthatthewordthatourthumblandson has an average length of about 5.9 letters. That seems a little puzzling, because there are 7 words in 10 the sentence, with an average word length of 29 = 4.1428 letters. But there’s a good reason for this 7 discrepancy: longer words are more likely to be chosen because they have more letters, and therefore the average word that’s chosen has more letters than average. An analogous phenomenon occurs in many other settings, too. When you’re driving, you spend most of your time on longer-than-average trips. Most people in Canada live in a larger-than-average-sized Canadian city. Most 3rd-grade students in California are in a larger-than-average-size 3rd-grade class. (In fact, this broader phenomenon is sometimes called the class-size paradox.) Perhaps even more jarringly, a random person x knows fewer people than the average number of people known by someone x knows—that is, on average, your friends aremorepopularthanyouare.10 (Why?Averypopularperson—callherOprah—is,bydefinition,the friend of many people, and therefore Oprah’s astronomical popularity is averaged into the popularity of many people x. In computing the popularity of a randomly chosen person x, Oprah only contributes her popularity once for x = Oprah—but she contributes it many times to the popularity of x’s friends.) This phenomenon may illustrate an example of a sampling bias, in which we try to draw a uniform sample from a population but we end up with some kind of bias that overweights some members of the population at the expense of others. Sampling biases are a widespread concern in any statistical approach to understanding a population. For example, consider a telephone-based political poll that col- lects voters’ preferences for candidates one evening by randomly dialing phone numbers until somebody answers, and records the answerer’s preference. This poll will overweight those people who are sitting around at home during the evening—which correlates with the voter’s age, which correlates with the voter’s political affiliation. Example 10.38 (Number of aces in a bridge hand) Problem: Supposethatwearedealta13-cardhandfromastandard52-carddeck. What is the expected number of aces in our hand? Solution : Laterwe’llsolvethisproblemmoreeasily(seeExample10.41),buthere we’ll do it the hard way. We’ll compute the probability of getting 0, 1, . . . , 4 aces: 10 Scott L. Feld. Why your friends have more friends than you do. American Journal of Sociology, 96(6):1464–1477, May 1991. • Thereare􏰀52􏰁differenthands. 13 • There are 􏰀4􏰁 · 􏰀 48 􏰁 hands with exactly k aces. (We have to pick k ace cards k 13−k from the 4 aces in the deck, and 13 − k non-ace cards from the 48 non-aces.) 1046 CHAPTER 10. PROBABILITY Because each hand is equally likely to be chosen, therefore Pr drawing exactly k aces = 􏰀52􏰁 . 13 􏰂 􏰃 􏰀4􏰁·􏰀48􏰁 k 13−k And thus, letting A be a random variable denoting the number of aces, we have E [A] = ∑ Pr 􏰂being dealt hand h􏰃 · (number of aces in h) h =∑4 i·Pr[A=i] (reorderingsumbycollectingallhandswiththesamenumberofaces) i=0 0·Pr[A=0] 1·Pr[A=1] 2·Pr[A=2] 3·Pr[A=3] 4·Pr[A=4] 􏰠 􏰣􏰢 􏰡 􏰠 􏰣􏰢 􏰡 􏰠 􏰣􏰢 􏰡 􏰠 􏰣􏰢 􏰡 􏰠 􏰣􏰢 􏰡 0·􏰀4􏰁·􏰀48􏰁 + 1·􏰀4􏰁·􏰀48􏰁 + 2·􏰀4􏰁·􏰀48􏰁 + 3·􏰀4􏰁·􏰀48􏰁 + 4·􏰀4􏰁·􏰀48􏰁 0 13 1 12 2 11 3 10 4 9 = 􏰀52􏰁 13 0·1·􏰀48􏰁 + 1·4·􏰀48􏰁 + 2·6·􏰀48􏰁 + 3·4·􏰀48􏰁 + 4·1·􏰀48􏰁 13 = 13 12 11 10 9 􏰀52􏰁 = 0 + 278,674,137,872 + 271,142,404,416 + 78,488,590,752 + 6,708,426,560 635,013,559,600 = 635,013,559,600 635,013,559,600 = 1. That is, the expected number of aces in a 13-card hand is precisely 1. A useful property of expectation We’ve now seen several examples of computing the expectation of random variables by directly following the definition of expectation. Here we’ll introduce a transforma- tion that can often make expectation calculations easier, at least for positive integer– valued random variables: (Note that by definition E [X] = ∑∞ i · Pr [X = i], so we’re trading the multiplication of iforthereplacementof=by≥.) i=0 The proof will follow by changing the order of summation in the expectation for- mula. We’ll give an algebraic proof in a moment, but it may be easier to follow the idea by looking at a visualization first. See Figure 10.25. Theorem 10.5 (A new formula for expectation, for nonnegative integers) LetX:S→Z≥0bearandomvariable.ThenE[X]=∑∞ Pr[X≥i]. i=1 1048 CHAPTER 10. PROBABILITY We can use this theorem to find the expected value of a geometric random variable: Example 10.39 (Expectation of a geometric random variable) Let X be a geometric random variable with parameter p. (That is, X measures the number of flips of a p-biased coin before we get Heads for the first time.) Then E [X] is precisely 1 : p ∑∞ E[X]= Pr[X≥i] i=1 ∑∞􏰂 􏰃 = Pr failtogetheadsini−1flips i=1 = ∑∞ (1 − p)i−1 i=1 = ∑∞ (1 − p)i Theorem10.5(E[X]=∑∞ Pr[X≥i]) i=1 definitionofgeometricrandomvariable need i − 1 consecutive tails flips changing index of summation formula for geometric summations Here’s a very useful general property of expectation, called linearity of expectation: the expectation of a sum is the sum of the expectations. (A linear function is a function f thatsatisfiesf(a+b) = f(a)+f(b)—forexample,f(x) = 3xorf(x) = 0.) Theusefulness of Linearity of Expectation will come from the way in which it lets us “break down” a complicated random variable into the sum of a collection of simple random variables. (We can then compute E 􏰂Complicated􏰃 = E 􏰂∑i Simplei 􏰃 = ∑i E 􏰂Simplei 􏰃.) i=0 = 1 = 1 . 1−(1−p) p For example, we expect to flip a fair coin (with p = 1 ) twice before we get heads. 2 10.4.3 Linearity of Expectation We’ll see several useful examples soon, but let’s start with the proof: Proof. We’llbeabletoprovethistheorembyjustinvokingthedefinitionofexpectation and following our algebraic noses: Theorem 10.6 (Linearity of Expectation) Consider a sample space S, and let X : S → R and Y : S → R be any two random variables. Then E [X + Y] = E [X] + E [Y]. E [X + Y] = ∑(X + Y)(s) · Pr [s] definition of expectation definition of the random variable X + Y distributing the multiplication; rearranging definition of expectation s∈S􏰂 􏰃 = ∑ X(s) + Y(s) · Pr [s] s∈S + 􏰖 ∑ Y(s) · Pr [s] 􏰗 s∈S s∈S = 􏰖 ∑ X(s) · Pr [s] 􏰗 = E[X] + E[Y]. Therefore E [X + Y] = E [X] + E [Y], as desired. 10.4. RANDOMVARIABLESANDEXPECTATION 1049 Notice that Theorem 10.6 does not impose any requirement of independence on the random variables X and Y: even if X and Y are highly correlated (positively or nega- tively), we still can use linearity of expectation to conclude that E [X + Y] = E [X] + E [Y]. There are many apparently complicated problems in which using linearity of expecta- tion makes a solution totally straightforward. Here are a few examples: Example 10.40 (Expectation of a binomial random variable) Problem: We have a p-biased coin (that is, Pr [heads] = p) that we flip 1000 times. What is the expected number of heads that come up in these 1000 flips? Solution : Theintuitionisfairlystraightforward:ap-fractionofflipsareheads,sowe should expect 1000p heads in 1000 flips. But doing the math requires a bit of work. Anabandonedfirstattempt: Let’scomputetheprobabilitythatthereareexactly k heads in a sequence of 1000 flips, and then apply the definition of expectation definition of expectation Asecondtry: Here’sastrategythatendsupbeingmucheasier.Define1000ran- dom variables X1, X􏰓2, . . . , X1000, where Xi is the indicator random variable 1 if the ith flip of the coin comes up Heads 0 if the ith flip of the coin comes up Tails. The total number of heads in the 1000 coin flips is given by the random variable X = X1 +X2 +···+X1000. We can use this definition of X and linearity of expectation to compute the expected number of heads much more easily: directly. There are 􏰀1000􏰁 sequences of 1000 flips that have exactly k heads, and k k 1000−k the probability of any one of these sequences is p (1 − p) , so E [number of heads] 1000 = ∑ k · Pr [number of heads = k] k=0 1000 􏰀1000􏰁k 1000−k = ∑k · k ·p ·(1−p) . 􏰂 􏰃 aboveanalysisofPr numberofheads=k We could try to simplify this expression (but it turns out to be pretty hard!). k=0 Instead, let’s start over with a different approach. Xi = 􏰑1000 􏰒 E[number of heads] = E[X] = E ∑ Xi definition of X linearityofexpectation Example 10.35 (expectation of a Bernoulli variable) i=1 1000 Problem-solving tip: Often, the easiest way to compute an expectation is by finding a way to express the quantity of interest in terms of a sum of indicator random variables. = ∑E[Xi] i=1 1000 = ∑i=1 p = 1000p. 1050 CHAPTER 10. PROBABILITY Example 10.41 (Number of aces in a bridge hand, better) Recall Example 10.38, where we showed that the number A of aces in a randomly chosen 13-card hand from a standard 52-card deck has E [A] = 1. Here is a much easier way of solving that problem: Number your cards from 1 to 13. Let Ai be an indicator random variable that re- portswhethertheithcardinyourhandisanace. ThenA = A1 +A2 +...+A13. Note thatPr[Ai =1]= 1 (thereare 4 = 1 acesinthedeck),so 13 5213 E [A] = E [A1 + A2 + · · · + A13] = E [A1] + E [A2] + · · · + E [A13] linearity of expectation =13· 1 Pr[Ai =1]= 1 asabove,andsoE[Ai]= 1 (Example10.35) 13 13 13 = 1. (The random variables Ai and Aj are correlated—but, again, linearity of expectation doesn’t care! We can still use it to conclude that E 􏰂Ai + Aj􏰃 = E [Ai] + E 􏰂Aj􏰃.) Some examples about hashing Here are two more problems about expectation, both involving hashing: Example 10.42 (Hashing) Problem: Supposethatwehash1000elementsintoa1000-slothashtable,usinga completely random hash function, resolving collisions by chaining. (See Sec- tion 10.1.1.) How many empty slots do we expect? : Let’scomputetheprobabilitythatsomeparticularslotisempty: Solution Pr 􏰂slot i is empty􏰃 = Pr [none of the 1000 elements hash to slot i] = Pr 􏰂every element j ∈ {1, 2, . . . , 1000} hashes to a slot other than i􏰃 1000 􏰂 􏰃 = ∏ Pr element j hashes to a slot other than i elements are hashed independently elementsarehasheduniformly,andthereare999otherslots j=1 1000 999 =∏j=1 1000 = 􏰋 999 􏰌1000 = 0.3677··· . 1000 We’ll finish with the by-now-familiar calculation that also concluded the last two examples: we define a collection of indicator random variables and use linearity of 10.4. RANDOMVARIABLESANDEXPECTATION 1051 expectation. Let Xi be an indicator random variable that’s 1 if slot i is empty and 0 if slot i is full. Then the expected number of empty slots is 􏰑1000 􏰒 E ∑Xi i=1 1000 􏰋 999 􏰌1000 = ∑E[Xi] = 1000· 1000 ≈367.7. i=1 Taking it further: If we stated the question from Example 10.42 in full generality, we would ask: if we hash n elements into n slots, how many empty slots are there in expectation? Using the same approach as in Example 10.42, we’d find that the fraction of empty slots is, in expectation, (1 − 1/n)n . Using calculus, it’s possible to show that (1 − 1/n)n approaches 1/e ≈ 0.367879 as n → ∞. So, for large n, we’d expect to have n empty slots when we hash n elements into n slots. e We can also turn this hashing problem on its head: we’ve been asking “if we hash n elements into n slots, how many slots do we expect to find empty?” Instead we can ask “how many elements do we expect have to hash into n slots before all n slots are full?” This problem is called the coupon-collector problem; see Exercises 10.136–10.137 for more. Let’s also consider a second example about hashing—this time counting the (ex- pected) number of collisions, rather than the (expected) number of empty slots: Example 10.43 (Expected collisions in a hash table) Problem: HashnelementsA={x1,...,xn}intoanm-slothashtable.Recallthata collision between two elements xi and xj (for i ̸= j) occurs when h(xi) = h(xj). 1. Consider two elements xi ̸= xj . What’s Pr [there’s a collision between xi and xj ]? 2. WhatistheexpectednumberofcollisionsamongtheelementsofA? : 1. A collision between xi and xj occurs precisely when, for some index k, we have h(xi) = k and h(xj) = k. Thus: Pr 􏰂collision between xi and xj 􏰃 = Pr􏰖􏰂h(xi) = h(xj) = 1􏰃 or 􏰂h(xi) = h(xj) = 2􏰃 or ··· or 􏰂h(xi) = h(xj) = m􏰃􏰗 Solution = Pr h(x)=kandh(x)=k ∑m􏰂i j􏰃 bythesumrule;theseeventsaredisjoint hashingassumption:hashvaluesareindependent hashing assumption: hash values are uniform k=1 m􏰂􏰃 =∑Pr[h(xi)=k]·Pr h(xj)=k k=1 = ∑m 1 · 1 k=1m m =m=1. m2 m So the probability that a particular pair of elements collides is precisely 1 . m 1052 CHAPTER 10. PROBABILITY 2. Given(1),wecanagaincomputetheexpectednumberofcollisionsusingindi- cator random variables and linearity of expectation. The number of collisions between elements of A is precisely the number of unordered pairs {xi, xj} that collide. For indices i and j > i, then, define Xi,j as the indicator random variable
1 if xi and xj collide Xi,j = 0 if they do not.
Thus the expected number of collisions among the elements of A is given by
E 􏰖 ∑ Xi,j 􏰗 1≤i 1 and A[j] < A[j − 1]: swap A[j] and A[j − 1] j := j − 1 10.4. RANDOMVARIABLESANDEXPECTATION 1055 Note that the number of indicator random variables in this sum is n i−1 n n−1 (n−1)·n ∑i = 2 ∑j = 1 1 = ∑i = 2 ( i − 1 ) = ∑i = 1 i = 2 Thus by linearity of expectation we have E [X] = 􏰀n􏰁 · E [Xi,j] = 􏰀n􏰁 · 1 . 222 10.4.4 Conditional Expectation 􏰀n􏰁 = 2 . Just as we did with conditional probability in Section 10.3, we can define a notion of conditional expectation: that is, the average value of a random variable X when a particu- lar event occurs. In the original definition of expectation, we summed over all x in the whole sample space; here we sum only over the outcomes in the event E. Furthermore, here we weight the value of X by Pr 􏰂x|E􏰃 rather than by Pr [x]. We’ll omit the details, but con- ditional expectation has analogous properties to those of the original (nonconditional) version of expectation, including linearity of expectation. Here’s a brief example of computing some conditional expectations: Example 10.46 (Hearts in Poker) Problem: InTexasHold’Em,aparticularvariantofpoker,afterastandarddeckof cards is randomly shuffled, you are dealt two “personal” cards, and then five “community” cards are dealt. Let P denote the number of your personal cards that are hearts, and let C denote the number of community cards that are hearts. What are the following? 1. E[P] 2. E[C] 3. E􏰂C|P=0􏰃 4. E􏰂C|P=2􏰃 Definition 10.13 (Conditional expectation) 􏰂 􏰃 The conditional expectation of a random variable X given an event E, denoted E X|E , is the average value of X over all outcomes where E occurs: E􏰂X|E􏰃= ∑X(x)·Pr􏰂x|E􏰃. x∈E Solution : 1 & 2. Each card that’s dealt has a 13 = 1 chance of being a heart. By linear- 2 524 5 ity of expectation, then, E [P] = 4 = 0.5 and E [C] = 4 = 1.25. (Implicitly, we’re defining indicator random variables for “the ith card is a heart,” so P = P1 + P2 and C = C1 + · · · + C5.) 1056 CHAPTER 10. PROBABILITY 3. Giventhat2ofthe39non-heartcardsweredealtasyourpersonalcards,there are still 13 undealt hearts among the remaining 50 undealt cards. Thus there is a 13 = 0.26 chance that any particular undealt card is a heart. Thus, again by 50 􏰂 􏰃 13 linearity of expectation, we have that E C|P = 0 = 5 · 50 = 1.30. 4. Similarly,thereare11undealtheartsamongtheremaining50undealtcards. Thus there is an 11 = 0.22 chance that any particular undealt card is a heart, and 􏰂 􏰃 50 We’ll omit the proof, but it’s worth noting a useful property that connects expecta- tion to conditional expectation, an analogy to the law of total probability: Theorem 10.7 (Law of Total Expectation) For any random variable X and any event E: E[X]=E􏰂X|E􏰃·Pr[E] + E􏰂X|E􏰃·(1−Pr[E]). That is, the expectation of X is the (weighted) average of the expectation of X when E occurs and when E does not occur. Taking it further: One tremendously valuable use of probability is in randomized algorithms, which flip some coins as part of solving some problem. There is a massive variety in the ways that randomization is used in these algorithms, but one example—the computation of the median element of an unsorted array of numbers—is discussed on p. 1060. (We’ll make use of Theorem 10.7.) Median finding is a nice example of problem for which there is a very simple, efficient algorithm that makes random choices in its solution. (There are deterministic algorithms that solve this problem just as efficiently, but they are much more complicated than this randomized algorithm.) 10.4.5 Deviation from Expectation Let X be a random variable. By definition, the value of E [X] is the average value that X takes on, where we’re averaging over many different realizations. But how far away from E [X] is X, on average? That is, what is the average difference between (a) X, and (b) the average value of X? We might care about this quantity in applications like political polling or scientific experimentation, for example. Suppose X is a random variable defined as follows:  −1 the voter will vote for the Democratic candidate X = 0 the voter will vote for neither the Democratic nor Republican candidates +1 the voter will vote for the Republican candidate for a voter chosen uniformly at random from the population. If E [X] < 0, then the Democrat will beat the Republican in the election; if E [X] > 0, then the Republican will beat the Democrat. We might estimate E [X] by calling, say, 500 uniformly chosen voters from the population and averaging their responses. We’d like to know whether our estimate is accurate (that is, if our estimate is close to E [X]). This kind of question is the core of statistical reasoning. We’ll only begin to touch on these questions, but here are a few of the most important concepts.
E C|P=2 =5·11 =1.10. 50

10.4. RANDOMVARIABLESANDEXPECTATION 1057
Definition 10.14 (Variance)
Let X be a random variable. The variance of X is
var (X) = E 􏰖(X − E [X])2􏰗 .
The standard deviation is std (X) = √var (X).
(Exercise: why didn’t we just define std (X) = E [X − E [X]]?) Here’s a simple example:
Example 10.47 (Variance/standard deviation of a Bernoulli random variable)
Let X be the outcome of a flipping a p-biased coin. (That is, X is a Bernoulli random variable.) We previously showed that E [X] = p, so the variance of X is
var (X) = E [ (X − E [X])2 ] definition of expectation = E [ (X − p)2 ] expectation of a Bernoulli random variable (Example 10.35) =Pr[X=0]·(0−p)2 + Pr[X=1]·(1−p)2 definitionofexpectation = (1 − p) · (0 − p)2 + p · (1 − p)2 definition of Bernoulli random variable = (1−p)p2 +p(1−p)2
= (1−p)p·(p+1−p) = (1 − p)p.
Thus the standard deviation is std (X) = √var (X) = 􏰞(1 − p)p.
(For example, for a fair coin, the standard deviation is √(1 − 0.5)0.5 = √0.25 = 0.5: an average coin flip is 0.5 units away from the mean 0.5. In fact, every coin flip is that far away from the mean!)
Here’s another simple example, illustrating the fact that two random variables can have the same mean but wildly different variances:
Example 10.48 (Roulette bets)
Here are two bets available to a player in roulette (see Figure 10.27 for a reminder):
• Bet$1on“red”:Ifthespinlandsononeofthe18rednumbers,youget$2back; otherwise you get nothing.
• Bet$1on“17”:Ifthespinlandsonthenumber17,youget$36back;otherwise you get nothing.
Figure 10.27: A reminder of the roulette outcomes. A number in the set {0,00,1,2,…,36} is chosen uniformly at random by a spinning wheel; there are 18 red numbers and 18 black numbers; 0 and 00 are neither rednorblack.
0
00
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
Let X denote the payoff from playing the first bet, so X = 0 with probability 20 and 18 38
X = 2 with probability 38 . Let Y denote the payoff from playing the second bet, so Y = 0 with probability 37 and X = 36 with probability 1 . The expectations match:
38 38
E[X]= 20 ·0+18 ·2 = 36 38 38 38
E[Y]=37·0+1 ·36=36. 38 38 38

1058 CHAPTER 10. PROBABILITY But the variances are very different:
var(X) = 20 ·(0− 36)2 + 18 ·(2− 36)2 = 0.9972··· 38 38 38 38
var(Y) = 37 ·(0− 36)2 + 1 ·(36− 36)2 = 33.2077··· . 38 38 38 38
Generally speaking, the expectation of a random variable measures “how good it is” (on average), while the variance measures “how risky it is.”
Variance, the squared expectation, and the expectation of the square Here’s a useful property of variance, which sometimes helps us avoid tedium in
calculations. We can write var (X) as var (X) = E [X2] − (E [X])2, that is, the difference between the expectation of the square of X and the square of the expectation of X:
Theorem 10.8 (Variance = expectation of the square minus the expectation2) For any random variable X, we have
var (X) = E 􏰖X2􏰗 − (E [X])2 .
Proof. Writingμ:=E[X],wehave var(X)
= E 􏰖(X − μ)2􏰗
= E 􏰖X2 − 2Xμ + μ2􏰗
= E 􏰖X2􏰗 + E [−2Xμ] + E 􏰖μ2􏰗 = E 􏰖X2􏰗 − 2μ · E [X] + μ2
= E 􏰖X2􏰗 − 2μ · μ + μ2
= E 􏰖X2􏰗 − μ2
= E 􏰖X2􏰗 − (E [X])2 .
definition of expectation multiplying out linearity of expectation Exercise 10.151 definition of μ = E [X]
Here is a simple example in which Theorem 10.8 eases the computation:
Example 10.49 (Variance/standard deviation of a uniform random variable)
Problem: LetXbetheresultofarollofafairdie.Whatisvar(X)?
Solution
: Because Pr[X = k] = 1 for all k ∈ {1,…,6}, we have that
E[X] = 1 ·(1+2+3+4+5+6) 6
= 1·21 6
= 3.5.
6

10.4. RANDOMVARIABLESANDEXPECTATION 1059 Similarly, we can compute E 􏰂X2􏰃 as follows:
E􏰖X2􏰗 =1 ·(12 +22 +32 +42 +52 +62) 6
=1 ·91 6
≈15.1666 · · · . Therefore, by Theorem 10.8,
var(X) = E[X2]−(E[X])2 = 91 − 49 = 35 ≈ 2.9116··· , 6 4 12
and std(X) = √35/12 ≈ 1.7078···.
(In Exercise 10.150, you’ll show that the standard deviation of the average result
of two independent dice rolls is much smaller.)
Taking it further: Suppose that we need to estimate the fraction of [very complicated objects] that have [easy-to-verify property]: would I win a higher fraction of chess games with Opening Move A or B? Roughly how many different truth assignments satisfy Boolean formula φ? Roughly how many integers in {2, 3, . . . , n − 1} evenly divide n? Is the array A “mostly” sorted?
One nice way to approximate the answer to these questions is the Monte Carlo method, one of the sim- plest ways to use randomization in computation. The basic idea is to compute many random candidate elements—chess games, truth assignments, possible divisors, etc.—and test each one; we can then esti- mate the answer to the question of interest by calculating the fraction of those random candidates that have the property in question. See p. 1062 for more discussion.

1060 CHAPTER 10. PROBABILITY
Computer Science Connections
A Randomized Algorithm for Finding Medians
The median element of an array A[1 . . . n] is the item that would appear in the ⌈n/2⌉th slot of the sorted order if we sorted A. For example, the median of [1, 3, 5, 7, 9] is 5, and the median of [4, 3, 2, 1] is 2. (We arbitrarily chose to find the ⌈n/2⌉th element instead of the ⌊n/2⌋th.) This description already suggests a solution to the median problem: sort A, and then return A[⌈n/2⌉]. But we can do better than the sorting-based approach: we’ll give a faster algorithm for finding the median element of an unsorted array. Our algorithm will be randomized, and the expected running time of the algorithm will be linear.
It will turn out to be easier to solve a generalization of the median problem, called Select. See Figure 10.28.
A recursive solution to Select is given in Figure 10.29; we can solve the median problem by calling randSelect(A[1 . . . n], ⌈n/2⌉). A proof of correctness of the algorithm—that is, a proof that randSelect actually solves the Select problem—is reasonably straightforward by induction. (In fact, correctness is guaranteed regardless of how we choose x in Line 3 of the algorithm.) But we still have to analyze the running time.
Running Time: The Big Picture
Think about an invocation of randSelect(A), and imagine the array A
in sorted order and divided into quartiles:
Here are two crucial observations:
1. SupposethattheelementA[x]choseninstep3—callA[x]thepivot—
falls within the shaded region of the quartile picture above. Then we
know that |Losers| ≤ 3n and |Winners| ≤ 3n . 44
2. TheshadedregioncontainshalfoftheelementsofA.
Figure 10.28: The Select problem.
Select:
Given: an array A[1…n] and an
index k ∈ {1,…,n}.
Output: the element x in A such that, if you were to sort A, x would appear in the kth slot of the sorted array.
randSelect(A[1 . . . n], i):
// Find the ith-largest element of A. // If i ∈/ {1,2,…,n}, then error.
1: 2: 3: 4: 5: 6: 7: 8: 9:
10: 11:
ifn=1then
return A[1]. (If i ̸= 1, then error.)
choose x ∈ {1,…,n} randomly Losers[1…l] := {y ∈ A : y < A[x]} Winners[1...w] := {y ∈ A : y > A[x]}. ifil+1then
return randSelect(Winners,i−l−1)
n n 3n 0424n
Figure 10.29: Randomized Median finding. (We build Losers and Winners by going through A element-by-element.)
(Why? To put it briefly: because half of the elements of A are in the middle half of the array A.) So what? Let’s think intuitively for a moment, and defer the formal analysis. Whenever we choose an element from the middle half of the sorted order, the next recursive call is on an array of size at most 3 the size
4
of the original input. Also observe that the running time of any particular call
(aside from the recursive call) is linear in the input size. Thus, if we got lucky every time and picked an element from the middle half of the array, we’d have a recurrence like the following:
T(1) = 1 T(n) ≤ n + T(3n/4)
That’s a classic Master Method recurrence with a solution of T(n) = Θ(n). (Actually the master method only says that T(n) = O(n), because we have an inequality in the recurrence. But it’s trivial that the running time is Ω(n) as well, because just building Losers and Winners at the root takes Ω(n) time.)

10.4. RANDOMVARIABLESANDEXPECTATION 1061
Computer Science Connections
A Randomized Algorithm for Finding Medians, Continued
Running Time: Making it Formal
We engaged in wishful thinking in the last paragraph: it’s obviously not
true that we get a pivot in the middle half of the array every time. In fact, it’s only half the time! But this isn’t so bad: even if we imagine that picking a pivot outside the middle half yields zero progress at all toward the base case, we’d only double the estimate of the running time! Let’s make this formal. Define
Cn := the number of comparisons performed by randSelect on an input of size n.
Notice that Cn is a random variable: the number of comparisons that are performed depends on which pivots are chosen! But we can analyze E [Cn].
Before we start, let’s make one quick observation: the expected running time of this algorithm is monotonic in its input size. That is, E [Cn] ≤ E [Cn′ ] if n ≤ n′. (This fact is tedious to prove rigorously, but is still pretty obvious.)
Theorem: E[Cn] ≤ 8n.
Proof (by strong induction on n). Base case (n = 1): In fact, when n = 1, the
algorithm performs zero comparisons, and indeed 0 ≤ 8. Inductivecase(n≥2): Weassumetheinductivehypothesis,namelythatfor
any n′ < n, we have that E􏰂Cn′ 􏰃 ≤ 8n′. We must prove that E[Cn] ≤ 8n. Let’s consider the comparisons that are made on an input array of size n. First, there are n comparisons performed in Lines 4–5, to compute Losers and Winners. Then there are whatever comparisons are made in the recur- sive call. Because we’re trying to compute a worst-case bound, we’ll make do with the following observation: Cn ≤ n + Cmax(|Losers|,|Winners|). Let M denote the event that our pivot is in the middle half of A (that is, falls in the shaded region of the diagram on the previous page). Thus: E [Cn] ≤ E 􏰖n + Cmax(|Losers|,|Winners|)􏰗 = n + E 􏰖Cmax(|Losers|,|Winners|)􏰗 the above accounting of the comparisons linearity of expectation = n + E 􏰖Cmax(|Losers|,|Winners|)|M􏰗 · Pr [M] + E 􏰖Cmax(|Losers|,|Winners|)| M 􏰗 · Pr 􏰂 M 􏰃 􏰗 􏰖 =n+1 · E C 2 |M +E C Crucial observation #1: if M occurs, we recurse on ≤ 3n elements; else it’s certainly on ≤ n elements. 􏰖 􏰖 􏰗􏰗 Law of Total Expectation (Theorem 10.7) |M Crucialobservation#2:Pr[M]=Pr􏰂M􏰃= 1 max(|Losers|,|Winners|) ≤ n + 1 · 􏰂E 􏰂C 􏰃 + E [C ] 􏰃. max(|Losers|,|Winners|) 2 4 2 3n/4 n Thus we have argued that E[Cn]≤n+1 ·E􏰂C3n/4􏰃+1 ·E[Cn] andtherefore 2􏰂􏰃2 The inductive hypothesis says that E 􏰂C E[Cn] ≤ 2n+6n = 8n. E [C ] ≤ 2n + E C . starting with the previous inequality and subtracting 1 · E [Cn ] from both sides, and then multiplying both sides by 2 n 3n/4 2 􏰃 ≤ 8 · 3n = 6n, so we therefore have 3n/4 4 10.4.6 Exercises Choose a word in S = {Computers, are, useless, They, can, only, give, you, answers} (a quote attributed to Pablo Picasso) by choosing a word w with probability proportional to the number of letters in w. Let L be a random variable denoting the number of letters in the chosen word, and let V be a random variable denoting the number of vowels. 10.105 Give a table of outcomes and their proba- bilities, together with the values of L and V. 10.106 What is Pr [L = 4]? What is E 􏰂V|L = 4􏰃? 10.107 Are L and V independent? 10.108 What are E [L] and E [V]? 10.109 What is var (L)? 10.110 What is var (V)? Flip a fair coin 16 times. Define the following two random variables: • let H be an indicator random variable that’s 1 if at least one of the 16 flips comes up heads, and 0 otherwise. • let R be a random variable equal to the length of the longest “run” in the flips. (A run of length k is a sequence of k consecutive flips that all come up Heads, or k consecutive flips that all come up Tails.) 10.111 What’s E [H]? 10.112 What’s E [R]? (Hint: write a program—not by simulating many sequences of 16 coin flips, but rather by listing exhaustively all outcomes.) 10.113 Are H and R independent? In 1975, a physicist named Michael Winkelmann invented a dice-based game with the following three (fair) dice: Bluedie: sides1,2,5,6,7,9 Reddie: sides1,3,4,5,8,9 Blackdie: sides2,3,4,6,7,8 There are some weird properties of these dice, as you’ll see. 10.114 Choose one of the three dice at random, roll it, and call the result X. Show that Pr [X = k] = 1 for any k ∈ {1, . . . , 9}. 9 10.4. RANDOMVARIABLESANDEXPECTATION 1063 10.115 Choose one of the three dice at random, roll it, and call the result X. Put that die back in the pile and again (independently) choose one of the three dice at random, roll it, and call the result Y. Show that Pr[9X−Y=k]= 1 foranyk∈{0,...,80}. 81 10.116 Roll each die. Call the results B (blue), R (red), and K (black). Compute E [B], E [R], and E [K]. 10.117 Define B, R, and K as in the last exercise. Compute Pr 􏰂B > R|B ̸= R􏰃, Pr 􏰂R > K|R ̸= K􏰃, and
Pr 􏰂K > B|K ̸= B􏰃—in particular, show that all three of these probabilities (strictly) exceed 1 . 2
The last exercise demonstrates that the red, blue, and black dice are nontransitive, using the language of relations (Chapter 8): you’d bet on Blue beating Red and you’d bet on Red beating Black, but (surprisingly) you’d want to bet on Black beating Blue. Here’s another, even weirder, example of nontransitive dice. (And if you’re clever and mildly unscrupulous, you can win some serious money in bets with your friends using these dice.)
Kellydie: sides3,3,3,3,3,6 Limedie: sides2,2,2,5,5,5 Mintdie: sides1,4,4,4,4,4 These dice are fair; each side comes up with probability 1 . Roll each die, and call the resulting values K, L, and M.
6
10.118 Show that the expectation of each of these three random variables is identical.
10.119 ShowthatPr[K>L],Pr[L>M],andPr[M>K]areallstrictlygreaterthan 1. 2
You can think of the last exercise as showing that, if you had to bet on which of K or L would roll a higher number,
you should bet on K. (And likewise for L over M, and for M over K.) Now let’s think about rolling each die twice and
adding the two rolled values together. Roll each die twice, and call the resulting values K1, K2, L1, L2, M1, and M2,
respectively.
10.120 Show that the expectation of the three values K1 + K2, L1 + L2, and M1 + M2 are identical.
10.121 (programming required) Show that the following probabilities are all strictly less than 1 :
2 Pr [K1 + K2 > L1 + L2 ] , Pr [L1 + L2 > M1 + M2 ] , and Pr [M1 + M2 > K1 + K2 ] .
(Notice that which die won switched directions—and all we did was go from rolling the dice once to rolling them twice!) To show this result, write a program to check how many of the 64 outcomes cause K1+K2 >L1+L2,etc.
Suppose that you are dealt a 5-card hand from a standard deck. For the purposes of the next two questions, a pair consists of any two cards with the same rank—-so ♣A♥A♦A23 contains three pairs (♥A♦A and ♣A♦A and ♣A♥A). Let P denote the number of pairs in your hand.
10.122 Compute E [P] “the hard way,” by computing Pr [P = 0], Pr [P = 1], Pr [P = 2], and so forth. (There can be as many as 6 pairs in your hand, if you have four-of-a-kind.)
10.123 Compute E [P] “the easy way,” by defining an indicator random variable Ri,j that’s 1 if and only if cards #i and #j are a pair, computing E 􏰂Ri,j 􏰃, and using linearity of expectation.

1064 CHAPTER 10. PROBABILITY
In bridge, you are dealt a 13-card hand from a standard deck. A hand’s high-card points are awarded for face cards: 4 for an ace, 3 for a king, 2 for a queen, and 1 for a jack. A hand’s distribution points are awarded for having a small number of cards in a particular suit: 1 point for a “doubleton” (only two cards in a suit), 2 points for a “singleton” (only one card in a suit), and 3 points for a “void” (no cards in a suit).
10.124 What is the expected number of high-card points in a bridge hand? (Hint: define some simple random variables, and use linearity of expectation.)
10.125 What is the expected number of distribution points for hearts in a bridge hand? (Hint: calculate the probability of having exactly 2 hearts, exactly 1 heart, or no hearts in a hand.)
10.126 Using the results of the last two exercises and linearity of expectation, find the expected number of points (including both high-card and distribution points) in a bridge hand.
We’ve shown linearity of expectation—the expectation of a sum equals the sum of the expectations—even when the random variables in question aren’t independent. It turns out that the expectation of a product equals the product of the expectations when the random variables are independent, but not in general when they’re dependent.
10.127 Let X and Y be independent random variables. Prove that E [X · Y] = E [X] · E [Y].
On the other hand, suppose that X and Y are dependent random variables. Prove that . . .
10.128 . . . E [X · Y] is not necessarily equal to E [X] · E [Y].
10.129 . . . E [X · Y] is also not necessarily unequal to E [X] · E [Y].
We showed in Example 10.39 that the expected number of flips of a p-biased coin before we get Heads is precisely 1 . p
10.130 How many flips would you expect to have to make before you see 1000 heads in total (not neces- sarily consecutive)? (Hint: define a random variable Xi denoting the number of coin flips after the (i − 1)st Heads before you get another Heads. Then use linearity of expectation.)
10.131 How many flips would you expect to make before you see two consecutive heads?
In Insertion Sort, we showed in Example 10.45 that the expected number of swaps is 􏰀n2􏰁/2 for a randomly sorted input. With respect to comparisons, it’s fairly easy to see that each element participates in one more comparison than it does swap—with one exception: those elements that are swapped all the way back to the beginning of the array. Here you’ll precisely analyze the expected number of comparisons.
10.132 What is the probability that the ith element of the array is
swapped all the way back to the beginning of the array?
10.133 What’s the expected number of comparisons done by Insertion Sort on a randomly sorted n- element input?
Suppose we hash n elements into an 100,000-slot hash table, resolving collisions by chaining.
10.134 Use Example 10.43 to identify the smallest n for which the expected number of collisions first reaches 1. What the smallest n for which the expected number of collisions exceeds 100,000?
10.135 (programming required) Write a program to empirically test your answers from the last exercise, by doing k = 1000 trials of loading [your first answer from Exercise 10.134] elements into a 100,000-slot hash table. Also do k = 100 trials of loading [your second answer from Exercise 10.134] elements. On average, how many collisions did you see?
Consider an m-slot hash table that resolves collisions by chaining. In the next few problems, we’ll figure out the ex- pected number of elements that must be hashed into this table before every slot is “hit”—that is, until every cell of the hash table is full.
10.136 Suppose that the hash table currently has i − 1 filled slots, for some number i ∈ {1, . . . , m}. What is the probability that the next element that’s hashed falls into an unoccupied slot? Let the random variable Xi denote the number of elements that are hashed until one more cell is filled. What is E [Xi ]?
10.137m Argue that the total number X of elements hashed before the entire hash table is full is given by X = ∑i=1 Xi. Using Exercise 10.136 and linearity of expectation, prove that E [X] = m · Hm.
(Recall that Hm denotes the mth harmonic number, where Hm := ∑m 1 . See Definition 5.4.) i=1 i
The problem you’ve addressed in the last two exercises is called the coupon collector problem among computer scientists: imagine, say, a cereal company that puts one of n coupons into each box of cereal that it sells, choosing which coupon type goes into each box randomly. How many boxes of cereal must a serial cereal eater buy before he collects a complete set of the n coupons?
Figure 10.32: A reminder of Insertion Sort.
insertionSort(A[1 . . . n]):
1: 2: 3: 4: 5:
for i:=2ton: j := i
while j > 1 and A[j] < A[j − 1]: swap A[j] and A[j − 1] j := j − 1 10.4. RANDOMVARIABLESANDEXPECTATION 1065 True story: some nostalgic friends and I were trying to remember all of the possible responses on a Magic 8 Ball, a pseudopsychic toy that reveals one of 20 answers uniformly at random when it’s shaken—things like {ask again later, signs point to yes, don’t count on it, . . .} . We found a toy shop with a Magic 8 Ball in stock and started asking it questions. We hoped to have learned all 20 different answers before we got kicked out of the store. 10.138 What is the probability that we’d get 20 different answers in our first 20 trials? 10.139 In expectation, how many trials would we need before we found all 20 answers? (Use the result on coupon collecting from Exercise 10.137.) In Exercise 10.139, you determined the number of trials that, on average, are necessary to get all 20 answers. But how likely are we to succeed with a certain number of trials? 10.140 Suppose we perform 200 trials. What is the probability that a particular answer (for example, “ask again later”) was never revealed in any of those 200 trials? 10.141 Use the Union Bound (Exercise 10.37) and the previous exercise to argue that the probability that we need more than 200 trials to see all 20 answers is less than 0.1%. 10.142 Suppose that one random bit in a 32-bit number is corrupted (that is, flipped from 0 to 1 or from 1 to 0). What is the expected size of the error (thinking of the change of the value in binary)? What about for a random bit in an n-bit number? 10.143 Suppose that the numbers {1, . . . , n} are randomly ordered—that is, we choose a random per- mutationπof{1,...,n}.Foraparticularindexi,whatistheprobabilitythatπi =i—thatis,theithbiggest element is in the ith position? 10.144 Let X be a random variable denoting the number of indices i for which πi = i. What is E [X]? (Hint: define indicator random variables and use linearity of expectation.) Markov’s inequality states that, for a random variable X that is always nonnegative (that is, for any x in the sample space, we have X(x) ≥ 0), the following statement is true, for any α ≥ 1: Pr [X ≥ α] ≤ E [X] . α 10.145 Prove Markov’s inequality. (Hint: use conditional expectation.) 10.146 The median of a random variable X is a value x such that Pr[X≤x]≥ 1 and Pr[X≥x]≥ 1. 22 Using Markov’s inequality, prove that the median of a nonnegative random variable X is at most 2 · E [X]. Take a fair coin, and repeatedly flip it until it comes up heads. Let K be a random variable indicating the number of flips performed. (We’ve already shown that E [K] = 2, in Example 10.39.) You are offered a chance to play a gambling game, for the low low price of y dollars to enter. A fair coin will be flipped until it comes up heads, and you will be paid Markov’s inequality is named after Andrey Markov, a 19th-to-20th- century Russian mathematician. A number of other important ideas in probability are also named after him, like Markov processes, Hidden Markov models, and more. (3/2)K dollars if K flips were required. (So there’s a 1 chance that you’ll be paid $1.50 because the first flip comes up 1 22 heads; a 4 chance that you’ll be paid $2.25 = (1.50) because the first flip comes up tails and the second comes up heads, and so forth.) 10.147 Assuming that you care only about expected value—that is, you’re willing to play if and only if E [(3/2)K ] ≥ y—then what value of y is the break-even point? (In other words, what is E [(3/2)K ]?) 10.148 Let’s sweeten the deal slightly: you’ll be paid 2K dollars if K flips are required. Assuming that you still care only about expected value, then what value of y is the break-even point? (Be careful!) 10.149 Let X be the number of heads flipped in 4 independent flips of a fair coin. What is var (X)? 10.150 Let Y be the average of two independent rolls of a fair die. What is var (Y)? 10.151 Leta∈R,andletXbearandomvariable.ProvethatE[a·X]=a·E[X]. 10.152 Leta∈R,andletXbearandomvariable.Provethatvar(a·X)=a2·var(X). 10.153 Prove that var (X + Y) = var (X) + var (Y) for two independent random variables X and Y. (Hint: use Exercise 10.127.) 1066 CHAPTER 10. PROBABILITY 10.154 Let X be a random variable following a binomial distribution with parameters n and p. (That is, X is the number of heads found in n flips of a p-biased coin.) Using Exercise 10.153 and the logic as in Example 10.40, show that E [X] = np and var (X) = np(1 − p). 10.155 Flip a p-biased coin n times, and let Y be a random variable denoting the fraction of those n flips that came up heads. What are E [Y] and var (Y)? In the next few exercises, you’ll find the variance of a geometric random variable. This derivation will require a little more work than the result from Exercise 10.154 (about the variance of a binomial random variable); in particular, we’ll need a preliminary result about summations first: 10.156 (Calculus required.) Prove the following two formulas, for any real number r with 0 ≤ r < 1: ∑∞ iri = r ∑∞ i2ri = r(1+r). i=0 (1−r)2 i=0 (1−r)3 (Hint: use the geometric series formula ∑n ri = rn+1 −1 from Theorem 5.2, differentiate, and take the limit as n grows. Repeat for the second derivative.) i=0 r−1 10.157 Let X be a geometric random variable with parameter p. (That is, X denotes the number of flips of a p-biased coin we need before we see heads for the first time.) What is var (X)? (Hint: compute both E [X]2 and E 􏰂X2􏰃. The previous exercise will help with at least one of those computations.) Recall from Chapter 3 that a proposition is in 3-conjunctive normal form (3CNF) if it is the conjunction of clauses, where each clause is the disjunction of three different variables/negated variables. For example, (¬p∨q∨r)∧(¬q∨¬r∨x) is in 3CNF. Recall further that a proposition φ is satisfiable if it’s possible to give a truth assignment for the variables of φ to true/false so that φ itself turns out to be true. We’ve previously discussed that it is believed to be computation- ally very difficult to determine whether a proposition φ is satisfiable (see p. 326)—and it’s believed to be very hard to determine whether φ is satisfiable even if φ is in 3CNF. But you’ll show here an easy way to satisfy “most” clauses of a proposition φ in 3CNF, using randomization. 10.158 Let φ be a proposition in 3CNF. Consider a random truth assignment for φ—that is, each variable is set independently to True with probability 1 . Prove that a particular clause of φ is true under this truth assignment with probability ≥ 7 . 2 8 10.159 Suppose that φ has m clauses and n variables. Prove that the expected number of satisfied clauses under a random truth assignment is at least 7m . 8 10.160 Prove the following general statement about any random variable: Pr [X ≥ E [X]] > 0. (Hint:
use conditional expectation.) Then, using this general fact and Exercise 10.159, argue that, for any 3CNF proposition φ, there exists a truth assignment that satisfies at least 7 of φ’s clauses.
Taking it further: One can also show that there’s a very good chance—at least 8/m—that a random truth assignment satisfies at least 7m/8 clauses, and therefore we expect to find such a truth assign- ment within m/8 random trials. This algorithm is called Johnson’s algorithm, named after the researcher David Johnson; for details of this and other randomized algorithms for satisfiability, see a good book on randomized algorithms.11
11
8
11 Michael Mitzen- macher and Eli Upfal. Probability and computing: ran- domized algorithms and probabilistic analysis. Cambridge University Press, 2005; Rajeev Mot- wani and Prabhakar Raghavan. Ran- domized Algorithms. Cambridge Uni- versity Press, 1995; and Jon Kleinberg and Éva Tardos. Algorithm Design. Addison–Wesley, 2006.

10.5 Chapter at a Glance Probability, Outcomes, and Events
Imagine a process by which some quantities of interest are determined in some ran- dom way. An outcome, or realization, of this probabilistic process is the sequence of results for all randomly determined quantities. The sample space S is the set of all pos- sible outcomes. A probability function Pr : S → R describes, for each outcome s ∈ S, the fraction of the time that s occurs. The probability function Pr must satisfy two conditions: (i) ∑s∈S Pr [s] = 1, and (ii) Pr [s] ≥ 0 for every s ∈ S.
An event is a subset of S, and the probability of an event E, written Pr [E], is the sum oftheprobabilitiesoftheoutcomescontainedinE.WehavethatPr[S] = 1and Pr[∅]=0.ForeventsAandB,writingA(“notA”)todenotetheeventA = S−A, we have that Pr [ A ] = 1 − Pr [A], and Pr [A ∪ B] = Pr [A] + Pr [B] − Pr [A ∩ B].
We can use a tree diagram to represent a sequence of random choices, where internal nodes of the tree correspond to random decisions made by the probabilistic process; leaves correspond to the outcomes in the sample space. Every edge leaving an internal node is labeled with the probability of the corresponding random decision; the prob- ability of a particular outcome is precisely equal to the product of the labels on the edges leading from the root to its corresponding leaf.
The uniform distribution is the probability distribution in which all outcomes in the sample space S are equally likely—that is, when Pr [s] = 1 for each s ∈ S. (Nonuniform probability is when this equality does not hold.) |S|
10.5. CHAPTERATAGLANCE 1067
The Bernoulli distribution with parameter p is the probability distribution that results from flipping one coin, where the sample space is {H, T} and Pr [H] = p (and thus Pr [T] = 1 − p). Such a coin is called p-biased. Each coin flip is called a trial; the flip is called fair if p = 1.
2
The binomial distribution with parameters n and p is a distribution over the sample
space {0, 1, . . . , n} determined by flipping a p-biased coin n times and counting the number of times the coin comes up heads. Here Pr [k] = 􏰀nk􏰁 · pk · (1 − p)n−k denotes the probability that there are precisely k heads in the n flips.
The geometric distribution with parameter p is a distribution over the positive integers, where the output is determined by the number of flips of a p-biased coin required before we first see a heads; thus Pr [k] = (1 − p)k−1 · p for any integer k ≥ 1.
Independence and Conditional Probability
When there are multiple events of interest, then one useful way understanding the relationship between two events is to understand whether one event’s occurrence changes the likelihood of the other event also occurring. When there’s no change, the events are called independent; when there is a change in the probability, the events are called dependent. More formally, two events A and B are independent (or uncorrelated) if and only if Pr [A ∩ B] = Pr [A] · Pr [B]. Otherwise the events A and B are called depen- dent (or correlated). Intuitively, A and B are dependent if A’s occurrence/nonoccurrence tells us something about whether B occurs. When knowing that A occurred makes B

1068 CHAPTER 10. PROBABILITY
more likely to occur, we say that A and B are positively correlated; when A makes B less likely to occur, we say that A and B are negatively correlated.
The conditional probability of A given B is
Pr􏰂A|B􏰃 = Pr[A∩B].
Pr [B]
(Treat Pr 􏰂A|B􏰃 as undefined when Pr [B] = 0.) Intuitively, we can think of Pr 􏰂A|B􏰃 as “zooming” the universe down to the set B. Two events A and B for which Pr [B] ̸= 0 are independent if and only if Pr 􏰂A|B􏰃 = Pr [A].
There are a few useful equivalences based on conditional probability. For any events A and B, the chain rule says that Pr [A ∩ B] = Pr [B] · Pr 􏰂A|B􏰃; more generally,
Pr[A1 ∩A2 ∩A3 ∩···∩Ak]
=Pr[A1]·Pr􏰂A2|A1􏰃·Pr􏰂A3|A1∩A2􏰃· ··· ·Pr􏰂Ak|A1∩···∩Ak−1􏰃.
The law of total probability says that Pr [A] = Pr 􏰂A|B􏰃 · Pr [B] + Pr 􏰂A| B 􏰃 · Pr [ B ]. Bayes’ Rule is a particularly useful rule that allows us to “flip around” a conditional
probability statement: for any two events A and B, we have Pr􏰂A|B􏰃 = Pr􏰂B|A􏰃·Pr[A].
Pr [B]
Random Variables and Expectation
The probabilistic statements that we’ve considered so far are about events (“whether or not” questions); we can also consider probabilistic questions about “how much” or “how often.” A random variable X assigns a numerical value to every outcome in the sample space S—that is, a random variable is a function X : S → R. (Often we write X to denote the value of a random variable X for a realization chosen according to Pr, or perform arithmetic on random variables.) An indicator random variable is a {0, 1}- valued random variable. Two random variables X and Y are independent if every two events of the form “X = x” and “Y = y” are independent.
The expectation of a random variable X, denoted E [X], is the average value of X,
defined as E [X] = ∑x∈S X(x) · Pr [x]. A Bernoulli random variable with parameter p has
expectation p. A binomial random variable with parameters p and n has expectation
Linearity of expectation is the very useful fact that the expectation of a sum is the sum of the expectations. That is, for random variables X : S → R and Y : S → R, we have E [X + Y] = E [X] + E [Y]. (Note that there is no requirement of independence on X and Y!) Another useful fact is that, for a positive integer–valued random variable X:S→Z≥0,wehaveE[X]=∑∞ Pr[X≥i].
pn. A geometric random variable with parameter p has expectation 1 . p
i=1
The conditional expectation of a random variable X given an event E is the average
value of X over outcomes where E occurs, defined as E 􏰂X|E􏰃 = ∑x∈E X(x) · Pr 􏰂x|E􏰃. The variance of a random variable X is
var (X) = E 􏰖(X − E [X])2􏰗 = E 􏰖X2􏰗 − (E [X])2 . The standard deviation is std (X) = √var (X).

Key Terms and Results Key Terms
Probability, Outcomes, and Events
• outcome/realization
• sample space
• probability function/distribution
• event
• tree diagram
• uniform vs. nonuniform probability
• fair vs. biased coin flips
• uniform distribution
• Bernoulli distribution
• binomialdistribution
• geometricdistribution
Independence and Conditional Probability
• independent/uncorrelatedevents • dependent/correlatedevents
• positive/negativecorrelation
• conditionalprobability
• chainrule
• lawoftotalprobability • Bayes’Rule
Random Variables and Expectation
• randomvariable
• indicatorrandomvariable
• independentrandomvariables • expectation
• linearityofexpectation
• conditionalexpectation
• variance
• standarddeviation
Key Results
Probability, Outcomes, and Events
1. For a sample space S and events A and B, writing A (“not
A”)todenotetheeventS−A,wehavethatPr[S]=1, Pr[∅]=0,Pr[A]=1−Pr[A],and
Pr[A∪B] = Pr[A]+Pr[B]−Pr[A∩B].
10.5. CHAPTERATAGLANCE 1069
2. Under the uniform distribution, Pr [s] = 1 for every |S|
s ∈ S. Consider parameters p and n. Under a Bernoulli distribution, Pr [H] = p and Pr [T] = 1 − p. Under a binomial distribution, Pr [k] = 􏰀nk􏰁pk(1 − p)n−k. Under a geometric distribution, Pr [k] = (1 − p)k−1p.
Independence and Conditional Probability
1. EventsAandBareindependentifandonlyif Pr [A ∩ B] = Pr [A] · Pr [B], or, equivalently, if Pr 􏰂A|B􏰃 = Pr [A].
2. The chain rule: Pr [A ∩ B] = Pr [B] · Pr 􏰂A|B􏰃. 3. The law of total probability:
Pr[A]=Pr􏰂A|B􏰃·Pr[B]+Pr􏰂A|B􏰃·Pr[B]. 4. Bayes’ Rule: Pr 􏰂A|B􏰃 = Pr[B|A]·Pr[A] .
Pr[B] Random Variables and Expectation
1. 2.
3.
4. 5.
TheexpectationofarandomvariableXistheaverage value of X, defined as E [X] = ∑x∈S X(x) · Pr [x].
ABernoullirandomvariablewithparameterphas
expectation p. A binomial random variable with
parameters p and n has expectation pn. A geometric
random variable with parameter p has expectation 1 . p
Linearityofexpectation:foranytworandomvariablesX and Y, we have E [X + Y] = E [X] + E [Y]. (Note that there is no requirement of independence on X and Y!)
ForarandomvariableX:S→Z≥0,wehavethat E[X] = ∑∞ Pr[X ≥ i].
i=1 ForarandomvariableX,wehave
var (X) = E 􏰂(X − E [X])2􏰃 = E 􏰂X2􏰃 − (E [X])2 .

11
Graphs and Trees
In which our heroes explore the many twisting paths through the gnarled forest, emerging in the happy and peaceful land in which their computational adventures will continue.

1102 CHAPTER 11. GRAPHS AND TREES
11.1 Why You Might Care
Oh what a tangled web we weave, When first we practise to deceive!
Sir Walter Scott (1771–1832), Marmion (1808)
It’s possible to make graphs sound hopelessly abstract and utterly uninteresting: a graph is a pair ⟨V, E⟩, where V is a nonempty collection of entities called nodes and E is a collection of edges that join pairs of nodes. But graphs are fascinating—at least, when the entities and the relationship represented by the edges are themselves interesting! Here are a few of the many examples of types of graphs:
• socialnetworkslikeFacebook(orLinkedInorPinterestor…):thenodesarepeople, and an edge between two people represents a friendship (or at least a “friendship”).
• theworld-wideweb:thenodesarewebpages,andanedgerepresentsahyperlink from one page to another. These hyperlinks between pages form the basis for the ranking of web pages by search engines like Google.1
• datingnetworks:nodesrepresentpeople;anedgeconnectstwopeoplewhohave been involved in a romantic relationship. These networks have implications for the spread of certain communicable diseases, particularly sexually transmitted infections.
• roadnetworksandothertransportationnetworks:edgesrepresentroads;nodes represent intersections. For example, United Parcel Service (UPS) saves gas (and money!) by using a route-finding algorithm through this network that avoid turns across traffic.2
• foodwebs:nodesrepresentspecieswithinaparticularecosystem,andanedgefrom one species to another indicates that the first species preys on the latter.
• co-purchasenetworks:nodesareproductsthataresoldbyaretailerlikeWalmartor Amazon; an edge between two products indicates the number of customers who bought both products. These networks have implications for recommender systems, the “people who bought x also bought y” feature of Amazon.
• theinternet:nodesarecomputers(personalcomputers,servers,andothernetwork- ing hardware like routers), and edges represent physical wires connecting two machines together. When you request a video from youtube.com, the computers involved in the network must collectively construct a path along which YouTube’s bits can flow so that they reach your computer.
Graphs are ubiquitous. Indeed, any pairwise relationship among entities is really underlyingly a graph: web pages and links, computers and fiber optic cables, kidney patients/donors and compatibility for transplants. The applications are innumerable, and this chapter will barely scratch the surface. Graphs and graph-theoretic reasoning will arise again and again well beyond the end of this book.
1 Sergei Brin and Larry Page. The anatomy of a large- scale hypertextual web search engine. In 7th International World-Wide Web Conference, 1998.
2 Joel Lovell. Left- hand-turn elimina- tion. The New York Times, 9 December 2007.

11.2 Formal Introduction
11.2. FORMAL INTRODUCTION 1103
The Bible tells us to love our neighbors, and also to love our enemies; probably because they are generally the same people.
G. K. Chesterton (1874–1936)
We begin by defining the terminology for the two different basic types of graphs. In both, we have a set of entities called nodes, some pairs of which are joined by a relation- ship called an edge. (A node can also be called a vertex.) The two types of graph differ in whether the relationship represented by an edge is “between two nodes” or “from one node to another.” In an undirected graph, the relationship denoted by the edges is symmetric (for example, “u and v are genetically related”):
Definition 11.1 (Undirected Graph)
A undirected graph is a pair G = ⟨V, E⟩ where V is a nonempty set of vertices or nodes, andE⊆􏰜{u,v}:u,v∈V􏰝isasetof edgesjoiningpairsofvertices.
The second basic kind of graph is a directed graph, in which the relationship denoted by the edges need not be reciprocated (for example, “u has texted v”):
In other words, in a directed graph an edge is an ordered pair of vertices (“an edge from u to v”) and in an undirected graph an edge is an unordered pair of vertices (“an edge between u and v”). Think about the difference between Twitter followers (directed) and Facebook friendships (undirected): Alice can follow Bob without Bob following Alice, but they’re either friends or they’re not friends.
Graphs are generally drawn with nodes represented as circles, and edges repre- sented by lines. Each edge in directed graphs is drawn with an arrow indicating its orientation (“which way it goes”). Here is an example of each:
Example 11.1 (A sample undirected graph)
vertex, n.: a node. plural: vertices.
We will use the terms node/nodes and vertex/vertices interchangeably throughout this chapter. (Both terms are used commonly in CS.) A graph
can also be called a network; edges are also sometimes called links, or occasionally arcs in directed graphs.
Definition 11.2 (Directed Graph)
A directed graph is a pair G = ⟨V, E⟩ where V is a nonempty set of nodes, and E ⊆ V × V is a set of edges joining (ordered) pairs of vertices.
Here is an undirected graph:
B A
CD
EFIJ
H
GKL
This graph contains:
• 12 nodes: {A,B,C,D,E,F,G,H,I,J,K,L}.
• 10 edges: 􏰜{A,B},{B,C},{C,D},{E,F},{E,H},{F,G},{G,H},{I,J},{J,K},{K,L}􏰝.

1104 CHAPTER 11. GRAPHS AND TREES
Example 11.2 (Streets of Manhattan: a sample directed graph)
The following directed graph contains 9 nodes, each corresponding to an intersection of a “street” running east–west and an “avenue” running north–south in Manhattan:
43rd & 9th
42nd & 9th
41st & 9th
43rd & 8th
42nd & 8th
41st & 8th
43rd & 7th
42nd & 7th
41st & 7th
There are 14 edges in this graph. There’s something potentially tricky in count-
ing to 14: edges in a directed graph are ordered pairs, so there are two edges be- tween 42nd & 9th and 42nd & 8th, one in each direction—⟨42nd & 9th, 42nd & 8th⟩ and ⟨42nd & 8th, 42nd & 9th⟩. The pair of nodes 42nd & 8th and 42nd & 7th is similar.
For many of the concepts that we’ll explore in this chapter, it will turn out that there are no substantive differences between the ideas for directed and undirected graphs. To avoid being tedious and unhelpfully repetitive, whenever it’s possible we’ll state definitions and results about both undirected and directed graphs simultaneously. But doing so will require a little abuse of notation: we’ll allow ourselves to write an edge as an ordered pair ⟨u, v⟩ even for an undirected graph. In an undirected graph, we will agree to understand both ⟨u, v⟩ and ⟨v, u⟩ as meaning {u, v}.
Simple graphs
For many of the real-world phenomena that we will be interested in
modeling, it will make sense to make a simplifying assumption about the edges in our graphs. Specifically, we will typically restrict our at- tention to so-called simple graphs, which forbid two different kinds of edges: edges that connect nodes to themselves, and edges that are pre- cise duplicates of other existing edges. (See Figure 11.1.)
Note that the edges ⟨u, v⟩ and ⟨v, u⟩ are not parallel in a directed graph: directed edges are parallel only if they both go from the same node and to the same node, in the same orientation.
In general, the particular real-world phenomenon that we seek to model will dictate whether self-loops, parallel edges, or both will make sense. Here are a few examples:
Figure 11.1: Parallel edges and self- loops.
(a) Undirected graphs.
(b) Directed graphs.
Definition 11.3 (Self-loops and parallel edges)
A self-loop is an edge from a node u to itself. Two edges are called parallel if they both go from same node u and both go to the same node v.
Definition 11.4 (Simple graph)
A graph is simple if it contains no parallel edges and no self-loops.

Example 11.3 (Self-loops and parallel edges)
Problem: Supposethatweconstructagraphtomodeleachofthefollowingphenom- ena. In which settings do self-loops or parallel edges make sense?
1.
2. 3.
4.
Asocialnetwork:nodescorrespondtopeople;(undirected)edgesrepresent friendships.
Theweb:nodescorrespondtowebpages;(directed)edgesrepresentlinks.
Theflightnetworkforacommercialairline:nodescorrespondtoairports; (directed) edges denote flights scheduled by the airline in the next month.
Theemailnetworkatacollege:nodescorrespondtostudents;thereisa(di- rected) edge ⟨u, v⟩ if u has sent at least one email to v within the last year.
: 1. Neitherself-loopsnorparalleledgesmakesense.Aself-loopwould Solution
correspond to a person being a friend of himself, and parallel edges between two people would correspond to them being friends “twice.” (But two people are either friends or not friends.)
2. Bothself-loopsandparalleledgesarereasonable.Itiseasytoimagineaweb page p that contains a hyperlink to p itself. It is also easy to imagine a web page p that contains two separate links to another web page q. (For example, as of this writing, the “CNN” logo on www.cnn.com links to www.cnn.com. And, as of the end of this sentence, this page has three distinct references to www.cnn.com.)
3. Intheflightnetwork,manyparalleledgeswillexist:therearegenerallymany scheduled commercial flights from one airport to another—for example, there are dozens of flights every week from BOS (Boston, MA) to SFO (San Francisco, CA) on most major airlines. However, there are no self-loops: a commercial flight from an airport back to the same airport doesn’t go anywhere!
4. Self-loopsarereasonablebutparalleledgesarenot.Astudentuhaseithersent email to v in the last year or she has not, so parallel edges don’t make sense
in this network. However, self-loops exist if any student has sent an email to herself (as many people do to remind themselves to do something later).
Throughout, we assume that all graphs are simple unless otherwise noted.
Taking it further: Actually, the way that we phrased our definitions of graphs in Definitions 11.1
and 11.2 doesn’t even allow us to consider parallel edges. (Our definitions do allow self-loops, though.) That’s because we defined the edges as a subset E of V × V or 􏰈 {u, v} : u, v ∈ V􏰉, and sets don’t allow duplication—which means that we can’t have ⟨u, v⟩ in E “twice.” There are alternate ways to formalize graphs that do permit parallel edges, but they’re needlessly complicated for the applications that we’ll focus on in this chapter.
11.2.1 Neighborhoods and Degree
Imagine a social network in which two people, Ursula and Victor, are friends—or, more generally, imagine an undirected graph in which nodes u and v are joined by an edge. Here’s the vocabulary for referring to these nodes and the edge between them:
11.2. FORMAL INTRODUCTION 1105

1106 CHAPTER 11. GRAPHS AND TREES
Definition 11.5 (Adjacency, neighbors, endpoints, incidence)
For an edge e = {u, v} in an undirected graph (see Figure 11.2), we say that:
• thenodesuandvareadjacent;
• thenodevisaneighborofthenodeu(andviceversa); • thenodesuandvaretheendpointsoftheedgee;and • thenodesuandvarebothincidenttotheedgee.
It’s important to distinguish between two distinct concepts:
• thedirectconnectionbetweentwonodesuandvthatareadjacent—thatis,asingle
edge that joins u and v directly; and
• anindirectconnectionbetweentwonodesthatfollowsasequenceofedges.
At the moment, we’re talking only about the first kind of connection, a direct connec- tion via a single edge. (A multihop connection is called a path; we’ll talk about paths in Section 11.3.) Here’s an example of the vocabulary from Definition 11.5:
Example 11.4 (Disney World to Disney Land)
Here is a small portion of the U.S. Interstate system between Orlando, FL and Los Angeles, CA. Each of the roads is labeled by its name.
Figure 11.2: Two nodes joined by an edge.
uev
Los Angeles
I10(west)
Lake City, FL I75
Tampa I4(west)
I10(east)
Orlando
Jacksonville I95
Daytona Beach I4(east)
In this graph:
• OrlandoisadjacenttoTampaandDaytonaBeach.
• Noneoftheothernodes(LakeCity,Jacksonville,LosAngeles)isaneighborof
Orlando. Orlando is also not a neighbor of itself.
• TheendpointsofedgeI75areTampaandLakeCity.
• JacksonvilleisincidenttoI95,asisDaytonaBeach.
The neighborhood of a node is the set of all nodes adjacent to it:
Definition 11.6 (Neighborhood)
Let G = ⟨V, E⟩ be an undirected graph, and let u ∈ V be a node. The neighborhood of u is the set 􏰜v ∈ V : {u,v} ∈ E􏰝—that is, the set of all neighbors of u.

For example, in the graph from Example 11.4 (reproduced in abbreviated form in Figure 11.3), the neighborhood of Lake City (LC) is {Los Angeles (LA), Tampa (TA), Jacksonville (JA)}. Or, for a graph G that represents a social network, the neighborhood of a node u is the set of people who are u’s friends.
Degree
It’s also common to refer the number of neighbors that a node has (without reference
to which particular nodes happen to be that node’s neighbors):
For example, in the graph in Figure 11.3, Lake City (LC) has degree 3 and Los Angeles (LA) has degree 1. Or, in a social network, the degree of a node u is the popularity of u—the number of friends that u has. Here are a few practice questions:
Example 11.5 (Neighborhood and degree)
Problem: Considerthefollowinggraph: BDF
AH CEG
1. WhataretheneighborsofnodeC?
2. Whatnodes,ifany,havedegreeequaltoone?
3. Whatnodehasthehighestdegreeinthisgraph?
4. Whatnodes,ifany,areintheneighborhoodsofbothnodesBandE?
Solution
: 1. NodeChastwoneighbors,namelythenodesBandE.
2. Thenodeswithdegreeonearethosewithpreciselyoneneighbor.Thesenodes are: A, D, F, and H. (Their solitary neighbors are, respectively: B, G, E, and G.)
3. Wesimplycountneighborsforeachnode,andwefindthatnodesBandEboth have degree three, and are tied as the nodes with the highest degree.
4. TheneighborhoodofnodeBis{A,C,E},andtheneighborhoodofnodeEis {B, C, F}. Taking the intersection of those sets yields the one node in the neigh- borhood of both B and E, namely node C.
Taking it further: Consider a population of people—say, the current residents of Canada—represented as a social network, in an undirected graph whose edges represent friendship. For a node in the social network (also known as a person), we can calculate many numbers that may be interesting: height, age, income, number of cigarettes smoked per day, self-reported happiness, etc. Then, for any one of these
Figure 11.3: The road network from Example 11.4, abbreviated.
11.2. FORMAL INTRODUCTION 1107
LA
OR
JA DB
LC TA
Definition 11.7 (Degree)
The degree of a node u in an undirected graph G is the size of the neighborhood of u in G—that is, the number of nodes adjacent to u.

1108 CHAPTER 11. GRAPHS AND TREES
numerical properties, we can consider the distribution over the population: for example, the distribution of heights, or the distribution of ages. (The height distribution will follow a roughly bell-shaped curve; the age distribution is more complicated, both because of death and because of variation in the birth rate over time.) Another interesting numerical property of a person u is the degree of u: that is, the number of friends that u has. The degree distribution of a graph describes how popularity varies across the nodes of the network. The degree distribution has some interesting properties—very different from the distribution of heights or ages. See p. 1123 for some discussion.
The Handshaking Lemma
Before we move on from degree, we’ll prove a basic but valuable fact, colloquially
called the “handshaking lemma.” (We can represent a group of people, some pairs of whom shake hands, using an undirected graph: an edge joins u and v if and only if u and v shook hands; the theorem describes the number of shakes.) The handshaking lemma relates the sum of nodes’ degrees to the number of edges in the graph:
For example, Figure 11.4 shows our road network from Example 11.4, with all nodes labeled by their degree. This graph has |E| = 6 edges, and the sum of the nodes’ degrees is 1 + 3 + 2 + 2 + 2 + 2 = 12, and indeed 12 = 2 · 6. Here is a proof:
ProofofTheorem11.1. Everyedgehastwoendpoints!Or,moreformally,imagineloop- ing over each edge to compute all nodes’ degrees:
In each iteration of the for loop, we increment two different d• values; thus, after i iterations, we have that ∑u du = 2i. (We could give a fully rigorous proof of this fact by induction.) We complete |E| iterations of the for loop, one for each edge, and thus at the end of the algorithm we have that ∑u∈V du = 2|E|. Furthermore, after the loop, it’s clear that du = degree(u) for every node u. Thus
∑ du = ∑ degree(u) = 2|E|. u∈V u∈V
Here’s a useful corollary of Theorem 11.1 (the proof is left to you as Exercise 11.17):
(For example, for the graph in Figure 11.4, we have nodd = 2: the two nodes with odd degree are those with degree 1 and 3. And 2 is an even number.)
Figure 11.4: The road network from Figure 11.3, with nodes labeled by theirdegree.
“Look on every exit as being an entrance somewhere else.” — Tom Stoppard
(b. 1937),
Rosencrantz and Guildenstern are Dead (1966)
1
3
22 2
2
Theorem 11.1 (“Handshaking Lemma”)
Let G = ⟨V, E⟩ be an undirected graph. Then
∑ degree(u) = 2|E|. u∈V
1: initialize du to 0 for each node u 2: foreachedge{u,v}∈E:
3: du := du + 1
4: dv := dv + 1
Corollary 11.2
Let nodd denote the number of nodes whose degree is odd. Then nodd is even.

Neighborhoods and degree: directed graphs
The definitions of adjacency, neighbors, and degree from Definitions 11.5–11.7
were all for undirected graphs. Here we’ll introduce the analogous notions for directed graphs, all of which are slightly more complicated because they must account for the orientation of each edge. We start with the directed version of “neighbors”:
For example, if G represents a flight network (with nodes as airports and directed edges corresponding to flights), then the out-neighbors of node u are those airports that have direct flights from u, and the in-neighbors of u are those airports that have direct flights to u. (See Figure 11.5.) Now, using these definitions, we can define the analogues of neighborhoods and degree in directed graphs:
(a) in-neighbors
(b) out-neighbors
Figure 11.5: The in- and out-neighbors of a node u.
11.2. FORMAL INTRODUCTION 1109
Definition 11.8 (Neighbors in directed graphs)
For an edge ⟨u, v⟩ from node u to node v in a directed graph, we say that:
• thenodevisanout-neighborofthenodeu;and • thenodeuisanin-neighborofthenodev.
u
u
Definition 11.9 (Neighborhoods and degrees in directed graphs)
For a node u in an directed graph, we say that:
• thein-neighborhoodofuis{v:⟨v,u⟩∈E},thesetofin-neighborsofv;
• thein-degreeofuisitsnumberofin-neighbors(itsin-neighborhood’scardinality);
• theout-neighborhoodofuis{v:⟨u,v⟩∈E},thesetofout-neighborsofu;and
• theout-degreeofuisitsnumberofout-neighbors(itsout-neighborhood’scardinality).
Here are a few practice questions about in- and out-neighborhoods:
Example 11.6 (Neighborhood and degree in a directed graph)
Problem: Considerthefollowingdirectedgraph: BDF
AH CEG
1. Whatarethein-neighborsofnodeC?Theout-neighborsofC?
2. Whatnodes,ifany,areinboththein-neighborhoodandout-neighborhoodof
node E?
3. Whatnodes,ifany,havein-degreezero?Out-degreezero?
Solution
: 1. NodeChasonein-neighbor,namelyB,andtwoout-neighbors,namelyD
and E.
2. NodeEhasthreein-neighbors(B,C,andF)andtwoout-neighbors(BandF).So
nodes B and F are in both E’s in-neighborhood and E’s out-neighborhood.
3. NodeAhasnoin-neighbors,soA’sin-degreeiszero.NodeGhasnoout- neighbors, so G’s out-degree is zero.

1110 CHAPTER 11. GRAPHS AND TREES
11.2.2 Representing Graphs: Data Structures
The graphs that we’ve considered so far have been presented visually: as a picture,
with nodes drawn as circles and edges drawn as lines or arrows. But, of course, when
we represent a graph on a computer, we’ll need to use some data structure to store a
network, not just some image file. Here we will give a brief summary of the two major
data structures used to represent graphs. If you’ve had a course on data structures,
3
Taking it further: A visual representation is great for some smaller networks, and a well-designed lay- out can sometimes make even large networks easy to understand at a glance. Graph drawing is the prob- lem of algorithmically laying out the nodes of a graph well—in an aesthetic and informative manner. There’s a physics analogy that’s often used in laying out graphs, in which we imagine nodes “attracting” and “repelling” each other depending on the presence or absence of edges. See p. 1124 for some discus- sion, including an application of this graph-drawing idea to the 9/11 Memorial in New York City. Some other gorgeous visualizations of network (and other!) data can be found online at sites like Flowing Data (http://flowingdata.com/), Information Is Beautiful (http://informationisbeautiful.net), or some of the beautiful books on data visualization like the Atlas of Science.3
The most straightforward data structure for a graph is just a list of nodes and a list of edges. But this straightforward representation suffers for some standard, natural questions that are typically asked about graphs. Many of the natural questions that we will find ourselves asking are things like: What are all of the neighbors of A? or Are B and C joined by an edge? There are two standard data structures for graphs, each of which is tailored to make it possible to answer one of these two questions quickly.
Adjacency lists
The first standard data structure for graphs is an adjacency list, which—as the name
implies—stores, for each node u, a list of the nodes adjacent to u:
The schematic for an adjacency list is illus- trated in Figure 11.6: each node in the graph corresponds to a row of the table, which points to an unsorted list of that node’s neighbors. (These lists are unsorted so that it’s faster to add a new edge to the data structure.)
There’s no significant difference between
adjacency lists for undirected graphs and for
directed graphs: for an undirected graph, we
list the neighbors for each node u; for a directed
graph, we list the out-neighbors of each node. (Every edge ⟨u, v⟩ in a directed graph appears only once in the data structure, in u’s list. Every edge {u, v} in an undirected graph is represented twice: v appears in u’s list, and u appears in v’s list. This observa- tion is another way of thinking of the proof of Theorem 11.1.)
then this material may be a review; if not, it will be a preview.
3 Katy Börner.
Atlas of Science: Visualizing What We Know. MIT Press, 2010.
Definition 11.10 (Adjacency list)
In an adjacency list of a graph G = ⟨V, E⟩, for each node u ∈ V, we store an unsorted list of all of u’s neighbors in the graph.
x
u v1 v2 v3 v4
linked list of u’s neighbors
array containing all nodes in the graph
(empty) list of x’s neighbors
.
.
Figure 11.6: A schematic of an adjacency list.

11.2. FORMAL INTRODUCTION 1111 Here are example adjacency lists for two graphs, one undirected and one directed:
Example 11.7 (Two sample adjacency lists)
Consider the following two graphs:
The adjacency lists for these two graphs are as follows.
Note that the order of the (out-)neighbors of any particular node isn’t specified: forexample,wecouldjustaswellsaidthatEvie’sneighborswere[Ben, Allie]as [Allie, Ben].
Adjacency matrices
The second standard data structure for representing graphs is an adjacency matrix:
In a directed graph, the ith row corresponds to the out-neighbors of node i, so that the ⟨i, j⟩th entry of the matrix corresponds to the presence/absence of an edge from i to j. The ith column corresponds to the in-neighbors of i. Here are two examples of adjacency matrices, for the graphs from Example 11.7:
Example 11.8 (Two sample adjacency matrices)
The following adjacency matrices represent the graphs from Example 11.7:
Ben
Allie
Evie
Camille Derek
A
BD
CE
Allie:
Ben:
Camille:
Derek:
Evie:
Evie, Ben
Allie, Evie
—
—
Allie, Ben
A: B
B: C,D
C: E,A
D: —
E: C
Definition 11.11 (Adjacency matrix)
In an adjacency matrix of a graph G = ⟨V, E⟩, we store the graph using an |V|-by-|V| table. The ith row of the table corresponds to the neighbors of node i. A True (or 1) in column j indicates that the edge ⟨i, j⟩ is in E; a False (or 0) indicates that ⟨i, j⟩ ∈/ E.
ABCDE
Allie A Ben B Camille C Derek D Evie E
01001 10001 00000 00000 11000
01000 00110 10001 00000 00100
Allie
Ben
Camille
Derek
Evie

1112 CHAPTER 11. GRAPHS AND TREES
The adjacency matrix has two properties that are worth a note. (See Figure 11.7.)
• The main diagonal contains all zeros: a 1
in the ⟨i, i⟩th position of the matrix would correspond to an edge between node i and node i—that is, a self-loop, which is forbidden in a simple graph.
• For an undirected graph, the matrix is sym-
metric: the ⟨i, j⟩th position of the matrix records
the presence or absence of an edge from i to j,
which is identical to the presence or absence
of an edge from j to i in an undirected graph. Adjacency matrices are not necessarily symmetric in directed graphs: there may be an edge from u to v without an edge from v to u.
Choosing between adjacency lists and matrices
Which of the two data structures that we’ve seen for graphs should we choose? Are
adjacency lists better than adjacency matrices, or the other way around? Recall the two basic questions about graphs that we wish to answer quickly:
(A) isvaneighborofu?
(B) whatareallofu’sneighbors?
Figuring the details of how efficiently we can answer these questions with an adja- cency list or an adjacency matrix is better suited to a data-structures textbook than this one, but here’s a brief summary of the reasoning.
AdjacencyLists: AnadjacencylistisperfectlytailoredtoansweringQuestion(B): we’ve stored precisely the list of u’s neighbors for each node u, so we simply iter- ate through that list to output u’s neighborhood. To answer Question (A), we need to search through that same unsorted list to see if v is present. In both cases, we have to spend constant time finding u’s list in the table, and then we examine a list of length degree(u) to answer the question.
AdjacencyMatrices: AnadjacencymatrixisperfectforansweringQuestion(A):we
just look at the appropriate spot in the table. If the ⟨u, v⟩th entry is True, then the edge ⟨u, v⟩ exists. This lookup takes constant time. Answering Question (B) requires looking at one entire row of the table, entry by entry. There are |V| entries in the row, so this loop requires |V| operations.
Thus adjacency matrices solve Question (A) faster, while adjacency lists are faster at solving Question (B). In addition to the time to answer these questions, we’d also want the space—the amount of memory—consumed by the data structure to be as small as possible. (You can think of “the amount of memory” as the total number of boxes that appear in the diagrams in Figures 11.6 and 11.7.)
Figure 11.7: A schematic of an adjacency matrix.
Meta–problem- solving tip: The answer to “which is better?” in a class or textbook
is almost always
It depends! After all, why would we waste time/pages on a solution that’s always worse!? (The only plausible answer is that itwarmsusup conceptually for
a better but more complex solution.) The real question here what does it depend on?
ij
0
0
0
0
0
0
0
0
i
j
⟨i,j⟩ ∈ E? ⟨j, i⟩ ∈ E?
(identical in undirected graphs)
the main diagonal
⟨j, j⟩ ∈/ E (for simple graphs)

Example 11.9 (Space consumption for adjacency lists and matrices)
Problem: ConsideragraphG=⟨V,E⟩storedusinganadjacencylistoranadjacency matrix. In terms of the number of nodes and the number of edges in G—that is, in terms of |V| and |E|—how much memory is used by these data structures?
: An adjacency matrix is a |V|-by-|V| table, and thus contains exactly |V|2 Solution
cells. (Of them, the |V| cells on the diagonal are always 0, but they’re still there!) An adjacency list is a |V|-element table pointing to |V| lists; the length of the list
for node u is exactly degree(u). Thus the total number of cells in the data structure
11.2. FORMAL INTRODUCTION 1113
is
In an undirected graph we have ∑u degree(u) = 2|E|, by Theorem 11.1; in a directed
graph we have ∑u out-degree(u) = |E| by Exercise 11.18. Thus the total amount of memory used is |V| + 2|E| for an undirected graph
|V| + |E| for a directed graph.
Here’s the summary of the efficiency differences between these data structures (using
asymptotic notation from Chapter 6):
is v a neighbor of u? what are all of u’s neighbors? space
adjacency list adjacency matrix
|V| + ∑ degree(u). u∈V
1 + Θ(degree(u))
Θ(1)
1 + Θ(degree(u))
Θ(|V|)
Θ(|V| + |E|)
Θ(|V|2)
The better data structure in each row is highlighted. (Note that, in a simple graph, we have that degree(u) ≤ |V| and |E| ≤ |V|2.) So, is an adjacency list or an adjacency matrix better? It depends!
First, it depends on what kind of questions—Question (A) or Question (B) listed previously, for example—we want to answer: if we will ask few “is v a neighbor of u?” questions, then adjacency lists will be faster. If we will ask many of those ques- tions, then we probably prefer adjacency matrices. Similarly, it might depend on how much, if at all, the graph changes over time: adjacency lists are harder to update than adjacency matrices.
Second, it depends on how many edges are present in the graph. If the total num- ber of edges in the graph is relatively small—and thus most nodes have only a few neighbors—then degree(u) will generally be small, and the adjacency list will win. If the total number of edges in the graph is relatively large, then degree(u) will generally be larger, and the adjacency matrix will perform better. (Many of the most interesting real-world graphs are sparse: for example, the typical degree of a person in a social network like Facebook is perhaps a few hundred or at most a few thousand—very small in relation to the hundreds of millions of Facebook users.)

1114 CHAPTER 11. GRAPHS AND TREES
11.2.3 Relationships between Graphs: Isomorphism and Subgraphs
Now that we have the general definitions, we’ll turn to a few more specific properties that certain graphs have. We’ll start in this section with two different relationships between pairs of graphs—when two graphs are “the same” and when one is “part” of another; in Section 11.2.4, we’ll look at single graphs with a particular structure.
Graph isomorphism
When two graphs G and H are identical except for how we happen to have arranged
the nodes when we drew them on the page (and except for the names that we happen to have assigned to the nodes), then we call the graphs isomorphic. Informally, G and H are isomorphic if there’s a way to relabel (and rearrange) the nodes of G so that G and H are exactly identical. More formally:
(By abusing notation as we described earlier, this definition works for either undi- rected or directed graphs G and H.) Here are some small examples:
Example 11.10 (Two isomorphic graphs)
Let’s show that the following two directed graphs are isomorphic. (The first graph’s edges could also have be written as {⟨a, b⟩ : a < b and a evenly divides b}.) To do so, define the following bijection f : {1,2,...,6} → {A,B,...,F}: x f (x) The tables of edges in the graphs now match exactly, so they are isomorphic: Greek: iso “same”; morph “form.” Definition 11.12 (Graph isomorphism) Consider two graphs G = ⟨V, E⟩ and H = ⟨U, F⟩. We say that G and H are isomorphic if there exists a bijection f : V → U such that for all a ∈ V and b ∈ V, ⟨a, b⟩ ∈ E ⇔ ⟨f (a), f (b)⟩ ∈ F. 123456 B ACE DF 1 2 3 4 5 6 A D C F B E 1 2 3 4 5 6 1 f (1) = A A D C F B E 2 ✓ ✓ ✓ ✓ ✓ f (2) = D ✓ ✓ ✓ ✓ ✓ 3 ✓ ✓ f (3) = C ✓ ✓ 4 ✓ f (4) = F ✓ 5 f (5) = B 6 f (6) = E Example 11.11 (Isomorphic graphs) Problem: Whichpairs,ifany,ofthefollowinggraphsareisomorphic? Problem-solving tip: When you’re trying to prove or disprove a claim about graphs, you may find it useful to test out the claim against the following four “trivial” graphs: A lot of bogus claims about graphs turn out to be false on one of these four examples— or, unexpectedly, the so-called Petersen graph, the first graph in Example 11.11. (The Petersen graph is named after Julius Petersen, a 19th- century Danish mathematician.) It’s a good idea to try out any conjecture on all five of these graphs before you let yourself start to believe it! Note that Defini- tion 11.13 uses the abuse of notation that we mentioned earlier: we “ought” to have written {u,v} ∈ E′ for the case that G is undirected. 11.2. FORMAL INTRODUCTION 1115 B G AF HC JI ED 16 27 38 49 50 R UT S ZY Q VX W : Thefirsttwographsareisomorphic.Theeasiestwaytoseethisfactisto show the mapping between the nodes of the two graphs: It’s easy to verify that all 15 edges now match up between the first two graphs. But the third graph is not isomorphic to either of the others. The easiest justification is that node S in the third graph has degree 5, and no node in either of the first two graphs has degree 5. No matter how we reshuffle the nodes of graph #3, there will still be a node of degree 5—so the third graph can never match the others. Taking it further: In general, it’s easy to test whether two graphs are isomorphic by brute force (try all permutations!), but no substantially better algorithms are known. The computational complexity of the graph isomorphism problem has been studied extensively over the last few decades, and there has been substantial progress—but no complete resolution. It’s easy to convince someone that two graphs G and H are isomorphic: we can simply describe the relabeling of the nodes of G so that the resulting graphs are identical. (The “convincee” then just needs to verify that the edges really do match up.) When G and H are not isomorphic, it might be easy to demonstrate their nonisomorphism: for example, if they have a different number of nodes or edges, or if the degrees in G aren’t identical to the degrees in H. But the graphs may have identical degree distributions and yet not be isomorphic; see Exercise 11.49. Subgraphs When a graph H is isomorphic to a graph G, we can think of having created H by moving around some of the nodes and edges of G. When H is a subgraph of G, we can think of having created H by deleting some of the nodes and edges of G. (Of course, it doesn’t make sense to delete either endpoint of an edge e without also deleting the edge e.) Here’s the definition, for either undirected or directed graphs: For example, consider the graph G = ⟨V, E⟩ with nodes V = {A, B, C, D} and edges E = {{A,B},{A,C},{B,C},{C,D}}. ThenthegraphG′ withnodes{B,C,D}andedges {{B, C} , {C, D}} is a subgraph of G. In fact, G has many different subgraphs: Solution A B C D E F G H I J 1 2 3 4 5 0 7 9 6 8 Definition 11.13 (Subgraph) LetG=⟨V,E⟩beagraph.AsubgraphofGisagraphG′ =⟨V′,E′⟩whereV′ ⊆Vand E′ ⊆Esuchthateveryedge⟨u,v⟩∈E′ satisfiesu∈V′ andv∈V′. 1116 CHAPTER 11. GRAPHS AND TREES Example 11.12 (All 3-node subgraphs of G) Here are all of the 3-node subgraphs of the graph G with nodes V = {A, B, C, D} and edges E = {{A, B} , {A, C} , {B, C} , {C, D}}. (There are many other subgraphs—about 50 total—when we consider subgraphs with 1, 2, 3, or 4 nodes.) A, B, C: A, B, D: A, C, D: B, C, D: Taking it further: One of the earliest applications of a formal, mathematical perspective to networks—a collaboration between a psychologist and mathematician, in the 1950s—was based on subgraphs. Con- sider a signed social network, an undirected graph where each edge is labeled with ‘+’ to indicate friends, or ‘−’ to indicate enemies. (See Figure 11.8(a).) The adages “the enemy of my enemy of my friend” and “the friend of my friend is my friend” correspond to the claim that the subgraphs in Figure 11.8(b) would not appear. Dorwin Cartwright (the psychologist) and Frank Harary (the mathematician) proved some very interesting structural properties of any signed social network G that does not have either triangle in Figure 11.8(b) as a subgraph—a property that they called “structural balance”—and in the4 process helped launch much of the mathematical and computational work on graphs that’s followed. 4 Example 11.13 (Motifs in biological networks) At any particular moment in any particular cell, some of the genes in the organism’s DNA are being expressed—that is, some genes are “turned on” and the proteins that they code for are being produced by the cell. Furthermore, one gene g can regulate another gene g′: when g is being expressed, gene g can cause the expression of gene g′ to increase or decrease over the baseline level. A great deal of recent biological re- search has allowed us to construct gene-regulation networks for different such settings: that is, a directed graph G whose nodes are genes, and whose edges represent the regulation of one gene by another. Consider the induced subgraph of a particular set of genes in such a graph G— that is, the interactions among the particular genes in that set. Certain patterns of these subgraphs, called motifs, occur significantly more frequently in gene-regulation networks than would be expected by chance. Biologists generally believe that these repeated patterns indicate something important in the way that our genes work, so computational biologists have been working hard to build efficient algorithms to identify induced subgraphs that are overrepresented in a network. Figure 11.8: Signed social networks. For more about signed networks and these results, see 4 Dorwin Cartwright and Frank Harary. Structural balance: a generalization of Heider’s theory. Psychological Review, 63(5):277–293, 1956. Ger ++ Jap −+− Ita −−−− US + UK (a) A signed network from 1941 ++−− −− (b) Two triangles B AD C B AD C B AD C B AD C B AD C B AD C B AD C B AD C B AD C B AD C B AD C B AD C B AD C B AD C B AD C B AD C B AD C B AD C We sometimes refer to a special kind of subgraph: the subgraph of G = ⟨V, E⟩ induced by a set V′ ⊆ V of nodes is the subgraph of G where every edge between nodes in V′ is retained. The first subgraph in each row of Example 11.12 is the induced subgraph for its nodes. Here’s a brief description of one application of (induced) subgraphs: 11.2.4 Special Types of Graphs: Complete, Bipartite, Regular, and Planar Graphs In Section 11.2.3, we looked at two ways in which a pair of graphs might be related. Here, we’ll consider special characteristics that a single graph might have—that is, subcategories of graphs with some particular structural properties. These special types of graphs arise frequently in various applications. Complete graphs Our first special type of graph is a complete graph (also called a clique), which is an undirected graph in which every possible edge exists: Definition 11.14 (Complete graph/clique) A complete graph or clique is an undirected graph G = ⟨V, E⟩ such that {u, v} ∈ E for any two distinct nodes u ∈ V and v ∈ V. In CS, the word clique usually rhymes with bleak or sleek. In common-language usage, the word usually rhymes with slick or flick. Figure 11.9: Com- plete graphs with 3, 5, 8, and 16 nodes. There are two different prevailing explanations for the Kn notation: 11.2. FORMAL INTRODUCTION 1117 See Figure 11.9 for examples of com- plete graphs of varying sizes. (In everyday usage, a clique is a small, tight-knit, and exclusionary group of friends that doesn’t mingle with outsiders. If you think about a graph as a social network, the common-language meaning is similar to Definition 11.14.) Observe that an undirected graph with n nodes has 􏰀n2􏰁 unordered pairs of nodes, and therefore an n-node complete graph has 􏰀n2􏰁 = n(n − 1)/2 edges. A complete graph with n nodes is sometimes denoted by Kn. The word clique can also refer to a subgraph that’s complete—that is, in which every possible edge actually exists. For example, the graph G = ⟨V, E⟩ with V = {A, B, C, D} and E = 􏰜 {A, B} , {A, C} , {B, C} , {C, D}􏰝 contains a 3-node clique {A, B, C}. Here’s one small example of an interesting application in which cliques arise: Example 11.14 (Collaboration networks and cliques) Imagine a setting in which different groups of people can work together in different teams, with each person allowed to participate in multiple teams. For example: • actorsinmovies.(A“team”isthecastofasinglemovie.) • scientificresearchers.(A“team”isthesetofcoauthorsofapublishedpaper.) • employeesofacompany.(A“team”isagroupthatworkedonaspecificproject.) A collaboration network is a graph G that represents a setting like these: the nodes of G are the people involved; there is an edge between any two people who have worked together on at least one team. (You may have heard of a challenge in the collabora- tion network: in the Kevin Bacon Game, you’re given the name of some actor A; your job is to find a sequence of edges that connects A to the “Kevin Bacon” node in the movie collaboration network. There’s a similar game that computer scientists play in the scientific collaboration network, trying to connect themselves to the Hungarian polymath Paul Erdős. See p. 438.) • the K is as in omplete—or, rather, c as in k omplett; the notation was invented by a German speaker. • the K is in honor of Kazimierz Kuratowski, a 20th-century Polish mathematician who made major contributions to the study of graphs (among other mathematical topics). 1118 CHAPTER 11. GRAPHS AND TREES For example, for the teams listed below, we get the collaboration network at right: • Tigers:Deborah,George,Hicham,Josh,Lauren • Unicorns:Anita,Bev,Eva,Fernan • Vultures:Cathy,Eva,Kelly Notice that each team results in a clique inside the collaboration graph—every pair of members of that team is joined by an edge—in this case, creating a K5, K4, and K3 in the graph: CKG H AED J BFL Tigers CKHG AED BF JL Unicorns BF JL CKHG AED Vultures CKHG AED BF JL Bipartite graphs Our second special kind of graph is a bipartite graph. In a bipartite graph, the nodes can be divided into two groups such that no edges join two nodes that are in the same group: that is, there are two “kinds” of nodes, and all edges join a node of Type A to a node of Type B. Formally: For example, consider the graph G = ⟨V, E⟩ whose nodes are V = {A, B, C, D, E, F} and whose edges are E = 􏰜 {A, B} , {A, C} , {C, E} , {D, E}􏰝. The graph G is bipartite: for ex- ample, we can split the nodes into two groups—the vowels {A, E} and the consonants {B, C, D, F}—such that every edge joins a vowel and a consonant. (There’s another split that would also have worked: {A, E, F} and {B, C, D}.) See Figure 11.10 for a visualiza- tion of the vowel–consonant split. Bipartite graphs are traditionally drawn with the nodes arranged in two columns, one for each part: left (“L”) and right (“R”). But notice that the definition only requires that it be possible to divide the nodes into two groups, with no within-group edges. Example 11.15 (Bipartite or nonbipartite?) Problem: Whichofthefollowinggraphsarebipartite? Latin: bi “two”; part “part.” Definition 11.15 (Bipartite graph) A bipartite graph is an undirected graph G = ⟨V, E⟩ such that V can be partitioned into two disjoint sets L and R where, for every edge e ∈ E, one endpoint of e is in L and the other endpoint of e is in R. LR AB C D EF Figure 11.10: A bipartitegraph. (a) (b) (c) (d) (e) : Allofthemexcept(c)!Although(d)and(e)aretheonlygraphsdrawnin the “two-column” format, both (a) and (b) can be rearranged into two columns. In fact, aside from node positioning, graphs (a) and (d) are identical. And, similarly, graphs (b) and (e) are isomorphic! Only (c) is not bipartite: if we attempt to put the topmost node in one group, then both of the next higher two nodes must both be in the other group—but they’re joined by an edge themselves, and so we’re stuck. Many interesting real-world phenomena can be modeled using bipartite graphs: Example 11.16 (Bipartite graphs as models) Here are just a few of the scenarios that are naturally modeled using bipartite graphs: • datingrelationshipsinastrictlyheterosexualcommunity:thenodesaretheboysB and the girls G; every edge connects some boy to some girl. • nodesarecoursesandstudents;anedgejoinsastudenttoeachclassshe’staken. • affiliationnetworks:peopleandorganizationsarethenodes;anedgeconnectsper- son p and organization o if p is a member of o. There’s one further refinement of bipartite graphs that we’ll mention: a complete bipartite graph is a bipartite graph in which every pos- sible edge exists. In other words, a complete bipartite graph has the form G = ⟨L ∪ R, E⟩ where{l,r} ∈ Eforeverynodel ∈ Land r ∈ R. A complete bipartite graph with l nodes in the left group and r nodes in the right group is sometimes denoted by Kl,r. See Figure 11.11 for a few examples. (Note again that, as with the K2,4 in Figure 11.11, we don’t have to draw a bipartite graph in two-column format—if it’s bipartite, then it’s still bipartite no matter how we draw it!) Regular graphs Our next type of graph is defined in terms of the degree of its nodes: a regular graph is one in which all of the nodes have an identical number of neighbors. (Most of the time one talks about regular graphs that are undirected, but we can speak of regular directed graphs, too; we’d generally require that all in-degrees match each other and all out-degrees match each other.) Solution 11.2. FORMAL INTRODUCTION 1119 Figure 11.11: Complete bipartite graphs of varying sizes: K1,4, K4,4, K8,4, K8,8, and K2,4. Definition 11.16 (Regular graph) Let d ≥ 0 be an integer. A d-regular graph is a graph G such that every node has degree precisely equal to d. If G is d-regular for any d, then we say that G is a regular graph. 1120 CHAPTER 11. GRAPHS AND TREES For example, consider the graph G = ⟨V, E⟩ whose nodes are V = {A, B, C, D, E, F} andwhoseedgesareE = 􏰜{A,B},{A,E},{B,C},{C,F},{D,E},{D,F}􏰝. Thegraph G is 2-regular: you can check that each node has exactly two neighbors. As another example, note that the complete graph Kn is (n − 1)-regular, as each node has all n − 1 other nodes as neighbors. Or see Figure 11.12 for another example of a regular graph. There are many real-world examples in which regular graphs are useful: for example, imagine constructing a physical network of computers in which each machine only has the capacity for a fixed number of connections. Here are two other useful applications of regular graphs: Example 11.17 (Scheduling sports with a regular graph) You are the League Commissioner for an intramural ultimate frisbee league. There are 10 teams in the league, each of whom should play four games. No two teams should play each other twice. Suppose that you construct an undirected graph G = ⟨V,E⟩, where V = {1,2,...,10} is the set of teams, and E is the set of games to be played. If G is an 4-regular graph, then all of the listed requirements are met. Figure 11.12 is a randomly generated example of such a graph; you could use that graph to set the league schedule. A 1-regular graph is called a perfect matching, because each node is “matched” with one—and only one—neighbor. (If every node has degree at most 1, then the graph is just called a matching.) Matchings have a variety of applications—for example, see p. 960 for their role in the Enigma machine—but here’s another specific use of match- ings, in assigning partnerships: Example 11.18 (Matchings for CS partnerships) Each of n students in an Intro CS class submits a list of people whom they’d like to have as a partner for the final project. Define the following undirected graph G: • the set V of nodes is {1,2,...,n}, one per student. • thesetEofedgesincludes{u,v}ifbothofthefollowingaretrue:studentuwants to work with student v, and student v wants to work with student u. The instructor can assign partnerships by finding a 1-regular graph G′ = ⟨V, E′⟩ with E′ ⊆ E—that is, a subgraph of G that includes all of the nodes of G. For example: For this graph G . . . . . . these graphs (among others) are valid partner assignments. Figure 11.12: A 4-regular 10-node graph. 10 91 82 73 64 5 11 12 1 10 2 93 84 765 11 12 1 10 2 93 84 765 11 12 1 10 2 93 84 765 (Incidentally, Example 9.32 asked: how many perfect matchings are there in Kn?) Planar graphs Our last special type of graph is a planar graph, which is one that can be drawn on a sheet of paper without any lines crossing: It’s important to note that a graph is planar if it is possible to draw it with no crossing edges; just because a graph is drawn with edges crossing does not mean that it isn’t planar. Here is an example of a planar graph: Example 11.19 (New England, in a plane) Here are two copies of the same graph—one drawn with edge crossings, and another with the nodes rearranged to avoid edge crossing: 11.2. FORMAL INTRODUCTION 1121 Definition 11.17 (Planar graph) A planar graph is a graph G such that it is possible to draw G on a plane (that is, on a piece of paper) such that no edges cross. ME NH ME NY VT NH MA VT RI NY MA CT CT RI Example 11.19 shows one of the most famous types of planar graph, one derived from a map: we can think of the countries on a map as nodes, and we draw an edge be- tween two country–nodes if those two countries share a border. (See p. 437 for a dis- cussion of the four-color theorem for maps, which we could have phrased as a result about planar graphs instead.) There are other applications of planar graphs in computer science, too. For example, we can view a circuit (see Section 3.3.3) as a graph, where the logic gates correspond to nodes and the wires correspond to edges. Most modern circuits are now printed on a board (where the “ink” is the conducting material that serves as the wire), and the question of whether a particular circuit can be printed on a single layer is precisely the question of whether its corresponding graph is planar. (If it’s not planar, we’d like to minimize the number of edges that cross, or more specifically the number of layers we’d need in the circuit.) Here’s one more set of planarity challenges for you to try: Example 11.20 (Two planar challenges) Problem: Are these graphs planar? 1. 2. AG BF CDE IH JKLM 1122 CHAPTER 11. GRAPHS AND TREES : Yes,both:wecanrearrangethenodessothattherearenoedgesthatcross. Solution 1. Taking it further: Determining how to lay out a planar graph without edge crossings can be an interesting amusement—see www.planarity.net for a surprisingly fun game based on planar graphs. So far we haven’t seen any examples of graphs that can’t be rearranged 2. F EADGC B K J IH L M B G AF HC JI ED so that no edges cross. But, if you play around long enough, you should be able to convince yourself that neither K5 and K3,3 are planar; see Figure 11.13. And, while this shouldn’t be at all obvious, it turns out that K5 and K3,3 are in a sense the only “reasons” that a graph can be nonplanar. A theorem known as Kuratowski’s Theorem—after the Polish mathematician who may have lent his initial to the notation for complete graphs—says that every graph is planar unless it “contains” K5 or K3,3 for a subgraph- like notion of “containment.” (It’s not exactly the subgraph relation, because there are graphs that do not contain K5 or K3,3 as subgraphs but nonetheless are nonplanar in some sense “because” of one of them. For example, the Petersen Graph from Example 11.11—see Figure 11.13(c)—is nonplanar, but it doesn’t have K5 as a subgraph. But if we “collapse” together the nodes A/F, B/G, C/H, D/I, and E/J into “supernodes” then the resulting graph is K5.) (a) K 5 (b) K 3,3 (c) The Petersen graph Figure 11.13: Nonplanar graphs. Computer Science Connections Degree Distributions and the Heavy Tail 400 1000 350 300 250 100000 10000 1000 100 10 200 150 100 50 100 10 011 0 500 1000 1500 2000 2500 3000 3500 4000 1 10 100 1000 10000 1 10 100 1000 kkk 10000 (a) The degree distribution (b) A log–log plot of the degree distribution When we think about massive graphs like the World-Wide Web (with nodes representing web pages and edges representing hyperlinks from one page to another) or an online social network (with nodes representing people and edges representing “friendships”), it is interesting to look at how proper- ties of individual nodes are distributed across the population. We can look at the distribution of any node-by-node property—the physical height of Twitter users, or the number of words of text per web page, for example. But in addi- tion to demographic properties like height and length, we can also look at the distribution of network-type properties. The degree distribution of a graph G shows, for each possible degree d, the number of nodes in G whose degree is d. While one might initially expect degree distributions to look similar to the distribution of heights, it turns out that the degree distribution of an online social network has very different properties. Figure 11.14 shows the degree distribution (in linear, log–log, and cumulative form) for members of the University of North Carolina.5 Figure 11.14 shows, for each value of k, the number of people who have precisely k Facebook friends. About 350 people have only 1 friend, which is the most common number of friends to have. There are about 750,000 friend- ships represented in this dataset; the average degree is ≈ 84. But, looking at the far-right end of Figure 11.14(a) and 11.14(b), we see a handful of people with very high degrees: 2000, 2500, 3000, and even ≈ 3800. One of the inter- esting facts about degree distributions in real social networks (or the web) is that there are people whose popularity is massively larger than average: the highest-degree person in this dataset is about 3800/84 ≈ 45 times more popular than average. (Imagine the tallest person at the University of North Carolina being 45 times taller than average!) Significant research by computer scientists (and many others!) interested in the structure of social networks and the world-wide web has focused on thisso-calledheavy-taileddegreedistribution.6 Someoftheliteraturedebatesthe particular form of this distribution; for example, whether the distribution has the particular form of a power law, where the number of people with degree k is roughly kα for some small constant α, usually around 2. (c) The cumulative degree distribution Figure 11.14: The degree distribution of ≈ 18,000 Facebook users at the Univer- sity of North Carolina. Figure 11.14(b) shows a log–log plot of the same data as the linear plot in Figure 11.14(a). Fig- ure 11.14(c) shows a log–log plot of the cumulative degree distribution: the num- ber of people with degree ≥ k, whereas Figures 11.14(a) and 11.14(b) showed the number with degree = k. From the Facebook5 dataset, from Mason Porter via the International Network for Social Network Analysis: 5 Amanda L. Traud, Peter J. Mucha, andMasonA.Porter. Socialstruc- tureofFacebooknetworks. CoRR, abs/1102.2166, 2011. You can read more about power laws and heavy-tailed degree distributions: 6DavidA.EasleyandJonM.Kleinberg. Networks, Crowds, and Markets: Reason- ing About a Highly Connected World. Cambridge University Press, 2010. 11.2. FORMAL INTRODUCTION 1123 Number of users with degree k 1124 CHAPTER 11. GRAPHS AND TREES Computer Science Connections Graph Drawing, Graph Layouts, and the 9/11 Memorial Visual representations of most large graphs are too cluttered for a hu- man viewer to process: there are just too many nodes and edges crammed into a small space to see much of anything. Visually presenting a graph like Facebook (billions of nodes, tens of billions of edges) without it looking like a grade-school scribble is daunting. But there is an entire subfield of com- puter science called graph drawing, which is devoted to taking networks and producing good—clear, aesthetic, informative—images of the networks. In some large graphs, each node has a “natural location” and thus it is clear where on the page it should be placed. For example, graphs may represent data in which the nodes have a precise location sit- uated in the physical world. When we have that kind of layout information for each node, presenting the graph well is easier. (See Figure 11.15.) But many large graphs do not have obvious coordinates associ- ated with each node: while you and your college classmates do have geographic loca- tions (dorm rooms), it’s not clear that your dorm really best describes “where” you fit in the social scene of your institution. For graphs whose nodes don’t have obvious coordinates, we have to do some- thing else. One approach that’s often used in graph drawing is to arrange the nodes based on a physics analogy, as follows. Imagine each node as a charged particle: any two nodes that are joined by an edge are pulled together by an attractive force, and any two nodes that are not joined by an edge are pushed apart by a repulsive force. Then figuring out how to place nodes on the page can be done by starting them in a random configuration and letting the attrac- tive/repulsive forces move the nodes around until they’re “happy” in their current positions. An idea like this one was actually used in designing the 9/11 memorial at the site of the World Trade Center. The memorial was designed with bronze panels inscribed with the 2982 names of victims. A team of computer sci- entists, architects, and visual artists collaborated to organize the names in a meaningful way. Families were invited to submit “meaningful adjacencies” between victims—which would cause two names to be as close together in the bronze panels as possible. (One of the other algorithmic issues regarding the layout of this memorial was that the designers wanted the names to be placed at evenly spaced intervals on the bronze panels; this constraint added to the computational complexity of the process.) The team used an algorithm to organize the names in an arrangement that respected these requests, which was then used in the final design of the memorial.7 Figure 11.15: A visualization of selected European train routes, where each node’s position corresponds to the city’s spatial location. Image reproduced with permission from RGBAlpha/Getty Images, Inc. In addition to the broader news reports on the wrenching emotional and his- torical aspects of 9/11 Memorial, the algorithmic aspects of the memorial were also covered in the popular press. You can read more about it here: 7 Nick Paumgarden. The names. The New Yorker, 16 May 2011. 11.2.5 Exercises For each of the following, draw a graph G = ⟨V, E⟩ for the following sets of nodes and edges. Does it make sense to use a directed or undirected graph? Is the graph you’ve drawn simple? 11.1 nodes V = {1,2,...,10}; an edge connects x and y if gcd(x,y) = 1. 11.2 nodes V = {1,2,...,10}; an edge connects x and y if x divides y. 11.3 nodes V = {1,2,...,10}; an edge connects x and y if x < y. For the following undirected graphs, list the edges of the graph, and identify the node(s) with the highest degree. For the directed graphs, identify the node(s) with the highest in-degree, and the node(s) with the highest out-degree. 11.4 11.6 11.5 11.7 Consider a graph G = ⟨V, E⟩ with n := |V| nodes. State your answers in terms of n. Justify. 11.8 If G is an undirected, simple graph, what’s the largest that |E| can be? The smallest? 11.9 If G is a directed, simple graph, what’s the largest that |E| can be? The smallest? 11.10 How do your answers to Exercise 11.9 change if self-loops are allowed? 11.11 How do your answers to Exercise 11.9 change if self-loops and parallel edges are allowed? TheanthropologistRobinDunbarhasarguedthathumanshaveamentalcapacityforonly≈150friends.8 (Thisargu- ment is based in part on the physical size of the human brain, and cross-species comparisons; 150 is now occasionally known as Dunbar’s Number.) Suppose that Alice has exactly 150 friends, and each of her friends has exactly 150 friends—that is, a friend of Alice knows Alice and 149 other people. (Note that Alice’s friends’ sets of friends can overlap.) Let S denote the set of people that Alice knows directly or with whom Alice has a mutual friend. 11.12 What’s the largest possible value of |S|? 11.13 What’s the smallest possible value of |S|? Continue to assume that everyone has precisely 150 friends. Let Sk denote the set of all people that Bob knows via a chain of k or fewer intermediate friends: • Bob’s friends are in S0; • the people in S0 and the friends of people in S0 are in S1; • the people in S1 and the friends of people in S1 are in S2; and so forth. 11.14 Let k ≥ 0 be arbitrary. What’s the largest possible value of |Sk |? 11.15 Let k ≥ 0 be arbitrary. What’s the smallest possible |Sk |? Prove the following properties of graphs, related to Theorem 11.1 or degree more generally: 11.16 Let u be a node in an undirected graph G. Prove that u’s degree is at most the sum of the degrees of u’s neighbors. 11.17 Prove Corollary 11.2: in an undirected graph G = ⟨V, E⟩, let nodd denote the number of nodes whose degree is odd. Prove that nodd is an even number. That is: prove that | {u ∈ V : degree(u) mod 2 = 1} | mod 2 = 0. 11.18 Prove the analogy of Theorem 11.1 for directed graphs: for a directed graph G = ⟨V, E⟩, ∑ in-degree(v) = ∑ out-degree(v) = |E|. u∈V u∈V 8RobinDunbar. How Many Friends Does One Person Need?: Dunbar’s Number and Other Evolutionary Quirks. Harvard University Press, 2010. Thanks to Michael Kearns, from whom I learned a somewhat related version of these exercises. 11.2. FORMAL INTRODUCTION 1125 BDF AH CEG D BF AH CG E BDF AH CEG B AD C 1126 CHAPTER 11. GRAPHS AND TREES A linked list is a data structure consisting of a collection of nodes, each of which contains two fields: a data field (whatever the node stores) and a next field that is either null or points to a node in the linked list. A particular node is designated as the head node. Note that a circular linked list in which a node points back to a previously encountered node meets this definition. See Figure 11.16. Define a not-necessarily-simple directed graph G = ⟨V, E⟩, where V is the set of all nodes reachable by following any number of next pointers starting at the head node, and ⟨u, v⟩ ∈ E if u’s next field points to u. Observe that each node u in G has out-degree d ∈ {0, 1}. Describe a 5-node linked list in which . . . 11.19 . . . every node has in-degree d = 1. 11.20 ...somenodehasin-degreed=2. 11.21 . . . the resulting graph G is not simple. 11.22 (This exercise is a tougher algorithmic challenge.) You are given access to the head node h of an n- node linked list. The value of n is unknown to you. The only operations permitted are (a) to save a node; (b) test whether two saved nodes are the same or different; and (c) given a node u, fetch the node pointed to by u.next. Give an algorithm to determine whether the given list is circular using only a constant amount of memory—that is, remembering only a constant number of nodes at a time. A doubly linked list has n nodes with data and two pointers, previous and next, to other nodes (or null). (See Figure 11.17 for an example.) Let Cn denote an n-node doubly linked list with nodes {1, 2, . . . , n}, where, for each node u, • u’snextnodeisv=(umodn)+1 • v’s previous node is u. Define a directed graph Gn = ⟨V,E⟩, where V is the set {1,2,...,n} of nodes, and every node has two edges leaving it: one edge ⟨u, u.next⟩, and one edge ⟨u, u.previous⟩. 11.23 Draw G5. 11.24 Give an example of a Gn that contains a self-loop. 11.25 Give an example of a Gn that contains parallel edges. Write down an adjacency list representing each of the following graphs. 11.26 11.28 11.27 11.29 Now give an adjacency matrix for the graphs shown in the above exercises: 11.30 Exercise 11.26 11.32 Exercise 11.28 11.31 Exercise 11.27 11.33 Exercise 11.29 11.34 Suppose that a (possibly directed or undirected) simple graph G is represented by an adjacency list. Suppose further that, for every node u in G, the list of (out-)neighbors of u has a different length. True or False: G must be a directed graph. Justify your answer. 11.35 Describe a directed graph G meeting the specifications of Exercise 11.34. Figure 11.16: A linked list. Each rectangle is a node, and shows two fields: data on the left and next on the right. head 1234 Figure 11.17: A doubly linked list. Each rectangle is a node, and shows three fields: previous on the left, data in the middle, and next on the right. BDF AH CEG BDF AH CEG BDF AH CEG B AD C 11.2. FORMAL INTRODUCTION 1127 The density of a graph G = ⟨V, E⟩ is the fraction of all possible edges that actually exist: that is, density = |E| . [your answer to the first part of Exercise 11.8/Exercise 11.9] Taking it further: Informally, a dense graph is one for which most pairs of nodes are joined by an edge, and a sparse graph is one in which few pairs of nodes are joined by an edge. We will use these terms informally; a graph is dense if its density is close to 1, and sparse if its density is close to 0. Some people define graphs as dense if |E| = Θ(|V|2) and as sparse if |E| = O(|V|). (These asymptotic definitions only make sense for a family of graphs—one for each size n.) There are (families of) graphs that are neither sparse nor dense according to this definition; see Exercise 6.37. Asafunctionofn,whatarethedensitiesofthefollowingundirectedgraphs,withnodesV = {1, 2, . . . n}? (See Figure 11.18 for small versions of each of these graphs.) 11.36 11.37 11.38 an n-node path: E = {{1, 2} , {2, 3} , . . . , {n − 1, n}}. an n-node cycle: E = {{1, 2} , {2, 3} , . . . , {n − 1, n} , {n, 1}}. n disconnected triangles (assume that n mod 3 = 3): 3 E = {{1,2},{2,3},{3,1},{4,5},{5,6},{6,4},...{n−2,n−1},{n−1,n},{n,n−2}}. 􏰢 􏰡􏰠 􏰣􏰢 􏰡􏰠 􏰣􏰢 􏰡􏰠 􏰣 triangle on 1, 2, 3 triangle on 4, 5, 6 triangle on n − 2, n − 1, n 11.39 3 separate n -node cliques (assume that n mod 3 = 3): E = {{x, y} : x mod 3 = y mod 3}. Figure 11.18: A 12-node path, cycle, collection of 3 A hypercube Hn is a graph in which the 2n different nodes are all elements of {0, 1}n . There is an edge between x n triangles,and 3 and y if they differ in only one bit position. (Using the language of Chapter 4.2, there’s an edge between any two nodes whose Hamming distance is 1.) 11.40 DrawH3. 11.41 Write down an adjacency list for H4. 11.42 Write down an adjacency matrix for H4. 11.43 In terms of n, how many edges does Hn have? What is its density? Decide whether the following pairs of graphs are isomorphic, and prove your answers. 11.44 11.45 collection of three n -node cliques. 3 F EB AD CH G PO JKMI LN BG CAF DE JKMH LNI 11.46 G1 = ⟨V1,E1⟩, where V1 = {10,11,12,13,14,15}and⟨x,y⟩ ∈ E1 ifandonly if x and y are not relatively prime. G2 =⟨V2,E2⟩,whereV2 ={20,21,22,23,24,25} and⟨x,y⟩ ∈ E2 ifandonlyifxandyarenotrela- tively prime. Prove or disprove the following claims about isomorphism: 11.47 All 5-node graphs with degrees 1, 1, 1, 1, and 0 are isomorphic. 11.48 All 5-node graphs with degrees 4, 4, 4, 3, and 3 are isomorphic. 11.49 All 5-node graphs with degrees 3, 3, 2, 2, and 2 are isomorphic. 11.50 All n-node, 3-regular graphs are isomorphic. The computational problem of finding the largest clique (complete graph) that’s a subgraph of a given graph G is believed to be very difficult. But for small graphs it’s possible to do, even by brute force. For each of the following graphs, identify the size of the largest clique that’s a subgraph of the given graph. 11.51 11.52 11.53 ABCD IE HGF ABCD JE IHGF ABCD JE IHGF 1128 CHAPTER 11. GRAPHS AND TREES 11.54 Consider the collaboration network (see Example 11.14) in Figure 11.19. Assum- ing that the nodes correspond to actors in movies, what is the smallest number of movies that could possibly have generated this collaboration network? 11.55 Are you certain that there weren’t more movies than [your answer to the previous exercise] that generated this graph? Explain. For which integers n are the following graphs bipartite? Prove your answers. 11.56 V={1,2,...,n};E={⟨i,i−1⟩:i≥2}. 11.57 V={0,1,...,n−1};E={⟨i,i+1modn⟩:i≥1}. 11.58 Kn.Thatis,acompletegraphofnnodes:V={1,2,...,n};E={{u,v}:u∈Vandv∈V}. 11.59 V={0,1,...,2n−1};E={⟨i,(i+n)mod2n⟩:i∈V}. Figure 11.19: A collaboration network. Are either of the following graphs bipartite? Explain. 11.60 11.61 BDF AH CEG Consider a bipartite graph with a set L of nodes in the left column and a set of nodes R on the right column, where |L| = |R|. Prove or disprove the following claims: 11.62 The sum of the degrees of the nodes in L must equal the sum of the degrees of the nodes in R. 11.63 The sum of the degrees of the nodes in L must be even. 11.64 The sum of the degrees of all nodes (that is, all nodes in L ∪ R) must be an even number. Suppose that G is a complete bipartite graph with n nodes—that is, G = K|L|,|R| for |L| + |R| = n. 11.65 What’s the largest number of edges that can appear in G? 11.66 What’s the smallest number of edges that can appear in G? (Careful!) 11.67 Prove or disprove: any graph that does not contain a triangle (that is, three nodes a, b, and c with the edges {a, b} and {b, c} and {c, a} in the graph) as a subgraph is bipartite. 11.68 Definition 11.16 describes a regular undirected graph. In a directed regular graph, we require that there be two integers din and dout such that every node’s in-degree is din and every node’s out-degree is dout. Prove that we must have din = dout. Show that both of the following graphs are planar. 11.69 11.70 11.71 Prove that any 2-regular graph is planar. BDF AH CEG BDH FG EAC ABCD IE HGF 11.3 Paths, Connectivity, and Distances Well, you can go west to the next intersection, get onto the turnpike, go north through the toll gate at Augusta, ’til you come to that intersection . . . well, no. You keep right on this tar road; it changes to dirt now and again. Just keep the river on your left. You’ll come to a crossroads and . . . let me see. Then again, you can take that scenic coastal route that the tourists use. And after you get to Bucksport . . . well, let me see now. Millinocket. Come to think of it, you can’t get there from here. Marshall Dodge (1935–1982) and Robert Bryan (b. 1931), “Which Way to Millinocket?” Bert and I (1958) One of the most basic questions that one can ask about a graph is whether it is possible to get from some given node s to some given node t by following a sequence of edges. Is there some chain of friends that connects Barack Obama to Phil Collins? Can you get from Missoula to Madison by car? (And, if there is a way to get from s to t, what is the shortest way to get there?) These basic questions concern the existence of paths in the graph: (Note that this definition includes both directed and undirected graphs: if the edges are directed, we have to follow them “in the right direction.”) For example, in both of the graphs shown in Figure 11.21, there is no path from A to X. But, in both, the sequence ⟨A,C,E,Z⟩ is a path of length 3 from A to Z. In both cases, the edges traversed by the path are {⟨A, C⟩, ⟨C, E⟩, ⟨E, Z⟩}. Notice that the length of a path is the number of edges that it traverses, which is one fewer than the number of nodes in the path. Taking it further: A common mistake made by novice (and not-so-novice) programmers is an off-by- one error in specifying the bounds on a loop, by iterating either one time too many or one time too few. These errors are also sometimes called fencepost errors: if you build a 10-yard fence with posts placed every yard, then there are eleven fenceposts (at yard 0, yard 1, . . ., yard 10). Be careful! A path ⟨A, C, E, Z⟩ contains four nodes, but it traverses three edges (A → C, C → E, and E → Z) and has length 3. Figure 11.20: Paths in undirected and directed graphs. 11.3. PATHS,CONNECTIVITY,ANDDISTANCES 1129 ADEX BCZY ADEX BCZY Figure 11.21: Two graphs with paths from A to Z. Here’s an example of finding paths in a small graph: u1 u2 u3 ··· uk−1 uk u1 u2 u3 ··· uk−1 uk Definition 11.18 (Path) Consider a (directed or undirected) graph G = ⟨V, E⟩. A path in G is a sequence ⟨u1,u2,...,uk⟩ of k ≥ 1 nodes such that: • ui ∈ V for every i ∈ {1,...,k}, and • ⟨ui,ui+1⟩∈Eforeveryi∈{1,...,k−1}. (See Figure 11.20.) We say that such a sequence of nodes is a path from u1 to uk , and that this path has length k − 1. We also say that this path traverses the edges ⟨ui, ui+1⟩. 1130 CHAPTER 11. GRAPHS AND TREES Example 11.21 (Finding paths) Problem: Considerthefollowingundirectedgraph: BDF AH CEG 1. IsthereapathfromnodeHtonodeE? 2. NamethreedifferentpathsfromnodeDtonodeF.Whatisthelengthofeach path? Solution : 1. Yes; ⟨H,A,F,G,E⟩ is a path from node H to E. 2. ThefollowingsequencesarepathsfromDtoF: • ⟨D,B,E,G,F⟩,whichhaslength4. • ⟨D,B,C,E,G,F⟩,whichhaslength5. Finding a third path might seem harder, but Definition 11.18 did not require that the nodes in a path be distinct from each other. (In other words, nothing forbade the repetition of nodes in a path.) So a third path from D to F is: • ⟨D,B,C,E,B,C,E,G,F⟩, which has length 8. We will often restrict our attention to paths that never go back to a vertex that they’ve already visited, which are called simple paths: Of the three paths identified in Example 11.21, the first two are simple paths, but the third path is not simple because it repeated nodes {B, C, E}. 11.3.1 Connectivity in Undirected Graphs The most basic question about two nodes in a graph is whether it’s possible to get from one to another—that is, are these two nodes connected? We start with a formal defini- tion of connectivity for undirected graphs, because the relevant notions are simpler in the undirected setting. Definition 11.19 (Simple Path) A path ⟨u1,u2,...,uk⟩ is simple if all of the nodes u1,...,uk are distinct. Definition 11.20 (Connected nodes and connected graphs) Let G = ⟨V, E⟩ be an undirected graph. • Twonodesu∈Vandv∈Vareconnectedifthereexistsapathfromutov. • ThegraphGisconnectedifuandvareconnectedforanytwonodesu∈Vandv∈V. • ThegraphGiscalleddisconnectedifitisnotconnected. For example, Figure 11.22 shows one disconnected graph—there’s no path from A to H, for example—and one connected graph. You can check that the second graph is connected by testing all pairs of nodes. (Exer- cise 11.87 asks you to show that connectivity is symmetric in an undirected graph: if there exists a path from u to v, then there exists a path from v to u.) Example 11.22 (Connectivity of an undirected graph) Problem: Isthefollowinggraphconnected? 2 13 84 Figure 11.22: A disconnected and connected undirected graph. Problem-solving tip: Sometimes it’s very helpful to redraw a graph that you’re given, with nodes placed more meaningfully. For example, the graph 11.3. PATHS,CONNECTIVITY,ANDDISTANCES 1131 ADEG BCFH ILMO JKNP 75 from Example 11.22 6 can be redrawn as : No:odd-numberednodeshaveedgesonlytootherodd-numberednodes, and even-numbered nodes have edges only to other even-numbered nodes. So there is no path from, for example, node 1 to node 2; this graph is disconnected. Connected components More generally, we will talk about the connected components of an undirected graph G = ⟨V, E⟩—“subsections” of the graph in which all pairs of nodes are connected. A subset C ⊆ V of nodes is a connected component of an undirected graph G = ⟨V, E⟩ if, intuitively, it forms its own “section” of the graph: any two nodes in C are connected, and no node in C is connected to any node not in C. For example, Figure 11.23 shows a graph with three connected components—one with 4 nodes, one with 3 nodes, and one with just a single node. Note that we could have defined a “con- nected graph” in terms of the definition of connected components (instead of Definition 11.20): an undirected graph G = ⟨V, E⟩ is connected if it contains only one connected component, namely the entire node set V. 13 75 2 84 6 just by sliding the even-numbered nodes down. This visualization makes it clear that the graph is disconnected. Solution Definition 11.21 (Connected component) In an undirected graph G = ⟨V, E⟩, a connected component is a set C ⊆ V such that: (i) anytwonodess∈Candt∈Careconnected. (ii) foranynodex∈V−C,addingxtoCwouldmake(i)false. ADEG BCFH (a) The original graph. A D E G B C F H (b) Component #1. A D E G B C F H (c) Component #2. A D E G BCFH (d) Component #3. Figure 11.23: A graph’s connected components. 1132 CHAPTER 11. GRAPHS AND TREES Example 11.23 (Connected components of an undirected graph) Problem: Whataretheconnectedcomponentsofthefollowinggraph? E AB HF DC G : The set S = {A, B, C, G, H} is a connected component; there are paths from every node u ∈ S to every node v ∈ S, and furthermore no node in S is connected to any node not in S. To be thorough, here are paths connecting each pair of nodes from S: ABCGH Solution A ⟨A⟩ ⟨A,C,G,B⟩ ⟨A,C⟩ ⟨A,C,G⟩ ⟨A,C,G,H⟩ ⟨B, H⟩ ⟨C, G, H⟩ ⟨G, H⟩ ⟨H⟩ B C G H ⟨B⟩ ⟨B, G, C⟩ ⟨C⟩ ⟨B, G⟩ ⟨C, G⟩ ⟨G⟩ Note that we haven’t bothered to write down a path from u to v when we’d already recorded a path from v to u, because the graph is undirected and paths are sym- metric. We also had many choices of paths for many of these entries: for example, other paths from B to H included ⟨B, G, H⟩ or ⟨B, G, H, B, G, H⟩. There’s a second connected component in the graph: the nodes {D, E, F}. It’s easy to check that both clauses of Definition 11.21 are also satisfied for this set. Observe that, in any undirected graph G = ⟨V, E⟩, there is a path from each node u ∈ V to itself. Namely, the path is ⟨u⟩, and it has length 0. Check Definition 11.18! Taking it further: There are many computational settings in which undirected paths are relevant; here’s one example, in brief. In computer vision, we try to build algorithms to process—”understand,” even— images. For example, before it can decide how to react to them, a self-driving car must partition the image of the world from a front-facing camera into separate objects: painted lines on the road, trees, other cars, pedestrians, etc. Here’s a crude way to get started (real systems use far more sophisticated techniques): define a graph whose nodes are the image’s pixels; there is an edge between pixels p and p′ if (i) the two pixels are adjacent in the image, and (ii) the colors of p and p′ are within a threshold of acceptable difference. The connected components of this graph are a (very rough!) approximation to the “objects” in the image. This description misses all sorts of crucial features of good algorithms for the image-segmentation problem, but even as stated it may be familiar from a different context: the “region fill” tool in image- manipulation software uses something very much like what we’ve just described. 11.3.2 Connectivity in Directed Graphs Recall that we have to follow edges “in the right direction” in a directed graph G: as in Definition 11.18, a path from u1 to uk in G is a sequence ⟨u1,u2,...,uk⟩ where every pair ⟨ui, ui+1⟩ is an edge in G. Thus notions of connectivity in directed graphs are more complicated: the existence of a path from u to v does not imply the existence of a path from v to u. We will speak of a node t as being reachable from a node s if it’s possible to go from s to t, and of pairs of nodes as being strongly connected when it’s possible to “go in both directions” between them: For example, you can check that the first graph in Fig- ure 11.24 is strongly connected by testing for directed paths between all pairs of nodes, in both directions. But the second graph in Figure 11.24 is not strongly con- nected: there’s no path from any node in the right-hand side (nodes {M, N, O, P}) to any node in the left-hand side (nodes {I, J, K, L}). Strongly connected components As with undirected graphs, for a directed graph we will divide the graph into “sections”—subsets of the nodes—each of which is strongly connected. These sections are called strongly connected components of the graph: Figure 11.25 shows an example of a directed graph G and the three strongly connected components in G. The easiest strongly con- nected component to identify is {A, B, C, D}: we can go counterclockwise around the loop A → B → C → D → A,sowecangofromany one of these four nodes to any other, and we can’t get from any of these four nodes to any of the other nodes. The other two strongly connected components are {E, F, H} and, sep- arately, {G} on its own. The reason is that G is not strongly connected to any other node: we can’t get from G to any other node. (We can go around the E → F → H → E loop, so these three nodes are together in the other strongly connected component.) Figure 11.24: Two directed graphs, one that’s strongly connected and one that’s not. 11.3. PATHS,CONNECTIVITY,ANDDISTANCES 1133 Definition 11.22 (Reachability and strongly connected nodes/graphs) Let G = ⟨V, E⟩ be a directed graph. • Anodeu∈Visreachablefromanodev∈Vifthereisadirectedpathfromutov. • Twonodesu∈Vandv∈Varestronglyconnectedifuisreachablefromv,andvis reachable from u. • ThegraphGisstronglyconnectedifeverypairofnodesinVisstronglyconnected. A B DE CF G H I J LM KN O P Definition 11.23 (Strongly connected component) In a directed graph G = ⟨V, E⟩, a strongly connected component (SCC) is a set C ⊆ V such that: (i) anytwonodess∈Candt∈Carestronglyconnected. (ii) foranynodex∈V−C,addingxtoCwouldmake(i)false. EG FH (a) The original graph. AD BC A D B C EG FH (b) SCC #1. AD BC E G F H (c) SCC #2. AD BC E G FH (d) SCC #3. Figure 11.25: A graph and its connected components. 1134 CHAPTER 11. GRAPHS AND TREES Here’s another example of finding strongly connected components: Example 11.24 (Finding strongly connected components) Problem: Whatarethestronglyconnectedcomponentsofthefollowinggraph? AF BE C D : Thethreenodes{C,D,E}formastronglyconnectedcomponent:thereisa pathfromanyoneofthemtoanyotherofthem(C → D → E → C → D → E···), and furthermore there is no path from any {C, D, E} to any other node in the graph. In fact, every other node in the graph is alone in a strongly connected compo- nent by itself. For example, while there is a path from A to every node in the graph, there is no path from any other node to A. (There is a path from A to A, so the set {A} is a strongly connected component.) Thus the four strongly connected compo- nents of the graph are {A}, {B}, {F}, and {C, D, E}. Here’s an example that shows why the second clause of Definition 11.23 is crucial: Example 11.25 (A non-SCC) Problem: Inthefollowinggraph,thesetS:={A,B,C,E,F}isnotastronglyconnected component. Why not? G AF BE C D Solution : Itisindeedthecasethatthereisapathinbothdirectionsbetweenanytwo nodes in S: we can just keep “going around” clockwise in S and we eventually reach every other node in S. So S satisfies Definition 11.23(i). But it fails to satisfy Definition 11.23(ii): if we considered the set S+ := S ∪ {D}, it is still the case that there is a path in both directions between any nodes in S+. Thus S is not a strongly connected component! On the other hand, S+ = {A, B, C, D, E, F} is a strongly connected component: we can’t add any other node (specifically G; it’s the only other node) to S+ without falsifying this property—because there’s no path from G to A, for example. Thus the two strongly connected components are {A, B, C, D, E, F} and {G}. Solution Taking it further: There are many computational settings in which directed paths, reachability, and strongly connected components are relevant. For example, for a spreadsheet, consider a directed graph whose nodes are the spreadsheet’s cells, and an edge ⟨u, v⟩ indicates that u’s contents affect the contents of cell v; when a user changes the content of cell c, we must update all cells that are reachable from node c. For a chess-playing program, consider a directed graph whose nodes are board configurations, and there’s an edge ⟨u, v⟩ if a legal move in u can result in v; any configuration u that’s unreachable from the starting board configuration can never occur in chess, and thus your program doesn’t have to bother evaluating what move to make in position u. See p. 1142 for a discussion of another application of reachability and strongly connected compo- nents: the structure of the world-wide web, understood with respect to the directed paths in the graph defined by the pages and the hyperlinks of the web. 11.3.3 Shortest Paths and Distance So far we have concentrated on the basic question of connectivity: for a given pair of nodes, does any path exist from one node to the other? Here we address a more refined question: what is the shortest path that goes from one node to the next? (Recall that the length of a path ⟨u1,u2,...,uk⟩ is k − 1, the number of edges that it traverses.) Observe that there may be more than one shortest path from a node s to a node t, if there are multiple paths that are tied in length. For example, consider the undirected graph in Figure 11.26. We have the following distances from node A in this graph: ABCDEF The distance from A to A is 0 because ⟨A⟩ is a path from A to A. This graph also has an example of a pair of nodes connected by two different shortest paths, going from A to C (via either B or E). For the directed graph in Figure 11.27, we have the following distances from node G: GHIJKL Again, there’s a path from G to G of length zero, so the distance from G to G is 0. Note that there’s no G-to-J path of length two (because the edge from J to K goes in the wrong direction), so the distance from G to J is 3 (via K and I, or via H and I). Similarly, there is no directed path from G to L, so the distance is infinite. Figure 11.26: An undirected graph. Figure 11.27: A directed graph. 11.3. PATHS,CONNECTIVITY,ANDDISTANCES 1135 Definition 11.24 (Shortest Paths) Let G = ⟨V,E⟩ be a graph (undirected or directed), and let s ∈ V and t ∈ V be two nodes. A path from s to t is a shortest path if its length is the smallest out of all s-to-t paths. Definition 11.25 (Distance) The distance from s to t is the length of a shortest path from s to t. If there is no path from s to t, then we say that the distance from s to t is infinite (written as “∞”). A F BE C D 0 1 2 2 1 1 G L HK I J 0 1 2 3 1 ∞ Here’s another example of finding shortest paths in a small graph: 1136 CHAPTER 11. GRAPHS AND TREES Example 11.26 (Shortest paths in directed graphs) Problem: FindtheshortestpathfromAtoLinthegraphwiththisadjacencylist: Solution : A: B,D,E,F,G B: C,D,I C: B,D,I D: E E: A,F F: G: F H: E,F I: B,H,K J: C,K K: L L: F The nodes at distance 1 from A are B, D, E, F, and G. There’s no edge from any of those nodes to L—or indeed to K, which is L’s only in-neighbor. Thus the distance from A to L cannot be any smaller than 4. But there is an edge from I to K, and one from B to I. We can assemble these edges into the path ⟨A, B, I, K, L⟩. This path has length 4. So the distance from A to L is 4. (Drawing the graph, as on the right, with nodes arranged by their distance from A, can make these facts easier to see.) 11.3.4 Finding Paths: Breadth-First Search (BFS) There are many aspects of graphs that are valuable for interesting computational ap- plications, but perhaps the single most important graph algorithm is breadth-first search (BFS). BFS is a path-finding algorithm: it explores outward from a given source node s in a given graph G until it finds every node reachable from s in G. BFS can be used to solve all sorts of graph-related problems, as we’ll see. Here’s the intuition of the algorithm. (See Figure 11.28.) We maintain a set L of nodes that are reachable from the given node s (the shaded nodes in Figure 11.28). To start, we set L := {s}. Now we find all as-yet- undiscovered neighbors of nodes in L, and add those nodes (the dark-shaded nodes in Figure 11.28) to L: if ⟨u, v⟩ ∈ E and you can reach the node u from s, then you can also reach v from s, via u. But now we’ve found some more nodes that can be reached from s, which means that we can also reach any nodes that are directly connected to them from s. So we’ll repeat that process with the updated list L. And we’ll do it again, and again, and again, until we stop finding new nodes. BC D I H J AEK FL G Problem-solving tip: In solving any graph problem with a small graph, a good first move is to draw the graph. A BC DEF GH I J A BC DEF GH I J A BC DEF GH I J A BC DEF GH I J A BC DEF GH I J Figure 11.28: The intuition of breadth-first search: the steps of BFS on a small graph, starting at node A. distance 0 distance 1 distance 2 distance 3 distance 4 distance ∞ Observe that BFS discovers nodes in order of their dis- tance from the source node. Every expansion of L takes the full breadth of the frontier and expands it out by one more “layer” in the graph. (That’s why the algorithm is called breadth-first search.) You can think of BFS as throwing a pebble onto the graph at the node s, and then watching the ripples expanding out from s. Breadth-first search is presented more formally in Fig- ure 11.29. (While we’ve described BFS in terms of undirected graphs for simplicity, it works equally well for directed graphs. The only change is that Line 6 should say “for ev- ery out-neighbor” for a directed graph.) Here’s another example of breadth-first search in action, running the algorithm in full detail (precisely as specified in Figure 11.29): Example 11.27 (Sample run of BFS, in detail) We’ll trace BFS starting at node A in the following graph (shown here in the form of a picture and as an adjacency list): BGF A CEDH Figure 11.29: The pseudocode for breadth-first search. 11.3. PATHS,CONNECTIVITY,ANDDISTANCES 1137 Breadth-First Search (BFS): Input: agraphG=⟨V,E⟩andasourcenodes∈V Output: the set of nodes reachable from s in G 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: Frontier := ⟨s⟩ // Frontier will be a list of nodes to process, in order. Known := ∅ // Known will be the set of already-processed nodes. while Frontier is nonempty: u := the first node in Frontier remove u from Frontier for every neighbor v of u: if v is in neither Frontier nor Known then add v to the end of Frontier add u to Known return Known = Frontier = just moved from Frontier to Known = Known = neither Known nor Frontier Known Frontier Explanation A BGF CEDH {} ⟨A⟩ initialization (Lines 1–2) A BGF CEDH {A} ⟨B, C⟩ processing A (Lines 4–9) A BGF CEDH {A, B} ⟨C, G⟩ processing B (Lines 4–9) A BGF CEDH {A, B, C} ⟨G, E, F⟩ processing C (Lines 4–9) A BGF CEDH {A,B,C,G} ⟨E, F⟩ processing G (Lines 4–9) A BGF CEDH {A,B,C,G,E} ⟨F⟩ processing E (Lines 4–9) A BGF CEDH {A,B,C,G,E,F} ⟨⟩ processing F (Lines 4–9) Because Frontier is now empty, the while loop in BFS terminates. The algorithm returns the set Known, {A, B, C, G, E, F}. A: B,C B: A,C,G C: A,B,E,F,G D: H E: C F: C,G G: B,C,F H: D 1138 CHAPTER 11. GRAPHS AND TREES Correctness of BFS We’ll prove two important properties of BFS. The first is correctness: the set that BFS returns is precisely those nodes that are reachable from the starting node. The second is efficiency: BFS finds this set quickly. The first claim might seem obvious—and thus proving it may feel annoyingly pedantic—but there’s a bit of subtlety to the argument, and it’s good practice at using induction in proofs besides. Proof. We’llprovetheresultbyshowingtwosetinclusions:thediscoverednodes form a subset of the reachable nodes, and the reachable nodes form a subset of the discovered nodes. Both proofs will use induction, though on different quantities. Claim#1:BFS(G,s)⊆{t∈V:tisreachablefromsinG}. Byinspection,weseethat(i) BFS returns the set of nodes that end up in the Known set, and (ii) the only way that a node ends up in Known is having previously been in Frontier. Thus it will suffice to prove the following property for all k ≥ 0, by strong induction on k: Q(k) := if a node t ∈ V is added to the list Frontier during the kth iteration of the while loop of BFS, then there is a path from s to t. Basecase(k=0): IfthenodetwasaddedtoFrontierduringthe0thiterationofthe while loop—that is, before the while loop begins—then t was added in Line 1 of BFS. Therefore t is actually the node s itself. There is a path from s to s itself in any graph, and thus Q(0) holds. Inductivecase(k≥0): WeassumetheinductivehypothesesQ(0),...,Q(k−1),and we must prove Q(k). Consider a node t that was added to Frontier during the kth iteration of the while loop—in other words, t was added in the for loop (Lines 6–8) because t is a neighbor of some node u that was already in Frontier. That is, we know that ⟨u, t⟩ ∈ E and that u was added to Frontier in the (k′)th iteration, for some k′ < k. By the inductive hypothesis Q(k′), there is a path P from s to u. Therefore there is a path from s to t, too: edges of P edge ⟨u, t⟩ sut. Claim #2: BFS(G,s) ⊇ {t ∈ V : t is reachable from s in G}. If a node t is reachable from s in G, then by definition the distance from s to t is some integer d ≥ 0. Furthermore, by inspection of the algorithm, we see that any node that’s added to Frontier is even- tually moved to Known. Thus it will suffice to prove the following property for all d ≥ 0, by (weak) induction on d: R(d) := if a node t ∈ V at distance d from s, then t is eventually added to Frontier. Basecase(d=0): WemustproveR(0):anynodetatdistance0iseventuallyadded to Frontier. But the only node at distance 0 from s is s itself, and BFS adds s itself to Frontier in Line 1 of the algorithm. Problem-solving tip: The hard part hereisfiguringon what quantity to do induction. One way to approach this question is to figure out a recursive way of stating the correctness claim. Q: why is there a path to every node added to Frontier? (A: there was a path to every previous node in Frontier, and there’s an edge from some previously added node to this one!) Q: why is every node u reachable from s eventually added to Frontier? (A: because a neighbor of u that’s closer to s is eventually added to Frontier, and every neighbor of a node in Frontier is eventually added to Frontier!) Theorem 11.3 (Correctness of BFS) Let G = ⟨V, E⟩ be any graph, and let s ∈ V be an arbitrary node. Then the set of nodes discovered by BFS(G, s) is exactly {t ∈ V : t is reachable from s in G}. Inductivecase(d≥1): WeassumetheinductivehypothesisR(d−1),andwemust prove R(d). Let t be a node at distance d from s. Then by definition of distance there is a shortest path P of length d from s to t. Let u be the node immediately before t in P. Then the distance from s to u must be d − 1, and therefore by the inductive hypothesis R(d − 1) the node u is added to Frontier in some iteration of the while loop. There are at most |V| iterations of the loop, and thus eventually u is the first node in Frontier. In that iteration, the node t is added to Frontier (if it had not already been added). Thus R(d) follows. (In the exercises, you’ll show how to modify BFS so that it actually computes distances from s, using an idea very similar to the proof of Claim #2 of Theorem 11.3.) Running time of BFS Proof. See Figure 11.30 for a reminder of the algorithm. Lines 1, 2, and 10 take Θ(1) time, so the only question is how long the while loop takes. In the worst case, every node in the graph is reachable from the node from which BFS is run. In this case, there is one iteration of the while loop for every node u ∈ V. How long does the body of the while loop (Lines 4–9) take for a particular node u? • Lines4,5,and9takeΘ(1)time. • TheforloopinLines6–8hasoneiterationforeachneighbor of u. (In an adjacency list, the loop simply steps through the list of neighbors, one by one.) Each for-loop iteration takes Θ(1) time, and there are degree(u) iterations for node u. Therefore, ignoring multiplicative constants, the worst-case running time of BFS is 1 + ∑ 􏰖1 + degree(u)􏰗 Figure 11.30: A reminder of BFS. 11.3. PATHS,CONNECTIVITY,ANDDISTANCES 1139 Theorem 11.4 (Efficiency of BFS) For a graph G = ⟨V, E⟩ represented using an adjacency list, BFS takes Θ(|V| + |E|) time. Breadth-First Search (BFS): Input: agraphG=⟨V,E⟩andasourcenodes∈V Output: the set of nodes reachable from s in G 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: Frontier := ⟨s⟩ Known := ∅ while Frontier is nonempty: u := the first node in Frontier remove u from Frontier for every neighbor v of u: if v is in neither Frontier nor Known then add v to the end of Frontier add u to Known return Known u∈V = 1 + 􏰖 ∑ 1􏰗 + 􏰖 ∑ degree(u)􏰗 u∈V u∈V = 1 + |V| + 2|E| or 1 + |V| + |E| for a directed graph = Θ(|V| + |E|). rearranging the summation Theorem 11.1/Exercise 11.18 Taking it further: BFS arises in applications throughout computer science, from network routing to arti- ficial intelligence. Another application of BFS occurs (hidden from your view) as you use programming languages like Python and Java, through a language feature called garbage collection. In garbage-collected languages, when you as a programmer are done using whatever data you’ve stored in some chunk of memory, you just “drop it on the floor”; the “garbage collector” comes along to reclaim that memory for other use in the future of your program. The garbage collector runs BFS-like algorithms to determine whether a particular piece of memory is actually trash. See p. 1143 for more. 1140 CHAPTER 11. GRAPHS AND TREES 11.3.5 Finding Paths: Depth-First Search (DFS) Another important algorithm for exploring graphs is called depth-first search (DFS), which can be described informally as follows. Instead of exploring outward from the source node s in “layers” as in BFS, we will try to explore a new node at every stage of the search. We start at s, and at every stage we move to an unvisited neighbor of our current node. If at any stage we’re stuck at a node u that has no unvisited neighbors, we go back from u to the node from which we first reached u and continue exploring from there. Here is an example of DFS in a small graph, informally: Example 11.28 (Sample run of depth-first search) We start exploring node A; in each frame, the dark- shaded node is the current node. Previously discovered nodes are lightly shaded. Arrows indicate the steps of the exploration. In each of the first four frames, we move from the current node to a neighbor that is unexplored. (We pick the alphabetically first node if there’s a choice.) The current node E has no unvisited neighbors, so we backtrack from E to D to find D’s unvisited neigh- bor F. We backtrack from F to D to B to discover the new node C. We backtrack from C to B to A; there are no further unexplored nodes from any of these nodes, and thus the algorithm terminates. Intuitively, depth-first search is a close match for the way that you would explore a maze: you start at the entrance, follow a passageway to a location you’ve never vis- ited before; using breadcrumbs or a pencil, you remember where you’ve been and backtrack if you get stuck. You may have heard of another algorithm for mazes: Place your right hand on the wall as you go in the entrance. Continue to walk forward, always keeping your right hand on the wall. Eventually, you will get out of the maze. In fact, this right-hand-on-the-wall algorithm is identical in spirit to DFS: whenever you encounter a choice, you always choose the first (right-most) unexplored pas- sageway, and if you ever get stuck at a dead end you turn around and go back from whence you came. A BEG CDF H A BEG CDF H A BEG CDF H A BEG CDF H A BEG CDF H A BEG CDF H A BEG CDF H We can implement DFS with only a small change to BFS, as shown in Figure 11.31: instead of putting a newly discov- ered node u at the end of the list Frontier of nodes from which to explore (as in BFS), we put a newly discovered node u at the beginning of Frontier. (In other words, BFS treats the list Frontier as a queue—first in, first out—while DFS treats the list Frontier as a stack—last in, first out.) Another small change is necessary, to allow a node already in Frontier to be “moved” earlier in the list of nodes to explore. Because this alteration of BFS changes only the order in which the nodes in Frontier are explored, DFS does precisely the same work as BFS, and is correct for the same reasons: DFS returns precisely the set of nodes reachable from the given source node s. (With a little more cleverness in moving nodes to the front of Frontier, DFS can also be imple- mented in Θ(|V| + |E|) time.) Here’s a fully detailed example of DFS: Example 11.29 (Sample run of DFS, in detail) We’ll trace DFS starting at node A in this graph: BGF A CEDH Figure 11.31: The pseudocode for depth-first search. The only changes from BFS are underlined. 11.3. PATHS,CONNECTIVITY,ANDDISTANCES 1141 Depth-First Search (DFS): Input: agraphG=⟨V,E⟩andasourcenodes∈V Output: the set of nodes reachable from s in G 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: Frontier := ⟨s⟩ Known := ∅ while Frontier is nonempty: u := the first node in Frontier remove u from Frontier if u is not in Known then for every neighbor v of u: if v is not in Known then add v to the start of Frontier add u to Known return Known = Frontier = just moved from Frontier to Known = Known = neither Known nor Frontier Known Frontier u: just added. Explanation A BGF CEDH {} ⟨A⟩ initialization A BGF CEDH {A} ⟨B, C⟩ processing A A BGF CEDH {A, B} ,C⟩ ⟨G processing B (A known ⇒ not re-added) A BGF CEDH {A, B, G} ⟨E, F, C⟩ processing G (B known ⇒ not re-added) A BGF CEDH {A,B,G,E} ⟨C,F,F,C⟩ processing E (G known ⇒ not re-added) A BGF CEDH {A,B,G,E,C} ⟨ F,F,C⟩ processing C (A,E known ⇒ not re-added) A BGF CEDH {A,B,G,E,C,F} ⟨ F, C⟩ processing F There are two more iterations that remove the last two entries in Frontier (making no changes to Known and adding nothing further to Frontier), because both F and C are already in Known. The while loop then terminates, and DFS returns {A, B, G, E, C, F}. A: B,C B: A,G C: A,E D: H E: C,F,G F: E,G G: B,E,F H: D 1142 CHAPTER 11. GRAPHS AND TREES Computer Science Connections The Bowtie Structure of the Web As the web has grown more and more central in the daily lives of us all, it has garnered increasing attention from researchers in computer science. A great deal of work has been performed to characterize the web in terms of its degree distribution (see p. 1123) or in terms of the “small-world phenomenon” (see p. 438). But one foundational and influential paper sought to characterize the web’s structure in terms of its strongly connected components.9 In the early days of the web, eight researchers from AltaVista, IBM, and Compaq downloaded around 200 million web pages, comprising about 1.5 billion links. They then analyzed the structure of the resulting graph, by categorizing the pages: 1. LetcoredenotethosewebpagescontainedinthelargestSCCofthe web graph. Like many other networks (for example, social networks and collaboration networks), the web graph has a giant component that contains many more nodes than the second-largest SCC. Denote by core those nodes in the largest SCC in the web graph. 2. Letindenotethosewebpagespsuchthat(i)p∈/core,and(ii)thereisa path from p to some node in core. That is, there is a path from p to every page in core, but there’s no path from any node in core to p. 3. Let out denote those web pages p such that (i) p ∈/ core, and (ii) there is a path from some node in core to p. That is, there is a path from every page in core to page p, but there’s no path from p to any node in core. When displayed graphically, as in Figure 11.32, these categories of web pages look like a bowtie, and so the paper by Broder et al. came to be known as “the bowtie paper.” To complete the picture of the bowtie structure of the web, we must note that not all web pages are included in Figure 11.32. There are three further categories of nodes: 4. Lettubesdenotethosepagespthat(i)arereachablefromanodeofin (that is, there’s a page q ∈ in that has a path to p), and (ii) can reach a node of out (that is, there’s a page q ∈ out to which p has a path), and (iii) p ∈/ c o r e . 5. Lettendrilsdenotethosepagespthatareeitherreachablefromanodeof in, or can reach a node of out, but not both. 6. Letdisconnecteddenotethosepagespthatarenotincore,in,out, tubes, or tendrils—that is, those pages p that can neither reach nor be reached by any node in those sets. One of the unexpected facts found by Broder et al. was the extent to which the web is actually not particularly well connected. In particular, if we were to choose web pages p and q uniformly at random from the web graph, there was only a roughly 24% chance of that a directed path from p to q exists—far lower than the “small world” phenomenon would suggest. 9 Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins,andJanetWiener. Graph structure in the web. Computer Networks, 33(1–6):309–320, 2000. IN CORE OUT Figure 11.32: The “bowtie structure” of the web graph, in its basic form. Broder et al. found that roughly 25% of web pages fell into each of these categories: 56M pages (of 200M) in core, 43M pages in in, and 43M pages in out. TENDRILS TENDRILS IN CORE OUT TUBES DISCONNECTED Figure 11.33: The remainder of the “bowtie structure” of the web graph. There were about 44M pages in tendrils and tubes, and about 17M pages in disconnected. 11.3. PATHS,CONNECTIVITY,ANDDISTANCES 1143 Computer Science Connections Suppose that Node(data,next) creates a new node for a singly linked list, with data data and with a pointer next to the next node in the list. Imag- ine executing the following code: 1 2 3 4 5 Then the state of memory after execut- ing lines 1–4 is L But when we execute line 5, the state of memory becomes L The node with data = 3 is garbage now: there is no way to access that memory again, because there is no way for the programmer to refer to it. Garbage Collection In many modern programming languages, including Python and Java, the burden of managing memory is lifted from the shoulders of the program- mer. When a new object is needed, the programmer just creates it. After a program has been running for a while, there may be objects that were stored in memory but are now inaccessible because the programmer has no way to refer to them ever again. This stored but inaccessible data is called garbage. Figure 11.34 shows an example of garbage being created. In Python- and Java- like languages, the system provides a garbage collector that periodically runs to clean up the garbage, which allows that memory to be reused for future allo- cations. (In contrast, in languages like C or C++, when you as a programmer are done using a chunk of memory, it’s your responsibility to declare to the system that you’re done using that memory by explicitly “deallocating” or “freeing” it.) There are many sophisticated garbage-collection algorithms that are em- ployed in real systems, but fundamentally the algorithmic idea is based on finding reachable nodes in a graph. There is a root set of memory locations that are reachable—essentially every variable that’s defined in any cur- rently active function call on the stack. Furthermore, if a memory location l is pointed to by a reachable memory location, then l too is reachable. Two simpler algorithms that are sometimes used in garbage collection are based on some corresponding simple graph-theoretic approaches. Here’s a brief description of these two garbage-collection algorithms:10 Referencecounting: Foreachblockbofmemory,wemaintainareferencecount of the number of other blocks of memory (or root set variables) that refer to b. When the garbage collector runs, any block b that has a reference count equal to 0 is marked as garbage and reclaimed for future use. Mark-and-sweep: Whenthegarbagecollectorruns,weiterativelymarkeach block b that is accessible. Specifically, for every variable v in the root set, we mark the block to which v refers. Then, for any block b that is marked, we also mark any block to which b refers. Once the marking process is completed, we sweep through memory, and reclaim all unmarked blocks. In graph-theoretic terms, we view memory as a directed graph, with an edge from each block b to the block(s) to which b refers. Reference counting declares as garbage any node with in-degree 0; mark-and-sweep declares as garbage any node that is not reached by BFS starting from the root set. Reference counting is a simpler algorithm, but it has a prob- lem with cyclical structures. If two inaccessible blocks of memory refer to each other, they both have nonzero reference count, and therefore won’t be marked as garbage. An example is shown in Figure 11.35. There are issues of efficiency with mark-and-sweep (the entire system has to pause while the garbage collector runs), and so other, more sophisticated algorithms are gener- ally used in real systems. Figure 11.34: Garbage being created. You can learn more about garbage collection in any good textbook on programming languages. A few of these are: 10 Michael L. Scott. Programming Lan- guage Pragmatics. Morgan Kaufmann Publishers, 3rd edition, 2009; and Kenneth C. Louden and Kenneth A. Lambert. Programming Languages: Prin- ciples and Practices. Course Technology, 3rd edition, 2011. Figure 11.35: A memory diagram with six blocks of memory, and two root set variables x and y. Reference counting would show block #6 with a reference count of zero, and therefore it would be reclaimed. Mark-and-sweep would mark blocks #1, #4, and #5; thus it would reclaim blocks #2, #3, and #6. L = Node(7,NULL) L = Node(5,L) L = Node(3,L) L = Node(2,L) L.next = L.next.next 2357 2357 xy 123456 1144 CHAPTER 11. GRAPHS AND TREES 11.3.6 Exercises 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 D BF AH CG E D BF AH CG E A: B,E,F B: A C: D D: C,F E: A F: A,D A: B,C B: A,C C: A,B,F D: E E: B,D F: C (a) (b) (c) (d) ABCDEFGH A B C D E F G H (e) For the graphs defined in Figure 11.36, identify the following specified objects (or indicate why no such thing exists): 11.72 a path from D to B in Figure 11.36(a) 11.73 two different paths from C to H in Figure 11.36(a) 11.74 a path from C to B in Figure 11.36(b) 11.75 two different paths from A to H in Figure 11.36(b) 11.76 a path from D to H in Figure 11.36(b) that is not simple. 11.77 a path from B to C in the graph defined by the adjacency list in Figure 11.36(c) 11.78 a shortest path from B to F in Figure 11.36(d) 11.79 a non–shortest path from B to C in the graph defined by the adjacency matrix in Figure 11.36(e) 11.80 all nodes reachable from A in Figure 11.36(d) 11.81 all nodes reachable from A in Figure 11.36(e) Which of these graphs are (strongly) connected? Explain your answers. Identify all of the connected components for the undirected graphs, and all of the strongly connected components for the directed graphs. 11.82 Figure 11.36(a) 11.83 Figure 11.36(b) (strong connectivity) 11.84 Figure 11.36(c) 11.85 Figure 11.36(d) (strong connectivity) 11.86 Figure 11.36(e) Let G = ⟨V,E⟩ be an undirected graph, and let s ∈ V and t ∈ V be any two nodes in G. Prove the following: 11.87 If there’s a path of length k from s to t, then there’s a path of length k from t to s. 11.88 Every shortest path between s and t is a simple path. For a directed graph G = ⟨V, E⟩, the diameter of G is the largest node-to-node distance in the graph. That is, diameter(G) = max d(s, t), s∈V,t∈V where d(s, t) denotes the length of the shortest path from node s to node t in G. Prove your answers: 11.89 In terms of n, what is the smallest diameter that an n-node undirected graph can have? 11.90 In terms of n, what is the largest diameter that a connected n-node undirected graph can have? Give an example of a graph where the diameter is this large. (In other words, assuming that G is connected, what’s the largest possible distance between two nodes in G? Note that, without the restriction that the graph be connected, the answer would be ∞.) Consider an n-node 3-regular undirected graph G. (That is, we’re considering a graph G = ⟨V, E⟩ with |V| = n, where each node u ∈ V has degree exactly equal to 3.) In terms of n: 11.91 What is the largest possible number of connected components in a 3-regular graph? 11.92 What is the smallest possible number of connected components in a 3-regular graph? Figure 11.36: Several graphs. 11.93 Describe a connected 3-regular graph with n nodes with a diameter that’s at least n . 8 Although the context is different, our version of “diameter” matches the idea from geometry: the diameter of a circle is the distance between the two points in the circle that are farthest apart. That’s still true for a graph. 11.94 Describe a connected 3-regular graph with n nodes with a diameter that’s at most 8 log n. 11.3. PATHS,CONNECTIVITY,ANDDISTANCES 1145 11.95 Prove or disprove: let G = ⟨L ∪ R, E⟩ be a bipartite graph with |L| = |R|. Suppose that every node in the graph (that is, all nodes in L and R) has at least one neighbor. Then the graph is connected. Consider an undirected graph G. Recall that a simple path from s to t in G is a path that does not go through any node more than once. A Hamiltonian path from s to t in G is a path from s to t that goes through each node of G precisely once. In general, finding Hamiltonian paths in a graph is believed to be computationally very difficult. But there are some specific graphs in which it’s easy to find one. 11.96 Find a Hamiltonian path in the Petersen graph: 11.97 Let Kn be a complete graph, and let s and t be two distinct nodes in the graph. How many differ- ent Hamiltonian paths are there from s to t? 11.98 Let Kn,m be a complete bipartite graph with n + m nodes, and let s and t be two distinct nodes in the graph. How many different Hamiltonian paths are there from s to t? (Careful; your answer may depend on s and t.) The diameter of an undirected graph G = ⟨V, E⟩ is defined as the maximum distance between any two nodes s ∈ V and t ∈ V. (See Exercises 11.89 and 11.90.) The maximum distance is one measure of how far a graph “sprawls,” but another way of measuring this idea is by looking at the average distance instead. That is, for a pair of distinct nodes ⟨s, t⟩ chosen uniformly from the set V, what’s the distance from s to t? That is, the average distance of a graph G = ⟨V, E⟩ is defined as the average distance of G = ∑s∈V ∑t∈V:t̸=s distance(s, t) . n(n−1) (There are n(n − 1) ordered pairs of distinct nodes.) Often the average distance is a bit harder to calculate than the maximum distance, but in the next few exercises you’ll look at the average distance for a pair of simple graphs. 11.99 Consider an n-node cycle, where n is odd. (We’ll see a formal definition of a cycle in Section 11.4, but for now just look at the 15-node example in Figure 11.37(a).) Compute the average distance in this n-node graph. (Hint: every node is positioned symmetrically, so you can just figure out the average distance from some particular node u.) 11.100 What is the average distance for an n-node cycle where n is even? (See the 16-node example in Figure 11.37(b).) 11.101 What is the average distance for an n-node path? (See the 15-node example in Figure 11.37(c).) (Hint: for any particular integer k, how many pairs of nodes have distance k? Then simplify the summation.) 11.102 (programming required) Write a program, in a language of your choice, to verify your answers to the last three exercises: build a graph of the appropriate size and structure, sum all of the node-to-node distances, and compute their average. Suppose that G is an undirected graph with n nodes. Answer the following questions in terms of n: 11.103 If G is disconnected, what is the largest possible number of edges that G can contain? 11.104 If G is connected, what is the smallest possible number of edges that G can contain? Suppose that G is a directed graph with n nodes. Answer the following questions in terms of n: 11.105 If G is strongly connected, what is the smallest number of edges that G can contain? 11.106 If every node of G is in its own strongly connected component (that is, there are n different SCCs, one per node), what is the largest number of edges that G can contain? Hamiltonian paths are named after William Rowan Hamilton, a 19th- century Irish mathematician/ physicist. B G AF HC JI ED 14 15 1 13 2 12 3 11 4 10 5 96 87 15 16 1 14 2 13 3 12 4 11 5 10 6 987 14 15 1 13 2 12 3 11 4 10 5 96 87 (a) A 15-node cycle. (b) A 16-node cycle. (c) A 15-node path. Figure 11.37: Three graphs. 1146 CHAPTER 11. GRAPHS AND TREES A metric on a set V is a function d : V × V → R≥0 that obeys the following conditions (see Exercise 4.6 for more): • reflexivity:foranyu∈Vandv∈V,wehaved(u,u)=0andd(u,v)̸=0wheneveru̸=v. • symmetry: for any u ∈ V and v ∈ V, we have d(u,v) = d(v,u). • triangleinequality:foranyu∈Vandv∈Vandz∈V,wehaved(u,v)≤d(u,z)+d(z,v). Let dG(u,v) denote the distance (shortest path length) between nodes u ∈ V and v ∈ V for a graph G = ⟨V,E⟩. 11.107 Prove that dG is a metric if G is any connected undirected graph. 11.108 Prove that dG is not necessarily a metric for a directed graph G, even if G is strongly connected. 11.109 Definition 11.23 defined a strong connected component in a graph G = ⟨V, E⟩ as a set C ⊆ V such that: (i) any two nodes s ∈ C and t ∈ C are strongly connected; and (ii) for any node x ∈ V − C, adding x to C would make (i) false. Suppose that we’d instead defined clause (i) as for any two nodes s ∈ C and t ∈ C, the node t is reachable from node s. (But we don’t require that s be reachable from t.) This alternate definition is equivalent to the original. Why? 11.110 Prove that the strongly connected components (SCCs) of a directed graph partition the nodes of the graph: that is, prove that the relation R(u, v) denoting mutual reachability (u is reachable from v, and v is reachable from u) is an equivalence relation (reflexive, symmetric, and transitive). Consider the directed graphs represented in Figure 11.38, one by picture and one by adjacency list. Identify the strongly connected components . . . 11.111 . . . in Figure 11.38(a). 11.112 . . . in Figure 11.38(b). Suppose that we run breadth-first search from the following nodes. What is the last node that BFS discovers? (If there’s a tie, then list all the tied nodes.) 11.113 BFS from node A in Figure 11.38(a). 11.114 BFS from node B in Figure 11.38(a). 11.115 BFS from node 0 in Figure 11.38(b). 11.116 BFS from node 12 in Figure 11.38(b). (a) 0: 3,7 1: 9,2,5 2: 1,10,9 3: 0,7,1 4: 10, 7 5: 1 6: 7, 11 7: 0,4,6,8 8: 11, 12 9: 1 10: 2,4 11: 6,8 12: 8 Breadth-first search as described in Figure 11.29 finds all nodes reachable from a given source node in a given graph, and, in fact, it discovers nodes in increasing order of their distance from s. But we didn’t actually record distances during the computation. 11.117 Modify the pseudocode for BFS to compute distances instead of just whether a path exists, by annotating every node added to Frontier with its distance from the source node s. 11.118 Argue that in your modified version of BFS, there are never more than two different distances stored in the Frontier. 11.119 Argue that the claim from the previous exercise may be false for depth-first search. 11.120 Consider a graph G represented by an adjacency matrix M. What does the ⟨i, j⟩th entry of MM (the matrix that results from squaring the matrix M) represent? A word chain is a sequence ⟨w1, w2, . . . , wk ⟩ of words, where each wi is a word in English, and wi+1 is one letter different from wi. For example, a word chain from FROWN to SMILE for my dictionary is FROWN → FLOWN → FLOWS → SLOWS → SLOTS → SLITS → SKITS → SKITE → SMITE → SMILE. (SKITE is a word of Scottish origin, meaning “an oblique blow.”) 11.121 (programming required) Write a program that uses a BFS-like algorithm to find a shortest word chain between two given words w1 and w2 of the same length. (You can find a dictionary of English words on the web, or /usr/share/dict/words on Unix-based operating systems. You’ll want to cull your dictionary to only words of the right length before you start.) There are faster solutions that involve searching “in both directions” out from w1 and into w2 until you find a match, but BFS from w1 will work. (b) Figure 11.38: Two graphs. G AF BE C D 11.4 Trees I think that I shall never see A poem lovely as a tree. Joyce Kilmer (1886–1918), “Trees” Trees and Other Poems (1914) 11.4. TREES 1147 Informally, a tree is a graph that grows from a root, branching outward and eventually leading to the leaves. (We computer scientists are always upside down compared to botanists: unlike an oak or maple or tamarack, the root of a tree in CS is at the top, and it grows downward toward the leaves.) See Figure 11.39. Trees arise very frequently in computer science: to name just a few exam- ples, they’re the class hierarchies of object-oriented programming, the bi- nary search trees of data structures (see p. 1160), the game trees describing the progression of Tic-Tac-Toe or chess (p. 344), the parse trees that describe formal or natural languages (p. 543), the recursion trees that describe the execution of recursive algorithms (Section 6.4). Trees are also frequently used in computational models of important phenomena from outside of CS: for example, in reconstructing evolutionary phylogenies (in computational biology), or in reconstructing the paths by which rumors spread from the originator of the information (in social network analy- sis). In this section, we’ll introduce trees formally—including definitions, properties, algorithms, and applications—as a special type of graph. 11.4.1 Cycles Before we can define trees properly, we must first define another notion about graphs in general—a cycle, which is way to get from a node back to itself: Figure 11.40 shows examples of an undirected and directed graph with a cycle ⟨A, B, C, A⟩. Note that the edges ⟨s, t⟩ and ⟨t, s⟩ in a directed graph are different; in an undirected graph, the edges {s, t} and {t, s} are the same. Thus a cycle in a directed graph can use both ⟨s, t⟩ and ⟨t, s⟩, but a cycle in an undirected graph cannot use both ⟨s, t⟩ and ⟨t, s⟩. In Figure 11.40, the path ⟨C, E, C⟩ is a cycle in the directed graph, but is not a cycle in the undirected graph because it reuses an edge. Technically speaking, the definition of a cycle in Definition 11.26 says that the undi- rected graph in Figure 11.40 has six different cycles: • ⟨A,B,C,A⟩,⟨C,A,B,C⟩,and⟨B,C,A,B⟩(goingclockwise),and • ⟨A,C,B,A⟩,⟨C,B,A,C⟩,and⟨B,A,C,B⟩(goingcounterclockwise). Figure 11.39: A small tree. root leaves Definition 11.26 (Cycle) Acycle⟨u1,u2,...,uk,u1⟩isapathoflength≥2fromanodeu1 backtonodeu1 thatdoes not traverse the same edge twice. Just as for any other path, the length of the cycle ⟨u1, u2, . . . , uk, u1⟩ is the number of edges it traverses—that is, k. BD AC E BD AC E Figure 11.40: Two graphs with cycles ⟨A, B, C, A⟩. 1148 CHAPTER 11. GRAPHS AND TREES However, we will adopt the convention that there is one and only one cycle in this graph. Because we can “start anywhere” in a cycle, we consider a cycle to be defined only by the relative ordering of the nodes involved, regardless of where we start. In an undirected graph, we can “go either direction” (clock- wise or counterclockwise), so we also ignore the di- rection of travel in distinguishing cycles. In a directed graph, the direction of travel does matter; we may be able to go in one direction around a cycle without being able to go in the other. In other words, we say that Figure 11.41(a) and Figure 11.41(b) have one cycle each, while Figure 11.41(c) has two. A cycle is by definition forbidden from traversing the same edge twice. A simple cycle also does not visit any node more than once: (We’ve now used the word “simple” in three different contexts: simple graphs have no parallel edges or self-loops, and simple paths and cycles have no repeated vertices. Intuitively, all three definitions correspond to an entity that’s not unnecessarily compli- cated.) For one example, see Figure 11.42; here are two more: Example 11.30 (Finding cycles) Figure 11.41: Some cycles. B AC (a) B AC (b) B AC (c) Definition 11.27 (Simple cycle) A cycle ⟨u1, u2, . . . , uk, u1⟩ is simple if each ui is distinct—that is, no nodes in the cycle are duplicated aside from the last node (which equals the first node). B AC ED Problem: Identifyallsimplecyclesinthefollowinggraphs: 1. D 2. K Figure 11.42: In this graph, ⟨D,B,A,C,E,A,D⟩ is a non-simple cycle. This graph also has two simple cycles: ⟨D,B,A,D⟩ and ⟨C, E, A, C⟩. BF AHIJM CEG L : Anicewaytoidentifycyclessystematicallyistolookforcyclesofallpossi- ble lengths: 2-node cycles, 3-node cycles, etc. (Actually 2-node cycles are possible only in directed graphs. Exercise: why?) Here are the simple cycles in these graphs: 1. ⟨B,E,C,B⟩ 2. ⟨I,J,I⟩ ⟨B,D,F,C,B⟩ ⟨J,L,J⟩ ⟨C,F,G,E,C⟩ ⟨J,M,L,J⟩ ⟨B,D,F,G,E,B⟩ ⟨J,K,M,L,J⟩ ⟨B,D,F,G,E,C,B⟩ Note that (to name one of several examples) the sequence ⟨I, J, L, J, I⟩ is also a cycle in the second graph—it traverses four distinct directed edges and goes from node I to I—but this cycle is not simple, because node J is repeated. Solution We can use a modification of breadth-first search to identify cycles algorithmically. Specifically, suppose that we wish to find out whether a node u is involved in a cycle in a directed graph. We run BFS starting at node u, and if we ever encounter a node v that has u as a neighbor, then we have found a cycle involving node u. (An extra modification is necessary for undirected graphs; see Exercise 11.129.) Taking it further: Kidneys are the most frequently transplanted organ today, in part because—unlike for other organs—humans generally have a “spare”: we’re born with two kidneys, but only need one functioning kidney to live a healthy life. Thus patients suffering from kidney failure may be able to get a transplant from friends or family members who are willing to donate one of their kidneys. But this po- tential transplant relies on the donor and the patient being compatible in dimensions like blood type and the physical size of the organs. Recently a computational solution to the problem of incompatibility has emerged, using algorithms based on finding (short) cycles in a particular graph: there is now national exchange for matching up two (or a few) patients with willing-but-incompatible donors, and doing a multiway transplant. See p. 1159 for more discussion. Acyclic graphs While cycles are important on their own, their relevance for trees is actually when they don’t exist: Let’s prove a useful structural fact about acyclic graphs. (Recall that we are consid- ering finite graphs, where the set of nodes in the graph is finite. The following claim would be false if graphs could have an infinite number of nodes!) Proof. We’llgiveaconstructiveproofoftheclaim—specifically,we’llgiveanalgorithm that finds a node with the stated property: Observe that this process must terminate in at most |V| iterations, because we must visit a new node in each step. Suppose that this algorithm goes through k iterations of the while loop, and let t be the last node visited by the algorithm. (So t = uk .) • If k = 0, then t = u0 has degree zero, so the claim follows immediately. • If k ≥ 1, then we’ll argue that t has degree one. Because the algorithm terminated, there cannot be an edge between t and any unvisited node. Furthermore, if there were an edge from t to any previously visited node uj for j < k − 1, then there would be a cycle in the graph, namely ⟨uj, uj+1, . . . , uk−1, uk, uj⟩. Therefore t’s only neighbor is uk−1, and the degree of t is one. Definition 11.28 (Acyclic Graphs) A graph is acyclic if it contains no cycles. 11.4. TREES 1149 Lemma 11.5 (Every acyclic graph has a node with degree 0 or 1) Let G = ⟨V, E⟩ be an acyclic undirected graph. Then there exists a node in V whose degree is zero or one. 1: let u0 be an arbitrary node in the graph, and let i := 0 2: while the current node ui has no unvisited neighbors: 3: let ui+1 be a neighbor of ui that has not previously been visited. 4: increment i 1150 CHAPTER 11. GRAPHS AND TREES For directed graphs, the claim analogous to Lemma 11.5 is every directed acyclic graph contains a node with outdegree zero. (You’ll prove it in Exercise 11.130.) Taking it further: A directed acyclic graph (often just called a DAG) is, perhaps obviously, a directed graph that contains no cycles. A DAG G corresponds to a (strict) partial order (see Chapter 8); a cycle in G corresponds to a violation of transitivity. In fact, we can think of any directed graph G = ⟨V, E⟩ as a relation—specifically, the edge set E is a subset of V × V. Like transitivity and acyclicity, many of the concepts that we explored in Chapter 8 have analogues in the world of graphs. 11.4.2 Trees With the definition of cycles in hand, we can now define trees themselves: We will also sometimes talk about graphs that satisfy only the latter requirement: a forest is an undirected graph that is acyclic (but not necessarily connected). Every connected component of a forest is a tree, and note that a tree is itself a forest. Several examples of trees are shown in Figure 11.43: all six graphs have a single connected component and contain no cycles. Therefore all six are trees. We’ll prove several struc- tural facts about trees in this section, beginning with one concerning the number of edges in a tree. To start, let’s look at the number of nodes and edges in each of the trees in Figure 11.43: In each of these trees, the number of nodes is one more than the number of edges, and that’s no coincidence; here’s the statement and proof of the general fact: Proof. LetP(n)denotethepropertythatanyn-nodetreehaspreciselyn−1edges.We will prove that P(n) holds for all n ≥ 1 by induction on n. Base case (n = 1): We must prove P(1): any 1-node tree has 1 − 1 = 0 edges. But the only 1-node (simple) graph is the one shown in Figure 11.43(e), which has zero edges, and so we’re done immediately. An irrelevant note about Chinese: the character for tree is 木; the character for forest is 森 (a disconnected collection of trees!). Definition 11.29 (Tree) A tree is an undirected graph that is connected and acyclic. (a) (b) (c) (d) (e) (f) number of nodes number of edges (a) 4 3 (b) 11 10 (c) 4 3 (d) 5 4 (e) 1 0 (f) 7 6 Figure 11.43: Some sample trees. Theorem 11.6 (Number of edges in a tree) Let T = ⟨V, E⟩ be a tree. Then |E| = |V| − 1. 11.4. TREES 1151 Inductivecase(n≥2): WeassumetheinductivehypothesisP(n−1)—thatis,every (n − 1)-node tree has n − 2 edges. We must prove P(n). Consider an arbitrary tree T = ⟨V, E⟩ with |V| = n. By definition, T is acyclic and connected. By Lemma 11.5, then, there exists a node u ∈ V with degree 0 or 1 in T. Furthermore, because T is connected, the degree of u cannot be 0. Thus u is a node with degree(u) = 1. Let v ∈ V be the unique neighbor of u in T. Let T′ be T with node u and the edge {u, v} between u and v deleted. (See Figure 11.44.) We claim that the graph T′ = ⟨V − {u} , E − {{u, v}}⟩ is a tree, too. The acyclicity and connectivity of T′ both follow from the fact that T was acyclic and connected, and the fact that the eliminated node u was of degree 1. The tree T′ contains n − 1 nodes, and thus, by the inductive hypothesis P(n − 1), contains n − 2 edges. Therefore T, whose edges are precisely the edges of T′ plus the eliminated edge {u, v}, contains precisely (n − 2) + 1 = n − 1 edges. An immediate consequence of Theorem 11.6 is that every tree is teetering on the edge of being disconnected and of having a cycle (see Figure 11.45): T′ Figure 11.44: A tree T, with a node u of degree = 1 and its neighbor v. The tree T′ is T without the node u and the edge {u, v}. u v Corollary 11.7 (A tree with an edge added or removed is not a tree) Let T = ⟨V, E⟩ be any tree. Then: 1. addinganyedgee∈/EtoTcreatesacycle;and 2. removinganyedgee∈EfromTdisconnectsthegraph. ✗ (a) Imagine adding the dashed edge, or removing the edge marked with ✗. (b) Adding an edge creates a cycle. (c) Removing an edge disconnects the graph. Proof. 1. DefinethegraphG=⟨V,E∪{e}⟩astheresultofaddingthenewedgeeto the tree T. Because adding an edge to a graph can never disrupt connectivity and T was already connected, we know that G must be connected too. Thus if G were acyclic, then G would be a tree. But G has one more edge than T—specifically, G has (|V| − 1) + 1 = |V| edges—and therefore isn’t a tree by Theorem 11.6. 2. The proof is similar: let G′ be T with e removed. Removing an edge cannot create a cycle, so G′ is acyclic. But G′ has too few edges to be a tree by Theorem 11.6, so G′ must be disconnected. (Here’s an alternative proof of Corollary 11.7.1. Let ⟨u, v⟩ be an edge not in the tree T. Because T is connected, there is already a (simple) path P from u to v in T. If we add ⟨u, v⟩ to T, then there is a cycle: follow P from u to v and then follow the new edge from v back to u. Therefore G contains a cycle.) Rooted trees We often designate a particular node of a tree T as the root, which is traditionally drawn as the topmost node. (Note that we could designate any node as the root and— just like that mobile of zoo animals from your crib from infancy—“hang” the tree by that node.) We will adopt the standard convention that, whenever we draw trees, the vertically highest node is the root. Figure 11.45: Adding/removing an edge from a tree. 1152 CHAPTER 11. GRAPHS AND TREES There’s a lot of terminology about trees in computer science that’s bor- rowed from the world of family trees: • Foranodeuinatreewithroot r ̸= u, the parent of u is the unique neighbor of u that is closer to r than u is. (The root is the only node that has no parent.) • Anodevisoneofthechildrenofa node u if v’s parent is u. • Anodevisasiblingofanodeu̸=v if v and u have the same parent. A node with zero children is called a leaf. A node with one or more children is called an internal node. (Note that the root is an internal node unless the tree is the trivial one-node graph.) See Figure 11.46 for an illustration of all of these definitions. Note that Figure 11.46 is correct only when the root is the topmost node in the image; with a different root, all of the panels could change. Here’s a concrete example: Example 11.31 (A sample tree) Here are two trees. (The second tree is just the first, rerooted to make E the new root.) Figure 11.46: The root, leaves, and internal nodes of the tree; the parent, children, and siblings of a particular node. A BC DE FGH I (a) The root. A BC DE FGH I (b) The leaves. A BC DE FGH I (c) The internal nodes. A BC DE FGH I (d) The parent of E . A BC DE FGH I (e) The children of E . A BC DE FGH I (f) The sibling(s) of E . A BC DEFG HIJ KLM E HIB KLMADF C G J Then we have: Root: A Leaves: {D, F, H, J, K, L, M} Internal nodes: {A, B, C, E, G, I} Parent of B: A Children of B: {D, E, F} Parent of A: none Parent of A: B Root: E Leaves: {D, F, H, J, K, L, M} Internal nodes: {A, B, C, E, G, I} Parent of B: E Children of B: {A, D, F} Children of A: {B, C} Children of A: {C} While the leaves and internal nodes are identical in these two trees, note that if we’d rerooted the tree at any of the erstwhile leaves instead, the new root would become an internal node instead of a leaf. For example, if we reroot this tree at H, then the leaves would be {D, F, J, K, L, M} and the internal nodes would be {A, B, C, E, G, H, I}. Subtrees, descendants, and ancestors Let T be a rooted tree, and let u be any node in T. The subtree rooted at u consists of u and all those nodes and edges “below” u in T. (In other words, a node v is in the subtree rooted at u if and only if v is no closer to the root of T than u is; the subtree is the induced subgraph of these nodes.) Such a node v in the subtree rooted at u is called a descendant of u if v ̸= u. The node u is called an ancestor of v. See Figure 11.47 for illustrations of these three definitions. Here’s an example: Example 11.32 (Descendants and ancestors) Recall the trees from Example 11.31: Figure 11.47: Ances- tors, descendants, and subtrees. 11.4. TREES 1153 A BC DE FGH I (a) Ancestors of E . A BC DE FGH I (b) Descendants of E . E FGH I (c) Subtree rooted at E . A BC DEFG HIJ KLM E HIB KLMADF C G J Then we have: Descendants of B: {D, E, F, H, I, K, L, M} Ancestors of B: {A} Descendants of H: none Ancestors of H: {A, B, E} Subtree rooted at B: B DEF HI KLM Descendants of B: {A, C, D, F, G, J} Ancestors of B: {E} Descendants of H: none Ancestors of H: {E} Subtree rooted at B: C G J B ADF We have one final pair of definitions to (at last!) conclude our parade of terminology about rooted trees, related to how “tall” a tree is. Con- sider a rooted tree T with root node r. The depth of a node u is the dis- tance from u to r. The height of a tree is the maximum, over all nodes u in the tree, of the depth of node u. For example, every node in the tree in Figure 11.48 is labeled by its depth: the root has depth 0, its children have depth 1, their children (the “grandchildren” of the root) have depth 2, and so forth. The height of the tree is the largest depth of any of its nodes—in this case, the height is 4. Figure 11.48: A rooted tree’s nodes, labeled by depth. A0 B1 C1 D2 E2 F3G3H3 I4 1154 CHAPTER 11. GRAPHS AND TREES Taking it further: Alternatively, we could give several of the definitions about rooted trees recursively. For example, we could define ancestors and descendants of a node u be in a rooted tree T as follows: • Anodevisanancestorofuif(i)vistheparentofu;or(ii)vistheparentofanyancestorofu. • Anodevisadescendantofuif(i)visachildofu;or(ii)visachildofanydescendantofu. We can also think of the depth of a node, or the height of a tree, recursively. The depth of the root is zero; the depth of a node with a parent p is 1 + (the depth of p). For height: • the height of a one-node tree T is zero; and • the height of a tree T with root r with children {c1,c2,...,ck} is 1 + max the height of the subtree rooted at ci . i∈{1,...,k} Binary trees We’ll often encounter a special type of tree in which nodes have a limited number of children: For example, consider the trees in Figure 11.49. Of them, only the tree in Figure 11.49(d) is not a binary tree, because its root has four children. (This tree is a 4-ary tree.) But the other five trees are all binary: in each, every internal node has either 1 child or 2 children. In a binary tree, the possible children of a node are called its left child and right child. (Even for a node u in a binary tree that has only one child, we’ll insist that the lone child be designated as either the left child of u or the right child of u.) For a node u, we say that u’s left subtree is the subtree rooted at u’s left child; the right subtree is analogous. 11.4.3 Tree Traversal We will sometimes want to list all of the nodes contained in a tree T. There are three standard algorithms that are used for this purpose, called pre-order, in-order, and post- order traversal. While these algorithms can be generalized to non-binary trees, they’re easier to understand for binary trees (and they’re most frequently deployed for binary trees), so we’ll consider them that way. All three algorithms are recursive, and all three algorithms execute precisely the same steps—just in a different order. On an empty tree T, we do nothing; on a non- empty tree T, all three algorithms perform the following steps: Figure 11.49: The trees from Figure 11.43, repeated. All but (d) are binary trees. Definition 11.30 (Binary trees and k-ary trees) A binary tree is a rooted tree in which each node has 0, 1, or 2 children. More generally, if every node in a rooted tree T has k or fewer children, then T is called a k-ary tree. (In other words, a binary tree is 2-ary.) (a) (b) (c) (d) (e) (f) • we“visit”therootofthetreeT.(Youcanthinkof“visiting”therootasprintingout the contents of the root node, or as adding it to the end of an accumulating list of the nodes that we’ve encountered in the tree.) • werecursivelytraversetheleftsubtreeofT,findingallnodesthere. • werecursivelytraversetherightsubtreeofT,findingallnodesthere. But the three traversal algorithms execute the three steps in different orders, either visiting the root before both recursive calls (“pre-order”); between the recursive calls (“in- order”); or after both recursive calls (“post-order”). We always recurse on the left subtree before we recurse on the right subtree. Here are the details: Let’s take a look at an example of traversing a small tree using these algorithms. First we’ll look at the pre-order traversal, in which the first node visited in any subtree is the root of that subtree: Example 11.33 (Traversing a small tree: pre-order traversal) Let’s determine the order of nodes’ visits by a pre-order traversal of the following tree: A BC DEF In a pre-order traversal, we first visit the root, then pre-order-traverse the left subtree, then pre-order-traverse the right subtree. In other words, we first visit the root A, then pre-order-traverse D B , then pre-order-traverse E C F : Step#1:visittheroot. WevisittherootA. Step#2:pre-order-traversetheleftsubtree. Topre-order-traverse D B,wefirst visit the root B, then pre-order-traverse the left subtree D , then pre-order-traverse the (empty) right-subtree. In order, these steps visit B and D. Step #3: pre-order-traverse the right subtree. To pre-order-traverse E C F , we first visit C, then pre-order-traverse the left subtree E , and then pre-order- traverse the right subtree F . Pre-order-traversing E just results in visiting E, and pre-order-traversing F just visits F. In order, these steps visit C, E, and F. Putting this all together, the pre-order traversal of the tree visits the nodes in this order: Figure 11.50: Three different algorithms to traverse a binary tree. 11.4. TREES 1155 pre-order-traverse(T): 1: 2: 3: 4: 5: 6: if T is empty then do nothing. else visit the root of T pre-order-traverse(T’s left subtree) pre-order-traverse(T’s right subtree) in-order-traverse(T): 1: 2: 3: 4: 5: 6: if T is empty then do nothing. else in-order-traverse(T’s left subtree) visit the root of T in-order-traverse(T’s right subtree) post-order-traverse(T): 1: 2: 3: 4: 5: 6: if T is empty then do nothing. else post-order-traverse(T’s left subtree) post-order-traverse(T’s right subtree) visit the root of T A , B,D, C,E,F. 􏰢􏰡􏰠􏰣 􏰢􏰡􏰠􏰣 􏰢 􏰡􏰠 􏰣 step #1 step #2 step #3 1156 CHAPTER 11. GRAPHS AND TREES Here are examples of the other two traversal algorithms, on the same tree: Example 11.34 (Traversing a small tree: in-order and post-order traversals) Problem: RecallthetreefromExample11.33: A BC DEF 1. Inwhatorderarethenodesvisitedbyanin-ordertraversalofthistree? 2. Whataboutapost-ordertraversal? : 1. We first traverse D B, then visit A, then traverse E C F . Solution Traversing D B visits D and B: first the left subtree, then the root. Traversing E C F visits E, then C, then F. Thus an in-order traversal visits the nodes in the order D, B, A, E, C, F. 2. Forapost-ordertraversal,therootofeachsubtreeisthelastnodetraversedin that subtree: we first traverse D B , then traverse E C F , then visit A. Traversing D B visits D and B: first the left subtree, then the nonexistent right subtree, then the root. Traversing E C F visits E, then F, then C. Thus a post-order traversal visits the tree’s nodes in the order D, B, E, F, C, A. Here’s another example, of using traversals to reconstruct a binary tree: Example 11.35 (Trees from traversals) Problem: HereistheoutputofallthreetraversalsonabinarytreeT.What’sT? : We’llreassembleTfromtherootdown.Therootisfirstinthepre-order traversal (and last in the post-order), so 9 is the root. The root separates the left subtree from the right subtree in the in-order traversal; thus the left subtree con- tains just 2 and the right contains {3, 4, 5, 7}. So the tree has the following form: pre-order traversal in-order traversal post-order traversal 9, 2, 7, 4, 5, 3 2, 9, 5, 4, 3, 7 2, 5, 3, 4, 7, 9 Solution 2 9 {3,4,5,7} The post-order 5, 3, 4, 7 and in-order 5, 4, 3, 7 show that 7 is the root of the un- known portion of the tree and that 7’s right subtree is empty. The last three nodes are pre-ordered 4, 5, 3; in-ordered 5, 4, 3; and post-ordered 5, 3, 4. In sum, that says that 4 is the root, 5 is the left subtree, and 3 is the right subtree. Assembling these pieces yields the final tree: 11.4. TREES 1157 9 27 4 53 Taking it further: One particularly important type of binary tree is the binary search tree (BST), a widely used data structure—one that’s probably very familiar if you’ve taken a course on data structures. A BST is a binary tree in which each node has some associated “key” (a piece of data), and the nodes of the tree are stored in a particular sorted order: all nodes in the left subtree have a key smaller than the root, and all nodes in the right subtree have a key larger than the root. Thus an in-order traversal of a binary search tree yields the tree’s keys in sorted order. For more, see p. 1160. An even more specific form of binary search tree, called a balanced binary search tree, adds an additional structural property related to the depth of nodes in the tree. See p. 643 for a discussion of one scheme for balanced binary search trees, called AVL trees. 11.4.4 Spanning Trees Let G = ⟨V, E⟩ be an undirected graph. For example, imagine that each node in V represents a dorm room on your campus, and each edge in E denotes a possible fiber optic cable that can be laid to build an ethernet connection throughout the residence halls. A reasonable goal is to actually place only some of those possible cables, a subset E′ ⊆ E, while ensuring that network traffic can be sent between any two dorm rooms— that is, ensuring that the resulting network is connected. In other words, one seeks a spanning tree of the graph G: A spanning tree of G is called “spanning” because it con- nects (that is, spans) all nodes in G. Figure 11.51 shows a small example: the first panel shows a small graph G; the remaining panels show the 8 different spanning trees of G. A graph G has a spanning tree if and only if G is connected: we can be sure to only remove “redundant” edges that aren’t required for connectivity, and removing edges from G can never cause a disconnected graph to become connected. (For disconnected graphs, people sometimes talk about a spanning forest: a forest F = ⟨V, E′⟩ with E′ ⊆ E, where the connected components of the original graph G and the connected compo- nents of the forest F are identical.) Although we didn’t talk about it this way when we introduced breadth- and depth- first search (see Figures 11.29 and 11.31), these algorithms can find spanning trees, Figure 11.51: All 8 spanning trees of the graph shown in the first panel. Definition 11.31 (Spanning tree) Let G = ⟨V, E⟩ be a connected undirected graph. A spanning tree of G is a tree T = ⟨V, E′⟩ with the same nodes as G and with edges E′ ⊆ E that are a subset of G’s edges. B AF C E The original graph. B AF C E B AF C E B AF C E B AF C E B AF C E B AF C E B AF C E B AF C E 1158 CHAPTER 11. GRAPHS AND TREES with a small change: as we explore the graph, we include in E′ every edge ⟨u, v⟩ that leads from a previously known node u to a newly discovered node v. We’ll also see some other ways to find spanning trees in Section 11.5.2, but here’s another, conceptually simpler tech- nique. To find a spanning tree in a connected graph G, we repeatedly find an edge that can be deleted without discon- necting G—that is, an edge that’s in a cycle—and delete it. See Figure 11.52 for the algorithm. Here’s an example: Example 11.36 (Finding a spanning tree via cycle elimination) Here are the iterations of the Cycle Elimination algorithm in computing a spanning tree of a given connected graph. In each iteration, we’ve selected an arbitrary cycle (lightly shaded) and then selected an arbitrary edge from that cycle (heavily shaded) and removed it. After three iterations, the resulting graph has no cycles, and remains connected; the resulting graph is a spanning tree of the original graph. Figure 11.52: The pseudocode for an algorithm to find a spanning tree. A CB D E GF H step #1 A CB D E GF H step #2 A CB D E GF H step #3 A CB D E GF H Cycle Elimination Algorithm: Input: a connected graph G = ⟨V, E⟩ Output: a spanning tree of G 1: while there exists a cycle C in G: 2: let e be an arbitrary edge traversed by C 3: remove e from E 4: return the resulting graph ⟨V, E⟩. A CB D E GF H st uv cycle C (a) The short way from s to t, via {u, v}. st uv cycle C (b) The long way from s to t. We can prove that the Cycle Elimination algorithm correctly finds spanning trees, given an arbitrary connected graph as input: Proof. ThealgorithmonlydeletesedgesfromG,socertainlyT = ⟨V,E′⟩satisfies E′ ⊆ E. We need to prove that T is a tree: that is, T is acyclic and T is connected. Acyclicity: Aslongasthere’sacycleremaining,thealgorithmstaysinthewhileloop. Thus we only exit the loop when the remaining graph is acyclic. (And the loop terminates in at most |E| iterations, because an edge is deleted in every iteration.) Connectivity: We claim that the graph is connected throughout the algorithm. It’s true at the beginning of the algorithm, by assumption. Now consider an iteration in which we delete the edge {u, v} from a cycle C. Let s and t be arbitrary nodes; we will argue that there is still a path from s to t. Before we deleted {u, v}, there was a path P from s to t. If P didn’t traverse the edge {u, v}, then P is still a path from s to t. Otherwise, we can still get from s to t by going “the long way around” the cycle C instead of following the single edge {u, v}. (See Figure 11.53.) Thus there is still a path from any node s to any node t, and so the graph stays connected. Figure 11.53: Main- taining connectivity in the Cycle Elimi- nation Algorithm. Theorem 11.8 (Correctness of the Cycle Elimination algorithm) Given any connected graph G = ⟨V, E⟩, the Cycle Elimination algorithm returns a spanning tree T of G. 11.4. TREES 1159 Computer Science Connections patient #1 patient #2 patient #3 patient #4 patient #5 (a) The graph of compatibilities. A directed edge goes from every patient to her corresponding donor. There is a directed edge from a donor to a patient if that patient can receive a kidney from that donor. donor #1 donor #2 donor #3 donor #4 Directed Graphs, Cycles, and Kidney Transplants Kidneys are essential to human life; they play an essential filtering role in the body without which we would all die. Although we are born with two kidneys, humans need only one functioning kidney to live healthy lives. Because we’re all naturally equipped with a “spare,” kidney transplants are the most common form of transplant surgery performed today. Thousands of lives are saved annually through kidney transplants. Typically a patient in need of a kidney identifies a friend or relative who is willing to donate. If the patient and donor are compatible—for example, blood type and physical size of the donor’s kidney must be appropriate— then medical teams perform two simultaneous operations: one to remove the “spare” kidney from the donor, and one to implant it in the patient. (Some patients instead receive kidneys from strangers who chose to donate their organs in case of an untimely death.) Unfortunately, many patients who need kidneys have a friend or relative willing to donate to them—but they are incompatible with their prospective donor’s kidney. These patients may spend years on a waiting list for a transplant, undergoing painful, expensive, and only partially effective dialysis while they wait and hope. In recent years, medical personnel have begun a program of kidney ex- changes. Suppose that a patient p1 is incompatible with her prospective donor d1, another patient p2 is incompatible with his prospective donor d2, but pairs ⟨p1,d2⟩ and ⟨p2,d1⟩ are both compatible with each other. Four teams of doctors can then do a “paired exchange” with four surgeries, in which d1 donates to p2 and d2 donates to p1. (To ensure that everybody follows through, the surgeries must be simultaneous: if d1 donates to p2 before d2 undergoes surgery, then d2 has no incentive to go through the surgery, as d2’s friend p2 has already received his kidney.) We can even consider larger exchanges (three or more simultaneous donations)—though as the number of surgeries increases, the logistical difficulty increases as well. Deciding which transplants to complete is done using a graph-based algorithm. Each patient pi comes to the system with a donor di who is willing to donate to pi. Define a directed graph G as follows. There is a node for each patient pi and a node for each donor di . Add a directed edge ⟨pi , di ⟩ for every i. Also add a directed edge ⟨di , pj ⟩ if donor dj is compatible with patient pj . A cycle in G then corresponds to a set of surgeries that can be completed: every donor in the cycle donates a kidney, and every patient in the cycle receives a compatible kidney. See Figure 11.54 for an example. The algorithm that’s actually used in the real kidney exchange net- work in the United States computes a set of node-disjoint cycles that will beperformed.11 Tolimitthenumberofsimultaneoussurgeriesthatarere- quired, the algorithm seeks a set of cycles of length 4 or length 6—that is, 2 or 3 transplants—in G that maximizes the total number of nodes included. (The constraint on cycle length makes the computational problem much more dif- ficult, so the algorithm requires significant computational power to compute the surgeries to complete.) Figure 11.54: An example of a kidney exchange network, and the cycle-based algorithm to select transplants. 11DavidAbraham,AvrimBlum,and Tuomas Sandholm. Clearing algorithms for barter exchange markets: Enabling nationwidekidneyexchanges. In Proceedings of the ACM Conference on Electronic Commerce (EC), 2007. donor #5 patient #1 patient #2 patient #3 patient #4 patient #5 (b) The selected transplants. We “cover” this graph with two cycles; if we perform the transplants highlighted (the darker donor-to-patient edges), then every patient receives a compatible kidney. donor #1 donor #2 donor #3 donor #4 donor #5 1160 CHAPTER 11. GRAPHS AND TREES Computer Science Connections Binary Search Trees Trees are the basis of many important data structures, of which binary search trees are perhaps most frequently used. Binary search trees are data structures that implement the abstract data type called a dictionary: we have a set of keys, each of which has a corresponding value. (For example, the keys might be words and the values definitions, or they might be student names and GPAs, or usernames and encrypted passwords.) The data structure must support operations like insert(k, v) (add a new key/value pair) and lookup(k) (report the value associated with key k, if any). A binary search tree (BST) is a binary tree for which every node u satisfies the BST condition illustrated in Figure 11.55: every node v in u’s left subtree has a key that is less than u’s key, and every node v in u’s right subtree has a key that is greater than u’s key. (For simplicity, assume that all keys are distinct.) An example of a binary search tree is shown in Figure 11.56. Incidentally, the BST condition implies the following claim: an in-order traversal of a binary search tree visits the keys in sorted order. This claim can be proven formally by induction, but the intuition is straightforward: an in-order traversal of a node with key x first visits nodes < x (while traversing the left subtree), then x itself, and then nodes > x (while traversing the right subtree). Because, recursively, the nodes of the left and right subtrees
are themselves visited in sorted order, the entire tree’s keys are visited in sorted order.
Binary search trees are good data structures for dictionaries
because insert and lookup can be implemented simply and effi-
ciently. If we perform a lookup for a key k in an empty BST T, we return “not found.” (For simplicity, we allow a BST to be empty—that is, to contain zero nodes.) Otherwise, compare k to the key r stored in the root node of T:
• ifk=r,thenreturnthevaluestoredattheroot.
• ifkr,thenperformalookupforkintherightsubtree.
The BST condition guarantees that we find the node with key k if it’s in the tree. (You can prove this fact by induction.) The insert operation can be implemented similarly, by adding a new node exactly where a lookup for the key k would have found k.
The worst-case running time of lookup and insert is propor-
tional to the height of the binary search tree. More “balanced”
BSTs—in which every internal node has a left subtree with roughly the same height as its right subtree—have better performance. (There are many differ- ent BSTs with the same set of keys; for example, another BST that has the same keys as the BST in Figure 11.56 is shown in Figure 11.57.)
Most software therefore uses balanced binary search trees instead—for ex- ample, AVL trees or red–black trees.12 (See p. 643 for further discussion of AVL trees, and a proof of their efficiency.)
Figure 11.55: The binary search tree condition. For every node with key x: all keys in the left subtree of the node have a key < x; and all keys in the right subtree of the node have a key > x.
x
all keys x
Evan
Hanan Isaac
Joseph
Noah
Mikenna Morgan
Milan
Figure 11.56: A binary search tree storing a set of 10 keys. The key is shown in each node; the accompanying value isn’t drawn.
Figure 11.57: Another binary search tree with the same set of keys.
See the details in any good textbook on data structures, or in
12 Thomas H. Cormen, Charles E. Leisersen, Ronald L. Rivest, and Clifford
Stein. Introduction to Algorithms. MIT Press, 3rd edition, 2009.
Yasin Qwill
Hanan
Evan Joseph
Isaac
Morgan
Mikenna Milan
Noah
Yasin Qwill

11.4.5 Exercises
Identify all of the simple cycles in the following graphs:
11.122 11.124
11.123
Consider an undirected graph G with n nodes. In terms of n . . .
11.125 . . . what is the longest simple cycle that G can contain? Explain.
11.126 . . . what is the longest cycle (not necessarily simple) that G can contain? Explain.
11.128 Let u be a node in a n-node complete directed graph: all edges except for self-loops are present. How many simple cycles is node u involved in?
11.129 A small modification to BFS can detect cycles involving a node
s a directed graph, as shown in Figure 11.58. However, this modification doesn’t quite work for undirected graphs. Give an example of an acyclic graph in which the algorithm Figure 11.58 falsely claims that there is a cycle. Then describe briefly how to modify this algorithm to correctly detect cycles involving node s in undirected graphs.
Recall Lemma 11.5: in any acyclic undirected graph, there exists a node whose degree is zero or one. Prove the following two extensions/variations of this lemma: 11.130 Prove that every directed acyclic graph contains a node with out-degree zero.
11.131 Prove that there are two nodes of degree 1 in any acyclic undi- rected graph that contains at least one edge.
Recall Definition 11.26: a cycle ⟨u0,u1,…,uk,u0⟩ is a path of length ≥ 2 from a node u0 back to node u0 that does not traverse the same edge twice. At various times in class, I’ve tried to define cycles in all of the following ways—and they’re all bogus definitions, in the sense that they describe something different from Definition 11.26. For each of the following broken definitions, explain why I was wrong:
11.132 A cycle is a simple path from s to s.
11.133 Acycleisapathoflength≥2fromstos.
11.134 A cycle is a path from s to s that doesn’t traverse any edge more than once.
11.135 A cycle is a path from s to s that includes at least 3 distinct nodes.
11.136 A cycle is a path of length ≥ 2 from s to s that doesn’t traverse any edge twice consecutively.
11.137 Definition 11.28 defines an acyclic graph as one containing no cycles, but it would have been
equivalent to define acyclic graphs as those containing no simple cycles. Prove that G has a cycle if and only if G has a simple cycle.
Recall that G = ⟨V, E⟩ is a regular graph if every u ∈ V has degree(u) = d, for some fixed constant d.
11.138 Identify two different regular graphs that are trees.
11.139 It turns out that there are two and only two different trees T that are regular graphs. Prove that
there are no other regular graphs that are trees.
BDF AH CEG
BDF AH CEG
BDF AH CEG
Prove your answers to the following questions, and simplify your answer as n gets large. (For handling large n, a useful fact from calculus: ∑n 1 approaches e = 2.71828 · · · as n grows.)
i=0 i!
11.127 In the n-node complete graph Kn, how many simple cycles is a particular node u involved in?
11.4. TREES 1161
Input: agraphG=⟨V,E⟩andasourcenodes∈V Output: is s involved in a cycle in G?
1: 2: 3: 4: 5: 6: 7: 8: 9:
10: 11: 12:
Frontier := ⟨s⟩
Known := ∅
while Frontier is nonempty:
u := the first node in Frontier remove u from Frontier ifsisaneighborofuthen
return “s is involved in a cycle.” for every neighbor v of u:
if v is in neither Frontier nor Known then add v to the end of Frontier
add u to Known
return “s is not involved in a cycle.”
Figure 11.58: BFS modified (slightly buggily) to detect cycles involving the node s.

1162 CHAPTER 11. GRAPHS AND TREES
A triangle is a simple cycle containing exactly three nodes. A square is a simple cycle containing exactly four nodes.
11.140 What is the largest number of triangles possible in an undirected graph of n nodes?
11.141 What is the largest number of squares possible in an undirected graph of n nodes?
Let’s analyze the largest number of edges that are possible in an n-node undirected graph that contains no triangles. 11.142 Consider a triangle-free graph G = ⟨V, E⟩. For nodes u ∈ V and v ∈ V, argue that if {u, v} ∈ E, then we have degree(u) + degree(v) ≤ |V|.
11.143 Prove the following claim by induction on the number of nodes in the graph: if G = ⟨V, E⟩ is triangle-free, then |E| ≤ |V|2/4. (Hint: use the previous exercise.) n2
11.144 Give an example of an n-node triangle-free graph that contains 4 edges. Consider the following adjacency lists. Is the graph that each represents a tree? Justify your answers.
11.145 11.146 11.147 11.148
Prove or disprove the following claims about trees:
11.149 There is a node of degree equal to 2 in any tree with ≥ 3 nodes.
11.150 In any rooted binary tree (all nodes have 0, 1, or 2 children), there are an even number of leaves.
11.151 If a graph G = ⟨V, E⟩ has |V| − 1 edges, then G must be a forest.
11.152 The following pair of definitions is subtly broken: the root of a tree is a node that is not a child,
and a leaf is a node that is a child but not a parent. What’s broken?
For the tree in Figure 11.59, with node A as the root . . .
11.153 . . . what are the leaves?
11.154 . . . which nodes are internal nodes?
11.155 . . . what the are parent, children, and siblings of node D?
11.156 . . . what are the descendants of node D?
11.157 . . . what are the ancestors of node F?
11.158 . . . what is the height of the tree?
11.159 Let T be an arbitrary n-node rooted tree, with root r and
with l different leaves. Prove or disprove: if we reroot T at a new node r′ ̸= r, then the number of leaves remains exactly l.
A complete binary tree of height h
has “no holes”: reading from top-to-
bottom and left-to-right, every node
exists. Complete binary trees form a
subset of nearly complete binary trees: a
nearly complete binary tree has every
node until the last row, which is allowed
to stop early. (See Figure 11.60, and see
also p. 529 for a discussion of heaps,
which are a data structure represented as a nearly complete binary tree.)
11.160 Prove by induction that a complete binary tree of height h contains precisely 2h+1 − 1 nodes.
11.161 How many leaves does a nearly complete binary tree of height h have? Give the smallest and
largest possible values, and explain.
11.162 What is the diameter of a nearly complete binary tree of height h? Again, give the smallest and largest possible values, and explain your answer. (Recall that the diameter of a graph G = ⟨V, E⟩ is maxs,t∈V d(s, t), where d(s, t) denotes the length of the shortest path from u to v in G.)
Figure 11.59: A rooted tree.
Figure 11.60: A complete and nearly complete binary tree of height 3.
A: B,E B: A
C: D
D: C,F E: A
F: D
A: C
B: C,E
C: A,B,F
D: E
E: B,D
F: C
A: D
B: E,F C: D,F D: A,C E: B
F: B,C
A: C,D,F B: F
C: A,E,F D: A
E: C
F: A,B,C
A BD
CEF G
HI

Suppose that we “rerooted” a complete binary tree of height h by instead designating one of the erstwhile leaves as the root. In the rerooted tree, what are the following quantities?
11.163 the height
11.164 the diameter
11.165 the number of leaves
Justify your answers to the following questions: describe an 1000-node binary tree with . . .
11.166 . . . height as large as possible. 11.168 . . . as many leaves as possible.
11.167 . . . height as small as possible. 11.169 . . . as few leaves as possible.
11.170 What is the largest possible height for an n-node binary tree in which every node has precisely zero or two children? Justify your answer.
In what order are nodes of the tree in Figure 11.61 traversed . . .
11.171 . . . by a pre-order traversal?
11.172 . . . by an in-order traversal?
11.173 . . . by a post-order traversal?
11.174 Draw the binary tree with in-order traversal
4, 1, 2, 3, 5; pre-order traversal 1, 4, 3, 2, 5; and post-order traver- sal 4,2,5,3,1.
11.175 Do the same for the tree with in-order traversal
1, 3, 5, 4, 2; pre-order traversal 1, 3, 5, 2, 4; and post-order traver- sal 4,2,5,3,1.
11.176 Describe (that is, fully explain the structure of) an n-node binary tree T for which the pre-order and in-order traversals of T result in precisely the same ordering of T’s nodes. (That is, pre-order-traverse(T) = in-order-traverse(T).)
11.177 Describe a binary tree T for which the pre-order and post-order traversals result in precisely the same ordering of T’s nodes. (That is, pre-order-traverse(T) = post-order-traverse(T).)
11.178 Prove that there are two distinct binary trees T and T′ such that pre-order and post-order traver- sals are both identical on the trees T and T′. (That is, pre-order-traverse(T) = pre-order-traverse(T′) and post-order-traverse(T) = post-order-traverse(T′) but T ̸= T′.)
11.179 Give a recursive algorithm to reconstruct a tree from the in-order and post-order traversals.
11.180 Argue that we didn’t leave out any spanning trees of G in Figure 11.51, reproduced here for your
convenience:
How many spanning trees do the following graphs have? Explain.
Figure 11.61: A rooted tree.
11.4. TREES 1163
A BD
CEF G
HI
B AF
C
E
The original graph.
B AF
C
E
B AF
C
E
B AF
C
E
B AF
C
E
B AF
C
E
B AF
C
E
B AF
C
E
B AF
C
E
11.181
BDF AH CEG
11.182
BDF AH CEG

1164 CHAPTER 11. GRAPHS AND TREES
11.5 Weighted Graphs
Force without wisdom falls of its own weight.
Horace (65–8 bce), Odes (23 bce)
Many real-world situations are naturally modeled by different edges having differ- ent “weights”: the price of an airplane flight, the closeness of a friendship, the physical length of a road, the time required to transmit data across an internet connection. These graphs are called weighted graphs:
Definition 11.32 considers only nonnegative weights—every
we ≥ 0—which is a genuine restriction. (For example, the “signed” social networks from Figure 11.8(a)
have positive and negative weights signifying friend- ship and enmity.) Some, but not all, of the results that we’ll discuss in this section carry over to the setting of negative weights.
Definition 11.32 (Weighted graph)
A weighted graph is a graph G = ⟨V, E⟩ and a weight function w : E → R≥0, so that each edge e ∈ E has a weight w(e) ≥ 0. For simplicity of notation, we’ll often write we instead of w(e); we’ll also sometimes refer to we as the length of the edge e.
In a weighted graph, the length of a path in a weighted graph is the sum of the lengths of the edges traversed by the path. (A shortest path is, as before, one with the smallest length.)
Either undirected or directed graphs can be weighted. Aside from the length of a path, all of the other notions and terminology from unweighted graphs carry over: neighbors and degree, paths and connectivity, and so forth. Weighted graphs can be represented just as unweighted graphs were: we typically store the weight of edge
⟨u, v⟩ directly in the ⟨u, v⟩th entry of the adjacency matrix, or attach the edge weight as an additional slot in the adjacency list entries. Here’s an example:
Example 11.37 (A weighted graph)
Here’s the highway system from Example 11.4, where each road is labeled with its length:
Los Angeles
2350 miles
Lake City, FL
180 miles Tampa
60 miles
Orlando
Jacksonville
90 miles
Daytona Beach 55 miles
85 miles There are two simple paths between Orlando and Lake City:
• Orlando↔Tampa↔LakeCity:85+180=265miles.
• Orlando↔DaytonaBeach↔Jacksonville↔LakeCity:55+90+60=205miles.
The second path is shorter, even though it traverses more edges, as 265 > 205.
Taking it further: The primary job of a web search engine is to respond to a user’s search query (“give me web pages about Horace”) with a list of relevant pages. There’s a complex question of data struc- tures, parallel computing, and networking infrastructure in solving even the simplest part of this
task: identifying the set R of web pages (out of many billions) that contain the search term. A subtler challenge—and at least as important—is figuring out how to rank the set R. What pages in R are the “most important,” the ones that we should display on the first page of results? See p. 1174 for some discussion of how Google uses a weighted graph (and probability) to do this ranking.

11.5.1 Shortest Paths in Weighted Graphs: Dijkstra’s Algorithm
A shortest path from s to t in a weighted graph is the path connecting s and t that has shortest total length. In many natural applications where shortest paths are useful, we have weights on edges: you want the shortest walking route from the bar back to your apartment, for example, not necessarily the one with the fewest turns. In Example 11.37, we already saw a case in which the shortest path used more edges than necessary. Thus we cannot directly use breadth-first search to compute distances in weighted graphs.
But we can compute distances using an algorithm that’s very similar in
spirit to BFS. The basic idea of breadth-first search is to “expand outward” from the source node s in layers, accumulating a set of nodes u for which we know the distance from s to u. We add nodes in increasing order of their distance from s, and eventually we’ve computed distances from s to all nodes in the graph. (See Figure 11.62.) The trouble for weighted graphs is that the order in which BFS builds up its knowledge about shortest paths doesn’t always work (as in Example 11.37). But we can use a cleverer way of building up knowledge about the network to find shortest paths in weighted graphs, too.
The algorithm that we’ll describe is due to Edsger Dijkstra, and hence it is known as Dijkstra’s algorithm. The key idea of Dijkstra’s algorithm has parallels with BFS:
Suppose that we know the distance from a source node s to every node in some set S of nodes. (Assume that s ∈ S.) We will find some node not in S for which we can determine the shortest path from s.
For now, let’s not worry about where this set S came from; the key point is just that we are assuming that we know distances to certain nodes (those in S), and we seek to leverage that existing knowledge to learn the distance to some other node (not previ- ously in S). We’ll then add that new node to S and iterate.
Before we state the formal result, let’s look at an example:
Example 11.38 (An example of distances)
Consider the following weighted, undirected graph (with edge weights marked on the edges):
Figure 11.62: The intuition of BFS. Assume the shaded set S contains
every node within distance d of s, and t h a t u ∈/ S i s a neighbor of v ∈ S. The distance from s to u must be d + 1.
Edsger Dijkstra
was a 20th-century Dutch computer scientist—one
of the founders
of theoretical computer science, and the 1972 Turing Award winner.
Irrelevant quotation:
“Computer science is no more about computers than astronomy is about telescopes.”
— attributed to Edsger W. Dijkstra (1930–2002)
Irrelevant challenge:
Name a common English word that, like DIJKSTRA, has at least five (or 6
or even 7, which is technically possible) consecutive conso- nants. (Not SYZYGY or RHYTHMS; Y is a vowel if it’s used as a vowel!)
11.5. WEIGHTEDGRAPHS 1165
shaded nodes =
all nodes of distance ≤ d
s
v
u
6
B D2F
1 10 A457H
3
C8E G
9
Suppose we know the distances from A to every node in the shaded set S = {A, B, C}: d(A,A) = 0 d(A,B) = 1 d(A,C) = 3.

1166 CHAPTER 11. GRAPHS AND TREES
We wish to expand our set of known nodes by adding a neighbor of an already shaded node. The candidate nodes that are neighbors of nodes with known distances are {D, E, F}. In particular, their candidate distances are:
node distance
Let’s argue that we can now conclude that d(A, F) = 7.
The key reason is that, to get from A to F, we have to “escape” the set of shaded
nodes—and every “escape route” (path to F) must reach its last shaded node v (that’s d(A, v)) and then follow an edge to its first unshaded node u (that’s wv,u). Because this table tells us that every path out of the shaded region has length at least 7, and we’ve found a path from A to F with exactly that length, we conclude that d(A, F) = 7.
Computing the distance to a new node
The same basic reasoning that we used in Example 11.38 will allow us to prove a
general observation that’s the foundation of Dijkstra’s algorithm:
Figure 11.63: The graph for Example 11.38, repeated and rotated. We’ve computed that d(A, A) = 0 and d(A, B) = 1 and d(A, C) = 3.
A
31 9C4B
85 ED6
72
GF
10
H
F (via B) E (via A) E (via C) D (via C)
d(A,B)+wB,F = 1+6 = 7 d(A,A)+wA,E = 0+9 = 9 d(A,C)+wC,E = 3+8 = 11 d(A,C)+wC,D = 3+5 = 8
Lemma 11.9 (Foundation of Dijkstra’s Algorithm)
LetG = ⟨V,E⟩beagraphwithedgeweightsw,letS ⊂ Vbeasetofnodes,andlets ∈ S be a source node. Let d(s, v) denote the distance from s to v for every node v in S. For a node
u ∈/ S, define
Let u∗ be the node u ∈/ S for which du is minimized. Then the distance from s to u∗ is du∗ .
du := min d(s,v)+wv,u. v∈S : uisaneighborofv
Before we prove the lemma, let’s restate the claim in slightly less notation-heavy English. (See Figure 11.64.) We have a set S of nodes—the shaded nodes in the picture—for which we know the distance from s. We examine all unshaded nodes u that are neighbors of shaded nodes v. For each shaded/unshaded pair, we’ve computed the sum of the distance d(s, v) and the edge weight wv,u. And we’ve chosen the pair ⟨v∗, u∗⟩ that minimizes this quantity.
The lemma says that the shortest path from s to this particular
u∗ must have length precisely equal to du∗ := d(s, v∗) + wv∗,u∗ . The intuition matches the argument in Example 11.38: to get from s to u∗,
we have to somehow “escape” the set of shaded nodes—and, by the
way that we chose u∗ , every “escape route” must have length at least du∗ .
ProofofLemma11.9. Wemustshowthatthedistancefromstou∗isdu∗,andwe’lldo it in two steps: by showing that the distance is no more than du∗ , and by showing that the distance is no less than du∗ .
Figure 11.64: The intuition for Lemma11.9.
shaded nodes = S
s
d(s,v∗) v∗ wv∗,u∗ u∗

Thedistancefromstou∗is≤du∗. Wemustarguethatthereisapathoflength
d(s, v∗) + wv∗,u∗ from s to u∗. By assumption and the fact that v∗ ∈ S, we know
that d(s, v∗) is the distance from s to v∗, so there must exist a path P of length d(s, v∗) from s to v∗. (It’s the curved line in Figure 11.64.) By tacking u∗ onto the end of P, we’ve constructed a path from s to u∗ via v∗ with length d(s, v∗) + wv∗,u∗ .
Thedistancefromstou∗is≥du∗. ConsideranarbitrarypathPfromstou∗.Wemust show that P has length at least d(s, v∗) + wv∗,u∗ .
WhatdoesPlooklike? ThenodesisinthesetS,soPstartsoutats ∈ S,then wanders around for a while inside S, then crosses outside of S for the first time, wanders around outside S for a while, and eventually ends up at u∗ ∈/ S. Nothing prevents P from re-entering (and later re-exiting) S after its first departure—indeed, it can go in and out of S several times—but it definitely has to leave S at least once. Thus P has to look like the following:
s
v∗
u∗
s
v∗
u∗
(a) the entire path P (b) the portion of P up to the (c) the portion of P after the
Therefore we know that the length of P
first exit from S first exit from S
= (the length of P up to the first exit) + (the length of P after the first exit)
≥ (the length of the shortest path exiting S) + (the length of P after the first exit)
P up to the first exit is a path exiting S, so its length is at least the length of the shortest such path ≥ d(s, v∗) + wv∗,u∗ + (the length of P after the first exit)
we chose u∗ and v∗ so that d(s, v∗ ) + wv∗ ,u∗ is exactly the length of the shortest path exiting S
≥ d(s, v∗) + wv∗,u∗ + 0 all edge weights are nonnegative, so all path lengths are ≥ 0 too
= du∗ . definition of du∗ Thus the length of P is at least du∗ .
We’ve therefore argued that the distance from s to u∗ is both ≤ du∗ and ≥ du∗ . Thus the distance is precisely d ∗ , and the lemma follows.
Dijkstra’s Algorithm
With Lemma 11.9 proven, we can now put together the pieces of the entire algo-
rithm. The lemma describes a way to take a set S of nodes with known distance from the source node s, and correctly calculate the distance from s to a new node u ∈/ S.
Problem-solving tip: When we want to prove that x = y, it’s sometimes easier
to prove x ≥ y and x ≤ y separately.
u
11.5. WEIGHTEDGRAPHS 1167
s
v∗
u∗

1168 CHAPTER 11. GRAPHS AND TREES
In Dijkstra’s algorithm, the idea is to apply the calculation from Lemma 11.9 repeatedly to find all distances from the given source node s. We’ll need a base case to get started, but that’s straightfor- ward: we start with the set of nodes with known distance from s as S = {s}, where the distance from s to s is zero. The full algorithm is shown in Figure 11.65.
Before we prove the algorithm’s cor- rectness, let’s run through an example:
Example 11.39 (Dijkstra’s algorithm in action)
Let’s run Dijkstra’s algorithm on the network from Example 11.37, with the graph ro-
tated for compactness. We’ll start from the Orlando (OR) node. Here is the execution: DB JA LA LC OR TA
Figure 11.65: The pseudocode for Dijkstra’s algorithm.
Dijkstra’s Algorithm:
Input: a weighted graph G = ⟨V, E⟩, nonnegative edge weights we ≥ 0, and
a source node s ∈ V.
Output: the distance from s to every node in G
1: 2: 3:
4: 5: 6: 7: 8:
Let S := {s} and let d(s, s) := 0. // S is the set of nodes with known distances. while there exists a node in S with a neighbor not in S:
for every node u ∈/ S, define
du := min d(s,v)+wv,u.
v∈S : u is a neighbor of v
u∗ := the node with the smallest du.
Add u∗ to S and set d(s,u∗) := du∗. foreverynodeu∈V−S:
d(s, u) := ∞
return the recorded values d(s, u).
TA 180 LC 2350 LA 85
60 DB 90 JA
OR
nodes with known distances from OR
55
A “candidate” node for the next iteration: has unknown distance, but has a neighbor with known distance.
0
OR
TA 180 LC 2350 LA 85
60 DB 90 JA
55
Of the candidate nodes, DB has the smallest value as per Lemma 11.9. So its distance can now be recorded.
55 0
OR
TA 180 LC 2350 LA 85
60 DB 90 JA
55
55
0
85
OR
TA 180 LC 2350 LA 85
60 DB 90 JA
55
55
145
0
85
OR
TA 180 LC 2350 LA 85
60 DB 90 JA
55
55
145
205
0
85
OR
TA 180 LC 2350 LA 85
60 DB 90 JA
55
55
145
2555
205
0
85

The correctness of Dijkstra’s Algorithm
We’ll now prove the correctness of the algorithm, using Lemma 11.9 and induction:
Proof. Lookingatthealgorithm,weseethatDijkstra’sAlgorithmrecordsfinitedis- tances from s in Line 1 (for s itself) and Line 5 (for other nodes reachable from s). Sup- pose that Dijkstra’s algorithm executes n iterations of the loop in Line 2, thus recording n + 1 total distances in Lines 1 and 5—say in the order u0, u1, . . . , un. Let P(i) denote the claim that d(s, ui) is the length of the shortest s-to-ui path. We claim by strong induc- tion on i that P(i) holds for all i ∈ {0,1,…,n}.
Basecase(i=0): Wemustprovethatd(s,u0)isrecordedcorrectly.The0thnodeu0is recorded in Line 1, so u0 is the source node s itself. And the shortest path from s to s in any graph with nonnegative edge weights is the 0-hop path ⟨s⟩, of length 0.
Inductivecase(i≥1): WeassumetheinductivehypothesisP(0),P(1),…,P(i−1): that is, all recorded distances d(s, u0), d(s, u1), . . . , d(s, ui−1) are correct. We must prove P(i): that is, that the recorded distance d(s, ui) is correct. But this follows im- mediately from Lemma 11.9: the algorithm chooses ui as the u ∈/ S minimizing
du := min d(s,v)+wv,u, v∈S : u is a neighbor of v
where S = {u0, u1, . . . , ui−1}. Lemma 11.9 states precisely that this value du is the length of the shortest path from s to u.
Finally, observe that any node u that’s only discovered in Line 6 is not reachable from s, and so indeed d(s, u) = ∞. (A fully detailed argument that the ∞ values are correct can follow the structure in Theorem 11.3, which proved the correctness of BFS.)
Taking it further: Dijkstra’s algorithm as written in Figure 11.65 can be straightforwardly implemented to run in O(|V| · |E|) time: each iteration of the while loop (Line 2) can look at each edge to compute the smallest du. But with cleverer data structures, Dijkstra’s algorithm can be made to run in O(|E| log |V|) time. This improved running-time analysis, as well as other shortest-path algorithms—for example, handling the case in which edge weights can be negative (it’s worth thinking about where the proof
of Lemma 11.9 fails if an edge e can have we < 0), or computing distances between all pairs of nodes instead of just every distance from a single source—is a standard topic in a course on algorithms. Any good algorithms text should cover these algorithms and their analysis. Before we leave Dijkstra’s algorithm, it’s worth reflecting on its similarities with BFS. In both cases, we start from a seed set S of nodes for which we know the distance from s—namely S = {s}. Then we build up the set of nodes for which we know the dis- tance from s by finding the unknown nodes that are closest to s, and adding them to S. Of course, BFS is conceptually simpler, but Dijkstra’s algorithm solves a more com- plicated problem. It’s a worthwhile exercise to think about what happens if Dijkstra’s algorithm is run on an unweighted graph. (How does it relate to BFS?) 11.5. WEIGHTEDGRAPHS 1169 Theorem 11.10 (Correctness of Dijkstra’s Algorithm) Let G = ⟨V, E⟩ be a graph with nonnegative edge weights we for each edge. Let s ∈ V be a source node, and let d(s, •) := Dijkstra(G, w, s) be the values computed by Dijkstra’s Algorithm. Then, for every node u, we have that d(s, u) is the length of the shortest path from s to u in G under w. 1170 CHAPTER 11. GRAPHS AND TREES 11.5.2 Spanning Trees in Weighted Graphs: Minimum Spanning Trees Recall from Definition 11.31 that a spanning tree of a connected graph G = ⟨V, E⟩ is a tree T = ⟨V, E′⟩ where E′ ⊆ E. As with shortest paths, in many of the applications in which spanning trees are interesting, we actually want to find a spanning tree whose edges have minimum possible total cost. For example, when a college wants to put down networking cable in a new dorm building, they wish to ensure that the resulting network is connected, while minimizing the cost of construction. Formally, in a weighted graph, the cost of a spanning tree T is the sum of the weights of its edges: ∑e∈E′ we. A minimum spanning tree (MST) is a spanning tree whose cost is as small as possible. Here are two small examples: Example 11.40 (Some minimum spanning trees) Consider the following two graphs (the road network from Example 11.37 and the larger connected component from Example 11.38): LA 6 2350 LC 60 JA BD2F 1 10 A457H 180 90 3 TA DB 8555 9 OR C8E Here are the minimum spanning trees. (For the first, every spanning tree omits ex- actly one edge from the lone cycle; the cheapest tree omits the most expensive edge.) LA 6 2350 LC 60 JA BD2F 1 10 A457H 180 90 3 TA DB 8555 9 OR C8E As with shortest paths in weighted graphs, the question of how to find a minimum spanning tree most efficiently is more appropriate to an algorithms text than this book. But, between the Cycle Elimination Algorithm (Figure 11.52) and Example 11.40, we’ve already done almost all the work to develop a first algorithm. Assume throughout that all edge weights are distinct. (This assumption lets us refer to “the most expensive edge” in a set of edges. Removing this assumption complicates the language that we have to use, but it doesn’t fundamentally change anything about the MST problem or its solution.) 11.5. WEIGHTEDGRAPHS 1171 Lemma 11.11 (The “cycle rule”) Let C be a cycle in a connected undirected graph G = ⟨V, E⟩, and let e be the heaviest edge in C. Then e is not in any minimum spanning tree of G. Proof. Consider a spanning tree T of G, and suppose that e = {u, v} is included in T. We’ll show that T is not a minimum spanning tree. (Thus the only minimum spanning trees of G do not include e.) By definition, the spanning tree T is connected. If we delete {u, v} from T, the resulting graph will have two connected components, one containing u and the other containing v. (This fact follows by Corol- lary 11.7.) Call those connected components U and V, respectively. See Figure 11.66(a). Imagine following the cycle C from u to v the “long way” around C. This part of C starts at u, wanders around U for a while, and even- tually crosses over into V, before finally arriving at v. Let a ∈ U be the last node in U and b the first node in V as we go around C. (Note that C might go back and forth between U and V multiple times, but define a and b based on the first time C leaves U.) See Figure 11.66(b). Now define the graph T′ as T with the edge {u, v} removed and with the edge {a, b} inserted instead. Crucially, T′ is a spanning tree of G; because we’ve only swapped which edge connected the connected sets U and V. Thus T′ remains connected and acyclic. Now observe that the cost of T′ is less than the cost of T, because the edge {u, v} is heavier than the edge {a, b}. (Both {u, v} and {a, b} are in the cycle C, and by assumption {u, v} is the heaviest edge in C.) But therefore T′ is a cheaper spanning tree than T, and thus T isn’t a minimum spanning tree. Finding MSTs by removing cycles Lemma 11.11 immediately suggests that we can find minimum spanning trees using a modification of the Cycle Elimination Algorithm: While the Weighted Cycle Elimination Algorithm is correct and reasonably efficient, there are more efficient algorithms based on Lemma 11.11. One such algorithm is called Kruskal’s Algorithm, named after its discoverer Joseph Kruskal. The key idea of Kruskal’s Algorithm is that by sorting the edges in increasing order, we can be more efficient: we add edges in increasing order of their weight, as long as doing so doesn’t create a cycle. V = nodes on the v side of the tree U = nodes on the u side of the tree uv (a) Removing the edge {u, v} splits the tree into two connected components. uv cycle C a b (b) C is a cycle with {u, v} as its heaviest edge. Some other edge {a, b} from the cycle has a ∈ U and b ∈ V. Figure 11.66: The cycle rule for MSTs. Joseph Kruskal was a 20th-century American com- puter scientist/ mathematician/ statistician. He published his MST algorithm in 1956. Weighted Cycle Elimination Algorithm Input: a weighted connected graph G = ⟨V, E⟩ with edge weights we Output: a minimum spanning tree of G 1: while there exists a cycle C in G: 2: let e be the heaviest edge traversed by C 3: remove e from E 4: return the resulting graph ⟨V, E⟩. 1172 CHAPTER 11. GRAPHS AND TREES The insight of this algorithm is that, by consider- ing edges in increasing order of weight, if including an edge e creates a cycle, then we know that e must be the heaviest edge in that cycle. See Figure 11.67. Kruskal’s algorithm is reasonably efficient: the sort- ing step takes O(|E| log |E|) time, and each of the |E| iterations of the for loop can be implemented using one call to BFS to test for a cycle. (And, in fact, there are some cleverer ways to implement Line 4 so that the entire algorithm runs in O(|E| log |E|) time.) Here’s an example: Example 11.41 (Sample run of Kruskal’s algorithm) In each panel, the highlighted edge is being considered for inclusion in the tree. Black edges have already been included; light edges have not yet been considered. The original graph. We examine the cheapest edge {A, C}. It doesn’t create a cycle, so we keep it. We examine the next cheapest edge {B, C}. It doesn’t create a cycle, so we keep it. We examine the next cheapest edge {C, D}. It doesn’t create a cycle, so we keep it. We examine the next cheapest edge {A, B}. It creates a cycle ⟨A, B, C, A⟩, so we discard it. The next edge is {D, E}; we keep it. The next edge is {B, D}; it creates a cycle, so we discard it. We last edge is {C, E}; it creates a cycle, so we discard it too. The final spanning tree. Figure 11.67: Kruskal’s Algo- rithm. Kruskal’s Algorithm Input: a weighted connected graph G = ⟨V, E⟩ with distinct edge weights we Output: a minimum spanning tree of G 1: 2: 3: 4: 5: 6: Sort the edges e in increasing order of weight. S := ∅ for each edge e (taken in increasing order of weight): if the graph ⟨V, S ∪ {e}⟩ doesn’t contain a cycle then add e to S return the resulting graph ⟨V, S⟩ 4B6 A123D5E C7 4B6 A123D5E C7 4B6 A123D5E C7 4B6 A123D5E C7 4B6 A123D5E C7 4B6 A123D5E C7 4B6 A123D5E C7 4B6 A123D5E C7 4B6 A123D5E C7 Here is the general statement of correctness for both algorithms: Proof. ThecorrectnessoftheWeightedCycleEliminationAlgorithmfollowsimmedi- ately from Lemma 11.11 (the cycle rule) and from Theorem 11.8 (the correctness of the Cycle Elimination Algorithm): the heaviest edge in any cycle does not appear in any MST, and we terminate with a spanning tree when we repeatedly eliminate any edge from an arbitrarily chosen cycle. For Kruskal’s algorithm, consider an edge e that is not retained—that is, when e is considered, it is not included in the set S. The only reason that e wasn’t included is that adding it would create a cycle C involving e and previously included edges—but because the edges are considered in increasing order of weight, that means that e is the heaviest edge in C. Thus by Lemma 11.11, Kruskal’s algorithm removes only edges not contained in any minimum spanning tree. Because it only excludes edges that create cycles, the resulting graph is also connected—and thus a minimum spanning tree. Taking it further: There are several other commonly used algorithms for minimum spanning trees, using different structural properties than the Cycle Rule. For much more on these other algorithms, and for the clever data structures that allow Kruskal’s Algorithm to be implemented in O(|E| log |E|) time, see any good textbook on algorithms. 11.5. WEIGHTEDGRAPHS 1173 Theorem 11.12 (Correctness of minimum spanning tree algorithms) The Weighted Cycle Elimination Algorithm and Kruskal’s Algorithm both return a minimum spanning tree for any weighted connected undirected graph. 1174 CHAPTER 11. GRAPHS AND TREES Computer Science Connections 1.0 B Random Walks and Ranking Web Pages When Google launched as a web search engine, one of its major innova- tions over its competition was in how it ranked the pages returned in response to a user’s query. Here are two key ideas in Google’s ranking system, called PageRank (named after Larry Page, one of Google’s founders): • viewalinkfrompageutopagevasimplicit“endorsement”ofvbyu. • notallendorsementsareequal:ifapageuisendorsedbymanyother pages, then being endorsed by u is a bigger deal. These point can be restated more glibly as: a page is important if it is pointed to by many important pages. The idea of PageRank is to break this apparent circu- larity using the Random Surfer Model. Imagine a hypothetical web user who starts at a random web page, and, at every time step, clicks on a randomly chosen link from the page she’s currently visiting. The more frequently that this hypothetical user visits page u, the more important we’ll say u is. The Random Surfer explores the web using what’s called a random walk on the web graph. In its simplest form, a random walk on a directed graph G = ⟨V,E⟩ visits a sequence u0,u1,u2,... of nodes in G as follows: 1. chooseanodeu0∈V,uniformlyatrandom. 2. instept=1,2,...,thenextnodeutischosenbypickinganodeuniformly at random from the out-neighborhood of the previous node ut−1. (See Figure 11.68(a) for an example.) As you’ll explore in Exercises 11.204–11.208, under mild assumptions about G, there’s a special probability distribution p over the nodes of the graph, called the stationary distribution of the graph, that has the following property: if we choose an initial node u with probability p(u), and we then take one step of the random walk from u, the resulting probability distribution over the nodes is still p. And, it turns out, we can approximate p by the probability distribution observed simply by running the random walk for many steps, as in Figure 11.68(b). We’ll use p as our measure of importance. We’ve already made a lot of progress toward the stated goals: p(u) is higher the more in-neighbors u has, but p(u) will be increased even more when the in-neighbors of u have a high probability themselves. In Figure 11.68(c), for example, we see that p(D) > p(B) and p(D) > p(C), despite B and C having higher in-degree than D.
But there are a few complications that we still have to address to get to the fullPageRankmodel.13 OneisthattheRandomSurferhasnowheretogoif she ends up at a page u that has no out-neighbors. (The random walk’s next step isn’t even defined.) In this case, we’ll have the Random Surfer jump to
Figure 11.68: A random walk.
You can find more about the Random Surfer model and PageRank (including interesting questions about how to calculate it on a graph with nodes numbering in the billions) in a good textbook on data mining, like
13 Jure Leskovec, Anand Rajaraman, andJeffUllman. MiningofMassive Datasets. Cambridge University Press, 2nd edition, 2014.
There are also many other ingredients in Google’s ranking recipe beyond PageRank, though PageRank was an early and important one.
0.5
A
0.5
0.33 0.33
0.33
D 1.0 E
a completely random page (each of the |V| nodes is chosen with probability 1 ).Second,thismodelallowstheRandomSurfertogetstuckina“dead
|V|
end” if there’s a group of nodes that has no edges leaving it. Thus—and this
1.0
(a) A sample 5-node graph. Edges are annotated with their probabilities in a
random walk; we can view the resulting weighted graph as defining the process.
node steps
A 166,653
B 166,652
C 166,155
D 250,270
E 250,271
(b) The number of steps spent at each node in a computer-generated 1,000,000-step random walk starting at A. This particular walk began ABABABABABABACEDCEDEDBABAC.
1 1.0 6
C
0.33 .33
4 1.0 4
0.33
1.0
0.5 111
6
0.5
1 6
(c) The stationary distribution for G. ABCDE
A B C D E
(d) The updated link probabilities, with random restarts.
0.03 0.88 0.03 0.03 0.03
0.45 0.03 0.03 0.31 0.03
0.45 0.03 0.03 0.31 0.03
0.03 0.03 0.03 0.03 0.88
0.03 0.03 0.88 0.31 0.03
change probably makes the Random Surfer more realistic anyway—we’ll add a restart probability of 15% to every stage of the random walk: with probability 85%, we behave as previously described; with probability 15%, we jump to a randomly chosen node. (See Figure 11.68(d) for the updated probabilities.)

11.5.3 Exercises
For the following graphs, find all shortest paths between the given nodes. Give both the path length and the path itself.
11.183 From A to E: 11.186 From A to H:
2
B D8F
1105 9 4 A7H
C6E 3
C
218 A5D3E
476
B
11.184 FromAtoE:
11.185 FromAtoE:
11.187 From A to H:
C
472 A6D8E
153
B
11.188 Let n be arbitrary. Give an example of an n-node weighted graph G = ⟨V, E⟩ with designated nodes s ∈ V and t ∈ V in which both of the following conditions hold:
(i) all edge weights are distinct (for any e ∈ E and e′n∈ E, we have w(e) ̸= w(e′) if e ̸= e′), and (ii) for some α > 1 and c > 0, there are at least c · α different shortest paths between s and t.
Suppose that we are running Dijkstra’s Algorithm on the graph shown in Figure 11.69 to compute distances from the node A. So far Dijkstra’s Algorithm has computed four distances:
d(A, A) = 0 d(A, B) = 1 d(A, C) = 3 d(A, F) = 7
If we continue Dijkstra’s algorithm for further iterations, it records the distance for a new node in each iteration.
11.189 What is the next node recorded, and what is its distance?
11.190 What is the next node (after the one from Exercise 11.189) for
which Dijkstra’s algorithm records a distance, and what is its distance? List all subsequently discovered nodes, and their distances.
11.191 Trace Dijkstra’s algorithm on the graph shown in Figure 11.69 to compute distances from the node H. List all discovered nodes and their distances, in the order in which they’re discovered.
11.192 Identify exactly where the proof of correctness for Dijkstra’s algorithm (specifically, in the proof of Lemma 11.9) the argument fails if edge weights can be negative. Then give an example of a graph with negative edge weights in which Dijkstra’s algorithm fails.
Suppose that G = ⟨V, E⟩ is a weighted, directed graph in which nodes represent physical states of a system, and an edge ⟨u, v⟩ indicates that one can move from state u to state v. The weight w⟨u,v⟩ of edge ⟨u, v⟩ denotes the multiplicative cost of the exchange: one can trade wu,v units of u for 1 unit of v. For example, if there’s an edge ⟨A, B⟩ with weight 1.04, then I can trade 2.08 units of energy in state A for 2 units of energy in state B.
Suppose that we wish to find a shortest multiplicative path (SMP) from a given node s to a given node t in G, where the cost of the path is the product of the edge weights along it. For example, in Figure 11.70, the SMP from A to D is A → B → C → D at cost 1.1 · 1.5 · 1.4 = 2.31, which is better than A → B → D at cost 1.1 · 2.5 = 2.75.
11.193 Describe how to modify Dijkstra’s algorithm to find the shortest SMP in a given weighted graph G. Alternatively, describe how to modify a given weighted graph G into a graph G′ so that Dijkstra’s algo- rithm run on G′ finds an SMP in G.
11.194 As you argued in Exercise 11.192, Dijkstra’s algorithm may fail if edge weights are negative. State the condition that guarantees that your algorithm from Exercise 11.193 properly computes SMPs.
Figure 11.69: A weighted graph.
6
B D2F
5 4 1 7 10 A3H
C8E 9
11.5. WEIGHTEDGRAPHS 1175
C
173 A1D3E
173
B
G
6
BD2F
1 10 A457H
3
C8EG
9
1.1
A 2.25 B
1.5
DC
1.4
Figure 11.70: A weighted graph.

1176 CHAPTER 11. GRAPHS AND TREES
List all minimum spanning trees of the following graphs. (Note that some have edges with nondistinct weights.)
11.195 11.198
11.196
11.197
11.199
C
472 A6D8E
153
B
Consider the undirected 9-node complete graph K9. There are 􏰀9􏰁 = 9·8 = 36 unordered pairs of nodes in this graph, 22
2
B D8F
1105 9 4 A7H
C6E 3
C
218 A5D3E
476
B
6
B D2F
5 4 1 7 10 A3H
C8E 9
C
173 A1D3E
173
B
G
so there are 36 different edges in the graph. Suppose that you’re asked to assign each of these 36 edges a distinct weight from the set {1, 2, . . . , 36}. (You get to choose which edges have which weights.)
11.200 What’s the cheapest possible minimum spanning tree of K9?
11.201 What’s the most expensive edge that can appear in a minimum spanning tree of K9?
11.202 What’s the costliest possible minimum spanning tree of K9?
11.203 Generalize Exercise 11.200 and 11.202: what are the cheapest and most expensive possible MSTs
for the graph Kn if all edges have distinct weights chosen from 􏰈1, 2, . . . , 􏰀n2􏰁􏰉? (Hint: see Exercise 9.173.)
Recall from p. 1174 that a random walk in a graph G = ⟨V, E⟩ proceeds as follows: we start at a node u0 ∈ V, and, at every time step, we select as the next node ui+1 a uniformly chosen (out-)neighbor of ui.
Suppose we choose an initial node u0 according to a probability distribution p, and we then take one step of the
random walk from u0 to get a new node u1. The probability distribution p is a stationary distribution if it satisfies
the following condition: for every node s ∈ V, we have that Pr [u0 = s] = Pr [u1 = s] = p(s). Such a distribution is
called “stationary” because, if p is the probability distribution before a step of the walk, then p is still the probability
distribution after a step of the walk (and thus the distribution “hasn’t moved”—that is, is stationary).
(a)
(b)
11.204 Argue that p(A) = p(B) = p(C) = 1 is a stationary distribution for the graph in Figure 11.71(a). 3
B
AC
E
DF H
GI
11.205 Argue that the graph in Figure 11.71(b) has at least two distinct stationary distributions.
Suppose that we start a random walk at node A in the graph in Figure 11.71(a). The following chart shows the probabil-
ity of being at any particular node after each step of the random walk:
1 1 3 5 11 21 0 2 4 8 16 32 64
12 12 36 510 1122 21 2 4 4 8 8 16 16 32 32 64 64
KL
JM
100
Let pk (u) denote the probability of the kth step of this random walk being at node u. Although we’ll skip the proof, the (c)
following theorem turns out to be true of random walks on undirected graphs G:
If G is connected and nonbipartite, then a unique stationary distribution p exists for this random walk on G (regardless of which node we choose as the initial node for the walk). Furthermore, the stationary distribution is the limit of the probability distributions pk of where the random walk is in the kth step.
11.206 (programming required) Write a random-walk simulator: take an undirected graph G as input,
and simulate 2000 steps of a random walk starting at an arbitrary node. Repeat 2000 times, and report the fraction of walks that are at each node. What are your results on the graph from Figure 11.71(a)?
11.207 Argue that the above process doesn’t converge to a unique stationary distribution in a bipartite graph. (For example, what’s p1000 if a random walk starts at node J in the graph in Figure 11.71(c)? Node K?) 11.208 Let G = ⟨V, E⟩ be an arbitrary connected undirected graph. For any u ∈ V, define
p(u) := degree(u) . 2·|E|
Figure 11.71: Some undirected graphs upon which a random walk can be performed.
Prove that the probability distribution p is a stationary distribution for the random walk on G.

11.6 Chapter at a Glance Formal Introduction
AgraphisapairG = ⟨V,E⟩whereVisasetofverticesornodes,andEisasetofedges. In a directed graph, the edges E ⊆ V × V are ordered pairs of vertices; in an undirected graph, the edges E ⊆ {{u, v} : u, v ∈ V} are unordered pairs. A directed edge ⟨u, v⟩ goes from u to v; an undirected edge ⟨u, v⟩ goes between u and v. We sometimes write ⟨u, v⟩ even for an undirected graphs. A simple graph has no parallel edges joining the same two nodes and also has no self loops joining a node to itself.
For an edge e = ⟨u, v⟩, we say that u and v are adjacent;
v is a neighbor of u; u and v are the endpoints of e; and u
and v are both incident to e. The neighborhood of a node u
is {v : ⟨u, v⟩ ∈ E}, its set of neighbors. The degree of u is
the cardinality of u’s neighborhood. In a directed graph,
the in-neighbors of u are the nodes that have an edge
pointing to u; the out-neighbors are the nodes to which u
has an edge pointing; and the in-degree and out-degree of u are the number of in- and out-neighbors, respectively.
An adjacency list stores a graph using an array with |V| entries; the slot for node
u is a linked list of u’s neighbors. An adjacency matrix stores the graph using a two- dimensional Boolean array of size |V| × |V|; the value in ⟨row u, column v⟩ indicates whether the edge ⟨u, v⟩ exists.
Two graphs are isomorphic if they are identical except for the naming of the nodes. A subgraph of G contains a subset V′ of G’s nodes and a subset E′ of G’s edges joining elements of V′. An induced subgraph is a subgraph in which every edge that joins ele- ments of V′ is included in E′. A complete graph or clique is a graph Kn in which every possible edge exists. A bipartite graph is one in which nodes can be partitioned into sets LandRsuchthateveryedgejoinsanodeinLtoanodeinR. Aregulargraphisone
in which every node has identical degree. A planar graph is one that can be drawn on paper without any edges crossing.
Paths, Connectivity, and Distances
A path is a sequence of k ≥ 1 nodes ⟨v1,v2,…,vk⟩, where ⟨vi−1,vi⟩ ∈ E for every index i ∈ {1,2,…,k−1}. Thepathissimpleifallthevisaredistinct. Thispathhaslength
k − 1—the number of edges that it traverses—and is a path from v1 to vk .
In an undirected graph, nodes u and v are connected if there exists a path from u to v. A connected component of G = ⟨V, E⟩ is a set S ⊆ V such that (i) every u ∈ S and v ∈ S are connected; and (ii) for every w ∈/ S, the set S ∪ {w} does not satisfy condition (i). The entire graph is connected if it has only one connected component, namely V.
In a directed graph, node u is reachable from node v if there exists a path from v to u; u and v are strongly connected if each is reachable from the other. A strongly connected component is a set S of nodes such that any two nodes in S are strongly connected and no node x ∈/ S is strongly connected to any node s ∈ S.
11.6. CHAPTERATAGLANCE 1177
BDF AH CEG

1178 CHAPTER 11. GRAPHS AND TREES
Connectivity can be tested in time Θ(|V| + |E|) time using breadth-first search (BFS; see Figure 11.72) or depth-first search (DFS). The distance from node s to node t is the length of a shortest path from s to t. BFS can also be used to compute distances.
Trees
A cycle ⟨v1,v2,…,vk,v1⟩ is a path of length ≥2 from a node v1 back to itself that does not traverse the same edge twice. The length of the cycle is k. The cycle is simple if each vi is distinct. Cycles can be identified using BFS.
A graph is acyclic if it contains no cycles. Every acyclic
graph has a node of degree 0 or 1. A tree is a connected, acyclic graph. (A forest is any acyclic graph.) A tree has one more node than it has vertex. A tree becomes discon- nected if any edge is deleted; it becomes cyclic if any edge is added.
One node in a tree can be designated as the root. Every node other than the root has a parent (its neighbor that’s closer to the root). If p is v’s parent, then v is one of p’s chil- dren. Two nodes with the same parent are siblings. A leaf is a node with no children; an internal node is a node with children. The depth of a node is its distance from the root; the height of the entire tree is the depth of deepest node. The descendants of u are those nodes that go through u to get the root; the ancestors are those nodes through which u’s path to the root goes. The subtree rooted at u is the induced subgraph consisting of u and all descendants of u.
All nodes in binary trees have at most two children, called left and right. A traversal of a binary tree visits every node of the tree. An in-order traversal recursively traverses the root’s left subtree, visits the root, and recursively traverses the root’s right subtree. A pre-order traversal visits the root and recursively traverses the root’s left and right subtrees; a post-order traversal recursively traverses the root’s left and right subtrees and then visits the root.
A spanning tree of a connected graph G = ⟨V,E⟩ is a graph T = ⟨V,E′ ⊆ E⟩ that’s a tree. A spanning tree can by found by repeatedly identifying a cycle in G and deleting any edge in that cycle.
Weighted Graphs
In a weighted graph, each edge e has a weight we ∈ R≥0. (Although graphs with negative edge weights are possible, we haven’t addressed them in any detail.) The length of a path in a weighted graph is the sum of the weights of the edges that it tra- verses. Shortest paths in weighted graphs can be found with Dijkstra’s Algorithm (Figure 11.65), which expands a set of nodes of known distance one by one. Minimum spanning trees—spanning trees of the smallest possible total weight—in weighted graphs can be found with Kruskal’s Algorithm (Figure 11.67) or by repeatedly identi- fying a cycle in G and deleting the heaviest edge in that cycle.
Figure 11.72: Breadth-first search.
Breadth-First Search (BFS):
Input: agraphG=⟨V,E⟩andasourcenodes∈V Output: the set of nodes reachable from s in G
1: 2:
3: 4: 5: 6: 7: 8: 9:
10:
Frontier := ⟨s⟩
// Frontier will be a list of nodes to process, in order.
Known := ∅
// Known will be the set of already-processed nodes.
while Frontier is nonempty:
u := the first node in Frontier remove u from Frontier
for every neighbor v of u:
if v is in neither Frontier nor Known then add v to the end of Frontier
add u to Known return Known

Key Terms and Results Key Terms
Formal Introduction
• undirected and directed graphs
• nodes/vertices, edges
• parallel edges, self loops
• simple graphs
• adjacent node, incident edge
• (in/out-)neighbors, neighborhood
• (in/out-)degree
• adjacency list, adjacency matrix
• isomorphicgraphs
• subgraphs
• complete,bipartite,regular,planar
graphs
Paths and Connectivity
Key Results
Formal Introduction
1. The “handshaking lemma”: for any undirected graph
G = ⟨V, E⟩, we have ∑u∈V degree(u) = 2|E|.
2. Representing G with an adjacency matrix requires Θ(|V|2) space; we can answer “what are all of u’s neighbors?” in Θ(|V|) time and “is there an edge between u and v?” in Θ(1) time. Representing G = ⟨V, E⟩ with an adjacency list requires Θ(|V| + |E|) space; both questions take 1 + Θ(degree(u)) time.
Paths, Connectivity, and Distances
1. Connectivitycanbetestedusingbreadth-firstsearch(BFS) (Figure 11.29) or depth-first search (DFS) (Figure 11.31). BFS can also be used to compute the distance between nodes in a graph, and it runs in Θ(|V| + |E|) time.
Trees
1. Anytreewithnnodeshasexactlyn−1edges.Adding any edge to a tree creates a cycle; deleting any edge disconnects the graph.
2. AspanningtreeofagraphGcanbyfoundbyrepeatedly identifying a cycle in G and deleting an arbitrary edge in that cycle.
Weighted Graphs
1. Shortestpathsinweightedgraphscanbefoundwith Dijkstra’s Algorithm (Figure 11.65) if all edges have nonnegative weights.
2. Minimumspanningtreesinweightedgraphscanbe found with Kruskal’s Algorithm (Figure 11.67) or by repeatedly identifying a cycle in G and deleting the heaviest edge in that cycle.
• path
• connected(nodes),connected(graph) • connectedcomponent
• reachability
• stronglyconnectedcomponent
• shortestpath/distance
• breadth-firstsearch(BFS)
• depth-firstsearch(DFS)
Trees
• cycle
• tree,forest
• root,leaf,internalnode,child,parent,
sibling, ancestor, descendant, depth,
height, subtree
• spanningtree
Weighted graphs
• Dijkstra’salgorithm
• minimumspanningtrees • Kruskal’salgorithm
11.6. CHAPTERATAGLANCE 1179

12 Index
2–3 and 2–3–4 trees, 545 9/11 Memorial, 1124 123456791, 752 987654263, 752
∀ (universal quantifier), 333 ff. absolute value, 205, 427, 429 abstract algebra, 736
adjacency, see graphs
Adleman, Leonard, 747
affirming the consequent, see fallacy algorithms, 265 ff., see also random-
ized algorithms
asymptotic analysis, 617 ff.
brute force, 326, 515, 902, 959 divide and conquer, 647 ff., 655 dynamic programming, 515, 902,
959
greedy algorithms, 422, 918 recurrence relations, 633 ff. time, space, and power, 626
Alice and Bob, 745 ff. ambiguity
in natural language, 308, 309, 314 order of operations, 543, 805 order of quantification, 351 ff., 360 prefix-free/Huffman codes, 918
analysis (mathematics), 836 antisymmetry, 820 ff. approximate equality, 205 Ariane 5 rocket, 464 arithmetic mean, 439, 456 arithmetic series, 512 Arrow’s Theorem, 823 artificial intelligence
computer vision, 1132
game trees, 344, 941
assertions, 360, 517
associativity, 321, 545, 736 assuming the antecedent, see proofs asymmetry, 820 ff.
asymptotics
analysis of algorithms, 617 ff. asymptotic analysis, 603 ff. asymptotic relationships viewed as
relations, 823 ff.
best- and average-case running
time, 623 ff.
master method, 648 ff.
O (Big O), 604 ff.
o, Ω, ω, and Θ, 608 ff. polynomials, logs, and exponen-
tials, 606 ff. recurrence relations, 633 ff. worst-case analysis, 618 ff.
automata, 846, 942
automated theorem proving, 424 average distance in a graph, 1145 average-case analysis, see running
time
AVL trees, 643 ff.
axiom of extensionality, 229
Bacon, Kevin, 438, 1117
balanced binary search trees, 643 Bayes’ Rule, 1033 ff.
begging the question, see fallacy Bernoulli distribution, 1013 ff., 1044,
1057 betweenness, 812
BFS, see breadth-first search biased coins, 1014 ff.
big O, big Ω, and big Θ, 604 ff., 823 ff. bigrams, 1036
bijections, 262, 928, 937
binary numbers, see integers
binary relation, see relations
Binary Search, see searching
binary search trees, see trees
binary symmetric channel, 1033, 1034 binary trees, see trees
binomial coefficients, see combinations binomial distribution, 1014 ff., 1049 Binomial Theorem, 954 ff.
bipartite graphs, 1118 ff.
complete bipartite graphs, 1119 birthday paradox, 526, 1052 bitmaps, 243
bits/bitstrings, 203, 240
Bletchley Park, 960
Bloom filters, 1039
Booleans, 203, 305, see also logic bound (vs. free) variables, 336 breadth-first search, 1136 ff.
finding cycles, 1149 brute force, see algorithms Bubble Sort, see sorting Buffon’s needle, 1062 bugs, 217, 464, 517, 1129
C (programming language), 327, 345, 534
Caesar Cipher, see cryptography cardinality, 222–223, 903 ff.
infinite, 937
Carmichael numbers, 741, 742, 744 Cartesian product (×), 237 catchphrase, 1165

1202 CHAPTER 12. INDEX
Cauchy sequences, 836
ceiling, 206
cellular automata, 942
Chain Rule (probability), 1031 ff. checkers, 344, 437, 925 checksum, 403
chess, 237, 344, 518–519, 913, 924, 1135 Chinese Remainder Theorem, 725 ff. circle packing, 416
circuits
printing and planar graphs, 1121 representing logical propositions,
322, 329
using nand gates, 445
class-size paradox, 1045
cliques, 1117 ff.
closure, 736, 825 ff.
clustering, 234
coarsening equivalence relations,
836 ff.
codomain (of a function), 255 collaboration networks, 1117 collaborative filtering, 236 combinations, 945 ff.
k-combinations, 948 ff. Binomial Theorem, 954 ff. Pascal’s identity, 953, 957 Pascal’s Triangle, 957
combinatorial proof, 951 ff. commutativity, 246, 321, 352, 545, 736 comparability, see partial orders comparison-based sorting, see sorting compilers, 327, 543
complement (of a set), 226
complete graphs, 1117 ff.
complexity, see computational com-
plexity
composite numbers, see prime num-
bers composition
of functions, 258, 811
of relations, 807, 823 compression
entropy and compressibility, 1017 Huffman coding, 918 impossibility of lossless compres-
sion, 938
lossy vs. lossless, 938 quantization of images, 254, 268 URL shortening, 907
computability, 449
computational biology
genome rearrangements, 359, 942 motifs in gene networks, 1116
computational complexity and cryptography, 752 complexity classes, 628 graph isomorphism, 1115 input size, 706
P vs. NP, 326
regular languages, 830, 846 computational geometry, 251 computational linguistics, see natural
language processing computer architecture, 322 ff., 445
and running times, 618 Moore’s Law, 613
power consumption, 626 representation of numbers, 217
computer graphics hidden-surface removal, 847 morphing, 252
rotation matrices, 249 triangulation, 528
computer security, 752, 753 computer vision, 1132 computing networking, 919 conditional expectation, 1055 ff. conditional probability, 1027 ff.
Bayes’ Rule, 1033
Chain Rule, 1031
Law of Total Probability, 1032
Condorcet paradox, 823
congruences (modular), 707 ff., 726 ff.,
835
conjunctive normal form, 323 ff.,
441 ff., 540 ff.
connectivity (in graphs), 1130 ff.
connected component, 1131 ff.
reachability, 1133 ff.
constructive proofs, 432 constructivism, 433
context-free grammar, 543 contradiction, 318
contrapositive, 320, 428, see also proofs converse, 320
Cook–Levin Theorem, 326 correlation, 1021
correlation vs. causation, 463
positive and negative, 1024 countable sets, 937 counterexamples, 432 ff.
counting
Binomial Theorem, 954 ff. combinations, 945 ff. combinatorial proofs, 951 ff. combining products and sums,
915 ff.
Division Rule, 931 ff.
double counting, 909 ff. Generalized Product Rule, 913 ff. inclusion–exclusion, 909 ff.
for 3+ sets, 911
Mapping Rule, 927 ff.
order, 946 ff.
Pascal’s Triangle, 957 ff. permutations, 947 ff. Pigeonhole Principle, 935 ff. Product Rule (sequences), 906 repetition, 946 ff.
Sum Rule (unions), 903
Counting Sort, see sorting coupon collector problem, 1064 crossword puzzles, 358 cryptography, 745 ff.
and pseudorandomness, 1013 Caesar Cipher, 746, 1038 Diffie–Hellman key exchange, 753 digital signatures, 748
Enigma Machine and WWII, 960 frequency analysis, 1025, 1038 key exchange, 753 man-in-the-middle attack, 753 one-time pads, 745
public-key cryptography, 746 ff. RSA cryptosystem, 454, 747 ff. secret sharing, 730
substitution cipher, 1024, 1031, 1038
Currying, 357 cycles, 840, 1147 ff.
acyclic graphs, 1149 ff.
cycle elimination algorithm, 1158 cycle rule for minimum spanning
trees, 1170
kidney transplants, 1159
simple cycles, 1148
weighted cycle elimination algo-
rithm, 1170
DAG (directed acyclic graph), 1150
data mining, see machine learning data visualization, 1110 databases, 347, 815, 817

De Morgan’s Laws, 322
decision problems, 448
Deep Blue, 344
degree (in a graph), 1107 ff., 1109
degree distribution, 1123
regular graphs, 1119
degree (of a polynomial), 264 density (of a graph), 615, 1127 denying the hypothesis, see fallacy dependent events, 1021 ff. depth-first search, 1140 ff. Descartes, René, 239
deterministic finite automata, 846 DFS, see depth-first search diagonalization, 937
diameter, 1144
Diffie–Hellman key exchange, 753 Dijkstra’s algorithm, 1165 ff. directed graphs, 818 disconnected, see connectivity in
graphs
disjoint sets, 230, 416 disjunctive normal form, 323 ff.,
441 ff., 540 ff. distance, see also metrics
Euclidean, see Euclidean distance Hamming, see Hamming distance in a graph, 1135 ff.
Manhattan, see Manhattan distance minimum distance of a code, 407 ff.
divide and conquer, see algorithms divisibility, 210, 516, 841
and modular arithmetic, 708 ff. common divisors, 709 ff. divisibility rules, 316, 425, 716 Division Theorem, 703
division, see mod in Zn, 735
Division Rule, 931 ff.
domain (of a function), 255
dot product, 241 ff.
Dunbar’s number, 1125
dynamic programming, see algo-
rithms dynamic scope, 345
∃ (existential quantifier), 333 ff.
e (base of natural logarithm), 209 edges, see graphs
efficiency, see running time, see also
empty set, 226
Enigma Machine, 960 entropy, 1017
equivalence relations, 833 ff.
equivalence classes, 834
refinements and coarsenings, 836 Eratosthenes, 718, 732
Erdős numbers, 438
Erdős, Paul, 438, 1117 error-correcting codes, 405 ff.
Golay code, 422
Hamming code, 412 ff., 926 messages and codewords, 405 ff. minimum distance and rate, 407 ff. Reed–Solomon codes, 418, 731 repetition code, 410 ff.
upper bounds on rates, 415
error-detecting codes, 405 ff. credit card numbers, 403, 419 UPC, 940
Euclid, 446, 447, 710 Euclidean algorithm, 710, 722
efficiency, 713, 716
Extended Euclidean algorithm, 722 Euclidean distance, 250, 456
Euler’s Theorem, 744
Euler, Leonhard, 440, 744
even numbers, 430
evenly divides, see divisibility events (probability), 1007 ff.
correlated, 1021
independent events, 1021 ff. exclusive or (⊕), 211, 308 ff. existential quantifier (∃), 333 ff. expectation, 1044 ff.
average-case analysis of algorithms, 624 ff.
conditional expectation, 1055 ff. coupon collector problem, 1064 deviation from expectation, 1056 ff.
Markov’s inequality, 1065 Law of Total Expectation, 1056 linearity of expectation, 1048 ff.
exponentials, 206 ff., 545 asymptotics, 606 ff. modular, 716
EXPSPACE (complexity class), 628 EXPTIME (complexity class), 628 Extended Euclidean algorithm, 722
factorial, 423–424, 515–516, 633, 636, 915, 921
Stirling’s approximation, 964 factors, see divisibility, see also prime
factorization fallacy, 458 ff.
affirming the consequent, 460 begging the question, 462 denying the hypothesis, 461 false dichotomy, 427, 461 proving true, 460
false dichotomy, see fallacy fencepost error, 1129
Fermat pseudoprime, 741 Fermat’s Last Theorem, 739 Fermat’s Little Theorem, 739 ff. Fermat–Euler Theorem, 744 Fibonacci numbers, 252, 530, 634,
640–642, 644, 963
algorithms, 646
and the Euclidean algorithm, 716
filter, 233
finite-state machines, 846
float (floating point number), 217, 618 floor, 206
Division Theorem, 703 forests, 1150
spanning forests, 1157
formal language theory, see computa-
tional complexity
formal methods, 424, 825
Four Color Theorem, 437, 1121 fractals, 502, 508–510, 519–520, 532 free (vs. bound) variables, 336 frequency analysis, 1025 functions, 253 ff.
algorithms, 265 ff.
characteristic function of a set, 806 composition, 258 domain/codomain, 255
growth rates, 603 ff.
inverses, 262
one-to-one/onto functions, 259 ff. range/image, 256 ff.
viewed as relations, 810 ff.
visual representation, 258
vs. macros, 345
Fundamental Theorem of Arithmetic, 720
computational complexity
Facebook, 1123
Gödel’s Incompleteness Theorem, 346
1203

1204 CHAPTER 12. INDEX
Gödel, Kurt, 346
game trees, 344, 941
garbage collection, 627, 1143
Gates, Bill, 359, 438, 1006
GCD, see greatest common divisor GCHQ, 747
Generalized Product Rule, 913 ff. geometric distribution, 1015 ff., 1048 geometric mean, 439, 456
geometric series, 510 ff.
infinite, 512
master method, 648 ff.
giant component, 1142
Goldbach’s conjecture, 303, 350, 360 golden ratio, 641
Google, 1174
grammars, 535, 543
graph drawing, 1121, 1124
graphs, 1103 ff.
acyclic graphs, 1149 ff. adjacency lists, 1110 ff. adjacency matrices, 1111 ff. bipartite graphs, 1118 ff. breadth-first search, 1136 ff. complete graphs, 1117 ff. connected components, 1131 ff. connectivity, 1130 ff.
cycles, 1147 ff.
data structures, 1110 ff. degree, 1107, 1109 ff.
Handshaking Lemma, 1108
regular graphs, 1119
density, 1127
depth-first search, 1140 ff.
forests, 1150
isomorphism, 1114 ff.
matchings, 934, 942, 960, 1120, 1159 neighborhoods, 1106 ff., 1109 ff. paths, 1129 ff.
shortest paths, 1135 ff. planar graphs, 1121 ff. shortest paths
Dijkstra’s algorithm, 1165 ff. simple graphs, 1104 subgraphs, 1115 ff.
trees, see trees
undirected vs. directed, 1103 ff. weighted graphs, 1164 ff.
Dijkstra’s algorithm, 1165 ff. greatest common divisor, 709 ff., see
also Euclidean algorithm
Hn, see harmonic number Halting Problem, 346, 451 ff., 455 Hamiltonian path, 1145 Hamming code, 412 ff.
number of valid codewords, 926 Hamming distance, 404 Hamming, Richard, 404, 412 Handshaking Lemma, 1108 harmonic number, 512–514 hashing, 267, 942, 1003–1004, 1050,
1064
Bloom filters, 1039
collisions, 1003 ff., 1010, 1020, 1039,
1051
and pairwise independence, 1026 chaining, 1003
clustering, 1010, 1020
double hashing, 1020
linear probing, 1010, 1020 quadratic probing, 1020
simple uniform hashing, 1004 Hasse diagrams, 840
heaps, 269, 529, 544 heavy-tailed distribution, 1123 Heron’s method, 218, 439 hidden-surface removal, 847 higher-order functions, 233, 357 Hopper, Grace, 464
Huffman coding, 918 hypercube, 1127
I (identity matrix), 244 idempotence, 321 identity
identity function, 263 identity matrix, 244 multiplicative identity, 735 of a binary operator, 315, 545
if and only if (⇔), 308 ff. image (of a function), 256 image processing
blur filter, 218 dithering, 330 quantization, 254 segmentation, 1132
imaginary numbers, 207
implication (⇒), 306 ff.
in-degree, see degree
in-neighbor, see neighbors (in graphs) inclusion–exclusion, 909 ff. incomparability, 610, 838
incompleteness (logic), 346 independent events, 1021 ff.
pairwise independence, 1026 induction, see proofs
checklist for inductive proofs, 507 generating conjectures, 508
proofs about algorithms, 514 ff. strengthening the inductive hypoth-
esis, 540 infix notation, 805
information retrieval, 248 information theory, 1017, 1033 injective functions, see one-to-one
functions
Insertion Sort, see sorting integers, 203 ff.
algorithms for arithmetic, 705, 715 efficiency, 706
division, see modular arithmetic primes and composites, see prime
numbers
recursive definition, 542 representation
binary numbers, 316, 506, 520, 530, 706, 714
different bases, 530, 714 ints, 217
modular representation, 729 unary, 706
successor relation, 829 internet addresses, 919 intersection (of sets), 227 intervals, see real numbers invalid inference, 458 inverse
additive, 743 multiplicative, 735 ff. of a function, 262
of a matrix, 252
of a relation, 806 ff., 821 of an implication, 320
IP addresses, 919 irrationals, see rationals
irrationality of √2, 431 irreflexivity, 819 ff.
isomorphism (of graphs), 1114 ff.
Jaccard coefficient, 236
Java (programming language), 256,
311, 327, 1143 Johnson’s algorithm, 1066

Kn, see complete graphs Kn,n, see bipartite graphs Kasparov, Garry, 344 keyspace, see hashing kidney transplants, 1159 Knuth, Donald, 710 Kruskal’s algorithm, 1171 Kuratowski’s Theorem, 1122
L (complexity class), 628 latchstring, 1165
law of the excluded middle, 317 Law of Total Expectation, 1056 Law of Total Probability, 1032 least common multiple, 709 ff. length (of a vector), 241
lexical scope, 345
lexicographic ordering, 349, 806 Liar’s Paradox, 225
linearity of expectation, 1048 linked lists, 544
adjacency lists for graphs, 1110 ff. as graphs, 1125
recursive definition, 533
list, see sequence
little o and little ω, 608 ff., 823 ff. logarithms, 208–209
asymptotics, 606 ff.
discrete logarithm, 753 polylogarithmic functions, 615, 706
logic
Boolean logic, 203, 736 consistency, 346
fuzzy logic, 314 incompleteness, 346
logical equivalence, 319, 338 logical fallacy, see fallacy modal logic, 825
predicate logic, 331 ff.
games against the demon, 354 nested quantifiers, 349 ff. order of quantification, 350 ff. predicates, 331 ff.
quantifiers, 333 ff. theorems in predicate logic,
337 ff. propositional logic, 303 ff.
atomic vs. compound proposi- tions, 304
logical connectives, 305 ff. propositions, 303 ff.
recursive definition of a well- formed formula, 535
satisfiability, 318
tautology, 317 ff.
truth assignment, 311
truth tables, 311 ff.
truth values, 303, 535 universal set of operators, 456
temporal logic, 825
longest common subsequence, 515,
964
loop invariants, 517
machine learning
classification problems, 927, 1037 clustering, 234
cross-validation, 963
macros, 345
Manhattan distance, 241, 250, 456 map, 233
Mapping Rule, 927 ff. MapReduce, 233
maps, 437, 1121 mark-and-sweep, 1143
Markov’s inequality, 1065
master method, 648 ff. matchings, see graphs
matrices, 243 ff.
adjacency matrices for graphs, 1111 ff.
identity matrix, 244
inverse of a matrix, 252 matrix multiplication, 245 ff.
Strassen’s algorithm, 655 rotation matrices, 249 term–document matrix, 248
maximal element, 841 ff. maximum element, 228, 266, 841 ff. mazes, 1140
median (of an array), 1060 ff. memoization, 959
memory management, 1143
Merge Sort, see sorting
metrics, 404, 419–420, 1145 Milgram, Stanley, 438 Miller–Rabin test, 454, 742 minimal element, 841 ff.
minimum element, 228, 841 ff. minimum spanning trees, 1170 ff.
cycle rule, 1170
Kruskal’s algorithm, 1171
weighted cycle elimination algo- rithm, 1170
ML (programming language), 357, 539 modal logic, 825
modular arithmetic, 209–211, 703 ff.
Division Theorem, 703 mod-and-div algorithm, 705 ff., 715 modular congruences, 707 modular exponentiation, 716 modular products, 707
modular sums, 707
multiplicative inverse, 735 ff. primitive roots, 753
modus ponens, 317
modus tollens, 318
Monte Carlo method, 1062 Monty Hall Problem, 1012 Moore’s Law, 613
multiples, see divisibility multiplicative identity, 735 multiplicative inverse, 735 ff. multitasking, 627
naïve Bayes classifier, 1037 nand (not and), 445
n-ary relations, 812 ff.
expressing n-ary relations as binary relations, 813
natural language processing ambiguity, 314
language model, 1036 speech processing, 234, 925 speech recognition, 1036 text classification, 1037 text-to-speech systems, 925
natural logarithm, see logarithms neighbors (in graphs), 1106, 1109 nested quantifiers, 349 ff.
games against the demon, 354 negations, 352
order of quantification, 350 ff.
Newton’s method, 218
nodes, see graphs
nonconstructive proofs, 432
NP (complexity class), 326, 461, 628 number theory, see modular arith-
metic
numerical methods, see scientific
computing
O (Big O), 604 ff., 823 ff.
1205

1206 CHAPTER 12. INDEX
o (little o), 608, 823 ff.
off-by-one error, 1129
Omega (Ω) (asymptotics), 608, 823 ff. omega (ω) (asymptotics), 608, 823 ff. one-time pads, 745
one-to-one functions, 260, 928
onto functions, 259, 928
operating systems, 358
multitasking, 627
virtual memory, 455 optimizing compilers, 327 orders, see partial orders out-degree, see degree out-neighbor, see neighbors (in
graphs)
outcome (probability), 1005 overfitting, 1036
overflow, 217, 464
P, see power set
P (complexity class), 326, 461, 628 PageRank, 1174
Painter’s Algorithm, 847 pairwise independence, 1026 palindromes, 545, 939
paradoxes
birthday paradox, 1052 class-size paradox, 1045 Liar’s paradox, 225 nontransitive dice, 1063 paradoxes of translation, 304 Russell’s paradox, 225 Simpson’s Paradox, 467 voting paradoxes, 823
parallel edges, 1104
parity, 211, 412 ff., 522–523, 530 parsing, 543
partial orders, 837 ff.
chains and antichains, 849 comparability, 838
extending to a total order, 843 ff. Hasse diagrams, 840
immediate successors, 841 minimal/maximal elements, 841 minimum/maximum element, 841 strict partial order, 838
topological ordering, 843 ff. total orders, 838
consistency with a partial order, 843 ff.
bipartite graphs, 1118
equivalence relations, 835 Pascal’s identity, 953, 957 Pascal’s Triangle, 957 ff. paths (in graphs), 1129 ff.
breadth-first search, 1136 ff. connected graphs, 1130 ff. depth-first search, 1140 ff. Dijkstra’s algorithm, 1165 ff. internet routing, 919 shortest paths, 1135 ff. simple paths, 1130
Pentium chip, 464, 613
perfect matchings, see graphs perfect square, 207
Perl (programming language), 446 permutations, 532, 914–915, 921
k-permutations, 947 ff. Petersen graph, 1115, 1122 Pigeonhole Principle, 935 ff., 938 planar graphs, 1121 ff.
Kuratowski’s Theorem, 1122 polylogarithmic, 615, 706 polynomials, 263 ff., 418, see also P
(complexity class) asymptotics, 606 ff.
evaluating modulo a prime, 720,
730, 731 postfix notation, 805
Postscript (programming language), 805
power set, 232
as a relation, 804 cardinality, 930
power-law distribution, 1123
powers, see exponentials
precedence of operators, 227, 310, 336,
543
predicate logic, see logic predicates, 331 ff., 806, see also logic prefix notation, 805
prefix-free codes, 917
preorder, 840
prime numbers, 211, 449, 717 ff.
Carmichael numbers, 741, 744 distribution of the primes, 718 infinitude of primes, 447
primality testing, 447, 454, 617, 717
efficient algorithms, 742 prime factorization, 720, 752
cryptography, 454, 752
existence of, 523–524 Shor’s algorithm, 1016 uniqueness of, 723–725
Prime Number Theorem, 718
Sieve of Eratosthenes, 718, 732 priority queues, 529
probability
Bayes’ Rule, 1033 ff.
conditional expectation, 1055 ff. conditional probability, 1027 ff. coupon collector problem, 1064 events, 1007 ff.
expectation, 1044 ff. infinitesimal probabilities, 1030 Law of Total Expectation, 1056 Law of Total Probability, 1032 linearity of expectation, 1048 ff. Markov’s inequality, 1065 Monty Hall Problem, 1012 outcomes, 1005 ff.
probability functions, 1005 ff. random variables, 1041 ff. random walks, 1174
standard deviation, 1056 ff.
tree diagrams, 1010 ff.
variance, 1056 ff.
probability distributions Bernoulli, 1013 ff. binomial, 1014 ff.
entropy, 1017
geometric, 1015 ff. posterior distribution, 1034 prior distribution, 1034 uniform, 1013 ff.
product, 216 ff. of a set, 228
product of sums, see conjunctive normal form
Product Rule, 906 cardinality of Sk , 908
programming languages compile-time optimization, 327 Currying, 357
garbage collection, 627, 1143 higher-order functions, 233, 357 parsing, 543 scoping/functions/macros, 345 short-circuit evaluation, 327 syntactic sugar, 322
proofs, 423 ff.
by assuming the antecedent, 341,
partition (of a set), 231

426
by cases, 415, 427 ff.
by construction, 411, 432 ff.
by contradiction, 416, 430 ff.
by contrapositive, 428 ff.
by induction, 503 ff.
by mutual implication, 429
by strong induction, 521 ff.
by structural induction, 535 ff. combinatorial proofs, 951 ff. direct, 425 ff.
nonconstructive, 432
strategy for proofs, 433 ff. unprovable true statements, 346 “without loss of generality”, 427 writing proofs, 435 ff.
proper subset and superset, 229 propositional logic, see logic
proving true, see fallacy
pseudocode, 265
pseudorandom generator, 1013 PSPACE (complexity class), 628 public-key cryptography, see cryptog-
raphy
Pythagorean Theorem, 435, 445–446,
456
incorrect published proof, 468
Python (programming language), 217, 233, 256, 315, 316, 345, 357, 449 ff., 937, 1143
Q, see rationals quadtrees, 645 quantifiers, 333 ff.
negating quantifiers, 340 ff. nested quantifiers, 349 ff. vacuous quantification, 342
quantum computation, 1016 Quick Sort, see sorting
R, see real numbers
Radix Sort, see sorting
raising to a power, see exponentials Random Surfer Model, 1174 random variables, 1041 ff.
expectation, 1044 ff. independent random variables,
1043
indicator random variables, 1043
random walks, 1174, 1176 randomized algorithms, 626
Buffon’s needle, 1062
finding medians, 1060
Johnson’s algorithm, 1066
Monte Carlo method, 1062 primality testing (Miller–Rabin), 742 Quick Sort, 1018
range (of a function), 256
rate (of a code), 407 ff.
rationals, 203 ff., 238, 426, 429, 710
in lowest terms, 710, 835 real numbers, 203 ff.
absolute value/floor/ceiling, 205 ff. approximate equality (≈), 205, 803 defining via infinite sequences, 836 exponentiation, 206 ff.
floats (representation), 217 intervals, 205
logarithms, 208 ff. trichotomy, 611
realization, see outcome (probability) recommender system, 236 recurrence relations, 633 ff.
iterating, 636
master method, 648 ff. sloppiness, 640
solving by induction, 635 variable substitution, 637
recursion tree, 631, 648 ff. recursively defined structures, 533 ff. Reed–Solomon codes, 418, 731 reference counting, 1143
refining equivalence relations, 836 ff. reflexivity, 405, 819 ff.
reflexive closure, 826 ff.
regular expressions, 830, 846 regular graphs, 1119
relational databases, see databases relations
n-ary relations, 812 ff. binary relations, 804 ff. closures, 825 ff. composition, 807 ff. equivalence relations, 833 ff. functions as relations, 810 ff. inverses, 806 ff.
partial orders, 837 ff. reflexivity, 819 relational databases, 815 symmetry, 820
total orders, 838 ff. transitivity, 822 ff.
visual representation, 805 ff., 818 ff. Hasse diagrams, 840 ff.
vs. predicates, 806
relative primality, 720 ff., 737 ff.
Chinese Remainder Theorem, 725 ff.
Extended Euclidean algorithm, 722 remainder, see mod
repeated squaring, 646, 716, 749 repetition code, 410 ff.
Rivest, Ron, 747
roots (of a polynomial), 264, 418, 731 RSA cryptosystem, 454, 747 ff.
breaking the encryption, 752 Rubik’s cube, 736, 922
running time, 617 ff.
average case, 624 ff., 1054 best case, 623 ff.
worst case, 618 ff.
Russell’s paradox, 225 Russell, Bertrand, 225
sample space (probability), 1005 sampling bias, 1045
satisfiability, 318, 326, 450, 803, 1066 scalars, 240
SCC, see strongly connected compo- nents
Scheme (programming language), 233, 238, 322, 357, 805
scientific computing, 618 Newton’s method, 218
searching
Binary Search, 517, 532, 622 ff., 634,
638–640, 647 Linear Search, 621 ff. Ternary Search, 645
secret sharing, 730, 962
select, see median
Selection Sort, see sorting
self-loops, 1104
self-reference, 225, 304, 346, 448, 1174,
1207 sentinels, 945
sequences, 237 ff.
Sn (sequence of elements from the
same set), 239 cardinality, 906, 913
sets, 222 ff.
cardinality, 222 ff., 903 ff. characteristic function, 806
1207

1208 CHAPTER 12. INDEX complement, 226
disjointness, see disjoint sets, see also partitions
empty set, 226
intersection, 227
set difference, 227
singleton set, 226 subsets/supersets, 229 ff., see also
power set union, 227
inclusion–exclusion, 909 ff. Venn diagrams, 226 well-ordered, 537
Shamir, Adi, 730, 747, 962 Shannon, Claude, 1017
Sheffer stroke (|), 445
Shor’s algorithm, 1016 short-circuit evaluation, 327 Sierpinski triangle/carpet, 519 ff. Sieve of Eratosthenes, 718
signed social networks, 1116 Simpson’s Paradox, 467
six degrees of separation, 438 small-world phenomenon, 438, 1142 social networks, 1116, 1123
Dunbar’s number, 1125 sorting
Bubble Sort, 621, 626, 629 comparison-based, 629, 920 Counting Sort, 630, 921 Insertion Sort, 508, 620, 625, 629
average-case analysis, 1054, 1064 correctness using loop invariants,
517
lower bounds, 920–921
Merge Sort, 532, 631–632, 634, 636– 638, 646, 647
Quick Sort, 630, 645
correctness (for any pivot rule),
526 ff.
randomized pivot selection, 1018
Radix Sort, 630
Selection Sort, 619, 626, 629, 920 spam filter, 1037
spanning trees, 1157 ff.
cycle elimination algorithm, 1158
minimum spanning trees, 1170 ff. speech processing, see natural lan-
guage processing sphere packing, 416 spreadsheets, 845, 849, 1135
SQL (programming language), 815 square roots, 218, see exponentials
Heron’s method, 439 standard deviation, 1056 ff. Strassen’s algorithm, 655 strings, 239
generating all strings of a given length, 714
regular expressions, 830 strong induction, see proofs strongly connected components,
1133 ff.
structural induction, see proofs subgraphs, see graphs
subset, 229
sum of products, see disjunctive nor-
mal form Sum Rule, 903
summations, 212 ff. arithmetic, 512 geometric, 510 ff., 648 ff.
infinite, 512
harmonic, 512 ff.
of a set, 228
reindexing summations, 213 reversing nested summations, 215,
1046 superset, 230
surjective functions, see onto functions symmetry, 405, 804, 820 ff.
symmetric closure, 826 ff. syntactic sugar, 322
tautology, 317 ff.
temporal logic, 825
The Book, 438
Therac-25, 464
Theta (Θ) (asymptotics), 608, 823 ff. tic-tac-toe, 344, 941
topological ordering, 843 ff.
total orders, 838 ff., see also partial
orders
totient function, 744, 924 Towers of Hanoi, 656 transitivity, 822 ff.
nontransitive dice, 1063 nontransitivity in voting, 823 signed social networks, 1116 transitive closure, 826 ff.
Traveling Salesman Problem, 959 trees, 1147 ff.
2–3 and 2–3–4 trees, 545
AVL trees, 643 ff.
binary search trees, 643, 1160 binary trees, 534, 643 ff., 1154 ff.
complete binary trees, 1162 ff.
heaps, 269, 529 decision trees, 921
forests, 1150
game trees, 344, 941
in counting problems, 918 parse trees, 543 quadtrees, 645
recursion trees, 631 ff.
recursive definitions of trees, 534,
1154
rooted trees, 1151 ff. spanning trees, 1157 ff.
minimum spanning trees, 1170 ff. subtrees, 1153 ff.
tree traversal, 1154 ff.
van Emde Boas trees, 656
triangle inequality, 405 triangulation, 524–526, 528 truth tables, 311 ff.
truth values, 303 ff. tsktsks, 1165
tuple, see sequence
Turing Award, 224, 404, 604, 710, 747,
805, 1165
Turing machines, 346, 449 Turing, Alan, 448, 960
unary numbers, see integers uncomputability, 346, 448–452, 455,
937
undecidability, see uncomputability underflow, 217
Unicode, 923
uniform distribution, 1007, 1013 ff. unigrams, 1036
union (of sets), 227
Union Bound, 904
unit vector, 241
universal quantifier (∀), 333 ff. URL squatting, 941
vacuous quantification, 342 valid inference, 458
van Emde Boas trees, 656 variance, 1056 ff.
Vector Space Model, 248

vectors, 239 ff.
dot product, 241 ff.
Venn diagrams, 226
virtual memory, 455
Von Koch snowflake, 502, 509 Voronoi diagram, 251
voting systems, 823
wall clocks, 627
well-ordered set, 537
“without loss of generality”, 427 World War II, 960, 1116 World-Wide Web, 1123, 1142
Google PageRank, 1174 worst-case analysis, see running time
xor, see exclusive or Z, see integers
Zn, 734 ff.
zero (of a binary operator), 315, 545 zyzzyvas, 806
1209

WILEY END USER LICENSE AGREEMENT
Go to www.wiley.com/go/eula to access Wiley’s ebook EULA.

Related Posts