程序代写代做代考 scheme arm database jvm algorithm interpreter AWS GPU Fortran assembler assembly concurrency computer architecture AI flex cuda ada hbase hadoop DNA Keras case study mips distributed system x86 ER cache c++ compiler Java prolog data structure chain Excel matlab Computer Organization and Design: The Hardware/Software Interface

Computer Organization and Design: The Hardware/Software Interface

In Praise of Computer Organization and Design: The Hardware/
Software Interface, Fifth Edition

“Textbook selection is oft en a frustrating act of compromise—pedagogy, content
coverage, quality of exposition, level of rigor, cost. Computer Organization and
Design is the rare book that hits all the right notes across the board, without
compromise. It is not only the premier computer organization textbook, it is a
shining example of what all computer science textbooks could and should be.”

—Michael Goldweber, Xavier University

“I have been using Computer Organization and Design for years, from the very
fi rst edition. Th e new Fift h Edition is yet another outstanding improvement on an
already classic text. Th e evolution from desktop computing to mobile computing
to Big Data brings new coverage of embedded processors such as the ARM, new
material on how soft ware and hardware interact to increase performance, and
cloud computing. All this without sacrifi cing the fundamentals.”

—Ed Harcourt, St. Lawrence University

“To Millennials: Computer Organization and Design is the computer architecture
book you should keep on your (virtual) bookshelf. Th e book is both old and new,
because it develops venerable principles—Moore’s Law, abstraction, common case
fast, redundancy, memory hierarchies, parallelism, and pipelining—but illustrates
them with contemporary designs, e.g., ARM Cortex A8 and Intel Core i7.”

—Mark D. Hill, University of Wisconsin-Madison

“Th e new edition of Computer Organization and Design keeps pace with advances
in emerging embedded and many-core (GPU) systems, where tablets and
smartphones will are quickly becoming our new desktops. Th is text acknowledges
these changes, but continues to provide a rich foundation of the fundamentals
in computer organization and design which will be needed for the designers of
hardware and soft ware that power this new class of devices and systems.”

—Dave Kaeli, Northeastern University

“Th e Fift h Edition of Computer Organization and Design provides more than an
introduction to computer architecture. It prepares the reader for the changes necessary
to meet the ever-increasing performance needs of mobile systems and big data
processing at a time that diffi culties in semiconductor scaling are making all systems
power constrained. In this new era for computing, hardware and soft ware must be co-
designed and system-level architecture is as critical as component-level optimizations.”

—Christos Kozyrakis, Stanford University

“Patterson and Hennessy brilliantly address the issues in ever-changing computer
hardware architectures, emphasizing on interactions among hardware and soft ware
components at various abstraction levels. By interspersing I/O and parallelism concepts
with a variety of mechanisms in hardware and soft ware throughout the book, the new
edition achieves an excellent holistic presentation of computer architecture for the
PostPC era. Th is book is an essential guide to hardware and soft ware professionals
facing energy effi ciency and parallelization challenges in Tablet PC to cloud computing.”

—Jae C. Oh, Syracuse University

This page intentionally left blank

Computer Organization and Design

T H E H A R D W A R E / S O F T W A R E I N T E R F A C E

F I F T H E D I T I O N

David A. Patterson has been teaching computer architecture at the University of
California, Berkeley, since joining the faculty in 1977, where he holds the Pardee Chair
of Computer Science. His teaching has been honored by the Distinguished Teaching
Award from the University of California, the Karlstrom Award from ACM, and the
Mulligan Education Medal and Undergraduate Teaching Award from IEEE. Patterson
received the IEEE Technical Achievement Award and the ACM Eckert-Mauchly Award
for contributions to RISC, and he shared the IEEE Johnson Information Storage Award
for contributions to RAID. He also shared the IEEE John von Neumann Medal and
the C & C Prize with John Hennessy. Like his co-author, Patterson is a Fellow of the
American Academy of Arts and Sciences, the Computer History Museum, ACM,
and IEEE, and he was elected to the National Academy of Engineering, the National
Academy of Sciences, and the Silicon Valley Engineering Hall of Fame. He served on
the Information Technology Advisory Committee to the U.S. President, as chair of the
CS division in the Berkeley EECS department, as chair of the Computing Research
Association, and as President of ACM. Th is record led to Distinguished Service Awards
from ACM and CRA.

At Berkeley, Patterson led the design and implementation of RISC I, likely the fi rst
VLSI reduced instruction set computer, and the foundation of the commercial
SPARC architecture. He was a leader of the Redundant Arrays of Inexpensive Disks
(RAID) project, which led to dependable storage systems from many companies.
He was also involved in the Network of Workstations (NOW) project, which led to
cluster technology used by Internet companies and later to cloud computing. Th ese
projects earned three dissertation awards from ACM. His current research projects
are Algorithm-Machine-People and Algorithms and Specializers for Provably Optimal
Implementations with Resilience and Effi ciency. Th e AMP Lab is developing scalable
machine learning algorithms, warehouse-scale-computer-friendly programming
models, and crowd-sourcing tools to gain valuable insights quickly from big data in
the cloud. Th e ASPIRE Lab uses deep hardware and soft ware co-tuning to achieve the
highest possible performance and energy effi ciency for mobile and rack computing
systems.

John L. Hennessy is the tenth president of Stanford University, where he has been
a member of the faculty since 1977 in the departments of electrical engineering and
computer science. Hennessy is a Fellow of the IEEE and ACM; a member of the
National Academy of Engineering, the National Academy of Science, and the American
Philosophical Society; and a Fellow of the American Academy of Arts and Sciences.
Among his many awards are the 2001 Eckert-Mauchly Award for his contributions to
RISC technology, the 2001 Seymour Cray Computer Engineering Award, and the 2000
John von Neumann Award, which he shared with David Patterson. He has also received
seven honorary doctorates.

In 1981, he started the MIPS project at Stanford with a handful of graduate students.
Aft er completing the project in 1984, he took a leave from the university to cofound
MIPS Computer Systems (now MIPS Technologies), which developed one of the fi rst
commercial RISC microprocessors. As of 2006, over 2 billion MIPS microprocessors have
been shipped in devices ranging from video games and palmtop computers to laser printers
and network switches. Hennessy subsequently led the DASH (Director Architecture
for Shared Memory) project, which prototyped the fi rst scalable cache coherent
multiprocessor; many of the key ideas have been adopted in modern multiprocessors.
In addition to his technical activities and university responsibilities, he has continued to
work with numerous start-ups both as an early-stage advisor and an investor.

Computer Organization and Design

T H E H A R D W A R E / S O F T W A R E I N T E R F A C E

David A. Patterson
University of California, Berkeley

John L. Hennessy
Stanford University

AMSTERDAM • BOSTON • HEIDELBERG • LONDON
NEW YORK • OXFORD • PARIS • SAN DIEGO

SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO
Morgan Kaufmann is an imprint of Elsevier

With contributions by
Perry Alexander
Th e University of Kansas

Peter J. Ashenden
Ashenden Designs Pty Ltd

Jason D. Bakos
University of South Carolina

Javier Bruguera
Universidade de Santiago de Compostela

Jichuan Chang
Hewlett-Packard

Matthew Farrens
University of California, Davis

David Kaeli
Northeastern University

Nicole Kaiyan
University of Adelaide

David Kirk
NVIDIA

James R. Larus
School of Computer and
Communications Science at EPFL

Jacob Leverich
Hewlett-Packard

Kevin Lim
Hewlett-Packard

John Nickolls
NVIDIA

John Oliver
Cal Poly, San Luis Obispo

Milos Prvulovic
Georgia Tech

Partha Ranganathan
Hewlett-Packard

F I F T H E D I T I O N

Library of Congress Cataloging-in-Publication Data
Patterson, David A.
Computer organization and design: the hardware/soft ware interface/David A. Patterson, John L. Hennessy. — 5th ed.
p. cm. — (Th e Morgan Kaufmann series in computer architecture and design)
Rev. ed. of: Computer organization and design/John L. Hennessy, David A. Patterson. 1998.
Summary: “Presents the fundamentals of hardware technologies, assembly language, computer arithmetic, pipelining, memory hierarchies
and I/O”— Provided by publisher.
ISBN 978-0-12-407726-3 (pbk.)
1. Computer organization. 2. Computer engineering. 3. Computer interfaces. I. Hennessy, John L. II. Hennessy, John L. Computer
organization and design. III. Title.

British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library

ISBN: 978-0-12-407726-3

Acquiring Editor: Todd Green
Development Editor: Nate McFadden
Project Manager: Lisa Jones
Designer: Russell Purdy

Morgan Kaufmann is an imprint of Elsevier
Th e Boulevard, Langford Lane, Kidlington, Oxford, OX5 1GB
225 Wyman Street, Waltham, MA 02451, USA

No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including
photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how
to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the
Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions

Th is book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted
herein).

Notices
Knowledge and best practice in this fi eld are constantly changing. As new research and experience broaden our understanding, changes in
research methods or professional practices, may become necessary. Practitioners and researchers must always rely on their own experience
and knowledge in evaluating and using any information or methods described herein. In using such information or methods they should be
mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.

To the fullest extent of the law, neither the publisher nor the authors, contributors, or editors, assume any liability for any injury and/
or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods,
products, instructions, or ideas contained in the material herein.

For information on all MK publications visit our
website at www.mkp.com

Printed and bound in the United States of America

13 14 15 16 10 9 8 7 6 5 4 3 2 1

http://www.elsevier.com/permissions
http://www.mkp.com

To Linda,
who has been, is, and always will be the love of my life

A C K N O W L E D G M E N T S

Figures 1.7, 1.8 Courtesy of iFixit ( www.ifi xit.com ).

Figure 1.9 Courtesy of Chipworks ( www.chipworks.com ).

Figure 1.13 Courtesy of Intel.

Figures 1.10.1, 1.10.2, 4.15.2 Courtesy of the Charles Babbage
Institute, University of Minnesota Libraries, Minneapolis.

Figures 1.10.3, 4.15.1, 4.15.3, 5.12.3, 6.14.2 Courtesy of IBM.

Figure 1.10.4 Courtesy of Cray Inc.

Figure 1.10.5 Courtesy of Apple Computer, Inc.

Figure 1.10.6 Courtesy of the Computer History Museum.

Figures 5.17.1, 5.17.2 Courtesy of Museum of Science, Boston.

Figure 5.17.4 Courtesy of MIPS Technologies, Inc.

Figure 6.15.1 Courtesy of NASA Ames Research Center.

http://www.ifixit.com
http://www.chipworks.com

Contents

Preface xv

C H A P T E R S

1 Computer Abstractions and Technology 2

1.1 Introduction 3
1.2 Eight Great Ideas in Computer Architecture 11
1.3 Below Your Program 13
1.4 Under the Covers 16
1.5 Technologies for Building Processors and Memory 24
1.6 Performance 28
1.7 Th e Power Wall 40
1.8 Th e Sea Change: Th e Switch from Uniprocessors to

Multiprocessors 43
1.9 Real Stuff : Benchmarking the Intel Core i7 46
1.10 Fallacies and Pitfalls 49
1.11 Concluding Remarks 52
1.12 Historical Perspective and Further Reading 54
1.13 Exercises 54

2 Instructions: Language of the Computer 60

2.1 Introduction 62
2.2 Operations of the Computer Hardware 63
2.3 Operands of the Computer Hardware 66
2.4 Signed and Unsigned Numbers 73
2.5 Representing Instructions in the Computer 80
2.6 Logical Operations 87
2.7 Instructions for Making Decisions 90
2.8 Supporting Procedures in Computer Hardware 96
2.9 Communicating with People 106
2.10 MIPS Addressing for 32-Bit Immediates and Addresses 111
2.11 Parallelism and Instructions: Synchronization 121
2.12 Translating and Starting a Program 123
2.13 A C Sort Example to Put It All Together 132
2.14 Arrays versus Pointers 141

x Contents

2.15 Advanced Material: Compiling C and Interpreting Java 145
2.16 Real Stuff : ARMv7 (32-bit) Instructions 145
2.17 Real Stuff : x86 Instructions 149
2.18 Real Stuff : ARMv8 (64-bit) Instructions 158
2.19 Fallacies and Pitfalls 159
2.20 Concluding Remarks 161
2.21 Historical Perspective and Further Reading 163
2.22 Exercises 164

3 Arithmetic for Computers 176

3.1 Introduction 178
3.2 Addition and Subtraction 178
3.3 Multiplication 183
3.4 Division 189
3.5 Floating Point 196
3.6 Parallelism and Computer Arithmetic: Subword Parallelism 222
3.7 Real Stuff : Streaming SIMD Extensions and Advanced Vector

Extensions in x86 224
3.8 Going Faster: Subword Parallelism and Matrix Multiply 225
3.9 Fallacies and Pitfalls 229
3.10 Concluding Remarks 232
3.11 Historical Perspective and Further Reading 236
3.12 Exercises 237

4 The Processor 242

4.1 Introduction 244
4.2 Logic Design Conventions 248
4.3 Building a Datapath 251
4.4 A Simple Implementation Scheme 259
4.5 An Overview of Pipelining 272
4.6 Pipelined Datapath and Control 286
4.7 Data Hazards: Forwarding versus Stalling 303
4.8 Control Hazards 316
4.9 Exceptions 325
4.10 Parallelism via Instructions 332
4.11 Real Stuff : Th e ARM Cortex-A8 and Intel Core i7 Pipelines 344
4.12 Going Faster: Instruction-Level Parallelism and Matrix

Multiply 351
4.13 Advanced Topic: An Introduction to Digital Design Using a Hardware

Design Language to Describe and Model a Pipeline and More Pipelining
Illustrations 354

Contents xi

4.14 Fallacies and Pitfalls 355
4.15 Concluding Remarks 356
4.16 Historical Perspective and Further Reading 357
4.17 Exercises 357

5 Large and Fast: Exploiting Memory Hierarchy 372

5.1 Introduction 374
5.2 Memory Technologies 378
5.3 Th e Basics of Caches 383
5.4 Measuring and Improving Cache Performance 398
5.5 Dependable Memory Hierarchy 418
5.6 Virtual Machines 424
5.7 Virtual Memory 427
5.8 A Common Framework for Memory Hierarchy 454
5.9 Using a Finite-State Machine to Control a Simple Cache 461
5.10 Parallelism and Memory Hierarchies: Cache Coherence 466
5.11 Parallelism and Memory Hierarchy: Redundant Arrays of

Inexpensive Disks 470
5.12 Advanced Material: Implementing Cache Controllers 470
5.13 Real Stuff : Th e ARM Cortex-A8 and Intel Core i7 Memory

Hierarchies 471
5.14 Going Faster: Cache Blocking and Matrix Multiply 475
5.15 Fallacies and Pitfalls 478
5.16 Concluding Remarks 482
5.17 Historical Perspective and Further Reading 483
5.18 Exercises 483

6 Parallel Processors from Client to Cloud 500

6.1 Introduction 502
6.2 Th e Diffi culty of Creating Parallel Processing Programs 504
6.3 SISD, MIMD, SIMD, SPMD, and Vector 509
6.4 Hardware Multithreading 516
6.5 Multicore and Other Shared Memory Multiprocessors 519
6.6 Introduction to Graphics Processing Units 524
6.7 Clusters, Warehouse Scale Computers, and Other

Message-Passing Multiprocessors 531
6.8 Introduction to Multiprocessor Network Topologies 536
6.9 Communicating to the Outside World: Cluster Networking 539
6.10 Multiprocessor Benchmarks and Performance Models 540
6.11 Real Stuff : Benchmarking Intel Core i7 versus NVIDIA Tesla

GPU 550

xii Contents

6.12 Going Faster: Multiple Processors and Matrix Multiply 555
6.13 Fallacies and Pitfalls 558
6.14 Concluding Remarks 560
6.15 Historical Perspective and Further Reading 563
6.16 Exercises 563

A P P E N D I C E S

A Assemblers, Linkers, and the SPIM Simulator A-2

A.1 Introduction A-3
A.2 Assemblers A-10
A.3 Linkers A-18
A.4 Loading A-19
A.5 Memory Usage A-20
A.6 Procedure Call Convention A-22
A.7 Exceptions and Interrupts A-33
A.8 Input and Output A-38
A.9 SPIM A-40
A.10 MIPS R2000 Assembly Language A-45
A.11 Concluding Remarks A-81
A.12 Exercises A-82

B The Basics of Logic Design B-2

B.1 Introduction B-3
B.2 Gates, Truth Tables, and Logic Equations B-4
B.3 Combinational Logic B-9
B.4 Using a Hardware Description Language B-20
B.5 Constructing a Basic Arithmetic Logic Unit B-26
B.6 Faster Addition: Carry Lookahead B-38
B.7 Clocks B-48
B.8 Memory Elements: Flip-Flops, Latches, and Registers B-50
B.9 Memory Elements: SRAMs and DRAMs B-58
B.10 Finite-State Machines B-67
B.11 Timing Methodologies B-72
B.12 Field Programmable Devices B-78
B.13 Concluding Remarks B-79
B.14 Exercises B-80

Index I-1

Contents xiii

O N L I N E C O N T E N T

Graphics and Computing GPUs C-2

C.1 Introduction C-3
C.2 GPU System Architectures C-7
C.3 Programming GPUs C-12
C.4 Multithreaded Multiprocessor Architecture C-25
C.5 Parallel Memory System C-36
C.6 Floating Point Arithmetic C-41
C.7 Real Stuff : Th e NVIDIA GeForce 8800 C-46
C.8 Real Stuff : Mapping Applications to GPUs C-55
C.9 Fallacies and Pitfalls C-72
C.10 Concluding Remarks C-76
C.11 Historical Perspective and Further Reading C-77

Mapping Control to Hardware D-2

D.1 Introduction D-3
D.2 Implementing Combinational Control Units D-4
D.3 Implementing Finite-State Machine Control D-8
D.4 Implementing the Next-State Function with a Sequencer D-22
D.5 Translating a Microprogram to Hardware D-28
D.6 Concluding Remarks D-32
D.7 Exercises D-33

A Survey of RISC Architectures for Desktop, Server,
and Embedded Computers E-2
E.1 Introduction E-3
E.2 Addressing Modes and Instruction Formats E-5
E.3 Instructions: Th e MIPS Core Subset E-9
E.4 Instructions: Multimedia Extensions of the Desktop/Server RISCs E-16
E.5 Instructions: Digital Signal-Processing Extensions of the Embedded

RISCs E-19
E.6 Instructions: Common Extensions to MIPS Core E-20
E.7 Instructions Unique to MIPS-64 E-25
E.8 Instructions Unique to Alpha E-27
E.9 Instructions Unique to SPARC v9 E-29
E.10 Instructions Unique to PowerPC E-32
E.11 Instructions Unique to PA-RISC 2.0 E-34
E.12 Instructions Unique to ARM E-36
E.13 Instructions Unique to Th umb E-38
E.14 Instructions Unique to SuperH E-39

xiv Contents

E.15 Instructions Unique to M32R E-40
E.16 Instructions Unique to MIPS-16 E-40
E.17 Concluding Remarks E-43

Glossary G-1
Further Reading FR-1

Preface

Th e most beautiful thing we can experience is the mysterious. It is the
source of all true art and science.

Albert Einstein, What I Believe, 1930

About This Book
We believe that learning in computer science and engineering should refl ect
the current state of the fi eld, as well as introduce the principles that are shaping
computing. We also feel that readers in every specialty of computing need
to appreciate the organizational paradigms that determine the capabilities,
performance, energy, and, ultimately, the success of computer systems.

Modern computer technology requires professionals of every computing
specialty to understand both hardware and soft ware. Th e interaction between
hardware and soft ware at a variety of levels also off ers a framework for understanding
the fundamentals of computing. Whether your primary interest is hardware or
soft ware, computer science or electrical engineering, the central ideas in computer
organization and design are the same. Th us, our emphasis in this book is to show
the relationship between hardware and soft ware and to focus on the concepts that
are the basis for current computers.

Th e recent switch from uniprocessor to multicore microprocessors confi rmed
the soundness of this perspective, given since the fi rst edition. While programmers
could ignore the advice and rely on computer architects, compiler writers, and silicon
engineers to make their programs run faster or be more energy-effi cient without
change, that era is over. For programs to run faster, they must become parallel.
While the goal of many researchers is to make it possible for programmers to be
unaware of the underlying parallel nature of the hardware they are programming,
it will take many years to realize this vision. Our view is that for at least the next
decade, most programmers are going to have to understand the hardware/soft ware
interface if they want programs to run effi ciently on parallel computers.

Th e audience for this book includes those with little experience in assembly
language or logic design who need to understand basic computer organization as
well as readers with backgrounds in assembly language and/or logic design who
want to learn how to design a computer or understand how a system works and
why it performs as it does.

xvi Preface

About the Other Book
Some readers may be familiar with Computer Architecture: A Quantitative
Approach , popularly known as Hennessy and Patterson. (Th is book in turn is
oft en called Patterson and Hennessy.) Our motivation in writing the earlier book
was to describe the principles of computer architecture using solid engineering
fundamentals and quantitative cost/performance tradeoff s. We used an approach
that combined examples and measurements, based on commercial systems, to
create realistic design experiences. Our goal was to demonstrate that computer
architecture could be learned using quantitative methodologies instead of a
descriptive approach. It was intended for the serious computing professional who
wanted a detailed understanding of computers.

A majority of the readers for this book do not plan to become computer
architects. Th e performance and energy effi ciency of future soft ware systems will
be dramatically aff ected, however, by how well soft ware designers understand the
basic hardware techniques at work in a system. Th us, compiler writers, operating
system designers, database programmers, and most other soft ware engineers need
a fi rm grounding in the principles presented in this book. Similarly, hardware
designers must understand clearly the eff ects of their work on soft ware applications.

Th us, we knew that this book had to be much more than a subset of the material
in Computer Architecture , and the material was extensively revised to match the
diff erent audience. We were so happy with the result that the subsequent editions of
Computer Architecture were revised to remove most of the introductory material;
hence, there is much less overlap today than with the fi rst editions of both books.

Changes for the Fifth Edition
We had six major goals for the fi ft h edition of Computer Organization and Design:
demonstrate the importance of understanding hardware with a running example;
highlight major themes across the topics using margin icons that are introduced
early; update examples to refl ect changeover from PC era to PostPC era; spread the
material on I/O throughout the book rather than isolating it into a single chapter;
update the technical content to refl ect changes in the industry since the publication
of the fourth edition in 2009; and put appendices and optional sections online
instead of including a CD to lower costs and to make this edition viable as an
electronic book.

Before discussing the goals in detail, let’s look at the table on the next page. It
shows the hardware and soft ware paths through the material. Chapters 1, 4, 5, and
6 are found on both paths, no matter what the experience or the focus. Chapter 1
discusses the importance of energy and how it motivates the switch from single
core to multicore microprocessors and introduces the eight great ideas in computer
architecture. Chapter 2 is likely to be review material for the hardware-oriented,
but it is essential reading for the soft ware-oriented, especially for those readers
interested in learning more about compilers and object-oriented programming
languages. Chapter 3 is for readers interested in constructing a datapath or in

Preface xvii

Chapter or Appendix Sections Software focus Hardware focus

1. Computer Abstractions
and Technology

1.1 to 1.11

1.12 (History)

3. Arithmetic for Computers

3.1 to 3.5

3.11 (History)

4. The Processor

4.1 (Overview)

4.2 (Logic Conventions)

4.3 to 4.4 (Simple Implementation)

E. RISC Instruction-Set Architectures E.1 to E.17

2. Instructions: Language
of the Computer

2.1 to 2.14

2.15 (Compilers & Java)

2.16 to 2.20

2.21 (History)

4.5 (Pipelining Overview)

4.6 (Pipelined Datapath)

4.7 to 4.9 (Hazards, Exceptions)

4.10 to 4.12 (Parallel, Real Stuff)

4.16 (History)

B. The Basics of Logic Design B.1 to B.13

D. Mapping Control to Hardware D.1 to D.6

A. Assemblers, Linkers, and
the SPIM Simulator

C.1 to C.13

Read carefully

Review or read

Read if have time

Read for culture

Reference

4.13 (Verilog Pipeline Control)

5. Large and Fast: Exploiting
Memory Hierarchy

5.1 to 5.10

5.17 (History)

4.14 to 4.15 (Fallacies)

6. Parallel Process from Client
to Cloud

6.1 to 6.8

6.9 (Networks)

6.10 to 6.14

6.15 (History)

3.6 to 3.8 (Subword Parallelism)

3.9 to 3.10 (Fallacies)

5.13 to 5.16

C. Graphics Processor Units

A.1 to A.11

5.12 (Verilog Cache Controller)

5.11 (Redundant Arrays of
Inexpensive Disks)

xviii Preface

learning more about fl oating-point arithmetic. Some will skip parts of Chapter 3,
either because they don’t need them or because they off er a review. However, we
introduce the running example of matrix multiply in this chapter, showing how
subword parallels off ers a fourfold improvement, so don’t skip sections 3.6 to 3.8.
Chapter 4 explains pipelined processors. Sections 4.1, 4.5, and 4.10 give overviews
and Section 4.12 gives the next performance boost for matrix multiply for those with
a soft ware focus. Th ose with a hardware focus, however, will fi nd that this chapter
presents core material; they may also, depending on their background, want to read
Appendix C on logic design fi rst. Th e last chapter on multicores, multiprocessors,
and clusters, is mostly new content and should be read by everyone. It was
signifi cantly reorganized in this edition to make the fl ow of ideas more natural
and to include much more depth on GPUs, warehouse scale computers, and the
hardware-soft ware interface of network interface cards that are key to clusters.

Th e fi rst of the six goals for this fi rth edition was to demonstrate the importance
of understanding modern hardware to get good performance and energy effi ciency
with a concrete example. As mentioned above, we start with subword parallelism
in Chapter 3 to improve matrix multiply by a factor of 4. We double performance
in Chapter 4 by unrolling the loop to demonstrate the value of instruction level
parallelism. Chapter 5 doubles performance again by optimizing for caches using
blocking. Finally, Chapter 6 demonstrates a speedup of 14 from 16 processors by
using thread-level parallelism. All four optimizations in total add just 24 lines of C
code to our initial matrix multiply example.

Th e second goal was to help readers separate the forest from the trees by
identifying eight great ideas of computer architecture early and then pointing out
all the places they occur throughout the rest of the book. We use (hopefully) easy
to remember margin icons and highlight the corresponding word in the text to
remind readers of these eight themes. Th ere are nearly 100 citations in the book.
No chapter has less than seven examples of great ideas, and no idea is cited less than
fi ve times. Performance via parallelism, pipelining, and prediction are the three
most popular great ideas, followed closely by Moore’s Law. Th e processor chapter
(4) is the one with the most examples, which is not a surprise since it probably
received the most attention from computer architects. Th e one great idea found in
every chapter is performance via parallelism, which is a pleasant observation given
the recent emphasis in parallelism in the fi eld and in editions of this book.

Th e third goal was to recognize the generation change in computing from the
PC era to the PostPC era by this edition with our examples and material. Th us,
Chapter 1 dives into the guts of a tablet computer rather than a PC, and Chapter 6
describes the computing infrastructure of the cloud. We also feature the ARM,
which is the instruction set of choice in the personal mobile devices of the PostPC
era, as well as the x86 instruction set that dominated the PC Era and (so far)
dominates cloud computing.

Th e fourth goal was to spread the I/O material throughout the book rather
than have it in its own chapter, much as we spread parallelism throughout all the
chapters in the fourth edition. Hence, I/O material in this edition can be found in

Preface xix

Sections 1.4, 4.9, 5.2, 5.5, 5.11, and 6.9. Th e thought is that readers (and instructors)
are more likely to cover I/O if it’s not segregated to its own chapter.

Th is is a fast-moving fi eld, and, as is always the case for our new editions, an
important goal is to update the technical content. Th e running example is the ARM
Cortex A8 and the Intel Core i7, refl ecting our PostPC Era. Other highlights include
an overview the new 64-bit instruction set of ARMv8, a tutorial on GPUs that
explains their unique terminology, more depth on the warehouse scale computers
that make up the cloud, and a deep dive into 10 Gigabyte Ethernet cards.

To keep the main book short and compatible with electronic books, we placed
the optional material as online appendices instead of on a companion CD as in
prior editions.

Finally, we updated all the exercises in the book.
While some elements changed, we have preserved useful book elements from

prior editions. To make the book work better as a reference, we still place defi nitions
of new terms in the margins at their fi rst occurrence. Th e book element called
“Understanding Program Performance” sections helps readers understand the
performance of their programs and how to improve it, just as the “Hardware/Soft ware
Interface” book element helped readers understand the tradeoff s at this interface.
“Th e Big Picture” section remains so that the reader sees the forest despite all the
trees. “Check Yourself ” sections help readers to confi rm their comprehension of the
material on the fi rst time through with answers provided at the end of each chapter.
Th is edition still includes the green MIPS reference card, which was inspired by the
“Green Card” of the IBM System/360. Th is card has been updated and should be a
handy reference when writing MIPS assembly language programs.

Changes for the Fifth Edition
We have collected a great deal of material to help instructors teach courses using
this book. Solutions to exercises, fi gures from the book, lecture slides, and other
materials are available to adopters from the publisher. Check the publisher’s Web
site for more information:

textbooks.elsevier.com/9780124077263

Concluding Remarks
If you read the following acknowledgments section, you will see that we went to
great lengths to correct mistakes. Since a book goes through many printings, we
have the opportunity to make even more corrections. If you uncover any remaining,
resilient bugs, please contact the publisher by electronic mail at cod5bugs@mkp.
com or by low-tech mail using the address found on the copyright page.

Th is edition is the second break in the long-standing collaboration between
Hennessy and Patterson, which started in 1989. Th e demands of running one of
the world’s great universities meant that President Hennessy could no longer make
the substantial commitment to create a new edition. Th e remaining author felt

http://textbooks.elsevier.com/
mailto:cod5bugs@mkp.com
mailto:cod5bugs@mkp.com

xx Preface

once again like a tightrope walker without a safety net. Hence, the people in the
acknowledgments and Berkeley colleagues played an even larger role in shaping
the contents of this book. Nevertheless, this time around there is only one author
to blame for the new material in what you are about to read.

Acknowledgments for the Fifth Edition
With every edition of this book, we are very fortunate to receive help from many
readers, reviewers, and contributors. Each of these people has helped to make this
book better.

Chapter 6 was so extensively revised that we did a separate review for ideas and
contents, and I made changes based on the feedback from every reviewer. I’d like to
thank Christos Kozyrakis of Stanford University for suggesting using the network
interface for clusters to demonstrate the hardware-soft ware interface of I/O and
for suggestions on organizing the rest of the chapter; Mario Flagsilk of Stanford
University for providing details, diagrams, and performance measurements of the
NetFPGA NIC; and the following for suggestions on how to improve the chapter:
David Kaeli of Northeastern University, Partha Ranganathan of HP Labs,
David Wood of the University of Wisconsin, and my Berkeley colleagues Siamak
Faridani , Shoaib Kamil , Yunsup Lee , Zhangxi Tan , and Andrew Waterman .

Special thanks goes to Rimas Avizenis of UC Berkeley, who developed the
various versions of matrix multiply and supplied the performance numbers as well.
As I worked with his father while I was a graduate student at UCLA, it was a nice
symmetry to work with Rimas at UCB.

I also wish to thank my longtime collaborator Randy Katz of UC Berkeley, who
helped develop the concept of great ideas in computer architecture as part of the
extensive revision of an undergraduate class that we did together.

I’d like to thank David Kirk , John Nickolls , and their colleagues at NVIDIA
(Michael Garland, John Montrym, Doug Voorhies, Lars Nyland, Erik Lindholm,
Paulius Micikevicius, Massimiliano Fatica, Stuart Oberman, and Vasily Volkov)
for writing the fi rst in-depth appendix on GPUs. I’d like to express again my
appreciation to Jim Larus , recently named Dean of the School of Computer and
Communications Science at EPFL, for his willingness in contributing his expertise
on assembly language programming, as well as for welcoming readers of this book
with regard to using the simulator he developed and maintains.

I am also very grateful to Jason Bakos of the University of South Carolina,
who updated and created new exercises for this edition, working from originals
prepared for the fourth edition by Perry Alexander (Th e University of Kansas);
Javier Bruguera (Universidade de Santiago de Compostela); Matthew Farrens
(University of California, Davis); David Kaeli (Northeastern University); Nicole
Kaiyan (University of Adelaide); John Oliver (Cal Poly, San Luis Obispo); Milos
Prvulovic (Georgia Tech); and Jichuan Chang , Jacob Leverich , Kevin Lim , and
Partha Ranganathan (all from Hewlett-Packard).

Additional thanks goes to Jason Bakos for developing the new lecture slides.

Preface xxi

I am grateful to the many instructors who have answered the publisher’s surveys,
reviewed our proposals, and attended focus groups to analyze and respond to our
plans for this edition. Th ey include the following individuals: Focus Groups in
2012: Bruce Barton (Suff olk County Community College), Jeff Braun (Montana
Tech), Ed Gehringer (North Carolina State), Michael Goldweber (Xavier University),
Ed Harcourt (St. Lawrence University), Mark Hill (University of Wisconsin,
Madison), Patrick Homer (University of Arizona), Norm Jouppi (HP Labs), Dave
Kaeli (Northeastern University), Christos Kozyrakis (Stanford University),
Zachary Kurmas (Grand Valley State University), Jae C. Oh (Syracuse University),
Lu Peng (LSU), Milos Prvulovic (Georgia Tech), Partha Ranganathan (HP
Labs), David Wood (University of Wisconsin), Craig Zilles (University of Illinois
at Urbana-Champaign). Surveys and Reviews: Mahmoud Abou-Nasr (Wayne State
University), Perry Alexander (Th e University of Kansas), Hakan Aydin (George
Mason University), Hussein Badr (State University of New York at Stony Brook),
Mac Baker (Virginia Military Institute), Ron Barnes (George Mason University),
Douglas Blough (Georgia Institute of Technology), Kevin Bolding (Seattle Pacifi c
University), Miodrag Bolic (University of Ottawa), John Bonomo (Westminster
College), Jeff Braun (Montana Tech), Tom Briggs (Shippensburg University), Scott
Burgess (Humboldt State University), Fazli Can (Bilkent University), Warren R.
Carithers (Rochester Institute of Technology), Bruce Carlton (Mesa Community
College), Nicholas Carter (University of Illinois at Urbana-Champaign), Anthony
Cocchi (Th e City University of New York), Don Cooley (Utah State University),
Robert D. Cupper (Allegheny College), Edward W. Davis (North Carolina State
University), Nathaniel J. Davis (Air Force Institute of Technology), Molisa Derk
(Oklahoma City University), Derek Eager (University of Saskatchewan), Ernest
Ferguson (Northwest Missouri State University), Rhonda Kay Gaede (Th e University
of Alabama), Etienne M. Gagnon (UQAM), Costa Gerousis (Christopher Newport
University), Paul Gillard (Memorial University of Newfoundland), Michael
Goldweber (Xavier University), Georgia Grant (College of San Mateo), Merrill Hall
(Th e Master’s College), Tyson Hall (Southern Adventist University), Ed Harcourt
(St. Lawrence University), Justin E. Harlow (University of South Florida), Paul F.
Hemler (Hampden-Sydney College), Martin Herbordt (Boston University), Steve
J. Hodges (Cabrillo College), Kenneth Hopkinson (Cornell University), Dalton
Hunkins (St. Bonaventure University), Baback Izadi (State University of New
York—New Paltz), Reza Jafari, Robert W. Johnson (Colorado Technical University),
Bharat Joshi (University of North Carolina, Charlotte), Nagarajan Kandasamy
(Drexel University), Rajiv Kapadia, Ryan Kastner (University of California,
Santa Barbara), E.J. Kim (Texas A&M University), Jihong Kim (Seoul National
University), Jim Kirk (Union University), Geoff rey S. Knauth (Lycoming College),
Manish M. Kochhal (Wayne State), Suzan Koknar-Tezel (Saint Joseph’s University),
Angkul Kongmunvattana (Columbus State University), April Kontostathis (Ursinus
College), Christos Kozyrakis (Stanford University), Danny Krizanc (Wesleyan
University), Ashok Kumar, S. Kumar (Th e University of Texas), Zachary Kurmas
(Grand Valley State University), Robert N. Lea (University of Houston), Baoxin

xxii Preface

Li (Arizona State University), Li Liao (University of Delaware), Gary Livingston
(University of Massachusetts), Michael Lyle, Douglas W. Lynn (Oregon Institute
of Technology), Yashwant K Malaiya (Colorado State University), Bill Mark
(University of Texas at Austin), Ananda Mondal (Clafl in University), Alvin Moser
(Seattle University), Walid Najjar (University of California, Riverside), Danial J.
Neebel (Loras College), John Nestor (Lafayette College), Jae C. Oh (Syracuse
University), Joe Oldham (Centre College), Timour Paltashev, James Parkerson
(University of Arkansas), Shaunak Pawagi (SUNY at Stony Brook), Steve Pearce, Ted
Pedersen (University of Minnesota), Lu Peng (Louisiana State University), Gregory
D Peterson (Th e University of Tennessee), Milos Prvulovic (Georgia Tech), Partha
Ranganathan (HP Labs), Dejan Raskovic (University of Alaska, Fairbanks) Brad
Richards (University of Puget Sound), Roman Rozanov, Louis Rubinfi eld (Villanova
University), Md Abdus Salam (Southern University), Augustine Samba (Kent State
University), Robert Schaefer (Daniel Webster College), Carolyn J. C. Schauble
(Colorado State University), Keith Schubert (CSU San Bernardino), William
L. Schultz, Kelly Shaw (University of Richmond), Shahram Shirani (McMaster
University), Scott Sigman (Drury University), Bruce Smith, David Smith, Jeff W.
Smith (University of Georgia, Athens), Mark Smotherman (Clemson University),
Philip Snyder (Johns Hopkins University), Alex Sprintson (Texas A&M), Timothy
D. Stanley (Brigham Young University), Dean Stevens (Morningside College),
Nozar Tabrizi (Kettering University), Yuval Tamir (UCLA), Alexander Taubin
(Boston University), Will Th acker (Winthrop University), Mithuna Th ottethodi
(Purdue University), Manghui Tu (Southern Utah University), Dean Tullsen
(UC San Diego), Rama Viswanathan (Beloit College), Ken Vollmar (Missouri
State University), Guoping Wang (Indiana-Purdue University), Patricia Wenner
(Bucknell University), Kent Wilken (University of California, Davis), David Wolfe
(Gustavus Adolphus College), David Wood (University of Wisconsin, Madison),
Ki Hwan Yum (University of Texas, San Antonio), Mohamed Zahran (City College
of New York), Gerald D. Zarnett (Ryerson University), Nian Zhang (South Dakota
School of Mines & Technology), Jiling Zhong (Troy University), Huiyang Zhou
(Th e University of Central Florida), Weiyu Zhu (Illinois Wesleyan University).

A special thanks also goes to Mark Smotherman for making multiple passes to
fi nd technical and writing glitches that signifi cantly improved the quality of this
edition.

We wish to thank the extended Morgan Kaufmann family for agreeing to publish
this book again under the able leadership of Todd Green and Nate McFadden : I
certainly couldn’t have completed the book without them. We also want to extend
thanks to Lisa Jones , who managed the book production process, and Russell
Purdy , who did the cover design. Th e new cover cleverly connects the PostPC Era
content of this edition to the cover of the fi rst edition.

Th e contributions of the nearly 150 people we mentioned here have helped
make this fi ft h edition what I hope will be our best book yet. Enjoy!

David A. Patterson

This page intentionally left blank

1
Civilization advances
by extending the
number of important
operations which we
can perform without
thinking about them.
Alfred North Whitehead,
An Introduction to Mathematics, 1911

Computer
Abstractions and
Technology
1.1 Introduction 3
1.2 Eight Great Ideas in Computer

Architecture 11
1.3 Below Your Program 13
1.4 Under the Covers 16
1.5 Technologies for Building Processors and

Memory 24

Computer Organization and Design. DOI:
© 2013 Elsevier Inc. All rights reserved.

http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1
2013

1.6 Performance 28
1.7 The Power Wall 40
1.8 The Sea Change: The Switch from Uniprocessors to

Multiprocessors 43
1.9 Real Stuff: Benchmarking the Intel Core i7 46
1.10 Fallacies and Pitfalls 49
1.11 Concluding Remarks 52
1.12 Historical Perspective and Further Reading 54
1.13 Exercises 54

1.1 Introduction

Welcome to this book! We’re delighted to have this opportunity to convey the
excitement of the world of computer systems. Th is is not a dry and dreary fi eld,
where progress is glacial and where new ideas atrophy from neglect. No! Computers
are the product of the incredibly vibrant information technology industry, all
aspects of which are responsible for almost 10% of the gross national product of
the United States, and whose economy has become dependent in part on the rapid
improvements in information technology promised by Moore’s Law. Th is unusual
industry embraces innovation at a breath-taking rate. In the last 30 years, there have
been a number of new computers whose introduction appeared to revolutionize
the computing industry; these revolutions were cut short only because someone
else built an even better computer.

Th is race to innovate has led to unprecedented progress since the inception
of electronic computing in the late 1940s. Had the transportation industry kept
pace with the computer industry, for example, today we could travel from New
York to London in a second for a penny. Take just a moment to contemplate how
such an improvement would change society—living in Tahiti while working in San
Francisco, going to Moscow for an evening at the Bolshoi Ballet—and you can
appreciate the implications of such a change.

4 Chapter 1 Computer Abstractions and Technology

Computers have led to a third revolution for civilization, with the information
revolution taking its place alongside the agricultural and the industrial revolutions.
Th e resulting multiplication of humankind’s intellectual strength and reach
naturally has aff ected our everyday lives profoundly and changed the ways in which
the search for new knowledge is carried out. Th ere is now a new vein of scientifi c
investigation, with computational scientists joining theoretical and experimental
scientists in the exploration of new frontiers in astronomy, biology, chemistry, and
physics, among others.

Th e computer revolution continues. Each time the cost of computing improves
by another factor of 10, the opportunities for computers multiply. Applications that
were economically infeasible suddenly become practical. In the recent past, the
following applications were “computer science fi ction.”

■ Computers in automobiles: Until microprocessors improved dramatically
in price and performance in the early 1980s, computer control of cars was
ludicrous. Today, computers reduce pollution, improve fuel effi ciency via
engine controls, and increase safety through blind spot warnings, lane
departure warnings, moving object detection, and air bag infl ation to protect
occupants in a crash.

■ Cell phones: Who would have dreamed that advances in computer
systems would lead to more than half of the planet having mobile phones,
allowing person-to-person communication to almost anyone anywhere in
the world?

■ Human genome project: Th e cost of computer equipment to map and analyze
human DNA sequences was hundreds of millions of dollars. It’s unlikely that
anyone would have considered this project had the computer costs been 10
to 100 times higher, as they would have been 15 to 25 years earlier. Moreover,
costs continue to drop; you will soon be able to acquire your own genome,
allowing medical care to be tailored to you.

■ World Wide Web: Not in existence at the time of the fi rst edition of this book,
the web has transformed our society. For many, the web has replaced libraries
and newspapers.

■ Search engines: As the content of the web grew in size and in value, fi nding
relevant information became increasingly important. Today, many people
rely on search engines for such a large part of their lives that it would be a
hardship to go without them.

Clearly, advances in this technology now aff ect almost every aspect of our
society. Hardware advances have allowed programmers to create wonderfully
useful soft ware, which explains why computers are omnipresent. Today’s science
fi ction suggests tomorrow’s killer applications: already on their way are glasses that
augment reality, the cashless society, and cars that can drive themselves.

1.1 Introduction 5

Classes of Computing Applications and Their
Characteristics
Although a common set of hardware technologies (see Sections 1.4 and 1.5) is used
in computers ranging from smart home appliances to cell phones to the largest
supercomputers, these diff erent applications have diff erent design requirements
and employ the core hardware technologies in diff erent ways. Broadly speaking,
computers are used in three diff erent classes of applications.

Personal computers (PCs) are possibly the best known form of computing,
which readers of this book have likely used extensively. Personal computers
emphasize delivery of good performance to single users at low cost and usually
execute third-party soft ware. Th is class of computing drove the evolution of many
computing technologies, which is only about 35 years old!

Servers are the modern form of what were once much larger computers, and
are usually accessed only via a network. Servers are oriented to carrying large
workloads, which may consist of either single complex applications—usually a
scientifi c or engineering application—or handling many small jobs, such as would
occur in building a large web server. Th ese applications are usually based on
soft ware from another source (such as a database or simulation system), but are
oft en modifi ed or customized for a particular function. Servers are built from the
same basic technology as desktop computers, but provide for greater computing,
storage, and input/output capacity. In general, servers also place a greater emphasis
on dependability, since a crash is usually more costly than it would be on a single-
user PC.

Servers span the widest range in cost and capability. At the low end, a server
may be little more than a desktop computer without a screen or keyboard and
cost a thousand dollars. Th ese low-end servers are typically used for fi le storage,
small business applications, or simple web serving (see Section 6.10). At the other
extreme are supercomputers, which at the present consist of tens of thousands of
processors and many terabytes of memory, and cost tens to hundreds of millions
of dollars. Supercomputers are usually used for high-end scientifi c and engineering
calculations, such as weather forecasting, oil exploration, protein structure
determination, and other large-scale problems. Although such supercomputers
represent the peak of computing capability, they represent a relatively small fraction
of the servers and a relatively small fraction of the overall computer market in
terms of total revenue.

Embedded computers are the largest class of computers and span the widest
range of applications and performance. Embedded computers include the
microprocessors found in your car, the computers in a television set, and the
networks of processors that control a modern airplane or cargo ship. Embedded
computing systems are designed to run one application or one set of related
applications that are normally integrated with the hardware and delivered as a
single system; thus, despite the large number of embedded computers, most users
never really see that they are using a computer!

personal computer
(PC) A computer
designed for use by
an individual, usually
incorporating a graphics
display, a keyboard, and a
mouse.

server A computer
used for running
larger programs for
multiple users, oft en
simultaneously, and
typically accessed only via
a network.

supercomputer A class
of computers with the
highest performance and
cost; they are confi gured
as servers and typically
cost tens to hundreds of
millions of dollars.

terabyte (TB) Originally
1,099,511,627,776
(240) bytes, although
communications and
secondary storage
systems developers
started using the term to
mean 1,000,000,000,000
(1012) bytes. To reduce
confusion, we now use the
term tebibyte (TiB) for
240 bytes, defi ning terabyte
(TB) to mean 1012 bytes.
Figure 1.1 shows the full
range of decimal and
binary values and names.

embedded computer
A computer inside another
device used for running
one predetermined
application or collection of
soft ware.

6 Chapter 1 Computer Abstractions and Technology

Embedded applications oft en have unique application requirements that
combine a minimum performance with stringent limitations on cost or power. For
example, consider a music player: the processor need only be as fast as necessary
to handle its limited function, and beyond that, minimizing cost and power are the
most important objectives. Despite their low cost, embedded computers oft en have
lower tolerance for failure, since the results can vary from upsetting (when your
new television crashes) to devastating (such as might occur when the computer in a
plane or cargo ship crashes). In consumer-oriented embedded applications, such as
a digital home appliance, dependability is achieved primarily through simplicity—
the emphasis is on doing one function as perfectly as possible. In large embedded
systems, techniques of redundancy from the server world are oft en employed.
Although this book focuses on general-purpose computers, most concepts apply
directly, or with slight modifi cations, to embedded computers.

Elaboration: Elaborations are short sections used throughout the text to provide more
detail on a particular subject that may be of interest. Disinterested readers may skip
over an elaboration, since the subsequent material will never depend on the contents
of the elaboration.

Many embedded processors are designed using processor cores, a version of a
processor written in a hardware description language, such as Verilog or VHDL (see
Chapter 4). The core allows a designer to integrate other application-specifi c hardware
with the processor core for fabrication on a single chip.

Welcome to the PostPC Era
Th e continuing march of technology brings about generational changes in
computer hardware that shake up the entire information technology industry.
Since the last edition of the book we have undergone such a change, as signifi cant
in the past as the switch starting 30 years ago to personal computers. Replacing the

FIGURE 1.1 The 2X vs. 10Y bytes ambiguity was resolved by adding a binary notation for
all the common size terms. In the last column we note how much larger the binary term is than its
corresponding decimal term, which is compounded as we head down the chart. Th ese prefi xes work for bits
as well as bytes, so gigabit (Gb) is 109 bits while gibibits (Gib) is 230 bits.

Decimal
term Abbreviation Value

Binary
term Abbreviation Value % Larger

kilobyte KB 103 kibibyte KiB 210 2%

megabyte MB 106 mebibyte MiB 220 5%

gigabyte GB 109 gibibyte GiB 230 7%

terabyte TB 1012 tebibyte TiB 240 10%

petabyte PB 1015 pebibyte PiB 250 13%

exabyte EB 1018 exbibyte EiB 260 15%

zettabyte ZB 1021 zebibyte ZiB 270 18%

yottabyte YB 1024 yobibyte YiB 280 21%

1.1 Introduction 7

200

400

600

800

1000

1200

1400

2007 2008 2009 2010 2011 2012

Tablet

Smart phone sales

M
ill

io
n
s

PC (not including
tablet)

Cell phone (not
including smart phone)

FIGURE 1.2 The number manufactured per year of tablets and smart phones, which
refl ect the PostPC era, versus personal computers and traditional cell phones. Smart phones
represent the recent growth in the cell phone industry, and they passed PCs in 2011. Tablets are the fastest
growing category, nearly doubling between 2011 and 2012. Recent PCs and traditional cell phone categories
are relatively fl at or declining.

PC is the personal mobile device (PMD). PMDs are battery operated with wireless
connectivity to the Internet and typically cost hundreds of dollars, and, like PCs,
users can download soft ware (“apps”) to run on them. Unlike PCs, they no longer
have a keyboard and mouse, and are more likely to rely on a touch-sensitive screen
or even speech input. Today’s PMD is a smart phone or a tablet computer, but
tomorrow it may include electronic glasses. Figure 1.2 shows the rapid growth time
of tablets and smart phones versus that of PCs and traditional cell phones.

Taking over from the traditional server is Cloud Computing, which relies upon
giant datacenters that are now known as Warehouse Scale Computers (WSCs).
Companies like Amazon and Google build these WSCs containing 100,000 servers
and then let companies rent portions of them so that they can provide soft ware
services to PMDs without having to build WSCs of their own. Indeed, Soft ware as
a Service (SaaS) deployed via the cloud is revolutionizing the soft ware industry just
as PMDs and WSCs are revolutionizing the hardware industry. Today’s soft ware
developers will oft en have a portion of their application that runs on the PMD and
a portion that runs in the Cloud.

What You Can Learn in This Book
Successful programmers have always been concerned about the performance of
their programs, because getting results to the user quickly is critical in creating
successful soft ware. In the 1960s and 1970s, a primary constraint on computer
performance was the size of the computer’s memory. Th us, programmers oft en
followed a simple credo: minimize memory space to make programs fast. In the

Personal mobile
devices (PMDs) are
small wireless devices to
connect to the Internet;
they rely on batteries for
power, and soft ware is
installed by downloading
apps. Conventional
examples are smart
phones and tablets.

Cloud Computing refers
to large collections of
servers that provide services
over the Internet; some
providers rent dynamically
varying numbers of servers
as a utility.

Soft ware as a Service
(SaaS) delivers soft ware
and data as a service over
the Internet, usually via
a thin program such as a
browser that runs on local
client devices, instead of
binary code that must be
installed, and runs wholly
on that device. Examples
include web search and
social networking.

8 Chapter 1 Computer Abstractions and Technology

last decade, advances in computer design and memory technology have greatly
reduced the importance of small memory size in most applications other than
those in embedded computing systems.

Programmers interested in performance now need to understand the issues
that have replaced the simple memory model of the 1960s: the parallel nature
of processors and the hierarchical nature of memories. Moreover, as we explain
in Section 1.7, today’s programmers need to worry about energy effi ciency of
their programs running either on the PMD or in the Cloud, which also requires
understanding what is below your code. Programmers who seek to build
competitive versions of soft ware will therefore need to increase their knowledge of
computer organization.

We are honored to have the opportunity to explain what’s inside this revolutionary
machine, unraveling the soft ware below your program and the hardware under the
covers of your computer. By the time you complete this book, we believe you will
be able to answer the following questions:

■ How are programs written in a high-level language, such as C or Java,
translated into the language of the hardware, and how does the hardware
execute the resulting program? Comprehending these concepts forms the
basis of understanding the aspects of both the hardware and soft ware that
aff ect program performance.

■ What is the interface between the soft ware and the hardware, and how does
soft ware instruct the hardware to perform needed functions? Th ese concepts
are vital to understanding how to write many kinds of soft ware.

■ What determines the performance of a program, and how can a programmer
improve the performance? As we will see, this depends on the original
program, the soft ware translation of that program into the computer’s
language, and the eff ectiveness of the hardware in executing the program.

■ What techniques can be used by hardware designers to improve performance?
Th is book will introduce the basic concepts of modern computer design. Th e
interested reader will fi nd much more material on this topic in our advanced
book, Computer Architecture: A Quantitative Approach.

■ What techniques can be used by hardware designers to improve energy
effi ciency? What can the programmer do to help or hinder energy effi ciency?

■ What are the reasons for and the consequences of the recent switch from
sequential processing to parallel processing? Th is book gives the motivation,
describes the current hardware mechanisms to support parallelism, and
surveys the new generation of “multicore” microprocessors (see Chapter 6).

■ Since the fi rst commercial computer in 1951, what great ideas did computer
architects come up with that lay the foundation of modern computing?

multicore
microprocessor
A microprocessor
containing multiple
processors (“cores”) in a
single integrated circuit.

1.1 Introduction 9

Without understanding the answers to these questions, improving the
performance of your program on a modern computer or evaluating what features
might make one computer better than another for a particular application will be
a complex process of trial and error, rather than a scientifi c procedure driven by
insight and analysis.

Th is fi rst chapter lays the foundation for the rest of the book. It introduces the
basic ideas and defi nitions, places the major components of soft ware and hardware
in perspective, shows how to evaluate performance and energy, introduces
integrated circuits (the technology that fuels the computer revolution), and explains
the shift to multicores.

In this chapter and later ones, you will likely see many new words, or words
that you may have heard but are not sure what they mean. Don’t panic! Yes, there
is a lot of special terminology used in describing modern computers, but the
terminology actually helps, since it enables us to describe precisely a function or
capability. In addition, computer designers (including your authors) love using
acronyms, which are easy to understand once you know what the letters stand for!
To help you remember and locate terms, we have included a highlighted defi nition
of every term in the margins the fi rst time it appears in the text. Aft er a short
time of working with the terminology, you will be fl uent, and your friends will
be impressed as you correctly use acronyms such as BIOS, CPU, DIMM, DRAM,
PCIe, SATA, and many others.

To reinforce how the soft ware and hardware systems used to run a program will
aff ect performance, we use a special section, Understanding Program Performance,
throughout the book to summarize important insights into program performance.
Th e fi rst one appears below.

Th e performance of a program depends on a combination of the eff ectiveness of the
algorithms used in the program, the soft ware systems used to create and translate
the program into machine instructions, and the eff ectiveness of the computer in
executing those instructions, which may include input/output (I/O) operations.
Th is table summarizes how the hardware and soft ware aff ect performance.

Hardware or software
component How this component affects performance

Where is this
topic covered?

Algorithm Determines both the number of source-level
statements and the number of I/O operations
executed

Other books!

Programming language,
compiler, and architecture

Determines the number of computer instructions
for each source-level statement

Chapters 2 and 3

Processor and memory
system

Determines how fast instructions can be executed Chapters 4, 5, and 6

I/O system (hardware and
operating system)

Determines how fast I/O operations may be
executed

Chapters 4, 5, and 6

acronym A word
constructed by taking the
initial letters of a string
of words. For example:
RAM is an acronym for
Random Access Memory,
and CPU is an acronym
for Central Processing
Unit.

Understanding
Program
Performance

10 Chapter 1 Computer Abstractions and Technology

To demonstrate the impact of the ideas in this book, we improve the performance
of a C program that multiplies a matrix times a vector in a sequence of
chapters. Each step leverages understanding how the underlying hardware
really works in a modern microprocessor to improve performance by a factor
of 200!

■ In the category of data level parallelism, in Chapter 3 we use subword
parallelism via C intrinsics to increase performance by a factor of 3.8.

■ In the category of instruction level parallelism, in Chapter 4 we use loop
unrolling to exploit multiple instruction issue and out-of-order execution
hardware to increase performance by another factor of 2.3.

■ In the category of memory hierarchy optimization, in Chapter 5 we use
cache blocking to increase performance on large matrices by another factor
of 2.5.

■ In the category of thread level parallelism, in Chapter 6 we use parallel for
loops in OpenMP to exploit multicore hardware to increase performance by
another factor of 14.

Check Yourself sections are designed to help readers assess whether they
comprehend the major concepts introduced in a chapter and understand the
implications of those concepts. Some Check Yourself questions have simple answers;
others are for discussion among a group. Answers to the specifi c questions can
be found at the end of the chapter. Check Yourself questions appear only at the
end of a section, making it easy to skip them if you are sure you understand the
material.

1. Th e number of embedded processors sold every year greatly outnumbers
the number of PC and even PostPC processors. Can you confi rm or deny
this insight based on your own experience? Try to count the number of
embedded processors in your home. How does it compare with the number
of conventional computers in your home?

2. As mentioned earlier, both the soft ware and hardware aff ect the performance
of a program. Can you think of examples where each of the following is the
right place to look for a performance bottleneck?

■ Th e algorithm chosen
■ Th e programming language or compiler
■ Th e operating system
■ Th e processor
■ Th e I/O system and devices

Check
Yourself

1.2 Eight Great Ideas in Computer Architecture 11

1.2 Eight Great Ideas in Computer
Architecture

We now introduce eight great ideas that computer architects have been invented in
the last 60 years of computer design. Th ese ideas are so powerful they have lasted
long aft er the fi rst computer that used them, with newer architects demonstrating
their admiration by imitating their predecessors. Th ese great ideas are themes that
we will weave through this and subsequent chapters as examples arise. To point
out their infl uence, in this section we introduce icons and highlighted terms that
represent the great ideas and we use them to identify the nearly 100 sections of the
book that feature use of the great ideas.

Design for Moore’s Law
Th e one constant for computer designers is rapid change, which is driven largely by
Moore’s Law. It states that integrated circuit resources double every 18–24 months.
Moore’s Law resulted from a 1965 prediction of such growth in IC capacity made
by Gordon Moore, one of the founders of Intel. As computer designs can take years,
the resources available per chip can easily double or quadruple between the start
and fi nish of the project. Like a skeet shooter, computer architects must anticipate
where the technology will be when the design fi nishes rather than design for where
it starts. We use an “up and to the right” Moore’s Law graph to represent designing
for rapid change.

Use Abstraction to Simplify Design
Both computer architects and programmers had to invent techniques to make
themselves more productive, for otherwise design time would lengthen as
dramatically as resources grew by Moore’s Law. A major productivity technique for
hardware and soft ware is to use abstractions to represent the design at diff erent
levels of representation; lower-level details are hidden to off er a simpler model at
higher levels. We’ll use the abstract painting icon to represent this second great
idea.

Make the Common Case Fast
Making the common case fast will tend to enhance performance better than
optimizing the rare case. Ironically, the common case is oft en simpler than the
rare case and hence is oft en easier to enhance. Th is common sense advice implies
that you know what the common case is, which is only possible with careful
experimentation and measurement (see Section 1.6). We use a sports car as the
icon for making the common case fast, as the most common trip has one or two
passengers, and it’s surely easier to make a fast sports car than a fast minivan!

12 Chapter 1 Computer Abstractions and Technology

Performance via Parallelism
Since the dawn of computing, computer architects have off ered designs that get
more performance by performing operations in parallel. We’ll see many examples
of parallelism in this book. We use multiple jet engines of a plane as our icon for
parallel performance.

Performance via Pipelining
A particular pattern of parallelism is so prevalent in computer architecture that
it merits its own name: pipelining. For example, before fi re engines, a “bucket
brigade” would respond to a fi re, which many cowboy movies show in response to
a dastardly act by the villain. Th e townsfolk form a human chain to carry a water
source to fi re, as they could much more quickly move buckets up the chain instead
of individuals running back and forth. Our pipeline icon is a sequence of pipes,
with each section representing one stage of the pipeline.

Performance via Prediction
Following the saying that it can be better to ask for forgiveness than to ask for
permission, the fi nal great idea is prediction. In some cases it can be faster on
average to guess and start working rather than wait until you know for sure,
assuming that the mechanism to recover from a misprediction is not too expensive
and your prediction is relatively accurate. We use the fortune-teller’s crystal ball as
our prediction icon.

Hierarchy of Memories
Programmers want memory to be fast, large, and cheap, as memory speed oft en
shapes performance, capacity limits the size of problems that can be solved, and the
cost of memory today is oft en the majority of computer cost. Architects have found
that they can address these confl icting demands with a hierarchy of memories, with
the fastest, smallest, and most expensive memory per bit at the top of the hierarchy
and the slowest, largest, and cheapest per bit at the bottom. As we shall see in
Chapter 5, caches give the programmer the illusion that main memory is nearly
as fast as the top of the hierarchy and nearly as big and cheap as the bottom of
the hierarchy. We use a layered triangle icon to represent the memory hierarchy.
Th e shape indicates speed, cost, and size: the closer to the top, the faster and more
expensive per bit the memory; the wider the base of the layer, the bigger the memory.

Dependability via Redundancy
Computers not only need to be fast; they need to be dependable. Since any physical
device can fail, we make systems dependable by including redundant components that
can take over when a failure occurs and to help detect failures. We use the tractor-trailer
as our icon, since the dual tires on each side of its rear axels allow the truck to continue
driving even when one tire fails. (Presumably, the truck driver heads immediately to a
repair facility so the fl at tire can be fi xed, thereby restoring redundancy!)

1.3 Below Your Program 13

1.3 Below Your Program

A typical application, such as a word processor or a large database system, may
consist of millions of lines of code and rely on sophisticated soft ware libraries that
implement complex functions in support of the application. As we will see, the
hardware in a computer can only execute extremely simple low-level instructions.
To go from a complex application to the simple instructions involves several layers
of soft ware that interpret or translate high-level operations into simple computer
instructions, an example of the great idea of abstraction.

Figure 1.3 shows that these layers of soft ware are organized primarily in a
hierarchical fashion, with applications being the outermost ring and a variety of
systems soft ware sitting between the hardware and applications soft ware.

Th ere are many types of systems soft ware, but two types of systems soft ware
are central to every computer system today: an operating system and a compiler.
An operating system interfaces between a user’s program and the hardware
and provides a variety of services and supervisory functions. Among the most
important functions are:

■ Handling basic input and output operations

■ Allocating storage and memory

■ Providing for protected sharing of the computer among multiple applications
using it simultaneously.

Examples of operating systems in use today are Linux, iOS, and Windows.

In Paris they simply
stared when I spoke to
them in French; I never
did succeed in making
those idiots understand
their own language.
Mark Twain, Th e
Innocents Abroad, 1869

systems soft ware
Soft ware that provides
services that are
commonly useful,
including operating
systems, compilers,
loaders, and assemblers.

operating system
Supervising program that
manages the resources of
a computer for the benefi t
of the programs that run
on that computer.

Ap
plic

ations software

ys
tem

s software

Hardware

FIGURE 1.3 A simplifi ed view of hardware and software as hierarchical layers, shown as
concentric circles with hardware in the center and applications software outermost. In
complex applications, there are oft en multiple layers of application soft ware as well. For example, a database
system may run on top of the systems soft ware hosting an application, which in turn runs on top of the
database.

14 Chapter 1 Computer Abstractions and Technology

Compilers perform another vital function: the translation of a program written
in a high-level language, such as C, C��, Java, or Visual Basic into instructions
that the hardware can execute. Given the sophistication of modern programming
languages and the simplicity of the instructions executed by the hardware, the
translation from a high-level language program to hardware instructions is
complex. We give a brief overview of the process here and then go into more depth
in Chapter 2 and in Appendix A.

From a High-Level Language to the Language of Hardware
To actually speak to electronic hardware, you need to send electrical signals. Th e
easiest signals for computers to understand are on and off , and so the computer
alphabet is just two letters. Just as the 26 letters of the English alphabet do not limit
how much can be written, the two letters of the computer alphabet do not limit
what computers can do. Th e two symbols for these two letters are the numbers 0
and 1, and we commonly think of the computer language as numbers in base 2, or
binary numbers. We refer to each “letter” as a binary digit or bit. Computers are
slaves to our commands, which are called instructions. Instructions, which are just
collections of bits that the computer understands and obeys, can be thought of as
numbers. For example, the bits

1000110010100000

tell one computer to add two numbers. Chapter 2 explains why we use numbers
for instructions and data; we don’t want to steal that chapter’s thunder, but using
numbers for both instructions and data is a foundation of computing.

Th e fi rst programmers communicated to computers in binary numbers, but this
was so tedious that they quickly invented new notations that were closer to the way
humans think. At fi rst, these notations were translated to binary by hand, but this
process was still tiresome. Using the computer to help program the computer, the
pioneers invented programs to translate from symbolic notation to binary. Th e fi rst of
these programs was named an assembler. Th is program translates a symbolic version
of an instruction into the binary version. For example, the programmer would write

add A,B

and the assembler would translate this notation into

1000110010100000

Th is instruction tells the computer to add the two numbers A and B. Th e name coined
for this symbolic language, still used today, is assembly language. In contrast, the
binary language that the machine understands is the machine language.

Although a tremendous improvement, assembly language is still far from the
notations a scientist might like to use to simulate fl uid fl ow or that an accountant
might use to balance the books. Assembly language requires the programmer
to write one line for every instruction that the computer will follow, forcing the
programmer to think like the computer.

compiler A program
that translates high-level
language statements
into assembly language
statements.

binary digit Also called
a bit. One of the two
numbers in base 2 (0 or 1)
that are the components
of information.

instruction A command
that computer hardware
understands and obeys.

assembler A program
that translates a symbolic
version of instructions
into the binary version.

assembly language
A symbolic representation
of machine instructions.

machine language
A binary representation of
machine instructions.

Th e recognition that a program could be written to translate a more powerful
language into computer instructions was one of the great breakthroughs in the
early days of computing. Programmers today owe their productivity—and their
sanity—to the creation of high-level programming languages and compilers
that translate programs in such languages into instructions. Figure 1.4 shows the
relationships among these programs and languages, which are more examples of
the power of abstraction.

high-level
programming
language A portable
language such as C, C��,
Java, or Visual Basic that
is composed of words
and algebraic notation
that can be translated by
a compiler into assembly
language.

FIGURE 1.4 C program compiled into assembly language and then assembled into binary
machine language. Although the translation from high-level language to binary machine language is
shown in two steps, some compilers cut out the middleman and produce binary machine language directly.
Th ese languages and this program are examined in more detail in Chapter 2.

1.3 Below Your Program 15

swap(int v[], int k)
{int temp;
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
}

swap:
multi $2, $5,4
add $2, $4,$2
lw $15, 0($2)
lw $16, 4($2)
sw $16, 0($2)
sw $15, 4($2)
jr $31

00000000101000100000000100011000
00000000100000100001000000100001
10001101111000100000000000000000
10001110000100100000000000000100
10101110000100100000000000000000
10101101111000100000000000000100
00000011111000000000000000001000

Assembler

Compiler

Binary machine
language
program
(for MIPS)

Assembly
language
program
(for MIPS)

High-level
language
program
(in C)

16 Chapter 1 Computer Abstractions and Technology

A compiler enables a programmer to write this high-level language expression:

A + B

Th e compiler would compile it into this assembly language statement:

add A,B

As shown above, the assembler would translate this statement into the binary
instructions that tell the computer to add the two numbers A and B.

High-level programming languages off er several important benefi ts. First, they
allow the programmer to think in a more natural language, using English words
and algebraic notation, resulting in programs that look much more like text than
like tables of cryptic symbols (see Figure 1.4). Moreover, they allow languages to be
designed according to their intended use. Hence, Fortran was designed for scientifi c
computation, Cobol for business data processing, Lisp for symbol manipulation,
and so on. Th ere are also domain-specifi c languages for even narrower groups of
users, such as those interested in simulation of fl uids, for example.

Th e second advantage of programming languages is improved programmer
productivity. One of the few areas of widespread agreement in soft ware development
is that it takes less time to develop programs when they are written in languages
that require fewer lines to express an idea. Conciseness is a clear advantage of high-
level languages over assembly language.

Th e fi nal advantage is that programming languages allow programs to be
independent of the computer on which they were developed, since compilers and
assemblers can translate high-level language programs to the binary instructions of
any computer. Th ese three advantages are so strong that today little programming
is done in assembly language.

1.4 Under the Covers

Now that we have looked below your program to uncover the underlying soft ware,
let’s open the covers of your computer to learn about the underlying hardware. Th e
underlying hardware in any computer performs the same basic functions: inputting
data, outputting data, processing data, and storing data. How these functions are
performed is the primary topic of this book, and subsequent chapters deal with
diff erent parts of these four tasks.

When we come to an important point in this book, a point so important that
we hope you will remember it forever, we emphasize it by identifying it as a Big
Picture item. We have about a dozen Big Pictures in this book, the fi rst being the
fi ve components of a computer that perform the tasks of inputting, outputting,
processing, and storing data.

Two key components of computers are input devices, such as the microphone,
and output devices, such as the speaker. As the names suggest, input feeds the

input device
A mechanism through
which the computer is
fed information, such as a
keyboard.

output device
A mechanism that
conveys the result of a
computation to a user,
such as a display, or to
another computer.

1.4 Under the Covers 17

FIGURE 1.5 The organization of a computer, showing the fi ve classic components. Th e
processor gets instructions and data from memory. Input writes data to memory, and output reads data from
memory. Control sends the signals that determine the operations of the datapath, memory, input, and output.

Th e fi ve classic components of a computer are input, output, memory,
datapath, and control, with the last two sometimes combined and called
the processor. Figure 1.5 shows the standard organization of a computer.
Th is organization is independent of hardware technology: you can place
every piece of every computer, past and present, into one of these fi ve
categories. To help you keep all this in perspective, the fi ve components of
a computer are shown on the front page of each of the following chapters,
with the portion of interest to that chapter highlighted.

The BIG
Picture

computer, and output is the result of computation sent to the user. Some devices,
such as wireless networks, provide both input and output to the computer.

Chapters 5 and 6 describe input/output (I/O) devices in more detail, but let’s
take an introductory tour through the computer hardware, starting with the
external I/O devices.

18 Chapter 1 Computer Abstractions and Technology

Through the Looking Glass
Th e most fascinating I/O device is probably the graphics display. Most personal
mobile devices use liquid crystal displays (LCDs) to get a thin, low-power display.
Th e LCD is not the source of light; instead, it controls the transmission of light.
A typical LCD includes rod-shaped molecules in a liquid that form a twisting
helix that bends light entering the display, from either a light source behind the
display or less oft en from refl ected light. Th e rods straighten out when a current is
applied and no longer bend the light. Since the liquid crystal material is between
two screens polarized at 90 degrees, the light cannot pass through unless it is bent.
Today, most LCD displays use an active matrix that has a tiny transistor switch at
each pixel to precisely control current and make sharper images. A red-green-blue
mask associated with each dot on the display determines the intensity of the three-
color components in the fi nal image; in a color active matrix LCD, there are three
transistor switches at each point.

Th e image is composed of a matrix of picture elements, or pixels, which can
be represented as a matrix of bits, called a bit map. Depending on the size of the
screen and the resolution, the display matrix in a typical tablet ranges in size from
1024 � 768 to 2048 � 1536. A color display might use 8 bits for each of the three
colors (red, blue, and green), for 24 bits per pixel, permitting millions of diff erent
colors to be displayed.

Th e computer hardware support for graphics consists mainly of a raster refresh
buff er, or frame buff er, to store the bit map. Th e image to be represented onscreen
is stored in the frame buff er, and the bit pattern per pixel is read out to the graphics
display at the refresh rate. Figure 1.6 shows a frame buff er with a simplifi ed design
of just 4 bits per pixel.

Th e goal of the bit map is to faithfully represent what is on the screen. Th e
challenges in graphics systems arise because the human eye is very good at detecting
even subtle changes on the screen.

liquid crystal display
A display technology
using a thin layer of liquid
polymers that can be used
to transmit or block light
according to whether a
charge is applied.

pixel Th e smallest
individual picture
element. Screens are
composed of hundreds
of thousands to millions
of pixels, organized in a
matrix.

X0 X1

Frame buffer

Raster scan CRT display

0
01

1
10

X0 X1

FIGURE 1.6 Each coordinate in the frame buffer on the left determines the shade of the
corresponding coordinate for the raster scan CRT display on the right. Pixel (X0, Y0) contains
the bit pattern 0011, which is a lighter shade on the screen than the bit pattern 1101 in pixel (X1, Y1).

active matrix display
A liquid crystal display
using a transistor to
control the transmission
of light at each individual
pixel.

Th rough computer
displays I have landed
an airplane on the
deck of a moving
carrier, observed a
nuclear particle hit a
potential well, fl own
in a rocket at nearly
the speed of light and
watched a computer
reveal its innermost
workings.
Ivan Sutherland, the
“father” of computer
graphics, Scientifi c
American, 1984

1.4 Under the Covers 19

Touchscreen
While PCs also use LCD displays, the tablets and smartphones of the PostPC era
have replaced the keyboard and mouse with touch sensitive displays, which has
the wonderful user interface advantage of users pointing directly what they are
interested in rather than indirectly with a mouse.

While there are a variety of ways to implement a touch screen, many tablets
today use capacitive sensing. Since people are electrical conductors, if an insulator
like glass is covered with a transparent conductor, touching distorts the electrostatic
fi eld of the screen, which results in a change in capacitance. Th is technology can
allow multiple touches simultaneously, which allows gestures that can lead to
attractive user interfaces.

Opening the Box
Figure 1.7 shows the contents of the Apple iPad 2 tablet computer. Unsurprisingly,
of the fi ve classic components of the computer, I/O dominates this reading device.
Th e list of I/O devices includes a capacitive multitouch LCD display, front facing
camera, rear facing camera, microphone, headphone jack, speakers, accelerometer,
gyroscope, Wi-Fi network, and Bluetooth network. Th e datapath, control, and
memory are a tiny portion of the components.

Th e small rectangles in Figure 1.8 contain the devices that drive our advancing
technology, called integrated circuits and nicknamed chips. Th e A5 package seen
in the middle of in Figure 1.8 contains two ARM processors that operate with a
clock rate of 1 GHz. Th e processor is the active part of the computer, following the
instructions of a program to the letter. It adds numbers, tests numbers, signals I/O
devices to activate, and so on. Occasionally, people call the processor the CPU, for
the more bureaucratic-sounding central processor unit.

Descending even lower into the hardware, Figure 1.9 reveals details of a
microprocessor. Th e processor logically comprises two main components: datapath
and control, the respective brawn and brain of the processor. Th e datapath performs
the arithmetic operations, and control tells the datapath, memory, and I/O devices
what to do according to the wishes of the instructions of the program. Chapter 4
explains the datapath and control for a higher-performance design.

Th e A5 package in Figure 1.8 also includes two memory chips, each with
2 gibibits of capacity, thereby supplying 512 MiB. Th e memory is where the
programs are kept when they are running; it also contains the data needed by the
running programs. Th e memory is built from DRAM chips. DRAM stands for
dynamic random access memory. Multiple DRAMs are used together to contain
the instructions and data of a program. In contrast to sequential access memories,
such as magnetic tapes, the RAM portion of the term DRAM means that memory
accesses take basically the same amount of time no matter what portion of the
memory is read.

Descending into the depths of any component of the hardware reveals insights
into the computer. Inside the processor is another type of memory—cache memory.

integrated circuit Also
called a chip. A device
combining dozens to
millions of transistors.

central processor unit
(CPU) Also called
processor. Th e active part
of the computer, which
contains the datapath and
control and which adds
numbers, tests numbers,
signals I/O devices to
activate, and so on.

datapath Th e
component of the
processor that performs
arithmetic operations

control Th e component
of the processor that
commands the datapath,
memory, and I/O
devices according to
the instructions of the
program.

memory Th e storage
area in which programs
are kept when they are
running and that contains
the data needed by the
running programs.

dynamic random access
memory (DRAM)
Memory built as an
integrated circuit; it
provides random access to
any location. Access times
are 50 nanoseconds and
cost per gigabyte in 2012
was $5 to $10.

20 Chapter 1 Computer Abstractions and Technology

FIGURE 1.7 Components of the Apple iPad 2 A1395. Th e metal back of the iPad (with the reversed
Apple logo in the middle) is in the center. At the top is the capacitive multitouch screen and LCD display. To
the far right is the 3.8 V, 25 watt-hour, polymer battery, which consists of three Li-ion cell cases and off ers
10 hours of battery life. To the far left is the metal frame that attaches the LCD to the back of the iPad. Th e
small components surrounding the metal back in the center are what we think of as the computer; they
are oft en L-shaped to fi t compactly inside the case next to the battery. Figure 1.8 shows a close-up of the
L-shaped board to the lower left of the metal case, which is the logic printed circuit board that contains the
processor and the memory. Th e tiny rectangle below the logic board contains a chip that provides wireless
communication: Wi-Fi, Bluetooth, and FM tuner. It fi ts into a small slot in the lower left corner of the logic
board. Near the upper left corner of the case is another L-shaped component, which is a front-facing camera
assembly that includes the camera, headphone jack, and microphone. Near the right upper corner of the case
is the board containing the volume control and silent/screen rotation lock button along with a gyroscope and
accelerometer. Th ese last two chips combine to allow the iPad to recognize 6-axis motion. Th e tiny rectangle
next to it is the rear-facing camera. Near the bottom right of the case is the L-shaped speaker assembly. Th e
cable at the bottom is the connector between the logic board and the camera/volume control board. Th e
board between the cable and the speaker assembly is the controller for the capacitive touchscreen. (Courtesy
iFixit, www.ifi xit.com)

FIGURE 1.8 Th e logic board of Apple iPad 2 in Figure 1.7. Th e photo highlights fi ve integrated circuits.
Th e large integrated circuit in the middle is the Apple A5 chip, which contains a dual ARM processor cores
that run at 1 GHz as well as 512 MB of main memory inside the package. Figure 1.9 shows a photograph of
the processor chip inside the A5 package. Th e similar sized chip to the left is the 32 GB fl ash memory chip
for non-volatile storage. Th ere is an empty space between the two chips where a second fl ash chip can be
installed to double storage capacity of the iPad. Th e chips to the right of the A5 include power controller and
I/O controller chips. (Courtesy iFixit, www.ifi xit.com)

http://www.ifixit.com
http://www.ifixit.com

1.4 Under the Covers 21

FIGURE 1.9 Th e processor integrated circuit inside the A5 package. Th e size of chip is 12.1 by 10.1 mm, and
it was manufactured originally in a 45-nm process (see Section 1.5). It has two identical ARM processors or
cores in the middle left of the chip and a PowerVR graphical processor unit (GPU) with four datapaths in the
upper left quadrant. To the left and bottom side of the ARM cores are interfaces to main memory (DRAM).
(Courtesy Chipworks, www.chipworks.com)

Cache memory consists of a small, fast memory that acts as a buff er for the DRAM
memory. (Th e nontechnical defi nition of cache is a safe place for hiding things.)
Cache is built using a diff erent memory technology, static random access memory
(SRAM). SRAM is faster but less dense, and hence more expensive, than DRAM
(see Chapter 5). SRAM and DRAM are two layers of the memory hierarchy.

cache memory A small,
fast memory that acts as a
buff er for a slower, larger
memory.

static random access
memory (SRAM) Also
memory built as an
integrated circuit, but
faster and less dense than
DRAM.

http://www.chipworks.com

22 Chapter 1 Computer Abstractions and Technology

As mentioned above, one of the great ideas to improve design is abstraction.
One of the most important abstractions is the interface between the hardware
and the lowest-level soft ware. Because of its importance, it is given a special
name: the instruction set architecture, or simply architecture, of a computer.
Th e instruction set architecture includes anything programmers need to know to
make a binary machine language program work correctly, including instructions,
I/O devices, and so on. Typically, the operating system will encapsulate the
details of doing I/O, allocating memory, and other low-level system functions
so that application programmers do not need to worry about such details. Th e
combination of the basic instruction set and the operating system interface
provided for application programmers is called the application binary interface
(ABI).

An instruction set architecture allows computer designers to talk about
functions independently from the hardware that performs them. For example,
we can talk about the functions of a digital clock (keeping time, displaying the
time, setting the alarm) independently from the clock hardware (quartz crystal,
LED displays, plastic buttons). Computer designers distinguish architecture from
an implementation of an architecture along the same lines: an implementation is
hardware that obeys the architecture abstraction. Th ese ideas bring us to another
Big Picture.

instruction set
architecture Also
called architecture. An
abstract interface between
the hardware and the
lowest-level soft ware
that encompasses all the
information necessary to
write a machine language
program that will run
correctly, including
instructions, registers,
memory access, I/O, and
so on.

application binary
interface (ABI) Th e user
portion of the instruction
set plus the operating
system interfaces used by
application programmers.
It defi nes a standard for
binary portability across
computers.

implementation
Hardware that obeys the
architecture abstraction.

Both hardware and soft ware consist of hierarchical layers using abstraction,
with each lower layer hiding details from the level above. One key interface
between the levels of abstraction is the instruction set architecture—the
interface between the hardware and low-level soft ware. Th is abstract
interface enables many implementations of varying cost and performance
to run identical soft ware.

The BIG
Picture

A Safe Place for Data
Th us far, we have seen how to input data, compute using the data, and display
data. If we were to lose power to the computer, however, everything would be lost
because the memory inside the computer is volatile—that is, when it loses power,
it forgets. In contrast, a DVD disk doesn’t forget the movie when you turn off the
power to the DVD player, and is thus a nonvolatile memory technology.

volatile memory
Storage, such as DRAM,
that retains data only if it
is receiving power.

nonvolatile memory
A form of memory that
retains data even in the
absence of a power source
and that is used to store
programs between runs.
A DVD disk is nonvolatile.

1.4 Under the Covers 23

To distinguish between the volatile memory used to hold data and programs
while they are running and this nonvolatile memory used to store data and
programs between runs, the term main memory or primary memory is used for
the former, and secondary memory for the latter. Secondary memory forms the
next lower layer of the memory hierarchy. DRAMs have dominated main memory
since 1975, but magnetic disks dominated secondary memory starting even earlier.
Because of their size and form factor, personal Mobile Devices use fl ash memory,
a nonvolatile semiconductor memory, instead of disks. Figure 1.8 shows the chip
containing the fl ash memory of the iPad 2. While slower than DRAM, it is much
cheaper than DRAM in addition to being nonvolatile. Although costing more per
bit than disks, it is smaller, it comes in much smaller capacities, it is more rugged,
and it is more power effi cient than disks. Hence, fl ash memory is the standard
secondary memory for PMDs. Alas, unlike disks and DRAM, fl ash memory bits
wear out aft er 100,000 to 1,000,000 writes. Th us, fi le systems must keep track of
the number of writes and have a strategy to avoid wearing out storage, such as by
moving popular data. Chapter 5 describes disks and fl ash memory in more detail.

Communicating with Other Computers
We’ve explained how we can input, compute, display, and save data, but there is
still one missing item found in today’s computers: computer networks. Just as the
processor shown in Figure 1.5 is connected to memory and I/O devices, networks
interconnect whole computers, allowing computer users to extend the power of
computing by including communication. Networks have become so popular that
they are the backbone of current computer systems; a new personal mobile device
or server without a network interface would be ridiculed. Networked computers
have several major advantages:

■ Communication: Information is exchanged between computers at high
speeds.

■ Resource sharing : Rather than each computer having its own I/O devices,
computers on the network can share I/O devices.

■ Nonlocal access: By connecting computers over long distances, users need not
be near the computer they are using.

Networks vary in length and performance, with the cost of communication
increasing according to both the speed of communication and the distance that
information travels. Perhaps the most popular type of network is Ethernet. It can
be up to a kilometer long and transfer at up to 40 gigabits per second. Its length and
speed make Ethernet useful to connect computers on the same fl oor of a building;

main memory Also
called primary memory.
Memory used to hold
programs while they are
running; typically consists
of DRAM in today’s
computers.

secondary memory
Nonvolatile memory
used to store programs
and data between runs;
typically consists of fl ash
memory in PMDs and
magnetic disks in servers.

magnetic disk Also
called hard disk. A form
of nonvolatile secondary
memory composed of
rotating platters coated
with a magnetic recording
material. Because they
are rotating mechanical
devices, access times are
about 5 to 20 milliseconds
and cost per gigabyte in
2012 was $0.05 to $0.10.

fl ash memory
A nonvolatile semi-
conductor memory. It
is cheaper and slower
than DRAM but more
expensive per bit and
faster than magnetic disks.
Access times are about 5
to 50 microseconds and
cost per gigabyte in 2012
was $0.75 to $1.00.

24 Chapter 1 Computer Abstractions and Technology

hence, it is an example of what is generically called a local area network. Local area
networks are interconnected with switches that can also provide routing services
and security. Wide area networks cross continents and are the backbone of the
Internet, which supports the web. Th ey are typically based on optical fi bers and are
leased from telecommunication companies.

Networks have changed the face of computing in the last 30 years, both by
becoming much more ubiquitous and by making dramatic increases in performance.
In the 1970s, very few individuals had access to electronic mail, the Internet and
web did not exist, and physically mailing magnetic tapes was the primary way to
transfer large amounts of data between two locations. Local area networks were
almost nonexistent, and the few existing wide area networks had limited capacity
and restricted access.

As networking technology improved, it became much cheaper and had a much
higher capacity. For example, the fi rst standardized local area network technology,
developed about 30 years ago, was a version of Ethernet that had a maximum capacity
(also called bandwidth) of 10 million bits per second, typically shared by tens of, if
not a hundred, computers. Today, local area network technology off ers a capacity
of from 1 to 40 gigabits per second, usually shared by at most a few computers.
Optical communications technology has allowed similar growth in the capacity of
wide area networks, from hundreds of kilobits to gigabits and from hundreds of
computers connected to a worldwide network to millions of computers connected.
Th is combination of dramatic rise in deployment of networking combined with
increases in capacity have made network technology central to the information
revolution of the last 30 years.

For the last decade another innovation in networking is reshaping the way
computers communicate. Wireless technology is widespread, which enabled
the PostPC Era. Th e ability to make a radio in the same low-cost semiconductor
technology (CMOS) used for memory and microprocessors enabled a signifi cant
improvement in price, leading to an explosion in deployment. Currently available
wireless technologies, called by the IEEE standard name 802.11, allow for transmission
rates from 1 to nearly 100 million bits per second. Wireless technology is quite a bit
diff erent from wire-based networks, since all users in an immediate area share the
airwaves.

■ Semiconductor DRAM memory, fl ash memory, and disk storage diff er
signifi cantly. For each technology, list its volatility, approximate relative
access time, and approximate relative cost compared to DRAM.

1.5 Technologies for Building Processors
and Memory

Processors and memory have improved at an incredible rate, because computer
designers have long embraced the latest in electronic technology to try to win the
race to design a better computer. Figure 1.10 shows the technologies that have

local area network
(LAN) A network
designed to carry data
within a geographically
confi ned area, typically
within a single building.

wide area network
(WAN) A network
extended over hundreds
of kilometers that can
span a continent.

Check
Yourself

FIGURE 1.10 Relative performance per unit cost of technologies used in computers over
time. Source: Computer Museum, Boston, with 2013 extrapolated by the authors. See Section 1.12.

1,000,000

10,000,000

1976 1978 1980 1982 1984 1986

Year of introduction

1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012

K
ib

ib
it

ca
p

a
ci

16K

64K

256K

16M
64M

128M
256M 512M

1G
2G

100,000

10,000

1000

100

FIGURE 1.11 Growth of capacity per DRAM chip over time. Th e y-axis is measured in kibibits (210 bits). Th e DRAM industry
quadrupled capacity almost every three years, a 60% increase per year, for 20 years. In recent years, the rate has slowed down and is somewhat
closer to doubling every two years to three years.

1.5 Technologies for Building Processors and Memory 25

been used over time, with an estimate of the relative performance per unit cost for
each technology. Since this technology shapes what computers will be able to do
and how quickly they will evolve, we believe all computer professionals should be
familiar with the basics of integrated circuits.

A transistor is simply an on/off switch controlled by electricity. Th e integrated
circuit (IC) combined dozens to hundreds of transistors into a single chip. When
Gordon Moore predicted the continuous doubling of resources, he was predicting
the growth rate of the number of transistors per chip. To describe the tremendous
increase in the number of transistors from hundreds to millions, the adjective very
large scale is added to the term, creating the abbreviation VLSI, for very large-scale
integrated circuit.

Th is rate of increasing integration has been remarkably stable. Figure 1.11 shows
the growth in DRAM capacity since 1977. For decades, the industry has consistently
quadrupled capacity every 3 years, resulting in an increase in excess of 16,000 times!

To understand how manufacture integrated circuits, we start at the beginning.
Th e manufacture of a chip begins with silicon, a substance found in sand. Because
silicon does not conduct electricity well, it is called a semiconductor. With a special
chemical process, it is possible to add materials to silicon that allow tiny areas to
transform into one of three devices:

■ Excellent conductors of electricity (using either microscopic copper or
aluminum wire)

transistor An on/off
switch controlled by an
electric signal.

very large-scale
integrated (VLSI)
circuit A device
containing hundreds of
thousands to millions of
transistors.

silicon A natural
element that is a
semiconductor.

semiconductor
A substance that does not
conduct electricity well.

Year Technology used in computers Relative performance/unit cost

1951 Vacuum tube 1
1965 35
1975 Integrated circuit

Very large-scale integrated circuit
Ultra large-scale integrated circuit

Transistor
900

1995 2,400,000
2013 250,000,000,000

26 Chapter 1 Computer Abstractions and Technology

■ Excellent insulators from electricity (like plastic sheathing or glass)

■ Areas that can conduct or insulate under special conditions (as a switch)

Transistors fall in the last category. A VLSI circuit, then, is just billions of
combinations of conductors, insulators, and switches manufactured in a single
small package.

Th e manufacturing process for integrated circuits is critical to the cost of the
chips and hence important to computer designers. Figure 1.12 shows that process.
Th e process starts with a silicon crystal ingot, which looks like a giant sausage.
Today, ingots are 8–12 inches in diameter and about 12–24 inches long. An ingot
is fi nely sliced into wafers no more than 0.1 inches thick. Th ese wafers then go
through a series of processing steps, during which patterns of chemicals are placed
on each wafer, creating the transistors, conductors, and insulators discussed earlier.
Today’s integrated circuits contain only one layer of transistors but may have from
two to eight levels of metal conductor, separated by layers of insulators.

silicon crystal ingot
A rod composed of a
silicon crystal that is
between 8 and 12 inches
in diameter and about 12
to 24 inches long.

wafer A slice from a
silicon ingot no more than
0.1 inches thick, used to
create chips.

Slicer

Dicer

20 to 40
processing steps

Bond die to
package

Silicon ingot

Wafer
tester

Part
tester

Ship to
customers

Tested dies Tested
wafer

Blank
wafers

Packaged dies

Patterned wafers

Tested packaged dies

FIGURE 1.12 The chip manufacturing process. Aft er being sliced from the silicon ingot, blank
wafers are put through 20 to 40 steps to create patterned wafers (see Figure 1.13). Th ese patterned wafers are
then tested with a wafer tester, and a map of the good parts is made. Th en, the wafers are diced into dies (see
Figure 1.9). In this fi gure, one wafer produced 20 dies, of which 17 passed testing. (X means the die is bad.)
Th e yield of good dies in this case was 17/20, or 85%. Th ese good dies are then bonded into packages and
tested one more time before shipping the packaged parts to customers. One bad packaged part was found
in this fi nal test.

A single microscopic fl aw in the wafer itself or in one of the dozens of patterning
steps can result in that area of the wafer failing. Th ese defects, as they are called,
make it virtually impossible to manufacture a perfect wafer. Th e simplest way to
cope with imperfection is to place many independent components on a single
wafer. Th e patterned wafer is then chopped up, or diced, into these components,

defect A microscopic
fl aw in a wafer or in
patterning steps that can
result in the failure of the
die containing that defect.

FIGURE 1.13 A 12-inch (300 mm) wafer of Intel Core i7 (Courtesy Intel). Th e number of
dies on this 300 mm (12 inch) wafer at 100% yield is 280, each 20.7 by 10.5 mm. Th e several dozen partially
rounded chips at the boundaries of the wafer are useless; they are included because it’s easier to create the
masks used to pattern the silicon. Th is die uses a 32-nanometer technology, which means that the smallest
features are approximately 32 nm in size, although they are typically somewhat smaller than the actual feature
size, which refers to the size of the transistors as “drawn” versus the fi nal manufactured size.

1.6 Performance 27

called dies and more informally known as chips. Figure 1.13 shows a photograph
of a wafer containing microprocessors before they have been diced; earlier, Figure
1.9 shows an individual microprocessor die.

Dicing enables you to discard only those dies that were unlucky enough to
contain the fl aws, rather than the whole wafer. Th is concept is quantifi ed by the
yield of a process, which is defi ned as the percentage of good dies from the total
number of dies on the wafer.

Th e cost of an integrated circuit rises quickly as the die size increases, due both
to the lower yield and the smaller number of dies that fi t on a wafer. To reduce the
cost, using the next generation process shrinks a large die as it uses smaller sizes for
both transistors and wires. Th is improves the yield and the die count per wafer. A
32-nanometer (nm) process was typical in 2012, which means essentially that the
smallest feature size on the die is 32 nm.

die Th e individual
rectangular sections that
are cut from a wafer, more
informally known as
chips.

yield Th e percentage of
good dies from the total
number of dies on the
wafer.

28 Chapter 1 Computer Abstractions and Technology

Once you’ve found good dies, they are connected to the input/output pins of a
package, using a process called bonding. Th ese packaged parts are tested a fi nal time,
since mistakes can occur in packaging, and then they are shipped to customers.

Elaboration: The cost of an integrated circuit can be expressed in three simple
equations:

Cost per die
Cost per wafer

Dies per wafer yield

Dies per waffer
Wafer area
Die area

Yield
Defects per area Die are

�

1( ( aa/2))2

The fi rst equation is straightforward to derive. The second is an approximation,
since it does not subtract the area near the border of the round wafer that cannot
accommodate the rectangular dies (see Figure 1.13). The fi nal equation is based on
empirical observations of yields at integrated circuit factories, with the exponent related
to the number of critical processing steps.

Hence, depending on the defect rate and the size of the die and wafer, costs are
generally not linear in the die area.

A key factor in determining the cost of an integrated circuit is volume. Which of
the following are reasons why a chip made in high volume should cost less?

1. With high volumes, the manufacturing process can be tuned to a particular
design, increasing the yield.

2. It is less work to design a high-volume part than a low-volume part.

3. Th e masks used to make the chip are expensive, so the cost per chip is lower
for higher volumes.

4. Engineering development costs are high and largely independent of volume;
thus, the development cost per die is lower with high-volume parts.

5. High-volume parts usually have smaller die sizes than low-volume parts and
therefore have higher yield per wafer.

1.6 Performance

Assessing the performance of computers can be quite challenging. Th e scale and
intricacy of modern soft ware systems, together with the wide range of performance
improvement techniques employed by hardware designers, have made performance
assessment much more diffi cult.

When trying to choose among diff erent computers, performance is an important
attribute. Accurately measuring and comparing diff erent computers is critical to

Check
Yourself

Airplane
Passenger
capacity

Cruising range
(miles)

Cruising speed
(m.p.h.)

Passenger throughput
m.p.h.)

Boeing 777 375 4630 0610 228,750
Boeing 747 470

132
146

4150 0610 286,700
BAC/Sud Concorde 4000 1350 178,200
Douglas DC-8-50 8720 0544 79,424

(passengers × m.p.h.)

FIGURE 1.14 The capacity, range, and speed for a number of commercial airplanes. Th e last
column shows the rate at which the airplane transports passengers, which is the capacity times the cruising
speed (ignoring range and takeoff and landing times).

1.6 Performance 29

purchasers and therefore to designers. Th e people selling computers know this as
well. Oft en, salespeople would like you to see their computer in the best possible
light, whether or not this light accurately refl ects the needs of the purchaser’s
application. Hence, understanding how best to measure performance and the
limitations of performance measurements is important in selecting a computer.

Th e rest of this section describes diff erent ways in which performance can be
determined; then, we describe the metrics for measuring performance from the
viewpoint of both a computer user and a designer. We also look at how these metrics
are related and present the classical processor performance equation, which we will
use throughout the text.

Defi ning Performance
When we say one computer has better performance than another, what do we
mean? Although this question might seem simple, an analogy with passenger
airplanes shows how subtle the question of performance can be. Figure 1.14
lists some typical passenger airplanes, together with their cruising speed, range,
and capacity. If we wanted to know which of the planes in this table had the best
performance, we would fi rst need to defi ne performance. For example, considering
diff erent measures of performance, we see that the plane with the highest cruising
speed was the Concorde (retired from service in 2003), the plane with the longest
range is the DC-8, and the plane with the largest capacity is the 747.

Let’s suppose we defi ne performance in terms of speed. Th is still leaves two
possible defi nitions. You could defi ne the fastest plane as the one with the highest
cruising speed, taking a single passenger from one point to another in the least time.
If you were interested in transporting 450 passengers from one point to another,
however, the 747 would clearly be the fastest, as the last column of the fi gure shows.
Similarly, we can defi ne computer performance in several diff erent ways.

If you were running a program on two diff erent desktop computers, you’d say
that the faster one is the desktop computer that gets the job done fi rst. If you were
running a datacenter that had several servers running jobs submitted by many
users, you’d say that the faster computer was the one that completed the most
jobs during a day. As an individual computer user, you are interested in reducing
response time—the time between the start and completion of a task—also referred

response time Also
called execution time.
Th e total time required
for the computer to
complete a task, including
disk accesses, memory
accesses, I/O activities,
operating system
overhead, CPU execution
time, and so on.

30 Chapter 1 Computer Abstractions and Technology

to as execution time. Datacenter managers are oft en interested in increasing
throughput or bandwidth—the total amount of work done in a given time. Hence,
in most cases, we will need diff erent performance metrics as well as diff erent sets
of applications to benchmark personal mobile devices, which are more focused on
response time, versus servers, which are more focused on throughput.

Throughput and Response Time

Do the following changes to a computer system increase throughput, decrease
response time, or both?

1. Replacing the processor in a computer with a faster version

2. Adding additional processors to a system that uses multiple processors
for separate tasks—for example, searching the web

Decreasing response time almost always improves throughput. Hence, in case
1, both response time and throughput are improved. In case 2, no one task gets
work done faster, so only throughput increases.

If, however, the demand for processing in the second case was almost
as large as the throughput, the system might force requests to queue up. In
this case, increasing the throughput could also improve response time, since
it would reduce the waiting time in the queue. Th us, in many real computer
systems, changing either execution time or throughput oft en aff ects the other.

In discussing the performance of computers, we will be primarily concerned with
response time for the fi rst few chapters. To maximize performance, we want to
minimize response time or execution time for some task. Th us, we can relate
performance and execution time for a computer X:

Performance
Execution timeX X

�
1

Th is means that for two computers X and Y, if the performance of X is greater than
the performance of Y, we have

Performance Performance

Execution time Execution time

X Y

�

�
1 1

EExecution time Execution timeY X�

Th at is, the execution time on Y is longer than that on X, if X is faster than Y.

throughput Also called
bandwidth. Another
measure of performance,
it is the number of tasks
completed per unit time.

EXAMPLE

ANSWER

In discussing a computer design, we oft en want to relate the performance of two
diff erent computers quantitatively. We will use the phrase “X is n times faster than
Y”—or equivalently “X is n times as fast as Y”—to mean

Performance
Performance

Y
� n

If X is n times as fast as Y, then the execution time on Y is n times as long as it is
on X:

Performance
Performance

Execution time
Execution time

X
� � n

Relative Performance

If computer A runs a program in 10 seconds and computer B runs the same
program in 15 seconds, how much faster is A than B?

We know that A is n times as fast as B if
Performance
Performance

Execution time
Execution time

A
� � n

Th us the performance ratio is
15
10

1 5� .

and A is therefore 1.5 times as fast as B.

In the above example, we could also say that computer B is 1.5 times slower than
computer A, since

Performance
Performance

B
� 1 5.

means that
Performance

PerformanceA B1 5.
�

EXAMPLE

ANSWER

1.6 Performance 31

32 Chapter 1 Computer Abstractions and Technology

For simplicity, we will normally use the terminology as fast as when we try to
compare computers quantitatively. Because performance and execution time are
reciprocals, increasing performance requires decreasing execution time. To avoid
the potential confusion between the terms increasing and decreasing, we usually
say “improve performance” or “improve execution time” when we mean “increase
performance” and “decrease execution time.”

Measuring Performance
Time is the measure of computer performance: the computer that performs the
same amount of work in the least time is the fastest. Program execution time is
measured in seconds per program. However, time can be defi ned in diff erent ways,
depending on what we count. Th e most straightforward defi nition of time is called
wall clock time, response time, or elapsed time. Th ese terms mean the total time
to complete a task, including disk accesses, memory accesses, input/output (I/O)
activities, operating system overhead—everything.

Computers are oft en shared, however, and a processor may work on several
programs simultaneously. In such cases, the system may try to optimize throughput
rather than attempt to minimize the elapsed time for one program. Hence, we
oft en want to distinguish between the elapsed time and the time over which the
processor is working on our behalf. CPU execution time or simply CPU time,
which recognizes this distinction, is the time the CPU spends computing for this
task and does not include time spent waiting for I/O or running other programs.
(Remember, though, that the response time experienced by the user will be the
elapsed time of the program, not the CPU time.) CPU time can be further divided
into the CPU time spent in the program, called user CPU time, and the CPU time
spent in the operating system performing tasks on behalf of the program, called
system CPU time. Diff erentiating between system and user CPU time is diffi cult to
do accurately, because it is oft en hard to assign responsibility for operating system
activities to one user program rather than another and because of the functionality
diff erences among operating systems.

For consistency, we maintain a distinction between performance based on
elapsed time and that based on CPU execution time. We will use the term system
performance to refer to elapsed time on an unloaded system and CPU performance
to refer to user CPU time. We will focus on CPU performance in this chapter,
although our discussions of how to summarize performance can be applied to
either elapsed time or CPU time measurements.

Diff erent applications are sensitive to diff erent aspects of the performance of a
computer system. Many applications, especially those running on servers, depend
as much on I/O performance, which, in turn, relies on both hardware and soft ware.
Total elapsed time measured by a wall clock is the measurement of interest. In

CPU execution
time Also called CPU
time. Th e actual time the
CPU spends computing
for a specifi c task.

user CPU time Th e
CPU time spent in a
program itself.

system CPU time Th e
CPU time spent in
the operating system
performing tasks on
behalf of the program.

Understanding
Program

Performance

some application environments, the user may care about throughput, response
time, or a complex combination of the two (e.g., maximum throughput with a
worst-case response time). To improve the performance of a program, one must
have a clear defi nition of what performance metric matters and then proceed to
look for performance bottlenecks by measuring program execution and looking
for the likely bottlenecks. In the following chapters, we will describe how to search
for bottlenecks and improve performance in various parts of the system.

Although as computer users we care about time, when we examine the details
of a computer it’s convenient to think about performance in other metrics. In
particular, computer designers may want to think about a computer by using a
measure that relates to how fast the hardware can perform basic functions. Almost
all computers are constructed using a clock that determines when events take
place in the hardware. Th ese discrete time intervals are called clock cycles (or
ticks, clock ticks, clock periods, clocks, cycles). Designers refer to the length of a
clock period both as the time for a complete clock cycle (e.g., 250 picoseconds, or
250 ps) and as the clock rate (e.g., 4 gigahertz, or 4 GHz), which is the inverse of the
clock period. In the next subsection, we will formalize the relationship between the
clock cycles of the hardware designer and the seconds of the computer user.

1. Suppose we know that an application that uses both personal mobile
devices and the Cloud is limited by network performance. For the following
changes, state whether only the throughput improves, both response time
and throughput improve, or neither improves.

a. An extra network channel is added between the PMD and the Cloud,
increasing the total network throughput and reducing the delay to obtain
network access (since there are now two channels).

b. Th e networking soft ware is improved, thereby reducing the network
communication delay, but not increasing throughput.

c. More memory is added to the computer.

2. Computer C’s performance is 4 times as fast as the performance of computer
B, which runs a given application in 28 seconds. How long will computer C
take to run that application?

CPU Performance and Its Factors
Users and designers oft en examine performance using diff erent metrics. If we could
relate these diff erent metrics, we could determine the eff ect of a design change
on the performance as experienced by the user. Since we are confi ning ourselves
to CPU performance at this point, the bottom-line performance measure is CPU

clock cycle Also called
tick, clock tick, clock
period, clock, or cycle.
Th e time for one clock
period, usually of the
processor clock, which
runs at a constant rate.

clock period Th e length
of each clock cycle.

Check
Yourself

1.6 Performance 33

34 Chapter 1 Computer Abstractions and Technology

execution time. A simple formula relates the most basic metrics (clock cycles and
clock cycle time) to CPU time:

CPU execution time
for a program

CPU clock cycles
for a progrram Clock cycle time

Alternatively, because clock rate and clock cycle time are inverses,
CPU execution time

for a program
CPU clock cycles for a pro

�
ggram

Clock rate

Th is formula makes it clear that the hardware designer can improve performance
by reducing the number of clock cycles required for a program or the length of
the clock cycle. As we will see in later chapters, the designer oft en faces a trade-off
between the number of clock cycles needed for a program and the length of each
cycle. Many techniques that decrease the number of clock cycles may also increase
the clock cycle time.

Improving Performance

Our favorite program runs in 10 seconds on computer A, which has a 2 GHz
clock. We are trying to help a computer designer build a computer, B, which will
run this program in 6 seconds. Th e designer has determined that a substantial
increase in the clock rate is possible, but this increase will aff ect the rest of the
CPU design, causing computer B to require 1.2 times as many clock cycles as
computer A for this program. What clock rate should we tell the designer to
target?

Let’s fi rst fi nd the number of clock cycles required for the program on A:

CPU time
CPU clock cycles

Clock rate

seconds
CPU clock

A
A

10
cycles
cycles

second

CPU clock cycles seconds

2 10

10 2 1

00 20 109 9
cycles

second
cycles

EXAMPLE

ANSWER

CPU time for B can be found using this equation:

CPU time
CPU clock cycles
Clock rate

seconds

B
A

1 2

6
1 2 20

. 10

1 2 20 10
6

cycles
Clock rate

Clock rate
cycles

seco

B
.

nnds
cycles

second
cycles

second
GHz

0 2 20 10 4 10
4

9 9.

To run the program in 6 seconds, B must have twice the clock rate of A.

Instruction Performance
Th e performance equations above did not include any reference to the number of
instructions needed for the program. However, since the compiler clearly generated
instructions to execute, and the computer had to execute the instructions to run
the program, the execution time must depend on the number of instructions in a
program. One way to think about execution time is that it equals the number of
instructions executed multiplied by the average time per instruction. Th erefore, the
number of clock cycles required for a program can be written as

CPU clock cycles Instructions for a program
Average clock ccycles

per instruction

Th e term clock cycles per instruction, which is the average number of clock
cycles each instruction takes to execute, is oft en abbreviated as CPI. Since diff erent
instructions may take diff erent amounts of time depending on what they do, CPI is
an average of all the instructions executed in the program. CPI provides one way of
comparing two diff erent implementations of the same instruction set architecture,
since the number of instructions executed for a program will, of course, be the
same.

Using the Performance Equation

Suppose we have two implementations of the same instruction set architecture.
Computer A has a clock cycle time of 250 ps and a CPI of 2.0 for some program,
and computer B has a clock cycle time of 500 ps and a CPI of 1.2 for the same
program. Which computer is faster for this program and by how much?

clock cycles
per instruction
(CPI) Average number
of clock cycles per
instruction for a program
or program fragment.

EXAMPLE

1.6 Performance 35

36 Chapter 1 Computer Abstractions and Technology

We know that each computer executes the same number of instructions for
the program; let’s call this number I. First, fi nd the number of processor clock
cycles for each computer:

CPU clock cycles
CPU clock cycles

�

I
I
×

2 0
1 2

.
.

Now we can compute the CPU time for each computer:
CPU time CPU clock cycles Clock cycle time

ps
A A

I 2 0 250. 5500 I ps

Likewise, for B:
CPU time ps psB II 1 2 500 600.

Clearly, computer A is faster. Th e amount faster is given by the ratio of the
execution times:

CPU performance
CPU performance

Execution time
Execution

B
ttime

ps
psA

600
500

1 2
I
I

We can conclude that computer A is 1.2 times as fast as computer B for this
program.

The Classic CPU Performance Equation
We can now write this basic performance equation in terms of instruction count
(the number of instructions executed by the program), CPI, and clock cycle time:

CPU time Instruction count CPI Clock cycle time

or, since the clock rate is the inverse of clock cycle time:

CPU time
Instruction count CPI

Clock rate

Th ese formulas are particularly useful because they separate the three key factors
that aff ect performance. We can use these formulas to compare two diff erent
implementations or to evaluate a design alternative if we know its impact on these
three parameters.

ANSWER

instruction count Th e
number of instructions
executed by the program.

Comparing Code Segments

A compiler designer is trying to decide between two code sequences for a
particular computer. Th e hardware designers have supplied the following facts:

CPI for each instruction class

A B C

CPI 1 2 3

For a particular high-level language statement, the compiler writer is
considering two code sequences that require the following instruction counts:

Instruction counts for each instruction class

Code sequence A B C

1 2 1 2

2 4 1 1

Which code sequence executes the most instructions? Which will be faster?
What is the CPI for each sequence?

Sequence 1 executes 2 � 1 � 2 � 5 instructions. Sequence 2 executes 4 � 1 �
1 � 6 instructions. Th erefore, sequence 1 executes fewer instructions.

We can use the equation for CPU clock cycles based on instruction count
and CPI to fi nd the total number of clock cycles for each sequence:

CPU clock cycles CPI C( )i i
i

1
∑

Th is yields

CPU clock cycles cycles1 2 1 1 2 2 3 2 2 6 10( ) ( ) ( )

CPU clock cycles cycles2 4 1 1 2 1 3 4 2 3 9( ) ( ) ( )

So code sequence 2 is faster, even though it executes one extra instruction. Since
code sequence 2 takes fewer overall clock cycles but has more instructions, it
must have a lower CPI. Th e CPI values can be computed by

CPI
CPU clock cycles
Instruction count

CPI
CPU clock cycles

�

�1
11

2
2

10
5

2 0
Instruction count

CPI
CPU clock cycles
Instruct

� �

�

iion count2
9
6

1 5� � .

EXAMPLE

ANSWER

1.6 Performance 37

38 Chapter 1 Computer Abstractions and Technology

Figure 1.15 shows the basic measurements at diff erent levels in the
computer and what is being measured in each case. We can see how these
factors are combined to yield execution time measured in seconds per
program:

Time Seconds/Program
Instructions

Program
Clock cycles
Instruuction

Seconds
Clock cycle

Always bear in mind that the only complete and reliable measure of
computer performance is time. For example, changing the instruction set
to lower the instruction count may lead to an organization with a slower
clock cycle time or higher CPI that off sets the improvement in instruction
count. Similarly, because CPI depends on type of instructions executed,
the code that executes the fewest number of instructions may not be the
fastest.

The BIG
Picture

Components of performance Units of measure

CPU execution time for a program Seconds for the program

Instruction count Instructions executed for the program

Clock cycles per instruction (CPI) Average number of clock cycles per instruction

Clock cycle time Seconds per clock cycle

FIGURE 1.15 The basic components of performance and how each is measured.

How can we determine the value of these factors in the performance equation?
We can measure the CPU execution time by running the program, and the clock
cycle time is usually published as part of the documentation for a computer. Th e
instruction count and CPI can be more diffi cult to obtain. Of course, if we know
the clock rate and CPU execution time, we need only one of the instruction count
or the CPI to determine the other.

We can measure the instruction count by using soft ware tools that profi le the
execution or by using a simulator of the architecture. Alternatively, we can use
hardware counters, which are included in most processors, to record a variety of
measurements, including the number of instructions executed, the average CPI,
and oft en, the sources of performance loss. Since the instruction count depends
on the architecture, but not on the exact implementation, we can measure the
instruction count without knowing all the details of the implementation. Th e CPI,
however, depends on a wide variety of design details in the computer, including
both the memory system and the processor structure (as we will see in Chapter 4
and Chapter 5), as well as on the mix of instruction types executed in an application.
Th us, CPI varies by application, as well as among implementations with the same
instruction set.

Th e above example shows the danger of using only one factor (instruction count)
to assess performance. When comparing two computers, you must look at all three
components, which combine to form execution time. If some of the factors are
identical, like the clock rate in the above example, performance can be determined
by comparing all the nonidentical factors. Since CPI varies by instruction mix,
both instruction count and CPI must be compared, even if clock rates are identical.
Several exercises at the end of this chapter ask you to evaluate a series of computer
and compiler enhancements that aff ect clock rate, CPI, and instruction count. In

Section 1.10, we’ll examine a common performance measurement that does not
incorporate all the terms and can thus be misleading.

Th e performance of a program depends on the algorithm, the language, the
compiler, the architecture, and the actual hardware. Th e following table summarizes
how these components aff ect the factors in the CPU performance equation.

Hardware
or software
component Affects what? How?

Algorithm Instruction count,
possibly CPI

The algorithm determines the number of source program
instructions executed and hence the number of processor
instructions executed. The algorithm may also affect the CPI,
by favoring slower or faster instructions. For example, if the
algorithm uses more divides, it will tend to have a higher CPI.

Programming
language

Instruction count,
CPI

The programming language certainly affects the instruction
count, since statements in the language are translated to
processor instructions, which determine instruction count. The
language may also affect the CPI because of its features; for
example, a language with heavy support for data abstraction
(e.g., Java) will require indirect calls, which will use higher CPI
instructions.

Compiler Instruction count,
CPI

The effi ciency of the compiler affects both the instruction
count and average cycles per instruction, since the compiler
determines the translation of the source language instructions
into computer instructions. The compiler’s role can be very
complex and affect the CPI in complex ways.

Instruction set
architecture

Instruction count,
clock rate, CPI

The instruction set architecture affects all three aspects of
CPU performance, since it affects the instructions needed for a
function, the cost in cycles of each instruction, and the overall
clock rate of the processor.

Elaboration: Although you might expect that the minimum CPI is 1.0, as we’ll see in
Chapter 4, some processors fetch and execute multiple instructions per clock cycle. To
refl ect that approach, some designers invert CPI to talk about IPC, or instructions per
clock cycle. If a processor executes on average 2 instructions per clock cycle, then it has
an IPC of 2 and hence a CPI of 0.5.

instruction mix
A measure of the dynamic
frequency of instructions
across one or many
programs.

Understanding
Program
Performance

1.7 The Power Wall 39

40 Chapter 1 Computer Abstractions and Technology

2667 3300 3400

12.5 16

2000

200

3600

75.3
95

87
77

29.1
10.1

4.94.13.3

103

100

1000

10,000

8
0

2
8

6
(1

9
8

2
)

8
0

3
8

6
(1

9
8

5
)

8
0

4
8

6
(1

9
8

9
)

P
e

n
tiu

m
(1

9
9

3
)

P
e

n
tiu

m
P

ro
(

1
9

9
7

)

P
e

n
tiu

m
4

W
ill

a
m

e
tt

e
(2

0
0

1
)

P
e

n
tiu

m
4

P
re

sc
o

tt
(2

0
0

4
)

C
o

re
2

K
e

n
ts

fie
ld

(2
0

0
7

)

C
lo

ck
R

a
te

(
M

H
z)

100

120

P
o

w
e

r
(w

a
tt

s)Clock Rate

Power

C
o

re
i5

C
la

rk
d

a
le

0
1

0
)

C
o

re
i5

Iv
y

B
ri
d

g
e

(2
0

1
2

)

FIGURE 1.16 Clock rate and Power for Intel x86 microprocessors over eight generations
and 25 years. Th e Pentium 4 made a dramatic jump in clock rate and power but less so in performance. Th e
Prescott thermal problems led to the abandonment of the Pentium 4 line. Th e Core 2 line reverts to a simpler
pipeline with lower clock rates and multiple processors per chip. Th e Core i5 pipelines follow in its footsteps.

Elaboration: Although clock cycle time has traditionally been fi xed, to save energy
or temporarily boost performance, today’s processors can vary their clock rates, so we
would need to use the average clock rate for a program. For example, the Intel Core i7
will temporarily increase clock rate by about 10% until the chip gets too warm. Intel calls
this Turbo mode.

A given application written in Java runs 15 seconds on a desktop processor. A new
Java compiler is released that requires only 0.6 as many instructions as the old
compiler. Unfortunately, it increases the CPI by 1.1. How fast can we expect the
application to run using this new compiler? Pick the right answer from the three
choices below:

a.
15 0 6

1 1
8 2

.
.

. sec

b. 15 � 0.6 � 1.1 � 9.9 sec

c.
15 1 1

0 6
27 5

.
.

. sec

1.7 The Power Wall

Figure 1.16 shows the increase in clock rate and power of eight generations of Intel
microprocessors over 30 years. Both clock rate and power increased rapidly for
decades, and then fl attened off recently. Th e reason they grew together is that they
are correlated, and the reason for their recent slowing is that we have run into the
practical power limit for cooling commodity microprocessors.

Check
Yourself

Although power provides a limit to what we can cool, in the PostPC Era the
really critical resource is energy. Battery life can trump performance in the personal
mobile device, and the architects of warehouse scale computers try to reduce the
costs of powering and cooling 100,000 servers as the costs are high at this scale. Just
as measuring time in seconds is a safer measure of program performance than a
rate like MIPS (see Section 1.10), the energy metric joules is a better measure than
a power rate like watts, which is just joules/second.

Th e dominant technology for integrated circuits is called CMOS (complementary
metal oxide semiconductor). For CMOS, the primary source of energy consumption
is so-called dynamic energy—that is, energy that is consumed when transistors
switch states from 0 to 1 and vice versa. Th e dynamic energy depends on the
capacitive loading of each transistor and the voltage applied:

Energy Capacitive load Voltage∝ � 2

Th is equation is the energy of a pulse during the logic transition of 0 → 1 → 0 or
1 → 0 → 1. Th e energy of a single transition is then

Energy Capacitive load Voltage∝ 1 2 2/ � �

Th e power required per transistor is just the product of energy of a transition and
the frequency of transitions:

Power Capacitive load Voltage Frequency switched∝ 1 2 2/ � � �

Frequency switched is a function of the clock rate. Th e capacitive load per transistor
is a function of both the number of transistors connected to an output (called the
fanout) and the technology, which determines the capacitance of both wires and
transistors.

With regard to Figure 1.16, how could clock rates grow by a factor of 1000
while power grew by only a factor of 30? Energy and thus power can be reduced by
lowering the voltage, which occurred with each new generation of technology, and
power is a function of the voltage squared. Typically, the voltage was reduced about
15% per generation. In 20 years, voltages have gone from 5 V to 1 V, which is why
the increase in power is only 30 times.

Relative Power

Suppose we developed a new, simpler processor that has 85% of the capacitive
load of the more complex older processor. Further, assume that it has adjustable
voltage so that it can reduce voltage 15% compared to processor B, which
results in a 15% shrink in frequency. What is the impact on dynamic power?

EXAMPLE

1.7 The Power Wall 41

42 Chapter 1 Computer Abstractions and Technology

Power
Power

Capacitive load Voltage Fnew
old

〈 〉 〈 〉 〈0 85 0 85 2. . rrequency switched
Capacitive load Voltage Frequency

0 85
2

. 〉
switched

Th us the power ratio is

0 85 0 524. .�

Hence, the new processor uses about half the power of the old processor.

Th e problem today is that further lowering of the voltage appears to make the
transistors too leaky, like water faucets that cannot be completely shut off . Even
today about 40% of the power consumption in server chips is due to leakage. If
transistors started leaking more, the whole process could become unwieldy.

To try to address the power problem, designers have already attached large
devices to increase cooling, and they turn off parts of the chip that are not used in
a given clock cycle. Although there are many more expensive ways to cool chips
and thereby raise their power to, say, 300 watts, these techniques are generally
too expensive for personal computers and even servers, not to mention personal
mobile devices.

Since computer designers slammed into a power wall, they needed a new way
forward. Th ey chose a diff erent path from the way they designed microprocessors
for their fi rst 30 years.

Elaboration: Although dynamic energy is the primary source of energy consumption
in CMOS, static energy consumption occurs because of leakage current that fl ows even
when a transistor is off. In servers, leakage is typically responsible for 40% of the energy
consumption. Thus, increasing the number of transistors increases power dissipation,
even if the transistors are always off. A variety of design techniques and technology
innovations are being deployed to control leakage, but it’s hard to lower voltage further.

Elaboration: Power is a challenge for integrated circuits for two reasons. First, power
must be brought in and distributed around the chip; modern microprocessors use
hundreds of pins just for power and ground! Similarly, multiple levels of chip interconnect
are used solely for power and ground distribution to portions of the chip. Second, power
is dissipated as heat and must be removed. Server chips can burn more than 100 watts,
and cooling the chip and the surrounding system is a major expense in Warehouse Scale
Computers (see Chapter 6).

ANSWER

1.8 The Sea Change: The Switch from Uniprocessors to Multiprocessors 43

1.8 The Sea Change: The Switch from
Uniprocessors to Multiprocessors

Th e power limit has forced a dramatic change in the design of microprocessors.
Figure 1.17 shows the improvement in response time of programs for desktop
microprocessors over time. Since 2002, the rate has slowed from a factor of 1.5 per
year to a factor of 1.2 per year.

Rather than continuing to decrease the response time of a single program
running on the single processor, as of 2006 all desktop and server companies are
shipping microprocessors with multiple processors per chip, where the benefi t is
oft en more on throughput than on response time. To reduce confusion between the
words processor and microprocessor, companies refer to processors as “cores,” and
such microprocessors are generically called multicore microprocessors. Hence, a
“quadcore” microprocessor is a chip that contains four processors or four cores.

In the past, programmers could rely on innovations in hardware, architecture,
and compilers to double performance of their programs every 18 months without
having to change a line of code. Today, for programmers to get signifi cant
improvement in response time, they need to rewrite their programs to take
advantage of multiple processors. Moreover, to get the historic benefi t of running
faster on new microprocessors, programmers will have to continue to improve
performance of their code as the number of cores increases.

To reinforce how the soft ware and hardware systems work hand in hand, we use
a special section, Hardware/Soft ware Interface, throughout the book, with the fi rst
one appearing below. Th ese elements summarize important insights at this critical
interface.

Parallelism has always been critical to performance in computing, but it was
oft en hidden. Chapter 4 will explain pipelining, an elegant technique that runs
programs faster by overlapping the execution of instructions. Th is is one example of
instruction-level parallelism, where the parallel nature of the hardware is abstracted
away so the programmer and compiler can think of the hardware as executing
instructions sequentially.

Forcing programmers to be aware of the parallel hardware and to explicitly
rewrite their programs to be parallel had been the “third rail” of computer
architecture, for companies in the past that depended on such a change in behavior
failed (see Section 6.15). From this historical perspective, it’s startling that the
whole IT industry has bet its future that programmers will fi nally successfully
switch to explicitly parallel programming.

Up to now, most
soft ware has been like
music written for a
solo performer; with
the current generation
of chips we’re getting a
little experience with
duets and quartets and
other small ensembles;
but scoring a work for
large orchestra and
chorus is a diff erent
kind of challenge.
Brian Hayes, Computing
in a Parallel Universe,
2007.

Hardware/
Software
Interface

44 Chapter 1 Computer Abstractions and Technology

13
18

117

183

280

481
649

993
1,267

1,779
3,016

4,195
6,043 6,681

7,108

11,865
14,387

19,484
21,871

24,129

100

1000

10,000

100,000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 20142012

P
e

rf
o

rm
a

n
ce

(
vs

.
V

A
X

-1
1

/7
8

0
)

25%/year

52%/year

22%/year

IBM POWERstation 100, 150 MHz

Digital Alphastation 4/266, 266 MHz

Digital Alphastation 5/300, 300 MHz

Digital Alphastation 5/500, 500 MHz
AlphaServer 4000 5/600, 600 MHz 21164

Digital AlphaServer 8400 6/575, 575 MHz 21264
Professional Workstation XP1000, 667 MHz 21264A
Intel VC820 motherboard, 1.0 GHz Pentium III processor

IBM Power4, 1.3 GHz

Intel Xeon EE 3.2 GHz
AMD Athlon, 2.6 GHz

Intel Core 2 Extreme 2 cores, 2.9 GHz
Intel Core Duo Extreme 2 cores, 3.0 GHz

Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz)
Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz)

Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz)

Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-threading Technology)

1.5, VAX-11/785

AMD Athlon 64, 2.8 GHz

Digital 3000 AXP/500, 150 MHz

HP 9000/750, 66 MHz

IBM RS6000/540, 30 MHz
MIPS M2000, 25 MHz

MIPS M/120, 16.7 MHz

Sun-4/260, 16.7 MHz

VAX 8700, 22 MHz

AX-11/780, 5 MHz

Intel Core i7 4 cores 3.4 GHz (boost to 3.8 GHz)

31,999

Intel Xeon 4 cores 3.6 GHz (Boost to 4.0)

34,967

FIGURE 1.17 Growth in processor performance since the mid-1980s. Th is chart plots performance relative to the VAX 11/780
as measured by the SPECint benchmarks (see Section 1.10). Prior to the mid-1980s, processor performance growth was largely technology-
driven and averaged about 25% per year. Th e increase in growth to about 52% since then is attributable to more advanced architectural and
organizational ideas. Th e higher annual performance improvement of 52% since the mid-1980s meant performance was about a factor of seven
higher in 2002 than it would have been had it stayed at 25%. Since 2002, the limits of power, available instruction-level parallelism, and long
memory latency have slowed uniprocessor performance recently, to about 22% per year.

Why has it been so hard for programmers to write explicitly parallel programs?
Th e fi rst reason is that parallel programming is by defi nition performance
programming, which increases the diffi culty of programming. Not only does the
program need to be correct, solve an important problem, and provide a useful
interface to the people or other programs that invoke it, the program must also be
fast. Otherwise, if you don’t need performance, just write a sequential program.

Th e second reason is that fast for parallel hardware means that the programmer
must divide an application so that each processor has roughly the same amount to
do at the same time, and that the overhead of scheduling and coordination doesn’t
fritter away the potential performance benefi ts of parallelism.

As an analogy, suppose the task was to write a newspaper story. Eight reporters
working on the same story could potentially write a story eight times faster. To achieve
this increased speed, one would need to break up the task so that each reporter had
something to do at the same time. Th us, we must schedule the sub-tasks. If anything
went wrong and just one reporter took longer than the seven others did, then the
benefi ts of having eight writers would be diminished. Th us, we must balance the

1.8 The Sea Change: The Switch from Uniprocessors to Multiprocessors 45

load evenly to get the desired speedup. Another danger would be if reporters had to
spend a lot of time talking to each other to write their sections. You would also fall
short if one part of the story, such as the conclusion, couldn’t be written until all of
the other parts were completed. Th us, care must be taken to reduce communication
and synchronization overhead. For both this analogy and parallel programming, the
challenges include scheduling, load balancing, time for synchronization, and overhead
for communication between the parties. As you might guess, the challenge is stiff er with
more reporters for a newspaper story and more processors for parallel programming.

To refl ect this sea change in the industry, the next fi ve chapters in this edition of the
book each have a section on the implications of the parallel revolution to that chapter:

■ Chapter 2, Section 2.11: Parallelism and Instructions: Synchronization. Usually
independent parallel tasks need to coordinate at times, such as to say when
they have completed their work. Th is chapter explains the instructions used
by multicore processors to synchronize tasks.

■ Chapter 3, Section 3.6: Parallelism and Computer Arithmetic: Subword
Parallelism. Perhaps the simplest form of parallelism to build involves
computing on elements in parallel, such as when multiplying two vectors.
Subword parallelism takes advantage of the resources supplied by Moore’s
Law to provider wider arithmetic units that can operate on many operands
simultaneously.

■ Chapter 4, Section 4.10: Parallelism via Instructions. Given the diffi culty of
explicitly parallel programming, tremendous eff ort was invested in the 1990s
in having the hardware and the compiler uncover implicit parallelism, initially
via pipelining. Th is chapter describes some of these aggressive techniques,
including fetching and executing multiple instructions simultaneously and
guessing on the outcomes of decisions, and executing instructions speculatively
using prediction.

■ Chapter 5, Section 5.10: Parallelism and Memory Hierarchies: Cache
Coherence. One way to lower the cost of communication is to have all
processors use the same address space, so that any processor can read or
write any data. Given that all processors today use caches to keep a temporary
copy of the data in faster memory near the processor, it’s easy to imagine that
parallel programming would be even more diffi cult if the caches associated
with each processor had inconsistent values of the shared data. Th is chapter
describes the mechanisms that keep the data in all caches consistent.

■ Chapter 5, Section 5.11: Parallelism and Memory Hierarchy: Redundant
Arrays of Inexpensive Disks. Th is section describes how using many disks
in conjunction can off er much higher throughput, which was the original
inspiration of Redundant Arrays of Inexpensive Disks (RAID). Th e real
popularity of RAID proved to be to the much greater dependability off ered
by including a modest number of redundant disks. Th e section explains the
diff erences in performance, cost, and dependability between the diff erent
RAID levels.

46 Chapter 1 Computer Abstractions and Technology

In addition to these sections, there is a full chapter on parallel processing. Chapter 6
goes into more detail on the challenges of parallel programming; presents the
two contrasting approaches to communication of shared addressing and explicit
message passing; describes a restricted model of parallelism that is easier to
program; discusses the diffi culty of benchmarking parallel processors; introduces
a new simple performance model for multicore microprocessors; and, fi nally,
describes and evaluates four examples of multicore microprocessors using this
model.

As mentioned above, Chapters 3 to 6 use matrix vector multiply as a running
example to show how each type of parallelism can signifi cantly increase performance.

Appendix C describes an increasingly popular hardware component that
is included with desktop computers, the graphics processing unit (GPU). Invented
to accelerate graphics, GPUs are becoming programming platforms in their
own right. As you might expect, given these times, GPUs rely on parallelism.

Appendix C describes the NVIDIA GPU and highlights parts of its parallel
programming environment.

1.9 Real Stuff: Benchmarking the
Intel Core i7

Each chapter has a section entitled “Real Stuff ” that ties the concepts in the book
with a computer you may use every day. Th ese sections cover the technology
underlying modern computers. For this fi rst “Real Stuff ” section, we look at
how integrated circuits are manufactured and how performance and power are
measured, with the Intel Core i7 as the example.

SPEC CPU Benchmark
A computer user who runs the same programs day in and day out would be the
perfect candidate to evaluate a new computer. Th e set of programs run would form
a workload. To evaluate two computer systems, a user would simply compare
the execution time of the workload on the two computers. Most users, however,
are not in this situation. Instead, they must rely on other methods that measure
the performance of a candidate computer, hoping that the methods will refl ect
how well the computer will perform with the user’s workload. Th is alternative is
usually followed by evaluating the computer using a set of benchmarks—programs
specifi cally chosen to measure performance. Th e benchmarks form a workload that
the user hopes will predict the performance of the actual workload. As we noted
above, to make the common case fast, you fi rst need to know accurately which case
is common, so benchmarks play a critical role in computer architecture.

SPEC (System Performance Evaluation Cooperative) is an eff ort funded and
supported by a number of computer vendors to create standard sets of benchmarks
for modern computer systems. In 1989, SPEC originally created a benchmark

I thought [computers]
would be a universally
applicable idea, like
a book is. But I didn’t
think it would develop
as fast as it did, because
I didn’t envision we’d
be able to get as many
parts on a chip as
we fi nally got. Th e
transistor came along
unexpectedly. It all
happened much faster
than we expected.
J. Presper Eckert,
coinventor of ENIAC,
speaking in 1991

workload A set of
programs run on a
computer that is either
the actual collection of
applications run by a user
or constructed from real
programs to approximate
such a mix. A typical
workload specifi es both
the programs and the
relative frequencies.

benchmark A program
selected for use in
comparing computer
performance.

set focusing on processor performance (now called SPEC89), which has evolved
through fi ve generations. Th e latest is SPEC CPU2006, which consists of a set of 12
integer benchmarks (CINT2006) and 17 fl oating-point benchmarks (CFP2006).
Th e integer benchmarks vary from part of a C compiler to a chess program to a
quantum computer simulation. Th e fl oating-point benchmarks include structured
grid codes for fi nite element modeling, particle method codes for molecular
dynamics, and sparse linear algebra codes for fl uid dynamics.

Figure 1.18 describes the SPEC integer benchmarks and their execution time
on the Intel Core i7 and shows the factors that explain execution time: instruction
count, CPI, and clock cycle time. Note that CPI varies by more than a factor of 5.

To simplify the marketing of computers, SPEC decided to report a single number
to summarize all 12 integer benchmarks. Dividing the execution time of a reference
processor by the execution time of the measured computer normalizes the execution
time measurements; this normalization yields a measure, called the SPECratio, which
has the advantage that bigger numeric results indicate faster performance. Th at is,
the SPECratio is the inverse of execution time. A CINT2006 or CFP2006 summary
measurement is obtained by taking the geometric mean of the SPECratios.

Elaboration: When comparing two computers using SPECratios, use the geometric
mean so that it gives the same relative answer no matter what computer is used to
normalize the results. If we averaged the normalized execution time values with an
arithmetic mean, the results would vary depending on the computer we choose as the
reference.

1.9 Real Stuff: Benchmarking the Intel Core i7 47

FIGURE 1.18 SPECINTC2006 benchmarks running on a 2.66 GHz Intel Core i7 920. As the equation on page 35 explains,
execution time is the product of the three factors in this table: instruction count in billions, clocks per instruction (CPI), and clock cycle time in
nanoseconds. SPECratio is simply the reference time, which is supplied by SPEC, divided by the measured execution time. Th e single number
quoted as SPECINTC2006 is the geometric mean of the SPECratios.

Description Name
Instruction
Count x 109 CPI

Clock cycle time
(seconds x 10–9)

Execution
T ime

(seconds)

Reference
Time

(seconds) SPECratio

Interpreted string processing perl 2252 0.60 0.376 508 9770 19.2

Block-sorting bzip2 2390 0.70 0.376 629 9650 15.4
compression

GNU C compiler gcc 794 1.20 0.376 358 8050 22.5

Combinatorial optimization mcf 221 2.66 0.376 221 9120 41.2

Go game (AI) go 1274 1.10 0.376 527 10490 19.9

Search gene sequence hmmer 2616 0.60 0.376 590 9330 15.8

Chess game (AI) sjeng 1948 0.80 0.376 586 12100 20.7

Quantum computer libquantum 659 0.44 0.376 109 20720 190.0

simulation

Video compression h264avc 3793 0.50 0.376 713 22130 31.0

Discrete event omnetpp 367 2.10 0.376 290 6250 21.5
simulation library

Games/path finding astar 1250 1.00 0.376 470 7020 14.9

XML parsing xalancbmk 1045 0.70 0.376 275 6900 25.1

Geometric mean – – – – – 25.7 –

48 Chapter 1 Computer Abstractions and Technology

The formula for the geometric mean is

Execution time ratioi
i

n
n

�1
∏

where Execution time ratio
i
is the execution time, normalized to the reference computer,

for the ith program of a total of n in the workload, and

a a a ai n
i

means the product 1 2
1

…∏

SPEC Power Benchmark
Given the increasing importance of energy and power, SPEC added a benchmark
to measure power. It reports power consumption of servers at diff erent workload
levels, divided into 10% increments, over a period of time. Figure 1.19 shows the
results for a server using Intel Nehalem processors similar to the above.

FIGURE 1.19 SPECpower_ssj2008 running on a dual socket 2.66 GHz Intel Xeon X5650
with 16 GB of DRAM and one 100 GB SSD disk.

Target Load %
Performance

(ssj_ops)
Average Power

(watts)

100% 865,618 258

90% 786,688 242

80% 698,051 224

70% 607,826 204

60% 521,391 185

50% 436,757 170

40% 345,919 157

30% 262,071 146

20% 176,061 135

10% 86,784 121

0% 0 80

Overall Sum 4,787,166 1922

∑ssj_ops / ∑power = 2490

SPECpower started with another SPEC benchmark for Java business applications
(SPECJBB2005), which exercises the processors, caches, and main memory as well
as the Java virtual machine, compiler, garbage collector, and pieces of the operating
system. Performance is measured in throughput, and the units are business
operations per second. Once again, to simplify the marketing of computers, SPEC

1.10 Fallacies and Pitfalls 49

boils these numbers down to a single number, called “overall ssj_ops per watt.” Th e
formula for this single summarizing metric is

overall ssj_ops per watt ssj_ops power�
�

i
i

i
0

∑
⎛

⎝
⎜⎜⎜⎜

⎞

⎠
⎟⎟⎟⎟⎟ ii�0

∑
⎛

⎝
⎜⎜⎜⎜

⎞

⎠
⎟⎟⎟⎟⎟

where ssj_opsi is performance at each 10% increment and poweri is power
consumed at each performance level.

Fallacies and Pitfalls

Th e purpose of a section on fallacies and pitfalls, which will be found in every
chapter, is to explain some commonly held misconceptions that you might
encounter. We call them fallacies. When discussing a fallacy, we try to give a
counterexample. We also discuss pitfalls, or easily made mistakes. Oft en pitfalls are
generalizations of principles that are only true in a limited context. Th e purpose
of these sections is to help you avoid making these mistakes in the computers you
may design or use. Cost/performance fallacies and pitfalls have ensnared many a
computer architect, including us. Accordingly, this section suff ers no shortage of
relevant examples. We start with a pitfall that traps many designers and reveals an
important relationship in computer design.

Pitfall: Expecting the improvement of one aspect of a computer to increase overall
performance by an amount proportional to the size of the improvement.

Th e great idea of making the common case fast has a demoralizing corollary
that has plagued designers of both hardware and soft ware. It reminds us that the
opportunity for improvement is aff ected by how much time the event consumes.

A simple design problem illustrates it well. Suppose a program runs in 100
seconds on a computer, with multiply operations responsible for 80 seconds of this
time. How much do I have to improve the speed of multiplication if I want my
program to run fi ve times faster?

Th e execution time of the program aft er making the improvement is given by
the following simple equation known as Amdahl’s Law:

Execution time after improvement
Execution time affected byy improvement

Amount of improvement
Execution time unaffectted

For this problem:

Execution time after improvement
seconds

secon
80

100 80
n

( dds)

Science must begin
with myths, and the
criticism of myths.
Sir Karl Popper, Th e
Philosophy of Science,
1957

Amdahl’s Law
A rule stating that
the performance
enhancement possible
with a given improvement
is limited by the amount
that the improved feature
is used. It is a quantitative
version of the law of
diminishing returns.

1.10

50 Chapter 1 Computer Abstractions and Technology

Since we want the performance to be fi ve times faster, the new execution time
should be 20 seconds, giving

20
80

0
80

seconds
seconds

seconds

seconds
n

Th at is, there is no amount by which we can enhance-multiply to achieve a fi vefold
increase in performance, if multiply accounts for only 80% of the workload. Th e
performance enhancement possible with a given improvement is limited by the amount
that the improved feature is used. In everyday life this concept also yields what we call
the law of diminishing returns.

We can use Amdahl’s Law to estimate performance improvements when we
know the time consumed for some function and its potential speedup. Amdahl’s
Law, together with the CPU performance equation, is a handy tool for evaluating
potential enhancements. Amdahl’s Law is explored in more detail in the exercises.

Amdahl’s Law is also used to argue for practical limits to the number of parallel
processors. We examine this argument in the Fallacies and Pitfalls section of
Chapter 6.

Fallacy: Computers at low utilization use little power.
Power effi ciency matters at low utilizations because server workloads vary.
Utilization of servers in Google’s warehouse scale computer, for example, is
between 10% and 50% most of the time and at 100% less than 1% of the time. Even
given fi ve years to learn how to run the SPECpower benchmark well, the specially
confi gured computer with the best results in 2012 still uses 33% of the peak power
at 10% of the load. Systems in the fi eld that are not confi gured for the SPECpower
benchmark are surely worse.

Since servers’ workloads vary but use a large fraction of peak power, Luiz
Barroso and Urs Hölzle [2007] argue that we should redesign hardware to achieve
“energy-proportional computing.” If future servers used, say, 10% of peak power at
10% workload, we could reduce the electricity bill of datacenters and become good
corporate citizens in an era of increasing concern about CO2 emissions.

Fallacy: Designing for performance and designing for energy effi ciency are
unrelated goals.

Since energy is power over time, it is oft en the case that hardware or soft ware
optimizations that take less time save energy overall even if the optimization takes
a bit more energy when it is used. One reason is that all of the rest of the computer is
consuming energy while the program is running, so even if the optimized portion
uses a little more energy, the reduced time can save the energy of the whole system.

Pitfall: Using a subset of the performance equation as a performance metric.
We have already warned about the danger of predicting performance based on
simply one of clock rate, instruction count, or CPI. Another common mistake

1.10 Fallacies and Pitfalls 51

is to use only two of the three factors to compare performance. Although using
two of the three factors may be valid in a limited context, the concept is also
easily misused. Indeed, nearly all proposed alternatives to the use of time as the
performance metric have led eventually to misleading claims, distorted results, or
incorrect interpretations.

One alternative to time is MIPS (million instructions per second). For a given
program, MIPS is simply

MIPS
Instruction count

Execution time 106

Since MIPS is an instruction execution rate, MIPS specifi es performance inversely
to execution time; faster computers have a higher MIPS rating. Th e good news
about MIPS is that it is easy to understand, and faster computers mean bigger
MIPS, which matches intuition.

Th ere are three problems with using MIPS as a measure for comparing computers.
First, MIPS specifi es the instruction execution rate but does not take into account
the capabilities of the instructions. We cannot compare computers with diff erent
instruction sets using MIPS, since the instruction counts will certainly diff er.
Second, MIPS varies between programs on the same computer; thus, a computer
cannot have a single MIPS rating. For example, by substituting for execution time,
we see the relationship between MIPS, clock rate, and CPI:

MIPS
Instruction count

Instruction count CPI
Clock rate

106
CClock rate
CPI 106

Th e CPI varied by a factor of 5 for SPEC CPU2006 on an Intel Core i7 computer
in Figure 1.18, so MIPS does as well. Finally, and most importantly, if a new
program executes more instructions but each instruction is faster, MIPS can vary
independently from performance!

Consider the following performance measurements for a program:

Measurement Computer A Computer B

Instruction count 10 billion 8 billion

Clock rate 4 GHz 4 GHz

CPI 1.0 1.1

a. Which computer has the higher MIPS rating?

b. Which computer is faster?

million instructions
per second (MIPS)
A measurement of
program execution speed
based on the number of
millions of instructions.
MIPS is computed as the
instruction count divided
by the product of the
execution time and 106.

Check
Yourself

52 Chapter 1 Computer Abstractions and Technology

1.11 Concluding Remarks

Although it is diffi cult to predict exactly what level of cost/performance computers
will have in the future, it’s a safe bet that they will be much better than they are
today. To participate in these advances, computer designers and programmers
must understand a wider variety of issues.

Both hardware and soft ware designers construct computer systems in hierarchical
layers, with each lower layer hiding details from the level above. Th is great idea
of abstraction is fundamental to understanding today’s computer systems, but it
does not mean that designers can limit themselves to knowing a single abstraction.
Perhaps the most important example of abstraction is the interface between
hardware and low-level soft ware, called the instruction set architecture. Maintaining
the instruction set architecture as a constant enables many implementations of
that architecture—presumably varying in cost and performance—to run identical
soft ware. On the downside, the architecture may preclude introducing innovations
that require the interface to change.

Th ere is a reliable method of determining and reporting performance by using
the execution time of real programs as the metric. Th is execution time is related to
other important measurements we can make by the following equation:

Seconds
Program

Instructions
Program

Clock cycles
Instruction

Seconds
Clock cycle

We will use this equation and its constituent factors many times. Remember,
though, that individually the factors do not determine performance: only the
product, which equals execution time, is a reliable measure of performance.

Execution time is the only valid and unimpeachable measure of
performance. Many other metrics have been proposed and found wanting.
Sometimes these metrics are fl awed from the start by not refl ecting
execution time; other times a metric that is valid in a limited context
is extended and used beyond that context or without the additional
clarifi cation needed to make it valid.

The BIG
Picture

Where … the ENIAC
is equipped with
18,000 vacuum tubes
and weighs 30 tons,
computers in the
future may have 1,000
vacuum tubes and
perhaps weigh just 1½
tons.
Popular Mechanics,
March 1949

1.11 Concluding Remarks 53

Th e key hardware technology for modern processors is silicon. Equal in
importance to an understanding of integrated circuit technology is an understanding
of the expected rates of technological change, as predicted by Moore’s Law. While
silicon fuels the rapid advance of hardware, new ideas in the organization of
computers have improved price/performance. Two of the key ideas are exploiting
parallelism in the program, typically today via multiple processors, and exploiting
locality of accesses to a memory hierarchy, typically via caches.

Energy effi ciency has replaced die area as the most critical resource of
microprocessor design. Conserving power while trying to increase performance
has forced the hardware industry to switch to multicore microprocessors, thereby
forcing the soft ware industry to switch to programming parallel hardware.
Parallelism is now required for performance.

Computer designs have always been measured by cost and performance, as well
as other important factors such as energy, dependability, cost of ownership, and
scalability. Although this chapter has focused on cost, performance, and energy,
the best designs will strike the appropriate balance for a given market among all
the factors.

Road Map for This Book
At the bottom of these abstractions are the fi ve classic components of a computer:
datapath, control, memory, input, and output (refer to Figure 1.5). Th ese fi ve
components also serve as the framework for the rest of the chapters in this book:

■ Datapath: Chapter 3, Chapter 4, Chapter 6, and Appendix C

■ Control: Chapter 4, Chapter 6, and Appendix C

■ Memory: Chapter 5

■ Input: Chapters 5 and 6

■ Output: Chapters 5 and 6

As mentioned above, Chapter 4 describes how processors exploit implicit
parallelism, Chapter 6 describes the explicitly parallel multicore microprocessors
that are at the heart of the parallel revolution, and Appendix C describes
the highly parallel graphics processor chip. Chapter 5 describes how a memory
hierarchy exploits locality. Chapter 2 describes instruction sets—the interface
between compilers and the computer—and emphasizes the role of compilers and
programming languages in using the features of the instruction set. Appendix A
provides a reference for the instruction set of Chapter 2. Chapter 3 describes how
computers handle arithmetic data. Appendix B introduces logic design.

54 Chapter 1 Computer Abstractions and Technology

1.12 Historical Perspective and Further
Reading

For each chapter in the text, a section devoted to a historical perspective can be
found online on a site that accompanies this book. We may trace the development
of an idea through a series of computers or describe some important projects, and
we provide references in case you are interested in probing further.

Th e historical perspective for this chapter provides a background for some of the
key ideas presented in this opening chapter. Its purpose is to give you the human
story behind the technological advances and to place achievements in their historical
context. By understanding the past, you may be better able to understand the forces
that will shape computing in the future. Each Historical Perspective section online
ends with suggestions for further reading, which are also collected separately online
under the section “Further Reading.” Th e rest of Section 1.12 is found online.

1.13 Exercises

Th e relative time ratings of exercises are shown in square brackets aft er each
exercise number. On average, an exercise rated [10] will take you twice as long as
one rated [5]. Sections of the text that should be read before attempting an exercise
will be given in angled brackets; for example, <§1.4> means you should have read
Section 1.4, Under the Covers, to help you solve this exercise.

1.1 [2] <§1.1> Aside from the smart cell phones used by a billion people, list and
describe four other types of computers.

1.2 [5] <§1.2> Th e eight great ideas in computer architecture are similar to ideas
from other fi elds. Match the eight ideas from computer architecture, “Design for
Moore’s Law”, “Use Abstraction to Simplify Design”, “Make the Common Case
Fast”, “Performance via Parallelism”, “Performance via Pipelining”, “Performance
via Prediction”, “Hierarchy of Memories”, and “Dependability via Redundancy” to
the following ideas from other fi elds:

a. Assembly lines in automobile manufacturing

b. Suspension bridge cables

c. Aircraft and marine navigation systems that incorporate wind information

d. Express elevators in buildings

An active fi eld of
science is like an
immense anthill; the
individual almost
vanishes into the mass
of minds tumbling over
each other, carrying
information from place
to place, passing it
around at the speed of
light.
Lewis Th omas, “Natural
Science,” in Th e Lives of
a Cell, 1974

1.13 Exercises 55

e. Library reserve desk
f. Increasing the gate area on a CMOS transistor to decrease its switching time
g. Adding electromagnetic aircraft catapults (which are electrically-powered
as opposed to current steam-powered models), allowed by the increased power
generation off ered by the new reactor technology
h. Building self-driving cars whose control systems partially rely on existing sensor
systems already installed into the base vehicle, such as lane departure systems and
smart cruise control systems

1.3 [2] <§1.3> Describe the steps that transform a program written in a high-level
language such as C into a representation that is directly executed by a computer
processor.

1.4 [2] <§1.4> Assume a color display using 8 bits for each of the primary colors
(red, green, blue) per pixel and a frame size of 1280 × 1024.
a. What is the minimum size in bytes of the frame buff er to store a frame?
b. How long would it take, at a minimum, for the frame to be sent over a 100
Mbit/s network?

1.5 [4] <§1.6> Consider three diff erent processors P1, P2, and P3 executing
the same instruction set. P1 has a 3 GHz clock rate and a CPI of 1.5. P2 has a
2.5 GHz clock rate and a CPI of 1.0. P3 has a 4.0 GHz clock rate and has a CPI
of 2.2.
a. Which processor has the highest performance expressed in instructions per second?
b. If the processors each execute a program in 10 seconds, fi nd the number of
cycles and the number of instructions.
c. We are trying to reduce the execution time by 30% but this leads to an increase
of 20% in the CPI. What clock rate should we have to get this time reduction?

1.6 [20] <§1.6> Consider two diff erent implementations of the same instruction
set architecture. Th e instructions can be divided into four classes according to
their CPI (class A, B, C, and D). P1 with a clock rate of 2.5 GHz and CPIs of 1, 2, 3,
and 3, and P2 with a clock rate of 3 GHz and CPIs of 2, 2, 2, and 2.
Given a program with a dynamic instruction count of 1.0E6 instructions divided
into classes as follows: 10% class A, 20% class B, 50% class C, and 20% class D,
which implementation is faster?
a. What is the global CPI for each implementation?
b. Find the clock cycles required in both cases.

56 Chapter 1 Computer Abstractions and Technology

1.7 [15] <§1.6> Compilers can have a profound impact on the performance
of an application. Assume that for a program, compiler A results in a dynamic
instruction count of 1.0E9 and has an execution time of 1.1 s, while compiler B
results in a dynamic instruction count of 1.2E9 and an execution time of 1.5 s.

a. Find the average CPI for each program given that the processor has a clock cycle
time of 1 ns.

b. Assume the compiled programs run on two diff erent processors. If the execution
times on the two processors are the same, how much faster is the clock of the
processor running compiler A’s code versus the clock of the processor running
compiler B’s code?

c. A new compiler is developed that uses only 6.0E8 instructions and has an
average CPI of 1.1. What is the speedup of using this new compiler versus using
compiler A or B on the original processor?

1.8 Th e Pentium 4 Prescott processor, released in 2004, had a clock rate of 3.6
GHz and voltage of 1.25 V. Assume that, on average, it consumed 10 W of static
power and 90 W of dynamic power.

Th e Core i5 Ivy Bridge, released in 2012, had a clock rate of 3.4 GHz and voltage
of 0.9 V. Assume that, on average, it consumed 30 W of static power and 40 W of
dynamic power.

1.8.1 [5] <§1.7> For each processor fi nd the average capacitive loads.

1.8.2 [5] <§1.7> Find the percentage of the total dissipated power comprised by
static power and the ratio of static power to dynamic power for each technology.

1.8.3 [15] <§1.7> If the total dissipated power is to be reduced by 10%, how much
should the voltage be reduced to maintain the same leakage current? Note: power
is defi ned as the product of voltage and current.

1.9 Assume for arithmetic, load/store, and branch instructions, a processor has
CPIs of 1, 12, and 5, respectively. Also assume that on a single processor a program
requires the execution of 2.56E9 arithmetic instructions, 1.28E9 load/store
instructions, and 256 million branch instructions. Assume that each processor has
a 2 GHz clock frequency.

Assume that, as the program is parallelized to run over multiple cores, the number
of arithmetic and load/store instructions per processor is divided by 0.7 x p (where
p is the number of processors) but the number of branch instructions per processor
remains the same.

1.9.1 [5] <§1.7> Find the total execution time for this program on 1, 2, 4, and 8
processors, and show the relative speedup of the 2, 4, and 8 processor result relative
to the single processor result.

1.13 Exercises 57

1.9.2 [10] <§§1.6, 1.8> If the CPI of the arithmetic instructions was doubled,
what would the impact be on the execution time of the program on 1, 2, 4, or 8
processors?

1.9.3 [10] <§§1.6, 1.8> To what should the CPI of load/store instructions be
reduced in order for a single processor to match the performance of four processors
using the original CPI values?

1.10 Assume a 15 cm diameter wafer has a cost of 12, contains 84 dies, and has
0.020 defects/cm2. Assume a 20 cm diameter wafer has a cost of 15, contains 100
dies, and has 0.031 defects/cm2.

1.10.1 [10] <§1.5> Find the yield for both wafers.

1.10.2 [5] <§1.5> Find the cost per die for both wafers.

1.10.3 [5] <§1.5> If the number of dies per wafer is increased by 10% and the
defects per area unit increases by 15%, fi nd the die area and yield.

1.10.4 [5] <§1.5> Assume a fabrication process improves the yield from 0.92 to
0.95. Find the defects per area unit for each version of the technology given a die
area of 200 mm2.

1.11 Th e results of the SPEC CPU2006 bzip2 benchmark running on an AMD
Barcelona has an instruction count of 2.389E12, an execution time of 750 s, and a
reference time of 9650 s.

1.11.1 [5] <§§1.6, 1.9> Find the CPI if the clock cycle time is 0.333 ns.

1.11.2 [5] <§1.9> Find the SPECratio.

1.11.3 [5] <§§1.6, 1.9> Find the increase in CPU time if the number of instructions
of the benchmark is increased by 10% without aff ecting the CPI.

1.11.4 [5] <§§1.6, 1.9> Find the increase in CPU time if the number of instructions
of the benchmark is increased by 10% and the CPI is increased by 5%.

1.11.5 [5] <§§1.6, 1.9> Find the change in the SPECratio for this change.

1.11.6 [10] <§1.6> Suppose that we are developing a new version of the AMD
Barcelona processor with a 4 GHz clock rate. We have added some additional
instructions to the instruction set in such a way that the number of instructions
has been reduced by 15%. Th e execution time is reduced to 700 s and the new
SPECratio is 13.7. Find the new CPI.

1.11.7 [10] <§1.6> Th is CPI value is larger than obtained in 1.11.1 as the clock
rate was increased from 3 GHz to 4 GHz. Determine whether the increase in the
CPI is similar to that of the clock rate. If they are dissimilar, why?

1.11.8 [5] <§1.6> By how much has the CPU time been reduced?

58 Chapter 1 Computer Abstractions and Technology

1.11.9 [10] <§1.6> For a second benchmark, libquantum, assume an execution
time of 960 ns, CPI of 1.61, and clock rate of 3 GHz. If the execution time is
reduced by an additional 10% without aff ecting to the CPI and with a clock rate of
4 GHz, determine the number of instructions.

1.11.10 [10] <§1.6> Determine the clock rate required to give a further 10%
reduction in CPU time while maintaining the number of instructions and with the
CPI unchanged.

1.11.11 [10] <§1.6> Determine the clock rate if the CPI is reduced by 15% and
the CPU time by 20% while the number of instructions is unchanged.

1.12 Section 1.10 cites as a pitfall the utilization of a subset of the performance
equation as a performance metric. To illustrate this, consider the following two
processors. P1 has a clock rate of 4 GHz, average CPI of 0.9, and requires the
execution of 5.0E9 instructions. P2 has a clock rate of 3 GHz, an average CPI of
0.75, and requires the execution of 1.0E9 instructions.
1.12.1 [5] <§§1.6, 1.10> One usual fallacy is to consider the computer with the
largest clock rate as having the largest performance. Check if this is true for P1 and
P2.
1.12.2 [10] <§§1.6, 1.10> Another fallacy is to consider that the processor executing
the largest number of instructions will need a larger CPU time. Considering that
processor P1 is executing a sequence of 1.0E9 instructions and that the CPI of
processors P1 and P2 do not change, determine the number of instructions that P2
can execute in the same time that P1 needs to execute 1.0E9 instructions.
1.12.3 [10] <§§1.6, 1.10> A common fallacy is to use MIPS (millions of
instructions per second) to compare the performance of two diff erent processors,
and consider that the processor with the largest MIPS has the largest performance.
Check if this is true for P1 and P2.
1.12.4 [10] <§1.10> Another common performance fi gure is MFLOPS (millions
of fl oating-point operations per second), defi ned as
MFLOPS = No. FP operations / (execution time × 1E6)
but this fi gure has the same problems as MIPS. Assume that 40% of the instructions
executed on both P1 and P2 are fl oating-point instructions. Find the MFLOPS
fi gures for the programs.

1.13 Another pitfall cited in Section 1.10 is expecting to improve the overall
performance of a computer by improving only one aspect of the computer. Consider
a computer running a program that requires 250 s, with 70 s spent executing FP
instructions, 85 s executed L/S instructions, and 40 s spent executing branch
instructions.

1.13.1 [5] <§1.10> By how much is the total time reduced if the time for FP
operations is reduced by 20%?

1.13 Exercises 59

1.13.2 [5] <§1.10> By how much is the time for INT operations reduced if the
total time is reduced by 20%?

1.13.3 [5] <§1.10> Can the total time can be reduced by 20% by reducing only
the time for branch instructions?

1.14 Assume a program requires the execution of 50 × 106 FP instructions,
110 × 106 INT instructions, 80 × 106 L/S instructions, and 16 × 106 branch
instructions. Th e CPI for each type of instruction is 1, 1, 4, and 2, respectively.
Assume that the processor has a 2 GHz clock rate.

1.14.1 [10] <§1.10> By how much must we improve the CPI of FP instructions if
we want the program to run two times faster?

1.14.2 [10] <§1.10> By how much must we improve the CPI of L/S instructions
if we want the program to run two times faster?

1.14.3 [5] <§1.10> By how much is the execution time of the program improved
if the CPI of INT and FP instructions is reduced by 40% and the CPI of L/S and
Branch is reduced by 30%?

1.15 [5] <§1.8> When a program is adapted to run on multiple processors in
a multiprocessor system, the execution time on each processor is comprised of
computing time and the overhead time required for locked critical sections and/or
to send data from one processor to another.

Assume a program requires t = 100 s of execution time on one processor. When run
p processors, each processor requires t/p s, as well as an additional 4 s of overhead,
irrespective of the number of processors. Compute the per-processor execution
time for 2, 4, 8, 16, 32, 64, and 128 processors. For each case, list the corresponding
speedup relative to a single processor and the ratio between actual speedup versus
ideal speedup (speedup if there was no overhead).

§1.1, page 10: Discussion questions: many answers are acceptable.
§1.4, page 24: DRAM memory: volatile, short access time of 50 to 70 nanoseconds,
and cost per GB is $5 to $10. Disk memory: nonvolatile, access times are 100,000
to 400,000 times slower than DRAM, and cost per GB is 100 times cheaper than
DRAM. Flash memory: nonvolatile, access times are 100 to 1000 times slower than
DRAM, and cost per GB is 7 to 10 times cheaper than DRAM.
§1.5, page 28: 1, 3, and 4 are valid reasons. Answer 5 can be generally true because
high volume can make the extra investment to reduce die size by, say, 10% a good
economic decision, but it doesn’t have to be true.
§1.6, page 33: 1. a: both, b: latency, c: neither. 7 seconds.
§1.6, page 40: b.
§1.10, page 51: a. Computer A has the higher MIPS rating. b. Computer B is faster.

Answers to
Check Yourself

2
I speak Spanish
to God, Italian to
women, French to
men, and German to
my horse.
Charles V, Holy Roman Emperor
(1500–1558)

Instructions:
Language of the
Computer
2.1 Introduction 62
2.2 Operations of the Computer Hardware 63
2.3 Operands of the Computer Hardware 66
2.4 Signed and Unsigned Numbers 73
2.5 Representing Instructions in the

Computer 80
2.6 Logical Operations 87
2.7 Instructions for Making Decisions 90

Computer Organization and Design. DOI:
© 2013 Elsevier Inc. All rights reserved.

http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1
2013

2.8 Supporting Procedures in Computer Hardware 96
2.9 Communicating with People 106
2.10 MIPS Addressing for 32-Bit Immediates and Addresses 111
2.11 Parallelism and Instructions: Synchronization 121
2.12 Translating and Starting a Program 123
2.13 A C Sort Example to Put It All Together 132
2.14 Arrays versus Pointers 141
2.15 Advanced Material: Compiling C and Interpreting Java 145
2.16 Real Stuff: ARMv7 (32-bit) Instructions 145
2.17 Real Stuff: x86 Instructions 149
2.18 Real Stuff: ARMv8 (64-bit) Instructions 158
2.19 Fallacies and Pitfalls 159
2.20 Concluding Remarks 161
2.21 Historical Perspective and Further Reading 163
2.22 Exercises 164

The Five Classic Components of a Computer

62 Chapter 2 Instructions: Language of the Computer

2.1 Introduction

To command a computer’s hardware, you must speak its language. Th e words of a
computer’s language are called instructions, and its vocabulary is called an instruction
set. In this chapter, you will see the instruction set of a real computer, both in the form
written by people and in the form read by the computer. We introduce instructions in
a top-down fashion. Starting from a notation that looks like a restricted programming
language, we refi ne it step-by-step until you see the real language of a real computer.
Chapter 3 continues our downward descent, unveiling the hardware for arithmetic
and the representation of fl oating-point numbers.

You might think that the languages of computers would be as diverse as those of
people, but in reality computer languages are quite similar, more like regional dialects
than like independent languages. Hence, once you learn one, it is easy to pick up others.

Th e chosen instruction set comes from MIPS Technologies, and is an elegant
example of the instruction sets designed since the 1980s. To demonstrate how
easy it is to pick up other instruction sets, we will take a quick look at three other
popular instruction sets.

1. ARMv7 is similar to MIPS. More than 9 billion chips with ARM processors
were manufactured in 2011, making it the most popular instruction set in
the world.

2. Th e second example is the Intel x86, which powers both the PC and the
cloud of the PostPC Era.

3. Th e third example is ARMv8, which extends the address size of the ARMv7
from 32 bits to 64 bits. Ironically, as we shall see, this 2013 instruction set is
closer to MIPS than it is to ARMv7.

Th is similarity of instruction sets occurs because all computers are constructed
from hardware technologies based on similar underlying principles and because
there are a few basic operations that all computers must provide. Moreover,
computer designers have a common goal: to fi nd a language that makes it easy
to build the hardware and the compiler while maximizing performance and
minimizing cost and energy. Th is goal is time honored; the following quote
was written before you could buy a computer, and it is as true today as it was in 1947:

It is easy to see by formal-logical methods that there exist certain [instruction
sets] that are in abstract adequate to control and cause the execution of any
sequence of operations . . . . Th e really decisive considerations from the present
point of view, in selecting an [instruction set], are more of a practical nature:
simplicity of the equipment demanded by the [instruction set], and the clarity of
its application to the actually important problems together with the speed of its
handling of those problems.

Burks, Goldstine, and von Neumann, 1947

instruction set Th e
vocabulary of commands
understood by a given
architecture.

2.2 Operations of the Computer Hardware 63

Th e “simplicity of the equipment” is as valuable a consideration for today’s
computers as it was for those of the 1950s. Th e goal of this chapter is to teach
an instruction set that follows this advice, showing both how it is represented
in hardware and the relationship between high-level programming languages
and this more primitive one. Our examples are in the C programming language;

Section 2.15 shows how these would change for an object-oriented language
like Java.

By learning how to represent instructions, you will also discover the secret of
computing: the stored-program concept. Moreover, you will exercise your “foreign
language” skills by writing programs in the language of the computer and running
them on the simulator that comes with this book. You will also see the impact of
programming languages and compiler optimization on performance. We conclude
with a look at the historical evolution of instruction sets and an overview of other
computer dialects.

We reveal our fi rst instruction set a piece at a time, giving the rationale along
with the computer structures. Th is top-down, step-by-step tutorial weaves the
components with their explanations, making the computer’s language more
palatable. Figure 2.1 gives a sneak preview of the instruction set covered in this
chapter.

2.2 Operations of the Computer Hardware

Every computer must be able to perform arithmetic. Th e MIPS assembly language
notation

add a, b, c

instructs a computer to add the two variables b and c and to put their sum in a.
Th is notation is rigid in that each MIPS arithmetic instruction performs only

one operation and must always have exactly three variables. For example, suppose
we want to place the sum of four variables b, c, d, and e into variable a. (In this
section we are being deliberately vague about what a “variable” is; in the next
section we’ll explain in detail.)

Th e following sequence of instructions adds the four variables:

add a, b, c # The sum of b and c is placed in a
add a, a, d # The sum of b, c, and d is now in a
add a, a, e # The sum of b, c, d, and e is now in a

Th us, it takes three instructions to sum the four variables.
Th e words to the right of the sharp symbol (#) on each line above are comments

for the human reader, so the computer ignores them. Note that unlike other
programming languages, each line of this language can contain at most one

stored-program
concept Th e idea that
instructions and data of
many types can be stored
in memory as numbers,
leading to the stored-
program computer.

Th ere must certainly
be instructions
for performing
the fundamental
arithmetic operations.
Burks, Goldstine, and
von Neumann, 1947

64 Chapter 2 Instructions: Language of the Computer

MIPS operands

Name Example Comments

32 registers
$s0–$s7, $t0–$t9, $zero,
$a0–$a3, $v0–$v1, $gp, $fp,
$sp, $ra, $at

Fast locations for data. In MIPS, data must be in registers to perform arithmetic,
register $zero always equals 0, and register $at is reserved by the assembler to
handle large constants.

230 memory
words

Memory[0], Memory[4], . . . ,
Memory[4294967292]

Accessed only by data transfer instructions. MIPS uses byte addresses, so
sequential word addresses differ by 4. Memory holds data structures, arrays, and
spilled registers.

MIPS assembly language

Category Instruction Example Meaning Comments

Arithmetic

add add $s1,$s2,$s3 $s1 = $s2 + $s3 Three register operands
subtract sub $s1,$s2,$s3 $s1 = $s2 – $s3 Three register operands
add immediate addi $s1,$s2,20 $s1 = $s2 + 20 Used to add constants

Data
transfer

load word lw $s1,20($s2) $s1 = Memory[$s2 + 20] Word from memory to register
store word sw $s1,20($s2) Memory[$s2 + 20] = $s1 Word from register to memory
load half lh $s1,20($s2) $s1 = Memory[$s2 + 20] Halfword memory to register
load half unsigned lhu $s1,20($s2) $s1 = Memory[$s2 + 20] Halfword memory to register
store half sh $s1,20($s2) Memory[$s2 + 20] = $s1 Halfword register to memory
load byte lb $s1,20($s2) $s1 = Memory[$s2 + 20] Byte from memory to register
load byte unsigned lbu $s1,20($s2) $s1 = Memory[$s2 + 20] Byte from memory to register
store byte sb $s1,20($s2) Memory[$s2 + 20] = $s1 Byte from register to memory
load linked word ll $s1,20($s2) $s1 = Memory[$s2 + 20] Load word as 1st half of atomic swap
store condition. word sc $s1,20($s2) Memory[$s2+20]=$s1;$s1=0 or 1 Store word as 2nd half of atomic swap
load upper immed. lui $s1,20 $s1 = 20 * 216 Loads constant in upper 16 bits

Logical

and and $s1,$s2,$s3 $s1 = $s2 & $s3 Three reg. operands; bit-by-bit AND
or or $s1,$s2,$s3 $s1 = $s2 | $s3 Three reg. operands; bit-by-bit OR
nor nor $s1,$s2,$s3 $s1 = ~ ($s2 | $s3) Three reg. operands; bit-by-bit NOR
and immediate andi $s1,$s2,20 $s1 = $s2 & 20 Bit-by-bit AND reg with constant
or immediate ori $s1,$s2,20 $s1 = $s2 | 20 Bit-by-bit OR reg with constant
shift left logical sll $s1,$s2,10 $s1 = $s2 << 10 Shift left by constant shift right logical srl $s1,$s2,10 $s1 = $s2 >> 10 Shift right by constant

Conditional
branch

branch on equal beq $s1,$s2,25 if ($s1 == $s2) go to
PC + 4 + 100

Equal test; PC-relative branch

branch on not equal bne $s1,$s2,25 if ($s1!= $s2) go to
PC + 4 + 100

Not equal test; PC-relative

set on less than slt $s1,$s2,$s3 if ($s2 < $s3) $s1 = 1; else $s1 = 0 Compare less than; for beq, bne set on less than unsigned sltu $s1,$s2,$s3 if ($s2 < $s3) $s1 = 1; else $s1 = 0 Compare less than unsigned set less than immediate slti $s1,$s2,20 if ($s2 < 20) $s1 = 1; else $s1 = 0 Compare less than constant set less than immediate unsigned sltiu $s1,$s2,20 if ($s2 < 20) $s1 = 1; else $s1 = 0 Compare less than constant unsigned Unconditional jump jump j 2500 go to 10000 Jump to target address jump register jr $ra go to $ra For switch, procedure return jump and link jal 2500 $ra = PC + 4; go to 10000 For procedure call FIGURE 2.1 MIPS assembly language revealed in this chapter. Th is information is also found in Column 1 of the MIPS Reference Data Card at the front of this book. 2.2 Operations of the Computer Hardware 65 instruction. Another diff erence from C is that comments always terminate at the end of a line. Th e natural number of operands for an operation like addition is three: the two numbers being added together and a place to put the sum. Requiring every instruction to have exactly three operands, no more and no less, conforms to the philosophy of keeping the hardware simple: hardware for a variable number of operands is more complicated than hardware for a fi xed number. Th is situation illustrates the fi rst of three underlying principles of hardware design: Design Principle 1: Simplicity favors regularity. We can now show, in the two examples that follow, the relationship of programs written in higher-level programming languages to programs in this more primitive notation. Compiling Two C Assignment Statements into MIPS Th is segment of a C program contains the fi ve variables a, b, c, d, and e. Since Java evolved from C, this example and the next few work for either high-level programming language: a = b + c; d = a – e; Th e translation from C to MIPS assembly language instructions is performed by the compiler. Show the MIPS code produced by a compiler. A MIPS instruction operates on two source operands and places the result in one destination operand. Hence, the two simple statements above compile directly into these two MIPS assembly language instructions: add a, b, c sub d, a, e Compiling a Complex C Assignment into MIPS A somewhat complex statement contains the fi ve variables f, g, h, i, and j: f = (g + h) – (i + j); What might a C compiler produce? EXAMPLE ANSWER EXAMPLE 66 Chapter 2 Instructions: Language of the Computer Th e compiler must break this statement into several assembly instructions, since only one operation is performed per MIPS instruction. Th e fi rst MIPS instruction calculates the sum of g and h. We must place the result somewhere, so the compiler creates a temporary variable, called t0: add t0,g,h # temporary variable t0 contains g + h Although the next operation is subtract, we need to calculate the sum of i and j before we can subtract. Th us, the second instruction places the sum of i and j in another temporary variable created by the compiler, called t1: add t1,i,j # temporary variable t1 contains i + j Finally, the subtract instruction subtracts the second sum from the fi rst and places the diff erence in the variable f, completing the compiled code: sub f,t0,t1 # f gets t0 – t1, which is (g + h) – (i + j) For a given function, which programming language likely takes the most lines of code? Put the three representations below in order. 1. Java 2. C 3. MIPS assembly language Elaboration: To increase portability, Java was originally envisioned as relying on a software interpreter. The instruction set of this interpreter is called Java bytecodes (see Section 2.15), which is quite different from the MIPS instruction set. To get performance close to the equivalent C program, Java systems today typically compile Java bytecodes into the native instruction sets like MIPS. Because this compilation is normally done much later than for C programs, such Java compilers are often called Just In Time (JIT) compilers. Section 2.12 shows how JITs are used later than C compilers in the start-up process, and Section 2.13 shows the performance consequences of compiling versus interpreting Java programs. 2.3 Operands of the Computer Hardware Unlike programs in high-level languages, the operands of arithmetic instructions are restricted; they must be from a limited number of special locations built directly in hardware called registers. Registers are primitives used in hardware design that are also visible to the programmer when the computer is completed, so you can think of registers as the bricks of computer construction. Th e size of a register in the MIPS architecture is 32 bits; groups of 32 bits occur so frequently that they are given the name word in the MIPS architecture. ANSWER Check Yourself word Th e natural unit of access in a computer, usually a group of 32 bits; corresponds to the size of a register in the MIPS architecture. 2.3 Operands of the Computer Hardware 67 One major diff erence between the variables of a programming language and registers is the limited number of registers, typically 32 on current computers, like MIPS. (See Section 2.21 for the history of the number of registers.) Th us, continuing in our top-down, stepwise evolution of the symbolic representation of the MIPS language, in this section we have added the restriction that the three operands of MIPS arithmetic instructions must each be chosen from one of the 32 32-bit registers. Th e reason for the limit of 32 registers may be found in the second of our three underlying design principles of hardware technology: Design Principle 2: Smaller is faster. A very large number of registers may increase the clock cycle time simply because it takes electronic signals longer when they must travel farther. Guidelines such as “smaller is faster” are not absolutes; 31 registers may not be faster than 32. Yet, the truth behind such observations causes computer designers to take them seriously. In this case, the designer must balance the craving of programs for more registers with the designer’s desire to keep the clock cycle fast. Another reason for not using more than 32 is the number of bits it would take in the instruction format, as Section 2.5 demonstrates. Chapter 4 shows the central role that registers play in hardware construction; as we shall see in this chapter, eff ective use of registers is critical to program performance. Although we could simply write instructions using numbers for registers, from 0 to 31, the MIPS convention is to use two-character names following a dollar sign to represent a register. Section 2.8 will explain the reasons behind these names. For now, we will use $s0, $s1, . . . for registers that correspond to variables in C and Java programs and $t0, $t1, . . . for temporary registers needed to compile the program into MIPS instructions. Compiling a C Assignment Using Registers It is the compiler’s job to associate program variables with registers. Take, for instance, the assignment statement from our earlier example: f = (g + h) – (i + j); Th e variables f, g, h, i, and j are assigned to the registers $s0, $s1, $s2, $s3, and $s4, respectively. What is the compiled MIPS code? EXAMPLE 68 Chapter 2 Instructions: Language of the Computer Th e compiled program is very similar to the prior example, except we replace the variables with the register names mentioned above plus two temporary registers, $t0 and $t1, which correspond to the temporary variables above: add $t0,$s1,$s2 # register $t0 contains g + h add $t1,$s3,$s4 # register $t1 contains i + j sub $s0,$t0,$t1 # f gets $t0 – $t1, which is (g + h)–(i + j) Memory Operands Programming languages have simple variables that contain single data elements, as in these examples, but they also have more complex data structures—arrays and structures. Th ese complex data structures can contain many more data elements than there are registers in a computer. How can a computer represent and access such large structures? Recall the fi ve components of a computer introduced in Chapter 1 and repeated on page 61. Th e processor can keep only a small amount of data in registers, but computer memory contains billions of data elements. Hence, data structures (arrays and structures) are kept in memory. As explained above, arithmetic operations occur only on registers in MIPS instructions; thus, MIPS must include instructions that transfer data between memory and registers. Such instructions are called data transfer instructions. To access a word in memory, the instruction must supply the memory address. Memory is just a large, single-dimensional array, with the address acting as the index to that array, starting at 0. For example, in Figure 2.2, the address of the third data element is 2, and the value of Memory [2] is 10. ANSWER data transfer instruction A command that moves data between memory and registers. address A value used to delineate the location of a specifi c data element within a memory array. Processor Memory Address Data 1 101 10 100 0 1 2 3 FIGURE 2.2 Memory addresses and contents of memory at those locations. If these elements were words, these addresses would be incorrect, since MIPS actually uses byte addressing, with each word representing four bytes. Figure 2.3 shows the memory addressing for sequential word addresses. Th e data transfer instruction that copies data from memory to a register is traditionally called load. Th e format of the load instruction is the name of the operation followed by the register to be loaded, then a constant and register used to access memory. Th e sum of the constant portion of the instruction and the contents of the second register forms the memory address. Th e actual MIPS name for this instruction is lw, standing for load word. Compiling an Assignment When an Operand Is in Memory Let’s assume that A is an array of 100 words and that the compiler has associated the variables g and h with the registers $s1 and $s2 as before. Let’s also assume that the starting address, or base address, of the array is in $s3. Compile this C assignment statement: g = h + A[8]; Although there is a single operation in this assignment statement, one of the operands is in memory, so we must fi rst transfer A[8] to a register. Th e address of this array element is the sum of the base of the array A, found in register $s3, plus the number to select element 8. Th e data should be placed in a temporary register for use in the next instruction. Based on Figure 2.2, the fi rst compiled instruction is lw $t0,8($s3) # Temporary reg $t0 gets A[8] (We’ll be making a slight adjustment to this instruction, but we’ll use this simplifi ed version for now.) Th e following instruction can operate on the value in $t0 (which equals A[8]) since it is in a register. Th e instruction must add h (contained in $s2) to A[8] (contained in $t0) and put the sum in the register corresponding to g (associated with $s1): add $s1,$s2,$t0 # g = h + A[8] Th e constant in a data transfer instruction (8) is called the off set, and the register added to form the address ($s3) is called the base register. In addition to associating variables with registers, the compiler allocates data structures like arrays and structures to locations in memory. Th e compiler can then place the proper starting address into the data transfer instructions. Since 8-bit bytes are useful in many programs, virtually all architectures today address individual bytes. Th erefore, the address of a word matches the address of one of the 4 bytes within the word, and addresses of sequential words diff er by 4. For example, Figure 2.3 shows the actual MIPS addresses for the words in Figure 2.2; the byte address of the third word is 8. In MIPS, words must start at addresses that are multiples of 4. Th is requirement is called an alignment restriction, and many architectures have it. (Chapter 4 suggests why alignment leads to faster data transfers.) EXAMPLE ANSWER Hardware/ Software Interface alignment restriction A requirement that data be aligned in memory on natural boundaries. 2.3 Operands of the Computer Hardware 69 70 Chapter 2 Instructions: Language of the Computer Computers divide into those that use the address of the left most or “big end” byte as the word address versus those that use the rightmost or “little end” byte. MIPS is in the big-endian camp. Since the order matters only if you access the identical data both as a word and as four bytes, few need to be aware of the endianess. (Appendix A shows the two options to number bytes in a word.) Byte addressing also aff ects the array index. To get the proper byte address in the code above, the off set to be added to the base register $s3 must be 4 � 8, or 32, so that the load address will select A[8] and not A[8/4]. (See the related pitfall on page 160 of Section 2.19.) Th e instruction complementary to load is traditionally called store; it copies data from a register to memory. Th e format of a store is similar to that of a load: the name of the operation, followed by the register to be stored, then off set to select the array element, and fi nally the base register. Once again, the MIPS address is specifi ed in part by a constant and in part by the contents of a register. Th e actual MIPS name is sw, standing for store word. As the addresses in loads and stores are binary numbers, we can see why the DRAM for main memory comes in binary sizes rather than in decimal sizes. Th at is, in gebibytes (230) or tebibytes (240), not in gigabytes (109) or terabytes (1012); see Figure 1.1. Hardware/ Software Interface Processor Memory Byte Address Data 1 101 10 100 0 4 8 12 FIGURE 2.3 Actual MIPS memory addresses and contents of memory for those words. Th e changed addresses are highlighted to contrast with Figure 2.2. Since MIPS addresses each byte, word addresses are multiples of 4: there are 4 bytes in a word. Compiling Using Load and Store Assume variable h is associated with register $s2 and the base address of the array A is in $s3. What is the MIPS assembly code for the C assignment statement below? A[12] = h + A[8]; Although there is a single operation in the C statement, now two of the operands are in memory, so we need even more MIPS instructions. Th e fi rst two instructions are the same as in the prior example, except this time we use the proper off set for byte addressing in the load word instruction to select A[8], and the add instruction places the sum in $t0: lw $t0,32($s3) # Temporary reg $t0 gets A[8] add $t0,$s2,$t0 # Temporary reg $t0 gets h + A[8] Th e fi nal instruction stores the sum into A[12], using 48 (4 � 12) as the off set and register $s3 as the base register. sw $t0,48($s3) # Stores h + A[8] back into A[12] Load word and store word are the instructions that copy words between memory and registers in the MIPS architecture. Other brands of computers use other instructions along with load and store to transfer data. An architecture with such alternatives is the Intel x86, described in Section 2.17. Many programs have more variables than computers have registers. Consequently, the compiler tries to keep the most frequently used variables in registers and places the rest in memory, using loads and stores to move variables between registers and memory. Th e process of putting less commonly used variables (or those needed later) into memory is called spilling registers. Th e hardware principle relating size and speed suggests that memory must be slower than registers, since there are fewer registers. Th is is indeed the case; data accesses are faster if data is in registers instead of memory. Moreover, data is more useful when in a register. A MIPS arithmetic instruction can read two registers, operate on them, and write the result. A MIPS data transfer instruction only reads one operand or writes one operand, without operating on it. Th us, registers take less time to access and have higher throughput than memory, making data in registers both faster to access and simpler to use. Accessing registers also uses less energy than accessing memory. To achieve highest performance and conserve energy, an instruction set architecture must have a suffi cient number of registers, and compilers must use registers effi ciently. EXAMPLE ANSWER Hardware/ Software Interface 2.3 Operands of the Computer Hardware 71 72 Chapter 2 Instructions: Language of the Computer Constant or Immediate Operands Many times a program will use a constant in an operation—for example, incrementing an index to point to the next element of an array. In fact, more than half of the MIPS arithmetic instructions have a constant as an operand when running the SPEC CPU2006 benchmarks. Using only the instructions we have seen so far, we would have to load a constant from memory to use one. (Th e constants would have been placed in memory when the program was loaded.) For example, to add the constant 4 to register $s3, we could use the code lw $t0, AddrConstant4($s1) # $t0 = constant 4 add $s3,$s3,$t0 # $s3 = $s3 + $t0 ($t0 == 4) assuming that $s1 + AddrConstant4 is the memory address of the constant 4. An alternative that avoids the load instruction is to off er versions of the arithmetic instructions in which one operand is a constant. Th is quick add instruction with one constant operand is called add immediate or addi. To add 4 to register $s3, we just write addi $s3,$s3,4 # $s3 = $s3 + 4 Constant operands occur frequently, and by including constants inside arithmetic instructions, operations are much faster and use less energy than if constants were loaded from memory. Th e constant zero has another role, which is to simplify the instruction set by off ering useful variations. For example, the move operation is just an add instruction where one operand is zero. Hence, MIPS dedicates a register $zero to be hard-wired to the value zero. (As you might expect, it is register number 0.) Using frequency to justify the inclusions of constants is another example of the great idea of making the common case fast. Given the importance of registers, what is the rate of increase in the number of registers in a chip over time? 1. Very fast: Th ey increase as fast as Moore’s law, which predicts doubling the number of transistors on a chip every 18 months. 2. Very slow: Since programs are usually distributed in the language of the computer, there is inertia in instruction set architecture, and so the number of registers increases only as fast as new instruction sets become viable. Elaboration: Although the MIPS registers in this book are 32 bits wide, there is a 64-bit version of the MIPS instruction set with 32 64-bit registers. To keep them straight, they are offi cially called MIPS-32 and MIPS-64. In this chapter, we use a subset of MIPS-32. Appendix E shows the differences between MIPS-32 and MIPS-64. Sections 2.16 and 2.18 show the much more dramatic difference between the 32-bit address ARMv7 and its 64-bit successor, ARMv8. Check Yourself 2.4 Signed and Unsigned Numbers 73 Elaboration: The MIPS offset plus base register addressing is an excellent match to structures as well as arrays, since the register can point to the beginning of the structure and the offset can select the desired element. We’ll see such an example in Section 2.13. Elaboration: The register in the data transfer instructions was originally invented to hold an index of an array with the offset used for the starting address of an array. Thus, the base register is also called the index register. Today’s memories are much larger and the software model of data allocation is more sophisticated, so the base address of the array is normally passed in a register since it won’t fi t in the offset, as we shall see. Elaboration: Since MIPS supports negative constants, there is no need for subtract immediate in MIPS. 2.4 Signed and Unsigned Numbers First, let’s quickly review how a computer represents numbers. Humans are taught to think in base 10, but numbers may be represented in any base. For example, 123 base 10 � 1111011 base 2. Numbers are kept in computer hardware as a series of high and low electronic signals, and so they are considered base 2 numbers. (Just as base 10 numbers are called decimal numbers, base 2 numbers are called binary numbers.) A single digit of a binary number is thus the “atom” of computing, since all information is composed of binary digits or bits. Th is fundamental building block can be one of two values, which can be thought of as several alternatives: high or low, on or off , true or false, or 1 or 0. Generalizing the point, in any number base, the value of ith digit d is d i� Base where i starts at 0 and increases from right to left . Th is representation leads to an obvious way to number the bits in the word: simply use the power of the base for that bit. We subscript decimal numbers with ten and binary numbers with two. For example, 1011 two represents (1 x 23) + (0 x 22) + (1 x 21) + (1 x 20) ten = (1 x 8) + (0 x 4) + (1 x 2) + (1 x 1) ten = 8 + 0 + 2 + 1 ten = 11 ten binary digit Also called binary bit. One of the two numbers in base 2, 0 or 1, that are the components of information. 74 Chapter 2 Instructions: Language of the Computer We number the bits 0, 1, 2, 3, . . . from right to left in a word. Th e drawing below shows the numbering of bits within a MIPS word and the placement of the number 1011two: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 (32 bits wide) Since words are drawn vertically as well as horizontally, left most and rightmost may be unclear. Hence, the phrase least signifi cant bit is used to refer to the right- most bit (bit 0 above) and most signifi cant bit to the left most bit (bit 31). Th e MIPS word is 32 bits long, so we can represent 232 diff erent 32-bit patterns. It is natural to let these combinations represent the numbers from 0 to 232 �1 (4,294,967,295ten): 0000 0000 0000 0000 0000 0000 0000 0000 two = 0 ten 0000 0000 0000 0000 0000 0000 0000 0001 two = 1 ten 0000 0000 0000 0000 0000 0000 0000 0010 two = 2 ten . . . . . . 1111 1111 1111 1111 1111 1111 1111 1101 two = 4,294,967,293 ten 1111 1111 1111 1111 1111 1111 1111 1110 two = 4,294,967,294 ten 1111 1111 1111 1111 1111 1111 1111 1111 two = 4,294,967,295 ten Th at is, 32-bit binary numbers can be represented in terms of the bit value times a power of 2 (here xi means the ith bit of x): ( ) ( ) ( ) ( ) ( )x x x x x31 2 30 2 29 2 1 2 0 231 30 29 1 0… For reasons we will shortly see, these positive numbers are called unsigned numbers. Base 2 is not natural to human beings; we have 10 fi ngers and so fi nd base 10 natural. Why didn’t computers use decimal? In fact, the fi rst commercial computer did off er decimal arithmetic. Th e problem was that the computer still used on and off signals, so a decimal digit was simply represented by several binary digits. Decimal proved so ineffi cient that subsequent computers reverted to all binary, converting to base 10 only for the relatively infrequent input/output events. Keep in mind that the binary bit patterns above are simply representatives of numbers. Numbers really have an infi nite number of digits, with almost all being 0 except for a few of the rightmost digits. We just don’t normally show leading 0s. Hardware can be designed to add, subtract, multiply, and divide these binary bit patterns. If the number that is the proper result of such operations cannot be represented by these rightmost hardware bits, overfl ow is said to have occurred. least signifi cant bit Th e rightmost bit in a MIPS word. most signifi cant bit Th e left most bit in a MIPS word. Hardware/ Software Interface It’s up to the programming language, the operating system, and the program to determine what to do if overfl ow occurs. Computer programs calculate both positive and negative numbers, so we need a representation that distinguishes the positive from the negative. Th e most obvious solution is to add a separate sign, which conveniently can be represented in a single bit; the name for this representation is sign and magnitude. Alas, sign and magnitude representation has several shortcomings. First, it’s not obvious where to put the sign bit. To the right? To the left ? Early computers tried both. Second, adders for sign and magnitude may need an extra step to set the sign because we can’t know in advance what the proper sign will be. Finally, a separate sign bit means that sign and magnitude has both a positive and a negative zero, which can lead to problems for inattentive programmers. As a result of these shortcomings, sign and magnitude representation was soon abandoned. In the search for a more attractive alternative, the question arose as to what would be the result for unsigned numbers if we tried to subtract a large number from a small one. Th e answer is that it would try to borrow from a string of leading 0s, so the result would have a string of leading 1s. Given that there was no obvious better alternative, the fi nal solution was to pick the representation that made the hardware simple: leading 0s mean positive, and leading 1s mean negative. Th is convention for representing signed binary numbers is called two’s complement representation: 0000 0000 0000 0000 0000 0000 0000 0000 two = 0 ten 0000 0000 0000 0000 0000 0000 0000 0001 two = 1 ten 0000 0000 0000 0000 0000 0000 0000 0010 two = 2 ten . . . . . . 0111 1111 1111 1111 1111 1111 1111 1101 two = 2,147,483,645 ten 0111 1111 1111 1111 1111 1111 1111 1110 two = 2,147,483,646 ten 0111 1111 1111 1111 1111 1111 1111 1111 two = 2,147,483,647 ten 1000 0000 0000 0000 0000 0000 0000 0000 two = –2,147,483,648 ten 1000 0000 0000 0000 0000 0000 0000 0001 two = –2,147,483,647 ten 1000 0000 0000 0000 0000 0000 0000 0010 two = –2,147,483,646 ten . . . . . . 1111 1111 1111 1111 1111 1111 1111 1101 two = –3 ten 1111 1111 1111 1111 1111 1111 1111 1110 two = –2 ten 1111 1111 1111 1111 1111 1111 1111 1111 two = –1 ten Th e positive half of the numbers, from 0 to 2,147,483,647ten (2 31 �1), use the same representation as before. Th e following bit pattern (1000 . . . 0000two) represents the most negative number �2,147,483,648ten (�2 31). It is followed by a declining set of negative numbers: �2,147,483,647ten (1000 . . . 0001two) down to �1ten (1111 . . . 1111two). Two’s complement does have one negative number, �2,147,483,648ten, that has no corresponding positive number. Such imbalance was also a worry to the inattentive programmer, but sign and magnitude had problems for both the programmer and the hardware designer. Consequently, every computer today uses two’s complement binary representations for signed numbers. 2.4 Signed and Unsigned Numbers 75 76 Chapter 2 Instructions: Language of the Computer Two’s complement representation has the advantage that all negative numbers have a 1 in the most signifi cant bit. Consequently, hardware needs to test only this bit to see if a number is positive or negative (with the number 0 considered positive). Th is bit is oft en called the sign bit. By recognizing the role of the sign bit, we can represent positive and negative 32-bit numbers in terms of the bit value times a power of 2: ( ) ( ) ( ) ( ) ( )x x x x x31 2 30 2 29 2 1 2 0 231 30 29 1 0+ … Th e sign bit is multiplied by �231, and the rest of the bits are then multiplied by positive versions of their respective base values. Binary to Decimal Conversion What is the decimal value of this 32-bit two’s complement number? 1111 1111 1111 1111 1111 1111 1111 1100 two Substituting the number’s bit values into the formula above: ( ) ( ) ( ) ( ) ( ) ( )1 2 1 2 1 2 1 2 0 2 0 2 2 2 2 31 30 29 1 1 0 31 30 … 229 22 0 0 2 147 483 648 2 147 483 644 4 … , , , , , ,te tn en ten We’ll see a shortcut to simplify conversion from negative to positive soon. Just as an operation on unsigned numbers can overfl ow the capacity of hardware to represent the result, so can an operation on two’s complement numbers. Overfl ow occurs when the left most retained bit of the binary bit pattern is not the same as the infi nite number of digits to the left (the sign bit is incorrect): a 0 on the left of the bit pattern when the number is negative or a 1 when the number is positive. Signed versus unsigned applies to loads as well as to arithmetic. Th e function of a signed load is to copy the sign repeatedly to fi ll the rest of the register—called sign extension—but its purpose is to place a correct representation of the number within that register. Unsigned loads simply fi ll with 0s to the left of the data, since the number represented by the bit pattern is unsigned. When loading a 32-bit word into a 32-bit register, the point is moot; signed and unsigned loads are identical. MIPS does off er two fl avors of byte loads: load byte (lb) treats the byte as a signed number and thus sign-extends to fi ll the 24 left -most bits of the register, while load byte unsigned (lbu) works with unsigned integers. Since C programs almost always use bytes to represent characters rather than consider bytes as very short signed integers, lbu is used practically exclusively for byte loads. EXAMPLE ANSWER Hardware/ Software Interface Unlike the numbers discussed above, memory addresses naturally start at 0 and continue to the largest address. Put another way, negative addresses make no sense. Th us, programs want to deal sometimes with numbers that can be positive or negative and sometimes with numbers that can be only positive. Some programming languages refl ect this distinction. C, for example, names the former integers (declared as int in the program) and the latter unsigned integers (unsigned int). Some C style guides even recommend declaring the former as signed int to keep the distinction clear. Let’s examine two useful shortcuts when working with two’s complement numbers. Th e fi rst shortcut is a quick way to negate a two’s complement binary number. Simply invert every 0 to 1 and every 1 to 0, then add one to the result. Th is shortcut is based on the observation that the sum of a number and its inverted representation must be 111 . . . 111two, which represents �1. Since x x 1, therefore x x 1 0 or x x1 − . (We use the notation x to mean invert every bit in x from 0 to 1 and vice versa.) Negation Shortcut Negate 2ten, and then check the result by negating �2ten. 2ten � 0000 0000 0000 0000 0000 0000 0000 0010two Negating this number by inverting the bits and adding one, 1111 1111 1111 1111 1111 1111 1111 1101 two + 1 two = 1111 1111 1111 1111 1111 1111 1111 1110 two = –2 ten Going the other direction, 1111 1111 1111 1111 1111 1111 1111 1110 two is fi rst inverted and then incremented: 0000 0000 0000 0000 0000 0000 0000 0001 two + 1 two = 0000 0000 0000 0000 0000 0000 0000 0010 two = 2 ten Hardware/ Software Interface EXAMPLE ANSWER 2.4 Signed and Unsigned Numbers 77 78 Chapter 2 Instructions: Language of the Computer Our next shortcut tells us how to convert a binary number represented in n bits to a number represented with more than n bits. For example, the immediate fi eld in the load, store, branch, add, and set on less than instructions contains a two’s complement 16-bit number, representing �32,768ten (�2 15) to 32,767ten (2 15 � 1). To add the immediate fi eld to a 32-bit register, the computer must convert that 16- bit number to its 32-bit equivalent. Th e shortcut is to take the most signifi cant bit from the smaller quantity—the sign bit—and replicate it to fi ll the new bits of the larger quantity. Th e old nonsign bits are simply copied into the right portion of the new word. Th is shortcut is commonly called sign extension. Sign Extension Shortcut Convert 16-bit binary versions of 2ten and �2ten to 32-bit binary numbers. Th e 16-bit binary version of the number 2 is 0000 0000 0000 0010 two = 2 ten It is converted to a 32-bit number by making 16 copies of the value in the most signifi cant bit (0) and placing that in the left -hand half of the word. Th e right half gets the old value: 0000 0000 0000 0000 0000 0000 0000 0010 two = 2 ten Let’s negate the 16-bit version of 2 using the earlier shortcut. Th us, 0000 0000 0000 0010 two becomes 1111 1111 1111 1101 two + 1 two = 1111 1111 1111 1110 two Creating a 32-bit version of the negative number means copying the sign bit 16 times and placing it on the left : 1111 1111 1111 1111 1111 1111 1111 1110 two = –2 ten Th is trick works because positive two’s complement numbers really have an infi nite number of 0s on the left and negative two’s complement numbers have an infi nite number of 1s. Th e binary bit pattern representing a number hides leading bits to fi t the width of the hardware; sign extension simply restores some of them. EXAMPLE ANSWER Summary Th e main point of this section is that we need to represent both positive and negative integers within a computer word, and although there are pros and cons to any option, the unanimous choice since 1965 has been two’s complement. Elaboration: For signed decimal numbers, we used “�” to represent negative because there are no limits to the size of a decimal number. Given a fi xed word size, binary and hexadecimal (see Figure 2.4) bit strings can encode the sign; hence we do not normally use “�” or “�” with binary or hexadecimal notation. What is the decimal value of this 64-bit two’s complement number? 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1000 two 1) –4ten 2) –8ten 3) –16ten 4) 18,446,744,073,709,551,609ten Elaboration: Two’s complement gets its name from the rule that the unsigned sum of an n-bit number and its n-bit negative is 2n; hence, the negation or complement of a number x is 2n � x, or its “two’s complement.” A third alternative representation to two’s complement and sign and magnitude is called one’s complement. The negative of a one’s complement is found by inverting each bit, from 0 to 1 and from 1 to 0, or x. This relation helps explain its name since the complement of x is 2n � x � 1. It was also an attempt to be a better solution than sign and magnitude, and several early scientifi c computers did use the notation. This representation is similar to two’s complement except that it also has two 0s: 00 . . . 00 two is positive 0 and 11 . . . 11 two is negative 0. The most negative number, 10 . . . 000 two , represents �2,147,483,647 ten , and so the positives and negatives are balanced. One’s complement adders did need an extra step to subtract a number, and hence two’s complement dominates today. A fi nal notation, which we will look at when we discuss fl oating point in Chapter 3, is to represent the most negative value by 00 . . . 000 two and the most positive value by 11 . . . 11 two , with 0 typically having the value 10 . . . 00 two . This is called a biased notation, since it biases the number such that the number plus the bias has a non- negative representation. Check Yourself one’s complement A notation that represents the most negative value by 10 . . . 000two and the most positive value by 01 . . . 11two, leaving an equal number of negatives and positives but ending up with two zeros, one positive (00 . . . 00two) and one negative (11 . . . 11two). Th e term is also used to mean the inversion of every bit in a pattern: 0 to 1 and 1 to 0. biased notation A notation that represents the most negative value by 00 . . . 000two and the most positive value by 11 . . . 11two, with 0 typically having the value 10 . . . 00two, thereby biasing the number such that the number plus the bias has a non-negative representation. 2.4 Signed and Unsigned Numbers 79 80 Chapter 2 Instructions: Language of the Computer 2.5 Representing Instructions in the Computer We are now ready to explain the diff erence between the way humans instruct computers and the way computers see instructions. Instructions are kept in the computer as a series of high and low electronic signals and may be represented as numbers. In fact, each piece of an instruction can be considered as an individual number, and placing these numbers side by side forms the instruction. Since registers are referred to in instructions, there must be a convention to map register names into numbers. In MIPS assembly language, registers $s0 to $s7 map onto registers 16 to 23, and registers $t0 to $t7 map onto registers 8 to 15. Hence, $s0 means register 16, $s1 means register 17, $s2 means register 18, . . . , $t0 means register 8, $t1 means register 9, and so on. We’ll describe the convention for the rest of the 32 registers in the following sections. Translating a MIPS Assembly Instruction into a Machine Instruction Let’s do the next step in the refi nement of the MIPS language as an example. We’ll show the real MIPS language version of the instruction represented symbolically as add $t0,$s1,$s2 fi rst as a combination of decimal numbers and then of binary numbers. Th e decimal representation is 0 17 18 8 0 32 Each of these segments of an instruction is called a fi eld. Th e fi rst and last fi elds (containing 0 and 32 in this case) in combination tell the MIPS computer that this instruction performs addition. Th e second fi eld gives the number of the register that is the fi rst source operand of the addition operation (17 � $s1), and the third fi eld gives the other source operand for the addition (18 � $s2). Th e fourth fi eld contains the number of the register that is to receive the sum (8 � $t0). Th e fi ft h fi eld is unused in this instruction, so it is set to 0. Th us, this instruction adds register $s1 to register $s2 and places the sum in register $t0. Th is instruction can also be represented as fi elds of binary numbers as opposed to decimal: 000000 10001 10010 01000 00000 100000 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits EXAMPLE ANSWER 2.5 Representing Instructions in the Computer 81 Th is layout of the instruction is called the instruction format. As you can see from counting the number of bits, this MIPS instruction takes exactly 32 bits—the same size as a data word. In keeping with our design principle that simplicity favors regularity, all MIPS instructions are 32 bits long. To distinguish it from assembly language, we call the numeric version of instructions machine language and a sequence of such instructions machine code. It would appear that you would now be reading and writing long, tedious strings of binary numbers. We avoid that tedium by using a higher base than binary that converts easily into binary. Since almost all computer data sizes are multiples of 4, hexadecimal (base 16) numbers are popular. Since base 16 is a power of 2, we can trivially convert by replacing each group of four binary digits by a single hexadecimal digit, and vice versa. Figure 2.4 converts between hexadecimal and binary. instruction format A form of representation of an instruction composed of fi elds of binary numbers. machine language Binary representation used for communication within a computer system. hexadecimal Numbers in base 16. Hexadecimal Binary Hexadecimal Binary Hexadecimal Binary Hexadecimal Binary 0hex 0000two 4hex 0100two 8hex 1000two chex 1100two 1hex 0001two 5hex 0101two 9hex 1001two dhex 1101two 2hex 0010two 6hex 0110two ahex 1010two ehex 1110two 3hex 0011two 7hex 0111two bhex 1011two fhex 1111two FIGURE 2.4 The hexadecimal-binary conversion table. Just replace one hexadecimal digit by the corresponding four binary digits, and vice versa. If the length of the binary number is not a multiple of 4, go from right to left . Because we frequently deal with diff erent number bases, to avoid confusion we will subscript decimal numbers with ten, binary numbers with two, and hexadecimal numbers with hex. (If there is no subscript, the default is base 10.) By the way, C and Java use the notation 0xnnnn for hexadecimal numbers. Binary to Hexadecimal and Back Convert the following hexadecimal and binary numbers into the other base: eca8 6420 hex 0001 0011 0101 0111 1001 1011 1101 1111 two EXAMPLE 82 Chapter 2 Instructions: Language of the Computer Using Figure 2.4, the answer is just a table lookup one way: MIPS Fields MIPS fi elds are given names to make them easier to discuss: op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits Here is the meaning of each name of the fi elds in MIPS instructions: ■ op: Basic operation of the instruction, traditionally called the opcode. ■ rs: Th e fi rst register source operand. ■ rt: Th e second register source operand. ■ rd: Th e register destination operand. It gets the result of the operation. ■ shamt: Shift amount. (Section 2.6 explains shift instructions and this term; it will not be used until then, and hence the fi eld contains zero in this section.) ■ funct: Function. Th is fi eld, oft en called the function code, selects the specifi c variant of the operation in the op fi eld. A problem occurs when an instruction needs longer fi elds than those shown above. For example, the load word instruction must specify two registers and a constant. If the address were to use one of the 5-bit fi elds in the format above, the constant within the load word instruction would be limited to only 25 or 32. Th is constant is used to select elements from arrays or data structures, and it oft en needs to be much larger than 32. Th is 5-bit fi eld is too small to be useful. Hence, we have a confl ict between the desire to keep all instructions the same length and the desire to have a single instruction format. Th is leads us to the fi nal hardware design principle: ANSWER opcode Th e fi eld that denotes the operation and format of an instruction. eca8 6420 hex 1110 1100 1010 1000 0110 0100 0010 0000 two And then the other direction: 0001 0011 0101 0111 1001 1011 1101 1111 two 1357 9bdf hex Design Principle 3: Good design demands good compromises. Th e compromise chosen by the MIPS designers is to keep all instructions the same length, thereby requiring diff erent kinds of instruction formats for diff erent kinds of instructions. For example, the format above is called R-type (for register) or R-format. A second type of instruction format is called I-type (for immediate) or I-format and is used by the immediate and data transfer instructions. Th e fi elds of I-format are op rs rt constant or address 6 bits 5 bits 5 bits 16 bits Th e 16-bit address means a load word instruction can load any word within a region of �215 or 32,768 bytes (�213 or 8192 words) of the address in the base register rs. Similarly, add immediate is limited to constants no larger than �215. We see that more than 32 registers would be diffi cult in this format, as the rs and rt fi elds would each need another bit, making it harder to fi t everything in one word. Let’s look at the load word instruction from page 71: lw $t0,32($s3) # Temporary reg $t0 gets A[8] Here, 19 (for $s3) is placed in the rs fi eld, 8 (for $t0) is placed in the rt fi eld, and 32 is placed in the address fi eld. Note that the meaning of the rt fi eld has changed for this instruction: in a load word instruction, the rt fi eld specifi es the destination register, which receives the result of the load. Although multiple formats complicate the hardware, we can reduce the complexity by keeping the formats similar. For example, the fi rst three fi elds of the R-type and I-type formats are the same size and have the same names; the length of the fourth fi eld in I-type is equal to the sum of the lengths of the last three fi elds of R-type. In case you were wondering, the formats are distinguished by the values in the fi rst fi eld: each format is assigned a distinct set of values in the fi rst fi eld (op) so that the hardware knows whether to treat the last half of the instruction as three fi elds (R-type) or as a single fi eld (I-type). Figure 2.5 shows the numbers used in each fi eld for the MIPS instructions covered so far. 2.5 Representing Instructions in the Computer 83 Instruction Format op rs rt rd shamt funct address add R 0 reg reg reg 0 32ten n.a. sub (subtract) R 0 reg reg reg 0 34ten n.a. add immediate I 8ten reg reg n.a. n.a. n.a. constant lw (load word) I 35ten reg reg n.a. n.a. n.a. address sw (store word) I 43ten reg reg n.a. n.a. n.a. address FIGURE 2.5 MIPS instruction encoding. In the table above, “reg” means a register number between 0 and 31, “address” means a 16-bit address, and “n.a.” (not applicable) means this fi eld does not appear in this format. Note that add and sub instructions have the same value in the op fi eld; the hardware uses the funct fi eld to decide the variant of the operation: add (32) or subtract (34). 84 Chapter 2 Instructions: Language of the Computer Translating MIPS Assembly Language into Machine Language We can now take an example all the way from what the programmer writes to what the computer executes. If $t1 has the base of the array A and $s2 corresponds to h, the assignment statement A[300] = h + A[300]; is compiled into lw $t0,1200($t1) # Temporary reg $t0 gets A[300] add $t0,$s2,$t0 # Temporary reg $t0 gets h + A[300] sw $t0,1200($t1) # Stores h + A[300] back into A[300] What is the MIPS machine language code for these three instructions? For convenience, let’s fi rst represent the machine language instructions using decimal numbers. From Figure 2.5, we can determine the three machine language instructions: Op rs rt rd address/ shamt funct 35 9 8 1200 0 18 8 8 0 32 43 9 8 1200 Th e lw instruction is identifi ed by 35 (see Figure 2.5) in the fi rst fi eld (op). Th e base register 9 ($t1) is specifi ed in the second fi eld (rs), and the destination register 8 ($t0) is specifi ed in the third fi eld (rt). Th e off set to select A[300] (1200 � 300 � 4) is found in the fi nal fi eld (address). Th e add instruction that follows is specifi ed with 0 in the fi rst fi eld (op) and 32 in the last fi eld (funct). Th e three register operands (18, 8, and 8) are found in the second, third, and fourth fi elds and correspond to $s2, $t0, and $t0. Th e sw instruction is identifi ed with 43 in the fi rst fi eld. Th e rest of this fi nal instruction is identical to the lw instruction. Since 1200ten � 0000 0100 1011 0000two, the binary equivalent to the decimal form is: EXAMPLE ANSWER 100011 01001 01000 0000 0100 1011 0000 000000 10010 01000 01000 00000 100000 101011 01001 01000 0000 0100 1011 0000 Note the similarity of the binary representations of the fi rst and last instructions. Th e only diff erence is in the third bit from the left , which is highlighted here. Th e desire to keep all instructions the same size is in confl ict with the desire to have as many registers as possible. Any increase in the number of registers uses up at least one more bit in every register fi eld of the instruction format. Given these constraints and the design princple that smaller is faster, most instruction sets today have 16 or 32 general purpose registers. Hardware/ Software Interface MIPS machine language Name Format Example Comments add R 0 18 19 17 0 32 add $s1,$s2,$s3 sub R 0 18 19 17 0 34 sub $s1,$s2,$s3 addi I 8 18 17 100 addi $s1,$s2,100 lw I 35 18 17 100 lw $s1,100($s2) sw I 43 18 17 100 sw $s1,100($s2) Field size 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits All MIPS instructions are 32 bits long R-format R op rs rt rd shamt funct Arithmetic instruction format I-format I op rs rt address Data transfer format FIGURE 2.6 MIPS architecture revealed through Section 2.5. Th e two MIPS instruction formats so far are R and I. Th e fi rst 16 bits are the same: both contain an op fi eld, giving the base operation; an rs fi eld, giving one of the sources; and the rt fi eld, which specifi es the other source operand, except for load word, where it specifi es the destination register. R-format divides the last 16 bits into an rd fi eld, specifying the destination register; the shamt fi eld, which Section 2.6 explains; and the funct fi eld, which specifi es the specifi c operation of R-format instructions. I-format combines the last 16 bits into a single address fi eld. 2.5 Representing Instructions in the Computer 85 Figure 2.6 summarizes the portions of MIPS machine language described in this section. As we shall see in Chapter 4, the similarity of the binary representations of related instructions simplifi es hardware design. Th ese similarities are another example of regularity in the MIPS architecture. 86 Chapter 2 Instructions: Language of the Computer Today’s computers are built on two key principles: 1. Instructions are represented as numbers. 2. Programs are stored in memory to be read or written, just like data. Th ese principles lead to the stored-program concept; its invention let the computing genie out of its bottle. Figure 2.7 shows the power of the concept; specifi cally, memory can contain the source code for an editor program, the corresponding compiled machine code, the text that the compiled program is using, and even the compiler that generated the machine code. One consequence of instructions as numbers is that programs are oft en shipped as fi les of binary numbers. Th e commercial implication is that computers can inherit ready-made soft ware provided they are compatible with an existing instruction set. Such “binary compatibility” oft en leads industry to align around a small number of instruction set architectures. The BIG Picture Memory Accounting program (machine code) Processor Editor program (machine code) C compiler (machine code) Payroll data Book text Source code in C for editor program FIGURE 2.7 The stored-program concept. Stored programs allow a computer that performs accounting to become, in the blink of an eye, a computer that helps an author write a book. Th e switch happens simply by loading memory with programs and data and then telling the computer to begin executing at a given location in memory. Treating instructions in the same way as data greatly simplifi es both the memory hardware and the soft ware of computer systems. Specifi cally, the memory technology needed for data can also be used for programs, and programs like compilers, for instance, can translate code written in a notation far more convenient for humans into code that the computer can understand. 2.6 Logical Operations 87 What MIPS instruction does this represent? Choose from one of the four options below. op rs rt rd shamt funct 0 8 9 10 0 34 1. sub $t0, $t1, $t2 2. add $t2, $t0, $t1 3. sub $t2, $t1, $t0 4. sub $t2, $t0, $t1 2.6 Logical Operations Although the fi rst computers operated on full words, it soon became clear that it was useful to operate on fi elds of bits within a word or even on individual bits. Examining characters within a word, each of which is stored as 8 bits, is one example of such an operation (see Section 2.9). It follows that operations were added to programming languages and instruction set architectures to simplify, among other things, the packing and unpacking of bits into words. Th ese instructions are called logical operations. Figure 2.8 shows logical operations in C, Java, and MIPS. Check Yourself “Contrariwise,” continued Tweedledee, “if it was so, it might be; and if it were so, it would be; but as it isn’t, it ain’t. Th at’s logic.” Lewis Carroll, Alice’s Adventures in Wonderland, 1865 FIGURE 2.8 C and Java logical operators and their corresponding MIPS instructions. MIPS implements NOT using a NOR with one operand being zero. Th e fi rst class of such operations is called shift s. Th ey move all the bits in a word to the left or right, fi lling the emptied bits with 0s. For example, if register $s0 contained 0000 0000 0000 0000 0000 0000 0000 1001 two = 9 ten and the instruction to shift left by 4 was executed, the new value would be: 0000 0000 0000 0000 0000 0000 1001 0000 two = 144 ten Logical operations C operators Java operators MIPS instructions Shift left << << sll Shift right >> >>> srl

Bit-by-bit AND & & and, andi
Bit-by-bit OR | | or, ori
Bit-by-bit NOT ~ ~ nor

88 Chapter 2 Instructions: Language of the Computer

Th e dual of a shift left is a shift right. Th e actual name of the two MIPS shift
instructions are called shift left logical (sll) and shift right logical (srl). Th e
following instruction performs the operation above, assuming that the original
value was in register $s0 and the result should go in register $t2:

sll $t2,$s0,4 # reg $t2 = reg $s0 << 4 bits We delayed explaining the shamt fi eld in the R-format. Used in shift instructions, it stands for shift amount. Hence, the machine language version of the instruction above is op rs rt rd shamt funct 0 0 16 10 4 0 Th e encoding of sll is 0 in both the op and funct fi elds, rd contains 10 (register $t2), rt contains 16 (register $s0), and shamt contains 4. Th e rs fi eld is unused and thus is set to 0. Shift left logical provides a bonus benefi t. Shift ing left by i bits gives the same result as multiplying by 2i, just as shift ing a decimal number by i digits is equivalent to multiplying by 10i. For example, the above sll shift s by 4, which gives the same result as multiplying by 24 or 16. Th e fi rst bit pattern above represents 9, and 9 �16 � 144, the value of the second bit pattern. Another useful operation that isolates fi elds is AND. (We capitalize the word to avoid confusion between the operation and the English conjunction.) AND is a bit- by-bit operation that leaves a 1 in the result only if both bits of the operands are 1. For example, if register $t2 contains 0000 0000 0000 0000 0000 1101 1100 0000 two and register $t1 contains 0000 0000 0000 0000 0011 1100 0000 0000 two then, aft er executing the MIPS instruction and $t0,$t1,$t2 # reg $t0 = reg $t1 & reg $t2 the value of register $t0 would be 0000 0000 0000 0000 0000 1100 0000 0000 two As you can see, AND can apply a bit pattern to a set of bits to force 0s where there is a 0 in the bit pattern. Such a bit pattern in conjunction with AND is traditionally called a mask, since the mask “conceals” some bits. AND A logical bit- by-bit operation with two operands that calculates a 1 only if there is a 1 in both operands. To place a value into one of these seas of 0s, there is the dual to AND, called OR. It is a bit-by-bit operation that places a 1 in the result if either operand bit is a 1. To elaborate, if the registers $t1 and $t2 are unchanged from the preceding example, the result of the MIPS instruction or $t0,$t1,$t2 # reg $t0 = reg $t1 | reg $t2 is this value in register $t0: 0000 0000 0000 0000 0011 1101 1100 0000 two Th e fi nal logical operation is a contrarian. NOT takes one operand and places a 1 in the result if one operand bit is a 0, and vice versa. Using our prior notation, it calculates x. In keeping with the three-operand format, the designers of MIPS decided to include the instruction NOR (NOT OR) instead of NOT. If one operand is zero, then it is equivalent to NOT: A NOR 0 � NOT (A OR 0) � NOT (A). If the register $t1 is unchanged from the preceding example and register $t3 has the value 0, the result of the MIPS instruction nor $t0,$t1,$t3 # reg $t0 = ~ (reg $t1 | reg $t3) is this value in register $t0: 1111 1111 1111 1111 1100 0011 1111 1111 two Figure 2.8 above shows the relationship between the C and Java operators and the MIPS instructions. Constants are useful in AND and OR logical operations as well as in arithmetic operations, so MIPS also provides the instructions and immediate (andi) and or immediate (ori). Constants are rare for NOR, since its main use is to invert the bits of a single operand; thus, the MIPS instruction set architecture has no immediate version of NOR. Elaboration: The full MIPS instruction set also includes exclusive or (XOR), which sets the bit to 1 when two corresponding bits differ, and to 0 when they are the same. C allows bit fi elds or fi elds to be defi ned within words, both allowing objects to be packed within a word and to match an externally enforced interface such as an I/O device. All fi elds must fi t within a single word. Fields are unsigned integers that can be as short as 1 bit. C compilers insert and extract fi elds using logical instructions in MIPS: and, or, sll, and srl. Elaboration: Logical AND immediate and logical OR immediate put 0s into the upper 16 bits to form a 32-bit constant, unlike add immediate, which does sign extension. Which operations can isolate a fi eld in a word? 1. AND 2. A shift left followed by a shift right OR A logical bit-by- bit operation with two operands that calculates a 1 if there is a 1 in either operand. NOT A logical bit-by- bit operation with one operand that inverts the bits; that is, it replaces every 1 with a 0, and every 0 with a 1. NOR A logical bit-by- bit operation with two operands that calculates the NOT of the OR of the two operands. Th at is, it calculates a 1 only if there is a 0 in both operands. Check Yourself 2.6 Logical Operations 89 90 Chapter 2 Instructions: Language of the Computer 2.7 Instructions for Making Decisions What distinguishes a computer from a simple calculator is its ability to make decisions. Based on the input data and the values created during computation, diff erent instructions execute. Decision making is commonly represented in programming languages using the if statement, sometimes combined with go to statements and labels. MIPS assembly language includes two decision-making instructions, similar to an if statement with a go to. Th e fi rst instruction is beq register1, register2, L1 Th is instruction means go to the statement labeled L1 if the value in register1 equals the value in register2. Th e mnemonic beq stands for branch if equal. Th e second instruction is bne register1, register2, L1 It means go to the statement labeled L1 if the value in register1 does not equal the value in register2. Th e mnemonic bne stands for branch if not equal. Th ese two instructions are traditionally called conditional branches. Compiling if-then-else into Conditional Branches In the following code segment, f, g, h, i, and j are variables. If the fi ve variables f through j correspond to the fi ve registers $s0 through $s4, what is the compiled MIPS code for this C if statement? if (i == j) f = g + h; else f = g – h; Figure 2.9 shows a fl owchart of what the MIPS code should do. Th e fi rst expression compares for equality, so it would seem that we would want the branch if registers are equal instruction (beq). In general, the code will be more effi cient if we test for the opposite condition to branch over the code that performs the subsequent then part of the if (the label Else is defi ned below) and so we use the branch if registers are not equal instruction (bne): bne $s3,$s4,Else # go to Else if i ≠ j Th e utility of an automatic computer lies in the possibility of using a given sequence of instructions repeatedly, the number of times it is iterated being dependent upon the results of the computation . . . . Th is choice can be made to depend upon the sign of a number (zero being reckoned as plus for machine purposes). Consequently, we introduce an [instruction] (the conditional transfer [instruction]) which will, depending on the sign of a given number, cause the proper one of two routines to be executed. Burks, Goldstine, and von Neumann, 1947 EXAMPLE ANSWER 2.7 Instructions for Making Decisions 91 Th e next assignment statement performs a single operation, and if all the operands are allocated to registers, it is just one instruction: add $s0,$s1,$s2 # f = g + h (skipped if i ≠ j) We now need to go to the end of the if statement. Th is example introduces another kind of branch, oft en called an unconditional branch. Th is instruction says that the processor always follows the branch. To distinguish between conditional and unconditional branches, the MIPS name for this type of instruction is jump, abbreviated as j (the label Exit is defi ned below). j Exit # go to Exit Th e assignment statement in the else portion of the if statement can again be compiled into a single instruction. We just need to append the label Else to this instruction. We also show the label Exit that is aft er this instruction, showing the end of the if-then-else compiled code: Else:sub $s0,$s1,$s2 # f = g – h (skipped if i = j) Exit: Notice that the assembler relieves the compiler and the assembly language programmer from the tedium of calculating addresses for branches, just as it does for calculating data addresses for loads and stores (see Section 2.12). f = g + h f = g – h i = j i ≠ j i = = j? Else: Exit: FIGURE 2.9 Illustration of the options in the if statement above. Th e left box corresponds to the then part of the if statement, and the right box corresponds to the else part. conditional branch An instruction that requires the comparison of two values and that allows for a subsequent transfer of control to a new address in the program based on the outcome of the comparison. 92 Chapter 2 Instructions: Language of the Computer Compilers frequently create branches and labels where they do not appear in the programming language. Avoiding the burden of writing explicit labels and branches is one benefi t of writing in high-level programming languages and is a reason coding is faster at that level. Loops Decisions are important both for choosing between two alternatives—found in if statements—and for iterating a computation—found in loops. Th e same assembly instructions are the building blocks for both cases. Compiling a while Loop in C Here is a traditional loop in C: while (save[i] == k) i += 1; Assume that i and k correspond to registers $s3 and $s5 and the base of the array save is in $s6. What is the MIPS assembly code corresponding to this C segment? Th e fi rst step is to load save[i] into a temporary register. Before we can load save[i] into a temporary register, we need to have its address. Before we can add i to the base of array save to form the address, we must multiply the index i by 4 due to the byte addressing problem. Fortunately, we can use shift left logical, since shift ing left by 2 bits multiplies by 22 or 4 (see page 88 in the prior section). We need to add the label Loop to it so that we can branch back to that instruction at the end of the loop: Loop: sll $t1,$s3,2 # Temp reg $t1 = i * 4 To get the address of save[i], we need to add $t1 and the base of save in $s6: add $t1,$t1,$s6 # $t1 = address of save[i] Now we can use that address to load save[i] into a temporary register: lw $t0,0($t1) # Temp reg $t0 = save[i] Th e next instruction performs the loop test, exiting if save[i] ≠ k: bne $t0,$s5, Exit # go to Exit if save[i] ≠ k Hardware/ Software Interface EXAMPLE ANSWER Th e next instruction adds 1 to i: addi $s3,$s3,1 # i = i + 1 Th e end of the loop branches back to the while test at the top of the loop. We just add the Exit label aft er it, and we’re done: j Loop # go to Loop Exit: (See the exercises for an optimization of this sequence.) Such sequences of instructions that end in a branch are so fundamental to compiling that they are given their own buzzword: a basic block is a sequence of instructions without branches, except possibly at the end, and without branch targets or branch labels, except possibly at the beginning. One of the fi rst early phases of compilation is breaking the program into basic blocks. Th e test for equality or inequality is probably the most popular test, but sometimes it is useful to see if a variable is less than another variable. For example, a for loop may want to test to see if the index variable is less than 0. Such comparisons are accomplished in MIPS assembly language with an instruction that compares two registers and sets a third register to 1 if the fi rst is less than the second; otherwise, it is set to 0. Th e MIPS instruction is called set on less than, or slt. For example, slt $t0, $s3, $s4 # $t0 = 1 if $s3 < $s4 means that register $t0 is set to 1 if the value in register $s3 is less than the value in register $s4; otherwise, register $t0 is set to 0. Constant operands are popular in comparisons, so there is an immediate version of the set on less than instruction. To test if register $s2 is less than the constant 10, we can just write slti $t0,$s2,10 # $t0 = 1 if $s2 < 10 MIPS compilers use the slt, slti, beq, bne, and the fi xed value of 0 (always available by reading register $zero) to create all relative conditions: equal, not equal, less than, less than or equal, greater than, greater than or equal. Hardware/ Software Interface basic block A sequence of instructions without branches (except possibly at the end) and without branch targets or branch labels (except possibly at the beginning). Hardware/ Software Interface 2.7 Instructions for Making Decisions 93 94 Chapter 2 Instructions: Language of the Computer Heeding von Neumann’s warning about the simplicity of the “equipment,” the MIPS architecture doesn’t include branch on less than because it is too complicated; either it would stretch the clock cycle time or it would take extra clock cycles per instruction. Two faster instructions are more useful. Comparison instructions must deal with the dichotomy between signed and unsigned numbers. Sometimes a bit pattern with a 1 in the most signifi cant bit represents a negative number and, of course, is less than any positive number, which must have a 0 in the most signifi cant bit. With unsigned integers, on the other hand, a 1 in the most signifi cant bit represents a number that is larger than any that begins with a 0. (We’ll soon take advantage of this dual meaning of the most signifi cant bit to reduce the cost of the array bounds checking.) MIPS off ers two versions of the set on less than comparison to handle these alternatives. Set on less than (slt) and set on less than immediate (slti) work with signed integers. Unsigned integers are compared using set on less than unsigned (sltu) and set on less than immediate unsigned (sltiu). Signed versus Unsigned Comparison Suppose register $s0 has the binary number 1111 1111 1111 1111 1111 1111 1111 1111 two and that register $s1 has the binary number 0000 0000 0000 0000 0000 0000 0000 0001 two What are the values of registers $t0 and $t1 aft er these two instructions? slt $t0, $s0, $s1 # signed comparison sltu $t1, $s0, $s1 # unsigned comparison Th e value in register $s0 represents �1ten if it is an integer and 4,294,967,295ten if it is an unsigned integer. Th e value in register $s1 represents 1ten in either case. Th en register $t0 has the value 1, since �1ten �1ten, and register $t1 has the value 0, since 4,294,967,295ten �1ten. Hardware/ Software Interface EXAMPLE ANSWER Treating signed numbers as if they were unsigned gives us a low cost way of checking if 0 x � y, which matches the index out-of-bounds check for arrays. Th e key is that negative integers in two’s complement notation look like large numbers in unsigned notation; that is, the most signifi cant bit is a sign bit in the former notation but a large part of the number in the latter. Th us, an unsigned comparison of x � y also checks if x is negative as well as if x is less than y. Bounds Check Shortcut Use this shortcut to reduce an index-out-of-bounds check: jump to IndexOutOfBounds if $s1 ≥ $t2 or if $s1 is negative. Th e checking code just uses u to do both checks: sltu $t0,$s1,$t2 # $t0=0 if $s1>=length or $s1<0 beq $t0,$zero,IndexOutOfBounds #if bad, goto Error Case/Switch Statement Most programming languages have a case or switch statement that allows the programmer to select one of many alternatives depending on a single value. Th e simplest way to implement switch is via a sequence of conditional tests, turning the switch statement into a chain of if-then-else statements. Sometimes the alternatives may be more effi ciently encoded as a table of addresses of alternative instruction sequences, called a jump address table or jump table, and the program needs only to index into the table and then jump to the appropriate sequence. Th e jump table is then just an array of words containing addresses that correspond to labels in the code. Th e program loads the appropriate entry from the jump table into a register. It then needs to jump using the address in the register. To support such situations, computers like MIPS include a jump register instruction (jr), meaning an unconditional jump to the address specifi ed in a register. Th en it jumps to the proper address using this instruction. We’ll see an even more popular use of jr in the next section. EXAMPLE ANSWER jump address table Also called jump table. A table of addresses of alternative instruction sequences. 2.7 Instructions for Making Decisions 95 96 Chapter 2 Instructions: Language of the Computer Although there are many statements for decisions and loops in programming languages like C and Java, the bedrock statement that implements them at the instruction set level is the conditional branch. Elaboration: If you have heard about delayed branches, covered in Chapter 4, don’t worry: the MIPS assembler makes them invisible to the assembly language programmer. I. C has many statements for decisions and loops, while MIPS has few. Which of the following do or do not explain this imbalance? Why? 1. More decision statements make code easier to read and understand. 2. Fewer decision statements simplify the task of the underlying layer that is responsible for execution. 3. More decision statements mean fewer lines of code, which generally reduces coding time. 4. More decision statements mean fewer lines of code, which generally results in the execution of fewer operations. II. Why does C provide two sets of operators for AND (& and &&) and two sets of operators for OR (| and ||), while MIPS doesn’t? 1. Logical operations AND and OR implement & and |, while conditional branches implement && and ||. 2. Th e previous statement has it backwards: && and || correspond to logical operations, while & and | map to conditional branches. 3. Th ey are redundant and mean the same thing: && and || are simply inherited from the programming language B, the predecessor of C. 2.8 Supporting Procedures in Computer Hardware A procedure or function is one tool programmers use to structure programs, both to make them easier to understand and to allow code to be reused. Procedures allow the programmer to concentrate on just one portion of the task at a time; parameters act as an interface between the procedure and the rest of the program and data, since they can pass values and return results. We describe the equivalent to procedures in Java in Section 2.15, but Java needs everything from a computer that C needs. Procedures are one way to implement abstraction in soft ware. Hardware/ Software Interface Check Yourself procedure A stored subroutine that performs a specifi c task based on the parameters with which it is provided. 2.8 Supporting Procedures in Computer Hardware 97 You can think of a procedure like a spy who leaves with a secret plan, acquires resources, performs the task, covers his or her tracks, and then returns to the point of origin with the desired result. Nothing else should be perturbed once the mission is complete. Moreover, a spy operates on only a “need to know” basis, so the spy can’t make assumptions about his employer. Similarly, in the execution of a procedure, the program must follow these six steps: 1. Put parameters in a place where the procedure can access them. 2. Transfer control to the procedure. 3. Acquire the storage resources needed for the procedure. 4. Perform the desired task. 5. Put the result value in a place where the calling program can access it. 6. Return control to the point of origin, since a procedure can be called from several points in a program. As mentioned above, registers are the fastest place to hold data in a computer, so we want to use them as much as possible. MIPS soft ware follows the following convention for procedure calling in allocating its 32 registers: ■ $a0–$a3: four argument registers in which to pass parameters ■ $v0–$v1: two value registers in which to return values ■ $ra: one return address register to return to the point of origin In addition to allocating these registers, MIPS assembly language includes an instruction just for the procedures: it jumps to an address and simultaneously saves the address of the following instruction in register $ra. Th e jump-and-link instruction (jal) is simply written jal ProcedureAddress Th e link portion of the name means that an address or link is formed that points to the calling site to allow the procedure to return to the proper address. Th is “link,” stored in register$ra (register 31), is called the return address. Th e return address is needed because the same procedure could be called from several parts of the program. To support such situations, computers like MIPS use jump register instruction (jr), introduced above to help with case statements, meaning an unconditional jump to the address specifi ed in a register: jr $ra jump-and-link instruction An instruction that jumps to an address and simultaneously saves the address of the following instruction in a register ($ra in MIPS). return address A link to the calling site that allows a procedure to return to the proper address; in MIPS it is stored in register $ra. 98 Chapter 2 Instructions: Language of the Computer Th e jump register instruction jumps to the address stored in register $ra— which is just what we want. Th us, the calling program, or caller, puts the parameter values in $a0–$a3 and uses jal X to jump to procedure X (sometimes named the callee). Th e callee then performs the calculations, places the results in $v0 and $v1, and returns control to the caller using jr $ra. Implicit in the stored-program idea is the need to have a register to hold the address of the current instruction being executed. For historical reasons, this register is almost always called the program counter, abbreviated PC in the MIPS architecture, although a more sensible name would have been instruction address register. Th e jal instruction actually saves PC � 4 in register $ra to link to the following instruction to set up the procedure return. Using More Registers Suppose a compiler needs more registers for a procedure than the four argument and two return value registers. Since we must cover our tracks aft er our mission is complete, any registers needed by the caller must be restored to the values that they contained before the procedure was invoked. Th is situation is an example in which we need to spill registers to memory, as mentioned in the Hardware/Soft ware Interface section above. Th e ideal data structure for spilling registers is a stack—a last-in-fi rst-out queue. A stack needs a pointer to the most recently allocated address in the stack to show where the next procedure should place the registers to be spilled or where old register values are found. Th e stack pointer is adjusted by one word for each register that is saved or restored. MIPS soft ware reserves register 29 for the stack pointer, giving it the obvious name $sp. Stacks are so popular that they have their own buzzwords for transferring data to and from the stack: placing data onto the stack is called a push, and removing data from the stack is called a pop. By historical precedent, stacks “grow” from higher addresses to lower addresses. Th is convention means that you push values onto the stack by subtracting from the stack pointer. Adding to the stack pointer shrinks the stack, thereby popping values off the stack. Compiling a C Procedure That Doesn’t Call Another Procedure Let’s turn the example on page 65 from Section 2.2 into a C procedure: int leaf_example (int g, int h, int i, int j) { int f; f = (g + h) – (i + j); return f; } What is the compiled MIPS assembly code? caller Th e program that instigates a procedure and provides the necessary parameter values. callee A procedure that executes a series of stored instructions based on parameters provided by the caller and then returns control to the caller. program counter (PC) Th e register containing the address of the instruction in the program being executed. stack A data structure for spilling registers organized as a last-in- fi rst-out queue. stack pointer A value denoting the most recently allocated address in a stack that shows where registers should be spilled or where old register values can be found. In MIPS, it is register $sp. push Add element to stack. pop Remove element from stack. EXAMPLE Th e parameter variables g, h, i, and j correspond to the argument registers $a0, $a1, $a2, and $a3, and f corresponds to $s0. Th e compiled program starts with the label of the procedure: leaf_example: Th e next step is to save the registers used by the procedure. Th e C assignment statement in the procedure body is identical to the example on page 68, which uses two temporary registers. Th us, we need to save three registers: $s0, $t0, and $t1. We “push” the old values onto the stack by creating space for three words (12 bytes) on the stack and then store them: addi $sp, $sp, –12 # adjust stack to make room for 3 items sw $t1, 8($sp) # save register $t1 for use afterwards sw $t0, 4($sp) # save register $t0 for use afterwards sw $s0, 0($sp) # save register $s0 for use afterwards Figure 2.10 shows the stack before, during, and aft er the procedure call. Th e next three statements correspond to the body of the procedure, which follows the example on page 68: add $t0,$a0,$a1 # register $t0 contains g + h add $t1,$a2,$a3 # register $t1 contains i + j sub $s0,$t0,$t1 # f = $t0 – $t1, which is (g + h)–(i + j) To return the value of f, we copy it into a return value register: add $v0,$s0,$zero # returns f ($v0 = $s0 + 0) Before returning, we restore the three old values of the registers we saved by “popping” them from the stack: lw $s0, 0($sp) # restore register $s0 for caller lw $t0, 4($sp) # restore register $t0 for caller lw $t1, 8($sp) # restore register $t1 for caller addi $sp,$sp,12 # adjust stack to delete 3 items Th e procedure ends with a jump register using the return address: jr $ra # jump back to calling routine In the previous example, we used temporary registers and assumed their old values must be saved and restored. To avoid saving and restoring a register whose value is never used, which might happen with a temporary register, MIPS soft ware separates 18 of the registers into two groups: ■ $t0–$t9: temporary registers that are not preserved by the callee (called procedure) on a procedure call ■ $s0–$s7: saved registers that must be preserved on a procedure call (if used, the callee saves and restores them) ANSWER 2.8 Supporting Procedures in Computer Hardware 99 100 Chapter 2 Instructions: Language of the Computer Th is simple convention reduces register spilling. In the example above, since the caller does not expect registers $t0 and $t1 to be preserved across a procedure call, we can drop two stores and two loads from the code. We still must save and restore $s0, since the callee must assume that the caller needs its value. Nested Procedures Procedures that do not call others are called leaf procedures. Life would be simple if all procedures were leaf procedures, but they aren’t. Just as a spy might employ other spies as part of a mission, who in turn might use even more spies, so do procedures invoke other procedures. Moreover, recursive procedures even invoke “clones” of themselves. Just as we need to be careful when using registers in procedures, more care must also be taken when invoking nonleaf procedures. For example, suppose that the main program calls procedure A with an argument of 3, by placing the value 3 into register $a0 and then using jal A. Th en suppose that procedure A calls procedure B via jal B with an argument of 7, also placed in $a0. Since A hasn’t fi nished its task yet, there is a confl ict over the use of register $a0. Similarly, there is a confl ict over the return address in register $ra, since it now has the return address for B. Unless we take steps to prevent the problem, this confl ict will eliminate procedure A’s ability to return to its caller. One solution is to push all the other registers that must be preserved onto the stack, just as we did with the saved registers. Th e caller pushes any argument registers ($a0–$a3) or temporary registers ($t0–$t9) that are needed aft er the call. Th e callee pushes the return address register $ra and any saved registers ($s0–$s7) used by the callee. Th e stack pointer $sp is adjusted to account for the number of registers placed on the stack. Upon the return, the registers are restored from memory and the stack pointer is readjusted. High address Low address Contents of register $t1 Contents of register $t0 Contents of register $s0 $sp $sp $sp (a) (b) (c) FIGURE 2.10 The values of the stack pointer and the stack (a) before, (b) during, and (c) after the procedure call. Th e stack pointer always points to the “top” of the stack, or the last word in the stack in this drawing. Compiling a Recursive C Procedure, Showing Nested Procedure Linking Let’s tackle a recursive procedure that calculates factorial: int fact (int n) { if (n < 1) return (1); else return (n * fact(n – 1)); } What is the MIPS assembly code? Th e parameter variable n corresponds to the argument register $a0. Th e compiled program starts with the label of the procedure and then saves two registers on the stack, the return address and $a0: fact: addi $sp, $sp, –8 # adjust stack for 2 items sw $ra, 4($sp) # save the return address sw $a0, 0($sp) # save the argument n Th e fi rst time fact is called, sw saves an address in the program that called fact. Th e next two instructions test whether n is less than 1, going to L1 if n ≥ 1. slti $t0,$a0,1 # test for n < 1 beq $t0,$zero,L1 # if n >= 1, go to L1

If n is less than 1, fact returns 1 by putting 1 into a value register: it adds 1 to
0 and places that sum in $v0. It then pops the two saved values off the stack
and jumps to the return address:

addi $v0,$zero,1 # return 1
addi $sp,$sp,8 # pop 2 items off stack
jr $ra # return to caller

Before popping two items off the stack, we could have loaded $a0 and
$ra. Since $a0 and $ra don’t change when n is less than 1, we skip those
instructions.

If n is not less than 1, the argument n is decremented and then fact is
called again with the decremented value:

L1: addi $a0,$a0,–1 # n >= 1: argument gets (n – 1)
jal fact # call fact with (n –1)

EXAMPLE

ANSWER

2.8 Supporting Procedures in Computer Hardware 101

102 Chapter 2 Instructions: Language of the Computer

Th e next instruction is where fact returns. Now the old return address and
old argument are restored, along with the stack pointer:

lw $a0, 0($sp) # return from jal: restore argument n
lw $ra, 4($sp) # restore the return address
addi $sp, $sp, 8 # adjust stack pointer to pop 2 items

Next, the value register $v0 gets the product of old argument $a0 and
the current value of the value register. We assume a multiply instruction is
available, even though it is not covered until Chapter 3:

mul $v0,$a0,$v0 # return n * fact (n – 1)

Finally, fact jumps again to the return address:

jr $ra # return to the caller

A C variable is generally a location in storage, and its interpretation depends both
on its type and storage class. Examples include integers and characters (see Section
2.9). C has two storage classes: automatic and static. Automatic variables are local to
a procedure and are discarded when the procedure exits. Static variables exist across
exits from and entries to procedures. C variables declared outside all procedures
are considered static, as are any variables declared using the keyword static. Th e
rest are automatic. To simplify access to static data, MIPS soft ware reserves another
register, called the global pointer, or $gp.

Figure 2.11 summarizes what is preserved across a procedure call. Note that
several schemes preserve the stack, guaranteeing that the caller will get the same
data back on a load from the stack as it stored onto the stack. Th e stack above $sp
is preserved simply by making sure the callee does not write above $sp; $sp is

Hardware/
Software
Interface

global pointer Th e
register that is reserved to
point to the static area.

Saved registers: $s0–$s7 Temporary registers: $t0–$t9

Stack pointer register: $sp Argument registers: $a0–$a3

Return address register: $ra Return value registers: $v0–$v1

Stack above the stack pointer Stack below the stack pointer

Preserved Not preserved

FIGURE 2.11 What is and what is not preserved across a procedure call. If the soft ware relies
on the frame pointer register or on the global pointer register, discussed in the following subsections, they
are also preserved.

itself preserved by the callee adding exactly the same amount that was subtracted
from it; and the other registers are preserved by saving them on the stack (if they
are used) and restoring them from there.

Allocating Space for New Data on the Stack
Th e fi nal complexity is that the stack is also used to store variables that are local
to the procedure but do not fi t in registers, such as local arrays or structures. Th e
segment of the stack containing a procedure’s saved registers and local variables is
called a procedure frame or activation record. Figure 2.12 shows the state of the
stack before, during, and aft er the procedure call.

Some MIPS soft ware uses a frame pointer ($fp) to point to the fi rst word of
the frame of a procedure. A stack pointer might change during the procedure, and
so references to a local variable in memory might have diff erent off sets depending
on where they are in the procedure, making the procedure harder to understand.
Alternatively, a frame pointer off ers a stable base register within a procedure for
local memory-references. Note that an activation record appears on the stack
whether or not an explicit frame pointer is used. We’ve been avoiding using $fp by
avoiding changes to $sp within a procedure: in our examples, the stack is adjusted
only on entry and exit of the procedure.

procedure frame Also
called activation record.
Th e segment of the stack
containing a procedure’s
saved registers and local
variables.

frame pointer A value
denoting the location of
the saved registers and
local variables for a given
procedure.

High address

Low address
(a) (b) (c)

Saved argument
registers (if any)

$sp

$fp

Saved return address

Saved saved
registers (if any)

Local arrays and
structures (if any)

FIGURE 2.12 Illustration of the stack allocation (a) before, (b) during, and (c) after the
procedure call. Th e frame pointer ($fp) points to the fi rst word of the frame, oft en a saved argument
register, and the stack pointer ($sp) points to the top of the stack. Th e stack is adjusted to make room for
all the saved registers and any memory-resident local variables. Since the stack pointer may change during
program execution, it’s easier for programmers to reference variables via the stable frame pointer, although it
could be done just with the stack pointer and a little address arithmetic. If there are no local variables on the
stack within a procedure, the compiler will save time by not setting and restoring the frame pointer. When a
frame pointer is used, it is initialized using the address in $sp on a call, and $sp is restored using $fp. Th is
information is also found in Column 4 of the MIPS Reference Data Card at the front of this book.

2.8 Supporting Procedures in Computer Hardware 103

104 Chapter 2 Instructions: Language of the Computer

Allocating Space for New Data on the Heap
In addition to automatic variables that are local to procedures, C programmers
need space in memory for static variables and for dynamic data structures. Figure
2.13 shows the MIPS convention for allocation of memory. Th e stack starts in the
high end of memory and grows down. Th e fi rst part of the low end of memory is
reserved, followed by the home of the MIPS machine code, traditionally called
the text segment. Above the code is the static data segment, which is the place
for constants and other static variables. Although arrays tend to be a fi xed length
and thus are a good match to the static data segment, data structures like linked
lists tend to grow and shrink during their lifetimes. Th e segment for such data
structures is traditionally called the heap, and it is placed next in memory. Note
that this allocation allows the stack and heap to grow toward each other, thereby
allowing the effi cient use of memory as the two segments wax and wane.

text segment Th e
segment of a UNIX object
fi le that contains the
machine language code
for routines in the source
fi le.

Stack

Dynamic data

Static data

Text

Reserved

$sp 7fff fffchex

$gp 1000 8000hex
1000 0000hex

pc 0040 0000hex

FIGURE 2.13 The MIPS memory allocation for program and data. Th ese addresses are only
a soft ware convention, and not part of the MIPS architecture. Th e stack pointer is initialized to 7fff
fffchex and grows down toward the data segment. At the other end, the program code (“text”) starts at
0040 0000hex. Th e static data starts at 1000 0000hex. Dynamic data, allocated by malloc in C and by
new in Java, is next. It grows up toward the stack in an area called the heap. Th e global pointer, $gp, is set to
an address to make it easy to access data. It is initialized to 1000 8000hex so that it can access from 1000
0000hex to 1000 ffffhex using the positive and negative 16-bit off sets from $gp. Th is information is also
found in Column 4 of the MIPS Reference Data Card at the front of this book.

C allocates and frees space on the heap with explicit functions. malloc()
allocates space on the heap and returns a pointer to it, and free() releases
space on the heap to which the pointer points. Memory allocation is controlled by
programs in C, and it is the source of many common and diffi cult bugs. Forgetting
to free space leads to a “memory leak,” which eventually uses up so much memory
that the operating system may crash. Freeing space too early leads to “dangling
pointers,” which can cause pointers to point to things that the program never
intended. Java uses automatic memory allocation and garbage collection just to
avoid such bugs.

Figure 2.14 summarizes the register conventions for the MIPS assembly
language. Th is convention is another example of making the common case fast:
most procedures can be satisfi ed with up to 4 arguments, 2 registers for a return
value, 8 saved registers, and 10 temporary registers without ever going to memory.

Name Register number Usage
Preserved on

call?

$zero 0 The constant value 0 n.a.

$v0–$v1 2–3 Values for results and expression evaluation no

$a0–$a3 4–7 Arguments no

$t0–$t7 onseiraropmeT51–8

$s0–$s7 seydevaS32–61

$t8–$t9 onseiraropmeteroM52–42

$gp seyretnioplabolG82

$sp seyretniopkcatS92

$fp seyretniopemarF03

$ra seysserddanruteR13

FIGURE 2.14 MIPS register conventions. Register 1, called $at, is reserved for the assembler (see
Section 2.12), and registers 26–27, called $k0–$k1, are reserved for the operating system. Th is information
is also found in Column 2 of the MIPS Reference Data Card at the front of this book.

Elaboration: What if there are more than four parameters? The MIPS convention is
to place the extra parameters on the stack just above the frame pointer. The procedure
then expects the fi rst four parameters to be in registers $a0 through $a3 and the rest
in memory, addressable via the frame pointer.

As mentioned in the caption of Figure 2.12, the frame pointer is convenient because
all references to variables in the stack within a procedure will have the same offset.
The frame pointer is not necessary, however. The GNU MIPS C compiler uses a frame
pointer, but the C compiler from MIPS does not; it treats register 30 as another save
register ($s8).

Elaboration: Some recursive procedures can be implemented iteratively without using
recursion. Iteration can signifi cantly improve performance by removing the overhead
associated with recursive procedure calls. For example, consider a procedure used to
accumulate a sum:

int sum (int n, int acc) {
if (n >0)
return sum(n – 1, acc + n);
else
return acc;
}

Consider the procedure call sum(3,0). This will result in recursive calls to
sum(2,3), sum(1,5), and sum(0,6), and then the result 6 will be returned four

2.8 Supporting Procedures in Computer Hardware 105

106 Chapter 2 Instructions: Language of the Computer

times. This recursive call of sum is referred to as a tail call, and this example use of
tail recursion can be implemented very effi ciently (assume $a0 = n and $a1 = acc):

sum: slti $t0, $a0, 1 # test if n <= 0 bne $t0, $zero, sum_exit # go to sum_exit if n <= 0 add$a1, $a1, $a0 # add n to acc addi$a0, $a0, –1 # subtract 1 from n j sum # go to sum sum_exit: add$v0, $a1, $zero # return value acc jr $ra # return to caller Which of the following statements about C and Java are generally true? 1. C programmers manage data explicitly, while it’s automatic in Java. 2. C leads to more pointer bugs and memory leak bugs than does Java. 2.9 Communicating with People Computers were invented to crunch numbers, but as soon as they became commercially viable they were used to process text. Most computers today off er 8-bit bytes to represent characters, with the American Standard Code for Information Interchange (ASCII) being the representation that nearly everyone follows. Figure 2.15 summarizes ASCII. Check Yourself !(@ | � � (wow open tab at bar is great) Fourth line of the keyboard poem “Hatless Atlas,” 1991 (some give names to ASCII characters: “!” is “wow,” “(” is open, “|” is bar, and so on). ASCII value Char- acter ASCII value Char- acter ASCII value Char- acter ASCII value Char- acter ASCII value Char- acter ASCII value Char- acter 096 ` 112 p 33 ! 49 097 a 113 q 34 " 50 098 b 114 r 35 # 51 3 6 099 c 115 s 36 $ 52 32 space 48 0 64 @ 80 P 1 65 A 81 Q 2 66 B 82 R 7 C 83 S 4 68 D 84 T 100 d 116 t 37 % 53 5 69 E 85 U 101 e 117 u 38 & 54 6 70 F 86 V 102 f 118 v 39 ' 55 7 71 G 87 W 103 g 119 w 40 ( 56 8 72 H 88 X 104 h 120 x 41 ) 57 9 73 I 89 Y 105 i 121 y 42 * 58 : 74 J 90 Z 106 j 122 z 43 + 59 ; 75 K 91 [ 107 k 123 { 44 , 60 < 76 L 92 \ 108 l 124 | 45 - 61 = 77 M 93 ] 109 m 125 } 46 . 62 > 78 N 94 ^ 110 n 126 ~

47 / 63 ? 79 O 95 _ 111 o 127 DEL

FIGURE 2.15 ASCII representation of characters. Note that upper- and lowercase letters diff er by exactly 32; this observation can
lead to shortcuts in checking or changing upper- and lowercase. Values not shown include formatting characters. For example, 8 represents a
backspace, 9 represents a tab character, and 13 a carriage return. Another useful value is 0 for null, the value the programming language C uses
to mark the end of a string. Th is information is also found in Column 3 of the MIPS Reference Data Card at the front of this book.

2.9 Communicating with People 107

ASCII versus Binary Numbers

We could represent numbers as strings of ASCII digits instead of as integers.
How much does storage increase if the number 1 billion is represented in
ASCII versus a 32-bit integer?

One billion is 1,000,000,000, so it would take 10 ASCII digits, each 8 bits long.
Th us the storage expansion would be (10 � 8)/32 or 2.5. Beyond the expansion
in storage, the hardware to add, subtract, multiply, and divide such decimal
numbers is diffi cult and would consume more energy. Such diffi culties explain
why computing professionals are raised to believe that binary is natural and
that the occasional decimal computer is bizarre.

A series of instructions can extract a byte from a word, so load word and store
word are suffi cient for transferring bytes as well as words. Because of the popularity
of text in some programs, however, MIPS provides instructions to move bytes. Load
byte (lb) loads a byte from memory, placing it in the rightmost 8 bits of a register.
Store byte (sb) takes a byte from the rightmost 8 bits of a register and writes it to
memory. Th us, we copy a byte with the sequence

lb $t0,0($sp) # Read byte from source
sb $t0,0($gp) # Write byte to destination

Characters are normally combined into strings, which have a variable number
of characters. Th ere are three choices for representing a string: (1) the fi rst position
of the string is reserved to give the length of a string, (2) an accompanying variable
has the length of the string (as in a structure), or (3) the last position of a string is
indicated by a character used to mark the end of a string. C uses the third choice,
terminating a string with a byte whose value is 0 (named null in ASCII). Th us,
the string “Cal” is represented in C by the following 4 bytes, shown as decimal
numbers: 67, 97, 108, 0. (As we shall see, Java uses the fi rst option.)

EXAMPLE

ANSWER

108 Chapter 2 Instructions: Language of the Computer

Compiling a String Copy Procedure, Showing How to Use C Strings

Th e procedure strcpy copies string y to string x using the null byte
termination convention of C:

void strcpy (char x[], char y[])
{
int i;

i = 0;
while ((x[i] = y[i]) != ‘\0’) /* copy & test byte */
i += 1;
}

What is the MIPS assembly code?

Below is the basic MIPS assembly code segment. Assume that base addresses
for arrays x and y are found in $a0 and $a1, while i is in $s0. strcpy
adjusts the stack pointer and then saves the saved register $s0 on the stack:

strcpy:
addi $sp,$sp,–4 # adjust stack for 1 more item
sw $s0, 0($sp) # save $s0

To initialize i to 0, the next instruction sets $s0 to 0 by adding 0 to 0 and
placing that sum in $s0:

add $s0,$zero,$zero # i = 0 + 0

Th is is the beginning of the loop. Th e address of y[i] is fi rst formed by adding
i to y[]:

L1: add $t1,$s0,$a1 # address of y[i] in $t1

Note that we don’t have to multiply i by 4 since y is an array of bytes and not
of words, as in prior examples.

To load the character in y[i], we use load byte unsigned, which puts the
character into $t2:

lbu $t2, 0($t1) # $t2 = y[i]

A similar address calculation puts the address of x[i] in $t3, and then the
character in $t2 is stored at that address.

EXAMPLE

ANSWER

add $t3,$s0,$a0 # address of x[i] in $t3
sb $t2, 0($t3) # x[i] = y[i]

Next, we exit the loop if the character was 0. Th at is, we exit if it is the last
character of the string:

beq $t2,$zero,L2 # if y[i] == 0, go to L2

If not, we increment i and loop back:

addi $s0, $s0,1 # i = i + 1
j L1 # go to L1

If we don’t loop back, it was the last character of the string; we restore $s0 and
the stack pointer, and then return.

L2: lw $s0, 0($sp) # y[i] == 0: end of string.
# Restore old $s0

addi $sp,$sp,4 # pop 1 word off stack
jr $ra # return

String copies usually use pointers instead of arrays in C to avoid the operations
on i in the code above. See Section 2.14 for an explanation of arrays versus
pointers.

Since the procedure strcpy above is a leaf procedure, the compiler could
allocate i to a temporary register and avoid saving and restoring $s0. Hence,
instead of thinking of the $t registers as being just for temporaries, we can think of
them as registers that the callee should use whenever convenient. When a compiler
fi nds a leaf procedure, it exhausts all temporary registers before using registers it
must save.

Characters and Strings in Java
Unicode is a universal encoding of the alphabets of most human languages. Figure
2.16 gives a list of Unicode alphabets; there are almost as many alphabets in Unicode
as there are useful symbols in ASCII. To be more inclusive, Java uses Unicode for
characters. By default, it uses 16 bits to represent a character.

2.9 Communicating with People 109

110 Chapter 2 Instructions: Language of the Computer

Latin Malayalam Tagbanwa General Punctuation

Greek Sinhala Khmer Spacing Modifier Letters

Cyrillic Thai Mongolian Currency Symbols

Armenian Lao Limbu Combining Diacritical Marks

Hebrew Tibetan Tai Le Combining Marks for Symbols

Arabic Myanmar Kangxi Radicals Superscripts and Subscripts

Syriac Georgian Hiragana Number Forms

Thaana Hangul Jamo Katakana Mathematical Operators

Devanagari Ethiopic Bopomofo Mathematical Alphanumeric Symbols

Bengali Cherokee Kanbun Braille Patterns

Gurmukhi Unified Canadian
Aboriginal Syllabic

Shavian Optical Character Recognition

Gujarati Ogham Osmanya Byzantine Musical Symbols

Oriya Runic Cypriot Syllabary Musical Symbols

Tamil Tagalog Tai Xuan Jing Symbols Arrows

Telugu Hanunoo Yijing Hexagram Symbols Box Drawing

Kannada Buhid Aegean Numbers Geometric Shapes

FIGURE 2.16 Example alphabets in Unicode. Unicode version 4.0 has more than 160 “blocks,”
which is their name for a collection of symbols. Each block is a multiple of 16. For example, Greek starts at
0370hex, and Cyrillic at 0400hex. Th e fi rst three columns show 48 blocks that correspond to human languages
in roughly Unicode numerical order. Th e last column has 16 blocks that are multilingual and are not in order.
A 16-bit encoding, called UTF-16, is the default. A variable-length encoding, called UTF-8, keeps the ASCII
subset as eight bits and uses 16 or 32 bits for the other characters. UTF-32 uses 32 bits per character. To learn
more, see www.unicode.org.

Th e MIPS instruction set has explicit instructions to load and store such 16-
bit quantities, called halfwords. Load half (lh) loads a halfword from memory,
placing it in the rightmost 16 bits of a register. Like load byte, load half (lh) treats
the halfword as a signed number and thus sign-extends to fi ll the 16 left most bits
of the register, while load halfword unsigned (lhu) works with unsigned integers.
Th us, lhu is the more popular of the two. Store half (sh) takes a halfword from the
rightmost 16 bits of a register and writes it to memory. We copy a halfword with
the sequence

lhu $t0,0($sp) # Read halfword (16 bits) from source
sh $t0,0($gp) # Write halfword (16 bits) to destination

Strings are a standard Java class with special built-in support and predefi ned
methods for concatenation, comparison, and conversion. Unlike C, Java includes a
word that gives the length of the string, similar to Java arrays.

Elaboration: MIPS software tries to keep the stack aligned to word addresses,
allowing the program to always use lw and sw (which must be aligned) to access the
stack. This convention means that a char variable allocated on the stack occupies 4
bytes, even though it needs less. However, a C string variable or an array of bytes will
pack 4 bytes per word, and a Java string variable or array of shorts packs 2 halfwords
per word.

Elaboration: Refl ecting the international nature of the web, most web pages today
use Unicode instead of ASCII.

I. Which of the following statements about characters and strings in C and
Java are true?

1. A string in C takes about half the memory as the same string in Java.

2. Strings are just an informal name for single-dimension arrays of
characters in C and Java.

3. Strings in C and Java use null (0) to mark the end of a string.

4. Operations on strings, like length, are faster in C than in Java.

II. Which type of variable that can contain 1,000,000,000ten takes the most
memory space?

1. int in C

2. string in C

3. string in Java

2.10 MIPS Addressing for 32-bit Immediates
and Addresses

Although keeping all MIPS instructions 32 bits long simplifi es the hardware, there
are times where it would be convenient to have a 32-bit constant or 32-bit address.
Th is section starts with the general solution for large constants, and then shows the
optimizations for instruction addresses used in branches and jumps.

Check
Yourself

2.10 MIPS Addressing for 32-bit Immediates and Addresses 111

112 Chapter 2 Instructions: Language of the Computer

32-Bit Immediate Operands
Although constants are frequently short and fi t into the 16-bit fi eld, sometimes they
are bigger. Th e MIPS instruction set includes the instruction load upper immediate
(lui) specifi cally to set the upper 16 bits of a constant in a register, allowing a
subsequent instruction to specify the lower 16 bits of the constant. Figure 2.17
shows the operation of lui.

Loading a 32-Bit Constant

What is the MIPS assembly code to load this 32-bit constant into register $s0?

0000 0000 0011 1101 0000 1001 0000 0000

First, we would load the upper 16 bits, which is 61 in decimal, using lui:

lui $s0, 61 # 61 decimal = 0000 0000 0011 1101 binary

Th e value of register $s0 aft erward is

0000 0000 0011 1101 0000 0000 0000 0000

Th e next step is to insert the lower 16 bits, whose decimal value is 2304:

ori $s0, $s0, 2304 # 2304 decimal = 0000 1001 0000 0000

Th e fi nal value in register $s0 is the desired value:

0000 0000 0011 1101 0000 1001 0000 0000

EXAMPLE

ANSWER

FIGURE 2.17 The effect of the lui instruction. Th e instruction lui transfers the 16-bit immediate constant fi eld value into the
left most 16 bits of the register, fi lling the lower 16 bits with 0s.

The machine language version of lui $t0, 255

Contents of register $t0 after executing lui $t0, 255:

001111 00000 01000 0000 0000 1111 1111

0000 0000 1111 1111 0000 0000 0000 0000

# $t0 is register 8:

2.10 MIPS Addressing for 32-bit Immediates and Addresses 113

Either the compiler or the assembler must break large constants into pieces and
then reassemble them into a register. As you might expect, the immediate fi eld’s
size restriction may be a problem for memory addresses in loads and stores as
well as for constants in immediate instructions. If this job falls to the assembler,
as it does for MIPS soft ware, then the assembler must have a temporary register
available in which to create the long values. Th is need is a reason for the register
$at (assembler temporary), which is reserved for the assembler.

Hence, the symbolic representation of the MIPS machine language is no longer
limited by the hardware, but by whatever the creator of an assembler chooses to
include (see Section 2.12). We stick close to the hardware to explain the architecture
of the computer, noting when we use the enhanced language of the assembler that
is not found in the processor.

Elaboration: Creating 32-bit constants needs care. The instruction addi copies the
left-most bit of the 16-bit immediate fi eld of the instruction into the upper 16 bits of a
word. Logical or immediate from Section 2.6 loads 0s into the upper 16 bits and hence
is used by the assembler in conjunction with lui to create 32-bit constants.

Addressing in Branches and Jumps
Th e MIPS jump instructions have the simplest addressing. Th ey use the fi nal MIPS
instruction format, called the J-type, which consists of 6 bits for the operation fi eld
and the rest of the bits for the address fi eld. Th us,

j 10000 # go to location 10000

could be assembled into this format (it’s actually a bit more complicated, as we will
see):

2 10000

6 bits 26 bits

where the value of the jump opcode is 2 and the jump address is 10000.
Unlike the jump instruction, the conditional branch instruction must specify

two operands in addition to the branch address. Th us,

bne $s0,$s1,Exit # go to Exit if $s0 ≠ $s1

is assembled into this instruction, leaving only 16 bits for the branch address:

5 16 17 Exit

6 bits 5 bits 5 bits 16 bits

Hardware/
Software
Interface

114 Chapter 2 Instructions: Language of the Computer

If addresses of the program had to fi t in this 16-bit fi eld, it would mean that no
program could be bigger than 216, which is far too small to be a realistic option
today. An alternative would be to specify a register that would always be added
to the branch address, so that a branch instruction would calculate the following:

Program counter Register Branch address

Th is sum allows the program to be as large as 232 and still be able to use
conditional branches, solving the branch address size problem. Th en the question
is, which register?

Th e answer comes from seeing how conditional branches are used. Conditional
branches are found in loops and in if statements, so they tend to branch to a
nearby instruction. For example, about half of all conditional branches in SPEC
benchmarks go to locations less than 16 instructions away. Since the program
counter (PC) contains the address of the current instruction, we can branch within
�215 words of the current instruction if we use the PC as the register to be added
to the address. Almost all loops and if statements are much smaller than 216 words,
so the PC is the ideal choice.

Th is form of branch addressing is called PC-relative addressing. As we shall see
in Chapter 4, it is convenient for the hardware to increment the PC early to point
to the next instruction. Hence, the MIPS address is actually relative to the address
of the following instruction (PC � 4) as opposed to the current instruction (PC).
It is yet another example of making the common case fast, which in this case is
addressing nearby instructions.

Like most recent computers, MIPS uses PC-relative addressing for all conditional
branches, because the destination of these instructions is likely to be close to the
branch. On the other hand, jump-and-link instructions invoke procedures that
have no reason to be near the call, so they normally use other forms of addressing.
Hence, the MIPS architecture off ers long addresses for procedure calls by using the
J-type format for both jump and jump-and-link instructions.

Since all MIPS instructions are 4 bytes long, MIPS stretches the distance of the
branch by having PC-relative addressing refer to the number of words to the next
instruction instead of the number of bytes. Th us, the 16-bit fi eld can branch four
times as far by interpreting the fi eld as a relative word address rather than as a
relative byte address. Similarly, the 26-bit fi eld in jump instructions is also a word
address, meaning that it represents a 28-bit byte address.

Elaboration: Since the PC is 32 bits, 4 bits must come from somewhere else for
jumps. The MIPS jump instruction replaces only the lower 28 bits of the PC, leaving
the upper 4 bits of the PC unchanged. The loader and linker (Section 2.12) must be
careful to avoid placing a program across an address boundary of 256 MB (64 million
instructions); otherwise, a jump must be replaced by a jump register instruction preceded
by other instructions to load the full 32-bit address into a register.

PC-relative
addressing An
addressing regime
in which the address
is the sum of the
program counter (PC)
and a constant in the
instruction.

Showing Branch Offset in Machine Language

Th e while loop on pages 92–93 was compiled into this MIPS assembler code:

Loop:sll $t1,$s3,2 # Temp reg $t1 = 4 * i
add $t1,$t1,$s6 # $t1 = address of save[i]
lw $t0,0($t1) # Temp reg $t0 = save[i]
bne $t0,$s5, Exit # go to Exit if save[i] ≠ k
addi $s3,$s3,1 # i = i + 1
j Loop # go to Loop
Exit:

If we assume we place the loop starting at location 80000 in memory, what is
the MIPS machine code for this loop?

Th e assembled instructions and their addresses are:

EXAMPLE

ANSWER

80000 0 0 19 9 2 0

80004 0 9 22 9 0 32

80008 35 9 8 0

80012 5 8 21 2

80016 8 19 19 1

80020 2 20000

80024 . . .

Remember that MIPS instructions have byte addresses, so addresses of
sequential words diff er by 4, the number of bytes in a word. Th e bne instruction
on the fourth line adds 2 words or 8 bytes to the address of the following
instruction (80016), specifying the branch destination relative to that following
instruction (8 � 80016) instead of relative to the branch instruction (12 �
80012) or using the full destination address (80024). Th e jump instruction on
the last line does use the full address (20000 � 4 � 80000), corresponding to
the label Loop.

2.10 MIPS Addressing for 32-bit Immediates and Addresses 115

116 Chapter 2 Instructions: Language of the Computer

Most conditional branches are to a nearby location, but occasionally they branch
far away, farther than can be represented in the 16 bits of the conditional branch
instruction. Th e assembler comes to the rescue just as it did with large addresses
or constants: it inserts an unconditional jump to the branch target, and inverts the
condition so that the branch decides whether to skip the jump.

Branching Far Away

Given a branch on register $s0 being equal to register $s1,

beq $s0, $s1, L1

replace it by a pair of instructions that off ers a much greater branching distance.

Th ese instructions replace the short-address conditional branch:

bne $s0, $s1, L2
j L1
L2:

MIPS Addressing Mode Summary
Multiple forms of addressing are generically called addressing modes. Figure 2.18
shows how operands are identifi ed for each addressing mode. Th e MIPS addressing
modes are the following:

1. Immediate addressing, where the operand is a constant within the instruction
itself

2. Register addressing, where the operand is a register

3. Base or displacement addressing, where the operand is at the memory location
whose address is the sum of a register and a constant in the instruction

4. PC-relative addressing, where the branch address is the sum of the PC and a
constant in the instruction

5. Pseudodirect addressing, where the jump address is the 26 bits of the
instruction concatenated with the upper bits of the PC

Hardware/
Software
Interface

EXAMPLE

ANSWER

addressing mode One
of several addressing
regimes delimited by their
varied use of operands
and/or addresses.

Although we show MIPS as having 32-bit addresses, nearly all microprocessors
(including MIPS) have 64-bit address extensions (see Appendix E and Section
2.18). Th ese extensions were in response to the needs of soft ware for larger
programs. Th e process of instruction set extension allows architectures to expand in
such a way that is able to move soft ware compatibly upward to the next generation
of architecture.

Hardware/
Software
Interface

1. Immediate addressing

2. Register addressing

3. Base addressing

4. PC-relative addressing

5. Pseudodirect addressing

Immediateop rs rt

op rs rt . . . functrd

Registers

op rs rt Address

Word

Memory

+Register HalfwordByte

op rs rt Address

Word

Memory

+PC

Word

Memory

Address

FIGURE 2.18 Illustration of the fi ve MIPS addressing modes. Th e operands are shaded in color.
Th e operand of mode 3 is in memory, whereas the operand for mode 2 is a register. Note that versions of
load and store access bytes, halfwords, or words. For mode 1, the operand is 16 bits of the instruction itself.
Modes 4 and 5 address instructions in memory, with mode 4 adding a 16-bit address shift ed left 2 bits to the
PC and mode 5 concatenating a 26-bit address shift ed left 2 bits with the 4 upper bits of the PC. Note that a
single operation can use more than one addressing mode. Add, for example, uses both immediate (addi)
and register (add) addressing.

2.10 MIPS Addressing for 32-bit Immediates and Addresses 117

118 Chapter 2 Instructions: Language of the Computer

Decoding Machine Language
Sometimes you are forced to reverse-engineer machine language to create the
original assembly language. One example is when looking at “core dump.” Figure
2.19 shows the MIPS encoding of the fi elds for the MIPS machine language. Th is
fi gure helps when translating by hand between assembly language and machine
language.

Decoding Machine Code

What is the assembly language statement corresponding to this machine
instruction?

00af8020hex

Th e fi rst step in converting hexadecimal to binary is to fi nd the op fi elds:

(Bits: 31 28 26 5 2 0)
0000 0000 1010 1111 1000 0000 0010 0000

We look at the op fi eld to determine the operation. Referring to Figure 2.19,
when bits 31–29 are 000 and bits 28–26 are 000, it is an R-format instruction.
Let’s reformat the binary instruction into R-format fi elds, listed in Figure 2.20:

op rs rt rd shamt funct
000000 00101 01111 10000 00000 100000

Th e bottom portion of Figure 2.19 determines the operation of an R-format
instruction. In this case, bits 5–3 are 100 and bits 2–0 are 000, which means
this binary pattern represents an add instruction.

We decode the rest of the instruction by looking at the fi eld values. Th e
decimal values are 5 for the rs fi eld, 15 for rt, and 16 for rd (shamt is unused).
Figure 2.14 shows that these numbers represent registers $a1, $t7, and $s0.
Now we can reveal the assembly instruction:

add $s0,$a1,$t7

EXAMPLE

ANSWER

op(31:26)

28–26

31–29
0(000) R-format Bltz/gez jump jump & link branch eq branch

ne
blez bgtz

1(001) add
immediate

addiu set less
than imm.

set less
than imm.
unsigned

andi ori xori load upper
immediate

2(010) TLB FlPt

3(011)

4(100) load byte load half lwl load word load byte
unsigned

load
half
unsigned

lwr

5(101) store byte store half swl store word swr

6(110) load linked
word

lwc1

7(111) store cond.
word

swc1

op(31:26)=010000 (TLB), rs(25:21)

23–21

25–24
0(00) mfc0 cfc0 mtc0 ctc0
1(01)

2(10)

3(11)

op(31:26)=000000 (R-format), funct(5:0)

2–0

5–3

0(000) 1(001) 2(010) 3(011) 4(100) 5(101) 6(110) 7(111)

0(000) shift left
logical

shift right
logical

sra sllv srlv srav

1(001) jump register jalr syscall break

2(010) mfhi mthi mfl o mtlo
3(011) mult multu div divu
4(100) add addu subtract subu and or xor not or (nor)
5(101) set l.t. set l.t.

unsigned
6(110)

7(111)

FIGURE 2.19 MIPS instruction encoding. Th is notation gives the value of a fi eld by row and by column. For example, the top portion
of the fi gure shows load word in row number 4 (100two for bits 31–29 of the instruction) and column number 3 (011two for bits 28–26 of the
instruction), so the corresponding value of the op fi eld (bits 31–26) is 100011two. Underscore means the fi eld is used elsewhere. For example,
R-format in row 0 and column 0 (op � 000000two) is defi ned in the bottom part of the fi gure. Hence, subtract in row 4 and column
2 of the bottom section means that the funct fi eld (bits 5–0) of the instruction is 100010two and the op fi eld (bits 31–26) is 000000two. Th e
floating point value in row 2, column 1 is defi ned in Figure 3.18 in Chapter 3. Bltz/gez is the opcode for four instructions found
in Appendix A: bltz, bgez, bltzal, and bgezal. Th is chapter describes instructions given in full name using color, while Chapter 3
describes instructions given in mnemonics using color. Appendix A covers all instructions.

2.10 MIPS Addressing for 32-bit Immediates and Addresses 119

120 Chapter 2 Instructions: Language of the Computer

Figure 2.20 shows all the MIPS instruction formats. Figure 2.1 on page 64 shows
the MIPS assembly language revealed in this chapter. Th e remaining hidden portion
of MIPS instructions deals mainly with arithmetic and real numbers, which are
covered in the next chapter.

I. What is the range of addresses for conditional branches in MIPS (K � 1024)?

1. Addresses between 0 and 64K � 1

2. Addresses between 0 and 256K � 1

3. Addresses up to about 32K before the branch to about 32K aft er

4. Addresses up to about 128K before the branch to about 128K aft er

II. What is the range of addresses for jump and jump and link in MIPS
(M � 1024K)?

1. Addresses between 0 and 64M � 1

2. Addresses between 0 and 256M � 1

3. Addresses up to about 32M before the branch to about 32M aft er

4. Addresses up to about 128M before the branch to about 128M aft er

5. Anywhere within a block of 64M addresses where the PC supplies the
upper 6 bits

6. Anywhere within a block of 256M addresses where the PC supplies the
upper 4 bits

III. What is the MIPS assembly language instruction corresponding to the
machine instruction with the value 0000 0000hex?

1. j

2. R-format

3. addi

4. sll

5. mfc0

6. Undefi ned opcode: there is no legal instruction that corresponds to 0

Check
Yourself

Name Fields Comments

Field size 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits All MIPS instructions are 32 bits long

R-format op rs rt rd shamt funct Arithmetic instruction format
I-format op rs rt address/immediate Transfer, branch,imm. format

Jump instruction formatsserddategratpotamrof-J

FIGURE 2.20 MIPS instruction formats.

2.11 Parallelism and Instructions: Synchronization 121

2.11 Parallelism and Instructions:
Synchronization

Parallel execution is easier when tasks are independent, but oft en they need to
cooperate. Cooperation usually means some tasks are writing new values that
others must read. To know when a task is fi nished writing so that it is safe for
another to read, the tasks need to synchronize. If they don’t synchronize, there is a
danger of a data race, where the results of the program can change depending on
how events happen to occur.

For example, recall the analogy of the eight reporters writing a story on page 44 of
Chapter 1. Suppose one reporter needs to read all the prior sections before writing
a conclusion. Hence, he or she must know when the other reporters have fi nished
their sections, so that there is no danger of sections being changed aft erwards. Th at
is, they had better synchronize the writing and reading of each section so that the
conclusion will be consistent with what is printed in the prior sections.

In computing, synchronization mechanisms are typically built with user-level
soft ware routines that rely on hardware-supplied synchronization instructions. In
this section, we focus on the implementation of lock and unlock synchronization
operations. Lock and unlock can be used straightforwardly to create regions
where only a single processor can operate, called a mutual exclusion, as well as to
implement more complex synchronization mechanisms.

Th e critical ability we require to implement synchronization in a multiprocessor
is a set of hardware primitives with the ability to atomically read and modify a
memory location. Th at is, nothing else can interpose itself between the read and
the write of the memory location. Without such a capability, the cost of building
basic synchronization primitives will be high and will increase unreasonably as the
processor count increases.

Th ere are a number of alternative formulations of the basic hardware primitives,
all of which provide the ability to atomically read and modify a location, together
with some way to tell if the read and write were performed atomically. In general,
architects do not expect users to employ the basic hardware primitives, but
instead expect that the primitives will be used by system programmers to build a
synchronization library, a process that is oft en complex and tricky.

Let’s start with one such hardware primitive and show how it can be used to
build a basic synchronization primitive. One typical operation for building
synchronization operations is the atomic exchange or atomic swap, which inter-
changes a value in a register for a value in memory.

To see how to use this to build a basic synchronization primitive, assume that
we want to build a simple lock where the value 0 is used to indicate that the lock
is free and 1 is used to indicate that the lock is unavailable. A processor tries to set
the lock by doing an exchange of 1, which is in a register, with the memory address
corresponding to the lock. Th e value returned from the exchange instruction is 1
if some other processor had already claimed access, and 0 otherwise. In the latter

data race Two memory
accesses form a data race
if they are from diff erent
threads to same location,
at least one is a write,
and they occur one aft er
another.

122 Chapter 2 Instructions: Language of the Computer

case, the value is also changed to 1, preventing any competing exchange in another
processor from also retrieving a 0.

For example, consider two processors that each try to do the exchange
simultaneously: this race is broken, since exactly one of the processors will perform
the exchange fi rst, returning 0, and the second processor will return 1 when it does
the exchange. Th e key to using the exchange primitive to implement synchronization
is that the operation is atomic: the exchange is indivisible, and two simultaneous
exchanges will be ordered by the hardware. It is impossible for two processors
trying to set the synchronization variable in this manner to both think they have
simultaneously set the variable.

Implementing a single atomic memory operation introduces some challenges in
the design of the processor, since it requires both a memory read and a write in a
single, uninterruptible instruction.

An alternative is to have a pair of instructions in which the second instruction
returns a value showing whether the pair of instructions was executed as if the pair
were atomic. Th e pair of instructions is eff ectively atomic if it appears as if all other
operations executed by any processor occurred before or aft er the pair. Th us, when
an instruction pair is eff ectively atomic, no other processor can change the value
between the instruction pair.

In MIPS this pair of instructions includes a special load called a load linked and
a special store called a store conditional. Th ese instructions are used in sequence:
if the contents of the memory location specifi ed by the load linked are changed
before the store conditional to the same address occurs, then the store conditional
fails. Th e store conditional is defi ned to both store the value of a (presumably
diff erent) register in memory and to change the value of that register to a 1 if it
succeeds and to a 0 if it fails. Since the load linked returns the initial value, and the
store conditional returns 1 only if it succeeds, the following sequence implements
an atomic exchange on the memory location specifi ed by the contents of $s1:

again: addi $t0,$zero,1 ;copy locked value
ll $t1,0($s1) ;load linked
sc $t0,0($s1) ;store conditional
beq $t0,$zero,again ;branch if store fails
add $s4,$zero,$t1 ;put load value in $s4

Any time a processor intervenes and modifi es the value in memory between the
ll and sc instructions, the sc returns 0 in $t0, causing the code sequence to try
again. At the end of this sequence the contents of $s4 and the memory location
specifi ed by $s1 have been atomically exchanged.

Elaboration: Although it was presented for multiprocessor synchronization, atomic
exchange is also useful for the operating system in dealing with multiple processes
in a single processor. To make sure nothing interferes in a single processor, the store
conditional also fails if the processor does a context switch between the two instructions
(see Chapter 5).

2.12 Translating and Starting a Program 123

An advantage of the load linked/store conditional mechanism is that it can be used
to build other synchronization primitives, such as atomic compare and swap or atomic
fetch-and-increment, which are used in some parallel programming models. These
involve more instructions between the ll and the sc, but not too many.

Since the store conditional will fail after either another attempted store to the load
linked address or any exception, care must be taken in choosing which instructions are
inserted between the two instructions. In particular, only register-register instructions
can safely be permitted; otherwise, it is possible to create deadlock situations where
the processor can never complete the sc because of repeated page faults. In addition,
the number of instructions between the load linked and the store conditional should be
small to minimize the probability that either an unrelated event or a competing processor
causes the store conditional to fail frequently.

When do you use primitives like load linked and store conditional?

1. When cooperating threads of a parallel program need to synchronize to get
proper behavior for reading and writing shared data

2. When cooperating processes on a uniprocessor need to synchronize for
reading and writing shared data

2.12 Translating and Starting a Program

Th is section describes the four steps in transforming a C program in a fi le on disk
into a program running on a computer. Figure 2.21 shows the translation hierarchy.
Some systems combine these steps to reduce translation time, but these are the
logical four phases that programs go through. Th is section follows this translation
hierarchy.

Compiler
Th e compiler transforms the C program into an assembly language program, a
symbolic form of what the machine understands. High-level language programs
take many fewer lines of code than assembly language, so programmer productivity
is much higher.

In 1975, many operating systems and assemblers were written in assembly
language because memories were small and compilers were ineffi cient. Th e
million-fold increase in memory capacity per single DRAM chip has reduced
program size concerns, and optimizing compilers today can produce assembly
language programs nearly as well as an assembly language expert, and sometimes
even better for large programs.

Check
Yourself

assembly language
A symbolic language that
can be translated into
binary machine language.

124 Chapter 2 Instructions: Language of the Computer

Assembler
Since assembly language is an interface to higher-level soft ware, the assembler
can also treat common variations of machine language instructions as if they
were instructions in their own right. Th e hardware need not implement these
instructions; however, their appearance in assembly language simplifi es translation
and programming. Such instructions are called pseudoinstructions.

As mentioned above, the MIPS hardware makes sure that register $zero always
has the value 0. Th at is, whenever register $zero is used, it supplies a 0, and the
programmer cannot change the value of register $zero. Register $zero is used
to create the assembly language instruction that copies the contents of one register
to another. Th us the MIPS assembler accepts this instruction even though it is not
found in the MIPS architecture:

move $t0,$t1 # register $t0 gets register $t1

pseudoinstruction
A common variation
of assembly language
instructions oft en treated
as if it were an instruction
in its own right.

Loader

C program

Compiler

Assembly language program

Assembler

Object: Machine language module Object: Library routine (machine language)

Linker

Memory

Executable: Machine language program

FIGURE 2.21 A translation hierarchy for C. A high-level language program is fi rst compiled into
an assembly language program and then assembled into an object module in machine language. Th e linker
combines multiple modules with library routines to resolve all references. Th e loader then places the machine
code into the proper memory locations for execution by the processor. To speed up the translation process,
some steps are skipped or combined. Some compilers produce object modules directly, and some systems use
linking loaders that perform the last two steps. To identify the type of fi le, UNIX follows a suffi x convention
for fi les: C source fi les are named x.c, assembly fi les are x.s, object fi les are named x.o, statically linked
library routines are x.a, dynamically linked library routes are x.so, and executable fi les by default are
called a.out. MS-DOS uses the suffi xes .C, .ASM, .OBJ, .LIB, .DLL, and .EXE to the same eff ect.

Th e assembler converts this assembly language instruction into the machine
language equivalent of the following instruction:

add $t0,$zero,$t1 # register $t0 gets 0 + register $t1

Th e MIPS assembler also converts blt (branch on less than) into the two
instructions slt and bne mentioned in the example on page 95. Other examples
include bgt, bge, and ble. It also converts branches to faraway locations into a
branch and jump. As mentioned above, the MIPS assembler allows 32-bit constants
to be loaded into a register despite the 16-bit limit of the immediate instructions.

In summary, pseudoinstructions give MIPS a richer set of assembly language
instructions than those implemented by the hardware. Th e only cost is reserving
one register, $at, for use by the assembler. If you are going to write assembly
programs, use pseudoinstructions to simplify your task. To understand the MIPS
architecture and be sure to get best performance, however, study the real MIPS
instructions found in Figures 2.1 and 2.19.

Assemblers will also accept numbers in a variety of bases. In addition to binary
and decimal, they usually accept a base that is more succinct than binary yet
converts easily to a bit pattern. MIPS assemblers use hexadecimal.

Such features are convenient, but the primary task of an assembler is assembly
into machine code. Th e assembler turns the assembly language program into an
object fi le, which is a combination of machine language instructions, data, and
information needed to place instructions properly in memory.

To produce the binary version of each instruction in the assembly language
program, the assembler must determine the addresses corresponding to all labels.
Assemblers keep track of labels used in branches and data transfer instructions
in a symbol table. As you might expect, the table contains pairs of symbols and
addresses.

Th e object fi le for UNIX systems typically contains six distinct pieces:

■ Th e object fi le header describes the size and position of the other pieces of the
object fi le.

■ Th e text segment contains the machine language code.

■ Th e static data segment contains data allocated for the life of the program.
(UNIX allows programs to use both static data, which is allocated throughout
the program, and dynamic data, which can grow or shrink as needed by the
program. See Figure 2.13.)

■ Th e relocation information identifi es instructions and data words that depend
on absolute addresses when the program is loaded into memory.

■ Th e symbol table contains the remaining labels that are not defi ned, such as
external references.

symbol table A table
that matches names of
labels to the addresses of
the memory words that
instructions occupy.

2.12 Translating and Starting a Program 125

126 Chapter 2 Instructions: Language of the Computer

■ Th e debugging information contains a concise description of how the modules
were compiled so that a debugger can associate machine instructions with C
source fi les and make data structures readable.

Th e next subsection shows how to attach such routines that have already been
assembled, such as library routines.

Linker
What we have presented so far suggests that a single change to one line of one
procedure requires compiling and assembling the whole program. Complete
retranslation is a terrible waste of computing resources. Th is repetition is
particularly wasteful for standard library routines, because programmers would
be compiling and assembling routines that by defi nition almost never change. An
alternative is to compile and assemble each procedure independently, so that a
change to one line would require compiling and assembling only one procedure.
Th is alternative requires a new systems program, called a link editor or linker,
which takes all the independently assembled machine language programs and
“stitches” them together.

Th ere are three steps for the linker:

1. Place code and data modules symbolically in memory.

2. Determine the addresses of data and instruction labels.

3. Patch both the internal and external references.

Th e linker uses the relocation information and symbol table in each object
module to resolve all undefi ned labels. Such references occur in branch instructions,
jump instructions, and data addresses, so the job of this program is much like that
of an editor: it fi nds the old addresses and replaces them with the new addresses.
Editing is the origin of the name “link editor,” or linker for short. Th e reason a
linker is useful is that it is much faster to patch code than it is to recompile and
reassemble.

If all external references are resolved, the linker next determines the memory
locations each module will occupy. Recall that Figure 2.13 on page 104 shows
the MIPS convention for allocation of program and data to memory. Since the
fi les were assembled in isolation, the assembler could not know where a module’s
instructions and data would be placed relative to other modules. When the linker
places a module in memory, all absolute references, that is, memory addresses that
are not relative to a register, must be relocated to refl ect its true location.

Th e linker produces an executable fi le that can be run on a computer. Typically,
this fi le has the same format as an object fi le, except that it contains no unresolved
references. It is possible to have partially linked fi les, such as library routines, that
still have unresolved addresses and hence result in object fi les.

linker Also called
link editor. A systems
program that combines
independently assembled
machine language
programs and resolves all
undefi ned labels into an
executable fi le.

executable fi le
A functional program in
the format of an object
fi le that contains no
unresolved references.
It can contain symbol
tables and debugging
information. A “stripped
executable” does not
contain that information.
Relocation information
may be included for the
loader.

Linking Object Files

Link the two object fi les below. Show updated addresses of the fi rst few
instructions of the completed executable fi le. We show the instructions in
assembly language just to make the example understandable; in reality, the
instructions would be numbers.

Note that in the object fi les we have highlighted the addresses and symbols
that must be updated in the link process: the instructions that refer to the
addresses of procedures A and B and the instructions that refer to the addresses
of data words X and Y.

EXAMPLE

Object fi le header

Name Procedure A
Text size 100hex
Data size 20hex

Text segment Address Instruction

0 lw $a0, 0($gp)

4 jal 0
… …

Data segment 0 (X)
… …

Relocation information Address Instruction type Dependency

0 lw X

4 jal B

Symbol table Label Address

X –

B –
Object fi le header

Name Procedure B
Text size 200hex
Data size 30hex

Text segment Address Instruction

0 sw $a1, 0($gp)
4 jal 0
… …

Data segment 0 (Y)
… …

Relocation information Address Instruction type Dependency

0 sw Y
4 jal A

Symbol table Label Address

Y –

A –

2.12 Translating and Starting a Program 127

128 Chapter 2 Instructions: Language of the Computer

Procedure A needs to fi nd the address for the variable labeled X to put in the
load instruction and to fi nd the address of procedure B to place in the jal
instruction. Procedure B needs the address of the variable labeled Y for the
store instruction and the address of procedure A for its jal instruction.

From Figure 2.13 on page 104, we know that the text segment starts
at address 40 0000hex and the data segment at 1000 0000hex. Th e text of
procedure A is placed at the fi rst address and its data at the second. Th e object
fi le header for procedure A says that its text is 100hex bytes and its data is 20hex
bytes, so the starting address for procedure B text is 40 0100hex, and its data
starts at 1000 0020hex.

ANSWER

Executable fi le header

Text size 300hex
Data size 50hex

Text segment Address Instruction
0040 0000

hex lw $a0, 8000hex($gp)

0040 0004
hex jal 40 0100hex

… …
0040 0100

hex sw $a1, 8020hex($gp)

0040 0104
hex jal 40 0000hex

… …

Data segment Address
1000 0000

hex (X)
… …

1000 0020
hex (Y)

… …

Figure 2.13 also shows that the text segment starts at address 40 0000hex
and the data segment at 1000 0000hex. Th e text of procedure A is placed at the
fi rst address and its data at the second. Th e object fi le header for procedure A
says that its text is 100hex bytes and its data is 20hex bytes, so the starting address
for procedure B text is 40 0100hex, and its data starts at 1000 0020hex.

Now the linker updates the address fi elds of the instructions. It uses the
instruction type fi eld to know the format of the address to be edited. We have
two types here:

1. Th e jals are easy because they use pseudodirect addressing. Th e jal at
address 40 0004hex gets 40 0100hex (the address of procedure B) in its
address fi eld, and the jal at 40 0104hex gets 40 0000hex (the address of
procedure A) in its address fi eld.

2. Th e load and store addresses are harder because they are relative to a base
register. Th is example uses the global pointer as the base register. Figure 2.13
shows that $gp is initialized to 1000 8000hex. To get the address 1000 0000hex
(the address of word X), we place 8000hex in the address fi eld of lw at address
40 0000hex. Similarly, we place 8020hex in the address fi eld of sw at address
40 0100hex to get the address 1000 0020hex (the address of word Y).

Elaboration: Recall that MIPS instructions are word aligned, so jal drops the right
two bits to increase the instruction’s address range. Thus, it uses 26 bits to create a
28-bit byte address. Hence, the actual address in the lower 26 bits of the jal instruction
in this example is 10 0040

hex,
rather than 40 0100

hex
.

Loader
Now that the executable fi le is on disk, the operating system reads it to memory and
starts it. Th e loader follows these steps in UNIX systems:

1. Reads the executable fi le header to determine size of the text and data
segments.

2. Creates an address space large enough for the text and data.

3. Copies the instructions and data from the executable fi le into memory.

4. Copies the parameters (if any) to the main program onto the stack.

5. Initializes the machine registers and sets the stack pointer to the fi rst free
location.

6. Jumps to a start-up routine that copies the parameters into the argument
registers and calls the main routine of the program. When the main routine
returns, the start-up routine terminates the program with an exit system
call.

Sections A.3 and A.4 in Appendix A describe linkers and loaders in more detail.

Dynamically Linked Libraries
Th e fi rst part of this section describes the traditional approach to linking libraries
before the program is run. Although this static approach is the fastest way to call
library routines, it has a few disadvantages:

■ Th e library routines become part of the executable code. If a new version of
the library is released that fi xes bugs or supports new hardware devices, the
statically linked program keeps using the old version.

■ It loads all routines in the library that are called anywhere in the executable,
even if those calls are not executed. Th e library can be large relative to the
program; for example, the standard C library is 2.5 MB.

Th ese disadvantages lead to dynamically linked libraries (DLLs), where the
library routines are not linked and loaded until the program is run. Both the
program and library routines keep extra information on the location of nonlocal
procedures and their names. In the initial version of DLLs, the loader ran a dynamic
linker, using the extra information in the fi le to fi nd the appropriate libraries and to
update all external references.

loader A systems
program that places an
object program in main
memory so that it is ready
to execute.

dynamically linked
libraries (DLLs) Library
routines that are linked
to a program during
execution.

2.12 Translating and Starting a Program 129

Virtually every
problem in computer
science can be solved
by another level of
indirection.
David Wheeler

130 Chapter 2 Instructions: Language of the Computer

Th e downside of the initial version of DLLs was that it still linked all routines
of the library that might be called, versus only those that are called during the
running of the program. Th is observation led to the lazy procedure linkage version
of DLLs, where each routine is linked only aft er it is called.

Like many innovations in our fi eld, this trick relies on a level of indirection.
Figure 2.22 shows the technique. It starts with the nonlocal routines calling a set of
dummy routines at the end of the program, with one entry per nonlocal routine.
Th ese dummy entries each contain an indirect jump.

Th e fi rst time the library routine is called, the program calls the dummy entry
and follows the indirect jump. It points to code that puts a number in a register to

Text

jal

(a) First call to DLL routine (b) Subsequent calls to DLL routine

lw
jr

…

Data

Text

li ID
j

…

Text

Data/Text

Dynamic linker/loader
Remap DLL routine

j
…

DLL routine

jr
…

Text

jal

lw
jr

…

Data

DLL routine

jr
…

Text

FIGURE 2.22 Dynamically linked library via lazy procedure linkage. (a) Steps for the fi rst time
a call is made to the DLL routine. (b) Th e steps to fi nd the routine, remap it, and link it are skipped on
subsequent calls. As we will see in Chapter 5, the operating system may avoid copying the desired routine by
remapping it using virtual memory management.

identify the desired library routine and then jumps to the dynamic linker/loader.
Th e linker/loader fi nds the desired routine, remaps it, and changes the address in
the indirect jump location to point to that routine. It then jumps to it. When the
routine completes, it returns to the original calling site. Th ereaft er, the call to the
library routine jumps indirectly to the routine without the extra hops.

In summary, DLLs require extra space for the information needed for dynamic
linking, but do not require that whole libraries be copied or linked. Th ey pay a good
deal of overhead the fi rst time a routine is called, but only a single indirect jump
thereaft er. Note that the return from the library pays no extra overhead. Microsoft ’s
Windows relies extensively on dynamically linked libraries, and it is also the default
when executing programs on UNIX systems today.

Starting a Java Program
Th e discussion above captures the traditional model of executing a program,
where the emphasis is on fast execution time for a program targeted to a specifi c
instruction set architecture, or even a specifi c implementation of that architecture.
Indeed, it is possible to execute Java programs just like C. Java was invented with
a diff erent set of goals, however. One was to run safely on any computer, even if it
might slow execution time.

Figure 2.23 shows the typical translation and execution steps for Java. Rather
than compile to the assembly language of a target computer, Java is compiled fi rst
to instructions that are easy to interpret: the Java bytecode instruction set (see

Section 2.15). Th is instruction set is designed to be close to the Java language
so that this compilation step is trivial. Virtually no optimizations are performed.
Like the C compiler, the Java compiler checks the types of data and produces the
proper operation for each type. Java programs are distributed in the binary version
of these bytecodes.

A soft ware interpreter, called a Java Virtual Machine (JVM), can execute Java
bytecodes. An interpreter is a program that simulates an instruction set architecture.

Java bytecode
Instruction from an
instruction set designed
to interpret Java
programs.

Java Virtual Machine
(JVM) Th e program that
interprets Java bytecodes.

Java program

Compiler

Class files (Java bytecodes)

Java Virtual Machine

Compiled Java methods (machine language)

Java library routines (machine language)

Just In Time
compiler

FIGURE 2.23 A translation hierarchy for Java. A Java program is fi rst compiled into a binary
version of Java bytecodes, with all addresses defi ned by the compiler. Th e Java program is now ready to run
on the interpreter, called the Java Virtual Machine (JVM). Th e JVM links to desired methods in the Java
library while the program is running. To achieve greater performance, the JVM can invoke the JIT compiler,
which selectively compiles methods into the native machine language of the machine on which it is running.

2.12 Translating and Starting a Program 131

132 Chapter 2 Instructions: Language of the Computer

For example, the MIPS simulator used with this book is an interpreter. Th ere is no
need for a separate assembly step since either the translation is so simple that the
compiler fi lls in the addresses or JVM fi nds them at runtime.

Th e upside of interpretation is portability. Th e availability of soft ware Java virtual
machines meant that most people could write and run Java programs shortly
aft er Java was announced. Today, Java virtual machines are found in hundreds of
millions of devices, in everything from cell phones to Internet browsers.

Th e downside of interpretation is lower performance. Th e incredible advances in
performance of the 1980s and 1990s made interpretation viable for many important
applications, but the factor of 10 slowdown when compared to traditionally
compiled C programs made Java unattractive for some applications.

To preserve portability and improve execution speed, the next phase of Java
development was compilers that translated while the program was running. Such
Just In Time compilers (JIT) typically profi le the running program to fi nd where
the “hot” methods are and then compile them into the native instruction set on
which the virtual machine is running. Th e compiled portion is saved for the next
time the program is run, so that it can run faster each time it is run. Th is balance
of interpretation and compilation evolves over time, so that frequently run Java
programs suff er little of the overhead of interpretation.

As computers get faster so that compilers can do more, and as researchers
invent betters ways to compile Java on the fl y, the performance gap between Java
and C or C�� is closing. Section 2.15 goes into much greater depth on the
implementation of Java, Java bytecodes, JVM, and JIT compilers.

Which of the advantages of an interpreter over a translator do you think was most
important for the designers of Java?

1. Ease of writing an interpreter

2. Better error messages

3. Smaller object code

4. Machine independence

2.13 A C Sort Example to Put It All Together

One danger of showing assembly language code in snippets is that you will have no
idea what a full assembly language program looks like. In this section, we derive
the MIPS code from two procedures written in C: one to swap array elements and
one to sort them.

Just In Time compiler
(JIT) Th e name
commonly given to a
compiler that operates at
runtime, translating the
interpreted code segments
into the native code of the
computer.

Check
Yourself

2.13 A C Sort Example to Put It All Together 133

The Procedure swap
Let’s start with the code for the procedure swap in Figure 2.24. Th is procedure
simply swaps two locations in memory. When translating from C to assembly
language by hand, we follow these general steps:

1. Allocate registers to program variables.

2. Produce code for the body of the procedure.

3. Preserve registers across the procedure invocation.

Th is section describes the swap procedure in these three pieces, concluding by
putting all the pieces together.

Register Allocation for swap
As mentioned on pages 98–99, the MIPS convention on parameter passing is to
use registers $a0, $a1, $a2, and $a3. Since swap has just two parameters, v and
k, they will be found in registers $a0 and $a1. Th e only other variable is temp,
which we associate with register $t0 since swap is a leaf procedure (see page 100).
Th is register allocation corresponds to the variable declarations in the fi rst part of
the swap procedure in Figure 2.24.

Code for the Body of the Procedure swap
Th e remaining lines of C code in swap are

temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;

Recall that the memory address for MIPS refers to the byte address, and so
words are really 4 bytes apart. Hence we need to multiply the index k by 4 before
adding it to the address. Forgetting that sequential word addresses diff er by 4 instead

void swap(int v[], int k)
{
int temp;
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;

}

FIGURE 2.24 A C procedure that swaps two locations in memory. Th is subsection uses this
procedure in a sorting example.

134 Chapter 2 Instructions: Language of the Computer

of by 1 is a common mistake in assembly language programming. Hence the fi rst step
is to get the address of v[k] by multiplying k by 4 via a shift left by 2:

sll $t1, $a1,2 # reg $t1 = k * 4
add $t1, $a0,$t1 # reg $t1 = v + (k * 4)
# reg $t1 has the address of v[k]

Now we load v[k] using $t1, and then v[k+1] by adding 4 to $t1:

lw $t0, 0($t1) # reg $t0 (temp) = v[k]
lw $t2, 4($t1) # reg $t2 = v[k + 1]
# refers to next element of v

Next we store $t0 and $t2 to the swapped addresses:

sw $t2, 0($t1) # v[k] = reg $t2
sw $t0, 4($t1) # v[k+1] = reg $t0 (temp)

Now we have allocated registers and written the code to perform the operations
of the procedure. What is missing is the code for preserving the saved registers
used within swap. Since we are not using saved registers in this leaf procedure,
there is nothing to preserve.

The Full swap Procedure
We are now ready for the whole routine, which includes the procedure label and
the return jump. To make it easier to follow, we identify in Figure 2.25 each block
of code with its purpose in the procedure.

Procedure body

swap: sll $t1, $a1, 2 # reg $t1 = k * 4
add $t1, $a0, $t1 # reg $t1 = v + (k * 4)

# reg $t1 has the address of v[k]
lw $t0, 0($t1) # reg $t0 (temp) = v[k]
lw $t2, 4($t1) # reg $t2 = v[k + 1]

# refers to next element of v
sw $t2, 0($t1) # v[k] = reg $t2
sw $t0, 4($t1) # v[k+1] = reg $t0 (temp)

Procedure return

jr $ra # return to calling routine

FIGURE 2.25 MIPS assembly code of the procedure swap in Figure 2.24.

The Procedure sort
To ensure that you appreciate the rigor of programming in assembly language, we’ll
try a second, longer example. In this case, we’ll build a routine that calls the swap
procedure. Th is program sorts an array of integers, using bubble or exchange sort,
which is one of the simplest if not the fastest sorts. Figure 2.26 shows the C version
of the program. Once again, we present this procedure in several steps, concluding
with the full procedure.

void sort (int v[], int n)
{
int i, j;
for (i = 0; i < n; i += 1) { for (j = i – 1; j >= 0 && v[j] > v[j + 1]; j =1) {
swap(v,j);
}
}
}

FIGURE 2.26 A C procedure that performs a sort on the array v.

Register Allocation for sort
Th e two parameters of the procedure sort, v and n, are in the parameter registers
$a0 and $a1, and we assign register $s0 to i and register $s1 to j.

Code for the Body of the Procedure sort
Th e procedure body consists of two nested for loops and a call to swap that includes
parameters. Let’s unwrap the code from the outside to the middle.

Th e fi rst translation step is the fi rst for loop:

for (i = 0; i = 0 && v[j] > v[j + 1]; j –= 1) {

Th e initialization portion of this loop is again one instruction:

addi $s1, $s0, –1 # j = i – 1

Th e decrement of j at the end of the loop is also one instruction:

addi $s1, $s1, –1 # j –= 1

Th e loop test has two parts. We exit the loop if either condition fails, so the fi rst
test must exit the loop if it fails (j � 0):

for2tst: slti $t0, $s1, 0 # reg $t0 = 1 if $s1 < 0 (j < 0) bne $t0, $zero, exit2 # go to exit2 if $s1 < 0 (j < 0) Th is branch will skip over the second condition test. If it doesn’t skip, j ≥ 0. Th e second test exits if v[j] > v[j + 1] is not true, or exits if v[j] ≤
v[j + 1]. First we create the address by multiplying j by 4 (since we need a byte
address) and add it to the base address of v:

sll $t1, $s1, 2 # reg $t1 = j * 4
add $t2, $a0, $t1 # reg $t2 = v + (j * 4)

Now we load v[j]:

lw $t3, 0($t2) # reg $t3 = v[j]

Since we know that the second element is just the following word, we add 4 to
the address in register $t2 to get v[j + 1]:

lw $t4, 4($t2) # reg $t4 = v[j + 1]

Th e test of v[j] ≤ v[j + 1] is the same as v[j + 1] ≥ v[j], so the
two instructions of the exit test are

slt $t0, $t4, $t3 # reg $t0 = 0 if $t4 ≥ $t3
beq $t0, $zero, exit2 # go to exit2 if $t4 ≥ $t3

Th e bottom of the loop jumps back to the inner loop test:

j for2tst # jump to test of inner loop

Combining the pieces, the skeleton of the second for loop looks like this:

addi $s1, $s0, –1 # j = i – 1
for2tst:slti $t0, $s1, 0 # reg $t0 = 1 if $s1 < 0 (j < 0) bne $t0, $zero, exit2 # go to exit2 if $s1 < 0 (j < 0) sll $t1, $s1, 2 # reg $t1 = j * 4 add $t2, $a0, $t1 # reg $t2 = v + (j * 4) lw $t3, 0($t2) # reg $t3 = v[j] lw $t4, 4($t2) # reg $t4 = v[j + 1] slt $t0, $t4, $t3 # reg $t0 = 0 if $t4 ≥ $t3 beq $t0, $zero, exit2 # go to exit2 if $t4 ≥ $t3 . . . (body of second for loop) . . . addi $s1, $s1, –1 # j –= 1 j for2tst # jump to test of inner loop exit2: The Procedure Call in sort Th e next step is the body of the second for loop: swap(v,j); Calling swap is easy enough: jal swap 2.13 A C Sort Example to Put It All Together 137 138 Chapter 2 Instructions: Language of the Computer Passing Parameters in sort Th e problem comes when we want to pass parameters because the sort procedure needs the values in registers $a0 and $a1, yet the swap procedure needs to have its parameters placed in those same registers. One solution is to copy the parameters for sort into other registers earlier in the procedure, making registers $a0 and $a1 available for the call of swap. (Th is copy is faster than saving and restoring on the stack.) We fi rst copy $a0 and $a1 into $s2 and $s3 during the procedure: move $s2, $a0 # copy parameter $a0 into $s2 move $s3, $a1 # copy parameter $a1 into $s3 Th en we pass the parameters to swap with these two instructions: move $a0, $s2 # first swap parameter is v move $a1, $s1 # second swap parameter is j Preserving Registers in sort Th e only remaining code is the saving and restoring of registers. Clearly, we must save the return address in register $ra, since sort is a procedure and is called itself. Th e sort procedure also uses the saved registers $s0, $s1, $s2, and $s3, so they must be saved. Th e prologue of the sort procedure is then addi $sp,$sp,–20 # make room on stack for 5 registers sw $ra,16($sp) # save $ra on stack sw $s3,12($sp) # save $s3 on stack sw $s2, 8($sp) # save $s2 on stack sw $s1, 4($sp) # save $s1 on stack sw $s0, 0($sp) # save $s0 on stack Th e tail of the procedure simply reverses all these instructions, then adds a jr to return. The Full Procedure sort Now we put all the pieces together in Figure 2.27, being careful to replace references to registers $a0 and $a1 in the for loops with references to registers $s2 and $s3. Once again, to make the code easier to follow, we identify each block of code with its purpose in the procedure. In this example, nine lines of the sort procedure in C became 35 lines in the MIPS assembly language. Elaboration: One optimization that works with this example is procedure inlining. Instead of passing arguments in parameters and invoking the code with a jal instruction, the compiler would copy the code from the body of the swap procedure where the call to swap appears in the code. Inlining would avoid four instructions in this example. The downside of the inlining optimization is that the compiled code would be bigger if the inlined procedure is called from several locations. Such a code expansion might turn into lower performance if it increased the cache miss rate; see Chapter 5. Saving registers sort: addi $sp,$sp, –20 # make room on stack for 5 registers sw $ra, 16($sp)# save $ra on stack sw $s3,12($sp) # save $s3 on stack sw $s2, 8($sp)# save $s2 on stack sw $s1, 4($sp)# save $s1 on stack sw $s0, 0($sp)# save $s0 on stack Procedure body Move parameters move $s2, $a0 # copy parameter $a0 into $s2 (save $a0) move $s3, $a1 # copy parameter $a1 into $s3 (save $a1) Outer loop move $s0, $zero# i = 0 for1tst:slt $t0, $s0,$s3 #reg$t0=0if$s0Š$s3(iŠn) beq $t0, $zero, exit1# go to exit1 if $s0 Š $s3 (i Š n) Inner loop addi $s1, $s0, –1# j = i – 1 for2tst:slti $t0, $s1,0 #reg$t0=1if$s1<0(j<0) bne $t0, $zero, exit2# go to exit2 if $s1 < 0 (j < 0) sll $t1, $s1, 2# reg $t1 = j * 4 add $t2, $s2, $t1# reg $t2 = v + (j * 4) lw $t3, 0($t2)# reg $t3 = v[j] lw $t4, 4($t2)# reg $t4 = v[j + 1] slt $t0, $t4, $t3 # reg $t0 = 0 if $t4 Š $t3 beq $t0, $zero, exit2# go to exit2 if $t4 Š $t3 Pass parameters and call move $a0, $s2 # 1st parameter of swap is v (old $a0) move $a1, $s1 # 2nd parameter of swap is j jal swap # swap code shown in Figure 2.25 Inner loop addi $s1, $s1, –1# j –= 1 j for2tst # jump to test of inner loop Outer loop exit2: addi $s0, $s0, 1 # i += 1 j for1tst # jump to test of outer loop Restoring registers exit1: lw $s0, 0($sp) # restore $s0 from stack lw $s1, 4($sp)# restore $s1 from stack lw $s2, 8($sp)# restore $s2 from stack lw $s3,12($sp) # restore $s3 from stack lw $ra,16($sp) # restore $ra from stack addi $sp,$sp, 20 # restore stack pointer Procedure return jr $ra # return to calling routine FIGURE 2.27 MIPS assembly version of procedure sort in Figure 2.26. 2.13 A C Sort Example to Put It All Together 139 140 Chapter 2 Instructions: Language of the Computer Figure 2.28 shows the impact of compiler optimization on sort program performance, compile time, clock cycles, instruction count, and CPI. Note that unoptimized code has the best CPI, and O1 optimization has the lowest instruction count, but O3 is the fastest, reminding us that time is the only accurate measure of program performance. Figure 2.29 compares the impact of programming languages, compilation versus interpretation, and algorithms on performance of sorts. Th e fourth column shows that the unoptimized C program is 8.3 times faster than the interpreted Java code for Bubble Sort. Using the JIT compiler makes Java 2.1 times faster than the unoptimized C and within a factor of 1.13 of the highest optimized C code. ( Section 2.15 gives more details on interpretation versus compilation of Java and the Java and MIPS code for Bubble Sort.) Th e ratios aren’t as close for Quicksort in Column 5, presumably because it is harder to amortize the cost of runtime compilation over the shorter execution time. Th e last column demonstrates the impact of a better algorithm, off ering three orders of magnitude a performance increases by when sorting 100,000 items. Even comparing interpreted Java in Column 5 to the C compiler at highest optimization in Column 4, Quicksort beats Bubble Sort by a factor of 50 (0.05 � 2468, or 123 times faster than the unoptimized C code versus 2.41 times faster). Elaboration: The MIPS compilers always save room on the stack for the arguments in case they need to be stored, so in reality they always decrement $sp by 16 to make room for all four argument registers (16 bytes). One reason is that C provides a vararg option that allows a pointer to pick, say, the third argument to a procedure. When the compiler encounters the rare vararg, it copies the four argument registers onto the stack into the four reserved locations. Understanding Program Performance gcc optimization Relative performance Clock cycles (millions) Instruction count (millions) CPI None 1.00 158,615 114,938 1.38 O1 (medium) 2.37 66,990 37,470 1.79 O2 (full) 2.38 66,521 39,993 1.66 O3 (procedure integration) 2.41 65,747 44,993 1.46 FIGURE 2.28 Comparing performance, instruction count, and CPI using compiler optimization for Bubble Sort. Th e programs sorted 100,000 words with the array initialized to random values. Th ese programs were run on a Pentium 4 with a clock rate of 3.06 GHz and a 533 MHz system bus with 2 GB of PC2100 DDR SDRAM. It used Linux version 2.4.20. 2.14 Arrays versus Pointers 141 2.14 Arrays versus Pointers A challenge for any new C programmer is understanding pointers. Comparing assembly code that uses arrays and array indices to the assembly code that uses pointers off ers insights about pointers. Th is section shows C and MIPS assembly versions of two procedures to clear a sequence of words in memory: one using array indices and one using pointers. Figure 2.30 shows the two C procedures. Th e purpose of this section is to show how pointers map into MIPS instructions, and not to endorse a dated programming style. We’ll see the impact of modern compiler optimization on these two procedures at the end of the section. Array Version of Clear Let’s start with the array version, clear1, focusing on the body of the loop and ignoring the procedure linkage code. We assume that the two parameters array and size are found in the registers $a0 and $a1, and that i is allocated to register $t0. Th e initialization of i, the fi rst part of the for loop, is straightforward: move $t0,$zero # i = 0 (register $t0 = 0) To set array[i] to 0 we must fi rst get its address. Start by multiplying i by 4 to get the byte address: loop1: sll $t1,$t0,2 # $t1 = i * 4 Since the starting address of the array is in a register, we must add it to the index to get the address of array[i] using an add instruction: add $t2,$a0,$t1 # $t2 = address of array[i] Language Execution method Optimization Bubble Sort relative performance Quicksort relative performance Speedup Quicksort vs. Bubble Sort C Compiler None 1.00 1.00 2468 Compiler O1 2.37 1.50 1562 Compiler O2 2.38 1.50 1555 Compiler O3 2.41 1.91 1955 Java Interpreter – 0.12 0.05 1050 JIT compiler – 2.13 0.29 338 FIGURE 2.29 Performance of two sort algorithms in C and Java using interpretation and optimizing compilers relative to unoptimized C version. Th e last column shows the advantage in performance of Quicksort over Bubble Sort for each language and execution option. Th ese programs were run on the same system as in Figure 2.28. Th e JVM is Sun version 1.3.1, and the JIT is Sun Hotspot version 1.3.1. 142 Chapter 2 Instructions: Language of the Computer Finally, we can store 0 in that address: sw $zero, 0($t2) # array[i] = 0 Th is instruction is the end of the body of the loop, so the next step is to increment i: addi $t0,$t0,1 # i = i + 1 Th e loop test checks if i is less than size: slt $t3,$t0,$a1 # $t3 = (i < size) bne $t3,$zero,loop1 # if (i < size) go to loop1 We have now seen all the pieces of the procedure. Here is the MIPS code for clearing an array using indices: move $t0,$zero # i = 0 loop1: sll $t1,$t0,2 # $t1 = i * 4 add $t2,$a0,$t1 # $t2 = address of array[i] sw $zero, 0($t2) # array[i] = 0 addi $t0,$t0,1 # i = i + 1 slt $t3,$t0,$a1 # $t3 = (i < size) bne $t3,$zero,loop1 # if (i < size) go to loop1 (Th is code works as long as size is greater than 0; ANSI C requires a test of size before the loop, but we’ll skip that legality here.) clear1(int array[], int size) { int i; for (i = 0; i < size; i += 1) array[i] = 0; } clear2(int *array, int size) { int *p; for (p = &array[0]; p < &array[size]; p = p + 1) *p = 0; } FIGURE 2.30 Two C procedures for setting an array to all zeros. Clear1 uses indices, while clear2 uses pointers. Th e second procedure needs some explanation for those unfamiliar with C. Th e address of a variable is indicated by &, and the object pointed to by a pointer is indicated by *. Th e declarations declare that array and p are pointers to integers. Th e fi rst part of the for loop in clear2 assigns the address of the fi rst element of array to the pointer p. Th e second part of the for loop tests to see if the pointer is pointing beyond the last element of array. Incrementing a pointer by one, in the last part of the for loop, means moving the pointer to the next sequential object of its declared size. Since p is a pointer to integers, the compiler will generate MIPS instructions to increment p by four, the number of bytes in a MIPS integer. Th e assignment in the loop places 0 in the object pointed to by p. Pointer Version of Clear Th e second procedure that uses pointers allocates the two parameters array and size to the registers $a0 and $a1 and allocates p to register $t0. Th e code for the second procedure starts with assigning the pointer p to the address of the fi rst element of the array: move $t0,$a0 # p = address of array[0] Th e next code is the body of the for loop, which simply stores 0 into p: loop2: sw $zero,0($t0) # Memory[p] = 0 Th is instruction implements the body of the loop, so the next code is the iteration increment, which changes p to point to the next word: addi $t0,$t0,4 # p = p + 4 Incrementing a pointer by 1 means moving the pointer to the next sequential object in C. Since p is a pointer to integers, each of which uses 4 bytes, the compiler increments p by 4. Th e loop test is next. Th e fi rst step is calculating the address of the last element of array. Start with multiplying size by 4 to get its byte address: sll $t1,$a1,2 # $t1 = size * 4 and then we add the product to the starting address of the array to get the address of the fi rst word aft er the array: add $t2,$a0,$t1 # $t2 = address of array[size] Th e loop test is simply to see if p is less than the last element of array: slt $t3,$t0,$t2 # $t3 = (p<&array[size]) bne $t3,$zero,loop2 # if (p<&array[size]) go to loop2 With all the pieces completed, we can show a pointer version of the code to zero an array: move $t0,$a0 # p = address of array[0] loop2: sw $zero,0($t0) # Memory[p] = 0 addi $t0,$t0,4 # p = p + 4 sll $t1,$a1,2 # $t1 = size * 4 add $t2,$a0,$t1 # $t2 = address of array[size] slt $t3,$t0,$t2 # $t3 = (p<&array[size]) bne $t3,$zero,loop2 # if (p<&array[size]) go to loop2 As in the fi rst example, this code assumes size is greater than 0. 2.14 Arrays versus Pointers 143 144 Chapter 2 Instructions: Language of the Computer Note that this program calculates the address of the end of the array in every iteration of the loop, even though it does not change. A faster version of the code moves this calculation outside the loop: move $t0,$a0 # p = address of array[0] sll $t1,$a1,2 # $t1 = size * 4 add $t2,$a0,$t1 # $t2 = address of array[size] loop2: sw $zero,0($t0) # Memory[p] = 0 addi $t0,$t0,4 # p = p + 4 slt $t3,$t0,$t2 # $t3 = (p<&array[size]) bne $t3,$zero,loop2 # if (p<&array[size]) go to loop2 Comparing the Two Versions of Clear Comparing the two code sequences side by side illustrates the diff erence between array indices and pointers (the changes introduced by the pointer version are highlighted): Th e version on the left must have the “multiply” and add inside the loop because i is incremented and each address must be recalculated from the new index. Th e memory pointer version on the right increments the pointer p directly. Th e pointer version moves the scaling shift and the array bound addition outside the loop, thereby reducing the instructions executed per iteration from 6 to 4. Th is manual optimization corresponds to the compiler optimization of strength reduction (shift instead of multiply) and induction variable elimination (eliminating array address calculations within loops). Section 2.15 describes these two and many other optimizations. Elaboration: As mentioned ealier, a C compiler would add a test to be sure that size is greater than 0. One way would be to add a jump just before the fi rst instruction of the loop to the slt instruction. move $t0,$zero # i = 0 loop1: sll $t1,$t0,2 # $t1 = i * 4 add $t2,$a0,$t1 # $t2 = &array[i] sw $zero, 0($t2) # array[i] = 0 addi $t0,$t0,1 # i = i + 1 slt $t3,$t0,$a1 # $t3 = (i < size) bne $t3,$zero,loop1# if () go to loop1 move $t0,$a0 # p = & array[0] sll $t1,$a1,2 # $t1 = size * 4 add $t2,$a0,$t1 # $t2 = &array[size] loop2: sw $zero,0($t0) # Memory[p] = 0 addi $t0,$t0,4 # p = p + 4 slt $t3,$t0,$t2 # $t3=(p<&array[size]) bne $t3,$zero,loop2# if () go to loop2 2.16 Real Stuff: ARMv7 (32-bit) Instructions 145 People used to be taught to use pointers in C to get greater effi ciency than that available with arrays: “Use pointers, even if you can’t understand the code.” Modern optimizing compilers can produce code for the array version that is just as good. Most programmers today prefer that the compiler do the heavy lift ing. Advanced Material: Compiling C and Interpreting Java Th is section gives a brief overview of how the C compiler works and how Java is executed. Because the compiler will signifi cantly aff ect the performance of a computer, understanding compiler technology today is critical to understanding performance. Keep in mind that the subject of compiler construction is usually taught in a one- or two-semester course, so our introduction will necessarily only touch on the basics. Th e second part of this section is for readers interested in seeing how an object oriented language like Java executes on a MIPS architecture. It shows the Java byte-codes used for interpretation and the MIPS code for the Java version of some of the C segments in prior sections, including Bubble Sort. It covers both the Java Virtual Machine and JIT compilers. Th e rest of Section 2.15 can be found online. 2.16 Real Stuff: ARMv7 (32-bit) Instructions ARM is the most popular instruction set architecture for embedded devices, with more than 9 billion devices in 2011 using ARM, and recent growth has been 2 billion per year. Standing originally for the Acorn RISC Machine, later changed to Advanced RISC Machine, ARM came out the same year as MIPS and followed similar philosophies. Figure 2.31 lists the similarities. Th e principal diff erence is that MIPS has more registers and ARM has more addressing modes. Th ere is a similar core of instruction sets for arithmetic-logical and data transfer instructions for MIPS and ARM, as Figure 2.32 shows. Addressing Modes Figure 2.33 shows the data addressing modes supported by ARM. Unlike MIPS, ARM does not reserve a register to contain 0. Although MIPS has just three simple data addressing modes (see Figure 2.18), ARM has nine, including fairly complex calculations. For example, ARM has an addressing mode that can shift one register Understanding Program Performance 2.15 object oriented language A programming language that is oriented around objects rather than actions, or data versus logic. 146 Chapter 2 Instructions: Language of the Computer ARM MIPS Date announced 1985 1985 Instruction size (bits) 32 32 Address space (size, model) 32 bits, fl at 32 bits, fl at Data alignment Aligned Aligned Data addressing modes 9 3 Integer registers (number, model, size) 15 GPR � 32 bits 31 GPR � 32 bits I/O Memory mapped Memory mapped FIGURE 2.31 Similarities in ARM and MIPS instruction sets. Register-register ddA buS luM iviD dnA rO roX oC Data transfer aoL rotS rotS Instruction name ARM MIPS add Add (trap if overfl ow) adds; swivs add addu, addiu subtcart Subtract (trap if overfl ow) subs; swivs sub subu mult, multumulylpit div, divu—ed andand ororr xoreor Load high part register — lui Shift left logical lsl1 sllv, sll Shift right logical lsr1 srlv, srl Shift right arithmetic asr1 srav, sra slt/i,slt/iucmp, cmn, tst, teqerapm Load byte signed ldrsb lb Load byte unsigned ldrb lbu Load halfword signed ldrsh lh Load halfword unsigned ldrh lhu lwldrdrowd sbstrbetybe Store halfword strh sh swstrdrowe Read, write special registers mrs, msr move Atomic Exchange swp, swpb ll;sc FIGURE 2.32 ARM register-register and data transfer instructions equivalent to MIPS core. Dashes mean the operation is not available in that architecture or not synthesized in a few instructions. If there are several choices of instructions equivalent to the MIPS core, they are separated by commas. ARM includes shift s as part of every data operation instruction, so the shift s with superscript 1 are just a variation of a move instruction, such as lsr1. Note that ARM has no divide instruction. by any amount, add it to the other registers to form the address, and then update one register with this new address. Addressing mode MIPS Register operand XX Immediate operand XX Register + offset (displacement or based) XX Register + register (indexed) —X Register + scaled register (scaled) —X Register + offset and update register —X Register + register and update register —X Autoincrement, autodecrement —X PC-relative data —X ARM FIGURE 2.33 Summary of data addressing modes. ARM has separate register indirect and register � off set addressing modes, rather than just putting 0 in the off set of the latter mode. To get greater addressing range, ARM shift s the off set left 1 or 2 bits if the data size is halfword or word. Compare and Conditional Branch MIPS uses the contents of registers to evaluate conditional branches. ARM uses the traditional four condition code bits stored in the program status word: negative, zero, carry, and overfl ow. Th ey can be set on any arithmetic or logical instruction; unlike earlier architectures, this setting is optional on each instruction. An explicit option leads to fewer problems in a pipelined implementation. ARM uses conditional branches to test condition codes to determine all possible unsigned and signed relations. CMP subtracts one operand from the other and the diff erence sets the condition codes. Compare negative (CMN) adds one operand to the other, and the sum sets the condition codes. TST performs logical AND on the two operands to set all condition codes but overfl ow, while TEQ uses exclusive OR to set the fi rst three condition codes. One unusual feature of ARM is that every instruction has the option of executing conditionally, depending on the condition codes. Every instruction starts with a 4-bit fi eld that determines whether it will act as a no operation instruction (nop) or as a real instruction, depending on the condition codes. Hence, conditional branches are properly considered as conditionally executing the unconditional branch instruction. Conditional execution allows avoiding a branch to jump over a single instruction. It takes less code space and time to simply conditionally execute one instruction. Figure 2.34 shows the instruction formats for ARM and MIPS. Th e principal diff erences are the 4-bit conditional execution fi eld in every instruction and the smaller register fi eld, because ARM has half the number of registers. 2.16 Real Stuff: ARMv7 (32-bit) Instructions 147 148 Chapter 2 Instructions: Language of the Computer Unique Features of ARM Figure 2.35 shows a few arithmetic-logical instructions not found in MIPS. Since ARM does not have a dedicated register for 0, it has separate opcodes to perform some operations that MIPS can do with $zero. In addition, ARM has support for multiword arithmetic. ARM’s 12-bit immediate fi eld has a novel interpretation. Th e eight least- signifi cant bits are zero-extended to a 32-bit value, then rotated right the number of bits specifi ed in the fi rst four bits of the fi eld multiplied by two. One advantage is that this scheme can represent all powers of two in a 32-bit word. Whether this split actually catches more immediates than a simple 12-bit fi eld would be an interesting study. Operand shift ing is not limited to immediates. Th e second register of all arithmetic and logical processing operations has the option of being shift ed before being operated on. Th e shift options are shift left logical, shift right logical, shift right arithmetic, and rotate right. Register ConstantOpcode ARM Register-register Opx4 31 28 27 28 27 28 27 28 27 19 16 15 16 15 16 15 16 15 16 15 1112 4 3 0 Op8 Rs14 Rd4 Rs24Opx8 Data transfer ARM Opx4 31 1112 0 Op8 Rs14 Rd4 Const12 Branch ARM Jump/Call Opx4 31 2324 0 Op4 Const24 ARM Opx4 31 2324 0 Op4 Const24 MIPS 31 2526 20 21 20 2526 21 20 21 20 1920 11 10 6 5 0 Const5Rs15 Rs25 Rd5 Opx6Op6 MIPS 31 0 Const16Rs15 Rd5Op6 MIPS 31 2526 2526 0 Rs15 Opx5/Rs25 Const16Op6 31 0 Op6MIPS Const26 FIGURE 2.34 Instruction formats, ARM and MIPS. Th e diff erences result from whether the architecture has 16 or 32 registers. 2.17 Real Stuff: x86 Instructions 149 ARM also has instructions to save groups of registers, called block loads and stores. Under control of a 16-bit mask within the instructions, any of the 16 registers can be loaded or stored into memory in a single instruction. Th ese instructions can save and restore registers on procedure entry and return. Th ese instructions can also be used for block memory copy, and today block copies are the most important use of such instructions. 2.17 Real Stuff: x86 Instructions Designers of instruction sets sometimes provide more powerful operations than those found in ARM and MIPS. Th e goal is generally to reduce the number of instructions executed by a program. Th e danger is that this reduction can occur at the cost of simplicity, increasing the time a program takes to execute because the instructions are slower. Th is slowness may be the result of a slower clock cycle time or of requiring more clock cycles than a simpler sequence. Th e path toward operation complexity is thus fraught with peril. Section 2.19 demonstrates the pitfalls of complexity. Evolution of the Intel x86 ARM and MIPS were the vision of single small groups in 1985; the pieces of these architectures fi t nicely together, and the whole architecture can be described succinctly. Such is not the case for the x86; it is the product of several independent groups who evolved the architecture over 35 years, adding new features to the original instruction set as someone might add clothing to a packed bag. Here are important x86 milestones. Beauty is altogether in the eye of the beholder. Margaret Wolfe Hungerford, Molly Bawn, 1877 Name Defi nition ARM MIPS Load immediate Rd = Imm mov addi $0, Not Rd = ~(Rs1) mvn nor $0, Move Rd = Rs1 mov or $0, Rotate right Rd = Rs i >> i
Rd0. . . i–1 = Rs31–i. . . 31

ror

And not Rd = Rs1 & ~(Rs2) bic

Reverse subtract Rd = Rs2 – Rs1 rsb, rsc

Support for multiword
integer add

CarryOut, Rd = Rd + Rs1 +
OldCarryOut

adcs —

Support for multiword
integer sub

CarryOut, Rd = Rd – Rs1 +
OldCarryOut

sbcs —

FIGURE 2.35 ARM arithmetic/logical instructions not found in MIPS.

150 Chapter 2 Instructions: Language of the Computer

■ 1978: Th e Intel 8086 architecture was announced as an assembly
language–compatible extension of the then successful Intel 8080, an 8-bit
microprocessor. Th e 8086 is a 16-bit architecture, with all internal registers
16 bits wide. Unlike MIPS, the registers have dedicated uses, and hence the
8086 is not considered a general-purpose register architecture.

■ 1980: Th e Intel 8087 fl oating-point coprocessor is announced. Th is archi-
tecture extends the 8086 with about 60 fl oating-point instructions. Instead of
using registers, it relies on a stack (see Section 2.21 and Section 3.7).

■ 1982: Th e 80286 extended the 8086 architecture by increasing the address
space to 24 bits, by creating an elaborate memory-mapping and protection
model (see Chapter 5), and by adding a few instructions to round out the
instruction set and to manipulate the protection model.

■ 1985: Th e 80386 extended the 80286 architecture to 32 bits. In addition to
a 32-bit architecture with 32-bit registers and a 32-bit address space, the
80386 added new addressing modes and additional operations. Th e added
instructions make the 80386 nearly a general-purpose register machine. Th e
80386 also added paging support in addition to segmented addressing (see
Chapter 5). Like the 80286, the 80386 has a mode to execute 8086 programs
without change.

■ 1989–95: Th e subsequent 80486 in 1989, Pentium in 1992, and Pentium
Pro in 1995 were aimed at higher performance, with only four instructions
added to the user-visible instruction set: three to help with multiprocessing
(Chapter 6) and a conditional move instruction.

■ 1997: Aft er the Pentium and Pentium Pro were shipping, Intel announced that
it would expand the Pentium and the Pentium Pro architectures with MMX
(Multi Media Extensions). Th is new set of 57 instructions uses the fl oating-
point stack to accelerate multimedia and communication applications. MMX
instructions typically operate on multiple short data elements at a time, in
the tradition of single instruction, multiple data (SIMD) architectures (see
Chapter 6). Pentium II did not introduce any new instructions.

■ 1999: Intel added another 70 instructions, labeled SSE (Streaming SIMD
Extensions) as part of Pentium III. Th e primary changes were to add eight
separate registers, double their width to 128 bits, and add a single precision
fl oating-point data type. Hence, four 32-bit fl oating-point operations can be
performed in parallel. To improve memory performance, SSE includes cache
prefetch instructions plus streaming store instructions that bypass the caches
and write directly to memory.

■ 2001: Intel added yet another 144 instructions, this time labeled SSE2. Th e
new data type is double precision arithmetic, which allows pairs of 64-bit
fl oating-point operations in parallel. Almost all of these 144 instructions are
versions of existing MMX and SSE instructions that operate on 64 bits of data

general-purpose
register (GPR)
A register that can be
used for addresses or for
data with virtually any
instruction.

in parallel. Not only does this change enable more multimedia operations;
it gives the compiler a diff erent target for fl oating-point operations than
the unique stack architecture. Compilers can choose to use the eight SSE
registers as fl oating-point registers like those found in other computers. Th is
change boosted the fl oating-point performance of the Pentium 4, the fi rst
microprocessor to include SSE2 instructions.

■ 2003: A company other than Intel enhanced the x86 architecture this time.
AMD announced a set of architectural extensions to increase the address
space from 32 to 64 bits. Similar to the transition from a 16- to 32-bit address
space in 1985 with the 80386, AMD64 widens all registers to 64 bits. It also
increases the number of registers to 16 and increases the number of 128-
bit SSE registers to 16. Th e primary ISA change comes from adding a new
mode called long mode that redefi nes the execution of all x86 instructions
with 64-bit addresses and data. To address the larger number of registers, it
adds a new prefi x to instructions. Depending how you count, long mode also
adds four to ten new instructions and drops 27 old ones. PC-relative data
addressing is another extension. AMD64 still has a mode that is identical
to x86 (legacy mode) plus a mode that restricts user programs to x86 but
allows operating systems to use AMD64 (compatibility mode). Th ese modes
allow a more graceful transition to 64-bit addressing than the HP/Intel IA-64
architecture.

■ 2004: Intel capitulates and embraces AMD64, relabeling it Extended Memory
64 Technology (EM64T). Th e major diff erence is that Intel added a 128-bit
atomic compare and swap instruction, which probably should have been
included in AMD64. At the same time, Intel announced another generation of
media extensions. SSE3 adds 13 instructions to support complex arithmetic,
graphics operations on arrays of structures, video encoding, fl oating-point
conversion, and thread synchronization (see Section 2.11). AMD added SSE3
in subsequent chips and the missing atomic swap instruction to AMD64 to
maintain binary compatibility with Intel.

■ 2006: Intel announces 54 new instructions as part of the SSE4 instruction set
extensions. Th ese extensions perform tweaks like sum of absolute diff erences,
dot products for arrays of structures, sign or zero extension of narrow data to
wider sizes, population count, and so on. Th ey also added support for virtual
machines (see Chapter 5).

■ 2007: AMD announces 170 instructions as part of SSE5, including 46
instructions of the base instruction set that adds three operand instructions
like MIPS.

■ 2011: Intel ships the Advanced Vector Extension that expands the SSE
register width from 128 to 256 bits, thereby redefi ning about 250 instructions
and adding 128 new instructions.

2.17 Real Stuff: x86 Instructions 151

152 Chapter 2 Instructions: Language of the Computer

Th is history illustrates the impact of the “golden handcuff s” of compatibility on
the x86, as the existing soft ware base at each step was too important to jeopardize
with signifi cant architectural changes.

Whatever the artistic failures of the x86, keep in mind that this instruction set
largely drove the PC generation of computers and still dominates the cloud portion
of the PostPC Era. Manufacturing 350M x86 chips per year may seem small
compared to 9 billion ARMv7 chips, but many companies would love to control
such a market. Nevertheless, this checkered ancestry has led to an architecture that
is diffi cult to explain and impossible to love.

Brace yourself for what you are about to see! Do not try to read this section
with the care you would need to write x86 programs; the goal instead is to give you
familiarity with the strengths and weaknesses of the world’s most popular desktop
architecture.

Rather than show the entire 16-bit, 32-bit, and 64-bit instruction set, in this
section we concentrate on the 32-bit subset that originated with the 80386. We start
our explanation with the registers and addressing modes, move on to the integer
operations, and conclude with an examination of instruction encoding.

x86 Registers and Data Addressing Modes

Th e registers of the 80386 show the evolution of the instruction set (Figure 2.36).
Th e 80386 extended all 16-bit registers (except the segment registers) to 32 bits,
prefi xing an E to their name to indicate the 32-bit version. We’ll refer to them
generically as GPRs (general-purpose registers). Th e 80386 contains only eight
GPRs. Th is means MIPS programs can use four times as many and ARMv7 twice
as many.

Figure 2.37 shows the arithmetic, logical, and data transfer instructions are
two-operand instructions. Th ere are two important diff erences here. Th e x86
arithmetic and logical instructions must have one operand act as both a source
and a destination; ARMv7 and MIPS allow separate registers for source and
destination. Th is restriction puts more pressure on the limited registers, since one
source register must be modifi ed. Th e second important diff erence is that one of
the operands can be in memory. Th us, virtually any instruction may have one
operand in memory, unlike ARMv7 and MIPS.

Data memory-addressing modes, described in detail below, off er two sizes of
addresses within the instruction. Th ese so-called displacements can be 8 bits or 32
bits.

Although a memory operand can use any addressing mode, there are restrictions
on which registers can be used in a mode. Figure 2.38 shows the x86 addressing
modes and which GPRs cannot be used with each mode, as well as how to get the
same eff ect using MIPS instructions.

x86 Integer Operations
Th e 8086 provides support for both 8-bit (byte) and 16-bit (word) data types. Th e
80386 adds 32-bit addresses and data (double words) in the x86. (AMD64 adds 64-

GPR 0

GPR 1

GPR 2

GPR 3

GPR 4

GPR 5

GPR 6

GPR 7

Code segment pointer

Stack segment pointer (top of stack)

Data segment pointer 0

Data segment pointer 1

Data segment pointer 2

Data segment pointer 3

Instruction pointer (PC)

Condition codes

Use

031

Name

EAX

ECX

EDX

EBX

ESP

EBP

ESI

EDI

EIP

EFLAGS

FIGURE 2.36 The 80386 register set. Starting with the 80386, the top eight registers were extended
to 32 bits and could also be used as general-purpose registers.

Source/destination operand type Second source operand

Memory Register

Memory Immediate

FIGURE 2.37 Instruction types for the arithmetic, logical, and data transfer instructions.
Th e x86 allows the combinations shown. Th e only restriction is the absence of a memory-memory mode.
Immediates may be 8, 16, or 32 bits in length; a register is any one of the 14 major registers in Figure 2.36
(not EIP or EFLAGS).

2.17 Real Stuff: x86 Instructions 153

154 Chapter 2 Instructions: Language of the Computer

bit addresses and data, called quad words; we’ll stick to the 80386 in this section.)
Th e data type distinctions apply to register operations as well as memory accesses.

Almost every operation works on both 8-bit data and on one longer data size.
Th at size is determined by the mode and is either 16 bits or 32 bits.

Clearly, some programs want to operate on data of all three sizes, so the 80386
architects provided a convenient way to specify each version without expanding
code size signifi cantly. Th ey decided that either 16-bit or 32-bit data dominates
most programs, and so it made sense to be able to set a default large size. Th is
default data size is set by a bit in the code segment register. To override the default
data size, an 8-bit prefi x is attached to the instruction to tell the machine to use the
other large size for this instruction.

Th e prefi x solution was borrowed from the 8086, which allows multiple prefi xes
to modify instruction behavior. Th e three original prefi xes override the default
segment register, lock the bus to support synchronization (see Section 2.11), or
repeat the following instruction until the register ECX counts down to 0. Th is last
prefi x was intended to be paired with a byte move instruction to move a variable
number of bytes. Th e 80386 also added a prefi x to override the default address size.

Th e x86 integer operations can be divided into four major classes:

1. Data movement instructions, including move, push, and pop

2. Arithmetic and logic instructions, including test, integer, and decimal
arithmetic operations

3. Control fl ow, including conditional branches, unconditional jumps, calls,
and returns

4. String instructions, including string move and string compare

Mode Description
Register

restrictions MIPS equivalent

Based mode with 8- or 32-bit
displacement

Address is contents of base register plus
displacement.

Not ESP lw $s0,100($s1) # <= 16-bit # displacement Base plus scaled index The address is Base + (2Scale x Index) where Scale has the value 0, 1, 2, or 3. Base: any GPR Index: not ESP mul $t0,$s2,4 add $t0,$t0,$s1 lw $s0,0($t0) Base plus scaled index with 8- or 32-bit displacement The address is Base + (2Scale x Index) + displacement where Scale has the value 0, 1, 2, or 3. Base: any GPR Index: not ESP mul $t0,$s2,4 add $t0,$t0,$s1 lw $s0,100($t0) #<=16-bit # displacement FIGURE 2.38 x86 32-bit addressing modes with register restrictions and the equivalent MIPS code. Th e Base plus Scaled Index addressing mode, not found in ARM or MIPS, is included to avoid the multiplies by 4 (scale factor of 2) to turn an index in a register into a byte address (see Figures 2.25 and 2.27). A scale factor of 1 is used for 16-bit data, and a scale factor of 3 for 64-bit data. A scale factor of 0 means the address is not scaled. If the displacement is longer than 16 bits in the second or fourth modes, then the MIPS equivalent mode would need two more instructions: a lui to load the upper 16 bits of the displacement and an add to sum the upper address with the base register $s1. (Intel gives two diff erent names to what is called Based addressing mode—Based and Indexed—but they are essentially identical and we combine them here.) Th e fi rst two categories are unremarkable, except that the arithmetic and logic instruction operations allow the destination to be either a register or a memory location. Figure 2.39 shows some typical x86 instructions and their functions. Conditional branches on the x86 are based on condition codes or fl ags, like ARMv7. Condition codes are set as a side eff ect of an operation; most are used to compare the value of a result to 0. Branches then test the condition codes. PC- Instruction Function je name if equal(condition code) {EIP=name}; EIP–128 <= name < EIP+128 jmp name EIP=name call name SP=SP–4; M[SP]=EIP+5; EIP=name; movw EBX,[EDI+45] EBX=M[EDI+45] push ESI SP=SP–4; M[SP]=ESI pop EDI EDI=M[SP]; SP=SP+4 add EAX,#6765 EAX= EAX+6765 test EDX,#42 Set condition code (fl ags) with EDX and 42 movsl M[EDI]=M[ESI]; EDI=EDI+4; ESI=ESI+4 FIGURE 2.39 Some typical x86 instructions and their functions. A list of frequent operations appears in Figure 2.40. Th e CALL saves the EIP of the next instruction on the stack. (EIP is the Intel PC.) relative branch addresses must be specifi ed in the number of bytes, since unlike ARMv7 and MIPS, 80386 instructions are not all 4 bytes in length. String instructions are part of the 8080 ancestry of the x86 and are not commonly executed in most programs. Th ey are oft en slower than equivalent soft ware routines (see the fallacy on page 159). Figure 2.40 lists some of the integer x86 instructions. Many of the instructions are available in both byte and word formats. x86 Instruction Encoding Saving the worst for last, the encoding of instructions in the 80386 is complex, with many diff erent instruction formats. Instructions for the 80386 may vary from 1 byte, when there are no operands, up to 15 bytes. Figure 2.41 shows the instruction format for several of the example instructions in Figure 2.39. Th e opcode byte usually contains a bit saying whether the operand is 8 bits or 32 bits. For some instructions, the opcode may include the addressing mode and the register; this is true in many instructions that have the form “register � register op immediate.” Other instructions use a “postbyte” or extra opcode byte, labeled “mod, reg, r/m,” which contains the addressing mode information. Th is postbyte is used for many 2.17 Real Stuff: x86 Instructions 155 156 Chapter 2 Instructions: Language of the Computer of the instructions that address memory. Th e base plus scaled index mode uses a second postbyte, labeled “sc, index, base.” Figure 2.42 shows the encoding of the two postbyte address specifi ers for both 16-bit and 32-bit mode. Unfortunately, to understand fully which registers and which addressing modes are available, you need to see the encoding of all addressing modes and sometimes even the encoding of the instructions. x86 Conclusion Intel had a 16-bit microprocessor two years before its competitors’ more elegant architectures, such as the Motorola 68000, and this head start led to the selection of the 8086 as the CPU for the IBM PC. Intel engineers generally acknowledge that the x86 is more diffi cult to build than computers like ARMv7 and MIPS, but the large market meant in the PC Era that AMD and Intel could aff ord more resources Instruction Meaning Control Conditional and unconditional branches jnz, jz Jump if condition to EIP + 8-bit offset; JNE (forJNZ), JE (for JZ) are alternative names jmp Unconditional jump—8-bit or 16-bit offset call Subroutine call—16-bit offset; return address pushed onto stack ret Pops return address from stack and jumps to it loop Loop branch—decrement ECX; jump to EIP + 8-bit displacement if ECX ≠ 0 Data transfer Move data between registers or between register and memory move Move between two registers or between register and memory push, pop Push source operand on stack; pop operand from stack top to a register les Load ES and one of the GPRs from memory Arithmetic, logical Arithmetic and logical operations using the data registers and memory add, sub Add source to destination; subtract source from destination; register-memory format cmp Compare source and destination; register-memory format shl, shr, rcr Shift left; shift logical right; rotate right with carry condition code as fi ll cbw Convert byte in eight rightmost bits of EAX to 16-bit word in right of EAX test Logical AND of source and destination sets condition codes inc, dec Increment destination, decrement destination or, xor Logical OR; exclusive OR; register-memory format String Move between string operands; length given by a repeat prefi x movs Copies from string source to destination by incrementing ESI and EDI; may be repeated lods Loads a byte, word, or doubleword of a string into the EAX register FIGURE 2.40 Some typical operations on the x86. Many operations use register-memory format, where either the source or the destination may be memory and the other may be a register or immediate operand. to help overcome the added complexity. What the x86 lacks in style, it made up for in market size, making it beautiful from the right perspective. Its saving grace is that the most frequently used x86 architectural components are not too diffi cult to implement, as AMD and Intel have demonstrated by rapidly improving performance of integer programs since 1978. To get that performance, FIGURE 2.41 Typical x86 instruction formats. Figure 2.42 shows the encoding of the postbyte. Many instructions contain the 1-bit fi eld w, which says whether the operation is a byte or a double word. Th e d fi eld in MOV is used in instructions that may move to or from memory and shows the direction of the move. Th e ADD instruction requires 32 bits for the immediate fi eld, because in 32-bit mode, the immediates are either 8 bits or 32 bits. Th e immediate fi eld in the TEST is 32 bits long because there is no 8-bit immediate for test in 32-bit mode. Overall, instructions may vary from 1 to 15 bytes in length. Th e long length comes from extra 1-byte prefi xes, having both a 4-byte immediate and a 4-byte displacement address, using an opcode of 2 bytes, and using the scaled index mode specifi er, which adds another byte. 2.17 Real Stuff: x86 Instructions 157 a. JE EIP + displacement b. CALL c. MOV EBX, [EDI + 45] d. PUSH ESI e. ADD EAX, #6765 f. TEST EDX, #42 ImmediatePostbyteTEST ADD PUSH MOV CALL JE w w ImmediateReg Reg wd Displacement r/m Postbyte Offset Displacement Condi- tion 4 4 8 8 32 6 81 1 8 5 3 4 323 1 7 321 8 158 Chapter 2 Instructions: Language of the Computer compilers must avoid the portions of the architecture that are hard to implement fast. In the PostPC Era, however, despite considerable architectural and manufacturing expertise, x86 has not yet been competitive in the personal mobile device. 2.18 Real Stuff: ARMv8 (64-bit) Instructions Of the many potential problems in an instruction set, the one that is almost impossible to overcome is having too small a memory address. While the x86 was successfully extended fi rst to 32-bit addresses and then later to 64-bit addresses, many of its brethren were left behind. For example, the 16-bit address MOStek 6502 powered the Apple II, but even given this headstart with the fi rst commercially successful personal computer, its lack of address bits condemned it to the dustbin of history. ARM architects could see the writing on the wall of their 32-bit address computer, and began design of the 64-bit address version of ARM in 2007. It was fi nally revealed in 2013. Rather than some minor cosmetic changes to make all the registers 64 bits wide, which is basically what happened to the x86, ARM did a complete overhaul. Th e good news is that if you know MIPS it will be very easy to pick up ARMv8, as the 64-bit version is called. First, as compared to MIPS, ARM dropped virtually all of the unusual features of v7: ■ Th ere is no conditional execution fi eld, as there was in nearly every instruction in v7. reg w = 0 w = 1 r/m mod = 0 mod = 1 mod = 2 mod = 3 16b 32b 16b 32b 16b 32b 16b 32b 0 AL AX EAX 0 addr=BX+SI =EAX same same same same same 1 CL CX ECX 1 addr=BX+DI =ECX addr as addr as addr as addr as as 2 DL DX EDX 2 addr=BP+SI =EDX mod=0 mod=0 mod=0 mod=0 reg 3 BL BX EBX 3 addr=BP+SI =EBX + disp8 + disp8 + disp16 + disp32 fi eld 4 AH SP ESP 4 addr=SI =(sib) SI+disp8 (sib)+disp8 SI+disp8 (sib)+disp32 “ 5 CH BP EBP 5 addr=DI =disp32 DI+disp8 EBP+disp8 DI+disp16 EBP+disp32 “ 6 DH SI ESI 6 addr=disp16 =ESI BP+disp8 ESI+disp8 BP+disp16 ESI+disp32 “ 7 BH DI EDI 7 addr=BX =EDI BX+disp8 EDI+disp8 BX+disp16 EDI+disp32 “ FIGURE 2.42 The encoding of the fi rst address specifi er of the x86: mod, reg, r/m. Th e fi rst four columns show the encoding of the 3-bit reg fi eld, which depends on the w bit from the opcode and whether the machine is in 16-bit mode (8086) or 32-bit mode (80386). Th e remaining columns explain the mod and r/m fi elds. Th e meaning of the 3-bit r/m fi eld depends on the value in the 2-bit mod fi eld and the address size. Basically, the registers used in the address calculation are listed in the sixth and seventh columns, under mod � 0, with mod � 1 adding an 8-bit displacement and mod � 2 adding a 16-bit or 32-bit displacement, depending on the address mode. Th e exceptions are 1) r/m � 6 when mod � 1 or mod � 2 in 16-bit mode selects BP plus the displacement; 2) r/m � 5 when mod � 1 or mod � 2 in 32-bit mode selects EBP plus displacement; and 3) r/m � 4 in 32-bit mode when mod does not equal 3, where (sib) means use the scaled index mode shown in Figure 2.38. When mod � 3, the r/m fi eld indicates a register, using the same encoding as the reg fi eld combined with the w bit. 2.19 Fallacies and Pitfalls 159 ■ Th e immediate fi eld is simply a 12 bit constant, rather than essentially an input to a function that produces a constant as in v7. ■ ARM dropped Load Multiple and Store Multiple instructions. ■ Th e PC is no longer one of the registers, which resulted in unexpected branches if you wrote to it. Second, ARM added missing features that are useful in MIPS ■ V8 has 32 general-purpose registers, which compiler writers surely love. Like MIPS, one register is hardwired to 0, although in load and store instructions it instead refers to the stack pointer. ■ Its addressing modes work for all word sizes in ARMv8, which was not the case in ARMv7. ■ It includes a divide instruction, which was omitted from ARMv7. ■ It adds the equivalent of MIPS branch if equal and branch if not equal. As the philosophy of the v8 instruction set is much closer to MIPS than it is to v7, our conclusion is that the main similarity between ARMv7 and ARMv8 is the name. 2.19 Fallacies and Pitfalls Fallacy: More powerful instructions mean higher performance. Part of the power of the Intel x86 is the prefi xes that can modify the execution of the following instruction. One prefi x can repeat the following instruction until a counter counts down to 0. Th us, to move data in memory, it would seem that the natural instruction sequence is to use move with the repeat prefi x to perform 32-bit memory-to-memory moves. An alternative method, which uses the standard instructions found in all computers, is to load the data into the registers and then store the registers back to memory. Th is second version of this program, with the code replicated to reduce loop overhead, copies at about 1.5 times as fast. A third version, which uses the larger fl oating-point registers instead of the integer registers of the x86, copies at about 2.0 times as fast than the complex move instruction. Fallacy: Write in assembly language to obtain the highest performance. At one time compilers for programming languages produced naïve instruction sequences; the increasing sophistication of compilers means the gap between compiled code and code produced by hand is closing fast. In fact, to compete with current compilers, the assembly language programmer needs to understand the concepts in Chapters 4 and 5 thoroughly (processor pipelining and memory hierarchy). 160 Chapter 2 Instructions: Language of the Computer Th is battle between compilers and assembly language coders is another situation in which humans are losing ground. For example, C off ers the programmer a chance to give a hint to the compiler about which variables to keep in registers versus spilled to memory. When compilers were poor at register allocation, such hints were vital to performance. In fact, some old C textbooks spent a fair amount of time giving examples that eff ectively use register hints. Today’s C compilers generally ignore such hints, because the compiler does a better job at allocation than the programmer does. Even if writing by hand resulted in faster code, the dangers of writing in assembly language are the longer time spent coding and debugging, the loss in portability, and the diffi culty of maintaining such code. One of the few widely accepted axioms of soft ware engineering is that coding takes longer if you write more lines, and it clearly takes many more lines to write a program in assembly language than in C or Java. Moreover, once it is coded, the next danger is that it will become a popular program. Such programs always live longer than expected, meaning that someone will have to update the code over several years and make it work with new releases of operating systems and new models of machines. Writing in higher-level language instead of assembly language not only allows future compilers to tailor the code to future machines; it also makes the soft ware easier to maintain and allows the program to run on more brands of computers. Fallacy: Th e importance of commercial binary compatibility means successful instruction sets don’t change. While backwards binary compatibility is sacrosanct, Figure 2.43 shows that the x86 architecture has grown dramatically. Th e average is more than one instruction per month over its 35-year lifetime! Pitfall: Forgetting that sequential word addresses in machines with byte addressing do not diff er by one. Many an assembly language programmer has toiled over errors made by assuming that the address of the next word can be found by incrementing the address in a register by one instead of by the word size in bytes. Forewarned is forearmed! Pitfall: Using a pointer to an automatic variable outside its defi ning procedure. A common mistake in dealing with pointers is to pass a result from a procedure that includes a pointer to an array that is local to that procedure. Following the stack discipline in Figure 2.12, the memory that contains the local array will be reused as soon as the procedure returns. Pointers to automatic variables can lead to chaos. 2.20 Concluding Remarks 161 2.20 Concluding Remarks Th e two principles of the stored-program computer are the use of instructions that are indistinguishable from numbers and the use of alterable memory for programs. Th ese principles allow a single machine to aid environmental scientists, fi nancial advisers, and novelists in their specialties. Th e selection of a set of instructions that the machine can understand demands a delicate balance among the number of instructions needed to execute a program, the number of clock cycles needed by an instruction, and the speed of the clock. As illustrated in this chapter, three design principles guide the authors of instruction sets in making that delicate balance: 1. Simplicity favors regularity. Regularity motivates many features of the MIPS instruction set: keeping all instructions a single size, always requiring three register operands in arithmetic instructions, and keeping the register fi elds in the same place in each instruction format. 2. Smaller is faster. Th e desire for speed is the reason that MIPS has 32 registers rather than many more. 3. Good design demands good compromises. One MIPS example was the compromise between providing for larger addresses and constants in instructions and keeping all instructions the same length. Less is more. Robert Browning, Andrea del Sarto, 1855 0 100 200 300 400 500 600 700 800 900 1000 19 78 19 80 19 82 19 84 19 86 19 88 19 90 19 92 19 94 19 96 19 98 20 00 20 02 20 04 20 06 20 08 20 10 20 12 Year N u m b e r o f In st ru ct io n s FIGURE 2.43 Growth of x86 instruction set over time. While there is clear technical value to some of these extensions, this rapid change also increases the diffi culty for other companies to try to build compatible processors. 162 Chapter 2 Instructions: Language of the Computer We also saw the great idea of making the common cast fast applied to instruction sets as well as computer architecture. Examples of making the common MIPS case fast include PC-relative addressing for conditional branches and immediate addressing for larger constant operands. Above this machine level is assembly language, a language that humans can read. Th e assembler translates it into the binary numbers that machines can understand, and it even “extends” the instruction set by creating symbolic instructions that aren’t in the hardware. For instance, constants or addresses that are too big are broken into properly sized pieces, common variations of instructions are given their own name, and so on. Figure 2.44 lists the MIPS instructions we have covered MIPS instructions Name Format Pseudo MIPS Name Format add add R move move R subtract sub R multiply mult R add immediate addi I multiply immediate multi I load word lw I load immediate li I store word sw I branch less than blt I load half lh I branch less than or equal ble Iload half unsigned lhu I store half sh I branch greater than bgt I load byte lb I branch greater than or equal bge Iload byte unsigned lbu I store byte sb I load linked ll I store conditional sc I load upper immediate lui I and and R or or R nor nor R and immediate andi I or immediate ori I shift left logical sll R shift right logical srl R branch on equal beq I branch on not equal bne I set less than slt R set less than immediate slti I set less than immediate unsigned sltiu I jump j J jump register jr R jump and link jal J FIGURE 2.44 The MIPS instruction set covered so far, with the real MIPS instructions on the left and the pseudoinstructions on the right. Appendix A (Section A.10) describes the full MIPS architecture. Figure 2.1 shows more details of the MIPS architecture revealed in this chapter. Th e information given here is also found in Columns 1 and 2 of the MIPS Reference Data Card at the front of the book. 2.21 Historical Perspective and Further Reading 163 so far, both real and pseudoinstructions. Hiding details from the higher level is another example of the great idea of abstraction. Each category of MIPS instructions is associated with constructs that appear in programming languages: ■ Arithmetic instructions correspond to the operations found in assignment statements. ■ Transfer instructions are most likely to occur when dealing with data structures like arrays or structures. ■ Conditional branches are used in if statements and in loops. ■ Unconditional jumps are used in procedure calls and returns and for case/ switch statements. Th ese instructions are not born equal; the popularity of the few dominates the many. For example, Figure 2.45 shows the popularity of each class of instructions for SPEC CPU2006. Th e varying popularity of instructions plays an important role in the chapters about datapath, control, and pipelining. Instruction class MIPS examples HLL correspondence Frequency Integer Ft. pt. Arithmetic add, sub, addi Operations in assignment statement s Data transfer lw, sw, lb, lbu, lh, lhu, sb, lui Logical and, or, nor, andi, ori, sll, srl 0perations in assignment statement s Conditional branch beq, bne, slt, slti, sltiu If statements and loops Jump j, jr, jal Procedure calls, returns, and case/switch statements 16% 35% 12% 34% 2% 48% 36% 4% 8% 0% References to data structures, such as arrays FIGURE 2.45 MIPS instruction classes, examples, correspondence to high-level program language constructs, and percentage of MIPS instructions executed by category for the average integer and fl oating point SPEC CPU2006 benchmarks. Figure 3.26 in Chapter 3 shows average percentage of the individual MIPS instructions executed. Aft er we explain computer arithmetic in Chapter 3, we reveal the rest of the MIPS instruction set architecture. Historical Perspective and Further Reading Th is section surveys the history of instruction set architectures (ISAs) over time, and we give a short history of programming languages and compilers. ISAs 2.21 164 Chapter 2 Instructions: Language of the Computer include accumulator architectures, general-purpose register architectures, stack architectures, and a brief history of ARM and the x86. We also review the controversial subjects of high-level-language computer architectures and reduced instruction set computer architectures. Th e history of programming languages includes Fortran, Lisp, Algol, C, Cobol, Pascal, Simula, Smalltalk, C��, and Java, and the history of compilers includes the key milestones and the pioneers who achieved them. Th e rest of Section 2.21 is found online. 2.22 Exercises Appendix A describes the MIPS simulator, which is helpful for these exercises. Although the simulator accepts pseudoinstructions, try not to use pseudoinstructions for any exercises that ask you to produce MIPS code. Your goal should be to learn the real MIPS instruction set, and if you are asked to count instructions, your count should refl ect the actual instructions that will be executed and not the pseudoinstructions. Th ere are some cases where pseudoinstructions must be used (for example, the la instruction when an actual value is not known at assembly time). In many cases, they are quite convenient and result in more readable code (for example, the li and move instructions). If you choose to use pseudoinstructions for these reasons, please add a sentence or two to your solution stating which pseudoinstructions you have used and why. 2.1 [5] <§2.2> For the following C statement, what is the corresponding MIPS
assembly code? Assume that the variables f, g, h, and i are given and could be
considered 32-bit integers as declared in a C program. Use a minimal number of
MIPS assembly instructions.

f = g + (h − 5);

2.2 [5] <§2.2> For the following MIPS assembly instructions above, what is a
corresponding C statement?

add f, g, h

add f, i, f

2.22 Exercises 165

2.3 [5] <§§2.2, 2.3> For the following C statement, what is the corresponding
MIPS assembly code? Assume that the variables f, g, h, i, and j are assigned to
registers $s0, $s1, $s2, $s3, and $s4, respectively. Assume that the base address
of the arrays A and B are in registers $s6 and $s7, respectively.

B[8] = A[i−j];

2.4 [5] <§§2.2, 2.3> For the MIPS assembly instructions below, what is the
corresponding C statement? Assume that the variables f, g, h, i, and j are assigned
to registers $s0, $s1, $s2, $s3, and $s4, respectively. Assume that the base address
of the arrays A and B are in registers $s6 and $s7, respectively.

sll $t0, $s0, 2 # $t0 = f * 4
add $t0, $s6, $t0 # $t0 = &A[f]
sll $t1, $s1, 2 # $t1 = g * 4
add $t1, $s7, $t1 # $t1 = &B[g]
lw $s0, 0($t0) # f = A[f]
addi $t2, $t0, 4
lw $t0, 0($t2)
add $t0, $t0, $s0
sw $t0, 0($t1)

2.5 [5] <§§2.2, 2.3> For the MIPS assembly instructions in Exercise 2.4, rewrite
the assembly code to minimize the number if MIPS instructions (if possible)
needed to carry out the same function.

2.6 Th e table below shows 32-bit values of an array stored in memory.

Address Data

24 2

38 4

32 3

36 6

40 1

166 Chapter 2 Instructions: Language of the Computer

2.6.1 [5] <§§2.2, 2.3> For the memory locations in the table above, write C
code to sort the data from lowest to highest, placing the lowest value in the
smallest memory location shown in the figure. Assume that the data shown
represents the C variable called Array, which is an array of type int, and that
the first number in the array shown is the first element in the array. Assume
that this particular machine is a byte-addressable machine and a word consists
of four bytes.

2.6.2 [5] <§§2.2, 2.3> For the memory locations in the table above, write MIPS
code to sort the data from lowest to highest, placing the lowest value in the smallest
memory location. Use a minimum number of MIPS instructions. Assume the base
address of Array is stored in register $s6.

2.7 [5] <§2.3> Show how the value 0xabcdef12 would be arranged in memory
of a little-endian and a big-endian machine. Assume the data is stored starting at
address 0.

2.8 [5] <§2.4> Translate 0xabcdef12 into decimal.

2.9 [5] <§§2.2, 2.3> Translate the following C code to MIPS. Assume that the
variables f, g, h, i, and j are assigned to registers $s0, $s1, $s2, $s3, and $s4,
respectively. Assume that the base address of the arrays A and B are in registers $s6
and $s7, respectively. Assume that the elements of the arrays A and B are 4-byte
words:

B[8] = A[i] + A[j];

2.10 [5] <§§2.2, 2.3> Translate the following MIPS code to C. Assume that the
variables f, g, h, i, and j are assigned to registers $s0, $s1, $s2, $s3, and $s4,
respectively. Assume that the base address of the arrays A and B are in registers $s6
and $s7, respectively.

addi $t0, $s6, 4
add $t1, $s6, $0
sw $t1, 0($t0)
lw $t0, 0($t0)
add $s0, $t1, $t0

2.11 [5] <§§2.2, 2.5> For each MIPS instruction, show the value of the opcode
(OP), source register (RS), and target register (RT) fi elds. For the I-type instructions,
show the value of the immediate fi eld, and for the R-type instructions, show the
value of the destination register (RD) fi eld.

2.22 Exercises 167

2.12 Assume that registers $s0 and $s1 hold the values 0x80000000 and
0xD0000000, respectively.

2.12.1 [5] <§2.4> What is the value of $t0 for the following assembly code?

add $t0, $s0, $s1

2.12.2 [5] <§2.4> Is the result in $t0 the desired result, or has there been overfl ow?

2.12.3 [5] <§2.4> For the contents of registers $s0 and $s1 as specifi ed above,
what is the value of $t0 for the following assembly code?

sub $t0, $s0, $s1

2.12.4 [5] <§2.4> Is the result in $t0 the desired result, or has there been overfl ow?

2.12.5 [5] <§2.4> For the contents of registers $s0 and $s1 as specifi ed above,
what is the value of $t0 for the following assembly code?

add $t0, $s0, $s1
add $t0, $t0, $s0

2.12.6 [5] <§2.4> Is the result in $t0 the desired result, or has there been
overfl ow?

2.13 Assume that $s0 holds the value 128ten.

2.13.1 [5] <§2.4> For the instruction add $t0, $s0, $s1, what is the range(s) of
values for $s1 that would result in overfl ow?

2.13.2 [5] <§2.4> For the instruction sub $t0, $s0, $s1, what is the range(s) of
values for $s1 that would result in overfl ow?

2.13.3 [5] <§2.4> For the instruction sub $t0, $s1, $s0, what is the range(s) of
values for $s1 that would result in overfl ow?

2.14 [5] <§§2.2, 2.5> Provide the type and assembly language instruction for the
following binary value: 0000 0010 0001 0000 1000 0000 0010 0000

two

2.15 [5] <§§2.2, 2.5> Provide the type and hexadecimal representation of
following instruction: sw $t1, 32($t2)

168 Chapter 2 Instructions: Language of the Computer

2.16 [5] <§2.5> Provide the type, assembly language instruction, and binary
representation of instruction described by the following MIPS fi elds:

op=0, rs=3, rt=2, rd=3, shamt=0, funct=34

2.17 [5] <§2.5> Provide the type, assembly language instruction, and binary
representation of instruction described by the following MIPS fi elds:

op=0x23, rs=1, rt=2, const=0x4

2.18 Assume that we would like to expand the MIPS register fi le to 128 registers
and expand the instruction set to contain four times as many instructions.

2.18.1 [5] <§2.5> How this would this aff ect the size of each of the bit fi elds in
the R-type instructions?

2.18.2 [5] <§2.5> How this would this aff ect the size of each of the bit fi elds in
the I-type instructions?

2.18.3 [5] <§§2.5, 2.10> How could each of the two proposed changes decrease
the size of an MIPS assembly program? On the other hand, how could the proposed
change increase the size of an MIPS assembly program?

2.19 Assume the following register contents:

$t0 = 0xAAAAAAAA, $t1 = 0x12345678

2.19.1 [5] <§2.6> For the register values shown above, what is the value of $t2
for the following sequence of instructions?

sll $t2, $t0, 44
or $t2, $t2, $t1

2.19.2 [5] <§2.6> For the register values shown above, what is the value of $t2
for the following sequence of instructions?

sll $t2, $t0, 4
andi $t2, $t2, −1

2.19.3 [5] <§2.6> For the register values shown above, what is the value of $t2
for the following sequence of instructions?

srl $t2, $t0, 3
andi $t2, $t2, 0xFFEF

2.22 Exercises 169

2.20 [5] <§2.6> Find the shortest sequence of MIPS instructions that extracts bits
16 down to 11 from register $t0 and uses the value of this fi eld to replace bits 31
down to 26 in register $t1 without changing the other 26 bits of register $t1.

2.21 [5] <§2.6> Provide a minimal set of MIPS instructions that may be used to
implement the following pseudoinstruction:

not $t1, $t2 // bit-wise invert

2.22 [5] <§2.6> For the following C statement, write a minimal sequence of MIPS
assembly instructions that does the identical operation. Assume $t1 = A, $t2 = B,
and $s1 is the base address of C.

A = C[0] << 4; 2.23 [5] <§2.7> Assume $t0 holds the value 0x00101000. What is the value of
$t2 aft er the following instructions?

slt $t2, $0, $t0
bne $t2, $0, ELSE
j DONE

ELSE: addi $t2, $t2, 2
DONE:

2.24 [5] <§2.7> Suppose the program counter (PC) is set to 0x2000 0000. Is it
possible to use the jump (j) MIPS assembly instruction to set the PC to the address
as 0x4000 0000? Is it possible to use the branch-on-equal (beq) MIPS assembly
instruction to set the PC to this same address?

2.25 Th e following instruction is not included in the MIPS instruction set:

rpt $t2, loop # if(R[rs]>0) R[rs]=R[rs]−1, PC=PC+4+BranchAddr

2.25.1 [5] <§2.7> If this instruction were to be implemented in the MIPS
instruction set, what is the most appropriate instruction format?

2.25.2 [5] <§2.7> What is the shortest sequence of MIPS instructions that
performs the same operation?

170 Chapter 2 Instructions: Language of the Computer

2.26 Consider the following MIPS loop:

LOOP: slt $t2, $0, $t1
beq $t2, $0, DONE
subi $t1, $t1, 1
addi $s2, $s2, 2
j LOOP

DONE:

2.26.1 [5] <§2.7> Assume that the register $t1 is initialized to the value 10. What
is the value in register $s2 assuming $s2 is initially zero?

2.26.2 [5] <§2.7> For each of the loops above, write the equivalent C code
routine. Assume that the registers $s1, $s2, $t1, and $t2 are integers A, B, i, and
temp, respectively.

2.26.3 [5] <§2.7> For the loops written in MIPS assembly above, assume that
the register $t1 is initialized to the value N. How many MIPS instructions are
executed?

2.27 [5] <§2.7> Translate the following C code to MIPS assembly code. Use a
minimum number of instructions. Assume that the values of a, b, i, and j are in
registers $s0, $s1, $t0, and $t1, respectively. Also, assume that register $s2 holds
the base address of the array D.

for(i=0; i How many MIPS instructions does it take to implement the C
code from Exercise 2.27? If the variables a and b are initialized to 10 and 1 and all
elements of D are initially 0, what is the total number of MIPS instructions that is
executed to complete the loop?

2.29 [5] <§2.7> Translate the following loop into C. Assume that the C-level
integer i is held in register $t1, $s2 holds the C-level integer called result, and
$s0 holds the base address of the integer MemArray.

addi $t1, $0, $0
LOOP: lw $s1, 0($s0)
add $s2, $s2, $s1
addi $s0, $s0, 4

2.22 Exercises 171

addi $t1, $t1, 1
slti $t2, $t1, 100
bne $t2, $s0, LOOP

2.30 [5] <§2.7> Rewrite the loop from Exercise 2.29 to reduce the number of
MIPS instructions executed.

2.31 [5] <§2.8> Implement the following C code in MIPS assembly. What is the
total number of MIPS instructions needed to execute the function?

int fib(int n){

if (n==0)

return 0;

else if (n == 1)

return 1;

else

return fib(n−1) + fib(n−2);

2.32 [5] <§2.8> Functions can oft en be implemented by compilers “in-line.” An
in-line function is when the body of the function is copied into the program space,
allowing the overhead of the function call to be eliminated. Implement an “in-line”
version of the C code above in MIPS assembly. What is the reduction in the total
number of MIPS assembly instructions needed to complete the function? Assume
that the C variable n is initialized to 5.

2.33 [5] <§2.8> For each function call, show the contents of the stack aft er the
function call is made. Assume the stack pointer is originally at address 0x7ff ff ff c,
and follow the register conventions as specifi ed in Figure 2.11.

2.34 Translate function f into MIPS assembly language. If you need to use
registers $t0 through $t7, use the lower-numbered registers fi rst. Assume the
function declaration for func is “int f(int a, int b);”. Th e code for function
f is as follows:

int f(int a, int b, int c, int d){

return func(func(a,b),c+d);

}

172 Chapter 2 Instructions: Language of the Computer

2.35 [5] <§2.8> Can we use the tail-call optimization in this function? If no,
explain why not. If yes, what is the diff erence in the number of executed instructions
in f with and without the optimization?

2.36 [5] <§2.8> Right before your function f from Exercise 2.34 returns, what do
we know about contents of registers $t5, $s3, $ra, and $sp? Keep in mind that
we know what the entire function f looks like, but for function func we only know
its declaration.

2.37 [5] <§2.9> Write a program in MIPS assembly language to convert an ASCII
number string containing positive and negative integer decimal strings, to an
integer. Your program should expect register $a0 to hold the address of a null-
terminated string containing some combination of the digits 0 through 9. Your
program should compute the integer value equivalent to this string of digits, then
place the number in register $v0. If a non-digit character appears anywhere in the
string, your program should stop with the value −1 in register $v0. For example,
if register $a0 points to a sequence of three bytes 50ten, 52ten, 0ten (the null-
terminated string “24”), then when the program stops, register $v0 should contain
the value 24ten.

2.38 [5] <§2.9> Consider the following code:

lbu $t0, 0($t1)

sw $t0, 0($t2)

Assume that the register $t1 contains the address 0x1000 0000 and the register
$t2 contains the address 0x1000 0010. Note the MIPS architecture utilizes
big-endian addressing. Assume that the data (in hexadecimal) at address 0x1000
0000 is: 0x11223344. What value is stored at the address pointed to by register
$t2?

2.39 [5] <§2.10> Write the MIPS assembly code that creates the 32-bit constant
0010 0000 0000 0001 0100 1001 0010 0100

two
and stores that value to

2.40 [5] <§§2.6, 2.10> If the current value of the PC is 0x00000000, can you use
a single jump instruction to get to the PC address as shown in Exercise 2.39?

2.41 [5] <§§2.6, 2.10> If the current value of the PC is 0x00000600, can you use
a single branch instruction to get to the PC address as shown in Exercise 2.39?

2.22 Exercises 173

2.42 [5] <§§2.6, 2.10> If the current value of the PC is 0x1FFFf000, can you use
a single branch instruction to get to the PC address as shown in Exercise 2.39?

2.43 [5] <§2.11> Write the MIPS assembly code to implement the following C
code:

lock(lk);

shvar=max(shvar,x);

unlock(lk);

Assume that the address of the lk variable is in $a0, the address of the shvar
variable is in $a1, and the value of variable x is in $a2. Your critical section should
not contain any function calls. Use ll/sc instructions to implement the lock()
operation, and the unlock() operation is simply an ordinary store instruction.

2.44 [5] <§2.11> Repeat Exercise 2.43, but this time use ll/sc to perform
an atomic update of the shvar variable directly, without using lock() and
unlock(). Note that in this problem there is no variable lk.

2.45 [5] <§2.11> Using your code from Exercise 2.43 as an example, explain what
happens when two processors begin to execute this critical section at the same
time, assuming that each processor executes exactly one instruction per cycle.

2.46 Assume for a given processor the CPI of arithmetic instructions is 1,
the CPI of load/store instructions is 10, and the CPI of branch instructions is
3. Assume a program has the following instruction breakdowns: 500 million
arithmetic instructions, 300 million load/store instructions, 100 million branch
instructions.

2.46.1 [5] <§2.19> Suppose that new, more powerful arithmetic instructions are
added to the instruction set. On average, through the use of these more powerful
arithmetic instructions, we can reduce the number of arithmetic instructions
needed to execute a program by 25%, and the cost of increasing the clock cycle
time by only 10%. Is this a good design choice? Why?

2.46.2 [5] <§2.19> Suppose that we fi nd a way to double the performance of
arithmetic instructions. What is the overall speedup of our machine? What if we
fi nd a way to improve the performance of arithmetic instructions by 10 times?

2.47 Assume that for a given program 70% of the executed instructions are
arithmetic, 10% are load/store, and 20% are branch.

174 Chapter 2 Instructions: Language of the Computer

2.47.1 [5] <§2.19> Given this instruction mix and the assumption that an
arithmetic instruction requires 2 cycles, a load/store instruction takes 6 cycles, and
a branch instruction takes 3 cycles, fi nd the average CPI.

2.47.2 [5] <§2.19> For a 25% improvement in performance, how many cycles, on
average, may an arithmetic instruction take if load/store and branch instructions
are not improved at all?

2.47.3 [5] <§2.19> For a 50% improvement in performance, how many cycles, on
average, may an arithmetic instruction take if load/store and branch instructions
are not improved at all?

§2.2, page 66: MIPS, C, Java
§2.3, page 72: 2) Very slow
§2.4, page 79: 2) �8ten
§2.5, page 87: 4) sub $t2, $t0, $t1
§2.6, page 89: Both. AND with a mask pattern of 1s will leaves 0s everywhere but
the desired fi eld. Shift ing left by the correct amount removes the bits from the left
of the fi eld. Shift ing right by the appropriate amount puts the fi eld into the right-
most bits of the word, with 0s in the rest of the word. Note that AND leaves the
fi eld where it was originally, and the shift pair moves the fi eld into the rightmost
part of the word.
§2.7, page 96: I. All are true. II. 1).
§2.8, page 106: Both are true.
§2.9, page 111: I. 1) and 2) II. 3)
§2.10, page 120: I. 4) ��128K. II. 6) a block of 256M. III. 4) sll
§2.11, page 123: Both are true.
§2.12, page 132: 4) Machine independence.

Answers to
Check Yourself

This page intentionally left blank

3
Numerical precision
is the very soul of
science.
Sir D’arcy Wentworth Thompson
On Growth and Form, 1917

Arithmetic for
Computers
3.1 Introduction 178
3.2 Addition and Subtraction 178
3.3 Multiplication 183
3.4 Division 189
3.5 Floating Point 196
3.6 Parallelism and Computer Arithmetic:

Subword Parallelism 222
3.7 Real Stuff: Streaming SIMD Extensions and

Advanced Vector Extensions in x86 224

Computer Organization and Design. DOI:
© 2013 Elsevier Inc. All rights reserved.

http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1
2013

3.8 Going Faster: Subword Parallelism and Matrix Multiply 225
3.9 Fallacies and Pitfalls 229
3.10 Concluding Remarks 232
3.11 Historical Perspective and Further Reading 236
3.12 Exercises 237

The Five Classic Components of a Computer

178 Chapter 3 Arithmetic for Computers

3.1 Introduction

Computer words are composed of bits; thus, words can be represented as binary
numbers. Chapter 2 shows that integers can be represented either in decimal or
binary form, but what about the other numbers that commonly occur? For example:

■ What about fractions and other real numbers?

■ What happens if an operation creates a number bigger than can be represented?

■ And underlying these questions is a mystery: How does hardware really
multiply or divide numbers?

Th e goal of this chapter is to unravel these mysteries including representation of
real numbers, arithmetic algorithms, hardware that follows these algorithms, and
the implications of all this for instruction sets. Th ese insights may explain quirks
that you have already encountered with computers. Moreover, we show how to use
this knowledge to make arithmetic-intensive programs go much faster.

3.2 Addition and Subtraction

Addition is just what you would expect in computers. Digits are added bit by bit
from right to left , with carries passed to the next digit to the left , just as you would
do by hand. Subtraction uses addition: the appropriate operand is simply negated
before being added.

Binary Addition and Subtraction

Let’s try adding 6ten to 7ten in binary and then subtracting 6ten from 7ten in binary.

0000 0000 0000 0000 0000 0000 0000 0111two = 7ten
+ 0000 0000 0000 0000 0000 0000 0000 0110two = 6ten
= 0000 0000 0000 0000 0000 0000 0000 1101two = 13ten

Th e 4 bits to the right have all the action; Figure 3.1 shows the sums and
carries. Th e carries are shown in parentheses, with the arrows showing how
they are passed.

Subtracting 6ten from 7ten can be done directly:

Subtraction: Addition’s
Tricky Pal
No. 10, Top Ten
Courses for Athletes at a
Football Factory, David
Letterman et al., Book of
Top Ten Lists, 1990

EXAMPLE

ANSWER

3.2 Addition and Subtraction 179

0000 0000 0000 0000 0000 0000 0000 0111two = 7ten
– 0000 0000 0000 0000 0000 0000 0000 0110two = 6ten
= 0000 0000 0000 0000 0000 0000 0000 0001two = 1ten

or via addition using the two’s complement representation of �6:
0000 0000 0000 0000 0000 0000 0000 0111two = 7ten
+ 1111 1111 1111 1111 1111 1111 1111 1010two = –6ten
= 0000 0000 0000 0000 0000 0000 0000 0001two = 1ten

(0)

0 (0)

(0)

0 (0)

(1)

1 (1)

(1)

1 (1)

(0)

0 (0)

(Carries)

1(0)

. . .

FIGURE 3.1 Binary addition, showing carries from right to left. Th e rightmost bit adds 1
to 0, resulting in the sum of this bit being 1 and the carry out from this bit being 0. Hence, the operation
for the second digit to the right is 0 � 1 � 1. Th is generates a 0 for this sum bit and a carry out of 1. Th e
third digit is the sum of 1 � 1 � 1, resulting in a carry out of 1 and a sum bit of 1. Th e fourth bit is 1 �
0 � 0, yielding a 1 sum and no carry.

Recall that overfl ow occurs when the result from an operation cannot be
represented with the available hardware, in this case a 32-bit word. When can
overfl ow occur in addition? When adding operands with diff erent signs, overfl ow
cannot occur. Th e reason is the sum must be no larger than one of the operands.
For example, �10 � 4 � �6. Since the operands fi t in 32 bits and the sum is no
larger than an operand, the sum must fi t in 32 bits as well. Th erefore, no overfl ow
can occur when adding positive and negative operands.

Th ere are similar restrictions to the occurrence of overfl ow during subtract, but
it’s just the opposite principle: when the signs of the operands are the same, overfl ow
cannot occur. To see this, remember that c � a � c � (�a) because we subtract by
negating the second operand and then add. Th erefore, when we subtract operands
of the same sign we end up by adding operands of diff erent signs. From the prior
paragraph, we know that overfl ow cannot occur in this case either.

Knowing when overfl ow cannot occur in addition and subtraction is all well and
good, but how do we detect it when it does occur? Clearly, adding or subtracting
two 32-bit numbers can yield a result that needs 33 bits to be fully expressed.

Th e lack of a 33rd bit means that when overfl ow occurs, the sign bit is set with
the value of the result instead of the proper sign of the result. Since we need just one
extra bit, only the sign bit can be wrong. Hence, overfl ow occurs when adding two
positive numbers and the sum is negative, or vice versa. Th is spurious sum means
a carry out occurred into the sign bit.

Overfl ow occurs in subtraction when we subtract a negative number from a
positive number and get a negative result, or when we subtract a positive number
from a negative number and get a positive result. Such a ridiculous result means a
borrow occurred from the sign bit. Figure 3.2 shows the combination of operations,
operands, and results that indicate an overfl ow.

180 Chapter 3 Arithmetic for Computers

We have just seen how to detect overfl ow for two’s complement numbers in a
computer. What about overfl ow with unsigned integers? Unsigned integers are
commonly used for memory addresses where overfl ows are ignored.

Th e computer designer must therefore provide a way to ignore overfl ow in
some cases and to recognize it in others. Th e MIPS solution is to have two kinds of
arithmetic instructions to recognize the two choices:

■ Add (add), add immediate (addi), and subtract (sub) cause exceptions on
overfl ow.

■ Add unsigned (addu), add immediate unsigned (addiu), and subtract
unsigned (subu) do not cause exceptions on overfl ow.

Because C ignores overfl ows, the MIPS C compilers will always generate the
unsigned versions of the arithmetic instructions addu, addiu, and subu, no
matter what the type of the variables. Th e MIPS Fortran compilers, however, pick
the appropriate arithmetic instructions, depending on the type of the operands.

Appendix B describes the hardware that performs addition and subtraction,
which is called an Arithmetic Logic Unit or ALU.

Elaboration: A constant source of confusion for addiu is its name and what happens
to its immediate fi eld. The u stands for unsigned, which means addition cannot cause an
overfl ow exception. However, the 16-bit immediate fi eld is sign extended to 32 bits, just
like addi, slti, and sltiu. Thus, the immediate fi eld is signed, even if the operation
is “unsigned.”

Th e computer designer must decide how to handle arithmetic overfl ows. Although
some languages like C and Java ignore integer overfl ow, languages like Ada and
Fortran require that the program be notifi ed. Th e programmer or the programming
environment must then decide what to do when overfl ow occurs.

MIPS detects overfl ow with an exception, also called an interrupt on many
computers. An exception or interrupt is essentially an unscheduled procedure
call. Th e address of the instruction that overfl owed is saved in a register, and the
computer jumps to a predefi ned address to invoke the appropriate routine for that
exception. Th e interrupted address is saved so that in some situations the program
can continue aft er corrective code is executed. (Section 4.9 covers exceptions in

Arithmetic Logic
Unit (ALU) Hardware
that performs addition,
subtraction, and usually
logical operations such as
AND and OR.

Hardware/
Software
Interface

exception Also
called interrupt on
many computers. An
unscheduled event
that disrupts program
execution; used to detect
overfl ow.

FIGURE 3.2 Overfl ow conditions for addition and subtraction.

Operation Operand A Operand B
Result

indicating overflow

A + B ≥ 0 ≥ 0 < 0 A + B < 0 < 0 ≥ 0 A – B ≥ 0 < 0 < 0 A – B < 0 ≥ 0 ≥ 0 3.2 Addition and Subtraction 181 more detail; Chapter 5 describes other situations where exceptions and interrupts occur.) MIPS includes a register called the exception program counter (EPC) to contain the address of the instruction that caused the exception. Th e instruction move from system control (mfc0) is used to copy EPC into a general-purpose register so that MIPS soft ware has the option of returning to the off ending instruction via a jump register instruction. Summary A major point of this section is that, independent of the representation, the fi nite word size of computers means that arithmetic operations can create results that are too large to fi t in this fi xed word size. It’s easy to detect overfl ow in unsigned numbers, although these are almost always ignored because programs don’t want to detect overfl ow for address arithmetic, the most common use of natural numbers. Two’s complement presents a greater challenge, yet some soft ware systems require detection of overfl ow, so today all computers have a way to detect it. Some programming languages allow two’s complement integer arithmetic on variables declared byte and half, whereas MIPS only has integer arithmetic operations on full words. As we recall from Chapter 2, MIPS does have data transfer operations for bytes and halfwords. What MIPS instructions should be generated for byte and halfword arithmetic operations? 1. Load with lbu, lhu; arithmetic with add, sub, mult, div; then store using sb, sh. 2. Load with lb, lh; arithmetic with add, sub, mult, div; then store using sb, sh. 3. Load with lb, lh; arithmetic with add, sub, mult, div, using AND to mask result to 8 or 16 bits aft er each operation; then store using sb, sh. Elaboration: One feature not generally found in general-purpose microprocessors is saturating operations. Saturation means that when a calculation overfl ows, the result is set to the largest positive number or most negative number, rather than a modulo calculation as in two’s complement arithmetic. Saturation is likely what you want for media operations. For example, the volume knob on a radio set would be frustrating if, as you turned it, the volume would get continuously louder for a while and then immediately very soft. A knob with saturation would stop at the highest volume no matter how far you turned it. Multimedia extensions to standard instruction sets often offer saturating arithmetic. Elaboration: MIPS can trap on overfl ow, but unlike many other computers, there is no conditional branch to test overfl ow. A sequence of MIPS instructions can discover interrupt An exception that comes from outside of the processor. (Some architectures use the term interrupt for all exceptions.) Check Yourself 182 Chapter 3 Arithmetic for Computers overfl ow. For signed addition, the sequence is the following (see the Elaboration on page 89 in Chapter 2 for a description of the xor instruction): addu $t0, $t1, $t2 # $t0 = sum, but don’t trap xor $t3, $t1, $t2 # Check if signs differ slt $t3, $t3, $zero # $t3 = 1 if signs differ bne $t3, $zero, No_overflow # $t1, $t2 signs ≠, # so no overflow xor $t3, $t0, $t1 # signs =; sign of sum match too? # $t3 negative if sum sign different slt $t3, $t3, $zero # $t3 = 1 if sum sign different bne $t3, $zero, Overflow # All 3 signs ≠; goto overflow For unsigned addition ($t0 = $t1 + $t2), the test is addu $t0, $t1, $t2 # $t0 = sum nor $t3, $t1, $zero # $t3 = NOT $t1 # (2’s comp – 1: 232 – $t1 – 1) sltu $t3, $t3, $t2 # (232 – $t1 – 1) < $t2 # ⇒ 232 – 1 < $t1 + $t2 bne $t3,$zero,Overflow # if(232–1<$t1+$t2) goto overflow Elaboration: In the preceding text, we said that you copy EPC into a register via mfc0 and then return to the interrupted code via jump register. This directive leads to an interesting question: since you must fi rst transfer EPC to a register to use with jump register, how can jump register return to the interrupted code and restore the original values of all registers? Either you restore the old registers fi rst, thereby destroying your return address from EPC, which you placed in a register for use in jump register, or you restore all registers but the one with the return address so that you can jump—meaning an exception would result in changing that one register at any time during program execution! Neither option is satisfactory. To rescue the hardware from this dilemma, MIPS programmers agreed to reserve registers $k0 and $k1 for the operating system; these registers are not restored on exceptions. Just as the MIPS compilers avoid using register $at so that the assembler can use it as a temporary register (see Hardware/Software Interface in Section 2.10), compilers also abstain from using registers $k0 and $k1 to make them available for the operating system. Exception routines place the return address in one of these registers and then use jump register to restore the instruction address. Elaboration: The speed of addition is increased by determining the carry in to the high-order bits sooner. There are a variety of schemes to anticipate the carry so that the worst-case scenario is a function of the log 2 of the number of bits in the adder. These anticipatory signals are faster because they go through fewer gates in sequence, but it takes many more gates to anticipate the proper carry. The most popular is carry lookahead, which Section B.6 in Appendix B describes. 3.3 Multiplication 183 3.3 Multiplication Now that we have completed the explanation of addition and subtraction, we are ready to build the more vexing operation of multiplication. First, let’s review the multiplication of decimal numbers in longhand to remind ourselves of the steps of multiplication and the names of the operands. For reasons that will become clear shortly, we limit this decimal example to using only the digits 0 and 1. Multiplying 1000ten by 1001ten: Multiplicand 1000ten Multiplier x 1001ten 1000 0000 0000 1000 Product 1001000ten Th e fi rst operand is called the multiplicand and the second the multiplier. Th e fi nal result is called the product. As you may recall, the algorithm learned in grammar school is to take the digits of the multiplier one at a time from right to left , multiplying the multiplicand by the single digit of the multiplier, and shift ing the intermediate product one digit to the left of the earlier intermediate products. Th e fi rst observation is that the number of digits in the product is considerably larger than the number in either the multiplicand or the multiplier. In fact, if we ignore the sign bits, the length of the multiplication of an n-bit multiplicand and an m-bit multiplier is a product that is n � m bits long. Th at is, n � m bits are required to represent all possible products. Hence, like add, multiply must cope with overfl ow because we frequently want a 32-bit product as the result of multiplying two 32-bit numbers. In this example, we restricted the decimal digits to 0 and 1. With only two choices, each step of the multiplication is simple: 1. Just place a copy of the multiplicand (1 � multiplicand) in the proper place if the multiplier digit is a 1, or 2. Place 0 (0 � multiplicand) in the proper place if the digit is 0. Although the decimal example above happens to use only 0 and 1, multiplication of binary numbers must always use 0 and 1, and thus always off ers only these two choices. Now that we have reviewed the basics of multiplication, the traditional next step is to provide the highly optimized multiply hardware. We break with tradition in the belief that you will gain a better understanding by seeing the evolution of the multiply hardware and algorithm through multiple generations. For now, let’s assume that we are multiplying only positive numbers. Multiplication is vexation, Division is as bad; Th e rule of three doth puzzle me, And practice drives me mad. Anonymous, Elizabethan manuscript, 1570 184 Chapter 3 Arithmetic for Computers Sequential Version of the Multiplication Algorithm and Hardware Th is design mimics the algorithm we learned in grammar school; Figure 3.3 shows the hardware. We have drawn the hardware so that data fl ows from top to bottom to resemble more closely the paper-and-pencil method. Let’s assume that the multiplier is in the 32-bit Multiplier register and that the 64- bit Product register is initialized to 0. From the paper-and-pencil example above, it’s clear that we will need to move the multiplicand left one digit each step, as it may be added to the intermediate products. Over 32 steps, a 32-bit multiplicand would move 32 bits to the left . Hence, we need a 64-bit Multiplicand register, initialized with the 32-bit multiplicand in the right half and zero in the left half. Th is register is then shift ed left 1 bit each step to align the multiplicand with the sum being accumulated in the 64-bit Product register. Figure 3.4 shows the three basic steps needed for each bit. Th e least signifi cant bit of the multiplier (Multiplier0) determines whether the multiplicand is added to the Product register. Th e left shift in step 2 has the eff ect of moving the intermediate operands to the left , just as when multiplying with paper and pencil. Th e shift right in step 3 gives us the next bit of the multiplier to examine in the following iteration. Th ese three steps are repeated 32 times to obtain the product. If each step took a clock cycle, this algorithm would require almost 100 clock cycles to multiply two 32-bit numbers. Th e relative importance of arithmetic operations like multiply varies with the program, but addition and subtraction may be anywhere from 5 to 100 times more popular than multiply. Accordingly, in many applications, multiply can take multiple clock cycles without signifi cantly aff ecting performance. Yet Amdahl’s Law (see Section 1.10) reminds us that even a moderate frequency for a slow operation can limit performance. Multiplicand Shift left 64 bits 64-bit ALU Product Write 64 bits Control test Multiplier Shift right 32 bits FIGURE 3.3 First version of the multiplication hardware. Th e Multiplicand register, ALU, and Product register are all 64 bits wide, with only the Multiplier register containing 32 bits. (Appendix B describes ALUs.) Th e 32-bit multiplicand starts in the right half of the Multiplicand register and is shift ed left 1 bit on each step. Th e multiplier is shift ed in the opposite direction at each step. Th e algorithm starts with the product initialized to 0. Control decides when to shift the Multiplicand and Multiplier registers and when to write new values into the Product register. 3.3 Multiplication 185 Th is algorithm and hardware are easily refi ned to take 1 clock cycle per step. Th e speed-up comes from performing the operations in parallel: the multiplier and multiplicand are shift ed while the multiplicand is added to the product if the multiplier bit is a 1. Th e hardware just has to ensure that it tests the right bit of the multiplier and gets the preshift ed version of the multiplicand. Th e hardware is usually further optimized to halve the width of the adder and registers by noticing where there are unused portions of registers and adders. Figure 3.5 shows the revised hardware. 32nd repetition? 1a. Add multiplicand to product and place the result in Product register Multiplier0 = 01. Test Multiplier0 Start Multiplier0 = 1 2. Shift the Multiplicand register left 1 bit 3. Shift the Multiplier register right 1 bit No: < 32 repetitions Yes: 32 repetitions Done FIGURE 3.4 The fi rst multiplication algorithm, using the hardware shown in Figure 3.3. If the least signifi cant bit of the multiplier is 1, add the multiplicand to the product. If not, go to the next step. Shift the multiplicand left and the multiplier right in the next two steps. Th ese three steps are repeated 32 times. 186 Chapter 3 Arithmetic for Computers Replacing arithmetic by shift s can also occur when multiplying by constants. Some compilers replace multiplies by short constants with a series of shift s and adds. Because one bit to the left represents a number twice as large in base 2, shift ing the bits left has the same eff ect as multiplying by a power of 2. As mentioned in Chapter 2, almost every compiler will perform the strength reduction optimization of substituting a left shift for a multiply by a power of 2. A Multiply Algorithm Using 4-bit numbers to save space, multiply 2ten � 3ten, or 0010two � 0011two. Figure 3.6 shows the value of each register for each of the steps labeled according to Figure 3.4, with the fi nal value of 0000 0110two or 6ten. Color is used to indicate the register values that change on that step, and the bit circled is the one examined to determine the operation of the next step. Hardware/ Software Interface EXAMPLE ANSWER Multiplicand 32 bits 32-bit ALU Product Write 64 bits Control test Shift right FIGURE 3.5 Refi ned version of the multiplication hardware. Compare with the fi rst version in Figure 3.3. Th e Multiplicand register, ALU, and Multiplier register are all 32 bits wide, with only the Product register left at 64 bits. Now the product is shift ed right. Th e separate Multiplier register also disappeared. Th e multiplier is placed instead in the right half of the Product register. Th ese changes are highlighted in color. (Th e Product register should really be 65 bits to hold the carry out of the adder, but it’s shown here as 64 bits to highlight the evolution from Figure 3.3.) 3.3 Multiplication 187 Signed Multiplication So far, we have dealt with positive numbers. Th e easiest way to understand how to deal with signed numbers is to fi rst convert the multiplier and multiplicand to positive numbers and then remember the original signs. Th e algorithms should then be run for 31 iterations, leaving the signs out of the calculation. As we learned in grammar school, we need negate the product only if the original signs disagree. It turns out that the last algorithm will work for signed numbers, provided that we remember that we are dealing with numbers that have infi nite digits, and we are only representing them with 32 bits. Hence, the shift ing steps would need to extend the sign of the product for signed numbers. When the algorithm completes, the lower word would have the 32-bit product. Faster Multiplication Moore’s Law has provided so much more in resources that hardware designers can now build much faster multiplication hardware. Whether the multiplicand is to be added or not is known at the beginning of the multiplication by looking at each of the 32 multiplier bits. Faster multiplications are possible by essentially providing one 32-bit adder for each bit of the multiplier: one input is the multiplicand ANDed with a multiplier bit, and the other is the output of a prior adder. A straightforward approach would be to connect the outputs of adders on the right to the inputs of adders on the left , making a stack of adders 32 high. An alternative way to organize these 32 additions is in a parallel tree, as Figure 3.7 shows. Instead of waiting for 32 add times, we wait just the log2 (32) or fi ve 32-bit add times. Iteration Step Multiplier Multiplicand Product 0 Initial values 0011 0000 0010 0000 0000 1 1a: 1 ⇒ Prod = Prod + Mcand 0011 0000 0010 0000 0010 2: Shift left Multiplicand 0011 0000 0100 0000 0010 3: Shift right Multiplier 0001 0000 0100 0000 0010 2 1a: 1 ⇒ Prod = Prod + Mcand 0001 0000 0100 0000 0110 2: Shift left Multiplicand 0001 0000 1000 0000 0110 3: Shift right Multiplier 0000 0000 1000 0000 0110 3 1: 0 ⇒ No operation 0000 0000 1000 0000 0110 2: Shift left Multiplicand 0000 0001 0000 0000 0110 3: Shift right Multiplier 0000 0001 0000 0000 0110 4 1: 0 ⇒ No operation 0000 0001 0000 0000 0110 2: Shift left Multiplicand 0000 0010 0000 0000 0110 3: Shift right Multiplier 0000 0010 0000 0000 0110 FIGURE 3.6 Multiply example using algorithm in Figure 3.4. Th e bit examined to determine the next step is circled in color. 188 Chapter 3 Arithmetic for Computers In fact, multiply can go even faster than fi ve add times because of the use of carry save adders (see Section B.6 in Appendix B) and because it is easy to pipeline such a design to be able to support many multiplies simultaneously (see Chapter 4). Multiply in MIPS MIPS provides a separate pair of 32-bit registers to contain the 64-bit product, called Hi and Lo. To produce a properly signed or unsigned product, MIPS has two instructions: multiply (mult) and multiply unsigned (multu). To fetch the integer 32-bit product, the programmer uses move from lo (mflo). Th e MIPS assembler generates a pseudoinstruction for multiply that specifi es three general-purpose registers, generating mflo and mfhi instructions to place the product into registers. Summary Multiplication hardware simply shift s and add, as derived from the paper-and- pencil method learned in grammar school. Compilers even use shift instructions for multiplications by powers of 2. With much more hardware we can do the adds in parallel, and do them much faster. Both MIPS multiply instructions ignore overfl ow, so it is up to the soft ware to check to see if the product is too big to fi t in 32 bits. Th ere is no overfl ow if Hi is 0 for multu or the replicated sign of Lo for mult. Th e instruction move from hi (mfhi) can be used to transfer Hi to a general-purpose register to test for overfl ow. Hardware/ Software Interface Product1 Product0Product63 Product62 Product47..16 1 bit 1 bit 1 bit 1 bit . . . . . . . . .. . . . . . . . . 32 bits 32 bits 32 bits 32 bits 32 bits 32 bits 32 bits Mplier31 • Mcand Mplier30 • Mcand Mplier29 • Mcand Mplier28 • Mcand Mplier3 • Mcand Mplier2 • Mcand Mplier1 • Mcand Mplier0 • Mcand FIGURE 3.7 Fast multiplication hardware. Rather than use a single 32-bit adder 31 times, this hardware “unrolls the loop” to use 31 adders and then organizes them to minimize delay. 3.4 Division 189 3.4 Division Th e reciprocal operation of multiply is divide, an operation that is even less frequent and even more quirky. It even off ers the opportunity to perform a mathematically invalid operation: dividing by 0. Let’s start with an example of long division using decimal numbers to recall the names of the operands and the grammar school division algorithm. For reasons similar to those in the previous section, we limit the decimal digits to just 0 or 1. Th e example is dividing 1,001,010ten by 1000ten: 1001ten Quotient Divisor 1000ten 1001010ten Dividend −1000 10 101 1010 −1000 10ten Remainder Divide’s two operands, called the dividend and divisor, and the result, called the quotient, are accompanied by a second result, called the remainder. Here is another way to express the relationship between the components: Dividend � Quotient � Divisor � Remainder where the remainder is smaller than the divisor. Infrequently, programs use the divide instruction just to get the remainder, ignoring the quotient. Th e basic grammar school division algorithm tries to see how big a number can be subtracted, creating a digit of the quotient on each attempt. Our carefully selected decimal example uses only the numbers 0 and 1, so it’s easy to fi gure out how many times the divisor goes into the portion of the dividend: it’s either 0 times or 1 time. Binary numbers contain only 0 or 1, so binary division is restricted to these two choices, thereby simplifying binary division. Let’s assume that both the dividend and the divisor are positive and hence the quotient and the remainder are nonnegative. Th e division operands and both results are 32-bit values, and we will ignore the sign for now. A Division Algorithm and Hardware Figure 3.8 shows hardware to mimic our grammar school algorithm. We start with the 32-bit Quotient register set to 0. Each iteration of the algorithm needs to move the divisor to the right one digit, so we start with the divisor placed in the left half of the 64-bit Divisor register and shift it right 1 bit each step to align it with the dividend. Th e Remainder register is initialized with the dividend. Divide et impera. Latin for “Divide and rule,” ancient political maxim cited by Machiavelli, 1532 dividend A number being divided. divisor A number that the dividend is divided by. quotient Th e primary result of a division; a number that when multiplied by the divisor and added to the remainder produces the dividend. remainder Th e secondary result of a division; a number that when added to the product of the quotient and the divisor produces the dividend. 190 Chapter 3 Arithmetic for Computers Figure 3.9 shows three steps of the fi rst division algorithm. Unlike a human, the computer isn’t smart enough to know in advance whether the divisor is smaller than the dividend. It must fi rst subtract the divisor in step 1; remember that this is how we performed the comparison in the set on less than instruction. If the result is positive, the divisor was smaller or equal to the dividend, so we generate a 1 in the quotient (step 2a). If the result is negative, the next step is to restore the original value by adding the divisor back to the remainder and generate a 0 in the quotient (step 2b). Th e divisor is shift ed right and then we iterate again. Th e remainder and quotient will be found in their namesake registers aft er the iterations are complete. A Divide Algorithm Using a 4-bit version of the algorithm to save pages, let’s try dividing 7ten by 2ten, or 0000 0111two by 0010two. Figure 3.10 shows the value of each register for each of the steps, with the quotient being 3ten and the remainder 1ten. Notice that the test in step 2 of whether the remainder is positive or negative simply tests whether the sign bit of the Remainder register is a 0 or 1. Th e surprising requirement of this algorithm is that it takes n + 1 steps to get the proper quotient and remainder. EXAMPLE ANSWER Divisor Shift right 64 bits 64-bit ALU Remainder Write 64 bits Control test Quotient Shift left 32 bits FIGURE 3.8 First version of the division hardware. Th e Divisor register, ALU, and Remainder register are all 64 bits wide, with only the Quotient register being 32 bits. Th e 32-bit divisor starts in the left half of the Divisor register and is shift ed right 1 bit each iteration. Th e remainder is initialized with the dividend. Control decides when to shift the Divisor and Quotient registers and when to write the new value into the Remainder register. 3.4 Division 191 33rd repetition? 2a. Shift the Quotient register to the left, setting the new rightmost bit to 1 Remainder < 0Remainder ≥ 0 Test Remainder Start 3. Shift the Divisor register right 1 bit No: < 33 repetitions Yes: 33 repetitions Done 1. Subtract the Divisor register from the Remainder register and place the result in the Remainder register 2b. Restore the original value by adding the Divisor register to the Remainder register and placing the sum in the Remainder register. Also shift the Quotient register to the left, setting the new least significant bit to 0 FIGURE 3.9 A division algorithm, using the hardware in Figure 3.8. If the remainder is positive, the divisor did go into the dividend, so step 2a generates a 1 in the quotient. A negative remainder aft er step 1 means that the divisor did not go into the dividend, so step 2b generates a 0 in the quotient and adds the divisor to the remainder, thereby reversing the subtraction of step 1. Th e fi nal shift , in step 3, aligns the divisor properly, relative to the dividend for the next iteration. Th ese steps are repeated 33 times. Th is algorithm and hardware can be refi ned to be faster and cheaper. Th e speed- up comes from shift ing the operands and the quotient simultaneously with the subtraction. Th is refi nement halves the width of the adder and registers by noticing where there are unused portions of registers and adders. Figure 3.11 shows the revised hardware. 192 Chapter 3 Arithmetic for Computers Signed Division So far, we have ignored signed numbers in division. Th e simplest solution is to remember the signs of the divisor and dividend and then negate the quotient if the signs disagree. Iteration Step Quotient Divisor Remainder 0 Initial values 0000 0010 0000 0000 0111 1 1: Rem = Rem – Div 0000 0010 0000 1110 0111 2b: Rem < 0 ⇒ +Div, sll Q, Q0 = 0 0000 0010 0000 0000 0111 3: Shift Div right 0000 0001 0000 0000 0111 2 1: Rem = Rem – Div 0000 0001 0000 1111 0111 2b: Rem < 0 ⇒ +Div, sll Q, Q0 = 0 0000 0001 0000 0000 0111 3: Shift Div right 0000 0000 1000 0000 0111 3 1: Rem = Rem – Div 0000 0000 1000 1111 1111 2b: Rem < 0 ⇒ +Div, sll Q, Q0 = 0 0000 0000 1000 0000 0111 3: Shift Div right 0000 0000 0100 0000 0111 4 1: Rem = Rem – Div 0000 0000 0100 0000 0011 2a: Rem ≥ 0 ⇒ sll Q, Q0 = 1 0001 0000 0100 0000 0011 3: Shift Div right 0001 0000 0010 0000 0011 5 1: Rem = Rem – Div 0001 0000 0010 0000 0001 2a: Rem ≥ 0 ⇒ sll Q, Q0 = 1 0011 0000 0010 0000 0001 3: Shift Div right 0011 0000 0001 0000 0001 FIGURE 3.10 Division example using the algorithm in Figure 3.9. Th e bit examined to determine the next step is circled in color. Divisor 32 bits 32-bit ALU Remainder Write 64 bits Control test Shift left Shift right FIGURE 3.11 An improved version of the division hardware. Th e Divisor register, ALU, and Quotient register are all 32 bits wide, with only the Remainder register left at 64 bits. Compared to Figure 3.8, the ALU and Divisor registers are halved and the remainder is shift ed left . Th is version also combines the Quotient register with the right half of the Remainder register. (As in Figure 3.5, the Remainder register should really be 65 bits to make sure the carry out of the adder is not lost.) 3.4 Division 193 Elaboration: The one complication of signed division is that we must also set the sign of the remainder. Remember that the following equation must always hold: Dividend � Quotient � Divisor � Remainder To understand how to set the sign of the remainder, let’s look at the example of dividing all the combinations of �7 ten by �2 ten . The fi rst case is easy: �7 � �2: Quotient � �3, � Remainder � �1 Checking the results: �7 � 3 � 2 � (�1) � 6 � 1 If we change the sign of the dividend, the quotient must change as well: �7 � �2: Quotient � �3 Rewriting our basic formula to calculate the remainder: Remainder � (Dividend � Quotient � Divisor) � �7 � (�3x � 2) � �7 � (�6) � �1 So, �7 � �2: Quotient � �3, Remainder � �1 Checking the results again: �7 � �3 � 2 � (�1) � �6 � 1 The reason the answer isn’t a quotient of �4 and a remainder of �1, which would also fi t this formula, is that the absolute value of the quotient would then change depending on the sign of the dividend and the divisor! Clearly, if �(x � y) � (�x) � y programming would be an even greater challenge. This anomalous behavior is avoided by following the rule that the dividend and remainder must have the same signs, no matter what the signs of the divisor and quotient. We calculate the other combinations by following the same rule: �7 � �2: Quotient � �3, Remainder � �1 �7 � �2: Quotient � �3, Remainder � �1 194 Chapter 3 Arithmetic for Computers Thus the correctly signed division algorithm negates the quotient if the signs of the operands are opposite and makes the sign of the nonzero remainder match the dividend. Faster Division Moore’s Law applies to division hardware as well as multiplication, so we would like to be able to speed up division by throwing hardware at it. We used many adders to speed up multiply, but we cannot do the same trick for divide. Th e reason is that we need to know the sign of the diff erence before we can perform the next step of the algorithm, whereas with multiply we could calculate the 32 partial products immediately. Th ere are techniques to produce more than one bit of the quotient per step. Th e SRT division technique tries to predict several quotient bits per step, using a table lookup based on the upper bits of the dividend and remainder. It relies on subsequent steps to correct wrong predictions. A typical value today is 4 bits. Th e key is guessing the value to subtract. With binary division, there is only a single choice. Th ese algorithms use 6 bits from the remainder and 4 bits from the divisor to index a table that determines the guess for each step. Th e accuracy of this fast method depends on having proper values in the lookup table. Th e fallacy on page 231 in Section 3.9 shows what can happen if the table is incorrect. Divide in MIPS You may have already observed that the same sequential hardware can be used for both multiply and divide in Figures 3.5 and 3.11. Th e only requirement is a 64-bit register that can shift left or right and a 32-bit ALU that adds or subtracts. Hence, MIPS uses the 32-bit Hi and 32-bit Lo registers for both multiply and divide. As we might expect from the algorithm above, Hi contains the remainder, and Lo contains the quotient aft er the divide instruction completes. To handle both signed integers and unsigned integers, MIPS has two instructions: divide (div) and divide unsigned (divu). Th e MIPS assembler allows divide instructions to specify three registers, generating the mflo or mfhi instructions to place the desired result into a general-purpose register. Summary Th e common hardware support for multiply and divide allows MIPS to provide a single pair of 32-bit registers that are used both for multiply and divide. We accelerate division by predicting multliple quotient bits and then correcting mispredictions later, Figure 3.12 summarizes the enhancements to the MIPS architecture for the last two sections. 3.4 Division 195 MIPS assembly language Category Instruction Example Meaning Comments Arithmetic add add $s1,$s2,$s3 $s1 = $s2 + $s3 Three operands; overflow detected subtract sub $s1,$s2,$s3 $s1 = $s2 – $s3 Three operands; overflow detected add immediate addi $s1,$s2,100 $s1 = $s2 + 100 + constant; overflow detected add unsigned addu $s1,$s2,$s3 $s1 = $s2 + $s3 Three operands; overflow undetected subtract unsigned subu $s1,$s2,$s3 $s1 = $s2 – $s3 Three operands; overflow undetected add immediate unsigned addiu $s1,$s2,100 $s1 = $s2 + 100 + constant; overflow undetected move from coprocessor register mfc0 $s1,$epc $s1 = $epc Copy Exception PC + special regs multiply mult $s2,$s3 Hi, Lo = $s2 × $s3 64-bit signed product in Hi, Lo multiply unsigned multu $s2,$s3 Hi, Lo = $s2 × $s3 64-bit unsigned product in Hi, Lo divide div $s2,$s3 Lo = $s2 / $s3, Hi = $s2 mod $s3 Lo = quotient, Hi = remainder divide unsigned divu $s2,$s3 Lo = $s2 / $s3, Hi = $s2 mod $s3 Unsigned quotient and remainder move from Hi mfhi $s1 $s1 = Hi Used to get copy of Hi move from Lo mflo $s1 $s1 = Lo Used to get copy of Lo Data transfer load word lw $s1,20($s2) $s1 = Memory[$s2 + 20] Word from memory to register store word sw $s1,20($s2) Memory[$s2 + 20] = $s1 Word from register to memory load half unsigned lhu $s1,20($s2) $s1 = Memory[$s2 + 20] Halfword memory to register store half sh $s1,20($s2) Memory[$s2 + 20] = $s1 Halfword register to memory load byte unsigned lbu $s1,20($s2) $s1 = Memory[$s2 + 20] Byte from memory to register store byte sb $s1,20($s2) Memory[$s2 + 20] = $s1 Byte from register to memory load linked word ll $s1,20($s2) $s1 = Memory[$s2 + 20] Load word as 1st half of atomic swap store conditional word sc $s1,20($s2) Memory[$s2+20]=$s1;$s1=0 or 1 Store word as 2nd half atomic swap load upper immediate lui $s1,100 $s1 = 100 * 216 Loads constant in upper 16 bits Logical AND AND $s1,$s2,$s3 $s1 = $s2 & $s3 Three reg. operands; bit-by-bit AND OR OR $s1,$s2,$s3 $s1 = $s2 | $s3 Three reg. operands; bit-by-bit OR NOR NOR $s1,$s2,$s3 $s1 = ~ ($s2 |$s3) Three reg. operands; bit-by-bit NOR AND immediate ANDi $s1,$s2,100 $s1 = $s2 & 100 Bit-by-bit AND with constant OR immediate ORi $s1,$s2,100 $s1 = $s2 | 100 Bit-by-bit OR with constant shift left logical sll $s1,$s2,10 $s1 = $s2 << 10 Shift left by constant shift right logical srl $s1,$s2,10 $s1 = $s2 >> 10 Shift right by constant

Condi-
tional
branch

branch on equal beq $s1,$s2,25 if ($s1 == $s2) go to PC + 4 + 100 Equal test; PC-relative branch

branch on not equal bne $s1,$s2,25 if ($s1 != $s2) go to PC + 4 + 100 Not equal test; PC-relative

set on less than slt $s1,$s2,$s3 if ($s2 < $s3) $s1 = 1; else $s1 = 0 Compare less than; two’s complement set less than immediate slti $s1,$s2,100 if ($s2 < 100) $s1 = 1; else $s1=0 Compare < constant; two’s complement set less than unsigned sltu $s1,$s2,$s3 if ($s2 < $s3) $s1 = 1; else $s1=0 Compare less than; natural numbers set less than immediate unsigned sltiu $s1,$s2,100 if ($s2 < 100) $s1 = 1; else $s1 = 0 Compare < constant; natural numbers Uncondi- tional jump jump j 2500 go to 10000 Jump to target address jump register jr $ra go to $ra For switch, procedure return jump and link jal 2500 $ra = PC + 4; go to 10000 For procedure call FIGURE 3.12 MIPS core architecture. Th e memory and registers of the MIPS architecture are not included for space reasons, but this section added the Hi and Lo registers to support multiply and divide. MIPS machine language is listed in the MIPS Reference Data Card at the front of this book. 196 Chapter 3 Arithmetic for Computers MIPS divide instructions ignore overfl ow, so soft ware must determine whether the quotient is too large. In addition to overfl ow, division can also result in an improper calculation: division by 0. Some computers distinguish these two anomalous events. MIPS soft ware must check the divisor to discover division by 0 as well as overfl ow. Elaboration: An even faster algorithm does not immediately add the divisor back if the remainder is negative. It simply adds the dividend to the shifted remainder in the following step, since (r � d) � 2 � d � r � 2 � d � 2 � d � r � 2 � d. This nonrestoring division algorithm, which takes 1 clock cycle per step, is explored further in the exercises; the algorithm above is called restoring division. A third algorithm that doesn’t save the result of the subtract if it’s negative is called a nonperforming division algorithm. It averages one-third fewer arithmetic operations. 3.5 Floating Point Going beyond signed and unsigned integers, programming languages support numbers with fractions, which are called reals in mathematics. Here are some examples of reals: 3.14159265… ten (pi) 2.71828… ten (e) 0.000000001ten or 1.0ten × 10 −9 (seconds in a nanosecond) 3,155,760,000ten or 3.15576ten × 10 9 (seconds in a typical century) Notice that in the last case, the number didn’t represent a small fraction, but it was bigger than we could represent with a 32-bit signed integer. Th e alternative notation for the last two numbers is called scientifi c notation, which has a single digit to the left of the decimal point. A number in scientifi c notation that has no leading 0s is called a normalized number, which is the usual way to write it. For example, 1.0ten � 10 �9 is in normalized scientifi c notation, but 0.1ten � 10 �8 and 10.0ten � 10 �10 are not. Just as we can show decimal numbers in scientifi c notation, we can also show binary numbers in scientifi c notation: 1.0two � 2�1 To keep a binary number in normalized form, we need a base that we can increase or decrease by exactly the number of bits the number must be shift ed to have one nonzero digit to the left of the decimal point. Only a base of 2 fulfi lls our need. Since the base is not 10, we also need a new name for decimal point; binary point will do fi ne. Hardware/ Software Interface Speed gets you nowhere if you’re headed the wrong way. American proverb scientifi c notation A notation that renders numbers with a single digit to the left of the decimal point. normalized A number in fl oating-point notation that has no leading 0s. 3.5 Floating Point 197 Computer arithmetic that supports such numbers is called fl oating point because it represents numbers in which the binary point is not fi xed, as it is for integers. Th e programming language C uses the name fl oat for such numbers. Just as in scientifi c notation, numbers are represented as a single nonzero digit to the left of the binary point. In binary, the form is 1.xxxxxxxxxtwo � 2yyyy (Although the computer represents the exponent in base 2 as well as the rest of the number, to simplify the notation we show the exponent in decimal.) A standard scientifi c notation for reals in normalized form off ers three advantages. It simplifi es exchange of data that includes fl oating-point numbers; it simplifi es the fl oating-point arithmetic algorithms to know that numbers will always be in this form; and it increases the accuracy of the numbers that can be stored in a word, since the unnecessary leading 0s are replaced by real digits to the right of the binary point. Floating-Point Representation A designer of a fl oating-point representation must fi nd a compromise between the size of the fraction and the size of the exponent, because a fi xed word size means you must take a bit from one to add a bit to the other. Th is tradeoff is between precision and range: increasing the size of the fraction enhances the precision of the fraction, while increasing the size of the exponent increases the range of numbers that can be represented. As our design guideline from Chapter 2 reminds us, good design demands good compromise. Floating-point numbers are usually a multiple of the size of a word. Th e representation of a MIPS fl oating-point number is shown below, where s is the sign of the fl oating-point number (1 meaning negative), exponent is the value of the 8-bit exponent fi eld (including the sign of the exponent), and fraction is the 23-bit number. As we recall from Chapter 2, this representation is sign and magnitude, since the sign is a separate bit from the rest of the number. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 s exponent fraction 1 bit 8 bits 23 bits In general, fl oating-point numbers are of the form (�1)S � F � 2 E F involves the value in the fraction fi eld and E involves the value in the exponent fi eld; the exact relationship to these fi elds will be spelled out soon. (We will shortly see that MIPS does something slightly more sophisticated.) fl oating point Computer arithmetic that represents numbers in which the binary point is not fi xed. fraction Th e value, generally between 0 and 1, placed in the fraction fi eld. Th e fraction is also called the mantissa. exponent In the numerical representation system of fl oating-point arithmetic, the value that is placed in the exponent fi eld. 198 Chapter 3 Arithmetic for Computers Th ese chosen sizes of exponent and fraction give MIPS computer arithmetic an extraordinary range. Fractions almost as small as 2.0ten � 10 �38 and numbers almost as large as 2.0ten � 10 38 can be represented in a computer. Alas, extraordinary diff ers from infi nite, so it is still possible for numbers to be too large. Th us, overfl ow interrupts can occur in fl oating-point arithmetic as well as in integer arithmetic. Notice that overfl ow here means that the exponent is too large to be represented in the exponent fi eld. Floating point off ers a new kind of exceptional event as well. Just as programmers will want to know when they have calculated a number that is too large to be represented, they will want to know if the nonzero fraction they are calculating has become so small that it cannot be represented; either event could result in a program giving incorrect answers. To distinguish it from overfl ow, we call this event underfl ow. Th is situation occurs when the negative exponent is too large to fi t in the exponent fi eld. One way to reduce chances of underfl ow or overfl ow is to off er another format that has a larger exponent. In C this number is called double, and operations on doubles are called double precision fl oating-point arithmetic; single precision fl oating point is the name of the earlier format. Th e representation of a double precision fl oating-point number takes two MIPS words, as shown below, where s is still the sign of the number, exponent is the value of the 11-bit exponent fi eld, and fraction is the 52-bit number in the fraction fi eld. overfl ow (fl oating- point) A situation in which a positive exponent becomes too large to fi t in the exponent fi eld. underfl ow (fl oating- point) A situation in which a negative exponent becomes too large to fi t in the exponent fi eld. double precision A fl oating-point value represented in two 32-bit words. single precision A fl oating-point value represented in a single 32- bit word. MIPS double precision allows numbers almost as small as 2.0ten � 10 �308 and almost as large as 2.0ten � 10 308. Although double precision does increase the exponent range, its primary advantage is its greater precision because of the much larger fraction. Th ese formats go beyond MIPS. Th ey are part of the IEEE 754 fl oating-point standard, found in virtually every computer invented since 1980. Th is standard has greatly improved both the ease of porting fl oating-point programs and the quality of computer arithmetic. To pack even more bits into the signifi cand, IEEE 754 makes the leading 1-bit of normalized binary numbers implicit. Hence, the number is actually 24 bits long in single precision (implied 1 and a 23-bit fraction), and 53 bits long in double precision (1 � 52). To be precise, we use the term signifi cand to represent the 24- or 53-bit number that is 1 plus the fraction, and fraction when we mean the 23- or 52-bit number. Since 0 has no leading 1, it is given the reserved exponent value 0 so that the hardware won’t attach a leading 1 to it. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 fractionexponents 1 bit 11 bits 20 bits fraction (continued) 32 bits 3.5 Floating Point 199 Th us 00 … 00two represents 0; the representation of the rest of the numbers uses the form from before with the hidden 1 added: (�1)S � (1 � Fraction) � 2E where the bits of the fraction represent a number between 0 and 1 and E specifi es the value in the exponent fi eld, to be given in detail shortly. If we number the bits of the fraction from left to right s1, s2, s3, …, then the value is (�1)S � (1 � (s1 � 2�1) � (s2 � 2�2) � (s3 � 2�3) � (s4 � 2�4) � ...) � 2E Figure 3.13 shows the encodings of IEEE 754 fl oating-point numbers. Other features of IEEE 754 are special symbols to represent unusual events. For example, instead of interrupting on a divide by 0, soft ware can set the result to a bit pattern representing �∞ or �∞; the largest exponent is reserved for these special symbols. When the programmer prints the results, the program will print an infi nity symbol. (For the mathematically trained, the purpose of infi nity is to form topological closure of the reals.) IEEE 754 even has a symbol for the result of invalid operations, such as 0/0 or subtracting infi nity from infi nity. Th is symbol is NaN, for Not a Number. Th e purpose of NaNs is to allow programmers to postpone some tests and decisions to a later time in the program when they are convenient. Th e designers of IEEE 754 also wanted a fl oating-point representation that could be easily processed by integer comparisons, especially for sorting. Th is desire is why the sign is in the most signifi cant bit, allowing a quick test of less than, greater than, or equal to 0. (It’s a little more complicated than a simple integer sort, since this notation is essentially sign and magnitude rather than two’s complement.) Placing the exponent before the signifi cand also simplifi es the sorting of fl oating-point numbers using integer comparison instructions, since numbers with bigger exponents look larger than numbers with smaller exponents, as long as both exponents have the same sign. Single precision Double precision Object represented Exponent Fraction Exponent Fraction 0 0 0 0 0 0 Nonzero 0 Nonzero ± denormalized number 1–254 Anything 1–2046 Anything ± floating-point number 255 0 2047 0 ± infinity 255 Nonzero 2047 Nonzero NaN (Not a Number) FIGURE 3.13 EEE 754 encoding of fl oating-point numbers. A separate sign bit determines the sign. Denormalized numbers are described in the Elaboration on page 222. Th is information is also found in Column 4 of the MIPS Reference Data Card at the front of this book. 200 Chapter 3 Arithmetic for Computers Negative exponents pose a challenge to simplifi ed sorting. If we use two’s complement or any other notation in which negative exponents have a 1 in the most signifi cant bit of the exponent fi eld, a negative exponent will look like a big number. For example, 1.0two � 2 �1 would be represented as 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . . . (Remember that the leading 1 is implicit in the signifi cand.) Th e value 1.0two � 2 �1 would look like the smaller binary number 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . . . Th e desirable notation must therefore represent the most negative exponent as 00 … 00two and the most positive as 11 … 11two. Th is convention is called biased notation, with the bias being the number subtracted from the normal, unsigned representation to determine the real value. IEEE 754 uses a bias of 127 for single precision, so an exponent of �1 is represented by the bit pattern of the value �1 � 127ten, or 126ten � 0111 1110two, and �1 is represented by 1 � 127, or 128ten � 1000 0000two. Th e exponent bias for double precision is 1023. Biased exponent means that the value represented by a fl oating-point number is really (�1)S � (1 � Fraction) � 2(Exponent � Bias) Th e range of single precision numbers is then from as small as �1.00000000000000000000000two � 2�126 to as large as �1.11111111111111111111111two � 2�127. Let’s demonstrate. 3.5 Floating Point 201 Floating-Point Representation Show the IEEE 754 binary representation of the number �0.75ten in single and double precision. Th e number �0.75ten is also �3/4ten or � 3/2 2 ten It is also represented by the binary fraction �11two /2 2 ten or � 0.11two In scientifi c notation, the value is � 0.11two � 2 0 and in normalized scientifi c notation, it is �1.1two � 2 �1 Th e general representation for a single precision number is (�1)S � (1 � Fraction) � 2(Exponent�127) Subtracting the bias 127 from the exponent of �1.1two � 2 �1 yields (�1)1 � (1 � .1000 0000 0000 0000 0000 000two) � 2 (126�127) Th e single precision binary representation of �0.75ten is then 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 bit 8 bits 23 bits Th e double precision representation is EXAMPLE ANSWER (�1)1 � (1 � .1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000two) � 2 (1022�1023) 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 bit 11 bits 20 bits 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 32 bits 202 Chapter 3 Arithmetic for Computers Now let’s try going the other direction. Converting Binary to Decimal Floating Point What decimal number is represented by this single precision fl oat? EXAMPLE 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . . . Th e sign bit is 1, the exponent fi eld contains 129, and the fraction fi eld contains 1 � 2�2 � 1/4, or 0.25. Using the basic equation, (�1)S � (1 � Fraction) � 2(Exponent�Bias) � (�1)1 � (1 � 0.25) � 2(129�127) � �1 � 1.25 � 22 � �1.25 � 4 � �5.0 In the next few subsections, we will give the algorithms for fl oating-point addition and multiplication. At their core, they use the corresponding integer operations on the signifi cands, but extra bookkeeping is necessary to handle the exponents and normalize the result. We fi rst give an intuitive derivation of the algorithms in decimal and then give a more detailed, binary version in the fi gures. Elaboration: Following IEEE guidelines, the IEEE 754 committee was reformed 20 years after the standard to see what changes, if any, should be made. The revised standard IEEE 754-2008 includes nearly all the IEEE 754-1985 and adds a 16-bit format (“half precision”) and a 128-bit format (“quadruple precision”). No hardware has yet been built that supports quadruple precision, but it will surely come. The revised standard also add decimal fl oating point arithmetic, which IBM mainframes have implemented. Elaboration: In an attempt to increase range without removing bits from the signifi cand, some computers before the IEEE 754 standard used a base other than 2. For example, the IBM 360 and 370 mainframe computers use base 16. Since changing the IBM exponent by one means shifting the signifi cand by 4 bits, “normalized” base 16 numbers can have up to 3 leading bits of 0s! Hence, hexadecimal digits mean that up to 3 bits must be dropped from the signifi cand, which leads to surprising problems in the accuracy of fl oating-point arithmetic. IBM mainframes now support IEEE 754 as well as the hex format. ANSWER 3.5 Floating Point 203 Floating-Point Addition Let’s add numbers in scientifi c notation by hand to illustrate the problems in fl oating-point addition: 9.999ten � 10 1 � 1.610ten � 10 �1. Assume that we can store only four decimal digits of the signifi cand and two decimal digits of the exponent. Step 1. To be able to add these numbers properly, we must align the decimal point of the number that has the smaller exponent. Hence, we need a form of the smaller number, 1.610ten � 10 �1, that matches the larger exponent. We obtain this by observing that there are multiple representations of an unnormalized fl oating-point number in scientifi c notation: 1.610ten � 10 �1 � 0.1610ten � 10 0 � 0.01610ten � 10 1 Th e number on the right is the version we desire, since its exponent matches the exponent of the larger number, 9.999ten � 10 1. Th us, the fi rst step shift s the signifi cand of the smaller number to the right until its corrected exponent matches that of the larger number. But we can represent only four decimal digits so, aft er shift ing, the number is really 0.016 � 101 Step 2. Next comes the addition of the signifi cands: 9.999ten + 0.016ten 10.015ten Th e sum is 10.015ten � 10 1. Step 3. Th is sum is not in normalized scientifi c notation, so we need to adjust it: 10.015ten � 10 1 � 1.0015ten � 10 2 Th us, aft er the addition we may have to shift the sum to put it into normalized form, adjusting the exponent appropriately. Th is example shows shift ing to the right, but if one number were positive and the other were negative, it would be possible for the sum to have many leading 0s, requiring left shift s. Whenever the exponent is increased or decreased, we must check for overfl ow or underfl ow—that is, we must make sure that the exponent still fi ts in its fi eld. Step 4. Since we assumed that the signifi cand can be only four digits long (excluding the sign), we must round the number. In our grammar school algorithm, the rules truncate the number if the digit to the right of the desired point is between 0 and 4 and add 1 to the digit if the number to the right is between 5 and 9. Th e number 1.0015ten � 10 2 204 Chapter 3 Arithmetic for Computers is rounded to four digits in the signifi cand to 1.002ten � 10 2 since the fourth digit to the right of the decimal point was between 5 and 9. Notice that if we have bad luck on rounding, such as adding 1 to a string of 9s, the sum may no longer be normalized and we would need to perform step 3 again. Figure 3.14 shows the algorithm for binary fl oating-point addition that follows this decimal example. Steps 1 and 2 are similar to the example just discussed: adjust the signifi cand of the number with the smaller exponent and then add the two signifi cands. Step 3 normalizes the results, forcing a check for overfl ow or underfl ow. Th e test for overfl ow and underfl ow in step 3 depends on the precision of the operands. Recall that the pattern of all 0 bits in the exponent is reserved and used for the fl oating-point representation of zero. Moreover, the pattern of all 1 bits in the exponent is reserved for indicating values and situations outside the scope of normal fl oating-point numbers (see the Elaboration on page 222). For the example below, remember that for single precision, the maximum exponent is 127, and the minimum exponent is �126. Binary Floating-Point Addition Try adding the numbers 0.5ten and �0.4375ten in binary using the algorithm in Figure 3.14. Let’s fi rst look at the binary version of the two numbers in normalized scientifi c notation, assuming that we keep 4 bits of precision: 0.5ten � 1/2ten � 1/2 1 ten � 0.1two � 0.1two � 2 0 � 1.000two � 2 �1 �0.4375ten � �7/16ten � �7/2 4 ten � �0.0111two � �0.0111two � 2 0 � �1.110two � 2 �2 Now we follow the algorithm: Step 1. Th e signifi cand of the number with the lesser exponent (�1.11two � 2�2) is shift ed right until its exponent matches the larger number: �1.110two � 2 �2 � �0.111two � 2 �1 Step 2. Add the signifi cands: 1.000two � 2 �1 � (�0.111two � 2 �1) � 0.001two � 2 �1 EXAMPLE ANSWER 3.5 Floating Point 205 Still normalized? 4. Round the significand to the appropriate number of bits YesOverflow or underflow? Start No Yes Done 1. Compare the exponents of the two numbers; shift the smaller number to the right until its exponent would match the larger exponent 2. Add the significands 3. Normalize the sum, either shifting right and incrementing the exponent or shifting left and decrementing the exponent No Exception FIGURE 3.14 Floating-point addition. Th e normal path is to execute steps 3 and 4 once, but if rounding causes the sum to be unnormalized, we must repeat step 3. 206 Chapter 3 Arithmetic for Computers Step 3. Normalize the sum, checking for overfl ow or underfl ow: 0.001two � 2 �1 � 0.010two � 2 �2 � 0.100two � 2 �3 � 1.000two � 2 �4 Since 127 � �4 � �126, there is no overfl ow or underfl ow. (Th e biased exponent would be �4 � 127, or 123, which is between 1 and 254, the smallest and largest unreserved biased exponents.) Step 4. Round the sum: 1.000two � 2 �4 Th e sum already fi ts exactly in 4 bits, so there is no change to the bits due to rounding. Th is sum is then 1.000two � 2 �4 � 0.0001000two � 0.0001two � 1/24ten � 1/16ten � 0.0625ten Th is sum is what we would expect from adding 0.5ten to �0.4375ten. Many computers dedicate hardware to run fl oating-point operations as fast as possible. Figure 3.15 sketches the basic organization of hardware for fl oating-point addition. Floating-Point Multiplication Now that we have explained fl oating-point addition, let’s try fl oating-point multiplication. We start by multiplying decimal numbers in scientifi c notation by hand: 1.110ten � 10 10 � 9.200ten � 10 �5. Assume that we can store only four digits of the signifi cand and two digits of the exponent. Step 1. Unlike addition, we calculate the exponent of the product by simply adding the exponents of the operands together: New exponent � 10 � (�5) � 5 Let’s do this with the biased exponents as well to make sure we obtain the same result: 10 + 127 = 137, and �5 + 127 = 122, so New exponent � 137 � 122� 259 Th is result is too large for the 8-bit exponent fi eld, so something is amiss! Th e problem is with the bias because we are adding the biases as well as the exponents: New exponent � (10 � 127) � (�5 � 127) � (5 � 2 � 127) � 259 Accordingly, to get the correct biased sum when we add biased numbers, we must subtract the bias from the sum: 3.5 Floating Point 207 Compare exponents Small ALU Exponent difference Control ExponentSign Fraction Big ALU ExponentSign Fraction 0 1 0 1 0 1 Shift right 0 1 0 1 Increment or decrement Shift left or right Rounding hardware ExponentSign Fraction Shift smaller number right Add Normalize Round FIGURE 3.15 Block diagram of an arithmetic unit dedicated to fl oating-point addition. Th e steps of Figure 3.14 correspond to each block, from top to bottom. First, the exponent of one operand is subtracted from the other using the small ALU to determine which is larger and by how much. Th is diff erence controls the three multiplexors; from left to right, they select the larger exponent, the signifi cand of the smaller number, and the signifi cand of the larger number. Th e smaller signifi cand is shift ed right, and then the signifi cands are added together using the big ALU. Th e normalization step then shift s the sum left or right and increments or decrements the exponent. Rounding then creates the fi nal result, which may require normalizing again to produce the actual fi nal result. 208 Chapter 3 Arithmetic for Computers New exponent � 137 � 122 � 127 � 259 � 127 � 132 � (5 � 127) and 5 is indeed the exponent we calculated initially. Step 2. Next comes the multiplication of the signifi cands: 1.110ten × 9.200ten 0000 0000 2220 9990 10212000ten Th ere are three digits to the right of the decimal point for each operand, so the decimal point is placed six digits from the right in the product signifi cand: 10.212000ten Assuming that we can keep only three digits to the right of the decimal point, the product is 10.212 � 105. Step 3. Th is product is unnormalized, so we need to normalize it: 10.212ten � 10 5 � 1.0212ten � 10 6 Th us, aft er the multiplication, the product can be shift ed right one digit to put it in normalized form, adding 1 to the exponent. At this point, we can check for overfl ow and underfl ow. Underfl ow may occur if both operands are small—that is, if both have large negative exponents. Step 4. We assumed that the signifi cand is only four digits long (excluding the sign), so we must round the number. Th e number 1.0212ten � 10 6 is rounded to four digits in the signifi cand to 1.021ten � 10 6 Step 5. Th e sign of the product depends on the signs of the original operands. If they are both the same, the sign is positive; otherwise, it’s negative. Hence, the product is �1.021ten � 10 6 Th e sign of the sum in the addition algorithm was determined by addition of the signifi cands, but in multiplication, the sign of the product is determined by the signs of the operands. 3.5 Floating Point 209 5. Set the sign of the product to positive if the signs of the original operands are the same; if they differ make the sign negative Still normalized? 4. Round the significand to the appropriate number of bits YesOverflow or underflow? Start No Yes Done 1. Add the biased exponents of the two numbers, subtracting the bias from the sum to get the new biased exponent 2. Multiply the significands 3. Normalize the product if necessary, shifting it right and incrementing the exponent No Exception FIGURE 3.16 Floating-point multiplication. Th e normal path is to execute steps 3 and 4 once, but if rounding causes the sum to be unnormalized, we must repeat step 3. 210 Chapter 3 Arithmetic for Computers Once again, as Figure 3.16 shows, multiplication of binary fl oating-point numbers is quite similar to the steps we have just completed. We start with calculating the new exponent of the product by adding the biased exponents, being sure to subtract one bias to get the proper result. Next is multiplication of signifi cands, followed by an optional normalization step. Th e size of the exponent is checked for overfl ow or underfl ow, and then the product is rounded. If rounding leads to further normalization, we once again check for exponent size. Finally, set the sign bit to 1 if the signs of the operands were diff erent (negative product) or to 0 if they were the same (positive product). Binary Floating-Point Multiplication Let’s try multiplying the numbers 0.5ten and �0.4375ten, using the steps in Figure 3.16. In binary, the task is multiplying 1.000two � 2 �1 by �1.110two � 2 �2. Step 1. Adding the exponents without bias: �1 � (�2) � �3 or, using the biased representation: (�1 � 127) � (�2 � 127) � 127 � (�1 � 2) � (127 � 127 � 127) � �3 � 127 � 124 Step 2. Multiplying the signifi cands: 1.000two � 1.110two 0000 1000 1000 1000 1110000two Th e product is 1.110000two � 2 �3, but we need to keep it to 4 bits, so it is 1.110two � 2 �3. Step 3. Now we check the product to make sure it is normalized, and then check the exponent for overfl ow or underfl ow. Th e product is already normalized and, since 127 � �3 � �126, there is no overfl ow or underfl ow. (Using the biased representation, 254 � 124 � 1, so the exponent fi ts.) Step 4. Rounding the product makes no change: 1.110two � 2 �3 EXAMPLE ANSWER 3.5 Floating Point 211 Step 5. Since the signs of the original operands diff er, make the sign of the product negative. Hence, the product is �1.110two � 2 �3 Converting to decimal to check our results: �1.110two � 2 �3 � �0.001110two � �0.00111two � �7/25ten � �7/32ten � �0.21875ten Th e product of 0.5ten and �0.4375ten is indeed �0.21875ten. Floating-Point Instructions in MIPS MIPS supports the IEEE 754 single precision and double precision formats with these instructions: ■ Floating-point addition, single (add.s) and addition, double (add.d) ■ Floating-point subtraction, single (sub.s) and subtraction, double (sub.d) ■ Floating-point multiplication, single (mul.s) and multiplication, double (mul.d) ■ Floating-point division, single (div.s) and division, double (div.d) ■ Floating-point comparison, single (c.x.s) and comparison, double (c.x.d), where x may be equal (eq), not equal (neq), less than (lt), less than or equal (le), greater than (gt), or greater than or equal (ge) ■ Floating-point branch, true (bc1t) and branch, false (bc1f) Floating-point comparison sets a bit to true or false, depending on the comparison condition, and a fl oating-point branch then decides whether or not to branch, depending on the condition. Th e MIPS designers decided to add separate fl oating-point registers—called $f0, $f1, $f2, …—used either for single precision or double precision. Hence, they included separate loads and stores for fl oating-point registers: lwc1 and swc1. Th e base registers for fl oating-point data transfers which are used for addresses remain integer registers. Th e MIPS code to load two single precision numbers from memory, add them, and then store the sum might look like this: lwc1 $f4,c($sp) # Load 32-bit F.P. number into F4 lwc1 $f6,a($sp) # Load 32-bit F.P. number into F6 add.s $f2,$f4,$f6 # F2 = F4 + F6 single precision swc1 $f2,b($sp) # Store 32-bit F.P. number from F2 A double precision register is really an even-odd pair of single precision registers, using the even register number as its name. Th us, the pair of single precision registers $f2 and $f3 also form the double precision register named $f2. Figure 3.17 summarizes the fl oating-point portion of the MIPS architecture revealed in this chapter, with the additions to support fl oating point shown in color. Similar to Figure 2.19 in Chapter 2, Figure 3.18 shows the encoding of these instructions. 212 Chapter 3 Arithmetic for Computers MIPS floating-point operands Name Example Comments 32 floating- point registers $f0, $f1, $f2, . . . , $f31 MIPS floating-point registers are used in pairs for double precision numbers. 230 memory words Memory[0], Memory[4], . . . , Memory[4294967292] Accessed only by data transfer instructions. MIPS uses byte addresses, so sequential word addresses differ by 4. Memory holds data structures, such as arrays, and spilled registers, such as those saved on procedure calls. MIPS floating-point assembly language Category Instruction Example Meaning Comments Arithmetic FP add single add.s $f2,$f4,$f6 $f2 = $f4 + $f6 FP add (single precision) FP subtract single sub.s $f2,$f4,$f6 $f2 = $f4 – $f6 FP sub (single precision) FP multiply single mul.s $f2,$f4,$f6 $f2 = $f4 × $f6 FP multiply (single precision) FP divide single div.s $f2,$f4,$f6 $f2 = $f4 / $f6 FP divide (single precision) FP add double add.d $f2,$f4,$f6 $f2 = $f4 + $f6 FP add (double precision) FP subtract double sub.d $f2,$f4,$f6 $f2 = $f4 – $f6 FP sub (double precision) FP multiply double mul.d $f2,$f4,$f6 $f2 = $f4 × $f6 FP multiply (double precision) FP divide double div.d $f2,$f4,$f6 $f2 = $f4 / $f6 FP divide (double precision) Data transfer load word copr. 1 lwc1 $f1,100($s2) $f1 = Memory[$s2 + 100] 32-bit data to FP register store word copr. 1 swc1 $f1,100($s2) Memory[$s2 + 100] = $f1 32-bit data to memory Condi- tional branch branch on FP true bc1t 25 if (cond == 1) go to PC + 4 + 100 PC-relative branch if FP cond. branch on FP false bc1f 25 if (cond == 0) go to PC + 4 + 100 PC-relative branch if not cond. FP compare single (eq,ne,lt,le,gt,ge) c.lt.s $f2,$f4 if ($f2 < $f4) cond = 1; else cond = 0 FP compare less than single precision FP compare double (eq,ne,lt,le,gt,ge) c.lt.d $f2,$f4 if ($f2 < $f4) cond = 1; else cond = 0 FP compare less than double precision MIPS floating-point machine language Name Format Example Comments add.s R 17 16 6 4 2 0 add.s $f2,$f4,$f6 sub.s R 17 16 6 4 2 1 sub.s $f2,$f4,$f6 mul.s R 17 16 6 4 2 2 mul.s $f2,$f4,$f6 div.s R 17 16 6 4 2 3 div.s $f2,$f4,$f6 add.d R 17 17 6 4 2 0 add.d $f2,$f4,$f6 sub.d R 17 17 6 4 2 1 sub.d $f2,$f4,$f6 mul.d R 17 17 6 4 2 2 mul.d $f2,$f4,$f6 div.d R 17 17 6 4 2 3 div.d $f2,$f4,$f6 lwc1 I 49 20 2 100 lwc1 $f2,100($s4) swc1 I 57 20 2 100 swc1 $f2,100($s4) bc1t I 17 8 1 25 bc1t 25 bc1f I 17 8 0 25 bc1f 25 c.lt.s R 17 16 4 2 0 60 c.lt.s $f2,$f4 c.lt.d R 17 17 4 2 0 60 c.lt.d $f2,$f4 Field size 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits All MIPS instructions 32 bits FIGURE 3.17 MIPS fl oating-point architecture revealed thus far. See Appendix A, Section A.10, for more detail. Th is information is also found in column 2 of the MIPS Reference Data Card at the front of this book. 3.5 Floating Point 213 op(31:26): 28–26 31–29 0(000) 1(001) 2(010) 3(011) 4(100) 5(101) 6(110) 7(111) 0(000) Rfmt Bltz/gez j jal beq bne blez bgtz 1(001) addi addiu slti sltiu ANDi ORi xORi lui 2(010) TLB FlPt 3(011) 4(100) lb lh lwl lw lbu lhu lwr 5(101) sb sh swl sw swr 6(110) lwc0 lwc1 7(111) swc0 swc1 op(31:26) = 010001 (FlPt), (rt(16:16) = 0 => c = f, rt(16:16) = 1 => c = t), rs(25:21):

23–21

25–24

0(000) 1(001) 2(010) 3(011) 4(100) 5(101) 6(110) 7(111)

0(00) mfc1 cfc1 mtc1 ctc1

1(01) bc1.c
2(10) f = single f = double
3(11)

op(31:26) = 010001 (FlPt), (f above: 10000 => f = s, 10001 => f = d), funct(5:0):

2–0

5–3

0(000) 1(001) 2(010) 3(011) 4(100) 5(101) 6(110) 7(111)

0(000) add.f sub.f mul.f div.f abs.f mov.f neg.f
1(001)

2(010)

3(011)

4(100) cvt.s.f cvt.d.f cvt.w.f
5(101)

6(110) c.f.f c.un.f c.eq.f c.ueq.f c.olt.f c.ult.f c.ole.f c.ule.f
7(111) c.sf.f c.ngle.f c.seq.f c.ngl.f c.lt.f c.nge.f c.le.f c.ngt.f

FIGURE 3.18 MIPS fl oating-point instruction encoding. Th is notation gives the value of a fi eld by row and by column. For example,
in the top portion of the fi gure, lw is found in row number 4 (100two for bits 31–29 of the instruction) and column number 3 (011two for bits
28–26 of the instruction), so the corresponding value of the op fi eld (bits 31–26) is 100011two. Underscore means the fi eld is used elsewhere.
For example, FlPt in row 2 and column 1 (op � 010001two) is defi ned in the bottom part of the fi gure. Hence sub.f in row 0 and column 1 of
the bottom section means that the funct fi eld (bits 5–0) of the instruction) is 000001two and the op fi eld (bits 31–26) is 010001two. Note that the
5-bit rs fi eld, specifi ed in the middle portion of the fi gure, determines whether the operation is single precision (f � s, so rs � 10000) or double
precision (f � d, so rs � 10001). Similarly, bit 16 of the instruction determines if the bc1.c instruction tests for true (bit 16 � 1 � bc1.t)
or false (bit 16 � 0 � bc1.f). Instructions in color are described in Chapter 2 or this chapter, with Appendix A covering all instructions.
Th is information is also found in column 2 of the MIPS Reference Data Card at the front of this book.

214 Chapter 3 Arithmetic for Computers

One issue that architects face in supporting fl oating-point arithmetic is whether
to use the same registers used by the integer instructions or to add a special set
for fl oating point. Because programs normally perform integer operations and
fl oating-point operations on diff erent data, separating the registers will only
slightly increase the number of instructions needed to execute a program. Th e
major impact is to create a separate set of data transfer instructions to move data
between fl oating-point registers and memory.

Th e benefi ts of separate fl oating-point registers are having twice as many
registers without using up more bits in the instruction format, having twice the
register bandwidth by having separate integer and fl oating-point register sets, and
being able to customize registers to fl oating point; for example, some computers
convert all sized operands in registers into a single internal format.

Compiling a Floating-Point C Program into MIPS Assembly Code

Let’s convert a temperature in Fahrenheit to Celsius:

float f2c (float fahr)
{
return ((5.0/9.0) *(fahr – 32.0));
}

Assume that the fl oating-point argument fahr is passed in $f12 and the
result should go in $f0. (Unlike integer registers, fl oating-point register 0 can
contain a number.) What is the MIPS assembly code?

We assume that the compiler places the three fl oating-point constants in
memory within easy reach of the global pointer $gp. Th e fi rst two instruc-
tions load the constants 5.0 and 9.0 into fl oating-point registers:

f2c:
lwc1 $f16,const5($gp) # $f16 = 5.0 (5.0 in memory)
lwc1 $f18,const9($gp) # $f18 = 9.0 (9.0 in memory)

Th ey are then divided to get the fraction 5.0/9.0:

div.s $f16, $f16, $f18 # $f16 = 5.0 / 9.0

Hardware/
Software
Interface

EXAMPLE

ANSWER

3.5 Floating Point 215

(Many compilers would divide 5.0 by 9.0 at compile time and save the single
constant 5.0/9.0 in memory, thereby avoiding the divide at runtime.) Next, we
load the constant 32.0 and then subtract it from fahr ($f12):

lwc1 $f18, const32($gp)# $f18 = 32.0
sub.s $f18, $f12, $f18 # $f18 = fahr – 32.0

Finally, we multiply the two intermediate results, placing the product in $f0 as
the return result, and then return

mul.s $f0, $f16, $f18 # $f0 = (5/9)*(fahr – 32.0)
jr $ra # return

Now let’s perform fl oating-point operations on matrices, code commonly
found in scientifi c programs.

Compiling Floating-Point C Procedure with Two-Dimensional
Matrices into MIPS

Most fl oating-point calculations are performed in double precision. Let’s per-
form matrix multiply of C � C � A * B. It is commonly called DGEMM,
for Double precision, General Matrix Multiply. We’ll see versions of DGEMM
again in Section 3.8 and subsequently in Chapters 4, 5, and 6. Let’s assume C,
A, and B are all square matrices with 32 elements in each dimension.

void mm (double c[][], double a[][], double b[][])
{
int i, j, k;
for (i = 0; i != 32; i = i + 1)
for (j = 0; j != 32; j = j + 1)
for (k = 0; k != 32; k = k + 1)
c[i][j] = c[i][j] + a[i][k] *b[k][j];
}

Th e array starting addresses are parameters, so they are in $a0, $a1, and $a2.
Assume that the integer variables are in $s0, $s1, and $s2, respectively.
What is the MIPS assembly code for the body of the procedure?

Note that c[i][j] is used in the innermost loop above. Since the loop index
is k, the index does not aff ect c[i][j], so we can avoid loading and storing
c[i][j] each iteration. Instead, the compiler loads c[i][j] into a register
outside the loop, accumulates the sum of the products of a[i][k] and

EXAMPLE

ANSWER

216 Chapter 3 Arithmetic for Computers

b[k][j] in that same register, and then stores the sum into c[i][j] upon
termination of the innermost loop.

We keep the code simpler by using the assembly language pseudoinstructions
li (which loads a constant into a register), and l.d and s.d (which the
assembler turns into a pair of data transfer instructions, lwc1 or swc1, to a
pair of fl oating-point registers).

Th e body of the procedure starts with saving the loop termination value of
32 in a temporary register and then initializing the three for loop variables:

mm:…
li $t1, 32 # $t1 = 32 (row size/loop end)
li $s0, 0 # i = 0; initialize 1st for loop
L1: li $s1, 0 # j = 0; restart 2nd for loop
L2: li $s2, 0 # k = 0; restart 3rd for loop

To calculate the address of c[i][j], we need to know how a 32 � 32, two-
dimensional array is stored in memory. As you might expect, its layout is the
same as if there were 32 single-dimension arrays, each with 32 elements. So the
fi rst step is to skip over the i “single-dimensional arrays,” or rows, to get the
one we want. Th us, we multiply the index in the fi rst dimension by the size of
the row, 32. Since 32 is a power of 2, we can use a shift instead:

sll $t2, $s0, 5 # $t2 = i * 25 (size of row of c)

Now we add the second index to select the jth element of the desired row:

addu $t2, $t2, $s1 # $t2 = i * size(row) + j

To turn this sum into a byte index, we multiply it by the size of a matrix element
in bytes. Since each element is 8 bytes for double precision, we can instead shift
left by 3:

sll $t2, $t2, 3 # $t2 = byte offset of [i][j]

Next we add this sum to the base address of c, giving the address of c[i][j],
and then load the double precision number c[i][j] into $f4:

addu $t2, $a0, $t2 # $t2 = byte address of c[i][j]
l.d $f4, 0($t2) # $f4 = 8 bytes of c[i][j]

Th e following fi ve instructions are virtually identical to the last fi ve: calculate
the address and then load the double precision number b[k][j].

L3: sll $t0, $s2, 5 # $t0 = k * 25 (size of row of b)
addu $t0, $t0, $s1 # $t0 = k * size(row) + j
sll $t0, $t0, 3 # $t0 = byte offset of [k][j]
addu $t0, $a2, $t0 # $t0 = byte address of b[k][j]
l.d $f16, 0($t0) # $f16 = 8 bytes of b[k][j]

Similarly, the next fi ve instructions are like the last fi ve: calculate the address
and then load the double precision number a[i][k].

3.5 Floating Point 217

sll $t0, $s0, 5 # $t0 = i * 25 (size of row of a)
addu $t0, $t0, $s2 # $t0 = i * size(row) + k
sll $t0, $t0, 3 # $t0 = byte offset of [i][k]
addu $t0, $a1, $t0 # $t0 = byte address of a[i][k]
l.d $f18, 0($t0) # $f18 = 8 bytes of a[i][k]

Now that we have loaded all the data, we are fi nally ready to do some fl oating-
point operations! We multiply elements of a and b located in registers $f18
and $f16, and then accumulate the sum in $f4.

mul.d $f16, $f18, $f16 # $f16 = a[i][k] * b[k][j]
add.d $f4, $f4, $f16 # f4 = c[i][j] + a[i][k] * b[k][j]

Th e fi nal block increments the index k and loops back if the index is not 32.
If it is 32, and thus the end of the innermost loop, we need to store the sum
accumulated in $f4 into c[i][j].

addiu $s2, $s2, 1 # $k = k + 1
bne $s2, $t1, L3 # if (k != 32) go to L3
s.d $f4, 0($t2) # c[i][j] = $f4

Similarly, these fi nal four instructions increment the index variable of the
middle and outermost loops, looping back if the index is not 32 and exiting if
the index is 32.

addiu $s1, $s1, 1 # $j = j + 1
bne $s1, $t1, L2 # if (j != 32) go to L2
addiu $s0, $s0, 1 # $i = i + 1
bne $s0, $t1, L1 # if (i != 32) go to L1
…

Figure 3.22 below shows the x86 assembly language code for a slightly diff erent
version of DGEMM in Figure 3.21.

Elaboration: The array layout discussed in the example, called row-major order, is
used by C and many other programming languages. Fortran instead uses column-major
order, whereby the array is stored column by column.

Elaboration: Only 16 of the 32 MIPS fl oating-point registers could originally be used
for double precision operations: $f0, $f2, $f4, …, $f30. Double precision is computed
using pairs of these single precision registers. The odd-numbered fl oating-point registers
were used only to load and store the right half of 64-bit fl oating-point numbers. MIPS-32
added l.d and s.d to the instruction set. MIPS-32 also added “paired single” versions of
all fl oating-point instructions, where a single instruction results in two parallel fl oating-point
operations on two 32-bit operands inside 64-bit registers (see Section 3.6). For example,
add.ps $f0, $f2, $f4 is equivalent to add.s $f0, $f2, $f4 followed by add.s
$f1, $f3, $f5.

218 Chapter 3 Arithmetic for Computers

Elaboration: Another reason for separate integers and fl oating-point registers is that
microprocessors in the 1980s didn’t have enough transistors to put the fl oating-point unit
on the same chip as the integer unit. Hence, the fl oating-point unit, including the fl oating-
point registers, was optionally available as a second chip. Such optional accelerator
chips are called coprocessors, and explain the acronym for fl oating-point loads in MIPS:
lwc1 means load word to coprocessor 1, the fl oating-point unit. (Coprocessor 0 deals
with virtual memory, described in Chapter 5.) Since the early 1990s, microprocessors
have integrated fl oating point (and just about everything else) on chip, and hence the term
coprocessor joins accumulator and core memory as quaint terms that date the speaker.

Elaboration: As mentioned in Section 3.4, accelerating division is more challenging
than multiplication. In addition to SRT, another technique to leverage a fast multiplier
is Newton’s iteration, where division is recast as fi nding the zero of a function to fi nd
the reciprocal 1/c, which is then multiplied by the other operand. Iteration techniques
cannot be rounded properly without calculating many extra bits. A TI chip solved this
problem by calculating an extra-precise reciprocal.

Elaboration: Java embraces IEEE 754 by name in its defi nition of Java fl oating-point
data types and operations. Thus, the code in the fi rst example could have well been
generated for a class method that converted Fahrenheit to Celsius.

The second example above uses multiple dimensional arrays, which are not explicitly
supported in Java. Java allows arrays of arrays, but each array may have its own length,
unlike multiple dimensional arrays in C. Like the examples in Chapter 2, a Java version
of this second example would require a good deal of checking code for array bounds,
including a new length calculation at the end of row access. It would also need to check
that the object reference is not null.

Accurate Arithmetic
Unlike integers, which can represent exactly every number between the smallest and
largest number, fl oating-point numbers are normally approximations for a number
they can’t really represent. Th e reason is that an infi nite variety of real numbers
exists between, say, 0 and 1, but no more than 253 can be represented exactly in
double precision fl oating point. Th e best we can do is getting the fl oating-point
representation close to the actual number. Th us, IEEE 754 off ers several modes of
rounding to let the programmer pick the desired approximation.

Rounding sounds simple enough, but to round accurately requires the hardware
to include extra bits in the calculation. In the preceding examples, we were vague
on the number of bits that an intermediate representation can occupy, but clearly,
if every intermediate result had to be truncated to the exact number of digits, there
would be no opportunity to round. IEEE 754, therefore, always keeps two extra bits
on the right during intermediate additions, called guard and round, respectively.
Let’s do a decimal example to illustrate their value.

guard Th e fi rst of two
extra bits kept on the
right during intermediate
calculations of fl oating-
point numbers; used
to improve rounding
accuracy.

round Method to
make the intermediate
fl oating-point result fi t
the fl oating-point format;
the goal is typically to fi nd
the nearest number that
can be represented in the
format.

3.5 Floating Point 219

Rounding with Guard Digits

Add 2.56ten � 10
0 to 2.34ten � 10

2, assuming that we have three signifi cant
decimal digits. Round to the nearest decimal number with three signifi cant
decimal digits, fi rst with guard and round digits, and then without them.

First we must shift the smaller number to the right to align the exponents, so
2.56ten � 10

0 becomes 0.0256ten � 10
2. Since we have guard and round digits,

we are able to represent the two least signifi cant digits when we align expo-
nents. Th e guard digit holds 5 and the round digit holds 6. Th e sum is

2.3400ten
+ 0.0256ten

2.3656ten
Th us the sum is 2.3656ten � 10

2. Since we have two digits to round, we want
values 0 to 49 to round down and 51 to 99 to round up, with 50 being the
tiebreaker. Rounding the sum up with three signifi cant digits yields 2.37ten � 10

2.
Doing this without guard and round digits drops two digits from the

calculation. Th e new sum is then

2.34ten
+ 0.02ten

2.36ten
Th e answer is 2.36ten � 10

2, off by 1 in the last digit from the sum above.

Since the worst case for rounding would be when the actual number is halfway
between two fl oating-point representations, accuracy in fl oating point is normally
measured in terms of the number of bits in error in the least signifi cant bits of the
signifi cand; the measure is called the number of units in the last place, or ulp. If
a number were off by 2 in the least signifi cant bits, it would be called off by 2 ulps.
Provided there is no overfl ow, underfl ow, or invalid operation exceptions, IEEE
754 guarantees that the computer uses the number that is within one-half ulp.

Elaboration: Although the example above really needed just one extra digit, multiply
can need two. A binary product may have one leading 0 bit; hence, the normalizing step
must shift the product one bit left. This shifts the guard digit into the least signifi cant bit
of the product, leaving the round bit to help accurately round the product.

IEEE 754 has four rounding modes: always round up (toward +∞), always round down
(toward �∞), truncate, and round to nearest even. The fi nal mode determines what to
do if the number is exactly halfway in between. The U.S. Internal Revenue Service (IRS)
always rounds 0.50 dollars up, possibly to the benefi t of the IRS. A more equitable way
would be to round up this case half the time and round down the other half. IEEE 754
says that if the least signifi cant bit retained in a halfway case would be odd, add one;

EXAMPLE

ANSWER

units in the last place
(ulp) Th e number of
bits in error in the least
signifi cant bits of the
signifi cand between
the actual number and
the number that can be
represented.

220 Chapter 3 Arithmetic for Computers

if it’s even, truncate. This method always creates a 0 in the least signifi cant bit in the
tie-breaking case, giving the rounding mode its name. This mode is the most commonly
used, and the only one that Java supports.

The goal of the extra rounding bits is to allow the computer to get the same results
as if the intermediate results were calculated to infi nite precision and then rounded. To
support this goal and round to the nearest even, the standard has a third bit in addition
to guard and round; it is set whenever there are nonzero bits to the right of the round
bit. This sticky bit allows the computer to see the difference between 0.50 … 00

ten
and

0.50 … 01
ten

when rounding.
The sticky bit may be set, for example, during addition, when the smaller number is

shifted to the right. Suppose we added 5.01
ten

� 10�1 to 2.34
ten

� 102 in the example
above. Even with guard and round, we would be adding 0.0050 to 2.34, with a sum of
2.3450. The sticky bit would be set, since there are nonzero bits to the right. Without the
sticky bit to remember whether any 1s were shifted off, we would assume the number
is equal to 2.345000 … 00 and round to the nearest even of 2.34. With the sticky bit
to remember that the number is larger than 2.345000 … 00, we round instead to 2.35.

Elaboration: PowerPC, SPARC64, AMD SSE5, and Intel AVX architectures provide a
single instruction that does a multiply and add on three registers: a � a � (b � c).
Obviously, this instruction allows potentially higher fl oating-point performance for this
common operation. Equally important is that instead of performing two roundings—after
the multiply and then after the add—which would happen with separate instructions,
the multiply add instruction can perform a single rounding after the add. A single
rounding step increases the precision of multiply add. Such operations with a single
rounding are called fused multiply add. It was added to the IEEE 754-2008 standard
(see Section 3.11).

Summary
Th e Big Picture that follows reinforces the stored-program concept from Chapter 2;
the meaning of the information cannot be determined just by looking at the bits, for
the same bits can represent a variety of objects. Th is section shows that computer
arithmetic is fi nite and thus can disagree with natural arithmetic. For example, the
IEEE 754 standard fl oating-point representation

(�1)5 � (1 � Fraction) � 2(Exponent �Bias)

is almost always an approximation of the real number. Computer systems must
take care to minimize this gap between computer arithmetic and arithmetic in the
real world, and programmers at times need to be aware of the implications of this
approximation.

sticky bit A bit used in
rounding in addition to
guard and round that is
set whenever there are
nonzero bits to the right
of the round bit.

fused multiply add
A fl oating-point
instruction that performs
both a multiply and an
add, but rounds only once
aft er the add.

Bit patterns have no inherent meaning. Th ey may represent signed integers,
unsigned integers, fl oating-point numbers, instructions, and so on. What is
represented depends on the instruction that operates on the bits in the word.

The BIG
Picture

3.5 Floating Point 221

C type Java type Data transfers Operations

int int lw, sw, lui
addu, addiu, subu, mult, div, AND,
ANDi, OR, ORi, NOR, slt, slti

unsigned int — lw, sw, lui
addu, addiu, subu, multu, divu, AND,
ANDi, OR, ORi, NOR, sltu, sltiu

char — lb, sb, lui
add, addi, sub, mult, div AND, ANDi,
OR, ORi, NOR, slt, slti

— char lh, sh, lui
addu, addiu, subu, multu, divu, AND,
ANDi, OR, ORi, NOR, sltu, sltiu

float float lwc1, swc1
add.s, sub.s, mult.s, div.s, c.eq.s,
c.lt.s, c.le.s

double double l.d, s.d
add.d, sub.d, mult.d, div.d, c.eq.d,
c.lt.d, c.le.d

In the last chapter, we presented the storage classes of the programming language C
(see the Hardware/Soft ware Interface section in Section 2.7). Th e table above shows
some of the C and Java data types, the MIPS data transfer instructions, and instructions
that operate on those types that appear in Chapter 2 and this chapter. Note that Java
omits unsigned integers.

Th e revised IEEE 754-2008 standard added a 16-bit fl oating-point format with fi ve
exponent bits. What do you think is the likely range of numbers it could represent?

1. 1.0000 00 � 20 to 1.1111 1111 11 � 231, 0

2. �1.0000 0000 0 � 2�14 to �1.1111 1111 1 � 215, �0, �∞, NaN

3. �1.0000 0000 00 � 2�14 to �1.1111 1111 11 � 215, �0, �∞, NaN

4. �1.0000 0000 00 � 2�15 to �1.1111 1111 11 � 214, �0, �∞, NaN

Elaboration: To accommodate comparisons that may include NaNs, the standard
includes ordered and unordered as options for compares. Hence, the full MIPS instruction
set has many fl avors of compares to support NaNs. (Java does not support unordered
compares.)

Hardware/
Software
Interface

Check
Yourself

Th e major diff erence between computer numbers and numbers in the
real world is that computer numbers have limited size and hence limited
precision; it’s possible to calculate a number too big or too small to be
represented in a word. Programmers must remember these limits and
write programs accordingly.

222 Chapter 3 Arithmetic for Computers

In an attempt to squeeze every last bit of precision from a fl oating-point operation,
the standard allows some numbers to be represented in unnormalized form. Rather than
having a gap between 0 and the smallest normalized number, IEEE allows denormalized
numbers (also known as denorms or subnormals). They have the same exponent as
zero but a nonzero fraction. They allow a number to degrade in signifi cance until it
becomes 0, called gradual underfl ow. For example, the smallest positive single precision
normalized number is

1.0000 0000 0000 0000 0000 000
two

� 2�126

but the smallest single precision denormalized number is

0.0000 0000 0000 0000 0000 001
two

� 2�126, or 1.0
two

� 2�149

For double precision, the denorm gap goes from 1.0 � 2�1022 to 1.0 � 2�1074.
The possibility of an occasional unnormalized operand has given headaches to

fl oating-point designers who are trying to build fast fl oating-point units. Hence, many
computers cause an exception if an operand is denormalized, letting software complete
the operation. Although software implementations are perfectly valid, their lower
performance has lessened the popularity of denorms in portable fl oating-point software.
Moreover, if programmers do not expect denorms, their programs may surprise them.

3.6 Parallelism and Computer Arithmetic:
Subword Parallelism

Since every desktop microprocessor by defi nition has its own graphical displays,
as transistor budgets increased it was inevitable that support would be added for
graphics operations.

Many graphics systems originally used 8 bits to represent each of the three
primary colors plus 8 bits for a location of a pixel. Th e addition of speakers and
microphones for teleconferencing and video games suggested support of sound as
well. Audio samples need more than 8 bits of precision, but 16 bits are suffi cient.

Every microprocessor has special support so that bytes and halfwords take up
less space when stored in memory (see Section 2.9), but due to the infrequency of
arithmetic operations on these data sizes in typical integer programs, there was
little support beyond data transfers. Architects recognized that many graphics
and audio applications would perform the same operation on vectors of this data.
By partitioning the carry chains within a 128-bit adder, a processor could use
parallelism to perform simultaneous operations on short vectors of sixteen 8-bit
operands, eight 16-bit operands, four 32-bit operands, or two 64-bit operands. Th e
cost of such partitioned adders was small.

Given that the parallelism occurs within a wide word, the extensions are
classifi ed as subword parallelism. It is also classifi ed under the more general name
of data level parallelism. Th ey have been also called vector or SIMD, for single
instruction, multiple data (see Section 6.6). Th e rising popularity of multimedia

3.6 Parallelism and Computer Arithemtic: Subword Parallelism 223

applications led to arithmetic instructions that support narrower operations that
can easily operate in parallel.

For example, ARM added more than 100 instructions in the NEON multimedia
instruction extension to support subword parallelism, which can be used either
with ARMv7 or ARMv8. It added 256 bytes of new registers for NEON that can be
viewed as 32 registers 8 bytes wide or 16 registers 16 bytes wide. NEON supports
all the subword data types you can imagine except 64-bit fl oating point numbers:

■ 8-bit, 16-bit, 32-bit, and 64-bit signed and unsigned integers

■ 32-bit fl oating point numbers

Figure 3.19 gives a summary of the basic NEON instructions.

FIGURE 3.19 Summary of ARM NEON instructions for subword parallelism. We use the curly brackets {} to show optional
variations of the basic operations: {S8,U8,8} stand for signed and unsigned 8-bit integers or 8-bit data where type doesn’t matter, of which 16
fi t in a 128-bit register; {S16,U16,16} stand for signed and unsigned 16-bit integers or 16-bit type-less data, of which 8 fi t in a 128-bit register;
{S32,U32,32} stand for signed and unsigned 32-bit integers or 32-bit type-less data, of which 4 fi t in a 128-bit register; {S64,U64,64} stand for
signed and unsigned 64-bit integers or type-less 64-bit data, of which 2 fi t in a 128-bit register; {F32} stand for signed and unsigned 32-bit
fl oating point numbers, of which 4 fi t in a 128-bit register. Vector Load reads one n-element structure from memory into 1, 2, 3, or 4 NEON
registers. It loads a single n-element structure to one lane (See Section 6.6), and elements of the register that are not loaded are unchanged.
Vector Store writes one n-element structure into memory from 1, 2, 3, or 4 NEON registers.

Elaboration: In addition to signed and unsigned integers, ARM includes “fi xed-point”
format of four sizes called I8, I16, I32, and I64, of which 16, 8, 4, and 2 fi t in a 128-
bit register, respectively. A portion of the fi xed point is for the fraction (to the right of
the binary point) and the rest of the data is the integer portion (to the left of the binary
point). The location of the binary point is up to the software. Many ARM processors do
not have fl oating point hardware and thus fl oating point operations must be performed by
library routines. Fixed point arithmetic can be signifi cantly faster than software fl oating
point routines, but more work for the programmer.

Data transfer Arithmetic Logical/Compare

821.DNAV,46.DNAV}23U,23S,61U,61S,8U,8S{}W,L{DDAV,23F.DDAV23F.RDLV

821.RROV,46.RROV}23U,23S,61U,61S,8U,8S{}W,L{BUSV,23F.BUSV23F.RTSV

VLD{1,2,3.4}.{I8,I16,I32} VMUL.F32, VMULL{S8,U8,S16,U16,S32,U32} VEOR.64, VEOR.128

VST{1,2,3.4}.{I8,I16,I32} VMLA.F32, VMLAL{S8,U8,S16,U16,S32,U32} VBIC.64, VBIC.128

VMOV.{I8,I16,I32,F32}, #imm VMLS.F32, VMLSL{S8,U8,S16,U16,S32,U32} VORN.64, VORN.128

VMVN.{I8,I16,I32,F32}, #imm VMAX.{S8,U8,S16,U16,S32,U32,F32} VCEQ.{I8,I16,I32,F32}

VMOV.{I64,I128} VMIN.{S8,U8,S16,U16,S32,U32,F32} VCGE.{S8,U8,S16,U16,S32,U32,F32}

}23F,23U,23S,61U,61S,8U,8S{.TGCV}23F,23S,61S,8S{.SBAV}821I,46I{.NVMV

224 Chapter 3 Arithmetic for Computers

3.7 Real Stuff: Streaming SIMD Extensions
and Advanced Vector Extensions in x86

Th e original MMX (MultiMedia eXtension) and SSE (Streaming SIMD Extension)
instructions for the x86 included similar operations to those found in ARM NEON.
Chapter 2 notes that in 2001 Intel added 144 instructions to its architecture as
part of SSE2, including double precision fl oating-point registers and operations. It
includes eight 64-bit registers that can be used for fl oating-point operands. AMD
expanded the number to 16 registers, called XMM, as part of AMD64, which
Intel relabeled EM64T for its use. Figure 3.20 summarizes the SSE and SSE2
instructions.

In addition to holding a single precision or double precision number in a
register, Intel allows multiple fl oating-point operands to be packed into a single
128-bit SSE2 register: four single precision or two double precision. Th us, the 16
fl oating-point registers for SSE2 are actually 128 bits wide. If the operands can be
arranged in memory as 128-bit aligned data, then 128-bit data transfers can load
and store multiple operands per instruction. Th is packed fl oating-point format is
supported by arithmetic operations that can operate simultaneously on four singles
(PS) or two doubles (PD).

Data transfer Arithmetic Compare

MOV{A/U}{SS/PS/SD/
PD} xmm, mem/xmm

ADD{SS/PS/SD/PD} xmm,mem/xmm CMP{SS/PS/SD/PD}

SUB{SS/PS/SD/PD} xmm,mem/xmm

MOV {H/L} {PS/PD}
xmm, mem/xmm

MUL{SS/PS/SD/PD} xmm,mem/xmm

DIV{SS/PS/SD/PD} xmm,mem/xmm

SQRT{SS/PS/SD/PD} mem/xmm

MAX {SS/PS/SD/PD} mem/xmm

MIN{SS/PS/SD/PD} mem/xmm

FIGURE 3.20 The SSE/SSE2 fl oating-point instructions of the x86. xmm means one operand is
a 128-bit SSE2 register, and mem/xmm means the other operand is either in memory or it is an SSE2 register.
We use the curly brackets {} to show optional variations of the basic operations: {SS} stands for Scalar Single
precision fl oating point, or one 32-bit operand in a 128-bit register; {PS} stands for Packed Single precision
fl oating point, or four 32-bit operands in a 128-bit register; {SD} stands for Scalar Double precision fl oating
point, or one 64-bit operand in a 128-bit register; {PD} stands for Packed Double precision fl oating point, or
two 64-bit operands in a 128-bit register; {A} means the 128-bit operand is aligned in memory; {U} means
the 128-bit operand is unaligned in memory; {H} means move the high half of the 128-bit operand; and {L}
means move the low half of the 128-bit operand.

3.8 Going Faster: Subword Parallelism and Matrix Multiply 225

In 2011 Intel doubled the width of the registers again, now called YMM, with
Advanced Vector Extensions (AVX). Th us, a single operation can now specify eight
32-bit fl oating-point operations or four 64-bit fl oating-point operations. Th e
legacy SSE and SSE2 instructions now operate on the lower 128 bits of the YMM
registers. Th us, to go from 128-bit and 256-bit operations, you prepend the letter
“v” (for vector) in front of the SSE2 assembly language operations and then use the
YMM register names instead of the XMM register name. For example, the SSE2
instruction to perform two 64-bit fl oating-point multiplies

addpd %xmm0, %xmm4

It becomes

vaddpd %ymm0, %ymm4

which now produces four 64-bit fl oating-point multiplies.

Elaboration: AVX also added three address instructions to x86. For example, vaddpd
can now specify

vaddpd %ymm0, %ymm1, %ymm4 # %ymm4 = %ymm1 + %ymm2

instead of the standard two address version

addpd %xmm0, %xmm4 # %xmm4 = %xmm4 + %xmm0

(Unlike MIPS, the destination is on the right in x86.) Three addresses can reduce the
number of registers and instructions needed for a computation.

3.8 Going Faster: Subword Parallelism and
Matrix Multiply

To demonstrate the performance impact of subword parallelism, we’ll run the same
code on the Intel Core i7 fi rst without AVX and then with it. Figure 3.21 shows an
unoptimized version of a matrix-matrix multiply written in C. As we saw in Section
3.5, this program is commonly called DGEMM, which stands for Double precision
GEneral Matrix Multiply. Starting with this edition, we have added a new section
entitled “Going Faster” to demonstrate the performance benefi t of adapting soft ware
to the underlying hardware, in this case the Sandy Bridge version of the Intel Core
i7 microprocessor. Th is new section in Chapters 3, 4, 5, and 6 will incrementally
improve DGEMM performance using the ideas that each chapter introduces.

Figure 3.22 shows the x86 assembly language output for the inner loop of Figure
3.21. Th e fi ve fl oating point-instructions start with a v like the AVX instructions,
but note that they use the XMM registers instead of YMM, and they include sd in
the name, which stands for scalar double precision. We’ll defi ne the subword parallel
instructions shortly.

226 Chapter 3 Arithmetic for Computers

FIGURE 3.22 The x86 assembly language for the body of the nested loops generated by compiling the
optimized C code in Figure 3.21. Although it is dealing with just 64-bits of data, the compiler uses the AVX version of
the instructions instead of SSE2 presumably so that it can use three address per instruction instead of two (see the Elaboration
in Section 3.7).

FIGURE 3.21 Unoptimized C version of a double precision matrix multiply, widely known as DGEMM for
Double-precision GEneral Matrix Multiply (GEMM). Because we are passing the matrix dimension as the parameter
n, this version of DGEMM uses single dimensional versions of matrices C, A, and B and address arithmetic to get better
performance instead of using the more intuitive two-dimensional arrays that we saw in Section 3.5. Th e comments remind
us of this more intuitive notation.

1. void dgemm (int n, double* A, double* B, double* C)

2. {

3. for (int i = 0; i < n; ++i) 4. for (int j = 0; j < n; ++j) 5. { 6. double cij = C[i+j*n]; /* cij = C[i][j] */ 7. for( int k = 0; k < n; k++ ) 8. cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */ 9. C[i+j*n] = cij; /* C[i][j] = cij */ 10. } 11. } 1. vmovsd (%r10),%xmm0 # Load 1 element of C into %xmm0 2. mov %rsi,%rcx # register %rcx = %rsi 3. xor %eax,%eax # register %eax = 0 4. vmovsd (%rcx),%xmm1 # Load 1 element of B into %xmm1 5. add %r9,%rcx # register %rcx = %rcx + %r9 6. vmulsd (%r8,%rax,8),%xmm1,%xmm1 # Multiply %xmm1, element of A 7. add $0x1,%rax # register %rax = %rax + 1 8. cmp %eax,%edi # compare %eax to %edi 9. vaddsd %xmm1,%xmm0,%xmm0 # Add %xmm1, %xmm0 10. jg 30 # jump if %eax > %edi

11. add $0x1,%r11d # register %r11 = %r11 + 1

12. vmovsd %xmm0,(%r10) # Store %xmm0 into C element

3.8 Going Faster: Subword Parallelism and Matrix Multiply 227

FIGURE 3.23 Optimized C version of DGEMM using C intrinsics to generate the AVX subword-parallel
instructions for the x86. Figure 3.24 shows the assembly language produced by the compiler for the inner loop.

While compiler writers may eventually be able to routinely produce high-
quality code that uses the AVX instructions of the x86, for now we must “cheat” by
using C intrinsics that more or less tell the compiler exactly how to produce good
code. Figure 3.23 shows the enhanced version of Figure 3.21 for which the Gnu C
compiler produces AVX code. Figure 3.24 shows annotated x86 code that is the
output of compiling using gcc with the –O3 level of optimization.

Th e declaration on line 6 of Figure 3.23 uses the __m256d data type, which tells
the compiler the variable will hold 4 double-precision fl oating-point values. Th e
intrinsic _mm256_load_pd() also on line 6 uses AVX instructions to load 4
double-precision fl oating-point numbers in parallel (_pd) from the matrix C into
c0. Th e address calculation C+i+j*n on line 6 represents element C[i+j*n].
Symmetrically, the fi nal step on line 11 uses the intrinsic _mm256_store_pd()
to store 4 double-precision fl oating-point numbers from c0 into the matrix C.
As we’re going through 4 elements each iteration, the outer for loop on line 4
increments i by 4 instead of by 1 as on line 3 of Figure 3.21.

Inside the loops, on line 9 we fi rst load 4 elements of A again using _mm256_
load_pd(). To multiply these elements by one element of B, on line 10 we fi rst
use the intrinsic _mm256_broadcast_sd(), which makes 4 identical copies
of the scalar double precision number—in this case an element of B—in one of the
YMM registers. We then use _mm256_mul_pd() on line 9 to multiply the four
double-precision results in parallel. Finally, _mm256_add_pd() on line 8 adds
the 4 products to the 4 sums in c0.

Figure 3.24 shows resulting x86 code for the body of the inner loops produced
by the compiler. You can see the fi ve AVX instructions—they all start with v and

1. #include

2. void dgemm (int n, double* A, double* B, double* C)

3. {

4. for ( int i = 0; i < n; i+=4 ) 5. for ( int j = 0; j < n; j++ ) { 6. __m256d c0 = _mm256_load_pd(C+i+j*n); /* c0 = C[i][j] */ 7. for( int k = 0; k < n; k++ ) 8. c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */ 9. _mm256_mul_pd(_mm256_load_pd(A+i+k*n), 10. _mm256_broadcast_sd(B+k+j*n))); 11. _mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */ 12. } 13. } 228 Chapter 3 Arithmetic for Computers four of the fi ve use pd for parallel double precision—that correspond to the C intrinsics mentioned above. Th e code is very similar to that in Figure 3.22 above: both use 12 instructions, the integer instructions are nearly identical (but diff erent registers), and the fl oating-point instruction diff erences are generally just going from scalar double (sd) using XMM registers to parallel double (pd) with YMM registers. Th e one exception is line 4 of Figure 3.24. Every element of A must be multiplied by one element of B. One solution is to place four identical copies of the 64-bit B element side-by-side into the 256-bit YMM register, which is just what the instruction vbroadcastsd does. For matrices of dimensions of 32 by 32, the unoptimized DGEMM in Figure 3.21 runs at 1.7 GigaFLOPS (FLoating point Operations Per Second) on one core of a 2.6 GHz Intel Core i7 (Sandy Bridge). Th e optimized code in Figure 3.23 performs at 6.4 GigaFLOPS. Th e AVX version is 3.85 times as fast, which is very close to the factor of 4.0 increase that you might hope for from performing 4 times as many operations at a time by using subword parallelism. Elaboration: As mentioned in the Elaboration in Section 1.6, Intel offers Turbo mode that temporarily runs at a higher clock rate until the chip gets too hot. This Intel Core i7 (Sandy Bridge) can increase from 2.6 GHz to 3.3 GHz in Turbo mode. The results above are with Turbo mode turned off. If we turn it on, we improve all the results by the increase in the clock rate of 3.3/2.6 = 1.27 to 2.1 GFLOPS for unoptimized DGEMM and 8.1 GFLOPS with AVX. Turbo mode works particularly well when using only a single core of an eight-core chip, as in this case, as it lets that single core use much more than its fair share of power since the other cores are idle. FIGURE 3.24 The x86 assembly language for the body of the nested loops generated by compiling the optimized C code in Figure 3.23. Note the similarities to Figure 3.22, with the primary diff erence being that the fi ve fl oating-point operations are now using YMM registers and using the pd versions of the instructions for parallel double precision instead of the sd version for scalar double precision. 1. vmovapd (%r11),%ymm0 # Load 4 elements of C into %ymm0 2. mov %rbx,%rcx # register %rcx = %rbx 3. xor %eax,%eax # register %eax = 0 4. vbroadcastsd (%rax,%r8,1),%ymm1 # Make 4 copies of B element 5. add $0x8,%rax # register %rax = %rax + 8 6. vmulpd (%rcx),%ymm1,%ymm1 # Parallel mul %ymm1,4 A elements 7. add %r9,%rcx # register %rcx = %rcx + %r9 8. cmp %r10,%rax # compare %r10 to %rax 9. vaddpd %ymm1,%ymm0,%ymm0 # Parallel add %ymm1, %ymm0 10. jne 50 # jump if not %r10 != %rax

11. add $0x1,%esi # register % esi = % esi + 1

12. vmovapd %ymm0,(%r11) # Store %ymm0 into 4 C elements

3.9 Fallacies and Pitfalls 229

3.9 Fallacies and Pitfalls

Arithmetic fallacies and pitfalls generally stem from the diff erence between the
limited precision of computer arithmetic and the unlimited precision of natural
arithmetic.

Fallacy: Just as a left shift instruction can replace an integer multiply by a
power of 2, a right shift is the same as an integer division by a power of 2.

Recall that a binary number c, where xi means the ith bit, represents the number

… � (x3 � 23) � (x2 � 22) 1 (x1 � 21) � (x0 � 20)

Shift ing the bits of c right by n bits would seem to be the same as dividing by
2n. And this is true for unsigned integers. Th e problem is with signed integers. For
example, suppose we want to divide �5ten by 4ten; the quotient should be �1ten. Th e
two’s complement representation of �5ten is

1111 1111 1111 1111 1111 1111 1111 1011two

According to this fallacy, shift ing right by two should divide by 4ten (2
2):

0011 1111 1111 1111 1111 1111 1111 1110two

With a 0 in the sign bit, this result is clearly wrong. Th e value created by the shift
right is actually 1,073,741,822ten instead of �1ten.

A solution would be to have an arithmetic right shift that extends the sign bit
instead of shift ing in 0s. A 2-bit arithmetic shift right of �5ten produces

1111 1111 1111 1111 1111 1111 1111 1110two

Th e result is �2ten instead of �1ten; close, but no cigar.

Pitfall: Floating-point addition is not associative.

Associativity holds for a sequence of two’s complement integer additions, even if the
computation overfl ows. Alas, because fl oating-point numbers are approximations
of real numbers and because computer arithmetic has limited precision, it does
not hold for fl oating-point numbers. Given the great range of numbers that can be
represented in fl oating point, problems occur when adding two large numbers of
opposite signs plus a small number. For example, let’s see if c � (a � b) � (c � a)
� b. Assume c � �1.5ten � 10

38, a � 1.5ten � 10
38, and b � 1.0, and that these are

all single precision numbers.

Th us mathematics
may be defi ned as the
subject in which we
never know what we
are talking about, nor
whether what we are
saying is true.
Bertrand Russell, Recent
Words on the Principles
of Mathematics, 1901

230 Chapter 3 Arithmetic for Computers

c ( ) 1.5 10 (1.5 10 1.0)
1.5 10 (1.5

ten
38

a b
nn

ten
38

ten

10 )
0.0

c ( ) ( 1.5 10 1.5 10 ) 1.0
(0.0

a b
)) 1.0

1.0

Since fl oating-point numbers have limited precision and result in approximations
of real results, 1.5ten � 10

38 is so much larger than 1.0ten that 1.5ten � 10
38 � 1.0 is still

1.5ten � 10
38. Th at is why the sum of c, a, and b is 0.0 or 1.0, depending on the order

of the fl oating-point additions, so c � (a � b) � (c � a) � b. Th erefore, fl oating-
point addition is not associative.

Fallacy: Parallel execution strategies that work for integer data types also work
for fl oating-point data types.

Programs have typically been written fi rst to run sequentially before being rewritten
to run concurrently, so a natural question is, “Do the two versions get the same
answer?” If the answer is no, you presume there is a bug in the parallel version that
you need to track down.

Th is approach assumes that computer arithmetic does not aff ect the results when
going from sequential to parallel. Th at is, if you were to add a million numbers
together, you would get the same results whether you used 1 processor or 1000
processors. Th is assumption holds for two’s complement integers, since integer
addition is associative. Alas, since fl oating-point addition is not associative, the
assumption does not hold.

A more vexing version of this fallacy occurs on a parallel computer where the
operating system scheduler may use a diff erent number of processors depending
on what other programs are running on a parallel computer. As the varying
number of processors from each run would cause the fl oating-point sums to be
calculated in diff erent orders, getting slightly diff erent answers each time despite
running identical code with identical input may fl ummox unaware parallel
programmers.

Given this quandary, programmers who write parallel code with fl oating-point
numbers need to verify whether the results are credible even if they don’t give the
same exact answer as the sequential code. Th e fi eld that deals with such issues is
called numerical analysis, which is the subject of textbooks in its own right. Such
concerns are one reason for the popularity of numerical libraries such as LAPACK
and SCALAPAK, which have been validated in both their sequential and parallel
forms.

Pitfall: Th e MIPS instruction add immediate unsigned (addiu) sign-extends
its 16-bit immediate fi eld.

3.9 Fallacies and Pitfalls 231

Despite its name, add immediate unsigned (addiu) is used to add constants to
signed integers when we don’t care about overfl ow. MIPS has no subtract immediate
instruction, and negative numbers need sign extension, so the MIPS architects
decided to sign-extend the immediate fi eld.

Fallacy: Only theoretical mathematicians care about fl oating-point accuracy.

Newspaper headlines of November 1994 prove this statement is a fallacy (see
Figure 3.25). Th e following is the inside story behind the headlines.

Th e Pentium used a standard fl oating-point divide algorithm that generates
multiple quotient bits per step, using the most signifi cant bits of divisor and
dividend to guess the next 2 bits of the quotient. Th e guess is taken from a lookup
table containing �2, �1, 0, �1, or �2. Th e guess is multiplied by the divisor and
subtracted from the remainder to generate a new remainder. Like nonrestoring
division, if a previous guess gets too large a remainder, the partial remainder is
adjusted in a subsequent pass.

Evidently, there were fi ve elements of the table from the 80486 that Intel
engineers thought could never be accessed, and they optimized the logic to return
0 instead of 2 in these situations on the Pentium. Intel was wrong: while the fi rst 11

FIGURE 3.25 A sampling of newspaper and magazine articles from November 1994,
including the New York Times, San Jose Mercury News, San Francisco Chronicle, and
Infoworld. Th e Pentium fl oating-point divide bug even made the “Top 10 List” of the David Letterman
Late Show on television. Intel eventually took a $300 million write-off to replace the buggy chips.

232 Chapter 3 Arithmetic for Computers

bits were always correct, errors would show up occasionally in bits 12 to 52, or the
4th to 15th decimal digits.

A math professor at Lynchburg College in Virginia, Th omas Nicely, discovered the
bug in September 1994. Aft er calling Intel technical support and getting no offi cial
reaction, he posted his discovery on the Internet. Th is post led to a story in a trade
magazine, which in turn caused Intel to issue a press release. It called the bug a glitch
that would aff ect only theoretical mathematicians, with the average spreadsheet
user seeing an error every 27,000 years. IBM Research soon counterclaimed that the
average spreadsheet user would see an error every 24 days. Intel soon threw in the
towel by making the following announcement on December 21:

“We at Intel wish to sincerely apologize for our handling of the recently publicized
Pentium processor fl aw. Th e Intel Inside symbol means that your computer has
a microprocessor second to none in quality and performance. Th ousands of Intel
employees work very hard to ensure that this is true. But no microprocessor is
ever perfect. What Intel continues to believe is technically an extremely minor
problem has taken on a life of its own. Although Intel fi rmly stands behind the
quality of the current version of the Pentium processor, we recognize that many
users have concerns. We want to resolve these concerns. Intel will exchange the
current version of the Pentium processor for an updated version, in which this
fl oating-point divide fl aw is corrected, for any owner who requests it, free of
charge anytime during the life of their computer.”

Analysts estimate that this recall cost Intel $500 million, and Intel engineers did not
get a Christmas bonus that year.

Th is story brings up a few points for everyone to ponder. How much cheaper
would it have been to fi x the bug in July 1994? What was the cost to repair the
damage to Intel’s reputation? And what is the corporate responsibility in disclosing
bugs in a product so widely used and relied upon as a microprocessor?

3.10 Concluding Remarks

Over the decades, computer arithmetic has become largely standardized, greatly
enhancing the portability of programs. Two’s complement binary integer arithmetic is
found in every computer sold today, and if it includes fl oating point support, it off ers
the IEEE 754 binary fl oating-point arithmetic.

Computer arithmetic is distinguished from paper-and-pencil arithmetic by the
constraints of limited precision. Th is limit may result in invalid operations through
calculating numbers larger or smaller than the predefi ned limits. Such anomalies, called
“overfl ow” or “underfl ow,” may result in exceptions or interrupts, emergency events
similar to unplanned subroutine calls. Chapters 4 and 5 discuss exceptions in more detail.

Floating-point arithmetic has the added challenge of being an approximation
of real numbers, and care needs to be taken to ensure that the computer number

3.10 Concluding Remarks 233

selected is the representation closest to the actual number. Th e challenges of
imprecision and limited representation of fl oating point are part of the inspiration
for the fi eld of numerical analysis. Th e recent switch to parallelism shines the
searchlight on numerical analysis again, as solutions that were long considered
safe on sequential computers must be reconsidered when trying to fi nd the fastest
algorithm for parallel computers that still achieves a correct result.

Data-level parallelism, specifi cally subword parallelism, off ers a simple path to
higher performance for programs that are intensive in arithmetic operations for
either integer or fl oating-point data. We showed that we could speed up matrix
multiply nearly fourfold by using instructions that could execute four fl oating-
point operations at a time.

With the explanation of computer arithmetic in this chapter comes a description
of much more of the MIPS instruction set. One point of confusion is the instructions
covered in these chapters versus instructions executed by MIPS chips versus the
instructions accepted by MIPS assemblers. Two fi gures try to make this clear.

Figure 3.26 lists the MIPS instructions covered in this chapter and Chapter 2.
We call the set of instructions on the left -hand side of the fi gure the MIPS core. Th e
instructions on the right we call the MIPS arithmetic core. On the left of Figure 3.27
are the instructions the MIPS processor executes that are not found in Figure 3.26.
We call the full set of hardware instructions MIPS-32. On the right of Figure 3.27
are the instructions accepted by the assembler that are not part of MIPS-32. We call
this set of instructions Pseudo MIPS.

Figure 3.28 gives the popularity of the MIPS instructions for SPEC CPU2006
integer and fl oating-point benchmarks. All instructions are listed that were
responsible for at least 0.2% of the instructions executed.

Note that although programmers and compiler writers may use MIPS-32 to
have a richer menu of options, MIPS core instructions dominate integer SPEC
CPU2006 execution, and the integer core plus arithmetic core dominate SPEC
CPU2006 fl oating point, as the table below shows.

Instruction subset Integer Fl. pt.

MIPS core 98% 31%

MIPS arithmetic core 2% 66%

Remaining MIPS-32 0% 3%

For the rest of the book, we concentrate on the MIPS core instructions—the integer
instruction set excluding multiply and divide—to make the explanation of computer
design easier. As you can see, the MIPS core includes the most popular MIPS
instructions; be assured that understanding a computer that runs the MIPS core
will give you suffi cient background to understand even more ambitious computers.
No matter what the instruction set or its size—MIPS, ARM, x86—never forget that
bit patterns have no inherent meaning. Th e same bit pattern may represent a signed
integer, unsigned integer, fl oating-point number, string, instruction, and so on. In
stored program computers, it is the operation on the bit pattern that determines its
meaning.

234 Chapter 3 Arithmetic for Computers

MIPS core instructions Name Format MIPS arithmetic core Name Format

add add R multiply mult R
add immediate addi I multiply unsigned multu R
add unsigned addu R divide div R
add immediate unsigned addiu I divide unsigned divu R
subtract sub R move from Hi mfhi R
subtract unsigned subu R move from Lo mflo R
AND AND R move from system control (EPC) mfc0 R
AND immediate ANDi I floating-point add single add.s R
OR OR R floating-point add double add.d R
OR immediate ORi I floating-point subtract single sub.s R
NOR NOR R floating-point subtract double sub.d R
shift left logical sll R floating-point multiply single mul.s R
shift right logical srl R floating-point multiply double mul.d R
load upper immediate lui I floating-point divide single div.s R
load word lw I floating-point divide double div.d R
store word sw I load word to floating-point single lwc1 I
load halfword unsigned lhu I store word to floating-point single swc1 I
store halfword sh I load word to floating-point double ldc1 I
load byte unsigned lbu I store word to floating-point double sdc1 I
store byte sb I branch on floating-point true bc1t I
load linked (atomic update) ll I branch on floating-point false bc1f I
store cond. (atomic update) sc I floating-point compare single c.x.s R
branch on equal beq I (x = eq, neq, lt, le, gt, ge)

branch on not equal bne I floating-point compare double c.x.d R
jump j J (x = eq, neq, lt, le, gt, ge)

jump and link jal J

jump register jr R

set less than slt R

set less than immediate slti I

set less than unsigned sltu R

set less than immediate unsigned sltiu I

FIGURE 3.26 The MIPS instruction set. Th is book concentrates on the instructions in the left column. Th is information is also found
in columns 1 and 2 of the MIPS Reference Data Card at the front of this book.

3.10 Concluding Remarks 235

Remaining MIPS-32 Name Format Pseudo MIPS Name Format

exclusive or (rs ⊕ rt) xor R absolute value abs rd,rs
exclusive or immediate xori I negate (signed or unsigned) negs rd,rs
shift right arithmetic sra R rotate left rol rd,rs,rt
shift left logical variable sllv R rotate right ror rd,rs,rt
shift right logical variable srlv R multiply and don’t check oflw (signed or uns.) muls rd,rs,rt
shift right arithmetic variable srav R multiply and check oflw (signed or uns.) mulos rd,rs,rt
move to Hi mthi R divide and check overflow div rd,rs,rt
move to Lo mtlo R divide and don’t check overflow divu rd,rs,rt
load halfword lh I remainder (signed or unsigned) rems rd,rs,rt
load byte lb I load immediate li rd,imm
load word left (unaligned) lwl I load address la rd,addr
load word right (unaligned) lwr I load double ld rd,addr
store word left (unaligned) swl I store double sd rd,addr
store word right (unaligned) swr I unaligned load word ulw rd,addr
load linked (atomic update) ll I unaligned store word usw rd,addr
store cond. (atomic update) sc I unaligned load halfword (signed or uns.) ulhs rd,addr
move if zero movz R unaligned store halfword ush rd,addr
move if not zero movn R branch b Label
multiply and add (S or uns.) madds R branch on equal zero beqz rs,L
multiply and subtract (S or uns.) msubs I branch on compare (signed or unsigned) bxs rs,rt,L
branch on ≥ zero and link bgezal I (x = lt, le, gt, ge)
branch on < zero and link bltzal I set equal seq rd,rs,rt jump and link register jalr R set not equal sne rd,rs,rt branch compare to zero bxz I set on compare (signed or unsigned) sxs rd,rs,rt branch compare to zero likely bxzl I (x = lt, le, gt, ge) (x = lt, le, gt, ge) load to floating point (s or d) l.f rd,addr branch compare reg likely bxl I store from floating point (s or d) s.f rd,addr trap if compare reg tx R trap if compare immediate txi I (x = eq, neq, lt, le, gt, ge) return from exception rfe R system call syscall I break (cause exception) break I move from FP to integer mfc1 R move to FP from integer mtc1 R FP move (s or d) mov.f R FP move if zero (s or d) movz.f R FP move if not zero (s or d) movn.f R FP square root (s or d) sqrt.f R FP absolute value (s or d) abs.f R FP negate (s or d) neg.f R FP convert (w, s, or d) cvt.f.f R FP compare un (s or d) c.xn.f R FIGURE 3.27 Remaining MIPS-32 and Pseudo MIPS instruction sets. f means single (s) or double (d) precision fl oating-point instructions, and s means signed and unsigned (u) versions. MIPS-32 also has FP instructions for multiply and add/sub (madd.f/ msub.f), ceiling (ceil.f), truncate (trunc.f), round (round.f), and reciprocal (recip.f). Th e underscore represents the letter to include to represent that datatype. 236 Chapter 3 Arithmetic for Computers Core MIPS Name Integer Fl. pt. Arithmetic core + MIPS-32 Name Integer Fl. pt. add add 0.0% 0.0% FP add double add.d 0.0% 10.6% add immediate addi 0.0% 0.0% FP subtract double sub.d 0.0% 4.9% add unsigned addu 5.2% 3.5% FP multiply double mul.d 0.0% 15.0% add immediate unsigned addiu 9.0% 7.2% FP divide double div.d 0.0% 0.2% subtract unsigned subu 2.2% 0.6% FP add single add.s 0.0% 1.5% AND AND 0.2% 0.1% FP subtract single sub.s 0.0% 1.8% AND immediate ANDi 0.7% 0.2% FP multiply single mul.s 0.0% 2.4% OR OR 4.0% 1.2% FP divide single div.s 0.0% 0.2% OR immediate ORi 1.0% 0.2% load word to FP double l.d 0.0% 17.5% NOR NOR 0.4% 0.2% store word to FP double s.d 0.0% 4.9% shift left logical sll 4.4% 1.9% load word to FP single l.s 0.0% 4.2% shift right logical srl 1.1% 0.5% store word to FP single s.s 0.0% 1.1% load upper immediate lui 3.3% 0.5% branch on floating-point true bc1t 0.0% 0.2% load word lw 18.6% 5.8% branch on floating-point false bc1f 0.0% 0.2% store word sw 7.6% 2.0% floating-point compare double c.x.d 0.0% 0.6% load byte lbu 3.7% 0.1% multiply mul 0.0% 0.2% store byte sb 0.6% 0.0% shift right arithmetic sra 0.5% 0.3% branch on equal (zero) beq 8.6% 2.2% load half lhu 1.3% 0.0% branch on not equal (zero) bne 8.4% 1.4% store half sh 0.1% 0.0% jump and link jal 0.7% 0.2% jump register jr 1.1% 0.2% set less than slt 9.9% 2.3% set less than immediate slti 3.1% 0.3% set less than unsigned sltu 3.4% 0.8% set less than imm. uns. sltiu 1.1% 0.1% FIGURE 3.28 The frequency of the MIPS instructions for SPEC CPU2006 integer and fl oating point. All instructions that accounted for at least 0.2% of the instructions are included in the table. Pseudoinstructions are converted into MIPS-32 before execution, and hence do not appear here. 3.11 Historical Perspective and Further Reading This section surveys the history of the floating point going back to von Neumann, including the surprisingly controversial IEEE standards effort, plus the rationale for the 80-bit stack architecture for floating point in the x86. See the rest of Section 3.11 online. Gresham’s Law (“Bad money drives out Good”) for computers would say, “Th e Fast drives out the Slow even if the Fast is wrong.” W. Kahan, 1992 3.12 Exercises 237 3.12 Exercises 3.1 [5] <§3.2> What is 5ED4 � 07A4 when these values represent unsigned 16-
bit hexadecimal numbers? Th e result should be written in hexadecimal. Show your
work.

3.2 [5] <§3.2> What is 5ED4 � 07A4 when these values represent signed 16-
bit hexadecimal numbers stored in sign-magnitude format? Th e result should be
written in hexadecimal. Show your work.

3.3 [10] <§3.2> Convert 5ED4 into a binary number. What makes base 16
(hexadecimal) an attractive numbering system for representing values in
computers?

3.4 [5] <§3.2> What is 4365 � 3412 when these values represent unsigned 12-bit
octal numbers? Th e result should be written in octal. Show your work.

3.5 [5] <§3.2> What is 4365 � 3412 when these values represent signed 12-bit
octal numbers stored in sign-magnitude format? Th e result should be written in
octal. Show your work.

3.6 [5] <§3.2> Assume 185 and 122 are unsigned 8-bit decimal integers. Calculate
185 – 122. Is there overfl ow, underfl ow, or neither?

3.7 [5] <§3.2> Assume 185 and 122 are signed 8-bit decimal integers stored in
sign-magnitude format. Calculate 185 � 122. Is there overfl ow, underfl ow, or
neither?

3.8 [5] <§3.2> Assume 185 and 122 are signed 8-bit decimal integers stored in
sign-magnitude format. Calculate 185 � 122. Is there overfl ow, underfl ow, or
neither?

3.9 [10] <§3.2> Assume 151 and 214 are signed 8-bit decimal integers stored in
two’s complement format. Calculate 151 � 214 using saturating arithmetic. Th e
result should be written in decimal. Show your work.

3.10 [10] <§3.2> Assume 151 and 214 are signed 8-bit decimal integers stored in
two’s complement format. Calculate 151 � 214 using saturating arithmetic. Th e
result should be written in decimal. Show your work.

3.11 [10] <§3.2> Assume 151 and 214 are unsigned 8-bit integers. Calculate 151
� 214 using saturating arithmetic. Th e result should be written in decimal. Show
your work.

3.12 [20] <§3.3> Using a table similar to that shown in Figure 3.6, calculate the
product of the octal unsigned 6-bit integers 62 and 12 using the hardware described
in Figure 3.3. You should show the contents of each register on each step.

Never give in, never
give in, never, never,
never—in nothing,
great or small, large or
petty—never give in.
Winston Churchill,
address at Harrow
School, 1941

238 Chapter 3 Arithmetic for Computers

3.13 [20] <§3.3> Using a table similar to that shown in Figure 3.6, calculate the
product of the hexadecimal unsigned 8-bit integers 62 and 12 using the hardware
described in Figure 3.5. You should show the contents of each register on each step.

3.14 [10] <§3.3> Calculate the time necessary to perform a multiply using the
approach given in Figures 3.3 and 3.4 if an integer is 8 bits wide and each step
of the operation takes 4 time units. Assume that in step 1a an addition is always
performed—either the multiplicand will be added, or a zero will be. Also assume
that the registers have already been initialized (you are just counting how long it
takes to do the multiplication loop itself). If this is being done in hardware, the
shift s of the multiplicand and multiplier can be done simultaneously. If this is being
done in soft ware, they will have to be done one aft er the other. Solve for each case.

3.15 [10] <§3.3> Calculate the time necessary to perform a multiply using the
approach described in the text (31 adders stacked vertically) if an integer is 8 bits
wide and an adder takes 4 time units.

3.16 [20] <§3.3> Calculate the time necessary to perform a multiply using the
approach given in Figure 3.7 if an integer is 8 bits wide and an adder takes 4 time
units.

3.17 [20] <§3.3> As discussed in the text, one possible performance enhancement
is to do a shift and add instead of an actual multiplication. Since 9 � 6, for example,
can be written (2 � 2 � 2 � 1) � 6, we can calculate 9 � 6 by shift ing 6 to the left 3
times and then adding 6 to that result. Show the best way to calculate 0�33 � 0�55
using shift s and adds/subtracts. Assume both inputs are 8-bit unsigned integers.

3.18 [20] <§3.4> Using a table similar to that shown in Figure 3.10, calculate
74 divided by 21 using the hardware described in Figure 3.8. You should show
the contents of each register on each step. Assume both inputs are unsigned 6-bit
integers.

3.19 [30] <§3.4> Using a table similar to that shown in Figure 3.10, calculate
74 divided by 21 using the hardware described in Figure 3.11. You should show
the contents of each register on each step. Assume A and B are unsigned 6-bit
integers. Th is algorithm requires a slightly diff erent approach than that shown in
Figure 3.9. You will want to think hard about this, do an experiment or two, or else
go to the web to fi gure out how to make this work correctly. (Hint: one possible
solution involves using the fact that Figure 3.11 implies the remainder register can
be shift ed either direction.)

3.20 [5] <§3.5> What decimal number does the bit pattern 0×0C000000
represent if it is a two’s complement integer? An unsigned integer?

3.21 [10] <§3.5> If the bit pattern 0×0C000000 is placed into the Instruction
Register, what MIPS instruction will be executed?

3.22 [10] <§3.5> What decimal number does the bit pattern 0×0C000000
represent if it is a fl oating point number? Use the IEEE 754 standard.

3.12 Exercises 239

3.23 [10] <§3.5> Write down the binary representation of the decimal number
63.25 assuming the IEEE 754 single precision format.

3.24 [10] <§3.5> Write down the binary representation of the decimal number
63.25 assuming the IEEE 754 double precision format.

3.25 [10] <§3.5> Write down the binary representation of the decimal number
63.25 assuming it was stored using the single precision IBM format (base 16,
instead of base 2, with 7 bits of exponent).

3.26 [20] <§3.5> Write down the binary bit pattern to represent �1.5625 � 10�1
assuming a format similar to that employed by the DEC PDP-8 (the left most 12
bits are the exponent stored as a two’s complement number, and the rightmost 24
bits are the fraction stored as a two’s complement number). No hidden 1 is used.
Comment on how the range and accuracy of this 36-bit pattern compares to the
single and double precision IEEE 754 standards.

3.27 [20] <§3.5> IEEE 754-2008 contains a half precision that is only 16 bits
wide. Th e left most bit is still the sign bit, the exponent is 5 bits wide and has a bias
of 15, and the mantissa is 10 bits long. A hidden 1 is assumed. Write down the
bit pattern to represent �1.5625 � 10�1 assuming a version of this format, which
uses an excess-16 format to store the exponent. Comment on how the range and
accuracy of this 16-bit fl oating point format compares to the single precision IEEE
754 standard.

3.28 [20] <§3.5> Th e Hewlett-Packard 2114, 2115, and 2116 used a format
with the left most 16 bits being the fraction stored in two’s complement format,
followed by another 16-bit fi eld which had the left most 8 bits as an extension of the
fraction (making the fraction 24 bits long), and the rightmost 8 bits representing
the exponent. However, in an interesting twist, the exponent was stored in sign-
magnitude format with the sign bit on the far right! Write down the bit pattern to
represent �1.5625 � 10�1 assuming this format. No hidden 1 is used. Comment on
how the range and accuracy of this 32-bit pattern compares to the single precision
IEEE 754 standard.

3.29 [20] <§3.5> Calculate the sum of 2.6125 � 101 and 4.150390625 � 10�1
by hand, assuming A and B are stored in the 16-bit half precision described in
Exercise 3.27. Assume 1 guard, 1 round bit, and 1 sticky bit, and round to the
nearest even. Show all the steps.

3.30 [30] <§3.5> Calculate the product of –8.0546875 � 100 and �1.79931640625
� 10–1 by hand, assuming A and B are stored in the 16-bit half precision format
described in Exercise 3.27. Assume 1 guard, 1 round bit, and 1 sticky bit, and round
to the nearest even. Show all the steps; however, as is done in the example in the
text, you can do the multiplication in human-readable format instead of using the
techniques described in Exercises 3.12 through 3.14. Indicate if there is overfl ow
or underfl ow. Write your answer in both the 16-bit fl oating point format described
in Exercise 3.27 and also as a decimal number. How accurate is your result? How
does it compare to the number you get if you do the multiplication on a calculator?

240 Chapter 3 Arithmetic for Computers

3.31 [30] <§3.5> Calculate by hand 8.625 � 101 divided by �4.875 � 100. Show
all the steps necessary to achieve your answer. Assume there is a guard, a round bit,
and a sticky bit, and use them if necessary. Write the fi nal answer in both the 16-bit
fl oating point format described in Exercise 3.27 and in decimal and compare the
decimal result to that which you get if you use a calculator.

3.32 [20] <§3.9> Calculate (3.984375 � 10�1 � 3.4375 � 10�1) � 1.771 � 103
by hand, assuming each of the values are stored in the 16-bit half precision format
described in Exercise 3.27 (and also described in the text). Assume 1 guard, 1
round bit, and 1 sticky bit, and round to the nearest even. Show all the steps, and
write your answer in both the 16-bit fl oating point format and in decimal.

3.33 [20] <§3.9> Calculate 3.984375 � 10�1 � (3.4375 � 10�1 � 1.771 � 103)
by hand, assuming each of the values are stored in the 16-bit half precision format
described in Exercise 3.27 (and also described in the text). Assume 1 guard, 1
round bit, and 1 sticky bit, and round to the nearest even. Show all the steps, and
write your answer in both the 16-bit fl oating point format and in decimal.

3.34 [10] <§3.9> Based on your answers to 3.32 and 3.33, does (3.984375 � 10�1
� 3.4375 � 10�1) � 1.771 � 103 = 3.984375 � 10�1 � (3.4375 � 10�1 � 1.771 �
103)?

3.35 [30] <§3.9> Calculate (3.41796875 10�3 � 6.34765625 � 10�3) � 1.05625
� 102 by hand, assuming each of the values are stored in the 16-bit half precision
format described in Exercise 3.27 (and also described in the text). Assume 1 guard,
1 round bit, and 1 sticky bit, and round to the nearest even. Show all the steps, and
write your answer in both the 16-bit fl oating point format and in decimal.

3.36 [30] <§3.9> Calculate 3.41796875 10�3 � (6.34765625 � 10�3 � 1.05625
� 102) by hand, assuming each of the values are stored in the 16-bit half precision
format described in Exercise 3.27 (and also described in the text). Assume 1 guard,
1 round bit, and 1 sticky bit, and round to the nearest even. Show all the steps, and
write your answer in both the 16-bit fl oating point format and in decimal.

3.37 [10] <§3.9> Based on your answers to 3.35 and 3.36, does (3.41796875 10�3
� 6.34765625 � 10�3) � 1.05625 � 102 = 3.41796875 � 10�3 � (6.34765625 �
10�3 � 1.05625 � 102)?

3.38 [30] <§3.9> Calculate 1.666015625 � 100� (1.9760 � 104 � �1.9744 �
104) by hand, assuming each of the values are stored in the 16-bit half precision
format described in Exercise 3.27 (and also described in the text). Assume 1 guard,
1 round bit, and 1 sticky bit, and round to the nearest even. Show all the steps, and
write your answer in both the 16-bit fl oating point format and in decimal.

3.39 [30] <§3.9> Calculate (1.666015625 � 100 � 1.9760 � 104) � (1.666015625
� 100 � �1.9744 � 104) by hand, assuming each of the values are stored in the
16-bit half precision format described in Exercise 3.27 (and also described in the
text). Assume 1 guard, 1 round bit, and 1 sticky bit, and round to the nearest even.
Show all the steps, and write your answer in both the 16-bit fl oating point format
and in decimal.

3.12 Exercises 241

3.40 [10] <§3.9> Based on your answers to 3.38 and 3.39, does (1.666015625 �
100 � 1.9760 � 104) � (1.666015625 � 100 � �1.9744 � 104) = 1.666015625 �
100 � (1.9760 � 104 � �1.9744 � 104)?

3.41 [10] <§3.5> Using the IEEE 754 fl oating point format, write down the bit
pattern that would represent �1/4. Can you represent �1/4 exactly?

3.42 [10] <§3.5> What do you get if you add �1/4 to itself 4 times? What is �1/4
� 4? Are they the same? What should they be?

3.43 [10] <§3.5> Write down the bit pattern in the fraction of value 1/3 assuming
a fl oating point format that uses binary numbers in the fraction. Assume there are
24 bits, and you do not need to normalize. Is this representation exact?

3.44 [10] <§3.5> Write down the bit pattern in the fraction assuming a fl oating
point format that uses Binary Coded Decimal (base 10) numbers in the fraction
instead of base 2. Assume there are 24 bits, and you do not need to normalize. Is
this representation exact?

3.45 [10] <§3.5> Write down the bit pattern assuming that we are using base 15
numbers in the fraction instead of base 2. (Base 16 numbers use the symbols 0–9
and A–F. Base 15 numbers would use 0–9 and A–E.) Assume there are 24 bits, and
you do not need to normalize. Is this representation exact?

3.46 [20] <§3.5> Write down the bit pattern assuming that we are using base 30
numbers in the fraction instead of base 2. (Base 16 numbers use the symbols 0–9
and A–F. Base 30 numbers would use 0–9 and A–T.) Assume there are 20 bits, and
you do not need to normalize. Is this representation exact?

3.47 [45] <§§3.6, 3.7> Th e following C code implements a four-tap FIR fi lter on
input array sig_in. Assume that all arrays are 16-bit fi xed-point values.

for (i 3;i< 128;i ) sig_out[i] sig_in[i-3] * f[0] sig_in[i-22] * f[1] sig_in[i-1] * f[2] sig_in[i] * f[3]; Assume you are to write an optimized implementation this code in assembly language on a processor that has SIMD instructions and 128-bit registers. Without knowing the details of the instruction set, briefl y describe how you would implement this code, maximizing the use of sub-word operations and minimizing the amount of data that is transferred between registers and memory. State all your assumptions about the instructions you use. §3.2, page 182: 2. §3.5, page 221: 3. Answers to Check Yourself 4 In a major matter, no details are small. French Proverb The Processor 4.1 Introduction 244 4.2 Logic Design Conventions 248 4.3 Building a Datapath 251 4.4 A Simple Implementation Scheme 259 4.5 An Overview of Pipelining 272 4.6 Pipelined Datapath and Control 286 4.7 Data Hazards: Forwarding versus Stalling 303 4.8 Control Hazards 316 4.9 Exceptions 325 4.10 Parallelism via Instructions 332 Computer Organization and Design. DOI: © 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1 2013 4.11 Real Stuff: The ARM Cortex-A8 and Intel Core i7 Pipelines 344 4.12 Going Faster: Instruction-Level Parallelism and Matrix Multiply 351 4.13 Advanced Topic: An Introduction to Digital Design Using a Hardware Design Language to Describe and Model a Pipeline and More Pipelining Illustrations 354 4.14 Fallacies and Pitfalls 355 4.15 Concluding Remarks 356 4.16 Historical Perspective and Further Reading 357 4.17 Exercises 357 The Five Classic Components of a Computer 244 Chapter 4 The Processor 4.1 Introduction Chapter 1 explains that the performance of a computer is determined by three key factors: instruction count, clock cycle time, and clock cycles per instruction (CPI). Chapter 2 explains that the compiler and the instruction set architecture determine the instruction count required for a given program. However, the implementation of the processor determines both the clock cycle time and the number of clock cycles per instruction. In this chapter, we construct the datapath and control unit for two diff erent implementations of the MIPS instruction set. Th is chapter contains an explanation of the principles and techniques used in implementing a processor, starting with a highly abstract and simplifi ed overview in this section. It is followed by a section that builds up a datapath and constructs a simple version of a processor suffi cient to implement an instruction set like MIPS. Th e bulk of the chapter covers a more realistic pipelined MIPS implementation, followed by a section that develops the concepts necessary to implement more complex instruction sets, like the x86. For the reader interested in understanding the high-level interpretation of instructions and its impact on program performance, this initial section and Section 4.5 present the basic concepts of pipelining. Recent trends are covered in Section 4.10, and Section 4.11 describes the recent Intel Core i7 and ARM Cortex-A8 architectures. Section 4.12 shows how to use instruction-level parallelism to more than double the performance of the matrix multiply from Section 3.8. Th ese sections provide enough background to understand the pipeline concepts at a high level. For the reader interested in understanding the processor and its performance in more depth, Sections 4.3, 4.4, and 4.6 will be useful. Th ose interested in learning how to build a processor should also cover 4.2, 4.7, 4.8, and 4.9. For readers with an interest in modern hardware design, Section 4.13 describes how hardware design languages and CAD tools are used to implement hardware, and then how to use a hardware design language to describe a pipelined implementation. It also gives several more illustrations of how pipelining hardware executes. A Basic MIPS Implementation We will be examining an implementation that includes a subset of the core MIPS instruction set: ■ Th e memory-reference instructions load word (lw) and store word (sw) ■ Th e arithmetic-logical instructions add, sub, AND, OR, and slt ■ Th e instructions branch equal (beq) and jump (j), which we add last Th is subset does not include all the integer instructions (for example, shift , multiply, and divide are missing), nor does it include any fl oating-point instructions. 4.1 Introduction 245 However, it illustrates the key principles used in creating a datapath and designing the control. Th e implementation of the remaining instructions is similar. In examining the implementation, we will have the opportunity to see how the instruction set architecture determines many aspects of the implementation, and how the choice of various implementation strategies aff ects the clock rate and CPI for the computer. Many of the key design principles introduced in Chapter 1 can be illustrated by looking at the implementation, such as Simplicity favors regularity. In addition, most concepts used to implement the MIPS subset in this chapter are the same basic ideas that are used to construct a broad spectrum of computers, from high-performance servers to general-purpose microprocessors to embedded processors. An Overview of the Implementation In Chapter 2, we looked at the core MIPS instructions, including the integer arithmetic-logical instructions, the memory-reference instructions, and the branch instructions. Much of what needs to be done to implement these instructions is the same, independent of the exact class of instruction. For every instruction, the fi rst two steps are identical: 1. Send the program counter (PC) to the memory that contains the code and fetch the instruction from that memory. 2. Read one or two registers, using fi elds of the instruction to select the registers to read. For the load word instruction, we need to read only one register, but most other instructions require reading two registers. Aft er these two steps, the actions required to complete the instruction depend on the instruction class. Fortunately, for each of the three instruction classes (memory-reference, arithmetic-logical, and branches), the actions are largely the same, independent of the exact instruction. Th e simplicity and regularity of the MIPS instruction set simplifi es the implementation by making the execution of many of the instruction classes similar. For example, all instruction classes, except jump, use the arithmetic-logical unit (ALU) aft er reading the registers. Th e memory-reference instructions use the ALU for an address calculation, the arithmetic-logical instructions for the operation execution, and branches for comparison. Aft er using the ALU, the actions required to complete various instruction classes diff er. A memory-reference instruction will need to access the memory either to read data for a load or write data for a store. An arithmetic-logical or load instruction must write the data from the ALU or memory back into a register. Lastly, for a branch instruction, we may need to change the next instruction address based on the comparison; otherwise, the PC should be incremented by 4 to get the address of the next instruction. Figure 4.1 shows the high-level view of a MIPS implementation, focusing on the various functional units and their interconnection. Although this fi gure shows most of the fl ow of data through the processor, it omits two important aspects of instruction execution. 246 Chapter 4 The Processor First, in several places, Figure 4.1 shows data going to a particular unit as coming from two diff erent sources. For example, the value written into the PC can come from one of two adders, the data written into the register fi le can come from either the ALU or the data memory, and the second input to the ALU can come from a register or the immediate fi eld of the instruction. In practice, these data lines cannot simply be wired together; we must add a logic element that chooses from among the multiple sources and steers one of those sources to its destination. Th is selection is commonly done with a device called a multiplexor, although this device might better be called a data selector. Appendix B describes the multiplexor, which selects from among several inputs based on the setting of its control lines. Th e control lines are set based primarily on information taken from the instruction being executed. Th e second omission in Figure 4.1 is that several of the units must be controlled depending on the type of instruction. For example, the data memory must read FIGURE 4.1 An abstract view of the implementation of the MIPS subset showing the major functional units and the major connections between them. All instructions start by using the program counter to supply the instruction address to the instruction memory. Aft er the instruction is fetched, the register operands used by an instruction are specifi ed by fi elds of that instruction. Once the register operands have been fetched, they can be operated on to compute a memory address (for a load or store), to compute an arithmetic result (for an integer arithmetic-logical instruction), or a compare (for a branch). If the instruction is an arithmetic-logical instruction, the result from the ALU must be written to a register. If the operation is a load or store, the ALU result is used as an address to either store a value from the registers or load a value from memory into the registers. Th e result from the ALU or memory is written back into the register fi le. Branches require the use of the ALU output to determine the next instruction address, which comes either from the ALU (where the PC and branch off set are summed) or from an adder that increments the current PC by 4. Th e thick lines interconnecting the functional units represent buses, which consist of multiple signals. Th e arrows are used to guide the reader in knowing how information fl ows. Since signal lines may cross, we explicitly show when crossing lines are connected by the presence of a dot where the lines cross. Data PC Address Instruction Instruction memory Registers ALU Address Data Data memory AddAdd 4 Register # Register # Register # 4.1 Introduction 247 on a load and written on a store. Th e register fi le must be written only on a load or an arithmetic-logical instruction. And, of course, the ALU must perform one of several operations. (Appendix B describes the detailed design of the ALU.) Like the multiplexors, control lines that are set on the basis of various fi elds in the instruction direct these operations. Figure 4.2 shows the datapath of Figure 4.1 with the three required multiplexors added, as well as control lines for the major functional units. A control unit, which has the instruction as an input, is used to determine how to set the control lines for the functional units and two of the multiplexors. Th e third multiplexor, Data PC Address Instruction Instruction memory Registers ALU Address Data Data memory AddAdd 4 MemWrite MemRead M u x M u x M u x Control RegWrite Zero Branch ALU operation Register # Register # Register # FIGURE 4.2 The basic implementation of the MIPS subset, including the necessary multiplexors and control lines. Th e top multiplexor (“Mux”) controls what value replaces the PC (PC + 4 or the branch destination address); the multiplexor is controlled by the gate that “ANDs” together the Zero output of the ALU and a control signal that indicates that the instruction is a branch. Th e middle multiplexor, whose output returns to the register fi le, is used to steer the output of the ALU (in the case of an arithmetic-logical instruction) or the output of the data memory (in the case of a load) for writing into the register fi le. Finally, the bottommost multiplexor is used to determine whether the second ALU input is from the registers (for an arithmetic-logical instruction or a branch) or from the off set fi eld of the instruction (for a load or store). Th e added control lines are straightforward and determine the operation performed at the ALU, whether the data memory should read or write, and whether the registers should perform a write operation. Th e control lines are shown in color to make them easier to see. 248 Chapter 4 The Processor which determines whether PC + 4 or the branch destination address is written into the PC, is set based on the Zero output of the ALU, which is used to perform the comparison of a beq instruction. Th e regularity and simplicity of the MIPS instruction set means that a simple decoding process can be used to determine how to set the control lines. In the remainder of the chapter, we refi ne this view to fi ll in the details, which requires that we add further functional units, increase the number of connections between units, and, of course, enhance a control unit to control what actions are taken for diff erent instruction classes. Sections 4.3 and 4.4 describe a simple implementation that uses a single long clock cycle for every instruction and follows the general form of Figures 4.1 and 4.2. In this fi rst design, every instruction begins execution on one clock edge and completes execution on the next clock edge. While easier to understand, this approach is not practical, since the clock cycle must be severely stretched to accommodate the longest instruction. Aft er designing the control for this simple computer, we will look at pipelined implementation with all its complexities, including exceptions. How many of the fi ve classic components of a computer—shown on page 243—do Figures 4.1 and 4.2 include? 4.2 Logic Design Conventions To discuss the design of a computer, we must decide how the hardware logic implementing the computer will operate and how the computer is clocked. Th is section reviews a few key ideas in digital logic that we will use extensively in this chapter. If you have little or no background in digital logic, you will fi nd it helpful to read Appendix B before continuing. Th e datapath elements in the MIPS implementation consist of two diff erent types of logic elements: elements that operate on data values and elements that contain state. Th e elements that operate on data values are all combinational, which means that their outputs depend only on the current inputs. Given the same input, a combinational element always produces the same output. Th e ALU shown in Figure 4.1 and discussed in Appendix B is an example of a combinational element. Given a set of inputs, it always produces the same output because it has no internal storage. Other elements in the design are not combinational, but instead contain state. An element contains state if it has some internal storage. We call these elements state elements because, if we pulled the power plug on the computer, we could restart it accurately by loading the state elements with the values they contained before we pulled the plug. Furthermore, if we saved and restored the state elements, it would be as if the computer had never lost power. Th us, these state elements completely characterize the computer. In Figure 4.1, the instruction and data memories, as well as the registers, are all examples of state elements. Check Yourself combinational element An operational element, such as an AND gate or an ALU. state element A memory element, such as a register or a memory. 4.2 Logic Design Conventions 249 A state element has at least two inputs and one output. Th e required inputs are the data value to be written into the element and the clock, which determines when the data value is written. Th e output from a state element provides the value that was written in an earlier clock cycle. For example, one of the logically simplest state elements is a D-type fl ip-fl op (see Appendix B), which has exactly these two inputs (a value and a clock) and one output. In addition to fl ip-fl ops, our MIPS implementation uses two other types of state elements: memories and registers, both of which appear in Figure 4.1. Th e clock is used to determine when the state element should be written; a state element can be read at any time. Logic components that contain state are also called sequential, because their outputs depend on both their inputs and the contents of the internal state. For example, the output from the functional unit representing the registers depends both on the register numbers supplied and on what was written into the registers previously. Th e operation of both the combinational and sequential elements and their construction are discussed in more detail in Appendix B. Clocking Methodology A clocking methodology defi nes when signals can be read and when they can be written. It is important to specify the timing of reads and writes, because if a signal is written at the same time it is read, the value of the read could correspond to the old value, the newly written value, or even some mix of the two! Computer designs cannot tolerate such unpredictability. A clocking methodology is designed to make hardware predictable. For simplicity, we will assume an edge-triggered clocking methodology. An edge-triggered clocking methodology means that any values stored in a sequential logic element are updated only on a clock edge, which is a quick transition from low to high or vice versa (see Figure 4.3). Because only state elements can store a data value, any collection of combinational logic must have its inputs come from a set of state elements and its outputs written into a set of state elements. Th e inputs are values that were written in a previous clock cycle, while the outputs are values that can be used in a following clock cycle. clocking methodology Th e approach used to determine when data is valid and stable relative to the clock. edge-triggered clocking A clocking scheme in which all state changes occur on a clock edge. State element 1 State element 2 Combinational logic Clock cycle FIGURE 4.3 Combinational logic, state elements, and the clock are closely related. In a synchronous digital system, the clock determines when elements with state will write values into internal storage. Any inputs to a state element must reach a stable value (that is, have reached a value from which they will not change until aft er the clock edge) before the active clock edge causes the state to be updated. All state elements in this chapter, including memory, are assumed to be positive edge-triggered; that is, they change on the rising clock edge. 250 Chapter 4 The Processor Figure 4.3 shows the two state elements surrounding a block of combinational logic, which operates in a single clock cycle: all signals must propagate from state element 1, through the combinational logic, and to state element 2 in the time of one clock cycle. Th e time necessary for the signals to reach state element 2 defi nes the length of the clock cycle. For simplicity, we do not show a write control signal when a state element is written on every active clock edge. In contrast, if a state element is not updated on every clock, then an explicit write control signal is required. Both the clock signal and the write control signal are inputs, and the state element is changed only when the write control signal is asserted and a clock edge occurs. We will use the word asserted to indicate a signal that is logically high and assert to specify that a signal should be driven logically high, and deassert or deasserted to represent logically low. We use the terms assert and deassert because when we implement hardware, at times 1 represents logically high and at times it can represent logically low. An edge-triggered methodology allows us to read the contents of a register, send the value through some combinational logic, and write that register in the same clock cycle. Figure 4.4 gives a generic example. It doesn’t matter whether we assume that all writes take place on the rising clock edge (from low to high) or on the falling clock edge (from high to low), since the inputs to the combinational logic block cannot change except on the chosen clock edge. In this book we use the rising clock edge. With an edge-triggered timing methodology, there is no feedback within a single clock cycle, and the logic in Figure 4.4 works correctly. In Appendix B, we briefl y discuss additional timing constraints (such as setup and hold times) as well as other timing methodologies. For the 32-bit MIPS architecture, nearly all of these state and logic elements will have inputs and outputs that are 32 bits wide, since that is the width of most of the data handled by the processor. We will make it clear whenever a unit has an input or output that is other than 32 bits in width. Th e fi gures will indicate buses, which are signals wider than 1 bit, with thicker lines. At times, we will want to combine several buses to form a wider bus; for example, we may want to obtain a 32-bit bus by combining two 16-bit buses. In such cases, labels on the bus lines will make it control signal A signal used for multiplexor selection or for directing the operation of a functional unit; contrasts with a data signal, which contains information that is operated on by a functional unit. asserted Th e signal is logically high or true. deasserted Th e signal is logically low or false. State element Combinational logic FIGURE 4.4 An edge-triggered methodology allows a state element to be read and written in the same clock cycle without creating a race that could lead to indeterminate data values. Of course, the clock cycle still must be long enough so that the input values are stable when the active clock edge occurs. Feedback cannot occur within one clock cycle because of the edge-triggered update of the state element. If feedback were possible, this design could not work properly. Our designs in this chapter and the next rely on the edge-triggered timing methodology and on structures like the one shown in this fi gure. 4.3 Building a Datapath 251 clear that we are concatenating buses to form a wider bus. Arrows are also added to help clarify the direction of the fl ow of data between elements. Finally, color indicates a control signal as opposed to a signal that carries data; this distinction will become clearer as we proceed through this chapter. True or false: Because the register fi le is both read and written on the same clock cycle, any MIPS datapath using edge-triggered writes must have more than one copy of the register fi le. Elaboration: There is also a 64-bit version of the MIPS architecture, and, naturally enough, most paths in its implementation would be 64 bits wide. 4.3 Building a Datapath A reasonable way to start a datapath design is to examine the major components required to execute each class of MIPS instructions. Let’s start at the top by looking at which datapath elements each instruction needs, and then work our way down through the levels of abstraction. When we show the datapath elements, we will also show their control signals. We use abstraction in this explanation, starting from the bottom up. Figure 4.5a shows the fi rst element we need: a memory unit to store the instructions of a program and supply instructions given an address. Figure 4.5b also shows the program counter (PC), which as we saw in Chapter 2 is a register that holds the address of the current instruction. Lastly, we will need an adder to increment the PC to the address of the next instruction. Th is adder, which is combinational, can be built from the ALU described in detail in Appendix B simply by wiring the control lines so that the control always specifi es an add operation. We will draw such an ALU with the label Add, as in Figure 4.5, to indicate that it has been permanently made an adder and cannot perform the other ALU functions. To execute any instruction, we must start by fetching the instruction from memory. To prepare for executing the next instruction, we must also increment the program counter so that it points at the next instruction, 4 bytes later. Figure 4.6 shows how to combine the three elements from Figure 4.5 to form a datapath that fetches instructions and increments the PC to obtain the address of the next sequential instruction. Now let’s consider the R-format instructions (see Figure 2.20 on page 120). Th ey all read two registers, perform an ALU operation on the contents of the registers, and write the result to a register. We call these instructions either R-type instructions or arithmetic-logical instructions (since they perform arithmetic or logical operations). Th is instruction class includes add, sub, AND, OR, and slt, Check Yourself datapath element A unit used to operate on or hold data within a processor. In the MIPS implementation, the datapath elements include the instruction and data memories, the register fi le, the ALU, and adders. program counter (PC) Th e register containing the address of the instruction in the program being executed. 252 Chapter 4 The Processor which were introduced in Chapter 2. Recall that a typical instance of such an instruction is add $t1,$t2,$t3, which reads $t2 and $t3 and writes $t1. Th e processor’s 32 general-purpose registers are stored in a structure called a register fi le. A register fi le is a collection of registers in which any register can be read or written by specifying the number of the register in the fi le. Th e register fi le contains the register state of the computer. In addition, we will need an ALU to operate on the values read from the registers. R-format instructions have three register operands, so we will need to read two data words from the register fi le and write one data word into the register fi le for each instruction. For each data word to be read from the registers, we need an input to the register fi le that specifi es the register number to be read and an output from the register fi le that will carry the value that has been read from the registers. To write a data word, we will need two inputs: one to specify the register number to be written and one to supply the data to be written into the register. Th e register fi le always outputs the contents of whatever register numbers are on the Read register inputs. Writes, however, are controlled by the write control signal, which must be asserted for a write to occur at the clock edge. Figure 4.7a shows the result; we need a total of four inputs (three for register numbers and one for data) and two outputs (both for data). Th e register number inputs are 5 bits wide to specify one of 32 registers (32 = 25), whereas the data input and two data output buses are each 32 bits wide. Figure 4.7b shows the ALU, which takes two 32-bit inputs and produces a 32-bit result, as well as a 1-bit signal if the result is 0. Th e 4-bit control signal of the ALU is described in detail in Appendix B; we will review the ALU control shortly when we need to know how to set it. register fi le A state element that consists of a set of registers that can be read and written by supplying a register number to be accessed. Instruction address Instruction Instruction memory a. Instruction memory PC b. Program counter Add Sum c. Adder FIGURE 4.5 Two state elements are needed to store and access instructions, and an adder is needed to compute the next instruction address. Th e state elements are the instruction memory and the program counter. Th e instruction memory need only provide read access because the datapath does not write instructions. Since the instruction memory only reads, we treat it as combinational logic: the output at any time refl ects the contents of the location specifi ed by the address input, and no read control signal is needed. (We will need to write the instruction memory when we load the program; this is not hard to add, and we ignore it for simplicity.) Th e program counter is a 32-bit register that is written at the end of every clock cycle and thus does not need a write control signal. Th e adder is an ALU wired to always add its two 32-bit inputs and place the sum on its output. 4.3 Building a Datapath 253 PC Read address Instruction Instruction memory Add 4 FIGURE 4.6 A portion of the datapath used for fetching instructions and incrementing the program counter. Th e fetched instruction is used by other parts of the datapath. Read register 1 Registers ALUData Data Zero ALU result RegWrite a. Registers b. ALU 5 5 5 Register numbers Read data 1 Read data 2 ALU operation 4 Read register 2 Write register Write Data FIGURE 4.7 The two elements needed to implement R-format ALU operations are the register fi le and the ALU. Th e register fi le contains all the registers and has two read ports and one write port. Th e design of multiported register fi les is discussed in Section B.8 of Appendix B. Th e register fi le always outputs the contents of the registers corresponding to the Read register inputs on the outputs; no other control inputs are needed. In contrast, a register write must be explicitly indicated by asserting the write control signal. Remember that writes are edge-triggered, so that all the write inputs (i.e., the value to be written, the register number, and the write control signal) must be valid at the clock edge. Since writes to the register fi le are edge-triggered, our design can legally read and write the same register within a clock cycle: the read will get the value written in an earlier clock cycle, while the value written will be available to a read in a subsequent clock cycle. Th e inputs carrying the register number to the register fi le are all 5 bits wide, whereas the lines carrying data values are 32 bits wide. Th e operation to be performed by the ALU is controlled with the ALU operation signal, which will be 4 bits wide, using the ALU designed in Appendix B. We will use the Zero detection output of the ALU shortly to implement branches. Th e overfl ow output will not be needed until Section 4.9, when we discuss exceptions; we omit it until then. 254 Chapter 4 The Processor Next, consider the MIPS load word and store word instructions, which have the general form lw $t1,offset_value($t2) or sw $t1,offset_value ($t2). Th ese instructions compute a memory address by adding the base register, which is $t2, to the 16-bit signed off set fi eld contained in the instruction. If the instruction is a store, the value to be stored must also be read from the register fi le where it resides in $t1. If the instruction is a load, the value read from memory must be written into the register fi le in the specifi ed register, which is $t1. Th us, we will need both the register fi le and the ALU from Figure 4.7. In addition, we will need a unit to sign-extend the 16-bit off set fi eld in the instruction to a 32-bit signed value, and a data memory unit to read from or write to. Th e data memory must be written on store instructions; hence, data memory has read and write control signals, an address input, and an input for the data to be written into memory. Figure 4.8 shows these two elements. Th e beq instruction has three operands, two registers that are compared for equality, and a 16-bit off set used to compute the branch target address relative to the branch instruction address. Its form is beq $t1,$t2,offset. To implement this instruction, we must compute the branch target address by adding the sign-extended off set fi eld of the instruction to the PC. Th ere are two details in the defi nition of branch instructions (see Chapter 2) to which we must pay attention: ■ Th e instruction set architecture specifi es that the base for the branch address calculation is the address of the instruction following the branch. Since we compute PC + 4 (the address of the next instruction) in the instruction fetch datapath, it is easy to use this value as the base for computing the branch target address. ■ Th e architecture also states that the off set fi eld is shift ed left 2 bits so that it is a word off set; this shift increases the eff ective range of the off set fi eld by a factor of 4. To deal with the latter complication, we will need to shift the off set fi eld by 2. As well as computing the branch target address, we must also determine whether the next instruction is the instruction that follows sequentially or the instruction at the branch target address. When the condition is true (i.e., the operands are equal), the branch target address becomes the new PC, and we say that the branch is taken. If the operands are not equal, the incremented PC should replace the current PC (just as for any other normal instruction); in this case, we say that the branch is not taken. Th us, the branch datapath must do two operations: compute the branch target address and compare the register contents. (Branches also aff ect the instruction fetch portion of the datapath, as we will deal with shortly.) Figure 4.9 shows the structure of the datapath segment that handles branches. To compute the branch target address, the branch datapath includes a sign extension unit, from Figure 4.8 and an adder. To perform the compare, we need to use the register fi le shown in Figure 4.7a to supply the two register operands (although we will not need to write into the register fi le). In addition, the comparison can be done using the ALU we sign-extend To increase the size of a data item by replicating the high-order sign bit of the original data item in the high- order bits of the larger, destination data item. branch target address Th e address specifi ed in a branch, which becomes the new program counter (PC) if the branch is taken. In the MIPS architecture the branch target is given by the sum of the off set fi eld of the instruction and the address of the instruction following the branch. branch taken A branch where the branch condition is satisfi ed and the program counter (PC) becomes the branch target. All unconditional jumps are taken branches. branch not taken or (untaken branch) A branch where the branch condition is false and the program counter (PC) becomes the address of the instruction that sequentially follows the branch. 4.3 Building a Datapath 255 designed in Appendix B. Since that ALU provides an output signal that indicates whether the result was 0, we can send the two register operands to the ALU with the control set to do a subtract. If the Zero signal out of the ALU unit is asserted, we know that the two values are equal. Although the Zero output always signals if the result is 0, we will be using it only to implement the equal test of branches. Later, we will show exactly how to connect the control signals of the ALU for use in the datapath. Th e jump instruction operates by replacing the lower 28 bits of the PC with the lower 26 bits of the instruction shift ed left by 2 bits. Simply concatenating 00 to the jump off set accomplishes this shift , as described in Chapter 2. Elaboration: In the MIPS instruction set, branches are delayed, meaning that the instruction immediately following the branch is always executed, independent of whether the branch condition is true or false. When the condition is false, the execution looks like a normal branch. When the condition is true, a delayed branch fi rst executes the instruction immediately following the branch in sequential instruction order before jumping to the specifi ed branch target address. The motivation for delayed branches arises from how pipelining affects branches (see Section 4.8). For simplicity, we generally ignore delayed branches in this chapter and implement a nondelayed beq instruction. branch A type of branch where the instruction immediately following the branch is always executed, independent of whether the branch condition is true or false. Address Read data Data memory a. Data memory unit Write data MemRead MemWrite b. Sign extension unit Sign- extend 16 32 FIGURE 4.8 The two units needed to implement loads and stores, in addition to the register fi le and ALU of Figure 4.7, are the data memory unit and the sign extension unit. Th e memory unit is a state element with inputs for the address and the write data, and a single output for the read result. Th ere are separate read and write controls, although only one of these may be asserted on any given clock. Th e memory unit needs a read signal, since, unlike the register fi le, reading the value of an invalid address can cause problems, as we will see in Chapter 5. Th e sign extension unit has a 16-bit input that is sign-extended into a 32-bit result appearing on the output (see Chapter 2). We assume the data memory is edge-triggered for writes. Standard memory chips actually have a write enable signal that is used for writes. Although the write enable is not edge-triggered, our edge-triggered design could easily be adapted to work with real memory chips. See Section B.8 of Appendix B for further discussion of how real memory chips work. 256 Chapter 4 The Processor Creating a Single Datapath Now that we have examined the datapath components needed for the individual instruction classes, we can combine them into a single datapath and add the control to complete the implementation. Th is simplest datapath will attempt to execute all instructions in one clock cycle. Th is means that no datapath resource can be used more than once per instruction, so any element needed more than once must be duplicated. We therefore need a memory for instructions separate from one for data. Although some of the functional units will need to be duplicated, many of the elements can be shared by diff erent instruction fl ows. To share a datapath element between two diff erent instruction classes, we may need to allow multiple connections to the input of an element, using a multiplexor and control signal to select among the multiple inputs. Read register 1 Registers ALU Zero RegWrite Read data 1 Read data 2 ALU operation 4 To branch control logic Add Sum Branch target PC + 4 from instruction datapath Sign- extend 16 32 Instruction Shift left 2 Read register 2 Write register Write data FIGURE 4.9 The datapath for a branch uses the ALU to evaluate the branch condition and a separate adder to compute the branch target as the sum of the incremented PC and the sign-extended, lower 16 bits of the instruction (the branch displacement), shifted left 2 bits. Th e unit labeled Shift left 2 is simply a routing of the signals between input and output that adds 00two to the low-order end of the sign-extended off set fi eld; no actual shift hardware is needed, since the amount of the “shift ” is constant. Since we know that the off set was sign-extended from 16 bits, the shift will throw away only “sign bits.” Control logic is used to decide whether the incremented PC or branch target should replace the PC, based on the Zero output of the ALU. 4.3 Building a Datapath 257 Building a Datapath Th e operations of arithmetic-logical (or R-type) instructions and the memory instructions datapath are quite similar. Th e key diff erences are the following: ■ Th e arithmetic-logical instructions use the ALU, with the inputs coming from the two registers. Th e memory instructions can also use the ALU to do the address calculation, although the second input is the sign- extended 16-bit off set fi eld from the instruction. ■ Th e value stored into a destination register comes from the ALU (for an R-type instruction) or the memory (for a load). Show how to build a datapath for the operational portion of the memory- reference and arithmetic-logical instructions that uses a single register fi le and a single ALU to handle both types of instructions, adding any necessary multiplexors. To create a datapath with only a single register fi le and a single ALU, we must support two diff erent sources for the second ALU input, as well as two diff erent sources for the data stored into the register fi le. Th us, one multiplexor is placed at the ALU input and another at the data input to the register fi le. Figure 4.10 shows the operational portion of the combined datapath. Now we can combine all the pieces to make a simple datapath for the core MIPS architecture by adding the datapath for instruction fetch (Figure 4.6), the datapath from R-type and memory instructions (Figure 4.10), and the datapath for branches (Figure 4.9). Figure 4.11 shows the datapath we obtain by composing the separate pieces. Th e branch instruction uses the main ALU for comparison of the register operands, so we must keep the adder from Figure 4.9 for computing the branch target address. An additional multiplexor is required to select either the sequentially following instruction address (PC + 4) or the branch target address to be written into the PC. Now that we have completed this simple datapath, we can add the control unit. Th e control unit must be able to take inputs and generate a write signal for each state element, the selector control for each multiplexor, and the ALU control. Th e ALU control is diff erent in a number of ways, and it will be useful to design it fi rst before we design the rest of the control unit. I. Which of the following is correct for a load instruction? Refer to Figure 4.10. a. MemtoReg should be set to cause the data from memory to be sent to the register fi le. EXAMPLE ANSWER Check Yourself 258 Chapter 4 The Processor Read register 1 Read register 2 Write register Write data Write data Registers ALU Zero RegWrite MemRead MemWrite MemtoReg Read data 1 Read data 2 ALU operation4 Sign- extend 16 32 Instruction ALU result M u x 0 1 M u x 1 0 ALUSrc Address Data memory Read data FIGURE 4.10 The datapath for the memory instructions and the R-type instructions. Th is example shows how a single datapath can be assembled from the pieces in Figures 4.7 and 4.8 by adding multiplexors. Two multiplexors are needed, as described in the example. Read register 1 Write data Registers ALU Add Zero RegWrite MemRead MemWrite PCSrc MemtoReg Read data 1 Read data 2 ALU operation4 Sign- extend 16 32 Instruction ALU result Add ALU result M u x M u x M u x ALUSrc Address Data memory Read data Shift left 2 4 Read address Instruction memory PC Read register 2 Write register Write data FIGURE 4.11 The simple datapath for the core MIPS architecture combines the elements required by different instruction classes. Th e components come from Figures 4.6, 4.9, and 4.10. Th is datapath can execute the basic instructions (load-store word, ALU operations, and branches) in a single clock cycle. Just one additional multiplexor is needed to integrate branches. Th e support for jumps will be added later. 4.4 A Simple Implementation Scheme 259 b. MemtoReg should be set to cause the correct register destination to be sent to the register fi le. c. We do not care about the setting of MemtoReg for loads. II. Th e single-cycle datapath conceptually described in this section must have separate instruction and data memories, because a. the formats of data and instructions are diff erent in MIPS, and hence diff erent memories are needed. b. having separate memories is less expensive. c. the processor operates in one cycle and cannot use a single-ported memory for two diff erent accesses within that cycle 4.4 A Simple Implementation Scheme In this section, we look at what might be thought of as the simplest possible implementation of our MIPS subset. We build this simple implementation using the datapath of the last section and adding a simple control function. Th is simple implementation covers load word (lw), store word (sw), branch equal (beq), and the arithmetic-logical instructions add, sub, AND, OR, and set on less than. We will later enhance the design to include a jump instruction (j). The ALU Control Th e MIPS ALU in Appendix B defi nes the 6 following combinations of four control inputs: ALU control lines Function 0000 AND 0001 OR 0010 add 0110 subtract 0111 set on less than 1100 NOR Depending on the instruction class, the ALU will need to perform one of these fi rst fi ve functions. (NOR is needed for other parts of the MIPS instruction set not found in the subset we are implementing.) For load word and store word instructions, we use the ALU to compute the memory address by addition. For the R-type instructions, the ALU needs to perform one of the fi ve actions (AND, OR, subtract, add, or set on less than), depending on the value of the 6-bit funct (or function) fi eld 260 Chapter 4 The Processor in the low-order bits of the instruction (see Chapter 2). For branch equal, the ALU must perform a subtraction. We can generate the 4-bit ALU control input using a small control unit that has as inputs the function fi eld of the instruction and a 2-bit control fi eld, which we call ALUOp. ALUOp indicates whether the operation to be performed should be add (00) for loads and stores, subtract (01) for beq, or determined by the operation encoded in the funct fi eld (10). Th e output of the ALU control unit is a 4-bit signal that directly controls the ALU by generating one of the 4-bit combinations shown previously. In Figure 4.12, we show how to set the ALU control inputs based on the 2-bit ALUOp control and the 6-bit function code. Later in this chapter we will see how the ALUOp bits are generated from the main control unit. Th is style of using multiple levels of decoding—that is, the main control unit generates the ALUOp bits, which then are used as input to the ALU control that generates the actual signals to control the ALU unit—is a common implementation technique. Using multiple levels of control can reduce the size of the main control unit. Using several smaller control units may also potentially increase the speed of the control unit. Such optimizations are important, since the speed of the control unit is oft en critical to clock cycle time. Th ere are several diff erent ways to implement the mapping from the 2-bit ALUOp fi eld and the 6-bit funct fi eld to the four ALU operation control bits. Because only a small number of the 64 possible values of the function fi eld are of interest and the function fi eld is used only when the ALUOp bits equal 10, we can use a small piece of logic that recognizes the subset of possible values and causes the correct setting of the ALU control bits. As a step in designing this logic, it is useful to create a truth table for the interesting combinations of the function code fi eld and the ALUOp bits, as we’ve Instruction opcode ALUOp Instruction operation Funct field Desired ALU action ALU control input LW 00 load word XXXXXX add 0010 SW 00 store word XXXXXX add 0010 Branch equal 01 branch equal XXXXXX subtract 0110 R-type 10 add 100000 add 0010 R-type 10 subtract 100010 subtract 0110 R-type 10 AND 100100 AND 0000 R-type 10 OR 100101 OR 0001 R-type 10 set on less than 101010 set on less than 0111 FIGURE 4.12 How the ALU control bits are set depends on the ALUOp control bits and the different function codes for the R-type instruction. Th e opcode, listed in the fi rst column, determines the setting of the ALUOp bits. All the encodings are shown in binary. Notice that when the ALUOp code is 00 or 01, the desired ALU action does not depend on the function code fi eld; in this case, we say that we “don’t care” about the value of the function code, and the funct fi eld is shown as XXXXXX. When the ALUOp value is 10, then the function code is used to set the ALU control input. See Appendix B. 4.4 A Simple Implementation Scheme 261 done in Figure 4.13; this truth table shows how the 4-bit ALU control is set depending on these two input fi elds. Since the full truth table is very large (28 = 256 entries) and we don’t care about the value of the ALU control for many of these input combinations, we show only the truth table entries for which the ALU control must have a specifi c value. Th roughout this chapter, we will use this practice of showing only the truth table entries for outputs that must be asserted and not showing those that are all deasserted or don’t care. (Th is practice has a disadvantage, which we discuss in Section D.2 of Appendix D.) Because in many instances we do not care about the values of some of the inputs, and because we wish to keep the tables compact, we also include don’t-care terms. A don’t-care term in this truth table (represented by an X in an input column) indicates that the output does not depend on the value of the input corresponding to that column. For example, when the ALUOp bits are 00, as in the fi rst row of Figure 4.13, we always set the ALU control to 0010, independent of the function code. In this case, then, the function code inputs will be don’t cares in this line of the truth table. Later, we will see examples of another type of don’t-care term. If you are unfamiliar with the concept of don’t-care terms, see Appendix B for more information. Once the truth table has been constructed, it can be optimized and then turned into gates. Th is process is completely mechanical. Th us, rather than show the fi nal steps here, we describe the process and the result in Section D.2 of Appendix D. Designing the Main Control Unit Now that we have described how to design an ALU that uses the function code and a 2-bit signal as its control inputs, we can return to looking at the rest of the control. To start this process, let’s identify the fi elds of an instruction and the control lines that are needed for the datapath we constructed in Figure 4.11. To understand how to connect the fi elds of an instruction to the datapath, it is useful to review truth table From logic, a representation of a logical operation by listing all the values of the inputs and then in each case showing what the resulting outputs should be. don’t-care term An element of a logical function in which the output does not depend on the values of all the inputs. Don’t-care terms may be specifi ed in diff erent ways. ALUOp Funct field OperationALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0 0 0 X X X X X X 0010 X 1 X X X X X X 0110 1 X X X 0 0 0 0 0010 1 X X X 0 0 1 0 0110 1 X X X 0 1 0 0 0000 1 X X X 0 1 0 1 0001 1 X X X 1 0 1 0 0111 FIGURE 4.13 The truth table for the 4 ALU control bits (called Operation). Th e inputs are the ALUOp and function code fi eld. Only the entries for which the ALU control is asserted are shown. Some don’t-care entries have been added. For example, the ALUOp does not use the encoding 11, so the truth table can contain entries 1X and X1, rather than 10 and 01. Note that when the function fi eld is used, the fi rst 2 bits (F5 and F4) of these instructions are always 10, so they are don’t-care terms and are replaced with XX in the truth table. 262 Chapter 4 The Processor the formats of the three instruction classes: the R-type, branch, and load-store instructions. Figure 4.14 shows these formats. Th ere are several major observations about this instruction format that we will rely on: ■ Th e op fi eld, which as we saw in Chapter 2 is called the opcode, is always contained in bits 31:26. We will refer to this fi eld as Op[5:0]. ■ Th e two registers to be read are always specifi ed by the rs and rt fi elds, at positions 25:21 and 20:16. Th is is true for the R-type instructions, branch equal, and store. ■ Th e base register for load and store instructions is always in bit positions 25:21 (rs). ■ Th e 16-bit off set for branch equal, load, and store is always in positions 15:0. ■ Th e destination register is in one of two places. For a load it is in bit positions 20:16 (rt), while for an R-type instruction it is in bit positions 15:11 (rd). Th us, we will need to add a multiplexor to select which fi eld of the instruction is used to indicate the register number to be written. Th e fi rst design principle from Chapter 2—simplicity favors regularity—pays off here in specifying control. opcode Th e fi eld that denotes the operation and format of an instruction. Field 0 rs rt rd shamt funct Bit positions 31:26 25:21 20:16 15:11 10:6 5:0 a. R-type instruction Field 35 or 43 rs rt address Bit positions 31:26 25:21 20:16 15:0 b. Load or store instruction Field 4 rs rt address Bit positions 31:26 25:21 20:16 15:0 c. Branch instruction FIGURE 4.14 The three instruction classes (R-type, load and store, and branch) use two different instruction formats. Th e jump instructions use another format, which we will discuss shortly. (a) Instruction format for R-format instructions, which all have an opcode of 0. Th ese instructions have three register operands: rs, rt, and rd. Fields rs and rt are sources, and rd is the destination. Th e ALU function is in the funct fi eld and is decoded by the ALU control design in the previous section. Th e R-type instructions that we implement are add, sub, AND, OR, and slt. Th e shamt fi eld is used only for shift s; we will ignore it in this chapter. (b) Instruction format for load (opcode = 35ten) and store (opcode = 43ten) instructions. Th e register rs is the base register that is added to the 16-bit address fi eld to form the memory address. For loads, rt is the destination register for the loaded value. For stores, rt is the source register whose value should be stored into memory. (c) Instruction format for branch equal (opcode =4). Th e registers rs and rt are the source registers that are compared for equality. Th e 16-bit address fi eld is sign-extended, shift ed, and added to the PC + 4 to compute the branch target address. 4.4 A Simple Implementation Scheme 263 Using this information, we can add the instruction labels and extra multiplexor (for the Write register number input of the register fi le) to the simple datapath. Figure 4.15 shows these additions plus the ALU control block, the write signals for state elements, the read signal for the data memory, and the control signals for the multiplexors. Since all the multiplexors have two inputs, they each require a single control line. Figure 4.15 shows seven single-bit control lines plus the 2-bit ALUOp control signal. We have already defi ned how the ALUOp control signal works, and it is useful to defi ne what the seven other control signals do informally before we determine how to set these control signals during instruction execution. Figure 4.16 describes the function of these seven control lines. Now that we have looked at the function of each of the control signals, we can look at how to set them. Th e control unit can set all but one of the control signals based solely on the opcode fi eld of the instruction. Th e PCSrc control line is the exception. Th at control line should be asserted if the instruction is branch on equal (a decision that the control unit can make) and the Zero output of the ALU, which is used for equality comparison, is asserted. To generate the PCSrc signal, we will need to AND together a signal from the control unit, which we call Branch, with the Zero signal out of the ALU. Read register 1 Write data Registers ALU Add Zero MemRead MemWrite RegWrite PCSrc MemtoReg Read data 1 Read data 2 Sign- extend 16 32 Instruction [31:0] ALU result Add ALU result M u x M u x M u x ALUSrc Address Data memory Read data Shift left 2 4 Read address Instruction memory PC 1 0 0 1 0 1 M u x 0 1 ALU control ALUOp Instruction [5:0] Instruction [25:21] Instruction [15:11] Instruction [20:16] Instruction [15:0] RegDst Read register 2 Write register Write data FIGURE 4.15 The datapath of Figure 4.11 with all necessary multiplexors and all control lines identifi ed. Th e control lines are shown in color. Th e ALU control block has also been added. Th e PC does not require a write control, since it is written once at the end of every clock cycle; the branch control logic determines whether it is written with the incremented PC or the branch target address. 264 Chapter 4 The Processor Signal name Effect when deasserted Effect when asserted RegDst The register destination number for the Write register comes from the rt field (bits 20:16). The register destination number for the Write register comes from the rd field (bits 15:11). RegWrite None. The register on the Write register input is written with the value on the Write data input. ALUSrc The second ALU operand comes from the second register file output (Read data 2). The second ALU operand is the sign- extended, lower 16 bits of the instruction. PCSrc The PC is replaced by the output of the adder that computes the value of PC + 4. The PC is replaced by the output of the adder that computes the branch target. MemRead None. Data memory contents designated by the address input are put on the Read data output. MemWrite None. Data memory contents designated by the address input are replaced by the value on the Write data input. MemtoReg The value fed to the register Write data input comes from the ALU. The value fed to the register Write data input comes from the data memory. FIGURE 4.16 The effect of each of the seven control signals. When the 1-bit control to a two- way multiplexor is asserted, the multiplexor selects the input corresponding to 1. Otherwise, if the control is deasserted, the multiplexor selects the 0 input. Remember that the state elements all have the clock as an implicit input and that the clock is used in controlling writes. Gating the clock externally to a state element can create timing problems. (See Appendix B for further discussion of this problem.) Th ese nine control signals (seven from Figure 4.16 and two for ALUOp) can now be set on the basis of six input signals to the control unit, which are the opcode bits 31 to 26. Figure 4.17 shows the datapath with the control unit and the control signals. Before we try to write a set of equations or a truth table for the control unit, it will be useful to try to defi ne the control function informally. Because the setting of the control lines depends only on the opcode, we defi ne whether each control signal should be 0, 1, or don’t care (X) for each of the opcode values. Figure 4.18 defi nes how the control signals should be set for each opcode; this information follows directly from Figures 4.12, 4.16, and 4.17. Operation of the Datapath With the information contained in Figures 4.16 and 4.18, we can design the control unit logic, but before we do that, let’s look at how each instruction uses the datapath. In the next few fi gures, we show the fl ow of three diff erent instruction classes through the datapath. Th e asserted control signals and active datapath elements are highlighted in each of these. Note that a multiplexor whose control is 0 has a defi nite action, even if its control line is not highlighted. Multiple-bit control signals are highlighted if any constituent signal is asserted. Figure 4.19 shows the operation of the datapath for an R-type instruction, such as add $t1,$t2,$t3. Although everything occurs in one clock cycle, we can 4.4 A Simple Implementation Scheme 265 think of four steps to execute the instruction; these steps are ordered by the fl ow of information: 1. Th e instruction is fetched, and the PC is incremented. 2. Two registers, $t2 and $t3, are read from the register fi le; also, the main control unit computes the setting of the control lines during this step. 3. Th e ALU operates on the data read from the register fi le, using the function code (bits 5:0, which is the funct fi eld, of the instruction) to generate the ALU function. Read register 1 Write data Registers ALU Add Zero Read data 1 Read data 2 Sign- extend 16 32 Instruction [31–0] ALU result Add ALU result M u x M u x M u x Address Data memory Read data Shift left 2 4 Read address Instruction memory PC 1 0 0 1 0 1 M u x 0 1 ALU control Instruction [5–0] Instruction [25–21] Instruction [31–26] Instruction [15–11] Instruction [20–16] Instruction [15–0] RegDst Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Control Read register 2 Write register Write data FIGURE 4.17 The simple datapath with the control unit. Th e input to the control unit is the 6-bit opcode fi eld from the instruction. Th e outputs of the control unit consist of three 1-bit signals that are used to control multiplexors (RegDst, ALUSrc, and MemtoReg), three signals for controlling reads and writes in the register fi le and data memory (RegWrite, MemRead, and MemWrite), a 1-bit signal used in determining whether to possibly branch (Branch), and a 2-bit control signal for the ALU (ALUOp). An AND gate is used to combine the branch control signal and the Zero output from the ALU; the AND gate output controls the selection of the next PC. Notice that PCSrc is now a derived signal, rather than one coming directly from the control unit. Th us, we drop the signal name in subsequent fi gures. 266 Chapter 4 The Processor Instruction RegDst ALUSrc Memto- Reg Reg- Write Mem- Read Mem- Write Branch ALUOp1 ALUOp0 R-format 1 0 0 1 0 0 0 1 0 lw 0 1 1 1 1 0 0 0 0 sw X 1 X 0 0 1 0 0 0 beq X 0 X 0 0 0 1 0 1 FIGURE 4.18 The setting of the control lines is completely determined by the opcode fi elds of the instruction. Th e fi rst row of the table corresponds to the R-format instructions (add, sub, AND, OR, and slt). For all these instructions, the source register fi elds are rs and rt, and the destination register fi eld is rd; this defi nes how the signals ALUSrc and RegDst are set. Furthermore, an R-type instruction writes a register (Reg-Write = 1), but neither reads nor writes data memory. When the Branch control signal is 0, the PC is unconditionally replaced with PC + 4; otherwise, the PC is replaced by the branch target if the Zero output of the ALU is also high. Th e ALUOp fi eld for R-type instructions is set to 10 to indicate that the ALU control should be generated from the funct fi eld. Th e second and third rows of this table give the control signal settings for lw and sw. Th ese ALUSrc and ALUOp fi elds are set to perform the address calculation. Th e MemRead and MemWrite are set to perform the memory access. Finally, RegDst and RegWrite are set for a load to cause the result to be stored into the rt register. Th e branch instruction is similar to an R-format operation, since it sends the rs and rt registers to the ALU. Th e ALUOp fi eld for branch is set for a subtract (ALU control = 01), which is used to test for equality. Notice that the MemtoReg fi eld is irrelevant when the RegWrite signal is 0: since the register is not being written, the value of the data on the register data write port is not used. Th us, the entry MemtoReg in the last two rows of the table is replaced with X for don’t care. Don’t cares can also be added to RegDst when RegWrite is 0. Th is type of don’t care must be added by the designer, since it depends on knowledge of how the datapath works. Read register 1 Write data Registers ALU Add Zero Read data 1 Read data 2 Sign- extend 16 32 Instruction [31–0] ALU result Add ALU result M u x M u x M u x Address Data memory Read data Shift left 2 4 Read address Instruction memory PC 1 0 0 1 0 1 M u x 0 1 ALU control Instruction [5–0] Instruction [25–21] Instruction [31–26] Instruction [15–11] Instruction [20–16] Instruction [15–0] RegDst Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Control Read register 2 Write register Write data FIGURE 4.19 The datapath in operation for an R-type instruction, such as add $t1,$t2,$t3. Th e control lines, datapath units, and connections that are active are highlighted. 4.4 A Simple Implementation Scheme 267 4. Th e result from the ALU is written into the register fi le using bits 15:11 of the instruction to select the destination register ($t1). Similarly, we can illustrate the execution of a load word, such as lw $t1, offset($t2) in a style similar to Figure 4.19. Figure 4.20 shows the active functional units and asserted control lines for a load. We can think of a load instruction as operating in fi ve steps (similar to how the R-type executed in four): 1. An instruction is fetched from the instruction memory, and the PC is incremented. 2. A register ($t2) value is read from the register fi le. Read register 1 Write data Registers ALU Add Zero Read data 1 Read data 2 Sign- extend 16 32 Instruction [31–0] ALU result Add ALU result M u x M u x M u x Address Data memory Read data Shift left 2 4 Read address Instruction memory PC 1 0 0 1 0 1 M u x 0 1 ALU control Instruction [5–0] Instruction [25–21] Instruction [31–26] Instruction [15–11] Instruction [20–16] Instruction [15–0] RegDst Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Control Read register 2 Write register Write data FIGURE 4.20 The datapath in operation for a load instruction. Th e control lines, datapath units, and connections that are active are highlighted. A store instruction would operate very similarly. Th e main diff erence would be that the memory control would indicate a write rather than a read, the second register value read would be used for the data to store, and the operation of writing the data memory value to the register fi le would not occur. 268 Chapter 4 The Processor 3. Th e ALU computes the sum of the value read from the register fi le and the sign-extended, lower 16 bits of the instruction (offset). 4. Th e sum from the ALU is used as the address for the data memory. 5. Th e data from the memory unit is written into the register fi le; the register destination is given by bits 20:16 of the instruction ($t1). Finally, we can show the operation of the branch-on-equal instruction, such as beq $t1, $t2, offset, in the same fashion. It operates much like an R-format instruction, but the ALU output is used to determine whether the PC is written with PC + 4 or the branch target address. Figure 4.21 shows the four steps in execution: 1. An instruction is fetched from the instruction memory, and the PC is incremented. Read register 1 Write data Registers ALU Add Zero Read data 1 Read data 2 Sign- extend 16 32 Instruction [31–0] ALU result Add ALU result M u x M u x M u x Address Data memory Read data Shift left 2 4 Read address Instruction memory PC 1 0 0 1 0 1 M u x 0 1 ALU control Instruction [5–0] Instruction [25–21] Instruction [31–26] Instruction [15–11] Instruction [20–16] Instruction [15–0] RegDst Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Control Read register 2 Write register Write data FIGURE 4.21 The datapath in operation for a branch-on-equal instruction. Th e control lines, datapath units, and connections that are active are highlighted. Aft er using the register fi le and ALU to perform the compare, the Zero output is used to select the next program counter from between the two candidates. 4.4 A Simple Implementation Scheme 269 2. Two registers, $t1 and $t2, are read from the register fi le. 3. Th e ALU performs a subtract on the data values read from the register fi le. Th e value of PC + 4 is added to the sign-extended, lower 16 bits of the instruction (offset) shift ed left by two; the result is the branch target address. 4. Th e Zero result from the ALU is used to decide which adder result to store into the PC. Finalizing Control Now that we have seen how the instructions operate in steps, let’s continue with the control implementation. Th e control function can be precisely defi ned using the contents of Figure 4.18. Th e outputs are the control lines, and the input is the 6-bit opcode fi eld, Op [5:0]. Th us, we can create a truth table for each of the outputs based on the binary encoding of the opcodes. Figure 4.22 shows the logic in the control unit as one large truth table that combines all the outputs and that uses the opcode bits as inputs. It completely specifi es the control function, and we can implement it directly in gates in an automated fashion. We show this fi nal step in Section D.2 in Appendix D. Input or output Signal name R-format lw sw beq Inputs Op5 0 1 1 0 Op4 0 0 0 0 Op3 0 0 1 0 Op2 0 0 0 1 Op1 0 1 1 0 Op0 0 1 1 0 Outputs RegDst 1 0 X X ALUSrc 0 1 1 0 MemtoReg 0 1 X X RegWrite 1 1 0 0 MemRead 0 1 0 0 MemWrite 0 0 1 0 Branch 0 0 0 1 ALUOp1 1 0 0 0 ALUOp0 0 0 0 1 FIGURE 4.22 The control function for the simple single-cycle implementation is completely specifi ed by this truth table. Th e top half of the table gives the combinations of input signals that correspond to the four opcodes, one per column, that determine the control output settings. (Remember that Op [5:0] corresponds to bits 31:26 of the instruction, which is the op fi eld.) Th e bottom portion of the table gives the outputs for each of the four opcodes. Th us, the output RegWrite is asserted for two diff erent combinations of the inputs. If we consider only the four opcodes shown in this table, then we can simplify the truth table by using don’t cares in the input portion. For example, we can detect an R-format instruction with the expression Op5 � Op2 , since this is suffi cient to distinguish the R-format instructions from lw, sw, and beq. We do not take advantage of this simplifi cation, since the rest of the MIPS opcodes are used in a full implementation. 270 Chapter 4 The Processor Now that we have a single-cycle implementation of most of the MIPS core instruction set, let’s add the jump instruction to show how the basic datapath and control can be extended to handle other instructions in the instruction set. Implementing Jumps Figure 4.17 shows the implementation of many of the instructions we looked at in Chapter 2. One class of instructions missing is that of the jump instruction. Extend the datapath and control of Figure 4.17 to include the jump instruction. Describe how to set any new control lines. Th e jump instruction, shown in Figure 4.23, looks somewhat like a branch instruction but computes the target PC diff erently and is not conditional. Like a branch, the low-order 2 bits of a jump address are always 00two. Th e next lower 26 bits of this 32-bit address come from the 26-bit immediate fi eld in the instruction. Th e upper 4 bits of the address that should replace the PC come from the PC of the jump instruction plus 4. Th us, we can implement a jump by storing into the PC the concatenation of ■ the upper 4 bits of the current PC + 4 (these are bits 31:28 of the sequentially following instruction address) ■ the 26-bit immediate fi eld of the jump instruction ■ the bits 00two Figure 4.24 shows the addition of the control for jump added to Figure 4.17. An additional multiplexor is used to select the source for the new PC value, which is either the incremented PC (PC + 4), the branch target PC, or the jump target PC. One additional control signal is needed for the additional multiplexor. Th is control signal, called Jump, is asserted only when the instruction is a jump— that is, when the opcode is 2. EXAMPLE ANSWER Field 000010 address Bit positions 31:26 25:0 FIGURE 4.23 Instruction format for the jump instruction (opcode = 2). Th e destination address for a jump instruction is formed by concatenating the upper 4 bits of the current PC + 4 to the 26-bit address fi eld in the jump instruction and adding 00 as the 2 low-order bits. single-cycle implementation Also called single clock cycle implementation. An implementation in which an instruction is executed in one clock cycle. While easy to understand, it is too slow to be practical. 4.4 A Simple Implementation Scheme 271 Why a Single-Cycle Implementation Is Not Used Today Although the single-cycle design will work correctly, it would not be used in modern designs because it is ineffi cient. To see why this is so, notice that the clock cycle must have the same length for every instruction in this single-cycle design. Of course, the longest possible path in the processor determines the clock cycle. Th is path is almost certainly a load instruction, which uses fi ve functional units in series: the instruction memory, the register fi le, the ALU, the data memory, and the register fi le. Although the CPI is 1 (see Chapter 1), the overall performance of a single-cycle implementation is likely to be poor, since the clock cycle is too long. Th e penalty for using the single-cycle design with a fi xed clock cycle is signifi cant, but might be considered acceptable for this small instruction set. Historically, early Read register 1 Write data Registers ALU Add Zero Read data 1 Read data 2 Sign- extend 16 32 Instruction [31–0] ALU result Add ALU result M u x M u x M u x Address Data memory Read data Shift left 2 4 Read address Instruction memory PC 1 0 0 1 0 1 M u x 0 1 ALU control Instruction [5–0] Instruction [25–21] Instruction [31–26] Instruction [15–11] Instruction [20–16] Instruction [15–0] RegDst Jump Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Control Read register 2 Write register Write data M u x 1 0 Shift left 2 Instruction [25–0] Jump address [31–0] 26 28 PC + 4 [31–28] FIGURE 4.24 The simple control and datapath are extended to handle the jump instruction. An additional multiplexor (at the upper right) is used to choose between the jump target and either the branch target or the sequential instruction following this one. Th is multiplexor is controlled by the jump control signal. Th e jump target address is obtained by shift ing the lower 26 bits of the jump instruction left 2 bits, eff ectively adding 00 as the low-order bits, and then concatenating the upper 4 bits of PC + 4 as the high-order bits, thus yielding a 32-bit address. 272 Chapter 4 The Processor computers with very simple instruction sets did use this implementation technique. However, if we tried to implement the fl oating-point unit or an instruction set with more complex instructions, this single-cycle design wouldn’t work well at all. Because we must assume that the clock cycle is equal to the worst-case delay for all instructions, it’s useless to try implementation techniques that reduce the delay of the common case but do not improve the worst-case cycle time. A single- cycle implementation thus violates the great idea from Chapter 1 of making the common case fast. In next section, we’ll look at another implementation technique, called pipelining, that uses a datapath very similar to the single-cycle datapath but is much more effi cient by having a much higher throughput. Pipelining improves effi ciency by executing multiple instructions simultaneously. Look at the control signals in Figure 4.22. Can you combine any together? Can any control signal output in the fi gure be replaced by the inverse of another? (Hint: take into account the don’t cares.) If so, can you use one signal for the other without adding an inverter? 4.5 An Overview of Pipelining Pipelining is an implementation technique in which multiple instructions are overlapped in execution. Today, pipelining is nearly universal. Th is section relies heavily on one analogy to give an overview of the pipelining terms and issues. If you are interested in just the big picture, you should concentrate on this section and then skip to Sections 4.10 and 4.11 to see an introduction to the advanced pipelining techniques used in recent processors such as the Intel Core i7 and ARM Cortex-A8. If you are interested in exploring the anatomy of a pipelined computer, this section is a good introduction to Sections 4.6 through 4.9. Anyone who has done a lot of laundry has intuitively used pipelining. Th e non- pipelined approach to laundry would be as follows: 1. Place one dirty load of clothes in the washer. 2. When the washer is fi nished, place the wet load in the dryer. 3. When the dryer is fi nished, place the dry load on a table and fold. 4. When folding is fi nished, ask your roommate to put the clothes away. When your roommate is done, start over with the next dirty load. Th e pipelined approach takes much less time, as Figure 4.25 shows. As soon as the washer is fi nished with the fi rst load and placed in the dryer, you load the washer with the second dirty load. When the fi rst load is dry, you place it on the table to start folding, move the wet load to the dryer, and put the next dirty load Check Yourself pipelining An implementation technique in which multiple instructions are overlapped in execution, much like an assembly line. Never waste time. American proverb 4.5 An Overview of Pipelining 273 into the washer. Next you have your roommate put the fi rst load away, you start folding the second load, the dryer has the third load, and you put the fourth load into the washer. At this point all steps—called stages in pipelining—are operating concurrently. As long as we have separate resources for each stage, we can pipeline the tasks. Th e pipelining paradox is that the time from placing a single dirty sock in the washer until it is dried, folded, and put away is not shorter for pipelining; the reason pipelining is faster for many loads is that everything is working in parallel, so more loads are fi nished per hour. Pipelining improves throughput of our laundry system. Hence, pipelining would not decrease the time to complete one load of laundry, but when we have many loads of laundry to do, the improvement in throughput decreases the total time to complete the work. If all the stages take about the same amount of time and there is enough work to do, then the speed-up due to pipelining is equal to the number of stages in the Time Task order A B C D 6 PM 7 8 9 10 11 12 1 2 AM Time Task order A B C D 6 PM 7 8 9 10 11 12 1 2 AM FIGURE 4.25 The laundry analogy for pipelining. Ann, Brian, Cathy, and Don each have dirty clothes to be washed, dried, folded, and put away. Th e washer, dryer, “folder,” and “storer” each take 30 minutes for their task. Sequential laundry takes 8 hours for 4 loads of wash, while pipelined laundry takes just 3.5 hours. We show the pipeline stage of diff erent loads over time by showing copies of the four resources on this two-dimensional time line, but we really have just one of each resource. 274 Chapter 4 The Processor pipeline, in this case four: washing, drying, folding, and putting away. Th erefore, pipelined laundry is potentially four times faster than nonpipelined: 20 loads would take about 5 times as long as 1 load, while 20 loads of sequential laundry takes 20 times as long as 1 load. It’s only 2.3 times faster in Figure 4.25, because we only show 4 loads. Notice that at the beginning and end of the workload in the pipelined version in Figure 4.25, the pipeline is not completely full; this start-up and wind- down aff ects performance when the number of tasks is not large compared to the number of stages in the pipeline. If the number of loads is much larger than 4, then the stages will be full most of the time and the increase in throughput will be very close to 4. Th e same principles apply to processors where we pipeline instruction-execution. MIPS instructions classically take fi ve steps: 1. Fetch instruction from memory. 2. Read registers while decoding the instruction. Th e regular format of MIPS instructions allows reading and decoding to occur simultaneously. 3. Execute the operation or calculate an address. 4. Access an operand in data memory. 5. Write the result into a register. Hence, the MIPS pipeline we explore in this chapter has fi ve stages. Th e following example shows that pipelining speeds up instruction execution just as it speeds up the laundry. Single-Cycle versus Pipelined Performance To make this discussion concrete, let’s create a pipeline. In this example, and in the rest of this chapter, we limit our attention to eight instructions: load word (lw), store word (sw), add (add), subtract (sub), AND (and), OR (or), set less than (slt), and branch on equal (beq). Compare the average time between instructions of a single-cycle implementation, in which all instructions take one clock cycle, to a pipelined implementation. Th e operation times for the major functional units in this example are 200 ps for memory access, 200 ps for ALU operation, and 100 ps for register fi le read or write. In the single-cycle model, every instruction takes exactly one clock cycle, so the clock cycle must be stretched to accommodate the slowest instruction. Figure 4.26 shows the time required for each of the eight instructions. Th e single-cycle design must allow for the slowest instruction—in Figure 4.26 it is lw—so the time required for every instruction is 800 ps. Similarly EXAMPLE ANSWER 4.5 An Overview of Pipelining 275 to Figure 4.25, Figure 4.27 compares nonpipelined and pipelined execution of three load word instructions. Th us, the time between the fi rst and fourth instructions in the nonpipelined design is 3 × 800 ns or 2400 ps. All the pipeline stages take a single clock cycle, so the clock cycle must be long enough to accommodate the slowest operation. Just as the single-cycle design must take the worst-case clock cycle of 800 ps, even though some instructions can be as fast as 500 ps, the pipelined execution clock cycle must have the worst-case clock cycle of 200 ps, even though some stages take only 100 ps. Pipelining still off ers a fourfold performance improvement: the time between the fi rst and fourth instructions is 3 × 200 ps or 600 ps. We can turn the pipelining speed-up discussion above into a formula. If the stages are perfectly balanced, then the time between instructions on the pipelined processor—assuming ideal conditions—is equal to Time bet tions Time between instructio pipelinedween instruc � nnnonpipelined Number of pipe stages Under ideal conditions and with a large number of instructions, the speed-up from pipelining is approximately equal to the number of pipe stages; a fi ve-stage pipeline is nearly fi ve times faster. Th e formula suggests that a fi ve-stage pipeline should off er nearly a fi vefold improvement over the 800 ps nonpipelined time, or a 160 ps clock cycle. Th e example shows, however, that the stages may be imperfectly balanced. Moreover, pipelining involves some overhead, the source of which will be clearer shortly. Th us, the time per instruction in the pipelined processor will exceed the minimum possible, and speed-up will be less than the number of pipeline stages. Instruction class Instruction fetch Register read ALU operation Data access Register write Total time Load word (lw) 200 ps 100 ps 200 ps 200 ps 100 ps 800 ps Store word (sw) 200 ps 100 ps 200 ps 200 ps 700 ps R-format (add, sub, AND, OR, slt) 200 ps 100 ps 200 ps 100 ps 600 ps Branch (beq) 200 ps 100 ps 200 ps 500 ps FIGURE 4.26 Total time for each instruction calculated from the time for each component. Th is calculation assumes that the multiplexors, control unit, PC accesses, and sign extension unit have no delay. 276 Chapter 4 The Processor Moreover, even our claim of fourfold improvement for our example is not refl ected in the total execution time for the three instructions: it’s 1400 ps versus 2400 ps. Of course, this is because the number of instructions is not large. What would happen if we increased the number of instructions? We could extend the previous fi gures to 1,000,003 instructions. We would add 1,000,000 instructions in the pipelined example; each instruction adds 200 ps to the total execution time. Th e total execution time would be 1,000,000 × 200 ps + 1400 ps, or 200,001,400 ps. In the nonpipelined example, we would add 1,000,000 instructions, each taking 800 ps, so total execution time would be 1,000,000 × 800 ps + 2400 ps, or 800,002,400 ps. Under these conditions, the ratio of total execution times for real programs on nonpipelined to pipelined processors is close to the ratio of times between instructions: 800 002 400 200 001 400 , , , , ps ps ps ps � � 800 200 4.00 Program execution order (in instructions) lw $1, 100($0) lw $2, 200($0) lw $3, 300($0) Time 1000 1200 1400200 400 600 800 1000 1200 1400200 400 600 800 1600 1800 Instruction fetch Data access Reg Instruction fetch Data access Reg Instruction fetch 800 ps 800 ps 800 ps Program execution order (in instructions) lw $1, 100($0) lw $2, 200($0) lw $3, 300($0) Time Instruction fetch Data access Reg Instruction fetch Instruction fetch Data access Reg Data access Reg 200 ps 200 ps 200 ps 200 ps 200 ps 200 ps 200 ps ALUReg ALUReg ALU ALU ALU Reg Reg Reg FIGURE 4.27 Single-cycle, nonpipelined execution in top versus pipelined execution in bottom. Both use the same hardware components, whose time is listed in Figure 4.26. In this case, we see a fourfold speed-up on average time between instructions, from 800 ps down to 200 ps. Compare this fi gure to Figure 4.25. For the laundry, we assumed all stages were equal. If the dryer were slowest, then the dryer stage would set the stage time. Th e pipeline stage times of a computer are also limited by the slowest resource, either the ALU operation or the memory access. We assume the write to the register fi le occurs in the fi rst half of the clock cycle and the read from the register fi le occurs in the second half. We use this assumption throughout this chapter. 4.5 An Overview of Pipelining 277 Pipelining improves performance by increasing instruction throughput, as opposed to decreasing the execution time of an individual instruction, but instruction throughput is the important metric because real programs execute billions of instructions. Designing Instruction Sets for Pipelining Even with this simple explanation of pipelining, we can get insight into the design of the MIPS instruction set, which was designed for pipelined execution. First, all MIPS instructions are the same length. Th is restriction makes it much easier to fetch instructions in the fi rst pipeline stage and to decode them in the second stage. In an instruction set like the x86, where instructions vary from 1 byte to 15 bytes, pipelining is considerably more challenging. Recent implementations of the x86 architecture actually translate x86 instructions into simple operations that look like MIPS instructions and then pipeline the simple operations rather than the native x86 instructions! (See Section 4.10.) Second, MIPS has only a few instruction formats, with the source register fi elds being located in the same place in each instruction. Th is symmetry means that the second stage can begin reading the register fi le at the same time that the hardware is determining what type of instruction was fetched. If MIPS instruction formats were not symmetric, we would need to split stage 2, resulting in six pipeline stages. We will shortly see the downside of longer pipelines. Th ird, memory operands only appear in loads or stores in MIPS. Th is restriction means we can use the execute stage to calculate the memory address and then access memory in the following stage. If we could operate on the operands in memory, as in the x86, stages 3 and 4 would expand to an address stage, memory stage, and then execute stage. Fourth, as discussed in Chapter 2, operands must be aligned in memory. Hence, we need not worry about a single data transfer instruction requiring two data memory accesses; the requested data can be transferred between processor and memory in a single pipeline stage. Pipeline Hazards Th ere are situations in pipelining when the next instruction cannot execute in the following clock cycle. Th ese events are called hazards, and there are three diff erent types. Hazards Th e fi rst hazard is called a structural hazard. It means that the hardware cannot support the combination of instructions that we want to execute in the same clock cycle. A structural hazard in the laundry room would occur if we used a washer- dryer combination instead of a separate washer and dryer, or if our roommate was busy doing something else and wouldn’t put clothes away. Our carefully scheduled pipeline plans would then be foiled. structural hazard When a planned instruction cannot execute in the proper clock cycle because the hardware does not support the combination of instructions that are set to execute. 278 Chapter 4 The Processor As we said above, the MIPS instruction set was designed to be pipelined, making it fairly easy for designers to avoid structural hazards when designing a pipeline. Suppose, however, that we had a single memory instead of two memories. If the pipeline in Figure 4.27 had a fourth instruction, we would see that in the same clock cycle the fi rst instruction is accessing data from memory while the fourth instruction is fetching an instruction from that same memory. Without two memories, our pipeline could have a structural hazard. Data Hazards Data hazards occur when the pipeline must be stalled because one step must wait for another to complete. Suppose you found a sock at the folding station for which no match existed. One possible strategy is to run down to your room and search through your clothes bureau to see if you can fi nd the match. Obviously, while you are doing the search, loads must wait that have completed drying and are ready to fold as well as those that have fi nished washing and are ready to dry. In a computer pipeline, data hazards arise from the dependence of one instruction on an earlier one that is still in the pipeline (a relationship that does not really exist when doing laundry). For example, suppose we have an add instruction followed immediately by a subtract instruction that uses the sum ($s0): add $s0, $t0, $t1 sub $t2, $s0, $t3 Without intervention, a data hazard could severely stall the pipeline. Th e add instruction doesn’t write its result until the fi ft h stage, meaning that we would have to waste three clock cycles in the pipeline. Although we could try to rely on compilers to remove all such hazards, the results would not be satisfactory. Th ese dependences happen just too oft en and the delay is just too long to expect the compiler to rescue us from this dilemma. Th e primary solution is based on the observation that we don’t need to wait for the instruction to complete before trying to resolve the data hazard. For the code sequence above, as soon as the ALU creates the sum for the add, we can supply it as an input for the subtract. Adding extra hardware to retrieve the missing item early from the internal resources is called forwarding or bypassing. Forwarding with Two Instructions For the two instructions above, show what pipeline stages would be connected by forwarding. Use the drawing in Figure 4.28 to represent the datapath during the fi ve stages of the pipeline. Align a copy of the datapath for each instruction, similar to the laundry pipeline in Figure 4.25. data hazard Also called a pipeline data hazard. When a planned instruction cannot execute in the proper clock cycle because data that is needed to execute the instruction is not yet available. forwarding Also called bypassing. A method of resolving a data hazard by retrieving the missing data element from internal buff ers rather than waiting for it to arrive from programmer- visible registers or memory. EXAMPLE 4.5 An Overview of Pipelining 279 Figure 4.29 shows the connection to forward the value in $s0 aft er the execution stage of the add instruction as input to the execution stage of the sub instruction. In this graphical representation of events, forwarding paths are valid only if the destination stage is later in time than the source stage. For example, there cannot be a valid forwarding path from the output of the memory access stage in the fi rst instruction to the input of the execution stage of the following, since that would mean going backward in time. Forwarding works very well and is described in detail in Section 4.7. It cannot prevent all pipeline stalls, however. For example, suppose the fi rst instruction was a load of $s0 instead of an add. As we can imagine from looking at Figure 4.29, the ANSWER Time add $s0, $t0, $t1 IF MEMID WBEX 200 400 600 800 1000 FIGURE 4.28 Graphical representation of the instruction pipeline, similar in spirit to the laundry pipeline in Figure 4.25. Here we use symbols representing the physical resources with the abbreviations for pipeline stages used throughout the chapter. Th e symbols for the fi ve stages: IF for the instruction fetch stage, with the box representing instruction memory; ID for the instruction decode/ register fi le read stage, with the drawing showing the register fi le being read; EX for the execution stage, with the drawing representing the ALU; MEM for the memory access stage, with the box representing data memory; and WB for the write-back stage, with the drawing showing the register fi le being written. Th e shading indicates the element is used by the instruction. Hence, MEM has a white background because add does not access the data memory. Shading on the right half of the register fi le or memory means the element is read in that stage, and shading of the left half means it is written in that stage. Hence the right half of ID is shaded in the second stage because the register fi le is read, and the left half of WB is shaded in the fi ft h stage because the register fi le is written. Time add $s0, $t0, $t1 sub $t2, $s0, $t3 IF MEMID WBEX IF MEMID WBEX Program execution order (in instructions) 200 400 600 800 1000 FIGURE 4.29 Graphical representation of forwarding. Th e connection shows the forwarding path from the output of the EX stage of add to the input of the EX stage for sub, replacing the value from register $s0 read in the second stage of sub. 280 Chapter 4 The Processor desired data would be available only aft er the fourth stage of the fi rst instruction in the dependence, which is too late for the input of the third stage of sub. Hence, even with forwarding, we would have to stall one stage for a load-use data hazard, as Figure 4.30 shows. Th is fi gure shows an important pipeline concept, offi cially called a pipeline stall, but oft en given the nickname bubble. We shall see stalls elsewhere in the pipeline. Section 4.7 shows how we can handle hard cases like these, using either hardware detection and stalls or soft ware that reorders code to try to avoid load-use pipeline stalls, as this example illustrates. Reordering Code to Avoid Pipeline Stalls Consider the following code segment in C: a = b + e; c = b + f; Here is the generated MIPS code for this segment, assuming all variables are in memory and are addressable as off sets from $t0: lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1,$t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1,$t4 sw $t5, 16($t0) load-use data hazard A specifi c form of data hazard in which the data being loaded by a load instruction has not yet become available when it is needed by another instruction. pipeline stall Also called bubble. A stall initiated in order to resolve a hazard. EXAMPLE 200 400 600 800 1000 1200 1400 Time lw $s0, 20($t1) sub $t2, $s0, $t3 IF MEMID WBEX IF MEMID WBEX Program execution order (in instructions) bubble bubble bubble bubble bubble FIGURE 4.30 We need a stall even with forwarding when an R-format instruction following a load tries to use the data. Without the stall, the path from memory access stage output to execution stage input would be going backward in time, which is impossible. Th is fi gure is actually a simplifi cation, since we cannot know until aft er the subtract instruction is fetched and decoded whether or not a stall will be necessary. Section 4.7 shows the details of what really happens in the case of a hazard. 4.5 An Overview of Pipelining 281 Find the hazards in the preceding code segment and reorder the instructions to avoid any pipeline stalls. Both add instructions have a hazard because of their respective dependence on the immediately preceding lw instruction. Notice that bypassing eliminates several other potential hazards, including the dependence of the fi rst add on the fi rst lw and any hazards for store instructions. Moving up the third lw instruction to become the third instruction eliminates both hazards: lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1,$t2 sw $t3, 12($t0) add $t5, $t1,$t4 sw $t5, 16($t0) On a pipelined processor with forwarding, the reordered sequence will complete in two fewer cycles than the original version. Forwarding yields another insight into the MIPS architecture, in addition to the four mentioned on page 277. Each MIPS instruction writes at most one result and does this in the last stage of the pipeline. Forwarding is harder if there are multiple results to forward per instruction or if there is a need to write a result early on in instruction execution. Elaboration: The name “forwarding” comes from the idea that the result is passed forward from an earlier instruction to a later instruction. “Bypassing” comes from passing the result around the register fi le to the desired unit. Control Hazards Th e third type of hazard is called a control hazard, arising from the need to make a decision based on the results of one instruction while others are executing. Suppose our laundry crew was given the happy task of cleaning the uniforms of a football team. Given how fi lthy the laundry is, we need to determine whether the detergent and water temperature setting we select is strong enough to get the uniforms clean but not so strong that the uniforms wear out sooner. In our laundry pipeline, we have to wait until aft er the second stage to examine the dry uniform to see if we need to change the washer setup or not. What to do? Here is the fi rst of two solutions to control hazards in the laundry room and its computer equivalent. Stall: Just operate sequentially until the fi rst batch is dry and then repeat until you have the right formula. Th is conservative option certainly works, but it is slow. ANSWER control hazard Also called branch hazard. When the proper instruction cannot execute in the proper pipeline clock cycle because the instruction that was fetched is not the one that is needed; that is, the fl ow of instruction addresses is not what the pipeline expected. 282 Chapter 4 The Processor Th e equivalent decision task in a computer is the branch instruction. Notice that we must begin fetching the instruction following the branch on the very next clock cycle. Nevertheless, the pipeline cannot possibly know what the next instruction should be, since it only just received the branch instruction from memory! Just as with laundry, one possible solution is to stall immediately aft er we fetch a branch, waiting until the pipeline determines the outcome of the branch and knows what instruction address to fetch from. Let’s assume that we put in enough extra hardware so that we can test registers, calculate the branch address, and update the PC during the second stage of the pipeline (see Section 4.8 for details). Even with this extra hardware, the pipeline involving conditional branches would look like Figure 4.31. Th e lw instruction, executed if the branch fails, is stalled one extra 200 ps clock cycle before starting. Performance of “Stall on Branch” Estimate the impact on the clock cycles per instruction (CPI) of stalling on branches. Assume all other instructions have a CPI of 1. Figure 3.27 in Chapter 3 shows that branches are 17% of the instructions executed in SPECint2006. Since the other instructions run have a CPI of 1, and branches took one extra clock cycle for the stall, then we would see a CPI of 1.17 and hence a slowdown of 1.17 versus the ideal case. EXAMPLE ANSWER add $4, $5, $6 beq $1, $2, 40 or $7, $8, $9 Time Instruction fetch Data access Data access Data access Reg Instruction fetch Instruction fetch Reg Reg 200 ps 400 ps bubble bubble bubble bubble bubble 200 400 600 800 1000 1200 1400 Program execution order (in instructions) Reg ALU Reg ALU Reg ALU FIGURE 4.31 Pipeline showing stalling on every conditional branch as solution to control hazards. Th is example assumes the conditional branch is taken, and the instruction at the destination of the branch is the OR instruction. Th ere is a one-stage pipeline stall, or bubble, aft er the branch. In reality, the process of creating a stall is slightly more complicated, as we will see in Section 4.8. Th e eff ect on performance, however, is the same as would occur if a bubble were inserted. 4.5 An Overview of Pipelining 283 If we cannot resolve the branch in the second stage, as is oft en the case for longer pipelines, then we’d see an even larger slowdown if we stall on branches. Th e cost of this option is too high for most computers to use and motivates a second solution to the control hazard using one of our great ideas from Chapter 1: Predict: If you’re pretty sure you have the right formula to wash uniforms, then just predict that it will work and wash the second load while waiting for the fi rst load to dry. Th is option does not slow down the pipeline when you are correct. When you are wrong, however, you need to redo the load that was washed while guessing the decision. Computers do indeed use prediction to handle branches. One simple approach is to predict always that branches will be untaken. When you’re right, the pipeline proceeds at full speed. Only when branches are taken does the pipeline stall. Figure 4.32 shows such an example. add $4, $5, $6 beq $1, $2, 40 lw $3, 300($0) Time Instruction fetch Instruction fetch Data access Reg Instruction fetch Data access Data access Reg Reg Reg ALU Reg ALU Reg ALU Reg ALU Reg ALU Reg ALU 200 ps 200 ps add $4, $5, $6 beq $1, $2, 40 or $7, $8, $9 Time Instruction fetch Data access Reg Instruction fetch Instruction fetch Data access Reg Data access Reg 200 ps 400 ps bubble bubble bubble bubble bubble 200 400 600 800 1000 1200 1400 Program execution order (in instructions) 200 400 600 800 1000 1200 1400 Program execution order (in instructions) FIGURE 4.32 Predicting that branches are not taken as a solution to control hazard. Th e top drawing shows the pipeline when the branch is not taken. Th e bottom drawing shows the pipeline when the branch is taken. As we noted in Figure 4.31, the insertion of a bubble in this fashion simplifi es what actually happens, at least during the fi rst clock cycle immediately following the branch. Section 4.8 will reveal the details. 284 Chapter 4 The Processor A more sophisticated version of branch prediction would have some branches predicted as taken and some as untaken. In our analogy, the dark or home uniforms might take one formula while the light or road uniforms might take another. In the case of programming, at the bottom of loops are branches that jump back to the top of the loop. Since they are likely to be taken and they branch backward, we could always predict taken for branches that jump to an earlier address. Such rigid approaches to branch prediction rely on stereotypical behavior and don’t account for the individuality of a specifi c branch instruction. Dynamic hardware predictors, in stark contrast, make their guesses depending on the behavior of each branch and may change predictions for a branch over the life of a program. Following our analogy, in dynamic prediction a person would look at how dirty the uniform was and guess at the formula, adjusting the next prediction depending on the success of recent guesses. One popular approach to dynamic prediction of branches is keeping a history for each branch as taken or untaken, and then using the recent past behavior to predict the future. As we will see later, the amount and type of history kept have become extensive, with the result being that dynamic branch predictors can correctly predict branches with more than 90% accuracy (see Section 4.8). When the guess is wrong, the pipeline control must ensure that the instructions following the wrongly guessed branch have no eff ect and must restart the pipeline from the proper branch address. In our laundry analogy, we must stop taking new loads so that we can restart the load that we incorrectly predicted. As in the case of all other solutions to control hazards, longer pipelines exacerbate the problem, in this case by raising the cost of misprediction. Solutions to control hazards are described in more detail in Section 4.8. Elaboration: There is a third approach to the control hazard, called delayed decision. In our analogy, whenever you are going to make such a decision about laundry, just place a load of nonfootball clothes in the washer while waiting for football uniforms to dry. As long as you have enough dirty clothes that are not affected by the test, this solution works fi ne. Called the delayed branch in computers, and mentioned above, this is the solution actually used by the MIPS architecture. The delayed branch always executes the next sequential instruction, with the branch taking place after that one instruction delay. It is hidden from the MIPS assembly language programmer because the assembler can automatically arrange the instructions to get the branch behavior desired by the programmer. MIPS software will place an instruction immediately after the delayed branch instruction that is not affected by the branch, and a taken branch changes the address of the instruction that follows this safe instruction. In our example, the add instruction before the branch in Figure 4.31 does not affect the branch and can be moved after the branch to fully hide the branch delay. Since delayed branches are useful when the branches are short, no processor uses a delayed branch of more than one cycle. For longer branch delays, hardware-based branch prediction is usually used. branch prediction A method of resolving a branch hazard that assumes a given outcome for the branch and proceeds from that assumption rather than waiting to ascertain the actual outcome. 4.5 An Overview of Pipelining 285 Pipeline Overview Summary Pipelining is a technique that exploits parallelism among the instructions in a sequential instruction stream. It has the substantial advantage that, unlike programming a multiprocessor, it is fundamentally invisible to the programmer. In the next few sections of this chapter, we cover the concept of pipelining using the MIPS instruction subset from the single-cycle implementation in Section 4.4 and show a simplifi ed version of its pipeline. We then look at the problems that pipelining introduces and the performance attainable under typical situations. If you wish to focus more on the soft ware and the performance implications of pipelining, you now have suffi cient background to skip to Section 4.10. Section 4.10 introduces advanced pipelining concepts, such as superscalar and dynamic scheduling, and Section 4.11 examines the pipelines of recent microprocessors. Alternatively, if you are interested in understanding how pipelining is implemented and the challenges of dealing with hazards, you can proceed to examine the design of a pipelined datapath and the basic control, explained in Section 4.6. You can then use this understanding to explore the implementation of forwarding and stalls in Section 4.7. You can then read Section 4.8 to learn more about solutions to branch hazards, and then see how exceptions are handled in Section 4.9. For each code sequence below, state whether it must stall, can avoid stalls using only forwarding, or can execute without stalling or forwarding. Sequence 1 Sequence 2 Sequence 3 lw $t0,0($t0) add $t1,$t0,$t0 addi $t1,$t0,#1 add $t1,$t0,$t0 addi $t2,$t0,#5 addi $t2,$t0,#2 addi $t4,$t1,#5 addi $t3,$t0,#2 addi $t3,$t0,#4 addi $t5,$t0,#5 Outside the memory system, the eff ective operation of the pipeline is usually the most important factor in determining the CPI of the processor and hence its performance. As we will see in Section 4.10, understanding the performance of a modern multiple-issue pipelined processor is complex and requires understanding more than just the issues that arise in a simple pipelined processor. Nonetheless, structural, data, and control hazards remain important in both simple pipelines and more sophisticated ones. For modern pipelines, structural hazards usually revolve around the fl oating- point unit, which may not be fully pipelined, while control hazards are usually more of a problem in integer programs, which tend to have higher branch frequencies as well as less predictable branches. Data hazards can be performance bottlenecks Check Yourself Understanding Program Performance 286 Chapter 4 The Processor in both integer and fl oating-point programs. Oft en it is easier to deal with data hazards in fl oating-point programs because the lower branch frequency and more regular memory access patterns allow the compiler to try to schedule instructions to avoid hazards. It is more diffi cult to perform such optimizations in integer programs that have less regular memory access, involving more use of pointers. As we will see in Section 4.10, there are more ambitious compiler and hardware techniques for reducing data dependences through scheduling. Pipelining increases the number of simultaneously executing instructions and the rate at which instructions are started and completed. Pipelining does not reduce the time it takes to complete an individual instruction, also called the latency. For example, the fi ve-stage pipeline still takes 5 clock cycles for the instruction to complete. In the terms used in Chapter 1, pipelining improves instruction throughput rather than individual instruction execution time or latency. The BIG Picture latency (pipeline) Th e number of stages in a pipeline or the number of stages between two instructions during execution. Instruction sets can either simplify or make life harder for pipeline designers, who must already cope with structural, control, and data hazards. Branch prediction and forwarding help make a computer fast while still getting the right answers. 4.6 Pipelined Datapath and Control Figure 4.33 shows the single-cycle datapath from Section 4.4 with the pipeline stages identifi ed. Th e division of an instruction into fi ve stages means a fi ve-stage pipeline, which in turn means that up to fi ve instructions will be in execution during any single clock cycle. Th us, we must separate the datapath into fi ve pieces, with each piece named corresponding to a stage of instruction execution: 1. IF: Instruction fetch 2. ID: Instruction decode and register fi le read 3. EX: Execution or address calculation 4. MEM: Data memory access 5. WB: Write back In Figure 4.33, these fi ve components correspond roughly to the way the data- path is drawn; instructions and data move generally from left to right through the Th ere is less in this than meets the eye. Tallulah Bankhead, remark to Alexander Woollcott, 1922 4.6 Pipelined Datapath and Control 287 fi ve stages as they complete execution. Returning to our laundry analogy, clothes get cleaner, drier, and more organized as they move through the line, and they never move backward. Th ere are, however, two exceptions to this left -to-right fl ow of instructions: ■ Th e write-back stage, which places the result back into the register fi le in the middle of the datapath ■ Th e selection of the next value of the PC, choosing between the incremented PC and the branch address from the MEM stage Data fl owing from right to left does not aff ect the current instruction; these reverse data movements infl uence only later instructions in the pipeline. Note that WB: Write backMEM: Memory accessIF: Instruction fetch EX: Execute/ address calculation 1 M u x 0 0 M u x 1 Address Write data Read data Data memory Read register 1 Read register 2 Write register Write data Registers Read data 1 Read data 2 ALU Zero ALU result ADD Add result Shift left 2 Address Instruction Instruction memory Add 4 PC Sign- extend 0 M u x 1 32 ID: Instruction decode/ register file read 16 FIGURE 4.33 The single-cycle datapath from Section 4.4 (similar to Figure 4.17). Each step of the instruction can be mapped onto the datapath from left to right. Th e only exceptions are the update of the PC and the write-back step, shown in color, which sends either the ALU result or the data from memory to the left to be written into the register fi le. (Normally we use color lines for control, but these are data lines.) 288 Chapter 4 The Processor the fi rst right-to-left fl ow of data can lead to data hazards and the second leads to control hazards. One way to show what happens in pipelined execution is to pretend that each instruction has its own datapath, and then to place these datapaths on a timeline to show their relationship. Figure 4.34 shows the execution of the instructions in Figure 4.27 by displaying their private datapaths on a common timeline. We use a stylized version of the datapath in Figure 4.33 to show the relationships in Figure 4.34. Figure 4.34 seems to suggest that three instructions need three datapaths. Instead, we add registers to hold data so that portions of a single datapath can be shared during instruction execution. For example, as Figure 4.34 shows, the instruction memory is used during only one of the fi ve stages of an instruction, allowing it to be shared by following instructions during the other four stages. To retain the value of an individual instruction for its other four stages, the value read from instruction memory must be saved in a register. Similar arguments apply to every pipeline stage, so we must place registers wherever there are dividing lines between stages in Figure 4.33. Returning to our laundry analogy, we might have a basket between each pair of stages to hold the clothes for the next step. Program execution order (in instructions) lw $1, 100($0) lw $2, 200($0) lw $3, 300($0) Time (in clock cycles) IM DMReg RegALU IM DMReg RegALU IM DMReg RegALU CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 FIGURE 4.34 Instructions being executed using the single-cycle datapath in Figure 4.33, assuming pipelined execution. Similar to Figures 4.28 through 4.30, this fi gure pretends that each instruction has its own datapath, and shades each portion according to use. Unlike those fi gures, each stage is labeled by the physical resource used in that stage, corresponding to the portions of the datapath in Figure 4.33. IM represents the instruction memory and the PC in the instruction fetch stage, Reg stands for the register fi le and sign extender in the instruction decode/register fi le read stage (ID), and so on. To maintain proper time order, this stylized datapath breaks the register fi le into two logical parts: registers read during register fetch (ID) and registers written during write back (WB). Th is dual use is represented by drawing the unshaded left half of the register fi le using dashed lines in the ID stage, when it is not being written, and the unshaded right half in dashed lines in the WB stage, when it is not being read. As before, we assume the register fi le is written in the fi rst half of the clock cycle and the register fi le is read during the second half. 4.6 Pipelined Datapath and Control 289 Figure 4.35 shows the pipelined datapath with the pipeline registers high- lighted. All instructions advance during each clock cycle from one pipeline register to the next. Th e registers are named for the two stages separated by that register. For example, the pipeline register between the IF and ID stages is called IF/ID. Notice that there is no pipeline register at the end of the write-back stage. All instructions must update some state in the processor—the register fi le, memory, or the PC—so a separate pipeline register is redundant to the state that is updated. For example, a load instruction will place its result in 1 of the 32 registers, and any later instruction that needs that data will simply read the appropriate register. Of course, every instruction updates the PC, whether by incrementing it or by setting it to a branch destination address. Th e PC can be thought of as a pipeline register: one that feeds the IF stage of the pipeline. Unlike the shaded pipeline registers in Figure 4.35, however, the PC is part of the visible architectural state; its contents must be saved when an exception occurs, while the contents of the pipeline registers can be discarded. In the laundry analogy, you could think of the PC as corresponding to the basket that holds the load of dirty clothes before the wash step. To show how the pipelining works, throughout this chapter we show sequences of fi gures to demonstrate operation over time. Th ese extra pages would seem to require much more time for you to understand. Fear not; the sequences take much Add Address Instruction memory Read register 1 In st ru ct io n Read register 2 Write register Write data Read data 1 Read data 2 Registers Address Write data Read data Data memory Add Add result ALU ALU result Zero Shift left 2 Sign- extend PC 4 ID/EXIF/ID EX/MEM 16 32 0 M u x 1 0 M u x 1 1 M u x 0 MEM/WB FIGURE 4.35 The pipelined version of the datapath in Figure 4.33. Th e pipeline registers, in color, separate each pipeline stage. Th ey are labeled by the stages that they separate; for example, the fi rst is labeled IF/ID because it separates the instruction fetch and instruction decode stages. Th e registers must be wide enough to store all the data corresponding to the lines that go through them. For example, the IF/ID register must be 64 bits wide, because it must hold both the 32-bit instruction fetched from memory and the incremented 32-bit PC address. We will expand these registers over the course of this chapter, but for now the other three pipeline registers contain 128, 97, and 64 bits, respectively. 290 Chapter 4 The Processor less time than it might appear, because you can compare them to see what changes occur in each clock cycle. Section 4.7 describes what happens when there are data hazards between pipelined instructions; ignore them for now. Figures 4.36 through 4.38, our fi rst sequence, show the active portions of the datapath highlighted as a load instruction goes through the fi ve stages of pipelined execution. We show a load fi rst because it is active in all fi ve stages. As in Figures 4.28 through 4.30, we highlight the right half of registers or memory when they are being read and highlight the left half when they are being written. We show the instruction abbreviation lw with the name of the pipe stage that is active in each fi gure. Th e fi ve stages are the following: 1. Instruction fetch: Th e top portion of Figure 4.36 shows the instruction being read from memory using the address in the PC and then being placed in the IF/ID pipeline register. Th e PC address is incremented by 4 and then written back into the PC to be ready for the next clock cycle. Th is incremented address is also saved in the IF/ID pipeline register in case it is needed later for an instruction, such as beq. Th e computer cannot know which type of instruction is being fetched, so it must prepare for any instruction, passing potentially needed information down the pipeline. 2. Instruction decode and register fi le read: Th e bottom portion of Figure 4.36 shows the instruction portion of the IF/ID pipeline register supplying the 16-bit immediate fi eld, which is sign-extended to 32 bits, and the register numbers to read the two registers. All three values are stored in the ID/EX pipeline register, along with the incremented PC address. We again transfer everything that might be needed by any instruction during a later clock cycle. 3. Execute or address calculation: Figure 4.37 shows that the load instruction reads the contents of register 1 and the sign-extended immediate from the ID/EX pipeline register and adds them using the ALU. Th at sum is placed in the EX/MEM pipeline register. 4. Memory access: Th e top portion of Figure 4.38 shows the load instruction reading the data memory using the address from the EX/MEM pipeline register and loading the data into the MEM/WB pipeline register. 5. Write-back: Th e bottom portion of Figure 4.38 shows the fi nal step: reading the data from the MEM/WB pipeline register and writing it into the register fi le in the middle of the fi gure. Th is walk-through of the load instruction shows that any information needed in a later pipe stage must be passed to that stage via a pipeline register. Walking through a store instruction shows the similarity of instruction execution, as well as passing the information for later stages. Here are the fi ve pipe stages of the store instruction: 4.6 Pipelined Datapath and Control 291 Instruction decode lw Instruction fetch lw Add Address Instruction memory Read register 1 In st ru ct io n Read register 2 Write register Write data Read data 1 Read data 2 Registers Address Write data Read data Data memory Add Add result ALU ALU result Zero Shift left 2 Sign- extend PC 4 ID/EXIF/ID EX/MEM 16 32 0 M u x 1 0 M u x 1 0 M u x 1 MEM/WB Add Address Instruction memory Read register 1 In st ru ct io n Read register 2 Write register Write data Read data 1 Read data 2 Registers Address Write data Read data Data memory Add Add result ALU ALU result Zero Shift left 2 Sign- extend PC 4 ID/EXIF/ID EX/MEM 16 32 0 M u x 1 0 M u x 1 1 M u x 0 MEM/WB FIGURE 4.36 IF and ID: First and second pipe stages of an instruction, with the active portions of the datapath in Figure 4.35 highlighted. Th e highlighting convention is the same as that used in Figure 4.28. As in Section 4.2, there is no confusion when reading and writing registers, because the contents change only on the clock edge. Although the load needs only the top register in stage 2, the processor doesn’t know what instruction is being decoded, so it sign-extends the 16-bit constant and reads both registers into the ID/EX pipeline register. We don’t need all three operands, but it simplifi es control to keep all three. 292 Chapter 4 The Processor 1. Instruction fetch: Th e instruction is read from memory using the address in the PC and then is placed in the IF/ID pipeline register. Th is stage occurs before the instruction is identifi ed, so the top portion of Figure 4.36 works for store as well as load. 2. Instruction decode and register fi le read: Th e instruction in the IF/ID pipeline register supplies the register numbers for reading two registers and extends the sign of the 16-bit immediate. Th ese three 32-bit values are all stored in the ID/EX pipeline register. Th e bottom portion of Figure 4.36 for load instructions also shows the operations of the second stage for stores. Th ese fi rst two stages are executed by all instructions, since it is too early to know the type of the instruction. 3. Execute and address calculation: Figure 4.39 shows the third step; the eff ective address is placed in the EX/MEM pipeline register. 4. Memory access: Th e top portion of Figure 4.40 shows the data being written to memory. Note that the register containing the data to be stored was read in an earlier stage and stored in ID/EX. Th e only way to make the data available during the MEM stage is to place the data into the EX/MEM pipeline register in the EX stage, just as we stored the eff ective address into EX/MEM. Execution Iw Add Address Instruction memory Read register 1 In st ru ct io n Read register 2 Write register Write data Read data 1 Read data 2 Registers Address Write data Read data Data memory Add Add result ALU ALU result Zero Shift left 2 Sign- extend PC 4 ID/EXIF/ID EX/MEM 16 32 0 M u x 1 0 M u x1 1 M u x 0 MEM/WB FIGURE 4.37 EX: The third pipe stage of a load instruction, highlighting the portions of the datapath in Figure 4.35 used in this pipe stage. Th e register is added to the sign-extended immediate, and the sum is placed in the EX/MEM pipeline register. 4.6 Pipelined Datapath and Control 293 Memory Iw Write-back Iw Add Address Instruction memory Read register 1 In st ru ct io n Read register 2 Write register Write data Read data 1 Read data 2 Registers Address Write data Read data Data memory Add Add result ALU ALU result Zero Shift left 2 Sign- extend PC 4 ID/EXIF/ID EX/MEM 16 32 0 M u x 1 0 M u x 1 0 M u x 1 MEM/WB Add Address Instruction memory Read register 1 In st ru ct io n Read register 2 Write register Write data Read data 1 Read data 2 Registers Address Write data Read data Data memory Add Add result ALU ALU result Zero Shift left 2 Sign- extend PC 4 ID/EXIF/ID EX/MEM 16 32 0 M u x 1 0 M u x 1 1 M u x 0 MEM/WB FIGURE 4.38 MEM and WB: The fourth and fi fth pipe stages of a load instruction, highlighting the portions of the datapath in Figure 4.35 used in this pipe stage. Data memory is read using the address in the EX/MEM pipeline registers, and the data is placed in the MEM/WB pipeline register. Next, data is read from the MEM/WB pipeline register and written into the register fi le in the middle of the datapath. Note: there is a bug in this design that is repaired in Figure 4.41. 294 Chapter 4 The Processor 5. Write-back: Th e bottom portion of Figure 4.40 shows the fi nal step of the store. For this instruction, nothing happens in the write-back stage. Since every instruction behind the store is already in progress, we have no way to accelerate those instructions. Hence, an instruction passes through a stage even if there is nothing to do, because later instructions are already progressing at the maximum rate. Th e store instruction again illustrates that to pass something from an early pipe stage to a later pipe stage, the information must be placed in a pipeline register; otherwise, the information is lost when the next instruction enters that pipeline stage. For the store instruction we needed to pass one of the registers read in the ID stage to the MEM stage, where it is stored in memory. Th e data was fi rst placed in the ID/EX pipeline register and then passed to the EX/MEM pipeline register. Load and store illustrate a second key point: each logical component of the datapath—such as instruction memory, register read ports, ALU, data memory, and register write port—can be used only within a single pipeline stage. Otherwise, we would have a structural hazard (see page 277). Hence these components, and their control, can be associated with a single pipeline stage. Now we can uncover a bug in the design of the load instruction. Did you see it? Which register is changed in the fi nal stage of the load? More specifi cally, which Execution sw Add Address Instruction memory Read register 1 In st ru ct io n Read register 2 Write register Write data Read data 1 Read data 2 Registers Address Write data Read data Data memory AddAdd result ALU ALU result Zero Shift left 2 Sign- extend PC 4 ID/EXIF/ID EX/MEM 16 32 0 M u x 1 0 M u x 1 1 M u x 0 MEM/WB FIGURE 4.39 EX: The third pipe stage of a store instruction. Unlike the third stage of the load instruction in Figure 4.37, the second register value is loaded into the EX/MEM pipeline register to be used in the next stage. Although it wouldn’t hurt to always write this second register into the EX/MEM pipeline register, we write the second register only on a store instruction to make the pipeline easier to understand. 4.6 Pipelined Datapath and Control 295 Memory sw Write-back sw Add Address Instruction memory Read register 1 In st ru ct io n Read register 2 Write register Write data Read data 1 Read data 2 Registers Address Write data Read data Data memory Add Add result ALU ALU result Zero Shift left 2 Sign- extend PC 4 ID/EXIF/ID EX/MEM 16 32 0 M u x 1 0 M u x 1 0 M u x 1 MEM/WB Add Address Instruction memory Read register 1 In st ru ct io n Read register 2 Write register Write data Read data 1 Read data 2 Registers Address Write data Read data Data memory Add Add result ALU ALU result Zero Shift left 2 Sign- extend PC 4 ID/EXIF/ID EX/MEM 16 32 0 M u x 1 0 M u x 1 1 M u x 0 MEM/WB FIGURE 4.40 MEM and WB: The fourth and fi fth pipe stages of a store instruction. In the fourth stage, the data is written into data memory for the store. Note that the data comes from the EX/MEM pipeline register and that nothing is changed in the MEM/WB pipeline register. Once the data is written in memory, there is nothing left for the store instruction to do, so nothing happens in stage 5. 296 Chapter 4 The Processor instruction supplies the write register number? Th e instruction in the IF/ID pipeline register supplies the write register number, yet this instruction occurs considerably aft er the load instruction! Hence, we need to preserve the destination register number in the load instruction. Just as store passed the register contents from the ID/EX to the EX/ MEM pipeline registers for use in the MEM stage, load must pass the register number from the ID/EX through EX/MEM to the MEM/WB pipeline register for use in the WB stage. Another way to think about the passing of the register number is that to share the pipelined datapath, we need to preserve the instruction read during the IF stage, so each pipeline register contains a portion of the instruction needed for that stage and later stages. Figure 4.41 shows the correct version of the datapath, passing the write register number fi rst to the ID/EX register, then to the EX/MEM register, and fi nally to the MEM/WB register. Th e register number is used during the WB stage to specify the register to be written. Figure 4.42 is a single drawing of the corrected datapath, highlighting the hardware used in all fi ve stages of the load word instruction in Figures 4.36 through 4.38. See Section 4.8 for an explanation of how to make the branch instruction work as expected. Graphically Representing Pipelines Pipelining can be diffi cult to understand, since many instructions are simultaneously executing in a single datapath in every clock cycle. To aid understanding, there are Add Address Instruction memory Read register 1 In st ru ct io n Read register 2 Write register Write data Read data 1 Read data 2 Registers Address Write data Read data Data memory Add Add result ALU ALU result Zero Shift left 2 Sign- extend PC 4 ID/EXIF/ID EX/MEM 16 32 0 M u x 1 0 M u x 1 1 M u x 0 MEM/WB FIGURE 4.41 The corrected pipelined datapath to handle the load instruction properly. Th e write register number now comes from the MEM/WB pipeline register along with the data. Th e register number is passed from the ID pipe stage until it reaches the MEM/ WB pipeline register, adding fi ve more bits to the last three pipeline registers. Th is new path is shown in color. 4.6 Pipelined Datapath and Control 297 two basic styles of pipeline fi gures: multiple-clock-cycle pipeline diagrams, such as Figure 4.34 on page 288, and single-clock-cycle pipeline diagrams, such as Figures 4.36 through 4.40. Th e multiple-clock-cycle diagrams are simpler but do not contain all the details. For example, consider the following fi ve-instruction sequence: lw $10, 20($1) sub $11, $2, $3 add $12, $3, $4 lw $13, 24($1) add $14, $5, $6 Figure 4.43 shows the multiple-clock-cycle pipeline diagram for these instructions. Time advances from left to right across the page in these diagrams, and instructions advance from the top to the bottom of the page, similar to the laundry pipeline in Figure 4.25. A representation of the pipeline stages is placed in each portion along the instruction axis, occupying the proper clock cycles. Th ese stylized datapaths represent the fi ve stages of our pipeline graphically, but a rectangle naming each pipe stage works just as well. Figure 4.44 shows the more traditional version of the multiple-clock-cycle pipeline diagram. Note that Figure 4.43 shows the physical resources used at each stage, while Figure 4.44 uses the name of each stage. Single-clock-cycle pipeline diagrams show the state of the entire datapath during a single clock cycle, and usually all fi ve instructions in the pipeline are identifi ed by labels above their respective pipeline stages. We use this type of fi gure to show the details of what is happening within the pipeline during each clock cycle; typically, Add Address Instruction memory Read register 1 In st ru ct io n Read register 2 Write register Write data Read data 1 Read data 2 Registers Address Write data Read data Data memory Add Add result ALU ALU result Zero Shift left 2 Sign- extend PC 4 ID/EXIF/ID EX/MEM 16 32 0 M u x 1 0 M u x 1 1 M u x 0 MEM/WB FIGURE 4.42 The portion of the datapath in Figure 4.41 that is used in all fi ve stages of a load instruction. 298 Chapter 4 The Processor the drawings appear in groups to show pipeline operation over a sequence of clock cycles. We use multiple-clock-cycle diagrams to give overviews of pipelining situations. ( Section 4.13 gives more illustrations of single-clock diagrams if you would like to see more details about Figure 4.43.) A single-clock-cycle diagram represents a vertical slice through a set of multiple-clock-cycle diagrams, showing the usage of the datapath by each of the instructions in the pipeline at the designated clock cycle. For example, Figure 4.45 shows the single-clock-cycle diagram corresponding to clock cycle 5 of Figures 4.43 and 4.44. Obviously, the single-clock-cycle diagrams have more detail and take signifi cantly more space to show the same number of clock cycles. Th e exercises ask you to create such diagrams for other code sequences. A group of students were debating the effi ciency of the fi ve-stage pipeline when one student pointed out that not all instructions are active in every stage of the pipeline. Aft er deciding to ignore the eff ects of hazards, they made the following four statements. Which ones are correct? Check Yourself Program execution order (in instructions) lw $10, 20($1) sub $11, $2, $3 add $12, $3, $4 lw $13, 24($1) add $14, $5, $6 Time (in clock cycles) IM Reg Reg IM DMReg Reg IM Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU DM DM DM CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 DM IM IM FIGURE 4.43 Multiple-clock-cycle pipeline diagram of fi ve instructions. Th is style of pipeline representation shows the complete execution of instructions in a single fi gure. Instructions are listed in instruction execution order from top to bottom, and clock cycles move from left to right. Unlike Figure 4.28, here we show the pipeline registers between each stage. Figure 4.44 shows the traditional way to draw this diagram. 4.6 Pipelined Datapath and Control 299 Program execution order (in instructions) lw $10, 20($1) sub $11, $2, $3 add $12, $3, $4 lw $13, 24($1) add $14, $5, $6 Time (in clock cycles) Instruction fetch Instruction decode Execution Data access Data access Data access Data access Data access Write-back CC 9CC 8CC 7CC 6CC 5CC 4CC 3CC 2CC 1 Instruction fetch Instruction fetch Instruction fetch Instruction fetch Instruction decode Instruction decode Instruction decode Instruction decode Execution Write-back Execution Write-back Execution Write-back Execution Write-back FIGURE 4.44 Traditional multiple-clock-cycle pipeline diagram of fi ve instructions in Figure 4.43. Add Address Instruction memory Read register 1 Read register 2 Write register Write data Read data 1 Read data 2 Registers Address Write data Read data Data memory Add Add result ALU ALU result Zero Shift left 2 Sign- extend PC 4 ID/EXIF/ID EX/MEM Memory sub $11, $2, $3 Write-back lw $10, 20($1) Execution add $12, $3, $4 Instruction decode lw $13, 24 ($1) Instruction fetch add $14, $5, $6 16 32 In st ru ct io n MEM/WB 0 M u x 1 0 M u x 1 1 M u x 0 FIGURE 4.45 The single-clock-cycle diagram corresponding to clock cycle 5 of the pipeline in Figures 4.43 and 4.44. As you can see, a single-clock-cycle fi gure is a vertical slice through a multiple-clock-cycle diagram. 1. Allowing jumps, branches, and ALU instructions to take fewer stages than the fi ve required by the load instruction will increase pipeline performance under all circumstances. 300 Chapter 4 The Processor 2. Trying to allow some instructions to take fewer cycles does not help, since the throughput is determined by the clock cycle; the number of pipe stages per instruction aff ects latency, not throughput. 3. You cannot make ALU instructions take fewer cycles because of the write- back of the result, but branches and jumps can take fewer cycles, so there is some opportunity for improvement. 4. Instead of trying to make instructions take fewer cycles, we should explore making the pipeline longer, so that instructions take more cycles, but the cycles are shorter. Th is could improve performance. Pipelined Control Just as we added control to the single-cycle datapath in Section 4.3, we now add control to the pipelined datapath. We start with a simple design that views the problem through rose-colored glasses. Th e fi rst step is to label the control lines on the existing datapath. Figure 4.46 shows those lines. We borrow as much as we can from the control for the simple datapath in Figure 4.17. In particular, we use the same ALU control logic, branch logic, destination-register-number multiplexor, and control lines. Th ese functions are defi ned in Figures 4.12, 4.16, and 4.18. We reproduce the key information in Figures 4.47 through 4.49 on a single page to make the following discussion easier to follow. As was the case for the single-cycle implementation, we assume that the PC is written on each clock cycle, so there is no separate write signal for the PC. By the same argument, there are no separate write signals for the pipeline registers (IF/ ID, ID/EX, EX/MEM, and MEM/WB), since the pipeline registers are also written during each clock cycle. To specify control for the pipeline, we need only set the control values during each pipeline stage. Because each control line is associated with a component active in only a single pipeline stage, we can divide the control lines into fi ve groups according to the pipeline stage. 1. Instruction fetch: Th e control signals to read instruction memory and to write the PC are always asserted, so there is nothing special to control in this pipeline stage. 2. Instruction decode/register fi le read: As in the previous stage, the same thing happens at every clock cycle, so there are no optional control lines to set. 3. Execution/address calculation: Th e signals to be set are RegDst, ALUOp, and ALUSrc (see Figures 4.47 and 4.48). Th e signals select the Result register, the ALU operation, and either Read data 2 or a sign-extended immediate for the ALU. In the 6600 Computer, perhaps even more than in any previous computer, the control system is the diff erence. James Th ornton, Design of a Computer: Th e Control Data 6600, 1970 4.6 Pipelined Datapath and Control 301 MemWrite PCSrc MemtoReg MemRead Add Address Instruction memory Read register 1 In st ru ct io n Read register 2 Write register Write data Instruction (15–0) Instruction (20–16) Instruction (15–11) Read data 1 Read data 2 Registers Address Write data Read data Data memory Add Add result Add ALU result Zero Shift left 2 Sign- extend PC 4 ID/EXIF/ID EX/MEM 16 32 6 ALU control RegDst ALUOp ALUSrc RegWrite Branch MEM/WB 0 M u x 1 0 M u x 1 0 M u x 1 0 M u x 1 FIGURE 4.46 The pipelined datapath of Figure 4.41 with the control signals identifi ed. Th is datapath borrows the control logic for PC source, register destination number, and ALU control from Section 4.4. Note that we now need the 6-bit funct fi eld (function code) of the instruction in the EX stage as input to ALU control, so these bits must also be included in the ID/EX pipeline register. Recall that these 6 bits are also the 6 least signifi cant bits of the immediate fi eld in the instruction, so the ID/EX pipeline register can supply them from the immediate fi eld since sign extension leaves these bits unchanged. Instruction opcode ALUOp Instruction operation Function code Desired ALU action ALU control input LW 00 load word XXXXXX add 0010 SW 00 store word XXXXXX add 0010 Branch equal 01 branch equal XXXXXX subtract 0110 R-type 10 add 100000 add 0010 R-type 10 subtract 100010 subtract 0110 R-type 10 AND 100100 AND 0000 R-type 10 OR 100101 OR 0001 R-type 10 set on less than 101010 set on less than 0111 FIGURE 4.47 A copy of Figure 4.12. Th is fi gure shows how the ALU control bits are set depending on the ALUOp control bits and the diff erent function codes for the R-type instruction. 302 Chapter 4 The Processor 4. Memory access: Th e control lines set in this stage are Branch, MemRead, and MemWrite. Th e branch equal, load, and store instructions set these signals, respectively. Recall that PCSrc in Figure 4.48 selects the next sequential address unless control asserts Branch and the ALU result was 0. 5. Write-back: Th e two control lines are MemtoReg, which decides between sending the ALU result or the memory value to the register fi le, and Reg- Write, which writes the chosen value. Since pipelining the datapath leaves the meaning of the control lines unchanged, we can use the same control values. Figure 4.49 has the same values as in Section 4.4, but now the nine control lines are grouped by pipeline stage. Signal name Effect when deasserted (0) Effect when asserted (1) RegDst The register destination number for the Write register comes from the rt field (bits 20:16). The register destination number for the Write register comes from the rd field (bits 15:11). RegWrite None. The register on the Write register input is written with the value on the Write data input. ALUSrc The second ALU operand comes from the second register file output (Read data 2). The second ALU operand is the sign-extended, lower 16 bits of the instruction. PCSrc The PC is replaced by the output of the adder that computes the value of PC + 4. The PC is replaced by the output of the adder that computes the branch target. MemRead None. Data memory contents designated by the address input are put on the Read data output. MemWrite None. Data memory contents designated by the address input are replaced by the value on the Write data input. MemtoReg The value fed to the register Write data input comes from the ALU. The value fed to the register Write data input comes from the data memory. FIGURE 4.48 A copy of Figure 4.16. Th e function of each of seven control signals is defi ned. Th e ALU control lines (ALUOp) are defi ned in the second column of Figure 4.47. When a 1-bit control to a 2-way multiplexor is asserted, the multiplexor selects the input corresponding to 1. Otherwise, if the control is deasserted, the multiplexor selects the 0 input. Note that PCSrc is controlled by an AND gate in Figure 4.46. If the Branch signal and the ALU Zero signal are both set, then PCSrc is 1; otherwise, it is 0. Control sets the Branch signal only during a beq instruction; otherwise, PCSrc is set to 0. Instruction Execution/address calculation stage control lines Memory access stage control lines Write-back stage control lines RegDst ALUOp1 ALUOp0 ALUSrc Branch Mem- Read Mem- Write Reg- Write Memto- Reg R-format 1 1 0 0 0 0 0 1 0 lw 0 0 0 1 0 1 0 1 1 sw X 0 0 1 0 0 1 0 X beq X 0 1 0 1 0 0 0 X FIGURE 4.49 The values of the control lines are the same as in Figure 4.18, but they have been shuffl ed into three groups corresponding to the last three pipeline stages. 4.7 Data Hazards: Forwarding versus Stalling 303 Implementing control means setting the nine control lines to these values in each stage for each instruction. Th e simplest way to do this is to extend the pipeline registers to include control information. Since the control lines start with the EX stage, we can create the control information during instruction decode. Figure 4.50 above shows that these control signals are then used in the appropriate pipeline stage as the instruction moves down the pipeline, just as the destination register number for loads moves down the pipeline in Figure 4.41. Figure 4.51 shows the full datapath with the extended pipeline registers and with the control lines connected to the proper stage. ( Section 4.13 gives more examples of MIPS code executing on pipelined hardware using single-clock diagrams, if you would like to see more details.) 4.7 Data Hazards: Forwarding versus Stalling Th e examples in the previous section show the power of pipelined execution and how the hardware performs the task. It’s now time to take off the rose-colored glasses and look at what happens with real programs. Th e instructions in Figures 4.43 through 4.45 were independent; none of them used the results calculated by any of the others. Yet in Section 4.5, we saw that data hazards are obstacles to pipelined execution. WB M EX WB M WB Control IF/ID ID/EX EX/MEM MEM/WB Instruction FIGURE 4.50 The control lines for the fi nal three stages. Note that four of the nine control lines are used in the EX phase, with the remaining fi ve control lines passed on to the EX/MEM pipeline register extended to hold the control lines; three are used during the MEM stage, and the last two are passed to MEM/ WB for use in the WB stage. What do you mean, why’s it got to be built? It’s a bypass. You’ve got to build bypasses. Douglas Adams, Th e Hitchhiker’s Guide to the Galaxy, 1979 304 Chapter 4 The Processor Let’s look at a sequence with many dependences, shown in color: sub $2, $1,$3 # Register $2 written by sub and $12,$2,$5 # 1st operand($2) depends on sub or $13,$6,$2 # 2nd operand($2) depends on sub add $14,$2,$2 # 1st($2) & 2nd($2) depend on sub sw $15,100($2) # Base ($2) depends on sub Th e last four instructions are all dependent on the result in register $2 of the fi rst instruction. If register $2 had the value 10 before the subtract instruction and −20 aft erwards, the programmer intends that −20 will be used in the following instructions that refer to register $2. WB M EX WB M WB M e m W ri te PCSrc M e m to R e g MemRead Add Address Instruction memory Read register 1 Read register 2 Instruction [15–0] Instruction [20–16] Instruction [15–11] Write register Write data Read data 1 Read data 2 Registers Address Write data Read data Data memory Add Add result ALU ALU result Zero Shift left 2 Sign- extend PC 4 ID/EX IF/ID EX/MEM MEM/WB 16 632 ALU control RegDst ALUOp ALUSrc R e g W ri te In st ru ct io n Branch Control 0 M u x 1 0 M u x M u x M u x 1 1 0 0 1 FIGURE 4.51 The pipelined datapath of Figure 4.46, with the control signals connected to the control portions of the pipeline registers. Th e control values for the last three stages are created during the instruction decode stage and then placed in the ID/EX pipeline register. Th e control lines for each pipe stage are used, and remaining control lines are then passed to the next pipeline stage. 4.7 Data Hazards: Forwarding versus Stalling 305 How would this sequence perform with our pipeline? Figure 4.52 illustrates the execution of these instructions using a multiple-clock-cycle pipeline representation. To demonstrate the execution of this instruction sequence in our current pipeline, the top of Figure 4.52 shows the value of register $2, which changes during the middle of clock cycle 5, when the sub instruction writes its result. Th e last potential hazard can be resolved by the design of the register fi le hardware: What happens when a register is read and written in the same clock cycle? We assume that the write is in the fi rst half of the clock cycle and the read is in the second half, so the read delivers what is written. As is the case for many implementations of register fi les, we have no data hazard in this case. Figure 4.52 shows that the values read for register $2 would not be the result of the sub instruction unless the read occurred during clock cycle 5 or later. Th us, the instructions that would get the correct value of −20 are add and sw; the AND and Program execution order (in instructions) sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2,$2 sw $15, 100($2) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 IM DMReg Reg IM DMReg Reg IM DMReg Reg IM DMReg Reg IM DMReg Reg 10 10 10 10 Value of register $2: 10/–20 –20 –20 –20 –20 FIGURE 4.52 Pipelined dependences in a fi ve-instruction sequence using simplifi ed datapaths to show the dependences. All the dependent actions are shown in color, and “CC 1” at the top of the fi gure means clock cycle 1. Th e fi rst instruction writes into $2, and all the following instructions read $2. Th is register is written in clock cycle 5, so the proper value is unavailable before clock cycle 5. (A read of a register during a clock cycle returns the value written at the end of the fi rst half of the cycle, when such a write occurs.) Th e colored lines from the top datapath to the lower ones show the dependences. Th ose that must go backward in time are pipeline data hazards. 306 Chapter 4 The Processor OR instructions would get the incorrect value 10! Using this style of drawing, such problems become apparent when a dependence line goes backward in time. As mentioned in Section 4.5, the desired result is available at the end of the EX stage or clock cycle 3. When is the data actually needed by the AND and OR instructions? At the beginning of the EX stage, or clock cycles 4 and 5, respectively. Th us, we can execute this segment without stalls if we simply forward the data as soon as it is available to any units that need it before it is available to read from the register fi le. How does forwarding work? For simplicity in the rest of this section, we consider only the challenge of forwarding to an operation in the EX stage, which may be either an ALU operation or an eff ective address calculation. Th is means that when an instruction tries to use a register in its EX stage that an earlier instruction intends to write in its WB stage, we actually need the values as inputs to the ALU. A notation that names the fi elds of the pipeline registers allows for a more precise notation of dependences. For example, “ID/EX.RegisterRs” refers to the number of one register whose value is found in the pipeline register ID/EX; that is, the one from the fi rst read port of the register fi le. Th e fi rst part of the name, to the left of the period, is the name of the pipeline register; the second part is the name of the fi eld in that register. Using this notation, the two pairs of hazard conditions are 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt Th e fi rst hazard in the sequence on page 304 is on register $2, between the result of sub $2,$1,$3 and the fi rst read operand of and $12,$2,$5. Th is hazard can be detected when the and instruction is in the EX stage and the prior instruction is in the MEM stage, so this is hazard 1a: EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 Dependence Detection Classify the dependences in this sequence from page 304: sub $2, $1, $3 # Register $2 set by sub and $12, $2, $5 # 1st operand($2) set by sub or $13, $6, $2 # 2nd operand($2) set by sub add $14, $2, $2 # 1st($2) & 2nd($2) set by sub sw $15, 100($2) # Index($2) set by sub EXAMPLE 4.7 Data Hazards: Forwarding versus Stalling 307 As mentioned above, the sub-and is a type 1a hazard. Th e remaining hazards are as follows: ■ Th e sub-or is a type 2b hazard: MEM/WB.RegisterRd = ID/EX.RegisterRt = $2 ■ Th e two dependences on sub-add are not hazards because the register fi le supplies the proper data during the ID stage of add. ■ Th ere is no data hazard between sub and sw because sw reads $2 the clock cycle aft er sub writes $2. Because some instructions do not write registers, this policy is inaccurate; sometimes it would forward when it shouldn’t. One solution is simply to check to see if the RegWrite signal will be active: examining the WB control fi eld of the pipeline register during the EX and MEM stages determines whether RegWrite is asserted. Recall that MIPS requires that every use of $0 as an operand must yield an operand value of 0. In the event that an instruction in the pipeline has $0 as its destination (for example, sll $0, $1, 2), we want to avoid forwarding its possibly nonzero result value. Not forwarding results destined for $0 frees the assembly programmer and the compiler of any requirement to avoid using $0 as a destination. Th e conditions above thus work properly as long we add EX/MEM. RegisterRd ≠ 0 to the fi rst hazard condition and MEM/WB.RegisterRd ≠ 0 to the second. Now that we can detect hazards, half of the problem is resolved—but we must still forward the proper data. Figure 4.53 shows the dependences between the pipeline registers and the inputs to the ALU for the same code sequence as in Figure 4.52. Th e change is that the dependence begins from a pipeline register, rather than waiting for the WB stage to write the register fi le. Th us, the required data exists in time for later instructions, with the pipeline registers holding the data to be forwarded. If we can take the inputs to the ALU from any pipeline register rather than just ID/EX, then we can forward the proper data. By adding multiplexors to the input of the ALU, and with the proper controls, we can run the pipeline at full speed in the presence of these data dependences. For now, we will assume the only instructions we need to forward are the four R-format instructions: add, sub, AND, and OR. Figure 4.54 shows a close-up of the ALU and pipeline register before and aft er adding forwarding. Figure 4.55 shows the values of the control lines for the ALU multiplexors that select either the register fi le values or one of the forwarded values. Th is forwarding control will be in the EX stage, because the ALU forwarding multiplexors are found in that stage. Th us, we must pass the operand register numbers from the ID stage via the ID/EX pipeline register to determine whether to forward values. We already have the rt fi eld (bits 20–16). Before forwarding, the ID/EX register had no need to include space to hold the rs fi eld. Hence, rs (bits 25–21) is added to ID/EX. ANSWER 308 Chapter 4 The Processor Let’s now write both the conditions for detecting hazards and the control signals to resolve them: 1. EX hazard: if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 Program execution order (in instructions) sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2 , $2 sw $15, 100($2) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 IM Reg Reg IM Reg Reg IM Reg Reg IM Reg Reg IM DM DM DM DM DM Reg Reg 10 10 10 10 10/–20 –20 –20 –20 –20Value of register $2: Value of EX/MEM: X X X –20 X X X X X Value of MEM/WB: X X X X –20 X X X X FIGURE 4.53 The dependences between the pipeline registers move forward in time, so it is possible to supply the inputs to the ALU needed by the AND instruction and OR instruction by forwarding the results found in the pipeline registers. Th e values in the pipeline registers show that the desired value is available before it is written into the register fi le. We assume that the register fi le forwards values that are read and written during the same clock cycle, so the add does not stall, but the values come from the register fi le instead of a pipeline register. Register fi le “forwarding”—that is, the read gets the value of the write in that clock cycle—is why clock cycle 5 shows register $2 having the value 10 at the beginning and −20 at the end of the clock cycle. As in the rest of this section, we handle all forwarding except for the value to be stored by a store instruction. 4.7 Data Hazards: Forwarding versus Stalling 309 Data memory Registers M u x ALU ALU ID/EX a. No forwarding b. With forwarding EX/MEM MEM/WB Data memory Registers M u x M u x M u x M u x ID/EX EX/MEM MEM/WB Forwarding unit EX/MEM.RegisterRd MEM/WB.RegisterRd Rs Rt Rt Rd ForwardB ForwardA FIGURE 4.54 On the top are the ALU and pipeline registers before adding forwarding. On the bottom, the multiplexors have been expanded to add the forwarding paths, and we show the forwarding unit. Th e new hardware is shown in color. Th is fi gure is a stylized drawing, however, leaving out details from the full datapath such as the sign extension hardware. Note that the ID/EX.RegisterRt fi eld is shown twice, once to connect to the Mux and once to the forwarding unit, but it is a single signal. As in the earlier discussion, this ignores forwarding of a store value to a store instruction. Also note that this mechanism works for slt instructions as well. 310 Chapter 4 The Processor Note that the EX/MEM.RegisterRd fi eld is the register destination for either an ALU instruction (which comes from the Rd fi eld of the instruction) or a load (which comes from the Rt fi eld). Th is case forwards the result from the previous instruction to either input of the ALU. If the previous instruction is going to write to the register fi le, and the write register number matches the read register number of ALU inputs A or B, provided it is not register 0, then steer the multiplexor to pick the value instead from the pipeline register EX/MEM. 2. MEM hazard: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and ( MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 As mentioned above, there is no hazard in the WB stage, because we assume that the register fi le supplies the correct result if the instruction in the ID stage reads the same register written by the instruction in the WB stage. Such a register fi le performs another form of forwarding, but it occurs within the register fi le. One complication is potential data hazards between the result of the instruction in the WB stage, the result of the instruction in the MEM stage, and the source operand of the instruction in the ALU stage. For example, when summing a vector of numbers in a single register, a sequence of instructions will all read and write to the same register: add $1,$1,$2 add $1,$1,$3 add $1,$1,$4 . . . Mux control Source Explanation ForwardA = 00 ID/EX The first ALU operand comes from the register file. ForwardA = 10 EX/MEM The first ALU operand is forwarded from the prior ALU result. ForwardA = 01 MEM/WB The first ALU operand is forwarded from data memory or an earlier ALU result. ForwardB = 00 ID/EX The second ALU operand comes from the register file. ForwardB = 10 EX/MEM The second ALU operand is forwarded from the prior ALU result. ForwardB = 01 MEM/WB The second ALU operand is forwarded from data memory or an earlier ALU result. FIGURE 4.55 The control values for the forwarding multiplexors in Figure 4.54. Th e signed immediate that is another input to the ALU is described in the Elaboration at the end of this section. 4.7 Data Hazards: Forwarding versus Stalling 311 In this case, the result is forwarded from the MEM stage because the result in the MEM stage is the more recent result. Th us, the control for the MEM hazard would be (with the additions highlighted): if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and not(EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRs)) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and not(EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRt)) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 Figure 4.56 shows the hardware necessary to support forwarding for operations that use results during the EX stage. Note that the EX/MEM.RegisterRd fi eld is the register destination for either an ALU instruction (which comes from the Rd fi eld of the instruction) or a load (which comes from the Rt fi eld). FIGURE 4.56 The datapath modifi ed to resolve hazards via forwarding. Compared with the datapath in Figure 4.51, the additions are the multiplexors to the inputs to the ALU. Th is fi gure is a more stylized drawing, however, leaving out details from the full datapath, such as the branch hardware and the sign extension hardware. M WB WB Registers Instruction memory M u x M u xM u x M u x ALU ID/EX EX/MEM MEM/WB Forwarding unit EX/MEM.RegisterRd MEM/WB.RegisterRd Rs Rt Rt Rd PC Control EX M WB IF/ID.RegisterRs IF/ID.RegisterRt IF/ID.RegisterRt IF/ID.RegisterRd In st ru ct io n IF/ID Data memory 312 Chapter 4 The Processor Section 4.13 shows two pieces of MIPS code with hazards that cause forwarding, if you would like to see more illustrated examples using single-cycle pipeline drawings. Elaboration: Forwarding can also help with hazards when store instructions are dependent on other instructions. Since they use just one data value during the MEM stage, forwarding is easy. However, consider loads immediately followed by stores, useful when performing memory-to-memory copies in the MIPS architecture. Since copies are frequent, we need to add more forwarding hardware to make them run faster. If we were to redraw Figure 4.53, replacing the sub and AND instructions with lw and sw, we would see that it is possible to avoid a stall, since the data exists in the MEM/WB register of a load instruction in time for its use in the MEM stage of a store instruction. We would need to add forwarding into the memory access stage for this option. We leave this modifi cation as an exercise to the reader. In addition, the signed-immediate input to the ALU, needed by loads and stores, is missing from the datapath in Figure 4.56. Since central control decides between register and immediate, and since the forwarding unit chooses the pipeline register for a register Data memory Registers M u x M u x M u x M u x M u x ALU ID/EX EX/MEM MEM/WB Forwarding unit ALUSrc FIGURE 4.57 A close-up of the datapath in Figure 4.54 shows a 2:1 multiplexor, which has been added to select the signed immediate as an ALU input. 4.7 Data Hazards: Forwarding versus Stalling 313 input to the ALU, the easiest solution is to add a 2:1 multiplexor that chooses between the ForwardB multiplexor output and the signed immediate. Figure 4.57 shows this addition. Data Hazards and Stalls As we said in Section 4.5, one case where forwarding cannot save the day is when an instruction tries to read a register following a load instruction that writes the same register. Figure 4.58 illustrates the problem. Th e data is still being read from memory in clock cycle 4 while the ALU is performing the operation for the following instruction. Something must stall the pipeline for the combination of load followed by an instruction that reads its result. Hence, in addition to a forwarding unit, we need a hazard detection unit. It operates during the ID stage so that it can insert the stall between the load and its Program execution order (in instructions) lw $2, 20($1) and $4, $2, $5 or $8, $2, $6 add $9, $4, $2 slt $1, $6, $7 Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 IM DMReg Reg IM DMReg Reg IM DMReg Reg IM DMReg Reg IM DMReg Reg FIGURE 4.58 A pipelined sequence of instructions. Since the dependence between the load and the following instruction (and) goes backward in time, this hazard cannot be solved by forwarding. Hence, this combination must result in a stall by the hazard detection unit. If at fi rst you don’t succeed, redefi ne success. Anonymous 314 Chapter 4 The Processor use. Checking for load instructions, the control for the hazard detection unit is this single condition: if (ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt))) stall the pipeline Th e fi rst line tests to see if the instruction is a load: the only instruction that reads data memory is a load. Th e next two lines check to see if the destination register fi eld of the load in the EX stage matches either source register of the instruction in the ID stage. If the condition holds, the instruction stalls one clock cycle. Aft er this 1-cycle stall, the forwarding logic can handle the dependence and execution proceeds. (If there were no forwarding, then the instructions in Figure 4.58 would need another stall cycle.) If the instruction in the ID stage is stalled, then the instruction in the IF stage must also be stalled; otherwise, we would lose the fetched instruction. Preventing these two instructions from making progress is accomplished simply by preventing the PC register and the IF/ID pipeline register from changing. Provided these registers are preserved, the instruction in the IF stage will continue to be read using the same PC, and the registers in the ID stage will continue to be read using the same instruction fi elds in the IF/ID pipeline register. Returning to our favorite analogy, it’s as if you restart the washer with the same clothes and let the dryer continue tumbling empty. Of course, like the dryer, the back half of the pipeline starting with the EX stage must be doing something; what it is doing is executing instructions that have no eff ect: nops. How can we insert these nops, which act like bubbles, into the pipeline? In Figure 4.49, we see that deasserting all nine control signals (setting them to 0) in the EX, MEM, and WB stages will create a “do nothing” or nop instruction. By identifying the hazard in the ID stage, we can insert a bubble into the pipeline by changing the EX, MEM, and WB control fi elds of the ID/EX pipeline register to 0. Th ese benign control values are percolated forward at each clock cycle with the proper eff ect: no registers or memories are written if the control values are all 0. Figure 4.59 shows what really happens in the hardware: the pipeline execution slot associated with the AND instruction is turned into a nop and all instructions beginning with the AND instruction are delayed one cycle. Like an air bubble in a water pipe, a stall bubble delays everything behind it and proceeds down the instruction pipe one stage each cycle until it exits at the end. In this example, the hazard forces the AND and OR instructions to repeat in clock cycle 4 what they did in clock cycle 3: AND reads registers and decodes, and OR is refetched from instruction memory. Such repeated work is what a stall looks like, but its eff ect is to stretch the time of the AND and OR instructions and delay the fetch of the add instruction. Figure 4.60 highlights the pipeline connections for both the hazard detection unit and the forwarding unit. As before, the forwarding unit controls the ALU nop An instruction that does no operation to change state. 4.7 Data Hazards: Forwarding versus Stalling 315 multiplexors to replace the value from a general-purpose register with the value from the proper pipeline register. Th e hazard detection unit controls the writing of the PC and IF/ID registers plus the multiplexor that chooses between the real control values and all 0s. Th e hazard detection unit stalls and deasserts the control fi elds if the load-use hazard test above is true. Section 4.13 gives an example of MIPS code with hazards that causes stalling, illustrated using single-clock pipeline diagrams, if you would like to see more details. Although the compiler generally relies upon the hardware to resolve hazards and thereby ensure correct execution, the compiler must understand the pipeline to achieve the best performance. Otherwise, unexpected stalls will reduce the performance of the compiled code. The BIG Picture bubble Program execution order (in instructions) lw $2, 20($1) and becomes nop and $4, $2, $5 or $8, $2, $6 add $9, $4, $2 Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10 IM DMReg Reg IM DMReg Reg IM DMReg Reg IM DMReg Reg IM DMReg Reg FIGURE 4.59 The way stalls are really inserted into the pipeline. A bubble is inserted beginning in clock cycle 4, by changing the and instruction to a nop. Note that the and instruction is really fetched and decoded in clock cycles 2 and 3, but its EX stage is delayed until clock cycle 5 (versus the unstalled position in clock cycle 4). Likewise the OR instruction is fetched in clock cycle 3, but its ID stage is delayed until clock cycle 5 (versus the unstalled clock cycle 4 position). Aft er insertion of the bubble, all the dependences go forward in time and no further hazards occur. 316 Chapter 4 The Processor Elaboration: Regarding the remark earlier about setting control lines to 0 to avoid writing registers or memory: only the signals RegWrite and MemWrite need be 0, while the other control signals can be don’t cares. 4.8 Control Hazards Th us far, we have limited our concern to hazards involving arithmetic operations and data transfers. However, as we saw in Section 4.5, there are also pipeline hazards involving branches. Figure 4.61 shows a sequence of instructions and indicates when the branch would occur in this pipeline. An instruction must be fetched at every clock cycle to sustain the pipeline, yet in our design the decision about whether to branch doesn’t occur until the MEM pipeline stage. As mentioned in Section 4.5, 0 M WB WB Data memory Instruction memory ALU ID/EX EX/MEM MEM/WB Forwarding unit PC Control EX M WB IF/ID M u x M u x M u x M u x M u x Hazard detection unit ID/EX.MemRead IF/ID.RegisterRs In st ru ct io n IF/ID.RegisterRt IF/ID.RegisterRt IF/ID.RegisterRd ID/EX.RegisterRt P C W ri te IF /D W ri te Registers Rt Rd Rs Rt FIGURE 4.60 Pipelined control overview, showing the two multiplexors for forwarding, the hazard detection unit, and the forwarding unit. Although the ID and EX stages have been simplifi ed—the sign-extended immediate and branch logic are missing— this drawing gives the essence of the forwarding hardware requirements. Th ere are a thousand hacking at the branches of evil to one who is striking at the root. Henry David Th oreau, Walden, 1854 4.8 Control Hazards 317 this delay in determining the proper instruction to fetch is called a control hazard or branch hazard, in contrast to the data hazards we have just examined. Th is section on control hazards is shorter than the previous sections on data hazards. Th e reasons are that control hazards are relatively simple to understand, they occur less frequently than data hazards, and there is nothing as eff ective against control hazards as forwarding is against data hazards. Hence, we use simpler schemes. We look at two schemes for resolving control hazards and one optimization to improve these schemes. Reg Program execution order (in instructions) 40 beq $1, $3, 28 44 and $12, $2, $5 48 or $13, $6, $2 52 add $14, $2, $2 72 lw $4, 50($7) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 IM DMReg Reg IM DMReg Reg IM DM Reg IM DMReg Reg IM DMReg Reg FIGURE 4.61 The impact of the pipeline on the branch instruction. Th e numbers to the left of the instruction (40, 44, …) are the addresses of the instructions. Since the branch instruction decides whether to branch in the MEM stage—clock cycle 4 for the beq instruction above—the three sequential instructions that follow the branch will be fetched and begin execution. Without intervention, those three following instructions will begin execution before beq branches to lw at location 72. (Figure 4.31 assumed extra hardware to reduce the control hazard to one clock cycle; this fi gure uses the nonoptimized datapath.) 318 Chapter 4 The Processor Assume Branch Not Taken As we saw in Section 4.5, stalling until the branch is complete is too slow. One improvement over branch stalling is to predict that the branch will not be taken and thus continue execution down the sequential instruction stream. If the branch is taken, the instructions that are being fetched and decoded must be discarded. Execution continues at the branch target. If branches are untaken half the time, and if it costs little to discard the instructions, this optimization halves the cost of control hazards. To discard instructions, we merely change the original control values to 0s, much as we did to stall for a load-use data hazard. Th e diff erence is that we must also change the three instructions in the IF, ID, and EX stages when the branch reaches the MEM stage; for load-use stalls, we just change control to 0 in the ID stage and let them percolate through the pipeline. Discarding instructions, then, means we must be able to fl ush instructions in the IF, ID, and EX stages of the pipeline. Reducing the Delay of Branches One way to improve branch performance is to reduce the cost of the taken branch. Th us far, we have assumed the next PC for a branch is selected in the MEM stage, but if we move the branch execution earlier in the pipeline, then fewer instructions need be fl ushed. Th e MIPS architecture was designed to support fast single-cycle branches that could be pipelined with a small branch penalty. Th e designers observed that many branches rely only on simple tests (equality or sign, for example) and that such tests do not require a full ALU operation but can be done with at most a few gates. When a more complex branch decision is required, a separate instruction that uses an ALU to perform a comparison is required—a situation that is similar to the use of condition codes for branches (see Chapter 2). Moving the branch decision up requires two actions to occur earlier: computing the branch target address and evaluating the branch decision. Th e easy part of this change is to move up the branch address calculation. We already have the PC value and the immediate fi eld in the IF/ID pipeline register, so we just move the branch adder from the EX stage to the ID stage; of course, the branch target address calculation will be performed for all instructions, but only used when needed. Th e harder part is the branch decision itself. For branch equal, we would compare the two registers read during the ID stage to see if they are equal. Equality can be tested by fi rst exclusive ORing their respective bits and then ORing all the results. Moving the branch test to the ID stage implies additional forwarding and hazard detection hardware, since a branch dependent on a result still in the pipeline must still work properly with this optimization. For example, to implement branch on equal (and its inverse), we will need to forward results to the equality test logic that operates during ID. Th ere are two complicating factors: 1. During ID, we must decode the instruction, decide whether a bypass to the equality unit is needed, and complete the equality comparison so that if the instruction is a branch, we can set the PC to the branch target address. fl ush To discard instructions in a pipeline, usually due to an unexpected event. 4.8 Control Hazards 319 Forwarding for the operands of branches was formerly handled by the ALU forwarding logic, but the introduction of the equality test unit in ID will require new forwarding logic. Note that the bypassed source operands of a branch can come from either the ALU/MEM or MEM/WB pipeline latches. 2. Because the values in a branch comparison are needed during ID but may be produced later in time, it is possible that a data hazard can occur and a stall will be needed. For example, if an ALU instruction immediately preceding a branch produces one of the operands for the comparison in the branch, a stall will be required, since the EX stage for the ALU instruction will occur aft er the ID cycle of the branch. By extension, if a load is immediately followed by a conditional branch that is on the load result, two stall cycles will be needed, as the result from the load appears at the end of the MEM cycle but is needed at the beginning of ID for the branch. Despite these diffi culties, moving the branch execution to the ID stage is an improvement, because it reduces the penalty of a branch to only one instruction if the branch is taken, namely, the one currently being fetched. Th e exercises explore the details of implementing the forwarding path and detecting the hazard. To fl ush instructions in the IF stage, we add a control line, called IF.Flush, that zeros the instruction fi eld of the IF/ID pipeline register. Clearing the register transforms the fetched instruction into a nop, an instruction that has no action and changes no state. Pipelined Branch Show what happens when the branch is taken in this instruction sequence, assuming the pipeline is optimized for branches that are not taken and that we moved the branch execution to the ID stage: 36 sub $10, $4, $8 40 beq $1, $3, 7 # PC-relative branch to 40 + 4 + 7 * 4 = 72 44 and $12, $2, $5 48 or $13, $2, $6 52 add $14, $4, $2 56 slt $15, $6, $7 . . . 72 lw $4, 50($7) Figure 4.62 shows what happens when a branch is taken. Unlike Figure 4.61, there is only one pipeline bubble on a taken branch. EXAMPLE ANSWER 320 Chapter 4 The Processor M WB WB Data memory Registers Instruction memory ALU ID/EX EX/MEM MEM/WB Forwarding unit PC Control EX M WB IF/ID 0 Hazard detection unit + + Sign- extend Shift left 2 = IF.Flush 4 72 48 44 28 44 $1 $3 $8 $4 7 10 and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8 before<1> before<2>

Data

memory

Registers

Instruction
memory

M
u
x

ALU

ID/EX

EX/MEM

MEM/WB

Forwarding

unit

Control

IF/ID 0

Hazard
detection

unit

Sign-
extend

Shift
left 2

IF.Flush

76
72

72 $3

lw $4, 50($7)

Clock 3

Clock 4

Bubble (nop) beq $1, $3, 7 sub $10, . . . before<1>

M
u
x

FIGURE 4.62 The ID stage of clock cycle 3 determines that a branch must be taken, so it selects 72 as the next PC
address and zeros the instruction fetched for the next clock cycle. Clock cycle 4 shows the instruction at location 72 being
fetched and the single bubble or nop instruction in the pipeline as a result of the taken branch. (Since the nop is really sll $0, $0, 0, it’s
arguable whether or not the ID stage in clock 4 should be highlighted.)

4.8 Control Hazards 321

Dynamic Branch Prediction
Assuming a branch is not taken is one simple form of branch prediction. In that case,
we predict that branches are untaken, fl ushing the pipeline when we are wrong. For
the simple fi ve-stage pipeline, such an approach, possibly coupled with compiler-
based prediction, is probably adequate. With deeper pipelines, the branch penalty
increases when measured in clock cycles. Similarly, with multiple issue (see Section
4.10), the branch penalty increases in terms of instructions lost. Th is combination
means that in an aggressive pipeline, a simple static prediction scheme will probably
waste too much performance. As we mentioned in Section 4.5, with more hardware
it is possible to try to predict branch behavior during program execution.

One approach is to look up the address of the instruction to see if a branch was
taken the last time this instruction was executed, and, if so, to begin fetching new
instructions from the same place as the last time. Th is technique is called dynamic
branch prediction.

One implementation of that approach is a branch prediction buff er or branch
history table. A branch prediction buff er is a small memory indexed by the lower
portion of the address of the branch instruction. Th e memory contains a bit that
says whether the branch was recently taken or not.

Th is is the simplest sort of buff er; we don’t know, in fact, if the prediction is
the right one—it may have been put there by another branch that has the same
low-order address bits. However, this doesn’t aff ect correctness. Prediction is just
a hint that we hope is correct, so fetching begins in the predicted direction. If the
hint turns out to be wrong, the incorrectly predicted instructions are deleted, the
prediction bit is inverted and stored back, and the proper sequence is fetched and
executed.

Th is simple 1-bit prediction scheme has a performance shortcoming: even if a
branch is almost always taken, we can predict incorrectly twice, rather than once,
when it is not taken. Th e following example shows this dilemma.

Loops and Prediction

Consider a loop branch that branches nine times in a row, then is not taken
once. What is the prediction accuracy for this branch, assuming the prediction
bit for this branch remains in the prediction buff er?

Th e steady-state prediction behavior will mispredict on the fi rst and last loop
iterations. Mispredicting the last iteration is inevitable since the prediction
bit will indicate taken, as the branch has been taken nine times in a row at
that point. Th e misprediction on the fi rst iteration happens because the bit is
fl ipped on prior execution of the last iteration of the loop, since the branch
was not taken on that exiting iteration. Th us, the prediction accuracy for this

dynamic branch
prediction Prediction of
branches at runtime using
runtime information.
branch prediction
buff er Also called
branch history table.
A small memory that
is indexed by the lower
portion of the address of
the branch instruction
and that contains one
or more bits indicating
whether the branch was
recently taken or not.

EXAMPLE

ANSWER

322 Chapter 4 The Processor

branch that is taken 90% of the time is only 80% (two incorrect predictions and
eight correct ones).

Ideally, the accuracy of the predictor would match the taken branch frequency for
these highly regular branches. To remedy this weakness, 2-bit prediction schemes
are oft en used. In a 2-bit scheme, a prediction must be wrong twice before it is
changed. Figure 4.63 shows the fi nite-state machine for a 2-bit prediction scheme.

A branch prediction buff er can be implemented as a small, special buff er accessed
with the instruction address during the IF pipe stage. If the instruction is predicted
as taken, fetching begins from the target as soon as the PC is known; as mentioned
on page 318, it can be as early as the ID stage. Otherwise, sequential fetching and
executing continue. If the prediction turns out to be wrong, the prediction bits are
changed as shown in Figure 4.63.

Elaboration: As we described in Section 4.5, in a fi ve-stage pipeline we can make the
control hazard a feature by redefi ning the branch. A delayed branch always executes the
following instruction, but the second instruction following the branch will be affected by
the branch.

Compilers and assemblers try to place an instruction that always executes after the
branch in the branch delay slot. The job of the software is to make the successor
instructions valid and useful. Figure 4.64 shows the three ways in which the branch
delay slot can be scheduled.

branch delay slot Th e
slot directly aft er
a delayed branch
instruction, which in the
MIPS architecture is fi lled
by an instruction that
does not aff ect the branch.

Predict taken

Not taken

Taken

Predict not takenPredict not taken

Predict taken

FIGURE 4.63 The states in a 2-bit prediction scheme. By using 2 bits rather than 1, a branch that
strongly favors taken or not taken—as many branches do—will be mispredicted only once. Th e 2 bits are used
to encode the four states in the system. Th e 2-bit scheme is a general instance of a counter-based predictor,
which is incremented when the prediction is accurate and decremented otherwise, and uses the mid-point of
its range as the division between taken and not taken.

4.8 Control Hazards 323

The limitations on delayed branch scheduling arise from (1) the restrictions on the
instructions that are scheduled into the delay slots and (2) our ability to predict at
compile time whether a branch is likely to be taken or not.

Delayed branching was a simple and effective solution for a fi ve-stage pipeline
issuing one instruction each clock cycle. As processors go to both longer pipelines
and issuing multiple instructions per clock cycle (see Section 4.10), the branch delay
becomes longer, and a single delay slot is insuffi cient. Hence, delayed branching has
lost popularity compared to more expensive but more fl exible dynamic approaches.
Simultaneously, the growth in available transistors per chip has due to Moore’s Law
made dynamic prediction relatively cheaper.

add $s1, $s2, $s3

if $s2 = 0 then

Delay slot

if $s2 = 0 then

add $s1, $s2, $s3

Becomes

a. From before

sub $t4, $t5, $t6

. . .

add $s1, $s2, $s3

if $s1 = 0 then

Delay slot

add $s1, $s2, $s3

if $s1 = 0 then

sub $t4, $t5, $t6

Becomes

b. From target

add $s1, $s2, $s3

if $s1 = 0 then

Delay slot

add $s1, $s2, $s3

if $s1 = 0 then

sub $t4, $t5, $t6

Becomes

c. From fall-through

sub $t4, $t5, $t6

FIGURE 4.64 Scheduling the branch delay slot. Th e top box in each pair shows the code before
scheduling; the bottom box shows the scheduled code. In (a), the delay slot is scheduled with an independent
instruction from before the branch. Th is is the best choice. Strategies (b) and (c) are used when (a) is not
possible. In the code sequences for (b) and (c), the use of $s1 in the branch condition prevents the add
instruction (whose destination is $s1) from being moved into the branch delay slot. In (b) the branch delay
slot is scheduled from the target of the branch; usually the target instruction will need to be copied because
it can be reached by another path. Strategy (b) is preferred when the branch is taken with high probability,
such as a loop branch. Finally, the branch may be scheduled from the not-taken fall-through as in (c). To
make this optimization legal for (b) or (c), it must be OK to execute the sub instruction when the branch
goes in the unexpected direction. By “OK” we mean that the work is wasted, but the program will still execute
correctly. Th is is the case, for example, if $t4 were an unused temporary register when the branch goes in
the unexpected direction.

324 Chapter 4 The Processor

Elaboration: A branch predictor tells us whether or not a branch is taken, but still
requires the calculation of the branch target. In the fi ve-stage pipeline, this calculation
takes one cycle, meaning that taken branches will have a 1-cycle penalty. Delayed
branches are one approach to eliminate that penalty. Another approach is to use a
cache to hold the destination program counter or destination instruction using a branch
target buffer.

The 2-bit dynamic prediction scheme uses only information about a particular branch.
Researchers noticed that using information about both a local branch, and the global
behavior of recently executed branches together yields greater prediction accuracy for
the same number of prediction bits. Such predictors are called correlating predictors.
A typical correlating predictor might have two 2-bit predictors for each branch, with the
choice between predictors made based on whether the last executed branch was taken
or not taken. Thus, the global branch behavior can be thought of as adding additional
index bits for the prediction lookup.

A more recent innovation in branch prediction is the use of tournament predictors. A
tournament predictor uses multiple predictors, tracking, for each branch, which predictor
yields the best results. A typical tournament predictor might contain two predictions for
each branch index: one based on local information and one based on global branch
behavior. A selector would choose which predictor to use for any given prediction. The
selector can operate similarly to a 1- or 2-bit predictor, favoring whichever of the two
predictors has been more accurate. Some recent microprocessors use such elaborate
predictors.

Elaboration: One way to reduce the number of conditional branches is to add
conditional move instructions. Instead of changing the PC with a conditional branch, the
instruction conditionally changes the destination register of the move. If the condition
fails, the move acts as a nop. For example, one version of the MIPS instruction set
architecture has two new instructions called movn (move if not zero) and movz (move
if zero). Thus, movn $8, $11, $4 copies the contents of register 11 into register 8,
provided that the value in register 4 is nonzero; otherwise, it does nothing.

The ARMv7 instruction set has a condition fi eld in most instructions. Hence, ARM
programs could have fewer conditional branches than in MIPS programs.

Pipeline Summary
We started in the laundry room, showing principles of pipelining in an everyday
setting. Using that analogy as a guide, we explained instruction pipelining
step-by-step, starting with the single-cycle datapath and then adding pipeline
registers, forwarding paths, data hazard detection, branch prediction, and fl ushing
instructions on exceptions. Figure 4.65 shows the fi nal evolved datapath and control.
We now are ready for yet another control hazard: the sticky issue of exceptions.

Consider three branch prediction schemes: predict not taken, predict taken, and
dynamic prediction. Assume that they all have zero penalty when they predict
correctly and two cycles when they are wrong. Assume that the average predict

branch target buff er
A structure that caches
the destination PC or
destination instruction
for a branch. It is usually
organized as a cache with
tags, making it more
costly than a simple
prediction buff er.

correlating predictor
A branch predictor that
combines local behavior
of a particular branch
and global information
about the behavior of
some recent number of
executed branches.

tournament branch
predictor A branch
predictor with multiple
predictions for each
branch and a selection
mechanism that chooses
which predictor to enable
for a given branch.

Check
Yourself

4.9 Exceptions 325

accuracy of the dynamic predictor is 90%. Which predictor is the best choice for
the following branches?

1. A branch that is taken with 5% frequency

2. A branch that is taken with 95% frequency

3. A branch that is taken with 70% frequency

4.9 Exceptions

Control is the most challenging aspect of processor design: it is both the hardest
part to get right and the hardest part to make fast. One of the hardest parts of

Control

Hazard

detection

unit

PC
Instruction

memory

Sign-

extend

Registers =

Fowarding

unit

ALU

ID/EX

MEM/WB

EX/MEM

Shift

left 2

IF.Flush

IF/ID

M
u
x

Data

memory

WBM

M
u
x

FIGURE 4.65 The fi nal datapath and control for this chapter. Note that this is a stylized fi gure rather than a detailed datapath, so
it’s missing the ALUsrc Mux from Figure 4.57 and the multiplexor controls from Figure 4.51.

To make a computer
with automatic
program-interruption
facilities behave
[sequentially] was
not an easy matter,
because the number of
instructions in various
stages of processing
when an interrupt
signal occurs may be
large.
Fred Brooks, Jr.,
Planning a Computer
System: Project Stretch,
1962

326 Chapter 4 The Processor

control is implementing exceptions and interrupts—events other than branches
or jumps that change the normal fl ow of instruction execution. Th ey were initially
created to handle unexpected events from within the processor, like arithmetic
overfl ow. Th e same basic mechanism was extended for I/O devices to communicate
with the processor, as we will see in Chapter 5.

Many architectures and authors do not distinguish between interrupts and
exceptions, oft en using the older name interrupt to refer to both types of events.
For example, the Intel x86 uses interrupt. We follow the MIPS convention, using
the term exception to refer to any unexpected change in control fl ow without
distinguishing whether the cause is internal or external; we use the term interrupt
only when the event is externally caused. Here are fi ve examples showing whether
the situation is internally generated by the processor or externally generated:

Type of event From where? MIPS terminology

I/O device request External Interrupt

Invoke the operating system from user program Internal Exception

Arithmetic overfl ow Internal Exception

Using an undefi ned instruction Internal Exception

Hardware malfunctions Either Exception or interrupt

Many of the requirements to support exceptions come from the specifi c
situation that causes an exception to occur. Accordingly, we will return to this
topic in Chapter 5, when we will better understand the motivation for additional
capabilities in the exception mechanism. In this section, we deal with the control
implementation for detecting two types of exceptions that arise from the portions
of the instruction set and implementation that we have already discussed.

Detecting exceptional conditions and taking the appropriate action is oft en
on the critical timing path of a processor, which determines the clock cycle time
and thus performance. Without proper attention to exceptions during design of
the control unit, attempts to add exceptions to a complicated implementation
can signifi cantly reduce performance, as well as complicate the task of getting the
design correct.

How Exceptions Are Handled in the MIPS Architecture
Th e two types of exceptions that our current implementation can generate are
execution of an undefi ned instruction and an arithmetic overfl ow. We’ll use
arithmetic overfl ow in the instruction add $1, $2, $1 as the example exception
in the next few pages. Th e basic action that the processor must perform when an
exception occurs is to save the address of the off ending instruction in the exception
program counter (EPC) and then transfer control to the operating system at some
specifi ed address.

Th e operating system can then take the appropriate action, which may involve
providing some service to the user program, taking some predefi ned action in

exception Also
called interrupt. An
unscheduled event
that disrupts program
execution; used to detect
overfl ow.

interrupt An exception
that comes from outside
of the processor. (Some
architectures use the
term interrupt for all
exceptions.)

4.9 Exceptions 327

response to an overfl ow, or stopping the execution of the program and reporting an
error. Aft er performing whatever action is required because of the exception, the
operating system can terminate the program or may continue its execution, using
the EPC to determine where to restart the execution of the program. In Chapter 5,
we will look more closely at the issue of restarting the execution.

For the operating system to handle the exception, it must know the reason for
the exception, in addition to the instruction that caused it. Th ere are two main
methods used to communicate the reason for an exception. Th e method used in
the MIPS architecture is to include a status register (called the Cause register),
which holds a fi eld that indicates the reason for the exception.

A second method, is to use vectored interrupts. In a vectored interrupt, the
address to which control is transferred is determined by the cause of the exception.
For example, to accommodate the two exception types listed above, we might
defi ne the following two exception vector addresses:

Exception type Exception vector address (in hex)

Undefi ned instruction 8000 0000
hex

Arithmetic overfl ow 8000 0180
hex

Th e operating system knows the reason for the exception by the address at which
it is initiated. Th e addresses are separated by 32 bytes or eight instructions, and the
operating system must record the reason for the exception and may perform some
limited processing in this sequence. When the exception is not vectored, a single
entry point for all exceptions can be used, and the operating system decodes the
status register to fi nd the cause.

We can perform the processing required for exceptions by adding a few extra
registers and control signals to our basic implementation and by slightly extending
control. Let’s assume that we are implementing the exception system used in the
MIPS architecture, with the single entry point being the address 8000 0180hex.
(Implementing vectored exceptions is no more diffi cult.) We will need to add two
additional registers to our current MIPS implementation:

■ EPC: A 32-bit register used to hold the address of the aff ected instruction.
(Such a register is needed even when exceptions are vectored.)

■ Cause: A register used to record the cause of the exception. In the MIPS
architecture, this register is 32 bits, although some bits are currently unused.
Assume there is a fi ve-bit fi eld that encodes the two possible exception
sources mentioned above, with 10 representing an undefi ned instruction and
12 representing arithmetic overfl ow.

Exceptions in a Pipelined Implementation
A pipelined implementation treats exceptions as another form of control hazard.
For example, suppose there is an arithmetic overfl ow in an add instruction. Just as

vectored interrupt An
interrupt for which
the address to which
control is transferred is
determined by the cause
of the exception.

328 Chapter 4 The Processor

we did for the taken branch in the previous section, we must fl ush the instructions
that follow the add instruction from the pipeline and begin fetching instructions
from the new address. We will use the same mechanism we used for taken branches,
but this time the exception causes the deasserting of control lines.

When we dealt with branch mispredict, we saw how to fl ush the instruction
in the IF stage by turning it into a nop. To fl ush instructions in the ID stage, we
use the multiplexor already in the ID stage that zeros control signals for stalls. A
new control signal, called ID.Flush, is ORed with the stall signal from the hazard
detection unit to fl ush during ID. To fl ush the instruction in the EX phase, we use
a new signal called EX.Flush to cause new multiplexors to zero the control lines. To
start fetching instructions from location 8000 0180hex, which is the MIPS exception
address, we simply add an additional input to the PC multiplexor that sends 8000
0180hex to the PC. Figure 4.66 shows these changes.

Th is example points out a problem with exceptions: if we do not stop execution
in the middle of the instruction, the programmer will not be able to see the original
value of register $1 that helped cause the overfl ow because it will be clobbered as
the Destination register of the add instruction. Because of careful planning, the
overfl ow exception is detected during the EX stage; hence, we can use the EX.Flush
signal to prevent the instruction in the EX stage from writing its result in the WB
stage. Many exceptions require that we eventually complete the instruction that
caused the exception as if it executed normally. Th e easiest way to do this is to fl ush
the instruction and restart it from the beginning aft er the exception is handled.

Th e fi nal step is to save the address of the off ending instruction in the exception
program counter (EPC). In reality, we save the address +4, so the exception handling
the soft ware routine must fi rst subtract 4 from the saved value. Figure 4.66 shows
a stylized version of the datapath, including the branch hardware and necessary
accommodations to handle exceptions.

Exception in a Pipelined Computer

Given this instruction sequence,

40
hex
sub $11, $2, $4

44
hex
and $12, $2, $5

48
hex
or $13, $2, $6

4C
hex
add $1, $2, $1

50
hex
slt $15, $6, $7

54
hex
lw $16, 50($7)

. . .

EXAMPLE

4.9 Exceptions 329

assume the instructions to be invoked on an exception begin like this:

80000180
hex
sw $26, 1000($0)

80000184
hex
sw $27, 1004($0)

. . .

Show what happens in the pipeline if an overfl ow exception occurs in the add
instruction.

Figure 4.67 shows the events, starting with the add instruction in the EX stage.
Th e overfl ow is detected during that phase, and 8000 0180hex is forced into the
PC. Clock cycle 7 shows that the add and following instructions are fl ushed,
and the fi rst instruction of the exception code is fetched. Note that the address
of the instruction following the add is saved: 4Chex + 4 = 50hex.

ANSWER

0 M

Data

memory

Instruction

memory

M
u
x

ALU

ID/EX

EX/MEM

Cause

EPC

MEM/WB

Forwarding

unit

Control

IF/ID

M
u
x

Hazard

detection

unit

�

�
Shift

left 2

�

IF.Flush

ID.Flush

EX.Flush

Sign-

extend

80000180

Registers

M
u
x

FIGURE 4.66 The datapath with controls to handle exceptions. Th e key additions include a new input with the value 8000 0180hex
in the multiplexor that supplies the new PC value; a Cause register to record the cause of the exception; and an Exception PC register to save
the address of the instruction that caused the exception. Th e 8000 0180hex input to the multiplexor is the initial address to begin fetching
instructions in the event of an exception. Although not shown, the ALU overfl ow signal is an input to the control unit.

330 Chapter 4 The Processor

lw $16, 50($7) slt $15, $6, $7 add $1, $2, $1 or $13, . . . and $12, . . .

sw $26, 1000($0)

Clock 6

Clock 7

bubble (nop) bubble bubble or $13, . . .

000

0 0

000000

0 M

Data

memory

Instruction

memory

M
u
x

ID/EX

EX/MEM

MEM/WB

Forwarding

unit

Control

IF/ID

M
u
x

Hazard

detection

unit

+
Shift

left 2

IF.Flush
ID.Flush

EX.Flush

$115

Sign-
extend

80000180

Registers

M
u
x

Cause

EPC

$6
$2

13 12

0 0 M

Data

memory

Instruction

memory

M
u
x

ID/EX

EX/MEM

MEM/WB

Forwarding

unit

Control

IF/ID

M
u
x

Hazard

detection

unit

+
Shift

left 2

IF.Flush
ID.Flush

EX.Flush

Sign-
extend

80000180

80000184

Registers

M
u
x

Cause

EPC

ALU

M
u
x

FIGURE 4.67 The result of an exception due to arithmetic overfl ow in the add instruction. Th e overfl ow is detected during
the EX stage of clock 6, saving the address following the add in the EPC register (4C + 4 = 50hex). Overfl ow causes all the Flush signals to be set
near the end of this clock cycle, deasserting control values (setting them to 0) for the add. Clock cycle 7 shows the instructions converted to
bubbles in the pipeline plus the fetching of the fi rst instruction of the exception routine—sw $25,1000($0)—from instruction location
8000 0180hex. Note that the AND and OR instructions, which are prior to the add, still complete. Although not shown, the ALU overfl ow signal
is an input to the control unit.

4.9 Exceptions 331

We mentioned fi ve examples of exceptions on page 326, and we will see others
in Chapter 5. With fi ve instructions active in any clock cycle, the challenge is
to associate an exception with the appropriate instruction. Moreover, multiple
exceptions can occur simultaneously in a single clock cycle. Th e solution is to
prioritize the exceptions so that it is easy to determine which is serviced fi rst. In
most MIPS implementations, the hardware sorts exceptions so that the earliest
instruction is interrupted.

I/O device requests and hardware malfunctions are not associated with a specifi c
instruction, so the implementation has some fl exibility as to when to interrupt the
pipeline. Hence, the mechanism used for other exceptions works just fi ne.

Th e EPC captures the address of the interrupted instructions, and the MIPS
Cause register records all possible exceptions in a clock cycle, so the exception
soft ware must match the exception to the instruction. An important clue is knowing
in which pipeline stage a type of exception can occur. For example, an undefi ned
instruction is discovered in the ID stage, and invoking the operating system
occurs in the EX stage. Exceptions are collected in the Cause register in a pending
exception fi eld so that the hardware can interrupt based on later exceptions, once
the earliest one has been serviced.

Th e hardware and the operating system must work in conjunction so that
exceptions behave as you would expect. Th e hardware contract is normally to
stop the off ending instruction in midstream, let all prior instructions complete,
fl ush all following instructions, set a register to show the cause of the exception,
save the address of the off ending instruction, and then jump to a prearranged
address. Th e operating system contract is to look at the cause of the exception and
act appropriately. For an undefi ned instruction, hardware failure, or arithmetic
overfl ow exception, the operating system normally kills the program and returns
an indicator of the reason. For an I/O device request or an operating system service
call, the operating system saves the state of the program, performs the desired task,
and, at some point in the future, restores the program to continue execution. In
the case of I/O device requests, we may oft en choose to run another task before
resuming the task that requested the I/O, since that task may oft en not be able to
proceed until the I/O is complete. Exceptions are why the ability to save and restore
the state of any task is critical. One of the most important and frequent uses of
exceptions is handling page faults and TLB exceptions; Chapter 5 describes these
exceptions and their handling in more detail.

Elaboration: The diffi culty of always associating the correct exception with the correct
instruction in pipelined computers has led some computer designers to relax this
requirement in noncritical cases. Such processors are said to have imprecise interrupts
or imprecise exceptions. In the example above, PC would normally have 58

hex
at the start

of the clock cycle after the exception is detected, even though the offending instruction

Hardware/
Software
Interface

imprecise
interrupt Also called
imprecise exception.
Interrupts or exceptions
in pipelined computers
that are not associated
with the exact instruction
that was the cause of the
interrupt or exception.

332 Chapter 4 The Processor

is at address 4C
hex

. A processor with imprecise exceptions might put 58
hex

into EPC and
leave it up to the operating system to determine which instruction caused the problem.
MIPS and the vast majority of computers today support precise interrupts or precise
exceptions. (One reason is to support virtual memory, which we shall see in Chapter 5.)

Elaboration: Although MIPS uses the exception entry address 8000 0180
hex

for
almost all exceptions, it uses the address 8000 0000

hex
to improve performance of the

exception handler for TLB-miss exceptions (see Chapter 5).

Which exception should be recognized fi rst in this sequence?

1. add $1, $2, $1 # arithmetic overfl ow
2. XXX $1, $2, $1 # undefi ned instruction
3. sub $1, $2, $1 # hardware error

4.10 Parallelism via Instructions

Be forewarned: this section is a brief overview of fascinating but advanced
topics. If you want to learn more details, you should consult our more advanced
book, Computer Architecture: A Quantitative Approach, fi ft h edition, where the
material covered in these 13 pages is expanded to almost 200 pages (including
appendices)!

Pipelining exploits the potential parallelism among instructions. Th is
parallelism is called instruction-level parallelism (ILP). Th ere are two primary
methods for increasing the potential amount of instruction-level parallelism. Th e
fi rst is increasing the depth of the pipeline to overlap more instructions. Using our
laundry analogy and assuming that the washer cycle was longer than the others
were, we could divide our washer into three machines that perform the wash, rinse,
and spin steps of a traditional washer. We would then move from a four-stage to a
six-stage pipeline. To get the full speed-up, we need to rebalance the remaining steps
so they are the same length, in processors or in laundry. Th e amount of parallelism
being exploited is higher, since there are more operations being overlapped.
Performance is potentially greater since the clock cycle can be shorter.

Another approach is to replicate the internal components of the computer so
that it can launch multiple instructions in every pipeline stage. Th e general name
for this technique is multiple issue. A multiple-issue laundry would replace our
household washer and dryer with, say, three washers and three dryers. You would
also have to recruit more assistants to fold and put away three times as much
laundry in the same amount of time. Th e downside is the extra work to keep all the
machines busy and transferring the loads to the next pipeline stage.

Check
Yourself

instruction-level
parallelism Th e
parallelism among
instructions.

multiple issue A scheme
whereby multiple
instructions are launched
in one clock cycle.

precise interrupt Also
called precise exception.
An interrupt or exception
that is always associated
with the correct
instruction in pipelined
computers.

4.10 Parallelism via Instructions 333

Launching multiple instructions per stage allows the instruction execution rate to
exceed the clock rate or, stated alternatively, the CPI to be less than 1. As mentioned
in Chapter 1, it is sometimes useful to fl ip the metric and use IPC, or instructions
per clock cycle. Hence, a 4 GHz four-way multiple-issue microprocessor can execute
a peak rate of 16 billion instructions per second and have a best-case CPI of 0.25,
or an IPC of 4. Assuming a fi ve-stage pipeline, such a processor would have 20
instructions in execution at any given time. Today’s high-end microprocessors
attempt to issue from three to six instructions in every clock cycle. Even moderate
designs will aim at a peak IPC of 2. Th ere are typically, however, many constraints
on what types of instructions may be executed simultaneously, and what happens
when dependences arise.

Th ere are two major ways to implement a multiple-issue processor, with the
major diff erence being the division of work between the compiler and the hardware.
Because the division of work dictates whether decisions are being made statically
(that is, at compile time) or dynamically (that is, during execution), the approaches
are sometimes called static multiple issue and dynamic multiple issue. As we will
see, both approaches have other, more commonly used names, which may be less
precise or more restrictive.

Th ere are two primary and distinct responsibilities that must be dealt with in a
multiple-issue pipeline:

1. Packaging instructions into issue slots: how does the processor determine
how many instructions and which instructions can be issued in a given
clock cycle? In most static issue processors, this process is at least partially
handled by the compiler; in dynamic issue designs, it is normally dealt with
at runtime by the processor, although the compiler will oft en have already
tried to help improve the issue rate by placing the instructions in a benefi cial
order.

2. Dealing with data and control hazards: in static issue processors, the compiler
handles some or all of the consequences of data and control hazards statically.
In contrast, most dynamic issue processors attempt to alleviate at least some
classes of hazards using hardware techniques operating at execution time.

Although we describe these as distinct approaches, in reality one approach oft en
borrows techniques from the other, and neither approach can claim to be perfectly
pure.

The Concept of Speculation
One of the most important methods for fi nding and exploiting more ILP is
speculation. Based on the great idea of prediction, speculation is an approach
that allows the compiler or the processor to “guess” about the properties of an
instruction, so as to enable execution to begin for other instructions that may
depend on the speculated instruction. For example, we might speculate on the
outcome of a branch, so that instructions aft er the branch could be executed earlier.

static multiple issue An
approach to implementing
a multiple-issue processor
where many decisions
are made by the compiler
before execution.

dynamic multiple
issue An approach to
implementing a multiple-
issue processor where
many decisions are made
during execution by the
processor.

issue slots Th e positions
from which instructions
could issue in a given
clock cycle; by analogy,
these correspond to
positions at the starting
blocks for a sprint.

speculation An
approach whereby the
compiler or processor
guesses the outcome of an
instruction to remove it as
a dependence in executing
other instructions.

334 Chapter 4 The Processor

Another example is that we might speculate that a store that precedes a load does
not refer to the same address, which would allow the load to be executed before the
store. Th e diffi culty with speculation is that it may be wrong. So, any speculation
mechanism must include both a method to check if the guess was right and a
method to unroll or back out the eff ects of the instructions that were executed
speculatively. Th e implementation of this back-out capability adds complexity.

Speculation may be done in the compiler or by the hardware. For example, the
compiler can use speculation to reorder instructions, moving an instruction across
a branch or a load across a store. Th e processor hardware can perform the same
transformation at runtime using techniques we discuss later in this section.

Th e recovery mechanisms used for incorrect speculation are rather diff erent.
In the case of speculation in soft ware, the compiler usually inserts additional
instructions that check the accuracy of the speculation and provide a fi x-up routine
to use when the speculation is incorrect. In hardware speculation, the processor
usually buff ers the speculative results until it knows they are no longer speculative.
If the speculation is correct, the instructions are completed by allowing the
contents of the buff ers to be written to the registers or memory. If the speculation is
incorrect, the hardware fl ushes the buff ers and re-executes the correct instruction
sequence.

Speculation introduces one other possible problem: speculating on certain
instructions may introduce exceptions that were formerly not present. For
example, suppose a load instruction is moved in a speculative manner, but the
address it uses is not legal when the speculation is incorrect. Th e result would be
an exception that should not have occurred. Th e problem is complicated by the
fact that if the load instruction were not speculative, then the exception must
occur! In compiler-based speculation, such problems are avoided by adding
special speculation support that allows such exceptions to be ignored until it is
clear that they really should occur. In hardware-based speculation, exceptions
are simply buff ered until it is clear that the instruction causing them is no longer
speculative and is ready to complete; at that point the exception is raised, and
nor-mal exception handling proceeds.

Since speculation can improve performance when done properly and decrease
performance when done carelessly, signifi cant eff ort goes into deciding when it
is appropriate to speculate. Later in this section, we will examine both static and
dynamic techniques for speculation.

Static Multiple Issue
Static multiple-issue processors all use the compiler to assist with packaging
instructions and handling hazards. In a static issue processor, you can think of the
set of instructions issued in a given clock cycle, which is called an issue packet, as
one large instruction with multiple operations. Th is view is more than an analogy.
Since a static multiple-issue processor usually restricts what mix of instructions can
be initiated in a given clock cycle, it is useful to think of the issue packet as a single

issue packet Th e set
of instructions that
issues together in one
clock cycle; the packet
may be determined
statically by the compiler
or dynamically by the
processor.

4.10 Parallelism via Instructions 335

instruction allowing several operations in certain predefi ned fi elds. Th is view led to
the original name for this approach: Very Long Instruction Word (VLIW).

Most static issue processors also rely on the compiler to take on some
responsibility for handling data and control hazards. Th e compiler’s responsibilities
may include static branch prediction and code scheduling to reduce or prevent all
hazards. Let’s look at a simple static issue version of a MIPS processor, before we
describe the use of these techniques in more aggressive processors.

An Example: Static Multiple Issue with the MIPS ISA

To give a fl avor of static multiple issue, we consider a simple two-issue MIPS
processor, where one of the instructions can be an integer ALU operation or
branch and the other can be a load or store. Such a design is like that used in some
embedded MIPS processors. Issuing two instructions per cycle will require fetching
and decoding 64 bits of instructions. In many static multiple-issue processors, and
essentially all VLIW processors, the layout of simultaneously issuing instructions
is restricted to simplify the decoding and instruction issue. Hence, we will require
that the instructions be paired and aligned on a 64-bit boundary, with the ALU
or branch portion appearing fi rst. Furthermore, if one instruction of the pair
cannot be used, we require that it be replaced with a nop. Th us, the instructions
always issue in pairs, possibly with a nop in one slot. Figure 4.68 shows how the
instructions look as they go into the pipeline in pairs.

Static multiple-issue processors vary in how they deal with potential data and
control hazards. In some designs, the compiler takes full responsibility for removing
all hazards, scheduling the code and inserting no-ops so that the code executes
without any need for hazard detection or hardware-generated stalls. In others,
the hardware detects data hazards and generates stalls between two issue packets,
while requiring that the compiler avoid all dependences within an instruction pair.
Even so, a hazard generally forces the entire issue packet containing the dependent

Instruction type Pipe stages

ALU or branch instruction IF ID EX MEM WB

Load or store instruction IF ID EX MEM WB

ALU or branch instruction IF ID EX MEM WB

Load or store instruction IF ID EX MEM WB

ALU or branch instruction IF ID EX MEM WB

Load or store instruction IF ID EX MEM WB

ALU or branch instruction IF ID EX MEM WB

Load or store instruction IF ID EX MEM WB

FIGURE 4.68 Static two-issue pipeline in operation. Th e ALU and data transfer instructions
are issued at the same time. Here we have assumed the same fi ve-stage structure as used for the single-issue
pipeline. Although this is not strictly necessary, it does have some advantages. In particular, keeping the
register writes at the end of the pipeline simplifi es the handling of exceptions and the maintenance of a
precise exception model, which become more diffi cult in multiple-issue processors.

Very Long Instruction
Word (VLIW)
A style of instruction set
architecture that launches
many operations that are
defi ned to be independent
in a single wide
instruction, typically with
many separate opcode
fi elds.

336 Chapter 4 The Processor

instruction to stall. Whether the soft ware must handle all hazards or only try to
reduce the fraction of hazards between separate issue packets, the appearance of
having a large single instruction with multiple operations is reinforced. We will
assume the second approach for this example.

To issue an ALU and a data transfer operation in parallel, the fi rst need for
additional hardware—beyond the usual hazard detection and stall logic—is extra
ports in the register fi le (see Figure 4.69). In one clock cycle we may need to read
two registers for the ALU operation and two more for a store, and also one write
port for an ALU operation and one write port for a load. Since the ALU is tied
up for the ALU operation, we also need a separate adder to calculate the eff ective
address for data transfers. Without these extra resources, our two-issue pipeline
would be hindered by structural hazards.

Clearly, this two-issue processor can improve performance by up to a factor of
two. Doing so, however, requires that twice as many instructions be overlapped
in execution, and this additional overlap increases the relative performance loss
from data and control hazards. For example, in our simple fi ve-stage pipeline,

Data

memory

Instruction

memory

M
u
x

ALU

�

Sign-

extend

Registers

M
u
x

80000180

Write
data

Address

Sign-

extend

FIGURE 4.69 A static two-issue datapath. Th e additions needed for double issue are highlighted: another 32 bits from instruction
memory, two more read ports and one more write port on the register fi le, and another ALU. Assume the bottom ALU handles address
calculations for data transfers and the top ALU handles everything else.

4.10 Parallelism via Instructions 337

loads have a use latency of one clock cycle, which prevents one instruction from
using the result without stalling. In the two-issue, fi ve-stage pipeline the result of
a load instruction cannot be used on the next clock cycle. Th is means that the next
two instructions cannot use the load result without stalling. Furthermore, ALU
instructions that had no use latency in the simple fi ve-stage pipeline now have a
one-instruction use latency, since the results cannot be used in the paired load or
store. To eff ectively exploit the parallelism available in a multiple-issue processor,
more ambitious compiler or hardware scheduling techniques are needed, and static
multiple issue requires that the compiler take on this role.

Simple Multiple-Issue Code Scheduling

How would this loop be scheduled on a static two-issue pipeline for MIPS?

Loop: lw $t0, 0($s1) # $t0=array element
addu $t0,$t0,$s2# add scalar in $s2
sw $t0, 0($s1)# store result
addi $s1,$s1,–4# decrement pointer
bne $s1,$zero,Loop# branch $s1!=0

Reorder the instructions to avoid as many pipeline stalls as possible. Assume
branches are predicted, so that control hazards are handled by the hardware.

Th e fi rst three instructions have data dependences, and so do the last two.
Figure 4.70 shows the best schedule for these instructions. Notice that just
one pair of instructions has both issue slots used. It takes four clocks per loop
iteration; at four clocks to execute fi ve instructions, we get the disappointing
CPI of 0.8 versus the best case of 0.5., or an IPC of 1.25 versus 2.0. Notice
that in computing CPI or IPC, we do not count any nops executed as useful
instructions. Doing so would improve CPI, but not performance!

use latency Number
of clock cycles between
a load instruction and
an instruction that can
use the result of the
load without stalling the
pipeline.

EXAMPLE

ANSWER

FIGURE 4.70 The scheduled code as it would look on a two-issue MIPS pipeline. Th e empty
slots are no-ops.

ALU or branch instruction Data transfer instruction Clock cycle

Loop: lw $t0, 0($s1) 1

addi $s1,$s1,–4 2

addu $t0,$t0,$s2 3

bne $s1,$zero,Loop sw $t0, 4($s1) 4

338 Chapter 4 The Processor

An important compiler technique to get more performance from loops
is loop unrolling, where multiple copies of the loop body are made. After
unrolling, there is more ILP available by overlapping instructions from different
iterations.

loop unrolling
A technique to get more
performance from loops
that access arrays, in
which multiple copies of
the loop body are made
and instructions from
diff erent iterations are
scheduled together

FIGURE 4.71 The unrolled and scheduled code of Figure 4.70 as it would look on a static
two-issue MIPS pipeline. Th e empty slots are no-ops. Since the fi rst instruction in the loop decrements
$s1 by 16, the addresses loaded are the original value of $s1, then that address minus 4, minus 8, and minus 12.

Loop Unrolling for Multiple-Issue Pipelines

See how well loop unrolling and scheduling work in the example above. For
simplicity assume that the loop index is a multiple of four.

To schedule the loop without any delays, it turns out that we need to make
four copies of the loop body. Aft er unrolling and eliminating the unnecessary
loop overhead instructions, the loop will contain four copies each of lw, add,
and sw, plus one addi and one bne. Figure 4.71 shows the unrolled and
scheduled code.

During the unrolling process, the compiler introduced additional registers
($t1, $t2, $t3). Th e goal of this process, called register renaming, is to
eliminate dependences that are not true data dependences, but could either
lead to potential hazards or prevent the compiler from fl exibly scheduling
the code. Consider how the unrolled code would look using only $t0. Th ere
would be repeated instances of lw $t0,0($$s1), addu $t0, $t0, $s2
followed by sw t0,4($s1), but these sequences, despite using $t0, are
actually completely independent—no data values fl ow between one set of these
instructions and the next set. Th is case is what is called an antidependence or
name dependence, which is an ordering forced purely by the reuse of a name,
rather than a real data dependence that is also called a true dependence.

Renaming the registers during the unrolling process allows the compiler
to move these independent instructions subsequently so as to better schedule

EXAMPLE

ANSWER

antidependence Also
called name
dependence. An
ordering forced by the
reuse of a name, typically
a register, rather than by
a true dependence that
carries a value between
two instructions.

ALU or branch instruction Data transfer instruction Clock cycle

Loop: addi $s1,$s1,–16 lw $t0, 0($s1) 1

lw $t1,12($s1) 2

addu $t0,$t0,$s2 lw $t2, 8($s1) 3

addu $t1,$t1,$s2 lw $t3, 4($s1) 4

addu $t2,$t2,$s2 sw $t0, 16($s1) 5

addu $t3,$t3,$s2 sw $t1,12($s1) 6

sw $t2, 8($s1) 7

bne $s1,$zero,Loop sw $t3, 4($s1) 8

4.10 Parallelism via Instructions 339

the code. Th e renaming process eliminates the name dependences, while
preserving the true dependences.

Notice now that 12 of the 14 instructions in the loop execute as pairs. It takes
8 clocks for 4 loop iterations, or 2 clocks per iteration, which yields a CPI of 8/14
= 0.57. Loop unrolling and scheduling with dual issue gave us an improvement
factor of almost 2, partly from reducing the loop control instructions and partly
from dual issue execution. Th e cost of this performance improvement is using four
temporary registers rather than one, as well as a signifi cant increase in code size.

Dynamic Multiple-Issue Processors
Dynamic multiple-issue processors are also known as superscalar processors, or
simply superscalars. In the simplest superscalar processors, instructions issue in
order, and the processor decides whether zero, one, or more instructions can issue
in a given clock cycle. Obviously, achieving good performance on such a processor
still requires the compiler to try to schedule instructions to move dependences
apart and thereby improve the instruction issue rate. Even with such compiler
scheduling, there is an important diff erence between this simple superscalar
and a VLIW processor: the code, whether scheduled or not, is guaranteed by
the hardware to execute correctly. Furthermore, compiled code will always run
correctly independent of the issue rate or pipeline structure of the processor. In
some VLIW designs, this has not been the case, and recompilation was required
when moving across diff erent processor models; in other static issue processors,
code would run correctly across diff erent implementations, but oft en so poorly as
to make compilation eff ectively required.

Many superscalars extend the basic framework of dynamic issue decisions to
include dynamic pipeline scheduling. Dynamic pipeline scheduling chooses
which instructions to execute in a given clock cycle while trying to avoid hazards
and stalls. Let’s start with a simple example of avoiding a data hazard. Consider the
following code sequence:

lw $t0, 20($s2)
addu $t1, $t0, $t2
sub $s4, $s4, $t3
slti $t5, $s4, 20

Even though the sub instruction is ready to execute, it must wait for the lw
and addu to complete fi rst, which might take many clock cycles if memory is slow.
(Chapter 5 explains cache misses, the reason that memory accesses are sometimes
very slow.) Dynamic pipeline scheduling allows such hazards to be avoided either
fully or partially.

Dynamic Pipeline Scheduling

Dynamic pipeline scheduling chooses which instructions to execute next, possibly
reordering them to avoid stalls. In such processors, the pipeline is divided into
three major units: an instruction fetch and issue unit, multiple functional units

superscalar An
advanced pipelining
technique that enables the
processor to execute more
than one instruction per
clock cycle by selecting
them during execution.

dynamic pipeline
scheduling Hardware
support for reordering
the order of instruction
execution so as to avoid
stalls.

340 Chapter 4 The Processor

(a dozen or more in high-end designs in 2013), and a commit unit. Figure 4.72
shows the model. Th e fi rst unit fetches instructions, decodes them, and sends
each instruction to a corresponding functional unit for execution. Each functional
unit has buff ers, called reservation stations, which hold the operands and the
operation. (Th e Elaboration discusses an alternative to reservation stations used
by many recent processors.) As soon as the buff er contains all its operands and
the functional unit is ready to execute, the result is calculated. When the result is
completed, it is sent to any reservation stations waiting for this particular result
as well as to the commit unit, which buff ers the result until it is safe to put the
result into the register fi le or, for a store, into memory. Th e buff er in the commit
unit, oft en called the reorder buff er, is also used to supply operands, in much the
same way as forwarding logic does in a statically scheduled pipeline. Once a result
is committed to the register fi le, it can be fetched directly from there, just as in a
normal pipeline.

Th e combination of buff ering operands in the reservation stations and results
in the reorder buff er provides a form of register renaming, just like that used by
the compiler in our earlier loop-unrolling example on page 338. To see how this
conceptually works, consider the following steps:

commit unit Th e unit in
a dynamic or out-of-order
execution pipeline that
decides when it is safe to
release the result of an
operation to programmer-
visible registers and
memory.

reservation station
A buff er within a
functional unit that holds
the operands and the
operation.

reorder buff er Th e
buff er that holds results in
a dynamically scheduled
processor until it is safe
to store the results to
memory or a register.

Instruction fetch
and decode unit

Reservation
station

Integer Integer
Floating

point
Load-
store

Commit
unit

In-order issue

Out-of-order executeFunctional
units

In-order commit

. . .

FIGURE 4.72 The three primary units of a dynamically scheduled pipeline. Th e fi nal step of
updating the state is also called retirement or graduation.

4.10 Parallelism via Instructions 341

1. When an instruction issues, it is copied to a reservation station for the
appropriate functional unit. Any operands that are available in the register
fi le or reorder buff er are also immediately copied into the reservation station.
Th e instruction is buff ered in the reservation station until all the operands
and the functional unit are available. For the issuing instruction, the register
copy of the operand is no longer required, and if a write to that register
occurred, the value could be overwritten.

2. If an operand is not in the register fi le or reorder buff er, it must be waiting to
be produced by a functional unit. Th e name of the functional unit that will
produce the result is tracked. When that unit eventually produces the result,
it is copied directly into the waiting reservation station from the functional
unit bypassing the registers.

Th ese steps eff ectively use the reorder buff er and the reservation stations to
implement register renaming.

Conceptually, you can think of a dynamically scheduled pipeline as analyzing
the data fl ow structure of a program. Th e processor then executes the instructions
in some order that preserves the data fl ow order of the program. Th is style of
execution is called an out-of-order execution, since the instructions can be
executed in a diff erent order than they were fetched.

To make programs behave as if they were running on a simple in-order pipeline,
the instruction fetch and decode unit is required to issue instructions in order,
which allows dependences to be tracked, and the commit unit is required to write
results to registers and memory in program fetch order. Th is conservative mode is
called in-order commit. Hence, if an exception occurs, the computer can point to
the last instruction executed, and the only registers updated will be those written
by instructions before the instruction causing the exception. Although the front
end (fetch and issue) and the back end (commit) of the pipeline run in order,
the functional units are free to initiate execution whenever the data they need is
available. Today, all dynamically scheduled pipelines use in-order commit.

Dynamic scheduling is oft en extended by including hardware-based speculation,
especially for branch outcomes. By predicting the direction of a branch, a
dynamically scheduled processor can continue to fetch and execute instructions
along the predicted path. Because the instructions are committed in order, we know
whether or not the branch was correctly predicted before any instructions from the
predicted path are committed. A speculative, dynamically scheduled pipeline can
also support speculation on load addresses, allowing load-store reordering, and
using the commit unit to avoid incorrect speculation. In the next section, we will
look at the use of dynamic scheduling with speculation in the Intel Core i7 design.

out-of-order
execution A situation in
pipelined execution when
an instruction blocked
from executing does
not cause the following
instructions to wait.

in-order commit
A commit in which
the results of pipelined
execution are written to
the programmer visible
state in the same order
that instructions are
fetched.

342 Chapter 4 The Processor

Given that compilers can also schedule code around data dependences, you might
ask why a superscalar processor would use dynamic scheduling. Th ere are three
major reasons. First, not all stalls are predictable. In particular, cache misses
(see Chapter 5) in the memory hierarchy cause unpredictable stalls. Dynamic
scheduling allows the processor to hide some of those stalls by continuing to
execute instructions while waiting for the stall to end.

Second, if the processor speculates on branch outcomes using dynamic branch
prediction, it cannot know the exact order of instructions at compile time, since it
depends on the predicted and actual behavior of branches. Incorporating dynamic
speculation to exploit more instruction-level parallelism (ILP) without incorporating
dynamic scheduling would signifi cantly restrict the benefi ts of speculation.

Th ird, as the pipeline latency and issue width change from one implementation
to another, the best way to compile a code sequence also changes. For example, how
to schedule a sequence of dependent instructions is aff ected by both issue width and
latency. Th e pipeline structure aff ects both the number of times a loop must be unrolled
to avoid stalls as well as the process of compiler-based register renaming. Dynamic
scheduling allows the hardware to hide most of these details. Th us, users and soft ware
distributors do not need to worry about having multiple versions of a program for
diff erent implementations of the same instruction set. Similarly, old legacy code will
get much of the benefi t of a new implementation without the need for recompilation.

Both pipelining and multiple-issue execution increase peak instruction
throughput and attempt to exploit instruction-level parallelism (ILP).
Data and control dependences in programs, however, off er an upper limit
on sustained performance because the processor must sometimes wait for
a dependence to be resolved. Soft ware-centric approaches to exploiting
ILP rely on the ability of the compiler to fi nd and reduce the eff ects of such
dependences, while hardware-centric approaches rely on extensions to the
pipeline and issue mechanisms. Speculation, performed by the compiler
or the hardware, can increase the amount of ILP that can be exploited via
prediction, although care must be taken since speculating incorrectly is
likely to reduce performance.

The BIG
Picture

Understanding
Program

Performance

4.10 Parallelism via Instructions 343

Modern, high-performance microprocessors are capable of issuing several instructions
per clock; unfortunately, sustaining that issue rate is very diffi cult. For example, despite
the existence of processors with four to six issues per clock, very few applications can
sustain more than two instructions per clock. Th ere are two primary reasons for this.

First, within the pipeline, the major performance bottlenecks arise from
dependences that cannot be alleviated, thus reducing the parallelism among
instructions and the sustained issue rate. Although little can be done about true data
dependences, oft en the compiler or hardware does not know precisely whether a
dependence exists or not, and so must conservatively assume the dependence exists.
For example, code that makes use of pointers, particularly in ways that may lead to
aliasing, will lead to more implied potential dependences. In contrast, the greater
regularity of array accesses oft en allows a compiler to deduce that no dependences
exist. Similarly, branches that cannot be accurately predicted whether at runtime or
compile time will limit the ability to exploit ILP. Oft en, additional ILP is available, but
the ability of the compiler or the hardware to fi nd ILP that may be widely separated
(sometimes by the execution of thousands of instructions) is limited.

Second, losses in the memory hierarchy (the topic of Chapter 5) also limit the
ability to keep the pipeline full. Some memory system stalls can be hidden, but
limited amounts of ILP also limit the extent to which such stalls can be hidden.

Energy Effi ciency and Advanced Pipelining
Th e downside to the increasing exploitation of instruction-level parallelism via
dynamic multiple issue and speculation is potential energy ineffi ciency. Each
innovation was able to turn more transistors into performance, but they oft en did
so very ineffi ciently. Now that we have hit the power wall, we are seeing designs
with multiple processors per chip where the processors are not as deeply pipelined
or as aggressively speculative as its predecessors.

Th e belief is that while the simpler processors are not as fast as their sophisticated
brethren, they deliver better performance per joule, so that they can deliver more
performance per chip when designs are constrained more by energy than they are
by number of transistors.

Figure 4.73 shows the number of pipeline stages, the issue width, speculation level,
clock rate, cores per chip, and power of several past and recent microprocessors. Note
the drop in pipeline stages and power as companies switch to multicore designs.

Elaboration: A commit unit controls updates to the register fi le and memory. Some
dynamically scheduled processors update the register fi le immediately during execution,
using extra registers to implement the renaming function and preserving the older copy of a
register until the instruction updating the register is no longer speculative. Other processors
buffer the result, typically in a structure called a reorder buffer, and the actual update to the
register fi le occurs later as part of the commit. Stores to memory must be buffered until
commit time either in a store buffer (see Chapter 5) or in the reorder buffer. The commit unit
allows the store to write to memory from the buffer when the buffer has a valid address and
valid data, and when the store is no longer dependent on predicted branches.

Hardware/
Software
Interface

344 Chapter 4 The Processor

Elaboration: Memory accesses benefi t from nonblocking caches, which continue
servicing cache accesses during a cache miss (see Chapter 5). Out-of-order execution
processors need the cache design to allow instructions to execute during a miss.

State whether the following techniques or components are associated primarily
with a soft ware- or hardware-based approach to exploiting ILP. In some cases, the
answer may be both.

1. Branch prediction

2. Multiple issue

3. VLIW

4. Superscalar

5. Dynamic scheduling

6. Out-of-order execution

7. Speculation

8. Reorder buff er

9. Register renaming

4.11 Real Stuff: The ARM Cortex-A8 and Intel
Core i7 Pipelines

Figure 4.74 describes the two microprocessors we examine in this section, whose
targets are the two bookends of the PostPC Era.

Check
Yourself

Microprocessor Year Clock Rate
Pipeline
Stages

Issue
Width

Out-of-Order/
Speculation

Cores/
Chip Power

Intel 486 1989 25 MHz 5 1 No 1 5 W

Intel Pentium 1993 66 MHz 5 2 No 1 10 W

Intel Pentium Pro 1997 200 MHz 10 3 Yes 1 29 W

Intel Pentium 4 Willamette 2001 2000 MHz 22 3 Yes 1 75 W

Intel Pentium 4 Prescott 2004 3600 MHz 31 3 Yes 1 103 W

Intel Core 2006 2930 MHz 14 4 Yes

Yes

2 75 W

Intel Core i5 Nehalem 2010 3300 MHz 14 4 1 87 W

Intel Core i5 Ivy Bridge 2012 3400 MHz 14 4 8 77 W

FIGURE 4.73 Record of Intel Microprocessors in terms of pipeline complexity, number of cores, and power. Th e Pentium
4 pipeline stages do not include the commit stages. If we included them, the Pentium 4 pipelines would be even deeper.

4.11 Real Stuff: The ARM Cortex-A8 and Intel Core i7 Pipelines 345

Processor Intel Core i7 920ARM A8

Market

Thermal design power

Clock rate

Cores/Chip

Floating point?

Multiple Issue?

Peak instructions/clock cycle

Pipeline Stages

Pipeline schedule

Branch prediction

1st level caches / core

2nd level cache / core

3rd level cache (shared)

Personal Mobile Device

2 Watts

1 GHz

Dynamic

Static In-order

2-level

32 KiB I, 32 KiB D

128 – 1024 KiB

—

Server, Cloud

130 Watts

2.66 GHz

Yes

Dynamic

Dynamic Out-of-order with Speculation

2-level

32 KiB I, 32 KiB D

256 KiB

2 – 8 MiB

FIGURE 4.74 Specifi cation of the ARM Cortex-A8 and the Intel Core i7 920.

The ARM Cortex-A8
Th e ARM Corxtex-A8 runs at 1 GHz with a 14-stage pipeline. It uses dynamic
multiple issue, with two instructions per clock cycle. It is a static in-order pipeline,
in that instructions issue, execute, and commit in order. Th e pipeline consists of
three sections for instruction fetch, instruction decode, and execute. Figure 4.75
shows the overall pipeline.

Th e fi rst three stages fetch two instructions at a time and try to keep a
12-instruction entry prefetch buff er full. It uses a two-level branch predictor using
both a 512-entry branch target buff er, a 4096-entry global history buff er, and an
8-entry return stack to predict future returns. When the branch prediction is
wrong, it empties the pipeline, resulting in a 13-clock cycle misprediction penalty.

Th e fi ve stages of the decode pipeline determine if there are dependences
between a pair of instructions, which would force sequential execution, and in
which pipeline of the execution stages to send the instructions.

Th e six stages of the instruction execution section off er one pipeline for load
and store instructions and two pipelines for arithmetic operations, although only
the fi rst of the pair can handle multiplies. Either instruction from the pair can be
issued to the load-store pipeline. Th e execution stages have full bypassing between
the three pipelines.

Figure 4.76 shows the CPI of the A8 using small versions of programs derived
from the SPEC2000 benchmarks. While the ideal CPI is 0.5, the best case here is
1.4, the median case is 2.0, and the worst case is 5.2. For the median case, 80% of
the stalls are due to the pipelining hazards and 20% are stalls due to the memory

346 Chapter 4 The Processor

hierarchy. Pipeline stalls are caused by branch mispredictions, structural hazards,
and data dependencies between pairs of instructions. Given the static pipeline of the
A8, it is up to the compiler to try to avoid structural hazards and data dependences.

Elaboration: The Cortex-A8 is a confi gurable core that supports the ARMv7 instruction
set architecture. It is delivered as an IP (Intellectual Property) core. IP cores are the
dominant form of technology delivery in the embedded, personal mobile device, and
related markets; billions of ARM and MIPS processors have been created from these
IP cores.

Note that IP cores are different than the cores in the Intel i7 multicore computers. An
IP core (which may itself be a multicore) is designed to be incorporated with other logic
(hence it is the “core” of a chip), including application-specifi c processors (such as an
encoder or decoder for video), I/O interfaces, and memory interfaces, and then fabricated
to yield a processor optimized for a particular application. Although the processor core is
almost identical, the resultant chips have many differences. One parameter is the size
of the L2 cache, which can vary by a factor of eight.

The Intel Core i7 920
x86 microprocessors employ sophisticated pipelining approaches, using both
dynamic multiple issue and dynamic pipeline scheduling with out-of-order
execution and speculation for its 14-stage pipeline. Th ese processors, however,
are still faced with the challenge of implementing the complex x86 instruction
set, described in Chapter 2. Intel fetches x86 instructions and translates them into
internal MIPS-like instructions, which Intel calls micro-operations. Th e micro-
operations are then executed by a sophisticated, dynamically scheduled, speculative
pipeline capable of sustaining an execution rate of up to six micro-operations per
clock cycle. Th is section focuses on that micro-operation pipeline.

FIGURE 4.75 The A8 pipeline. Th e fi rst three stages fetch instructions into a 12-entry instruction fetch
buff er. Th e Address Generation Unit (AGU) uses a Branch Target Buff er (BTB), Global History Buff er (GHB),
and a Return Stack (RS) to predict branches to try to keep the fetch queue full. Instruction decode is fi ve
stages and instruction execution is six stages.

F0 F1 F2 D0 D1

Branch mispredict
penalty =13 cycles Instruction execute and load/store

ALU pipe 1

LS pipe 0 or 1

D2 D3

Instruction decode

A
rch

ite
ctu

ra
l re

g
iste

r file

Instruction
fetch

AGU
RAM

+
TLB

12-entry
fetch

queue

BTB
GHB
RS

D4 E0 E1 E2 E3 E4 E5

BP
update

ALU/MUL pipe 0
BP

update

BP
update

4.11 Real Stuff: The ARM Cortex-A8 and Intel Core i7 Pipelines 347

When we consider the design of sophisticated, dynamically scheduled processors, the
design of the functional units, the cache and register fi le, instruction issue, and overall
pipeline control become intermingled, making it diffi cult to separate the datapath from
the pipeline. Because of this, many engineers and researchers have adopted the term
microarchitecture to refer to the detailed internal architecture of a processor.

Th e Intel Core i7 uses a scheme for resolving antidependences and incorrect
speculation that uses a reorder buff er together with register renaming. Register
renaming explicitly renames the architectural registers in a processor (16 in the case
of the 64-bit version of the x86 architecture) to a larger set of physical registers. Th e
Core i7 uses register renaming to remove antidependences. Register renaming requires
the processor to maintain a map between the architectural registers and the physical
registers, indicating which physical register is the most current copy of an architectural
register. By keeping track of the renamings that have occurred, register renaming off ers
another approach to recovery in the event of incorrect speculation: simply undo the
mappings that have occurred since the fi rst incorrectly speculated instruction. Th is
will cause the state of the processor to return to the last correctly executed instruction,
keeping the correct mapping between the architectural and physical registers.

Figure 4.77 shows the overall organization and pipeline of the Core i7. Below are
the eight steps an x86 instruction goes through for execution.

1. Instruction fetch—Th e processor uses a multilevel branch target buff er to
achieve a balance between speed and prediction accuracy. Th ere is also a
return address stack to speed up function return. Mispredictions cause a
penalty of about 15 cycles. Using the predicted address, the instruction fetch
unit fetches 16 bytes from the instruction cache.

2. Th e 16 bytes are placed in the predecode instruction buff er— Th e predecode
stage transforms the 16 bytes into individual x86 instructions. Th is predecode

microarchitecture Th e
organization of the
processor, including the
major functional units,
their interconnection, and
control.

architectural
registers Th e instruction
set of visible registers of
a processor; for example,
in MIPS, these are the 32
integer and 16 fl oating-
point registers.

1.00

twolf bzip2 gzip parser gap perlbmk gcc crafty vpr vortex eon mcf

2.00

3.00

4.00

5.00

6.00
Memory hierarchy stalls

Pipeline stalls

Ideal CPI

1.41
1.63 1.69 1.70

1.85 1.95 2.01
2.07 2.11

2.41

3.20

5.17

FIGURE 4.76 CPI on ARM Cortex A8 for the Minnespec benchmarks, which are small versions of the SPEC2000
benchmarks. Th ese benchmarks use the much smaller inputs to reduce running time by several orders of magnitude. Th e smaller size
signifi cantly underestimates the CPI impact of the memory hierarchy (See Chapter 5).

348 Chapter 4 The Processor

is nontrivial since the length of an x86 instruction can be from 1 to 15 bytes
and the predecoder must look through a number of bytes before it knows the
instruction length. Individual x86 instructions are placed into the 18-entry
instruction queue.

3. Micro-op decode—Individual x86 instructions are translated into micro-
operations (micro-ops). Th ree of the decoders handle x86 instructions that
translate directly into one micro-op. For x86 instructions that have more complex
semantics, there is a microcode engine that is used to produce the micro-op
sequence; it can produce up to four micro-ops every cycle and continues until
the necessary micro-op sequence has been generated. Th e micro-ops are placed
according to the order of the x86 instructions in the 28-entry micro-op buff er.

4. Th e micro-op buff er performs loop stream detection—If there is a small
sequence of instructions (less than 28 instructions or 256 bytes in length)
that comprises a loop, the loop stream detector will fi nd the loop and directly

FIGURE 4.77 The Core i7 pipeline with memory components. Th e total pipeline depth is 14
stages, with branch mispredictions costing 17 clock cycles. Th is design can buff er 48 loads and 32 stores. Th e
six independent units can begin execution of a ready RISC operation each clock cycle.

256 KB unified l2
cache (eight-way)

128-Entry reorder buffer

36-Entry reservation station

Retirement
register file

ALU
shift

SSE
shuffle
ALU

128-bit
FMUL
FDIV

SSE
shuffle
ALU

Memory order buffer

ALU
shift

Load
address

Store
address

Store
data

Store
& load

Micro
-code

Complex
macro-op
decoder

28-Entry micro-op loop stream detect buffer

Simple
macro-op
decoder

128-Entry
inst. TLB

(four-way)

Instruction
fetch

hardware

18-Entry instruction queue

32 KB Inst. cache (four-way associative)

16-Byte pre-decode + macro-op
fusion, fetch buffer

64-Entry data TLB
(4-way associative)

32-KB dual-ported data
cache (8-way associative)

512-Entry unified
L2 TLB (4-way)

8 MB all core shared and inclusive L3
cache (16-way associative)

Uncore arbiter (handles scheduling and
clock/power state differences)

4.11 Real Stuff: The ARM Cortex-A8 and Intel Core i7 Pipelines 349

issue the micro-ops from the buff er, eliminating the need for the instruction
fetch and instruction decode stages to be activated.

5. Perform the basic instruction issue—Looking up the register location in the
register tables, renaming the registers, allocating a reorder buff er entry, and
fetching any results from the registers or reorder buff er before sending the
micro-ops to the reservation stations.

6. Th e i7 uses a 36-entry centralized reservation station shared by six functional
units. Up to six micro-ops may be dispatched to the functional units every
clock cycle.

7. Th e individual function units execute micro-ops and then results are sent
back to any waiting reservation station as well as to the register retirement
unit, where they will update the register state, once it is known that the
instruction is no longer speculative. Th e entry corresponding to the
instruction in the reorder buff er is marked as complete.

8. When one or more instructions at the head of the reorder buff er have been
marked as complete, the pending writes in the register retirement unit are
executed, and the instructions are removed from the reorder buff er.

Elaboration: Hardware in the second and fourth steps can combine or fuse operations
together to reduce the number of operations that must be performed. Macro-op fusion
in the second step takes x86 instruction combinations, such as compare followed by a
branch, and fuses them into a single operation. Microfusion in the fourth step combines
micro-operation pairs such as load/ALU operation and ALU operation/store and issues
them to a single reservation station (where they can still issue independently), thus
increasing the usage of the buffer. In a study of the Intel Core architecture, which also
incorporated microfusion and macrofusion, Bird et al. [2007] discovered that microfusion
had little impact on performance, while macrofusion appears to have a modest positive
impact on integer performance and little impact on fl oating-point performance.

Performance of the Intel Core i7 920
Figure 4.78 shows the CPI of the Intel Core i7 for each of the SPEC2006 benchmarks.
While the ideal CPI is 0.25, the best case here is 0.44, the median case is 0.79, and
the worst case is 2.67.

While it is diffi cult to diff erentiate between pipeline stalls and memory stalls
in a dynamic out-of-order execution pipeline, we can show the eff ectiveness of
branch prediction and speculation. Figure 4.79 shows the percentage of branches
mispredicted and the percentage of the work (measured by the numbers of micro-
ops dispatched into the pipeline) that does not retire (that is, their results are
annulled) relative to all micro-op dispatches. Th e min, median, and max of branch
mispredictions are 0%, 2%, and 10%. For wasted work, they are 1%, 18%, and 39%.

Th e wasted work in some cases closely matches the branch misprediction rates,
such as for gobmk and astar. In several instances, such as mcf, the wasted work
seems relatively larger than the misprediction rate. Th is divergence is likely due

350 Chapter 4 The Processor

2.5

1.5

C
P

I
1

0.5 0.44
0.59 0.61

0.65
0.74 0.77

0.82

1.02 1.06

1.23

2.12

2.67

lib
qu

an
tu

h2
64

re
f

hm
m

pe
rlb

en
ch

bz
ip
2

xa
la
nc

bm
k

sje
ng

go
bm

k
as

ta
r

gc
c

om
ne

tp
p

m
cf

Stalls, misspeculation

Ideal CPI

FIGURE 4.78 CPI of Intel Core i7 920 running SPEC2006 integer benchmarks.

FIGURE 4.79 Percentage of branch mispredictions and wasted work due to unfruitful
speculation of Intel Core i7 920 running SPEC2006 integer benchmarks.

40%

35%

30%

25%

20%

15%

10%

lib
qu

an
tu

h2
64

re
f

hm
m

pe
rlb

en
ch

bz
ip
2

xa
la
nc

bm
k

sje
ng

go
bm

k
as

ta
r

gc
c

om
ne

tp
p

m
cf

Branch misprediction % Wasted work %

0%
2% 2% 2%

10%
9%

2% 2%

5%
6%

11%

24%

25%

32%

38%

15%

22%

39%

4.12 Going Faster: Instruction-Level Parallelism and Matrix Multiply 351

to the memory behavior. With very high data cache miss rates, mcf will dispatch
many instructions during an incorrect speculation as long as suffi cient reservation
stations are available for the stalled memory references. When a branch among the
many speculated instructions is fi nally mispredicted, the micro-ops corresponding
to all these instructions will be fl ushed.

Th e Intel Core i7 combines a 14-stage pipeline and aggressive multiple issue to
achieve high performance. By keeping the latencies for back-to-back operations
low, the impact of data dependences is reduced. What are the most serious potential
performance bottlenecks for programs running on this processor? Th e following
list includes some potential performance problems, the last three of which can
apply in some form to any high-performance pipelined processor.

■ Th e use of x86 instructions that do not map to a few simple micro-operations

■ Branches that are diffi cult to predict, causing misprediction stalls and restarts
when speculation fails

■ Long dependences—typically caused by long-running instructions or the
memory hierarchy—that lead to stalls

■ Performance delays arising in accessing memory (see Chapter 5) that cause
the processor to stall

4.12 Going Faster: Instruction-Level
Parallelism and Matrix Multiply

Returning to the DGEMM example from Chapter 3, we can see the impact of
instruction level parallelism by unrolling the loop so that the multiple issue, out-of-
order execution processor has more instructions to work with. Figure 4.80 shows
the unrolled version of Figure 3.23, which contains the C intrinsics to produce the
AVX instructions.

Like the unrolling example in Figure 4.71 above, we are going to unroll the loop
4 times. (We use the constant UNROLL in the C code to control the amount of
unrolling in case we want to try other values.) Rather than manually unrolling the
loop in C by making 4 copies of each of the intrinsics in Figure 3.23, we can rely
on the gcc compiler to do the unrolling at –O3 optimization. We surround each
intrinsic with a simple for loop that 4 iterations (lines 9, 14, and 20) and replace the
scalar C0 in Figure 3.23 with a 4-element array c[] (lines 8, 10, 16, and 21).

Figure 4.81 shows the assembly language output of the unrolled code. As
expected, in Figure 4.81 there are 4 versions of each of the AVX instructions in
Figure 3.24, with one exception. We only need 1 copy of the vbroadcastsd

Understanding
Program
Performance

352 Chapter 4 The Processor

instruction, since we can use the four copies of the B element in register %ymm0
repeatedly throughout the loop. Th us, the 5 AVX instructions in Figure 3.24
become 17 in Figure 4.81, and the 7 integer instructions appear in both, although
the constants and addressing changes to account for the unrolling. Hence, despite
unrolling 4 times, the number of instructions in the body of the loop only doubles:
from 12 to 24.

Figure 4.82 shows the performance increase DGEMM for 32×32 matrices in
going from unoptimized to AVX and then to AVX with unrolling. Unrolling more
than doubles performance, going from 6.4 GFLOPS to 14.6 GFLOPS. Optimizations
for subword parallelism and instruction level parallelism result in an overall
speedup of 8.8 versus the unoptimized DGEMM in Figure 3.21.

Elaboration: As mentioned in the Elaboration in Section 3.8, these results are with
Turbo mode turned off. If we turn it on, like in Chapter 3 we improve all the results by the
temporary increase in the clock rate of 3.3/2.6 = 1.27 to 2.1 GFLOPS for unoptimized
DGEMM, 8.1 GFLOPS with AVX, and 18.6 GFLOPS with unrolling and AVX. As mentioned
in Section 3.8, Turbo mode works particularly well in this case because it is using only
a single core of an eight-core chip.

1 #include
2 #define UNROLL (4)
3
4 void dgemm (int n, double* A, double* B, double* C)
5 {
6 for ( int i = 0; i < n; i+=UNROLL*4 ) 7 for ( int j = 0; j < n; j++ ) { 8 __m256d c[4]; 9 for ( int x = 0; x < UNROLL; x++ ) 10 c[x] = _mm256_load_pd(C+i+x*4+j*n); 11 12 for( int k = 0; k < n; k++ ) 13 { 14 __m256d b = _mm256_broadcast_sd(B+k+j*n); 15 for (int x = 0; x < UNROLL; x++) 16 c[x] = _mm256_add_pd(c[x], 17 _mm256_mul_pd(_mm256_load_pd(A+n*k+x*4+i), b)); 18 } 19 20 for ( int x = 0; x < UNROLL; x++ ) 21 _mm256_store_pd(C+i+x*4+j*n, c[x]); 22 } 23 } FIGURE 4.80 Optimized C version of DGEMM using C intrinsics to generate the AVX subword- parallel instructions for the x86 (Figure 3.23) and loop unrolling to create more opportunities for instruction-level parallelism. Figure 4.81 shows the assembly language produced by the compiler for the inner loop, which unrolls the three for-loop bodies to expose instruction level parallelism. 4.12 Going Faster: Instruction-Level Parallelism and Matrix Multiply 353 Elaboration: There are no pipeline stalls despite the reuse of register %ymm5 in lines 9 to 17 Figure 4.81 because the Intel Core i7 pipeline renames the registers. Are the following statements true or false? 1. Th e Intel Core i7 uses a multiple-issue pipeline to directly execute x86 instructions. 2. Both the A8 and the Core i7 use dynamic multiple issue. 3. Th e Core i7 microarchitecture has many more registers than x86 requires. 4. Th e Intel Core i7 uses less than half the pipeline stages of the earlier Intel Pentium 4 Prescott (see Figure 4.73). Check Yourself vmovapd (%r11),%ymm4 # Load 4 elements of C into %ymm41 mov %rbx,%rax # register %rax = %rbx2 xor %ecx,%ecx # register %ecx = 03 vmovapd 0x20(%r11),%ymm3 # Load 4 elements of C into %ymm34 vmovapd 0x40(%r11),%ymm2 # Load 4 elements of C into %ymm25 vmovapd 0x60(%r11),%ymm1 # Load 4 elements of C into %ymm16 vbroadcastsd (%rcx,%r9,1),%ymm0 # Make 4 copies of B element7 add $0x8,%rcx # register %rcx = %rcx + 88 vmulpd (%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elements9 mm4vaddpd %ymm5,%ymm4,%ymm4 # Parallel add %ymm5, %y10 vmulpd 0x20(%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elements11 vaddpd %ymm5,%ymm3,%ymm3 # Parallel add %ymm5, %ymm312 vmulpd 0x40(%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elements13 vmulpd 0x60(%rax),%ymm0,%ymm0 # Parallel mul %ymm1,4 A elements14 add %r8,%rax # register %rax = %rax + %r815 cmp %r10,%rcx # compare %r8 to %rax16 vaddpd %ymm5,%ymm2,%ymm2 # Parallel add %ymm5, %ymm217 vaddpd %ymm0,%ymm1,%ymm1 # Parallel add %ymm0, %ymm118 jne 68 # jump if not %r8 != %rax19

add $0x1,%esi # register % esi = % esi + 120

vmovapd %ymm4,(%r11) # Store %ymm4 into 4 C elements21

vmovapd %ymm3,0x20(%r11) # Store %ymm3 into 4 C elements22

vmovapd %ymm2,0x40(%r11) # Store %ymm2 into 4 C elements23

vmovapd %ymm1,0x60(%r11) # Store %ymm1 into 4 C elements24

FIGURE 4.81 The x86 assembly language for the body of the nested loops generated by compiling
the unrolled C code in Figure 4.80.

354 Chapter 4 The Processor

4.13 Advanced Topic: An Introduction to
Digital Design Using a Hardware Design
Language to Describe and Model a
Pipeline and More Pipelining Illustrations

Modern digital design is done using hardware description languages and modern
computer-aided synthesis tools that can create detailed hardware designs from the
descriptions using both libraries and logic synthesis. Entire books are written on
such languages and their use in digital design. Th is section, which appears online,
gives a brief introduction and shows how a hardware design language, Verilog in
this case, can be used to describe the MIPS control both behaviorally and in a
form suitable for hardware synthesis. It then provides a series of behavioral models
in Verilog of the MIPS fi ve-stage pipeline. Th e initial model ignores hazards, and
additions to the model highlight the changes for forwarding, data hazards, and
branch hazards.

We then provide about a dozen illustrations using the single-cycle graphical
pipeline representation for readers who want to see more detail on how pipelines
work for a few sequences of MIPS instructions.

4.13

FIGURE 4.82 Performance of three versions of DGEMM for 32×32 matrices. Subword
parallelism and instruction level parallelism has led to speedup of almost a factor of 9 over the unoptimized
code in Figure 3.21.

–

4.0

unoptimized

1.7

6.4

14.6

AVX AVX+unroll

8.0
G

F
L
O

P
S

12.0

16.0

4.14 Fallacies and Pitfalls 355

4.14 Fallacies and Pitfalls

Fallacy: Pipelining is easy.
Our books testify to the subtlety of correct pipeline execution. Our advanced book
had a pipeline bug in its fi rst edition, despite its being reviewed by more than 100
people and being class-tested at 18 universities. Th e bug was uncovered only when
someone tried to build the computer in that book. Th e fact that the Verilog to
describe a pipeline like that in the Intel Core i7 will be many thousands of lines is
an indication of the complexity. Beware!

Fallacy: Pipelining ideas can be implemented independent of technology.
When the number of transistors on-chip and the speed of transistors made a
fi ve-stage pipeline the best solution, then the delayed branch (see the Elaboration
on page 255) was a simple solution to control hazards. With longer pipelines,
superscalar execution, and dynamic branch prediction, it is now redundant. In
the early 1990s, dynamic pipeline scheduling took too many resources and was
not required for high performance, but as transistor budgets continued to double
due to Moore’s Law and logic became much faster than memory, then multiple
functional units and dynamic pipelining made more sense. Today, concerns about
power are leading to less aggressive designs.

Pitfall: Failure to consider instruction set design can adversely impact pipelining.
Many of the diffi culties of pipelining arise because of instruction set complications.
Here are some examples:

■ Widely variable instruction lengths and running times can lead to imbalance
among pipeline stages and severely complicate hazard detection in a design
pipelined at the instruction set level. Th is problem was overcome, initially
in the DEC VAX 8500 in the late 1980s, using the micro-operations and
micropipelined scheme that the Intel Core i7 employs today. Of course, the
overhead of translation and maintaining correspondence between the micro-
operations and the actual instructions remains.

■ Sophisticated addressing modes can lead to diff erent sorts of problems.
Addressing modes that update registers complicate hazard detection. Other
addressing modes that require multiple memory accesses substantially
complicate pipeline control and make it diffi cult to keep the pipeline fl owing
smoothly.

■ Perhaps the best example is the DEC Alpha and the DEC NVAX. In
comparable technology, the newer instruction set architecture of the Alpha
allowed an implementation whose performance is more than twice as fast
as NVAX. In another example, Bhandarkar and Clark [1991] compared the
MIPS M/2000 and the DEC VAX 8700 by counting clock cycles of the SPEC
benchmarks; they concluded that although the MIPS M/2000 executes more

356 Chapter 4 The Processor

instructions, the VAX on average executes 2.7 times as many clock cycles, so
the MIPS is faster.

4.15 Concluding Remarks

As we have seen in this chapter, both the datapath and control for a processor can be
designed starting with the instruction set architecture and an understanding of the
basic characteristics of the technology. In Section 4.3, we saw how the datapath for
a MIPS processor could be constructed based on the architecture and the decision
to build a single-cycle implementation. Of course, the underlying technology also
aff ects many design decisions by dictating what components can be used in the
datapath, as well as whether a single-cycle implementation even makes sense.

Pipelining improves throughput but not the inherent execution time, or
instruction latency, of instructions; for some instructions, the latency is similar
in length to the single-cycle approach. Multiple instruction issue adds additional
datapath hardware to allow multiple instructions to begin every clock cycle, but at
an increase in eff ective latency. Pipelining was presented as reducing the clock cycle
time of the simple single-cycle datapath. Multiple instruction issue, in comparison,
clearly focuses on reducing clock cycles per instruction (CPI).

Pipelining and multiple issue both attempt to exploit instruction-level
parallelism. Th e presence of data and control dependences, which can become
hazards, are the primary limitations on how much parallelism can be exploited.
Scheduling and speculation via prediction, both in hardware and in soft ware, are
the primary techniques used to reduce the performance impact of dependences.

We showed that unrolling the DGEMM loop four times exposed more
instructions that could take advantage of the out-of-order execution engine of the
Core i7 to more than double performance.

Th e switch to longer pipelines, multiple instruction issue, and dynamic
scheduling in the mid-1990s has helped sustain the 60% per year processor
performance increase that started in the early 1980s. As mentioned in Chapter
1, these microprocessors preserved the sequential programming model, but
they eventually ran into the power wall. Th us, the industry has been forced to
switch to multiprocessors, which exploit parallelism at much coarser levels (the
subject of Chapter 6). Th is trend has also caused designers to reassess the energy-
performance implications of some of the inventions since the mid-1990s, resulting
in a simplifi cation of pipelines in the more recent versions of microarchitectures.

To sustain the advances in processing performance via parallel processors,
Amdahl’s law suggests that another part of the system will become the bottleneck.
Th at bottleneck is the topic of the next chapter: the memory hierarchy.

instruction latency Th e
inherent execution time
for an instruction.

Nine-tenths of wisdom
consists of being wise
in time.
American proverb

4.17 Exercises 357

4.16 Historical Perspective and Further
Reading

Th is section, which appears online, discusses the history of the fi rst pipelined
processors, the earliest superscalars, and the development of out-of-order and
speculative techniques, as well as important developments in the accompanying
compiler technology.

4.17 Exercises

4.1 Consider the following instruction:
Instruction: AND Rd,Rs,Rt

Interpretation: Reg[Rd] = Reg[Rs] AND Reg[Rt]

4.1.1 [5] <§4.1> What are the values of control signals generated by the control in
Figure 4.2 for the above instruction?

4.1.2 [5] <§4.1> Which resources (blocks) perform a useful function for this
instruction?

4.1.3 [10] <§4.1> Which resources (blocks) produce outputs, but their outputs
are not used for this instruction? Which resources produce no outputs for this
instruction?

4.2 Th e basic single-cycle MIPS implementation in Figure 4.2 can only implement
some instructions. New instructions can be added to an existing Instruction Set
Architecture (ISA), but the decision whether or not to do that depends, among
other things, on the cost and complexity the proposed addition introduces into the
processor datapath and control. Th e fi rst three problems in this exercise refer to the
new instruction:

Instruction: LWI Rt,Rd(Rs)

Interpretation: Reg[Rt] = Mem[Reg[Rd]+Reg[Rs]]

4.2.1 [10] <§4.1> Which existing blocks (if any) can be used for this instruction?

4.2.2 [10] <§4.1> Which new functional blocks (if any) do we need for this
instruction?

4.2.3 [10] <§4.1> What new signals do we need (if any) from the control unit to
support this instruction?

4.16

358 Chapter 4 The Processor

4.3 When processor designers consider a possible improvement to the processor
datapath, the decision usually depends on the cost/performance trade-off . In
the following three problems, assume that we are starting with a datapath from
Figure 4.2, where I-Mem, Add, Mux, ALU, Regs, D-Mem, and Control blocks have
latencies of 400 ps, 100 ps, 30 ps, 120 ps, 200 ps, 350 ps, and 100 ps, respectively,
and costs of 1000, 30, 10, 100, 200, 2000, and 500, respectively.

Consider the addition of a multiplier to the ALU. Th is addition will add 300 ps to the
latency of the ALU and will add a cost of 600 to the ALU. Th e result will be 5% fewer
instructions executed since we will no longer need to emulate the MUL instruction.

4.3.1 [10] <§4.1> What is the clock cycle time with and without this improvement?

4.3.2 [10] <§4.1> What is the speedup achieved by adding this improvement?

4.3.3 [10] <§4.1> Compare the cost/performance ratio with and without this
improvement.

4.4 Problems in this exercise assume that logic blocks needed to implement a
processor’s datapath have the following latencies:

I-Mem Add Mux ALU Regs D-Mem Sign-Extend Shift-Left-2

200ps 70ps 20ps 90ps 90ps 250ps 15ps 10ps

4.4.1 [10] <§4.3> If the only thing we need to do in a processor is fetch consecutive
instructions (Figure 4.6), what would the cycle time be?

4.4.2 [10] <§4.3> Consider a datapath similar to the one in Figure 4.11, but for a
processor that only has one type of instruction: unconditional PC-relative branch.
What would the cycle time be for this datapath?

4.4.3 [10] <§4.3> Repeat 4.4.2, but this time we need to support only conditional
PC-relative branches.

Th e remaining three problems in this exercise refer to the datapath element Shift –
left -2:

4.4.4 [10] <§4.3> Which kinds of instructions require this resource?

4.4.5 [20] <§4.3> For which kinds of instructions (if any) is this resource on the
critical path?

4.4.6 [10] <§4.3> Assuming that we only support beq and add instructions,
discuss how changes in the given latency of this resource aff ect the cycle time of the
processor. Assume that the latencies of other resources do not change.

4.17 Exercises 359

4.5 For the problems in this exercise, assume that there are no pipeline stalls and
that the breakdown of executed instructions is as follows:

add addi not beq lw sw

20% 20% 0% 25% 25% 10%

4.5.1 [10] <§4.3> In what fraction of all cycles is the data memory used?

4.5.2 [10] <§4.3> In what fraction of all cycles is the input of the sign-extend
circuit needed? What is this circuit doing in cycles in which its input is not needed?

4.6 When silicon chips are fabricated, defects in materials (e.g., silicon) and
manufacturing errors can result in defective circuits. A very common defect is for
one wire to aff ect the signal in another. Th is is called a cross-talk fault. A special
class of cross-talk faults is when a signal is connected to a wire that has a constant
logical value (e.g., a power supply wire). In this case we have a stuck-at-0 or a stuck-
at-1 fault, and the aff ected signal always has a logical value of 0 or 1, respectively.
Th e following problems refer to bit 0 of the Write Register input on the register fi le
in Figure 4.24.

4.6.1 [10] <§§4.3, 4.4> Let us assume that processor testing is done by fi lling the
PC, registers, and data and instruction memories with some values (you can choose
which values), letting a single instruction execute, then reading the PC, memories,
and registers. Th ese values are then examined to determine if a particular fault is
present. Can you design a test (values for PC, memories, and registers) that would
determine if there is a stuck-at-0 fault on this signal?

4.6.2 [10] <§§4.3, 4.4> Repeat 4.6.1 for a stuck-at-1 fault. Can you use a single
test for both stuck-at-0 and stuck-at-1? If yes, explain how; if no, explain why not.

4.6.3 [60] <§§4.3, 4.4> If we know that the processor has a stuck-at-1 fault on
this signal, is the processor still usable? To be usable, we must be able to convert
any program that executes on a normal MIPS processor into a program that works
on this processor. You can assume that there is enough free instruction memory
and data memory to let you make the program longer and store additional
data. Hint: the processor is usable if every instruction “broken” by this fault can
be replaced with a sequence of “working” instructions that achieve the same
eff ect.

4.6.4 [10] <§§4.3, 4.4> Repeat 4.6.1, but now the fault to test for is whether
the “MemRead” control signal becomes 0 if RegDst control signal is 0, no fault
otherwise.

4.6.5 [10] <§§4.3, 4.4> Repeat 4.6.4, but now the fault to test for is whether the
“Jump” control signal becomes 0 if RegDst control signal is 0, no fault otherwise.

360 Chapter 4 The Processor

4.7 In this exercise we examine in detail how an instruction is executed in a
single-cycle datapath. Problems in this exercise refer to a clock cycle in which the
processor fetches the following instruction word:

10101100011000100000000000010100.

Assume that data memory is all zeros and that the processor’s registers have the
following values at the beginning of the cycle in which the above instruction word
is fetched:

r0 r1 r2 r3 r4 r5 r6 r8 r12 r31

0 –1 2 –3 –4 10 6 8 2 –16

4.7.1 [5] <§4.4> What are the outputs of the sign-extend and the jump “Shift left
2” unit (near the top of Figure 4.24) for this instruction word?

4.7.2 [10] <§4.4> What are the values of the ALU control unit’s inputs for this
instruction?

4.7.3 [10] <§4.4> What is the new PC address aft er this instruction is executed?
Highlight the path through which this value is determined.

4.7.4 [10] <§4.4> For each Mux, show the values of its data output during the
execution of this instruction and these register values.

4.7.5 [10] <§4.4> For the ALU and the two add units, what are their data input
values?

4.7.6 [10] <§4.4> What are the values of all inputs for the “Registers” unit?

4.8 In this exercise, we examine how pipelining aff ects the clock cycle time of the
processor. Problems in this exercise assume that individual stages of the datapath
have the following latencies:

IF ID EX MEM WB

250ps 350ps 150ps 300ps 200ps

Also, assume that instructions executed by the processor are broken down as
follows:

alu beq lw sw

45% 20% 20% 15%

4.8.1 [5] <§4.5> What is the clock cycle time in a pipelined and non-pipelined
processor?

4.8.2 [10] <§4.5> What is the total latency of an LW instruction in a pipelined
and non-pipelined processor?

4.17 Exercises 361

4.8.3 [10] <§4.5> If we can split one stage of the pipelined datapath into two new
stages, each with half the latency of the original stage, which stage would you split
and what is the new clock cycle time of the processor?

4.8.4 [10] <§4.5> Assuming there are no stalls or hazards, what is the utilization
of the data memory?

4.8.5 [10] <§4.5> Assuming there are no stalls or hazards, what is the utilization
of the write-register port of the “Registers” unit?

4.8.6 [30] <§4.5> Instead of a single-cycle organization, we can use a multi-cycle
organization where each instruction takes multiple cycles but one instruction
fi nishes before another is fetched. In this organization, an instruction only goes
through stages it actually needs (e.g., ST only takes 4 cycles because it does not
need the WB stage). Compare clock cycle times and execution times with single-
cycle, multi-cycle, and pipelined organization.

4.9 In this exercise, we examine how data dependences aff ect execution in the
basic 5-stage pipeline described in Section 4.5. Problems in this exercise refer to the
following sequence of instructions:

or r1,r2,r3
or r2,r1,r4
or r1,r1,r2

Also, assume the following cycle times for each of the options related to forwarding:

Without Forwarding With Full Forwarding With ALU-ALU Forwarding Only

250ps 300ps 290ps

4.9.1 [10] <§4.5> Indicate dependences and their type.

4.9.2 [10] <§4.5> Assume there is no forwarding in this pipelined processor.
Indicate hazards and add nop instructions to eliminate them.

4.9.3 [10] <§4.5> Assume there is full forwarding. Indicate hazards and add NOP
instructions to eliminate them.

4.9.4 [10] <§4.5> What is the total execution time of this instruction sequence
without forwarding and with full forwarding? What is the speedup achieved by
adding full forwarding to a pipeline that had no forwarding?

4.9.5 [10] <§4.5> Add nop instructions to this code to eliminate hazards if there
is ALU-ALU forwarding only (no forwarding from the MEM to the EX stage).

4.9.6 [10] <§4.5> What is the total execution time of this instruction sequence
with only ALU-ALU forwarding? What is the speedup over a no-forwarding
pipeline?

362 Chapter 4 The Processor

4.10 In this exercise, we examine how resource hazards, control hazards, and
Instruction Set Architecture (ISA) design can aff ect pipelined execution. Problems
in this exercise refer to the following fragment of MIPS code:

sw r16,12(r6)
lw r16,8(r6)
beq r5,r4,Label # Assume r5!=r4
add r5,r1,r4
slt r5,r15,r4

Assume that individual pipeline stages have the following latencies:

IF ID EX MEM WB

200ps 120ps 150ps 190ps 100ps

4.10.1 [10] <§4.5> For this problem, assume that all branches are perfectly
predicted (this eliminates all control hazards) and that no delay slots are used. If we
only have one memory (for both instructions and data), there is a structural hazard
every time we need to fetch an instruction in the same cycle in which another
instruction accesses data. To guarantee forward progress, this hazard must always
be resolved in favor of the instruction that accesses data. What is the total execution
time of this instruction sequence in the 5-stage pipeline that only has one memory?
We have seen that data hazards can be eliminated by adding nops to the code. Can
you do the same with this structural hazard? Why?

4.10.2 [20] <§4.5> For this problem, assume that all branches are perfectly
predicted (this eliminates all control hazards) and that no delay slots are used.
If we change load/store instructions to use a register (without an off set) as the
address, these instructions no longer need to use the ALU. As a result, MEM and
EX stages can be overlapped and the pipeline has only 4 stages. Change this code to
accommodate this changed ISA. Assuming this change does not aff ect clock cycle
time, what speedup is achieved in this instruction sequence?

4.10.3 [10] <§4.5> Assuming stall-on-branch and no delay slots, what speedup is
achieved on this code if branch outcomes are determined in the ID stage, relative to
the execution where branch outcomes are determined in the EX stage?

4.10.4 [10] <§4.5> Given these pipeline stage latencies, repeat the speedup
calculation from 4.10.2, but take into account the (possible) change in clock cycle
time. When EX and MEM are done in a single stage, most of their work can be
done in parallel. As a result, the resulting EX/MEM stage has a latency that is the
larger of the original two, plus 20 ps needed for the work that could not be done
in parallel.

4.10.5 [10] <§4.5> Given these pipeline stage latencies, repeat the speedup
calculation from 4.10.3, taking into account the (possible) change in clock cycle
time. Assume that the latency ID stage increases by 50% and the latency of the EX
stage decreases by 10ps when branch outcome resolution is moved from EX to ID.

4.17 Exercises 363

4.10.6 [10] <§4.5> Assuming stall-on-branch and no delay slots, what is the new
clock cycle time and execution time of this instruction sequence if beq address
computation is moved to the MEM stage? What is the speedup from this change?
Assume that the latency of the EX stage is reduced by 20 ps and the latency of the
MEM stage is unchanged when branch outcome resolution is moved from EX to
MEM.

4.11 Consider the following loop.
loop:lw r1,0(r1)
and r1,r1,r2
lw r1,0(r1)
lw r1,0(r1)
beq r1,r0,loop

Assume that perfect branch prediction is used (no stalls due to control hazards),
that there are no delay slots, and that the pipeline has full forwarding support. Also
assume that many iterations of this loop are executed before the loop exits.

4.11.1 [10] <§4.6> Show a pipeline execution diagram for the third iteration of
this loop, from the cycle in which we fetch the fi rst instruction of that iteration up
to (but not including) the cycle in which we can fetch the fi rst instruction of the
next iteration. Show all instructions that are in the pipeline during these cycles (not
just those from the third iteration).

4.11.2 [10] <§4.6> How oft en (as a percentage of all cycles) do we have a cycle in
which all fi ve pipeline stages are doing useful work?

4.12 Th is exercise is intended to help you understand the cost/complexity/
performance trade-off s of forwarding in a pipelined processor. Problems in this
exercise refer to pipelined datapaths from Figure 4.45. Th ese problems assume
that, of all the instructions executed in a processor, the following fraction of these
instructions have a particular type of RAW data dependence. Th e type of RAW
data dependence is identifi ed by the stage that produces the result (EX or MEM)
and the instruction that consumes the result (1st instruction that follows the one
that produces the result, 2nd instruction that follows, or both). We assume that the
register write is done in the fi rst half of the clock cycle and that register reads are
done in the second half of the cycle, so “EX to 3rd” and “MEM to 3rd” dependences
are not counted because they cannot result in data hazards. Also, assume that the
CPI of the processor is 1 if there are no data hazards.

EX to 1st
Only

MEM to 1st
Only

EX to 2nd
Only

MEM to 2nd
Only

EX to 1st

and MEM
to 2nd

Other RAW
Dependences

5% 20% 5% 10% 10% 10%

364 Chapter 4 The Processor

Assume the following latencies for individual pipeline stages. For the EX stage,
latencies are given separately for a processor without forwarding and for a processor
with diff erent kinds of forwarding.

IF ID
EX

(no FW)
EX

(full FW)
EX (FW from

EX/MEM only)

EX (FW
from MEM/
WB only) MEM WB

150 ps 100 ps 120 ps 150 ps 140 ps 130 ps 120 ps 100 ps

4.12.1 [10] <§4.7> If we use no forwarding, what fraction of cycles are we stalling
due to data hazards?

4.12.2 [5] <§4.7> If we use full forwarding (forward all results that can be
forwarded), what fraction of cycles are we staling due to data hazards?

4.12.3 [10] <§4.7> Let us assume that we cannot aff ord to have three-input Muxes
that are needed for full forwarding. We have to decide if it is better to forward
only from the EX/MEM pipeline register (next-cycle forwarding) or only from
the MEM/WB pipeline register (two-cycle forwarding). Which of the two options
results in fewer data stall cycles?

4.12.4 [10] <§4.7> For the given hazard probabilities and pipeline stage latencies,
what is the speedup achieved by adding full forwarding to a pipeline that had no
forwarding?

4.12.5 [10] <§4.7> What would be the additional speedup (relative to a processor
with forwarding) if we added time-travel forwarding that eliminates all data
hazards? Assume that the yet-to-be-invented time-travel circuitry adds 100 ps to
the latency of the full-forwarding EX stage.

4.12.6 [20] <§4.7> Repeat 4.12.3 but this time determine which of the two
options results in shorter time per instruction.

4.13 Th is exercise is intended to help you understand the relationship between
forwarding, hazard detection, and ISA design. Problems in this exercise refer to
the following sequence of instructions, and assume that it is executed on a 5-stage
pipelined datapath:

add r5,r2,r1
lw r3,4(r5)
lw r2,0(r2)
or r3,r5,r3
sw r3,0(r5)

4.13.1 [5] <§4.7> If there is no forwarding or hazard detection, insert nops to
ensure correct execution.

4.17 Exercises 365

4.13.2 [10] <§4.7> Repeat 4.13.1 but now use nops only when a hazard cannot be
avoided by changing or rearranging these instructions. You can assume register R7
can be used to hold temporary values in your modifi ed code.

4.13.3 [10] <§4.7> If the processor has forwarding, but we forgot to implement
the hazard detection unit, what happens when this code executes?

4.13.4 [20] <§4.7> If there is forwarding, for the fi rst fi ve cycles during the
execution of this code, specify which signals are asserted in each cycle by hazard
detection and forwarding units in Figure 4.60.

4.13.5 [10] <§4.7> If there is no forwarding, what new inputs and output signals
do we need for the hazard detection unit in Figure 4.60? Using this instruction
sequence as an example, explain why each signal is needed.

4.13.6 [20] <§4.7> For the new hazard detection unit from 4.13.5, specify which
output signals it asserts in each of the fi rst fi ve cycles during the execution of this
code.

4.14 Th is exercise is intended to help you understand the relationship between
delay slots, control hazards, and branch execution in a pipelined processor. In
this exercise, we assume that the following MIPS code is executed on a pipelined
processor with a 5-stage pipeline, full forwarding, and a predict-taken branch
predictor:

lw r2,0(r1)
label1: beq r2,r0,label2 # not taken once, then taken
lw r3,0(r2)
beq r3,r0,label1 # taken
add r1,r3,r1
label2: sw r1,0(r2)

4.14.1 [10] <§4.8> Draw the pipeline execution diagram for this code, assuming
there are no delay slots and that branches execute in the EX stage.

4.14.2 [10] <§4.8> Repeat 4.14.1, but assume that delay slots are used. In the
given code, the instruction that follows the branch is now the delay slot instruction
for that branch.

4.14.3 [20] <§4.8> One way to move the branch resolution one stage earlier is to
not need an ALU operation in conditional branches. Th e branch instructions would
be “bez rd,label” and “bnez rd,label”, and it would branch if the register has
and does not have a zero value, respectively. Change this code to use these branch
instructions instead of beq. You can assume that register R8 is available for you
to use as a temporary register, and that an seq (set if equal) R-type instruction can
be used.

366 Chapter 4 The Processor

Section 4.8 describes how the severity of control hazards can be reduced by moving
branch execution into the ID stage. Th is approach involves a dedicated comparator
in the ID stage, as shown in Figure 4.62. However, this approach potentially adds
to the latency of the ID stage, and requires additional forwarding logic and hazard
detection.

4.14.4 [10] <§4.8> Using the fi rst branch instruction in the given code as an
example, describe the hazard detection logic needed to support branch execution
in the ID stage as in Figure 4.62. Which type of hazard is this new logic supposed
to detect?

4.14.5 [10] <§4.8> For the given code, what is the speedup achieved by moving
branch execution into the ID stage? Explain your answer. In your speedup
calculation, assume that the additional comparison in the ID stage does not aff ect
clock cycle time.

4.14.6 [10] <§4.8> Using the fi rst branch instruction in the given code as an
example, describe the forwarding support that must be added to support branch
execution in the ID stage. Compare the complexity of this new forwarding unit to
the complexity of the existing forwarding unit in Figure 4.62.

4.15 Th e importance of having a good branch predictor depends on how oft en
conditional branches are executed. Together with branch predictor accuracy, this
will determine how much time is spent stalling due to mispredicted branches. In
this exercise, assume that the breakdown of dynamic instructions into various
instruction categories is as follows:

R-Type BEQ JMP LW SW

40% 25% 5% 25% 5%

Also, assume the following branch predictor accuracies:

Always-Taken Always-Not-Taken 2-Bit

45% 55% 85%

4.15.1 [10] <§4.8> Stall cycles due to mispredicted branches increase the
CPI. What is the extra CPI due to mispredicted branches with the always-taken
predictor? Assume that branch outcomes are determined in the EX stage, that there
are no data hazards, and that no delay slots are used.

4.15.2 [10] <§4.8> Repeat 4.15.1 for the “always-not-taken” predictor.

4.15.3 [10] <§4.8> Repeat 4.15.1 for for the 2-bit predictor.

4.15.4 [10] <§4.8> With the 2-bit predictor, what speedup would be achieved if
we could convert half of the branch instructions in a way that replaces a branch
instruction with an ALU instruction? Assume that correctly and incorrectly
predicted instructions have the same chance of being replaced.

4.17 Exercises 367

4.15.5 [10] <§4.8> With the 2-bit predictor, what speedup would be achieved if
we could convert half of the branch instructions in a way that replaced each branch
instruction with two ALU instructions? Assume that correctly and incorrectly
predicted instructions have the same chance of being replaced.

4.15.6 [10] <§4.8> Some branch instructions are much more predictable than
others. If we know that 80% of all executed branch instructions are easy-to-predict
loop-back branches that are always predicted correctly, what is the accuracy of the
2-bit predictor on the remaining 20% of the branch instructions?

4.16 Th is exercise examines the accuracy of various branch predictors for the
following repeating pattern (e.g., in a loop) of branch outcomes: T, NT, T, T, NT

4.16.1 [5] <§4.8> What is the accuracy of always-taken and always-not-taken
predictors for this sequence of branch outcomes?

4.16.2 [5] <§4.8> What is the accuracy of the two-bit predictor for the fi rst 4
branches in this pattern, assuming that the predictor starts off in the bottom left
state from Figure 4.63 (predict not taken)?

4.16.3 [10] <§4.8> What is the accuracy of the two-bit predictor if this pattern is
repeated forever?

4.16.4 [30] <§4.8> Design a predictor that would achieve a perfect accuracy if
this pattern is repeated forever. You predictor should be a sequential circuit with
one output that provides a prediction (1 for taken, 0 for not taken) and no inputs
other than the clock and the control signal that indicates that the instruction is a
conditional branch.

4.16.5 [10] <§4.8> What is the accuracy of your predictor from 4.16.4 if it is
given a repeating pattern that is the exact opposite of this one?

4.16.6 [20] <§4.8> Repeat 4.16.4, but now your predictor should be able to
eventually (aft er a warm-up period during which it can make wrong predictions)
start perfectly predicting both this pattern and its opposite. Your predictor should
have an input that tells it what the real outcome was. Hint: this input lets your
predictor determine which of the two repeating patterns it is given.

4.17 Th is exercise explores how exception handling aff ects pipeline design. Th e
fi rst three problems in this exercise refer to the following two instructions:

Instruction 1 Instruction 2

BNE R1, R2, Label LW R1, 0(R1)

4.17.1 [5] <§4.9> Which exceptions can each of these instructions trigger? For
each of these exceptions, specify the pipeline stage in which it is detected.

368 Chapter 4 The Processor

4.17.2 [10] <§4.9> If there is a separate handler address for each exception, show
how the pipeline organization must be changed to be able to handle this exception.
You can assume that the addresses of these handlers are known when the processor
is designed.

4.17.3 [10] <§4.9> If the second instruction is fetched right aft er the fi rst
instruction, describe what happens in the pipeline when the fi rst instruction causes
the fi rst exception you listed in 4.17.1. Show the pipeline execution diagram from
the time the fi rst instruction is fetched until the time the fi rst instruction of the
exception handler is completed.

4.17.4 [20] <§4.9> In vectored exception handling, the table of exception handler
addresses is in data memory at a known (fi xed) address. Change the pipeline to
implement this exception handling mechanism. Repeat 4.17.3 using this modifi ed
pipeline and vectored exception handling.

4.17.5 [15] <§4.9> We want to emulate vectored exception handling (described
in 4.17.4) on a machine that has only one fi xed handler address. Write the code
that should be at that fi xed address. Hint: this code should identify the exception,
get the right address from the exception vector table, and transfer execution to that
handler.

4.18 In this exercise we compare the performance of 1-issue and 2-issue
processors, taking into account program transformations that can be made to
optimize for 2-issue execution. Problems in this exercise refer to the following loop
(written in C):

for(i=0;i!=j;i+=2)
b[i]=a[i]–a[i+1];

When writing MIPS code, assume that variables are kept in registers as follows, and
that all registers except those indicated as Free are used to keep various variables,
so they cannot be used for anything else.

i j a b c Free

R5 R6 R1 R2 R3 R10, R11, R12

4.18.1 [10] <§4.10> Translate this C code into MIPS instructions. Your translation
should be direct, without rearranging instructions to achieve better performance.

4.18.2 [10] <§4.10> If the loop exits aft er executing only two iterations, draw a
pipeline diagram for your MIPS code from 4.18.1 executed on a 2-issue processor
shown in Figure 4.69. Assume the processor has perfect branch prediction and can
fetch any two instructions (not just consecutive instructions) in the same cycle.

4.18.3 [10] <§4.10> Rearrange your code from 4.18.1 to achieve better
performance on a 2-issue statically scheduled processor from Figure 4.69.

4.17 Exercises 369

4.18.4 [10] <§4.10> Repeat 4.18.2, but this time use your MIPS code from 4.18.3.

4.18.5 [10] <§4.10> What is the speedup of going from a 1-issue processor to
a 2-issue processor from Figure 4.69? Use your code from 4.18.1 for both 1-issue
and 2-issue, and assume that 1,000,000 iterations of the loop are executed. As in
4.18.2, assume that the processor has perfect branch predictions, and that a 2-issue
processor can fetch any two instructions in the same cycle.

4.18.6 [10] <§4.10> Repeat 4.18.5, but this time assume that in the 2-issue
processor one of the instructions to be executed in a cycle can be of any kind, and
the other must be a non-memory instruction.

4.19 Th is exercise explores energy effi ciency and its relationship with performance.
Problems in this exercise assume the following energy consumption for activity in
Instruction memory, Registers, and Data memory. You can assume that the other
components of the datapath spend a negligible amount of energy.

I-Mem 1 Register Read Register Write D-Mem Read D-Mem Write

140pJ 70pJ 60pJ 140pJ 120pJ

Assume that components in the datapath have the following latencies. You can
assume that the other components of the datapath have negligible latencies.

I-Mem Control Register Read or Write ALU D-Mem Read or Write

200ps 150ps 90ps 90ps 250ps

4.19.1 [10] <§§4.3, 4.6, 4.14> How much energy is spent to execute an ADD
instruction in a single-cycle design and in the 5-stage pipelined design?

4.19.2 [10] <§§4.6, 4.14> What is the worst-case MIPS instruction in terms of
energy consumption, and what is the energy spent to execute it?

4.19.3 [10] <§§4.6, 4.14> If energy reduction is paramount, how would you
change the pipelined design? What is the percentage reduction in the energy spent
by an LW instruction aft er this change?

4.19.4 [10] <§§4.6, 4.14> What is the performance impact of your changes from
4.19.3?

4.19.5 [10] <§§4.6, 4.14> We can eliminate the MemRead control signal and have
the data memory be read in every cycle, i.e., we can permanently have MemRead=1.
Explain why the processor still functions correctly aft er this change. What is the
eff ect of this change on clock frequency and energy consumption?

4.19.6 [10] <§§4.6, 4.14> If an idle unit spends 10% of the power it would spend
if it were active, what is the energy spent by the instruction memory in each cycle?
What percentage of the overall energy spent by the instruction memory does this
idle energy represent?

370 Chapter 4 The Processor

§4.1, page 248: 3 of 5: Control, Datapath, Memory. Input and Output are missing.
§4.2, page 251: false. Edge-triggered state elements make simultaneous reading and
writing both possible and unambiguous.
§4.3, page 257: I. a. II. c.
§4.4, page 272: Yes, Branch and ALUOp0 are identical. In addition, MemtoReg and
RegDst are inverses of one another. You don’t need an inverter; simply use the other
signal and fl ip the order of the inputs to the multiplexor!
§4.5, page 285: I. Stall on the lw result. 2. Bypass the fi rst add result written into
$t1. 3. No stall or bypass required.
§4.6, page 298: Statements 2 and 4 are correct; the rest are incorrect.
§4.8, page 324: 1. Predict not taken. 2. Predict taken. 3. Dynamic prediction.
§4.9, page 332: Th e fi rst instruction, since it is logically executed before the others.
§4.10, page 344: 1. Both. 2. Both. 3. Soft ware. 4. Hardware. 5. Hardware. 6.
Hardware. 7. Both. 8. Hardware. 9. Both.
§4.11, page 353: First two are false and the last two are true.

Answers to
Check Yourself

This page intentionally left blank

5
Ideally one would desire an
indefi nitely large memory
capacity such that any
particular … word would be
immediately available. … We
are … forced to recognize the
possibility of constructing a
hierarchy of memories, each
of which has greater capacity
than the preceding but which
is less quickly accessible.
A. W. Burks, H. H. Goldstine, and
J. von Neumann
Preliminary Discussion of the Logical Design of an
Electronic Computing Instrument, 1946

Large and Fast:
Exploiting Memory
Hierarchy
5.1 Introduction 374
5.2 Memory Technologies 378
5.3 The Basics of Caches 383
5.4 Measuring and Improving Cache

Performance 398
5.5 Dependable Memory Hierarchy 418
5.6 Virtual Machines 424
5.7 Virtual Memory 427

Computer Organization and Design. DOI:
© 2013 Elsevier Inc. All rights reserved.

http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1
2013

5.8 A Common Framework for Memory Hierarchy 454
5.9 Using a Finite-State Machine to Control a Simple Cache 461
5.10 Parallelism and Memory Hierarchies: Cache Coherence 466
5.11 Parallelism and Memory Hierarchy: Redundant Arrays of

Inexpensive Disks 470
5.12 Advanced Material: Implementing Cache Controllers 470
5.13 Real Stuff: The ARM Cortex-A8 and Intel Core i7 Memory

The Five Classic Components of a Computer

374 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

5.1 Introduction

From the earliest days of computing, programmers have wanted unlimited
amounts of fast memory. Th e topics in this chapter aid programmers by creating
that illusion. Before we look at creating the illusion, let’s consider a simple analogy
that illustrates the key principles and mechanisms that we use.

Suppose you were a student writing a term paper on important historical
developments in computer hardware. You are sitting at a desk in a library with
a collection of books that you have pulled from the shelves and are examining.
You fi nd that several of the important computers that you need to write about are
described in the books you have, but there is nothing about the EDSAC. Th erefore,
you go back to the shelves and look for an additional book. You fi nd a book on
early British computers that covers the EDSAC. Once you have a good selection of
books on the desk in front of you, there is a good probability that many of the topics
you need can be found in them, and you may spend most of your time just using
the books on the desk without going back to the shelves. Having several books on
the desk in front of you saves time compared to having only one book there and
constantly having to go back to the shelves to return it and take out another.

Th e same principle allows us to create the illusion of a large memory that we
can access as fast as a very small memory. Just as you did not need to access all the
books in the library at once with equal probability, a program does not access all of
its code or data at once with equal probability. Otherwise, it would be impossible
to make most memory accesses fast and still have large memory in computers, just
as it would be impossible for you to fi t all the library books on your desk and still
fi nd what you wanted quickly.

Th is principle of locality underlies both the way in which you did your work in
the library and the way that programs operate. Th e principle of locality states that
programs access a relatively small portion of their address space at any instant of
time, just as you accessed a very small portion of the library’s collection. Th ere are
two diff erent types of locality:

■ Temporal locality (locality in time): if an item is referenced, it will tend to be
referenced again soon. If you recently brought a book to your desk to look at,
you will probably need to look at it again soon.

■ Spatial locality (locality in space): if an item is referenced, items whose
addresses are close by will tend to be referenced soon. For example, when
you brought out the book on early English computers to fi nd out about the
EDSAC, you also noticed that there was another book shelved next to it about
early mechanical computers, so you also brought back that book and, later
on, found something useful in that book. Libraries put books on the same
topic together on the same shelves to increase spatial locality. We’ll see how
memory hierarchies use spatial locality a little later in this chapter.

temporal locality Th e
principle stating that if a
data location is referenced
then it will tend to be
referenced again soon.

spatial locality Th e
locality principle stating
that if a data location is
referenced, data locations
with nearby addresses
will tend to be referenced
soon.

5.1 Introduction 375

Just as accesses to books on the desk naturally exhibit locality, locality in
programs arises from simple and natural program structures. For example,
most programs contain loops, so instructions and data are likely to be accessed
repeatedly, showing high amounts of temporal locality. Since instructions are
normally accessed sequentially, programs also show high spatial locality. Accesses
to data also exhibit a natural spatial locality. For example, sequential accesses to
elements of an array or a record will naturally have high degrees of spatial locality.

We take advantage of the principle of locality by implementing the memory
of a computer as a memory hierarchy. A memory hierarchy consists of multiple
levels of memory with diff erent speeds and sizes. Th e faster memories are more
expensive per bit than the slower memories and thus are smaller.

Figure 5.1 shows the faster memory is close to the processor and the slower,
less expensive memory is below it. Th e goal is to present the user with as much
memory as is available in the cheapest technology, while providing access at the
speed off ered by the fastest memory.

Th e data is similarly hierarchical: a level closer to the processor is generally a
subset of any level further away, and all the data is stored at the lowest level. By
analogy, the books on your desk form a subset of the library you are working in,
which is in turn a subset of all the libraries on campus. Furthermore, as we move
away from the processor, the levels take progressively longer to access, just as we
might encounter in a hierarchy of campus libraries.

A memory hierarchy can consist of multiple levels, but data is copied between
only two adjacent levels at a time, so we can focus our attention on just two levels.

memory hierarchy
A structure that uses
multiple levels of
memories; as the distance
from the processor
increases, the size of the
memories and the access
time both increase.

Speed

Fastest

Slowest

Smallest

Biggest

Size Cost ($/bit)
Current

technology

Highest

Lowest

SRAM

DRAM

Magnetic disk

Processor

Memory

FIGURE 5.1 The basic structure of a memory hierarchy. By implementing the memory system as
a hierarchy, the user has the illusion of a memory that is as large as the largest level of the hierarchy, but can
be accessed as if it were all built from the fastest memory. Flash memory has replaced disks in many personal
mobile devices, and may lead to a new level in the storage hierarchy for desktop and server computers; see
Section 5.2.

376 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Th e upper level—the one closer to the processor—is smaller and faster than the lower
level, since the upper level uses technology that is more expensive. Figure 5.2 shows
that the minimum unit of information that can be either present or not present in
the two-level hierarchy is called a block or a line; in our library analogy, a block of
information is one book.

If the data requested by the processor appears in some block in the upper level,
this is called a hit (analogous to your fi nding the information in one of the books
on your desk). If the data is not found in the upper level, the request is called a miss.
Th e lower level in the hierarchy is then accessed to retrieve the block containing the
requested data. (Continuing our analogy, you go from your desk to the shelves to
fi nd the desired book.) Th e hit rate, or hit ratio, is the fraction of memory accesses
found in the upper level; it is oft en used as a measure of the performance of the
memory hierarchy. Th e miss rate (1−hit rate) is the fraction of memory accesses
not found in the upper level.

Since performance is the major reason for having a memory hierarchy, the time
to service hits and misses is important. Hit time is the time to access the upper level
of the memory hierarchy, which includes the time needed to determine whether
the access is a hit or a miss (that is, the time needed to look through the books on
the desk). Th e miss penalty is the time to replace a block in the upper level with
the corresponding block from the lower level, plus the time to deliver this block to
the processor (or the time to get another book from the shelves and place it on the
desk). Because the upper level is smaller and built using faster memory parts, the
hit time will be much smaller than the time to access the next level in the hierarchy,
which is the major component of the miss penalty. (Th e time to examine the books
on the desk is much smaller than the time to get up and get a new book from the
shelves.)

block (or line) Th e
minimum unit of
information that can
be either present or not
present in a cache.

hit rate Th e fraction of
memory accesses found
in a level of the memory
hierarchy.

miss rate Th e fraction
of memory accesses not
found in a level of the
memory hierarchy.

hit time Th e time
required to access a level
of the memory hierarchy,
including the time needed
to determine whether the
access is a hit or a miss.

miss penalty Th e time
required to fetch a block
into a level of the memory
hierarchy from the lower
level, including the time
to access the block,
transmit it from one level
to the other, insert it in
the level that experienced
the miss, and then pass
the block to the requestor.

Processor

Data is transferred

FIGURE 5.2 Every pair of levels in the memory hierarchy can be thought of as having an
upper and lower level. Within each level, the unit of information that is present or not is called a block or
a line. Usually we transfer an entire block when we copy something between levels.

5.1 Introduction 377

As we will see in this chapter, the concepts used to build memory systems aff ect
many other aspects of a computer, including how the operating system manages
memory and I/O, how compilers generate code, and even how applications use
the computer. Of course, because all programs spend much of their time accessing
memory, the memory system is necessarily a major factor in determining
performance. Th e reliance on memory hierarchies to achieve performance
has meant that programmers, who used to be able to think of memory as a fl at,
random access storage device, now need to understand that memory is a hierarchy
to get good performance. We show how important this understanding is in later
examples, such as Figure 5.18 on page 408, and Section 5.14, which shows how to
double matrix multiply performance.

Since memory systems are critical to performance, computer designers devote a
great deal of attention to these systems and develop sophisticated mechanisms for
improving the performance of the memory system. In this chapter, we discuss the
major conceptual ideas, although we use many simplifi cations and abstractions to
keep the material manageable in length and complexity.

Programs exhibit both temporal locality, the tendency to reuse recently
accessed data items, and spatial locality, the tendency to reference data
items that are close to other recently accessed items. Memory hierarchies
take advantage of temporal locality by keeping more recently accessed
data items closer to the processor. Memory hierarchies take advantage of
spatial locality by moving blocks consisting of multiple contiguous words
in memory to upper levels of the hierarchy.

Figure 5.3 shows that a memory hierarchy uses smaller and faster
memory technologies close to the processor. Th us, accesses that hit in the
highest level of the hierarchy can be processed quickly. Accesses that miss
go to lower levels of the hierarchy, which are larger but slower. If the hit
rate is high enough, the memory hierarchy has an eff ective access time
close to that of the highest (and fastest) level and a size equal to that of the
lowest (and largest) level.

In most systems, the memory is a true hierarchy, meaning that data
cannot be present in level i unless it is also present in level i � 1.

The BIG
Picture

Which of the following statements are generally true?

1. Memory hierarchies take advantage of temporal locality.

2. On a read, the value returned depends on which blocks are in the cache.

3. Most of the cost of the memory hierarchy is at the highest level.

4. Most of the capacity of the memory hierarchy is at the lowest level.

Check
Yourself

378 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

5.2 Memory Technologies

Th ere are four primary technologies used today in memory hierarchies. Main
memory is implemented from DRAM (dynamic random access memory), while
levels closer to the processor (caches) use SRAM (static random access memory).
DRAM is less costly per bit than SRAM, although it is substantially slower. Th e
price diff erence arises because DRAM uses signifi cantly less area per bit of memory,
and DRAMs thus have larger capacity for the same amount of silicon; the speed
diff erence arises from several factors described in Section B.9 of Appendix B.
Th e third technology is fl ash memory. Th is nonvolatile memory is the secondary
memory in Personal Mobile Devices. Th e fourth technology, used to implement
the largest and slowest level in the hierarchy in servers, is magnetic disk. Th e access
time and price per bit vary widely among these technologies, as the table below
shows, using typical values for 2012:

Memory technology Typical access time $ per GiB in 2012

SRAM semiconductor memory 0.5–2.5 ns $500–$1000

DRAM semiconductor memory 50–70 ns $10–$20

Flash semiconductor memory 5,000–50,000 ns $0.75–$1.00

Magnetic disk 5,000,000–20,000,000 ns $0.05–$0.10

We describe each memory technology in the remainder of this section.

CPU

Level 1

Level 2

Level n

Increasing distance

from the CPU in

access time
Levels in the

memory hierarchy

Size of the memory at each level

FIGURE 5.3 This diagram shows the structure of a memory hierarchy: as the distance
from the processor increases, so does the size. Th is structure, with the appropriate operating
mechanisms, allows the processor to have an access time that is determined primarily by level 1 of the
hierarchy and yet have a memory as large as level n. Maintaining this illusion is the subject of this chapter.
Although the local disk is normally the bottom of the hierarchy, some systems use tape or a fi le server over a
local area network as the next levels of the hierarchy.

5.2 Memory Technologies 379

SRAM Technology
SRAMs are simply integrated circuits that are memory arrays with (usually) a
single access port that can provide either a read or a write. SRAMs have a fi xed
access time to any datum, though the read and write access times may diff er.

SRAMs don’t need to refresh and so the access time is very close to the cycle
time. SRAMs typically use six to eight transistors per bit to prevent the information
from being disturbed when read. SRAM needs only minimal power to retain the
charge in standby mode.

In the past, most PCs and server systems used separate SRAM chips for either
their primary, secondary, or even tertiary caches. Today, thanks to Moore’s Law, all
levels of caches are integrated onto the processor chip, so the market for separate
SRAM chips has nearly evaporated.

DRAM Technology
In a SRAM, as long as power is applied, the value can be kept indefi nitely. In a
dynamic RAM (DRAM), the value kept in a cell is stored as a charge in a capacitor.
A single transistor is then used to access this stored charge, either to read the
value or to overwrite the charge stored there. Because DRAMs use only a single
transistor per bit of storage, they are much denser and cheaper per bit than SRAM.
As DRAMs store the charge on a capacitor, it cannot be kept indefi nitely and must
periodically be refreshed. Th at is why this memory structure is called dynamic, as
opposed to the static storage in an SRAM cell.

To refresh the cell, we merely read its contents and write it back. Th e charge
can be kept for several milliseconds. If every bit had to be read out of the DRAM
and then written back individually, we would constantly be refreshing the DRAM,
leaving no time for accessing it. Fortunately, DRAMs use a two-level decoding
structure, and this allows us to refresh an entire row (which shares a word line)
with a read cycle followed immediately by a write cycle.

Figure 5.4 shows the internal organization of a DRAM, and Figure 5.5 shows
how the density, cost, and access time of DRAMs have changed over the years.

Th e row organization that helps with refresh also helps with performance. To
improve performance, DRAMs buff er rows for repeated access. Th e buff er acts
like an SRAM; by changing the address, random bits can be accessed in the buff er
until the next row access. Th is capability improves the access time signifi cantly,
since the access time to bits in the row is much lower. Making the chip wider also
improves the memory bandwidth of the chip. When the row is in the buff er, it
can be transferred by successive addresses at whatever the width of the DRAM is
(typically 4, 8, or 16 bits), or by specifying a block transfer and the starting address
within the buff er.

To further improve the interface to processors, DRAMs added clocks and are
properly called Synchronous DRAMs or SDRAMs. Th e advantage of SDRAMs
is that the use of a clock eliminates the time for the memory and processor to
synchronize. Th e speed advantage of synchronous DRAMs comes from the ability
to transfer the bits in the burst without having to specify additional address bits.

380 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Instead, the clock transfers the successive bits in a burst. Th e fastest version is called
Double Data Rate (DDR) SDRAM. Th e name means data transfers on both the
rising and falling edge of the clock, thereby getting twice as much bandwidth as you
might expect based on the clock rate and the data width. Th e latest version of this
technology is called DDR4. A DDR4-3200 DRAM can do 3200 million transfers
per second, which means it has a 1600 MHz clock.

Sustaining that much bandwidth requires clever organization inside the DRAM.
Instead of just a faster row buff er, the DRAM can be internally organized to read or

FIGURE 5.5 DRAM size increased by multiples of four approximately once every three
years until 1996, and thereafter considerably slower. Th e improvements in access time have been
slower but continuous, and cost roughly tracks density improvements, although cost is oft en aff ected by other
issues, such as availability and demand. Th e cost per gibibyte is not adjusted for infl ation.

Year introduced Chip size $ per GiB
Total access time to
a new row/column

Average column
access time to
existing row

1980 64 Kibibit $1,500,000 250 ns 150 ns

1983 256 Kibibit $500,000 185 ns 100 ns

1985 1 Mebibit $200,000 135 ns 40 ns

1989 4 Mebibit $50,000 110 ns 40 ns

1992 16 Mebibit $15,000 90 ns 30 ns

1996 64 Mebibit $10,000 60 ns 12 ns

1998 128 Mebibit $4,000 60 ns 10 ns

2000 256 Mebibit $1,000 55 ns 7 ns

2004 512 Mebibit $250 50 ns 5 ns

2007 1 Gibibit $50 45 ns 1.25 ns

2010 2 Gibibit

4 Gibibit

$30 40 ns 1 ns

2012 $1 35 ns 0.8 ns

FIGURE 5.4 Internal organization of a DRAM. Modern DRAMs are organized in banks, typically
four for DDR3. Each bank consists of a series of rows. Sending a PRE (precharge) command opens or closes a
bank. A row address is sent with an Act (activate), which causes the row to transfer to a buff er. When the row
is in the buff er, it can be transferred by successive column addresses at whatever the width of the DRAM is
(typically 4, 8, or 16 bits in DDR3) or by specifying a block transfer and the starting address. Each command,
as well as block transfers, is synchronized with a clock.

Column

Rd/Wr

Pre

Act

Row

Bank

5.2 Memory Technologies 381

write from multiple banks, with each having its own row buff er. Sending an address
to several banks permits them all to read or write simultaneously. For example,
with four banks, there is just one access time and then accesses rotate between
the four banks to supply four times the bandwidth. Th is rotating access scheme is
called address interleaving.

Although Personal Mobile Devices like the iPad (see Chapter 1) use individual
DRAMs, memory for servers are commonly sold on small boards called dual inline
memory modules (DIMMs). DIMMs typically contain 4–16 DRAMs, and they are
normally organized to be 8 bytes wide for server systems. A DIMM using DDR4-
3200 SDRAMs could transfer at 8 � 3200 � 25,600 megabytes per second. Such
DIMMs are named aft er their bandwidth: PC25600. Since a DIMM can have so
many DRAM chips that only a portion of them are used for a particular transfer, we
need a term to refer to the subset of chips in a DIMM that share common address
lines. To avoid confusion with the internal DRAM names of row and banks, we use
the term memory rank for such a subset of chips in a DIMM.

Elaboration: One way to measure the performance of the memory system behind the
caches is the Stream benchmark [McCalpin, 1995]. It measures the performance of
long vector operations. They have no temporal locality and they access arrays that are
larger than the cache of the computer being tested.

Flash Memory
Flash memory is a type of electrically erasable programmable read-only memory
(EEPROM).

Unlike disks and DRAM, but like other EEPROM technologies, writes can wear out
fl ash memory bits. To cope with such limits, most fl ash products include a controller
to spread the writes by remapping blocks that have been written many times to less
trodden blocks. Th is technique is called wear leveling. With wear leveling, personal
mobile devices are very unlikely to exceed the write limits in the fl ash. Such wear
leveling lowers the potential performance of fl ash, but it is needed unless higher-
level soft ware monitors block wear. Flash controllers that perform wear leveling can
also improve yield by mapping out memory cells that were manufactured incorrectly.

Disk Memory
As Figure 5.6 shows, a magnetic hard disk consists of a collection of platters, which
rotate on a spindle at 5400 to 15,000 revolutions per minute. Th e metal platters are
covered with magnetic recording material on both sides, similar to the material found
on a cassette or videotape. To read and write information on a hard disk, a movable arm
containing a small electromagnetic coil called a read-write head is located just above
each surface. Th e entire drive is permanently sealed to control the environment inside
the drive, which, in turn, allows the disk heads to be much closer to the drive surface.

Each disk surface is divided into concentric circles, called tracks. Th ere are
typically tens of thousands of tracks per surface. Each track is in turn divided into

track One of thousands
of concentric circles that
makes up the surface of a
magnetic disk.

382 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

sectors that contain the information; each track may have thousands of sectors.
Sectors are typically 512 to 4096 bytes in size. Th e sequence recorded on the
magnetic media is a sector number, a gap, the information for that sector including
error correction code (see Section 5.5), a gap, the sector number of the next sector,
and so on.

Th e disk heads for each surface are connected together and move in conjunction,
so that every head is over the same track of every surface. Th e term cylinder is used
to refer to all the tracks under the heads at a given point on all surfaces.

FIGURE 5.6 A disk showing 10 disk platters and the read/write heads. Th e diameter of
today’s disks is 2.5 or 3.5 inches, and there are typically one or two platters per drive today.

To access data, the operating system must direct the disk through a three-stage
process. Th e fi rst step is to position the head over the proper track. Th is operation is
called a seek, and the time to move the head to the desired track is called the seek time.

Disk manufacturers report minimum seek time, maximum seek time, and average
seek time in their manuals. Th e fi rst two are easy to measure, but the average is open to
wide interpretation because it depends on the seek distance. Th e industry calculates
average seek time as the sum of the time for all possible seeks divided by the number
of possible seeks. Average seek times are usually advertised as 3 ms to 13 ms, but,
depending on the application and scheduling of disk requests, the actual average seek
time may be only 25% to 33% of the advertised number because of locality of disk

sector One of the
segments that make up a
track on a magnetic disk;
a sector is the smallest
amount of information
that is read or written on
a disk.

seek Th e process of
positioning a read/write
head over the proper
track on a disk.

5.3 The Basics of Caches 383

references. Th is locality arises both because of successive accesses to the same fi le and
because the operating system tries to schedule such accesses together.

Once the head has reached the correct track, we must wait for the desired sector
to rotate under the read/write head. Th is time is called the rotational latency or
rotational delay. Th e average latency to the desired information is halfway around
the disk. Disks rotate at 5400 RPM to 15,000 RPM. Th e average rotational latency
at 5400 RPM is

Average rotational latency
0.5 rotation

RPM
0.5 rotati

� �
5400

oon

RPM/
seconds
minute

0.0056 seconds 5.6 m

5400 60
⎛
⎝
⎜⎜⎜

⎞
⎠
⎟⎟⎟

� � ss

Th e last component of a disk access, transfer time, is the time to transfer a block
of bits. Th e transfer time is a function of the sector size, the rotation speed, and the
recording density of a track. Transfer rates in 2012 were between 100 and 200 MB/sec.

One complication is that most disk controllers have a built-in cache that stores
sectors as they are passed over; transfer rates from the cache are typically higher,
and were up to 750 MB/sec (6 Gbit/sec) in 2012.

Alas, where block numbers are located is no longer intuitive. Th e assumptions of
the sector-track-cylinder model above are that nearby blocks are on the same track,
blocks in the same cylinder take less time to access since there is no seek time,
and some tracks are closer than others. Th e reason for the change was the raising
of the level of the disk interfaces. To speed-up sequential transfers, these higher-
level interfaces organize disks more like tapes than like random access devices.
Th e logical blocks are ordered in serpentine fashion across a single surface, trying
to capture all the sectors that are recorded at the same bit density to try to get best
performance. Hence, sequential blocks may be on diff erent tracks.

In summary, the two primary diff erences between magnetic disks and
semiconductor memory technologies are that disks have a slower access time because
they are mechanical devices—fl ash is 1000 times as fast and DRAM is 100,000 times
as fast—yet they are cheaper per bit because they have very high storage capacity at a
modest cost—disk is 10 to 100 time cheaper. Magnetic disks are nonvolatile like fl ash,
but unlike fl ash there is no write wear-out problem. However, fl ash is much more
rugged and hence a better match to the jostling inherent in personal mobile devices.

5.3 The Basics of Caches

In our library example, the desk acted as a cache—a safe place to store things
(books) that we needed to examine. Cache was the name chosen to represent the
level of the memory hierarchy between the processor and main memory in the fi rst
commercial computer to have this extra level. Th e memories in the datapath in
Chapter 4 are simply replaced by caches. Today, although this remains the dominant

rotational latency Also
called rotational delay.
Th e time required for
the desired sector of a
disk to rotate under the
read/write head; usually
assumed to be half the
rotation time.

Cache: a safe place
for hiding or storing
things.
Webster’s New World
Dictionary of the
American Language,
Th ird College Edition,
1988

384 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

use of the word cache, the term is also used to refer to any storage managed to take
advantage of locality of access. Caches fi rst appeared in research computers in the
early 1960s and in production computers later in that same decade; every general-
purpose computer built today, from servers to low-power embedded processors,
includes caches.

In this section, we begin by looking at a very simple cache in which the processor
requests are each one word and the blocks also consist of a single word. (Readers
already familiar with cache basics may want to skip to Section 5.4.) Figure 5.7 shows
such a simple cache, before and aft er requesting a data item that is not initially in
the cache. Before the request, the cache contains a collection of recent references
X1, X2, …, Xn�1, and the processor requests a word Xn that is not in the cache. Th is
request results in a miss, and the word Xn is brought from memory into the cache.

In looking at the scenario in Figure 5.7, there are two questions to answer: How
do we know if a data item is in the cache? Moreover, if it is, how do we fi nd it? Th e
answers are related. If each word can go in exactly one place in the cache, then it
is straightforward to fi nd the word if it is in the cache. Th e simplest way to assign
a location in the cache for each word in memory is to assign the cache location
based on the address of the word in memory. Th is cache structure is called direct
mapped, since each memory location is mapped directly to exactly one location in
the cache. Th e typical mapping between addresses and cache locations for a direct-
mapped cache is usually simple. For example, almost all direct-mapped caches use
this mapping to fi nd a block:

(Block address) modulo (Number of blocks in the cache)

If the number of entries in the cache is a power of 2, then modulo can be
computed simply by using the low-order log2 (cache size in blocks) bits of the
address. Th us, an 8-block cache uses the three lowest bits (8 � 23) of the block
address. For example, Figure 5.8 shows how the memory addresses between 1ten
(00001two) and 29ten (11101two) map to locations 1ten (001two) and 5ten (101two) in a
direct-mapped cache of eight words.

Because each cache location can contain the contents of a number of diff erent
memory locations, how do we know whether the data in the cache corresponds
to a requested word? Th at is, how do we know whether a requested word is in the
cache or not? We answer this question by adding a set of tags to the cache. Th e
tags contain the address information required to identify whether a word in the
cache corresponds to the requested word. Th e tag needs only to contain the upper
portion of the address, corresponding to the bits that are not used as an index into
the cache. For example, in Figure 5.8 we need only have the upper 2 of the 5 address
bits in the tag, since the lower 3-bit index fi eld of the address selects the block.
Architects omit the index bits because they are redundant, since by defi nition the
index fi eld of any address of a cache block must be that block number.

We also need a way to recognize that a cache block does not have valid
information. For instance, when a processor starts up, the cache does not have good
data, and the tag fi elds will be meaningless. Even aft er executing many instructions,

direct-mapped cache
A cache structure in
which each memory
location is mapped to
exactly one location in the
cache.

tag A fi eld in a table used
for a memory hierarchy
that contains the address
information required
to identify whether the
associated block in the
hierarchy corresponds to
a requested word.

5.3 The Basics of Caches 385

Xn – 2

Xn – 1

a. Before the reference to Xn

Xn – 2

Xn – 1

b. After the reference to Xn

FIGURE 5.7 The cache just before and just after a reference to a word X
n
that is not

initially in the cache. Th is reference causes a miss that forces the cache to fetch Xn from memory and
insert it into the cache.

Cache

Memory
00001 10001

0
1
0

1
0
0

1
0
1

1
1
1

1
1
0

0
0
0

0
0
1

0
1
1

00101 01001 01101 10101 11001 11101

FIGURE 5.8 A direct-mapped cache with eight entries showing the addresses of memory
words between 0 and 31 that map to the same cache locations. Because there are eight
words in the cache, an address X maps to the direct-mapped cache word X modulo 8. Th at is, the low-order
log2(8) � 3 bits are used as the cache index. Th us, addresses 00001two, 01001two, 10001two, and 11001two all map
to entry 001two of the cache, while addresses 00101two, 01101two, 10101two, and 11101two all map to entry 101two
of the cache.

386 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

some of the cache entries may still be empty, as in Figure 5.7. Th us, we need to
know that the tag should be ignored for such entries. Th e most common method is
to add a valid bit to indicate whether an entry contains a valid address. If the bit is
not set, there cannot be a match for this block.

For the rest of this section, we will focus on explaining how a cache deals with
reads. In general, handling reads is a little simpler than handling writes, since reads
do not have to change the contents of the cache. Aft er seeing the basics of how
reads work and how cache misses can be handled, we’ll examine the cache designs
for real computers and detail how these caches handle writes.

valid bit A fi eld in
the tables of a memory
hierarchy that indicates
that the associated block
in the hierarchy contains
valid data.

Caching is perhaps the most important example of the big idea of
prediction. It relies on the principle of locality to try to fi nd the
desired data in the higher levels of the memory hierarchy, and provides
mechanisms to ensure that when the prediction is wrong it fi nds and
uses the proper data from the lower levels of the memory hierarchy. Th e
hit rates of the cache prediction on modern computers are oft en higher
than 95% (see Figure 5.47).

The BIG
Picture

Accessing a Cache
Below is a sequence of nine memory references to an empty eight-block cache,
including the action for each reference. Figure 5.9 shows how the contents of the
cache change on each miss. Since there are eight blocks in the cache, the low-order
three bits of an address give the block number:

Decimal address
of reference

Binary address
of reference

Hit or miss
in cache

Assigned cache block
(where found or placed)

22 10110two miss (5.6b) (10110two mod 8) = 110two
26 11010two miss (5.6c) (11010two mod 8) = 010two
22 10110two hit (10110two mod 8) = 110two
26 11010two hit (11010two mod 8) = 010two
16 10000two miss (5.6d) (10000two mod 8) = 000two
3 00011two miss (5.6e) (00011two mod 8) = 011two
16 10000two hit (10000two mod 8) = 000two
18 10010two miss (5.6f) (10010two mod 8) = 010two
16 10000two hit (10000two mod 8) = 000two

Since the cache is empty, several of the fi rst references are misses; the caption of
Figure 5.9 describes the actions for each memory reference. On the eighth reference

5.3 The Basics of Caches 387

Index V Tag Data Index V Tag Data

000 N 000 N

001 N 001 N

010 N 010 N

011 N 011 N

100 N 100 N

101 N 101 N

110 N 110 Y 10two Memory (10110two)

111 N 111 N

a. The initial state of the cache after power-on b. After handling a miss of address (10110two)

Index V Tag Data Index V Tag Data

000 N 000 Y 10two Memory (10000two)

001 N 001 N

010 Y 11two Memory (11010two) 010 Y 11two Memory (11010two)

011 N 011 N

100 N 100 N

101 N 101 N

110 Y 10two Memory (10110two) 110 Y 10two Memory (10110two)

111 N 111 N

c. After handling a miss of address (11010two) d. After handling a miss of address (10000two)

Index V Tag Data Index V Tag Data

000 Y 10two Memory (10000two) 000 Y 10two Memory (10000two)

001 N 001 N

010 Y 11two Memory (11010two) 010 Y 10two Memory (10010two)

011 Y 00two Memory (00011two) 011 Y 00two Memory (00011two)

100 N 100 N

101 N 101 N

110 Y 10two Memory (10110two) 110 Y 10two Memory (10110two)

111 N 111 N

e. After handling a miss of address (00011two) f. After handling a miss of address (10010two)

FIGURE 5.9 The cache contents are shown after each reference request that misses, with the index and tag fi elds
shown in binary for the sequence of addresses on page 386. Th e cache is initially empty, with all valid bits (V entry in cache)
turned off (N). Th e processor requests the following addresses: 10110two (miss), 11010two (miss), 10110two (hit), 11010two (hit), 10000two (miss),
00011two (miss), 10000two (hit), 10010two (miss), and 10000two (hit). Th e fi gures show the cache contents aft er each miss in the sequence has been
handled. When address 10010two (18) is referenced, the entry for address 11010two (26) must be replaced, and a reference to 11010two will cause a
subsequent miss. Th e tag fi eld will contain only the upper portion of the address. Th e full address of a word contained in cache block i with tag
fi eld j for this cache is j � 8 � i, or equivalently the concatenation of the tag fi eld j and the index i. For example, in cache f above, index 010two
has tag 10two and corresponds to address 10010two.

388 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

we have confl icting demands for a block. Th e word at address 18 (10010two) should
be brought into cache block 2 (010two). Hence, it must replace the word at address
26 (11010two), which is already in cache block 2 (010two). Th is behavior allows a
cache to take advantage of temporal locality: recently referenced words replace less
recently referenced words.

Th is situation is directly analogous to needing a book from the shelves and
having no more space on your desk—some book already on your desk must be
returned to the shelves. In a direct-mapped cache, there is only one place to put the
newly requested item and hence only one choice of what to replace.

We know where to look in the cache for each possible address: the low-order bits
of an address can be used to fi nd the unique cache entry to which the address could
map. Figure 5.10 shows how a referenced address is divided into

■ A tag fi eld, which is used to compare with the value of the tag fi eld of the
cache

■ A cache index, which is used to select the block

Th e index of a cache block, together with the tag contents of that block, uniquely
specifi es the memory address of the word contained in the cache block. Because
the index fi eld is used as an address to reference the cache, and because an n-bit
fi eld has 2n values, the total number of entries in a direct-mapped cache must be a
power of 2. In the MIPS architecture, since words are aligned to multiples of four
bytes, the least signifi cant two bits of every address specify a byte within a word.
Hence, the least signifi cant two bits are ignored when selecting a word in the block.

Th e total number of bits needed for a cache is a function of the cache size and
the address size, because the cache includes both the storage for the data and the
tags. Th e size of the block above was one word, but normally it is several. For the
following situation:

■ 32-bit addresses

■ A direct-mapped cache

■ Th e cache size is 2n blocks, so n bits are used for the index

■ Th e block size is 2m words (2m+2 bytes), so m bits are used for the word within
the block, and two bits are used for the byte part of the address

the size of the tag fi eld is

32 � (n � m � 2).

Th e total number of bits in a direct-mapped cache is

2n � (block size � tag size � valid fi eld size).

5.3 The Basics of Caches 389

Since the block size is 2m words (2m�5 bits), and we need 1 bit for the valid fi eld, the
number of bits in such a cache is

2n � (2m � 32 � (32 � n � m � 2) � 1) � 2n � (2m � 32 � 31 � n � m).

Although this is the actual size in bits, the naming convention is to exclude the size
of the tag and valid fi eld and to count only the size of the data. Th us, the cache in
Figure 5.10 is called a 4 KiB cache.

Address (showing bit positions)

Data

Hit

Data

Tag

Valid Tag

3220

Index
0
1
2

1023
1022
1021

Index

20 10

Byte
offset

31 30 13 12 11 2 1 0

FIGURE 5.10 For this cache, the lower portion of the address is used to select a cache
entry consisting of a data word and a tag. Th is cache holds 1024 words or 4 KiB. We assume 32-bit
addresses in this chapter. Th e tag from the cache is compared against the upper portion of the address to
determine whether the entry in the cache corresponds to the requested address. Because the cache has 210 (or
1024) words and a block size of one word, 10 bits are used to index the cache, leaving 32 −10 − 2 = 20 bits to
be compared against the tag. If the tag and upper 20 bits of the address are equal and the valid bit is on, then
the request hits in the cache, and the word is supplied to the processor. Otherwise, a miss occurs.

390 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Bits in a Cache

How many total bits are required for a direct-mapped cache with 16 KiB of
data and 4-word blocks, assuming a 32-bit address?

We know that 16 KiB is 4096 (212) words. With a block size of 4 words (22),
there are 1024 (210) blocks. Each block has 4 � 32 or 128 bits of data plus a
tag, which is 32 � 10 � 2 � 2 bits, plus a valid bit. Th us, the total cache size is

210 � (4 � 32 � (32 � 10 � 2 � 2) � 1) � 210 � 147 � 147 Kibibits

or 18.4 KiB for a 16 KiB cache. For this cache, the total number of bits in the
cache is about 1.15 times as many as needed just for the storage of the data.

Mapping an Address to a Multiword Cache Block

Consider a cache with 64 blocks and a block size of 16 bytes. To what block
number does byte address 1200 map?

We saw the formula on page 384. Th e block is given by

(Block address) modulo (Number of blocks in the cache)

where the address of the block is

Byte address
Bytes per block

Notice that this block address is the block containing all addresses between

Byte address
Bytes per block

Bytes per block
⎡

⎣
⎢
⎢

⎤

⎦
⎥
⎥�

EXAMPLE

ANSWER

EXAMPLE

ANSWER

5.3 The Basics of Caches 391

and

Byte address
Bytes per block

Bytes per block (Bytes
⎡

⎣
⎢
⎢

⎤

⎦
⎥
⎥ per block 1)

Th us, with 16 bytes per block, byte address 1200 is block address

1200
6

75
⎡

⎣
⎢
⎢

⎤

⎦
⎥
⎥
�

which maps to cache block number (75 modulo 64) � 11. In fact, this block
maps all addresses between 1200 and 1215.

Larger blocks exploit spatial locality to lower miss rates. As Figure 5.11 shows,
increasing the block size usually decreases the miss rate. Th e miss rate may go up
eventually if the block size becomes a signifi cant fraction of the cache size, because
the number of blocks that can be held in the cache will become small, and there will
be a great deal of competition for those blocks. As a result, a block will be bumped
out of the cache before many of its words are accessed. Stated alternatively, spatial
locality among the words in a block decreases with a very large block; consequently,
the benefi ts in the miss rate become smaller.

A more serious problem associated with just increasing the block size is that the
cost of a miss increases. Th e miss penalty is determined by the time required to fetch

10%

16K

64K

256K

0%
32 64 128 256

Miss
rate

Block size

FIGURE 5.11 Miss rate versus block size. Note that the miss rate actually goes up if the block size
is too large relative to the cache size. Each line represents a cache of diff erent size. (Th is fi gure is independent
of associativity, discussed soon.) Unfortunately, SPEC CPU2000 traces would take too long if block size were
included, so this data is based on SPEC92.

392 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

the block from the next lower level of the hierarchy and load it into the cache. Th e
time to fetch the block has two parts: the latency to the fi rst word and the transfer
time for the rest of the block. Clearly, unless we change the memory system, the
transfer time—and hence the miss penalty—will likely increase as the block size
increases. Furthermore, the improvement in the miss rate starts to decrease as the
blocks become larger. Th e result is that the increase in the miss penalty overwhelms
the decrease in the miss rate for blocks that are too large, and cache performance
thus decreases. Of course, if we design the memory to transfer larger blocks more
effi ciently, we can increase the block size and obtain further improvements in cache
performance. We discuss this topic in the next section.

Elaboration: Although it is hard to do anything about the longer latency component of
the miss penalty for large blocks, we may be able to hide some of the transfer time so
that the miss penalty is effectively smaller. The simplest method for doing this, called
early restart, is simply to resume execution as soon as the requested word of the block
is returned, rather than wait for the entire block. Many processors use this technique
for instruction access, where it works best. Instruction accesses are largely sequential,
so if the memory system can deliver a word every clock cycle, the processor may be
able to restart operation when the requested word is returned, with the memory system
delivering new instruction words just in time. This technique is usually less effective for
data caches because it is likely that the words will be requested from the block in a
less predictable way, and the probability that the processor will need another word from
a different cache block before the transfer completes is high. If the processor cannot
access the data cache because a transfer is ongoing, then it must stall.

An even more sophisticated scheme is to organize the memory so that the requested
word is transferred from the memory to the cache fi rst. The remainder of the block
is then transferred, starting with the address after the requested word and wrapping
around to the beginning of the block. This technique, called requested word fi rst or
critical word fi rst, can be slightly faster than early restart, but it is limited by the same
properties that limit early restart.

Handling Cache Misses
Before we look at the cache of a real system, let’s see how the control unit deals with
cache misses. (We describe a cache controller in detail in Section 5.9). Th e control
unit must detect a miss and process the miss by fetching the requested data from
memory (or, as we shall see, a lower-level cache). If the cache reports a hit, the
computer continues using the data as if nothing happened.

Modifying the control of a processor to handle a hit is trivial; misses, however,
require some extra work. Th e cache miss handling is done in collaboration with
the processor control unit and with a separate controller that initiates the memory
access and refi lls the cache. Th e processing of a cache miss creates a pipeline stall
(Chapter 4) as opposed to an interrupt, which would require saving the state of all
registers. For a cache miss, we can stall the entire processor, essentially freezing
the contents of the temporary and programmer-visible registers, while we wait

cache miss A request for
data from the cache that
cannot be fi lled because
the data is not present in
the cache.

5.3 The Basics of Caches 393

for memory. More sophisticated out-of-order processors can allow execution of
instructions while waiting for a cache miss, but we’ll assume in-order processors
that stall on cache misses in this section.

Let’s look a little more closely at how instruction misses are handled; the same
approach can be easily extended to handle data misses. If an instruction access
results in a miss, then the content of the Instruction register is invalid. To get the
proper instruction into the cache, we must be able to instruct the lower level in the
memory hierarchy to perform a read. Since the program counter is incremented in
the fi rst clock cycle of execution, the address of the instruction that generates an
instruction cache miss is equal to the value of the program counter minus 4. Once
we have the address, we need to instruct the main memory to perform a read. We
wait for the memory to respond (since the access will take multiple clock cycles),
and then write the words containing the desired instruction into the cache.

We can now defi ne the steps to be taken on an instruction cache miss:

1. Send the original PC value (current PC – 4) to the memory.

2. Instruct main memory to perform a read and wait for the memory to
complete its access.

3. Write the cache entry, putting the data from memory in the data portion of
the entry, writing the upper bits of the address (from the ALU) into the tag
fi eld, and turning the valid bit on.

4. Restart the instruction execution at the fi rst step, which will refetch the
instruction, this time fi nding it in the cache.

Th e control of the cache on a data access is essentially identical: on a miss, we
simply stall the processor until the memory responds with the data.

Handling Writes
Writes work somewhat diff erently. Suppose on a store instruction, we wrote the
data into only the data cache (without changing main memory); then, aft er the
write into the cache, memory would have a diff erent value from that in the cache.
In such a case, the cache and memory are said to be inconsistent. Th e simplest way
to keep the main memory and the cache consistent is always to write the data into
both the memory and the cache. Th is scheme is called write-through.

Th e other key aspect of writes is what occurs on a write miss. We fi rst fetch the
words of the block from memory. Aft er the block is fetched and placed into the
cache, we can overwrite the word that caused the miss into the cache block. We also
write the word to main memory using the full address.

Although this design handles writes very simply, it would not provide very
good performance. With a write-through scheme, every write causes the data
to be written to main memory. Th ese writes will take a long time, likely at least
100 processor clock cycles, and could slow down the processor considerably. For
example, suppose 10% of the instructions are stores. If the CPI without cache

write-through
A scheme in which writes
always update both the
cache and the next lower
level of the memory
hierarchy, ensuring that
data is always consistent
between the two.

394 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

misses was 1.0, spending 100 extra cycles on every write would lead to a CPI of
1.0 � 100 � 10% � 11, reducing performance by more than a factor of 10.

One solution to this problem is to use a write buff er. A write buff er stores the
data while it is waiting to be written to memory. Aft er writing the data into the
cache and into the write buff er, the processor can continue execution. When a write
to main memory completes, the entry in the write buff er is freed. If the write buff er
is full when the processor reaches a write, the processor must stall until there is an
empty position in the write buff er. Of course, if the rate at which the memory can
complete writes is less than the rate at which the processor is generating writes, no
amount of buff ering can help, because writes are being generated faster than the
memory system can accept them.

Th e rate at which writes are generated may also be less than the rate at which the
memory can accept them, and yet stalls may still occur. Th is can happen when the
writes occur in bursts. To reduce the occurrence of such stalls, processors usually
increase the depth of the write buff er beyond a single entry.

Th e alternative to a write-through scheme is a scheme called write-back. In a
write-back scheme, when a write occurs, the new value is written only to the block
in the cache. Th e modifi ed block is written to the lower level of the hierarchy when
it is replaced. Write-back schemes can improve performance, especially when
processors can generate writes as fast or faster than the writes can be handled by
main memory; a write-back scheme is, however, more complex to implement than
write-through.

In the rest of this section, we describe caches from real processors, and we
examine how they handle both reads and writes. In Section 5.8, we will describe
the handling of writes in more detail.

Elaboration: Writes introduce several complications into caches that are not present
for reads. Here we discuss two of them: the policy on write misses and effi cient
implementation of writes in write-back caches.

Consider a miss in a write-through cache. The most common strategy is to allocate a
block in the cache, called write allocate. The block is fetched from memory and then the
appropriate portion of the block is overwritten. An alternative strategy is to update the portion
of the block in memory but not put it in the cache, called no write allocate. The motivation is
that sometimes programs write entire blocks of data, such as when the operating system
zeros a page of memory. In such cases, the fetch associated with the initial write miss may
be unnecessary. Some computers allow the write allocation policy to be changed on a per
page basis.

Actually implementing stores effi ciently in a cache that uses a write-back strategy is
more complex than in a write-through cache. A write-through cache can write the data
into the cache and read the tag; if the tag mismatches, then a miss occurs. Because the
cache is write-through, the overwriting of the block in the cache is not catastrophic, since
memory has the correct value. In a write-back cache, we must fi rst write the block back
to memory if the data in the cache is modifi ed and we have a cache miss. If we simply
overwrote the block on a store instruction before we knew whether the store had hit in
the cache (as we could for a write-through cache), we would destroy the contents of the
block, which is not backed up in the next lower level of the memory hierarchy.

write buff er A queue
that holds data while
the data is waiting to be
written to memory.

write-back A scheme
that handles writes by
updating values only to
the block in the cache,
then writing the modifi ed
block to the lower level
of the hierarchy when the
block is replaced.

5.3 The Basics of Caches 395

In a write-back cache, because we cannot overwrite the block, stores either require
two cycles (a cycle to check for a hit followed by a cycle to actually perform the write) or
require a write buffer to hold that data—effectively allowing the store to take only one
cycle by pipelining it. When a store buffer is used, the processor does the cache lookup
and places the data in the store buffer during the normal cache access cycle. Assuming
a cache hit, the new data is written from the store buffer into the cache on the next
unused cache access cycle.

By comparison, in a write-through cache, writes can always be done in one cycle.
We read the tag and write the data portion of the selected block. If the tag matches
the address of the block being written, the processor can continue normally, since the
correct block has been updated. If the tag does not match, the processor generates a
write miss to fetch the rest of the block corresponding to that address.

Many write-back caches also include write buffers that are used to reduce the miss
penalty when a miss replaces a modifi ed block. In such a case, the modifi ed block is
moved to a write-back buffer associated with the cache while the requested block is read
from memory. The write-back buffer is later written back to memory. Assuming another
miss does not occur immediately, this technique halves the miss penalty when a dirty
block must be replaced.

An Example Cache: The Intrinsity FastMATH Processor
Th e Intrinsity FastMATH is an embedded microprocessor that uses the MIPS
architecture and a simple cache implementation. Near the end of the chapter, we
will examine the more complex cache designs of ARM and Intel microprocessors,
but we start with this simple, yet real, example for pedagogical reasons. Figure 5.12
shows the organization of the Intrinsity FastMATH data cache.

Th is processor has a 12-stage pipeline. When operating at peak speed, the
processor can request both an instruction word and a data word on every clock.
To satisfy the demands of the pipeline without stalling, separate instruction
and data caches are used. Each cache is 16 KiB, or 4096 words, with 16-word
blocks.

Read requests for the cache are straightforward. Because there are separate
data and instruction caches, we need separate control signals to read and write
each cache. (Remember that we need to update the instruction cache when a miss
occurs.) Th us, the steps for a read request to either cache are as follows:

1. Send the address to the appropriate cache. Th e address comes either from
the PC (for an instruction) or from the ALU (for data).

2. If the cache signals hit, the requested word is available on the data lines.
Since there are 16 words in the desired block, we need to select the right one.
A block index fi eld is used to control the multiplexor (shown at the bottom
of the fi gure), which selects the requested word from the 16 words in the
indexed block.

396 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

3. If the cache signals miss, we send the address to the main memory. When
the memory returns with the data, we write it into the cache and then read it
to fulfi ll the request.

For writes, the Intrinsity FastMATH off ers both write-through and write-back,
leaving it up to the operating system to decide which strategy to use for an
application. It has a one-entry write buff er.

What cache miss rates are attained with a cache structure like that used by the
Intrinsity FastMATH? Figure 5.13 shows the miss rates for the instruction and
data caches. Th e combined miss rate is the eff ective miss rate per reference for
each program aft er accounting for the diff ering frequency of instruction and data
accesses.

Address (showing bit positions)

Data
Hit

Data

Tag

V Tag

Index

18 8 Byte
offset

31 14 13 2 1 06 5

Block offset

256
entries

512 bits18 bits

Mux

3232 32

FIGURE 5.12 The 16 KiB caches in the Intrinsity FastMATH each contain 256 blocks with 16 words per block. Th e tag
fi eld is 18 bits wide and the index fi eld is 8 bits wide, while a 4-bit fi eld (bits 5–2) is used to index the block and select the word from the block
using a 16-to-1 multiplexor. In practice, to eliminate the multiplexor, caches use a separate large RAM for the data and a smaller RAM for the
tags, with the block off set supplying the extra address bits for the large data RAM. In this case, the large RAM is 32 bits wide and must have 16
times as many words as blocks in the cache.

5.3 The Basics of Caches 397

Although miss rate is an important characteristic of cache designs, the ultimate
measure will be the eff ect of the memory system on program execution time; we’ll
see how miss rate and execution time are related shortly.

Elaboration: A combined cache with a total size equal to the sum of the two split
caches will usually have a better hit rate. This higher rate occurs because the combined
cache does not rigidly divide the number of entries that may be used by instructions
from those that may be used by data. Nonetheless, almost all processors today use
split instruction and data caches to increase cache bandwidth to match what modern
pipelines expect. (There may also be fewer confl ict misses; see Section 5.8.)

Here are miss rates for caches the size of those found in the Intrinsity FastMATH
processor, and for a combined cache whose size is equal to the sum of the two caches:

■ Total cache size: 32 KiB
■ Split cache effective miss rate: 3.24%
■ Combined cache miss rate: 3.18%

The miss rate of the split cache is only slightly worse.
The advantage of doubling the cache bandwidth, by supporting both an instruction

and data access simultaneously, easily overcomes the disadvantage of a slightly
increased miss rate. This observation cautions us that we cannot use miss rate as the
sole measure of cache performance, as Section 5.4 shows.

Summary
We began the previous section by examining the simplest of caches: a direct-mapped
cache with a one-word block. In such a cache, both hits and misses are simple, since
a word can go in exactly one location and there is a separate tag for every word. To
keep the cache and memory consistent, a write-through scheme can be used, so
that every write into the cache also causes memory to be updated. Th e alternative
to write-through is a write-back scheme that copies a block back to memory when
it is replaced; we’ll discuss this scheme further in upcoming sections.

split cache A scheme
in which a level of the
memory hierarchy
is composed of two
independent caches that
operate in parallel with
each other, with one
handling instructions and
one handling data.

Instruction miss rate Data miss rate Effective combined miss rate

0.4% 11.4% 3.2%

FIGURE 5.13 Approximate instruction and data miss rates for the Intrinsity FastMATH
processor for SPEC CPU2000 benchmarks. Th e combined miss rate is the eff ective miss rate seen
for the combination of the 16 KiB instruction cache and 16 KiB data cache. It is obtained by weighting the
instruction and data individual miss rates by the frequency of instruction and data references.

398 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

To take advantage of spatial locality, a cache must have a block size larger than
one word. Th e use of a larger block decreases the miss rate and improves the
effi ciency of the cache by reducing the amount of tag storage relative to the amount
of data storage in the cache. Although a larger block size decreases the miss rate, it
can also increase the miss penalty. If the miss penalty increased linearly with the
block size, larger blocks could easily lead to lower performance.

To avoid performance loss, the bandwidth of main memory is increased to
transfer cache blocks more effi ciently. Common methods for increasing bandwidth
external to the DRAM are making the memory wider and interleaving. DRAM
designers have steadily improved the interface between the processor and memory
to increase the bandwidth of burst mode transfers to reduce the cost of larger cache
block sizes.

Th e speed of the memory system aff ects the designer’s decision on the size of
the cache block. Which of the following cache designer guidelines are generally
valid?

1. Th e shorter the memory latency, the smaller the cache block

2. Th e shorter the memory latency, the larger the cache block

3. Th e higher the memory bandwidth, the smaller the cache block

4. Th e higher the memory bandwidth, the larger the cache block

5.4 Measuring and Improving Cache
Performance

In this section, we begin by examining ways to measure and analyze cache
performance. We then explore two diff erent techniques for improving cache
performance. One focuses on reducing the miss rate by reducing the probability
that two diff erent memory blocks will contend for the same cache location. Th e
second technique reduces the miss penalty by adding an additional level to the
hierarchy. Th is technique, called multilevel caching, fi rst appeared in high-end
computers selling for more than $100,000 in 1990; since then it has become
common on personal mobile devices selling for a few hundred dollars!

Check
Yourself

5.4 Measuring and Improving Cache Performance 399

CPU time can be divided into the clock cycles that the CPU spends executing
the program and the clock cycles that the CPU spends waiting for the memory
system. Normally, we assume that the costs of cache accesses that are hits are part
of the normal CPU execution cycles. Th us,

CPU time � (CPU execution clock cycles � Memory-stall clock cycles)
� Clock cycle time

Th e memory-stall clock cycles come primarily from cache misses, and we make
that assumption here. We also restrict the discussion to a simplifi ed model of the
memory system. In real processors, the stalls generated by reads and writes can be
quite complex, and accurate performance prediction usually requires very detailed
simulations of the processor and memory system.

Memory-stall clock cycles can be defi ned as the sum of the stall cycles coming
from reads plus those coming from writes:

Memory-stall clock cycles � (Read-stall cycles � Write-stall cycles)

Th e read-stall cycles can be defi ned in terms of the number of read accesses per
program, the miss penalty in clock cycles for a read, and the read miss rate:

Read-stall cycles
Reads

Program
Read miss rate Read miss pennalty

Writes are more complicated. For a write-through scheme, we have two sources of
stalls: write misses, which usually require that we fetch the block before continuing
the write (see the Elaboration on page 394 for more details on dealing with writes),
and write buff er stalls, which occur when the write buff er is full when a write
occurs. Th us, the cycles stalled for writes equals the sum of these two:

Write-stall cycles
Writes

Program
Write miss rate Write misss penalty

Write buffer stalls

⎛

⎝
⎜⎜⎜⎜

⎞

⎠
⎟⎟⎟⎟

Because the write buff er stalls depend on the proximity of writes, and not just
the frequency, it is not possible to give a simple equation to compute such stalls.
Fortunately, in systems with a reasonable write buff er depth (e.g., four or more
words) and a memory capable of accepting writes at a rate that signifi cantly exceeds
the average write frequency in programs (e.g., by a factor of 2), the write buff er
stalls will be small, and we can safely ignore them. If a system did not meet these
criteria, it would not be well designed; instead, the designer should have used either
a deeper write buff er or a write-back organization.

400 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Write-back schemes also have potential additional stalls arising from the need
to write a cache block back to memory when the block is replaced. We will discuss
this more in Section 5.8.

In most write-through cache organizations, the read and write miss penalties are
the same (the time to fetch the block from memory). If we assume that the write
buff er stalls are negligible, we can combine the reads and writes by using a single
miss rate and the miss penalty:

Memory-stall clock cycles
Memory accesses

Program
Miss rate Miss penalty

We can also factor this as

Memory-stall clock cycles
Instructions

Program
Misses

Instrucction
Miss penalty

Let’s consider a simple example to help us understand the impact of cache
performance on processor performance.

Calculating Cache Performance

Assume the miss rate of an instruction cache is 2% and the miss rate of the data
cache is 4%. If a processor has a CPI of 2 without any memory stalls and the
miss penalty is 100 cycles for all misses, determine how much faster a processor
would run with a perfect cache that never missed. Assume the frequency of all
loads and stores is 36%.

Th e number of memory miss cycles for instructions in terms of the Instruction
count (I) is

Instruction miss cycles � I � 2% � 100 � 2.00 � I

As the frequency of all loads and stores is 36%, we can fi nd the number of
memory miss cycles for data references:

Data miss cycles � I � 36% � 4% � 100 � 1.44 � I

EXAMPLE

ANSWER

5.4 Measuring and Improving Cache Performance 401

What happens if the processor is made faster, but the memory system is not? Th e
amount of time spent on memory stalls will take up an increasing fraction of the
execution time; Amdahl’s Law, which we examined in Chapter 1, reminds us of
this fact. A few simple examples show how serious this problem can be. Suppose
we speed-up the computer in the previous example by reducing its CPI from 2 to 1
without changing the clock rate, which might be done with an improved pipeline.
Th e system with cache misses would then have a CPI of 1 � 3.44 � 4.44, and the
system with the perfect cache would be

4 44
1
.

� 4.44 times as fast.

Th e amount of execution time spent on memory stalls would have risen from
3 44
5 44

.
� 63%

to 3 44
4 44

.
� 77%

Similarly, increasing the clock rate without changing the memory system also
increases the performance lost due to cache misses.

Th e previous examples and equations assume that the hit time is not a factor in
determining cache performance. Clearly, if the hit time increases, the total time to
access a word from the memory system will increase, possibly causing an increase in
the processor cycle time. Although we will see additional examples of what can increase

Th e total number of memory-stall cycles is 2.00 I � 1.44 I � 3.44 I. Th is is
more than three cycles of memory stall per instruction. Accordingly, the total
CPI including memory stalls is 2 � 3.44 � 5.44. Since there is no change in
instruction count or clock rate, the ratio of the CPU execution times is

CPU time with stalls
CPU time with perfect cache

I CPIstall Clock cycle
I CPI Clock cycle
CPI

CPI
5

perfect

stall

perfect

..44
2

Th e performance with the perfect cache is better by
5 44

2
.

� 2.72.

402 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

hit time shortly, one example is increasing the cache size. A larger cache could clearly
have a longer access time, just as, if your desk in the library was very large (say, 3 square
meters), it would take longer to locate a book on the desk. An increase in hit time
likely adds another stage to the pipeline, since it may take multiple cycles for a cache
hit. Although it is more complex to calculate the performance impact of a deeper
pipeline, at some point the increase in hit time for a larger cache could dominate the
improvement in hit rate, leading to a decrease in processor performance.

To capture the fact that the time to access data for both hits and misses aff ects
performance, designers sometime use average memory access time (AMAT) as
a way to examine alternative cache designs. Average memory access time is the
average time to access memory considering both hits and misses and the frequency
of diff erent accesses; it is equal to the following:

AMAT � Time for a hit � Miss rate � Miss penalty

Calculating Average Memory Access Time

Find the AMAT for a processor with a 1 ns clock cycle time, a miss penalty of
20 clock cycles, a miss rate of 0.05 misses per instruction, and a cache access
time (including hit detection) of 1 clock cycle. Assume that the read and write
miss penalties are the same and ignore other write stalls.

Th e average memory access time per instruction is

AMAT Time for a hit Miss rate Miss penalty
1 0.05 20
2 clocck cycles

or 2 ns.

Th e next subsection discusses alternative cache organizations that decrease
miss rate but may sometimes increase hit time; additional examples appear in
Section 5.15, Fallacies and Pitfalls.

Reducing Cache Misses by More Flexible Placement
of Blocks
So far, when we place a block in the cache, we have used a simple placement scheme:
A block can go in exactly one place in the cache. As mentioned earlier, it is called
direct mapped because there is a direct mapping from any block address in memory
to a single location in the upper level of the hierarchy. However, there is actually a
whole range of schemes for placing blocks. Direct mapped, where a block can be
placed in exactly one location, is at one extreme.

EXAMPLE

ANSWER

5.4 Measuring and Improving Cache Performance 403

At the other extreme is a scheme where a block can be placed in any location
in the cache. Such a scheme is called fully associative, because a block in memory
may be associated with any entry in the cache. To fi nd a given block in a fully
associative cache, all the entries in the cache must be searched because a block
can be placed in any one. To make the search practical, it is done in parallel with
a comparator associated with each cache entry. Th ese comparators signifi cantly
increase the hardware cost, eff ectively making fully associative placement practical
only for caches with small numbers of blocks.

Th e middle range of designs between direct mapped and fully associative
is called set associative. In a set-associative cache, there are a fi xed number of
locations where each block can be placed. A set-associative cache with n locations
for a block is called an n-way set-associative cache. An n-way set-associative cache
consists of a number of sets, each of which consists of n blocks. Each block in the
memory maps to a unique set in the cache given by the index fi eld, and a block can
be placed in any element of that set. Th us, a set-associative placement combines
direct-mapped placement and fully associative placement: a block is directly
mapped into a set, and then all the blocks in the set are searched for a match. For
example, Figure 5.14 shows where block 12 may be placed in a cache with eight
blocks total, according to the three block placement policies.

Remember that in a direct-mapped cache, the position of a memory block is
given by

(Block number) modulo (Number of blocks in the cache)

fully associative
cache A cache structure
in which a block can be
placed in any location in
the cache.

set-associative cache
A cache that has a fi xed
number of locations (at
least two) where each
block can be placed.

Direct mapped

2 4 5 760 1 3Block #

Data

Tag

1
2

Set associative

20 1 3Set #

Data

Tag

1
2

Fully associative

Data

Tag

1
2

FIGURE 5.14 The location of a memory block whose address is 12 in a cache with eight
blocks varies for direct-mapped, set-associative, and fully associative placement. In direct-
mapped placement, there is only one cache block where memory block 12 can be found, and that block is
given by (12 modulo 8) � 4. In a two-way set-associative cache, there would be four sets, and memory block
12 must be in set (12 mod 4) � 0; the memory block could be in either element of the set. In a fully associative
placement, the memory block for block address 12 can appear in any of the eight cache blocks.

404 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

In a set-associative cache, the set containing a memory block is given by

(Block number) modulo (Number of sets in the cache)

Since the block may be placed in any element of the set, all the tags of all the elements
of the set must be searched. In a fully associative cache, the block can go anywhere,
and all tags of all the blocks in the cache must be searched.

We can also think of all block placement strategies as a variation on set
associativity. Figure 5.15 shows the possible associativity structures for an eight-
block cache. A direct-mapped cache is simply a one-way set-associative cache:
each cache entry holds one block and each set has one element. A fully associative
cache with m entries is simply an m-way set-associative cache; it has one set with m
blocks, and an entry can reside in any block within that set.

Th e advantage of increasing the degree of associativity is that it usually decreases
the miss rate, as the next example shows. Th e main disadvantage, which we discuss
in more detail shortly, is a potential increase in the hit time.

Eight-way set associative (fully associative)

Tag Tag Data DataTagTag Data Data Tag Tag Data DataTagTag Data Data

Tag Tag Data DataTagTag Data DataSet

Four-way set associative

TagTag Data DataSet

Two-way set associative

Tag DataBlock

One-way set associative

(direct mapped)

FIGURE 5.15 An eight-block cache confi gured as direct mapped, two-way set associative,
four-way set associative, and fully associative. Th e total size of the cache in blocks is equal to the
number of sets times the associativity. Th us, for a fi xed cache size, increasing the associativity decreases
the number of sets while increasing the number of elements per set. With eight blocks, an eight-way set-
associative cache is the same as a fully associative cache.

5.4 Measuring and Improving Cache Performance 405

Misses and Associativity in Caches

Assume there are three small caches, each consisting of four one-word blocks.
One cache is fully associative, a second is two-way set-associative, and the
third is direct-mapped. Find the number of misses for each cache organization
given the following sequence of block addresses: 0, 8, 0, 6, and 8.

Th e direct-mapped case is easiest. First, let’s determine to which cache block
each block address maps:

Block address Cache block

0 (0 modulo 4) � 0

6 (6 modulo 4) � 2

8 (8 modulo 4) � 0

Now we can fi ll in the cache contents aft er each reference, using a blank entry to
mean that the block is invalid, colored text to show a new entry added to the cache
for the associated reference, and plain text to show an old entry in the cache:

Address of memory
block accessed

Hit
or miss

Contents of cache blocks after reference

0 1 2 3

0 miss Memory[0]

8 miss Memory[8]

0 miss Memory[0]

6 miss Memory[0] Memory[6]

8 miss Memory[8] Memory[6]

Th e direct-mapped cache generates fi ve misses for the fi ve accesses.
Th e set-associative cache has two sets (with indices 0 and 1) with two

elements per set. Let’s fi rst determine to which set each block address maps:

Block address Cache set

0 (0 modulo 2) � 0

6 (6 modulo 2) � 0

8 (8 modulo 2) � 0

Because we have a choice of which entry in a set to replace on a miss, we need
a replacement rule. Set-associative caches usually replace the least recently
used block within a set; that is, the block that was used furthest in the past

EXAMPLE

ANSWER

406 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

is replaced. (We will discuss other replacement rules in more detail shortly.)
Using this replacement rule, the contents of the set-associative cache aft er each
reference looks like this:

Address of memory
block accessed

Hit
or miss

Contents of cache blocks after reference

Set 0 Set 0 Set 1 Set 1

0 miss Memory[0]

8 miss Memory[0] Memory[8]

0 hit Memory[0] Memory[8]

6 miss Memory[0] Memory[6]

8 miss Memory[8] Memory[6]

Notice that when block 6 is referenced, it replaces block 8, since block 8 has
been less recently referenced than block 0. Th e two-way set-associative cache
has four misses, one less than the direct-mapped cache.

Th e fully associative cache has four cache blocks (in a single set); any
memory block can be stored in any cache block. Th e fully associative cache has
the best performance, with only three misses:

Address of memory
block accessed

Hit
or miss

Contents of cache blocks after reference

Block 0 Block 1 Block 2 Block 3

0 miss Memory[0]

8 miss Memory[0] Memory[8]

0 hit Memory[0] Memory[8]

6 miss Memory[0] Memory[8] Memory[6]

8 hit Memory[0] Memory[8] Memory[6]

For this series of references, three misses is the best we can do, because three
unique block addresses are accessed. Notice that if we had eight blocks in the
cache, there would be no replacements in the two-way set-associative cache
(check this for yourself), and it would have the same number of misses as the
fully associative cache. Similarly, if we had 16 blocks, all 3 caches would have
the same number of misses. Even this trivial example shows that cache size and
associativity are not independent in determining cache performance.

How much of a reduction in the miss rate is achieved by associativity?
Figure 5.16 shows the improvement for a 64 KiB data cache with a 16-word block,
and associativity ranging from direct mapped to eight-way. Going from one-way
to two-way associativity decreases the miss rate by about 15%, but there is little
further improvement in going to higher associativity.

5.4 Measuring and Improving Cache Performance 407

Locating a Block in the Cache
Now, let’s consider the task of fi nding a block in a cache that is set associative.
Just as in a direct-mapped cache, each block in a set-associative cache includes
an address tag that gives the block address. Th e tag of every cache block within
the appropriate set is checked to see if it matches the block address from the
processor. Figure 5.17 decomposes the address. Th e index value is used to select
the set containing the address of interest, and the tags of all the blocks in the set
must be searched. Because speed is of the essence, all the tags in the selected set are
searched in parallel. As in a fully associative cache, a sequential search would make
the hit time of a set-associative cache too slow.

If the total cache size is kept the same, increasing the associativity increases the
number of blocks per set, which is the number of simultaneous compares needed
to perform the search in parallel: each increase by a factor of 2 in associativity
doubles the number of blocks per set and halves the number of sets. Accordingly,
each factor-of-2 increase in associativity decreases the size of the index by 1 bit and
increases the size of the tag by 1 bit. In a fully associative cache, there is eff ectively
only one set, and all the blocks must be checked in parallel. Th us, there is no index,
and the entire address, excluding the block off set, is compared against the tag of
every block. In other words, we search the entire cache without any indexing.

In a direct-mapped cache, only a single comparator is needed, because the entry can
be in only one block, and we access the cache simply by indexing. Figure 5.18 shows
that in a four-way set-associative cache, four comparators are needed, together with
a 4-to-1 multiplexor to choose among the four potential members of the selected set.
Th e cache access consists of indexing the appropriate set and then searching the tags
of the set. Th e costs of an associative cache are the extra comparators and any delay
imposed by having to do the compare and select from among the elements of the set.

Associativity Data miss rate

1 10.3%

2 8.6%

4 8.3%

8 8.1%

FIGURE 5.16 The data cache miss rates for an organization like the Intrinsity FastMATH
processor for SPEC CPU2000 benchmarks with associativity varying from one-way to
eight-way. Th ese results for 10 SPEC CPU2000 programs are from Hennessy and Patterson (2003).

Block offsetTag Index

FIGURE 5.17 The three portions of an address in a set-associative or direct-mapped
cache. Th e index is used to select the set, then the tag is used to choose the block by comparison with the
blocks in the selected set. Th e block off set is the address of the desired data within the block.

408 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Elaboration: A Content Addressable Memory (CAM) is a circuit that combines
comparison and storage in a single device. Instead of supplying an address and reading
a word like a RAM, you supply the data and the CAM looks to see if it has a copy and
returns the index of the matching row. CAMs mean that cache designers can afford to
implement much higher set associativity than if they needed to build the hardware out
of SRAMs and comparators. In 2013, the greater size and power of CAM generally leads
to 2-way and 4-way set associativity being built from standard SRAMs and comparators,
with 8-way and above built using CAMs.

Address

Data

Tag

V Tag

Index

22 8

31 30 12 11 10 9 8 3 2 1 0

4-to-1 multiplexor

Index
0
1
2

253
254
255

DataV Tag

DataHit

FIGURE 5.18 The implementation of a four-way set-associative cache requires four
comparators and a 4-to-1 multiplexor. Th e comparators determine which element of the selected set
(if any) matches the tag. Th e output of the comparators is used to select the data from one of the four blocks
of the indexed set, using a multiplexor with a decoded select signal. In some implementations, the Output
enable signals on the data portions of the cache RAMs can be used to select the entry in the set that drives the
output. Th e Output enable signal comes from the comparators, causing the element that matches to drive the
data outputs. Th is organization eliminates the need for the multiplexor.

5.4 Measuring and Improving Cache Performance 409

Choosing Which Block to Replace
When a miss occurs in a direct-mapped cache, the requested block can go in
exactly one position, and the block occupying that position must be replaced. In
an associative cache, we have a choice of where to place the requested block, and
hence a choice of which block to replace. In a fully associative cache, all blocks are
candidates for replacement. In a set-associative cache, we must choose among the
blocks in the selected set.

Th e most commonly used scheme is least recently used (LRU), which we used
in the previous example. In an LRU scheme, the block replaced is the one that has
been unused for the longest time. Th e set associative example on page 405 uses
LRU, which is why we replaced Memory(0) instead of Memory(6).

LRU replacement is implemented by keeping track of when each element in a
set was used relative to the other elements in the set. For a two-way set-associative
cache, tracking when the two elements were used can be implemented by keeping
a single bit in each set and setting the bit to indicate an element whenever that
element is referenced. As associativity increases, implementing LRU gets harder; in
Section 5.8, we will see an alternative scheme for replacement.

Size of Tags versus Set Associativity

Increasing associativity requires more comparators and more tag bits per
cache block. Assuming a cache of 4096 blocks, a 4-word block size, and a
32-bit address, fi nd the total number of sets and the total number of tag bits
for caches that are direct mapped, two-way and four-way set associative, and
fully associative.

Since there are 16 (� 24) bytes per block, a 32-bit address yields 32�4 � 28 bits
to be used for index and tag. Th e direct-mapped cache has the same number
of sets as blocks, and hence 12 bits of index, since log2(4096) � 12; hence, the
total number is (28�12) � 4096 � 16 � 4096 � 66 K tag bits.

Each degree of associativity decreases the number of sets by a factor of 2 and
thus decreases the number of bits used to index the cache by 1 and increases
the number of bits in the tag by 1. Th us, for a two-way set-associative cache,
there are 2048 sets, and the total number of tag bits is (28�11) � 2 � 2048 �
34 � 2048 � 70 Kbits. For a four-way set-associative cache, the total number
of sets is 1024, and the total number is (28�10) � 4 � 1024 � 72 � 1024 �
74 K tag bits.

For a fully associative cache, there is only one set with 4096 blocks, and the
tag is 28 bits, leading to 28 � 4096 � 1 � 115 K tag bits.

least recently used
(LRU) A replacement
scheme in which the
block replaced is the one
that has been unused for
the longest time.

EXAMPLE

ANSWER

410 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Reducing the Miss Penalty Using Multilevel Caches
All modern computers make use of caches. To close the gap further between the
fast clock rates of modern processors and the increasingly long time required to
access DRAMs, most microprocessors support an additional level of caching. Th is
second-level cache is normally on the same chip and is accessed whenever a miss
occurs in the primary cache. If the second-level cache contains the desired data,
the miss penalty for the fi rst-level cache will be essentially the access time of the
second-level cache, which will be much less than the access time of main memory.
If neither the primary nor the secondary cache contains the data, a main memory
access is required, and a larger miss penalty is incurred.

How signifi cant is the performance improvement from the use of a secondary
cache? Th e next example shows us.

Performance of Multilevel Caches

Suppose we have a processor with a base CPI of 1.0, assuming all references
hit in the primary cache, and a clock rate of 4 GHz. Assume a main memory
access time of 100 ns, including all the miss handling. Suppose the miss rate
per instruction at the primary cache is 2%. How much faster will the processor
be if we add a secondary cache that has a 5 ns access time for either a hit or
a miss and is large enough to reduce the miss rate to main memory to 0.5%?

Th e miss penalty to main memory is

100

0 25

clock cycle

400 clock cycles
.

�

Th e eff ective CPI with one level of caching is given by

Total CPI � Base CPI � Memory-stall cycles per instruction

For the processor with one level of caching,

Total CPI � 1.0 � Memory-stall cycles per instruction � 1.0 � 2% � 400 � 9

With two levels of caching, a miss in the primary (or fi rst-level) cache can be
satisfi ed either by the secondary cache or by main memory. Th e miss penalty
for an access to the second-level cache is

0 25

clock cycle

20 clock cycles
.

�

EXAMPLE

ANSWER

5.4 Measuring and Improving Cache Performance 411

If the miss is satisfi ed in the secondary cache, then this is the entire miss
penalty. If the miss needs to go to main memory, then the total miss penalty is
the sum of the secondary cache access time and the main memory access time.

Th us, for a two-level cache, total CPI is the sum of the stall cycles from both
levels of cache and the base CPI:

Total CPI 1 Primary stalls per instruction Secondary stallss per instruction
1 2% 20 0.5% 400 1 0.4 2.0 3.4

Th us, the processor with the secondary cache is faster by

9 0
3 4

.
� 2.6

Alternatively, we could have computed the stall cycles by summing the stall
cycles of those references that hit in the secondary cache ((2%�0.5%) �
20 � 0.3). Th ose references that go to main memory, which must include the
cost to access the secondary cache as well as the main memory access time, are
(0.5% � (20 � 400) � 2.1). Th e sum, 1.0 � 0.3 � 2.1, is again 3.4.

Th e design considerations for a primary and secondary cache are signifi cantly
diff erent, because the presence of the other cache changes the best choice versus
a single-level cache. In particular, a two-level cache structure allows the primary
cache to focus on minimizing hit time to yield a shorter clock cycle or fewer
pipeline stages, while allowing the secondary cache to focus on miss rate to reduce
the penalty of long memory access times.

Th e eff ect of these changes on the two caches can be seen by comparing each
cache to the optimal design for a single level of cache. In comparison to a single-
level cache, the primary cache of a multilevel cache is oft en smaller. Furthermore,
the primary cache may use a smaller block size, to go with the smaller cache size and
also to reduce the miss penalty. In comparison, the secondary cache will be much
larger than in a single-level cache, since the access time of the secondary cache is
less critical. With a larger total size, the secondary cache may use a larger block size
than appropriate with a single-level cache. It oft en uses higher associativity than
the primary cache given the focus of reducing miss rates.

Sorting has been exhaustively analyzed to fi nd better algorithms: Bubble Sort,
Quicksort, Radix Sort, and so on. Figure 5.19(a) shows instructions executed by
item searched for Radix Sort versus Quicksort. As expected, for large arrays, Radix
Sort has an algorithmic advantage over Quicksort in terms of number of operations.
Figure 5.19(b) shows time per key instead of instructions executed. We see that the
lines start on the same trajectory as in Figure 5.19(a), but then the Radix Sort line

multilevel cache
A memory hierarchy with
multiple levels of caches,
rather than just a cache
and main memory.

Understanding
Program
Performance

412 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

FIGURE 5.19 Comparing Quicksort and Radix Sort by (a) instructions executed per item
sorted, (b) time per item sorted, and (c) cache misses per item sorted. Th is data is from a
paper by LaMarca and Ladner [1996]. Due to such results, new versions of Radix Sort have been invented
that take memory hierarchy into account, to regain its algorithmic advantages (see Section 5.15). Th e basic
idea of cache optimizations is to use all the data in a block repeatedly before it is replaced on a miss.

Radix Sort

Quicksort

Size (K items to sort)

In
st

ru
ct

io
n

s
/i

te
m

0
4 8 16 32

200

400

600

800

1000

1200

64 128 256 512 1024 2048 4096
a.

Radix Sort

Quicksort

Size (K items to sort)

C
lo

ck
c

yc
le

s
/i

te
m

0
4 8 16 32

400

800

1200

1600

2000

64 128 256 512 1024 2048 4096

Radix Sort

Quicksort

Size (K items to sort)

C
a

ch
e

m
is

se
s

/i
te

0
4 8 16 32

64 128 256 512 1024 2048 4096

5.4 Measuring and Improving Cache Performance 413

diverges as the data to sort increases. What is going on? Figure 5.19(c) answers by
looking at the cache misses per item sorted: Quicksort consistently has many fewer
misses per item to be sorted.

Alas, standard algorithmic analysis oft en ignores the impact of the memory
hierarchy. As faster clock rates and Moore’s Law allow architects to squeeze all of
the performance out of a stream of instructions, using the memory hierarchy well
is critical to high performance. As we said in the introduction, understanding the
behavior of the memory hierarchy is critical to understanding the performance of
programs on today’s computers.

Software Optimization via Blocking
Given the importance of the memory hierarchy to program performance, not
surprisingly many soft ware optimizations were invented that can dramatically
improve performance by reusing data within the cache and hence lower miss rates
due to improved temporal locality.

When dealing with arrays, we can get good performance from the memory
system if we store the array in memory so that accesses to the array are sequential
in memory. Suppose that we are dealing with multiple arrays, however, with some
arrays accessed by rows and some by columns. Storing the arrays row-by-row
(called row major order) or column-by-column (column major order) does not
solve the problem because both rows and columns are used in every loop iteration.

Instead of operating on entire rows or columns of an array, blocked algorithms
operate on submatrices or blocks. Th e goal is to maximize accesses to the data
loaded into the cache before the data are replaced; that is, improve temporal locality
to reduce cache misses.

For example, the inner loops of DGEMM (lines 4 through 9 of Figure 3.21 in
Chapter 3) are

for (int j = 0; j < n; ++j) { double cij = C[i+j*n]; /* cij = C[i][j] */ for( int k = 0; k < n; k++ ) cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */ C[i+j*n] = cij; /* C[i][j] = cij */ } } It reads all N-by-N elements of B, reads the same N elements in what corresponds to one row of A repeatedly, and writes what corresponds to one row of N elements of C. (Th e comments make the rows and columns of the matrices easier to identify.) Figure 5.20 gives a snapshot of the accesses to the three arrays. A dark shade indicates a recent access, a light shade indicates an older access, and white means not yet accessed. 414 Chapter 5 Large and Fast: Exploiting Memory Hierarchy Th e number of capacity misses clearly depends on N and the size of the cache. If it can hold all three N-by-N matrices, then all is well, provided there are no cache confl icts. We purposely picked the matrix size to be 32 by 32 in DGEMM for Chapters 3 and 4 so that this would be the case. Each matrix is 32 � 32 � 1024 elements and each element is 8 bytes, so the three matrices occupy 24 KiB, which comfortably fi t in the 32 KiB data cache of the Intel Core i7 (Sandy Bridge). If the cache can hold one N-by-N matrix and one row of N, then at least the ith row of A and the array B may stay in the cache. Less than that and misses may occur for both B and C. In the worst case, there would be 2 N3 � N2 memory words accessed for N3 operations. To ensure that the elements being accessed can fi t in the cache, the original code is changed to compute on a submatrix. Hence, we essentially invoke the version of DGEMM from Figure 4.80 in Chapter 4 repeatedly on matrices of size BLOCKSIZE by BLOCKSIZE. BLOCKSIZE is called the blocking factor. Figure 5.21 shows the blocked version of DGEMM. Th e function do_block is DGEMM from Figure 3.21 with three new parameters si, sj, and sk to specify the starting position of each submatrix of of A, B, and C. Th e two inner loops of the do_block now compute in steps of size BLOCKSIZE rather than the full length of B and C. Th e gcc optimizer removes any function call overhead by “inlining” the function; that is, it inserts the code directly to avoid the conventional parameter passing and return address bookkeeping instructions. Figure 5.22 illustrates the accesses to the three arrays using blocking. Looking only at capacity misses, the total number of memory words accessed is 2 N3/ BLOCKSIZE � N2. Th is total is an improvement by about a factor of BLOCKSIZE. Hence, blocking exploits a combination of spatial and temporal locality, since A benefi ts from spatial locality and B benefi ts from temporal locality. FIGURE 5.20 A snapshot of the three arrays C, A, and B when N � 6 and i � 1. Th e age of accesses to the array elements is indicated by shade: white means not yet touched, light means older accesses, and dark means newer accesses. Compared to Figure 5.21, elements of A and B are read repeatedly to calculate new elements of x. Th e variables i, j, and k are shown along the rows or columns used to access the arrays. 0 1 2 3 4 5 10 2 3 4 5 x j i 0 1 2 3 4 5 10 2 3 4 5 y k i 0 1 2 3 4 5 10 2 3 4 5 z j k 5.4 Measuring and Improving Cache Performance 415 FIGURE 5.21 Cache blocked version of DGEMM in Figure 3.21. Assume C is initialized to zero. Th e do_block function is basically DGEMM from Chapter 3 with new parameters to specify the starting positions of the submatrices of BLOCKSIZE. Th e gcc optimizer can remove the function overhead instructions by inlining the do_block function. FIGURE 5.22 The age of accesses to the arrays C, A, and B when BLOCKSIZE � 3. Note that, in contrast to Figure 5.20, fewer elements are accessed. 0 1 2 3 4 5 10 2 3 4 5 x j i 0 1 2 3 4 5 10 2 3 4 5 y k i 0 1 2 3 4 5 10 2 3 4 5 z j k 1 #define BLOCKSIZE 32 2 void do_block (int n, int si, int sj, int sk, double *A, double 3 *B, double *C) 4 { 5 for (int i = si; i < si+BLOCKSIZE; ++i) 6 for (int j = sj; j < sj+BLOCKSIZE; ++j) 7 { 8 double cij = C[i+j*n];/* cij = C[i][j] */ 9 for( int k = sk; k < sk+BLOCKSIZE; k++ ) 10 cij += A[i+k*n] * B[k+j*n];/* cij+=A[i][k]*B[k][j] */ 11 C[i+j*n] = cij;/* C[i][j] = cij */ 12 } 13 } 14 void dgemm (int n, double* A, double* B, double* C) 15 { 16 for ( int sj = 0; sj < n; sj += BLOCKSIZE ) 17 for ( int si = 0; si < n; si += BLOCKSIZE ) 18 for ( int sk = 0; sk < n; sk += BLOCKSIZE ) 19 do_block(n, si, sj, sk, A, B, C); 20 } Although we have aimed at reducing cache misses, blocking can also be used to help register allocation. By taking a small blocking size such that the block can be held in registers, we can minimize the number of loads and stores in the program, which also improves performance. 416 Chapter 5 Large and Fast: Exploiting Memory Hierarchy Figure 5.23 shows the impact of cache blocking on the performance of the unoptimized DGEMM as we increase the matrix size beyond where all three matrices fi t in the cache. Th e unoptimized performance is halved for the largest matrix. Th e cache-blocked version is less than 10% slower even at matrices that are 960x960, or 900 times larger than the 32 × 32 matrices in Chapters 3 and 4. Elaboration: Multilevel caches create several complications. First, there are now several different types of misses and corresponding miss rates. In the example on pages 410–411, we saw the primary cache miss rate and the global miss rate—the fraction of references that missed in all cache levels. There is also a miss rate for the secondary cache, which is the ratio of all misses in the secondary cache divided by the number of accesses to it. This miss rate is called the local miss rate of the secondary cache. Because the primary cache fi lters accesses, especially those with good spatial and temporal locality, the local miss rate of the secondary cache is much higher than the global miss rate. For the example on pages 410–411, we can compute the local miss rate of the secondary cache as 0.5%/2% � 25%! Luckily, the global miss rate dictates how often we must access the main memory. Elaboration: With out-of-order processors (see Chapter 4), performance is more complex, since they execute instructions during the miss penalty. Instead of instruction miss rates and data miss rates, we use misses per instruction, and this formula: Memory stall cycles Instruction Misses Instruction (Total misss latency Overlapped miss latency) global miss rate Th e fraction of references that miss in all levels of a multilevel cache. local miss rate Th e fraction of references to one level of a cache that miss; used in multilevel hierarchies. 1.8 1.5 1.2 0.9 0.6 G F L O P S 0.3 – Unoptimized 1.7 1.5 1.3 0.8 1.7 1.6 1.6 1.5 Blocked 32x32 160x160 480x480 960x960 FIGURE 5.23 Performance of unoptimized DGEMM (Figure 3.21) versus cache blocked DGEMM (Figure 5.21) as the matrix dimension varies from 32x32 (where all three matrices fi t in the cache) to 960x960. 5.4 Measuring and Improving Cache Performance 417 There is no general way to calculate overlapped miss latency, so evaluations of memory hierarchies for out-of-order processors inevitably require simulation of the processor and the memory hierarchy. Only by seeing the execution of the processor during each miss can we see if the processor stalls waiting for data or simply fi nds other work to do. A guideline is that the processor often hides the miss penalty for an L1 cache miss that hits in the L2 cache, but it rarely hides a miss to the L2 cache. Elaboration: The performance challenge for algorithms is that the memory hierarchy varies between different implementations of the same architecture in cache size, associativity, block size, and number of caches. To cope with such variability, some recent numerical libraries parameterize their algorithms and then search the parameter space at runtime to fi nd the best combination for a particular computer. This approach is called autotuning. Which of the following is generally true about a design with multiple levels of caches? 1. First-level caches are more concerned about hit time, and second-level caches are more concerned about miss rate. 2. First-level caches are more concerned about miss rate, and second-level caches are more concerned about hit time. Summary In this section, we focused on four topics: cache performance, using associativity to reduce miss rates, the use of multilevel cache hierarchies to reduce miss penalties, and soft ware optimizations to improve eff ectiveness of caches. Th e memory system has a signifi cant eff ect on program execution time. Th e number of memory-stall cycles depends on both the miss rate and the miss penalty. Th e challenge, as we will see in Section 5.8, is to reduce one of these factors without signifi cantly aff ecting other critical factors in the memory hierarchy. To reduce the miss rate, we examined the use of associative placement schemes. Such schemes can reduce the miss rate of a cache by allowing more fl exible placement of blocks within the cache. Fully associative schemes allow blocks to be placed anywhere, but also require that every block in the cache be searched to satisfy a request. Th e higher costs make large fully associative caches impractical. Set- associative caches are a practical alternative, since we need only search among the elements of a unique set that is chosen by indexing. Set-associative caches have higher miss rates but are faster to access. Th e amount of associativity that yields the best performance depends on both the technology and the details of the implementation. We looked at multilevel caches as a technique to reduce the miss penalty by allowing a larger secondary cache to handle misses to the primary cache. Second- level caches have become commonplace as designers fi nd that limited silicon and the goals of high clock rates prevent primary caches from becoming large. Th e secondary cache, which is oft en ten or more times larger than the primary cache, handles many accesses that miss in the primary cache. In such cases, the miss penalty is that of the access time to the secondary cache (typically < 10 processor Check Yourself 418 Chapter 5 Large and Fast: Exploiting Memory Hierarchy cycles) versus the access time to memory (typically > 100 processor cycles). As with
associativity, the design tradeoff s between the size of the secondary cache and its
access time depend on a number of aspects of the implementation.

Finally, given the importance of the memory hierarchy in performance, we
looked at how to change algorithms to improve cache behavior, with blocking
being an important technique when dealing with large arrays.

5.5 Dependable Memory Hierarchy

Implicit in all the prior discussion is that the memory hierarchy doesn’t forget. Fast
but undependable is not very attractive. As we learned in Chapter 1, the one great
idea for dependability is redundancy. In this section we’ll fi rst go over the terms to
defi ne terms and measures associated with failure, and then show how redundancy
can make nearly unforgettable memories.

Defi ning Failure
We start with an assumption that you have a specifi cation of proper service. Users
can then see a system alternating between two states of delivered service with
respect to the service specifi cation:

1. Service accomplishment, where the service is delivered as specifi ed

2. Service interruption, where the delivered service is diff erent from the
specifi ed service

Transitions from state 1 to state 2 are caused by failures, and transitions from state
2 to state 1 are called restorations. Failures can be permanent or intermittent. Th e
latter is the more diffi cult case; it is harder to diagnose the problem when a system
oscillates between the two states. Permanent failures are far easier to diagnose.

Th is defi nition leads to two related terms: reliability and availability.
Reliability is a measure of the continuous service accomplishment—or, equivalently,

of the time to failure—from a reference point. Hence, mean time to failure (MTTF)
is a reliability measure. A related term is annual failure rate (AFR), which is just the
percentage of devices that would be expected to fail in a year for a given MTTF.
When MTTF gets large it can be misleading, while AFR leads to better intuition.

MTTF vs. AFR of Disks

Some disks today are quoted to have a 1,000,000-hour MTTF. As 1,000,000
hours is 1,000,000/(365 � 24) � 114 years, it would seem like they practically
never fail. Warehouse scale computers that run Internet services such as
Search might have 50,000 servers. Assume each server has 2 disks. Use AFR to
calculate how many disks we would expect to fail per year.

EXAMPLE

5.5 Dependable Memory Hierarchy 419

One year is 365 � 24 � 8760 hours. A 1,000,000-hour MTTF means an AFR
of 8760/1,000,000 � 0.876%. With 100,000 disks, we would expect 876 disks to
fail per year, or on average more than 2 disk failures per day!

Service interruption is measured as mean time to repair (MTTR). Mean time
between failures (MTBF) is simply the sum of MTTF + MTTR. Although MTBF
is widely used, MTTF is oft en the more appropriate term. Availability is then a
measure of service accomplishment with respect to the alternation between the two
states of accomplishment and interruption. Availability is statistically quantifi ed as

Availability
MTTF

(MTTF MTTR)

Note that reliability and availability are actually quantifi able measures, rather than
just synonyms for dependability. Shrinking MTTR can help availability as much as
increasing MTTF. For example, tools for fault detection, diagnosis, and repair can
help reduce the time to repair faults and thereby improve availability.

We want availability to be very high. One shorthand is to quote the number of
“nines of availability” per year. For example, a very good Internet service today
off ers 4 or 5 nines of availability. Given 365 days per year, which is 365 � 24 �
60 � 526,000 minutes, then the shorthand is decoded as follows:

One nine: 90% => 36.5 days of repair/year
Two nines: 99% => 3.65 days of repair/year
Th ree nines: 99.9% => 526 minutes of repair/year
Four nines: 99.99% => 52.6 minutes of repair/year
Five nines: 99.999% => 5.26 minutes of repair/year

and so on.
To increase MTTF, you can improve the quality of the components or design

systems to continue operation in the presence of components that have failed.
Hence, failure needs to be defi ned with respect to a context, as failure of a component
may not lead to a failure of the system. To make this distinction clear, the term fault
is used to mean failure of a component. Here are three ways to improve MTTF:

1. Fault avoidance: Preventing fault occurrence by construction.

2. Fault tolerance: Using redundancy to allow the service to comply with the
service specifi cation despite faults occurring.

3. Fault forecasting: Predicting the presence and creation of faults, allowing
the component to be replaced before it fails.

ANSWER

420 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

The Hamming Single Error Correcting, Double Error
Detecting Code (SEC/DED)
Richard Hamming invented a popular redundancy scheme for memory, for which
he received the Turing Award in 1968. To invent redundant codes, it is helpful
to talk about how “close” correct bit patterns can be. What we call the Hamming
distance is just the minimum number of bits that are diff erent between any two
correct bit patterns. For example, the distance between 011011 and 001111 is two.
What happens if the minimum distance between members of a codes is two, and
we get a one-bit error? It will turn a valid pattern in a code to an invalid one. Th us,
if we can detect whether members of a code are valid or not, we can detect single
bit errors, and can say we have a single bit error detection code.

Hamming used a parity code for error detection. In a parity code, the number
of 1s in a word is counted; the word has odd parity if the number of 1s is odd and
even otherwise. When a word is written into memory, the parity bit is also written
(1 for odd, 0 for even). Th at is, the parity of the N+1 bit word should always be even.
Th en, when the word is read out, the parity bit is read and checked. If the parity of the
memory word and the stored parity bit do not match, an error has occurred.

Calculate the parity of a byte with the value 31ten and show the pattern stored to
memory. Assume the parity bit is on the right. Suppose the most signifi cant bit
was inverted in memory, and then you read it back. Did you detect the error?
What happens if the two most signifi cant bits are inverted?

31ten is 00011111two, which has fi ve 1s. To make parity even, we need to write a 1
in the parity bit, or 000111111two. If the most signifi cant bit is inverted when we
read it back, we would see 100111111two which has seven 1s. Since we expect
even parity and calculated odd parity, we would signal an error. If the two most
signifi cant bits are inverted, we would see 110111111two which has eight 1s or
even parity and we would not signal an error.

If there are 2 bits of error, then a 1-bit parity scheme will not detect any errors,
since the parity will match the data with two errors. (Actually, a 1-bit parity scheme
can detect any odd number of errors; however, the probability of having 3 errors is
much lower than the probability of having two, so, in practice, a 1-bit parity code is
limited to detecting a single bit of error.)

Of course, a parity code cannot correct errors, which Hamming wanted to do
as well as detect them. If we used a code that had a minimum distance of 3, then
any single bit error would be closer to the correct pattern than to any other valid
pattern. He came up with an easy to understand mapping of data into a distance 3
code that we call Hamming Error Correction Code (ECC) in his honor. We use extra

error detection
code A code that
enables the detection of
an error in data, but not
the precise location and,
hence, correction of the
error.

EXAMPLE

ANSWER

5.5 Dependable Memory Hierarchy 421

parity bits to allow the position identifi cation of a single error. Here are the steps to
calculate Hamming ECC

1. Start numbering bits from 1 on the left , as opposed to the traditional
numbering of the rightmost bit being 0.

2. Mark all bit positions that are powers of 2 as parity bits (positions 1, 2, 4, 8,
16, …) .

3. All other bit positions are used for data bits (positions 3, 5, 6, 7, 9, 10, 11, 12,
13, 14, 15, …).

4. Th e position of parity bit determines sequence of data bits that it checks
(Figure 5.24 shows this coverage graphically) is:

■ Bit 1 (0001two) checks bits (1,3,5,7,9,11,…), which are bits where rightmost
bit of address is 1 (0001two, 0011two, 0101two, 0111two, 1001two, 1011two,…).

■ Bit 2 (0010two) checks bits (2,3,6,7,10,11,14,15,…), which are the bits
where the second bit to the right in the address is 1.

■ Bit 4 (0100two) checks bits (4–7, 12–15, 20–23,…) , which are the bits where
the third bit to the right in the address is 1.

■ Bit 8 (1000two) checks bits (8–15, 24–31, 40–47,…), which are the bits
where the fourth bit to the right in the address is 1.

Note that each data bit is covered by two or more parity bits.

5. Set parity bits to create even parity for each group.

Bit position

Encoded data bits

Parity
bit

coverage

p2 d1 p4 d2 d3 d4 p8 d5 d6 d7 d8

X X X X X X

X X X X X

1 2 3 4 5 6 7 8 9 10 11 12

FIGURE 5.24 Parity bits, data bits, and fi eld coverage in a Hamming ECC code for
eight data bits.

In what seems like a magic trick, you can then determine whether bits are
incorrect by looking at the parity bits. Using the 12 bit code in Figure 5.24, if the
value of the four parity calculations (p8,p4,p2,p1) was 0000, then there was no
error. However, if the pattern was, say, 1010, which is 10ten, then Hamming ECC
tells us that bit 10 (d6) is an error. Since the number is binary, we can correct the
error just by inverting the value of bit 10.

422 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Assume one byte data value is 10011010two. First show the Hamming ECC code
for that byte, and then invert bit 10 and show that the ECC code fi nds and
corrects the single bit error.

Leaving spaces for the parity bits, the 12 bit pattern is _ _ 1 _ 0 0 1 _ 1 0 1 0.
Position 1 checks bits 1,3,5,7,9, and11, which we highlight: __ 1 _ 0 0 1 _ 1 0 1
0. To make the group even parity, we should set bit 1 to 0.
Position 2 checks bits 2,3,6,7,10,11, which is 0 _ 1 _ 0 0 1 _ 1 0 1 0 or odd parity,
so we set position 2 to a 1.
Position 4 checks bits 4,5,6,7,12, which is 0 1 1 _ 0 0 1 _ 1 0 1, so we set it to a 1.
Position 8 checks bits 8,9,10,11,12, which is 0 1 1 1 0 0 1 _ 1 0 1 0, so we set it
to a 0.
Th e fi nal code word is 011100101010. Inverting bit 10 changes it to
011100101110.
Parity bit 1 is 0 (011100101110 is four 1s, so even parity; this group is OK).
Parity bit 2 is 1 (011100101110 is fi ve 1s, so odd parity; there is an error
somewhere).
Parity bit 4 is 1 (011100101110 is two 1s, so even parity; this group is OK).
Parity bit 8 is 1 (011100101110 is three 1s, so odd parity; there is an error
somewhere).
Parity bits 2 and 10 are incorrect. As 2 + 8 = 10, bit 10 must be wrong. Hence,
we can correct the error by inverting bit 10: 011100101010. Voila!

Hamming did not stop at single bit error correction code. At the cost of one more
bit, we can make the minimum Hamming distance in a code be 4. Th is means
we can correct single bit errors and detect double bit errors. Th e idea is to add a
parity bit that is calculated over the whole word. Let’s use a four-bit data word as
an example, which would only need 7 bits for single bit error detection. Hamming
parity bits H (p1 p2 p3) are computed (even parity as usual) plus the even parity
over the entire word, p4:

1 2 3 4 5 6 7 8
p1 p2 d1 p3 d2 d3 d4 p4
Th en the algorithm to correct one error and detect two is just to calculate parity
over the ECC groups (H) as before plus one more over the whole group (p4). Th ere
are four cases:

1. H is even and p4 is even, so no error occurred.

2. H is odd and p4 is odd, so a correctable single error occurred. (p4 should
calculate odd parity if one error occurred.)

3. H is even and p4 is odd, a single error occurred in p4 bit, not in the rest of the
word, so correct the p4 bit.

EXAMPLE

ANSWER

5.5 Dependable Memory Hierarchy 423

4. H is odd and p4 is even, a double error occurred. (p4 should calculate even
parity if two errors occurred.)

Single Error Correcting / Double Error Detecting (SEC/DED) is common in
memory for servers today. Conveniently, eight byte data blocks can get SEC/DED
with just one more byte, which is why many DIMMs are 72 bits wide.

Elaboration: To calculate how many bits are needed for SEC, let p be total number of
parity bits and d number of data bits in p � d bit word. If p error correction bits are to
point to error bit (p + d cases) plus one case to indicate that no error exists, we need:

2p � p � d � 1 bits, and thus p � log(p � d � 1).

For example, for 8 bits data means d � 8 and 2p � p � 8 � 1, so p � 4. Similarly,
p � 5 for 16 bits of data, 6 for 32 bits, 7 for 64 bits, and so on.

Elaboration: In very large systems, the possibility of multiple errors as well as
complete failure of a single wide memory chip becomes signifi cant. IBM introduced
chipkill to solve this problem, and many very large systems use this technology. (Intel
calls their version SDDC.) Similar in nature to the RAID approach used for disks (see

Section 5.11), Chipkill distributes the data and ECC information, so that the complete
failure of a single memory chip can be handled by supporting the reconstruction of the
missing data from the remaining memory chips. Assuming a 10,000-processor cluster
with 4 GiB per processor, IBM calculated the following rates of unrecoverable memory
errors in three years of operation:

■ Parity only—about 90,000, or one unrecoverable (or undetected) failure every 17
minutes.

■ SEC/DED only—about 3500, or about one undetected or unrecoverable failure
every 7.5 hours.

■ Chipkill—6, or about one undetected or unrecoverable failure every 2 months.

Hence, Chipkill is a requirement for warehouse-scale computers.

Elaboration: While single or double bit errors are typical for memory systems, networks
can have bursts of bit errors. One solution is called Cyclic Redundancy Check. For a
block of k bits, a transmitter generates an n-k bit frame check sequence. It transmits
n bits exactly divisible by some number. The receiver divides frame by that number. If
there is no remainder, it assumes there is no error. If there is, the receiver rejects the
message, and asks the transmitter to send again. As you might guess from Chapter 3,
it is easy to calculate division for some binary numbers with a shift register, which made
CRC codes popular even when hardware was more precious. Going even further, Reed-
Solomon codes use Galois fi elds to correct multibit transmission errors, but now data is
considered coeffi cients of a polynomials and the code space is values of a polynomial.
The Reed-Solomon calculation is considerably more complicated than binary division!

424 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

5.6 Virtual Machines

Virtual Machines (VM) were fi rst developed in the mid-1960s, and they have
remained an important part of mainframe computing over the years. Although
largely ignored in the single user PC era in the 1980s and 1990s, they have recently
gained popularity due to

■ Th e increasing importance of isolation and security in modern systems

■ Th e failures in security and reliability of standard operating systems

■ Th e sharing of a single computer among many unrelated users, in particular
for cloud computing

■ Th e dramatic increases in raw speed of processors over the decades, which
makes the overhead of VMs more acceptable

Th e broadest defi nition of VMs includes basically all emulation methods that
provide a standard soft ware interface, such as the Java VM. In this section, we are
interested in VMs that provide a complete system-level environment at the binary
instruction set architecture (ISA) level. Although some VMs run diff erent ISAs in
the VM from the native hardware, we assume they always match the hardware. Such
VMs are called (Operating) System Virtual Machines. IBM VM/370, VirtualBox,
VMware ESX Server, and Xen are examples.

System virtual machines present the illusion that the users have an entire
computer to themselves, including a copy of the operating system. A single
computer runs multiple VMs and can support a number of diff erent operating
systems (OSes). On a conventional platform, a single OS “owns” all the hardware
resources, but with a VM, multiple OSes all share the hardware resources.

Th e soft ware that supports VMs is called a virtual machine monitor (VMM) or
hypervisor; the VMM is the heart of virtual machine technology. Th e underlying
hardware platform is called the host, and its resources are shared among the guest
VMs. Th e VMM determines how to map virtual resources to physical resources: a
physical resource may be time-shared, partitioned, or even emulated in soft ware.
Th e VMM is much smaller than a traditional OS; the isolation portion of a VMM
is perhaps only 10,000 lines of code.

Although our interest here is in VMs for improving protection, VMs provide
two other benefi ts that are commercially signifi cant:

1. Managing soft ware. VMs provide an abstraction that can run the complete
soft ware stack, even including old operating systems like DOS. A typical
deployment might be some VMs running legacy OSes, many running the
current stable OS release, and a few testing the next OS release.

2. Managing hardware. One reason for multiple servers is to have each
application running with the compatible version of the operating system
on separate computers, as this separation can improve dependability. VMs

5.6 Virtual Machines 425

allow these separate soft ware stacks to run independently yet share hardware,
thereby consolidating the number of servers. Another example is that some
VMMs support migration of a running VM to a diff erent computer, either
to balance load or to evacuate from failing hardware.

Amazon Web Services (AWS) uses the virtual machines in its cloud computing
off ering EC2 for fi ve reasons:

1. It allows AWS to protect users from each other while sharing the same server.

2. It simplifi es soft ware distribution within a warehouse scale computer. A
customer installs a virtual machine image confi gured with the appropriate
soft ware, and AWS distributes it to all the instances a customer wants to use.

3. Customers (and AWS) can reliably “kill” a VM to control resource usage
when customers complete their work.

4. Virtual machines hide the identity of the hardware on which the customer is
running, which means AWS can keep using old servers and introduce new,
more effi cient servers. Th e customer expects performance for instances to
match their ratings in “EC2 Compute Units,” which AWS defi nes: to “provide
the equivalent CPU capacity of a 1.0–1.2 GHz 2007 AMD Opteron or 2007
Intel Xeon processor.” Th anks to Moore’s Law, newer servers clearly off er
more EC2 Compute Units than older ones, but AWS can keep renting old
servers as long as they are economical.

5. Virtual Machine Monitors can control the rate that a VM uses the processor,
the network, and disk space, which allows AWS to off er many price points
of instances of diff erent types running on the same underlying servers.
For example, in 2012 AWS off ered 14 instance types, from small standard
instances at $0.08 per hour to high I/O quadruple extra large instances at
$3.10 per hour.

In general, the cost of processor virtualization depends on the workload. User-
level processor-bound programs have zero virtualization overhead, because the
OS is rarely invoked, so everything runs at native speeds. I/O-intensive workloads
are generally also OS-intensive, executing many system calls and privileged
instructions that can result in high virtualization overhead. On the other hand, if
the I/O-intensive workload is also I/O-bound, the cost of processor virtualization
can be completely hidden, since the processor is oft en idle waiting for I/O.

Th e overhead is determined by both the number of instructions that must be
emulated by the VMM and by how much time each takes to emulate them. Hence,
when the guest VMs run the same ISA as the host, as we assume here, the goal

Hardware/
Software
Interface

426 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

of the architecture and the VMM is to run almost all instructions directly on the
native hardware.

Requirements of a Virtual Machine Monitor
What must a VM monitor do? It presents a soft ware interface to guest soft ware, it
must isolate the state of guests from each other, and it must protect itself from guest
soft ware (including guest OSes). Th e qualitative requirements are:

■ Guest soft ware should behave on a VM exactly as if it were running on the
native hardware, except for performance-related behavior or limitations of
fi xed resources shared by multiple VMs.

■ Guest soft ware should not be able to change allocation of real system resources
directly.

To “virtualize” the processor, the VMM must control just about everything—access
to privileged state, I/O, exceptions, and interrupts—even though the guest VM and
OS currently running are temporarily using them.

For example, in the case of a timer interrupt, the VMM would suspend the
currently running guest VM, save its state, handle the interrupt, determine which
guest VM to run next, and then load its state. Guest VMs that rely on a timer
interrupt are provided with a virtual timer and an emulated timer interrupt by the
VMM.

To be in charge, the VMM must be at a higher privilege level than the guest
VM, which generally runs in user mode; this also ensures that the execution of
any privileged instruction will be handled by the VMM. Th e basic requirements of
system virtual:

■ At least two processor modes, system and user.

■ A privileged subset of instructions that is available only in system mode,
resulting in a trap if executed in user mode; all system resources must be
controllable only via these instructions.

(Lack of) Instruction Set Architecture Support for Virtual
Machines
If VMs are planned for during the design of the ISA, it’s relatively easy to reduce
both the number of instructions that must be executed by a VMM and improve
their emulation speed. An architecture that allows the VM to execute directly on
the hardware earns the title virtualizable, and the IBM 370 architecture proudly
bears that label.

Alas, since VMs have been considered for PC and server applications only fairly
recently, most instruction sets were created without virtualization in mind. Th ese
culprits include x86 and most RISC architectures, including ARMv7 and MIPS.

5.7 Virtual Memory 427

Because the VMM must ensure that the guest system only interacts with virtual
resources, a conventional guest OS runs as a user mode program on top of the
VMM. Th en, if a guest OS attempts to access or modify information related to
hardware resources via a privileged instruction—for example, reading or writing
a status bit that enables interrupts—it will trap to the VMM. Th e VMM can then
eff ect the appropriate changes to corresponding real resources.

Hence, if any instruction that tries to read or write such sensitive information
traps when executed in user mode, the VMM can intercept it and support a virtual
version of the sensitive information, as the guest OS expects.

In the absence of such support, other measures must be taken. A VMM must
take special precautions to locate all problematic instructions and ensure that they
behave correctly when executed by a guest OS, thereby increasing the complexity
of the VMM and reducing the performance of running the VM.

Protection and Instruction Set Architecture
Protection is a joint eff ort of architecture and operating systems, but architects
had to modify some awkward details of existing instruction set architectures when
virtual memory became popular.

For example, the x86 instruction POPF loads the fl ag registers from the top of
the stack in memory. One of the fl ags is the Interrupt Enable (IE) fl ag. If you run
the POPF instruction in user mode, rather than trap it, it simply changes all the
fl ags except IE. In system mode, it does change the IE. Since a guest OS runs in user
mode inside a VM, this is a problem, as it expects to see a changed IE.

Historically, IBM mainframe hardware and VMM took three steps to improve
performance of virtual machines:

1. Reduce the cost of processor virtualization.

2. Reduce interrupt overhead cost due to the virtualization.

3. Reduce interrupt cost by steering interrupts to the proper VM without
invoking VMM.

AMD and Intel tried to address the fi rst point in 2006 by reducing the cost of
processor virtualization. It will be interesting to see how many generations of
architecture and VMM modifi cations it will take to address all three points, and
how long before virtual machines of the 21st century will be as effi cient as the IBM
mainframes and VMMs of the 1970s.

5.7 Virtual Memory

In earlier sections, we saw how caches provided fast access to recently used portions
of a program’s code and data. Similarly, the main memory can act as a “cache” for

… a system has
been devised to
make the core drum
combination appear
to the programmer
as a single level
store, the requisite
transfers taking place
automatically.
Kilburn et al., One-level
storage system, 1962

428 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

the secondary storage, usually implemented with magnetic disks. Th is technique is
called virtual memory. Historically, there were two major motivations for virtual
memory: to allow effi cient and safe sharing of memory among multiple programs,
such as for the memory needed by multiple virtual machines for cloud computing,
and to remove the programming burdens of a small, limited amount of main
memory. Five decades aft er its invention, it’s the former reason that reigns today.

Of course, to allow multiple virtual machines to share the same memory, we
must be able to protect the virtual machines from each other, ensuring that a
program can only read and write the portions of main memory that have been
assigned to it. Main memory need contain only the active portions of the many
virtual machines, just as a cache contains only the active portion of one program.
Th us, the principle of locality enables virtual memory as well as caches, and virtual
memory allows us to effi ciently share the processor as well as the main memory.

We cannot know which virtual machines will share the memory with other
virtual machines when we compile them. In fact, the virtual machines sharing
the memory change dynamically while the virtual machines are running. Because
of this dynamic interaction, we would like to compile each program into its
own address space—a separate range of memory locations accessible only to this
program. Virtual memory implements the translation of a program’s address space
to physical addresses. Th is translation process enforces protection of a program’s
address space from other virtual machines.

Th e second motivation for virtual memory is to allow a single user program
to exceed the size of primary memory. Formerly, if a program became too large
for memory, it was up to the programmer to make it fi t. Programmers divided
programs into pieces and then identifi ed the pieces that were mutually exclusive.
Th ese overlays were loaded or unloaded under user program control during
execution, with the programmer ensuring that the program never tried to access
an overlay that was not loaded and that the overlays loaded never exceeded the
total size of the memory. Overlays were traditionally organized as modules, each
containing both code and data. Calls between procedures in diff erent modules
would lead to overlaying of one module with another.

As you can well imagine, this responsibility was a substantial burden on
programmers. Virtual memory, which was invented to relieve programmers of
this diffi culty, automatically manages the two levels of the memory hierarchy
represented by main memory (sometimes called physical memory to distinguish it
from virtual memory) and secondary storage.

Although the concepts at work in virtual memory and in caches are the same,
their diff ering historical roots have led to the use of diff erent terminology. A virtual
memory block is called a page, and a virtual memory miss is called a page fault.
With virtual memory, the processor produces a virtual address, which is translated
by a combination of hardware and soft ware to a physical address, which in turn can
be used to access main memory. Figure 5.25 shows the virtually addressed memory
with pages mapped to main memory. Th is process is called address mapping or

virtual memory
A technique that uses
main memory as a “cache”
for secondary storage.

physical address
An address in main
memory.

protection A set
of mechanisms for
ensuring that multiple
processes sharing the
processor, memory,
or I/O devices cannot
interfere, intentionally
or unintentionally, with
one another by reading or
writing each other’s data.
Th ese mechanisms also
isolate the operating system
from a user process.

page fault An event that
occurs when an accessed
page is not present in
main memory.

virtual address
An address that
corresponds to a location
in virtual space and is
translated by address
mapping to a physical
address when memory is
accessed.

5.7 Virtual Memory 429

address translation. Today, the two memory hierarchy levels controlled by virtual
memory are usually DRAMs and fl ash memory in personal mobile devices and
DRAMs and magnetic disks in servers (see Section 5.2). If we return to our library
analogy, we can think of a virtual address as the title of a book and a physical
address as the location of that book in the library, such as might be given by the
Library of Congress call number.

Virtual memory also simplifi es loading the program for execution by providing
relocation. Relocation maps the virtual addresses used by a program to diff erent
physical addresses before the addresses are used to access memory. Th is relocation
allows us to load the program anywhere in main memory. Furthermore, all virtual
memory systems in use today relocate the program as a set of fi xed-size blocks
(pages), thereby eliminating the need to fi nd a contiguous block of memory to
allocate to a program; instead, the operating system need only fi nd a suffi cient
number of pages in main memory.

In virtual memory, the address is broken into a virtual page number and a page
off set. Figure 5.26 shows the translation of the virtual page number to a physical
page number. Th e physical page number constitutes the upper portion of the
physical address, while the page off set, which is not changed, constitutes the lower
portion. Th e number of bits in the page off set fi eld determines the page size. Th e
number of pages addressable with the virtual address need not match the number
of pages addressable with the physical address. Having a larger number of virtual
pages than physical pages is the basis for the illusion of an essentially unbounded
amount of virtual memory.

address translation
Also called address
mapping. Th e process by
which a virtual address
is mapped to an address
used to access memory.

Virtual addresses Physical addresses
Address translation

Disk addresses

FIGURE 5.25 In virtual memory, blocks of memory (called pages) are mapped from one
set of addresses (called virtual addresses) to another set (called physical addresses).
Th e processor generates virtual addresses while the memory is accessed using physical addresses. Both the
virtual memory and the physical memory are broken into pages, so that a virtual page is mapped to a physical
page. Of course, it is also possible for a virtual page to be absent from main memory and not be mapped to
a physical address; in that case, the page resides on disk. Physical pages can be shared by having two virtual
addresses point to the same physical address. Th is capability is used to allow two diff erent programs to share
data or code.

430 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Many design choices in virtual memory systems are motivated by the high cost
of a page fault. A page fault to disk will take millions of clock cycles to process.
(Th e table on page 378 shows that main memory latency is about 100,000 times
quicker than disk.) Th is enormous miss penalty, dominated by the time to get the
fi rst word for typical page sizes, leads to several key decisions in designing virtual
memory systems:

■ Pages should be large enough to try to amortize the high access time. Sizes
from 4 KiB to 16 KiB are typical today. New desktop and server systems are
being developed to support 32 KiB and 64 KiB pages, but new embedded
systems are going in the other direction, to 1 KiB pages.

■ Organizations that reduce the page fault rate are attractive. Th e primary
technique used here is to allow fully associative placement of pages in
memory.

■ Page faults can be handled in soft ware because the overhead will be small
compared to the disk access time. In addition, soft ware can aff ord to use clever
algorithms for choosing how to place pages because even small reductions in
the miss rate will pay for the cost of such algorithms.

■ Write-through will not work for virtual memory, since writes take too long.
Instead, virtual memory systems use write-back.

Virtual page number Page offset

31 30 29 28 27 3 2 1 015 14 13 12 11 10 9 8

Physical page number Page offset

29 28 27 3 2 1 015 14 13 12 11 10 9 8

Virtual address

Physical address

Translation

FIGURE 5.26 Mapping from a virtual to a physical address. Th e page size is 212 � 4 KiB. Th e
number of physical pages allowed in memory is 218, since the physical page number has 18 bits in it. Th us,
main memory can have at most 1 GiB, while the virtual address space is 4 GiB.

5.7 Virtual Memory 431

Th e next few subsections address these factors in virtual memory design.

Elaboration: We present the motivation for virtual memory as many virtual machines
sharing the same memory, but virtual memory was originally invented so that many
programs could share a computer as part of a timesharing system. Since many readers
today have no experience with time-sharing systems, we use virtual machines to motivate
this section.

Elaboration: For servers and even PCs, 32-bit address processors are problematic.
Although we normally think of virtual addresses as much larger than physical addresses,
the opposite can occur when the processor address size is small relative to the state
of the memory technology. No single program or virtual machine can benefi t, but a
collection of programs or virtual machines running at the same time can benefi t from
not having to be swapped to memory or by running on parallel processors.

Elaboration: The discussion of virtual memory in this book focuses on paging,
which uses fi xed-size blocks. There is also a variable-size block scheme called
segmentation. In segmentation, an address consists of two parts: a segment number
and a segment offset. The segment number is mapped to a physical address, and
the offset is added to fi nd the actual physical address. Because the segment can
vary in size, a bounds check is also needed to make sure that the offset is within
the segment. The major use of segmentation is to support more powerful methods
of protection and sharing in an address space. Most operating system textbooks
contain extensive discussions of segmentation compared to paging and of the use
of segmentation to logically share the address space. The major disadvantage of
segmentation is that it splits the address space into logically separate pieces that
must be manipulated as a two-part address: the segment number and the offset.
Paging, in contrast, makes the boundary between page number and offset invisible
to programmers and compilers.

Segments have also been used as a method to extend the address space without
changing the word size of the computer. Such attempts have been unsuccessful because
of the awkwardness and performance penalties inherent in a two-part address, of which
programmers and compilers must be aware.

Many architectures divide the address space into large fi xed-size blocks that simplify
protection between the operating system and user programs and increase the effi ciency
of implementing paging. Although these divisions are often called “segments,” this
mechanism is much simpler than variable block size segmentation and is not visible to
user programs; we discuss it in more detail shortly.

Placing a Page and Finding It Again
Because of the incredibly high penalty for a page fault, designers reduce page fault
frequency by optimizing page placement. If we allow a virtual page to be mapped
to any physical page, the operating system can then choose to replace any page
it wants when a page fault occurs. For example, the operating system can use a

segmentation
A variable-size address
mapping scheme in which
an address consists of two
parts: a segment number,
which is mapped to a
physical address, and a
segment off set.

432 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

sophisticated algorithm and complex data structures that track page usage to try
to choose a page that will not be needed for a long time. Th e ability to use a clever
and fl exible replacement scheme reduces the page fault rate and simplifi es the use
of fully associative placement of pages.

As mentioned in Section 5.4, the diffi culty in using fully associative placement
is in locating an entry, since it can be anywhere in the upper level of the hierarchy.
A full search is impractical. In virtual memory systems, we locate pages by using a
table that indexes the memory; this structure is called a page table, and it resides
in memory. A page table is indexed with the page number from the virtual address
to discover the corresponding physical page number. Each program has its own
page table, which maps the virtual address space of that program to main memory.
In our library analogy, the page table corresponds to a mapping between book
titles and library locations. Just as the card catalog may contain entries for books
in another library on campus rather than the local branch library, we will see that
the page table may contain entries for pages not present in memory. To indicate the
location of the page table in memory, the hardware includes a register that points to
the start of the page table; we call this the page table register. Assume for now that
the page table is in a fi xed and contiguous area of memory.

Th e page table, together with the program counter and the registers, specifi es
the state of a virtual machine. If we want to allow another virtual machine to use
the processor, we must save this state. Later, aft er restoring this state, the virtual
machine can continue execution. We oft en refer to this state as a process. Th e
process is considered active when it is in possession of the processor; otherwise, it
is considered inactive. Th e operating system can make a process active by loading
the process’s state, including the program counter, which will initiate execution at
the value of the saved program counter.

Th e process’s address space, and hence all the data it can access in memory, is
defi ned by its page table, which resides in memory. Rather than save the entire page
table, the operating system simply loads the page table register to point to the page
table of the process it wants to make active. Each process has its own page table,
since diff erent processes use the same virtual addresses. Th e operating system is
responsible for allocating the physical memory and updating the page tables, so
that the virtual address spaces of diff erent processes do not collide. As we will see
shortly, the use of separate page tables also provides protection of one process from
another.

page table Th e table
containing the virtual
to physical address
translations in a virtual
memory system. Th e
table, which is stored
in memory, is typically
indexed by the virtual
page number; each entry
in the table contains the
physical page number
for that virtual page if
the page is currently in
memory.

Hardware/
Software
Interface

5.7 Virtual Memory 433

Figure 5.27 uses the page table register, the virtual address, and the indicated page
table to show how the hardware can form a physical address. A valid bit is used
in each page table entry, just as we did in a cache. If the bit is off , the page is not
present in main memory and a page fault occurs. If the bit is on, the page is in
memory and the entry contains the physical page number.

Because the page table contains a mapping for every possible virtual page, no
tags are required. In cache terminology, the index that is used to access the page
table consists of the full block address, which is the virtual page number.

Virtual page number Page offset

3 1 3 0 2 9 2 8 2 7 3 2 1 01 5 1 4 1 3 1 2 1 1 1 0 9 8

Physical page number Page offset

2 9 2 8 2 7 3 2 1 01 5 1 4 1 3 1 2 1 1 1 0 9 8

Virtual address

Physical address

Page table register

Physical page numberValid

Page table

If 0 then page is not
present in memory

20 12

FIGURE 5.27 The page table is indexed with the virtual page number to obtain the
corresponding portion of the physical address. We assume a 32-bit address. Th e page table pointer
gives the starting address of the page table. In this fi gure, the page size is 212 bytes, or 4 KiB. Th e virtual
address space is 232 bytes, or 4 GiB, and the physical address space is 230 bytes, which allows main memory
of up to 1 GiB. Th e number of entries in the page table is 220, or 1 million entries. Th e valid bit for each entry
indicates whether the mapping is legal. If it is off , then the page is not present in memory. Although the
page table entry shown here need only be 19 bits wide, it would typically be rounded up to 32 bits for ease of
indexing. Th e extra bits would be used to store additional information that needs to be kept on a per-page
basis, such as protection.

434 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Page Faults
If the valid bit for a virtual page is off , a page fault occurs. Th e operating system
must be given control. Th is transfer is done with the exception mechanism, which
we saw in Chapter 4 and will discuss again later in this section. Once the operating
system gets control, it must fi nd the page in the next level of the hierarchy (usually
fl ash memory or magnetic disk) and decide where to place the requested page in
main memory.

Th e virtual address alone does not immediately tell us where the page is on disk.
Returning to our library analogy, we cannot fi nd the location of a library book on
the shelves just by knowing its title. Instead, we go to the catalog and look up the
book, obtaining an address for the location on the shelves, such as the Library of
Congress call number. Likewise, in a virtual memory system, we must keep track
of the location on disk of each page in virtual address space.

Because we do not know ahead of time when a page in memory will be replaced,
the operating system usually creates the space on fl ash memory or disk for all the
pages of a process when it creates the process. Th is space is called the swap space.
At that time, it also creates a data structure to record where each virtual page is
stored on disk. Th is data structure may be part of the page table or may be an
auxiliary data structure indexed in the same way as the page table. Figure 5.28
shows the organization when a single table holds either the physical page number
or the disk address.

Th e operating system also creates a data structure that tracks which processes
and which virtual addresses use each physical page. When a page fault occurs,
if all the pages in main memory are in use, the operating system must choose a
page to replace. Because we want to minimize the number of page faults, most
operating systems try to choose a page that they hypothesize will not be needed
in the near future. Using the past to predict the future, operating systems follow
the least recently used (LRU) replacement scheme, which we mentioned in Section
5.4. Th e operating system searches for the least recently used page, assuming that
a page that has not been used in a long time is less likely to be needed than a more
recently accessed page. Th e replaced pages are written to swap space on the disk.
In case you are wondering, the operating system is just another process, and these
tables controlling memory are in memory; the details of this seeming contradiction
will be explained shortly.

swap space Th e space on
the disk reserved for the
full virtual memory space
of a process.

5.7 Virtual Memory 435

Implementing a completely accurate LRU scheme is too expensive, since it requires
updating a data structure on every memory reference. Instead, most operating
systems approximate LRU by keeping track of which pages have and which pages
have not been recently used. To help the operating system estimate the LRU pages,
some computers provide a reference bit or use bit, which is set whenever a page
is accessed. Th e operating system periodically clears the reference bits and later
records them so it can determine which pages were touched during a particular
time period. With this usage information, the operating system can select a page
that is among the least recently referenced (detected by having its reference bit off ).
If this bit is not provided by the hardware, the operating system must fi nd another
way to estimate which pages have been accessed.

Hardware/
Software
Interface
reference bit Also called
use bit. A fi eld that is
set whenever a page
is accessed and that is
used to implement LRU
or other replacement
schemes.

Page table
Physical page or

disk address
Physical memory

Virtual page
number

Disk storage

1
1
1
1
0
1
1

1
1

Valid

FIGURE 5.28 The page table maps each page in virtual memory to either a page in main
memory or a page stored on disk, which is the next level in the hierarchy. Th e virtual page
number is used to index the page table. If the valid bit is on, the page table supplies the physical page number
(i.e., the starting address of the page in memory) corresponding to the virtual page. If the valid bit is off , the
page currently resides only on disk, at a specifi ed disk address. In many systems, the table of physical page
addresses and disk page addresses, while logically one table, is stored in two separate data structures. Dual
tables are justifi ed in part because we must keep the disk addresses of all the pages, even if they are currently
in main memory. Remember that the pages in main memory and the pages on disk are the same size.

436 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Elaboration: With a 32-bit virtual address, 4 KiB pages, and 4 bytes per page table
entry, we can compute the total page table size:

Number of page table entries
2

2
32

20
� �

212

Size of page table 2 page table entries 2
bytes

page tabl
20 2

ee entry
4 MiB

That is, we would need to use 4 MiB of memory for each program in execution at any
time. This amount is not so bad for a single process. What if there are hundreds of
processes running, each with their own page table? And how should we handle 64-bit
addresses, which by this calculation would need 252 words?

A range of techniques is used to reduce the amount of storage required for the page
table. The fi ve techniques below aim at reducing the total maximum storage required as
well as minimizing the main memory dedicated to page tables:

1. The simplest technique is to keep a limit register that restricts the size of the
page table for a given process. If the virtual page number becomes larger than
the contents of the limit register, entries must be added to the page table. This
technique allows the page table to grow as a process consumes more space.
Thus, the page table will only be large if the process is using many pages of
virtual address space. This technique requires that the address space expand in
only one direction.

2. Allowing growth in only one direction is not suffi cient, since most languages require
two areas whose size is expandable: one area holds the stack and the other area
holds the heap. Because of this duality, it is convenient to divide the page table
and let it grow from the highest address down, as well as from the lowest address
up. This means that there will be two separate page tables and two separate
limits. The use of two page tables breaks the address space into two segments.
The high-order bit of an address usually determines which segment and thus which
page table to use for that address. Since the high-order address bit specifi es the
segment, each segment can be as large as one-half of the address space. A
limit register for each segment specifi es the current size of the segment, which
grows in units of pages. This type of segmentation is used by many architectures,
including MIPS. Unlike the type of segmentation discussed in the third elaboration
on page 431, this form of segmentation is invisible to the application program,
although not to the operating system. The major disadvantage of this scheme is
that it does not work well when the address space is used in a sparse fashion
rather than as a contiguous set of virtual addresses.

3. Another approach to reducing the page table size is to apply a hashing function
to the virtual address so that the page table need be only the size of the number
of physical pages in main memory. Such a structure is called an inverted page
table. Of course, the lookup process is slightly more complex with an inverted
page table, because we can no longer just index the page table.

4. Multiple levels of page tables can also be used to reduce the total amount of
page table storage. The fi rst level maps large fi xed-size blocks of virtual address
space, perhaps 64 to 256 pages in total. These large blocks are sometimes
called segments, and this fi rst-level mapping table is sometimes called a

5.7 Virtual Memory 437

segment table, though the segments are again invisible to the user. Each entry
in the segment table indicates whether any pages in that segment are allocated
and, if so, points to a page table for that segment. Address translation happens
by fi rst looking in the segment table, using the highest-order bits of the address.
If the segment address is valid, the next set of high-order bits is used to index
the page table indicated by the segment table entry. This scheme allows the
address space to be used in a sparse fashion (multiple noncontiguous segments
can be active) without having to allocate the entire page table. Such schemes
are particularly useful with very large address spaces and in software systems
that require noncontiguous allocation. The primary disadvantage of this two-level
mapping is the more complex process for address translation.

5. To reduce the actual main memory tied up in page tables, most modern systems
also allow the page tables to be paged. Although this sounds tricky, it works
by using the same basic ideas of virtual memory and simply allowing the page
tables to reside in the virtual address space. In addition, there are some small
but critical problems, such as a never-ending series of page faults, which must
be avoided. How these problems are overcome is both very detailed and typically
highly processor specifi c. In brief, these problems are avoided by placing all the
page tables in the address space of the operating system and placing at least
some of the page tables for the operating system in a portion of main memory
that is physically addressed and is always present and thus never on disk.

What about Writes?
Th e diff erence between the access time to the cache and main memory is tens to
hundreds of cycles, and write-through schemes can be used, although we need a
write buff er to hide the latency of the write from the processor. In a virtual memory
system, writes to the next level of the hierarchy (disk) can take millions of processor
clock cycles; therefore, building a write buff er to allow the system to write-through
to disk would be completely impractical. Instead, virtual memory systems must use
write-back, performing the individual writes into the page in memory, and copying
the page back to disk when it is replaced in the memory.

A write-back scheme has another major advantage in a virtual memory system.
Because the disk transfer time is small compared with its access time, copying back
an entire page is much more effi cient than writing individual words back to the disk.
A write-back operation, although more effi cient than transferring individual words, is
still costly. Th us, we would like to know whether a page needs to be copied back when
we choose to replace it. To track whether a page has been written since it was read into
the memory, a dirty bit is added to the page table. Th e dirty bit is set when any word
in a page is written. If the operating system chooses to replace the page, the dirty bit
indicates whether the page needs to be written out before its location in memory can be
given to another page. Hence, a modifi ed page is oft en called a dirty page.

Hardware/
Software
Interface

438 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Making Address Translation Fast: the TLB
Since the page tables are stored in main memory, every memory access by a program
can take at least twice as long: one memory access to obtain the physical address
and a second access to get the data. Th e key to improving access performance is to
rely on locality of reference to the page table. When a translation for a virtual page
number is used, it will probably be needed again in the near future, because the
references to the words on that page have both temporal and spatial locality.

Accordingly, modern processors include a special cache that keeps track of recently
used translations. Th is special address translation cache is traditionally referred to as
a translation-lookaside buff er (TLB), although it would be more accurate to call it
a translation cache. Th e TLB corresponds to that little piece of paper we typically use
to record the location of a set of books we look up in the card catalog; rather than
continually searching the entire catalog, we record the location of several books and
use the scrap of paper as a cache of Library of Congress call numbers.

Figure 5.29 shows that each tag entry in the TLB holds a portion of the virtual
page number, and each data entry of the TLB holds a physical page number.

translation-lookaside
buff er (TLB) A cache
that keeps track of
recently used address
mappings to try to avoid
an access to the page
table.

1
1
1
1
0
1
1

1
1

0
0
0
0
0
0
0

1
1

1
0
0
1
0
1
1

1
1

Physical page
or disk addressValidDirty Ref

Page table

Physical memory

Virtual page
number

Disk storage

1
1
1
1
0
1

0
1
1
0
0
0

1
1
1
1
0
1

Physical page
addressValidDirty Ref

TLB

Tag

FIGURE 5.29 The TLB acts as a cache of the page table for the entries that map to
physical pages only. Th e TLB contains a subset of the virtual-to-physical page mappings that are in the
page table. Th e TLB mappings are shown in color. Because the TLB is a cache, it must have a tag fi eld. If there
is no matching entry in the TLB for a page, the page table must be examined. Th e page table either supplies a
physical page number for the page (which can then be used to build a TLB entry) or indicates that the page
resides on disk, in which case a page fault occurs. Since the page table has an entry for every virtual page, no
tag fi eld is needed; in other words, unlike a TLB, a page table is not a cache.

5.7 Virtual Memory 439

Because we access the TLB instead of the page table on every reference, the TLB
will need to include other status bits, such as the dirty and the reference bits.

On every reference, we look up the virtual page number in the TLB. If we get a
hit, the physical page number is used to form the address, and the corresponding
reference bit is turned on. If the processor is performing a write, the dirty bit is also
turned on. If a miss in the TLB occurs, we must determine whether it is a page fault
or merely a TLB miss. If the page exists in memory, then the TLB miss indicates
only that the translation is missing. In such cases, the processor can handle the TLB
miss by loading the translation from the page table into the TLB and then trying the
reference again. If the page is not present in memory, then the TLB miss indicates
a true page fault. In this case, the processor invokes the operating system using an
exception. Because the TLB has many fewer entries than the number of pages in
main memory, TLB misses will be much more frequent than true page faults.

TLB misses can be handled either in hardware or in soft ware. In practice, with
care there can be little performance diff erence between the two approaches, because
the basic operations are the same in either case.

Aft er a TLB miss occurs and the missing translation has been retrieved from the
page table, we will need to select a TLB entry to replace. Because the reference and
dirty bits are contained in the TLB entry, we need to copy these bits back to the page
table entry when we replace an entry. Th ese bits are the only portion of the TLB
entry that can be changed. Using write-back—that is, copying these entries back at
miss time rather than when they are written—is very effi cient, since we expect the
TLB miss rate to be small. Some systems use other techniques to approximate the
reference and dirty bits, eliminating the need to write into the TLB except to load
a new table entry on a miss.

Some typical values for a TLB might be

■ TLB size: 16–512 entries

■ Block size: 1–2 page table entries (typically 4–8 bytes each)

■ Hit time: 0.5–1 clock cycle

■ Miss penalty: 10–100 clock cycles

■ Miss rate: 0.01%–1%

Designers have used a wide variety of associativities in TLBs. Some systems use
small, fully associative TLBs because a fully associative mapping has a lower miss
rate; furthermore, since the TLB is small, the cost of a fully associative mapping is
not too high. Other systems use large TLBs, oft en with small associativity. With
a fully associative mapping, choosing the entry to replace becomes tricky since
implementing a hardware LRU scheme is too expensive. Furthermore, since TLB
misses are much more frequent than page faults and thus must be handled more
cheaply, we cannot aff ord an expensive soft ware algorithm, as we can for page faults.
As a result, many systems provide some support for randomly choosing an entry
to replace. We’ll examine replacement schemes in a little more detail in Section 5.8.

440 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

The Intrinsity FastMATH TLB

To see these ideas in a real processor, let’s take a closer look at the TLB of the
Intrinsity FastMATH. Th e memory system uses 4 KiB pages and a 32-bit address
space; thus, the virtual page number is 20 bits long, as in the top of Figure 5.30.
Th e physical address is the same size as the virtual address. Th e TLB contains 16
entries, it is fully associative, and it is shared between the instruction and data
references. Each entry is 64 bits wide and contains a 20-bit tag (which is the virtual
page number for that TLB entry), the corresponding physical page number (also 20
bits), a valid bit, a dirty bit, and other bookkeeping bits. Like most MIPS systems,
it uses soft ware to handle TLB misses.

Figure 5.30 shows the TLB and one of the caches, while Figure 5.31 shows the
steps in processing a read or write request. When a TLB miss occurs, the MIPS
hardware saves the page number of the reference in a special register and generates
an exception. Th e exception invokes the operating system, which handles the miss
in soft ware. To fi nd the physical address for the missing page, the TLB miss routine
indexes the page table using the page number of the virtual address and the page
table register, which indicates the starting address of the active process page table.
Using a special set of system instructions that can update the TLB, the operating
system places the physical address from the page table into the TLB. A TLB miss
takes about 13 clock cycles, assuming the code and the page table entry are in the
instruction cache and data cache, respectively. (We will see the MIPS TLB code
on page 449.) A true page fault occurs if the page table entry does not have a valid
physical address. Th e hardware maintains an index that indicates the recommended
entry to replace; the recommended entry is chosen randomly.

Th ere is an extra complication for write requests: namely, the write access bit in
the TLB must be checked. Th is bit prevents the program from writing into pages
for which it has only read access. If the program attempts a write and the write
access bit is off , an exception is generated. Th e write access bit forms part of the
protection mechanism, which we will discuss shortly.

Integrating Virtual Memory, TLBs, and Caches
Our virtual memory and cache systems work together as a hierarchy, so that data
cannot be in the cache unless it is present in main memory. Th e operating system
helps maintain this hierarchy by fl ushing the contents of any page from the cache
when it decides to migrate that page to disk. At the same time, the OS modifi es the
page tables and TLB, so that an attempt to access any data on the migrated page
will generate a page fault.

Under the best of circumstances, a virtual address is translated by the TLB and
sent to the cache where the appropriate data is found, retrieved, and sent back to
the processor. In the worst case, a reference can miss in all three components of the
memory hierarchy: the TLB, the page table, and the cache. Th e following example
illustrates these interactions in more detail.

5.7 Virtual Memory 441

Virtual page number Page offset

TagValid Dirty

TLB

Physical page number

TagValid

TLB hit

Cache hit

Data

Byte
offset

=
=
=
=
=

Physical page number Page offset

Physical address tag Cache index

Block
offset

Physical address

8 4 2

12
8

Cache

31 30 29 3 2 1 014 13 12 11 10 9

Virtual address

FIGURE 5.30 The TLB and cache implement the process of going from a virtual address to a data item in the Intrinsity
FastMATH. Th is fi gure shows the organization of the TLB and the data cache, assuming a 4 KiB page size. Th is diagram focuses on a read;
Figure 5.31 describes how to handle writes. Note that unlike Figure 5.12, the tag and data RAMs are split. By addressing the long but narrow
data RAM with the cache index concatenated with the block off set, we select the desired word in the block without a 16:1 multiplexor. While
the cache is direct mapped, the TLB is fully associative. Implementing a fully associative TLB requires that every TLB tag be compared against
the virtual page number, since the entry of interest can be anywhere in the TLB. (See content addressable memories in the Elaboration on
page 408.) If the valid bit of the matching entry is on, the access is a TLB hit, and bits from the physical page number together with bits from
the page off set form the index that is used to access the cache.

442 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Yes
Write access

bit on?

Yes
Cache hit?

Write data into cache,
update the dirty bit, and

put the data and the
address into the write buffer

Yes
TLB hit?

Virtual address

TLB access

Try to read data
from cache

Yes
Write?

Cache miss stall
while read block

Deliver data
to the CPU

Write protection
exception

Yes
Cache hit?

Try to write data
to cache

Cache miss stall
while read block

TLB miss
exception

Physical address

FIGURE 5.31 Processing a read or a write-through in the Intrinsity FastMATH TLB and cache. If the TLB generates a hit,
the cache can be accessed with the resulting physical address. For a read, the cache generates a hit or miss and supplies the data or causes a stall
while the data is brought from memory. If the operation is a write, a portion of the cache entry is overwritten for a hit and the data is sent to
the write buff er if we assume write-through. A write miss is just like a read miss except that the block is modifi ed aft er it is read from memory.
Write-back requires writes to set a dirty bit for the cache block, and a write buff er is loaded with the whole block only on a read miss or write
miss if the block to be replaced is dirty. Notice that a TLB hit and a cache hit are independent events, but a cache hit can only occur aft er a TLB
hit occurs, which means that the data must be present in memory. Th e relationship between TLB misses and cache misses is examined further
in the following example and the exercises at the end of this chapter.

5.7 Virtual Memory 443

Overall Operation of a Memory Hierarchy

In a memory hierarchy like that of Figure 5.30, which includes a TLB and a
cache organized as shown, a memory reference can encounter three diff erent
types of misses: a TLB miss, a page fault, and a cache miss. Consider all
the combinations of these three events with one or more occurring (seven
possibilities). For each possibility, state whether this event can actually occur
and under what circumstances.

Figure 5.32 shows all combinations and whether each is possible in practice.

Elaboration: Figure 5.32 assumes that all memory addresses are translated to
physical addresses before the cache is accessed. In this organization, the cache is
physically indexed and physically tagged (both the cache index and tag are physical,
rather than virtual, addresses). In such a system, the amount of time to access memory,
assuming a cache hit, must accommodate both a TLB access and a cache access; of
course, these accesses can be pipelined.

Alternatively, the processor can index the cache with an address that is completely
or partially virtual. This is called a virtually addressed cache, and it uses tags that
are virtual addresses; hence, such a cache is virtually indexed and virtually tagged. In
such caches, the address translation hardware (TLB) is unused during the normal cache
access, since the cache is accessed with a virtual address that has not been translated
to a physical address. This takes the TLB out of the critical path, reducing cache latency.
When a cache miss occurs, however, the processor needs to translate the address to a
physical address so that it can fetch the cache block from main memory.

EXAMPLE

ANSWER

virtually addressed
cache A cache that is
accessed with a virtual
address rather than a
physical address.

TLB
Page
table Cache Possible? If so, under what circumstance?

Hit Hit Miss Possible, although the page table is never really checked if TLB hits.

Miss Hit Hit TLB misses, but entry found in page table; after retry, data is found in cache.

Miss Hit Miss TLB misses, but entry found in page table; after retry, data misses in cache.

Miss Miss Miss TLB misses and is followed by a page fault; after retry, data must miss in cache.

Hit Miss Miss Impossible: cannot have a translation in TLB if page is not present in memory.

Hit Miss Hit Impossible: cannot have a translation in TLB if page is not present in memory.

Miss Miss Hit Impossible: data cannot be allowed in cache if the page is not in memory.

FIGURE 5.32 The possible combinations of events in the TLB, virtual memory system,
and cache. Th ree of these combinations are impossible, and one is possible (TLB hit, virtual memory hit,
cache miss) but never detected.

444 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

When the cache is accessed with a virtual address and pages are shared between
processes (which may access them with different virtual addresses), there is the
possibility of aliasing. Aliasing occurs when the same object has two names—in this
case, two virtual addresses for the same page. This ambiguity creates a problem, because
a word on such a page may be cached in two different locations, each corresponding
to different virtual addresses. This ambiguity would allow one program to write the data
without the other program being aware that the data had changed. Completely virtually
addressed caches either introduce design limitations on the cache and TLB to reduce
aliases or require the operating system, and possibly the user, to take steps to ensure
that aliases do not occur.

A common compromise between these two design points is caches that are virtually
indexed—sometimes using just the page-offset portion of the address, which is really
a physical address since it is not translated—but use physical tags. These designs,
which are virtually indexed but physically tagged, attempt to achieve the performance
advantages of virtually indexed caches with the architecturally simpler advantages of a
physically addressed cache. For example, there is no alias problem in this case. Figure
5.30 assumed a 4 KiB page size, but it’s really 16 KiB, so the Intrinsity FastMATH can
use this trick. To pull it off, there must be careful coordination between the minimum
page size, the cache size, and associativity.

Implementing Protection with Virtual Memory
Perhaps the most important function of virtual memory today is to allow sharing of
a single main memory by multiple processes, while providing memory protection
among these processes and the operating system. Th e protection mechanism must
ensure that although multiple processes are sharing the same main memory, one
renegade process cannot write into the address space of another user process or into
the operating system either intentionally or unintentionally. Th e write access bit in
the TLB can protect a page from being written. Without this level of protection,
computer viruses would be even more widespread.

To enable the operating system to implement protection in the virtual memory
system, the hardware must provide at least the three basic capabilities summarized
below. Note that the fi rst two are the same requirements as needed for virtual
machines (Section 5.6).

1. Support at least two modes that indicate whether the running process is a
user process or an operating system process, variously called a supervisor
process, a kernel process, or an executive process.

2. Provide a portion of the processor state that a user process can read but not
write. Th is includes the user/supervisor mode bit, which dictates whether
the processor is in user or supervisor mode, the page table pointer, and the

aliasing A situation
in which two addresses
access the same object;
it can occur in virtual
memory when there are
two virtual addresses for
the same physical page.

physically addressed
cache A cache that is
addressed by a physical
address.

Hardware/
Software
Interface

supervisor mode Also
called kernel mode. A
mode indicating that a
running process is an
operating system process.

5.7 Virtual Memory 445

TLB. To write these elements, the operating system uses special instructions
that are only available in supervisor mode.

3. Provide mechanisms whereby the processor can go from user mode to
supervisor mode and vice versa. Th e fi rst direction is typically accomplished
by a system call exception, implemented as a special instruction (syscall in
the MIPS instruction set) that transfers control to a dedicated location in
supervisor code space. As with any other exception, the program counter
from the point of the system call is saved in the exception PC (EPC), and
the processor is placed in supervisor mode. To return to user mode from the
exception, use the return from exception (ERET) instruction, which resets to
user mode and jumps to the address in EPC.

By using these mechanisms and storing the page tables in the operating system’s
address space, the operating system can change the page tables while preventing a
user process from changing them, ensuring that a user process can access only the
storage provided to it by the operating system.

We also want to prevent a process from reading the data of another process. For
example, we wouldn’t want a student program to read the grades while they were
in the processor’s memory. Once we begin sharing main memory, we must provide
the ability for a process to protect its data from both reading and writing by another
process; otherwise, sharing the main memory will be a mixed blessing!

Remember that each process has its own virtual address space. Th us, if the
operating system keeps the page tables organized so that the independent virtual
pages map to disjoint physical pages, one process will not be able to access another’s
data. Of course, this also requires that a user process be unable to change the page
table mapping. Th e operating system can assure safety if it prevents the user process
from modifying its own page tables. However, the operating system must be able
to modify the page tables. Placing the page tables in the protected address space of
the operating system satisfi es both requirements.

When processes want to share information in a limited way, the operating system
must assist them, since accessing the information of another process requires
changing the page table of the accessing process. Th e write access bit can be used
to restrict the sharing to just read sharing, and, like the rest of the page table, this
bit can be changed only by the operating system. To allow another process, say, P1,
to read a page owned by process P2, P2 would ask the operating system to create
a page table entry for a virtual page in P1’s address space that points to the same
physical page that P2 wants to share. Th e operating system could use the write
protection bit to prevent P1 from writing the data, if that was P2’s wish. Any bits
that determine the access rights for a page must be included in both the page table
and the TLB, because the page table is accessed only on a TLB miss.

system call A special
instruction that transfers
control from user mode
to a dedicated location
in supervisor code space,
invoking the exception
mechanism in the process.

446 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Elaboration: When the operating system decides to change from running process
P1 to running process P2 (called a context switch or process switch), it must ensure
that P2 cannot get access to the page tables of P1 because that would compromise
protection. If there is no TLB, it suffi ces to change the page table register to point to P2’s
page table (rather than to P1’s); with a TLB, we must clear the TLB entries that belong to
P1—both to protect the data of P1 and to force the TLB to load the entries for P2. If the
process switch rate were high, this could be quite ineffi cient. For example, P2 might load
only a few TLB entries before the operating system switched back to P1. Unfortunately,
P1 would then fi nd that all its TLB entries were gone and would have to pay TLB misses
to reload them. This problem arises because the virtual addresses used by P1 and P2
are the same, and we must clear out the TLB to avoid confusing these addresses.

A common alternative is to extend the virtual address space by adding a process
identifi er or task identifi er. The Intrinsity FastMATH has an 8-bit address space ID (ASID)
fi eld for this purpose. This small fi eld identifi es the currently running process; it is kept
in a register loaded by the operating system when it switches processes. The process
identifi er is concatenated to the tag portion of the TLB, so that a TLB hit occurs only if
both the page number and the process identifi er match. This combination eliminates the
need to clear the TLB, except on rare occasions.

Similar problems can occur for a cache, since on a process switch the cache will
contain data from the running process. These problems arise in different ways for
physically addressed and virtually addressed caches, and a variety of different solutions,
such as process identifi ers, are used to ensure that a process gets its own data.

Handling TLB Misses and Page Faults
Although the translation of virtual to physical addresses with a TLB is
straightforward when we get a TLB hit, as we saw earlier, handling TLB misses and
page faults is more complex. A TLB miss occurs when no entry in the TLB matches
a virtual address. Recall that a TLB miss can indicate one of two possibilities:

1. Th e page is present in memory, and we need only create the missing TLB
entry.

2. Th e page is not present in memory, and we need to transfer control to the
operating system to deal with a page fault.

MIPS traditionally handles a TLB miss in soft ware. It brings in the page table
entry from memory and then re-executes the instruction that caused the TLB miss.
Upon re-executing, it will get a TLB hit. If the page table entry indicates the page is
not in memory, this time it will get a page fault exception.

Handling a TLB miss or a page fault requires using the exception mechanism
to interrupt the active process, transferring control to the operating system, and
later resuming execution of the interrupted process. A page fault will be recognized
sometime during the clock cycle used to access memory. To restart the instruction
aft er the page fault is handled, the program counter of the instruction that caused
the page fault must be saved. Just as in Chapter 4, the exception program counter
(EPC) is used to hold this value.

context switch
A changing of the internal
state of the processor to
allow a diff erent process
to use the processor
that includes saving the
state needed to return to
the currently executing
process.

5.7 Virtual Memory 447

In addition, a TLB miss or page fault exception must be asserted by the end
of the same clock cycle that the memory access occurs, so that the next clock
cycle will begin exception processing rather than continue normal instruction
execution. If the page fault was not recognized in this clock cycle, a load instruction
could overwrite a register, and this could be disastrous when we try to restart the
instruction. For example, consider the instruction lw $1,0($1): the computer
must be able to prevent the write pipeline stage from occurring; otherwise, it could
not properly restart the instruction, since the contents of $1 would have been
destroyed. A similar complication arises on stores. We must prevent the write into
memory from actually completing when there is a page fault; this is usually done
by deasserting the write control line to the memory.

Between the time we begin executing the exception handler in the operating
system and the time that the operating system has saved all the state of the process,
the operating system is particularly vulnerable. For example, if another exception
occurred when we were processing the fi rst exception in the operating system, the
control unit would overwrite the exception program counter, making it impossible
to return to the instruction that caused the page fault! We can avoid this disaster
by providing the ability to disable and enable exceptions. When an exception fi rst
occurs, the processor sets a bit that disables all other exceptions; this could happen
at the same time the processor sets the supervisor mode bit. Th e operating system
will then save just enough state to allow it to recover if another exception occurs—
namely, the exception program counter (EPC) and Cause registers. EPC and Cause
are two of the special control registers that help with exceptions, TLB misses, and
page faults; Figure 5.33 shows the rest. Th e operating system can then re-enable
exceptions. Th ese steps make sure that exceptions will not cause the processor
to lose any state and thereby be unable to restart execution of the interrupting
instruction.

Once the operating system knows the virtual address that caused the page fault, it
must complete three steps:

1. Look up the page table entry using the virtual address and fi nd the location
of the referenced page on disk.

2. Choose a physical page to replace; if the chosen page is dirty, it must be
written out to disk before we can bring a new virtual page into this physical
page.

3. Start a read to bring the referenced page from disk into the chosen physical
page.

Hardware/
Software
Interface

exception enable Also
called interrupt enable.
A signal or action that
controls whether the
process responds to
an exception or not;
necessary for preventing
the occurrence of
exceptions during
intervals before the
processor has safely saved
the state needed to restart.

448 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Of course, this last step will take millions of processor clock cycles (so will the
second if the replaced page is dirty); accordingly, the operating system will usually
select another process to execute in the processor until the disk access completes.
Because the operating system has saved the state of the process, it can freely give
control of the processor to another process.

When the read of the page from disk is complete, the operating system can
restore the state of the process that originally caused the page fault and execute
the instruction that returns from the exception. Th is instruction will reset the
processor from kernel to user mode, as well as restore the program counter. Th e
user process then re-executes the instruction that faulted, accesses the requested
page successfully, and continues execution.

Page fault exceptions for data accesses are diffi cult to implement properly in a
processor because of a combination of three characteristics:

1. Th ey occur in the middle of instructions, unlike instruction page faults.

2. Th e instruction cannot be completed before handling the exception.

3. Aft er handling the exception, the instruction must be restarted as if nothing
had occurred.

Making instructions restartable, so that the exception can be handled and the
instruction later continued, is relatively easy in an architecture like the MIPS.
Because each instruction writes only one data item and this write occurs at the end
of the instruction cycle, we can simply prevent the instruction from completing (by
not writing) and restart the instruction at the beginning.

Let’s look in more detail at MIPS. When a TLB miss occurs, the MIPS hardware
saves the page number of the reference in a special register called BadVAddr and
generates an exception.

restartable
instruction An
instruction that can
resume execution aft er
an exception is resolved
without the exception’s
aff ecting the result of the
instruction.

EPC 14 Where to restart after exception

Cause 13 Cause of exception

BadVAddr 8 Address that caused exception

Index 0 Location in TLB to be read or written

Random 1 Pseudorandom location in TLB

EntryLo 2 Physical page address and flags

EntryHi 10 Virtual page address

Context 4 Page table address and page number

FIGURE 5.33 MIPS control registers. Th ese are considered to be in coprocessor 0, and hence are
read using mfc0 and written using mtc0.

5.7 Virtual Memory 449

Th e exception invokes the operating system, which handles the miss in soft ware.
Control is transferred to address 8000 0000hex, the location of the TLB miss handler.
To fi nd the physical address for the missing page, the TLB miss routine indexes the
page table using the page number of the virtual address and the page table register,
which indicates the starting address of the active process page table. To make this
indexing fast, MIPS hardware places everything you need in the special Context
register: the upper 12 bits have the address of the base of the page table, and the
next 18 bits have the virtual address of the missing page. Each page table entry is
one word, so the last 2 bits are 0. Th us, the fi rst two instructions copy the Context
register into the kernel temporary register $k1 and then load the page table entry
from that address into $k1. Recall that $k0 and $k1 are reserved for the operating
system to use without saving; a major reason for this convention is to make the TLB
miss handler fast. Below is the MIPS code for a typical TLB miss handler:

TLBmiss:
mfc0 $k1,Context # copy address of PTE into temp $k1
lw $k1,0($k1) # put PTE into temp $k1
mtc0 $k1,EntryLo # put PTE into special register EntryLo
tlbwr # put EntryLo into TLB entry at Random
eret # return from TLB miss exception

As shown above, MIPS has a special set of system instructions to update the
TLB. Th e instruction tlbwr copies from control register EntryLo into the TLB
entry selected by the control register Random. Random implements random
replacement, so it is basically a free-running counter. A TLB miss takes about a
dozen clock cycles.

Note that the TLB miss handler does not check to see if the page table entry is
valid. Because the exception for TLB entry missing is much more frequent than
a page fault, the operating system loads the TLB from the page table without
examining the entry and restarts the instruction. If the entry is invalid, another
and diff erent exception occurs, and the operating system recognizes the page fault.
Th is method makes the frequent case of a TLB miss fast, at a slight performance
penalty for the infrequent case of a page fault.

Once the process that generated the page fault has been interrupted, it transfers
control to 8000 0180hex, a diff erent address than the TLB miss handler. Th is is
the general address for exception; TLB miss has a special entry point to lower the
penalty for a TLB miss. Th e operating system uses the exception Cause register
to diagnose the cause of the exception. Because the exception is a page fault, the
operating system knows that extensive processing will be required. Th us, unlike a
TLB miss, it saves the entire state of the active process. Th is state includes all the
general-purpose and fl oating-point registers, the page table address register, the
EPC, and the exception Cause register. Since exception handlers do not usually use
the fl oating-point registers, the general entry point does not save them, leaving that
to the few handlers that need them.

handler Name of a
soft ware routine invoked
to “handle” an exception
or interrupt.

450 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Figure 5.34 sketches the MIPS code of an exception handler. Note that we
save and restore the state in MIPS code, taking care when we enable and disable
exceptions, but we invoke C code to handle the particular exception.

Th e virtual address that caused the fault depends on whether the fault was an
instruction or data fault. Th e address of the instruction that generated the fault is
in the EPC. If it was an instruction page fault, the EPC contains the virtual address
of the faulting page; otherwise, the faulting virtual address can be computed by
examining the instruction (whose address is in the EPC) to fi nd the base register
and off set fi eld.

Elaboration: This simplifi ed version assumes that the stack pointer (sp) is valid. To
avoid the problem of a page fault during this low-level exception code, MIPS sets aside
a portion of its address space that cannot have page faults, called unmapped. The
operating system places the exception entry point code and the exception stack in
unmapped memory. MIPS hardware translates virtual addresses 8000 0000

hex
to BFFF

FFFF
hex

to physical addresses simply by ignoring the upper bits of the virtual address,
thereby placing these addresses in the low part of physical memory. Thus, the operating
system places exception entry points and exception stacks in unmapped memory.

Elaboration: The code in Figure 5.34 shows the MIPS-32 exception return sequence.
The older MIPS-I architecture uses rfe and jr instead of eret.

Elaboration: For processors with more complex instructions that can touch many
memory locations and write many data items, making instructions restartable is much
harder. Processing one instruction may generate a number of page faults in the middle
of the instruction. For example, x86 processors have block move instructions that touch
thousands of data words. In such processors, instructions often cannot be restarted
from the beginning, as we do for MIPS instructions. Instead, the instruction must be
interrupted and later continued midstream in its execution. Resuming an instruction in
the middle of its execution usually requires saving some special state, processing the
exception, and restoring that special state. Making this work properly requires careful
and detailed coordination between the exception-handling code in the operating system
and the hardware.

Elaboration: Rather than pay an extra level of indirection on every memory access, the
VMM maintains a shadow page table that maps directly from the guest virtual address
space to the physical address space of the hardware. By detecting all modifi cations to
the guest’s page table, the VMM can ensure the shadow page table entries being used
by the hardware for translations correspond to those of the guest OS environment, with
the exception of the correct physical pages substituted for the real pages in the guest
tables. Hence, the VMM must trap any attempt by the guest OS to change its page table
or to access the page table pointer. This is commonly done by write protecting the guest
page tables and trapping any access to the page table pointer by a guest OS. As noted
above, the latter happens naturally if accessing the page table pointer is a privileged
operation.

unmapped A portion
of the address space that
cannot have page faults.

5.7 Virtual Memory 451

Save state

Save GPR addi $k1,$sp, -XCPSIZE # save space on stack for state
sw $sp, XCT_SP($k1) # save $sp on stack
sw $v0, XCT_V0($k1) # save $v0 on stack
… # save $v1, $ai, $si, $ti,… on stack
sw $ra, XCT_RA($k1) # save $ra on stack

Save hi, lo mfhi $v0 # copy Hi
mflo $v1 # copy Lo
sw $v0, XCT_HI($k1) # save Hi value on stack
sw $v1, XCT_LO($k1) # save Lo value on stack

Save exception
registers

mfc0 $a0, $cr # copy cause register
sw $a0, XCT_CR($k1) # save $cr value on stack
… # save $v1,….
mfc0 $a3, $sr # copy status register
sw $a3, XCT_SR($k1) # save $sr on stack

Set sp move $sp, $k1 # sp = sp – XCPSIZE

Enable nested exceptions

andi $v0, $a3, MASK1 # $v0 = $sr & MASK1, enable exceptions
mtc0 $v0, $sr # $sr = value that enables exceptions

Call C exception handler

Set $gp move $gp, GPINIT # set $gp to point to heap area

Call C code
move $a0, $sp # arg1 = pointer to exception stack
jal xcpt_deliver # call C code to handle exception

Restoring state

Restore most
GPR, hi, lo

move $at, $sp # temporary value of $sp
lw $ra, XCT_RA($at) # restore $ra from stack
… # restore $t0,…., $a1
lw $a0, XCT_A0($k1) # restore $a0 from stack

Restore status
register

lw $v0, XCT_SR($at) # load old $sr from stack
li $v1, MASK2 # mask to disable exceptions
and $v0, $v0, $v1 # $v0 = $sr & MASK2, disable exceptions
mtc0 $v0, $sr # set status register

Exception return

Restore $sp
and rest of
GPR used as
temporary
registers

lw $sp, XCT_SP($at) # restore $sp from stack

lw $v0, XCT_V0($at) # restore $v0 from stack

lw $v1, XCT_V1($at) # restore $v1 from stack

lw $k1, XCT_EPC($at) # copy old $epc from stack

lw $at, XCT_AT($at) # restore $at from stack

Restore ERC
and return

mtc0 $k1, $epc # restore $epc

eret $ra # return to interrupted instruction

FIGURE 5.34 MIPS code to save and restore state on an exception.

452 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Elaboration: The fi nal portion of the architecture to virtualize is I/O. This is by far
the most diffi cult part of system virtualization because of the increasing number of
I/O devices attached to the computer and the increasing diversity of I/O device types.
Another diffi culty is the sharing of a real device among multiple VMs, and yet another
comes from supporting the myriad of device drivers that are required, especially if
different guest OSes are supported on the same VM system. The VM illusion can be
maintained by giving each VM generic versions of each type of I/O device driver, and then
leaving it to the VMM to handle real I/O.

Elaboration: In addition to virtualizing the instruction set for a virtual machine,
another challenge is virtualization of virtual memory, as each guest OS in every virtual
machine manages its own set of page tables. To make this work, the VMM separates
the notions of real and physical memory (which are often treated synonymously), and
makes real memory a separate, intermediate level between virtual memory and physical
memory. (Some use the terms virtual memory, physical memory, and machine memory
to name the same three levels.) The guest OS maps virtual memory to real memory
via its page tables, and the VMM page tables map the guest’s real memory to physical
memory. The virtual memory architecture is specifi ed either via page tables, as in IBM
VM/370 and the x86, or via the TLB structure, as in MIPS.

Summary
Virtual memory is the name for the level of memory hierarchy that manages
caching between the main memory and secondary memory. Virtual memory
allows a single program to expand its address space beyond the limits of main
memory. More importantly, virtual memory supports sharing of the main memory
among multiple, simultaneously active processes, in a protected manner.

Managing the memory hierarchy between main memory and disk is challenging
because of the high cost of page faults. Several techniques are used to reduce the
miss rate:

1. Pages are made large to take advantage of spatial locality and to reduce the
miss rate.

2. Th e mapping between virtual addresses and physical addresses, which is
implemented with a page table, is made fully associative so that a virtual
page can be placed anywhere in main memory.

3. Th e operating system uses techniques, such as LRU and a reference bit, to
choose which pages to replace.

5.7 Virtual Memory 453

Writes to secondary memory are expensive, so virtual memory uses a write-back
scheme and also tracks whether a page is unchanged (using a dirty bit) to avoid
writing unchanged pages.

Th e virtual memory mechanism provides address translation from a virtual
address used by the program to the physical address space used for accessing
memory. Th is address translation allows protected sharing of the main memory
and provides several additional benefi ts, such as simplifying memory allocation.
Ensuring that processes are protected from each other requires that only the
operating system can change the address translations, which is implemented by
preventing user programs from changing the page tables. Controlled sharing of
pages among processes can be implemented with the help of the operating system
and access bits in the page table that indicate whether the user program has read or
write access to a page.

If a processor had to access a page table resident in memory to translate every
access, virtual memory would be too expensive, as caches would be pointless!
Instead, a TLB acts as a cache for translations from the page table. Addresses are
then translated from virtual to physical using the translations in the TLB.

Caches, virtual memory, and TLBs all rely on a common set of principles and
policies. Th e next section discusses this common framework.

Although virtual memory was invented to enable a small memory to act as a large
one, the performance diff erence between secondary memory and main memory
means that if a program routinely accesses more virtual memory than it has
physical memory, it will run very slowly. Such a program would be continuously
swapping pages between memory and disk, called thrashing. Th rashing is a disaster
if it occurs, but it is rare. If your program thrashes, the easiest solution is to run it on
a computer with more memory or buy more memory for your computer. A more
complex choice is to re-examine your algorithm and data structures to see if you
can change the locality and thereby reduce the number of pages that your program
uses simultaneously. Th is set of popular pages is informally called the working set.

A more common performance problem is TLB misses. Since a TLB might
handle only 32–64 page entries at a time, a program could easily see a high TLB
miss rate, as the processor may access less than a quarter mebibyte directly: 64
� 4 KiB � 0.25 MiB. For example, TLB misses are oft en a challenge for Radix
Sort. To try to alleviate this problem, most computer architectures now support
variable page sizes. For example, in addition to the standard 4 KiB page, MIPS
hardware supports 16 KiB, 64 KiB, 256 KiB, 1 MiB, 4 MiB, 16 MiB, 64 MiB, and
256 MiB pages. Hence, if a program uses large page sizes, it can access more
memory directly without TLB misses.

Th e practical challenge is getting the operating system to allow programs to
select these larger page sizes. Once again, the more complex solution to reducing

Understanding
Program
Performance

454 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

TLB misses is to re-examine the algorithm and data structures to reduce the
working set of pages; given the importance of memory accesses to performance
and the frequency of TLB misses, some programs with large working sets have
been redesigned with that goal.

Match the defi nitions in the right column to the terms in the left column.

1. L1 cache a. A cache for a cache
2. L2 cache b. A cache for disks
3. Main memory c. A cache for a main memory
4. TLB d. A cache for page table entries

5.8 A Common Framework for Memory
Hierarchy

By now, you’ve recognized that the diff erent types of memory hierarchies have a
great deal in common. Although many of the aspects of memory hierarchies diff er
quantitatively, many of the policies and features that determine how a hierarchy
functions are similar qualitatively. Figure 5.35 shows how some of the quantitative
characteristics of memory hierarchies can diff er. In the rest of this section, we will
discuss the common operational alternatives for memory hierarchies, and how
these determine their behavior. We will examine these policies as a series of four
questions that apply between any two levels of a memory hierarchy, although for
simplicity we will primarily use terminology for caches.

Check
Yourself

Feature
Typical values
for L1 caches

Typical values
for L2 caches

Typical values for
paged memory

Typical values
for a TLB

Total size in blocks 250–2000 2,500–25,000 16,000–250,000 40–1024

Total size in kilobytes 16–64 125–2000 1,000,000–1,000,000,000 0.25–16

Block size in bytes 16–64 64–128 4000–64,000 4–32

Miss penalty in clocks 10–25 100–1000 10,000,000–100,000,000 10–1000

Miss rates (global for L2) 2%–5% 0.1%–2% 0.00001%–0.0001% 0.01%–2%

FIGURE 5.35 The key quantitative design parameters that characterize the major elements of memory hierarchy in a
computer. Th ese are typical values for these levels as of 2012. Although the range of values is wide, this is partially because many of the values
that have shift ed over time are related; for example, as caches become larger to overcome larger miss penalties, block sizes also grow. While not
shown, server microprocessors today also have L3 caches, which can be 2 to 8 MiB and contain many more blocks than L2 caches. L3 caches
lower the L2 miss penalty to 30 to 40 clock cycles.

5.8 A Common Framework for Memory Hierarchy 455

Question 1: Where Can a Block Be Placed?
We have seen that block placement in the upper level of the hierarchy can use a range
of schemes, from direct mapped to set associative to fully associative. As mentioned
above, this entire range of schemes can be thought of as variations on a set-associative
scheme where the number of sets and the number of blocks per set varies:

Scheme name Number of sets Blocks per set

Direct mapped Number of blocks in cache 1

Set associative
Number of blocks in the cache

Associativity
Associativity (typically 2–16)

Fully associative 1 Number of blocks in the cache

Th e advantage of increasing the degree of associativity is that it usually decreases
the miss rate. Th e improvement in miss rate comes from reducing misses that
compete for the same location. We will examine these in more detail shortly. First,
let’s look at how much improvement is gained. Figure 5.36 shows the miss rates
for several cache sizes as associativity varies from direct mapped to eight-way set
associative. Th e largest gains are obtained in going from direct mapped to two-way
set associative, which yields between a 20% and 30% reduction in the miss rate.
As cache sizes grow, the relative improvement from associativity increases only

Associativity

M
is

s
ra

0
One-way Two-way

12%

15%

Four-way Eight-way

1 KiB

2 KiB

4 KiB

8 KiB

16 KiB
32 KiB

64 KiB 128 KiB

FIGURE 5.36 The data cache miss rates for each of eight cache sizes improve as the
associativity increases. While the benefi t of going from one-way (direct mapped) to two-way set
associative is signifi cant, the benefi ts of further associativity are smaller (e.g., 1%–10% improvement going
from two-way to four-way versus 20%–30% improvement going from one-way to two-way). Th ere is even
less improvement in going from four-way to eight-way set associative, which, in turn, comes very close to
the miss rates of a fully associative cache. Smaller caches obtain a signifi cantly larger absolute benefi t from
associativity because the base miss rate of a small cache is larger. Figure 5.16 explains how this data was
collected.

456 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

slightly; since the overall miss rate of a larger cache is lower, the opportunity for
improving the miss rate decreases and the absolute improvement in the miss rate
from associativity shrinks signifi cantly. Th e potential disadvantages of associativity,
as we mentioned earlier, are increased cost and slower access time.

Question 2: How Is a Block Found?
Th e choice of how we locate a block depends on the block placement scheme, since
that dictates the number of possible locations. We can summarize the schemes as
follows:

Associativity Location method Comparisons required

Direct mapped Index 1

Set associative Index the set, search among elements Degree of associativity

Full
Search all cache entries Size of the cache

Separate lookup table 0

Th e choice among direct-mapped, set-associative, or fully associative mapping
in any memory hierarchy will depend on the cost of a miss versus the cost of
implementing associativity, both in time and in extra hardware. Including the
L2 cache on the chip enables much higher associativity, because the hit times are
not as critical and the designer does not have to rely on standard SRAM chips as
the building blocks. Fully associative caches are prohibitive except for small sizes,
where the cost of the comparators is not overwhelming and where the absolute
miss rate improvements are greatest.

In virtual memory systems, a separate mapping table—the page table—is kept
to index the memory. In addition to the storage required for the table, using an
index table requires an extra memory access. Th e choice of full associativity for
page placement and the extra table is motivated by these facts:

1. Full associativity is benefi cial, since misses are very expensive.

2. Full associativity allows soft ware to use sophisticated replacement schemes
that are designed to reduce the miss rate.

3. Th e full map can be easily indexed with no extra hardware and no searching
required.

Th erefore, virtual memory systems almost always use fully associative placement.
Set-associative placement is oft en used for caches and TLBs, where the access

combines indexing and the search of a small set. A few systems have used direct-
mapped caches because of their advantage in access time and simplicity. Th e
advantage in access time occurs because fi nding the requested block does not
depend on a comparison. Such design choices depend on many details of the

5.8 A Common Framework for Memory Hierarchy 457

implementation, such as whether the cache is on-chip, the technology used for
implementing the cache, and the critical role of cache access time in determining
the processor cycle time.

Question 3: Which Block Should Be Replaced on
a Cache Miss?
When a miss occurs in an associative cache, we must decide which block to replace.
In a fully associative cache, all blocks are candidates for replacement. If the cache is
set associative, we must choose among the blocks in the set. Of course, replacement
is easy in a direct-mapped cache because there is only one candidate.

Th ere are the two primary strategies for replacement in set-associative or fully
associative caches:

■ Random: Candidate blocks are randomly selected, possibly using some hardware
assistance. For example, MIPS supports random replacement for TLB misses.

■ Least recently used (LRU): Th e block replaced is the one that has been unused
for the longest time.

In practice, LRU is too costly to implement for hierarchies with more than a small
degree of associativity (two to four, typically), since tracking the usage information
is costly. Even for four-way set associativity, LRU is oft en approximated—for
example, by keeping track of which pair of blocks is LRU (which requires 1 bit),
and then tracking which block in each pair is LRU (which requires 1 bit per pair).

For larger associativity, either LRU is approximated or random replacement is
used. In caches, the replacement algorithm is in hardware, which means that the
scheme should be easy to implement. Random replacement is simple to build in
hardware, and for a two-way set-associative cache, random replacement has a miss
rate about 1.1 times higher than LRU replacement. As the caches become larger, the
miss rate for both replacement strategies falls, and the absolute diff erence becomes
small. In fact, random replacement can sometimes be better than the simple LRU
approximations that are easily implemented in hardware.

In virtual memory, some form of LRU is always approximated, since even a tiny
reduction in the miss rate can be important when the cost of a miss is enormous.
Reference bits or equivalent functionality are oft en provided to make it easier for
the operating system to track a set of less recently used pages. Because misses are
so expensive and relatively infrequent, approximating this information primarily
in soft ware is acceptable.

Question 4: What Happens on a Write?
A key characteristic of any memory hierarchy is how it deals with writes. We have
already seen the two basic options:

■ Write-through: Th e information is written to both the block in the cache and
the block in the lower level of the memory hierarchy (main memory for a
cache). Th e caches in Section 5.3 used this scheme.

458 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

■ Write-back: Th e information is written only to the block in the cache. Th e
modifi ed block is written to the lower level of the hierarchy only when it
is replaced. Virtual memory systems always use write-back, for the reasons
discussed in Section 5.7.

Both write-back and write-through have their advantages. Th e key advantages of
write-back are the following:

■ Individual words can be written by the processor at the rate that the cache,
rather than the memory, can accept them.

■ Multiple writes within a block require only one write to the lower level in the
hierarchy.

■ When blocks are written back, the system can make eff ective use of a high-
bandwidth transfer, since the entire block is written.

Write-through has these advantages:

■ Misses are simpler and cheaper because they never require a block to be
written back to the lower level.

■ Write-through is easier to implement than write-back, although to be
practical, a write-through cache will still need to use a write buff er.

Caches, TLBs, and virtual memory may initially look very diff erent, but
they rely on the same two principles of locality, and they can be understood
by their answers to four questions:

Question 1: Where can a block be placed?
Answer: One place (direct mapped), a few places (set associative),

or any place (fully associative).
Question 2: How is a block found?
Answer: Th ere are four methods: indexing (as in a direct-mapped

cache), limited search (as in a set-associative cache), full
search (as in a fully associative cache), and a separate
lookup table (as in a page table).

Question 3: What block is replaced on a miss?
Answer: Typically, either the least recently used or a random block.
Question 4: How are writes handled?
Answer: Each level in the hierarchy can use either write-through

or write-back.

The BIG
Picture

5.8 A Common Framework for Memory Hierarchy 459

In virtual memory systems, only a write-back policy is practical because of the long
latency of a write to the lower level of the hierarchy. Th e rate at which writes are
generated by a processor generally exceeds the rate at which the memory system can
process them, even allowing for physically and logically wider memories and burst
modes for DRAM. Consequently, today lowest-level caches typically use write-back.

The Three Cs: An Intuitive Model for Understanding the
Behavior of Memory Hierarchies
In this subsection, we look at a model that provides insight into the sources of
misses in a memory hierarchy and how the misses will be aff ected by changes
in the hierarchy. We will explain the ideas in terms of caches, although the ideas
carry over directly to any other level in the hierarchy. In this model, all misses are
classifi ed into one of three categories (the three Cs):

■ Compulsory misses: Th ese are cache misses caused by the fi rst access to
a block that has never been in the cache. Th ese are also called cold-start
misses.

■ Capacity misses: Th ese are cache misses caused when the cache cannot
contain all the blocks needed during execution of a program. Capacity misses
occur when blocks are replaced and then later retrieved.

■ Confl ict misses: Th ese are cache misses that occur in set-associative or
direct-mapped caches when multiple blocks compete for the same set.
Confl ict misses are those misses in a direct-mapped or set-associative cache
that are eliminated in a fully associative cache of the same size. Th ese cache
misses are also called collision misses.

Figure 5.37 shows how the miss rate divides into the three sources. Th ese sources of
misses can be directly attacked by changing some aspect of the cache design. Since
confl ict misses arise directly from contention for the same cache block, increasing
associativity reduces confl ict misses. Associativity, however, may slow access time,
leading to lower overall performance.

Capacity misses can easily be reduced by enlarging the cache; indeed, second-
level caches have been growing steadily larger for many years. Of course, when we
make the cache larger, we must also be careful about increasing the access time,
which could lead to lower overall performance. Th us, fi rst-level caches have been
growing slowly, if at all.

Because compulsory misses are generated by the fi rst reference to a block, the
primary way for the cache system to reduce the number of compulsory misses is
to increase the block size. Th is will reduce the number of references required to
touch each block of the program once, because the program will consist of fewer

three Cs model A cache
model in which all cache
misses are classifi ed into
one of three categories:
compulsory misses,
capacity misses, and
confl ict misses.

compulsory miss Also
called cold-start miss.
A cache miss caused by
the fi rst access to a block
that has never been in the
cache.

capacity miss A cache
miss that occurs because
the cache, even with
full associativity, cannot
contain all the blocks
needed to satisfy the
request.

confl ict miss Also called
collision miss. A cache
miss that occurs in a
set-associative or direct-
mapped cache when
multiple blocks compete
for the same set and that
are eliminated in a fully
associative cache of the
same size.

460 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Cache size (KiB)

Miss rate
per type

0%
8 32

128 512

16 64 2564

Capacity

10%

1024

One-way

Two-way

Four-way

FIGURE 5.37 The miss rate can be broken into three sources of misses. Th is graph shows
the total miss rate and its components for a range of cache sizes. Th is data is for the SPEC CPU2000 integer
and fl oating-point benchmarks and is from the same source as the data in Figure 5.36 Th e compulsory
miss component is 0.006% and cannot be seen in this graph. Th e next component is the capacity miss rate,
which depends on cache size. Th e confl ict portion, which depends both on associativity and on cache size, is
shown for a range of associativities from one-way to eight-way. In each case, the labeled section corresponds
to the increase in the miss rate that occurs when the associativity is changed from the next higher degree to
the labeled degree of associativity. For example, the section labeled two-way indicates the additional misses
arising when the cache has associativity of two rather than four. Th us, the diff erence in the miss rate incurred
by a direct-mapped cache versus a fully associative cache of the same size is given by the sum of the sections
marked four-way, two-way, and one-way. Th e diff erence between eight-way and four-way is so small that it
is diffi cult to see on this graph.

Th e challenge in designing memory hierarchies is that every change
that potentially improves the miss rate can also negatively aff ect overall
performance, as Figure 5.38 summarizes. Th is combination of positive
and negative eff ects is what makes the design of a memory hierarchy
interesting.

The BIG
Picture

5.9 Using a Finite-State Machine to Control a Simple Cache 461

cache blocks. As mentioned above, increasing the block size too much can have a
negative eff ect on performance because of the increase in the miss penalty.

Th e decomposition of misses into the three Cs is a useful qualitative model. In
real cache designs, many of the design choices interact, and changing one cache
characteristic will oft en aff ect several components of the miss rate. Despite such
shortcomings, this model is a useful way to gain insight into the performance of
cache designs.

Which of the following statements (if any) are generally true?

1. Th ere is no way to reduce compulsory misses.

2. Fully associative caches have no confl ict misses.

3. In reducing misses, associativity is more important than capacity.

5.9 Using a Finite-State Machine to Control a
Simple Cache

We can now implement control for a cache, just as we implemented control for
the single-cycle and pipelined datapaths in Chapter 4. Th is section starts with a
defi nition of a simple cache and then a description of fi nite-state machines (FSMs).
It fi nishes with the FSM of a controller for this simple cache. Section 5.12 goes
into more depth, showing the cache and controller in a new hardware description
language.

A Simple Cache
We’re going to design a controller for a simple cache. Here are the key characteristics
of the cache:

■ Direct-mapped cache

Check
Yourself

Design change Effect on miss rate
Possible negative

performance effect

Increases cache size Decreases capacity misses May increase access time

Increases associativity Decreases miss rate due to conflict
misses

May increase access time

Increases block size Decreases miss rate for a wide range of
block sizes due to spatial locality

Increases miss penalty. Very large
block could increase miss rate

FIGURE 5.38 Memory hierarchy design challenges.

462 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

■ Write-back using write allocate

■ Block size is 4 words (16 bytes or 128 bits)

■ Cache size is 16 KiB, so it holds 1024 blocks

■ 32-byte addresses

■ Th e cache includes a valid bit and dirty bit per block

From Section 5.3, we can now calculate the fi elds of an address for the cache:

■ Cache index is 10 bits

■ Block off set is 4 bits

■ Tag size is 32 � (10 � 4) or 18 bits

Th e signals between the processor to the cache are

■ 1-bit Read or Write signal

■ 1-bit Valid signal, saying whether there is a cache operation or not

■ 32-bit address

■ 32-bit data from processor to cache

■ 32-bit data from cache to processor

■ 1-bit Ready signal, saying the cache operation is complete

Th e interface between the memory and the cache has the same fi elds as between
the processor and the cache, except that the data fi elds are now 128 bits wide. Th e
extra memory width is generally found in microprocessors today, which deal with
either 32-bit or 64-bit words in the processor while the DRAM controller is oft en
128 bits. Making the cache block match the width of the DRAM simplifi ed the
design. Here are the signals:

■ 1-bit Read or Write signal

■ 1-bit Valid signal, saying whether there is a memory operation or not

■ 32-bit address

■ 128-bit data from cache to memory

■ 128-bit data from memory to cache

■ 1-bit Ready signal, saying the memory operation is complete

Note that the interface to memory is not a fi xed number of cycles. We assume a
memory controller that will notify the cache via the Ready signal when the memory
read or write is fi nished.

Before describing the cache controller, we need to review fi nite-state machines,
which allow us to control an operation that can take multiple clock cycles.

5.9 Using a Finite-State Machine to Control a Simple Cache 463

Finite-State Machines
To design the control unit for the single-cycle datapath, we used a set of truth tables
that specifi ed the setting of the control signals based on the instruction class. For a
cache, the control is more complex because the operation can be a series of steps.
Th e control for a cache must specify both the signals to be set in any step and the
next step in the sequence.

Th e most common multistep control method is based on fi nite-state machines,
which are usually represented graphically. A fi nite-state machine consists of a set
of states and directions on how to change states. Th e directions are defi ned by a
next-state function, which maps the current state and the inputs to a new state.
When we use a fi nite-state machine for control, each state also specifi es a set of
outputs that are asserted when the machine is in that state. Th e implementation
of a fi nite-state machine usually assumes that all outputs that are not explicitly
asserted are deasserted. Similarly, the correct operation of the datapath depends on
the fact that a signal that is not explicitly asserted is deasserted, rather than acting
as a don’t care.

Multiplexor controls are slightly diff erent, since they select one of the inputs
whether they are 0 or 1. Th us, in the fi nite-state machine, we always specify the
setting of all the multiplexor controls that we care about. When we implement
the fi nite-state machine with logic, setting a control to 0 may be the default and
thus may not require any gates. A simple example of a fi nite-state machine appears
in Appendix B, and if you are unfamiliar with the concept of a fi nite-state machine,
you may want to examine Appendix B before proceeding.

A fi nite-state machine can be implemented with a temporary register that holds
the current state and a block of combinational logic that determines both the
data-path signals to be asserted and the next state. Figure 5.39 shows how such an
implementation might look. Appendix D describes in detail how the fi nite-state
machine is implemented using this structure. In Section B.3, the combinational
control logic for a fi nite-state machine is implemented both with either a ROM
(read-only memory) or a PLA (programmable logic array). (Also see Appendix B
for a description of these logic elements.)

Elaboration: Note that this simple design is called a blocking cache, in that the
processor must wait until the cache has fi nished the request. Section 5.12 describes
the alternative, which is called a nonblocking cache.

Elaboration: The style of fi nite-state machine in this book is called a Moore machine,
after Edward Moore. Its identifying characteristic is that the output depends only on the
current state. For a Moore machine, the box labeled combinational control logic can be
split into two pieces. One piece has the control output and only the state input, while the
other has only the next-state output.

An alternative style of machine is a Mealy machine, named after George Mealy. The
Mealy machine allows both the input and the current state to be used to determine the
output. Moore machines have potential implementation advantages in speed and size
of the control unit. The speed advantages arise because the control outputs, which are

fi nite-state machine
A sequential logic
function consisting of a
set of inputs and outputs,
a next-state function that
maps the current state and
the inputs to a new state,
and an output function
that maps the current
state and possibly the
inputs to a set of asserted
outputs.

next-state function
A combinational function
that, given the inputs
and the current state,
determines the next state
of a fi nite-state machine.

464 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

needed early in the clock cycle, do not depend on the inputs, but only on the current
state. In Appendix B, when the implementation of this fi nite-state machine is taken down
to logic gates, the size advantage can be clearly seen. The potential disadvantage of a
Moore machine is that it may require additional states. For example, in situations where
there is a one-state difference between two sequences of states, the Mealy machine
may unify the states by making the outputs depend on the inputs.

FSM for a Simple Cache Controller
Figure 5.40 shows the four states of our simple cache controller:

■ Idle: Th is state waits for a valid read or write request from the processor,
which moves the FSM to the Compare Tag state.

■ Compare Tag: As the name suggests, this state tests to see if the requested read
or write is a hit or a miss. Th e index portion of the address selects the tag to
be compared. If the data in the cache block referred to by the index portion
of the address is valid, and the tag portion of the address matches the tag,
then it is a hit. Either the data is read from the selected word if it is a load or
written to the selected word if it is a store. Th e Cache Ready signal is then

Combinational
control logic

Outputs

Inputs

State register
Next state

Datapath control outputs

Inputs from cache
datapath

FIGURE 5.39 Finite-state machine controllers are typically implemented using a block of
combinational logic and a register to hold the current state. Th e outputs of the combinational
logic are the next-state number and the control signals to be asserted for the current state. Th e inputs to the
combinational logic are the current state and any inputs used to determine the next state. Notice that in the
fi nite-state machine used in this chapter, the outputs depend only on the current state, not on the inputs. Th e
Elaboration explains this in more detail.

5.9 Using a Finite-State Machine to Control a Simple Cache 465

set. If it is a write, the dirty bit is set to 1. Note that a write hit also sets the
valid bit and the tag fi eld; while it seems unnecessary, it is included because
the tag is a single memory, so to change the dirty bit we also need to change
the valid and tag fi elds. If it is a hit and the block is valid, the FSM returns to
the idle state. A miss fi rst updates the cache tag and then goes either to the
Write-Back state, if the block at this location has dirty bit value of 1, or to the
Allocate state if it is 0.

■ Write-Back: Th is state writes the 128-bit block to memory using the address
composed from the tag and cache index. We remain in this state waiting for
the Ready signal from memory. When the memory write is complete, the
FSM goes to the Allocate state.

■ Allocate: Th e new block is fetched from memory. We remain in this state
waiting for the Ready signal from memory. When the memory read is
complete, the FSM goes to the Compare Tag state. Although we could
have gone to a new state to complete the operation instead of reusing the
Compare Tag state, there is a good deal of overlap, including the update of the
appropriate word in the block if the access was a write.

Cache
Miss
and
Old Block
is Dirty

Cache
Miss
and
Old Block
is Clean

Valid CPU request

Mark Cache Ready
Idle

Cache Hit
Compare Tag

If Valid && Hit ,
Set Valid, SetTag,
if Write Set Dirty

Memory Ready

M
em

or
y
Re

ad
y

Memory
not

Ready

Memory
not

Ready

Write Old
Block to
Memory

Write-Back

Read new block
from Memory

Allocate

FIGURE 5.40 Four states of the simple controller.

466 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Th is simple model could easily be extended with more states to try to improve
performance. For example, the Compare Tag state does both the compare and the
read or write of the cache data in a single clock cycle. Oft en the compare and cache
access are done in separate states to try to improve the clock cycle time. Another
optimization would be to add a write buff er so that we could save the dirty block
and then read the new block fi rst so that the processor doesn’t have to wait for two
memory accesses on a dirty miss. Th e cache would then write the dirty block from
the write buff er while the processor is operating on the requested data.

Section 5.12, goes into more detail about the FSM, showing the full controller
in a hardware description language and a block diagram of this simple cache.

5.10 Parallelism and Memory Hierarchy:
Cache Coherence

Given that a multicore multiprocessor means multiple processors on a single chip,
these processors very likely share a common physical address space. Caching shared
data introduces a new problem, because the view of memory held by two diff erent
processors is through their individual caches, which, without any additional
precautions, could end up seeing two diff erent values. Figure 5.41 illustrates the
problem and shows how two diff erent processors can have two diff erent values
for the same location. Th is diffi culty is generally referred to as the cache coherence
problem.

Informally, we could say that a memory system is coherent if any read of a data
item returns the most recently written value of that data item. Th is defi nition,
although intuitively appealing, is vague and simplistic; the reality is much more
complex. Th is simple defi nition contains two diff erent aspects of memory system
behavior, both of which are critical to writing correct shared memory programs.
Th e fi rst aspect, called coherence, defi nes what values can be returned by a read. Th e
second aspect, called consistency, determines when a written value will be returned
by a read.

Let’s look at coherence fi rst. A memory system is coherent if

1. A read by a processor P to a location X that follows a write by P to X, with no
writes of X by another processor occurring between the write and the read
by P, always returns the value written by P. Th us, in Figure 5.41, if CPU A
were to read X aft er time step 3, it should see the value 1.

2. A read by a processor to location X that follows a write by another processor
to X returns the written value if the read and write are suffi ciently separated
in time and no other writes to X occur between the two accesses. Th us, in
Figure 5.41, we need a mechanism so that the value 0 in the cache of CPU B
is replaced by the value 1 aft er CPU A stores 1 into memory at address X in
time step 3.

5.10 Parallelism and Memory Hierarchy: Cache Coherence 467

3. Writes to the same location are serialized; that is, two writes to the same
location by any two processors are seen in the same order by all processors.
For example, if CPU B stores 2 into memory at address X aft er time step 3,
processors can never read the value at location X as 2 and then later read
it as 1.

Th e fi rst property simply preserves program order—we certainly expect this
property to be true in uniprocessors, for example. Th e second property defi nes
the notion of what it means to have a coherent view of memory: if a processor
could continuously read an old data value, we would clearly say that memory was
incoherent.

Th e need for write serialization is more subtle, but equally important. Suppose
we did not serialize writes, and processor P1 writes location X followed by P2
writing location X. Serializing the writes ensures that every processor will see the
write done by P2 at some point. If we did not serialize the writes, it might be the
case that some processor could see the write of P2 fi rst and then see the write of P1,
maintaining the value written by P1 indefi nitely. Th e simplest way to avoid such
diffi culties is to ensure that all writes to the same location are seen in the same
order, which we call write serialization.

Basic Schemes for Enforcing Coherence
In a cache coherent multiprocessor, the caches provide both migration and
replication of shared data items:

■ Migration: A data item can be moved to a local cache and used there in a
transparent fashion. Migration reduces both the latency to access a shared
data item that is allocated remotely and the bandwidth demand on the shared
memory.

Time
step Event

Cache contents for
CPU A

Cache contents
for CPU B

Memory
contents for
location X

0 0

1 CPU A reads X 0 0

2 CPU B reads X 0 0 0

3 CPU A stores 1 into X 1 0 1

FIGURE 5.41 The cache coherence problem for a single memory location (X), read and
written by two processors (A and B). We initially assume that neither cache contains the variable and
that X has the value 0. We also assume a write-through cache; a write-back cache adds some additional but
similar complications. Aft er the value of X has been written by A, A’s cache and the memory both contain the
new value, but B’s cache does not, and if B reads the value of X, it will receive 0!

468 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

■ Replication: When shared data are being simultaneously read, the caches
make a copy of the data item in the local cache. Replication reduces both
latency of access and contention for a read shared data item.

Supporting migration and replication is critical to performance in accessing
shared data, so many multiprocessors introduce a hardware protocol to maintain
coherent caches. Th e protocols to maintain coherence for multiple processors are
called cache coherence protocols. Key to implementing a cache coherence protocol
is tracking the state of any sharing of a data block.

Th e most popular cache coherence protocol is snooping. Every cache that has a
copy of the data from a block of physical memory also has a copy of the sharing
status of the block, but no centralized state is kept. Th e caches are all accessible via
some broadcast medium (a bus or network), and all cache controllers monitor or
snoop on the medium to determine whether or not they have a copy of a block that
is requested on a bus or switch access.

In the following section we explain snooping-based cache coherence as
implemented with a shared bus, but any communication medium that broadcasts
cache misses to all processors can be used to implement a snooping-based
coherence scheme. Th is broadcasting to all caches makes snooping protocols
simple to implement but also limits their scalability.

Snooping Protocols
One method of enforcing coherence is to ensure that a processor has exclusive
access to a data item before it writes that item. Th is style of protocol is called a write
invalidate protocol because it invalidates copies in other caches on a write. Exclusive
access ensures that no other readable or writable copies of an item exist when the
write occurs: all other cached copies of the item are invalidated.

Figure 5.42 shows an example of an invalidation protocol for a snooping bus
with write-back caches in action. To see how this protocol ensures coherence,
consider a write followed by a read by another processor: since the write requires
exclusive access, any copy held by the reading processor must be invalidated (hence
the protocol name). Th us, when the read occurs, it misses in the cache, and the
cache is forced to fetch a new copy of the data. For a write, we require that the
writing processor have exclusive access, preventing any other processor from being
able to write simultaneously. If two processors do attempt to write the same data
simultaneously, one of them wins the race, causing the other processor’s copy to be
invalidated. For the other processor to complete its write, it must obtain a new copy
of the data, which must now contain the updated value. Th erefore, this protocol
also enforces write serialization.

5.10 Parallelism and Memory Hierarchy: Cache Coherence 469

One insight is that block size plays an important role in cache coherency. For
example, take the case of snooping on a cache with a block size of eight words,
with a single word alternatively written and read by two processors. Most protocols
exchange full blocks between processors, thereby increasing coherency bandwidth
demands.

Large blocks can also cause what is called false sharing: when two unrelated
shared variables are located in the same cache block, the full block is exchanged
between processors even though the processors are accessing diff erent variables.
Programmers and compilers should lay out data carefully to avoid false sharing.

Elaboration: Although the three properties on pages 466 and 467 are suffi cient to
ensure coherence, the question of when a written value will be seen is also important. To
see why, observe that we cannot require that a read of X in Figure 5.41 instantaneously
sees the value written for X by some other processor. If, for example, a write of X on one
processor precedes a read of X on another processor very shortly beforehand, it may be
impossible to ensure that the read returns the value of the data written, since the written
data may not even have left the processor at that point. The issue of exactly when a
written value must be seen by a reader is defi ned by a memory consistency model.

Hardware/
Software
Interface

false sharing When two
unrelated shared variables
are located in the same
cache block and the
full block is exchanged
between processors even
though the processors
are accessing diff erent
variables.

FIGURE 5.42 An example of an invalidation protocol working on a snooping bus for a
single cache block (X) with write-back caches. We assume that neither cache initially holds X
and that the value of X in memory is 0. Th e CPU and memory contents show the value aft er the processor
and bus activity have both completed. A blank indicates no activity or no copy cached. When the second
miss by B occurs, CPU A responds with the value canceling the response from memory. In addition, both
the contents of B’s cache and the memory contents of X are updated. Th is update of memory, which occurs
when a block becomes shared, simplifi es the protocol, but it is possible to track the ownership and force the
write-back only if the block is replaced. Th is requires the introduction of an additional state called “owner,”
which indicates that a block may be shared, but the owning processor is responsible for updating any other
processors and memory when it changes the block or replaces it.

Processor activity Bus activity
Contents of

CPU A’s cache
Contents of

CPU B’s cache

Contents of
memory

location X

00XrofssimehcaCXsdaerAUPC

CPU B reads X Cache miss for X 0 0 0

01XrofnoitadilavnIXot1asetirwAUPC

CPU B reads X Cache miss for X 1 1 1

470 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

We make the following two assumptions. First, a write does not complete (and allow
the next write to occur) until all processors have seen the effect of that write. Second,
the processor does not change the order of any write with respect to any other memory
access. These two conditions mean that if a processor writes location X followed by
location Y, any processor that sees the new value of Y must also see the new value of
X. These restrictions allow the processor to reorder reads, but forces the processor to
fi nish a write in program order.

Elaboration: Since input can change memory behind the caches and since output
could need the latest value in a write back cache, there is also a cache coherency
problem for I/O with the caches of a single processor as well as just between caches
of multiple processors. The cache coherence problem for multiprocessors and I/O
(see Chapter 6), although similar in origin, has different characteristics that affect the
appropriate solution. Unlike I/O, where multiple data copies are a rare event—one to
be avoided whenever possible—a program running on multiple processors will normally
have copies of the same data in several caches.

Elaboration: In addition to the snooping cache coherence protocol where the status
of shared blocks is distributed, a directory-based cache coherence protocol keeps the
sharing status of a block of physical memory in just one location, called the directory.
Directory-based coherence has slightly higher implementation overhead than snooping,
but it can reduce traffi c between caches and thus scale to larger processor counts.

5.11 Parallelism and Memory Hierarchy:
Redundant Arrays of Inexpensive Disks

Th is online section describes how using many disks in conjunction can off er much
higher throughput, which was the orginal inspiration of Redundant Arrays of
Inexpensive Disks (RAID). Th e real popularlity of RAID, however, was due more to
the much greater dependability off ered by including a modest number of redundant
disks. Th e section explains the diff erences in performance, cost, and dependability
between the diff erent RAID levels.

5.12 Advanced Material: Implementing Cache
Controllers

Th is online section shows how to implement control for a cache, just as we
implemented control for the single-cycle and pipelined datapaths in Chapter 4.
Th is section starts with a description of fi nite-state machines and the implemention
of a cache controller for a simple data cache, including a description of the cache
controller in a hardware description language. It then goes into details of an example
cache coherence protocol and the diffi culties in implementing such a protocol.

5.13 Real Stuff: The ARM Cortex-A8 and Intel Core i7 Memory Hierarchies 471

5.13 Real Stuff: The ARM Cortex-A8 and Intel
Core i7 Memory Hierarchies

In this section, we will look at the memory hierarchy of the same two microprocessors
described in Chapter 4: the ARM Cortex-A8 and Intel Core i7. Th is section is based
on Section 2.6 of Computer Architecture: A Quantitative Approach, 5th edition.

Figure 5.43 summarizes the address sizes and TLBs of the two processors. Note
that the A8 has two TLBs with a 32-bit virtual address space and a 32-bit physical
address space. Th e Core i7 has three TLBs with a 48-bit virtual address and a 44-bit
physical address. Although the 64-bit registers of the Core i7 could hold a larger
virtual address, there was no soft ware need for such a large space and 48-bit virtual
addresses shrinks both the page table memory footprint and the TLB hardware.

Figure 5.44 shows their caches. Keep in mind that the A8 has just one processor
or core while the Core i7 has four. Both have identically organized 32 KiB, 4-way
set associative, L1 instruction caches (per core) with 64 byte blocks. Th e A8 uses the
same design for data cache, while the Core i7 keeps everything the same except the
associativity, which it increases to 8-way. Both use an 8-way set associative unifi ed
L2 cache (per core) with 64 byte blocks, although the A8 varies in size from 128 KiB
to 1 MiB while the Core i7 is fi xed at 256 KiB. As the Core i7 is used for servers, it

Characteristic ARM Cortex-A8 Intel Core i7

Virtual address 32 bits 48 bits

Physical address 32 bits 44 bits

Page size Variable: 4, 16, 64 KiB, 1, 16 MiB Variable: 4 KiB, 2/4 MiB

TLB organization 1 TLB for instructions and 1 TLB
for data

Both TLBs are fully associative,
with 32 entries, round robin
replacement

TLB misses handled in hardware

1 TLB for instructions and 1 TLB for
data per core

Both L1 TLBs are four-way set
associative, LRU replacement

L1 I-TLB has 128 entries for small
pages, 7 per thread for large pages

L1 D-TLB has 64 entries for small
pages, 32 for large pages

The L2 TLB is four-way set associative,
LRU replacement

The L2 TLB has 512 entries

TLB misses handled in hardware

FIGURE 5.43 Address translation and TLB hardware for the ARM Cortex-A8 and Intel
Core i7 920. Both processors provide support for large pages, which are used for things like the operating
system or mapping a frame buff er. Th e large-page scheme avoids using a large number of entries to map a
single object that is always present.

472 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

also off ers an L3 cache shared by all the cores on the chip. Its size varies depending
on the number of cores. With four cores, as in this case, the size is 8 MiB.

A signifi cant challenge facing cache designers is to support processors like the
A8 and the Core i7 that can execute more than one memory instruction per clock
cycle. A popular technique is to break the cache into banks and allow multiple,
independent, parallel accesses, provided the accesses are to diff erent banks. Th e
technique is similar to interleaved DRAM banks (see Section 5.2).

Th e Core i7 has additional optimizations that allow them to reduce the miss
penalty. Th e fi rst of these is the return of the requested word fi rst on a miss. It also
continues to execute instructions that access the data cache during a cache miss.
Designers who are attempting to hide the cache miss latency commonly use this
technique, called a nonblocking cache, when building out-of-order processors.
Th ey implement two fl avors of nonblocking. Hit under miss allows additional cache
hits during a miss, while miss under miss allows multiple outstanding cache misses.
Th e aim of the fi rst of these two is hiding some miss latency with other work, while
the aim of the second is overlapping the latency of two diff erent misses.

Overlapping a large fraction of miss times for multiple outstanding misses
requires a high-bandwidth memory system capable of handling multiple misses in
parallel. In a personal mobile device, the memory may only be able to take limited

nonblocking cache
A cache that allows
the processor to make
references to the cache
while the cache is
handling an earlier miss.

Characteristic ARM Cortex-A8 Intel Nehalem

L1 cache organization Split instruction and data caches Split instruction and data caches

L1 cache size 32 KiB each for instructions/data 32 KiB each for instructions/data
per core

L1 cache associativityy 4-way (I), 4-way (D) set associative 4-way (I), 8-way (D) set associative

L1 replacement Random Approximated LRU

L1 block size 64 bytes 64 bytes

L1 write policy Write-back, Write-allocate(?) Write-back, No-write-allocate

L1 hit time (load-use)) 1 clock cycle 4 clock cycles, pipelined

L2 cache organization Unified (instruction and data) Unified (instruction and data) per core

L2 cache size 128 KiB to 1 MiB 256 KiB (0.25 MiB)

L2 cache associativity 8-way set associative 8-way set associative

L2 replacement Random(?) Approximated LRU

L2 block size 64 bytes 64 bytes

L2 write policy Write-back, Write-allocate (?) Write-back, Write-allocate

L2 hit time 11 clock cycles 10 clock cycles

L3 cache organization —

—

Unified (instruction and data)

8 MiB, sharedL3 cache size

L3 cache associativity 16-way set associative

L3 replacement Approximated LRU

L3 block size 64 bytes

L3 write policy Write-back, Write-allocate

L3 hit time 35 clock cycles

FIGURE 5.44 Caches in the ARM Cortex-A8 and Intel Core i7 920.

5.13 Real Stuff: The ARM Cortex-A8 and Intel Core i7 Memory Hierarchies 473

advantage of this capability, but large servers and multiprocessors oft en have
memory systems capable of handling more than one outstanding miss in parallel.

Th e Core i7 has a prefetch mechanism for data accesses. It looks at a pattern
of data misses and use this information to try to predict the next address to start
fetching the data before the miss occurs. Such techniques generally work best when
accessing arrays in loops.

Th e sophisticated memory hierarchies of these chips and the large fraction of
the dies dedicated to caches and TLBs show the signifi cant design eff ort expended
to try to close the gap between processor cycle times and memory latency.

Performance of the A8 and Core i7 Memory Hierarchies
Th e memory hierarchy of the Cortex-A8 was simulated with a 1 MiB eight-way
set associative L2 cache using the integer Minnespec benchmarks. As mentioned
in Chapter 4, Minnespec is a set of benchmarks consisting of the SPEC2000
benchmarks but with diff erent inputs that reduce the running times by several
orders of magnitude. Although the use of smaller inputs does not change the
instruction mix, it does aff ect the cache behavior. For example, on mcf, the most
memory-intensive SPEC2000 integer benchmark, Minnespec has a miss rate for a
32 KiB cache that is only 65% of the miss rate for the full SPEC2000 version. For
a 1 MiB cache the diff erence is a factor of six! For this reason, one cannot compare
the Minnespec benchmarks against the SPEC2000 benchmarks, much less the even
larger SPEC2006 benchmarks used for the Core i7 in Figure 5.47. Instead, the data
are useful for looking at the relative impact of L1 and L2 misses and on overall CPI,
which we used in Chapter 4.

Th e A8 instruction cache miss rates for these benchmarks (and also for the
full SPEC2000 versions on which Minnespec is based) are very small even for
just the L1: close to zero for most and under 1% for all of them. Th is low rate
probably results from the computationally intensive nature of the SPEC programs
and the four-way set associative cache that eliminates most confl ict misses. Figure
5.45 shows the data cache results for the A8, which have signifi cant L1 and L2
miss rates. Th e L1 miss penalty for a 1 GHz Cortex-A8 is 11 clock cycles, while
the L2 miss penalty is assumed to be 60 clock cycles. Using these miss penalties,
Figure 5.46 shows the average miss penalty per data access.

Figure 5.47 shows the miss rates for the caches of the Core i7 using the SPEC2006
benchmarks. Th e L1 instruction cache miss rate varies from 0.1% to 1.8%,
averaging just over 0.4%. Th is rate is in keeping with other studies of instruction
cache behavior for the SPECCPU2006 benchmarks, which show low instruction
cache miss rates. With L1 data cache miss rates running 5% to 10%, and sometimes
higher, the importance of the L2 and L3 caches should be obvious. Since the cost
for a miss to memory is over 100 cycles and the average data miss rate in L2 is 4%,
L3 is obviously critical. Assuming about half the instructions are loads or stores,
without L3 the L2 cache misses could add two cycles per instruction to the CPI! In
comparison, the average L3 data miss rate of 1% is still signifi cant but four times
lower than the L2 miss rate and six times less than the L1 miss rate.

25.0%

20.0%

15.0%

M
is

s
R

a
te

10.0%

5.0%

0.0%

tw
ol
f

bz
ip
2

gz
ip

pa
rs

er
ga

pe
rlb

m
k

gc
c

cr
af

ty vp
r

vo
rte

x
co

n
m

L1 Data Miss Rate

L2 Data Miss Rate

FIGURE 5.45 Data cache miss rates for ARM Cortex-A8 when running Minnespec, a small
version of SPEC2000. Applications with larger memory footprints tend to have higher miss rates in both
L1 and L2. Note that the L2 rate is the global miss rate; that is, counting all references, including those that hit
in L1. (See Elaboration in Section 5.4.) Mcf is known as a cache buster. Note that this fi gure is for the same
systems and benchmarks as Figure 4.76 in Chapter 4.

FIGURE 5.46 The average memory access penalty in clock cycles per data memory
reference coming from L1 and L2 is shown for the ARM processor when running Minnespec.
Although the miss rates for L1 are signifi cantly higher, the L2 miss penalty, which is more than fi ve times
higher, means that the L2 misses can contribute signifi cantly.

0.5

1.5

2.5

3.5

4.5

gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2

L1 data average memory penalty

L2 data average memory penalty

M
is

s
p

e
n

a
lty

p
e

r
d

a
ta

r
e

fe
re

n
ce

5.14 Going Faster: Cache Blocking and Matrix Multiply 475

25%

20%

15%

10%

lib
qu

an
tu

h2
64

re
f

hu
m

m
er

pe
rlb

en
ch

bz
ip
2

xa
la
nc

bm
k

sje
ng

gp
bm

l
as

ta
r

gc
c

om
ne

tp
p

m
cf

L1 Data Miss Rate

L2 Data Miss Rate

L3 Data Miss Rate

FIGURE 5.47 The L1, L2, and L3 data cache miss rates for the Intel Core i7 920 running
the full integer SPECCPU2006 benchmarks.

Elaboration: Because speculation may sometimes be wrong (see Chapter 4), there
are references to the L1 data cache that do not correspond to loads or stores that
eventually complete execution. The data in Figure 5.45 is measured against all data
requests including some that are cancelled. The miss rate when measured against only
completed data accesses is 1.6 times higher (an average of 9.5% versus 5.9% for L1
Dcache misses)

5.14 Going Faster: Cache Blocking and Matrix
Multiply

Our next step in the continuing saga of improving performance of DGEMM by
tailoring it to the underlying hardware is to add cache blocking to the subword
parallelism and instruction level parallelism optimizations of Chapters 3 and 4.
Figure 5.48 shows the blocked version of DGEMM from Figure 4.80. Th e changes
are the same as was made earlier in going from unoptimized DGEMM in Figure
3.21 to blocked DGEMM in Figure 5.21 above. Th is time we taking the unrolled
version of DGEMM from Chapter 4 and invoke it many times on the submatrices

476 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

#include
#define UNROLL (4)
#define BLOCKSIZE 32
void do_block (int n, int si, int sj, int sk,
double *A, double *B, double *C)
{
for ( int i = si; i < si+BLOCKSIZE; i+=UNROLL*4 ) for ( int j = sj; j < sj+BLOCKSIZE; j++ ) { __m256d c[4]; for ( int x = 0; x < UNROLL; x++ ) c[x] = _mm256_load_pd(C+i+x*4+j*n); /* c[x] = C[i][j] */ for( int k = sk; k < sk+BLOCKSIZE; k++ ) { __m256d b = _mm256_broadcast_sd(B+k+j*n); /* b = B[k][j] */ for (int x = 0; x < UNROLL; x++) c[x] = _mm256_add_pd(c[x], /* c[x]+=A[i][k]*b */ _mm256_mul_pd(_mm256_load_pd(A+n*k+x*4+i), b)); } for ( int x = 0; x < UNROLL; x++ ) _mm256_store_pd(C+i+x*4+j*n, c[x]); /* C[i][j] = c[x] */ } } void dgemm (int n, double* A, double* B, double* C) { for ( int sj = 0; sj < n; sj += BLOCKSIZE ) for ( int si = 0; si < n; si += BLOCKSIZE ) for ( int sk = 0; sk < n; sk += BLOCKSIZE ) do_block(n, si, sj, sk, A, B, C); } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 FIGURE 5.48 Optimized C version of DGEMM from Figure 4.80 using cache blocking. Th ese changes are the same ones found in Figure 5.21. Th e assembly language produced by the compiler for the do_block function is nearly identical to Figure 4.81. Once again, there is no overhead to call the do_block because the compiler inlines the function call. of A, B, and C. Indeed, lines 28 – 34 and lines 7 – 8 in Figure 5.48 are identical to lines 14 – 20 and lines 5 – 6 in Figure 5.21, with the exception of incrementing the for loop in line 7 by the amount unrolled. Unlike the earlier chapters, we do not show the resulting x86 code because the inner loop code is nearly identical to Figure 4.81, as the blocking does not aff ect the computation, just the order that it accesses data in memory. What does change is the bookkeeping integer instructions to implement the for loops. It expands from 14 instructions before the inner loop and 8 aft er the loop for Figure 4.80 to 40 and 28 instructions respectively for the bookkeeping code generated for Figure 5.48. Nevertheless, the extra instructions executed pale in comparison to the performance improvement of reducing cache misses. Figure 5.49 compares unoptimzed to optimizations for subword parallelism, instruction level parallelism, and caches. Blocking improves performance over unrolled AVX code by factors of 2 to 2.5 for the larger matrices. When we compare unoptimized code to the code with all three optimizations, the performance improvement is factors of 8 to 15, with the largest increase for the largest matrix. 32x32 160x160 480x480 960x960 16.0 12.0 8.0 4.0 Unoptimized AVX AVX + unroll AVX + unroll + blocked – 1.7 1.5 1.3 0.8 6.4 3.5 2.3 2.5 14.6 6.6 4.7 5.1 13.6 12.7 11.7 12.0 G F L O P S FIGURE 5.49 Performance of four versions of DGEMM from matrix dimensions 32x32 to 960x960. Th e fully optimized code for largest matrix is almost 15 times as fast the unoptimized version in Figure 3.21 in Chapter 3. Elaboration: As mentioned in the Elaboration in Section 3.8, these results are with Turbo mode turned off. As in Chapters 3 and 4, when we turn it on we improve all the results by the temporary increase in the clock rate of 3.3/2.6 � 1.27. Turbo mode works particularly well in this case because it is using only a single core of an eight- core chip. However, if we want to run fast we should use all cores, which we’ll see in Chapter 6. 5.14 Going Faster: Cache Blocking and Matrix Multiply 477 478 Chapter 5 Large and Fast: Exploiting Memory Hierarchy 5.15 Fallacies and Pitfalls As one of the most naturally quantitative aspects of computer architecture, the memory hierarchy would seem to be less vulnerable to fallacies and pitfalls. Not only have there been many fallacies propagated and pitfalls encountered, but some have led to major negative outcomes. We start with a pitfall that oft en traps students in exercises and exams. Pitfall: Ignoring memory system behavior when writing programs or when generating code in a compiler. Th is could easily be rewritten as a fallacy: “Programmers can ignore memory hierarchies in writing code.” Th e evaluation of sort in Figure 5.19 and of cache blocking in Section 5.14 demonstrate that programmers can easily double performance if they factor the behavior of the memory system into the design of their algorithms. Pitfall: Forgetting to account for byte addressing or the cache block size in simulating a cache. When simulating a cache (by hand or by computer), we need to make sure we account for the eff ect of byte addressing and multiword blocks in determining into which cache block a given address maps. For example, if we have a 32-byte direct- mapped cache with a block size of 4 bytes, the byte address 36 maps into block 1 of the cache, since byte address 36 is block address 9 and (9 modulo 8) = 1. On the other hand, if address 36 is a word address, then it maps into block (36 mod 8) = 4. Make sure the problem clearly states the base of the address. In like fashion, we must account for the block size. Suppose we have a cache with 256 bytes and a block size of 32 bytes. Into which block does the byte address 300 fall? If we break the address 300 into fi elds, we can see the answer: 31 30 29 . . . . . . . . . 11 10 9 8 7 6 5 4 3 2 1 0 0 0 0 . . . . . . . . . 0 0 0 1 0 0 1 0 1 1 0 0 Cache block number Block offset Block address Byte address 300 is block address 300 32 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ � 9 Th e number of blocks in the cache is 256 32 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ � 8 Block number 9 falls into cache block number (9 modulo 8) � 1. 5.15 Fallacies and Pitfalls 479 Th is mistake catches many people, including the authors (in earlier draft s) and instructors who forget whether they intended the addresses to be in words, bytes, or block numbers. Remember this pitfall when you tackle the exercises. Pitfall: Having less set associativity for a shared cache than the number of cores or threads sharing that cache. Without extra care, a parallel program running on 2n processors or threads can easily allocate data structures to addresses that would map to the same set of a shared L2 cache. If the cache is at least 2n-way associative, then these accidental confl icts are hidden by the hardware from the program. If not, programmers could face apparently mysterious performance bugs—actually due to L2 confl ict misses— when migrating from, say, a 16-core design to 32-core design if both use 16-way associative L2 caches. Pitfall: Using average memory access time to evaluate the memory hierarchy of an out-of-order processor. If a processor stalls during a cache miss, then you can separately calculate the memory-stall time and the processor execution time, and hence evaluate the memory hierarchy independently using average memory access time (see page 399). If the processor continues to execute instructions, and may even sustain more cache misses during a cache miss, then the only accurate assessment of the memory hierarchy is to simulate the out-of-order processor along with the memory hierarchy. Pitfall: Extending an address space by adding segments on top of an unsegmented address space. During the 1970s, many programs grew so large that not all the code and data could be addressed with just a 16-bit address. Computers were then revised to off er 32- bit addresses, either through an unsegmented 32-bit address space (also called a fl at address space) or by adding 16 bits of segment to the existing 16-bit address. From a marketing point of view, adding segments that were programmer-visible and that forced the programmer and compiler to decompose programs into segments could solve the addressing problem. Unfortunately, there is trouble any time a programming language wants an address that is larger than one segment, such as indices for large arrays, unrestricted pointers, or reference parameters. Moreover, adding segments can turn every address into two words—one for the segment number and one for the segment off set—causing problems in the use of addresses in registers. Fallacy: Disk failure rates in the fi eld match their specifi cations. Two recent studies evaluated large collections of disks to check the relationship between results in the fi eld compared to specifi cations. One study was of almost 100,000 disks that had quoted MTTF of 1,000,000 to 1,500,000 hours, or AFR of 0.6% to 0.8%. Th ey found AFRs of 2% to 4% to be common, oft en three to fi ve times higher than the specifi ed rates [Schroeder and Gibson, 2007]. A second study of more than 100,000 disks at Google, which had a quoted AFR of about 1.5%, saw failure rates of 1.7% for drives in their fi rst year rise to 8.6% for drives in their third year, or about fi ve to six times the specifi ed rate [Pinheiro, Weber, and Barroso, 2007]. 480 Chapter 5 Large and Fast: Exploiting Memory Hierarchy Fallacy: Operating systems are the best place to schedule disk accesses. As mentioned in Section 5.2, higher-level disk interfaces off er logical block addresses to the host operating system. Given this high-level abstraction, the best an OS can do to try to help performance is to sort the logical block addresses into increasing order. However, since the disk knows the actual mapping of the logical addresses onto the physical geometry of sectors, tracks, and surfaces, it can reduce the rotational and seek latencies by rescheduling. For example, suppose the workload is four reads [Anderson, 2003]: Operation Starting LBA Length Read 724 8 Read 100 16 Read 9987 1 Read 26 128 Th e host might reorder the four reads into logical block order: Operation Starting LBA Length Read 26 128 Read 100 16 Read 724 8 Read 9987 1 Depending on the relative location of the data on the disk, reordering could make it worse, as Figure 5.50 shows. Th e disk-scheduled reads complete in three- quarters of a disk revolution, but the OS-scheduled reads take three revolutions. Host-ordered queue Drive-ordered queue 724 100 26 9987 FIGURE 5.50 Example showing OS versus disk schedule accesses, labeled host-ordered versus drive-ordered. Th e former takes three revolutions to complete the four reads, while the latter completes them in just three-fourths of a revolution (from Anderson [2003]). FIGURE 5.51 Summary of 18 x86 instructions that cause problems for virtualization [Robin and Irvine, 2000]. Th e fi rst fi ve instructions in the top group allow a program in user mode to read a control register, such as descriptor table registers, without causing a trap. Th e pop fl ags instruction modifi es a control register with sensitive information but fails silently when in user mode. Th e protection checking of the segmented architecture of the x86 is the downfall of the bottom group, as each of these instructions checks the privilege level implicitly as part of instruction execution when reading a control register. Th e checking assumes that the OS must be at the highest privilege level, which is not the case for guest VMs. Only the Move to segment register tries to modify control state, and protection checking foils it as well. Problem category Problem x86 instructions Access sensitive registers without trapping when running in user mode Store global descriptor table register (SGDT) Store local descriptor table register (SLDT) Store interrupt descriptor table register (SIDT) Store machine status word (SMSW) Push flags (PUSHF, PUSHFD) Pop flags (POPF, POPFD) When accessing virtual memory mechanisms in user mode, instructions fail the x86 protection checks Load access rights from segment descriptor (LAR) Load segment limit from segment descriptor (LSL) Verify if segment descriptor is readable (VERR) Verify if segment descriptor is writable (VERW) Pop to segment register (POP CS, POP SS, . . .) Push segment register (PUSH CS, PUSH SS, . . .) Far call to different privilege level (CALL) Far return to different privilege level (RET) Far jump to different privilege level (JMP) Software interrupt (INT) Store segment selector register (STR) Move to/from segment registers (MOVE) Pitfall: Implementing a virtual machine monitor on an instruction set architecture that wasn’t designed to be virtualizable. Many architects in the 1970s and 1980s weren’t careful to make sure that all instructions reading or writing information related to hardware resource information were privileged. Th is laissez-faire attitude causes problems for VMMs for all of these architectures, including the x86, which we use here as an example. Figure 5.51 describes the 18 instructions that cause problems for virtualization [Robin and Irvine, 2000]. Th e two broad classes are instructions that ■ Read control registers in user mode that reveals that the guest operating system is running in a virtual machine (such as POPF, mentioned earlier) ■ Check protection as required by the segmented architecture but assume that the operating system is running at the highest privilege level To simplify implementations of VMMs on the x86, both AMD and Intel have proposed extensions to the architecture via a new mode. Intel’s VT-x provides a new execution mode for running VMs, an architected defi nition of the VM 5.15 Fallacies and Pitfalls 481 482 Chapter 5 Large and Fast: Exploiting Memory Hierarchy state, instructions to swap VMs rapidly, and a large set of parameters to select the circumstances where a VMM must be invoked. Altogether, VT-x adds 11 new instructions for the x86. AMD’s Pacifi ca makes similar proposals. An alternative to modifying the hardware is to make small modifi cations to the operating system to avoid using the troublesome pieces of the architecture. Th is technique is called paravirtualization, and the open source Xen VMM is a good example. Th e Xen VMM provides a guest OS with a virtual machine abstraction that uses only the easy-to-virtualize parts of the physical x86 hardware on which the VMM runs. 5.16 Concluding Remarks Th e diffi culty of building a memory system to keep pace with faster processors is underscored by the fact that the raw material for main memory, DRAMs, is essentially the same in the fastest computers as it is in the slowest and cheapest. It is the principle of locality that gives us a chance to overcome the long latency of memory access—and the soundness of this strategy is demonstrated at all levels of the memory hierarchy. Although these levels of the hierarchy look quite diff erent in quantitative terms, they follow similar strategies in their operation and exploit the same properties of locality. Multilevel caches make it possible to use more cache optimizations more easily for two reasons. First, the design parameters of a lower-level cache are diff erent from a fi rst-level cache. For example, because a lower-level cache will be much larger, it is possible to use larger block sizes. Second, a lower-level cache is not constantly being used by the processor, as a fi rst-level cache is. Th is allows us to consider having the lower-level cache do something when it is idle that may be useful in preventing future misses. Another trend is to seek soft ware help. Effi ciently managing the memory hierarchy using a variety of program transformations and hardware facilities is a major focus of compiler enhancements. Two diff erent ideas are being explored. One idea is to reorganize the program to enhance its spatial and temporal locality. Th is approach focuses on loop-oriented programs that use large arrays as the major data structure; large linear algebra problems are a typical example, such as DGEMM. By restructuring the loops that access the arrays, substantially improved locality—and, therefore, cache performance—can be obtained. Another approach is prefetching. In prefetching, a block of data is brought into the cache before it is actually referenced. Many microprocessors use hardware prefetching to try to predict accesses that may be diffi cult for soft ware to notice. A third approach is special cache-aware instructions that optimize memory transfer. For example, the microprocessors in Section 6.10 in Chapter 6 use an optimization that does not fetch the contents of a block from memory on a write miss because the program is going to write the full block. Th is optimization signifi cantly reduces memory traffi c for one kernel. prefetching A technique in which data blocks needed in the future are brought into the cache early by the use of special instructions that specify the address of the block. As we will see in Chapter 6, memory systems are a central design issue for parallel processors. Th e growing importance of the memory hierarchy in determining system performance means that this important area will continue to be a focus for both designers and researchers for some years to come. 5.17 Historical Perspective and Further Reading Th is section, which appears online, gives an overview of memory technologies, from mercury delay lines to DRAM, the invention of the memory hierarchy, protection mechanisms, and virtual machines, and concludes with a brief history of operating systems, including CTSS, MULTICS, UNIX, BSD UNIX, MS-DOS, Windows, and Linux. 5.18 Exercises 5.1 In this exercise we look at memory locality properties of matrix computation. Th e following code is written in C, where elements within the same row are stored contiguously. Assume each word is a 32-bit integer. for (I=0; I<8; I++) for (J=0; J<8000; J++) A[I][J]=B[I][0]+A[J][I]; 5.1.1 [5] <§5.1> How many 32-bit integers can be stored in a 16-byte cache block?

5.1.2 [5] <§5.1> References to which variables exhibit temporal locality?

5.1.3 [5] <§5.1> References to which variables exhibit spatial locality?

Locality is aff ected by both the reference order and data layout. Th e same computation
can also be written below in Matlab, which diff ers from C by storing matrix elements
within the same column contiguously in memory.

for I=1:8

for J=1:8000

A(I,J)=B(I,0)+A(J,I);

end

5.18 Exercises 483

484 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

5.1.4 [10] <§5.1> How many 16-byte cache blocks are needed to store all 32-bit
matrix elements being referenced?

5.1.5 [5] <§5.1> References to which variables exhibit temporal locality?

5.1.6 [5] <§5.1> References to which variables exhibit spatial locality?

5.2 Caches are important to providing a high-performance memory hierarchy
to processors. Below is a list of 32-bit memory address references, given as word
addresses.

3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253

5.2.1 [10] <§5.3> For each of these references, identify the binary address, the tag,
and the index given a direct-mapped cache with 16 one-word blocks. Also list if each
reference is a hit or a miss, assuming the cache is initially empty.

5.2.2 [10] <§5.3> For each of these references, identify the binary address, the tag,
and the index given a direct-mapped cache with two-word blocks and a total size of 8
blocks. Also list if each reference is a hit or a miss, assuming the cache is initially empty.

5.2.3 [20] <§§5.3, 5.4> You are asked to optimize a cache design for the given
references. Th ere are three direct-mapped cache designs possible, all with a total of 8
words of data: C1 has 1-word blocks, C2 has 2-word blocks, and C3 has 4-word blocks.
In terms of miss rate, which cache design is the best? If the miss stall time is 25 cycles,
and C1 has an access time of 2 cycles, C2 takes 3 cycles, and C3 takes 5 cycles, which is
the best cache design?

Th ere are many diff erent design parameters that are important to a cache’s overall
performance. Below are listed parameters for diff erent direct-mapped cache designs.

Cache Data Size: 32 KiB

Cache Block Size: 2 words

Cache Access Time: 1 cycle

5.2.4 [15] <§5.3> Calculate the total number of bits required for the cache listed
above, assuming a 32-bit address. Given that total size, fi nd the total size of the closest
direct-mapped cache with 16-word blocks of equal size or greater. Explain why the
second cache, despite its larger data size, might provide slower performance than the
fi rst cache.

5.2.5 [20] <§§5.3, 5.4> Generate a series of read requests that have a lower miss rate
on a 2 KiB 2-way set associative cache than the cache listed above. Identify one possible
solution that would make the cache listed have an equal or lower miss rate than the 2
KiB cache. Discuss the advantages and disadvantages of such a solution.

5.2.6 [15] <§5.3> Th e formula shown in Section 5.3 shows the typical method to
index a direct-mapped cache, specifi cally (Block address) modulo (Number of blocks in
the cache). Assuming a 32-bit address and 1024 blocks in the cache, consider a diff erent

indexing function, specifi cally (Block address[31:27] XOR Block address[26:22]). Is it
possible to use this to index a direct-mapped cache? If so, explain why and discuss any
changes that might need to be made to the cache. If it is not possible, explain why.

5.3 For a direct-mapped cache design with a 32-bit address, the following bits of the
address are used to access the cache.

Tag Index Offset

31–10 9–5 4–0

5.3.1 [5] <§5.3> What is the cache block size (in words)?

5.3.2 [5] <§5.3> How many entries does the cache have?

5.3.3 [5] <§5.3> What is the ratio between total bits required for such a cache
implementation over the data storage bits?

Starting from power on, the following byte-addressed cache references are recorded.

Address

0 4 16 132 232 160 1024 30 140 3100 180 2180

5.3.4 [10] <§5.3> How many blocks are replaced?

5.3.5 [10] <§5.3> What is the hit ratio?

5.3.6 [20] <§5.3> List the fi nal state of the cache, with each valid entry represented as
a record of .

5.4 Recall that we have two write policies and write allocation policies, and their
combinations can be implemented either in L1 or L2 cache. Assume the following
choices for L1 and L2 caches:

L1 L2

Write through, non-write allocate Write back, write allocate

5.4.1 [5] <§§5.3, 5.8> Buff ers are employed between diff erent levels of memory
hierarchy to reduce access latency. For this given confi guration, list the possible buff ers
needed between L1 and L2 caches, as well as L2 cache and memory.

5.4.2 [20] <§§5.3, 5.8> Describe the procedure of handling an L1 write-miss,
considering the component involved and the possibility of replacing a dirty block.

5.4.3 [20] <§§5.3, 5.8> For a multilevel exclusive cache (a block can only reside in
one of the L1 and L2 caches), confi guration, describe the procedure of handling an L1
write-miss, considering the component involved and the possibility of replacing a dirty
block.

5.18 Exercises 485

486 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Consider the following program and cache behaviors.

Data Reads per
1000 Instructions

Data Writes per
1000 Instructions

Instruction Cache
Miss Rate

Data Cache
Miss Rate

Block Size
(byte)

250 100 0.30% 2% 64

5.4.4 [5] <§§5.3, 5.8> For a write-through, write-allocate cache, what are the minimum
read and write bandwidths (measured by byte per cycle) needed to achieve a CPI of 2?

5.4.5 [5] <§§5.3, 5.8> For a write-back, write-allocate cache, assuming 30% of
replaced data cache blocks are dirty, what are the minimal read and write bandwidths
needed for a CPI of 2?

5.4.6 [5] <§§5.3, 5.8> What are the minimal bandwidths needed to achieve the
performance of CPI=1.5?

5.5 Media applications that play audio or video fi les are part of a class of workloads
called “streaming” workloads; i.e., they bring in large amounts of data but do not reuse
much of it. Consider a video streaming workload that accesses a 512 KiB working set
sequentially with the following address stream:

0, 2, 4, 6, 8, 10, 12, 14, 16, …

5.5.1 [5] <§§5.4, 5.8> Assume a 64 KiB direct-mapped cache with a 32-byte block.
What is the miss rate for the address stream above? How is this miss rate sensitive to
the size of the cache or the working set? How would you categorize the misses this
workload is experiencing, based on the 3C model?

5.5.2 [5] <§§5.1, 5.8> Re-compute the miss rate when the cache block size is 16 bytes,
64 bytes, and 128 bytes. What kind of locality is this workload exploiting?

5.5.3 [10] <§5.13>“Prefetching” is a technique that leverages predictable address
patterns to speculatively bring in additional cache blocks when a particular cache block
is accessed. One example of prefetching is a stream buff er that prefetches sequentially
adjacent cache blocks into a separate buff er when a particular cache block is brought
in. If the data is found in the prefetch buff er, it is considered as a hit and moved into
the cache and the next cache block is prefetched. Assume a two-entry stream buff er
and assume that the cache latency is such that a cache block can be loaded before the
computation on the previous cache block is completed. What is the miss rate for the
address stream above?

Cache block size (B) can aff ect both miss rate and miss latency. Assuming a 1-CPI
machine with an average of 1.35 references (both instruction and data) per instruction,
help fi nd the optimal block size given the following miss rates for various block sizes.

8: 4% 16: 3% 32: 2% 64: 1.5% 128: 1%

5.5.4 [10] <§5.3> What is the optimal block size for a miss latency of 20×B cycles?

5.5.5 [10] <§5.3> What is the optimal block size for a miss latency of 24+B cycles?

5.5.6 [10] <§5.3> For constant miss latency, what is the optimal block size?

5.6 In this exercise, we will look at the diff erent ways capacity aff ects overall
performance. In general, cache access time is proportional to capacity. Assume that
main memory accesses take 70 ns and that memory accesses are 36% of all instructions.
Th e following table shows data for L1 caches attached to each of two processors, P1 and
P2.

L1 Size L1 Miss Rate L1 Hit Time

P1 2 KiB 8.0% 0.66 ns

P2 4 KiB 6.0% 0.90 ns

5.6.1 [5] <§5.4> Assuming that the L1 hit time determines the cycle times for P1 and
P2, what are their respective clock rates?

5.6.2 [5] <§5.4> What is the Average Memory Access Time for P1 and P2?

5.6.3 [5] <§5.4> Assuming a base CPI of 1.0 without any memory stalls, what is the
total CPI for P1 and P2? Which processor is faster?

For the next three problems, we will consider the addition of an L2 cache to P1 to
presumably make up for its limited L1 cache capacity. Use the L1 cache capacities
and hit times from the previous table when solving these problems. Th e L2 miss rate
indicated is its local miss rate.

L2 Size L2 Miss Rate L2 Hit Time

1 MiB 95% 5.62 ns

5.6.4 [10] <§5.4> What is the AMAT for P1 with the addition of an L2 cache? Is the
AMAT better or worse with the L2 cache?

5.6.5 [5] <§5.4> Assuming a base CPI of 1.0 without any memory stalls, what is the
total CPI for P1 with the addition of an L2 cache?

5.6.6 [10] <§5.4> Which processor is faster, now that P1 has an L2 cache? If P1 is
faster, what miss rate would P2 need in its L1 cache to match P1’s performance? If P2 is
faster, what miss rate would P1 need in its L1 cache to match P2’s performance?

5.7 Th is exercise examines the impact of diff erent cache designs, specifi cally
comparing associative caches to the direct-mapped caches from Section 5.4. For these
exercises, refer to the address stream shown in Exercise 5.2.

5.7.1 [10] <§5.4> Using the sequence of references from Exercise 5.2, show the fi nal
cache contents for a three-way set associative cache with two-word blocks and a total
size of 24 words. Use LRU replacement. For each reference identify the index bits, the
tag bits, the block off set bits, and if it is a hit or a miss.

5.7.2 [10] <§5.4> Using the references from Exercise 5.2, show the fi nal cache
contents for a fully associative cache with one-word blocks and a total size of 8 words.
Use LRU replacement. For each reference identify the index bits, the tag bits, and if it
is a hit or a miss.

5.18 Exercises 487

488 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

5.7.3 [15] <§5.4> Using the references from Exercise 5.2, what is the miss rate for
a fully associative cache with two-word blocks and a total size of 8 words, using LRU
replacement? What is the miss rate using MRU (most recently used) replacement?
Finally what is the best possible miss rate for this cache, given any replacement policy?

Multilevel caching is an important technique to overcome the limited amount of
space that a fi rst level cache can provide while still maintaining its speed. Consider a
processor with the following parameters:

B
a
s
e
C

P
I,
N

o
M

e
m

o
ry

ta
ll
s

P
ro

c
e
s
s
o
r

S
p
e
e
d

M
a
in

M
e
m

o
ry

A
c
c
e
s
s

Ti
m

F
ir

s
t

L
e
v
e
l
C

a
c
h
e

M
is

s
R

a
te

p
e
r

In
s
tr

u
c
ti

o
n

S
e
c
o
n
d
L

e
v
e
l
C

a
c
h
e
,

D
ir

e
c
t-

M
a
p
p
e
d
S

p
e
e
d

G
lo

b
a
l
M

is
s
R

a
te

w
it

S
e
c
o
n
d
L

e
v
e
l
C

a
c
h
e
,

D
ir

e
c
t-

M
a
p
p
e
d

S
e
c
o
n
d
L

e
v
e
l
C

a
c
h
e
,

E
ig

h
t-

W
a
y
S

e
t

A
s
s
o
c
ia

ti
v
e
S

p
e
e
d

G
lo

b
a
l
M

is
s
R

a
te

it
h
S

e
c
o
n
d
L

e
v
e
l

C
a
c
h
e
,
E
ig

h
t-

W
a
y
S

e
t

A
s
s
o
c
ia

ti
v
e

1.5 2 GHz 100 ns 7% 12 cycles 3.5% 28 cycles 1.5%

5.7.4 [10] <§5.4> Calculate the CPI for the processor in the table using: 1) only a
fi rst level cache, 2) a second level direct-mapped cache, and 3) a second level eight-way
set associative cache. How do these numbers change if main memory access time is
doubled? If it is cut in half?

5.7.5 [10] <§5.4> It is possible to have an even greater cache hierarchy than two
levels. Given the processor above with a second level, direct-mapped cache, a designer
wants to add a third level cache that takes 50 cycles to access and will reduce the global
miss rate to 1.3%. Would this provide better performance? In general, what are the
advantages and disadvantages of adding a third level cache?

5.7.6 [20] <§5.4> In older processors such as the Intel Pentium or Alpha 21264, the
second level of cache was external (located on a diff erent chip) from the main processor
and the fi rst level cache. While this allowed for large second level caches, the latency to
access the cache was much higher, and the bandwidth was typically lower because the
second level cache ran at a lower frequency. Assume a 512 KiB off -chip second level
cache has a global miss rate of 4%. If each additional 512 KiB of cache lowered global
miss rates by 0.7%, and the cache had a total access time of 50 cycles, how big would
the cache have to be to match the performance of the second level direct-mapped cache
listed above? Of the eight-way set associative cache?

5.8 Mean Time Between Failures (MTBF), Mean Time To Replacement (MTTR), and
Mean Time To Failure (MTTF) are useful metrics for evaluating the reliability and
availability of a storage resource. Explore these concepts by answering the questions
about devices with the following metrics.

MTTF MTTR

3 Years 1 Day

5.8.1 [5] <§5.5> Calculate the MTBF for each of the devices in the table.

5.8.2 [5] <§5.5> Calculate the availability for each of the devices in the table.

5.8.3 [5] <§5.5> What happens to availability as the MTTR approaches 0? Is this a
realistic situation?

5.8.4 [5] <§5.5> What happens to availability as the MTTR gets very high, i.e., a
device is diffi cult to repair? Does this imply the device has low availability?

5.9 Th is Exercise examines the single error correcting, double error detecting (SEC/
DED) Hamming code.

5.9.1 [5] <§5.5> What is the minimum number of parity bits required to protect a
128-bit word using the SEC/DED code?

5.9.2 [5] <§5.5> Section 5.5 states that modern server memory modules (DIMMs)
employ SEC/DED ECC to protect each 64 bits with 8 parity bits. Compute the cost/
performance ratio of this code to the code from 5.9.1. In this case, cost is the relative
number of parity bits needed while performance is the relative number of errors that
can be corrected. Which is better?

5.9.3 Consider a SEC code that protects 8 bit words with 4 parity bits. If we read the
value 0x375, is there an error? If so, correct the error.

5.10 For a high-performance system such as a B-tree index for a database, the page
size is determined mainly by the data size and disk performance. Assume that on
average a B-tree index page is 70% full with fi x-sized entries. Th e utility of a page is
its B-tree depth, calculated as log2(entries). Th e following table shows that for 16-byte
entries, and a 10-year-old disk with a 10 ms latency and 10 MB/s transfer rate, the
optimal page size is 16K.

Page Size (KiB)

Page Utility or B-Tree
Depth (Number of Disk

Accesses Saved)

Index Page
Access

Cost (ms) Utility/Cost

2 6.49 (or log
2
(2048/16×0.7)) 10.2 0.64

4 7.49 10.4 0.72

8 8.49 10.8 0.79

16 9.49 11.6 0.82

32 10.49 13.2 0.79

64 11.49 16.4 0.70

128 12.49 22.8 0.55

256 13.49 35.6 0.38

5.10.1 [10] <§5.7> What is the best page size if entries now become 128 bytes?

5.10.2 [10] <§5.7> Based on 5.10.1, what is the best page size if pages are half full?

5.10.3 [20] <§5.7> Based on 5.10.2, what is the best page size if using a modern disk
with a 3 ms latency and 100 MB/s transfer rate? Explain why future servers are likely
to have larger pages.

5.18 Exercises 489

490 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Keeping “frequently used” (or “hot”) pages in DRAM can save disk accesses, but how
do we determine the exact meaning of “frequently used” for a given system? Data
engineers use the cost ratio between DRAM and disk access to quantify the reuse time
threshold for hot pages. Th e cost of a disk access is $Disk/accesses_per_sec, while the
cost to keep a page in DRAM is $DRAM_MiB/page_size. Th e typical DRAM and disk
costs and typical database page sizes at several time points are listed below:

Year
DRAM Cost

($/MiB)
Page Size

(KiB)
Disk Cost
($/disk)

Disk Access Rate
(access/sec)

1987 5000 1 15,000 15

1997 15 8 2000 64

2007 0.05 64 80 83

5.10.4 [10] <§§5.1, 5.7> What are the reuse time thresholds for these three
technology generations?

5.10.5 [10] <§5.7> What are the reuse time thresholds if we keep using the same 4K
page size? What’s the trend here?

5.10.6 [20] <§5.7> What other factors can be changed to keep using the same page
size (thus avoiding soft ware rewrite)? Discuss their likeliness with current technology
and cost trends.

5.11 As described in Section 5.7, virtual memory uses a page table to track the
mapping of virtual addresses to physical addresses. Th is exercise shows how this table
must be updated as addresses are accessed. Th e following data constitutes a stream of
virtual addresses as seen on a system. Assume 4 KiB pages, a 4-entry fully associative
TLB, and true LRU replacement. If pages must be brought in from disk, increment the
next largest page number.

4669, 2227, 13916, 34587, 48870, 12608, 49225

TLB

Valid Tag
Physical Page

Number

1 11 12

1 7 4

1 3 6

0 4 9

Page table

Valid Physical Page or in Disk

1 5

0 Disk

1 6

1 9

1 11

0 Disk

1 4

0 Disk

1 3

1 12

5.11.1 [10] <§5.7> Given the address stream shown, and the initial TLB and page
table states provided above, show the fi nal state of the system. Also list for each reference
if it is a hit in the TLB, a hit in the page table, or a page fault.

5.11.2 [15] <§5.7> Repeat 5.11.1, but this time use 16 KiB pages instead of 4 KiB
pages. What would be some of the advantages of having a larger page size? What are
some of the disadvantages?

5.11.3 [15] <§§5.4, 5.7> Show the fi nal contents of the TLB if it is 2-way set
associative. Also show the contents of the TLB if it is direct mapped. Discuss the
importance of having a TLB to high performance. How would virtual memory
accesses be handled if there were no TLB?

Th ere are several parameters that impact the overall size of the page table. Listed below
are key page table parameters.

Virtual Address Size Page Size Page Table Entry Size

32 bits 8 KiB 4 bytes

5.11.4 [5] <§5.7> Given the parameters shown above, calculate the total page table
size for a system running 5 applications that utilize half of the memory available.

5.11.5 [10] <§5.7> Given the parameters shown above, calculate the total page table
size for a system running 5 applications that utilize half of the memory available, given
a two level page table approach with 256 entries. Assume each entry of the main page
table is 6 bytes. Calculate the minimum and maximum amount of memory required.

5.11.6 [10] <§5.7> A cache designer wants to increase the size of a 4 KiB virtually
indexed, physically tagged cache. Given the page size shown above, is it possible to
make a 16 KiB direct-mapped cache, assuming 2 words per block? How would the
designer increase the data size of the cache?

5.18 Exercises 491

492 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

5.12 In this exercise, we will examine space/time optimizations for page tables. Th e
following list provides parameters of a virtual memory system.

Virtual Address (bits)
Physical DRAM

Installed Page Size PTE Size (byte)

43 16 GiB 4 KiB 4

5.12.1 [10] <§5.7> For a single-level page table, how many page table entries (PTEs)
are needed? How much physical memory is needed for storing the page table?

5.12.2 [10] <§5.7> Using a multilevel page table can reduce the physical memory
consumption of page tables, by only keeping active PTEs in physical memory. How
many levels of page tables will be needed in this case? And how many memory
references are needed for address translation if missing in TLB?

5.12.3 [15] <§5.7> An inverted page table can be used to further optimize space
and time. How many PTEs are needed to store the page table? Assuming a hash table
implementation, what are the common case and worst case numbers of memory
references needed for servicing a TLB miss?

Th e following table shows the contents of a 4-entry TLB.

Entry-ID Valid VA Page Modifi ed Protection PA Page

1 1 140 1 RW 30

2 0 40 0 RX 34

3 1 200 1 RO 32

4 1 280 0 RW 31

5.12.4 [5] <§5.7> Under what scenarios would entry 2’s valid bit be set to zero?

5.12.5 [5] <§5.7> What happens when an instruction writes to VA page 30? When
would a soft ware managed TLB be faster than a hardware managed TLB?

5.12.6 [5] <§5.7> What happens when an instruction writes to VA page 200?

5.13 In this exercise, we will examine how replacement policies impact miss rate.
Assume a 2-way set associative cache with 4 blocks. To solve the problems in this
exercise, you may fi nd it helpful to draw a table like the one below, as demonstrated for
the address sequence “0, 1, 2, 3, 4.”

Address of
Memory

Block Accessed Hit or Miss
Evicted
Block

Contents of Cache Blocks After Reference

Set 0 Set 0 Set 1 Set 1

0 Miss Mem[0]

1 Miss Mem[0] Mem[1]

2 Miss Mem[0] Mem[2] Mem[1]

3 Miss Mem[0] Mem[2] Mem[1] Mem[3]

4 Miss 0 Mem[4] Mem[2] Mem[1] Mem[3]

…

Consider the following address sequence: 0, 2, 4, 8, 10, 12, 14, 16, 0

5.13.1 [5] <§§5.4, 5.8> Assuming an LRU replacement policy, how many hits does
this address sequence exhibit?

5.13.2 [5] <§§5.4, 5.8> Assuming an MRU (most recently used) replacement policy,
how many hits does this address sequence exhibit?

5.13.3 [5] <§§5.4, 5.8> Simulate a random replacement policy by fl ipping a coin. For
example, “heads” means to evict the fi rst block in a set and “tails” means to evict the
second block in a set. How many hits does this address sequence exhibit?

5.13.4 [10] <§§5.4, 5.8> Which address should be evicted at each replacement to
maximize the number of hits? How many hits does this address sequence exhibit if you
follow this “optimal” policy?

5.13.5 [10] <§§5.4, 5.8> Describe why it is diffi cult to implement a cache replacement
policy that is optimal for all address sequences.

5.13.6 [10] <§§5.4, 5.8> Assume you could make a decision upon each memory
reference whether or not you want the requested address to be cached. What impact
could this have on miss rate?

5.14 To support multiple virtual machines, two levels of memory virtualization are
needed. Each virtual machine still controls the mapping of virtual address (VA) to
physical address (PA), while the hypervisor maps the physical address (PA) of each
virtual machine to the actual machine address (MA). To accelerate such mappings,
a soft ware approach called “shadow paging” duplicates each virtual machine’s page
tables in the hypervisor, and intercepts VA to PA mapping changes to keep both copies
consistent. To remove the complexity of shadow page tables, a hardware approach
called nested page table (NPT) explicitly supports two classes of page tables (VA ⇒ PA
and PA ⇒ MA) and can walk such tables purely in hardware.
Consider the following sequence of operations: (1) Create process; (2) TLB miss;
(3) page fault; (4) context switch;

5.14.1 [10] <§§5.6, 5.7> What would happen for the given operation sequence for
shadow page table and nested page table, respectively?

5.14.2 [10] <§§5.6, 5.7> Assuming an x86-based 4-level page table in both guest and
nested page table, how many memory references are needed to service a TLB miss for
native vs. nested page table?

5.14.3 [15] <§§5.6, 5.7> Among TLB miss rate, TLB miss latency, page fault rate, and
page fault handler latency, which metrics are more important for shadow page table?
Which are important for nested page table?

5.18 Exercises 493

494 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Assume the following parameters for a shadow paging system.

TLB Misses per
1000 Instructions

NPT TLB Miss
Latency

Page Faults per
1000 Instructions

Shadowing Page
Fault Overhead

0.2 200 cycles 0.001 30,000 cycles

5.14.4 [10] <§5.6> For a benchmark with native execution CPI of 1, what are the CPI
numbers if using shadow page tables vs. NPT (assuming only page table virtualization
overhead)?

5.14.5 [10] <§5.6> What techniques can be used to reduce page table shadowing
induced overhead?

5.14.6 [10] <§5.6> What techniques can be used to reduce NPT induced overhead?

5.15 One of the biggest impediments to widespread use of virtual machines is the
performance overhead incurred by running a virtual machine. Listed below are various
performance parameters and application behavior.

Base CPI

Priviliged
O/S

Accesses
per 10,000

Instructions

Performance
Impact to
Trap to the
Guest O/S

Performance
Impact to Trap

to VMM

I/O Access
per 10,000

Instructions

I/O Access Time
(Includes Time

to Trap to Guest
O/S)

1.5 120 15 cycles 175 cycles 30 1100 cycles

5.15.1 [10] <§5.6> Calculate the CPI for the system listed above assuming that there
are no accesses to I/O. What is the CPI if the VMM performance impact doubles? If it is
cut in half? If a virtual machine soft ware company wishes to obtain a 10% performance
degradation, what is the longest possible penalty to trap to the VMM?

5.15.2 [10] <§5.6> I/O accesses oft en have a large impact on overall system
performance. Calculate the CPI of a machine using the performance characteristics
above, assuming a non-virtualized system. Calculate the CPI again, this time using a
virtualized system. How do these CPIs change if the system has half the I/O accesses?
Explain why I/O bound applications have a smaller impact from virtualization.

5.15.3 [30] <§§5.6, 5.7> Compare and contrast the ideas of virtual memory and
virtual machines. How do the goals of each compare? What are the pros and cons of
each? List a few cases where virtual memory is desired, and a few cases where virtual
machines are desired.

5.15.4 [20] <§5.6> Section 5.6 discusses virtualization under the assumption that
the virtualized system is running the same ISA as the underlying hardware. However,
one possible use of virtualization is to emulate non-native ISAs. An example of this is
QEMU, which emulates a variety of ISAs such as MIPS, SPARC, and PowerPC. What
are some of the diffi culties involved in this kind of virtualization? Is it possible for an
emulated system to run faster than on its native ISA?

5.16 In this exercise, we will explore the control unit for a cache controller for a
processor with a write buff er. Use the fi nite state machine found in Figure 5.40 as a
starting point for designing your own fi nite state machines. Assume that the cache
controller is for the simple direct-mapped cache described on page 465 (Figure 5.40 in
Section 5.9), but you will add a write buff er with a capacity of one block.

Recall that the purpose of a write buff er is to serve as temporary storage so that the
processor doesn’t have to wait for two memory accesses on a dirty miss. Rather than
writing back the dirty block before reading the new block, it buff ers the dirty block and
immediately begins reading the new block. Th e dirty block can then be written to main
memory while the processor is working.

5.16.1 [10] <§§5.8, 5.9> What should happen if the processor issues a request that
hits in the cache while a block is being written back to main memory from the write
buff er?

5.16.2 [10] <§§5.8, 5.9> What should happen if the processor issues a request that
misses in the cache while a block is being written back to main memory from the write
buff er?

5.16.3 [30] <§§5.8, 5.9> Design a fi nite state machine to enable the use of a write
buff er.

5.17 Cache coherence concerns the views of multiple processors on a given cache
block. Th e following data shows two processors and their read/write operations on two
diff erent words of a cache block X (initially X[0] = X[1] = 0). Assume the size of integers is
32 bits.

P1 P2

X[0] ++; X[1] = 3; X[0] = 5; X[1] +=2;

5.17.1 [15] <§5.10> List the possible values of the given cache block for a correct
cache coherence protocol implementation. List at least one more possible value of the
block if the protocol doesn’t ensure cache coherency.

5.17.2 [15] <§5.10> For a snooping protocol, list a valid operation sequence on each
processor/cache to fi nish the above read/write operations.

5.17.3 [10] <§5.10> What are the best-case and worst-case numbers of cache misses
needed to execute the listed read/write instructions?

Memory consistency concerns the views of multiple data items. Th e following data
shows two processors and their read/write operations on diff erent cache blocks (A and
B initially 0).

P1 P2

A = 1; B = 2; A+=2; B++; C = B; D = A;

5.18 Exercises 495

496 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

5.17.4 [15] <§5.10> List the possible values of C and D for an implementation that
ensures both consistency assumptions on page 470.

5.17.5 [15] <§5.10> List at least one more possible pair of values for C and D if such
assumptions are not maintained.

5.17.6 [15] <§§5.3, 5.10> For various combinations of write policies and write
allocation policies, which combinations make the protocol implementation simpler?

5.18 Chip multiprocessors (CMPs) have multiple cores and their caches on a single
chip. CMP on-chip L2 cache design has interesting trade-off s. Th e following table
shows the miss rates and hit latencies for two benchmarks with private vs. shared L2
cache designs. Assume L1 cache misses once every 32 instructions.

Private Shared

Benchmark A misses-per-instruction 0.30% 0.12%

Benchmark B misses-per-instruction 0.06% 0.03%

Assume the following hit latencies:

Private Cache Shared Cache Memory

5 20 180

5.18.1 [15] <§5.13> Which cache design is better for each of these benchmarks? Use
data to support your conclusion.

5.18.2 [15] <§5.13> Shared cache latency increases with the CMP size. Choose
the best design if the shared cache latency doubles. Off -chip bandwidth becomes the
bottleneck as the number of CMP cores increases. Choose the best design if off -chip
memory latency doubles.

5.18.3 [10] <§5.13> Discuss the pros and cons of shared vs. private L2 caches for both
single-threaded, multi-threaded, and multiprogrammed workloads, and reconsider
them if having on-chip L3 caches.

5.18.4 [15] <§5.13> Assume both benchmarks have a base CPI of 1 (ideal L2 cache).
If having non-blocking cache improves the average number of concurrent L2 misses
from 1 to 2, how much performance improvement does this provide over a shared L2
cache? How much improvement can be achieved over private L2?

5.18.5 [10] <§5.13> Assume new generations of processors double the number of
cores every 18 months. To maintain the same level of per-core performance, how much
more off -chip memory bandwidth is needed for a processor released in three years?

5.18.6 [15] <§5.13> Consider the entire memory hierarchy. What kinds of
optimizations can improve the number of concurrent misses?

5.19 In this exercise we show the defi nition of a web server log and examine code
optimizations to improve log processing speed. Th e data structure for the log is defi ned
as follows:

struct entry {
int srcIP; // remote IP address
char URL[128]; // request URL (e.g., “GET index.html”)
long long refTime; // reference time
int status; // connection status
char browser[64]; // client browser name

} log [NUM_ENTRIES];

Assume the following processing function for the log:

topK_sourceIP (int hour);

5.19.1 [5] <§5.15> Which fi elds in a log entry will be accessed for the given log
processing function? Assuming 64-byte cache blocks and no prefetching, how many
cache misses per entry does the given function incur on average?

5.19.2 [10] <§5.15> How can you reorganize the data structure to improve cache
utilization and access locality? Show your structure defi nition code.

5.19.3 [10] <§5.15> Give an example of another log processing function that would
prefer a diff erent data structure layout. If both functions are important, how would you
rewrite the program to improve the overall performance? Supplement the discussion
with code snippet and data.

For the problems below, use data from “Cache Performance for SPEC CPU2000
Benchmarks” (http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/) for the
pairs of benchmarks shown in the following table.

a. Mesa / gcc

b. mcf / swim

5.19.4 [10] <§5.15> For 64 KiB data caches with varying set associativities, what are
the miss rates broken down by miss types (cold, capacity, and confl ict misses) for each
benchmark?

5.19.5 [10] <§5.15> Select the set associativity to be used by a 64 KiB L1 data cache
shared by both benchmarks. If the L1 cache has to be directly mapped, select the set
associativity for the 1 MiB L2 cache.

5.19.6 [20] <§5.15> Give an example in the miss rate table where higher set
associativity actually increases miss rate. Construct a cache confi guration and reference
stream to demonstrate this.

5.18 Exercises 497

http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/

498 Chapter 5 Large and Fast: Exploiting Memory Hierarchy

§5.1, page 377: 1 and 4. (3 is false because the cost of the memory hierarchy varies
per computer, but in 2013 the highest cost is usually the DRAM.)
§5.3, page 398: 1 and 4: A lower miss penalty can enable smaller blocks, since you
don’t have that much latency to amortize, yet higher memory bandwidth usually
leads to larger blocks, since the miss penalty is only slightly larger.
§5.4, page 417: 1.
§5.7, page 454: 1-a, 2-c, 3-b, 4-d.
§5.8, page 461: 2. (Both large block sizes and prefetching may reduce compulsory
misses, so 1 is false.)

Answers to
Check Yourself

This page intentionally left blank

6
“I swing big, with
everything I’ve got.
I hit big or I miss big.
I like to live as big as
I can.”
Babe Ruth
American baseball player

Parallel Processors
from Client to Cloud
6.1 Introduction 502
6.2 The Diffi culty of Creating Parallel Processing

Programs 504
6.3 SISD, MIMD, SIMD, SPMD, and Vector 509
6.4 Hardware Multithreading 516
6.5 Multicore and Other Shared Memory

Multiprocessors 519
6.6 Introduction to Graphics Processing

Units 524

http://dx.doi.org/10.1016/B978-0-12-407726-3.00001-1
2013

6.7 Clusters, Warehouse Scale Computers, and Other Message-

Passing Multiprocessors 531
6.8 Introduction to Multiprocessor Network Topologies 536
6.9 Communicating to the Outside World: Cluster Networking 539
6.10 Multiprocessor Benchmarks and Performance Models 540
6.11 Real Stuff: Benchmarking Intel Core i7 versus NVIDIA

Tesla GPU 550
6.12 Going Faster: Multiple Processors and Matrix Multiply 555
6.13 Fallacies and Pitfalls 558
6.14 Concluding Remarks 560
6.15 Historical Perspective and Further Reading 563
6.16 Exercises 563

Computer

Network

Multiprocessor or Cluster Organization

502 Chapter 6 Parallel Processors from Client to Cloud

6.1 Introduction

Computer architects have long sought the “Th e City of Gold” (El Dorado) of
computer design: to create powerful computers simply by connecting many existing
smaller ones. Th is golden vision is the fountainhead of multiprocessors. Ideally,
customers order as many processors as they can aff ord and receive a commensurate
amount of performance. Th us, multiprocessor soft ware must be designed to work
with a variable number of processors. As mentioned in Chapter 1, energy has
become the overriding issue for both microprocessors and datacenters. Replacing
large ineffi cient processors with many smaller, effi cient processors can deliver
better performance per joule both in the large and in the small, if soft ware can
effi ciently use them. Th us, improved energy effi ciency joins scalable performance
in the case for multiprocessors.

Since multiprocessor soft ware should scale, some designs support operation
in the presence of broken hardware; that is, if a single processor fails in a
multiprocessor with n processors, these system would continue to provide service
with n – 1 processors. Hence, multiprocessors can also improve availability (see
Chapter 5).

High performance can mean high throughput for independent tasks, called
task-level parallelism or process-level parallelism. Th ese tasks are independent
single-threaded applications, and they are an important and popular use of
multiple processors. Th is approach is in contrast to running a single job on
multiple processors. We use the term parallel processing program to refer to a
single program that runs on multiple processors simultaneously.

Th ere have long been scientifi c problems that have needed much faster
computers, and this class of problems has been used to justify many novel parallel
computers over the decades. Some of these problems can be handled simply today,
using a cluster composed of microprocessors housed in many independent servers
(see Section 6.7). In addition, clusters can serve equally demanding applications
outside the sciences, such as search engines, Web servers, email servers, and
databases.

As described in Chapter 1, multiprocessors have been shoved into the spotlight
because the energy problem means that future increases in performance will
primarily come from explicit hardware parallelism rather than much higher
clock rates or vastly improved CPI. As we said in Chapter 1, they are called

Over the Mountains Of
the Moon, Down the
Valley of the Shadow,
Ride, boldly ride the
shade replied— If you
seek for El Dorado!
Edgar Allan Poe,
“El Dorado,”
stanza 4, 1849

multiprocessor
A computer system with at
least two processors. Th is
computer is in contrast to
a uniprocessor, which has
one, and is increasingly
hard to fi nd today.

task-level parallelism
or process-level
parallelism Utilizing
multiple processors by
running independent
programs simultaneously.

parallel processing
program A single
program that runs on
multiple processors
simultaneously.

cluster A set of
computers connected over
a local area network that
function as a single large
multiprocessor.

6.1 Introduction 503

multicore microprocessors instead of multiprocessor microprocessors,
presumably to avoid redundancy in naming. Hence, processors are oft en called
cores in a multicore chip. Th e number of cores is expected to increase with
Moore’s Law. Th ese multicores are almost always Shared Memory Processors
(SMPs), as they usually share a single physical address space. We’ll see SMPs
more in Section 6.5.

Th e state of technology today means that programmers who care about
performance must become parallel programmers, for sequential code now means
slow code.

Th e tall challenge facing the industry is to create hardware and soft ware that
will make it easy to write correct parallel processing programs that will execute
effi ciently in performance and energy as the number of cores per chip scales.

Th is abrupt shift in microprocessor design caught many off guard, so there is a
great deal of confusion about the terminology and what it means. Figure 6.1 tries to
clarify the terms serial, parallel, sequential, and concurrent. Th e columns of this fi gure
represent the soft ware, which is either inherently sequential or concurrent. Th e rows
of the fi gure represent the hardware, which is either serial or parallel. For example, the
programmers of compilers think of them as sequential programs: the steps include
parsing, code generation, optimization, and so on. In contrast, the programmers
of operating systems normally think of them as concurrent programs: cooperating
processes handling I/O events due to independent jobs running on a computer.

Th e point of these two axes of Figure 6.1 is that concurrent soft ware can run on
serial hardware, such as operating systems for the Intel Pentium 4 uniprocessor,
or on parallel hardware, such as an OS on the more recent Intel Core i7. Th e same
is true for sequential soft ware. For example, the MATLAB programmer writes
a matrix multiply thinking about it sequentially, but it could run serially on the
Pentium 4 or in parallel on the Intel Core i7.

You might guess that the only challenge of the parallel revolution is fi guring out how
to make naturally sequential soft ware have high performance on parallel hardware, but
it is also to make concurrent programs have high performance on multiprocessors as the
number of processors increases. With this distinction made, in the rest of this chapter
we will use parallel processing program or parallel soft ware to mean either sequential
or concurrent soft ware running on parallel hardware. Th e next section of this chapter
describes why it is hard to create effi cient parallel processing programs.

Software

Sequential Concurrent

Hardware

Serial
Matrix Multiply written in MatLab
running on an Intel Pentium 4

Windows Vista Operating System
running on an Intel Pentium 4

Parallel
Matrix Multiply written in MATLAB
running on an Intel Core i7

Windows Vista Operating System
running on an Intel Core i7

FIGURE 6.1 Hardware/software categorization and examples of application perspective
on concurrency versus hardware perspective on parallelism.

multicore
microprocessor
A microprocessor
containing multiple
processors (“cores”)
in a single integrated
circuit. Virtually all
microprocessors today in
desktops and servers are
multicore.

shared memory
multiprocessor
(SMP) A parallel
processor with a single
physical address space.

504 Chapter 6 Parallel Processors from Client to Cloud

Before proceeding further down the path to parallelism, don t forget our initial
incursions from the earlier chapters:

■ Chapter 2, Section 2.11: Parallelism and Instructions: Synchronization

■ Chapter 3, Section 3.6: Parallelism and Computer Arithmetic: Subword
Parallelism

■ Chapter 4, Section 4.10: Parallelism via Instructions

■ Chapter 5, Section 5.10: Parallelism and Memory Hierarchy: Cache Coherence

True or false: To benefi t from a multiprocessor, an application must be concurrent.

6.2 The Diffi culty of Creating Parallel
Processing Programs

Th e diffi culty with parallelism is not the hardware; it is that too few important
application programs have been rewritten to complete tasks sooner on multiprocessors.
It is diffi cult to write soft ware that uses multiple processors to complete one task
faster, and the problem gets worse as the number of processors increases.

Why has this been so? Why have parallel processing programs been so much
harder to develop than sequential programs?

Th e fi rst reason is that you must get better performance or better energy
effi ciency from a parallel processing program on a multiprocessor; otherwise, you
would just use a sequential program on a uniprocessor, as sequential programming
is simpler. In fact, uniprocessor design techniques such as superscalar and out-of-
order execution take advantage of instruction-level parallelism (see Chapter 4),
normally without the involvement of the programmer. Such innovations reduced
the demand for rewriting programs for multiprocessors, since programmers
could do nothing and yet their sequential programs would run faster on new
computers.

Why is it diffi cult to write parallel processing programs that are fast, especially
as the number of processors increases? In Chapter 1, we used the analogy of
eight reporters trying to write a single story in hopes of doing the work eight
times faster. To succeed, the task must be broken into eight equal-sized pieces,
because otherwise some reporters would be idle while waiting for the ones with
larger pieces to fi nish. Another speed-up obstacle could be that the reporters
would spend too much time communicating with each other instead of writing
their pieces of the story. For both this analogy and parallel programming,
the challenges include scheduling, partitioning the work into parallel pieces,
balancing the load evenly between the workers, time to synchronize, and

Check
Yourself

overhead for communication between the parties. Th e challenge is stiff er with the
more reporters for a newspaper story and with the more processors for parallel
programming.

Our discussion in Chapter 1 reveals another obstacle, namely Amdahl s Law. It
reminds us that even small parts of a program must be parallelized if the program
is to make good use of many cores.

Speed-up Challenge

Suppose you want to achieve a speed-up of 90 times faster with 100 processors.
What percentage of the original computation can be sequential?

Amdahl s Law (Chapter 1) says

Execution time after improvement =
Execution time affected byy improvement

Amount of improvement
Execution time unaffec+ tted

We can reformulate Amdahl s Law in terms of speed-up versus the original
execution time:

Speed-up =
Execution time before

(Execution time before Execu− ttion time affected)
Execution time affected

+
Amount of improovement

Th is formula is usually rewritten assuming that the execution time before is
1 for some unit of time, and the execution time aff ected by improvement is
considered the fraction of the original execution time:

Speed-up =
1

(1 Fraction time affected)
Fraction time affecte

− +
dd

Amount of improvement

Substituting 90 for speed-up and 100 for amount of improvement into the
formula above:

90 =
1

(1 Fraction time affected)
Fraction time affected

− +
100

EXAMPLE

ANSWER

6.2 The Diffi culty of Creating Parallel Processing Programs 505

506 Chapter 6 Parallel Processors from Client to Cloud

Th en simplifying the formula and solving for fraction time aff ected:

90 (1 0.99 Fraction time affected) = 1
90 (90 0.99 Fraction t
× − ×
− × × iime affected) = 1

90 = 90 0.99 Fraction time affected
Fractio

−1 × ×
nn time affected = 89/89.1 = 0.999

Th us, to achieve a speed-up of 90 from 100 processors, the sequential
percentage can only be 0.1%.

Yet, there are applications with plenty of parallelism, as we shall see next.

Speed-up Challenge: Bigger Problem

Suppose you want to perform two sums: one is a sum of 10 scalar variables, and
one is a matrix sum of a pair of two-dimensional arrays, with dimensions 10 by 10.
For now let’s assume only the matrix sum is parallelizable; we’ll see soon how to
parallelize scalar sums. What speed-up do you get with 10 versus 40 processors?
Next, calculate the speed-ups assuming the matrices grow to 20 by 20.

If we assume performance is a function of the time for an addition, t, then
there are 10 additions that do not benefi t from parallel processors and 100
additions that do. If the time for a single processor is 110 t, the execution time
for 10 processors is

Execution time after improvement =
Execution time affected byy improvement

Amount of improvement
Execution time unaffec+ tted

Execution time after improvement =
100
10

t
t t+ =10 20

so the speed-up with 10 processors is 110t/20t = 5.5. Th e execution time for
40 processors is

Execution time after improvement =
100

40
t

t t+ =10 12 5.

so the speed-up with 40 processors is 110t/12.5t = 8.8. Th us, for this problem
size, we get about 55% of the potential speed-up with 10 processors, but only
22% with 40.

EXAMPLE

ANSWER

Look what happens when we increase the matrix. Th e sequential program now
takes 10t + 400t = 410t. Th e execution time for 10 processors is

Execution time after improvement =
400
10

t
t t+ =10 50

so the speed-up with 10 processors is 410t/50t = 8.2. Th e execution time for
40 processors is

Execution time after improvement =
400
40

t
t t+ =10 20

so the speed-up with 40 processors is 410t/20t = 20.5. Th us, for this larger problem
size, we get 82% of the potential speed-up with 10 processors and 51% with 40.

Th ese examples show that getting good speed-up on a multiprocessor while
keeping the problem size fi xed is harder than getting good speed-up by increasing
the size of the problem. Th is insight allows us to introduce two terms that describe
ways to scale up.

Strong scaling means measuring speed-up while keeping the problem size fi xed.
Weak scaling means that the problem size grows proportionally to the increase in
the number of processors. Let’s assume that the size of the problem, M, is the working
set in main memory, and we have P processors. Th en the memory per processor for
strong scaling is approximately M/P, and for weak scaling, it is approximately M.

Note that the memory hierarchy can interfere with the conventional wisdom
about weak scaling being easier than strong scaling. For example, if the weakly
scaled dataset no longer fi ts in the last level cache of a multicore microprocessor,
the resulting performance could be much worse than by using strong scaling.

Depending on the application, you can argue for either scaling approach. For
example, the TPC-C debit-credit database benchmark requires that you scale up
the number of customer accounts in proportion to the higher transactions per
minute. Th e argument is that it s nonsensical to think that a given customer base
is suddenly going to start using ATMs 100 times a day just because the bank gets a
faster computer. Instead, if you re going to demonstrate a system that can perform
100 times the numbers of transactions per minute, you should run the experiment
with 100 times as many customers. Bigger problems oft en need more data, which
is an argument for weak scaling.

Th is fi nal example shows the importance of load balancing.

Speed-up Challenge: Balancing Load

To achieve the speed-up of 20.5 on the previous larger problem with 40
processors, we assumed the load was perfectly balanced. Th at is, each of the 40

strong scaling Speed-
up achieved on a
multiprocessor without
increasing the size of the
problem.

weak scaling Speed-
up achieved on a
multiprocessor while
increasing the size of the
problem proportionally
to the increase in the
number of processors.

EXAMPLE

6.2 The Diffi culty of Creating Parallel Processing Programs 507

508 Chapter 6 Parallel Processors from Client to Cloud

processors had 2.5% of the work to do. Instead, show the impact on speed-up if
one processor s load is higher than all the rest. Calculate at twice the load (5%)
and fi ve times the load (12.5%) for that hardest working processor. How well
utilized are the rest of the processors?

If one processor has 5% of the parallel load, then it must do 5% × 400 or 20
additions, and the other 39 will share the remaining 380. Since they are operating
simultaneously, we can just calculate the execution time as a maximum

Execution time after improvement = Max
380
39

20
1

t t
,

⎛
⎝
⎜⎜⎜

⎞
⎠
⎟⎟⎟ + 110t t= 30

Th e speed-up drops from 20.5 to 410t/30t = 14. Th e remaining 39 processors
are utilized less than half the time: while waiting 20t for hardest working
processor to fi nish, they only compute for 380t/39 = 9.7t.

If one processor has 12.5% of the load, it must perform 50 additions. Th e
formula is:

Execution time after improvement = Max
350
39

50
1

t t
,

⎛
⎝
⎜⎜⎜

⎞
⎠
⎟⎟⎟ + 110t t= 60

Th e speed-up drops even further to 410t/60t = 7. Th e rest of the processors
are utilized less than 20% of the time (9t/50t). Th is example demonstrates the
importance of balancing load, for just a single processor with twice the load
of the others cuts speed-up by a third, and fi ve times the load on just one
processor reduces speed-up by almost a factor of three.

Now that we better understand the goals and challenges of parallel processing,
we give an overview of the rest of the chapter. Th e next Section (6.3) describes
a much older classifi cation scheme than in Figure 6.1. In addition, it describes
two styles of instruction set architectures that support running of sequential
applications on parallel hardware, namely SIMD and vector. Section 6.4 then
describes multithreading, a term oft en confused with multiprocessing, in part
because it relies upon similar concurrency in programs. Section 6.5 describes the
fi rst the two alternatives of a fundamental parallel hardware characteristic, which is
whether or not all the processors in the systems rely upon a single physical address
space. As mentioned above, the two popular versions of these alternatives are called
shared memory multiprocessors (SMPs) and clusters, and this section covers the
former. Section 6.6 describes a relatively new style of computer from the graphics
hardware community, called a graphics-processing unit (GPU) that also assumes
a single physical address. ( Appendix C describes GPUs in even more detail.)
Section 6.7 describes clusters, a popular example of a computer with multiple
physical address spaces. Section 6.8 shows typical topologies used to connect many
processors together, either server nodes in a cluster or cores in a microprocessor.

Section 6.9 describes the hardware and soft ware for communicating between

ANSWER

nodes in a cluster using Ethernet. It shows how to optimize its performance using
custom soft ware and hardware. We next discuss the diffi culty of fi nding parallel
benchmarks in Section 6.10. Th is section also includes a simple, yet insightful
performance model that helps in the design of applications as well as architectures.
We use this model as well as parallel benchmarks in Section 6.11 to compare a
multicore computer to a GPU. Section 6.12 divulges the fi nal and largest step in
our journey of accelerating matrix multiply. For matrices that don’t fi t in the cache,
parallel processing uses 16 cores to improve performance by a factor of 14. We
close with fallacies and pitfalls and our conclusions for parallelism.

In the next section, we introduce acronyms that you probably have already seen
to identify diff erent types of parallel computers.

True or false: Strong scaling is not bound by Amdahl s Law.

6.3 SISD, MIMD, SIMD, SPMD, and Vector

One categorization of parallel hardware proposed in the 1960s is still used today. It
was based on the number of instruction streams and the number of data streams.
Figure 6.2 shows the categories. Th us, a conventional uniprocessor has a single
instruction stream and single data stream, and a conventional multiprocessor has
multiple instruction streams and multiple data streams. Th ese two categories are
abbreviated SISD and MIMD, respectively.

While it is possible to write separate programs that run on diff erent processors
on a MIMD computer and yet work together for a grander, coordinated goal,
programmers normally write a single program that runs on all processors of an
MIMD computer, relying on conditional statements when diff erent processors
should execute diff erent sections of code. Th is style is called Single Program
Multiple Data (SPMD), but it is just the normal way to program a MIMD computer.

Th e closest we can come to multiple instruction streams and single data stream
(MISD) processor might be a “stream processor” that would perform a series of
computations on a single data stream in a pipelined fashion: parse the input from
the network, decrypt the data, decompress it, search for match, and so on. Th e
inverse of MISD is much more popular. SIMD computers operate on vectors of

Check
Yourself

SISD or Single
Instruction stream,
Single Data stream.
A uniprocessor.

MIMD or Multiple
Instruction streams,
Multiple Data streams.
A multiprocessor.

SPMD Single Program,
Multiple Data streams.
Th e conventional MIMD
programming model,
where a single program
runs across all processors.

SIMD or Single
Instruction stream,
Multiple Data streams.
Th e same instruction
is applied to many data
streams, as in a vector
processor.

FIGURE 6.2 Hardware categorization and examples based on number of instruction
streams and data streams: SISD, SIMD, MISD, and MIMD.

Data Streams

Single Multiple

Instruction

Streams

Single SISD: Intel Pentium 4 SIMD: SSE instructions of x86

Multiple MISD: No examples today MIMD: Intel Core i7

6.3 SISD, MIMD, SIMD, SPMD, and Vector 509

510 Chapter 6 Parallel Processors from Client to Cloud

data. For example, a single SIMD instruction might add 64 numbers by sending 64
data streams to 64 ALUs to form 64 sums within a single clock cycle. Th e subword
parallel instructions that we saw in Sections 3.6 and 3.7 are another example of
SIMD; indeed, the middle letter of Intel’s SSE acronym stands for SIMD.

Th e virtues of SIMD are that all the parallel execution units are synchronized and
they all respond to a single instruction that emanates from a single program counter
(PC). From a programmer s perspective, this is close to the already familiar SISD.
Although every unit will be executing the same instruction, each execution unit has
its own address registers, and so each unit can have diff erent data addresses. Th us,
in terms of Figure 6.1, a sequential application might be compiled to run on serial
hardware organized as a SISD or in parallel hardware that was organized as a SIMD.

Th e original motivation behind SIMD was to amortize the cost of the control
unit over dozens of execution units. Another advantage is the reduced instruction
bandwidth and space SIMD needs only one copy of the code that is being
simultaneously executed, while message-passing MIMDs may need a copy in every
processor, and shared memory MIMD will need multiple instruction caches.

SIMD works best when dealing with arrays in for loops. Hence, for parallelism
to work in SIMD, there must be a great deal of identically structured data, which
is called data-level parallelism. SIMD is at its weakest in case or switch
statements, where each execution unit must perform a diff erent operation on its
data, depending on what data it has. Execution units with the wrong data must be
disabled so that units with proper data may continue. If there are n cases, in these
situations SIMD processors essentially run at 1/nth of peak performance.

Th e so-called array processors that inspired the SIMD category have faded
into history (see Section 6.15 online), but two current interpretations of SIMD
remain active today.

SIMD in x86: Multimedia Extensions
As described in Chapter 3, subword parallelism for narrow integer data was the
original inspiration of the Multimedia Extension (MMX) instructions of the x86
in 1996. As Moore’s Law continued, more instructions were added, leading fi rst
to Streaming SIMD Extensions (SSE) and now Advanced Vector Extensions (AVX).
AVX supports the simultaneous execution of four 64-bit fl oating-point numbers.
Th e width of the operation and the registers is encoded in the opcode of these
multimedia instructions. As the data width of the registers and operations grew,
the number of opcodes for multimedia instructions exploded, and now there are
hundreds of SSE and AVX instructions (see Chapter 3).

Vector
An older and, as we shall see, more elegant interpretation of SIMD is called a vector
architecture, which has been closely identifi ed with computers designed by Seymour
Cray starting in the 1970s. It is also a great match to problems with lots of data-level
parallelism. Rather than having 64 ALUs perform 64 additions simultaneously, like
the old array processors, the vector architectures pipelined the ALU to get good
performance at lower cost. Th e basic philosophy of vector architecture is to collect

data-level
parallelism Parallelism
achieved by performing
the same operation on
independent data.

data elements from memory, put them in order into a large set of registers, operate
on them sequentially in registers using pipelined execution units, and then write
the results back to memory. A key feature of vector architectures is then a set of
vector registers. Th us, a vector architecture might have 32 vector registers, each
with 64 64-bit elements.

Comparing Vector to Conventional Code

Suppose we extend the MIPS instruction set architecture with vector
instructions and vector registers. Vector operations use the same names as
MIPS operations, but with the letter V appended. For example, addv.d
adds two double-precision vectors. Th e vector instructions take as their input
either a pair of vector registers (addv.d) or a vector register and a scalar
register (addvs.d). In the latter case, the value in the scalar register is used
as the input for all operations the operation addvs.d will add the contents
of a scalar register to each element in a vector register. Th e names lv and sv
denote vector load and vector store, and they load or store an entire vector
of double-precision data. One operand is the vector register to be loaded or
stored; the other operand, which is a MIPS general-purpose register, is the
starting address of the vector in memory. Given this short description, show
the conventional MIPS code versus the vector MIPS code for

Y a X Y= × +
where X and Y are vectors of 64 double precision fl oating-point numbers,
initially resident in memory, and a is a scalar double precision variable. (Th is
example is the so-called DAXPY loop that forms the inner loop of the Linpack
benchmark; DAXPY stands for double precision a × X plus Y.). Assume that
the starting addresses of X and Y are in $s0 and $s1, respectively.

Here is the conventional MIPS code for DAXPY:
l.d $f0,a($sp) :load scalar a
addiu $t0,$s0,#512 :upper bound of what to load

loop: l.d $f2,0($s0) :load x(i)
mul.d $f2,$f2,$f0 :a x x(i)
l.d $f4,0($s1) :load y(i)
add.d $f4,$f4,$f2 :a x x(i) + y(i)
s.d $f4,0($s1) :store into y(i)
addiu $s0,$s0,#8 :increment index to x
addiu $s1,$s1,#8 :increment index to y
subu $t1,$t0,$s0 :compute bound
bne $t1,$zero,loop :check if done

Here is the vector MIPS code for DAXPY:

EXAMPLE

ANSWER

6.3 SISD, MIMD, SIMD, SPMD, and Vector 511

512 Chapter 6 Parallel Processors from Client to Cloud

l.d $f0,a($sp) :load scalar a
lv $v1,0($s0) :load vector x
mulvs.d $v2,$v1,$f0 :vector-scalar multiply
lv $v3,0($s1) :load vector y
addv.d $v4,$v2,$v3 :add y to product
sv $v4,0($s1) :store the result

Th ere are some interesting comparisons between the two code segments in
this example. Th e most dramatic is that the vector processor greatly reduces the
dynamic instruction bandwidth, executing only 6 instructions versus almost 600
for the traditional MIPS architecture. Th is reduction occurs both because the vector
operations work on 64 elements at a time and because the overhead instructions
that constitute nearly half the loop on MIPS are not present in the vector code. As
you might expect, this reduction in instructions fetched and executed saves energy.

Another important diff erence is the frequency of pipeline hazards (Chapter 4).
In the straightforward MIPS code, every add.d must wait for a mul.d, every
s.d must wait for the add.d and every add.d and mul.d must wait on l.d.
On the vector processor, each vector instruction will only stall for the fi rst element
in each vector, and then subsequent elements will fl ow smoothly down the pipeline.
Th us, pipeline stalls are required only once per vector operation, rather than once
per vector element. In this example, the pipeline stall frequency on MIPS will be
about 64 times higher than it is on the vector version of MIPS. Th e pipeline stalls
can be reduced on MIPS by using loop unrolling (see Chapter 4). However, the
large diff erence in instruction bandwidth cannot be reduced.

Since the vector elements are independent, they can be operated on in parallel,
much like subword parallelism for AVX instructions. All modern vector computers
have vector functional units with multiple parallel pipelines (called vector lanes; see
Figures 6.2 and 6.3) that can produce two or more results per clock cycle.
Elaboration: The loop in the example above exactly matched the vector length. When
loops are shorter, vector architectures use a register that reduces the length of vector
operations. When loops are larger, we add bookkeeping code to iterate full-length vector
operations and to handle the leftovers. This latter process is called strip mining.

Vector versus Scalar
Vector instructions have several important properties compared to conventional
instruction set architectures, which are called scalar architectures in this context:

■ A single vector instruction specifi es a great deal of work it is equivalent
to executing an entire loop. Th e instruction fetch and decode bandwidth
needed is dramatically reduced.

■ By using a vector instruction, the compiler or programmer indicates that the
computation of each result in the vector is independent of the computation of
other results in the same vector, so hardware does not have to check for data
hazards within a vector instruction.

■ Vector architectures and compilers have a reputation of making it much
easier than when using MIMD multiprocessors to write effi cient applications
when they contain data-level parallelism.

■ Hardware need only check for data hazards between two vector instructions
once per vector operand, not once for every element within the vectors.
Reduced checking can save energy as well as time.

■ Vector instructions that access memory have a known access pattern. If
the vector s elements are all adjacent, then fetching the vector from a set
of heavily interleaved memory banks works very well. Th us, the cost of the
latency to main memory is seen only once for the entire vector, rather than
once for each word of the vector.

■ Because an entire loop is replaced by a vector instruction whose behavior
is predetermined, control hazards that would normally arise from the loop
branch are nonexistent.

■ Th e savings in instruction bandwidth and hazard checking plus the effi cient
use of memory bandwidth give vector architectures advantages in power and
energy versus scalar architectures.

For these reasons, vector operations can be made faster than a sequence of
scalar operations on the same number of data items, and designers are motivated
to include vector units if the application domain can oft en use them.

Vector versus Multimedia Extensions
Like multimedia extensions found in the x86 AVX instructions, a vector instruction
specifi es multiple operations. However, multimedia extensions typically specify a
few operations while vector specifi es dozens of operations. Unlike multimedia
extensions, the number of elements in a vector operation is not in the opcode but in a
separate register. Th is distinction means diff erent versions of the vector architecture
can be implemented with a diff erent number of elements just by changing the
contents of that register and hence retain binary compatibility. In contrast, a new
large set of opcodes is added each time the vector length changes in the multimedia
extension architecture of the x86: MMX, SSE, SSE2, AVX, AVX2, … .

Also unlike multimedia extensions, the data transfers need not be contiguous.
Vectors support both strided accesses, where the hardware loads every nth data
element in memory, and indexed accesses, where hardware fi nds the addresses of
the items to be loaded in a vector register. Indexed accesses are also called gather-
scatter, in that indexed loads gather elements from main memory into contiguous
vector elements and indexed stores scatter vector elements across main memory.

Like multimedia extensions, vector architectures easily capture the fl exibility
in data widths, so it is easy to make a vector operation work on 32 64-bit data
elements or 64 32-bit data elements or 128 16-bit data elements or 256 8-bit data
elements. Th e parallel semantics of a vector instruction allows an implementation
to execute these operations using a deeply pipelined functional unit, an array of
parallel functional units, or a combination of parallel and pipelined functional
units. Figure 6.3 illustrates how to improve vector performance by using parallel
pipelines to execute a vector add instruction.

Vector arithmetic instructions usually only allow element N of one vector
register to take part in operations with element N from other vector registers. Th is

6.3 SISD, MIMD, SIMD, SPMD, and Vector 513

514 Chapter 6 Parallel Processors from Client to Cloud

dramatically simplifi es the construction of a highly parallel vector unit, which can
be structured as multiple parallel vector lanes. As with a traffi c highway, we can
increase the peak throughput of a vector unit by adding more lanes. Figure 6.4
shows the structure of a four-lane vector unit. Th us, going to four lanes from one
lane reduces the number of clocks per vector instruction by roughly a factor of four.
For multiple lanes to be advantageous, both the applications and the architecture
must support long vectors. Otherwise, they will execute so quickly that you’ll run
out of instructions, requiring instruction level parallel techniques like those in
Chapter 4 to supply enough vector instructions.

Generally, vector architectures are a very effi cient way to execute data parallel
processing programs; they are better matches to compiler technology than
multimedia extensions; and they are easier to evolve over time than the multimedia
extensions to the x86 architecture.

Given these classic categories, we next see how to exploit parallel streams of
instructions to improve the performance of a single processor, which we will reuse
with multiple processors.

True or false: As exemplifi ed in the x86, multimedia extensions can be thought of
as a vector architecture with short vectors that supports only contiguous vector
data transfers.

vector lane One or
more vector functional
units and a portion of
the vector register fi le.
Inspired by lanes on
highways that increase
traffi c speed, multiple
lanes execute vector
operations
simultaneously.

Check
Yourself

A[9]

A[8]

A[7]

A[6]

A[5]

A[4]

A[3]

A[2]

A[1]

B[9]

B[8]

B[7]

B[6]

B[5]

B[4]

B[3]

B[2]

B[1]

C[0]

C[0] C[1] C[2] C[3]

A[8]

A[4]

B[8]

B[4]

A[9]

A[5]

B[9]

B[5] A[6] B[6] A[7] B[7]

(a) (b)

Element group

+ + + +

FIGURE 6.3 Using multiple functional units to improve the performance of a single vector
add instruction, C = A + B. Th e vector processor (a) on the left has a single add pipeline and can complete
one addition per cycle. Th e vector processor (b) on the right has four add pipelines or lanes and can complete
four additions per cycle. Th e elements within a single vector add instruction are interleaved across the four
lanes.

Elaboration: Given the advantages of vector, why aren’t they more popular outside
high-performance computing? There were concerns about the larger state for vector
registers increasing context switch time and the diffi culty of handling page faults in
vector loads and stores, and SIMD instructions achieved some of the benefi ts of vector
instructions. In addition, as long as advances in instruction level parallelism could
deliver on the performance promise of Moore’s Law, there was little reason to take the
chance of changing architecture styles.

Elaboration: Another advantage of vector and multimedia extensions is that it is
relatively easy to extend a scalar instruction set architecture with these instructions to
improve performance of data parallel operations.

Elaboration: The Haswell-generation x86 processors from Intel support AVX2, which
has a gather operation but not a scatter operation.

Lane 0 Lane 1 Lane 2 Lane 3

FP add
pipe 0

FP mul
pipe 0

Vector
registers:
elements
0,4,8,…

FP add
pipe 1

FP mul
pipe 1

Vector
registers:
elements
1,5,9,…

FP add
pipe 2

FP mul
pipe 2

Vector
registers:
elements
2,6,10,…

FP add
pipe 3

FP mul
pipe 3

Vector
registers:
elements
3,7,11,…

Vector load store unit

FIGURE 6.4 Structure of a vector unit containing four lanes. Th e vector-register storage is
divided across the lanes, with each lane holding every fourth element of each vector register. Th e fi gure
shows three vector functional units: an FP add, an FP multiply, and a load-store unit. Each of the vector
arithmetic units contains four execution pipelines, one per lane, which acts in concert to complete a single
vector instruction. Note how each section of the vector-register fi le only needs to provide enough read and
write ports (see Chapter 4) for functional units local to its lane.

6.3 SISD, MIMD, SIMD, SPMD, and Vector 515

516 Chapter 6 Parallel Processors from Client to Cloud

6.4 Hardware Multithreading

A related concept to MIMD, especially from the programmer’s perspective, is
hardware multithreading. While MIMD relies on multiple processes or threads
to try to keep multiple processors busy, hardware multithreading allows multiple
threads to share the functional units of a single processor in an overlapping fashion
to try to utilize the hardware resources effi ciently. To permit this sharing, the
processor must duplicate the independent state of each thread. For example, each
thread would have a separate copy of the register fi le and the program counter.
Th e memory itself can be shared through the virtual memory mechanisms, which
already support multi-programming. In addition, the hardware must support the
ability to change to a diff erent thread relatively quickly. In particular, a thread
switch should be much more effi cient than a process switch, which typically
requires hundreds to thousands of processor cycles while a thread switch can be
instantaneous.

Th ere are two main approaches to hardware multithreading. Fine-grained
multithreading switches between threads on each instruction, resulting in
interleaved execution of multiple threads. Th is interleaving is oft en done in a
round-robin fashion, skipping any threads that are stalled at that clock cycle. To
make fi ne-grained multithreading practical, the processor must be able to switch
threads on every clock cycle. One advantage of fi ne-grained multithreading is
that it can hide the throughput losses that arise from both short and long stalls,
since instructions from other threads can be executed when one thread stalls. Th e
primary disadvantage of fi ne-grained multithreading is that it slows down the
execution of the individual threads, since a thread that is ready to execute without
stalls will be delayed by instructions from other threads.

Coarse-grained multithreading was invented as an alternative to fi ne-grained
multithreading. Coarse-grained multithreading switches threads only on costly
stalls, such as last-level cache misses. Th is change relieves the need to have thread
switching be extremely fast and is much less likely to slow down the execution of an
individual thread, since instructions from other threads will only be issued when
a thread encounters a costly stall. Coarse-grained multithreading suff ers, however,
from a major drawback: it is limited in its ability to overcome throughput losses,
especially from shorter stalls. Th is limitation arises from the pipeline start-up
costs of coarse-grained multithreading. Because a processor with coarse-grained
multithreading issues instructions from a single thread, when a stall occurs, the
pipeline must be emptied or frozen. Th e new thread that begins executing aft er
the stall must fi ll the pipeline before instructions will be able to complete. Due
to this start-up overhead, coarse-grained multithreading is much more useful for
reducing the penalty of high-cost stalls, where pipeline refi ll is negligible compared
to the stall time.

hardware
multithreading
Increasing utilization of a
processor by switching to
another thread when one
thread is stalled.

thread A thread includes
the program counter, the
register state, and the
stack. It is a lightweight
process; whereas threads
commonly share a single
address space, processes
don’t.

process A process
includes one or more
threads, the address space,
and the operating system
state. Hence, a process
switch usually invokes the
operating system, but not
a thread switch.

fi ne-grained
multithreading
A version of hardware
multithreading that
implies switching between
threads aft er every
instruction.

coarse-grained
multithreading
A version of hardware
multithreading that
implies switching between
threads only aft er
signifi cant events, such as
a last-level cache miss.

Simultaneous multithreading (SMT) is a variation on hardware multithreading
that uses the resources of a multiple-issue, dynamically scheduled pipelined
processor to exploit thread-level parallelism at the same time it exploits instruction-
level parallelism (see Chapter 4). Th e key insight that motivates SMT is that
multiple-issue processors oft en have more functional unit parallelism available
than most single threads can eff ectively use. Furthermore, with register renaming
and dynamic scheduling (see Chapter 4), multiple instructions from independent
threads can be issued without regard to the dependences among them; the resolution
of the dependences can be handled by the dynamic scheduling capability.

Since SMT relies on the existing dynamic mechanisms, it does not switch
resources every cycle. Instead, SMT is always executing instructions from multiple
threads, leaving it up to the hardware to associate instruction slots and renamed
registers with their proper threads.

Figure 6.5 conceptually illustrates the diff erences in a processor s ability to exploit
superscalar resources for the following processor confi gurations. Th e top portion shows

simultaneous
multithreading
(SMT) A version
of multithreading
that lowers the cost
of multithreading by
utilizing the resources
needed for multiple issue,
dynamically scheduled
microarchitecture.

FIGURE 6.5 How four threads use the issue slots of a superscalar processor in different
approaches. Th e four threads at the top show how each would execute running alone on a standard
superscalar processor without multithreading support. Th e three examples at the bottom show how they
would execute running together in three multithreading options. Th e horizontal dimension represents the
instruction issue capability in each clock cycle. Th e vertical dimension represents a sequence of clock cycles.
An empty (white) box indicates that the corresponding issue slot is unused in that clock cycle. Th e shades of
gray and color correspond to four diff erent threads in the multithreading processors. Th e additional pipeline
start-up eff ects for coarse multithreading, which are not illustrated in this fi gure, would lead to further loss
in throughput for coarse multithreading.

Issue slots

Thread C Thread DThread A Thread B

Time

SMTCoarse MT Fine MT

Issue slots

6.4 Hardware Multithreading 517

518 Chapter 6 Parallel Processors from Client to Cloud

how four threads would execute independently on a superscalar with no multithreading
support. Th e bottom portion shows how the four threads could be combined to execute
on the processor more effi ciently using three multithreading options:

■ A superscalar with coarse-grained multithreading

■ A superscalar with fi ne-grained multithreading

■ A superscalar with simultaneous multithreading

In the superscalar without hardware multithreading support, the use of issue
slots is limited by a lack of instruction-level parallelism. In addition, a major stall,
such as an instruction cache miss, can leave the entire processor idle.

In the coarse-grained multithreaded superscalar, the long stalls are partially
hidden by switching to another thread that uses the resources of the processor.
Although this reduces the number of completely idle clock cycles, the pipeline
start-up overhead still leads to idle cycles, and limitations to ILP means all issue
slots will not be used. In the fi ne-grained case, the interleaving of threads mostly
eliminates idle clock cycles. Because only a single thread issues instructions in a
given clock cycle, however, limitations in instruction-level parallelism still lead to
idle slots within some clock cycles.

2.00

1.75

1.50

1.25

1.00

0.75

i7
S

M
T

p
e
rf

o
rm

a
n
ce

a
n
d
e

n
e
rg

y
e
ff
ic

ie
n
cy

r
a
tio

Bl
ac

ks
ch

ol
es

Bo
dy

tra
ck

Ca
nn

ea
l

Fe
rre

Fl
ui
da

ni
m

at
e

Ra
yt
ra

St
re

am
clu

st
er

Sw
ap

tio
ns

×2
64

Energy efficiencySpeedup

Fa
ce

sim Vi
ps

FIGURE 6.6 The speed-up from using multithreading on one core on an i7 processor
averages 1.31 for the PARSEC benchmarks (see Section 6.9) and the energy effi ciency
improvement is 1.07. Th is data was collected and analyzed by Esmaeilzadeh et. al. [2011].

In the SMT case, thread-level parallelism and instruction-level parallelism are
both exploited, with multiple threads using the issue slots in a single clock cycle.
Ideally, the issue slot usage is limited by imbalances in the resource needs and
resource availability over multiple threads. In practice, other factors can restrict
how many slots are used. Although Figure 6.5 greatly simplifi es the real operation
of these processors, it does illustrate the potential performance advantages of
multithreading in general and SMT in particular.

Figure 6.6 plots the performance and energy benefi ts of multithreading on a
single processors of the Intel Core i7 960, which has hardware support for two
threads. Th e average speed-up is 1.31, which is not bad given the modest extra
resources for hardware multithreading. Th e average improvement in energy
effi ciency is 1.07, which is excellent. In general, you’d be happy with a performance
speed-up being energy neutral.

Now that we have seen how multiple threads can utilize the resources of a single
processor more eff ectively, we next show how to use them to exploit multiple
processors.

1. True or false: Both multithreading and multicore rely on parallelism to get
more effi ciency from a chip.

2. True or false: Simultaneous multithreading (SMT) uses threads to improve
resource utilization of a dynamically scheduled, out-of-order processor.

6.5 Multicore and Other Shared Memory
Multiprocessors

While hardware multithreading improved the effi ciency of processors at modest
cost, the big challenge of the last decade has been to deliver on the performance
potential of Moore’s Law by effi ciently programming the increasing number of
processors per chip.

Given the diffi culty of rewriting old programs to run well on parallel hardware,
a natural question is: what can computer designers do to simplify the task? One
answer was to provide a single physical address space that all processors can share,
so that programs need not concern themselves with where their data is, merely that
programs may be executed in parallel. In this approach, all variables of a program
can be made available at any time to any processor. Th e alternative is to have a
separate address space per processor that requires that sharing must be explicit;
we ll describe this option in the Section 6.7. When the physical address space is
common then the hardware typically provides cache coherence to give a consistent
view of the shared memory (see Section 5.8).

As mentioned above, a shared memory multiprocessor (SMP) is one that off ers
the programmer a single physical address space across all processors which is

Check
Yourself

6.5 Multicore and Other Shared Memory Multiprocessors 519

520 Chapter 6 Parallel Processors from Client to Cloud

nearly always the case for multicore chips although a more accurate term would
have been shared-address multiprocessor. Processors communicate through shared
variables in memory, with all processors capable of accessing any memory location
via loads and stores. Figure 6.7 shows the classic organization of an SMP. Note that
such systems can still run independent jobs in their own virtual address spaces,
even if they all share a physical address space.

Single address space multiprocessors come in two styles. In the fi rst style, the
latency to a word in memory does not depend on which processor asks for it.
Such machines are called uniform memory access (UMA) multiprocessors. In the
second style, some memory accesses are much faster than others, depending on
which processor asks for which word, typically because main memory is divided
and attached to diff erent microprocessors or to diff erent memory controllers on
the same chip. Such machines are called nonuniform memory access (NUMA)
multiprocessors. As you might expect, the programming challenges are harder for
a NUMA multiprocessor than for a UMA multiprocessor, but NUMA machines
can scale to larger sizes and NUMAs can have lower latency to nearby memory.

As processors operating in parallel will normally share data, they also need to
coordinate when operating on shared data; otherwise, one processor could start
working on data before another is fi nished with it. Th is coordination is called
synchronization, which we saw in Chapter 2. When sharing is supported with a
single address space, there must be a separate mechanism for synchronization. One
approach uses a lock for a shared variable. Only one processor at a time can acquire
the lock, and other processors interested in shared data must wait until the original
processor unlocks the variable. Section 2.11 of Chapter 2 describes the instructions
for locking in the MIPS instruction set.

uniform memory access
(UMA) A multiprocessor
in which latency to any
word in main memory is
about the same no matter
which processor requests
the access.

nonuniform memory
access (NUMA) A type
of single address space
multiprocessor in which
some memory accesses
are much faster than
others depending on
which processor asks for
which word.

synchronization Th e
process of coordinating
the behavior of two or
more processes, which
may be running on
diff erent processors.

lock A synchronization
device that allows access
to data to only one
processor at a time.

FIGURE 6.7 Classic organization of a shared memory multiprocessor.

Processor

Memory I/O

Processor Processor

Cache Cache Cache

Interconnection Network

. . .

A Simple Parallel Processing Program for a Shared Address Space

Suppose we want to sum 64,000 numbers on a shared memory multiprocessor
computer with uniform memory access time. Let s assume we have 64
processors.

Th e fi rst step is to ensure a balanced load per processor, so we split the set
of numbers into subsets of the same size. We do not allocate the subsets to a
diff erent memory space, since there is a single memory space for this machine;
we just give diff erent starting addresses to each processor. Pn is the number that
identifi es the processor, between 0 and 63. All processors start the program by
running a loop that sums their subset of numbers:

sum[Pn] = 0;
for (i = 1000*Pn; i < 1000*(Pn+1); i += 1) sum[Pn] += A[i]; /*sum the assigned areas*/ (Note the C code i += 1 is just a shorter way to say i = i + 1.) Th e next step is to add these 64 partial sums. Th is step is called a reduction, where we divide to conquer. Half of the processors add pairs of partial sums, and then a quarter add pairs of the new partial sums, and so on until we have the single, fi nal sum. Figure 6.8 illustrates the hierarchical nature of this reduction. In this example, the two processors must synchronize before the consumer processor tries to read the result from the memory location written by the producer processor; otherwise, the consumer may read the old value of EXAMPLE ANSWER reduction A function that processes a data structure and returns a single value. 0 0 1 0 1 2 3 0 1 2 3 4 5 6 7 (half = 1) (half = 2) (half = 4) FIGURE 6.8 The last four levels of a reduction that sums results from each processor, from bottom to top. For all processors whose number i is less than half, add the sum produced by processor number (i + half) to its sum. 6.5 Multicore and Other Shared Memory Multiprocessors 521 522 Chapter 6 Parallel Processors from Client to Cloud the data. We want each processor to have its own version of the loop counter variable i, so we must indicate that it is a private variable. Here is the code (half is private also): half = 64; /*64 processors in multiprocessor*/ do synch(); /*wait for partial sum completion*/ if (half%2 != 0 && Pn == 0) sum[0] += sum[half–1]; /*Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /*dividing line on who sums */ if (Pn < half) sum[Pn] += sum[Pn+half]; while (half > 1); /*exit with final sum in Sum[0] */

Given the long-term interest in parallel programming, there have been hundreds
of attempts to build parallel programming systems. A limited but popular example
is OpenMP. It is just an Application Programmer Interface (API) along with a set of
compiler directives, environment variables, and runtime library routines that can
extend standard programming languages. It off ers a portable, scalable, and simple
programming model for shared memory multiprocessors. Its primary goal is to
parallelize loops and to perform reductions.

Most C compilers already have support for OpenMP. Th e command to uses the
OpenMP API with the UNIX C compiler is just:

cc –fopenmp foo.c

OpenMP extends C using pragmas, which are just commands to the C macro
preprocessor like #define and #include. To set the number of processors we
want to use to be 64, as we wanted in the example above, we just use the command

#define P 64 /* define a constant that we’ll use a few times */
#pragma omp parallel num_threads(P)

Th at is, the runtime libraries should use 64 parallel threads.
To turn the sequential for loop into a parallel for loop that divides the work

equally between all the threads that we told it to use, we just write (assuming sum
is initialized to 0)

#pragma omp parallel for
for (Pn = 0; Pn < P; Pn += 1) for (i = 0; 1000*Pn; i < 1000*(Pn+1); i += 1) sum[Pn] += A[i]; /*sum the assigned areas*/ Hardware/ Software Interface OpenMP An API for shared memory multiprocessing in C, C++, or Fortran that runs on UNIX and Microsoft platforms. It includes compiler directives, a library, and runtime directives. To perform the reduction, we can use another command that tells OpenMP what the reduction operator is and what variable you need to use to place the result of the reduction. #pragma omp parallel for reduction(+ : FinalSum) for (i = 0; i < P; i += 1) FinalSum += sum[i]; /* Reduce to a single number */ Note that it is now up to the OpenMP library to fi nd effi cient code to sum 64 numbers effi ciently using 64 processors. While OpenMP makes it easy to write simple parallel code, it is not very helpful with debugging, so many parallel programmers use more sophisticated parallel programming systems than OpenMP, just as many programmers today use more productive languages than C. Given this tour of classic MIMD hardware and soft ware, our next path is a more exotic tour of a type of MIMD architecture with a diff erent heritage and thus a very diff erent perspective on the parallel programming challenge. True or false: Shared memory multiprocessors cannot take advantage of task-level parallelism. Elaboration: Some writers repurposed the acronym SMP to mean symmetric multiprocessor, to indicate that the latency from processor to memory was about the same for all processors. This shift was done to contrast them from large-scale NUMA multiprocessors, as both classes used a single address space. As clusters proved much more popular than large-scale NUMA multiprocessors, in this book we restore SMP to its original meaning, and use it to contrast against that use multiple address spaces, such as clusters. Elaboration: An alternative to sharing the physical address space would be to have separate physical address spaces but share a common virtual address space, leaving it up to the operating system to handle communication. This approach has been tried, but it has too high an overhead to offer a practical shared memory abstraction to the performance-oriented programmer. Check Yourself 6.5 Multicore and Other Shared Memory Multiprocessors 523 524 Chapter 6 Parallel Processors from Client to Cloud 6.6 Introduction to Graphics Processing Units Th e original justifi cation for adding SIMD instructions to existing architectures was that many microprocessors were connected to graphics displays in PCs and workstations, so an increasing fraction of processing time was used for graphics. As Moore’s Law increased the number of transistors available to microprocessors, it therefore made sense to improve graphics processing. A major driving force for improving graphics processing was the computer game industry, both on PCs and in dedicated game consoles such as the Sony PlayStation. Th e rapidly growing game market encouraged many companies to make increasing investments in developing faster graphics hardware, and this positive feedback loop led graphics processing to improve at a faster rate than general-purpose processing in mainstream microprocessors. Given that the graphics and game community had diff erent goals than the microprocessor development community, it evolved its own style of processing and terminology. As the graphics processors increased in power, they earned the name Graphics Processing Units or GPUs to distinguish themselves from CPUs. For a few hundred dollars, anyone can buy a GPU today with hundreds of parallel fl oating-point units, which makes high-performance computing more accessible. Th e interest in GPU computing blossomed when this potential was combined with a programming language that made GPUs easier to program. Hence, many programmers of scientifi c and multimedia applications today are pondering whether to use GPUs or CPUs. (Th is section concentrates on using GPUs for computing. To see how GPU computing combines with the traditional role of graphics acceleration, see Appendix C.) Here are some of the key characteristics as to how GPUs vary from CPUs: ■ GPUs are accelerators that supplement a CPU, so they do not need be able to perform all the tasks of a CPU. Th is role allows them to dedicate all their resources to graphics. It s fi ne for GPUs to perform some tasks poorly or not at all, given that in a system with both a CPU and a GPU, the CPU can do them if needed. ■ Th e GPU problems sizes are typically hundreds of megabytes to gigabytes, but not hundreds of gigabytes to terabytes. Th ese diff erences led to diff erent styles of architecture: ■ Perhaps the biggest diff erence is that GPUs do not rely on multilevel caches to overcome the long latency to memory, as do CPUs. Instead, GPUs rely on hardware multithreading (Section 6.4) to hide the latency to memory. Th at is, between the time of a memory request and the time that data arrives, the GPU executes hundreds or thousands of threads that are independent of that request. ■ Th e GPU memory is thus oriented toward bandwidth rather than latency. Th ere are even special graphics DRAM chips for GPUs that are wider and have higher bandwidth than DRAM chips for CPUs. In addition, GPU memories have traditionally had smaller main memories than conventional microprocessors. In 2013, GPUs typically have 4 to 6 GiB or less, while CPUs have 32 to 256 GiB. Finally, keep in mind that for general-purpose computation, you must include the time to transfer the data between CPU memory and GPU memory, since the GPU is a coprocessor. ■ Given the reliance on many threads to deliver good memory bandwidth, GPUs can accommodate many parallel processors (MIMD) as well as many threads. Hence, each GPU processor is more highly multithreaded than a typical CPU, plus they have more processors. Although GPUs were designed for a narrower set of applications, some programmers wondered if they could specify their applications in a form that would let them tap the high potential performance of GPUs. Aft er tiring of trying to specify their problems using the graphics APIs and languages, they developed C-inspired programming languages to allow them to write programs directly for the GPUs. An example is NVIDIA s CUDA (Compute Unifi ed Device Architecture), which enables the programmer to write C programs to execute on GPUs, albeit with some restrictions. Appendix C gives examples of CUDA code. (OpenCL is a multi- company initiative to develop a portable programming language that provides many of the benefi ts of CUDA.) NVIDIA decided that the unifying theme of all these forms of parallelism is the CUDA Th read. Using this lowest level of parallelism as the programming primitive, the compiler and the hardware can gang thousands of CUDA Th reads together to utilize the various styles of parallelism within a GPU: multithreading, MIMD, SIMD, and instruction-level parallelism. Th ese threads are blocked together and executed in groups of 32 at a time. A multithreaded processor inside a GPU executes these blocks of threads, and a GPU consists of 8 to 32 of these multithreaded processors. An Introduction to the NVIDIA GPU Architecture We use NVIDIA systems as our example as they are representative of GPU architectures. Specifi cally, we follow the terminology of the CUDA parallel programming language and use the Fermi architecture as the example. Like vector architectures, GPUs work well only with data-level parallel problems. Both styles have gather-scatter data transfers, and GPU processors have even more Hardware/ Software Interface 6.6 Introduction to Graphics Processing Units 525 526 Chapter 6 Parallel Processors from Client to Cloud registers than do vector processors. Unlike most vector architectures, GPUs also rely on hardware multithreading within a single multi-threaded SIMD processor to hide memory latency (see Section 6.4). A multithreaded SIMD processor is similar to a Vector Processor, but the former has many parallel functional units instead of just a few that are deeply pipelined, as does the latter. As mentioned above, a GPU contains a collection of multithreaded SIMD processors; that is, a GPU is a MIMD composed of multithreaded SIMD processors. For example, NVIDIA has four implementations of the Fermi architecture at diff erent price points with 7, 11, 14, or 15 multithreaded SIMD processors. To provide transparent scalability across models of GPUs with diff ering number of multithreaded SIMD processors, the Th read Block Scheduler hardware assigns blocks of threads to multithreaded SIMD processors. Figure 6.9 shows a simplifi ed block diagram of a multithreaded SIMD processor. Dropping down one more level of detail, the machine object that the hardware creates, manages, schedules, and executes is a thread of SIMD instructions, which we will also call a SIMD thread. It is a traditional thread, but it contains exclusively SIMD instructions. Th ese SIMD threads have their own program counters and they run on a multithreaded SIMD processor. Th e SIMD Th read Scheduler includes a controller that lets it know which threads of SIMD instructions are ready to run, and then it sends them off to a dispatch unit to be run on the multithreaded FIGURE 6.9 Simplifi ed block diagram of the datapath of a multithreaded SIMD Processor. It has 16 SIMD lanes. Th e SIMD Th read Scheduler has many independent SIMD threads that it chooses from to run on this processor. Instruction register Regi- sters 1K × 32 Load store unit Load store unit Load store unit Load store unit Address coalescing unit Interconnection network Local Memory 64 KiB To Global Memory Load store unit Load store unit Load store unit Load store unit Load store unit Load store unit Load store unit Load store unit Load store unit Load store unit Load store unit Load store unit Reg 1K × 32 Reg 1K × 32 Reg 1K × 32 Reg 1K × 32 Reg 1K × 32 Reg 1K × 32 Reg 1K × 32 Reg 1K × 32 Reg 1K × 32 Reg 1K × 32 Reg 1K × 32 Reg 1K × 32 Reg 1K × 32 Reg 1K × 32 Reg 1K × 32 SIMD Lanes (Thread Processors) SIMD processor. It is identical to a hardware thread scheduler in a traditional multithreaded processor (see Section 6.4), except that it is scheduling threads of SIMD instructions. Th us, GPU hardware has two levels of hardware schedulers: 1. Th e Th read Block Scheduler that assigns blocks of threads to multithreaded SIMD processors, and 2. the SIMD Th read Scheduler within a SIMD processor, which schedules when SIMD threads should run. Th e SIMD instructions of these threads are 32 wide, so each thread of SIMD instructions would compute 32 of the elements of the computation. Since the thread consists of SIMD instructions, the SIMD processor must have parallel functional units to perform the operation. We call them SIMD Lanes, and they are quite similar to the Vector Lanes in Section 6.3. Elaboration: The number of lanes per SIMD processor varies across GPU generations. With Fermi, each 32-wide thread of SIMD instructions is mapped to 16 SIMD Lanes, so each SIMD instruction in a thread of SIMD instructions takes two clock cycles to complete. Each thread of SIMD instructions is executed in lock step. Staying with the analogy of a SIMD processor as a vector processor, you could say that it has 16 lanes, and the vector length would be 32. This wide but shallow nature is why we use the term SIMD processor instead of vector processor, as it is more intuitive. Since by defi nition the threads of SIMD instructions are independent, the SIMD Thread Scheduler can pick whatever thread of SIMD instructions is ready, and need not stick with the next SIMD instruction in the sequence within a single thread. Thus, using the terminology of Section 6.4, it uses fi ne-grained multithreading. To hold these memory elements, a Fermi SIMD processor has an impressive 32,768 32-bit registers. Just like a vector processor, these registers are divided logically across the vector lanes or, in this case, SIMD Lanes. Each SIMD Thread is limited to no more than 64 registers, so you might think of a SIMD Thread as having up to 64 vector registers, with each vector register having 32 elements and each element being 32 bits wide. Since Fermi has 16 SIMD Lanes, each contains 2048 registers. Each CUDA Thread gets one element of each of the vector registers. Note that a CUDA thread is just a vertical cut of a thread of SIMD instructions, corresponding to one element executed by one SIMD Lane. Beware that CUDA Threads are very different from POSIX threads; you can t make arbitrary system calls or synchronize arbitrarily in a CUDA Thread. NVIDIA GPU Memory Structures Figure 6.10 shows the memory structures of an NVIDIA GPU. We call the on- chip memory that is local to each multithreaded SIMD processor Local Memory. It is shared by the SIMD Lanes within a multithreaded SIMD processor, but this memory is not shared between multithreaded SIMD processors. We call the off - chip DRAM shared by the whole GPU and all thread blocks GPU Memory. Rather than rely on large caches to contain the whole working sets of an application, GPUs traditionally use smaller streaming caches and rely on extensive multithreading of threads of SIMD instructions to hide the long latency to DRAM, 6.6 Introduction to Graphics Processing Units 527 528 Chapter 6 Parallel Processors from Client to Cloud since their working sets can be hundreds of megabytes. Th us, they will not fi t in the last level cache of a multicore microprocessor. Given the use of hardware multithreading to hide DRAM latency, the chip area used for caches in system processors is spent instead on computing resources and on the large number of registers to hold the state of the many threads of SIMD instructions. Elaboration: While hiding memory latency is the underlying philosophy, note that the latest GPUs and vector processors have added caches. For example, the recent Fermi architecture has added caches, but they are thought of as either bandwidth fi lters to reduce demands on GPU Memory or as accelerators for the few variables whose latency cannot be hidden by multithreading. Local memory for stack frames, function calls, and register spilling is a good match to caches, since latency matters when calling a function. Caches can also save energy, since on-chip cache accesses take much less energy than accesses to multiple, external DRAM chips. CUDA Thread Thread block Per-Block Local Memory Grid 0 . . . Grid 1 . . . GPU Memory Sequence Inter-Grid Synchronization Per-CUDA Thread Private Memory FIGURE 6.10 GPU Memory structures. GPU Memory is shared by the vectorized loops. All threads of SIMD instructions within a thread block share Local Memory. Putting GPUs into Perspective At a high level, multicore computers with SIMD instruction extensions do share similarities with GPUs. Figure 6.11 summarizes the similarities and diff erences. Both are MIMDs whose processors use multiple SIMD lanes, although GPUs have more processors and many more lanes. Both use hardware multithreading to improve processor utilization, although GPUs have hardware support for many more threads. Both use caches, although GPUs use smaller streaming caches and multicore computers use large multilevel caches that try to contain whole working sets completely. Both use a 64-bit address space, although the physical main memory is much smaller in GPUs. While GPUs support memory protection at the page level, they do not yet support demand paging. SIMD processors are also similar to vector processors. Th e multiple SIMD processors in GPUs act as independent MIMD cores, just as many vector computers have multiple vector processors. Th is view would consider the Fermi GTX 580 as a 16-core machine with hardware support for multithreading, where each core has 16 lanes. Th e biggest diff erence is multithreading, which is fundamental to GPUs and missing from most vector processors. GPUs and CPUs do not go back in computer architecture genealogy to a common ancestor; there is no Missing Link that explains both. As a result of this uncommon heritage, GPUs have not used the terms common in the computer architecture community, which has led to confusion about what GPUs are and how they work. To help resolve the confusion, Figure 6.12 (from left to right) lists the more descriptive term used in this section, the closest term from mainstream computing, the offi cial NVIDIA GPU term in case you are interested, and then a short description of the term. Th is “GPU Rosetta Stone” may help relate this section and ideas to more conventional GPU descriptions, such as those found in Appendix C. While GPUs are moving toward mainstream computing, they can t abandon their responsibility to continue to excel at graphics. Th us, the design of GPUs may Feature Multicore with SIMD GPU SIMD processors SIMD lanes/processor Multithreading hardware support for SIMD threads Largest cache size Size of memory address Size of main memory Memory protection at level of page Demand paging Cache coherent 4 to 8 8 to 16 8 to 16 16 to 32 2 to 4 2 to 4 8 MiB 0.75 MiB 8 GiB to 256 GiB 4 GiB to 6 GiB 64-bit 64-bit Yes Yes No No Yes Yes FIGURE 6.11 Similarities and differences between multicore with Multimedia SIMD extensions and recent GPUs. 6.6 Introduction to Graphics Processing Units 529 530 Chapter 6 Parallel Processors from Client to Cloud make more sense when architects ask, given the hardware invested to do graphics well, how can we supplement it to improve the performance of a wider range of applications? Having covered two diff erent styles of MIMD that have a shared address space, we next introduce parallel processors where each processor has its own private address space, which makes it much easier to build much larger systems. Th e Internet services that you use every day depend on these large scale systems. Type More descriptive name Vectorizable Loop Body of Vectorized Loop Body of a (Strip-Mined) Vectorized Loop Thread Block Sequence of SIMD Lane Operations One iteration of a Scalar Loop CUDA Thread A Thread of SIMD Instructions Thread of Vector Instructions Warp SIMD Instruction Vector Instruction PTX Instruction Multithreaded SIMD Processor (Multithreaded) Vector Processor Streaming Multiprocessor Thread Block Scheduler Scalar Processor Giga Thread Engine SIMD Thread Scheduler Thread scheduler in a Multithreaded CPU Warp Scheduler SIMD Lane Vector lane Thread Processor GPU Memory Main Memory Global Memory Local Memory Local Memory Shared Memory SIMD Lane Registers Vector Lane Registers Thread Processor Registers A vectorized loop executed on a multithreaded SIMD Processor, made up of one or more threads of SIMD instructions. They can communicate via Local Memory. P ro gr a m a b st ra ct io n s M a ch in e o b je ct P ro ce ss in g h a rd w a re M e m o ry h a rd w a re A vertical cut of a thread of SIMD instructions corresponding to one element executed by one SIMD Lane. Result is stored depending on mask and predicate register. A traditional thread, but it contains just SIMD instructions that are executed on a multithreaded SIMD Processor. Results stored depending on a per-element mask. A single SIMD instruction executed across SIMD Lanes. A multithreaded SIMD Processor executes threads of SIMD instructions, independent of other SIMD Processors. Assigns multiple Thread Blocks (bodies of vectorized loop) to multithreaded SIMD Processors. Hardware unit that schedules and issues threads of SIMD instructions when they are ready to execute; includes a scoreboard to track SIMD Thread execution. A SIMD Lane executes the operations in a thread of SIMD instructions on a single element. Results stored depending on mask. DRAM memory accessible by all multithreaded SIMD Processors in a GPU. Fast local SRAM for one multithreaded SIMD Processor, unavailable to other SIMD Processors. Registers in a single SIMD Lane allocated across a full thread block (body of vectorized loop). Vectorizable Loop Grid A vectorizable loop, executed on the GPU, made up of one or more Thread Blocks (bodies of vectorized loop) that can execute in parallel. Closest old term outside of GPUs Official CUDA/ NVIDIA GPU term Book definition FIGURE 6.12 Quick guide to GPU terms. We use the fi rst column for hardware terms. Four groups cluster these 12 terms. From top to bottom: Program Abstractions, Machine Objects, Processing Hardware, and Memory Hardware. Elaboration: While the GPU was introduced as having a separate memory from the CPU, both AMD and Intel have announced “fused” products that combine GPUs and CPUs to share a single memory. The challenge will be to maintain the high bandwidth memory in a fused architecture that has been a foundation of GPUs. True or false: GPUs rely on graphics DRAM chips to reduce memory latency and thereby increase performance on graphics applications. 6.7 Clusters, Warehouse Scale Computers, and Other Message-Passing Multiprocessors Th e alternative approach to sharing an address space is for the processors to each have their own private physical address space. Figure 6.13 shows the classic organization of a multiprocessor with multiple private address spaces. Th is alternative multiprocessor must communicate via explicit message passing, which traditionally is the name of such style of computers. Provided the system has routines to send and receive messages, coordination is built in with message passing, since one processor knows when a message is sent, and the receiving processor knows when a message arrives. If the sender needs confi rmation that the message has arrived, the receiving processor can then send an acknowledgment message back to the sender. Th ere have been several attempts to build large-scale computers based on high-performance message-passing networks, and they do off er better absolute Check Yourself message passing Communicating between multiple processors by explicitly sending and receiving information. send message routine A routine used by a processor in machines with private memories to pass a message to another processor. receive message routine A routine used by a processor in machines with private memories to accept a message from another processor. Cache Cache Cache Memory Memory Memory Interconnection Network . . . . . . Processor Processor Processor. . . FIGURE 6.13 Classic organization of a multiprocessor with multiple private address spaces, traditionally called a message-passing multiprocessor. Note that unlike the SMP in Figure 6.7, the interconnection network is not between the caches and memory but is instead between processor-memory nodes. 6.7 Clusters, Warehouse Scale Computers, and Other Message-Passing Multiprocessors 531 532 Chapter 6 Parallel Processors from Client to Cloud communication performance than clusters built using local area networks. Indeed, many supercomputers today use custom networks. Th e problem is that they are much more expensive than local area networks like Ethernet. Few applications today outside of high performance computing can justify the higher communication performance, given the much higher costs. Computers that rely on message passing for communication rather than cache coherent shared memory are much easier for hardware designers to build (see Section 5.8). Th ere is an advantage for programmers as well, in that communication is explicit, which means there are fewer performance surprises than with the implicit communication in cache-coherent shared memory computers. Th e downside for programmers is that it s harder to port a sequential program to a message- passing computer, since every communication must be identifi ed in advance or the program doesn t work. Cache-coherent shared memory allows the hardware to fi gure out what data needs to be communicated, which makes porting easier. Th ere are diff erences of opinion as to which is the shortest path to high performance, given the pros and cons of implicit communication, but there is no confusion in the marketplace today. Multicore microprocessors use shared physical memory and nodes of a cluster communicate with each other using message passing. Some concurrent applications run well on parallel hardware, independent of whether it off ers shared addresses or message passing. In particular, task-level parallelism and applications with little communication like Web search, mail servers, and fi le servers do not require shared addressing to run well. As a result, clusters have become the most widespread example today of the message-passing parallel computer. Given the separate memories, each node of a cluster runs a distinct copy of the operating system. In contrast, the cores inside a microprocessor are connected using a high-speed network inside the chip, and a multichip shared- memory system uses the memory interconnect for communication. Th e memory interconnect has higher bandwidth and lower latency, allowing much better communication performance for shared memory multiprocessors. Th e weakness of separate memories for user memory from a parallel programming perspective turns into a strength in system dependability (see Section 5.5). Since a cluster consists of independent computers connected through a local area network, it is much easier to replace a computer without bringing down the system in a cluster than in an shared memory multiprocessor. Fundamentally, the shared address means that it is diffi cult to isolate a processor and replace it without heroic work by the operating system and in the physical design of the server. It is also easy for clusters to scale down gracefully when a server fails, thereby improving dependability. Since the cluster soft ware is a layer that runs on top of the local operating systems running on each computer, it is much easier to disconnect and replace a broken computer. Hardware/ Software Interface clusters Collections of computers connected via I/O over standard network switches to form a message-passing multiprocessor. Given that clusters are constructed from whole computers and independent, scalable networks, this isolation also makes it easier to expand the system without bringing down the application that runs on top of the cluster. Th eir lower cost, higher availability, and rapid, incremental expandability make clusters attractive to service Internet providers, despite their poorer communication performance when compared to large-scale shared memory multiprocessors. Th e search engines that hundreds of millions of us use every day depend upon this technology. Amazon, Facebook, Google, Microsoft , and others all have multiple datacenters each with clusters of tens of thousands of servers. Clearly, the use of multiple processors in Internet service companies has been hugely successful. Warehouse-Scale Computers Internet services, such as those described above, necessitated the construction of new buildings to house, power, and cool 100,000 servers. Although they may be classifi ed as just large clusters, their architecture and operation are more sophisticated. Th ey act as one giant computer and cost on the order of $150M for the building, the electrical and cooling infrastructure, the servers, and the networking equipment that connects and houses 50,000 to 100,000 servers. We consider them a new class of computer, called Warehouse-Scale Computers (WSC). Th e most popular framework for batch processing in a WSC is MapReduce [Dean, 2008] and its open-source twin Hadoop. Inspired by the Lisp functions of the same name, Map fi rst applies a programmer-supplied function to each logical input record. Map runs on thousands of servers to produce an intermediate result of key- value pairs. Reduce collects the output of those distributed tasks and collapses them using another programmer-defi ned function. With appropriate soft ware support, both are highly parallel yet easy to understand and to use. Within 30 minutes, a novice programmer can run a MapReduce task on thousands of servers. For example, one MapReduce program calculates the number of occurrences of every English word in a large collection of documents. Below is a simplifi ed version of that program, which shows just the inner loop and assumes just one occurrence of all English words found in a document: Hardware/ Software Interface 6.7 Clusters, Warehouse Scale Computers, and Other Message-Passing Multiprocessors 533 Anyone can build a fast CPU. Th e trick is to build a fast system. Seymour Cray, considered the father of the supercomputer. map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, “1”); // Produce list of all words reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); // get integer from key-value pair Emit(AsString(result)); 534 Chapter 6 Parallel Processors from Client to Cloud Th e function EmitIntermediate used in the Map function emits each word in the document and the value one. Th en the Reduce function sums all the values per word for each document using ParseInt() to get the number of occurrences per word in all documents. Th e MapReduce runtime environment schedules map tasks and reduce tasks to the servers of a WSC. At this extreme scale, which requires innovation in power distribution, cooling, monitoring, and operations, the WSC is a modern descendant of the 1970s supercomputers—making Seymour Cray the godfather of today’s WSC architects. His extreme computers handled computations that could be done nowhere else, but were so expensive that only a few companies could aff ord them. Th is time the target is providing information technology for the world instead of high performance computing for scientists and engineers. Hence, WSCs surely play a more important societal role today than Cray’s supercomputers did in the past. While they share some common goals with servers, WSCs have three major distinctions: 1. Ample, easy parallelism: A concern for a server architect is whether the applications in the targeted marketplace have enough parallelism to justify the amount of parallel hardware and whether the cost is too high for suffi cient communication hardware to exploit this parallelism. A WSC architect has no such concern. First, batch applications like MapReduce benefi t from the large number of independent data sets that need independent processing, such as billions of Web pages from a Web crawl. Second, interactive Internet service applications, also known as Soft ware as a Service (SaaS), can benefi t from millions of independent users of interactive Internet services. Reads and writes are rarely dependent in SaaS, so SaaS rarely needs to synchronize. For example, search uses a read-only index and email is normally reading and writing independent information. We call this type of easy parallelism Request-Level Parallelism, as many independent eff orts can proceed in parallel naturally with little need for communication or synchronization. 2. Operational Costs Count: Traditionally, server architects design their systems for peak performance within a cost budget and worry about energy only to make sure they don’t exceed the cooling capacity of their enclosure. Th ey usually ignored operational costs of a server, assuming that they pale in comparison to purchase costs. WSC have longer lifetimes—the building and electrical and cooling infrastructure are oft en amortized over 10 or more years—so the operational costs add up: energy, power distribution, and cooling represent more than 30% of the costs of a WSC over 10 years. 3. Scale and the Opportunities/Problems Associated with Scale: To construct a single WSC, you must purchase 100,000 servers along with the supporting infrastructure, which means volume discounts. Hence, WSCs are so massive soft ware as a service (SaaS) Rather than selling soft ware that is installed and run on customers’ own computers, soft ware is run at a remote site and made available over the Internet typically via a Web interface to customers. SaaS customers are charged based on use versus on ownership. internally that you get economy of scale even if there are not many WSCs. Th ese economies of scale led to cloud computing, as the lower per unit costs of a WSC meant that cloud companies could rent servers at a profi table rate and still be below what it costs outsiders to do it themselves. Th e fl ip side of the economic opportunity of scale is the need to cope with the failure frequency of scale. Even if a server had a Mean Time To Failure of an amazing 25 years (200,000 hours), the WSC architect would need to design for 5 server failures every day. Section 5.15 mentioned annualized disk failure rate (AFR) was measured at Google at 2% to 4%. If there were 4 disks per server and their annual failure rate was 2%, the WSC architect should expect to see one disk fail every hour. Th us, fault tolerance is even more important for the WSC architect than the server architect. Th e economies of scale uncovered by WSC have realized the long dreamed of goal of computing as a utility. Cloud computing means anyone anywhere with good ideas, a business model, and a credit card can tap thousands of servers to deliver their vision almost instantly around the world. Of course, there are important obstacles that could limit the growth of cloud computing—such as security, privacy, standards, and the rate of growth of Internet bandwidth—but we foresee them being addressed so that WSCs and cloud computing can fl ourish. To put the growth rate of cloud computing into perspective, in 2012 Amazon Web Services announced that it adds enough new server capacity every day to support all of Amazon’s global infrastructure as of 2003, when Amazon was a $5.2Bn annual revenue enterprise with 6000 employees. Now that we understand the importance of message-passing multiprocessors, especially for cloud computing, we next cover ways to connect the nodes of a WSC together. Th anks to Moore’s Law and the increasing number of cores per chip, we now need networks inside a chip as well, so these topologies are important in the small as well as in the large. Elaboration: The MapReduce framework shuffl es and sorts the key-value pairs at the end of the Map phase to produce groups that all share the same key. These groups are then passed to the Reduce phase. Elaboration: Another form of large scale computing is grid computing, where the computers are spread across large areas, and then the programs that run across them must communicate via long haul networks. The most popular and unique form of grid computing was pioneered by the SETI@home project. As millions of PCs are idle at any one time doing nothing useful, they could be harvested and put to good uses if someone developed software that could run on those computers and then gave each PC an independent piece of the problem to work on. The fi rst example was the Search for ExtraTerrestrial Intelligence (SETI), which was launched at UC Berkeley in 1999. Over 5 million computer users in more than 200 countries have signed up for SETI@home, with more than 50% outside the US. By the end of 2011, the average performance of the SETI@home grid was 3.5 PetaFLOPS. 6.7 Clusters, Warehouse Scale Computers, and Other Message-Passing Multiprocessors 535 536 Chapter 6 Parallel Processors from Client to Cloud 1. True or false: Like SMPs, message-passing computers rely on locks for synchronization. 2. True or false: Clusters have separate memories and thus need many copies of the operating system. 6.8 Introduction to Multiprocessor Network Topologies Multicore chips require on-chip networks to connect cores together, and clusters require local area networks to connect servers together. Th is section reviews the pros and cons of diff erent interconnection network topologies. Network costs include the number of switches, the number of links on a switch to connect to the network, the width (number of bits) per link, and length of the links when the network is mapped into silicon. For example, some cores or servers may be adjacent and others may be on the other side of the chip or the other side of the datacenter. Network performance is multifaceted as well. It includes the latency on an unloaded network to send and receive a message, the throughput in terms of the maximum number of messages that can be transmitted in a given time period, delays caused by contention for a portion of the network, and variable performance depending on the pattern of communication. Another obligation of the network may be fault tolerance, since systems may be required to operate in the presence of broken components. Finally, in this era of energy-limited systems, the energy effi ciency of diff erent organizations may trump other concerns. Networks are normally drawn as graphs, with each edge of the graph representing a link of the communication network. In the fi gures in this section, the processor- memory node is shown as a black square and the switch is shown as a colored circle. We assume here that all links are bidirectional; that is, information can fl ow in either direction. All networks consist of switches whose links go to processor- memory nodes and to other switches. Th e fi rst network connects a sequence of nodes together: Th is topology is called a ring. Since some nodes are not directly connected, some messages will have to hop along intermediate nodes until they arrive at the fi nal destination. Unlike a bus—a shared set of wires that allows broadcasting to all connected devices—a ring is capable of many simultaneous transfers. Check Yourself Because there are numerous topologies to choose from, performance metrics are needed to distinguish these designs. Two are popular. Th e fi rst is total network bandwidth, which is the bandwidth of each link multiplied by the number of links. Th is represents the peak bandwidth. For the ring network above, with P processors, the total network bandwidth would be P times the bandwidth of one link; the total network bandwidth of a bus is just the bandwidth of that bus. To balance this best bandwidth case, we include another metric that is closer to the worst case: the bisection bandwidth. Th is metric is calculated by dividing the machine into two halves. Th en you sum the bandwidth of the links that cross that imaginary dividing line. Th e bisection bandwidth of a ring is two times the link bandwidth. It is one times the link bandwidth for the bus. If a single link is as fast as the bus, the ring is only twice as fast as a bus in the worst case, but it is P times faster in the best case. Since some network topologies are not symmetric, the question arises of where to draw the imaginary line when bisecting the machine. Bisection bandwidth is a worst-case metric, so the answer is to choose the division that yields the most pessimistic network performance. Stated alternatively, calculate all possible bisection bandwidths and pick the smallest. We take this pessimistic view because parallel programs are oft en limited by the weakest link in the communication chain. At the other extreme from a ring is a fully connected network, where every processor has a bidirectional link to every other processor. For fully connected networks, the total network bandwidth is P × (P – 1)/2, and the bisection bandwidth is (P/2)2. Th e tremendous improvement in performance of fully connected networks is off set by the tremendous increase in cost. Th is consequence inspires engineers to invent new topologies that are between the cost of rings and the performance of fully connected networks. Th e evaluation of success depends in large part on the nature of the communication in the workload of parallel programs run on the computer. Th e number of diff erent topologies that have been discussed in publications would be diffi cult to count, but only a few have been used in commercial parallel processors. Figure 6.14 illustrates two of the popular topologies. An alternative to placing a processor at every node in a network is to leave only the switch at some of these nodes. Th e switches are smaller than processor-memory- switch nodes, and thus may be packed more densely, thereby lessening distance and increasing performance. Such networks are frequently called multistage networks to refl ect the multiple steps that a message may travel. Types of multistage networks are as numerous as single-stage networks; Figure 6.15 illustrates two of the popular multistage organizations. A fully connected or crossbar network allows any node to communicate with any other node in one pass through the network. An Omega network uses less hardware than the crossbar network (2n log2 n versus n 2 switches), but contention can occur between messages, depending on the pattern network bandwidth Informally, the peak transfer rate of a network; can refer to the speed of a single link or the collective transfer rate of all links in the network. bisection bandwidth Th e bandwidth between two equal parts of a multiprocessor. Th is measure is for a worst case split of the multiprocessor. fully connected network A network that connects processor- memory nodes by supplying a dedicated communication link between every node. multistage network A network that supplies a small switch at each node. crossbar network A network that allows any node to communicate with any other node in one pass through the network. 6.8 Introduction to Multiprocessor Network Topologies 537 538 Chapter 6 Parallel Processors from Client to Cloud of communication. For example, the Omega network in Figure 6.15 cannot send a message from P0 to P6 at the same time that it sends a message from P1 to P4. Implementing Network Topologies Th is simple analysis of all the networks in this section ignores important practical considerations in the construction of a network. Th e distance of each link aff ects the cost of communicating at a high clock rate generally, the longer the distance, the more expensive it is to run at a high clock rate. Shorter distances also make it easier to assign more wires to the link, as the power to drive many wires is less if the wires are short. Shorter wires are also cheaper than longer wires. Another practical limitation is that the three-dimensional drawings must be mapped onto chips that are essentially two-dimensional media. Th e fi nal concern is energy. Energy concerns may force multicore chips to rely on simple grid topologies, for example. Th e bottom line is that topologies that appear elegant when sketched on the blackboard may be impractical when constructed in silicon or in a datacenter. Now that we understand the importance of clusters and have seen topologies that we can follow to connect them together, we next look at the hardware and soft ware of the interface of the network to the processor. True or false: For a ring with P nodes, the ratio of the total network bandwidth to the bisection bandwidth is P/2. Check Yourself a. 2-D grid or mesh of 16 nodes b. n-cube tree of 8 nodes (8 = 23 so n = 3) FIGURE 6.14 Network topologies that have appeared in commercial parallel processors. Th e colored circles represent switches and the black squares represent processor-memory nodes. Even though a switch has many links, generally only one goes to the processor. Th e Boolean n-cube topology is an n-dimensional interconnect with 2n nodes, requiring n links per switch (plus one for the processor) and thus n nearest-neighbor nodes. Frequently, these basic topologies have been supplemented with extra arcs to improve performance and reliability. 5.96.9 Communicating to the Outside World: Cluster Networking Th is online section describes the networking hardware and soft ware used to connect the nodes of a cluster together. Th e example is 10 gigabit/second Ethernet connected to the computer using Peripheral Component Interconnect Express (PCIe). It shows both soft ware and hardware optimizations how to improve network performance, including zero copy messaging, user space communication, using polling instead of I/O interrupts, and hardware calculation of checksums. While the example is networking, the techniques in this section apply to storage controllers and other I/O devices as well. a. Crossbar b. Omega network c. Omega network switch box C D A B P0 P1 P2 P3 P4 P5 P6 P7 P0 P1 P2 P3 P4 P5 P6 P7 FIGURE 6.15 Popular multistage network topologies for eight nodes. Th e switches in these drawings are simpler than in earlier drawings because the links are unidirectional; data comes in at the left and exits out the right link. Th e switch box in c can pass A to C and B to D or B to C and A to D. Th e crossbar uses n2 switches, where n is the number of processors, while the Omega network uses 2n log2n of the large switch boxes, each of which is logically composed of four of the smaller switches. In this case, the crossbar uses 64 switches versus 12 switch boxes, or 48 switches, in the Omega network. Th e crossbar, however, can support any combination of messages between processors, while the Omega network cannot. 6.9 Communicating to the Outside World: Cluster Networking 539 540 Chapter 6 Parallel Processors from Client to Cloud Aft er covering the performance of network at a low level of detail in this online section, the next section shows how to benchmark multiprocessors of all kinds with much higher-level programs. 6.10 Multiprocessor Benchmarks and Performance Models As we saw in Chapter 1, benchmarking systems is always a sensitive topic, because it is a highly visible way to try to determine which system is better. Th e results aff ect not only the sales of commercial systems, but also the reputation of the designers of those systems. Hence, all participants want to win the competition, but they also want to be sure that if someone else wins, they deserve to win because they have a genuinely better system. Th is desire leads to rules to ensure that the benchmark results are not simply engineering tricks for that benchmark, but are instead advances that improve performance of real applications. To avoid possible tricks, a typical rule is that you can t change the benchmark. Th e source code and data sets are fi xed, and there is a single proper answer. Any deviation from those rules makes the results invalid. Many multiprocessor benchmarks follow these traditions. A common exception is to be able to increase the size of the problem so that you can run the benchmark on systems with a widely diff erent number of processors. Th at is, many benchmarks allow weak scaling rather than require strong scaling, even though you must take care when comparing results for programs running diff erent problem sizes. Figure 6.16 gives a summary of several parallel benchmarks, also described below: ■ Linpack is a collection of linear algebra routines, and the routines for performing Gaussian elimination constitute what is known as the Linpack benchmark. Th e DGEMM routine in the example on page 215 represents a small fraction of the source code of the Linpack benchmark, but it accounts for most of the execution time for the benchmark. It allows weak scaling, letting the user pick any size problem. Moreover, it allows the user to rewrite Linpack in almost any form and in any language, as long as it computes the proper result and performs the same number of fl oating point operations for a given problem size. Twice a year, the 500 computers with the fastest Linpack performance are published at www.top500.org. Th e fi rst on this list is considered by the press to be the world s fastest computer. ■ SPECrate is a throughput metric based on the SPEC CPU benchmarks, such as SPEC CPU 2006 (see Chapter 1). Rather than report performance of the individual programs, SPECrate runs many copies of the program simultaneously. Th us, it measures task-level parallelism, as there is no communication between the tasks. You can run as many copies of the programs as you want, so this is again a form of weak scaling. ■ SPLASH and SPLASH 2 (Stanford Parallel Applications for Shared Memory) were eff orts by researchers at Stanford University in the 1990s to put together a parallel benchmark suite similar in goals to the SPEC CPU benchmark suite. It includes both kernels and applications, including many from the high-performance computing community. Th is benchmark requires strong scaling, although it comes with two data sets. Benchmark Scaling? Reprogram? Description Linpack Weak Yes Dense matrix linear algebra [Dongarra, 1979] SPECrate Weak No Independent job parallelism [Henning, 2007] Stanford Parallel Applications for Shared Memory SPLASH 2 [Woo et al., 1995] Strong (although offers two problem sizes) No Complex 1D FFT Blocked LU Decomposition Blocked Sparse Cholesky Factorization Integer Radix Sort Barnes-Hut Adaptive Fast Multipole Ocean Simulation Hierarchical Radiosity Ray Tracer Volume Renderer Water Simulation with Spatial Data Structure Water Simulation without Spatial Data Structure NAS Parallel Benchmarks [Bailey et al., 1991] Weak Yes (C or Fortran only) EP: embarrassingly parallel MG: simplified multigrid CG: unstructured grid for a conjugate gradient method FT: 3-D partial differential equation solution using FFTs IS: large integer sort PARSEC Benchmark Suite [Bienia et al., 2008] Weak No Blackscholes—Option pricing with Black-Scholes PDE Bodytrack—Body tracking of a person Canneal—Simulated cache-aware annealing to optimize routing Dedup—Next-generation compression with data deduplication Facesim—Simulates the motions of a human face Ferret—Content similarity search server Fluidanimate—Fluid dynamics for animation with SPH method Freqmine—Frequent itemset mining Streamcluster—Online clustering of an input stream Swaptions—Pricing of a portfolio of swaptions Vips—Image processing x264—H.264 video encoding Berkeley Design Patterns [Asanovic et al., 2006] Strong or Weak Yes Finite-State Machine Combinational Logic Graph Traversal Structured Grid Dense Matrix Sparse Matrix Spectral Methods (FFT) Dynamic Programming N-Body MapReduce Backtrack/Branch and Bound Graphical Model Inference Unstructured Grid FIGURE 6.16 Examples of parallel benchmarks. 6.10 Multiprocessor Benchmarks and Performance Models 541 542 Chapter 6 Parallel Processors from Client to Cloud ■ Th e NAS (NASA Advanced Supercomputing) parallel benchmarks were another attempt from the 1990s to benchmark multiprocessors. Taken from computational fl uid dynamics, they consist of fi ve kernels. Th ey allow weak scaling by defi ning a few data sets. Like Linpack, these benchmarks can be rewritten, but the rules require that the programming language can only be C or Fortran. ■ Th e recent PARSEC (Princeton Application Repository for Shared Memory Computers) benchmark suite consists of multithreaded programs that use Pthreads (POSIX threads) and OpenMP (Open MultiProcessing; see Section 6.5). Th ey focus on emerging computational domains and consist of nine applications and three kernels. Eight rely on data parallelism, three rely on pipelined parallelism, and one on unstructured parallelism. ■ On the cloud front, the goal of the Yahoo! Cloud Serving Benchmark (YCSB) is to compare performance of cloud data services. It off ers a framework that makes it easy for a client to benchmark new data services, using Cassandra and HBase as representative examples. [Cooper, 2010] Th e downside of such traditional restrictions to benchmarks is that innovation is chiefl y limited to the architecture and compiler. Better data structures, algorithms, programming languages, and so on oft en cannot be used, since that would give a misleading result. Th e system could win because of, say, the algorithm, and not because of the hardware or the compiler. While these guidelines are understandable when the foundations of computing are relatively stable as they were in the 1990s and the fi rst half of this decade they are undesirable during a programming revolution. For this revolution to succeed, we need to encourage innovation at all levels. Researchers at the University of California at Berkeley have advocated one approach. Th ey identifi ed 13 design patterns that they claim will be part of applications of the future. Frameworks or kernels implement these design patterns. Examples are sparse matrices, structured grids, fi nite-state machines, map reduce, and graph traversal. By keeping the defi nitions at a high level, they hope to encourage innovations at any level of the system. Th us, the system with the fastest sparse matrix solver is welcome to use any data structure, algorithm, and programming language, in addition to novel architectures and compilers. Performance Models A topic related to benchmarks is performance models. As we have seen with the increasing architectural diversity in this chapter—multithreading, SIMD, GPUs— it would be especially helpful if we had a simple model that off ered insights into the performance of diff erent architectures. It need not be perfect, just insightful. Th e 3Cs for cache performance from Chapter 5 is an example performance model. It is not a perfect performance model, since it ignores potentially important Pthreads A UNIX API for creating and manipulating threads. It is structured as a library. factors like block size, block allocation policy, and block replacement policy. Moreover, it has quirks. For example, a miss can be ascribed due to capacity in one design and to a confl ict miss in another cache of the same size. Yet 3Cs model has been popular for 25 years, because it off ers insight into the behavior of programs, helping both architects and programmers improve their creations based on insights from that model. To fi nd such a model for parallel computers, let s start with small kernels, like those from the 13 Berkeley design patterns in Figure 6.16. While there are versions with diff erent data types for these kernels, fl oating point is popular in several implementations. Hence, peak fl oating-point performance is a limit on the speed of such kernels on a given computer. For multicore chips, peak fl oating-point performance is the collective peak performance of all the cores on the chip. If there were multiple microprocessors in the system, you would multiply the peak per chip by the total number of chips. Th e demands on the memory system can be estimated by dividing this peak fl oating-point performance by the average number of fl oating-point operations per byte accessed: Floating Point Operations/Sec Floating Point Operations/By - - tte = Bytes/Sec Th e ratio of fl oating-point operations per byte of memory accessed is called the arithmetic intensity. It can be calculated by taking the total number of fl oating- point operations for a program divided by the total number of data bytes transferred to main memory during program execution. Figure 6.17 shows the arithmetic intensity of several of the Berkeley design patterns from Figure 6.16. arithmetic intensity Th e ratio of fl oating- point operations in a program to the number of data bytes accessed by a program from main memory. A r i t h m e t i c I n t e n s i t y O(N) O(log(N)) O(1) Sparse Matrix (SpMV) Structured Grids (Stencils, PDEs) Structured Grids (Lattice Methods) Spectral Methods (FFTs) Dense Matrix (BLAS3) N-body (Particle Methods) FIGURE 6.17 Arithmetic intensity, specifi ed as the number of fl oat-point operations to run the program divided by the number of bytes accessed in main memory [Williams, Waterman, and Patterson 2009]. Some kernels have an arithmetic intensity that scales with problem size, such as Dense Matrix, but there are many kernels with arithmetic intensities independent of problem size. For kernels in this former case, weak scaling can lead to diff erent results, since it puts much less demand on the memory system. 6.10 Multiprocessor Benchmarks and Performance Models 543 544 Chapter 6 Parallel Processors from Client to Cloud The Roofl ine Model Th is simple model ties fl oating-point performance, arithmetic intensity, and memory performance together in a two-dimensional graph [Williams, Waterman, and Patterson 2009]. Peak fl oating-point performance can be found using the hardware specifi cations mentioned above. Th e working sets of the kernels we consider here do not fi t in on-chip caches, so peak memory performance may be defi ned by the memory system behind the caches. One way to fi nd the peak memory performance is the Stream benchmark. (See the Elaboration on page 381 in Chapter 5). Figure 6.18 shows the model, which is done once for a computer, not for each kernel. Th e vertical Y-axis is achievable fl oating-point performance from 0.5 to 64.0 GFLOPs/second. Th e horizontal X-axis is arithmetic intensity, varying from 1/8 FLOPs/DRAM byte accessed to 16 FLOPs/DRAM byte accessed. Note that the graph is a log-log scale. For a given kernel, we can fi nd a point on the X-axis based on its arithmetic intensity. If we draw a vertical line through that point, the performance of the kernel on that computer must lie somewhere along that line. We can plot a horizontal line showing peak fl oating-point performance of the computer. Obviously, the actual fl oating-point performance can be no higher than the horizontal line, since that is a hardware limit. Arithmetic Intensity: FLOPs/Byte Ratio A tt a in a b le G F L O P s/ se co n d 0.5 1.0 2.0 4.0 8.0 16.0 32.0 64.0 1/8 1/4 1/2 1 2 4 8 16 peak floating-point performance pe ak m em or y BW (s tre am ) Kernel 1 (Memory Bandwidth limited) Kernel 2 (Computation limited) FIGURE 6.18 Roofl ine Model [Williams, Waterman, and Patterson 2009]. Th is example has a peak fl oating-point performance of 16 GFLOPS/sec and a peak memory bandwidth of 16 GB/sec from the Stream benchmark. (Since Stream is actually four measurements, this line is the average of the four.) Th e dotted vertical line in color on the left represents Kernel 1, which has an arithmetic intensity of 0.5 FLOPs/ byte. It is limited by memory bandwidth to no more than 8 GFLOPS/sec on this Opteron X2. Th e dotted vertical line to the right represents Kernel 2, which has an arithmetic intensity of 4 FLOPs/byte. It is limited only computationally to 16 GFLOPS/s. (Th is data is based on the AMD Opteron X2 (Revision F) using dual cores running at 2 GHz in a dual socket system.) How could we plot the peak memory performance, which is measured in bytes/ second? Since the X-axis is FLOPs/byte and the Y-axis FLOPs/second, bytes/second is just a diagonal line at a 45-degree angle in this fi gure. Hence, we can plot a third line that gives the maximum fl oating-point performance that the memory system of that computer can support for a given arithmetic intensity. We can express the limits as a formula to plot the line in the graph in Figure 6.18: Attainable GFLOPs/sec = Min (Peak Memory BW Arithmetic Inte× nnsity, Peak Floating Point Performance)- Th e horizontal and diagonal lines give this simple model its name and indicate its value. Th e roofl ine sets an upper bound on performance of a kernel depending on its arithmetic intensity. Given a roofl ine of a computer, you can apply it repeatedly, since it doesn t vary by kernel. If we think of arithmetic intensity as a pole that hits the roof, either it hits the slanted part of the roof, which means performance is ultimately limited by memory bandwidth, or it hits the fl at part of the roof, which means performance is computationally limited. In Figure 6.18, kernel 1 is an example of the former, and kernel 2 is an example of the latter. Note that the ridge point, where the diagonal and horizontal roofs meet, off ers an interesting insight into the computer. If it is far to the right, then only kernels with very high arithmetic intensity can achieve the maximum performance of that computer. If it is far to the left , then almost any kernel can potentially hit the maximum performance. Comparing Two Generations of Opterons Th e AMD Opteron X4 (Barcelona) with four cores is the successor to the Opteron X2 with two cores. To simplify board design, they use the same socket. Hence, they have the same DRAM channels and thus the same peak memory bandwidth. In addition to doubling the number of cores, the Opteron X4 also has twice the peak fl oating-point performance per core: Opteron X4 cores can issue two fl oating-point SSE2 instructions per clock cycle, while Opteron X2 cores issue at most one. As the two systems we re comparing have similar clock rates 2.2 GHz for Opteron X2 versus 2.3 GHz for Opteron X4 the Opteron X4 has about four times the peak fl oating-point performance of the Opteron X2 with the same DRAM bandwidth. Th e Opteron X4 also has a 2MiB L3 cache, which is not found in the Opteron X2. In Figure 6.19 the roofl ine models for both systems are compared. As we would expect, the ridge point moves to the right, from 1 in the Opteron X2 to 5 in the Opteron X4. Hence, to see a performance gain in the next generation, kernels need an arithmetic intensity higher than 1, or their working sets must fi t in the caches of the Opteron X4. Th e roofl ine model gives an upper bound to performance. Suppose your program is far below that bound. What optimizations should you perform, and in what order? 6.10 Multiprocessor Benchmarks and Performance Models 545 546 Chapter 6 Parallel Processors from Client to Cloud To reduce computational bottlenecks, the following two optimizations can help almost any kernel: 1. Floating-point operation mix. Peak fl oating-point performance for a computer typically requires an equal number of nearly simultaneous additions and multiplications. Th at balance is necessary either because the computer supports a fused multiply-add instruction (see the Elaboration on page 220 in Chapter 3) or because the fl oating-point unit has an equal number of fl oating-point adders and fl oating-point multipliers. Th e best performance also requires that a signifi cant fraction of the instruction mix is fl oating- point operations and not integer instructions. 2. Improve instruction-level parallelism and apply SIMD. For modern archi- tectures, the highest performance comes when fetching, executing, and committing three to four instructions per clock cycle (see Section 4.10). Th e goal for this step is to improve the code from the compiler to increase ILP. One way is by unrolling loops, as we saw in Section 4.12. For the x86 architectures, a single AVX instruction can operate on four double precision operands, so they should be used whenever possible (see Sections 3.7 and 3.8). To reduce memory bottlenecks, the following two optimizations can help: 1. Soft ware prefetching. Usually the highest performance requires keeping many memory operations in fl ight, which is easier to do by performing predicting accesses via soft ware prefetch instructions rather than waiting until the data is required by the computation. Actual FLOPbyte ratio A tt a in a b le G F L O P /s 128.0 64.0 32.0 16.0 8.0 4.0 2.0 1.0 0.5 1/8 1/4 1/2 168421 Opteron X4 (Barcelona) Opteron X2 FIGURE 6.19 Roofl ine models of two generations of Opterons. Th e Opteron X2 roofl ine, which is the same as in Figure 6.18, is in black, and the Opteron X4 roofl ine is in color. Th e bigger ridge point of Opteron X4 means that kernels that were computationally bound on the Opteron X2 could be memory- performance bound on the Opteron X4. 2. Memory affi nity. Microprocessors today include a memory controller on the same chip with the microprocessor, which improves performance of the memory hierarchy. If the system has multiple chips, this means that some addresses go to the DRAM that is local to one chip, and the rest require accesses over the chip interconnect to access the DRAM that is local to another chip. Th is split results in non-uniform memory accesses, which we described in Section 6.5. Accessing memory through another chip lowers performance. Th is second optimization tries to allocate data and the threads tasked to operate on that data to the same memory-processor pair, so that the processors rarely have to access the memory of the other chips. Th e roofl ine model can help decide which of these two optimizations to perform and the order in which to perform them. We can think of each of these optimizations as a ceiling below the appropriate roofl ine, meaning that you cannot break through a ceiling without performing the associated optimization. Th e computational roofl ine can be found from the manuals, and the memory roofl ine can be found from running the Stream benchmark. Th e computational ceilings, such as fl oating-point balance, can also come from the manuals for that computer. A memory ceiling, such as memory affi nity, requires running experiments on each computer to determine the gap between them. Th e good news is that this process only need be done once per computer, for once someone characterizes a computer s ceilings, everyone can use the results to prioritize their optimizations for that computer. Figure 6.20 adds ceilings to the roofl ine model in Figure 6.18, showing the computational ceilings in the top graph and the memory bandwidth ceilings on the bottom graph. Although the higher ceilings are not labeled with both optimizations, they are implied in this fi gure; to break through the highest ceiling, you need to have already broken through all the ones below. Th e width of the gap between the ceiling and the next higher limit is the reward for trying that optimization. Th us, Figure 6.20 suggests that optimization 2, which improves ILP, has a large benefi t for improving computation on that computer, and optimization 4, which improves memory affi nity, has a large benefi t for improving memory bandwidth on that computer. Figure 6.21 combines the ceilings of Figure 6.20 into a single graph. Th e arithmetic intensity of a kernel determines the optimization region, which in turn suggests which optimizations to try. Note that the computational optimizations and the memory bandwidth optimizations overlap for much of the arithmetic intensity. Th ree regions are shaded diff erently in Figure 6.21 to indicate the diff erent optimization strategies. For example, Kernel 2 falls in the blue trapezoid on the right, which suggests working only on the computational optimizations. Kernel 1 falls in the blue-gray parallelogram in the middle, which suggests trying both types of optimizations. Moreover, it suggests starting with optimizations 2 and 4. Note that the Kernel 1 vertical lines fall below the fl oating-point imbalance optimization, so optimization 1 may be unnecessary. If a kernel fell in the gray triangle on the lower left , it would suggest trying just memory optimizations. 6.10 Multiprocessor Benchmarks and Performance Models 547 548 Chapter 6 Parallel Processors from Client to Cloud 0.5 1.0 2.0 4.0 8.0 16.0 32.0 64.0 1/8 1/4 1/2 1 2 4 8 16 peak floating-point performance 1. Fl. Pt. imbalance 2. Without ILP or SIMD AMD Opteron pe ak m em or y BW (s tre am ) Arithmetic Intensity: FLOPs/Byte Ratio A tt a in a b le G F L O P s/ se co n d 0.5 1.0 2.0 4.0 8.0 16.0 32.0 64.0 1/8 1/4 1/2 1 2 4 8 16 AMD Opteron pe ak m em or y BW (s tre am ) Arithmetic Intensity: FLOPs/Byte Ratio A tt a in a b le G F L O P s/ se co n d 3. w /o ut S W p re fe tc hi ng 4. w /o ut M em or y Af fin ity peak floating-point performance FIGURE 6.20 Roofl ine model with ceilings. Th e top graph shows the computational “ceilings” of 8 GFLOPs/sec if the fl oating-point operation mix is imbalanced and 2 GFLOPs/sec if the optimizations to increase ILP and SIMD are also missing. Th e bottom graph shows the memory bandwidth ceilings of 11 GB/ sec without soft ware prefetching and 4.8 GB/sec if memory affi nity optimizations are also missing. Th us far, we have been assuming that the arithmetic intensity is fi xed, but that is not really the case. First, there are kernels where the arithmetic intensity increases with problem size, such as for Dense Matrix and N-body problems (see Figure 6.17). Indeed, this can be a reason that programmers have more success with weak scaling than with strong scaling. Second, the eff ectiveness of the memory hierarchy aff ects the number of accesses that go to memory, so optimizations that improve cache performance also improve arithmetic intensity. One example is improving temporal locality by unrolling loops and then grouping together statements with similar addresses. Many computers have special cache instructions that allocate data in a cache but do not fi rst fi ll the data from memory at that address, since it will soon be over-written. Both these optimizations reduce memory traffi c, thereby moving the arithmetic intensity pole to the right by a factor of, say, 1.5. Th is shift right could put the kernel in a diff erent optimization region. While the examples above show how to help programmers improve performance, architects can also use the model to decide where they should optimize hardware to improve performance of the kernels that they think will be important. Th e next section uses the roofl ine model to demonstrate the performance diff erence between a multicore microprocessor and a GPU and to see whether these diff erences refl ect performance of real programs. 0.5 1.0 2.0 4.0 8.0 16.0 32.0 64.0 1 2 4 8 16 pe ak m em or y BW (s tre am ) Arithmetic Intensity: FLOPs/Byte Ratio A tt a in a b le G F L O P s/ se co n d Kernel 1 Kernel 2 2. Without ILP or SIMD 4. w /o ut M em or y Af fin ity 1. Fl. Pt. imbalance 3. w /o ut S W p re fe tc hi ng peak floating-point performance 1/8 1/4 1/2 FIGURE 6.21 Roofl ine model with ceilings, overlapping areas shaded, and the two kernels from Figure 6.18. Kernels whose arithmetic intensity land in the blue trapezoid on the right should focus on computation optimizations, and kernels whose arithmetic intensity land in the gray triangle in the lower left should focus on memory bandwidth optimizations. Th ose that land in the blue-gray parallelogram in the middle need to worry about both. As Kernel 1 falls in the parallelogram in the middle, try optimizing ILP and SIMD, memory affi nity, and soft ware prefetching. Kernel 2 falls in the trapezoid on the right, so try optimizing ILP and SIMD and the balance of fl oating-point operations. 6.10 Multiprocessor Benchmarks and Performance Models 549 550 Chapter 6 Parallel Processors from Client to Cloud Elaboration: The ceilings are ordered so that lower ceilings are easier to optimize. Clearly, a programmer can optimize in any order, but following this sequence reduces the chances of wasting effort on an optimization that has no benefi t due to other constraints. Like the 3Cs model, as long as the roofl ine model delivers on insights, a model can have assumptions that may prove optimistic. For example, roofl ine assumes the load is balanced between all processors. Elaboration: An alternative to the Stream benchmark is to use the raw DRAM bandwidth as the roofl ine. While the raw bandwidth defi nitely is a hard upper bound, actual memory performance is often so far from that boundary that it s not that useful. That is, no program can go close to that bound. The downside to using Stream is that very careful programming may exceed the Stream results, so the memory roofl ine may not be as hard a limit as the computational roofl ine. We stick with Stream because few programmers will be able to deliver more memory bandwidth than Stream discovers. Elaboration: Although the roofl ine model shown is for multicore processors, it clearly would work for a uniprocessor as well. True or false: Th e main drawback with conventional approaches to benchmarks for parallel computers is that the rules that ensure fairness also slow soft ware innovation. 6.11 Real Stuff: Benchmarking and Roofl ines of the Intel Core i7 960 and the NVIDIA Tesla GPU A group of Intel researchers published a paper [Lee et al., 2010] comparing a quad-core Intel Core i7 960 with multimedia SIMD extensions to the previous generation GPU, the NVIDIA Tesla GTX 280. Figure 6.22 lists the characteristics of the two systems. Both products were purchased in Fall 2009. Th e Core i7 is in Intel s 45-nanometer semiconductor technology while the GPU is in TSMC s 65-nanometer technology. Although it might have been fairer to have a comparison by a neutral party or by both interested parties, the purpose of this section is not to determine how much faster one product is than another, but to try to understand the relative value of features of these two contrasting architecture styles. Th e roofl ines of the Core i7 960 and GTX 280 in Figure 6.23 illustrate the diff erences in the computers. Not only does the GTX 280 have much higher memory bandwidth and double-precision fl oating-point performance, but also its double-precision ridge point is considerably to the left . Th e double-precision ridge point is 0.6 for the GTX 280 versus 3.1 for the Core i7. As mentioned above, it is much easier to hit peak computational performance the further the ridge point of Check Yourself the roofl ine is to the left . For single-precision performance, the ridge point moves far to the right for both computers, so it s much harder to hit the roof of single- precision performance. Note that the arithmetic intensity of the kernel is based on the bytes that go to main memory, not the bytes that go to cache memory. Th us, as mentioned above, caching can change the arithmetic intensity of a kernel on a particular computer, if most references really go to the cache. Note also that this bandwidth is for unit-stride accesses in both architectures. Real gather-scatter addresses can be slower on the GTX 280 and on the Core i7, as we shall see. Th e researchers selected the benchmark programs by analyzing the computational and memory characteristics of four recently proposed benchmark suites and then formulated the set of throughput computing kernels that capture these characteristics. Figure 6.24 shows the performance results, with larger numbers meaning faster. Th e Roofl ines help explain the relative performance in this case study. Given that the raw performance specifi cations of the GTX 280 vary from 2.5 × slower (clock rate) to 7.5 × faster (cores per chip) while the performance varies Core i7- 960 Number of processing elements (cores or SMs) Clock frequency (GHz) Die size Technology Power (chip, not module) Transistors Memory brandwith (GBytes/sec) Single-precision SIMD width Double-precision SIMD width Peak Single-precision scalar FLOPS (GFLOP/sec) Peak Single-precision SIMD FLOPS (GFLOP/Sec) (SP 1 add or multiply) (SP 1 instruction fused multiply-adds) (Rare SP dual issue fused multiply-add and multiply) Peal double-precision SIMD FLOPS (GFLOP/sec) 4 3.2 263 Intel 45 nm 130 700 M 32 4 2 26 102 N.A. N.A. N.A. 51 30 1.3 576 TSMC 65 nm 130 1400 M 141 8 1 117 311 to 933 (311) (622) (933) 78 15 1.4 520 TSMC 40 nm 167 3030 M 177 32 16 63 515 or 1344 (515) (1344) N.A. 515 7.5 0.41 2.2 1.6 1.0 2.0 4.4 2.0 0.5 4.6 3.0–9.1 (3.0) (6.1) (9.1) 1.5 3.8 0.44 2.0 1.0 1.3 4.4 5.5 8.0 8.0 2.5 6.6–13.1 (6.6) (13.1) – 10.1 GTX 280 GTX 480 Ratio 280/i7 Ratio 480/i7 FIGURE 6.22 Intel Core i7-960, NVIDIA GTX 280, and GTX 480 specifi cations. Th e rightmost columns show the ratios of the Tesla GTX 280 and the Fermi GTX 480 to Core i7. Although the case study is between the Tesla 280 and i7, we include the Fermi 480 to show its relationship to the Tesla 280 since it is described in this chapter. Note that these memory bandwidths are higher than in Figure 6.23 because these are DRAM pin bandwidths and those in Figure 6.23 are at the processors as measured by a benchmark program. (From Table 2 in Lee et al. [2010].) 6.11 Real Stuff: Benchmarking and Roofl ines of the Intel Core i7 960 and the NVIDIA Tesla GPU 551 552 Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.23 Roofl ine model [Williams, Waterman, and Patterson 2009]. Th ese roofl ines show double-precision fl oating-point performance in the top row and single-precision performance in the bottom row. (Th e DP FP performance ceiling is also in the bottom row to give perspective.) Th e Core i7 960 on the left has a peak DP FP performance of 51.2 GFLOP/sec, a SP FP peak of 102.4 GFLOP/sec, and a peak memory bandwidth of 16.4 GBytes/sec. Th e NVIDIA GTX 280 has a DP FP peak of 78 GFLOP/sec, SP FP peak of 624 GFLOP/sec, and 127 GBytes/sec of memory bandwidth. Th e dashed vertical line on the left represents an arithmetic intensity of 0.5 FLOP/byte. It is limited by memory bandwidth to no more than 8 DP GFLOP/sec or 8 SP GFLOP/sec on the Core i7. Th e dashed vertical line to the right has an arithmetic intensity of 4 FLOP/byte. It is limited only computationally to 51.2 DP GFLOP/sec and 102.4 SP GFLOP/sec on the Core i7 and 78 DP GFLOP/ sec and 512 DP GFLOP/sec on the GTX 280. To hit the highest computation rate on the Core i7 you need to use all 4 cores and SSE instructions with an equal number of multiplies and adds. For the GTX 280, you need to use fused multiply-add instructions on all multithreaded SIMD processors. 128 64 32 16 8 4 2 1 128 64 32 16 8 4 2 1 Core i7 960 (Nehalem) 1024 512 256 128 64 32 16 8 1 2 Arithmetic intensity 4 8 16 321/8 1/4 1/2 1 2 Arithmetic intensity 4 8 16 321/8 1/4 1/2 1 2 Arithmetic intensity 4 8 16 32 32 1/8 1/4 1/2 1 2 Arithmetic intensity 4 8 161/8 1/4 1/2 Core i7 960 (Nehalem) NVIDIA GTX280 1024 512 256 128 64 32 8 16 44 NVIDIA GTX280 G F lo p /s G F lo p /s G F lo p /s G F lo p /s 51.2 GF/s Double Precision St re am = 1 6. 4 G B/ s S tre am =1 27 G B/ s Peak = 78 GF/s Double Precision 78 GF/s Double Precision St re am =1 27 G B/ s 624 GF/s Single Precision St re am = 1 6. 4 G B/ s 102.4 GF/s Single Precision 51.2 GF/s Double Precision from 2.0 × slower (Solv) to 15.2 × faster (GJK), the Intel researchers decided to fi nd the reasons for the diff erences: ■ Memory bandwidth. Th e GPU has 4.4 × the memory bandwidth, which helps explain why LBM and SAXPY run 5.0 and 5.3 × faster; their working sets are hundreds of megabytes and hence don t fi t into the Core i7 cache. (So as to access memory intensively, they purposely did not use cache blocking as in Chapter 5.) Hence, the slope of the roofl ines explains their performance. SpMV also has a large working set, but it only runs 1.9 × faster because the double- precision fl oating point of the GTX 280 is only 1.5 × as faster as the Core i7. ■ Compute bandwidth. Five of the remaining kernels are compute bound: SGEMM, Conv, FFT, MC, and Bilat. Th e GTX is faster by 3.9, 2.8, 3.0, 1.8, and 5.7 ×, respectively. Th e fi rst three of these use single-precision fl oating-point arithmetic, and GTX 280 single precision is 3 to 6 × faster. MC uses double precision, which explains why it s only 1.8 × faster since DP performance is only 1.5 × faster. Bilat uses transcendental functions, which the GTX 280 supports directly. Th e Core i7 spends two-thirds of its time calculating transcendental functions for Bilat, so the GTX 280 is 5.7 × faster. Th is observation helps point out the value of hardware support for operations that occur in your workload: double-precision fl oating point and perhaps even transcendentals. Kernel Units Core i7-960 GTX 280 GTX 280/ i7-960 Million pixels/sec SGEMM GFLOP/sec Billion paths/secMC Million pixels/secConv GFLOP/secFFT GBytes/secSAXPY Million lookups/secLBM Frames/secSolv GFLOP/secSpMV Frames/secGJK Million elements/secSort Frames/secRC Million queries/secSearch Million pixels/sec 83 94 0.8 1250 71.4 16.8 85 103 4.9 67 250 5 50 1517 3.9 5.7 1.8 2.8 3.0 5.3 5.0 0.5 1.9 15.2 0.8 1.6 1.8 1.7 364 475 1.4 3500 213 88.8 426 52 9.1 1020 198 8.1 90 2583Hist Bilat FIGURE 6.24 Raw and relative performance measured for the two platforms. In this study, SAXPY is just used as a measure of memory bandwidth, so the right unit is GBytes/sec and not GFLOP/sec. (Based on Table 3 in [Lee et al., 2010].) 6.11 Real Stuff: Benchmarking and Roofl ines of the Intel Core i7 960 and the NVIDIA Tesla GPU 553 554 Chapter 6 Parallel Processors from Client to Cloud ■ Cache benefi ts. Ray casting (RC) is only 1.6 × faster on the GTX because cache blocking with the Core i7 caches prevents it from becoming memory bandwidth bound (see Sections 5.4 and 5.14), as it is on GPUs. Cache blocking can help Search, too. If the index trees are small so that they fi t in the cache, the Core i7 is twice as fast. Larger index trees make them memory bandwidth bound. Overall, the GTX 280 runs search 1.8 × faster. Cache blocking also helps Sort. While most programmers wouldn t run Sort on a SIMD processor, it can be written with a 1-bit Sort primitive called split. However, the split algorithm executes many more instructions than a scalar sort does. As a result, the Core i7 runs 1.25 × as fast as the GTX 280. Note that caches also help other kernels on the Core i7, since cache blocking allows SGEMM, FFT, and SpMV to become compute bound. Th is observation re- emphasizes the importance of cache blocking optimizations in Chapter 5. ■ Gather-Scatter. Th e multimedia SIMD extensions are of little help if the data are scattered throughout main memory; optimal performance comes only when accesses are to data are aligned on 16-byte boundaries. Th us, GJK gets little benefi t from SIMD on the Core i7. As mentioned above, GPUs off er gather-scatter addressing that is found in a vector architecture but omitted from most SIMD extensions. Th e memory controller even batches accesses to the same DRAM page together (see Section 5.2). Th is combination means the GTX 280 runs GJK a startling 15.2 × as fast as the Core i7, which is larger than any single physical parameter in Figure 6.22. Th is observation reinforces the importance of gather- scatter to vector and GPU architectures that is missing from SIMD extensions. ■ Synchronization. Th e performance of synchronization is limited by atomic updates, which are responsible for 28% of the total runtime on the Core i7 despite its having a hardware fetch-and-increment instruction. Th us, Hist is only 1.7 × faster on the GTX 280. Solv solves a batch of independent constraints in a small amount of computation followed by barrier synchronization. Th e Core i7 benefi ts from the atomic instructions and a memory consistency model that ensures the right results even if not all previous accesses to memory hierarchy have completed. Without the memory consistency model, the GTX 280 version launches some batches from the system processor, which leads to the GTX 280 running 0.5 × as fast as the Core i7. Th is observation points out how synchronization performance can be important for some data parallel problems. It is striking how oft en weaknesses in the Tesla GTX 280 that were uncovered by kernels selected by Intel researchers were already being addressed in the successor architecture to Tesla: Fermi has faster double-precision fl oating-point performance, faster atomic operations, and caches. It was also interesting that the gather-scatter support of vector architectures that predate the SIMD instructions by decades was so important to the eff ective usefulness of these SIMD extensions, which some had predicted before the comparison. Th e Intel researchers noted that 6 of the 14 kernels would exploit SIMD better with more effi cient gather-scatter support on the Core i7. Th is study certainly establishes the importance of cache blocking as well. Now that we seen a wide range of results of benchmarking diff erent multiprocessors, let’s return to our DGEMM example to see in detail how much we have to change the C code to exploit multiple processors. 6.12 Going Faster: Multiple Processors and Matrix Multiply Th is section is the fi nal and largest step in our incremental performance journey of adapting DGEMM to the underlying hardware of the Intel Core i7 (Sandy Bridge). Each Core i7 has 8 cores, and the computer we have been using has 2 Core i7s. Th us, we have 16 cores on which to run DGEMM. Figure 6.25 shows the OpenMP version of DGEMM that utilizes those cores. Note that line 30 is the single line added to Figure 5.48 to make this code run on multiple processors: an OpenMP pragma that tells the compiler to use multiple threads in the outermost for loop. It tells the computer to spread the work of the outermost loop across all the threads. Figure 6.26 plots a classic multiprocessor speedup graph, showing the performance improvement versus a single thread as the number of threads increase. Th is graph makes it easy to see the challenges of strong scaling versus weak scaling. When everything fi ts in the fi rst level data cache, as is the case for 32 × 32 matrices, adding threads actually hurts performance. Th e 16-threaded version of DGEMM is almost half as fast as the single-threaded version in this case. In contrast, the two largest matrices get a 14 × speedup from 16 threads, and hence the classic two “up and to the right” lines in Figure 6.26. Figure 6.27 shows the absolute performance increase as we increase the number of threads from 1 to 16. DGEMM operates now operates at 174 GLOPS for 960 × 960 matrices. As our unoptimized C version of DGEMM in Figure 3.21 ran this code at just 0.8 GFOPS, the optimizations in Chapters 3 to 6 that tailor the code to the underlying hardware result in a speedup of over 200 times! Next up is our warnings of the fallacies and pitfalls of multiprocessing. Th e computer architecture graveyard is fi lled with parallel processing projects that have ignored them. Elaboration: These results are with Turbo mode turned off. We are using a dual chip system in this system, so not surprisingly, we can get the full Turbo speedup (3.3/2.6 = 1.27) with either 1 thread (only 1 core on one of the chips) or 2 threads (1 core per chip). As we increase the number of threads and hence the number of active cores, the benefi t of Turbo mode decreases, as there is less of the power budget to spend on the active cores. For 4 threads the average Turbo speedup is 1.23, for 8 it is 1.13, and for 16 it is 1.11. 6.12 Going Faster: Multiple Processors and Matrix Multiply 555 556 Chapter 6 Parallel Processors from Client to Cloud #include
#define UNROLL (4)
#define BLOCKSIZE 32
void do_block (int n, int si, int sj, int sk,
double *A, double *B, double *C)
{
for ( int i = si; i < si+BLOCKSIZE; i+=UNROLL*4 ) for ( int j = sj; j < sj+BLOCKSIZE; j++ ) { __m256d c[4]; for ( int x = 0; x < UNROLL; x++ ) c[x] = _mm256_load_pd(C+i+x*4+j*n); /* c[x] = C[i][j] */ for( int k = sk; k < sk+BLOCKSIZE; k++ ) { __m256d b = _mm256_broadcast_sd(B+k+j*n); /* b = B[k][j] */ for (int x = 0; x < UNROLL; x++) c[x] = _mm256_add_pd(c[x], /* c[x]+=A[i][k]*b */ _mm256_mul_pd(_mm256_load_pd(A+n*k+x*4+i), b)); } for ( int x = 0; x < UNROLL; x++ ) _mm256_store_pd(C+i+x*4+j*n, c[x]); /* C[i][j] = c[x] */ } } void dgemm (int n, double* A, double* B, double* C) { #pragma omp parallel for for ( int sj = 0; sj < n; sj += BLOCKSIZE ) for ( int si = 0; si < n; si += BLOCKSIZE ) for ( int sk = 0; sk < n; sk += BLOCKSIZE ) do_block(n, si, sj, sk, A, B, C); } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 FIGURE 6.25 OpenMP version of DGEMM from Figure 5.48. Line 30 is the only OpenMP code, making the outermost for loop operate in parallel. Th is line is the only diff erence from Figure 5.48. Elaboration: Although the Sandy Bridge supports two hardware threads per core, we do not get more performance from 32 threads. The reason is that a single AVX hardware is shared between the two threads multiplexed onto one core, so assigning two threads per core actually hurts performance due to the multiplexing overhead. – 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 4 8 Threads 12 16 S pe ed up r el at iv e to 1 c or e 960 X 960 480 X 480 160 X 160 32 X 32 FIGURE 6.26 Performance improvements relative to a single thread as the number of threads increase. Th e most honest way to present such graphs is to make performance relative to the best version of a single processor program, which we did. Th is plot is relative to the performance of the code in Figure 5.48 without including OpenMP pragmas. 14 12 11 11 8 13 20 31 61 60 12 22 43 85 169 12 23 44 87 174 - 50 100 150 200 1 2 4 8 16 G FL O P S Threads 32x32 160x160 480x480 960x960 FIGURE 6.27 DGEMM performance versus the number of threads for four matrix sizes. Th e performance improvement compared unoptimized code in Figure 3.21 for the 960 × 960 matrix with 16 threads is an astounding 212 times faster! 6.12 Going Faster: Multiple Processors and Matrix Multiply 557 558 Chapter 6 Parallel Processors from Client to Cloud 6.13 Fallacies and Pitfalls Th e many assaults on parallel processing have uncovered numerous fallacies and pitfalls. We cover four here. Fallacy: Amdahl’s Law doesn’t apply to parallel computers. In 1987, the head of a research organization claimed that a multiprocessor machine had broken Amdahl’s Law. To try to understand the basis of the media reports, let s see the quote that gave us Amdahl s Law [1967, p. 483]: A fairly obvious conclusion which can be drawn at this point is that the eff ort expended on achieving high parallel processing rates is wasted unless it is accompanied by achievements in sequential processing rates of very nearly the same magnitude. Th is statement must still be true; the neglected portion of the program must limit performance. One interpretation of the law leads to the following lemma: portions of every program must be sequential, so there must be an economic upper bound to the number of processors say, 100. By showing linear speed-up with 1000 processors, this lemma is disproved; hence the claim that Amdahl s Law was broken. Th e approach of the researchers was just to use weak scaling: rather than going 1000 times faster on the same data set, they computed 1000 times more work in comparable time. For their algorithm, the sequential portion of the program was constant, independent of the size of the input, and the rest was fully parallel hence, linear speed-up with 1000 processors. Amdahl s Law obviously applies to parallel processors. What this research does point out is that one of the main uses of faster computers is to run larger problems. Just be sure that users really care about those problems versus being a justifi cation to buying an expensive computer by fi nding a problem that just keeps lots of processors busy. Fallacy: Peak performance tracks observed performance. Th e supercomputer industry once used this metric in marketing, and the fallacy is exacerbated with parallel machines. Not only are marketers using the nearly unattainable peak performance of a uniprocessor node, but also they are then multiplying it by the total number of processors, assuming perfect speed-up! Amdahl s Law suggests how diffi cult it is to reach either peak; multiplying the two together multiplies the sins. Th e roofl ine model helps put peak performance in perspective. Pitfall: Not developing the soft ware to take advantage of, or optimize for, a multiprocessor architecture. Th ere is a long history of parallel soft ware lagging behind on parallel hardware, possibly because the soft ware problems are much harder. We give one example to show the subtlety of the issues, but there are many examples we could choose! For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly signifi cant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution. …Demonstration is made of the continued validity of the single processor approach … Gene Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” Spring Joint Computer Conference, 1967 One frequently encountered problem occurs when soft ware designed for a uniprocessor is adapted to a multiprocessor environment. For example, the Silicon Graphics operating system originally protected the page table with a single lock, assuming that page allocation is infrequent. In a uniprocessor, this does not represent a performance problem. In a multiprocessor, it can become a major performance bottleneck for some programs. Consider a program that uses a large number of pages that are initialized at start-up, which UNIX does for statically allocated pages. Suppose the program is parallelized so that multiple processes allocate the pages. Because page allocation requires the use of the page table, which is locked whenever it is in use, even an OS kernel that allows multiple threads in the OS will be serialized if the processes all try to allocate their pages at once (which is exactly what we might expect at initialization time!). Th is page table serialization eliminates parallelism in initialization and has signifi cant impact on overall parallel performance. Th is performance bottleneck persists even for task-level parallelism. For example, suppose we split the parallel processing program apart into separate jobs and run them, one job per processor, so that there is no sharing between the jobs. (Th is is exactly what one user did, since he reasonably believed that the performance problem was due to unintended sharing or interference in his application.) Unfortunately, the lock still serializes all the jobs so even the independent job performance is poor. Th is pitfall indicates the kind of subtle but signifi cant performance bugs that can arise when soft ware runs on multiprocessors. Like many other key soft ware components, the OS algorithms and data structures must be rethought in a multiprocessor context. Placing locks on smaller portions of the page table eff ectively eliminated the problem. Fallacy: You can get good vector performance without providing memory bandwidth. As we saw with the Roofl ine model, memory bandwidth is quite important to all architectures. DAXPY requires 1.5 memory references per fl oating-point operation, and this ratio is typical of many scientifi c codes. Even if the fl oating-point operations took no time, a Cray-1 could not increase the DAXPY performance of the vector sequence used, since it was memory limited. Th e Cray-1 performance on Linpack jumped when the compiler used blocking to change the computation so that values could be kept in the vector registers. Th is approach lowered the number of memory references per FLOP and improved the performance by nearly a factor of two! Th us, the memory bandwidth on the Cray-1 became suffi cient for a loop that formerly required more bandwidth, which is just what the Roofl ine model would predict. 6.13 Fallacies and Pitfalls 559 560 Chapter 6 Parallel Processors from Client to Cloud 6.14 Concluding Remarks Th e dream of building computers by simply aggregating processors has been around since the earliest days of computing. Progress in building and using eff ective and effi cient parallel processors, however, has been slow. Th is rate of progress has been limited by diffi cult soft ware problems as well as by a long process of evolving the architecture of multiprocessors to enhance usability and improve effi ciency. We have discussed many of the soft ware challenges in this chapter, including the diffi culty of writing programs that obtain good speed-up due to Amdahl s Law. Th e wide variety of diff erent architectural approaches and the limited success and short life of many of the parallel architectures of the past have compounded the soft ware diffi culties. We discuss the history of the development of these multiprocessors in online Section 6.15. To go into even greater depth on topics in this chapter, see Chapter 4 of Computer Architecture: A Quantitative Approach, Fift h Edition for more on GPUs and comparisons between GPUs and CPUs and Chapter 6 for more on WSCs. As we said in Chapter 1, despite this long and checkered past, the information technology industry has now tied its future to parallel computing. Although it is easy to make the case that this eff ort will fail like many in the past, there are reasons to be hopeful: ■ Clearly, soft ware as a service (SaaS) is growing in importance, and clusters have proven to be a very successful way to deliver such services. By providing redundancy at a higher-level, including geographically distributed datacenters, such services have delivered 24 × 7 × 365 availability for customers around the world. ■ We believe that Warehouse-Scale Computers are changing the goals and principles of server design, just as the needs of mobile clients are changing the goals and principles of microprocessor design. Both are revolutionizing the soft ware industry as well. Performance per dollar and performance per joule drive both mobile client hardware and the WSC hardware, and parallelism is the key to delivering on those sets of goals. ■ SIMD and vector operations are a good match to multimedia applications, which are playing a larger role in the PostPC Era. Th ey share the advantage of being easier for the programmer than classic parallel MIMD programming and being more energy effi cient than MIMD. To put into perspective the importance of SIMD versus MIMD, Figure 6.28 plots the number of cores for MIMD versus the number of 32-bit and 64-bit operations per clock cycle in SIMD mode for x86 computers over time. For x86 computers, we expect to see two additional cores per chip about every two years and the SIMD width to double about every four years. Given these assumptions, over the next decade the potential speed-up from SIMD parallelism is twice that of We are dedicating all of our future product development to multicore designs. We believe this is a key infl ection point for the industry. … Th is is not a race. Th is is a sea change in computing…” Paul Otellini, Intel President, Intel Developers Forum, 2004 MIMD parallelism. Given the eff ectiveness of SIMD for multimedia and its increasing importance in the PostPC Era, that emphasis may be appropriate. Hence, it’s as least as important to understand SIMD parallelism as MIMD parallelism, even though the latter has received much more attention. ■ Th e use of parallel processing in domains such as scientifi c and engineering computation is popular. Th is application domain has an almost limitless thirst for more computation. It also has many applications that have lots of natural concurrency. Once again, clusters dominate this application area. For example, using the 2012 Top 500 report, clusters are responsible for more than 80% of the 500 fastest Linpack results. ■ All desktop and server microprocessor manufacturers are building multiprocessors to achieve higher performance, so, unlike in the past, there is no easy path to higher performance for sequential applications. As we said earlier, sequential programs are now slow programs. Hence, programmers who need higher performance must parallelize their codes or write new parallel processing programs. 2003 1 10 100 P o te n tia l p a ra lle l s p e e d u p 1000 2007 2011 2015 2019 2023 MIMD*SIMD (32 b) SIMD (32 b) MIMD*SIMD (64 b) MIMD SIMD (64 b) FIGURE 6.28 Potential speed-up via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for x86 computers. Th is fi gure assumes that two cores per chip for MIMD will be added every two years and the number of operations for SIMD will double every four years. 6.14 Concluding Remarks 561 562 Chapter 6 Parallel Processors from Client to Cloud ■ In the past, microprocessors and multiprocessors were subject to diff erent defi nitions of success. When scaling uniprocessor performance, microprocessor architects were happy if single thread performance went up by the square root of the increased silicon area. Th us, they were happy with sublinear performance in terms of resources. Multiprocessor success used to be defi ned as linear speed-up as a function of the number of processors, assuming that the cost of purchase or cost of administration of n processors was n times as much as one processor. Now that parallelism is happening on- chip via multicore, we can use the traditional microprocessor metric of being successful with sublinear performance improvement. ■ Th e success of just-in-time runtime compilation and autotuning makes it feasible to think of soft ware adapting itself to take advantage of the increasing number of cores per chip, which provides fl exibility that is not available when limited to static compilers. ■ Unlike in the past, the open source movement has become a critical portion of the soft ware industry. Th is movement is a meritocracy, where better engineering solutions can win the mind share of the developers over legacy concerns. It also embraces innovation, inviting change to old soft ware and welcoming new languages and soft ware products. Such an open culture could be extremely helpful in this time of rapid change. To motivate readers to embrace this revolution, we demonstrated the potential of parallelism concretely for matrix multiply on the Intel Core i7 (Sandy Bridge) in the Going Faster sections of Chapters 3 to 6: ■ Data-level parallelism in Chapter 3 improved performance by a factor of 3.85 by executing four 64-bit fl oating-point operations in parallel using the 256- bit operands of the AVX instructions, demonstrating the value of SIMD. ■ Instruction-level parallelism in Chapter 4 pushed performance up by another factor of 2.3 by unrolling loops 4 times to give the out-of-order execution hardware more instructions to schedule. ■ Cache optimizations in Chapter 5 improved performance of matrices that didn’t fi t into the L1 data cache by another factor of 2.0 to 2.5 by using cache blocking to reduce cache misses. ■ Th read-level parallelism in this chapter improved performance of matrices that don’t fi t into a single L1 data cache by another factor of 4 to 14 by utilizing all 16 cores of our multicore chips, demonstrating the value of MIMD. We did this by adding a single line using an OpenMP pragma. Using the ideas in this book and tailoring the soft ware to this computer added 24 lines of code to DGEMM. For the matrix sizes of 32x32, 160x160, 480x480, and 960x960, the overall performance speedup from these ideas realized in those two- dozen lines of code is factors of 8, 39, 129, and 212! Th is parallel revolution in the hardware/soft ware interface is perhaps the greatest challenge facing the fi eld in the last 60 years. You can also think of it as the greatest opportunity, as our Going Faster sections demonstrate. Th is revolution will provide many new research and business prospects inside and outside the IT fi eld, and the companies that dominate the multicore era may not be the same ones that dominated the uniprocessor era. Aft er understanding the underlying hardware trends and learning to adapt soft ware to them, perhaps you will be one of the innovators who will seize the opportunities that are certain to appear in the uncertain times ahead. We look forward to benefi ting from your inventions! 5.96.15 Historical Perspective and Further Reading Th is section online gives the rich and oft en disastrous history of multiprocessors over the last 50 years. References G. Regnier, S. Makineni, R. Illikkal, R. Iyer, D. Minturn, R. Huggahalli, D. Newell, L. Cline, and A. Foong. TCP onloading for data center servers. IEEE Computer, 37(11):48–58, 2004. B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, R. Sears. Benchmarking cloud serving systems with YCSB, In: Proceedings of the 1st ACM Symposium on Cloud computing, June 10–11, 2010, Indianapolis, Indiana, USA, doi:10.1145/1807128.1807152. 6.16 Exercises 6.1 First, write down a list of your daily activities that you typically do on a weekday. For instance, you might get out of bed, take a shower, get dressed, eat breakfast, dry your hair, brush your teeth. Make sure to break down your list so you have a minimum of 10 activities. 6.1.1 [5] <§6.2> Now consider which of these activities is already exploiting some
form of parallelism (e.g., brushing multiple teeth at the same time, versus one at
a time, carrying one book at a time to school, versus loading them all into your

6.16 Exercises 563

http://refhub.elsevier.com/B978-0-12-407726-3.00006-0/sbref1
http://refhub.elsevier.com/B978-0-12-407726-3.00006-0/sbref1
http://refhub.elsevier.com/B978-0-12-407726-3.00006-0/sbref1

564 Chapter 6 Parallel Processors from Client to Cloud

backpack and then carry them “in parallel”). For each of your activities, discuss if
they are already working in parallel, but if not, why they are not.

6.1.2 [5] <§6.2> Next, consider which of the activities could be carried out
concurrently (e.g., eating breakfast and listening to the news). For each of your
activities, describe which other activity could be paired with this activity.

6.1.3 [5] <§6.2> For 6.1.2, what could we change about current systems (e.g.,
showers, clothes, TVs, cars) so that we could perform more tasks in parallel?

6.1.4 [5] <§6.2> Estimate how much shorter time it would take to carry out these
activities if you tried to carry out as many tasks in parallel as possible.

6.2 You are trying to bake 3 blueberry pound cakes. Cake ingredients are as
follows:

1 cup butter, soft ened
1 cup sugar
4 large eggs
1 teaspoon vanilla extract
1/2 teaspoon salt
1/4 teaspoon nutmeg
1 1/2 cups fl our
1 cup blueberries

Th e recipe for a single cake is as follows:

Step 1: Preheat oven to 325°F (160°C). Grease and fl our your cake pan.

Step 2: In large bowl, beat together with a mixer butter and sugar at medium
speed until light and fl uff y. Add eggs, vanilla, salt and nutmeg. Beat until
thoroughly blended. Reduce mixer speed to low and add fl our, 1/2 cup at a time,
beating just until blended.

Step 3: Gently fold in blueberries. Spread evenly in prepared baking pan. Bake
for 60 minutes.

6.2.1 [5] <§6.2> Your job is to cook 3 cakes as effi ciently as possible. Assuming
that you only have one oven large enough to hold one cake, one large bowl, one
cake pan, and one mixer, come up with a schedule to make three cakes as quickly
as possible. Identify the bottlenecks in completing this task.

6.2.2 [5] <§6.2> Assume now that you have three bowls, 3 cake pans and 3 mixers.
How much faster is the process now that you have additional resources?

6.2.3 [5] <§6.2> Assume now that you have two friends that will help you cook,
and that you have a large oven that can accommodate all three cakes. How will this
change the schedule you arrived at in Exercise 6.2.1 above?

6.2.4 [5] <§6.2> Compare the cake-making task to computing 3 iterations
of a loop on a parallel computer. Identify data-level parallelism and task-level
parallelism in the cake-making loop.

6.3 Many computer applications involve searching through a set of data and
sorting the data. A number of effi cient searching and sorting algorithms have been
devised in order to reduce the runtime of these tedious tasks. In this problem we
will consider how best to parallelize these tasks.

6.3.1 [10] <§6.2> Consider the following binary search algorithm (a classic divide
and conquer algorithm) that searches for a value X in a sorted N-element array A
and returns the index of matched entry:

BinarySearch(A[0..N−1], X) {
low = 0
high = N −1
while (low <= high) { mid = (low + high) / 2 if (A[mid] >X)

high = mid −1
else if (A[mid] Next, assume that Y is equal to N. How would this aff ect your
conclusions in your previous answer? If you were tasked with obtaining the best
speedup factor possible (i.e., strong scaling), explain how you might change this
code to obtain it.

6.4 Consider the following piece of C code:

for (j=2;j<1000;j++) D[j] = D[j−1]+D[j−2]; 6.16 Exercises 565 566 Chapter 6 Parallel Processors from Client to Cloud Th e MIPS code corresponding to the above fragment is: addiu $s2,$zero,7992 addiu $s1,$zero,16 loop: l.d $f0, �16($s1) l.d $f2, �8($s1) add.d $f4, $f0, $f2 s.d $f4, 0($s1) addiu $s1, $s1, 8 bne $s1, $s2, loop Instructions have the following associated latencies (in cycles): add.d l.d s.d addiu 4 6 1 2 6.4.1 [10] <§6.2> How many cycles does it take for all instructions in a single
iteration of the above loop to execute?

6.4.2 [10] <§6.2> When an instruction in a later iteration of a loop depends upon
a data value produced in an earlier iteration of the same loop, we say that there is
a loop carried dependence between iterations of the loop. Identify the loop-carried
dependences in the above code. Identify the dependent program variable and
assembly-level registers. You can ignore the loop induction variable j.

6.4.3 [10] <§6.2> Loop unrolling was described in Chapter 4. Apply loop
unrolling to this loop and then consider running this code on a 2-node distributed
memory message passing system. Assume that we are going to use message passing
as described in Section 6.7, where we introduce a new operation send (x, y) that
sends to node x the value y, and an operation receive( ) that waits for the value being
sent to it. Assume that send operations take a cycle to issue (i.e., later instructions
on the same node can proceed on the next cycle), but take 10 cycles be received
on the receiving node. Receive instructions stall execution on the node where they
are executed until they receive a message. Produce a schedule for the two nodes
assuming an unroll factor of 4 for the loop body (i.e., the loop body will appear
4 times). Compute the number of cycles it will take for the loop to run on the
message passing system.

6.4.4 [10] <§6.2> Th e latency of the interconnect network plays a large role in
the effi ciency of message passing systems. How fast does the interconnect need to
be in order to obtain any speedup from using the distributed system described in
Exercise 6.4.3?

6.5 Consider the following recursive mergesort algorithm (another classic divide
and conquer algorithm). Mergesort was fi rst described by John Von Neumann in
1945. Th e basic idea is to divide an unsorted list x of m elements into two sublists
of about half the size of the original list. Repeat this operation on each sublist, and

continue until we have lists of size 1 in length. Th en starting with sublists of length
1, “merge” the two sublists into a single sorted list.

Mergesort(m)
var list left, right, result
if length(m) ≤ 1

return m
else

var middle = length(m) / 2
for each x in m up to middle

add x to left
for each x in m after middle

add x to right
left = Mergesort(left)
right = Mergesort(right)
result = Merge(left, right)
return result

Th e merge step is carried out by the following code:

Merge(left,right)
var list result
while length(left) >0 and length(right) > 0

if first(left) ≤ first(right)
append first(left) to result
left = rest(left)

else
append first(right) to result
right = rest(right)

if length(left) >0
append rest(left) to result

if length(right) >0
append rest(right) to result

return result

6.5.1 [10] <§6.2> Assume that you have Y cores on a multicore processor to run
MergeSort. Assuming that Y is much smaller than length(m), express the speedup
factor you might expect to obtain for values of Y and length(m). Plot these on a
graph.

6.5.2 [10] <§6.2> Next, assume that Y is equal to length (m). How would this
aff ect your conclusions your previous answer? If you were tasked with obtaining
the best speedup factor possible (i.e., strong scaling), explain how you might
change this code to obtain it.

6.16 Exercises 567

568 Chapter 6 Parallel Processors from Client to Cloud

6.6 Matrix multiplication plays an important role in a number of applications.
Two matrices can only be multiplied if the number of columns of the fi rst matrix is
equal to the number of rows in the second.

Let’s assume we have an m × n matrix A and we want to multiply it by an n × p
matrix B. We can express their product as an m × p matrix denoted by AB (or A ⋅ B).
If we assign C = AB, and ci,j denotes the entry in C at position (i, j), then for each
element i and j with 1 ≤ i ≤ m and 1 ≤ j ≤ p. Now we want to see if we can parallelize
the computation of C. Assume that matrices are laid out in memory sequentially as
follows: a1,1, a2,1, a3,1, a4,1, …, etc.

6.6.1 [10] <§6.5> Assume that we are going to compute C on both a single core
shared memory machine and a 4-core shared-memory machine. Compute the
speedup we would expect to obtain on the 4-core machine, ignoring any memory
issues.

6.6.2 [10] <§6.5> Repeat Exercise 6.6.1, assuming that updates to C incur a cache
miss due to false sharing when consecutive elements are in a row (i.e., index i) are
updated.

6.6.3 [10] <§6.5> How would you fi x the false sharing issue that can occur?

6.7 Consider the following portions of two diff erent programs running at the
same time on four processors in a symmetric multicore processor (SMP). Assume
that before this code is run, both x and y are 0.

Core 1: x = 2;

Core 2: y = 2;

Core 3: w = x + y + 1;

Core 4: z = x + y;

6.7.1 [10] <§6.5> What are all the possible resulting values of w, x, y, and z? For
each possible outcome, explain how we might arrive at those values. You will need
to examine all possible interleavings of instructions.

6.7.2 [5] <§6.5> How could you make the execution more deterministic so that
only one set of values is possible?

6.8 Th e dining philosopher’s problem is a classic problem of synchronization and
concurrency. Th e general problem is stated as philosophers sitting at a round table
doing one of two things: eating or thinking. When they are eating, they are not
thinking, and when they are thinking, they are not eating. Th ere is a bowl of pasta
in the center. A fork is placed in between each philosopher. Th e result is that each
philosopher has one fork to her left and one fork to her right. Given the nature of
eating pasta, the philosopher needs two forks to eat, and can only use the forks on
her immediate left and right. Th e philosophers do not speak to one another.

6.8.1 [10] <§6.7> Describe the scenario where none of philosophers ever eats (i.e.,
starvation). What is the sequence of events that happen that lead up to this problem?

6.8.2 [10] <§6.7> Describe how we can solve this problem by introducing the
concept of a priority? But can we guarantee that we will treat all the philosophers
fairly? Explain.

Now assume we hire a waiter who is in charge of assigning forks to philosophers.
Nobody can pick up a fork until the waiter says they can. Th e waiter has global
knowledge of all forks. Further, if we impose the policy that philosophers will
always request to pick up their left fork before requesting to pick up their right
fork, then we can guarantee to avoid deadlock.

6.8.3 [10] <§6.7> We can implement requests to the waiter as either a queue of
requests or as a periodic retry of a request. With a queue, requests are handled in
the order they are received. Th e problem with using the queue is that we may not
always be able to service the philosopher whose request is at the head of the queue
(due to the unavailability of resources). Describe a scenario with 5 philosophers
where a queue is provided, but service is not granted even though there are forks
available for another philosopher (whose request is deeper in the queue) to eat.

6.8.4 [10] <§6.7> If we implement requests to the waiter by periodically repeating
our request until the resources become available, will this solve the problem
described in Exercise 6.8.3? Explain.

6.9 Consider the following three CPU organizations:

CPU SS: A 2-core superscalar microprocessor that provides out-of-order issue
capabilities on 2 function units (FUs). Only a single thread can run on each core
at a time.

CPU MT: A fi ne-grained multithreaded processor that allows instructions from 2
threads to be run concurrently (i.e., there are two functional units), though only
instructions from a single thread can be issued on any cycle.

CPU SMT: An SMT processor that allows instructions from 2 threads to be run
concurrently (i.e., there are two functional units), and instructions from either or
both threads can be issued to run on any cycle.

Assume we have two threads X and Y to run on these CPUs that include the
following operations:

Thread X Thread Y

A1 – takes 3 cycles to execute B1 – take 2 cycles to execute

A2 – no dependences B2 – confl icts for a functional unit with B1

A3 – confl icts for a functional unit with A1 B3 – depends on the result of B2

A4 – depends on the result of A3 B4 – no dependences and takes 2 cycles to execute

6.16 Exercises 569

570 Chapter 6 Parallel Processors from Client to Cloud

Assume all instructions take a single cycle to execute unless noted otherwise or
they encounter a hazard.

6.9.1 [10] <§6.4> Assume that you have 1 SS CPU. How many cycles will it take to
execute these two threads? How many issue slots are wasted due to hazards?

6.9.2 [10] <§6.4> Now assume you have 2 SS CPUs. How many cycles will it take
to execute these two threads? How many issue slots are wasted due to hazards?

6.9.3 [10] <§6.4> Assume that you have 1 MT CPU. How many cycles will it take
to execute these two threads? How many issue slots are wasted due to hazards?

6.10 Virtualization soft ware is being aggressively deployed to reduce the costs of
managing today’s high performance servers. Companies like VMWare, Microsoft
and IBM have all developed a range of virtualization products. Th e general concept,
described in Chapter 5, is that a hypervisor layer can be introduced between the
hardware and the operating system to allow multiple operating systems to share
the same physical hardware. Th e hypervisor layer is then responsible for allocating
CPU and memory resources, as well as handling services typically handled by the
operating system (e.g., I/O).

Virtualization provides an abstract view of the underlying hardware to the hosted
operating system and application soft ware. Th is will require us to rethink how
multi-core and multiprocessor systems will be designed in the future to support
the sharing of CPUs and memories by a number of operating systems concurrently.

6.10.1 [30] <§6.4> Select two hypervisors on the market today, and compare
and contrast how they virtualize and manage the underlying hardware (CPUs and
memory).

6.10.2 [15] <§6.4> Discuss what changes may be necessary in future multi-core
CPU platforms in order to better match the resource demands placed on these
systems. For instance, can multithreading play an eff ective role in alleviating the
competition for computing resources?

6.11 We would like to execute the loop below as effi ciently as possible. We have
two diff erent machines, a MIMD machine and a SIMD machine.

for (i=0; i < 2000; i++) for (j=0; j<3000; j++) X_array[i][j] = Y_array[j][i] + 200; 6.11.1 [10] <§6.3> For a 4 CPU MIMD machine, show the sequence of MIPS
instructions that you would execute on each CPU. What is the speedup for this
MIMD machine?

6.11.2 [20] <§6.3> For an 8-wide SIMD machine (i.e., 8 parallel SIMD functional
units), write an assembly program in using your own SIMD extensions to MIPS
to execute the loop. Compare the number of instructions executed on the SIMD
machine to the MIMD machine.

6.12 A systolic array is an example of an MISD machine. A systolic array is a
pipeline network or “wavefront” of data processing elements. Each of these elements
does not need a program counter since execution is triggered by the arrival of data.
Clocked systolic arrays compute in “lock-step” with each processor undertaking
alternate compute and communication phases.

6.12.1 [10] <§6.3> Consider proposed implementations of a systolic array (you
can fi nd these in on the Internet or in technical publications). Th en attempt to
program the loop provided in Exercise 6.11 using this MISD model. Discuss any
diffi culties you encounter.

6.12.2 [10] <§6.3> Discuss the similarities and diff erences between an MISD and
SIMD machine. Answer this question in terms of data-level parallelism.

6.13 Assume we want to execute the DAXPY loop show on page 511 in MIPS
assembly on the NVIDIA 8800 GTX GPU described in this chapter. In this problem,
we will assume that all math operations are performed on single-precision fl oating-
point numbers (we will rename the loop SAXPY). Assume that instructions take
the following number of cycles to execute.

Loads Stores Add.S Mult.S

5 2 3 4

6.13.1 [20] <§6.6> Describe how you will constructs warps for the SAXPY loop
to exploit the 8 cores provided in a single multiprocessor.

6.14 Download the CUDA Toolkit and SDK from http://www.nvidia.com/object/
cuda_get.html. Make sure to use the “emurelease” (Emulation Mode) version of the
code (you will not need actual NVIDIA hardware for this assignment). Build the
example programs provided in the SDK, and confi rm that they run on the emulator.

6.14.1 [90] <§6.6> Using the “template” SDK sample as a starting point, write a
CUDA program to perform the following vector operations:

1) a − b (vector-vector subtraction)

2) a ⋅ b (vector dot product)
Th e dot product of two vectors a = [a1, a2, … , an] and b = [b1, b2, … , bn] is defi ned as:

a b 1 1 2 2⋅ ∑a b a b a b a b
i

i n ni
1

…

Submit code for each program that demonstrates each operation and verifi es the
correctness of the results.

6.14.2 [90] <§6.6> If you have GPU hardware available, complete a performance
analysis your program, examining the computation time for the GPU and a CPU
version of your program for a range of vector sizes. Explain any results you see.

6.16 Exercises 571

http://www.nvidia.com/object/cuda_get.html
http://www.nvidia.com/object/cuda_get.html

572 Chapter 6 Parallel Processors from Client to Cloud

6.15 AMD has recently announced that they will be integrating a graphics
processing unit with their x86 cores in a single package, though with diff erent
clocks for each of the cores. Th is is an example of a heterogeneous multiprocessor
system which we expect to see produced commericially in the near future. One
of the key design points will be to allow for fast data communication between
the CPU and the GPU. Presently communications must be performed between
discrete CPU and GPU chips. But this is changing in AMDs Fusion architecture.
Presently the plan is to use multiple (at least 16) PCI express channels for facilitate
intercommunication. Intel is also jumping into this arena with their Larrabee chip.
Intel is considering to use their QuickPath interconnect technology.

6.15.1 [25] <§6.6> Compare the bandwidth and latency associated with these
two interconnect technologies.

6.16 Refer to Figure 6.14b, which shows an n-cube interconnect topology of order
3 that interconnects 8 nodes. One attractive feature of an n-cube interconnection
network topology is its ability to sustain broken links and still provide connectivity.

6.16.1 [10] <§6.8> Develop an equation that computes how many links in the
n-cube (where n is the order of the cube) can fail and we can still guarantee an
unbroken link will exist to connect any node in the n-cube.

6.16.2 [10] <§6.8> Compare the resiliency to failure of n-cube to a fully-
connected interconnection network. Plot a comparison of reliability as a function
of the added number of links for the two topologies.

6.17 Benchmarking is fi eld of study that involves identifying representative
workloads to run on specifi c computing platforms in order to be able to objectively
compare performance of one system to another. In this exercise we will compare
two classes of benchmarks: the Whetstone CPU benchmark and the PARSEC
Benchmark suite. Select one program from PARSEC. All programs should be freely
available on the Internet. Consider running multiple copies of Whetstone versus
running the PARSEC Benchmark on any of systems described in Section 6.11.

6.17.1 [60] <§6.10> What is inherently diff erent between these two classes of
workload when run on these multi-core systems?

6.17.2 [60] <§6.10> In terms of the Roofl ine Model, how dependent will the
results you obtain when running these benchmarks be on the amount of sharing
and synchronization present in the workload used?

6.18 When performing computations on sparse matrices, latency in the memory
hierarchy becomes much more of a factor. Sparse matrices lack the spatial locality
in the data stream typically found in matrix operations. As a result, new matrix
representations have been proposed.

One the earliest sparse matrix representations is the Yale Sparse Matrix Format. It
stores an initial sparse m × n matrix, M in row form using three one-dimensional

arrays. Let R be the number of nonzero entries in M. We construct an array A
of length R that contains all nonzero entries of M (in left -to-right top-to-bottom
order). We also construct a second array IA of length m + 1 (i.e., one entry per row,
plus one). IA(i) contains the index in A of the fi rst nonzero element of row i. Row
i of the original matrix extends from A(IA(i)) to A(IA(i+1)−1). Th e third array, JA,
contains the column index of each element of A, so it also is of length R.

6.18.1 [15] <§6.10> Consider the sparse matrix X below and write C code that
would store this code in Yale Sparse Matrix Format.

Row 1 [1, 2, 0, 0, 0, 0]
Row 2 [0, 0, 1, 1, 0, 0]
Row 3 [0, 0, 0, 0, 9, 0]
Row 4 [2, 0, 0, 0, 0, 2]
Row 5 [0, 0, 3, 3, 0, 7]
Row 6 [1, 3, 0, 0, 0, 1]

6.18.2 [10] <§6.10> In terms of storage space, assuming that each element in
matrix X is single precision fl oating point, compute the amount of storage used to
store the Matrix above in Yale Sparse Matrix Format.

6.18.3 [15] <§6.10> Perform matrix multiplication of Matrix X by Matrix Y
shown below.

[2, 4, 1, 99, 7, 2]

Put this computation in a loop, and time its execution. Make sure to increase
the number of times this loop is executed to get good resolution in your timing
measurement. Compare the runtime of using a naïve representation of the matrix,
and the Yale Sparse Matrix Format.

6.18.4 [15] <§6.10> Can you fi nd a more effi cient sparse matrix representation
(in terms of space and computational overhead)?

6.19 In future systems, we expect to see heterogeneous computing platforms
constructed out of heterogeneous CPUs. We have begun to see some appear in the
embedded processing market in systems that contain both fl oating point DSPs and
a microcontroller CPUs in a multichip module package.

Assume that you have three classes of CPU:

CPU A—A moderate speed multi-core CPU (with a fl oating point unit) that can
execute multiple instructions per cycle.

CPU B—A fast single-core integer CPU (i.e., no fl oating point unit) that can
execute a single instruction per cycle.

CPU C—A slow vector CPU (with fl oating point capability) that can execute
multiple copies of the same instruction per cycle.

6.16 Exercises 573

574 Chapter 6 Parallel Processors from Client to Cloud

Assume that our processors run at the following frequencies:

CPU A CPU B CPU C

1 GHz 3 GHz 250 MHz

CPU A can execute 2 instructions per cycle, CPU B can execute 1 instruction per
cycle, and CPU C can execute 8 instructions (though the same instruction) per
cycle. Assume all operations can complete execution in a single cycle of latency
without any hazards.

All three CPUs have the ability to perform integer arithmetic, though CPU B cannot
perform fl oating point arithmetic. CPU A and B have an instruction set similar
to a MIPS processor. CPU C can only perform fl oating point add and subtract
operations, as well as memory loads and stores. Assume all CPUs have access to
shared memory and that synchronization has zero cost.

Th e task at hand is to compare two matrices X and Y that each contain 1024 × 1024
fl oating point elements. Th e output should be a count of the number indices where
the value in X was larger or equal to the value in Y.

6.19.1 [10] <§6.11> Describe how you would partition the problem on the 3
diff erent CPUs to obtain the best performance.

6.19.2 [10] <§6.11> What kind of instruction would you add to the vector CPU
C to obtain better performance?

6.20 Assume a quad-core computer system can process database queries at a
steady state rate of requests per second. Also assume that each transaction takes,
on average, a fi xed amount of time to process. Th e following table shows pairs of
transaction latency and processing rate.

Average Transaction Latency Maximum transaction processing rate

1 ms 5000/sec

2 ms 5000/sec

1 ms 10,000/sec

2 ms 10,000/sec

For each of the pairs in the table, answer the following questions:

6.20.1 [10] <§6.11> On average, how many requests are being processed at any
given instant?

6.20.2 [10] <§6.11> If move to an 8-core system, ideally, what will happen to the
system throughput (i.e., how many queries/second will the computer process)?

6.20.3 [10] <§6.11> Discuss why we rarely obtain this kind of speedup by simply
increasing the number of cores.

§6.1, page 504: False. Task-level parallelism can help sequential applications and
sequential applications can be made to run on parallel hardware, although it is
more challenging.
§6.2, page 509: False. Weak scaling can compensate for a serial portion of the
program that would otherwise limit scalability, but not so for strong scaling.
§6.3, page 514: True, but they are missing useful vector features like gather-scatter
and vector length registers that improve the effi ciency of vector architectures.
(As an elaboration in this section mentions, the AVX2 SIMD extensions off ers
indexed loads via a gather operation but not scatter for indexed stores. Th e Haswell
generation x86 microprocessor is the fi rst to support AVX2.)
§6.4, page 519: 1. True. 2. True.
§6.5, page 523: False. Since the shared address is a physical address, multiple
tasks each in their own virtual address spaces can run well on a shared memory
multiprocessor.
§6.6, page 531: False. Graphics DRAM chips are prized for their higher bandwidth.
§6.7, page 536: 1. False. Sending and receiving a message is an implicit
synchronization, as well as a way to share data. 2. True.
§6.8, page 538: True.
§6.10, page 550: True. We likely need innovation at all levels of the hardware and
soft ware stack for parallel computing to succeed.

Answers to
Check Yourself

6.16 Exercises 575

A
Fear of serious injury
cannot alone justify
suppression of free
speech and assembly.
Louis Brandeis
Whitney v. California, 1927

Assemblers, Linkers,
and the SPIM
Simulator
James R. Larus
Microsoft Research
Microsoft

A P P E N D I X

A.1 Introduction

Encoding instructions as binary numbers is natural and effi cient for computers.
Humans, however, have a great deal of diffi culty understanding and manipulating
these numbers. People read and write symbols (words) much better than long
sequences of digits. Chapter 2 showed that we need not choose between numbers
and words, because computer instructions can be represented in many ways.
Humans can write and read symbols, and computers can execute the equivalent
binary numbers. Th is appendix describes the process by which a human-readable
program is translated into a form that a computer can execute, provides a few hints
about writing assembly programs, and explains how to run these programs on
SPIM, a simulator that executes MIPS programs. UNIX, Windows, and Mac OS X
versions of the SPIM simulator are available on the CD.

Assembly language is the symbolic representation of a computer’s binary
encoding—the machine language. Assembly language is more readable than
machine language, because it uses symbols instead of bits. Th e symbols in assembly
language name commonly occurr in bit patterns, such as opcodes and register
specifi ers, so people can read and remember them. In addition, assembly language

machine language
Binary representation
used for communication
within a computer
system.

A-4 Appendix A Assemblers, Linkers, and the SPIM Simulator

FIGURE A.1.1 The process that produces an executable fi le. An assembler translates a fi le of
assembly language into an object fi le, which is linked with other fi les and libraries into an executable fi le.

Object
file

Source
file

Assembler

LinkerAssembler

Assembler
Program
library

Object
file

Source
file

Executable
file

permits programmers to use labels to identify and name particular memory words
that hold instructions or data.

A tool called an assembler translates assembly language into binary instructions.
Assemblers provide a friendlier representation than a computer’s 0s and 1s, which
sim plifi es writing and reading programs. Symbolic names for operations and loca-
tions are one facet of this representation. Another facet is programming facilities
that increase a program’s clarity. For example, macros, discussed in Section A.2,
enable a programmer to extend the assembly language by defi ning new operations.

An assembler reads a single assembly language source fi le and produces an
object fi le containing machine instructions and bookkeeping information that
helps combine several object fi les into a program. Figure A.1.1 illustrates how a
program is built. Most programs consist of several fi les—also called modules—
that are written, compiled, and assembled independently. A program may also use
prewritten routines supplied in a program library. A module typically contains ref-
erences to subroutines and data defi ned in other modules and in libraries. Th e code
in a module cannot be executed when it contains unresolved references to labels
in other object fi les or libraries. Another tool, called a linker, combines a collection
of object and library fi les into an executable fi le, which a computer can run.

To see the advantage of assembly language, consider the following sequence of
fi gures, all of which contain a short subroutine that computes and prints the sum of
the squares of integers from 0 to 100. Figure A.1.2 shows the machine language that
a MIPS computer executes. With considerable eff ort, you could use the opcode and
instruction format tables in Chapter 2 to translate the instructions into a symbolic
program similar to that shown in Figure A.1.3. Th is form of the routine is much
easier to read, because operations and operands are written with symbols rather

assembler A program
that translates a symbolic
version of instruction into
the binary ver sion.

macro A pattern-
matching and replacement
facility that pro vides a
simple mechanism to name
a frequently used sequence
of instructions.

unresolved reference
A reference that requires
more information from
an outside source to be
complete.

linker Also called
link editor. A systems
program that combines
independently assembled
machine language
programs and resolves all
undefi ned labels into an
executable fi le.

A.1 Introduction A-5

than with bit patterns. However, this assembly language is still diffi cult to follow,
because memory locations are named by their address rather than by a symbolic
label.

Figure A.1.4 shows assembly language that labels memory addresses with mne-
monic names. Most programmers prefer to read and write this form. Names that
begin with a period, for example .data and .globl, are assembler directives
that tell the assembler how to translate a program but do not produce machine
instructions. Names followed by a colon, such as str: or main:, are labels that
name the next memory location. Th is program is as readable as most assembly
language programs (except for a glaring lack of comments), but it is still diffi cult
to follow, because many simple operations are required to accomplish simple tasks
and because assembly language’s lack of control fl ow constructs provides few hints
about the program’s operation.

By contrast, the C routine in Figure A.1.5 is both shorter and clearer, since vari-
ables have mnemonic names and the loop is explicit rather than constructed with
branches. In fact, the C routine is the only one that we wrote. Th e other forms of
the program were produced by a C compiler and assembler.

In general, assembly language plays two roles (see Figure A.1.6). Th e fi rst role
is the output language of compilers. A compiler translates a program written in a
high-level language (such as C or Pascal) into an equivalent program in machine or

assembler directive
An operation that tells the
assembler how to translate
a program but does not
produce machine instruc-
tions; always begins with
a period.

00100111101111011111111111100000
10101111101111110000000000010100
10101111101001000000000000100000
10101111101001010000000000100100
10101111101000000000000000011000
10101111101000000000000000011100
10001111101011100000000000011100
10001111101110000000000000011000
00000001110011100000000000011001
00100101110010000000000000000001
00101001000000010000000001100101
10101111101010000000000000011100
00000000000000000111100000010010
00000011000011111100100000100001
00010100001000001111111111110111
10101111101110010000000000011000
00111100000001000001000000000000
10001111101001010000000000011000
00001100000100000000000011101100
00100100100001000000010000110000
10001111101111110000000000010100
00100111101111010000000000100000
00000011111000000000000000001000
00000000000000000001000000100001

FIGURE A.1.2 MIPS machine language code for a routine to compute and print the sum
of the squares of integers between 0 and 100.

A-6 Appendix A Assemblers, Linkers, and the SPIM Simulator

assembly language. Th e high-level language is called the source language, and the
compiler’s output is its target language.

Assembly language’s other role is as a language in which to write programs. Th is
role used to be the dominant one. Today, however, because of larger main memo-
ries and better compilers, most programmers write in a high-level language and
rarely, if ever, see the instructions that a computer executes. Nevertheless, assembly
language is still important to write programs in which speed or size is critical or to
exploit hardware features that have no analogues in high-level languages.

Although this appendix focuses on MIPS assembly language, assembly pro-
gramming on most other machines is very similar. Th e additional instructions and
address modes in CISC machines, such as the VAX, can make assembly pro grams
shorter but do not change the process of assembling a program or provide assembly
language with the advantages of high-level languages, such as type-checking and
structured control fl ow.

source language Th e
high-level language
in which a pro gram is
originally written.

addiu $29, $29, -32
sw $31, 20($29)
sw $4, 32($29)
sw $5, 36($29)
sw $0, 24($29)
sw $0, 28($29)
lw $14, 28($29)
lw $24, 24($29)
multu $14, $14
addiu $8, $14, 1
slti $1, $8, 101
sw $8, 28($29)
mflo $15
addu $25, $24, $15
bne $1, $0, -9
sw $25, 24($29)
lui $4, 4096
lw $5, 24($29)
jal 1048812
addiu $4, $4, 1072
lw $31, 20($29)
addiu $29, $29, 32
jr $31
move $2, $0

FIGURE A.1.3 The same routine as in Figure A.1.2 written in assembly language. However,
the code for the routine does not label registers or memory locations or include comments.

A.1 Introduction A-7

When to Use Assembly Language
Th e primary reason to program in assembly language, as opposed to an available
high-level language, is that the speed or size of a program is critically important.
For example, consider a computer that controls a piece of machinery, such as a
car’s brakes. A computer that is incorporated in another device, such as a car, is
called an embedded computer. Th is type of computer needs to respond rapidly
and predictably to events in the outside world. Because a compiler introduces

FIGURE A.1.4 The same routine as in Figure A.1.2 written in assembly language with
labels, but no com ments. Th e commands that start with periods are assembler directives (see pages
A-47–49). .text indicates that succeeding lines contain instructions. .data indicates that they contain
data. .align n indicates that the items on the succeeding lines should be aligned on a 2n byte boundary.
Hence, .align 2 means the next item should be on a word boundary. .globl main declares that main is
a global symbol that should be visible to code stored in other fi les. Finally, .asciiz stores a null-terminated
string in memory.

A-8 Appendix A Assemblers, Linkers, and the SPIM Simulator

uncertainty about the time cost of operations, programmers may fi nd it diffi cult
to ensure that a high-level language program responds within a defi nite time
interval—say, 1 millisecond aft er a sensor detects that a tire is skidding. An
assembly language programmer, on the other hand, has tight control over which
instruc tions execute. In addition, in embedded applications, reducing a program’s
size, so that it fi ts in fewer memory chips, reduces the cost of the embedded
computer.

A hybrid approach, in which most of a program is written in a high-level lan-
guage and time-critical sections are written in assembly language, builds on the
strengths of both languages. Programs typically spend most of their time execut ing
a small fraction of the program’s source code. Th is observation is just the prin ciple
of locality that underlies caches (see Section 5.1 in Chapter 5).

Program profi ling measures where a program spends its time and can fi nd the
time-critical parts of a program. In many cases, this portion of the program can
be made faster with better data structures or algorithms. Sometimes, however, sig-
nifi cant performance improvements only come from recoding a critical portion of
a program in assembly language.

#include

int
main (int argc, char *argv[])
{
int i;
int sum = 0;

for (i = 0; i <= 100; i = i + 1) sum = sum + i * i; printf (“The sum from 0 .. 100 is %d\n”, sum); } FIGURE A.1.5 The routine in Figure A.1.2 written in the C programming language. FIGURE A.1.6 Assembly language either is written by a programmer or is the output of a compiler. LinkerCompiler Assembler Computer High-level language program Assembly language program Program A.1 Introduction A-9 Th is improvement is not necessarily an indication that the high-level language’s compiler has failed. Compilers typically are better than programmers at produc- ing uniformly high-quality machine code across an entire program. Pro grammers, however, understand a program’s algorithms and behavior at a deeper level than a compiler and can expend considerable eff ort and ingenuity improving small sections of the program. In particular, programmers oft en consider several proce- dures simultaneously while writing their code. Compilers typically compile each procedure in isolation and must follow strict conventions governing the use of registers at procedure boundaries. By retaining commonly used values in regis- ters, even across procedure boundaries, programmers can make a program run faster. Another major advantage of assembly language is the ability to exploit special- ized instructions—for example, string copy or pattern-matching instructions. Compilers, in most cases, cannot determine that a program loop can be replaced by a single instruction. However, the programmer who wrote the loop can replace it easily with a single instruction. Currently, a programmer’s advantage over a compiler has become diffi cult to maintain as compilation techniques improve and machines’ pipelines increase in complexity (Chapter 4). Th e fi nal reason to use assembly language is that no high-level language is available on a particular computer. Many older or specialized computers do not have a compiler, so a programmer’s only alternative is assembly language. Drawbacks of Assembly Language Assembly language has many disadvantages that strongly argue against its wide- spread use. Perhaps its major disadvantage is that programs written in assembly language are inherently machine-specifi c and must be totally rewritten to run on another computer architecture. Th e rapid evolution of computers discussed in Chapter 1 means that architectures become obsolete. An assembly language pro- gram remains tightly bound to its original archi tecture, even aft er the computer is eclipsed by new, faster, and more cost-eff ective machines. Another disadvantage is that assembly language programs are longer than the equivalent programs written in a high-level language. For example, the C program in Figure A.1.5 is 11 lines long, while the assembly program in Figure A.1.4 is 31 lines long. In more complex programs, the ratio of assembly to high-level lan- guage (its expansion factor) can be much larger than the factor of three in this exam ple. Unfortunately, empirical studies have shown that programmers write roughly the same number of lines of code per day in assembly as in high-level languages. Th is means that programmers are roughly x times more productive in a high-level language, where x is the assembly language expansion factor. A-10 Appendix A Assemblers, Linkers, and the SPIM Simulator To compound the problem, longer programs are more diffi cult to read and understand, and they contain more bugs. Assembly language exacerbates the prob- lem because of its complete lack of structure. Common programming idioms, such as if-then statements and loops, must be built from branches and jumps. Th e resulting programs are hard to read, because the reader must reconstruct every higher-level construct from its pieces and each instance of a statement may be slightly diff erent. For example, look at Figure A.1.4 and answer these questions: What type of loop is used? What are its lower and upper bounds? Elaboration: Compilers can produce machine language directly instead of relying on an assembler. These compilers typically execute much faster than those that invoke an assembler as part of compilation. However, a compiler that generates machine lan- guage must perform many tasks that an assembler normally handles, such as resolv- ing addresses and encoding instructions as binary numbers. The tradeoff is between compilation speed and compiler simplicity. Elaboration: Despite these considerations, some embedded applications are writ- ten in a high-level language. Many of these applications are large and complex pro- grams that must be extremely reliable. Assembly language programs are longer and more diffi cult to write and read than high-level language programs. This greatly increases the cost of writing an assembly language program and makes it extremely dif fi cult to verify the correctness of this type of program. In fact, these considerations led the US Department of Defense, which pays for many complex embedded systems, to develop Ada, a new high-level language for writing embedded systems. A.2 Assemblers An assembler translates a fi le of assembly language statements into a fi le of binary machine instructions and binary data. Th e translation process has two major parts. Th e fi rst step is to fi nd memory locations with labels so that the relationship between symbolic names and addresses is known when instructions are trans lated. Th e second step is to translate each assembly statement by combining the numeric equivalents of opcodes, register specifi ers, and labels into a legal instruc tion. As shown in Figure A.1.1, the assembler produces an output fi le, called an object fi le, which contains the machine instructions, data, and bookkeeping infor mation. An object fi le typically cannot be executed, because it references procedures or data in other fi les. A label is external (also called global) if the labeled object can external label Also called global label. A label referring to an object that can be referenced from fi les other than the one in which it is defi ned. be referenced from fi les other than the one in which it is defi ned. A label is local if the object can be used only within the fi le in which it is defi ned. In most assem- blers, labels are local by default and must be explicitly declared global. Subrou tines and global variables require external labels since they are referenced from many fi les in a program. Local labels hide names that should not be visible to other modules—for example, static functions in C, which can only be called by other functions in the same fi le. In addition, compiler-generated names—for example, a name for the instruction at the beginning of a loop—are local so that the compiler need not produce unique names in every fi le. Local and Global Labels Consider the program in Figure A.1.4. Th e subroutine has an external (global) label main. It also contains two local labels—loop and str—that are only visible with this assembly language fi le. Finally, the routine also contains an unresolved reference to an external label printf, which is the library routine that prints values. Which labels in Figure A.1.4 could be referenced from another fi le? Only global labels are visible outside a fi le, so the only label that could be referenced from another fi le is main. Since the assembler processes each fi le in a program individually and in isola tion, it only knows the addresses of local labels. Th e assembler depends on another tool, the linker, to combine a collection of object fi les and libraries into an executable fi le by resolving external labels. Th e assembler assists the linker by pro viding lists of labels and unresolved references. However, even local labels present an interesting challenge to an assembler. Unlike names in most high-level languages, assembly labels may be used before they are defi ned. In the example in Figure A.1.4, the label str is used by the la instruction before it is defi ned. Th e possibility of a forward reference, like this one, forces an assembler to translate a program in two steps: fi rst fi nd all labels and then produce instructions. In the example, when the assembler sees the la instruction, it does not know where the word labeled str is located or even whether str labels an instruction or datum. local label A label referring to an object that can be used only within the fi le in which it is defi ned. EXAMPLE ANSWER forward reference A label that is used before it is defi ned. A.2 Assemblers A-11 A-12 Appendix A Assemblers, Linkers, and the SPIM Simulator An assembler’s fi rst pass reads each line of an assembly fi le and breaks it into its component pieces. Th ese pieces, which are called lexemes, are individual words, numbers, and punctuation characters. For example, the line ble $t0, 100, loop contains six lexemes: the opcode ble, the register specifi er $t0, a comma, the number 100, a comma, and the symbol loop. If a line begins with a label, the assembler records in its symbol table the name of the label and the address of the memory word that the instruction occupies. Th e assembler then calculates how many words of memory the instruction on the current line will occupy. By keeping track of the instructions’ sizes, the assembler can determine where the next instruction goes. To compute the size of a variable- length instruction, like those on the VAX, an assembler has to examine it in detail. However, fi xed-length instructions, like those on MIPS, require only a cursory examination. Th e assembler performs a similar calculation to compute the space required for data statements. When the assembler reaches the end of an assembly fi le, the symbol table records the location of each label defi ned in the fi le. Th e assembler uses the information in the symbol table during a second pass over the fi le, which actually produces machine code. Th e assembler again exam- ines each line in the fi le. If the line contains an instruction, the assembler com- bines the binary representations of its opcode and operands (register specifi ers or memory address) into a legal instruction. Th e process is similar to the one used in Section 2.5 in Chapter 2. Instructions and data words that reference an external symbol defi ned in another fi le cannot be completely assembled (they are unre- solved), since the symbol’s address is not in the symbol table. An assembler does not complain about unresolved references, since the corresponding label is likely to be defi ned in another fi le. Assembly language is a programming language. Its principal diff erence from high-level languages such as BASIC, Java, and C is that assembly lan- guage provides only a few, simple types of data and control fl ow. Assembly language programs do not specify the type of value held in a variable. Instead, a programmer must apply the appropriate operations (e.g., integer or fl oating-point addition) to a value. In addition, in assem bly language, programs must implement all control fl ow with go tos. Both factors make assembly language programming for any machine—MIPS or x86—more diffi cult and error-prone than writing in a high-level language. symbol table A table that matches names of labels to the addresses of the memory words that instructions occupy. The BIG Picture Elaboration: If an assembler’s speed is important, this two-step process can be done in one pass over the assembly fi le with a technique known as backpatching. In its pass over the fi le, the assembler builds a (possibly incomplete) binary representation of every instruction. If the instruction references a label that has not yet been defi ned, the assembler records the label and instruction in a table. When a label is defi ned, the assembler consults this table to fi nd all instructions that contain a forward reference to the label. The assembler goes back and corrects their binary representation to incorpo- rate the address of the label. Backpatching speeds assembly because the assembler only reads its input once. However, it requires an assembler to hold the entire binary rep- resentation of a program in memory so instructions can be backpatched. This require- ment can limit the size of programs that can be assembled. The process is com plicated by machines with several types of branches that span different ranges of instructions. When the assembler fi rst sees an unresolved label in a branch instruction, it must either use the largest possible branch or risk having to go back and readjust many instructions to make room for a larger branch. Object File Format Assemblers produce object fi les. An object fi le on UNIX contains six distinct sections (see Figure A.2.1): ■ Th e object fi le header describes the size and position of the other pieces of the fi le. ■ Th e text segment contains the machine language code for routines in the source fi le. Th ese routines may be unexecutable because of unresolved references. ■ Th e data segment contains a binary representation of the data in the source fi le. Th e data also may be incomplete because of unresolved references to labels in other fi les. ■ Th e relocation information identifi es instructions and data words that depend on absolute addresses. Th ese references must change if portions of the program are moved in memory. ■ Th e symbol table associates addresses with external labels in the source fi le and lists unresolved references. ■ Th e debugging information contains a concise description of the way the program was compiled, so a debugger can fi nd which instruction addresses correspond to lines in a source fi le and print the data structures in readable form. Th e assembler produces an object fi le that contains a binary representation of the program and data and additional information to help link pieces of a program. backpatching A method for translating from assembly lan guage to machine instructions in which the assembler builds a (possibly incomplete) binary representation of every instruc tion in one pass over a program and then returns to fi ll in previ- ously undefi ned labels. text segment Th e segment of a UNIX object fi le that contains the machine language code for rou tines in the source fi le. data segment Th e segment of a UNIX object or executable fi le that contains a binary represen tation of the initialized data used by the program. relocation information Th e segment of a UNIX object fi le that identifi es instructions and data words that depend on absolute addresses. absolute address A variable’s or routine’s actual address in memory. A.2 Assemblers A-13 A-14 Appendix A Assemblers, Linkers, and the SPIM Simulator Th is relocation information is necessary because the assembler does not know which memory locations a procedure or piece of data will occupy aft er it is linked with the rest of the program. Procedures and data from a fi le are stored in a con- tiguous piece of memory, but the assembler does not know where this mem ory will be located. Th e assembler also passes some symbol table entries to the linker. In particular, the assembler must record which external symbols are defi ned in a fi le and what unresolved references occur in a fi le. Elaboration: For convenience, assemblers assume each fi le starts at the same address (for example, location 0) with the expectation that the linker will relocate the code and data when they are assigned locations in memory. The assembler produces relocation information, which contains an entry describing each instruction or data word in the fi le that references an absolute address. On MIPS, only the subroutine call, load, and store instructions reference absolute addresses. Instructions that use PC- relative addressing, such as branches, need not be relocated. Additional Facilities Assemblers provide a variety of convenience features that help make assembler programs shorter and easier to write, but do not fundamentally change assembly language. For example, data layout directives allow a programmer to describe data in a more concise and natural manner than its binary representation. In Figure A.1.4, the directive .asciiz “The sum from 0 .. 100 is %d\n” stores characters from the string in memory. Contrast this line with the alternative of writing each character as its ASCII value (Figure 2.15 in Chapter 2 describes the ASCII encoding for characters): .byte 84, 104, 101, 32, 115, 117, 109, 32 .byte 102, 114, 111, 109, 32, 48, 32, 46 .byte 46, 32, 49, 48, 48, 32, 105, 115 .byte 32, 37, 100, 10, 0 Th e .asciiz directive is easier to read because it represents characters as letters, not binary numbers. An assembler can translate characters to their binary repre- sentation much faster and more accurately than a human can. Data layout directives FIGURE A.2.1 Object fi le. A UNIX assembler produces an object fi le with six distinct sections. Object file header Text segment Data segment Relocation information Symbol table Debugging information specify data in a human-readable form that the assembler translates to binary. Other layout directives are described in Section A.10. String Directive Defi ne the sequence of bytes produced by this directive: .asciiz “The quick brown fox jumps over the lazy dog” .byte 84, 104, 101, 32, 113, 117, 105, 99 .byte 107, 32, 98, 114, 111, 119, 110, 32 .byte 102, 111, 120, 32, 106, 117, 109, 112 .byte 115, 32, 111, 118, 101, 114, 32, 116 .byte 104, 101, 32, 108, 97, 122, 121, 32 .byte 100, 111, 103, 0 Macro is a pattern-matching and replacement facility that provides a simple mechanism to name a frequently used sequence of instructions. Instead of repeat- edly typing the same instructions every time they are used, a programmer invokes the macro and the assembler replaces the macro call with the corresponding sequence of instructions. Macros, like subroutines, permit a programmer to create and name a new abstraction for a common operation. Unlike subroutines, how- ever, macros do not cause a subroutine call and return when the program runs, since a macro call is replaced by the macro’s body when the program is assembled. Aft er this replacement, the resulting assembly is indistinguishable from the equiv- alent program written without macros. Macros As an example, suppose that a programmer needs to print many numbers. Th e library routine printf accepts a format string and one or more values to print as its arguments. A programmer could print the integer in register $7 with the following instructions: .data int_str: .asciiz“%d” .text la $a0, int_str # Load string address # into first arg EXAMPLE ANSWER EXAMPLE A.2 Assemblers A-15 A-16 Appendix A Assemblers, Linkers, and the SPIM Simulator mov $a1, $7 # Load value into # second arg jal printf # Call the printf routine Th e .data directive tells the assembler to store the string in the program’s data segment, and the .text directive tells the assembler to store the instruc tions in its text segment. However, printing many numbers in this fashion is tedious and produces a verbose program that is diffi cult to understand. An alternative is to introduce a macro, print_int, to print an integer: .data int_str:.asciiz “%d” .text .macro print_int($arg) la $a0, int_str # Load string address into # first arg mov $a1, $arg # Load macro’s parameter # ($arg) into second arg jal printf # Call the printf routine .end_macro print_int($7) Th e macro has a formal parameter, $arg, that names the argument to the macro. When the macro is expanded, the argument from a call is substituted for the formal parameter throughout the macro’s body. Th en the assembler replaces the call with the macro’s newly expanded body. In the fi rst call on print_int, the argument is $7, so the macro expands to the code la $a0, int_str mov $a1, $7 jal printf In a second call on print_int, say, print_int($t0), the argument is $t0, so the macro expands to la $a0, int_str mov $a1, $t0 jal printf What does the call print_int($a0) expand to? formal parameter A variable that is the argument to a proce dure or macro; it is replaced by that argument once the macro is expanded. la $a0, int_str mov $a1, $a0 jal printf Th is example illustrates a drawback of macros. A programmer who uses this macro must be aware that print_int uses register $a0 and so cannot correctly print the value in that register. Some assemblers also implement pseudoinstructions, which are instructions pro- vided by an assembler but not implemented in hardware. Chapter 2 contains many examples of how the MIPS assembler synthesizes pseudoinstructions and addressing modes from the spartan MIPS hardware instruction set. For example, Section 2.7 in Chapter 2 describes how the assembler synthesizes the blt instruc tion from two other instructions: slt and bne. By extending the instruction set, the MIPS assembler makes assembly language programming easier without complicating the hardware. Many pseudoinstructions could also be simulated with macros, but the MIPS assembler can generate better code for these instructions because it can use a dedicated register ($at) and is able to optimize the generated code. Elaboration: Assemblers conditionally assemble pieces of code, which permits a programmer to include or exclude groups of instructions when a program is assembled. This feature is particularly useful when several versions of a program differ by a small amount. Rather than keep these programs in separate fi les—which greatly complicates fi xing bugs in the common code—programmers typically merge the versions into a sin- gle fi le. Code particular to one version is conditionally assembled, so it can be excluded when other versions of the program are assembled. If macros and conditional assembly are useful, why do assemblers for UNIX systems rarely, if ever, provide them? One reason is that most programmers on these systems write programs in higher-level languages like C. Most of the assembly code is produced by compilers, which fi nd it more convenient to repeat code rather than defi ne macros. Another reason is that other tools on UNIX—such as cpp, the C preprocessor, or m4, a general macro processor—can provide macros and conditional assembly for assembly language programs. ANSWER Hardware/ Software Interface A.2 Assemblers A-17 A-18 Appendix A Assemblers, Linkers, and the SPIM Simulator A.3 Linkers Separate compilation permits a program to be split into pieces that are stored in diff erent fi les. Each fi le contains a logically related collection of subroutines and data structures that form a module in a larger program. A fi le can be compiled and assembled independently of other fi les, so changes to one module do not require recompiling the entire program. As we discussed above, separate compila- tion necessitates the additional step of linking to combine object fi les from separate modules and fi xing their unresolved references. Th e tool that merges these fi les is the linker (see Figure A.3.1). It performs three tasks: ■ Searches the program libraries to fi nd library routines used by the program ■ Determines the memory locations that code from each module will occupy and relocates its instructions by adjusting absolute references ■ Resolves references among fi les A linker’s fi rst task is to ensure that a program contains no undefi ned labels. Th e linker matches the external symbols and unresolved references from a pro gram’s fi les. An external symbol in one fi le resolves a reference from another fi le if both refer to a label with the same name. Unmatched references mean a symbol was used but not defi ned anywhere in the program. Unresolved references at this stage in the linking process do not necessarily mean a programmer made a mistake. Th e program could have referenced a library routine whose code was not in the object fi les passed to the linker. Aft er matching symbols in the program, the linker searches the system’s program librar ies to fi nd predefi ned subroutines and data structures that the program references. Th e basic libraries contain routines that read and write data, allocate and deallo cate memory, and perform numeric operations. Other libraries contain routines to access a database or manipulate terminal windows. A program that references an unresolved symbol that is not in any library is erroneous and cannot be linked. When the program uses a library routine, the linker extracts the routine’s code from the library and incorporates it into the program text segment. Th is new rou- tine, in turn, may depend on other library routines, so the linker continues to fetch other library routines until no external references are unresolved or a rou tine cannot be found. If all external references are resolved, the linker next determines the memory locations that each module will occupy. Since the fi les were assembled in isolation, separate compilation Split ting a program across many fi les, each of which can be com piled without knowledge of what is in the other fi les. the assembler could not know where a module’s instructions or data would be placed relative to other modules. When the linker places a module in memory, all abso lute references must be relocated to refl ect its true location. Since the linker has relocation information that identifi es all relocatable references, it can effi ciently fi nd and backpatch these references. Th e linker produces an executable fi le that can run on a computer. Typically, this fi le has the same format as an object fi le, except that it contains no unresolved references or relocation information. A.4 Loading A program that links without an error can be run. Before being run, the program resides in a fi le on secondary storage, such as a disk. On UNIX systems, the operating FIGURE A.3.1 The linker searches a collection of object fi les and program libraries to fi nd nonlocal routines used in a program, combines them into a single executable fi le, and resolves references between routines in different fi les. A.4 Loading A-19 A-20 Appendix A Assemblers, Linkers, and the SPIM Simulator system kernel brings a program into memory and starts it running. To start a program, the operating system performs the following steps: 1. It reads the executable fi le’s header to determine the size of the text and data segments. 2. It creates a new address space for the program. Th is address space is large enough to hold the text and data segments, along with a stack segment (see Section A.5). 3. It copies instructions and data from the executable fi le into the new address space. 4. It copies arguments passed to the program onto the stack. 5. It initializes the machine registers. In general, most registers are cleared, but the stack pointer must be assigned the address of the fi rst free stack location (see Section A.5). 6. It jumps to a start-up routine that copies the program’s arguments from the stack to registers and calls the program’s main routine. If the main routine returns, the start-up routine terminates the program with the exit system call. A.5 Memory Usage Th e next few sections elaborate the description of the MIPS architecture presented earlier in the book. Earlier chapters focused primarily on hardware and its relationship with low-level soft ware. Th ese sections focus primarily on how assembly language programmers use MIPS hardware. Th ese sections describe a set of conventions followed on many MIPS systems. For the most part, the hardware does not impose these conventions. Instead, they represent an agreement among programmers to follow the same set of rules so that soft ware written by diff erent people can work together and make eff ective use of MIPS hardware. Systems based on MIPS processors typically divide memory into three parts (see Figure A.5.1). Th e fi rst part, near the bottom of the address space (starting at address 400000hex), is the text segment, which holds the program’s instructions. Th e second part, above the text segment, is the data segment, which is further divided into two parts. Static data (starting at address 10000000hex) contains objects whose size is known to the compiler and whose lifetime—the interval dur ing which a program can access them—is the program’s entire execution. For example, in C, global variables are statically allocated, since they can be referenced static data Th e portion of memory that contains data whose size is known to the com piler and whose lifetime is the program’s entire execution. FIGURE A.5.1 Layout of memory. Dynamic data Static data Reserved Stack segment Data segment Text segment 7fffffffhex 10000000hex 400000hex Because the data segment begins far above the program at address 10000000hex, load and store instructions cannot directly reference data objects with their 16-bit off set fi elds (see Section 2.5 in Chapter 2). For example, to load the word in the data segment at address 10010020hex into register $v0 requires two instructions: lui $s0, 0x1001 # 0x1001 means 1001 base 16 lw $v0, 0x0020($s0) # 0x10010000 + 0x0020 = 0x10010020 (Th e 0x before a number means that it is a hexadecimal value. For example, 0x8000 is 8000hex or 32,768ten.) To avoid repeating the lui instruction at every load and store, MIPS systems typically dedicate a register ($gp) as a global pointer to the static data segment. Th is register contains address 10008000hex, so load and store instructions can use their signed 16-bit off set fi elds to access the fi rst 64 KB of the static data segment. With this global pointer, we can rewrite the example as a single instruction: lw $v0, 0x8020($gp) Of course, a global pointer register makes addressing locations 10000000hex– 10010000hex faster than other heap locations. Th e MIPS compiler usually stores global variables in this area, because these variables have fi xed locations and fi t bet- ter than other global data, such as arrays. Hardware/ Software Interface A.5 Memory Usage A-21 anytime during a program’s execution. Th e linker both assigns static objects to locations in the data segment and resolves references to these objects. Immediately above static data is dynamic data. Th is data, as its name implies, is allocated by the program as it executes. In C programs, the malloc library rou tine A-22 Appendix A Assemblers, Linkers, and the SPIM Simulator fi nds and returns a new block of memory. Since a compiler cannot predict how much memory a program will allocate, the operating system expands the dynamic data area to meet demand. As the upward arrow in the fi gure indicates, malloc expands the dynamic area with the sbrk system call, which causes the operating system to add more pages to the program’s virtual address space (see Section 5.7 in Chapter 5) immediately above the dynamic data segment. Th e third part, the program stack segment, resides at the top of the virtual address space (starting at address 7ff ff ff fhex). Like dynamic data, the maximum size of a program’s stack is not known in advance. As the program pushes values on to the stack, the operating system expands the stack segment down toward the data segment. Th is three-part division of memory is not the only possible one. However, it has two important characteristics: the two dynamically expandable segments are as far apart as possible, and they can grow to use a program’s entire address space. A.6 Procedure Call Convention Conventions governing the use of registers are necessary when procedures in a program are compiled separately. To compile a particular procedure, a compiler must know which registers it may use and which registers are reserved for other procedures. Rules for using registers are called register use or procedure call conventions. As the name implies, these rules are, for the most part, conventions fol lowed by soft ware rather than rules enforced by hardware. However, most com- pilers and programmers try very hard to follow these conventions because violat- ing them causes insidious bugs. Th e calling convention described in this section is the one used by the gcc com- piler. Th e native MIPS compiler uses a more complex convention that is slightly faster. Th e MIPS CPU contains 32 general-purpose registers that are numbered 0–31. Register $0 always contains the hardwired value 0. ■ Registers $at (1), $k0 (26), and $k1 (27) are reserved for the assembler and operating system and should not be used by user programs or compilers. ■ Registers $a0–$a3 (4–7) are used to pass the fi rst four arguments to rou tines (remaining arguments are passed on the stack). Registers $v0 and $v1 (2, 3) are used to return values from functions. stack segment Th e portion of memory used by a program to hold procedure call frames. register use convention Also called procedure call convention. A soft ware proto col governing the use of registers by procedures. ■ Registers $t0–$t9 (8–15, 24, 25) are caller-saved registers that are used to hold temporary quantities that need not be preserved across calls (see Section 2.8 in Chapter 2). ■ Registers $s0–$s7 (16–23) are callee-saved registers that hold long-lived values that should be preserved across calls. ■ Register $gp (28) is a global pointer that points to the middle of a 64K block of memory in the static data segment. ■ Register $sp (29) is the stack pointer, which points to the last location on the stack. Register $fp (30) is the frame pointer. Th e jal instruction writes register $ra (31), the return address from a procedure call. Th ese two regis- ters are explained in the next section. Th e two-letter abbreviations and names for these registers—for example $sp for the stack pointer—refl ect the registers’ intended uses in the procedure call convention. In describing this convention, we will use the names instead of regis ter numbers. Figure A.6.1 lists the registers and describes their intended uses. Procedure Calls Th is section describes the steps that occur when one procedure (the caller) invokes another procedure (the callee). Programmers who write in a high-level language (like C or Pascal) never see the details of how one procedure calls another, because the compiler takes care of this low-level bookkeeping. However, assembly language programmers must explicitly implement every procedure call and return. Most of the bookkeeping associated with a call is centered around a block of memory called a procedure call frame. Th is memory is used for a variety of purposes: ■ To hold values passed to a procedure as arguments ■ To save registers that a procedure may modify, but which the procedure’s caller does not want changed ■ To provide space for variables local to a procedure In most programming languages, procedure calls and returns follow a strict last-in, fi rst-out (LIFO) order, so this memory can be allocated and deallocated on a stack, which is why these blocks of memory are sometimes called stack frames. Figure A.6.2 shows a typical stack frame. Th e frame consists of the memory between the frame pointer ($fp), which points to the fi rst word of the frame, and the stack pointer ($sp), which points to the last word of the frame. Th e stack grows down from higher memory addresses, so the frame pointer points above the caller-saved register A regis ter saved by the routine being called. callee-saved register A regis ter saved by the routine making a procedure call. procedure call frame A block of memory that is used to hold values passed to a procedure as arguments, to save registers that a procedure may modify but that the procedure’s caller does not want changed, and to pro- vide space for variables local to a procedure. A.6 Procedure Call Convention A-23 A-24 Appendix A Assemblers, Linkers, and the SPIM Simulator stack pointer. Th e executing procedure uses the frame pointer to quickly access values in its stack frame. For example, an argument in the stack frame can be loaded into register $v0 with the instruction lw $v0, 0($fp) Register name Number Usage $zero 0 constant 0 $at 1 reserved for assembler $v0 2 expression evaluation and results of a function $v1 3 expression evaluation and results of a function $a0 4 argument 1 $a1 5 argument 2 $a2 6 argument 3 $a3 7 argument 4 $t0 8 temporary (not preserved across call) $t1 9 temporary (not preserved across call) $t2 10 temporary (not preserved across call) $t3 11 temporary (not preserved across call) $t4 12 temporary (not preserved across call) $t5 13 temporary (not preserved across call) $t6 14 temporary (not preserved across call) $t7 15 temporary (not preserved across call) $s0 16 saved temporary (preserved across call) $s1 17 saved temporary (preserved across call) $s2 18 saved temporary (preserved across call) $s3 19 saved temporary (preserved across call) $s4 20 saved temporary (preserved across call) $s5 21 saved temporary (preserved across call) $s6 22 saved temporary (preserved across call) $s7 23 saved temporary (preserved across call) $t8 24 temporary (not preserved across call) $t9 25 temporary (not preserved across call) $k0 26 reserved for OS kernel $k1 27 reserved for OS kernel $gp 28 pointer to global area $sp 29 stack pointer $fp 30 frame pointer $ra 31 return address (used by function call) FIGURE A.6.1 MIPS registers and usage convention. A stack frame may be built in many diff erent ways; however, the caller and callee must agree on the sequence of steps. Th e steps below describe the calling convention used on most MIPS machines. Th is convention comes into play at three points during a procedure call: immediately before the caller invokes the callee, just as the callee starts executing, and immediately before the callee returns to the caller. In the fi rst part, the caller puts the procedure call arguments in stan dard places and invokes the callee to do the following: 1. Pass arguments. By convention, the fi rst four arguments are passed in regis- ters $a0–$a3. Any remaining arguments are pushed on the stack and appear at the beginning of the called procedure’s stack frame. 2. Save caller-saved registers. Th e called procedure can use these registers ($a0–$a3 and $t0–$t9) without fi rst saving their value. If the caller expects to use one of these registers aft er a call, it must save its value before the call. 3. Execute a jal instruction (see Section 2.8 of Chapter 2), which jumps to the callee’s fi rst instruction and saves the return address in register $ra. FIGURE A.6.2 Layout of a stack frame. Th e frame pointer ($fp) points to the fi rst word in the currently executing procedure’s stack frame. Th e stack pointer ($sp) points to the last word of the frame. Th e fi rst four arguments are passed in registers, so the fi ft h argument is the fi rst one stored on the stack. A.6 Procedure Call Convention A-25 Argument 6 Argument 5 Saved registers Local variables Higher memory addresses Lower memory addresses Stack grows $fp $sp A-26 Appendix A Assemblers, Linkers, and the SPIM Simulator Before a called routine starts running, it must take the following steps to set up its stack frame: 1. Allocate memory for the frame by subtracting the frame’s size from the stack pointer. 2. Save callee-saved registers in the frame. A callee must save the values in these registers ($s0–$s7, $fp, and $ra) before altering them, since the caller expects to fi nd these registers unchanged aft er the call. Register $fp is saved by every procedure that allocates a new stack frame. However, register $ra only needs to be saved if the callee itself makes a call. Th e other callee- saved registers that are used also must be saved. 3. Establish the frame pointer by adding the stack frame’s size minus 4 to $sp and storing the sum in register $fp. Th e MIPS register use convention provides callee- and caller-saved registers, because both types of registers are advantageous in diff erent circumstances. Callee- saved registers are better used to hold long-lived values, such as variables from a user’s program. Th ese registers are only saved during a procedure call if the callee expects to use the register. On the other hand, caller-saved registers are bet ter used to hold short-lived quantities that do not persist across a call, such as immediate values in an address calculation. During a call, the callee can also use these registers for short-lived temporaries. Finally, the callee returns to the caller by executing the following steps: 1. If the callee is a function that returns a value, place the returned value in register $v0. 2. Restore all callee-saved registers that were saved upon procedure entry. 3. Pop the stack frame by adding the frame size to $sp. 4. Return by jumping to the address in register $ra. Elaboration: A programming language that does not permit recursive procedures— procedures that call themselves either directly or indirectly through a chain of calls—need not allocate frames on a stack. In a nonrecursive language, each procedure’s frame may be statically allocated, since only one invocation of a procedure can be active at a time. Older versions of Fortran prohibited recursion, because statically allocated frames produced faster code on some older machines. However, on load store architec tures like MIPS, stack frames may be just as fast, because a frame pointer register points directly Hardware/ Software Interface recursive procedures Procedures that call themselves either directly or indirectly through a chain of calls. to the active stack frame, which permits a single load or store instruc tion to access values in the frame. In addition, recursion is a valuable programming technique. Procedure Call Example As an example, consider the C routine main () { printf (“The factorial of 10 is %d\n”, fact (10)); } int fact (int n) { if (n < 1) return (1); else return (n * fact (n - 1)); } which computes and prints 10! (the factorial of 10, 10! = 10 × 9 × . . . × 1). fact is a recursive routine that computes n! by multiplying n times (n - 1)!. Th e assembly code for this routine illustrates how programs manipulate stack frames. Upon entry, the routine main creates its stack frame and saves the two callee- saved registers it will modify: $fp and $ra. Th e frame is larger than required for these two register because the calling convention requires the minimum size of a stack frame to be 24 bytes. Th is minimum frame can hold four argument registers ($a0–$a3) and the return address $ra, padded to a double-word boundary (24 bytes). Since main also needs to save $fp, its stack frame must be two words larger (remember: the stack pointer is kept doubleword aligned). .text .globl main main: subu $sp,$sp,32 # Stack frame is 32 bytes long sw $ra,20($sp) # Save return address sw $fp,16($sp) # Save old frame pointer addiu $fp,$sp,28 # Set up frame pointer Th e routine main then calls the factorial routine and passes it the single argument 10. Aft er fact returns, main calls the library routine printf and passes it both a format string and the result returned from fact: A.6 Procedure Call Convention A-27 A-28 Appendix A Assemblers, Linkers, and the SPIM Simulator li $a0,10 # Put argument (10) in $a0 jal fact # Call factorial function la $a0,$LC # Put format string in $a0 move $a1,$v0 # Move fact result to $a1 jal printf # Call the print function Finally, aft er printing the factorial, main returns. But fi rst, it must restore the registers it saved and pop its stack frame: lw $ra,20($sp) # Restore return address lw $fp,16($sp) # Restore frame pointer addiu $sp,$sp,32 # Pop stack frame jr $ra # Return to caller .rdata $LC: .ascii “The factorial of 10 is %d\n\000” Th e factorial routine is similar in structure to main. First, it creates a stack frame and saves the callee-saved registers it will use. In addition to saving $ra and $fp, fact also saves its argument ($a0), which it will use for the recursive call: .text fact: subu $sp,$sp,32 # Stack frame is 32 bytes long sw $ra,20($sp) # Save return address sw $fp,16($sp) # Save frame pointer addiu $fp,$sp,28 # Set up frame pointer sw $a0,0($fp) # Save argument (n) Th e heart of the fact routine performs the computation from the C program. It tests whether the argument is greater than 0. If not, the routine returns the value 1. If the argument is greater than 0, the routine recursively calls itself to compute fact(n–1) and multiplies that value times n: lw $v0,0($fp) # Load n bgtz $v0,$L2 # Branch if n > 0
li $v0,1 # Return 1
jr $L1 # Jump to code to return

$L2:
lw $v1,0($fp) # Load n
subu $v0,$v1,1 # Compute n – 1
move $a0,$v0 # Move value to $a0

jal fact # Call factorial function

lw $v1,0($fp) # Load n
mul $v0,$v0,$v1 # Compute fact(n-1) * n

Finally, the factorial routine restores the callee-saved registers and returns the
value in register $v0:

$L1: # Result is in $v0
lw $ra, 20($sp) # Restore $ra
lw $fp, 16($sp) # Restore $fp
addiu $sp, $sp, 32 # Pop stack
jr $ra # Return to caller

Stack in Recursive Procedure

Figure A.6.3 shows the stack at the call fact(7). main runs fi rst, so its frame
is deepest on the stack. main calls fact(10), whose stack frame is next on the
stack. Each invocation recursively invokes fact to compute the next-lowest
factorial. Th e stack frames parallel the LIFO order of these calls. What does the
stack look like when the call to fact(10) returns?

EXAMPLE

A.6 Procedure Call Convention A-29

FIGURE A.6.3 Stack frames during the call of fact(7).

main

fact (10)

fact (9)

fact (8)

fact (7)

Stack

Stack grows

Old $ra
Old $fp

Old $a0
Old $ra
Old $fp

A-30 Appendix A Assemblers, Linkers, and the SPIM Simulator

ANSWER

Elaboration: The difference between the MIPS compiler and the gcc compiler is that
the MIPS compiler usually does not use a frame pointer, so this register is available as
another callee-saved register, $s8. This change saves a couple of instructions in the
procedure call and return sequence. However, it complicates code generation, because
a procedure must access its stack frame with $sp, whose value can change during a
procedure’s execution if values are pushed on the stack.

Another Procedure Call Example
As another example, consider the following routine that computes the tak func-
tion, which is a widely used benchmark created by Ikuo Takeuchi. Th is function
does not compute anything useful, but is a heavily recursive program that illustrates
the MIPS calling convention.

int tak (int x, int y, int z)
{
if (y < x) return 1+ tak (tak (x - 1, y, z), tak (y - 1, z, x), tak (z - 1, x, y)); else return z; } int main () { tak(18, 12, 6); } Th e assembly code for this program is shown below. Th e tak function fi rst saves its return address in its stack frame and its arguments in callee-saved regis ters, since the routine may make calls that need to use registers $a0–$a2 and $ra. Th e function uses callee-saved registers, since they hold values that persist over the main Stack Stack grows Old $ra Old $fp lifetime of the function, which includes several calls that could potentially modify registers. .text .globl tak tak: subu $sp, $sp, 40 sw $ra, 32($sp) sw $s0, 16($sp) # x move $s0, $a0 sw $s1, 20($sp) # y move $s1, $a1 sw $s2, 24($sp) # z move $s2, $a2 sw $s3, 28($sp) # temporary Th e routine then begins execution by testing if y < x. If not, it branches to label L1, which is shown below. bge $s1, $s0, L1 # if (y < x) If y < x, then it executes the body of the routine, which contains four recursive calls. Th e fi rst call uses almost the same arguments as its parent: addiu $a0, $s0, -1 move $a1, $s1 move $a2, $s2 jal tak # tak (x - 1, y, z) move $s3, $v0 Note that the result from the fi rst recursive call is saved in register $s3, so that it can be used later. Th e function now prepares arguments for the second recursive call. addiu $a0, $s1, -1 move $a1, $s2 move $a2, $s0 jal tak # tak (y - 1, z, x) In the instructions below, the result from this recursive call is saved in register $s0. But fi rst we need to read, for the last time, the saved value of the fi rst argu- ment from this register. A.6 Procedure Call Convention A-31 A-32 Appendix A Assemblers, Linkers, and the SPIM Simulator addiu $a0, $s2, -1 move $a1, $s0 move $a2, $s1 move $s0, $v0 jal tak # tak (z - 1, x, y) Aft er the three inner recursive calls, we are ready for the fi nal recursive call. Aft er the call, the function’s result is in $v0 and control jumps to the function’s epilogue. move $a0, $s3 move $a1, $s0 move $a2, $v0 jal tak # tak (tak(...), tak(...), tak(...)) addiu $v0, $v0, 1 j L2 Th is code at label L1 is the consequent of the if-then-else statement. It just moves the value of argument z into the return register and falls into the function epilogue. L1: move $v0, $s2 Th e code below is the function epilogue, which restores the saved registers and returns the function’s result to its caller. L2: lw $ra, 32($sp) lw $s0, 16($sp) lw $s1, 20($sp) lw $s2, 24($sp) lw $s3, 28($sp) addiu $sp, $sp, 40 jr $ra Th e main routine calls the tak function with its initial arguments, then takes the computed result (7) and prints it using SPIM’s system call for printing integers. .globl main main: subu $sp, $sp, 24 sw $ra, 16($sp) li $a0, 18 li $a1, 12 li $a2, 6 jal tak # tak(18, 12, 6) move $a0, $v0 li $v0, 1 # print_int syscall syscall lw $ra, 16($sp) addiu $sp, $sp, 24 jr $ra A.7 Exceptions and Interrupts Section 4.9 of Chapter 4 describes the MIPS exception facility, which responds both to exceptions caused by errors during an instruction’s execution and to external interrupts caused by I/O devices. Th is section describes exception and interrupt handling in more detail.1 In MIPS processors, a part of the CPU called coprocessor 0 records the information the soft ware needs to handle excep tions and interrupts. Th e MIPS simulator SPIM does not implement all of copro cessor 0’s registers, since many are not useful in a simulator or are part of the memory system, which SPIM does not implement. However, SPIM does provide the following coprocessor 0 registers: Register name Register number Usage BadVAddr 8 memory address at which an offending memory reference occurred Count 9 timer Compare 11 value compared against timer that causes interrupt when they match Status 12 interrupt mask and enable bits Cause 13 exception type and pending interrupt bits EPC 14 address of instruction that caused exception Confi g 16 confi guration of machine 1. Th is section discusses exceptions in the MIPS-32 architecture, which is what SPIM imple ments in Version 7.0 and later. Earlier versions of SPIM implemented the MIPS-1 architecture, which handled exceptions slightly diff erently. Converting programs from these versions to run on MIPS-32 should not be diffi cult, as the changes are limited to the Status and Cause register fi elds and the replacement of the rfe instruction by the eret instruction. interrupt handler A piece of code that is run as a result of an exception or an interrupt. A.7 Exceptions and Interrupts A-33 A-34 Appendix A Assemblers, Linkers, and the SPIM Simulator Th ese seven registers are part of coprocessor 0’s register set. Th ey are accessed by the mfc0 and mtc0 instructions. Aft er an exception, register EPC contains the address of the instruction that was executing when the exception occurred. If the exception was caused by an external interrupt, then the instruction will not have started executing. All other exceptions are caused by the execution of the instruc- tion at EPC, except when the off ending instruction is in the delay slot of a branch or jump. In that case, EPC points to the branch or jump instruction and the BD bit is set in the Cause register. When that bit is set, the exception handler must look at EPC + 4 for the off ending instruction. However, in either case, an excep tion handler properly resumes the program by returning to the instruction at EPC. If the instruction that caused the exception made a memory access, register BadVAddr contains the referenced memory location’s address. Th e Count register is a timer that increments at a fi xed rate (by default, every 10 milliseconds) while SPIM is running. When the value in the Count register equals the value in the Compare register, a hardware interrupt at priority level 5 occurs. Figure A.7.1 shows the subset of the Status register fi elds implemented by the MIPS simulator SPIM. Th e interrupt mask fi eld contains a bit for each of the six hardware and two soft ware interrupt levels. A mask bit that is 1 allows inter- rupts at that level to interrupt the processor. A mask bit that is 0 disables inter- rupts at that level. When an interrupt arrives, it sets its interrupt pending bit in the Cause register, even if the mask bit is disabled. When an interrupt is pending, it will interrupt the processor when its mask bit is subsequently enabled. Th e user mode bit is 0 if the processor is running in kernel mode and 1 if it is running in user mode. On SPIM, this bit is fi xed at 1, since the SPIM processor does not implement kernel mode. Th e exception level bit is normally 0, but is set to 1 aft er an exception occurs. When this bit is 1, interrupts are disabled and the EPC is not updated if another exception occurs. Th is bit prevents an exception handler from being disturbed by an interrupt or exception, but it should be reset when the handler fi nishes. If the interrupt enable bit is 1, interrupts are allowed. If it is 0, they are disabled. Figure A.7.2 shows the subset of Cause register fi elds that SPIM implements. Th e branch delay bit is 1 if the last exception occurred in an instruction executed in the delay slot of a branch. Th e interrupt pending bits become 1 when an inter rupt is raised at a given hardware or soft ware level. Th e exception code register describes the cause of an exception through the following codes: Number Name Cause of exception 0 Int interrupt (hardware) 4 AdEL address error exception (load or instruction fetch) 5 AdES address error exception (store) 6 IBE bus error on instruction fetch 7 DBE bus error on data load or store 8 Sys syscall exception 9 Bp breakpoint exception 10 RI reserved instruction exception 11 CpU coprocessor unimplemented 12 Ov arithmetic overfl ow exception 13 Tr trap 15 FPE fl oating point Exceptions and interrupts cause a MIPS processor to jump to a piece of code, at address 80000180hex (in the kernel, not user address space), called an exception handler. Th is code examines the exception’s cause and jumps to an appropriate point in the operating system. Th e operating system responds to an exception either by terminating the process that caused the exception or by performing some action. A process that causes an error, such as executing an unimplemented instruction, is killed by the operating system. On the other hand, other exceptions such as page FIGURE A.7.1 The Status register. 15 8 4 1 0 Interrupt mask U se r m o d e E xc e p tio n le ve l In te rr u p t e n a b le FIGURE A.7.2 The Cause register. 1531 8 6 2 Pending interrupts Branch delay Exception code A.7 Exceptions and Interrupts A-35 A-36 Appendix A Assemblers, Linkers, and the SPIM Simulator faults are requests from a process to the operating system to perform a service, such as bringing in a page from disk. Th e operating system processes these requests and resumes the process. Th e fi nal type of exceptions are interrupts from external devices. Th ese generally cause the operating system to move data to or from an I/O device and resume the interrupted process. Th e code in the example below is a simple exception handler, which invokes a routine to print a message at each exception (but not interrupts). Th is code is similar to the exception handler (exceptions.s) used by the SPIM simulator. Exception Handler Th e exception handler fi rst saves register $at, which is used in pseudo- instructions in the handler code, then saves $a0 and $a1, which it later uses to pass arguments. Th e exception handler cannot store the old values from these registers on the stack, as would an ordinary routine, because the cause of the exception might have been a memory reference that used a bad value (such as 0) in the stack pointer. Instead, the exception handler stores these registers in an exception handler register ($k1, since it can’t access memory without using $at) and two memory locations (save0 and save1). If the exception routine itself could be interrupted, two locations would not be enough since the second exception would overwrite values saved during the fi rst exception. However, this simple exception handler fi nishes running before it enables interrupts, so the problem does not arise. .ktext 0x80000180 mov $k1, $at # Save $at register sw $a0, save0 # Handler is not re-entrant and can’t use sw $a1, save1 # stack to save $a0, $a1 # Don’t need to save $k0/$k1 Th e exception handler then moves the Cause and EPC registers into CPU registers. Th e Cause and EPC registers are not part of the CPU register set. In stead, they are registers in coprocessor 0, which is the part of the CPU that han dles exceptions. Th e instruction mfc0 $k0, $13 moves coprocessor 0’s register 13 (the Cause register) into CPU register $k0. Note that the exception handler need not save registers $k0 and $k1, because user programs are not supposed to use these registers. Th e exception handler uses the value from the Cause reg ister to test whether the exception was caused by an interrupt (see the preceding ta ble). If so, the exception is ignored. If the exception was not an interrupt, the handler calls print_excp to print a message. EXAMPLE mfc0 $k0, $13 # Move Cause into $k0 srl $a0, $k0, 2 # Extract ExcCode field andi $a0, $a0, Oxf bgtz $a0, done # Branch if ExcCode is Int (0) mov $a0, $k0 # Move Cause into $a0 mfco $a1, $14 # Move EPC into $a1 jal print_excp # Print exception error message Before returning, the exception handler clears the Cause register; resets the Status register to enable interrupts and clear the EXL bit, which allows subse quent exceptions to change the EPC register; and restores registers $a0, $a1, and $at. It then executes the eret (exception return) instruction, which returns to the instruction pointed to by EPC. Th is exception handler returns to the instruction following the one that caused the exception, so as to not re-execute the faulting instruction and cause the same exception again. done: mfc0 $k0, $14 # Bump EPC addiu $k0, $k0, 4 # Do not re-execute # faulting instruction mtc0 $k0, $14 # EPC mtc0 $0, $13 # Clear Cause register mfc0 $k0, $12 # Fix Status register andi $k0, Oxfffd # Clear EXL bit ori $k0, Ox1 # Enable interrupts mtc0 $k0, $12 lw $a0, save0 # Restore registers lw $a1, save1 mov $at, $k1 eret # Return to EPC .kdata save0: .word 0 save1: .word 0 A.7 Exceptions and Interrupts A-37 A-38 Appendix A Assemblers, Linkers, and the SPIM Simulator Elaboration: On real MIPS processors, the return from an exception handler is more complex. The exception handler cannot always jump to the instruction following EPC. For example, if the instruction that caused the exception was in a branch instruction’s delay slot (see Chapter 4), the next instruction to execute may not be the following instruction in memory. A.8 Input and Output SPIM simulates one I/O device: a memory-mapped console on which a program can read and write characters. When a program is running, SPIM connects its own terminal (or a separate console window in the X-window version xspim or the Windows version PCSpim) to the processor. A MIPS program running on SPIM can read the characters that you type. In addition, if the MIPS program writes characters to the terminal, they appear on SPIM’s terminal or console win- dow. One exception to this rule is control-C: this character is not passed to the program, but instead causes SPIM to stop and return to command mode. When the program stops running (for example, because you typed control-C or because the program hit a breakpoint), the terminal is reconnected to SPIM so you can type SPIM commands. To use memory-mapped I/O (see below), spim or xspim must be started with the -mapped_io fl ag. PCSpim can enable memory-mapped I/O through a command line fl ag or the “Settings” dialog. Th e terminal device consists of two independent units: a receiver and a trans- mitter. Th e receiver reads characters from the keyboard. Th e transmitter displays characters on the console. Th e two units are completely independent. Th is means, for example, that characters typed at the keyboard are not automatically echoed on the display. Instead, a program echoes a character by reading it from the receiver and writing it to the transmitter. A program controls the terminal with four memory-mapped device registers, as shown in Figure A.8.1. “Memory-mapped’’ means that each register appears as a special memory location. Th e Receiver Control register is at location ff ff 0000hex. Only two of its bits are actually used. Bit 0 is called “ready’’: if it is 1, it means that a character has arrived from the keyboard but has not yet been read from the Receiver Data register. Th e ready bit is read-only: writes to it are ignored. Th e ready bit changes from 0 to 1 when a character is typed at the keyboard, and it changes from 1 to 0 when the character is read from the Receiver Data register. Bit 1 of the Receiver Control register is the keyboard “interrupt enable.” Th is bit may be both read and written by a program. Th e interrupt enable is initially 0. If it is set to 1 by a program, the terminal requests an interrupt at hardware level 1 whenever a character is typed, and the ready bit becomes 1. However, for the inter- rupt to aff ect the processor, interrupts must also be enabled in the Status register (see Section A.7). All other bits of the Receiver Control register are unused. Th e second terminal device register is the Receiver Data register (at address ff ff 0004hex). Th e low-order eight bits of this register contain the last character typed at the keyboard. All other bits contain 0s. Th is register is read-only and changes only when a new character is typed at the keyboard. Reading the Receiver Data register resets the ready bit in the Receiver Control register to 0. Th e value in this register is undefi ned if the Receiver Control register is 0. Th e third terminal device register is the Transmitter Control register (at address ff ff 0008hex). Only the low-order two bits of this register are used. Th ey behave much like the corresponding bits of the Receiver Control register. Bit 0 is called “ready’’ FIGURE A.8.1 The terminal is controlled by four device registers, each of which appears as a memory location at the given address. Only a few bits of these registers are actually used. Th e others always read as 0s and are ignored on writes. 1 Interrupt enable Ready 1Unused Receiver control (0xffff0000) 8 Received byte Unused Receiver data (0xffff0004) 1 Interrupt enable Ready 1Unused Transmitter control (0xffff0008) Transmitter data (0xffff000c) 8 Transmitted byte Unused A.8 Input and Output A-39 A-40 Appendix A Assemblers, Linkers, and the SPIM Simulator and is read-only. If this bit is 1, the transmitter is ready to accept a new character for output. If it is 0, the transmitter is still busy writing the previous character. Bit 1 is “interrupt enable’’ and is readable and writable. If this bit is set to 1, then the terminal requests an interrupt at hardware level 0 whenever the transmitter is ready for a new character, and the ready bit becomes 1. Th e fi nal device register is the Transmitter Data register (at address ff ff 000chex). When a value is written into this location, its low-order eight bits (i.e., an ASCII character as in Figure 2.15 in Chapter 2) are sent to the console. When the Trans- mitter Data register is written, the ready bit in the Transmitter Control register is reset to 0. Th is bit stays 0 until enough time has elapsed to transmit the character to the terminal; then the ready bit becomes 1 again. Th e Trans mitter Data register should only be written when the ready bit of the Transmitter Control register is 1. If the transmitter is not ready, writes to the Transmitter Data register are ignored (the write appears to succeed but the character is not output). Real computers require time to send characters to a console or terminal. Th ese time lags are simulated by SPIM. For example, aft er the transmitter starts to write a character, the transmitter’s ready bit becomes 0 for a while. SPIM measures time in instructions executed, not in real clock time. Th is means that the transmitter does not become ready again until the processor executes a fi xed number of instructions. If you stop the machine and look at the ready bit, it will not change. However, if you let the machine run, the bit eventually changes back to 1. A.9 SPIM SPIM is a soft ware simulator that runs assembly language programs written for processors that implement the MIPS-32 architecture, specifi cally Release 1 of this architecture with a fi xed memory mapping, no caches, and only coprocessors 0 and 1.2 SPIM’s name is just MIPS spelled backwards. SPIM can read and immedi- ately execute assembly language fi les. SPIM is a self-contained system for running 2. Earlier versions of SPIM (before 7.0) implemented the MIPS-1 architecture used in the origi nal MIPS R2000 processors. Th is architecture is almost a proper subset of the MIPS-32 architec ture, with the diff erence being the manner in which exceptions are handled. MIPS-32 also introduced approximately 60 new instructions, which are supported by SPIM. Programs that ran on the earlier versions of SPIM and did not use exceptions should run unmodifi ed on newer ver sions of SPIM. Programs that used exceptions will require minor changes. MIPS programs. It contains a debugger and provides a few operating system-like services. SPIM is much slower than a real computer (100 or more times). How ever, its low cost and wide availability cannot be matched by real hardware! An obvious question is, “Why use a simulator when most people have PCs that contain processors that run signifi cantly faster than SPIM?” One reason is that the processors in PCs are Intel 80×86s, whose architecture is far less regular and far more complex to understand and program than MIPS processors. Th e MIPS architecture may be the epitome of a simple, clean RISC machine. In addition, simulators can provide a better environment for assembly pro- gramming than an actual machine because they can detect more errors and provide a better interface than can an actual computer. Finally, simulators are useful tools in studying computers and the programs that run on them. Because they are implemented in soft ware, not silicon, simulators can be examined and easily modifi ed to add new instructions, build new systems such as multiprocessors, or simply collect data. Simulation of a Virtual Machine Th e basic MIPS architecture is diffi cult to program directly because of delayed branches, delayed loads, and restricted address modes. Th is diffi culty is tolerable since these computers were designed to be programmed in high-level languages and present an interface designed for compilers rather than assembly language programmers. A good part of the programming complexity results from delayed instructions. A delayed branch requires two cycles to execute (see the Elabora tions on pages 284 and 322 of Chapter 4). In the second cycle, the instruction imme- diately following the branch executes. Th is instruction can perform useful work that normally would have been done before the branch. It can also be a nop (no operation) that does nothing. Similarly, delayed loads require two cycles to bring a value from memory, so the instruction immediately following a load cannot use the value (see Section 4.2 of Chapter 4). MIPS wisely chose to hide this complexity by having its assembler implement a virtual machine. Th is virtual computer appears to have nondelayed branches and loads and a richer instruction set than the actual hardware. Th e assembler reorga nizes (rearranges) instructions to fi ll the delay slots. Th e virtual computer also provides pseudoinstructions, which appear as real instructions in assembly lan guage programs. Th e hardware, however, knows nothing about pseudoinstruc- tions, so the assembler must translate them into equivalent sequences of actual machine instructions. For example, the MIPS hardware only provides instructions to branch when a register is equal to or not equal to 0. Other conditional branches, such as one that branches when one register is greater than another, are synthesized by comparing the two registers and branching when the result of the comparison is true (nonzero). virtual machine A virtual computer that appears to have nondelayed branches and loads and a richer instruction set than the actual hardware. A.9 SPIM A-41 A-42 Appendix A Assemblers, Linkers, and the SPIM Simulator By default, SPIM simulates the richer virtual machine, since this is the machine that most programmers will fi nd useful. However, SPIM can also simulate the delayed branches and loads in the actual hardware. Below, we describe the virtual machine and only mention in passing features that do not belong to the actual hardware. In doing so, we follow the convention of MIPS assembly language pro- grammers (and compilers), who routinely use the extended machine as if it was implemented in silicon. Getting Started with SPIM Th e rest of this appendix introduces SPIM and the MIPS R2000 Assembly lan- guage. Many details should never concern you; however, the sheer volume of information can sometimes obscure the fact that SPIM is a simple, easy-to-use program. Th is section starts with a quick tutorial on using SPIM, which should enable you to load, debug, and run simple MIPS programs. SPIM comes in diff erent versions for diff erent types of computer systems. Th e one constant is the simplest version, called spim, which is a command-line-driven pro gram that runs in a console window. It operates like most programs of this type: you type a line of text, hit the return key, and spim executes your command. Despite its lack of a fancy interface, spim can do everything that its fancy cousins can do. Th ere are two fancy cousins to spim. Th e version that runs in the X-windows environment of a UNIX or Linux system is called xspim. xspim is an easier pro- gram to learn and use than spim, because its commands are always visible on the screen and because it continually displays the machine’s registers and memory. Th e other fancy version is called PCspim and runs on Microsoft Windows. Th e UNIX and Windows versions of SPIM are available online at the publisher’s companion Web site for this book. Tutorials on xspim, pcSpim, spim, and SPIM command-line options are also online. If you are going to run SPIM on a PC running Microsoft Windows, you should fi rst look at the tutorial on PCSpim on the companion Web site. If you are going to run SPIM on a computer running UNIX or Linux, you should read the tutorial on xspim . Surprising Features Although SPIM faithfully simulates the MIPS computer, SPIM is a simulator, and certain things are not identical to an actual computer. Th e most obvious diff er- ences are that instruction timing and the memory systems are not identical. SPIM does not simulate caches or memory latency, nor does it accurately refl ect fl oating-point operation or multiply and divide instruction delays. In addition, the fl oating-point instructions do not detect many error conditions, which would cause exceptions on a real machine. Another surprise (which occurs on the real machine as well) is that a pseudo- instruction expands to several machine instructions. When you single-step or exam ine memory, the instructions that you see are diff erent from the source program. Th e correspondence between the two sets of instructions is fairly simple, since SPIM does not reorganize instructions to fi ll slots. Byte Order Processors can number bytes within a word so the byte with the lowest number is either the left most or rightmost one. Th e convention used by a machine is called its byte order. MIPS processors can operate with either big-endian or little-endian byte order. For example, in a big-endian machine, the directive .byte 0, 1, 2, 3 would result in a memory word containing Byte # 0 1 2 3 while in a little-endian machine, the word would contain Byte # 3 2 1 0 SPIM operates with both byte orders. SPIM’s byte order is the same as the byte order of the underlying machine that runs the simulator. For example, on an Intel 80x86, SPIM is little-endian, while on a Macintosh or Sun SPARC, SPIM is big- endian. System Calls SPIM provides a small set of operating system–like services through the system call (syscall) instruction. To request a service, a program loads the system call code (see Figure A.9.1) into register $v0 and arguments into registers $a0–$a3 (or $f12 for fl oating-point values). System calls that return values put their results in register $v0 (or $f0 for fl oating-point results). For example, the follow ing code prints "the answer = 5": .data str: .asciiz “the answer = ” .text A.9 SPIM A-43 A-44 Appendix A Assemblers, Linkers, and the SPIM Simulator li $v0, 4 # system call code for print_str la $a0, str # address of string to print syscall # print the string li $v0, 1 # system call code for print_int li $a0, 5 # integer to print syscall # print it Th e print_int system call is passed an integer and prints it on the console. print_float prints a single fl oating-point number; print_double prints a double precision number; and print_string is passed a pointer to a null- terminated string, which it writes to the console. Th e system calls read_int, read_float, and read_double to read an entire line of input up to and including the newline. Characters following the number are ignored. read_string has the same semantics as the UNIX library routine fgets. It reads up to n − 1 characters into a buff er and terminates the string with a null byte. If fewer than n − 1 characters are on the current line, read_string reads up to and including the newline and again null-terminates the string. Service System call code Arguments Result print_int 1 $a0 = integer print_float 2 $f12 = fl oat print_double 3 $f12 = double print_string 4 $a0 = string read_int 5 integer (in $v0) read_float 6 fl oat (in $f0) read_double 7 double (in $f0) read_string 8 $a0 = buffer, $a1 = length sbrk 9 $a0 = amount address (in $v0) exit 10 print_char 11 $a0 = char read_char 12 char (in $v0) open 13 $a0 = fi lename (string), $a1 = fl ags, $a2 = mode fi le descriptor (in $a0) read 14 $a0 = fi le descriptor, $a1 = buffer, $a2 = length num chars read (in $a0) write 15 $a0 = fi le descriptor, $a1 = buffer, $a2 = length num chars written (in $a0) close 16 $a0 = fi le descriptor exit2 17 $a0 = result FIGURE A.9.1 System services. Warning: Programs that use these syscalls to read from the terminal should not use memory-mapped I/O (see Section A.8). sbrk returns a pointer to a block of memory containing n additional bytes. exit stops the program SPIM is running. exit2 terminates the SPIM pro gram, and the argument to exit2 becomes the value returned when the SPIM simulator itself terminates. print_char and read_char write and read a single character. open, read, write, and close are the standard UNIX library calls. A.10 MIPS R2000 Assembly Language A MIPS processor consists of an integer processing unit (the CPU) and a collec- tion of coprocessors that perform ancillary tasks or operate on other types of data, such as fl oating-point numbers (see Figure A.10.1). SPIM simulates two coproces- sors. Coprocessor 0 handles exceptions and interrupts. Coprocessor 1 is the fl oating-point unit. SPIM simulates most aspects of this unit. Addressing Modes MIPS is a load store architecture, which means that only load and store instruc tions access memory. Computation instructions operate only on values in regis ters. Th e bare machine provides only one memory-addressing mode: c(rx), which uses the sum of the immediate c and register rx as the address. Th e virtual machine provides the following addressing modes for load and store instructions: Format Address computation (register) contents of register imm immediate imm (register) immediate + contents of register label address of label label ± imm address of label + or – immediate label ± imm (register) address of label + or – (immediate + contents of register) Most load and store instructions operate only on aligned data. A quantity is aligned if its memory address is a multiple of its size in bytes. Th erefore, a half word A.10 MIPS R2000 Assembly Language A-45 A-46 Appendix A Assemblers, Linkers, and the SPIM Simulator object must be stored at even addresses, and a full word object must be stored at addresses that are a multiple of four. However, MIPS provides some instructions to manipulate unaligned data (lwl, lwr, swl, and swr). Elaboration: The MIPS assembler (and SPIM) synthesizes the more complex address- ing modes by producing one or more instructions before the load or store to compute a complex address. For example, suppose that the label table referred to memory loca- tion 0x10000004 and a program contained the instruction ld $a0, table + 4($a1) The assembler would translate this instruction into the instructions FIGURE A.10.1 MIPS R2000 CPU and FPU. CPU Registers $0 $31 Arithmetic unit Multiply divide Lo Hi Coprocessor 1 (FPU) Registers $0 $31 Arithmetic unit Registers BadVAddr Coprocessor 0 (traps and memory) Status Cause EPC Memory lui $at, 4096 addu $at, $at, $a1 lw $a0, 8($at) The fi rst instruction loads the upper bits of the label’s address into register $at, which is the register that the assembler reserves for its own use. The second instruction adds the contents of register $a1 to the label’s partial address. Finally, the load instruction uses the hardware address mode to add the sum of the lower bits of the label’s address and the offset from the original instruction to the value in register $at. Assembler Syntax Comments in assembler fi les begin with a sharp sign (#). Everything from the sharp sign to the end of the line is ignored. Identifi ers are a sequence of alphanumeric characters, underbars (_), and dots (.) that do not begin with a number. Instruction opcodes are reserved words that cannot be used as identifi ers. Labels are declared by putting them at the beginning of a line followed by a colon, for example: .data item: .word 1 .text .globl main # Must be global main: lw $t0, item Numbers are base 10 by default. If they are preceded by 0x, they are interpreted as hexadecimal. Hence, 256 and 0x100 denote the same value. Strings are enclosed in double quotes (”). Special characters in strings follow the C convention: ■ newline \n ■ tab \t ■ quote \” SPIM supports a subset of the MIPS assembler directives: .align n Align the next datum on a 2n byte boundary. For example, .align 2 aligns the next value on a word boundary. .align 0 turns off automatic alignment of .half, .word, .float, and .double directives until the next .data or .kdata directive. .ascii str Store the string str in memory, but do not null- terminate it. A.10 MIPS R2000 Assembly Language A-47 A-48 Appendix A Assemblers, Linkers, and the SPIM Simulator .asciiz str Store the string str in memory and null- terminate it. .byte b1,..., bn Store the n values in successive bytes of memory. .data Subsequent items are stored in the data segment.
If the optional argument addr is present, subse-
quent items are stored starting at address addr.

.double d1,…, dn Store the n fl oating-point double preci-
sion num-bers in successive memory locations.

.extern sym size Declare that the datum stored at sym is size bytes
large and is a global label. Th is directive enables
the assembler to store the datum in a portion of
the data segment that is effi ciently accessed via
register $gp.

.float f1,…, fn Store the n fl oating-point single precision num-
bers in successive memory locations.

.globl sym Declare that label sym is global and can be refer-
enced from other fi les.

.half h1,…, hn Store the n 16-bit quantities in successive mem ory
halfwords.

.kdata Subsequent data items are stored in the kernel
data segment. If the optional argument addr is
present, subsequent items are stored starting at
address addr.

.ktext Subsequent items are put in the kernel text seg-
ment. In SPIM, these items may only be instruc-
tions or words (see the .word directive below). If
the optional argument addr is present, subse quent
items are stored starting at address addr.

.set noat and .set at Th e fi rst directive prevents SPIM from complain-
ing about subsequent instructions that use regis ter
$at. Th e second directive re-enables the warning.
Since pseudoinstructions expand into code that
uses register $at, programmers must be very care-
ful about leaving values in this register.

.space n Allocates n bytes of space in the current segment
(which must be the data segment in SPIM).

.text Subsequent items are put in the user text seg ment.
In SPIM, these items may only be instruc tions
or words (see the .word directive below). If the
optional argument addr is present, subse quent
items are stored starting at address addr.

.word w1,…, wn Store the n 32-bit quantities in successive mem ory
words.

SPIM does not distinguish various parts of the data segment (.data, .rdata, and
.sdata).

Encoding MIPS Instructions
Figure A.10.2 explains how a MIPS instruction is encoded in a binary number.
Each column contains instruction encodings for a fi eld (a contiguous group of
bits) from an instruction. Th e numbers at the left margin are values for a fi eld.
For example, the j opcode has a value of 2 in the opcode fi eld. Th e text at the top
of a column names a fi eld and specifi es which bits it occupies in an instruction.
For example, the op fi eld is contained in bits 26–31 of an instruction. Th is fi eld
encodes most instructions. However, some groups of instructions use additional
fi elds to distinguish related instructions. For example, the diff erent fl oating-point
instructions are specifi ed by bits 0–5. Th e arrows from the fi rst column show which
opcodes use these additional fi elds.

Instruction Format
Th e rest of this appendix describes both the instructions implemented by actual
MIPS hardware and the pseudoinstructions provided by the MIPS assembler. Th e
two types of instructions are easily distinguished. Actual instructions depict the
fi elds in their binary representation. For example, in

Addition (with overfl ow)

add rd, rs, rt
0 rs rt rd 0 0x20

6 5 5 5 5 6

the add instruction consists of six fi elds. Each fi eld’s size in bits is the small num ber
below the fi eld. Th is instruction begins with six bits of 0s. Register specifi ers begin
with an r, so the next fi eld is a 5-bit register specifi er called rs. Th is is the same
register that is the second argument in the symbolic assembly at the left of this
line. Another common fi eld is imm16, which is a 16-bit immediate number.

A.10 MIPS R2000 Assembly Language A-49

A-50 Appendix A Assemblers, Linkers, and the SPIM Simulator

FIGURE A.10.2 MIPS opcode map. Th e values of each fi eld are shown to its left . Th e fi rst column shows the values in base 10, and the
second shows base 16 for the op fi eld (bits 31 to 26) in the third column. Th is op fi eld completely specifi es the MIPS operation except for six
op values: 0, 1, 16, 17, 18, and 19. Th ese operations are determined by other fi elds, identifi ed by pointers. Th e last fi eld (funct) uses “f ” to
mean “s” if rs = 16 and op = 17 or “d” if rs = 17 and op = 17. Th e second fi eld (rs) uses “z” to mean “0”, “1”, “2”, or “3” if op = 16, 17, 18, or 19,
respectively. If rs = 16, the operation is specifi ed elsewhere: if z = 0, the operations are specifi ed in the fourth fi eld (bits 4 to 0); if z = 1, then the
operations are in the last fi eld with f = s. If rs = 17 and z = 1, then the operations are in the last fi eld with f = d.

10
0
1
2
3
4
5
6
7
8
9

10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

10
0
1
2
3
4
5
6
7
8
9

10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

10
0
1
2
3
4
5
6
7
8
9

10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

0
1
2
3
4
5
6
7
8
9

10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

0
1
2
3
4
5
6
7
8
9

10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

0
1
2
3
4
5
6
7
8
9

10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

16
00
01
02
03
04
05
06
07
08
09
0a
0b
0c
0d
0e
0 f
10
11
12
13
14
15
16
17
18
19
1a
1b
1c
1d
1e
1 f
20
21
22
23
24
25
26
27
28
29
2a
2b
2c
2d
2e
2 f
30
31
32
33
34
35
36
37
38
39
3a
3b
3c
3d
3e
3 f

rs
(25:21)
mfcz

cfcz

mtcz

ctcz

copz
copz

(17:16)
bczf
bczt
bczfl
bcztl

tlbr
tlbwi

tlbwr

tlbp

eret

deret

rt
(20:16)
bltz
bgez
bltzl
bgezl

tgei
tgeiu
tlti
tltiu
tegi

tnei

bltzal
bgezal
bltzall
bgczall

cvt.s.f
cvt.d.f

cvt.w.f

c.f.f
c.un.f
c.eq.f
c.ueq.f
c.olt.f
c.ult.f
c.ole.f
c.ule.f
c.sf.f
c.ngle.f
c.seq.f
c.ngl.f
c.lt.f
c.nge.f
c.le.f
c.ngt.f

funct(5:0)funct(5:0)
sll

srl
sra
sllv

srlv
srav
jr
jalr
movz
movn
syscall
break

sync
mfhi
mthi
mflo
mtlo

mult
multu
div
divu

add
addu
sub
subu
and
or
xor
nor

slt
sltu

tge
tgeu
tlt
tltu
teq

tne

if z = 1,
f = d

if z = 1,
f = s

if z = 0

if z = 1 or z = 2

0
1
2
3

funct
(4:0)

sub.f
add.f

mul.f
div.f
sqrt.f
abs.f
mov.f
neg.f

round.w.f
trunc.w.f
cell.w.f
floor.w.f

movz.f
movn.f

clz
clo

funct(5:0)
madd
maddu
mul

msub
msubu

(16:16)
movf
movt

0
1

(16:16)
movf.f
movt.f

0
1

op(31:26)

j
jal
beq
bne
blez
bgtz
addi
addiu
slti
sltiu
andi
ori
xori
lui
z = 0
z = 1
z = 2

beql
bnel
blezl
bgtzl

lb
lh
lwl
lw
lbu
lhu
lwr

sb
sh
swl
sw

swr
cache
ll
lwc1
lwc2
pref

ldc1
ldc2

sc
swc1
swc2

sdc1
sdc2

Pseudoinstructions follow roughly the same conventions, but omit instruction
encoding information. For example:

Multiply (without overfl ow)

mul rdest, rsrc1, src2 pseudoinstruction

In pseudoinstructions, rdest and rsrc1 are registers and src2 is either a regis-
ter or an immediate value. In general, the assembler and SPIM translate a more
general form of an instruction (e.g., add $v1, $a0, 0x55) to a specialized form
(e.g., addi $v1, $a0, 0x55).

Arithmetic and Logical Instructions

Absolute value

abs rdest, rsrc pseudoinstruction

Put the absolute value of register rsrc in register rdest.

Addition (with overfl ow)

add rd, rs, rt
0 rs rt rd 0 0x20

6 5 5 5 5 6

Addition (without overfl ow)

addu rd, rs, rt
0 rs rt rd 0 0x21

6 5 5 5 5 6

Put the sum of registers rs and rt into register rd.

Addition immediate (with overfl ow)

addi rt, rs, imm
8 rs rt imm

6 5 5 16

Addition immediate (without overfl ow)

addiu rt, rs, imm
9 rs rt imm

6 5 5 16

Put the sum of register rs and the sign-extended immediate into register rt.

A.10 MIPS R2000 Assembly Language A-51

A-52 Appendix A Assemblers, Linkers, and the SPIM Simulator

AND

and rd, rs, rt
0 rs rt rd 0 0x24
6 5 5 5 5 6

Put the logical AND of registers rs and rt into register rd.

AND immediate

andi rt, rs, imm
0xc rs rt imm
6 5 5 16

Put the logical AND of register rs and the zero-extended immediate into reg-
ister rt.

Count leading ones

clo rd, rs
0x1c rs 0 rd 0 0x21
6 5 5 5 5 6

Count leading zeros

clz rd, rs
0x1c rs 0 rd 0 0x20
6 5 5 5 5 6

Count the number of leading ones (zeros) in the word in register rs and put
the result into register rd. If a word is all ones (zeros), the result is 32.

Divide (with overfl ow)

div rs, rt
0 rs rt 0 0x1a
6 5 5 10 6

Divide (without overfl ow)

divu rs, rt
0 rs rt 0 0x1b
6 5 5 10 6

Divide register rs by register rt. Leave the quotient in register lo and the remain-
der in register hi. Note that if an operand is negative, the remainder is unspecifi ed
by the MIPS architecture and depends on the convention of the machine on which
SPIM is run.

Divide (with overfl ow)

div rdest, rsrc1, src2 pseudoinstruction

Divide (without overfl ow)

divu rdest, rsrc1, src2 pseudoinstruction

Put the quotient of register rsrc1 and src2 into register rdest.

Multiply

mult rs, rt
0 rs rt 0 0x18
6 5 5 10 6

Unsigned multiply

multu rs, rt
0 rs rt 0 0x19
6 5 5 10 6

Multiply registers rs and rt. Leave the low-order word of the product in register
lo and the high-order word in register hi.

Multiply (without overfl ow)

mul rd, rs, rt
0x1c rs rt rd 0 2
6 5 5 5 5 6

Put the low-order 32 bits of the product of rs and rt into register rd.

Multiply (with overfl ow)

mulo rdest, rsrc1, src2 pseudoinstruction

Unsigned multiply (with overfl ow)

mulou rdest, rsrc1, src2 pseudoinstruction

Put the low-order 32 bits of the product of register rsrc1 and src2 into register
rdest.

A.10 MIPS R2000 Assembly Language A-53

A-54 Appendix A Assemblers, Linkers, and the SPIM Simulator

Multiply add

madd rs, rt
0x1c rs rt 0 0
6 5 5 10 6

Unsigned multiply add

maddu rs, rt
0x1c rs rt 0 1
6 5 5 10 6

Multiply registers rs and rt and add the resulting 64-bit product to the 64-bit
value in the concatenated registers lo and hi.

Multiply subtract

msub rs, rt
0x1c rs rt 0 4
6 5 5 10 6

Unsigned multiply subtract

msub rs, rt
0x1c rs rt 0 5
6 5 5 10 6

Multiply registers rs and rt and subtract the resulting 64-bit product from the 64-
bit value in the concatenated registers lo and hi.

Negate value (with overfl ow)

neg rdest, rsrc pseudoinstruction

Negate value (without overfl ow)

negu rdest, rsrc pseudoinstruction

Put the negative of register rsrc into register rdest.

NOR

nor rd, rs, rt
0 rs rt rd 0 0x27
6 5 5 5 5 6

Put the logical NOR of registers rs and rt into register rd.

NOT

not rdest, rsrc pseudoinstruction

Put the bitwise logical negation of register rsrc into register rdest.

or rd, rs, rt
0 rs rt rd 0 0x25
6 5 5 5 5 6

Put the logical OR of registers rs and rt into register rd.

OR immediate

ori rt, rs, imm
0xd rs rt imm
6 5 5 16

Put the logical OR of register rs and the zero-extended immediate into register rt.

Remainder

rem rdest, rsrc1, rsrc2 pseudoinstruction

Unsigned remainder

remu rdest, rsrc1, rsrc2 pseudoinstruction

Put the remainder of register rsrc1 divided by register rsrc2 into register rdest.
Note that if an operand is negative, the remainder is unspecifi ed by the MIPS
architecture and depends on the convention of the machine on which SPIM is run.

Shift left logical

sll rd, rt, shamt
0 rs rt rd shamt 0
6 5 5 5 5 6

Shift left logical variable

sllv rd, rt, rs
0 rs rt rd 0 4
6 5 5 5 5 6

A.10 MIPS R2000 Assembly Language A-55

A-56 Appendix A Assemblers, Linkers, and the SPIM Simulator

Shift right arithmetic

sra rd, rt, shamt
0 rs rt rd shamt 3
6 5 5 5 5 6

Shift right arithmetic variable

srav rd, rt, rs
0 rs rt rd 0 7
6 5 5 5 5 6

Shift right logical

srl rd, rt, shamt
0 rs rt rd shamt 2
6 5 5 5 5 6

Shift right logical variable

srlv rd, rt, rs
0 rs rt rd 0 6
6 5 5 5 5 6

Shift register rt left (right) by the distance indicated by immediate shamt or the
register rs and put the result in register rd. Note that argument rs is ignored for
sll, sra, and srl.

Rotate left

rol rdest, rsrc1, rsrc2 pseudoinstruction

Rotate right

ror rdest, rsrc1, rsrc2 pseudoinstruction

Rotate register rsrc1 left (right) by the distance indicated by rsrc2 and put the
result in register rdest.

Subtract (with overfl ow)

sub rd, rs, rt
0 rs rt rd 0 0x22
6 5 5 5 5 6

Subtract (without overfl ow)

subu rd, rs, rt
0 rs rt rd 0 0x23
6 5 5 5 5 6

Put the diff erence of registers rs and rt into register rd.

Exclusive OR

xor rd, rs, rt
0 rs rt rd 0 0x26
6 5 5 5 5 6

Put the logical XOR of registers rs and rt into register rd.

XOR immediate

xori rt, rs, imm
0xe rs rt Imm
6 5 5 16

Put the logical XOR of register rs and the zero-extended immediate into reg-
ister rt.

Constant-Manipulating Instructions

Load upper immediate

lui rt, imm
0xf O rt imm
6 5 5 16

Load the lower halfword of the immediate imm into the upper halfword of reg-
ister rt. Th e lower bits of the register are set to 0.

Load immediate

li rdest, imm pseudoinstruction

Move the immediate imm into register rdest.

Comparison Instructions

Set less than

slt rd, rs, rt
0 rs rt rd 0 0x2a
6 5 5 5 5 6

A.10 MIPS R2000 Assembly Language A-57

A-58 Appendix A Assemblers, Linkers, and the SPIM Simulator

Set less than unsigned

sltu rd, rs, rt
0 rs rt rd 0 0x2b
6 5 5 5 5 6

Set register rd to 1 if register rs is less than rt, and to 0 otherwise.

Set less than immediate

slti rt, rs, imm
0xa rs rt imm
6 5 5 16

Set less than unsigned immediate

sltiu rt, rs, imm
0xb rs rt imm
6 5 5 16

Set register rt to 1 if register rs is less than the sign-extended immediate, and to
0 otherwise.

Set equal

seq rdest, rsrc1, rsrc2 pseudoinstruction

Set register rdest to 1 if register rsrc1 equals rsrc2, and to 0 otherwise.

Set greater than equal

sge rdest, rsrc1, rsrc2 pseudoinstruction

Set greater than equal unsigned

sgeu rdest, rsrc1, rsrc2 pseudoinstruction

Set register rdest to 1 if register rsrc1 is greater than or equal to rsrc2, and to
0 otherwise.

Set greater than

sgt rdest, rsrc1, rsrc2 pseudoinstruction

Set greater than unsigned

sgtu rdest, rsrc1, rsrc2 pseudoinstruction

Set register rdest to 1 if register rsrc1 is greater than rsrc2, and to 0 otherwise.

Set less than equal

sle rdest, rsrc1, rsrc2 pseudoinstruction

Set less than equal unsigned

sleu rdest, rsrc1, rsrc2 pseudoinstruction

Set register rdest to 1 if register rsrc1 is less than or equal to rsrc2, and to 0
otherwise.

Set not equal

sne rdest, rsrc1, rsrc2 pseudoinstruction

Set register rdest to 1 if register rsrc1 is not equal to rsrc2, and to 0 otherwise.

Branch Instructions
Branch instructions use a signed 16-bit instruction off set fi eld; hence, they can
jump 215 − 1 instructions (not bytes) forward or 215 instructions backward. Th e
jump instruction contains a 26-bit address fi eld. In actual MIPS processors, branch
instructions are delayed branches, which do not transfer control until the instruction
following the branch (its “delay slot”) has executed (see Chapter 4). Delayed branches
aff ect the off set calculation, since it must be computed relative to the address of the
delay slot instruction (PC + 4), which is when the branch occurs. SPIM does not
simulate this delay slot, unless the -bare or -delayed_branch fl ags are specifi ed.

In assembly code, off sets are not usually specifi ed as numbers. Instead, an
instructions branch to a label, and the assembler computes the distance between
the branch and the target instructions.

In MIPS-32, all actual (not pseudo) conditional branch instructions have a
“likely” variant (for example, beq’s likely variant is beql), which does not execute
the instruction in the branch’s delay slot if the branch is not taken. Do not use

A.10 MIPS R2000 Assembly Language A-59

A-60 Appendix A Assemblers, Linkers, and the SPIM Simulator

these instructions; they may be removed in subsequent versions of the architec ture.
SPIM implements these instructions, but they are not described further.

Branch instruction

b label pseudoinstruction

Unconditionally branch to the instruction at the label.

Branch coprocessor false

bclf cc label
0x11 8 cc 0 Offset
6 5 3 2 16

Branch coprocessor true

bclt cc label
0x11 8 cc 1 Offset
6 5 3 2 16

Conditionally branch the number of instructions specifi ed by the off set if the
fl oating-point coprocessor’s condition fl ag numbered cc is false (true). If cc is
omitted from the instruction, condition code fl ag 0 is assumed.

Branch on equal

beq rs, rt, label
4 rs rt Offset
6 5 5 16

Conditionally branch the number of instructions specifi ed by the off set if register
rs equals rt.

Branch on greater than equal zero

bgez rs, label
1 rs 1 Offset
6 5 5 16

Conditionally branch the number of instructions specifi ed by the off set if register
rs is greater than or equal to 0.

Branch on greater than equal zero and link

bgezal rs, label
1 rs 0x11 Offset
6 5 5 16

Conditionally branch the number of instructions specifi ed by the off set if register
rs is greater than or equal to 0. Save the address of the next instruction in reg-
ister 31.

Branch on greater than zero

bgtz rs, label
7 rs 0 Offset
6 5 5 16

Conditionally branch the number of instructions specifi ed by the off set if register
rs is greater than 0.

Branch on less than equal zero

blez rs, label
6 rs 0 Offset
6 5 5 16

Conditionally branch the number of instructions specifi ed by the off set if register
rs is less than or equal to 0.

Branch on less than and link

bltzal rs, label
1 rs 0x10 Offset
6 5 5 16

Conditionally branch the number of instructions specifi ed by the off set if register
rs is less than 0. Save the address of the next instruction in register 31.

Branch on less than zero

bltz rs, label
1 rs 0 Offset
6 5 5 16

Conditionally branch the number of instructions specifi ed by the off set if register
rs is less than 0.

A.10 MIPS R2000 Assembly Language A-61

A-62 Appendix A Assemblers, Linkers, and the SPIM Simulator

Branch on not equal

bne rs, rt, label
5 rs rt Offset
6 5 5 16

Conditionally branch the number of instructions specifi ed by the off set if register
rs is not equal to rt.

Branch on equal zero

beqz rsrc, label pseudoinstruction

Conditionally branch to the instruction at the label if rsrc equals 0.

Branch on greater than equal

bge rsrc1, rsrc2, label pseudoinstruction

Branch on greater than equal unsigned

bgeu rsrc1, rsrc2, label pseudoinstruction

Conditionally branch to the instruction at the label if register rsrc1 is greater than
or equal to rsrc2.

Branch on greater than

bgt rsrc1, src2, label pseudoinstruction

Branch on greater than unsigned

bgtu rsrc1, src2, label pseudoinstruction

Conditionally branch to the instruction at the label if register rsrc1 is greater than
src2.

Branch on less than equal

ble rsrc1, src2, label pseudoinstruction

Branch on less than equal unsigned

bleu rsrc1, src2, label pseudoinstruction

Conditionally branch to the instruction at the label if register rsrc1 is less than or
equal to src2.

Branch on less than

blt rsrc1, rsrc2, label pseudoinstruction

Branch on less than unsigned

bltu rsrc1, rsrc2, label pseudoinstruction

Conditionally branch to the instruction at the label if register rsrc1 is less than
rsrc2.

Branch on not equal zero

bnez rsrc, label pseudoinstruction

Conditionally branch to the instruction at the label if register rsrc is not equal to 0.

Jump Instructions

Jump

j target
2 target
6 26

Unconditionally jump to the instruction at target.

Jump and link

jal target
3 target
6 26

Unconditionally jump to the instruction at target. Save the address of the next
instruction in register $ra.

A.10 MIPS R2000 Assembly Language A-63

A-64 Appendix A Assemblers, Linkers, and the SPIM Simulator

Jump and link register

jalr rs, rd
0 rs 0 rd 0 9
6 5 5 5 5 6

Unconditionally jump to the instruction whose address is in register rs. Save the
address of the next instruction in register rd (which defaults to 31).

Jump register

jr rs
0 rs 0 8
6 5 15 6

Unconditionally jump to the instruction whose address is in register rs.

Trap Instructions

Trap if equal

teq rs, rt
0 rs rt 0 0x34
6 5 5 10 6

If register rs is equal to register rt, raise a Trap exception.

Trap if equal immediate

teqi rs, imm
1 rs 0xc imm
6 5 5 16

If register rs is equal to the sign-extended value imm, raise a Trap exception.

Trap if not equal

teq rs, rt
0 rs rt 0 0x36
6 5 5 10 6

If register rs is not equal to register rt, raise a Trap exception.

Trap if not equal immediate

teqi rs, imm
1 rs 0xe imm
6 5 5 16

If register rs is not equal to the sign-extended value imm, raise a Trap exception.

Trap if greater equal

tge rs, rt
0 rs rt 0 0x30
6 5 5 10 6

Unsigned trap if greater equal

tgeu rs, rt
0 rs rt 0 0x31
6 5 5 10 6

If register rs is greater than or equal to register rt, raise a Trap exception.

Trap if greater equal immediate

tgei rs, imm
1 rs 8 imm
6 5 5 16

Unsigned trap if greater equal immediate

tgeiu rs, imm
1 rs 9 imm
6 5 5 16

If register rs is greater than or equal to the sign-extended value imm, raise a Trap
exception.

Trap if less than

tlt rs, rt
0 rs rt 0 0x32
6 5 5 10 6

Unsigned trap if less than

tltu rs, rt
0 rs rt 0 0x33
6 5 5 10 6

If register rs is less than register rt, raise a Trap exception.

Trap if less than immediate

tlti rs, imm
1 rs a imm
6 5 5 16

A.10 MIPS R2000 Assembly Language A-65

A-66 Appendix A Assemblers, Linkers, and the SPIM Simulator

Unsigned trap if less than immediate

tltiu rs, imm
1 rs b imm
6 5 5 16

If register rs is less than the sign-extended value imm, raise a Trap exception.

Load Instructions

Load address

la rdest, address pseudoinstruction

Load computed address—not the contents of the location—into register rdest.

Load byte

lb rt, address
0x20 rs rt Offset
6 5 5 16

Load unsigned byte

lbu rt, address
0x24 rs rt Offset
6 5 5 16

Load the byte at address into register rt. Th e byte is sign-extended by lb, but not
by lbu.

Load halfword

lh rt, address
0x21 rs rt Offset
6 5 5 16

Load unsigned halfword

lhu rt, address
0x25 rs rt Offset
6 5 5 16

Load the 16-bit quantity (halfword) at address into register rt. Th e halfword is
sign-extended by lh, but not by lhu.

Load word

lw rt, address
0x23 rs rt Offset
6 5 5 16

Load the 32-bit quantity (word) at address into register rt.

Load word coprocessor 1

lwcl ft, address
0x31 rs rt Offset
6 5 5 16

Load the word at address into register ft in the fl oating-point unit.

Load word left

lwl rt, address
0x22 rs rt Offset
6 5 5 16

Load word right

lwr rt, address
0x26 rs rt Offset
6 5 5 16

Load the left (right) bytes from the word at the possibly unaligned address into
register rt.

Load doubleword

ld rdest, address pseudoinstruction

Load the 64-bit quantity at address into registers rdest and rdest + 1.

Unaligned load halfword

ulh rdest, address pseudoinstruction

A.10 MIPS R2000 Assembly Language A-67

A-68 Appendix A Assemblers, Linkers, and the SPIM Simulator

Unaligned load halfword unsigned

ulhu rdest, address pseudoinstruction

Load the 16-bit quantity (halfword) at the possibly unaligned address into register
rdest. Th e halfword is sign-extended by ulh, but not ulhu.

Unaligned load word

ulw rdest, address pseudoinstruction

Load the 32-bit quantity (word) at the possibly unaligned address into register
rdest.

Load linked

ll rt, address
0x30 rs rt Offset
6 5 5 16

Load the 32-bit quantity (word) at address into register rt and start an atomic
read-modify-write operation. Th is operation is completed by a store conditional
(sc) instruction, which will fail if another processor writes into the block contain-
ing the loaded word. Since SPIM does not simulate multiple processors, the store
conditional operation always succeeds.

Store Instructions

Store byte

sb rt, address
0x28 rs rt Offset
6 5 5 16

Store the low byte from register rt at address.

Store halfword

sh rt, address
0x29 rs rt Offset
6 5 5 16

Store the low halfword from register rt at address.

Store word

sw rt, address
0x2b rs rt Offset
6 5 5 16

Store the word from register rt at address.

Store word coprocessor 1

swcl ft, address
0x31 rs ft Offset
6 5 5 16

Store the fl oating-point value in register ft of fl oating-point coprocessor at address.

Store double coprocessor 1

sdcl ft, address
0x3d rs ft Offset
6 5 5 16

Store the doubleword fl oating-point value in registers ft and ft + l of fl oating-
point coprocessor at address. Register ft must be even numbered.

Store word left

swl rt, address
0x2a rs rt Offset
6 5 5 16

Store word right

swr rt, address
0x2e rs rt Offset

6 5 5 16

Store the left (right) bytes from register rt at the possibly unaligned address.

Store doubleword

sd rsrc, address pseudoinstruction

Store the 64-bit quantity in registers rsrc and rsrc + 1 at address.

A.10 MIPS R2000 Assembly Language A-69

A-70 Appendix A Assemblers, Linkers, and the SPIM Simulator

Unaligned store halfword

ush rsrc, address pseudoinstruction

Store the low halfword from register rsrc at the possibly unaligned address.

Unaligned store word

usw rsrc, address pseudoinstruction

Store the word from register rsrc at the possibly unaligned address.

Store conditional

sc rt, address
0x38 rs rt Offset
6 5 5 16

Store the 32-bit quantity (word) in register rt into memory at address and com plete
an atomic read-modify-write operation. If this atomic operation is success ful, the
memory word is modifi ed and register rt is set to 1. If the atomic operation fails
because another processor wrote to a location in the block contain ing the addressed
word, this instruction does not modify memory and writes 0 into register rt. Since
SPIM does not simulate multiple processors, the instruc tion always succeeds.

Data Movement Instructions
Move

move rdest, rsrc pseudoinstruction

Move register rsrc to rdest.

Move from hi

mfhi rd
0 0 rd 0 0x10
6 10 5 5 6

Move from lo

mflo rd
0 0 rd 0 0x12
6 10 5 5 6

Th e multiply and divide unit produces its result in two additional registers, hi
and lo. Th ese instructions move values to and from these registers. Th e multiply,
divide, and remainder pseudoinstructions that make this unit appear to operate on
the general registers move the result aft er the computation fi nishes.

Move the hi (lo) register to register rd.

Move to hi

mthi rs
0 rs 0 0x11
6 5 15 6

Move to lo

mtlo rs
0 rs 0 0x13
6 5 15 6

Move register rs to the hi (lo) register.

Move from coprocessor 0

mfc0 rt, rd
0x10 0 rt rd 0
6 5 5 5 11

Move from coprocessor 1

mfcl rt, fs
0x11 0 rt fs 0
6 5 5 5 11

Coprocessors have their own register sets. Th ese instructions move values between
these registers and the CPU’s registers.

Move register rd in a coprocessor (register fs in the FPU) to CPU register rt. Th e
fl oating-point unit is coprocessor 1.

A.10 MIPS R2000 Assembly Language A-71

A-72 Appendix A Assemblers, Linkers, and the SPIM Simulator

Move double from coprocessor 1

mfc1.d rdest, frsrc1 pseudoinstruction

Move fl oating-point registers frsrc1 and frsrc1 + 1 to CPU registers rdest
and rdest + 1.

Move to coprocessor 0

mtc0 rd, rt
0x10 4 rt rd 0
6 5 5 5 11

Move to coprocessor 1

mtc1 rd, fs
0x11 4 rt fs 0
6 5 5 5 11

Move CPU register rt to register rd in a coprocessor (register fs in the FPU).

Move conditional not zero

movn rd, rs, rt
0 rs rt rd 0xb
6 5 5 5 11

Move register rs to register rd if register rt is not 0.

Move conditional zero

movz rd, rs, rt
0 rs rt rd 0xa
6 5 5 5 11

Move register rs to register rd if register rt is 0.

Move conditional on FP false

movf rd, rs, cc
0 rs cc 0 rd 0 1
6 5 3 2 5 5 6

Move CPU register rs to register rd if FPU condition code fl ag number cc is 0. If
cc is omitted from the instruction, condition code fl ag 0 is assumed.

Move conditional on FP true

movt rd, rs, cc
0 rs cc 1 rd 0 1
6 5 3 2 5 5 6

Move CPU register rs to register rd if FPU condition code fl ag number cc is 1. If
cc is omitted from the instruction, condition code bit 0 is assumed.

Floating-Point Instructions
Th e MIPS has a fl oating-point coprocessor (numbered 1) that operates on single
precision (32-bit) and double precision (64-bit) fl oating-point numbers. Th is
coprocessor has its own registers, which are numbered $f0–$f31. Because these
registers are only 32 bits wide, two of them are required to hold doubles, so only
fl oating-point registers with even numbers can hold double precision values. Th e
fl oating-point coprocessor also has eight condition code (cc) fl ags, numbered 0–7,
which are set by compare instructions and tested by branch (bclf or bclt) and
conditional move instructions.

Values are moved in or out of these registers one word (32 bits) at a time by
lwc1, swc1, mtc1, and mfc1 instructions or one double (64 bits) at a time by ldcl
and sdcl, described above, or by the l.s, l.d, s.s, and s.d pseudoinstructions
described below.

In the actual instructions below, bits 21–26 are 0 for single precision and 1
for double precision. In the pseudoinstructions below, fdest is a fl oating-point
register (e.g., $f2).

Floating-point absolute value double

abs.d fd, fs
0x11 1 0 fs fd 5
6 5 5 5 5 6

Floating-point absolute value single

abs.s fd, fs
0x11 0 0 fs fd 5

Compute the absolute value of the fl oating-point double (single) in register fs and
put it in register fd.

Floating-point addition double

add.d fd, fs, ft
0x11 0x11 ft fs fd 0
6 5 5 5 5 6

A.10 MIPS R2000 Assembly Language A-73

A-74 Appendix A Assemblers, Linkers, and the SPIM Simulator

Floating-point addition single

add.s fd, fs, ft
0x11 0x10 ft fs fd 0
6 5 5 5 5 6

Compute the sum of the fl oating-point doubles (singles) in registers fs and ft and
put it in register fd.

Floating-point ceiling to word

ceil.w.d fd, fs
0x11 0x11 0 fs fd 0xe
6 5 5 5 5 6

ceil.w.s fd, fs
0x11 0x10 0 fs fd 0xe

Compute the ceiling of the fl oating-point double (single) in register fs, convert to
a 32-bit fi xed-point value, and put the resulting word in register fd.

Compare equal double

c.eq.d cc fs, ft
0x11 0x11 ft fs cc 0 FC 2
6 5 5 5 3 2 2 4

Compare equal single

c.eq.s cc fs, ft
0x11 0x10 ft fs cc 0 FC 2
6 5 5 5 3 2 2 4

Compare the fl oating-point double (single) in register fs against the one in ft
and set the fl oating-point condition fl ag cc to 1 if they are equal. If cc is omitted,
condition code fl ag 0 is assumed.

Compare less than equal double

c.le.d cc fs, ft
0x11 0x11 ft fs cc 0 FC 0xe
6 5 5 5 3 2 2 4

Compare less than equal single

c.le.s cc fs, ft
0x11 0x10 ft fs cc 0 FC 0xe
6 5 5 5 3 2 2 4

Compare the fl oating-point double (single) in register fs against the one in ft and
set the fl oating-point condition fl ag cc to 1 if the fi rst is less than or equal to the
second. If cc is omitted, condition code fl ag 0 is assumed.

Compare less than double

c.lt.d cc fs, ft
0x11 0x11 ft fs cc 0 FC 0xc
6 5 5 5 3 2 2 4

Compare less than single

c.lt.s cc fs, ft
0x11 0x10 ft fs cc 0 FC 0xc
6 5 5 5 3 2 2 4

Compare the fl oating-point double (single) in register fs against the one in ft
and set the condition fl ag cc to 1 if the fi rst is less than the second. If cc is omitted,
condition code fl ag 0 is assumed.

Convert single to double

cvt.d.s fd, fs
0x11 0x10 0 fs fd 0x21
6 5 5 5 5 6

Convert integer to double

cvt.d.w fd, fs
0x11 0x14 0 fs fd 0x21
6 5 5 5 5 6

Convert the single precision fl oating-point number or integer in register fs to a
double (single) precision number and put it in register fd.

Convert double to single

cvt.s.d fd, fs
0x11 0x11 0 fs fd 0x20
6 5 5 5 5 6

Convert integer to single

cvt.s.w fd, fs
0x11 0x14 0 fs fd 0x20
6 5 5 5 5 6

Convert the double precision fl oating-point number or integer in register fs to a
single precision number and put it in register fd.

A.10 MIPS R2000 Assembly Language A-75

A-76 Appendix A Assemblers, Linkers, and the SPIM Simulator

Convert double to integer

cvt.w.d fd, fs
0x11 0x11 0 fs fd 0x24
6 5 5 5 5 6

Convert single to integer

cvt.w.s fd, fs
0x11 0x10 0 fs fd 0x24
6 5 5 5 5 6

Convert the double or single precision fl oating-point number in register fs to an
integer and put it in register fd.

Floating-point divide double

div.d fd, fs, ft
0x11 0x11 ft fs fd 3
6 5 5 5 5 6

Floating-point divide single

div.s fd, fs, ft
0x11 0x10 ft fs fd 3
6 5 5 5 5 6

Compute the quotient of the fl oating-point doubles (singles) in registers fs and ft
and put it in register fd.

Floating-point fl oor to word

floor.w.d fd, fs
0x11 0x11 0 fs fd 0xf
6 5 5 5 5 6

floor.w.s fd, fs
0x11 0x10 0 fs fd 0xf

Compute the fl oor of the fl oating-point double (single) in register fs and put the
resulting word in register fd.

Load fl oating-point double

l.d fdest, address pseudoinstruction

Load fl oating-point single

l.s fdest, address pseudoinstruction

Load the fl oating-point double (single) at address into register fdest.

Move fl oating-point double

mov.d fd, fs
0x11 0x11 0 fs fd 6
6 5 5 5 5 6

Move fl oating-point single

mov.s fd, fs
0x11 0x10 0 fs fd 6
6 5 5 5 5 6

Move the fl oating-point double (single) from register fs to register fd.

Move conditional fl oating-point double false

movf.d fd, fs, cc
0x11 0x11 cc 0 fs fd 0x11
6 5 3 2 5 5 6

Move conditional fl oating-point single false

movf.s fd, fs, cc
0x11 0x10 cc 0 fs fd 0x11
6 5 3 2 5 5 6

Move the fl oating-point double (single) from register fs to register fd if condi tion
code fl ag cc is 0. If cc is omitted, condition code fl ag 0 is assumed.

Move conditional fl oating-point double true

movt.d fd, fs, cc
0x11 0x11 cc 1 fs fd 0x11
6 5 3 2 5 5 6

Move conditional fl oating-point single true

movt.s fd, fs, cc
0x11 0x10 cc 1 fs fd 0x11
6 5 3 2 5 5 6

A.10 MIPS R2000 Assembly Language A-77

A-78 Appendix A Assemblers, Linkers, and the SPIM Simulator

Move the fl oating-point double (single) from register fs to register fd if condi tion
code fl ag cc is 1. If cc is omitted, condition code fl ag 0 is assumed.

Move conditional fl oating-point double not zero

movn.d fd, fs, rt
0x11 0x11 rt fs fd 0x13
6 5 5 5 5 6

Move conditional fl oating-point single not zero

movn.s fd, fs, rt
0x11 0x10 rt fs fd 0x13
6 5 5 5 5 6

Move the fl oating-point double (single) from register fs to register fd if proces sor
register rt is not 0.

Move conditional fl oating-point double zero

movz.d fd, fs, rt
0x11 0x11 rt fs fd 0x12
6 5 5 5 5 6

Move conditional fl oating-point single zero

movz.s fd, fs, rt
0x11 0x10 rt fs fd 0x12
6 5 5 5 5 6

Move the fl oating-point double (single) from register fs to register fd if proces sor
register rt is 0.

Floating-point multiply double

mul.d fd, fs, ft
0x11 0x11 ft fs fd 2
6 5 5 5 5 6

Floating-point multiply single

mul.s fd, fs, ft
0x11 0x10 ft fs fd 2
6 5 5 5 5 6

Compute the product of the fl oating-point doubles (singles) in registers fs and ft
and put it in register fd.

Negate double

neg.d fd, fs
0x11 0x11 0 fs fd 7
6 5 5 5 5 6

Negate single

neg.s fd, fs
0x11 0x10 0 fs fd 7
6 5 5 5 5 6

Negate the fl oating-point double (single) in register fs and put it in register fd.

Floating-point round to word

round.w.d fd, fs
0x11 0x11 0 fs fd 0xc
6 5 5 5 5 6

round.w.s fd, fs 0x11 0x10 0 fs fd 0xc

Round the fl oating-point double (single) value in register fs, convert to a 32-bit
fi xed-point value, and put the resulting word in register fd.

Square root double

sqrt.d fd, fs
0x11 0x11 0 fs fd 4
6 5 5 5 5 6

Square root single

sqrt.s fd, fs
0x11 0x10 0 fs fd 4
6 5 5 5 5 6

Compute the square root of the fl oating-point double (single) in register fs and
put it in register fd.

Store fl oating-point double

s.d fdest, address pseudoinstruction

Store fl oating-point single

s.s fdest, address pseudoinstruction

Store the fl oating-point double (single) in register fdest at address.

Floating-point subtract double

sub.d fd, fs, ft
0x11 0x11 ft fs fd 1
6 5 5 5 5 6

A.10 MIPS R2000 Assembly Language A-79

A-80 Appendix A Assemblers, Linkers, and the SPIM Simulator

Floating-point subtract single

sub.s fd, fs, ft
0x11 0x10 ft fs fd 1
6 5 5 5 5 6

Compute the diff erence of the fl oating-point doubles (singles) in registers fs and
ft and put it in register fd.

Floating-point truncate to word

trunc.w.d fd, fs
0x11 0x11 0 fs fd 0xd
6 5 5 5 5 6

trunc.w.s fd, fs 0x11 0x10 0 fs fd 0xd

Truncate the fl oating-point double (single) value in register fs, convert to a 32-bit
fi xed-point value, and put the resulting word in register fd.

Exception and Interrupt Instructions
Exception return

eret
0x10 1 0 0x18
6 1 19 6

Set the EXL bit in coprocessor 0’s Status register to 0 and return to the instruction
pointed to by coprocessor 0’s EPC register.

System call

syscall
0 0 0xc
6 20 6

Break

break code
0 code 0xd
6 20 6

Cause exception code. Exception 1 is reserved for the debugger.

No operation

nop
0 0 0 0 0 0
6 5 5 5 5 6

Do nothing.

A.11 Concluding Remarks

Programming in assembly language requires a programmer to trade helpful fea-
tures of high-level languages—such as data structures, type checking, and control
constructs—for complete control over the instructions that a computer executes.
External constraints on some applications, such as response time or program size,
require a programmer to pay close attention to every instruction. However, the
cost of this level of attention is assembly language programs that are longer, more
time-consuming to write, and more diffi cult to maintain than high-level language
programs.

Moreover, three trends are reducing the need to write programs in assembly
language. Th e fi rst trend is toward the improvement of compilers. Modern com-
pilers produce code that is typically comparable to the best handwritten code—
and is sometimes better. Th e second trend is the introduction of new processors
that are not only faster, but in the case of processors that execute multiple instruc-
tions simultaneously, also more diffi cult to program by hand. In addition, the rapid
evolution of the modern computer favors high-level language programs that are
not tied to a single architecture. Finally, we witness a trend toward increasingly
complex applications, characterized by complex graphic interfaces and many more
features than their predecessors had. Large applications are written by teams of
programmers and require the modularity and semantic checking features pro vided
by high-level languages.

Related Posts