CS计算机代考程序代写 scheme mips x86 data structure database chain compiler file system cuda flex distributed system computer architecture concurrency cache arm assembly assembler algorithm 1

1

University of Waterloo
Cheriton School of Computer Science

CS 251: Sections 1, 2, and 3

Computer Organization and Design

Slides
for

Fall 2021

These slides are for the use of students taking CS 251 in Fall 2021. You should not share these
files with anyone else, and you agree to delete all your copies of these slides at the end of the
Fall 2021 term. It is your responsibility to ensure that no one else has access to these files.

Copyright 2001, 2009, 2013, 2016, 2019 by Stephen Mann and Prabhakar Ragde. Figures taken from the course text are copyrighted by Elsevier and reprinted
here with permission of the publisher. Some of the figures in the SPARC section are copyrighted by SPARC International, and are reprinted here with permission
of SPARC International.

Introduction 1–1

Introduction
• Course goals:

Understanding of computer architecture, structure and
evolution

• Computer architecture = instruction set architecture plus
computer organization

• Instruction set architecture:
Conceptual structure and functional behaviour of computing
system as seen by programmer

• Computer organization:
Physical implementation, described in terms of functional
units, their interconnection, how information flow among them
is controlled

Introduction 1–2

Why We Teach This Course
•Understanding of what’s inside the computer
• Introduction to material in future courses
•Architecture issues influence programming

Introduction 1–3

Example: small code changes, big performance differences
#include
#define NR 10000
#define NC 10000

int a[NR][NC];

void main() {
int i,j;
for (i=0;i
#define NR 10000
#define NC 10000

int a[NR][NC];

void main() {
int i,j;
for (i=0;i
#define NR 10000
#define NC 10000

int a[NR][NC];

void main() {
int i,j;
for (i=0;i 1

Memory Hierarchies 7–9

Block Size Read Example
Tag 00 01 10 11

000
001
010
011

100
101
110
111

Dec Tag Indx Block Byte Hit/Miss
LDUR 96 000 011 00 000
LDUR 104 000 011 01 000
LDUR 112 000 011 10 000
LDUR 120 000 011 11 000

Memory Hierarchies 7–10

Block Size Write Example
Tag 00 01 10 11

000
001
010
011

100
101
110
111

Dec Tag Indx Block Byte Hit/Miss
STUR 368 001 011 10 000
LDUR 376 001 011 11 000

Memory Hierarchies 7–11

Block Size Read/Write Example
Tag 00 01 10 11

000
001
010
011

100
101
110
111

Dec Tag Indx Block Byte Hit/Miss
LDUR 96 000 011 00 000
LDUR 104 000 011 01 000
LDUR 96 000 011 00 000
STUR 368 001 011 10 000
LDUR 376 001 011 11 000
LDUR 96 000 011 00 000

Memory Hierarchies 7–12

Block Size Code Example: Arrays
Pseudo-ARM: Sum the elements in an array A

100 ADD X1, X31, X31
104 ADD X2, X31, X31

loop: 108 LDUR X3, A(X1)
112 ADD X2, X2, X3
116 ADDI X1, X1, #1
120 SUBI X4, X1, #1000
124 CBNZ X4, loop
128 NOP

Assume 1 cc per instruction (data forwarding, etc)
LDUR takes 1cc if in cache
Block size 1: fetch from memory takes 17cc extra
Block size 4: fetch from memory takes 20cc (4 words) extra

Memory Hierarchies 7–13

Block Size Code Example: Linked List
Pseudo-ARM code: sum elements in a linked list

100 LDUR X1, A
104 ADD X2, X31,X31
108 CBZ X1, done

loop: 112 LDUR X3, [X1,#0]
116 ADD X2, X2,X3
120 LDUR X1, [X1,#8]
124 CBNZ X1, loop

done: 128 nop

Assume 1 cc per instruction (data forwarding, etc)
LDUR takes 1cc if in cache
Block size 1: fetch from memory takes 17cc extra
Block size 4: fetch from memory takes 20cc (4 words) extra

Memory Hierarchies 7–14

Alternate Placement Schemes
• Fully-associative: a block can go anywhere in the cache
• Requires a comparator for each cache entry
• Set-associative: combines two ideas

0 1 2 3 4 5 6 7

1

3

Data Data Data

Tag Tag Tag

0

1

3

321

1

3

Direct Map Set Associative Fully Associative

Triangles show locations to search for Tag=13

Memory Hierarchies 7–15

A 4-way Set-Associative Cache

Address

22 8

V TagIndex

0

1

2

253

254

255

Data V Tag Data V Tag Data V Tag Data

3222

4-to-1 multiplexor

Hit Data

123891011123031 0

Figure 5.18. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.

Memory Hierarchies 7–16

2-way set associative cache
Memory Access Cache

Dec Binary Hit/miss
20 10100
18 10010
20 10100
18 10010
22 10110
7 00111

22 10110
28 11100

Index Tag0 Tag1
00
01
10
11

Memory Hierarchies 7–17

4-way set associative cache
Memory Access

Dec Binary Hit/miss
20 10100
18 10010
20 10100
18 10010
22 10110
7 00111

22 10110
28 11100

Cache
Index Tag0 Tag1 Tag2 Tag3

0
1

Memory Hierarchies 7–18

Fully associative cache
Memory Access

Dec Binary Hit/miss
20 10100
18 10010
20 10100
18 10010
22 10110
7 00111

22 10110
28 11100

Cache
Tag0 Tag1 Tag2 Tag3 Tag4 Tag5 Tag6 Tag7

Memory Hierarchies 7–19

Which one to kick out?
•What if all tags are full for an index?

Which one do we replace?

• LRU (Least Recently Used) Schemes
Replace item whose most recent use was futhest in the past

• 2-way set associative example: Words accesses
20 (10100), 28 (11100), 20 (10100), 24 (11000)

Index Tag0 Tag1
00
01
10
11

•How much hardware to implement LRU?

Memory Hierarchies 7–20

Mixing “Ways” and “Blocks”
• Can have both ways and blocks in a cache

Example: 2-way set associative, 4 word line size

Tag0 Block 00 Block 01 Block 10 Block 11

Tag1 Block 00 Block 01 Block 10 Block 11

0000

0001

0010

• Caches typically associative with block size > 1
• Block size determines number of block bits

Memory Hierarchies 7–21

Breaking Address into Byte, Block, Index, and Tag

… 12 0345

Tag Index Block Byte

63

•Number of index bits depends on cache size, block size, and
the number of ways.

•An 2n-way cache of size 2k bytes, with a block size of 2b:
– 2 byte bits (or possibly 3-byte block)
– b block bits

Vendors use linesize, which is sum of byte and block bits
– Index bits i = k−2−b−n = k− ls−n
– Tag are remaining bits: 64−2−b− i = 64− ls− i

Memory Hierarchies 7–22

Example: 2017 Asus Skylake (2015), i7
Intel i7, 2 cores

Memory Hierarchies 7–23

Example: 2017 Asus Kaby Lake (2016), i7
Intel i7, 4 cores, ROG

Memory Hierarchies 7–24

Example: 2016 Jumper 2
Intel Atom, 4 cores

Memory Hierarchies 7–25

How useful are the ways?
• For small (8 word) caches, fully associate makes sense
• For larger caches, more “ways” means fewer cache misses, but

is the extra hardware worth it?

• Experiment on a 64KB cache, 16-word block:
Associativity Miss Rate

1 10.3%
2 8.6%
4 8.3%
8 8.1%

Memory Hierarchies 7–26

Data cache miss rates

Associativity

et
ar

s
si

M

0
One-way Two-way

3%

6%

9%

12%

15%

Four-way Eight-way

1 KB

2 KB

4 KB

8 KB

16 KB

32 KB
64 KB 128 KB

Data cache miss rates. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.

Memory Hierarchies 7–27

Calculating Average Memory Access Time
•How long to access memory?

AMAT = Time for a hit + Miss rate ×Miss penalty
• Example

1ns clock, miss penalty 20cc, miss rate 0.05 per instruction,
cache access time of 1cc

Memory Hierarchies 7–28

Sorting Example
Impact of cache on performance

Radix Sort

Quicksort

Size (K items to sort)

In
s
tr

u
c
ti
o

n
s

/i
te

m

0
4 8 16 32

200

400

600

800

1000

1200

64 128 256 512 1024 2048 4096

Radix Sort

Quicksort

Size (K items to sort)

C
lo

c
k
c

y
c
le

s
/i

te
m

0
4 8 16 32

400

800

1200

1600

2000

64 128 256 512 1024 2048 4096

Radix Sort

Quicksort

Size (K items to sort)

C
a

c
h

e
m

is
s
e

s
/i

te
m

0
4 8 16 32

1

2

3

4

5

64 128 256 512 1024 2048 4096

Figure 5.19. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.

• Left: instructions
• Center: run time
• Right: cache misses

Thus, although radix sort takes fewer instruction,
quicksort is faster because of fewer cache misses

Memory Hierarchies 7–29

Radix Sort
• For each digit, linear pass through data
• If data bigger than cache, no subsets of data remain in cache

Memory Hierarchies 7–30

Quicksort
1 block, 4xcache size

4 blocks, 1xcache size

2 blocks, 2xcache size

unsorted

pivot

… … … …

in cache!

• Initially, data too large for cache…
• …but at some depth, data fits in cache and remains in cache for

all deeper levels

Memory Hierarchies 7–31

Virtual Memory
• Level of memory hierarchy between main memory and

secondary storage (disks)

•Motivations:
– Sharing memory among multiple programs
– Allowing single user program to exceed size of main

memory

•Different terminology:
– Page: virtual memory block
– Page fault: virtual memory miss
• Idea similar to cache:

– Complete program (instructions and data) stored on disk
– Only keep parts you need in memory
– Use large (4KB+) blocks to reduce costs

Memory Hierarchies 7–32

How to run two programs at once?
• Each program starts at address 0 and goes up to 264−4 (64 bit)
•How do we run two programs?
• Time slice: each program gets to run for a while, then another

program runs.

•How do we switch between programs?

Memory Hierarchies 7–33

Idea 1: Copy back/forth to disk
•When changing programs, copy current program onto disk and

read next program from disk

0

MEMORY DISK

2**64

0

Program 2
Program 1

Memory Hierarchies 7–34

Idea 1: Copy back/forth to disk
•When changing programs, copy current program onto disk and

read next program from disk

0

MEMORY

2**64

COPY

DISK

Program 1

Program 2

Memory Hierarchies 7–35

Idea 1: Copy back/forth to disk
•When changing programs, copy current program onto disk and

read next program from disk

0

MEMORY

2**64

DISK

COPY

Program 2
Program 1

Memory Hierarchies 7–36

Idea 1: Copy back/forth to disk
•When changing programs, copy current program onto disk and

read next program from disk

0

MEMORY DISK

2**64

0

Program 2 Program 1

•At 70MB/s, swapping between two 100MB programs takes
over 2 seconds

Memory Hierarchies 7–37

Idea 2: Keep both programs in memory
•Keep both programs in memory as follows:

– Assume 4GB+ memory
– Assume 2GB program

size

– Program 1: 0 to 231−4
– Program 2: 231 to 232−4

MEMORY DISK

0 0

2**64

2**31 2**31

Program 1

Program 2

• Compiler creates Program 2’s address space to start at 231
•When swapping, just start second program.
• Problem 1: What if two programs overlap.
• Problem 2: What if both programs need more than 2GB?

Memory Hierarchies 7–38

Idea 3: Keep both program in memory
•Keep both programs in memory as follows:

– Assume 4GB+ memory
– Assume 2GB program

size

– Program 1: 0 to 231−4
– Program 2: stored in 231 to

232−4
BUT program addresses
from 0 to 231−4

MEMORY DISK

0

0

0

2**32

2**31

Program 1

Program 2

•When swapping, just start second program.
• Problem 1: Program 2’s addresses need to be converted to

physical addresses

• Problem 2: What if both programs need more than 2GB?
• Problem 3: Slow to load large programs

Memory Hierarchies 7–39

Idea 4: Virtual Memory
• Like idea 3, but split memory into pages (eg, 4KB pieces).
• Each page of program can be in any page of physical memory.

MEMORY DISK

0

0

0

2**64

4K

24K

4K Program 2
Program 1

•Use page table to map program address to physical address.
• If program wants more space, give it a new page.

If no pages available, replace “lightly used” page

Memory Hierarchies 7–40

Virtual Memory Mappings

Physical addresses

Disk addresses

Virtual addresses
Address translation

Figure 5.25. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.

3 2 1 011 10 9 815 14 13 1231 30 29 28 27

Page offsetVirtual page number

Virtual address

3 2 1 011 10 9 815 14 13 1229 28 27

Page offsetPhysical page number

Physical address

Translation

63 62 61 60 59

32 31 30

Figure 5.26. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.

Memory Hierarchies 7–41

Key Decisions for Virtual Memory
•Misses very expensive (millions of cycles)
• Pages are larger (1KB to 4KB up to 64KB)
• Fully-associative schemes pay off, though full search too

expensive

• Page faults handled in software (OS)
•Write-back used (write-through too expensive)

Memory Hierarchies 7–42

Page Replacement Schemes
When a page needs replacement, we must decide which to
replace.

• LRU (Least Recently Used)
• LFU (Least Frequently Used)
• FIFO (First In, First Out)
• Random

Memory Hierarchies 7–43

Page Tables

Page offsetVirtual page number

Virtual address

Page offsetPhysical page number

Physical address

Physical page numberValid

If 0 then page is not

present in memory

Page table register

Page table

20 12

18

31 30 29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

42

21

63 62 61 60 59

32 31 30

Figure 5.27. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.

Memory Hierarchies 7–44

Page Table Example
4KB pages

Virtual Address Physical Address

0000 … 0000 0000 0000 0000 0100 1000 1111

0000 … 0000 0000 0000 0001 0000 0000 0001

0000 … 0000 0000 0000 0010 0111 1000 1010

Page Table

1 01 0000 0001 0000 0101

0 11 0000 1010 0000 0000

1 00 0000 0000 0000 0001

1 00 0000 0000 0000 0011

Physical AddressV

……

0000 … 0000 0000 0000 0000

0000 … 0000 0000 0000 0001

0000 … 0000 0000 0000 0010

0000 … 0000 0000 0000 0011

Memory Hierarchies 7–45

More on Page Tables
• Page replacement policies:

– Optimal: replace page used farthest into the future
– Good approximation: LRU (Least Recently Used)
– Too expensive to implement exactly, use approximation (eg

reference bits)

• Page tables are large (contain mappings for every virtual page)
– Multi-level page tables reduce space needed
– Page tables can themselves be paged
•VM must use writeback (called copyback here)

– Dirty bit

Memory Hierarchies 7–46

Data Initially On Disk

PIC

SWAP

DISK

Page Table

RAM

Memory Hierarchies 7–47

Program Reads File Into Memory

PIC

PIC

PIC

SWAP

DISK

Page Table
D
0

RAM

READ

COPY

Also copied into swap space

Memory Hierarchies 7–48

Picture Is Modified In Memory

PIC

PIC

PIC’

SWAP

DISK

Page Table
D
1

RAM

Dirty bit set to 1, swap copy differs from memory copy

Memory Hierarchies 7–49

If Picture Page In Memory Is Swapped Out…

PIC

PIC’

PIC’

SWAP

DISK

Page Table
D
0

RAM

COPY

Since dirty bit is 1, have to copy to reuse page in memory

Memory Hierarchies 7–50

Picture only in swap, not in memory

PIC

PIC’

SWAP

DISK

Page Table
D
0

RAM

Memory Hierarchies 7–51

Picture Read Into Memory When Needed

PIC

PIC’

PIC’

SWAP

DISK

Page Table
D
0

RAM

COPY

Memory Hierarchies 7–52

A Problem With Virtual Memory
• Convert virtual address to physical address:

look up virtual address in Page Table

• Page Table stored in memory
• To do LDUR A, we have to

– Look up virtual address A in Page Table to get physical
address A’

– Read A’ from physical memory
With virtual memory, LDUR requires two memory accesses
(Four memory accessed if LDUR instruction not in cache)

Memory Hierarchies 7–53

Translation Lookaside Buffers

1
1
1
1
0
1
1

1
1

1

0

0

0
0
0
0
0
0
0

1
1

1

0

0

1
0
0
1
0
1
1

1
1

1

0

0

Physical page

or disk addressValidDirty Ref

Page table

Physical memory

Virtual page

number

Disk storage

1
1
1
1
0
1

0
1
1
0
0
0

1
1
1
1
0
1

Physical page

addressValidDirty Ref

TLB

Tag

Figure 5.30. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.

A cache for page table entries

Memory Hierarchies 7–54

Virtual Memory: Two Memory Accesses

Virtual Address
Page Table

(RAM)

Physical Address

RAM

Reading memory from a virtual address requires two memory
accesses

Memory Hierarchies 7–55

Virtual Memory: TLB and Cache

Virtual Address
Page Table

(RAM)

Physical Address

RAM

TLB Cache

Reading from the TLB and cache avoids reading from memory

Memory Hierarchies 7–56

Virtual Memory: The CPU

Register

File

Virtual Address TLB

Cache

CPU

Physical Address Page Table

RAM

Control

Memory Hierarchies 7–57

Example (X1=0x1000, X2=0x2000, X6=0x6000, X8=0x8000)
Free pages: 0 1 2 3 4 5 6 7
PT: TLB: (4 lines)
Index: V D R PG V D R TAG PG
00000: 0
00001:
00002: 1
00003:
00004: 2
00005:
00006: 3
00007:
00008:
… …
Code (0):

0x00000100: ADD X11,X2,X3
0x00000104: LDUR X10,[X6,0x000]
0x00000108: LDUR X10,[X6,0x008]
0x0000010c: STUR X10,[X8,0x000]
0x00000110: LDUR X10,[X2,0x000]
0x00000114: LDUR X10,[X1,0x000]
0x00000118: LDUR X10,[X6,0x008]

Memory Hierarchies 7–58

VM and Multitasking
•Need two modes: user process or operating system process
•OS processes are privileged (can do more)

– Write page table register
– Modify page tables or TLB
•Need method of switching modes

– User process to OS: System call or exception
– OS to user process: return-from-exception
• To swap processes, OS rewrites page table register and flushes

TLB

Memory Hierarchies 7–59

Memory Hierarchy

Level 1

Level 2

. . .

Leven n

Size of memory

CPU Distance from CPU

Things at lower levels in hierarchy also in all higher levels:
•Data: L1 Cache, L2 Cache, L3 Cache, Memory, Disk
•Address: TLB, Page Table, Disk

When executing an instruction, two hierarchies active
• convert virtual address to physical address
• load instruction from memory

(load word does this twice)

Memory Hierarchies 7–60

• Example (4K pages):
LDUR X1,[X2,#20]–where 20+X2=0x2000

– Need to convert virtual page 2 to physical page
TLB? Page Table?

– Need to load data at M[0x2000] into $1
Cache? RAM?

•Not all combinations possible

TLB PT RAM* Cache Disk Possible? # Disk Reads # Mem Access
HIT HIT HIT HIT HIT Possible 0 0
Miss Miss Miss Miss HIT
HIT Miss Miss Miss HIT
Miss Miss HIT HIT HIT
Miss HIT Miss Miss HIT
Miss HIT HIT Miss HIT
Miss HIT HIT HIT HIT

Memory Hierarchies 7–61

Virtual Memory: Summary
Problem:
• Program address space is larger than physical memory

264 vs 234

Program (compiler) layout in memory uses high memory
addresses

•Multiple programs share memory
Solution:
•Map program address (virual address) to RAM address

(physical address)

•Map blocks (pages) [4KB⇒ 12-bit sub-address]
• PAGE TABLE CONTAINS THIS MAPPING
• Store copy of what’s in RAM on disk (swap space)
•Use reference bits (occassionally cleared) to approximate LRU

Memory Hierarchies 7–62

Virtual Memory: Making it faster
• Problem with virtual memory: doubles memory accesses

Page table stored in RAM

• Solution: TLB–a cache for the page table

Is

VP in TLB?

Yes

Is

VP in PT?

No

Yes
No

TLB{VP}=PT[VP]

Yes No

VM

CACHE

Is
PP|PO in
Cache?

CACHE{PP|PO}

PP = TLB{VP}

PP=new Page

RAM[PP] = Disk Read

PT[VP] = PP

CACHE{PP|PO}=M[PP|PO]

PO: Page Offset
PP: Physical Page
VP: Virtual Page

[]: Direct Mapped
{}: Associative

VP|PO => PP|PO
Virtual Physical

Input/Output 8–1

Input/Output
• Readings: Chapter 1.4, 5.2.
•Our interests:

– Concepts rather than numerical details
– Demands on CPU
– Connection of peripherals to CPU

Input/Output 8–2

Typical Organization of I/O Devices

Main
memory

I/O
controller

I/O
controller

I/O
controller

Disk Graphics
output

Network

Memory–I/O bus

Processor

Cache

Interrupts

Disk

Typical Organization of Bus. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.

Input/Output 8–3

Demands of I/O Systems
• Supercomputers:

– Single large read at start of batch job
– “Checkpointing” saves during run of job
• Transaction processing:

– Many small changes to large database
– Need good throughput but also reliability
•Unix file systems:

– Mostly to small files
– More reading than writing
• Benchmarks have been developed for all three areas

Input/Output 8–4

Types and Characteristics of I/O Devices
• Characteristics:

– Behaviour: input, output, storage
– Partner: human or machine
– Data rate
• Some types:

– Keyboard: input device, human partner, data rate 10 bits/sec
(15 WPM)

– Graphics display: output device, human partner,
data rate 60Mb/sec to 500 MB/sec

– Disk: storage device, machine partner, data rate
100Kb/sec—1500MB/sec

Input/Output 8–5

Keyboards and Mice
•Keyboards slow enough that each keystroke can generate

interrupt serviced by CPU

•Alternate arrangement: bytes generated by keystrokes can be
put into small buffer that is checked periodically by CPU

•We say the CPU polls the device
•A mouse needs to be polled more often to ensure smooth

animation of cursor on screen

•Mouse generates pulses indicating small movements in four
directions (up,down,left,right)

Input/Output 8–6

Magnetic Disks

Platter

Track

Platters

Sectors

Tracks

Magnetic Disks, Platers, sectors. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.

Typically 1-4 platters, 10,000-50,000 tracks, 100-500 sectors,
Pre 2011: 512 bytes per sector. Since 2011: 4KB per sector.

Input/Output 8–7

Inside a Magnetic Disk
Disk drive image

Input/Output 8–8

Characteristics of Magnetic Disks
• Seek time: time to position read/write heads over desired track
• Rotational latency: must wait for correct sector to rotate under

head

• In practice these two depend highly on mix of requests
• Transfer time: time to read block of bits (typically sector)
• Can add cache to lower these times
• RAID: Redundant Array of Inexpensive Disks

Use redundancy and coding to improve throughput and
reliability

Input/Output 8–9

RAID—Redundant Arrays of Inexpensive Disks

RAID 0

RAID 6

RAID 5

RAID 4

RAID 3

RAID 1

87
65431 2

9
Striped

Mirror
CCBBAA

Σ

Σ

Σ Σ

Σ Σ Σ Σ Σ Σ

Blue disks marks redundant information.

Input/Output 8–10

Flash Memory/SSD
•HDD may be reaching limits

15,000 RPM available; 20,000 RPM has problems

• Flash memory similar to DRAM:
put voltage on line and test, but doesn’t forget

• Four types of cells:
SLC: 1 bit/cell, 100K writes
MLC: 2 bits/cell, 10K writes
TLC: 3 bits/cell, 3K writes
QLC: 4 bits/cell, 1K writes

Input/Output 8–11

Solid State Drive (SSD)
• SSD is groups of flash cells

Cells are grouped into pages (4KB)
Pages grouped into blocks (512KB)

• Read and write into page
• Erase a block (128 pages in a block)
• Speed comes from parallelism

Flash memory: 20MB/s
SSD: 10 in parallel: 200MB/s

Input/Output 8–12

SSD Issues
• Characteristics:

– Can read and write pages
– Can ONLY erase BLOCKS
– Can NOT overwrite page:

have to erase first

• Result: Controller writes to every page before erasing
Once “full”, a 4KB page write requires

– Read 512KB block into temp
– Erase 512KB block
– Write back valid pages
• Result: SSD fast when new, but later is slow (to write)

OS writes lots of temp files, so OS appears to slow down
(OS has properly handled this since Windows 7.)

Input/Output 8–13

SSD Write Example
Starting with 123 pages full and 5 empty…

Write 8KB Text File Write Two 4KB Pages

Write 8KB Image Write Two 4KB Pages

Delete Text File Mark Two Pages Invalid

Write 8KB Image

copy valid pages back
erase block
Copy block

Write Two 4KB Pages

copy valid pages back

CS 251 is a great

CS 251 is a great

course and the
prof is great, too!

course and the
prof is great, too!

INVALIDINVALID

(…123 full pages…)

(…123 full pages…)

(…123 full pages…)

(…123 full pages…)

(…123 full pages…)

Input/Output 8–14

Comparison (Fall 2011 vs Fall 2013 vs Fall 2016)
2011 2013 2016

Size Price $/TB Size Price $/TB Size Price $/TB
HDD 3TB $180 $60 4TB $200 $50 8TB $320 $40
SSD 250GB $400 $1600 1TB $570 $570 2TB $650 $325

•Observations:
– SSDs increasing in capacity quicker than HDDs
– SSDs price per unit storage dropping faster than HDDs
• Speed:

SSD range from 100 times slower to 10 times faster
(newer SSD’s may not be slower than HDD any more)

•Hybrid models
HDD with 100GB SSD used as cache: HHD

• SSDs are appearing in large data centers (Amazon, Facebook,
Dropbox)

Input/Output 8–15

Networks
• Ethernet: one-wire bus, 10/100/1000Mbit/sec transfer rate, no

central control

•Nodes listen for silence, then transmit
• Collisions resolved by random backoff
• Basis for most LANs (local area networks) when combined

with some switching

• Long-haul networks: distances of 10 to 10,000 km
•Usually packet-switched: information broken up, reassembled
• Protocols for reliable delivery (TCP) and addressing (IP)

Input/Output 8–16

Buses

97018/Patterson

Fig. 8.07

Memory Processor

Control lines

Data lines

Disks

Memory Processor

Control lines

Data lines

Disks

Processor

Control lines

Data lines

Disks

a.

b.

c.

Memory

Direct memory mapping. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.

A typical output operation (defined from processor point of view)

Input/Output 8–17

Bus Parameters
• Buses are typical bottleneck for I/O throughput
• Limitations on speed: length, number of devices
• Types: processor-memory, I/O, backplane
• Communication schemes: synchronous, asynchronous

– Synchronous have clock connected to control lines, fixed
protocols
Fast, but require similar devices and must be short

– Asynchronous buses require “handshaking” communication
protocol
Handled as two communicating FSMs

Input/Output 8–18

Bus Arbitration
•Who decides who gets access to a shared bus?
• Simplest scheme: the processor
• Better: have multiple bus masters
• Choosing among bus masters is bus arbitration
•Need fairness but also scheme of priorities
• Typical schemes: daisy-chain, centralized, token-passing,

self-selection, collision detection (as in Ethernet)

Input/Output 8–19

Interfacing with Processor and Memory
• Two schemes for processor I/O: memory-mapped, special

instructions

• Polling versus interrupt-driven I/O
•Usually need to have different levels of interrupts depending

on priority

•DMA: direct memory access (device controller transfers data
right into memory without processor involvement)

• Creates problems with caches or VM

Input/Output 8–20

Example: Pentium Bus

North
Bridge

CPU CacheCPU Cache

Graphics

Card
RAM

Bridge
South

PCI Bus

USB Bus

IDE BusHard Disk

Mouse

Firewire Bus

ISA Bus

Audio

Keyboard

DVD

CDROM

Frontside Bus

Ethernet Bus
Interface
Ethernet

SCSI
Interface

Starting Your Computer 9–1

What happens when you turn on your computer?
SoC: System on Chip

RAM BridgeROM

Ethernet USB

Bus

CPU

Peripheral bus

SoC
Reset

High
Speed

Reset

•High speed synchronous bus (100MHz) connecting CPU,
RAM, ROM, Ethernet, USB, Bridge

• Bridge connects to peripheral bus: Asynchronous, 50MHz,
hand shaking protocol

Starting Your Computer 9–2

Power On
• Power to SoC: 30ms to get stable
• Clock needs time to get started

Power supply generates reset for 0.5 seconds

• Reset
Reset from power supply goes to CPU

• CPU sends reset signal to other devices on SoC
Except RAM (for debugging)

•On Reset:
– PC← 0x0000 0000

PC Address aliased to ROM (not RAM)

Starting Your Computer 9–3

PC starting at 0
Executing ROM code (SoC boot code):

1. Disable watchdog
Watchdog: reset processor if it reaches end of count

2. Configure CPU

• turn off cache
• turn off memory manager
• turn off branch predictor
ie, get to known state

3. Configure memory controller

4. Load code into static RAM (flash, bus)

5. Clear boot mode
Execute from RAM rather than ROM

6. Start executing flash boot code loaded in 4th step

Starting Your Computer 9–4

• Flash boot code
Does all of the above again
(board manufacturer knows what’s on board vs SoC)

•Once configured for board, OS loaded and OS execution
begins

Multiprocessors 10–1

Multiprocessors
• Readings: Chapter 6
•Motivations:

– Increased throughput
– Scalability
– Fault tolerance
• Issues:

– How to share data among processors?
– How to coordinate processing?
– What is the optimal number of processors?
• Terminology:

Rather than “multiprocessor”, the term multicore is used

Multiprocessors 10–2

Parallelism is Easy!
•Modern uniprocessor (single core) CPUs are parallel

computers
Pipelining

• Pipelining gives important but limited benefits
•Want to do better:

break task into multiple pieces,
run on multiple cores

Multiprocessors 10–3

Parallelism is Hard!
•How to program multicore machines?
• Large gains needed to justify expense

Book example: to get a factor of 90 speedup with 100 cores,
program must be “99.9%” parallelizable

• To achieve large gains, must balance load
If one processor does more work, rest are idle

• Programming language constructs
Synchronization, data sharing, data locking, …

Multiprocessors 10–4

Hardware Multithreading
• Thread: process of execution

– Potentially, one program can have multiple threads of
execution (parallelism)

– On single core, “swap” between execution of threads
– On multi-core, one process per core (“swap” if needed).
•Hardware multithreading

– Multiple threads run on a single core
– Idea is that when one thread stalls, run another thread
• Fine-grained multithreading: switch threads each instruction
• Coarse-grained multithreading: switch threads on long stalls
• Bottom line: hardware multithreading reduces idle time of

processor

Multiprocessors 10–5

Some Parallelism is Easy
• Instruction/Data classifications:

– SISD—Single Instruction, Single Data
Pentium 4

– SIMD—Single Instruction, Multiple Data
MMX, SSE, SSE2

– (MISD—Multiple Instruction, Single Data)
– MIMD—Multiple Instruction, Multiple Data

“algorithmic” parallelism

Multiprocessors 10–6

SIMD
Single Instruction, Multiple Data
• Example:

vec_res.x = v1.x + v2.x; vec_res.y = v1.y + v2.y;
vec_res.z = v1.z + v2.z; vec_res.w = v1.w + v2.w;

vs

movaps xmm0, [v1]
addps xmm0, [v2]
movaps [vec_res], xmm0

•Works best on data-level parallelism
Data independent from one another

•Works best with arrays in for loops
Numerical computations (matrix multiply)

•MMX, SSE, AVX, …

Multiprocessors 10–7

GPU: an extreme example of SIMD
•GPU is high-performance parallel computer
• Intended for graphics, so no need to support general

computing

• Relies on hardware multithreading rather than caches to
handle memory latency
Execute other threads while waiting for memory

• Relied on memory bandwidth rather than latency
• Intended for graphics, but support for non-graphics (CUDA)

Multiprocessors 10–8

Example: y=ax+y
Computing y = ax + y with a serial loop:
void saxpy_serial(int n, float alpha, float *x, float *y)
{
for(int i = 0; i>>(n, 2.0, x, y);

Multiprocessors 10–8

Multiprocessor Parallelism
What’s required to run a program on multiple cores?

• Compiler support
Compiler recognizes code that can run in parallel (SIMD)

• Programming language support:
Threads, locks, communication

•Hardware support:
Shared memory access
Caches
Networks, Message passing

Multiprocessors 10–9

Communication: Shared-Memory vs. Message Passing
• Shared-memory view: processors have a single common

address space

– Uniform Memory Access: any reference by any processor
takes same time

– Non-Uniform Memory Access: each processor has “nearby”
and “farther away” memory

– UMA easier to program, NUMA more realistic and scalable
– Both require methods of synchronization
•Message-passing view: processors have private memories,

communicate via explicit messages

– Synchronization is more natural
– More suited to network view

Multiprocessors 10–10

Message Passing Multiprocessors
• Some large-scale computers with high-performance message

passing networks
Fast, but expensive

• Some concurrent applications have task level parallelism that
require little communication

– Web search, mail servers, file servers
– Clusters: an example of message-passing parallel computers
– Separate OS on each computer, separate memory
– Separate memories, etc., leads to increased system

dependability, expandability

•Warehouse-scale computing
– Internet services may require 50,000–100,000 computers

Require special building, power, cooling
– Economy of scale⇒ cloud computing

Multiprocessors 10–11

Organization: Bus-Connected vs Network-Connected

Cache

Processor

Cache

Processor

Cache

Processor

Single bus

Memory I/O

Network

Cache

Processor

Cache

Processor

Cache

Processor

Memory Memory Memory

• Big issue: maintaining data consistency (called cache
coherency/consistency)

•Data read by several processors will show up in several caches
•What happens when that data is changed?

Multiprocessors 10–12

Multiprocessor Cache Coherency: Snooping
• Idea: information to maintain coherency is available on bus
• Each processor’s cache monitors bus
• Simplest idea: write-update (or write-broadcast)

– When a processor writes a block, all other caches containing
that block are updated with new value

– Increases bus traffic, limits scalability, but reduces latency
• Better idea: write-invalidate

– When a processor writes a block, it first issues an
invalidation signal

– Other caches invalidate their copies of that block
– Subsequent writes by same processor result in no bus traffic
– Subsequent reads by other processors are slower
•Variations: build in exclusivity (avoiding need for invalidation

broadcast), transferring data between caches directly

Multiprocessors 10–13

Network Topologies

a. 2D grid or mesh of 16 nodes b. n-cube tree of 8 nodes (8 = 23 so n = 3)

Figure 6.14. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.

Variations: cylinder, 3D torus, mesh of trees; cube-connected
cycles

Multiprocessors 10–14

More Network Topologies

c. Omega network switch box

A

B

C

D

P0

P1

P2

P3

P4

P5

P6

P7

a. Crossbar b. Omega network

P0

P1

P2

P3

P4

P5

P6

P7

Figure 6.15. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.

Variations: butterfly, shuffle-exchange, Beneš network

Multiprocessors 10–15

Coherency in Networked Multiprocessors

Directory
Memory

Directory
Memory

Cache Cache

ProcessorProcessor

Network

a.

Directory
Memory

Directory
Memory

Cache Cache

ProcessorProcessor

Network

b.

Memory coherency. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.

Cache-level coherency Memory-level coherency

Multiprocessors 10–16

The Evolution-Revolution Spectrum

C
a

c
h

e

V
ir

tu
a

l
m

e
m

o
r
y

R
IS

C

P
a

r
a

ll
e
l

p
r
o
c
e
s
s
in

g
m

u
lt

ip
r
o
c
e
s
s
o
r

P
ip

e
li

n
in

g

M
a

s
s
iv

e
S

IM
D

M
ic

r
o
p

r
o
g
r
a

m
m

in
g

T
im

e
s
h

a
r
e
d

m
u

lt
ip

r
o
c
e
s
s
o
r

C
C

-U
M

A
m

u
lt

ip
r
o
c
e
s
s
o
r

C
C

-N
U

M
A

m
u

lt
ip

r
o
c
e
s
s
o
r

N
o
t-

C
C

-N
U

M
A

m
u

lt
ip

r
o
c
e
s
s
o
r

M
e
s
s
a

g
e
-p

a
s
s
in

g
m

u
lt

ip
r
o
c
e
s
s
o
r

Evolutionary Revolutionary

Evolution vs Revolution. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.

• Items further right require more effort from programmers/users
• Software, not hardware, is current bottleneck

Other Architecture Ideas 11–1

Overview
Many other ideas in architecture.
We will look at a few:

•Microprogramming
VAX
CISC vs RISC

• SPARC Ideas
Register window
32-bit jump command

Other Architecture Ideas 11–2

Microprogramming
• Idea: control unit for datapath runs program for each assembly

instruction

•Multicycle datapath so far too simple for microprogramming
– Microprogram needs scratch registers
– Microprogram needs conditional branches
•Need to extend datapath, modify to give us extra power

– Extra registers in register file
– Control lines to specify read/write registers
– Control specify constants
– Status bits sent to control

Other Architecture Ideas 11–3

VAX: A microprogrammed architecture
•VAX is an example of a microprogrammed architecture

Some instructions ran faster than others

•Goals
– Direct support for high-level language constructs

(procedure calls, multi-way branching, loop control, array
subscript calculation)

– Minimize code space to save memory
•Variable length instruction encoding (from 1 to 54 bytes)
• Sixteen 32-bit registers, R0 through R15
• Some are special (eg R15 is PC, R14 is stack pointer)
• Large number of instructions (304) and data types

Other Architecture Ideas 11–4

VAX instruction encoding
• First byte is opcode (or first two)
•Opcode implies number of operands, data type
• For each operand, one byte with addressing mode and register

name (4 bits each)

•More information may follow if needed (e.g. constants)
• Example: ADDL3 R1, 737(R2), (R3)[R4]

7 bytes of storage

Other Architecture Ideas 11–5

Examples of complex VAX instructions
•MOVC3: copy a string
• SCANC: scan for character
•AOBLEQ: add one and branch on less than or equal
• INSQUE: insert into queue
• CRC: calculate cyclic redundancy check
• POLY: evaluate polynomial

Other Architecture Ideas 11–6

VAX/ARM instruction set comparison
• Task: copy fifty words of data from one array to another
• Pseudo-code:
for i=0 to 49 do b[i] = a[i];

• Can write programs in each type of assembler, but comparison
requires some assumptions

• Suppose base address of array a is in R3/s3, base address of
array b is in R4/s4

Other Architecture Ideas 11–7

ARM code
ADDI X2, X31, #50 # put 50 into X2
ADDI X3, X13, #0 # copy s3 into X3
ADDI X4, X14, #0 # copy X14 into X4

loop: LDUR X1, [X3,#0] # get word pointed to by X3
STUR X1, [X4,#0] # put word where pointed by X4
ADDI X3, X3, #8 # increment X3
ADDI X4, X4, #8 # increment X4
SUBI X2, X2, #1 # decrement count
CBNZ X2, loop # loop if count not zero

Other Architecture Ideas 11–8

VAX code
CLRL R2 ; R2 is index

LOOP: MOVL (R3)[R2], (R4)[R2]; (note scaling)
AOBLSS #50, R2, LOOP ; add one, branch on <50 Other Architecture Ideas 11–9 Analysis of ARM code •Use pipelined implementation, with all forwarding possible for data hazards, including load-use • Branch hazards require stall on success • Code rearrangement to get rid of load/store stall and branch stall •One change: memory can be slow, whole machine stalls until memory access completes • Let C be cycle time, M be memory access stall time Other Architecture Ideas 11–10 Analysis of VAX code •Assume multicycle implementation similar to microcoded ARM implementation done in class •No pipelining • CLRL, AOBLSS take 4 cycles; AOBLSS is 2-word instruction •MOVL takes 6 cycles because of operand address calculations •Above assumptions are reasonable but do not correspond exactly to reality – In reality, VAX instructions are fetched by the byte, and the branch would be an 8-bit displacement – VAX implementations pipeline the microcode, so the operand fetches could overlap to some extent Other Architecture Ideas 11–11 Comparison of ARM/VAX running times • Reasonable assumption: 2 ns cycle time, 10ns memory stall time • If no cache, memory time dominates, VAX wins • If everything in cache, no memory stalls, ARM wins • If cache is initially empty but large enough to contain everything, ARM wins since it has better compute time Other Architecture Ideas 11–12 SPARC: A RISC Example •Goals – Support object-oriented languages – Efficient trapping and subroutine call-return – RISC philosophy • Fixed instruction size: 32 bits • Small number of instruction formats (5 in V8) • Load-store architecture, delayed branches • 32 single-precision floating-point registers/16 double-precision • 40 to 520 integer registers, BUT only 32 accessible at a time Other Architecture Ideas 11–13 Registers • 32 registers visible at one time: %r0 to %r31 • Four 8-register banks: inputs, outputs, globals, and locals •Globals %g0 to %g7 correspond to %r0 to %r7 • Register %r0/%g0 always has value 0 •Outputs %o0 to %o7 correspond to %r8 to %r15 • Locals %l0 to %l7 correspond to %r16 to %r23 • Inputs %i0 to %i7 correspond to %r24 to %r31 Other Architecture Ideas 11–14 Registers 5.1 Nonprivileged Registers 31 Figure 1—General-Purpose Registers (Nonprivileged View) Programming Note: The alternate global registers are present to give trap handlers a set of scratch registers that are inde- pendent of nonprivileged software’s registers. The AG bit in PSTATE allows supervisor software to access the normal global registers if required (for example, during instruction emulation). 5.1.1.2 Windowed r Registers At any time, an instruction can access the 8 globals and a 24-register window into the r registers. A register window comprises the 8 in and 8 local registers of a particular register set, together with the 8 in registers of an adjacent register set, which are addressable from the current window as out registers. See figure 2 and table 6. i7 r[31] i6 r[30] i5 r[29] i4 r[28] i3 r[27] i2 r[26] i1 r[25] i0 r[24] r[23] r[22] r[21] r[20] r[19] r[18] r[17] r[16] r[15] r[14] r[13] r[12] r[11] r[10] r[9] r[8] r[7] r[6] r[5] r[4] r[3] r[2] r[1] r[0] l7 l6 l5 l4 l3 l2 l1 l0 o7 o6 o5 o4 o3 o2 o1 o0 g7 g6 g5 g4 g3 g2 g1 g0 Other Architecture Ideas 11–15 Register Window • 32 registers visible at one time: %r0 to %r31 • 40 to 520 registers in V8 implementation • Inputs, Locals, and Outputs actually a 24-register window into master register file •When call a subroutine, can shift register window forward so outputs become inputs, and get a new set of locals and outputs •When return from a subroutine, can shift register window back • Register “stack” managed as a circular buffer •When overflows, trap to O/S to save off registers to stack in memory •More efficient to save a lot of registers at once, and if have a lot of call-returns can avoid many register saves Other Architecture Ideas 11–16 Register Windows32 5 Registers Figure 2—Three Overlapping Windows and the Eight Global Registers The number of windows or register sets, NWINDOWS, is implementation-dependent and ranges from 3 to 32 (impl. dep. #2). The total number of r registers in a given implementa- tion is 8 (for the globals), plus 8 (for the alternate globals), plus the number of sets times 16 registers/set. Thus, the minimum number of r registers is 64 (3 sets plus the 16 globals Window (CWP – 1) r[31] r[24] ins . . r[23] r[16] locals . . r[15] r[ 8] outs . . Window (CWP ) r[31] r[24] ins . . r[23] r[16] locals . . r[15] r[ 8] outs . . Window (CWP + 1) r[31] r[24] ins . . r[23] r[16] locals . . r[15] r[ 8] outs . . r[ 7] r[ 1] globals . . r[ 0] 0 63 0 Other Architecture Ideas 11–17 Register Window Management • Register “stack” is a FIFO, managed as a circular array • Each time we push or pop register file, modify current window index by 16 •When overflows, trap to O/S to save off registers to stack in memory •More efficient to save a lot of registers at once • Typically can avoid many register saves altogether •Gives more efficient subroutine calls and interrupt/trap processing •Good for real-time systems and certain programming languages (FORTH, Smalltalk, LISP, C++) Other Architecture Ideas 11–18 SPARC Instruction Format op 31 30 24 22 a 29 28 25 cond op2 disp2 op rd 31 30 29 25 24 22 op2 imm22 21 0 21 0 18 14 13 12 5 4 029 2531 30 24 19 op 31 30 op rd op3 rs1 i=1 simm13 18 14 1329 2531 30 24 19 12 0 op rd op3 rs1 18 14 13 12 5 4 029 2531 30 24 19 rs2opf disp30 29 0 Format 1 (op=1) CALL Format 2 (op=0) SETHI, Branches Format 3 (op=2 or 3) Arithmetic, Logical, Load, Store i=0op rd op3 rs1 rs2--- Other Architecture Ideas 11–19 ARM Instruction Set •ARM instruction set far bigger than done in class •Number of instruction: © 2016 Elsevier, Inc. All rights reserved. 41 FIgure 2.41. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition. •Will highlight some interesting “extras” Other Architecture Ideas 11–20 Registers •Many registers have special purposes Some are just conventions. © 2016 Elsevier, Inc. All rights reserved. 17 Figure 2.15. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition. Other Architecture Ideas 11–21 Status Bits •ARM instructions can set and test condition codes • Four condition codes: Negative (N), Zero (Z), Overflow (V), Carry (C) • B instruction suffixed with condition © 2016 Elsevier, Inc. All rights reserved. 12 Figure 2.10. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition. Becomes a conditional branch Other Architecture Ideas 11–22 Shifty Arithmetic Operations • R-format instructions optionally set the status bits ADD vs ADDS • Second register of R-format can be shifted 5 bits5 bits11 bits 6 bits 5 bits opcode Rm shamt Rn Rd ADD X1,X2,X3<<3 x86 Instruction Set Architecture 12–1 Intel Pentium: Basic Instruction Set The instruction set is very large • 48 move/data instructions • 12 binary arithmetic instruction ADD, ADC (add w/carry), SUB, MUL, DIV, INC, CMP • 6 decimal arithmetic instructions • 4 logic instructions (AND, OR, XOR, NOT) • 10 shift/rotate instructions • 48 control transfer instructions x86 Instruction Set Architecture 12–2 Instruction Set (cont) • 37 bit/byte instructions Testing bits/bytes • 57 string instructions • 13 flag control instructions Set/clear conditional flags used by branches • 5 segment register instructions • 6 misc instructions (including NOP) Additional floating point and SIMD (SSE) instructions x86 Instruction Set Architecture 12–3 Addressing Modes • 16-bit, 32-bit (mostly 32 bit) • Segment-relative addresses SEGMENT + BASE + (INDEX*SCALE) + DISPLACE Mode Example Register MOV DS,AZ Immediate MOV AX,1000H Direct Addressing MOV [2000H],AX Register Indirect MOV DL, [SI] Base Addressing MOV AX,[BX+4] Indexed Addressing MOV [DI-8],RL Based Indexed Addressing MOV [GP][SI],AH Based Indexed with Displacement MOV CL,[BX+DI+2040H] Pentium 13–1 IA-32 Evolution 32-bit Pentium evolution starts with 80386 • 1985: 80386 32-bit addressing • 1989: 80486 – Internal cache, – Instruction pipeline, – Integrated math coprocessor • 1993: Pentium – Superscalar microarchitecture executes multiple instructions in parallel; – Split internal cache (L1) separate instruction and data caches Pentium 13–2 • 1995: Pentium Pro – Backside level 2 cache (L2) – Convert complex instructions into micro-ops micro-ops execute on RISC – Longer pipeline Pentium 13–3 Instruction prefetch and decodeBranch prediction Register file Integer ALU Integer ALU. Multiplier Integer ALU Floating point Adder/ SSE Floating point Multiplier/ SSE Floating point Misc Data cache Instruction cache RISC-operation queue Dispatch and register renaming Integer and floating-point operation queue Load/Store queue Commit unit Pentium 13–4 • 1997: Pentium MMX, Pentium II – MMX instructions (integer SIMD instructions) • 1999: Pentium III – SSE: SIMD single precision floats Useful for 3D graphics – 10 stage pipeline Pentium 13–5 • 2000: Pentium 4 – SSE2: double precision floats, quadword ints, 128-bit ints – 20 stage pipeline – Branch prediction (4K branch predictor table) – Looks ahead 126 instructions to find parallel tasks – 12K Instruction cache stores micro-ops instead of instructions – L1: 8-KB, 4-way set associative, 64-byte lines – L2: 256KB (512KB), 8-way set associative, 128-byte lines – ALU executes common integer ops in one-half clock cycle Trace cache access (including predictor) buffer allocation + register renaming Reorder Scheduling + dispatch unit Register file access Execution Commit Microoperation queue Functional unit queues Reorder buffer 5 4 5 2 1 3 Clock Cycles Pentium 13–6 • 2003,4: AMD, Intel announce extensions to 64 bit architectures SSE3 (graphics, video, thread sychronization support). i7 14–1 i7 • x86 complex, so i7 converts to micro-operations. • Register renaming maps 16 architectural registers to larger number of physical registers . c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition. i7 14–2 Performance on Benchmarks 0.25 CPI ideal © 2016 Elsevier, Inc. All rights reserved. 77 . c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition. i7 14–3 Branch Misprediction on Benchmarks © 2016 Elsevier, Inc. All rights reserved. 78 . c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition. Conclusions 15–1 Where are we now? • Power wall 2667 3300 3400 12.5 16 2000 200 66 25 3600 75.3 95 87 77 29.1 10.1 4.94.13.3 103 1 10 100 1000 10,000 8 0 2 8 6 (1 9 8 2 ) 8 0 3 8 6 (1 9 8 5 ) 8 0 4 8 6 (1 9 8 9 ) P e n ti u m (1 9 9 3 ) P e n ti u m P ro ( 1 9 9 7 ) P e n ti u m 4 W il la m e tt e (2 0 0 1 ) P e n ti u m 4 P re s c o tt (2 0 0 4 ) C o re 2 K e n ts fi e ld (2 0 0 7 ) C lo c k R a te ( M H z ) 0 20 40 60 80 100 120 P o w e r (w a tt s ) Clock Rate Power C o re i 5 C la rk d a le (2 0 1 0 ) C o re i 5 Iv y B ri d g e (2 0 1 2 ) Similar to Figure 1.16. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition. • Parallelism GPUs • Smaller is better (for home market) •Network applications Conclusions 15–2 Take Aways •Understanding hardware can improve compilers code rearrangement, loop unrolling •Understanding hardware/OS can improve code Cache, virtual memory, paging • Introduction to material you’ll see later – Floating point – OS issues – Data structures – Interrupts, exceptions – Synchronization, parallelism Conclusions 15–3 Example: Revisited #include
#define NR 10000
#define NC 10000

int a[NR][NC];

void main() {
int i,j;
for (i=0;i