1
University of Waterloo
Cheriton School of Computer Science
CS 251: Sections 1, 2, and 3
Computer Organization and Design
Slides
for
Fall 2021
These slides are for the use of students taking CS 251 in Fall 2021. You should not share these
files with anyone else, and you agree to delete all your copies of these slides at the end of the
Fall 2021 term. It is your responsibility to ensure that no one else has access to these files.
Copyright 2001, 2009, 2013, 2016, 2019 by Stephen Mann and Prabhakar Ragde. Figures taken from the course text are copyrighted by Elsevier and reprinted
here with permission of the publisher. Some of the figures in the SPARC section are copyrighted by SPARC International, and are reprinted here with permission
of SPARC International.
Introduction 1–1
Introduction
• Course goals:
Understanding of computer architecture, structure and
evolution
• Computer architecture = instruction set architecture plus
computer organization
• Instruction set architecture:
Conceptual structure and functional behaviour of computing
system as seen by programmer
• Computer organization:
Physical implementation, described in terms of functional
units, their interconnection, how information flow among them
is controlled
Introduction 1–2
Why We Teach This Course
•Understanding of what’s inside the computer
• Introduction to material in future courses
•Architecture issues influence programming
Introduction 1–3
Example: small code changes, big performance differences
#include
#define NR 10000
#define NC 10000
int a[NR][NC];
void main() {
int i,j;
for (i=0;i
#define NR 10000
#define NC 10000
int a[NR][NC];
void main() {
int i,j;
for (i=0;i
#define NR 10000
#define NC 10000
int a[NR][NC];
void main() {
int i,j;
for (i=0;i
Memory Hierarchies 7–9
Block Size Read Example
Tag 00 01 10 11
000
001
010
011
100
101
110
111
Dec Tag Indx Block Byte Hit/Miss
LDUR 96 000 011 00 000
LDUR 104 000 011 01 000
LDUR 112 000 011 10 000
LDUR 120 000 011 11 000
Memory Hierarchies 7–10
Block Size Write Example
Tag 00 01 10 11
000
001
010
011
100
101
110
111
Dec Tag Indx Block Byte Hit/Miss
STUR 368 001 011 10 000
LDUR 376 001 011 11 000
Memory Hierarchies 7–11
Block Size Read/Write Example
Tag 00 01 10 11
000
001
010
011
100
101
110
111
Dec Tag Indx Block Byte Hit/Miss
LDUR 96 000 011 00 000
LDUR 104 000 011 01 000
LDUR 96 000 011 00 000
STUR 368 001 011 10 000
LDUR 376 001 011 11 000
LDUR 96 000 011 00 000
Memory Hierarchies 7–12
Block Size Code Example: Arrays
Pseudo-ARM: Sum the elements in an array A
100 ADD X1, X31, X31
104 ADD X2, X31, X31
loop: 108 LDUR X3, A(X1)
112 ADD X2, X2, X3
116 ADDI X1, X1, #1
120 SUBI X4, X1, #1000
124 CBNZ X4, loop
128 NOP
Assume 1 cc per instruction (data forwarding, etc)
LDUR takes 1cc if in cache
Block size 1: fetch from memory takes 17cc extra
Block size 4: fetch from memory takes 20cc (4 words) extra
Memory Hierarchies 7–13
Block Size Code Example: Linked List
Pseudo-ARM code: sum elements in a linked list
100 LDUR X1, A
104 ADD X2, X31,X31
108 CBZ X1, done
loop: 112 LDUR X3, [X1,#0]
116 ADD X2, X2,X3
120 LDUR X1, [X1,#8]
124 CBNZ X1, loop
done: 128 nop
Assume 1 cc per instruction (data forwarding, etc)
LDUR takes 1cc if in cache
Block size 1: fetch from memory takes 17cc extra
Block size 4: fetch from memory takes 20cc (4 words) extra
Memory Hierarchies 7–14
Alternate Placement Schemes
• Fully-associative: a block can go anywhere in the cache
• Requires a comparator for each cache entry
• Set-associative: combines two ideas
0 1 2 3 4 5 6 7
1
3
Data Data Data
Tag Tag Tag
0
1
3
321
1
3
Direct Map Set Associative Fully Associative
Triangles show locations to search for Tag=13
Memory Hierarchies 7–15
A 4-way Set-Associative Cache
Address
22 8
V TagIndex
0
1
2
253
254
255
Data V Tag Data V Tag Data V Tag Data
3222
4-to-1 multiplexor
Hit Data
123891011123031 0
Figure 5.18. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
Memory Hierarchies 7–16
2-way set associative cache
Memory Access Cache
Dec Binary Hit/miss
20 10100
18 10010
20 10100
18 10010
22 10110
7 00111
22 10110
28 11100
Index Tag0 Tag1
00
01
10
11
Memory Hierarchies 7–17
4-way set associative cache
Memory Access
Dec Binary Hit/miss
20 10100
18 10010
20 10100
18 10010
22 10110
7 00111
22 10110
28 11100
Cache
Index Tag0 Tag1 Tag2 Tag3
0
1
Memory Hierarchies 7–18
Fully associative cache
Memory Access
Dec Binary Hit/miss
20 10100
18 10010
20 10100
18 10010
22 10110
7 00111
22 10110
28 11100
Cache
Tag0 Tag1 Tag2 Tag3 Tag4 Tag5 Tag6 Tag7
Memory Hierarchies 7–19
Which one to kick out?
•What if all tags are full for an index?
Which one do we replace?
• LRU (Least Recently Used) Schemes
Replace item whose most recent use was futhest in the past
• 2-way set associative example: Words accesses
20 (10100), 28 (11100), 20 (10100), 24 (11000)
Index Tag0 Tag1
00
01
10
11
•How much hardware to implement LRU?
Memory Hierarchies 7–20
Mixing “Ways” and “Blocks”
• Can have both ways and blocks in a cache
Example: 2-way set associative, 4 word line size
…
…
…
…
…
Tag0 Block 00 Block 01 Block 10 Block 11
…
…
…
…
…
…
Tag1 Block 00 Block 01 Block 10 Block 11
0000
0001
0010
• Caches typically associative with block size > 1
• Block size determines number of block bits
Memory Hierarchies 7–21
Breaking Address into Byte, Block, Index, and Tag
… 12 0345
Tag Index Block Byte
63
•Number of index bits depends on cache size, block size, and
the number of ways.
•An 2n-way cache of size 2k bytes, with a block size of 2b:
– 2 byte bits (or possibly 3-byte block)
– b block bits
Vendors use linesize, which is sum of byte and block bits
– Index bits i = k−2−b−n = k− ls−n
– Tag are remaining bits: 64−2−b− i = 64− ls− i
Memory Hierarchies 7–22
Example: 2017 Asus Skylake (2015), i7
Intel i7, 2 cores
Memory Hierarchies 7–23
Example: 2017 Asus Kaby Lake (2016), i7
Intel i7, 4 cores, ROG
Memory Hierarchies 7–24
Example: 2016 Jumper 2
Intel Atom, 4 cores
Memory Hierarchies 7–25
How useful are the ways?
• For small (8 word) caches, fully associate makes sense
• For larger caches, more “ways” means fewer cache misses, but
is the extra hardware worth it?
• Experiment on a 64KB cache, 16-word block:
Associativity Miss Rate
1 10.3%
2 8.6%
4 8.3%
8 8.1%
Memory Hierarchies 7–26
Data cache miss rates
Associativity
et
ar
s
si
M
0
One-way Two-way
3%
6%
9%
12%
15%
Four-way Eight-way
1 KB
2 KB
4 KB
8 KB
16 KB
32 KB
64 KB 128 KB
Data cache miss rates. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
Memory Hierarchies 7–27
Calculating Average Memory Access Time
•How long to access memory?
AMAT = Time for a hit + Miss rate ×Miss penalty
• Example
1ns clock, miss penalty 20cc, miss rate 0.05 per instruction,
cache access time of 1cc
Memory Hierarchies 7–28
Sorting Example
Impact of cache on performance
Radix Sort
Quicksort
Size (K items to sort)
In
s
tr
u
c
ti
o
n
s
/i
te
m
0
4 8 16 32
200
400
600
800
1000
1200
64 128 256 512 1024 2048 4096
Radix Sort
Quicksort
Size (K items to sort)
C
lo
c
k
c
y
c
le
s
/i
te
m
0
4 8 16 32
400
800
1200
1600
2000
64 128 256 512 1024 2048 4096
Radix Sort
Quicksort
Size (K items to sort)
C
a
c
h
e
m
is
s
e
s
/i
te
m
0
4 8 16 32
1
2
3
4
5
64 128 256 512 1024 2048 4096
Figure 5.19. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
• Left: instructions
• Center: run time
• Right: cache misses
Thus, although radix sort takes fewer instruction,
quicksort is faster because of fewer cache misses
Memory Hierarchies 7–29
Radix Sort
• For each digit, linear pass through data
• If data bigger than cache, no subsets of data remain in cache
Memory Hierarchies 7–30
Quicksort
1 block, 4xcache size
4 blocks, 1xcache size
2 blocks, 2xcache size
unsorted
… … … …
…
in cache!
• Initially, data too large for cache…
• …but at some depth, data fits in cache and remains in cache for
all deeper levels
Memory Hierarchies 7–31
Virtual Memory
• Level of memory hierarchy between main memory and
secondary storage (disks)
•Motivations:
– Sharing memory among multiple programs
– Allowing single user program to exceed size of main
memory
•Different terminology:
– Page: virtual memory block
– Page fault: virtual memory miss
• Idea similar to cache:
– Complete program (instructions and data) stored on disk
– Only keep parts you need in memory
– Use large (4KB+) blocks to reduce costs
Memory Hierarchies 7–32
How to run two programs at once?
• Each program starts at address 0 and goes up to 264−4 (64 bit)
•How do we run two programs?
• Time slice: each program gets to run for a while, then another
program runs.
•How do we switch between programs?
Memory Hierarchies 7–33
Idea 1: Copy back/forth to disk
•When changing programs, copy current program onto disk and
read next program from disk
0
MEMORY DISK
2**64
0
Program 2
Program 1
Memory Hierarchies 7–34
Idea 1: Copy back/forth to disk
•When changing programs, copy current program onto disk and
read next program from disk
0
MEMORY
2**64
COPY
DISK
Program 1
Program 2
Memory Hierarchies 7–35
Idea 1: Copy back/forth to disk
•When changing programs, copy current program onto disk and
read next program from disk
0
MEMORY
2**64
DISK
COPY
Program 2
Program 1
Memory Hierarchies 7–36
Idea 1: Copy back/forth to disk
•When changing programs, copy current program onto disk and
read next program from disk
0
MEMORY DISK
2**64
0
Program 2 Program 1
•At 70MB/s, swapping between two 100MB programs takes
over 2 seconds
Memory Hierarchies 7–37
Idea 2: Keep both programs in memory
•Keep both programs in memory as follows:
– Assume 4GB+ memory
– Assume 2GB program
size
– Program 1: 0 to 231−4
– Program 2: 231 to 232−4
MEMORY DISK
0 0
2**64
2**31 2**31
Program 1
Program 2
• Compiler creates Program 2’s address space to start at 231
•When swapping, just start second program.
• Problem 1: What if two programs overlap.
• Problem 2: What if both programs need more than 2GB?
Memory Hierarchies 7–38
Idea 3: Keep both program in memory
•Keep both programs in memory as follows:
– Assume 4GB+ memory
– Assume 2GB program
size
– Program 1: 0 to 231−4
– Program 2: stored in 231 to
232−4
BUT program addresses
from 0 to 231−4
MEMORY DISK
0
0
0
2**32
2**31
Program 1
Program 2
•When swapping, just start second program.
• Problem 1: Program 2’s addresses need to be converted to
physical addresses
• Problem 2: What if both programs need more than 2GB?
• Problem 3: Slow to load large programs
Memory Hierarchies 7–39
Idea 4: Virtual Memory
• Like idea 3, but split memory into pages (eg, 4KB pieces).
• Each page of program can be in any page of physical memory.
MEMORY DISK
0
0
0
2**64
4K
24K
4K Program 2
Program 1
•Use page table to map program address to physical address.
• If program wants more space, give it a new page.
If no pages available, replace “lightly used” page
Memory Hierarchies 7–40
Virtual Memory Mappings
Physical addresses
Disk addresses
Virtual addresses
Address translation
Figure 5.25. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
3 2 1 011 10 9 815 14 13 1231 30 29 28 27
Page offsetVirtual page number
Virtual address
3 2 1 011 10 9 815 14 13 1229 28 27
Page offsetPhysical page number
Physical address
Translation
63 62 61 60 59
32 31 30
Figure 5.26. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
Memory Hierarchies 7–41
Key Decisions for Virtual Memory
•Misses very expensive (millions of cycles)
• Pages are larger (1KB to 4KB up to 64KB)
• Fully-associative schemes pay off, though full search too
expensive
• Page faults handled in software (OS)
•Write-back used (write-through too expensive)
Memory Hierarchies 7–42
Page Replacement Schemes
When a page needs replacement, we must decide which to
replace.
• LRU (Least Recently Used)
• LFU (Least Frequently Used)
• FIFO (First In, First Out)
• Random
Memory Hierarchies 7–43
Page Tables
Page offsetVirtual page number
Virtual address
Page offsetPhysical page number
Physical address
Physical page numberValid
If 0 then page is not
present in memory
Page table register
Page table
20 12
18
31 30 29 28 27 15 14 13 12 11 10 9 8 3 2 1 0
29 28 27 15 14 13 12 11 10 9 8 3 2 1 0
42
21
63 62 61 60 59
32 31 30
Figure 5.27. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
Memory Hierarchies 7–44
Page Table Example
4KB pages
Virtual Address Physical Address
0000 … 0000 0000 0000 0000 0100 1000 1111
0000 … 0000 0000 0000 0001 0000 0000 0001
0000 … 0000 0000 0000 0010 0111 1000 1010
Page Table
1 01 0000 0001 0000 0101
0 11 0000 1010 0000 0000
1 00 0000 0000 0000 0001
1 00 0000 0000 0000 0011
Physical AddressV
……
0000 … 0000 0000 0000 0000
0000 … 0000 0000 0000 0001
0000 … 0000 0000 0000 0010
0000 … 0000 0000 0000 0011
Memory Hierarchies 7–45
More on Page Tables
• Page replacement policies:
– Optimal: replace page used farthest into the future
– Good approximation: LRU (Least Recently Used)
– Too expensive to implement exactly, use approximation (eg
reference bits)
• Page tables are large (contain mappings for every virtual page)
– Multi-level page tables reduce space needed
– Page tables can themselves be paged
•VM must use writeback (called copyback here)
– Dirty bit
Memory Hierarchies 7–46
Data Initially On Disk
PIC
SWAP
DISK
Page Table
RAM
Memory Hierarchies 7–47
Program Reads File Into Memory
PIC
PIC
PIC
SWAP
DISK
Page Table
D
0
RAM
READ
COPY
Also copied into swap space
Memory Hierarchies 7–48
Picture Is Modified In Memory
PIC
PIC
PIC’
SWAP
DISK
Page Table
D
1
RAM
Dirty bit set to 1, swap copy differs from memory copy
Memory Hierarchies 7–49
If Picture Page In Memory Is Swapped Out…
PIC
PIC’
PIC’
SWAP
DISK
Page Table
D
0
RAM
COPY
Since dirty bit is 1, have to copy to reuse page in memory
Memory Hierarchies 7–50
Picture only in swap, not in memory
PIC
PIC’
SWAP
DISK
Page Table
D
0
RAM
Memory Hierarchies 7–51
Picture Read Into Memory When Needed
PIC
PIC’
PIC’
SWAP
DISK
Page Table
D
0
RAM
COPY
Memory Hierarchies 7–52
A Problem With Virtual Memory
• Convert virtual address to physical address:
look up virtual address in Page Table
• Page Table stored in memory
• To do LDUR A, we have to
– Look up virtual address A in Page Table to get physical
address A’
– Read A’ from physical memory
With virtual memory, LDUR requires two memory accesses
(Four memory accessed if LDUR instruction not in cache)
Memory Hierarchies 7–53
Translation Lookaside Buffers
1
1
1
1
0
1
1
1
1
1
0
0
0
0
0
0
0
0
0
1
1
1
0
0
1
0
0
1
0
1
1
1
1
1
0
0
Physical page
or disk addressValidDirty Ref
Page table
Physical memory
Virtual page
number
Disk storage
1
1
1
1
0
1
0
1
1
0
0
0
1
1
1
1
0
1
Physical page
addressValidDirty Ref
TLB
Tag
Figure 5.30. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
A cache for page table entries
Memory Hierarchies 7–54
Virtual Memory: Two Memory Accesses
Virtual Address
Page Table
(RAM)
Physical Address
RAM
Reading memory from a virtual address requires two memory
accesses
Memory Hierarchies 7–55
Virtual Memory: TLB and Cache
Virtual Address
Page Table
(RAM)
Physical Address
RAM
TLB Cache
Reading from the TLB and cache avoids reading from memory
Memory Hierarchies 7–56
Virtual Memory: The CPU
Register
File
Virtual Address TLB
Cache
CPU
Physical Address Page Table
RAM
Control
Memory Hierarchies 7–57
Example (X1=0x1000, X2=0x2000, X6=0x6000, X8=0x8000)
Free pages: 0 1 2 3 4 5 6 7
PT: TLB: (4 lines)
Index: V D R PG V D R TAG PG
00000: 0
00001:
00002: 1
00003:
00004: 2
00005:
00006: 3
00007:
00008:
… …
Code (0):
0x00000100: ADD X11,X2,X3
0x00000104: LDUR X10,[X6,0x000]
0x00000108: LDUR X10,[X6,0x008]
0x0000010c: STUR X10,[X8,0x000]
0x00000110: LDUR X10,[X2,0x000]
0x00000114: LDUR X10,[X1,0x000]
0x00000118: LDUR X10,[X6,0x008]
Memory Hierarchies 7–58
VM and Multitasking
•Need two modes: user process or operating system process
•OS processes are privileged (can do more)
– Write page table register
– Modify page tables or TLB
•Need method of switching modes
– User process to OS: System call or exception
– OS to user process: return-from-exception
• To swap processes, OS rewrites page table register and flushes
TLB
Memory Hierarchies 7–59
Memory Hierarchy
Level 1
Level 2
. . .
Leven n
Size of memory
CPU Distance from CPU
Things at lower levels in hierarchy also in all higher levels:
•Data: L1 Cache, L2 Cache, L3 Cache, Memory, Disk
•Address: TLB, Page Table, Disk
When executing an instruction, two hierarchies active
• convert virtual address to physical address
• load instruction from memory
(load word does this twice)
Memory Hierarchies 7–60
• Example (4K pages):
LDUR X1,[X2,#20]–where 20+X2=0x2000
– Need to convert virtual page 2 to physical page
TLB? Page Table?
– Need to load data at M[0x2000] into $1
Cache? RAM?
•Not all combinations possible
TLB PT RAM* Cache Disk Possible? # Disk Reads # Mem Access
HIT HIT HIT HIT HIT Possible 0 0
Miss Miss Miss Miss HIT
HIT Miss Miss Miss HIT
Miss Miss HIT HIT HIT
Miss HIT Miss Miss HIT
Miss HIT HIT Miss HIT
Miss HIT HIT HIT HIT
Memory Hierarchies 7–61
Virtual Memory: Summary
Problem:
• Program address space is larger than physical memory
264 vs 234
Program (compiler) layout in memory uses high memory
addresses
•Multiple programs share memory
Solution:
•Map program address (virual address) to RAM address
(physical address)
•Map blocks (pages) [4KB⇒ 12-bit sub-address]
• PAGE TABLE CONTAINS THIS MAPPING
• Store copy of what’s in RAM on disk (swap space)
•Use reference bits (occassionally cleared) to approximate LRU
Memory Hierarchies 7–62
Virtual Memory: Making it faster
• Problem with virtual memory: doubles memory accesses
Page table stored in RAM
• Solution: TLB–a cache for the page table
Is
VP in TLB?
Yes
Is
VP in PT?
No
Yes
No
TLB{VP}=PT[VP]
Yes No
VM
CACHE
Is
PP|PO in
Cache?
CACHE{PP|PO}
PP = TLB{VP}
PP=new Page
RAM[PP] = Disk Read
PT[VP] = PP
CACHE{PP|PO}=M[PP|PO]
PO: Page Offset
PP: Physical Page
VP: Virtual Page
[]: Direct Mapped
{}: Associative
VP|PO => PP|PO
Virtual Physical
Input/Output 8–1
Input/Output
• Readings: Chapter 1.4, 5.2.
•Our interests:
– Concepts rather than numerical details
– Demands on CPU
– Connection of peripherals to CPU
Input/Output 8–2
Typical Organization of I/O Devices
Main
memory
I/O
controller
I/O
controller
I/O
controller
Disk Graphics
output
Network
Memory–I/O bus
Processor
Cache
Interrupts
Disk
Typical Organization of Bus. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
Input/Output 8–3
Demands of I/O Systems
• Supercomputers:
– Single large read at start of batch job
– “Checkpointing” saves during run of job
• Transaction processing:
– Many small changes to large database
– Need good throughput but also reliability
•Unix file systems:
– Mostly to small files
– More reading than writing
• Benchmarks have been developed for all three areas
Input/Output 8–4
Types and Characteristics of I/O Devices
• Characteristics:
– Behaviour: input, output, storage
– Partner: human or machine
– Data rate
• Some types:
– Keyboard: input device, human partner, data rate 10 bits/sec
(15 WPM)
– Graphics display: output device, human partner,
data rate 60Mb/sec to 500 MB/sec
– Disk: storage device, machine partner, data rate
100Kb/sec—1500MB/sec
Input/Output 8–5
Keyboards and Mice
•Keyboards slow enough that each keystroke can generate
interrupt serviced by CPU
•Alternate arrangement: bytes generated by keystrokes can be
put into small buffer that is checked periodically by CPU
•We say the CPU polls the device
•A mouse needs to be polled more often to ensure smooth
animation of cursor on screen
•Mouse generates pulses indicating small movements in four
directions (up,down,left,right)
Input/Output 8–6
Magnetic Disks
Platter
Track
Platters
Sectors
Tracks
Magnetic Disks, Platers, sectors. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
Typically 1-4 platters, 10,000-50,000 tracks, 100-500 sectors,
Pre 2011: 512 bytes per sector. Since 2011: 4KB per sector.
Input/Output 8–7
Inside a Magnetic Disk
Disk drive image
Input/Output 8–8
Characteristics of Magnetic Disks
• Seek time: time to position read/write heads over desired track
• Rotational latency: must wait for correct sector to rotate under
head
• In practice these two depend highly on mix of requests
• Transfer time: time to read block of bits (typically sector)
• Can add cache to lower these times
• RAID: Redundant Array of Inexpensive Disks
Use redundancy and coding to improve throughput and
reliability
Input/Output 8–9
RAID—Redundant Arrays of Inexpensive Disks
RAID 0
RAID 6
RAID 5
RAID 4
RAID 3
RAID 1
87
65431 2
9
Striped
Mirror
CCBBAA
Σ
Σ
Σ Σ
Σ Σ Σ Σ Σ Σ
Blue disks marks redundant information.
Input/Output 8–10
Flash Memory/SSD
•HDD may be reaching limits
15,000 RPM available; 20,000 RPM has problems
• Flash memory similar to DRAM:
put voltage on line and test, but doesn’t forget
• Four types of cells:
SLC: 1 bit/cell, 100K writes
MLC: 2 bits/cell, 10K writes
TLC: 3 bits/cell, 3K writes
QLC: 4 bits/cell, 1K writes
Input/Output 8–11
Solid State Drive (SSD)
• SSD is groups of flash cells
Cells are grouped into pages (4KB)
Pages grouped into blocks (512KB)
• Read and write into page
• Erase a block (128 pages in a block)
• Speed comes from parallelism
Flash memory: 20MB/s
SSD: 10 in parallel: 200MB/s
Input/Output 8–12
SSD Issues
• Characteristics:
– Can read and write pages
– Can ONLY erase BLOCKS
– Can NOT overwrite page:
have to erase first
• Result: Controller writes to every page before erasing
Once “full”, a 4KB page write requires
– Read 512KB block into temp
– Erase 512KB block
– Write back valid pages
• Result: SSD fast when new, but later is slow (to write)
OS writes lots of temp files, so OS appears to slow down
(OS has properly handled this since Windows 7.)
Input/Output 8–13
SSD Write Example
Starting with 123 pages full and 5 empty…
Write 8KB Text File Write Two 4KB Pages
Write 8KB Image Write Two 4KB Pages
Delete Text File Mark Two Pages Invalid
Write 8KB Image
copy valid pages back
erase block
Copy block
Write Two 4KB Pages
copy valid pages back
CS 251 is a great
CS 251 is a great
course and the
prof is great, too!
course and the
prof is great, too!
INVALIDINVALID
(…123 full pages…)
(…123 full pages…)
(…123 full pages…)
(…123 full pages…)
(…123 full pages…)
Input/Output 8–14
Comparison (Fall 2011 vs Fall 2013 vs Fall 2016)
2011 2013 2016
Size Price $/TB Size Price $/TB Size Price $/TB
HDD 3TB $180 $60 4TB $200 $50 8TB $320 $40
SSD 250GB $400 $1600 1TB $570 $570 2TB $650 $325
•Observations:
– SSDs increasing in capacity quicker than HDDs
– SSDs price per unit storage dropping faster than HDDs
• Speed:
SSD range from 100 times slower to 10 times faster
(newer SSD’s may not be slower than HDD any more)
•Hybrid models
HDD with 100GB SSD used as cache: HHD
• SSDs are appearing in large data centers (Amazon, Facebook,
Dropbox)
Input/Output 8–15
Networks
• Ethernet: one-wire bus, 10/100/1000Mbit/sec transfer rate, no
central control
•Nodes listen for silence, then transmit
• Collisions resolved by random backoff
• Basis for most LANs (local area networks) when combined
with some switching
• Long-haul networks: distances of 10 to 10,000 km
•Usually packet-switched: information broken up, reassembled
• Protocols for reliable delivery (TCP) and addressing (IP)
Input/Output 8–16
Buses
97018/Patterson
Fig. 8.07
Memory Processor
Control lines
Data lines
Disks
Memory Processor
Control lines
Data lines
Disks
Processor
Control lines
Data lines
Disks
a.
b.
c.
Memory
Direct memory mapping. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
A typical output operation (defined from processor point of view)
Input/Output 8–17
Bus Parameters
• Buses are typical bottleneck for I/O throughput
• Limitations on speed: length, number of devices
• Types: processor-memory, I/O, backplane
• Communication schemes: synchronous, asynchronous
– Synchronous have clock connected to control lines, fixed
protocols
Fast, but require similar devices and must be short
– Asynchronous buses require “handshaking” communication
protocol
Handled as two communicating FSMs
Input/Output 8–18
Bus Arbitration
•Who decides who gets access to a shared bus?
• Simplest scheme: the processor
• Better: have multiple bus masters
• Choosing among bus masters is bus arbitration
•Need fairness but also scheme of priorities
• Typical schemes: daisy-chain, centralized, token-passing,
self-selection, collision detection (as in Ethernet)
Input/Output 8–19
Interfacing with Processor and Memory
• Two schemes for processor I/O: memory-mapped, special
instructions
• Polling versus interrupt-driven I/O
•Usually need to have different levels of interrupts depending
on priority
•DMA: direct memory access (device controller transfers data
right into memory without processor involvement)
• Creates problems with caches or VM
Input/Output 8–20
Example: Pentium Bus
North
Bridge
CPU CacheCPU Cache
Graphics
Card
RAM
Bridge
South
PCI Bus
USB Bus
IDE BusHard Disk
Mouse
Firewire Bus
ISA Bus
Audio
Keyboard
DVD
CDROM
Frontside Bus
Ethernet Bus
Interface
Ethernet
SCSI
Interface
Starting Your Computer 9–1
What happens when you turn on your computer?
SoC: System on Chip
RAM BridgeROM
Ethernet USB
Bus
CPU
Peripheral bus
SoC
Reset
High
Speed
Reset
•High speed synchronous bus (100MHz) connecting CPU,
RAM, ROM, Ethernet, USB, Bridge
• Bridge connects to peripheral bus: Asynchronous, 50MHz,
hand shaking protocol
Starting Your Computer 9–2
Power On
• Power to SoC: 30ms to get stable
• Clock needs time to get started
Power supply generates reset for 0.5 seconds
• Reset
Reset from power supply goes to CPU
• CPU sends reset signal to other devices on SoC
Except RAM (for debugging)
•On Reset:
– PC← 0x0000 0000
PC Address aliased to ROM (not RAM)
Starting Your Computer 9–3
PC starting at 0
Executing ROM code (SoC boot code):
1. Disable watchdog
Watchdog: reset processor if it reaches end of count
2. Configure CPU
• turn off cache
• turn off memory manager
• turn off branch predictor
ie, get to known state
3. Configure memory controller
4. Load code into static RAM (flash, bus)
5. Clear boot mode
Execute from RAM rather than ROM
6. Start executing flash boot code loaded in 4th step
Starting Your Computer 9–4
• Flash boot code
Does all of the above again
(board manufacturer knows what’s on board vs SoC)
•Once configured for board, OS loaded and OS execution
begins
Multiprocessors 10–1
Multiprocessors
• Readings: Chapter 6
•Motivations:
– Increased throughput
– Scalability
– Fault tolerance
• Issues:
– How to share data among processors?
– How to coordinate processing?
– What is the optimal number of processors?
• Terminology:
Rather than “multiprocessor”, the term multicore is used
Multiprocessors 10–2
Parallelism is Easy!
•Modern uniprocessor (single core) CPUs are parallel
computers
Pipelining
• Pipelining gives important but limited benefits
•Want to do better:
break task into multiple pieces,
run on multiple cores
Multiprocessors 10–3
Parallelism is Hard!
•How to program multicore machines?
• Large gains needed to justify expense
Book example: to get a factor of 90 speedup with 100 cores,
program must be “99.9%” parallelizable
• To achieve large gains, must balance load
If one processor does more work, rest are idle
• Programming language constructs
Synchronization, data sharing, data locking, …
Multiprocessors 10–4
Hardware Multithreading
• Thread: process of execution
– Potentially, one program can have multiple threads of
execution (parallelism)
– On single core, “swap” between execution of threads
– On multi-core, one process per core (“swap” if needed).
•Hardware multithreading
– Multiple threads run on a single core
– Idea is that when one thread stalls, run another thread
• Fine-grained multithreading: switch threads each instruction
• Coarse-grained multithreading: switch threads on long stalls
• Bottom line: hardware multithreading reduces idle time of
processor
Multiprocessors 10–5
Some Parallelism is Easy
• Instruction/Data classifications:
– SISD—Single Instruction, Single Data
Pentium 4
– SIMD—Single Instruction, Multiple Data
MMX, SSE, SSE2
– (MISD—Multiple Instruction, Single Data)
– MIMD—Multiple Instruction, Multiple Data
“algorithmic” parallelism
Multiprocessors 10–6
SIMD
Single Instruction, Multiple Data
• Example:
vec_res.x = v1.x + v2.x; vec_res.y = v1.y + v2.y;
vec_res.z = v1.z + v2.z; vec_res.w = v1.w + v2.w;
vs
movaps xmm0, [v1]
addps xmm0, [v2]
movaps [vec_res], xmm0
•Works best on data-level parallelism
Data independent from one another
•Works best with arrays in for loops
Numerical computations (matrix multiply)
•MMX, SSE, AVX, …
Multiprocessors 10–7
GPU: an extreme example of SIMD
•GPU is high-performance parallel computer
• Intended for graphics, so no need to support general
computing
• Relies on hardware multithreading rather than caches to
handle memory latency
Execute other threads while waiting for memory
• Relied on memory bandwidth rather than latency
• Intended for graphics, but support for non-graphics (CUDA)
Multiprocessors 10–8
Example: y=ax+y
Computing y = ax + y with a serial loop:
void saxpy_serial(int n, float alpha, float *x, float *y)
{
for(int i = 0; i
Multiprocessors 10–8
Multiprocessor Parallelism
What’s required to run a program on multiple cores?
• Compiler support
Compiler recognizes code that can run in parallel (SIMD)
• Programming language support:
Threads, locks, communication
•Hardware support:
Shared memory access
Caches
Networks, Message passing
Multiprocessors 10–9
Communication: Shared-Memory vs. Message Passing
• Shared-memory view: processors have a single common
address space
– Uniform Memory Access: any reference by any processor
takes same time
– Non-Uniform Memory Access: each processor has “nearby”
and “farther away” memory
– UMA easier to program, NUMA more realistic and scalable
– Both require methods of synchronization
•Message-passing view: processors have private memories,
communicate via explicit messages
– Synchronization is more natural
– More suited to network view
Multiprocessors 10–10
Message Passing Multiprocessors
• Some large-scale computers with high-performance message
passing networks
Fast, but expensive
• Some concurrent applications have task level parallelism that
require little communication
– Web search, mail servers, file servers
– Clusters: an example of message-passing parallel computers
– Separate OS on each computer, separate memory
– Separate memories, etc., leads to increased system
dependability, expandability
•Warehouse-scale computing
– Internet services may require 50,000–100,000 computers
Require special building, power, cooling
– Economy of scale⇒ cloud computing
Multiprocessors 10–11
Organization: Bus-Connected vs Network-Connected
Cache
Processor
Cache
Processor
Cache
Processor
Single bus
Memory I/O
Network
Cache
Processor
Cache
Processor
Cache
Processor
Memory Memory Memory
• Big issue: maintaining data consistency (called cache
coherency/consistency)
•Data read by several processors will show up in several caches
•What happens when that data is changed?
Multiprocessors 10–12
Multiprocessor Cache Coherency: Snooping
• Idea: information to maintain coherency is available on bus
• Each processor’s cache monitors bus
• Simplest idea: write-update (or write-broadcast)
– When a processor writes a block, all other caches containing
that block are updated with new value
– Increases bus traffic, limits scalability, but reduces latency
• Better idea: write-invalidate
– When a processor writes a block, it first issues an
invalidation signal
– Other caches invalidate their copies of that block
– Subsequent writes by same processor result in no bus traffic
– Subsequent reads by other processors are slower
•Variations: build in exclusivity (avoiding need for invalidation
broadcast), transferring data between caches directly
Multiprocessors 10–13
Network Topologies
a. 2D grid or mesh of 16 nodes b. n-cube tree of 8 nodes (8 = 23 so n = 3)
Figure 6.14. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
Variations: cylinder, 3D torus, mesh of trees; cube-connected
cycles
Multiprocessors 10–14
More Network Topologies
c. Omega network switch box
A
B
C
D
P0
P1
P2
P3
P4
P5
P6
P7
a. Crossbar b. Omega network
P0
P1
P2
P3
P4
P5
P6
P7
Figure 6.15. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
Variations: butterfly, shuffle-exchange, Beneš network
Multiprocessors 10–15
Coherency in Networked Multiprocessors
Directory
Memory
Directory
Memory
Cache Cache
ProcessorProcessor
Network
a.
Directory
Memory
Directory
Memory
Cache Cache
ProcessorProcessor
Network
b.
Memory coherency. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
Cache-level coherency Memory-level coherency
Multiprocessors 10–16
The Evolution-Revolution Spectrum
C
a
c
h
e
V
ir
tu
a
l
m
e
m
o
r
y
R
IS
C
P
a
r
a
ll
e
l
p
r
o
c
e
s
s
in
g
m
u
lt
ip
r
o
c
e
s
s
o
r
P
ip
e
li
n
in
g
M
a
s
s
iv
e
S
IM
D
M
ic
r
o
p
r
o
g
r
a
m
m
in
g
T
im
e
s
h
a
r
e
d
m
u
lt
ip
r
o
c
e
s
s
o
r
C
C
-U
M
A
m
u
lt
ip
r
o
c
e
s
s
o
r
C
C
-N
U
M
A
m
u
lt
ip
r
o
c
e
s
s
o
r
N
o
t-
C
C
-N
U
M
A
m
u
lt
ip
r
o
c
e
s
s
o
r
M
e
s
s
a
g
e
-p
a
s
s
in
g
m
u
lt
ip
r
o
c
e
s
s
o
r
Evolutionary Revolutionary
Evolution vs Revolution. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
• Items further right require more effort from programmers/users
• Software, not hardware, is current bottleneck
Other Architecture Ideas 11–1
Overview
Many other ideas in architecture.
We will look at a few:
•Microprogramming
VAX
CISC vs RISC
• SPARC Ideas
Register window
32-bit jump command
Other Architecture Ideas 11–2
Microprogramming
• Idea: control unit for datapath runs program for each assembly
instruction
•Multicycle datapath so far too simple for microprogramming
– Microprogram needs scratch registers
– Microprogram needs conditional branches
•Need to extend datapath, modify to give us extra power
– Extra registers in register file
– Control lines to specify read/write registers
– Control specify constants
– Status bits sent to control
Other Architecture Ideas 11–3
VAX: A microprogrammed architecture
•VAX is an example of a microprogrammed architecture
Some instructions ran faster than others
•Goals
– Direct support for high-level language constructs
(procedure calls, multi-way branching, loop control, array
subscript calculation)
– Minimize code space to save memory
•Variable length instruction encoding (from 1 to 54 bytes)
• Sixteen 32-bit registers, R0 through R15
• Some are special (eg R15 is PC, R14 is stack pointer)
• Large number of instructions (304) and data types
Other Architecture Ideas 11–4
VAX instruction encoding
• First byte is opcode (or first two)
•Opcode implies number of operands, data type
• For each operand, one byte with addressing mode and register
name (4 bits each)
•More information may follow if needed (e.g. constants)
• Example: ADDL3 R1, 737(R2), (R3)[R4]
7 bytes of storage
Other Architecture Ideas 11–5
Examples of complex VAX instructions
•MOVC3: copy a string
• SCANC: scan for character
•AOBLEQ: add one and branch on less than or equal
• INSQUE: insert into queue
• CRC: calculate cyclic redundancy check
• POLY: evaluate polynomial
Other Architecture Ideas 11–6
VAX/ARM instruction set comparison
• Task: copy fifty words of data from one array to another
• Pseudo-code:
for i=0 to 49 do b[i] = a[i];
• Can write programs in each type of assembler, but comparison
requires some assumptions
• Suppose base address of array a is in R3/s3, base address of
array b is in R4/s4
Other Architecture Ideas 11–7
ARM code
ADDI X2, X31, #50 # put 50 into X2
ADDI X3, X13, #0 # copy s3 into X3
ADDI X4, X14, #0 # copy X14 into X4
loop: LDUR X1, [X3,#0] # get word pointed to by X3
STUR X1, [X4,#0] # put word where pointed by X4
ADDI X3, X3, #8 # increment X3
ADDI X4, X4, #8 # increment X4
SUBI X2, X2, #1 # decrement count
CBNZ X2, loop # loop if count not zero
Other Architecture Ideas 11–8
VAX code
CLRL R2 ; R2 is index
LOOP: MOVL (R3)[R2], (R4)[R2]; (note scaling)
AOBLSS #50, R2, LOOP ; add one, branch on <50
Other Architecture Ideas 11–9
Analysis of ARM code
•Use pipelined implementation, with all forwarding possible
for data hazards, including load-use
• Branch hazards require stall on success
• Code rearrangement to get rid of load/store stall and branch
stall
•One change: memory can be slow, whole machine stalls until
memory access completes
• Let C be cycle time, M be memory access stall time
Other Architecture Ideas 11–10
Analysis of VAX code
•Assume multicycle implementation similar to microcoded
ARM implementation done in class
•No pipelining
• CLRL, AOBLSS take 4 cycles; AOBLSS is 2-word instruction
•MOVL takes 6 cycles because of operand address calculations
•Above assumptions are reasonable but do not correspond
exactly to reality
– In reality, VAX instructions are fetched by the byte, and the
branch would be an 8-bit displacement
– VAX implementations pipeline the microcode, so the
operand fetches could overlap to some extent
Other Architecture Ideas 11–11
Comparison of ARM/VAX running times
• Reasonable assumption: 2 ns cycle time, 10ns memory stall
time
• If no cache, memory time dominates, VAX wins
• If everything in cache, no memory stalls, ARM wins
• If cache is initially empty but large enough to contain
everything, ARM wins since it has better compute time
Other Architecture Ideas 11–12
SPARC: A RISC Example
•Goals
– Support object-oriented languages
– Efficient trapping and subroutine call-return
– RISC philosophy
• Fixed instruction size: 32 bits
• Small number of instruction formats (5 in V8)
• Load-store architecture, delayed branches
• 32 single-precision floating-point registers/16 double-precision
• 40 to 520 integer registers, BUT only 32 accessible at a time
Other Architecture Ideas 11–13
Registers
• 32 registers visible at one time: %r0 to %r31
• Four 8-register banks: inputs, outputs, globals, and locals
•Globals %g0 to %g7 correspond to %r0 to %r7
• Register %r0/%g0 always has value 0
•Outputs %o0 to %o7 correspond to %r8 to %r15
• Locals %l0 to %l7 correspond to %r16 to %r23
• Inputs %i0 to %i7 correspond to %r24 to %r31
Other Architecture Ideas 11–14
Registers
5.1 Nonprivileged Registers 31
Figure 1—General-Purpose Registers (Nonprivileged View)
Programming Note:
The alternate global registers are present to give trap handlers a set of scratch registers that are inde-
pendent of nonprivileged software’s registers. The AG bit in PSTATE allows supervisor software to
access the normal global registers if required (for example, during instruction emulation).
5.1.1.2 Windowed r Registers
At any time, an instruction can access the 8 globals and a 24-register window into the r
registers. A register window comprises the 8 in and 8 local registers of a particular register
set, together with the 8 in registers of an adjacent register set, which are addressable from
the current window as out registers. See figure 2 and table 6.
i7 r[31]
i6 r[30]
i5 r[29]
i4 r[28]
i3 r[27]
i2 r[26]
i1 r[25]
i0 r[24]
r[23]
r[22]
r[21]
r[20]
r[19]
r[18]
r[17]
r[16]
r[15]
r[14]
r[13]
r[12]
r[11]
r[10]
r[9]
r[8]
r[7]
r[6]
r[5]
r[4]
r[3]
r[2]
r[1]
r[0]
l7
l6
l5
l4
l3
l2
l1
l0
o7
o6
o5
o4
o3
o2
o1
o0
g7
g6
g5
g4
g3
g2
g1
g0
Other Architecture Ideas 11–15
Register Window
• 32 registers visible at one time: %r0 to %r31
• 40 to 520 registers in V8 implementation
• Inputs, Locals, and Outputs actually a 24-register window into
master register file
•When call a subroutine, can shift register window forward so
outputs become inputs, and get a new set of locals and outputs
•When return from a subroutine, can shift register window back
• Register “stack” managed as a circular buffer
•When overflows, trap to O/S to save off registers to stack in
memory
•More efficient to save a lot of registers at once, and if have a
lot of call-returns can avoid many register saves
Other Architecture Ideas 11–16
Register Windows32 5 Registers
Figure 2—Three Overlapping Windows and the Eight Global Registers
The number of windows or register sets, NWINDOWS, is implementation-dependent and
ranges from 3 to 32 (impl. dep. #2). The total number of r registers in a given implementa-
tion is 8 (for the globals), plus 8 (for the alternate globals), plus the number of sets times
16 registers/set. Thus, the minimum number of r registers is 64 (3 sets plus the 16 globals
Window (CWP – 1)
r[31]
r[24]
ins
.
.
r[23]
r[16]
locals
.
.
r[15]
r[ 8]
outs
.
.
Window (CWP )
r[31]
r[24]
ins
.
.
r[23]
r[16]
locals
.
.
r[15]
r[ 8]
outs
.
.
Window (CWP + 1)
r[31]
r[24]
ins
.
.
r[23]
r[16]
locals
.
.
r[15]
r[ 8]
outs
.
.
r[ 7]
r[ 1]
globals
.
.
r[ 0] 0
63 0
Other Architecture Ideas 11–17
Register Window Management
• Register “stack” is a FIFO, managed as a circular array
• Each time we push or pop register file, modify current window
index by 16
•When overflows, trap to O/S to save off registers to stack in
memory
•More efficient to save a lot of registers at once
• Typically can avoid many register saves altogether
•Gives more efficient subroutine calls and interrupt/trap
processing
•Good for real-time systems and certain programming
languages (FORTH, Smalltalk, LISP, C++)
Other Architecture Ideas 11–18
SPARC Instruction Format
op
31 30 24 22
a
29 28 25
cond op2 disp2
op rd
31 30 29 25 24 22
op2 imm22
21 0
21 0
18 14 13 12 5 4 029 2531 30 24 19
op
31 30
op rd op3 rs1 i=1 simm13
18 14 1329 2531 30 24 19 12 0
op rd op3 rs1
18 14 13 12 5 4 029 2531 30 24 19
rs2opf
disp30
29 0
Format 1 (op=1) CALL
Format 2 (op=0) SETHI, Branches
Format 3 (op=2 or 3) Arithmetic, Logical, Load, Store
i=0op rd op3 rs1 rs2---
Other Architecture Ideas 11–19
ARM Instruction Set
•ARM instruction set far bigger than done in class
•Number of instruction:
© 2016 Elsevier, Inc. All rights reserved. 41
FIgure 2.41. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
•Will highlight some interesting “extras”
Other Architecture Ideas 11–20
Registers
•Many registers have special purposes
Some are just conventions.
© 2016 Elsevier, Inc. All rights reserved. 17
Figure 2.15. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
Other Architecture Ideas 11–21
Status Bits
•ARM instructions can set and test condition codes
• Four condition codes:
Negative (N), Zero (Z), Overflow (V), Carry (C)
• B instruction suffixed with condition
© 2016 Elsevier, Inc. All rights reserved. 12
Figure 2.10. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
Becomes a conditional branch
Other Architecture Ideas 11–22
Shifty Arithmetic Operations
• R-format instructions optionally set the status bits
ADD vs ADDS
• Second register of R-format can be shifted
5 bits5 bits11 bits 6 bits 5 bits
opcode Rm shamt Rn Rd
ADD X1,X2,X3<<3
x86 Instruction Set Architecture 12–1
Intel Pentium: Basic Instruction Set
The instruction set is very large
• 48 move/data instructions
• 12 binary arithmetic instruction
ADD, ADC (add w/carry), SUB, MUL, DIV, INC, CMP
• 6 decimal arithmetic instructions
• 4 logic instructions (AND, OR, XOR, NOT)
• 10 shift/rotate instructions
• 48 control transfer instructions
x86 Instruction Set Architecture 12–2
Instruction Set (cont)
• 37 bit/byte instructions
Testing bits/bytes
• 57 string instructions
• 13 flag control instructions
Set/clear conditional flags used by branches
• 5 segment register instructions
• 6 misc instructions (including NOP)
Additional floating point and SIMD (SSE) instructions
x86 Instruction Set Architecture 12–3
Addressing Modes
• 16-bit, 32-bit (mostly 32 bit)
• Segment-relative addresses
SEGMENT + BASE + (INDEX*SCALE) + DISPLACE
Mode Example
Register MOV DS,AZ
Immediate MOV AX,1000H
Direct Addressing MOV [2000H],AX
Register Indirect MOV DL, [SI]
Base Addressing MOV AX,[BX+4]
Indexed Addressing MOV [DI-8],RL
Based Indexed Addressing MOV [GP][SI],AH
Based Indexed with Displacement MOV CL,[BX+DI+2040H]
Pentium 13–1
IA-32 Evolution
32-bit Pentium evolution starts with 80386
• 1985: 80386
32-bit addressing
• 1989: 80486
– Internal cache,
– Instruction pipeline,
– Integrated math coprocessor
• 1993: Pentium
– Superscalar microarchitecture
executes multiple instructions in parallel;
– Split internal cache (L1)
separate instruction and data caches
Pentium 13–2
• 1995: Pentium Pro
– Backside level 2 cache (L2)
– Convert complex instructions into micro-ops
micro-ops execute on RISC
– Longer pipeline
Pentium 13–3
Instruction prefetch
and decodeBranch
prediction
Register file
Integer
ALU
Integer
ALU.
Multiplier
Integer
ALU
Floating
point
Adder/
SSE
Floating
point
Multiplier/
SSE
Floating
point
Misc
Data
cache
Instruction cache
RISC-operation queue
Dispatch and register renaming
Integer and floating-point operation queue
Load/Store queue
Commit
unit
Pentium 13–4
• 1997: Pentium MMX, Pentium II
– MMX instructions (integer SIMD instructions)
• 1999: Pentium III
– SSE: SIMD single precision floats
Useful for 3D graphics
– 10 stage pipeline
Pentium 13–5
• 2000: Pentium 4
– SSE2: double precision floats, quadword ints, 128-bit ints
– 20 stage pipeline
– Branch prediction (4K branch predictor table)
– Looks ahead 126 instructions to find parallel tasks
– 12K Instruction cache
stores micro-ops instead of instructions
– L1: 8-KB, 4-way set associative, 64-byte lines
– L2: 256KB (512KB), 8-way set associative, 128-byte lines
– ALU executes common integer ops in one-half clock cycle
Trace cache
access
(including
predictor)
buffer
allocation +
register
renaming
Reorder
Scheduling
+ dispatch
unit
Register
file
access
Execution Commit
Microoperation
queue
Functional unit
queues
Reorder
buffer
5 4 5 2 1 3
Clock
Cycles
Pentium 13–6
• 2003,4: AMD, Intel announce extensions to 64 bit
architectures
SSE3 (graphics, video, thread sychronization support).
i7 14–1
i7
• x86 complex, so i7 converts to micro-operations.
• Register renaming maps 16 architectural registers to larger
number of physical registers
. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
i7 14–2
Performance on Benchmarks
0.25 CPI ideal
© 2016 Elsevier, Inc. All rights reserved. 77
. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
i7 14–3
Branch Misprediction on Benchmarks
© 2016 Elsevier, Inc. All rights reserved. 78
. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
Conclusions 15–1
Where are we now?
• Power wall
2667 3300 3400
12.5
16
2000
200
66
25
3600
75.3
95
87
77
29.1
10.1
4.94.13.3
103
1
10
100
1000
10,000
8
0
2
8
6
(1
9
8
2
)
8
0
3
8
6
(1
9
8
5
)
8
0
4
8
6
(1
9
8
9
)
P
e
n
ti
u
m
(1
9
9
3
)
P
e
n
ti
u
m
P
ro
(
1
9
9
7
)
P
e
n
ti
u
m
4
W
il
la
m
e
tt
e
(2
0
0
1
)
P
e
n
ti
u
m
4
P
re
s
c
o
tt
(2
0
0
4
)
C
o
re
2
K
e
n
ts
fi
e
ld
(2
0
0
7
)
C
lo
c
k
R
a
te
(
M
H
z
)
0
20
40
60
80
100
120
P
o
w
e
r
(w
a
tt
s
)
Clock Rate
Power
C
o
re
i
5
C
la
rk
d
a
le
(2
0
1
0
)
C
o
re
i
5
Iv
y
B
ri
d
g
e
(2
0
1
2
)
Similar to Figure 1.16. c©2017 Elsevier. Reproduced with permission from Computer Organization and Design, ARM edition.
• Parallelism
GPUs
• Smaller is better (for home market)
•Network applications
Conclusions 15–2
Take Aways
•Understanding hardware can improve compilers
code rearrangement, loop unrolling
•Understanding hardware/OS can improve code
Cache, virtual memory, paging
• Introduction to material you’ll see later
– Floating point
– OS issues
– Data structures
– Interrupts, exceptions
– Synchronization, parallelism
Conclusions 15–3
Example: Revisited
#include
#define NR 10000
#define NC 10000
int a[NR][NC];
void main() {
int i,j;
for (i=0;i