CSE141-s121-L4-before
Week 2 — Tuesday July 6
ISA (Cont’d) & Performance
CSE-141 Summer Session I 2021
Many slides adapted from Dean Tullsen
Administrative
• Weekly Check-In will be released today and due tomorrow night
• Homework 1
• Due next Thursday 11:59 pm
Be a human …
• 2 min breakout rooms
• Anything exciting you did on the weekend?
• Feel free to not join and hang out in the main
room…
Previously on CSE-141…
The Instruction Set Architecture
I/O systemInstr. Set Proc.
Compiler
Operating
System
Application
Digital Design
Circuit Design
Instruction Set
Architecture
• An ISA is “the agreed-upon interface between all the software that runs on the
machine and the hardware that executes it.”
Hardware
Software
The Instruction Set Architecture
I/O systemInstr. Set Proc.
Compiler
Operating
System
Application
Digital Design
Circuit Design
Instruction Set
Architecture
• An ISA is “the agreed-upon interface between all the software that runs on the
machine and the hardware that executes it.”
Hardware
Software
Key ISA decisions
• operations
– how many?
– which ones?
• operands
– how many?
– location
– types
– how to specify?
• instruction format
– size
– how many formats?
y = x + b
operation
source operands
Destination operand
how does the computer know what
0001 0101 0001 0010 means?
add r5, r1, r2
MIPS Instruction Formats
opcode
opcode
opcode
rs rt rd sa funct
rs rt immediate
target
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits
R-type
I-type
J-type
• An instruction has opcode == 0 ————————-> R-type ✅
• An instruction is R-type ————————-> opcode == 0 ❌
MIPS Instruction Formats
• the opcode tells the machine which format
opcode
opcode
opcode
rs rt rd sa funct
rs rt immediate
target
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits
R-type
I-type
J-type
• An instruction has opcode == 0 ————————-> R-type ✅
• An instruction is R-type ————————-> opcode == 0 ❌
MIPS addressing modes
register direct
add $1, $2, $3
immediate
add $1, $2, #35
base + displacement
lw $1, disp($2)
OP rs rt rd sa funct
OP rs rt immediate
rt
rs
immediate
(R1 = M[R2 + disp])
MIPS addressing modes
register direct
add $1, $2, $3
immediate
add $1, $2, #35
base + displacement
lw $1, disp($2)
OP rs rt rd sa funct
OP rs rt immediate
rt
rs
immediate register indirect ! disp = 0
absolute ! (rs) = 0
(R1 = M[R2 + disp])
Control Flow
• Jumps
• Procedure call (jump subroutine)
• Conditional Branch
• Used to implement, for example, if-then-else logic, loops, etc.
Control Flow
• Jumps
• Procedure call (jump subroutine)
• Conditional Branch
• Used to implement, for example, if-then-else logic, loops, etc.
• A conditional branch must specify two things
• Condition under which the branch is taken
• Location that the branch jumps to if taken (target)
Conditional branch
• How do you specify the destination (target) of a branch/jump?
Conditional branch
• How do you specify the destination (target) of a branch/jump?
• studies show that almost all conditional branches go short distances from the
current program counter (loops, if-then-else).
Conditional branch
• How do you specify the destination (target) of a branch/jump?
• studies show that almost all conditional branches go short distances from the
current program counter (loops, if-then-else).
• we can specify a relative address in much fewer bits than an absolute address
• e.g., beq $1, $2, 100 => if ($1 == $2) PC = (PC+4) + 100 * 4
Conditional branch
• How do you specify the destination (target) of a branch/jump?
• studies show that almost all conditional branches go short distances from the
current program counter (loops, if-then-else).
• we can specify a relative address in much fewer bits than an absolute address
• e.g., beq $1, $2, 100 => if ($1 == $2) PC = (PC+4) + 100 * 4
• How do we specify the condition of the branch?
MIPS conditional branches
MIPS conditional branches
• beq, bne beq r1, r2, addr => if (r1 == r2) goto addr
MIPS conditional branches
• beq, bne beq r1, r2, addr => if (r1 == r2) goto addr
• slt $1, $2, $3 => if ($2 < $3) $1 = 1; else $1 = 0 MIPS conditional branches • beq, bne beq r1, r2, addr => if (r1 == r2) goto addr
• slt $1, $2, $3 => if ($2 < $3) $1 = 1; else $1 = 0 • these, combined with $0, can implement all fundamental branch conditions Always, never, !=, = =, >, <=, >=, <, >(unsigned), <= (unsigned), ... MIPS conditional branches • beq, bne beq r1, r2, addr => if (r1 == r2) goto addr
• slt $1, $2, $3 => if ($2 < $3) $1 = 1; else $1 = 0
• these, combined with $0, can implement all fundamental branch conditions
Always, never, !=, = =, >, <=, >=, <, >(unsigned), <= (unsigned), ...
if (i
Jumps
• need to be able to jump to an absolute address sometimes
• need to be able to do procedure calls and returns
• jump — j 10000 => PC = 10000
• jump and link — jal 100000 => $31 = PC + 4; PC = 10000
Jumps
• need to be able to jump to an absolute address sometimes
• need to be able to do procedure calls and returns
• jump — j 10000 => PC = 10000
• jump and link — jal 100000 => $31 = PC + 4; PC = 10000
• used for procedure calls
OP target
Jumps
• need to be able to jump to an absolute address sometimes
• need to be able to do procedure calls and returns
• jump — j 10000 => PC = 10000
• jump and link — jal 100000 => $31 = PC + 4; PC = 10000
• used for procedure calls
• jump register — jr $31 => PC = $31
• used for returns, but can be useful for lots of other things.
OP target
Branch and Jump Addressing Modes
• Branch (e.g., beq) uses PC-relative addressing mode (uses few bits if address typically
close). That is, it uses base+displacement mode, with the PC being the base. If opcode
is 6 bits, how many bits are available for displacement? How far can you jump?
• Jump uses pseudo-direct addressing mode. 26 bits of the address is in the instruction,
the rest is taken from the PC.
instruction program counter
jump destination address
| |00
6 | 26 4 | 28
Review — Instruction Execution in a CPU
Memory2
20000
20004
80000
Address10
10001100010000110100111000100000
00000000011000010010100000100000
00000000000000000000000000111001
Registers10
R0
R1
R2
R3
R4
R5
…
0
36
60000
45
198
12
…
Program Counter10
20000
Instruction Buffer
op rtrs rd shamt
immediate/disp
in1 in2
out
ALUoperation
Load/Store
Unit
addr
data
C
P
U
M
e
m
o
r
y
MIPS ISA Tradeoffs
What if?
• 64 registers
• 20-bit immediates
opcode
opcode
opcode
rs rt rd sa funct
rs rt immediate
target
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits
R-type
I-type
J-type
RISC Architectures
• MIPS, like SPARC, PowerPC, and Alpha AXP, is a RISC (Reduced Instruction
Set Computer) ISA.
• fixed instruction length
• few instruction formats
• load/store architecture
• RISC architectures worked because they enabled pipelining. They continue
to thrive because they enable parallelism.
Alternative Architectures
Alternative Architectures
• Design alternative:
• provide more powerful operations
• goal is to reduce number of instructions executed
• danger is a slower cycle time and/or a higher CPI (cycles per instruction)
Alternative Architectures
• Design alternative:
• provide more powerful operations
• goal is to reduce number of instructions executed
• danger is a slower cycle time and/or a higher CPI (cycles per instruction)
• Sometimes referred to as “RISC vs. CISC”
• CISC = Complex Instruction Set Computer (as alt to RISC)
• virtually all new instruction sets since 1982 have been RISC
• VAX: minimize code size, make assembly language easy
instructions from 1 to 54 bytes long!
Alternative Architectures
• Design alternative:
• provide more powerful operations
• goal is to reduce number of instructions executed
• danger is a slower cycle time and/or a higher CPI (cycles per instruction)
• Sometimes referred to as “RISC vs. CISC”
• CISC = Complex Instruction Set Computer (as alt to RISC)
• virtually all new instruction sets since 1982 have been RISC
• VAX: minimize code size, make assembly language easy
instructions from 1 to 54 bytes long!
• We’ll look (briefly!) at 80×86
Alternative Architectures
• Design alternative:
• provide more powerful operations
• goal is to reduce number of instructions executed
• danger is a slower cycle time and/or a higher CPI (cycles per instruction)
• Sometimes referred to as “RISC vs. CISC”
• CISC = Complex Instruction Set Computer (as alt to RISC)
• virtually all new instruction sets since 1982 have been RISC
• VAX: minimize code size, make assembly language easy
instructions from 1 to 54 bytes long!
• We’ll look (briefly!) at 80×86
• What is ARM?
80×86
• 1978: The Intel 8086 is announced (16 bit architecture)
• 1980: The 8087 floating point coprocessor is added
• 1982: The 80286 increases address space to 24 bits, +instructions
• 1985: The 80386 extends to 32 bits, new addressing modes
• 1989-1995: The 80486, Pentium, Pentium Pro add a few instructions
(mostly designed for higher performance)
• 1997: MMX is added
• 1999: Pentium III (same architecture)
• 2001: Pentium 4 (144 new multimedia instructions), simultaneous multithreading (hyperthreading)
• 2005: dual core Pentium processors
• 2006: quad core (sort of) Pentium processors
• 2009: Nehalem – eight-core multithreaded processors
• 2015: Skylake – 4-core, multithreaded, added hw security features, transactional memory, …
80×86
• Complexity:
• Instructions from 1 to 17 bytes long
• one operand must act as both a source and destination
• one operand can come from memory
• complex addressing modes
e.g., “base or scaled index with 8 or 32 bit displacement”
• Saving grace:
• the most frequently used instructions are not too difficult to build
• compilers avoid the portions of the architecture that are slow
• Some other tricks we’ll talk about later.
ISA Key Points
• MIPS is a general-purpose register, load-store, fixed-instruction-length
architecture.
• MIPS is optimized for fast pipelined performance, not for low instruction
count
• Historic architectures favored code size over parallelism.
• MIPS most complex addressing mode, for both branches and loads/stores is
base + displacement.
Thought Experiment
• What is the fastest way to send a picture of a black hole to Boston?
• What is the fastest way to send 5 petabytes of data to Boston?
Thought Experiment
• What is the fastest way to send a picture of a black hole to Boston?
• What is the fastest way to send 5 petabytes of data to Boston?
Thought Experiment
• What is the fastest way to send a picture of a black hole to Boston?
• What is the fastest way to send 5 petabytes of data to Boston?
(5 petabytes) / (940 Megabits/second) = 1.35 years
Measuring and Discussing
Computer System Performance
or
“My computer is faster than your computer”
Performance Trends
Specint2000
1.00
10.00
100.00
1000.00
10000.00
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05
Intel
Alpha
Sparc
Mips
HP PA
Pow er PC
AMD
Year of introduction
P
er
fo
rm
an
ce
• But what is performance?
The bottom line: Performance
Car Speed
160 mph
65 mph
Time to
Bay Area
3.1 hours
7.7 hours
Passengers
2
60
Throughput
(pmph)
320
3900
The bottom line: Performance
° Time to do the task
– execution time, response time, latency
° Tasks per day, hour, week, sec, ns. ..
– throughput, bandwidth
Car Speed
160 mph
65 mph
Time to
Bay Area
3.1 hours
7.7 hours
Passengers
2
60
Throughput
(pmph)
320
3900
Measures of “Performance”
Measures of “Performance”
• Execution Time
Measures of “Performance”
• Execution Time
• Frame Rate
Measures of “Performance”
• Execution Time
• Frame Rate
• Throughput (operations/time)
• Transactions/sec, queries/day, etc.
Measures of “Performance”
• Execution Time
• Frame Rate
• Throughput (operations/time)
• Transactions/sec, queries/day, etc.
• Responsiveness
Measures of “Performance”
• Execution Time
• Frame Rate
• Throughput (operations/time)
• Transactions/sec, queries/day, etc.
• Responsiveness
• Performance / Cost
Measures of “Performance”
• Execution Time
• Frame Rate
• Throughput (operations/time)
• Transactions/sec, queries/day, etc.
• Responsiveness
• Performance / Cost
• Performance / Power
Measures of “Performance”
• Execution Time
• Frame Rate
• Throughput (operations/time)
• Transactions/sec, queries/day, etc.
• Responsiveness
• Performance / Cost
• Performance / Power
• Performance / Energy
How to measure Execution Time?
How to measure Execution Time?
% time program
… program results …
real 0m0.921s
user 0m0.179s
sys 0m0.389s
%
How to measure Execution Time?
• Wall-clock time?
% time program
… program results …
real 0m0.921s
user 0m0.179s
sys 0m0.389s
%
How to measure Execution Time?
• Wall-clock time?
• user CPU time? % time program
… program results …
real 0m0.921s
user 0m0.179s
sys 0m0.389s
%
How to measure Execution Time?
• Wall-clock time?
• user CPU time?
• user + kernel CPU time?
% time program
… program results …
real 0m0.921s
user 0m0.179s
sys 0m0.389s
%
How to measure Execution Time?
• Wall-clock time?
• user CPU time?
• user + kernel CPU time?
• Answer:
% time program
… program results …
real 0m0.921s
user 0m0.179s
sys 0m0.389s
%
Our definition of Performance
Execution TimeX
, for program X
Our definition of Performance
PerformanceX =
1
Execution TimeX
, for program X
Our definition of Performance
• only has meaning in the context of a program or workload
PerformanceX =
1
Execution TimeX
, for program X
Our definition of Performance
• only has meaning in the context of a program or workload
• Not very intuitive as an absolute measure, but most of the time we’re more
interested in relative performance.
PerformanceX =
1
Execution TimeX
, for program X
Relative Performance
• can be confusing
A runs in 12 seconds
B runs in 20 seconds
• A/B = 0.60, so A is 40% faster, or 1.40X faster, or B is 40% slower
• B/A = 1.67, so A is 67% faster, or 1.67X faster, or B is 67% slower
• needs a precise definition
Relative Performance (Speedup), the Definition
PerformanceX
Execution TimeXPerformanceY
Speedup Execution TimeY= = = n(X/Y)
Example
• Machine A runs program C in 9 seconds, Machine B runs the same program
in 6 seconds. What is the speedup we see if we move to Machine B from
Machine A?
Example
• Machine A runs program C in 9 seconds, Machine B runs the same program
in 6 seconds. What is the speedup we see if we move to Machine B from
Machine A?
• Machine B gets a new compiler, and can now run the program in 3 seconds.
Speedup from the new compiler?
What is Time?
What is Time?
CPU Execution Time = CPU clock cycles * Clock cycle time
What is Time?
CPU Execution Time = CPU clock cycles * Clock cycle time
• Every conventional processor has a clock with an associated clock cycle time or clock rate
What is Time?
CPU Execution Time = CPU clock cycles * Clock cycle time
• Every conventional processor has a clock with an associated clock cycle time or clock rate
• Every program runs in an integral number of clock cycles
Cycle Time
MHz = millions of cycles/second, GHz = billions of cycles/second
X MHz = 1000/X nanoseconds cycle time
Y GHz = 1/Y nanoseconds cycle time
How many clock cycles?
How many clock cycles?
Number of CPU clock cycles =
[Instruction count] * [Average Clock Cycles per Instruction (CPI)]
How many clock cycles?
Number of CPU clock cycles =
[Instruction count] * [Average Clock Cycles per Instruction (CPI)]
Computer A runs program C in 3.6 billion cycles. Program C requires 2 billion dynamic
instructions. What is the CPI?
All Together Now
CPU Execution Time = [CPU clock cycles] * [Clock cycle time]
CPU clock cycles = [Instruction count] * [Average Clock Cycles per Instruction (CPI)]
All Together Now
CPU Execution
Time
Instruction
Count
CPI Clock Cycle
Time= X X
All Together Now
CPU Execution
Time
Instruction
Count
CPI Clock Cycle
Time= X X
instructions
cycles/instruction seconds/cycle
seconds
Putting it all together
• IC = 4 billion, 2 GHz processor, execution time of 3 seconds. What is the CPI
for this program?
• Suppose we reduce CPI to 1.2 (through an architectural improvement). What
is the new execution time?
CPU Execution
Time
Instruction
Count
CPI Clock Cycle
Time= X X