Computer Architecture ELEC3441
Computer Performance 2nd Semester, 2020-21
Dr. Hayden Kwok-Hay So
Department of Electrical and Electronic Engineering
University of Hong Kong
How do you measure performance of a computer?
How do you make a computer fast?
HKU EEE ELEC3441 – HS 2
Ways to measure Performance
Execution Time Throughput
Time to finish a task
Number of tasks finish per unit time
n Execution time (response time) ≠ Throughput n We will focus on execution time in this course
HKU EEE ELEC3441 – HS 3
Relative Performance
n Define performance of a computer as
1 ExecutionTime
Performance =
n Computer B is n times faster than Computer A if:
n = PerformanceB = ExecutionTimeA PerformanceA ExecutionTimeB
HKU EEE ELEC3441 – HS 4
Quick Check
n Computer A finishes a task in 5s, Computer B finishes the same task in 4s. Which one is faster, by how much?
PerformanceB = ExecutionTimeA = 5 =1.25 PerformanceA ExecutionTimeB 4
Computer B is 1.25 times faster than Computer A
HKU EEE ELEC3441 – HS 5
Ways to Measure Execution Time
n Wall Clock Time (Elapse Time)
• The total time a user experiences that a computer takes to
finish a task
• Includes OS overhead, I/O, idle time, time shared with other users
n CPU Time
• The time spent on a user task in the CPU
• User CPU + OS CPU time
• Does not include I/O, time spent by other users, etc
n Focus on CPU Time in this course $ time shasum afile
132ecc0e19eec19d5dc775752efeac280cecebdc afile
real 0m20.177s user 0m12.835s sys 0m1.786s
HKU EEE ELEC3441 – HS 6
How can we determine CPU time needed to execute a program?
CPUTime = # of instruction × # of cycle × time program instruction cycle
HKU EEE ELEC3441 – HS
7
The Iron Law
CPU Time – Step 1
CPUTime = CycleCount × CycleTime
CycleCount ClockFrequency
n Most modern CPUs are synchronous digital systems
n The time needs to finish executing a task is determined by the number of cycles needed for that ask, multiply by the cycle time.
Digital system design review…
=
HKU EEE ELEC3441 – HS 8
Synchronous Sequential Circuits
n A synchronous sequential circuit contains exactly 1 clock signal
n All state elements are connected to the same clock signal
• è the state of the entire circuit is updated at the same time n Common form of synchronous sequential circuits:
input
clk
Comb Logic
HKU EEE
Comb Comb Logic Logic
ELEC3441 – HS
Comb output Logic
9
clk
clk
clk
Clock Signal
n A clock signal is particularly important signal in a synchronous sequential circuit
• It controls the action of all DFFs
n A clock signal toggles between ‘0’ and ‘1’
periodically
n The frequency of the toggling determines the maximum speed of the circuit
•
E.g.: in the accumulator example earlier, the output S cannot change faster than the clock frequency
X
S clk
x0 x1 x2 0
HKU EEE
1 clock period
= clock frequency
x0
x0 + x1
x0 +x1 +x2
e.g. Intel CPU runs at 3 GHz, Mobile phone processors at 1 GHz Lab FPGA board at 50 MHz
ELEC3441 – HS 10
clock period
Timing in Synchronous Circuits
clk
clk
a b
c d
y
clk
n In a synchronous sequential circuit, signal changes occur only during clock edge
n All signals are therefore synchronized to change values right after a clock edge
n In the above example, need to make sure correct value of y available BEFORE next clock edge
• Avoid glitches
HKU EEE ELEC3441 – HS 11
Timing in Synchronous Circuits
n In general, the propagation
delay through the
combinational logic between
Comb Logic
Stable before clock edge
any two registers must be
clk
a
b x y
clk
From glitch example
shorter than the clock period
n The longest such path is called the critical path of the circuit
n The critical path determines the maximum clock speed
HKU EEE ELEC3441 – HS
12
clk
clk
CPU Time – Step 1 – Summary
CPUTime = CycleCount × CycleTime = n To improve performance:
1. Increase clock frequency 2. Reduce cycle count
CycleCount ClockFrequency
n Increase clock freq è shorter critical path è less work accomplished in 1 cycleèmore cycles needed
• Engineersneedtradeoffbetweenthetwo
How many cycles does it take to finish a program?
HKU EEE ELEC3441 – HS 13
CPU Time – Step 2 – Cycle Per Instruction (CPI)
CycleCount = InstructionCount × CyclePerInstruction
n Program A has 2000 instructions, each instruction takes 2 cycles to finish. How many cycles does it take to complete Program A?
n Program B has 3000 instructions. 2000 of them takes 2 cycles and 1000 of them takes 1 cycle. How many cycles does the program take to finish?
HKU EEE ELEC3441 – HS 14
Average CPI
n In general, different machine instructions may take different amount of time to complete.
n Assuming n classes of instructions, then total clock cycle:
n
ClockCycle=∑CPIi ×InstructionCounti
i=1
n Weighted average CPI:
CPI=
i
CycleCount InstructionCount
∑n
= CPI×
i
i=1
InstructionCount InstructionCount
HKU EEE
ELEC3441 – HS
15
CPI Example (1)
n The ISA of computer A includes 3 classes of instructions that take different number of cycles to complete. A program P is compiled using compiler J, resulting in the utilization above.
n What is the average CPI of the compiled program?
Class
C1
C2
C3
Cycles
1
4
8
Compiler J
100
50
100
HKU EEE ELEC3441 – HS 16
CPI Example (2)
Class
C1
C2
C3
Cycles
1
4
8
Compiler J
100
50
100
Compiler K
350
100
50
n A newer compiler K was developed to compile same program P, resulting in the utilization above.
n What is the average CPI of the compiled
program using compiler K?
Ans: 2.3
Which compiler was better…?
HKU EEE ELEC3441 – HS 17
CPI Example (3)
Class
C1
C2
C3
#instr
#cycle
CPI
Cycles
1
4
8
Compiler J
100
50
100
250
1100
4.4
Compiler K
350
100
50
500
1150
2.3
n Observation:
• CompilerJresultsinhigherCPI
• CompilerKusesmoreinstructions
n But most importantly:
Compiler J uses fewer cyclesèshorter run time
è better
HKU EEE ELEC3441 – HS 18
Number of Instructions
a=0 b=a+1 c=a+b b=c+b
How many instructions are there in the following code?
If CPI = 1, how many cycles does it take to complete?
# of instr: 4
# of cycles: 4
HKU EEE ELEC3441 – HS 19
Number of Instructions
i=0 loop: a = a + 1
i=i+1
if i < 10 goto loop
How many instructions are there in the following code?
If CPI = 1, how many cycles does it take to complete?
# of STATIC instructions: 4
# of DYNAMIC instructions: 1 + 3 * 10 = 31 # of cycles: 31
HKU EEE ELEC3441 - HS 20
Number of Instructions
How many instructions are there in the following code?
To compute: r = a × b
r=0
for(i=b;i>0;i=i-1) r=a*b
r=r+a
Dynamic # of instructions can be data dependent.
HKU EEE ELEC3441 – HS 21
# of DYNAMIC instructions: ≈ 3b # of cycles: ≈3b
# of instructions: 1 # of cycles: 1 (?)
Instruction Count & CPI
n The number of instructions in a program depends on
• Natureofapplication
• Compilertechniques
• TypeofavailableinstructionofanISA
n Average cycles per instruction depends on • CPUmicroarchitecture
• ISA(CISCvsRISC)
• ThecurrentrunningstateofCPU
n Different instructions may have different CPI • AverageCPIaffectedbyinstructionmix
HKU EEE ELEC3441 – HS 22
Combining All – The Iron Law
CPUTime = # of instruction × # of cycle × time
• Algorithm • Language • Compiler • ISA
program
• Language • Compiler • ISA
• Micro-
architecture
instruction cycle
• ISA
• Hardware
design
HKU EEE
ELEC3441 – HS
23
CISC vs RISC
n CISC: Complex Instruction Set Computer RISC: Reduced Instruction Set Computer
n CISC and RISC are two different computer design strategies:
HKU EEE
ELEC3441 – HS
24
VAX
x86
PA-RISC
MIPS RISC-V
ARM
CISC
SPARC Alpha
RISC
CISC
n ISA includes complex instructions
• E.g. VAX has a POLY instruction that evaluate polynomial in
hardware
n Includes complex addressing mode
• Mem-reg; mem-mem; indirect; relative; double-indirect..
n Hardware implement complex instructions using multiple clock cycles
• microcode
n One promise of CISC ISA is that it allows shorter compiled code and make compiler easier.
• Still relevant today in embedded systems
n Drawback:
• Less attractive as compiler techniques improve
• Complex hardwareèslow
HKU EEE ELEC3441 – HS 25
RISC
n ISA specifies simple instructions • Mostlyregister-registertransfer
• Simpleaddressingmode
n Simpler hardware design
• Allowshardwareoptimization • Fasterhardwareoverall
• Allowseasypipelining
n Simple ISA allows compiler optimization
n Generated code length is generally longer n Most (if not all) ISA after the 80s are RISC
HKU EEE ELEC3441 – HS 26
RISC vs CISC – Iron Law
CPUTime = # of instruction × # of cycle × time program instruction cycle
Microarchitecture
CPI
Cycle Time
CISC
>1
short
RISC – single cycle unpipelined
1
long
RISC – pipelined
1
short
HKU EEE ELEC3441 – HS 27
Amdahl’s Law Review
n Describes the overall speedup of a system due to speed improvement that applies to a portion of the system.
n Let P be the portion of the system that can be spedupbyafactorofS, 0≤P≤1
n Amdahl’s Law stays that the overall speedup is: 1
(1 − P ) + P S
n E.g.: P = 50%, S=100 è speedup = 1.98x
HKU EEE ELEC3441 – HS 28
Amdahl’s Law Example
Class
C1
C2
C3
Cycles
1
4
8
# instr
200
130
60
# cycles
200
520
480
n Q1: a new implementation of C3 reduces its execution length by half to 4 cycles, how much improvement in performance can be achieved?
P= 480 =0.4 S=2 200+520+480
⇒ speedup =
1 = 1.25 (1−0.4)+0.4/2
HKU EEE ELEC3441 – HS 29
Amdahl’s Law Example
Class
C1
C2
C3
Cycles
1
4
8
# instr
200
130
60
# cycles
200
520
480
n Q2: Which instruction class, when its cycle count is reduced by half, will result in most performance improvement?
• LargestCPI?
• Mostused?
• Mostcyclesused?
HKU EEE ELEC3441 – HS 30
Amdahl’s Law Implications
Can we get to a speedup of 10 with P=0.9?
n In most applications, only portion of the computation can be sped up
• improved hardware designs • parallelization
n Amdahl’s Law è max speedup is limited by P
• If only small portion of program can be sped up, then it
doesn’t matter how large S is
HKU EEE ELEC3441 – HS 31
Benchmark Programs
n A benchmark suite is a set of programs used to compare processor performance
n Need to be representative of typical workload n Kernel vs whole application
• Recall Amdahl’s Law
n Avoid over optimization for specific benchmark
n SPEC benchmark
• Several benchmark suites commonly used in computer
architecture research
• E.g. SPEC CPU2006
HKU EEE ELEC3441 – HS 32
SPEC CPU Benchmark
n Programs used to measure performance • Supposedlytypicalofactualworkload
n Standard Performance Evaluation Corp (SPEC) • DevelopsbenchmarksforCPU,I/O,Web,…
n SPEC CPU2006
• Elapsedtimetoexecuteaselectionofprograms • Negligible I/O, so focuses on CPU performance
• Normalizerelativetoreferencemachine
• Summarizeasgeometricmeanofperformanceratios
• CINT2006 (integer) and CFP2006 (floating-point)
EHLKEUCE3E44E1 – HS 33
n
n
ÕExecution time ratioi
i=1
CINT2006 for Intel Core i7 920
HKU EEE ELEC3441 – HS 34
Matrix-Matrix Multiplication
⎡ a0,0 ! a0,N−1 ⎤⎡ b0,0 ! b0,N−1 ⎤ ⎢⎥⎢⎥
⎢ ” # ” ⎥×⎢ ” # ” ⎥ ⎢ a ! a ⎥ ⎢ b ! b ⎥
⎣ N−1,0 N−1,N−1 ⎦ ⎣ N−1,0 N−1,N−1 ⎦
⎡!⎤
⎢ N−1 ⎥ =⎢” ∑ai,kbk,j “⎥
r[i][j] = 0
for (k=0; k