CS3350B Computer Organization Chapter 1: CPU and Memory Part 1: The CPU
Alex Brandt
Department of Computer Science University of Western Ontario, Canada
Wednesday January 13, 2021
Alex Brandt
Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 1 / 35
Outline
1 The Basics
2 Clock Cycles per Instruction (CPI)
3 Power, Trends, Limitations
4 Benchmarks and Profiling
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 2 / 35
Components of a computer
Micro-architecture: the internals of a CPU (control and datapath) will come later in the course. For now, we look at the CPU as a whole.
CPU and memory are highly coupled, especially when we look at performance. This will be seen by the end of this chapter.
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 3 / 35
CPUs and Memory: Terminology Review
CPU – Central Processing Unit
Processor – Sometimes the CPU, sometimes the actual circuitry
doing the processing within the CPU.
Memory Word – The typical unit of memory that is stored, passed around, and operated on. Usually 32 or 64 bits in size.
RAM – Random Access Memory. A memory storage technology that can be either static or dynamic.
HDD – Hard Disk Drive.
SSD – Solid State Drive. Drives store data that persists loss of power.
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 4 / 35
Programmer’s View of CPU Performance
At a basic level, the running time of some program on a CPU is determined by:
The clock rate of the CPU (e.g. 3.4 GHz), The type of instructions being performed,
ë Addition/Subtraction faster than Multiplication/Division, etc. ë Affects the average clock cycles per instruction (CPI)
Memory access time.
ë Recall processor-memory gap
Assuming CPU clock rate is fixed, programmers can influence program performance by changing the type of instructions they use as well as how their code accesses memory.
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 5 / 35
Aside: Changing Instructions for Performance
On 6th-generation Intel CPUs
32-bit integer division ∼26 clock cycles vs. logical bit shift ∼1 clock cycle
ë int i = 1234567; i /= 2; ë int i = 1234567; i >>= 1;
Floating point division ∼14 clock cycles vs. multiplication ∼5 clock cycles
ë float x = 1.2f; x /= 2.0f; ë float x = 1.2f; x *= 0.5f;
Fast Inverse Square Root thanks to Quake: https://en.wikipedia.org/wiki/Fast_inverse_square_root
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 6 / 35
Understanding and Analyzing Performance
Algorithmic analysis: Estimating the complexity of algorithms in some abstract, idealized way. In reality, two different 𝑂(𝑛2) algorithms can have wildly different running times.
Programming language, compiler, architecture:
ë Programming language and corresponding compiler determine the actual machine instructions the CPU will perform.
ë Resulting number and type of machine instructions vary by compiler and language and therefore resulting performance varies.
Processor and Memory: Determines how fast instructions are executed and how fast data moves to and from the processor.
I/O system (including OS): Determines how fast I/O operations are executed
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 7 / 35
Need for Performance Metrics
Purchasing Perspective. For a collection of machine which one has the
ë “best” cost?
ë “best” cost relative to performance?
Design perspective. Given many design options and directions which one has the
ë “best” performance improvement?
ë “best” cost relative to performance improvement?
In either case we need: (i) a basis for comparison, (ii) metrics for evaluation.
Our goal is to understand what factors in the architecture contribute to the overall system performance and the relative importance (and cost) of these factors.
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 8 / 35
CPU Performance: Latency
CPU performance is largely measured by latency, throughput, clock frequency.
Want reduced response time (aka execution time, aka latency) – the time between the start and the completion of a task.
– Important to general PC users.
To maximize performance of some code segment (program) 𝑋, we
need to minimize execution time.
performance𝑋 = 1⇑execution_time𝑋
If 𝑋 is n times faster than 𝑌 , then
performance𝑋 execution_time𝑌
performance = execution_time = 𝑛 𝑌𝑋
Note: Can also compare latency of the same single instruction on
different CPUs.
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 9 / 35
CPU Performance: Throughput
Want increased throughput – the total amount of work done in a given unit of time.
– Important to users like data center managers.
Again, throughput depends on the code segment being executed.
Different instructions result in different throughput measures.
Decreasing response time usually improves throughput, but other factors are important (task scheduling, memory bandwidth, etc.).
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 10 / 35
CPU Performance: Clock Frequency
Clock Frequency is a code-agnostic measure of CPU performance.
Typically, a faster clock (higher frequency) yields a higher performing CPU.
But the micro-architecture and the instruction set architecture play a large role.
ë They influence how much work the CPU does per cycle (i.e. efficiency)
Example:
CPU 𝐴 runs at 3 GHz and a division takes 20 cycles. CPU 𝐵 runs at 2 GHz and a division takes 10 cycles.
20 cycles ⇑ 3 𝐺𝐻𝑧 = 20 ⇑ 3000000000 𝐻𝑧 = 6.66𝑛𝑠 10 cycles ⇑ 2 𝐺𝐻𝑧 = 10 ⇑ 2000000000 𝐻𝑧 = 5.00𝑛𝑠
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 11 / 35
CPU Clocking
Synchronous digital systems (e.g. a CPU) are governed and controlled by a clock. This clock synchronizes internal circuits, memory states, and data movement.
Length of clock period determined by internal circuits and micro-architecture design. We will look at this with CPU datapaths and pipelining.
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 12 / 35
CPU Clocking
Clock period (cycle): duration of a clock cycle (CC)
determines the speed of a computer processor
Caveat: again, not necessarily latency or throughput though e.g., 250ps = 0.25ns = 250 × 10−12𝑠
Clock frequency or rate (CR): cycles per second the inverse of the clock period
e.g., 3.0GHz = 3000MHz = 3.0 × 109Hz
CR = 1 / CC.
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 13 / 35
CPU Time
It is important to distinguish elapsed time and the time spent on your task.
ë Wall time vs. CPU time
ë CPU time does not include time waiting for I/O or time spent on other processes
CPU execution time = for a program
or
CPU execution time = for a program
#CPU clock cycles × for a program
#CPU clock cycles ⇑ for a program
clockcycle
We can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program.
clockrate
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 14 / 35
Outline
1 The Basics
2 Clock Cycles per Instruction (CPI)
3 Power, Trends, Limitations
4 Benchmarks and Profiling
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 15 / 35
Instruction Performance
#CPU clock cycles = #Instructions × Average # of clock cycles for a program for a program per instruction
Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute.
ë Different instructions may take different amounts of time depending on what they do.
ë A way to compare two different implementations (micro-architectures) of the same ISA.
ë Calculated by a simple averaging of each instruction type in a program and their corresponding number of cycles.
Average CPI and Effective CPI mean the same thing and are usually shortened to just CPI.
The CPI for a particular instruction type is usually denoted CPI𝑖
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 16 / 35
The Classic Performance Equation
CPU time = Instruction_count × CPI × clock_cycle or
CPU time = Instruction_count × CPI ⇑ clock_rate Keep in mind that the only complete and reliable measure of
computer performance is time.
For example, redesigning the hardware implementation of an instruction set to lower the instruction count may lead to an organization with
ë a slower clock cycle time or, ë higher CPI,
that offsets the improvement in instruction count.
Note: CPI depends on the type of instruction executed, so the code which executes the fewest instructions may not be the fastest.
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 17 / 35
A Simple Example (1/2)
𝑛
Overall effective CPI = ∑(CPI𝑖 × IC𝑖) ⇑ IC
𝑖=1
Op Inst. Freq CPI𝑖 Freq × CPI𝑖 (1) ALU 50% 1 .5 .5
Load
Store Branch
20% 5 1.0
10% 3 .3 .3 20% 2 .4 .4
∑ = 2.2
(1) How much faster would the machine be if a better data cache
reduced the average load time to 2 cycles?
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 18 / 35
A Simple Example (1/2)
𝑛
Overall effective CPI = ∑(CPI𝑖 × IC𝑖) ⇑ IC
𝑖=1
Op Inst. Freq CPI𝑖 Freq × CPI𝑖 (1) ALU 50% 1 .5 .5
Load
Store Branch
20% 5 10% 3 20% 2
1.0 .4 .3 .3 .4 .4
∑ = 2.2 1.6
(1) How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?
CPU time new = 1.6 × IC × CC; so 2.2 versus 1.6 which means 37.5% faster
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 18 / 35
A Simple Example (2/2)
𝑛
Overall effective CPI = ∑(CPI𝑖 × IC𝑖) ⇑ IC
Op Freq CPI𝑖 ALU50%1 Load 20% 5
Store 10% 3 Branch 20% 2
(2) How does this CPI compare cycle off the branch time?
𝑖=1
Freq × CPI𝑖 (2) (3) .5 .5
1.0 1.0 1.0 .3 .3 .3 .4 .4
(3) What if two ALU instructions could be executed at once?
∑ = 2.2
with using branch prediction to save a
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 19 / 35
A Simple Example (2/2)
𝑛
Overall effective CPI = ∑(CPI𝑖 × IC𝑖) ⇑ IC
Op Freq CPI𝑖 ALU50%1 Load 20% 5
Store 10% 3 Branch 20% 2
𝑖=1
Freq × CPI𝑖 (2) (3) .5 .5
1.0 1.0 1.0 .3 .3 .3 .4 .2 .4
(3) What if two ALU instructions could be executed at once?
∑ = 2.2 2.0
with using branch prediction to save a
(2) How does this CPI compare
cycle off the branch time?
CPU time new = 2.0 × IC × CC so 2.2 versus 2.0 means 10% faster
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 19 / 35
A Simple Example (2/2)
𝑛
Overall effective CPI = ∑(CPI𝑖 × IC𝑖) ⇑ IC
𝑖=1
Freq × CPI𝑖 (2) (3) .5 .5.25 1.0 1.0 1.0
.3 .3 .3
.4 .2 .4 ∑ = 2.2 2.0 1.95
(2) How does this CPI compare
cycle off the branch time?
CPU time new = 2.0 × IC × CC so 2.2 versus 2.0 means 10% faster
(3) What if two ALU instructions could be executed at once?
CPU time new = 1.95 × IC × CC so 2.2 versus 1.95 means 12.8% faster
Op Freq CPI𝑖 ALU50%1 Load 20% 5
Store 10% 3 Branch 20% 2
with using branch prediction to save a
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 19 / 35
Understanding Program Performance
CPU Time = Instruction_count × CPI × clock_cycle
Instructions CPU Time = ×
Program
Clock cycles Instruction
×
Seconds Clock cycle
The performance of a program depends on the algorithm, the language, the compiler, the architecture, and the actual hardware.
Algorithm Programming language Compiler
Instruction_count CPI clock_cycle
X X X X X X
ISA XXX Processor organization X X
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 20 / 35
Check Yourself
A given application written in Java runs 15 seconds on a desktop processor. A new Java compiler is released that requires only 0.6 as many instructions as the old compiler. Unfortunately, it increases the CPI by a factor of 1.1. How fast can we expect the application to run using this new compiler? Pick the right answer from the three choices below:
a. 15×0.6 = 8.2 sec 1.1
b. 15×0.6×1.1=9.9 sec c. 15×1.1 = 27.5 sec
0.6
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 21 / 35
Check Yourself
A given application written in Java runs 15 seconds on a desktop processor. A new Java compiler is released that requires only 0.6 as many instructions as the old compiler. Unfortunately, it increases the CPI by a factor of 1.1. How fast can we expect the application to run using this new compiler? Pick the right answer from the three choices below:
a. 15×0.6 = 8.2 sec 1.1
b. 15×0.6×1.1=9.9 sec c. 15×1.1 = 27.5 sec
0.6 b!
𝑇𝑖𝑚𝑒1 =𝐼𝐶1 ×𝐶𝑃𝐼1 ×𝐶𝐶1
𝑇𝑖𝑚𝑒2 =𝐼𝐶2 ×𝐶𝑃𝐼2 ×𝐶𝐶2
=(𝐼𝐶1 ×0.6)×(𝐶𝑃𝐼1 ×1.1)×𝐶𝐶1 =𝑇𝑖𝑚𝑒1 ×0.6×1.1
= 15 × 0.6 × 1.1 = 9.9
Alex Brandt
Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 21 / 35
Outline
1 The Basics
2 Clock Cycles per Instruction (CPI)
3 Power, Trends, Limitations
4 Benchmarks and Profiling
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 22 / 35
CPU Power Usage
Depending on an architect’s design goals they may want to look at metrics different from latency, throughput, time, or clock frequency.
↑ Power Usage ⇒ ↑ Temperature ↑ Power Usage ⇒ ↓ Battery Life
Ultrabooks are so (not) hot right now
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 23 / 35
Power Trends
Complementary metal oxide semiconductor (CMOS) integrated circuits, the technology which implements the physical circuity inside modern CPUs, has a (simplified) power equation:
Power = (×30)
Capacitive load × Voltage2 (5𝑉 → 1𝑉 )
× Frequency switched (×1000)
Alex Brandt
Chapter 1: CPU and Memory, Part 1: The CPU
Wednesday January 13, 2021 24 / 35
Reducing Power
Suppose a new CPU has
ë 85% of capacitive load of old CPU
ë 15% voltage and 15% frequency reduction
Pnew
P =
old
Cold × 0.85 × (Vold × 0.85)2 × Fold × 0.85 C ×V2 ×F
4
The Power Wall
ë ë
ë
ë ë
Simply cannot consume any more power.
Decreasing transistor size ⇒ more transistors per chip ⇒ greater power density
Required input voltage may decrease but more transistors use more power.
Reducing voltage further very difficulty.
Removing extra heat very difficult.
=0.85 =0.52
old old old
Alex Brandt
Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 25 / 35
CPU Performance: Relative Performance vs VAX-11
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 26 / 35
Multi-Processors to the Rescue
Moore’s Law failing? Great idea #4: Performance via Parallelism.
Multi-core processors!
ë More than one processor per chip
ë Huge benefits to OS and multiple processes.
Within a single process requires explicit parallel programming ë Compared with instruction level parallelism:
– Hardware executes multiple instructions at once (pipelining/multi-issue)
– Hidden from the programmer ë Hard to do:
– Programming for performance
– Thread management, Load Balancing
– Optimizing communication and synchronization
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 27 / 35
Outline
1 The Basics
2 Clock Cycles per Instruction (CPI)
3 Power, Trends, Limitations
4 Benchmarks and Profiling
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 28 / 35
SPEC CPU Benchmark
Programs used to measure performance ë Supposedly typical of actual workload
Standard Performance Evaluation Corp (SPEC) ë Develops benchmarks for CPU, I/O, Web, …
SPEC CPU2006
ë Elapsed time to execute a selection of programs
– Negligible I/O, so focuses on CPU performance
ë Normalize relative to reference machine
ë Summarize as geometric mean of performance ratios
– CINT2006 (integer) and CFP2006 (floating-point) ⟨
⧸︂⟩∏ Execution time ratio𝑖 𝑖=1
⧸︂𝑛 𝑛
Alex Brandt
Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 29 / 35
CINT2006 for Intel Core i7 920
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 30 / 35
Profiling Tools
Many profiling tools
ë gprof (static instrumentation)
ë cachegrind, Dtrace (dynamic instrumentation) ë perf (performance counters)
perf in linux-tools, based on event sampling
ë ë
ë
ë
Keep a list of where “interesting events” (cycle, cache miss, etc) happen
CPU Feature: Counters for hundreds of events
– Performance: Cache misses, branch misses, instructions per cycle, … Intel®64 and IA-32 Architectures Software Developer’s Manual:
Appendix A lists all counters http: //www.intel.com/products/processor/manuals/index.html perf user guide: https://perf.wiki.kernel.org/index.php/Tutorial
Alex Brandt
Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 31 / 35
Exercise 1
void copymatrix1(int** src,
int** dst, int n) {
int i,j;
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
dst[i][j] = src[i][j];
}}
copymatrix1 vs copymatrix2 ë What do they do?
ë What is the difference?
ë Which one performs better? Why?
perf stat -e cycles -e cache-misses ./copymatrix1
perf stat -e cycles -e cache-misses ./copymatrix2
ë What does the output like?
ë How to interpret it?
ë Which program performs better?
void copymatrix2(int** src,
int** dst, int n) {
int i,j;
for (j = 0; j < n; j++)
for (i = 0; i < n; i++)
dst[i][j] = src[i][j];
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 32 / 35
Exercise 1: Results
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 33 / 35
Exercise 2
void lower1 (char* s) { void lower2 (char* s) { int i; int i;
for (i = 0; i < strlen(s); i++) int n = strlen(s);
}
if (s[i]>=’A’ && s[i]<=’Z’)
s[i] -= ’A’-’a’;
lower1 vs lower2
ë What do they do?
for (i = 0; i < n; i++)
if (s[i]>=’A’ && s[i]<=’Z’)
s[i] -= ’A’-’a’;
ë What is the difference?
ë Which one performs better? Why?
perf stat -e cycles -e cache-misses ./lower1
perf stat -e cycles -e cache-misses ./lower2
ë What does the output like?
ë How to interpret it?
ë Which program performs better?
}
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 34 / 35
Exercise 2: Results
Alex Brandt Chapter 1: CPU and Memory, Part 1: The CPU Wednesday January 13, 2021 35 / 35