Modern Computer Architecture
Sourced from Introduction: Modern Computer Architecture RRZE 2016.
CMPSC 450
The Stored Program Computer
• The modern design still based on Turing’s design in 1936.
CMPSC 450
Simple instruction execution
• Application Developer sees N adds as the work
• Processor designer has a different perspective on work. Six instructions per add operation.
User Code:
for (int i = 0; i < N; i++) A[i] = A[i] + B[i];
Processor Code:
LOAD r1 = A[i]
LOAD r2 = B[i]
ADD r1 = r1 + r2
STORE A[i] = r1 INCREMENT i
BRANCH -> top if i < N
CMPSC 450
Simple instruction execution
• Data transfers are a consequence of instruction execution and therefore a secondary resource.
User Code:
for (int i = 0; i < N; i++) A[i] = A[i] + B[i];
• Assuming 8 bytes per element, this loads 24 bytes per addition.
• What is the bottleneck?
Processor Code:
LOAD r1 = A[i]
LOAD r2 = B[i]
ADD r1 = r1 + r2
STORE A[i] = r1 INCREMENT i
BRANCH -> top if i < N
CMPSC 450
From high level code to actual execution
• Remember C/C++ is a high level language!
• Code is translated into lower level assembly code by the compiler.
CMPSC 450
Pipelining
CMPSC 450
Pipelining
• Split a complex instruction into several simple/fast steps
• Each step takes the same amount of time (ex: a clock cycle)
• Execute different steps on different instructions at the same time
• Allows for shorter cycle times
• Ex: floating point multiplication takes 5 clock cycles
• Can be broken up, such that 5 multiplications can be worked on simultaneously
• Once full, one result is produced each clock cycle.
• Drawbacks:
• Pipeline must be filled, before full efficiency is observed
• Instructions must be independent to support pipelining
• Makes instruction scheduling more complex by compiler/hardware
CMPSC 450
Five stage pipeline visualization
First result is available after 5 cycles (pipeline latency) Wind-up/down phases: Empty pipeline stages
CMPSC 450
Example:
• Consider the following code. Ignoring memory access latency and assuming a multiply takes 3 clock cycles. What is the estimated execution time of the code?
• Without pipelining, each multiply is calculated individually: 3m clock cycles
• With pipelining, the calculation of each multiply is overlapped based on the pipeline depth: m + 3 – 1 ~= m clock cycles (for very large m)
for (int i = 0; i < m; i++) C[i] = A[i] * B[i];
Start time
End time
0
3
1
4
2
5
3
6
4
7
5
8
6
9
7
10
8
11
9
12
CMPSC 450
Pipelining: The Instruction pipeline
• Besides the ALU, instruction execution itself is pipelined also. • Intel uses up to 30 stages for their instruction pipeline
• Branches can stall this pipeline • Each unit is pipelined itself
CMPSC 450
Superscalar Processors – Instruction Level Parallelism
CMPSC 450
Superscalar Processors – Instruction Level Parallelism
• Multiple units enable use of Instruction Level Parallelism (ILP)
• Issuing m concurrent instructions per cycle: m-way superscalar
• Modern processors are 3-6 way superscalar & can perform multiple floating point instructions per cycle
• Out of order execution
• The ability to issue deconstructed operations out of order and re-assemble
them after complete execution.
• To keep more functional units operational at all times.
CMPSC 450
Simultaneous Multi-Threading
• For a given thread, there are still holes in the pipelines. • Conditional branches / branch mis-prediction
• Memory stall
• Infrastructure added to keep track of which instructions belong to which threads.
• Does not increase the amount of functional units in CPU.
CMPSC 450
Simultaneous Multi-Threading
CMPSC 450
SIMD Processing
CMPSC 450
SIMD processing
• Single Instruction Multiple Data (SIMD) operations allow the concurrent execution of the same operation on “wide” registers
• x86 SIMD instruction sets:
• SSE: register width = 128 Bit -> 2 double precision floating point
operands
• AVX: register width = 256 Bit -> 4 double precision floating point operands
• AVX512: register width = 512 Bit -> 8 double precision floating point operands
• EX: Adding two registers holding double precision floating point operands
CMPSC 450
Memory
• DRAM gap – difference in performance between main memory and ALU/functional units.
CMPSC 450
Caches:
• Solution: bring memory closer to functional units.
• Cache is organized into cache lines (approx. 64 bytes each)
• Cache exploit the concept of locality (i.e. data re-use).
CMPSC 450
Typical Memory Hierarchy
• Rule of thumb: the further from the functional units, the larger and slower the memory.
CMPSC 450
Abstracted Microarchitecture
CMPSC 450
Why Caches Work: Locality
• Locality: Programs tent to use data and instructions with addresses near or equal to those they have used recently
• Temporal locality:
• Recently referenced items are likely to be
referenced again in the near future
• Spatial locality:
• Items with nearby addresses tend to be referenced close together in time
CMPSC 450
Cache Hits and Misses
• Cache Hit – The CPU requests a memory address that already exists in the cache
• Hit time (Core 2) • 3clocksforL1
• 14clocksforL2
• Cache Miss – The CPU requests a memory address that does not exist in the cache.
• Cold Miss – Occurs on first access to a block
• Capacity Miss – Occurs when working set is larger than the cache
• Conflict Miss – Occurs when the cache is large enough, but multiple data objects all map to the same slot
• Miss Penalty (Core 2) approximately 100 cycles
CMPSC 450
Cache line placement
• Direct Mapped: a line with a given address can only be placed in a single location in the cache
•
• Fully Associative: a line can be placed anywhere in the cache
• Cache search for memory address is painful
• Set Associative: a line can be placed anywhere within a set of locations in the cache
• Hybrid mix of both
CMPSC 450
Direct Mapped Example #1
int sum_array_rows(double a[8][8]) {
int i, j;
double sum = 0;
for (i = 0; i < 8; i++)
for (j = 0; j < 8; j++)
sum += a[i][j];
return sum; }
Cache (32 bytes/line = 4 doubles)
A[0][4]...A[0][7]
A[1][4]...A[1][7]
A[2][4]...A[2][7]
A[3][4]...A[3][7]
A[4][4]...A[4][7]
A[5][4]...A[5][7]
A[6][4]...A[6][7]
A[7][4]...A[7][7]
• Because ‘j’ is in the inner loop, this cache pattern is Miss, Hit, Hit, Hit.
• There will be 4 memory accesses from the first cache page (A[0][4]...) followed by a replacement (and 3 more accesses). Then cache page will be accessed.
• The same pattern (Miss, Hit, Hit, Hit) will repeat for the entirety of this loop.
CMPSC 450
Direct Mapped Example #2
int sum_array_rows(double a[8][8]) {
int i, j;
double sum = 0;
for (j = 0; j < 8; j++)
for (i = 0; i < 8; i++) sum += a[i][j];
return sum; }
Because ‘i’ is in the inner loop, this cache pattern starts out Miss, Miss, Miss, Miss.
After the first full iteration of i (j = 0), for iterations of j = 1,2,3, there will be all cache hits.
Cache(32bytes/line=4doubles)
A[0][0]...A[0][3]
A[1][0]...A[1][3]
A[2][0]...A[2][3]
A[3][0]...A[3][3]
A[4][0]...A[4][3]
A[5][0]...A[5][3]
A[6][0]...A[6][3]
A[7][0]...A[7][3]
• •
CMPSC 450