CS计算机代考程序代写 assembly c/c++ compiler cache x86 computer architecture Modern Computer Architecture

Modern Computer Architecture
Sourced from Introduction: Modern Computer Architecture RRZE 2016.
CMPSC 450

The Stored Program Computer
• The modern design still based on Turing’s design in 1936.
CMPSC 450

Simple instruction execution
• Application Developer sees N adds as the work
• Processor designer has a different perspective on work. Six instructions per add operation.
User Code:
for (int i = 0; i < N; i++) A[i] = A[i] + B[i]; Processor Code: LOAD r1 = A[i] LOAD r2 = B[i] ADD r1 = r1 + r2 STORE A[i] = r1 INCREMENT i BRANCH -> top if i < N CMPSC 450 Simple instruction execution • Data transfers are a consequence of instruction execution and therefore a secondary resource. User Code: for (int i = 0; i < N; i++) A[i] = A[i] + B[i]; • Assuming 8 bytes per element, this loads 24 bytes per addition. • What is the bottleneck? Processor Code: LOAD r1 = A[i] LOAD r2 = B[i] ADD r1 = r1 + r2 STORE A[i] = r1 INCREMENT i BRANCH -> top if i < N CMPSC 450 From high level code to actual execution • Remember C/C++ is a high level language! • Code is translated into lower level assembly code by the compiler. CMPSC 450 Pipelining CMPSC 450 Pipelining • Split a complex instruction into several simple/fast steps • Each step takes the same amount of time (ex: a clock cycle) • Execute different steps on different instructions at the same time • Allows for shorter cycle times • Ex: floating point multiplication takes 5 clock cycles • Can be broken up, such that 5 multiplications can be worked on simultaneously • Once full, one result is produced each clock cycle. • Drawbacks: • Pipeline must be filled, before full efficiency is observed • Instructions must be independent to support pipelining • Makes instruction scheduling more complex by compiler/hardware CMPSC 450 Five stage pipeline visualization First result is available after 5 cycles (pipeline latency) Wind-up/down phases: Empty pipeline stages CMPSC 450 Example: • Consider the following code. Ignoring memory access latency and assuming a multiply takes 3 clock cycles. What is the estimated execution time of the code? • Without pipelining, each multiply is calculated individually: 3m clock cycles • With pipelining, the calculation of each multiply is overlapped based on the pipeline depth: m + 3 – 1 ~= m clock cycles (for very large m) for (int i = 0; i < m; i++) C[i] = A[i] * B[i]; Start time End time 0 3 1 4 2 5 3 6 4 7 5 8 6 9 7 10 8 11 9 12 CMPSC 450 Pipelining: The Instruction pipeline • Besides the ALU, instruction execution itself is pipelined also. • Intel uses up to 30 stages for their instruction pipeline • Branches can stall this pipeline • Each unit is pipelined itself CMPSC 450 Superscalar Processors – Instruction Level Parallelism CMPSC 450 Superscalar Processors – Instruction Level Parallelism • Multiple units enable use of Instruction Level Parallelism (ILP) • Issuing m concurrent instructions per cycle: m-way superscalar • Modern processors are 3-6 way superscalar & can perform multiple floating point instructions per cycle • Out of order execution • The ability to issue deconstructed operations out of order and re-assemble them after complete execution. • To keep more functional units operational at all times. CMPSC 450 Simultaneous Multi-Threading • For a given thread, there are still holes in the pipelines. • Conditional branches / branch mis-prediction • Memory stall • Infrastructure added to keep track of which instructions belong to which threads. • Does not increase the amount of functional units in CPU. CMPSC 450 Simultaneous Multi-Threading CMPSC 450 SIMD Processing CMPSC 450 SIMD processing • Single Instruction Multiple Data (SIMD) operations allow the concurrent execution of the same operation on “wide” registers • x86 SIMD instruction sets: • SSE: register width = 128 Bit -> 2 double precision floating point
operands
• AVX: register width = 256 Bit -> 4 double precision floating point operands
• AVX512: register width = 512 Bit -> 8 double precision floating point operands
• EX: Adding two registers holding double precision floating point operands
CMPSC 450

Memory
• DRAM gap – difference in performance between main memory and ALU/functional units.
CMPSC 450

Caches:
• Solution: bring memory closer to functional units.
• Cache is organized into cache lines (approx. 64 bytes each)
• Cache exploit the concept of locality (i.e. data re-use).
CMPSC 450

Typical Memory Hierarchy
• Rule of thumb: the further from the functional units, the larger and slower the memory.
CMPSC 450

Abstracted Microarchitecture
CMPSC 450

Why Caches Work: Locality
• Locality: Programs tent to use data and instructions with addresses near or equal to those they have used recently
• Temporal locality:
• Recently referenced items are likely to be
referenced again in the near future
• Spatial locality:
• Items with nearby addresses tend to be referenced close together in time
CMPSC 450

Cache Hits and Misses
• Cache Hit – The CPU requests a memory address that already exists in the cache
• Hit time (Core 2) • 3clocksforL1
• 14clocksforL2
• Cache Miss – The CPU requests a memory address that does not exist in the cache.
• Cold Miss – Occurs on first access to a block
• Capacity Miss – Occurs when working set is larger than the cache
• Conflict Miss – Occurs when the cache is large enough, but multiple data objects all map to the same slot
• Miss Penalty (Core 2) approximately 100 cycles
CMPSC 450

Cache line placement
• Direct Mapped: a line with a given address can only be placed in a single location in the cache
= mod
• Fully Associative: a line can be placed anywhere in the cache
• Cache search for memory address is painful
• Set Associative: a line can be placed anywhere within a set of locations in the cache
• Hybrid mix of both
CMPSC 450

Direct Mapped Example #1
int sum_array_rows(double a[8][8]) {
int i, j;
double sum = 0;
for (i = 0; i < 8; i++) for (j = 0; j < 8; j++) sum += a[i][j]; return sum; } Cache (32 bytes/line = 4 doubles) A[0][4]...A[0][7] A[1][4]...A[1][7] A[2][4]...A[2][7] A[3][4]...A[3][7] A[4][4]...A[4][7] A[5][4]...A[5][7] A[6][4]...A[6][7] A[7][4]...A[7][7] • Because ‘j’ is in the inner loop, this cache pattern is Miss, Hit, Hit, Hit. • There will be 4 memory accesses from the first cache page (A[0][4]...) followed by a replacement (and 3 more accesses). Then cache page will be accessed. • The same pattern (Miss, Hit, Hit, Hit) will repeat for the entirety of this loop. CMPSC 450 Direct Mapped Example #2 int sum_array_rows(double a[8][8]) { int i, j; double sum = 0; for (j = 0; j < 8; j++) for (i = 0; i < 8; i++) sum += a[i][j]; return sum; } Because ‘i’ is in the inner loop, this cache pattern starts out Miss, Miss, Miss, Miss. After the first full iteration of i (j = 0), for iterations of j = 1,2,3, there will be all cache hits. Cache(32bytes/line=4doubles) A[0][0]...A[0][3] A[1][0]...A[1][3] A[2][0]...A[2][3] A[3][0]...A[3][3] A[4][0]...A[4][3] A[5][0]...A[5][3] A[6][0]...A[6][3] A[7][0]...A[7][3] • • CMPSC 450