SEC204
1
Computer Architecture and Low Level Programming
Dr. Vasilios Kelefouras
Email: v.kelefouras@plymouth.ac.uk Website: https://www.plymouth.ac.uk/staff/vasilios -kelefouras
School of Computing (University of Plymouth)
Date
28/10/2019
2
Outline
Memory hierarchy
Cache memories
Temporal and spatial locality Cache design
Cache hit/miss
Direct mapped, set associative, fully associative Write policies
Replacement policy
Stack memory
3
Memory Wall Problem
Even if you have an incredibly fast processor in your computer, the system’s performance strongly depends on the speed of your DDR
Take from https://slideplayer.com/slide/7075269/
Relative performance (over 1980)
4
Memory Hierarchy
Taken from https://www.researchgate.net/publication/281805561_MTJ- based_hybrid_storage_cells_for_normally-off_and_instant-on_computing/figures?lo=1
5
Cache memories
Wouldn’t it be nice if we could find a balance between fast and cheap memory?
The solution is to add from 1 up to 3 levels of cache memories, which are small, fast, but expensive memories
— The cache goes between the processor and the slower, main memory (DDR)
— It keeps a copy of the most frequently used data from the main memory
— Faster reads and writes to the most frequently used addresses
— We only need to access the slower main memory for less frequently used data
Cache memories occupy the largest part of the chip area
They consume a significant amount of the total power consumption
Add complexity to the design
Cache memories are of key importance regarding performance
6
Memory Hierarchy (2)
Consider that CPU needs to perform a load instruction
First it looks at L1 data cache. If the datum is there then it loads it and no other memory is accessed (L1 hit)
If the datum is not in the L1 data cache (L1 miss), then the CPU looks at the L2 cache
If the datum is in L2 (L2 hit) then no other memory is accessed. Otherwise (L2 miss), the CPU looks at L3 etc
L1 cache access time: L2 cache access time : L3 cache access time : DDR access time :
1-4 CPU cycles 6-14 CPU cycles 40-70 CPU cycles 100-200 CPU cycles
7
Data Locality
Regarding static programs, it’s very difficult and time consuming to figure out what data will be the “most frequently accessed” before a program actually runs
Instaticprogramsthecontrolflowpathisknownatcompiletime
Regarding dynamic programs it is impossible
This makes it hard to know what to store into the small, precious cache memory
But in practice, most programs exhibit locality, which the cache can take advantage of
— The principle of temporal locality says that if a program accesses one memory address, there is a good chance that it will access the same address again
— The principle of spatial locality says that if a program accesses one memory address, there is a good chance that it will also access other nearby addresses
8
Temporal Locality in Program Instructions
The principle of temporal locality says that if a program accesses one memory address, there is a good chance that it will access the same address again
Loops are excellent examples of temporal locality in programs
— The loop body will be executed many times
— The computer will need to access those same few locations of the instruction memory repeatedly
For example:
Loop: lw, $t0, 0($s1) add $t0, $t0, $s2 sw $t0, 0($s1) addi $s1, $s1, -4 bne $s1, $0, Loop
— Each instruction will be fetched over and over again, once on every loop iteration
9
Temporal Locality in Data
Programs often access the same variables over and over, especially within loops, e.g., below, sum, i and B[k] are repeatedly read/written
Commonly-accessed variables can be kept in registers, but this is not always possible as there is a limited number of registers
Sum and i variables are a) of small size, b) reused many times, and therefore it is efficient to remain in the CPU’s registers
B[k] remains unchanged during the innermost loop and therefore it is efficient to remain in a CPU register
The whole A[ ] array is accessed 3 times and therefore it will remain in the cache (depending on its size)
sum = 0;
for (k = 0; k < 3; k++)
for (i = 0; i < N; i++)
sum = sum + A[i] + B[k];
10
Spatial Locality in Program Instructions
The principle of spatial locality says that if a program accesses one memory address, there is a good chance that it will also access other nearby addresses
sub $sp, $sp, 16 sw $ra, 0($sp) sw $s0, 4($sp) sw $a0, 8($sp) sw $a1, 12($sp)
Every program exhibits spatial locality, because instructions are executed in sequence most of the time (however, branches might occur) - if we execute an instruction at memory location i, then we will probably also execute the next instruction, at memory location i+1
Code fragments such as loops exhibit both temporal and spatial locality
11
Main memory
L2 unified cache
A[0]
A[1]
A[2]
A[3]
A[4]
A[5]
A[6]
A[7]
....
L1 data cache
L1 instruction cache
RF
Spatial Locality in Data
Programs often access data that are stored in contiguous memory locations — Arrays, like A[ ] in the code below are always stored in memory
contiguously – this task is performed by the compiler
sum = 0;
for (i = 0; i < N; i++)
sum = sum + A[i];
L1 data cache
CPU
12
Main memory
L2 unified cache
L1 data cache
L1 instruction cache
RF
How caches take advantage of temporal locality
Every time the processor reads from an address in main memory, a copy of that datum is also stored in the cache
— The next time that same address is read, we can use the copy of the data in the cache instead of accessing the slower DDR
— So the first read is a little slower than before since it goes through both main memory and the cache, but subsequent reads are much faster
This takes advantage of temporal locality - commonly accessed data are stored in the faster cache memory
CPU
13
A[0]
A[1]
A[2]
A[3]
A[4]
A[5]
A[6]
A[7]
....
Main memory
L2 unified cache
L1 data cache
L1 instruction cache
RF
How caches take advantage of Spatial locality
When the CPU reads location i from main memory, a copy of that data is placed in the cache
But instead of just copying the contents of location i, it copies several values into the cache at once (cache line)
— If the CPU later does need to read from a location in that cache line, it can access that data from the cache and not the slower main memory, e.g., A[0] and A[3]
— For example, instead of loading just one array element at a time, the cache actually loads four /eight array elements at once
Again, the initial load incurs a performance penalty, but we’re gambling on spatial locality and the chance that the CPU will need the extra data
L1 data cache
Cache lines – 256 bit Words 32 bit CPU
14
Definitions: Hits and misses
A cache hit occurs if the cache contains the data that we’re looking for. Hits are desirable, because the cache can return the data much faster than main memory
A cache miss occurs if the cache does not contain the requested data. This is inefficient, since the CPU must then wait accessing the slower next level of memory
There are two basic measurements of cache performance
— The hit rate is the percentage of memory accesses that are handled by the
cache
— The miss rate (1 - hit rate) is the percentage of accesses that must be handled by the slower lower level memory
Typical caches have a hit rate of 95% or higher, so in fact most memory accesses will be handled by the cache and will be dramatically faster
15
1. When we copy a block of data from main memory to the cache, where exactly should we put it?
2. How can we tell if a word is already in the cache, or if it has to be fetched from main memory first?
3. Eventually, the small cache memory might fill up. To load a new block from main memory, we’d have to replace one of the existing blocks in the cache... which one?
4. In a write request, are we going to write to all memories in memory hierarchy or not?
Important Questions
16
A simple cache design
Caches are divided into blocks, which may be of various sizes
— Thenumberofblocksincachememoriesarealwaysinpowerof2
— For now consider that each block contains just one byte (not true in practice). Of course this cannot take advantage of spatial locality
Hereisanexampleofcachewitheightblocks,eachholdingonebyte
Block index
000 001 010 011 100 101 110 111
8-bit data
17
Where should we put data in the cache? (1)
A direct-mapped cache is the simplest approach: each main memory address
maps to exactly one cache block
In the following figure
a 16-entry main memory and a 4-entry cache (four 1-entry blocks) are shown
Memory locations 0, 4, 8 and 12 all map to cache block 0
Addresses 1, 5, 9 and 13 map to cache block 1, etc
Memory Address
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Index
0
1
2
3
18
One way to figure out which cache block a particular memory address should go to is to use the modulo (remainder) operator
the modulo operation finds the remainder after division of one number by another
Let x be block number in cache, y be 1 block number of DDR, and n be number2 of blocks in cache, then mapping is 3 done with the help of the equation 4
Where should we put data in the cache? (2)
Memory Address
0
For instance, with the 72 four-block cache here, 83 address 14 would map 9
to cache block 2 10
x = y mod n 5 0 61
14 mod 4 = 2
11
12
13
14
15
Index
19
Where should we put data in the cache? (3)
An equivalent way to find the placement of a memory address in the cache is to look at the least significant k bits of the address
Memory Address
In a four-entry cache
we would check the two
0000 0010
least significant bits of 0001
our memory addresses
Again, you can check that 0011
address 14 (1110 ) 0100 10 2 0101
Index
00
01
10
11
maps to cache block 210 0110 (102) 0111
Taking the least k bits of 1000
a binary value is the same 1001
as computing that value 1010
mod n
1011 1100 1101 1110 1111
20
How can we find data in the cache?
How can we tell if a word is already in the cache, or if it has to be fetched
from main memory first?
If we want to read memory
address i, we can use the
mod trick to determine
which cache block would 3
contain i
But other addresses might
Memory Address
0
1
2
4 Index
5 0
61 also map to the same cache 72
block. How can we 83 9
distinguish between them? For instance, cache block
2 could contain data from addresses 2, 6, 10 or 14
10
11
12
13
14
15
21
00
??
01
01
Adding tags
The solution is to add tags to the cache, which supply the rest of the address bits to let us distinguish between different memory locations that map to the same cache block
0000
0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
Index
00
01
10
11
Tag Data
22
00
11
01
01
what’s in the cache
Now we can tell exactly which addresses of main memory are stored in the cache, by concatenating the cache block tags with the block indices
Index
00
01
10
11
Tag Data
Main memory address in cache block
00 + 00 = 0000
11 + 01 = 1101
01 + 10 = 0110
01 + 11 = 0111
‘+’ does not refer to addition here
23
1
0
0
1
00
11
01
01
the valid bit
Initially, the cache is empty and does not contain valid data, but trash
Thus, we add a valid bit for each cache block
— When the system is initialized, all the valid bits are set to 0
— When data is loaded into a particular cache block, the corresponding
valid bit is set to 1
Valid
Index Bit Tag Data
00
01
10
11
Main memory address in cache block
00 + 00 = 0000
Invalid
Invalid
01 + 11 = 0111
24
cache hit
Every memory has its memory controller, a HW mechanism responsible for finding the words in memory, loading/storing etc
When the CPU tries to read from memory, the address will be sent to the cache controller
— The lowest k bits of the address will index a block in the cache
— If the block is valid and the tag matches the upper (m - k) bits of the m-bit address, then that data will be sent to the CPU
Here is a diagram of a 32-bit memory address and a 210 byte cache
Address (32 bits)
Index Valid Tag Data 0
1 22 10 2
3 ... ... 1022 1023
Index
To CPU
Tag
=
Hit
25
cache miss
In a two level memory hierarchy, L1 Cache misses are somehow expensive, but L2 cache misses are very expensive
However, the slower main memory accesses are inevitable on an L3 cache miss
The simplest thing to do is to stall the pipeline until the data from main memory can be fetched (and also copied into the cache)
26
Copying a block into the cache
After data is read from main memory, putting a copy of that data into the cache is straightforward
— The lowest k bits of the address specify a cache block
— The upper (m - k) address bits are stored in the block’s tag field
— The data from main memory is stored in the block’s data field
— The valid bit is set to 1 Address (32 bits)
Index 0
1 2 3 ...
... ...
Valid
Tag Data
22
10
Index Tag
Data
1
27
What if the cache fills up?
Eventually, the small cache memory might fill up. To load a new block from DDR, we’d have to replace one of the existing blocks in the cache... which one?
— A miss causes a new block to be loaded into the cache, automatically overwriting any previously stored data
— Normally, the least recently used (LRU) replacement policy is used, which assumes that least recently used data are less likely to be requested than the most recently used ones
— So, in a cache miss, cache throws out the cache line that has been unused for the longest time
28
A more realistic direct mapped cache memory
Normally, one cache line contains 128/256 bits of data. In most programming languages, an integer uses 32 bits (4 bytes)
29
Associativity
The replacement policy decides where in the cache a copy of a particular entry of main memory will go
So far, we have seen directed mapped cache only
each entry in main memory can go in just one place in the cache
Although direct mapped caches are simpler and cheaper they are not performance efficient as they give a large number of cache misses
There are three types of caches regarding associativity Direct mapped
N-way Associative
Fully associative
30
Associative Caches
• Block 12 placed in 8 block cache: Fully associative:
block 12 can go anywhere
Block 01234567 no.
Direct mapped: block 12 can go only into block 4 (12 mod 8 = 4)
2 way Set associative (like having two half size direct mapped caches):
block 12 can go in either of the two block 0 (12 mod 4 = 0)
0123 0123
Block 01234567 Block no. no.
Block-frame address
Block no.
1111111111222222222233 01234567890123456789012345678901
31
Set-associative cache memories
A 4-way associative cache consists of four direct-mapped caches that work in parallel
Normally, L1 caches are of 8 way associative, while L2/L3 caches are of 16/24 way associative, i.e., 16/24 direct mapped caches in parallel
Data are found in one cache among four by using an address which is stored in one of the four caches
Set associative caches are the most used as they present the best compromise between cost and performance
32
The 4-way set-associative cache architecture
33
Cache misses
• Compulsory miss (or cold miss): first access to a block – Normally, compulsory misses are insignificant
• Capacity miss:
– Cache cannot contain all blocks accessed by the program
– Solution: increase cache size
• Conflict miss:
– Multiple memory locations are mapped to the same cache location
– Solution 1: increase cache size
– Solution 2: increase associativity
34
Write policies - In a write request, are we going to write to all memories in memory hierarchy or not?
• Write through – Data are written to all memories in memory hierarchy
• Disadvantage: Data are written into multiple memories every time and thus
it takes more time
• Write back – Data are written only to L1 data cache; only when this block is replaced, it is written in L2. If this block is replaced from, then it is written in main memory
• Requires a “dirty bit”
• more complex to implement, since it needs to track which of its locations have been written over, and mark them as dirty for later writing to the lower lower memory
35
Write policies (2)
What happens on a write miss?
a decision needs to be made on write misses, whether or not data would be loaded into the cache. This is defined by these two approaches:
Write allocate : datum at the missed-write location is loaded to cache, followed by a write-hit operation. In this approach, write misses are similar to read misses
No-write allocate : datum at the missed-write location is not loaded to cache, and is written directly to DDR. In this approach, data are loaded into the cache on read misses only
Bothwrite-throughandwrite-backpoliciescanuseeitherofthesewrite-miss policies, but usually they are paired in this way:
A write-back cache uses write allocate, hoping for subsequent writes (or even reads) to the same location, which is now cached
A write-through cache uses no-write allocate. Here, subsequent writes have no advantage, since they still need to be written directly to DDR
36
Main Memory
Memory Allocation
37
Stack
There are 4 parts in main memory:
code, global variables, stack, and heap.
Stack is a block of memory that is used for function calls and local variables.
Stack size is fixed during compilation – cannot ask for more during run- time.
Actual data in the stack grows and shrinks as local variables and other pointers are added and removed accordingly.
Every function call is allocated a stack frame – a block of memory – required for holding function specific data. The size of this is determined during compile-time.
In x64, the stack frame size is a multiple of 16 bytes
Little-endian form is used
38
Virtual and Physical Memory
A virtual address is a memory address that a process uses to access its own memory
The virtual address is not the same as the actual physical RAM addresss in which it is stored
When a process accesses a virtual address, the Memory Management Unit hardware translates the virtual address to physical
The OS determines the mapping from virtual address to physical address Some of the benefits include
virtual space available is huge compared to physical memory
increased security due to memory isolation
39
How can I find the memory addresses of an array?
We can print the virtual memory addresses only
The hardware addresses are managed by the memory management
unit
However, if you know how the cache works you can have a very good guess about where the data are stored
//print virtual memory addresses
for (i=0; i<4; i++) for (j=0; j<4; j++)
printf(“\nThe address of element (%d,%d) is %p”,i, j, &A[i][j]);
40
Stack and Heap
Stack memory
The stack is a special area of RAM that can be
used by functions as temporary storage...
To save register state.
For local variables.
For large input parameters / return values. Stack works as Last In First Out (LIFO)
Heap memory
Dynamic allocation
41
Why Stack is needed?
Calling subroutines
Using registers in subroutines
Using local variables in subroutines with the same name as in others
less RAM memory is used as these local variables are no longer needed when the function is ended
Passing and returning large arguments to subroutines
42
Why Stack is needed?
Every function being called has its own space in stack
Every time a function is called, it stores into the stack the following:
the return address – thus the CPU knows where to return afterwards
its input operands (in x86, the first 6 operands are stored into registers)
Its local variables – this way, we can use the registers in the subroutine without being overwritten, e.g., consider using edx in both caller and callee
43
Stack related instructions and registers
Registers
ESP Stack pointer – points to the last element used in the stack.
Implicitly manipulated by instructions (e.g. push, pop, call, ret, etc.)
EBP Base pointer – used to reference function parameters and local variable within a stack frame.
Instructions
push M/L/R Pushes an M/L/R value on top of the stack (ESP - 4).
pop R Remove and restore an R value from the top (ESP + 4).
call label Calls a function with a label. This results into pushing the next instruction address on the stack.
ret Returns to caller function – return value is usually stored in eax register.
44
Every time we call a function...
The operands of the called function are stored into rdi,rsi,rdx,rcx,r8,r9 and then into the stack (if more space is needed).
The return value is stored to rax (eax for 32bit - it is the same register)
Remember this when apply reverse engineering...
45
How Stack Works?
Let’s have a look at the 3rd question of this lab session ...
46
Caller and Callee – example (1)
In the C++ code below, main is the caller function, and addFunc is the callee function.
We will convert this into assembly and see how it works with the memory.
47
Caller and Callee -example (2)
48
b
a
%rip - return address
%ebp1
Caller and Callee -example (3)
bottom
%ebp1
main() frame
%esp1
%ebp2 %esp2 addFunc() frame
Increasing address
Top
49
Walking through the code...
We start the debugger
Before running line 12, the
ESP=0x00DEF8C0.
Each address is pointing to 4 bytes of memory.
Decreasing addresses has nothing in there to being with.
Push the last parameter of the function first (LIFO)!
50
Walking through the code...
The value of b is pushed on the top of the stack.
Last used address is now ESP = 0x00DEF8C0 - 4 = 0x00DEF8BC.
Little Endian – lower bits on the lower addresses (see red highlighted line on the right).
51
Walking through the code...
The value of a is pushed on the top of the stack.
Last used address is now ESP = 0x00DEF8BC - 4 = 0x00DEF8B8.
Little Endian – lower bits on the lower addresses (see red highlighted line on the right).
52
Walking through the code...
What happened here? ⇒ We jumped into the fucntion that we called: addFunc.
Last used address is now ESP = 0x00DEF8B8 - 4 = 0x00DEF8B4.
Why was that weird number pushed to the stack?
53
Walking through the code...
We pushed the instruction address that comes after the function call would finish, so that we can resume normal operation after fucntion call has finished.
54
Walking through the code...
Now, we have saved the EBP state in the stack.
This is because we are going to re-write it in the next statement.
This will enable us to explicitly manipulate bits of memory.
Last used address is now ESP = 0x00DEF8B4 - 4 = 0x00DEF8B0.
55
Walking through the code...
We copied the ESP into the EBP register.
Within addFunc, we will use the EBP register
to access stack rather than ESP.
We want to access a and b values fromthe stack next. Where are they?
Last used address is now ESP = 0x00DEF8B4 - 4 = 0x00DEF8B0.
56
Walking through the code...
Currently ESP = 0x00DEF8B0.
a is in address 0x00DEF8B8 = ESP + 8
b is in address 0x00DEF8BC = ESP + 12
Why? ⇒ We pushed the next instruction address following on from the addFunc call and then the EBP address
57
Walking through the code...
We perform three steps now to add a and b.
Stack is unaltered – only changed general purpose registers eax and ebx using the ebp to access values from the stack.
The resulting sum is in eax = 2 + 4 = 6
58
Walking through the code...
We now restored EBP to original state so that the caller function do not see any changes to it, and therefore can use it as if nothing happened.
Pressing next now return to the next instruction in the main function.
Note that ESP has changed: it now “removed” (i.e. not tracking the piece of memory) EBP.
59
Walking through the code...
After the addFunc we reset the ESP to the original starting point by adding 8 (i.e. discounting the places where we put a and b).
The values are still there in the stack, but will be overwritten if we have other functions using the stack.
Thank you
School of Computing (University of Plymouth)
Date
29/10/2019