Squishy Maps for Soft Body Modelling Using Generalised Chain Mail
KIT308/408 (Advanced) Multicore Architecture and Programming
Memory Architecture
Dr. Ian Lewis
Discipline of ICT, School of TED
University of Tasmania, Australia
1
Finish talking about cache
Expand our discussion about cache to include
Virtual Memory
The problems with sharing values between caches in multicore systems
View the entire memory hierarchy
These topics are important
For allowing us to share values in multicore programs
When we start to look at other efficiency concerns
Why OO can be bad
Writing code that accesses memory efficientyl
Purpose of Lecture
2
Cache
3
Cache conceptually lies between CPU and RAM
Interface between CPU and cache is much faster than memory bus
Most modern systems have two or three levels of cache
Level 1 (L1), level 2 (L2), and level 3 (L3)
Multiple caches complicates design, but gives speed increase
L1 cache is faster than L2 cache, but L2 cache is much larger
L1 for very recently used locations
L2 for just recently used locations
And likewise for L2 versus L3
Some modern RAM has its own cache which can be thought of as L4 cache
1. http://25.media.tumblr.com/tumblr_lxel3jagEK1qmt3zpo1_500.jpg
4
Refresher: Cache
Cache operates under the assumption that if a memory location is referenced, it is likely to be referenced again some time soon after
Often nearby memory locations are also likely to be referenced
Programs that exhibit this property are said to have strong locality of reference
For programs that do not, cache is of no help
Cache can actually make the situation worse due to the extra overhead involved
Cache stores recently used memory locations (and some nearby ones) in a buffer
RAM used for buffer is much faster to access than main store
Buffer must also store where value came from
CPU requests memory location from the cache, not directly from main store
If value in the cache (a hit), it is returned straight away
If not (a miss), the cache reads it from main store and adds it to the buffer
5
Locality of Reference
The CPU requests data from the cache in “words”
e.g. 4 bytes, 8 bytes, 16 bytes, etc.
Due to the locality of reference assumption, cache receives values from main store in blocks that are larger than words
e.g. 64 bytes, 128 bytes, etc.
6
Cache Reads / Writes
Here each core has it’s own L1 and L2 cache
But other organisations are possible
e.g. own L1, L2 shared between 2 cores, L3 shared between 8
e.g. shared L1 cache between 2 cores
1. https://www.extremetech.com/wp-content/uploads/2014/09/Haswell-Labeled.jpg (Haswell-E)
7
Cache on Multicores
Memory Management Unit
8
Many processors contain a Memory Management Unit (MMU) to provide hardware support for other features of memory
Virtual memory is a technique for making main store larger
Main memory is too small for many applications, so a portion of the harddrive is used as extra “RAM”
Memory segmentation
Generally programs should be given their own “safe” piece of memory to call home
Anti-fragmentation features
To avoid locations in memory being unable to be used
Which could happens when dynamically allocating and releasing memory, or beginning, ending, or swapping many processes
9
More Memory Complexity: The MMU
Central Processing Unit
Registers
Arithmetic Logic Unit
Control Unit
Status Registers
Cache
MMU
TLB
Memory interface
Virtual memory allows secondary storage to act as extra (albeit slower) main store
The virtual memory space is divided into pages
Memory is divided into frames of the same size as pages
Only some of the pages are kept in memory
Virtual memory is user transparent
Programs merely see a large memory
CPU and OS decide which parts of the virtual memory space to keep in RAM
When pages are moved in and out of memory it is referred to as “paging in” and “paging out”
1. https://i.ytimg.com/vi/ICnr87A5m7c/hqdefault.jpg
10
Virtual Memory
Every time a memory location is accessed, the given virtual address, must be translated into a real physical address
Enough information needs to be stored in order to be able to perform this conversion, and to do so quickly
This is actually can be quite a slow process, so most processors cache the result of this translation
If the page that contains the physical address is in memory (in a frame), then we have a page hit (otherwise it is a page miss)
1. http://www.pling.org.uk/cs/opsimg/pagetable.png
11
Virtual Memory
The use of virtual memory explains why the highlighted steps in the diagram would take a lot of time
12
Instruction Cycle
Registers are faster, more expensive, and smaller in capacity than the L1 cache
The L1 cache is faster, more expensive, and smaller in capacity than the L2 cache
…
Main memory (RAM) is faster, more expensive, and smaller in capacity than secondary storage (HDD, SDD)
1. http://ieeexplore.ieee.org/mediastore/IEEE/content/media/69/7116676/7097722/ooi3-2427795-large.gif
13
Memory Hierarchy
When reading from memory to store a value in a register, what’s the worst that could happen?
It’s not in the L1 cache
It’s not in the L2/L3/L4 cache either
It’s physical address isn’t in the TLB
Costly translation from virtual to physical address
The page containing the virtual address isn’t in memory
Now what happens?
Read an entire page from disk into memory
Copy cache line containing value into L4/L3/L2/L1 caches
If the location that the cache line is placed contains dirty (modified) values they need to be written to memory first
Copy value into register
14
Worst Case Memory Read
Cache Coherency
15
When there is more than one CPU or core that each have their own caches, they need to be kept consistent
If one updates a cache line, the others that are storing that line too need to update theirs
1. https://en.wikipedia.org/wiki/Cache_coherence#/media/File:Cache_Coherency_Generic.png
16
Cache Coherency
As the number of cores increases, cache coherency gets more and more difficult
e.g. what happens if Core P#4 updates a value in it’s L1 data cache?
1. https://en.wikipedia.org/wiki/Memory_hierarchy#/media/File:Hwloc.png
17
Multicore Cache Coherency
The individual caches monitor address lines for writes to memory locations they have cached
When a write is observed to a location the cache has a copy of, the cache controller invalidates its own copy
Next time that core wishes to access that location it must be reloaded
Snooping is fast, but isn’t scalable (but is ok with say <16 processors)
Every request needs to be broadcast to all caches
1. https://i.pinimg.com/originals/ea/d3/79/ead379129cc36a54e0e666d5cf356ffb.jpg
18
Snooping Coherence
Shared data is placed in a common directory that maintains coherence between caches
The directory acts as a filter through which the processor must ask permission to load an entry from primary memory to its cache
When an entry is changed the directory either updates or invalidates the other caches with that entry
Next time they access that entry, they need to reload those values
Scales well, so is used on large systems
1. https://promote.simdif.com/images/publish/4248396588_08fbcae837_t.jpg?1496311917
19
Directory-Based Coherence
/docProps/thumbnail.jpeg