System Software
HPC ARCHITECTURES
System Software and Alternative Architectures
Adrian Jackson
a. .ac.uk
Operating systems
• The operating system is the basic system software
which manages the way applications run on the
hardware.
• Most HPC systems use some flavour of Linux
• Some proprietary versions of Unix still in use
• AIX on IBM
• Many manufacturers now using some version of
Linux
• also the OS of choice for build-your-own clusters
• An HPC version of Windows (Cluster Compute
Server) does exist but isn’t widely used.
Processes
• A process is an instance of a running program, together with
its associated data and other resources.
• Many processes may be active on a machine
• each application may consist of one or more processes
• OS itself also consists of a number of processes
• On a single CPU (without hardware multithreading) only
one process can be running at a given time
• processes must take turns using the CPU
• OS is able to interrupt a running process, and give the CPU
to another process
• mechanism for doing this is in the hardware
Scheduling processes
• When a process is interrupted, the OS has to decide which
process should get the next turn on the CPU.
• this is determined by the scheduling policy
• Process will run for a given length of time before being
interrupted (time-slicing)
• implemented by interrupting CPU at regular intervals
• every 100th of a second is typical
• Next process to run should have something useful to do.
• no use scheduling a process which is waiting (e.g. for disk access)
• Most OSs assign a priority to each process
• high priority processes more likely to get access to CPU
• priorities may change over time
• The act of interrupting one process and giving the CPU to
another is called a context switch
• Some cost in time associated with context switching
• need to save state of old process
• when new process starts running, the caches are full of the old
process’ data
• frequent context switching is bad for application performance
• Many HPC systems run commercial OSs
• scheduling policy is designed for general purpose workloads
• may not be what is required for HPC
• e.g. not designed to minimise application wall-clock time!
• Some HPC systems run a very lightweight OS
• may not implement context switching at all
Multiprocessors
• In a distributed memory multiprocessor, each node runs a
separate copy of the operating system
• each copy is unaware of processes on other nodes
• Processes cannot migrate between nodes
• once a process is started on a node, it runs there until it finishes
• In a shared memory multiprocessor, one copy of the
operating system manages all the CPUs
• Processes can migrate between CPUs
• scheduler will try not to do this unless it’s necessary
• memory is shared, so no need to move any data
• may be possible to restrict a process to run on a subset of CPUs
Process Placement: Distributed Memory
OSOS OS OS
OSOS OS OS
User
CPU
Operating
System
Node
Process Placement: Distributed Memory
OSOS OS OS
OSOS OS
PP P P
PP P P
OS
User
Process Placement: Shared Memory
OS
User
PPP P P PP PPPP P P PPP
Threads
• Classical process has one thread of execution (i.e. one
sequence of instructions).
• Instead, a process may contain multiple threads
• All the threads in a process have access to the same
memory address space
• useful for both single and multiprocessor systems
• makes programming GUIs easier, for example
• context switch is (a bit) cheaper for threads than processes
• OS scheduling policy is aware of threads
• In a shared memory system is it possible for multiple
threads belonging to the same process to be scheduled to
different CPUs
• Details of the relationship between threads and
processes varies between different OSs
• not important to the programmer
• Message passing programs are usually
implemented as multiple processes
• each task is a process
• may be additional threads on each process to handle
communications
• Shared variable programs are usually implement as
a single process with multiple threads than can be
scheduled on different CPUs
Virtual memory
• Allows memory and disk to be a seamless whole
• processes can use more data that will fit into physical memory
• Allows multiple processes to share physical memory
• Can think of main memory as a cache of what is on disk
• Virtual address space divided into chunks, called pages
• typical size of a page is 4KB to 64KB
• total size of address space can be very large: 264 Bytes
• When a page is accessed, it is allocated space in the real
physical memory
• a not-recently-used page is copied out to disk to make space for it
• this an expensive operation: an application may run very slowly if
this occurs frequently
Batch systems
• Many HPC installations don’t allow users to have interactive access to most of
the machine.
• Users log in to a small subset of processors, or to a separate machine entirely
• this machine is often called a front end
• used for running Unix commands, editing files, compiling code etc.
• Most or all of the processors are only accessible in batch mode
• only run application codes
• not interactive
• these processors are called the back end
• File system usually accessible from both front and back ends
• back end processors may have temporary disk space for use by running
applications
• On a distributed memory machine, the batch system knows how many
processors there are
• keeps a count of free processors
• only starts batch jobs on free processors
• On a shared memory system, the mapping of processes and threads to
processors is dynamic and controlled by the OS scheduling policy
• batch system tries to limit the number of active processes/threads so that
there is at least one processor per thread
• can be hard to prevent users cheating: requesting a number of processors,
and then starting more than this number of threads, for example.
• varying degrees of integration between batch systems and OSs
• some batch systems may co-operate with the OS to restrict the
processes/threads in a job to a dedicated subset of processors
Roadrunner
• IBM machine
• 1.7 Pflop/s peak
• 1.026 achieved
• First petaflop sustained machine
• Hybrid design
• AMD Opteron 2210
• 1.8 GHz
• Dual-core
• 6,480 processors, 12,960 cores
• IBM PowerXCell 8i
• 3.2 GHz
• 1 PPE and 8 SPEs
• 12,960 processors (12,960 PPE cores and 103,680 SPE cores, total of
116,640 cores)
Roadrunner
Cell
• Famous cross-over processor
• Designed for PS3 and other media apps
• Partnership between Sony, Toshiba,IBM
• IBM key developer
• Cost-performance very attractive
• Mass production for games console
• ~256 GFlop/s on chip (26 GFlop/s double precision)
• 1 PPE – PowerPC 3.2 GHz
• Dual threaded functional processor
• 8 SPE – Processing Element
• SIMD, 128 128-bit vector registers, local memory, in-
order
• EIB – Memory bus
• 25.6 GB/s per SPE, total > 200 GB/s
Cell overview
SPUPPU
EIB
Scientific Cell
• Identified as co-processor for scientific applications
• PowerXCell 8i
• Variant for double precision
• 102 GFlop/s DP
• 32 GB memory supported
• SPE used independently or in concert
• Suited to vector style operations
• Blades with 2 Cell processors
• Cell is dead
• IBM has no further development plans
K Computer
• 88,128 nodes
• 2.0 GHz 8-core SPARC64 VIIIfx
• 128 GFlop/s
• 16 GB of memory
• 705,024 cores
• Tofu network
– 6D torus
– 10 links per
node
– 10 GB/s per
link
– 100 GB/s of
node
bandwidth
K Computer
• 8 FP ops, or 4 FMA ops,
per cycle;
• Registers:
• 192 integer
• 256 floating point
• L1(I/D) Cache : 32KB/32KB,
2-way set associative, 128-
byte lines
• L2 Cache : 6MB, 10-way
set associative, 128-byte
lines
K Computer – Network
• Multi-scale network
• Remote: 6 links
• Scalable xyz 3D torus
• Local: 4 links
• Fixed size 3D mesh
• 12 nodes connection
• Total topology
• 6D mesh/torus
• Multi-path routing to avoid
faults
• High Performance and Operability
• Low hop-count (average hop count is
about ½ of conventional 3D torus)
• The 3D Torus/Mesh view is always
provided to an application even when
meshes are divided into arbitrary sizes
• No interference between job
• Fault tolerance
• 12 possible alternate paths are used to
bypass faulty nodes
• Redundant node can be assigned
preserving the torus topology
MD specific machines
• ANTON
• MD-GRAPE
• Anton ASIC
• high-throughput interaction subsystem
(HTIS)
• 32 deeply pipelined modules,800 MHz
• flexible subsystem
• four general-purpose Tensilica cores (each
with cache and scratchpad memory) and
eight specialized but programmable SIMD
geometry cores, 400 MHz.
• 3D torus, 6 inter-node links, total in+out
bandwidth of 607.2 Gbit/s. The per-hop
latency 50ns
http://en.wikipedia.org/wiki/Tensilica
http://en.wikipedia.org/wiki/SIMD
Future Architectures
• Exascale and beyond
• Accelerator based
• 3D chip stacking, new memory forms, etc…
• Very large numbers of processing units
• Power and cooling concerns
• Data movement high cost
PACS-G: a straw man architecture
• SIMD architecture, for compute oriented apps (N-body, MD), and stencil
apps.
• 4096 cores (64×64), 2FMA@1GHz
• 2D mesh (+ broadcast/reduction) on-chip network for stencil apps.
• Chip die size: 20mm x 20mm
– Mainly working on on-chip memory
(size 512 MB/chip, 128KB/core), and,
with module memory by 3D-
stack/wide IO DRAM memory,
bandwidth 1000GB/s, size 16-
32GB/chip
• No external memory (DIM/DDR)
• 250 W/chips expected
• 64K chips for 1 EFLOPS (at peak)
PACS-G
• A group of 1024~2048 chips are
connected via accelerator
network (inter-chip network)
• 25 – 50Gpbs/link for inter-chip: If we
extend 2-D mesh network to the
(2D-mesh) external network in a
group, we need 200~400GB/s (= 32
ch. x 25~50Gbps x 2(bi-direction))
A64FX
Category Details
Instruction Set
Architecture
Armv8.2-A SVE
(512-bit wide SIMD)
Number of cores
48 computing cores
4 assistant cores
Memory 32GiB (HBM2)
Process Technology 7 nm FinFET
Number of Transistors About 8.7 billion transistors
Peak Performance
(TOPS)
Double precision (64 bit) floating point operations: over
2.7 TOPS (DGEMM execution efficiency over 90%)
Single precision (32 bit) floating point operations: over
5.4 TOPS
Half precision (16 bit) floating point operations/16 bit
integer operations: over 10.8 TOPS
8 bit integer operations: over 21.6 TOPS
Peak Memory
Bandwidth
1024 GB/second (STREAM Triad(3) execution
efficiency over 80%)
https://www.fujitsu.com/global/about
/resources/news/press-
releases/2018/0822-02.html
https://www.fujitsu.com/global/about/resources/news/press-releases/2018/0822-02.html#3
https://www.fujitsu.com/global/about/resources/news/press-releases/2018/0822-02.html
High Bandwidth Memory
• Two levels of memory for KNL
• Main memory
• KNL has direct access to all of main memory
• Similar latency/bandwidth as you’d see from a standard processors
• 6 DDR channels
• MCDRAM
• High bandwidth memory on chip: 16 GB
• Slightly higher latency than main memory (~10% slower)
• 8 MCDRAM controllers/16 channels
Memory Modes
• Cache mode
• MCDRAM cache for DRAM
• Only DRAM address space
• Done in hardware (applications don’t need modified)
• Misses more expensive (DRAM and MCDRAM access)
• Flat mode
• MCDRAM and DRAM are both available
• MCDRAM is just memory, in same address space
• Software managed (applications need to do it themselves)
• Hybrid – Part cache/part memory
• 25% or 50% cache
MCDRAM DRAMProcessor
MCDRAM
DRAM
Processor
Non-volatile memory
• Non-volatile RAM
• 3D XPoint technology
• STT-RAM
• Much larger capacity than DRAM
• Hosted in the DRAM slots, controlled by a standard memory
controller
• Slower than DRAM by a small factor, but significantly
faster than SSDs
• STT-RAM
• Read fast and low energy
• Write slow and high energy
• Trade off between durability and performance
• Can sacrifice data persistence for faster writes
SRAM vs NVRAM
• SRAM used for cache
• High performance but costly
• Die area
• Energy leakage/refresh
• NVRAM technologies offer
• Much smaller implementation area
• No refresh/ no/low energy leakage
NVRAM
• The “memory” usage model allows for the
extension of the main memory
• The data is volatile like normal DRAM based main memory
• The “storage” usage model which supports the use
of NVRAM like a classic block device
• E.g. like a very fast SSD
• The “application direct” usage model maps
persistent storage from the NVRAM directly into
the main memory address space
• Direct CPU load/store instructions for persistent main
memory regions
Using distributed storage
• Without changing applications
• Large memory space/in-memory database etc…
• Local filesystem
• Users manage data themselves
• No global data access/namespace, large number of files
• Still require global filesystem for persistence
Filesystem
Network
Memory
Node
/tmp
Memory
Node
/tmp
Memory
Node
/tmp
Memory
Node
/tmp
Memory
Node
/tmp
Memory
Node
/tmp
Using distributed storage
• Without changing applications
• Filesystem buffer
• Pre-load data into NVRAM from filesystem
• Use NVRAM for I/O and write data back to filesystem at the end
• Requires systemware to preload and postmove data
• Uses filesystem as namespace manager
Filesystem
Network
Memory
Node
buffer
Memory
Node
buffer
Memory
Node
buffer
Memory
Node
buffer
Memory
Node
buffer
Memory
Node
buffer
Using distributed storage
• Without changing applications
• Global filesystem
• Requires functionality to create and tear down global filesystems for
individual jobs
• Requires filesystem that works across nodes
• Requires functionality to preload and postmove filesystems
• Need to be able to support multiple filesystems across system
Filesystem
Network
Memory Memory
Node
Memory Memory Memory Memory
Node
Node NodeNodeNode
Filesystem
Using distributed storage
• With changes to applications
• Object store
• Needs same functionality as global filesystem
• Removes need for POSIX, or POSIX-like functionality
Filesystem
Network
Memory Memory
Node
Memory Memory Memory Memory
Node
Node NodeNodeNode
Object store
Using distributed storage
• Without changing applications
• Automatic checkpointing
• Resiliency
• Local checkpointing without hitting the filesystem
• Pause and restart
• Just-in-time scheduling/high priority jobs
• Waiting for something else to happen…
Using distributed storage
• New usage models
• Resident data sets
• Sharing preloaded data across a range of jobs
• Data analytic workflows
• How to control access/authorisation/security/etc….?
• Workflows
• Producer-consumer model
• Remove filesystem from intermediate stages
Job 1
Filesystem
Job 2 Job 3 Job 4
Using distributed storage
• Workflows
• How to enable different sized applications?
• How to schedule these jobs fairly?
• How to enable secure access?
Job 1
Filesystem
Job 2
Job 3
Job 4Job 2
Job 2 Job 2 Job 4
The Challenge of distributed storage
• Enabling all the use cases in multi-user, multi-job
environment is the real challenge
• Heterogeneous scheduling mix
• Different requirements on the NVRAM
• Scheduling across these resources
• Enabling sharing of nodes
• etc….
• Enabling applications to do more I/O
• Large numbers of our applications don’t heavily use I/O at
the moment
• What can we enable if I/O is significantly cheaper
Benefits of this storage