CS计算机代考程序代写 database file system flex cache System Software

System Software

HPC ARCHITECTURES
System Software and Alternative Architectures

Adrian Jackson

a. .ac.uk

Operating systems

• The operating system is the basic system software
which manages the way applications run on the
hardware.

• Most HPC systems use some flavour of Linux

• Some proprietary versions of Unix still in use
• AIX on IBM

• Many manufacturers now using some version of
Linux
• also the OS of choice for build-your-own clusters

• An HPC version of Windows (Cluster Compute
Server) does exist but isn’t widely used.

Processes

• A process is an instance of a running program, together with
its associated data and other resources.

• Many processes may be active on a machine
• each application may consist of one or more processes

• OS itself also consists of a number of processes

• On a single CPU (without hardware multithreading) only
one process can be running at a given time
• processes must take turns using the CPU

• OS is able to interrupt a running process, and give the CPU
to another process
• mechanism for doing this is in the hardware

Scheduling processes

• When a process is interrupted, the OS has to decide which

process should get the next turn on the CPU.
• this is determined by the scheduling policy

• Process will run for a given length of time before being

interrupted (time-slicing)
• implemented by interrupting CPU at regular intervals
• every 100th of a second is typical

• Next process to run should have something useful to do.
• no use scheduling a process which is waiting (e.g. for disk access)

• Most OSs assign a priority to each process
• high priority processes more likely to get access to CPU
• priorities may change over time

• The act of interrupting one process and giving the CPU to
another is called a context switch

• Some cost in time associated with context switching
• need to save state of old process
• when new process starts running, the caches are full of the old

process’ data
• frequent context switching is bad for application performance

• Many HPC systems run commercial OSs
• scheduling policy is designed for general purpose workloads
• may not be what is required for HPC
• e.g. not designed to minimise application wall-clock time!

• Some HPC systems run a very lightweight OS
• may not implement context switching at all

Multiprocessors

• In a distributed memory multiprocessor, each node runs a
separate copy of the operating system
• each copy is unaware of processes on other nodes

• Processes cannot migrate between nodes
• once a process is started on a node, it runs there until it finishes

• In a shared memory multiprocessor, one copy of the
operating system manages all the CPUs

• Processes can migrate between CPUs
• scheduler will try not to do this unless it’s necessary

• memory is shared, so no need to move any data

• may be possible to restrict a process to run on a subset of CPUs

Process Placement: Distributed Memory

OSOS OS OS

OSOS OS OS

User

CPU

Operating
System

Node

Process Placement: Distributed Memory

OSOS OS OS

OSOS OS

PP P P

PP P P

OS

User

Process Placement: Shared Memory

OS

User

PPP P P PP PPPP P P PPP

Threads

• Classical process has one thread of execution (i.e. one
sequence of instructions).

• Instead, a process may contain multiple threads

• All the threads in a process have access to the same
memory address space
• useful for both single and multiprocessor systems
• makes programming GUIs easier, for example
• context switch is (a bit) cheaper for threads than processes

• OS scheduling policy is aware of threads

• In a shared memory system is it possible for multiple
threads belonging to the same process to be scheduled to
different CPUs

• Details of the relationship between threads and
processes varies between different OSs
• not important to the programmer

• Message passing programs are usually
implemented as multiple processes
• each task is a process

• may be additional threads on each process to handle
communications

• Shared variable programs are usually implement as
a single process with multiple threads than can be
scheduled on different CPUs

Virtual memory

• Allows memory and disk to be a seamless whole
• processes can use more data that will fit into physical memory

• Allows multiple processes to share physical memory

• Can think of main memory as a cache of what is on disk

• Virtual address space divided into chunks, called pages
• typical size of a page is 4KB to 64KB

• total size of address space can be very large: 264 Bytes

• When a page is accessed, it is allocated space in the real
physical memory
• a not-recently-used page is copied out to disk to make space for it

• this an expensive operation: an application may run very slowly if
this occurs frequently

Batch systems

• Many HPC installations don’t allow users to have interactive access to most of
the machine.

• Users log in to a small subset of processors, or to a separate machine entirely

• this machine is often called a front end

• used for running Unix commands, editing files, compiling code etc.

• Most or all of the processors are only accessible in batch mode

• only run application codes

• not interactive

• these processors are called the back end

• File system usually accessible from both front and back ends

• back end processors may have temporary disk space for use by running
applications

• On a distributed memory machine, the batch system knows how many
processors there are

• keeps a count of free processors

• only starts batch jobs on free processors

• On a shared memory system, the mapping of processes and threads to
processors is dynamic and controlled by the OS scheduling policy

• batch system tries to limit the number of active processes/threads so that
there is at least one processor per thread

• can be hard to prevent users cheating: requesting a number of processors,
and then starting more than this number of threads, for example.

• varying degrees of integration between batch systems and OSs

• some batch systems may co-operate with the OS to restrict the
processes/threads in a job to a dedicated subset of processors

Roadrunner
• IBM machine

• 1.7 Pflop/s peak
• 1.026 achieved
• First petaflop sustained machine
• Hybrid design

• AMD Opteron 2210
• 1.8 GHz
• Dual-core
• 6,480 processors, 12,960 cores

• IBM PowerXCell 8i
• 3.2 GHz
• 1 PPE and 8 SPEs
• 12,960 processors (12,960 PPE cores and 103,680 SPE cores, total of

116,640 cores)

Roadrunner

Cell

• Famous cross-over processor

• Designed for PS3 and other media apps

• Partnership between Sony, Toshiba,IBM

• IBM key developer

• Cost-performance very attractive

• Mass production for games console

• ~256 GFlop/s on chip (26 GFlop/s double precision)

• 1 PPE – PowerPC 3.2 GHz

• Dual threaded functional processor

• 8 SPE – Processing Element

• SIMD, 128 128-bit vector registers, local memory, in-
order

• EIB – Memory bus

• 25.6 GB/s per SPE, total > 200 GB/s

Cell overview

SPUPPU

EIB

Scientific Cell

• Identified as co-processor for scientific applications

• PowerXCell 8i

• Variant for double precision

• 102 GFlop/s DP

• 32 GB memory supported

• SPE used independently or in concert

• Suited to vector style operations

• Blades with 2 Cell processors

• Cell is dead

• IBM has no further development plans

K Computer
• 88,128 nodes

• 2.0 GHz 8-core SPARC64 VIIIfx

• 128 GFlop/s

• 16 GB of memory

• 705,024 cores

• Tofu network
– 6D torus

– 10 links per

node

– 10 GB/s per

link

– 100 GB/s of

node

bandwidth

K Computer

• 8 FP ops, or 4 FMA ops,
per cycle;

• Registers:
• 192 integer
• 256 floating point

• L1(I/D) Cache : 32KB/32KB,
2-way set associative, 128-
byte lines

• L2 Cache : 6MB, 10-way
set associative, 128-byte
lines

K Computer – Network

• Multi-scale network
• Remote: 6 links

• Scalable xyz 3D torus

• Local: 4 links
• Fixed size 3D mesh

• 12 nodes connection

• Total topology
• 6D mesh/torus
• Multi-path routing to avoid

faults

• High Performance and Operability

• Low hop-count (average hop count is

about ½ of conventional 3D torus)

• The 3D Torus/Mesh view is always

provided to an application even when

meshes are divided into arbitrary sizes

• No interference between job

• Fault tolerance

• 12 possible alternate paths are used to

bypass faulty nodes

• Redundant node can be assigned

preserving the torus topology

MD specific machines

• ANTON

• MD-GRAPE

• Anton ASIC
• high-throughput interaction subsystem

(HTIS)
• 32 deeply pipelined modules,800 MHz

• flexible subsystem
• four general-purpose Tensilica cores (each

with cache and scratchpad memory) and
eight specialized but programmable SIMD
geometry cores, 400 MHz.

• 3D torus, 6 inter-node links, total in+out
bandwidth of 607.2 Gbit/s. The per-hop
latency 50ns

http://en.wikipedia.org/wiki/Tensilica
http://en.wikipedia.org/wiki/SIMD

Future Architectures

• Exascale and beyond

• Accelerator based

• 3D chip stacking, new memory forms, etc…

• Very large numbers of processing units

• Power and cooling concerns

• Data movement high cost

PACS-G: a straw man architecture

• SIMD architecture, for compute oriented apps (N-body, MD), and stencil
apps.

• 4096 cores (64×64), 2FMA@1GHz

• 2D mesh (+ broadcast/reduction) on-chip network for stencil apps.

• Chip die size: 20mm x 20mm

– Mainly working on on-chip memory

(size 512 MB/chip, 128KB/core), and,

with module memory by 3D-

stack/wide IO DRAM memory,

bandwidth 1000GB/s, size 16-

32GB/chip

• No external memory (DIM/DDR)

• 250 W/chips expected

• 64K chips for 1 EFLOPS (at peak)

PACS-G

• A group of 1024~2048 chips are
connected via accelerator
network (inter-chip network)
• 25 – 50Gpbs/link for inter-chip: If we

extend 2-D mesh network to the
(2D-mesh) external network in a
group, we need 200~400GB/s (= 32
ch. x 25~50Gbps x 2(bi-direction))

A64FX

Category Details

Instruction Set

Architecture

Armv8.2-A SVE

(512-bit wide SIMD)

Number of cores
48 computing cores

4 assistant cores

Memory 32GiB (HBM2)

Process Technology 7 nm FinFET

Number of Transistors About 8.7 billion transistors

Peak Performance

(TOPS)

Double precision (64 bit) floating point operations: over

2.7 TOPS (DGEMM execution efficiency over 90%)

Single precision (32 bit) floating point operations: over

5.4 TOPS

Half precision (16 bit) floating point operations/16 bit

integer operations: over 10.8 TOPS

8 bit integer operations: over 21.6 TOPS

Peak Memory

Bandwidth

1024 GB/second (STREAM Triad(3) execution

efficiency over 80%)

https://www.fujitsu.com/global/about

/resources/news/press-

releases/2018/0822-02.html

https://www.fujitsu.com/global/about/resources/news/press-releases/2018/0822-02.html#3
https://www.fujitsu.com/global/about/resources/news/press-releases/2018/0822-02.html

High Bandwidth Memory

• Two levels of memory for KNL

• Main memory

• KNL has direct access to all of main memory

• Similar latency/bandwidth as you’d see from a standard processors

• 6 DDR channels

• MCDRAM

• High bandwidth memory on chip: 16 GB

• Slightly higher latency than main memory (~10% slower)

• 8 MCDRAM controllers/16 channels

Memory Modes

• Cache mode
• MCDRAM cache for DRAM

• Only DRAM address space

• Done in hardware (applications don’t need modified)

• Misses more expensive (DRAM and MCDRAM access)

• Flat mode
• MCDRAM and DRAM are both available

• MCDRAM is just memory, in same address space

• Software managed (applications need to do it themselves)

• Hybrid – Part cache/part memory
• 25% or 50% cache

MCDRAM DRAMProcessor

MCDRAM

DRAM

Processor

Non-volatile memory

• Non-volatile RAM
• 3D XPoint technology
• STT-RAM

• Much larger capacity than DRAM
• Hosted in the DRAM slots, controlled by a standard memory

controller

• Slower than DRAM by a small factor, but significantly
faster than SSDs

• STT-RAM
• Read fast and low energy
• Write slow and high energy

• Trade off between durability and performance

• Can sacrifice data persistence for faster writes

SRAM vs NVRAM

• SRAM used for cache

• High performance but costly

• Die area

• Energy leakage/refresh

• NVRAM technologies offer

• Much smaller implementation area

• No refresh/ no/low energy leakage

NVRAM

• The “memory” usage model allows for the
extension of the main memory
• The data is volatile like normal DRAM based main memory

• The “storage” usage model which supports the use
of NVRAM like a classic block device
• E.g. like a very fast SSD

• The “application direct” usage model maps
persistent storage from the NVRAM directly into
the main memory address space
• Direct CPU load/store instructions for persistent main

memory regions

Using distributed storage

• Without changing applications
• Large memory space/in-memory database etc…
• Local filesystem

• Users manage data themselves

• No global data access/namespace, large number of files

• Still require global filesystem for persistence

Filesystem

Network

Memory

Node

/tmp

Memory

Node

/tmp

Memory

Node

/tmp

Memory

Node

/tmp

Memory

Node

/tmp

Memory

Node

/tmp

Using distributed storage

• Without changing applications
• Filesystem buffer

• Pre-load data into NVRAM from filesystem

• Use NVRAM for I/O and write data back to filesystem at the end

• Requires systemware to preload and postmove data

• Uses filesystem as namespace manager

Filesystem

Network

Memory

Node

buffer

Memory

Node

buffer

Memory

Node

buffer

Memory

Node

buffer

Memory

Node

buffer

Memory

Node

buffer

Using distributed storage

• Without changing applications
• Global filesystem

• Requires functionality to create and tear down global filesystems for
individual jobs

• Requires filesystem that works across nodes

• Requires functionality to preload and postmove filesystems

• Need to be able to support multiple filesystems across system

Filesystem

Network

Memory Memory

Node

Memory Memory Memory Memory

Node

Node NodeNodeNode

Filesystem

Using distributed storage

• With changes to applications

• Object store

• Needs same functionality as global filesystem

• Removes need for POSIX, or POSIX-like functionality

Filesystem

Network

Memory Memory

Node

Memory Memory Memory Memory

Node

Node NodeNodeNode

Object store

Using distributed storage

• Without changing applications

• Automatic checkpointing

• Resiliency

• Local checkpointing without hitting the filesystem

• Pause and restart

• Just-in-time scheduling/high priority jobs

• Waiting for something else to happen…

Using distributed storage

• New usage models
• Resident data sets

• Sharing preloaded data across a range of jobs

• Data analytic workflows

• How to control access/authorisation/security/etc….?

• Workflows
• Producer-consumer model

• Remove filesystem from intermediate stages

Job 1

Filesystem

Job 2 Job 3 Job 4

Using distributed storage

• Workflows

• How to enable different sized applications?

• How to schedule these jobs fairly?

• How to enable secure access?

Job 1

Filesystem

Job 2

Job 3

Job 4Job 2

Job 2 Job 2 Job 4

The Challenge of distributed storage

• Enabling all the use cases in multi-user, multi-job
environment is the real challenge
• Heterogeneous scheduling mix

• Different requirements on the NVRAM

• Scheduling across these resources

• Enabling sharing of nodes

• etc….

• Enabling applications to do more I/O
• Large numbers of our applications don’t heavily use I/O at

the moment

• What can we enable if I/O is significantly cheaper

Benefits of this storage