CS计算机代考程序代写 SQL data structure database Java file system Fortran hadoop cache Hive System Software

System Software

HPC ARCHITECTURES
Filesystems and Data Hardware

Adrian Jackson

a. .ac.uk

I/O

• I/O essential for all applications/codes
• Some data must be read in or produced

• Instructions and Data

• Key for data analysis

• Basic hierarchy
• CPU – Cache – Memory – Devices (including I/O)

• Often “forgotten” for HPC systems
• Linpack not I/O bound

• Not based on CPU clock speed or memory size

• Often “forgotten” in program
• Start and end so un-important

• Just assumed overhead

Challenges of I/O

• Moves beyond process-memory model
• data in memory has to physically appear on an external device

• Files are very restrictive
• Don’t often map well to common program data structures (i.e.

flat file/array)
• Often no description of data in file

• I/O libraries or options system specific
• Hardware different on different systems

• Lots of different formats
• text, binary, big/little endian, Fortran unformatted, …
• Different performance and usability characteristics

• Disk systems are very complicated
• RAID disks, caching on disk, in memory, I/O nodes, network,

etc…

Challenges of I/O

• Standard computer hardware
– Possibly multiple disks
– PATA, SATA, SCSI (SAS)

• Optimisations
– RAID (striping and replication)
– Fast disks (SSD or server)

• HPC/Server/SAN hardware
– Many disks
– SCSI (SAS), Fibre channel

• Optimisations
– Striped
– Multiple adapters and network interfaces

• Network filesystems
– Provide access to data from many

machines and for many users

– Long term storage
– Tape
– Disk farms

CPU

Memory

Disk (and other I/O devices or

peripherals)

CPU

Memory

Graphics

Disk (and other I/O

devices or

peripherals)

Northbridge

controller

Southbridge

controller

Abstract Hardware Hierarchy

Actual Hardware Hierarchy

Performance

Interface Throughput

Bandwith (MB/s)

PATA (IDE) 133

SATA 600

Serial Attached

SCSI (SAS)

600

Fibre Channel 2,000

NVMe 3,000

High Performance or Parallel I/O

• Lots of different methods for providing high performance
I/O

• Hard to support multiple processes writing to same file
• Basic O/S does not support

• Data cached in units of disk blocks (eg 4K) and is not coherent

• Not even sufficient to have processes writing to distinct parts of file

• Even reading can be difficult
• 1024 processes opening a file can overload the filesystem limit on file

handles etc….

• File operations tend to lock files, serialising access

• Data is distributed across different processes
• Dependent on number of processors used, etc…

• Parallel file systems may allow multiple access
• but complicated and difficult for the user to manage

HPC/Parallel Systems

• Basic cluster

– Individual nodes

– Network attached filesystem

– Local scratch disks

Network

Node Node Node Node

Processor/Core

Disk

Network Attached

Filesystem

• Multiple I/O systems
– Home and work

– Optimised for production or for

user access

• Many options for optimisations
– Filesystem servers, caching,

etc…

Hierarchy
Disk

Disk
Disk

Disk
Disk

Disk

Adapter

I/O Compute Node

I/O Software/System

Disk
Disk

Disk

Disk
Disk

Disk

Adapter

I/O Compute Node

I/O Software/System

Network

Compute

Node

Compute

Node

Compute

Node

I/O System

Compute

Parallel filesystem

• Parallel File System is one in which there are multiple storage resources

• Connected to client resources

• Accessible across clients

• Multiple processes can access the same file simultaneously

• Often optimized for high performance

• Large block sizes (≥ 64kB)

• Relatively slow metadata operations (eg. fstat()) compared to reads and
writes

• Special APIs for direct access and additional optimizations

Parallel filesystem

• Key features

• Multiple hardware storage resources

• Hardware connected to compute resources via high performance network

• High-performance, concurrent access to these I/O resources

• Multiple physical I/O devices and paths ensure sufficient bandwidth for
the high performance desired

• Parallel I/O systems include both the hardware and a number of layers
of software

• Filesystem clients and servers

• MPI-I/O

• HDF5 etc…

POSIX I/O

• Standard interface to files

• Unix/Linux approach

• Based on systems with single filesystem

• open, close, write, read, etc…

• Does not support parallel or HPC I/O well

• Many NFS don’t fully implement it for performance
reasons

• Some work on extending for HPC

Lustre

• Open-source parallel file system

• Three functional units
• Object Storage Servers (OSS)

• Store data on one or more Object Storage Targets (OST)

• The OST handles interaction between client data request and underlying
physical storage

• An OSS typically serves 2-8 targets, each target a local disk system. The
capacity of the Lustre file system is the sum of the capacities provided by the
targets

• The OSS operate in parallel, independent of one another

• Metadata Target (MDT)

• One per filesystem, storing all metadata: filenames, directories, permissions,
file layout

• Stored on Metadata Server (MDS)

• Clients

• Supports standard POSIX access

Lustre cont.

Lustre cont…

• Supports different networks

• Infiniband, Ethernet, Myrinet, Quadrics

• Striping

• Data striped across OSTs (round robin)

• File split into units

• Simultaneous read/write to different units

Lustre commands
• Striping cont.

• Improves bandwidths, overall performance available, and maximum file size

• Incurs communication overhead and contention potentials including
serialisation if multiple processes accessing same units

• lfs command for more information and configuration

adrianj@nid16958:~>lfs df –h

(query number of OSTs)

adrianj@nid16958:~> lfs getstripe dirname

(query stripe count, stripe size)

adrianj@nid16958:~> lfs setstripe dirname 0 -1 -1

(set large file stripe size, start index, stripe count)

adrianj@nid16958:~> lfs setstripe dirname 0 -1 1

(set lots of files stripe size, start index, stripe count)

Example I/O hardware: ARCHER

Lustre on ARCHER

• See white paper on I/O performance on ARCHER:

• http://www.archer.ac.uk/documentation/white-
papers/parallelIO/ARCHER_wp_parallelIO.pdf

http://www.archer.ac.uk/documentation/white-papers/parallelIO/ARCHER_wp_parallelIO.pdf

GPFS (Spectrum Scale)

• IBM General Purpose Filesystem
• Files broken into blocks, striped over disks

• Distributed metadata (including dir tree)

• Extended directory indexes

• Failure aware (partition based)

• Fully POSIX compliant

• Storage pools and policies
• Groups disks

• Tiered on performance, reliability, locality

• Policies move and manage data

• Active management of data and location

• Supports wide range of storage hardware

• High performance

GPFS cont…

• Configuration

• Shared disks (i.e. SAN attached to cluster)

• Network Shared disks (NSD) using NSD servers

• NSD across clusters (higher performance NFS)

AFS

• Andrews Filesystem
• Large/wide scale NFS

• Distributed, transparent

• Designed for scalability

• Server caching
• File cached local, read and writes done locally

• Servers maintain list of open files (callback coherence)

• Local and shared files

• File locking
• Doesn’t support large databases or updating shared files

• Kerberos authentication
• Access control list on directories for users and groups

HDFS

• Hadoop distributed file system
• Distributed filesystem with built in fault tolerance
• Relaxed POSIX implementation to allow data streaming
• Optimised for large scale

• Java based implementation
• Separate data nodes and metadata functionality
• Single NameNode performs filesystem name space operations
• Similar to Lustre decomposition

• Namenode -> MDS server

• Block replication undertaken
• Namenode “RAIDs” data
• Namenode copes with DataNode failures
• Heartbeat and status operations

zfs
• Filesystem and logical volume manager

• Focus on data integrity
• 128-bit filesystem (very large potential volume)
• Encryption built in (if desired)
• Storage pools for different types of storage
• Cache-like storage hierarchy

• Data integrity
• Blocks are checksummed
• Checksums are also checksummed up the file system tree
• Enables detection of corruption of checksums as well as corruption

of block data
• Full filesystem integrity

• Sun developed
• Now supported by OpenZFS
• Not deployed by vendors

Hierarchical storage management

• HSM moves data between
storage levels based on
policies

• Data moved independently of
users

• May be for backup, archive,
staging
• Manage expensive fast storage,

maintain data in slow, cheap
storage

• Policies may relate to
• Time since last access
• Fixed time
• Events

file system

Fast

Large

Long term

SCSI RAID

SSD

SATA RAID

Optical disk Disk Offsite storageTape

users

Software layers

• MPI-I/O
• Fast direct parallel access to files in parallel (multiple readers and

writers allowed)

• HDF5
• Can build on MPI-I/O

• Gives metadata operations (allows addition of structure and
metadata to files)

• netCDF
• Can build on HDF5 (and therefore MPI-I/O)

• Specific structures/interfaces/tools for modelling communities
(climate/earth science)

• ….
• Other similar formats designed for specific communities

Non-Filesystem based data
• Database storage

• Traditional SQL databases don’t scale to very large distributed
system

• NoSQL
• Relax ACID ((Atomicity, Consistency, Isolation, Durability)

transaction guarantees (i.e. eventual consistency rather than
immediate consistency)

• None-relational or tabular, i.e. key-value, graph, object storage, etc…
• Support querying, storage, and retrieval of data in SQL-like formats

• NewSQL databases
• Scale database to large, distributed, systems
• Enable SQL querying
• Support ACID guarantees
• Custom SQL engines, data sharding infrastructure, distributed

cluster nodes…

Burst Buffer
• Non-volatile already becoming part of HPC hardware

stack

• SSDs offer high I/O performance but at a cost
• How to utilise in large scale systems?

• Burst-buffer hardware accelerating parallel filesystem
• DDN IME (Infinite Memory Engine)

• Cray DataWarp

Burst buffer

high performance network

external filesystem

compute nodes

high performance network

external filesystem

compute nodes

burst

filesystem

Non-volatile memory

• Non-volatile RAM
• 3D XPoint technology (Intel Optane DCPMM)
• STT-RAM

• Much larger capacity than DRAM
• Hosted in the DRAM slots, controlled by a standard memory

controller

• Slower than DRAM by a small factor, but significantly
faster than SSDs

• STT-RAM
• Read fast and low energy
• Write slow and high energy

• Trade off between durability and performance

• Can sacrifice data persistence for faster writes

SRAM vs NVRAM

• SRAM used for cache

• High performance but costly
• Die area

• Energy leakage

• DRAM lower cost but lower performance
• Higher power/refresh requirement

• NVRAM technologies offer
• Much smaller implementation area

• No refresh/ no/low energy leakage

• Independent read/write cycles

• NVDIMM offers
• Persistency
• Direct access (DAX)

NVDIMMs

• Non-volatile memory already exists

• NVDIMM-N:

• DRAM with NAND Flash on board

• External power source (i.e super capacitors)

• Data automatically moved to flash on power failure with capacitor support, moved back when power restored

• Persistence functionality with memory performance (and capacity)

• NVDIMM-F:

• NAND Flash in memory form

• No DRAM

• Accessed through block mode (like SSD)

• NVDIMM-P:

• Combination of N and F

• Direct mapped DRAM and NAND Flash

• Both block and direct memory access possible

• 3D Xpoint, when it comes

• NVDIMM-P like (i.e. direct memory access and block)

• But no DRAM on board

• Likely to be paired with DRAM in the memory channel

• Real differentiator (from NVDIMM-N) likely to be capacity and cost

Performance – STREAM

Mode Min BW

(GB/s)

Median BW

(GB/s)

Max BW

(GB/s)

App Direct (DRAM) 142 150 155

App Direct

(DCPMM)

32 32 32

Memory mode 144 146 147

Memory mode 12 12 12

https://github.com/adrianjhpc/DistributedStream.git

STREAM_TYPE *a, *b, *c;

pmemaddr = pmem_map_file(path, array_length,

PMEM_FILE_CREATE|PMEM_FILE_EXCL,

0666, &mapped_len, &is_pmem)

a = pmemaddr;

b = pmemaddr + (*array_size+OFFSET)*BytesPerWord;

c = pmemaddr + (*array_size+OFFSET)*BytesPerWord*2;

#pragma omp parallel for

for (j=0; j<*array_size; j++){ a[j] = b[j]+scalar*c[j]; } pmem_persist(a, *array_size*BytesPerWord); Performance - STREAM unsigned long get_processor_and_core(int *socket, int *core){ unsigned long a,d,c; __asm__ volatile("rdtscp" : "=a" (a), "=d" (d), "=c" (c)); *socket = (c & 0xFFF000)>>12;

*core = c & 0xFFF;

return ((unsigned long)a) | (((unsigned long)d) << 32);; } strcpy(path,"/mnt/pmem_fsdax"); sprintf(path+strlen(path), "%d", socket/2); sprintf(path+strlen(path), "/"); Optimising data usage • Reducing data movement • Time and associated energy cost for moving data too and from external parallel filesystems • Move compute to data • Considering full scientific workflow • Data pre-/post-processing • Multi-physics/multi-application simulations • Combined simulation and analytics • Enable scaling I/O performance with compute nodes NVDIMM Performance Data access sizes Performance IOR • For comparison: Summit at ORNL Distributed storage Filesystem Memory Memory Memory Memory Memory Memory Node Node Node Node Node Node Network Filesystem Network Memory Node NVRAM Memory Node NVRAM Memory Node NVRAM Memory Node NVRAM Memory Node NVRAM Memory Node NVRAM Filesystem Network Memory Node NVRAM Memory Node Memory Node NVRAM Memory Node Memory Node NVRAM Memory Node DAOS • Native object store on non-volatile memory and NVMe devices