Further Fundamentals
aka “How big is HPC anyway… ” https://warwick.ac.uk/fac/sci/dcs/teaching/material/cs402/ 11/01/2022 ● CS402/922 High Performance Computing ● ●
11/01/2022
Copyright By PowCoder代写 加微信 powcoder
What is a HPC supercomputer?
What makes a supercomputer “super”?
• One big processor wouldn’t be good • Expensive to make and maintain
• Difficult to design efficiently
• Hard to handle thermal output
• Single point of failure
• What if we just connected lots of small computers together?
Login and/or Managment Node(s)
High Speed Network & Switches
File system
Network Interface
Compute Node
Network Interface
Compute Node
11/01/2022
How do we make a supercomputer faster?
Run it on energy drinks!
• Improve the computer architecture (the processors in the compute nodes)
• Improve the interconnects between the compute nodes (and the file system)
• Improve how the compiler interprets the program
• Improve what the algorithm actually does
• Allow for multiple jobs and resources to be used at the same time
Architecture
aka “How do we compute in a supercomputer?” https://warwick.ac.uk/fac/sci/dcs/teaching/material/cs402/ 11/01/2022 ● CS402/922 High Performance Computing ● ●
11/01/2022
Computer Architecture
Process faster, better, harder, stronger…
• Many CPU’s rely on similar tricks to improve performance:
• Increasing clock speed (number of compute cycles per second)
• More/Smaller transistors (more complex compute can be done in hardware)
• More cores (greater level of parallelism)
• Multiple processors (increase level of parallelism without
having to develop bigger chips)
• Intel Xeon is a prime example of this
(E5- 4669 v3, Q2’15)
Xeon Cascade Lake (9242, Q2’19)
Clock Speed
Transistor Size
Core count
Thermal Design Power
(36 threads)
(96 threads)
11/01/2022
Intel Xeon Phi Knights Landing (KNL)
Knights of the square processor…
• Over the last decade, GPU’s are becoming more mainstream in HPC
• Huge amount of simple cores
• Faster access to data
• Code had to be designed to utilise it
• Intel designed a hybrid called KNL
• Allowed for CPU-type parallelism to work natively • Configurable processor and memory
• High bandwidth memoryà16GB of MCDRAM
• Large pool of threadsà64 cores/256 threads
11/01/2022
ARM Marvell ThunderX2
ThunderX2… sounds like a superhero; or a super villain!
• Memory bandwidth for CPUs are a big issue • A faster interconnect between cores, cache
and main memory is required
• Thunder X2àARM based architecture
• Multiple (8) interconnects to main memory
• Multiple (8) paths to cache
• Measured memory bandwidth of ~116.5GB/s
11/01/2022
Shiny graphics?
• Most huge systems still rely on large numbers of CPU’s
• GPU’s usage in HPC is actively increasing • Huge amount of simplified, slower cores
• NVIDIA K80 (Q4’14)à4992 CUDA cores, 824 MHz
• NVIDIA P100 (Q3’16)àFirst GPU with HBM2
(~732 GB/s), 3584 CUDA cores, 1329MHz
• NVIDIA A100 (Q2’20)à40GB of HBM, 6912 CUDA cores, 1410MHz
Networking
aka “Lets talk about it!” https://warwick.ac.uk/fac/sci/dcs/teaching/material/cs402/ 11/01/2022 ● CS402/922 High Performance Computing ● ●
11/01/2022
Networking
Just plug everything into everything else‽
• Supercomputers are a collection of interconnected, smaller nodes • Fugakuà158,976 nodes
• Summità4,608 nodes
• Data needs to be passed between nodes, and to file systems
• In order to reduce network time:
• Utilise faster communication methods
• Reduce number of network switches
• Reduce the amount of buffer time on network cards • Reduce amount of congestion on network
11/01/2022
Common Interconnect Types
100 Gigabit Ethernet
• 100Gb per second (hence the name)
• Relatively new
• Based on the (much) older Ethernet standard
• Fibre optics
Infiniband
• Around 80GB per second
• Older, but much more common • Propriety connectors
• Fibre optics
Red wires to red socket, blue wire to blue socket
11/01/2022
Common Networking Paradigms
Where is that wire going‽
• Different networking models allow for different communication styles
• DragonflyàMultiple interconnected groups • Can be expensive to design and implement
• Fat TreeàTree structure (compute nodes = leaves, network switches = nodes)
• Can have large amount of congestion
• TorusàNodes connected in a looping cube • Hugely complex and expensive
aka “Now your speaking my assembly language!” https://warwick.ac.uk/fac/sci/dcs/teaching/material/cs402/ 11/01/2022 ● CS402/922 High Performance Computing ● ●
11/01/2022
Translation in progress…
• Programs are translated from human-readable code to machine code
• More complex processorsàmore complex instructions/machine code
• Compilers need to constantly develop to:
• Better optimise code
• Make the best use of new architectures and instructions • Exploit hardware optimisations
• Many different compilers…
Have a look at CS325 Compiler Design for more detail!
11/01/2022
Intel DPC++/oneAPI
You have the processor, now get the compiler!
• Many HPC machines are built with Intel Xeon CPU’s
• Intel built a compiler to exploit:
• New features built into the processor
• Hidden knowledge of there processors
• Better optimisations that may not otherwise be possible
• Primarily designed for Intel CPU’s
• Can be utilised on other x86 64bit processors, but may not be optimal
Intel oneAPI
https://www.intel.com/content/www/us/en/developer/tools/oneapi/overview.html#gs.kmtzdj
11/01/2022
Clang and LLVM
Dropping letters with a klang
• Many compilers are closed-source, hugely complex codes (Intel Classic, XL, Cray etc.)
• Increasing diversity in processors (AMD, Intel, ARM, OpenPower etc.)
• LLVMàA compiler language to interconnect programming languages and processors using an intermediate representation (IR)
• ClangàC/C++ compiler based on LLVM
• Open source, in active development
https://llvm.org/
https://clang.llvm.org/
Algorithms
aka “Thinking in different ways” https://warwick.ac.uk/fac/sci/dcs/teaching/material/cs402/ 11/01/2022 ● CS402/922 High Performance Computing ● ●
11/01/2022
Algorithms
Now we’re thinking in parallel!
• Processors have been designed to be highly parallel
• Need to design algorithms to exploit this
• Design so loops have no dependencies
• Each processor can run on a different iteration of the loop • More processorsàFaster program
• Design so different loops can be ran together
• Different processors can work on different parts of the algoithm
• Sometimes, there is no way to parallelise an algorithm
11/01/2022
Algorithms
Now we’re thinking in parallel!
• Loop dependencies are the most common issue
• Loop iteration is dependant on the order of operations • Flow dependencyàa=x+y; b=a+c;
• Anti-dependencyàb=a+c; a=x+y;
• Output dependencyàa=2; x=a+1; a=5;
• Control dependenciesàbranching statements • Compiler can’t predict as easily
• Discussed further next time!
Workload and
Resource Manager
aka “I wanted to use that!” https://warwick.ac.uk/fac/sci/dcs/teaching/material/cs402/ 11/01/2022 ● CS402/922 High Performance Computing ● ●
11/01/2022
Workload and Resource ManagerLet’s share this supercomputer!
• Supercomputers consist of a large collection of smaller machines
• Inefficient for a single person to be the only person allowed on a machine
• May not need to use the entire machine
• May not disconnect promptly, wasting time, power and money • Usually owned by large companies/research centres
• Don’t want free access for all permitted usersàmay impact performance of other programs
• Need something to manage access to nodes
11/01/2022
That just sounds weird!
• One of the most common workload manager
• Open sourceàUsed in DCS, Warwick and many
other systems
• Normal use case:
• Installed on login node
• Allocate jobs to the queue (managed by login node)
• When resources are free, job is ran on the requested resources
• Job tracked by a job ID, outputs and errors are stored in files with job ID
https://slurm.schedmd.com/overview.html
11/01/2022
Portable Batch System (PBS)
• PBS – Propierty software
Now on PBS…
• OpenPBS – Opensource version based on PBS
• Allows for more control over how the queue is managed
• Can be more difficult for newer users
• Utilised by larger supercomputersàArcher
11/01/2022
Interesting related reads
Some of this might even be fun…
• LLVM’s original paper
• C. Lattner. LLVM: An Infrastructure for Multi-Stage Optimization. Thesis, Urbana, IL, December 2002
• Archer2 usage statsà https://www.archer2.ac.uk/news/2021/02/04/archer-code-use.html
• Top500 stats pageàhttps://www.top500.org/statistics/list/
Next lecture: Thread Level Parallelism
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com