PowerPoint Presentation
Parallel Computing
with GPUs
Dr Paul Richmond
http://paulrichmond.shef.ac.uk/teaching/COM4521/
Context and Hardware Trends
Supercomputing
Software and Parallel Computing
Course Outline
Context of course
0.0 TFlops
1.0 TFlops
2.0 TFlops
3.0 TFlops
4.0 TFlops
5.0 TFlops
6.0 TFlops
7.0 TFlops
8.0 TFlops
9.0 TFlops
10.0 TFlops
1 CPU Core GPU (4992 cores)
8.74 TeraFLOPS
~40 GigaFLOPS
6 hours CPU time
vs.
1 minute GPU time
Scale of Performance
4992 GPU cores
Serial Computing
Parallel Computing
Accelerated Computing
16 cores1 core 4x 4992 GPU cores +16 CPU cores
Accelerated Workstation
1.8m
28m
650m 2.6km
Scale of Performance: Titan Supercomputer
Transistors != performance
Moores Law: A doubling of transistors every couple of years
Not a law actually an observation
Doesn’t actually say anything about performance
Dennard Scaling
“As transistors get smaller their power density stays constant”
Power = Frequency x Voltage²
Performance improvements for CPUs traditionally realised by
increasing frequency
Decrease voltage to maintain a steady power
Only works so far
Increase Power
Disastrous implications for cooling
Instruction Level Parallelism
Transistors used to build more complex architectures
Use pipelining to overlap instruction execution
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
cycles
Instruction Level Parallelism
Transistors used to build more complex architectures
Use pipelining to overlap instruction execution
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
cycles
add 1 to R1
copy R1 to R2
IF ID EX MEM WB
IF ID EX MEM WBWasted cycles
Golden Era of Performance
90s saw great improvements to
single CPU performance
1980s to 2002: 100%
performance increase every 2
years
2002 to now: ~40% every 2 years
Adapting to Thrive in a New Economy of Memory Abundance, K Bresniker et al.
Why More Cores?
Use extra transistors for multi/many core parallelism
More operations per clock cycle
Power can be kept low
Processor designs can be simple – shorter pipelines (RISC)
GPUs and Many Core Designs
Take the idea of multiple cores to the extreme (many cores)
Dedicate more die space to compute
At the expense of branch prediction, out of order execution, etc.
Simple, Lower Power and Highly Parallel
Very effective for HPC applications
From GTC 2017 Keynote Talk, NVIDIA CEO Jensen Huang
Accelerators
Problem: Still require OS, IO and scheduling
Solution: “Hybrid System”,
CPU provides management and
“Accelerators” (or co-processors) such as GPUs provide compute power
DRAM GDRAM
CPU
GPU/
Accelerator
I/O I/O
PCIe
Types of Accelerator
GPUs
Emerged from 3D graphics but now specialised for HPC
Readily available in workstations
Xeon Phis
Many Integrated Cores (MIC) architecture
Based on Pentium 4 design (x86) with wide vector units
Closer to traditional multicore
Simpler programming and compilation
Context and Hardware Trends
Supercomputing
Software and Parallel Computing
Course Outline
Top Supercomputers
Top Supercomputer
Number 500 on list
1993 1998 2004 2009 2014 2020
100 PFlops
10 PFlops
1 PFlops
100 TFlops
10 TFlops
1 TFlops
100 GFlops
10 GFlops
1 GFlops
100 MFlops
1 EFlops
P
e
rf
o
rm
a
n
ce
(
e
xp
o
n
e
n
ti
a
l)
Year
Volta V100 (15TFLOPS SP)
Supercomputing Observations
Exascale computing
1 Exaflop = 1M Gigaflops
Estimated for 2020
Pace of change
Desktop GPU top supercomputer in 2002
A desktop with a GPU would be in Top 500 in 2008
A Teraflop of performance took 1MW in 2000
Extrapolating the trend
Current gen top500 on every desktop in < 10 years
Trends of HPC
Improvements at individual computer node level are greatest
Better parallelism
Hybrid processing
3D fabrication
Communication costs are increasing
Memory per core is reducing
Supercomputing Observations
https://www.nextplatform.com/2016/11/14/closer-look-
2016-top-500-supercomputer-rankings/
Green 500
Top energy efficient supercomputers
HPC Observations
Improvements at individual
computer node level are
greatest
Better parallelism
Hybrid processing
3D fabrication
Communication costs are
increasing
Memory per core is reducing
Throughput > Latency
http://sc16.supercomputing.org/2016/10/07/sc16-invited-talk-spotlight-dr-john-d-mccalpin-
presents-memory-bandwidth-system-balance-hpc-systems/
Context and Hardware Trends
Supercomputing
Software and Parallel Computing
Course Outline
Software Challenge
How to use this hardware efficiently?
Software approaches
Parallel languages: some limited impact but not as flexible as sequential
programming
Automatic parallelisation of serial code: >30 years of research hasn’t solved
this yet
Design software with parallelisation in mind
Amdahl’s Law
Speedup of a program is limited by the proportion than can be
parallelised
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100%
S
p
e
e
d
u
p
(
S
)
Parallel Proportion of Code (P)
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 𝑆 =
1
1 − 𝑃
Amdahl’s Law cont.
Addition of processing cores gives diminishing returns
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 𝑆 =
1
𝑃
𝑁
− (1 − 𝑃)
0
5
10
15
20
25
1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
5
1
2
1
0
2
4
2
0
4
8
4
0
9
6
8
1
9
2
1
6
3
8
4
3
2
7
6
8
6
5
5
3
6
S
p
e
e
d
u
p
(
S
)
Number of Processors (N)
P = 25%
P = 50%
P = 90%
P= 95%
Parallel Programming Models
Distributed Memory
Geographically distributed processors (clusters)
Information exchanged via messages
Shared Memory
Independent tasks share memory space
Asynchronous memory access
Serialisation and synchronisation to ensure correctness
No clear ownership of data
Not necessarily performance oriented
Types of Parallelism
Bit-level
Parallelism over size of word, 8, 16, 32, or 64 bit.
Instruction Level (ILP)
Pipelining
Task Parallel
Program consists of many independent tasks
Tasks execute on asynchronous cores
Data Parallel
Program has many similar threads of execution
Each thread performs the same behaviour on different data
Implications of Parallel Computing
Performance improvements
Speed
Capability (i.e. scale)
Context and Hardware Trends
Supercomputing
Software and Parallel Computing
Course Outline
COM4521/6521 specifics
Designed to give insight into parallel computing
Specifically with GPU accelerators
Knowledge transfers to all many core architectures
What you will learn
How to program in C and manage memory manually
How to use OpenMP to write programs for multi-core CPUs
What a GPU is and how to program it with the CUDA language
How to think about problems in a highly parallel way
How to identify performance limitations in code and address them
Course Mailing List
A google group for the course has been set up
You have already been added if you were registered 01/02/2018
Mailing list uses;
Request help outside of lab classes
Find out if a lecture has changed
Want to participate in discussion on course content
https://groups.google.com/a/sheffield.ac.uk/forum/#!forum/com452
1-group
https://groups.google.com/a/sheffield.ac.uk/forum/#!forum/com4521-group
Learning Resources
Course website: http://paulrichmond.shef.ac.uk/teaching/COM4521/
Recommended Reading:
Edward Kandrot, Jason Sanders, “CUDA by Example: An Introduction to
General-Purpose GPU Programming”, Addison Wesley 2010.
Brian Kernighan, Dennis Ritchie, “The C Programming Language (2nd
Edition)”, Prentice Hall 1988.
http://paulrichmond.shef.ac.uk/teaching/COM4521/
Timetable
2 x 1 hour lecture per week (back to back)
Monday 15:00 until 17:00 Broad Lane Lecture Theater 11
Week 5 first half of the lecture will be in DIA-LT09 (Lecture Theatre 9)
Week 5 second half of the lecture will be MOLE quiz in DIA-206 (Compute room 4)
1 x 2 hour lab per week
Tuesday 9:00 until 11:00 Diamond DIA-206 (Compute room 4)
Week 10 first half of the lab will be an assessed MOLE quiz DIA-206 (Compute room 4)
Assignment
Released in two parts
Part 1
Released week 3
Due for hand in on Tuesday week 7 (20/03/2018) at 17:00
Feedback after Easter.
Part 2
Released week 6
Due for hand in on Tuesday week 12 (15/05/2018) at 17:00
Course Assessment
2 x Multiple Choice quizzes on MOLE (10% each)
Weeks 5 and 10
An assignment (80%)
Part 1 is 30% of the assignment total
Part 2 is 70% of the assignment total
For each assignment part
Half of the marks are for the program and half for a written report
Will require understanding of why you have implemented a particular
technique
Will require benchmarking, profiling and explanation to demonstrate that you
understand the implications of what you have done
Lab Classes
2 hours every week
Essential in understanding the course content!
Do not expect to complete all exercises within the 2 hours
Coding help from lab demonstrators Robert Chisholm and John
Charlton:
http://staffwww.dcs.shef.ac.uk/people/R.Chisholm/
http://www.dcs.shef.ac.uk/cgi-bin/makeperson?J.Charlton
Assignment and lab class help questions should be directed to the
google discussion group
http://staffwww.dcs.shef.ac.uk/people/R.Chisholm/
http://www.dcs.shef.ac.uk/cgi-bin/makeperson?J.Charlton
Feedback
After each teaching week you MUST submit the lab register/feedback
form
This records your engagement in the course
Ensures that I can see what you have understood and not understood
Allows us to revisit any concepts ideas with further examples
This only works if you are honest!
Submit this once you have finished with the lab exercises
Your feedback will be used to clarify topics which are assessed in the
assignments
Lab Register Link: https://goo.gl/0r73gD
Additional feedback from assignment and MOLE quizzes
https://goo.gl/0r73gD
Machines Available
Diamond Compute Labs
Visual Studio 2017
NVIDIA CUDA 9.1
VAR Lab
CUDA enabled machines – same spec as Diamond high spec compute room
ShARC
University of Sheffield HPC system
You will need an account (see HPC docs website)
Select number of GPU nodes available (see gpucomputing.shef.ac.uk)
Special short job queue will be made availble
Your own machine
Must have a NVIDIA GPU for CUDA exercises
Virtual machines not an option
IMPORTANT: Follow the websites guidance for installing Visual Studio
http://docs.hpc.shef.ac.uk/en/latest/hpc/getting-started.html#getting-an-account
gpucomputing.shef.ac.uk
http://paulrichmond.shef.ac.uk/teaching/COM4521/visual_studio
Summary
Parallelism is already here in a big way
From mobile to workstation to supercomputers
Parallelism in hardware
It’s the only way to use increasing number of transistors
Trend is for increasing parallelism
Supercomputers
Increased dependency on accelerators
Accelerators are greener
Software approaches
Shared and distributed memory models differ
Programs must be highly parallel to avoid diminishing returns