Microsoft PowerPoint – COMP528 HAL27 Intro to GPUs.pptx
Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528
COMP528: Multi-core and
Multi-Processor Programming
27 – HAL
G P U
• We are NOT concerned with computer graphics, gaming etc
• So why GPUs part of “HPC”?
History
• 1970s – 1990s: home PCs had co-processor
– floating point: accelerate fl.pt. math
– video display controllers: accelerate the display o/p
• Costs of integrating these within main CPU declined
– CPUs became more “general purpose”
• 2000s: discrete graphics cards (Graphics Processing Unit)
– accelerate rendering (etc) particularly for gaming. battle of NVIDIA .v. ATI
• 2006: AGEIA add in card to do complex physic calc. Bought out by
NVIDIA
– beginning of the “general purpose GPU” (GPGPU)
• 2007: NVIDIA release CUDA
• 2009: Kronos (AMD/ATI backed) release OpenCL
Graphical Processing
• Doing the same thing, on a LOT of data items
• As quickly as possible
• i.e. concurrently
• BUT apply this concurrency to numerical processing
– and nowadays also to varieties of Machine Learning
Vector Arithmetic
• z[i] = A*x[i] + y[i]; for i=0,1,2,3,…
• for video this is per pixel
• #pixel per display, say 1920×1080 ==> ~2M pixels
• but each update is INDEPENDENT (and NO ordering)
• so each update can occur CONCURRENTLY
• GPUs have lots of “weak” cores but lots of concurrency
Excellent for general purpose and handling
everything
• Out of order
• Speculative execution
• Small number of powerful cores
• Fast clock cycle eg 3 GHz
But not ideal for any given type of problem
Images from NVIDIA
Remove all the general purpose-ness
Reduce to low capability cores (eg 1 GHz)
• NO out of order
• NO speculative execution
Dramatically increase the number of cores
Product that is very good for a specific job
GPU MOTIVATION
Top500 (Nov2018)
• what notice?
• Green500
CES 2016
Chadwick
– 118 nodes: 33 TF
GPU CODING
Vector Arithmetic
• z[i] = A*x[i] + y[i]; for i=0,1,2,3,…
• for video this is per pixel
• #pixel per display, say 1920×1080 ==> ~2M pixels
• but each update is INDEPENDENT (and NO ordering)
• so each update can occur CONCURRENTLY
• GPUs have lots of “weak” cores but lots of concurrency
CUDA by Example: vector arithmetic
• z = x + y
– Vectors: summation element-by-
element
start = clock();
for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } finish = clock(); z0 x0 y0+ © High End Compute Ltd CUDA by Example: vector arithmetic • z = x + y – Vectors: summation element-by- element start = clock(); for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } finish = clock(); z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 x0 y0+ © High End Compute Ltd CUDA by Example: vector arithmetic • z = x + y – Vectors: summation element-by- element – IS IT PARALLELISABLE? start = clock(); for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } finish = clock(); z0 z1 z2 z3 z4 z5 x1 x2 x3 x4 x5x0 y1 y2 y3 y4 y5y0+ We can do z[i] in any order of i © High End Compute Ltd CUDA by Example: parallel vector arithmetic • z = x + y – Vectors: summation element-by- element start = clock(); // parallel control to be added for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } finish = clock(); z0 z1 z2 z3 z4 z5 x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x0 y0+ We can do each of z[i] independently and thus concurrently. Perhaps each element on a “thread” on each of the GPU cores… © High End Compute Ltd CUDA by Example: parallel vector arithmetic • z = x + y – Vectors: summation element-by- element start = clock(); // parallel control to be added for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } finish = clock(); z0 z1 z2 z3 z4 z5 x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x0 y0+ We can do each of z[i] independently and thus concurrently. Perhaps each element on a “thread” on each of the GPU cores… © High End Compute Ltd parallel vector arithmetic • We know how to do with OpenMP on a number of CPU cores… start = clock(); // parallel control to be added #pragma omp parallel for... for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } finish = clock(); • HOW to do for GPU? – OpenMP directives? – other directives? – any “native” languages? parallel vector arithmetic • HOW to do for GPU? – OpenMP directives? • good in theory but… – other directives? • yes, OpenACC is okay – any “native” languages? • yes, CUDA for NVIDIA GPUs • OpenCL (but harder to learn/implement) parallel vector arithmetic • HOW to do for GPU? – OpenMP directives? • good in theory but… – other directives? • yes, OpenACC is okay – any “native” languages? • yes, CUDA for NVIDIA GPUs • OpenCL (but harder to learn/implement) • This course module covers – OpenMP directives [done] – OpenACC directives? [done] – any “native” languages? • CUDA for NVIDIA GPUs – OpenCL (overview only) Remember the physicals… • cannot boot an OS directly from a GPU card • GPU cards typically “hang off” a node via PCI-e • What does this imply…? – “off load” programming style – bw & latency of PCI-e considerations Questions via MS Teams / email Dr Michael K Bane, Computer Science, University of Liverpool m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane