CS计算机代考程序代写 cuda GPU concurrency Excel Microsoft PowerPoint – COMP528 HAL27 Intro to GPUs.pptx

Microsoft PowerPoint – COMP528 HAL27 Intro to GPUs.pptx

Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528

COMP528: Multi-core and
Multi-Processor Programming

27 – HAL

G P U

• We are NOT concerned with computer graphics, gaming etc

• So why GPUs part of “HPC”?

History
• 1970s – 1990s: home PCs had co-processor

– floating point: accelerate fl.pt. math

– video display controllers: accelerate the display o/p

• Costs of integrating these within main CPU declined
– CPUs became more “general purpose”

• 2000s: discrete graphics cards (Graphics Processing Unit)
– accelerate rendering (etc) particularly for gaming. battle of NVIDIA .v. ATI

• 2006: AGEIA add in card to do complex physic calc. Bought out by
NVIDIA
– beginning of the “general purpose GPU” (GPGPU)

• 2007: NVIDIA release CUDA

• 2009: Kronos (AMD/ATI backed) release OpenCL

Graphical Processing

• Doing the same thing, on a LOT of data items

• As quickly as possible

• i.e. concurrently

• BUT apply this concurrency to numerical processing
– and nowadays also to varieties of Machine Learning

Vector Arithmetic

• z[i] = A*x[i] + y[i]; for i=0,1,2,3,…

• for video this is per pixel

• #pixel per display, say 1920×1080 ==> ~2M pixels

• but each update is INDEPENDENT (and NO ordering)

• so each update can occur CONCURRENTLY

• GPUs have lots of “weak” cores but lots of concurrency

Excellent for general purpose and handling
everything
• Out of order
• Speculative execution
• Small number of powerful cores
• Fast clock cycle eg 3 GHz

But not ideal for any given type of problem

Images from NVIDIA

Remove all the general purpose-ness
Reduce to low capability cores (eg 1 GHz)
• NO out of order
• NO speculative execution
Dramatically increase the number of cores

 Product that is very good for a specific job

GPU MOTIVATION

Top500 (Nov2018)

• what notice?

• Green500

CES 2016

Chadwick
– 118 nodes: 33 TF

GPU CODING

Vector Arithmetic

• z[i] = A*x[i] + y[i]; for i=0,1,2,3,…

• for video this is per pixel

• #pixel per display, say 1920×1080 ==> ~2M pixels

• but each update is INDEPENDENT (and NO ordering)

• so each update can occur CONCURRENTLY

• GPUs have lots of “weak” cores but lots of concurrency

CUDA by Example: vector arithmetic

• z = x + y
– Vectors: summation element-by-

element

start = clock();

for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } finish = clock(); z0 x0 y0+ © High End Compute Ltd CUDA by Example: vector arithmetic • z = x + y – Vectors: summation element-by- element start = clock(); for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } finish = clock(); z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 x0 y0+ © High End Compute Ltd CUDA by Example: vector arithmetic • z = x + y – Vectors: summation element-by- element – IS IT PARALLELISABLE? start = clock(); for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } finish = clock(); z0 z1 z2 z3 z4 z5 x1 x2 x3 x4 x5x0 y1 y2 y3 y4 y5y0+ We can do z[i] in any order of i © High End Compute Ltd CUDA by Example: parallel vector arithmetic • z = x + y – Vectors: summation element-by- element start = clock(); // parallel control to be added for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } finish = clock(); z0 z1 z2 z3 z4 z5 x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x0 y0+ We can do each of z[i] independently and thus concurrently. Perhaps each element on a “thread” on each of the GPU cores… © High End Compute Ltd CUDA by Example: parallel vector arithmetic • z = x + y – Vectors: summation element-by- element start = clock(); // parallel control to be added for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } finish = clock(); z0 z1 z2 z3 z4 z5 x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x0 y0+ We can do each of z[i] independently and thus concurrently. Perhaps each element on a “thread” on each of the GPU cores… © High End Compute Ltd parallel vector arithmetic • We know how to do with OpenMP on a number of CPU cores… start = clock(); // parallel control to be added #pragma omp parallel for... for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } finish = clock(); • HOW to do for GPU? – OpenMP directives? – other directives? – any “native” languages? parallel vector arithmetic • HOW to do for GPU? – OpenMP directives? • good in theory but… – other directives? • yes, OpenACC is okay – any “native” languages? • yes, CUDA for NVIDIA GPUs • OpenCL (but harder to learn/implement) parallel vector arithmetic • HOW to do for GPU? – OpenMP directives? • good in theory but… – other directives? • yes, OpenACC is okay – any “native” languages? • yes, CUDA for NVIDIA GPUs • OpenCL (but harder to learn/implement) • This course module covers – OpenMP directives [done] – OpenACC directives? [done] – any “native” languages? • CUDA for NVIDIA GPUs – OpenCL (overview only) Remember the physicals… • cannot boot an OS directly from a GPU card • GPU cards typically “hang off” a node via PCI-e • What does this imply…? – “off load” programming style – bw & latency of PCI-e considerations Questions via MS Teams / email Dr Michael K Bane, Computer Science, University of Liverpool m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane