CS计算机代考程序代写 cuda GPU gui concurrency Microsoft PowerPoint – COMP528 HAL29 GPU threads, blocks, grids, barkla, cuda malloc, data transfers.pptx

Microsoft PowerPoint – COMP528 HAL29 GPU threads, blocks, grids, barkla, cuda malloc, data transfers.pptx

Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528

COMP528: Multi-core and
Multi-Processor Programming

29 – HAL

Why GPU?
• Doing the same thing, on a LOT of data

items: weak but specific cores

• As quickly as possible: concurrently

• Apply this concurrency
to numerical
processing

• Different ways to program:

– directives

– CUDA

– openCL

Terminology

• Host: the CPU (and its memory)
• maybe several cores
• at given time, each core could be doing something different

• Device: the GPU (and its memory)
• many cores
• designed to run 1000s of threads, lightweight switching
• threads run same code (“kernel”)
• the host code will call N copies of the kernel,

one running on each of N threads

• Threads
• lightweight
• will have unique identifier: helps with how to parallelise work of the kernel

• Threads and thread blocks are the main point to understand

• Each thread runs a copy of the kernel
• 1000s (potentially) of threads

=> 1000 copies of kernel
=> potentially 1000x speed-up
BUT they are “weak cores”
so each is less ‘speedy’ than a single CPU core

• the “potentially 1000x speed-up” refers to “using just 1 thread”
(which is a very, very bad idea)

Threads & Blocks

• but rather than think of threads, you need to think of
threadblocks (or just “blocks”)

• threads in a block
• can be synchronised
• have access to memory shared between themselves

• threadblock shared memory (faster)
• also have access to the GPU “global memory” (slower)

• you define (at call to kernel – see later)
• #threads in a block
• #blocks

NVIDIA GPU Hierarchy
Hardware
• CUDA core

• where threads run and the work is performed
• traditionally lots of support for integer & single precision (float)
• increasingly more double precision (double) support AND half-int (for…?)


Streaming Multiprocessor (SM, (SMP))

• a number of CUDA cores

• the GPU
• a number of SMs

• “deviceQuery” (CUDA SDK|examples) -> gives details of actual GPU

Barkla – System Architecture

login1

login2

viz01

viz02

comp
sci

node001

node002

node003

node135

node136

himem01

himem02

gpu01

gpu02

gpu03

phi01

phi02

phi03

phi04

srun

sbatch

User can “ssh” in to either login or interactive viz node. If doing any GUI or debug runs then use one of the interactive viz nodes.
To do compute work (on CPU nodes, CPU nodes with extra RAM, nodes with batch GPUs, or the Xeon Phi nodes) you have to
use the “SLURM” batch system, typically “sbatch” for batch jobs and “srun” for interactive runs on the compute nodes

login nodes compute nodes (accessible via batch system)

Note that from outside of Barkla, the login
nodes are “barkla4.liv.ac.uk” and

“barkla5.liv.ac.uk”, and the interactive gpu
nodes are “barkla6.liv.ac.uk” and

“barkla7.liv.ac.uk”

COMP328/COMP538 (c) mkbane, university of liverpool

Barkla – System Architecture

login1

login2

viz01

viz02

comp
sci

node001

node002

node003

node135

node136

himem01

himem02

gpu01

gpu02

gpu03

phi01

phi02

phi03

phi04

srun

sbatch

User can “ssh” in to either login or interactive viz node. If doing any GUI or debug runs then use one of the interactive viz nodes.
To do compute work (on CPU nodes, CPU nodes with extra RAM, nodes with batch GPUs, or the Xeon Phi nodes) you have to
use the “SLURM” batch system, typically “sbatch” for batch jobs and “srun” for interactive runs on the compute nodes

login nodes compute nodes (accessible via batch system)

Note that from outside of Barkla, the login
nodes are “barkla4.liv.ac.uk” and

“barkla5.liv.ac.uk”, and the interactive gpu
nodes are “barkla6.liv.ac.uk” and

“barkla7.liv.ac.uk”

Barkla GPU

interactive viz nodes
• viz01, viz02
• “Quadro P4000” *2

Barkla GPU

interactive viz nodes
• viz01, viz02
• “Quadro P4000” *2

Barkla GPU

interactive viz nodes
• viz01, viz02
• “Quadro P4000” *2

Barkla GPU

interactive viz nodes
• viz01, viz02
• “Quadro P4000” *2

Barkla GPU

interactive viz nodes
• viz01, viz02
• “Quadro P4000” *2

Barkla GPU

batch gpu nodes
• via partition “gpu”
• e.g. “sinfo -p gpu”:

[mkbane@viz02[barkla] release]$ sinfo -p gpu
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpu up 3-00:00:00 1 alloc gpu01

• deviceQuery (on gpu01) gives ==>

Barkla GPU: interactive .v. batch

Barkla GPU: interactive .v. batch

• interactive
• these GPUs are shared
• but you can always get access
• but they are less powerful than batch GPUs

• batch
• (covid19: some resources reserved)
• exclusive use of node but queue to get access
• but more powerful than interactive/viz GPUs

Threads & Blocks

Threads etc
• Hardware

• # CUDA cores in a SM
• # SM in a GPU
• 3 levels: CUDA cores, SM, GPU

Im
ag

e:
h

tt
ps

:/
/e

n.
w

ik
ip

ed
ia

.o
rg

/w
ik

i/
Th

re
ad

_b
lo

ck

CUDA
core

SM

Threads etc
• Hardware

• # CUDA cores in a SM
• # SM in a GPU
• 3 levels: CUDA cores, SM, GPU

• Software [CUDA runtime] mirrors this
• threads
• thread blocks (or “blocks”)
• kernel grid

Im
ag

e:
h

tt
ps

:/
/e

n.
w

ik
ip

ed
ia

.o
rg

/w
ik

i/
Th

re
ad

_b
lo

ck

CUDA
core

SM

Threads etc
• Hardware

• # CUDA cores in a SM
• # SM in a GPU
• 3 levels: CUDA cores, SM, GPU

• Software [CUDA runtime] mirrors this
• threads
• thread blocks (or “blocks”)
• kernel grid

Im
ag

e:
h

tt
ps

:/
/e

n.
w

ik
ip

ed
ia

.o
rg

/w
ik

i/
Th

re
ad

_b
lo

ck

CUDA
core

SM

COMP328/COMP528 (c) mkbane, university of liverpool

CO
M

P3
28

/C
O

M
P5

38
(c

) m
kb

an
e,

u
ni

ve
rs

it
y

of
li

ve
rp

oo
l

Threads, Blocks & Warps

• Definitive ref
http://docs.nvidia.com/cuda/cuda-c-programming-
guide/index.html#hardware-implementation

• For today…
• Thread blocks is quanta to consider
• A block is placed on a single SM
• Can have multiple blocks per SM
• SM splits a thread block in to warps

(32 threads)
• If a warp is blocked (mem access),

switch to another warp – latency hiding
• Warps run same instructions,

in lockstep fashion

K E Y P R I N C I P L E :
t h r e a d s r u n c o n c u r r e n t l y

 p a r a l l e l i s m

CO
M

P3
28

/C
O

M
P5

38
(c

) m
kb

an
e,

u
ni

ve
rs

it
y

of
li

ve
rp

oo
l

CUDA

Steps to CUDA

Determine work that has inherent parallelism

Move (serial) work to a “kernel”

3. Invoke a parallel kernel by use of CUDA

Based upon
“Steps to CUDA”

© High End Compute Ltd

CUDA by Example: CUDA kernel
serial_kernel(x, y, z, num) {

for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } } start = clock(); serial_kernel(x, y, z, num) finish = clock(); __global__ cuda_kernel(x, y, z) { // parallel control via varying index my_i = threadIdx.x + blockIdx.x*blockDim.x; z[my_i] = x[my_i] + y[my_i]; // not there is NO 'for' loop over index } start = clock(); cuda_kernel <<>> (x, y, z);

finish = clock();

CUDA by Example: CUDA kernel
__global__ cuda_kernel(x, y, z) {

// parallel control via varying index

my_i = threadIdx.x + blockIdx.x*blockDim.x;

z[my_i] = x[my_i] + y[my_i];

// not there is NO ‘for’ loop over index

}

start = clock();

cuda_kernel <<>> (x, y, z);

finish = clock();

• requests “blks” thread
blocks

• with “threadsPerBlock”
thread per block

• and each thread running
an instance of the
“cuda_kernel” (with given
args) on a separate thread

• each thread is in a block
• blocks are uniquely numbered

“blockIdx.x”
• each thread in a given block has a

unique number “threadIdx.x”
• therefore “my_i” will be numbered

0,1,2,3… (each on a different
thread, perhaps in a different
thread block)

CUDA by Example: CUDA kernel
__global__ cuda_kernel(x, y, z) {

// parallel control via varying index

my_i = threadIdx.x + blockIdx.x*blockDim.x;

z[my_i] = x[my_i] + y[my_i];

// not there is NO ‘for’ loop over index

}

start = clock();

cuda_kernel <<<2, 5>>> (x, y, z);

finish = clock();

z1

x1

y1

z2

x2

y2

z3

x3

y3

z4

x4

y4

z5

x5

y5

z0

x0

y0+

my_i: 0 1 2 3 4 5 6 7 8 9

block 0
thread 0

block 0
thread 1

block 0
thread 2

block 0
thread 3

block 0
thread 4

block 1
thread 0

block 1
thread 1

block 1
thread 2

block 1
thread 3

block 1
thread 4

my_ID = 0
+ 0 * 5 = 0

my_ID = 1
+ 0 * 5 = 1

my_ID = 2 my_ID = 3 my_ID = 4 my_ID = 0
+ 1*5 = 5

my_ID = 1
+ 1*5 = 6

my_ID = 7 my_ID = 8 my_ID = 9

z6

x6

y6

z8

x8

y8

z7

x7

y7

z9

x9

y9

COMP528 (c) mkbane, univ of liverpool (2018, 2019)

What are we missing?

• getting data on to and off the
GPU device

• num

• eg num=100
blks=3, threadsPerBlock=3

• eg num=120
blks=10, threadsPerBlock=32

Diagram of kernel running on GPU hw

PCI Express

Sequence of data transfer slides
© High End Compute Ltd

(with permission solely for
COMP528, U/Liverpool)

__global__ cuda_kernel(x, y, z) {

// parallel control via varying index

my_i = threadIdx.x + blockIdx.x*blockDim.x;

z[my_i] = x[my_i] + y[my_i];

// not there is NO ‘for’ loop over index

}

allocVars();

inputData();

preProcessData();

start = clock();

cuda_kernel <<>> (x, y, z);

finish = clock();

postProcessData();

saveResults();

CO
M

P3
28

/C
O

M
P5

38
(c

) m
kb

an
e,

u
ni

ve
rs

it
y

of
li

ve
rp

oo
l

__global__ cuda_kernel(x, y, z) {

// parallel control via varying index

my_i = threadIdx.x + blockIdx.x*blockDim.x;

z[my_i] = x[my_i] + y[my_i];

// not there is NO ‘for’ loop over index

}

allocVars();

inputData();

preProcessData();

start = clock();

Based upon
“Steps to CUDA”

© High End Compute Ltd

COMP328/COMP538 (c) mkbane, university of liverpool

__global__ cuda_kernel(x, y, z) {

// parallel control via varying index

my_i = threadIdx.x + blockIdx.x*blockDim.x;

z[my_i] = x[my_i] + y[my_i];

// not there is NO ‘for’ loop over index

}

allocVars();

inputData();

preProcessData();

start = clock();

cuda_kernel <<>> (x, y, z);

Based upon
“Steps to CUDA”

© High End Compute Ltd

__global__ cuda_kernel(x, y, z) {

// parallel control via varying index

my_i = threadIdx.x + blockIdx.x*blockDim.x;

z[my_i] = x[my_i] + y[my_i];

// not there is NO ‘for’ loop over index

}

allocVars();

inputData();

preProcessData();

start = clock();

cuda_kernel <<>> (x, y, z);

finish = clock();

postProcessData();

saveResults();

COMP528 (c) mkbane, HEC / univ of liverpool (2018, 2019)

Based upon
“Steps to CUDA”

© High End Compute Ltd

__global__ void cuda_kernel(float *x, float *y, float *z, int num)
{

// parallel control via varying index

int my_i = threadIdx.x + blockIdx.x*blockDim.x;

// handle my_i exceeds num

if (my_i < num) { z[my_i] = x[my_i] + y[my_i]; } } Kernel will be off-loaded to GPU When does GPU start? and stop? How does GPU physically get the data? start = clock(); cuda_kernel <<>> (x, y, z, num);

finish = clock();

Based upon
“Steps to CUDA”

© High End Compute LtdCO
M

P3
28

/C
O

M
P5

38
(c

) m
kb

an
e,

u
ni

ve
rs

it
y

of
li

ve
rp

oo
l

Memory Matters

• Typically, read/set data on the CPU
• Pre- and post- process data on the CPU

• Run some analysis on the GPU

• Requires data transfer from CPU memory to GPU memory
• Host variables (examples):

x, y, z
h_x, h_y, h_z
x_host, y_host, z_host

• Device variables (examples):
dev_x, dev_y, dev_z
x_dev, y_dev, z_dev
d_x, d_y, d_z
x_d, y_d, z_d

PCI Express

Host variables
x, y, z

Device variables
x_dev, y_dev, z_dev

Example: Z = X + Y

PCI Express

Host variables
x, y, z

Device variables
x_dev, y_dev, z_dev

Z_dev[0] = X_dev[0] + Y_dev[0]
Z_dev[1] = X_dev[1] + Y_dev[1]
Z_dev[2] = X_dev[2] + Y_dev[2]
Z_dev[3] = X_dev[3] + Y_dev[3]
Z_dev[4] = X_dev[4] + Y_dev[4]
Z_dev[5] = X_dev[5] + Y_dev[5]

Z_dev[n-1] = X_dev[n-1] + Y_dev[n-1]

Example: Z = X + Y

Example recipe for data transfer

1. Declare both host & device variables
2. Initialise on host, copy to device
3. Perform work on the device (CUDA kernel)
4. Copy result back to host
5. Print / post-process / save to file from host

CUDA recipes
© High End Compute Ltd

(with permission solely for
COMP528, U/Liverpool)

1. Declare & alloc both host & device variables
2. Initialise on host, copy to device
3. Perform work on the device (CUDA kernel)

int main(int argc, char *argv[]) {

float *z, *x, *y;

int num, i;

// memory for device (GPU)

float *x_dev, *y_dev, *z_dev;

// set up arrays

num = (int) atoi(argv[1]);

printf(“Forming z=x+y for dimension %d\n”, num);

x = (float *) malloc(num*sizeof(*x));

y = (float *) malloc(num*sizeof(*y));

z = (float *) malloc(num*sizeof(*z));

// set up device memory

cudaMalloc(&x_dev, num*sizeof(float));

cudaMalloc(&y_dev, num*sizeof(float));

cudaMalloc(&z_dev, num*sizeof(float));

// init vars

for (i=0; i>> (z_dev, x_dev, y_dev, num);

// copy results back

cudaMemcpy(z, z_dev, num*sizeof(float), cudaMemcpyDeviceToHost);

// see if any errors

cudaError err = cudaGetLastError();

if ( err != cudaSuccess) {

printf(“(1) CUDA RT error: %s \n”,cudaGetErrorString(err));

}

// use results in ‘z’ on local host

/* — CLEAN UP GPU — */

// release device memory

cudaFree(x_dev); cudaFree(y_dev); cudaFree(z_dev);

}

CO
M

P3
28

/C
O

M
P5

38
(c

) m
kb

an
e,

u
ni

ve
rs

it
y

of
li

ve
rp

oo
l

Questions via MS Teams / email
Dr Michael K Bane, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane