Microsoft PowerPoint – COMP528 HAL29 GPU threads, blocks, grids, barkla, cuda malloc, data transfers.pptx
Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528
COMP528: Multi-core and
Multi-Processor Programming
29 – HAL
Why GPU?
• Doing the same thing, on a LOT of data
items: weak but specific cores
• As quickly as possible: concurrently
• Apply this concurrency
to numerical
processing
• Different ways to program:
– directives
– CUDA
– openCL
Terminology
• Host: the CPU (and its memory)
• maybe several cores
• at given time, each core could be doing something different
• Device: the GPU (and its memory)
• many cores
• designed to run 1000s of threads, lightweight switching
• threads run same code (“kernel”)
• the host code will call N copies of the kernel,
one running on each of N threads
• Threads
• lightweight
• will have unique identifier: helps with how to parallelise work of the kernel
• Threads and thread blocks are the main point to understand
• Each thread runs a copy of the kernel
• 1000s (potentially) of threads
=> 1000 copies of kernel
=> potentially 1000x speed-up
BUT they are “weak cores”
so each is less ‘speedy’ than a single CPU core
• the “potentially 1000x speed-up” refers to “using just 1 thread”
(which is a very, very bad idea)
Threads & Blocks
• but rather than think of threads, you need to think of
threadblocks (or just “blocks”)
• threads in a block
• can be synchronised
• have access to memory shared between themselves
• threadblock shared memory (faster)
• also have access to the GPU “global memory” (slower)
• you define (at call to kernel – see later)
• #threads in a block
• #blocks
NVIDIA GPU Hierarchy
Hardware
• CUDA core
• where threads run and the work is performed
• traditionally lots of support for integer & single precision (float)
• increasingly more double precision (double) support AND half-int (for…?)
•
Streaming Multiprocessor (SM, (SMP))
• a number of CUDA cores
• the GPU
• a number of SMs
• “deviceQuery” (CUDA SDK|examples) -> gives details of actual GPU
Barkla – System Architecture
login1
login2
viz01
viz02
comp
sci
node001
node002
node003
node135
node136
himem01
himem02
gpu01
gpu02
gpu03
…
phi01
phi02
phi03
phi04
srun
sbatch
User can “ssh” in to either login or interactive viz node. If doing any GUI or debug runs then use one of the interactive viz nodes.
To do compute work (on CPU nodes, CPU nodes with extra RAM, nodes with batch GPUs, or the Xeon Phi nodes) you have to
use the “SLURM” batch system, typically “sbatch” for batch jobs and “srun” for interactive runs on the compute nodes
login nodes compute nodes (accessible via batch system)
Note that from outside of Barkla, the login
nodes are “barkla4.liv.ac.uk” and
“barkla5.liv.ac.uk”, and the interactive gpu
nodes are “barkla6.liv.ac.uk” and
“barkla7.liv.ac.uk”
COMP328/COMP538 (c) mkbane, university of liverpool
Barkla – System Architecture
login1
login2
viz01
viz02
comp
sci
node001
node002
node003
node135
node136
himem01
himem02
gpu01
gpu02
gpu03
…
phi01
phi02
phi03
phi04
srun
sbatch
User can “ssh” in to either login or interactive viz node. If doing any GUI or debug runs then use one of the interactive viz nodes.
To do compute work (on CPU nodes, CPU nodes with extra RAM, nodes with batch GPUs, or the Xeon Phi nodes) you have to
use the “SLURM” batch system, typically “sbatch” for batch jobs and “srun” for interactive runs on the compute nodes
login nodes compute nodes (accessible via batch system)
Note that from outside of Barkla, the login
nodes are “barkla4.liv.ac.uk” and
“barkla5.liv.ac.uk”, and the interactive gpu
nodes are “barkla6.liv.ac.uk” and
“barkla7.liv.ac.uk”
Barkla GPU
interactive viz nodes
• viz01, viz02
• “Quadro P4000” *2
Barkla GPU
interactive viz nodes
• viz01, viz02
• “Quadro P4000” *2
Barkla GPU
interactive viz nodes
• viz01, viz02
• “Quadro P4000” *2
Barkla GPU
interactive viz nodes
• viz01, viz02
• “Quadro P4000” *2
Barkla GPU
interactive viz nodes
• viz01, viz02
• “Quadro P4000” *2
Barkla GPU
batch gpu nodes
• via partition “gpu”
• e.g. “sinfo -p gpu”:
[mkbane@viz02[barkla] release]$ sinfo -p gpu
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpu up 3-00:00:00 1 alloc gpu01
• deviceQuery (on gpu01) gives ==>
Barkla GPU: interactive .v. batch
Barkla GPU: interactive .v. batch
• interactive
• these GPUs are shared
• but you can always get access
• but they are less powerful than batch GPUs
• batch
• (covid19: some resources reserved)
• exclusive use of node but queue to get access
• but more powerful than interactive/viz GPUs
Threads & Blocks
Threads etc
• Hardware
• # CUDA cores in a SM
• # SM in a GPU
• 3 levels: CUDA cores, SM, GPU
Im
ag
e:
h
tt
ps
:/
/e
n.
w
ik
ip
ed
ia
.o
rg
/w
ik
i/
Th
re
ad
_b
lo
ck
CUDA
core
SM
Threads etc
• Hardware
• # CUDA cores in a SM
• # SM in a GPU
• 3 levels: CUDA cores, SM, GPU
• Software [CUDA runtime] mirrors this
• threads
• thread blocks (or “blocks”)
• kernel grid
Im
ag
e:
h
tt
ps
:/
/e
n.
w
ik
ip
ed
ia
.o
rg
/w
ik
i/
Th
re
ad
_b
lo
ck
CUDA
core
SM
Threads etc
• Hardware
• # CUDA cores in a SM
• # SM in a GPU
• 3 levels: CUDA cores, SM, GPU
• Software [CUDA runtime] mirrors this
• threads
• thread blocks (or “blocks”)
• kernel grid
Im
ag
e:
h
tt
ps
:/
/e
n.
w
ik
ip
ed
ia
.o
rg
/w
ik
i/
Th
re
ad
_b
lo
ck
CUDA
core
SM
COMP328/COMP528 (c) mkbane, university of liverpool
CO
M
P3
28
/C
O
M
P5
38
(c
) m
kb
an
e,
u
ni
ve
rs
it
y
of
li
ve
rp
oo
l
Threads, Blocks & Warps
• Definitive ref
http://docs.nvidia.com/cuda/cuda-c-programming-
guide/index.html#hardware-implementation
• For today…
• Thread blocks is quanta to consider
• A block is placed on a single SM
• Can have multiple blocks per SM
• SM splits a thread block in to warps
(32 threads)
• If a warp is blocked (mem access),
switch to another warp – latency hiding
• Warps run same instructions,
in lockstep fashion
K E Y P R I N C I P L E :
t h r e a d s r u n c o n c u r r e n t l y
p a r a l l e l i s m
CO
M
P3
28
/C
O
M
P5
38
(c
) m
kb
an
e,
u
ni
ve
rs
it
y
of
li
ve
rp
oo
l
CUDA
Steps to CUDA
Determine work that has inherent parallelism
Move (serial) work to a “kernel”
3. Invoke a parallel kernel by use of CUDA
Based upon
“Steps to CUDA”
© High End Compute Ltd
CUDA by Example: CUDA kernel
serial_kernel(x, y, z, num) {
for (i = 0; i< num; i++) {
z[i] = x[i] + y[i];
}
}
start = clock();
serial_kernel(x, y, z, num)
finish = clock();
__global__ cuda_kernel(x, y, z) {
// parallel control via varying index
my_i = threadIdx.x + blockIdx.x*blockDim.x;
z[my_i] = x[my_i] + y[my_i];
// not there is NO 'for' loop over index
}
start = clock();
cuda_kernel <<
finish = clock();
CUDA by Example: CUDA kernel
__global__ cuda_kernel(x, y, z) {
// parallel control via varying index
my_i = threadIdx.x + blockIdx.x*blockDim.x;
z[my_i] = x[my_i] + y[my_i];
// not there is NO ‘for’ loop over index
}
start = clock();
cuda_kernel <<
finish = clock();
• requests “blks” thread
blocks
• with “threadsPerBlock”
thread per block
• and each thread running
an instance of the
“cuda_kernel” (with given
args) on a separate thread
• each thread is in a block
• blocks are uniquely numbered
“blockIdx.x”
• each thread in a given block has a
unique number “threadIdx.x”
• therefore “my_i” will be numbered
0,1,2,3… (each on a different
thread, perhaps in a different
thread block)
CUDA by Example: CUDA kernel
__global__ cuda_kernel(x, y, z) {
// parallel control via varying index
my_i = threadIdx.x + blockIdx.x*blockDim.x;
z[my_i] = x[my_i] + y[my_i];
// not there is NO ‘for’ loop over index
}
start = clock();
cuda_kernel <<<2, 5>>> (x, y, z);
finish = clock();
z1
x1
y1
z2
x2
y2
z3
x3
y3
z4
x4
y4
z5
x5
y5
z0
x0
y0+
my_i: 0 1 2 3 4 5 6 7 8 9
block 0
thread 0
block 0
thread 1
block 0
thread 2
block 0
thread 3
block 0
thread 4
block 1
thread 0
block 1
thread 1
block 1
thread 2
block 1
thread 3
block 1
thread 4
my_ID = 0
+ 0 * 5 = 0
my_ID = 1
+ 0 * 5 = 1
my_ID = 2 my_ID = 3 my_ID = 4 my_ID = 0
+ 1*5 = 5
my_ID = 1
+ 1*5 = 6
my_ID = 7 my_ID = 8 my_ID = 9
z6
x6
y6
z8
x8
y8
z7
x7
y7
z9
x9
y9
COMP528 (c) mkbane, univ of liverpool (2018, 2019)
What are we missing?
• getting data on to and off the
GPU device
• num
• eg num=100
blks=3, threadsPerBlock=3
• eg num=120
blks=10, threadsPerBlock=32
Diagram of kernel running on GPU hw
PCI Express
Sequence of data transfer slides
© High End Compute Ltd
(with permission solely for
COMP528, U/Liverpool)
__global__ cuda_kernel(x, y, z) {
// parallel control via varying index
my_i = threadIdx.x + blockIdx.x*blockDim.x;
z[my_i] = x[my_i] + y[my_i];
// not there is NO ‘for’ loop over index
}
allocVars();
inputData();
preProcessData();
start = clock();
cuda_kernel <<
finish = clock();
postProcessData();
saveResults();
CO
M
P3
28
/C
O
M
P5
38
(c
) m
kb
an
e,
u
ni
ve
rs
it
y
of
li
ve
rp
oo
l
__global__ cuda_kernel(x, y, z) {
// parallel control via varying index
my_i = threadIdx.x + blockIdx.x*blockDim.x;
z[my_i] = x[my_i] + y[my_i];
// not there is NO ‘for’ loop over index
}
allocVars();
inputData();
preProcessData();
start = clock();
Based upon
“Steps to CUDA”
© High End Compute Ltd
COMP328/COMP538 (c) mkbane, university of liverpool
__global__ cuda_kernel(x, y, z) {
// parallel control via varying index
my_i = threadIdx.x + blockIdx.x*blockDim.x;
z[my_i] = x[my_i] + y[my_i];
// not there is NO ‘for’ loop over index
}
allocVars();
inputData();
preProcessData();
start = clock();
cuda_kernel <<
Based upon
“Steps to CUDA”
© High End Compute Ltd
__global__ cuda_kernel(x, y, z) {
// parallel control via varying index
my_i = threadIdx.x + blockIdx.x*blockDim.x;
z[my_i] = x[my_i] + y[my_i];
// not there is NO ‘for’ loop over index
}
allocVars();
inputData();
preProcessData();
start = clock();
cuda_kernel <<
finish = clock();
postProcessData();
saveResults();
COMP528 (c) mkbane, HEC / univ of liverpool (2018, 2019)
Based upon
“Steps to CUDA”
© High End Compute Ltd
__global__ void cuda_kernel(float *x, float *y, float *z, int num)
{
// parallel control via varying index
int my_i = threadIdx.x + blockIdx.x*blockDim.x;
// handle my_i exceeds num
if (my_i < num) {
z[my_i] = x[my_i] + y[my_i];
}
}
Kernel will be off-loaded to
GPU
When does GPU start?
and stop?
How does GPU
physically
get the data?
start = clock();
cuda_kernel <<
finish = clock();
Based upon
“Steps to CUDA”
© High End Compute LtdCO
M
P3
28
/C
O
M
P5
38
(c
) m
kb
an
e,
u
ni
ve
rs
it
y
of
li
ve
rp
oo
l
Memory Matters
• Typically, read/set data on the CPU
• Pre- and post- process data on the CPU
• Run some analysis on the GPU
• Requires data transfer from CPU memory to GPU memory
• Host variables (examples):
x, y, z
h_x, h_y, h_z
x_host, y_host, z_host
• Device variables (examples):
dev_x, dev_y, dev_z
x_dev, y_dev, z_dev
d_x, d_y, d_z
x_d, y_d, z_d
PCI Express
Host variables
x, y, z
Device variables
x_dev, y_dev, z_dev
Example: Z = X + Y
PCI Express
Host variables
x, y, z
Device variables
x_dev, y_dev, z_dev
Z_dev[0] = X_dev[0] + Y_dev[0]
Z_dev[1] = X_dev[1] + Y_dev[1]
Z_dev[2] = X_dev[2] + Y_dev[2]
Z_dev[3] = X_dev[3] + Y_dev[3]
Z_dev[4] = X_dev[4] + Y_dev[4]
Z_dev[5] = X_dev[5] + Y_dev[5]
…
Z_dev[n-1] = X_dev[n-1] + Y_dev[n-1]
Example: Z = X + Y
Example recipe for data transfer
1. Declare both host & device variables
2. Initialise on host, copy to device
3. Perform work on the device (CUDA kernel)
4. Copy result back to host
5. Print / post-process / save to file from host
CUDA recipes
© High End Compute Ltd
(with permission solely for
COMP528, U/Liverpool)
1. Declare & alloc both host & device variables
2. Initialise on host, copy to device
3. Perform work on the device (CUDA kernel)
int main(int argc, char *argv[]) {
float *z, *x, *y;
int num, i;
// memory for device (GPU)
float *x_dev, *y_dev, *z_dev;
// set up arrays
num = (int) atoi(argv[1]);
printf(“Forming z=x+y for dimension %d\n”, num);
x = (float *) malloc(num*sizeof(*x));
y = (float *) malloc(num*sizeof(*y));
z = (float *) malloc(num*sizeof(*z));
// set up device memory
cudaMalloc(&x_dev, num*sizeof(float));
cudaMalloc(&y_dev, num*sizeof(float));
cudaMalloc(&z_dev, num*sizeof(float));
// init vars
for (i=0; i
// copy results back
cudaMemcpy(z, z_dev, num*sizeof(float), cudaMemcpyDeviceToHost);
// see if any errors
cudaError err = cudaGetLastError();
if ( err != cudaSuccess) {
printf(“(1) CUDA RT error: %s \n”,cudaGetErrorString(err));
}
// use results in ‘z’ on local host
/* — CLEAN UP GPU — */
// release device memory
cudaFree(x_dev); cudaFree(y_dev); cudaFree(z_dev);
}
CO
M
P3
28
/C
O
M
P5
38
(c
) m
kb
an
e,
u
ni
ve
rs
it
y
of
li
ve
rp
oo
l
Questions via MS Teams / email
Dr Michael K Bane, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane