Microsoft PowerPoint – COMP528 HAL3X accelerators (openCL and FPGA)
Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528
COMP528: Multi-core and
Multi-Processor Programming
3X – HAL
opencl
OpenCL
• much more low level than Directives
• perhaps more low level than CUDA
• but portable
• get the OpenCL compiler
• compile for various targets
• Devotees of OpenCL claim good & portable performance
• targets: CPUs, GPUs, Xeon Phis, FPGAs
Principles of OpenCL
• OpenCL compiler + OpenCL run-time library
• “Open Computing Language”: Khronos et al. Standard. 2008-
• context
• defining the actual run-time hardware
• a “compute device” has 1 or more “compute units”
a “compute unit” comprises several “processing elements”
• eg CPU has several cores; eg NVIDIA GPU has several Streaming Multiprocessors
• kernels to run on processing elements
• kernels are compiled at runtime (to give portability) c.f. Java JIT
• kernels queued up in “command queue”
– NDRange (1D or 2D) assigns tasks to processing elements
OpenCL kernel: mat-vect mult (c/o Wikipedia
// Multiplies A*x, leaving the result in y.
// A is a row-major matrix, meaning the (i,j) element is at A[i*ncols+j].
__kernel void matvec(__global const float *A, __global const float *x,
uint ncols, __global float *y)
{
size_t i = get_global_id(0); // Global id, used as the row index
__global float const *a = &A[i*ncols]; // Pointer to the i’th row
float sum = 0.f; // Accumulator for dot product
for (size_t j = 0; j < ncols; j++) { sum += a[j] * x[j]; } y[i] = sum; } cf threadIdx.x in CUDA cf CUDA __global__ and __device__ and __shared__ attributes OpenCL “boiler plate” • https://en.wikipedia.org/wiki/OpenCL A personal view… • OpenCL • complex • but re-usable (amend as appropriate!) boiler plate • set up context • more portable than CUDA • open source not proprietary • now also supported by NVIDIA • targets include Intel FPGA (Altera) • your OpenCL for GPU will just run on FPGA • but you will need to tune it! F P G A Field Programmable Gate Arrays • FPGA used frequently for embedded devices • (increasingly) for HPC also • can program the logic • will do one specific thing • but can re-program • unlike ASIC • so more specialised app than GPU, can be more performant, but harder to program • CPU • general purpose, easy to program • data and/or task parallelism • GPU • good for range of specifics, okay to program • data parallelism • FPGA • reprogrammable - can be very good but only for what you program • hard to program • dataflow • ASIC • burning very specific logic gates, once (no reprogramming) • can be very, very good but only for that single app • extremely expensive FPGA • Host code, offload kernel to FPGA • Offloading via use of OpenCL • Hand write / use “high level synthesis” tools • FPGA Kernel • Can write in C (or Java) • Data flow • Performance achieved by deep pipelines, many pipelines • Maximising the available logic gates • Circumventing any ‘stalls’ (one logic path longer than another) • Hierarchical memory transfers • Making good use of available “IP core” / libraries • Xilinx “university program” • Student projects: fintech, linear algebra, CFD • Energy efficient performant computing Questions via MS Teams / email Dr Michael K Bane, Computer Science, University of Liverpool m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane