PowerPoint Presentation
Dr Massoud Zolgharni
mzolgharni@lincoln.ac.uk
Room SLB1004, SLB
Dr Grzegorz Cielniak
gcielniak@lincoln.ac.uk
Room INB2221, INB
Week W/C Lecture Workshop
1 23/01 Introduction –
2 30/01 Architectures
Tutorial-1
3 06/02 Patterns 1
4 13/02 Patterns 2
Tutorial-2
5 20/02 Patterns 3
6 27/02 Patterns 4
Tutorial-3
7 06/03 Communication & Synchronisation
8 13/03 Algorithms 1
Assessment
support
9 20/03 Algorithms 2
10 27/03 Algorithms 3
11 03/04 Performance & Optimisation
Tutorial-4 &
Demonstration
12 24/04 Parallel Libraries
13 01/05 –
Parallel Hardware – overview and classification
Graphics Processing Units
OpenCL programming model
Different solutions and programmer
support
• pipelines, vector instructions in CPU
• limited access by a programmer, built-in
or through compiler (implicit
parallelism)
• multi-core CPUs and multi-processors
• OS level, multithreading libraries (e.g.
Boost.Thread)
• dedicated parallel processing units
(e.g. GPU)
• libraries with different level of
granularity (e.g. OpenCL,
Boost.Compute)
• distributed systems
• distributed parallel libraries (e.g. MPI)
Classifies multi-processor computer architectures
according to 2 independent dimensions of:
• Instruction Stream
• Data Stream
Each dimensions can have only one of two possible
states:
• Single
• Multiple
4 possible classifications
A serial computer – standard von Neumann model
Single Instruction
• only one instruction stream is being acted on by CPU during one
clock cycle
Single Data
• only one data stream is being used as input during one clock cycle
Example
• traditional single processor/core CPUs
• good for real-time applications
A type of parallel computer
Single Instruction
• all processing units execute the same instruction at a clock cycle
Multiple Data
• each processing unit can operate on a different data element
Notes
• Modern CPUs (vector instructions) and GPUs – focus of this module!
• Best for problems with high degree of regularity, e.g. image
processing
A type of parallel computer
Multiple Instruction
• every processor may be executing a different instruction stream
Multiple Data
• every processor may be working with a different data stream
Notes
• most common type of parallel computer, multi-core CPUs, computing
clusters and grids
• many MIMD architectures also include SIMD execution sub-
components
A type of parallel computer
Multiple Instruction
• each processing unit operates on the data independently via separate
instruction streams
Single Data
• a single data stream is fed into multiple processing units
Notes
• multiple cryptography algorithms attempting to crack a single coded
message
• very uncommon architecture and rare applications
Shared
• multiple processors can
operate independently
but share the same
memory – global
address space
• changes in a memory
location affected by one
processor are visible to
all other processors
Distributed
• processors have their
own local memory and
operate independently
• memory addresses in
one processor do not
map to another
processor – no global
address space
Shared
• Pros
• global address space is easy
to use/program
• data sharing between tasks
is fast due to proximity of
memory to CPUs
• Cons
• adding more CPUs can
increase traffic on shared
memory-CPU path
• programmer responsibility
for synchronization
constructs that ensure
“correct” access of global
memory
Distributed
• Pros
• each processor can rapidly
access its own memory
without interference
• memory is scalable; increase
number of processors and
size of memory increases
• Cons
• programmer is responsible
for many of details
associated with data
communication between
processors
• non-uniform memory access
times
Computer
Architecture
Serial
SISD
single core CPU
Parallel
SIMD
GPUs
(typically)
Shared-memory
MISD
rare
MIMD
Shared-memory
multi-processors
Distributed-
memory
multi-computers
Designed for manipulating computer
graphics and image processing
Highly parallel structure
More efficient than CPUs when processing
large blocks of visual data in parallel
Different realisations
• dedicated expansion video card
• integrated into the CPU die
• embedded on motherboard
Radeon HD 7970 Intel® Core™ i7-4710MQ
• 1000x complexity since 1995
• Moore’s Law at work
1997 2000 2005 2010
RIVA 128
3M
transistors
GeForce®
256
23M
transistors
GeForce FX
125M
transistors
GeForce
8800
681M
transistors
GeForce 3
60M
transistors
GeForce
580 GTX
3B transistors
Titan X ~12B transistors
Allows to exploit massively parallel hardware of GPU
for applications other than graphics and image
processing
Typically where large datasets and complex
computations are required
• everywhere where large vectors/matrices are used
• physics simulation, AI, weather forecasting
Required some adoption of hardware (shaders,
texture units, floating point arithmetics) so that
standard code could be executed
But also dedicated software frameworks hiding the
graphics-specific functionality from the
programmer (e.g. CUDA)
Latency
time to solution
minimise time at the
expense of power
Throughput
quantity of tasks processed
per unit of time
minimise energy per
operation
CPU
Optimised for low-latency
computations
Large caches (quick access to
data) and control unit (out-of-
order execution)
Fewer ALUs
Good for real-time applications
GPU
Optimised for data-parallel and
high throughput computations
Smaller caches
More transistors dedicated to
computation
Good if enough work to hide
latency
S
M
60 SMs, 3840 cores
latency
processor
throughput
processor
Terminology:
Host The CPU and its memory (host memory)
Device The GPU and its memory (device memory)
OpenCL is a heterogeneous model
• one host and many devices (GPUs, FPGAs, CPUs)
Data-parallel portions of an algorithm are executed
on the device(s) as kernels
The host defines a context to control the devices,
kernels, and memory objects
Only one kernel is executed at a time on a particular
device (SIMD)
Typical steps
• query and select the platforms and devices
• create a context to control devices, kernels, and memory
objects
• create a command queue to schedule commands for
execution on device
• write to device
• launch the kernel
• read from device
Processing Element (PE)
Compute Unit (CU)
Device
A Work Item is executed on a Processing Element
A Workgroup on a Compute Unit
A problem/program is executed on a Device
Task: C = A + B
A, B, C – vectors, N – number of elements
void add(const int* A, const int* B,
int* C, int N) {
for (int id = 0; id < N; id++) C[id] = A[id] + B[id]; } __kernel void add(__global const int* A, __global const int* B, __global int* C) { int id = get_global_id(0); C[id] = A[id] + B[id]; } serial implementation in C equivalent OpenCL kernel Structured parallel programming: patterns for efficient computation • Section 2.4 on Machine Models Heterogeneous computing with OpenCL • Chapter 1 and 2, Background and Introduction to OpenCL