程序代写代做代考 compiler algorithm computer architecture cache cuda GPU distributed system PowerPoint Presentation

PowerPoint Presentation

Dr Massoud Zolgharni
mzolgharni@lincoln.ac.uk

Room SLB1004, SLB

Dr Grzegorz Cielniak
gcielniak@lincoln.ac.uk

Room INB2221, INB

Week W/C Lecture Workshop

1 23/01 Introduction –

2 30/01 Architectures
Tutorial-1

3 06/02 Patterns 1

4 13/02 Patterns 2
Tutorial-2

5 20/02 Patterns 3

6 27/02 Patterns 4
Tutorial-3

7 06/03 Communication & Synchronisation

8 13/03 Algorithms 1
Assessment

support
9 20/03 Algorithms 2

10 27/03 Algorithms 3

11 03/04 Performance & Optimisation
Tutorial-4 &

Demonstration
12 24/04 Parallel Libraries

13 01/05 –

Parallel Hardware – overview and classification

Graphics Processing Units

OpenCL programming model

Different solutions and programmer
support

• pipelines, vector instructions in CPU
• limited access by a programmer, built-in

or through compiler (implicit
parallelism)

• multi-core CPUs and multi-processors
• OS level, multithreading libraries (e.g.

Boost.Thread)

• dedicated parallel processing units
(e.g. GPU)

• libraries with different level of
granularity (e.g. OpenCL,
Boost.Compute)

• distributed systems
• distributed parallel libraries (e.g. MPI)

Classifies multi-processor computer architectures
according to 2 independent dimensions of:

• Instruction Stream

• Data Stream

Each dimensions can have only one of two possible
states:

• Single

• Multiple

4 possible classifications

A serial computer – standard von Neumann model

Single Instruction
• only one instruction stream is being acted on by CPU during one

clock cycle

Single Data
• only one data stream is being used as input during one clock cycle

Example
• traditional single processor/core CPUs
• good for real-time applications

A type of parallel computer

Single Instruction
• all processing units execute the same instruction at a clock cycle

Multiple Data
• each processing unit can operate on a different data element

Notes
• Modern CPUs (vector instructions) and GPUs – focus of this module!
• Best for problems with high degree of regularity, e.g. image

processing

A type of parallel computer

Multiple Instruction
• every processor may be executing a different instruction stream

Multiple Data
• every processor may be working with a different data stream

Notes
• most common type of parallel computer, multi-core CPUs, computing

clusters and grids
• many MIMD architectures also include SIMD execution sub-

components

A type of parallel computer

Multiple Instruction
• each processing unit operates on the data independently via separate

instruction streams

Single Data
• a single data stream is fed into multiple processing units

Notes
• multiple cryptography algorithms attempting to crack a single coded

message
• very uncommon architecture and rare applications

Shared
• multiple processors can

operate independently
but share the same
memory – global
address space

• changes in a memory
location affected by one
processor are visible to
all other processors

Distributed
• processors have their

own local memory and
operate independently

• memory addresses in
one processor do not
map to another
processor – no global
address space

Shared

• Pros
• global address space is easy

to use/program
• data sharing between tasks

is fast due to proximity of
memory to CPUs

• Cons
• adding more CPUs can

increase traffic on shared
memory-CPU path

• programmer responsibility
for synchronization
constructs that ensure
“correct” access of global
memory

Distributed

• Pros
• each processor can rapidly

access its own memory
without interference

• memory is scalable; increase
number of processors and
size of memory increases

• Cons
• programmer is responsible

for many of details
associated with data
communication between
processors

• non-uniform memory access
times

Computer
Architecture

Serial

SISD

single core CPU

Parallel

SIMD

GPUs

(typically)
Shared-memory

MISD

rare

MIMD

Shared-memory

multi-processors

Distributed-
memory

multi-computers

Designed for manipulating computer
graphics and image processing

Highly parallel structure

More efficient than CPUs when processing
large blocks of visual data in parallel

Different realisations
• dedicated expansion video card

• integrated into the CPU die

• embedded on motherboard

Radeon HD 7970 Intel® Core™ i7-4710MQ

• 1000x complexity since 1995

• Moore’s Law at work

1997 2000 2005 2010

RIVA 128
3M

transistors

GeForce®

256
23M

transistors

GeForce FX
125M

transistors

GeForce
8800
681M

transistors
GeForce 3

60M
transistors

GeForce
580 GTX

3B transistors

Titan X ~12B transistors

Allows to exploit massively parallel hardware of GPU
for applications other than graphics and image
processing

Typically where large datasets and complex
computations are required

• everywhere where large vectors/matrices are used
• physics simulation, AI, weather forecasting

Required some adoption of hardware (shaders,
texture units, floating point arithmetics) so that
standard code could be executed

But also dedicated software frameworks hiding the
graphics-specific functionality from the
programmer (e.g. CUDA)

Latency

time to solution

minimise time at the
expense of power

Throughput

quantity of tasks processed
per unit of time

minimise energy per
operation

CPU
 Optimised for low-latency

computations
 Large caches (quick access to

data) and control unit (out-of-
order execution)

 Fewer ALUs
 Good for real-time applications

GPU
 Optimised for data-parallel and

high throughput computations
 Smaller caches
 More transistors dedicated to

computation
 Good if enough work to hide

latency

S
M

60 SMs, 3840 cores

latency
processor

throughput
processor

Terminology:

 Host The CPU and its memory (host memory)

 Device The GPU and its memory (device memory)

OpenCL is a heterogeneous model
• one host and many devices (GPUs, FPGAs, CPUs)

Data-parallel portions of an algorithm are executed
on the device(s) as kernels

The host defines a context to control the devices,
kernels, and memory objects

Only one kernel is executed at a time on a particular
device (SIMD)

Typical steps
• query and select the platforms and devices

• create a context to control devices, kernels, and memory
objects

• create a command queue to schedule commands for
execution on device

• write to device

• launch the kernel

• read from device

Processing Element (PE)

Compute Unit (CU)

Device

A Work Item is executed on a Processing Element

A Workgroup on a Compute Unit

A problem/program is executed on a Device

Task: C = A + B

A, B, C – vectors, N – number of elements

void add(const int* A, const int* B,
int* C, int N) {

for (int id = 0; id < N; id++) C[id] = A[id] + B[id]; } __kernel void add(__global const int* A, __global const int* B, __global int* C) { int id = get_global_id(0); C[id] = A[id] + B[id]; } serial implementation in C equivalent OpenCL kernel Structured parallel programming: patterns for efficient computation • Section 2.4 on Machine Models Heterogeneous computing with OpenCL • Chapter 1 and 2, Background and Introduction to OpenCL