CS代考 XJCO3221 Parallel Computation

Overview Host/device latency hiding Task graphs Summary and next lecture
XJCO3221 Parallel Computation

University of Leeds

Copyright By PowCoder代写 加微信 powcoder

Lecture 19: Task parallelism
XJCO3221 Parallel Computation

Host/device latency hiding Previous lectures
Task graphs Today’s lecture Summary and next lecture
Previous lectures
For much of this module have considered loop parallel problems: Same operation applied to multiple data sets.
In Lecture 13 we looked at a work pool in MPI:
One processing unit, the main process, sent small
independent tasks to all other units.
These worker processes performed the task, and sent the
result back to the main process.
We referred to this as task parallelism since the emphasis was on
parallelising tasks rather than the data.
XJCO3221 Parallel Computation

Host/device latency hiding Previous lectures Task graphs Today’s lecture
Summary and next lecture
Today’s lecture
Today we will look at task parallelism in more detail.
How GPU command queues or streams can permit:
Overlapping device and host computation.
Overlapping host-device transfer and device computation.
How OpenCL’s events specify dependencies.
Task graphs that are derived from these inter-dependencies.
The work-span model that estimates the maximum speed up from a task graph.
Firstly, we will see how to time an OpenCL program using an event.
XJCO3221 Parallel Computation

Host/device latency hiding
Task graphs Summary and next lecture
Queues and events in OpenCL
Overlapping host and device computation Overlapping computation with communication
Timing kernels in OpenCL
Code on Minerva: timedReduction.c, timedReduction.cl, helper.h
For profiling purposes we often want to time how long a kernel takes to complete, to see if modifications can make it faster.
Timing in OpenCL is straightforward to achieve using events.
1 Ensure the command queue supports profiling.
2 Declare an event, and attach to the kernel when it is enqueued.
3 Extract the time taken once the kernel has finished.
Some of previous code examples already do this.
XJCO3221 Parallel Computation

Host/device latency hiding
Task graphs Summary and next lecture
Queues and events in OpenCL
Overlapping host and device computation Overlapping computation with communication
1 2 3 4 5 6 7 8 9
10 11 12 13
// Ensure queue supports profiling.
cl_command_queue queue = clCreateCommandQueue
(context ,device ,CL_QUEUE_PROFILING_ENABLE ,&status);
// OpenCL event.
cl_event timer;
// Enqueue the kernel with timer as last argument.
status = clEnqueueNDRangeKernel(queue ,…,&timer);
// … (once the kernel has finished)
cl_ulong start , end; clGetEventProfilingInfo(timer,
CL_PROFILING_COMMAND_START ,sizeof(cl_ulong),&start
,NULL); clGetEventProfilingInfo(timer,CL_PROFILING_COMMAND_END
,sizeof(cl_ulong),&end,NULL);
printf(“Time: %g ms\n”,1e-6*(cl_double)(end-start));
XJCO3221 Parallel Computation

Host/device latency hiding
Task graphs Summary and next lecture
Queues and events in OpenCL
Overlapping host and device computation Overlapping computation with communication
Blocking communication
Recall that when we copy the data from device to host at the end of the calculation, we typically use a blocking call:
1 clEnqueueReadBuffer(queue,device_dot,CL_TRUE,…); The CL TRUE denotes blocking; the routine does not return
until the copy is complete.
Similar to MPI Recv() [cf. Lecture 9].
Replacing CL TRUE with CL FALSE makes this a non-blocking copy command1:
Will return ‘immediately,’ before the copy is complete. Similar to MPI Irecv() [cf. Lecture 12].
1In CUDA: Use cudaMemcpyAsync() rather than cudaMemcpy(). XJCO3221 Parallel Computation

Host/device latency hiding
Task graphs Summary and next lecture
Queues and events in OpenCL
Overlapping host and device computation Overlapping computation with communication
Potential consequences of non-blocking
For this example, using a non-blocking copy can mean:
1 The check on the host fails — since check code reached before the data has been copied to the host.
2 The timing is meaningless — since kernel not yet finished.
Note that the read did not start until the kernel had finished. It was enqueued on the same command queue.
XJCO3221 Parallel Computation

Host/device latency hiding
Task graphs Summary and next lecture
Queues and events in OpenCL
Overlapping host and device computation
Overlapping computation with communication
Overlapping host and device computation
1 2 3 4 5 6 7 8 9
synchronised
The queuing model means we can perform calculations on the host (CPU) and device (GPU) simultaneously.
// Enqueue kernel.
clEnqueueNDRangeKernel(…);
// Perform useful operations // on the host.
// Blocking copy device ->host
clEnqueueReadBuffer(…);
// Device and host in sync.
CPU computa- tions
GPU kernel execution
XJCO3221 Parallel Computation
enqueue kernel
enqueue read
data copied

Host/device latency hiding
Task graphs Summary and next lecture
Queues and events in OpenCL
Overlapping host and device computation Overlapping computation with communication
Overlapping computation with communication
Recall from Lecture 12 that we can reduce latency by overlapping computation with communication.
Communication Computation
MPI_Isend()
1 Start communication with MPI Isend()/MPI Irecv().
2 Perform calculations.
3 Synchronise using
MPI Wait().
MPI_Wait()
(continues)
Similar benefits can be achieved on a GPU using multiple command queues (OpenCL) / streams (CUDA).
and XJCO3221 Parallel Computation

Host/device latency hiding
Task graphs Summary and next lecture
Queues and events in OpenCL
Overlapping host and device computation Overlapping computation with communication
Multiple command queues: Example
Consider the following problem:
Have two data arrays a and b.
Needs to execute kernelA on a, and kernelB on b. These calculations are independent.
Suppose our device supports asynchronous copy and simultaneous data transfer and kernel execution.
Not guaranteed, although common in modern GPUs. May require device to have direct access to host memory.
XJCO3221 Parallel Computation

Host/device latency hiding
Task graphs Summary and next lecture
Queues and events in OpenCL
Overlapping host and device computation Overlapping computation with communication
OpenCL with two command queues: Outline
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17
// Initialise two command queues.
cl_command_queue queueA = clCreateCommandQueue(…); cl_command_queue queueB = clCreateCommandQueue(…);
// Enqueue data transfer host->device (non-blocking).
clEnqueueWriteBuffer(queueA ,…,CL_FALSE ,…); clEnqueueWriteBuffer(queueB ,…,CL_FALSE ,…);
// Enqueue both kernels.
clEnqueueNDRangeKernel(queueA ,kernelA ,…); clEnqueueNDRangeKernel(queueB ,kernelB ,…);
// Enqueue data transfer device ->host (blocking).
clEnqueueReadBuffer(queueA ,…,CL_TRUE ,…); clEnqueueReadBuffer(queueB ,…,CL_TRUE ,…);
… // Process results; clear up.
XJCO3221 Parallel Computation

Host/device latency hiding
Task graphs Summary and next lecture
Queues and events in OpenCL
Overlapping host and device computation Overlapping computation with communication
Program logic On the device
queueA queueB
copy execute
write a to device
write b to device
write a to device
write b to device
read a from device
read b from device
read a from device
read b from device
XJCO3221 Parallel Computation

Overview Host/device latency hiding Task graphs Summary and next lecture
Events in queues and streams
Task dependencies as directed acyclic graphs Work-span model
Task graphs as a programming pattern
Events in queues and streams
Code on Minerva: taskGraph.c, taskGraph.cl, helper.h
Earlier we saw how an event can be used as a timer. 1 cl_event timer;
In general, events are used to define dependencies between kernels and data transfers.
The last arguments on enqueue commands are:
1 clEnqeue…(…,numWait,waitEvents,event);
Used to identify when this op- eration completes.
cl uint numWait
cl event *waitEvents List of events to wait for.
Number of events to wait for.
cl event *event
XJCO3221 Parallel Computation

Overview Host/device latency hiding Task graphs Summary and next lecture
Events in queues and streams
Task dependencies as directed acyclic graphs Work-span model
Task graphs as a programming pattern
Example (fragment)
Link together reads, writes and kernels on multiple queues — not necessary when using a single queue.
1 2 3 4 5 6 7 8 9
10 11 12 13
cl_event writeEvent , kernelEvent , readEvent;
// Non-blocking write host->device.
clEnqueueWriteBuffer(..,0,NULL,&writeEvent);
// Enqueue kernel.
clEnqueueNDRangeKernel(..,1,&writeEvent ,&kernelEvent);
// Non-blocking read device->host.
clEnqueueReadBuffer(..,1,&kernelEvent ,&readEvent);
// Synchronise (wait for read to complete). clWaitForEvents(1,&readEvent); // Sim. to MPI_Wait().
XJCO3221 Parallel Computation

Overview Host/device latency hiding Task graphs Summary and next lecture
Events in queues and streams
Task dependencies as directed acyclic graphs
Work-span model
Task graphs as a programming pattern
Task graphs
Events are used to link two tasks when the first must complete before the second starts.
Simple example of a task graph: Directed, acyclic graph. Nodes are tasks.
Edges denote dependencies.
Direction denotes which task must complete before the other begins.
Write buffer
writeEvent
kernelEvent
Read buffer
readEvent clWaitForEvents()
XJCO3221 Parallel Computation

Overview Host/device latency hiding Task graphs Summary and next lecture
Events in queues and streams
Task dependencies as directed acyclic graphs
Work-span model
Task graphs as a programming pattern
Earlier task graphs
In fact, we have already seen examples of task graphs.
In binary tree reduction the arrows denote which calculations
must be completed before continuing [cf. Lecture 11]. Example:
Must know x and y before we can calculate z = x ⊗ y.
(These two dependencies highlighted in the diagram).
XJCO3221 Parallel Computation

Overview Host/device latency hiding Task graphs Summary and next lecture
Events in queues and streams
Task dependencies as directed acyclic graphs
Work-span model
Task graphs as a programming pattern
Satisfying dependencies
However many processing units, dependencies must be satisfied.
Invalid Valid
This means you must always take the ‘longest path’.
XJCO3221 Parallel Computation

Overview Host/device latency hiding Task graphs Summary and next lecture
Events in queues and streams
Task dependencies as directed acyclic graphs Work-span model
Task graphs as a programming pattern
Work-span model
Consider the task graph on the right: 9 tasks (nodes).
13 dependencies (arrows). Assume all tasks take equal time.
The performance of a parallel program represented as a task graph can be estimated once the work and span have been identified.
XJCO3221 Parallel Computation

Overview Host/device latency hiding Task graphs Summary and next lecture
Events in queues and streams
Task dependencies as directed acyclic graphs Work-span model
Task graphs as a programming pattern
Work and span
Definition
The work is the total time to complete all tasks.
This corresponds to a serial machine with p = 1 processing units.
Definition
The span is the time taken on an ideal machine with p = ∞.
As many tasks in parallel as possible given the dependencies. The span is the longest path executed one after the other. Also called the critical path.
and XJCO3221 Parallel Computation

Overview Host/device latency hiding Task graphs Summary and next lecture
Events in queues and streams
Task dependencies as directed acyclic graphs Work-span model
Task graphs as a programming pattern
Span example
In this example, the number of tasks from start to finish, given the dependencies, is 5.
The span is therefore 5 tasks.
For uneven sized tasks, would be measured in units of seconds/clock cycles/FLOPs etc.
= task along the critical path
XJCO3221 Parallel Computation

Overview Host/device latency hiding Task graphs Summary and next lecture
Events in queues and streams
Task dependencies as directed acyclic graphs Work-span model
Task graphs as a programming pattern
Work-span model
Note that the work is just the serial execution time ts.
We have argued that the parallel execution time can never become
less than the span.
There is therefore a maximum speed up [cf. Lecture 4]:
S=ts≤ ts =(work) tp tp=∞ (span)
This is the work-span model.
Upper limit1 for S based purely on the task graph. S ≤ 95 = 1.8 for this example.
1There is also a lower bound provided by Brent’s lemma. R.P. Brent, J. Ass. Comp. Mach. 21, 201 (1974).
XJCO3221 Parallel Computation

Overview Host/device latency hiding Task graphs Summary and next lecture
Events in queues and streams
Task dependencies as directed acyclic graphs Work-span model
Task graphs as a programming pattern
Superscalar sequences and futures
Some parallel frameworks schedule tasks based on dependencies specified by the programmer.
OpenCL from earlier in this lecture. OpenMP v4.0, and especially v4.5. Futures in C++11 and Java.
This is sometimes referred to as a superscalar sequence1. The benefit is that you do not need to explicitly synchronise.
The runtime system synchronises when necessary, based on the dependencies you provide.
1McCool et al., Structured parallel programming (Morgan-Kauffman, 2012). XJCO3221 Parallel Computation

Overview Host/device latency hiding Task graphs Summary and next lecture
Summary and next lecture
Summary of GPGPU programming
Lec. Content
14 GPGPU archi- tectures
15 Threads and
16 Memory types
17 Synchronisation
18 Atomics
19 Task paral-
Key points
SIMD cores; CUDA and OpenCL; start- ing with OpenCL.
Host vs. device; data transfer and kernel launches; work items and work groups. Global, local, private and constant. Barriers; breaking up kernels; subgroups advancing in lockstep; divergence. Global and local atomics; compare-and- exchange; lock-free data structures. GPU queues/streams; events; task graphs, work and span.
XJCO3221 Parallel Computation

Overview Host/device latency hiding Task graphs Summary and next lecture
Summary and next lecture
Next lecture
This penultimate lecture is the last containing new material.
In the next and last lecture, we will summarise the material by parallel concept rather than by architecture.
Alternative perspective focussing on transferable insights. Also serves as a useful summary of the module material.
XJCO3221 Parallel Computation

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com