PowerPoint Presentation
Parallel Computing
with GPUs
Dr Paul Richmond
http://paulrichmond.shef.ac.uk/teaching/COM4521/
Assignment Feedback
Last Week
We learnt about warp level CUDA
How threads are scheduled and executed
Impacts of divergence
Atomics: Good and bad…
Do the warp shuffle!
Parallel primitives
Scan and Reduction
Credits
The code and much of the content from this lecture is based on the
GTC2016 Talk by C. Angerer and J. Progsch (NVIDIA)
S6112 – CUDA Optimisation with NVIDIA Nsight for Visual Studio
Provided by NVIDIA with thanks to Joe Bungo
Content has been adapted to use Visual Profiler Guided Analysis
where possible
Additional steps and analysis have been added
http://on-demand.gputechconf.com/gtc/2016/presentation/s6112-angerer-cuda-nsight-vse.pdf
Learning Objectives
Understand the key performance metrics of GPU code.
Understand profiling metrics and relate this to approaches which
they have already learnt to address limiting factors in their code.
Appreciate memory vs compute bound code and be able to
recognise factors which contribute to this.
Profiling Introduction
The Problem
Visual Profiler Guided Analysis
Iteration 1
Iteration 2
Iteration 3
Iteration 4
The APOD Cycle
1. Assess
• Identify Performance Limiter
• Analyze Profile
• Find Indicators
2. Parallelize3. Optimize
3b. Build Knowledge
4. Deploy
and Test
https://devblogs.nvidia.com/assess-parallelize-optimize-deploy/
https://devblogs.nvidia.com/assess-parallelize-optimize-deploy/
CUDA Profiling Options
Visual Profiler (1st choice)
Stand alone cross platform (java on Eclipse) program
Guided performance analysis
Links to CUDA best practice guide
NVProf
Command line profiler
Results can be visualised in Visual Profiler
Visual Studio Nsight Profiler
Built into visual studio
Detailed kernel and source level analysis (more than Visual Profiler)
Unguided
Changes to your code
If you want to associate profile information with source line
–lineinfo argument
Works in release mode
Must flush GPU buffers
cudaDeviceReset()
At end of program
Conveyor belt model
Our GPU program is like a factory assembly line
Data in and data out (in a new form)
Skilled operators (multi processors) doing stuff with chunks of the data
Both the belt and people have maximum operating speed
Ideal situation
Conveyor belt runs at full speed
Skilled operators always 100% busy
What is likely to effect this model?
Potential Performance Limiters
Memory
Program limited by memory bandwidth
Can’t get data to the registers on the device fast enough
Are you using lots of global memory but not faster local memory caches?
Have you exceed the amount of cache available?
Compute
Memory bandwidth well below peak
GPU is too busy performing compute
Have you got high levels of divergence with low warp execution efficiency ?
Latency
Poor occupancy = not enough active threads
Instruction execution stalls due to poor memory access patterns (sparse or poorly
used data)
Is problem size or block size too small? Are you using the memory bandwidth
effectively (cache line utilisation)?
Profiling Introduction
The Problem
Visual Profiler Guided Analysis
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Blur
Grayscale
Edges
Introducing the Application
Grayscale Conversion
// r, g, b: Red, green, blue components of the pixel p
foreach pixel p:
p = 0.298839f*r + 0.586811f*g + 0.114350f*b;
Introducing the Application
Blur: 7×7 Gaussian Filter
400 300 120 20
225 90 15
90 36 6
15 6 1
90 15
36 6
225
90
15 6 1
20 120 300
15 90 225
6 36 90
1 6 15
15 90 225
6 36 90
1 6 15
300
120
20
20
120
300
Image from Wikipedia
Introducing the Application
foreach pixel p:
G = weighted sum of p and its 48 neighbors
p = G/256
Edges: 3×3 Sobel Filters
foreach pixel p:
Gx = weighted sum of p and its 8 neighbors
Gy = weighted sum of p and its 8 neighbors
p = sqrt(Gx + Gy)
-1 0 1
-2 0 2
-1 0 1
Weights for Gx:
1 2 1
0 0 0
-1 -2 -1
Weights for Gy:
Introducing the Application
The Starting Code
https://github.com/chmaruni/nsight-gtc
void gaussian_filter_7x7_v0(int w, int h, const uchar *src, uchar *dst)
{
// Position of the thread in the image.
const int x = blockIdx.x*blockDim.x + threadIdx.x;
const int y = blockIdx.y*blockDim.y + threadIdx.y;
// Early exit if the thread is not in the image.
if( !in_img(x, y, w, h) )
return;
// Load the 48 neighbours and myself.
int n[7][7];
for( int j = -3 ; j <= 3 ; ++j )
for( int i = -3 ; i <= 3 ; ++i )
n[j+3][i+3] = in_img(x+i, y+j, w, h) ? (int) src[(y+j)*w + (x+i)] : 0;
// Compute the convolution.
int p = 0;
for( int j = 0 ; j < 7 ; ++j )
for( int i = 0 ; i < 7 ; ++i )
p += gaussian_filter[j][i] * n[j][i];
// Store the result.
dst[y*w + x] = (uchar) (p / 256);
}
What is good and
what is bad?
https://github.com/chmaruni/nsight-gtc
The Starting Code
https://github.com/chmaruni/nsight-gtc
void gaussian_filter_7x7_v0(int w, int h, const uchar *src, uchar *dst)
{
// Position of the thread in the image.
const int x = blockIdx.x*blockDim.x + threadIdx.x;
const int y = blockIdx.y*blockDim.y + threadIdx.y;
// Early exit if the thread is not in the image.
if( !in_img(x, y, w, h) )
return;
// Load the 48 neighbours and myself.
int n[7][7];
for( int j = -3 ; j <= 3 ; ++j )
for( int i = -3 ; i <= 3 ; ++i )
n[j+3][i+3] = in_img(x+i, y+j, w, h) ? (int) src[(y+j)*w + (x+i)] : 0;
// Compute the convolution.
int p = 0;
for( int j = 0 ; j < 7 ; ++j )
for( int i = 0 ; i < 7 ; ++i )
p += gaussian_filter[j][i] * n[j][i];
// Store the result.
dst[y*w + x] = (uchar) (p / 256);
}
What is good and
what is bad?
https://github.com/chmaruni/nsight-gtc
Profiling Machine
NVIDIA GeForce GTX980
GM200
Compute Capability SM5.2
CUDA 7.0
Windows 7
Visual Studio 2013
Nsight Visual Studio Edition 5.0
Profiling Introduction
The Problem
Visual Profiler Guided Analysis
Iteration 1
CUDA API Calls
Device Activity
Hints
Results
We are using only a single stream
Kernels have data dependencies so cant be executed in parallel
This is a problem
The guided analysis will try and address this
Guided Analysis
guassian_filter_7x7_v0 kernel has highest rank
What is this telling us about our code?
Memory Bound Problem!
Memory vs Compute vs Latency
Comp Mem
Compute
Bound
Comp Mem
Bandwidth
Bound
Comp Mem
Latency
Bound
Comp Mem
60%
Compute and
Bandwidth
Bound
Better Occupancy might improve compute use
What about occupancy?
Occupancy: “number of active warps over max warps supported”
Increasing achieved occupancy can hide latency
More warps available for execution = more to hide latency
warp 0
warp 1
warp 2
warp 3
warp 4
warp 5
warp 6
warp 7
warp 8
warp 9
The warp issues
The warp waits (latency)
Fully covered latency
warp 0
warp 1
warp 2
warp 3
No warp issues
Exposed latency, not enough warps
Occupancy
In our case we are not achieving theoretical occupancy (we have
latency)
warp 0
warp 1
warp 2
warp 3
warp 4
warp 5
warp 6
warp 7
warp 8
warp 9
What is the problem here?
The warp issues
The warp waits (latency)
Occupancy
In our case we have good occupancy but still high latency
Schedulers cant find eligible warps at every cycle
Exposed latency at high occupancy
No warp issuing
warp 0
warp 1
warp 2
warp 3
warp 4
warp 5
warp 6
warp 7
warp 8
warp 9
Warps are waiting for memory (transactions)
The warp issues
The warp waits (latency)
More information
Transaction per access = 5:1
We are using only 20% of the effective bandwidth
Transactions per access?
Think back to Lecture 11
To get 100% efficiency our threads need to access consecutive 4 byte values
32 Threads in warp accessing 4B each
128B total via 4 L2 cache lines
256192 2240 32 64 96 128 160
addresses from warp
__global__ void copy(float *odata, float* idata)
int xid = blockIdx.x * blockDim.x + threadIdx.x;
odata[xid] = idata[xid];
}
Profiler is telling that we could use only 1 transaction but are
using 4/5 (only 1 transaction required for each thread in warp
to read a single byte char)
n[j+3][i+3] = in_img(x+i, y+j, w, h) ? (int) src[(y+j)*w + (x+i)] : 0;
Memory is indexed based on x==threadIdx.x: Suggests access is coalesced.
Cause not clear…..
Analysis
The limiting factor of our code is L2 Throughput
There is nothing wrong with having high throughput
Except: There is not enough compute to hide this
We cant increase occupancy any further to hide this
Solution: We need to reduce the time it takes to get data to the
device to do compute on it. Either by
Moving data closer to the SMPs
Making our L2 reads/writes more efficient
Currently ~4-5 Transactions/Access
Our L2 cache lines are being used ineffectively
Causes of Transaction per access: Striding?
Lecture 11 example
Strides (like above) cause poor transactions per access
In the above case 8 transactions where we could have used 4
256192 2240 32 64 96 128 160
__global__ void copy(float *odata, float* idata)
int xid = (blockIdx.x * blockDim.x + threadIdx.x)* 2;
odata[xid] = idata[xid];
}
256192 2240 32 64 96 128 160
Lecture 11 Example:
If memory accesses are offset then parts of the cache line will be unused
(shown in red) e.g.
Use thread blocks sizes of multiples of 32!
__global__ void copy(float *odata, float* idata)
int xid = blockIdx.x * blockDim.x + threadIdx.x + 1;
odata[xid] = idata[xid];
}
Causes of Transaction per access: Offset?
What is our current data layout?
Block 1 Block 2 …
Warp 0
Warp 1
Warps are 8x2Blocks are 8x8
Why might this be a problem
What is our current data layout?
Block 1 Block 2 …
Warp 0
Warps are 8x2Blocks are 8x8
threadIdx.x not consecutive within the warp
Overfetch from L2 Cache
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63
Data Overfetch
Warp 1
Warp 2
Line 245 for i=0, j=0: src[x] //threads 0-7 only
Line 245: src[(y+j)*w + (x+i)]
6448 560 8 16 24 32 40
Cache line (always aligned by 32B boundaries)
Overfetch with L1 Caching
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63
Data Overfetch
Warp 1
Warp 2
Line 245 for i=0, j=0: src[y*w + x]
Line 245: src[(y+j)*w + (x+i)]
Any Ideas for improving this?
Optimisation: Improved Memory layout
Minimum block width should be 32 (each thread requires only 1 byte)
Use Layout of 32x2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
6448 560 8 16 24 32 40
Cache line (always aligned by 32B boundaries)
Deploy: Improved Memory layout
Kernel Time (ms) Speedup Rel. Speedup
Gaussian_filter (Step 0) 5.49 1.00x -
Gaussian_filter (Step 1a) 1.00 5.49x 5.49x
Break
What do we expect the analysis to look like next?
Any ideas for what else may be required?
Half time summary
The guided profiler will help us optimise the right thing
Hotspot tells us the most appropriate place to optimise
Performance Limiter tells us what to focus on to improve
Code may be Memory, Compute or Latency Bound
Improvements so far
Changed the access pattern (by changing block size)
Reduced memory dependencies?
Profiling Introduction
The Problem
Visual Profiler Guided Analysis
Iteration 1
Iteration 2
Identify the hotspot
Examine GPU Usage in Visual Profiler
Examine Individual Kernels
Gaussian filter kernel still the highest rank
Performance Limiter
Tex Instruction Units?
Grid
Block (0, 0)
GPU DRAM
Local Cache
Thread (0, 0)
Registers
Thread (1, 0)
Registers
L2 Cache
Global Memory
Shared Mem
Constant Cache
L1 / Read-only
Block (1, 0)
Local Cache
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Shared Mem
Constant Cache
L1 / Read-only
What are texture instruction units and
why might our code be using them?
Tex Instruction Units?
What are texture instruction units and
why might our code be using them?
Hint:
void gaussian_filter_7x7_v1(int w,
int h,
const uchar *src,
uchar *dst)
Grid
Block (0, 0)
GPU DRAM
Local Cache
Thread (0, 0)
Registers
Thread (1, 0)
Registers
L2 Cache
Global Memory
Shared Mem
Constant Cache
L1 / Read-only
Block (1, 0)
Local Cache
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Shared Mem
Constant Cache
L1 / Read-only
Tex Instruction Units?
What are texture instruction units and
why might out code be using them?
Compiler is reading src as read-only
through Unified L1/Read-Only (texture
cache)
Grid
Block (0, 0)
GPU DRAM
Local Cache
Thread (0, 0)
Registers
Thread (1, 0)
Registers
L2 Cache
Global Memory
Shared Mem
Constant Cache
L1 / Read-only
Block (1, 0)
Local Cache
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Shared Mem
Constant Cache
L1 / Read-only
void gaussian_filter_7x7_v1(int w,
int h,
const uchar *src,
uchar *dst)
Guided Bandwidth Analysis
We are doing lots of reading/writing through unified cache
Guided Bandwidth Analysis
Still parts of the code reporting 2 transactions per access?
Transaction per request
Line 245 for i=1, j=0: src[x+1]
Line 245: src[(y+j)*w + (x+i)]
What is wrong with this access pattern?
Transaction per request
Line 245 for i=1, j=0: src[x+1]
Line 245: src[(y+j)*w + (x+i)]
What is wrong with this access pattern?
Hint: Cache Lines are aligned by 32B boundaries
Transaction per request
Line 245 for i=1, j=0: src[x+1]
Line 245: src[(y+j)*w + (x+i)]
40 48 560 8 16 24 32
Cache line Cache line
We have an offset access pattern
Guided Compute Analysis
The guided analysis suggests that lots of our compute cycles are
spent issuing texture load/stores
Guided Latency Analysis: Occupancy
Guided Latency Analysis: Occupancy
Register usage is very high
Occupancy currently limited by register usage
Increasing occupancy might not help us however
as we are dominated by texture load stores
More work per SMP will just mean even more texture
load stores!
We can confirm this by looking at the unguided
analysis: Kernel Latency
PC Sampling
Execution/Memory Dependency
Rank these are best to worst
Which have instruction and memory dependencies?
int a = b + c;
int d = a + e;
//b, c and e are local ints
int a = b[i];
int d = a + e;
//b is global memory
//I and e are local ints
int a = b + c;
int d = e + f;
//b, c, e and f are local ints
Instruction/Memory Dependency
Rank these are best to worst
Which have instruction and memory dependencies?
int a = b + c;
int d = a + e;
//b, c and e are local ints
int a = b[i];
int d = a + e;
//b is global memory
//i and e are local ints
int a = b + c;
int d = e + f;
//b, c, e and f are local ints
Instruction Dependency
Second add must wait for first
Memory Dependency
Second add must wait for
memory request
No dependencies
Independent Adds
Analysis
Our compute engine is dominated by load/store instructions for the
texture cache
Our texture bandwidth is good BUT
Our warps are stalling as instructions are waiting to issue texture
fetch instructions
We still have poorly aligned access pattern within our inner loops
Solution: Reduce dependencies on texture loads
Move data closer to the SMP
Only read from global memory with nicely aligned cache lines
How?
Analysis
Our compute engine is dominated by load/store instructions for the
texture cache
Our texture bandwidth is good BUT
Our warps are stalling as instructions are waiting to issue texture
fetch instructions
We still have poorly aligned access pattern within our inner loops
Solution: Reduce dependencies on texture loads
Move data closer to the SMP
Only read from global memory with nicely aligned cache lines
Shared Memory
Shared Memory
Single thread uses 7x7= 42 values
Single block (32x4) uses 38x10 = 680 values
Use shared memory to store all pixels for the block
What important factor should we be considering?
Also increased
Block size
Shared Memory
Single thread uses 7x7= 42 values
Single block (32x4) uses 38x10 = 680 values
__shared__ unsigned char smem_pixels[10][64]
Use shared memory to store all pixels for the block
SM bank conflicts
Also increased
Block size
BUT WAIT!!!!!!!!!!!!!!!
Wouldn’t aligned char access have 4 way bank conflicts?
NOT for Compute Mode 2.0+…
“A shared memory request for a warp does not generate a
bank conflict between two threads that access any
address within the same 32-bit word (even though the two
addresses fall in the same bank): In that case, for read
accesses, the word is broadcast to the requesting
threads (multiple words can be broadcast in a single
transaction) …”
I.e. A Stride of less than 1 (4B word) can be read conflict free if threads access aligned data
0
1
2
3
4
5
6
7
Bank
…
Thread
1
0
2
3
4
5
6
7
Improvement
Significant
Kernel Time (ms) Speedup Rel. Speedup
Gaussian_filter (Step 0) 5.49 1.00x -
Gaussian_filter (Step 1a) 1.00 5.49x 5.49x
Gaussian_filter (Step 40) 0.49 11.20x 2.04x
Profiling Introduction
The Problem
Visual Profiler Guided Analysis
Iteration 1
Iteration 2
Iteration 3
Identify the hotspot
Examine GPU Usage in Visual Profiler
Examine Individual Kernels
Gaussian filter kernel still the highest rank
Getting much closer though
Performance Limiter
Actually very close to magical 60% of compute
Lets examine
1) The compute analysis
2) The latency analysis
Guided Bandwidth Analysis
Compute Analysis
We are simply doing lots of
compute
Additional floating point operations
graph shows no activity i.e. all of
our instructions are Integer
What are all of these integer instructions?
Compute Analysis by Line
Selecting the CUDA function from compute analysis results allows a
line by line breakdown
This will switch to unguided analysis
Also PTX instruction
breakdown provided
Guided
Latency
Analysis
Would changing the block size,
register usage or amount of shared
memory per block improve
occupancy?
Guided Latency Analysis
Line by Line Breakdown
Latency Overview: Other 32.25%
Stall reason other generally means that there is no obvious action to
improve performance
Other stall reasons may indicate either;
1. Execution unit is busy
Solution: Potentially reduce use of low throughput integer operations if possible
2. Register bank conflicts : a compiler issue that can sometimes be made worst
by heavy use of vector data types
Solution: None
3. Too few warps per scheduler
Solution: Increase occupancy, decrease latency
Guided Latency Analysis: Line by Line
Lots of time spent loading into shared memory
2nd Highest latency is on memory
Lots of time spent computing convolution
Highest latency is on execution dependency
memory
other
execution
1st Analysis
We have a reasonably well balanced use of the from Compute and
Memory pipes.
There is some latency in loading data to and from shared memory
Our compute cycles are dominated by Integer operations
What operations are they?
We can either examine the code and PTX instructions (from Compute or
Latency Analysis) or run additional analysis via Nsight within Visual Studio
More detailed analysis
Not guided like the visual profiler
Start profiling
Kernel Analysis
Select Profile
CUDA Application
Select the Kernel
(optional, will
profile all kernels
otherwise)
Select the
Experiments (All)
Launch
CUDA Launches View
Performance Indicators
Achieved IOPS
No surprises…
int p = 0;
for( int j = 0 ; j < 7 ; ++j )
for( int i = 0 ; i < 7 ; ++i )
p += gaussian_filter[j][i] * n[j][i];
Pipe Utilisation
More detailed confirmation
Integer operations dominate
Issue Efficiency
This is good
We have no divergent code
2nd Analysis
We have a reasonably well balanced use of the from Compute and
Memory pipes.
There is some latency in loading data to shared memory and on
executions to read it back
Our compute cycles are dominated by Integer operations
There is some latency in loading data to
shared memory and on executions to read it
back
Consider a simplified problem
Each thread needs to load an r, g, b, a value into shared memory
Which has fewer shared memory load instructions?
__shared__ char sm[TPB*4];
char r,g,b,a;
r = sm[threadidx.x];
g = sm[threadidx.x+1];
b = sm[threadidx.x+2];
a = sm[threadidx.x+3];
__shared__ char4 sm[TPB];
char r,g,b,a;
char4 rgba;
rgba = sm[threadidx.x];
r = rgba.r;
g = rgba.g;
b = rgba.b;
a = rgba.a;
There is some latency in loading data to
shared memory and on executions to read it
back
Consider a simplified problem
Each thread needs to load an r, g, b, a value into shared memory
Which has fewer shared memory load instructions?
__shared__ char sm[TPB*4];
char r,g,b,a;
r = sm[threadidx.x];
g = sm[threadidx.x+1];
b = sm[threadidx.x+2];
a = sm[threadidx.x+3];
__shared__ char4 sm[TPB];
char r,g,b,a;
char4 rgba;
rgba = sm[threadidx.x];
r = rgba.r;
g = rgba.g;
b = rgba.b;
a = rgba.a;
Our compute cycles are dominated by Integer
operations
Which of the following is faster?
int p = 0;
for( int j = 0 ; j < 7 ; ++j )
for( int i = 0 ; i < 7 ; ++i )
p += gaussian_filter[j][i] * n[j][i];
int a, b, c;
a = sm_a[i]; b = sm_b[i];
c += a * b;
float a, b, c;
a = sm_a[i]; b = sm_b[i];
c += a * b;
Our compute cycles are dominated by Integer
operations
Which of the following is faster?
int p = 0;
for( int j = 0 ; j < 7 ; ++j )
for( int i = 0 ; i < 7 ; ++i )
p += gaussian_filter[j][i] * n[j][i];
int a, b, c;
a = sm_a[i]; b = sm_b[i];
c += a * b;
float a, b, c;
a = sm_a[i]; b = sm_b[i];
c += a * b;
Integer multiply add is 16 cycles
Float combined multiply add is 4 cycles
Analysis
We have a reasonably well balanced use of the from Compute and
Memory pipes.
There is some latency in loading data to shared memory and on
executions to read it back
Solution 1: Reduce SM Load Stores dependencies by using wider requests.
i.e. 4B values rather than 1B (chars)
I.e. Store shared memory values as 4B minimum
Our compute cycles are dominated by Integer operations
Almost all MAD operations
Solution: Change slower Integer MAD instructions to faster floating point
FMAD instructions
I.e. Use floating point multiply and cast result to uchar at end
Improvement
Significant
Kernel Time (ms) Speedup Rel. Speedup
Gaussian_filter (Step 0) 5.49 1.00x -
Gaussian_filter (Step 1a) 1.00 5.49x 5.49x
Gaussian_filter (Step 40) 0.49 11.20x 2.04x
Gaussian_filter (Step 5a) 0.28 19.60x 1.75x
Profiling Introduction
The Problem
Visual Profiler Guided Analysis
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Identify the hotspot
Examine GPU Usage in Visual Profiler
What should be our next step?
Lets look at our Gaussian kernel anyway…
Identify the hotspot
Examine GPU Usage in Visual Profiler
Examine Individual Kernels
Gaussian filter kernel no longer highest rank!
We can now optimise the sobel_filter kernel
Lets look at our Gaussian kernel anyway…
Performance Limiter
Looking good
VS NSight IOPS/ FLOPS Metrics
Analysis
Our algorithm is making good use of compute and memory
Further improvement will be difficult (but not impossible)
Solution: Optimise a different kernel
sobel_filter_kernel to get the same treatment
Solution: Improve Gaussian kernel by changing the technique
(parallelise differently)
Separable Filter: Compute horizontal and vertical convolution separately then
approximate by binominal coefficients
Ensure we apply the same optimisations to separable filter version
Improvement
25x speedup on existing GPU code is pretty good
Companion Code: https://github.com/chmaruni/nsight-gtc
Kernel Time (ms) Speedup Rel. Speedup
Gaussian_filter (Step 0) 5.49 1.00x -
Gaussian_filter (Step 1a) 1.00 5.49x 5.49x
Gaussian_filter (Step 40) 0.49 11.20x 2.04x
Gaussian_filter (Step 5a) 0.28 19.60x 1.75x
Gaussian_filter (Step 9) 0.22 24.95x 1.27x
https://github.com/chmaruni/nsight-gtc
Summary
Profiling with the Visual Profiler will give you guided analysis of how
to improve your performance
Show you how to spot key metrics
We are trying to achieve good overall utilisation of the hardware
(compute and memory engines)
Through an appreciation of memory and compute bounds
Follow the APOD cycle
Assess: What is the limiting factor, analyse and profile
Parallelise and improve (apply the knowledge you have learnt over the
course)
Optimise
Deploy and Test
If in doubt use the lab classes to seek guidence!