代写代考 COSC 407: Intro to Parallel Computing

Intro to Parallel Computing
Topic 12: Intro to CUDA
COSC 407: Intro to Parallel Computing
Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing

Today’s topics:
• Intro to GPU programming
• CPU vs GPU programming
• Latency vs. Throughput
• CUDA basics: the hardware layout
• CUDA basics: program structure
• Kernel Launch
• Useful Built-in CUDA functions
• Function Declarations (global, device, host)
• Simple examples
Next Lecture:
• Error Handling
• cudaDeviceSynchronize
• Thread organization
Topic 12: Intro to CUDA
COSC 407: Intro to Parallel Computing

Serial vs Parallel on CPU / GPU
Topic 12: Intro to CUDA
Serial code:
one thread does all the work
for(i=0; i<100; i++) C[i] = A[i] + B[i]; COSC 407: Intro to Parallel Computing Serial vs Parallel on CPU / GPU Parallel on CPU: Divide the work among Threads for(i=25; i<50; i++) C[i] = A[i] + B[i]; for(i=50; i<75; i++) C[i] = A[i] + B[i]; Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing for(i=0; i<25; i++) C[i] = A[i] + B[i]; for(i=75; i<100; i++) C[i] = A[i] + B[i]; (Assuming static scheduling) Serial vs Parallel on CPU / GPU Topic 12: Intro to CUDA On GPU: many threads do the work (one thread/element) C[0] = A[0] + B[0]; C[1] = A[1] + B[2]; C[2] = A[2] + B[2]; C[3] = A[3] + B[3]; Thread 0 Thread 1 Thread 2 Thread 3 Thread 97 Thread 98 Thread 99 C[97] = A[97] + B[97]; C[98] = A[98] + B[98]; C[99] = A[99] + B[99]; COSC 407: Intro to Parallel Computing Keep in Mind... Amdahl’s law : Max speedup = ! !"# Limitation: applies to situations where problem size is fixed. Gustafson’s Law: Max speedup = p (with large enough problem sizes) • With more cores, larger problem sizes (datasets) can be solved within the same time Limitation: doesn’t apply to problems which do not have fundamentally large datasets. Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing Latency vs. Throughput Latency: time to finish one task (in seconds) Throughput: number of tasks finished per unit of time (items/sec) Latency and Throughput are not necessarily aligned: • You are standing in a long line at a supermarket • YOU are aiming for less latency (to finish as soon as possible). • Supermarket Manager is optimizing for throughput (to serve as many customers as possible per hour) Option1: 1 fast employee Option2: 10 slow employees Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing Latency vs. Throughput: Another Example Required: move rocks from point A to point B • Mission 1: 2 big rocks. • Mission 2: 200 small rocks Option1: 2 strong workers Option2: 200 weak workers. • Worker can only carry one rock at a time. A rock can be carried by one worker. • Time taken to move one rock is shown below (the latency) Q1: Which option would you choose to finish each mission? Q2: What if we have a 3rd mission for moving both big and small rocks? Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing Mission 1: Mission 1: Latency Latency Overall time 1 1 0 0 m mi ni n Overall time CPU vs. GPU CPUs and GPUs have fundamentally different design philosophies CPU 1 core with 4 SIMD ALUs ALU ALU ALU ALU Up to 50 GB/s cudaMemcpy() 8 cores with 16 SIMD ALUs/core Up to 1 TB/s cudaMalloc() cudaFree() Topic 12: Intro to CUDA Adapted from © /NVIDIA and Wen-mei W. Hwu, 2007-2010, ECE 408, University of Illinois, Urbana-Champaign COSC 407: Intro to Parallel Computing CPU vs. GPU Intel Core i7 Nvidia GTX 480 4 cores with 8 SIMD ALUs per core 15 SMs with 32 SIMD ALUs per core ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing Latency Oriented Design • Optimized for sequential code performance • Sophisticated control • Branch prediction for reduced branch latency • Data forwarding for reduced data latency • Minimize operation latency • Using Powerful ALU • Large caches • Reduce latency cache accesses Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing ALU ALU ALU ALU Core i7-5960X • 2.6 billion transistors • 3 GHz clock rate • 16 hyper threads • SIMD http://techgage.com/article/core-i7-5960x-extreme-edition-review-intels-overdue-desktop-8-core-is-here/ Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing Throughput Oriented Design • Maximize throughput • long latency but heavily pipelined for high throughput • Require massive number of threads to tolerate latencies GPU • Less space for control logic • Trade simple control for more compute • No branch prediction • No data forwarding • Large number of simple ALUs per core • SIMD execution for each core • Many ALUs performing the same instruction • You have to design your code to make use of these units • Memory optimized for bandwidth • Large bandwidth allows for serving many ALUs simultaneously Adapted from © /NVIDIA and Wen-mei W. Hwu, 2007-2010, ECE 408, University of Illinois, Urbana-Champaign Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing Throughput Oriented Design Using GPUs would only be efficient if we have massive number of threads working on large amount of data • Is this Amdahl’s or Gustafson’s law? Example: In image processing, we need to process many pixels per unit of time. It is ok for each pixel taking more time (higher latency) as long as we maximize the throughput • Remember that GPUs have 100s or 1000s of cores which can finish more work jobs even if each job takes more time. • Each thread may be assigned to only one pixel Adapted from © /NVIDIA and Wen-mei W. Hwu, 2007-2010, ECE 408, University of Illinois, Urbana-Champaign Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing Speedup of Applications GeForce 8800 GTX vs. 2.2GHz AMD Opteron 248 CPU 210 457 316 431 263 60 50 40 30 20 10 RC5-72 FEM RPES PNS Application TPACF FDTD § 10 ́ speedup in a kernel (i.e. running on GPU) is typical, as long as the kernel can occupy enough parallel threads § 25 ́ to 400 ́ speedup if the function’s data requirements and control flow suit the GPU and the application is optimized Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing FLOPS for CPUs vs GPUs 2003 2005 2007 2009 2011 2013 Topic 12: Intro to CUDA Nvidia Single Precision Nvidia Double Precision Intel CPU Single Precision Intel CPU Double Precision 2015 2017 2019 Source: Nvidia COSC 407: Intro to Parallel Computing GPU Speedup Relative to CPU GPU Computing Winning Applications Use BOTH CPU and GPU • CPUs for sequential parts where latency matters • CPUs can be 10+ X faster than GPUs for sequential code • GPUs for parallel parts where throughput wins • GPUs can be 10+ X faster than CPUs for parallel code • GPUs are efficient in launching many threads in parallel Topic 12: Intro to CUDA Using threads is not expensive If you are not launching many threads, you should probably run your code on the CPU. COSC 407: Intro to Parallel Computing • GPGPU: General-Purpose computation on GPUs • aka GPU Computing • Before 2007, GPUs were • Specially designed for computer graphics and difficult to • Restrictive (e.g., you can only write/read “pixels” data) • After 2007 • General Purpose computation • Using industry-standard languages such as C • Large data arrays, streaming throughput • Fine-grain SIMD parallelism • Low-latency floating point (FP) computation Adapted from © /NVIDIA and Wen-mei W. Hwu, 2007-2010, ECE 408, University of Illinois, Urbana-Champaign Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing Differences Between GPU and CPU threads • GPU threads are extremely lightweight • Very little creation overhead • Remember that GPU needs 1000s of threads for full efficiency • Multi-core CPU needs only a few Adapted from © /NVIDIA and Wen-mei W. Hwu, 2007-2010, ECE 408, University of Illinois, Urbana-Champaign Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing What is CUDA? • CUDA = Compute Unified Device Architecture (almost!) • A parallel computing platform and API developed by NVIDIA • Released in June 2007 • Programmers can use CUDA-enabled GPUs for General- Purpose processing (GPGPU) • Programmer kicks off batches of threads on the GPU • GPU: dedicated super-threaded, massively data parallel co-processor • Can significantly increase computing performance by harnessing the power of the GPU. • CUDA works with programming languages such as C, C++ and Fortran. Source: http://www.nvidia.ca/object/cuda_home_new.html Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing GPU Design § Massively threaded, sustains 1000s of threads per app § 30-100x speedup over high-end microprocessors The figure: 8 SMs x 16 SP = 128 SPs (CUDA cores) Host § SM: streaming Input Assembler Thread Execution Manager multiprocessor § SP: streaming processor SM SM SM SM SM SM SM SM Load/store Load/store Load/store Load/store Load/store Load/store Global Memory (DRAM) Note: other components might exit on the GPU (e.g. Tensor Cores), but this is outside the scope of this course. Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache GeForce GTX 580 512 CUDA cores (16 SMs x 32 SPs) Adapted from © /NVIDIA and Wen-mei W. Hwu, 2007-2010, ECE 408, University of Illinois, Urbana-Champaign Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing Comparing Cards § 480 CUDA Cores – 15SMs – EachSMfeatures32SPs(CUDA cores) • CUDAcore=oneALU(INT) + one FPU • #ofcoresdetermine#of calculations performed per clock cycle – SMexecutesthreadsingroups of 32 called warps § Memory Interface: 384-bit § Memory Bandwidth 177.4 GB/s § Compute capability: 2.1 Topic 12: Intro to CUDA § CUDA cores: 10,496 § Memory Interface: 384-bit § Memory Bandwidth 936 GB/s § $$$$ COSC 407: Intro to Parallel Computing Running CUDA Remotely • On Google Colabs • Google has made available an online environment to work with CUDA C/C++ on their GPUs for free. • The steps on how to do that are detailed here: • https://www.wikihow.com/Run-CUDA-C-or-C%2B%2B- on-Jupyter-(Google-Colab) Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing Great Reference NVIDIA official CUDA Toolkit Documentation v10.2.89 https://docs.nvidia.com/cuda/ Specially: Installation Guides CUDA C++ Programming Guide https://docs.nvidia.com/cuda/cuda-c-programming- https://docs.nvidia.com/cuda/cuda-c-best-practices- guide/index.html https://docs.nvidia.com/cuda/cuda-samples/index.html Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing CUDA Terminology Host: the CPU (e.g., intel Core i3) • Host code: the code that runs on the CPU Device: a coprocessor to the CPU • Typically a GPU • Can also be other types of parallel processing devices • Has its own DRAM (device memory) • Runs many threads in parallel • Thread is the concurrent code executed on the CUDA device in parallel with other threads. • Kernel code is the data-parallel portions of an application which run on the device Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing Program Structure C Program = host code (runs on CPU) + device code (runs on GPU) Serial or modestly parallel partsàin host code Compiled by host standard compiler Highly parallel partsàin device SPMD kernel code Compiled by NVIDIA compiler SPMD (single program multiple data) Always starts on CPU (host) Serial Code (host) Parallel Kernel (device) Functions that will run on GPU. Called using special format. Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing kernel<<>>(args);
Serial Code (host)
Parallel Kernel (device)
foo<<>>(args);
Threads, Blocks and Grids
https://medium.com/analytics-vidhya/cuda-compute-unified-device-architecture-part-2-f3841c25375e
Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing

Arrays of Parallel Threads
• A CUDA kernel is executed by an array of threads • All threads run the same code (SPMD)
• SPMD: Single Program, Multiple Data • Thread IDs are used to
Decide what data to work on Make control decisions
input 0 1 2 3 4 5 6 7 threadID 0 1 2 3 4 5 6 7
float x = input[threadID]; output[threadID] = func(x); …
Topic 12: Intro to CUDA
output 0 1 2 3 4 5 6 7
COSC 407: Intro to Parallel Computing
How to Write a CUDA Program
C Program = host code (runs on CPU) + device code (runs on GPU) C Program
Always starts on CPU (host)
Host Code (serial)
Device Code (Parallel Kernel)
Host Code (serial)
Device Code (Parallel Kernel)
Topic 12: Intro to CUDA
COSC 407: Intro to Parallel Computing

The Host Code Step #1
1. Allocate space on the GPU using cudaMalloc
2. Copy CPU data to the GPU memory using cudaMemcpy
• CPU accesses its own memory, and GPU accesses its own memory
3. Launch the kernel function(s) on the GPU
• Might need to define the launch-configuration
• Pass reference to allocated GPU memory + any
arguments.
• Results are stored on GPU memory.
4. Copy the results from GPU to CPU using cudaMemcpy
• Must have memory allocated on CPU ram = the size of
the results.
5. Free GPU memory using cudaFree
Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing
Processing Flow on CUDA
Topic 12: Intro to CUDA
(a) Launch Kernel Function
Copy data
cudaMemcpy()
Copy results
(b) Execute parallel in each core
Device Memory
allocate memory on GPU cudaMalloc()
cudaFree()
Main Memory
Free allocated memory
COSC 407: Intro to Parallel Computing

A HOST Example
int *a = 0, *d_a = 0, num_bytes = …;
// Prepare data on CPU
a = malloc(num_bytes);//could also have other data b,c,d,… . . . // put into a the data that need to be processed on GPU
// Allocate memory on GPU + Copy CPU data to GPU
cudaMalloc(&d_a, num_bytes ); //allocate memory on GPU before copy cudaMemcpy(d_a, a, num_bytes, cudaMemcpyHostToDevice );
// Process data on GPU
. . .// invoke the kernel function on the GPU here // Copy results from GPU to CPU
cudaMemcpy(a, d_a, num_bytes, cudaMemcpyDeviceToHost );
// Free CPU & GPU memory
cudaFree(d_a); free(a);
Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing
The Kernel Code Step 2
• Write the kernel function as if it will run on a single thread • Use IDs to identify which piece of data is processed by
this thread.
• The kernel function will run on only one piece of the
• Remember that this SAME kernel function is
executed by many threads
• This means parallelism of threads is expressed in the host
• Kernel functions must be declared with a qualifier, either
__global__ or __device__
__global__ void foo(int a,int b){…}
Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing

Launching the Kernel
To launch a CUDA kernel from the host, we need to
• Specify the block dimension
• Specify the Grid dimension
• Threads are organized into blocks
• Blocks are organized into grids
• When launching the Kernel we will pass into it (more later…)
• Number of blocks in the grid to process
• Based on your data size
• Data needs be covered -> need to make sure that
you have one thread for each element of data
• Number of threads in block (check card stats)
foo<<<1,1024>>>(a, b); //run N threads on 1 block
Grid size (number of blocks) Block size (threads per block)
Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing
Built-in CUDA Functions
cudaError_t cudaMalloc(void** d_ptr, size_t n)
• Allocates n bytes of linear memory on the device and returns in
*d_ptr a pointer to the allocated memory.
• d_ptr: address of a pointer to the allocated device memory
• We need to pass this pointer by-reference (i.e., pass its address) so that cudaMalloc is allowed to modify its value and store the pointer to the allocated memory
• n: size of requested memory in bytes. cudaError_t cudaFree(void* d_Ptr)
• Frees memory on device pointed at by d_ptr
Note that CUDA functions return an error code if anything goes wrong.
More about this later
Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing

Built-in CUDA Functions
cudaError_t cudaMemcpy(void *dst, void *src,size_t n, dir) • Copies data between host / device.
• dst/srcarepointerstodestination/sourcememorysegments • nisnumberofbytestocopy
• diriskindoftransfer,anditmighthavedifferentvalues
including:
• cudaMemcpyDefault
• (recommended, but not supported on all machines) • cudaMemcpyDeviceToHost
• cudaMemcpyDeviceToDevice
• cudaMemcpyHostToDevice
• Starts copying after previous CUDA calls complete
• CPU thread is blocked until copy is complete after which method
Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing
Built-in CUDA Functions
cudaError_t cudaMemset(void* d_ptr, int value, size_t n) • Fills the first n bytes of the memory area pointed to by d_ptr with a
constant value value
• Note that we are filling the first n bytes, not n integers or any
other type!
Topic 12: Intro to CUDA
COSC 407: Intro to Parallel Computing

Function Declarations
__global__
Called from CPU (**). It must return void __device__
Called and executed by GPU. Cannot be called from CPU i.e. called from other __global__ and __device__ functions
called and executed by CPU
__device__ and __host__ can be used together
__host__ __device__ int max(int a, int b){return (a>b)?a:b;}
This means both CPU and GPU can call max
(**) Dynamic parallelism allows calling kernels from within other kernels on cards of compute 3.5 or higher. This is outside the scope of
this course, and we will always call kernels from host code.
Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing
__global__ void KernelFunc()
__device__ float DeviceFunc()
__host__ float HostFunc()
Executed on
device (GPU)
device (GPU)
host (CPU)
callable from
host (CPU) **
device (GPU)
host (CPU)
Limitations
For functions executed on the GPU:
• No recursion
• No static variable declarations inside the function
• No variable number of arguments
• Only pointers to GPU memory can be dereferenced
Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing

Add Two Numbers Example 1: Serial on CPU
void add(int a,int b,int *c){
*c = a + b;
int main (void) {
// runs on the host
// c is on the host DRAM
// execute add on the host
add (2, 7, &c); // save results in c on the host DRAM
printf( “2 + 7 = %d\n”, c);
Topic 12: Intro to CUDA COSC 407: Intro to Parallel Computing
Add Two Numbers
Example 1: On the GPU (Parallel)
__globavlo_i_d vaodidd(iandtd(ai,nitnta,bi,nitntb,*icn)t{

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts