DSCC 201/401
Tools and Infrastructure for Data Science
March 1, 2021
Parallel Programming Models
• Embarrassingly Parallel • Shared Memory
• Pthreads
• OpenMP
• Message Passing – MPI
• Accelerator Computing – CUDA
2
Shared Memory
• Common physical memory that can be accessed by all processors
• Single address space that is globally accessible
• Changes in a memory location caused by one processor are visible to all other processors
• Difficult to scale to multiple processors
• We will look at two programming libraries for shared memory systems: Pthreads and OpenMP
3
Shared Memory Architecture
CPU
CPU
CPU
CPU
Cache Cache
Cache Cache
High Speed Interconnect (Computer Bus)
Memory
4
Shared Memory – Threads
• A thread is an independent stream of instructions that can be scheduled to run by the operating system
• A thread is “light weight” since it exists in a parent process and uses the resources of the process – it takes advantage of the overhead already needed of the parent process
• A thread has its own independent flow of control as long as its parent process exists
• Threads duplicate only the essential resources they need
• Threads may share the process resources with other threads that act equally independently (and also dependently)
• A thread dies if the parent process dies
• Pthreads is a library based on the IEEE POSIX 1003.1c standard that implements the desired behavior of threads on a Linux system
5
Shared Memory – Pthreads
#include
#include
#include
void *print_message(void *threadid) {
long int tid;
tid = (long int)threadid;
printf(“Hello from thread %ld!\n”, tid);
pthread_exit(NULL);
}
int main() {
int status;
long int t;
pthread_t threads[NUM_THREADS];
for(t = 0; t < NUM_THREADS; t++) {
status = pthread_create(&threads[t], NULL, print_message, (void *)t);
if (status != 0) {
printf("Error: return code from pthread_create() is %d\n", status);
}
exit(-1); }
}
pthread_exit(NULL);
return(0);
gcc -o pthreads pthreads.c -pthread
6
Shared Memory - OpenMP
• OpenMP = Open Multi-Processing: Designed for multi-platform shared memory parallel programming
• Standard was released in 1997 and is portable to many different systems and languages
• Uses compiler directives to control execution
• Easier to implement (than Pthreads)
• OpenMP uses the fork-join model of parallel execution
main() main() main()
#pragma omp parallel {} #pragma omp parallel {}
Fork
Join
Fork
Join
7
#include
#include
int main() {
int nthreads;
int tid;
Shared Memory – OpenMP
/* Fork a team of threads with each thread having a private tid variable */
#pragma omp parallel private(tid)
{
/* Obtain and print thread id */
tid = omp_get_thread_num();
printf(“Hello from thread %d!\n”, tid);
/* Only main thread does this */
if (tid == 0) {
nthreads = omp_get_num_threads();
printf(“Number of threads = %d\n”, nthreads);
}
} /* All threads join main thread and terminate */
return(0); }
gcc -o omp omp.c -fopenmp
export OMP_NUM_THREADS=8
./omp
unset OMP_NUM_THREADS
8
OpenMP – Example: Calculation on Array
Core:
array [a]
array [b]
[c] = [a] + [b]
0123456789
++++++++++
5 2
7 1
8 0
3 2
1 5
2 6
1 8
0 7
6 3
4 2
9796788568
9
#include
#include
#include
#define CHUNKSIZE 10
#define N 100
int main () {
int nthreads;
int mytid;
int i;
int chunk;
float a[N], b[N], c[N];
/* initalize the arrays */
for (i = 0; i < N; i++) {
a[i] = (float)i;
b[i] = (float)i;
}
chunk = CHUNKSIZE;
gcc -o worksharing worksharing.c -fopenmp
export OMP_NUM_THREADS=8
./worksharing
unset OMP_NUM_THREADS
Shared Memory - OpenMP
#pragma omp parallel shared(a, b, c, nthreads, chunk) private(i, mytid)
{
mytid = omp_get_thread_num();
if (mytid == 0) {
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
printf("Thread %d now starting...\n", mytid);
#pragma omp for schedule(dynamic, chunk)
for (i = 0; i < N; i++) {
c[i] = a[i] + b[i];
printf("Thread %d: c[%d]= %f\n", mytid, i, c[i]);
}
} /* end of parallel section */
}
10
Running a Batch Job Using OpenMP
#!/bin/bash
#SBATCH -p standard
#SBATCH -t 00:05:00
#SBATCH --mem=2GB
#SBATCH -c 6
./worksharing > worksharing.out
Parallel Programming Models
• Embarrassingly Parallel • Shared Memory
• Pthreads
• OpenMP
• Message Passing – MPI
• Accelerator Computing – CUDA
12
Message Passing Interface (MPI)
• The Message Passing Interface was released in 1994 by the MPI Forum
• MPI uses multiple tasks to complete the work
• Each task uses its own local memory for computation
• MPI can be used to run parallel applications on shared memory and distributed memory systems
• Multiple tasks can reside on the same physical machine and across a number of physically separated machines
• Tasks exchange data through communications by sending and receiving messages (over a bus or a high speed interconnect)
• Data transfer usually requires cooperative operations to be performed by each process (e.g. a send operation must have a matching receive operation)
13
MPI_Send();
MPI
Task 0
data
Node 0
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 7
Task 8
Task 10
Node 1
Task 9
Task 11
Task 13
data
Task 12
Task 14
Task 15
14
MPI_Recv();
Message Passing Interface (MPI)
• MPI is a specification
• Many implementations exist: Open MPI, MPICH, MVAPICH, Intel MPI, etc.
• Strongly linked to compilers (e.g. gcc, icc, etc.) and resource management system (e.g. Slurm)
• Vendor optimized MPI implementations also exist, e.g. IBM and Cray
• Example documentation: https://www.open-mpi.org/doc/v3.1/
15
#include
int main(int argc, char* argv[]) {
/* set up MPI variables */
int rank, p;
MPI_Status status;
/* start up MPI */
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &p);
/* print messages from various tasks*/
if (rank != 0) {
printf(“Hello from task %d!\n”, rank);
}
else {
printf(“Hello from the root task!\n”);
}
/* shut down MPI */
MPI_Finalize();
return(0); }
module load openmpi
mpicc -o hello_mpi hello_mpi.c
mpirun -n 2 ./hello_mpi
MPI
16
Running a Batch Job Using MPI
#!/bin/bash
#SBATCH -p standard
#SBATCH -t 00:05:00
#SBATCH -n 12
#SBATCH –mem-per-cpu=2GB
srun ./hello_mpi > hello_mpi.out
MPI – Example: Calculating Pi
• Calculate Pi from an integral
Z1 0
4 dx=⇡ 1+x2
18
}
MPI – Example: Calculate Pi
#include
#include
#include “mpi.h”
/* function to integrate */
double f(double x) {
return (4.0 / (1.0 + x*x));
}
int main(int argc, char* argv[]) {
int myid;
int numprocs;
int i;
const double PIREF = 3.141592653589793238462643; /* reference value of pi */
int n;
double mypi, pi, h, sum, x, error;
double start_time, end_time, total_time;
int namelength;
char processor_name[MPI_MAX_PROCESSOR_NAME];
/* number of intervals */
/* timing variables */
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
MPI_Get_processor_name(processor_name, &namelength);
fprintf(stderr,”task %d on %s with a total of %d processors\n”,
myid, processor_name, numprocs);
if (myid == 0) {
start_time = MPI_Wtime();
}
n = 80;
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
h = 1.0 / (double)n;
sum = 0.0;
for (i = myid + 1; i <= n; i += numprocs) {
x = h * ((double)i - 0.5);
sum += f(x); }
mypi = h * sum;
MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
module load openmpi
mpicc -o mpi_pi mpi_pi.c
sbatch run_mpi.sh
if (myid == 0) {
error = ((pi - PIREF) / PIREF) * 100.0;
printf("pi is calculated as %.16f\n relative error is %16.8e\n", pi, error);
end_time = MPI_Wtime();
total_time = end_time - start_time;
printf("wall clock time = %f\n",
}
MPI_Finalize();
return 0;
19
total_time);
/* my rank or id */
/* number of tasks */
/* loop interator */
Running a Batch Job
#!/bin/bash
#SBATCH -p standard #SBATCH -t 00:05:00 #SBATCH -n 12
#SBATCH --mem-per-cpu=2GB
srun ./mpi_pi > mpi_pi.out
CUDA
• Application programming interface (API) created by Nvidia to program GPUs
• CUDA = Compute Unified Device Architecture
• Uses a special compiler (nvcc) to compile code to run on GPU (based on LLVM)
• Released in 2007 (version 1) with frequent updates – now at version 11
• As new hardware is released, new programming features emerge
21
GPU Specifications
• Nvidia GPUs designed for high-performance computing are referred to as Tesla
• Nvidia Tesla has had 5 major generations of GPUs for computing: Fermi, Kepler, Pascal, Volta, and Ampere
GPU
Generation
CUDA Cores
GPU RAM
TF (DP)
C2050
Fermi
448
6 GB
0.5
K20
Kepler
2496
5 GB
1.2
K20X
Kepler
2688
6 GB
1.3
K80
Kepler
4992
24 GB
2.9
P100
Pascal
3584
16 GB
4.7
V100
Volta
5120
16 GB
7.0
A100
Ampere
6912
40 GB
19.5
22
CUDA
• Certain features of CUDA versions require newer hardware features
• Hardware features are provided by compute capability
• CUDA 6.5 supports compute capability 1.0 – 5.x (Tesla, Fermi, Kepler, Maxwell)
• CUDA 7.5 support for compute capability 2.0 – 5.x (Fermi, Kepler, Maxwell)
• CUDA 8 support for compute capability 2.0 – 6.x (Fermi, Kepler, Maxwell, Pascal)
• CUDA 9 support for compute capability 3.0 – 7.2 (Kepler, Maxwell, Pascal, Volta)
• CUDA 11 support for compute capability 3.5 – 8.6 (Kepler, Maxwell, Pascal, Volta, Ampere)
23
CUDA 6 – Unified Memory
24
GPU Features
• CUDA 6 – Unified Memory (up to GPU memory size) • CUDA 7 – Accelerated libraries and C++11 support • CUDA 7.5 – Support for FP16
• CUDA 8 – Unified Memory (up to CPU memory size) • CUDA 9 – Support for dedicated tensor processing
GPU
Generation
Compute Capability
Earliest CUDA
C2050
Fermi
2.0
3 (last is 8)
K20X
Kepler
3.5
6
K80
Kepler
3.7
6
P100
Pascal
6.0
8
V100
Volta
7.0
9
A100
Ampere
8.0
11
25
GPU Information
interactive -p gpu-debug -t 00:30:00 –gres=gpu:1
nvidia-smi -L
nvidia-smi
nvidia-smi -q -i 0
26
CUDA Example
#include
__global__ void hello_from_gpu(void) {
printf(“Hello from GPU!\n”);
}
int main() {
printf(“Hello from CPU!\n”);
hello_from_gpu <<<1, 16>>>();
cudaDeviceReset();
return 0;
}
module load cuda
nvcc -o cuda cuda.cu
sbatch run_cuda.sh
27
Running a Batch Job
#!/bin/bash
#SBATCH -p gpu
#SBATCH -t 00:05:00 #SBATCH –mem=2GB #SBATCH –gres=gpu:2 ./cuda > cuda.out
Typical GPU Program Structure
1. Allocate GPU memory
2. Copy data from CPU memory to GPU memory
3. Perform GPU computation with CUDA kernel
4. Copy data back from GPU memory to CPU memory 5. Reset GPU memory
Cache
CPU
PCIe Bus
GPU
GPU Memory (RAM)
Memory (RAM)
29
BlueHive
30
Summit: Oak Ridge National Laboratory – 122 PF
IBM Power 9 + Nvidia V100 GPU
Extreme Computing – Combining Parallel Programming Libraries
• MPI + OpenMP – Multiple CPUs on multiple nodes • MPI + CUDA – Multiple GPUs on multiple nodes
• Task 0 distributes work to tasks 1, 2, …, i, …, N.
• Task i distributes work to GPUs on compute nodes • Task 0 gathers all results
32