CS计算机代考程序代写 compiler algorithm cuda cache GPU data science DSCC 201/401

DSCC 201/401
Tools and Infrastructure for Data Science
February 24, 2021

Review of Hardware Definitions
• Storage – permanent data storage (hard drive or solid state drive), does not go away when system is powered off
• Memory – usually refers to RAM (random access memory), data goes away when system is powered off (volatile)
• Processor – a computing chip that processes data (usually refers to a CPU)
• CPU – central processing unit, the chip that does the computing, also know as a processor
• Socket – a physical location on a computer board that can house a CPU
• Core – a computing element of a CPU that can process data
independently, most CPUs today have multiple cores
• Node – a physical computing unit (e.g. server) with sockets that have one or more processors, banks of memory, and a network interface
2

Software Parallelization Models and Techniques
• Serial vs. Parallel Computing • Flynn’s Taxonomy
• Amdahl’s Law
• Parallel Programming Models
3

Serial vs. Parallel Computing
function()
Task 1
Task 2
Task 3
Task 4
function()
Task 1
Task 3
Task 2
Task 4
time
4

Serial vs. Parallel Computing
• Serial Computing
• A problem is broken into a discrete series of instructions • Instructions are executed sequentially one after another • Only one instruction may execute at any moment in time
• Parallel Computing
• A problem is broken into many parts that can be solved at the same time (concurrently)
• Tasks from each part execute simultaneously on different processors
• An overall control or coordination mechanism is required
5

Why Are We Interested in Parallel Computing?
• Time saver for intensive computing problems (e.g. training batches of data)
• Ability to tackle difficult or more complex problems (e.g. computer vision)
• Allows for the development of more accurate and detailed models (e.g. weather forecasting)
• More efficient use of existing and future hardware (e.g. system are multi-core)
6

Types of Parallelism
• Task Parallelism
• Different tasks are distributed to the same set of data
• Run many different tasks at the same time on the same data
• Data Parallelism
• Same calculation is performed across large set of data
• Data can be structured into separate partitions and distributed to each task
7

Flynn’s Taxonomy
SISD
Single Instruction Single Data
SIMD
Single Instruction Multiple Data
MISD
Multiple Instruction Single Data
MIMD
Multiple Instruction Multiple Data
8

SISD
• Single Instruction Single Data
• One instruction is executed on one data
• Execution is sequential and non-parallel
• “One step at a time”
• Example is single CPU system (e.g. old Cray 1 or old desktop)
+
a = 1 b=2 c=a+b
Time
1 2
3
9

SIMD
• Single Instruction Multiple Data
• One instruction is executed simultaneously on a set of data
• All processing units execute the same command but on different pieces of data
• Example is GPU
Processor:
Time
01234567
++++++++
array [a]
array [b]
[c] = [a] + [b]
5 2
7 1
8 0
3 2
1 8
0 7
6 3
4 2
97967885
10

MISD
• Multiple Instruction Single Data
• Multiple instructions are executed simultaneously on single data
• All processing units execute different commands but on the same data
• Rare type of processing (difficult to construct) – may exist in obscure cryptographic applications
Processor:
Time
0123
+-x/
a=2 b=1
2 1
2 1
3122
2 1
2 1
11

MIMD
• Multiple Instruction Multiple Data
• Every processor can execute different instructions on different data • Modern hybrid supercomputing architecture
• Example is Linux cluster
Processor:
Time
01234567
+x+-/x-+
int a, b, …, p a+b
cxd
e+f
6 2
4 1
3 2
1 7
2 3
1 4
6 3
4 2
54923418
12

Amdahl’s Law
• Given a fixed workload and number of processors, Amdahl’s Law calculates the expected speedup
• S is the speedup, p is the fraction of work that can be parallelized, and N is the number of computing cores
• If p = 0 then none of the work is parallelized and speedup is 1
• If p = 1 then the speedup is equal to the number of cores N
• If p = 0.5 then speedup is equal to 2 in the limit of infinite number of cores
S= 1 lim S= 1 limS=N
(1p)+ p N!1 1p p!1 N
13

Amdahl’s Law
N
p = 0.5
p = 0.9
p = 0.95
p = 0.99
10
1.82
5.26
6.89
9.17
100
1.98
9.17
16.80
50.25
1,000
1.99
9.91
19.62
90.99
10,000
1.99
9.91
19.96
99.02
100,000
1.99
9.99
19.99
99.90
S=
1
(1p)+ p N
limS= 1 limS=N N!1 1p p!1
14

S = 1 (1p)+ p
lim S = N p!1
Amdahl’s Law
N = 100
N
15

S=
1
(1p)+ p N
lim S = N!1
1
1p
Amdahl’s Law
p = 0.95
p = 0.90
p = 0.80
16

Speedup Example
• An application takes 100 minutes to run on single core (wall time)
• Application is parallelized but we do not know the fraction that is parallelized
• Speedup can be calculated from empirical wall time data from running the application – wall time for 1 core and wall time for N cores
N
wall time (min)
speedup
1
100
1
10
16.3
6.13
100
7.9
12.66
1,000
7.1
14.08
10,000
7.0
14.29
S=
t1 tN
17

Linear Scaling
18
linear
p = 0.995 p = 0.99
p = 0.95

Linear Scaling
• Difficult to achieve linear scaling
• How do we get closer to linear scaling?
• Improve the algorithm so that the program has a greater fraction of its time spent doing parallelized work
• Reduce parallel overhead (task start-up time, synchronizations, data communication, software overhead imposed by libraries and operating system, and task termination time)
• Larger data sets may help improve the fraction of time spent doing parallelized work (but may result in longer wall times)
• Consideration has to be made regarding programmer time
19

Other Considerations: Strong Scaling vs. Weak Scaling
• Strong Scaling
• The total problem size stays fixed as more cores are added to the benchmark
• Goal is to run the same problem size faster • Weak Scaling
• The problem size per core stays fixed as more cores are added to the benchmark
• The total problem size is proportional to the number of cores used
• Goal is to run larger problem in same amount of time
20

Parallel Programming Models
• Embarrassingly Parallel • Shared Memory
• Pthreads
• OpenMP
• Message Passing – MPI
• Accelerator Computing – CUDA
21

Embarrassingly Parallel
• Little or no effort is needed to separate the computational problem into a number of parallel tasks
• Usually little or no dependency or need for communication between the parallel tasks or for results between them
• Problem is usually split up into N parts — multiple independent parts
• Distribution of N equal tasks to N cores
• Extremely close to linear speedup — the most practically achievable
• However, the problem and algorithm have to support this type of parallelism
22

Embarrassingly Parallel Strategy
• Many unique strategies can exist depending on the problem, but all involve splitting the data and computing on an individual component independently
• Steps:
• Divide data into N equal parts
• Distribute each segment of data to a processor • Compute on each part
• Results can be collected independently (e.g. inference on unknown cases using a trained model) or combined statistically at the end (e.g. molecular dynamics simulation)
23

Parallel Programming Models
• Embarrassingly Parallel • Shared Memory
• Pthreads
• OpenMP
• Message Passing – MPI
• Accelerator Computing – CUDA
24

Shared Memory
• Common physical memory that can be accessed by all processors
• Single address space that is globally accessible
• Changes in a memory location caused by one processor are visible to all other processors
• Difficult to scale to multiple processors
• We will look at two programming libraries for shared memory systems: Pthreads and OpenMP
25

Shared Memory Architecture
CPU
CPU
CPU
CPU
Cache Cache
Cache Cache
High Speed Interconnect (Computer Bus)
Memory
26

Shared Memory – Threads
• A thread is an independent stream of instructions that can be scheduled to run by the operating system
• A thread is “light weight” since it exists in a parent process and uses the resources of the process – it takes advantage of the overhead already needed of the parent process
• A thread has its own independent flow of control as long as its parent process exists
• Threads duplicate only the essential resources they need
• Threads may share the process resources with other threads that act equally independently (and also dependently)
• A thread dies if the parent process dies
• Pthreads is a library based on the IEEE POSIX 1003.1c standard that implements the desired behavior of threads on a Linux system
27

Shared Memory – Pthreads
#include
#include
#include #define NUM_THREADS 8
void *print_message(void *threadid) {
long int tid;
tid = (long int)threadid;
printf(“Hello from thread %ld!\n”, tid);
pthread_exit(NULL);
}
int main() {
int status;
long int t;
pthread_t threads[NUM_THREADS];
for(t = 0; t < NUM_THREADS; t++) { status = pthread_create(&threads[t], NULL, print_message, (void *)t); if (status != 0) { printf("Error: return code from pthread_create() is %d\n", status); } exit(-1); } } pthread_exit(NULL); return(0); gcc -o pthreads pthreads.c -pthread 28 Shared Memory - OpenMP • OpenMP = Open Multi-Processing: Designed for multi-platform share memory parallel programming • Standard was released in 1997 and is portable to many different systems and languages • Uses compiler directives to control execution • Easier to implement (than Pthreads) • OpenMP uses the fork-join model of parallel execution main() main() main() #pragma omp parallel {} #pragma omp parallel {} Fork Join Fork Join 29 #include
#include
int main() {
int nthreads;
int tid;
Shared Memory – OpenMP
/* Fork a team of threads with each thread having a private tid variable */
#pragma omp parallel private(tid)
{
/* Obtain and print thread id */
tid = omp_get_thread_num();
printf(“Hello from thread %d!\n”, tid);
/* Only main thread does this */
if (tid == 0) {
nthreads = omp_get_num_threads();
printf(“Number of threads = %d\n”, nthreads);
}
} /* All threads join main thread and terminate */
return(0); }
gcc -o omp omp.c -fopenmp
export OMP_NUM_THREADS=8
./omp
unset OMP_NUM_THREADS
30

OpenMP – Example: Calculation on Array
Core:
array [a]
array [b]
[c] = [a] + [b]
0123456789
++++++++++
5 2
7 1
8 0
3 2
1 5
2 6
1 8
0 7
6 3
4 2
9796788568
31

#include
#include
#include
#define CHUNKSIZE 10
#define N 100
int main () {
int nthreads;
int mytid;
int i;
int chunk;
float a[N], b[N], c[N];
/* initalize the arrays */
for (i = 0; i < N; i++) { a[i] = (float)i; b[i] = (float)i; } chunk = CHUNKSIZE; gcc -o worksharing worksharing.c -fopenmp export OMP_NUM_THREADS=8 ./worksharing unset OMP_NUM_THREADS Shared Memory - OpenMP #pragma omp parallel shared(a, b, c, nthreads, chunk) private(i, mytid) { mytid = omp_get_thread_num(); if (mytid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } printf("Thread %d now starting...\n", mytid); #pragma omp for schedule(dynamic, chunk) for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %f\n", mytid, i, c[i]); } } /* end of parallel section */ } 32