Intro to Parallel Computing
Topic 1: Introduction to Computer Parallelism
COSC 407: Intro to Parallel Computing
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing
Changing times
§ From 1986 – 2002, performance of microprocessors increased, on average, 50% per year.
§ Since then, it’s dropped to about 20% increase per year.
§ Now, it has dropped to even less % per year.
§ But we still need high performance in many applications… – e.g.,NPproblems,gaming,bigdataanalysis,…
§ Why we need ever-increasing performance?
– Computationalpowerisincreasing,butsoareour
computation problems and needs.
– Problemsweneverdreamedofhavebeensolvedbecause
of past increases, such as decoding the human genome. – Morecomplexproblemsarestillwaitingtobesolved.
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing
Open Areas…
Climate modeling
Protein folding
Topic 1: Introduction to Computer Parallelism
Data analysis
Energy research
COSC 407: Intro to Parallel Computing
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing
How to Predict Like an Analyst
Powe r use
Moore’s Law: Power Consumption
Conclusion #1: Your own nuclear reactor by
Moore’s Law: Heat Generation
Conclusion #2: You will need a new office by 2030
Topic 1: Introduction to Computer Parallelism
COSC 407: Intro to Parallel Computing Source:
There’s Plenty of Room at the Bottom…?
§ More transistors doesn’t always
mean more speed or increased performance!
§ Designs are approaching the physical limitations of processes
http://www.gotw.ca/publications/concurr ency-ddj. htm
Topic 1: Introduction to Computer Parallelism
COSC 407: Intro to Parallel Computing
Harder to Increase frequency
§ Power consumption
§ Leakage problems
§ CPU frequency world record: Approaching 9 GHz http://valid.canardpc.com/records. php
Topic 1: Introduction to Computer Parallelism
COSC 407: Intro to Parallel Computing
Harder to Increase Frequency cont’d
http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/46413-amd-achieves-new-world-record-cpu-frequency.html
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing
An intelligent solution
§ Instead of designing and building faster microprocessors, put multiple processors on a single integrated circuit.
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing 9
Now it’s up to the programmers
§ Adding more processors doesn’t help much if programs aren’t aware of them…
… or don’t know how to use them.
§ Serial programs don’t benefit from this approach (in most cases).
§ Running multiple instances of a serial program often isn’t very useful.
– Think of running multiple instances of your favorite game.
– What you really want is for it to run faster!
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing
Approaches to the Serial Problem
Rewrite serial programs so that they’re parallel.
Write translation programs that automatically convert serial programs into parallel programs.
This is very difficult to do. Success has been limited.
An efficient parallel implementation of a serial program may not be obtained by finding efficient parallelization of each of its steps. Rather, the best parallelization may be obtained by stepping back and devising an entirely new algorithm.
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing
Serial solution
§ Compute n values and add them together.
§ Serial solution:
for ( i = 0; i < n; i++ ) {
sum += compute_next_value(...);
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing
Concurrent Solution
§ Assume we have p cores, p much smaller than n.
§ Each core can perform a partial sum of approximately n/p
Core 0 Core 1 ... Core p-1
Copyright © 2010, Elsevier Inc. All rights Reserved
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing
my_sum = 0;
my_a = ...;
my_b = ...;
for ( my_i = my_a; my_i < my_b; my_i++ ) {
my_sum += compute_next_value(...);
Each core uses it’s own private variables
and executes this block of code independently of the other cores.
Let’s Try Some Numbers...
§ After each core completes execution of the code, a private variable my_sum contains the sum of the values computed by its calls to compute_next_value(...)
§ Ex., 8 cores, n = 24, then the calls to compute_next_value return:
– 1,4,3, 9,2,8, 5,1,1, 5,2,7, 2,5,0, 4,1,8, 6,5,1, 2,3,9
§ And the values stored in my_sum are
Copyright © 2010, Elsevier Inc. All rights Reserved
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing
Concurrent Solution (cont.)
§ Once all the cores are done computing their private my_sum, they form a global sum by sending results to a designated “master” core which adds the final result.
1,4,3, 9,2,8, 5,1,1, 5,2,7, 2,5,0, 4,1,8, 6,5,1, 2,3,9
Q: potential problems?
Copyright © 2010, Elsevier Inc. All rights Reserved
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing
Concurrent Solution (cont.)
Copyright © 2010, Elsevier Inc. All rights Reserved
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing
if(I'm the master core){
global_sum = my_sum;
for each core other than myself{
receive value from core;
global_sum += value;
send my_sum to the master core;
Is this the best we can do?
Here is one of the problems...
§ Only the master will combine the results while other cores are idle. Can we improve the algorithm?
Other cores
1,4,3, 9,2,8, 5,1,1, 5,2,7, 2,5,0, 4,1,8, 6,5,1, 2,3,9
Master core
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing
Better Parallel Algorithm
§ There’s a better way to compute the global sum, especially with large number of cores.
§ Don’t make the master core do all the work. Share it among the other cores!!
– Pair the cores so that
• core0addsitsresultwithcore1’sresult.
• Core2addsitsresultwithcore3’sresult,etc.
– Repeat the process now with only the evenly ranked cores.
• Core0addsresultfromcore2.
• Core4addstheresultfromcore6,etc.
– Now cores divisible by 4 repeat the
process, and so forth, until core 0 has the final result.
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing
Multiple Cores Forming a Global Sum
Copyright © 2010, Elsevier Inc. All rights Reserved
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing
§ In the first example, the master core performs 7 receives and 7 additions.
§ In the second example, the master core performs 3 receives and 3 additions.
– The improvement is more than a factor of 2
§ The difference is more dramatic with a larger number of cores.
– If we have 1000 cores:
• The first example would require the master to perform
999 receives and 999 additions.
• The second example would only require 10 receives and
10 additions.
• That’s an improvement of almost a factor of 100
Copyright © 2010, Elsevier Inc. All rights Reserved
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing
How Do We Write Parallel Programs?
§ Basic Idea:
– Partition the work to be done among the cores!
§ Two approaches:
1. Task parallelism
• Partition various tasks carried out solving the problem among the cores.
2. Data parallelism
• Partition the data used in solving the problem among the cores.
• Each core carries out similar operations on it’s part of the data.
Copyright © 2010, Elsevier Inc. All rights Reserved
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing
How Do We Write Parallel Programs?
§ The professor example – 15questions
– 300exams – 3TAs
Copyright © 2010, Elsevier Inc. All rights Reserved
COSC 407: Intro to Parallel Computing
Topic 1: Introduction to Computer Parallelism
Division of Work: Data Parallelism
Copyright © 2010, Elsevier Inc. All rights Reserved
COSC 407: Intro to Parallel Computing
Topic 1: Introduction to Computer Parallelism
Division of work: Task Parallelism
Questions 1 - 5
Questions 11 - 15 TA#3 All exams
All TAs will be totaling the marks
Copyright © 2010, Elsevier Inc. All rights Reserved
Questions 6 - 10
Communication, load balancing, & sync issues
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing
Coordination
§ Cores usually need to coordinate their work
§ Communication – one or more cores send their
current partial sums to another core
§ Load balancing – share the work evenly among the cores so that one is not heavily loaded.
• If one core has to compute most of the values, then the other cores will finish much sooner and their computational power will be wasted.
§ Synchronization – because each core works at its own pace, make sure cores do not get too far ahead of the rest (e.g., if the tasks of one core depend on the results from another core).
Copyright © 2010, Elsevier Inc. All rights Reserved
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing
A few more concepts...
§ Computer architectures classified by memory organizations:
Multiprocessor with Shared Memory
Cores share same memory.
• Multiplecoresonasinglechip • e.g.CPU
• Multiplecoresonseparatechips
• e.g.multipleprocessors,GPUs
Multi-node with Distributed Memory
Cores don’t share memory
Buttheyareconnectedonanetwork
Processor1 ... Processor Cache Cache MEMOR MEMOR Y NETWORK Y
COSC 407: Intro to Parallel Computing
... Processor
Processor1
Topic 1: Introduction to Computer Parallelism
Flynn’s Taxonomy
Computer architectures may also be classified
into four groups:
§ Single Instruction Single Data (SISD) • 1 core (Serial architecture)
Not this Flynn.... Retrieved from: https://disney.fandom.com/wiki/Kevin_Flynn
– 1instructionatanytime,operationsonsingledatastream
§ Single Instruction Multiple Data (SIMD -> GPU) • Multiple cores (Parallel architecture)
– Allcoresexecutethesameinstructionatanytime
– Eachcoreoperatesonadifferentportionofthedata
§ Multiple Instruction Single Data (MISD)
• Multiple cores, each operate on the same data stream
• Uncommon architecture
§ Multiple Instruction Multiple Data (MIMD)
• Multiple cores operate on multiple data streams, each core runs
independent instructions
More about Flynn’s taxonomy later…..
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing
Recap: What We’ll Be Doing in this Course…
§ Learning to write programs that are explicitly parallel (or at least contain some parallel aspects…..)
– Different types of parallelism § Using different extensions to C
– Posix Threads – pThreads (shared memory – low level) – OpenMP (shared memory)
– CUDA (shared memory)
– Message-Passing Concurrency (distributed memory)
§ We may also have a look at Java Concurrency Mechanisms • Time permitting
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing
Where Things Sit
pThreads/OpenMP
Topic 1: Introduction to Computer Parallelism
COSC 407: Intro to Parallel Computing
§ Read: , “An introduction to C Programming for Java Programmer”,
– Link on Canvas.
§ read the course notes on C
Topic 1: Introduction to Computer Parallelism COSC 407: Intro to Parallel Computing