Digital System Design 4 Parallel Computing Architecture 2
Stewart Smith Digital Systems Design 4
This Lecture
•
‣ ‣ ‣ ‣ ‣
Parallel Computing and Performance
Definitions
Speedup and Efficiency Return to Amdahl’s Law Scaling
Load balancing
Stewart Smith
Digital Systems Design 4
Parallel Computing
•
‣
• •
•
‣
Goal: connecting multiple computers to get higher performance
Multiprocessors
‣ Scalability, availability, power efficiency
Job-level (process-level) parallelism ‣ High throughput for independent jobs
Parallel processing program
‣ Single program run on multiple processors
Multicore microprocessors
Chips with multiple processors (cores)
Stewart Smith
Digital Systems Design 4
Hardware and Software
Software
Sequential
Concurrent
Hardware
Serial
Matlab routine running on a Pentium 4
Operating System running on a Pentium 4
Parallel
Matlab routine running on a Intel Core i7
Operating System running on an Intel Core i7
•
‣
Sequential/concurrent software can run on serial/parallel hardware
Challenge: making effective use of parallel hardware
Stewart Smith
Digital Systems Design 4
Parallel Processing Challenges
• •
• •
‣ ‣
Scheduling
‣ Getting jobs to multiple processors at the right time
Partioning
‣ Dividing work up between processors efficiently
Load balancing
‣ Speed up limited by the longest/biggest task
Synchronisation and communication
If information needs to be shared between processor
How to bring the results from individual processors together at the end
Stewart Smith
Digital Systems Design 4
•
T1 = time to run with one worker
•
TP = time to run with P workers
•
T1/TP = speedup
The relative reduction in time to complete the same task
Ideal case is linear speedup with P
i.e., 4 workers gives a best-case speedup of 4.
In real cases, speedup often significantly less
In rare cases, such as search, can be super-linear
– [e.g. Suddenly amount of stuff per processor becomes small enough to fit in cache]
‣
‣
‣
‣
‣
Parallel Speedup
Stewart Smith
Digital Systems Design 4
Parallel Efficiency
•
•
•
‣ ‣
‣
T1 = time to run with one worker
TP = time to run with P workers
T1/(PTP) = efficiency
1 is perfect efficiency
Like linear speedup, perfect efficiency is hard
to achieve
Note that this is not the same as “utilisation”
Stewart Smith
Digital Systems Design 4
Amdahl’s Law
• •
part!”
Tsequential + Tparallel /P ≤ TP
“You can only parallelise the parallelisable
Stewart Smith
Digital Systems Design 4
Amdahl’s Law
•
‣
Tnew = Tparallelisable + Tsequential 100 1
Speedup = (1 − Fparallel) + Fparallel 100
‣ Solving: Fparallel = 0.999
•
Sequential part will limit speedup
Example: 100 processors, we want 90× speedup?
– –
= 90
Need sequential part to be 0.1% of original time
Stewart Smith
Digital Systems Design 4
Scaling Example1
•
• •
•
•
Workload: sum of 10 scalars, and 10×10 matrix sum ‣ Speed up from 10 to 100 processors
Single processor: Time = (10 + 100) × tadd 10 processors
‣ Time = 10 × tadd + 100/10 × tadd = 20 × tadd ‣ Speedup = 110/20 = 5.5 (55% of potential)
100 processors
‣ Time = 10 × tadd + 100/100 × tadd = 11 × tadd ‣ Speedup = 110/11 = 10 (10% of potential)
Assumes load can be balanced across processors
Stewart Smith
Digital Systems Design 4
•
What if matrix size is 100×100?
•
Single processor: Time = (10 + 10000) × tadd
•
10 processors
•
•
‣ Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd ‣ Speedup = 10010/1010 = 9.9 (99% of potential)
Scaling Example 2
100 processors
‣ Time = 10 × tadd + 10000/100 × tadd = 110 × tadd ‣ Speedup = 10010/110 = 91 (91% of potential)
Assuming load balanced
Stewart Smith
Digital Systems Design 4
Strong and Weak Scaling
•
‣
•
•
•
‣
‣ Time = 20 × tadd
Strong scaling: problem size fixed
As in example
Weak scaling: problem size proportional to number of processors
10 processors, 10 × 10 matrix
100 processors, 32 × 32 matrix
‣ Time = 10 × tadd + 1024/100 × tadd ≈ 20 × tadd Constant performance in this example
Stewart Smith
Digital Systems Design 4
Load Balancing
• •
Execution time = Max ✓ 9800t , 200t ◆ + 10t = 210t 99 1
Speedup = 10010t/210t = 48× Digital Systems Design 4
100 processors on 100×100 matrix gives speedup of 91× but requires balanced load
If one processor has 2% load not 1%
‣ It does 200 additions while the rest share 9800
‣
Stewart Smith
Load Balancing
•
Situation is even worse if one processor does 5% of the work.
‣ It does 500 additions while the rest share 9500 Execution time = Max ✓ 9500t , 500t ◆ + 10t = 510t
•‣ Speedup = 10010t/510t = 20×
Speedup limited by the highest loaded processor
99 1
Stewart Smith
Digital Systems Design 4
• • •
Multicore processors Multiprocessor systems (clusters) Multi-threading
Next Lecture
Stewart Smith
Digital Systems Design 4