CS计算机代考程序代写 matlab cache Digital System Design 4 Parallel Computing Architecture 2

Digital System Design 4 Parallel Computing Architecture 2
Stewart Smith Digital Systems Design 4

This Lecture

‣ ‣ ‣ ‣ ‣
Parallel Computing and Performance
Definitions
Speedup and Efficiency Return to Amdahl’s Law Scaling
Load balancing
Stewart Smith
Digital Systems Design 4

Parallel Computing


• •


Goal: connecting multiple computers to get higher performance
Multiprocessors
‣ Scalability, availability, power efficiency
Job-level (process-level) parallelism ‣ High throughput for independent jobs
Parallel processing program
‣ Single program run on multiple processors
Multicore microprocessors
Chips with multiple processors (cores)
Stewart Smith
Digital Systems Design 4

Hardware and Software
Software
Sequential
Concurrent
Hardware
Serial
Matlab routine running on a Pentium 4
Operating System running on a Pentium 4
Parallel
Matlab routine running on a Intel Core i7
Operating System running on an Intel Core i7


Sequential/concurrent software can run on serial/parallel hardware
Challenge: making effective use of parallel hardware
Stewart Smith
Digital Systems Design 4

Parallel Processing Challenges
• •
• •
‣ ‣
Scheduling
‣ Getting jobs to multiple processors at the right time
Partioning
‣ Dividing work up between processors efficiently
Load balancing
‣ Speed up limited by the longest/biggest task
Synchronisation and communication
If information needs to be shared between processor
How to bring the results from individual processors together at the end
Stewart Smith
Digital Systems Design 4


T1 = time to run with one worker

TP = time to run with P workers

T1/TP = speedup
The relative reduction in time to complete the same task
Ideal case is linear speedup with P
i.e., 4 workers gives a best-case speedup of 4.
In real cases, speedup often significantly less
In rare cases, such as search, can be super-linear
– [e.g. Suddenly amount of stuff per processor becomes small enough to fit in cache]





Parallel Speedup
Stewart Smith
Digital Systems Design 4

Parallel Efficiency



‣ ‣

T1 = time to run with one worker
TP = time to run with P workers
T1/(PTP) = efficiency
1 is perfect efficiency
Like linear speedup, perfect efficiency is hard
to achieve
Note that this is not the same as “utilisation”
Stewart Smith
Digital Systems Design 4

Amdahl’s Law
• •
part!”
Tsequential + Tparallel /P ≤ TP
“You can only parallelise the parallelisable
Stewart Smith
Digital Systems Design 4

Amdahl’s Law


Tnew = Tparallelisable + Tsequential 100 1
Speedup = (1 − Fparallel) + Fparallel 100
‣ Solving: Fparallel = 0.999

Sequential part will limit speedup
Example: 100 processors, we want 90× speedup?
– –
= 90
Need sequential part to be 0.1% of original time
Stewart Smith
Digital Systems Design 4

Scaling Example1

• •


Workload: sum of 10 scalars, and 10×10 matrix sum ‣ Speed up from 10 to 100 processors
Single processor: Time = (10 + 100) × tadd 10 processors
‣ Time = 10 × tadd + 100/10 × tadd = 20 × tadd ‣ Speedup = 110/20 = 5.5 (55% of potential)
100 processors
‣ Time = 10 × tadd + 100/100 × tadd = 11 × tadd ‣ Speedup = 110/11 = 10 (10% of potential)
Assumes load can be balanced across processors
Stewart Smith
Digital Systems Design 4


What if matrix size is 100×100?

Single processor: Time = (10 + 10000) × tadd

10 processors


‣ Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd ‣ Speedup = 10010/1010 = 9.9 (99% of potential)
Scaling Example 2
100 processors
‣ Time = 10 × tadd + 10000/100 × tadd = 110 × tadd ‣ Speedup = 10010/110 = 91 (91% of potential)
Assuming load balanced
Stewart Smith
Digital Systems Design 4

Strong and Weak Scaling






‣ Time = 20 × tadd
Strong scaling: problem size fixed
As in example
Weak scaling: problem size proportional to number of processors
10 processors, 10 × 10 matrix
100 processors, 32 × 32 matrix
‣ Time = 10 × tadd + 1024/100 × tadd ≈ 20 × tadd Constant performance in this example
Stewart Smith
Digital Systems Design 4

Load Balancing
• •
Execution time = Max ✓ 9800t , 200t ◆ + 10t = 210t 99 1
Speedup = 10010t/210t = 48× Digital Systems Design 4
100 processors on 100×100 matrix gives speedup of 91× but requires balanced load
If one processor has 2% load not 1%
‣ It does 200 additions while the rest share 9800

Stewart Smith

Load Balancing

Situation is even worse if one processor does 5% of the work.
‣ It does 500 additions while the rest share 9500 Execution time = Max ✓ 9500t , 500t ◆ + 10t = 510t
•‣ Speedup = 10010t/510t = 20×
Speedup limited by the highest loaded processor
99 1
Stewart Smith
Digital Systems Design 4

• • •
Multicore processors Multiprocessor systems (clusters) Multi-threading
Next Lecture
Stewart Smith
Digital Systems Design 4