CS计算机代考程序代写 matlab cache Parallel 2 2021

Parallel 2 2021

Stewart Smith Digital Systems Design 4

Digital System Design 4
Parallel Computing Architecture 2

Stewart Smith Digital Systems Design 4

This Lecture

• Parallel Computing and Performance
‣ Definitions
‣ Speedup and Efficiency
‣ Return to Amdahl’s Law
‣ Scaling
‣ Load balancing

Stewart Smith Digital Systems Design 4

Parallel Computing
• Goal: connecting multiple computers

to get higher performance
‣ Multiprocessors
‣ Scalability, availability, power efficiency
• Job-level (process-level) parallelism
‣ High throughput for independent jobs
• Parallel processing program
‣ Single program run on multiple processors
• Multicore microprocessors
‣ Chips with multiple processors (cores)

Stewart Smith Digital Systems Design 4

Hardware and Software

• Sequential/concurrent software can run on
serial/parallel hardware
‣ Challenge: making effective use of parallel

hardware

Software

Sequential Concurrent

Hardware
Serial Matlab routine running on

a Pentium 4
Operating System running

on a Pentium 4

Parallel
Matlab routine running on

a Intel Core i7
Operating System running

on an Intel Core i7

Stewart Smith Digital Systems Design 4

Parallel Processing Challenges

• Scheduling
‣ Getting jobs to multiple processors at the right time
• Partioning
‣ Dividing work up between processors efficiently
• Load balancing
‣ Speed up limited by the longest/biggest task
• Synchronisation and communication
‣ If information needs to be shared between processor
‣ How to bring the results from individual processors

together at the end

Stewart Smith Digital Systems Design 4

Parallel Speedup
• T1 = time to run with one worker
• TP = time to run with P workers
• T1/TP = speedup
‣ The relative reduction in time to complete the same

task
‣ Ideal case is linear speedup with P
‣ i.e., 4 workers gives a best-case speedup of 4.
‣ In real cases, speedup often significantly less
‣ In rare cases, such as search, can be super-linear
– [e.g. Suddenly amount of stuff per processor becomes small

enough to fit in cache]

Stewart Smith Digital Systems Design 4

Parallel Efficiency
• T1 = time to run with one worker
• TP = time to run with P workers
• T1/(PTP) = efficiency
‣ 1 is perfect efficiency
‣ Like linear speedup, perfect efficiency is hard

to achieve

‣ Note that this is not the same as “utilisation”

Stewart Smith Digital Systems Design 4

Amdahl’s Law
• Tsequential + Tparallel /P ≤ TP
• “You can only parallelise the parallelisable

part!”

Stewart Smith Digital Systems Design 4

Amdahl’s Law
• Sequential part will limit speedup
‣ Example: 100 processors, we want 90× speedup?

‣ Solving: Fparallel = 0.999
• Need sequential part to be 0.1% of original

time

Tnew =
Tparallelisable

100
+ Tsequential

Speedup =
1

(1 − Fparallel) +
Fparallel

100

= 90

Stewart Smith Digital Systems Design 4

Scaling Example1
• Workload: sum of 10 scalars, and 10×10 matrix sum
‣ Speed up from 10 to 100 processors
• Single processor: Time = (10 + 100) × tadd
• 10 processors
‣ Time = 10 × tadd + 100/10 × tadd = 20 × tadd
‣ Speedup = 110/20 = 5.5 (55% of potential)
• 100 processors
‣ Time = 10 × tadd + 100/100 × tadd = 11 × tadd
‣ Speedup = 110/11 = 10 (10% of potential)
• Assumes load can be balanced across processors

Stewart Smith Digital Systems Design 4

Scaling Example 2
• What if matrix size is 100×100?
• Single processor: Time = (10 + 10000) × tadd
• 10 processors
‣ Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd
‣ Speedup = 10010/1010 = 9.9 (99% of potential)
• 100 processors
‣ Time = 10 × tadd + 10000/100 × tadd = 110 × tadd
‣ Speedup = 10010/110 = 91 (91% of potential)
• Assuming load balanced

Stewart Smith Digital Systems Design 4

Strong and Weak Scaling
• Strong scaling: problem size fixed
‣ As in example

• Weak scaling: problem size proportional to
number of processors
‣ 10 processors, 10 × 10 matrix
‣ Time = 20 × tadd
• 100 processors, 32 × 32 matrix
‣ Time = 10 × tadd + 1024/100 × tadd ≈ 20 × tadd
• Constant performance in this example

Stewart Smith Digital Systems Design 4

Load Balancing
• 100 processors on 100×100 matrix gives

speedup of 91× but requires balanced
load

• If one processor has 2% load not 1%
‣ It does 200 additions while the rest share 9800

‣ Speedup = 10010t/210t = 48×

Execution time = Max


9800t

99
,
200t

1


+ 10t = 210t

Stewart Smith Digital Systems Design 4

Load Balancing
• Situation is even worse if one processor

does 5% of the work.
‣ It does 500 additions while the rest share 9500

‣ Speedup = 10010t/510t = 20×
• Speedup limited by the highest loaded

processor

Execution time = Max


9500t

99
,
500t

1


+ 10t = 510t

Stewart Smith Digital Systems Design 4

Next Lecture

• Multicore processors
• Multiprocessor systems (clusters)
• Multi-threading