Computer architecture: parallelization
Dr Fei Xia and Dr Alex Bystrov
Why Parallel systems?
Copyright By PowCoder代写 加微信 powcoder
• Use multiple processors in parallel to compute faster than a single processor
• Example parallel computing systems:
– Cluster computers (like supercomputers) contain multiple (few hundreds) PCs combined together with a high speed network – each PC has one or more processors
– Modern high performance workstations have multiple (up to a dozen or more) multiprocessors
– Chip multiprocessors (CMP) contains multiple cores on a single chip [check your handheld phones]
• State-of-the-art multicore systems:
– AMD Opteron (4/8 64-bit x86 cores 3L cache), Intel Nehalem (4/8 64-bit cores 3L
cache), 80-core Teraflops Research Processor , Intel Xeon Phi (61×4 processors)
– IBM Power5/6 processors with dual cores, Sun UltraSparc T2 with eight cores etc.
– ARM big.LITTLE (in your smart phones in 2+4, 4+4, 2+2+4, etc.), Intel Xeon Phi (61 co-processors)
• Traditionally driven by performance needs; energy consumption is an emerging need
29/10/20 Architecture topics, EEE8087 1
Energy/power advantage
• Dynamic power of MOSFET circuits
𝑃 = 𝐴𝐹𝑉! (from 1st principles – can you derive this?)
If we run double the number of cores (parallelize by 2) we may get the same performance with
𝐴 =2𝐴,𝐹 =”⁄,butalso𝑉 <𝑉 !!!!
Therefore we have 𝑃! < 𝑃 for the same execution speed – the same workload will complete in the same amount of time
• In other words, parallelization can help improve performance, or power, or both together
– https://ieeexplore.ieee.org/document/7999145
29/10/20 Architecture topics, EEE8087 2
• How much does the speed of a system improve after an attempt at improving the system:
– Proportional concept
– Speed of system after improvement divided by speed of system before improvement 𝑆𝑝𝑒𝑒𝑑𝑢𝑝!"#$%&'"'() = 𝑆𝑝𝑒𝑒𝑑*+)'$ !"#$%&'"'()
𝑆𝑝𝑒𝑒𝑑,'+%$' !"#$%&'"'() = 𝑇𝑖𝑚𝑒,'+%$' !"#$%&'"'() ∈ [1, ∞)
𝑇𝑖𝑚𝑒*+)'$ !"#$%&'"'()
– “Improvement” implies that we have better speed after than before
• Improvements may include
– Increase operating frequency (e.g. overclocking)
– Use accelerator (e.g. graphics card)
– Use pipelining (e.g. Intel Hyperthreading)
– Use multiple/more cores
– Change hard disk drive to solid state drive
– Use cache memory
– But do they automatically guarantee 𝑆𝑝𝑒𝑒𝑑𝑢𝑝 ≥ 1?
Architecture topics, EEE8087 3
• How much does the speed of a system improve after an attempt at improving the system: – Proportional concept
– Speed of system after change divided by speed of system before change 𝑆𝑝𝑒𝑒𝑑𝑢𝑝-.*(/' = 𝑆𝑝𝑒𝑒𝑑*+)'$
𝑆𝑝𝑒𝑒𝑑,'+%$' = 𝑇𝑖𝑚𝑒,'+%$' ∈ [0, ∞)
– General modification does not guarantee improvement • Improvements may include
– Increase operating frequency (e.g. overclocking)
– Use accelerator (e.g. graphics card)
– Use pipelining (e.g. Intel Hyperthreading)
– Use multiple/more cores
– Change hard disk drive to solid state drive
– Use cache memory
– None of these automatically guarantee 𝑆𝑝𝑒𝑒𝑑𝑢𝑝 ≥ 1!
Architecture topics, EEE8087 4
The parallelizability of software
• “Standard” matrix addition code found in books:
for (c = 0; c < m; c++) {
for (d = 0 ; d < n; d++) {
sum[c][d] = first[c][d] + second[c][d];
The additions themselves are fully parallelizable
(zero cross-dependency between different additions in the same step)
But the code is fully sequential!
(does not take advantage of multi-core hardware)
Almost all matrix operations are highly parallelizable – this is why GPUs and NPUs are highly parallel machines
29/10/20 Architecture topics, EEE8087
Parallelismindicateshow parallelizable a workload/program/job is
Thiscanbeillustratedby using task graphs
Tasks are represented by the nodes (vertices) – circles here, could be rectangles or other shapes
Task dependencies are represented by the directed arcs (edges) – an arc connects two tasks, when the first completes the second may start
The general direction of all arcs indicates the direction of time progression - from top to bottom in this case
This type of graph is known as a dag – directed acyclic graph – a task graph does not have loops, must unfold all loops to generate a task graph
• Parallelism
Time 29/10/20
Architecture topics, EEE8087 6
• Parallelism
– Taskgraphsdo not have to have circles or go top- down
Architecture topics, EEE8087 7
• Parallelism
Eachstepofataskgraph shows the workload’s instantaneous parallelism in that step
p=2 2 3 p=34 5 6
Intuitively, instantaneous parallelism is how many independent threads a workload has at a time step, which measures how parallelizable the workload is at that step.
For instance, you don’t want to give a workload 4 cores if its parallelism is only 2, in some time step, because you’d be wasting 2 cores. On the other hand, if you only have a single core available in that time step, the parallelism of 2 cannot be exploited
We normally assume that a task is an indivisible thread – atomic element of a workload. We also assume that each task takes the same time to run on a core (for ease of reasoning, not necessary).
p=47 8 9 10
Architecture topics, EEE8087 8
p=2 2 3 p=34 5 6
p=47 8 9 10
Parallelism is the speedup of a workload if you run it on an infinite number of cores, compared to running it on one core:
It therefore is the inherent parallelizability of software, regardless of hardware. Because extra cores will be wasted, we also have
For instance, at time step 3 on the left, if we give the workload an infinite number of cores, we’d achieve a speedup of 3, which can also be obtained by giving it 3 cores.
• Parallelism – Themaths
Architecture topics, EEE8087 9
p=2 2 3 p=34 5 6
p=47 8 9 10
The overall (or average) parallelism of a workload can be derived from averaging all the instantaneous parallelisms across all of its steps, but there is a much easier way.
The ‘work’ of a workload is the sum of all of its tasks – this assumes that each task takes exactly the same amount of time to run for convenience – hence
𝑤𝑜𝑟𝑘 = 𝑇! = 12
The ‘span’ of a workload is the number of steps in the task graph of a workload – again assuming equal task size in terms of time for convenience – hence
𝑠𝑝𝑎𝑛 = 𝑇" = 6
Hence dividing work with span gets us the overall parallelism of a workload. 2 for this
• Parallelism – Themaths
Architecture topics, EEE8087 10
p=2 2 3 p=34 5 6
p=47 8 9 10
You may not have enough hardware to take full advantage of the parallelism of a software. For instance, you need 4 cores in order to fully exploit 𝑝$%& = 4 in the example on the left, but what if you only have 3 cores?
With 3 cores, step 4 cannot be fully parallelized as in the task graph.
Usually you need to modify the task graph to reflect the hardware reality and scheduling constraints.
Scheduling refers to task-to-core mapping. This is usually done in system software, without the knowledge of the user, but it can also be intentionally controlled in some operating systems.
• Influence of hardware
Architecture topics, EEE8087 11
• Influence of hardware
p=1 p=2 p=3
Task graph modification to fit workload onto 3 cores. The best (why?) option is to delay task 7 until the next step. This does not modify the logical relations within the task graph – causality is fully preserved.
Now with 𝑝$%& = 3 the workload can be executed on 3 cores. Interestingly, neither work nor span changes from this modification. The
workload still has the same overall parallelism 2!
In other words, with 3 cores you can achieve the same speedup with this
workload as with 4 cores, with the right task graph modification. The relative speedup obtained by moving from 3 to 4 cores is 𝑆(4)⁄𝑆(3) = 1.
Homework: what happens if your system only has 2 cores?
Architecture topics, EEE8087 12
p=1 12 Time 29/10/20
Software parallelizability – an alternative view
• Amdahl’s law
– For studying in the abstract, i.e. what if you do not know the structure of the
software and its inner workings, and a task graph is not available?
– , in the 1960s, proposed an abstract view of a software workload as consisting of a fully parallelizable part and a fully non- parallelizable part. In other words, a piece of software is regarded as one part with 𝑝 = 1 and another part with 𝑝 = ∞.
– The time taken by executing such a parallel workload on a single core can be divided into two parts, that taken by the sequential part and that taken by the parallel part: 𝑇' = 𝑇( + 𝑇#
– Executing such a workload on n cores means that in the total time 𝑇), 𝑇( stays the same, as you use one of the cores to execute the sequential part leaving the other n-1 cores idle. However the parallel part will take time 𝑇#⁄𝑛 as you get to use all n cores because 𝑝 = ∞ and you get to fill all the cores with no idle.
– The speedup achieved by moving from one core to n cores is then 29/10/20 Architecture topics, EEE8087
𝑆 𝑛 = 𝑇! = 𝑇# + 𝑇$
𝑇" 𝑇# + 𝑇$ 𝑛
Amdahl’s law
• An important parameter for Amdahl’s Law is what’s known as the ‘parallel fraction’.
• Fraction of the workload that is parallel, defined as 𝑓 = 𝑇!
The fraction of time taken by the parallel part of the workload
• Amdahl’s Law says that the speedup, with the parallel fraction, is then
𝑆 𝑛 = 𝑇# = 𝑇% + 𝑇! = 1
𝑇$ 𝑇+𝑇! 1−𝑓+𝑓 %𝑛𝑛
• The parallel part achieves linear speedup with the number of cores and the sequential part stays sequential.
• Typical scheduling decision: if you have a workload with small f and another one with large f, give the latter more cores.
1 − 𝑓 + 𝑛𝑓
29/10/20 Architecture topics, EEE8087
• Amdahl’s Law examples – Fixedworkload
– 50%sequentialpart
– Onasinglecoretakes1unitoftimeto complete
Architecture topics, EEE8087
• Amdahl’s Law examples
– Withtwocores...
– Parallelizablepartisdistributedbetweenthe two cores
– Totaltime0.75
𝑆(2) = 1/0.75 = 1.333
29/10/20 Architecture topics, EEE8087
• Amdahl’s Law examples – Withthreecores...
0.5 + 0.5 3
Architecture topics, EEE8087
• Amdahl’s Law examples
– Withaninfinitenumberofcores...
1 − 𝑓 + ∞𝑓 𝑇! = 𝑇"
Architecture topics, EEE8087
Amdahl’s law
1 − 𝑓 + 𝑛𝑓
29/10/20 Architecture topics, EEE8087
Amdahl’s Law vs task graphs
• Both methods can be used for modelling the behaviour of workloads with regard to the hardware on which they run
• Scheduling decisions can be made to optimize speedup, energy, or a combination of both based on these models
• Task graphs are usually derivable by programmers who write workloads – could be very complicated for large workloads
• Users usually do not have access to task graphs
• The parallel fraction f can be obtained through experiments
• Task graphs can also be extracted from experiments (running a workload and monitoring certain system sensors)
• Which one to use in what situation is the user’s decision to make
• Models can be unified – see
http://async.org.uk/tech-reports/NCL-EEE-MICRO-TR-2018-211.pdf
• These all assume ideal hardware and software, with full scalability other than what’s limited by parallelism and parallel fraction
29/10/20 Architecture topics, EEE8087 20
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com