CS代考 EEE8087 1

Computer architecture: parallelization
Dr Fei Xia and Dr Alex Bystrov

Why Parallel systems?

Copyright By PowCoder代写 加微信 powcoder

• Use multiple processors in parallel to compute faster than a single processor
• Example parallel computing systems:
– Cluster computers (like supercomputers) contain multiple (few hundreds) PCs combined together with a high speed network – each PC has one or more processors
– Modern high performance workstations have multiple (up to a dozen or more) multiprocessors
– Chip multiprocessors (CMP) contains multiple cores on a single chip [check your handheld phones]
• State-of-the-art multicore systems:
– AMD Opteron (4/8 64-bit x86 cores 3L cache), Intel Nehalem (4/8 64-bit cores 3L
cache), 80-core Teraflops Research Processor , Intel Xeon Phi (61×4 processors)
– IBM Power5/6 processors with dual cores, Sun UltraSparc T2 with eight cores etc.
– ARM big.LITTLE (in your smart phones in 2+4, 4+4, 2+2+4, etc.), Intel Xeon Phi (61 co-processors)
• Traditionally driven by performance needs; energy consumption is an emerging need
29/10/20 Architecture topics, EEE8087 1

Energy/power advantage
• Dynamic power of MOSFET circuits
𝑃 = 𝐴𝐹𝑉! (from 1st principles – can you derive this?)
If we run double the number of cores (parallelize by 2) we may get the same performance with
𝐴 =2𝐴,𝐹 =”⁄,butalso𝑉 <𝑉 !!!! Therefore we have 𝑃! < 𝑃 for the same execution speed – the same workload will complete in the same amount of time • In other words, parallelization can help improve performance, or power, or both together – https://ieeexplore.ieee.org/document/7999145 29/10/20 Architecture topics, EEE8087 2 • How much does the speed of a system improve after an attempt at improving the system: – Proportional concept – Speed of system after improvement divided by speed of system before improvement 𝑆𝑝𝑒𝑒𝑑𝑢𝑝!"#$%&'"'() = 𝑆𝑝𝑒𝑒𝑑*+)'$ !"#$%&'"'() 𝑆𝑝𝑒𝑒𝑑,'+%$' !"#$%&'"'() = 𝑇𝑖𝑚𝑒,'+%$' !"#$%&'"'() ∈ [1, ∞) 𝑇𝑖𝑚𝑒*+)'$ !"#$%&'"'() – “Improvement” implies that we have better speed after than before • Improvements may include – Increase operating frequency (e.g. overclocking) – Use accelerator (e.g. graphics card) – Use pipelining (e.g. Intel Hyperthreading) – Use multiple/more cores – Change hard disk drive to solid state drive – Use cache memory – But do they automatically guarantee 𝑆𝑝𝑒𝑒𝑑𝑢𝑝 ≥ 1? Architecture topics, EEE8087 3 • How much does the speed of a system improve after an attempt at improving the system: – Proportional concept – Speed of system after change divided by speed of system before change 𝑆𝑝𝑒𝑒𝑑𝑢𝑝-.*(/' = 𝑆𝑝𝑒𝑒𝑑*+)'$ 𝑆𝑝𝑒𝑒𝑑,'+%$' = 𝑇𝑖𝑚𝑒,'+%$' ∈ [0, ∞) – General modification does not guarantee improvement • Improvements may include – Increase operating frequency (e.g. overclocking) – Use accelerator (e.g. graphics card) – Use pipelining (e.g. Intel Hyperthreading) – Use multiple/more cores – Change hard disk drive to solid state drive – Use cache memory – None of these automatically guarantee 𝑆𝑝𝑒𝑒𝑑𝑢𝑝 ≥ 1! Architecture topics, EEE8087 4 The parallelizability of software • “Standard” matrix addition code found in books: for (c = 0; c < m; c++) { for (d = 0 ; d < n; d++) { sum[c][d] = first[c][d] + second[c][d]; The additions themselves are fully parallelizable (zero cross-dependency between different additions in the same step) But the code is fully sequential! (does not take advantage of multi-core hardware) Almost all matrix operations are highly parallelizable – this is why GPUs and NPUs are highly parallel machines 29/10/20 Architecture topics, EEE8087 Parallelismindicateshow parallelizable a workload/program/job is Thiscanbeillustratedby using task graphs Tasks are represented by the nodes (vertices) – circles here, could be rectangles or other shapes Task dependencies are represented by the directed arcs (edges) – an arc connects two tasks, when the first completes the second may start The general direction of all arcs indicates the direction of time progression - from top to bottom in this case This type of graph is known as a dag – directed acyclic graph – a task graph does not have loops, must unfold all loops to generate a task graph • Parallelism Time 29/10/20 Architecture topics, EEE8087 6 • Parallelism – Taskgraphsdo not have to have circles or go top- down Architecture topics, EEE8087 7 • Parallelism Eachstepofataskgraph shows the workload’s instantaneous parallelism in that step p=2 2 3 p=34 5 6 Intuitively, instantaneous parallelism is how many independent threads a workload has at a time step, which measures how parallelizable the workload is at that step. For instance, you don’t want to give a workload 4 cores if its parallelism is only 2, in some time step, because you’d be wasting 2 cores. On the other hand, if you only have a single core available in that time step, the parallelism of 2 cannot be exploited We normally assume that a task is an indivisible thread – atomic element of a workload. We also assume that each task takes the same time to run on a core (for ease of reasoning, not necessary). p=47 8 9 10 Architecture topics, EEE8087 8 p=2 2 3 p=34 5 6 p=47 8 9 10 Parallelism is the speedup of a workload if you run it on an infinite number of cores, compared to running it on one core: It therefore is the inherent parallelizability of software, regardless of hardware. Because extra cores will be wasted, we also have For instance, at time step 3 on the left, if we give the workload an infinite number of cores, we’d achieve a speedup of 3, which can also be obtained by giving it 3 cores. • Parallelism – Themaths Architecture topics, EEE8087 9 p=2 2 3 p=34 5 6 p=47 8 9 10 The overall (or average) parallelism of a workload can be derived from averaging all the instantaneous parallelisms across all of its steps, but there is a much easier way. The ‘work’ of a workload is the sum of all of its tasks – this assumes that each task takes exactly the same amount of time to run for convenience – hence 𝑤𝑜𝑟𝑘 = 𝑇! = 12 The ‘span’ of a workload is the number of steps in the task graph of a workload – again assuming equal task size in terms of time for convenience – hence 𝑠𝑝𝑎𝑛 = 𝑇" = 6 Hence dividing work with span gets us the overall parallelism of a workload. 2 for this • Parallelism – Themaths Architecture topics, EEE8087 10 p=2 2 3 p=34 5 6 p=47 8 9 10 You may not have enough hardware to take full advantage of the parallelism of a software. For instance, you need 4 cores in order to fully exploit 𝑝$%& = 4 in the example on the left, but what if you only have 3 cores? With 3 cores, step 4 cannot be fully parallelized as in the task graph. Usually you need to modify the task graph to reflect the hardware reality and scheduling constraints. Scheduling refers to task-to-core mapping. This is usually done in system software, without the knowledge of the user, but it can also be intentionally controlled in some operating systems. • Influence of hardware Architecture topics, EEE8087 11 • Influence of hardware p=1 p=2 p=3 Task graph modification to fit workload onto 3 cores. The best (why?) option is to delay task 7 until the next step. This does not modify the logical relations within the task graph – causality is fully preserved. Now with 𝑝$%& = 3 the workload can be executed on 3 cores. Interestingly, neither work nor span changes from this modification. The workload still has the same overall parallelism 2! In other words, with 3 cores you can achieve the same speedup with this workload as with 4 cores, with the right task graph modification. The relative speedup obtained by moving from 3 to 4 cores is 𝑆(4)⁄𝑆(3) = 1. Homework: what happens if your system only has 2 cores? Architecture topics, EEE8087 12 p=1 12 Time 29/10/20 Software parallelizability – an alternative view • Amdahl’s law – For studying in the abstract, i.e. what if you do not know the structure of the software and its inner workings, and a task graph is not available? – , in the 1960s, proposed an abstract view of a software workload as consisting of a fully parallelizable part and a fully non- parallelizable part. In other words, a piece of software is regarded as one part with 𝑝 = 1 and another part with 𝑝 = ∞. – The time taken by executing such a parallel workload on a single core can be divided into two parts, that taken by the sequential part and that taken by the parallel part: 𝑇' = 𝑇( + 𝑇# – Executing such a workload on n cores means that in the total time 𝑇), 𝑇( stays the same, as you use one of the cores to execute the sequential part leaving the other n-1 cores idle. However the parallel part will take time 𝑇#⁄𝑛 as you get to use all n cores because 𝑝 = ∞ and you get to fill all the cores with no idle. – The speedup achieved by moving from one core to n cores is then 29/10/20 Architecture topics, EEE8087 𝑆 𝑛 = 𝑇! = 𝑇# + 𝑇$ 𝑇" 𝑇# + 𝑇$ 𝑛 Amdahl’s law • An important parameter for Amdahl’s Law is what’s known as the ‘parallel fraction’. • Fraction of the workload that is parallel, defined as 𝑓 = 𝑇! The fraction of time taken by the parallel part of the workload • Amdahl’s Law says that the speedup, with the parallel fraction, is then 𝑆 𝑛 = 𝑇# = 𝑇% + 𝑇! = 1 𝑇$ 𝑇+𝑇! 1−𝑓+𝑓 %𝑛𝑛 • The parallel part achieves linear speedup with the number of cores and the sequential part stays sequential. • Typical scheduling decision: if you have a workload with small f and another one with large f, give the latter more cores. 1 − 𝑓 + 𝑛𝑓 29/10/20 Architecture topics, EEE8087 • Amdahl’s Law examples – Fixedworkload – 50%sequentialpart – Onasinglecoretakes1unitoftimeto complete Architecture topics, EEE8087 • Amdahl’s Law examples – Withtwocores... – Parallelizablepartisdistributedbetweenthe two cores – Totaltime0.75 𝑆(2) = 1/0.75 = 1.333 29/10/20 Architecture topics, EEE8087 • Amdahl’s Law examples – Withthreecores... 0.5 + 0.5 3 Architecture topics, EEE8087 • Amdahl’s Law examples – Withaninfinitenumberofcores... 1 − 𝑓 + ∞𝑓 𝑇! = 𝑇" Architecture topics, EEE8087 Amdahl’s law 1 − 𝑓 + 𝑛𝑓 29/10/20 Architecture topics, EEE8087 Amdahl’s Law vs task graphs • Both methods can be used for modelling the behaviour of workloads with regard to the hardware on which they run • Scheduling decisions can be made to optimize speedup, energy, or a combination of both based on these models • Task graphs are usually derivable by programmers who write workloads – could be very complicated for large workloads • Users usually do not have access to task graphs • The parallel fraction f can be obtained through experiments • Task graphs can also be extracted from experiments (running a workload and monitoring certain system sensors) • Which one to use in what situation is the user’s decision to make • Models can be unified – see http://async.org.uk/tech-reports/NCL-EEE-MICRO-TR-2018-211.pdf • These all assume ideal hardware and software, with full scalability other than what’s limited by parallelism and parallel fraction 29/10/20 Architecture topics, EEE8087 20 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com