CS代考 COMP326 Assignment 1 Fall 2

Name: ___________________ ID: ____________

COMP326 Assignment 1 Fall 2021
Issued: September 20, 2021 Due: October 4, 2021

Submit electronically to Moodle. No extension will be granted.

1. Amdahl’s Law [18 marks]

A family {Pk | k in |R} of programs has the following properties. When
program Pk is run on a sequential machine, portion A consumes ‘x’ seconds,
while portion B consumes ‘k * x’ seconds. On a parallel machine, portion A
has no speedup, while portion B speeds up by the number of processors.

a) What is the theoretical maximum speedup of program Pk?

b) What must the value of ‘k’ be in order that Pk speeds up by a factor of
512 on a parallel machine with 1,024 processors? What is the new time?

c) What must the value of ‘k’ be in order that Pk speeds up by a factor of
1,000 on a parallel machine with 1,024 processors? What is the new time?

2. Power and Performance I [18 marks]

There are two options to scale a multicore-die’s peak performance (ops/s) by
a _multiplicative_ factor of sigma: i) scale the number of cores by sigma,
or ii) scale the clock frequency by sigma (the voltage will scale by the
same amount).

In case i), the power will scale by sigma.
In case ii), the power will scale—let us say—by sigma^2.65.

We can scale clock frequency up or down. Or, we can increase or decrease
the number of (running) cores.

a) Show that core scaling is better, and clock scaling is worse, when sigma
> 1 (goal: high power efficiency, i.e., high GFs/s/W).

b) Show that clock scaling is better, and core scaling is worse, when sigma
< 1 (goal: high power efficiency, i.e., high GFs/s/W). c) If your multicore die is fully utilized and too hot, should you scale down voltage and frequency or reduce the number of cores? Explain. 3. Tolerating Communication Latency [24 marks] Consider CPU core 'alpha' dedicated to supplying input data to TPU 'beta', where 'beta' is a tensor coprocessor. The CPU core has access to a large CPU memory, but the TPU can only access its own 8-GB TPU memory. The TPU processes one 4-GB block of data at a time. The CPU core takes 250 ns to download 4 GBs from CPU memory to TPU memory, and the TPU takes 950 ns to process it. These are the only activities that consume time. There is instantaneous hardware synchronization so that i) the CPU core can inform the TPU that it has downloaded a block, and ii) the TPU can inform the CPU core that it has processed the block, giving the CPU core the CPU memory address of the next block to be downloaded at that time. The TPU results are handed off to some entity by magic. The timeline of the simplest case is therefore: cpu: 250 ns tpu: 950 ns cpu: 250 ns tpu: 950 ns |------------|---------------|------------|---------------| ... a) What is CPU-core 'alpha's utilization? What is TPU 'beta's utilization? b) Suppose the CPU core knew all the successive block addresses, and their proper order, in advance. How would you tolerate communication latency? Asymptotically, what would the new TPU utilization be? c) Suppose the TPU is organized as a pipeline. Drop the idea of 4-GB blocks; assume the data is a sequence of, say, 1 billion floating-point numbers. That is, some data portion (say, 7 sequential floating-point numbers) is fetched from TPU memory, and processed by one TPU-internal hardware device ("box"). The results are passed to the next hardware device; and so on. This assumes that data portions (here, 7 sequential floating-point numbers) can be independently processed in stages. Can you think of a CPU-core communication strategy that would improve the throughput of the TPU? (Note that you have lost the previous TPU "I am done" signals, but can see TPU memory). 4. The Memory-Latency Wall [15 marks] A standard RISC 1.0 formula asserts that pipeline speedup is pipeline depth divided by (1 + the average number of stall cycles per instruction). Consider a five-stage pipeline in which 1/5 of instructions are loads. For simplicity, identify the number of load-stall cycles per load with the load latency. When the average load latency reaches _20 cycles_, all speedup from pipelining is lost. Why? Because 5/(1 + 1/5 * 20) = 1, for a speedup of 1. Let us calculate the average time 'tav' to complete a memory reference, such as a load, with all temporal quantities measured in _processor cycles_. Let 't_c' and 't_m' be the D-cache and DRAM access times in processor cycles, and let 'P' be the probability of a D-cache hit. tav = P * t_c + (1 - P) * (t_m + t_c) = t_c + (1 - P) * t_m Now, assume that t_c is always one cycle, i.e., t_c = 1. Therefore, we have: tav = 1 + (1 - P) * t_m In English, the average load latency is one plus the D-cache miss rate times the DRAM latency. Again, both times are in processor cycles. Assume a D-cache with a miss rate of 1%. Assume DRAM latency decreases by a factor of 1.18 every year, while the processor clock cycle decreases by a factor of 1.72 every year. If a DRAM access has a 200-cycle latency today, after how many fractional, and how many integral, years will all pipeline speedup be lost? 5. The Memory-Bandwidth Wall [15 marks] When a cache is used temporally, it acts as a bandwidth amplifier. That is, the _effective_ operand bandwidth reaching the pipeline is the actual operand bandwidth reaching the processor divided by the miss rate. Suppose we must deliver an actual operand bandwidth of 'n' words/second to the processor to sustain a floating-point performance of 'n/m' flops/second, where 'm' is the miss rate of the D-cache. Typically, we need a new word from either the cache or the memory for each flop. By assumption, 'm' is always 0.01. Today, the actual operand bandwidth delivered to the processor is 8 * 10^7 words/second and the peak floating-point performance is 2 * 10^9 flops/second. If peak floating-point performance increases by a factor of 1.82 every year and memory bandwidth increases by a factor of 1.13 every year, after how many _fractional_ years will the memory bandwidth be _just sufficient_ to sustain a floating-point performance equal to the peak performance? 6. Multicore and the Memory Wall [10 marks] Relative to each other, CPUs have few threads and enormous caches, while GPUs have many threads and tiny caches. Generally speaking, we need to think carefully about any hardware architecture in which many hardware threads share a cache (usually, but not always, the last-level cache). GPUs are the extreme example of such an architecture. Suppose threads are cache friendly in the sense that each portion of a thread accesses a small subset of memory, and that this subset changes relatively slowly. Yet threads are normally interleaved with some frequency, rather than being allowed to run to completion. The technical term for this slowly changing subset of interesting data is the thread's _working set_. Now suppose that the working set of a thread is roughly the same size as the shared cache. Further suppose that the hardware normally schedules threads much more rapidly than the speed with which their working sets evolve. As a thought experiment, in which case below is a thread more likely to find its data in cache? Case i: We schedule many threads rapidly for short intervals, as described Case ii: We schedule one thread for a relatively long time so that it owns the cache for an interval roughly equal to one phase of the evolution of its working set, i.e., an interval during which the working set is roughly _______________________________________________ comp326-f21 mailing list https://mailman.encs.concordia.ca/mailman/listinfo/comp326-f21 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts