Name: ___________________ ID: ____________
COMP326 Assignment 1 Fall 2021
Issued: September 20, 2021 Due: October 4, 2021
Copyright By PowCoder代写 加微信 powcoder
Submit electronically to Moodle. No extension will be granted.
1. Amdahl’s Law [18 marks]
A family {Pk | k in |R} of programs has the following properties. When
program Pk is run on a sequential machine, portion A consumes ‘x’ seconds,
while portion B consumes ‘k * x’ seconds. On a parallel machine, portion A
has no speedup, while portion B speeds up by the number of processors.
a) What is the theoretical maximum speedup of program Pk?
b) What must the value of ‘k’ be in order that Pk speeds up by a factor of
512 on a parallel machine with 1,024 processors? What is the new time?
c) What must the value of ‘k’ be in order that Pk speeds up by a factor of
1,000 on a parallel machine with 1,024 processors? What is the new time?
2. Power and Performance I [18 marks]
There are two options to scale a multicore-die’s peak performance (ops/s) by
a _multiplicative_ factor of sigma: i) scale the number of cores by sigma,
or ii) scale the clock frequency by sigma (the voltage will scale by the
same amount).
In case i), the power will scale by sigma.
In case ii), the power will scale—let us say—by sigma^2.65.
We can scale clock frequency up or down. Or, we can increase or decrease
the number of (running) cores.
a) Show that core scaling is better, and clock scaling is worse, when sigma
> 1 (goal: high power efficiency, i.e., high GFs/s/W).
b) Show that clock scaling is better, and core scaling is worse, when sigma
< 1 (goal: high power efficiency, i.e., high GFs/s/W).
c) If your multicore die is fully utilized and too hot, should you scale
down voltage and frequency or reduce the number of cores? Explain.
3. Tolerating Communication Latency [24 marks]
Consider CPU core 'alpha' dedicated to supplying input data to TPU 'beta',
where 'beta' is a tensor coprocessor. The CPU core has access to a large
CPU memory, but the TPU can only access its own 8-GB TPU memory. The TPU
processes one 4-GB block of data at a time. The CPU core takes 250 ns to
download 4 GBs from CPU memory to TPU memory, and the TPU takes 950 ns to
process it. These are the only activities that consume time. There is
instantaneous hardware synchronization so that i) the CPU core can inform
the TPU that it has downloaded a block, and ii) the TPU can inform the
CPU core that it has processed the block, giving the CPU core the CPU
memory address of the next block to be downloaded at that time. The TPU
results are handed off to some entity by magic.
The timeline of the simplest case is therefore:
cpu: 250 ns tpu: 950 ns cpu: 250 ns tpu: 950 ns
|------------|---------------|------------|---------------| ...
a) What is CPU-core 'alpha's utilization? What is TPU 'beta's utilization?
b) Suppose the CPU core knew all the successive block addresses, and their
proper order, in advance. How would you tolerate communication latency?
Asymptotically, what would the new TPU utilization be?
c) Suppose the TPU is organized as a pipeline. Drop the idea of 4-GB blocks;
assume the data is a sequence of, say, 1 billion floating-point numbers.
That is, some data portion (say, 7 sequential floating-point numbers) is
fetched from TPU memory, and processed by one TPU-internal hardware device
("box"). The results are passed to the next hardware device; and so on.
This assumes that data portions (here, 7 sequential floating-point numbers)
can be independently processed in stages. Can you think of a CPU-core
communication strategy that would improve the throughput of the TPU?
(Note that you have lost the previous TPU "I am done" signals, but can see
TPU memory).
4. The Memory-Latency Wall [15 marks]
A standard RISC 1.0 formula asserts that pipeline speedup is pipeline depth
divided by (1 + the average number of stall cycles per instruction). Consider
a five-stage pipeline in which 1/5 of instructions are loads. For simplicity,
identify the number of load-stall cycles per load with the load latency. When
the average load latency reaches _20 cycles_, all speedup from pipelining is
lost. Why? Because 5/(1 + 1/5 * 20) = 1, for a speedup of 1.
Let us calculate the average time 'tav' to complete a memory reference, such
as a load, with all temporal quantities measured in _processor cycles_.
Let 't_c' and 't_m' be the D-cache and DRAM access times in processor cycles,
and let 'P' be the probability of a D-cache hit.
tav = P * t_c + (1 - P) * (t_m + t_c)
= t_c + (1 - P) * t_m
Now, assume that t_c is always one cycle, i.e., t_c = 1. Therefore, we have:
tav = 1 + (1 - P) * t_m
In English, the average load latency is one plus the D-cache miss rate times
the DRAM latency. Again, both times are in processor cycles.
Assume a D-cache with a miss rate of 1%. Assume DRAM latency decreases by
a factor of 1.18 every year, while the processor clock cycle decreases by a
factor of 1.72 every year. If a DRAM access has a 200-cycle latency today,
after how many fractional, and how many integral, years will all pipeline
speedup be lost?
5. The Memory-Bandwidth Wall [15 marks]
When a cache is used temporally, it acts as a bandwidth amplifier.
That is, the _effective_ operand bandwidth reaching the pipeline is the
actual operand bandwidth reaching the processor divided by the miss rate.
Suppose we must deliver an actual operand bandwidth of 'n' words/second to
the processor to sustain a floating-point performance of 'n/m' flops/second,
where 'm' is the miss rate of the D-cache. Typically, we need a new word
from either the cache or the memory for each flop. By assumption, 'm' is
always 0.01. Today, the actual operand bandwidth delivered to the processor
is 8 * 10^7 words/second and the peak floating-point performance is 2 * 10^9
flops/second. If peak floating-point performance increases by a factor of
1.82 every year and memory bandwidth increases by a factor of 1.13 every year,
after how many _fractional_ years will the memory bandwidth be _just
sufficient_ to sustain a floating-point performance equal to the peak
performance?
6. Multicore and the Memory Wall [10 marks]
Relative to each other, CPUs have few threads and enormous caches, while GPUs
have many threads and tiny caches. Generally speaking, we need to think
carefully about any hardware architecture in which many hardware threads
share a cache (usually, but not always, the last-level cache). GPUs are the
extreme example of such an architecture.
Suppose threads are cache friendly in the sense that each portion of a thread
accesses a small subset of memory, and that this subset changes relatively
slowly. Yet threads are normally interleaved with some frequency, rather
than being allowed to run to completion. The technical term for this slowly
changing subset of interesting data is the thread's _working set_.
Now suppose that the working set of a thread is roughly the same size as the
shared cache. Further suppose that the hardware normally schedules threads
much more rapidly than the speed with which their working sets evolve.
As a thought experiment, in which case below is a thread more likely to find
its data in cache?
Case i: We schedule many threads rapidly for short intervals, as described
Case ii: We schedule one thread for a relatively long time so that it owns
the cache for an interval roughly equal to one phase of the evolution of its
working set, i.e., an interval during which the working set is roughly
_______________________________________________
comp326-f21 mailing list
https://mailman.encs.concordia.ca/mailman/listinfo/comp326-f21
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com