Name: ___________________ ID: ____________
COMP326 Assignment 2 Fall 2021
Issued: October 4, 2021 Due: October 25, 2021
Copyright By PowCoder代写 加微信 powcoder
Submit electronically to Moodle. No extension will be granted.
1. [24 marks] In-Order Superscalar Pipeline Plus Its Abstraction
Consider a four-lane, abstract parallel pipeline, which models a four-lane
RISC 1.0 _superscalar_ pipeline. In the latter, we lay down, on the same chip,
four one-lane ‘fdxmw’ pipelines parallel to each other. This gives us a
concurrency of 5 in time, and a concurrency of 4 in space.
+—+—+—+—+—+
b/w = 1 | | | | | | b/w = 1 Two of the four sequential pipelines
+—+—+—+—+—+ are shown.
… … Operations move to the right.
+—+—+—+—+—+
b/w = 1 | | | | | | b/w = 1
+—+—+—+—+—+
a) Suppose that this computer is running four infinite-length, _independent_
threads (e.g., behavior at equilibrium of a fully data-parallel program).
Suppose, moreover, that there are no intrathread stalls from other sources.
In this case, what is the speedup of the four-lane parallel pipeline over the
one-lane, _unpipelined_, five-workstation sequential workbench?
b) By what factor must the input bandwidth be increased for a four-lane
superscalar pipeline compared to a one-lane scalar pipeline? By what
factor must the output bandwidth be increased?
c) Synchronization cost is proportional to register-file size. Register-file
size is chosen to allow to allow threads to build up _thread state_ while
running. In a parallel pipeline, multiple threads contribute to building up
_processor state_. In part a) above, how does the thread state that must be
saved for synchronization change when we move from i) having one program
counter control the parallel pipeline to ii) having four program counters
control the pipeline? Hint: could you partition the register file? Fetching
multiple operations at the same time does not mean that they are part of some
larger operation.
d) Continuing c), with four program counters, when an individual thread blocks,
must the whole processor block? Explain.
e) Return to conventional single-program-counter processors. There may be
_intrathread_ stalls. Data-parallel program P4 has the following average
number of stalls per instruction, respectively, in each of the four lanes:
<0.15, 0.24, 0.12, 0.27>. What is the speedup of the parallel pipeline on
P4 over the unpipelined (stall-free) sequential workbench (as in a))?
f) Drop the restriction to data-parallel programs. Now, threads are likely to
communicate and/or synchronize with each other. Unlike cores, boxes cannot
be turned off. How does the power efficiency (e.g., GFs/s/W) of running four
threads of a general program compare to running four threads of a data-parallel
program? Note: Threads are created by decomposing programs.
2. [25 marks] In-Order Scalar Pipeline
+-+ +-+ +-+ +-+
| | | | | | | |
+-+ +-+ +-+ +-+
f/d d/x x/m m/w
loop: lw r1,0(r2)
addi r1,r1,1
sw r1,0(r2)
addi r2,r2,4
sub r4,r3,r2
bnez r4,loop
a) How many register-mediated flow dependences are there in this program?
Make a list. Are there data dependences in this program that are not register
mediated? ___ (yes/no) Make a list. Indicate the types of any such data
dependences.
b) Draw the space-time diagram for the first iteration of this loop. Include
all six instructions.
c) In a 5-stage pipeline, the longest stage requires 0.8 ps for its compute
portion, and 0.1 ps for its save-to-latch portion. What is the clock-cycle
time of this 5-stage pipeline? If a 10-stage pipeline is produced by
splitting all five stages in half, what is the cycle time of the 10-stage
pipeline? Note: you cannot split the save-to-latch time.
d) The 10-stage pipeline has some subtleties. First, each pair of boxes is a
mini pipeline. A mini pipeline must finish computing a complete result before
forwarding it to some other mini pipeline. Second, if the mini pipeline is
Draw the space-time diagram for the first iteration of this loop executing on
the 10-stage pipeline. Use of “dot” notation is strongly encouraged.
3. [17 marks] Loop Timing I
Consider the following code fragment:
loop: lw r1,0(r3)
lw r2,0(r4)
add r1,r1,r2
sw r1,0(r3)
addi r3,r3,4
addi r4,r4,4
sub r6,r5,r3
bnez r6,loop
The loop iterates 90 times. Draw a space-time diagram of this code.
Calculate the total execution time of the loop. Use predict not taken.
4. [18 marks] Loop Timing II
Consider the following code fragment:
loop: fld f0,0(r2)
fld f2,0(r3)
fmul f4,f0,f2
std f0,0(r2)
fsub f4,f4,f0
fadd f4,f4,f4
addi r2,r2,8
addi r3,r3,8
sub r4,r5,r3
bnez r4,loop
‘fsub’ and ‘fadd’ have 3 x-boxes. ‘fmul’ has 4 x-boxes. Functional units are
fully pipelined. The loop iterates 90 times. Draw the space-time diagram of
this code. Calculate the total execution time of the loop. Use predict not
5. [16 marks] Pipeline Boxes and Pipeline Latches
+————————————————————————-+ P
|
| | o
|
| +-+ +-+ +-+ +-+ | e
|
| +-+ +-+ +-+ +-+ | s
| f/d d/x x/m m/w | o
+————————————————————————-+ r
fld f0,0(r1) Memory references have two m-boxes.
fld f2,0(r2) Floating-point multiplies have four x-boxes.
fmul f4,f0,f6 Ignore integer operands.
fmul f8,f2,f4
fsd f8,0(r1)
a) In the order
does the x1-box in the third instruction receive its two operands at the start
of the cycle in which it computes? _____ _____
b) In the order
does the x1-box in the fourth instruction receive its two operands at the
start of the cycle in which it computes? _____ _____
c) From which latch does the m1-box in the fifth instruction receive its
single floating-point operand at the start of the cycle in which it computes?
d) Indicate the number of stalls in the third, fourth, and fifth instructions.
Show the space-time diagram. _____ _____ _____
_______________________________________________
comp326-f21 mailing list
https://mailman.encs.concordia.ca/mailman/listinfo/comp326-f21
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com