SCHOOL OF ENGINEERING
DIGITAL SYSTEM DESIGN 4 ELEE10007
Exam Date: 03/05/2016 From and To: 09.30-11.30 Exam Diet: Apr/May 2016 Please read full instructions before commencing writing
Exam paper information
• Paper consists of 2 Sections
• Candidates to answer THREE questions
• Section A: (One question) Answer whole Section
• Section B: Answer TWO out of THREE questions
Special instructions
•
•
•
•
•
Students should assume reasonable values for any data not given in a question nor available on a datasheet, and should make any such assumptions clear on their script.
Students in any doubt as to the interpretation of the wording of a question, should make their own decision, and should state it clearly on their script.
Please write your name in the space indicated at the right hand side on the front cover of the answer book. Also enter you examination number in the appropriate space on the front cover.
Write ONLY your examination number on any extra sheets or worksheets used and firmly attach these to the answer book(s).
This examination will be marked anonymously.
Special items
• None
Convenor of Board of Examiners: Professor R Cheung
External Examiners: Professor P Mellor and Professor J Robinson
SECTION A
Question A1
a)
b)
Define the term RISC and explain, with examples, the four design
principles upon which it was constructed. (3)
c)
d)
e)
Describe the Von-Neumann bottleneck. Use a diagram to show the
physical origin of the bottleneck and, again using diagrams, describe
how the two Harvard architectures solve this problem. (5)
How can the bit (branch if-less-than) pseudoinstruction be
implemented using the MIPS core instruction set? (2)
The roofline model was introduced in the lectures as a method of characterising the performance of different processor systems undertaking a range of tasks (kernels) with different characteristics.
(i) Describe what aspects of an application or processor task are defined by the arithmetic intensity, and give an example of a
task with low arithmetic intensity. (2)
(ii) Referring to Figure A 1e, state which processor has the highest performance for a kernel which requires one word of data to be accessed (read or write) for every two floating point operations. (1)
(CONTINUED OVER)
The latencies for each of the five stages in a MIPS processor are given below.
Stage IF
ID
E X MEM WB
Latency (ps) 170
350
100
160 150
(i) Calculate the maximum clock frequency for a single-cycle processor that uses these stages.
(ii) Calculate the maximum clock frequency for a pipelined processor that uses these stages.
(iii) Calculate the relative performance of the pipelined processor as compared to the single-cycle processor, assuming that the number of instructions is much greater than the number of pipeline stages. Use diagrams to show your working.
(1)
(1)
(1)
(2)
(iv) If you could split one stage into two new stages (each with half the latency) which would you split, and what would be the new theoretical maximum clock frequency? Discuss any potential problems.
ELEE10007 Digital System Design 4 – May 2016
64
32
u
QJ
Ill
……..
Ill 16
0..
0
-‘
u..
(.!)
8
.c ttl c:
ttl
+J
~
2
QJ
4
__…,.. Processor A – & – Processor B
(iii) What aspect of the processor with the lower performance
should be improved for it to equal the processor with the higher performance, and by how much? (2)
1
0 1/8 1/4 1/2 1 2 4 8 16 32
Arithmetic Intensity: FLOPs/Byte Ratio
Figure A1e
ELEE10007 Digital System Design 4 – May 2016
SECTION B Question 81
a) Give two reasons why modern computers use a virtual memory
addressing system. (2)
b) For each of the two situations described below, state which form of locality this represents and describe how a cache memory is set up to exploit it.
(i) A loop in a programme which reads and writes a particular
variable during each cycle around the loop. (3)
(ii) A programme which sequentially accesses items in a data array structure. (3)
c) If a 16 KiB cache with 4-way set associativity has a block size of 32 words, define the following characteristics of the cache if the address is 32 bits wide:
(i) Number of blocks; (1)
(ii) Number of sets; (1)
(iii) Number of index bits in each address; (1) (iv) Number of tag bits in each address. (1)
d) If the cache described in 81 (c) is fully associative instead of being set associative:
(i) Describe the structure of the address with a suitable diagram; (2)
(ii) State which cache you would expect to have the lower hit-time,
and give a reason for your answer; (2)
(iii) If the block size is doubled, describe the effects you would expect
to observe on the miss rate and the hit time of the cache; (2)
(iv) Describe an appropriate method for deciding which block of data
in this fully associative cache should be replaced if a memory
access results in a miss at this level of the cache. (2)
ELEE10007 Digital System Design 4 – May 2016
Question 82 a)
b)
c)
Explain the key concept in parallel computing which is described and quantified in “Amdahl’s Law”. (2)
Two main forms for a parallel computing system were defined in this course: “message passing” and “shared memory”.
(i) With the aid of an appropriate diagram describe the key
components of each architecture. (6)
(ii) Describe a possible application for each type of parallel system. (2)
Multiprocessor systems with shared memory can have either uniform
(UMA) or non-uniform memory access (NUMA).
(i) Describe one advantage and one disadvantage of each type. (4)
An application with two threads can be run on a dual-processor system where each thread needs to access an entire array of 4096 words. For each of the two systems described below calculate the minimum number of cycles taken to run the application:
(ii) UMA, two processors with 1 GiB shared memory and memory
access latency of 35 cycles per word; (1)
(iii) NUMA, with two processors, each with local memory of 512 MiB
where local memory access time is 20 cycles/word, and remote memory access latency is 60 cycles per word. (3)
Imagine that you have been asked to design a new 2-core NUMA system with local memory latency of 15 cycles per word.
(iv) What is the maximum remote memory access latency required to
run the same application in fewer cycles than required by the
fastest system above? (2)
ELEE10007 Digital System Design 4 – May 2016
Question 83 a)
Consider a 32-bit MIPS architecture and assume the programme counter is currently at OxOOOOOOOO. Consider two instructions: An i-format instruction which uses PC-relative addressing; and a j-format instruction which uses pseudo-direct addressing.
(i) Explain, with the aid of diagrams how the two addressing
modes are physically implemented. Label the width of all
busses. (6)
(ii) Calculate the maximum forward address (in hexadecimal) that
can be reached for each of the two instructions. (2)
Describe how to design an arithmetic logic unit (ALU) which
implements overflow and zero-detection. (2)
An attractive feature of the simplified MIPS single-cycle architecture, that we have studied in-class, is that the main-controller can be implemented using purely combinational logic.
(i) Using the data from Table B3c, show with the help of a
diagram, how you would implement a single-cycle main
controller on a programmable logic array PLA using a sum of (6) products representation.
(ii) Explain, using diagrams, how this controller could easily be extended to provide the correct signals for a pipelined (4) processor. Assume all hazards are eliminated by the compiler.
b)
c)
Instruction
R-type lw
SW beq
Op5:0 )> ::a ::a )> CJ
s:s:s:)> CD CD CD c:
3330
c: CD CD c: DJ
(/) …,
C’)
cc cc (/)
C’)
0
…, en ::r
:2:
::a
:2: 0 ~
;:::+ …..
…, CD ;:::+ I l l CD a .
::a 0
CD cc
CD
…, :::J C’)
…..
000000 0 1 1 0 0 0 0 0 10 100011 1 1 0 1 0 0 1 1 00 101011 1 0 x 1 0 1 0 x 00 000100 0 0 x 0 1 0 0 x 01
Table B3c
END OF PAPER
ELEE10007 Digital System Design 4 – May 2016
“i:’: E
R and R
R[rt] = R[rs] + SignExtlmm
R[rt) = R[rs] + SignExtlmm
R[rd] = R[rs] + R[rt]
R[rd] = R[rs] & R[rt)
R[rt) = R[rs] & ZeroExtlmm
if(R[rs]=R[rt])
::l 0
addu
~
(!) “O ·;;;
E
0
t::
0 ..0
“O
~
r-i “E
•(xis eq, 1t, or le) (op is=,<, or<=) (y is 32, Jc, or Je)
MIPSReference Data CORE INSTRUCTION SET
0
FOR- NAME, MNEMONIC MAT
Branch On FP True b e ! t Fl Branch On FP False be! f Fl
OPERATION if(FPcond)PC=PC+4+BranchAddr (4) if(!FPcond)PC=PC+4+BranchAddr(4)
FOR- NAME, MNEMONIC MAT
Add add R Add Immediate addi
Add Imm. Unsigned addiu
OPERATION (in Verilog) = R[rs] + R[rt]
OPCODE /FUNCT (Hex)
0 / 20hcx
Shex
9hcx 0/21hcx 0 I 24hcx
Divide
Divide Unsigned FP Add Single
FP Add
Double
FP Compare Single FP Compare Double
d i v R d i vu R add.s FR
a d d . d FR c.<.s • FR cs.ct• FR
Lo=R[rs]/R[rt]; Hi=R[rs]%R[rt] Lo=R[rs]/R[rt]; Hi=R[rs)%R(rt) F(fd ]= F[fs] + F[ft) {F[fd],F[fd+l]l = {F[fs],F[fs+I]}
{F[ft),F(ft+l)} FPcond = (F[fs] op F[ft]l? I : 0
FPcond=({F[fs],F[fs+l]l op {F[ft],F[ft+I]})? I : 0
(6) +
Add Unsigned
And
And Immediate andi
Branch On Equal heq
Branch On Not Equal b n e Jump
Jump And Link Jump Register
j a I j r
R
F[fd]=F[fs] - F[ft]
{F(fd],F(fd+I]} = {F[fs],F[fs+I)} -
Load Byte Unsigned 1bu Load Halfword
lwcl
Ide!
{F[ft],F[ft+I)} F[rt]=M[R[rs]+SignExtlmm] F(rt]=M(R[rs]+SignExtlmm];
(2) (2)
(6)
(2) (2)
ell lhu u Unsigned
(!)
R[rd] =Hi
R[rd] =Lo
R[rd] = CR[rs]
{Hi,Lo} = R[rs] • R(rt)
{Hi,Lo} = R[rs] * R(rt]
R[rd] = R(rt) » > shamt M[R[rs]+SignExtlmm] = F(rt) M[R[rs]+SignExtlmm] = F(rt); M[R[rs]+SignExtlmm+4) = F(rt+I)
Load Linked 11
~ Load Upper Imm. lui 0.
(!) Load Word lw
R(rt) = M[R[rs]+SignExtlmm] R[rt) = {imm, 16’b0}
R(rt) = M[R[rs]+SignExtlmm] R[rd] = – (R[rs] IR[rt])
R[rd] = R[rs] IR(rt]
R(rt] = R[rs] IZeroExtlmm
R[rd] = (R[rs] < R[rt]l? I : 0
(2,7)
(2)
(3)
30hcx fhcx 23hcx 0 I 27hcx 0 / 25hcx dhcx
0 / 2ahcx ahcx
Multiply
Multiply Unsigned Shift Right Arith. Store FP Single Store FP
Double
m u l t R m u l t u R sra R swcl I
sdcl
"' Nor B
nor R
i::OrorR 0
·~
.,E .... (!)
0. 0.0
0
:i Q.,
Or Immediate ori
SetLessThan sit R
Set Less Than Imm. sl ti
R[rt] = (R[rs] < SignExtlmm)? I : 0 (2)
FLOATING-POINT INSTRUCTION FORMATS
I ft
21 20
Shift Left Logical Shift Right Logical
Store Byte
Store Conditional
Store Halfword
Store Word Subtract
Subtract Unsigned
R[rd] = R[rt] « shamt
R[rd] = R[rt] » shamt M[R[rs]+SignExtlmm](7:0) =
R[rt)(7:0) M[R[rs]+SignExtlmm] - R(rt]; R[rt) - (atomic)? I : 0
BASIC INSTRUCTION FORMATS
RII
IIII
funct
31
I
31
J
26 25
opcode I 26 25
rs
21 n
21 20
rt
161'
l61'i
11 Ill 6 5 immediate
Ssp 29 Sfp 30 Sra 31
opcode
rs rt rd
shamt
( I ) ( 1,2) (2)
(3)
(4) PC=PC+4+BranchAddr (4)
PC=JumpAddr (5)
R[3 l]=PC+S;PC=JumpAddr (5)
PC=R[rs]
R[rt]={24'b0,M[R[rs] +SignExtlmm](7:0) l (2)
R[rt]={ 16'b0,M[R[rs]
+SignExtlmm]( 15:0) l (2)
R[rd]
PC=PC+4+BranchAddr
FP Divide Single FP Divide Double
FP Multiply
div.s FR div.d FR mul . s FR mul.d FR sub. s FR sub.ct FR
F[fd] = F[fs] I F[ft]
{F[fd),F[fd+I]} = {F[fs],F[fs+I)} /
{F[ft],F[ft+l]l
if(R[rs]!=R[rt]l
F[fd] = F[fs] * F[ft]
{F[fd),F[fd+I]} = {F[fs],F[fs+I]} *
R(rt] = (R[rs] < SignExtlmm)
FR I opcode I fmt
fun ct I 0
SetLessThanImm. sltiu
i:: Unsigned ?I:0 26 "
I fs 16 15
fd
ca SetLessThanUnsig.sltu R R[rd]=(R[rs]
if(R[rs]<=R[rt]) P C = Label if(R[rs]>=R[rt]) P C = Label R[rd] =immediate
R[rd] = R[rs]
ARITHMETIC CORE INSTRUCTION SET
0 OPCODE / FMT/FT I FUNCT
(Hex) 11/8/1/- 11/8/0/-
0/–/–/la 0/–/–/lb 11110/–/0
11/11/–/0
11/10/–~I’ 11111/–~)’
11/10/–/3
11/11/–/3
11/10/–/2
11/11/–/2
11/10/–11
11/11/–/1
31/–/–/-
351–1–1-
01–1–110 01–1–112 10/0/–/0 0/–/–/18 0/–/–/19 01–1–13 391-1-1–
3d/–/–/–
0IOOhcx JI 2625 2120 1615 0
Single
FP Multiply
Double
FP Subtract Single
FP Subtract
Double
Load FP Single
Load FP
Double F(rt+I)=M(R[rs)+SignExtlmm+4] Move From Hi mfhi R
Move From Lo mflo R
Move From Control mfcO R
26 25
11 IO
The Constant Value 0 N.A. Assembler Temporary No
Values for Function Results
and Expression Evaluation
Argtunents No Temporaries No Saved Temporaries Yes Temporaries No Reserved for OS Kernel No Global Pointer Yes Stack Pointer Yes Frame Pointer Yes Return Address No
{F[ft],F[ft+l]l
No
opcode (31 :26)
( )
( ) funct
(5:0) s l l
( )
exa- A Char-
Exponent 0
0
I to MAX-I
Fraction 0 ;tO
Object ±0
± Denorm
j srl
jal sra
beq sllv
bne absf blez srlv movf bgtz srav negf addi jr
addiu jalr slti. movz sltiu movn andi syscall ori break xori
lui sync mfhi (2) mthi
mflo movzf mtlo movnf
anything ±Fl. Pt. Num MAX 0 ±~
sqrt.
MAX ;tO NaN S.P. MAX – 255, D.P. MAX – 2047
mult
multu y div z divu [
lb add
lh addu a lwl sub b lw subu
!bu and cvt.w.
lhu or e lwr xor f
W ord
Halfword I Halfword
Byte IByte IByte IByte
I 2 .l 4 5 6 7
Value ofthree least significant bits ofbyte address (Big Endian)
EXCEPTION CONTROL REGISTERS: CAUSE AND STATUS
nor g
sh JI15 sh
swl sit
SWsltu
m swr n cache 0 11 tge p lwcl tgcu q
lwc2 tit pref tltu
teq c.olt.
Ide! c.ultf u ldc2 tne c.olef v
c . u l e f w SC c.sf. x
EXCEPTION CODES
Number Name Cause of Exception
Number Name
Cause of Exception Breakpoint Exception Reserved Instruction Exception Coprocessor Unimplemented Arithmetic Overflow Exception
Trap
Floating Point Exceptmn
swcl c. nglef swc2 c. seqf c. nglf
c. lt. sdcl c.ngef
(I) opcode(31:26) = 0
(2) opcode(31:26) = 171en (llhexl; if fmt(25:21)=1610n (IOhexlf= s (single);
iffmt(25:21)=1710n (I lhex)f= d (double)
13 Syscall Exception 15
II 1101
sdc2 c.lef 111110 62 c.ngtf II 1111 63
PRE- FIX Kilo- Mega- Giga- Tera-
PRE- PRE- FIX SIZE FIX Peta- 10·3 milli-
Exa- 10·6 micro- Zetta- 10·9 nano- Votta- 10·12 pico-
PRE- FIX femto- atto- zepto- yocto-
61
SIZE lOl, 210
106, 220 109, 230 1012. 240
SIZE 1015, 250
1018, 260 1021. 210 1024. 2so
SIZE 10-15
10-18 10-21 10-24
c 125 7d > 126 7e
SIZE PREFIXES (10″ for Disk, Communication; 2• for Memory)
? 127 7f DEL
M N 0
Q
R s
u
v
w
Stack
i
DynalcData Static Data
Text
Reserved
…
-Amument6 Argument 5
Higher Memory Addresses
Stack Grows
i
Lower Memory Addresses
y
The symbol for each prefix 1s JUSt us first letter, except μ 1s used for micro. Copyright 2009 by Elsevier, Inc., All rights reserved. From Patterson and Hennessy, Complller Organi=ation and Design, 4th ed.
IEEE 754 FLOATING-POINT STANDARD
(-l)s x (I+ Fraction) x 2