Brand & Type of Calculator Your University No.
Date:
誼~ 香 港 大 -;p-
THE UNIVERSITY OF HONG KONG Bachelor of Engineering
Department of Electrical & Electronic Engineering Faculty of Engineering
ELEC3441 Computer Architecture
Final Examination
7 May, 2019 Time: 2:30pm – 5:30pm
• Answer ALL questions. Y ou must submit this question pa- per.
For o血cial use:
• “\Vrite your answers on the question paper. Q
• Do NOT use a separate answer book.
• In each question, space is provided for your answer. Your answer
Marks
should be co且cise and to the point.
• The amount of space provided in each question does NOT neces-
sarily reflect the amount of information required in your answer.
• Stαte clearly αny αssumptions that yo包 hαve mαde.
Use of Electronic Calculators:
Only approved calculators as announced by the Examinations Sec自 retary can be used in this examina七ion. It is candidates’ respon- sibility to ensure that their calculator operates satisfactorily, and candidates must record the name and type of the calculator used on the front page of the examination script.
[J
1 /30
2 /32
3 /32
4 /30
5 /26
們—
ELEC3441 – Fillal Exam University Number:
EEE/ ELEC3441/2019 May
Page 2 of 22
This page is intentionally left blank. Use it for scrap work. No work on this page will be marked.
ELEC3441 – Final Exam University Number:
Question 1
Question 1 Short Questions [30 pts] Part(a) True-False Questions [20 pts]
(a) SR必在 requires periodic refreshing to retain its content.
(b) The overall CPI of a quad-core processor is always 0.25.
(c) The size of issuing window determines the number of instructions that may be issued in one cycle.
(d) Increasing page size increases the penalty of a page fault.
(e) In a processor with in直面te hardware but no register renami峙, the number of registers defined in an ISA determines the maximum num- her of concurrently executing instructions.
(昀 In an inclusive 2-level cache, there will never be a叮 read miss on Ll that also misses in L2.
(g) The number of 2吋 level page tables used by a proc臼s cannot be determined from the program file before it gets executed.
(h) Running the same program on 2 processors that implement the exact same ISA always result in the same average CPI.
(i) A pipelined processor without a full forwarding path will need to have mechanism to stall the pipeline due to data hazards.
。) Performance of a. single cycle processor will not improve by introduc- ing a branch predictor.
T F T F T F
T F
T F
T F
T F
T F
T F
T F
EEE/ ELEC3441/2019 May
Page 3 of 22
ELEC3441 – Fillal Exam University Number: Question 1
Part(b) Iron Law [10 pts]
When running in processor A, a program P executes the following number of instructions.
Class
Cycles per instruction
# instructions in P
l. Wh叫 is the 前
ALU Lo吋/ Store Branch/Jumps 1 10 5
100 200 150
2. Processor B implements an extended ISA of processor A. Processor B supports all the instructions of A, and also support an additional set of accelerated instruction extension. The program same P can thus be implemented differently as follows:
Class
Cycles per instruction
# instructions in P
ALU Load/Store Branch/Jumps Extension
1 15 8 1 100 50 50 200
Also, processor A runs at 1GHz while processor B runs at 800MHz. Comparing the two processors when running program P, which one has higher performance,包nd by how much? If you believe it is impossible to perform a fair comparison between A and B, explain why it is the c品e.
3. You are allowed to improve the speed of only 1 of the three instruction types (ALU, Load/Store, Branch/Jump) of Processor A. Assuming the clock frequency remains un- changed, what is the theoretical maximum speedup possible for program P?
Md ax mum D ou unr
–
EEE/ELEC3441/2019 May
Page 4 of 22
三 一-
戶U
ELEC3441 – Fi且al Exam University Number: Question 2 Question 2 TLB and Cache [32 pts]
You The
•
•
• • • • • • • ., •
are designing a paged memory system for a high-performance 16-bit embedded processor. initial virtual memory subsystem design has the following characteristics:
16-bit virtual address
32-bit physical address, with 4 GiB physical memory always installed
4 KiB page size
8-叩.tries, 2-w叮 set associative TLB (i.e. 是 rows with 2 sets each), true LRU
A single linear page table for each user process
Page fault penalty: 5000 cycles
On TLB hit and cache h鈍, co且bined TLB and cache access takes 1 cycle
Page tables are not cached
Single level cache: physically tagged, direct map, 8 word line size, 1MiB capacity, Cache hit time: 1 cycle
Cache miss penalty: 200 cycles
Part(a)[10 pts] Answer the following:
1 Number of bits needed for a virtual page number
2 Number of bits needed for a physical page number
3 Number of bits needed for page offset
4 Number of lines in the cache
5 Size of a tag for each line of the cache
Part(b)[3 pts]
What is the TLB reach of the system?
D
I.+iuP D P D n b
.-+U 4 L -1
EEE/ELEC3441/2019 May
Page 5 of 22
,D -b ’。
三門口一
ELEC3441 – Fillal Exam University Number: Qu部tion 2 Part(c)[5 pts]
Ignore cache access ti血ing for this part (i丸 assume all cache accesses result in a 凶 regardless of the virtual translation). If a memory 缸cess results in a TLB hit, then memory access time is 1 cycle. To fetch an entry from the page table to load to TLB, it takes 200 cycles. To fetch an entry from the hard disk to page table, it takes 5000 cycles.
After profiling a program B, you found that out of all the memory accesses, 5%results in TLB R在ISS. Among the misses, 1 % results in a page fault. As a result of these address translations, what is the average memoη access time (in cycles) for B?
Part(d) [5 pts]
Youfoundthatingeneralthedatacachehasamissrateof5%. Yourprocessoralwayscomplete address translations before accessing the data cache. That 尬, the system handles any TLB misses or page faults before accessing the cache. If a cache miss, the penalty is added on top of the time needed to translate an address.
Given this acc臼S mechar白血, what is the average memory access time (AMAT) of the system, which includes both time needed for address translation and cache access?
EEE/ELEC3441/2019 May Page 6 of 22
ELEC3441 – Final Exam University Number: Question 2 P訂t(e)〔9 pts]
You now have an opportunity to redesign the TLB with 2 new possible designs. Together with the original TLB, you now have 3 design choices:
(i) 16 e的ries direct map TLB
(ii) 4 entries fully-associ前ive TLB
{iii) (original) 4 entires, each with 2 set associative TLB
In all cases, the physically-tagged cache remains the same.
On your computer system, you do not know what programs will be running on this processor,
but it is confirmed that the programs running on this processor generally make repe前ed accesses to all the addressable memory. Also, most of the time there are at most 3 processes running at the same time.
Now, considering both the access time for TLB and cαcl惚, out of the 3 TLB designs (2 new and the original), which one would likely results in the best performance in general? Give examples and qua叫 tive measurements (based on the size and organization of TLB, cache, hit/miss ti削 ng etc.) to support your answer.
EEE/ ELEC3441/2019 May Page 7 of 22
ELEC3441 Final Exam University Number: Question 2
EEE/ELEC3441/2019 May
Page 8 of 22
This page is intentionally left blank. Use it for scrap work. No work on this page will be marked.
ELEC34也 – Final Exam University Number: Question 3 Question 3 Vector Extension [32 pts]
You are working with a new processor with a vector extension unit. The vector extension unit provides SIMD support for vector operations on a set of dedicated 512-bit registers:
• 16 SIMD registers, 512 bit wide, named xrO to xrl5
• Support SIMD operations on 16 singlE• precision floating nu且bers in each cycle
• Special instruction to load/store data from main memory to/from SIMD registers • Special instruction to p部s 32-bit values to/from the general purpose register file
Part(a)[9 pts]
The following code forms the performance bottleneck of your application:
for (i = O;土< N; i ++){ A[1]= A[i]* B[i]+ C[i];
>
=}
fld fadd addi addi addi addi
f2, O(x4) # load C[i ] fO, fO,主2
xl,x1,1 #incri x2, x2, 4
x3, x3, 4 x4, x4, 4 x1, x5, loop
Ignore effects from pipeline and cache in this part of question. Assume fmult takes 10 cycles, fadd takes 5 cycles, and all the remaining instructions take 1 cycle. \Vhat are the values of (i) the average CPI of the above code segment; and (ii) total run time in terms of N and cycles;(iii) Number of floating point operations (add or multiply) per cycle?
Instruction vxlf vxD, offset(xs1)
vxsf vxS, offset(xs1)
vxadd vxD, vxS1, vxS2 vxmult vxD, vxS1, vxS2
Latency
EEE/ELEC3441/2019 May
Paεe 9 of 22
N’
p pn tO F bpu
U nhuevi i
o
oa ngb
Average CPI:
Total run time:
P 4b– +
?
Part(b)[6 pts]
The vector extension provides the following instructions:
# x5 = N
loop: fld fO, O(x2) # load A[1]
I1 I
I2I
工3I
I4I
I5I
I6I
I7I
rsI
I9I
I10I bne
fld主1,O(x3)#loadB[i] fmult fO,主0,主1
Description
Load 16 single-precision floating point valu由 from memory 16
location offset+ (xs1) into vector extension register vxD
Store 16 single-precision floating point values from vector 16
extension register vxS to memory location offset + (xs1)
Add vector in vxS 1 祖 d vxS2 and s七ore results in destination 5
vsD
Multiply each eleme叫 of vxS1 and vxS2 and store results 10 in destination vsD
QU 戶LV Pa斗A
– αι
且U
、A
ELEC3441 – Final Exam University Number: Question 3 Assume N is a multiple of 16, complete the following code using the new vector extension
instructions:
# x5 = N
vxloop: vxlf vxO, O(x2) # load A[i] to A[i+15]
bne xi,芷5, vxloop
Part(c)[9 pts]
Based on your code from the previous part, by using the new vector extension, what are the
values of (i) the average CPI of the above code segment; a吋(ii) to七al run time in terms of N and cycles; (iii) Number of floati時 point operations (add or multiply) per cycle?
Average CPI:
Total run time:
Nu且ber of floating point ops per cycle
EEE/ELEC3441/2019 May
Page 10 of 22
ELEC3441 Final Exam University Number: Question 3 Part(d)[8 pts]
Your proc臼sor has a 1MiB physically tagged data cache (direct map,是叫ord line size). The 3 arrays are located at the following physical addresses:
A [] B[] C[]
Array Address
Ox10000000 Ox10000000 +4N Ox10000000 +8N
A vxlf instruction is equivalent to loading 16 consecutive words from memory into the vector extension register with 1 instruction. In contrast, the original code loads the data 1 word per iteration.
Comparing the 2 code segments, after executing the entire loop, which code segment is likely to spend more 七ime loading data due to the effect of cache? Explain your answer in terms of spatial and temporal locality, and use concrete example to support your answer. Consider 2 cases: N = 1024 and N = 220.
EEE/ELEC3441/2019 May Page 11 of 22
ELEC3441 – Final Exam University Number: Question 3
EEE/ ELEC3441/2019 May
Page 12 of 22
This page is intentionally left blank. Use it for scrap work. No work on this page will be marked.
ELEC3441 – Final Exam University Number: Question 4 Question 益 Streaming Data [30 pts]
You are working on a new processor design to process high speed streαming dα的. Your data s七ream has the following characteristics:
• Every second, one packet of data arrives. Each data block is 10 000 words.
• Every word of the data packet is read sequentially exactly 1 time by the user application
(Pl).
The above data processing loop repeats indefinitely as long as the system is power on. Pl
process each data block with 1000000 non-memory instructions. Each instruction takes 1 cycle to complete.
packet_st缸 t: lw x1, O(x2) # load 1 word from new data packet # process data
bne x3, x4, packet_start # loop 10 , 000 times
#>>Total number of non-memory instructions above: 1,000,000 <<
Depending the request size, the main memory is capable of returning data either as (i) a single word in 250 ns; OR
(ii) a burst of 4 words in 300 ns.
EEE/ELEC3441/2019 May Page 13 of 22
ELEC3441 - Final Exam University Number: Question 4 Part(a)[4 pts]
In the firs七 version of the proc臼sor design (VI), the data source shares access 七o the main memory with the processor. The data source writes each data block 泊.to a predefined memory area starting at address OxAOOOOOOO. There is NO DA T A CACHE 泊 VI processor. Processor VI runs at 1GHz.
[ CPU J但
Figure 1: VI architecture
Given the above configuratio泣, how long does it take to complete processing one data block?
EEE/ELEC3441/2019 May Page 14 of 22
字=
ELEC3441 - Final Exam University Number: Question 4
Part(b)[8 pts]
Your project partner suggests adding a cache to the processor to improve performance in a new
version of the processor (V2). The cache is 64KiB, direct map, with a line 血 e of 4-words. On read 凶 ss, the entire line is read from the main memory as a burst.
CPU ¢=i ¢::::i
Figure 2: V2 architecture
Discuss in words and with specific data support, how the addition of cache will affect the
performance of Vl processing 七he streaming data blocks?
EEE/ELEC3441/2019 May Page 15 of 22
ELEC3441 - Final Exam University Number: Q田stion 4
Part(c)[6 pts)
Unfortunately, the program starts to produce incorrect result after the introduction of the
cache. You c組 assume there is no hardware problem. Explain why adding the cache causes the program to fail. Suggest a software solution to correct this problem.
EEE/ ELEC3441/2019 May P包.ge 16 of 22
ELEC3441 - Final Exam University Number: Question 4 Part(d)[6 pts]
Another colleague suggested a new processor architecture (V3) th前 connects 七he data sour臼 directly to a special register in the processor pipeline called ioreg. A new instruction ioregrd rd copies the data from ioreg to the general purpose register rd that is suitable for the remaining normal operation. Each read of the ioreg register returns a new word from the streaming data block. V3 also features 七he same data cache as V2, but V3 runs at 0.5 GHz.
CPU~
Figure 3: V3 architecture
Comparing the design of V3 against V2 and Vl, which processor will result in highest
performance in processing the streaming data?
EEE/ELEC3441/ 2019 May Page 17 of 22
ELEC3441 Final Exam University Number: Question 4 Part(e)[6 pts]
Now consider a different program (P2) that processes the streaming data. Similar to Pl, P2 also read the entire data block exactly once sequentially. However, 50 % of the computing code requires access to main memory locations that span a random range. That is 500 000 instructions are non-memory instructions that take 1 cycle to complete, and 500 000 instructions are memory operations unrelated to the strea凶 ng data block.
When using P2, which of the 3 processor will result in best performance? You do not need to provide exact calculation, but you should support your argument with as much data as possible.
EEE/ELEC3441/ 2019 May Page 18 of 22
ELEC3441 一- Final Exam University Number: Question 5 Question 5 Branch Predictor & Branch Target Buffer [26 pts]
As the chief architect of your company, you are evaluating the performance of your branch target buffer (BTB) and branch history table (BHT). Your target application is the followin particularly important benchmark program F that contains a 2-level nested loop.
void F(int N) {
inti= 0;
int j = 0;
for (i = 0; i < N; i++]{
for (j = O; j <=土; j++) {
II code for loop body
if CCi + j) %2) II Ci + j)扭扭 ODD number {
II some code for (i+j) being ODD >
> >
}
BOOI beq
B24lskip: addi B28I bne B2CI addi B30I bne
Part(a)[6 pts]
x3, zero, skip# x3 contains (i+j) %2
x2, x2, 1
x2, tO, loopJ xl,芷1, 1
xl, aO, loopI
品
## aO contains value of N
AOOI addi xl, zero, 0 A04lloopI: addi x2, zero, 0 直08lloopJ: addi tO, xl, 1
# initialize
# initialize j=O
# tO contains (i+l] for comparison
## code for loop J body
Assume N = 3 for now.τ’race through the execution of the above code and fill in the two columns labeled Branch Instruction Address and Branch Taken? in Table 1.
(i) Branch Instruction Address: the sequence of branch instructions that get executed; (ii) Branch Taken?: the direction of the execution of th前 branch. Write ‘Y’ for a branch
that is taken and ‘N’ for a branch that is not taken.
Part(b)[6 pts]
You use a branch prediction scheme that alwαys follow last bmnch direction. That is, for each branch instruction,証 that branch was taken last time, the scheme will continue to predict this scheme to be taken the nex七 time it encounters 鈍, and vice versa for not taken branches. Initially it predicts NOT TAKEN the first time a branch instruction is encountered.
Fill in the column in Table 1 with the header “Correct Prediction? (Y/N)”. Put ‘Y’ for a branch that is correctly predicted. Put a ‘N’for a branch that is mispredicted.
EEE/ELEC3441/2019 May Page 19 of 22
i =O
ELEC3441 Final Exam
University Number:
Question 5
Event Branch Instruction
Seq Address
Branch Ta.ken? (Y/N)
Correct Prediction?
(Y/N)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
EEE/ELEC3是41/2019 May
Page 20 of 22
Table 1: Put your answer to Part(a) a吋 P缸t(b) in this table.
ELEC3441 Final Exam University Number: Question 5 Part(c)[2 pts]
What is the branch misprediction rate of this prediction scheme?
口
Part(d)[6 pts]
Thefollow旭gshowsape由ctcombinedBTBandBHT.Itinclud田間 entryforeverybranch/jump instruction in the program.
Continue from the previous parts, fill in the t缸get address as well as the current prediction for each of the entry after compl的ely executing F once.
EEE/ELEC3441/2019 May
Page 21 of 22
Instruction Address
OxBOO
OxB28
OxB30
Target 且ddress
Branch Prediction
(YIN)
ELEC3441 – Fin.al Exam University Number: Question. 5
Part(e)[6 pts]
If the 2-bit branch predictor shown in class is used instead, how will the misprediction rate of the 3 branches be affected?
EEE/ELEC3441/2019 May
Page 22 of 22
*** END OF PAPER***
、