CS计算机代考程序代写 GPU cache algorithm UNIVERSITY OF EDINBURGH

UNIVERSITY OF EDINBURGH

COLLEGE OF SCIENCE AND ENGINEERING

SCHOOL OF INFORMATICS

INFR1115 HPC ARCHITECTURES

Friday 11 th December 2020

2 hours

13:00 to 15:00

Answers must be submitted to Turnitin by 16:00

INSTRUCTIONS TO CANDIDATES

1. Note that THERE ARE THREE COMPULSORY 25-MARK QUES-
TIONS. Different sub-questions may have different numbers of total
marks. Take note of this in allocating time to questions.

2. THIS EXAMINATION IS AN OPEN-BOOK ASSESSMENT. You
may refer to material from your notes, course material, or beyond
to assist you. You should not copy any text or images into your an-
swer however as your answer must remain your own work. If you refer
to material from outside the course it must be referenced properly.

3. THIS IS A REMOTE EXAMINATION. As stated in the Own Work
Declaration, for the duration of the assessment you must not commu-
nicate with any other person about your work either electronically or
by word or sign, nor let your work be seen by any other person.

4. Please refer to guidance in Learn under Examination Information if
you have any difficulties

EPCC Courses

Convener: Dr Andreas Pieris

External Examiner: Prof Matt Probert

THIS EXAMINATION WILL BE MARKED ANONYMOUSLY

1. (a) Define the term fully associative and explain why this design is normally
not used in large data caches. [5 marks ]

A 2.0 GHz processor has an 64 Kbyte, 4-way set associative cache with 32 byte
blocks and an LRU replacement policy.

(b) Calculate the number of sets in the cache. [2 marks ]

(c) Explain why you might expect typical applications to perform better with
an LRU replacement policy instead of a random replacement policy. [4 marks ]

(d) The following code fragment is executed on the processor, which has a fused
multiply-add floating point unit:

for (int i=0; i<1000; i++){ sum += a[i]; } for (int i=0; i<1000; i++){ ta = a[i]; sumsq += ta * ta; } where a, ta, sum and sumsq are 64-bit floating point values, and none of the elements of a are initially stored in the cache. By timing the two loops separately, the performance of the first loop is calculated as 125 Mflop/s and that of second loop as 1 Gflop/s. i. For each loop, calculate the average time in nanoseconds taken to exe- cute one loop iteration. Show your reasoning. [6 marks ] ii. Using your answer to the previous part, or otherwise, estimate the cost of a cache miss in clock cycles, showing your reasoning and clearly stat- ing any assumptions. [8 marks ] Page 1 of 3 2. (a) Name three aspects of efficiency in HPC applications, and describe the re- lationship between them. [5 marks ] (b) You have an application that runs only on CPU, but you would like to know whether accelerating it with either a GPU or an FPGA will give you better energy efficiency. i. Give an example of a measurement method that you could use to get the power draw of a CPU. [1 mark ] On a CPU, the application takes 100s to run and during the computation the CPU draws 60W. The GPU would draw an additional 180W and the FPGA would draw an additional 60W. What performance do you need to achieve for the accelerated version of the application to be more energy efficienct than the CPU-only version when using: ii. CPU+FPGA [3 marks ] iii. CPU+GPU [3 marks ] (c) By configuring the hardware to behave like an application at the electron- ics level, Field Programmable Gate Arrays (FPGAs) have the potential to deliver both performance and energy efficiency. i. Describe how, in terms of their features and design, FPGAs can provide both competative performance and reduced energy consumption when executing codes in comparison against other architectures such as CPUs or GPUs. [5 marks ] ii. Briefly describe a situation where FPGAs might not provide energy efficiency benefits against CPUs, and why. [2 marks ] (d) In terms of the chips themselves, FPGAs are typically comprised of a number of basic technologies. Briefly explain what the following are and the role they have to play in an FPGA: i. Look Up Table (LUT) ii. Digital Signal Processing Slice (DSP) iii. Block RAM (BRAM) [6 marks ] Page 2 of 3 3. (a) Briefly describe the functions that a Resource Manager undertakes and con- trast that with a Batch Scheduler. [4 marks ] (b) Describe what scheduling algorithms do, what backfill scheduling is, and what benefits and drawbacks backfill scheduling may have. [7 marks ] (c) As well as the compute nodes in a HPC system, other components are re- quired for good performane for a range of applications. One such component is the filesystem. Describe the main features of a parallel filesystem, and discuss how they provide high performance I/O for applications. [6 marks ] (d) Parallel filesystems may provide variable performance for applications. Dis- cuss why performance may be variable and what approaches users can take to optimise I/O performance. [5 marks ] (e) One option for increasing I/O performance for parallel filesystems is to use specialised hardware such as a global Burst Buffer or non-volatile memory in compute nodes. Discuss what extra functionality may be required in the scheduling system to enable such resources to be efficiently used by users and applications. [3 marks ] Page 3 of 3