UNIVERSITY OF EDINBURGH
COLLEGE OF SCIENCE AND ENGINEERING
SCHOOL OF INFORMATICS
INFR11175 HPC ARCHITECTURES
Thursday 12 th December 2019
09:30 to 11:30
INSTRUCTIONS TO CANDIDATES
1. Note that ALL QUESTIONS ARE COMPULSORY.
2. EACH QUESTION IS WORTH 25 MARKS. Different sub-questions
may have different numbers of total marks. Take note of this in allo-
cating time to questions.
3. CALCULATORS MAY BE USED IN THIS EXAMINATION.
EPCC Courses
Convener: M.Mistry
External Examiner: Matt Probert
THIS EXAMINATION WILL BE MARKED ANONYMOUSLY
1. A processor has an 8 Kbyte, 2-way set associative cache with 128 byte blocks and
a Least Recently Used (LRU) replacement policy.
(a) Briefly explain the following terms:
i. Cache block.
ii. 2-way set associative.
iii. LRU replacement policy.
[6 marks ]
(b) Describe the method used in this particular cache to determine the set in
which a data item is stored.
[6 marks ]
(c) A program uses a one-dimensional array a of 64-bit integers. Assuming that
the first element of the array is aligned with the start of a cache block, show
that the following array elements will all be stored in the same set: a[0],
a[12], a[1024], a[1026], a[2048].
[6 marks ]
(d) Initially, none of the array elements are stored in the cache. The proces-
sor loads the following sequence of array elements: a[0], a[1024], a[0],
a[2048], a[0], a[1024], a[0]. State, with reasoning, which of these ac-
cesses will result in a cache miss.
[7 marks ]
Page 1 of 4
2. (a) The following are different approaches that implement parallelism within a
processor:
i. Pipeline
ii. Super-scalar
iii. SIMD-vector functional units
Describe each approach, include how and to what extent they implement
parallelism, and to what extent hardware is replicated.
[15 marks ]
(b) The following pseudo-code fragment represents a sequence of floating point
instructions operating on registers:
(1) f1 + f2→ f1
(2) f3× f4→ f3
(3) f1 + f5→ f1
(4) f3× f6→ f3
(5) f1 + f7→ f1
(6) f3× f7→ f3
i. Draw the dependency graph for this fragment labelling the nodes and
vertices with the instructions and register names as appropriate.
ii. Assume a single floating-point unit with a pipeline length of 6 . On
which clock cycle (relative to the start) will each instruction issue and
complete? How many cycles will it take to complete the entire fragment?
iii. How much faster would the fragment execute if running on a super-scalar
processor with separate floating-point addition and multiplication units
each with a pipeline length of 6?
[10 marks ]
Page 2 of 4
3. (a) Describe the main differences in hardware design between CPUs and GPUs.
[8 marks ]
(b) Explain the advantages and disadvantages of including GPUs in an HPC
system.
[9 marks ]
(c) Suppose you have a parallel application code that runs only on a CPU.
There are two systems available, with the following power characteristics:
System A One CPU which consumes 30W idle and 130W loaded.
System B One CPU which consumes 25W idle and 110W loaded, and one
GPU which consumes 20W idle and 200W loaded.
The code takes 100 seconds to execute on System A and 80 seconds on Sys-
tem B.
i. State, with reasons, which system is more power efficient for this code.
ii. State, with reasons, which system is more energy efficient for this code.
iii. Explain what you could do to make these systems more power efficient.
[8 marks ]
Page 3 of 4
4. (a) Name the four different types of computing architecture described by Flynn’s
taxonomy and state which is the most common in today’s HPC systems. [5 marks ]
(b) Modern processors are generally multi-core. Describe what it means to be a
multi-core processor, compared to a single core processor, and discuss why
multi-core processors have become the most common form of processors used
today. [6 marks ]
(c) In addition to multiple cores, there are two other common ways that indi-
vidual processors provide parallelism to applications. One of these is vec-
torisation. Name and briefly describe the other way of providing parallelism
within a single processor, including some discussion of performance benefits
that may be achieved for typical computational simulation applications. [4 marks ]
(d) Another approach for providing parallelism to applications is to use a custom
processor such as an FPGA. Describe how FPGAs can achieve parallelism
and very high throughputs even though they run at a lower clock speed and
use less energy than CPUs. [4 marks ]
(e) Describe how CPU and FPGA architectures differ. Specifically:
i. How does each architeture process instructions and data?
ii. What challenges does programming each architecture involve? [6 marks ]
Page 4 of 4