Student ID number: ____________________
Instructions:
UNIVERSITY OF TASMANIA
EXAMINATIONS FOR DEGREES AND DIPLOMAS
October 2012
KXG365/KXA437 Multi-core Architecture and Programming
Examiners: Dr Ian Lewis
Time Allowed: TWO (2) hours
There is a total of ONE HUNDRED (100) marks available.
Attempt ALL TEN (10) questions from Section A, EIGHT (8) questions from Section B, and TWO (2) questions from Section C.
Each question in Section A is worth ONE (1) mark, each question in Section B is worth FIVE (5) marks, and each question in Section C is worth TWENTY-FIVE (25) marks.
Section A is worth TEN (10) marks overall, Section B is worth FOURTY (40) marks overall, and Section C is worth FIFTY (50) marks overall.
____________________________________________________________________________________
Pages: 11 Questions: 23
KXG365/KXA437 Multi-core Architecture and Programming Page 2/11
SECTION A — MULTIPLE CHOICE QUESTIONS
Answer ALL of the following TEN (10) questions on the sheet provided at the back of this examination and hand it in with your examination booklet.
Each question is worth 1 mark. This section is worth 10 marks.
Question 1
Which of the following isn’t a typical characteristic of CISC processors?
A Fixed-length instructions
B A large number of addressing modes
C Arbitrary alignment of data
D A large and varied set of instructions
E Cache
Question 2
Which of the following is a description of simultaneous multithreading?
A Control is passed between threads and each thread is given a small amount of time to execute
B Each thread executes until completion
C Multiple threads are executed simultaneously within the same pipeline
D Threads are executed simultaneously on multiple cores
E All of the above
Question 3
The following are the execution times of a multithreaded program (performing a task that can be split up into fully independent portions) being run with a various number of threads:
Which of the following architectures is it most likely that the program is being run on?
A A single-core system with only SIMD instructions
B A homogeneous two-core system
C A heterogeneous four-core system
D A single-core system with 8-way simultaneous multithreading
E A superscalar architecture with two ALUs and a single Load/Store unit
Threads
Time (msecs)
1
1000
2
717
4
599
8
545
16
666
Continued…
KXG365/KXA437 Multi-core Architecture and Programming
Page 3/11
Question 4
Cache coherency attempts to:
A
B C D E
Question 5
Which of the architecture?
A B C D E
Question 6
Which of the
A B C D E
Question 7
Which of the
A B C D E
Question 8
Which of the
A B C D E
Distribute memory reads across non-shared caches in a multi-processor system
Distribute memory reads/writes between different levels of cache Distribute memory writes across non-shared caches in a multicore processor Distribute the frequency of memory accesses to increase efficiency Distribute the load of virtual to real address translation across caches
following memory resources are not shared between cores in a typical multi-core
L1 cache
L2 cache
Memory controller Memory bus
Main memory
following is the best description of a Cell processor?
A multi-core heterogeneous processor
A multi-core homogeneous processor
A multi-core RISC processor
An embedded multi-core processor with only fixed-size on-chip memory An asynchronous multi-core processor without cache
following is an approach suggested by data-oriented design:
Where there is many, there is one Multiple instructions, multiple data Pure functional programming Object-Orientation
Structures of Arrays
following is a valid DMA transfer?
mfc_put(0x20000, 0xA0000000FABA8880, 0x18, 31, 0, 0); mfc_put(0x200000, 0xA0000000FABA8888, 0x10, 4, 0, 0); mfc_put(0x20008, 0xA0000000FABA8888, 0x1000, 1, 0, 0); mfc_put(0x20000, 0xA0000000FABA8880, 0x10000, 8, 0, 0); mfc_put(0x20008, 0xA0000000FABA8880, 0x8, 7, 0, 0);
Continued…
KXG365/KXA437 Multi-core Architecture and Programming Page 4/11
Question 9
Which of the following instrinsics always creates a vector containing four single precision floating-point values each with same value?
A spu_promote
B spu_shuffle
C spu_splats
D spu_extract
E spu_insert
Question 10
What is the value of result after execution of the following code? vector unsigned int a = (vector unsigned int)
{ 0x01234567, 0x89ABCDEF, 0xFEDCBA98, 0x76543210 };
vector unsigned int b = (vector unsigned int)
{ 0x00FF11EE, 0x22DD33CC, 0x44BB55AA, 0x66997788 };
vector unsigned int index = (vector unsigned int)
{ 0x13121110, 0x80C0E000, 0x0C0D0E0F, 0x14051606 };
vector
unsigned int result = spu_shuffle(a, b, index);
A { 0x67452301, 0x80C0E000, 0x66997788, 0x22AB33CD }
B { 0xEE11FF00, 0x00FF8001, 0x76543210, 0x22AB33EF }
C { 0x67452301, 0x00FF8000, 0x66997788, 0x22AB33EF }
D { 0xEE11FF00, 0x00FF8001, 0x76543210, 0x22AB33CD }
E { 0xEE11FF00, 0x0080FF00, 0x76543210, 0x22AB33CD }
Continued…
KXG365/KXA437 Multi-core Architecture and Programming Page 5/11
SECTION B — SHORT ANSWER QUESTIONS
Answer EIGHT (8) of the following TEN (10) questions.
Each question is worth 5 marks. This section is worth 40 marks.
Question 11: Pipelined Architectures
Explain in your own words the difference between a non-pipelined and pipelined architecture.
[5 marks]
Question 12: Superscalar Architectures
Explain in your own words the difference between a pipelined and superscalar architecture.
[5 marks]
Question 13: Multi-core Architecture Rationale
Name and describe the three “walls” that have slowed superscalar development and have led to the development of multi-core architectures.
[5 marks]
Question 14: Cell Rationale
Describe the major architectural features that differentiate the Cell processor from more standard designs (eg. a Core 2 Duo). Explain the rationale behind these design decisions and the consequences of them.
[5 marks]
Question 15: Dependencies
Consider the following pseudo-code program in the context of executing it on an out-of-order superscalar architecture:
R1 = R2 * R3 + R4
R4 = R5
R1 = R2 – R4
For each dependency present in this code:
name the type of dependency;
explain why the dependency affects execution; and
suggest techniques for removing the dependency (if possible).
[5 marks]
Continued…
KXG365/KXA437 Multi-core Architecture and Programming Page 6/11
Question 15: Loop Unrolling
Describe the technique of loop unrolling and explain how this can improve the efficiency of programs.
[5 marks]
Question 17: Alignment
Rewrite the following PPU datastructure so that it will be equivalent when compiled on both the PPU and SPU, and rewrite the associated PPU-side variable declaration so that it is in an appropriate form for DMA (you can assume that each string will be appropriately aligned and sized upon allocation):
struct TextTree
{
char* text;
TextTree* left;
TextTree* right;
};
TextTree branchyTextPPU;
Question 18: DMA
[5 marks]
Building on your answer to Question 17, write a function to fully transfer (using DMA) a TextTree datastructure from main memory to the SPU’s local store (you can assume that each string will be stored in a fixed size buffer of 128 bytes):
TextTree transfer(unsigned long long ea);
Question 19: AoS Versus SoA
Rewrite the following code fragment using Structures of Arrays (SoA).
typedef struct point2D
{
int x, y;
float channel[4];
};
point2D dots[TOTAL];
for (int i = 0; i < TOTAL; ++i)
{
[5 marks]
}
dots[i].x += 2;
dots[i].y = -dots[i].y;
dots[i].colour[3] = (dots[i].colour[0] + dots[i].colour[1] +
dots[i].colour[2]) / 3;
[5 marks]
Continued...
KXG365/KXA437 Multi-core Architecture and Programming Page 7/11
Question 20: SIMD
ROT13 encryption is a transformation of text that shifts all alphabetic characters by 13 places in the alphabet, wrapping back to the beginning if necessary (e.g. A becomes N, B becomes O, and so on up to M becoming Z, then N becomes A, etc.). For example:
“EBG13: Vg’f abg sbe frpergf!” → “ROT13: It’s not for secrets!”
The following code performs a ROT13 conversion on the string in and stores the result in the string out. Rewrite the code using vectors to store the strings and SIMD instructions and ensure it contains no branches by use of SPU intrinsics for comparison and selection.
const int LENGTH = 40;
char in[LENGTH] = "I’ve had it with this #$%$ing exam! ;)\0"; char out[LENGTH];
for (int i = 0; i < LENGTH; i++)
{
if (in[i] >= ‘a’ && in[i] <= 'z')
{
out[i] = (in[i] – 'a' + 13) % 26 + 'a';
}
else if (in[i] >= ‘A’ && in[i] <= 'Z')
{
out[i] = (in[i] – 'A' + 13) % 26 + 'A';
}
else {
out[i] = in[i];
}
}
[5 marks]
Continued...
KXG365/KXA437 Multi-core Architecture and Programming Page 8/11
SECTION C — LONG ANSWER QUESTIONS
Answer TWO (2) of the following THREE (3) questions.
Each question is worth 25 marks. This section is worth 50 marks.
Question 21: Multi-SPE Programming
Write a Cell program that converts characters to lowercase (ie. changes uppercase letters to lowercase and leaves all other characters unchanged) in a ten million element array of unsigned chars via use of all available SPEs. Your program should consist of three files:
a. A header file containing any shared datastructures or constants.
[2 marks]
b. A PPE program that creates, distributes work amongst, and manages the SPE tasks. (Note: your code is not required to handle errors, and the task does not require sophisticated load balancing).
[10 marks]
c. An SPE program that performs the lowercase conversion. This program should use buffered input when reading from and writing to the array.
[13 marks]
Question 22: SIMD
The following C function performs a Sobel vertical edge detection (each new pixel value is calculated by applying a mask calculation to itself and the surrounding pixels) on a width height image where each pixel is a floating-point value. (Note: pixels on the edge of the image are not calculated and can be given any value).
void detect_edge(float* in, float* out, int width, int height) {
float vertical_total[width];
for (int y = 1; y < height - 1; y++)
{
for (int x = 0; x < width; x++)
{
vertical_total[x] = in[x + (y - 1) * width] +
2 * in[x + y * width] + in[x + (y + 1) * width];
}
for (int x = 1; x < width - 1; x++)
{
out[x] = vertical_total[x + 1] - vertical_total[x - 1]; }
} }
Rewrite the detect_edge function to use vector types for its array parameters and SIMD mathematical operations and intrinsics in its body. (Hint: using the spu_shuffle function is the most effective approach for the second loop over x).
[25 marks]
Continued...
KXG365/KXA437 Multi-core Architecture and Programming Page 9/11
Question 23: Software Pipelining
The following C function calculates the volume of ice-cream in a list of ice-cream cones (i.e. a cone shape with a half-sphere on top) described by the slanted edge length of the cone s and radius of the ice- cream scoop r. This is calculated by (somewhat naïvely) applying the formula area = 2⁄3r3 + 1⁄3r2h where h = (s2 - r2), ie. the volume of the half-sphere, plus one-third the area of the base times the height of the cone.
void cone_surface_areas(vector float* s, vector float* r, vector float* volume, int length)
{
const float PI = 3.141592f;
for (int i = 0; i < length; i++)
{
vector float height = sqrtf4_fast(s[i] * s[i] - r[i] * r[i]);
vector float scoop_vol = spu_splats(2.0f) / spu_splats(3.0f) * spu_splats(PI) * r[i] * r[i] * r[i];
vector float cone_vol = spu_splats(1.0f) / spu_splats(3.0f) * spu_splats(PI) * r[i] * r[i] * height;
volume[i] = scool_vol + cone_vol;
}
}
a. Using the table of assembly instructions on the following page as a reference, rewrite the cone_surface_areas function in SPU assembly taking care to remove any duplicate or unnecessary calculations, using as few instructions as possible, and optimising the use of the loop counter. (Note: a 12-bits of precision estimate of the square-root is sufficient accuracy).
[15 marks]
b. Rewrite the loop from part a using the software pipelining approach using the table of assembly instructions on the following page for instruction cycle timing information. (Note: you only need to write the main body of the pipelined loop and not the prologue or epilogue).
[10 marks]
Continued...
KXG365/KXA437 Multi-core Architecture and Programming
Page 10/11
Assembly Instructions (Subset) Reference Table:
SH: Shuffle
Shuffle Bytes
shufb i,j,k,mask
Reciprocal Estimate
frest a,x
frsqest a,x
FI: Floating-point / Integer
Reciprocal Estimate
fi b,x,a
BR: Branch
Unconditional
bra brTo
Conditional (Relative)
brz i,brTo brnz i,brTo brhz i,brTo brhnz i,brTo LS: Load/Store Loads
lqa i,label18 lqd i,qoff(j) lqr i,label14 lqx i,j,k Stores
stqa i,label18
stqd i,qoff(j)
stqr i,label14
stqx i,j,k
SP: Single Precision
ODD (4)
shuffle bytes a.wn=1/x.wn (first step)
a.wn=1/x.wn (first step)
EVEN (7)
reciprocal or reciprocal square root (second step)
br brTo
ODD (0/4/18)
(0)
(0, +18 on misprediction)
goto brTo (relative) bi i goto i.w0
goto brTo (absolute)
if (i.w0 == 0) goto brTo if (i.w0!= 0) goto brTo if (i.h1 == 0) goto brTo if (i.h1!= 0) goto brTo
ODD (6)
i = mem[label18]
i = mem[qoff * 16 + j.w0]
i = mem[extend(label14) + pc] i = mem[j.w0 + k.w0]
mem[label18] = i
mem[qoff * 16 + j.w0] = i mem[extend(label14) + pc] = i mem[j.w0 + k.w0] = i
fa
fs
fm
fma
fms
fnms
FX: FiXed Arithmetic a
a,b,c
a,b,c
a,b,c
a,b,c,d
a,b,c,d
a,b,c,d
a.fn = b.fn + c.fn
a.fn = b.fn - c.fn
a.fn = b.fn * c.fn
a.fn = b.fn * c.fn
a.fn = b.fn * c.fn
a.fn = -(b.fn * c.fn - d.fn)
i.wn = j.wn + k.wn
i.wn = j.wn + ext(s10) i.wn = -j.wn + k.wn
i.wn = -j.wn + ext(s10)
EVEN (6)
EVEN (2)
+ d.fn - d.fn
i,j,k
i,j,s10
i,j,k
i,j,s10
ai
sf
sfi
Immediate Loads
il i,s16
ilh i,u16
ila i,u18
ilhu i,u16 i.wn=u16<<16
i.wn = ext(s16) i.hn = u16
i.wn = u18
iohl i,u16
Comparisons (Words)
ceq i,j,k ceqi i,j,s10 cgt i,j,k cgti i,j,s10 Select Bits
selb i,j,k,l
i.wn |= u16
i.wn = (j.wn == k.wn)
i.wn = (j.wn == s10)
i.wn = (j.wn > k.wn) (signed) i.wn = (j.wn > s10) (signed)
i=(l==0)?j:k
Continued…
KXG365 Multi-core Architecture and Programming Page 11/11
Answer Sheet for SECTION A. Hand this sheet in together with your examination answer book(s).
Student Number: _____________________________
Your selection must be written as a BLOCK CAPITAL in the box provided for each question. For example:
Question 1
A
Question 1
Question 2
Question 3
Question 4
Question 5
Question 6
Question 7
Question 8
Question 9
Question 10
____________________