CS代考 CS 15-213, Fall 2005 Final Exam

Andrew login ID: Full Name:
CS 15-213, Fall 2005 Final Exam
Friday Dec 16, 2005
• Make sure that your exam is not missing any sheets, then write your full name and Andrew login ID on the front.

Copyright By PowCoder代写 加微信 powcoder

• Write your answers in the space provided below the problem. If you make a mess, clearly indicate your final answer.
• The exam has a maximum score of 92 points.
• This exam is OPEN BOOK. You may use any books or notes you like. You may use a calculator, but
no other electronic devices are allowed. Good luck!
TOTAL (92):
Page 1 of 17

Problem 1. (6 points):
Address spaces. Suppose you have a computer system with:
• A 1 GB byte-addressable virtual address space,
• A 256 MB byte-addressable physical address space, and • A virtual memory page size of 4 KB.
A. What is the minimum number of address bits needed to represent the virtual address space? __________. B. What is the minumum number of bits needed to represent the physical address space? __________
C. What is the total number of page table entries? __________ (Express your answer in the form 2x ).
Page 2 of 17

Problem 2. (12 points):
Data representation. Consider the following two 9-bit floating point representations based on the IEEE floating point format.
1. Format A
• There is one sign bit.
• There are k = 5 exponent bits. The exponent bias is 15. • There are n = 3 fraction bits.
2. Format B
• There is one sign bit.
• There are k = 4 exponent bits. The exponent bias is 7. • There are n = 4 fraction bits.
Numeric values are encoded in both of these formats as a value of the form V = (−1)S × M × 2E , where S is the sign bit, E is exponent after biasing, and M is the significand value. The fraction bits encode the significand value M using either a denormalized (exponent field 0) or a normalized representation (exponent field nonzero).
Below, you are given some bit patterns in Format A, and your task is to convert them to the closest value in Format B. If rounding is necessary you should round toward +∞. In addition, give the values of numbers given by the Format A and Format B bit patterns. Give these as whole numbers (e.g., 17) or as fractions (e.g., 17/64 or 17/26).
1 01111 001
1 0111 0010
0 10110 011
1 00111 010
1 11100 000
0 10111 100
Page 3 of 17

Problem 3. (8 points):
Array indexing. Consider the source code below, where M and N are constants declared with #define. int array1[M][N];
int array2[N][M];
int copy(int i, int j)
array1[i][j] = array2[j][i];
The above code generates the following assembly code on a 64-bit Pentium:
Arguments: i is in %edi, j is in %esi
movslq %esi,%rsi
movslq %edi,%rdi
leaq 0(,%rsi,8), %rax
leaq (%rdi,%rdi,4), %rdx subq %rsi, %rax
addq %rsi, %rdx
addq %rdi, %rax
movl array2(,%rax,4), %eax movl %eax, array1(,%rdx,4) ret
Assuming that sizeof(int) == 4, what are the values of M and N? M=
Page 4 of 17

Problem 4. (10 points):
Machine-level code. Consider the following function’s assembly code:
00000000004004f8 : 4004f8: 53
4004f9: 89 f8 4004fb: 83 ff 4004fe: 76 21 400500: b8 01 400505: b9 00 40050a: ba 02 40050f: 39 fa 400511: 77 0e 400513: 01 c8 400515: 89 c3 400517: 29 cb 400519: 89 d9 40051b: ff c2 40051d: 39 fa 40051f: 76 f2 400521: 5b
400522: c3
Please fill in the corresponding C code:
int foo (unsigned int x) {
int a, b, i;
if(__________)
_____________;
a = 1; b = 0;
mov %edi,%eax
cmp $0x1,%edi
jbe 400521 mov $0x1,%eax
mov $0x0,%ecx
mov $0x2,%edx
cmp %edi,%edx
ja 400521 add %ecx,%eax
mov %eax,%ebx
sub %ecx,%ebx
mov %ebx,%ecx
cmp %edi,%edx
jbe 400513 pop %rbx
000000 000000 000000
for(_______ ; ________ ; _______) {
a = ____________;
b = ____________;
return _____;
Page 5 of 17

Problem 5. (12 points):
Performance Evaluation. We saw in class how loop unrolling can be used to improve the performance of a piece of code. This problem will test your ability to analyze the performance improvements offered by this technique.
Assume that multiplication has a latency of 7 cycles and addition has a latency of 5 cycles.
A. Alice has written the code below to compute the dot product of two vectors, computing one element per iteration.
data_t dot_prod(data_t A[], data_t B[], int size) {
data_t result = 0;
for (i = size-1; i >= 0; i–) {
result = result + (A[i] * B[i]); }
return result;
What is the optimal CPE achieved by the code above? Assume that there are an unlimited number of functional units.
CPE = _____________
Page 6 of 17

B. Now suppose Alice unrolls the loop, computing two elements per iteration. What is the resulting optimal CPE? Once again, assume that there are an unlimited number of functional units.
/* Unroll 2x */
data_t dot_product2(data_t A[], data_t B[], int size) {
data_t result = 0;
/* Unroll by 2X */
for (i = size-1; i >= 1; i -= 2) {
data_t t1 = A[i] * B[i]; data_t t2 = A[i-1] * B[i-1]; result = result + (t1 + t2);
/* Finish off remaining element(s) */ for (; i >= 0; i -= 1) {
result = result + (A[i] * B[i]); }
return result;
CPE = _____________
C. By what factor would Alice need to unroll the loop to get an optimal CPE of 0.5? Once again, assume
that there are an unlimited number of functional units.
Unroll by _____________
D. One of the reasons why we get diminishing returns from unrolling on a real CPU is that there are a limited number of functional units, and therefore, a limit to how many operations we can perform in parallel. So Alice is very excited to learn that the CS department is getting a new machine with one million floating point units. In anticipation, she unrolls her dot product code by 10,000. Give one reason why her new code may actually perform worse than her old code (which unrolled by 2).
You may assume that memory latencies and cache sizes on the new machine are the same as on the old one.
_________________________________________________________________
Page 7 of 17

Problem 6. (12 points):
Cache memories. This problem requires you to analyze both high-level and low-level aspects of caches. You will be required to perform part of a a cache translation, determine individual hits and misses, and analyze overall cache performance.
• Memory is byte addressable
• Physical addesses are 14 bits wide
• The cache is direct-mapped with a 16 byte block-size and 4 sets
• sizeof(int) = 4 bytes
A. The following question will deal with a 5 × 5 int matrix arr[5][5]. Assume that the array has already been initialized.
(a) The box below shows the format of a physical address. Indicate (by labeling the diagram) the fields that would be used to determine the following:
CO The block offset within the cache line
CI The set index
CT The cache tag
13 12 11 10 9 8 7 6 5 4 3 2 1 0
(b) Given the address of arr[0][0] has value 0x2A6C, perform a cache address translation to deter- mine the block offset and set index for the first item in the array.
CI = _______
CO = _______
(c) For each element in the matrix, label the diagram below with the set index that it will map to.
arr col 0 row 0
row 1 row 2 row 3 row 4
Page 8 of 17

B. Thefollowingquestionswillalsodealwiththe5×5matrixarrandthecachedefinedatthebeginning of the problem. Assume that the cache begins cold and then the following five cells in the matrix are accessed, in the order they are listed:
arr[0][0] arr[1][1] arr[2][2] arr[3][3] arr[4][4]
The following four cells are then accessed, in order. You need to determine if each access will be a hit (H) or a miss (M). Remember that the cache is not cold but is setup as it would be after accessing the elements above. Also remember that the elements below are also being accessed in order and the contents of the cache should change following a miss. This part will be graded based on the set mapping in (c) above so even if you aren’t confident in your mapping you can still get full credit for this section (assuming your mapping is rational).
arr[1][2] = _______ arr[0][0] = _______ arr[4][0] = _______ arr[0][3] = _______
The following region can be used as scrap space:
Page 9 of 17

C. The following questions will deal with an N × N int matrix arr except that N = 1024. Assuming i, j, and sum are all stored in registers, analyze the cache performance of the following piece of code:
#define N 1024
int arr[N][N];
int arr_sum()
int sum = 0;
for(i=0; iCS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com