Microsoft Word – Final_sample.docx
1
CEG3310/5310 Final Practice Exam A (90-min version)
1. (30%) A tiny RISC machine has 8-bit memory address. It has a D-cache that is a direct-mapped cache of 4
Bytes. Suppose each cache block contains 2 bytes. Assume write-with-allocation is used.
(1) Draw the cache organization, compute the total number of tag bits in the whole cache directory (not just one
cache block), and show how the physical addresses are partitioned and related to cache addresses.
(2) The following operation is to be executed on the RISC machine.
for(i=0; i<3; i++) A[i + 2] = A[i] + A[4−i]; where A[ ] is an array originally stored in memory, not in cache. Each A[ ] element occupies one byte of memory space. Assume the cache is initially empty. Calculate the hit ratio for the D-cache by clearly identify for each loop iteration whether each array element access is a hit or a miss. Show the final cache contents (3) If the cache is initially empty, show the final cache contents and clearly identify from the following memory address sequence the one(s) that will result in cache hits: 0x22, 0x21, 0x31, 0x23, 0x24, 0x22, 0x31, 0x30. 2. (20%) Consider a computer system with the following specification: • The processor uses a 5-stage ARM pipe, running at 2 GHz. • The I-cache miss ratio is 0.5% and the D-cache miss ratio is 10%. There is a 15% chance for an instruction to access a D-cache. Each cache has a cache block size of 8 words. • The memory is not interleaved. The memory bus clock is 500 MHz and the bus/memory is two-word wide. To access the memory, it takes one bus clock cycle to send the address, 4 bus clock cycles for each memory access initiated (DRAM latency), and one bus clock cycle to send one word of data. Assume the CPI (cycle per instruction) for the processor is 1.6 when no cache miss is considered. (1) What is the miss penalty, which is the amount of time to fill a cache block, in terms of CPU clock cycles? (2) What is the CPI, in terms of the CPU clock cycles, when all cache misses are considered? (3) How much execution time in seconds does it take to execute ten billion instructions? (4) What is the miss penalty in terms of BUS clock cycles if the memory is interleaved with only four memory banks? 3. (20%) Draw a 5-stage ARM instruction pipe diagram that includes one major component in each step of the datapath. Clearly specify all actions performed by each stage of the hardware to facilitate the execution of the instruction, ADD X1, X2, X3. 4. (30%) Briefly answer the following questions. (1) Given a 4-stage pipe with 1,000,000 tasks going through the pipe, what is the theoretical speedup in terms of latency, compared to a non-pipelined version? How about throughput? (2) Write down two lines of ARM assembly code that would introduce a write-after-write hazard. (3) In a multiple-processor system that shares a single memory bus, if each processor has its own two-level cache, which write-hit policy should be used for the L1 cache? How about the L2 cache? (4) Specify the characteristics of the hardware needed to support “delayed branch”. (5) Identify the term used when a processor execution switches to a different thread after every clock cycle. (6) Identify two reasons that GPUs become popular for general purpose parallel computing. 2 CEG3310/5310 Final Practice Exam B (90-min version) 1. (30%) A tiny RISC machine has 12-bit memory address. It has a D-cache that is a two-way set associative cache of 64 Bytes. Suppose each cache block contains 16 bytes. Assume FIFO replacement and write-with- allocation are used. (1) Draw the cache organization, compute the total number of tag bits in the whole cache directory (not just one cache block), and show how the physical addresses are partitioned and related to cache addresses. (2) The following operation is to be executed on the RISC machine. for(i=0; i<3; i++) A[i + 4] = A[i] + A[i+8]; where A[ ] is an array originally stored in memory, not in cache. Each A[ ] element occupies 8 bytes of memory space. Calculate the hit ratio for the D-cache by clearly identify for each loop iteration whether each array element access is a hit or a miss. (3) If the cache is initially empty, show the final cache contents and clearly identify from the following memory address sequence the one(s) that will result in cache hits: 0xA21, 0x228, 0x2A4, 0x2A6, 0xA24, 0xA23, 0xA2B, 0xB20, 0xA20. 2. (20%) Consider a computer system with the following specification: • The processor uses a 5-stage instruction pipe, running at 500 MHz. • The I-cache miss ratio is 5% and the D-cache miss ratio is 10%. There is a 40% chance for an instruction to access a D-cache. Each cache has a cache block size of 2 words. • No memory interleaving is used. The memory bus clock is 250 MHz and the bus is one-word wide. To access the memory, it takes one bus clock cycle to send the address, 9 bus clock cycles for each memory access initiated (DRAM latency), and one bus clock cycle to send one word of data. Assume the CPI (cycle per instruction) for the processor is 1.2 when no cache miss is considered. (1) What is the miss penalty, which is the amount of time to fill a cache block, in terms of CPU clock cycles? (2) What is the CPI, in terms of the CPU clock cycles, when all cache misses are considered? (3) How much execution time in seconds does it take to execute twenty billion instructions? (4) What is the miss penalty in terms of CPU clock cycles if the memory interleaving is used? 3. (20%) Draw a 5-stage ARM instruction pipe diagram that includes one major component in each step and also includes other components that are necessary to execute the ARM CBZ instruction. Given your diagram, how many clock cycles the pipeline needs to stall after a CBZ instruction? Why? 4. (30%) Briefly answer the following questions. (1) Write down one line of ARM assembly code that does use the write back stage of the 5-stage pipe. (2) Write down two lines of ARM assembly code that would still stall the pipe even with the data forwarding technique. (3) Is loop-unrolling useful on a superscalar processor? If yes, why? If no, why not? (4) What is the maximal speedup on an application that contains 50% sequential part and 50% parallelizable part? (5) What is the hardware feature that is needed to support simultaneous multithreading? (6) Identify two main reasons why special instructions are added to accelerate multi-media applications. 3 CEG3310/5310 Final Practice Exam C (90-min version) 1. (30%) A tiny RISC machine has 8-bit memory address. It has a D-cache that is a direct mapped cache of 16 Bytes. Suppose each cache block contains 4 bytes. Assume write-with-allocation is used. (1) Draw the cache organization, compute the total number of tag bits in the whole cache directory (not just one cache block), and show how the physical addresses are partitioned and related to cache addresses. (2) The following operation is to be executed on the RISC machine. for(i=0; i<3; i++) A[i + 4] = A[i] + A[i+8]; where A[ ] is an array originally stored in memory, not in cache. Each A[ ] element occupies two bytes of memory space. Calculate the hit ratio for the D-cache by clearly identify for each loop iteration whether each array element access is a hit or a miss. (3) If the cache is initially empty, show the final cache contents and clearly identify from the following memory address sequence the one(s) that will result in cache hits: 0x21, 0x28, 0xA4, 0xA6, 0x24, 0x23, 0x2B, 0x20. 2. (20%) Consider a computer system with the following specification: • The processor uses a 5-stage instruction pipe, running at 1 GHz. • The I-cache miss ratio is 2% and the D-cache miss ratio is 20%. There is a 40% chance for an instruction to access a D-cache. Each cache has a cache block size of 2 words. • Memory interleaving is used. The memory bus clock is 250 MHz and the bus is one-word wide. To access the memory, it takes one bus clock cycle to send the address, 7 bus clock cycles for each memory access initiated (DRAM latency), and one bus clock cycle to send one word of data. Assume the CPI (cycle per instruction) for the processor is 1.5 when no cache miss is considered. (1) What is the miss penalty, which is the amount of time to fill a cache block, in terms of CPU clock cycles? (2) What is the CPI, in terms of the CPU clock cycles, when all cache misses are considered? (3) How much execution time in seconds does it take to execute ten billion instructions? (4) What is the miss penalty in terms of CPU clock cycles if no memory interleaving is used? 3. (20%) Draw a 5-stage ARM instruction pipe diagram that includes one major component in each step of the datapath and also includes the most important component of the control path. For the WB stage, identify how it is controlled for an LDUR instruction, an ADDI instruction, and an STUR instruction, respectively. 4. (30%) Briefly answer the following questions. (1) Write down two lines of ARM assembly code that would introduce control hazard. (2) Write down two lines of ARM assembly code that would stall the pipe unless the data forwarding technique is used. (3) Describe one C coding technique that is essential in improving performance when adding one to every element of a huge 2-dimensional matrix. (4) What is the maximal speedup on an application that contains 10% sequential part and 90% parallelizable part? (5) What is the difference between a fine multithreading and an SMT? (6) Identify the terminology where unused idling times of computers on the internet can be jointly used for a parallel computing effort. 4 CEG3310/5310 Final Practice Exam D (90-min version) 1. (30%) A tiny RISC machine has 10-bit memory address. It has a D-cache that is a two-way set associative cache of 16 Bytes. Suppose each cache block contains 4 bytes. Assume FIFO replacement and write-with- allocation are used. (1) Draw the cache organization, compute the total number of tag bits in the whole cache directory (not just one cache block), and show how the physical addresses are partitioned and related to cache addresses. (2) The following operation is to be executed on the RISC machine. for(i=0; i<3; i++) A[i + 2] = A[i] + A[12−i]; where A[ ] is an array originally stored in memory, not in cache. Each A[ ] element occupies two bytes of memory space. Assume the cache is initially empty. Calculate the hit ratio for the D-cache by clearly identify for each loop iteration whether each array element access is a hit or a miss. (3) If the cache is initially empty, show the final cache contents and clearly identify from the following memory address sequence the one(s) that will result in cache hits: 0x121, 0x028, 0x027, 0x026, 0x124, 0x128, 0x120, 0x126. 2. (20%) Consider a computer system with the following specification: • The processor uses a 5-stage ARM pipe, running at 1 GHz. • The I-cache miss ratio is 2% and the D-cache miss ratio is 10%. There is a 20% chance for an instruction to access a D-cache. Each cache has a cache block size of 8 words. • The memory is interleaved with eight memory banks. The memory bus clock is 200 MHz and the bus is one-word wide. To access the memory, it takes one bus clock cycle to send the address, 16 bus clock cycles for each memory access initiated (DRAM latency), and one bus clock cycle to send one word of data. Assume the CPI (cycle per instruction) for the processor is 2 when no cache miss is considered. (1) What is the miss penalty, which is the amount of time to fill a cache block, in terms of CPU clock cycles? (2) What is the CPI, in terms of the CPU clock cycles, when all cache misses are considered? (3) How much execution time in seconds does it take to execute ten billion instructions? (4) What is the miss penalty in terms of BUS clock cycles if the memory is interleaved with only two memory banks? 3. (20%) Draw a 5-stage ARM instruction pipe diagram that includes one major component in each step of the datapath. In addition, identify in the diagram the detailed hardware needed to facilitate the execution of the instruction, LDUR X1, [X2, 1], so that the correct result is loaded into the destination register. 4. (30%) Briefly answer the following questions. (1) What is the benefit of using pipelining versus using parallel processing to achieve the same speedup? (2) Write down two lines of ARM assembly code that would introduce data hazards. (3) A MIPS processor supports only 32 registers even though it could have many more, with huge number of on-chip transistors. Why? (4) Why is it possible for hardware to predict if a specific conditional branch instruction will branch or not? (5) Why is it not a good idea for a system to use a single large cache (vs multiple levels)? (6) Identify the special purpose hardware that becomes popular for general purpose parallel computing because it contains many cores and evolves faster than general purpose CPUs. 5 CEG3310/5310 Final Practice Exam E (90-min version) 1. (30%) A tiny RISC machine has 12-bit memory address. It has a D-cache that is a two-way set associative cache of 128 Bytes. Suppose each cache block contains 16 bytes. Assume FIFO replacement and write- with-allocation are used. (1) Draw the cache organization, compute the total number of tag bits in the whole cache directory (not just one cache block), and show how the physical addresses are partitioned and related to cache addresses. (2) The following operation is to be executed on the RISC machine. for(i=0; i<4; i++) A[11-i] = A[i] + A[i+2]; where A[ ] is an array originally stored in memory, not in cache. Each A[ ] element occupies eight bytes of memory space. Assume the cache is initially empty. Calculate the hit ratio for the D-cache by clearly identify for each loop iteration whether each array element access is a hit or a miss. (3) If the cache is initially empty, show the final cache contents and clearly identify from the following memory address sequence the one(s) that will result in cache hits: 0x121, 0x118, 0x027, 0x166, 0x123, 0x028, 0x110, 0x126. 2. (20%) Consider a computer system with the following specification: • The processor uses a 5-stage ARM pipe, running at 2 GHz. • The I-cache miss ratio is 1% and the D-cache miss ratio is 20%. There is a 20% chance for an instruction to access a D-cache. Each cache has a cache block size of 8 words. • The memory is interleaved with eight memory banks. The memory bus clock is 400 MHz and the bus is one-word wide. To access the memory, it takes one bus clock cycle to send the address, 11 bus clock cycles for each memory access initiated (DRAM latency), and one bus clock cycle to send one word of data. Assume the CPI (cycle per instruction) for the processor is 1.35 when no cache miss is considered. (1) What is the miss penalty, which is the amount of time to fill a cache block, in terms of CPU clock cycles? (2) What is the CPI, in terms of the CPU clock cycles, when all cache misses are considered? (3) How much execution time in seconds does it take to execute twenty billion instructions? (4) What is the miss penalty in terms of BUS clock cycles if the memory is interleaved with only two memory banks? 3. (20%) Draw a 5-stage ARM instruction pipe diagram that includes one major component in each step and also includes other components that are necessary to execute the ARM “BR X30” instruction. Given your diagram, how many clock cycles the pipeline needs to stall after the BR instruction? Why? 4. (30%) Briefly answer the following questions. (1) In a 5-stage ARM instruction pipe, what is the most important component of the control path? (2) Write down two lines of ARM assembly code that would not benefit from using data forwarding at all. (3) Draw a diagram to illustrate a superscalar version of the ARM five-stage pipe. (4) What happens in a processor using speculative execution when the prediction is wrong? (5) Given the same cache size, which cache organization (between direct-mapped and 4-way set associative) would normally have a higher hit ratio? Why? (6) Identify the term for computer hardware that is used in companies such as Google and Amazon to handle huge number of independent transactions every second.