程序代写代做 assembly flex clock compiler algorithm cache data structure kernel GPU EECS 645

EECS 645
Test 2

Spring 2020

____Peng Zhang____ __2905964_____
Name Student Number

Instructions

• When complete, email this exam to benewy@ku.edu, no later than 4:00pm, May 12, 2020
• Open Book
• Open Notes
• Calculators allowed, NOT open Internet searching.
• Be sure and clearly mark your final answer.
• Please show all work so that partial credit can be evaluated
• Feel free to insert more pages/white space anywhere if you need more space.
• Where indicated provide justification for your answers.
• Feel free to email questions about the exam. benewy@ku.edu
• If you feel that a problem is unclear, contradictory, incomplete, or ambiguous, clearly state the assumptions you used to solve the problem.

Each student is expected to complete the exam individually using only course notes, and the textbook, and without aid from other outside sources.

I assert that I have neither provided help nor accepted help from another student in completing this exam. As such, the work herein is mine and mine alone.

–––––––––––––––––––––––––––––––––––– ––––––––
Signature Date

1. /15

2. /20

3. /15

4. /20

5. /10

6. /20

Total /100

Problem #1 Total of 15 points.

Consider the MIPS 32 program below.
.text
start: li $s0, 2
loop: move $a0, $s0
j iam
next: add $s2, $s2, $v0
addi $s0, $s0, -1
bne $s0, $zero, loop
j done
iam: addi $v0, $a0, 2
j next
done: nop

Assume all registers are initialized to zero. Assume that the first line executed is the one with the label “start”. For each row below fill in the instruction executed at each clock cycle, and a description of what the results of that execution are. Descriptions should be things like “jump to label foo”, “fall through branch”, or “$t1 = 42”

Instruction
Description
li $s0, 2

Problem #2 Total of 20 points.
Instruction Level Parallelism involves many techniques for getting as much work done at the same time as possible. Reduced Instruction Set Computers (RISC) (such as the MIPS32) are a philosophy of how to build an Instruction Set Architecture (ISA) that is as simple as possible in order to make it feasible to implement these parallelism techniques either in the hardware or software of the time period.

Hazards are the cause of pipeline stalls and the inability to issue more than one instruction (aka superscalar) at the same time.

Static scheduling techniques by compilers can be used to overcome some hazards.

10 points. Consider the following instruction sequence to add all elements in an array.

loop: lw $t2, 0($a3)
      add $s3, $s3, $t2
      addi $a3, $a3, 4
      bne $a3, $a1, loop

The lw instruction takes 2 cycles to get its result, all other instructions take a single cycle. This instruction sequence as written takes 5 cycles per array element. Rewrite this using loop unrolling to take just 3 cycles per array element.

Problem #2 Continued.

Branch prediction can be used to enable runtime Speculation for the assembly of deeper unrolled pipelines.

Dynamic scheduling is a collection of techniques for flexibly allocating processor resources to overcome hazards. It is these approaches that have enabled the Complex Instruction Set Computers (CISC) such as the Intel x86 family to continue to improve their performance, albeit at the cost of hardware and power. In power limited environments (aka your phone) RISC processors such as the ARM64 dominate the market.
10 points. Why does a dynamic out-of-order processor need to use register renaming? Provide an example where register renaming enables more out of order parallelism.

Problem #3 Total of 15 points.
Data Level Parallelism includes a number of methods for acting on large sets of data in the same way. We discussed Vector processors, SIMD extensions for traditional processors, and GPU’s.

Vector processors group instructions into convoys to improve parallelism, in a way that is analogous to how superscalar traditional processors issue multiple at the same time.

5 points. For the following stream of vector operations, group the instructions into convoys.

L.D F2, 0(Rx)
MUL.D F2, F2, F0
L.D F4, 0(Ry)
ADD.D F4, F4, F2
S.D F4, 8(Ry)
DSUBU R20, R4, Rx

Convoy 1

Convoy 2

Convoy 3

Problem #3 Continued.
Vector and GPU processors have unique methods for moving the large amounts of data in various data structures to the processing elements.

5 pts. Describe how stride is used and what kinds of problems it solves.

Stride can access non-sequential memory locations to reshape them into denser data structures.
Stride can solve problems related to vector loads and stores, like handle elements that are not adjacent in memory if we have multidimensional arrays.

5 pts. Describe what scatter gather memory operations are, and how they are used in GPU’s.

Scatter-gather handles sparse matrices, it allows us convert between a compressed
representation and a sparse one. The gather uses a index vector to collect the sparse
elements and pack them into a vector register, and scatter spreads the results back out to memory as a sparse matrix.
In GPUs, all GPU loads are gather instructions and all GPU stores are scatter
Instructions.
Problem #4 Total of 20 points.

Arithmetic intensity is a method for quantifying the memory bandwidth and processing capabilities either that a particular computer is capable of, or that a particular algorithm requires. For example if a particular supercomputer was being designed primarily to run a specific algorithm (a kernel of computation) the arithmetic intensity of the algorithm would serve as requirements for the supercomputer itself, i.e. you want to tailor the capabilities of the supercomputer so that it is not unbalanced for the application, i.e. not always waiting on data from memory, but rather, operating efficiently.

For the following code, assume the vectors are single precision floating point.

For (I=0; i<1000; i++) { Res_a[i] = pos_x[i] * vel_x[i] + off_x[i] * pos_x[i]; Res_b[i] = pos_y[i] * vel_y[i] + off_y[i] * pos_y[i]; Res_c[i] = pos_z[i] * vel_z[i] + off_z[i] * pos_z[i]; } 3 points. How many floating point operations per loop? 3 points. How many bytes are transferred to and from memory per loop? 4 points. What is the arithmetic intensity of this kernel. Problem #4 Continued. 5 points. Given the following roofline model for a supercomputer, would this kernel be memory bound, cpu bound, or balanced? GPU’s are collections of SIMD cores that each run the identical instruction at the same time, with many high speed interfaces (lanes) to memory. They take advantage of predicate masks to handle per core variations (divergences) in data (spare matrices) and control flow (branches). We are given a NVIDIA GTX260 GPU with 192 1.2 GHz SIMD processors that can each produce 32 results every 4 cycles. Assume that 75% of instructions are single precision arithmetic, and that the average CPI is 1.2 cycles/instruction. Our workload has divergent branches that limit utilization to 85% of cores on average. 5 points. Compute the throughput in GFLOPs/sec for this workload on this GPU. Please show all work. Problem #5 Total of 10 points. Symmetric Multiprocessors with shared memory have become common today with our multicore chips. Each core has their own L1 and L2 cache, and typically they share a L3 cache and main memory. Ensuring that each processor has a coherent and consistent view of the memory hierarchy is important for deterministic program execution, particularly in multi-threaded applications. Single core processors have compulsory (data being accessed for the first time) and conflict (data was in the cache and replaced by something else) cache misses. 2 points. Name two other types of cache misses and when they occur that symmetric multicore shared memory systems with write invalidate logic introduce. 4 points. For each of those types list some approaches to reduce the number of misses. 4 points. Write updates on a broadcast bus limit the speed at which the system can operate. Using an exclusive state in a snoopy bus protocol is one approach to increasing performance, while using a directory based implementation is another. Compare the two approaches. Problem #6 Total of 20 points. Warehouse Sized Computers (aka clouds) are large collections of commodity computers that can throw 1000’s of CPU’s at a problem, and then shift the CPU’s to other tasks as they complete. Getting data to and from the distributed computers effectively adds additional layers to the memory hierarchy, with steps at the local disk, then a disk in the same rack, then a disk in a different rack, and ultimately a disk in a different data center. There are many financial considerations when designing a datacenter, and tradeoff’s can be explored through the Total Cost of Ownership model. 6 points. A WSC has 45,000 servers that cost $1400 each, consume 200W apiece, which costs $400,000 in monthly electricity bills. The servers can be placed in a medium power mode which results in 75% of the performance with 60% of the power used. If I run the servers in medium mode, and get additional servers to have the equivalent performance, what are the changes in CAPEX for server hardware and OPEX for electricity bills. In order to have a high amount of request level parallelism, algorithms must be designed to provide independent workloads that can be highly distributed and tolerant of failure, redundancy, and varying latency. The map reduce paradigm is a very powerful and popular method for providing a large amount of request level parallelism. A 400 GB dataset must be processed by a map reduce job. Assume this dataset is located in another rack initially. The BW for copying data from another rack is 10 MB/sec. The Map rate is 15 sec/GB, and the reduce rate is 20 sec/GB. Assume the reduce will take place on the same nodes as the map, the intermediate data is the same size as the initial data, and the intermediate data can be accessed at 200MB/sec. Problem #6 Continued. 3 points. How long does this job take with 10 nodes. 3 points. How long with 40 nodes. 8 points. Why is the Raspberry pi not vulnerable to the spectre and meltdown vulnerabilities.