UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering
EEC 170 Winter 2022
Due: 11:59pm Friday March 11th
Refer to Getting started with RARS for tips/suggestions. This project has 3 parts:
Copyright By PowCoder代写 加微信 powcoder
Introduction to Computer Architecture
1. Write two functions to multiply C = A x B + C, where A, B, and C are all 64×64 matrices of 32b unsigned integers. The first function is unblocked and computes the output C in strict row-major order. The second function is blocked with a blocking factor of your choice. Multiplying two matrices together is called ¡°GEMM¡± (for GEneral Matrix Multiply).
2. Analyze the instruction trace from your runs.
3. Using the memory trace from your runs, design, implement, and analyze at least two 8 kB data caches that would accelerate running your 2 (or more) GEMM routines. One cache must be direct-mapped and the second cache must be n-way set associative, where n >= 2.
For steps 2 and 3, you can use any tool/language that you choose. Your instructor gave you simple C++ code that should be a helpful scaffold, but if you want to do your instruction and memory trace analysis in plain C or Python or Matlab or any other environment where you feel comfortable, that¡¯s perfectly fine.
Every place in this assignment that we use the term “kB”, we mean “kibibytes” (a power of two) not “kilobytes” (a power of ten). 4 kB, for instance, is 4096 bytes.
1. WRITE TWO GEMM ROUTINES
You will write TWO GEMM functions named ¡°naive_gemm¡± and ¡°blocked_gemm¡±, each of which will take three arguments in a0, a1 and a2 (pointers to the first elements of A, B, and C, each stored in row-major order). These functions must be written inside the file ¡°gemm.asm¡±. Assume the memory at these locations has already been allocated and initialized to 0¡¯s (because it will be). You should not need additional memory or stack space to write these routines.
Your routines will each compute A x B + C and store the output in C. The ¡°+C¡± makes the problem easier by letting you just add your results to the existing values in C¡¯s memory space. Both functions should have the same output. The unsigned integer inputs will never have their high bit set (bit 31 will always be 0). You can assume that no results at any point in this project will overflow, and you may use any of the ¡®a¡¯, ¡®s¡¯ and ¡®t¡¯ registers. A file named ¡°main.asm¡± is provided for you to test each of your GEMM routines to see if they give a correct output. Do not use the label ¡°main¡± in any of your routines or any of the other labels that are used in ¡°main.asm¡± and ¡°file_read.asm¡±. Follow RISCV calling conventions and save the return address at the beginning of your sort routine, and retrieve it back at the end.
A, B, and C are stored in row-major order, meaning each matrix will occupy 64×64=4096 words (16384 bytes) of space in memory, with the first row of each matrix stored in the first 64 words of the memory allocated for each matrix, the second row in the next 64 words, and so on.
1. Your first routine will compute the output matrix C in row-major order, first computing the first word in C (0,0), then the next word in C (0,1), up to (0,63); then (1,0) and so on.
2. Your second routine will traverse C in a blocked order. With a blocked approach, you choose a square region of C smaller than the entire matrix C (this is called a tile) and compute it first. For instance, you might divide C into 2×2 tiles and first compute the top left tile (of dimension 32×32), then the top right tile (also of dimension 32×32), and so on. Or you might divide C into 4×4 tiles, so each block is 16×16. You will find ample detail on the Internet if you search for ¡°GEMM blocking¡±.
For the second implementation, you can choose the blocking factor. While it is possible to choose non-square tile sizes, we strongly recommend you only look at square tiles. Drawing a diagram of the matrices with the block sizes traced inside them to reference can make it easier to program this implementation.
You may use a compiler or other tools to help you write your functions, but the instructional staff suggests that you will learn the most out from this assignment if you write the functions in assembly (or use a compiler to help you optimize the assembly that you wrote).
2. GENERATE AN INSTRUCTION/MEMORY TRACE
You will do this for both of your GEMM implementations.
Your instructor wrote RARS¡¯s trace generator in Java. Please use the most recent version of RARS. When you run a program in (the latest version of) RARS, you can generate a trace by doing the following:
1. Open your source file in RARS.
2. Tools menu, Instruction/Memory Dump.
3. Change filename to a filename of your choice.
4. Click button: Connect to Program
5. Run, Assemble.
6. Run, Go.
7. Go back to Instruction/Memory Dump window: click “Dump Log”. This saves the dump to
the file you specified in step 3.
These steps are pretty brittle (i.e., do them in this exact order) because your instructor doesn¡¯t know how to use Swing very well, but they seem to work well in practice. If you have suggestions, the instructional staff is happy to hear them.
The file you generate has one line per datum. The four kinds of data you will see in the trace are:
¡®I¡¯: The address of an access into instruction memory
¡®i¡¯: A 32-bit RISC-V instruction (the trace first dumps the address then the instruction)
¡®L¡¯: The address of a memory load into data memory
¡®S¡¯: The address of a memory store into data memory (we don¡¯t think you need the contents of the memory load/store for this project, so they aren¡¯t in the trace)
Your instructor put a C++ file that you can use as a scaffold for your analysis, as well as a sample log file, in the Project 3 directory in Canvas.
3. ANALYZE THE INSTRUCTION TRACE
For this part you only need to look at the ¡®I¡¯ and ¡®i¡¯ entries in your trace. In your report, answer the following questions. For this part (instruction trace), you need only report results for your unblocked GEMM implementation.
1. What is the instruction mix that you actually run over your GEMM function: what is the fraction of instructions that are loads, stores, branches, arithmetic operations, etc.? Use your judgment on how best to report this mix.
2. Answer one of the following 3 questions (specify which question you are answering).
a. List the top 5 most frequent dependent instruction pairs that executed in your trace.
By “dependent instruction pairs”, we mean “a pair of instructions that run in sequence where the first instruction produces a register value that is read by the second instruction”. You can ignore register numbers for this problem except to tell if the second instruction depends on the first one (so your answer is going to be 5 pairs like “lw-add”, not “lw x2 … – add x3, x2, x1”).
b. Given our 5-stage RISC-V pipeline, what fraction of instructions executed have at least one register argument that must be forwarded (for both implementations)?
c. How effective is static branch prediction (what percentage of the time will a static branch prediction algorithm¡ªforward branches are not taken; backward branches are taken¡ªmake the right prediction)?
4. DESIGN A CACHE
You will design two 8 kB data caches, implement them (in C), and analyze their performance using the traces you produce. The most straightforward way for you to do this is to allocate a data structure in your trace analyzer that has the exact same data as the pictures of cache that we showed in class (for example, in the lecture memory2.pdf, ¡°Example: Intrinsity FastMath¡±; for this example, you will have to keep track of a valid bit and a tag for each entry in the cache; you do NOT have to keep track of data because we don¡¯t need to do so for this assignment, and you should think through why we say this in the assignment so you understand). You are only focusing on data accesses (load/store), not on instruction accesses.
In your simulator, every time you see a data memory access, you will check the cache to see if it is a hit or miss, record whatever data you want to record for your analysis, and update the cache. The size of your cache is fixed (8 kB). You must implement at least two caches, one direct- mapped, one n-way set associative, with n >= 2. You will want to study different cache line sizes. You may also want to vary replacement policy and write policy (write-through vs. write-back).
Your goal is to maximize system performance. In your report, quantify how much your cache helps system performance vs. having no cache. Maximizing system performance will primarily mean maximizing hit rate. Justify your design and your conclusions in your report succinctly but thoroughly. To measure system performance, assume the following:
– All non-memory instructions are free (take no time).
– Cache hits complete in a single cycle.
– Cache misses cost an additional 50 cycles (to go to memory).
– If you implement an associative cache, it affects the clock cycle of the machine. Any set
associative cache has a 10% clock cycle penalty when compared to a direct-mapped cache, and each additional doubling of set associativity adds an additional 2%. So a 2-way s-a cache has a 10% penalty, a 4-way s-a cache has a 12% penalty, 8-way is 14%, 16- way is 16%, etc.
5. EXTRA CREDIT
We will award a small amount of extra credit if you convince us that you have designed the ¡°optimal¡± design (minimizes runtime), which means you choose both an optimal blocking factor and an optimal cache design. In practice, this means you will want to explore ~all possible blocking factors across ~all cache designs, which is not particularly difficult but will take some thought on your part about how to parameterize blocking factors and cache designs in your code.
WHAT TO SUBMIT
1. A PDF writeup (technical report) of no more than 2 pages that summarizes your findings. These findings involve the instruction-trace analysis of unblocked GEMM and 4 cache experiments: {unblocked, blocked} GEMM with a {direct-mapped, set-associative} cache.
2. Your source code from ¡°gemm.asm¡±, containing your two GEMM implementations. If re- download the project from Canvas and drop this file in, then ¡°main.asm¡± should run with no issues.
3. Your C source code that you used to write your direct-mapped and set-associative caches.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com