ECE 463 Project 1
NC State University
Department of Electrical and Computer Engineering
ECE 463/563
Copyright By PowCoder代写 加微信 powcoder
Project #1: Cache Design, Memory Hierarchy Design (Version 3.0) Due: Fri., Oct. 14, 11:59 PM
1. Preliminary Information
1.1. Academic integrity
• Academic integrity:
o Source code: Each student must design and write their source code alone. They must do
this (design and write their source code) without the assistance of any other person in ECE 463/563 or not in ECE 463/563. They must do this (design and write their source code) without searching the web for past semesters’ projects and without searching the web for source code with similar goals (e.g., modeling computer architecture components such as caches, predictors, pipelines, etc.), which is strictly forbidden. They must do this (design and write their source code) without looking at anyone else’s source code, without obtaining electronic or printed copies of anyone else’s source code, etc.
o Explicit debugging: With respect to “explicit debugging” as part of the coding process (e.g., using a debugger, inspecting code for bugs, inserting prints in the code, iteratively applying fixes, etc.), each student must explicitly debug their code without the assistance of any other person in ECE 463/563 or not in ECE 463/563.
o Report: Each student must run their own experiments, collect and process their own data, and write their own report. Plagiarism (lifting text, graphs, illustrations, etc., from someone else, whether one adjusts the originals or not) is prohibited. Using someone else’s data (in raw form and/or graph form) is prohibited. Fabricating data is prohibited.
o Sanctions: The sanctions for violating the academic integrity policy are (1) a score of 0 on the project and (2) academic integrity probation for a first-time offense or suspension for a subsequent offense (the latter sanctions are administered by the Office of Student Conduct). Note that, when it comes to academic integrity violations, both the giver and receiver of aid are responsible and both are sanctioned. Please see the following RAIV form which has links to various policies and procedures and gives a sense of how academic integrity violations are confronted, documented, and sanctioned: RAIV form.
o Enforcement: The TAs will scan source code (from current and past semesters) through tools available to us for detecting cheating. The outputs from these tools, combined with in-depth manual analysis of these outputs, will be the basis for investigating suspected academic integrity violations. TAs will identify suspected plagiarism and/or data fabrication and these cases will be investigated.
• Reasonable assistance: If a student has any doubts or questions, or if a student is stumped by a bug, the student is encouraged to seek assistance using both of the following channels.
o Students may receive assistance from the TAs and instructor. 1
o Students are encouraged to post their doubts, questions, and obstacles, on the Moodle message board for this project. The instructor and TA will moderate the message board to ensure that reasonable assistance is provided to the student. Other students are encouraged to contribute answers so long as no source code is posted.
* An example of reasonable assistance via the message board: Student A: “I’m encountering the following problem: I’m getting fewer writebacks from L2 to main memory than the validation runs. Has anyone else encountered something like this?” Student B: “Yes, I encountered something similar, and you might want to take a look at how you are doing XYZ because the problem has to do with such-and-such.”
* Another example of a reasonable exchange: Student A: “I’m unsure how to split the address into tag and index, and how to discard the block offset bits. I’ve successfully computed the # bits for each but I am stuck on how to manipulate the address in C/C++. Do you have any advice?” Instructor/TA/Student B: “We suggest you use an unsigned integer type for the address and use bitwise manipulation, such as ANDs (&), ORs (|), and left/right shifts (<<. >>) to extract values from the unsigned integer, the same as one would do in Verilog.”
* Another example of a reasonable exchange: Student A: “I’m unsure how to dynamically allocate memory, such as dynamically allocating a 1D array of structs/classes or (more appropriately for a cache) a 2D array of structs/classes. Can you point me to some references on this?” Instructor/TA/Student B: “Sure, here is a web site or reference book that discusses dynamic memory allocation including 1D and 2D arrays.”
• The intent of the academic integrity policy is to ensure students code and explicitly debug their code by themselves. It is NOT our intent to stifle robust, interesting, and insightful discussion and Q&A that is helpful for students (and the instructor and TA) to learn together. We also would like to help students get past bugs by offering advice on where they may be doing things incorrectly, or where they are making incorrect assumptions, etc., from an academic and conceptual standpoint.
1.2. Reduced project scope for ECE 463 students
The project scope is reduced but still substantial for ECE 463 students, as detailed in this specification.
1.3. Programming languages for this project
You must implement your project using the C, C++, or Java languages, for two reasons. First, these languages are preferred for computer architecture performance modeling. Second, our Gradescope autograder only supports compilation of these languages.
1.4. Responsibility for self-grading your project via Gradescope
You will submit, validate, and SELF-GRADE your project via Gradescope; the TAs will only manually grade the report. While you are developing your simulator, you are required to frequently check via Gradescope that your code compiles, runs, and gives expected outputs with respect to your current progress. This is necessary to resolve porting issues in a timely fashion (i.e., well before the deadline), caused by different compiler versions in your programming environment and the Gradescope backend. This is also necessary to resolve non-compliance issues (i.e., how you specify the simulator’s command-line arguments, how you format the simulator’s outputs, etc.) in a timely fashion (i.e., well before the deadline).
2. Project Description
In this project, you will implement a flexible cache and memory hierarchy simulator and use it to compare the performance, area, and energy of different memory hierarchy configurations, using a subset of the SPEC 2006 benchmark suite, SPEC 2017 benchmark suite, and/or microbenchmarks.
3. Specification of Memory Hierarchy
Design a generic cache module that can be used at any level in a memory hierarchy. For example, this cache module can be “instantiated” as an L1 cache, an L2 cache, an L3 cache, and so on. Since it can be used at any level of the memory hierarchy, it will be referred to generically as CACHE throughout this specification.
3.1. Configurable parameters
CACHE should be configurable in terms of supporting any cache size, associativity, and block size, specified at the beginning of simulation:
o SIZE: Total bytes of data storage.
o ASSOC: The associativity of the cache. (ASSOC = 1 is a direct-mapped cache. ASSOC
= # blocks in the cache = SIZE/BLOCKSIZE is a fully-associative cache.) o BLOCKSIZE: The number of bytes in a block.
There are a few constraints on the above parameters: 1) BLOCKSIZE is a power of two and 2) the number of sets is a power of two. Note that ASSOC (and, therefore, SIZE) need not be a power of two. As you know, the number of sets is determined by the following equation:
# sets = SIZE
ASSOC × BLOCKSIZE
3.2. Replacement policy
CACHE should use the LRU (least-recently-used) replacement policy.
3.3. Write policy
CACHE should use the WBWA (write-back + write-allocate) write policy.
o Write-allocate: A write that misses in CACHE will cause a block to be allocated in CACHE. Therefore, both write misses and read misses cause blocks to be allocated in
o Write-back: A write updates the corresponding block in CACHE, making the block dirty.
It does not update the next level in the memory hierarchy (next level of cache or memory). If a dirty block is evicted from CACHE, a writeback (i.e., a write of the entire block) will be sent to the next level in the memory hierarchy.
3.4. Allocating a block: Sending requests to next level in the memory
Your simulator must be capable of modeling one or more instances of CACHE to form an overall memory hierarchy, as shown in Figure 1.
CACHE receives a read or write request from whatever is above it in the memory hierarchy (either the CPU or another cache). The only situation where CACHE must interact with the next level below it (either another CACHE or main memory) is when the read or write request misses in CACHE. When the read or write request misses in CACHE, CACHE must “allocate” the requested block so that the read or write can be performed.
Thus, let us think in terms of allocating a requested block X in CACHE. The allocation of requested block X is actually a two-step process. The two steps must be performed in the following order.
1. Make space for the requested block X. If there is at least one invalid block in the set, then there is already space for the requested block X and no further action is required (go to step 2). On the other hand, if all blocks in the set are valid, then a victim block V must be singled out for eviction, according to the replacement policy (Section 3.2). If this victim block V is dirty, then a write of the victim block V must be issued to the next level of the memory hierarchy.
2. Bring in the requested block X. Issue a read of the requested block X to the next level of the memory hierarchy and put the requested block X in the appropriate place in the set (as per step 1).
To summarize, when allocating a block, CACHE issues a write request (only if there is a victim block and it is dirty) followed by a read request, both to the next level of the memory hierarchy. Note that each of these two requests could themselves miss in the next level of the memory hierarchy (if the next level is another CACHE), causing a cascade of requests in subsequent levels. Fortunately, you only need to correctly implement the two steps for an allocation locally within CACHE. If an allocation is correctly implemented locally (steps 1 and 2, above), the memory hierarchy as a whole will automatically handle cascaded requests globally.
3.5. Updating state
After servicing a read or write request, whether the corresponding block was in the cache already (hit) or had just been allocated (miss), remember to update other state. This state includes LRU counters affiliated with the set as well as the valid and dirty bits affiliated with the requested block.
Read or Write Request
Read or Write Request
Read or Write Request
Read or Write Request Main Memory
Figure 1. Your simulator must be capable of modeling one or more instances of CACHE to form an overall memory hierarchy.
4. ECE 563 Students: Augment CACHE with Stream-Buffer
Prefetching
Students enrolled in ECE 563 must additionally augment CACHE with a prefetch unit. The prefetch unit implements Stream Buffers.
In this project, consider the prefetch unit to be an extension implemented within CACHE. This preserves the clean abstraction of one or more instances of CACHE interacting in an overall memory hierarchy (see Figure 1), where each CACHE may have a prefetch unit within it.
4.1. CACHE should support a configurable prefetch unit
Your generic implementation of CACHE should support a configurable prefetch unit as follows. The prefetch unit has N Stream Buffers. Each Stream Buffer contains M memory blocks. Both N and M should be configurable. Setting N=0 disables the prefetch unit.
4.2. Operation of a single Stream Buffer
A Stream Buffer is a simple queue that is capable of holding M consecutive memory blocks.
A Stream Buffer has a single valid bit that indicates the validity of the buffer as a whole. If its valid bit is 0, it means the Stream Buffer is empty and doesn’t contain a prefetch stream. If its valid bit is 1, it means the Stream Buffer is full and contains a prefetch stream (M consecutive memory blocks).
When CACHE receives a read or write request for block X, both CACHE and its Stream Buffer are checked for a hit. Note that all Stream Buffer entries, not just the first entry (as in the original Stream Buffer paper), are searched for block X. There are four possible scenarios:
1. Scenario #1 (create a new prefetch stream): Requested block X misses in CACHE and misses in Stream Buffer: Handle the miss in CACHE as usual (see Section 3.4). In addition to fetching the requested block X into CACHE, prefetch the next M consecutive memory blocks into the Stream Buffer. That is, prefetch memory blocks X+1, X+2, …, X+M into the Stream Buffer, thereby replacing the entire contents of the Stream Buffer. Note that prefetches are implemented by issuing read requests to the next level in the memory hierarchy.1 Also note that replacing the contents of the Stream Buffer does not involve any writebacks from the Stream Buffer: this will be explained in Section 4.4.
2. Scenario #2 (benefit from and continue a prefetch stream): Requested block X misses in CACHE and hits in the Stream Buffer: Perform an allocation in CACHE as follows. First, make space in CACHE for the requested block X (as described in Section 3.4). Second, instead of fetching the requested block X from the next level in the memory hierarchy, copy the requested block X from the Stream Buffer into CACHE (since the
1 For accurate performance accounting using the Average Access Time (AAT) expression, you will need to convey to the next level in the memory hierarchy that these read requests are prefetches. This will enable the next level in the memory hierarchy to distinguish between 1) its read misses that originated from normal read requests versus 2) its read misses that originated from prefetch read requests. Note that this is only needed for accurate performance accounting.
Stream Buffer contains the requested block X, in this scenario). Next, manage the Stream Buffer as illustrated in Figure 2. Notice in the “before” picture, the fourth entry of the Stream Buffer hit (it contained the requested block X). As shown in the “after” picture, all blocks before and including block X (X-3, X-2, X-1, X) are removed from the Stream Buffer, the blocks after block X (X+1, X+2) are “shifted up”, and the newly freed entries are refilled by prefetching the next consecutive blocks (issue prefetches of blocks X+3, X+4, X+5, X+6). A non-shifting circular buffer implementation, based on a head pointer that points to the least block address in the prefetch stream, is more efficient in real hardware and in software simulators, and is illustrated in Figure 3.
3. Scenario #3 (do nothing): Requested block X hits in CACHE and misses in the Stream Buffer: In this case, nothing happens with respect to the Stream Buffer.
4. Scenario #4 (continue prefetch stream to stay in sync with demand stream): Requested block X hits in CACHE and hits in the Stream Buffer: Manage the Stream Buffer identically to Scenario #2. The only difference is that the requested block X hit in CACHE, so there is no transfer from Stream Buffer to CACHE.
before after
V (valid) = 1 V (valid) = 1
Figure 2. Managing the Stream Buffer when there is a hit in the Stream Buffer (scenarios #2 and #4).
V (valid) = 1 head
V (valid) = 1
Figure 3. A non-shifting circular buffer implementation is more efficient in hardware and in software simulators.
4.3. Multiple Stream Buffers
The operation of a single Stream Buffer, described in the previous section, extends to multiple Stream Buffers. The main difference is that all Stream Buffers are checked for a hit.
For Scenario #1 (request misses in CACHE and misses in all Stream Buffers), one of the Stream Buffers must be chosen for the new prefetch stream: select the least-recently-used Stream Buffer, i.e., apply the LRU policy to the Stream Buffers as a whole. When a new stream is prefetched into a particular Stream Buffer (Scenario #1), or a particular Stream Buffer supplies a requested block to CACHE (Scenario #2), or we are keeping a Stream Buffer in sync (Scenario #4), that Stream Buffer becomes the most-recently-used buffer.
Policy for multiple Stream Buffer hits:
It is possible for two or more Stream Buffers to have some blocks in common (redundancy). For example, suppose all Stream Buffers are initially invalid and CACHE is empty; the CPU requests block X which creates the prefetch stream X+1 to X+6 in a first Stream Buffer (assume M=6); and then the CPU requests block X-2 which creates the prefetch stream X-1 to X+4 in a second Stream Buffer; thus, after these initial two misses, the Stream Buffers have X+1 to X+4 in common. Other scenarios create redundancy as well, such as one continuing prefetch stream reaching the start of another prefetch stream.
Redundancy means that a given request may hit in multiple Stream Buffers. Managing multiple Stream Buffers as in Figure 2, for the same hit, results in redundant prefetches because the multiple Stream Buffers will all try to continue their overlapping streams. A simple solution is to only consider the hit to the most-recently-used Stream Buffer among those that hit and ignore the other hits. From a simulator standpoint, this could mean (for example) searching Stream Buffers for a hit in recency order, and stopping at the first hit. Only that Stream Buffer is managed as shown in Figure 2, i.e., only that Stream Buffer continues its prefetch stream.
4.4. Assume Stream Buffers are updated by writebacks with no effect
on recency (no explicit modeling required in your simulator)
A Stream Buffer never contains dirty blocks, that is, it never contains a block whose content differs from the same block in the next level of the memory hierarchy. The benefit of this design is that replacing the contents of the Stream Buffer will never require writebacks from the Stream Buffer.
In this section, we discuss a Stream Buffer complication that we will handle conceptually. The problem and solution are only discussed out of academic interest. The solution does not require any explicit support in the simulator.
Consider that a dirty copy of block Y may exist in CACHE while a clean copy of block Y exists in a Stream Buffer. Here is a simple example of how we can get into this situation (assume M=6 for the example):
• Write request to block Y misses in CACHE. Block Y is allocated in CACHE, write is performed, and block Y is dirty in CACHE. Prefetch stream Y+1 to Y+6 is created in a first Stream Buffer, although this is not germane for this example.
• Then, a request to block Y-2 misses in CACHE. Prefetch stream Y-1 to Y+4 is created in a second Stream Buffer. Thus, at this point, CACHE has a dirty copy of block Y and the second Stream Buffer has a clean copy of block Y.
Now, suppose CACHE evicts its dirty copy of block Y (e.g., it is replaced by a missed block Z) before referencing it again (fyi: referencing it as a hit might wipe it from the Stream Buffer to keep the latter’s prefetch stream in sync with demand references, as per scenario #4). Stale block Y still exists in the Stream Buffer which could lead to incorrect operation in the future, namely, when the CPU requests block Y again and hits on the stale copy in the Stream Buffer.
We will assume a solution that does NOT require any code in your simulator. When a dirty block Y is evicted from CACHE (i.e., when there is a writeback), any Stre
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com