程序代写 Name:

Name:
ID#: X.500:
CS 2021: Practice Exam 3 SOLUTION
Fall 2021 University of Minnesota
Exam period: 20 minutes Points available: 40
Below is an initial memory/cache configuration along with several memory load operations. Indicate whether these load operations result in cache hits or misses and show the state of the cache after these loads complete.
——————– SOLUTION ——————–
MAIN MEMORY DIRECT-MAPPED Cache, 8-byte lines
| Addr | Addr Bits | Value | 4 Sets, 8-bit Address = 3-bit tag
|——+—————-+——-|
Problem 1 (10 pts):
|10|00010000
| 14|000 10 100
| 18|000 11 000
| 1C | 000 11 100 | 13 | |—–+—+—–+————-|
| 20|001
| 24|001
| 28|001
| 2C|001
| 30|001
| 34|001
| 38|001
| 3C|001
| 40|010
| 44|010
| 48|010
| 4C|010 |——+—————-+——-| | |TagSetOffset| |
| 10|INITIALCACHESTATE
| 11| | | | | Blocks/Line | | 12| |Set|V|Tag|0-3 4-7 |
|00|1| 010 |01|1| 001 |10|1| 000 |11|0| –
HITS OR MISSES?
| OPEARTION |————–+———–| | 1. Load 0x48 | Miss | |2.Load0x4C|Hit | | 3. Load 0x24 | Miss |
FINAL CACHE STATE
| || | Blocks/Line |
| Set |V| Tag | 0-3 8-7 | |—–+—+—–+————-+———-| | 00 | 1 | 001 | 20 21 | *** | | 01|1|010|202 203 |*** | | 10 | 1 | 000 | 10 11 | | |11|0| -|- | |
read in a free online blog post “Memory for Morons” that there is no need to invest much money in buying RAM. Instead, one can configure the operating system’s virtual memory system to use disk space as main memory leading to a much less expensive computer with a seemingly large memory. Pyra is quite excited about this as some programs she wants to execute fast need a lot of main memory and it would be nice to save some cash. Advise her on any risks or performance drawbacks she may encounter using such an approach.
Problem 2 (5 pts):
00 000
00 100
01 000
01 100
10 000
10 100
11 000
11 100
00 000
00 100
01 000
01 100
| 20| | 21| | 22| | 23| | 100 | | 101 | | 102 | | 103 | | 200 | | 201 | | 202 | | 203 |
| 200 201 | |2223 | |1011 | |- |
SOLUTION: Disks are many orders of magnitude slower than the DRAM that is typically used for main memory. Disks are near the bottom of the memory pyramid offering many gigabytes of storage per dollar at the expense of speed. If Pyra needs speed, she is better off investing in more DRAM as her system will crawl if it attempts to use disk for main memory. In addition, she should consider getting a CPU with a large cache which is more expensive but even faster than DRAM.
| HIT/MISS? |
| Changed? |
1/ 2

4 5 6 7 8 9
10 11 12 13
void vset(vector_t vec, int idx, int x){
vec.data[idx] = x;
}
void base_scalevec(vector_t *vec, int *scale){
for(int i=0; i < vec->len; i++){
int cur = vget(*vec,i);
int new = cur * (*scale);
vset(*vec,i,new);
} }
1
2 3}
WRITE ON EVERY PAGE – Name: SOLUTION
Problem 3 (15 pts): Nearby is the definition for base scalvec() which scales a vector by multiplying each element by a number. Write an optimized ver- sion of this function in the space provided. Mention in comments why you performed certain transforma- tions.
int vget(vector_t vec, int idx){
return vec.data[idx];
SOLUTION
1 void opt_scalevec(vector_t *vec, int *scale){
2 3 4 5 6 7 8 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23 }
Examine the two functions be- low which add elements of a row or column vector to all corresponding rows or columns of a matrix. Consider the
benchmark timing of these two provided.
1. Explain which of these two functions is faster and why. 2. Suggest a way to increase the speed of the slower func-
tion with only moderate changes to the code.
SOLUTION: The addrow() version is clearly faster than addcol() at all sizes and this disparity increases as the sizes of the matrices go up. At the largest size, addrow() takes aboute 0.2 seconds while addcol() takes 1.5 seconds, a seven-fold difference.
The reason is due to the layout of the matrix favoring traversal of rows: each row is contiguous in memory which means loading an element will bring nearby elements in the row into cache. This speeds up their access subsequently. The column version jumps non-contiguously through mem- ory getting much less benefit from cache.
Re-writing addcol() to move across rows instead would greatly improve its memory access pattern leading to greater efficiency. This would involve inverting the inner and outer loops to for(i) / for(j) as in the row version. This along with slight modifications to the setting would yield speedups
// locals to avoid memory access
int *data = vec->data, len = vec->len;
int scal = (*scale), i;
// unroll x2 with duplicate vars to
// enable pipelining
for(i=0; i < len-2; i+=2){ // no function calls - inline bodies // to improve register use int cur0 = data[i+0]; int new0 = cur0 * scal; data[i+0] = new0; int cur1 = data[i+1]; int new1 = cur1 * scal; data[i+1] = new1; } // cleanup loop for(; i