程序代写代做 algorithm assembly html clock C arm graph cache computer architecture kernel go compiler DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING UNIVERSITY OF BRITISH COLUMBIA

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING UNIVERSITY OF BRITISH COLUMBIA
CPEN 211 Introduction to Microcomputers, Fall 2019
Lab 11: Caches, Performance Counters, and FloatingPoint
The handin deadline is 9:59 PM the evening before your lab section the week of Nov 25 to 29
REMINDER: As outlined in the CPEN 211 Lab Academic Integrity Policy, until all students involved have a grade for Lab 10 you must NOT share or describe any code you write for this assignment with anyone except your authorized lab partner for Lab 10, NOT ask for or use any code offered to you by anyone other than your authorized lab partner and NOT look at or use solution code from anywhere. If you are repeating CPEN 211 you may not reuse any of your code submitted in a prior year. Using a compiler to generate ARM from C code for this lab, or Lab 10 or 11 is considered cheating. Use of a compiler will be considered cheating regardless of whether the compiler generated code or even a portion of it is used directly, adapted with changes, or a student just looks at the compiler generated code to compare with their own. Promptly report cases of misconduct you have firsthand knowledge of to the instructor. Your partner is authorized to work with you for Lab 10 if https:cpen211.ece.ubc.cacwl labpartners.php says they are your current lab partner at the time you start working together on Lab 10 up until you demo your code. The deadline to sign up or change lab partners using the above URL is 96 hours before your lab section. Your code will be checked for plagiarism using very effective plagiarism detection tools. As per UBC policy, all suspected cases of academic misconduct must be reported to the APSC Deans office. Examples of outcomes for misconduct cases at UBC can be found online. See e.g.: https: universitycounsel.ubc.ca files 2016 07 SD20142015.pdf .
1 Introduction
The ARM processor in your DE1SoC has the CortexA9 microarchitecture. The specifications for the CortexA9 include an 8stage pipeline, an L1 instruction cache a separate L1 data cache, and a unified L2 cache. In this lab we explore factors that impact program performance with a focus on the L1 data cache.
1.1 Caches inside the DE1SoC
Both L1 caches hold 32KB and are 4way set associative with 32 byte blocks and pseudo random replace ment. Using the initialization code provided for this lab, addresses between 0x00000000 and 0x3FFFFFFF are configured to be cached. Addresses larger than 0x3FFFFFFF are configured to bypass the cache, meaning accesses to these addresses are not cached in L1 or L2 caches.
Why bypass a cache? One reason we wish accesses to certain addresses to not be cached is if these addresses correspond to registers in IO peripherals. For example, consider what would happen if instead, when software on the DE1SoC reads SWBASE at address 0xFF200040 the values it read from the control register were allowed to be cached in the L1 data cache: The first LDR instruction to read from address 0xFF200040 would cause a cache block to be allocated in the L1 data cache. This cache block would contain the value of the switches at the time this first LDR instruction was executed. Now, if the settings of the switches change after the first LDR executes but that cache block remains in the cache, subsequent LDR instructions reading from address 0xFF200040 will read the old or stale value for the switch settings that in the cache. Thus, it will seem to the software like the switches have not changed even though they have. Without an understanding of caches such behavior would be very surprising and hard to explain.
In addition, the initialization code provided for this lab configures the L1 data cache so that store in structions e.g., STR and FSTD are handled as write back with write allocate. By write allocate we mean that if the cache block accessed by the store was not in the L1 data cache then it will be brought into the cache, possibly evicting another block. By writeback we mean that if a cache block in the L1 is written to by a store instruction then only the copy of the block in the L1 is modified.
CPEN 211 Lab 11 1 of 11

1.2 Performance Counters
How can you increase the performance of a software program? One common approach is to profile the program to identify which lines of code it spends the most time executing. Standard developer tools such as Visual Studio include such basic profiling capabilities1. Using this form of profiling you can identify where making algorithmic changes, such as using a hash table instead of a linked list, is worth the effort.
To obtain the highest performance it is also necessary to know about how a program interacts with the microarchitecture of the CPU. One of the most important questions is does the program incur many cache misses? The software that runs in datacenters, such as those operated by Google, Facebook, Amazon, Microsoft and others, typically suffers many cache misses. Google reports that Half of cycles in their datacenters are spent stalled on caches2. Most modern microprocessors include special counter registers that can be configured to measure how often events such as cache misses occur. Special profiling tools such as Intels VTune3 can use these hardware performance counters. Hardware counters can also be used for runtime optimization of software and to enable the operating system to select lower power modes.
The CortexA9 processor supports six hardware counters. Each counter can be configured to track one of 58 different events. In this lab you will use these counters to measure clock cycles, load instructions executed, and L1 data cache misses caused either by loads or stores. You will use these three counters to analyze the performance as you make changes to programs. These performance counters are a standard feature of the ARMv7 architecture and are implemented as part of coprocessor 15 CP15. CP15 also includes functionality for controlling the caches and virtual memory hardware. For this lab we provide you ARM assembly code to enable the L1 data cache and L2 unified cache using CP15. The L2 cache is accessed when a load or store instruction does not find a matching cache block in the L1 data cache. Enabling the data caches on the CortexA9 also requires enabling virtual memory. So, the code we provide for Lab 11 pagetable.s also does this for you using a onelevel page table with 1MB pages called sections in ARM terminology. You do not need to know how virtual memory works to complete this lab. However, for those who are interested, Bonus 1 and Bonus 2 ask you to modify pagetable.s.
In Part 1 of this lab you run an example assembly program that helps illustrates how to access the performance counters. In Part 2, you write a matrixmultiply function using floatingpoint instructions and study its cache behavior using the performance counters. In Part 3, you modify your matrixmultiply to improve cache performance.
To enable the caches on your DE1SoC, your assembly needs to call the function CONFIGVIRTUALMEMORY defined inside pagetable.s to enable the cache. After virtual memory is enabled the Altera Monitor Pro gram will not correctly download changes to your code without first power cycling the DE1SoC. To save time during debugging e.g., in Part 2 and 3 enable virtual memory only after you get your code working. Also, note that resetting the ARM processor through the Altera Monitor Program does not flush the contents of the caches. Thus, you will need to power cycle your DE1SoC each time you want to make a new performance measurement.
The ARM coprocessor model was briefly described in Slide Set 13. The CortexA9 contains a Per formance Monitor Unit PMU inside of Coprocessor 15. While there are 58 different events that can be tracked on the CortexA9, the PMU contains only six performance counters with which to track them. These are called PMXEVCNTR0 through PMXEVCNTR5, which we will abbreviate to PMN0 through PMN5. These counters are controlled through several additional registers inside the PMU.
The specific PMU registers you will need to use in this lab are listed in Table 1. Recall that the MCR, or move to coprocessor from an ARM register, instruction moves a value to the coprocessor i.e., PMU from an ARM register R0R14. The MRC, or move to ARM register from a coprocessor instruction
1https:msdn.microsoft.comenCAlibraryms182372.aspx
2 Kanev et al., Profiling a warehousescale computer, ACMIEEE Intl Symp. on Computer Architecture, 2015. 3https:software.intel.comenusintelvtuneamplifierxe
CPEN 211 Lab 11 2 of 11

Figure 1: Hardware Organization of CortexA9 Performance Monitor Unit
copies a value from a coprocessor i.e., PMU into an ARM register. Certain registers in the PMU are used to configure the performance counter hardware before using the counters PMN0 through PMN5 to actually count hardware events. The relationship between the different PMU registers, the hardware events and the performance counter registers is partly illustrated in Figure 1. The operation of this hardware is described below. You will measure the three events listed in Table 2. The other 55 possible events can be found in ARM documents that are available on ARMs website4.
To use one of the performance counters you need to complete the following steps:
1. Select counter PMNx by putting the value x in a regular register e.g., R0R12 and then executing the ARM code shown in Table 1 for Set PMSELR replacing Rt with the register you put the value x in, e.g., R0. This put the value in Rt into the register labeled PMSELR in Figure 1, which controls the demultiplexer labeled 1 .
2. Selecting the event that PMNx should count by putting the event number in the first column of Table 2 into a regular register e.g., R0R12 and then executing the ARM code in Table 1 for Set PMX EVTYPER replacing Rt with the register you put the value in. This causes the value of Rt to be placed into the corresponding register named PMXEVTYPER0 through PMXEVTYPER5 in Figure 1. Which one of PMXEVTYPER0 through PMXEVTYPER5, is updated depends upon the value in PMSELR set in the prior step.
3. Repeat steps 1 through 2 for up to six counters.
4. Enable each PMNx by setting bit x in a regular register e.g., R0R12 to 1 and then executing the ARM code in Table 1 for Set PMCNTENSET replace Rt with the register containing bits set to 1. This sets the register labeled PMCNTENSET in Figure 1.
4ARM Architecture Reference Manual ARMv7A and ARMv7R edition, CortexA9 Technical Reference Manual
CPEN 211 Lab 11 3 of 11

Name
ARM Code NOTE: replace Rt
Function
Set PMSELR
MCR p15, 0, Rt, c9, c12, 5
Value in ARM register Rt speci fies the performance counter PMN0 through PMN5 that will either be con figured using a PMXEVTYPER opera tion or read using a PMXEVCNTR oper ation.
Set PMXEVTYPER
MCR p15, 0, Rt, c9, c13, 1
Lower 8bits of Rt configures which event increments counter selected by PMSELR.
Set PMCNTENSET
MCR p15, 0, Rt, c9, c12, 1
A 1 in bit 0 through bit 5 of Rt en ables performance counter 0 through 5, respectively
Set PMCR
MCR p15, 0, Rt, c9, c12, 0
If Bit 1 of Rt is 1 this instruction clears all six performance counters. If Bit 0 of Rt is 1 this instruction starts any performance counters enabled by PMCNTENSET. If Bit 0 of Rt is 0 this instruction stops all performance counters.
Read PMXEVCNTR
MRC p15, 0, Rt, c9, c13, 2
Copies current value of counter se lected by PMSELR into Rt.
Table 1: ARM CortexA9 Performance Monitor Interface
Table 2: Event Numbers
5. Reset all counters and start those that are enabled by putting the value 3 into a regular register e.g., R0R12 and then executing the ARM code in Table 1 for Set PMCR replacing Rt with the register containing 3. In Figure 1, this step resets the counters PMN0 through PMN5 and allows them to begin counting the events passed through the multiplexes that connect to the hardware event signals 5 .
6. Run the code you wish to measure the performance of e.g., matrix multiply. During this step the counters PMN0 through PMN5 shown in Figure 1 will be incremented whenever a configured event occurs.
7. Stop the performance counters by putting the value 0 into a regular register e.g., R0R12 and then executing the ARM code in Table 1 for Set PMCR replacing Rt with the register containing 3.
8. For each counter you wish to read, follow steps 9 and 10 below.
9. SelectcounterPMNxbyputtingthevaluexinaregularregistere.g.,R0R12andthenexecutingthe
ARM code shown in Table 1 for Set PMSELR replacing Rt with the register you put the value
CPEN 211 Lab 11 4 of 11
Event number
Event description
0x3
Level 1 data cache misses
0x6
Number of load instructions executed counted if condition code passed
0x11
CPU cycles

x in.
10. Read PMNx by executing the ARM code shown in Table 1 for Read PMXEVCNTR after replacing Rt with the register e.g., R0R12 you want to copy the performance counter value into. This corresponds to reading the counters via the multiplexer 10 illustrated in Figure 1.
These steps are illustrated in the example in Figure 2 which is described in more detail in the following section.
.text
.global start start:
BL CONFIGVIRTUALMEMORY
Step 13: configure PMN0 to count cycles
MOV R0, 0
MCR p15, 0, R0, c9, c12, 5
MOV R1, 0x11
MCR p15, 0, R1, c9, c13, 1
Step 4: enable PMN0
mov R0, 1
MCR p15, 0, R0, c9, c12, 1
Step 5: clear all counters and start counters
mov r0, 3
MCR p15, 0, r0, c9, c12, 0
Step 6: code we wish to
bits 0 start counters and 1 reset counters
Setting PMCR to 3
mov r1, 0x00100000 mov r2, 0x100
mov r3, 2
mov r4, 0
Louterloop: mov r5, 0
Linnerloop:
ldr r6, r1, r5, LSL 2 add r5, r5, 1
cmp r5, r2
blt Linnerloop
add r4, r4, 1
cmp r4, r3
blt Louterloop
Step 7: stop counters
profile using hardware counters
base of array
iterations of inner loop
iterations of outer loop
mov r0, 0
MCR p15, 0, R0, c9, c12, 5 MRC p15, 0, R3, c9, c13, 2
PMN0
Write 0 to PMSELR
Read PMXEVCNTR into R3
wait here
Write 0 into R0 then PMSELR
Write 0 into PMSELR selects PMN0
Event 0x11 is CPU cycles
Write 0x11 into PMXEVTYPER PMN0 measure CPU cycles
PMN0 is bit 0 of PMCNTENSET
Setting bit 0 of PMCNTENSET enables PMN0
i0 outer loop counter
j0 inner loop counter
read data from memory
jj1
compare j with 256
branch if less than
ii1
compare i with 2
branch if less than
mov r0, 0
MCR p15, 0, r0, c9, c12, 0 Write 0 to PMCR to stop
Step 810: Select PMN0 and read out result into R3
counters
end: b end
Figure 2: Example 1 NOTE: CONFIGVIRTUALMEMORY is defined in pagetable.s
2 Lab Procedure
Follow the steps below. CPEN 211 Lab 11
5 of 11

2.1 Part 1 4 marks: Performance Measurement using Example Code
Run the ARM assembly code in Figure 2 on your DE1SoC this code must be run on real hardware. Note you should not single step while collecting performance counters. Set a breakpoint on the line end: b end; and run to it without single stepping. This code measures the number of cycles to exe cute a nested loop that repeatedly iterates over elements of a onedimensional array. You will notice that the code in Figure 2 does not actually use the values loaded from memory by the line:
ldr r6, r1, r5, LSL 2 read data from memory
The reason is in this example we are concerned only with how many cache hits or misses are generated by a program that repeatedly reads values from an array.
Next, modify the code from Figure 2 to also measure cycles and number of load instructions. NOTE: Running the above code the measured CPU cycles will decrease each time you run the program e.g., if using ActionsRestart. This occurs because if you do NOT power cycle your DE1SoC and download the program again the cache blocks brought into the cache by one run of the code will remain valid in the cache thus reducing subsequent cache misses.
Measure all three performance counters and compute the three factors in the processor performance equation discussed in Slide Set 14:
Execution Time Instruction Count CPI Cycle Time 1
CPI is the average cycles per instruction and can be obtained by dividing cycle count by the instruction count. To obtain cycle time you need to know the clock frequency, which is 800 MHz. Surprisingly the ARM CortexA9 does not have a counter that measures all instructions executed the ARMv7 documentation says this is mandatory, the CortexA9 documentation says it is not implemented! So you will need to compute instruction count by analyzing the program. Create a table using your favorite document editor or spreadsheet program to record the values measured by each of the performance counters. Note that usually hardware performance counters are not perfect and may slightly under or over count events versus what you expect.
Then, try increasing the value of the shift parameter 2 in the following line to at least one other value and repeat the measurements:
ldr r6, r1, r5, LSL 2 read data from memory Your mark for Part 1 will be:
44 If you measure all three counters for two values of the left shift parameter, compute the three terms in the processor performance equation Equation 1 and can explain the results.
34 If you measure all three counters for at least two values of the left shift parameters, compute the three terms in the processor performance equation, but have difficulty explaining the results to your TA.
24 If you measure all three counters for the default value of the left shift parameter.
14 If you measure at least two counter values
2.2 Part 2 4 marks: Matrix Multiply
In this part you will write ARM assemble code equivalent to the C code shown in Figure 3.
This code multiplies the matrix A times B and puts the result in matrix C. Matrix multiplication is an important computational kernel in many important applications today e.g., machine learning algorithms such as deep belief networks used in speech recognition, self driving cars, etc…. Note and opera tions in the above code should be double precision floatingpoint Slide Set 13. Twodimensional C arrays are stored in memory in row major format. The elements in a row are placed adjacent in memory. For
example, consider the array with 2 rows and 3 columns declared as follows:
CPEN 211 Lab 11 6 of 11

define N 128
double ANN, BNN, CNN; void matrixmultiplyvoid

int i, j, k;
for i0; iN; i

for j0; jN; j double sum0.0;
for k0; kN; k
sum sum Aik Bkj;
Cij sum;
Figure 3: Matrix Multiply C code
address
data
0x100
1.1
0x108
1.2
0x110
1.3
0x118
2.1
0x120
2.2
0x128
2.3
Figure 4: Layout of two dimensional array in memory
double myarray23 1.1, 1.2, 1.3, 2.1, 2.2, 2.3; Drawn as a matrix myarray looks like:
1.1 1.2 1.3 2.1 2.2 2.3
2
This means myarray00 contains 1.1, myarray01 contains 1.2, myarray02 contains 1.2, myarray10 contains 2.1, and so on. Assume the base address of myarray is 0x100. Then, the above six elements of myarray would be placed in memory as shown in Figure 4 recall IEEE double precision floatingpoint uses 64bits which is 8 bytes:
You can use the .double directive to initialize the contents of the array. For example, myarray above can be specified to have the initial contents in the above example using ARM assembly as follows:
myarray: .double 1.1 .double 1.2 .double 1.3 .double 2.1 .double 2.2 .double 2.3
To avoid conflicts with the memory used by CONFIGVIRTUALMEMORY place your arrays below address 0x01000000.
Use the above information about how arrays are placed in memory to help you compute the address to load from for Aik and Bkj and the address to store to for Cij. You need to use N in your address calculation. There an example of ARM assembly code performing matrix multiply on pages 250253 in Chapter 3 of COD4e PDF on Connect with N32. You can use this code as a starting point provided you add a citation to it in your .s file in a comment. Alternatively, you can write the code
CPEN 211 Lab 11 7 of 11

yourself. Either way, due to the limitations of the Altera Monitor Program noted earlier, you will need to encode floatingpoint operations and your assembly code should support arbitrary values of N.
Before enabling virtual memory and caches run your code with a small value of N and with A and B matrices with values of your choosing to verify the results of your matrix multiply code are correct. To check the results are correct you will need to look at the values stored into memory for the output array using the memory tab in the Altera Monitor Program. NOTE: The Altera Monitor Program does not know how to display floatingpoint numbers. Instead, use the following URL to find the hexadecimal encoding for a double precision number: http:www.binaryconvert.comresultdouble.html?decimal049
Rerun the code with virtual memory and caches enabled with N set to 128 and then N set to 16. Use the performance counters to help you compute the average CPI in both cases and be prepared to be able to explain them. When using larger values of N e.g., 16 and 128 to measure cache performance you do NOT need to explicitly initialize the input matrices unless you want to.
A challenge you will encounter is that the Altera Monitor Program is not setup to support programs that use floatingpoint even though the ARM CortexA9 on the DE1SoC has very good support for floating point. There are two issues: One is that the Altera Monitor Program does not show the contents of the floatingpoint registers e.g., D0D15. Another is that it will not compile floatingpoint assembly mnemon ics such as FMULD i.e., double precision floatingpoint multiply. We will work around the lack of support for floating point in the Altera Monitor Program in the following way: You will manually assem ble FLDD, FMULD, FADDD and FSTD instructions into 1s and 0s then load them into memory at the appropriate location in your assembly program using the .word directive.
The encodings for these four instructions are summarized below in Figures 58. Recall that ARM floatingpoint operations are implemented using the coprocessor model described in Slide Set 11 on Slide 13. The double precision floatingpoint coprocessor number is 11 CP11.
Forexample,theinstructionFLDD D0, R8canbespecifiedusing.word 0xED980B00. Your mark for Part 2 will be:
44 If you write the code and you can show it correctly computes results for small value of N that is not a power of 2, and you run it with N 128 and N 16 and collect all three hardware counters and compute the average CPI for both cases and you can explain the results you get.
34 If you write the code and you can show it correctly computes results for a small value of N other than 32.
24 If you wrote code for this part and it runs without triggering an illegal instruction fault and jumping to address 0x00000004 on your hand coded floatpoint assembly and it stores values for the output matrix in memory but the result looks wrong.
14 Ifyoudidnotdothispartoryourcodewillnotcompile,oritcompilesbuttriggersanillegalinstruction fault and jumps to address 0x00000004 andor it does not write the results to memory.
2.3 Part 3: Blocked Matrix Multiply
Look at the C code in Figure 5.21 in COD ARM edition page 429. If you do not have the second textbook it is available on short term loan in the library. This code performs a blocked matrix multiply which helps improve performance by ensuring values are used multiple times after they are brought into the cache. Implement this same strategy in assembly code and measure the difference in performance.
Your mark for Part 3 will be:
22 If you code up blocked matrix multiply in assembly and can show it computes the correct result for small values of N and you use performance counters to verify it improves average CPI for N128.
CPEN 211 Lab 11 8 of 11

31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
cond
1
1
0
1
U
0
0
1
Rn
Dd
1
0
1
1
Imm8
Figure 5: FloatingPoint Double Precision Load FLDD Dd, Rn,imm8. If U bit23 is 1, then imm8 is added to the contents of Rn to form the effective address. If U is 0 imm8 is subtracted from Rn. See also documentation on LDC in COD4e Appendix B1 B2 copro11.
Figure 6: FP Double Precision Multiply FMULD Dd, Dn, Dm. See also CDP in COD4e Appendix B1 B2 op12, op20,coproc11.
Figure 7: FP Double Precision Addition FADDD Dd, Dn, Dm. See also CDP in COD4e Appendix B1 B2 op13, op20,coproc11.
Figure 8: FloatingPoint Double Precision Store FSTD Dd, Rn,imm8. If U bit23 is 1, then imm8 is added to the contents of Rn to form the effective address. If U is 0 imm8 is subtracted from Rn. See also documentation on STC in COD4e Appendix B1 B2 copro11.
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
cond
1
1
1
0
0
0
1
0
Dn
Dd
1
0
1
1
0
0
0
0
Dm
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
cond
1
1
1
0
0
0
1
1
Dn
Dd
1
0
1
1
0
0
0
0
Dm
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
cond
1
1
0
1
U
0
0
1
Rn
Dd
1
0
1
1
Imm8
CPEN 211 Lab 11 9 of 11

12 If you coded something up and it looks to the TA like it might have a chance of working but it does not actually work.
2.4 Bonus 1 of 2 4 marks: TwoLevel Page Table and TLB Performance Events
Both bonuses require knowledge of virtual memory you can learn by going through the last flipped lecture on Virtual Memory early. If you plan to attempt these bonus questions you will need to sign up to ARMs website so you can download additional ARM documentation.
Modify the code in pagetable.s to create a working twolevel page table implementation with 4KB pages. You will likely need to consult the ARM Architecture Reference Manual ARMv7A and ARMv7R edition available from the ARM website to complete this you will need to register with ARM to access it. You will also want to read ahead about virtual memory in the textbook we will cover virtual memory in class too, but not necessarily before your lab section. Once you think you have the twolevel page table working make sure you extend the testing approach in pagetable.s to verify that it does you need to figure out how to do this. Lookup the event numbers for translation look aside buffer TLB misses and measure them on the code from Part 2 but with N set to a large enough value to trigger TLB misses with 4KB pages. Your mark for Bonus 1 will be:
44 If you complete of the aspects described in the paragraph above.
34 If you dont get the TLB performance counter part done but otherwise get everything done.
24 If you code up the two level page table and it runs but you dont have any testing code or your test isnt convincing.
14 If you code up most of the changes needed for the twolevel page table but it is not working. 2.5 Bonus 2 of 2 4 marks: Mini Operating System
Modify pagetable.s and combine it with the task switching code from Part 4 of Lab 10 to create a simple operating system that provides virtual memory protection as well as preemptive multitasking for applications that use floatingpoint. You may use code from another students Lab 10 Part 4 provided both you and they have demoed and submitted Lab 10 using handin, they give you permission to do so, and you acknowledge them in a CONTRIBUTIONS file that you submit with your code. Process 0 and Process 1 must each have their own page table. For Process 0 and Process 1 Virtual addresses between 0x00000000 and 0x0FFFFFFF should map to different physical locations. Other virtual addresses should be marked invalid in the page table. Be sure to consider the impact of the TLBs when virtual to physical mapping change. Not required do this for fun: Use the SWI instruction to enable your OS to expose IO safely to software. Your mark for Bonus 2 will be: 44 If you complete the aspects described in the paragraph above and you can convince your TA your code works or at your TAs discretion otherwise.
3 Lab Submission
Submit all files by the deadline on Page 1 using handin. Use the same procedure outlined at the end of the Lab 3 handout except that now you are submitting Lab 11, so use:
handin cpen211 Lab11section
where section should be replaced by your lab section. Remember you can overwrite previous or trial submissions using o.
To ensure the demo proceeds quickly, your lab11.zip file should include all files including your assembly source code AND your project files.
CPEN 211 Lab 11 10 of 11

4 Lab Demonstration Procedure
As in prior labs we will be dividing each lab section into two one hour sessions details on Connect. Your TA will have your submitted code with them and have setup a TA marking station where you will go when it is your turn to be marked. Please bring your DE1SoC.
CPEN 211 Lab 11 11 of 11