代写 C computer architecture graph statistic software network GPU 5SIA0 – 2018/2019

5SIA0 – 2018/2019
Lab 3: using GEM5 for architecture explora􏷁on
Patrick Wijnings December 13, 2018

Contents
1 Introduc􏷁on 3
1.1 Goaloftheassignment ……………………………… 3
1.2 Simulatedsystem ………………………………… 3
1.3 Application……………………………………. 4
1.4 GEM5 ………………………………………. 5
2 Installa􏷁on 6
2.1 Virtualmachine …………………………………. 6
2.2 Quicktourofthesoftware ……………………………. 6
2.3 Helloworld……………………………………. 6
3 Power model 7
3.1 CPUpowermodel ………………………………… 7 3.2 Cachepowermodel ……………………………….. 7 3.3 Mainmemorypowermodel …………………………… 9
4 Performance
5 Applica􏷁on op􏷁miza􏷁on
11 13
2

1 Introduc􏷁on
1.1 Goaloftheassignment
In this lab, you’ll get some hands-on experience using the GEM5 simulator. The main goal of this lab is to learn how a simulator can be exploited for exploring the impacts of architectural modi􏷂ications on performance, and in particular how design choices in processor and cache hierarchy impact the trade-off between energy ef􏷂iciency and performance. Furthermore, you will learn: how to do basic power modelling of a full system; how to deal with a large amount of simulation parameters and big output data 􏷂iles; and how to write C code that optimally exploits a given cache hierarchy.
Note: this lab contains 7 exercises. Some are easy, others take more time. Do all the exercises yourself: questions about what you learn will be asked at the written, online exam.
1.2 Simulatedsystem
We are going to simulate a system based on the 28 nm Samsung Exynos 5410, which was used in the Samsung Galaxy S4. It was the 􏷂irst mobile ‘big.LITTLE’ processor with cores of different sizes: four ‘big’ ARM Cortex A15 cores, and four ‘little’ ARM Cortex A7 cores. Each core has a private level 1 instruction (32 KiB) and data (32 KiB) cache. Furthermore, the ‘big’ cores share a 2 MiB level 2 cache, and the ‘little’ cores have a shared 512 KiB level 2 cache. The level 2 caches are connected to LPDDR3 main memory. For our particular system, we selected a Micron EDF8132A1MC of 512 MiB.
The memory hierarchy of the system is visualized below:
L2
Main memory (LPDDR3)
The instruction pipelines of the processors, i.e. what’s inside each green box, look like this (taken from [1]):
ARM Cortex A15
ARM Cortex A15
ARM Cortex A15
ARM Cortex A15
ARM Cortex A7
ARM Cortex A7
ARM Cortex A7
ARM Cortex A7
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L2
3

ARM Cortex A7
ARM Cortex A15
In this exercise, we are only going to simulate a single core of the system, but you get to choose if it will be a big A15 or a little A7 core, as well as the parameters of the cache hierarchy.
1.3 Applica􏷁on
The application we will use to benchmark the system is the neural network implementation from the previous GPU lab assignment. However, we have replaced the neural network layout with LeNet. This is a much smaller neural network that can detect handwritten numbers (from the MNIST test set).
The input of the network is a 28 × 28 pixel image, and the output is the detected number (from 0 to 9). You can visualize the network in 3D and 􏷂ind out whether it can also decode your handwriting in your browser. In the C implementation, the network layers are named like this:
4

1.4 GEM5
The processor, caches and system memory are simulated using GEM5. Two operating modes are available:
Full system mode In this mode, GEM5 actually boots a Linux kernel in the simulator, like a virtual machine. This means that all advanced features of the Linux kernel (e.g. the thread scheduler and dynamic frequency scaling) are available. The downside is that it is very slow. We will not be using this mode in this assignment.
Syscallemula􏷁onmode Inthismode,GEM5executesastatically-linkedARMbinaryandemulates all kernel functions (e.g. fopen). This is much faster than the full system mode, but unfortu- nately, not all kernel functions are properly emulated. For example, no multithreading sup- port is currently available. Because this is the mode we will use, this means the benchmark application can only run on a single processor core.
While executing your binary, GEM5 periodically writes outputs statistics such as number of cycles, instruction mix, cache hits and misses, and power usage (to a 􏷂ile stats.txt). You can 􏷂ind more background information on the Learning gem5 webpage.
5

1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18
2 Installa􏷁on 2.1 Virtualmachine
We have prepared the image 􏷂ile of a virtual machine with Ubuntu 18.10 and all the tools installed which you can directly use. You need to download gem2018.ova from OnCourse and import it into VirtualBox.
Username and password of the virtual machine are both eca. 2.2 Quicktouroftheso􏷃ware
In the home directory, there are the following folders:
gem5 Contains the GEM5 source 􏷂iles, con􏷂iguration, and the binaries we have already compiled for you.
cac􏷁 ContainstheCACTIcachepowermodel.
benchmark Contains the LeNet neural network benchmark application
2.3 Helloworld
To test everything is working correctly, run your 􏷂irst full system simulation by executing in the Terminal Emulator:
# Compile the benchmark application (natively)
cd ~/benchmark
make -f make.native clean lenet
# Run it natively (for full test set)
make -f make.native check
# Now compile for ARM
make -f make.arm clean lenet.arm
# Run it in GEM5 (for single image)
# This should take about five minutes make -f make.arm check_1
# View the output
cd ~/gem5/m5out
(atril config.dot.pdf &) (mousepad stats.txt &)
6

3 Powermodel
Each block in the system (see 􏷂igure on page 3) has its own instance of a power model (with its own parameters). There are three types of power models: for CPU (green), for cache (red), and 􏷂inally for main memory (blue). They will be investigated in the subsections below.
3.1 CPUpowermodel
We have adapted a power model from [1]. It models the total energy dissipated in a single clock cycle as follows:
E(𝑓) [pJ] = EPC(𝑓) + EPI(𝑓) ⋅ IPC. (3.1)
EPC stands for base energy per cycle. This energy is always spent, even when the CPU is idle. EPI is the extra energy per instruction, which is spent each time an instruction is executed. Of course, its value depends on the speci􏷂ic instruction executed, but in our simpli􏷂ied model we use an average EPI. Finally, IPC is the number of executed instructions per cycle, which will be provided for us by GEM5 when we run the system simulation. The clock frequency of the processor is 𝑓 [MHz].
For the A15, the following approximations hold for clock frequencies between 800 MHz and 1500 MHz:
EPC (𝑓 ) = 0.0107𝑓 2 + 0.7543𝑓 + 224.0700, EPI (𝑓 ) = 0.0071𝑓 2 + 0.6924𝑓 + 138.0900.
For the A7, different approximations are used:
EPC (𝑓 ) = 0.0032𝑓 2 − 0.0843𝑓 + 28.0000,
EPI (𝑓 ) = 0.0042𝑓 2 + 0.0748𝑓 + 47.1670.
These are valid for frequencies between 500 MHz and 1200 MHz.
Exercise 1. Based on the instruction pipelines of the ARM7 and ARM15 (see 􏷂igures on page 1.2),
what IPC value do you expect? What are limiting factors in a practical system?
Exercise 2. Plot equation 3.1 as a function of IPC, for 􏷂ixed frequency of 1000 MHz. Compare the
A7 and A15. Interpret these graphs: what type of workload is each processor optimized for?
3.2 Cachepowermodel
The cache power model is based on CACTI 6.5 [2]. This software tool requires a con􏷂iguration 􏷂ile with:
• Cachesize
• Cacheassociativity
• Typeandnumberofports:
– Read-only(e.g.instructioncache) – Read-write(e.g.datacache)
7

– Write-only
• Memorycelltype:
– Highperformance
– Lowstandbypower
– Lowoperatingpower
• Otheradvancedparametersareoutsidethescopeofthisassignment.Forexample,cacheline size is 􏷂ixed to 64 bytes.
It then derives a suitable cache circuit and reports back its timing, power and area properties. CACTI can be used for L1D, L1I as well as the L2 cache (but of course the con􏷂iguration will in general be different).
Exercise 3. We have prepared a con􏷂iguration 􏷂ile for the default A15 L2 cache. Run CACTI as follows:
1 2 3
cd ~/cacti
./cacti -infile cache.cfg > properties.txt (mousepad properties.txt &)
Investigate the output 􏷂ile, and extract the two values related to power from the Cache Parameters section, as well as the overall data and tag delays from the Time Components section.
Now change the (four) memory cell types to low standby power in cache.cfg and repeat the ex- periment. Also try with the cell types set to high performance.
Explain why all four extracted values are relevant for the GEM5 simulation. Finally, draw an analogy with the CPU power model.
Exercise 4. Make 2D image plots of the four properties from the previous exercise as a function of cache size and associativity. Choose sensible axes that cover both the L1 and L2 cache design space. Also make sure to include the values in your plot, because you will need them later. You can use your favourite plotting tool of choice. For example, if you use Excel each plot should look like this:
Repeat for all three memory cell types. Note that, for some combinations, CACTI is not able to build a circuit and returns an error message. Leave those entries in the plots empty.
Hint: you might want to consider automating (part of) this exercise using a script. But be careful to avoid the automation trap:
8

Source: xkcd
If you decide to automate, please include the scripts in your submission.
3.3 Mainmemorypowermodel
The LDDR3 main memory power model is already included in GEM5, based on the currents listed in the datasheet of the Micron EDF8132A1MC.
Exercise 5. Open the stats.txt from sec. 2.3. The energy dissipated by the system memory is returned as system.mem_ctrls_0.___Energy and consists of many different components. We prepared a script ./extract_memory_power.sh for you in the ~/gem5/m5out directory that ex- tracts these statistics over time to Space-separated values (SSV) format. To help you interpret these components, a (simpli􏷂ied) state diagram of the system memory is shown below:
Refresh IDLE
Precharge
Activate
ACTIVE Read / write
Map the statistics to the transitions of this state diagram (some transitions have more than one statistic), and sum them over time. Make a pie plot, which should look like this:
9

Now use this plot and the state diagram to explain whether cache prefetching reduces or increases main memory energy dissipation.
Hint: how does cache prefetching in􏷄luence the occurence of each state transition? Note that, in the default GEM5 con􏷄iguration, L2 cache prefetching is turned on.
10

4 Performance
Now we are going to run the benchmark application for different system con􏷂igurations. The work- 􏷂low is as follows:
1. Choose the CPU type, frequency and cache sizes and associativity. Set the values in ~/gem5/configs/simulation_parameters.py.
2. RunCACTIforeachcache,andupdatetheenergyandtimingvaluesin ~/gem5/configs/simulation_parameters.py.
3. Buildthebenchmarkapplication:
1 2 3
4. RunthebenchmarkapplicationinsideGEM5:
1 2
5. Make sure the application is still functionally correct by checking the detected class label. You can get the ‘ground truth’ class labels by running the application natively for the full test set, as described in sec. 2.3.
6. Extracttherelevantstatisticsfromthestats.txt􏷂ile.Wehavepreparedascript ./extract_stats.sh for you (in the same directory) that does so in SSV format. You can also modify this script to extract other values that might help you with optimizing the system con􏷂iguration.
The static power statistics are related to EPC (for CPU) and leakage (for caches). The dy- namic power statistics are related to EPI ⋅ IPC (for CPU) and energy per access (for caches). The many components of the main memory have already been summed for you in an average power statistic.
7. Sum the sim_seconds statistic over time to get overall runtime (in seconds). Sum all the power statistics (over time and over the different statistics) to get overall energy (in Joule).
Hint: pay close attention to the units in stats.txt. Values are reported in different units (e.g. Joule, Watt or Milliwatt). Use the sim_seconds statistic to convert everything to Joules.
Exercise 6. Find a system con􏷂iguration that is Pareto-optimal. This means you have to make it likely that (within the design space we are considering) there are no systems that result in (sig- ni􏷂icantly) shorter runtime for the same energy budget, as well as that there are no systems that require (signi􏷂icantly) less energy for the same runtime.
Plot your chosen system con􏷂iguration together with the other con􏷂igurations you considered on a Pareto plot with energy (in Joule) on the x-axis and runtime (in seconds) on the y-axis. Include as many con􏷂igurations as you think are necessary to make Pareto-optimality likely. Clearly label each con􏷂iguration in the plot, and include a table with the simulation parameters and references to the labels. The plot should look like this:
# ARM build
cd ~/benchmark
make -f make.arm clean lenet.arm
cd ~/gem5
./build/ARM/gem5.opt configs/se_mode.py ‘/home/eca/benchmark/lenet.arm /home/
eca/benchmark/mnist/1.png’
Execution should take about 􏷂ive minutes depending on your hardware and VM con􏷂iguration.
11

Next, make an energy breakdown for your chosen system con􏷂iguration, which should look like this:
Finally, give (theoretical) arguments for why your chosen system con􏷂iguration is Pareto-optimal for this benchmark application.
Hint: again, you might want to consider automating (part of) this exercise to save time.
12

5 Applica􏷁onop􏷁miza􏷁on
Exercise7. Forthesystemcon􏷂igurationyouchoseinexercise6,optimizetheCcodeofthebench- mark application as to improve runtime. For example, you can consider interchanging for-loops for better locality of memory accesses. However, make sure the application is still functionally correct!
Make sure to commit the intermediate steps you tested to git (so that you can rollback if you break something), and submit the whole benchmark folder including version history.
Finally, add the runtime and total energy of each step in the Pareto plot from the previous exercise, which should now look like this:
Include a table which relates the plot labels to the changes you made.
13

Bibliography
[1] Evangelos Vasilakis (2015), An Instruction Level Energy Characterization of ARM Processors. Technical report, Computer Architecture and VLSI Systems (CARV) Laboratory, Institute of Computer Science (ICS), Foundation of Research and Technology Hellas (FORTH).
[2] Naveen et. al. (2009), CACTI 6.0: A Tool to Model Large Caches. Technical report, HP Laborato- ries.
14