CS作业代写 matlab python ocaml x86 compiler Java Fortran assembly CSCI 2021: Memory Systems

CSCI 2021: Memory Systems

Last Updated:
Mon Nov 22 01:41:21 PM CST 2021
1

Logistics
Reading Bryant/O’Hallaron
▶ Ch 4: Finish / Skim ▶ Ch 6: Memory
Assignments
▶ Lab 11: clock() function
▶ HW 11: Memory
Optimization
▶ P4: Overview Video Tue
Goals
▶ Memory efficient programs ▶ Permanent Storage
Schedule
Date
Mon 11/22 Tue 11/23 Wed 11/24
Thu/Fri Mon 11/29 Wed 12/01
Fri 12/03
Event
Storage
P4 Overview Video Micro Opts
Lab: Preprocessor Break
Micro Opts
Review
Lab: Review
P4 Due
Exam 3
2

Architecture Performance
// LOOP 1
for(i=0; i gcc -Og matrix_timing.c
> a.out 50000 10000
sumR: 1711656320 row-wise CPU time: 0.265 sec, Wall time: 0.265
sumC: 1711656320 col-wise CPU time: 1.307 sec, Wall time: 1.307
▶ sumR runs about 6 times faster than sumC
▶ Understanding why requires knowledge of the memory hierarchy and cache behavior
9

Measuring Time in Code
▶ Measure CPU time with the standard clock() function; measure time difference and convert to seconds
▶ Measure Wall (real) time with gettimeofday() or related functions; fills struct with info on time of day (duh)
CPU Time
#include
clock_t begin, end;
begin = clock(); // current cpu moment
do_something();
end = clock(); // later moment
double cpu_time =
((double) (end-begin)) / CLOCKS_PER_SEC;
Real (Wall) Time
#include
struct timeval tv1, tv2; gettimeofday(&tv1, NULL); // early time
do_something();
gettimeofday(&tv2, NULL); // later time
double wall_time = ((tv2.tv_sec-tv1.tv_sec)) + ((tv2.tv_usec-tv1.tv_usec) / 1000000.0);
10

Tools to Measure Performance: perf
▶ The Linux perf tool is useful to measure performance of an entire program
▶ Shows variety of statistics tracked by the kernel about things like memory performance
▶ Examine examples involving the matrix_timing program: sumR vs sumC
▶ Determine statistics that explain the performance gap between these two?
11

Exercise: perf stats for sumR vs sumC, what’s striking?
> perf stat $perfopts ./matrix_timing 8000 4000 row ## RUN sumR ROW SUMMING
sumR: 1227611136 row-wise CPU time: 0.019 sec, Wall time: 0.019
Performance counter stats for ‘./matrix_timing 8000 4000 row’:
135,161,407 cycles:u
417,889,646 instructions:u
56,413,529 L1-dcache-loads:u
# 3.09 insn per cycle
3,843,602 L1-dcache-load-misses:u # 6.81% of all L1-dcache hits (50.41%)
28,153,429 L1-dcache-stores:u
125 L1-icache-load-misses:u
3,473,211 cache-references:u
1,161,006 cache-misses:u
# last level of cache
# 33.427 % of all cache refs
(47.42%)
(44.77%)
(56.22%)
(56.22%)
> perf stat $perfopts ./matrix_timing 8000 4000 col # RUN sumC COLUMN SUMMING
sumC: 1227611136 col-wise CPU time: 0.086 sec, Wall time: 0.086
Performance counter stats for ‘./matrix_timing 8000 4000 col’:
372,203,024 cycles:u
404,821,793 instructions:u
61,990,626 L1-dcache-loads:u
39,281,370 L1-dcache-load-misses:u # 63.37% of all L1-dcache hits (45.66%)
23,886,332 L1-dcache-stores:u
2,486 L1-icache-load-misses:u
32,582,656 cache-references:u
1,894,514 cache-misses:u
# 1.09 insn per cycle
# last level of cache
# 5.814 % of all cache refs (60.38%)
%SAMPLED
(45.27%)
(56.22%)
(55.96%)
%SAMPLED
(40.60%)
(57.23%)
(60.21%)
(43.24%)
(40.82%)
(59.38%)
12

Answers: perf stats for sumR vs sumC, what’s striking?
Observations
▶ Similar number of instructions between row/col versions
▶ #cycles lower for row version → higher insn per cycle
▶ L1-dcache-misses: marked difference between row/col version
▶ Last Level Cache Refs : many, many more in col version
▶ Col version: much time spent waiting for memory system to
feed in data to the processor
Notes
▶ The right-side percentages like (50.41%) indicate how much of how much of the time this feature is measured; some items can’t be monitored all the time.
▶ Specific perf invocation is in 10-memory-systems-code/measure-cache.sh
13

Exercise: Time and Throughput
Consider the following simple loop to sum elements of an array from stride_throughput.c
int *data = …; // global array
int sum_simple(int len, int stride){
int sum = 0;
for(int i=0; i Offset: 4 bits | +-> Set: 2 bits
+-> Tag: Remaining bits
▶ 0x20 in the same line, will also be loaded int set #2
25

Exercises: Anatomy of a Simple CPU Cache
MAIN MEMORY
|Addr|AddrBits |Value| |——+—————-+——-+
CACHE
| | | |Blocks/Line| |Set|V|Tag|0-7 8-15 | |—–+—+—–+————-| |00|0|-|- | |01 |1| 00|333 334 | |10 |1| 11|555 556 | |11 |1| 00|337 338 | |—–+—+—–+————-| ||||0-78-15| DIRECT-MAPPED Cache
– Direct-mapped: 1 Line per Set – 16-byte lines = 4-bit offset – 4 Sets = 2-bit index
– 8-bit Address = 2-bit tag
– Total Cache Size = 64 bytes
4 sets * 16 bytes
HITS OR MISSES? Show effects
1. Load 0x08
2. Load 0xF0
3. Load 0x18
|00 |00000000 |08 |00001000 |10 |00010000 |18 |00011000 |20 |00100000 |28 |00101000 |30 |00110000 |38 |00111000 | |…
| 331 | | 332 | | 333 | | 334 | | 335 | | 336 | | 337 | | 338 | | | | 551 | | 552 | | 553 | | 554 | | 555 | | 556| | 557 | | 558 |
|C0 |11000000
|C8 |11001000
|D0 |11010000
|D8 |11011000
|E0 |11100000
|E8 |11 10 1000
|F0 |11110000
|F8 |11111000 |——+—————-+——-+ | |TagSetOffset| |
26

Answers: Anatomy of a Simple CPU Cache
MAIN MEMORY
|Addr|AddrBits |Value| |——+—————-+——-+
CACHE
| | | |Blocks/Line| |Set|V|Tag|0-7 8-15 | |—–+—+—–+————-| |00 |1|*00|331 332 | |01 |1| 00|333 334 | |10 |1| 11|555 556 | |11 |1|*11|557 558 | |—–+—+—–+————-| ||||0-78-15| DIRECT-MAPPED Cache
– Direct-mapped: 1 line per set – 16-byte lines = 4-bit offset – 4 Sets = 2-bit index
– 8-bit Address = 2-bit tag
– Total Cache Size = 64 bytes
4 sets * 16 bytes
HITS OR MISSES? Show effects
1. Load 0x08: MISS to set 00
2. Load 0xF0: MISS overwrite
set 11
3. Load 0x18: HIT in set 01
no change
|00 |00000000 |08 |00001000 |10 |00010000 |18 |00011000 |20 |00100000 |28 |00101000 |30 |00110000 |38 |00111000 | |…
| 331 | | 332 | | 333 | | 334 | | 335 | | 336 | | 337 | | 338 | | | | 551 | | 552 | | 553 | | 554 | | 555 | | 556| | 557 | | 558 |
|C0 |11000000
|C8 |11001000
|D0 |11010000
|D8 |11011000
|E0 |11100000
|E8 |11 10 1000
|F0 |11110000
|F8 |11111000 |——+—————-+——-+ | |TagSetOffset| |
27

Direct vs Associative Caches Direct Mapped
One line per set
| | | |Blocks/Line| |Set|V|Tag|0-7 8-15 | |—–+—+—–+————-| |00|0|-|- | |01 |1| 00|333 334 | |10 |1| 11|555 556 | |11 |1| 00|337 338 | |—–+—+—–+————-| ||||0-78-15|
▶ Simple circuitry
▶ Conflict misses may result: 1
slot for many possible tags
▶ Thrashing: need memory with overlapping tags
vv
0x10 = 00 01 0000 : in cache
0xD8 = 11 01 1000 : conflict
^^
N-Way Associative Cache
Ex: 2-way = 2 lines per set
| | | |Blocks | |Set|V|Tag|0-7 8-15 | |—–+—+—–+————-| |00|0|-|- |Line1 | |1| 11|551 552 |Line2 |—–+—+—–+————-|
|01 |1| 00|333 334 |Line1 | |1| 11|553 554 |Line2 |—–+—+—–+————-|
|10 |1| 11|555 556 |Line1 | |0| -|- |Line2 |—–+—+—–+————-|
|11 |1| 00|337 338 |Line1 | |1| 11|557 558 |Line2 |—–+—+—–+————-|
| || |0-78-15|
▶ Complex circuitry → $$
▶ Requires an eviction policy, usually least recently used
28

How big is your cache? Check Linux System special Files
lscpu Utility
Handy Linux program that
summarizes info on CPU(s)
> lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 36 bits physical,
Detailed Hardware Info
Files under /sys/devices/… show hardware info (caches)
> cd /sys/devices/system/cpu/cpu0/cache/
> ls
index0 index1 index2 index3 …
> ls index0/
number_of_sets type level size
ways_of_associativity …
> cd index0
> cat level type number_* ways_* size
1 Data 64 8 32K
> cd ../index1
> cat level type number_* ways_* size
1 Instruction 64 8 32K
> cd ../index3
> cat level type number_* ways_* size
3 Unified 8192 20 10240K
48 bits virtual
4
GenuineIntel
6
58
Intel(R) Core(TM)
i7-3667U CPU @ 2.00GHz

L1d cache:
L1i cache:
L2 cache:
L3 cache:
Vulnerability Meltdown: Mitigation; …
Vulnerability Spectre v1: Mitigation …

CPU(s):
Vendor ID:
CPU family:
Model:
Model name:
64 KiB
64 KiB
512 KiB
4 MiB
29

Disks: Persistent Block Storage
▶ Have discussed a variety of fast memories which are small
▶ At the bottom of the pyramid are disks: slow but large
memories
▶ These are persistent: when powered off, they retain information
Using Disk as Main Memory
▶ Operating Systems can create the illusion that main memory is larger than it is in reality
▶ Ex: 2GBDRAM+6GBofdiskspace=8GBMainMemory
▶ Disk file is called swap or a swap file
▶ Naturally much slower than RAM so OS will try to limit its use
▶ A Virtual Memory system manages RAM/Disk as main memory, will discuss later in the course
30

Flavors of Permanent Storage
▶ Permanent storage often referred to as a “drive”
▶ Comes in many variants but these 3 are worth knowing about in the modern era
1. Rotating Disk Drive 2. Solid State Drive
3. Magnetic Tape Drive
▶ Surveyed in the slides that follow
31

Ye Olde Rotating Disk
▶ Store bits “permanently” as magnetized areas on special platters
▶ Magnetic disks: moving parts → slow
▶ Cheap per GB of space
Source: CS:APP Slides
Source: Realtechs.net
32

Rotating Disk Drive Features of Interest Measures of Quality
▶ Capacity: bigger is usually better
▶ Seek Time: delay before a head assembly reaches an arbitrary
track of the disk that contains data
▶ Rotational Latency: time for disk to spin around to correct
position; faster rotation → lower Latency
▶ Transfer Rate: once correct read/write position is found, how
fast data moves between disk and RAM
Sequential vs Random Access
Due to the rotational nature of Magnetic Disks…
▶ Sequential reads/writes comparatively FAST
▶ Random reads/writes comparatively very SLOW
33

Solid State Drives
▶ No moving parts → speed
▶ Most use “flash” memory,
non-volatile circuitry
▶ Major drawback: limited number of writes, disk wears out eventually
▶ Reads faster than writes
▶ Sequential somewhat faster
than random access
▶ Expensive:
A 1TB internal 2.5-inch hard drive costs between $40 and $50, but as of this writing, an SSD of the same capac- ity and form factor starts at $250. That translates into
– 4 to 5 cents/GB for HDD – 25 cents/GB for the SSD. PC Magazine, “SSD vs HDD” by and
26, 2018
34

Tape Drives
▶ Slowest yet: store bits as magnetic field on a piece of “tape” a la 1980’s cassette tape / video recorder
▶ Extremely cheap per GB so mostly used in backup systems
▶ Ex: CSELabs does nightly backups of home directories, recoverable from tape at request to Operator
35

The I/O System Connects CPU and Peripherals
36

Terminology
Bus A collection of wires which allow communication between parts of the computer. May be serial (single wire) or parallel (several wires), must have a communication protocol over it.
Bus Speed Frequency of the clock signal on a particular bus, usually different between components/buses requiring interface chips
CPU Frequency > Memory Bus > I/O Bus
Interface/Bridge Computing chips that manage communications across the bus possibly routing signals to correct part
of the computer and adapting to differing speeds of components
Motherboard A printed circuit board connects to connect CPU to RAM chips and peripherals. Has buses present on it to allow communication between parts. Form factor
dictates which components can be handled.
37

The Motherboard
Source: Wikipedia
38

Memory Mapped I/O
▶ Modern systems are a collection of devices and microprocessors
▶ CPU usually uses memory mapped I/O: read/write certain memory addresses translated to communication with devices on I/O bus
39

Direct Memory Access
▶ Communication received by other microprocessors like a Disk Controller or Memory Management Unit (MMU)
▶ Other controllers may talk: Disk Controller loads data directly into Main Memory via direct memory access
40

Interrupts and I/O
Recall access times
Place
L1 cache RAM Disk
Time
0.5 ns
100 ns 10,000,000 ns
▶ While running Program X, CPU reads an int from disk into %rax
▶ Communicates to disk controller to read from file
▶ Rather than wait, OS puts Program X to “sleep”, starts running program Y
▶ When disk controller completes read, signals the CPU via an interrupt, electrical signals indicating an event
▶ OS handles interrupt, schedules Program X as “ready to run”
41

Interrupts from Outside and Inside
▶ Examples of events that generate interrupts ▶ Integer divide by 0
▶ I/O Operation complete
▶ Memory address not in RAM (Page Fault) ▶ User generated: x86 instruction int 80
▶ Interrupts are mainly the business of the Operating System
▶ Usually cause generating program to immediately transfer control to the OS for handling
▶ When building your own OS, must write “interrupt handlers” to deal with above situations
▶ Divide by 0: signal program usually terminating it ▶ I/O Complete: schedule requesting program to run ▶ Page Fault: sleep program until page loaded
▶ User generated: perform system call
▶ User-level programs will sometimes get a little access to interrupts via signals, a topic for CSCI 4061
42