PERSISTENCE: Unwritten contract OF SSDs
Andrea Arpaci-Dusseau CS 537, Fall 2019
ADMINISTRIVIA
Project 7: xv6 File systems: Improvements + Checker Specification Quiz – Worth Project Points
7a due tomorrow, 7b due Thursday Topic of tomorrow’s Discussion Sections Can still request project partner if needed
Final Exam
Friday, December 13th 7:25-9:25 pm
Two Rooms: Last name A-H in SOC SCI 5206, I-Z in Humanities 3650 Slightly cumulative (some T/F from Virtualization and Concurrency – 25%)
Exam Review
Next Tuesday: You ask questions to cover by Monday at 5:00pm Next Wednesday discussions
AGENDA / LEARNING OUTCOMES
What is the unwritten contract of SSDs?
• How are SSDs structured?
• How should they be used to obtain best performance?
Cost: HDD vs. SSD
Enterprise SSD revenue is expected to exceed enterprise HDD in 2017
HDD SSD
2017
Source: Gartner, Stifel Estimates https://www.theregister.co.uk/2016/01/07/gartner_enterprise_ssd_hdd_revenue_crossover_in_2017/
Pros and Cons of Block-Based Devices
for SSD
for SSD
App App? App? FS FS ? FS?
The consequences of misusing SSDs
Performance degradation Performance fluctuation
Early end of device life
http: //crestingwave. com/sites/default/files/collateral/velobit_whitepaper_ssdperformancetips. pdf
S. Boboila and P. Desnoyers. Write Endurance in Flash Drives: Measurements and Analysis. In Proceedings of the 8th USENIX Symposium on File and Storage Technologies (FAST
’10), San Jose, California, February 2010
Block Device Interface:
HDD Unwritten Contract
• Sequential accesses are best
• Nearby accesses are more efficient
than farther ones
read(), write()
SSD Unwritten Contract
?
Right way use SSDs?
Ssd unwritten Contract
1.Request Scale 2.Locality
3.Aligned Sequentiality 4.Grouping by Death Time 5.Uniform Data Lifetime
Background
SSD Background
Controller
FTL – address mapping – garbage collection
– wear-leveling
RAM
Mapping Table Data Cache
Channel Channel Channel Block Block Block
PPP PPP PPP
PPP PPP PPP
PPP PPP PPP
………… ………… ……………
Flash
Hold charge in cells. No moving parts! Inherently parallel
No seeks!
Parallelism
Flash devices are divided into banks (aka, planes or channels) Banks can be accessed in parallel
read read
Bank 0
Bank 1
Bank 2
Bank 3
SLC: Single-Level Cell
10 NAND Cell NAND Cell
charge
charge
MLC: Multi-Level Cell
NAND Cell
00 10
NAND Cell
01 11
NAND Cell
NAND Cell
charge charge
charge charge
Single- vs. Multi- Level Cell
SLC MLC
For the same total capacity, which is the most expensive?
expensive
Which is the most robust (fewest errors)?
cheap
robust
sensitive
charge
charge
Wearout
Problem: flash cells wear out after being overwritten too many times.
SLC: ~100K times MLC: ~10K times
Usage strategy: Wear leveling
– Prevents some cells from wearing out while others still fresh
– Need to look at the number of times each has been written and move old data
Flash Writes
Writing 0’s:
– fast, fine-grained [pages] – called “program”
Writing 1’s:
– slow, course-grained [blocks] – called “erase”
each bank contains many “blocks”
Bank 0
Bank 2
Bank 3
each block contains many “pages”
1111 1111
Block
1111 1111
1111 1111
1111 1111
1111 1111
1111 1111
1111 1111
1111 1111
one page
Block
program
1111 1111
1111 1111
1111 1111
1001 1111
1111 1111
1111 1111
1111 1111
1111 1111
Block
erase
1111 1111
1111 1111
1111 1111
1111 1111
1111 1111
1111 1111
1111 1111
1111 1111
API Comparison: Blocks or Pages?
disk read sector
flash
read page
erase block (1’s)
write sector
(0’s)
program page
write read
Pick Plane, Block, or Page:
Accessed in parallel
Plane
Unit of read and program Unit of erase
64 to 256 pages
2 to 8 KB
1024 to 4096 blocks
Page
Flash Hierarchy
Block
Page
Plane
Block
Hard Disk vs. Flash Performance
Throughput: – disk:
– flash:
Latency:
– disk: – flash
~130 MB/s (sequential only) ~200 MB/s
~10 ms (one op)
10-50 us 200-500 us 2 ms
– read:
– program:
– erase:
Implications:
File System and Flash Interactions
Traditional File Systems
File System
Storage Device
Traditional API: – read sector
– write sector
Not same as flash
File system Options
1. Build/use new file systems for flash (using program/erase) – JFFS/JFFS2, YAFFS
– lot of work!
2. Build traditional API over flash API – use FFS, LFS, whatever we want
Traditional API on Flash
read(addr):
return flash_read(addr)
write(addr, data):
block_copy = flash_read(block of addr) modify block_copy with data flash_erase(block of addr) flash_program(block of addr, block_copy)
Memory:
FS wants to write 0001
Flash:
00 00
00 11
11 00
11 11
111 1
111 1
111 1
111 1
00 01
111 1
111 1
11 11
block 0 block 1 block 2
00 00
00 11
11 00
11 11
Memory:
read all other pages in block
Flash:
00 00
00 11
11 00
11 11
111 1
111 1
111 1
111 1
00 01
111 1
111 1
11 11
block 0 block 1 block 2
00 01
00 11
11 00
11 11
Memory:
modify target page in memory
Flash:
00 00
00 11
11 00
11 11
111 1
111 1
111 1
111 1
00 01
111 1
111 1
11 11
block 0 block 1 block 2
00 01
00 11
11 00
11 11
Memory: erase block
Flash:
111 1
111 1
111 1
11 11
111 1
111 1
111 1
111 1
00 01
111 1
111 1
11 11
block 0 block 1 block 2
00 01
00 11
11 00
11 11
Memory:
program all pages in block
Flash:
00 01
00 11
11 00
11 11
111 1
111 1
111 1
111 1
00 01
111 1
111 1
11 11
block 0 block 1 block 2
Write Amplification
One write is amplified to many additional writes Writing one 2KB page may cause:
– read, erase, and program of 256KB block. Random writes are extremely expensive!
(Sequential could modify all pages in one block at one time…)
Implication for File System:
Would FFS or LFS be better on flash?
COW FS over Flash
Copy-On-Write FS may prevent some expensive random writes What about wear leveling? LFS won’t do this
What if we want to use some other FS?
Better Solution
Add copy-on-write layer between FS and flash Avoids RMW (read-modify-write) cycle
FTL: Flash Translation Layer
Translate logical device addrs to physical addrs Some similarities to RAID translation…
Should FTL use math or data structure?
Flash Translation Layer
write 1101
logical:
valid physical:
must eventually be garbage collected
01234567 invalid free
block 0 block 1
000 1
001 0
00 11
000 0
10 01
11
11 01
11 11
11 11
Physical pages can be in three states: valid, invalid, free
Label transitions: relocate erase program
free
erase
invalid
valid
States of Physical pages
program
relocate or TRIM
How big are mapping tables?
Assume 200 GB device, 2 KB pages, 4-byte entries in FTL for each page RAM needed: (200 GB / 2 KB) * 4 bytes = 400 MB
Too big, RAM is expensive!
Two solutions:
1. Keep (unused) portion of Mapping Table in FLASH (à Locality Rule) 2. Hybrid-Mapping FTL (àAligned Sequentiality Rule)
Controller
RAM
SSD Background
FTL – address mapping
– garbage collection
– wear-leveling
Channel Block
Channel Channel Block Block
PPP PPP PPP
PPP PPP PPP
PPP PPP PPP
………… ………… ……………
Mapping Table Data Cache
Unwritten Contract for SSds
Rules of the Unwritten Contract
#1 Request Scale
#2 Locality
#3 Aligned Sequentiality
#4 Grouping by Death Time #5 Uniform Data Lifetime
Rule #1: Request Scale SSD clients should issue large data requests or multiple
outstanding data requests.
Request
SSD
Channel Channel Channel Channel
High internal parallelism
Rule #1: Request Scale
Violation
SSD – Low internal parallelism Channel Channel Channel Channel
Performance impact:
18x read bandwidth
10x write bWanadswteidth
F. Chen, R. Lee, and X. Zhang. Essential Roles of Exploit- ing Internal Parallelism of Flash Memory Based Solid State Drives in High-speed Data Processing. In Proceedings of the 17th International Symposium on High Performance Com- puter Architecture (HPCA-11), pages 266–277, San Antonio, Texas, February 2011.
Request
If you violate the rule:
SSD RAM
Rule 2: Locality SSD clients should access with locality
Hit Translation Cache
High Cache Hit Ratio PPPPPP
Logical
to P P P P P P
FLASH
Physical Mapping Table
P P P P P P PPPPPP
46
SSD RAM
– –
If you violate the rule:
Translation cache misses
Rule 2: Locality Violation SSD clients should access with locality
PP
More translation entry reads/writes
Logical
P P P P P P
to P P P P P P
Physical PeP rformaP nceP impacP t: P P
FLASH
Y. Zhou, F. Wu, P. Huang, X. He, C. Xie, and J. Zhou. An Efficient Page-level FTL to Optimize Address Translation in Flash Memory. In Proceedings of the EuroSys Conference (EuroSys ’15), Bordeaux, France, April 2015.
47
Mapping Table
2.2x average response time
P P P P P P
Rule 3: Aligned Sequentiality
Hybrid mapping FTL:
To save space, compromise between page and block-level mappings
Aligned Sequentiality:
Perform best if write to entire block, starting from 1st page of block
Per-Page Translations
logical:
physical:
01234567
000 1
001 0
00 11
000 0
10 01
010 1
11 11
11 11
block 0 block 1
2-Page Translations (Generalize…)
logical:
01234567 physical:
block 0 block 1
Advantage: Reduce number of translations! Disadvantage?
000 1
001 0
00 11
000 0
10 01
010 1
11 11
11 11
2-Page Translations
write 1011
01234567 physical:
block 0 block 1
logical:
000 1
001 0
00 11
000 0
10 01
010 1
11 11
11 11
logical:
physical:
2-Page Translations
write 1011
01234567
copy
000 1
001 0
00 11
000 0
10 01
010 1
10 11
010 1
Disadvantages?
– more read-modify-write updates – more garbage
– less flexibility for placement
block 0
block 1
Hybrid FTL
Compromise between page and block-level mappings Get advantages of each approach
Use course-grained mapping for most (e.g., 95%) of data Map at block level
Most dataàMost space savings Use fine-grained mapping for recent data
Map at page level
Changing rapidlyàNeeds performance efficiency
Log Blocks
Write changed pages to designated log blocks in Flash
After log blocks become full, merge changes with old data Type of merge depends on pattern of written data
– full merge (worst performance) – partial merge
– switch merge (best performance)
Eventually garbage collect old pages
logical:
physical:
0123
write D2
…
Full Merge
A
B
C
D
111
D2
1
111 1
11 11
11 11
111 1
111 1
11 11
11 11
block 0
Eventually, need to get rid of page mappings because they are expensive
block 1 (log)
block 2
logical:
physical:
…
0123
Full Merge
A
B
C
D
D2
111 1
11 11
11 11
A
B
C
D2
block 0
block 1 (log)
block 2
logical:
garbage
physical:
…
0123
Full Merge
A
B
C
D
D2
111 1
11 11
11 11
A
B
C
D2
block 0
block 1 (log)
block 2
Merging
Three merge types:
– full merge (worst performance) – partial merge
– switch merge (best performance)
logical:
physical:
0123
write D2 …
Partial Merge
A
B
C
D
111 1
111 1
11 11
11 11
111 1
111 1
11 11
11 11
block 0
block 1 (log)
block 2
logical:
physical:
0123
write D2 …
Partial Merge
A
B
C
D
111 1
111 1
11 11
D2
111 1
111 1
11 11
11 11
block 0
block 1 (log)
block 2
logical:
physical:
…
0123
Partial Merge
A
B
C
D
A
B
C
D2
111 1
111 1
11 11
11 11
block 0
block 1 (log)
block 2
logical:
physical:
…
0123
Partial Merge
A
B
C
D
A
B
C
D2
111 1
111 1
11 11
11 11
block 0
block 1
block 2
logical:
garbage
…
0123
Partial Merge
A
B
C
D
A
B
C
D2
111 1
111 1
11 11
11 11
physical:
block 0
block 1
block 2
Merging
Three merge types:
– full merge (worst performance)
– partial merge
– switch merge (best performance)
logical:
physical:
…
0123
Switch Merge
A
B
C
D
111 1
111 1
11 11
11 11
111 1
111 1
11 11
11 11
block 0
block 1 (log)
block 2
write A2
logical:
physical:
Switch Merge
0123
…
A
B
C
D
111 1
111 1
11 11
11 11
111 1
111 1
11 11
11 11
block 0
block 1 (log)
block 2
write A2
logical:
physical:
Switch Merge
0123
…
A
B
C
D
A2
111 1
11 11
111 1
111 1
111 1
11 11
11 11
block 0
block 1 (log)
block 2
logical:
physical:
write B2
…
0123
Switch Merge
A
B
C
D
A2
B2
11 11
111 1
111 1
111 1
11 11
11 11
block 0
block 1 (log)
block 2
logical:
physical:
0123
write C2
…
Switch Merge
A
B
C
D
A2
B2
C2
111 1
111 1
111 1
11 11
11 11
block 0
block 1 (log)
block 2
logical:
physical:
0123
write D2 …
Switch Merge
A
B
C
D
A2
B2
C2
D2
111 1
111 1
11 11
11 11
block 0
block 1 (log)
block 2
logical:
garbage
…
0123
Switch Merge
A
B
C
D
A2
B2
C2
D2
111 1
111 1
11 11
11 11
physical:
block 0
block 1 (log)
block 2
Rule 3: Aligned Sequentiality
Hybrid mapping FTL:
Compromise between page and block-level mappings
Perform best if write to entire block, starting from 1st page of block Does not need to copy pages from one block to the next
Rule 4: Grouping By Death Time
Data with similar death times should be placed in the same block.
Rule 4: Grouping By Death Time
Data with similar death times should be placed in the same block.
1pm 2pm 1pm 2pm 1pm 2pm 1pm 2pm
Rule 4: Grouping By Death Time
Data with similar death times should be placed in the same block.
Time
1pm 2pm
1pm 2pm 1pm 2pm 1pm 2pm 1pm 2pm
😀
Rule 4: Grouping By Death Time
Violation
Time
1pm 2pm
1pm 1pm 2pm 2pm
1pm 1pm 2pm 2pm
Rule 4: Grouping By Death Time
Violation
If you violate the rule:
– Performance penalty
– Write amplification Performance impact:
2pm 2pm 4.8x write bandwidth
2pm1.6xthroughput 2pm 2pm 1.8x block erasure count
Data movement!!!
1pm 1pm 1pm 1pm
Time
1pm
C. Lee, D. Sim, J.-Y. Hwang, and S. Cho. F2FS: A New File System for Flash Storage. In Proceedings of the 13th USENIX Co😞nference on File and Storage Technologies (FAST ’15), Santa Clara, California, February 2015.
J.-U. Kang, J. Hyun, H. Maeng, and S. Cho. The Multi- streamed Solid-State Drive. In 6th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage ’14), Philadelphia, PA, June 2014.
. Y. Cheng, F. Douglis, P. Shilane, G. Wallace, P. Desnoyers, and K. Li. Erasing Belady’s Limitations: In Search of Flash Cache Offline Optimality. In 2016 USENIX Annual Technical Conference (USENIX ATC 16), pages 379–392, Denver, CO, 2016. USENIX Association.
Rule 5: Uniform Data Lifetime Clients of SSDs should create data with similar lifetimes
Lifetime
1 Day
Rule 5: Uniform Data Lifetime Clients of SSDs should create data with similar lifetimes
Lifetime
1 Day
Per Block
No wear-leveling needed
SSD Usage Count:
333
012 01 3 10 2 012 2
Rule 5: Uniform Data Lifetime
Lifetime
1 Day
1000 Years
Violation
If you violate the rule:
– Performance penalty – Write amplification
Some blocks wear out sooner
Per Block
Usage Count:
365*1000 365*1000
Performance impact:
Frequent wear-leveling needed!!!
1.6x write latency
S. BSobDoila and P. Desnoyers. Write Endurance in Flash Drives: Measurements and Analysis. In Proceedings of the 8th USENIX Symposium on File and Storage Technologies (FAST ’10), San Jose, California, February 2010.
2130 10 1320 10
Impact of Garbage Collection + Wear Leveling
Performance fluctuation
Flash is much faster than disk, but… More expensive
Not a drop-in replacement without a complex layer (FTL) for emulating hard disk API
Applications and file systems must carefully interact with Flash to obtain best performance
#1 Request Scale
#2 Locality
#3 Aligned Sequentiality
#4 Grouping by Death Time
#5 Uniform Data Lifetime
Summary