File Systems
File Systems
DSCI 551
Wensheng Wu
1
Roadmap
• Files and directories
– CRUD operations via system calls
• Implementing CRUD
– Data structures, e.g., organization of blocks
– Access methods: turn system calls into operations
on data structures
2
Files and directories
• File content stored in blocks on storage device
– Has user defined name: hello.txt
– & low-level name, e.g., inode number: 410689
• Files are organized into directories (folders)
– each may have a list of files and/or subdirectories
– That is, directories can be nested
3
Example
4
Root directory
Empty directory
Operations on files
• Create
– open(), write()
• Read
– open(), read(), lseek()
• Update
– write(), lseek()
• Delete
– unlink()
5
System calls:
calls to functions in the API
provided by OS
Create
• User interface via GUI or touch command in Linux
• Implementation, e.g., via a C program with a
system call: open()
• int fd = open(“foo”, O_CREAT | O_WRONLY |
O_TRUNC);
– Open with flags indicating the specifics
– O_CREAT: create a file
– O_WRONLY: write only
– O_TRUNC: remove existing contents if exits
6
|: Bitwise OR operator
File descriptor
• Note open() returns a file descriptor
– Typically an integer
– Reserved fds: stdin 0, stdout, 1, stderr 2
7
Read
• read(fd, buffer, size)
– Read from file “fd”
– And store them in buffer
• Read starts from the current offset of fd
– Initially 0
8
Write
• write(fd, buffer, size)
– Write to file fd
buffer
– Also start writing from the current offset
9
Random read and write
• off_t lseek(int fd, off_t offset, int whence)
– If whence is SEEK_SET, the offset is set to
bytes from the beginning of file
– If whence is SEEK_CUR, the offset is set to its
current location plus
– If whence is SEEK_END, the offset is set to the size
of the file plus
negative, e.g., -8 for 8 bytes from the end)
• whence: from where
10
11
Copy a file
“0” starts an octal number
=> permissions:
110 (owner) rw-
100 (group) r–
100 (others) r–
Pointer to a character array
File permission mode
12
rw-r–r–
=> 110 (owner permission) 100 (group) 100 (others)
Resources for system calls
• https://en.wikipedia.org/wiki/System_call
• open:
https://en.wikipedia.org/wiki/Open_(system_call
)
• read:
https://en.wikipedia.org/wiki/Read_(system_call)
• write:
https://en.wikipedia.org/wiki/Write_(system_call
)
• close:
https://en.wikipedia.org/wiki/Close_(system_call)
13
https://en.wikipedia.org/wiki/System_call
https://en.wikipedia.org/wiki/Open_(system_call)
https://en.wikipedia.org/wiki/Read_(system_call)
https://en.wikipedia.org/wiki/Write_(system_call)
https://en.wikipedia.org/wiki/Close_(system_call)
Resources for system calls
• man –S 2 read
– Find it in the Section 2 of the manual
14
Install gcc on EC2
• sudo yum install gcc
• Usage:
– gcc -o copy2 copy2.c
15
File and directory
• When creating a file
– Bookkeeping data structure (inode) created:
recording size of file, location of its blocks, etc.
– Linking a human-readable name to the file
– Putting the link in a directory
16
Info about file (stored in inode)
struct stat {
dev_t st_dev; /* ID of device containing file */
ino_t st_ino; /* inode number */
mode_t st_mode; /* protection */
nlink_t st_nlink; /* number of (hard) links */
uid_t st_uid; /* user ID of owner */
gid_t st_gid; /* group ID of owner */
dev_t st_rdev; /* device ID (if special device file, e.g., /etc/tty) */
off_t st_size; /* total size, in bytes */
blksize_t st_blksize; /* blocksize for filesystem I/O */
blkcnt_t st_blocks; /* number of blocks allocated */
time_t st_atime; /* last time file content was accessed */
time_t st_mtime; /* last time file content was modified */
time_t st_ctime; /* last time inode was changed */
};
Execute “man -S 2 stat” for more details…
17
inode
• Stores metadata/attributes about the file
• Also stores locations of blocks holding the
content of the file
18
Example
• a.txt
abc def
abc def
abc def
19
Block size
# of blocks allocated
Device id
Inode #
User id Group id
# of (hard) links
Access permission
noatime
20
Hard links
21
Symbolic links
22
Working with directories
• Create: mkdir() system call
– Used to implement command, e.g., mkdir xyz
• Read: opendir(), readdir(), closedir()
– ls xyz
• Delete: rmdir()
23
Roadmap
• Files and directories
– CRUD operations
• Implementation
– Data structures: how to organize the blocks
– Access methods: turn system calls into operations
on data structures
24
Organization of blocks
• Array-based
– Disk consists of a list of blocks
– We will assume this
• Tree-based, e.g., SGI XFS
– Blocks are organized into variable-length extents
– Use B+-tree to quickly find free extents
25
Blocks
• Consider a disk with 64 blocks
– 4KB/block
– 512B/sector (we assume this in this lecture)
• So there are 212/29 = 23 = 8 sectors/block
– Capacity of disk = 64 * 4KB = 256KB
26
Data region
• 56 blocks (#8-63)
– used to store data/content of files
– but see later: some blocks may store pointers
27
Metadata
• For each file, file system records its metadata
– Information in the “stat” struct
– Location of blocks that stores the content of file
28
Metadata of files stored in inodes
• Index nodes
• Stored in blocks #3 — #7 (i.e., 5 blocks)
• Together they are called the “inode” table
29
How many inodes are there?
• 256 bytes/inode
• 5 blocks, 4KB/block
=> 16 inodes/block (4K/256 = 212/28 )
=> 5 blocks, 5 * 16 = 80 inodes
=> File system can store at most 80 files
30
Free space management using bitmaps
• Bitmap: a vector of bits
– 0 for free (inode/block), 1 for in-use
• Inode bitmap (imap)
– keep track of which inodes in the inode table are
available
• Data bitmap (dmap)
– Keep track of which blocks in data region are
available
31
Bitmaps
• Each bitmap is stored in a block
– Block “i”: keep track of 80 inodes (could track 32K)
– Block “d”: keep track of the 56 data blocks
32
Superblock
• Track where i/d blocks and inode table are
– E.g., inode table starts at block 3; there are 80
inodes and 56 data blocks, etc.
• Indicate type of FS & inumber of its root dir
• Will be read first when file system is mounted
33
inumber
• Each inode is identified by a number
– Low-level number of file name
• Can figure out location of inode from inumber
34
inumber => location
• inumber = 32
=> address: offset in bytes from the beginning
=> which sector?
35
inumber => location of inode
• Address: 12K + 32 * 256 = 20K
• Sector #: 20K/512 = 40
– more generally
– ہ
ۂ
(inodeStartAddress + inumber ∗ inode size)/sector
size
36
inode => location of data blocks
• A number of direct pointers
– E.g., 8 pointers, each points to a data block
– Enough for 8*4K = 32K size of file
• Also has a slot for indirect pointer
– Pointing to a data block storing direct pointers
– Assume 4 bytes for block address (e.g.,
represented in CHS), so 1024 pointers/block
– Now file can have (8 + 1024) blocks or 4,128KB
37
Multi-level index
• Pointers may be organized into multiple levels
– Indirect pointer (as in previous slide)
• Inode (pointer1, pointer2, …, indirect pointer)
• Indirect pointer -> a block of direct pointers
– Double indirect pointers
• Inode (pointer1, pointer2, …, indirect pointer)
• Indirect pointer -> a block of indirect pointers instead
-> each points to a block of direct pointers
– Triple indirect pointers
• Indirect pointer -> a block of indirect pointers
-> each points to a block of indirect pointers
-> each points to a block of direct pointers
38
Double Indirect Pointers
39
inode
Block of indirect pointers
Indirect
pointer
Indirect
pointer
Block of direct pointers
direct pointers
Advantages of multi-level index
• Grow to more levels as needed
• Direct pointers handle most of the cases
– Many files are small
40
Directory organization
• Directory itself stored as a file
• For each file in the directory, it stores:
– name, inumber, record length, string length
41
Actual length
Record length vs string length
• String length = # of characters in file name + 1
(for \0: end of string)
• Record length >= string length
– Due to entry reuse
42
Reusing directory entries
• If file is deleted (using rm command) or a
name is unlinked (using unlink command)
– File is finally deleted when its last (hard) link is
removed
• Then inumber in its directory entry set to 0
(reserved for empty entry)
– So we know it can be reused
43
Storing a directory
• Also as a file with its own inode + data block
• inode:
– file type: directory (instead of regular file)
– pointer to block(s) in data region storing directory
entries
44
Roadmap
• Files and directories
– CRUD operations
• Implementation
– Data structures: how to organize blocks, e.g., into
array/tree
– Access methods: turn system calls to operations
on data structures
45
Open for read
• fd = open(“/foo/bar”, O_RDONLY)
46
Open for read
• fd = open(“/foo/bar”, O_RDONLY)
– Need to locate inode of the file “/foo/bar”
– Assume inumber of root, say 2, is known (e.g.,
when the file system is mounted)
47
Open for read
1. Read inode and content of / (2 reads)
– Look for “foo” in / -> foo’s inumber
2. Read inode and content of /foo (2 reads)
– Look for “bar” in /foo -> bar’s inumber
3. Read inode of /foo/bar (1 read)
– Permission check + allocate file descriptor
48
Cost of open()
• Need 5 reads of inode/data block
49
File-open table per process
File descriptor File name Inumber Position offset …
3 /foo/bar 32382 0
4 /foo/more 48482 512 …
50
Reading the file
• read(fd, buffer, size)
– Note fd is maintained in per-process open-file
table
– Table translates fd -> inumber of file
51
Reading the file
• read(fd, buffer, size)
1. Consult bar’s inode to locate a block
2. Read the block
3. Update inode with newest file access time
4. Update open-file table with new offset
5. Repeat above steps until done (with reading
data of given size)
52
Cost for reading a block
• 3 I/O’s:
– read inode, read data block, write inode
53
2
Open for write
• int fd = open(“/foo/bar”, O_WRONLY)
– Or int fd = create((“/foo/bar”)
– Assume bar is a new file under foo
– (note the difference from reading chapter!)
54
Open for write
• int fd = open(“/foo/bar”, O_WRONLY)
1. Read ‘/’ inode & content
– obtain foo’s inumber
2. Read ‘/foo’ inode & content
– check if bar exists
55
Open for write
3. Read imap, to find a free inode for bar
4. Update imap, setting 1 for allocated inode
5. Write bar’s inode
56
Open for write
6. Update foo’s content block
– Adding an entry for bar
7. Update foo’s inode
– Update its modification time
57
Cost for “open for write”
• int fd = open(“/foo/bar”, O_WRONLY)
• Need 9 I/O’s
58
data
bitmap
inode
bitmap
root
inode
foo
inode
bar
inode
root
data
foo
data
bar
data[0]
bar
data[1]
bar
data[2]
read
read
read
read
read
write
write
write
write
create()
Writing the file: /foo/bar
1. Read inode of bar (by first looking up its
inumber in the file-open table)
2. Allocate new data block
– Read and write bmap
3. Write to data block of bar
4. Update bar inode
– new modification time, add pointer to block
59
Cost of writing /foo/bar
• 5 I/O’s for write a block
60
data
bitmap
inode
bitmap
root
inode
foo
inode
bar
inode
root
data
foo
data
bar
data[0]
bar
data[1]
bar
data[2]
read
read
read
read
read
write
write
write
write
read
read
write
write
write
create()
write()
Caching for read
• First read may be slow
– But subsequent ones will speed up
• Good idea to cache popular blocks
– e.g., determined via LRU strategy
61
Buffering for delayed write
• Improve write performance via:
– Batching (e.g., two updates to the same imap)
– Scheduling (reordering for better performance)
– Avoiding writes (if file created, then quickly
deleted)
• Problem: update may be lost when system
crashes
62
Example file systems
• NTFS
– New technology file system, Microsoft proprietary
• FAT
– File allocation table
– FAT 16, 32, …
– 32 bits = # of sectors a file can occupy
512B/sector => 2TB limit on file size
4KB/sector => 16TB limit
• Ext4
– fourth extended file system, common in Linux
63