System Calls Interface and Implementation
1
Learning Outcomes
• A high-level understanding of System Call interface
• Mostly from the user’s perspective • From textbook (section 1.6)
• Understanding of how the application-kernel boundary is crossed with system calls in general
• Including an appreciation of the relationship between a case study (OS/161 system call handling) and the general case.
• Exposure architectural details of the MIPS R3000
• Detailed understanding of the of exception handling mechanism • From “Hardware Guide” on class web site
2
System Calls Interface
3
The Structure of a Computer System
Interaction via
System Calls
Application
System Libraries
User Mode Kernel Mode
Device
Device
OS
Memory
4
System Calls
• Can be viewed as special function calls
• Provides for a controlled entry into the kernel
• While in kernel, they perform a privileged operation • Returns to original caller with the result
• The system call interface represents the abstract machine provided by the operating system.
5
The System Call Interface: A Brief Overview
• From the user’s perspective • Process Management
• File I/O
• Directories management • Some other selected Calls • There are many more
•OnLinux,seeman syscalls foralist
6
Some System Calls For Process Management
7
Some System Calls For File Management
8
System Calls
• A stripped down shell:
while (TRUE) {
type_prompt( );
read_command (command, parameters)
if (fork() != 0) {
/* Parent code */ waitpid( -1, &status, 0);
} else {
/* Child code */
execve (command, parameters, 0);
} }
/* repeat forever */
/* display prompt */
/* input from terminal */ /* fork off child process */ /* wait for child to exit */
/* execute command */
9
System Calls
Some Win32 API calls
10
System Call Implementation Crossing user-kernel boundary
11
A Simple Model of CPU Computation
• The fetch-execute cycle
• Load memory contents from address in program counter (PC)
• The instruction
• Execute the instruction
• Increment PC
• Repeat
CPU Registers
PC: 0x0300
12
A Simple Model of CPU Computation
• Stack Pointer (SP) • Status Register
• Condition codes • Positive result • Zero result
• Negative result
• General Purpose Registers
• Holds operands of most instructions
• Enables programmers (compiler) to minimise memory references.
CPU Registers
PC: 0x0300
SP: 0xcbf3
Status
R1
Rn
13
Privileged-mode Operation
• To protect operating system execution, two or more CPU modes of operation exist
• Privileged mode (system-, kernel-mode) • All instructions and registers are available
• User-mode
• Uses ‘safe’ subset of the instruction set
• Onlyaffectsthestateoftheapplicationitself
• Theycannotbeusedtouncontrollablyinterferewith OS
• Only ‘safe’ registers are accessible
CPU Registers
Interrupt Mask
Exception Type
MMU regs
Others
PC: 0x0300
SP: 0xcbf3
Status
R1
Rn
14
Example Unsafe Instruction • “cli” instruction on x86 architecture
• Disables interrupts
• Example exploit
cli /* disable interrupts */ while (true)
/* loop forever */;
15
Privileged-mode Operation
Accessible only to Kernel-mode
Accessible to User- and
Kernel-mode
• The accessibility of addresses within an address space changes depending on operating mode
• To protect kernel code and data
• Note: The exact memory ranges are usually configurable, and vary between CPU architectures and/or operating systems.
Memory Address Space 0xFFFFFFFF
0x80000000
0x00000000
16
System Call
User Mode
Application
Kernel Mode
System call mechanism securely transfers from user execution to kernel execution and back.
System Call Handler
17
Questions we’ll answer
• There is only one register set
• How is register use managed?
• What does an application expect a system call to look like?
• How is the transition to kernel mode triggered?
• Where is the OS entry point (system call handler)? • How does the OS know what to do?
18
System Call Mechanism Overview
• System call transitions triggered by special processor instructions
• User to Kernel
• System call instruction
• Kernel to User
• Return from privileged mode instruction
19
System Call Mechanism Overview • Processor mode
• Switched from user-mode to kernel-mode • Switched back when returning to user mode
• Stack Pointer (SP)
• User-level SP is saved and a kernel SP is initialised
• User-level SP restored when returning to user-mode • Program Counter (PC)
• User-level PC is saved and PC set to kernel entry point • User-level PC restored when returning to user-level
• Kernel entry via the designated entry point must be strictly enforced
20
System Call Mechanism Overview • Registers
• Set at user-level to indicate system call type and its arguments
• A convention between applications and the kernel
• Some registers are preserved at user-level or kernel-level in order to restart user-level execution
• Depends on language calling convention etc.
• Result of system call placed in registers when returning to user-level
• Another convention
21
Why do we need system calls?
• Why not simply jump into the kernel via a function call????
• Function calls do not
• Change from user to kernel mode • andeventuallybackagain
• Restrict possible entry points to secure locations • To prevent entering after any security checks
22
Steps in Making a System Call
There are 11 steps in making the system call
read (fd, buffer, nbytes)
23
The MIPS R2000/R3000
• Before looking at system call mechanics in some detail, we need a basic understanding of the MIPS R3000
24
Coprocessor 0
• The processor control registers are located in CP0
• Exception/Interrupt management registers
• Translation management registers
• CP0 is manipulated using mtc0 (move to) and mfc0 (move from) instructions
• mtc0/mfc0 are only accessible in kernel mode.
CP0
CP1 (floating point)
PC: 0x0300
HI/LO
R1
Rn
25
CP0 Registers
• Exception Management
• c0_cause
• Cause of the recent exception
• c0_status
• Current status of the CPU
• c0_epc
• Address of the instruction
that caused the exception
• c0_badvaddr
• Address accessed that caused the exception
• Miscellaneous • c0_prid
• Processor Identifier
• Memory Management • c0_index
• c0_random • c0_entryhi • c0_entrylo • c0_context
• More about these later in course
26
c0_status
• For practical purposes, you can ignore most bits • Green background is the focus
27
c0_status
28
• IM
• Individual interrupt mask bits
• 6 external • 2 software
• KU
• 0 = kernel
• 1 = user mode
• IE
• 0 = all interrupts masked • 1 = interrupts enable
• Mask determined via IM bits
• c, p, o = current, previous, old
c0_cause
• IP
• Interrupts pending
• BD
– If set, the instruction that caused the exception was in a branch delay slot
• 8 bits indicating current state of interrupt lines
• CE
• Coprocessor error
• Attempt to access disabled Copro.
• ExcCode
• The code number of the exception taken
29
Exception Codes
30
Exception Codes
31
c0_epc
• The Exception Program Counter
• Points to address of where to restart execution after handling the exception or interrupt
• Example
• Assume sw r3,(r4) causes
a restartable fault exception
Aside: We are ignore BD-bit in c0_cause which is also used in reality on rare occasions.
nop
sw r3 (r4)
nop
C0_epc
C0_cause
C0_status
CP1 (floating point)
PC: 0x0300
HI/LO
R1
Rn
32
Exception Vectors
33
Simple Exception Walk-through
User Mode
Application
Kernel Mode
Interrupt Handler
34
Hardware exception handling
PC
EPC
Cause
Status
0x12345678
?
• Let’s now walk through an exception
?
• Assume an interrupt occurred as the previous instruction completed
• Note: We are in user mode with interrupts enabled
KUo IEo KUp IEp KUc IEc
?
?
?
?
1
1
35
Hardware exception handling
PC
EPC
Cause
Status
0x12345678
0x12345678
?
• Instruction address at which to restart after the interrupt is transferred to EPC
KUo IEo KUp IEp KUc IEc
?
?
?
?
1
1
36
Hardware exception handling
PC
Interrupts EPC
disabled
and previous 0x12345678
state shifted
Cause along
?
0x12345678
Kernel Mode is set, and previous mode shifted along
Status KUo IEo KUp IEp KUc IEc
?
?
1
1
0
0
37
Hardware exception handling
PC EPC
0x12345678 0x12345678 Cause
Status KUo IEo KUp IEp KUc IEc
0
Code for the exception placed in Cause. Note Interrupt code = 0
?
?
1
1
0
0
38
Hardware exception handling
PC EPC
0x80000080
0x12345678
Cause
Status
Address of general exception vector placed in PC
KUo IEo KUp IEp KUc IEc
0
?
?
1
1
0
0
39
Hardware exception handling
PC
EPC
Cause
Status
0x80000080
0x12345678
• CPU is now running in kernel mode at 0x80000080, with interrupts disabled
• All information required to • Find out what caused the
exception
• Restart after exception handling
is in coprocessor registers
KUo IEo KUp IEp KUc IEc
0
?
?
1
1
0
0
40
Returning from an exception
• For now, lets ignore
• how the exception is actually handled • how user-level registers are preserved
• Let’s simply look at how we return from the exception
41
Returning from an exception
PC
EPC
Cause
Status
0x80001234
0x12345678
• This code to return is
lw r27, saved_epc
nop
jr r27 rfe
KUo IEo KUp IEp KUc IEc
0
??110 Load the contents of
0
EPC which is usually moved earlier to somewhere in memory by the exception handler
42
Returning from an exception
PC
EPC
Cause
Status
0x12345678
0x12345678
• This code to return is
lw r27, saved_epc
nop
jr r27 rfe
KUo IEo KUp IEp KUc IEc
0
1
1
0
0
?? Store the EPC back in
the PC
43
Returning from an exception
PC EPC
0x12345678 • This code to return is
lw r27, saved_epc nop
jr r27
rfe
In the branch delay slot, 0x12345678
execute a restore from Cause
exception instruction
0
Status
KUo IEo KUp IEp KUc IEc
?
?
?
?
1
1
44
Returning from an exception
PC
EPC
Cause
Status
0x12345678
0x12345678
• We are now back in the same state we were in when the exception happened
0
KUo IEo KUp IEp KUc IEc
?
?
?
?
1
1
45
MIPS System Calls
• System calls are invoked via a syscall instruction.
• The syscall instruction causes an exception and transfers control to the general exception handler
• A convention (an agreement between the kernel and applications) is required as to how user-level software indicates
• Which system call is required • Where its arguments are
• Where the result should go
46
OS/161 Systems Calls
• OS/161 uses the following conventions
• Arguments are passed and returned via the normal C function calling convention
• Additionally
• Reg v0 contains the system call number • On return, reg a3 contains
• 0: if success, v0 contains successful result • not 0: if failure, v0 has the errno.
• v0storedinerrno • -1returnedinv0
47
ra
fp
sp
gp
k1
k0
s7
⁞
s0
t9
⁞
t0
a3
a2
a1
a0
v1
v0
AT
zero
ra
fp
sp
gp
k1
k0
s7
⁞
s0
t9
⁞
t0
a3
a2
a1
a0
v1
v0
AT
zero
Convention for kernel entry
Preserved
Preserved for C calling convention
Preserved
Success? Args in
Result SysCall No.
Convention for kernel exit
48
• Seriously low-level code follows
• This code is not for the faint hearted
move a0,s3
addiu a1,sp,16
jal 40068c
li a2,1024
move s0,v0
blez s0,400194
User-Level System Call Walk Through – Calling read()
int read(int filehandle, void *buffer, size_t size)
• Three arguments, one return value
• Code fragment calling the read function
400124: 02602021
400128: 27a50010
40012c: 0c1001a3
400130: 24060400
400134: 00408021
400138: 1a000016
move a0,s3
addiu a1,sp,16
jal 40068c
move s0,v0
blez s0,400194
• Args are loaded, return value is tested
50
Inside the read() syscall function part 1
0040068c
40068c: 08100190 j 400640 <__syscall>
400690: 24020005 li v0,5
• Appropriate registers are preserved
• Arguments (a0-a3), return address (ra), etc.
• The syscall number (5) is loaded into v0
• Jump (not jump and link) to the common syscall routine
51
The read() syscall function part 2
00400640 <__syscall>:
400640: 0000000c syscall
Generate a syscall exception
400644: 10e00005
400648: 00000000 nop
beqz a3,40065c <__syscall+0x1c>
40064c: 3c011000
400650: ac220000
400654: 2403ffff
400658: 2402ffff
40065c: 03e00008
400660: 00000000 nop
lui at,0x1000 sw v0,0(at) li v1,-1
li v0,-1
jr ra
52
The read() syscall function part 2
Test success, if yes, branch to return fromfunction
00400640 <__syscall>: 400640: 0000000c 400644: 10e00005 400648: 00000000 40064c: 3c011000 400650: ac220000 400654: 2403ffff 400658: 2402ffff 40065c: 03e00008 400660: 00000000
syscall
beqz a3,40065c <__syscall+0x1c> nop
lui at,0x1000
sw v0,0(at)
li v1,-1 li v0,-1 jr ra nop
53
The read() syscall function part 2
00400640 <__syscall>: 400640: 0000000c 400644: 10e00005 400648: 00000000 40064c: 3c011000 400650: ac220000 400654: 2403ffff 400658: 2402ffff 40065c: 03e00008 400660: 00000000
If failure, store code in errno
syscall
beqz a3,40065c <__syscall+0x1c> nop
lui at,0x1000
sw v0,0(at)
li v1,-1
li v0,-1
jr ra
nop
54
The read() syscall function part 2
00400640 <__syscall>:
400640: 0000000c syscall
Set read() result to -1
400644: 10e00005
400648: 00000000 nop
beqz a3,40065c <__syscall+0x1c>
40064c: 3c011000
400650: ac220000
400654: 2403ffff
400658: 2402ffff
40065c: 03e00008
400660: 00000000 nop
lui at,0x1000 sw v0,0(at) li v1,-1
li v0,-1
jr ra
55
The read() syscall function part 2
00400640 <__syscall>:
400640: 0000000c syscall
Return to location
after where read()
was called
400644: 10e00005
400648: 00000000 nop
beqz a3,40065c <__syscall+0x1c>
40064c: 3c011000
400650: ac220000
400654: 2403ffff
400658: 2402ffff
40065c: 03e00008
400660: 00000000 nop
lui at,0x1000 sw v0,0(at) li v1,-1
li v0,-1
jr ra
56
Summary
• From the caller’s perspective, the read() system call behaves like a normal function call
• It preserves the calling convention of the language
• However, the actual function implements its own convention by agreement with the kernel
• Our OS/161 example assumes the kernel preserves appropriate registers(s0-s8, sp, gp, ra).
• Most languages have similar libraries that interface with the operating system.
57
System Calls – Kernel Side
• Things left to do
• Change to kernel stack
• Preserve registers by saving to memory (on the kernel stack)
• Leave saved registers somewhere accessible to • Read arguments
• Store return values
• Do the “read()”
• Restore registers
• Switch back to user stack
• Return to application
58
OS/161 Exception Handling
• Note: The following code is from the uniprocessor variant of OS161 (v1.x). • Simpler, but broadly similar to current version.
59
exception:
move k1, sp /* Save previous stack pointer in k1 */ mfc0 k0, c0_status /* Get status register */
andi k0, k0, CST_Kup /* Check the we-were-in-user-mode bit */
beq k0, $0, 1f /* If clear, from kernel, already have stack */
nop
/* Coming from user mode
la k0, curkstack
lw sp, 0(k0)
nop
/* delay slot */
– load kernel stack into sp */
/* get address of “curkstack” */
Note k0, k1
/* get its value */
registers
/* delay slot for the load */
1:
available for
mfc0 k0, c0_cause
j common_exception
nop
kernel use
/* Now, load the exception cause. */
/* Skip to common code */
/* delay slot */
60
exception:
move k1, sp /* Save previous stack pointer in k1 */ mfc0 k0, c0_status /* Get status register */
andi k0, k0, CST_Kup /* Check the we-were-in-user-mode bit */
beq k0, $0, 1f /* If clear, from kernel, already have stack */
nop
/* Coming from user mode
la k0, curkstack
lw sp, 0(k0)
nop
/* delay slot */
– load kernel stack into sp */
/* get address of “curkstack” */
/* get its value */
/* delay slot for the load */
1:
mfc0 k0, c0_cause
j common_exception
nop
/* Now, load the exception cause. */
/* Skip to common code */
/* delay slot */
61
common_exception:
/*
* At this point:
* Interrupts are off. (The processor did this for us.)
* k0 contains the exception cause value.
* k1 contains the old stack pointer.
* sp points into the kernel stack.
* All other */
/*
* Allocate stack
* plus four more
*/
addi sp, sp, -164
registers are untouched.
space for 37 words to hold the trap frame, words for a minimal argument block.
62
/* The order here must match mips/include/trapframe.h. */ sw ra, 160(sp) /* dummy for gdb */
sw s8, 156(sp)
sw sp, 152(sp)
sw gp, 148(sp)
sw k1, 144(sp)
sw k0, 140(sp)
sw k1, 152(sp)
nop
mfc0 k1, c0_epc
sw k1, 160(sp)
/* save s8 */
/* dummy for gdb */
/* save gp */
/* dummy for gdb */
/* dummy for gdb */
/* real saved sp */
/* delay slot for store */
These six stores are
a “hack” to avoid
confusing GDB
You can ignore the
details of why and
how
/* Copr.0 reg 13 == PC for exception */
/* real saved PC */
63
/* The order here must match mips/include/trapframe.h. */
sw ra, 160(sp) /* dummy for gdb */
The real work starts here
sw s8, 156(sp)
sw sp, 152(sp)
sw gp, 148(sp)
sw k1, 144(sp)
sw k0, 140(sp)
sw k1, 152(sp)
nop
mfc0 k1, c0_epc
sw k1, 160(sp)
/* save s8 */
/* dummy for gdb */
/* save gp */
/* dummy for gdb */
/* dummy for gdb */
/* real saved sp */
/* delay slot for store */
/* Copr.0 reg 13 == PC for exception */
/* real saved PC */
64
sw t9, 136(sp)
sw t8, 132(sp)
sw s7, 128(sp)
sw s6, 124(sp)
sw s5, 120(sp)
sw s4, 116(sp)
sw s3, 112(sp)
sw s2, 108(sp)
sw s1, 104(sp)
sw s0, 100(sp)
sw t7, 96(sp)
sw t6, 92(sp)
sw t5, 88(sp)
sw t4, 84(sp)
sw t3, 80(sp)
sw t2, 76(sp)
sw t1, 72(sp)
sw t0, 68(sp)
sw a3, 64(sp)
sw a2, 60(sp)
sw a1, 56(sp)
sw a0, 52(sp)
sw v1, 48(sp)
sw v0, 44(sp)
sw AT, 40(sp)
sw ra, 36(sp)
Save all the registers on the kernel stack
65
/*
* Save special registers.
*/
mfhi t0
mflo t1
sw t0, 32(sp)
sw t1, 28(sp)
We can now use the other registers (t0, t1) thatwehave preserved on the stack
/*
* Save remaining exception context information. */
sw k0, 24(sp)
mfc0 t1, c0_status
sw t1, 20(sp)
mfc0 t2, c0_vaddr
sw t2, 16(sp)
/* k0 was loaded with cause earlier */ /* Copr.0 reg 11 == status */
/* Copr.0 reg 8 == faulting vaddr */
/*
* Pretend to save $0 for gdb’s benefit. */
sw $0, 12(sp)
66
/*
* Prepare to call mips_trap(struct trapframe *) */
addiu a0, sp, 16
jal mips_trap
nop
/* set argument */
/* call it */
/* delay slot */
Create a pointer to the base of the saved registers and state in the first argument register
67
struct trapframe {
u_int32_t tf_vaddr;
u_int32_t tf_status;
u_int32_t tf_cause;
u_int32_t tf_lo;
u_int32_t tf_hi;
u_int32_t tf_ra;
u_int32_t tf_at;
u_int32_t tf_v0;
u_int32_t tf_v1;
u_int32_t tf_a0;
u_int32_t tf_a1;
u_int32_t tf_a2;
u_int32_t tf_a3;
u_int32_t tf_t0;
⁞
u_int32_t tf_t7;
u_int32_t tf_s0;
⁞
u_int32_t tf_s7;
u_int32_t tf_t8;
u_int32_t tf_t9;
u_int32_t tf_k0;
*/
u_int32_t tf_k1;
u_int32_t tf_gp;
u_int32_t tf_sp;
u_int32_t tf_s8;
u_int32_t tf_epc;
};
/* vaddr register */
/* status register */
/* cause register */
/* Saved register 31 */
/* Saved register 1 (AT) */
/* Saved register 2 (v0) */
/* etc. */
By creating a pointer to
here of type struct
trapframe *, we can
Kernel Stack
epc
s8
sp
gp
k1
k0
t9
t8
⁞
at
ra
hi
lo
cause
status
vaddr
access the user’s saved
/* dummy (see exception.S comments)
registers as normal
/* dummy */
variables within ‘C’
/* coprocessor 0 epc register */
68
Now we arrive in the ‘C’ kernel
/*
* General trap (exception) handling function for mips.
* This is called by the assembly-language exception handler once
* the trapframe has been set up.
*/
void
mips_trap(struct trapframe *tf)
{
u_int32_t code, isutlb, iskern;
int savespl;
/* The trap frame is supposed to be 37 registers long. */
assert(sizeof(struct trapframe)==(37*4));
/* Save the value of curspl, which belongs to the old context. */
savespl = curspl;
/* Right now, interrupts should be off. */
curspl = SPL_HIGH;
69
What happens next?
• The kernel deals with whatever caused the exception • Syscall
• Interrupt
• Page fault
• It potentially modifies the trapframe, etc
• E.g., Store return code in v0, zero in a3 • ‘mips_trap’ eventually returns
70
exception_return:
/* 16(sp)
lw t0, 20(sp)
nop
mtc0 t0, c0_status
/* 24(sp)
no need to restore tf_vaddr */
/* load status register value into t0 */
/* load delay slot */
/* store it back to coprocessor 0 */
no need to restore tf_cause */
/* restore special registers */
lw t1, 28(sp)
lw t0, 32(sp)
mtlo t1
mthi t0
/* load the general registers */
lw ra, 36(sp)
lw AT, 40(sp)
lw v0, 44(sp)
lw v1, 48(sp)
lw a0, 52(sp)
lw a1, 56(sp)
lw a2, 60(sp)
lw a3, 64(sp)
71
lw t0, 68(sp)
lw t1, 72(sp)
lw t2, 76(sp)
lw t3, 80(sp)
lw t4, 84(sp)
lw t5, 88(sp)
lw t6, 92(sp)
lw t7, 96(sp)
lw s0, 100(sp)
lw s1, 104(sp)
lw s2, 108(sp)
lw s3, 112(sp)
lw s4, 116(sp)
lw s5, 120(sp)
lw s6, 124(sp)
lw s7, 128(sp)
lw t8, 132(sp)
lw t9, 136(sp)
/* 140(sp)
/* 144(sp)
“saved” k0 was dummy garbage anyway */ “saved” k1 was dummy garbage anyway */
72
lw gp, 148(sp)
/* 152(sp)
lw s8, 156(sp)
lw k0, 160(sp)
lw sp, 152(sp)
/* restore gp */
stack pointer – below */
/* restore s8 */
/* fetch exception return PC into k0 */
/* fetch saved sp (must be last) */
/* done */
jr k0
rfe
.end common_exception
/* jump back */
/* in delay slot */
Note again that only k0, k1 have been trashed
73