CONCURRENCY: INTRODUCTION
Andrea Arpaci-Dusseau CS 537, Fall 2019
ADMINISTRIVIA
Project 3 Due Tonight
– Turn in Makefile and all src code (.c and .h) to make mysh
Plan Project 4 to be available Tuesday
Office hours instead of Lab Hours from Friday to Monday?
Midterm 1 on Thursday 7:30pm – 9:30pm
– Coversmaterialincludingtoday’slecture
– Readtextbookifyouhaven’t
– CompleteCanvashomeworks
– Takesampleexamsandcheckwhatyoudon’tunderstand – Lookoverlecturenotes;canyouanswerallquestions?
– PostquestionsforTuesdaylecturereview
AGENDA / LEARNING OUTCOMES
Virtual memory: Summary Concurrency
What is the motivation for concurrent execution? What are some of the challenges?
RECAP
SWAPPING Intuition
Idea: OS keeps unreferenced pages on disk
– Slower, cheaper backing store than memory
Process can run when not all pages are loaded into main memory OS and hardware cooperate to make large disk seem like memory
– Same behavior as if all of address space in main memory Requirements:
– OS must have mechanism to identify location of each page in address spaceàin memory or on disk
– OS must have policy for determining which pages live in memory and which on disk
Virtual Memory Mechanisms
First, hardware checks TLB for virtual address
– ifTLBhit,addresstranslationisdone;pageinphysicalmemory
Else TLB miss…
– HardwareorOSwalkpagetables
– IfPTEdesignatespageispresent,thenpageinphysicalmemory
page fault (i.e., present bit is cleared) Else Page fault…
– Trap into OS (not handled by hardware)
– OSselectsvictimpageinmemorytoreplace(lookatusebitsforclock)
• Write victim page out to disk if modified (add dirty bit to PTE)
– OSreadsreferencedpagefromdiskintomemory(faultingprocessisblocked)
– Pagetableisupdated,presentbitisset
– Processcontinuesexecution(processmovedtoreadystate)
Chat: performance impact?
Assume workload of multiple processes
– Processes each alternate between CPU and I/O – Each access some amount of memory
What happens to system performance (throughput = jobs/sec) as we increase the number of processes?
– If the sum of the working sets > physical memory?
Bonus: What if no Hardware Support?
What can the OS do if hardware does not have use bit (or dirty bit) (and hardware-filled TLB)?
– Can the OS “emulate” these bits? Leading question:
– How can the OS get control (i.e., generate a trap) every time use bit should be set? (i.e., when a page is accessed?)
SUMMARY: VIRTUAL MEMORY
Abstraction: Virtual address space with code, heap, stack Address translation
– Contiguous memory: base, bounds, segmentation
– Using fixed sizes pages with page tables Challenges with paging
– Extra memory references: avoid with TLB
– Page table size: avoid with multi-level paging
Larger address spaces: Swapping mechanisms, policies (LRU, Clock)
Review: Easy Piece 1
CPU Virtualization
Context Switch Schedulers
Allocation Segmentation
Memory
Paging
TLBs
Multilevel Swapping
CONCURRENCY
Motivation for Concurrency
Motivation
CPU Trend: Same speed, but multiple cores
Option 0: Run many different applications on one machine
– OSschedulerensurescoresareusedefficiently Our Goal: Write applications that fully utilize many cores
Option 1: Use Multiple processes
Option 1: Build apps from many communicating processes
–
–
Pros?
–
Example: Chrome (process per tab)
Communicate via pipe() or other message passing primitives
Don’t need new abstractions; good for security
Cons?
– Cumbersome programming
– High communication overheads
– Expensive context switching (why expensive?)
Option 2: Divide process into threads
New abstraction: thread
Threads are like processes, except:
multiple threads of same process share an address space
Divide large task across several cooperative threads Communicate through shared address space
Common Programming Models
Multi-threaded programs tend to be structured as:
– Producer/consumer
Multiple producer threads create data (or work) that is handled by one of the multiple consumer threads
– Pipeline
Task is divided into series of subtasks, each of which is handled in series by a different thread
– Deferworkwithbackgroundthread
One thread performs non-critical work in the background (e.g., when CPU idle)
CPU 1
CPU 2 RAM
running thread 1
PTBR
IP
running thread 2
PTBR
IP
PageDir A
PageDir B …
Virt Mem (PageDir A)
What state do threads share?
Do threads share page directories?
Do threads share Instruction Pointers?
CODE HEAP
…
Share code, but each thread may be executing different code at the same time àDifferent Instruction Pointers
CPU 1 CPU 2 RAM
running thread 1
PTBR
IP SP
running thread 2
PTBR
IP SP
PageDir A
PageDir B
…
Virt Mem (PageDir A)
Do threads share stack pointer?
Threads executing different functions need different stacks
(But, stacks are in same address space, so trusted to be cooperative)
CODE HEAP
STACK 1
STACK 2
THREAD VS. Process
Multiple threads within a single process share: – ProcessID(PID)
– Address space: Code (instructions), Most data (heap) – Openfiledescriptors
– Currentworkingdirectory(environmentvariables)
– Userandgroupid(permissions,limits)
Each thread has its own
– Thread ID ( TID)
– Setofregisters,includingProgramcounterandStackpointer
– Stackforlocalvariablesandreturnaddresses (in same address space)
THREAD API
Variety of thread systems exist – POSIX Pthreads
Common thread operations – Create
– Exit
– Join (instead of wait() for processes)
OS Support: Approach 1
User-level threads: Many-to-one thread mapping
– Implemented by user-level runtime libraries Create, schedule, synchronize threads at user-level
– OS is not aware of user-level threads
OS thinks each process contains only a single thread of control
Advantages
– Does not require OS support; Portable
– Can tune scheduling policy to meet application demands (simplify) – Lower overhead thread operations since no system call
Disadvantages?
– Cannot leverage multiprocessors
– Entire process blocks when one thread blocks
OS Support: Approach 2
Kernel-level threads: One-to-one thread mapping
– OS provides each user-level thread with a kernel thread
– Each kernel thread scheduled independently
– Thread operations (creation, scheduling, synchronization) performed by OS
Advantages
– Each kernel-level thread can run in parallel on a multiprocessor
– When one thread blocks, other threads from process can be scheduled
Disadvantages
– Higher overhead for thread operations
– OS must scale well with increasing number of threads
THREADS DEMO
Thread SchedulE #1
balance = balance + 1; balance at 0x9cd4
Registers are virtualized by OS; Each thread thinks it has own
Thread 2
State:
0x9cd4: 100 %eax: ? %rip = 0x195
Thread 1
process control blocks:
T1
0x195 mov 0x9cd4, %eax 0x19a add $0x1, %eax 0x19d mov %eax, 0x9cd4A
What is state after instruction 0x195 completes?
%eax: ? %rip: 0x195
%eax: ? %rip: 0x195
T1
0x195 mov 0x9cd4, %eax 0x19a add $0x1, %eax 0x19d mov %eax, 0x9cd4A
State:
0x9cd4: 100 %eax: 100 %rip = 0x19a
process control blocks:
What is state after instruction 0x19a completes?
Thread SchedulE #1
Thread 1
Thread 2
%eax: ? %rip: 0x195
%eax: ? %rip: 0x195
State:
0x9cd4: 100 %eax: 101 %rip = 0x19d
process control blocks:
Thread SchedulE #1
Thread 1
Thread 2
0x195 mov 0x9cd4, %eax
0x19a add $0x1, %eax
T1 0x19d mov %eax, 0x9cd4A
What is state after instruction 0x19d completes?
%eax: ? %rip: 0x195
%eax: ? %rip: 0x195
State:
0x9cd4: 101 %eax: 101 %rip = 0x1a2
process control blocks:
Thread SchedulE #1
Thread 1
Thread 2
T1
0x195 mov 0x9cd4, %eax 0x19a add $0x1, %eax 0x19d mov %eax, 0x9cd4A
Thread Context Switch
New contents of PCB and %eax and %rip?
%eax: ? %rip: 0x195
%eax: ? %rip: 0x195
State:
0x9cd4: 101 %eax: ? %rip = 0x195
process control blocks:
Thread SchedulE #1
Thread 1
Thread 2
T2
0x195 mov 0x9cd4, %eax 0x19a add $0x1, %eax 0x19d mov %eax, 0x9cd4A
What is state after instruction 0x195 completes?
%eax: 101 %rip: 0x1a2
%eax: ? %rip: 0x195
State:
0x9cd4: 101 %eax: 101 %rip = 0x19a
process control blocks:
Thread SchedulE #1
Thread 1
Thread 2
T2
0x195 mov 0x9cd4, %eax 0x19a add $0x1, %eax 0x19d mov %eax, 0x9cd4A
What is state after instruction 0x19a completes?
%eax: 101 %rip: 0x1a2
%eax: ? %rip: 0x195
State:
0x9cd4: 101 %eax: 102 %rip = 0x19d
process control blocks:
Thread SchedulE #1
Thread 1
Thread 2
T2
0x195 mov 0x9cd4, %eax 0x19a add $0x1, %eax 0x19d mov %eax, 0x9cd4A
What is state after instruction 0x19d completes?
%eax: 101 %rip: 0x1a2
%eax: ? %rip: 0x195
State:
0x9cd4: 102 %eax: 102 %rip = 0x1a2
process control blocks:
Thread SchedulE #1
Thread 1
Thread 2
T2
0x195 mov 0x9cd4, %eax 0x19a add $0x1, %eax 0x19d mov %eax, 0x9cd4A
Desired Result!
%eax: 101 %rip: 0x1a2
%eax: ? %rip: 0x195
Another schedule
State:
0x9cd4: 100 %eax: ? %rip = 0x195
process control blocks:
Thread SchedulE #2
Thread 1
Thread 2
T1
0x195 mov 0x9cd4, %eax 0x19a add $0x1, %eax 0x19d mov %eax, 0x9cd4A
%eax: ? %rip: 0x195
%eax: ? %rip: 0x195
T1
0x195 mov 0x9cd4, %eax 0x19a add $0x1, %eax 0x19d mov %eax, 0x9cd4A
State:
0x9cd4: 100 %eax: 100 %rip = 0x19a
process control blocks:
Thread SchedulE #2
Thread 1
Thread 2
%eax: ? %rip: 0x195
%eax: ? %rip: 0x195
T1
0x195 mov 0x9cd4, %eax 0x19a add $0x1, %eax 0x19d mov %eax, 0x9cd4A
State:
0x9cd4: 100 %eax: 101 %rip = 0x19d
process control blocks:
Thread SchedulE #2
Thread 1
Thread 2
Thread Context Switch before T1 executes 0x19d
%eax: ? %rip: 0x195
%eax: ? %rip: 0x195
State:
0x9cd4: 100 %eax: ? %rip = 0x195
process control blocks:
Thread SchedulE #2
Thread 1
Thread 2
T2 0x195 mov 0x9cd4, %eax 0x19a add $0x1, %eax
0x19d mov %eax, 0x9cd4A
%eax: 101 %rip: 0x19d
%eax: ? %rip: 0x195
T2
0x195 mov 0x9cd4, %eax 0x19a add $0x1, %eax 0x19d mov %eax, 0x9cd4A
State:
0x9cd4: 100 %eax: 100 %rip = 0x19a
process control blocks:
Thread SchedulE #2
Thread 1
Thread 2
%eax: 101 %rip: 0x19d
%eax: ? %rip: 0x195
T2
0x195 mov 0x9cd4, %eax 0x19a add $0x1, %eax 0x19d mov %eax, 0x9cd4A
State:
0x9cd4: 100 %eax: 101 %rip = 0x19d
process control blocks:
Thread SchedulE #2
Thread 1
Thread 2
%eax: 101 %rip: 0x19d
%eax: ? %rip: 0x195
T2
0x195 mov 0x9cd4, %eax 0x19a add $0x1, %eax 0x19d mov %eax, 0x9cd4A
State:
0x9cd4: 101 %eax: 101 %rip = 0x1a2
process control blocks:
Thread SchedulE #2
Thread 1
Thread 2
Thread Context Switch back to T1
%eax: 101 %rip: 0x19d
%eax: ? %rip: 0x195
T1
0x195 mov 0x9cd4, %eax 0x19a add $0x1, %eax 0x19d mov %eax, 0x9cd4A
State:
0x9cd4: 101 %eax: 101 %rip = 0x19d
process control blocks:
Thread SchedulE #2
Thread 1
Thread 2
%eax: 101 %rip: 0x19d
%eax: 101 %rip: 0x1a2
State:
0x9cd4: 101 %eax: 101 %rip = 0x1a2
process control blocks:
Thread SchedulE #2
Thread 1
Thread 2
T1
0x195 mov 0x9cd4, %eax 0x19a add $0x1, %eax 0x19d mov %eax, 0x9cd4A
%eax: 101 %rip: 0x1a2
%eax: 101 %rip: 0x1a2
State:
0x9cd4: 101 %eax: 101 %rip = 0x1a2
process control blocks:
Thread SchedulE #2
Thread 1
Thread 2
0x195 mov 0x9cd4, %eax 0x19a add $0x1, %eax 0x19d mov %eax, 0x9cd4A
T1
WRONG Result! Final value of balance is 101
%eax: 101 %rip: 0x1a2
%eax: 101 %rip: 0x1a2
Thread 1
mov 0x123, %eax add %0x1, %eax mov %eax, 0x123
Timeline View
Thread 2
Balance T1-eax 0 0
1
T2-eax
How much is added to shared variable?
1 mov 0x123, %eax
1 3
add %0x2, %eax mov %eax, 0x123
3
3: correct!
Thread 1
Timeline View
Thread 2
mov 0x123, %eax
Balance T1-eax T2-eax 00
1
0
2
mov 0x123, %eax add %0x1, %eax
mov %eax, 0x123
1 add %0x2, %eax
How much is added?
mov %eax, 0x123
2: incorrect!
2
Thread 1
mov 0x123, %eax add %0x1, %eax mov %eax, 0x123
Timeline View
Thread 2
mov 0x123, %eax add %0x2, %eax mov %eax, 0x123
Balance T1-eax T2-eax 00
0
2 1
How much is added?
2 1
1: incorrect!
Thread 1
mov 0x123, %eax add %0x1, %eax mov %eax, 0x123
How much is added?
2
3 3
3: correct!
2
Timeline View
Thread 2
mov 0x123, %eax add %0x2, %eax mov %eax, 0x123
Balance T1-eax T2-eax 00
2
Thread 1
mov 0x123, %eax add %0x1, %eax mov %eax, 0x123
Timeline View
Thread 2
mov 0x123, %eax add %0x2, %eax
Balance T1-eax T2-eax 00
mov %eax, 0x123
How much is added? 2: incorrect!
0
1 1
2
2
Non-Determinism
Concurrency leads to non-deterministic results – Different results even with same inputs
– race conditions
Whether bug manifests depends on CPU schedule! How to program: imagine scheduler is malicious?!
What do we want?
Want 3 instructions to execute as an uninterruptable group That is, we want them to be atomic
mov 0x123, %eax
add %0x1, %eax
mov %eax, 0x123
More general: Need mutual exclusion for critical sections if thread A is in critical section C, thread B isn’t (okay if other threads do unrelated work)
Synchronization
Build higher-level synchronization primitives in OS
Operations that ensure correct ordering of instructions across threads Use help from hardware
Motivation: Build them once and get them right
Monitors
Condition Variables
Locks
Semaphores
Loads Stores Test&Set Disable Interrupts
CONCURRENCY SUMMARY
Concurrency is needed for high performance when using multiple cores
Threads are multiple execution streams within a single process or address space (share PID and address space, own registers and stack)
Context switches within a critical section can lead to non-deterministic bugs
Locks
Goal: Provide mutual exclusion (mutex)
Allocate and Initialize
– Pthread_mutex_tmylock=PTHREAD_MUTEX_INITIALIZER;
Acquire
– Acquireexclusionaccesstolock;
– Wait if lock is not available (some other process in critical section) – Spinorblock(relinquishCPU)whilewaiting
– Pthread_mutex_lock(&mylock);
Release
– Releaseexclusiveaccesstolock;letanotherprocessentercriticalsection – Pthread_mutex_unlock(&mylock);
THREADS DEMO2