COMP 3000 Operating Systems
Containerization and Virtualization (part 2)
Lianying Zhao
Namespaces
Recall:
from_kuid_munged(); from_kgid_munged();
uid_t from_kuid_munged(struct user_namespace *to, kuid_t uid);
• Created with the unshare()or clone()system call
• unshare is also a command
• A namespace cannot be empty, hence no system call like “create
namespace”
• Network namespaces
• No longer need to worry about port conflicts
• PID namespaces: your process can be the init (PID 1) of your world
• User namespaces: UID 0 may not mean root
• Mount namespaces: you can have different root file systems
• The system starts out with a single namespace of each type
COMP 3000 (Winter 2021) 2
Control Groups
• A kernel feature for metering and quota of resource usage • CPU, memory, disk, network, etc.
• A control group is a collection of processes
• Acts based on parameters/limits
• Can be hierarchical, e.g., inheriting limits from parent groups
• Exposed as the cgroup file system (hence cgroup namespaces) • /sys/fs/cgroup
• Originally under the name “process containers”
COMP 3000 (Winter 2021) 3
Additional/optional Building Blocks
• Seccomp
• Secure Computing mode
• Put simply, to disable (deny access to) certain system calls
• AppArmor
• Application Armor, a Linux kernel security module (LSM) • Based on per-program profiles
• Towards MAC
• Works with file paths
• SELinux
• Security-Enhanced Linux, another Linux kernel security module (LSM) • Based on security labels
COMP 3000 (Winter 2021) 4
So What a Container Really Is
• Building blocks, supported by the kernel • Execution drivers
• libcontainer(now runc), libvirt-lxc, OpenVZ, systemd- nspawn (or even more crazily BSD Jails, Solaris Zones, AIX WPARs)
• Where the artifact of container appears
• Container management
• LXC, OpenVZ, LXD*, Docker
• How the user deals with containers
COMP 3000 (Winter 2021) 5
Docker
• A set of open-source tools for container management
• Positioned to support a single app per container
• Towards microservices • Emphasis on reuse
COMP 3000 (Winter 2021)
6
Source: www.docker.com
Union File Systems
• Storage drivers for containers vary
• E.g., LXC is file system neutral
• Often a layered/union file system is used, e.g., UnionFS and AUFS
• Docker: OverlayFS
• Kernel module: overlay
• Mounted is a merged view of the layers below
• Higher layers “win” over lower ones
COMP 3000 (Winter 2021)
7
Source: www.docker.com
Kubernetes
• An orchestration system for containers
• Cluster of containers
• Kubernetes worker nodes
• Pod = a group of containers or a single container • A task
• Container Runtime Interface (CRI) • E.g.,runc
COMP 3000 (Winter 2021) 8
Evolution: Docker and Kubernetes
COMP 3000 (Winter 2021)
9
Source: www.docker.com
Security for Containerization
• Operating-system-level virtualization
• The single kernel is shared, as opposed to everything having a separate copy as is the case of virtualization
• E.g., shared system call interface • Side channels, e.g., CPU load
• The achieved isolation is not as good as that of virtual machines • Why?
COMP 3000 (Winter 2021) 10
UID Mapping
• What we don’t want:
UID 0 in the container = UID 0 outside of the container
• Solution: UID namespaces
• Privileged containers
• Container root not mapped, container runs as root • Shouldbeavoided
• Semi-unprivileged containers
• Container root mapped, container runs as root
• Unprivileged containers (rootless mode)
• container root mapped, container runs as non-root
• Ideal, but more involving
COMP 3000 (Winter 2021) 11
Achieving Isolation
• Hardware support (or lack thereof)
• Whether the facilities match the intended purpose
• Example:
Process – runtime and address space only
Container – runtime+persistent, IPC, OS artifacts, etc.
• Level of abstraction
• Recall the emulator example
• Pure software enforcement does NOT necessarily mean weak
• Emulation ≥ hardware virtualization > paravirtualization > containerization (not strictly)
COMP 3000 (Winter 2021) 12
Online Student Experience Questionnaire March 31 – April 14
Check your Carleton email for a personalized link