Parallel Programming Patterns
SCHEDULERS
Adrian Jackson, Iakovos Panourgias
a. .ac.uk
@adrianjhpc
mailto:a. .ac.uk
Overview
• Why do we need Schedulers and Resource
Management Systems (RMS);
• What are they;
• How do they work;
• How can we use them;
• Slurm (formally known as Simple Linux Utility for
Resource Management)
2
Schedulers and Resource Management Systems (RMS)
• Why do we need to add more complicated components to
an already complicated system?
• More points of failure.
• More configuration options.
• More options to break the system (or reduce functionality).
• Managing a medium to large infrastructure is already hard
enough.
3
Different types of HPC infrastructure
• SMP (small to medium) (10s to 100s cores)
• [kepler0/1 32 cores, morar 128 cores];
• MPP (medium to large) (100s to 1,000s cores)
• [Ultra 1,536 cores];
• Cluster (large) (1,000s to …. cores)
• [ARCHER 118,080 cores];
• Distributed;
• Specialised (special hardware architectures, accelerators, etc…)
• [NGIO 100 TB of NVRAM].
• Let’s not forget accelerators.
4
Different uses of HPC infrastructure
• Single user (ideal if you are the user);
• 2-10 users (small business, some contention);
• 10s of users (large business, lots of contention);
• 100s of users (National Supercomputing Service).
• High-Throughput Computing: usually connect a large
number of nodes using low-end interconnects (HTCondor,
BOINC, SETI@home):
• Ideal for embarrassingly (or trivially/pleasantly) parallel problems
(SPMD);
• High-Performance Computing: connect more powerful
compute nodes using faster interconnects than HTC
clusters (lower latency and higher bandwidth).
5
Simplest use case
• Single user using a small SMP (or MPP or even a Cluster)
infrastructure:
• No need for a Scheduler or even an RMS.
• Parallel ssh (pssh, pscp, prsync and pnuke) can be used.
• What about:
• Resource usage (billing): how do we charge?
• Infrastructure utilisation: how much of the infrastructure do we use?
How often is it fully utilised? Do we have a case for buying a
new/larger one?
• We can write our own scripts;
• Create a database, so that we can have historical data.
6
Simple use case
• Two users using a small SMP infrastructure:
• Some contention;
• Users have to talk to each other (let’s hope that they are friendly users
and their deadlines do not clash).
• What about:
• Resource usage (billing): how do we charge?
• Infrastructure utilisation: how much of the infrastructure do we use? How
often is it fully utilised? Do we have a case for buying a new/larger one?
• Support: When (not if) it breaks and support tries to fix it how can they
stop access to the system? What if a single node stops working?
• We can write our own scripts;
• Create a database, so that we can have historical data;
• Create some generic scripts to stop/resume access to the
Infrastructure (for users and support).
7
What is a Resource Management System
• Jobs compete with each other for (limited) compute
resources.
• A Resource Management System manages
competition amongst jobs.
• An RMS tries to reduce/manage the processing load
on the Infrastructure and usually has two different
components:
• A Resource Manager;
• A Batch/Job Scheduler (internal or external).
8
What is a Resource Manager and a (Batch/Job) Scheduler
• Batch Scheduler:
• First point of contact;
• Users submit jobs to the Batch system;
• The Scheduler decides when and where (on what resources) to launch
the job.
• The Batch/Job Scheduler communicates with the Resource Manager to
obtain up-to-date information about the cluster.
• Resource Manager:
• Launches jobs;
• Executes applications;
• Checks/monitors running applications;
• Checks/monitors the resources and has an up-to-date report of the
infrastructure (nodes, hardware, network, licenses, accelerators, etc.).
9
What is a Resource Manager and a (Batch/Job) Scheduler
User A
User B
User C
Resource Management System (RMS)
Job Queue(s)
Job submissions (or
other interactions)
Internal Job
Scheduler
External Job
Scheduler
Status updates send back to the RM;
Jobs are assigned to the nodes.
10
Popular RMSs and Batch Schedulers
• Popular Batch Schedulers:
• LSF (IBM);
• Moab Workload Manager (Adaptive Computing);
• PBS Pro (Altair);
• Grid Engine (formerly Sun Grid Engine) (Univa).
• Popular Resource Managers:
• Slurm (SchedMD);
• PBS (Altair);
• TORQUE (Cluster Resources).
• Slurm and the Application Level Placement Scheduler
(ALPS) are both a Batch Scheduler and Resource Manager.
11
How do they work: Batch Scheduler
• First point of contact;
• Users submit jobs to the Batch system;
• The Scheduler decides when and where (on what
resources) to launch the job.
• The Batch/Job Scheduler communicates with the
Resource Manager to obtain up-to-date
information about the cluster.
12
How do they work: Batch Scheduler
• Typical Scheduler workflow:
Submission
• User
submits a
new job
Priority
• The Scheduler assigns a
priority to the job
Schedule
• Group of resources
is reserved
Wait
• Until
resources
are free
Allocation
• Take
control of
resources
Execution
• Execute
application
13
How do they work: Batch Scheduler:
submitting jobs
• Submit jobs:
• Well defined set of commands (API) for interacting with the
batch system:
• srun, sbatch, qlogin, qsub, aprun, etc.
• Each Job/Batch Scheduler has a different syntax. However,
most of them support similar options;
• Job submissions scripts (or commands) typically include
walltime, number of nodes, type of interconnect, group name,
partition name, etc.;
• The Batch system validates your submission (checking quota
levels, partition name, permissions, etc.);
• Some Batch systems also provide a machine API (allowing
applications to orchestrate job submission).
14
How do they work: Batch Scheduler:
cancelling jobs
• Cancel jobs:
• Well defined set of commands (API) for cancelling jobs:
• scancel, qcancel, apkill, etc.
• Job cancel commands typically include the unique JOB_ID;
• Cancelling a job frees resources;
• Also stops charging/billing;
• Usually the application is stopped with a SIGTERM (15) or
SIGQUIT (3). If the application is still running then a
SIGKILL (9) is send;
• Some Batch systems also provide a machine API (allowing
applications to orchestrate job cancellation).
15
How do they work: Batch Scheduler:
querying jobs/status
• Querying:
• Well defined set of commands (API) for querying the batch
system:
• sstat, sinfo, sview, qhost, qstat, aprun, xtnodestat, etc.
• Job query commands typically include the unique JOB_ID or a
user name (or even a group name);
• You can also query a specific partition (e.g. see the current state
of the “Kepler” or “Production” partition);
• You can query:
• Status of nodes (active/allocated/down/drain/etc.);
• Status of jobs (pending/running/queued/etc.);
• Status of users (quota/running jobs/failed jobs/etc.).
• Some Batch systems also provide a machine API (allowing
applications to automate queries).
16
How do they work: Batch Scheduler:
interacting
• Interacting with the Batch Scheduler:
• Well defined set of commands (API) for interacting with the
batch system:
• scontrol, showstart, pbsnodes, qalter, etc.
• Interacting commands typically include the unique JOB_ID or a
user name (or even a group name);
• You can:
• Pause a job;
• Increase/decrease the size of your allocation (if supported by the
Scheduler);
• Increase/decrease other characteristics (RAM, CPUs, accelerators,
licenses, etc.).
• Some Batch systems also provide a machine API (allowing
applications to orchestrate interactions).
17
How do they work: Batch Scheduler
• Typical Scheduler workflow:
Submission
•User
submits a
new job
Priority
•The Scheduler assigns
a priority to the job
Schedule
•Group of
resources is
reserved
Wait
•Until
resources
are free
Allocation
•Take
control of
resources
Execution
• Execute
application
18
How do they work: Batch Scheduler:
Priorities
• Schedulers use several different approaches to
calculate a job’s priority:
• Fairshare: assign a priority based on the user and user group
historical level of activity. Thus under-used accounts/groups
will get a higher priority;
• Job size: usually jobs requesting more CPUs are favoured;
• Age: your priority increases the longer you wait in the queue;
• Partition: each partition has a different starting priority;
• Quality Of Service (QOS): if the user/group have a QoS then
their priority starts lower/higher;
• Trackable Resources (TRES): Each TRES has a different priority
factor (CPU, Energy, GPUs, Licenses, Memory, Nodes, etc.);
• Other…
19
How do they work: Batch Scheduler
• Typical Scheduler workflow:
Submission
•User
submits a
new job
Priority
•The Scheduler assigns a
priority to the job
Schedule
•Group of
resources are
reserved
Wait
•Until
resources
are free
Allocation
•Take
control of
resources
Execution
• Execute
application
20
How do they work: Batch Scheduler: Schedule
• Scheduling algorithms can be divided in two
classes:
• Time-sharing: divide time on a processor into several
discrete intervals, or slots ;
• Space-sharing: give the requested resources to a single
job until the job completes execution.
• Most cluster schedulers use space-sharing
algorithms.
21
How do they work: Batch Scheduler: Schedule
• Schedulers use several simple algorithms for
scheduling prioritised jobs:
• First In, First Out (FIFO): simple priority order scheduling.
Once a job cannot be scheduled all lower priority ones are
ignored;
• First Come, First Served (FCFS): similar to FIFO. Both of
these work well in a (very) low job load environment;
• Round Robin (RR);
• Shortest Job First (SJF): schedules shortest jobs first;
• Longest Job First (LJF): schedules longer jobs first.
22
How do they work: Batch Scheduler: Schedule
• Schedulers use several advanced algorithms for
scheduling prioritised jobs:
• Backfill: augments the FIFO algorithm. Backfill scheduling will
initiate lower-priority jobs if this action will not impact higher-
priority jobs (sensible job time limits must be used);
• Multiple queues: each queue has different priority settings and
(possibly) uses different resources;
• Advanced Reservation: create a window into the future (which
will change) using execution time prediction (as provided by the
users);
• Pre-emption (requires check pointing ☺ ): lower-priority jobs
can be paused/migrated or even cancelled in order for a higher-
priority job can run;
• Others …
23
How do they work: Batch Scheduler: Schedule
• Schedulers need to allocate resources and place
tasks;
• Allocation is the selection of resources (as
requested by the user) for a job
• Each job can have multiple job steps
• Each job step can have multiple tasks
• Task placement is the process of assigning a subset
of the allocated resources to each task.
• Schedulers need to communicate with the
Resource Manager in order to get an up-to-date
state of the Infrastructure.
24
How do they work: Batch Scheduler: Schedule
• Schedulers must schedule (also called
map/mapping):
• Nodes;
• Sockets;
• Cores;
• Hardware Threads (cpus);
• Memory;
• Generic Resources (GPUs, Accelerators, File Systems,
Others);
• Licenses;
• Others.
25
FIFO Scheduling Example
Time
R
e
s
o
u
r
c
e
s
26
FIFO Scheduling Example
Job 1
Time
R
e
s
o
u
r
c
e
s
27
FIFO Scheduling Example
Job 1
Job 2
Time
R
e
s
o
u
r
c
e
s
28
FIFO Scheduling Example
Job 1
Job 2
Job 3
Time
R
e
s
o
u
r
c
e
s
29
FIFO Scheduling Example
Job 1
Job 4Job 2
Job 3
Time
R
e
s
o
u
r
c
e
s
30
FIFO Scheduling Example
Job 1
Job 4Job 2
Job 5
Job 3
Time
R
e
s
o
u
r
c
e
s
31
FIFO Scheduling Example
Job 1
Job 4
Job
6
Job 2
Job 5
Job 3
Time
R
e
s
o
u
r
c
e
s
32
FIFO Scheduling Example
Job 1
Job 4
Job
6
Job 2
Job 5
Job
7
Job 3
Time
R
e
s
o
u
r
c
e
s
33
Backfill Scheduling Example
Job 1
Job 4
Job
6
Job 2
Job 5
Job 7
Job 3
Time
R
e
s
o
u
r
c
e
s
34
Backfill Scheduling Example
Time
R
e
s
o
u
r
c
e
s
35
Backfill Scheduling Example
Job 1
Time
R
e
s
o
u
r
c
e
s
36
Backfill Scheduling Example
Job 1
Job 2
Time
R
e
s
o
u
r
c
e
s
37
Backfill Scheduling Example
Job 1
Job 2
Job 3
Time
R
e
s
o
u
r
c
e
s
38
Backfill Scheduling Example
Job 1
Job 4Job 2
Job 3
Time
R
e
s
o
u
r
c
e
s
39
Backfill Scheduling Example
Job 1
Job 4Job 2
Job 5
Job 3
Time
R
e
s
o
u
r
c
e
s
40
Backfill Scheduling Example
Job 1
Job 4
Job
6
Job 2
Job 5
Job 3
Time
R
e
s
o
u
r
c
e
s
41
Backfill Scheduling Example
Job 1
Job 4
Job
6
Job 2
Job 5
Job 7
Job 3
Time
R
e
s
o
u
r
c
e
s
42
How do they work: Batch Scheduler
• Typical Scheduler workflow:
Submission
• User
submits a
new job
Priority
• The Scheduler assigns a
priority to the job
Schedule
• Group of resources
is reserved
Wait
• Until
resources
are free
Allocation
• Take
control of
resources
Execution
• Execute
application
43
How do they work: Batch Scheduler: Wait
• When a job is scheduled it waits in a queue (based on the user/group);
• Time spent in the queue is called “wait time”;
• Elapsed time between job submission and job completion is called
“turnaround time”;
• “Response time” is how fast the system responds to a job submission
request (under load this can be up to several minutes or hours);
• “Resource utilisation” represents the actual useful work that a job has
performed;
• A job starts either when it has highest priority and resources are
available, or has an opportunity to backfill (i.e. start earlier without
impacting higher priority jobs).
44
How do they work: Batch Scheduler: Wait
• If “Response time” is slow we may need to optimise the
Scheduler (or buy new faster hardware);
• Usually, high system utilisation also means high average
response time for jobs
• As system utilization climbs, the average response time tends
to increase.
• The challenge for Schedulers is to maximise
resource/system utilisation whilst maintaining
acceptable response times for the users.
45
How do they work: Batch Scheduler
• Typical Scheduler workflow:
Submission
• User
submits a
new job
Priority
• The Scheduler assigns a
priority to the job
Schedule
• Group of resources
is reserved
Wait
• Until
resources
are free
Allocation
• Take
control of
resources
Execution
• Execute
application
46
How do they work: Batch Scheduler: Allocation
• The Scheduler asks the Resource Manager to:
• Take control of the resources (nodes, GPUs, Accelerators,
Bandwidth, Licenses, etc.);
• Copy executables, data, etc. to the nodes;
• Initialise MPI, RDMA, etc.;
• Reduce the amount of available licenses (if applicable);
• Other …
47
How do they work: Batch Scheduler
• Typical Scheduler workflow:
Submission
•User
submits a
new job
Priority
•The Scheduler assigns a
priority to the job
Schedule
•Group of
resources are
reserved
Wait
•Until
resources
are free
Allocation
•Take
control of
resources
Execution
• Execute
application
48
How do they work Resource Manager
• Launches jobs;
• Executes applications;
• Checks/monitors running applications;
• Checks/monitors the resources (nodes).
49
How do they work Resource Manager: Launch job
• Start MPI daemons on the nodes;
• Setup MPI host files;
• Copy user files to the scratch/local disk;
• Setup network configuration (create private
networking, etc.)
• Other …
50
How do they work Resource Manager: Execute application(s)
• Use mpirun (or equivalent) to start/execute an
application;
• Multiple applications could be started at the same
time (“multiple program, multiple data” (MPMD))
51
How do they work Resource Manager: Checks/monitors running
applications
• A daemon running on each node checks/monitors
the state of each running application;
• Detects crashes;
• Out Of Memory (OOM) errors;
• Other errors or abnormal behaviour;
• Start a checkpoint operation (if supported);
• Monitors CPU/Memory/Filesystem/Network usage;
• Reports back to the Resource Manager controller
(either directly or using a proxy).
52
How do they work Resource Manager: Checks/monitors a
resource (node)
• A daemon running on each node checks/monitors the
state of each node;
• Even when the node is IDLE;
• Gathers statistics like:
• Power/Energy usage;
• Data transferred (using the network);
• Data transferred to local and remote filesystems (like Lustre);
• Memory/CPU/Other statistics.
• Health check (and what to do when it fails);
• It periodically communicates with the Resource
Manager controller (either directly or using a proxy);
53
Scheduling
• N-dimensional problem;
• Very hard to solve;
• HTC and HPC have different requirements.
• HTC: main goal is to maximise throughput (jobs
completed per unit of time (day/week/month/year). Load
imbalance (especially in heterogeneous clusters) is the
main reason for low performance;
• HPC: main goal is to maximise performance and minimise
communication overheads/costs.
54
Slurm
• Slurm manages:
• Tianhe-2: 16,000 compute nodes / 3.1 million cores
• Sequoia: 98,304 compute nodes / 1.6 million cores
• Three main daemons:
• slurmctld: Runs on the login node and accepts input from users (jobs,
queries, etc.) and sends requests to the Scheduler and the RMS.
Sends requests to the DB daemon. For High Availability (HA) a backup
daemon is also up and running (in a hot standby configuration).
• slurmdbd: Runs on the login node and manages the Accounting DB.
Serves requests from slurmctld. For High Availability (HA) a backup
daemon is also up and running (in a hot standby configuration).
• slurmd: Runs on the compute nodes. Starts/stops applications, checks
the node, gathers statistics, sends reports to the controller (slurmctld;
either directly or using an aggregation proxy).
55
Update
Node
States
Terminate
Jobs
Exceeding
Limits
Node
Health
Check/Ping
Schedule
Trigger
Events
Send
Updates to
Accounting
DB
Slurm
• Slurmctld Main Loop:
56
Slurm Scheduling
• Slurm tries to find a better schedule (using quick
and simple algorithms) when:
• A job is submitted;
• A job completes;
• A configuration change takes place.
• Slurm also performs slower and more expensive
scheduling attempts less frequently
• This design allows nearly instant response; even
when thousand of job are submitted at the same
time.
57
Slurm Scheduling
• Slurm Generic Scheduling Loop:
Build
job/part
queue
Pop top
priority
job
Reject
non-
runnable
jobs
Find
nodes for
job
Pre-empt
jobs if
needed
Start Job
Update
job and
nodes
states
Build job queue
Pop top priority job
Build node candidate list
Omit reserved nodes
Build pre-emptible jobs list
Cycle through each feature
Consider where nodes can
be shared
Select “best” nodes
Return list
58
Slurm – Quick Scheduling
• Slurm only checks the first X (by default 100) entries
of the queue for new scheduling opportunities;
• Once a job in a partition is left pending (i.e. no
scheduling is possible), Slurm ignores the other
jobs in that partition;
59
Slurm – Thorough Scheduling
• Slurm only checks all jobs in the queue (or until a
configurable time limit is reached);
• Jobs are ordered by priority so this operation has
low overhead;
• However, jobs in lower priority partitions (queues)
have now more opportunities to start.
60
Slurm
• Slurm is:
• network-aware: i.e. it tries to allocate resources in
clusters; sharing switches and routers;
• resource-aware: i.e. it tries to allocate resources in
clusters; running a job in the same node (if possible).
• Slurm uses advanced reservation and scheduling;
therefore, it knows when your job will start (you
have to ask);
• Slurm is open source and extensible so you can add
new features as required
61
Summary
• Resource Management Systems essential to:
• Manage large scale compute resources
• Fairly/configurabl62y share resources between users
• Provide accounting/reporting functionality
• Launch and monitor applications
• Resource manager:
• Launches and monitors jobs
• Launches applications associated with jobs
• Monitors resources (compute nodes)
• Monitors consumable resources (licenses, storage, etc…)
• Batch scheduler
• Decides when jobs run
• Provides interface for users to define, submit, and manage jobs
• Ensures resources are efficiently used
62