COMP9334
Capacity Planning for Computer Systems and Networks
Week 3B: Markov Chain
COMP9334
1
Last lecture: Queues with Poisson arrivals
• Single-server
Arrivals
Departures
• Multi-server
Arrivals
1
Departures 2
m
m servers
T1,2021 COMP9334
2
This week: Markov Chain
• You can use Markov Chain to analyse
• Closedqueueingnetwork(seeexamplebelow) • Reliabilityproblem
CPU
• There are n jobs in the closed system
• What is the response time of one job?
• What is the response time if we replace the CPU with one that is twice as fast?
Disk
T1,2021
COMP9334
3
Markov chain
• The state-transition model that we have used is called a continuous-time Markov chain
• Thereisalsodiscrete-timeMarkovchain
• The transition from a state of the Markov chain to another state is characterised by an exponential distribution
• E.g.ThetransitionfromStateptoStateqisexponentialwithrate rpq, then consider a small time interval d
• Prob[TransitionfromStateptoStateqintimed|Statep]=rpq d
0
l
μ
1
l
2μ
2
T1,2021 COMP9334
4
Method for solving Markov chain
• A Markov chain can be solved by • Identifyingthestates
• Findthetransitionratebetweenthestates • Solvethesteadystateprobabilities
• You can then use the steady state probabilities as a stepping stone to find the quantity of interest (e.g. response time etc.)
• We will study two Markov chain problems in this lecture: • Problem1:ADatabaseserver
• Problem2:Datacentrereliabilityproblem
T1,2021 COMP9334 5
Problem 1: A DB server
• AdatabaseserverwithaCPU,afastdiskandaslowdisk • Atpeakdemand,therearealwaystwousersinthesystem • UsersalternatebetweentheCPUandthedisks
• Theuserswillequallylikelyfindthefileoneitherdisk
2 users
Fast Disk
CPU
Slow Disk
T1,2021 COMP9334
6
Problem 1: A DB server (cont’d)
• Fastdiskistwiceasfastastheslowdisk
• Typicaltransactionstakeonaverage10sCPUtime
• Fastdisktakesonaverage15stoserveallfilesforauser
• Slowdisktakesonaverage30stoserveallfilesforauser
• ThetimethateachtransactionrequiresfromtheCPUandthedisksis exponentially distributed
2 users
Fast Disk (15s)
CPU (10s)
Slow Disk (30s)
T1,2021 COMP9334
7
Typical capacity planning questions
• What response time can a typical user expect?
• What is the utilisation of each of the system resources?
• How will performance parameters change if number of users are doubled?
• If fast disk fails and all files are moved to slow disk, what will be the new response time?
T1,2021 COMP9334 8
Choice of states #1
• Use a 2-tuple (A,B) where
• A is the location of the first user
• B is the location of the second user
• A, B are drawn from {CPU,FD,SD}
• FD = fast disk, SD = slow disk
• Example states are:
• (CPU,CPU): both users at CPU
• (CPU, FD): 1st user at CPU, 2nd user at fast disk
• T otal 9 states
• Question: If there are n users,
• What are the states?
• How many states are there?
T1,2021 COMP9334 9
Choice of states #2
• We use a 3-tuple (X,Y,Z)
• Xis#usersatCPU
• Y is # users at fast disk
• Z is # users at slow disk
• Examples
• (2,0,0): both users at CPU
• (1,0,1): one user at CPU and one user at slow disk
• There are six possible states. Can you list them? • (2,0,0) (1,1,0) (1,0,1) (0,2,0) (0,1,1) (0,0,2)
• If there are n users, how many states are there?
Choice #2 requires less #states but loses certain information.
T1,2021 COMP9334 10
• •
• •
Identifying state transitions (1)
A state is: (#users at CPU, #users at fast disk, #users at slow disk) What is the rate of moving from State (2,0,0) to State (1,1,0)?
• This is caused by a job finishing at the CPU and move to fast disk
• Jobs complete at CPU at a rate of 6 transactions/minute
• Half of the jobs go to the fast disk
Transition rate from (2,0,0) ! (1,1,0) = 3 transactions/minute Similarly, transition rate from (2,0,0)!(1,0,1) = 3 transactions/minute
2 users
Fast Disk (15s)
CPU (10s)
Slow Disk (30s)
T1,2021 COMP9334
11
State transition diagram (2)
• Transitionratefrom(2,0,0)!(1,1,0)=3transactions/minute • Transitionratefrom(2,0,0)!(1,0,1)=3transactions/minute
3
2,0,0
3
1,0,1
1,1,0
0,1,1
• Question: What is the transition rate from (2,0,0)!(0,1,1)? T1,2021 COMP9334
0,0,2
0,2,0
12
•
•
Identifying state transitions (2)
From (1,1,0) there are 3 possible transitions
• Fast disk user goes back to CPU (2,0,0)
• CPU user goes to the fast disk (0,2,0), or
• CPU user goes to the slow disk (0,1,1)
Question: What are the transition rates in number of transactions per minute?
2 users
Fast Disk (15s)
CPU (10s)
Slow Disk (30s)
T1,2021 COMP9334
13
Completing the state transition diagram
3
4
2,0,0
3
1,1,0
3
3
0,2,0
1,0,1
0,1,1
0,0,2
T1,2021
COMP9334
14
Exercise
• The state transition diagram is still not complete. Choose any two state transitions and determine their rates.
T1,2021 COMP9334 15
Complete state transition diagram
T1,2021 COMP9334 16
2,0,0
3
3
0,2,0
1,1,0
3
2
4
2
1,0,1
4
3
4
3
0,1,1
2
3
0,0,2
Balance Equations
Define
P(2,0,0) = Probability in state (2,0,0)
P(1,1,0) = Probability in state (1,1,0) etc.
Exercise: Write down the balance equation for state (2,0,0)
2,0,0
3
3
0,2,0
1,1,0
3
2
4
2
1,0,1
4
3
4
3
0,1,1
2
3
0,0,2
T1,2021
COMP9334
17
Flow balance equations
• You can write one flow balance equation for each state:
6 P(2,0,0) – 4 P(1,1,0) – 2 P(1,0,1)+ 0 P(0,2,0) + 0 P(0,1,1) + 0 P(0,0,2)=0
-3 P(2,0,0) + 10 P(1,1,0) + 0 P(1,0,1) – 4 P(0,2,0) -2 P(0,1,1) + 0 P(0,0,2) =0 -3 P(2,0,0) + 0 P(1,1,0) + 8 P(1,0,1) + 0 P(0,2,0) – 4 P(0,1,1) – 2 P(0,0,2) =0
0 P(2,0,0) – 3 P(1,1,0) + 0 P(1,0,1) + 4 P(0,2,0) + 0 P(0,1,1) + 0 P(0,0,2) =0
0 P(2,0,0) – 3 P(1,1,0) – 3 P(1,0,1) + 0 P(0,2,0) + 6 P(0,1,1) + 0 P(0,0,2) =0 0 P(2,0,0) + 0 P(1,1,0) – 3 P(1,0,1) + 0 P(0,2,0) + 0 P(0,1,1) + 2 P(0,0,2) =0
• However, there are only 5 linearly independent equations. • Need one more equation:
T1,2021 COMP9334 18
P(2,0,0)+P(1,1,0)+P(1,0,1)+ P(0,2,0) + P(0,1,1) + P(0,0,2)=1
Steady State Probability
• You can find the steady state probabilities from 6 equations
• It’seasiertosolvetheequationsbyasoftwarepackages,e.g • Python, Matlab, Octave, etc.
• Thesolutionsare:
• P(2,0,0) = 0.1391
• P(1,1,0) = 0.1043
• P(1,0,1) = 0.2087
• P(0,2,0) = 0.0783
• P(0,1,1) = 0.1565
• P(0,0,2) = 0.3131
• I used Python (the numpy library) to solve these equations • Thefileis“data_server.py”(canbedownloadedfromthecourse
web site)
• How can we use these results for capacity planning?
T1,2021 COMP9334 19
Model interpretation
• Response time of each transaction
• UseLittle’sLawR=N/XwithN=2
• Forthissystem:
• System throughput = CPU Throughput
• Throughput=UtilisationxServicerate
• Recall Utilisation = Throughput x Service time (From Lecture 1B)
• CPUutilisation(usingstateswherethereisajobatCPU): P(2,0,0)+ P(1,1,0)+P(1,0,1)= 0.452
• Throughput=0.452×6=2.7130transactions/minute
• Responsetime(with2users)=2/2.7126=0.7372minutesper
transaction
T1,2021 COMP9334 20
Sample capacity planning problem
• What is the response time if the system has up to 4 users instead of 2 users only?
• Youcan’tusethepreviousMarkovchain • YouneedtodevelopanewMarkovchain
•
• • • • •
• •
The states are again (#users at CPU, #users at fast disk, #users at slow disk)
States are (4,0,0), (3,1,0), (1,2,1) etc.
There are 15 states
Determine the transition rates
Write down the balance equations and solve them.
Use the steady state probabilities and Little’s Law to determine the new response time
You can do this as an exercise
Throughput = 3.4768 (up 28%), response time = 60.03 seconds (up 56%)
T1,2021
COMP9334 21
Computation aspect of Markov chain
• This example shows that when there are a large number of users, the burden to build a Markov chain model is large
• 15states
• Manytransitions
• Needtosolve15equationsin15unknowns
• Is there a faster way to do this?
• Yes,wewilllookatMeanValueAnalysisinafewweeksanditcan
obtain the response time much more quickly
T1,2021 COMP9334 22
Machine working-repair cycle
• A data centre consists of a number of machines
• Machines can fail and have to be repaired
• Terminology:
• Time-to-nextfailure:Fromthetimeamachinehasbeenfixedtothe
time it next fails
• Time-to-repair=timewaitingintherepairqueue+servicetimeto repair the machine
• Time-to-repair is a response time
Machine fails at these points in time
Working
Working
time
Time-to-next-failure
Time-to-repair
T1,2021 COMP9334
23
Data centre reliability problem
• Example:Adatacentrehas10machines • Eachmachinemaygodown
• Time-to-next-failure is exponentially distributed with mean 90 days
• Servicetimetorepairisexponentiallydistributedwithmean6 hours
• Capacityplanningquestion:
•
• •
•
Can I make sure that at least 8 machines are available 99.9999% of the time?
What is the probability that at least 6 machines are available? How many repair staff are required to guarantee that at least k
machines are available with a given probability? What is the mean-time-to-repair (MTTR) a machine?
• Note: Mean-time-to-repair includes waiting time at the repair queue.
T1,2021
COMP9334 24
Data centre reliability – general problem
• Datacentrehas
• M machines
• N staff maintain and repair machine
• Assumption: M > N
• Automaticdiagnosticsystem
• Check “heartbeat” by “ping” (Failure detection)
• Staff are informed if failure is detected
• Repairwork
• If a machine fails, any one of the idle repair staff (if there is one) will attend to
it.
• If all repair staff are busy, a failed machine will need to wait until a repair staff has finished its work
• ThisisaqueueingproblemsolvablebyMarkovchain!!! • Letusdenote
• l=1/Mean-time-to-failure
• μ=1/Meanservicetimetorepairamachine
T1,2021 COMP9334 25
Queueing model for data centre example
An arrival is due to a machine failure.
A departure occurs when a machine has been repaired.
We build a Markov chain for this box.
T1,2021
COMP9334
26
Markov model for the repair queue
• State k represents k machines have failed
• Part of the state transition diagram is showed below
Ml 01
(M-1) l
2
2μ
The rate of failure for one machine is l. In State 0,
μ
there are M working machine, the failure rate is Ml.
The same argument holds for other state transition probability.
T1,2021 COMP9334 27
Markov Model for the repair queue
Note: There are (M+1) states.
Why is it Nμ? Why not (N+1)μ?
T1,2021 COMP9334
28
Solving the model
• We can solve for P(0), P(1), …, P(M)
Where
T1,2021 COMP9334 29
Using the model
• Probabilitythatexactlykmachinesareavailable=P(M-k)
• Probabilitythatatleastkmachinesareavailable
= P(0) + P(1) … + P(M-k)
• ButexpressionforP(k)’sarecomplicated,neednumerical
software
• Example: • M=120
• Mean-time-to-failure=500minutes
• Meanservicetimetorepair=20minutes
• N=2,5or10
• Theresultsareshowedinthegraphsinthenext2pages
• I used the file “data_centre.py” to do the computation, the file is available on the course web site.
T1,2021 COMP9334 30
Probability that exactly k machines operate
T1,2021 COMP9334 31
Probability that at least k machines operate
T1,2021 COMP9334 32
Think time ~ Mean-time-to-failure (MTTF) = 1 / l
Throughput
~ Mean machine failure
rate
(see next page)
Mean time to repair (MTTR)
= Queueing time for repair + actual repair time
Can compute MTTR using Little’s Law.
T1,2021 COMP9334 33
Mean machine failure rate
State
Probability
Failure rate
0
P(0)
Ml
1
P(1)
(M-1)l
2
P(2)
(M-2)l
…
…
k
P(k)
(M-k)l
…
…
M
P(M)
0
T1,2021 COMP9334 34
Continuous-time Markov chain
• Useful for analysing queues when the inter-arrival or service time distribution is exponential
• The procedure is fairly standard for obtaining the steady state probability distribution
• Identifythestate
• Findthestatetransitionrates
• Setupthebalanceequations
• Solvethesteadystateprobability
• We can use the steady state probability to obtain other performance metrics: throughput, response time etc.
• MayneedLittle’sLawetc.
• Continuous-time Markov chain is only applicable when the underlying probability distribution is exponential but the operations laws (e.g. Little’s Law) are applicable no matter what the underlying probability distributions are.
T1,2021 COMP9334 35
Markov chain
• Markov chain is big field in itself. We have touched on only continuous-time Markov chain
• TherearealsodiscretetimeMarkovchains
• Markovchainhasdiscretestate,arelatedconceptisMarkov
process whose states are continuous
• Markov chain / processes have many applications
• PagerankalgorithmfromGooglecanbeexplainedintermsof discrete-time Markov chain
• GraphicalModels(frommachinelearning)
• Transportengineering
• Mathematicalfinance
• Personally, I use Markov chains to understand how living cells process information
T1,2021 COMP9334 36
References
• Recommended reading
• ThedatabaseserverexampleistakenfromMenasceetal.,
“Performance by design”, Chapter 10
• ThedatacentreexampleistakenfromMensaceetal,“Performance
by design”, Chapter 7, Sections 1-4
• For a more in-depth, and mathematical discussion of continuous-time Markov chain, see
• AlbertoLeon-Gracia,“Probabilitiesandrandomprocessesfor Electrical Engineering”, Chapter 8.
• LeonardKleinrock,“QueueingSystems”,Volume1
T1,2021 COMP9334 37