Capacity Planning for Computer Systems and Networks
Week 3B: Markov Chain
Last lecture: Queues with Poisson arrivals
Copyright By PowCoder代写 加微信 powcoder
• Single-server
Departures
• Multi-server
Departures 2
T1,2022 COMP9334
This week: Markov Chain
• You can use Markov Chain to analyse
• Closedqueueingnetwork(seeexamplebelow) • Reliabilityproblem
• There are n jobs in the closed system
• What is the response time of one job?
• What is the response time if we replace the CPU with one that is twice as fast?
Markov chain
• The state-transition model that we have used is called a continuous-time Markov chain
• Thereisalsodiscrete-timeMarkovchain
• The transition from a state of the Markov chain to another state is characterised by an exponential distribution
• E.g.ThetransitionfromStateptoStateqisexponentialwithrate rpq, then consider a small time interval d
• Prob[TransitionfromStateptoStateqintimed|Statep]=rpq d
T1,2022 COMP9334
Method for solving Markov chain
• A Markov chain can be solved by • Identifyingthestates
• Findthetransitionratebetweenthestates • Solvethesteadystateprobabilities
• You can then use the steady state probabilities as a stepping stone to find the quantity of interest (e.g. response time etc.)
• We will study two Markov chain problems in this lecture: • Problem1:ADatabaseserver
• Problem2:Datacentrereliabilityproblem
T1,2022 COMP9334 5
Problem 1: A DB server
• AdatabaseserverwithaCPU,afastdiskandaslowdisk • Atpeakdemand,therearealwaystwousersinthesystem • UsersalternatebetweentheCPUandthedisks
• Theuserswillequallylikelyfindthefileoneitherdisk
T1,2022 COMP9334
Problem 1: A DB server (cont’d)
• Fastdiskistwiceasfastastheslowdisk
• Typicaltransactionstakeonaverage10sCPUtime
• Fastdisktakesonaverage15stoserveallfilesforauser
• Slowdisktakesonaverage30stoserveallfilesforauser
• ThetimethateachtransactionrequiresfromtheCPUandthedisksis exponentially distributed
Fast Disk (15s)
Slow Disk (30s)
T1,2022 COMP9334
Typical capacity planning questions
• What response time can a typical user expect?
• What is the utilisation of each of the system resources?
• How will performance parameters change if number of users are doubled?
• If fast disk fails and all files are moved to slow disk, what will be the new response time?
T1,2022 COMP9334 8
Choice of states #1
• Use a 2-tuple (A,B) where
• A is the location of the first user
• B is the location of the second user
• A, B are drawn from {CPU,FD,SD}
• FD = fast disk, SD = slow disk
• Example states are:
• (CPU,CPU): both users at CPU
• (CPU, FD): 1st user at CPU, 2nd user at fast disk
• T otal 9 states
• Question: If there are n users,
• What are the states?
• How many states are there?
T1,2022 COMP9334 9
Choice of states #2
• We use a 3-tuple (X,Y,Z)
• Xis#usersatCPU
• Y is # users at fast disk
• Z is # users at slow disk
• Examples
• (2,0,0): both users at CPU
• (1,0,1): one user at CPU and one user at slow disk
• There are six possible states. Can you list them? • (2,0,0) (1,1,0) (1,0,1) (0,2,0) (0,1,1) (0,0,2)
• If there are n users, how many states are there?
Choice #2 requires less #states but loses certain information.
T1,2022 COMP9334 10
Identifying state transitions (1)
A state is: (#users at CPU, #users at fast disk, #users at slow disk) What is the rate of moving from State (2,0,0) to State (1,1,0)?
• This is caused by a job finishing at the CPU and move to fast disk
• Jobs complete at CPU at a rate of 6 transactions/minute
• Half of the jobs go to the fast disk
Transition rate from (2,0,0) ! (1,1,0) = 3 transactions/minute Similarly, transition rate from (2,0,0)!(1,0,1) = 3 transactions/minute
Fast Disk (15s)
Slow Disk (30s)
T1,2022 COMP9334
State transition diagram (2)
• Transitionratefrom(2,0,0)!(1,1,0)=3transactions/minute • Transitionratefrom(2,0,0)!(1,0,1)=3transactions/minute
• Question: What is the transition rate from (2,0,0)!(0,1,1)? T1,2022 COMP9334
Identifying state transitions (2)
From (1,1,0) there are 3 possible transitions
• Fast disk user goes back to CPU (2,0,0)
• CPU user goes to the fast disk (0,2,0), or
• CPU user goes to the slow disk (0,1,1)
Question: What are the transition rates in number of transactions per minute?
Fast Disk (15s)
Slow Disk (30s)
T1,2022 COMP9334
Completing the state transition diagram
• The state transition diagram is still not complete. Choose any two state transitions and determine their rates.
T1,2022 COMP9334 15
Complete state transition diagram
T1,2022 COMP9334 16
Balance Equations
P(2,0,0) = Probability in state (2,0,0)
P(1,1,0) = Probability in state (1,1,0) etc.
Exercise: Write down the balance equation for state (2,0,0)
Flow balance equations
• You can write one flow balance equation for each state:
6 P(2,0,0) – 4 P(1,1,0) – 2 P(1,0,1)+ 0 P(0,2,0) + 0 P(0,1,1) + 0 P(0,0,2)=0
-3 P(2,0,0) + 10 P(1,1,0) + 0 P(1,0,1) – 4 P(0,2,0) -2 P(0,1,1) + 0 P(0,0,2) =0 -3 P(2,0,0) + 0 P(1,1,0) + 8 P(1,0,1) + 0 P(0,2,0) – 4 P(0,1,1) – 2 P(0,0,2) =0
0 P(2,0,0) – 3 P(1,1,0) + 0 P(1,0,1) + 4 P(0,2,0) + 0 P(0,1,1) + 0 P(0,0,2) =0
0 P(2,0,0) – 3 P(1,1,0) – 3 P(1,0,1) + 0 P(0,2,0) + 6 P(0,1,1) + 0 P(0,0,2) =0 0 P(2,0,0) + 0 P(1,1,0) – 3 P(1,0,1) + 0 P(0,2,0) + 0 P(0,1,1) + 2 P(0,0,2) =0
• However, there are only 5 linearly independent equations. • Need one more equation:
T1,2022 COMP9334 18
P(2,0,0)+P(1,1,0)+P(1,0,1)+ P(0,2,0) + P(0,1,1) + P(0,0,2)=1
Steady State Probability
• You can find the steady state probabilities from 6 equations
• It’seasiertosolvetheequationsbyasoftwarepackages,e.g • Python, Matlab, Octave, etc.
• Thesolutionsare:
• P(2,0,0) = 0.1391
• P(1,1,0) = 0.1043
• P(1,0,1) = 0.2087
• P(0,2,0) = 0.0783
• P(0,1,1) = 0.1565
• P(0,0,2) = 0.3131
• I used Python (the numpy library) to solve these equations • Thefileis“data_server.py”(canbedownloadedfromthecourse
• How can we use these results for capacity planning?
T1,2022 COMP9334 19
Model interpretation
• Response time of each transaction
• UseLittle’sLawR=N/XwithN=2
• Forthissystem:
• System throughput = CPU Throughput
• Throughput=UtilisationxServicerate
• Recall Utilisation = Throughput x Service time (From Lecture 1B)
• CPUutilisation(usingstateswherethereisajobatCPU): P(2,0,0)+ P(1,1,0)+P(1,0,1)= 0.452
• Throughput=0.452×6=2.7130transactions/minute
• Responsetime(with2users)=2/2.7126=0.7372minutesper
transaction
T1,2022 COMP9334 20
Sample capacity planning problem
• What is the response time if the system has up to 4 users instead of 2 users only?
• Youcan’tusethepreviousMarkovchain • YouneedtodevelopanewMarkovchain
The states are again (#users at CPU, #users at fast disk, #users at slow disk)
States are (4,0,0), (3,1,0), (1,2,1) etc.
There are 15 states
Determine the transition rates
Write down the balance equations and solve them.
Use the steady state probabilities and Little’s Law to determine the new response time
You can do this as an exercise
Throughput = 3.4768 (up 28%), response time = 60.03 seconds (up 56%)
COMP9334 21
Computation aspect of Markov chain
• This example shows that when there are a large number of users, the burden to build a Markov chain model is large
• 15states
• Manytransitions
• Needtosolve15equationsin15unknowns
• Is there a faster way to do this?
• Yes,wewilllookatMeanValueAnalysisinafewweeksanditcan
obtain the response time much more quickly
T1,2022 COMP9334 22
Automation
• We derived the state balance equation manually so far • It’s possible to automate that
• Seethesupplementarymaterialsonthecoursewebsiteuntil Week 3
T1,2022 COMP9334 23
Machine working-repair cycle
• A data centre consists of a number of machines
• Machines can fail and have to be repaired
• Terminology:
• Time-to-nextfailure:Fromthetimeamachinehasbeenfixedtothe
time it next fails
• Time-to-repair=timewaitingintherepairqueue+servicetimeto repair the machine
• Time-to-repair is a response time
Machine fails at these points in time
Time-to-next-failure
Time-to-repair
T1,2022 COMP9334
Data centre reliability problem
• Example:Adatacentrehas10machines • Eachmachinemaygodown
• Time-to-next-failure is exponentially distributed with mean 90 days
• Servicetimetorepairisexponentiallydistributedwithmean6 hours
• Capacityplanningquestion:
Can I make sure that at least 8 machines are available 99.9999% of the time?
What is the probability that at least 6 machines are available? How many repair staff are required to guarantee that at least k
machines are available with a given probability? What is the mean-time-to-repair (MTTR) a machine?
• Note: Mean-time-to-repair includes waiting time at the repair queue.
COMP9334 25
Data centre reliability – general problem
• Datacentrehas
• M machines
• N staff maintain and repair machine
• Assumption: M > N
• Automaticdiagnosticsystem
• Check “heartbeat” by “ping” (Failure detection)
• Staff are informed if failure is detected
• Repairwork
• If a machine fails, any one of the idle repair staff (if there is one) will attend to
• If all repair staff are busy, a failed machine will need to wait until a repair staff has finished its work
• ThisisaqueueingproblemsolvablebyMarkovchain!!! • Letusdenote
• l=1/Mean-time-to-failure
• μ=1/Meanservicetimetorepairamachine
T1,2022 COMP9334 26
Queueing model for data centre example
An arrival is due to a machine failure.
A departure occurs when a machine has been repaired.
We build a Markov chain for this box.
Markov model for the repair queue
• State k represents k machines have failed
• Part of the state transition diagram is showed below
The rate of failure for one machine is l. In State 0,
there are M working machine, the failure rate is Ml.
The same argument holds for other state transition probability.
T1,2022 COMP9334 28
Markov Model for the repair queue
Note: There are (M+1) states.
Why is it Nμ? Why not (N+1)μ?
T1,2022 COMP9334
Solving the model
• We can solve for P(0), P(1), …, P(M)
T1,2022 COMP9334 30
Using the model
• Probabilitythatexactlykmachinesareavailable=P(M-k)
• Probabilitythatatleastkmachinesareavailable
= P(0) + P(1) … + P(M-k)
• ButexpressionforP(k)’sarecomplicated,neednumerical
• Example: • M=120
• Mean-time-to-failure=500minutes
• Meanservicetimetorepair=20minutes
• N=2,5or10
• Theresultsareshowedinthegraphsinthenext2pages
• I used the file “data_centre.py” to do the computation, the file is available on the course web site.
T1,2022 COMP9334 31
Probability that exactly k machines operate
T1,2022 COMP9334 32
Probability that at least k machines operate
T1,2022 COMP9334 33
Think time ~ Mean-time-to-failure (MTTF) = 1 / l
Throughput
~ Mean machine failure
(see next page)
Mean time to repair (MTTR)
= Queueing time for repair + actual repair time
Can compute MTTR using Little’s Law.
T1,2022 COMP9334 34
Mean machine failure rate
Probability
Failure rate
T1,2022 COMP9334 35
Continuous-time Markov chain
• Useful for analysing queues when the inter-arrival or service time distribution is exponential
• The procedure is fairly standard for obtaining the steady state probability distribution
• Identifythestate
• Findthestatetransitionrates
• Setupthebalanceequations
• Solvethesteadystateprobability
• We can use the steady state probability to obtain other performance metrics: throughput, response time etc.
• MayneedLittle’sLawetc.
• Continuous-time Markov chain is only applicable when the underlying probability distribution is exponential but the operations laws (e.g. Little’s Law) are applicable no matter what the underlying probability distributions are.
T1,2022 COMP9334 36
Markov chain
• Markov chain is big field in itself. We have touched on only continuous-time Markov chain
• TherearealsodiscretetimeMarkovchains
• Markovchainhasdiscretestate,arelatedconceptisMarkov
process whose states are continuous
• Markov chain / processes have many applications
• PagerankalgorithmfromGooglecanbeexplainedintermsof discrete-time Markov chain
• GraphicalModels(frommachinelearning)
• Transportengineering
• Mathematicalfinance
• Personally, I use Markov chains to understand how living cells process information
T1,2022 COMP9334 37
References
• Recommended reading
• ThedatabaseserverexampleistakenfromMenasceetal.,
“Performance by design”, Chapter 10
• ThedatacentreexampleistakenfromMensaceetal,“Performance
by design”, Chapter 7, Sections 1-4
• For a more in-depth, and mathematical discussion of continuous-time Markov chain, see
• AlbertoLeon-Gracia,“Probabilitiesandrandomprocessesfor Electrical Engineering”, Chapter 8.
• LeonardKleinrock,“QueueingSystems”,Volume1
T1,2022 COMP9334 38
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com