程序代写代做代考 python capacity planning algorithm finance database chain matlab week04

week04

COMP9334 1

COMP9334
Capacity Planning for Computer Systems
and Networks

Week 4: Markov Chain

S1,2018 COMP9334 2

Last week: Queues with Poisson arrivals

• Single-server Arrivals Departures

• Multi-server 1

2

m

m servers

Arrivals

Departures

S1,2018 COMP9334 3

This week: Markov Chain

• You can use Markov Chain to analyse
• Closed queueing network (see example below)
• Reliability problem

CPU

Disk

• There are n jobs in the closed
system

• What is the response time of
one job?

• What is the response time if
we replace the CPU with one
that is twice as fast?

S1,2018 COMP9334 4

This lecture: Road Map

• A recap on the methodology that we used to analyse
Poisson queues last week
• You were using Markov Chain without knowing it

• Analysing closed queueing networks
• Analysing reliability problem

S1,2018 COMP9334 5

Recap: Properties of exponential distribution

• Exponential inter-arrival time and service time gives rise to
the following two properties

• Inter-arrival time is exponential with mean rate l,
• Consider a small time interval d
• Probability [ no arrival in d ] = 1 – l d
• Probability [ 1 arrival in d ] = l d
• Probability [ 2 or more arrivals in d ] » 0

• Service time distribution is exponential with mean rate µ
• Consider a small time interval d
• Probability [ 0 job will finish its service in next d seconds ] = 1 – µ d
• Probability [ 1 job will finish its service in next d seconds ] = µ d
• Probability [ > 2 jobs will finish its service in next d seconds ] » 0

S1,2018 COMP9334 6

Recap: M/M/2/2 queue

• A call centre analogy

Arrivals

2 operators.
No holding slot.

Call centre:

Exponential
Inter-arrivals (l)

No buffer.

Two serversExponential
Service time (µ)

• Let us recall how we can analyse this system

• Calls are accepted as
long as at least one
operator is available.

• If both operators are
busy, an arriving call is
rejected.

S1,2018 COMP9334 7

Recap: Analysing M/M/2/2

• The system can be in one of the following three states
• State 0 = 0 call in the system (= both operators are idle)
• State 1 = 1 call in the system (= one operator is busy, one is idle)
• State 2 = 2 calls in the system (= both operators are busy)

• Define the probability that a certain state occurs

S1,2018 COMP9334 8

Recap: The transition probabilities

• Consider a small time interval d
• Given the system is in State 1

• What is the probability that it will move to State 0?
• What is the probability that it will move to State 2?

• Transiting from State 1 è State 0
• This can only occur when a call finishes in time interval d
• Conditional probability for this to occur = µ d

• Transiting from State 1 è State 2
• This can only occur when a call arrives in time interval d
• Conditional probability for this to occur = l d

• Prob [State 1 è State 0 | State 1] = µ d
• Prob [State 1 è State 2 | State 1] = l d

S1,2018 COMP9334 9

Exercise: The transition probabilities

• Can you work out the following transition probabilities
• Prob [State 0 è State 1 | State 0] = l d
• Prob [State 0 è State 2 | State 0] = 0
• Prob [State 2 è State 0 | State 2] = 0
• Prob [State 2 è State 1 | State 2] = 2 µ d

S1,2018 COMP9334 10

Recap: The state transition diagram
• Given the following transition probabilities (over a small time interval d)

• Prob [State 0 è State 1 | State 0] = l d
• Prob [State 0 è State 2 | State 0] = 0
• Prob [State 1 è State 0 | State 1] = µ d
• Prob [State 1 è State 2 | State 1] = l d
• Prob [State 2 è State 0 | State 2] = 0
• Prob [State 2 è State 1 | State 2] = 2 µ d

• We draw the following state transition diagram
• Note 1: We label the arc with transition rate = transition probability / d
• Note 2: Arcs with zero rate are not drawn

0 1

l

µ

2

l

S1,2018 COMP9334 11

Recap: Setting up the balance equations (1)
• For steady state, we have

• Prob of transiting into a “box” = Prob of transiting out of a “box”
• Rate of transiting into a “box” = Rate of transiting out of a “box”

• Note a “box” can include one or more state
• The “box” is the dotted square shown below

0 1

l

µ

2

l

S1,2018 COMP9334 12

Exercise: Setting up the balance equations (2)
• Set up the balance equations for the

• Red box
• Green box

0 1

l

µ

2

l

0 1

l

µ

2

l

S1,2018 COMP9334 13

Recap: The balance equations

• There are three balance equations

• Note that these three equations are not linearly independent
• First equation + Third equation = Second equation

• There are 3 unknowns (P0, P1, P2) but we have only 2
equations

• We need 1 more equation. What is it?

S1,2018 COMP9334 14

Recap: Solving for the steady state probabilities

• An addition equation: Sum( Probabilities ) = 1
• Solve the following equations for the steady state

probabilities P0, P1, P2 :

• By solving these 3 equations, we have

S1,2018 COMP9334 15

Recap: Steady state probabilities

• By solving the equations on the previous slide, we have the
steady state probabilities are:

• If we know the values of l
and µ, we can find the
numerical values of
these probabilities

• Do the expressions
make sense?

S1,2018 COMP9334 16

Markov chain

• The state-transition model that we have used is called a
continuous-time Markov chain
• There is also discrete-time Markov chain

• The transition from a state of the Markov chain to another
state is characterised by an exponential distribution
• E.g. The transition from State p to State q is exponential with rate

rpq, then consider a small time interval d
• Prob [ Transition from State p to State q in time d | State p] = rpq d

0 1

l

µ

2

l

S1,2018 COMP9334 17

Method for solving Markov chain

• A Markov chain can be solved by
• Identifying the states
• Find the transition rate between the states
• Solve the steady state probabilities

• You can then use the steady state probabilities as a
stepping stone to find the quantity of interest (e.g. response
time etc.)

• We will study two Markov chain problems in this lecture:
• Problem 1: A Database server
• Problem 2: Data centre reliability problem

S1,2018 COMP9334 18

Problem 1: A DB server

• A database server with a CPU, a fast disk and a slow disk
• At peak demand, there are always two users in the system
• Transactions alternate between the CPU and the disks
• The transactions will equally likely find the file on either disk

CPU

Slow Disk

Fast Disk

2 users

S1,2018 COMP9334 19

Problem 1: A DB server (cont’d)

• Fast disk is twice as fast as the slow disk
• Typical transactions take on average 10s CPU time
• Fast disk takes on average 15s to serve all files for a transactions
• Slow disk takes on average 30s to serve all files for a transactions
• The time that each transaction requires from the CPU and the disks is

exponentially distributed

CPU (10s)

Slow Disk (30s)

Fast Disk (15s)

2 users

S1,2018 COMP9334 20

Typical capacity planning questions

• What response time can a typical user expect?
• What is the utilisation of each of the system resources?
• How will performance parameters change if number of

users are doubled?
• If fast disk fails and all files are moved to slow disk, what

will be the new response time?

S1,2018 COMP9334 21

Choice of states #1

• Use a 2-tuple (A,B) where
• A is the location of the first user
• B is the location of the second user
• A, B are drawn from {CPU,FD,SD}

• FD = fast disk, SD = slow disk

• Example states are:
• (CPU,CPU): both users at CPU
• (CPU, FD): 1st user at CPU, 2nd user at fast disk

• Total 9 states

• Question: If there are n users,
• What are the states?
• How many states will you need?

S1,2018 COMP9334 22

Choice of states #2

• We use a 3-tuple (X,Y,Z)
• X is # users at CPU
• Y is # users at fast disk
• Z is # users at slow disk

• Examples
• (2,0,0): both users at CPU
• (1,0,1): one user at CPU and one user at slow disk

• There are six possible states. Can you list them?
• (2,0,0) (1,1,0) (1,0,1) (0,2,0) (0,1,1) (0,0,2)

• If there are n users, how many states do you need?

Choice #2 requires less

#states but loses certain

information.

S1,2018 COMP9334 23

Identifying state transitions (1)

• A state is: (#users at CPU, #users at fast disk, #users at slow disk)
• What is the rate of moving from State (2,0,0) to State (1,1,0)?

• This is caused by a job finishing at the CPU and move to fast disk
• Jobs complete at CPU at a rate of 6 transactions/minute
• Half of the jobs go to the fast disk

• Transition rate from (2,0,0) è (1,1,0) = 3 transactions/minute
• Similarly, transition rate from (2,0,0) è (1,0,1) = 3 transactions/minute

CPU (10s)

Slow Disk (30s)

Fast Disk (15s)

2 users

S1,2018 COMP9334 24

State transition diagram (2)

• Transition rate from (2,0,0) è (1,1,0) = 3 transactions/minute
• Transition rate from (2,0,0) è (1,0,1) = 3 transactions/minute

2,0,0

1,0,11,1,0

0,2,0 0,1,1 0,0,2

3
3

• Question: What is the transition rate from (2,0,0) è (0,1,1)?

S1,2018 COMP9334 25

Identifying state transitions (2)
• From (1,1,0) there are 3 possible transitions

• Fast disk user goes back to CPU (2,0,0)
• CPU user goes to the fast disk (0,2,0), or
• CPU user goes to the slow disk (0,1,1)

• Question: What are the transition rates in number of transactions per
minute?

CPU (10s)

Slow Disk (30s)

Fast Disk (15s)

2 users

S1,2018 COMP9334 26

Completing the state transition diagram

2,0,0

1,0,11,1,0

0,2,0 0,1,1 0,0,2

3

4

3

3

3

Exercise

• The state transition diagram is still no complete. Choose
any two state transitions and determine their rates.

S1,2018 COMP9334 27

S1,2018 COMP9334 28

Complete state transition diagram

2,0,0

1,0,11,1,0

0,2,0 0,1,1 0,0,2

3

4

3

4
2

3

2

3

3

4

3

2

S1,2018 COMP9334 29

Balance Equations

2,0,0

1,0,11,1,0

0,2,0 0,1,1 0,0,2

3

4

3

4
2

3

2

3

3

4

3

2

Define
P(2,0,0) = Probability in state (2,0,0)
P(1,1,0) = Probability in state (1,1,0) etc.
Exercise: Write down the balance equation for state (2,0,0)

S1,2018 COMP9334 30

Flow balance equations

• You can write one flow balance equation for each state:

P(2,0,0)+P(1,1,0)+P(1,0,1)+ P(0,2,0) + P(0,1,1) + P(0,0,2)=1

6 P(2,0,0) – 4 P(1,1,0) – 2 P(1,0,1)+ 0 P(0,2,0) + 0 P(0,1,1) + 0 P(0,0,2)=0

-3 P(2,0,0) + 10 P(1,1,0) + 0 P(1,0,1) – 4 P(0,2,0) -2 P(0,1,1) + 0 P(0,0,2) =0

-3 P(2,0,0) + 0 P(1,1,0) + 8 P(1,0,1) + 0 P(0,2,0) – 4 P(0,1,1) – 2 P(0,0,2) =0

0 P(2,0,0) – 3 P(1,1,0) + 0 P(1,0,1) + 4 P(0,2,0) + 0 P(0,1,1) + 0 P(0,0,2) =0

0 P(2,0,0) – 3 P(1,1,0) – 3 P(1,0,1) + 0 P(0,2,0) + 6 P(0,1,1) + 0 P(0,0,2) =0

0 P(2,0,0) + 0 P(1,1,0) – 3 P(1,0,1) + 0 P(0,2,0) + 0 P(0,1,1) + 2 P(0,0,2) =0

• However, there are only 5 linearly independent equations.
• Need one more equation:

S1,2018 COMP9334 31

Steady State Probability

• You can find the steady state probabilities from 6 equations
• It�s easier to solve the equations by a software packages, e.g

• Matlab, Octave, Python etc.
• See �Software� under course web page

• The solutions are:
• P(2,0,0) = 0.1391
• P(1,1,0) = 0.1043
• P(1,0,1) = 0.2087
• P(0,2,0) = 0.0783
• P(0,1,1) = 0.1565
• P(0,0,2) = 0.3131

• I used Matlab to solve these equations
• The file is �dataserver.m� (can be downloaded from the course web

site)
• How can we use these results for capacity planning?

S1,2018 COMP9334 32

Model interpretation
• Response time of each transaction

• Use Little�s Law R = N/X with N = 2
• For this system:

• System throughput = CPU Throughput

• Throughput = Utilisation x Service rate
• Recall Utilisation = Throughput x Service time (From Lecture 2)

• CPU utilisation (using states where there is a job at CPU):
P(2,0,0)+ P(1,1,0)+P(1,0,1)= 0.452

• Throughput = 0.452 x 6 = 2.7130 transactions / minute

• Response time (with 2 users) = 2 /2.7126 = 0.7372 minutes per
transaction

S1,2018 COMP9334 33

Sample capacity planning problem

• What is the response time if the system has up to 4 users
instead of 2 users only?
• You can’t use the previous Markov chain
• You need to develop a new Markov chain

• The states are again (#users at CPU, #users at fast disk, #users at
slow disk)

• States are (4,0,0), (3,1,0), (1,2,1) etc.
• There are 15 states
• Determine the transition rates
• Write down the balance equations and solve them.
• Use the steady state probabilities and Little’s Law to determine the

new response time
• You can do this as an exercise
• Throughput = 3.4768 (up 28%), response time = 60.03 seconds (up

56%)

S1,2018 COMP9334 34

Computation aspect of Markov chain

• This example shows that when there are a large number of
users, the burden to build a Markov chain model is large
• 15 states
• Many transitions
• Need to solve 15 equations in 15 unknowns

• Is there a faster way to do this?
• Yes, we will look at Mean Value Analysis in a few weeks and it can

obtain the response time much more quickly

S1,2018 COMP9334 35

Reliability problem using Markov chain

• Consider the working-repair cycle of a machine
• “Failure” is an arrival to the repair workshop
• “Repair” time is the service time to repair the machine
• Let us assume

• “Time-to-next-failure” and “Repair time” are exponentially distributed

WorkingWorking

time
Time-to-next-failure Time-to-repair

Machine fails at these points in time

Working

• Note: Mean-time-to-repair includes waiting (or queueing) time for repair
and actual time under repair

S1,2018 COMP9334 36

Question
• If there is only one machine, what are the possible states of

the machine?

WorkingWorking

time
Time-to-next-failure Time-to-repair

Machine fails at these points in time

Working

S1,2018 COMP9334 37

Data centre reliability problem

• Example: A data centre has 10 machines
• Each machine may go down

• Time-to-next-failure is exponentially distributed with mean 90
days

• Repair time is exponentially distributed with mean 6 hours
• Capacity planning question:

• Can I make sure that at least 8 machines are available
99.9999% of the time?

• What is the probability that at least 6 machines are available?
• How many repair staff are required to guarantee that at least k

machines are available with a given probability?

• What is the mean time to repair (MTTR) a machine?
• Note: Mean-time-to-repair includes waiting time at the repair

queue.

S1,2018 COMP9334 38

Data centre reliability – general problem

• Data centre has
• M machines
• N staff maintain and repair machine
• Assumption: M > N

• Automatic diagnostic system
• Check �heartbeat� by �ping� (Failure detection)
• Staff are informed if failure is detected

• Repair work
• If a machine fails, any one of the idle repair staff (if there is one) will attend to

it.
• If all repair staff are busy, a failed machine will need to wait until a repair staff

has finished its work
• This is a queueing problem solvable by Markov chain!!!
• Let us denote

• l = 1 / Mean-time-to-failure
• µ = 1/ Mean repair time

S1,2018 COMP9334 39

Queueing model for data centre example

An arrival is
due to a
machine
failure.

A departure
occurs
when a
machine
has been
repaired.

We build a
Markov
chain for
this box.

S1,2018 COMP9334 40

Markov model for the repair queue

• State k represents k machines have failed
• Part of the state transition diagram is showed below

0 1 2
M l

µ

(M-1) l


The rate of failure for one machine is l. In State 0,
there are M working machine, the failure rate is Ml.

The same argument holds for other state
transition probability.

S1,2018 COMP9334 41

Markov Model for the repair queue

Note: There are only (M+1) states.

Why is it Nµ?
Why not (N+1)µ?

S1,2018 COMP9334 42

Solving the model

• We can solve for P(0), P(1), …, P(M)

Where

S1,2018 COMP9334 43

Using the model

• Probability that exactly k machines are available = P(M-k)
• Probability that at least k machines are available

= P(0) + P(1) … + P(M-k)

• But expression for P(k)’s are complicated, need numerical
software

• Example:
• M = 120
• Mean-time-to-failure = 500 minutes
• Mean repair time = 20 minutes
• N = 2, 5 or 10
• The results are showed in the graphs in the next 2 pages

• I used the file �data_centre.m� to do the computation, the file is
available on the course web site.

S1,2018 COMP9334 44

Probability that exactly k machines operate

S1,2018 COMP9334 45

Probability that at least k machines operate

Throughput
~ Mean machine failure
rate
(see next page)

Think time ~ Mean-time-to-failure (MTTF) = 1 / l

Mean time to repair
(MTTR)
= Queueing time for
repair + actual repair
time

Can compute MTTR
using Little�s Law.

S1,2018 COMP9334 46

S1,2018 COMP9334 47

Mean machine failure rate

State Probability Failure rate

0 P(0) Ml
1 P(1) (M-1)l
2 P(2) (M-2)l
… …

k P(k) (M-k)l
… …

M P(M) 0

S1,2018 COMP9334 48

Continuous-time Markov chain

• Useful for analysing queues when the inter-arrival or
service time distribution are exponential

• The procedure is fairly standard for obtaining the steady
state probability distribution
• Identify the state
• Find the state transition rates
• Set up the balance equations
• Solve the steady state probability

• We can use the steady state probability to obtain other
performance metrics: throughput, response time etc.
• May need Little’s Law etc.

• Continuous-time Markov chain is only applicable when the
underlying probability distribution is exponential but the
operations laws (e.g. Little’s Law) are applicable no matter
what the underlying probability distributions are.

Markov chain

• Markov chain is big field in itself. We have touched on only
continuous-time Markov chain
• Instead of continuous time, you can have discrete time
• Markov chain has discrete state, a related concept is Markov

process whose states are continuous
• Markov chain / processes have many applications

• Page rank algorithm from Google can be explained in terms of
discrete-time Markov chain

• Graphical Models (from machine learning)
• Transport engineering
• Mathematical finance

• Personally, I use Markov chains to design bio-inspired
communication systems

S1,2018 COMP9334 49

S1,2018 COMP9334 50

References

• Recommended reading
• The database server example is taken from Menasce et al.,
�Performance by design�, Chapter 10

• The data centre example is taken from Mensace et al, �Performance
by desing�, Chapter 7, Sections 1-4

• For a more in-depth, and mathematical discussion of
continuous-time Markov chain, see
• Alberto Leon-Gracia, �Probabilities and random processes for

Electrical Engineering�, Chapter 8.
• Leonard Kleinrock, �Queueing Systems�, Volume 1