Updates to the project, including any corrections and clarifications, will be posted on the course website. Make sure that you check the course website regularly for updates.
Change log
Version 1.01 (27 March 2024). There is a mistake in the denominators of the two probability density functions in Section 5.1.1. For g0(t), it should be t raised to the power of η0+1 where the +1 was missing. A similar error appeared in g1(t), it should be t raised to the power of η1 +1.
Version 1.00. Issued on 19 March 2024.
Copyright By PowCoder代写 加微信 powcoder
Introduction and learning objectives
COMP9334 Project, Term 1, 2024: Computing clusters
Due Date: 5:00pm Friday 19 April 2024 Version 1.01
You have learnt in Week 4A’s lecture that a high variability of inter-arrival times or service times can cause a high response time. Measurements from real computer clusters have found that the service times in these clusters have very high variability [1]. The reference paper [1] also has a number of suggestions to deal with this issue. One suggestion is to separate the jobs according to their service time requirements, and have one set of servers processing jobs with short service times and another set of servers for jobs with long service times. This arrangement is the same as supermarkets having express checkouts for customers buying not more than a certain number of items and other checkouts that do not have a limit on the number of items. You had seen this theory in action in Week 4A’s revision Problem 1. We also highly recommend you to read the paper [1].
In this project, you will use simulation to study how to reduce the response time of a server farm that uses different servers to process jobs with different service time requirements.
In this project, you will learn:
1. To use discrete event simulation to simulate a computer system 2. To use simulation to solve a design problem
3. To use statistically sound methods to analyse simulation outputs
We mentioned a number of times in the lectures that simulation is not simply about writing simulation programs. While it is important to get your simulation code correct, it is also important that you use statistically sound methods to analyse simulation outputs. There, roughly half of the marks of this project is allocated to the simulation program, and the other half to statistical analysis; see Section 7.2.
Jobs that are killed are sent back
to the dispatcher
Jobs that have completed their processing will depart the system permanently
Jobs that have completed their processing will depart the system permanently
Server n0 – 1
New jobs submitted by users
Dispatcher
Server n – 1
Jobs killed by servers in Group 0
Figure 1: The multi-server system for this project.
2 Support provided and computing resources
If you have problems doing this project, you can post your question on the course forum. We strongly encourage you to do this as asking questions and trying to answer them is a great way to learn. Do not be afraid that your question may appear to be silly, the other students may very well have the same question! Please note that if your forum post shows part of your solution or code, you must mark that forum post private.
Another way to get help is to attend a consultation (see the Timetable section of the course website for dates and times).
If you need computing resources to run your simulation program, you can do it on the VLAB remote computing facility provided by the School. Information on VLAB is available here: https: //taggi.cse.unsw.edu.au/Vlab/
3 Multi-server system configuration with job isolation
The configuration of the multi-server system that you will use in this project is shown in Figure 1. The system consists of a dispatcher and n servers where n ≥ 2. The n servers are parti- tioned into 2 disjoint groups, called Groups 0 and 1, with at least one server in each group. The numberofserversinGroups0and1are,respectively,n0 andn1 wheren0,n1 ≥1andn0+n1 =n.
The servers in Group 0 are used to process short jobs which require a processing time of no more than a time limit of Tlimit. The servers in Group 1 do not impose any limit on service time.
The dispatcher has two queues: Queue 0 and Queue 1. The jobs in Queue i (where i = 0, 1) are destined for servers in Group i. Both queues have infinite queueing spaces.
When a user submits a job to this multi-server system, the user needs to indicate whether the job is intended for the servers in Group 0 or Group 1. The following general processing steps are common to all incoming jobs:
• If a job is intended for a server in Group i (where i = 0, 1) arrives at the dispatcher, the job will be sent to a server in Group i if one is available, otherwise the job will join Queue i.
• When a job departs from a server in Group i, the server will check whether there is a job at the head of Queue i. If yes, the job will be admitted to the available server for processing.
Recall that the servers in Group 0 have a service time limit. The intention is that the users make an estimate of the service time requirement of their submitted jobs. If a user thinks that their job should be able to complete within Tlimit, then they submit it to Group 0; otherwise, they should send it to the Group 1.
Unfortunately, the service time estimated by the users is not always correct. It is possible that a user sends a job which cannot be completed within the time limit to Group 0. We will now explain how the multi-server system will process such a job. Since the user has indicated that the job is destined for Group 0, the job will be processed according to the general processing steps explained earlier. This means the job will receive processing by a server in Group 0. After this job has been processed for a time of Tlimit, the server says that the service time limit is up and will kill the job. The server will send the job to the dispatcher and tell it that this is a killed job. The dispatcher will check whether a server in Group 1 is available. If yes, the job will be send to an available server; otherwise, it will join Queue 1 to wait for a server to become available. When a server in Group 1 is available to work on this job, it will process the job from the beginning, i.e., all the previous processing in a Group 0 server is lost.
If a job has completed its processing at a Group 0 server, which means its service time is less than or equal to Tlimit, then the job leaves the multi-server system permanently. Similarly, a job completed its processing at a Group 1 server will leave the system permanently.
We make the following assumptions on the multi-server system in Figure 1. First, it takes the dispatcher negligible time to classify a job and to send a job to an available server. Second, it takes a negligible time for a server to send a killed job to the dispatcher. Third, it takes a negligible time for a server to inform the dispatcher on its availability. As a consequence of these assumptions, it means that: (1) If a job arriving at the dispatcher is to be sent to an available server right away, then its arrival time at the dispatcher is the same as its arrival time at the chosen server; (2) The departure time of a job from the dispatcher is the same as its arrival time at the chosen server; and (3) The departure time of a killed job from a server is the same as its arrival time at the dispatcher. Ultimately, these assumptions imply that the response time of the system depends only on the queues and the servers.
We have now completed our description of the operation of the system in Figure 1. We will provide a number of numerical examples to further explain its operation in Section 4.
You will see from the numerical examples in Section 4 that the number of Group 0 servers n0 can be used to influence the mean response time. So, a design problem that you will consider in this project is to determine the value of n0 to minimise the mean response time.
Remark 1 Some elements in the above description are realistic but some are not. Typically, users are required to specify a walltime as a service time limit when they submit their jobs to a computing cluster. If a server has already spent the specified walltime on the job, then the server
will kill the job. All these are realistic.
The re-circulation of a killed job is normally not done. A user will typically have to resubmit a new job if it has been killed. If a killed job is re-circulated, then it may be given a lower priority, rather than joining the main queue which is the case here.
Some programming technique (e.g., checkpointing) allows a killed job or crashed job to resur- rect from the last state saved rather than from the beginning. However, that may require a sizeable memory space.
In order to make this project more do-able, we have simplified many of the settings. For example, we do not use lower priority for the re-circulated killed jobs.
4 Examples
We will now present three examples to illustrate the operation of the system that you will simulate in this project. In all these examples, we assume that the system is initially empty.
4.1 Example 0: n=3, n0 =1, n1 =2 and Tlimit =3
In this example, we assume the there are n = 3 servers in the farm with 1 (= n0) server in Group
0 and 2 (= n1) servers in Group 1. The time limit for Group 0 processing is Tlimit = 3.
Table 1 shows the attributes of the 8 jobs that we will use in this example. Each job is given an index (from 0 to 7). For each job, Table 1 shows its arrival time, service time and the server group that the user has indicated. For example, Job 1 arrives at time 10, requires 4 units of time for service and the user has indicated that this job needs to go to a Group 0 server. Since the service time requirement for this job exceeds the time limit Tlimit of 3, this job will be killed after 3 time units of service and will be sent to dispatcher after that.
Note that, a job which a user sends to a Group 0 server will be completed if its service time is less than or equal to the service time limit Tlimit being imposed. So, Job 6 in Table 1 will be completed in a Group 0 server and this job will not be killed.
Job index Arrival time Service time required Server group indicated 0251 1 10 4 0 2 11 9 0 3 12 2 0 4 14 8 1 5 15 5 0 6 19 3 0 7 20 6 1
Table 1: Jobs for Example 0.
Remark 2 We remark that the job indices are not necessary for carrying out the discrete event simulation. We have included the job index to make it easier to refer to a job in our description below.
The events in the system in Figure 1 are
• The arrival of a new job to the dispatcher; and,
• The departure of a job from a server.
We remark that for a Group 1 server, a departed job has its service completed. However, for a Group 0 server, a departed job can be a killed job or a completed job. Note that we have not included the arrival of a re-circulated killed job to the dispatcher as an event. This is because the arrival of a re-circulated job at the dispatcher is at the same time as the departure of that job from a Group 0 server. So the simulation will handle these events together: the departure of a killed job and its handling by the dispatcher.
We will illustrate the simulation of the system in Figure 1 using “on-paper simulation”. The quantities that you need to keep track of include:
• Next arrival time is the time that the next new job (i.e, not a killed job) will arrive
• For each server, we keep track its server status, which can be busy or idle.
• We also keep track of the following information on the job that is being processed in the server:
– Next departure time is the time at which the job will depart from the server. If the server is idle, the next departure time is set to ∞. Note that there is a next departure time for each server.
– The time that this job arrived at the system. This is needed for calculating the response time of the job when it permanently departs from the system.
• The contents of Queues 0 and 1. Each job in the queue is identified by a 2-tuple of (arrival time, service time).
There are other additional quantities that you will need to keep track of and they will be mentioned later on.
The “on-paper simulation” is shown in Table 2. The notes in the last column explain what updates you need to do for each event. Recall that the two event types in this simulation are the arrival of a new job to the dispatcher and the departure from a server, we will simply refer to these two events as Arrival and Departure in the “Event type” column (i.e., second column) in Table 2.
We assume the servers are idle and queues are empty at the start of the simulation. The next departure times for all servers are ∞. The “–” indicates that the queues are empty.
This event is the arrival of Job 0 for a Group 1 server. Since both Group 1 servers are idle before this arrival, the job can be sent to any one of the idle servers. We have chosen to send this job to Server 1. The job requires a service time of 5, so its completion time is 7. Note that the record of the job in the server is a 2-tuple consisting of (arrival time, scheduled departure time). Lastly, we need to update the arrival time of the next job, which is 10.
This event is the departure of a job from Server 1. Since Queue 1 is empty, Server 1 becomes idle.
This event is the arrival of Job 1 for a Group 0 server. Since Server 0 is idle, the job can be sent to the idle server. This job requires a service time of 4 which exceeds the service time limit of 3 for Group 0 servers, so the simulation needs to schedule this job to depart Server 0 at time 13 because this is the time that this job will be killed by the server. We use the 3-tuple consisting of (arrival time, scheduled departure time, service time), which for this job is (10, 13, 4), to indicate that this job arrives at time 10, is scheduled to depart at time 13 and its service time requirement is 4 time units. We need to include the service time of the job because we will need it later when the job is re-circulated to a Group 1 server. Note that if you see a 3-tuple job in a Group 0 server, it means that the job will be killed and re-circulated to a Group 1 server. Lastly, we need to update the arrival time of the next job, which is 11.
Server 2 Group 1
Server 1 Group 1
Busy, (2,7)
Server 0 Group 0
(10,13, 4)
Next arrival time
Event type
This event is the arrival of Job 2 for a Group 0 server. Since Server 0 is busy, this job will join Queue 0. The queue stores the 2-tuple (arrival time, service time) which is (11,9) for this job. We also need to update the arrival time of the next job, which is 12.
This event is the arrival of Job 3 for a Group 0 server. Since Server 0 is busy, this job will join Queue 0 with the job informa- tion (12,2). We also need to update the arrival time of the next job, which is 14.
This event is the departure of a killed job from Server 0. This job will be re-circulated to the dispatcher. Since both Group 1 servers are idle, this job can go to any one of them. We have chosen to send it to Server 1. Since this job requires 4 time units of service, it is scheduled to depart Server 1 at time 17. The 2- tuple (10,17) indicates that this job arrives at 10 and will depart at time 17. Since this is a departure from a Group 0 server, we will also need to check Queue 0, which has 2 jobs. So the job at the head of the queue will advance to Server 0 which is becoming available. This job requires 9 units of service time which exceeds the service time limit. So, the job will be killed at time 13 + 3 = 16 time units.
This event is the arrival of Job 4 for a Group 1 server. Since there is a Group 1 server available, this job goes to Server 2 directly. This job requires 8 units of service, so the job is scheduled to depart at time 22. We also need to update the arrival time of the next job, which is 15.
This event is the arrival of Job 5 for a Group 0 server. Since all Group 0 servers are busy, this job joins Queue 0. We also need to update the arrival time of the next job, which is 19.
(11,9), (12,2)
Busy (10,17)
Busy (10,13, 4)
Busy (10,13, 4)
Busy (11,16, 9)
(11,16, 9)
(11,16, 9)
This event is the departure of a killed job from Server 0. This job will be re-circulated to the dispatcher. Since both Group 1 servers are busy, this job will join Queue 1. The job at the head of Queue 0 will advance to Server 0. This job requires only 2 units of service which is within the limit. We use a 2-tuple to remember this job because the job is within the time limit so it will not be killed.
This event is the departure of a finished job at Server 1. Since there is a job in Queue 1, the job will move into Server 1.
This event is the departure of a finished job at Server 0. This job will depart from the system permanently. We can tell that because it is a 2-tuple in the server rather than a 3-tuple. Since there is a job in Queue 0, the job will move into Server 0.
This event is the arrival of Job 6 for a Group 0 server. Since all Group 0 servers are busy, this job joins Queue 0. We also need to update the arrival time of the next job, which is 20.
This event is the arrival of Job 7 for a Group 1 server. Since all Group 1 servers are busy, this job joins Queue 1. Since there are no more jobs arriving, we update the next arrival time to ∞
This event is the departure of a killed job from Server 0. This job will be re-circulated to the dispatcher. Since both Group 1 servers are busy, this job will join Queue 1. The job at the head of Queue 0 will advance to Server 0. This job requires only 3 units of service which is within the limit. We only need a 2-tuple to remember that this job arrives at time 19 and will depart at time 24.
This event is the departure of a finished job at Server 2. Since there is a job in Queue 1, the job will move into Server 2.
This event is the departure of a finished job at Server 0. Since Queue 0 is empty, Server 0 is now idle.
This event is the departure of a finished job at Server 1. The job at the head of Queue 1 advances to Server 1. The queue is now empty.
(20,6), (15,5)
Busy (14,22)
Busy (14,22)
Busy (14,22)
Busy (14,22)
Busy (14,22)
Busy (14,22)
Busy (10,17)
Busy (11, 26)
Busy (11, 26)
Busy (11, 26)
Busy (12,18)
Busy (12,18)
Busy (15,21,5)
This event is the departure of a finished job at Server 2. Server 2 is now idle as Queue 1 is empty.
This event is the departure of a finished job at Server 1. Server 1 is now idle as Queue 1 is empty.
Busy (15, 31)
Table 2: “On paper simulation” illustrating the event updates of the system.
The above description has not explained what happens if an arrival event and a departure event are at the same time. We will leave it unspecified. If we ask you to simulate in trace driven mode, we will ensure that such situation will
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com