Term Paper for Grid Computing
Assignment 3:PaaS
A Team-based Project
SEMESTER 2, 2019
*
Requirements
Create a simple Platform to manage requests for application execution management system with a Master and a set of workers running on different machines.
An end user will submit a request for executing an application
The platform will manage the execution. After completion of the execution will send back the end user all the outputs with a bill.
End User & Platform Interaction
The platform will only accept two types of applications from the end user
(input:application and password like A1)
Java based Application
a ‘Jar’ file and input file
Python based
A python code and input file
The end user will specify for submission
Type of application
Remote Location path of application and input file
Time by which he needs the results
After submission, a passcode will be generated by the platform and sent to the end user for any future communication i.e. status of the job, cancel the job, download bill and output file location √
You can implement user separately interacting with the platform using Socket OR you can implement a graphical interface for platform to interact with user. √
Platform should be able to interact with multiple users simultaneously. √
Platform Requirements
Platform will be implemented using master-slave model
The master node will interact with user and workers.
The master node will interact with each worker using Socket. Also the master should be implemented using multi-threading.
If master will submit the user request to a worker based on atleast two scheduling methods (e.g.):
Round Robin based (Push based)
Free worker will ask for the job (Pull based)
Master will also
Monitor health status of Workers (running, cpu load, active, dead, how many requests are being served)
Query status of job while executing
Cancel a running request
If a worker failed, master node will reschedule all the affected requests to other healthy workers
If all the workers are busy for certain amount of time, the master node will start a new worker. [elasticity]
master
Worker
Worker
Worker
Worker Node Requirements
It will accept the request for execution from the master node
It will copy (using ftp) executable and input file from the given location path
It will execute the application and copy the output files to a specified location
The worker will inform the master node about the status of the request.
If request is completed, the output location path will be sent to the master
Marking Scheme (Idea for dealing with complexity of the project)
Implement worker that can [4 Marks]
Maintain a request queue
Copy remote files for a request
Execute the code according to type of application
Copy the output file to a remote location
Implement master-worker interaction [2 marks]
Submit the request to the worker
Query status of the request
Query health of the worker
Cancel the execution of a request
实施worker [4 Marks]
维护请求队列
复制远程文件以获取请求
根据应用程序类型执行代码
将输出文件复制到远程位置
实现主从互动[2分]
将request提交给worker
查询请求的状态
查询worker的健康状况
取消执行request
Marking Scheme (Idea for dealing with complexity of the project)
Master implementation [5 Marks]
That can interact with multiple workers using multiple thread and socket
Scheduling of the requests to workers using two job allocation/scheduling methods
Receive user’s requests and generate a passcode
Allow querying of the status of the request based on passcode
Generate bill
Fault tolerance & error handling [3 Marks] (most important part)
If worker fails, reschedule all the jobs to other workers
Master should not fail, if worker failed
Elasticity [3.5 marks] (most important part)
If request queue (request waiting for execution) size is very high or all workers are busy for more than a given time, start a new machine to reduce the waiting time
Presentation[0.5 Mark] and Report [2 Marks]
Some experimentation and evaluation is important for getting marks for the report
Master 实施[5分]
可以使用多个线程和套接字与多个worker进行交互
使用两种作业分配/调度方法为工作者调度请求
接收用户的请求并生成密码
允许根据密码查询请求的状态
生成账单
容错和错误处理[3 Marks](最重要的部分)
如果worker失败,将所有工作重新安排给其他worker
如果worker失败,master不应该失败
弹性[3.5分](最重要的部分)
如果请求队列(请求等待执行)的size非常高或者所有worker的繁忙超过给定时间,请启动新机器以减少等待时间
演示文稿[0.5标记]和报告[2标记]
一些实验和评估对于获得报告的标记很重要
Typical RMS Architecture
Resource Manager
Job Manager
Computation Node 1
Computation Node c
:
:
:
Computation Nodes
User u
:
:
:
job
Manager Node
Node Status Monitor
User Population
execution results
User 1
job
execution results
Job Scheduler