CS考试辅导 CS 162 HW 5

Introduction Background
Example MapReduce job Worker registration
Job submission
Conclusion

This site uses Just the Docs, a documentation theme for Jekyll.
Search CS 162 HW 5
Fault tolerance
TABLE OF CONTENTS
1 Worker crashes
2 Reduce task failures
3 Job failures
4 Autograder
In this part, you will complete your MapReduce system by implementing fault tolerance. Specifically, you will update your coordinator to handle worker crashes and failure. Your system does not need to tolerate failures of the coordinator. You also do not need to implement Byzantine fault-tolerance.
You should handle worker crashes by detecting when a worker has failed, and reassigning relevant tasks to other workers. Do not reassign tasks if a worker is still alive, but is executing slowly. (You might do this for a real system, but for this assignment, we’ll keep it simple and only expect tasks to be reassigned if the worker running them fails.)
Worker crashes
Your coordinator should be able to determine whether a worker has died. This should be implemented by checking whether the worker has sent a heartbeat in the last TASK_TIMEOUT_SECS seconds. When choosing a task to assign, you should consider failed tasks as available for (re)assignment.
Tasks should be eligible for (re)assignment if:
• The task is a reduce task, is incomplete, and was assigned to a worker that crashed.
• The task is a map task, and was assigned to a worker that crashed.
Note that upon a worker crash, map tasks assigned to that worker must be re-executed, even if they were marked complete. This is because map task outputs are buffered in memory, and that memory is no longer accessible if a worker crashes.
Completed reduce tasks should not be reassigned, since their output is stored on disk.
Reduce task failures
In the case where a reduce worker tries and fails to reach a worker for map task results, the worker should not crash. Instead, it should receive a new assignment, and if the worker it tried to reach is determined to be truly dead, its map tasks should be reassigned. You will likely need an additional
FailTask RPC to alert the coordinator that the worker is no longer working on the failed task. Job failures
If an error that cannot be fixed occurs, the job should fail. That is, no more tasks for the job should be assigned, and polling the job’s status with the PollJob RPC should give failed = true . It does not matter if you set the PollJobReply ’s done field to true or false .
Examples of errors that should cause a job to fail immediately include:
• Being unable to find or open an input file
• Being unable to write to an output file
• Receiving an error from an application map or reduce function
You must ensure that error messages returned by map or reduce functions are reported to clients via the errors field of the PollJobReply message. It is OK to add additional context to the error message, but you must preserve the precise message reported by the application map / reduce functions.
If you have an anyhow::Error err , you can convert it to a string using format!(“{:#}”, err)
This will preserve the full error trace (by default, only the most “recent” error message is included).
Autograder
After completing this, you should be passing all the autograder tests.
Back to top
Copyright © 2022 CS 162 staff.
Fault tolerance

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts