代写代考 Streaming and Distributed Snapshot

Streaming and Distributed Snapshot

Streaming systems
● https://youtu.be/F8l9suOJ_DM ○ Courtesy:

Copyright By PowCoder代写 加微信 powcoder

● Programming Model
○ Stream as first class citizen
○ Batch is just a bounded stream
○ Dataflow programming model
● Exactly-once Consistency (At least once)
○ Each incoming event affects the final results exactly once
○ Even in case of a machine or software failure
■ there’s no duplicate data due to resend and no data that goes unprocessed ○ Implementation
■ Asynchronous checkpointing (distributed snapshot)
● Flink application continues to process data during the checkpointing process

Dataflow programming

Dataflow programming model
● Originally:
○ Processing-Time Grouping / Window only
● Introduce windowing function to define ○ Event-time Data-oriented Grouping / Window

Grouping / Window by processing event-time

https://www.slideshare.net/imrenagi/gdg-jakarta-
meetup-streaming-analytics-with-apache-beam 10

Physically: Parallel Dataflow

Some operators are stateful
● E.g., window
○ SUM of all events

Exactly-once consistency requires distributed snapshot

Snapshot: the state of the system
● Ex: token counting in 3 processes
○ An unknown, N, # of tokens circulating
among them
○ Take a snapshot to know N at a particular moment
○ Tokens in-transit between two processes also count

Snapshot: the state of the system
● One triggers the snapshot process
○ send request messages to nodes to collect their # of
● Assume process 0 takes the snapshot
○ Process 0: Np0=1
○ Process 1 replies Np1=4
○ Process 2 replies Np2=5 1
○ N = Np0+ Np1+ Np2 =10?
● Problems
○ What if there is a token in-transit from P1 to P2?

What is a Consistent Snapshot?
● Defined based on the notion of cuts and consistent cuts.
● A cut is a set of events, at least 1 event from each process
○ Ex: {c,d,f,g,h} is a cut Ex: {a,b,c,d,g} is not
● A consistent cut is for each event it contains
○ the events that casually ordered before it
○ Letxandyare2events
■ Ify≺xandxisinaconsistentcutC
■ ThenyisalsoinC
○ Ex: Cut1 = {a,b,c,m,k} is a consistent cut
○ Ex: Cut2= {a,b,c,d,g,m,e,k,i} is not
■ Because the event that cause g, which is h, is not in cut2
● As the system makes progress, the consistent cut is updated by new incoming events ○ The frontier of a consistent cut is a snapshot
■ Ex: {c,m,k}
○ An older consistent cut is a subset of a more recent consistent cut

Chandy-Lamport Snapshot Taking Algorithm
● Assume FIFO channel (e.g., TCP)
● CL1: Initiator
○ Record its own state as ci
○ Send a marker ci along all its outgoing channels
■ ci means the i–th checkpoint
● CL2: On receiving ci the first time
○ Record its own state as ci
○ Forwards marker ci to all its outgoing channels
■ (so the initiator will get it back)
Remember, the channels have no computational / storage power
– It is the job of the process to record the state of the channels So, let
– Sent(p, q) be messages sent by p to q //save it as part of the state
– Received(q, p) be messages received by q from p //save it as part of the
– The state of channels is: Sent(p,q)-Received(q,p)
https://github.com/FREEDM-DGI/FREEDM/wiki/State-Collection-Interface-Documentation-(Version-before-2.0)

Chandy-Lamport Snapshot Taking Algorithm
● Snapshot taking finish when: ○ Every process
■ has received marker ci from all its incoming channels
● Next time, the initiator triggers snapshot by sending out marker ci+1
● CL algorithm does not include the phase to collect the local snapshots of each node
● Initiator:

Latest: Apache Beam

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com