FIT5202 Big Data Processing
Data Streaming using Apache Kafka and Structured Streaming
Aggregations on Windows over event-time Handling Late events with water marking
Copyright By PowCoder代写 加微信 powcoder
Week 11 Agenda
• Week 10 Review
• Structured Streaming
• Integration with Kafka
• This Week:
• Aggregations on Windows over Event Time
• Handling late data with watermarking
• “Parquet” sink
• Checkpointing
Aggregations on Windows over Event Time
In many cases (e.g., moving averaging), we want aggregations over data bucketed by time windows (e.g., every 5 minutes) rather than over entire streams
Bucketing data into windows based on event-time (e.g. the time data generated in the producer) – grouping using window function
Non-overlapping windows
https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking- apache-sparks-structured-streaming.html
Aggregations on Windows over Event Time
Overlapping windows
Stateless vs Stateful Stream Processing
Each record is processed independently of other records
e.g. operations like map, filter, join with static data
Processing of records depends upon the result of previously processed records.
Need to maintain “intermediate information” for processing – called “state”
E.g., operations like aggregating count of records
State of Progress
keeping track of data that has been processed in streaming so far.
Called checkpointing/saving of offsets of incoming data.
State of Data
intermediate information derived from data (processed so far).
(e.g., intermediate count)
aaa87b6c9d31
Watermarking
Structured Streaming can maintain the intermediate state for partial aggregates (e.g. intermediate count) for a period of time such that late data can update aggregates of old windows correctly
• To handle events that arrive late to the application
• E.g. a word is generated at 12:04 (event time) but received at 12:11 by the application.
• The application should use the time 12:04 instead of 12:11 to update the older counts for a window 12:00 – 12:10.
• Watermarking lets engine automatically track the current event time in data and clean up/update old state accordingly
Watermarking
Two parameters to define the watermark of a query
(1) event time column
(2) Threshold specify for how late data should be
processed (in event time)
late data within the threshold will be aggregated, but data later than the threshold will start getting dropped
max event time – is the latest event time seen by the engine
https://towardsdatascience.com/watermarking-in-spark-structured-streaming- 9e164f373e9
Watermarking
Window-based aggregation based on event time – Window size = 10 mins Slide = 5 mins
Engine will keep updating counts of a window in the Result Table until the window is older than the watermark, which lags behind the current event time by 10 minutes.
Recovering from Failures with Checkpointing
• In case of failure, can recover the previous progress and state of previous query and continue where it left off.
• Can enable checkpointing using the option checkpointLocation on the query.
• To save all the progress information (i.e. range of offsets processed in each trigger) and the running aggregates (‘states’) to the checkpoint location
Clickstream Watermarking
Using the schema, we convert the data to a Frame
Kafka producer Late events
Parse the string data in json format according to the schema
Cast to String
Impressions
1602944650
1602944650
1602944650
Spark subscribe to topic & read data
Output result table
Aggregations on Windows over Event-time
For aggregation query, use ‘complete’ mode
Granularity Reduction DEMO
(Refer to Lecture)
The Aggregation on Windows over Event-time is another way of understanding the concept of granularity reduction.
Lab Task: Access Log – Window-based Aggregation
Aggregations on window over event-time
logs/access_log.txt
tsExp=r'(\d{10})\s‘
– Extract event timestamp
Data preparation
Aggregation on window
Output sink (Console)
access_log
Append with event time, ts
Using the Window function, find the number of logs for each
status in a window of 30 seconds. Set the window slide to 10
Write the output to console sink.
References
• https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache- sparks-structured-streaming.html
• https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#quick- example
• https://docs.databricks.com/spark/latest/structured-streaming/production.html
• http://blog.madhukaraphatak.com/introduction-to-spark-structured-streaming-part-7/
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com