Data Stream Processing
COMPSCI 753: Algorithms for Massive Data
Ninh Pham
University of Auckland
Slides are collected and edited from https://vasia.github.io/dspa20/index.html and https://homepages.cwi.nl/~boncz/bads/stream.shtml
Auckland, Aug 17, 2020
1
Outline
Introduction
Definition, applications, challenges
Data stream processing Data stream modelling
Data stream processing
2
What is data stream?
Large data volume, likely structured, arriving at a very high rate (high enough that machine cannot keep up it)
Definition (Golab & Ozsu, 2003):
A data stream is a real-time, continuous, ordered (implicitly by arrival
time of explicitly by timestamp) sequence of items.
It is impossible to control the order in which items arrive, nor it is feasible to locally store a stream in its entirety.
3
Massive data stream
We do not know entire data set in advance.
Data arrives in the streaming fashion and if it is not processed immediately, it will be lost forever.
4
Why do we need data stream?
Online, real-time processing
Potential objectives:
Event detection and reaction
Fast and potentially approximate online aggregation and analytics at different granularities
Various applications:
Network management, telecommunications
Load balancing in distributed systems
Stock monitoring, finance, fraud detection Online data mining (click stream analysis)
5
Data stream applications
Sensor networks:
Many sensors feeding into a central controller.
What occurs if the PDA device does not have enough computational resources for real-time event detection?
Source: Vahid Ayatollahitafti et al. PLOS’16
6
Data stream applications
Google query streams:
What queries are more frequent today than yesterday?
Estimate influenza activities by aggregating Google search queries
7
Data stream applications
Yahoo click streams:
What pages are getting an unusual number of hits in the past hour?
Mining social network:
Look for trending topics on Twitter, Facebook based on news feed
IP packets monitored at a switch in network: Detect denial-of-service attacks
Gather information for optimal routing
8
IP network monitoring application
24×7 IP packet/flow data streams at network elements
Massive stream arriving at rapid rates
AT&T collects ~ 1 TB of Netflow data per day.
Protocol 12 20K http 16 24K http 15 20K http 19 40K http 26 58K http
Off-line analysis is very slow and expensive.
16.2.3.7 12.4.0.3 11.6.8.2 17.1.2.1 14.8.7.4 13.0.0.1 10.3.4.5 16.5.5.8
Source 10.1.0.2 18.6.7.1 13.9.4.3 15.2.2.9 12.4.3.8 10.5.1.3 11.1.0.6 19.7.1.2
Destination Duration Bytes
27 100K ftp 32 300K ftp 18 80K ftp
9
Stream processing model
Streams arriving
Continuous queries
Hard queries to answer at any time
Each of stream is composed of elements/tuples
Easy queries to answer at any time
. . . 1, 5, 2, 7, 0, 9, 3 … a,r,v,t,y,h,b . . . 0, 0, 1, 0, 1, 1, 0
Standing Queries
Output
Collect summaries of streams to answer queries from this storage.
Limited Working Storage
Archival Storage
Impossible to answer queries from this storage
time
Processor
Source: mmds.org (chapter 4)
Structure of data stream
Infinite sequence of items (elements) which have the same structured information, e.g.
Tuple: (Source, Destination, Duration, Bytes, Protocol) Object: a user ID or a search query
Timestamp:
Explicit (date/time field in data)
Implicit (arrival time as a sequence of integers)
Unbounded length, no control of arrival order and data rate
11
Data management approaches
Source: https://vasia.github.io/dspa20/index.html
12
DSMS vs. DBMS
Source: https://homepages.cwi.nl/~boncz/bads/streaming.shtml
Data stream management system (DSMS): Voluminous streams-in, reduced streams-out
Database management system (DBMS): Output of DSMS can be treated as data feeds to database
13
DSMS vs. DBMS
Source: https://vasia.github.io/dspa20/index.html
14
Keep the data moving
Process messages in-stream, without any requirement to store them
Query using SQL on Streams
Support a high-level StreamSQL language with built-in extensible stream-
oriented primitives and operators
Handle stream imperfections (delayed, missing and out-of- order data)
Have built-in mechanisms to provide resiliency against stream issues which are commonly present in real-word data streams
Generate predictable outcomes
Must guarantee predictable and repeatable outcomes
15
Integrate stored and streaming data
Efficiently store, access, and modify state information, and combine it
with live streaming data
For seamless integration, system should use a uniform language when dealing with either type of data
Guarantee data safety and availability
Ensure the applications are up and available, and the integrity of the data
maintained at all times, despite failures
Partition and scale applications automatically
Distribute processing across multiple processors and machines to achieve incremental scalability
Process and respond instantaneously
Have a highly-optimized, minimal-overhead execution engine to deliver real-time response for high-volume application
16
Outline
Introduction
Definition, applications, challenges
Data stream processing Data stream modelling
Data stream processing
17
Modeling data stream
A stream can be viewed as never-ending updates of one- dimensional vector A[1…N] with an unknown N
The j-th update with the form (k, c[j]) modifies the k-th entry of A as A[k]←A[k] + c[j]
up-to-date frequencies of specific (srcIP, destIP) Source: https://vasia.github.io/dspa20/index.html
18
Time-series model
The j-th update (j, A[j])
Updates arriving in increasing
order of j as a time series
Entries of A are observed with
the increasing index Drawback:
Cannot change the past entries
Only practical value for modeling time series data (sensor data, stock data, etc.)
Source: https://www.mfe.govt.nz/fresh-water
19
Cash-register model
Allow multiple updates that only increment an entry in A:
The j-th update (k, c[j]) increases A[k] by a value c[j] ≥ 0.
Drawback:
Only practical value for insertion- only data stream since the old data are still in the history
Source: https://vasia.github.io/dspa20/index.html
20
Turnstile model
Allow multiple updates that increment/decrement an entry in A:
The j-th update (k, c[j]) updates A[k] by A[k] + c[j] for any c[j]
Drawback:
It is challenging to design space and time-efficient algorithms for this general model
Source: DOS attack, Obinna Igbe et al.’17
21
Outline
Introduction
Definition, applications, challenges
Data stream processing Data stream modelling
Data stream processing
22
DSMS architecture
Online queries
Offline queries
23
Challenges
Stream properties:
Infinite and non-stationary
Rapid rate, never-ending, so impossible to store the entire stream accessibly.
How do we process the stream using a distributed dataflow to answer queries as quickly as possible?
Exact solutions are possible by parsing and manipulating inputs on-the-fly
Distributed dataflow systems: Spark, TensorFlow, Flink, etc.
Source: https://en.wikipedia.org/wiki/Distributed_data_flow 24 (written by Krzysztof Ostrowski)
Challenges
Stream properties:
Infinite and non-stationary
Rapid rate, never-ending, so impossible to store the entire stream accessibly.
How do we process the stream using a limited amount of memory to answer queries in real-time manner?
Exact solutions are often impossible without entirely storing data. An approximation answer suffices with theoretical guarantees.
Source: https://blog.fastforwardlabs.com/2016/11/23/probabilistic-data-structure-showdown-cuckoo.html
25
Basic probabilistic algorithms
Sampling:
Hashing:
Source: Graham Cormode, Marios Hadjieleftheriou, CACM 2009
26
References
Lukasz Golab and M. Tamer Özsu. Issues in data stream management. SIGMOD Rec. 32, 2 (June 2003).
Michael Stonebraker, Ugur Çetintemel, and Stan Zdonik. The 8 requirements of real-time stream processing. SIGMOD Rec. 34, 4 (December 2005).
Data stream lectures:
https://homepages.cwi.nl/~boncz/bads/stream.shtml
https://vasia.github.io/dspa20/lectures/dspa20-1.pdf
27