程序代写代做代考 database data mining finance distributed system FTP algorithm html Data Stream Processing

Data Stream Processing
COMPSCI 753: Algorithms for Massive Data
Ninh Pham
University of Auckland
Slides are collected and edited from https://vasia.github.io/dspa20/index.html and https://homepages.cwi.nl/~boncz/bads/stream.shtml
Auckland, Aug 17, 2020
1

Outline
 Introduction
 Definition, applications, challenges
 Data stream processing  Data stream modelling
 Data stream processing
2

What is data stream?
 Large data volume, likely structured, arriving at a very high rate (high enough that machine cannot keep up it)
 Definition (Golab & Ozsu, 2003):
 A data stream is a real-time, continuous, ordered (implicitly by arrival
time of explicitly by timestamp) sequence of items.
 It is impossible to control the order in which items arrive, nor it is feasible to locally store a stream in its entirety.
3

Massive data stream
 We do not know entire data set in advance.
 Data arrives in the streaming fashion and if it is not processed immediately, it will be lost forever.
4

Why do we need data stream?
 Online, real-time processing
 Potential objectives:
 Event detection and reaction
 Fast and potentially approximate online aggregation and analytics at different granularities
 Various applications:
 Network management, telecommunications
 Load balancing in distributed systems
 Stock monitoring, finance, fraud detection  Online data mining (click stream analysis)
5

Data stream applications
 Sensor networks:
 Many sensors feeding into a central controller.
 What occurs if the PDA device does not have enough computational resources for real-time event detection?
Source: Vahid Ayatollahitafti et al. PLOS’16
6

Data stream applications
 Google query streams:
 What queries are more frequent today than yesterday?
 Estimate influenza activities by aggregating Google search queries
7

Data stream applications
 Yahoo click streams:
 What pages are getting an unusual number of hits in the past hour?
 Mining social network:
 Look for trending topics on Twitter, Facebook based on news feed
 IP packets monitored at a switch in network:  Detect denial-of-service attacks
 Gather information for optimal routing
8

IP network monitoring application
 24×7 IP packet/flow data streams at network elements
 Massive stream arriving at rapid rates
 AT&T collects ~ 1 TB of Netflow data per day.
Protocol 12 20K http 16 24K http 15 20K http 19 40K http 26 58K http
 Off-line analysis is very slow and expensive.
16.2.3.7 12.4.0.3 11.6.8.2 17.1.2.1 14.8.7.4 13.0.0.1 10.3.4.5 16.5.5.8
Source 10.1.0.2 18.6.7.1 13.9.4.3 15.2.2.9 12.4.3.8 10.5.1.3 11.1.0.6 19.7.1.2
Destination Duration Bytes
27 100K ftp 32 300K ftp 18 80K ftp
9

Stream processing model
Streams arriving
Continuous queries
Hard queries to answer at any time
Each of stream is composed of elements/tuples
Easy queries to answer at any time
. . . 1, 5, 2, 7, 0, 9, 3 … a,r,v,t,y,h,b . . . 0, 0, 1, 0, 1, 1, 0
Standing Queries
Output
Collect summaries of streams to answer queries from this storage.
Limited Working Storage
Archival Storage
Impossible to answer queries from this storage
time
Processor
Source: mmds.org (chapter 4)

Structure of data stream
 Infinite sequence of items (elements) which have the same structured information, e.g.
 Tuple: (Source, Destination, Duration, Bytes, Protocol)  Object: a user ID or a search query
 Timestamp:
 Explicit (date/time field in data)
 Implicit (arrival time as a sequence of integers)
 Unbounded length, no control of arrival order and data rate
11

Data management approaches
Source: https://vasia.github.io/dspa20/index.html
12

DSMS vs. DBMS
Source: https://homepages.cwi.nl/~boncz/bads/streaming.shtml
 Data stream management system (DSMS): Voluminous streams-in, reduced streams-out
 Database management system (DBMS): Output of DSMS can be treated as data feeds to database
13

DSMS vs. DBMS
Source: https://vasia.github.io/dspa20/index.html
14

 Keep the data moving
 Process messages in-stream, without any requirement to store them
 Query using SQL on Streams
 Support a high-level StreamSQL language with built-in extensible stream-
oriented primitives and operators
 Handle stream imperfections (delayed, missing and out-of- order data)
 Have built-in mechanisms to provide resiliency against stream issues which are commonly present in real-word data streams
 Generate predictable outcomes
 Must guarantee predictable and repeatable outcomes
15

 Integrate stored and streaming data
 Efficiently store, access, and modify state information, and combine it
with live streaming data
 For seamless integration, system should use a uniform language when dealing with either type of data
 Guarantee data safety and availability
 Ensure the applications are up and available, and the integrity of the data
maintained at all times, despite failures
 Partition and scale applications automatically
 Distribute processing across multiple processors and machines to achieve incremental scalability
 Process and respond instantaneously
 Have a highly-optimized, minimal-overhead execution engine to deliver real-time response for high-volume application
16

Outline
 Introduction
 Definition, applications, challenges
 Data stream processing  Data stream modelling
 Data stream processing
17

Modeling data stream
 A stream can be viewed as never-ending updates of one- dimensional vector A[1…N] with an unknown N
 The j-th update with the form (k, c[j]) modifies the k-th entry of A as A[k]←A[k] + c[j]
up-to-date frequencies of specific (srcIP, destIP) Source: https://vasia.github.io/dspa20/index.html
18

Time-series model
 The j-th update (j, A[j])
 Updates arriving in increasing
order of j as a time series
 Entries of A are observed with
the increasing index  Drawback:
 Cannot change the past entries
 Only practical value for modeling time series data (sensor data, stock data, etc.)
Source: https://www.mfe.govt.nz/fresh-water
19

Cash-register model
 Allow multiple updates that only increment an entry in A:
 The j-th update (k, c[j]) increases A[k] by a value c[j] ≥ 0.
 Drawback:
 Only practical value for insertion- only data stream since the old data are still in the history
Source: https://vasia.github.io/dspa20/index.html
20

Turnstile model
 Allow multiple updates that increment/decrement an entry in A:
 The j-th update (k, c[j]) updates A[k] by A[k] + c[j] for any c[j]
 Drawback:
 It is challenging to design space and time-efficient algorithms for this general model
Source: DOS attack, Obinna Igbe et al.’17
21

Outline
 Introduction
 Definition, applications, challenges
 Data stream processing  Data stream modelling
 Data stream processing
22

DSMS architecture
Online queries
Offline queries
23

Challenges
 Stream properties:
 Infinite and non-stationary
 Rapid rate, never-ending, so impossible to store the entire stream accessibly.
 How do we process the stream using a distributed dataflow to answer queries as quickly as possible?
 Exact solutions are possible by parsing and manipulating inputs on-the-fly
 Distributed dataflow systems: Spark, TensorFlow, Flink, etc.
Source: https://en.wikipedia.org/wiki/Distributed_data_flow 24 (written by Krzysztof Ostrowski)

Challenges
 Stream properties:
 Infinite and non-stationary
 Rapid rate, never-ending, so impossible to store the entire stream accessibly.
 How do we process the stream using a limited amount of memory to answer queries in real-time manner?
 Exact solutions are often impossible without entirely storing data.  An approximation answer suffices with theoretical guarantees.
Source: https://blog.fastforwardlabs.com/2016/11/23/probabilistic-data-structure-showdown-cuckoo.html
25

Basic probabilistic algorithms
 Sampling:
 Hashing:
Source: Graham Cormode, Marios Hadjieleftheriou, CACM 2009
26

References
 Lukasz Golab and M. Tamer Özsu. Issues in data stream management. SIGMOD Rec. 32, 2 (June 2003).
 Michael Stonebraker, Ugur Çetintemel, and Stan Zdonik. The 8 requirements of real-time stream processing. SIGMOD Rec. 34, 4 (December 2005).
 Data stream lectures:
 https://homepages.cwi.nl/~boncz/bads/stream.shtml
 https://vasia.github.io/dspa20/lectures/dspa20-1.pdf
27