程序代写代做代考 SQL jvm algorithm file system PowerPoint Presentation

PowerPoint Presentation

Spark Streaming
Supports real time processing of streaming data
Using the Spark core for parallelization & fault tolerance
Combine streaming with batch and interactive queries.

Steps of a Spark Streaming application
Define the input sources by creating input DStreams.
Define the streaming computations by applying transformation and output operations to DStreams.
Start receiving data and processing it using streamingContext.start().
Wait for the processing to be stopped (manually or due to any error) using streamingContext.awaitTermination().
The processing can be manually stopped using streamingContext.stop().

Points to note
Once a context has been started, no new streaming computations can be set up or added to it.
Once a context has been stopped, it cannot be restarted.
Only one StreamingContext can be active in a JVM at the same time.
stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext, set the optional parameter of stop() called stopSparkContext to false.
A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created.
Interrupting the streaming program in the shell / jupyter cannot properly clean up the streaming tasks (even after running ssc.stop). You will have to shut down the kernel and re-open the notebook.

Discretized Streams (DStreams)
DStream: a continuous stream of data.
DStreams can be created
from input data streams
by applying high-level operations on other DStreams.
Internally, a DStream is represented as a sequence of RDDs.

Any operation applied on a DStream translates to operations on the underlying RDDs.

Input DStreams and Receivers
Input DStreams are DStreams representing the stream of input data received from streaming sources.
Every input DStream is associated with a Receiver
Note: Each Receiver requires a core, so you need to allocate more cores than the number of input Dstreams.
Two categories of built-in streaming sources:
Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, socket connections, rdd queues.
Advanced sources: Other streaming systems like Kafka, Flume, Kinesis, etc.

Transformations on DStreams
DStreams support many of the transformations available on normal Spark RDD’s. See online manual for details.
transform(): Invoke any RDD operation on a Dstream

UpdateStateByKey
Allows you to maintain arbitrary state while continuously updating it with new information
Steps:
Define the initial state: An RDD of (key, value) pairs
Define the state update function – Specify a function on how to update the value using the previous value and the new values from an input stream.
In every batch, Spark will apply the state update function for all existing keys, regardless of whether they have new data in a batch or not
If the update function returns None then the key-value pair will be eliminated.

Streaming Algorithms

Missing Card
I take one from a deck of 52 cards, and pass the rest to you. Suppose you only have a (very basic) calculator and bad memory, how can you find out which card is missing with just one pass over the 51 cards?

What if there are two missing cards?
10

A data stream algorithm …
Makes one pass over the input data
Uses a small amount of memory (much smaller than the input data)
Computes something
11

Reservoir Sampling
Maintain a sample of size drawn (without replacement) from all elements in the stream so far
Keep the first elements in the stream, set
Algorithm for a new element

With probability , use it to replace an item in the current sample chosen uniformly at random
With probability , throw it away
Perhaps the first “streaming” algorithm
13
[Waterman ??; Knuth’s book]

Correctness Proof
By induction on
: trivially correct
Assume each element so far is sampled with probability
Consider :
The new element is sampled with probability
Any element in the current sample is sampled with probability
. Yeah!
This is a wrong (incomplete) proof
Each element being sampled with probability is not a sufficient condition of random sampling
Counter example: Divide elements into groups of and pick one group randomly

Reservoir Sampling Correctness Proof
Many “proofs” found online are actually wrong
They only show that each item is sampled with probability
Need to show that every subset of size has the same probability to be the sample
Correct proof relates with the Fisher-Yates shuffle
16
a
b
c
d
b
a
c
d
a
b
c
d
b
c
a
d
b
d
a
c

s = 2

Majority
Given a sequence of items, find the majority if there is one

A A B C D B A A B B A A A A A A C C C D A B A A A
Answer: A

Trivial if we have O(n) memory
Can you do it with O(1) memory and two passes?
First pass: find the possible candidate
Second pass: compute its frequency and verify that it is > n/2
How about one pass?
Unfortunately, no
17

18
Heavy hitters
Misra-Gries (MG) algorithm finds up to k items that occur more than 1/k fraction of the time in a stream
Estimate their frequencies with additive error N/(k+1)
Keep k different candidates in hand. For each item in stream:
If item is monitored, increase its counter
Else, if < k items monitored, add new item with count 1 Else, decrease all counts by 1 1 2 3 4 5 6 7 8 9 k=5 19 Heavy hitters Misra-Gries (MG) algorithm finds up to k items that occur more than 1/k fraction of the time in a stream Estimate their frequencies with additive error N/(k+1) Keep k different candidates in hand. For each item in stream: If item is monitored, increase its counter Else, if < k items monitored, add new item with count 1 Else, decrease all counts by 1 1 2 3 4 5 6 7 8 9 k=5 20 Heavy hitters Misra-Gries (MG) algorithm finds up to k items that occur more than 1/k fraction of the time in a stream Estimate their frequencies with additive error N/(k+1) Keep k different candidates in hand. For each item in stream: If item is monitored, increase its counter Else, if < k items monitored, add new item with count 1 Else, decrease all counts by 1 1 2 3 4 5 6 7 8 9 k=5 21 Streaming MG analysis N = total input size Error analysis True count [counter, counter + # decrements] Each decrement corresponds to deleting (k+1) distinct items from stream At most N/(k+1) decrements on each unique key So error N/(k+1) Note: We can easily keep track of # decrements, so the actual error guarantee can be smaller than N/(k+1) On real date sets, the true count is usually closer to the upper bound, i.e., counter + # decrements Implementing MG algorithm in SparkStreaming The classical MG algorithm is sequential. Need to turn it into a parallel algorithm. Idea 1: The increment operation and decrement operations can be decoupled. Idea 2: Both increment operations and decrement operations can be performed in batches. We may exceed our space budget sometimes, but at most by a batch size. Parallel MG Algorithm All counters = 0 total_decrement = 0 threshold = 0 For each batch do Count the words in this batch For each item in the batch, do the following in parallel: Add its count to the corresponding counter Subtract threshold from the counter If counter becomes 0 or negative, remove it If # counters > k, threshold = the k-th largest counter, else threshold = 0
total_decrement += threshold
Final estimate = (counter, counter+total_decrement)

Sliding windows
Note:
The aggregation function must be associative and commutative so that it can be computed correctly in parallel.
Inverse function is optional but can improve efficiency
Question: For what aggregates do we not have an inverse function?

Structured Streaming (Alpha)

Structured streaming = Spark Streaming on top of Spark SQL

Window Operations on Event Time

windowedCounts = words.groupBy(
window(words.timestamp, “10 minutes”, “5 minutes”),
words.word)
.count()

/docProps/thumbnail.jpeg

Related Posts