程序代写代做代考 algorithm Trees

Trees

Algorithm Design
for Big Data Systems
Divide and conquer
Graph algorithms
Streaming algorithms

Embarrassingly Parallel Problems
Problems that can be easily decomposed into many independent problems.
Examples.
Word count
k-means
PageRank
2

Classical Divide-and-Conquer
Classical D&C
Divide problem into 2 parts
Recursively solve each part
Combine the results together
D&C under big data systems
Divide problem into partitions, where (ideally) is the number of executors in the system
Solve the problem on each partition
Combine the results together
Example: sum(), reduce()

Prefix Sums
Input: Sequence x of n elements, binary associative operator +
Output: Sequence y of n elements, with
yk = x1 + … + xk
Example:
x = [1, 4, 3, 5, 6, 7, 0, 1]
y = [1, 5, 8, 13, 19, 26, 26, 27]
Algorithm:
Compute sum for each partition
Compute the prefix sums of the sums
Compute prefix sums in each partition
4

Variants of Prefix Sums
Assign consecutive id’s for each element
zipWithIndex()
Given a list of words, find the first appearance of “spark”
Given two long strings, compare them lexicographically
Given a sequence of integers, check whether these numbers are moronically decreasing.
5

Sorting (Sample Sort)
Step 1: Sampling
Collect a sample of elements
Step 2: Broadcast splitters
Pick every -th element in the sample as splitters
Broadcast them to all machines
Step 3: Partition the elements using splitters
Claim: The partitions are balanced. More precisely, with high probability, each contains at most elements.
Step 4: Sort each partition
6

Distributed Sampling
Q: How to sample one element uniformly from elements stored on servers?
A:
First randomly sample a server
Then ask that server to return an element randomly chosen from its elements.
The probability of each element being sampled is
Q: How to sample many elements at once?
A: Do each of the two steps above in batch mode
First sample servers with replacement (this can be done at the master node).
If a server is sampled times, we ask that server to return samples (with replacement) from its local data.
7

Balls and Bins
Theorem
Suppose we throw balls into bins randomly. Then with high probability every bin will receive balls.
Use this theorem for Sample Sort:
Imagine that all elements have been sorted in ascending order
Divide them into bins with each bin containing elements
The algorithm samples elements balls
No server receives more than elements as long as every bin contains at least one splitter
This is the case as long as every bin has sampled elements, i.e., balls

The Maximum Subarray Problem
Year 1 2 3 4 5 6 7 8 9
Profit (M$) -3 2 1 -4 5 2 -1 3 -1

9
Input: Profit history of a company of the years.
Problem: Find the span of years in which the company earned the most
Answer: Year 5-8 , 9 M$

Formal definition:
Input: An array of numbers , both positive and negative
Output: Find the maximum , where

A divide-and-conquer algorithm
10
Year 1 2 3 4 5 6 7 8 9
Profit (M$) -3 2 1 -4 5 2 -1 3 -1

Idea:
Cut the array into two halves
All subarrays can be classified into three cases:
Case 1: entirely in the first half
Case 2: entirely in the second half
Case 3: crosses the cut
Largest of three cases is final solution
The optimal solutions for case 1 and 2 can be found recursively.
Only need to consider case 3.

Solving case 3
11
Year 1 2 3 4 5 6 7 8 9
Profit (M$) -3 2 1 -4 5 2 -1 3 -1

Idea:
Let
Any case 3 subarray must have starting position , and ending position
Such a subarray can be divided into two parts and
, for some and
Just need to maximize each of them separately
Maximize and :
Let i’, j’, be the indices that maximize the values.
i’, j’ can be found using linear scans to left and right of q
A[i’,j’] has largest value of all subarrays that cross q

The (binary) divide-and-conquer algorithm
Analysis:
Recurrence:

So,
12
MaxSubarray():
if then return

MaxSubarray()
MaxSubarray()

for downto

if then

for to

if then
return

First call: MaxSubarray()

If we use the same algorithm on Spark:
Level 1:
Naively: 2 executors are working, all others idle
time =
Smarter: and can be found by the prefix-sum algorithm
Can use all executors, time =
Level 2:
We have 4 subarrays, and solve two prefix-sums for each subarray
Each subarray has size , and we make sure that each has the same number of partitions
Time =
Level 3: Time =
Stop recursion when each subarray is one partition.
Total time:

A linear-time algorithm?
Define:

14
Year 1 2 3 4 5 6 7 8 9
Profit (M$) -3 2 1 -4 5 2 -1 3 -1
-3 -1 0 -4 1 3 2 5

Observations:

For fixed j, finding largest is same as finding the index , for which is smallest
Idea: doing this for each then find overall largest

A linear-time (Θ(𝑛)) algorithm?
Define:
Goal: Find
15

Algorithm:
For each , needs to know i < j that minimizes (i.e., maximizes ) (Then maximize over all j) Algorithm increases by +1 each step Keeps track of smallest so far Could be old smallest one or it could be current Year 1 2 3 4 5 6 7 8 9 Profit (M$) -3 2 1 -4 5 2 -1 3 -1 -3 -1 0 -4 1 3 2 5 The linear-time algorithm 16 for to do if then if then return for to do if then if then return Even “simpler”: Observation: iff Because No need to actually store ! keeps track of smallest so far. contains difference between current and smallest so far A simpler and more efficient algorithm For each partition, solve the problem directly using one executor and the linear-time algorithm Now it remains to solve the “cross the boundary” case Find the and for each partition, as well as its sum For each contiguous subsets of partitions , the optimal solution with left boundary in partition and right boundary in partition is Total time: 17 /docProps/thumbnail.jpeg

Related Posts