CS代写 CRICOS code 00025BCRICOS code 00025B

CRICOS code 00025BCRICOS code 00025B

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• Apache Spark and its Characteristics

• Resilient Distributed Dataset (RDD)

• RDD Operations – Transform & Action

• Lazy evaluation and RDD Lineage graph

• Terms in Spark

• Narrow and Wide Dependencies

• How Spark works

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

What is Apache Spark?

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

Limitations of MapReduce in Hadoop

Cloud Computing

1. Slow Processing Speed

• Reads and writes data to and from the disk

• Batch-oriented

2. Difficult to Use

• Hand code every operation

• No interactive mode, hard to debug

3. No Iterative Processing

• No support for cyclic data flow

CRICOS code 00025BCRICOS code 00025B

Apache Spark achieves high performance in terms of processing speed

• Process data mainly in memory of working nodes

• Prevent unnecessary I/O operations on disks

Performance:

– Sorting Benchmark in 2014

 100TB data

 206 nodes in 23 mins (Spark)

 2,000 nodes in 72 mins (MapReduce)

 3x speed performance and 1/10 resources

– Machine Learning algorithm

 100x faster than Hadoop for logistic regression

Characteristics of Spark – Speed

6https://spark.apache.org

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

Spark has provided many operators making it easy to build parallel apps.

• map / reduce

• filter / groupByKey / join

• and more.

Highly accessible with supported languages & Interactive mode:

• Scala (interactive, fast)

• Python (interactive, slow)

• Java (non-interactive, fast)

• R and SQL shells (interactive, slow)

Characteristics of Spark – Ease of Use

7https://spark.apache.org

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• Suitable for machine learning algorithms

• Directed Acyclic Graph (DAG)

• Allow programs to load and query repeatedly

Characteristics of Spark – Iterative Processing

8https://spark.apache.org

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• Apache Spark and its Characteristics

• Resilient Distributed Dataset (RDD)

• RDD Operations – Transform & Action

• Lazy evaluation and RDD Lineage graph

• Terms in Spark

• Narrow and Wide Dependencies

• How Spark works

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

Resilient Distributed Dataset (RDD)

Input Data

Partition 1

Partition 2

Partition 3

Partition 4

Transformation

Partition 7

Partition 8

Partition 5

Partition 6

Cloud Computing

• Distributed : Each dataset in RDD is divided into logical

partitions, which are computed by many worker nodes

(computers) in the cluster.

• Resilient : RDD can be self recovered in case of failure

(support rebuild if a partition is destroyed).

• Datasets: Json file, CSV file, text file etc.

• RDD is a fundamental data structure of Spark.

• RDD is a read-only (i.e. immutable) distributed

collection of objects/elements.

CRICOS code 00025BCRICOS code 00025B

Resilient Distributed Dataset (RDD)

Input Data

Partition 1

Partition 2

Partition 3

Partition 4

Transformation

Partition 7

Partition 8

Partition 5

Partition 6

Cloud Computing

• In Spark, all work is expressed as

1. creating new RDDs or,

2. transforming existing RDDs or,

3. action on RDDs to compute a result.

• Data manipulation in Spark is heavily based on RDDs.

• RDDs can contain any type of Python, Java, or Scala

objects, including user-defined classes.

• Spark automatically distributes the data contained in

RDDs across your cluster and parallelizes the

operations you perform on them.

CRICOS code 00025BCRICOS code 00025B

Two ways to create RDDs

– load an external dataset;

– parallelize a collection of objects (e.g., a list or set) in their driver program

• External: Spark uses textFile(URI) method load data (in filesystems) into newly created RDD

– URI (Uniform Resource Identifier) can be from any storage source

 your local file system (a local path on the machine),

 HDFS (hdfs://),

 Cassandra, HBase, Amazon S3, etc.

– Once created, distFile can be acted on by dataset operations (transformations and actions).

RDD Creation: textFile() method

Cloud Computing

be var instead of

CRICOS code 00025BCRICOS code 00025B

• External file must also be accessible at the same path on worker

– Either copy the file to all workers or

– use a network-mounted shared file system.

RDD Creation: textFile() method

Cloud Computing

• Support running on directories, compressed files, and

wildcards.

▪ textFile(“/my/directory”)

▪ textFile(“/my/directory/*.txt”)

▪ textFile(“/my/directory/*.gz”).

• Second argument for controlling the number of partitions of

CRICOS code 00025BCRICOS code 00025B

• Another simple way to create RDDs is to take an existing collection in your program and pass it to

SparkContext’s parallelize() method:

• This approach is very useful when you are learning Spark, since you can quickly create your own

RDDs in the shell and perform operations on them.

RDD Creation: parallelize () method

Cloud Computing

Array(1,2,3,4,5)

sc.parallelize(array, 5)

CRICOS code 00025BCRICOS code 00025B

Partition Policy:

• the number of partitions is equal to the CPU cores in the cluster

• For different deploy modes of Spark,

• Spark.default.parallelism

– Default local: the number of CPU cores or local[N]

– Default Mesos: 8 partitions

– Default Standalone or Apache YARN: Max(2, total_cpus_in_cluster)

Create Partition:

• Set number of partitions when creating RDD

– textFile(path, partitionNum) vs parallelize(collection, partitionNum)

RDD Partition

Cloud Computing

Cluster Modes for Spark
•Standalone
•Kubernetes

CRICOS code 00025BCRICOS code 00025B

Set Partition using repartition() method:

Repartition actually create a new RDD with the specified number of partitions

RDD Partition

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• Apache Spark and its Characteristics

• Resilient Distributed Dataset (RDD)

• RDD Operations – Transform & Action

• Lazy evaluation and RDD Lineage graph

• Terms in Spark

• Narrow and Wide Dependencies

• How Spark works

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

There are two types of RDD Operations: Transformation and Action

RDD Operations – Transformation

+10 points

transformation

Cloud Computing

Transformations
• Transformations are operations on RDDs that return a

• Introduce dependencies between RDDs to generate

lineage graph

• Many transformations are element-wise (working on one

element at a time)

• Transformed RDDs are computed lazily (only when you

use them in an action)

CRICOS code 00025BCRICOS code 00025B

Common Transformations (15+) supported by Transformation

Cloud Computing

Transformation Meaning

map(func) Return a new distributed dataset formed by passing each element of the

source through a function func.

filter(func) Return a new dataset formed by selecting those elements of the source

on which funcreturns true.

flatMap(func) Similar to map, but each input item can be mapped to 0 or more output

items (so func should return a Seq rather than a single item).

union(otherDataset) Return a new dataset that contains the union of the elements in the

source dataset and the argument.

sortByKey([ascending],

[numPartitions])

When called on a dataset of (K, V) pairs where K implements Ordered,

returns a dataset of (K, V) pairs sorted by keys in ascending or

descending order, as specified in the boolean ascending argument.

join(otherDataset,

[numPartitions])

When called on datasets of type (K, V) and (K, W), returns a dataset of

(K, (V, W)) pairs with all pairs of elements for each key. Outer joins are

supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

CRICOS code 00025BCRICOS code 00025B

Example: a huge logfile (100TB), log.txt, with lines of messages, including normal, warning, and error

messages. We want to display the warning and error messages.

• use a filter() transformation from inputRDD to errorsRDD (only transform error lines)

• use a filter() transformation from inputRDD to warningsRDD (only transform warning lines)

• use a union() transformation to display the final results.

Direct Acyclic Graph

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

Element-wise transformation:

• map(func): Return a new RDD formed by passing each element of the source through a

function func.

• filter(func): Return a new RDD formed by selecting those elements of the source on

which func returns true.

RDD Operations – Transformation

Cloud Computing

Array(1,2,3,4,5)

array RDD(rdd1)

sc.parallelize(array) 2

rdd1.map(x=>x+10) rdd2.filter(x=>x>12)

(x) => (x+10) is equivalent to

func(x) { return x+10 }

CRICOS code 00025BCRICOS code 00025B

• Sometimes we want to produce multiple output elements for each input element. The

operation to do this is called flatMap().

– flatMap(func): Similar to map, but each input item can be mapped to 0 or more output

items (so func should return a Seq rather than a single item).

– A simple usage of flatMap() is splitting up an input string into words,

RDD Operations – Transformation

Cloud Computing

“MapReduce is good”

”Spark is fast”

“Spark is better than MapReduce”

MapReduce is good

Spark is fast

Spark is better than MapReduce

data.txt RDD (lines)

sc.textFile()

“MapReduce”

RDD (words)

lines.flatMap(line=>line.split(“ “))

“is” “good”

“is” “fast”

”Spark” “is” “better”

”MapReduce” “than”

CRICOS code 00025BCRICOS code 00025B

Pseudo set transformation:

• distinct(): produces a new RDD with only distinct items, but is expensive due to shuffle

• union(): gives back an RDD consisting of the data from both sources

• Different from the mathematical union, if there are duplicates in the input RDDs, the result of Spark’s union()

will contain duplicates (which we can fix if desired with distinct()).

• intersection(): returns only elements in both RDDs. intersection() also removes all duplicates (including

duplicates from a single RDD)

• intersection() and union() are two similar concepts,

• the performance of intersection() is much worse

– since it requires a shuffle over the network to identify

– common elements.

• subtract(other) function takes in another RDD and

returns an RDD that has only values present

in the first RDD and not the second RDD (shuffle).

RDD Operations – Transformation

Cloud Computing

{coffee, coffee,

panda, monkey,

{coffee, monkey,

RDD1.distinc

monkey, tea}

RDD1.union(RD

D2) {coffee,

coffee, coffee,

panda, monkey,

monkey, tea,

RDD1.inters

ection(RDD2)

RDD1.subtra

{panda, tea}

CRICOS code 00025BCRICOS code 00025B

• cartesian(other) transformation:

– returns all possible pairs of (a, b)

– a is in the source RDD and b is in the other RDD.

– similar to cross join in SQL

– The Cartesian product can be useful but is very expensive for large RDDs.

RDD Operations – Transformation

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• must return a final value

• the values of action are stored to drivers or to the external storage system.

• trigger job execution that forces the evaluation of all the transformations

• It brings laziness of RDD into motion.

• count() action – returns the count as a number

• take() action – collects a number of elements from the RDD

RDD Operations – Action

+10 points

transformation

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

To print out some information about the badLinesRDD:

• count() action – returns the count as a number

• take() action – collects a number of elements from the RDD

RDD Operations – Action

+10 points

transformation

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

Common Actions (10+) supported by – Action

Action Meaning

collect() Return all the elements of the dataset as an array at the driver program. This is

usually useful after a filter or other operation that returns a sufficiently small

subset of the data.

count() Return the number of elements in the dataset.

reduce(func) Aggregate the elements of the dataset using a function func (which takes two

arguments and returns one). The function should be commutative and

associative so that it can be computed correctly in parallel.

first() Return the first element of the dataset (similar to take(1)).

take(n) Return an array with the first n elements of the dataset.

foreach(func) Run a function func on each element of the dataset. This is usually done for

side effects such as updating an Accumulator or interacting with external

storage systems.

Cloud Computing

https://spark.apache.org/docs/latest/rdd-programming-guide.html#accumulators

CRICOS code 00025BCRICOS code 00025B

• collect() action:

– returns the entire RDD’s contents to our driver program

– When RDDs are distributed over worker nodes, to display the correct

result, collect() is used to gather all data into one node

RDD Operations – Action

Cloud Computing

• reduce(func) action:

– takes a function (func) that operates on two elements of the type in your RDD

– returns a new element of the same type.

CRICOS code 00025BCRICOS code 00025B

• Given an RDD rdd = {1, 2, 3, 4, 5}

• reduce(func):

RDD Operations – Action

Cloud Computing

f( , ) f( , ) f( , ) f( , ) Result

{1, 2, 3, 4, 5}

CRICOS code 00025BCRICOS code 00025B

• take(n):

– returns n elements from the RDD and attempts to minimize the number of accessed

partitions

– it may represent a biased collection (may not return the elements in the expected order).

– Useful for unit tests and quick debugging.

• top(): extracts the top elements from an RDD (an ordering defined).

• takeSample(withReplacement, num, seed): allows us to take a sample of our data.

• foreach(func): performs computations on each element in the RDD.

• count(): returns a count of the elements.

• countByValue() returns a map of each unique value to its count.

RDD Operations – Action

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

Examples: given a rdd containing {1,2,3,3}

rdd.reduce( (x,y) => x+y) -> 9

rdd.collect() -> {1,2,3,3}

rdd.count() -> 4

rdd.take(2) -> {1,2}

rdd.top(2) -> {3,3}

rdd.countByValue() -> {(1,1),(2,1),(3,2)}

rdd.foreach(println) -> 1, 2, 3, 3

RDD Operations – Action

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• RDDs of key/value pairs are a common data type required for many operations in Spark.

– (“Spark”, 2), (“is”, 3), (“is”, (1,1,1)), etc.

• Key/value RDDs are commonly used to perform aggregations.

– counting up reviews for each product,

– grouping together data with the same key,

– grouping together two different RDDs.

• Spark provides special operations on RDDs containing key/value pairs.

– reduceByKey() method can aggregate data separately for each key

– join() method can merge two RDDs together by grouping elements with the same key

• ._1 & ._2 stand for key and value, respectively.

– E.g. ._1 stands for “Spark” or “is”, while ._2 stands for 2 or 3.

Working with Key/Value Pairs

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

Use map() method:

• map(x => (x, 1))

– x is each element in RDD words

– => (right arrow or fat arrow): separate parameters of a

function and function body

 E.g. (x,y) => (x+y) is equivalent to

func(x, y) { return x+y }

 x => (x, 1) is thus equivalent to

func(x) {return (x, 1) } a pair of key/value

 In this way, you can conveniently write anonymous

Creating Pair RDDs

Cloud Computing

“MapReduce”

RDD (words)

“is” “good”

“is” “fast”

”Spark” “is” “better”

”MapReduce” “than”

(“MapReduce”,1)

(”Spark”,1)

RDD (wordspair)

(“is”,1) (“good”,1)

(“is”,1) (“fast”,1)

(”Spark”,1) (“is”,1) (“better”,1)

(”MapReduce”,1) (“than”,1)

Generate key/value pairs:

words.map(x=>(x,1))

CRICOS code 00025BCRICOS code 00025B

• reduceByKey(func, [numTasks]) transformation:

– is quite similar to reduce(): both take a function and use it to combine values.

– runs several parallel reduce operations, one for each key in the dataset, where

each operation combines values that have the same key.

– different from reduce(), reduceByKey() is not implemented as an action that

returns a value to the user program.

– returns a new RDD consisting of each key and the reduced value for that key.

• Example:

– Given grades of all the courses in ITEE, calculate the averaged GPA for each

– Given a text, count each word’s occurrence.

Transformations on Pair RDDs

Cloud Computing

CRICOS code 00025BCRICOS code 00025B

• reduceByKey(func): called on a dataset of (K, V) pairs,

• returns a dataset of (K, V) pairs where the values for each key are aggregated using the

given reduce function func,

• func must be of type (V,V) => V, e.g. (a,b) => a+b.

RDD Operations – Transformation

Cloud Computing

“MapReduce”

RDD (words)

wordspair.reduceByKey((a,b)=>a+b)

“is” “good”

“is” “fast”

“Spark” “is” “better”

“MapReduce” “than”

(“MapReduce”,1)

(”Spark”,1)

RDD (wordspair)

(“is”,1) (“good”,1)

(“is”,1) (“fast”,1)

(”Spark”,1) (“is”,1) (“better”,1)

(”MapReduce”,1) (“than”,1)

(“MapReduce”,2)

(”Spark”,2)

RDD (reducewords)

(“good”,1)

(“fast”,1)

(“better”,1)

(“than”,1)

Generate key/value pairs:

words.map(x=>(x,1))

CRICOS code 00025BCRICOS code 00025B

• reduceByKey(func):

– wordspair.reduceByKey((a,b)=>a+b) on {(”is”, 1), (”is”, 1), (”is”, 1), (”is”, 1), (”is”, 1)}

– (1,1,1,1,1): 1+1 -> 2 -> 2+1 -> 3 -> 3+1 -> 4 -> 4+1 -> 5

RDD Operations – Transformation

Cloud Computing

f( , ) f( , ) f( , ) f( , ) Result

f(a, b) { return a + b }

CRICOS code 00025BCRICOS code 00025B

• reduceByKey(_+_): Elements 1, 2, 3, 4, and 5 should have the same key so that the value

can be aggregated with the func:

– {(“s123456”, 78), (“s123456”, 80), (“s123456”, 65), (“s123456”, 90),

– (“s654321”, 80), (“s654321”, 40) , (“s654321”, 50) , (“s654321”, 90) , (“s654321”, 80) }

– For the key “s123456”, 78, 80, 65, 90 will be aggregated with func

– For the key “s654321”, 80, 40, 50, 90, 80 will be aggregated with func

Transformations on Pair RDDs

Cloud Computing

f( , ) f( , ) f( , ) f( , ) Result

f(a, b) { return a + b }

CRICOS code 00025BCRICOS code 00025B

reduceByKey(func) vs reduce(func):

• reduce(func) and reduceByKey(func) are aggregating operations

• reduceByKey(func) is a transformation on paired RDD while reduce(func) is an action on

regular RDD.

• reduce(func) can perform on non-pair RDDs

• reduceByKey(func) needs to match key first

• The aggregation function (func) works very similar:

Transformations on Pair RDDs

Cloud Computing

f( , ) f( , ) f(

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts