CRICOS code 00025BCRICOS code 00025B
Copyright By PowCoder代写 加微信 powcoder
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• Apache Spark and its Characteristics
• Resilient Distributed Dataset (RDD)
• RDD Operations – Transform & Action
• Lazy evaluation and RDD Lineage graph
• Terms in Spark
• Narrow and Wide Dependencies
• How Spark works
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
What is Apache Spark?
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
Limitations of MapReduce in Hadoop
Cloud Computing
1. Slow Processing Speed
• Reads and writes data to and from the disk
• Batch-oriented
2. Difficult to Use
• Hand code every operation
• No interactive mode, hard to debug
3. No Iterative Processing
• No support for cyclic data flow
CRICOS code 00025BCRICOS code 00025B
Apache Spark achieves high performance in terms of processing speed
• Process data mainly in memory of working nodes
• Prevent unnecessary I/O operations on disks
Performance:
– Sorting Benchmark in 2014
100TB data
206 nodes in 23 mins (Spark)
2,000 nodes in 72 mins (MapReduce)
3x speed performance and 1/10 resources
– Machine Learning algorithm
100x faster than Hadoop for logistic regression
Characteristics of Spark – Speed
6https://spark.apache.org
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
Spark has provided many operators making it easy to build parallel apps.
• map / reduce
• filter / groupByKey / join
• and more.
Highly accessible with supported languages & Interactive mode:
• Scala (interactive, fast)
• Python (interactive, slow)
• Java (non-interactive, fast)
• R and SQL shells (interactive, slow)
Characteristics of Spark – Ease of Use
7https://spark.apache.org
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• Suitable for machine learning algorithms
• Directed Acyclic Graph (DAG)
• Allow programs to load and query repeatedly
Characteristics of Spark – Iterative Processing
8https://spark.apache.org
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• Apache Spark and its Characteristics
• Resilient Distributed Dataset (RDD)
• RDD Operations – Transform & Action
• Lazy evaluation and RDD Lineage graph
• Terms in Spark
• Narrow and Wide Dependencies
• How Spark works
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
Resilient Distributed Dataset (RDD)
Resilient Distributed Dataset (RDD)
Input Data
Partition 1
Partition 2
Partition 3
Partition 4
Transformation
Partition 7
Partition 8
Partition 5
Partition 6
Cloud Computing
• Distributed : Each dataset in RDD is divided into logical
partitions, which are computed by many worker nodes
(computers) in the cluster.
• Resilient : RDD can be self recovered in case of failure
(support rebuild if a partition is destroyed).
• Datasets: Json file, CSV file, text file etc.
• RDD is a fundamental data structure of Spark.
• RDD is a read-only (i.e. immutable) distributed
collection of objects/elements.
CRICOS code 00025BCRICOS code 00025B
Resilient Distributed Dataset (RDD)
Resilient Distributed Dataset (RDD)
Input Data
Partition 1
Partition 2
Partition 3
Partition 4
Transformation
Partition 7
Partition 8
Partition 5
Partition 6
Cloud Computing
• In Spark, all work is expressed as
1. creating new RDDs or,
2. transforming existing RDDs or,
3. action on RDDs to compute a result.
• Data manipulation in Spark is heavily based on RDDs.
• RDDs can contain any type of Python, Java, or Scala
objects, including user-defined classes.
• Spark automatically distributes the data contained in
RDDs across your cluster and parallelizes the
operations you perform on them.
CRICOS code 00025BCRICOS code 00025B
Two ways to create RDDs
– load an external dataset;
– parallelize a collection of objects (e.g., a list or set) in their driver program
• External: Spark uses textFile(URI) method load data (in filesystems) into newly created RDD
– URI (Uniform Resource Identifier) can be from any storage source
your local file system (a local path on the machine),
HDFS (hdfs://),
Cassandra, HBase, Amazon S3, etc.
– Once created, distFile can be acted on by dataset operations (transformations and actions).
RDD Creation: textFile() method
Cloud Computing
be var instead of
CRICOS code 00025BCRICOS code 00025B
• External file must also be accessible at the same path on worker
– Either copy the file to all workers or
– use a network-mounted shared file system.
RDD Creation: textFile() method
Cloud Computing
• Support running on directories, compressed files, and
wildcards.
▪ textFile(“/my/directory”)
▪ textFile(“/my/directory/*.txt”)
▪ textFile(“/my/directory/*.gz”).
• Second argument for controlling the number of partitions of
CRICOS code 00025BCRICOS code 00025B
• Another simple way to create RDDs is to take an existing collection in your program and pass it to
SparkContext’s parallelize() method:
• This approach is very useful when you are learning Spark, since you can quickly create your own
RDDs in the shell and perform operations on them.
RDD Creation: parallelize () method
Cloud Computing
Array(1,2,3,4,5)
sc.parallelize(array, 5)
CRICOS code 00025BCRICOS code 00025B
Partition Policy:
• the number of partitions is equal to the CPU cores in the cluster
• For different deploy modes of Spark,
• Spark.default.parallelism
– Default local: the number of CPU cores or local[N]
– Default Mesos: 8 partitions
– Default Standalone or Apache YARN: Max(2, total_cpus_in_cluster)
Create Partition:
• Set number of partitions when creating RDD
– textFile(path, partitionNum) vs parallelize(collection, partitionNum)
RDD Partition
Cloud Computing
Cluster Modes for Spark
•Standalone
•Kubernetes
CRICOS code 00025BCRICOS code 00025B
Set Partition using repartition() method:
Repartition actually create a new RDD with the specified number of partitions
RDD Partition
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• Apache Spark and its Characteristics
• Resilient Distributed Dataset (RDD)
• RDD Operations – Transform & Action
• Lazy evaluation and RDD Lineage graph
• Terms in Spark
• Narrow and Wide Dependencies
• How Spark works
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
There are two types of RDD Operations: Transformation and Action
RDD Operations – Transformation
+10 points
transformation
Cloud Computing
Transformations
• Transformations are operations on RDDs that return a
• Introduce dependencies between RDDs to generate
lineage graph
• Many transformations are element-wise (working on one
element at a time)
• Transformed RDDs are computed lazily (only when you
use them in an action)
CRICOS code 00025BCRICOS code 00025B
Common Transformations (15+) supported by Transformation
Cloud Computing
Transformation Meaning
map(func) Return a new distributed dataset formed by passing each element of the
source through a function func.
filter(func) Return a new dataset formed by selecting those elements of the source
on which funcreturns true.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output
items (so func should return a Seq rather than a single item).
union(otherDataset) Return a new dataset that contains the union of the elements in the
source dataset and the argument.
sortByKey([ascending],
[numPartitions])
When called on a dataset of (K, V) pairs where K implements Ordered,
returns a dataset of (K, V) pairs sorted by keys in ascending or
descending order, as specified in the boolean ascending argument.
join(otherDataset,
[numPartitions])
When called on datasets of type (K, V) and (K, W), returns a dataset of
(K, (V, W)) pairs with all pairs of elements for each key. Outer joins are
supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
CRICOS code 00025BCRICOS code 00025B
Example: a huge logfile (100TB), log.txt, with lines of messages, including normal, warning, and error
messages. We want to display the warning and error messages.
• use a filter() transformation from inputRDD to errorsRDD (only transform error lines)
• use a filter() transformation from inputRDD to warningsRDD (only transform warning lines)
• use a union() transformation to display the final results.
Direct Acyclic Graph
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
Element-wise transformation:
• map(func): Return a new RDD formed by passing each element of the source through a
function func.
• filter(func): Return a new RDD formed by selecting those elements of the source on
which func returns true.
RDD Operations – Transformation
Cloud Computing
Array(1,2,3,4,5)
array RDD(rdd1)
sc.parallelize(array) 2
rdd1.map(x=>x+10) rdd2.filter(x=>x>12)
(x) => (x+10) is equivalent to
func(x) { return x+10 }
CRICOS code 00025BCRICOS code 00025B
• Sometimes we want to produce multiple output elements for each input element. The
operation to do this is called flatMap().
– flatMap(func): Similar to map, but each input item can be mapped to 0 or more output
items (so func should return a Seq rather than a single item).
– A simple usage of flatMap() is splitting up an input string into words,
RDD Operations – Transformation
Cloud Computing
“MapReduce is good”
”Spark is fast”
“Spark is better than MapReduce”
MapReduce is good
Spark is fast
Spark is better than MapReduce
data.txt RDD (lines)
sc.textFile()
“MapReduce”
RDD (words)
lines.flatMap(line=>line.split(“ “))
“is” “good”
“is” “fast”
”Spark” “is” “better”
”MapReduce” “than”
CRICOS code 00025BCRICOS code 00025B
Pseudo set transformation:
• distinct(): produces a new RDD with only distinct items, but is expensive due to shuffle
• union(): gives back an RDD consisting of the data from both sources
• Different from the mathematical union, if there are duplicates in the input RDDs, the result of Spark’s union()
will contain duplicates (which we can fix if desired with distinct()).
• intersection(): returns only elements in both RDDs. intersection() also removes all duplicates (including
duplicates from a single RDD)
• intersection() and union() are two similar concepts,
• the performance of intersection() is much worse
– since it requires a shuffle over the network to identify
– common elements.
• subtract(other) function takes in another RDD and
returns an RDD that has only values present
in the first RDD and not the second RDD (shuffle).
RDD Operations – Transformation
Cloud Computing
{coffee, coffee,
panda, monkey,
{coffee, monkey,
RDD1.distinc
monkey, tea}
RDD1.union(RD
D2) {coffee,
coffee, coffee,
panda, monkey,
monkey, tea,
RDD1.inters
ection(RDD2)
RDD1.subtra
{panda, tea}
CRICOS code 00025BCRICOS code 00025B
• cartesian(other) transformation:
– returns all possible pairs of (a, b)
– a is in the source RDD and b is in the other RDD.
– similar to cross join in SQL
– The Cartesian product can be useful but is very expensive for large RDDs.
RDD Operations – Transformation
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• must return a final value
• the values of action are stored to drivers or to the external storage system.
• trigger job execution that forces the evaluation of all the transformations
• It brings laziness of RDD into motion.
• count() action – returns the count as a number
• take() action – collects a number of elements from the RDD
RDD Operations – Action
+10 points
transformation
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
To print out some information about the badLinesRDD:
• count() action – returns the count as a number
• take() action – collects a number of elements from the RDD
RDD Operations – Action
+10 points
transformation
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
Common Actions (10+) supported by – Action
Action Meaning
collect() Return all the elements of the dataset as an array at the driver program. This is
usually useful after a filter or other operation that returns a sufficiently small
subset of the data.
count() Return the number of elements in the dataset.
reduce(func) Aggregate the elements of the dataset using a function func (which takes two
arguments and returns one). The function should be commutative and
associative so that it can be computed correctly in parallel.
first() Return the first element of the dataset (similar to take(1)).
take(n) Return an array with the first n elements of the dataset.
foreach(func) Run a function func on each element of the dataset. This is usually done for
side effects such as updating an Accumulator or interacting with external
storage systems.
Cloud Computing
https://spark.apache.org/docs/latest/rdd-programming-guide.html#accumulators
CRICOS code 00025BCRICOS code 00025B
• collect() action:
– returns the entire RDD’s contents to our driver program
– When RDDs are distributed over worker nodes, to display the correct
result, collect() is used to gather all data into one node
RDD Operations – Action
Cloud Computing
• reduce(func) action:
– takes a function (func) that operates on two elements of the type in your RDD
– returns a new element of the same type.
CRICOS code 00025BCRICOS code 00025B
• Given an RDD rdd = {1, 2, 3, 4, 5}
• reduce(func):
RDD Operations – Action
Cloud Computing
f( , ) f( , ) f( , ) f( , ) Result
{1, 2, 3, 4, 5}
CRICOS code 00025BCRICOS code 00025B
• take(n):
– returns n elements from the RDD and attempts to minimize the number of accessed
partitions
– it may represent a biased collection (may not return the elements in the expected order).
– Useful for unit tests and quick debugging.
• top(): extracts the top elements from an RDD (an ordering defined).
• takeSample(withReplacement, num, seed): allows us to take a sample of our data.
• foreach(func): performs computations on each element in the RDD.
• count(): returns a count of the elements.
• countByValue() returns a map of each unique value to its count.
RDD Operations – Action
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
Examples: given a rdd containing {1,2,3,3}
rdd.reduce( (x,y) => x+y) -> 9
rdd.collect() -> {1,2,3,3}
rdd.count() -> 4
rdd.take(2) -> {1,2}
rdd.top(2) -> {3,3}
rdd.countByValue() -> {(1,1),(2,1),(3,2)}
rdd.foreach(println) -> 1, 2, 3, 3
RDD Operations – Action
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• RDDs of key/value pairs are a common data type required for many operations in Spark.
– (“Spark”, 2), (“is”, 3), (“is”, (1,1,1)), etc.
• Key/value RDDs are commonly used to perform aggregations.
– counting up reviews for each product,
– grouping together data with the same key,
– grouping together two different RDDs.
• Spark provides special operations on RDDs containing key/value pairs.
– reduceByKey() method can aggregate data separately for each key
– join() method can merge two RDDs together by grouping elements with the same key
• ._1 & ._2 stand for key and value, respectively.
– E.g. ._1 stands for “Spark” or “is”, while ._2 stands for 2 or 3.
Working with Key/Value Pairs
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
Use map() method:
• map(x => (x, 1))
– x is each element in RDD words
– => (right arrow or fat arrow): separate parameters of a
function and function body
E.g. (x,y) => (x+y) is equivalent to
func(x, y) { return x+y }
x => (x, 1) is thus equivalent to
func(x) {return (x, 1) } a pair of key/value
In this way, you can conveniently write anonymous
Creating Pair RDDs
Cloud Computing
“MapReduce”
RDD (words)
“is” “good”
“is” “fast”
”Spark” “is” “better”
”MapReduce” “than”
(“MapReduce”,1)
(”Spark”,1)
RDD (wordspair)
(“is”,1) (“good”,1)
(“is”,1) (“fast”,1)
(”Spark”,1) (“is”,1) (“better”,1)
(”MapReduce”,1) (“than”,1)
Generate key/value pairs:
words.map(x=>(x,1))
CRICOS code 00025BCRICOS code 00025B
• reduceByKey(func, [numTasks]) transformation:
– is quite similar to reduce(): both take a function and use it to combine values.
– runs several parallel reduce operations, one for each key in the dataset, where
each operation combines values that have the same key.
– different from reduce(), reduceByKey() is not implemented as an action that
returns a value to the user program.
– returns a new RDD consisting of each key and the reduced value for that key.
• Example:
– Given grades of all the courses in ITEE, calculate the averaged GPA for each
– Given a text, count each word’s occurrence.
Transformations on Pair RDDs
Cloud Computing
CRICOS code 00025BCRICOS code 00025B
• reduceByKey(func): called on a dataset of (K, V) pairs,
• returns a dataset of (K, V) pairs where the values for each key are aggregated using the
given reduce function func,
• func must be of type (V,V) => V, e.g. (a,b) => a+b.
RDD Operations – Transformation
Cloud Computing
“MapReduce”
RDD (words)
wordspair.reduceByKey((a,b)=>a+b)
“is” “good”
“is” “fast”
“Spark” “is” “better”
“MapReduce” “than”
(“MapReduce”,1)
(”Spark”,1)
RDD (wordspair)
(“is”,1) (“good”,1)
(“is”,1) (“fast”,1)
(”Spark”,1) (“is”,1) (“better”,1)
(”MapReduce”,1) (“than”,1)
(“MapReduce”,2)
(”Spark”,2)
RDD (reducewords)
(“good”,1)
(“fast”,1)
(“better”,1)
(“than”,1)
Generate key/value pairs:
words.map(x=>(x,1))
CRICOS code 00025BCRICOS code 00025B
• reduceByKey(func):
– wordspair.reduceByKey((a,b)=>a+b) on {(”is”, 1), (”is”, 1), (”is”, 1), (”is”, 1), (”is”, 1)}
– (1,1,1,1,1): 1+1 -> 2 -> 2+1 -> 3 -> 3+1 -> 4 -> 4+1 -> 5
RDD Operations – Transformation
Cloud Computing
f( , ) f( , ) f( , ) f( , ) Result
f(a, b) { return a + b }
CRICOS code 00025BCRICOS code 00025B
• reduceByKey(_+_): Elements 1, 2, 3, 4, and 5 should have the same key so that the value
can be aggregated with the func:
– {(“s123456”, 78), (“s123456”, 80), (“s123456”, 65), (“s123456”, 90),
– (“s654321”, 80), (“s654321”, 40) , (“s654321”, 50) , (“s654321”, 90) , (“s654321”, 80) }
– For the key “s123456”, 78, 80, 65, 90 will be aggregated with func
– For the key “s654321”, 80, 40, 50, 90, 80 will be aggregated with func
Transformations on Pair RDDs
Cloud Computing
f( , ) f( , ) f( , ) f( , ) Result
f(a, b) { return a + b }
CRICOS code 00025BCRICOS code 00025B
reduceByKey(func) vs reduce(func):
• reduce(func) and reduceByKey(func) are aggregating operations
• reduceByKey(func) is a transformation on paired RDD while reduce(func) is an action on
regular RDD.
• reduce(func) can perform on non-pair RDDs
• reduceByKey(func) needs to match key first
• The aggregation function (func) works very similar:
Transformations on Pair RDDs
Cloud Computing
f( , ) f( , ) f(
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com