Introduction to Apache Spark ECA5372 Big Data and Technologies 1
Apache Spark
Background information
• Apache Spark is an open-source powerful distributed querying
and processing engine.
• First version of Spark was released in 2012.
• Spark codebase was donated to the Apache Software Foundation
• Originally developed by Matei Zaharia as a part of his PhD thesis while at UC Berkeley.
• In 2013, Zaharia co-founded Databricks
• Currently, he is an Assistant Professor, Stanford University
ECA5372 Big Data and Technologies 2
Apache Spark
What is it?
• Spark lets you spread data and computations over clusters with multiple nodes (think of each node as a separate computer).
• Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data.
• As each node works on its own subset of the total data, it also carries out a part of the total calculations required, so that both data processing and computation are performed in parallel over the nodes in the cluster.
• It is a fact that parallel computation can make certain types of programming tasks much faster.
ECA5372 Big Data and Technologies 3
Apache Spark
What is it?
• Apache Spark provides flexibility and extensibility of MapReduce but at significantly higher speeds: Up to 100 times faster than Apache Hadoop when data is stored in memory and up to 10 times when accessing disk.
• Apache Spark allows the user to read, transform, and aggregate data, as well as train and deploy sophisticated statistical models with ease.
• The Spark APIs are accessible in Java, Scala, Python, R and SQL.
• Apache Spark can be used to build applications or package them up as libraries to be deployed on a cluster or perform quick analytics interactively through notebooks (e.g. Jupyter notebooks).
ECA5372 Big Data and Technologies 4
Apache Spark
Components
• Spark SQL: This lets you actually treat data as a SQL database, and actually issue SQL queries on it.
• Useful if you’re already familiar with SQL.
• Spark Streaming: This is a library that lets you process data in real time. Data can be flowing into a server continuously and Spark Streaming can help you process that data in real time.
SQL: Structured Query Language
ECA5372 Big Data and Technologies 5
Apache Spark
Components
• MLlib: It is actually a machine learning library that lets you perform common
machine learning algorithms, with Spark underneath the hood to actually distribute that processing across a cluster.
• You can perform machine learning on much larger Datasets than you could have otherwise.
• GraphX: This is not for making pretty charts and pictorial graphs. It refers to graph in the network theory sense. Think about a social network; that’s an example of a graph.
• GraphX just has a few functions that let you analyse the properties of a graph of information.
ECA5372 Big Data and Technologies 6
Apache Spark
Different modes
• Apache Spark can run locally on a laptop, yet can also be deployed in standalone mode, over YARN – either on your local cluster or in the cloud.
• It can read and write from a diverse data sources including (but not limited to)
• HDFS,
• Apache Cassandra,
• Apache HBase,
• Amazon Simple Storage Service (Amazon S3)
ECA5372 Big Data and Technologies 7
Cluster Mode Overview
http://spark.apache.org/docs/latest/cluster-overview.html
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program). Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by
JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the
executors to run.
ECA5372 Big Data and Technologies 8
SparkContext
Entry point for Spark
• MainentrypointforSparkfunctionality.
• The most important step of any Spark driver application is to generate
SparkContext. It allows your Spark Application to access Spark Cluster with the help of Cluster Manager.
• ASparkContextrepresentstheconnectiontoaSparkcluster,andcan be used to create RDDs, accumulators and broadcast variables on that cluster.
• OnlyoneSparkContextmaybeactiveperJVM.Youmuststop() the active SparkContext before creating a new one.
ECA5372 Big Data and Technologies 9
Apache Spark versus Hadoop MapReduce
Components
Apache Spark Hadoop MapReduce
Parallel processing framework Parallel processing framework
Real-time or batch processing for big Batch processing for big data analysis data analysis
Written in Scala language Written in Java
Stores data in-memory (RAM) Reads and writes data from disk
Up to 100x faster in-memory or 10x Fast but slower than Spark faster on disk compared to MapReduce.
Cost Is higher due to RAM (memory) Cost is lower due to disk storage. storage requirements.
ECA5372 Big Data and Technologies 10
Spark APIs Application Programming Interfaces History
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
ECA5372 Big Data and Technologies 11
Spark APIs Application Programming Interfaces
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
ECA5372 Big Data and Technologies 12
Resilient Distributed Datasets
Spark RDD
• Resilient because RDDs are immutable (can’t be modified once created) and fault tolerant. Spark makes sure that if you’re running a job
on a cluster and one of those nodes in the cluster goes down, it can automatically recover from that and retry.
• Distributed because it is distributed across cluster. The whole point of
using Spark is that you can use it for big data problems where you can distribute the processing across the entire CPU and memory power of a cluster of computers. That can be distributed horizontally, so you can throw as many computers as you want to a given problem.
• Datasets because it holds data.
RDDs are automatically distributed across the network by means of Partitions.
ECA5372 Big Data and Technologies 13
Partitions
Spark RDD
• RDDs are divided into smaller chunks called Partitions, and when you execute some action, a task is launched per partition.
• the more the number of partitions, the more the parallelism.
• Spark automatically decides the number of partitions that an RDD has to be divided into but you can also specify the number of partitions when creating an RDD.
• These partitions of an RDD are distributed across all the nodes in the network.
ECA5372 Big Data and Technologies 14
Transformations and Actions
Spark RDD
• There are 2 types of operations that you can perform on an RDD:
• Transformations • Actions
• Transformation is a function that produces new RDD from the existing RDDs. It takes RDD as input and produces one or more RDD as output. Each time it creates new RDD when we apply any transformation.
• the input RDDs cannot be changed since RDD are immutable by nature.
• the new RDD keeps a pointer to its parent RDD.
ECA5372 Big Data and Technologies 15
Transformations and Actions
Spark RDD
• When you call a transformation, Spark does not execute it immediately, instead it creates a lineage.
• A lineage keeps track of what all transformations has to be applied on that RDD, including from where it has to read the data.
• sc.textFile() and rdd.filter() do not get executed immediately, it will only get executed once you call an Action on the RDD – in this case, here it is filtered.count().
• An Action is used to either save result to some location or to display it.
ECA5372 Big Data and Technologies 16
Lazy Evaluation
Directed Acyclic Graph (DAG) and Lineage
• RDD lineage, also known as RDD operator graph or RDD dependency graph. It is a logical execution plan i.e., it is Directed Acyclic Graph (DAG) of the entire parent RDDs of RDD.
• Transformations are lazy in nature i.e., they get execute when we call an action. They are not executed immediately.
• After the transformation, the resultant RDD is always different from its parent RDD. It can be smaller (e.g. filter(), count(), distinct(), sample()), bigger (e.g. union()) or the same size (e.g. map()).
• it will only get executed once you call an Action on the RDD
ECA5372 Big Data and Technologies 17
Lazy Evaluation
Directed Acyclic Graph (DAG) and Lineage
ECA5372 Big Data and Technologies 18
Transformations
Transformations and Actions
Transformations are ways of taking an RDD and transforming every row in that RDD to a new value, based on a function that you provide. Some of those functions include:
• map()
• It applies to each element of RDD and it returns the result as new RDD. In the Map, operation developer can define his own custom business logic. The same logic will be applied to all the elements of RDD.
• filter()
• can be used if what you want to do is just create a Boolean function that says “should this row
• distinct()
• a transformation that will return back only distinct values within your RDD
• sample()
• This function lets you take a random sample from your RDD
be preserved or not? Yes or no.”
ECA5372 Big Data and Technologies 19
Actions
Transformations and Actions
You can also perform actions on an RDD, when you want to actually get a result. Some of those functions include:
• reduce()
• lets you define a way of combining together all the values for a given key. It is very much similar
• collect()
• You can call collect() on an RDD, which will give you back a plain old Python object that you can then iterate through and print out the results, or save them to a file, or whatever you want to do.
• count()
• actually go count how many entries are in the RDD at this point
• countByValue()
• This function will give you a breakdown of how many times each unique value within that RDD occurs
in spirit to MapReduce.
ECA5372 Big Data and Technologies 20
Spark APIs Application Programming Interfaces
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
ECA5372 Big Data and Technologies 21
DataFrames
Understanding Spark
• DataFrames is an immutable, distributed collection of data that is organized into rows, where each one consists a set of columns and each column has a name and an associated type.
• In other words, this distributed collection of data has a structure defined by a schema.
• In simple terms, it is same as a table in relational database or an Excel worksheet with Column headers.
ECA5372 Big Data and Technologies 22
DataFrames
Understanding Spark
A DataFrame in Apache Spark can be created in multiple ways: • It can be created using different data formats.
• For example, loading the data from JSON, CSV. • Loading data from Existing RDD.
• Programmatically specifying schema.
ECA5372 Big Data and Technologies 23
Spark APIs Application Programming Interfaces
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
ECA5372 Big Data and Technologies 24
Datasets
Understanding Spark
• A Dataset is a distributed collection of data.
• Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
• A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.).
• The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API.
https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes
ECA5372 Big Data and Technologies 25