COMP5349 –Cloud Computing
Week 9: Spark Machine Learning
Dr. Ying Zhou School of Computer Science
n Data sharing in Spark Closure
Broadcast Variables Accumulators
RDD persistence
n Spark SQL and the DataFrame API
DataFrame is easier to manipulate ¡ Access to individual columns
More efficient with building optimization mechanism DataFrame is also immutable
DataFrame and RDD can be converted to each other
Last Week
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 09-2
Outline
n Spark Machine Learning Machine Learning Libraries Basic Data Types Standardized API
n Assignment Brief Intro
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 09-3
Spark Machine Learning Libraries
n Spark has two versions of Machine Learning library RDD based
¡ Package/module name mllib
DataFrame based
¡ Package/name name ml
¡ The primary API with potential new features introduced in new release
n Most machine learning algorithms expect data in vector or matrix form
Both libraries provide data types representing those, make sure you always use the one in ml package
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 09-4
Basic Data Types
n The linear algebra sub package/module linalg defines basic data structure used in most machine learning algorithms
n Vector is the general class and it is a local data type
We can have RDDs with Vector as part of its element ¡ RDD: (Integer, (String, Vector))
We can have DataFrame with Vector as its column ¡ DataFrame: (Integer, String, Vector, Vector)
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 09-5
Vectors
n Concrete vectors can be expressed in two forms
Dense vector is like an array, it may be implemented using different
structures in different languages
Sparse vector is an efficient way to save vectors with many zeros
¡ It remembers the size of the vector and all non-zero (index, value) pair.
e.g. the vector (0.5, 0.0, 0.3)
¡ Dense format: [0.5, 0.0, 0.3]
¡ Sparse format: (3,{0:0.5},{2:0.3}) or (3,[0,2],[0.5,0.3])
n Spark provides its own type hierarchy for Vector to have a consistent API across and also to provide some useful methods on it.
dot product of two vectors squared distance of two vectors
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 09-6
DataFrame of Vectors
n Many algorithms expect input data to contain a particular column of type Vector, this column is usually called ‘features’
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 09-7
Creating Vectors from stored data
n Most data set are stored as CSV or text file, where the values are separated by some delimiters
n The default read mechanism would read those into a data frame with many columns, each representing a single value
n Spark provides a utility called VectorAssembler that can merge specified columns to create a new column of Vector type.
One row of the MNIST data set, which contains 28×28 gray scale image for handwritten digit, It is flattened into a 784 vector
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 09-8
VectorAssembler usage
Sparse vector format
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
09-9
Machine Learning Algorithms
n Many common algorithms are implemented in SparkML and are grouped into a few package/modules
Clustering
Classification Feature
Etc..
n The DataFrame based library use standardized API for all algorithms
n Many analytic problems may need more than one algorithms, they can be combined into a single pipeline
n Most of the concepts are inspired or borrowed from scikit- learn
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
09-10
Pipeline Component: Transformers
n Transformer
An abstract term that include feature transformer and learned
models.
A transformer implements a method transform(), which converts one DataFrame (input) into another(output), generally by appending one or more columns (to indicate a prediction or other results)
After the transform, we get a DF with 785 columns, we only want to retain the newly created column
VectorAssember is a transformer, doing relatively simple task by assembling specified input columns into a single output column of Vector type
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
09-11
Pipeline Component: Estimators and Pipeline
n Estimators
Estimator abstracts the concept of a learning algorithm or any
algorithm that fits or trains on data.
An Estimator implements a method fit(), which accepts a DataFrame
(training set) and produces a Model, which is a Transformer The actual model’s presentation would be very different in different
algorithms n Pipeline
Pipeline represents a workflow consisting of a sequence of stages (transformers and estimators)
Pipeline usually takes a DataFrame as an input and produce a model as output
Once built and trained, a pipeline can be used to transform data
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 09-12
Example Pipeline
transformer
transformer
estimator
Produces a transformer
transformer
transformer
transformer
Produces result
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
09-13
Pipeline Code Example
Id
Text
label
0
“a b c d e spark”
1.0
1
“b d”
0.0
2
“spark f g h”
1.0
3
“Hadoop mapreduce”
0.0
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 09-14
Pipeline Code Example
Id
Text
label
words
features
0
“a b c d e spark”
1.0
1
“b d”
0.0
2
“spark f g h”
1.0
3
“Hadoop mapreduce”
0.0
DataFrame is also immutable
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 09-15
Pipeline Code Example
Id
Text
label
words
features
0
“a b c d e spark”
1.0
[“a”, “b”,.. ]
[1,1,…]
1
“b d”
0.0
[“b”, “d”,…]
[0,1,…]
2
“spark f g h”
1.0
[“spark”, “f”, …]
[0,0,…]
3
“Hadoop mapreduce”
0.0
[“Hadoop”, “mapreduce]
[0,0,…]
It output a model, which includes all transformers in the pipeline as well as logistic function
A few intermediate DataFrames are generated,
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 09-16
Pipeline Code Example
Id
Text
words
4
“spark i j k”
[spark, i, .. ]
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 09-17
features
[0,0,..]
probability
[0.16,0.84]
prediction
1
5
“l m n”
[l,m,n]
[0,0,…]
[0.84,0.16]
0
6
“spark Hadoop spark”
[spark,…]
[0,0,…]
[0.07,0.93]
1
7
“apache hadoop”
[apache,…]
[0,0,…]
[0.98,0.02]
0
Parameters
n Different algorithms take different parameters
n Parameters can be specified as literals when creating algorithm instances
n They can be put together in a hashmap like structure ParaMap and passed to algorithm instances
ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
09-18
Outline
n Spark Machine Learning Machine Learning Libraries Basic Data Types Standardized API
n Assignment Brief Intro
Various ways of using Spark in machine learning workload
Assignment intro Next week’s lab
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
09-19
Various options of using Spark in ML
n Both MapReduce and Spark are good candidate for Exploratory Data Analytic workload
n Spark is also designed for predictive workload, either with basic statistic model or with machine learning model
n There are various options of using Spark in ML
Use Spark RDD and SQL to explore or prepare the input data
Use Spark ML API to perform predictive analysis Using Spark and external ML packages Writing customized Spark ML algorithm
n The assignment covers first three options
n Week 10 lab demonstrates the second and third option
Based on week 9 lecture COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
09-20
Assignment Brief Intro
n Data Set:
Text corpus: Multi-Genre Natural Language Inference (MultiNLI).
Each data point is a pair of sentences: a premise and a hypothesis Sentences are selected from 10 domains (genres)
nTasks
Explore vocabulary coverage among genres
¡ Mostly Spark RDD and/or Spark SQL
Explore and compare sentence vector representation methods
¡ SparkML together with Spark RDD and/or Spark SQL
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
09-21
Representation Generation
n Embedding or representation learning is a technique to convert some form of input into a fixed sized vector
The output vectors can be used as feature vectors in traditional machine learning algorithms or in deep learning algorithms
The input can be of various format
Many ML algorithms requires measuring distance/similarity between data points,
a vector representation provides many options for distance computation
Deep Learning model has been proved effective in representation generation
… ,School of computer science is formerly called school of information technologies ,….
Any image classificati on model
FaceNet by Google (2015)
TFIDF
or similar
Word2Vec, Sentence Encoder
Node2Vec or similar
224 x 224 x 3
220×220×3
128
4096
100
128
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
09-22
Spark Provided Representation Generation
n SparkML provides some representation generation mechanisms for text data under the category “Feature Extractors”
TF-IDF
Word2Vec
CountVectorizer FeatureHasher
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
09-23
Spark Feature Extractor: TF-IDF
n Term frequency – Inverted Document Frequency
Each document is represented by a fixed d-dimensional vector
Each dimension represents a term(word) in the vocabulary
The value is computed using the frequency of that word in the document and
the inversed document frequency of that word in the corpus. n Term frequency and word-index mapping
CountVectorizer
¡ Maintain a term-index mapping: {term1:0, term2:1}
¡ Value d could be the vocabulary size or a smaller given number
¡ When the value d is less than vocab size, the top d frequent terms are used.
HashingTF
¡ Use a hash function to map term to index
¡ No need to maintain a large dictionary
¡ Value d should be larger than the vocab size to avoid hash collision
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
09-24
Spark and Third Party Representation Generation
n Spark can be used together with a trained model to generate features
Embarrassingly parallel workload
We can run the generation tasks in parallel on different partition of the data set on multiple nodes
The generation model can be treated as a black box, we only need to know how to prepare input and collect output
n Spark has many map like operators (e.g. map, filter, flatMap, mapToPair), it also has a mapPartition operator
Map, filter, flatMap, etc operate on a single element in the partition and produce an output for each element
mapPartition operates on the whole partition and generate one output for the whole partition
n In representation generation, we need to apply the model on each input element, but the model itself is usually large, it is more efficient to load it once and apply to many inputs
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 09-25
Week 10 Lab
n The lab exercises use two typical data sources in machine learning
Image and text
n Basic usage of Spark Machine Learning API is
demonstrated with the MNIST image data set
n Example showing how to use Spark with deep learning models to generate sentence encoding vector.
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
09-26
Spark and Representation Generation
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 09-27
References
nSpark Machine learning library documentation https://spark.apache.org/docs/latest/ml-pipeline.html
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 09-28