PowerPoint Presentation
1
Spark MLlib
Spark’s scalable machine learning library
linear SVM and logistic regression
classification and regression tree
recommendation via alternating least squares
clustering via k-means, Gaussian mixtures, and power iteration clustering
…
High-quality algorithms, 100x faster than MapReduce
Most machine learning algorithms are iterative
Note: As of Spark 2.0, the primary Machine Learning API for Spark is now the DataFrame-based API (spark.ml). RDD-based APIs (spark.mllib) have entered maintenance mode, and is expected to be removed in Spark 3.0.
2
Unsupervised learning
Examples
Clustering
Probability distribution estimation
Association rule mining
Dimension reduction
Supervised learning
Examples
Prediction
Classification, regression
Supervised learning
ML Pipelines
Inspired by scikit-learn
DataFrame
Pipeline componens:
Transformer
Estimator
Parameters
Transformers
Converts one dataframe to another
Must implement a method transform()
Examples:
Model:
DataFrame[id: int, feature_vector: Vector] =>
DataFrame[id: int, label: string]
Feature transformer:
DataFrame[id: int, text: string] =>
DataFrame[id: int, feature_vector: Vector]
Estimators
Input: DataFrame
Output: Model
Must implement a method fit()
Example:
LogisticRegression is an Estimator.
Calling fit() trains a LogisticRegressionModel, which is a Model (hence also a Transformer).
Parameters
Both transformers and estimators can have parameters
Set parameters:
lr = LogisticRegression()
lr.setMaxIter(10)
Pass a ParamMap to fit() or transform().
A ParamMap is a set of (parameter, value) pairs.
Logistic regression
Training data: feature vectors with binary labels
The trained model is a nonlinear function f(x) that maps testing data to [0, 1]
Returns 1 if f(x) > 0.5
Returns 0 if f(x) < 0.5
See example:
Training Pipeline (Estimator)
transformer
transformer
transformer
estimator
DataFrame
Trained PipelineModel (transformer)
Replace the estimator in the training Pipeline with the trained model, which is a transformer
See example
Cross Validation
Training
Testing
Train-test split
5-Fold Cross Validation
/docProps/thumbnail.jpeg