程序代写代做 assembler algorithm chain html decision tree graph Introduction to

Introduction to
Machine Learning
ECA5372 Big Data and Technologies 1

Machine Learning
What is it?
Machine Learning is the science of getting computers to learn patterns and trends, and improve their learning over time in an autonomous and iterative fashion, by providing them with data and information in the form of observations and real-world interactions.
• It is a method of analysing data that automates model building.
• It allows computers to find hidden insights without being explicitly
programmed step-by-step.
ECA5372 Big Data and Technologies 2

Uses of machine learning
Applications
• Fraud detection
• Product recommendations
• Natural Language Processing
• Predicting customer or employee churn • Customer segmentation
• Image recognition and object detection • New Pricing Models
• Financial Modelling
ECA5372 Big Data and Technologies 3

Machine Learning Process
The different stages
ECA5372 Big Data and Technologies 4

Types of Machine Learning
• Supervised Learning
• Supervised learning is the machine learning task of learning a function that maps an input to an output based on examples of input-output pairs.
• Unsupervised Learning
• Unsupervised learning is a type of machine learning
algorithm used to draw inferences from datasets consisting of input data without labelled responses.
• Reinforcement Learning
• Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
ECA5372 Big Data and Technologies 5

Supervised Learning
Two types
Supervised Learning
Classification Regression
ECA5372 Big Data and Technologies 6

Supervised Learning
What is the supervisor?
• Datasets suitable for supervised learning contain known class labels.
• Datasets without known class labels can only be used for unsupervised learning.
ECA5372 Big Data and Technologies 7

Classification versus Regression
Supervised Learning
ECA5372 Big Data and Technologies
8
• Classification is used when the prediction outcome is categorical, e.g. cats or dogs.
• Regression is used when the prediction outcome is numerical, e.g. stock prices, temperature.

Supervised Learning Workflow
The different stages
ECA5372 Big Data and Technologies 9

Model Validation or Evaluation
Supervised Learning
Model Validation
Train, Validation, Test split
k-fold cross validation
Train, Validation split
ECA5372 Big Data and Technologies
10

Model Validation or Evaluation
Supervised Learning
Train, Validation, Test Split
We need to randomly split the rows in the original dataset into subsets. Typically, the training set comprises about 70% of the original number of rows. The remainder rows are split randomly between the validation set and test set.
ECA5372 Big Data and Technologies 11

Model Validation or Evaluation
Supervised Learning
Train, Validation, Test Split
ECA5372 Big Data and Technologies 12

k-fold Cross Validation Supervised Learning
k=3
Fold 2
Fold 3
Model 1
Training data
Training data
Model 2
Test data
Training data
Model 3
Your
Dataset
Fold 1
Test data
Training data
Training data
Training data
Test data
ECA5372 Big Data and Technologies 13

k-fold Cross Validation Supervised Learning
k=5
Fold 1
Fold 3
Fold 5
Model 1
Training data
Training data
Training data
Model 2
Training data
Training data
Training data
Training data
Model 3
Training data
Test data
Training data
Training data
Model 4
Training data
Training data
Training data
Model 5
Your
Dataset
Test data
Training data
Fold 2
Training data
Test data
Training data
Training data
Training data
Training data
Fold 4
Test data
Training data
Test data
ECA5372 Big Data and Technologies 14

Model Validation or Evaluation
Supervised Learning
Train, Validation Split
We randomly split the rows in the original dataset into training set and validation set. Typically, the training set comprises about 70% of the original number of rows. The remainder rows will become the validation set.
ECA5372 Big Data and Technologies 15

Confusion Matrix
Model Validation or Evaluation for Supervised Learning
Prediction: Class 1
Prediction: Class 2
Actual: Class 1
True Positive (TP)
False Negative (FN)
Actual: Class 2
False Positive (FP)
True Negative (TN)
ECA5372 Big Data and Technologies 16

Metrics
Model Validation or Evaluation for Supervised Learning
Prediction: Class 1
Prediction: Class 2
Actual: Class 1
True Positive (TP)
False Negative (FN)
Actual: Class 2
False Positive (FP)
True Negative (TN)
With m being the sample size (that is, TP+TN+FP+FN), we have the following formulae:
• Accuracy = (TP + TN)/m
• Precision = TP/(TP+FP)
• Recall = TP/(TP+FN)
• F1 score = 2 * Precision * Recall/(Precision + Recall)
ECA5372 Big Data and Technologies 17

Metrics
Model Validation or Evaluation for Supervised Learning • Accuracy
• Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have balanced datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model. For our model, we have got 0.803 which means our model is approx. 80% accurate.
• Precision
• Precision attempts to answer the following question: What proportion of positive
identifications was actually correct?
• Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all data points that are predicted as Class 1, how many actually belong to Class 1? High precision relates to the low false positive rate.
https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall
ECA5372 Big Data and Technologies 18

Metrics
Model Validation or Evaluation for Supervised Learning • Recall
• Recall attempts to answer the following question: What proportion of actual positives was identified correctly?
• Recall is the ratio of correctly predicted positive observations to the all observations in actual class.
• F1 score
• F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution.
• Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall.
https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall
ECA5372 Big Data and Technologies 19

Unsupervised Learning
For data without class labels
ECA5372 Big Data and Technologies 20

Common issues in Machine Learning
• Imbalanced datasets
• This happens when the number of data points for a particular class label is very
much larger than the number of data points for other class labels. • ‘Dirty’ data
• Any dataset is probably needs some level of cleaning and pre-processing. • Irrelevant data
• The data points (rows) in your dataset do not support the problem statement that you are trying to answer or solve.
ECA5372 Big Data and Technologies 21

Next class…
Spark MLlib
• Introduction to MLlib: Apache Spark’s machine learning library
• ML Pipelines
• Extracting, transforming and selecting features
• Spark MLlib
• Classification and regression
• Linear Regression
• Logistic Regression
• Random Forests
• Clustering
ECA5372 Big Data and Technologies 22

Apache Spark’s machine learning library http://spark.apache.org/docs/latest/ml-guide.html
ECA5372 Big Data and Technologies 23

MLlib
What is it?
MLlib is short for Machine Learning Library. It provides tools for • Machine Learning Algorithms:
• common learning algorithms such as classification, regression, clustering, and collaborative filtering
• Featurization:
• feature extraction, transformation, dimensionality reduction, and selection
• Pipelines:
• tools for constructing, evaluating, and tuning ML Pipelines
• Persistence:
• saving and load algorithms, models, and Pipelines
• Utilities:
• linear algebra, statistics, data handling, etc.
ECA5372 Big Data and Technologies 24

MLlib
What is it?
• Basic Statistics http://spark.apache.org/docs/latest/ml-statistics.html • Correlation
• Hypothesis testing • Summarizer
• Data sources http://spark.apache.org/docs/latest/ml-datasource • Image data source
• ML Pipelines http://spark.apache.org/docs/latest/ml-pipeline.html • Pipeline components
• Transformers • Estimators
ECA5372 Big Data and Technologies 25

MLlib
Important Notice
ECA5372 Big Data and Technologies 26

Pipeline components
In machine learning, it is common to run a sequence of algorithms to process and learn from data. MLlib represents such a workflow as a Pipeline, which consists of a sequence of PipelineStages (Transformers and Estimators) to be run in a specific order.
A Pipeline chains multiple Transformers and Estimators together to specify an Machine Learning workflow.
• Transformers
• A Transformer is an algorithm which can transform one DataFrame into another DataFrame. For Transformer stages, the transform() method is called on the DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions.
• Estimators
• An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. For Estimator stages, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel, or fitted Pipeline), and that Transformer’s transform() method is called on the DataFrame. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.
ECA5372 Big Data and Technologies 27

Pipeline
How it works
http://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works
ECA5372 Big Data and Technologies
28
Training phase
Testing phase

Pipeline components
Training phase Testing phase
http://spark.apache.org/docs/latest/ml-pipeline.html#example-pipeline
ECA5372 Big Data and Technologies 29

Extracting, transforming and selecting features
Feature Transformers • StringIndexer
• StringIndexer encodes a string column of labels to a column of label indices. • OneHotEncoderEstimator
• One-hot encoding maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values.
• VectorAssembler
• VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector.
ECA5372 Big Data and Technologies 30

StringIndexer
ECA5372 Big Data and Technologies 31

OneHotEncoderEstimator
ECA5372 Big Data and Technologies 32

VectorAssembler
ECA5372 Big Data and Technologies 33

Classification and regression
MLlib
Regression
• Linear regression
• Generalized linear regression
• Available families
• Decision tree regression
• Random forest regression
• Gradient-boosted tree regression • Survival regression
• Isotonic regression
ECA5372 Big Data and Technologies 34

Classification and regression
MLlib
Classification
• Logistic regression
• Decision tree classifier
• Random forest classifier
• Gradient-boosted tree classifier
• Multilayer perceptron classifier
• Linear Support Vector Machine
• One-vs-Rest classifier (a.k.a. One-vs-All) • Naive Bayes
ECA5372 Big Data and Technologies 35

Univariate Linear Regression
MLlib
• Univariate (Single variable) linear regression refers to a linear regression model where we use only one independent variable x to learn a linear function that maps x to the dependent variable y:
𝑦! = 𝛽” + 𝛽#𝑥! + 𝜖!
Dependent variable for the ith observation
ECA5372 Big Data and Technologies 36
Error term for the ith observation
Single independent variable for the ith observation
Regression coefficient for the single independent variable
Intercept coefficient

Residuals
Linear regression
• The errors that our model make are called residuals.
• Goal: To choose regression coefficients for the independent variable(s) that minimize these residuals.
• To compute the residual, we simply subtract the predicted value from the actual value.
• We use a metric Sum of Squared Errors (SSE), the sum of all squared residuals:
𝑆𝑆𝐸= 𝜖! “+ 𝜖” “+…+ 𝜖# ”
ECA5372 Big Data and Technologies 37

Root Mean Square Error
Linear regression
• RMSE is the square root of the SSE divided by the total number of data points N:
!!” #
• RMSE tends to be used more often to check quality of a linear regression model since its unitsarethesameasthedependentvariable and it is normalized by the value of N.
𝑆𝑆𝐸= 𝜖! “+ 𝜖” “+…+ 𝜖# ”
𝑅𝑀𝑆𝐸 =
ECA5372 Big Data and Technologies 38

R-squared
Linear regression
• The R2 metric represents the proportion of variance in the dependent
variable explained by the independent variable(s).
𝑅! = 1 − 𝑆𝑆𝐸
Total sum of squares
• The value of R can range between 0 and 1, and the higher its value the more accurate the regression model is. The aim is to get as close as possible to 1.
ECA5372 Big Data and Technologies 39

Multivariate Linear Regression
MLlib
• Multivariate (or multiple) linear regression allows us to utilize more than
one independent variable, in this case K independent variables: 𝑦 ! = 𝛽 ” + 𝛽 # 𝑥 #! + 𝛽 $ 𝑥 $! + … + 𝛽 % 𝑥 %! + 𝜖 !
ECA5372 Big Data and Technologies 40

Logistic Regression
MLlib
• •
Logistic Regression does not predict a numerical outcome. Logistic regression models allow us to predict a categorical outcome by
predicting the probability that an outcome is true.
In logistic regression models, we also have a dependent variable y and a set of independent variables x1,x2, …, xk. In logistic regression, however, we want to learn a function that provides the probability that y=1 given a set of independent variables:
𝑃(𝑦 = 1) = 1
1 + 𝑒*(,!-,”.”-⋯- ,#.$)
The above function is called the Logistic function, it provides a number between 0 and 1, representing the probability that the outcome-dependent variable is true.
Our goal, when developing logistic regression models is to choose coefficients that predict a high probability when y = 1 but predict a low probability when y=0.
• •
ECA5372 Big Data and Technologies 41

Decision Tree
MLlib
ECA5372 Big Data and Technologies 42

Random Forest
MLlib
• In Random Forest, a large number of decision trees are generated and thereafter, each tree in the forest votes on the outcome, with the majority outcome taken as the final prediction.
• To generate a random forest, a process known as bootstrapping is employed whereby the training data for each tree making up the forest is selected randomly with replacement. Therefore, each tree will be trained using a different subset of independent variables and hence different training data.
ECA5372 Big Data and Technologies 43

Receiver Operating Characteristic curve
MLlib
A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true- positive rate is also known as sensitivity, recall or probability of detection in machine learning. The false-positive rate is also known as the fall-out or probability of false alarm and can be calculated as (1 − specificity).
ECA5372 Big Data and Technologies 44