PowerPoint Presentation
Introduction to Machine Learning
Developed by Prajwol Sangat
Updated by Chee-Ming Ting (3 April 2021)
MONASH
INFORMATION
TECHNOLOGY
Parallel Aggregation
Parallel Sort
Parallel Group-By
Last week
‹#›
What is Machine Learning?
Machine Learning Basics
Types of Machine Learning
Feature Engineering
This week
‹#›
According to McKinsey study, 35% of what consumers purchase on Amazon and 75% of what they watch on Netflix is driven by machine learning–based product recommendations.
‹#›
What is Machine Learning?
‹#›
What is Machine Learning?
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”, (Tom Mitchell, 1997)
‹#›
Examples
‹#›
Elements of machine learning
feature extraction, feature selection, feature transformation, feature reduction, feature scaling, feature normalization
Data processing
Model Learning (Training)
Find an optimal model (by estimating model parameters using training data
Based on loss function (e.g., minimize error between true and predicted labels)
Predictive Model
(X)
Model Testing
Test the learned model in predicting unseen test data
Performance metrics to assess model accuracy
()
‹#›
Illustration: Linear Regression model
Problem: Predict ice cream sales given temperature
i
1 36 200
2 31 100
3 24 50
100 38 250
Data
Day Temp Sales
20
25
30
35
40
45
50
0
50
100
150
200
250
300
350
Sales(RM)
Temp(degree)
Predictive Model:
– What is good model f(.) to maps x to y?
Model Learning/Estimation:
– How to choose parameters ?
Define loss function
Estimate using learning algorithm
Prediction:
– Given new input, predict y with learned model
Estimated parameters:
Predicted
output
‹#›
Overview of machine learning
Learned Predictive Model
e
Learning Algorithm
(X)
(.) maps X to Y
Find suitable model given training data
Training Data
=
Test Data
Predicted Label
e
()
Training Stage
Classification/Prediction Stage
Assessment of prediction performance
‹#›
Data
‹#›
Vector
A mathematical vector.
dense vectors, where every entry is stored, and
sparse vectors, where only the nonzero entries are stored to save space.
Labeled Point
A labeled data point for supervised learning algorithms such as classification and regression.
Includes a feature vector and a label (which is a floating point value).
Machine Learning: Data Types
‹#›
Vector
A mathematical vector.
dense vectors, where every entry is stored, and
sparse vectors, where only the nonzero entries are stored to save space.
Machine Learning: Data Types
https://miro.medium.com/max/3144/1*OrsYQ6FoKq6YwxwS6LPMpg.png
‹#›
All learning algorithms require defining a set of features for each item, which will be fed into the learning function.
For example, for an email, some features might include the server it comes from, or the number of mentions of the word free, or the color of the text.
In many cases, defining the right features is the most challenging
part of using machine learning.
For example, in a product recommendation task, simply adding another feature (e.g., realizing that which book you should recommend to a user might also depend on which movies she’s watched) could give a large improvement in results.
Features
‹#›
Machine Learning Fundamentals
Supervised and Unsupervised Models
Bias and Variance
Underfitting and Overfitting
‹#›
Model: Types of Model Learning
Unsupervised
https://www.simplilearn.com/tutorials/machine-learning-tutorial/supervised-and-unsupervised-learning
Supervised
‹#›
Types of Model Learning: Supervised
Goal: Learn a function from labelled training data to predict the output label(s) given a new unlabeled input.
Training data consists of input features and output information (labels)
Two types of supervised learning:
Classification
Regression
‹#›
Supervised Machine Learning: Classification
Binary classification example: dog or not dog
Classification problem: To separate inputs into a discrete set of classes or labels.
Binary classification
Multinomial (Multi-class) classification
‹#›
Supervised Machine Learning: Classification
Multinomial classification example: Australian shepherd, golden retriever, or poodle
‹#›
Supervised Machine Learning: Regression
Regression example: predicting ice cream sales based on temperature
A regression problem is when the output variable is a real value, such as “dollars” or “weight”.
‹#›
Supervised Machine Learning in Apache Spark
‹#›
Types of Model Learning: Unsupervised
Clustering
Association
Goal: Explore the underlying structure of the data to extract meaningful information. without guidance of known output info.
Deals with unlabelled data (no output labels)
Two types of unsupervised learning:
‹#›
Unsupervised Machine Learning: Clustering
Clustering example
Clustering problem: Divide data into clusters which are similar between them and are dissimilar to the data belonging to another cluster.
Where you want to discover the inherent groupings in the data, eg. grouping customers by purchasing behaviour
‹#›
Unsupervised Machine Learning: Association
Association rule learning problem: Discover the probability of the co-occurrence (association) between items in a large dataset
Where you want to discover rules that describe large portions of your data, e.g., people who buy X also tend to buy Y.
‹#›
Unsupervised Machine Learning in Apache Spark
k-means,
Latent Dirichlet Allocation (LDA), and
Gaussian mixture models.
‹#›
Machine Learning: Assessment
How to prepare the data?
Train-Test split
K-fold cross-validation
How to measure performance?
TP, FP, TN, FP, confusion matrix
Accuracy, Recall, Precision, F1-score
‹#›
Machine Learning: Performance Metrics
Example: Email Spam Detection
In test set: 10 spam, 20 non-spam
Positive = spam
7 5
3 15
SPAM (1) NON-SPAM (0)
SPAM (1)
NON-SPAM (0)
True labels
Predicted labels
‹#›
Machine Learning: Bias and Variance
Bias is the gap between the averaged predicted value by the model and the actual value of the data.
Variance measures the distance of the predicted values in relation to each other.
‹#›
Machine Learning: Bias and Variance
Low Bias
High Bias
Low Variance
High Variance
Overfitting
Underfitting
‹#›
Machine Learning: Overfitting and Underfitting
Overfitting (high variance, low bias) is a model that performs well on the training data but generalizes poorly to any new data.
Underfitting (low variance, high bias) is an overly simple model that does not perform well even on the training data.
‹#›
Machine Learning: Overfitting and Underfitting
Preventing Overfitting
Train with more data
‹#›
Machine Learning: Overfitting and Underfitting
Preventing Overfitting
Train with more data
Remove features
‹#›
Machine Learning: Overfitting and Underfitting
Preventing Overfitting
Train with more data
Remove features
Early stopping
‹#›
Machine Learning: Overfitting and Underfitting
Preventing Overfitting
Train with more data
Remove features
Early stopping
Cross validation
K-Fold Cross-Validation
‹#›
Next topic -> Featurization
To be continued..
‹#›
x
x
f
5
.
2
50
)
(
+
=
5
.
2
,
50
1
=
=
q
q
o
125
)
30
(
5
.
2
50
ˆ
=
+
=
y
x
x
f
o
1
)
(
q
q
+
=
1
,
o
new
o
new
x
x
f
y
1
)
(
ˆ
q
q
+
=
=
612233448