Recap
COMP9417 Machine Learning & Data Mining
Term 1, 2021
Adapted from slides by Dr Michael Bain
Machine Learning
COMP9417 T1, 2021 1
Machine Learning Pipeline
COMP9417 T1, 2021 2
Regression
Regression models are used to predict a continuous value.
COMP9417 T1, 2021 3
Regression
1. SimpleLinearRegression
– The most common cost function: Mean Squared Error (MSE)
– Cost function can be minimized using Gradient Descent (it has also closed form solution)
– Regression coefficients/weights (𝜃!) describe the relationship between a predictor variable (𝑥!) and the output variable (𝑦)
– Regularization is applied to avoid overfitting
o It applies additional constrains to the weigh usually to keep weights small (shrinkage) and can be used as feature selection too
o Most common regularization approaches:
§ Ridge (penalize ∑! 𝜃!”)
§ Lasso (penalize ∑! |𝜃!|)
§ Elastic Net (a combination of Ridge and Lasso)
COMP9417 T1, 2021 4
Regression
2. PolynomialRegression
– Create polynomial terms from your features
– Will be solved similar to simple Linear Regression
– Model is still linear in parameters
3. Localregression
– Use the k nearest neighbors to fit a regression line
– Produces a piecewise approximation
COMP9417 T1, 2021 5
Regression
4. DecisionTreeRegression(regressiontree)
– Partitioning data into homogeneous subsets
– Variance or standard deviation reduction is used to decide for splitting
– The predicted value for each leaf is the average value of the samples in
that leaf 5. ModelTree
– Similar to regression trees but with linear regression at each leaf
– Splitting criterion is standard deviation reduction
COMP9417 T3, 20 COMP9417 T3, 20
One option would be to use regression trees
100 80 60
40 20
10 20 30 40 50 60 Dosage (mg)
Each leaf corresponds to average drug effectiveness better job than Linear Regression
10% eff
COMP9417 T1, 2021 6
D Yes
ective
in a diff
19 19
Effectiveness (%)
1
e
Model Evaluation
– R-squared represents the portion of variance in the output that has been explained by the model COMP9417 T1, 2021 7
Classification
Classification
Classification is prediction of categorical output from input attributes/features.
Example:
Instead of finding a discriminative line, maybe we can focus on one class at a time and build that describes how that class looks like; and then do the same for the other class. This type are called generative learning algorithm.
Width
Sea bass
Salmon
Lightness
Width
Sea bass
Salmon
Lightness
COMP9417 T3, 2019 8
COMP9417 T1, 2021 8
a model of model
s
Classification
Two main types of classification:
– Generative algorithm: builds some models for each of the classes
o Learns 𝑝(𝑥|𝑦)
o and then estimate 𝑝 𝑦 𝑥 using Bayes theorem
– Discriminative algorithm: Do not build models for different classes,
but rather focuses on finding a decision boundary o Learns 𝑝(𝑦|𝑥) directly
COMP9417 T1, 2021 9
Classification
1. Nearestcentroidclassifier
– Distance based classifier
– 𝜇# = $ ∑’∈& 𝑥’
|&!| !
– For complex classes (eg. Multimodal, non-spherical) may give very poor results
– Can not handle outliers and noisy data well
– Not very accurate
!”
!#
COMP9417 T1, 2021 10
Classification
2. knearestneighborclassifier(kNN)
– Distance base classifier
– Find k nearest neighbor using an appropriate distance metric (e.g. Minkowski distance)
– Predict the output based on the majority vote
– Works better with lots of training data and small number of attributes
– Can be very accurate but slow at testing time
– Curse of dimensionality
– Assumes all attributes are equally important
o Remedy: attribute selection or attribute weights
– Needs homogenous feature type and scale
COMP9417 T1, 2021 11
Classification
3. Bayesian decision theory (based on Bayesian theorem, 𝑃 h 𝐷 = )(+|,))(,)) )(+)
– The prediction will be the most probable hypothesis if expected loss is equal for all classes :
o Maximum a posteriori (h./) = arg max 𝑃(h|𝐷) = arg max 𝑃 𝐷 h 𝑃(h)) ,∈0
,∈0
o If 𝑃 h! = 𝑃(h’) , we use maximum likelihood (h.1 =
arg max 𝑃(𝐷|h!)) ,” ∈0
– If the expected loss is not the same, then we have to predict the class which minimizes the expected loss
o Expected loss: 𝑅 𝛼! 𝑥 = ∑,∈0 𝜆(𝛼!|h) 𝑃(h|𝑥)
COMP9417 T1, 2021 12
Classification
4. Bayesoptimalclassification:(argmax∑,”∈0𝑃(𝜈’|h!)𝑃(h!|D)) 2# ∈3
– Here we are dealing with combining the decision from multiple hypothesis
– No other classification method using the same hypothesis space and same prior knowledge can outperform this method on average
– Bayes optimal classifier is very inefficient
COMP9417 T1, 2021 13
Classification
5. NaïveBayesclassifier
– Using Bayesian theory
– The main difference is the strong assumption that attributes are conditionallyindependent:𝑃(𝑥$,𝑥”,…,𝑥4 𝜈’ =∏!𝑃(𝑥!|𝜈’)
– Prediction is based on maximum a posteriori:
𝜈56 =argmax𝑃(𝑥$,𝑥”,…,𝑥4 𝜈’ P 𝜈’ =𝑎𝑟𝑔max𝑃?(𝜈’)∏!𝑃?(𝑥!|𝜈’) 2# ∈3 2# ∈3
– Useful when:
o moderate or large training set available
o Attributes are conditionally independent (however this is usually violated and still NB can do a descent job!)
– Having too many redundant attributes will decrease the performance COMP9417 T1, 2021 14
Classification
6. Decisiontree:
– Works in divide and conquer fashion
o Split into subsets
o Check the subset purity
§ Use entropy to measure impurity at each node (𝐸 𝑠 = ∑8 −𝑝log𝑝)
!7$ ! “!
§ Use information gain to decide which attribute works better for that node
𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ∑2∈39:;<=(/) >$ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆2) >
» However 𝐼𝐺 is more biased towards attributes with large number of possibilities, so we can use 𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 instead
§ Attribute with highest information gain will be selected for the node
COMP9417 T1, 2021 15
Classification
6. Decisiontree:
– Decision trees can work with any type of data (discrete and numeric)
– Can handle missing values
– One of the main advantages is interpretability
– Can almost always classify training example perfectly if we let it grow enough which means it can overfit
– To avoid overfitting
o Pre-pruning: stop growing when split is not not statistically significant like chi-squared test(suffer from early stopping). Or limiting min _𝑠𝑎𝑚𝑝𝑙𝑒_𝑙𝑒𝑎𝑓, min _𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦_𝑑𝑒𝑐𝑟𝑒𝑎𝑠𝑒,
max _𝑙𝑒𝑎𝑓_𝑛𝑜𝑑𝑒 or max _𝑑𝑒𝑝𝑡h, ect.
o Post-pruning: grow full tree, then remove sub-trees that cause overfitting based on cross validation
– Greedy algorithm (may not find the optimal tree) COMP9417 T1, 2021 16
Classification
7. Linear Perceptron ( 𝑦W = 𝑓 x = sgn(𝑤. x))
– Weights get updated iteratively until no mistake is made or max number
of iteration is met
– Simple and fast at training
– Doesn’t perform well if classes are not linearly separate
8. Non-linearperceptron
– Map attributes into new space consisting of polynomial terms and interaction terms
– Use kernel trick to make the computation much less
?
𝑦W = sign Z𝛼!𝑦!(𝜑(x!).𝜑(x)) !7$
– A valid kernel function is equivalent to a dot product in some space (𝐾x!,x’ =𝜑x!.𝜑x’)
COMP9417 T1, 2021 17
Classification
9. LinearSupportVectorMachine(maximummargin)
– 𝑦W=𝑠𝑖𝑔𝑛(w.x−t)
– 𝑤 = ∑x”∈{=;AABCD E<8DBC=} 𝛼!𝑦!𝑥!
– 𝛼! is non-zero for support vectors
– Is effective in high dimensional data
– Is effective when number of dimensions is bigger than the number of samples
10.Nonlinear SVM
– Similar to perceptron, Kernel trick can be applied using dual form – 𝑦W=𝑠𝑖𝑔𝑛(∑G"HI𝛼!𝑦!𝐾(x!,x)−𝑡)
COMP9417 T1, 2021 18
Bias-Variance Tradeoff
Good Fit
Truth
Underfitting
Overfitting
Source: Scott-Fortmann,Understanding Bias-variance tradeoff COMP9417 T1, 2021 19
Bias-Variance
• Bias-variance:
– Bias: The inability of the learning algorithm to capture the true relationship between the output and the features/attributes is called bias.
o due to model choice which (e.g. is not complex enough)
– Variance: The learning algorithm difference in fits between datasets is
called variance.
o due to small sample size
o high complexity of the model
COMP9417 T1, 2021 20
Bias-Variance Tradeoff Bias-Variance
• Often, the is to find the balance between bias and variance s
• The aim is to have a good bias variance tradeoff
we minimize the overall (total) error.
– methods to find a good bias-variance trade-off:
• There is not a quantitative way to find this balanced error poi
Instead, you will need to leverage (preferably on an unseen o Regularization
validation data set) and adjust your model’s complexity until o Ensemble learning in general
the iteration that minimizes overall error. o Bagging
o Boosting
COMP9417 T1, 2
O
21
C
MP9417 T3, 2019 21
93
0
COMP9417 T3, 2019
uch
nt. you f
t
i
Ensemble Learning
Ensemble methods: meta-algorithms that combine different models into one model
1. Simpleensembles:combiningseverallearningalgorithm
o Majority vote or unweighted average will be used for prediction
o Using weighted average or weighted votes to predict the output
o Treat the output of each algorithm as a feature and train another learning algorithm on them
2. Mixtureofexperts
o Each learning algorithm defines 𝛼! x which indicated the expertise
of that algorithm for that particular location of x in the input space o It may use a weighted average or just pick the model with the
largest expertise
COMP9417 T1, 2021 22
Ensemble Learning
3. “Bagging” method: (“Bootstrap Aggregation”)
o Training many classifiers, but each on a Bootstrapped dataset
§ Bootstrap: Create a random subset of data by sampling with replacement
§ Bagging: Repeat 𝑘 times to generate 𝑘 subsets
o Then aggregate through model averaging / majority voting
o Bagging is applied on a collection of low-bias high-variance models
§ by averaging them the bias would not get affected § by averaging them the variance will be reduced
COMP9417 T1, 2021 23
Ensemble Learning
4. Addrandomizationtothemodelstointroducemorediversityinthemodels for example
– For every model use a subset of features, selected randomly, e.g. in Random Forest (it can also help with training time)
– For algorithms that are dependent on initial weights, use different random initial weights
5. Boosting:Asequenceofweaklearners,eachtryingtocorrectits predecessor
– Learners are trained sequentially
– New learners focus on errors of earlier learners
– New learners try to get misclassified samples right by operating on a weighted training set in favor of misclassified instances
– Combine all learners in the end using weighted majority/weighted average of 𝑘 learners
COMP9417 T1, 2021 24
Ensemble Learning
– AdaBoost is a boosting algorithm using stump trees o Misclassified instances gain higher weights
o Correctly classified instances lose weight
– Main advantages:
o Use very simple (weak) learners o It boost the performance
o Decrease bias
o Decrease variance
– Slow during training and lack of interpretability
– Gradient Boosting is a boosting algorithm using stump tree for regression
o At every step models the residuals
COMP9417 T1, 2021 25
Neural Networks
• Neural Nets: composed of a large number of interconnected processing elements known as neurons
– They use supervised error correcting rules with back-propagation to learn a specific task
COMP9417 T1, 2021 26
Neural Networks
– Perceptron: Output is thresholded sum of products of inputs and their weights
o Perceptron learning is simply an iterative weight-update (w' = w + ηyi xi)
– Multilayer Perceptrons
o can represent arbitrary functions
o consists of an input, hidden and output layer each fully connected to the next, with activation feeding forward
– Neural nets are more useful when:
o Input is high dimensional
o form of target function is unknown o Interpretation is not important
COMP9417 T1, 2021 27
Neural Networks
– Deep Learning: similar to regular neural nets just with more layers o Relies on large amount of data
o Deeper learning architecture
– Convolutional Neural Net: among the most well-known deep learning models
o Neurons are arranges in 3 dimensions (width, height and depth)
o Proposes a parameter sharing scheme that minimize the number of
parameters
o Neurons in each layer are only connected to a small region of the layer before it (not fully connected)
o Parameters of each layer play the role of a filter which is applied locally
o The pooling layer: to progressively reduce the spatial size of the representation to reduce the number of parameter. Therefore they help with overfitting
COMP9417 T1, 2021 28
Neural Networks
– To avoid overfitting:
o dropout layer is used
§ In each forward pass, randomly set some neurons to zero o Early stopping
o Reduce the network’s capacity by removing some layers
o Regularisation: adding a cost to the loss function for large weights o Data Augmentation
§ Increase the data size
§ Rotation, cropping, scaling, flipping, Gaussina filtering
COMP9417 T1, 2021 29
Evaluation of classification
Algorithm-independent aspects of machine learning Evaluating performance on a task
Contingency table I
Contingency table
For two-class prediction case:
• For two-class prediction case:
Two-class prediction case:
Actual Class
Predicted Class
PoYseitisve
NegNatoive
PoYsietisve
True Positive (TP)
False Negative (FN)
NeNgaotive
False Positive (FP)
True Negative (TN)
•𝑎𝑐𝑐=! ∑ |#$%&|
#)
𝐼[𝑐̂𝑋=𝑐(𝑋)]
• 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
• 𝑅𝑒𝑐𝑎𝑙𝑙 = #)
This is also called confusion matrix
'∈#$%&
COMP9417 ML & DM
#)*+)
Classification (1)
Term 2, 2019
31 / 72
COMP9417 T3, 2019
31
#)*+,
• 𝐹! = 2. -.$/0%012..$/455
-.$/0%012*.$/455 • 𝐴𝑈𝐶 − 𝑅𝑂𝐶 𝑐𝑢𝑟𝑣𝑒
COMP9417 T1, 2021
30
Missing Values
How to handle missing values (common approaches): o Deleting samples with missing values
o Replacing the missing value with some statistics from the data (mean, median, ...)
o Assigning a unique category
o Predicting the missing values
o Using algorithms that support missing values
COMP9417 T1, 2021 31
Model (Feature) Selection
• Taking all the features will lead to an overly complex model. There are 3 ways to reduce complexity:
– Subset-selection: feature forward selection or feature backward selection
– Shrinkage, or regularization of coefficients to zero, by optimization. There is a single model, and unimportant variables have near-zero coefficients.
– Dimensionality-reduction, by projecting points into a lower dimensional space
COMP9417 T1, 2021 32
Data Normalization
• Normalization is usually a data pre-processing step that change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.
– Most of the distance based machine learning algorithms require normalization as a processing step if features do not have same scales
– Most common normalization techniques:
o Min-max normalization: 𝑥J = KLMNO(K)
MPQ K LMNO(K)
o Z-score (standardization):𝑥J = KLK̅ S
COMP9417 T1, 2021
33
Validation
• Hold-out method:
• K-fold cross validation
COMP9417 T1, 2021 34
Unsupervised Learning
Unsupervised learning: classes are initially unknown and need to be “discovered” with their definitions from the data
• It is –
– – – – –
useful for:
Dimensionality reduction (simplify the problem, getting rid of redundant feature)
exploratory data analysis
to group data instances into subsets
to discover structure, like hierarchies of subconcepts to learn new “features” for later use in classification to track “concept drift” over time
COMP9417 T1, 2021 35
Clustering
• Goal: form homogeneous cluster and well separated clusters
• Success of clustering often measured subjectively
• There are two broad types of clustering:
– Hierarchical methods
– Partitioning methods
COMP9417 T1, 2021 36
Clustering
1. K-means
– Initialize k random centers from the data
– Assign each instance to the closest center and re-compute the centers using mean or weighted average and re-iterate
– Simple and can be efficient clustering method
– Not easy to predict k
– Different initialization can result different clusters
– Sensitive to outliers
COMP9417 T1, 2021 37
Clustering
2. ExpectationMaximization:
– Similar to k-means
– Computes probabilities of cluster memberships based on one or more probability distributions. (e.g. mixture of Gaussian)
– The goal is to maximize the overall probability or likelihood of the data, given the (final) clusters.
– Easy with independence assumption
COMP9417 T1, 2021 38
Clustering
3. Hierarchicalclustering
– Agglomerative :starts by treating each object as a singleton cluster and gradually merge based on similarity
– Divisive: it starts by including all objects in a single large cluster. At each step of iteration, the most heterogeneous cluster is divided into two. The process is iterated until all objects are in their own cluster.
– Do not require to specify the number of clusters
– Different linkage methods can produce very different dendrograms
COMP9417 T1, 2021 39
Clustering
• Finding number of clusters:
– Elbow method: using the within-cluster dispersion
– Gap statistics: based on the within-cluster variance of original data and B sets of resampled data (Gap(k)=∑blog(Wkb)−log(Wk))
– Choose the number of clusters as the smallest value of k such that the gap statistic is within one standard deviation of the gap at k+1
• Quality of clusters
– if clusters known, measure proportion of disagreements to agreements
– if unknown, measure homogeneity and separation
– silhouette method
COMP9417 T1, 2021 40
Dimensionality Reduction
• Dimensionality reduction: is the process of reducing the number of feature/attributes
– Helps with removing redundant/correlated feature
– Helps with curse of dimensionality
1. PrincipalComponentAnalysis(PCA):tocapturethedirectionofthemost variation in the original data space.
– Features are not correlated in the new space (they are orthogonal)
– New dimensions are computed using eigenvectors and eigenvalues of the data matrix (rows are observations and columns are features)
– Note: Feature have to be normalized before applying PCA
2. Autoencoders:Aneuralnetworkmodel–theencodertransformsthedata into smaller dimension such that the decoder can then interpret and reconstruct with minimum error
COMP9417 T1, 2021 41