CS计算机代考程序代写 Bayesian scheme data mining algorithm deep learning decision tree Recap

Recap
COMP9417 Machine Learning & Data Mining
Term 1, 2021
Adapted from slides by Dr Michael Bain

Machine Learning
COMP9417 T1, 2021 1

Machine Learning Pipeline
COMP9417 T1, 2021 2

Regression
Regression models are used to predict a continuous value.
COMP9417 T1, 2021 3

Regression
1. SimpleLinearRegression
– The most common cost function: Mean Squared Error (MSE)
– Cost function can be minimized using Gradient Descent (it has also closed form solution)
– Regression coefficients/weights (𝜃!) describe the relationship between a predictor variable (𝑥!) and the output variable (𝑦)
– Regularization is applied to avoid overfitting
o It applies additional constrains to the weigh usually to keep weights small (shrinkage) and can be used as feature selection too
o Most common regularization approaches:
§ Ridge (penalize ∑! 𝜃!”)
§ Lasso (penalize ∑! |𝜃!|)
§ Elastic Net (a combination of Ridge and Lasso)
COMP9417 T1, 2021 4

Regression
2. PolynomialRegression
– Create polynomial terms from your features
– Will be solved similar to simple Linear Regression
– Model is still linear in parameters
3. Localregression
– Use the k nearest neighbors to fit a regression line
– Produces a piecewise approximation
COMP9417 T1, 2021 5

Regression
4. DecisionTreeRegression(regressiontree)
– Partitioning data into homogeneous subsets
– Variance or standard deviation reduction is used to decide for splitting
– The predicted value for each leaf is the average value of the samples in
that leaf 5. ModelTree
– Similar to regression trees but with linear regression at each leaf
– Splitting criterion is standard deviation reduction
COMP9417 T3, 20 COMP9417 T3, 20
One option would be to use regression trees
100 80 60
40 20
10 20 30 40 50 60 Dosage (mg)
Each leaf corresponds to average drug effectiveness better job than Linear Regression
10% eff
COMP9417 T1, 2021 6
D Yes
ective
in a diff
19 19
Effectiveness (%)
1
e

Model Evaluation
– R-squared represents the portion of variance in the output that has been explained by the model COMP9417 T1, 2021 7

Classification
Classification
Classification is prediction of categorical output from input attributes/features.
Example:
Instead of finding a discriminative line, maybe we can focus on one class at a time and build that describes how that class looks like; and then do the same for the other class. This type are called generative learning algorithm.
Width
Sea bass
Salmon
Lightness
Width
Sea bass
Salmon
Lightness
COMP9417 T3, 2019 8
COMP9417 T1, 2021 8
a model of model
s

Classification
Two main types of classification:
– Generative algorithm: builds some models for each of the classes
o Learns 𝑝(𝑥|𝑦)
o and then estimate 𝑝 𝑦 𝑥 using Bayes theorem
– Discriminative algorithm: Do not build models for different classes,
but rather focuses on finding a decision boundary o Learns 𝑝(𝑦|𝑥) directly
COMP9417 T1, 2021 9

Classification
1. Nearestcentroidclassifier
– Distance based classifier
– 𝜇# = $ ∑’∈& 𝑥’
|&!| !
– For complex classes (eg. Multimodal, non-spherical) may give very poor results
– Can not handle outliers and noisy data well
– Not very accurate
!”
!#
COMP9417 T1, 2021 10

Classification
2. knearestneighborclassifier(kNN)
– Distance base classifier
– Find k nearest neighbor using an appropriate distance metric (e.g. Minkowski distance)
– Predict the output based on the majority vote
– Works better with lots of training data and small number of attributes
– Can be very accurate but slow at testing time
– Curse of dimensionality
– Assumes all attributes are equally important
o Remedy: attribute selection or attribute weights
– Needs homogenous feature type and scale
COMP9417 T1, 2021 11

Classification
3. Bayesian decision theory (based on Bayesian theorem, 𝑃 h 𝐷 = )(+|,))(,)) )(+)
– The prediction will be the most probable hypothesis if expected loss is equal for all classes :
o Maximum a posteriori (h./) = arg max 𝑃(h|𝐷) = arg max 𝑃 𝐷 h 𝑃(h)) ,∈0
,∈0
o If 𝑃 h! = 𝑃(h’) , we use maximum likelihood (h.1 =
arg max 𝑃(𝐷|h!)) ,” ∈0
– If the expected loss is not the same, then we have to predict the class which minimizes the expected loss
o Expected loss: 𝑅 𝛼! 𝑥 = ∑,∈0 𝜆(𝛼!|h) 𝑃(h|𝑥)
COMP9417 T1, 2021 12

Classification
4. Bayesoptimalclassification:(argmax∑,”∈0𝑃(𝜈’|h!)𝑃(h!|D)) 2# ∈3
– Here we are dealing with combining the decision from multiple hypothesis
– No other classification method using the same hypothesis space and same prior knowledge can outperform this method on average
– Bayes optimal classifier is very inefficient
COMP9417 T1, 2021 13

Classification
5. NaïveBayesclassifier
– Using Bayesian theory
– The main difference is the strong assumption that attributes are conditionallyindependent:𝑃(𝑥$,𝑥”,…,𝑥4 𝜈’ =∏!𝑃(𝑥!|𝜈’)
– Prediction is based on maximum a posteriori:
𝜈56 =argmax𝑃(𝑥$,𝑥”,…,𝑥4 𝜈’ P 𝜈’ =𝑎𝑟𝑔max𝑃?(𝜈’)∏!𝑃?(𝑥!|𝜈’) 2# ∈3 2# ∈3
– Useful when:
o moderate or large training set available
o Attributes are conditionally independent (however this is usually violated and still NB can do a descent job!)
– Having too many redundant attributes will decrease the performance COMP9417 T1, 2021 14

Classification
6. Decisiontree:
– Works in divide and conquer fashion
o Split into subsets
o Check the subset purity
§ Use entropy to measure impurity at each node (𝐸 𝑠 = ∑8 −𝑝log𝑝)
!7$ ! “!
§ Use information gain to decide which attribute works better for that node
𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ∑2∈39:;<=(/) >$ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆2) >
» However 𝐼𝐺 is more biased towards attributes with large number of possibilities, so we can use 𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 instead
§ Attribute with highest information gain will be selected for the node
COMP9417 T1, 2021 15

Classification
6. Decisiontree:
– Decision trees can work with any type of data (discrete and numeric)
– Can handle missing values
– One of the main advantages is interpretability
– Can almost always classify training example perfectly if we let it grow enough which means it can overfit
– To avoid overfitting
o Pre-pruning: stop growing when split is not not statistically significant like chi-squared test(suffer from early stopping). Or limiting min _𝑠𝑎𝑚𝑝𝑙𝑒_𝑙𝑒𝑎𝑓, min _𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦_𝑑𝑒𝑐𝑟𝑒𝑎𝑠𝑒,
max _𝑙𝑒𝑎𝑓_𝑛𝑜𝑑𝑒 or max _𝑑𝑒𝑝𝑡h, ect.
o Post-pruning: grow full tree, then remove sub-trees that cause overfitting based on cross validation
– Greedy algorithm (may not find the optimal tree) COMP9417 T1, 2021 16

Classification
7. Linear Perceptron ( 𝑦W = 𝑓 x = sgn(𝑤. x))
– Weights get updated iteratively until no mistake is made or max number
of iteration is met
– Simple and fast at training
– Doesn’t perform well if classes are not linearly separate
8. Non-linearperceptron
– Map attributes into new space consisting of polynomial terms and interaction terms
– Use kernel trick to make the computation much less
?
𝑦W = sign Z𝛼!𝑦!(𝜑(x!).𝜑(x)) !7$
– A valid kernel function is equivalent to a dot product in some space (𝐾x!,x’ =𝜑x!.𝜑x’)
COMP9417 T1, 2021 17

Classification
9. LinearSupportVectorMachine(maximummargin)
– 𝑦W=𝑠𝑖𝑔𝑛(w.x−t)
– 𝑤 = ∑x”∈{=;AABCD E<8DBC=} 𝛼!𝑦!𝑥! – 𝛼! is non-zero for support vectors – Is effective in high dimensional data – Is effective when number of dimensions is bigger than the number of samples 10.Nonlinear SVM – Similar to perceptron, Kernel trick can be applied using dual form – 𝑦W=𝑠𝑖𝑔𝑛(∑G"HI𝛼!𝑦!𝐾(x!,x)−𝑡) COMP9417 T1, 2021 18 Bias-Variance Tradeoff Good Fit Truth Underfitting Overfitting Source: Scott-Fortmann,Understanding Bias-variance tradeoff COMP9417 T1, 2021 19 Bias-Variance • Bias-variance: – Bias: The inability of the learning algorithm to capture the true relationship between the output and the features/attributes is called bias. o due to model choice which (e.g. is not complex enough) – Variance: The learning algorithm difference in fits between datasets is called variance. o due to small sample size o high complexity of the model COMP9417 T1, 2021 20 Bias-Variance Tradeoff Bias-Variance • Often, the is to find the balance between bias and variance s • The aim is to have a good bias variance tradeoff we minimize the overall (total) error. – methods to find a good bias-variance trade-off: • There is not a quantitative way to find this balanced error poi Instead, you will need to leverage (preferably on an unseen o Regularization validation data set) and adjust your model’s complexity until o Ensemble learning in general the iteration that minimizes overall error. o Bagging o Boosting COMP9417 T1, 2 O 21 C MP9417 T3, 2019 21 93 0 COMP9417 T3, 2019 uch nt. you f t i Ensemble Learning Ensemble methods: meta-algorithms that combine different models into one model 1. Simpleensembles:combiningseverallearningalgorithm o Majority vote or unweighted average will be used for prediction o Using weighted average or weighted votes to predict the output o Treat the output of each algorithm as a feature and train another learning algorithm on them 2. Mixtureofexperts o Each learning algorithm defines 𝛼! x which indicated the expertise of that algorithm for that particular location of x in the input space o It may use a weighted average or just pick the model with the largest expertise COMP9417 T1, 2021 22 Ensemble Learning 3. “Bagging” method: (“Bootstrap Aggregation”) o Training many classifiers, but each on a Bootstrapped dataset § Bootstrap: Create a random subset of data by sampling with replacement § Bagging: Repeat 𝑘 times to generate 𝑘 subsets o Then aggregate through model averaging / majority voting o Bagging is applied on a collection of low-bias high-variance models § by averaging them the bias would not get affected § by averaging them the variance will be reduced COMP9417 T1, 2021 23 Ensemble Learning 4. Addrandomizationtothemodelstointroducemorediversityinthemodels for example – For every model use a subset of features, selected randomly, e.g. in Random Forest (it can also help with training time) – For algorithms that are dependent on initial weights, use different random initial weights 5. Boosting:Asequenceofweaklearners,eachtryingtocorrectits predecessor – Learners are trained sequentially – New learners focus on errors of earlier learners – New learners try to get misclassified samples right by operating on a weighted training set in favor of misclassified instances – Combine all learners in the end using weighted majority/weighted average of 𝑘 learners COMP9417 T1, 2021 24 Ensemble Learning – AdaBoost is a boosting algorithm using stump trees o Misclassified instances gain higher weights o Correctly classified instances lose weight – Main advantages: o Use very simple (weak) learners o It boost the performance o Decrease bias o Decrease variance – Slow during training and lack of interpretability – Gradient Boosting is a boosting algorithm using stump tree for regression o At every step models the residuals COMP9417 T1, 2021 25 Neural Networks • Neural Nets: composed of a large number of interconnected processing elements known as neurons – They use supervised error correcting rules with back-propagation to learn a specific task COMP9417 T1, 2021 26 Neural Networks – Perceptron: Output is thresholded sum of products of inputs and their weights o Perceptron learning is simply an iterative weight-update (w' = w + ηyi xi) – Multilayer Perceptrons o can represent arbitrary functions o consists of an input, hidden and output layer each fully connected to the next, with activation feeding forward – Neural nets are more useful when: o Input is high dimensional o form of target function is unknown o Interpretation is not important COMP9417 T1, 2021 27 Neural Networks – Deep Learning: similar to regular neural nets just with more layers o Relies on large amount of data o Deeper learning architecture – Convolutional Neural Net: among the most well-known deep learning models o Neurons are arranges in 3 dimensions (width, height and depth) o Proposes a parameter sharing scheme that minimize the number of parameters o Neurons in each layer are only connected to a small region of the layer before it (not fully connected) o Parameters of each layer play the role of a filter which is applied locally o The pooling layer: to progressively reduce the spatial size of the representation to reduce the number of parameter. Therefore they help with overfitting COMP9417 T1, 2021 28 Neural Networks – To avoid overfitting: o dropout layer is used § In each forward pass, randomly set some neurons to zero o Early stopping o Reduce the network’s capacity by removing some layers o Regularisation: adding a cost to the loss function for large weights o Data Augmentation § Increase the data size § Rotation, cropping, scaling, flipping, Gaussina filtering COMP9417 T1, 2021 29 Evaluation of classification Algorithm-independent aspects of machine learning Evaluating performance on a task Contingency table I Contingency table For two-class prediction case: • For two-class prediction case: Two-class prediction case: Actual Class Predicted Class PoYseitisve NegNatoive PoYsietisve True Positive (TP) False Negative (FN) NeNgaotive False Positive (FP) True Negative (TN) •𝑎𝑐𝑐=! ∑ |#$%&| #) 𝐼[𝑐̂𝑋=𝑐(𝑋)] • 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = • 𝑅𝑒𝑐𝑎𝑙𝑙 = #) This is also called confusion matrix '∈#$%& COMP9417 ML & DM #)*+) Classification (1) Term 2, 2019 31 / 72 COMP9417 T3, 2019 31 #)*+, • 𝐹! = 2. -.$/0%012..$/455 -.$/0%012*.$/455 • 𝐴𝑈𝐶 − 𝑅𝑂𝐶 𝑐𝑢𝑟𝑣𝑒 COMP9417 T1, 2021 30 Missing Values How to handle missing values (common approaches): o Deleting samples with missing values o Replacing the missing value with some statistics from the data (mean, median, ...) o Assigning a unique category o Predicting the missing values o Using algorithms that support missing values COMP9417 T1, 2021 31 Model (Feature) Selection • Taking all the features will lead to an overly complex model. There are 3 ways to reduce complexity: – Subset-selection: feature forward selection or feature backward selection – Shrinkage, or regularization of coefficients to zero, by optimization. There is a single model, and unimportant variables have near-zero coefficients. – Dimensionality-reduction, by projecting points into a lower dimensional space COMP9417 T1, 2021 32 Data Normalization • Normalization is usually a data pre-processing step that change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. – Most of the distance based machine learning algorithms require normalization as a processing step if features do not have same scales – Most common normalization techniques: o Min-max normalization: 𝑥J = KLMNO(K) MPQ K LMNO(K) o Z-score (standardization):𝑥J = KLK̅ S COMP9417 T1, 2021 33 Validation • Hold-out method: • K-fold cross validation COMP9417 T1, 2021 34 Unsupervised Learning Unsupervised learning: classes are initially unknown and need to be “discovered” with their definitions from the data • It is – – – – – – useful for: Dimensionality reduction (simplify the problem, getting rid of redundant feature) exploratory data analysis to group data instances into subsets to discover structure, like hierarchies of subconcepts to learn new “features” for later use in classification to track “concept drift” over time COMP9417 T1, 2021 35 Clustering • Goal: form homogeneous cluster and well separated clusters • Success of clustering often measured subjectively • There are two broad types of clustering: – Hierarchical methods – Partitioning methods COMP9417 T1, 2021 36 Clustering 1. K-means – Initialize k random centers from the data – Assign each instance to the closest center and re-compute the centers using mean or weighted average and re-iterate – Simple and can be efficient clustering method – Not easy to predict k – Different initialization can result different clusters – Sensitive to outliers COMP9417 T1, 2021 37 Clustering 2. ExpectationMaximization: – Similar to k-means – Computes probabilities of cluster memberships based on one or more probability distributions. (e.g. mixture of Gaussian) – The goal is to maximize the overall probability or likelihood of the data, given the (final) clusters. – Easy with independence assumption COMP9417 T1, 2021 38 Clustering 3. Hierarchicalclustering – Agglomerative :starts by treating each object as a singleton cluster and gradually merge based on similarity – Divisive: it starts by including all objects in a single large cluster. At each step of iteration, the most heterogeneous cluster is divided into two. The process is iterated until all objects are in their own cluster. – Do not require to specify the number of clusters – Different linkage methods can produce very different dendrograms COMP9417 T1, 2021 39 Clustering • Finding number of clusters: – Elbow method: using the within-cluster dispersion – Gap statistics: based on the within-cluster variance of original data and B sets of resampled data (Gap(k)=∑blog(Wkb)−log(Wk)) – Choose the number of clusters as the smallest value of k such that the gap statistic is within one standard deviation of the gap at k+1 • Quality of clusters – if clusters known, measure proportion of disagreements to agreements – if unknown, measure homogeneity and separation – silhouette method COMP9417 T1, 2021 40 Dimensionality Reduction • Dimensionality reduction: is the process of reducing the number of feature/attributes – Helps with removing redundant/correlated feature – Helps with curse of dimensionality 1. PrincipalComponentAnalysis(PCA):tocapturethedirectionofthemost variation in the original data space. – Features are not correlated in the new space (they are orthogonal) – New dimensions are computed using eigenvectors and eigenvalues of the data matrix (rows are observations and columns are features) – Note: Feature have to be normalized before applying PCA 2. Autoencoders:Aneuralnetworkmodel–theencodertransformsthedata into smaller dimension such that the decoder can then interpret and reconstruct with minimum error COMP9417 T1, 2021 41