WiSe 2021/22
Machine Learning 1/1-X
Lecture 1 Intro + Bayes
Copyright By PowCoder代写 加微信 powcoder
Introduction: Machine Learning
What is Machine Learning?
� Designing algorithms (machines) that efficiently convert finite sets of data (observations) into models with predictive capability.
Why Machine Learning?
� Autonomous Decision Making: Substitute/support human decision making for achieving gains in efficiency in practical applications.
� Knowledge Discovery: Finding laws that explain empirical phenomena.
Autonomous Decision Making
Modern economies involve taking good decisions in com- plex multi-dimensional problems. In many cases, for effi- ciency reasons, it is necessary that these decisions are taken autonomously, or with the help of predictive models.
� Supply Chain Management (demand forecasting, planning)
� End-User Services (recommendations, personalization,
translation, monitoring)
� Manufacturing
(quality control, materials
optimization)
� Finance/Insurance
(risk management, forecasting)
Autonomous Decision Making
Modern economies involve taking good decisions in com- plex multi-dimensional problems. In many cases, for effi- ciency reasons, it is necessary that these decisions are taken autonomously, or with the help of predictive models.
� Supply Chain Management
(demand forecasting, planning)
� End-User Services (recommendations, personalization,
translation, monitoring)
� Manufacturing
(quality control, materials
optimization)
� Finance/Insurance
(risk management, forecasting)
Demand Forecasting
Source: IBM
Autonomous Decision Making
Modern economies involve taking good decisions in com- plex multi-dimensional problems. In many cases, for effi- ciency reasons, it is necessary that these decisions are taken autonomously, or with the help of predictive models.
� Supply Chain Management (demand forecasting, planning)
� End-User Services (recommendations, personalization,
translation, monitoring)
� Manufacturing
(quality control, materials
optimization)
� Finance/Insurance
(risk management, forecasting)
Elderly Care
Adapted from Islam et al. (2020) Deep Learning Based Systems Developed for Fall
Detection: A Review
Autonomous Decision Making
Modern economies involve taking good decisions in com- plex multi-dimensional problems. In many cases, for effi- ciency reasons, it is necessary that these decisions are taken autonomously, or with the help of predictive models.
� Supply Chain Management (demand forecasting, planning)
� End-User Services (recommendations, personalization,
translation, monitoring)
� Manufacturing
(quality control, materials
optimization)
� Finance/Insurance
(risk management, forecasting)
Surface Inspection
Source: MvTec
Autonomous Decision Making
Modern economies involve taking good decisions in com- plex multi-dimensional problems. In many cases, for effi- ciency reasons, it is necessary that these decisions are taken autonomously, or with the help of predictive models.
� Supply Chain Management (demand forecasting, planning)
� End-User Services (recommendations, personalization,
translation, monitoring)
� Manufacturing
(quality control, materials
optimization)
� Finance/Insurance
(risk management, forecasting)
Forecasting (prices/claims)
Image source: https://www.kaggle.com/robikscube/m5- forecasting-starter-data-exploration
1st-Gen Decision Systems
� The human programs the decision system by hand (e.g. using if/else controls) in a way that it replicates his own decision strategy.
� If the system performs as expected on the few available test cases, it is then run autonomously on new instances.
Source: https://sites.google.com/site/keremitgsnotes
1st-Gen Decision Systems
� The human programs the decision system by hand (e.g. using if/else controls) in a way that it replicates his own decision strategy.
� If the system performs as expected on the few available test cases, it is then run autonomously on new instances.
� What if the user is not able to translate his own decision behavior into an actual program? (e.g. how does one detect objects in natural images?)
? wood scratch
2nd-Gen Decision Systems
Idea: The human collects a dataset of examples, and la- bels them according to his own decision strategy:
Good examples: Examples with scratches:
A machine learning model is trained to map each example (i.e. the array of pixels received as input) to the correct class.
Bayes Decision Theory
Bayesian Decision Theory
A theoretical framework for building decision models that:
� Yields optimal classifiers (under some favourable conditions). � Helps us to understand how
� properties of the data distribution (e.g. Gaussian-distributed),
� prior probabilities of classes,
� the criterion to optimize (e.g. maximum classification accuracy),
affect the classifier qualitatively and quantitatively.
Bayes Decision Theory
Example: (from the book Duda et al. 2000) � Fishes of various species (salmon and
sea bass) arrive on a conveyor belt.
� Sensors placed on the conveyor belt produce a collection of measurements for each observed fish (e.g. length, lightness).
� We would like to build a decision model that assign each fish to one of the two possible classes (salmon and sea bass).
Bayesian Decision Theory
� ω1,ω2,… set of classes (e.g. salmon, sea bass, …),
We are also given the probability laws:
� P(ωj ): probability of being of class j,
� p( | ωj ): density function of measurements for each class,
� p( ): density function of measurements (marginalized),
but what we are truly interested in is:
� P(ωj | ): probability of being of a certain class after observing .
∈ Rd vector of observations (e.g. x1 is the length and x2 is the lightness).
Bayesian Decision Theory
Bayes Theorem:
P(ωj| )=p( |ωj)·P(ωj) p( )
Image source: Duda et al. (2000)
Optimally Accurate Classifier
optimal decision function: arg max P (ωj | ) j
Image source: Duda et al. (2000)
Optimally Accurate Classifier
optimal decision function: arg max P (ωj | ) j
Alternate formulations of the decision:
Multivariate Normal Distributions
p( |ωj)=� 1 exp�−1( −μ)�Σ−1( −μ)� (2π)ddet(Σj) 2 j j j
� μj is the mean (center of the data distribution)
� Σj is the covariance (elongation of the data distribution and correlation between dimensions).
Image source: Duda et al. (2000)
Classifier for Gaussians (Σ1 = Σ2)
Recall: The optimal classifier is arg maxj [log p( |ωj ) + logP(ωj)], and we have the data distributions:
p( |ωj)=� 1 exp�−1( −μ)�Σ−1( −μ)� (2π)ddet(Σj) 2 j j j
Classifier for Gaussians (Σ1 = Σ2)
Image source: Duda et al. (2000)
� Decision boundary is linear and oriented by mean and covariance. � Offset is controlled by class prior probabilities.
Classifier for Gaussians (Σ1 �= Σ2)
� When covariances Σ1 and Σ2 are not the same, the decision boundary is quadric instead of linear. Quadrics include circle, ellipse, parabola, hyperbola, and degenerate forms.
Image source: Duda et al. (2000)
Classifying Non-Numerical Data
Example: Spam Classifier
� Observation: Messages M that need to be handled by the spam classifier come as text and not as numerical vectors.
� Common approach: Represent a message M as a collection of binary predicates testing for typical spam
∈ {0, 1}d
1{“claim”∈M} .
words, and forming a vector containing these predicates:
1{“gift”∈M}
1{“relief ”∈M} = 1{“pain”∈M}
Classifying Binary Data
� Assume that our data is binary, i.e. ∈ {0, 1}d , with each dimension generated independently according to some Bernoulli distribution:
P(xi =0|ωj)=1−qij P(xi =1|ωj)=qij
where qij are the parameters.
� The probability of the whole multivariate observation can be written as:
P( |ωij)=�d �qijxi +(1−qij)·(1−xi)� i=1
� Question: How to express the optimal decision boundary
argmaxP(ωj| ) j
Classifying Binary Data
Recall: The optimal classifier is arg maxj [log p( |ωj ) + logP(ωj)], and we have the data distributions:
P( |ωj)=�d �qijxi +(1−qij)·(1−xi)� i=1
Minimum Cost Decisions
Example: Buying a Car
� Suppose you would like to purchase a second-hand car. After observing the car (collecting a vector of measurements ), you assess that it has a defect with probability
P(defect | ) = 0.1 P(no defect | ) = 0.9
� Concretely, the decision you need to take is not to classify the whether the car has a defect or not, but whether to buy the car or not.
� For this, we need to evaluate the cost of each scenario, e.g.
cost ( buy | defect ) = 100.0 cost ( buy | no defect ) = −20.0
cost(not buy | defect) = 0.0 cost(not buy | no defect) = 0.0
and take the action with the lowest expected cost.
Minimum Cost Decisions
General problem formulation:
� Let (αk )k be the set of actions. The expected cost “λ” of taking a certain action is given by
whereλ(αk|ωj)isthecostoftakingactionαk giventheclassωj. � The optimal action to take is therefore: arg mink λ(αk | )
λ(αk|ωj)P(ωj| )
Minimum Cost Decisions
General problem formulation:
� Let (αk )k be the set of actions. The expected cost “λ” of taking a certain action is given by
� The optimal action to take is therefore: arg mink λ(αk | ) Car example: (cf. previous slide)
λ(αk| ) = whereλ(αk|ωj)isthecostoftakingactionαk giventheclassωj.
λ(αk|ωj)P(ωj| )
Classification Accuracy Special Case
Show that the problem of maximum accuracy classification is a special instance of expected cost minimization with a particular set of actions (αk)k and a particular cost function λ(αk |ωj ).
Recall: λ(αk| )=
λ(αk|ωj)P(ωj| )
Measuring Classification Error
� So far, we have studied what the decision boundary should be in order to predict optimally.
� However, in certain cases, it is also important to determine what is the expected error of the classifier (e.g. to determine whether the classifier is good enough for practical use).
� The expected error is the probability that the data is of a different class than the one predicted, e.g. for a binary classifier:
P(Err| )=� P(ω1| ) if “decideω2” P(ω2 | ) if “decide ω1”
� For the Bayes optimal classifier, this reduces to
P(Err| ) = min{P(ω1 | ),P(ω2 | )}
Measuring Classification Error
� The expected error of this maximally accurate classifier is computed as the integral of its error probability over the distribution p( ).
P(Err) = � P(Err | )p( )d
= min{P(ω1 | ), P(ω2 | )}p( )d
This is also known as the Bayes error rate.
� Generally, this integral cannot be solved analytically, because of the min function. Error must instead be evaluated numerically/empirically, or it can also be bounded analytically.
Bounding the Error of the Classifier
Very basic bound
� Observe for binary classification that P(ω2 | ) = 1 − P(ω1 | ).
� The error of an optimal binary classifier can be bounded as:
P(Err) = � min{P(ω1 | ), P(ω2 | )}p( )d =� min{P(ω1| ),1−P(ω1| )}p( )d
≤ 0.5p( )d
i.e. the classifier predicts the correct class at least 50% of the time.
� Note that, unlike an empirical evaluation, this result is general and independent on the data distributions.
Bounding the Error of the Classifier
Another simple bound
P(Err) = � min{P(ω1 | ), P(ω2 | )}p( )d
=� min{P(ω1| )p( ),P(ω2| )p( )}d
Recall: P(ωj | )p( ) = p( |ωj)P(ωj)
= � min �p( | ω1)P(ω1), p( | ω2)P(ω2)�d
≤ � min ��j {p( | ωj )}P(ω1), �j {p( | ωj )}P(ω2)�d
= ��j � {p( | ωj )}d � · min{P(ω1), P(ω2)}
= 2 · min{P(ω1), P(ω2)}
Additional insight: The optimal classifier improves its accuracy when one class prior probability is strongly dominant over to the other class.
Machine Learning
� Paradigm that provides a solution to the practically highly relevant problem of autonomous decision making.
� Avoids to the user the task of specifying the decision function at hand, and instead, infers it automatically from the data.
Bayes Decision Theory
� Framework that allows to build optimal machine learning classifiers, assuming we have full knowledge knowledge about the class probabilities and the data distributions.
� Bayesian decision theory highlights the effect of class priors, parameters of the data distribution, and specification of the cost function, on the optimal decision function and the expected cost.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com