What is MACHINE LEARNING?
Prof. Dan A. Simovici
UMB
1/49
Outline
1 A Formal Model
2 Empirical Risk Minimization (ERM)
3 ERM with Inductive Bias
4 An Example : Regression
2/49
Outline
What is Machine Learning?
Machine learning (ML) studies the construction and analysis of algorithms that learn from data.
ML algorithms construct models starting from samples of data and use these models to make predictions or decisions.
ML and its applied counterpart, data mining, mostly deal with problems that present difficulties in formulating algorithms that can be readily translated into programs, due to their complexity.
ML techniques tend to avoid the difficulties of standard problem solving techniques where a complete understanding of data is required at the beginning of the problem solving process.
3/49
Outline
Typical ML Activities
Example
finding diagnosis for patients starting with a series of their symptoms;
determining credit worthiness of customers based on their demographics and credit history;
document classification based on their main topic; speech recognition;
computational biology applications.
4/49
Outline
Supervised Learning
Often ML aims to compute a label for each analyzed piece of data that depends of the characteristics of data.
The general approach known as supervised learning is to begin with a number of labelled examples (where answers are known or are provided by a supervisor) known as training set.
The goal is to generate an algorithm that computes the function that gives the labels of remaining examples.
5/49
Outline
Unsupervised Learning
In unsupervised learning the challenge is to identify the structure that is hidden in data, e.g. identifying groups of data such that strong similarity exists between objects that belong to the same group and also, that objects that belong to different groups are sufficiently distinct.
This activity is known as clustering and it is a typical example of unsupervised learning.
The term “unsupervised” refers to the fact that this type of learning does not require operator intervention. Other machine learning activities of this type include outlier identification, density estimation, etc.
6/49
Outline
Semi-supervised Learning
An intermediate type of activity, referred as semi-supervised learning requires a limited involvement of the operator.
For example, in the case of clustering, this may allow the operator to specify pairs of objects that must belong to the same group and pairs of objects that may not belong to the same group.
7/49
Outline
Quality of the Learning Process
The quality of the learning process is assessed through its capability for generalization, that is, the capacity of the produced algorithm for computing correct labels for yet unseen examples.
the correct behavior of an algorithm relative to the training data is no guarantee, in general, for its generalization prowess;
sometimes in the pursuit of a perfect fit of the learning algorithm to the training data leads to overfitting; this term describes the situation when the algorithm acts correctly on the training data but is unable to predict unseen data;
in an extreme case, a rote learner will memorize the labels of its training data and nothing else. Such a learner will be perfectly accurate on its training data but lack completely any generalization capability.
8/49
Outline
Active and Reinforcement Learning
A machine learning algorithm can achieve greater accuracy with fewer training examples if it is allowed to choose the data from which it learns, that is, to apply active learning.
An active learner may pose queries soliciting a human operator to label a data instance. Since unlabelled data is abundant and, in many cases, easily obtained there are good reasons to use this learning paradigm.
Reinforcement learning is a machine-learning paradigm inspired by psychology which emphasizes learning by an agent from its direct interaction with the data in order to attain certain goals of learning e.g. accuracy of label prediction.
The framework of this type of learning makes use of states and actions of an agent, and their rewards, and deals with uncertainty and nondeterminism.
9/49
A Formal Model
The Learner’s Input
The domain set: X consists of the objects that we wish to label; usually objects are represented by a vector of features. We refer to these objects as instances.
The label set: Y is generally a finite set, e.g. {0, 1} or {−1, 1}.
Training data: S = {(x1,y1),…,(xm,ym)} is a finite sequence of pairs in X × Y, that is, a sequence of labelled objects. Each pair (xi , yi ) is a training example.
10/49
A Formal Model
The Learner’s Output
The learner is required to produce a function f : X −→ Y starting from f(x1) = y1,f(x2) = y2,…,f(xn) = yn,
as provided by the training data S. This function is known as a
a predictor, or a hypothesis, or a classifier
11/49
A Formal Model
A Data Generation Model
Assumptions:
data has a probability distribution function D;
the learner ignores the probability distribution function D; there exists some correct labelling function
f : X −→ Y whichweseektodetermineknowingthatf(xi)=yi for1in.
12/49
A Formal Model
Measures of Success
Definition
The true error of a prediction rule h : X −→ Y is
L(D,f )(h) = D({x | h(x) ̸= f (x)}).
L(D,f )(h) is a loss measure: it evaluates the probability that the prediction rule h(x) will produce a result distinct from the labelling function f .
The error is measured with respect to probability distribution D and the correct labelling function f .
Alternative terms used for L(D,f )(h): generalization error;
risk;
the true error of h.
The true error L(D,f )(h) of h is not known to the learner because D and f are unknown.
13/49
Empirical Risk Minimization (ERM)
The Training Error
The empirical error of a predictor h can be considered only in relation with a labelled sample.
Definition
Let S = {(x1, y1), . . . , (xm, ym)} be a labelled sample. The empirical error of a predictor h on sample S is the number
LS(h)= |{i | 1i m,h(xi)̸=yi}|. m
An alternative name for LS (h) is empirical risk.
14/49
Empirical Risk Minimization (ERM)
Empirical Risk Minimization (ERM)
ERM is an approach that seeks a predictor that minimizes the training error LS(h).
Let h be the predictor defined as h(xi) = yi for all xi ∈ S and h(xi) = k, where k does not label any object in S. The empirical error will be 0 but h will fail miserably on unseen data. This phenomenon is called overfitting: designing a predictor to fit the sample.
15/49
ERM with Inductive Bias
Inductive Bias
ERM can lead to overfitting. Therefore, we seek supplementary conditions that ensure that ERM will not overfit (conditions under which a predictor with good performance on the training data will have good performance on unseen data).
Common solution:
Use a restricted hypothesis class H chosen in advance, that is before seeing the data.
This approach is known as the inductive bias.
16/49
ERM with Inductive Bias
For a given class H (known as hypothesis class) and a training sample S the hypothesis
h = ERMH(S)
uses the ERM rule to chose a predictor h ∈ H with the lowest
possible error over S.
Both large LS (h) values and strong inductive bias are negative; the
question is achieve a balance between these factors.
Let argminh∈HLS(h) be the set of hypothesis in H that achieve the minimum values of LS (h). This approach aims to have
ERMH(S) ∈ argminh∈HLS (h).
Definition
hS is the hypothesis that results from applying ERMH to the sample S, namely
hS ∈argminh∈HLS(h).
17/49
ERM with Inductive Bias
Finite Hypothesis Classes
A simple inductive bias: class H is finite. Definition
The Realizability Assumption: There exists h∗ ∈ H such that the true error of a prediction rule h : X −→ Y is
L(D,f )(h∗) = 0.
This implies the existence of h∗ such that with probability 1 over random samples S, where the instances of S are sampled according to D and are labelled by f we have LS (h∗) = 0.
Realizability assumption implies that for every ERM hypothesis we have LS(hS) with probability 1.
18/49
ERM with Inductive Bias
Samples are obtained by drawing values from a distribution D independently of each other.
Since samples are drawn randomly from D, the true error L(D,f )(hS ) is a random variable.
We cannot predict with certainty that a sample S will suffice to direct the learner towards a good classifier.
19/49
ERM with Inductive Bias
Example
To predict if a papaya fruit is testy or not but without ever tasting papayas one needs to decide which features of a papaya the prediction should be based on.
On the basis of your past experience with other fruits, two features will be used: the papaya’s color, ranging from dark green, through orange and red to dark brown, and the papaya’s softness, ranging from rock hard to mushy.
The input for figuring out the prediction rule is a sample of papayas that are examined for color and softness and then tasted and found out whether they were tasty or not.
20/49
ERM with Inductive Bias
Example
There is always some (small) chance that all the papayas we have happened to taste were not tasty (a non-representative sample), in spite of the fact that, say, 70% of the papayas are tasty. In such a case, ERMH(S) may be the constant function that labels every papaya as “not tasty” (and has 70% error on the true distribution of papayas).
We will therefore address the probability to sample a training set for which L(D,f)(hS) is not too large. Usually, we denote the probability of getting a nonrepresentative sample by δ, and refer to 1 − δ as the confidence parameter of our prediction.
21/49
ERM with Inductive Bias
Approximately Correct Predictors
Since we cannot guarantee perfect label prediction, we introduce another parameter for the quality of prediction, the accuracy parameter, denoted by ε.
The probability of getting a non-representative sample is denoted by δ. 1 − δ is the confidence parameter.
The accuracy parameter ε: the event involving the true error
L(D,f )(h) > ε
is a failure of the learner.
If L(D,f )(h) ε (the true error is less than ε) then the output of the learner is an approximately correct predictor.
22/49
ERM with Inductive Bias
Fix f and seek an upper bound for the probability of sampling m instances that will lead to a failure of the learner.
Let S = (x1, . . . , xm). We would like to upper bound the probability that the true error is larger than ε, that is, Dm({S | L(D,f )(hS ) > ε}).
The set Hb of bad hypotheses is defined as:
Hb ={h∈H | L(D,f)(h)>ε}.
Notethatifh∈Hb wehave1−L(D,f) <1−ε. Define the set of misleading examples:
M = {S | ∃h ∈ Hb,LS(h) = 0}.
M contains those samples that produce an empirical error 0 on bad hypotheses, that is, makes a bad hypothesis look like a good hypothesis.
23/49
ERM with Inductive Bias
Note that if a sample S is such that LD,S (hS ) > ε, then there exists a bad hypothesis h ∈ Hb such that LS (h) = 0.
Goal: to bound the probability of the event L(D,f )(hS ) > ε.
24/49
ERM with Inductive Bias
The realizability assumption implies LS (hS ) = 0. Therefore, the event L(D,f )(hS ) > ε
can happen only if for some h ∈ Hb we have LS(h) = 0, that is, only if{S | L(D,f)(hS)>ε}⊆M.
The set of misleading examples M can be written as: M= {S|LS(h)=0},
h∈Hb
hence
Dm({S | L(D,f)(hS) > ε}) Dm(M) = Dm( {S | LS(h) = 0}. h∈Hb
25/49
ERM with Inductive Bias
Elementary probability implies that
kk
P( Ai) P(Ai).
i=1 i=1
26/49
ERM with Inductive Bias
Therefore, by elementary probability theory we have
Dm( {S | LS(h)=0} Dm({S | LS(h)=0}).
h∈Hb h∈Hb
Fix some bad hypothesis h ∈ Hb. The event LS(h) = 0 is equivalent to h(xi ) = f (xi ) for 1 i m. Since the examples in the training sets are sampled independently and identically distributed (iid), we get
Dm({S|LS(h)=0}) = Dm({S|h(xi)=f(xi)for1im}) m
= D({xi | h(xi)=f(xi)}). i=1
27/49
ERM with Inductive Bias
For each individual sampling we have:
D({xi | h(xi)=f(xi)})=1−L(D,f)(h)1−ε,
where the last inequality follows from the fact that h ∈ Hb. Note that 1 − ε e−ε.
Thus,
Dm({S | LS(h)=0})(1−ε)m e−εm.
28/49
ERM with Inductive Bias
Since
we conclude that
Dm( {S | LS(h)=0} Dm({S | LS(h)=0}) h∈Hb h∈Hb
Dm( {S | LS(h) = 0} |Hb|e−εm. h∈Hb
29/49
ERM with Inductive Bias
Theorem
Let H be a finite hypothesis class, δ ∈ (0, 1), ε > 0 and let m be an integer such that
m log(|H|/δ) ε
Then, for any labelling function f and for any distribution D for which the realizability distribution holds, with probability at least 1 − δ over the choice of an iid sample S of size m, we have that for every ERM hypothesis hS it holds that L(D,f )(h) ε.
30/49
ERM with Inductive Bias
Thus, for sufficiently large m the ERM rule over a finite hypothesis class H will be probably (with a confidence of 1 − δ) approximatively correct (up to an error of ε).
31/49
An Example : Regression
Define the class of affine functions Ld that consists of functions of the form hw,b : Rd −→ R, where
d hw,b(x)=(w,x)+b=wixi +b.
i=1
Note that:
Each function hw,b is parametrized by w ∈ Rd and b ∈ R.
Each function hw,b takes as input a vector x and returns a scalar (w,x)+b.
32/49
An Example : Regression
Different hypotheses classes are constructed using functions from Ld . THE CLASS OF HALFSPACES:
The class of halfspaces is designed for binary classification problems, X=Rd andY={−1,1}.
The realizability assumption means that it is possible to separate with a hyperplane all positive example from all negative examples; this is the separable case.
33/49
An Example : Regression
The ERM problem for halfspaces in the realizable case can be expressed as a linear programming problem.
Let S = {(xi , yi ) | 1 i m} be a training set of size m. Since we assume the realizable case, an ERM predictor should have zero errors on the training set. Thus, we are looking for w ∈ Rd such that
sign((w,xi))=yi for1im.
34/49
An Example : Regression
Equivalently, we are seeking w such that
yi (w, xi ) > 0 for 1 i m.
Since we assume realizability such a vector w must exist. Let w∗ be one.
Define
For all i we have
γ = minyi(w∗,xi) and w ̃ = 1w∗. iγ
yi(w ̃,xi) = 1yi(w∗,xi) 1. γ
Thus, there exists a vector w that is an ERM predictor.
35/49
An Example : Regression
Linear Regression
X isasubsetofRd andthelabelY isasubsetofR.
Goal is to learn a linear function h : Rd −→ R that best approximates the relationship between variables (e.g., predicting the weight of a baby as a function of age and weight at birth).
The hypothesis class is
Hreg =Ld ={h:Rd −→R,h(x)=w′x+b}.
36/49
An Example : Regression
A loss function for linear regression could ve l(h, (x, y )) = (h(x) − y )2 (the squared loss).
37/49
An Example : Regression
Least Squares
Least squares is an algorithm that solves the ERM problem for the linear regression predictors with respect to the squared loss.
The ERM problem starts with a traing set S and seeks to find w, where
1 m
(w′xi − yi )2.
The minimum condition amount to ∂LS (h) = 0 for 1 p n. ∂wp
w = argminwLS (h) = argminbfw m
i=1
38/49
An Example : Regression
We have
1 m
LS(h) = m (w′xi −yi)2
i=1 1 m
= m 2 m
(w1x1i +···+wmxmi −yi)2.
Therefore,
∂ L S ( h )
∂w
= m (w1x1i +···+wmxmi −yi)xpi i=1
=
i=1
p
i=1
for every p, 1 p n.
2m
which implies
m
(w1x1ixpi +···+wmxmixpi −yixpi =0, i=1
m
(w1x1ixpi +···+wmxmixpi −yixpi)=0,
39/49
An Example : Regression
The last equalities can be written in compact form as XX′w = Xy, which requires solving the equation Aw = b, where A = XX ′ and b = X y.
Note that:
if A is invertible, the solution to the ERM problem is w = A−1b; b = X y is a linear combination of the columns of X .
40/49
An Example : Regression
If the linear system Aw = b has no solution, the “next best thing” is to findavectorc∈Rn suchthat∥Ac−b∥2∥Aw−b∥2 foreveryw∈Rn, an approach known as the least square method.
We will refer to the triple (A, w, b) as an instance of the least square problem.
Note that Aw ∈ range(A) for any w ∈ Rn. Thus, solving this problem amounts to finding a vector w in the subspace range(A) such that Aw is as close to b as possible.
41/49
An Example : Regression
LetA∈Rm×n beafull-rankmatrixsuchthatm>n,sotherankofAis n. The symmetric square matrix A′A ∈ Rn×n has the same rank n as the matrix A. Therefore, the system (A′A)w = A′w a unique solution s. Moreover, A′A is positive definite because
w′A′Aw = (Aw)′Aw =∥ Aw ∥2> 0 for Aw ̸= 0. Theorem
LetA∈Rm×n beafull-rankmatrixsuchthatm>nandletb∈Rm. The unique solution of the system (A′A)w = A′b equals the projection of the vector b on the subspace range(A).
42/49
An Example : Regression
Proof
The n columns of the matrix A = (v1 · · · vn) constitute a basis of the subspace range(A). Therefore, we seek the projection c of b on range(A) as a linear combination c = At, which allows us to reduce this problem to a minimization of the function
f(t) = ∥At−b∥2
= (At−b)′(At−b) = (t′A′ −b′)(At−b)
= t′A′At − b′At − t′A′b + b′b.
The necessary condition for the minimum is
(∇f )(t) = 2A′At − 2A′b = 0,
which implies A′At = A′b.
The linear system (A′A)t = A′b is known as the system of normal equations of A and b.
43/49
An Example : Regression
We represent (using the function plot of ML), the number of calories consumed by a person per day vs. the gross national product per person in European countries
3800 3600 3400 3200 3000 2800 2600 2400
PT
RO
CY
BA
NO
NL FI
SK YU
0123456789
gdp per person in $10K units
44/49
calories per day
An Example : Regression
ccode
gdp
cal
ccode
gdp
cal
’AL’ ’AT’ ’BY’ ’BE’ ’BA’ ’BG’ ’HR’ ’CY’ ’CZ’ ’DK’ ’EE’ ’FI’ ’FR’ ’GE’ ’DE’ ’GR’ ’HU’ ’IS’ ’IE’
0.74 4.03 1.34 3.79 0.66 1.28 1.75 1.21 2.56 3.67 1.90 3.53 3.33 0.48 3.59 3.02 1.90 3.67 3.76
2824.00 3651.00 2895.00 3698.00 2950.00 2813.00 2937.00 3208.00 3346.00 3391.00 3086.00 3195.00 3602.00 2475.00 3491.00 3694.00 3420.00 3279.00 3685.00
’IT’ ’LV’ ’LT’ ’LU’ ’MK’ ’MT’ ’MD’ ’NL’ ’NO’ ’PL’ ’PT’ ’RO’ ’RU’ ’YU’ ’SK’ ’SI’ ’ES’ ’CH’
3.07 1.43 1.59 8.18 0.94 2.51 0.25 4.05 5.91 1.88 2.30 1.15 1.59 1.10 2.22 2.84 2.95 4.29
3685.00 3029.00 3397.00 3778.00 2881.00 3535.00 2841.00 3240.00 3448.00 3375.00 3593.00 3474.00 3100.00 2689.00 2825.00 3271.00 3329.00 3400.00
45/49
An Example : Regression
We seek to approximate the calorie intake as a linear function of the gdp of the form
cal=r1 +r2 gdp.
This amounts to solving a linear system that consists of 37 equations and
two unknowns:
r1 + 4.29r2 = 3400 and, clearly such a system is inconsistent.
r1 + 0.74r2 = 2824 .
46/49
An Example : Regression
We augment the data sample matrix by a column that consists of 1s to accommodate a constant term r1; thus, we work with the data sample matrix B ∈ R37×2 given by
1 0.74 . .
B=. . 1 4.29
whose second column consists of the countries’ gross domestic products in $10K units. The matrix
47/49
An Example : Regression
C = B′B is
Solving the normal system using the ML statement w = C \ (B′ ∗ b) yields
2894.2 w = 142.3 ,
so the regression line is cal = 142.3 ∗ gdp + 2894.2.
37.0000 94.4600 C = 94.4600 333.6592 .
48/49
An Example : Regression
4200 4000 3800 3600 3400 3200 3000 2800 2600 2400
012345678
gdp per person in $10K units
49/49
calories per day