Discriminant Functions
A discriminant function is a scalar function, g, defined over the feature vector, x:
We define a separate function for each class:
– gi(x),fori=1,…,c
Copyright By PowCoder代写 加微信 powcoder
And use the outputs of these functions to make
categorization decisions:
– assign feature vector x to class ωj if: gj(x)>gi(x) ∀ i ≠ j
Discriminant Functions
Assign feature vector x to class ωj if: gj(x)>gi(x) ∀ i ≠ j g1
This is a very general idea as we can use any parameterized function as the discriminant function.
● Discriminant Functions
– DecisionRegionsandDecisionBoundaries – Dichotomizers
● Linear Discriminant Functions
– Generalised Linear Discriminant Functions
● Learning Decision Boundaries
– perceptronlearning
– minimumsquarederror(MSE)procedures
● k-Nearest Neighbours Classifier
Decision Regions and Decision Boundaries
Discriminant functions divide feature space into regions, e.g. Region, R2, in Region, R1, in
which g2(x)>gi(x) which g1(x)>gi(x) ∀i≠2 ∀i≠1
Region, R3, in which g3(x)>gi(x) ∀i≠3
Regions are separated by boundaries
Boundary where g1(x) = g2(x)
Decision Regions and Decision Boundaries
Discriminant functions divide feature space into regions, e.g.
Region in which gsalmon(x)>
gsea bass(x)
x = lightnessi i [ widthi
Region in which gsalmon(x)<
gsea bass(x)
Boundary where gsea bass(x) = gsalmon(x)
Discriminant Functions
Assign feature vector x to class ωj if: gj(x)>gi(x) ∀ i ≠ j g1
Note, changing the discriminant functions, such that:
● gi(x) ← gi(x)+a ∀ i
● gi(x) ← gi(x)xb ∀ i (where b>0)
● gi(x) ← f(gi(x)) ∀ i (where f is a monotonically increasing function)
has no effect on classification
Dichotomizers
There are only two decision regions, but they are not necessarily contiguous.
Region, R2, in which g2(x)>g1(x) or g(x)<0
Dichotomizers
Region, R1, in which g1(x)>g2(x) or g(x)>0
Boundary where g1(x) = g2(x) or
Discriminant functions divide feature space into regions, e.g.
Region in which gfish(x)>0
x = lightnessi i [ widthi
Region in which gfish(x)<0
Boundary where gfish(x) = 0
Dichotomizers
A special case, where we have only two categories.
● Categorisation rule is
– assign x to ω1 if g1(x) > g2(x) x – otherwise assign x to ω2
g1(x) g2(x)
More conveniently, we can define a single function
– g(x)=g1(x)–g2(x)
● Categorisation rule is then
– assign x to ω if g(x) > 0 1
– otherwise assign x to ω2
if we equate -1 with ω2
● Discriminant Functions
– DecisionRegionsandDecisionBoundaries – Dichotomizers
● Linear Discriminant Functions
– GeneralisedLinearDiscriminantFunctions
● Learning Decision Boundaries
– perceptronlearning
– minimumsquarederror(MSE)procedures
● k-Nearest Neighbours Classifier
Linear Discriminant Functions
A linear discriminant function uses a linear combination of feature values:
– g(x)=wtx+w0 ● where:
– xisthefeaturevector
– w is a column vector of parameter or weight values – w0 the a scalar called the bias or threshold weight
– Superscript “t” or “T” means transpose
Linear Discriminant Functions
Linear Discriminant Functions
When g(x) is linear, feature space is divided into exactly c, contiguous, regions
Region, R2, in Region, R1, in which g2(x)>gi(x) which g1(x)>gi(x)
∀i≠2 ∀i≠1 Boundary where
g1(x) = g2(x) Boundaries are straight lines (in 2D), planes (in 3D) or
hyperplanes (in nD)
Dichotomizers
As before, when we have only two categories, we can define a single function
– g(x)=g1(x)–g2(x)
= (w1tx + w10) – (w2tx + w20) = (w1- w2)tx + (w10-w20)
= wtx + w0
● Categorisation rule is then
– assign x to ω1 if g(x) > 0, otherwise assign x to ω2 or
– assign x to ω1 if wtx > -w0, otherwise assign x to ω2
Linear Discriminant Functions
As before, the classification rule is
– assign feature vector x to class ωj if: gj(x)>gi(x) ∀ i ≠ j – i.e.chooseωj ifwjtx+wj0 >witx+wi0 ∀i≠j
This classifier is called a “Linear Machine” g1
Augmented Vectors
To simplify maths, “augment” weight and data vectors
– Augmentation means to extend a vector (or matrix) by
appending elements to it
● Instead of defining g(x) = wtx + w0
● We define g(x) = aty where: a = [w0 wt]t and y=[1 xt]t
● e.g. if:
w=(23), w0=4, x=(35)
g(x)=wt x+w0=(2 3)(35)+4=6+15+4=25
g(x)=a y=(4 2 3) 3 =4+6+15=25
Linear Discriminant Functions
Equivalent ways of defining a linear discriminant function:
g(x)=wt x+w0
g(x)=w0+∑wi xi i
Augmented Vectors
To simplify maths, “augment” weight and data vectors
– Augmentation means to extend a vector (or matrix) by appending elements to it
● Instead of defining g(x) = wtx + w0
● We define g(x) = aty where: a = [w0 wt]t and y=[1 xt]t
Linear Discriminant Functions
Location of decision boundary is defined by the weights.
The distance of the hyperplane from the origin is proportional to w0 (so w0 determines the location of the hyperplane).
w0=−5, w=(21)
|w 0|= 5 ‖w‖ √22+12
The vector normal to the hyperplane is w (so w determines the orientation of the hyperplane).
Linear Discriminant Functions
Location of decision boundary is defined by the weights.
e.g. if w0=−5, w=(21)
Linear Discriminant Functions
Location of decision boundary is defined by the weights. In 2D feature space the hyperplane is defined as:
x2=mx1+c=−w1 x1+−w0 w2 w2
e.g. if w0=−5, w=(21)
then x2=−2 x1+5 11
→ w0+w1 x1+w2 x2=0 X2
|W0| ||W||
|W0| ||W||
|W0| ||W||
Linear Discriminant Functions
Location of decision boundary is defined by the weights. The value of g(x) provides a measure of how far x is from
the decision boundary. The actual distance is given by |g(x)| ‖w‖
e.g. if when
w0=−5, w=(21)
x=(2), g(x)=4+2−5=+1
x=(3), g(x)=6+2−5=+3 2
Linear Discriminant Functions
Location of decision boundary is defined by the weights.
These properties are also true for linear discriminant functions in higher-dimensional space.
e.g. for 3D feature space:
|g(x)| ‖w‖
Linear Discriminant Functions
Location of decision boundary is defined by the weights. The vector normal to the hyperplane points towards feature
vectors for which g(x) is positive.
e.g. if when
w0=−5, w=(21)
x=(2), g(x)=4+2−5=+1
x=(1), g(x)=2+2−5=−1 2
|W0| ||W||
|W0| ||W||
Generalised Linear Discriminant Functions
Boundaries can now be quadratic curves or hyperquadratic surfaces.
Useful, as more complex boundaries may be required for separating classes in some tasks.
Generalised Linear Discriminant Functions
But more complex boundaries can be generated using linear discriminant functions in expanded feature space, e.g.:
original space (2d)
In expanded feature space decision surface is a hyperplane
In original feature space decision boundaries are quadratic, as would result from using a quadratic discriminant function
Generalised Linear Discriminant Functions
A linear discriminant function can be written as: g(x)=w0+∑wi xi
A quadratic discriminant function can be written as:
g(x)=w0+∑wi xi+∑wij xi x j i i,j
Generalised Linear Discriminant Functions
A generalized discriminant function can be written as: g(x) = w0 + w1f1(x) + w2f2(x) + … + wNfN(x)
where fi(x) for i=1 to N can be any scalar function of x
This allows us to define any arbitrary decision surface in the original feature space, x.
This is equivalent to defining decision surfaces that are hyperplanes (i.e. linear) in the expanded feature space, y:
– y=[1 f1(x) f2(x) … fN(x)]t
Learning Discriminant Functions: Perceptron Learning
Generalised Linear Discriminant Functions
A quadratic discriminant function g(x)=w0+∑wi xi+∑wij xi x j
Can be thought of as a linear discriminant function in an expanded feature space:
● g(x) = aty (in augmented vector notation) – where:a=[w0 w1 w2 w11 w12 w22 …]t – and y=[1 x1 x2 x1x1 x1x2 x2x2 …]t
Learning Linear Discriminant Functions
So far we have assumed we know g(x), and hence, that we can determine the class of a new input vector x.
But typically, we only have training data and we need to determine the appropriate form of g(x) from this data.
● Only need to consider how to learn linear discriminant functions as we can learn any arbitrary decision surfaces by changing the feature space (e.g. to [1
f1(x) f2(x) … fN(x)]t) and learning a linear discriminant function in the new feature space.
Learning Linear Discriminant Functions
Given a training dataset:
(x1, ω’1), (x2, ω’2), …, (xn, ω’n)
Determine weights:
a1, a2, …, am
Such that gj(xk) > gi(xk) ∀ i≠j where j=ω’k for all k
If solution exists, dataset is linearly separable.
● Discriminant Functions
– DecisionRegionsandDecisionBoundaries – Dichotomizers
● Linear Discriminant Functions
– Generalised Linear Discriminant Functions
● Learning Decision Boundaries
– perceptronlearning
– minimumsquarederror(MSE)procedures
● k-Nearest Neighbours Classifier
Sample Normalisation
If we replace all samples from class ω2 by their negatives (called sample normalisation):
y ← -y ∀y ∈ ω2
Then, a sample yk is correctly classified if:
atyk > 0 and yk is labelled ω1 and
atyk > 0 and yk is labelled ω2 This allows us to ignore class labels
and look for a vector a such that atyk > 0 ∀k
Solutions and Margins
There are potentially many possible solutions for a that satisfy:
atyk > 0 ∀k
There are fewer solutions that satisfy: atyk > b ∀k
where b is a positive number, called the “margin”
Two-category Linearly Separable Case
Suppose we have a set of n linearly separable samples y1, y2,…, yn, where some are labelled ω1 and the others ω2.
We want to use these samples to determine the weight vector a of a linear discriminant function (a dichotomizer).
A sample yk is correctly classified if: atyk > 0 and yk is labelled ω1
atyk < 0 and yk is labelled ω2
atyk > 0 atyk > b
Gradient Descent Procedures
Define a cost function J(a) that is minimised if a is a solution vector.
This type of problem can be solved by a Gradient Descent Procedure.
Basic gradient descent:
– Initialise a to arbitrary solution
– Compute gradient vector at a: ∇J(a)
– Movesolutionindirectionofsteepestdescent:
a ← a – η∇J(a) (negative to go down gradient)
η = learning rate
= step size
Number Misclassified Criterion Function
● Simplest approach would be to define a cost function based on the number of samples misclassified by a.
● However, this would be piecewise constant and therefore have no gradient to follow (not “convex”).
Gradient Descent Procedures
Define a cost function J(a) that is minimised if a is a solution vector.
cost of possible solutions
space of possible solutions
lower cost solution found by following gradient
gradient is zero at optimum solution
Note: “cost function” also called:
“objective function” or “loss function”
Batch Perceptron Learning Algorithm
The gradient of the perceptron cost function is:
∇Jp(a)=∑(−y) y∈χ
Hence the update rule becomes:
a←a+η∑ y y∈χ
Converges when χ is empty, and hence, ∇ J p ↦ 0 (i.e. when no more errors).
Will only converge if data is linearly separable.
Batch Perceptron Learning Algorithm
Gradient decent with the Batch Perceptron cost function is: – Set value of hyper-parameter (η)
– Initialiseatoarbitrarysolution
– Computegradientvectorata:∇Jp(a)=∑(−y) y∈χ
– Movesolutionindirectionofsteepestdescent: a←a+η∑ y
The new weight vector is obtained by summing the feature vectors associated with all misclassified samples, and adding some multiple of this to the current weight vector.
Perceptron Criterion Function
A better choice is to define a cost function based on the sum of the distances to the decision boundary for all misclassified
where χ is the set of samples misclassified by a
J p(a)=∑(−at y) y∈χ
Sequential Perceptron Learning Algorithm
Perceptron updates move weights towards misclassified exemplars. Only works for exemplars in ω2 because we
applied sample normalisation (multiplied exemplars by -1).
Alternatively, we can not use sample normalisation and change the sign of the update rule to move weights:
● towards misclassified exemplars from ω1, and ● away from misclassified exemplars from ω2.
Sequential Perceptron Learning Algorithm
● towards misclassified exemplars from ω1 ● letω1=1
● away from misclassified exemplars from ω2 ● letω2=-1
batch update rule becomes: a←a+η∑ω’ y
y∈χ sequential update rule becomes: a←a+ηω’k yk
where ω’ is the class label associated with each misclassified exemplar.
Sequential Perceptron Learning Algorithm
Update based only on single sample errors (updating of weight vector is “sequential” or “online”)
– Setvalueofhyper-parameter(η)
– Initialise a to arbitrary solution
– Foreachsample,yk,inthedatasetinturn
● If yk is misclassified
– Move solution towards yk: a←a+η yk
Run through all samples in turn as many times as necessary for convergence (until all samples correctly classified)
Sequential Perceptron Learning Algorithm
● Initialise a to arbitrary solution
● Until convergence (all samples correctly classified)
● For each sample, yk, in the dataset in turn
– If yk is misclassified: a←a+ηω’ y
k k g(x)=aX x2
y=[-1.5,5,-1] x 1 =-1.5
Sequential Perceptron Learning Algorithm
● Initialise a to arbitrary solution
● Until convergence (all samples correctly classified)
● For each sample, yk, in the dataset in turn
– If yk is misclassified: a←a+ηω’ y
a =[-1.5,5,-1]+1x1x[1,0,0]=[-0.5,5,-1]
predicted class=-1
k k g(x)=aX x2
y=[-0.5,5,-1] x 1 =4.5
a unchangedat[-0.5,5,-1]
predicted class=1
Sequential Perceptron Learning Algorithm
● Initialise a to arbitrary solution and select learning rate
● Until convergence (all samples correctly classified)
● For each sample, yk, in the dataset in turn
– If yk is misclassified: a←a+ηω’k yk
a initialisedto[-1.5,5,-1]
Sequential Perceptron Learning Algorithm
● Initialise a to arbitrary solution
● Until convergence (all samples correctly classified)
● For each sample, yk, in the dataset in turn
– If yk is misclassified: a←a+ηω’ y
k k g(x)=aX x2
y=[-0.5,5,-1] x 1 =-1.5
Sequential Perceptron Learning Algorithm
● Initialise a to arbitrary solution
● Until convergence (all samples correctly classified)
● For each sample, yk, in the dataset in turn
– If yk is misclassified: a←a+ηω’ y
a unchangedat[-0.5,5,-1]
predicted class=-1
k k g(x)=aX x2
y=[-0.5,5,-1] x 1 =2.5
a =[-0.5,5,-1]+1x-1x[1,1,2]=[-1.5,4,-3]
predicted class=1
Sequential Perceptron Learning Algorithm
● Initialise a to arbitrary solution
● Until convergence (all samples correctly classified)
● For each sample, yk, in the dataset in turn
– If yk is misclassified: a←a+ηω’ y
k k g(x)=aX x2
y=[-0.5,5,-1] x 1 =8.5
a unchangedat[-0.5,5,-1]
predicted class=1
Sequential Perceptron Learning Algorithm
● Initialise a to arbitrary solution
● Until convergence (all samples correctly classified)
● For each sample, yk, in the dataset in turn
– If yk is misclassified: a←a+ηω’ y
k k g(x)=aX x2
y=[-0.5,4,-3] x 1 =3.5
Sequential Perceptron Learning Algorithm
● Initialise a to arbitrary solution
● Until convergence (all samples correctly classified)
● For each sample, yk, in the dataset in turn
– If yk is misclassified: a←a+ηω’ y
a unchangedat[-0.5,4,-3]
predicted class=1
k k g(x)=aX x2
y=[-0.5,4,-3] x 1 =4.5
a unchangedat[-0.5,4,-3]
predicted class=1
Sequential Perceptron Learning Algorithm
● Initialise a to arbitrary solution
● Until convergence (all samples correctly classified)
● For each sample, yk, in the dataset in turn
– If yk is misclassified: a←a+ηω’ y
k k g(x)=aX x2
y=[-1.5,4,-3] x 1 =-1.5
a =[-1.5,4,-3]+1x1x[1,0,0]=[-0.5,4,-3]
predicted class=-1
Sequential Perceptron Learning Algorithm
● Initialise a to arbitrary solution
● Until convergence (all samples correctly classified)
● For each sample, yk, in the dataset in turn
– If yk is misclassified: a←a+ηω’ y
k k g(x)=aX x2
y=[-0.5,4,-3] x 1 =-2.5
Sequential Perceptron Learning Algorithm
● Initialise a to arbitrary solution
● Until convergence (all samples correctly classified)
● For each sample, yk, in the dataset in turn
– If yk is misclassified: a←a+ηω’ y
a unchangedat[-0.5,4,-3]
predicted class=-1
k k g(x)=aX x2
y=[-0.5,4,-3] x 1 =-0.5
a =[-0.5,4,-3]+1x1x[1,0,0]=[0.5,4,-3]
predicted class=-1
Sequential Perceptron Learning Algorithm
● Initialise a to arbitrary solution
● Until convergence (all samples correctly classified)
● For each sample, yk, in the dataset in turn
– If yk is misclassified: a←a+ηω’ y
k k g(x)=aX x2
y=[-0.5,4,-3] x 1 =-3.5
a unchangedat[-0.5,4,-3]
predicted class=-1
Learning Algorithm
Assign feature vector x to class ωj if: gj(x) > gi(x) ∀ i ≠ j g1
If classification is wrong, adjust weights of discriminant functions:
● move weights for required class towards input
● move weights of wrongly selected class away from input
Learning Algorithm
– Setvalueofhyper-parameter(η)
– For each possible class, c, initialise ac to arbitrary
– Foreachsample,(yk,ωk)inthedatasetinturn
● Classify: c’ = argmax gc(xk)
● If yk is misclassified (i.e. c’ ≠ ωk)
– Move aωk towards yk: aωk←aωk+ηyk – Move ac’ away from yk: ac’←ac’−ηyk
Sequential Perceptron Learning Algorithm
● Initialise a to arbitrary solution and select learning rate
● Until convergence (all samples correctly classified)
● For each sample, yk, in the dataset in turn
– If yk is misclassified: a←a+ηω’k yk
at =[0.5,4,-3]
● Discriminant Functions
– DecisionRegionsandDecisionBoundaries – Dichotomizers
● Linear Discriminant Functions
– Generalised Linear Discriminant Functions
● Learning Decision Boundaries
– perceptronlearning
– minimumsquarederror(MSE)procedures
● k-Nearest Neighbours Classifier
Minimum Squared Error (MSE) Procedures
Perceptron Learning
● Uses gradient descent procedures to calculate suitable linear discriminant functions
● Learning driven by misclassified exemplars
● Converges only if data is linearly separable
● Solves linear inequalities atyk > 0 ∀k
MSE Procedures
● Solves linear equations atyk = bk ∀k
● Gives result even if data is not linearly separable
● Learning depends on all exemplars
● Uses matrix inversion (or gradient descent) to calculate suitable linear discriminant functions
Learning Discriminant Functions: Minimum Squared Error
MSE via Pseudoinverse
Let Y be a matrix, each row of which contains one exemplar, y t, from the
k training data (in augmented and sample normalised format).
Then set of simultaneous linear equations can be expressed in matrix format as:
y10 y11 … y1d a0 b0 ⋮a⋮
yk0 yk1 … ykd [1]=bk [ ]⋮[]
⋮ ad b⋮ yn0 yn1 … ynd n
Typically Y is rectangular, so the solution is given by: a=Y†b
where Y† is the matrix pseudo-inverse (or Moore- )
MSE via Gradient Descent
The pseudo-inverse provides a closed-form solution to find a which minimizes the squared error:
e2=∥Ya−b∥2
An alternative method to find a solution is to define a cost function J(a) that is to be minimised, and use Gradient Descent.
J s ( a ) = 12 ‖ Y a − b ‖ 2
– Note the 1⁄2 is a mathematical convenience (it will be absorbed into the learning rate)
Two-category Case with Sample Normalisation
Suppose we have a set of n samples y1, y2,…, yn, where some are labelled ω1 and the others ω2.
If we replace all samples from class ω2 by their negatives: y ← -y ∀y ∈ ω2
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com