Naïve Bayes Classification
AI lecture: Machine Learning
Naïve Bayes Classification
— Basic Machine Learning Model
Material borrowed (and modified) from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others and revised by C.C. Hung
*
Outline
Probability and Machine Learning
Bayesian Classification
Naïve Bayesian Classifier
Examples
Model parameters
Evaluating classification algorithms
*
Things We’d Like to Do
Spam Classification
Given an email, predict whether it is spam or not
Medical Diagnosis
Given a list of symptoms, predict whether a patient has disease X or not
Weather
Based on temperature, humidity, etc… predict if it will rain tomorrow
*
Recall: The machine learning framework
Apply a prediction function to a feature representation of the image to get the desired output:
f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
Slide credit: L. Lazebnik
*
Recall: The machine learning framework
y = f(x)
Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the prediction function f by minimizing the prediction error on the training set
Testing: apply f to a never before seen test example x and output the predicted value y = f(x)
output
prediction function
Image feature
Slide credit: L. Lazebnik
*
Bayesian Classification
Problem statement:
Given features X1, X2, …, Xn
Predict a label Y for a new sample
*
Another Application
Digit Recognition
Features: X1,…,Xn {0,1} (Blue vs. Red pixels)
Label: Y {5,6} (predict whether a digit is a 5 or a 6)
Classifier
5
Blue: Background,
Red: Digit
*
The Bayes Classifier
A good strategy is to predict (mathematically):
(for example: what is the probability that the image represents a 5 given its pixels?)
So … How do we compute that?
*
Bayes Theorem
Use Bayes Rule!
Why did this help? Well, we think that we might be able to specify how features are “generated” by the class label (i.e. likelihood)
Normalization Constant
Likelihood
Prior
*
The Bayes Classifier
Use Bayes Rule!
Why did this help? Well, we think that we might be able to specify how features are “generated” by the class label
Normalization Constant
Likelihood
Prior
*
The Bayes Classifier
Let’s expand this for our digit recognition task:
To classify, we’ll simply compute these two probabilities and predict based on which one is greater.
*
Model Parameters
For the Bayes classifier, we need to “learn” two functions, the likelihood and the prior.
How many parameters are required to specify the prior for our digit recognition example?
*
Model Parameters
How many parameters are required to specify the likelihood?
(Supposing that each image is 30×30 pixels)
?
*
Model Parameters
The problem with explicitly modeling P(X1,…,Xn|Y) is that there are usually way too many parameters:
We’ll run out of space
We’ll run out of time
And we’ll need tons of training data (which is usually not available)
*
How many parameters must we estimate?
Suppose X =
To estimate P(Y| X1, X2, … Xn) 2n quantities need to be estimated!
If we have 30 boolean Xi ’s: P(Y | X1, X2, … X30) 230 ~ 1 billion!
How many parameters for P(X1, X2, … Xn|Y) ?
Hence, we need lots of data or a very small n.
*
How many parameters must we estimate?
Consider the number of parameters we must estimate when Y is boolean and X is a vector of n boolean attributes. In this case, we need to estimate a set of parameters
*
How many parameters must we estimate?
Reasoning: where the index i takes on 2n possible values (one for each of the possible vector values of X), and j takes on 2 possible values. Therefore, we will need to estimate approximately 2n+1 parameters.
To calculate the exact number of required parameters, note for any fixed j, the sum over i of θij must be one. Therefore, for any particular value yj , and the 2n possible values of xi , we need compute 2n −1 independent parameters. Given the two possible values for Y, we must estimate a total of 2(2n − 1) such θij parameters.
*
The Naïve Bayes Model
The Naïve Bayes Assumption: Assume that all features are independent given the class label Y
Note that: the second line follows from a general property of probabilities and the third line follows from the definition of conditional independence.
*
The Naïve Bayes Model
The Naïve Bayes Assumption: Assume that all features are independent given the class label Y
Equationally speaking:
(We will discuss the validity of this assumption later)
*
Bayes’ Rule: An example
Now our experiment (called event R) consists of selecting a bowl with those priori probabilities and then drawing a ball at random from that bowl.
Bayes’ Rule: An example
Priori Probability
Bowls Red balls White balls Probability for selecting a bowl
A 2 4 1/3
B 1 2 1/6
C 5 4 1/2
Prior (given)
Bayes’ Rule: An example
Now our experiment (called event R having a red ball) consists of selecting a bowl with those priori probabilities and then drawing a ball at random from that bowl.
That is, the event R is the union of the mutually exclusive events, i.e. A∩R, B∩R, and C∩R, i.e.
P(R) = P(A∩R) + P(B∩R) + P(C∩R)
P(X) = P(R) = 1/3 x 2/6 + 1/6 x 1/3 + ½ x 5/9 = 8/18
Bayes’ Rule: An example
Suppose that the outcome of event R is a red ball, but we do not know from which bowl it was drawn.
Accordingly, we compute the conditional probability that the red ball was drawn from bowl A, namely,
P(A|R) by using Bayes’ rule.
Similarily for B and C, i.e. P(B|R) and P(C|R)
Bayes’ Rule: An example
1/3 x 2/6 = 2/18
1/6 x 1/3 = 1/18
1/2 x 5/9 = 5/18
B
C
If divided by p(R)
Bayes’ Rule: An example
Bowls Red balls White balls Probability for selecting a bowl Posterior
A 2 4 1/3 2/8
B 1 2 1/6 1/8
C 5 4 1/2 5/8
Prior
“Essentially, all models are wrong, but some are useful“
— George E.P. Box (1919 – 2013)
University of Wisconsin
*
The Naïve Bayes Model
The Naïve Bayes Assumption: Assume that all features are independent given the class label Y
Equationally speaking:
(We will discuss the validity of this assumption later)
*
Why is this useful?
# of parameters for modeling P(X1,…,Xn|Y):
Too many (2(2n-1))
# of parameters for modeling P(X1|Y),…,P(Xn|Y)
Much better (2n)
*
Naïve Bayes Training
Now that we’ve decided to use a Naïve Bayes classifier, we need to train it with some data:
MNIST Training Data
*
Naïve Bayes Training
Training in Naïve Bayes is easy:
Estimate P(Y=v) as the fraction of records with Y=v
Estimate P(Xi=u|Y=v) as the fraction of records with Y=v for which Xi=u
(This corresponds to Maximum Likelihood estimation of model parameters)
Likelihood
Prior
*
Naïve Bayes Training: if zero
In practice, some of these counts can be zero
Fix this by adding “virtual” counts:
(This is like putting a prior on parameters and doing MAP estimation instead of MLE)
This is called Smoothing.
MAP: Maximum a posteriori
Likelihood
*
Naïve Bayes Training
For binary digits, training amounts to averaging all of the training fives together and all of the training sixes together.
*
Naïve Bayes Classification
*
Another Example of the Naïve Bayes Classifier
The weather data, with counts and probabilities
outlook temperature humidity windy play
yes no yes no yes no yes no yes no
sunny 2 3 hot 2 2 high 3 4 false 6 2 9 5
overcast 4 0 mild 4 2 normal 6 1 true 3 3
rainy 3 2 cool 3 1
sunny 2/9 3/5 hot 2/9 2/5 high 3/9 4/5 false 6/9 2/5 9/14 5/14
overcast 4/9 0/5 mild 4/9 2/5 normal 6/9 1/5 true 3/9 3/5
rainy 3/9 2/5 cool 3/9 1/5
A new day
outlook temperature humidity windy play
sunny cool high true ?
Weather Example
Likelihood of yes
Likelihood of no
Therefore, the prediction is No
The Naive Bayes Classifier for Data Sets with Numerical Attribute Values
One common practice to handle numerical attribute values is to assume normal distributions for numerical attributes.
The numeric weather data with summary statistics
outlook temperature humidity windy play
yes no yes no yes no yes no yes no
sunny 2 3 83 85 86 85 false 6 2 9 5
overcast 4 0 70 80 96 90 true 3 3
rainy 3 2 68 65 80 70
64 72 65 95
69 71 70 91
75 80
75 70
72 90
81 75
sunny 2/9 3/5 mean 73 74.6 mean 79.1 86.2 false 6/9 2/5 9/14 5/14
overcast 4/9 0/5 std dev 6.2 7.9 std dev 10.2 9.7 true 3/9 3/5
rainy 3/9 2/5
Weather Example with numerical data
Let x1, x2, …, xn be the values of a numerical attribute in the training data set.
Weather Example with numerical data
For examples,
Likelihood of Yes =
Likelihood of No =
Outlook
Temperature
Humidity
Windy
Play
Outputting Probabilities
What’s nice about Naïve Bayes (and generative models in general) is that it returns probabilities
These probabilities can tell us how confident the algorithm is
So… don’t throw away those probabilities!
*
Performance on a Test Set
Naïve Bayes is often a good choice if you don’t have much training data!
Size of training set
Classification Accuracy
*
Naïve Bayes Assumption
Recall the Naïve Bayes assumption:
that all features are independent given the class label Y
Does this hold for the digit recognition problem?
*
Exclusive-OR Example
For an example where conditional independence fails:
Y=XOR(X1,X2)
X1 X2 P(Y=0|X1,X2) P(Y=1|X1,X2)
0 0 1 0
0 1 0 1
1 0 0 1
1 1 1 0
*
Naïve Bayes assumption
Actually, the Naïve Bayes assumption is almost never true.
Still… Naïve Bayes often performs surprisingly well even when its assumptions do not hold.
*
Numerical Stability: zero issue
It is often the case that machine learning algorithms need to work with very small numbers
Imagine computing the probability of 2000 independent coin flips
MATLAB thinks that (.5)2000=0
*
Underflow Prevention
Multiplying lots of probabilities
floating-point underflow.
Recall: log(xy) = log(x) + log(y),
better to sum logs of probabilities rather than multiplying probabilities.
*
Underflow Prevention
Class with highest final un-normalized log probability score is still the most probable (log of Bayes Rule).
(
)
*
Numerical Stability
Instead of comparing P(Y=5|X1,…,Xn) with P(Y=6|X1,…,Xn),
Compare their logarithms
*
Recap
We defined a Bayes classifier but saw that it’s intractable to compute P(X1,…,Xn|Y).
We then used the Naïve Bayes assumption – that everything is independent given the class label Y.
A natural question: is there some happy compromise where we only assume that some features are conditionally independent?
*
Pros and cons of Naïve Bayes
Advantages:
It is relatively simple to understand and build.
It is easily trained, even with a small dataset.
It is fast.
It is not sensitive to irrelevant features.
Disadvantages:
It assumes every feature is independent, which is not always the case.
*
An Example of Naïve Bayes
(http://blog.aylien.com/naive-bayes-for-dummies-a-simple-explanation/)
So, let’s say we have data on 1000 pieces of fruit. The fruit being a Banana, Orange or some Other fruit and imagine we know 3 features of each fruit, whether it’s long or not, sweet or not and yellow or not, as displayed in the table below:
*
An Example of Naïve Bayes
So from the table what do we already know?
50% of the fruits are bananas
30% are oranges
20% are other fruits
Prior
*
An Example of Naïve Bayes
Based on our training set we can also say the following:
From 500 bananas 400 (0.8) are Long, 350 (0.7) are Sweet and 450 (0.9) are Yellow
Out of 300 oranges 0 are Long, 150 (0.5) are Sweet and 300 (1) are Yellow
From the remaining 200 fruits, 100 (0.5) are Long, 150 (0.75) are Sweet and 50 (0.25) are Yellow
Which should provide enough evidence to predict the class of another fruit as it’s introduced.
Likelihood
*
An Example of Naïve Bayes
So let’s say we’re given the features of a piece of fruit and we need to predict the class.
If we’re told that the additional fruit is Long, Sweet and Yellow, we can classify it using the following formula and subbing in the values for each outcome, whether it’s a Banana, an Orange or Other Fruit.
The one with the highest probability (score) being the winner.
*
An Example of Naïve Bayes
Banana:
P(Banana|Long,Sweet,Yellow) = 0.8 x 0.7 x 0.9 x 0.5 = 0.252
*
An Example of Naïve Bayes
Orange:
P(Orange|Long,Sweet,Yellow) = 0.0 x 0.5 x 1.0 x 0.3 = 0
*
An Example of Naïve Bayes
Other Fruit:
P(Other|Long,Sweet,Yellow) = 0.5 x 0.75 x 0.25 x 0.2 = 0.01875
*
Evaluating classification algorithms
*
Evaluating classification algorithms
You have designed a new classifier.
You give it to me, and I try it on my image dataset.
*
Evaluating classification algorithms
I tell you that it achieved 95% accuracy on my data.
Is your technique a success?
*
Types of errors
But suppose that
The 95% is the correctly classified pixels
Only 5% of the pixels are actually edges
It misses all the edge pixels
How do we count the effect of different types of error?
Evaluation for Classification
Evaluation Metrics
Confusion Matrix: shows performance of an algorithm, especially predictive capability.
rather than how fast it takes to classify, build models, or scalability.
Type I and II error
No, but you say Yes
Yes, but you say No
Model Evaluation
Metrics for Performance Evaluation
How to evaluate the performance of a model?
Methods for Performance Evaluation
How to obtain reliable estimates?
Methods for Model Comparison
How to compare the relative performance among competing models?
Metrics for Performance Evaluation
Focus on the predictive capability of a model
Rather than how fast it takes to classify or build models, scalability, etc.
Confusion Matrix:
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
PREDICTED CLASS
ACTUAL
CLASS Class=Yes Class=No
Class=Yes a b
Class=No c d
Metrics for Performance Evaluation…
Most widely-used metric:
PREDICTED CLASS
ACTUAL
CLASS Class=Yes Class=No
Class=Yes a
(TP) b
(FN)
Class=No c
(FP) d
(TN)
Limitation of Accuracy
Consider a 2-class problem
Number of Class 1 examples = 9990
Number of Class 2 examples = 10
If model predicts everything to be class 1, accuracy is 9990/10000 = 99.9 %
Accuracy is misleading because model does not detect any class 2 example
Cost Matrix
C(i|j): Cost of misclassifying class j example as class i
PREDICTED CLASS
ACTUAL
CLASS C(i|j) Class=Yes Class=No
Class=Yes C(Yes|Yes) C(No|Yes)
Class=No C(Yes|No) C(No|No)
Computing Cost of Classification
Accuracy = 80%
Cost = 3910
Accuracy = 90%
Cost = 4255
Cost Matrix PREDICTED CLASS
ACTUAL
CLASS C(i|j) + –
+ -1 100
– 1 0
Model M1 PREDICTED CLASS
ACTUAL
CLASS + –
+ 150 40
– 60 250
Model M2 PREDICTED CLASS
ACTUAL
CLASS + –
+ 250 45
– 5 200
Cost vs Accuracy
Count PREDICTED CLASS
ACTUAL
CLASS Class=Yes Class=No
Class=Yes a b
Class=No c d
Cost PREDICTED CLASS
ACTUAL
CLASS Class=Yes Class=No
Class=Yes p q
Class=No q p
N = a + b + c + d
Accuracy = (a + d)/N
Cost = p (a + d) + q (b + c)
= p (a + d) + q (N – a – d)
= q N – (q – p)(a + d)
= N [q – (q-p) Accuracy]
Accuracy is proportional to cost if
1. C(Yes|No)=C(No|Yes) = q
2. C(Yes|Yes)=C(No|No) = p
Cost-Sensitive Measures
Precision and Recall are two widely used metrics employed in applications where successful detection of one of the classes is considered more important than detection of the other classes.
Cost-Sensitive Measures
Precision is biased towards C(Yes|Yes) & C(Yes|No)
Recall is biased towards C(Yes|Yes) & C(No|Yes)
F-measure is biased towards all except C(No|No)
Evaluation Metrics
Sensitivity or True Positive Rate (TPR)
TP/(TP+FN)
A parameter describing the success in finding a particular type of target (also called hit rate)
Specificity or True Negative Rate (TNR)
TN/(FP+TN)
A term that is important in medicine and relates to the proportion of well patients who are accurately told after the test they are not ill.
PREDICTED CLASS
ACTUAL
CLASS Class=Yes Class=No
Class=Yes a
(TP) b
(FN)
Class=No c
(FP) d
(TN)
Evaluation Metrics
Recall (= sensitivity= TPR)
TP/(TP+FN)
A term used when describing the success in finding an item in a database (example: information retrieval).
Discriminability
TP/(TP+FP)
A term used when describing the success in differentiating a particular type of target from a similar type of target.
Precision or Positive Predictive Value (PPV)
TP/(TP+FP)
A term describing the accuracy in picking out a particular type of target from any distractors, including noise and clutter.
Evaluation Metrics
Sensitivity = Recall
TP/(TP+FN)
Discriminability = Precision
TP/(TP+FP)
Negative Predictive Value (NPV)
TN/(TN+FN)
Accuracy
(TP+TN)/(TP+FP+TN+FN)
PREDICTED CLASS
ACTUAL
CLASS Class=Yes Class=No
Class=Yes a
(TP) b
(FN)
Class=No c
(FP) d
(TN)
ROC (Receiver Operating Characteristic)
A good classification model should be located as close as possible to the upper left corner of the diagram.
Performance of each classifier represented as a point on the ROC curve
changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point
ROC curve
Receiver Operating Characteristic (ROC)
Graphical approach for displaying the tradeoff between true positive rate(TPR) and false positive rate (FPR) of a classifier
TPR = positives correctly classified/total positives
FPR = negatives incorrectly classified/total negatives
TPR on y-axis and FPR on x-axis
ROC Curve
– 1-dimensional data set containing 2 classes (positive and negative)
– any points located at x > t is classified as positive
At threshold t:
TP=0.5, FN=0.5, FP=0.12, FN=0.88
ROC curve
ROC curve
Points of interests (TP, FP)
(0, 0): everything is negative
(1, 1): everything is positive
(1, 0): perfect (ideal)
Diagonal line
Random guessing (50%)
Area Under Curve (AUC)
Measurement how good the model on the average
Good to compare with other methods
ROC Curve
(TP,FP):
(0,0): declare everything
to be negative class
(1,1): declare everything
to be positive class
(1,0): ideal
Diagonal line:
Random guessing
Below diagonal line:
prediction is opposite of the true class
ROC Curve
A model that is strictly better than another would have a larger area under the ROC curve.
If the model is perfect, then its area under the ROC curve would equal to 1.
If the model simply performs random guessing, then its area under the ROC curve would equal 0.5.
Types of errors (TP, FP, FN and TN)
Prediction
Edge Not edge
Ground Truth
Not Edge Edge
TP
FP
FN
TN
True
Positive False
Negative
False
Positive True Negative
True Positive
Two parts to each: whether you got it correct or not, and what you guessed. For example for a particular pixel, our guess might be labelled…
Did we get it correct? True, we did get it correct.
False Negative
Did we get it correct? False, we did not get it correct.
or maybe it was labelled as one of the others, maybe…
What did we say? We said ‘positive’, i.e. edge.
What did we say? We said ‘negative, i.e. not edge.
Sensitivity and Specificity
Count up the total number of each label (TP, FP, TN, FN) over a large dataset. In ROC analysis, we use two statistics:
Sensitivity =
TP
TP+FN
Specificity =
TN
TN+FP
Can be thought of as the likelihood of spotting a positive case when presented with one.
Or… the proportion of edges we find.
Can be thought of as the likelihood of spotting a negative case when presented with one.
Or… the proportion of non-edges that we find
Sensitivity = = ?
TP
TP+FN
Specificity = = ?
TN
TN+FP
Prediction
Ground Truth
1
1
0
0
60
30
20
80
80+20 = 100 cases in the dataset were class 0 (non-edge)
60+30 = 90 cases in the dataset were class 1 (edge)
90+100 = 190 examples (pixels) in the data overall
The ROC space
1 – Specificity
Sensitivity
This is edge detector B
This is edge detector A
1.0
0.0
1.0
Note
ROC: Receiver Operating Characteristic
ROC
In statistics, a receiver operating characteristic curve, i.e. ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
Receiver operating characteristic – Wikipedia
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
The ROC Curve
Draw a ‘convex hull’ around many points:
1 – Specificity
Sensitivity
This point is not on the convex hull.
ROC Analysis
1 – specificity
sensitivity
All the optimal detectors lie on the convex hull.
Which of these is best depends on the ratio of edges to non-edges, and the different cost of misclassification
Any detector on this side can lead to a better detector by flipping its output.
Take-home point : You should always quote sensitivity and specificity for your algorithm, if possible plotting an ROC graph. Remember also though, any statistic you quote should be an average over a suitable range of tests for your algorithm.
Holdout estimation
What to do if the amount of data is limited?
The holdout method reserves a certain amount for testing and uses the remainder for training.
Usually: one third for testing, the rest for training
Holdout estimation
Problem: the samples might not be representative
Example: class might be missing in the test data
Advanced version uses stratification
Ensures that each class is represented with approximately equal proportions in both subsets
Repeated holdout method
Repeat process with different subsamples
more reliable
In each iteration, a certain proportion is randomly selected for training (possibly with stratificiation)
The error rates on the different iterations are averaged to yield an overall error rate
Repeated holdout method
Still not optimum: the different test sets overlap
Can we prevent overlapping?
Of course!
Holdout
Split dataset into two groups for training and test
Training dataset: used to train the model
Test dataset: use to estimate the error rate of the model
Drawback
When “unfortunate split” happens, the holdout estimate of error rate will be misleading
Entire dataset
Training set
Test set
Split the data into two
Random Subsampling
Split the data set into two groups
Randomly selects a number of samples without replacement
Usually, one third for testing, the rest for training
K-Fold Cross-validation
K-fold Partition
Partition K equal sized sub groups
Use K-1 groups for training and the remaining one for testing
Experiment 1
Experiment 2
Experiment 3
Experiment 4
Experiment 5
Test set
*
Cross-validation
Cross-validation avoids overlapping test sets
First step: split data into k subsets of equal size
Second step: use each subset in turn for testing, the remainder for training
Called k-fold cross-validation
Cross-validation
Often the subsets are stratified before the cross-validation is performed
The error estimates are averaged to yield an overall error estimate
More on cross-validation
Standard method for evaluation: stratified ten-fold cross-validation
Why ten?
Empirical evidence supports this as a good choice to get an accurate estimate
There is also some theoretical evidence for this
Stratification reduces the estimate’s variance
Even better: repeated stratified cross-validation
E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)
Leave-One-Out cross-validation
Leave-One-Out:
a particular form of cross-validation:
Set number of folds to number of training instances
I.e., for n training instances, build classifier n times
Makes best use of the data
Involves no random subsampling
Very computationally expensive
(exception: NN)
Leave-One-Out-CV and stratification
Disadvantage of Leave-One-Out-CV: stratification is not possible
It guarantees a non-stratified sample because there is only one instance in the test set!
Conclusions
Naïve Bayes is:
Really easy to implement and often works well
Often a good first thing to try
Commonly used as a “punching bag” for smarter algorithms
Evaluate classification algorithms
TP, FP, FN, TN
ROC
Cross validation
*
Questions & Suggestions?
The End
*
Appendix
*
Recovering the Probabilities (skip)
What if we want the probabilities though??
Suppose that for some constant K, we have:
And
How would we recover the original probabilities?
*
Recovering the Probabilities (skip)
Given:
Then for any constant C:
One suggestion: set C such that the greatest i is shifted to zero:
See https://stats.stackexchange.com/questions/105602/example-of-how-the-log-sum-exp-trick-works-in-naive-bayes?noredirect=1&lq=1
*
Detour: Model Parameters
In the context of a mathematical model, such as a probability distribution, the distinction between variables and parameters was described by Bard as follows: We refer to the relations which supposedly describe a certain physical situation, as a model. Typically, a model consists of one or more equations.
Parameter – Wikipedia
https://en.wikipedia.org/wiki/Parameter
*
Detour: Model Parameters
Mathematical functions have one or more arguments that are designated in the definition by variables. A function definition can also contain parameters, but unlike variables, parameters are not listed among the arguments that the function takes. When parameters are present, the definition actually defines a whole family of functions, one for every valid set of values of the parameters.
Parameter – Wikipedia
https://en.wikipedia.org/wiki/Parameter
*
Example: Model Parameters
For instance, one could define a general quadratic function by declaring
f ( x ) = a x 2 + b x + c
here, the variable x designates the function’s argument, but a, b, and c are parameters that determine which particular quadratic function is being considered.
Parameter – Wikipedia
https://en.wikipedia.org/wiki/Parameter
*
0053
.
0
14
9
9
3
9
3
9
3
9
2
=
´
´
´
´
=
0206
.
0
14
5
5
3
5
4
5
1
5
3
=
´
´
´
´
=
(
)
(
)
2
2
2
1
)
(
1
1
1
1
2
1
s
m
s
p
m
s
m
–
–
=
=
=
–
–
=
=
å
å
w
e
w
f
x
n
x
n
n
i
i
n
i
i
000036
.
0
14
9
9
3
0221
.
0
0340
.
0
9
2
=
´
´
´
´
000136
.
0
14
5
5
3
038
.
0
0291
.
0
5
3
=
´
´
´
´
(
)
(
)
(
)
(
)
0340
.
0
2
.
6
2
1
Yes
|
66
e
temperatur
2
2
.
6
2
2
73
66
=
=
=
–
–
e
f
p
å
Î
Î
+
=
positions
i
j
i
j
C
c
NB
c
x
P
c
P
c
)
|
(
log
)
(
log
argmax
j
FN
FP
TN
TP
TN
TP
d
c
b
a
d
a
+
+
+
+
=
+
+
+
+
=
Accuracy
c
b
a
a
p
r
rp
b
a
a
c
a
a
+
+
=
+
=
+
=
+
=
2
2
2
(F)
measure
–
F
(r)
Recall
(p)
Precision
d
w
c
w
b
w
a
w
d
w
a
w
4
3
2
1
4
1
Accuracy
Weighted
+
+
+
+
=