Logistic Regression
What is logistic regression?
Copyright By PowCoder代写 加微信 powcoder
• It is a linear model for classification (contrary to its name!)
Recall the difference:
• In regression, the targets are real values
• In classification, the targets are categories, and they are called labels
Logistic regression – outline
We will go through the same conceptual journey as before:
1) Model formulation
2) Cost function
3) Learning algorithm by gradient descent
• We want to put a boundary between 2 classes
• If x has a single attribute, we can do it with a
• If x has 2 attributes, we can do it with a line
• If x has 3 attributes, we can do it with a plane
• If x has more than 3 attributes, we can do it
with a hyperplane (can’t draw it anymore)
• If the classes are linearly separable, the training error will be 0.
Q: Can you plug classification data into linear regression?
A: Yes. But it might not perform very well. No ordering between categories, like there is between real numbers. We need a better model
We change the linear model slightly by passing it through a nonlinearlity – If x has 1 attribute, we will have
h𝑥;𝒘 =𝜎(𝑤 +𝑤𝑥)= 01
1 1+𝑒−(𝑤0+𝑤1𝑥)
The function 𝜎 𝑢 = 1 is called the sigmoid function or logistic function 1+𝑒 −𝑢
Sigmoid function
It is a smoothed version of a step function – note the step function would make optimisation difficult.
Play around with the logistic model
• Go to https://www.desmos.com/calculator
• Type: 𝑦 = 1
• Change the values of the free parameters to see their effect
• Imagine how this function could fit
this data better than a line did.
• What if your data happens to have class 1 on the left, and class 0 on the right?
• 𝑤 can be negative, so the same model works. 1
1+exp(−(𝑤0+𝑤1𝑥))
– If x has d attributes, that is 𝒙 = 𝑥1,𝑥2,…𝑥𝑑
, we will write
h 𝒙; 𝒘 = 𝜎 𝑤 + 𝑤 𝑥 + ⋯ + 𝑤 𝑥 0 1 1 𝑑 𝑑
all components of w are free parameters
= 1 1+𝑒−(𝒘𝑇𝒙)
Meaning of the sigmoid function
• The sigmoid function takes a single argument (note, 𝒘𝑇𝒙 is one number).
• It always returns a value between 0 and 1. The meaning of this value is the
probability that the label is 1.
𝜎𝒘𝑇𝒙 =𝑃𝑦=1𝒙;𝒘
– If this is smaller than 0.5 then we predict label 0. – if this is larger than 0.5 then we predict label 1.
– There is a slim chance that the sigmoid outputs exactly 0.5. The set of all possible inputs for which this happens is called the decision boundary.
Check your understanding
• Can you express the probability that the label is 0 using sigmoid? 𝜎𝒘𝑇𝒙 =𝑃𝑦=1𝒙;𝒘
⇒1−𝜎𝒘𝑇𝒙 =1−𝑃𝑦=1𝒙;𝒘 =𝑃𝑦=0𝒙;𝒘
• In fact we can write both in 1 line as:
𝑃 𝑦 𝒙; 𝒘 = 𝜎 𝒘 𝒙 (1 − 𝜎 𝒘 𝒙 ) //y given x has a Bernoulli distribution
Worked example
• Suppose we have 2 input attributes, so our model is h𝒙;𝒘 =𝜎𝑤 +𝑤𝑥 +𝑤𝑥 .
•Supposeweknowthat𝑤 =−1,𝑤 =1,𝑤 =1. 012
• When do we predict 1? What is the decision boundary?
• We predict 1 precisely when 𝑃 𝑦 = 1 𝒙;𝒘 >0.5. That is, when h 𝒙;𝒘 >0.5.
• Thishappenspreciselywhentheargumentofthesigmoidispositive! • Decision boundary: −1 + 𝑥1 + 𝑥2 = 0 This is a line
• Q: Is the decision boundary of logistic regression always linear? A: Yes.
decision boundary
2) Cost function
• We need a new cost function, because the Mean Square Error used in linear regression produces a very wiggly function with the new hypothesis function, which would be difficult to optimise.
• But as before we will still have that:
• each data point contributes a cost, and the overall cost function is the average
• the cost is a function of the free parameters of the model
Logistic cost function
For each (x,y) pair,
𝐶𝑜𝑠𝑡h𝒙;𝒘,𝑦 =ቐ −logh𝒙;𝒘 , 𝑖𝑓𝑦=1 −log 1−h 𝒙;𝒘 , 𝑖𝑓𝑦=0
convex (easy to minimise)
Overallcost: 𝑔𝒘 =1𝐶𝑜𝑠𝑡(h𝒙𝑛;𝒘,𝑦(𝑛))
Writing the cost function in a single line
𝑔𝒘 =𝑁𝐶𝑜𝑠𝑡(h𝒙𝑛;𝒘,𝑦(𝑛))
𝐶𝑜𝑠𝑡h𝒙;𝒘,𝑦 =ቐ −logh𝒙;𝒘 , 𝑖𝑓𝑦=1
−log 1−h 𝒙;𝒘 , 𝑖𝑓𝑦=0
g 𝐰 =−𝑁(𝑦(𝑛)logh 𝒙𝑛 ;𝒘 +(1−𝑦(𝑛))log(1−h 𝒙𝑛 ;𝒘 ))
This is also called the cross-entropy. 16
Logistic regression – what we want to do
• Given training data
(𝑥(1), 𝑦(1)), (𝑥(2), 𝑦(2)), … , (𝑥(𝑁), 𝑦(𝑁))
• Fit the model
• By minimising the cross-entropy cost function
g 𝐰 =−𝑁(𝑦(𝑛)logh 𝒙𝑛 ;𝒘 +(1−𝑦(𝑛))log(1−h 𝒙𝑛 ;𝒘 ))
𝑦=h𝑥;𝒘 =𝝈(𝒘𝑇𝒙)
3) Learning algorithm by gradient descent
• We use gradient descent (again!) to minimise the cost function, i.e. to find the best weight values.
• The gradient vector is*:
∇𝑔𝒘 =−(y(n)−h𝐱n;𝐰)⋅𝐱(n)
𝐰= 𝑤2 𝐱= 𝑥2 ∈𝑅𝑑
We plug this into the general gradient descent algorithm given last week.
* This follows after differentiating the cost function w.r.t. weights – we omit the lengthy math!
Learning algorithm for logistic regression
While not converged
For n = 1,…,N // each example in the training set
𝐰 = 𝐰 + 𝛼(y(n) − h 𝐱 n ; 𝐰 ) ⋅ 𝐱(n)
Learning algorithm for logistic regression
The same, written component-wise:
While not converged
For n = 1,…,N // each example in the training set
w0 = w0 + 𝛼(y(n) − h 𝐱 n ; 𝐰 ) For i=1,…,d
wi = wi + 𝛼(y(n) − h 𝐱 n ; 𝐰 )x n 𝑖
Extensions
• We studied logistic regression for linear binary classification • There are extensions, such as:
• Nonlinear logistic regression: instead of linear function inside the exp in the sigmoid, we can use polynomial functions of the input attributes
• Multi-class logistic regression: uses a multi-valued version of sigmoid
• Details of these extensions are beyond of our scope in this module
Examples of application of logistic regression
o Face detection: classes consist of images that contain a face and images without a face
o Sentiment analysis: classes consist of written product-reviews expressing a positive or a negative opinion
o Automatic diagnosis of medical conditions: classes consist of medical data of patients who either do or do not have a specific disease
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com