CS计算机代考程序代写 algorithm Lecture 4. Logistic Regression. Basis Expansion

Lecture 4. Logistic Regression. Basis Expansion
COMP90051 Statistical Machine Learning
Semester 2, 2019 Lecturer: Ben Rubinstein
Copyright: University of Melbourne

COMP90051 Statistical Machine Learning
This lecture
• Logisticregression
∗ Workhorse of binary classification
• Basisexpansion
∗ Extending model expressiveness via data transformation
∗ Examples for linear and logistic regression ∗ Theoretical notes
2

COMP90051 Statistical Machine Learning
Logistic Regression Model A workhorse of binary classification
3

COMP90051 Statistical Machine Learning
Binary classification: Example
predict “no”
• Example: given body mass index (BMI) does a patient have type 2 diabetes (T2D)?
• This type of problem is called binary classification
T2D 1
predict “yes”
• One can use linear regression ∗ Fitaline/hyperplanetodata
(find weights 𝒘𝒘)
∗ Denote𝑠𝑠≡𝒙𝒙′𝒘𝒘
∗ Predict“Yes”if𝑠𝑠≥0.5 ∗ Predict“No”if𝑠𝑠<0.5 0.5 0 BMI 4 COMP90051 Statistical Machine Learning Approaches to classification • This approach can be susceptible to outliers T2D 1 • Overall, the least-squares criterion looks unnatural in this setting • There are many methods developed specifically with binary classification in mind • Examples include logistic regression, perceptron, support vector machines (SVM) 0 BMI 5 Confidence predictions are penalised COMP90051 Statistical Machine Learning Logistic regression model • Probabilistic approach to classification ∗ 𝑃𝑃𝑌𝑌=1|𝒙𝒙=𝑓𝑓𝒙𝒙=? ∗ Use a linear function? E.g., 𝑠𝑠 𝒙𝒙 = 𝒙𝒙′𝒘𝒘 Logistic function • Problem: the probability needs to be between 0 and 1. • Logistic function 𝑓𝑓 𝑠𝑠 = 1 1+exp −𝑠𝑠 𝑃𝑃 𝑌𝑌 = 1|𝒙𝒙 = 1 1+exp −𝒙𝒙′𝒘𝒘 • Logistic regression model ′ log𝑃𝑃𝑌𝑌=1|𝒙𝒙 =𝒙𝒙𝒘𝒘 • Equivalent to linear model for log-odds ratio 𝑃𝑃 𝑌𝑌 = 0|𝒙𝒙 -10 -5 𝑠𝑠0 Reals 5 10 6 𝑓𝑓 𝑠𝑠Probabilities 0.0 0.2 0.4 0.6 0.8 1.0 COMP90051 Statistical Machine Learning Logistic regression model • Probabilistic approach to classification ∗ 𝑃𝑃𝑌𝑌=1|𝒙𝒙=𝑓𝑓𝒙𝒙=? ∗ Use a linear function? E.g., 𝑠𝑠 𝒙𝒙 = 𝒙𝒙′𝒘𝒘 T2D predict “no” predict “yes” • Problem: the probability needs to be between 0 and 1. • Logistic function 𝑓𝑓 𝑠𝑠 = 1 1 1+exp −𝑠𝑠 𝑃𝑃 𝑌𝑌 = 1|𝒙𝒙 = 1 1+exp −𝒙𝒙′𝒘𝒘 • Logistic regression model 0.5 0 Note: here we do not use sum of squared errors for fitting ′ log𝑃𝑃𝑌𝑌=1|𝒙𝒙 =𝒙𝒙𝒘𝒘 • Equivalent to linear model for log-odds ratio 𝑃𝑃 𝑌𝑌 = 0|𝒙𝒙 BMI 7 COMP90051 Statistical Machine Learning Is logistic regression a linear method? Logistic function -10 -5 𝑠𝑠0 5 10 Reals 8 𝑓𝑓 𝑠𝑠Probabilities 0.0 0.2 0.4 0.6 0.8 1.0 COMP90051 Statistical Machine Learning Logistic regression is a linear classifier 1 if 𝑃𝑃 𝑌𝑌=1|𝒙𝒙 >1 thenclass“1”,elseclass“0”
• Logisticregressionmodel:
𝑃𝑃𝑌𝑌=1|𝒙𝒙 =1+exp−𝒙𝒙′𝒘𝒘
• Classificationrule: 2
• Decisionboundary = 1+exp −𝒙𝒙′𝒘𝒘 2
11
9

COMP90051 Statistical Machine Learning
Effect of parameter vector (2D problem)
• Decision boundary is the line where 𝑃𝑃 𝑌𝑌 = 1|𝒙𝒙 = 0.5
∗ In higher dimensional problems, the decision boundary is a plane or hyperplane
Murphy, Fig 8.1, p246
• Vector 𝒘𝒘 is perpendicular to the decision boundary (see supplemental LMS vector slides) ∗ That is, 𝒘𝒘 is a normal to the decision boundary
∗ Note: in this illustration we assume 𝑤𝑤0 = 0 for simplicity
10

COMP90051 Statistical Machine Learning
Linear vs. logistic probabilistic models • Linear regression assumes a Normal distribution with a
𝑝𝑝 𝑦𝑦|𝒙𝒙 = 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 𝒙𝒙′𝒘𝒘, 𝜎𝜎2
fixed variance and mean given by linear model
• Logistic regression assumes a Bernoulli distribution with parameter given by logistic transform of linear model
𝑝𝑝 𝑦𝑦|𝒙𝒙 = 𝐵𝐵𝐵𝐵𝑁𝑁𝐵𝐵𝑁𝑁𝐵𝐵𝑁𝑁𝑁𝑁𝐵𝐵 logistic(𝒙𝒙′𝒘𝒘)
• Recall that Bernoulli distribution is defined as
𝑝𝑝 1 =𝜃𝜃and𝑝𝑝 0 =1−𝜃𝜃for𝜃𝜃∈ 0,1
• Equivalently𝑝𝑝 𝑦𝑦 =𝜃𝜃𝑦𝑦 1−𝜃𝜃 1−𝑦𝑦 for𝑦𝑦∈{0,1}
11

COMP90051 Statistical Machine Learning
Training as Max Likelihood Estimation
• Assumingindependence,probabilityofdata 𝑛𝑛
𝑝𝑝 𝑦𝑦1,…,𝑦𝑦𝑛𝑛|𝒙𝒙1,…,𝒙𝒙𝑛𝑛 = �𝑖𝑖=1 𝑝𝑝 𝑦𝑦𝑖𝑖|𝒙𝒙𝒊𝒊
• AssumingBernoullidistributionwehave 𝑝𝑝𝑦𝑦|𝒙𝒙 = 𝜃𝜃𝒙𝒙 𝑦𝑦 1−𝜃𝜃𝒙𝒙 1−𝑦𝑦
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
where𝜃𝜃𝒙𝒙𝑖𝑖 = 1 1+exp −𝒙𝒙𝑖𝑖′𝒘𝒘
• Training:maximisethisexpressionwrtweights𝒘𝒘
12

COMP90051 Statistical Machine Learning
Apply log trick, simplify
• Instead of maximising likelihood, maximise its logarithm 𝑛𝑛𝑛𝑛
log𝑛𝑛�𝑖𝑖=1𝑝𝑝𝑦𝑦𝑖𝑖|𝒙𝒙𝒊𝒊 =�𝑖𝑖=1log𝑝𝑝𝑦𝑦𝑖𝑖|𝒙𝒙𝒊𝒊 𝑛𝑛=�𝑖𝑖=1log 𝜃𝜃𝒙𝒙𝑖𝑖 𝑦𝑦𝑖𝑖 1−𝜃𝜃𝒙𝒙𝑖𝑖 1−𝑦𝑦𝑖𝑖
=�𝑖𝑖=1 𝑦𝑦𝑖𝑖log 𝜃𝜃 𝒙𝒙𝑖𝑖 + 1−𝑦𝑦𝑖𝑖 log 1−𝜃𝜃 𝒙𝒙𝑖𝑖
𝑛𝑛
= �𝑖𝑖=1 𝑦𝑦𝑖𝑖 −1 𝒙𝒙′𝒊𝒊𝒘𝒘−log 1+exp −𝒙𝒙′𝒊𝒊𝒘𝒘
13

COMP90051 Statistical Machine Learning
Iterative optimisation
• Training logistic regression amounts to finding 𝒘𝒘 that maximise log-likelihood
• Analytical approach: Set derivatives of objective function to zero and solve for w
• Bad news: No closed form solution, iterative method necessary (e.g., gradient descent, Newton-Raphson, or iteratively-reweighted least squares)
• Good news: Problem is strictly convex (like a bowl) if there are no irrelevant features optimisation guaranteed to work!
Look ahead (L5): regularisation helps with irrelevant features
𝑤𝑤 2
Murphy, Fig 8.3, p247
𝑤𝑤1
14

COMP90051 Statistical Machine Learning
Logistic Regression: Decision-Theoretic View
Where loss is cross entropy
15

COMP90051 Statistical Machine Learning
Side note: Cross entropy
• Crossentropyisamethodforcomparingtwo distributions
• Cross entropy is a measure of a divergence between reference distribution 𝑔𝑔 (𝑁𝑁) and estimated
𝑟𝑟𝑟𝑟𝑟𝑟
distribution 𝑔𝑔𝑟𝑟𝑠𝑠𝑒𝑒 𝑁𝑁 . For discrete distributions:
𝐻𝐻 𝑔𝑔𝑟𝑟𝑟𝑟𝑟𝑟,𝑔𝑔𝑟𝑟𝑠𝑠𝑒𝑒 = −�𝑔𝑔𝑟𝑟𝑟𝑟𝑟𝑟 𝑁𝑁 log𝑔𝑔𝑟𝑟𝑠𝑠𝑒𝑒 𝑁𝑁 𝑎𝑎∈𝐴𝐴
𝐴𝐴 is support of the distributions, e.g., 𝐴𝐴 = 0,1
16

COMP90051 Statistical Machine Learning
Training as cross-entropy minimisation
• Consider log-likelihood for a single data point log𝑝𝑝𝑦𝑦|𝒙𝒙 =𝑦𝑦log𝜃𝜃𝒙𝒙 +1−𝑦𝑦 log1−𝜃𝜃𝒙𝒙
𝑖𝑖𝒊𝒊𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
• This expression is the negative cross entropy
• Crossentropy𝐻𝐻 𝑔𝑔 ,𝑔𝑔 =−∑ 𝑔𝑔 𝑁𝑁 log𝑔𝑔 𝑁𝑁
𝑟𝑟𝑟𝑟𝑟𝑟 𝑟𝑟𝑠𝑠𝑒𝑒
𝑎𝑎 𝑟𝑟𝑟𝑟𝑟𝑟 𝑟𝑟𝑠𝑠𝑒𝑒
• The reference (true) distribution is
𝑔𝑔 1 =𝑦𝑦 and𝑔𝑔 0 =1−𝑦𝑦
𝑟𝑟𝑟𝑟𝑟𝑟 𝑖𝑖 𝑟𝑟𝑟𝑟𝑟𝑟 𝑖𝑖
• Logistic regression aims to estimate this distribution as 𝑔𝑔 1=𝜃𝜃𝒙𝒙and𝑔𝑔 0=1−𝜃𝜃𝒙𝒙
𝑟𝑟𝑠𝑠𝑒𝑒 𝑖𝑖 𝑟𝑟𝑠𝑠𝑒𝑒 𝑖𝑖
It finds 𝒘𝒘 that minimises sum of cross entropies per training pt
17

COMP90051 Statistical Machine Learning
Basis Expansion
Extending the utility of models via data transformation
18

COMP90051 Statistical Machine Learning
Basis expansion for linear regression
• Let’s take a step back. Back to linear regression and least squares
• Real data is likely to be non-linear
• What if we still wanted to use a linear
∗ It’s simple, easier to understand, computationally efficient, etc.
• How to marry non-linear data to a linear method?
If you can’t beat’em, join’em
𝑦𝑦
𝑥𝑥
regression?
19

COMP90051 Statistical Machine Learning
Transform the data
• The trick is to transform the data: Map data onto another
𝑚𝑚𝑘𝑘
• Denote this transformation 𝜑𝜑: R → R . If 𝒙𝒙 is the
features space, s.t. data is linear in that space
original set of features, 𝜑𝜑 𝒙𝒙 denotes new feature set
• Example: suppose there is just one feature 𝑥𝑥, and the data is scattered around a parabola rather than a straight
line
𝑦𝑦
𝑥𝑥
20

COMP90051 Statistical Machine Learning
Example: Polynomial regression
• Noworries,mate:define 𝜑𝜑𝑥𝑥=𝑥𝑥
𝑦𝑦
𝜑𝜑1𝑥𝑥 =𝑥𝑥2 2
𝑥𝑥
• Next,applylinearregressionto𝜑𝜑1,𝜑𝜑2 𝑦𝑦=𝑤𝑤0+𝑤𝑤1𝜑𝜑1 𝑥𝑥 +𝑤𝑤2𝜑𝜑2 𝑥𝑥 =𝑤𝑤0+𝑤𝑤1𝑥𝑥+𝑤𝑤2𝑥𝑥2
and here you have quadratic regression
• More generally, obtain polynomial regression if the
new set of attributes are powers of 𝑥𝑥
21

COMP90051 Statistical Machine Learning
Basis expansion
• Data transformation, also known as basis expansion, is a general technique
∗ We’llseemoreexamplesthroughoutthecourse
• It can be applied for both regression and classification
• There are many possible choφices of 𝜑𝜑
22

COMP90051 Statistical Machine Learning
Basis expansion for logistic regression • Example binary classification problem: Dataset not linearly separable
• Define transformation as
𝜑𝜑 𝒙𝒙 = 𝒙𝒙 − 𝒛𝒛 , where 𝒛𝒛 some pre-defined constants
𝑖𝑖𝑖𝑖𝑖𝑖
• Choose𝒛𝒛1 = 0,0′,𝒛𝒛2 = 0,1′,𝒛𝒛3 = 1,0′,𝒛𝒛4 = 1,1′
𝑥𝑥2 𝑥𝑥 1
there exist weights that make new data separable, e.g.:
𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 1234
𝑥𝑥1 𝑥𝑥2 𝑦𝑦
1001
𝜑𝜑1 𝜑𝜑2 𝜑𝜑3 𝜑𝜑4
𝝋𝝋′𝒘𝒘 𝑦𝑦
0 1 1 2 2 ClassA
0 0 ClassA
2
2
Class B
Class B
Class A
0
1
Class B
1
0
Class B
1
1
Class A
12 102
1
02 1
2
23
1
1
0
The transformed data is linearly separable!

COMP90051 Statistical Machine Learning
Radial basis functions
• The above transformation is an example of the use of radial basis functions (RBFs)
∗ Their use has been motivated from approximation theory, where sums of RBFs are used to approximate given functions
• A radial basis function is a function of the form 𝜑𝜑 𝒙𝒙 =𝜓𝜓 𝒙𝒙−𝒛𝒛 ,where𝒛𝒛isaconstant
• Examples: •𝜑𝜑𝒙𝒙=𝒙𝒙−𝒛𝒛
𝜑𝜑𝑥𝑥
• 𝜑𝜑 𝒙𝒙 = e x p − 𝜎𝜎1 𝒙𝒙 − 𝒛𝒛
𝑥𝑥
2
24

COMP90051 Statistical Machine Learning
Challenges of basis expansion
• Basis expansion can significantly increase the utility of methods, especially, linear methods
• In the above examples, one limitation is that the transformation needs to be defined beforehand
∗ If using RBFs, need to choose 𝒛𝒛
• Regarding 𝒛𝒛𝑖𝑖, one can choose uniformly spaced points, or
∗ Need to choose the size of the new feature set
cluster training data and use cluster centroids
• Another popular idea is to use training data 𝒛𝒛 ≡ 𝒙𝒙
∗ E.g.,𝜑𝜑𝑖𝑖 𝒙𝒙 =𝜓𝜓 𝒙𝒙−𝒙𝒙𝑖𝑖
𝑖𝑖𝑖𝑖
𝑖𝑖
∗ However, for large datasets, this results in a large number of featurescomputational hurdle
25

COMP90051 Statistical Machine Learning
Further directions
• There are several avenues for taking the idea of basis expansion to the next level
∗ Will be covered later in this subject
• One idea is to learn the transformation 𝜑𝜑 from data
∗ E.g., Artificial Neural Networks
• Another powerful extension is the use of the kernel trick
∗ “Kernelised” methods, e.g., kernelised perceptron
• Finally, in sparse kernel machines, training depends only on a few data points
∗ E.g., SVM
26

COMP90051 Statistical Machine Learning
Summary
• Logistic regression
∗ Workhorse linear binary classifier
• Basis expansion
∗ Extending model expressiveness via data transformation
∗ Examples for linear and logistic regression ∗ Theoretical notes
Next time:
regularisation for avoiding overfitting and ill-posed optimisation; with example algorithms
27