IT代写 Linear Regression

Linear Regression

Example of a regression problem

• Let’s look at some fun data. Can we predict the long-jump Gold winning distance 125 years after the first modern Olympic games?
Data from: Rogers & Girolami. A First Course in Machine Learning. 2nd edition. Chapman & Hall/CRC, 2017

Example of a regression problem
• Let’s look at more fun data. Can we predict people’s weight from their height?
https://becominghuman.ai/univariate-linear-regression-clearly-explained-with-example-4164e83ca2ee

Regression
• Regression means learning a function that captures the “trend” between input and output
• We then use this function to predict target values for new inputs

Univariate linear regression
• Visually, there appears to be a trend
• A reasonable model seems to be the class of linear functions (lines) • We have one input attribute (year) – hence the name univariate
𝑦=𝑓𝑥;𝑤,𝑤 =𝑤𝑥+𝑤 0110
dependent variable
Independent variable
free parameters
• Any line is described by this equation by specifying values for 𝑤 , 𝑤 . 10

Check your understanding
Suppose that from historical data someone already calculated the parameters of our linear model are 𝑤 = 1.68, 𝑤 = 0.44. A new person (James) has height x=178cm.
Using our model, we can predict James’ weight is 0.44 * 178 + 1.68 = 80kg.
https://becominghuman.ai/univariate-linear-regression-clearly-explained-with-example-4164e83ca2ee

Play around with linear functions
• Go to https://www.desmos.com/calculator
• Type: 𝑦 = 𝑤 𝑥 + 𝑤
• Plug in some values for the free parameters, or use the slider to see their effect
• What is the role of the free parameters? • 𝑤 is the slope of the line
• 𝑤0 is the intercept with the y-axis
• Fixing concrete numbers for these parameters gives you specific lines

Equation of a straight line with slope m and intercept c. Δ𝑦
This is why:
𝑓 𝑥+Δ𝑥 =𝑚 𝑥+Δ𝑥 +𝑐=𝑚𝑥+𝑚⋅Δ𝑥+𝑐=𝑓(𝑥)+𝑚⋅Δ𝑥 ⇒𝑚=𝑓 𝑥+Δ𝑥 −𝑓(𝑥)=Δ𝑦

Our goal: Find the “best” line
• Which is the ”best” line? That captures the trend in the data.
• Determine the “best” values for 𝑤 , 𝑤 . 10

Loss functions (or cost functions)
• We need a criterion that, given the data, for any given line will tell us how bad is that line.
• Such criterion is called a loss function. It is a function of the free parameters!
Terminology
• Loss function = cost function = loss = cost = error function

We average the losses on all training examples
• For each training example (point) 𝑛 = 1, … , 𝑁,
The loss on the n-th point is the mismatch between the output of the model for this point
𝑓 𝑥 𝑛 ;𝑤 ,𝑤 and the observed 01
target 𝑦 𝑛 .
• Average these losses.

Square loss (L2 loss)
• The loss expresses an error, so it must be always non-negative
• Square loss is a sensible choice to measure mismatch for regression • Mean Square Error (MSE)
g(𝑤 , 𝑤 ) = 1 σ𝑁 (𝑓 𝑥(𝑛); 𝑤 , 𝑤 − 𝑦(𝑛)) 2 0 1 𝑁 𝑛=1 0 1
loss for the n-th training example
and recall that, for any x, we have 𝑓 𝑥; 𝑤 , 𝑤 = 𝑤 𝑥 + 𝑤 0110

Cost function depends on the free parameters

Check your understanding
○ Suppose a linear function with parameters w0 = 0.5, w1 = 0.5
○ Compute the loss function value for this line at the training
example: (1,3).
● f(x(1);0.5,0.5) = 0.5 * 1 + 0.5 = 1 (output of the model)
● y(1) = 3 (actual target)
● Square loss for this point: (1-3)^2 = 4.
● Cost = 4.

Univariate linear regression – what we want to do
• Given training data
(𝑥(1), 𝑦(1)), (𝑥(2), 𝑦(2)), … , (𝑥(𝑁), 𝑦(𝑁))
• Fit the model
• By minimising the cost function
g(𝑤 , 𝑤 ) = 1 σ𝑁 ( 𝑤 𝑥(𝑛) + 𝑤 − 𝑦(𝑛)) 2 01𝑁𝑛=11 0
𝑦=𝑓𝑥;𝑤,𝑤 =𝑤𝑥+𝑤 0110

Univariate linear regression – what we want to do ● Everycombinationofw0andw1hasanassociatedcost.
● To find the ‘best fit’ we need to find values for w0 and w1 such that the cost is minimum.

Gradient Descent

Gradient Descent
• A general strategy to minimise cost functions.
Goal: Minimise cost function 𝑔 𝑤 , 𝑤 01
Startatsay𝑤 ≔0,𝑤 ≔0 01
Repeat until no change occurs Update 𝑤 , 𝑤 by taking
a small step in the direction of the steepest descent
Return𝑤 ,𝑤 01

Gradient descent – the general algorithm
• Goal: Minimise cost function 𝑔 𝒘 , where 𝒘 = 𝑤 ,𝑤 ,,… 01
Input: α>0
Initialise w // at 0 or some random value Repeat until convergence
𝒘≔𝒘−α∇𝑔 𝒘 Return 𝑤
α is called “learning rate”= “step size”, for instance 0.01

How to find the best direction?
• First, recall from calculus that the derivative of a function is the change in function value as the argument of the function changes by a minimal amount.
• The derivative evaluated at a given location gives us the slope of the tangent line at that point.
• The negative of the slope points towards the minimum point. Check!
Δ𝑦 = 𝑓 𝑥+Δ𝑥 −𝑓(𝑥)
Δ𝑥 Δ𝑥 Δ𝑥 → 0

Demo example for gradient descent algorithm

• Partial derivative with respect to 𝑤0 is 𝛿𝑔(𝑤0,𝑤1). It means the derivative 𝛿𝑤0
function of 𝑔(𝑤 , 𝑤 ) when 𝑤 is treated as constant. 011
• Partial derivative with respect to 𝑤 is 𝛿𝑔(𝑤0,𝑤1). It means the derivative 1 𝛿𝑤1
function of 𝑔(𝑤 , 𝑤 ) when 𝑤 is treated as constant. 010
• The vector of partial derivatives is called the gradient. 𝛿𝑔(𝑤0,𝑤1)
where 𝑤 = 𝑤0 𝑤
• The negative of the gradient evaluated at a location (𝑤ෞ , 𝑤ෞ ) gives us the
𝛿𝑔(𝑤 ,𝑤 ) 01
direction of the steepest descent from that location. • We take a small step in that direction.

Gradient Descent applied to solving Univariate Linear Regression

Computing the gradient for our L2 loss
• Recall the cost function g(𝑤 , 𝑤 ) = 1 σ𝑁 ( 𝑤 𝑥(𝑛) + 𝑤 − 𝑦(𝑛)) 2 01𝑁𝑛=11 0
• Using the chain rule, we have*: 𝛿𝑔(𝑤0,𝑤1) = 2 σ𝑁 𝑤 𝑥(𝑛) + 𝑤
− 𝑦(𝑛) −𝑦(𝑛))𝑥(𝑛)
𝑛=11 0 𝛿𝑔(𝑤0,𝑤1) = 2 σ𝑁 ( 𝑤 𝑥(𝑛) +𝑤
𝛿𝑤0 𝑁 𝛿𝑤1 𝑁
*For a very detailed explanations of all steps watch: https://www.youtube.com/watch?v=sDv4f4s2SB8

Algorithm for univariate linear regression using GD
• Goal: Minimise g(𝑤 , 𝑤 ) = 1 σ𝑁 ( 𝑤 𝑥(𝑛) + 𝑤 − 𝑦(𝑛)) 2 01𝑁𝑛=11 0
Input: 𝛼>0, training set {(𝑥 𝑛 , 𝑦 𝑛 ):n=1,…,N} Initialise 𝑤 ≔ 0, 𝑤 ≔ 0
For n=1,…,N // more efficient to update after each data point
𝑤≔𝑤−α·(𝑤𝑥𝑛+𝑤−𝑦𝑛) 0010
𝑤≔𝑤−α·(𝑤𝑥𝑛+𝑤 −𝑦𝑛)𝑥𝑛 1110
Until change remains below a very small threshold
Return𝑤 ,𝑤 01

Effect of the learning rate

Extensions & variants of regression problems
• We change the model
• The loss and cost function remains the same

Multivariate linear regression

Univariate nonlinear regression
𝑦=𝑤 +𝑤𝑥+𝑤𝑥2+𝑤𝑥3+…+𝑤 𝑥𝑚=𝐰𝑇𝐱 with 0123𝑚
Figure from: https://www.r-bloggers.com/first-steps-with-non-linear-regression-in-r
This is an m-th order polynomial regression model.

Advantages of vector notation
• Vector notation in concise
• With the vectors 𝒘 and 𝐱 populated appropriately (and differently in each case, as on the previous 2 slides), these models are still linear in the parameter vector.
• The cost function is the L2 as before
• So the gradient in both cases is:
∇𝑔 𝐰 = 2(𝐰𝑇𝐱(𝑛) − 𝑦(𝑛))𝐱(𝑛)
• Ready to be plugged into the general gradient descent algorithm

Don’t get too carried away with nonlinearity

Reference Acknowledgement
Several figures and animations on these slides are taken from:
• et al. Machine Learning Refined. Cambridge University Press, 2020. https://github.com/jermwatt/machine_learning_refined

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts