Jan. 25, 2022
Recitation 1 Jan. 25, 2022 1 / 26
Recitation 1
Statistical Learning Theory Intro to Gradient Descent
Copyright By PowCoder代写 加微信 powcoder
Motivation
In data science, we generally need to Make a Decision on a problem. To do this, we need to understand
The setup of the problem The possible actions
The effect of actions
The evaluation of the results
How do we translate the problem into the language of DS/modeling?
Statistical Learning Theory
Recitation 1 Jan. 25, 2022 3 / 26
Statistical Learning Theory
Formalization
Recitation 1 Jan. 25, 2022 4 / 26
The Spaces
X : input space Y : outcome space A : action space
Prediction Function
A prediction function f gets an input x ∈ X and produces an action a ∈ A: f : X → A
Loss Function
A loss function l(a, y ) evaluates an action a ∈ A in the context of an outcome y ∈ Y:
l : A × Y → R
Risk Function
Given a loss function l, how can we evaluate the “average performance” of a prediction function f ?
To do so, we need to first assume that there is a data generating distribution Px,y.
Then the expected loss of f on Px,y will reflect the notion of “average preformance”.
Statistical Learning Theory
Recitation 1 Jan. 25, 2022 5 / 26
Definition
The risk of a prediction function f : X → A is R (f ) = El(f (x ), y )
It is the expected loss of f on a new sample (x,y) drawn from PX,Y .
Statistical Learning Theory
Finding ’best’ function
Recitation 1 Jan. 25, 2022 6 / 26
Definition
F is the family of functions we restrict our model to be. Example: Linear, quadratic, decision tree, two layer neural-net…
Definition
fF is ‘best’ function one can obtain within F.
Definition
fˆ is the ‘best’ function one can obtain using the data given. n
Definition
f ̃ is the function actually obtained using the data given. n
Statistical Learning Theory
The Bayes Prediction Function
The risk of a Bayes function is called Bayes risk.
Recitation 1 Jan. 25, 2022 7 / 26
Definition
A Bayes prediction function f ∗ : X → Y is a function that achieves the minimal risk among all possible functions:
f∗ ∈argminR(f), f
where the minimum is taken from all functions that maps from X to A.
Statistical Learning Theory
Error Decomposition
Recitation 1 Jan. 25, 2022 8 / 26
Statistical Learning Theory
Recitation 1 Jan. 25, 2022 9 / 26
Error Decomposition
Statistical Learning Theory
Approximation Error
Caused by the choice of family of functions or capacity of the model. Expand the capacity of the model.
Estimation Error
Caused by finite number of data Obtain more data/add regularizer
Optimization Error
Caused by not able to find the best parameters
Try different optimization algorithms, learning rates, etc.
Recitation 1 Jan. 25, 2022 10 / 26
Therefore, instead of solving for the best parameters, we just need to approximate it well enough.
Statistical Learning Theory
Gradient Descent
Motivation:
Our goal is the find fˆ, the best possible model from given data
Naive approach: Take gradient of loss function, solve for parameters that gives you 0.
Computationally intractable
Impossible to compute due to complex function structure
The optimal parameters for LR: βˆ = X⊤X−1 X⊤Y
When X’s dimension reaches the millions, the inverse is essentially
intractable.
But we do not need fˆ, a close f ̃ is good enough for decision making.
Recitation 1 Jan. 25, 2022 11 / 26
Gradient Descent
Statistical Learning Theory
Given any starting parameters, the gradient indicates the direction of local maximal change.
If we obtain new parameters by moving old parameter along its gradient, the new ones will give smaller loss (if we are careful).
We can repeat this procedure until we are happy with the result.
Recitation 1 Jan. 25, 2022 12 / 26
Statistical Learning Theory
Contour Graphs
Imagine we are solving a simple linear regression problem: y = θ0 + θ1x with loss function:
J(θ0,θ1)=(yi −(θ0 +θ1xi))2
Recitation 1 Jan. 25, 2022 13 / 26
Negative Gradient Steps
Recitation 1
Jan. 25, 2022
Statistical Learning Theory
Negative Gradient Steps
Recitation 1
Jan. 25, 2022
Statistical Learning Theory
Negative Gradient Steps
Recitation 1
Jan. 25, 2022
Statistical Learning Theory
Negative Gradient Steps
Recitation 1
Jan. 25, 2022
Statistical Learning Theory
Negative Gradient Steps
Recitation 1
Jan. 25, 2022
Statistical Learning Theory
Negative Gradient Steps
Recitation 1
Jan. 25, 2022
Statistical Learning Theory
Negative Gradient Steps
Recitation 1
Jan. 25, 2022
Statistical Learning Theory
Negative Gradient Steps
Recitation 1
Jan. 25, 2022
Statistical Learning Theory
Negative Gradient Steps
Recitation 1
Jan. 25, 2022
Statistical Learning Theory
Negative Gradient Steps
Recitation 1
Jan. 25, 2022
Statistical Learning Theory
Negative Gradient Steps
Recitation 1
Jan. 25, 2022
Statistical Learning Theory
Gradient Descent
Gradient Descent
Gradient descent Algorithm
Goal: find θ∗ = arg minθ J(θ)
θ0 :=[initial condition] (can be randomly chosen)
while not [termination condition]:
compute ∇J(θi)
α :=[choose learning rate at iteration i] θi+1 :=θi −α∇J(θi)
i := i + 1
Recitation 1
Jan. 25, 2022
Things to review
Gradient Descent
Gradients, taking (partial) derivatives
Linear Algebra
Matrix computation, matrix derivatives
Example: compute ∂xT Ax , where A is a matrix and x is a vector ∂x
Recitation 1 Jan. 25, 2022 26 / 26
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com