程序代写代做代考 C chain The general paradigm of supervised learning

The general paradigm of supervised learning
I The goal of a large family of machine learning is to minimize the prediction errors of the model
I Ideally we want to predict the true errors, errors made by the I model when it is used in realistic scenarios
That is hard to do, so the common practice is to minimize the I errors in a training sample
To do that we need to define an loss function, which is a metric that measures the errors in the prediction of the model.
I Cross-entropy Loss, Squared Error Loss, Hinge Loss
I In other cases it is more natural to think of the goal of learning
is to optimize an objective function, e.g., Maximum Likelihood
I Whether to call is a loss function or objective function, there is no di↵erence in how they are optimized

Commonly used loss and objective functions in NLP
I Na ̈ıve Bayes: maximize the joint probability of a training set of labeled samples
✓ˆ = argmaxp(x1:N,y1:N;✓) ✓
I Logistic Regression: The weights are estimated by Maximum Conditional Likelihood
✓ˆ = argmax log p(y 1:N |x 1:N ; ✓) ✓
I SVM: The weights are estimated by minimizing marginal loss
XN⇣ ⌘ ✓ˆ = argmin 1 (✓; x(i), y(i))
+
Note: Letters in bold indicates vector: ✓, x, f . Alternative notations: ✓~, ~x, f~
✓ i=1

Na ̈ıve Bayes Objective
I Na ̈ıve Bayes: Maximize the joint probability of a training set of labeled samples, in a process called Maximum Likelihood Estimation
✓ˆ = argmaxP(x1:N,y1:N;✓) ✓
YN i=1
XN i=1
= argmax

p(xi,yi;✓) logP(xi,yi;✓)
= argmax

Logistic Regression Objective
I Logistic Regression: The weights are estimated by Maximum Conditional Likelihood
✓ˆ = argmax log p(y 1:N |x 1:N ; ✓)
✓0 ⇣⌘1
XN X
=argmax @✓·f(x(i),y(i))log exp ✓·f(x(i),y) A

i=1 y2Y or by minimizing the logistic loss:
✓ˆ=argminXN 0@✓·f(x(i),y(i))logXexp⇣✓·f(x(i),y)⌘1A

i=1 y2Y

Support Vector Machine Objective
I SVM: The weights are estimated by minimizing marginal loss N
✓ˆ = argmin ✓
= argmin
X⇣ ⌘
1 (✓; x(i), y(i)) i=1
N
+
(max(✓ · f (x (i ) , y ) + c (y (i ) , y )) ✓ · f (x (i ) , y (i ) ))+
i=1 y2Y
These look rather daunting, don’t they?

How do we minimize a function?
In order to minimize a function, we need to be able to compute the derivative, or rate of change of the function.
Let’s start with a much simpler function f (x ) = x 2 + 1, and its derivative is:
df(x)= d(x2+1)=2x dx dx
“The derivative of the function f (x) with respect to (w.r.t.) x” This looks like magic, but it’s really just calculus.

How do we find the minimum of a function with the derivative?
I The derivative of a function can be interpreted as a slope at a certain point of the function.
I At the point that is the minimum (or maximum) of the function, the slope is level.
We can find the minimum of the function by setting its derivative to zero:
2x = 0, x = 0
I For this particular function, there is a closed form solution. Most models in NLP don’t have a closed form solution, but some do, e.g., Na ̈ıve Bayes.

Plot the function

Finding the minimum iteratively
For functions that don’t have a closed form solution, we find its minimum iteratively. We subtract (a fraction of) the derivative from the input x so that the value of the function will decrease. Suppose we start at the point where x = 1, and set the fraction ⌘ , 0.1, and x = ⌘ d f (x). So:
x = x x = 1 0.1 ⇥ (2) = 0.8 f(x)=(0.8)2 +1=1.64
x = x x = 0.8 0.1 ⇥ (1.6) = 0.64 f (x) = (0.64)2 + 1 = 1.4096
x = x x = 0.64 0.1 ⇥ (1.28) = 0.512 f (x ) = (0.512)2 + 1 = 1.262144
As x approaches 0, f (x ) reaches the minimum, which is 1.
dx

Finding the minimum iteratively

Finding the minimum iteratively

What if we try to learn fast using a larger learning rate?
Let’s still start at x = 1 and let’s set the learning rate ⌘ , 1 instead and see what happens:
x = x x = 1 1 ⇥ (2) = 1 f (x) = 12 + 1 = 2
x = x x = 1 1 ⇥ (2) = 1 f(x)=(1)2 +1=2
x = x x = 1 1 ⇥ (2) = 1 f(x)=(1)2 +1=2
So the x will just swing back and forth without ever reaching the minimum.
Setting the right learning rate is thus very important. If set improperly, we’ll never reach the minimum, or at least take much longer than necessary.

Trying to learn fast with a larger learning rate

Trying to learn fast with a larger learning rate

Derivative Rules
Common Derivatives:
d (C) = 0 dx
d (x) = 1 dx
d (xn) = nxn1 dx
d (ax) = ax ln(a) dx
d (ex)=ex dx
e.g., d (91) = 0 dx
e.g., d (4x) = 4 dx
e.g., d (x4) = 4×3 dx
e.g., d (5x) = 5x ln(5) dx
Note: ln: “Natural logarithm”, logarithm to base of the mathematic constant e, where e = 2.71882 · · ·

Derivative rules
More common derivatives:
d (ln(x)) = 1,x > 0 dx x
d (ln(|x|)) = 1,x 6= 0 dx x
d(loga(x))= 1 ,x>0 dx x ln(a)
d (sin(x)) = cos(x) dx
d (cos(x)) = sin(x) dx
d (tan(x)) = sec2(x) dx
Note: When x <= 0, ln(x) is unspecified. That is, you can’t raise the constant e to any value to get a zero or a negative number. Derivatives of functions “The derivative of the function with respect to x” d (cf(x))=c d f(x) d (f(x)±g(x))= d f(x)± d g(x) dx dxdx dx dx d (f (x)g(x)) = g(x) d f (x) + f (x) d g(x) (Product rule) dx dx (Quotient rule) dx d f(x) = dx dx g(x)d f(x)f(x)d g(x) dx g(x) g2 f (g(x)) d g(x) (Chain rule) dx d f (g(x)) = dx d dg(x) Breaking down the derivative of complex functions Using these fundamental derivative rules, and particularly the chain rule, you can break down more complicated functions: d (f (x))n = n(f (x)n1) d f (x) dx dx d ef(x) =ef(x) d f(x) dx dx dln(f(x))= 1 df(x) dx f(x)dx Partial Derivatives I We don’t normally deal with single variable functions in NLP. A typical NLP model (function) has tens of thousands or millions of variables (features). So we need to compute partial derivatives. I Fortunately, compute partial derivatives is relatively simple. You just need to hold all other variables constant (treat them as constant), and take the derivative with respect to a given variable. @ f(x,y), @ f(x,y) @x @y More on partial derivatives f (x , y ) = x 2 + y 2 @ f(x,y)=2x @x @ f(x,y)=2y @y More on partial derivatives f(x,y)=min(x,y)=(x if xy
@f(x,y)=(1, if xy
@f(x,y)= 0, if xy
The function is not di↵erentiable when
x = y

Plot multi-variable functions

Gradient
The gradient of a function rf is the set of partial derivatives of a function
2@f 3 6@x1 7
6@f 7 rf =6@x27
4···5
@f @xn

Properties of Logarithms
log(xy) = log(x) + log(y) log(x)=log(x)log(y)
ln(ex) = x eln(x) =x,
x >0
y
log(xy ) = y log(x)
log(Y xi ) = X log(xi ) ii
I It is common practice to map probabilities to logarithmic space to avoid underflow (when a value gets too close to zero for the computer to represent it).
ln(0.0001) = 9.2103403 · · ·
I You can map the log values back to probabilities using the exponent. e9.2103403 = 0.0001

Convexity of functions
I Intuitively, a convex (conclave) function is a continuous function in which there is a single minimum (maximum).
I A mathematical definition: A convex function is a continuous function whose value at the midpoint of every interval in its domain does not exceed the arithmetic mean of its values at the ends of the interval.
I How to decide a function is convex: If f (x ) has a second derivative in [a, b], then a necessary and sucient condition for it to be convex on that interval is that the second derivativef00(x)0forallx in[a,b].

Example convex functions

Example non-convex functions