程序代写代做代考 algorithm AI deep learning Seminar presentation

Seminar presentation

CS918 NLP Seminar

Mathematical primer for
neural networks

Elena Kochkina

University of Warwick

Outline

• Linear algebra

• Differentiation

• Cost function

• Optimisation via Gradient Descent

Refreshing linear algebra
A =


↵11 ↵12
↵21 ↵22

I =

2

66
4

1 0 .. 0
0 1 .. 0
.. .. .. ..
0 0 .. 1

3

77
5 D =

2

66
4

↵ 0 .. 0
0 � .. 0
.. .. .. ..
0 0 .. �

3

77
5

Identity matrix Diagonal matrix

Matrix transpose(AT )ij = Aji

Refreshing linear algebra

A =


↵11 ↵12
↵21 ↵22


AT =


↵11 ↵21
↵12 ↵22

If A = AT then matrix A is symmetric

What kind of matrices can be symmetric ?

If then matrix A is antisymmetricA = �AT

Refreshing linear algebra
Properties of matrix transpose

(AT )T = A

(AB)T = BTAT

(A+B)T = AT +BT

Refreshing linear algebra
Conjugate transpose

The conjugate transpose or Hermitian transpose of an m-
by-n matrix A with complex entries is the n-by-m matrix A*
obtained from A by taking the transpose and then taking the
complex conjugate of each entry (i.e., negating their
imaginary parts but not their real parts)

(A⇤)ij = Aji

Refreshing linear algebra
Matrix multiplication

A =


a b c

x y z


B =

2

4
↵ ⇢
� �
� ⌧

3

5

AB =


a b c

x y z

�2

4
↵ ⇢

� �

� ⌧

3

5 =

a↵+ b� + c� a⇢+ b� + c⌧
x↵+ y� + z� x⇢+ y� + z⌧

•Matrix multiplication is associative:

•Matrix multiplication is distributive:

•Matrix multiplication is not commutative:

Refreshing linear algebra
Properties of matrix multiplication

(AB)C = A(BC)

A(B + C) = AB +AC

AB 6= BA

AI = A = IA

Refreshing linear algebra

detA = � =

����
a c
b d

���� = ad� bc

Determinant

Refreshing linear algebra
Determinant

Laplace expansion of a determinant by complementary minors

� =
nX

j=1

(�1)1+ja1jM̄1j

Refreshing linear algebra

A�1A = I = AA�1
A�1 – inverse of a square matrix A

Properties:

(A�1)�1 = A

(AB)�1 = B�1A�1

(A�1)T = (AT )�1

Refreshing linear algebra
How to find inverse of the matrix?

Adjoint method

A

�1 =
1

detA

(adjoint of A)

A

�1 =
1

detA

(cofactor of A)T

A matrix with elements that are the cofactors, term-by-term, of
a given square matrix.

A =

2

4
a b c

x y z

m l p

3

5

Refreshing linear algebra
How to find the cofactor matrix?

etc..

C11 =

����
y z
l p

���� = yp� zl

C12 =

����
x z

m p

���� = xp� zm

Unitary matrix

Refreshing linear algebra

A complex square matrix U is unitary if:

U⇤U = UU⇤ = I

U⇤ = U�1

equivalently:

Refreshing linear algebra
Eigenvalues and Eigenvectors

(�I �A)x = 0, x 6= 0

� – eigenvalue x – corresponding eigenvector

|�I �A| = 0

(�iI �A)x = 0
To find the n eigenvalues solve:

To find the i-th eigenvector solve:

Singular Value Decomposition (SVD)

Refreshing linear algebra

M is a matrix of real or complex numbersm⇥ n
Then there exists a factorisation, called a Singular Value
Decomposition of of the formM

M = U⌃V ⇤

where U

V ⇤

is a

is a

is a
m⇥m
m⇥ n

n⇥ n

unitary matrix

unitary matrix

diagonal matrix with non-
negative real numbers

Refreshing linear algebra

Calculating the SVD consists of finding the eigenvalues
and eigenvectors of and .

The eigenvectors of make up the columns of V .

The eigenvectors of  make up the columns of U.

The singular values are square roots of eigenvalues from
. or .  

The diagonal entries of are known as the singular
values of the matrix M.

MM⇤ M⇤M

MM⇤

M⇤M

M⇤M MM⇤

Differentiable function – a function whose derivative exists at
each point in its domain, i.e. exists the limit

Differentiation reminder

For a function f to have a derivative it is necessary for the
function f to be continuous, but continuity alone is not
sufficient.

What is the example of continuous but not differentiable
function?

Derivatives of common activation functions

https://en.wikipedia.org/wiki/Activation_function

https://en.wikipedia.org/wiki/Activation_function

Derivatives of common activation functions

https://en.wikipedia.org/wiki/Activation_function

https://en.wikipedia.org/wiki/Activation_function

Derivatives of common activation functions

Derivatives cheat sheet

Rules for combined functions

Rules for combined functions

Partial derivative of a function of several variables –
derivative with respect to one of those variables, with the
others held constant

@f

@xi
(a1, .., ai, .., an) = lim

h!0

f(a1, .., ai + h, .., an)� f(a1, .., ai, .., an)
h

Cost function

The goal of your classifier (e.g. neural network) is to
minimise loss function by choosing the optimal set of
parameters

Cost function, loss function or objective – function that
measures the quality of the output produced by the model

Desirable features:

• Low computational complexity
• Differentiable
• Convex to ensure the uniqueness of the solution

Examples of cost functions
Loss Function Usage

Miscalssification
error

Squared

Binary cross-
entropy

Categorical
cross-entropy

Binary hinge

classification

binary
classification

multi-class
classification

regression

binary
classification

c(x, y, f(x)) = �y log(f(x))� (1� y) log(1� f(x))

c(x, y, f(x)) = (f(x)� y)2

c(x, y, f(x)) =

(
0, if y == f(x).

1, otherwise.

c(x, y, f(x)) = �
X

j

yi,j log(f(x))i,j

c(x, y, f(x)) = max(0, � � yf(x))

f(x) – prediction; y – label; c(x,y,f(x)) – loss function

Optimisation
The goal of optimization is to find the set of parameters W
that minimizes the loss function.

Random search is not a good strategy.

The idea is to start with set of random parameter and
iteratively refine them over time to get lower loss

How?

We can compute the direction in the parameter space
that is mathematically guaranteed to be the direction of the
steepest descend of the loss function.

Gradient

The gradient of is a vector whose
components are the n partial derivatives of f.

f(x1, .., xn)

f(x1, .., xn)

– differentiable, scalar-valued function

The gradient points in the direction of the greatest rate of
increase of the function

rf(x1, .., xn) = (
@f

@x1
, ..,

@f

@xn
)

Gradient
Two ways to compute the gradient:

• numerical gradient (slow, approximate, easy)

Using finite difference formulas:

• analytic gradient (fast, exact, not always easy)

f

0(x) =
f(x+ h)� f(x)

h

@f

@xi
(a1, .., ai, .., an) =

f(a1, .., ai + h, .., an)� f(a1, .., ai, .., an)
h

Choosing the step size (learning rate) is one of the most
important hyperparameter settings in training a neural
network.

Gradient descent

And we are moving (updating weights) in negative gradient
direction

Loss function can be viewed as a surface in parameter space

�W = rloss

W = W � learn.rate ·�W

Repeat until convergence

https://en.wikipedia.org/wiki/Gradient_descent

x 0

x 1

x 2

x 3
x 4

*

*

Illustrations of Gradient Descent

https://en.wikipedia.org/wiki/Gradient_descent

Gradient descent
Updates can be performed using loss function calculated
over

• the whole dataset,
• individual examples (online or stochastic gradient

descent),
• mini-batches (very commonly used now, especially in

large-scale applications).

img source: http://imgur.com/a/Hqolp

Illustrations of different optimisation
algorithms

Resources

Linear Algebra Review and Reference (http://cs229.stanford.edu/section/cs229-linalg.pdf)
Deep Learning for Natural Language Processing (http://cs224d.stanford.edu/index.html)
Loss function (http://cs231n.github.io/linear-classify/#loss)
Gradient descent (http://cs231n.github.io/optimization-1/#opt3)
Nice blog post – overview of gradient descent optimisation algorithms (http://
sebastianruder.com/optimizing-gradient-descent/index.html)

http://cs229.stanford.edu/section/cs229-linalg.pdf
http://cs231n.github.io/linear-classify/#loss
http://cs231n.github.io/optimization-1/#opt3