Seminar presentation
CS918 NLP Seminar
Mathematical primer for
neural networks
Elena Kochkina
University of Warwick
Outline
• Linear algebra
• Differentiation
• Cost function
• Optimisation via Gradient Descent
Refreshing linear algebra
A =
↵11 ↵12
↵21 ↵22
�
I =
2
66
4
1 0 .. 0
0 1 .. 0
.. .. .. ..
0 0 .. 1
3
77
5 D =
2
66
4
↵ 0 .. 0
0 � .. 0
.. .. .. ..
0 0 .. �
3
77
5
Identity matrix Diagonal matrix
Matrix transpose(AT )ij = Aji
Refreshing linear algebra
A =
↵11 ↵12
↵21 ↵22
�
AT =
↵11 ↵21
↵12 ↵22
�
If A = AT then matrix A is symmetric
What kind of matrices can be symmetric ?
If then matrix A is antisymmetricA = �AT
Refreshing linear algebra
Properties of matrix transpose
(AT )T = A
(AB)T = BTAT
(A+B)T = AT +BT
Refreshing linear algebra
Conjugate transpose
The conjugate transpose or Hermitian transpose of an m-
by-n matrix A with complex entries is the n-by-m matrix A*
obtained from A by taking the transpose and then taking the
complex conjugate of each entry (i.e., negating their
imaginary parts but not their real parts)
(A⇤)ij = Aji
Refreshing linear algebra
Matrix multiplication
A =
a b c
x y z
�
B =
2
4
↵ ⇢
� �
� ⌧
3
5
AB =
a b c
x y z
�2
4
↵ ⇢
� �
� ⌧
3
5 =
a↵+ b� + c� a⇢+ b� + c⌧
x↵+ y� + z� x⇢+ y� + z⌧
�
•Matrix multiplication is associative:
•Matrix multiplication is distributive:
•Matrix multiplication is not commutative:
Refreshing linear algebra
Properties of matrix multiplication
(AB)C = A(BC)
A(B + C) = AB +AC
AB 6= BA
AI = A = IA
Refreshing linear algebra
detA = � =
����
a c
b d
���� = ad� bc
Determinant
Refreshing linear algebra
Determinant
Laplace expansion of a determinant by complementary minors
� =
nX
j=1
(�1)1+ja1jM̄1j
Refreshing linear algebra
A�1A = I = AA�1
A�1 – inverse of a square matrix A
Properties:
(A�1)�1 = A
(AB)�1 = B�1A�1
(A�1)T = (AT )�1
Refreshing linear algebra
How to find inverse of the matrix?
Adjoint method
A
�1 =
1
detA
(adjoint of A)
A
�1 =
1
detA
(cofactor of A)T
A matrix with elements that are the cofactors, term-by-term, of
a given square matrix.
A =
2
4
a b c
x y z
m l p
3
5
Refreshing linear algebra
How to find the cofactor matrix?
etc..
C11 =
����
y z
l p
���� = yp� zl
C12 =
����
x z
m p
���� = xp� zm
Unitary matrix
Refreshing linear algebra
A complex square matrix U is unitary if:
U⇤U = UU⇤ = I
U⇤ = U�1
equivalently:
Refreshing linear algebra
Eigenvalues and Eigenvectors
(�I �A)x = 0, x 6= 0
� – eigenvalue x – corresponding eigenvector
|�I �A| = 0
(�iI �A)x = 0
To find the n eigenvalues solve:
To find the i-th eigenvector solve:
Singular Value Decomposition (SVD)
Refreshing linear algebra
M is a matrix of real or complex numbersm⇥ n
Then there exists a factorisation, called a Singular Value
Decomposition of of the formM
M = U⌃V ⇤
where U
⌃
V ⇤
is a
is a
is a
m⇥m
m⇥ n
n⇥ n
unitary matrix
unitary matrix
diagonal matrix with non-
negative real numbers
Refreshing linear algebra
Calculating the SVD consists of finding the eigenvalues
and eigenvectors of and .
The eigenvectors of make up the columns of V .
The eigenvectors of make up the columns of U.
The singular values are square roots of eigenvalues from
. or .
The diagonal entries of are known as the singular
values of the matrix M.
⌃
MM⇤ M⇤M
MM⇤
M⇤M
M⇤M MM⇤
Differentiable function – a function whose derivative exists at
each point in its domain, i.e. exists the limit
Differentiation reminder
For a function f to have a derivative it is necessary for the
function f to be continuous, but continuity alone is not
sufficient.
What is the example of continuous but not differentiable
function?
Derivatives of common activation functions
https://en.wikipedia.org/wiki/Activation_function
https://en.wikipedia.org/wiki/Activation_function
Derivatives of common activation functions
https://en.wikipedia.org/wiki/Activation_function
https://en.wikipedia.org/wiki/Activation_function
Derivatives of common activation functions
Derivatives cheat sheet
Rules for combined functions
Rules for combined functions
Partial derivative of a function of several variables –
derivative with respect to one of those variables, with the
others held constant
@f
@xi
(a1, .., ai, .., an) = lim
h!0
f(a1, .., ai + h, .., an)� f(a1, .., ai, .., an)
h
Cost function
The goal of your classifier (e.g. neural network) is to
minimise loss function by choosing the optimal set of
parameters
Cost function, loss function or objective – function that
measures the quality of the output produced by the model
Desirable features:
• Low computational complexity
• Differentiable
• Convex to ensure the uniqueness of the solution
Examples of cost functions
Loss Function Usage
Miscalssification
error
Squared
Binary cross-
entropy
Categorical
cross-entropy
Binary hinge
classification
binary
classification
multi-class
classification
regression
binary
classification
c(x, y, f(x)) = �y log(f(x))� (1� y) log(1� f(x))
c(x, y, f(x)) = (f(x)� y)2
c(x, y, f(x)) =
(
0, if y == f(x).
1, otherwise.
c(x, y, f(x)) = �
X
j
yi,j log(f(x))i,j
c(x, y, f(x)) = max(0, � � yf(x))
f(x) – prediction; y – label; c(x,y,f(x)) – loss function
Optimisation
The goal of optimization is to find the set of parameters W
that minimizes the loss function.
Random search is not a good strategy.
The idea is to start with set of random parameter and
iteratively refine them over time to get lower loss
How?
We can compute the direction in the parameter space
that is mathematically guaranteed to be the direction of the
steepest descend of the loss function.
Gradient
The gradient of is a vector whose
components are the n partial derivatives of f.
f(x1, .., xn)
f(x1, .., xn)
– differentiable, scalar-valued function
The gradient points in the direction of the greatest rate of
increase of the function
rf(x1, .., xn) = (
@f
@x1
, ..,
@f
@xn
)
Gradient
Two ways to compute the gradient:
• numerical gradient (slow, approximate, easy)
Using finite difference formulas:
• analytic gradient (fast, exact, not always easy)
f
0(x) =
f(x+ h)� f(x)
h
@f
@xi
(a1, .., ai, .., an) =
f(a1, .., ai + h, .., an)� f(a1, .., ai, .., an)
h
Choosing the step size (learning rate) is one of the most
important hyperparameter settings in training a neural
network.
Gradient descent
And we are moving (updating weights) in negative gradient
direction
Loss function can be viewed as a surface in parameter space
�W = rloss
W = W � learn.rate ·�W
Repeat until convergence
https://en.wikipedia.org/wiki/Gradient_descent
x 0
x 1
x 2
x 3
x 4
*
*
Illustrations of Gradient Descent
https://en.wikipedia.org/wiki/Gradient_descent
Gradient descent
Updates can be performed using loss function calculated
over
• the whole dataset,
• individual examples (online or stochastic gradient
descent),
• mini-batches (very commonly used now, especially in
large-scale applications).
img source: http://imgur.com/a/Hqolp
Illustrations of different optimisation
algorithms
Resources
Linear Algebra Review and Reference (http://cs229.stanford.edu/section/cs229-linalg.pdf)
Deep Learning for Natural Language Processing (http://cs224d.stanford.edu/index.html)
Loss function (http://cs231n.github.io/linear-classify/#loss)
Gradient descent (http://cs231n.github.io/optimization-1/#opt3)
Nice blog post – overview of gradient descent optimisation algorithms (http://
sebastianruder.com/optimizing-gradient-descent/index.html)
http://cs229.stanford.edu/section/cs229-linalg.pdf
http://cs231n.github.io/linear-classify/#loss
http://cs231n.github.io/optimization-1/#opt3