Review of Linear Algebra
April 1, 2021
1/60
1 Basic Concepts and Notation
2 Matrix Multiplication
3 Operations and Properties
4 Matrix Calculus
2/60
Basic Concepts and Notation
3/60
Basic Notation
– By x 2 Rn, we denote a vector with n entries.
26 x 1 37
6 x2 7 x = 64 . . . 75 .
xn
– By A 2 Rm⇥n we denote a matrix with m rows and n columns, where the entries of A are
real numbers.
2 a 1 1 a 1 2 · · · a 1 n 3 2 3 2 — a 1T — 3
6 6 a 2 1 a 2 2 · · · a 2 n 7 7 | | | 6 6 — a 2T — 7 7 A = 64 . . . . . . . . . . . . 75 = 4 a 1 a 2 · · · a n 5 = 64 . . . 75 .
a m 1 a m 2 · · · a m n | | | — a mT —
4/60
The Identity Matrix
The identity matrix, denoted I 2 Rn⇥n, is a⇢square matrix with ones on the diagonal and zeros
everywhere else. That is,
Iij = It has the property that for all A 2 Rn⇥n,
AI =A=IA.
1i=j 0 i6=j
5/60
Diagonal matrices
A diagonal matrix is a matrix where all non-diagonal elements are 0. This is typically denoted
D = diag(d1,d2,…,dn), with Clearly, I = diag(1,1,…,1).
Dij =
⇢
di i=j 0 i6=j
6/60
1 Basic Concepts and Notation
2 Matrix Multiplication
3 Operations and Properties
4 Matrix Calculus
7/60
Vector-Vector Product
– inner product or dot product
26y137 n
xTy2R=⇥x x ··· x ⇤6y2 7=Xxy. 12 n4.5 ii
i=1 yn
– outer product
26 x1 37 26 x1y1 x1y2 ··· x1yn 37
xyT 2Rm⇥n =6 x2 7⇥ y y ··· y ⇤=6 x2y1 x2y2 ··· x2yn 7. 4 . 5 1 2 n 4 . . … . 5
xm xmy1 xmy2 ··· xmyn
8/60
Matrix-Vector Product
– If we write A by rows, then we can express Ax as, 2—a1T—3 2a1Tx3
6—a2T—7 6a2Tx7 y = A x = 64 . . . 75 x = 64 . . . 75 .
—amT— amTx – If we write A by columns, then we have:
2|| |326×1372323 23 y=Ax=4a1 a2 ··· an 56×2 7=4a1 5x +4a2 5x +…+4an 5x .
|||4.512 n xn
(1)
y is a linear combination of the columns of A.
9/60
Matrix-Vector Product
It is also possible to multiply on the left by a row vector.
– If we write A by columns, then we can express x>A as,
2|| |3⇥ ⇤ yT =xTA=xT 4 a1 a2 ··· an 5= xTa1 xTa2 ··· xTan
|||
– expressing A in terms of rows we have:
2—a1T —3
y T = x T A = ⇥ x x · · · x ⇤ 6 6 6 — a 2T — 7 7 7 12 m4.5
— amT —
= x1⇥— a1T —⇤+x2⇥— a2T —⇤+…+xm⇥— amT —⇤ yT is a linear combination of the rows of A.
10/60
Matrix-Matrix Multiplication (different views)
1. As a set of vector-vector products 2— a1T —32
6— a2T —7 | | C = A B = 64 . . . 75 4 b 1 b 2
— a mT — | |
|
b p |
3 2a1Tb1 a1Tb2 ··· a1Tbp 3 6a2Tb1 a2Tb2 ··· a2Tbp 7
· · ·
5 = 64 . . . . . . . . . . . . 75 . a mT b 1 a mT b 2 · · · a mT b p
11/60
Matrix-Matrix Multiplication (different views)
2. As a sum of outer products
2 32—b1T —3
| | | 6 6 — b 2T — 7 7 Xn C=AB=4a1 a2 ··· an 564 . 75= aibiT .
|| | —bnT— i=1
12/60
Matrix-Matrix Multiplication (different views)
3. As a set of matrix-vector products.
2|||32|||3 C=AB=A4b1 b2 ··· bp 5=4Ab1 Ab2 ··· Abp 5. (2) ||||||
Here the ith column of C is given by the matrix-vector product with the vector on the right, ci = Abi . These matrix-vector products can in turn be interpreted using both viewpoints given in the previous subsection.
13/60
Matrix-Matrix Multiplication (different views)
4. As a set of vector-matrix products.
2—a1T—3 2—a1TB—3
6— a2T —7 6— a2TB —7 C = A B = 64 . . . 75 B = 64 . . . 75 .
— a mT —
— a mT B —
14/60
Matrix-Matrix Multiplication (properties)
– Associative: (AB)C = A(BC).
– Distributive: A(B + C) = AB + AC.
– In general, not commutative; that is, it can be the case that AB 6= BA. (For example, if A 2 Rm⇥n and B 2 Rn⇥q, the matrix product BA does not even exist if m and q are not equal!)
15/60
1 Basic Concepts and Notation
2 Matrix Multiplication
3 Operations and Properties
4 Matrix Calculus
16/60
Operations and Properties
17/60
The Transpose
The transpose of a matrix results from “flipping” the rows and columns. Given a matrix
A 2 Rm⇥n, its transpose, written AT 2 Rn⇥m, is the n ⇥ m matrix whose entries are given by
(AT)ij =Aji.
The following properties of transposes are easily verified:
– (AT)T =A
– (AB)T =BTAT
– (A+B)T =AT +BT
18/60
Trace
The trace of a square matrix A 2 Rn⇥n, denoted trA, is the sum of diagonal elements in the
matrix:
Xn
Aii .
i=1
trA =
The trace has the following properties:
– ForA2Rn⇥n,trA=trAT.
– ForA,B2Rn⇥n,tr(A+B)=trA+trB.
– ForA2Rn⇥n,t2R,tr(tA)=ttrA.
– For A, B such that AB is square, trAB = trBA.
– For A, B, C such that ABC is square, trABC = trBCA = trCAB, and so on for the product of more matrices.
19/60
Norms
A norm of a vector kxk is informally a measure of the “length” of the vector.
More formally, a norm is any function f : Rn ! R that satisfies 4 properties: 1. For all x 2 Rn, f (x) 0 (non-negativity).
2. f(x)=0ifandonlyifx=0(definiteness).
3. For all x 2 Rn, t 2 R, f (tx) = |t|f (x) (homogeneity).
4. Forallx,y2Rn,f(x+y)f(x)+f(y)(triangleinequality).
20/60
Examples of Norms
The commonly-used Euclidean or `2 norm, v
uXn kxk2 = t xi2.
i=1
The `1 norm, The `1 norm,
Xn i=1
parameterized by a real number p 1, and defined as!1/p
Xn
kxkp = |xi|p . i=1
kxk1 =
kxk1 = maxi |xi|.
|xi|
In fact, all three norms presented so far are examples of the family of `p norms, which are
21/60
Matrix Norms
Norms can also be defined for matrices, such as the Frobenius norm,
vu u Xm Xn q
kAkF =t A2ij = tr(ATA).
i=1 j=1
Many other norms exist, but they are beyond the scope of this review.
22/60
Linear Independence
A set of vectors {x1, x2, . . . xn} ⇢ Rm is said to be (linearly) dependent if one vector
belonging to the set can be represented as a linear combination of the remaining vectors; that is,
if X n 1
xn = ↵ixi i=1
for some scalar values ↵1, . . . , ↵n 1 2 R; otherwise, the vectors are (linearly) independent.
Example:
24 1 35 24 4 35 24 2 35 x1= 2 x2= 1 x3= 3
3 5 1
are linearly dependent because x3 = 2×1 + x2.
23/60
Rank of a Matrix
– The column rank of a matrix A 2 Rm⇥n is the size of the largest subset of columns of A that constitute a linearly independent set.
– The row rank is the largest number of rows of A that constitute a linearly independent set.
– For any matrix A 2 Rm⇥n, it turns out that the column rank of A is equal to the row rank of A (prove it yourself!), and so both quantities are referred to collectively as the rank of A, denoted as rank(A).
24/60
Properties of the Rank
– For A 2 Rm⇥n, rank(A) min(m, n). If rank(A) = min(m, n), then A is said to be full rank .
– For A 2 Rm⇥n, rank(A) = rank(AT ).
– For A 2 Rm⇥n, B 2 Rn⇥p, rank(AB) min(rank(A),rank(B)). – For A, B 2 Rm⇥n, rank(A + B) rank(A) + rank(B).
25/60
The Inverse of a Square Matrix
– The inverse of a square matrix A 2 Rn⇥n is denoted A 1, and is the unique matrix such
that
A 1A = I = AA 1.
– We say that A is invertible or non-singular if A 1 exists and non-invertible or singular
otherwise.
– In order for a square matrix A to have an inverse A 1, then A must be full rank.
– Properties (Assuming A, B 2 Rn⇥n are non-singular): – (A 1) 1 = A
– (AB) 1 = B 1A 1
– (A 1)T = (AT ) 1. For this reason this matrix is often denoted A T .
26/60
Orthogonal Matrices
– Two vectors x , y 2 Rn are orthogonal if x T y = 0.
– A vector x 2 Rn is normalized if kxk2 = 1.
– A square matrix U 2 Rn⇥n is orthogonal if all its columns are orthogonal to each other and are normalized (the columns are then referred to as being orthonormal).
– Properties:
– The inverse of an orthogonal matrix is its transpose.
UT U = I = UUT .
– Operating on a vector with an orthogonal matrix will not change its Euclidean norm, i.e.,
kUxk2 = kxk2 for any x 2 Rn, U 2 Rn⇥n orthogonal.
27/60
Span and Projection
– The span of a set of vectors {x1, x2, . . . xn} is the set of all vectors that can be expressed as a linear combination of {x1,…,xn}. That is,
(Xn ) span({x1,…xn}) = v : v = ↵ixi, ↵i 2 R .
i=1
– The projection of a vector y 2 Rm onto the span of {x1,…,xn} is the vector
v 2 span({x1,…xn}), such that v is as close as possible to y, as measured by the Euclidean norm kv yk2.
Proj(y; {x1, . . . xn}) = argminv2span({x1,…,xn})ky vk2.
28/60
Range
– The range or the columnspace of a matrix A 2 Rm⇥n, denoted R(A), is the the span of the columns of A. In other words,
R(A)={v 2Rm :v =Ax,x 2Rn}.
– Assuming A is full rank and n < m, the projection of a vector y 2 Rm onto the range of A
is given by,
Proj(y; A) = argminv2R(A)kv yk2.
29/60
Null space
The nullspace of a matrix A 2 Rm⇥n, denoted N(A) is the set of all vectors that equal 0 when
multiplied by A, i.e.,
N(A)={x 2Rn :Ax =0}.
30/60
The Determinant
The determinant of a square matrix A 2 Rn⇥n, is a function det : Rn⇥n ! R, and is denoted
|A| or det A. Given a matrix
2—a1T —3 6—a2T —7 64 . . . 75 ,
— anT — consider the set of points S ⇢ Rn as follows:
n Xn
S={v2R :v= ↵iai where0↵i 1,i=1,...,n}.
i=1
The absolute value of the determinant of A, it turns out, is a measure of the “volume” of the set S.
31/60
The Determinant: intuition
For example, consider the 2 ⇥ 2 matrix,
A=1 3 . (3) 32
Here, the rows of the matrix are
a1=13 a2=32 .
32/60
The determinant
Algebraically, the determinant satisfies the following three properties:
1. The determinant of the identity is 1, |I | = 1. (Geometrically, the volume of a unit hypercube is 1).
2. GivenamatrixA2Rn⇥n,ifwemultiplyasinglerowinAbyascalart2R,thenthe determinant of the new matrix is t|A|, (Geometrically, multiplying one of the sides of the set S by a factor t causes the volume to increase by a factor t.)
3. If we exchange any two rows aiT and ajT of A, then the determinant of the new matrix is |A|, for example
In case you are wondering, it is not immediately obvious that a function satisfying the above three properties exists. In fact, though, such a function does exist, and is unique (which we will not prove here).
33/60
The Determinant: Properties
- ForA2Rn⇥n,|A|=|AT|.
- For A, B 2 Rn⇥n, |AB| = |A||B|.
- For A 2 Rn⇥n, |A| = 0 if and only if A is singular (i.e., non-invertible). (If A is singular then it does not have full rank, and hence its columns are linearly dependent. In this case, the set S corresponds to a “flat sheet” within the n-dimensional space and hence has zero volume.)
- For A 2 Rn⇥n and A non-singular, |A 1| = 1/|A|.
34/60
The determinant: formula
Let A 2 Rn⇥n, A\i,\j 2 R(n 1)⇥(n 1) be the matrix that results from deleting the ith row and jth column from A.
The general (recursive) formula for the determinant is
|A| = =
Xn i=1
Xn j=1
( 1)i+jaij|A\i,\j|
( 1)i+jaij|A\i,\j|
(for any j 2 1,...,n)
(for any i 2 1,...,n)
with the initial case that |A| = a11 for A 2 R1⇥1. If we were to expand this formula completely for A 2 Rn⇥n, there would be a total of n! (n factorial) different terms. For this reason, we hardly ever explicitly write the complete equation of the determinant for matrices bigger than 3 ⇥ 3.
35/60
The determinant: examples
However, the equations for determinants of matrices up to size 3 ⇥ 3 are fairly common, and it is good to know them:
a11 2 a21
|[a11 ]| a12
a22 3 a13
a23 5 a33
= a11
= a11a22 a12a21
a11a22a33 + a12a23a31 + a13a21a32
= a11a23a32 a12a21a33 a13a22a31
a11
4 a21 a22
a12 a31 a32
36/60
Quadratic Forms
Given a square matrix A 2 Rn⇥n and a vector x 2 Rn, the scalar value xT Ax is called a quadratic form. Written explicitly, we see that
T Xn x Ax=
i=1
xi(Ax)i=
Xn0@Xn 1AXnXn
xi Aijxj = Aijxixj .
i=1 j=1 i=1 j=1
We often implicitly assume that the matrices appearing in a quadratic form are symmetric. xTAx=(xTAx)T =xTATx=xT✓12A+12AT◆x,
37/60
Positive Semidefinite Matrices
A symmetric matrix A 2 Sn is:
- positive definite (PD), denoted A 0 if for all non-zero vectors x 2 Rn, xT Ax > 0.
– positive semidefinite (PSD), denoted A ⌫ 0 if for all vectors xT Ax 0.
– negative definite (ND), denoted A 0 if for all non-zero x 2 Rn, xT Ax < 0.
- negative semidefinite (NSD), denoted A 0 ) if for all x 2 Rn, xT Ax 0.
- indefinite, if it is neither positive semidefinite nor negative semidefinite — i.e., if there exists x1,x2 2 Rn such that x1TAx1 > 0 and x2TAx2 < 0.
38/60
Positive Semidefinite Matrices
- One important property of positive definite and negative definite matrices is that they are always full rank, and hence, invertible.
- Given any matrix A 2 Rm⇥n (not necessarily symmetric or even square), the matrix
G = AT A (sometimes called a Gram matrix ) is always positive semidefinite. Further, if m n and A is full rank, then G = ATA is positive definite.
39/60
Eigenvalues and Eigenvectors
GivenasquarematrixA2Rn⇥n,wesaythat 2CisaneigenvalueofAandx2Cn isthe corresponding eigenvector if
Ax= x, x6=0.
Intuitively, this definition means that multiplying A by the vector x results in a new vector that points in the same direction as x, but scaled by a factor .
40/60
Eigenvalues and Eigenvectors
We can rewrite the equation above to state that ( ,x) is an eigenvalue-eigenvector pair of A if, ( I A)x =0, x 6=0.
But ( I A)x = 0 has a non-zero solution to x if and only if ( I A) has a non-empty nullspace, which is only the case if ( I A) is singular, i.e.,
|( I A)| = 0.
We can now use the previous definition of the determinant to expand this expression |( I A)| into a (very large) polynomial in , where will have degree n. It’s often called the characteristic polynomial of the matrix A.
41/60
Properties of eigenvalues and eigenvectors
- The trace of a A is equal to the sum of its eigenvalues,
Xn
trA =
- The determinant of A is equal to the product of its eigenvalues,
|A| =
- The rank of A is equal to the number of non-zero eigenvalues of A.
- Suppose A is non-singular with eigenvalue and an associated eigenvector x. Then 1/ is an eigenvalue of A 1 with an associated eigenvector x, i.e., A 1x = (1/ )x.
- The eigenvalues of a diagonal matrix D = diag(d1, . . . dn) are just the diagonal entries d1,...dn.
i=1
Yn i=1
i .
i .
42/60
Eigenvalues and Eigenvectors of Symmetric Matrices
Throughout this section, let’s assume that A is a symmetric real matrix (i.e., A> = A). We have the following properties:
1. All eigenvalues of A are real numbers. We denote them by 1, . . . , n.
2. There exists a set of eigenvectors u1,…,un such that (i) for all i, ui is an eigenvector with eigenvalue i and (ii) u1, . . . , un are unit vectors and orthogonal to each other.
43/60
New Representation for Symmetric Matrices
– Let U be the orthonormal matrix that contains ui ’s as columns:
24|| |35 U= u1 u2 ··· un
|||
– Let ⇤ = diag( 1,…, n) be the diagonal matrix that contains 1,…, n. – We can verify that
24 | | | 35 24 | | | 35
AU = Au1 Au2 ··· Aun = 1u1 2u2 ··· nun = Udiag( 1,…, n) = U⇤
||||||
– Recalling that orthonormal matrix U satisfies that UUT = I, we can diagonalize matrix A: A = AUUT = U⇤UT (4)
44/60
Background: representing vector w.r.t. another basis.
24|| |35
– Any orthonormal matrix U = u1 u2 · · · un defines a new basis of Rn.
|||
– For any vector x 2 Rn can be represented as a linear combination of u1, . . . , un with
coefficient xˆ1, . . . , xˆn:
– Indeed, such xˆ uniquely exists
x = xˆ 1 u 1 + · · · + xˆ n u n = U xˆ x = U xˆ , U T x = xˆ
In other words, the vector xˆ = U T x can serve as another representation of the vector x w.r.t the basis defined by U.
45/60
“Diagonalizing” matrix-vector multiplication.
– Left-multiplying matrix A can be viewed as left-multiplying a diagonal matrix w.r.t the basic of the eigenvectors.
– Suppose x is a vector and xˆ is its representation w.r.t to the basis of U. – Let z = Ax be the matrix-vector product.
– the representation z w.r.t the basis of U:
2 xˆ 3 11
6 xˆ 7 TTTT6227 zˆ=U z=U Ax=U U⇤U x=⇤xˆ=64 . 75
n xˆ n
– We see that left-multiplying matrix A in the original space is equivalent to left-multiplying the diagonal matrix ⇤ w.r.t the new basis, which is merely scaling each coordinate by the corresponding eigenvalue.
46/60
“Diagonalizing” matrix-vector multiplication.
Under the new basis, multiplying a matrix multiple times becomes much simpler as well. For example, suppose q = AAAx.
6 3xˆ 7 TTTTTT36227 qˆ=U q=U AAAx=U U⇤U U⇤U U⇤U x=⇤xˆ=64 . 75
2 3xˆ 3 11
3 xˆ nn
47/60
“Diagonalizing” quadratic form.
As a directly corollary, the quadratic form xTAx can also be simplified under the new basis T T T T Xn 2
x Ax=x U⇤U x=xˆ ⇤xˆ= xˆ ii
i=1
(Recall that with the old representation, x T Ax = Pni =1,j =1 xi xj Aij involves a sum of n2 terms instead of n terms in the equation above.)
48/60
The definiteness of the matrix A depends entirely on the sign of its eigenvalues
1 .
2 . 3.
4.
I f a l l i > 0 , t h e n t h e m a t r i x A s p o s i t i v e d e fi n i t e b e c a u s e x T A x = P n i xˆ 2 > 0 f o r a n y 1 i=1 i
xˆ 6 = 0 . P
I f a l l i 0 , i t i s p o s i t i v e s e m i d e fi n i t e b e c a u s e x T A x = n i xˆ 2 0 f o r a l l xˆ .
i=1 i
Likewise, if all i < 0 or i 0, then A is negative definite or negative semidefinite
respectively.
Finally, if A has both positive and negative eigenvalues, say i > 0 and j < 0, then it is
indefinite.PThis is because if we let xˆ satisfy xˆ = 1 and xˆ = 0,8k 6= i, then ik
xTAx=Pn xˆ2>0.Similarlywecanletxˆsatisfyxˆ=1andxˆ =0,8k6=j,then i=1 i i j k
x T A x = n i xˆ 2 < 0 . i=1 i
1 N o t e t h a t xˆ 6 = 0 , x 6 = 0 .
49/60
1 Basic Concepts and Notation
2 Matrix Multiplication
3 Operations and Properties
4 Matrix Calculus
50/60
Matrix Calculus
51/60
The Gradient
Supposethatf :Rm⇥n !RisafunctionthattakesasinputamatrixAofsizem⇥nand returns a real value. Then the gradient of f (with respect to A 2 Rm⇥n) is the matrix of partial derivatives, defined as:
2@f(A) @f(A) ···@f(A)3 6@A11 @A12 @A1n 7
i.e., an m ⇥ n matrix with
(rAf (A))ij = @f (A). @ Aij
6@f(A) @f(A) ···@f(A)7 r f(A)2Rm⇥n =6 @A21 @A22 @A2n 7 A 64 . . . . . . . . . . . . 75
@f(A) @f(A) ··· @f(A) @Am1 @Am2 @Amn
52/60
The Gradient
Note that the size of rAf (A) is always the same as the size of A. So if, in particular, A is just a
vectorx2Rn,
It follows directly from the equivalent properties of partial derivatives that: - rx(f(x)+g(x))=rxf(x)+rxg(x).
- Fort2R,rx(tf(x))=trxf(x).
3 6 @x1 7
2@f(x)
6 @f(x) 7
rf(x)=6 @x2 7. x 64 . 75
@f(x) @xn
53/60
The Hessian
Suppose that f : Rn ! R is a function that takes a vector in Rn and returns a real number. Then the Hessian matrix with respect to x , written r2x f (x ) or simply as H is the n ⇥ n matrix
of partial derivatives,
2
r2f(x)2Rn⇥n =6 x 64
@2f(x) 3 @x1@xn 7 @2f(x) 7 @x2@xn 7.
. 75
@2f(x) @ x n2
6 6
@2f(x) @2f(x) ··· @ x 12 @ x 1 @ x 2
@2f(x) @2f(x) ··· @x2@x1 @x2
. . ...
@2f(x) @2f(x) ··· @xn@x1 @xn@x2
I n o t h e r w o r d s , r 2x f ( x ) 2 R n ⇥ n , w i t h
(r2xf(x))ij = @2f(x).
@xi@xj Note that the Hessian is always symmetric, since
@2f (x) = @2f (x). @xi @xj @xj @xi
54/60
Gradients of Linear Functions
For x 2 Rn, let f (x) = bT x for some known vector b 2 Rn. Then
Xn
i=1
so Xn
bixi=bk.
f (x ) = @f(x)= @
bi xi
@xk @xk i=1
From this we can easily see that rx bT x = b. This should be compared to the analogous situation in single variable calculus, where @/(@x) ax = a.
55/60
Gradients of Quadratic Function
Now consider the quadratic function f (x) = xT Ax for A 2 Sn. Remember that
@ f ( x ) @ Xn Xn
@x = @x Aijxixj
Xn Xn i=1 j=1
f (x ) =
To take the partial derivative, we’ll consider the terms including xk and xk2 factors separately:
Aij xi xj .
k
= =
=
k i=1 j=1
@ 24XXAijxixj + XAikxixk + XAkjxkxj + Akkxk235
XAikxi + XAkjxj + 2Akkxk
i 6=k j 6=k
@xk i6=k j6=k i6=k j6=k
Xn Xn Xn
Aikxi + Akjxj =2 Akixi,
i=1 j=1 i=1
56/60
Hessian of Quadratic Functions
Finally, let’s look at the Hessian of the quadratic function f (x) = xT Ax In this case,
@2f(x) @ @f(x) @ " Xn #
@x @x = @x @x = @x 2 A`ixi = 2A`k = 2Ak`.
k ` k ` k i=1
Therefore, it should be clear that r2x x T Ax = 2A, which should be entirely expected (and again analogous to the single-variable fact that @2/(@x2) ax2 = 2a).
57/60
Recap
- rxbTx=b
- r 2x b T x = 0
- rx x T Ax = 2Ax (if A symmetric) - r 2x x T A x = 2 A ( i f A s y m m e t r i c )
58/60
Matrix Calculus Example: Least Squares
- Given a full rank matrices A 2 Rm⇥n, and a vector b 2 Rm such that b 62 R(A), we want to find a vector x such that Ax is as close as possible to b, as measured by the square of the Euclidean norm kAx bk2.
- Using the fact that kxk2 = xT x, we have
kAx bk2 =(Ax b)T(Ax b)=xTATAx 2bTAx+bTb
- Taking the gradient with respect to x we have:
rx(xTATAx 2bTAx +bTb) = rxxTATAx rx2bTAx +rxbTb
= 2ATAx 2ATb
- Setting this last expression equal to zero and solving for x gives the normal equations
x = (AT A) 1AT b
59/60
1 Basic Concepts and Notation
2 Matrix Multiplication
3 Operations and Properties
4 Matrix Calculus
60/60